<<

Progress in Neurobiology 590 (2002) 1–31

3 Review

4 integration and inference in the ∗ 5 Karl Friston 6 The Wellcome Department of Imaging , Institute of Neurology, University College London, 12 Queen Square, London WC1N 3BG, UK

7 Received 17 October 2001; accepted 16 August 2002

8 9 Abstract

10 Self-supervised models of how the brain represents and categorises the causes of its sensory input can be divided into two classes: those 11 that minimise the mutual information (i.e. redundancy) among evoked responses and those that minimise the prediction error. Although 12 these models have similar goals, the way they are attained, and the functional architectures employed, can be fundamentally different. This 13 review describes the two classes of models and their implications for the functional anatomy of sensory cortical hierarchies in the brain. 14 We then consider how empirical evidence can be used to disambiguate between architectures that are sufficient for perceptual learning and 15 synthesis. 16 Most models of representational learning require prior assumptions about the distribution of sensory causes. Using the notion of empirical 17 Bayes, we show that these assumptions are not necessary and that priors can be learned in a hierarchical context. Furthermore, we try to 18 show that learning can be implemented in a biologically plausible way. The main point made in this review is that backward connections, 19 mediating internal or generative models of how sensory inputs are caused, are essential if the process generating inputs cannot be inverted. 20 Because these processes are dynamical in nature, sensory inputs correspond to a non-invertible nonlinear convolution of causes. This 21 enforces an explicit parameterisation of generative models (i.e. backward connections) to enable approximate recognition and suggests 22 that feedforward architectures, on their own, are not sufficient. Moreover, nonlinearities in generative models, that induce a dependence 23 on backward connections, require these connections to be modulatory; so that estimated causes in higher cortical levels can interact to 24 predict responses in lower levels. This is important in relation to functional asymmetries in forward and backward connections that have 25 been demonstrated empirically. 26 To ascertain whether backward influences are expressed functionally requires measurements of functional integration among brain 27 systems. This review summarises approaches to integration in terms of effective connectivity and proceeds to address the question posed 28 by the theoretical considerations above. In short, it will be shown that functional can be used to test for interactions between 29 bottom–up and top–down inputs to an area. The conclusion of these studies points toward the prevalence of top–down influences and the 30 plausibility of generative models of sensory brain function. 31 © 2002 Published by Elsevier Science Ltd. 32

Contents

1. Introduction ...... 00 2. Functional specialisation and integration ...... 00 2.1. Background ...... 00 2.2. Functional specialisation and segregation ...... 00 2.3. The anatomy and physiology of cortico-cortical connections ...... 00 2.4. Functional integration and effective connectivity ...... 00 3. Representational learning ...... 00 3.1. The nature of representations ...... 00 3.2. Supervised models...... 00 3.2.1. Category specificity and connectionism ...... 00 3.2.2. Implementation ...... 00 3.3. Information theoretic approaches...... 00 3.3.1. Efficiency, redundancy and information ...... 00 3.3.2. Implementation ...... 00

∗ Tel.: +44-207-833-7454; fax: +44-207-813-1445. E-mail address: k.friston@fil.ion.ucl.ac.uk (K. Friston).

1 0301-0082/02/$ – see front matter © 2002 Published by Elsevier Science Ltd. 2 PII: S0301-0082(02)00076-X UNCORRECTED PROOF 2 K. Friston / Progress in Neurobiology 590 (2002) 1–31

3.4. Predictive coding and the inverse problem ...... 00 3.4.1. Implementation ...... 00 3.4.2. Predictive coding and Bayesian inference...... 00 3.5. Cortical hierarchies and empirical Bayes ...... 00 3.5.1. Empirical Bayes in the brain ...... 00 3.6. Generative models and representational learning ...... 00 3.6.1. Density estimation and EM ...... 00 3.6.2. Supervised representational learning ...... 00 3.6.3. Information theory ...... 00 3.6.4. Predictive coding ...... 00 3.7. Summary ...... 00 4. Generative models and the brain ...... 00 4.1. Context, causes and representations ...... 00 4.2. Neuronal responses and representations...... 00 4.2.1. Examples from electrophysiology ...... 00 5. Functional architectures assessed with brain imaging...... 00 5.1. Context-sensitive specialisation ...... 00 5.1.1. Categorical designs ...... 00 5.1.2. Multifactorial designs...... 00 5.1.3. Psychophysiological interactions ...... 00 5.2. Effective connectivity ...... 00 5.2.1. Effective connectivity and Volterra kernels ...... 00 5.2.2. Nonlinear coupling among brain areas ...... 00 6. Functional integration and ...... 00 6.1. Dynamic diaschisis ...... 00 6.1.1. An empirical demonstration ...... 00 7. Conclusion ...... 00 Acknowledgements ...... 00 References ...... 00

29 1. Introduction key focus of this section is on the functional architectures 54 implied by each model of representational learning. Infor- 55 30 In concert with the growing interest in contextual and mation theory can, in principle, proceed using only forward 56 31 extra-classical receptive field effects in electrophysiology connections. However, it turns out that this is only possible 57 32 (i.e. how the receptive fields of sensory neurons change ac- when processes generating sensory inputs are invertible and 58 33 cording to the context a stimulus is presented in), a sim- independent. Invertibility is precluded when the cause of a 59 34 ilar paradigm shift is emerging in imaging neuroscience. percept and the context in which it is engendered interact. 60 35 Namely, the appreciation that functional specialisation ex- These interactions create a problem of contextual invariance 61 36 hibits similar extra-classical phenomena in which a cortical that can only be solved using internal or generative models. 62 37 area may be specialised for one thing in one context but Contextual invariance is necessary for categorisation of sen- 63 38 something else in another. These extra-classical phenom- sory input (e.g. category-specific responses) and represents 64 39 ena have implications for theoretical ideas about how the a fundamental problem in perceptual synthesis. Generative 65 40 brain might work. This review uses the relationship among models based on predictive coding solve this problem with 66 41 theoretical models of representational learning as a vehicle hierarchies of backward and lateral projections that prevail 67 42 to illustrate how imaging can be used to address important in the real brain. In short, generative models of representa- 68 43 questions about functional brain architectures. tional learning are a natural choice for understanding real 69 44 We start by reviewing two fundamental principles of functional architectures and, critically, confer a necessary 70 45 brain organisation, namely functional specialisation and role on backward connections. 71 46 functional integration and how they rest upon the anatomy Empirical evidence, from electrophysiological studies 72 47 and physiology of cortico-cortical connections in the brain. of animals and functional neuroimaging studies of human 73 48 Section 2 deals with the nature and learning of representa- subjects, is presented in Sections 3 and 4 to illustrate the 74 49 tions from a theoretical or computational perspective. This context-sensitive nature of functional specialisation and 75 50 section reviews supervised (e.g. connectionist) approaches, how its expression depends upon integration among remote 76 51 information theoretic approaches and those predicated on cortical areas. Section 3 looks at extra-classical effects 77 52 predictive coding and reprises their heuristics and mo- in electrophysiology, in terms of the predictions afforded 78 53 tivation using the framework of generative models. The by generative models of brain function. The theme of 79

UNCORRECTED PROOF K. Friston / Progress in Neurobiology 590 (2002) 1–31 3

80 context-sensitive evoked responses is generalised to a cor- the functional integration among them. Functional special- 132 81 tical level and human functional neuroimaging studies in isation and integration are not exclusive, they are comple- 133 82 the subsequent section. The critical focus of this section is mentary. Functional specialisation is only meaningful in the 134 83 evidence for the interaction of bottom–up and top–down context of functional integration and vice versa. 135 84 influences in determining regional brain responses. These 85 interactions can be considered signatures of backward con- 2.2. Functional specialisation and segregation 136 86 nections. The final section reviews some of the implications 87 of the forging sections for lesion studies and neuropsychol- The functional role, played by any component (e.g. cor- 137 88 ogy. ‘Dynamic diaschisis’, is described, in which aberrant tical area, sub-area, neuronal population or neuron) of the 138 89 neuronal responses can be observed as a consequence of brain, is defined largely by its connections. Certain pat- 139 90 damage to distal brain areas providing enabling or mod- terns of cortical projections are so common that they could 140 91 ulatory afferents. This section uses neuroimaging in neu- amount to rules of cortical connectivity. “These rules re- 141 92 ropsychological patients and discusses the implications for volve around one, apparently, overriding strategy that the 142 93 constructs based on the lesion-deficit model. cerebral cortex uses—that of functional segregation” (Zeki, 143 1990). Functional segregation demands that cells with com- 144 mon functional properties be grouped together. This archi- 145 94 2. Functional specialisation and integration tectural constraint in turn necessitates both convergence and 146 divergence of cortical connections. Extrinsic connections, 147 95 2.1. Background between cortical regions, are not continuous but occur in 148 patches or clusters. This patchiness has, in some instances, 149 96 The brain appears to adhere to two fundamental princi- a clear relationship to functional segregation. For example, 150 97 ples of functional organisation, functional integration and the secondary visual area V2 has a distinctive cytochrome 151 98 functional specialisation, where the integration within and oxidase architecture, consisting of thick stripes, thin stripes 152 99 among specialised areas is mediated by effective connectiv- and inter-stripes. When recordings are made in V2, direc- 153 100 ity. The distinction relates to that between ‘localisationism’ tionally selective (but not wavelength or colour selective) 154 101 and ‘(dis)connectionism’ that dominated thinking about cells are found exclusively in the thick stripes. Retrograde 155 102 cortical function in the nineteenth century. Since the early (i.e. backward) labelling of cells in V5 is limited to these 156 103 anatomic theories of Gall, the identification of a particular thick stripes. All the available physiological evidence sug- 157 104 brain region with a specific function has become a central gests that V5 is a functionally homogeneous area that is spe- 158 105 theme in neuroscience. However, functional localisation per cialised for visual . Evidence of this nature supports 159 106 se was not easy to demonstrate: for example, a meeting that the notion that patchy connectivity is the anatomical infras- 160 107 took place on 4 August 1881, addressed the difficulties of tructure that underpins functional segregation and speciali- 161 108 attributing function to a cortical area, given the dependence sation. If it is the case that neurons in a given cortical area 162 109 of cerebral activity on underlying connections (Phillips share a common responsiveness (by virtue of their extrinsic 163 110 et al., 1984). This meeting was entitled “Localisation of connectivity) to some sensorimotor or cognitive attribute, 164 111 function in the cortex cerebri”. Goltz, although accepting then this functional segregation is also an anatomical one. 165 112 the results of electrical stimulation in dog and monkey Challenging a subject with the appropriate sensorimotor at- 166 113 cortex, considered that the excitation method was inconclu- tribute or cognitive process should lead to activity changes 167 114 sive, in that the behaviours elicited might have originated in, and only in, the areas of interest. This is the model upon 168 115 in related pathways, or current could have spread to dis- which the search for regionally specific effects with func- 169 116 tant centres. In short, the excitation method could not be tional neuroimaging is based. 170 117 used to infer functional localisation because localisationism 118 discounted interactions, or functional integration among 2.3. The anatomy and physiology of cortico-cortical 171 119 different brain areas. It was proposed that lesion studies connections 172 120 could supplement excitation experiments. Ironically, it was 121 observations on patients with brain lesions some years later If specialisation rests upon connectivity then important 173 122 (see Absher and Benson, 1993) that led to the concept of organisational principles should be embodied in the neu- 174 123 ‘disconnection syndromes’ and the refutation of localisa- roanatomy and physiology of extrinsic connections. Extrin- 175 124 tionism as a complete or sufficient explanation of cortical sic connections couple different cortical areas whereas in- 176 125 organisation. Functional localisation implies that a function trinsic connections are confined to the cortical sheet. There 177 126 can be localised in a cortical area, whereas specialisation are certain features of cortico-cortical connections that pro- 178 127 suggests that a cortical area is specialised for some aspects vide strong clues about their functional role. In brief, there 179 128 of perceptual or motor processing where this specialisation appears to be a hierarchical organisation that rests upon the 180 129 can be anatomically segregated within the cortex. The cor- distinction between forward and backward connections. The 181 130 tical infrastructure supporting a single function may then designation of a connection as forward or backward depends 182 131 involve many specialised areas whose union is mediated by primarily on its cortical layers of origin and termination. 183

UNCORRECTED PROOF 4 K. Friston / Progress in Neurobiology 590 (2002) 1–31

Table 1 Some key characteristics of extrinsic cortico-cortical connections in the brain Hierarchical organisation The organisation of the visual cortices can be considered as a hierarchy (Felleman and Van Essen, 1991) The notion of a hierarchy depends upon a distinction between forward and backward extrinsic connections This distinction rests upon different laminar specificity (Rockland and Pandya, 1979; Salin and Bullier, 1995) Backward connections are more numerous and transcend more levels Backward connections are more divergent than forward connections (Zeki and Shipp, 1988) Forwards connections Backwards connections Sparse axonal bifurcations Abundant axonal bifurcation Topographically organised Diffuse topography Originate in supragranular layers Originate in bilaminar/infragranular layers Terminate largely in layer VI Terminate predominantly in supragranular layers Postsynaptic effects through fast AMPA (1.3–2.4 ms decay) Modulatory afferents activate slow (50 ms decay) and GABAA (6 ms decay) receptors voltage-sensitive NMDA receptors

184 Some characteristics of cortico-cortical connections are pre- • Modulatory connections have slow time constants 223 185 sented below and are summarised in Table 1. The list is not Forward connections meditate their post-synaptic ef- 224 186 exhaustive, nor properly qualified, but serves to introduce fects through fast AMPA (1.3–2.4 ms decay) and GABAA 225 187 some important principles that have emerged from empirical (6 ms decay) receptors. Modulatory afferents activate 226 188 studies of visual cortex. NMDA receptors. NMDA receptors are voltage-sensitive, 227 showing nonlinear and slow dynamics (50 ms decay). 228 189 • Hierarchical organisation They are found predominantly in supragranular layers 229 190 The organisation of the visual cortices can be consid- where backward connections terminate (Salin and Bullier, 230 191 ered as a hierarchy of cortical levels with reciprocal ex- 1995). These slow time-constants again point to a role in 231 192 trinsic cortico-cortical connections among the constituent mediating contextual effects that are more enduring than 232 193 cortical areas (Felleman and Van Essen, 1991). The no- phasic sensory-evoked responses. 233 194 tion of a hierarchy depends upon a distinction between • Backwards connections are more divergent than forward 234 195 forward and backward extrinsic connections. connections 235 196 • Forwards and backwards connections—laminar speci- Extrinsic connections show an orderly convergence and 236 197 ficity divergence of connections from one cortical level to the 237 198 Forwards connections (from a low to a high level) next. At a macroscopic level, one point in a given cortical 238 199 have sparse axonal bifurcations and are topographically area will connect to a region 5–8 mm in diameter in an- 239 200 organised; originating in supragranular layers and termi- other. An important distinction between forward and back- 240 201 nating largely in layer VI. Backward connections, on the ward connections is that backward connections are more 241 202 other hand, show abundant axonal bifurcation and a dif- divergent. For example, the divergence region of a point 242 203 fuse topography. Their origins are bilaminar/infragranular in V5 (i.e. the region receiving backward afferents from 243 204 and they terminate predominantly in supragranular layers V5) may include thick and inter-stripes in V2, whereas 244 205 (Rockland and Pandya, 1979; Salin and Bullier, 1995). its convergence region (i.e. the region providing forward 245 206 • Forward connections are driving and backward connec- afferents to V5) is limited to the thick stripes (Zeki and 246 207 tions are modulatory Shipp, 1988). Reciprocal interactions between two levels, 247 208 Reversible inactivation (e.g. Sandell and Schiller, 1982; in conjunction with the divergence of backward connec- 248 209 Girard and Bullier, 1989) and functional neuroimaging tions, renders any area sensitive to the vicarious influence 249 210 (e.g. Büchel and Friston, 1997) studies suggest that for- of other regions at the same hierarchical level even in the 250 211 ward connections are driving, whereas backward connec- absence of direct lateral connections. 251 212 tions can be modulatory. The notion that forward connec- • Backward connections are more numerous and transcend 252 213 tions are concerned with the promulgation and segregation more levels 253 214 of sensory information is consistent with: (i) their sparse Backward connections are more abundant then forward 254 215 axonal bifurcation; (ii) patchy axonal terminations; and connections. For example, the ratio of forward efferent 255 216 (iii) topographic projections. In contradistinction, back- connections to backward afferents in the lateral genic- 256 217 ward connections are generally considered to have a role ulate is about 1:10/20. Another important distinction is 257 218 in mediating contextual effects and in the co-ordination that backward connections will traverse a number of hi- 258 219 of processing channels. This is consistent with: (i) their erarchical levels, whereas forward connections are more 259 220 frequent bifurcation; (ii) diffuse axonal terminations; and restricted. For example, there are backward connections 260 221 (iii) non-topographic projections (Salin and Bullier, 1995; from TE and TEO to V1 but no monosynaptic connec- 261 222 Crick and Koch, 1998). tions from V1 to TE or TEO (Salin and Bullier, 1995). 262

UNCORRECTED PROOF K. Friston / Progress in Neurobiology 590 (2002) 1–31 5

263 In summary, the anatomy and physiology of cortico- nonlinear are used almost synonymously. Modulatory ef- 317 264 cortical connections suggest that forward connections are fects imply the post-synaptic response evoked by one input 318 265 driving and commit cells to a pre-specified response given is modulated, or interacts, with another. By definition this 319 266 the appropriate pattern of inputs. Backward connections, on interaction must depend on nonlinear synaptic mechanisms. 320 267 the other hand, are less topographic and are in a position In summary, the brain can be considered as an ensemble 321 268 to modulate the responses of lower areas to driving inputs of functionally specialised areas that are coupled in a nonlin- 322 269 from either higher or lower areas (see Table 1). Backwards ear fashion by effective connections. Empirically, it appears 323 270 connections are abundant in the brain and are in a position that connections from lower to higher areas are predomi- 324 271 to exert powerful effects on evoked responses, in lower nantly driving whereas backwards connections, that medi- 325 272 levels, that define the specialisation of any area or neuronal ate top–down influences, are more diffuse and are capable 326 273 population. The idea pursued below is that specialisa- of exerting modulatory influences. In the next section we 327 274 tion depends upon backwards connections and, due to the describe a theoretical perspective, provided by ‘generative 328 275 greater divergence of the latter, can embody contextual ef- models’, that highlights the functional importance of back- 329 276 fects. Appreciating this is important for understanding how wards connections and nonlinear interactions. 330 277 functional integration can dynamically reconfigure the spe- 278 cialisation of brain areas that mediate perceptual synthesis. 3. Representational learning 331 279 2.4. Functional integration and effective connectivity This section compares and contrasts the heuristics behind 332 280 Electrophysiology and imaging neuroscience have firmly three prevalent computational approaches to representational 333 281 established functional specialisation as a principle of brain learning and perceptual synthesis, supervised learning, and 334 282 organisation in man. The functional integration of spe- two forms of self-supervised learning based on information 335 283 cialised areas has proven more difficult to assess. Functional theory and predictive coding. These approaches will then 336 284 integration refers to the interactions among specialised neu- be reconciled within the framework of generative models. 337 285 ronal populations and how these interactions depend upon This article restricts itself to sensory processing in cortical 338 286 the sensorimotor or cognitive context. Functional integration hierarchies. This precludes a discussion of other important 339 287 is usually assessed by examining the correlations among ideas (e.g. reinforcement learning (Sutton and Barto, 1990; 340 288 activity in different brain areas, or trying to explain the Friston et al., 1994), neuronal selection (Edelman, 1993) and 341 289 activity in one area in relation to activities elsewhere. Func- dynamical systems theory (Freeman and Barrie, 1994)). 342 290 tional connectivity is defined as correlations between remote The relationship between model and real neuronal archi- 343 291 neurophysiological events. However, correlations can arise tectures is central to . We address this 344 292 in a variety of ways. For example, in multi-unit electrode relationship, in terms of representations, starting with an 345 293 recordings they can result from stimulus-locked transients overview of representations in which the distinctions among 346 294 evoked by a common input or reflect stimulus-induced various approaches can be seen clearly. An important focus 347 295 oscillations mediated by synaptic connections (Gerstein of this section is the interaction among ‘causes’ of sensory 348 296 and Perkel, 1969). Integration within a distributed system input. These interactions posit the problem of contextual 349 297 is usually better understood in terms of effective connec- invariance. In brief, it will be shown that the problem of 350 298 tivity. Effective connectivity refers explicitly to the influ- contextual invariance points to the adoption of generative 351 299 ence that one neuronal system exerts over another, either models where interactions among causes of a percept are 352 300 at a synaptic (i.e. synaptic efficacy) or population level. modelled explicitly. Within the class of self-supervised 353 301 It has been proposed that “the (electrophysiological) no- models, we will compare classical information theoretic 354 302 tion of effective connectivity should be understood as the approaches and predictive coding. These two schemes use 355 303 experiment- and time-dependent, simplest possible circuit different heuristics which imply distinct architectures that 356 304 diagram that would replicate the observed timing relation- are sufficient for their implementation. The distinction rests 357 305 ships between the recorded neurons” (Aertsen and Preißl, on whether an explicit model, of the way sensory inputs are 358 306 1991). This speaks to two important points: (i) effective generated, is necessary for representational learning. If this 359 307 connectivity is dynamic, i.e. activity- and time-dependent; model is instantiated in backwards connections, then theo- 360 308 and (ii) it depends upon a model of the interactions. An retical distinctions may shed light on the functional role of 361 309 important distinction, among models employed in func- backward and lateral connections that are so prevalent in 362 310 tional neuroimaging, is whether these models are linear or the brain. 363 311 nonlinear. Recent characterisations of effective connectivity 312 have focussed on nonlinear models that accommodate the 3.1. The nature of representations 364 313 modulatory or nonlinear effects mentioned above. A more 314 detailed discussion of these models is provided in Section What is a representation? Here a representation is taken 365 315 5.2, after the motivation for their application is established to be a neuronal event that represents some ‘cause’ in the 366 316 in the next section. In this review the terms modulatory and sensorium. Causes are simply the states of the process gen- 367

UNCORRECTED PROOF 6 K. Friston / Progress in Neurobiology 590 (2002) 1–31

368 erating sensory data. It is not easy to ascribe meaning to At the point the process is modelled, this distinction is in- 420 369 these states without appealing to the way that we categorise voked through notions of deterministic or observation noise. 421 370 things, perceptually or conceptually. High-level conceptual This section deals with how the brain might construct such 422 371 causes may be categorical in nature, such as the identity of a models. 423 372 face in the visual field or the semantic category a perceived The importance of this formulation is that it highlights: (i) 424 373 object belongs to. In a hierarchical setting, high-level causes the dynamical aspects of sensory input; and (ii) the role of 425 374 may induce priors on lower-level causes that are more para- interactions among the causes of the sensory input. Dynamic 426 375 metric in nature. For example, the perceptual cause “mov- aspects imply that the current state of the world, registered 427 376 ing quickly” may show a one-to-many relationship with through our sensory receptors, depends not only on the ex- 428 377 over-complete representations of different velocities in V5 tant causes but also on their history. Interactions among these 429 378 (MT) units. An essential aspect of causes is their relation- causes, at any time in the past, can influence what is currently 430 379 ship to each other (e.g. ‘is part of’) and, in particular, their sensed. The second-order terms with i = 2inEq. (2) repre- 431 380 hierarchical structure. This ontology is often attended by sent pairwise interactions among the causes. These interac- 432 381 ambiguous many-to-one and one-to-many mappings (e.g. a tions are formally identical to interaction terms in conven- 433 382 table has legs but so do horses; a wristwatch is a watch irre- tional statistical models of observed data and can be viewed 434 383 spective of the orientation of its hands). This ambiguity can as contextual effects, where the expression of a particular 435 384 render the problem of inferring causes from sensory infor- cause depends on the context induced by another. For exam- 436 385 mation ill-posed (as we will see further). ple, the extraction of motion from the visual field depends 437 386 Even though causes may be difficult to describe, they upon there being sufficient luminance or wavelength con- 438 387 are easy to define operationally. Causes are the variables or trast to define the surface moving. Another ubiquitous ex- 439 388 states that are necessary to specify the products of a process ample, from early visual processing, is the occlusion of one 440 389 (or model of that process) generating sensory information. object by another. In the absence of interactions, we would 441 390 In very general terms, let us frame the problem of repre- see a linear superposition of both objects, but the visual in- 442 391 senting real world causes s(t) in terms of the system of put caused by the nonlinear mixing of these two causes ren- 443 392 deterministic equations der one occluded by the other. At a more cognitive level, the 444 x˙ = f(x, s) cause associated with the word ‘HAMMER’ will depend on 445 (1) the semantic context (that determines whether the word is a 446 393 u = g(x) verb or a noun). These contextual effects are profound and 447 394 where s is a vector of underlying causes in the environment must be discounted before the representations of the under- 448 395 (e.g. the velocity of a particular object, direction of radiant lying causes can be considered veridical. 449 396 light, etc.) and u represents sensory inputs. x˙ means the rate The problem the brain has to contend with is to find a 450 397 of change of x, which here denotes some unobserved states function of the input u(t) that recognises or represents the 451 398 of the world that form our sensory impression of it. The underlying causes. To do this, the brain must effectively 452 399 functions f and g can be highly nonlinear and allow for both undo the convolution and interactions to expose contextu- 453 400 the current state of the world and the causes of changes in ally invariant causes. In other words, the brain must perform 454 401 those states to interact, when evoking responses in sensory some form of nonlinear unmixing of ‘causes’ and ‘context’ 455 402 units. Sensory input can be shown to be a function of, and without knowing either. The key point here is that this non- 456 404403 only of, the causes and their recent history. linear mixing may not be invertible and that the estimation 457   of causes from input may be fundamentally ill posed. For 458 ∞ t t ∂iu(t) u = G(s) = ... example, no amount of unmixing can discern the parts of 459 2 ∂s(t − σ1) ···∂s(t − σi) an object that are occluded by another. The mapping u = s 460 405 i=1 0 0 provides a trivial example of this non-invertibility. Knowing 461 406 × s(t − σ1) ···s(t − σi) dσ1 ···dσi (2) u does not uniquely determine s. 462 407 G(s) is a functional (function of a function) that generates Nonlinearities are not the only source of non-invertibility. 463 408 inputs from the causes. Eq. (2) is simply a functional Taylor Because sensory inputs are convolutions of causes, there is 464 409 expansion covering dynamical systems of the sort implied a potential loss of information during the convolution or 465 410 by Eq. (1). This expansion is called a Volterra series and can smoothing that may have been critical for a unique deter- 466 411 be thought of as a nonlinear convolution of the causes to mination of the causes. The convolution implied by Eq. (2) 467 412 give the inputs (see Box 1). Convolution is like smoothing, means the brain has to de-convolve the inputs to obtain these 468 413 in this instance over time. A key aspect of this expansion causes. In estimation theory this problem is sometimes called 469 414 is that it does not refer to the many hidden states of the ‘blind de-convolution’ because the estimation is blind to the 470 415 world, only the causes of changes in states, that we want to underlying causes that are convolved to give the observed 471 416 represent. Furthermore, Eq. (1) does not contain any noise variables. To simplify the presentation of the ideas below we 472 417 or error. This is because Eqs. (1) and (2) describe a real will assume that the vectors of causes s, and their estimates 473 418 world process. There is no distinction between determinis- v, include a sufficient history to accommodate the dynamics 474 419 tic and stochastic behaviour until that process is observed. implied by Eq. (1). 475

UNCORRECTED PROOF K. Friston / Progress in Neurobiology 590 (2002) 1–31 7

Box 1 Dynamical systems and Volterra kernels. Input-state–output systems and Volterra series Neuronal systems are inherently nonlinear and lend themselves to modelling by nonlinear dynamical systems. How- ever, due to the complexity of biological systems it is difficult to find analytic equations that describe them adequately. Even if these equations were known the state variables are often not observable. An alternative approach to identifica- tion is to adopt a very general model (Wray and Green, 1994) and focus on the inputs and outputs. Consider the single input–single output (SISO) system x(t)˙ = f(x(t), u(t)) y(t) = g(x(t))

The Fliess fundamental formula (Fliess et al., 1983) describes the causal relationship between the outputs and the recent history of the inputs. This relationship can be expressed as a Volterra series, in which the output y(t) conforms to a nonlinear convolution of the inputs u(t), critically without reference to the state variables x(t). This series is simply a functional Taylor expansion of y(t).    y(t) = ∞ t ··· tκ (σ , ···σ )u(t − σ ) ···u(t − σ ) σ ··· σ i=1 0 0 i 1 i 1 i d 1 d i ∂iy(t) κi(σ1, ··· ,σi) = ∂u(t − σ1) ···∂u(t − σi)

where κi(σ1,... ,σi) is the ith-order kernel. Volterra series have been described as a ‘power series with memory’ and are generally thought of as a high-order or ‘nonlinear convolution’ of the inputs to provide an output. See Bendat (1990) for a fuller discussion. This expansion is used in a number of places in the main text. When the inputs and outputs are measured neuronal activity the Volterra kernels have a special interpretation. Volterra kernels and effective connectivity Volterra kernels are useful for characterising the effective connectivity or influences that one neuronal system exerts over another because they represent the causal characteristics of the system in question. Neurobiologically they have a simple and compelling interpretation—they are synonymous with effective connectivity. ∂y(t) ∂2y(t) κ1(σ1) = ,κ2(σ1,σ2) = , ... ∂u(t − σ1) ∂u(t − σ1)∂u(t − σ2)

It is evident that the first-order kernel embodies the response evoked by a change in input at t −σ1. In other words it is a time-dependant of driving efficacy. Similarly the second-order kernel reflects the modulatory influence of the input at t −σ1 on the response evoked at t −σ2. And so on for higher orders.

476 All the schemas considered below can be construed as pendence between the dynamics of the representations and 494 477 trying to effect a blind de-convolution of sensory inputs to their causes. Models of neuronal computation try to solve 495 478 estimate the causes with a recognition function. this problem in the hope that the ensuing parameters can be 496 interpreted in relation to real neuronal infrastructures. The 497 479 v = R(u, φ, θ) (3) greater the biological validity of the constraints under which 498 480 Here v represents an estimate of the causes and could corre- these solutions are obtained, the more plausible this relation- 499 481 spond to the activity of neuronal units (i.e. neurons or popu- ship becomes. In what follows, we will consider three mod- 500 482 lations of neurons) in the brain. The parameters φ and θ de- elling approaches: (i) supervised models; (ii) models based 501 483 termine the transformations that sensory input is subject to on information theory; and (iii) those based on predictive 502 484 and can be regarded as specifying the connection strengths coding. The focus will be on the sometimes hidden con- 503 485 and architecture of a neuronal network model or effective straints imposed on the parameters and the ensuing implica- 504 486 connectivity (see Box 1). For reasons that will become ap- tions for connectivity architectures and the representational 505 487 parent later, we make a distinction between parameters for properties of the units. In particular, we will ask whether 506 488 forward connections φ and backward connections θ. backward connections, corresponding to the parameters θ, 507 489 The problem of recognising causes reduces to finding are necessary. And if so what is their role? The three ap- 508 490 the right parameters such that the activity of the represen- proaches are reprised at the end of this section by treating 509 491 tational units v have some clearly defined relationship to them as special cases of generative models. Each subsection 510 492 the causes s. More formally, one wants to find the parame- below provides the background and heuristics for each ap- 511 493 ters that maximise the mutual information or statistical de- proach and describes its implementation using the formal- 512

UNCORRECTED PROOF 8 K. Friston / Progress in Neurobiology 590 (2002) 1–31

Fig. 1. Schematic illustrating the architectures implied by supervised, information theory-based approaches and predictive coding. The circles represent nodes in a network and the arrows represent a few of the connections. See the main text for an explanation of the equations and designation of the variables each set of nodes represents. The light grey boxes encompass connections and nodes within the model. Connection strengths are determined by the free parameters of the model φ (forward connections) and θ (backward connections). Nonlinear effects are implied when one arrow connects with another. Nonlinearities can be construed as the modulation of responsiveness to one input by another (see Box 1 for a more formal account). The broken arrow in the lower panel denotes connections that convey an error signal to the higher level from the input level.

513 ism above. Fig. 1 provides a graphical overview of the three neuroscience. We start by reviewing the role played by con- 521 514 schemes. nectionist models in the characterisation of brain systems 522 underlying cognitive functions. 523 515 3.2. Supervised models 3.2.1. Category specificity and connectionism 524 516 Connectionism is an approach that has proved very use- Semantic memory impairments can result from a vari- 525 517 ful in relating putative cognitive architectures to neuronal ety of pathophysiological insults, including Alzheimer’s dis- 526 518 ones and, in particular, modelling the impact of brain lesions ease, encephalitis and cerebrovascular accidents (e.g. Nebes, 527 519 on cognitive performance. Connectionism is used here as 1989; Warrington and Shallice, 1984). The concept of cate- 528 520 a well-known example of supervised learning in cognitive gory specificity stems from the work of Warrington and col- 529

UNCORRECTED PROOF K. Friston / Progress in Neurobiology 590 (2002) 1–31 9

530 leagues (Warrington and McCarthy, 1983; Warrington and encoding semantic primitives (e.g. structural—“has four 585 531 Shallice, 1984) and is based on the observation that pa- legs” or functional—“can put things on it”). These primi- 586 532 tients with focal brain lesions have difficulties in recognis- tives are simple localist representations “that are assumed 587 533 ing or naming specific categories of objects. Patients can ex- to be encoded by larger pools of neurons in the brain” 588 534 hibit double dissociations in terms of their residual seman- (Devlin et al., 1998). Irrespective of their theoretical bias, 589 535 tic capacity. For example, some patients can name artifacts connectionist models assume the existence of fixed repre- 590 536 but have difficulty with animals, whereas others can name sentations (i.e. units that represent a structural, phonological 591 537 animals with more competence than artifacts. These find- or lexico-semantic primitive) that are activated by some 592 538 ings have engendered a large number of studies, all point- input. These representational attributions are immutable 593 539 ing to impairments in perceptual synthesis, phonological or where each unit has its ‘label’. The representation of a con- 594 540 lexico-semantic analysis that is specific for certain categories cept, object or ‘cause’ in the sensorium is defined in terms 595 541 of stimuli. There are several theories that have been posited of which primitives are active. 596 542 to account for category specificity. Connectionist models Connectionist models employ some form of supervised 597 543 have been used to adjudicate among some of them. learning where the model parameters (connection strengths 598 544 Connectionist (e.g. parallel distributed processing or PDP) or biases) change to minimise the difference between the ob- 599 545 techniques use model neuronal architectures that can be le- served and required output. This output is framed in terms 600 546 sioned to emulate neuropsychological deficits. This involves of a distributed profile or pattern of activity over the (output) 601 547 modelling semantic networks using connected units or nodes units v = R(u, φ) which arises from sensory input u cor- 602 548 and suitable learning to determine a set of con- responding to activity in (input) primitives associated with 603 549 nection strengths (Rumelhart and McClelland, 1986). Se- the stimulus being simulated. There are often hidden units 604 550 mantic memory impairments are then simulated by lesioning interposed between the input and output units. The initial 605 551 the model to establish the nature of the interaction between input (sometimes held constant or ‘clamped’ for a while) is 606 552 neuropathology and cognitive deficit (e.g. Hinton and Shal- determined by a generative function of the ith stimulus or 607 553 lice, 1991; Plaut and Shallice, 1993). A compelling example cause ui = G(si). Connectionist models try to find the free 608 554 of this sort of approach is the connectionist model of Farah parameters φ that minimise some function or potential V of 609 555 and McClelland (1991): patterns of category-specific deficits the error or difference between the output obtained and that 610 556 led Warrington and McCarthy (1987) to suggest that an ani- desired 611 557 mate/inanimate distinction could be understood in terms of a φ = minφV(ε, φ) 558 differential dependence on functional and structural (percep- (4) εi = R(ui,φ)− si 612 559 tual) features for recognition. For example, tools have asso- 560 ciated motor acts whereas animals do not, or tools are easier The potential is usually the (expected) sum of squared differ- 613 561 to discriminate based upon their structural descriptions than ences. Although the connectionist paradigm has been very 614 562 four-legged animals. Farah and McClelland (1991) incorpo- useful in relating cognitive science and neuropsychology, it 615 563 rated this difference in terms of the proportion of the two has a few limitations in the context of understanding how 616 564 types of semantic featural representations encoding a partic- the brain learns to represent things: 617 565 ular object, with perceptual features dominating for animate 566 objects and both represented equally for artifacts. Damage to • First, one has to know the underlying cause si and the 618 567 visual features led to impairment for natural kinds and con- generative function, whereas the brain does not. This is 619 568 versely damage to functional features impaired the output the conventional criticism of supervised algorithms as a 620 569 for artifacts. Critically the model exhibited category-specific model of neuronal computation. Neural networks, of the 621 570 deficits in the absence of any category-specific organisa- sort used in connectionism, are well known to be flexi- 622 571 tion. The implication here is that an anatomical segrega- ble nonlinear function approximators. In this sense they 623 572 tion of structural and functional representations is sufficient can be used to approximate the inverse of any genera- 624 573 to produce category-specific deficits following focal brain tive function ui = G(si) to give model architectures that 625 574 damage. This example serves to illustrate how the connec- can lesioned. However, representational learning in the 626 575 tionist paradigm can be used to relate neuronal and cogni- brain has to proceed without any information about the 627 576 tive domains. In this example, connectionist models were processes generating inputs and the ensuing architectures 628 577 able to posit a plausible anatomical infrastructure wherein cannot be ascribed to connectionist mechanisms. 629 578 the specificity of deficits, induced by lesions, is mediated • Secondly, the generative mapping ui = G(si) precludes 630 579 by differential dependence on either the functional or struc- nonlinear interactions among stimuli or causes, dynamic 631 580 tural attributes of an object and not by any (less plausible) or static. This is a fundamental issue because one of the 632 581 category-specific anatomical organisation per se. main objectives of neuronal modelling is to see how rep- 633 resentations emerge with the nonlinear mixing and con- 634 582 3.2.2. Implementation textual effects prevalent in real sensory input. Omitting in- 635 583 In connectionist models causes or ‘concepts’ like teractions among the causes circumvents one of the most 636 584 “TABLE” are induced by patterns of activation over units important questions that could have been asked; namely 637

UNCORRECTED PROOF 10 K. Friston / Progress in Neurobiology 590 (2002) 1–31

638 how does the brain unmix sensory inputs to discount con- is no such uncertainty and this term can be discounted (see 690 639 textual effects and other aspects of nonlinear mixing? In Bell and Sejnowski, 1995). More generally 691 640 short, the same inputs are activated by a given cause, irre- ∂ ∂ I(u, v; φ) = H(v; φ) (6) 641 spective of the context. This compromises the plausibility ∂φ ∂φ 692 642 of connectionist models when addressing the emergence 643 of representations. It follows that maximising the mutual information is the 693 same as maximising the entropy of the responses. The in- 694 644 In summary, connectionist models specify distributed pro- fomax principle (maximum information transfer) is closely 695 645 files of activity over (semantic) primitives that are induced related to the idea of efficient coding. Generally speaking, 696 646 by (conceptual) causes and try to find connectivity parame- redundancy minimisation and efficient coding are all varia- 697 647 ters that emulate the inverse of these mappings. They have tions on the same theme and can be considered as the info- 698 648 been used to understand how the performance (storage and max principle operating under some appropriate constraints 699 649 generalisation) of a network responds to simulated damage, or bounds. Clearly it would be trivial to conform to the in- 700 650 after learning is complete. However, connectionism has a fomax principle by simply multiplying the inputs by a very 701 651 limited role in understanding representational learning per large number. What we would like to do is to capture the 702 652 se. In the next subsection we will look at self-supervised information in the inputs using a small number of output 703 653 approaches that do not require the causes for learning. channels operating in some bounded way. The key thing 704 that distinguishes among the various information theoretic 705 654 3.3. Information theoretic approaches schemas is the nature of the constraints under which entropy 706 is maximised. These constraints render infomax a viable ap- 707 655 There have been many compelling developments in theo- proach to recovering the original causes of data, if one can 708 656 retical neurobiology that have used information theory (e.g. enforce the outputs to conform to the same distribution of 709 657 Barlow, 1961; Optican and Richmond, 1987; Linsker, 1988; the causes (see Section 3.3.1). One useful way of looking at 710 658 Oja, 1989; Foldiak, 1990; Tovee et al., 1993; Tononi et al., constraints is in terms of efficiency. 711 659 1994). Many appeal to the principle of maximum informa- 660 tion transfer (e.g. Linsker, 1988; Atick and Redlich, 1990; 3.3.1. Efficiency, redundancy and information 712 661 Bell and Sejnowski, 1995). This principle has proven ex- The efficiency of a system can be considered as the com- 713 662 tremely powerful in predicting some of the basic receptive plement of redundancy (Barlow, 1961), the less redundant, 714 663 field properties of cells involved in early visual processing the more efficient a system will be. Redundancy is reflected 715 664 (e.g. Atick and Redlich, 1990; Olshausen and , 1996). in the dependencies or mutual information among the out- 716 665 This principle represents a formal statement of the com- puts. (cf. Gawne and Richmond, 1993). 717  666 mon sense notion that neuronal dynamics in sensory systems I(v; φ) = H(vi; φ) − H(v; φ) (7) 718 667 should reflect, efficiently, what is going on in the environ- 668 ment (Barlow, 1961). In the present context, the principle Here H(vi; φ) is the entropy of the ith output. Eq. (7) implies 719 669 of maximum information transfer (infomax; Linsker, 1988) that redundancy is the difference between the joint entropy 720 670 suggests that a model’s parameters should maximise the mu- and the sum of the entropies of the individual units (com- 721 671 tual information between the sensory input u and the evoked ponent entropies). Intuitively this expression makes sense if 722 672 responses or outputs v = R(u, φ). This maximisation is usu- one considers that the variability in activity of any single unit 723 673 ally considered in the light of some sensible constraints, e.g. corresponds to its entropy. Therefore, an efficient neuronal 724 674 the presence of noise in sensory input (Atick and Redlich, system represents its inputs with the minimal excursions 725 675 1990) or dimension reduction (Oja, 1989) given the smaller from baseline firing rates. Another way of thinking about 726 676 number of divergent outputs from a neuronal population than Eq. (7) is to note that maximising efficiency is equivalent to 727 677 convergent inputs (Friston et al., 1992). minimising the mutual information among the outputs. This 728 678 Intuitively, mutual information is like the covariance or is the basis of approaches that seek to de-correlate or orthog- 729 679 correlation between two variables but extended to cover onalise the outputs. To minimise redundancy one can either 730 680 multivariate observations. It is a measure of statistical de- minimise the entropy of the output units or maximise their 731 681 pendence. In a similar way, entropy can be regarded as the joint entropy, while ensuring the other is bounded in some 732 682 uncertainty or variability of an observation (cf. variance of way. Olshausen and Field (1996) present a very nice analy- 733 683 a univariate observation). The mutual information between sis based on sparse coding. Sparse coding minimises redun- 734 φ 685684 inputs and outputs under is given by dancy using single units with low entropy. Sparse coding 735 implies coding by units that fire very sparsely and will, gen- 736 686 I(u, v; φ) = H(u) + H(v; φ) − H(u, v; φ) erally, not be firing. Therefore, one can be relatively certain 737 687 = H(v; φ) − H(v|u) (5) about their (quiescent) state, conferring low entropy on them. 738 Approaches that seek to maximise the joint entropy of the 739 688 where H(v|u) is the conditional entropy or uncertainty in units include principal component analysis (PCA) learning 740 689 the output, given the input. For a deterministic system there algorithms (that sample the subspace of the inputs that have 741

UNCORRECTED PROOF K. Friston / Progress in Neurobiology 590 (2002) 1–31 11

742 the highest entropy) (e.g. Foldiak, 1990) and independent forward connections may not be defined in relation to the 791 743 component analysis (ICA). In PCA the component entropies generation of inputs. 792 744 are bounded by scaling the connection strengths of a simple • Second, we have to assume that the causes are indepen- 793 745 recognition model v = R(u, φ) = φu so that the sum of the dent. While this may be sensible for simple systems it 794 746 variances of vi is constant. ICA finds nonlinear functions of is certainly not appropriate for more realistic hierarchical 795 747 the inputs that maximise the joint entropy (Common, 1994; processes that generate sensory inputs (see Section 3.5.1). 796 748 Bell and Sejnowski, 1995). The component entropies are This is because correlations among causes at any level 797 749 constrained by the passing the outputs through a sigmoid are induced by, possibly independent, casual changes at 798 750 squashing function v = R(u, φ) = σ(φu) so that the outputs supraordinate levels. 799 751 lie in a bounded interval (hypercube). See Section 3.6.1 for 752 a different perspective on ICA in which the outputs are not Finally, the dynamical nature of evoked neuronal tran- 800 753 bounded but forced to have cumulative density functions that sients is lost in many information theoretic formulations 801 754 conform to the squashing function. which treat the inputs as a stationary , 802 755 An important aspect of the infomax principle is that it not as the products of a dynamical system. This is because 803 756 goes a long way to explaining functional segregation in the the mutual information and entropy measures, that govern 804 757 cortex. One perspective on functional segregation is that learning, pertain to distributions. These densities 805 758 each cortical area is segregating its inputs into relatively do not embody information about the temporal evolution 806 759 independent functional outputs. This is exactly what infomax of states, if they simply describe the probability the system 807 760 predicts. See Friston et al. (2001 and references therein) will be found in a particular state when sampled over time. 808 761 for an example of how infomax can be used to predict the Indeed, in many instances, the connection strengths are 809 762 segregation of processing streams from V2 to specialised identifiable given just the densities of the inputs, without 810 763 motion, colour and form areas in extrastriate cortex. any reference to the fact that they were generated dynam- 811 ically or constituted a time-series (cf. principal component 812 764 3.3.2. Implementation learning algorithms that need only the covariances of the in- 813 765 In terms of the above formulation, information theoretic puts). Discounting dynamics is not a fundament of infomax 814 766 approaches can be construed as finding the parameters of schemas. For example, my own work using ICA referred 815 767 a forward recognition function that maximise the efficiency to above (Friston et al., 2000) expanded inputs using tem- 816 768 or minimise the redundancy poral basis functions to model the functional segregation of 817 motion, colour and form in V2. This segregation emerged 818 φ = I(v; φ) minφ as a consequence of maximising the information trans- 819 v = R(u, φ) (8) 769 fer between spatio-temporal patterns of visual inputs and 820 V2 outputs. 821 770 But when are the outputs of an infomax model veridical In summary ICA and like-minded approaches, that try to 822 771 estimates of the causes of its inputs? This is assured when: find some deterministic function of the inputs that maximises 823 772 (i) the generating process is invertible; and (ii) the real world  information transfer, impose some simplistic and strong con- 824 773 causes are independent such that H(s) = H(si). This can straints on the generating process that must be met before 825 774 be seen by noting 775 veridical representations emerge. In the final approach, con- 826  sidered here, we discuss predictive coding models that do 827 776 I(v; φ) = H(vi; φ) − H(v; φ)   not require invertibility or independence and, consequently, 828 suggest a more natural form for representational learning. 829 777 = H(Ri(G(s), φ)) − H(si)      ∂R(G(s), φ) − ln   ≥ 0 (9) 3.4. Predictive coding and the inverse problem 830 778 ∂v

−1 Over the past years predictive coding and generative 831 779 with equality when v = R(u, φ) = G (u) = s. Compared models have supervened over other modelling approaches 832 780 to the connectionist scheme this has the fundamental advan- to brain function and represent one of the most promis- 833 781 tage that the is unsupervised by virtue of the fact ing avenues, offered by computational neuroscience, to 834 782 that the causes and generating process are not needed by understanding neuronal dynamics in relation to perceptual 835 783 Eq. (8). Note that the architectures in Fig. 1, depicting con- categorisation. In predictive coding the dynamics of units in 836 784 nectionist and infomax schemes, are identical apart from the a network are trying to predict the inputs. As with infomax 837 785 nodes representing desired output (unfilled circles in the up- schemas, the representational aspects of any unit emerge 838 786 per panel). However, there are some outstanding problems: spontaneously as the capacity to predict improves with 839 787 • First, infomax recovers causes only when the generating learning. There is no a priori ‘labelling’ of the units or any 840 788 process is invertible. However, as we have seen above the supervision in terms of what a correct response should be 841 789 nonlinear convolution of causes generating inputs may not (cf. connectionist approaches). The only correct response is 842 790 be invertible. This means that the recognition enacted by one in which the implicit internal model of the causes and 843

UNCORRECTED PROOF 12 K. Friston / Progress in Neurobiology 590 (2002) 1–31

844 their nonlinear mixing is sufficient to predict the input with or forward model is instantiated explicitly. Here the forward 899 845 minimal error. model is the nonlinear mixing of causes that, by definition 900 846 Conceptually, predictive coding and generative models must exist. The estimation of the causes still rests upon con- 901 847 (see further) are related to ‘analysis-by-synthesis’ (Neisser, straints, but these are now framed in terms of the forward 902 848 1967). This approach to perception, from cognitive psychol- model and have a much more direct relationship to casual 903 849 ogy, involves adapting an internal model of the world to processes in the real world. The ensuing mirror symmetry 904 850 match sensory input and was suggested by Mumford (1992) between the real generative process and its forward model 905 851 as a way of understanding hierarchical neuronal process- is illustrated in the architecture in Fig. 1. Notice that the 906 852 ing. The idea is reminiscent of Mackay’s epistemological connections within the model are now going backwards. In 907 853 automata (MacKay, 1956) which perceive by comparing ex- the predictive coding scheme these backward connections, 908 854 pected and actual sensory input (Rao, 1999). These mod- parameterised by θ form predictions from some estimate of 909 855 els emphasise the role of backward connections in medi- the causes v to provide a prediction error. The parameters 910 856 ating the prediction, at lower or input levels, based on the now change to minimise some function of the prediction er- 911 857 activity of units in higher levels. The connection strengths ror cf. Eq. (4). 912 858 of the model are changed so as to minimise the error be- θ = min θV(ε, θ) 859 tween the predicted and observed inputs at any level. This (10) ε = u − G(v, θ) 913 860 is in direct contrast to connectionist approaches were con- 861 nection strengths change to minimise the error between the The differences between Eqs. (10) and (4) are that the er- 914 862 observed and desired output. In predictive coding there is no rors are at the input level, as opposed to the output level 915 863 ‘output’ because the representational meaning of the units and the parameters now pertain to a forward model instan- 916 864 is not pre-specified but emerges during learning. tiated in backward connections. This minimisation scheme 917 865 Predictive coding schemes can also be regarded as aris- eschews the real causes s but where do their estimates come 918 866 ing from the distinction between forward and inverse mod- from? These casual estimates or representations change in 919 867 els adopted in machine vision (Ballard et al., 1983; Kawato the same way as the other free parameters of the model. 920 868 et al., 1993). Forward models generate inputs from causes, They change to minimise prediction error subject to some a 921 869 whereas inverse models approximate the reverse transfor- priori constraint, modelled by a regularisation term λ(v, θ), 922 1 870 mation of inputs to causes. This distinction embraces the usually through gradient ascent. 923 871 non-invertibility of generating processes and the ill-posed ∂V(ε, θ) ∂λ(v, θ) 872 nature of inverse problems. As with all underdetermined in- v˙ =− + (11) ∂v ∂v 924 873 verse problems the role of constraints becomes central. In 874 the inverse literature a priori constraints usually enter in The error is conveyed from the input layer to the output layer 925 875 terms of regularised solutions. For example; “Descriptions by forward connections that are rendered as a broken line in 926 876 of physical properties of visible surfaces, such as their dis- the lower panel of Fig. 1. This component of the predictive 927 877 tance and the presence of edges, must be recovered from coding scheme has a principled (Bayesian) motivation that is 928 878 the primary image data. Computational vision aims to un- described in the next subsection. For the moment, consider 929 879 derstand how such descriptions can be obtained from inher- what would transpire after training and prediction error is 930 880 ently ambiguous and noisy data. A recent development in largely eliminated. This implies the brain’s nonlinear con- 931 881 this field sees early vision as a set of ill-posed problems, volution of the estimated causes recapitulates the real con- 932 882 which can be solved by the use of regularisation methods” volution of the real causes. In short, there is a veridical (or 933 883 (Poggio et al., 1985). The architectures that emerge from at least sufficient) representation of both the causes and the 934 884 these schemes suggest that “feedforward connections from dynamical structure of their mixing through the backward 935 885 the lower visual cortical area to the higher visual cortical connections θ. 936 886 area provides an approximated inverse model of the imaging The dynamics of representational units or populations 937 887 process (optics), while the backprojection connection from implied by Eq. (11) represents the essential difference be- 938 888 the higher area to the lower area provides a forward model tween this class of approaches and those considered above. 939 889 of the optics” (Kawato et al., 1993). Only in predictive coding are the dynamics changing to 940 minimise the same objective function as the parameters. In 941 890 3.4.1. Implementation both the connectionist and infomax schemes the represen- 942 891 Predictive, or more generally, generative, models turn the tations of a given cause can only be changed vicariously 943 892 inverse problem on its head. Instead of trying to find func- through the connection parameters. Predictive coding is a 944 893 tions of the inputs that predict the causes they find functions strategy that has some compelling (Bayesian) underpinnings 945 894 of causal estimates that predict the inputs. As in approaches (see further) and is not simply using a connectionist archi- 946 895 based on information theory, the causes do not enter into the tecture in auto-associative mode or using error minimisation 947 896 learning rules, which are therefore unsupervised. Further- 897 more, they do not require the convolution of causes, engen- 1 For simplicity, time constants have been omitted from expressions 898 dering the inputs, to be invertible. This is because generative describing the ascent of states or parameters on objective functions.

UNCORRECTED PROOF K. Friston / Progress in Neurobiology 590 (2002) 1–31 13

948 to maximise information transfer. It is a real time, dynam- same as the probability of the input given the causes times 1002 949 ical scheme that embeds two concurrent processes. (i) The the prior probability of the causes. 1003 950 parameters of the generative or forward model change to p(u, v) = p(v|u)p(u) = p(u|v)p(v) (12) 1004 951 emulate the real world mixing of causes, using the current 952 estimates; and (ii) these estimates change to best explain the The MAP estimator of the causes is the most likely given 1005 953 observed inputs, using the current forward model. Both the the data. 1006 954 parameters and the states change in an identical fashion to vm = max ln p(v|u) = max[lnp(u|v) + lnp(v)] (13) 955 minimise prediction error. The predictive coding scheme es- v v 1007 956 chews the problems associated with earlier schemes. It can The first term on the right is known as the log likelihood or 1008 957 easily accommodate nonlinear mixing of causes in the real likelihood potential and the second is the prior potential. A 1009 958 world. It does not require this mixing to be invertible and gradient ascent to find vm would take the form 1010 959 needs only the sensory inputs. However, there is an outstand- ∂ 960 ing problem: v˙ = ∂v (14) 961 • To finesse the inverse problem, posed by non-invertible (u) = ln p(u|v; θ) + ln p(v; θ) 1011 962 generative models, regularisation constraints are required. where the dependence of the likelihood and priors on the 1012 963 These resolve the problem of non-invertibility that con- model parameters has been made explicit. The likelihood 1013 964 founds simple infomax schemes but introduce a new prob- is defined by the forward model u = G(v, θ) + ε where 1014 965 lem. Namely one needs to know the prior distribution of p(u|v; θ) ∝ exp(−V(ε, θ)). V now plays the role of a Gibb’s 1015 966 the causes. This is because, as shown next, the regulari- potential that specifies ones distributional assumptions about 1016 967 sation constraints are based on these priors. the prediction error. Now we have 1017 968 In summary, predictive coding treats representational ∂V(ε, θ) ∂ ln p(v; θ) 969 learning as an ill-posed inverse problem and uses an explicit v˙ =− + (15) ∂v ∂v 1018 970 parameterisation of a forward model to generate predictions 971 of the observed input. The ensuing error is then used to re- This is formally identical to the predictive coding scheme 1019 972 fine the forward model. This component of representational Eq. (11), in which the regularisation term λ(v, θ) = 1020 973 learning is dealt with below (Section 3.6). The predictions ln p(v; θ) becomes a log prior that renders the ensuing esti- 1021 974 are based on estimated causes that also minimise predic- mation Bayesian. In this formulation the state of the brain 1022 975 tive error, under some constraints that resolve the generally changes, not to minimise error per se, but to attain an es- 1023 976 ill-posed estimation problem. We now consider these con- timate of the causes that maximises both the likelihood of 1024 977 straints from a Bayesian point of view. the input given that estimate and the prior probability of 1025 the estimate being true. The implicit Bayesian estimation 1026 978 3.4.2. Predictive coding and Bayesian inference can be formalised from a number of different perspectives. 1027 979 One important aspect of predictive coding and generative Rao and Ballard (1998) give a very nice example using 1028 980 models (see further) is that they portray the brain as an infer- the Kalman filter that goes some way to dealing with the 1029 981 ential machine (Dayan et al., 1995). From this perspective, dynamical aspect of real sensory inputs. 1030 982 functional architectures exist, not to filter the input to obtain 983 the causes, but to estimate causes and test the predictions 3.5. Cortical hierarchies and empirical Bayes 1031 984 against the observed input. A compelling aspect of predic- 985 tive coding schemas is that they lend themselves to Bayesian The problem with Eq. (15) is that the brain cannot con- 1032 986 treatment. This is important because it can be extended using struct priors de novo. They have to be learned along with the 1033 987 empirical Bayes and hierarchical models. In what follows forward model. In Bayesian estimation priors are estimated 1034 988 we shall first describe the Bayesian view of regularisation from data using empirical Bayes. Empirical Bayes harnesses 1035 989 in terms of priors on the causes. We then consider hierar- the hierarchical structure of a forward model, treating the 1036 990 chical models in which priors can be derived empirically. estimates of causes at one level as prior expectations for the 1037 991 The key implication, for neuronal implementations of pre- subordinate level (Efron and Morris, 1973). This provides a 1038 992 dictive coding, is that empirical priors eschew assumptions natural framework within which to treat cortical hierarchies 1039 993 about the independence of causes (cf. infomax schemes) or in the brain, each providing constraints on the level below. 1040 994 the form of constraints in regularised inverse solutions. Fig. 2 depicts a hierarchical architecture that is described in 1041 995 Suppose we knew the a priori distribution of the causes more detail below. This extension models the world as a hi- 1042 996 p(v), but wanted the best estimate given the input. This max- erarchy of (dynamical) systems where supraordinate causes 1043 997 imum a posteriori (MAP) estimate maximises the posterior induce, and moderate, changes in subordinate causes. For 1044 998 p(v|u). The two are related through Bayes rule example, the presence of a particular object in the visual field 1045 999 which states that the probability of the cause and input oc- changes the incident light falling on a particular part of the 1046 1000 curring together is the probability of the cause given the in- retina. A more abstract example, that illustrates the brain’s 1047 1001 put times the probability of the input. This, in turn, is the inferential capacities, is presented in Fig. 3. On reading the 1048

UNCORRECTED PROOF 14 K. Friston / Progress in Neurobiology 590 (2002) 1–31

Fig. 2. Schematic depicting a hierarchical extension to the predictive coding architecture, using the same format as Fig. 1. Here hierarchical arrangements within the model serve to provide predictions or priors to representations in the level below. The open circles are the error units and the filled circles are the representations of causes in the environment. These representations change to minimise both the discrepancy between their predicted value and the mismatch incurred by their own prediction of the representations in the level below. These two constraints correspond to prior and likelihood potentials, respectively (see main text).

1049 first sentence ‘Jack and Jill went up the hill’ we perceive the predictions to which the posterior estimate is accountable. 1057 1050 word ‘event’ as ‘went’. In the absence of any hierarchical When this prior biases in favour of ‘went’ we tolerate a 1058 1051 inference the best explanation for the pattern of visual stimu- small error as a lower level of visual analysis to minimise 1059 1052 lation incurred by the text is ‘event’. This would correspond the overall prediction error at the visual and lexical level. 1060 1053 to the maximum likelihood estimate of the word and would This illustrates the role of higher level estimates in provid- 1061 1054 be the most appropriate in the absence of prior information ing predictions or priors for subordinate levels. These priors 1062 1055 about which is the most likely word. However, within hier- offer contextual guidance towards the most likely cause of 1063 1056 archical inference the semantic context provides top–down the input. Note that predictions at higher levels are subject 1064 to the same constraints, only the highest level, if there is 1065 one in the brain, is free to be directed solely by bottom–up 1066 influences (although there are always implicit priors). If the 1067 brain has evolved to recapitulate the casual structure of its 1068 environment, in terms of its sensory infrastructures, it is in- 1069 teresting to reflect on the possibility that our visual cortices 1070 reflect the hierarchical casual structure of our environment. 1071 The hierarchical structure of the real world is literally re- 1072 flected by the hierarchical architectures trying to minimise 1073 prediction error, not just at the level of sensory input but at 1074 all levels of the hierarchy (notice the deliberate mirror sym- 1075 metry in Fig. 2). The nice thing about this architecture is that 1076 the dynamics of casual representations at the ith level vi re- 1077 quire only the error for the current level and the immediately 1078 preceding level. This follows from the Markov property of 1079 hierarchical systems where one only needs to know the im- 1080 mediately supraordinate causes to determine the density of 1081 Fig. 3. Schematic illustrating the role of priors in biasing towards one causes at any level in question, i.e. p(vi|vi+1,... ,vn) = 1082 representation of an input or another. Left: The word ‘event’ is selected p(vi|vi+1). The fact that only error from the current and 1083 as the most likely cause of the visual input. Right: The word ‘went’ is lower level is required to drive the dynamics of vi is impor- 1084 selected as the most likely word that is: (i) a reasonable explanation for tant because it permits a biologically plausible implementa- 1085 the sensory input; and (ii) conforms to prior expectations induced by semantic context. tion, where the connections driving the error minimisation 1086

UNCORRECTED PROOF K. Friston / Progress in Neurobiology 590 (2002) 1–31 15

1087 have only to run forward from one level to the next (see low p(vi−1|vi). In a neuronal setting the (whitened) predic- 1133 1088 Section 3.5.1 and Fig. 2). tion error is encoded by the activities of units denoted by ξi. 1134 These error units receive a prediction from units in the level 1135 2 1089 3.5.1. Empirical Bayes in the brain above and connections from the principal units vi being pre- 1136 1090 The biological plausibility of the scheme depicted in Fig. 2 dicted. Horizontal interactions among the error units serve to 1137 1091 can be established fairly simply. To do this a hierarchical de-correlate them (cf. Foldiak, 1990), where the symmetric 1138 1092 predictive scheme is described in some detail. A more thor- lateral connection strengths λi hyper-parameterise the co- 1139 1093 ough account of this scheme, including simulations of var- variances of the errors i which are the prior covariances 1140 1094 ious neurobiological and psychophysical phenomena, will for level i − 1. 1141 1095 appear in future publications. For the moment, we will re- The estimators vi+1 and the connection strength parame- 1142 1096 view neuronal implementation at a purely theoretical level, ters perform a gradient ascent on the compound log proba- 1143 1097 using the framework developed above. bility. 1144 1098 Consider any level i in a cortical hierarchy containing ∂ ∂ξT ∂ξT v i i+1 1099 units (neurons or neuronal populations) whose activity i v˙i+1 = =− ξi − ξi+1 ∂vi+ ∂vi+ ∂vi+ 1100 is predicted by corresponding units in the level above vi+1. 1 1 1 T 1101 The hierarchical form of the implicit generative model is ∂ ∂ξ θ˙ = =− i ξ (19) i ∂θ ∂θ u = G1(v2,θ1) + ε1 i i v = G (v ,θ ) + ε (16) ∂ ∂ξT 2 2 3 2 2 ˙ i −1 v =··· λi = =− ξ − (1 + λi) 1102 3 ∂λi ∂λi 1145

1103 with v1 = u. Technically, these models fall into the class When Gi(vi+1,θ)models dynamical processes (i.e. is effec- 1146 1104 of conditionally independent hierarchical models when the tively a convolution operator) this gradient ascent is more 1147 1105 error terms are independent at each level (Kass and Steffey, complicated. In a subsequent paper we will show that, with 1148 1106 1989). These models are also called parametric empirical dynamical models, it is necessary to maximise both  and 1149 ˙ 1107 Bayes (PEB) models because the obvious interpretation of its temporal derivatives (e.g. ). An alternative is to assume 1150 1108 the higher-level densities as priors led to the development a simple hidden Markov model for the dynamics and use 1151 1109 of PEB methodology (Efron and Morris, 1973). We require Kalman filtering (cf. Rao and Ballard, 1998). For the mo- 1152 1110 units in all levels to jointly maximise the posterior probabili- ment, we will assume the inputs change sufficiently slowly 1153 v v 1111 ties of i+1 giveni. We will assume the errors are Gaussian for gradient ascent not to be confounded. 1154 1112 with covariance i = (λi). Therefore, θi and λi param- Despite the complicated nature of the hierarchical model 1155 1113 eterise the means and covariances of the likelihood at each and the abstract theorising, three simple and biologically 1156 1114 level. plausible things emerge: 1157  p(vi |vi+1 ) = N(vi : G(vi+1,θi), i ) •    Reciprocal connections 1158  − / − (17) ∝   1 2 − 1 εT 1ε The dynamics of representational units vi+1 are subject 1159 1115 i exp 2 i i i to two, locally available, influences. A likelihood term 1160 1116 This is also the prior density for the level below. Although mediated by forward afferents from the error units in the 1161 1117 θi and λi are both parameters of the forward model λi are level below and an empirical prior term conveyed by er- 1162 1118 sometimes referred to as hyperparameters and in classical ror units in the same level. This follows from the condi- 1163 1119 statistics correspond to variance components. We will pre- tional independence conferred by the hierarchical struc- 1164 1120 serve the distinction between parameters and hyperparam- ture of the model. Critically, the influences of the error 1165 1121 eters because minimising the prediction error with respect units in both levels are meditated by linear connections 1166 1122 to the estimated causes and parameters is sufficient to max- with a strength that is exactly the same as the (negative) 1167 1123 imise the likelihood of neuronal states at all levels. This is effective connectivity of the reciprocal connection from 1168 v ξ ξ 1124 the essence of predictive coding. For the hyperparameters i+1 to i and i+1 (see Box 1 for definition of effective 1169 1125 there is an additional term that depends on the hyperparam- connectivity). In short, the lateral, forwards and backward 1170 1126 eters themselves (see further). connections are all reciprocal, consistent with anatomi- 1171 1127 In this hierarchical setting, the objective function com- cal observations. Lateral connections, within each level 1172 1128 prises a series of log likelihoods     (u) = p(u|v ) + (v |v ) +···+(u) =−1 ξT ξ − 1 ξT ξ −···− 1   − 1   −··· ln 1 ln 1 2 1 2 ln 1 ln 2 2 1 2 2 2 2 (18) −1 1129 ξi = vi − Gi(vi+1,θ)− λiξi = (1 + λi) εi  1/2 1130 (λi) = +λi Here 1 . The likelihood at each level corre- 2 Clearly, the backward connections are not inhibitory but, after media- 1131 sponds to p(vi|vi+1) which also plays the role of a prior on vi tion by inhibitory interneurons, their effective influence could be rendered 1132 that is jointly maximised with the likelihood of the level be- inhibitory.

UNCORRECTED PROOF 16 K. Friston / Progress in Neurobiology 590 (2002) 1–31

1173 decorrelate the error units allowing competition between nections stop changing when the prediction error has been 1226 T 1174 prior expectations with different precisions (precision is whitened ξiξi =1. 1227 1175 the inverse of variance). Non-diagonal forms for λi complicate the biological in- 1228 1176 • Functionally asymmetric forward and backward connec- terpretation because changes at any one connection depend 1229 1177 tions on changes elsewhere. The problem can be finessed slightly 1230 1178 The forward connections are the reciprocal (nega- by rewriting the equations as 1231 1179 tive transpose) of the backward effective connectivity θ˙ = ξ vT − λ θ˙ ∂ξ /∂v i i i+1 i i 1180 i i+1 from the higher level to the lower level, extant (21) ˙ T ˙ 1181 at that time. However, the functional attributes of the λi = ξiξi − λiλi − 1 1232 1182 forward and backward influences are different. The influ- 1183 ences of units on error units in the lower level mediate where the decay terms are mediated by integration at the cell 1233 body in a fashion similar to that described in Friston et al. 1234 1184 the forward model ξi =−Gi(vi+1,θ)+ .... These can 1185 be nonlinear, where each unit in the higher level may (1993). 1235 1186 modulate or interact with the influence of others (accord- The overall scheme implied by Eq. (19) sits comfortably 1236 1187 ing to the nonlinearities in G). In contradistinction, the the hypothesis (Mumford, 1992). “On the role of the recip- 1237 1188 influences of units in lower levels do not interact when rocal, topographic pathways between two cortical areas, one 1238 1189 producing changes in the higher level because their ef- often a ‘higher’ area dealing with more abstract information 1239 about the world, the other ‘lower’, dealing with more con- 1240 1190 fects are linearly separable v˙i+1 =−∂ξi/∂vi+1ξi −···. 1191 This is a key observation because the empirical evidence, crete data. The higher area attempts to fit its abstractions 1241 1192 reviewed in the previous section, suggests that backward to the data it receives from lower areas by sending back to 1242 1193 connections are in a position to interact (e.g. though them from its deep pyramidal cells a template reconstruction 1243 1194 NMDA receptors expressed predominantly in the supra- best fitting the lower level view. The lower area attempts to 1244 1195 granular layers receiving backward connections) whereas reconcile the reconstruction of its view that it receives from 1245 1196 forward connections are not. It should be noted that, higher areas with what it knows, sending back from its su- 1246 perficial pyramidal cells the features in its data which are 1247 1197 although the implied forward connections ∂ξi/∂vi+1 me- not predicted by the higher area. The whole calculation is 1248 1198 diate linearly separable effects of ξi on vi+1, these con- 1199 nections might be activity- and time-dependent because done with all areas working simultaneously, but with order 1249 imposed by synchronous activity in the various top–down, 1250 1200 of their dependence on vi+1. 1201 • Associative plasticity bottom–up loops”. 1251 1202 Changes in the parameters correspond to plasticity In summary, the predictive coding approach lends itself 1252 1203 in the sense that the parameters control the strength of naturally to a hierarchical treatment, which considers the 1253 1204 backward and lateral connections. The backward connec- brain as an empirical Bayesian device. The dynamics of the 1254 1205 tions parameterise the prior expectations of the forward units or populations are driven to minimise error at all levels 1255 1206 model and the lateral connections hyper-parameterise the of the cortical hierarchy and implicitly render themselves 1256 1207 prior covariances. Together they parameterise the Gaus- posterior estimates of the causes given the data. In con- 1257 1208 sian densities that constitute the priors (and likelihoods) tradistinction to connectionist schemas, hierarchical predic- 1258 1209 of the model. The motivation for these parameters max- tion does not require any desired output. Indeed predictions 1259 1210 imising the same objective function  as the neuronal of intermediate outputs at each hierarchical level emerge 1260 1211 states is discussed in the next subsection. For the mo- spontaneously. Unlike information theoretic approaches they 1261 1212 ment, we are concerned with the biological plausibility do not assume independent causes and invertible generative 1262 1213 of these changes. The plasticity implied is seen more processes. In contrast to regularised inverse solutions (e.g. in 1263 1214 clearly with an explicit parameterisation of the connec- machine vision) they do not depend on a priori constraints. 1264 These emerge spontaneously as empirical priors from higher 1265 1215 tions. For example, let Gi(vi+1,θi) = θivi+1. In this 1216 instance levels. The Bayesian considerations above pertain largely to 1266 the estimates of the causes. In the final subsection we con- 1267 θ˙ = ( + λ )−1ξ vT i 1 i i i+1 (20) sider the estimation of model parameters using the frame- 1268 ˙ −1 T 1217 λi = (1 + λi) (ξiξi − 1) work provided by density learning with generative models. 1269

1218 This is just Hebbian or associative plasticity where the 3.6. Generative models and representational learning 1270 1219 connection strengths change in proportion to the product of 1220 pre and post-synaptic activity. An intuition about Eq. (20) In this section we bring together the various schemes con- 1271 1221 obtains by considering the conditions under which the ex- sidered above using the framework provided by density es- 1272 1222 pected change in parameters is zero (i.e. after learning). For timation as a way of fitting generative models. This sec- 1273 1223 the backward connections this implies there is no compo- tion follows Dayan and Abbott (2001) to which the reader 1274 1224 nent of prediction error that can be explained by casual es- is referred for a fuller discussion. Generative models repre- 1275 ξ vT = 1225 timates at the higher level i i+1 0. The lateral con- sent a generic formulation of representational leaning in a 1276

UNCORRECTED PROOF K. Friston / Progress in Neurobiology 590 (2002) 1–31 17

1277 self-supervised context. There are many forms of generative under the constraint that the recognition model is (approxi- 1327 1278 models that range from conventional statistical models (e.g. mately) the inverse of the generative model. Eq. (25) is the 1328 1279 factor and cluster analysis) and those motivated by Bayesian same as Eq. (24) after applying G to both sides. The impli- 1329 1280 inference and learning (e.g. Dayan et al., 1995; Hinton et al., cation is that one needs an explicit parameterisation of the 1330 1281 1995). Indeed many of the algorithms discussed under the (approximate) recognition (inverse) model and generative 1331 1282 heading of information theory can be formulated as genera- (forward) models that induces the need for both forward and 1332 1283 tive models. The goal of generative models is “to learn rep- backward influences. Separate recognition and generative 1333 1284 resentations that are economical to describe but allow the models resolve the problem caused by the non-invertibility 1334 1285 input to be reconstructed accurately” (Hinton et al., 1995). of generating processes. The corresponding motivation, in 1335 1286 In current treatments, representational learning is framed in probabilistic learning, rests on finessing the combinatorial 1336 1287 terms of estimating probability densities of the inputs and explosion of ways in which stochastic generative models 1337 1288 outputs. Although density learning is formulated at a level of can generate input patterns (Dayan et al., 1995). The com- 1338 1289 abstraction that eschews many issues of neuronal implemen- binatorial explosion represents another perspective on the 1339 1290 tation (e.g. the dynamics of real-time learning), it does pro- uninvertible ‘many to one’ relationship between causes and 1340 1291 vide a unifying framework that connects the various schemes inputs. 1341 1292 considered so far. In the general density learning framework, representa- 1342 1293 The goal of generative models is to make the density of tional learning has two components that can be seen in terms 1343 1294 the inputs, implied by the generative model p(u; θ), as close of expectation maximisation (EM, Dempster et al., 1977). In 1344 1295 as possible to those observed p(u). The generative model is the E-Step the approximate recognition distribution is mod- 1345 1296 specified in terms of the prior distribution over the causes ified to match the density implied by the generative model 1346 1297 p(u; θ) and the conditional generative distribution of the parameters, so that q(v; u, φ) ≈ p(v|u; θ) and in the M-Step 1347 1298 inputs given the causes p(u|v; θ) which together define the these parameters are changed to render p(u; θ) ≈ p(u).In 1348 1299 marginal distribution that has to be matched to the input other words, the E-Step ensures the recognition model ap- 1349 1300 distribution proximates the generative model and the M-Step ensures that 1350  the generative model can predict the observed inputs. If the 1351 p(u; θ) = p(u|v; θ)p(v; θ) dv (22) q(v; u, φ) = 1301 model is invertible the E-Step reduces to setting 1352 p(v|u; θ) using Eq. (23). Probabilistic recognition proceeds 1353 1302 Once the parameters of the generative model have been es- by using q(v; u, φ) to determine the probability that v caused 1354 1303 timated, through this matching, the posterior density of the the observed sensory inputs. This recognition becomes de- 1355 1304 causes, given the inputs are given by the recognition model terministic when q(v; u, φ) is a Dirac δ-function over the 1356 1305 defined in terms of the recognition distribution MAP estimator of the causes vm. The distinction between 1357 p(u|v; θ)p(v; θ) probabilistic and deterministic recognition is important be- 1358 p(v|u; θ) = (23) 1359 1306 p(u; θ) cause we have restricted ourselves to deterministic models thus far but these are special cases of density estimation in 1360 1307 However, as considered in depth above, the generative model generative modelling. 1361 1308 may not be invertible and it may not be possible to compute 1309 the recognition distribution from Eq. (23). In this instance, an 3.6.1. Density estimation and EM 1362 1310 approximate recognition distribution can be used q(v; u, φ) EM provides a useful procedure for density estimation 1363 1311 that we try to approximate to the true one. The distribution that helps relate many different models within a framework 1364 1312 has some parameters φ that need to be learned, for example, that has direct connections with statistical . Both 1365 1313 the strength of forward connections. The question addressed steps of the EM algorithm involve maximising a function of 1366 1314 in this review is whether forward connections are sufficient the densities that corresponds to the negative free energy in 1367 1315 for representational leaning. For a moment, consider deter- . 13691368 1316 ministic models that discount probabilistic or stochastic as-   p(v, u; θ) 1317 pects. We have been asking whether we can find the param- F(φ, θ) = q(v; u, φ) ln dv 1318 eters of a deterministic recognition model that renders it the q(v; u, φ) u 1370 1319 inverse of a generating process =ln p(u; θ)u −KL(q(v; u, φ), p(v|u; θ)u (26) 1371 v(u, φ) = G−1(u, θ) 1320 (24) This objective function comprises two terms. The first is the 1372 expected log likelihood of the inputs, under the generative 1373 1321 The problem is that G(v, θ) is a nonlinear convolution and model, over the observed inputs. Maximising this term im- 1374 1322 is generally not invertible. The generative model approach 3 plicitly minimises the Kullback–Leibler (KL) divergence 1375 1323 posits that it is sufficient to find the parameters of an (ap- between the actual input density and that implied by the gen- 1376 1324 proximate) recognition model φ and the generative model θ erative model. This is equivalent to maximising the log like- 1377 1325 that predict the inputs

1326 G(v(u, φ), θ) = u (25) 3 A measure of the discrepancy between two densities.

UNCORRECTED PROOF 18 K. Friston / Progress in Neurobiology 590 (2002) 1–31

1378 lihood of the inputs. The second term is the KL divergence responds to the squashing function in Section 3.3.1. The 1428 1379 between the approximating and true recognition densities. In deterministic equivalent of sparse coding is ICA that obtains 1429 1380 short, maximising F encompasses two components of rep- when → 0. The relationships among different models 1430 1381 resentational learning: (i) it increases the likelihood that the are rendered apparent under the perspective of generative 1431 1382 generative model could have produced the inputs; and (ii) models. It is useful to revisit the schemes above to examine 1432 1383 minimises the discrepancy between the approximate recog- their implicit generative and recognition models. 1433 1384 nition model and that implied by the generative model. The 1385 E-Step increases F with respect to the recognition parame- 3.6.2. Supervised representational learning 1434 1386 ters φ through minimising the KL term, ensuring a veridical In supervised schemes the generative model is already 1435 1387 approximation to the recognition distribution implied by θ. known and only the recognition model needs to be esti- 1436 1388 The M-Step increases F by changing θ, enabling the gener- mated. The generative model is known in the sense that the 1437 1389 ative model to reproduce the inputs. desired output determines the input either deterministically 1438 or stochastically (e.g. the input primitives are completely 1439 E : φ = minφ F(φ, θ) (27) specified by their cause, which is the desired output). In this 1440 1390 M : θ = minθ F(φ, θ) case only the E-Step is required in which the parameters 1441 1391 This formulation of representational leaning is critical for the φ that specify q(v; u, φ) change to maximise F. The only 1442 1392 thesis of this review because it shows that backward connec- term in Eq. (26) that depends on φ is the divergence term, 1443 1393 tions, parameterising a generative model, are essential when such that learning reduces to minimising the expected differ- 1444 1394 the model is not invertible. If the generative model is invert- ence between the approximate recognition density and that 1445 1395 ible then the KL term can be discounted and learning reduces required by the generative model. This can proceed proba- 1446 1396 to the M-Step (i.e. maximising the likelihood). In principle, bilistically (e.g. Contrastive Hebbian learning in stochastic 1447 1397 this could be done using a feedforward architecture corre- networks (Abbott and Dayan, 2001, p. 322)) or determinis- 1448 1398 sponding to the inverse of the generative model. However, tically. In the deterministic mode q(v; u, φ) corresponds to a 1449 1399 when processes generating inputs are non-invertible (due to δ-function over the point estimator vm = R(u, φ). The con- 1450 1400 nonlinear interactions among, and temporal convolutions of, nection strengths φ are changed, typically using the delta 1451 1401 the causes) a parameterisation of the generative model (back- rule, such that the distance between the modes of the approx- 1452 1402 ward connections) and approximate recognition model (for- imate and desired recognition distributions are minimised 1453 1403 ward connections) is required that can be updated in M- and over all inputs. This is equivalent to nonlinear function ap- 1454 1404 E-Steps, respectively. In short, non-invertibility enforces an proximation; a perspective that can be adopted on all super- 1455 1405 explicit parameterisation of the generative model in repre- vised learning of deterministic mappings with neural nets. 1456 1406 sentational learning. In the brain this parameterisation may Note, again, that any scheme, based on supervised learn- 1457 1407 be embodied in backward and lateral connections. ing, requires the processes generating inputs to be known a 1458 1408 The EM scheme enables exact and approximate maxi- priori and as such cannot be used by the brain. 1459 1409 mum likelihood density estimation for a whole variety of 1410 generative models that can be specified in terms of priors 3.6.3. Information theory 1460 1411 and generative distributions. Dayan and Abbott (2001) work In section on information theory we had considered 1461 1412 though a series of didactic examples from cluster analy- whether infomax principles were sufficient to specify deter- 1462 1413 sis to independent component analyses, within this unifying ministic recognition architectures, in the absence of back- 1463 1414 framework. For example, factor analysis corresponds to the ward connections. They were introduced in terms of finding 1464 1415 generative model some function of the inputs that produces an output den- 1465 sity with maximum entropy. Maximisation of F attains the 1466 p(v; θ) = N(v :0, 1)  (28) same thing through minimising the discrepancy between the 1467 1416 p(u |v ; θ) = N(u : θv, ) observed input distribution p(u) and that implied by a gen- 1468 1417 Namely, the underlying causes of inputs are independent erative model with maximum entropy priors. Although the 1469 1418 normal variates that are mixed linearly and added to Gaus- infomax and density learning approaches have the same ob- 1470 1419 sian noise to form inputs. In the limiting case of → 0 jective their heuristics are complementary. Infomax is moti- 1471 1420 the generative and recognition models become deterministic vated by maximising the mutual information between u and 1472 1421 and the ensuing model conforms to PCA. By simply assum- v under some constraints. The generative model approach 1473 1422 ing non-Gaussian priors one can specify generative models takes its heuristics from the assumption that the causes of 1474 1423 for sparse coding of the sort proposed by Olshausen and inputs are independent and possibly non-Gaussian. This re- 1475 1424 Field (1996). sults in a prior with maximum entropy p(v; θ) = p(vi; θ). 1476

The reason for adopting non-Gaussian priors (e.g. sparse 1477 p(v; θ) = p(vi,θ)  (29) coding and ICA) is that the central limit theorem implies 1478 1425 p(u |v ; θ) = N(u : θv, ) mixtures of causes will have Gaussian distributions and 1479 1426 where p(viθ) are chosen to be suitably sparse (i.e. therefore something that is not Gaussian is unlikely to be a 1480 1427 heavy-tailed) with a cumulative density function that cor- mixture. 1481

UNCORRECTED PROOF K. Friston / Progress in Neurobiology 590 (2002) 1–31 19

1482 For invertible deterministic models v = R(u, φ) = (u) is simply the log of the joint probability, under the 1532 −1 1483 G (u, θ) the KL component of F disappears leaving only generative model, of the observed inputs and their cause, 1533 14851484 the likelihood term. implied by approximate recognition. This log probability 1534 can be decomposed into a log likelihood and log prior and 1535 1486 F =lnp(u; θ)u =lnp(v; θ)u +lnp(u|v; θ)u    is exactly the same objective function used to find the MAP 1536 ∂R(u, φ)   estimator in predictive coding cf. Eq. (14). 1537 = ln p(vi; θ) + ln   1487 u ∂u On-line representational learning can be thought of as 1538  u comprising two components, corresponding to the E and 1539 1488 =− H(vi; θ) + H(v; φ) − H(u) (30) M-Steps. The expectation (E) component updates the recog- 1540 1489 This has exactly the same dependence on the parameters nition density, whose mode is encoded by the neuronal ac- 1541 1490 as the objective function employed by infomax in Eq. (7). tivity v, by maximising (u). Maximising (u) is sufficient 1542 1491 In this context, the free energy and the information differ to maximise its expectation F over inputs because it is max- 1543 1492 only by the entropy of the inputs −F = I + H(u). This imised for each input separately. The maximisation (M) 1544 1493 equivalence rests on uses maximum entropy priors of the component corresponds to an ascent of these parameters, 1545 1494 sort assumed for sparse coding. encoded by the connection strengths, on the same log prob- 1546 1495 Notice again that, in the context of invertible deterministic ability 1547 1496 generative models, the parameters of the recognition model ∂ E φ˙ =˙v = 1497 specify the generative model and only the recognition model : ∂v 1498 (i.e. forward connections meditating v = R(u, φ)) needs to (32) ˙ ∂ 1499 M : θ = be instantiated. If the generative modal cannot be inverted ∂θ 1548 1500 the recognition model is not defined and the scheme above 4 1501 is precluded. In this instance one has to parameterise both an such that the expected change approximates an ascent on 1549 ˙ 1502 approximate recognition and generative model as required F; θ≈∂/∂θu = ∂F/∂θ. Eq. (32) is formally identi- 1550 1503 by EM. This enables the use of nonlinear generative models, cal to Eq. (19), the hierarchical prediction scheme, where 1551 1504 such as nonlinear PCA (e.g. Kramer, 1991; Karhunen and the hyperparameters have been absorbed into the param- 1552 1505 Joutsensalo, 1994; Dong and McAvoy, 1996; Taleb and Jut- eters. In short, predictive coding can be regarded as an 1553 1506 ten, 1997). These schemes typically employ a ‘bottleneck’ on-line or dynamic form of density estimation using a de- 1554 1507 architecture that forces the inputs through a small number of terministic recognition model and a stochastic generative 1555 1508 nodes. The output from these nodes then diverges to produce model. Conjoint changes in neuronal states and connection 1556 1509 the predicted inputs. The approximate recognition model is strengths map to the expectation maximisation of the ap- 1557 1510 implemented, deterministically in connections to the bottle- proximate recognition and generative models, respectively. 1558 1511 neck nodes and the generative model by connection from Note that there is no explicit parameterisation of the recog- 1559 1512 these nodes to the outputs. Nonlinear transformations, from nition model; the recognition density is simply represented 1560 1513 the bottleneck nodes to the output layer, recapitulate the non- by its mode for the input u at a particular time. This affords 1561 1514 linear mixing of the real causes of the inputs. After learning, a very unconstrained recognition model that can, in princi- 1562 1515 the activity of the bottleneck nodes can be treated as esti- ple, approximate the inverse of highly nonlinear generative 1563 1516 mates of the causes. These representations obtain by projec- models. 1564 1517 tion of the input onto a low-dimensional curvilinear mani- 1518 fold (encompassing the activity of the bottleneck nodes) by 3.7. Summary 1565 1519 an approximate recognition model.

In summary, the formulation of representational learn- 1566 1520 3.6.4. Predictive coding ing in terms of generative models embodies a number of 1567 1521 In the forgoing, density learning is based on the expecta- key distinctions: (i) the distinction between invertible versus 1568 1522 tions of probability distributions over the inputs. Clearly the non-invertible models; (ii) deterministic versus probabilistic 1569 1523 brain does not have direct access to these expectations but representations; and (iii) dynamic versus density learning. 1570 1524 sees only one input at any instant. In this instance represen- Non-invertible generative models require their explicit pa- 1571 1525 tational learning has to proceed on-line, by sampling inputs rameterisation and suggest an important role for backward 1572 1526 over time. connections in the brain. Invertible models can, in princi- 1573 1527 For deterministic recognition models, q(v; u, φ) is param- ple be implemented using only forward connections because 1574 1528 eterised by its input-specific mode v(u), where q(v(u); u) = the recognition model completely specifies the generative 1575 15301529 1 and model and vice versa. However, nonlinear and dynamic as- 1576  p(v, u; θ) (u) = q(v; u, φ) v = p(v(u), u; θ) pects of the sensorium render invertibility highly unlikely. 1577 ln q(v; u, φ) d ln = ln p(u|v(u); θ) + ln p(v(u); θ) (31) 4 This approximation can be finessed by using traces, to approximate the expectation explicitly, and changing the connections in proportion 1531 F =(u)u with the trace.

UNCORRECTED PROOF 20 K. Friston / Progress in Neurobiology 590 (2002) 1–31

1578 This section has focused on the conditions under which for- architectures. However, it should be noted that the differ- 1631 1579 ward connections are sufficient to parameterise a generative ences among them have been deliberately emphasised. For 1632 1580 model. In short, these conditions rest on invertibility and example, predictive coding and the implicit error minimi- 1633 1581 speak to the need for backward connections in the context sation results in the maximisation of information transfer. 1634 1582 of nonlinear and noninvertible generative models. In other words, predictive coding conforms to the princi- 1635 1583 Most of the examples in this section have focussed on ple of maximum information transfer, but in a distinct way. 1636 1584 deterministic recognition models where neuronal dynamics Predictive coding is entirely consistent with the principle of 1637 1585 encode the most likely causes of the current sensory input. maximum information. The infomax principle is a principle, 1638 1586 This is largely because we have been concerned with how the whereas predictive coding represents a particular scheme 1639 1587 brain represents things. The distinction between determinis- that serves that principle. There are examples of infomax 1640 1588 tic and probabilistic representation addresses a deeper ques- that do not employ predictive coding (e.g. transformations 1641 1589 tion about whether neuronal dynamics represent the state of of stimulus energy in early visual processing; Atick and 1642 1590 the world or the probability densities of those states. From Redlich, 1990) that may be specified genetically or epigenet- 1643 1591 the point of view of hierarchical models the state of the neu- ically. However, predictive coding is likely to play a much 1644 1592 ronal units encodes the mode of the posterior density at any more prominent role at higher levels of processing for the 1645 1593 given level. This can be considered a point recognition den- reasons detailed in the previous section. 1646 1594 sity. However, the states of units at any level also induce a In a similar way predictive coding, especially in its hi- 1647 1595 prior density in the level below. This is because the prior erarchical formulation, conforms to the same PDP princi- 1648 1596 mode is specified by dynamic top–down influences and the ples that underpin connectionist schemes. The representa- 1649 1597 prior covariance by the strength of lateral connections. These tion of any cause depends upon the internally consistent 1650 1598 covariances render the generative model a probabilistic one. representations of subordinate and supraordinate causes in 1651 1599 By encoding densities in terms of their modes, using neu- lower and higher levels. These representations mutually in- 1652 1600 ronal activity, the posterior and prior densities can change duce and maintain themselves, across and within all levels 1653 1601 quickly with sensory inputs. However, this does entail uni- of the sensory hierarchy, through dynamic and reentrant in- 1654 1602 modal densities. From the point of view of a statistician this teractions (Edelman, 1993). The same PDP phenomena (e.g. 1655 1603 may be an impoverished representation of the world that lateral interactions leading to competition among represen- 1656 1604 compromises any proper inference, especially when the pos- tations) can be observed. For example, the lateral connection 1657 1605 terior distribution is multimodal. However, it is exactly this strengths embody what has been learnt empirically about the 1658 1606 approximate nature of recognition that pre-occupies psy- prior covariances among causes. A prior that transpires to be 1659 1607 chophysicists and psychologists; The emergence of unitary, very precise (i.e. low variance) will receive correspondingly 1660 1608 deterministic perceptual representations in the brain is com- low strength inhibitory connections from its competing er- 1661 1/2 1609 monplace and is of special interest when the causes are am- ror units (recall (λi) = 1 + λi). It will therefore su- 1662 1610 biguous (e.g. illusions and perceptual transitions induced by pervene over other error units and have a greater corrective 1663 1611 binocular rivalry and ambiguous figures). impact on the estimate causing the prediction error. Con- 1664 1612 The brain is a dynamical system that samples inputs dy- versely, top–down expectations that are less informative will 1665 1613 namically over time. It does not have instantaneous access to induce errors that are more easily suppressed and have less 1666 1614 the statistics of its inputs that are required for distinct E- and effect on the representations. In predictive coding, these dy- 1667 1615 M-Steps. Representational learning therefore has to proceed namics are driven explicitly by error minimisation, whereas 1668 1616 under this constraint. In this review, hierarchical predictive in connectionist simulations the activity is determined solely 1669 1617 coding has been portrayed as a variant of density leaning by the connection strengths established during training. 1670 1618 that conforms to these constraints. In addition to the theoretical bias toward generative mod- 1671 1619 We have seen that supervised, infomax and generative els and predictive coding, the clear emphasis on backward 1672 1620 models require prior assumptions about the distribution of and reentrant (Edelman, 1993) dynamics make it a more nat- 1673 1621 causes. This section introduced empirical Bayes to show that ural framework for understanding neuronal infrastructures. 1674 1622 these assumptions are not necessary and that priors can be Fig. 1 shows the fundamental difference between infomax 1675 1623 learned in a hierarchical context. Furthermore, we have tried and generative schemes. In the infomax schemes the con- 1676 1624 to show that hierarchical prediction can be implemented in nections are universally forward. In the predictive coding 1677 1625 brain-like architectures using mechanisms that are biologi- scheme the forward connections (broken line) drive the pre- 1678 1626 cally plausible. diction so as to minimise error whereas backwards connec- 1679 tions (solid lines) use these representations of causes to emu- 1680 late mixing enacted by the real world. The nonlinear aspects 1681 1627 4. Generative models and the brain of this mixing imply that only backward influences inter- 1682 act in the predictive coding scheme whereas the nonlinear 1683 1628 The arguments in the preceding section clearly favour unmixing, in classical infomax schemas, is mediated by for- 1684 1629 predictive coding, over supervised or information theoretic ward connections. Section 2 assembled some of the anatom- 1685 1630 frameworks, as a more plausible account of functional brain ical and physiological evidence suggesting that backward 1686

UNCORRECTED PROOF K. Friston / Progress in Neurobiology 590 (2002) 1–31 21

1687 connections are prevalent in the real brain and could support these contextual afferents will not elicit a response by them- 1741 1688 nonlinear mixing through their modulatory characteristics. selves. Effects of this sort, which change the responsiveness 1742 1689 It is pleasing that purely theoretical considerations and neu- of units but do not elicit a response, are a hallmark of mod- 1743 1690 robiological empiricism converge on the same architecture. ulatory projections. In summary, hierarchical models offer a 1744 1691 Before turning to electrophysiological and functional neu- scheme that allows for contextual effects; firstly through bi- 1745 1692 roimaging evidence for backward connections we consider asing responses towards their prior expectation and secondly 1746 1693 the implications for classical views of receptive fields and by conferring a context-sensitivity on these priors through 1747 1694 the representational capacity of neuronal units. modulatory backward projections. Next we consider the na- 1748 ture of real neuronal responses and whether they are consis- 1749 1695 4.1. Context, causes and representations tent with this perspective. 1750

1696 The Bayesian perspective suggests something quite pro- 4.2. Neuronal responses and representations 1751 1697 found for the classical view of receptive fields. If neuronal 1698 responses encompass a bottom–up likelihood term and Classical models (e.g. classical receptive fields) assume 1752 1699 top–down priors, then responses evoked by bottom–up in- that evoked responses will be expressed invariably in the 1753 1700 put should change with the context established by prior same units or neuronal populations irrespective of the con- 1754 1701 expectations from higher levels of processing. Consider the text. However, real neuronal responses are not invariant but 1755 1702 example in Fig. 3 again. Here a unit encoding the visual depend upon the context in which they are evoked. For exam- 1756 1703 form of ‘went’ responds when we read the first sentence at ple, visual cortical units have dynamic receptive fields that 1757 1704 the top of this figure. When we read the second sentence can change from moment to moment (cf. the non-classical 1758 1705 ‘The last event was cancelled’ it would not. If we recorded receptive field effects modelled in (Rao and Ballard, 1998)). 1759 1706 from this unit we might infer that our ‘went’ unit was, in Another example is attentional modulation of evoked re- 1760 1707 some circumstances, selective for the word ‘event’. Without sponses that can change the sensitivity of neurons to different 1761 1708 an understanding of hierarchical inference and the semantic perceptual attributes (e.g. Treue and Maunsell, 1996). The 1762 1709 context the stimulus was presented in this might be difficult evidence for contextual responses comes from neuroanatom- 1763 1710 to explain. In short, under a predictive coding scheme, the ical and electrophysiological studies. There are numerous 1764 1711 receptive fields of neurons should be context-sensitive. The examples of context-sensitive neuronal responses. Perhaps 1765 1712 remainder of this section deals with empirical evidence for the simplest is short-term plasticity. Short-term plasticity 1766 1713 these extra-classical receptive field effects. refers to changes in connection strength, either potentia- 1767 1714 Generative models suggest that the role of backward con- tion or depression, following pre-synaptic inputs (e.g. Abbot 1768 1715 nections is to provide contextual guidance to lower lev- et al., 1997). In brief, the underlying connection strengths, 1769 1716 els through a prediction of the lower level’s inputs. When that define what a unit represents, are a strong function of 1770 1717 this prediction is incomplete or incompatible with the lower the immediately preceding neuronal transient (i.e. preced- 1771 1718 area’s input, an error is generated that engenders changes in ing representation). A second, and possibly richer, example 1772 1719 the area above until reconciliation. When, and only when, the is that of attentional modulation. It has been shown, both 1773 1720 bottom–up driving inputs are in harmony with top–down pre- in single unit recordings in primates (Treue and Maunsell, 1774 1721 diction, error is suppressed and a consensus between the pre- 1996) and human functional fMRI studies (Büchel and Fris- 1775 1722 diction and the actual input is established. Given this concep- ton, 1997), that attention to specific visual attributes can pro- 1776 1723 tual model a stimulus-related response or ‘activation’ corre- foundly alter the receptive fields or event-related responses 1777 1724 sponds to some transient error signal that induces the appro- to the same stimuli. 1778 1725 priate change in higher areas until a veridical higher-level These sorts of effects are commonplace in the brain and 1779 1726 representation emerges and the error is ‘cancelled’ by back- are generally understood in terms of the dynamic modula- 1780 1727 wards connections. Clearly the prediction error will depend tion of receptive field properties by backward and lateral 1781 1728 on the context and consequently the backward connections afferents. There is clear evidence that lateral connections in 1782 1729 confer context-sensitivity on the functional specificity of the visual cortex are modulatory in nature (Hirsch and Gilbert, 1783 1730 lower area. In short, the activation does not just depend on 1991), speaking to an interaction between the functional seg- 1784 1731 bottom–up input but on the difference between bottom–up regation implicit in the columnar architecture of V1 and the 1785 1732 input and top–down predictions. neuronal dynamics in distal populations. These observations, 1786 1733 The prevalence of nonlinear or modulatory top–down ef- suggests that lateral and backwards interactions may convey 1787 1734 fects can be inferred from the fact that context interacts with contextual information that shapes the responses of any neu- 1788 1735 the content of representations. Here context is established ron to its inputs (e.g. Kay and Phillips, 1996; Phillips and 1789 1736 simply through the expression of causes other than the one Singer, 1997) to confer on the brain the ability to make con- 1790 1737 in question. Backward connections from one higher area ditional inferences about sensory input. See also McIntosh 1791 1738 can be considered as providing contextual modulation of the (2000) who develops the idea from a cognitive neuroscience 1792 1739 prediction from another. Because the effect of context will perspective “that a particular region in isolation may not 1793 1740 only be expressed when the thing being predicted is present act as a reliable index for a particular cognitive function. 1794

UNCORRECTED PROOF 22 K. Friston / Progress in Neurobiology 590 (2002) 1–31

1795 Instead, the neural context in which an area is active may ily to the component of motion perpendicular to a contour’s 1850 1796 define the cognitive function.” His argument is predicated orientation, but over a period of approximately 60 ms the re- 1851 1797 on careful characterisations of effective connectivity using sponses gradually shift to encode the true stimulus direction, 1852 1798 neuroimaging. regardless of orientation”. 1853 The preceding examples were taken from electrophys- 1854 1799 4.2.1. Examples from electrophysiology iology. Similar predictions can be made, albeit at a less 1855 1800 In the next section we will illustrate the context-sensitive refined level, about population responses elicited in func- 1856 1801 nature of cortical activations, and implicit specialisation, in tional neuroimaging where functional specialisation (cf. 1857 1802 the inferior temporal lobe using neuroimaging. Here we con- selectivity in unit recordings) is established by showing 1858 1803 sider the evidence for contextual representations in terms of regionally-specific responses to some sensorimotor attribute 1859 1804 single cell responses, to visual stimuli, in the temporal cor- or cognitive component. At the level of cortical responses 1860 1805 tex of awake behaving monkeys. If the representation of a in neuroimaging the dynamic and contextual nature of 1861 1806 stimulus depends on establishing representations of subor- evoked responses means that regionally-specific responses 1862 1807 dinate and supraordinate causes at all levels of the visual to a particular cognitive component may be expressed in 1863 1808 hierarchy, then information about the high-order attributes one context but not another. In the next section we look at 1864 1809 of a stimulus, must be conferred by top–down influences. some empirical evidence from functional neuroimaging that 1865 1810 Consequently, one might expect to see the emergence of se- confirms the idea that functional specialisation is conferred 1866 1811 lectivity, for high-level attributes, after the initial visually in a context-sensitive fashion by backwards connections 1867 1812 evoked response (it typically takes about 10 ms for volleys from higher brain areas. 1868 1813 of spikes to be propagated from one cortical area to another 1814 and about a 100 ms to reach prefrontal areas). This is be- 1815 cause the representations at higher levels must emerge be- 5. Functional architectures assessed with 1869 1816 fore backward afferents can reshape the response profile of brain imaging 1870 1817 neurons in lower areas. This temporal delay, in the emer- 1818 gence of selectivity, is precisely what one sees empirically: Information theory and predictive coding schemas sug- 1871 1819 Sugase et al. (1999) recorded neurons in macaque temporal gest alternative architectures that are sufficient for represen- 1872 1820 cortex during the presentation of faces and objects. The faces tational learning. Forward connections are sufficient for the 1873 1821 were either human or monkey faces and were categorised in former, whereas the latter posits that most of the brain’s in- 1874 1822 terms of identity (whose face it was) and expression (happy, frastructure is used to predict sensory input through a hierar- 1875 1823 angry, etc.). “Single neurones conveyed two different scales chy of top–down projections. Clearly to adjudicate between 1876 1824 of facial information in their firing patterns, starting at dif- these alternatives the existence of backward influences must 1877 1825 ferent latencies. Global information, categorising stimuli as be established. This is a slightly deeper problem for func- 1878 1826 monkey faces, human faces or shapes, was conveyed in the tional neuroimaging than might be envisaged. This is be- 1879 1827 earliest part of the responses. Fine information about iden- cause making causal inferences about effective connectivity 1880 1828 tity or expression was conveyed later”, starting on average is not straightforward (see Pearl, 2000). It might be thought 1881 1829 about 50 ms after face-selective responses. These observa- that showing regional activity was partially predicted by ac- 1882 1830 tions demonstrate representations for facial identity or ex- tivity in a higher level would be sufficient to confirm the ex- 1883 1831 pression that emerge dynamically in a way that might rely istence of backward influences, at least at a population level. 1884 1832 on backward connections. These influences imbue neurons The problem is that this statistical dependency does not per- 1885 1833 with a selectivity that is not intrinsic to the area but depends mit any causal inference. Statistical dependencies could eas- 1886 1834 on interactions across levels of a processing hierarchy. ily arise in a purely forward architecture because the higher 1887 1835 A similar late emergence of selectivity is seen in motion level activity is predicated on activity in the lower level. One 1888 1836 processing. A critical aspect of visual processing is the inte- resolution of this problem is to perturb the higher level di- 1889 1837 gration of local motion signals generated by moving objects. rectly using transmagnetic stimulation or pathological dis- 1890 1838 This process is complicated by the fact that local velocity ruptions (see Section 6). However, discounting these inter- 1891 1839 measurements can differ depending on contour orientation ventions, one is left with the difficult problem of inferring 1892 1840 and spatial position. Specifically, any local motion detector backward influences, based on measures that could be cor- 1893 1841 can measure only the component of motion perpendicular related because of forward connections. Although there are 1894 1842 to a contour that extends beyond its field of view (Pack and causal modelling techniques that can address this problem 1895 1843 Born, 2001). This “aperture problem” is particularly relevant we will take a simpler approach and note that interactions 1896 1844 to direction-selective neurons early in the visual pathways, between bottom–up and top–down influences cannot be ex- 1897 1845 where small receptive fields permit only a limited view of plained by a purely feedforward architecture. This is because 1898 1846 a moving object. Pack and Born (2001) have shown “that the top–down influences have no access to the bottom–up 1899 1847 neurons in the middle temporal visual area (known as MT inputs. An interaction, in this context, can be construed as an 1900 1848 or V5) of the macaque brain reveal a dynamic solution to effect of backward connections on the driving efficacy of for- 1901 1849 the aperture problem. MT neurons initially respond primar- ward connections. In other words, the response evoked by the 1902

UNCORRECTED PROOF K. Friston / Progress in Neurobiology 590 (2002) 1–31 23

1903 same driving bottom–up inputs depends upon the context es- tribute (e.g. colour, motion etc.). The areas highlighted by 1956 1904 tablished by top–down inputs. This interaction is used below subtraction were identified with homologous areas in mon- 1957 1905 simply as evidence for the existence of backward influences. keys that showed selective electrophysiological responses to 1958 1906 However, there are some instances of predictive coding that equivalent visual stimuli. 1959 1907 emphasises this phenomenon. For example, the “Kalman fil- Consider a specific example; namely the difference be- 1960 1908 ter model demonstrates how certain forms of attention can be tween simply saying “yes” when a recognisable object is 1961 1909 viewed as an emergent property of the interaction between seen, and saying “yes” when an unrecognisable non-object 1962 1910 top–down expectations and bottom–up signals” (Rao, 1999). is seen. Regionally specific differences in brain activity that 1963 1911 The remainder of this article focuses on the evidence distinguish between these two tasks could be implicated in 1964 1912 for these interactions. From the point of view of func- implicit object recognition. Although its simplicity is appeal- 1965 1913 tionally specialised responses these interactions manifest ing this approach embodies some strong assumptions about 1966 1914 as context-sensitive or contextual specialisation, where the way that the brain implements cognitive processes. A 1967 1915 modality-, category- or exemplar-specific responses, driven key assumption is ‘pure insertion’. Pure insertion asserts that 1968 1916 by bottom up inputs are modulated by top–down influences one can insert a new component into a task without effect- 1969 1917 induced by perceptual set. The first half of this section ing the implementation of pre-existing components (for ex- 1970 1918 adopts this perceptive. The second part of this section uses ample, how do we know that object recognition is not itself 1971 1919 measurements of effective connectivity to establish inter- affected by saying “yes”?). The fallibility of this assumption 1972 1920 actions between bottom–up and top–down influences. All has been acknowledged for decades, perhaps most explic- 1973 1921 the examples presented below rely on attempts to establish itly by Sternberg’s revision of Donder’s subtractive method. 1974 1922 interactions by trying to change sensory-evoked neuronal The problem for subtraction is as follows: if one develops a 1975 1923 responses through putative manipulations of top–down in- task by adding a component then the new task comprises not 1976 1924 fluences. These include inducing independent changes in only the previous components and the new component but 1977 1925 perceptual set, cognitive (attentional) set and, in the last the integration of the new and old components (for example, 1978 1926 section through the study of patients with brain lesions. the integration of phonology and object recognition). This 1979 integration or interaction can itself be considered as a new 1980 1927 5.1. Context-sensitive specialisation component. The difference between two tasks therefore in- 1981 cludes the new component and the interactions between the 1982 1928 If functional specialisation is context-dependent then one new component and those of the original task. Pure inser- 1983 1929 should be able to find evidence for functionally-specific re- tion requires that all these interaction terms are negligible. 1984 1930 sponses, using neuroimaging, that are expressed in one con- Clearly in many instances they are not. We next consider fac- 1985 1931 text and not in another. The first part of this section pro- torial designs that eschew the assumption of pure insertion. 1986 1932 vides an empirical example. If the contextual nature of spe- 1933 cialisation is mediated by backwards modulatory afferents 5.1.2. Multifactorial designs 1987 1934 then it should be possible to find cortical regions in which Factorial designs combine two or more factors within a 1988 1935 functionally-specific responses, elicited by the same stim- task or tasks. Factorial designs can be construed as per- 1989 1936 uli, are modulated by activity in higher areas. The second forming subtraction experiments in two or more different 1990 1937 example shows that this is indeed possible. Both of these ex- contexts. The differences in activations, attributable to the 1991 1938 amples depend on multifactorial experimental designs that effects of context, are simply the interaction. Consider re- 1992 1939 have largely replaced subtraction and categorical designs in peating the above implicit object recognition experiment in 1993 1940 mapping. another context, for example naming (of the object’s name 1994 or the non-object’s colour). The factors in this example are 1995 1941 5.1.1. Categorical designs implicit object recognition with two levels (objects versus 1996 1942 Categorical designs, such as cognitive subtraction, have non-objects) and phonological retrieval (naming versus say- 1997 1943 been the mainstay of functional neuroimaging over the past ing “yes”). The idea here is to look at the interaction be- 1998 1944 decade. Cognitive subtraction involves elaborating two tasks tween these factors, or the effect that one factor has on the 1999 1945 that differ in a separable component. Ensuing differences responses elicited by changes in the other. Generally, in- 2000 1946 in brain activity are then attributed to this component. The teractions can be thought of as a difference in activations 2001 1947 tenet of cognitive subtraction is that the difference between brought about by another processing demand. Dual task in- 2002 1948 two tasks can be formulated as a separable cognitive or sen- terference paradigms are a clear example of this approach 2003 1949 sorimotor component and that the regionally specific differ- (e.g. Fletcher et al., 1995). 2004 1950 ences in hemodynamic responses identify the corresponding Consider the above object recognition experiment again. 2005 1951 functionally specialised area. Early applications of subtrac- Noting that object-specific responses are elicited (by ask- 2006 1952 tion range from the functional anatomy of word processing ing subjects to view objects relative to meaningless shapes), 2007 1953 (Petersen et al., 1989) to functional specialisation in extras- with and without phonological retrieval, reveals the factorial 2008 1954 triate cortex (Lueck et al., 1989). The latter studies involved nature of this experiment. This ‘two by two’ design allows 2009 1955 presenting visual stimuli with and without some sensory at- one to look specifically at the interaction between phono- 2010

UNCORRECTED PROOF 24 K. Friston / Progress in Neurobiology 590 (2002) 1–31

Fig. 4. This example of regionally specific interactions comes from an experiment where subjects were asked to view coloured non-object shapes or coloured objects and say “yes”, or to name either the coloured object or the colour of the shape. Left: A regionally specific interaction in the left infero-temporal cortex. The SPM threshold is P<0.05 (uncorrected) (Friston et al., 1995b). Right: The corresponding activities in the maxima of this region are portrayed in terms of object recognition-dependent responses with and without naming. It is seen that this region shows object recognition responses when, and only when, there is phonological retrieval. The ‘extra’ activation with naming corresponds to the interaction. These data were acquired from 6 subjects scanned 12 times using PET.

2011 logical retrieval and object recognition. This analysis iden- to that stimulus when, and only when, activity in the parietal 2044 2012 tifies not regionally specific activations but regionally spe- source is high? If such an interaction exists, then one might 2045 2013 cific interactions. When we actually performed this exper- infer that the parietal area is modulating responses to the 2046 2014 iment these interactions were evident in the left posterior, stimulus attribute for which the area is selective. This has 2047 2015 inferior temporal region and can be associated with the in- clear ramifications in terms of the top–down modulation of 2048 2016 tegration of phonology and object recognition (see Fig. 4 specialised cortical areas by higher brain regions. 2049 2017 and Friston et al., 1996 for details). Alternatively this region The statistical model employed in testing for psychophysi- 2050 2018 can be thought of as expressing recognition-dependent re- ological interactions is a simple regression model of effective 2051 2019 sponses that are realised in, and only in, the context of having connectivity that embodies nonlinear (second-order or mod- 2052 2020 to name the object seen. These results can be construed as ulatory effects). As such, this class of model speaks directly 2053 2021 evidence of contextual specialisation for object-recognition to functional specialisation of a nonlinear and contextual 2054 2022 that depends upon modulatory afferents (possibly from tem- sort. Fig. 5 illustrates a specific example (see Dolan et al., 2055 2023 poral and parietal regions) that are implicated in naming a 1997 for details). Subjects were asked to view (degraded) 2056 2024 visually perceived object. There is no empirical evidence in faces and non-face (object) controls. The interaction between 2057 2025 these results to suggest that the temporal or parietal regions activity in the parietal region and the presence of faces was 2058 2026 are the source of this top–down influence but in the next ex- expressed most significantly in the right infero-temporal re- 2059 2027 ample the source of modulation is addressed explicitly using gion not far from the homologous left infero-temporal re- 2060 2028 psychophysiological interactions. gion implicated in the object naming experiment above. 2061 Changes in parietal activity were induced experimentally by 2062 2029 5.1.3. Psychophysiological interactions pre-exposure of the (un-degraded) stimuli before some scans 2063 2030 Psychophysiological interactions speak directly to the but not others to prime them. The data in the right panel 2064 2031 interactions between bottom–up and top–down influences, of Fig. 5 suggests that the infero-temporal region shows 2065 2032 where one is modelled as an experimental factor and the face-specific responses, relative to non-face objects, when, 2066 2033 other constitutes a measured brain response. In an analysis and only when, parietal activity is high. These results can be 2067 2034 of psychophysiological interactions one is trying to explain interpreted as a priming-dependent face-specific response, 2068 2035 a regionally specific response in terms of an interaction be- in infero-temporal regions that are mediated by interactions 2069 2036 tween the presence of a sensorimotor or cognitive process with medial parietal cortex. This is a clear example of con- 2070 2037 and activity in another part of the brain (Friston et al., 1997). textual specialisation that depends on top–down effects. 2071 2038 The supposition here is that the remote region is the source 2039 of backward modulatory afferents that confer functional 5.2. Effective connectivity 2072 2040 specificity on the target region. For example, by combining 2041 information about activity in the posterior parietal cortex, The previous examples demonstrating contextual special- 2073 2042 mediating attentional or perceptual set pertaining to a partic- isation are consistent with functional architectures implied 2074 2043 ular stimulus attribute, can we identify regions that respond by predictive coding. However, they do not provide defini- 2075

UNCORRECTED PROOF K. Friston / Progress in Neurobiology 590 (2002) 1–31 25

Fig. 5. Top: Examples of the stimuli presented to subjects. During the measurement of brain responses only degraded stimuli where shown (e.g. the right hand picture). In half the scans the subject was given the underlying cause of these stimuli, through presentation of the original picture (e.g. left) before scanning. This priming induced a profound difference in perceptual set for the primed, relative to non-primed, stimuli, Right: Activity observed in a right infero-temporal region, as a function of (mean corrected) PPC activity. This region showed the most significant interaction between the presence of faces in visually presented stimuli and activity in a reference location in the posterior medial parietal cortex (PPC). This analysis can be thoughtof as finding those areas that are subject to top–down modulation of face-specific responses by medial parietal activity. The crosses correspond to activity whilst viewing non-face stimuli and the circles to faces. The essence of this effect can be seen by noting that this region differentiates between faces and non-faces when, and only when, medial parietal activity is high. The lines correspond to the best second-order polynomial fit. These data were acquired from six subjects using PET. Left: Schematic depicting the underlying conceptual model in which driving afferents from ventral form areas (here designated as V4) excite infero-temporal (IT) responses, subject to permissive modulation by PPC projections.

2076 tive evidence for an interaction between top–down and of neuronal interactions, involving a huge number of micro- 2094 2077 bottom–up influences. In this subsection we look for direct scopic variables, it may seem an impossible task to make 2095 2078 evidence of these interactions using functional imaging. meaningful measurements of coupling among brain systems, 2096 2079 This rests upon being able to measure effective connectivity especially with measurements afforded by techniques like 2097 2080 in a way that is sensitive to interactions among inputs. This fMRI. However, the problem is not as intractable as one 2098 2081 requires a plausible model of coupling among brain regions might think. 2099 2082 that accommodates nonlinear and dynamical effects. We Suppose that the variables x represented a complete and 2100 2083 have used a model that is based on the Volterra expansion self-consistent description of the state variables of a brain 2101 2084 introduced in Section 3. Before turning to empirical evi- region. In other words, everything needed to determine the 2102 2085 dence for interactions between bottom–up and top–down evolution of that region’s state, at a particular place and 2103 2086 inputs the motivation for this particular model of effective time, was embodied in these measurements. If such a set of 2104 2087 connectivity is presented briefly. variables existed they would satisfy some immensely com- 2105 plicated nonlinear equations (cf. Eq. (1)) 2106 2088 5.2.1. Effective connectivity and Volterra kernels x˙ = f(s, u) 2089 The problem faced, when trying to measure effective con- (33) y = g(x) 2107 2090 nectivity, is that measurements of brain responses are usu- 2091 ally very limited, either in terms of their resolution (in space u represents a set of inputs conveyed by projections from 2108 2092 or time) or in terms of the neurophysiological or biophysi- other regions and x is a large vector of state variables which 2109 2093 cal variable that is measured. Given the complicated nature range from depolarisation at every point in the dendritic tree 2110

UNCORRECTED PROOF 26 K. Friston / Progress in Neurobiology 590 (2002) 1–31

2111 to the phosphorylation status of every relevant enzyme; from the details of each region’s electrochemical status. We only 2124 2112 the biochemical status of every glial cell compartment to need to know the history of its inputs, which obtain from 2125 2113 every aspect of gene expression. The vast majority of these the measurable outputs of other regions. In principle, a com- 2126 2114 variables are hidden and not measurable directly. However, plete description of regional responses could be framed in 2127 2115 there are measurements y that can be made, that, as we have terms of inputs and the Volterra kernels required to produce 2128 2116 seen in Section 3, are simply a nonlinear convolution of the the outputs. The nice thing about the kernels is that they can 2129 2117 inputs with some Volterra kernels. These measures usually be interpreted directly as effective connectivity (see Box 1). 2130 2118 reflect the activity of whole cells or populations and are mea- Because the inputs (and outputs) are measurable one can 2131 2119 sured in many ways, for example firing at the initial segment estimate the kernels empirically. The first-order kernel is 2132 2120 of an axon or local field potentials. The critical thing here is simply the change in response induced by a change in input 2133 2121 that the output is casually related to the inputs, which are the in the recent past. The second-order kernels are the change 2134 2122 outputs of other regions. This means that that we never need in the first–order effective connectivity induced by changes 2135 2123 to know the underlying and ‘hidden’ variables that describe in a second (modulatory) input and so on for higher orders.

Fig. 6. Left: Brain regions and connections comprising the model. Right: Characterisation of the effects of V2 inputs on V5 and their modulation by posterior parietal cortex (PPC). The broken lines represent estimates of V5 responses when PPC activity is zero, according to a second-order Volterra model of effective connectivity with inputs to V5 from V2, PPC and the pulvinar (PUL). The solid curves represent the same response when PPC activity is one standard deviation of its variation over conditions. It is evident that V2 has an activating effect on V5 and that PPC increases the responsiveness of V5 to these inputs. The insert shows all the voxels in V5 that evidenced a modulatory effect (P<0.05 uncorrected). These voxels were identified by thresholding a SPM (Friston et al., 1995b)oftheF statistic testing for the contribution of second-order kernels involving V2 and PPC (treating all other terms as nuisance variables). The data were obtained with fMRI under identical stimulus conditions (visual motion subtended by radially moving dots) whilst manipulating the attentional component of the task (detection of velocity changes).

UNCORRECTED PROOF K. Friston / Progress in Neurobiology 590 (2002) 1–31 27

2136 Another nice thing about the Volterra formulation is that the ture implied by predictive coding because they establish 2191 2137 response is linear in the unknowns, which can be estimated the existence of functionally expressed backward connec- 2192 2138 using standard least square procedures. In short, Volterra tions. V5 cortical responses evidence an interaction between 2193 2139 kernels are synonymous with effective connectivity because bottom–up input from early visual cortex and top–down in- 2194 2140 they characterise the measurable effect that an input has on fluences from parietal cortex. In the final section the impli- 2195 2141 its target. cations of this sort of functional integration are addressed 2196 from the point of view of the lesion-deficit model and neu- 2197 2142 5.2.2. Nonlinear coupling among brain areas ropsychology. 2198 2143 Linear models of effective connectivity assume that the 2144 multiple inputs to a brain region are linearly separable. This 2145 assumption precludes activity-dependent connections that 6. Functional integration and neuropsychology 2199 2146 are expressed in one context and not in another. The resolu- 2147 tion of this problem lies in adopting nonlinear models like If functional specialisation depends on interactions among 2200 2148 the Volterra formulation that include interactions among in- cortical areas then one might predict changes in functional 2201 2149 puts. These interactions can be construed as a context- or specificity in cortical regions that receive enabling or modu- 2202 2150 activity-dependent modulation of the influence that one re- latory afferents from a damaged area. A simple consequence 2203 2151 gion exerts over another (Büchel and Friston, 1997). In the is that aberrant responses will be elicited in regions hierar- 2204 2152 Volterra model, second-order kernels model modulatory ef- chically below the lesion if, and only if, these responses de- 2205 2153 fects. Within these models the influence of one region on pend upon inputs from the lesion site. However, there may be 2206 2154 another has two components: (i) the direct or driving in- other contexts in which the region’s responses are perfectly 2207 2155 fluence of input from the first (e.g. hierarchically lower) normal (relying on other, intact, afferents). This leads to the 2208 2156 region, irrespective of the activities elsewhere; and (ii) an notion of a context-dependent regionally-specific abnormal- 2209 2157 activity-dependent, modulatory component that represents ity, caused by, but remote from, a lesion (i.e. an abnormal 2210 2158 an interaction with inputs from the remaining (e.g. hierar- response that is elicited by some tasks but not others). We 2211 2159 chically higher) regions. These are mediated by the first and have referred to this phenomenon as ‘dynamic diaschisis’ 2212 2160 second-order kernels, respectively. The example provided in (Price et al., 2000). 2213 2161 Fig. 6 addresses the modulation of visual cortical responses 2162 by attentional mechanisms (e.g. Treue and Maunsell, 1996) 6.1. Dynamic diaschisis 2214 2163 and the mediating role of activity-dependent changes in ef- 2164 fective connectivity. Classical diaschisis, demonstrated by early anatomical 2215 2165 The right panel in Fig. 6 shows a characterisation of this studies and more recently by neuroimaging studies of rest- 2216 2166 modulatory effect in terms of the increase in V5 responses, ing brain activity, refers to regionally specific reductions in 2217 2167 to a simulated V2 input, when posterior parietal activity is metabolic activity at sites that are remote from, but con- 2218 2168 zero (broken line) and when it is high (solid lines). In this nected to, damaged regions. The clearest example is ‘crossed 2219 2169 study subjects were studied with fMRI under identical stim- cerebellar diaschisis’ (Lenzi et al., 1982) in which abnormal- 2220 2170 ulus conditions (visual motion subtended by radially moving ities of cerebellar metabolism are seen characteristically fol- 2221 2171 dots) whilst manipulating the attentional component of the lowing cerebral lesions involving the motor cortex. Dynamic 2222 2172 task (detection of velocity changes). The brain regions and diaschisis describes the context-sensitive and task-specific 2223 2173 connections comprising the model are shown in the upper effects that a lesion can have on the evoked responses of a 2224 2174 panel. The lower panel shows a characterisation of the ef- distant cortical region. The basic idea behind dynamic di- 2225 2175 fects of V2 inputs on V5 and their modulation by posterior aschisis is that an otherwise viable cortical region expresses 2226 2176 parietal cortex (PPC) using simulated inputs at different lev- aberrant neuronal responses when, and only when, those re- 2227 2177 els of PPC activity. It is evident that V2 has an activating ef- sponses depend upon interactions with a damaged region. 2228 2178 fect on V5 and that PPC increases the responsiveness of V5 This can arise because normal responses in any given region 2229 2179 to these inputs. The insert shows all the voxels in V5 that ev- depend upon inputs from, and reciprocal interactions with, 2230 2180 idenced a modulatory effect (P<0.05 uncorrected). These other regions. The regions involved will depend on the cog- 2231 2181 voxels were identified by thresholding statistical parametric nitive and sensorimotor operations engaged at any particular 2232 2182 maps of the F statistic (Friston et al., 1995b) testing for the time. If these regions include one that is damaged, then ab- 2233 2183 contribution of second-order kernels involving V2 and PPC normal responses may ensue. However, there may be situa- 2234 2184 while treating all other components as nuisance variables. tions when the same region responds normally, for instance 2235 2185 The estimation of the Volterra kernels and statistical infer- when its dynamics depend only upon integration with un- 2236 2186 ence procedure is described in Friston and Büchel (2000). damaged regions. If the region can respond normally in some 2237 2187 This sort of result suggests that backward parietal inputs situations then forward driving components must be intact. 2238 2188 may be a sufficient explanation for the attentional modu- This suggests that dynamic diaschisis will only present it- 2239 2189 lation of visually evoked extrastriate responses. More im- self when the lesion involves a hierarchically equivalent or 2240 2190 portantly, they are consistent with the functional architec- higher area. 2241

UNCORRECTED PROOF 28 K. Friston / Progress in Neurobiology 590 (2002) 1–31

Fig. 7. (a) Top: These renderings illustrate the extent of cerebral infarcts in four patients, as identified by voxel-based morphometry. Regions of reduced grey matter (relative to neurologically normal controls) are shown in white on the left hemisphere. The SPMs (Friston et al., 1995b) were thresholded at P<0.001 uncorrected. All patients had damage to Broca’s area. The first (upper left) patient’s left middle cerebral artery infarct was most extensive encompassing temporal and parietal regions as well as frontal and motor cortex. (b) Bottom: SPMs illustrating the functional imaging results with regions of significant activation shown in black on the left hemisphere. Results are shown for: (i) normal subjects reading words (left); (ii) activations common to normal subjects and patients reading words using a conjunction analysis (middle-top); (iii) areas where normal subjects activate significantly more than patients reading words, using the group times condition interaction (middle lower); and (iv) the first patient activating normally for a semantic task. Context-sensitive failures to activate are implied by the abnormal activations in the first patient, for the implicit reading task, despite a normal activation during a semantic task.

2242 6.1.1. An empirical demonstration (undamaged by the stroke) (see Fig. 4). These two regions 2263 2243 We investigated this possibility in a functional imaging are crucial for word production (Price, 1998). Examination 2264 2244 study of four aphasic patients, all with damage to the left pos- of individual responses in this area revealed that all the nor- 2265 2245 terior inferior frontal cortex, classically known as Broca’s mal subjects showed increased activity for words relative to 2266 2246 area (see Fig. 7, upper panels). These patients had speech consonant letter strings while all four patients showed the 2267 2247 output deficits but relatively preserved comprehension. Gen- reverse effect. The abnormal responses in the left posterior 2268 2248 erally functional imaging studies can only make inferences inferior temporal lobe occurred even though this undamaged 2269 2249 about abnormal neuronal responses when changes in cogni- region: (i) lies adjacent and posterior to a region of the left 2270 2250 tive strategy can be excluded. We ensured this by engaging middle temporal cortex that activated normally (see middle 2271 2251 the patients in an explicit task that they were able to perform column of Fig. 7b); and (ii) is thought to be involved in an 2272 2252 normally. This involved a keypress response when a visually earlier stage of word processing than the damaged left in- 2273 2253 presented letter string contained a letter with an ascending ferior frontal cortex (i.e. is hierarchically lower than the le- 2274 2254 visual feature (e.g.: h, k, l, or t). While the task remained sion). From these results we can conclude that, during the 2275 2255 constant, the stimuli presented were either words or conso- reading task, responses in the left basal temporal language 2276 2256 nant letter strings. Activations detected for words, relative area rely on afferent inputs from the left posterior inferior 2277 2257 to letters, were attributed to implicit word processing. Each frontal cortex. When the first patient was scanned again, 2278 2258 patient showed normal activation of the left posterior mid- during an explicit semantic task, the left posterior inferior 2279 2259 dle temporal cortex that has been associated with seman- temporal lobe responded normally. The abnormal implicit 2280 2260 tic processing (Price, 1998). However, none of the patients reading related responses were therefore task-specific. 2281 2261 activated the left posterior inferior frontal cortex (damaged These results serve to illustrate the concept of dy- 2282 2262 by the stroke), or the left posterior inferior temporal region namic diaschisis; namely the anatomically remote and 2283

UNCORRECTED PROOF K. Friston / Progress in Neurobiology 590 (2002) 1–31 29

2284 context-specific effects of focal brain lesions. Dynamic access to the context in which the inputs are generated they 2337 2285 diaschisis represents a form of functional disconnection are in a position to modulate the selectivity or specialisation 2338 2286 where regional dysfunction can be attributed to the loss of lower areas. The implications for classical models (e.g. 2339 2287 of enabling inputs from hierarchically equivalent or higher classical receptive fields in electrophysiology, classical spe- 2340 2288 brain regions. Unlike classical or anatomical disconnection cialisation in neuroimaging and connectionism in cognitive 2341 2289 syndromes its pathophysiological expression depends upon models) are severe and suggest these models may provide 2342 2290 the functional brain state at the time responses are evoked. incomplete accounts of real brain architectures. On the other 2343 2291 Dynamic diaschisis may be characteristic of many region- hand, predictive coding in the context of hierarchical gen- 2344 2292 ally specific brain insults and may have implications for erative models not only accounts for many extra-classical 2345 2293 neuropsychological inference. phenomena seen empirically but also enforces a view of 2346 the brain as an inferential machine through its empirical 2347 Bayesian motivation. 2348 2294 7. Conclusion

2295 In conclusion, the representational capacity and inherent Uncited references 2349 2296 function of any neuron, neuronal population or cortical area 2297 in the brain is dynamic and context-sensitive. Functional in- Friston (1995a), Friston (2000), Fukushima (1986), Harth 2350 2298 tegration, or interactions among brain systems, that employ et al. (1987), McIntosh and Gonzalez-Lima (1994). 2351 2299 driving (bottom up) and backward (top–down) connections, 2300 mediate this adaptive and contextual specialisation. A crit- 2301 ical consequence is that hierarchically organised neuronal Acknowledgements 2352 2302 responses, in any given cortical area, can represent different 2303 things at different times. We have seen that most models of The Wellcome Trust funded this work. I would like to 2353 2304 representational learning require prior assumptions about the thank my colleagues for help in writing this review and de- 2354 2305 distribution of causes. However, empirical Bayes suggests veloping these ideas, especially Cathy Price for the psycho- 2355 2306 that these assumptions can be relaxed and that priors can be logical components and Peter Dayan for the theoretical neu- 2356 2307 learned in a hierarchical context. We have tried to show that robiology. 2357 2308 this hierarchical prediction can be implemented in brain-like 2309 architectures and in a biologically plausible fashion. 2310 The main point made in this review is that backward con- References 2358 2311 nections, mediating internal or generative models of how 2312 sensory inputs are caused, are essential if the processes gen- Abbot, L.F., Varela, J.A., Nelson, S.B., 1997. Synaptic depression and 2359 2313 erating inputs are non-invertible. Because these generating cortical gain control. Science 275, 220–223. 2360 2314 processes are dynamical in nature, sensory input corresponds Absher, J.R., Benson, D.F., 1993. Disconnection syndromes: an overview 2361 of Geschwind’s contributions. Neurology 43, 862–867. 2362 2315 to a non-invertible nonlinear convolution of causes. This Aertsen, A., Preißl, H., 1991. Dynamics of activity and connectivity in 2363 2316 non-invertibility demands an explicit parameterisation of physiological neuronal Networks. In: Schuster, H.G. (Ed.), Nonlinear 2364 2317 generative models (backward connections) to enable approx- Dynamics and Neuronal Networks. VCH publishers, New York, NY, 2365 2318 imate recognition and suggests that feedforward architec- USA, pp. 281–302. 2366 2319 tures, are not sufficient for representational learning. More- Atick, J.J., Redlich, A.N., 1990. Towards a theory of early visual 2367 2320 over, nonlinearities in generative models, that induce depen- processing. Neural Comput. 2, 308–320. 2368 Ballard, D.H., Hinton, G.E., Sejnowski, T.J., 1983. Parallel visual 2369 2321 dence on backward connections, require these connections computation. Nature 306, 21–26. 2370 2322 to be modulatory; so that estimated causes in higher cortical Barlow, H.B., 1961. Possible principles underlying the transformation of 2371 2323 levels can interact to predict responses in lower levels. This sensory messages. In: Rosenblith, W.A. (Ed.), Sensory Communication. 2372 2324 is important in relation to asymmetries in forward and back- MIT Press, Cambridge, MA. 2373 2325 ward connections that have been characterised empirically. Bell, A.J., Sejnowski, T.J., 1995. An information maximisation approach 2374 to blind separation and blind de-convolution. Neural Comput. 7, 1129– 2375 2326 The arguments in this article were developed under pre- 1159. 2376 2327 diction models of brain function, where higher-level sys- Bendat, J.S., 1990. Nonlinear System Analysis and Identification from 2377 2328 tems provide a prediction of the inputs to lower-level re- Random Data. Wiley, New York, USA. 2378 2329 gions. Conflict between the two is resolved by changes in the Büchel, C., Friston, K.J., 1997. Modulation of connectivity in visual 2379 2330 higher-level representations, which are driven by the ensu- pathways by attention: cortical interactions evaluated with structural 2380 2331 ing error in lower regions, until the mismatch is ‘cancelled’. equation modelling and fMRI. Cereb. Cortex 7, 768–778. 2381 Common, P., 1994. Independent component analysis, a new concept? 2382 2332 From this perspective the specialisation of any region is de- Signal Processing 36, 287–314. 2383 2333 termined both by bottom–up driving inputs and by top–down Crick, F., Koch, C., 1998. Constraints on cortical and thalamic projections: 2384 2334 predictions. Specialisation is therefore not an intrinsic prop- the no-strong-loops hypothesis. Nature 391, 245–250. 2385 2335 erty of any region but depends on both forward and back- Dayan, P., Hinton, G.E., Neal, R.M., 1995. The Helmholtz machine. 2386 2336 ward connections with other areas. Because the latter have Neural Comput. 7, 889–904. 2387

UNCORRECTED PROOF 30 K. Friston / Progress in Neurobiology 590 (2002) 1–31

2388 Dempster, A.P., Laird, N.M., Rubin, 1977. Maximum likelihood from Harth, E., Unnikrishnan, K.P., Pandya, A.S., 1987. The inversion of 2455 2389 incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39, 1–38. sensory processing by feedback pathways: a model of visual cognitive 2456 2390 Devlin, J.T., Gunnerman, L.M., Andersen, E.S., Seidenberg, M.S., 1998. functions. Science 237, 184–187. 2457 2391 Category-specific semantic deficits in focal and widespread brain Hinton, G.T., Shallice, T., 1991. Lesioning an attractor network: 2458 2392 damage: a computational account. J. Cog. Neurosci. 10, 77–84. investigations of acquired dyslexia. Psychol. Rev. 98, 74–95. 2459 2393 Dolan, R.J., Fink, G.R., Rolls, E., Booth, M., Holmes, A., Frackowiak, Hinton, G.E., Dayan, P., Frey, B.J., Neal, R.M., 1995. The wake–sleep 2460 2394 R.S.J., Friston, K.J., 1997. How the brain learns to see objects and algorithm for unsupervised neural networks. Science 268, 1158–1161. 2461 2395 faces in an impoverished context. Nature 389, 596–598. Hirsch, J.A., Gilbert, C.D., 1991. Synaptic physiology of horizontal 2462 2396 Dong, D., McAvoy, T.J., 1996. Nonlinear principal component analysis— connections in the cat’s visual cortex. J. Neurosci. 11, 1800–1809. 2463 2397 based on principal curves and neural networks. Comput. Chem. Eng. Karhunen, J., Joutsensalo, J., 1994. Representation and separation of 2464 2398 20, 65–78. signals using nonlinear PCA type learning. Neural Netw. 7, 113–127. 2465 2399 Edelman, G.M., 1993. Neural Darwinism: selection and reentrant Kass, R.E., Steffey, D., 1989. Approximate Bayesian inference in 2466 2400 signalling in higher brain function. Neuron 10, 115–125. conditionally independent hierarchical models (parametric empirical 2467 2401 Efron, B., Morris, C., 1973. Stein’s estimation rule and its competitors— Bayes models). J. Am. Stat. Assoc. 407, 717–726. 2468 2402 an empirical Bayes approach. J. Am. Stat. Assoc. 68, 117–130. Kay, J., Phillips, W.A., 1996. Activation functions, computational goals 2469 2403 Farah, M., McClelland, J., 1991. A computational model of semantic and learning rules for local processors with contextual guidance. Neural 2470 2404 memory impairments: modality specificity and emergent category Comput. 9, 895–910. 2471 2405 specificity. J. Exp. Psychol. Gen. 120, 339–357. Kramer, M.A., 1991. Nonlinear principal component analysis using 2472 2406 Felleman, D.J., Van Essen, D.C., 1991. Distributed hierarchical processing auto-associative neural networks. AIChE J. 37, 233–243. 2473 2474 2407 in the primate cerebral cortex. Cereb. Cortex 1, 1–47. Lenzi, G.L., Frackowiak, R.S.J., Jones, T., 1982. Cerebral oxygen 2475 2408 Fletcher, P.C., Frith, C.D., Grasby, P.M., Shallice, T., Frackowiak, R.S.J., metabolism and blood flow in human cerebral ischaemic infarction. J. 2476 2409 Dolan, R.J., 1995. Brain systems for encoding and retrieval of Cereb. Blood Flow Metab. 2, 321–335. Linsker, R., 1988. Self-organisation in a perceptual network. Computer 2477 2410 auditory–verbal memory. Brain 118, 401–416. 105–117. 2478 2411 Fliess, M., Lamnabhi, M., Lamnabhi-Lagarrigue, F., 1983. An algebraic Lueck, C.J., Zeki, S., Friston, K.J., Deiber, M.P., Cope, N.O., Cunningham, 2479 2412 approach to nonlinear functional expansions. IEEE Trans. Circuits Syst. V.J., Lammertsma, A.A., Kennard, C., Frackowiak, R.S.J., 1989. The 2480 2413 30, 554–570. colour centre in the cerebral cortex of man. Nature 340, 386–389. 2481 2414 Foldiak, P., 1990. Forming sparse representations by local anti-Hebbian Kawato, M., Hayakawa, H., Inui, T., 1993. A forward-inverse optics model 2482 2415 learning. Biol. Cybern. 64, 165–170. of reciprocal connections between visual cortical areas. Network 4, 2483 2416 Freeman, W., Barrie, J., 1994. Chaotic oscillations and the genesis of 415–422. 2484 2417 meaning in cerebral cortex. In: Buzsaki, Llinas, R., Singer, W., Berthoz, McIntosh, A.R., 2000. Towards a network theory of cognition. Neural 2485 2418 A., Christen, T. (Eds.), Temporal Coding in the Brain. Springer, Berlin, Netw. 13, 861–870. 2486 2419 pp. 13–38. McIntosh, A.R., Gonzalez-Lima, F., 1994. Structural equation modelling 2487 2420 Friston, K.J., 2000. The labile brain. III. Transients and spatio-temporal and its application to network analysis in functional brain imaging. 2488 2421 receptive fields. Phil. Trans. R. Soc. Lond. B 355, 253–265. Hum. Brain Mapp. 2, 2–22. 2489 2422 Friston, K.J., Büchel, C., 2000. Attentional modulation of V5 in human. Mumford, D., 1992. On the computational architecture of the neocortex. 2490 2423 Proc. Natl. Acad. Sci. U.S.A. 97, 7591–7596. II. The role of cortico-cortical loops. Biol. Cybern 66, 241–251. 2491 2424 Friston, K.J., Frith, C., Passingham, R.E., Dolan, R., Liddle, P., Nebes, R., 1989. Semantic memory in Alzheimer’s disease. Psychol. Bull. 2492 2425 Frackowiak, R.S.J., 1992. Entropy and cortical activity: information 106, 377–394. 2493 2426 theory and PET findings. Cereb. Cortex 3, 259–267. Oja, E., 1989. Neural networks, principal components, and subspaces. Int. 2494 2427 Friston, K.J., Frith, C.D., Frackowiak, R.S.J., 1993. Principal component J. Neural Syst. 1, 61–68. 2495 2428 analysis learning algorithms: a neurobiological analysis. Proc. R. Soc. Olshausen, B.A., Field, D.J., 1996. Emergence of simple-cell receptive 2496 2429 B 254, 47–54. field properties by learning a sparse code for natural images. Nature 2497 2430 Friston, K.J., Tononi, G., Reeke, G.H., Sporns, O., Edelman, G.E., 1994. 381, 607–609. 2498 2431 Value-dependent selection in the brain: simulation in synthetic neural Optican, L., Richmond, B.J., 1987. Temporal encoding of two-dimensional 2499 2432 model. Neuroscience 39, 229–243. patterns by single units in primate inferior cortex. II. Information 2500 2433 Friston, K.J., 1995a. Functional and effective connectivity in neuroi- theoretic analysis. J. Neurophysiol. 57, 132–146. 2501 2434 maging: a synthesis. Hum. Brain Mapp. 2, 56–78. Pack, C.C., Born, R.T., 2001. Temporal dynamics of a neural solution 2502 2435 Friston, K.J., Holmes, A.P., Worsley, K.J., Poline, J.B., Frith, C.D., to the aperture problem in visual area MT of macaque brain. Nature 2503 2436 Frackowiak, R.S.J., 1995b. Statistical parametric maps in functional 409, 1040–1042. 2504 2437 imaging: a general linear approach. Hum. Brain Mapp. 2, 189–210. Pearl, J., 2000. Causality, Models, Reasoning and Inference. Cambridge 2505 2438 Friston, K.J., Price, C.J., Fletcher, P., Moore, C., Frackowiak, R.S.J., University Press, Cambridge, UK. 2506 2439 Dolan, R.J., 1996. The trouble with cognitive subtraction. NeuroImage Petersen, S.E., Fox, P.T., Posner, M.I., Mintun, M., Raichle, M.E., 1989. 2507 2440 4, 97–104. Positron emission tomographic studies of the processing of single 2508 2441 Friston, K.J., Büchel, C., Fink, G.R., Morris, J., Rolls, E., Dolan, words. J. Cog. Neurosci. 1, 153–170. 2509 2442 R.J., 1997. Psychophysiological and modulatory interactions in Phillips, W.A., Singer, W., 1997. In search of common foundations for 2510 2443 neuroimaging. NeuroImage 6, 218–229. cortical computation. Behav. Brain Sci. 20, 57–83. 2511 2444 Gawne, T.J., Richmond, B.J., 1993. How independent are the messages Phillips, C.G., Zeki, S., Barlow, H.B., 1984. Localisation of function in 2512 2445 carried by adjacent inferior temporal cortical neurons? J. Neurosci. 13, the cerebral cortex: past present and future. Brain 107, 327–361. 2513 2446 2758–2771. Plaut, D., Shallice, T., 1993. Deep dyslexia—a case study of connectionist 2514 2447 Fukushima, K., 1986. A neural network model for selective attention in neuropsychology. Cog. Neuropsychol. 10, 377–500. 2515 2448 visual pattern recognition. Biol. Cybern. 55, 5–15. Poggio, T., Torre, V., Koch, C., 1985. Computational vision and 2516 2449 Gerstein, G.L., Perkel, D.H., 1969. Simultaneously recorded trains of regularisation theory. Nature 317, 314–319. 2517 2450 potentials: analysis and functional interpretation. Science 164, Price, C.J., 1998. The functional anatomy of word comprehension and 2518 2451 828–830. production. Trends Cog. Sci. 2, 281–288. 2519 2452 Girard, P., Bullier, J., 1989. Visual activity in area V2 during reversible Price, C.J., Warburton, E.A., Moore, C.J., Frackowiak, R.S.J., Friston, K.J., 2520 2453 inactivation of area 17 in the macaque monkey. J. Neurophysiol. 62, 2000. Dynamic diaschisis: anatomically remote and context-specific 2521 2454 1287–1301. human brain lesions. J. Cog. Neurosci. 00, 00–00. 2522

UNCORRECTED PROOF K. Friston / Progress in Neurobiology 590 (2002) 1–31 31

2523 Rao, R.P., 1999. An optimal estimation approach to and Tononi, G., Sporns, O., Edelman, G.M., 1994. A measure for 2548 2524 learning. Vision Res. 39, 1963–1989. brain complexity: relating functional segregation and integration 2549 2525 Rao, R.P., Ballard, D.H., 1998. Predictive coding in the visual cortex: a in the nervous system. Proc. Natl. Acad. Sci. U.S.A. 91, 5033– 2550 2526 functional interpretation of some extra-classical receptive field effects. 5037. 2551 2527 Nat. Neurosci. 2, 79–87. Tovee, M.J., Rolls, E.T., Treves, A., Bellis, R.P., 1993. Information 2552 2528 Rockland, K.S., Pandya, D.N., 1979. Laminar origins and terminations of encoding and the response of single neurons in the primate temporal 2553 2529 cortical connections of the occipital lobe in the rhesus monkey. Brain visual cortex. J. Neurophysiol. 70, 640–654. 2554 2530 Res. 179, 3–20. Treue, S., Maunsell, H.R., 1996. Attentional modulation of visual motion 2555 2531 Rumelhart, D., McClelland, J., 1986. Parallel Distributed Processing: processing in cortical areas MT and MST. Nature 382, 539–541. 2556 2532 Explorations in the Microstructure of Cognition. MIT Press, Warrington, E.K., McCarthy, R., 1983. Category specific access dysphasia. 2557 2533 Cambridge, MA. Brain 106, 859–878. 2558 2534 Salin, P.-A., Bullier, J., 1995. Corticocortical connections in the visual Warrington, E.K., McCarthy, R., 1987. Categories of knowledge: further 2559 2535 system: structure and function. Psychol. Bull. 75, 107–154. fractionations and an attempted integration. Brain 110, 1273–1296. 2560 2536 Sandell, J.H., Schiller, P.H., 1982. Effect of cooling area 18 on striate Warrington, E.K., Shallice, T., 1984. Category specific semantic impair- 2561 2537 cortex cells in the squirrel monkey. J. Neurophysiol. 48, 38–48. ments. Brain 107, 829–853. 2562 2538 Sugase, Y., Yamane, S., Ueno, S., Kawano, K., 1999. Global and fine Wray, J., Green, G.G.R., 1994. Calculation of the Volterra kernels of 2563 2539 information coded by single neurons in the temporal visual cortex. nonlinear dynamic systems using an artificial neuronal network. Biol. 2564 2540 Nature 400, 869–873. Cybern. 71, 187–195. 2565 2541 Sutton, R.S., Barto, A.G., 1990. Time derivative models of Pavlovian Zeki, S., 1990. The motion pathways of the visual cortex. In: Blakemore, 2566 2542 reinforcement. In: Gabriel, M., Moore, J. (Eds.), Learning and C. (Ed.), Vision: Coding and Efficiency. Cambridge University Press, 2567 2543 Computational Neuroscience: Foundations of Adaptive Networks. MIT Cambridge, UK, pp. 321–345. 2568 2544 Press, Cambridge, MA, pp. 497–538. Zeki, S., Shipp, S., 1988. The functional logic of cortical connections. 2569 2545 Taleb, A., Jutten, C., 1997. Nonlinear source separation: the post-nonlinear Nature 335, 311–317. 2570 2546 mixtures. In: Proceedings of the ESANN’97, Bruges, April 1997, 2547 pp. 279–284.

UNCORRECTED PROOF