Bayesian Multisensory Integration and Cross-Modal Spatial Links Sophie Deneve A,*, Alexandre Pouget B
Total Page:16
File Type:pdf, Size:1020Kb
Journal of Physiology - Paris 98 (2004) 249–258 www.elsevier.com/locate/jphysparis Bayesian multisensory integration and cross-modal spatial links Sophie Deneve a,*, Alexandre Pouget b a Gatsby Computational Neuroscience Unit, Alexandra House, 17 Queen Square, London WC1N 3AR, UK Available onlineb Department of Brain and Cognitive Sciences, Meliora Hall, University of Rochester, Rochester, NY 14627, USA Abstract Our perception of the word is the result of combining information between several senses, such as vision, audition and pro- prioception. These sensory modalities use widely different frames of reference to represent the properties and locations of object. Moreover, multisensory cues come with different degrees of reliability, and the reliability of a given cue can change in different contexts. The Bayesian framework––which we describe in this review––provides an optimal solution to deal with this issue of combining cues that are not equally reliable. However, this approach does not address the issue of frames of references. We show that this problem can be solved by creating cross-modal spatial links in basis function networks. Finally, we show how the basis function approach can be combined with the Bayesian framework to yield networks that can perform optimal multisensory combination. On the basis of this theory, we argue that multisensory integration is a dialogue between sensory modalities rather that the convergence of all sensory information onto a supra-modal area. Ó 2004 Published by Elsevier Ltd. Keywords: Multisensory integration; Basis functions; Recurrent networks; Bayesian; Frames of reference; Gain modulation; Partially shifting receptive fields 1. Introduction environment or body. Thus multisensory integration cannot be a simple averaging between converging sen- Multisensory integration refers to the capacity of sory inputs. More elaborate computations are required combining information coming from different sensory to interpret neural responses corresponding to the same modalities to get a more accurate representation of the object in different sensory areas. To use an analogy in environment and body. For example, vision and touch the linguistic domain, each sensory modality uses its can be combined to estimate the shape of objects, and own language, and information cannot be shared be- viewing somebody’s lips moving can improve speech tween modalities without translation mechanisms for comprehension. This integration process is difficult for the different languages. For example, sensory modality two main reasons. First, the reliability of sensory encodes the position of objects in different frames of modalities varies widely according to the context. For reference. Visual stimuli are represented by neurons with example, in daylight, visual cues are more reliable than receptive fields on the retina, auditory stimuli by neu- auditory cues to localize objects, while the contrary is rons with receptive fields around the head, and tactile true at night. Thus, the brain should rely more on stimuli by neurons with receptive fields anchored on the auditory cues at night and more on visual cues during skin. Thus, a change in eye position or body posture will the day to estimate object positions. result in a change in the correspondence between visual, Another reason why multisensory integration is a auditory and tactile neural responses encoding the same complex issue is that each sensory modality uses a dif- object. To combine these different sensory responses, the ferent format to encode the same properties of the brain must take into account the posture and the movements of the body in space. We first review the Bayesian framework for multi- * Corresponding author. sensory integration, which provides a set of rule to E-mail addresses: [email protected] (S. Deneve), optimally combine sensory inputs with varying reliabil- [email protected] (A. Pouget). ities. We then describe several psychophysical studies 0928-4257/$ - see front matter Ó 2004 Published by Elsevier Ltd. doi:10.1016/j.jphysparis.2004.03.011 250 S. Deneve, A. Pouget / Journal of Physiology - Paris 98 (2004) 249–258 supporting the notion that multisensory integration in is, PðxÞ¼c, where c is a constant. This implies that PðxÞ the nervous system is indeed akin to a Bayesian infer- does not depend on x, in which case we can also ignore ence process. We then review evidence from psycho- it. Therefore, Eq. (1) reduces to: physics and neuropsychology that sensory inputs from PxðÞ/jr PðÞr jx ð2Þ different modalities, but originating at the same location vis vis in space, can influence one another regardless of body Once the posterior distribution is computed, an estimate posture, suggesting that there is a link, or translation of the position of the object can be obtained by recov- mechanism, between the spatial representations of dif- ering the value of x that maximizes that distribution: ferent sensory systems. Finally, we turn to neurophysi- ^xvis ¼ arg maxPxðÞjrvis ological and modeling data regarding the neural x mechanisms of spatial transformations and Bayesian This is known as the maximum a posteriori estimate, or inferences. MAP estimate for short. When we hear the object, a similar posterior distri- bution, Pðxjr Þ, and its corresponding estimate, ^x , 2. Bayesian framework for multisensory integration aud aud can be computed based on the noisy responses of auditory neuron, r . The Bayesian framework allows the optimal combi- aud What should we do when the object is heard and seen nations of multiple sources of information about a at the same time? Using the same approach, we need to quantity x [22]. We consider a specific example in which compute the estimate, ^x (‘‘bim’’ stands for bimodal), x refers to the position of an object which can be seen bim maximizing the posterior distribution, Pðxjr ; r Þ: and heard at the same time. Given noisy neural re- vis aud sponses in the visual cortex, rvis, the position of the ^xbim ¼ arg maxPxðÞjrvis; raud object is most probably near the receptive fields of the x most active cells, but this position cannot be determined To compute the posterior distribution, we use Bayes with infinite precision due to the presence of neural noise law, which under the assumption of a flat prior distri- (we use bold letter to refer to vector; thus rvis is meant to bution, reduces to: be a vector corresponding to the firing rate of a large PxðÞ/jrvis; raud PðÞrvis; raudjx ð3Þ population of visual neurons). Given the uncertainty associated with x, a good strategy is to compute the PxðÞ/jrvis; raud PðÞrvisjx PðÞraudjx ð4Þ posterior probability that the object is at position x gi- PxðÞ/jrvis; raud PxðÞjrvis PxðÞjraud ð5Þ ven the visual neural responses, PðxjrvisÞ. Using Baye’s rule, PðxjrvisÞ can be obtained by combining the distri- To go from Eqs. (3) and (4), we assumed that the noise bution of neural noise PðxjrvisÞ with prior knowledge on corrupting the visual neurons is independent from the the distribution of object position PðxÞ and the prior one corrupting the auditory neurons (which seems rea- probability of neural responses PðrvisÞ: sonable given how far apart those neurons are in the cortex). The step from Eqs. (4) and (5) is a consequence PðÞrvisjx PxðÞ PxðÞ¼jrvis ð1Þ of Eq. (2). From Eq. (5), we see that the bimodal pos- PðÞrvis terior distribution can be obtained by simply taking the This distribution is called a posterior probability be- product of the unimodal distributions. An example of cause it refers to the probability of object position after this operation is illustrated in Fig. 1. taking into account the sensory input, r (as opposed to When PðxjrvisÞ and PðxjraudÞ are Gaussian probability the prior PðxÞ which is independent of r). PðxjrvisÞ is distributions, as is the case in Fig. 1, the bimodal esti- called the noise distribution because it corresponds to mate ^xbim can be obtained by taking a linear combina- variability in neural responses for a fixed stimulus, i.e., tion of the unimodal estimates, ^xvis and ^xaud, weighted by to variability which is not accounted by the stimulus. their respective reliabilities [6,21,41]: Note several important points about Eq. (1). First, we 2 2 1=rvis 1=raud can ignore the denominator, Pðr Þ, because it is inde- ^xbim ¼ ^xvis þ ^xaud; ð6Þ vis 1=r2 þ 1=r2 1=r2 þ 1=r2 pendent of x, and x is the only variable we care about. vis aud vis aud 2 2 Second, PðrvisjxÞ can be measured experimentally by where 1=rvis and 1=raud, the reliability of the visual and repetitively presenting an object at the same position x auditory estimates respectively, are the inverse of the and measuring the variability in rvis (which is why we call variances of the visual and auditory posterior proba- this term the ‘‘noise’’ distribution). Finally, if we happen bilities. In particular, if the visual input is more reliable 2 2 to know that the object is more likely to appear in some than the auditory input (rvis is smaller than raud) then visual locations than others, we can represent this the bimodal estimate of position should be closer to the knowledge in the prior distribution PðxÞ. In this section, visual estimate and vice versa if audition is more reliable we will assume that all positions are equally likely, that than vision. S. Deneve, A. Pouget / Journal of Physiology - Paris 98 (2004) 249–258 251 P( x | r vis, r aud) The Bayesian hypothesis predicts that the distribution of bimodal estimates should be approximately a product between the unimodal estimate distributions (Eq. (5)). This approach has been applied successfully to the P( x | r vis) P( x | r ) aud estimated position of the hand from visual and propri- Probability oceptive inputs [33–35]. In these experiments, subjects were required to localize a proprioceptive target (their middle finger of their right hand without visual feed- back), a visual target, or a visual and proprioceptive ^xvivis ^xbim ^xaud Position of the object, x target (their middle finger of their right hand with visual feedback).