SPACING OUT:

DISTAL ATTRIBUTION IN

______

A Thesis

Presented to

The Honors Tutorial College

Ohio University

______

In Partial Fulfillment

of the Requirements for Graduation

from the Honors Tutorial College

with the degree of

Bachelor of Arts in Philosophy

______

by

David E. Pence

May 2013

Contents

1. Introduction: The ABCs of SSDs 1

1.1 More Background: Sensory Substitution and Enactivism 4

2. Why Action May not be Necessary 10

2.1 Seeing Without Structure 13

2.2 Seeing Without Corroboration? 18

2.3 A Thought Experiment 22

3. Alternative Frameworks for Sensory Substitution 25

3.1 The Prospects Considered 32

3.2. How it’s Done 32

3.3 Where it Happens 37

4. Plasticity 40

4.1 The Mechanisms of Plasticity 41

4.2 Where the Senses Trade 46

4.3 Acquired Synaesthesia? 48

5. The Multimodal Mind 52

5.1 Where it All Comes Together 53

5.2 Integration at Work 57

5.3 A Second Multimodal Proposal 67

6. Mental Imagery 71

6.1 The Step-by-Step 72

6.2 The Land of Imagination 77

6.3 Speaking For and Against 79 6.4 Does Reading Make a Better Metaphor? 87

7. How They All Fit Together 93

8. Conclusions: Lessons and Future Possibilities 97

9. References 100

List of Figures

Figure 1: Retinal Disparity 9

Figure 2: Gibsonian Ecological 9

Figure 3: High Contrast Image 14

Figure 4: Resolution of Early TVSS 16

Figure 5: Early TVSS Device 19

Figure 6: Working Memory Map 34

Figure 7: Map of Brain Regions Involved in Sensory Substitution 39

Figure 8: Multisensory Reverse Hierarchy Theory 69

Figure 9: Resolution of The vOICe 89

1

1. Introduction: The ABCs of SSDs

One would be hard pressed to find a psychological result as surprising as sensory substitution. The basic idea is that, by translating sensory stimuli from one modality to another, one can regain lost or damaged perceptual capacities. Sensory substitution devices (SSDs) will turn images into vibrations on a tactile grid or variations in an auditory “soundscape,” and when suitably processed, these inputs are used for adaptive world-directed action, spatial navigation, and even object recognition. The blind, so many claim, can learn to see.1 Most significant among the gains shown by SSD users is the ability to perceive objects beyond their bodies, things

“out there.” Sighted persons often take for granted their ability to sense and interact with the world of extrapersonal space, but for early and congenitally blind (i.e., blind from birth) users, the gain is massive. In the lab, users are often surprised to encounter depth and optical illusions (Bach-y-Rita, 1972); in everyday life, they might be surprised to “see” household objects long forgotten or distant trees on a walk outside.2

These perceptual capacities all fall under what psychologists have called “distal attribution” (Epstein et al., 1986; Auvray et al., 2003). Various definitions have been proposed, but it can be broadly understood as the ability to represent objects as occupying spatial locations beyond one’s own body.

As surprising as the result itself, however, is the fact that despite over four decades of research, sensory substitution remains so mysterious. There is no orthodox explanation of what, exactly, is going on when subjects learn to use an SSD. The most

1 Sensory substitution is not limited to vision, but to date, the majority of research has centered on it. 2See (“what blind users say,” “www.seeingwithsound.com”). 2

influential account to date, and the one we seek to draw the sharpest contrast with, is that of enactivism (also known as the sensorimotor or sensorimotor contingency theory). For this radical way of modeling perception, our awareness of the world beyond is inextricably linked to our familiarity with movement and action. In particular, enactivism holds that perception is the recognition of sensorimotor contingencies, the law-like connections between our actions and resultant sensory input. To see a cup of tea, for example, it is necessary to understand how visual sensations caused by the object would change as a result of environmental exploration and movement. In sensory substitution, enactivists have found perhaps their most intuitively powerful source of evidence. First off, they can explain how one modality sitting in for another is even possible. Immediate sensation is unimportant so long as the sensorimotor contingencies that govern use of the device, that is, how movements affect video input, parallel those of vision. Second and still more impressively, enactivism seems to predict the reliance of distal attribution upon control over visual input from the camera. Passively trained subjects were able to pick out the orientation of lines and even identify 2D shapes (granted high error rates; see White, 1970), but they never reached the astounding capacities of their actively trained counterparts.

While active subjects moved freely into perception of external objects, passive subjects never described their experiences in anything but proximal, tactile terms— exactly what one would expect were understanding action a key component of perception. We should likewise note that this has been one of the most longstanding and popular morals drawn from experiments with SSDs. The enactivists are in good 3

company. Bach-y-Rita, for example, thought it a “plausible hypothesis” that self- guidance of input constitutes “the necessary and sufficient condition for the experienced phenomena to be attributed to a stable outside world” (1972, p. 99). The position seems to have been shared by his colleagues, as well, since nearly identical language is used in White et al. (1970). The hypothesis was discussed rather positively in Epstein et al.’s (1986) seminal paper on distal attribution, and as recently as ten years ago reviewers (Lenay et al., 2003) went so far as to claim “empirical proof” that

“there is no perception without action” (p. 282).

Nevertheless, we suspect such judgments are hasty. In the following, we shall make our case. First, we contend that the purported necessity of action is overstated and that the reports upon which enactivists hinge their account have not been adequately scrutinized. In particular, the quality of visual input available to subjects in the early experiments they cite was so low that action could plausibly be transformed from a mere aid to an essential means of disambiguation. Second, there are at least three alternative mechanisms that could explain the substitution process and the empirical observations associated with it: blindness induced neural plasticity, multimodal3 learning, and mental imagery. These three accounts, we shall argue, fit together in a clear and mutually beneficial way, making for a formidable, albeit speculative, answer to the problem of sensory substitution.

3 Throughout the paper, we shall use the terms multimodal and multisensory to mean more or less the same thing. 4

1.1 More Background: Sensory Substitution and Enactivism

Although the idea of sensory substitution is an odd one, it is not entirely new.

The basic idea has been around at least since Descartes, who in his Dioptrics hypothesized that the blind gain a kind of sight through use of a cane:

No doubt you have had the experience of walking over rough ground without a light, and finding it necessary to use a stick in order to guide yourself. You may then have been able to notice that by means of this stick, you could feel the various objects situated around you, and that you could even tell whether there were trees or rocks, or sand, or water, or grass, or mud, or any other such thing. It is true that this kind of sensation is somewhat confused and obscure in those who do not have long practice with it. But consider it in those born blind, who have made use of it all their lives: with them, you will find it is so perfect and so exact, that one might almost say that they see with their hands, or that their stick is the organ of some sixth sense given to them in place of sight. (Descartes, 1637/1985, p. 153)

He went on to add that, given two such rods, the blind may even be able to triangulate depth (for a recent take, see Cabe, Wright, & Wright, 2003). Another philosopher to consider the possibility was Rousseau. In Emile, he noted the possibility of auditory sensory substitution:

As our sense of feeling, when properly exercised, becomes a supplement to sight, why may it not also substitute for hearing to a certain degree, since sounds excite in resonant bodies vibrations sensible to touch? Lay a hand on the body of the cello, and you will be able, without the assistance of either eyes or ears, to distinguish, merely by the way in which the wood vibrates and trembles, whether the sound it gives is deep or shrill, whether it comes from the treble or the bass. If one were to train the senses to these differences, I do not doubt that, in time, one could become so sensitive as to be able to distinguish a whole air by means of the fingers. Now, if we concede this, it is clear that we might easily talk to deaf people by means of music; for tone and measures are no less susceptible of regular combination than voice and articulation, so they may be made use of in the same way as the elements of speech. (Rousseau, 1762, 237-238, quoted in Geldard, 1966)4

4 Interestingly, both philosophers drew this conclusion from reflection on haptic touch. Some studies we shall discuss later might help explain this. 5

In order to reach something close to the contemporary SSDs, however, one must go to the 20th century. Near the middle of the century, studies started dealing with the skin as an information channel. Geldard and his colleagues (1957; 1961; 1966) constructed a vibrotactile device used for reading block text. These early studies had success, but whether they are visual is debatable. There is little distinctively visual or distal about the spatial extension of a letter. A capital “F,” for example, could be pressed against the skin and recognized by the subject without anything like distal attribution taking place. The first SSD study of interest to us, then, is Bach-y-Rita (1968). The device in these studies consisted in a freestanding television camera and a dental chair with a vibratory pin matrix mounted on the backrest. Each pin consisted of a solenoid vibrator and was placed 12mm from any neighboring pins. Together they totaled 400

(20X20). Camera input was routed through a digital switching matrix matching photocathode surface points through a video amplifier and signal conditioner, which in turn made its way to the vibratory pins (Collins, 1967). Subjects received vibratory tactile input corresponding to a handful of objects or patterns placed in front of the camera. These would often be simpler geometric shapes, though they could be as complex as a stuffed animal (see fig. 4). Later devices often followed this tactile visual model, becoming more and more efficient and portable as time went on. By 1973, subjects could make use of a glasses-mounted camera and a portable, albeit lower resolution tactile pad. Other advances included electrotactile devices that use small shocks rather than vibrations, as well as tongue-routed SSDs (Bach-y-Rita, 2003).

Non-tactile routes have also been explored. In the early 90s, Meijer (1992) developed 6

The vOICe, an audition-based visual SSD that has since become one of the most popular devices available. At present, one can even download a simple phone-based vOICe program online. Finally, some devices have extended beyond vision substitution altogether. The variety of SSDs includes sound for touch (Kim & Zatorre,

2011), vestibular sensory substitution, and extended touch for people with spinal cord injuries—one such device even purporting to treat sexual dysfunction (Bach-y-Rita,

2003).

Paralleling the proliferation of devices has been a steady growth in SSD effectiveness. On first experience, all devices are roughly the same: users seem to experience little more than “noise.” With a tactile device, users might feel a tickle or itch. With an auditory one, it will be a swoosh sound. After proper training, however, subjects have shown increasingly impressive abilities. Early on, subjects could identify shapes and faces. More advanced subjects might jerk back from an unexpected zoom or perform complex tasks like batting a ball across a table (Bach-y-

Rita, 1972). In one later case, a subject even assembled electronics as part of his job

(Bach-y-Rita, 1995). If these abilities are impressive, though, what users are capable of today is absolutely staggering. Color, a perceptual element unknown to the early devices has been the subject of increasing amounts of research as of late. Bologna,

Deville, and Pun (2009) developed a prototype, dubbed See CoLOr, which uses orchestral sounds (recall Rousseau’s example). The device reportedly allows subjects to perform tasks like pairing colored socks and walking a winding red path. More surprising still are the recent results regarding SSD acuity. Using an auditory-visual 7

SSD, Striem-Amit, Guendelman, and Amedi (2012) showed that cortically blind subjects could exceed the threshold set by the world health organization for blindness.

That is to say, the discriminatory capacities of today’s users can be greater than at least some sighted persons. Given these kinds of results, the difference these devices can make in the lives of blind persons is difficult to understate.

Enactivism has a history of its own, with relevant precursors dating back to the sense-data theorists in the late 17th and early 18th centuries (Locke, 1690: Berkeley,

1709). The key contention that these theorists pass on to the enactivist is that visual input is, by itself, insufficient for our perception of the outside world. For the enactivists and many of their intellectual ancestors, the world of vision is flat. The standard example is that of a coin (Russell, 1912). Looking at a quarter on the table in front of us, we are liable to see a figure that is not perfectly round but slightly elliptical instead. As one moves away from the coin or stoops more toward its level of elevation, the coin will appear still more elliptical. Conversely, if one moves closer to it or elevates oneself to a height higher than the current perspective, the coin will appear more circular. What one has access to at a given instant, the two dimensional snapshot available the enactivists call “p-properties” (see Noë, 2004). Like the elliptical coin, these appear the way a scene might look if outlined on a pane of glass. What makes our experience three dimensional, what gives it depth, is experience with active movement. We need to know what will happen if we move thus and so: the object in front of us, as Mill famously put it, is no more than the “permanent possibilities for sensation.” 8

The second principle component of enactivism is its anti-representationalism.

As a matter of course, most contemporary employs representations.

Like earlier theorists, the principle challenge facing researchers in this tradition is to understand how we arrive at a rich visual experience from input limited to two- dimensional retinal images. This task is thought to be accomplished by a number of assumptions and inferences on the part of the . The visual system operates on assumptions that two converging inputs from either eye describe a single object at the point of optic convergence, rather than two distant objects lining up with the eyes individually is one example (Marr, 1982; see fig. 1). Beneath these general kinds of descriptions is a network of algorithms and representations. The former operate on visual input, selecting and transforming the sea of available environmental information into a steady and manageable stream. The latter act as a stand-in for the world, a symbol (or more recently, a map) mediating between action and perception.

Needless to say, the enactivists find this account bulky and outmoded. Inspired by

James Gibson’s (1966) early critique of the representational program, they argue that perceivers are dynamically engaged with and inseparable from their environment, rendering mediators and heuristics unnecessary. Looking at the same problem addressed by Marr, Gibson found a solution in movement. If a subject moves from one point to another, the retinal image projected by each object will move at a different rate and take on different shapes depending on its position relative to the observer

(fig.2). Picking up on these changes renders the single-object assumption is superfluous on his account.5

5 One of the key differences between Gibson and the enactivists, however, is the centrality of action. 9

Figure 1. Retinal Desparity (Julesz, 1971). Given the retinal inputs, the four central points A, B, C, and D are indestinguishable from the eight points comprised by A sub R, A sub L, etc.

Figure 2. Ecological Optics (Gibson, 1979). Visual input will change as the subject’s position changes, giving him or her sense of how close they are. Putting these historical lines together, we get a better sense of just what sensorimotor contingencies mean to the enactivists. We have no representations to help us and no reliable sense of the outside world independent of action. Every would- be perceiver is thrown into a world of raw, unstructured sensation. Each organism, then, must find its way around this world, exploring and learning how the sensations it

For Gibson, movement simply makes more environmental information available: the contribution is instrumental. For the enactivists, by contrast, knowledge of sensorimotor regularities is partially constitutive of sight. Active movement not only makes more inputs available, it can also change the nature of the inputs already there. One cannot truly perceive without it. This, we suspect, is a legacy of the sense data theorists. 10

receives change as a result of exploration. By the end of the process, each and every perceiving organism will have mastered the sensorimotor contingencies specific to its body. O’Regan and Noë (2001) have provided numerous illustrations of the end result in humans:

Suppose you are looking at an apple. In central vision you have stimulation from your corresponding to redness, and above it in peripheral vision you have stimulation of lesser resolution corresponding to the green leaves. You know this is an apple because you know that if you move your eyes up to the green bits, the change in stimulation will be typical of the change that occurs when green things move from periphery into central vision. You perceive this as an apple because you additionally know that if you turn it in a certain way, you can make the green leaves disappear and re-appear, and because you know that, being round, the apple’s profile will not change very much when it rotates about itself. (cit., p. 83)

As we shift the source of visual input, the stimulation we receive changes accordingly.

In defending this proposal, the enactivists are carrying forward the principles first enunciated by sense data theorists (recall Russell’s coin example) in a way that draws from dynamic anti-representational psychology. In doing so, they have given new life to a very traditional position.6

2. Why Action May not be Necessary

Returning to the central issue, one is hardly surprised to learn that enactivism sees in sensory substitution a powerful source of evidence. Critically, withholding self-guided action from SSD training prevents the development of perceptual capacities. The reason for this, some argue, is that subjects are kept from learning

6Interestingly, the basic enactive take on distal attribution, and the formulation that enactivists have (knowingly or not) adopted was that of 19th century physiologist Hermann von Helmholtz: When we perceive before us the objects distributed in space, this perception is the acknowledgement of a lawlike connection between our movements and the therewith occurring sensations. (1878/1977, p. 138-9) Helmholtz is, we should note, the patron saint of inverse optics. 11

sensorimotor contingencies. Practically speaking, learning sensorimotor contingencies is impossible unless one (a) controls movement and (b) has access to the resultant inputs. Passive training cuts off the first of these, making perception impossible on the enactivist story (see Hurley, 1998; O’Regan and Noë, 2001; Noë, 2004).

There is, however, serious reason to doubt the enactivist interpretation of these results. To our knowledge, each of the reports used to motivate the necessity of action

(Bach-y-Rita et al., 1969; Bach-y-Rita, 1972, 1984, 1995; White, 1970; White et al.,

1970) can be traced back to a handful of studies conducted with early vibrotactile apparatuses.7 Any support enactivism might draw from TVSS is ultimately predicated on the adequacy of these devices for testing passive learning in sensory substitution contexts. If there exist confounds, things that could be added or substituted in these experiments to enable learning, the enactivist line would be seriously undermined. We think there are at least two. The first rests with the fact that the total amount of visual information routed through the device was quite low. Sending visual information through these early TVSS devices was, in many relevant respects, the equivalent of translating Caravaggio into stick figures. The second stems from the fact that, to our knowledge, there were no significant sources of concurrent disambiguating information 8 available from non-visual modalities. There is abundant empirical evidence that integration of inputs from distinct modalities plays a significant role in

7Some limitations are granted but often when introducing the device or suggesting future improvements: they feature alongside the weight of the machine and the placement of pins. On occasions when informational shortcomings are considered as a criticism, they serve only as support for some standing objection, such as whether TVSS provides genuinely visual phenomenology (see, for example, O’Regan & Noë, 2001).

12

perceptual learning. Since SSD training is a variety of perceptual learning, that the presence of concurrent, non-visual sensory inputs could facilitate the development of prosthetic vision is at the very least plausible. Ultimately, both objections have to do with the amount and kinds of information reaching subjects, and both are troubling because lack of information means scenes that would otherwise be clear require disambiguation. Successful sensory substitution would depend on movement, not because it enables the learning of sensorimotor contingencies but because the input subjects received was impossible to reconstruct without kinesthetic information about spatial layout.

Perception relies upon sources of natural, that is, non-intentional information about our environments in the proximal physical stimuli affecting our sensory organs.9

Under normal ecological conditions, vision makes use of numerous and independently variable sources of optical information: shading carries information about shape, whereas linear perspective (the distant convergence of parallel lines) and motion parallax (differential object/background rate of displacement) carry information about depth. Other sources of information come from other, non-visual sensory modalities, as well. Sounds, for example, will arrive at each ear at a different time depending on the position of the object. Noises coming from behind us will also be slightly muffled by the external structure of the ear. Both help to provide distinctively auditory space perception. Vestibular information (balance), proprioception (our sense of the arrangement of our body), and efference copy (a log of outgoing motor commands)

9 Information in this sense is said to be carried whenever some state A “is a regular, or nomological, or counterfactually supported consequence of” another state B (Burge, 2010, p. 316). 13

also play an important, if less introspectively salient, role in disambiguating and integrating sources of perceptual information. Taken together, the sources within (and between) modalities provide an abundance of information, so much so that cues are often redundant in naturalistic settings. Shading and texture gradient, for example, both convey information about depth and shape. When numerous sources are eliminated, however, what remains is absolutely essential. One can cut one spoke in a wheel with little consequence, but if a great many spokes go missing, the wheel will cease to function. This seems to have been the case in early TVSS experiments.

2.1 Seeing Without Structure?

Over the years, researchers have put together a long list of visual space cues including, among many others, stereopsis; motion parallax (rate of movement compared to background); shading; shape from movement; color/surface reflections; foreshortening; occlusion; defocus (fore or background objects “blurring”); and texture (closer textures appearing more detailed). From these, the visual system is normally able to produce a reliable 3D interpretation of the input it receives.

Unfortunately, when transmitted through the early TVSS devices, many of these cues were either eliminated or minimized (especially for passive subjects). Provided with visual input comparable to what these subjects received, perception would be a struggle even for sighted subjects.

Although more sophisticated devices exist today (researchers are even working on color; see Bologna, 2009), those used in the experiments purportedly establishing the necessity of camera-control were almost universally limited to high-contrast, 14

“black and white,” images. Just how difficult this makes perception is not immediately clear, but considering how sighted persons interpret two-tone, or “concealed,” images help clarify this issue (fig.3). When first shown a picture like the one below, many will find it utterly incomprehensible. Cues like shading, occlusion, and defocus are all but eliminated—replaced by binary black and white. Even those cues that make it through the change (e.g. foreshortening) are hard to discern in their new context. The difference between a high-contrast and a typical image is not sensorimotor ignorance, as even those with a lifetime of visual experience have difficulty parsing the image, nor is the scene is too complex, since the same result holds for very simple objects

(Moore & Cavanaugh, 1998). Rather, we must admit that the difference in interpretational difficulty between a two-tone image and its grayscale counterpart stems from the amount of information each reflects to the eye.

Figure 3. High Contrast Image. Those who are unfamiliar with R. C. James’ famous image often fail to notice the Dalmatian. One interesting fact about the image above is that it is of considerably higher resolution than what would be available to one of Bach-y-Rita’s subjects (650 x 700 pixels as opposed to 20 x 20). 15

Compounding the problem is the fact that TVSS resolution was low.

Experiments were conducted with an apparatus using no more than 400 pins (like a

20x20 pixel image), meaning that scenes could not be captured with the kind of detail common in naturalistic settings. The difficulty of recognizing low resolution images should be obvious, but it is especially pressing given the damage already done by high-contrast rendering. Consider the bottom right image of figure 4 below. Were one not already aware that the image presents two adjacent objects, one might well mistake them for a single complex figure; even after we have been primed to think of them three dimensionally, the objects seem flat. This is because the image lacks the depth cues we take for granted. Occlusion, for example, takes a serious blow. As

Bach-y-Rita noted, the “vibrotactile system has no provision for enhancement of the overlapping edge, by color, shadow, or any of the other everyday manifestations of the borders between two objects” (1972, p.81). In fact, subjects using the device came to perceive occlusion only indirectly, via factors like the relative height of the spout

(cit.).10 Texture is likewise missing. Under normal conditions, remote objects will have compressed and less easily discerned surfaces. A plowed field receding into the distance is the classic example (Gibson, 1966). With the vibrotactile system, this cue is absent. With sufficiently high resolution, the absence may not be a problem (fig. 3 has the beginnings of texture gradient), but when limited to 400 pins, there just is not

10 This particular means of obtaining information is only available once subjects have been told which object faces the camera. The underlying reason for this is discussed alongside the spinning dancer illusion below. 16

room. A white point at the bottom of the matrix will have just the same significance as a white point at the top.

Figure 4. Resolution of Early TVSS. A scene presented to a TVSS subject rendered on an oscilloscope. Notice that the objects were placed before a black background, making recognition easier but lessening available information.

The absence of these still image cues places a much heavier burden on their movement based counterparts (e.g. motion parallax). Regrettably, these cues are eliminated by a second, less obvious implication. SSDs available at the time of these 17

experiments showed a fundamental “inability to deal with visual clutter” (Bach-y-Rita,

1995, p. 179). To get around this, researchers used only a few objects on homogeneous black cloth backgrounds (Collins & Bach-y-Rita, 1973; see also fig.2).

In effect, any information dependent on fore/background contrast went missing, a fact that is troubling precisely because so many of the cues left after the black and white filtering were completely unavailable without it. Motion parallax, for example, depends on objects’ relative rate of movement as compared to their background.

Clearly, if an object surrounded by darkness, the cue will be unavailable.11 The same goes for some still-image cues like linear perspective. If there is not a background to recede into, subjects are not going to learn visual depth. The general importance of these types of cues has come up in a number of separate debates, but the best illustration comes from visual displacement adaptation (Rock, 1966). Like sensory substitution, these experiments tested the flexibility of our perceptual capacities.

Subjects were given visual input at various degrees of displacement and tested for their capacity to adapt. Shifting input 30 degrees to the left or right will initially cause visually guided action to err accordingly, after a few hours of use, however, subjects can move about with relative ease. Again, like sensory substitution, early studies seemed to indicate that action and resultant visual input were necessary for successful adaptation (Held & Freedman, 1963). Subjects whose arm movements were guided by experimenters did not adapt. As later experiments established, however, the introduction of a structured environment alters these results significantly (Pick & Kay,

11 We should also note that motion parallax is unclear without non-visual, especially vestibular input (Cornilleau-Pérès & Gielen, 1996; Hayashibe, 1991). This ties in with our second point: that subjects had insufficient disambiguating information from non-visual modalities. 18

1965; Singer & Day, 1966). The simple inclusion of a rich background meant the difference between passive learning and failure. Such improvements have not been taken into consideration when it comes to sensory substitution, but the parallels drawn with prism adaptation suggest they should.

What we have seen is a systematic elimination of visual cues. One by one the signals of depth, shape, position, and all other aspects of were cut out. Lack of color and low resolution eliminated the vast majority of still cues.

Meanwhile, the absence of environmental structure (itself a byproduct of the devices low resolution) finished off those that come from movement. Anyone, blind or sighted, would have had to turn to outside help if they were to derive something significant from the quality of input given TVSS subjects. For active subjects, this help came from camera control, but this does not mean that learners in naturalistic settings would need to be active. TVSS users might simply be at a disadvantage as compared to sighted persons, in which case the necessity of action would not generalize. Nor would it mean that action is the only way tactile-visual input can be made sense of.

Disambiguation might come from audition, proprioception, or even vestibular information, bringing us to the second major limitation.

2.2 Seeing Without Corroboration?

There is, to our knowledge, no mention of multimodal input in either the experimental or philosophical literature. For the early SSDs, structural features made the odds of incorporating multiple-source inputs low. A dental chair (see fig. 5) is not something that can be easily moved around or oriented to match with camera input. 19

Nor is it flexible enough to let subjects feel the objects in front of them while maintaining contact with the apparatus. Finally, although there is nothing about the setup that would similarly exclude audition, it is unmentioned in the early experiments.

Later researchers might have made up for this, but their goals seem more centered on engineering and/or clinical aspects of the device. The more sophisticated SSDs became, the less interested experimenters seemed in perceptual learning. The lone exception seems to be Epstein et al. (1986), who tested passive and active learning with a head-mounted device. In this case, however, no subjects, active or passive, progressed to distal attribution.

Figure 5. Early TVSS Device (White et al., 1970)

Were multimodal input an unimportant aspect of perceptual learning, its exclusion might not be a worry. This is far from the case, however, as is eminently clear from the various behavioral findings of Bahrick and her colleagues (e.g. Bahrick 20

and Linkliter, 2000, 2002; Lickliter, Bahrick, and Honeycutt, 2002). In countless studies, they have shown that intersensory redundancy leads to more effective processing, learning, and memory for multi- as compared to unimodal information, all pointing toward what they call the intersensory redundancy hypothesis. According to this theory, “information presented redundantly and in temporal synchrony across two sense modalities selectively recruits attention and facilitates perceptual differentiation more effectively than does the same information presented unimodally” (Bahrick and

Linkliter, 2000, p. 190). The evidence for this has built up over the course of several studies. Lickliter, Bahrick, and Honeycutt (2002), for example, found that exposing bobwhite quail embryos to synchronized regularities like the pairing of a maternal call with a flashing light increase learning rate by as much as four times. Asynchronous pairings, by contrast, appeared to hinder learning. In another study, Bahrick and

Lickliter (2000) found that redundancy allowed 5 month old infants to perceive rhythms their unimodal counterparts were unable to master. Subjects were habituated to either bimodal (auditory and visual) or unimodal (auditory or visual) rhythm from a hammer tapping on a surface. 12 In the testing condition, half of the subjects were shown the same rhythm, and the other half were given a novel one. Those habituated to the bimodal stimulus displayed visual recovery (returning to the video once it changed) significantly more often than those in the unimodal group, suggesting that these infants had noticed the shift in rhythm. The finding was later replicated and

12Habituation studies are a common way of ascertaining what infants perceive. They are predicated on the assumption that, like adults, babies will tire of and loose interest in a single lengthy stimulus. If the infant looks away (as judged by a third party) the infant is taken to have lost interest. If the infant suddenly returns to the stimulus (again, as judged by “observers largely ignorant of the experiment) experimenters take it that the infant has shifted its attention. 21

extended by Bahrick and Lickliter (2002), who found similar results for 3 month olds regarding tempo. The long and short of it is that infants who receive only unimodal information fail to notice changes easily picked out by their bimodal counterparts.

Yet another source of evidence in favor of multimodal facilitation comes from the recent studies of Seitz, Kim, and Shams (2006) and Kim, Seitz, and Shams (2008), both of which worked with adults. In these studies, adults were trained using audiovisual or visual-only stimuli and given tests dealing with congruent visual motion. The difference between the two groups was larger than might be expected based on the earlier cited studies. In the former experiment, for example, audiovisual training “reduced the number of sessions required to reach asymptote by ∼60%” compared to its unisensory counterpart (Shams, Wozny, Kim, & Seitz, 2011). Should one have doubts about the force of infant learning experiments, such adult learning studies will certainly help to dispel them. Still more corroborating results have come from a phenomenon known as attentional blink. If a subject is very briefly shown a stimulus, say a red dot, and is immediately afterward shown another, say a green square, she will only report having seen the former. If more time is allotted between the two inputs, the subject will recall them both, but so long as they are shown in quick succession, the former will be overlooked. The same does not happen if the second event is specified multimodally (Olivier & Van der Burg, 2008). If the green square from our earlier example were paired with a quick beeping sound, the subject would have a much better chance of picking it up. Like the learners in the earlier cited studies, the subjects in Olivier and Van der Burg’s experiment seem to have been 22

guided by intersensory redundancy. Inputs that would have otherwise gone unnoticed were brought to the fore by additional sensory input. To think that similar results may hold for sensory substitution as well is not unreasonable. In any event, the enactivist cannot assume that they will not.

2.3 A Thought Experiment

Both informational concerns, the low quality of input and the lack of input from other senses, can be summed up through a thought experiment. Suppose we give sighted individuals input comparable to what was available in Bach-y-Rita’s experiments. Subjects would be brought into an isolated room with a large 20x20 oscilloscope monitor showing prerecorded input. They would be familiar with the sensorimotor contingencies distinctive of vision and, if need be, could even be given prior training with some form of camera control. What matters most is that they are passive when the experiment occurs. This should pose no problems for the enactivists, however, since they have long emphasized that Actual motor control is not the role of sensorimotor knowledge (Noë, 2004, 2010). Noë in particular has made this quite clear:

Actionism [another name for enactivism] does not claim that visual awareness depends on visuomotor skill, if by ‘visuomotor skill’ one means the ability to make use of vision to reach out and manipulate or grasp. Our claim is that seeing depends on an appreciation of the sensory effects of movement (not, as it were, on the practical significance of sensation).…actionism is not committed to the general claim that seeing is a matter of knowing how to act on or in respect of or relation to the things we see. (2010, p. 249)

23

Passive subjects in our experiment should have the proper “appreciation,” and, given actual movement is not a factor, they should be able to form an accurate interpretation of the object.

There is, however, serious reason to expect a different result. Given that tactile input did not code for color or grayscale (fig.4), subjects would see what is, for all intents and purposes, a silhouette. Add to this the fact that subjects have no input from other modalities, and the odds that cues for depth will remain undeciphered skyrocket.

The most famous example of this is the spinning dancer illusion. Observers are shown the silhouette of a pirouetting dancer, but since the video contains no depth cues, they can choose to see her as rotating clockwise or counterclockwise. They could also choose to see the dancer as static by self-attributing movement.13 This ambiguity extends to all forms of rotational, forward/backward, and tangential movement, so no matter how the scene varies, depth and orientation are shrouded in mystery. We know that increases in information would make up for the problem. Were subjects shown a rich visual scene, they would be able to interpret it with little trouble. Alternatively, were subjects given input from other modalities or active control, they would likely discern the dancers true direction. If the subject is able to touch the front of the dancer’s foot or hear it approaching and receding (e.g. if the dancer has a siren attached to her leg) the subject will immediately apprehend her orientation. If the subject has control of the camera, one has the same result. Suppose he/she chooses to rotate the image clockwise. If the dancer is facing forward, the silhouette’s leg will

13 The same would be true for the horse and phone in fig. 4, provided the subject does not have an association between apparent size and distance. Of course, building an experiment like this would be impossible because sensorimotor contingencies will inevitably entail such associations. 24

appear to the left of her body; if the subject chooses to rotate her counterclockwise, the dancer’s leg will appear to the right. If the dancer is facing backward, the opposite will hold. These additions have nothing to do with general sensorimotor knowledge, though; rather, they give situation-specific disambiguating information. The passive subjects know what a moving dancer, and moving around a dancer, looks like. What they do not know is how this dancer is moving.

Provided this account is correct, the enactivist faces two troubles. The first is that, if a passive subject is unable to form an accurate impression, either the sufficiency of sensorimotor knowledge or the legitimacy of TVSS is challenged as a support for sensorimotor theories. If enactivists accept the informational conditions, they will have to explain why subjects with the requisite knowledge are unsuccessful.

If, on the other hand, they object to the quality of information in the experiment, they will risk inconsistency. Sensorimotor theorists have, after all, already relied on very similar input to draw conclusions about passive TVSS subjects. A third option might be to object that we have set the bar too high. Passive observers do form a volumetric interpretation of the dancer, whether or not it is stable or accurate may not be the important aspect. However, the 3D reading most people have of the dancer is probably a function of its highly detailed human form and/or regular movement. Provided a novel object, low resolution, and less predictable movements (experimenter-controlled camera movement is rarely as smooth and regular as the dancer’s pirouettes), a 2D reading becomes likely. Under these conditions, the object would seem like no more than an unstable 2D blot. Moreover, given that our focus is on perceptual learning, the 25

importance of actual depth information comes to the fore. Understanding depth is likely a necessary step in sensing objects as “out there,” and that subjects could accomplish this with only ambiguous and impoverished depth information is highly implausible. That is to say, if sighted subjects do not form accurate interpretations of depth, we cannot very well rely on blind subjects to learn what visual depth is on the basis of the same evidence.

The second concern is that the success of active subjects implies a role for action over and above providing knowledge of sensorimotor contingencies. Both active and passive subjects have the requisite knowledge (supposing it is properly characterized as such), but only subjects with direct control (or, we think, multimodal information) are able to parse out depth. At the very least, action is bringing more to the table than enactivists have emphasized. At worst, the new, distinct role action plays threatens to obviate sensorimotor knowledge altogether. This brings us to our next section.

3. Alternative Frameworks for Sensory Substitution

We have given some reason to doubt the centrality of sensorimotor knowledge in sensory substitution. There is, however, a gap. If the section above was successful, we will have shown that the deck was stacked against passive learning: the information available from visual input was sparse and the sources of multimodal input were minimal. Were we to close with this, however, there would still be a mystery surrounding how self-guided action or multimodal specification actually makes up the difference. In short, we would have no positive account. 26

Until very recently, enactivists seem to be the only ones who have stepped up to the plate. There were commentaries within the experimental literature, some of which we mentioned in the introduction, but these are tantalizingly brief. What is more, they were often couched in what might be considered proto-enactivist terms. Bach-y-

Rita (1972), for example, seems much impressed with the reafference theory of Held and Hein.14 Within the philosophical community, meanwhile, debate has tended to focus on other issues, whether sensory substitution really is qualitatively visual, for example (Block, 2005; Prinz, 2006). The job of outlining an alternative, then, is still outstanding. We do not presume to offer anything complete in what follows, but the literature does present several promising leads. In the coming sections, we will sketch three tentative proposals, some our own and some emerging in the contemporary literature, that together comprise an “inactive” front. The field is a very speculative one at the moment, but we nevertheless stand to gain much from a proper examination.

The proposals presented here are the following:

(a) Crossmodal plasticity This first route makes use of a phenomenon whereby sensory deprivation causes brain regions distinctive of one modality to be triggered by and perform the functional tasks of another. We tend to think of the of blind individuals as fallow, but as proponents of crossmodality point out, the region is quite active. Indeed, blind subjects show greater visual cortex metabolism than their sighted counterparts (De Volder et al., 1997). In lieu of optical input, these regions have taken on a variety of functions. The occipital (“visual”) cortex shows activity

14 This theory, well-known in the 1960s held that perception depends in part on a mechanism comparing motor output with resultant “reafferent” visual input. It is one of the major precursors to the now popular sensorimotor account. 27

during Braille reading (Sadato, 1996), memory (Röder et al., 2001), language-related tasks (Röder et al., 2000) and perhaps auditory and somatosensory localization as well

(Ptito and Kupers, 2005). The discovery of occipital activity in sensory substitution has led many to postulate that a similar “recruitment” process might be at work there.

Exactly how this process works on the psychological level is less well-defined than the other two options we will consider, but much has been made of certain cross- cortical connections and interactions evidenced between modalities. Sensory substitution, on this proposal, comes from the “unmasking” of these connections (Ptito

& Kupers, 2005; Ward & Meijer, in press). That is, connections between sensory regions that under normal circumstances would have been inhibitory are removed or retooled. Under normal developmental conditions, organisms benefit from the independence of , sight being unfettered by outside influences like, say, Braille reading. Obviously, vision subserves a very important function and includes many aspects of wholly unrelated to other modalities. Allowing other cognitive processes to have unchecked competition with or modulation of visual centers may interfere with these (the same applies vice versa). When subjects are raised in visually deprived conditions, however, this need dissolves. Occipital regions are not receiving any optical input, so the space is left open to other processes.

Connections between modalities are less inhibited and available for use in sensory substitution.

An interesting take on this comes from Ward and Wright (in press; see also

Proulx, 2010), who conceive of sensory substitution as a form of acquired 28

synaesthesia. Although synaesthesia is often reserved for the developmental abnormality, a condition that likely involves a genetic component (Baron-Cohen et al.,

1996; Grossenbacher & Lovelace, 2001), certain cases of “induced synaesthesia” have been observed. These are, as it so happens, generally tied to sensory deprivation of some sort. A sudden loss of optical input due to eye or nerve damage can cause it (Rao et al., 2007), as can brain damage (Beauchamp & Ro, 2008). The most significant implication of this approach is that both modalities involved will remain present.

Senses are not substituted, per se; rather, they are supplemented. Any given SSD- related experience, then, will have dual components. The reason subjects do shift reportage from tactile to distal has more to do with the allocation of attention than the number of modalities involved.

(b) Multimodal means15A second plausible route would be through multimodal learning and integration. This route ties most closely with our concerns regarding the absence of multiple sensory inputs in Bach-y-Rita’s studies. There are a couple of faces to the multimodal account we mean to develop here. The first has to do with the multimodality (or supramodality) of the brain itself. A growing body of evidence suggests that the brain is better subdivided by task than by modality (Reich et al.,

2012). Many regions respond to certain types of information regardless of which sense it arrives from. The LOC, for example, shows activity for shape perception based on

15 One usage of multimodal found within the empirical literature restricts it to cases of two simultaneously stimulated modalities influencing each other. Multimodal integration, for example, occurs when two inputs are bound. We shall take up a broader usage that refers to subsystems that, by default, work with multiple modalities and potentially mediate impacts between them. This stands in contrast to impacts dependent upon plastic changes in the brain, though the two are not mutually exclusive. 29

visual, auditory, and tactile inputs. The PPC, in turn, does the same for space. One can easily see how this might lend itself to sensory substitution: if the right regions are already responsive, using SSDs may only be a matter of strengthening pre-existing connections to these regions.

This is where multimodal integration comes in. The default interpretation of a given unimodal input will tend to be in line with whatever information is typical of it.

If there is nothing like self-guided movement to draw attention to them, inputs to the back will be interpreted as just that. Integration (or some similar binding process) might serve to overcome this by linking the SSD-based input with another modality.

In this way, integration, understood as the recognition of crossmodal correspondences, would move the hypothesis of distal attribution from a remote possibility to a likely scenario. Suppose one is trying to decode a certain script. On its own, the text will be next to impossible to understand, but if one is simultaneously given a corresponding text in a known language —if one finds a Rosetta Stone—things change. A similar reasoning seems to be employed by the perceptual systems. If some novel input lines up with converging evidence from other modalities, the mystery input more likely represents a real-world event than some perceptual quirk. This has been suggested by multiple perceptual learning studies (e.g. Bahrick et al., 2000, 2002), and serves as a plausible gateway into the “metamodal” space mentioned above.

Independently of the binding mechanism we have in mind, Proulx et al. (in press) have proposed a multimodal learning theory or their own. This account draws on reverse hierarchy theory, an increasingly popular model of perceptual processing 30

according to which “learning is a top-down guided process, which begins at high-level areas of the visual system, and when these do not suffice, progresses backwards to the input levels” (Ahissar & Hochstein, 2004). According to Proulx et al.’s particular version of it, the upper regions of this hierarchy are multimodal, serving to mediate crossmodal training. As subjects train with a novel input, say soundscapes, task- relevant aspects will be shared across modality by way of the higher-level representation. Low-level processing feeds into high-level and high-level trains low- level (see figure 8).

(c) Mental imagery A third and final possible route comes from mental imagery, a neutral term designating auditory, visual, and tactile imagination. That informed subjects, those who know the task is to locate some object, would employ imagery seems highly plausible, but most experiments take little note of it as a strategy. Most often, imagery is presented more as a confound than a theory of its own

(see, e.g., Poirier, De Volder, & Scheiber, 2007), as something that could account for, say, occipital lobe activity in blind persons without calling on significant neural reorganization. It is also often regarded as only a partial explanation: as Ward and

Wright contend, “visual imagery alone cannot explain the shift in experience before versus after immersion with sensory substitution” (in press, p. 7). Nevertheless, we think an imagery-based account has some potential and is certainly deserving of some independent consideration.

On the imagery-based account we will be developing, subjects would come to

“perceive” when they had attained the ability to update and adjust these imaginings 31

automatically (i.e. non-deliberately) and as a result of SSD input. Mental imagery would still be a high-level process, but it would be used in a way unlike any examined in the current literature. Sighted persons use imagery only when more direct means are unavailable. A mechanic might imagine the location and orientation of an unseen bolt, but doing the same for something in plain sight would be ridiculous. Moreover, if the process is automatic, it is likely piecemeal. A given smell might tend to trigger a given image, but nothing complex or systematic goes on. For blind SSD users, however, the situation is very different. Rather than an occasional stopgap or association, they would use imagery as an on-line, real time representation of the objects around them.

The imagined shape, location, and orientation of the object would each be finely tuned by sensory inputs and triggered automatically. Imagery would not be used in lieu of perception: it would be perception.

Explained thusly, the options are relatively clear, but when one looks to how sensory substitution actually comes about, the crisp distinctions begin to blur.

Moreover, as we shall see, the three proposals employ many of the same brain regions, a fact that has made distinguishing them via somewhat difficult. As hard as they are to work with experimentally, these aspects of imagery prove useful when one actually goes about learning to use an SSD. If connections between tactile processing and mental imagery areas are fostered, then by default the same connections could work for long-term crossmodal effects. Additionally, if multimodal means are employed, they would likely be helped by multimodal imagery. This may be especially true for sighted subjects who have less reason to develop much in the 32

way of crossmodal plasticity. Although we discuss them independently, we ought to keep in mind that the three are not exclusive. Some possibilities for combination will be explored in section 7.

3.1 The Prospects Considered

Like so many problems, sensory substitution can be broken down into questions of where and how. In the following, we shall consider how each of our three options answers these questions, with particular attention being paid to in the context of actual training and crossmodal impact. We will be considering each individually, but every account must eventually answer the same fundamental questions. Many, as we shall see, end up drawing on the same resources.

3.2 How it’s Done

In understanding any one of the proposed routes, an understanding of a few basics of orthodox psychology is helpful. At the psychological level of description, there are two preconditions on successful sensory substitution. The first is that subjects have sufficient information available for perceptual learning. This is true of each of the orthodox accounts as well as the enactivist proposal. It was also essentially the point we discussed at length earlier in the paper, so we will not belabor it. Suffice it to say, one cannot cook without ingredients: subjects may have all the necessary capacities but will never reach distal attribution without the right kinds of information.

The second condition is that this information be detected and decoded in a way that makes distal attribution possible. This, in turn, requires that the cues presented by the

SSD (or rather the representations constructed on the basis of this input) be made 33

available to working memory. The availability of these representations is a major aspect of how we have chosen to interpret distal attribution—that subjects be able to consider and or act upon sensory inputs—and will help to elucidate some of the differences between approaches. Fundamentally, this condition is that inputs run the attentional gamut.

Attention is a process we are all quite familiar with. Colloquially, we say that a student is paying attention or not; that an artist attends to the minutest detail of her work; and that soldiers stand at attention. The rough notion behind these examples has found use in a more refined, scientific notion employed in perceptual and cognitive psychology. The contemporary psychological understanding of it breaks down into four components: working memory, competitive selection, filtering, and orienting

(Knudsen, 2007). Working memory is the space containing the information used in conscious planning and decision making. An example of this would be whatever the reader happens to be monitoring at the moment, a page or computer screen most likely.

Just before working memory stands a process of competitive selection. Space in working memory is very limited. Famously, George Miller (1957) estimated it at around 7 “chunks” of information (7 numbers, 7 dates, 7 deadly sins, etc.). The information available to it, however, far exceeds this number. We can distinguish literally millions of shades, but (at least in the West) emphasize the 7 colors of the rainbow. Given this sharp difference, there is going to have to be some selection process, some decision on the part of attention. This decision is met by a combination of two factors: bottom-up filtering and top-down orienting. The first occurs when an 34

input is amplified on the basis of either learned or instinctive importance. Immediately recognizing when one’s name is called is an example of filtering. Another example is attentional bias for multimodally specified events (Bahrick et al., 2000). Orienting, by contrast, is done on the basis of executive control. Listening to one person in a noisy room is possible only because we have these powers. Another might be focusing on this thesis, rather than what is going on in the adjacent room. Figure 6 shows how each of these fit together.

Figure 6. Attentional Map (Knudsen, 2007). 35

Uncontroversially, any orthodox account of what happens in sensory substitution is going to have to work with (or around) this model. Attention is the process by which information becomes available to executive decision making, and sensory substitution, if nothing else, requires that the information it presents is not discarded. Those novel sources we discussed in section 1 are going to have to find their way through competitive selection if they are to be of any use to the subject. This comprises our first desideratum: any proposed route must provide a way of filtering out irrelevant information while preserving and adding weight to those sources necessary for distal attribution. This is most similar to the account one would give if asked how an account “worked.” It shall serve as a central talking point for each proposal’s section, then.

This also highlights a major point of divergence between mental imagery and the other two options. When pinpointing where, on the map, each of the three finds itself, mental imagery is the only one with a significant top-down component. Imagery is prototypically the result of volitional control. If one so chooses, one can picture one’s bedroom back at home: the placement of the windows, where the bed rests, perhaps a bookcase or television. One can likewise imagine one’s favorite song or the feeling of a soft pillow on one’s face if so inclined. Once familiar, it may not be much effort. Likely, the reader imagined the sensations described above without much consideration, but it is nevertheless a top-down process (Stokes, Thompson, Cusack,

& Duncan, 2009). There is even some debate as to whether it simply is a facet of selective attention (see Pylyshyn, 2003). Plasticity and multimodality, by contrast, are 36

recognizable as bottom-up from the get go. Bahrick and her colleague’s multimodal studies are especially clear in illustrating this.

This difference parallels another interesting, though less pressing, distinction.

Imagery, with its ties to executive control, is rather rare, evolutionarily speaking. The same cannot be said for multimodal integration or crossmodal plasticity. These mechanisms seem much too old and much too basic. Whereas imagery is of little use to less complex organisms (hence its widespread absence; see Whiten & Suddendorf,

2001),16 multimodal integration and cortical plasticity are exceedingly advantageous at much lower levels complexity. This is because, again in contrast to imagery, both make their contributions from the bottom up. Multimodal input operates by putting greater weight on inputs attested to by two or more modalities. Crossmodal plasticity, meanwhile, has to do with the amount and kind of processing undergone by a given input. By routing inputs to unused or underused processing regions, plasticity makes available entire classes of otherwise cutoff neural representations.

As we shall see, the bottom-up top-down distinction has significant implications for how each route sorts information and hence how they can contribute to sensory substitution. Explaining how subjects reach distal attribution, then, will involve a couple of distinct challenges and advantages. In particular, mental imagery will need to demonstrate that top-down mechanisms are both relevant for the crossmodal effects observed in sensory substitution and capable of quasi-perceptual use. Imagery might be too closely tied to volitional control to bring about perception-

16An interesting experiment might involve outfitting blind animals, a dog with cataracts, for example, with an SSD. If these animals show signs of distal attribution, it could show that mental imagery is not strictly necessary. 37

like experience. Alternatively, we could find that imagery, even if widely non- volitional, is tied too closely to singular modalities or simply insensitive to training, making it a poor mediator for sensory substitution and the crossmodal transfer it presupposes. Multi- and crossmodal plasticity have their own challenges. First, we must make clear that purely bottom-up means can trigger a distal representation from a default proximal one, something that mental imagery takes for granted (one simply imagines the object as external to oneself). Second, we ought to show that this process can allow subjects to recognize aspects of a scene that are not available through the various modalities involved. The pattern of a sheet may not be recognizable through the sounds it makes, but subjects are nevertheless able to discern it. This too is decided by fiat when imagery is concerned. Strictly speaking, this second hurdle is beyond our stated aim of accounting for distal attribution, but it is an important aspect of sensory substitution and warrants discussion nonetheless.

3.3 Where it Happens

The second desideratum has to do with where neuroimaging places the processes of sensory substitution. The technology necessary for conducting these studies is not especially new, but it was not until the 2000s, especially the latter half of the first decade, that one begins to see studies in publication. These began with Arno et al., 2001 and continue today. What they have found is that, contrary to what some maintain (Hurley & Noë, 2003), activity has tended to focus around “visual” areas.

These included the “visual” cortex, alongside certain multimodal areas. We focus on two particularly interesting areas: the Lateral Occipital Complex (LOC; shape 38

perception, mostly) and Posterior Parietal Cortex (PPC; spatial attention). The two of them are well-positioned for crossmodal impact (fig.7), and have been subject to a good deal of study. Most importantly, both have been implicated in successful sensory substitution. Activity in the PPC has been observed using both fMRI and PET (Amedi et al., 2007; Arno et al., 2001), and the LOC, especially subregion LOtv, has been linked to it by PET, fMRI, and EEG (Arno et al., 2001; Renier et al., 2005; Amedi et al., 2007; Ortiz et al., 2011).When comparing the brains of practiced and novice SSD users, these areas stand out, and when subjects are presented with noise (as opposed to meaningful SSD coding), the regions show no significant activation (Ptito, Mosegaard,

Gjedde, & Kupers, 2005).

Additional support for the LOC in particular comes from transcranial magnetic stimulation (TMS). Temporarily knocking out the LOC of long-term vOICe user PF with TMS led to a reported phenomenal “darkening” as well as “dramatic” identification errors (Merabet et al., 2009). Auditory sensation, by contrast, was left intact: she had no trouble discerning pitch, sound intensity, or experimenter instruction.

This suggests a stronger relation than “mere” correlation, at least where the LOC is concerned.

39

Figure 7. Map of Brain Regions Involved in Sensory Substitution (Poirier et al., 2007). Solid bidirectional lines indicate a known connection in the human brain; the dashed line indicates a connection only evidenced in monkeys. BA 19 contains both intermediate level visual areas and the LOC. S1 contains the primary somatosensory cortex.

The first thing to note about this data is that all three of the areas are commonly tied to vision. The LOC was first discovered when comparing visual shape perception with that of unstructured “noise” images (Malach et al.,

1995), and the PPC is well-known as the center of visuospatial attention.

Sensory substitution is known to activate other, more posterior regions as well, including earlier visual areas like V2 (Arno et al., 2001; Collignon et al., 2007;

Ortiz et al., 2011). Given the “visual” nature of most SSDs, activity in these regions is somewhat unsurprising. Nevertheless, the issue has been a center of controversy in the past. In particular, enactivists claimed that sensory substitution left the occipital cortex untouched, activating instead the somatosensory cortex (Hurley & Noë, 2003; Noë, 2004). This was actually a major point of contention for them, as it supported their own highly flexible conception of the brain. Empirical study has not seen out enactivist contentions, 40

however, marking a major opportunity for each of the orthodox proposals.

These theories have been structured with such neuroimaging results as an explicit explanandum. Hence, for each of the brain regions we consider, there is independent and persuasive evidence for crossmodality, multimodal integration, and mental imagery each (our second desideratum).

4. Crossmodal Plasticity

Crossmodal plasticity is perhaps the easiest of the three routes to explain. It also appears to be the most popular (see Ptito & Kupers, 2005; Ptito et al., 2005;

Poirier, De Volder, & Scheiber, 2007; Merabet et al., 2009; Reich et al., 2012; Ward

& Wright, 2012). The story of how crossmodal plasticity selects the right information boils down to getting the input to the right processing regions. This, in turn, is dependent upon the functional and/or anatomical neural development of blind persons.

Regions specifically tied to optical input have long been known to undergo serious changes due to sensory deprivation (Hubel & Weisel, 1962). Cats raised in the dark or in environments with only rightward movement, for example, will show selectively retarded development (Daw & Wyatt, 1976). For this reason, researchers long thought that deprivation caused irreparable harm to the possibility of visual processing in the blind. A more recent slew of findings has suggested a more optimistic outlook, however. Examination of the occipital lobe of blind persons has revealed no significant atrophy or degeneration (Breitenseher, 1998). Rather, the region appears to open up to alternative routes of input and kinds of processing.

Moreover, later studies have found that the occipital lobe maintains many species 41

typical processing regions and organizational features (Striem-Amit et al. 2011; Ptito,

Matteau, Gjedde, & Kupers, 2009; Reich, Szwed, Cohen, & Amedi, 2012). These include specific subregions like the area MT (a motion processing region; Ptito et al., cit.), as well as large-scale organizational features like the distinction between dorsal and ventral processing steams, the “how” and “what” areas of the visual brain (Striem-

Amit et al., 2012). Both seem to suggest that these features are at least partially innate.

If this is the case, we might expect visual information making its way to these regions have a head start. SSD-based input could potentially be filtered and processed in a similar manner as is observed in typical visual processing. The capacities may be unused, but they are still there in some respect. Some additional support for this take comes from the fact that sensory substitution in the blind tends to activate the same regions as would be expected in vision: the MT for motion, the LOC for shape, the

PPC for spatial tasks, etc. (see, for example, Ptito, Chebat, & Kupers, 2008). In sensory substitution, we appear to have a fortuitous pairing of the right input with the right cortical regions.

4.1 The Mechanisms of Plasticity

In terms of navigating through the filters of attention, it helps that, for the early blind subject, the representations produced in these regions would be entirely new.

Our attention is naturally drawn to novel happenings. Any inputs coming from these regions, then, would be marked as salient and fast tracked to executive processing.

Even for late blind subjects, the representations would be sharply discontinuous with anything the subject would have experienced in quite a while, giving it extra weight. A 42

more scientific way of fleshing out the representations’ novelty would be to suggest that input was tapping into previously unused “visual” saliency maps (see Koch and

Ullman, 1985). These maps are one way of accounting for our ability to select important features from the cluttered visual world. They may well work for blind subjects too. The basic idea is that different visual features (which have their own maps) combine into a single, normalized map. From here, the features that are most discontinuous, say a swiftly moving dot in the foreground, are selected and those features that are not selected are suppressed. What this means for sensory substitution is that information coming from the substituting modality could be treated more or less like visual information. Since the SSD is, in effect, giving subjects visual information one might expect features to be selected and idiosyncrasies from the mediating sense to be suppressed. The neural location of these maps have been debated, but proposed cortical areas have included V1 (Li 2002), V4 (Mazer & Gallant, 2003), and the

PPC (Gottlieb, 2007). Routing to these processing regions would allow access to an altogether new (for the subject) path to working memory.

Now the question becomes how plasticity makes this possible. One option is that blind subjects develop different circuitry as a result of their experiences with the device. Researchers have known for some time that lesions and informational rerouting can trigger neural rearrangement. One striking finding cited by orthodox theorists and enactivists alike is that of Sur et al. (1999), who found that rerouting visual inputs to the auditory cortex of ferrets will lead to visual responses in auditory regions. Blind subjects, on such an account, would be much like the ferrets in Sur et 43

al.’s study: receiving a certain kind of input would trigger major rearrangement and development. The accuracy of such an account is unlikely, however. For one, it would leave many cases of sensory substitution untouched. In particular, it would overlook sensory substitution in late blind and sighted subjects, both of whom have a sizeable literature surrounding them (see Renier et al., 2005; Amedi et al., 2007; Poirier et al.,

2007; Ward & Meijer, 2010; Kim & Zatorre, 2010; 2011). Additionally, it would fail to take into consideration the speed at which some subjects learn to use their SSDs.

Oftentimes, new abilities will manifest over the course of a few hours, much less time than is necessary for new connections to sprout. Finally, the prototypical ferret example suffers from a major disanalogy: SSD subjects still have use of the

“substituted” modality, making the seemingly clear-cut case much messier (Ward &

Wright, in press). If neural rearrangement has a role, it would likely be limited to early development and facilitation of crossmodal effects. Once again, this is not a critical aspect. Although the force of early brain plasticity is undoubted, it may not be what is needed for sensory substitution.

An alternative and more conservative hypothesis is that SSD subjects merely repurpose connections that have developed for other reasons (Collignon et al., 2007).

This is often called “recruitment” or “unmasking.” As we noted in the opening, the brain is a richly interconnected organ. One would not expect for all these connections to be used, however. If one is looking for some unisensory event, say a bird is silently swooping down at you, interference from other modalities could be disastrous. These connections are “masked” to protect the organism. If, however, conditions are such 44

that this typical relation does not hold (as is the case when visual input has been cut off), these connections and the cortical regions they lead to become fair game. The occipital cortex will open up to Braille reading, somatosensory localization, and a whole host of other functions (Ptito & Kupers, 2005). This approach does a much better job of accommodating sighted and late blind subjects. Moreover, there is some independent evidence to suggest rapid and reversible crossmodal effects in the occipital cortex following sensory deprivation (Merabet et al., 2008). These effects can follow the onset of blindness as well as the blindfolding of sighted subjects

(Pascual-Leone et al., 2005; Pitskel et al., 2007). Sometimes, abrupt sensory deprivation will even trigger quasi sensory experiences. That is, subjects will hallucinate (Merabet et al., 2004; Afra, Funke, & Matsuo, 2009). Other studies have found early and late blind subjects are the same with regard to Braille-triggered occipital activity (Burton et al., 2002) and one even found that blindfolded sighted subjects show the tendency (Kauffmann, Theoret, & Pascual-Leone, 2002). Once again, these effects unfold on a timeframe much faster than can be accounted for by neural growth, suggesting that recruitment is the important factor. Even so, neural reorganization might still have a part to play. Reorganization and recruitment are far from mutually exclusive.

The first way the two might fit together would be early neural reorganization aiding in the later, SSD driven recruitment process. If deprived of input, the occipital cortex does not lay fallow; rather, it connects itself to other sensory inputs. Needless to say, this involves a large-scale physiological reorganization wherein “visual” areas 45

become much more closely coupled with auditory and tactile regions (Bubic, Striem-

Amit, & Amedi, 2009). This is present in the many neuroimaging studies we have reviewed, as well as anatomical studies conducted with animals (see Chebat et al.,

2007). Now, suppose a blind subject is outfitted with the vOICe. We know that the blind have increased auditory localization skills (e.g. Röder et al. 1999), and that degree of skill has been correlated with activity in the occipital cortex (Gougoux et al.,

2005). Based on the animal studies, we may infer that these increased capacities will have both functional and anatomical side, the latter present via neural sprouting and growth. When input rich with spatial information comes through the ears, it will follow these connections to its natural place in the occipital cortex. The end result is that neural reorganization and growth have provided the mechanisms of later, functionally defined plasticity with more connections to work with. A similar account might be given of TVSS and blind subjects’ tactile localization abilities (van Boven et al., 2000). Early sensory deprivation will have allotted occipital space to touch, so when visual input is routed through the skin it finds its natural home.

The second way would be by slow, long-term reorganization following recruitment. The two are not exclusive, though. The quick process may well open the door for the more long-term one (Amedi et al., 2005; Pascual-Leone et al., 2005). This means that, as subjects become more and more accustomed to the device functionally, they should show greater receptivity to neural reorganization. In practice, this is a difficult difference to detect, but the possibility is open (see Lee et al., 2005). Despite encouraging results regarding late-brain plasticity, there is little doubt that 46

reorganization would be most relevant for very young subjects. Precious few studies have explored the effects of early SSD exposure on blind infants, but a handful of experiments have been conducted. The most relevant is a 1983 study by Aitken and

Bower, which found that exposure was significantly more effective for infants than for young blind subjects, presumably as a result of the former’s greater capacity for neural development. More recently, there have been some steps made to address the literature gap, but these remain tentative. One pilot study involving infant TVSS has been conducted by Segond, Weiss, and Sampaio (2007), another team has just now been assembled at Vanderbilt under the heading of Amy Needham. We expect more information to be forthcoming, but until the relevant studies have been conducted we regard this particular combination as an interesting possibility.

4.2 Where the Senses Trade

As should be clear from the preceding discussion, the second desideratum is more than satisfied. There is a long line of evidence for crossmodal modulation of

“visual” regions following blindness. Moreover, the plasticity approach makes some solid predictions. The classic experiment in crossmodal modulation is Sadato et al.’s

(1996) study on Braille reading. Blind subjects display significant occipital activity when reading, suggesting a kind of recruitment process following sensory loss. The case for occipital involvement was further strengthened by Hamilton, Keenan, Catala, and Pascual-Leone (2000) and Pascual-Leone et al. (2005), who describe interference from occipital lesions and TMS, respectively. More impressively still, a dual TMS fMRI study conducted by Wittenberg et al. (2004) found that blind, and only blind, 47

subjects showed V1 responsiveness when TMS was applied to S1. Other studies have found similar results for auditory tasks as well, including word memory (Amedi et al.,

2003), enhanced speech processing, (Röder et al., 2002), and (Alho et al., 1993). Each, we should note, is distinctive of the blind.

What is most noteworthy about the crossmodal approach, however, are the results that tie directly to sensory substitution. Proponents of crossmodal plasticity can point toward several impressive results with confidence. For one, Ptito and Kupers

(2005) were able to link the involvement of the parietal cortex to independently observed cases of early blind plasticity. As they noted, single unit recordings by

Hyvarinen et al. (1981) have shown increases in parietal and parieto-occipital responsiveness to object manipulation for blind but not sighted monkeys. A connectivity analysis of these monkeys evidenced a kind of relay from the anterior parietal cortex to the PPC and eventually occipital regions. These results, Ptito and

Kupers observe, are generally supportive of the view that the parietal cortex acts as a route by which tactile inputs reach backward. As we have already seen, imagery studies of sensory substitution are quite consistent with this explanation.

Further support comes from the fact that crossmodal activation can explain why occipital lobe activation is more reliably observed with blind SSD subjects. Arno et al.’s (2001) PET study found occipital activation in blind, but not sighted, subjects.

A later study conducted by Ptito, Mosegaard, Gjedde and Kupers (2005) replicated the result. PET is a comparatively rough approximation of brain activity, though, so it may have been that subjects experienced meaningful activity just below the threshold 48

necessary for PET detection (Poirier et al., 2007). Supporting the finding of these experiments is a TMS study by Collignon et al (2007), which evidenced interference for blind users, but no such difficulties for sighted subjects. Many imaging studies have shown occipital activation for sighted subjects (Renier et al., 2005; Amedi et al.,

2007; Poirier et al., 2007; Kim & Zatorre, 2011), but the results reviewed above are still noteworthy. At the very least, occipital activation is more consistently observed for blind subjects: they have a head start, so to speak. The simplest explanation is that this difference has to do with crossmodal plasticity. Blind subjects have a head start because auditory and or tactile inputs already lead to activation of the occipital cortex.

Sensory substitution may involve the other routes, but the fundamental feature is the influence, one way or another, of certain inputs on neural processing: plasticity is unavoidable. Given this is the case, a gain for plasticity is mutatis mutandis a gain for sensory substitution.

4.3 Acquired Synaesthesia?

One final source of support comes from reports consistent with Ward and

Wright’s (in press) synaesthesia account of sensory substitution. Since such an account is a subclass of crossmodal accounts, any evidence in its favor counts a fortiori for the crossmodal account generally. The account comes with a couple of novel and empirically supported predictions. These are the persistence of the inducing modality and the non-specificity of triggers, respectively. The first predicts that, rather than completely losing the “substituted” modality, subjects should feel both. This comes from the long observed tendency of synaesthetes to experience two, sometimes 49

conflicting, experiences in response to a single input. A sound may accompany a color without replacing it. This prediction runs counter to the enactivist assertion that subjects “no longer feel the tickling stimulation on the skin, but ‘see’ objects in front of them” (O’Regan & Noë, 2001, p.87). This has been the dominant view for some time, owing to certain reports given by Bach-y-Rita’s (1972) subjects. As Ward and

Wright note, however, such reports cannot be taken at face value. Tactile sensation may well be present but largely unattended. Ward and Wright note that dual sensation accounts have been advanced in the past, including proposals by Humphrey and

Humphrey (2006), Auvray and Myin (2009), and even Bach-y-Rita (Bach-y-Rita &

Kercel, 2003), who noted that “even during task performance…the subject can perceive purely tactile sensations when he is asked to concentrate on those sensations”

(p. 543). The idea is not especially new, then. We have found similarly supportive evidence from the “what blind users say” section of Meijer’s vOICe dedicated website.17 One user, PF notes that:

The soundscape information is placed forward from my left temple across my eyes to my right temple. They are two distinct separate areas of consciousness. this may seem strange. for sound to generate two different types of input. I can not explain it. I just am aware it is true. (November 5, 2001, emphasis ours)

Later on the page, PF implicitly described her experiences as dual, claiming that she

“could distinctly see/hear the edge of the bed.” Another user, early blind MF, also uses hearing and seeing quite freely:

The first thing I noticed was that there are nots [sic] of twinkly noises that I hear and can't identify. The second thing I noticed is that cars sound in the voice view just like they do when they are going by, only the sound always goes up. A soft swooshy sound. The third thing I noticed was that there was a

17 www.seeingwithsound.com 50

large object on my right. I must have looked strange constantly turning to touch what I was seeing. (March 18, 2004)

The fact that users often report seeing rather than seeing/hearing or seeing/feeling can be explained by appealing to attention. As Ward and Wright briefly point out,

“expertise reduces the need to attend to the inducing modality and thereby reduces awareness of it” (in press, p. 8). The point is given little argument, but it too gains support from user testimonials. PF described the shift: “the soundscape sounds over time are relegated to the subconscious ‘background’ noise” (2003). The comparison with background noise is instructive because such sounds are still a kind of phenomenal presence (even if “subconscious”). Like the humming refrigerator or irregular rainfall outside, we can easily overlook it. Were the stimulus suddenly removed, however, we would take note. Given this explanation, the advantage seems to rest with the crossmodal account over the enactive one: explaining the absence of an experience is, after all, easier than explaining away its presence.

The second point is that subjects’ experiences are not strictly tied to the device, what we call above the non-specificity of triggers. A sound or feeling similar to what is provided by the device ought to trigger its corresponding “visual” experience, since synaesthesia, acquired or not, is not known to be training specific.

There is some evidence for this too. Ward and Meijer (2010) cite one interview with

PF where she was asked the question directly:

JW: If you are not wearing The vOICe and you hear certain sounds could those also trigger vision? I am thinking of artificial noises such as a SHHH or a truck reversing or certain other sounds that aren’t related to The vOICe?

51

PF: Yes, it does. Absolutely. Because my mind automatically records it as a visual sound. It has to be in a certain vOICe frequency. I understand that now. But you can’t use a high car horn and it become a vision of a car. But if I hear a car horn, I see it in my mind through the ‘vOICe sight’. I don’t think of it like I use to ‘see sight’. The vOICe sight, I call it The vOICe sight.

JW: But you would have The vOICe sight without using The vOICe for certain sounds?

PF: Yes. (p.498)

This interview also helps explain apparently contrary reports that stimulation to the

TVSS site does not cause hallucination (Bach-y-Rita & Kercel, 2003): the stimulus was just too imprecise. Ward and Wright (in press) note that the adequacy of this explanation is ultimately an empirical matter, but the interview is highly suggestive.

Our worries about the synaesthesia analogy act as a microcosm of our concerns about plasticity-based accounts as a whole. Fundamentally, the issue is that we do not have an explanation of the phenomenon at the psychological level. We are labeling a phenomenon as opposed to outlining the psychological mechanisms that make it possible. There seem to be too many questions left unaddressed when only plasticity or acquired synaesthesia is invoked. Synaesthesia, like sensory substitution, is realized by a multiplicity of means. There are congenital (born) synaesthetes that are thought to retain sensory connections as the result of a neural pruning abnormality

(Ramachandran, 2001), as well as hallucinogen induced experiences, blindness- triggered connections, and a number of other cognitive abnormalities (it accompanies autism sometimes; see Bogdashina, 2001). Limiting ourselves to the blindness case, there is a plethora of possible underpinnings. Inputs may be rerouted to new regions after sensory loss, once-inhibited connections to memory may be unmasked, or “back 52

projections” from one cortex to another may be strengthened (Armel & Ramachandran,

1999). Even understanding the neural processes, there are many questions left outstanding. Why is the synaesthetic experience coupled with training and memory in such a systematic way? No subject seems to have been trained for this. Why do the crossmodal connections emerge so quickly and cleanly? Hypnosis and brain damage can trigger synaesthesia (Kadosh et al., 2009), but nothing so obvious is present here.

Most importantly, what reason is there for a connection to emerge between these two if distal attribution has yet to emerge? Connections between cortical regions need to be open for information to travel, but there needs to be reason for specific connections to become open, which seems to presume antecedent synchronization of the areas to be connected. We have tried to point out possibilities stemming from innate cortical specialization and plastic changes prior to SSD use, but these have real limitations.

Simply appealing to these structures will not tell us why SSD-based inputs are marked as significant to begin with or why the input is interpreted visually. The emergence of these functional connections is a feature to be explained, not the thing doing the explaining. Although plasticity is doubtlessly a piece of the puzzle, more needs to be said.

5. The Multimodal Mind

What we are here calling the multimodal mind actually comprises a number of distinct, though interconnected, effects. These include inputs from one modality modulating, triggering, and/or integrating with another. Many effects distinguished in the empirical literature as crossmodal, supramodal, or metamodal will fall under this 53

broad heading. In the following, we will be focusing on a couple of key aspects. The first is evidence that the brain operates in a task-based, rather than strict modality based, way. If the brain were really an equal opportunity processor (see Renier et al.,

2012), sensory substitution would be unsurprising or perhaps even expected.

Additionally, the crossmodal plasticity of the last section might be regarded as a strengthening of connections already posited by a multimodal learning account.

Second is the fact that such an account has at least two responses to sensory substitution’s “how” question. The first is that of multimodal integration (or some antecedent correlation-based perceptual binding process), whereby a statistical inference is made by the perceptual systems on the basis of converging input from several modalities. Subjects will have the novel, proximal sensation accompanying

SSD use alongside one or more familiar “teacher” senses. The already familiar modality or modalities will attach themselves like a pair of training wheels to the new input and its various untapped cues. The second proposal, and the one preferred by

Proulx et al. (in press), is that perceptual processing exists within a hierarchy and that abilities acquired in training with one modality can transfer to another by way of a higher level multisensory intermediary.

5.1 Where it All Comes Together

First off is the “where” of multimodality. We noted above that mounting evidence suggests a multimodal, task-based brain. Information travels freely between modalities and often finds itself being processed by regions once considered modality specific. This is the general picture this proposal means to support, and as luck would 54

have it, each of the areas highlighted by SSD neuroimaging counts as multimodal.

That is to say, each region responds to a variety of sensory inputs18 and seems to play some role in integration (see Beauchamp et al., 2005). They are also the seat of some very suggestive crossmodal phenomena. Each, for example, appears to be involved in priming across modalities. Input from one modality is known to impact the processing of other, related modalities (Butter, Buchtel, & Santucci, 1989; Driver & Spence, 1998;

Pavani, Spence, & Driver, 2000), and this, in turn, has been linked to these multimodal regions. James et al.’s (2002) haptic-visual cross-priming experiment is a good example. Subjects were briefly exposed to novel objects via sight or haptic touch and later shown/allowed to touch objects one at a time (some explored, some novel).

Neuroimaging evidenced significant activity in the LOtv for those that had already been explored, evidencing both haptic-to-visual and visual-to-haptic priming. In fact, the level of activation present was comparable to those of within-modality priming experiments (documented by Easton et al., 1997, and Reales & Ballesteros, 1999).

This result, we should note, is consistent with the crossmodal activation reported by

PF and with studies showing spontaneous mental imagery accompanying haptic exploration (discussed later). Priming has also been correlated with occipito-parietal junctions (Carlesimo et al., 2004) and observed within each of the various sensory cortices individually (Eimer & Driver, 2001). Each of these experiments was

18In fact, in Kim and Zatorre’s (2011) sensory substitution study, the LOC was found to be active from the very beginning. Even prior to training, the LOC was responsive to the shape information present in the SSDs input. Other, similar studies have the region responding to two dimensional tactile input was well. 55

conducted with sighted subjects, implying that neural reorganization may not be necessary as plasticity-based accounts suggest.

A case of particular interest to proponents of multimodal integration is Kim and Zatorre’s (2011) study, which subjected participants to fMRI testing before and after multimodal training with an audiotactile SSD. The device coded haptic features in a manner similar to the vOICe (Meijer, 1992), and training involved blindfolded subjects feeling and hearing simultaneous input. Analysis of connectivity19 before and after training showed significant increases between the LOC and auditory areas. When paired with post-training behavioral improvements, this suggests an increase in communication efficiency and temporal coupling. Training seems to allow the LOC and auditory regions to “become part of the same network and work together” (Kim &

Zatorre, 2011, p. 7855). In addition to these increases in communication, they demonstrated a robust carry-over effect for untrained modalities. The same connections used in auditory haptic training were present when subjects performed a subsequent auditory visual task (picking out the visual shape they heard). Rather than requiring its own training sessions, visual identification came almost immediately.

This, Kim and Zatorre have suggested, implicates a common shape representation in the LOC. Training establishes a close relationship between the input’s source modality

(audition in this case) and relevant representational resources like shape processing.

When the job at hand calls on more specialized processing, the resources become available there, with the central representation acting as middleman. In this experiment,

19 The analysis they call functional connectivity analysis is also called correlation analysis. It is a statistical method used to identify significant covariation of activity between brain regions. 56

“the required non-visual shape task involved fine spatial analysis, which the visual cortex is best capable of, and consequently, the auditory input requiring such processing gained access to this region” (p. 7854). Importantly, the subjects did not have control over auditory input. Instead, a stock recording was played 10 times while they explored the object by hand. The experiment, then, speaks against the enactivist approach as much as it speaks for multimodal approaches. SSD input was most certainly not under the subject’s control, yet they came to comprehend the soundscapes they received. While it is not distal attribution, the result suggests a meaningful role for multimodal training.

5.2 Integration at Work

Having discussed the “where,” we now proceed to the “how.” First is our own integrative approach, which holds that subjects may learn to decode SSD-based input via multimodal binding or something very near to it. We live in a unified world. One can pick up a book, look over its colorful jacket, hear its pages turn, feel the cloth spine and the rough pages, all without the slightest doubt that these sensations all stem from one solitary object. If one draws distinctions between the senses, however, there will be nothing that necessitates this unified experience. The sights and sounds of a bouncing ball could just as easily stem from wholly unrelated domains as they do from a single event. We cannot take for granted the unity of experience; rather, we must look to understand how the brain ties everything back together. This is the function of 57

multimodal integration, and it is a significant route through which relevant cues can be sorted from irrelevant tactile or auditory sensation.20

Decades of research have found what might be called rules of integration. Two are of particular interest to us: congruency and precision. First, the inputs to be integrated must be congruous along spatial or temporal dimensions. The particulars of this may depend on the cognitive task (some have wiggle room; Stein & Stanford,

2008), but synchrony and spatial alignment (the up, down, left, and right) constrain

“virtually all regions” of integration (Stein & Stanford, 2008, p. 260). The reason for this is fairly obvious: the events themselves are spatially and temporally unified and the difference in speed between modalities is relatively small. Differences in the speed of light and the speed of sound may matter for thunder and lightning, but when something is five feet away, they are practically simultaneous. In a natural setting, objects of interest like food, mates, and predators will have fairly tightly coupled sensory inputs. This is not to say, however, that two inputs must be perfectly aligned or congruous along both dimensions. As we shall see, this is far from the case.

Generally, the process of integration is seamless, but on occasion, we can catch a glimpse behind the curtain. These cases will often provide a good illustration of the first rule. A familiar example will be the ventriloquist’s dummy. The ventriloquist’s dialogue is temporally coupled with the dummy’s mouth movements better than anything else, making it seem as though the small figure is talking. A more surprising case is the so-called rubber hand illusion (RHI). Subjects who simultaneously attend to

20 A similar problem exists for binding the informational sources intramodally (e.g. fitting together color and shape). 58

tactile input from their hand and visual input of a rubber hand getting the same treatment refer their sensations to the dummy hand (Botvinick & Cohen, 1998). If the dummy hand is struck unexpectedly struck with a hammer, subjects will experience a jolt in galvanic skin currents as compared to controls (skin response is an indicator of fear; Amcel & Ramachandran, 2003). The same illusion can be brought on as a result of active movement (Drummer, Picot-Annand, Neal, & Moore, 2009) and over the course of long-term exposure. It can even, so certain experiments suggest, happen to the entire body (Stratton, 1899). George Stratton, a pioneering turn of the century psychologist, outfitted himself with a mirror-based contraption that allowed him to see a bird’s eye view of his body. Over the course of several days, he recorded his changes in perception. By the end, Stratton had experienced major shifts in body position and tactile sensation: “In walking,” he remarks, “I felt as though I were moving along above the shoulders of the figure below me, although this too was part of myself,—as if I were both Sinbad and the Old Man of the Sea” (p.496). These cases all seem quite odd, but for the perceptual systems, they are far and away the most reasonable attributions to make. We evolved around actual, multiply specified events, not contrived experimental conditions. Hence, when given tightly coupled sensory inputs, binding inevitably occurs. Weathering a few rare misfires is much better than acting conservatively on converging information (e.g., missing the opportunity to localize a predator, mate, or food source).

These cases also provide an illustration of the second rule of integration, that the outputs of different modalities will be integrated in an optimally precise and 59

reliable way. Each case we just mentioned has vision guiding another modality

(audition, touch, and body schema, respectively). What makes it the case that vision takes the lead, rather than touch or audition, is the fact that it is so much more precise than the others. Simply put, the one that seems least shaky will dominate. Studies on the so-called ventriloquist effect have been especially helpful in establishing this result.

Vision tended to be considered dominant over the other modalities (Posner, Nissen, &

Klein, 1976), but by tweaking the relative variability of auditory and visual cues researchers found that, in certain circumstances, the situation can be reversed. Alias and Burr (2004) tested subjects using “blobs” of light and “clicks” of sound at different levels of variance. When the two were put in conflict, visual stimulus placed

+Δ° away from some central point and auditory stimulus placed - Δ° from it, they found that estimates of the objects location tended to track whichever had the least variance. If, for example, a blurry, nebulous visual input is integrated with a clear auditory input, the latter will tend to dominate. A number of frameworks have been proposed that do not share Alias and Burr’s inverse variance formula,21 but these all share the basic commitment that the more reliable (cashed out in various ways) modality takes lead and this is all we really need.

To see how these two principles might apply to sensory substitution, suppose there is a beeping, flashing object moving round the subject’s would-be field of vision.

When it comes closer, it becomes louder and the tactile matrix presents it as occupying a larger space on the back. When it moves left or right, she can hear and feel it,

21 One example would be accounts based on Bayesian probability (Knill & Pouget, 2004), which sees integration as an exercise in uncertainty reduction. 60

likewise for up and down. The concurrence of events is quite obvious in this case: the two senses will report in tandem whenever and however the object moves. The perceptual systems cannot help but notice that the two inputs are connected.

Integration effects have been observed for inputs as simple as beeps and taps

(Bresciani et al., 2005), so we are not worried about the two inputs being joined. If audition presents a more reliable sense of space than touch, it will take the lead when the two are joined together. Since audition is prototypically a means for perceiving distal objects, we will have distal attribution.

It is easy to see, too, how input from other modalities might spark further attentional training in sensory substitution. Learning to attend to apparent size as an indicator of proximity, for example, might open the door for cues that have much weaker ties to audition, linear perspective and occlusion for example. Feeling a larger object “pass through” a smaller one might be recognized as one being “in front of” another. Feeling and hearing many objects recede to a small point in the distance could trigger the recognition that two lines converging on that point could represent a triangle or parallel lines receding.

Our only real concern might be that auditory input is too imprecise to guide

SSD training. One response might be to note increases in auditory localization capacities for blind subjects, a point of overlap with crossmodal plasticity, but similar results hold of tactile discrimination as well. If need be, then, the experiment could add in haptic exploration. Skepticism is understandable for the case presented above, but with more and more detail, successful sensory substitution becomes more 61

plausible. Some object, perhaps a tea cup or stuffed animal, would be placed in front of the subject, and she would be instructed to manually explore it (the camera would be rigged above the subject’s head so as not to get in the way of her exploration).

Binding is a plausible outcome for a number of reasons. First, studies with infants have shown intermodal transfer as early as 29 days (Meltzoff & Borton, 1979), suggesting that some kind of linking can happen with minimal, perhaps even negligible experience with multiply specified events. Second, similar strategies have been used successfully in sensory substitution. Kim and Zatorre (2011) found that pairing auditory input from the vOICe with haptic exploration allowed for SSD-based shape recognition as well as visual recognition. Another report comes from long-time vOICe user PF, who reported arriving at such a strategy independently:

I decided to take and place known objects on my scanner and then maybe I could tell what I was working with. One object that I used was a plastic drinking glass. I would place the glass on the scanner, take the scan and then whilst holding it in my hand, listening to the soundscapes, try to relate that which I was hearing to that which I was touching. This gave me a general understanding of how soundscapes worked and how they related visually. (Presentation by PF, April 2002 as cited in Ward & Meijer, 2009)

This passage actually describes one of her first experiences with sensory substitution.

At the time, PF did not have a camera. The scanner mentioned in the passage is a simple flatbed scanner. Nevertheless, the combination of the vOICe’s soundscapes with haptic input seems to have been sufficient to gain a sense of visual shape.22 These

22Sensorimotor contingencies make no real contribution in this case, since we may safely say she had no real control over input orientation. Enactivists may attempt to mount an objection based on the fact that haptic exploration is/was active. There is nothing to prevent us from making haptic input passive, though. The subject would be instructed to keep her arm and hand slack, and an experimenter would guide it toward the distal object and bring it to rest on top of it. We do not think this adjustment would change the outcome, but the matter is ultimately an empirical one. 62

cases do not involve distal attribution since they are arguably tied to haptic touch.

They do, however, show major gains for shape perception, making similar gains for spatial information more plausible.

The two extra modalities have the potential for some interesting combinations.

By touching the object in front of them, subjects will have solid reason for supposing that the object is, well, in front of them. If one closes one’s eyes and grabs a box, there is little doubt that something is “out there.” One could add to this auditory information by having subjects tap or flick the object. If this constitutes too much of a sensorimotor contingency, the experimenter could guide the tap, taking the subjects finger or hand and running it against the object. Another way of doing this would be to make the object a squeaky toy. Once the two are combined, there will be little doubt that something is “out there,” and when it comes time to tie together the various task- based representations (object, space, etc.), the SSD-based input will, by virtue of its being bound to the doubly defined object, be referred outward.

If integration were successful, one would also expect skills to carry over to the unisensory domain. Seitz et al. (2006) and Kim et al.’s (2008) work is certainly supportive of this. Subjects trained with audiovisual input were better and faster than visually trained subjects in a purely visual task. Similar results have been found in studies focusing on auditory perception too. Von Kriegstein and Giraud (2006) found, for example, that by pairing a face with a voice, they could improve later performance in a voice discrimination task. Effects do not seem to be limited to these naturalistic pairings, either. In yet another study, Wozny Seitz, and Shams (2008), found that 63

multimodal pairings could help even in pure associative learning. The arbitrary pairing of a sound frequency with a visual stimulus was found to increase performance in a purely visual task involving that stimulus. A number of competing hypotheses were tested, including response biases, shared contexts, and others, but the multimodal finding held robustly (for a review of these issues and studies, see Shams, Wozny,

Kim, & Seitz, 2011). Finally, although it was not a unisensory task, Kim and Zatorre’s

(2011) study is relevant. Importantly, it shows us that impacts can spread to multimodal tasks involving only one of the trained inputs: not only does multimodal training aid in subsequent unimodal processing, it can help in novel multimodal pairings as well.

Beyond these specifics, there is good evidence that multimodal training actually occurs when we are first learning to perceive. The parallel between SSD learners and the ever-learning infant has been drawn since the inception of the device

(White, 1970). Nowhere is it more relevant than in the case of multimodal learning, however. As we have already seen, Bahrick and her colleagues (Bahrick & Lickliter,

2000; Bahrick et al., 2002) have advanced a theory of infant perceptual learning called the intersensory redundancy hypothesis, the supports of which we have already seen.

Similar accounts have made their way to adult perceptual learning as well (Shams &

Seitz, 2008; Kim, Seitz, & Shams, 2008). Evolutionarily, the advantage of such a filter would come when distinguishing real threats from false alarms. The probability of two or more senses mistakenly agreeing is much lower than that of any one of them randomly misreporting something, so it is a simple matter of probability multimodal 64

information is to be preferred. A speck on one’s eye may look like an insect flying about, but only the latter will make a sound.

The success of infants in the multimodal synchrony groups has been attributed to an increase in salience: events that are doubly specified draw the infant’s attention more than their unimodal counterparts. One classic study, Bahrick, Walker, and

Neisser (1981), presented subjects with two overlapping films. Viewed without audio, they appeared as “an amalgamation of ghostly images passing through one another,” but given audio from one, it “stood out from the other event, creating a strong impression of figure and ground” (Bahrick and Linkliter, 2000, p.191). When shown the videos separately, infants seemed much more interested in the one that had not been given audio before. The unpaired video seemed new, implying that the subjects had not noticed it earlier. Another study, Bahrick, Lickliter, and Flom (2006), highlights this attentional asymmetry even better. Here, unimodal and bimodal events were pitted against each other in a more naturalistic way. Young infants were shown video of a toy hammer striking a surface. Some infants were given the accompanying sound (the bimodal group) and some were not (the unimodal group). Both were then tested on attentiveness to hammer orientation: whether the hammer is striking the ground or the ceiling. That is, the infants were shown video where the hammer changed orientation, and naïve observers were instructed to register whether they lost interest. Infants in the bimodal group tended to habituate and lose interest when those in the unimodal group did not. The long and short of it is that when bimodal and unimodal stimuli are both available, infants attend to the former. At around eight 65

months, they do not have the same problem. There are several possible explanations, general increases in attention might be one, but one probable element is that older infants have more experience with both modalities. Greater familiarity would be accompanied by greater facility, and subjects would gain the ability to move freely between multi and unimodal cues. One might think of multimodal input as attentional training wheels. They guide the less experienced riders but are taken off when they are no longer needed. We can apply this same way of thinking to sensory substitution.

Having information from many modalities might train new users to attend to relevant cues rather than the proximal, idiosyncratic aspects of the target modality like the itching or tickling sensation of TVSS.

Another point of interest when drawing contrasts with the enactivist account is that multimodal coordination has been used to help explain why self-guided action is so beneficial for developing infants. Interview studies with parents, along with many other sources, seem to suggest that the onset of self-movement marks significant milestones in childhood development. Locomotor infants take much greater notice of their caregiver’s departure and engage in referential gestural communication.

Experimental studies likewise attest to an attention-action link. In an unpublished honors thesis, Freedman (1992) found that locomotor infants, those with either crawling or walker experience, were more likely to focus on far objects as opposed to empty space and to attend to distant objects while manipulating nearby ones. The reason locomotion is such a boon, Campos (2000) suggests, is not because distally directed attention and sensitivity to depth cues are unavailable without it, but because 66

locomotion “mak[es] the operation of such processes almost inevitable” (p. 210).

Wariness of heights is an excellent case study. Depth is perceptible as early as 5 months (Schwartz, Campos, & Baisel, 1973), but it is not until the onset of self-guided movement that infants develop any apprehensions. Before this milestone, self- movement, attention, and locomotion are largely unrelated. As Campos (2000) has noted, “nothing demands that passively moved infants (in strollers, cars, or parents’ arms) direct their attention toward the direction of motion” (p.175).23 Fairly often, prelocomoter infants become distracted and direct vision toward targets well beyond direction of movement. This, in turn, results in ambiguities for cues like optical flow as well as conflicting vestibular, visual, and somatosensory information. With the onset of self-guided movement, however, infants actually look where they are going

(Higgins, Campos, & Kermoian, 1996). If distracted, they will naturally stop to examine the new focus rather than continue along in their previous direction. Goal- driven action teaches them to expect typical optic flow information and multimodal congruence, so when confronting a visual cliff, they experience a sharp surprise.

Vestibular and somatosensory information register motion, but vision shows relative stasis. Think, for example, of the slow approach of distant objects when driving.

Attending to these unexpected divergences, they suspect, helps produce a fear response.

One concern facing integration is the possibility that it requires that both sources be distally attributed in the first place. Rather than acting as a stepping stone

23This fact is related to our earlier comment that passive TVSS subjects have their attention directed toward irrelevant proximal sources. 67

to distal attribution, integration would presuppose it. This line of thought is suggested by the fact that the bulk of non-SSD cases seem to be ones where the two relevant inputs are already interpreted distally. When the perceptual systems integrate the sights and sounds of a bouncing ball, for example, both sources of input are already in a kind of distal mode. We have two responses to this. First, we can motion toward cases like the rubber hand illusion and ventriloquist effect. These cases of sensory “re- education” might not be considered plausible a priori, but they happen nonetheless.

Integration may have surprised us again. Second, even if sensory substitution training is not a case of integration per se, many of our points still hold. The synchronies found between the two inputs might lead to a kind of “proto-binding.” Once these are observed, the same uncertainty reduction mechanisms underwriting integration would kick into effect. Seeing that two or more inputs are in close correspondence and that all but one reports a distal event, the best move would be to interpret the other in a similar fashion. This need not be a process limited to integration, and it may well occur prior to and as a precondition of it. If the integration interpretation does not work out, we could always switch to the near equivalent binding interpretation. This is the reason we sometimes qualified our remarks, presenting our points as consistent integration or some similar binding process.

5.3 A Second Multimodal Proposal

Independent, though not exclusive, of our own account, Proulx, Brown,

Pasqualotto, and Meijer (2012) have presented one based on reverse hierarchy theory, an account of learning proposed by Ahissar and Hochstein (2004). In brief, the theory 68

states that processing occurs at multiple levels and that “the difficulty and characteristics of a task determine the level of cortical processing at which attentional mechanisms are required” (Proulx et al., 2012, p. 2). Most importantly, reverse hierarchy theory allows learning to go both ways. Training with inputs routed to the lower level can generalize to processing that goes on at the higher level and vice versa.

When paired with the hypothesis that the brain has higher-order multisensory areas this makes for a clean account (see fig.4):

Learning can progress from primary sensory areas to higher-level multisensory areas under complex unisensory stimulation. Activity may then cascade back down the hierarchy such that generalization across modalities occurs when these higher-level, multisensory areas are implicated in learning either unisensory or multisensory tasks. (p. 3)

If, as is the case with both blindfolded and blind users, the only source of relevant input is coming from these higher-order areas, one should expect them to engender a degree of crossmodal transfer. Moreover, since learning sensory substitution is doubtlessly a demanding regimen, these higher order areas are likely to take the lead in perceptual learning/training. Some support for this account comes from Levy-

Tzedek et al.’s (2012) t study on crossmodal transfer in sensory substitution. Rotating visual feedback, making a cursor seem to be 30 degrees clockwise or counterclockwise from its true location, led to compensation and adjustment to both visual tasks and auditory ones. That is to say, sensorimotor training carried over. Kim and Zatorre’s (2011) results also serve to support the generalizability of training.

Haptic-auditory skills, they have suggested, easily transfer to the auditory-visual realm. 69

The specifics of the proposal are well-outlined in their article, however, so we will not spend more time defending it here.

Figure 8. Multisensory Reverse Hierarchy Theory (Proulx et al., in press). By routing itself through a common multisensory area, training in one modality can generalize to an unused one, like vision in the blind.

A concern we have regarding this proposal mirrors the earlier concern we raised in response to our own integrative model. That is whether the system presented can accommodate the weight of distal attribution. There is a decent amount of evidence to suggest the existence of higher-order multisensory regions. There is likewise wide support for a reverse hierarchy model of perceptual learning. Joining the two together, then, seems like an excellent proposal and certainly one of interest when confronting sensory substitution. Most of the evidence so far, however, has involved training transferring between already synced modalities. That is to say, both were 70

antecedently calibrated for distal attribution. When it comes to learning to “see” through the skin or ears, however, one has difficulty understanding what kind of training would be necessary and how the modalities would come to work together.

The basic story Proulx et al. (2012) tell is that blindness brings it about: “if presented with a task that would require the spatial abilities normally attributed to vision, such as shape perception (Amedi et al., 2007), then auditory stimulation (green) can activate and induce perceptual learning in these ‘visual’ areas” (p. 6). One wonders how the task is tagged as visual, however. Just as likely seems the result that information coming from audition is filtered out or stopped well before distal attribution. Subjects might progress to shape recognition, for example, without going all the way to distal attribution. In this scenario, shape information would make its way to the multisensory regions, but other important variations in input could be lost, classified as tactile. Our own account has tried to confront this hurdle by relying on low-level synchronies to make important aspects of tactile input salient. Integration or pre-binding could guarantee access to higher-order multisensory areas. The trouble with this is that it only works with more than one sensory input. What needs to be provided is a mechanism capable of accomplishing the same thing unilaterally (i.e. with only one modality in use). We have seen a partial response in plasticity, but more needs to be said. We shall see another in mental imagery.

Such concerns aside, we think that the multimodal integration and reverse hierarchy proposals actually fit together quite well. The first point of overlap is the ever-recurring higher-order multisensory zones. The most plausible regions for both 71

the multisensory hub of Proulx et al.’s (2012) model also happen to be sites of multimodal integration (though perhaps not the only ones; see Maculoso & Driver,

2005; some kind of earlier binding would much facilitate the integration of the two proposals). Their names are familiar and include both the LOC and the PPC. The former is brought up by Proulx et al. (2012) for its responsiveness to haptic input

(Pietrini et al., 2004) as well as its presence in Kim and Zatorre’s (2011) SSD study.

The latter, while not mentioned explicitly, has similar supports that the reader will be familiar with by now. The second point where the two dovetail best concerns the already noted benefits of multimodal over unimodal input in learning. Whether binding occurs at or before these multimodal sites, it will clearly impact effects on other modalities. For all the reasons we have outlined in the preceding section, multimodally congruent information is generally understood by the perceptual systems as more trustworthy. Any training passed down to the untrained modality, vision in our case, would therefore be expected to gain in force as it gained in sources. Such is our suspicion, anyway. A simple test of this would involve using an already established paradigm for obtaining crossmodal training between two modalities and simply adding another to the mix. If there is a significant increase associated with the addition of the third modality, our suspicions would be corroborated.

6. Mental Imagery

The final proposal that we will be considering is that sensory substitution comes about through the training and use of mental imagery. Subjects use their capacities to imagine the object in front of them as an initial substitute for more direct 72

means of perception. As imagery becomes more and more closely coupled with tactile or auditory input, it will cease to be dependent on conscious effort. SSD use would be similar to the habit of visualizing what one reads, only much more informationally loaded and less semantically bound (we shall see this analogy explored in greater detail later). Once this has happened, the distinction between imagery and straightforward crossmodal activation will break down, leaving a stable and properly perceptual capacity in its stead. Neurofunctionally, this will allow information to be channeled to “visual” areas, and as task demands, this information will be made available to still further processing regions, each new capacity building on earlier ones.

6.1 The Step-by-Step

The first step will be the initial localization process. Subjects may be blind, but there is little doubt that they have a functioning sense of space (Farah, Hammond,

Levine, & Calvanio, 1988). Indeed, their capacity for imagery is often surprising.

Regarding analogue spatial representations, for example, they perform just as well as sighted individuals (Fleming et al., 2006; for a full review of imagery capacities in the blind see Cattaneo et al., 2008). Using even the simplest SSD, one can see how such a representation could be refined and updated. Say the object is five feet in front of her and sits at eye level. If the camera starts at this same height and the subject swivels it side to side, she will doubtlessly notice that feedback only comes when the device is pointed straight ahead, that is, forming a 90 degree angle with her body. If she then lowers camera by, say, a foot and swivels it up and down, she will then notice that feedback only comes when the camera is at a certain angle of elevation, 80.5 degrees 73

in this case. Between these two, the location of the object can be deduced, and an initial image can be formed. With each successive movement, this image will be corroborated and refined. The object location one has come to expect from one swipe will be either duplicated or corrected by the next. The account is somewhat similar to that of the enactivists in that it uses gradual estimation and correction. The main differences lie in the fact that imagery is representational and focuses on external objects rather than variations in input.

More complex devices allow for more subtle representations and means of input. If the SSD codes for shape, this will be incorporated into the mental image.

Supposing they have not already played a role in fixing the object’s location, input from other senses will be incorporated. The most important step, however, lies in making this an automatic process. Consciously picturing something based on a sensory input is one thing, having that input trigger and shape the picture involuntarily is another. This possibility is suggested by two lines of evidence. The first involves the priming of one modality based on attention paid to another. The point here would be to show that top-down processes can unconsciously modulate multiple modalities at once. The second adds two cases of crossmodal mental imagery, supporting the suggestions of the first line and adding to it the endogenous triggering of actual phenomenal events (rather than the mere modulation of inputs). Both are supportive of either a “metamodal” executive mechanism, working with each modality, or a multimodal one, having distinctions between them but allowing for significant crosstalk. The existence of such a mechanism may not be necessary for an imagery 74

based account, connections might be fostered over the course of learning, but they make it a good bit easier.

The first line of evidence comes from priming studies. Were executive processing modality specific, one would expect cues to impact only the modality they were presented in. Conversely, were either the meta or multimodal hypothesis accurate, the opposite would hold. As it turns out, and as we have already seen, a long line of research suggests the latter. Driver and Spence (1998), for example, performed an experiment in which they directed the subject’s attention with either a visual or auditory cue. Subjects were then given some stimulus (e.g., they might be asked whether a pitch was high or low) from the uncued modality on either the same side as the cue or the opposite side. They found that subjects cued to the same side were quicker and more accurate than their counterparts. In another study, Eimer and

Schröger (1998) had subjects perform a similar task, fixing their on a given point but endogenously shifting their attention to the right or left. As was expected, they observed a spike in cortical activity for the regions of the uncued modality. Such studies have been duplicated in a variety of ways, incorporating shape, touch, and many different protocols (James et al., 2002; Eimer & Driver, 2000; Eimer and Driver,

2001). These cases, though they are not of mental imagery, do show that top-down mechanisms can mediate crossmodal impacts and, more importantly, that these can be automatic. Spatial attention was most certainly exercised executively. Fixing attention on a given location in the way that Eimer and Schröger’s subjects did is fairly obviously a conscious effort. The readying of uncued modalities, on the other hand, 75

was presumably nothing subjects had intended to do. The spatial location of the uncued stimulus was completely task irrelevant and apparently unrelated to the initial cue. Nevertheless, these ostensibly unrelated modalities exercised influence on their neighbors. We need not rely on priming experiments, though. Mental imagery has made more than a few appearances independently and from the bottom-up.

A few particularly noteworthy investigations involve silent lip reading, RHI, and haptic exploration, respectively. In the first, subjects were shown a silent film of lip movements while subjected to neuroimaging. Results indicated significant activity in the “auditory” cortex despite the absence of auditory input (Calvert et al., 1997;

Hertrich, Dietrich, & Ackermann, 2011). This suggests significant crossmodal connections for speech perception and gives some experimental backing to the subjective experience of “almost hearing” muted speakers. The second involved subjects seeing and hearing a rubber head being stroked on the ear (the sound recorder was located within the rubber head; Kitagawa & Igarashi, 2005). Many subjects in this condition reported the faint sensation of being touched and still more reported a tickling sensation. What is most important about these studies is that they show an automatic crossmodal coupling of mental imagery. Subjects were not simply primed; rather, sensations were triggered from the inside. The involuntary nature imagery is capable of taking on is especially clear here. Although subjects in the lip-syncing experiments may have made some effort to interpret the speakers words, that the same holds for Kitagawa and Igarashi’s subjects is highly doubtful. Nor, we wager, would either of the two be subject to conscious control. That is to say, one cannot simply 76

push the sensations away. A last group of experiments involve the spontaneous generation of mental imagery from haptic exploration (Sathian & Zangaladze, 2001;

Zhang, Weisser, Stilla, Prather, & Sathian, 2004). Blindfolded subjects regularly report visualizing objects explored with the hands. When subjected to neuroimaging,

Zhang et al. (2004) found that subjects’ reports of image vividness correlated with haptic selective activity within the LOC. Were such effects writ large, were they taken advantage of in a broad and systematic fashion, one would expect something on the order of sensory substitution. In fact, subject PF reports something eerily similar. In the same Ward and Meijer (2010) interview cited in our crossmodal plasticity section, she reports haptic triggering of sensory substitution (even though she uses an audiovisual SSD!):

JW: If you were to touch things you wouldn’t get any visual experiences through that? PF: Yes, I can! If I pick up a pencil, I feel pencil, I see pencil. JW: Even if you touch it? If you touch an object with your hands you have an experience of seeing it? PF: Yes! Touch is vision. JW: If you weren’t using The vOICe and you were to touch an object you would see that object? PF: I’d see an image in my head, yes. JW: You would feel this as occurring automatically rather than something you are deliberately creating? PF: Yes, absolutely.

This report is rather puzzling for the enactive approach, since haptic exploration in no way emulates the sensorimotor contingencies of sight. The imagery view, coupled with multimodal account of the brain, makes short work of it, though. If the vOICe is decoded using imagery and haptic exploration tends to use the same automatic imagery mechanisms, haptic “sight” comes as little surprise. 77

6.2 The Land of Imagination

This brings us to the second aspect of imagery, the “where” of it. Each of the regions we have highlighted so far are commonly and really quite unsurprisingly involved in mental imagery. Broadly speaking, mental imagery uses most of the same cortical regions as perception (Ganis, Thompson, & Kosslyn, 2004). Imagery of shape, for example, is known to activate the left LOC (De Volder et al, 2001), spatial imagery triggers occipito-parietal regions (Mellet et al., 1996), etc. This holds true of the blind as well. De Volder et al.’s (2001) study of shape image involved blind subjects, for example. Blind imagery has even been found to strike up activity as far back as the primary visual area (Lambert, Sampaio, Mauss, & Scheiber, 2004), though the most pronounced overlap comes from frontal and parietal regions (Ganis et al.,

2004). This is why imagery is so worrisome as a confound (Merabet et al., 2009; Kim

& Zatorre, 2011).

In fact, many of the neuroimaging studies cited in this and other sections could be accounted for via mental imagery. We made mention of silent lip reading and haptic exploration, both of which show activity within corresponding perceptual regions, but a similar case could possibly be made for Kim and Zatorre’s multimodal training experiment. In this sense, imagery is as well off as any proposal. The only major hurdle seems to come from Merabet et al (2009). As we noted above, application of TMS to the LOC of patient PF interfered with her ability to use her SSD as such. An additional finding of this experiment, however, was that TMS application 78

failed to disrupt mental imagery. PF was told to imagine letters of the English alphabet in lower and upper case. She was then queried about various aspects of the letter, whether it featured a diagonal or an enclosed space, for example. PF reported no darkening for the letters and was actually quite good at answering questions about them. If sensory substitution can be eliminated without impacting imagery, however, it speaks against any grounding relation between the two. We do not think this challenge insurmountable, though. It might be perfectly consistent with an imagery account. The dissociation established within the experiment is to be expected given certain well- recognized divisions within imagery itself. Evidence has long suggested that even in the sighted, visual and spatial imagery comprise their own distinct processing regions.

Farah et al. (1988) report one subject whose lesions resulted in a clear dissociation of deficits between spatial and visual imagery. Further, behavioral support comes from experiments like that of Baddely and Liebermann (1980), which found that non-visual spatial tasks interfered with spatial visualization where non-spatial visual tasks did not, and Hollins (1985), which found that blind subjects, though they have little trouble imagining three dimensional scenes, show deficits in two dimensional pictorial tasks proportional to the amount of time they have spent without sight.24 Imagining letters is a two dimensional, and hence non-spatial, task. This imagery may involve “spatial” aspects, such as identifying closed spaces, but these are dissociable from other, 3D spatial processing. Far from weighing against imagery, Merabet et al.’s (2009) study helps provide support to an independently observed phenomenon stemming from

24 We should note that these results, along with the general scheme of splitting spatial and pictorial representations, are consistent with our task-based partition of the brain. 79

imagery. It likewise helps us form a further hypothesis. Namely, that TMS should impair PFs voluntary spatial imagery despite its lack of effect on “visual” imagery.

6.3 Speaking For and Against

The extant interviews with PF are actually some of the best sources of evidence imagery has. We saw above her report of seeing haptically explored objects, but she has made several other noteworthy observations. After many years of using the device, she has come to experience a number of things not coded by her SSD. In one presentation, she described a discrepancy in the apparent detail with which objects present themselves:

I took the program down to see my Christmas tree. I just wanted to see the pretty lights, that’s all, but to my surprise I was able to see the branches swirling and twirling around. And then one thing caught my eye. That was that by looking at the branches, I could see the points of the needles and before I even could stick out my finger and touch them I could feel that it would be sharp as if it was saying ‘‘touch me and I will prick you”. . . . as I’ve gone along later in life, life has hanged and I have things in my house that I have no memories other than what I have learned from my hands what they look like. Things like a microwave. Things like my computer. Unfortunately even though I love a computer (it has given me so much) when I look at a computer all I again see is the line box drawing. It’s not filled in solid like the tree. I cannot explain why. I just know that it is. I don’t have a memory for what computers truly look like. (Presentation by PF, April 2002)

The clearest explanation is that these details are drawn from memory. Seeing the outline of the Christmas tree brought back old memories in the same way that a song reminds us of an old friend or a smell reminds us of a favorite food. These memories were, in turn, relevant to the scene before her, so they were used to interpret it. The clearest way of accounting for so robust an effect is to attribute it to imagery. The

Christmas tree outline called forth a stored mental image that she then “saw” in front 80

of her. The phenomenon is not limited to these finer points, however. In more recent years, PF has claimed to see colors, something completely absent from the vOICe. The following snippet of interview gives her report:

PF: Now it has developed into what I perceive as color. JW: Really? Before you had said that it is not colored? PF: Yes, that’s true. But before my brain wasn’t seeing the finer detail. Over time my brain seems to have developed, and pulled out everything it can from the soundscape and then used my memory to color everything. JW: Aside from The vOICe, if you think of a strawberry you could still think of it in your mind’s eye as being colored? PF: Yes, red color with yellow seeds all around it and a green stalk. I can see it instantly. JW: But if you look at someone’s sweater or pants you wouldn’t necessarily know the color? It could be blue or red. PF: My brain would probably take a guess at that time. It would be greyish black. Something I know such as grass, tree bark, leaves, my mind just colors it in. JW: How long ago was it when you started having the colors? PF: Gradual, gradual but it is strongest now. Within the past year, year and a half, after my depth perception developed.

Once again, the clearest explanation comes from imagery. In the same way that the shape of a Christmas tree can summon up the image of pine needles, it can bring to mind its distinctive green as well. Just like those subjects who experience the tickle of a rubber head or the sound of a muted speaker, PF is “filling in” her experiences, though she would doubtlessly be more practiced at it.25 This interview is cited by

Ward and Wright (2010) in favor of their synaesthesia explanation, but we think it is more beneficially understood in terms of imagery. Color experience relies rather systematically upon memory, but there is little to suggest that memory plays any real role in synaesthesia (Witthoft & Winawer, 2006, report one case in which a subject’s

25 There is some evidence that sighted subjects undergo a similar tinting or “filling in” phenomenon. Delk and Fillenbaum (1965) found that stored color associations with certain shapes impacts the perceived color of that shape. This line of thought is developed in McPherson (2012). 81

color-grapheme association was linked to a childhood alphabet, but it is the exception to a rule seen throughout the literature). Imagery, however, is about as closely linked as one could want.26

Supposing imagery does provide the best explanation of PF’s colorful and richly detailed experiences, a possibility suggests itself. Imagery could be all there is to it. Seeing its success in accounting for these perceptual details, one wonders if there is much to speak against its implementation more broadly. If a given stimulus can retrieve an image from memory, is it really implausible that it could structure an image directly? The line between the two is thin, especially in light of PFs claim to “see” haptically explored objects in an SSD-like way. Stored object representations might end up functioning as the background knowledge of perception, plausibly serving as the Bayesian priors in a causal inference model of SSD-based perception (Briscoe, personal communication). We cannot develop the possibility here, but it is well worth consideration.

Another noteworthy finding is that senses can be “made up” using SSDs. We have in mind a belt designed by Nagel et al. (2005) that responds to magnetic fields.

Clearly, humans have no innate capacity for detecting magnetic north, ruling out multimodal approaches. Likewise, it appears that sighted subjects can gain expertise with the device rather quickly, speaking against unilateral plasticity accounts (along

26 A point worth noting is that PF is late blind, having lost her sight in early adulthood. This certainly explains how she identifies, let alone remembers, colors. This does not negate the importance of her case, however. Even if mental imagery is a strategy largely reserved for late blind subjects (as is hypothesized by Poirier et al., 2007), it is still a case of sensory substitution. The perceptual or quasi- perceptual experiences of at least some users can potentially be attributed to imagery. Stated otherwise, sensory substitution would still explained by imagery in some cases. Having a plurality of mechanisms is no vice. 82

with the fact that there are no real modalities to call crossmodal). Nevertheless, at least some subjects learn to use the belt. The results of the study were mixed. Many subjects learned little to nothing, but those that did have produced some interesting reports.

Looking through them, an imagery-based approach seems plausible. One subject noted that “it was different from mere tactile stimulation because the belt mediated a spatial feeling” (R22, emphasis ours). Another reported that “During the first two weeks, I had to concentrate on it; afterwards, it was intuitive. I could even imagine the arrangement of places and rooms where I sometimes stay. Interestingly, when I take off the belt at night I still feel the vibration: When I turn to the other side, the vibration is moving too—this is a fascinating feeling” (p. R22). From these we gather that successful users gain increased spatial abilities that are not, strictly speaking, dependent upon input (“I still feel the vibration”). These are very interesting for the proponent of imagery because (a) spatial imagery is widely regarded as an independent of visual imagery, making the report of a new spatial feeling perfectly understandable and (b) the lingering effects reported when the belt was removed seem awfully reminiscent of the silent lip reading and rubber head sensations from before.

The authors side with an enactive account, but given what we know about imagery, it too seems plausible. It also lacks the baggage we have seen from enactivism.

A final supportive, or at the very least consistent, study is Siegle and Warren’s

(2010) work with minimal SSDs. The experiments were designed to test the role of attention in SSD learning and the role of sensorimotor contingencies in the process.

Subjects were instructed to locate a luminous point by using finger mounted TVSS 83

device consisting of a single photodiode and tactile stimulator. Each was trained to locate the target using the same method, sweeping the arm side to side at the elbow joint, but given different instructions on what to attend to. Subjects were split into two major groups: distal attention (DA) and proximal attention (PA). The difference between them, as Siegle and Warren (2010) explain, it is that:

Participants in the PA group were instructed to attend to the location of the arm when the vibrating motor was active, and to consciously triangulate the location of the target by imagining their finger extending out into space. Participants in the DA group were explicitly told to not attend to their arms during the experiment, but to get an intuitive sense of the target's location and report how far away it felt. (p.214)

In both cases, the instructions were based on strategies spontaneously adopted by subjects in a pilot study. Over the course of the experiment, Siegle and Warren found that DA subjects were significantly more accurate than PA subjects, and furthermore, that they were more likely to attribute solidity to the target (an operationalization of distal attribution). When the finger-mounted device was inverted or switched to the opposite hand, changes that they argued altered the sensorimotor contingencies associated with the device, little trouble was observed.

The latter part of the experiment, where the device was changed in its placement, is what Siegle and Warren (2010) take to be most dangerous for sensorimotor accounts. Enactivists like Noë (2004) predict that abrupt changes in sensorimotor contingencies precipitate a kind of “experiential blindness,” but subjects in this experiment encountered little trouble. Sensorimotor contingencies, these results suggest, are neither sufficient nor necessary for distal attribution. From our own perspective, however, the first part of the study is the more serious. This is because it 84

highlights directly the two approaches, speaking better of the orthodox one. For sensorimotor accounts, perception depends upon knowledge of sensorimotor contingencies: as one learns how input varies as a function of movement, one will come to see the world as such. If the enactivist draws any distinction between the two groups, he will find himself more closely allied with PA as a strategy. Both groups know how input varies as a result of movement, and debatably, the subjects attending to proximal stimulation would know better. Mental imagery, by contrast, tells a very different story. The two groups are very different and those using DA as a strategy have a massive advantage. If subjects are too busy focusing on tactile feedback, they will fail to allocate any resources to visual imagination—to getting “an intuitive sense of the objects location.” Subjects in the DA condition, by contrast, are absolutely free to employ imagery. In fact, it is perhaps the most plausible strategy, given that subjects were blindfolded and had no input besides that of the SSD. Starting out, subjects would know that there is a luminous object to be located somewhere in front of them. With the first swipe of the device, they would anchor the general space to focus on and with the next swipe they would form a more precise estimate of where the object lies. With the third, they would receive input that to correct or corroborate the estimate from swipe three. Each successive swipe would, in turn, refine the image the subject has formed, leading to significant increases in accuracy. Likewise, when subjects report solidity, it is not because they are receiving any relevant input, but because the image has bled over into the other modalities (see Kitagawa & Igarashi, 85

2005; this phenomenon is a point of overlap between the mental imagery and multimodal hypotheses).

The evidence reviewed above make imagery a formidable option for orthodox theorists, but unlike multimodal and plasticity-based accounts, imagery faces a couple of additional, intuitive hurdles. One is related to its status as a mechanism subject to cognitive control. There is little evidence that SSD users have any real control over what they see. That is to say, no known users can will a thing into sight. Orthodox cases of imagery, by contrast, are most certainly subject to executive alteration. If asked to imagine a cup, one can freely add or subtract a handle, change the material of the thing (plastic or ceramic), or alter its color. As we have already seen, however, imagery can be much more than this. An involuntary image is suggested by cases like

Kitagawa and Igarashi’s (2005) rubber head study or the many blindfolded subjects who automatically imagine the shape that they hold. Clearly it is not the case for PF, who perceives haptically explored objects as if through sight. Concerns about the malleability of mental imagery, then, are not as worrisome as they might appear at first glance.

The second challenge facing imagery is its phenomenal presence, or “oomph.”

Few can claim to have confused a mental imagine for the real thing. If I imagine eating a large piece of chocolate cake I will feel neither guilty nor sated.

Understandably, the tendency is to think of imagery as a weak tool for when we cannot directly see or hear something. Aristotle likens it to a “residue” of our actual (De Insomniis 461b), and the metaphor has stuck around. As a number of 86

studies suggest, however, imagery shares major processes with dreaming, a process the “oomph” of which is unquestioned. Lesion studies are a good tool for assessing their relationship. Humphrey and Zangwill (1951) document the case of subjects who, after parietal lesioning, lost both dreaming and waking imagery. Similar results for occipital lesions were later reported by Nielsen (1955) and a literature review by Farah found that these cases were part of a larger tendency of the two to correlate (Farah,

1986), a result subsequent lesion studies have seen out (Farah, Levine, & Calvanio,

1988). If we find ourselves troubled by the phenomenal power of imagery, we need only think of its cousin, dreams.

A related and perhaps more pressing concern is that we are, by quantitative measures, actually very poor at visualizing scenes in any detail. Even very simple mental rotation tasks, for example, can be quite difficult. One response is that SSD users will be receiving helpful input throughout the process. Whenever there is a change in the camera, there will be a different auditory or tactile stimulus. Subjects need not do anything so taxing as mental rotation. All they need to do is construct a single image for sequential inputs. There is ample evidence that subjects are fully capable of updating and adjusting spatial imagery in a similar way (Loomis et al.,

2012). This is the case even for multiple objects. We may, in addition, borrow a page from crossmodal plasticity. Training with an SSD forces subjects to make use of mental imagery more than just about any non-user. As common sense and countless findings in tell us, training and expertise mean major neural shifts.

Musicians show major reorganization in the primary motor cortex (Pascual-Leone, 87

2001), London taxi drivers show increases in hippocampal grey matter (Maguire,

Woollett, & Spiers, 2006), and SSD users, if this account is correct, should be fine- tuned to spark up and update mental images. Blindness might also help. Visual and spatial imagery in the sighted is constantly competing with input from the eyes. For good reason, the former will be inhibited. This prevents any dangerous confusion that might occur otherwise, such as hallucinations. When visual input has been cut off, this problem is made moot. There is neither competition for resources nor worry of confusion between vision and imagery. A final point worth noting is that the perceptual experiences of SSD users are not as rich as those of normal perception. As has been noted elsewhere, “the number of objects which can be jointly accessed or available through sensory substitution devices seems to be much lower…users of sensory substitution devices have not been shown to be able to track or identify multiple objects at once" (Deroy & Auvray, 2012, p. 4). This is oddity, if a mental imagery account is taken up, would be explained by the very limitations that ground the objection.

6.4 Does Reading Make a Better Metaphor?

A final consideration comes from a very recent proposal by Deroy and Auvray

(2012). Although it has overlap with each of the three routes considered in the current work, it focuses on top-down mechanisms and will hence be discussed under the imagery heading. What they present is an alternative to the perceptual accounts of sensory substitution based in an analogy with reading. As in popular dual-route accounts of reading, Deroy and Auvray think learning happens when a new access 88

point to cognitive/spatial representations is “grafted onto some pre-existing perceptual-cognitive route” (p. 7). This grafting is, we take it, similar to our own proposal for multimodal integration. The preexisting route acts as an “in” for the novel one. Although the reading analogy has quite a few merits, we are skeptical toward some of the conclusions Deroy and Auvray draw from it. In particular, we do not see why such an analogy cannot work together with the already popular perceptual one.

The dual-route reading paradigm, they argue, fares considerably better than its perceptual rival on a number of important issues. One anomaly explained by this approach is the trouble users seem to have with tracking multiple objects. This, they correctly note, is common for tasks like reading but is not a typical feature of perception. We have already seen that our own imagery-based account also deals with this oddity. What we have proposed is, however, quite compatible with a perceptual understanding of sensory substitution. Some capacities, like tracking multiple objects at once, will be adumbrated, but this hardly precludes a perceptual interpretation.

Many animals have trouble dealing with multiple objects, this does not mean they are not perceivers. Even sighted persons will have limits on what they can attend to efficiently. SSD users could simply have a lower threshold. Similarly, they point toward the low “visual” acuity demonstrated by some users, citing Sampaio, Maris, and Bach-y-Rita’s (2001) finding that subjects averaged a 40/860 on ophthalmological tests (Deroy and Avuray, 2012). This can be addressed by simply pointing to the more recent findings of Striem-Amit, Guendelman, and Amedi (2012) whose median 89

subject performed at 20/360 (fig. 9), moving beyond the WHO blindness threshold of

20/400.

Figure 9. Resolution of The vOICe (Striem-Amit, Guendelman, & Amedi, 2012). An approximation of the acuity of SSD users in.

A second important point that Deroy and Auvray (2012) note is that sensory substitution seems to have “no clear modality” (p.10). One example of what they mean comes from Kupers et al. (2006), who subjected trained SSD users to TMS of area V1.

Deroy and Auvray (2012) note, “Before training, no subjective tactile sensations were reported,” but after training “some of the blind participants reported tactile sensations that were referred to the tongue [the location of the SSD]” (p. 6). This could also be explained by the synaesthesia account of Ward and Wright (2012), though. It is likewise compatible with our multimodal integration and imagery view, provided the two senses have been coupled. If the two inputs are normally coupled, the triggering of one could set off another. We have seen this in cases of lip reading, for example.

Finally, this activity is consistent with the view that SSD users perceive but do so in a 90

phenomenologically tactile way (Block, 2005). Deroy and Auvray (2012) seem hesitant to consider options like this for auditory SSDs, however, since, as they put it,

“it seems like a stretch to say that sensory substitution devices can change the proper objects of audition” (p. 9). We are less optimistic about our abilities to define the limits of sensory substitution a priori, though. If, for example, the brain is naturally a highly multimodal processor (Reich, 2012), the notion of a proper object for a given modality might be a thing of the past. There may be typical objects, but “proper” will be so much Aristotelian baggage.

We are nevertheless fond of the reading analogy, in part because it joins nicely with some of our earlier proposals. The automaticity of imagery and the dual route aspects of integration are especially compatible. The tendency to think of reading as necessarily non-perceptual is, however, odd for us. We would sooner think of it as lying on the far end of a continuum, differing from but not incompatible with the perceptual view. Excluding this possibility seems, more than anything else, to rely on a narrowly construed understanding of reading as “learning to access words through vision instead of audition” (Deroy & Auvray, 2012, p. 7). Their assumption is that words are wholly auditory, and it comes in quite often in their discussion:

Writing systems have been designed to preserve the phonemic structures that are relevant to access semantic information. In this sense, the code remains “auditorily phonemically” constrained or governed. The acquisition of reading itself relies on existing phonemic skills, not just on auditory perception, and consists in mapping what one hears onto what one sees through the mediation of what one knows the later means. It is only as a result of mapping the known written signs to known spoken words and phonemes that readers can progressively entertain auditory representation on the basis of visual words, and this even for unknown or novel items. (p. 7)

91

On Deroy and Auvray’s (2012) view, written words must be “translated into variations in sounds, and from there to meaning” (p. 7). Likewise, when it comes to learning sensory substitution, inputs must be translated into (“grafted onto”) the terms of another modality. In both cases, the reason why the novel input needs translation is because the mapping is arbitrary. There is very little about the look of “dog” to suggest the furry household creature. Although they note that non-arbitrary links exist between touch and sight, they contend that the vOICe (or any SSD using an auditory- visual mapping) is in roughly the same position:

Variations in shape/surface will not result in a variation in the auditory signal, at least not one from which that variation could normally be inferred. The only way to constrain the inference is to learn an arbitrary translation from one to the other. This is then more similar to the case of reading where variations in the shapes of words or letters do not directly lead to differences in sound. (p. 9)

The trouble is that in both cases they do not give non-arbitrary connections their due.

We have already seen the evidence that innate mappings exist between modalities, facilitating or perhaps underwriting the whole process. Deroy and Auvray (2012) grant a handful of these but are still skeptical. The best response we have to offer is to point toward Kim and Zatorre (2011). Subjects receiving auditory input produced from novel shapes showed significant activity in the LOC even before training. These preexisting connections were so prominent, in fact, that there was no significant increase in LOC activity after training, improvement coming in selectivity only.

Obviously these connections were not enough for perception right away—subjects did not just start “seeing”—but it is far from the arbitrary mapping Deroy and Auvray’s phonemic reading analogy suggests. If the mapping is an arbitrary one why would the 92

LOC respond so soon to auditory SSD input? They would do best to ease up on the arbitrariness claims, we think. This does not mean a move away from the broad reading analogy, though. We might still be able to draw comparison with a pictographic writing system. Close contemporaries might be characters in a logographic language like Chinese, which involves both direct written-word-to- semantic-meaning and written-word-to-spoken-word-to-semantic-meaning mappings

(Shu, Chen, Anderson, Wu, & Xuan, 2003). Returning to the dog example, if one looks at the character for “狗” a dog is easier to interpret (hint: the dog is standing and facing left; similarities tend to be even clearer for older character systems).Although not immediately transparent, these kinds of characters are not arbitrary either. 27

Thinking in terms of pictographs also helps dispel some pesky intuitions regarding the compatibility of the perceptual and reading analogies. Unlike phonetic writing systems, these are much more easily construed as continuous with more direct, perceptual signs. A pictograph is a stylized and simplified drawing, a kind of low- information picture, and intuitions regarding pictures as perceptual channels are much more amenable. Is a person watching a video seeing the event? What if the frames are slowed down to one per second? What if they are routed through another modality?

Each step is a little further away from the prototypical case of perception but not so

27The legendary origins of Chinese script are apropos. One day, the emperor became dissatisfied with the existing system of communication. He therefore directed the bureaucrat Cangjie (who is incidentally said to have had four eyes) to create a new system. Troubled by the task, Cangjie went to rest by the banks of a river where he noticed a set of mysterious tracks. Not knowing their origin, he sought out a hunter who might identify them and was told that they could be none other than the footprints of the Pixiu, a mythical lion-like creature. From this experience, Cangjie came to understand that each thing had its own distinctive sign and sought to capture each in print, just as the Pixiu had done on the banks of the river. 93

much that it makes connecting the two outlandish. At the far end would be pictographs, which are not widely regarded as perceptual, but might still be thought compatible with an analogy based thereon. The difference between the two could be one of emphasis. The reading analogy would highlight certain aspects like learned automaticity and the help a guiding modality can provide. The perceptual analogy, meanwhile, would naturally do best when discussing things like on-line “visual” action guidance. By emphasizing the arbitrariness of the mappings, Deroy and Auvray

(2012) seem to miss a chance to integrate the analogy with other models. This brings us to our final section.

7. How They All Fit Together

We opened by suggesting that the three proposals we have considered comprise front against enactivism. We have, however, spent a great deal of time considering them and their various advantages and disadvantages in isolation. An unfortunate aspect of this approach is that it obscures just how much these accounts can provide for each other. Despite what the formatting might suggest, our earlier claim was not simply a rhetorical flourish. These three can be integrated in suggestive and interesting ways. In fact, they are better for it.

We have already seen how this applies for the major difficulties facing mental imagery. Imagery is prototypically less vivid than sight, yet subjects using SSDs do not seem to notice this. One response would be to argue that these subjects were either mistaken or that they had been without sight for so long that any visualization appeared vivid and exciting. Some (e.g. Dennett, 1991) have suggested that even 94

conscious perceptual experience in sighted individuals is much sparser than we take it to be—that the detailed world is a kind of grand illusion. Avoiding the hypothesis of massive error would certainly be a plus, though, and barring it, we will have to explain why subjects self-report vivid scenes. Plasticity can help us do this. Blind subjects have been known to show augmented capacities for sound localization, some even using a hand-clapping form of echolocation (Thaler, Arnott, & Goodale, 2011). The possibility does not seem to be limited to early blind subjects, either. Experts, like professional musicians and drivers, show similar neural benefits. In one study,

Alemann, Nieuwenstein, Bocker, and De Haan (2000) found that subjects given a kind of auditory training task outperformed naïve subjects in both task specific and general auditory imagery tests. In the realm of crossmodality, there is even evidence that blindfolded sighted subjects can take on some well-known blindness induced capacities. Blindfolded persons have been observed during Braille character discrimination and have shown significant occipital activity during it (Kauffmann,

Theoret, & Pascual-Leone, 2002). We do not think it implausible that a similar process of augmentation could occur with mental imagery through SSD training.

More quietly, we have seen similar problems addressed for the other routes too.

What, for example, is the major trouble facing the multimodality account? Since our primary suggestion has been that subjects attain distal attribution through multimodal integration, the problem will be that not all subjects who learn to use SSDs have access to more than one sensory modality. Bach-y-Rita’s subjects (reviewed in Bach- y-Rita, 1972), from what we can gather, had only tactile feedback, camera control, and 95

some experimenter instruction. No mention is ever made of possible haptic exploration or auditory input. The multimodal learning methods outlined above are not relevant in this situation, then. Stated bluntly, the multimodality hypothesis has no account of unisensory SSD learning.

There are two ways of addressing this. The first would be through mental imagery. Subjects can construct an image wholesale and compare it with whatever input they receive. In some cases, in fact, experimenters have found that degree of crossmodal activation can be predicted by vividness of mental imagery (Zhang, et al.,

2004). Other relevant findings for this proposal come from the crossmodal imagery experiments of Kitagawa and Igarashi (2005), Sathian and Zangaladze (2001), and others. These can each be seen as the intersection of imagery and multimodality. Their classification as imagery comes both from their phenomenological side, the fact that they “seem like” imagery, and the fact that they occur in the absence of the relevant input (tactile for Kitagawa and Igarashi, visual for Sathian and Zangaladze). The multimodal aspect comes in when one tries to explain the automaticity of the triggers.

Were the processes in question carried out in multisensory regions, one would expect modality hopping on a fairly regular basis.

Additional help comes from blindness-facilitated plasticity. This is the route brought in by Proulx et al (2012). Higher order regions can train lower level ones, but the degree to which a lower level region is receptive to the training will depend in part on whether it is receiving a steady stream of conflicting information. A sighted subject will not learn to “see” through a handheld camera if they are already seeing through 96

their eyes. There is no reason to switch. By eliminating visual inputs, however, visual processing regions would be opened up. Once again, this could be the case for either blind or blindfolded subjects. If, as we have seen, blindfolded subjects show occipital activity for Braille reading, sensory substitution might not be too far off.

Lastly, we have seen a number of difficulties for cross modal plasticity. It has some trouble dealing with sensory substitution in the sighted, as well as a more general problem concerning psychological-level explanation. Turning to the first, we know that sightedness does not preclude subjects from feeling the effects of plasticity.

Blindfolded subjects show a plethora of plastic effects, but oftentimes these subjects are blindfolded for fairly short periods of time. The spontaneous development of the right connections is not out of the question, neither is the use of already present plastic changes in blind subjects, but the hypothesis that there are pre-existing multimodal connections involved seems more likely. The functional-anatomical proposals the multimodal approach is coupled with make this especially appealing. Rather than having to forge altogether new routes, there would be already active connections free to use. Perception would be fast-tracked, in a sense, avoiding competition with any and all cognitive processes for the space. A second and more significant gain is found in the benefit potentially offered by mental imagery. Sighted subjects (as well as blind ones) will doubtlessly have experience with imagery, and some of this will be involuntary. The experience subjects have with this automatic process of triggering will prime the development of correct functional pathways in a way more precise than the free exposure a bare plasticity account would offer. It likewise combines well with 97

a multimodal account, as we have seen. Finally, if it is combined with these other two explanations, our worry about the mechanisms involved would more or less dissolve.

The mechanisms that explain the guidance of plasticity are simply the mechanisms of imagery and multimodality.

We are reminded of a method of sealing a cardboard box without tape. One takes each flap, places the left edge under preceding flap and the right edge over the following one. The four fit together in this way, each maintaining the place of its neighbors, and the box will hold solidly. While any one mechanism behind sensory substitution might prove too little to explain the whole of the process, they are quite formidable together. The converse of the flexibility and compatibility of these programs is, of course, the difficulty in identifying which mechanism is at work in a particular case. Decomposing the contributions of the various methods to sensory substitution in early-blind, late-blind, and sighted is an outstanding problem. Some proposals have been made already (see Poirier et al., 2007), but the issue is much debated and the subtleties involved make a discussion here intractable.

8. Conclusions: Lessons and Future Possibilities

We have hopefully provided the reader with an understanding of sensory substitution: where it came from, what research has been done, what explanations ventured, and what the devices might mean for blind users. Our discussion has ranged from the humble devices of the 1960s to the dazzling projects of the last several years.

Despite the range covered, there are still some general lessons we think can be drawn.

Of these morals, two stand out. The first is the importance of the environment and how 98

easy it is to overlook. Going forward, it will benefit researchers to exercise caution when interpreting these kinds of experiments, especially with regard to the structure of the environment. This is not a new issue, Gibson raised it decades ago, but it is one easily missed. Ironically, the enactivists (Noë, Hurley, O’Regan, etc.), who have been some of the greatest champions of the body and environment in perception, were among those who overlooked the major environmental/informational hurdles to passive subjects. Working from essentially Gibsonian principles, one would expect the failure of passive adaptation for reasons much more quotidian than the lack of sensorimotor contingencies. Even a casual glance at the kind of input available in these early experiments will show that visual input was far from the rich detail available to organisms in a naturalistic setting. Many, if not most, cues were degraded, and some were eliminated completely. The same is true of multimodal correspondences. The absence of these sources is a bit less conspicuous, but their oversight is no more forgivable. Gibson, as well as the enactivists, saw perception is amodal. That is to say, what matters on his account is the information conveyed, not the modality it comes from. Given that many of these early experiments were unimodal, that they only employed the modality under training, the loss of informational input would be drastic. Additionally, one need not be a Gibsonian to recognize the importance of multimodal inputs. We have reviewed only a small subset of the experimental evidence in favor of multimodal learning and still found major contributions to learning in infants and adults. Subjects failed to see in part because there was just too little there. 99

The second lesson is that, despite nearly 45 years of research, there is still much to be learned from sensory substitution. As the reader may have gathered from the wide ranging discussion, there are many things that we just do not know about the process, including dozens of nooks and crannies that have yet to be investigated. In the future, we would like to see research aimed at sorting through the differing contributions of each mechanism in specific cases. One example might be Poirier et al.

(2007), who have suggested that mental imagery is a major component of SSD training in blindfolded sighted subjects. Some of the interview evidence we have reviewed here suggests that the same may be true of late blind subjects like PF. The time to make firm judgments on the matter is still some way off, however, making research individuating the three paths of increasing interest. An additional question concerns the possible role of anatomical neural reorganization. Late blind plasticity works mostly with functional organization, but the possibility of similar anatomical compensation ought not to be ruled out over the long term. SSD training may happen faster than anatomical changes can take effect, but that does not mean that no such changes are at work months, years, or decades after initial exposure. Long-term users of The vOICe will be especially helpful in studying this. Looking in to the effects of

SSD exposure on infants may be similarly revealing. Most SSD research has been conducted well-passed the “critical period” of neural development. The abilities of users tested so far, though miraculous, may be only a small sample of the SSDs capacities. We predict, then, that forthcoming research on infant SSDs will be of increasing interest over the next few decades. There is no telling what all the future 100

may hold, but if the next several decades are anything like the last, we look forward to it.

9. References

Afra, P., Funke, M., & Matsuo, F. (2009). Acquired auditory-visual synaesthesia: A

window to early crossmodal sensory interactions. Psychology Research and

Behavior Management, 2, 31.

Ahissar, M., & Hochstein, S. (1997). Task difficulty and the specificity of perceptual

learning. Nature, 387(6631), 401–406.

Aitken, S., & Bower, T. (1983). Developmental aspects of sensory substitution.

International Journal of Neuroscience, 19(1-4), 13–19.

Aleman, A., Nieuwenstein, M., Böcker, K., & de Haan, E. (2000). Music training and

mental imagery ability. Neuropsychologia, 38(12), 1664-1668

Alias, D., & Burr, D. (2004). The ventriloquist effect results from near-optimal

bimodal integration. Current Biology, 14(3), 257–262.

Alho, K., Kujala, T., Paavilainen, P., Summala, H., & Näätänen, R. (1993). Auditory

processing in visual brain areas of the early blind: Evidence from event-related

potentials. Electroencephalography and Clinical Neurophysiology, 86(6), 418–427.

Amedi, A., Stern, W. M., Camprodon, J. A., Bermpohl, F., Merabet, L., Rotman,

S., …Pascual-Leone, A. (2007). Shape conveyed by visual-to-auditory sensory

substitution activates the lateral occipital complex. Nature Neuroscience, 10(6),

687–689. 101

Amedi, Amir, Jacobson, G., Hendler, T., Malach, R., & Zohary, E.

(2002).Convergence of visual and tactile shape processing in the human lateral

occipital complex. Cerebral Cortex, 12(11), 1202–1212.

Armel, K. C., & Ramachandran, V. (2003).Projecting sensations to external objects:

Evidence from skin conductance response. Proceedings of the Royal Society of

London. Series B: Biological Sciences, 270(1523), 1499–1506.

Armel, K., & Ramachandran, V. (1999). Acquired synaesthesia in retinitis pigmentosa.

Neurocase, 5(4), 293–296.

Arno, P., De Volder, A. G., Vanlierde, A., Wanet-Defalque, M. C., Streel, E., Robert,

A., …Veraart, C. (2001). Occipital activation by pattern recognition in the early

blind using auditory substitution for vision. Neuroimage, 13(4), 632–645.

Auvray, M., Hanneton, S., Lenay, C., & O’Regan, K. (2005). There is something out

there: Distal attribution in sensory substitution, twenty years later. Journal of

Integrative Neuroscience, 4(4), 505–521.

Auvray, M, & Deroy, O. (2012).Interpreting sensory substitution beyond the

perceptual assumption: An analogy with reading. Seeing and Perceiving, 25(s1),

142–142.

Auvray, M., & Myin, E. (2009). Perception with compensatory devices: From sensory

substitution to sensorimotor extension. Cognitive Science, 33(6), 1036–1058.

Bach-y-Rita, P. (1983). Tactile vision substitution: Past and future. International

Journal of Neuroscience, 19(1-4), 29–36. 102

Bach-y-Rita, P. (1972). Brain mechanisms in sensory substitution. New York:

Academic Press.

Bach-y-Rita, P. (1984). The relationship between motor processes and in

tactile vision substitution. In W. Prinz & A. F. Sanders (Eds.). Cognition and

Motor Processes (149-160). New York: Springer.

Bach-y-Rita, P. (1990). Brain plasticity as a basis for recovery of function in humans.

Neuropsychologia, 28(6), 547–554.

Bach-y-Rita, P. (1995). Nonsynaptic diffusion neurotransmission and late brain

reorganization. New York: Demos.

Bach-y-Rita, P., Collins, C. C., Saunders, F. A., White, B., & Scadden, L. (1969).

Vision substitution by tactile image projection. Nature, 221, 963-964.

Bach-y-Rita, P., & Kercel, S. (2003). Sensory substitution and the human–machine

interface. Trends in Cognitive Sciences, 7(12), 541–546.

Baddeley, A. D., & Lieberman, K. (1980). Spatial working memory. In R. S.

Nickerson (Ed.), Attention and Performance VIII (521–539). Hillsdale, NJ:

Erlbaum.

Bahrick, L., Flom, R., & Lickliter, R. (2002).Intersensory redundancy facilitates

discrimination of tempo in 3‐month‐old infants. Developmental Psychobiology,

41(4), 352–363.

Bahrick, L., & Lickliter, R. (2000).Intersensory redundancy guides attentional

selectivity and perceptual learning in infancy. Developmental Psychology, 36(2),

190. 103

Bahrick, L., Lickliter, R., & Flom, R. (2006). Up versus down: The role of

intersensory redundancy in the development of infants’ sensitivity to the

orientation of moving objects. Infancy, 9(1), 73–96.

Bahrick, L. E., Walker, A. S., & Neisser, U. (1981). Selective looking by infants.

Cognitive Psychology, 13(3), 377–390.

Bahrick, L. E., & Lickliter, R. (2003). Intersensory redundancy guides early

perceptual and cognitive development. In R. V. Kail (Eds.), Advances in Child

Development and Behavior (Vol. 30, 153–187). San Diego: Elsevier.

Baron-Cohen, S., Burtlf, L., Smith-Laittan, F., Harrison, J., & Bolton, P. (1996).

Synaesthesia: Prevalence and familiality. Perception, 25, 1073–1079.

Beauchamp, M. S. (2005). See me, hear me, touch me: Multisensory integration in

lateral occipital-temporal cortex. Current Opinion in Neurobiology, 15(2), 145–

153.

Beauchamp, M., & Ro, T. (2008).Neural substrates of sound–touch synaesthesia after

a thalamic lesion. The Journal of Neuroscience, 28(50), 13696–13702.

Berkeley, G. (1709). An essay towards a new theory of vision. London: Aaron Rhames.

Block, N. (2005). Review of Alva Noe, Action in Perception. Journal of Philosophy,

102.

Bogdashina, O., & Exchange, S. H. A. R. (2001). A reconstruction of the sensory

world of autism. Sheffield: Sheffield Hallam University Press.

Bologna, G., Deville, B., & Pun, T. (2009). On the use of the auditory pathway to

represent image scenes in real-time. Neurocomputing, 72(4), 839–849. 104

Botvinick, M., & Cohen, J. (1998). Rubber hands’ feel touch that eyes see. Nature,

391(6669), 756–756.

Bresciani, J., Ernst, M., Drewing, K., Bouyer, G., Maury, V., & Kheddar, A. (2005).

Feeling what you hear: Auditory signals can modulate tactile tap perception.

Experimental Brain Research, 162(2), 172–180.

Bubic, A., Striem-Amit, E., &Amedi, A. (2010). Large-scale brain plasticity following

blindness and the use of sensory substitution devices. In M. J. Naumer & J. Kaiser

(Eds.), Multisensory object perception in the primate brain (351–380). New York:

Springer.

Butter, C., Buchtel, H., & Santucci, R. (1989). Spatial attentional shifts: Further

evidence for the role of polysensory mechanisms using visual and tactile stimuli.

Neuropsychologia, 27(10), 1231–1240.

Calvert, G., Bullmore, E., Brammer, M., Campbell, R., Williams, S., McGuire, P. K.,

David, A. S. (1997). Activation of auditory cortex during silent lipreading. Science,

276(5312), 593–596.

Campos, J., Anderson, D., Barbu-Roth, M., Hubbard, E., Hertenstein, M., &

Witherington, D. (2000). Travel broadens the mind. Infancy, 1(2), 149–219.

Carlesimo, G., Turriziani, P., Paulesu, E., Gorini, A., Caltagirone, C., Fazio, F., &

Perani, D. (2004). Brain activity during intra-and crossmodal priming: New

empirical data and review of the literature. Neuropsychologia, 42(1), 14–24. 105

Cattaneo, Z., Vecchi, T., Cornoldi, C., Mammarella, I., Bonino, D., Ricciardi, E., &

Pietrini, P. (2008). Imagery and spatial processes in blindness and visual

impairment. Neuroscience and biobehavioral reviews, 32(8), 1346-1360.

Chebat, D.-R., Rainville, C., Kupers, R., &Ptito, M. (2007). Tactile-’visual’ acuity of

the tongue in early blind individuals. Neuroreport, 18(18), 1901–1904.

Collignon, O., Lassonde, M., Lepore, F., Bastien, D., & Veraart, C. (2007). Functional

cerebral reorganization for auditory spatial processing and auditory substitution of

vision in early blind subjects. Cerebral Cortex, 17(2), 457–465.

Collins, C., & Bach-y-Rita, P. (1973). Transmission of pictorial information through

the skin. Advances in Biological and Medical Physics, 14, 285–315.

Cornilleau-Pérès, V., & Gielen, C. C. A. M. (1996). Interactions between self-motion

and depth perception in the processing of optic flow. Trends in ,

19(5), 196–202.

Daw, N., & Wyatt, H. (1976). Kittens reared in a unidirectional environment:

Evidence for a critical period. The Journal of Physiology, 257(1), 155–170.

Delk, J. L., & Fillenbaum, S. (1965). Differences in perceived color as a function of

characteristic color. The American journal of psychology, 78(2), 290-293.

Denett, D. (1991). Consciousness explained. Boston: Little. Brown, and Co.

De Volder, A. G., Toyama, H., Kimura, Y., Kiyosawa, M., Nakano, H., Vanlierde,

A…. Senda, M. (2001). Auditory triggered mental imagery of shape involves

visual association areas in early blind humans. NeuroImage, 14(1), 129–139. 106

Deroy, O., & Auvray, M. (2012). Reading the world through the skin and ears: A new

perspective on sensory substitution. Frontiers in psychology, 3.

Driver, J., & Spence, C. (1998). Attention and the crossmodal construction of space.

Trends in Cognitive Sciences, 2(7), 254–261.

Drummer, T., Picot-Annand, A., Neal, T., & Moore, C. (2009). Movement and the

rubber hand illusion. Perception, 38(2), 271–80.

Eimer, M., & Driver, J. (2000). An event-related brain potential study of crossmodal

links in spatial attention between vision and touch. Psychophysiology, 37(05),

697–705.

Eimer, M., & Driver, J. (2001).Crossmodal links in endogenous and exogenous spatial

attention: Evidence from event-related brain potential studies. Neuroscience &

Biobehavioral Reviews, 25(6), 497–511.

Eimer, M., &Schröger, E. (1998). ERP effects of intermodal attention and crossmodal

links in spatial attention. Psychophysiology, 35(03), 313–327.

Epstein, W., Hughes, B., Schneider, S., & Bach-y-Rita, P. (1986). Is there anything

out there? A study of distal attribution in response to vibrotactile stimulation.

Perception, 15(3), 275–284.

Farah, M. J., Hammond, K. M., Levine, D. N., & Calvanio, R. (1988). Visual and

spatial mental imagery: Dissociable systems of representation. Cognitive

Psychology, 20(4), 439–462. 107

Fleming, P., Ball, L. J., Ormerod, T. C., & Collins, A. F. (2006). Analogue versus

propositional representation in congenitally blind individuals. Psychonomic

bulletin & review, 13(6), 1049-1055.

Ganis, G., Thompson, W. L., & Kosslyn, S. M. (2004). Brain areas underlying visual

mental imagery and visual perception: An fMRI study. Cognitive Brain Research,

20(2), 226–241.

Geldard, F. A. (1957). Adventures in tactile literacy. American Psychologist, 12, 115–

124.

Geldard, F. A. (1961). Cutaneous channels of communication. In W. A. Rosenblith

(Ed.), Sensory communication (pp. 73-87). New York: Wiley.

Geldard, F. A. (1966). Cutaneous coding of optical signals: The optohapt. Perception

& Psychophysics 1, 377-381.

Gibson, E. J. (1969). Principles of perceptual learning and development. New York:

Appleton-Century- Crofts.

Gibson, J. J. (1962). Observations on active touch. Psychological Review, 69(6), 477.

Gibson, J. J. (1966). The senses considered as perceptual systems. New York:

Houghton Mifflin.

Gottlieb, J. (2007). From thought to action: The parietal cortex as a bridge between

perception, action, and cognition. Neuron, 53(1), 9.

Gougoux, F., Zatorre, R. J., Lassonde, M., Voss, P., &Lepore, F. (2005). A functional

neuroimaging study of sound localization: Visual cortex activity predicts

performance in early-blind individuals. PLoS biology, 3(2), e27. 108

Grill-Spector, K., Kourtzi, Z., &Kanwisher, N. (2001). The lateral occipital complex

and its role in object recognition. Vision Research, 41(10–11), 1409–1422.

Grossenbacher, P. G., & Lovelace, C. T. (2001). Mechanisms of synaesthesia:

Cognitive and physiological constraints. Trends in Cognitive Sciences, 5(1), 36–41.

Guarniero, G. (1974). Experience of tactile vision. Perception, 3(1), 101–104.

Hamilton, R., Keenan, J. P., Catala, M., &Pascual-Leone, A. (2000). Alexia for Braille

following bilateral occipital stroke in an early blind woman. Neuroreport, 11(02),

237–240.

Hayashibe, K. (1991). Reversals of visual depth caused by motion parallax.

Perception, 20(1), 17–28.

Held, R., & Freedman, S. J. (1963). Plasticity in human sensorimotor control.

Science, 142(3591), 455-462.

Helmholtz, H. ([1878]/1977) ‘The facts in perception (with notes and comments by

Moritz Schlick)’, in Cohen, Robert S.; Elkana, Yehuda. (Eds.) Hermann von

Helmholtz: Epistemological writings (115-185). Dordrecht: Reidel Publishing

Company.

Hertrich, I., Dietrich, S., & Ackermann, H. (2011). Crossmodal interactions during

perception of audiovisual speech and nonspeech signals: An fMRI study. Journal

of , 23(1), 221–237.

Higgins, C.I., Campos, J. J., & Kermoian, R. (1996). Effect of self-produced

locomotion on infant postural compensation to optic flow. Developmental

Psychology, 32(5), 836. 109

Higgins, C. I., Campos, J. J., & Kermoian, R. (1996). Effect of self-produced

locomotion on infant postural compensation to optic flow. Developmental

Psychology, 32(5), 836.

Hollins, M. (1985). Styles of mental imagery in blind adults. Neuropsychologia, 23(4),

561–566.

Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and

functional architecture in the cat’s visual cortex. The Journal of Physiology, 160(1),

106.

Humphrey, M., & Zangwill, O. (1951). Cessation of dreaming after brain injury.

Journal of Neurology, Neurosurgery & Psychiatry, 14(4), 322–325.

Humphrey, N., & Humphrey, N. (2006). Seeing red: A study in consciousness.

Cambridge: Harvard University Press.

Hurley, S. L. (2002). Consciousness in action. Cambridge: Harvard University Press.

Hurley, S., &Noë, A. (2003). Neural plasticity and consciousness. Biology and

Philosophy, 18(1), 131–168.

Hyvärinen, J., Carlson, S., & Hyvärinen, L. (1981). Early visual deprivation alters

modality of neuronal responses in area 19 of monkey cortex. Neuroscience Letters,

26(3), 239–243.

James, T. W., Humphrey, G. K., Gati, J. S., Servos, P., Menon, R. S., & Goodale, M.

A. (2002). Haptic study of three-dimensional objects activates extrastriate visual

areas. Neuropsychologia, 40(10), 1706–1714. 110

Jouen, F., Lepecq, J.-C., Gapenne, O., & Bertenthal, B. (2000). Optic flow sensitivity

in neonates. Infant Behavior and Development, 23(3–4), 271–284.

Julesz, B. (1971). Foundations of cyclopean perception. Cambridge: MIT Press.

Kadosh, R. C., Henik, A., Catena, A., Walsh, V., & Fuentes, L. J. (2009). Induced

cross-modal synaesthetic experience without abnormal neuronal

connections. Psychological Science, 20(2), 258-265.

Kauffmann, T., Théoret, H., &Pascual-Leone, A. (2002). Braille character

discrimination in blindfolded human subjects. Neuroreport, 13(5), 571–574.

Kim, J. K., &Zatorre, R. J. (2011). Tactile–Auditory Shape Learning Engages the

Lateral Occipital Complex. The Journal of Neuroscience, 31(21), 7848–7856.

Kim, J.-K., & Zatorre, R. J. (2008). Generalized learning of visual-to-auditory

substitution in sighted individuals. Brain Research, 1242, 263–275.

Kim, J.-K., & Zatorre, R. J. (2010). Can you hear shapes you touch? Experimental

Brain Research, 202(4), 747–754.

Kim, J. K., & Zatorre, R. J. (2011). Tactile–auditory shape learning engages the

Lateral Occipital Complex. The Journal of Neuroscience, 31(21), 7848-7856.

Kitagawa, N., & Igarashi, Y. (2005). Tickle sensation induced by hearing a sound. The

Japanese Journal of Psychonomic Science, 24, 121-122.

Knill, D. C., & Pouget, A. (2004). The Bayesian brain: The role of uncertainty in

neural coding and computation. TRENDS in Neurosciences, 27(12), 712–719.

Knudsen, E. I. (2007). Fundamental components of attention. Annual Review of

Neuroscience, 30, 57–78. 111

Koch, C., & Ullman, S. (1985). Shifts in selective visual attention: Towards the

underlying neural circuitry. Human Neurobiology, 4(4), 219–27.

Lacey, S., Tal, N., Amedi, A., & Sathian, K. (2009). A putative model of multisensory

object representation. Brain Topography, 21(3), 269–274.

Lambert, S., Sampaio, E., Mauss, Y., & Scheiber, C. (2004). Blindness and brain

plasticity: Contribution of mental imagery?: An fMRI study. Cognitive Brain

Research, 20(1), 1–11.

Lee, W., Huang, H., Feng, G., Sanes, J., Brown, E., So, P., & Nedivi, E. (2005).

Dynamic remodeling of dendritic arbors in GABAergic interneurons of adult

visual cortex. PLoS Biology, 4(2), e29.

Lenay, C., Canu, S., & Villon, P. (1997). Technology and perception: The contribution

of sensory substitution systems. In Cognitive Technology, 1997.’Humanizing the

Information Age’. Proceedings., Second International Conference on (pp. 44–53).

IEEE.

Lenay, C., Gapenne, O., Hanneton, S., Marque, C., & Genouelle, C. (2003). Sensory

substitution: Limits and perspectives. Touching For Knowing, 275–292.

Li, Z. (2002). A saliency map in primary visual cortex. Trends in Cognitive Sciences,

6(1), 9–16.

Lickliter, R., Bahrick, L., & Honeycutt, H. (2002). Intersensory redundancy facilitates

prenatal perceptual learning in bobwhite quail (Colinus virginianus) embryos.

Developmental Psychology; Developmental Psychology, 38(1), 15. 112

Locke, J. (1690/1888). An essay on the human understanding (Vol. 1). London: Ward,

Lock & Company.

Maguire, E., Gadian, D., Johnsrude, I., Good, C., Ashburner, J., Frackowiak, R., &

Frith, C. (2000). Navigation-related structural change in the hippocampi of taxi

drivers. Proceedings of the National Academy of Sciences, 97(8), 4398–4403.

Malach, R., Reppas, J., Benson, R., Kwong, K., Jiang, H., Kennedy, W., …Tootell, R.

(1995). Object-related activity revealed by functional magnetic resonance imaging

in human occipital cortex. Proceedings of the National Academy of Sciences,

92(18), 8135–8139.

Marr, D. (1982). Vision: A computational investigation into the human representation

and processing of visual information. New York: Henry Holt and Co. Inc.

Mazer, J., & Gallant, J. (2003). Goal-related activity in V4 during free viewing visual

search: Evidence for a ventral stream visual salience map. Neuron, 40(6), 1241–

1250.

Macpherson, F. (2012). Cognitive penetration of colour experience: Rethinking the

issue in light of an indirect mechanism. Philosophy and Phenomenological

Research, 84(1), 24-62.

Meijer, P. (1992). An experimental system for auditory image representations.

Biomedical Engineering, IEEE Transactions on, 39(2), 112–121.

Mellet, E., Tzourio, N., Crivello, F., Joliot, M., Denis, M., & Mazoyer, B. (1996).

Functional anatomy of spatial mental imagery generated from verbal instructions.

The Journal of Neuroscience, 16(20), 6504–6512. 113

Meltzoff, A., & Borton, R. (1979). Intermodal matching by human neonates. Nature,

282, 403-404.

Merabet, L., Battelli, L., Obretenova, S., Maguire, S., Meijer, P., & Pascual-Leone, A.

(2009). Functional recruitment of visual cortex for sound encoded object

identification in the blind. Neuroreport, 20(2), 132–138.

Merabet, Lotfi B, Hamilton, R., Schlaug, G., Swisher, J., Kiriakopoulos, E., Pitskel, N.,

Pascual-Leone, A. (2008). Rapid and reversible recruitment of early visual cortex

for touch. PLoS One, 3(8), e3046.

Merabet, Lotfi B, Maguire, D., Warde, A., Alterescu, K., Stickgold, R., & Pascual-

Leone, A. (2004). Visual hallucinations during prolonged blindfolding in sighted

subjects. Journal of Neuro-Ophthalmology, 24(2), 109–113.

Miller, G. A. (1956). The magical number seven, plus or minus two: Some limits on

our capacity for processing information. Psychological Review, 63(2), 81–97.

Molholm, S., Ritter, W., Javitt, D. C., & Foxe, J. J. (2004). Multisensory visual–

auditory object recognition in humans: A high-density electrical mapping study.

Cerebral Cortex, 14(4), 452–465.

Moore, C., & Cavanagh, P. (1998). Recovery of 3D volume from 2-tone images of

novel objects. Cognition, 67(1), 45–71.

Moore, C., & Engel, S. (2001). Neural response to perception of volume in the lateral

occipital complex. Neuron, 29(1), 277–286. 114

Murray, S., Kersten, D., Olshausen, B., Schrater, P., & Woods, D. (2002). Shape

perception reduces activity in human primary visual cortex. Proceedings of the

National Academy of Sciences, 99(23), 15164–15169.

Nielsen, J. (1955). Occipital lobes, dreams and psychosis. Journal of Nervous and

Mental Disease.

Noė, A. (2005). Action in perception. Cambridge: MIT Press.

Noë, A. (2010). Vision without representation. In N. Gangopadhyay, M. Madary, & F.

Spicer (Eds.), Perception, action and consciousness (245–256).

O'Regan, J. K. (2011). Why Red Doesn't Sound Like a Bell: Understanding the Feel of

Consciousness. Oxford: Oxford University Press.

O’Regan, J. K., &Noë, A. (2001). A sensorimotor account of vision and visual

consciousness. Behavioral and Brain Sciences, 24(5), 939–972.

Ortiz, T., Poch, J., Santos, J. M., Requena, C., Martínez, A. M., Ortiz-Terán,

L., …Calvo, A. (2011). Recruitment of occipital cortex during sensory substitution

training linked to subjective experience of seeing in people with blindness. PloS

One, 6(8), e23264.

Pascual‐Leone, A. (2001). The brain that plays music and is changed by it. Annals of

the New York Academy of Sciences, 930(1), 315–329.

Pick Jr., H. L., & Hay, J. C. (1965). A passive test of the Held reafference

hypothesis. Perceptual and Motor Skills, 20(3c), 1070-1072.

Pietrini, P., Furey, M., Ricciardi, E., Gobbini, M., Wu, W., Cohen, L., …Haxby, J.

(2004). Beyond sensory images: Object-based representation in the human ventral 115

pathway. Proceedings of the National Academy of Sciences of the United States of

America, 101(15), 5658–5663.

Pitskel, N. B., Merabet, L. B., Ramos-Estebanez, C., Kauffman, T., & Pascual-Leone,

A. (2007). Time-dependent changes in cortical excitability after prolonged visual

deprivation. Neuroreport, 18(16), 1703–1707.

Poirier, C., De Volder, A. G., & Scheiber, C. (2007). What neuroimaging tells us

about sensory substitution. Neuroscience and Biobehavioral Reviews, 31(7), 1064.

Posner, M., Nissen, M., & Klein, R. (1976). Visual dominance: An information-

processing account of its origins and significance. Psychological Review, 83(2),

157.

Prinz, J. (2006). Putting the brakes on enactive perception. Psyche, 12(1), 1–19.

Prinz, J. (2012). The conscious brain. Oxford: Oxford University Press.

Proulx, M., Brown, D., Pasqualotto, A., & Meijer, P. (in press). Multisensory

perceptual learning and sensory substitution. Neuroscience & Biobehavioral

Reviews.

Ptito M., & Kupers R. (2005). Crossmodal plasticity in early blindness. Journal of

Integrative Neuroscience, 4(4), 479–488.

Pylyshyn, Z. (2003). Return of the mental image: Are there really pictures in the brain?

Trends in cognitive sciences, 7(3), 113–118.

Ramachandran, V., & Hubbard, E. (2001). Synaesthesia--a window into perception,

thought and language. Journal of Consciousness Studies, 8(12), 3–34. 116

Reich, L., Maidenbaum, S., & Amedi, A. (2012). The brain as a flexible task machine:

Implications for visual rehabilitation using noninvasive vs. invasive approaches.

Current Opinion in Neurology, 25(1), 86.

Renier, L., Collignon, O., Poirier, C., Tranduy, D., Vanlierde, A., Bol, A., … De

Volder, A. G. (2005). Crossmodal activation of visual cortex during depth

perception using auditory substitution of vision. NeuroImage, 26(2), 573–580.

Röder, B., Rösler, F., & Neville, H. J. (2000). Event-related potentials during auditory

language processing in congenitally blind and sighted people. Neuropsychologia,

38(11), 1482.

Röder, B., Rösler, F., & Neville, H. J. (2001). Auditory memory in congenitally blind

adults: A behavioral-electrophysiological investigation. Cognitive Brain Research,

11(2), 289–303.

Röder, B., Stock, O., Bien, S., Neville, H., & Rösler, F. (2002). Speech processing

activates visual cortex in congenitally blind humans. European Journal of

Neuroscience, 16(5), 930–936.

Röder, B., Teder-SaÈlejaÈrvi, W., Sterr, A., Rösler, F., Hillyard, S. A., & Neville, H. J.

(1999). Improved auditory spatial tuning in blind humans. Nature, 400(6740),

162–165.

Rousseau, J. J. (1762). Emile, au de L'Education. Tome VII, Oeuvres de Jean Jaques

Rousseau. Amsterdam: Neaulme.

Russell, B. (1912). The problems of philosophy. New York: Barnes & Noble Books. 117

Sadato, N., Pascual-Leone, A., Grafman, J., Ibanez, V., Deiber, M., Dold, G., &

Hallett, M. (1996). Activation of the primary visual cortex by Braille reading in

blind subjects. Nature, 380(6574), 526–528.

Sampaio, E., Maris, S., & Bach-y-Rita, P. (2001). Brain plasticity: “Visual” acuity of

blind persons via the tongue. Brain Research, 908(2), 204–207.

Sathian K, Zangaladze A. (2001). Feeling with the mind's eye: the role of visual

imagery in tactile perception. Optometry and Vision Science, 78(5), 276-81.

Schwartz, A. N., Campos, J. J., & Baisel, E. J. (1973). The visual cliff: Cardiac and

behavioral responses on the deep and shallow sides at five and nine months of age.

Journal of Experimental Child Psychology, 15(1), 86–99.

Seeing with Sound - The vOICe. (n.d.). Retrieved February 5, 2013, from

http://www.seeingwithsound.com/

Segond, H., Weiss, D., & Sampaio, E. (2007). A proposed tactile vision-substitution

system for infants who are blind tested on sighted infants. Journal of Visual

Impairment and Blindness, 101(1), 32.

Shams, L., Wozny, D. R., Kim, R., & Seitz, A. (2011). Influences of multisensory

experience on subsequent unisensory processing. Frontiers in Psychology, 2.

Sharma, J., Angelucci, A., & Sur, M. (2000). Induction of visual orientation modules

in auditory cortex. Nature, 404(6780), 841–847.

Shu, H., Chen, X., Anderson, R. C., Wu, N., & Xuan, Y. (2003). Properties of school

Chinese: Implications for learning to read. Child Development, 74(1), 27-47. 118

Siegle, J. H., & Warren, W. H. (2010). Distal attribution and distance perception in

sensory substitution. Perception, 39(2), 208.

Singer, G., & Day, R. H. (1966). Spatial adaptation and aftereffect with optically

transformed vision: effects of active and passive responding and the relationship

between test and exposure responses. Journal of experimental psychology, 71(5),

725-731.

Spence, C., Pavani, F., & Driver, J. (2004). Spatial constraints on visual-tactile

crossmodal distractor congruency effects. Cognitive, Affective, & Behavioral

Neuroscience, 4(2), 148–169.

Stein, B. E., & Stanford, T. R. (2008). Multisensory integration: Current issues from

the perspective of the single neuron. Nature Review of Neuroscience, 9(4), 255–

266.

Stokes, M., Thompson, R., Cusack, R., & Duncan, J. (2009). Top-down activation of

shape-specific population codes in visual cortex during mental imagery. The

Journal of Neuroscience, 29(5), 1565–1572.

Stratton, G. (1899). The spatial harmony of touch and sight. Mind, 8(4), 492–505.

Striem-Amit, E., Guendelman, M., & Amedi, A. (2012). ‘Visual’acuity of the

congenitally blind using visual-to-auditory sensory substitution. PloS one,7(3),

e33136.

Suddendorf, T., & Whiten, A. (2003). Reinterpreting the mentality of apes. New York:

Psychology Press. 119

Sur, M., Angelucci, A., & Sharma, J. (1999). Rewiring cortex: The role of patterned

activity in development and plasticity of neocortical circuits. Journal of

Neurobiology, 41(1), 33–43.

Thaler, L., Arnott, S., & Goodale, M. (2011). Neural correlates of natural human

echolocation in early and late blind echolocation experts. PloS One, 6(5), e20162.

Van Boven, R., Hamilton, R., Kauffman, T., Keenan, J., & Pascual–Leone, A. (2000).

Tactile spatial resolution in blind Braille readers. Neurology, 54(12), 2230–2236.

Von Kriegstein, K., & Giraud, A. (2006). Implicit multisensory associations influence

voice recognition. PLoS Biology, 4(10), e326.

Ward, J., & Meijer, P. (2010).Visual experiences in the blind induced by an auditory

sensory substitution device. Consciousness and Cognition, 19(1), 492–500.

White, B. W. (1970). Perceptual findings with the vision-substitution system. Man-

Machine Systems, IEEE Transactions on, 11(1), 54–58.

White, B. W., Saunders, F. A., Scadden, L., Bach-Y-Rita, P., & Collins, C. C. (1970).

Seeing with the skin. Attention, Perception, & Psychophysics, 7(1), 23–27.

Whiten, A., & Suddendorf, T. (2001). Meta-representation and secondary

representation. Trends in Cognitive Sciences, 5(9), 378–378.

Wittenberg, G., Werhahn, K., Wassermann, E., Herscovitch, P., & Cohen, L. (2004).

Functional connectivity between somatosensory and visual cortex in early blind

humans. European Journal of Neuroscience, 20(7), 1923–1927.

Wozny, D., Seitz, A., & Shams, L. (2008). Learning associations between simple

visual and auditory features. Journal of Vision, 8(6), 171. 120

Zhang, M., Weisser, V., Stilla, R., Prather, S., & Sathian, K. (2004). Multisensory

cortical processing of object shape and its relation to mental imagery. Cognitive,

Affective, & Behavioral Neuroscience, 4(2), 251–259.