Reading Speech from Still and Moving Faces: the Neural Substrates of Visible Speech
Total Page:16
File Type:pdf, Size:1020Kb
Reading Speech from Still and Moving Faces: The Neural Substrates of Visible Speech Gemma A. Calvert1 and Ruth Campbell2 Downloaded from http://mitprc.silverchair.com/jocn/article-pdf/15/1/57/1757731/089892903321107828.pdf by guest on 18 May 2021 Abstract & Speech is perceived both by ear and by eye. Unlike heard regions including the left inferior frontal (Broca’s) area, left speech, some seen speech gestures can be captured in superior temporal sulcus (STS), and left supramarginal gyrus stilled image sequences. Previous studies have shown that in (the dorsal aspect of Wernicke’s area). Stilled speech hearing people, natural time-varying silent seen speech can sequences also generated activation in the ventral premotor access the auditory cortex (left superior temporal regions). cortex and anterior inferior parietal sulcus bilaterally. Using functional magnetic resonance imaging (fMRI), the Moving faces generated significantly greater cortical activa- present study explored the extent to which this circuitry was tion than stilled face sequences, and in similar regions. activated when seen speech was deprived of its time-varying However, a number of differences between stilled and moving characteristics. speech were also observed. In the visual cortex, stilled faces In the scanner, hearing participants were instructed to generated relatively more activation in primary visual regions look for a prespecified visible speech target sequence (‘‘voo’’ (V1/V2), while visual movement areas (V5/MT+) were or ‘‘ahv’’) among other monosyllables. In one condition, the activated to a greater extent by moving faces. Cortical regions image sequence comprised a series of stilled key frames activated more by naturally moving speaking faces included showing apical gestures (e.g., separate frames for ‘‘v’’ and the auditory cortex (Brodmann’s Areas 41/42; lateral parts of ‘‘oo’’ [from the target] or ‘‘ee’’ and ‘‘m’’ [i.e., from Heschl’s gyrus) and the left STS and inferior frontal gyrus. nontarget syllables]). In the other condition, natural speech Seen speech with normal time-varying characteristics movement of the same overall segment duration was seen. appears to have preferential access to ‘‘purely’’ auditory In contrast to a baseline condition in which the letter ‘‘V’’ processing regions specialized for language, possibly via was superimposed on a resting face, stilled speech face images acquired dynamic audiovisual integration mechanisms in generated activation in posterior cortical regions associated STS. When seen speech lacks natural time-varying character- with the perception of biological movement, despite the lack istics, access to speech-processing systems in the left temporal of apparent movement in the speech image sequence. lobe may be achieved predominantly via action-based speech Activation was also detected in traditional speech-processing representations, realized in the ventral premotor cortex. & INTRODUCTION auditory speech is perfectly clear (Reisberg, McLean, & Speechreading is the ability to understand a spoken Goldfield, 1987). message by watching the speech actions of a talker. It Behavioral studies have shown that hearing infants has traditionally been seen as a topic of interest to are sensitive to audiovisual speech synchronization clinical researchers interested in the implications of (Dodd, 1979) and to the fit of the seen and heard hearing loss, deafness ( Jeffers & Barley, 1971), or hear- speech characteristics, including the discrimination of ing in noise (Sumby & Pollack, 1954), but is increasingly the vowel that is uttered, the identity of the speaker, seen to have implications for understanding the mech- and the type of utterance produced (Lewkowicz, 1996, anisms of speech and language processing more gener- 1998; Burnham, 1993; Kuhl & Meltzoff, 1982). Suscept- ally (Liberman & Whalen, 2000; Green, 1998; Dodd & ibility to audiovisual speech illusions, whereby a Burnham, 1988). This is because all people who use dubbed utterance, such as seen ‘‘ga’’ with heard ‘‘ba’’ speech are sensitive to its visual qualities, despite the is perceived as ‘‘da’’ (McGurk & MacDonald, 1976), has fact that individual speechreading abilities can vary also been demonstrated in infants (Rosenblum, widely. For example, audiovisual speech perception is Schmuckler, & Johnson, 1997). Studies in adults reliably better than the perception of speech that is (Massaro, 1998) have shown that these audiovisual simply heard (Sumby & Pollack, 1954), even when speech illusions are not an isolated phenomenon, but evidence of systematic integration of seen and heard speech in normal speech processing (e.g., Massaro, 1University of Oxford, 2University College London 1999). Indeed, adult audiovisual speech processing is D 2003 Massachusetts Institute of Technology Journal of Cognitive Neuroscience 15:1, pp. 57–70 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/089892903321107828 by guest on 29 September 2021 sensitive to native heard language structure (Sekiyama, when both the visual and auditory streams have each 1997; Sekiyama & Tohkura, 1993; Werker, Frost, & been degraded to a level at which speech cannot be McGurk, 1992) and to a range of visible perceived identified, the audiovisual stream may yet be under- talker characteristics (Green, Kuhl, Meltzoff, & Stevens, standable because of redundancy in the dynamic 1991). Only models of speech perception that go patterning of speech across the two modalities. Sev- beyond auditory processing, and implicate ‘‘supramo- eral studies show an influence of vision on auditory dal’’ or ‘‘amodal’’ procedures can accommodate such speech even when the individual speech events can- findings (see Green et al., 1991; Summerfield, 1987, not be discriminated by eye. For instance, Jordan and 1992; Fowler & Rosenblum, 1991). If speechreading is Sergeant (2000) showed that vision could affect the intrinsic to an understanding of speech processing, one report of auditory syllables at viewing distances too Downloaded from http://mitprc.silverchair.com/jocn/article-pdf/15/1/57/1757731/089892903321107828.pdf by guest on 18 May 2021 important question that arises is, which of its visual great for the visual syllable to be identified, yet stimulus dimensions or properties are utilized by the sufficiently close for the seen action to be perceived speech-processing system? Here, two contrasting pos- as plausibly congruent with the heard syllable. Grant sibilities are outlined, which have implications for and Seitz (2000) have shown that correlated informa- understanding the cortical bases of speechreading tion from the face in motion improves detection of and its relation to heard speech. noisy auditory messages even though neither visual nor auditory segments could be identified reliably on their own. Such demonstrations suggest that the Time-Varying Information in Seen Speech visible dynamic signature of a spoken utterance is The pattern of movement made by a face can be informative when segmental properties of speech captured by a point-light display using sparse (8–30) within either the visible or auditory speech stream illuminated points on the facial surface, including the are not fully accessible. Its utility lies in the redun- cheeks, lips, chin, and nose. Under these conditions, dancy of information perceived from the talking head, face features such as the mouth, lips, and tongue cannot in particular in the common dynamic properties of the be reliably identified. That is, the configural image utterance, whether seen or heard. properties of the face and mouth are impoverished or A dynamic systems (time-varying) account of speech- absent. However, when the speaker’s face moves in reading, therefore, does not require the perceiver to speech such that the illuminated points follow the identify a particular image component of the speaking appropriate trajectories, these point-light displays can face. Even if the form of the facial image is under- affect the accuracy with which auditory speech tokens specified, vision can nevertheless improve speech pro- are identified (Rosenblum, Johnson, & Saldan˜a, 1996; cessing. However, a completely contrary case can also Rosenblum & Saldan˜a, 1996). An explanation of this be made for visible speech processing—that good effect may lie in the fact that the actions of the articu- image processing in the absence of well-specified lators have both visible and audible consequences time-varying information is an important feature of that are likely to be highly correlated with each other multimodal speech. because of their common source properties in the vocalizations of the talker. Configural (Image-Based) Considerations in One such property is the timing of changes in vocal- Speechreading ization—the dynamic properties of speech are visible as well as audible in terms of their time-varying patterns Although the time-varying regularities in seen and (Munhall & Vatikiotis-Bateson, 1998). For example, heard speech are used in speech processing, human increases in speech sound amplitude can be accompa- sensitivity to audiovisual synchronization is often quite nied by visible indicators of change in the disposition of poor. Imperfect time-streaming of videoclips in digi- the visible articulators—such as the speed and acceler- tized audiovisual speech segments, where synchrony of ation of mouth opening. The correlation between some the seen and heard message is poorly preserved, may auditory and visual dynamic patterns in utterances can not be noticeable. Campbell and