& Psychophysics 2000, 62 (7), /405-/4/2

Effects oftalker variability on speechreading

DEBORAH A YAKEL, LAWRENCE D, ROSENBLUM, and MICHELLE A FORTIER University ofCalifornia, Riverside, California

The effects of talker variability on visual perception were tested by having subjects speech­ read sentences from either single-talker or mixed-talker sentence lists. Results revealed that changes in talker from trial to trial decreased speechreading perfonnance. To help determine whether this decrement was due to talker change-and not a change in superficial characteristics of the stimuli­ Experiment 2 tested speechreading from visual stimuli whose images were tinted by a single color, or mixed colors. Results revealed that the mixed-color lists did not inhibit speechreading performance rel­ ative to the single-color lists. These results are analogous to findings in the auditory speech literature and suggest that, like auditory speech, visual speech operations include a resource-demanding com­ ponent that is influenced by talker variability.

The relationship between speaker and speech recogni­ (Mills, 1987). Finally, in one experimental context using tion has been explored extensively in the last 40 years. discrepant audiovisual speech , integration of Many theories ofauditory propose that visual speech was shown to be automatic and mandatory speech input undergoes a normalization process in which for most subjects (McGurk & MacDonald, 1976). talker-specific attributes are extracted and discarded, leav­ The second reason visual speech normalization would ing the phonetic material needed for the perception of seem an important issue is that it bears on a general the­ speech segments (Halle, 1985; K. Johnson, 1990). The oretical question in cognitive science. The question of concept that this talker normalization process might also modularity (see, e.g., Fodor, 1983), or whether particular extend to audiovisual is implicit in cognitive functions exhibit behavioral and anatomical other speech perception theories (Fowler, 1986; Liberman specialization, has been central to modern cognitive sci­ & Mattingly, 1985; McClelland & Elman, 1986). How­ ence. Among other characteristics, modules are consid­ ever, no research has addressed this question for visual ered to be informationally encapsulated in that they have speech. By borrowing the recent methods used in the au­ access to only the information/processes needed for their ditory speech literature (Mullennix, Pisoni, & Martin, particular function. Interestingly, two ofthe prototypical 1989; Sommers, Nygaard, & Pisoni, 1994), the present modules cited by theorists are those for speech/ study examines the degree to which visual talker normal­ and face perception (e.g., Ellis, 1989; Fodor, 1983; Liber­ ization influences speechreading (lipreading). man & Mattingly, 1985). Inthis sense, visual speech per­ The issue ofvisual speech normalization would seem ception would seem to pose a particularly interesting the­ important for two reasons. First, there is accumulating oretical problem: Are processes enlisted for visual speech evidence that visual speech perception is an important perception associated with speech, face, or both functions? component of the general speech perception process. The modular characteristic ofinformation encapsulation While it is clear that speechreading can be useful for the would seem to suggest that all language recognition-in­ impaired, it is also known that visual speech is cluding visual speech perception-would discard talker­ used by individuals with good hearing when they are faced specific facial properties. Testing the influences oftalker/ with a noisy environment (e.g., MacLeod & Summerfield, face normalization on visual speech perception can help 1987), speech with a heavy foreign accent, or speech con­ address this question. We now turn to the auditory speech veying complicated subject matter (Reisberg, Mcl.ean, normalization literature in order to borrow conceptual & Goldfield, 1987). There is also evidence that access to and methodological tools. visual speech is necessary for normal speech development One way the effects ofauditory speech normalization have been examined is through measuring the processing costs incurred during listening to multiple- versus single­ This research was supported by NSF Grant SBR-9617047 awarded talker stimuli. It is known that and stim­ to L.D.R. We gratefully acknowledge the assistance of Elizabeth Al­ berto. Mike Gordon, Sheila Kirby, Anjani Panchal, Julia Rogenski, and uli are easier to identify when spoken by a single talker Christi Royster, as well as the helpful comments of two anonymous re­ than when the talker changes from trial to trial (e.g., viewers and the UCR cognitive science group. D.A.Y.is currently in the Strange, Verbrugge, Shankweiler, & Edman, 1976; Ver­ Department of at Orange Coast College. M.A.F. is cur­ brugge, Strange, Shankweiler, & Edman, 1976). Analo­ rently in the Department ofPsychology at San Diego State University. gous multiple-talker effects have been observed for laten­ Correspondence should be addressed to L. D. Rosenblum, Department of Psychology, University of California, Riverside, CA 92521 (e-mail: cies in vowel categorizing and matching (Summerfield rosenblu@ citrus.ucr.edu). and Haggard, 1973). With regard to more complex stimuli,

1405 Copyright 2000 Psychonomic Society, Inc. 1406 YAKEL, ROSENBLUM, AND FORTIER

Creelman (1957) found that listeners identify em­ To summarize, a good amount of recent evidence not bedded in noise less accurately when they are spoken by only supports Mullennix et al.s (1989) suggestion that multiple versus single talkers. Mullennix et al. (1989) talker-specific information can be retained during pho­ replicated these findings using a larger ofwords both netic processing, but also suggests that talker-specific embedded in noise and in the clear. In one experiment, information can facilitate speech recognition. In this sense, II subjects received a list of 68 words produced by one any "normalization" process that might occur would not ofthe talkers, while another 11 subjects received a list of completely discard talker-specific information. More gen­ 68 words derived from 15 talkers. Both lists were erally, recent evidence suggests that the functions ofpho­ presented against varying degrees of white noise. Word netic recognition and identification are not as inde­ identification was more accurate for the single- than for pendent as once (e.g., Halle, 1985; K. Johnson, the mixed-talker list. Concerned that the effects might be 1990). In fact, Remez, Fellowes, and Rubin (1997) have attributable to the degraded nature of the stimuli, Mul­ proposed that both functions could use similar acoustic lennix et al. conducted a second experiment with unde­ primitives-a contention very different from those of'tra­ graded words and measured response latency for a nam­ ditional theories ofspeech and speaker perception (e.g., ing task. Performance for both identification and latency Pollack, Pickett, & Sumby, 1954; Van Lancker, Kreiman, measures were significantly worse for the mixed- than & Emmorey, 1985). for the single-talker list. While research on the effects ofauditory talker normal­ Mullennix and his colleagues (1989) offered two ex­ ization is abundant, there seems to be little analogous re­ planations for these results. First, the effects of talker search in the visual speech literature. It is known that variability could be due to speaker normalization pro­ speakers vary widely in their visible speech movements, cesses operating at a very early stage in the acoustic­ which bears on how difficult they are to speechread (De­ phonetic analysis. The normalized output would then be morest & Bernstein, 1992; Kricos & Lesner, 1982; Mont­ passed on to higher level language processes without gomery, Walden, & Prosek, 1987). However, it is not talker-specific information. On the basis ofthis account, known which, if any, talker-specific facial information one would expect processing costs due to talker variabil­ might be discarded during visual speech perception. Po­ ity to occur early in speech perception but not during tentially, a visual speech normalization process would higher level processing. strip away phonetically irrelevant information about the Alternatively, talker variability may affect performance face such as eye color, skin , and featural information because talker-specific features are retained for some time, beyond the mouth. The end product of this normaliza­ rather than discarded. Retaining such talker-specific fea­ tion might be the retention ofonly phonetically relevant tures for mixed-talker lists would incur a greater pro­ dimensions, including positions and movements of the cessing cost than it would for single-talker lists. Potentially, lips, tongue, andjaw. Presumably, this process ofnormal­ talker-specific features from a previous item could pro­ ization would take some time, so that speechreading duce interference when a subsequent item with different from a multiple-talker list would be more difficult than talker-specific features is perceived (Mullennix et aI., that from a single-speaker list. 1989). From this account, the effects of talker-specific The following experiments were designed to examine dimensions could appear for higher level functioning. the effects ofvisual speech "normalization" by testing the In fact, other research has supported this latter expla­ influences oftalker variability on speechreading. The ex­ nation. There is now evidence that talker-specific infor­ periments borrow from the method of the Mullennix mation can be retained and that it can act to facilitate et al. (1989) auditory speech study discussed above. The speech recognition. For example, Nygaard, Sommers, and first experiment compares subjects' speechreading abil­ Pisoni (1994) trained two groups ofsubjects to recognize ity from stimuli that are either from a single talker or the voices of10talkers over a 9-day period. One group was from multiple talkers. Iftalker variability has detrimen­ then tested with novel words embedded in noise spoken tal effects on speechreading, as it does for auditory word by unfamiliar talkers while the other group was tested recognition, then speechreading accuracy should be worse with the novel words spoken by the 10 familiar talkers for subjects in the multiple-talker condition. used during the training phase. The results revealed that familiarity with the talker significantly improved word EXPERIMENT 1 identification accuracy. There is also evidence that ex­ plicit and implicit for words is facilitated when Method study and test items are presented in the same voice Subjects. Sixty-two undergraduates at the University of Cali­ (Church & Schacter, 1994; Craik & Kirsner, 1974; fornia, Riverside, participated in the experiment and were either paid $10 or were given course credit as part ofa requirement of an Palmeri, Goldinger, & Pisoni, 1993). Finally, research on introductory psychology course. All subjects reported normal or form-based priming has revealed that priming effects are corrected-to-normal vision, good hearing, and were native speakers contingent on prime and target being presented in the of English with no prior speechreading experience. Prescreening same voice (Saldana & Rosenblum, 1994). tests (lasting 10 min) were given to all subjects to ensure a minimal EFFECTS OF TALKER VARIABILITY ON SPEECH 1407

speechreading ability for subjects used in the experiment. For the Subjects were told that they would be seeing a speaking face and prescreening task, 2 talkers (I male and I female) were recorded, that they were to attempt to speechread sentences. They were in­ each articulating 20 different sentences from the Revised Bamford­ formed that each sentence would be presented twice, and after the Kowal-Bench Sentence Test (BKB; Bench & Bamford, 1979). The second presentation the experimenter would pause the tape and recording and editing methods were the same as discussed below. allow the subject to respond. The subjects were instructed to respond The prescreening tape consisted of two blocks (one for the male verbally to any words they could recognize. The experimenter re­ talker and one for the female talker), each consisting of two con­ corded the key words that the subject verbalized by circling correct secutive presentations of 20 different sentences. After the second responses on a response sheet. The critical phase ofthe experiment presentation ofeach sentence, subjects were asked to repeat what­ lasted approximately 30 min for each subject. ever words from the sentence they could speechread. A keyword scoring technique was applied (see below). If a subject passed a cri­ Results and Discussion terion of32.5% recognition accuracy ofkeywords, they were asked The data were scored for percent correct identification to participate in the critical portion of the experiment. (This crite­ rion was set at I SD below the mean on the basis of other speech­ ofthe three keywordsin each ofthe BKB sentences (Bench reading research [Bernstein, personal communication, 1997] and a & Bamford, 1979). An average of 55.8% (I I.I) key­ pilot study that tested these particular stimuli with 40 subjects.) On words were identified in the single-talker condition, and the basis of this prescreening criterion, 22 of the original 62 sub­ 47.9% (l 1.75) keywords were identified in the mixed­ jects were eliminated from the study. talker condition. This difference for talker condition was Stimuli. Ten speakers (5 men and 5 women) were recorded in a significant atthep < .05 level [F(l,20) = 4.86,p = .034]. fully lit room with no alteration to their faces. The speakers were recorded articulating 100 BKB sentences (Bench & Bamford, These results suggest that a change in the talker from trial 1979; Rosenblum, Johnson, & Saldana, 1996), four to five times to trial can hinder speechreading performance. each. BKB sentences are short (5-7 words), simply constructed sen­ While the decrement imparted by the mixed-talker list tences (e.g., "The football game is over"). BKB sentences are scored might seem relatively small (7.9%), it is well within the using the Loose Keyword Scoring method (Bench & Bamford, range ofdecrement values observed by Mullennix et al. 1979), for which a point is given for each of three key words rec­ (1989), based on their auditory mixed-list condition. Their ognized (e.g., football, game, over), with morphological errors per­ mitted. None ofthe sentences or speakers used in the prescreening set offour experiments, which tested word identification test were used in the critical stimuli. of both degraded and clean auditory stimuli, displayed The speakers were instructed to articulate clearly, but not to ex­ mean mixed-talker decrement values between 4.4 and aggerate their movements. A Panasonic PVS350 camcorder was used 21.0%, with an overall average of9.3%. Clearly, a num­ to record the initial videotapes. The speakers were seated 8 ft in ber of critical differences exist between the Mullennix front ofthe camera. The camera was positioned so that the recorded et aI., and present experiments (e.g., auditory vs. visual image consisted ofthe speaker's entire head and neck. material; words vs. sentences; 10 vs. 15 talkers), and this Using a Panasonic 7510 video player and a Panasonic 7500A re­ corder, II presentation tapes were made using the 10 different speak­ similarity in decrement could be coincidental. Still, it ers. Ten single-talker tapes were produced, each consisting of I of could be that the time course of talker normalization the 10 talkers articulating the same 90 BKB sentences. Each sen­ and/or retaining oftalker-specific information is similar tence was actually recorded onto the tape twice, in succession. A 1­ across auditory and visual speech modalities. sec interstimulus interval (lSI) was used between the first and sec­ In summary, the results ofthe first experiment demon­ ond presentations of a sentence and a 3-sec lSI was used between strate that speechreading performance from a variable­ different sentences. The sentence pairs were separated into six blocks of 15. A 15-sec lSI separated blocks. talker list is worse than performance from a single-speaker A mixed-talker tape was produced composed ofall ofthe 10 talk­ list. However, it is not clear that the detriments produced ers each articulating (a different set of) 9 ofthe 90 sentences. Again, are a result of talker variability per se, or are a result of each sentence was placed on the tape twice, consecutively. The 9 superficial stimulus variability caused by a general stim­ sentences were chosen randomly from the talkers with the constraint ulus change from trial to trial. It could be that any trial­ that the sentence order was the same as that used in the single-talker to-trial change in the appearance ofthe stimuli would de­ tapes, and no I talker could be seen producing two different sen­ mand more , and thus take away from the tences consecutively (ensuring trial-to-trial variation). The ISIs and blocks used for the single-talker tapes were also used for the mixed­ attentional resources that could be used for speechread­ talker tape. ing. A similar consideration was entertained by Sommers Procedure. The 40 subjects who passed the prescreening test et al. (1994) with regard to the auditory mixed-talker ef­ were randomly assigned to either the mixed-talker or single-talker fects observed by Mullennix et al. (l989; see also Mul­ conditions. The 20 subjects in the mixed-talker condition were all lennix & Pisoni, 1990). Sommers et al. tested whether presented with the mixed-talker tape. The 20 subjects in the single­ trial-to-trial variation in a phonetically irrelevant, talker­ talker condition saw one of the single-talker presentation tapes so that 2 subjects saw each tape (Mullennix et al., 1989). Comparing unrelated dimension-overall amplitude-would also overall performance across all subjects in the single-talker condition induce performance decrements relative to single-am­ with subjects in the mixed-talker condition ensures that observed plitude lists. They observed that unlike the talker charac­ differences between conditions would not be based on differences teristics of speaker identity and speaking rate, variation in how easy the talkers were to speechread. in overall amplitude did not produce decrements in per­ Subjects were seated at a table 5 ft in front ofa Panasonic 21-in. formance. They concluded that variability-induced dec­ video monitor. The only source of illumination was the television rements are not due to the effects ofgeneral stimulus un­ monitor and one small light that was focused away from the moni­ tor but shed enough light for the experimenter to record the subject's certainty (and increased attention), but instead, to changes response. in talker-related dimensions that can have acoustic- 1408 YAKEL, ROSENBLUM, AND FORTIER

phonetic ramifications. Other researchers have used a Table 1 similar overall amplitude manipulation to support the RGB Values for the Color Tints same conclusion about memory effects ofspoken words Applied to Images in Experiment 2 (Bradlow, Nygaard, & Pisoni, 1999; Church & Schacter, Color name R G B 1994; Nygaard & Burt, 1996). Bright pink 255 0 173 On the basis ofthe analogous consideration ofour vi­ Light pink 255 99 163 Purple 253 83 255 sua speech effects, a second experiment was designed to Royal blue 93 0 255 examine whether more superficial trial-to-trial changes Aqua blue 66 208 255 in our visual stimuli could produce decrements in speech­ Emerald green 35 255 46 reading performance. For this purpose, 10 different color Olive green 83 118 0 Red 255 0 14 tints were superimposed onto the images of2 ofour talk­ Brown 124 44 0 ers' sentence stimuli. The color dimension was chosen for Black 14 4 0 the control manipulation because, while it is doubtful Note-See text for details. that it has an influence on visual-phonetic (viseme) per­ ception, image color is known to be relevant to face per­ ception. Color information has been shown to aid recog­ tapes, the Adobe Premiere program was used to apply 10 different nition of famous faces (Lee & Perrett, 1997), as well as color tint filters (black, brown, olive, emerald, red, royal blue, aqua, purple, light pink, and hot pink) to all 90 sentences. Tint values judgments of face gender (Hill, Bruce, & Akamatsu, were established in RGB color space as various proportions ofred, 1995) and age (Burt & Perrett, 1995). More generally, green, and blue values each ranging from 0 to 255 (see Lee & Per­ color information can enhance nonface object identifi­ rett, 1997). These values for each tint are listed in Table I. The tints cation (see, e.g., Humphreys, Goodale, Jakobson, & Ser­ were applied to the entire picture frame (including face) but were vos, 1994; Price & Humphreys, 1989; but see Biederman transparent enough to allow the speaking face to be seen. & Ju, 1988) as well as motion perception (e.g., Edwards In order to establish that the color tints were easily distinguish­ & Badcock, 1996; Gegenfurtner & Hawken, 1996; but able, a pilot experiment was conducted to test tint discrimination. For this pilot experiment, all 10 different tinted versions ofthe sen­ see Livingstone & Hubel, 1987, and Ramachandran & tence "The football game is over" were used from the poor speak­ Gregory, 1978). er's sentence set. Multiple instances ofthese tokens were recorded In Experiment 2, subjects were asked to speechread onto a presentation videotape arranged in 80 total pairs. Forty of sentences under both single- and mixed-color conditions. these pairs were composed oftokens with the same color tint, while This procedure enabled the assessment of the effects of 40 of the pairs were composed of two tokens with different color talker and phonetically irrelevant visual stimulus vari­ tints. This second set was composed of one instance (ordering) of all possible combinations of two tints. All 80 pairs were recorded ability. In addition, a second group ofsubjects were asked onto a presentation tape in random order with a I-sec lSI between to speechread from both single- and multiple-talker lists. tokens in each pair, and a 3-sec lSI between pairs. Ten new subjects Beyond providing a comparison for the single/multiple with (self-reported) good were each paid $5 to judge the color tint conditions, the latter group provides a test of stimuli. Subjects were asked to watch each pair ofsentences and to whether the multiple-talker effects observed in Experi­ judge whether the color tints ofeach token of a pair were the same ment I would replicate under a within-subjects design. or different. They indicated their judgments by circling "same" or "different" on a response sheet. Mean percent correct for this task was 98.1% (with subject means ranging from 96.3% to 100%), and EXPERIMENT 2 no single-color pair seemed more difficult than others to discrimi­ nate. This pilot study shows that the color tints were highly discrim­ Method inable and were therefore useful for testing the influence ofsuper­ Subjects. Eighty-six undergraduates at the University of Cali­ ficial trial-to-trial changes on speechreading. fornia, Riverside, were given credit as part of a requirement of an For the critical speechreading stimuli, the 10 sets ofsingle-color introductory psychology course. All reported normal or corrected­ tinted sentences were each recorded onto a presentation tape. The to-normal vision, good hearing, and were native speakers ofEnglish (90) sentences for each color tape were recorded in the same ran­ with no reported speechreading experience. Subjects ranged in age dom order and had the same block organization as in Experiment I. from 17 to 26 years. Sixty subjects passed the prescreening criterion Each sentence was presented twice with a I-sec lSI and a 3-sec lSI used for Experiment I and were used in the main experiment. between novel sentence. A 15-sec lSI separated blocks. A mixed­ Stimuli. The II stimulus presentation tapes used in Experi­ color presentation tape was also made for each ofthe 2 speakers. To ment I were also used in Experiment 2. In addition, 22 new tapes produce the mixed-color presentation tapes, nine different sen­ were made using the stimuli from two of the single-talker tapes of tences from each ofthe single-color sets were recorded onto a new Experiment I. Two sets ofcolor tapes were produced: one set of II tape in the same random order as that ofthe single color tapes, with was derived from the talker that provided the best mean speechread­ the same lSI, blocks, and ordering constraints that were applied to ing scores in Experiment I (65.7 correct, as calculated across the the mixed-talker tape. 20 subjects in the multiple-speaker condition); the other set was de­ Procedure. The room, viewing distance, lighting, and equipment rived from the speaker that provided the worst mean speechreading were the same as in Experiment I. Subjects were randomly assigned scores in Experiment I (35.3% correct). These 2 speakers were cho­ to one of two groups. Twenty subjects participated in the single/ sen for the color control condition to test whether any differences mixed talker condition and 40 participated in the single/mixed-color observed between single/mixed speaker and color conditions were condition; 20 were tested with the stimuli derived from the best talker, a function of overall ease ofspeechreading. Foreach of these 2 speak­ and 20 were tested with the stimuli derived from the worst talker. ers' 90 BKB sentences were captured onto a computer using the In the single/mixed talker condition, 10 (ofthe 20) subjects were software program Adobe Premiere. To make the 10 single-color presented with the first block of45 sentences from one ofthe single- EFFECTS OF TALKER VARIABILITY ON SPEECHREADING 1409 talker presentation tapes used in Experiment I, followed by the sec­ ofExperiment 1 and demonstrates that talker variability ond block of45 sentences from the mixed-talker presentation tape can inhibit speechreading performance. A significant used in Experiment I. For the remaining 10 subjects ofthis group, difference was not found for variability in the color con­ presentation order was counterbalanced so that the first block ofthe dition using the best speaker [F(1,19) = .011, p = .9170]. mixed-talker tape was seen first, followed by the second block ofone of the single-talker tapes. As in Experiment I, each of the single­ Nor was there a significant difference ofcolor variabil­ talker tapes was seen by 2 ofthe subjects. The instructions and task ity using the worst speaker [F(1,19) = .003,p = .9537] were identical to those of Experiment I. (see means above). For each of the single/mixed-color conditions, 20 (ofthe 40) sub­ The results ofthe color manipulation suggest that, un­ jects were presented with the first block of 45 sentences from one like talker variability, phonetically irrevelant stimulus ofthe single-color presentation tapes, followed by the second block variability does not produce detrimental effects on speech­ of 45 sentences from one ofthe mixed-color presentation tapes. For the remaining 20 subjects ofeach ofthese groups, presentation order reading performance. This finding is analogous to the was counterbalanced so that the first block of one of the mixed­ findings of Sommers et al. (1994), who found that the color tapes was seen first, followed by the second block of one of phonetically irrelevant dimension of overall stimulus the single-color tapes. Each single-color tape was seen by 2 ofthe amplitude variability also had no influence on auditory subjects in each group. The instructions and task were identical to speech identification. However, conclusions about our those of Experiment I. color manipulation must be made with some caution since it is difficult to compare across manipulations that vary Results and Discussion different stimulus dimensions. As discussed above, it is As before, the data were analyzed in terms ofkeyword known that image coloration does influence face percep­ correct identification accuracy. In the single-talker con­ tion (Burt & Perrett, 1995; Hill et aI., 1995; Lee & Per­ dition, subjects identified 51.01% (11.89) ofthe keywords rett, 1997). Also, our pilot experiment revealed that the correctly, while in the multiple-speaker condition, 42.06% color tints used in the present study were highly discrim­ (8. I) ofthe keywords were correctly identified. For the inable. Still, it could be that the relative salience of the best speaker, during the single-color condition, subjects color versus speaker dimensions, or the range over which identified 63.74% (11.49) of the keywords correctly, they each varied, was not perceptually equivalent. This while in the mixed-color condition, 64. I5% (13.21) ofthe issue ofrelative perceptual salience was also considered keywords were correctly identified. For the worst speaker, by Sommers et al. in discussing their mixed-amplitude subjects correctly identified 36.6% (13.38) of the key­ versus mixed-talker effects. These authors suggest that words in the single-color condition and 36.4% (10.44) of future methods that provide for more direct comparisons the keywords in the multiple-color condition. These mean of perceived similarity (e.g. multidimensional scaling) values are similar to the mean percent correct values for should be implemented to address this issue. The same these speakers in Experiment I (65.7% and 35.3% for the suggestion applies to the present set of visual speech best and worst speakers, respectively), suggesting that stimulus manipulations, adding the color tints did not have an overall affect on speechreading. A three-way analysis ofvariance (ANOVA) GENERAL DISCUSSION was conducted on the following factors: condition (talker, color best speaker vs. color worst speaker), variability The results ofthe present study demonstrate that talker (single vs. mixed), and order (single first vs. mixed first), variability produces detrimental effects on visual speech with alpha = .05. Twosignificant main effects were found. perception. These results suggest that some resource-de­ First, there was a main effect of variability [F( I,54) = manding component ofthe visual speech process is acti­ 14.814, P = .0003], with performance in the single vated when the talker changes from trial to trial. These speaker/ color conditions (50.25%) better than perfor­ results are consistent with results ofprevious research on mance in the multiple speaker/color conditions (47.54%). auditory speech perception (e.g., Mullennix et aI., 1989; Second, there was a main effect ofcondition [F(2,54) = Strange et aI., 1976; Summerfield & Haggard, 1973; 32.730, p = .000 I], with subjects in the best speaker/ Verbrugge et aI., 1976), and in fact show a decrement sim­ color condition (63.95%) performing better than those ilar in magnitude to the auditory speech effects. in the worst speaker/color condition (46.54%) or the sin­ In discussing the present visual speech results, it would gle/multiple-speaker conditions (36.5%). There was no seem useful to turn to explanations offered for the anal­ main effect oforder [F(2,54) = .681,p = .5103]. ogous auditory mixed-talker results. As stated, Mullennix A significant interaction was observed for variability et al. (1989) offered two possible ways that talker vari­ and condition [F(2,54) = 22.470, p = .0001]. Accord­ ability may demand cognitive resources, First, mixed­ ingly, pairwise comparisons were conducted on single­ talker stimuli could invoke a "normalization" process in and mixed-list performance for the talker and the two which some cognitive effort is taken to strip away phonet­ color conditions. There was a significant effect of vari­ ically irrelevant information. Alternatively, talker-specific ability in the talker condition [F(I,37) = 7.545, P = features could actually be encoded with phonetic infor­ .0092], with performance in the single-talker condition mation-a process that could be more resource demand­ (51.01 %) better than performance in the multiple­ ing under mixed-talker than single-talker conditions. As speaker condition (42.06%). This replicates the findings noted, subsequent research in novel word recognition, 1410 YAKEL, ROSENBLUM, AND FORTIER explicit and implicit memory, and form-based priming, that these effects might not be completely analogous to has supported this latter interpretation (Craik & Kirsner, those observed for auditory speech (e.g., Craik & Kirs­ 1974; Nygaard et al., 1994; Palmeri et al., 1993; Saldana ner, 1974; Palmeri et al., 1993). Sheffert and Fowler (1995) & Rosenblum, 1994). used a continuous recognition paradigm to test for talker More recently, other theories have been offered to ex­ facilitation of word recognition from audiovisual stim­ plain the speech-speaker contingencies in the auditory uli. The audiovisual stimuli were dubbed to allow for in­ speech literature (e.g., Church & Schacter, 1994; Remez dependent manipulation ofauditory and visual talker in­ et al., 1997; Sheffert & Fowler, 1995). However, for the formation. Sheffert and Fowler were able to replicate the present visual speech "normalization" results, the two al­ previous auditory talker facilitation effects (e.g., Palmeri ternative explanations ofMullennix et aI. (1989) provide et al., 1993); they observed that recognition ofwords was a good starting point for evaluating our findings. First, the better if the auditory talker was the same across presen­ performance decrements observed for the mixed-talker tations. However, they found no significant recognition lists could be due to a stripping away ofphonetically irrel­ facilitation for visual talker. In follow-up experiments, evant visual speech dimensions. Features ranging from Sheffert and Fowler asked subjects to explicitly judge skin tone to eye color to visible talker-specific articula­ whether the visual talker was the same across presentations tory (idiolectical) style could be discarded to leave rele­ ofa given word. They found that subjects could perform vant visual speech information. In the second account, this task at better-than-chance levels. They concluded that the stimulus dimensions that are relevant to face recog­ although visual talker information can be retained along nition could be retained for some period and induce longer with spoken words, it does not act to facilitate word rec­ term effects on visual speech recognition. ognition. They added that potentially, talker-specific au­ The present experiments do not distinguish between ditory and visual information are preserved differently. these explanations. Future research-analogous to the In a similar study, Saldana et aI. (1996) also used a work in auditory speech (e.g., Nygaard et al., 1994)­ continuous recognition method to test for facilitory ef­ should be conducted to examine this question. However, fects ofvisual talker information using audiovisual stim­ there currently exists some findings in the visual speech uli. In order to force subjects to attend more to the visual literature that bear on this question. speech information, Saldana et aI. embedded the audi­ First, Walker, Bruce, and O'Malley (1995) found that tory speech in varying degrees of noise. An additional audiovisual speech integration was influenced by the gen­ manipulation involved presenting the faces either artic­ der congruency between the face and voice depending ulating along with the auditory words or in an unmoving, on whether subjects were familiar with the face. Specif­ static configuration. Like Sheffert and Fowler (1995), ically, when the face and voice were incongruent, sub­ Saldana et aI. found auditory, but not visual, talker facil­ jects that were personally familiar with the faces were itation for word recognition (across speech in noise con­ less susceptible to visual influences on "heard" speech ditions). Interestingly, however, Saldana et aI. did find syllables. This finding suggests that, as for auditory some improved recognition performance in dynamic ver­ speech, talker/face-specific information can be retained sus static visual conditions when the same talker was used to bear on subsequent visual speech perception. More re­ across presentations. They concluded that visual artic­ cently, Schweinberger and Soukup (1998) asked subjects ulatory information-versus simple facial information to perform speeded classification oftwo (/i/ and (conveyed statically)-might be encoded in long-term /u/) portrayed in facial photographs. They found that re­ memory. action times were faster when subjects were personally The research by Sheffert and Fowler (1995) and Sal­ familiar with the faces being portrayed (but see Campbell, dana et aI. (1996) suggests that visual talker information Brooks, De Haan, & Roberts, 1996). Although these find­ can be encoded with lexical items, and that this infor­ ings were based on classification ofjust two visual vow­ mation might take an articulatory form. Still, it is unclear els portrayed in static photographs, they are compatible whether this talker-specific information can facilitate rec­ with our findings: To the degree that our single-talker ognition, as it can for auditory speech. Future research condition provided familiarization with a specific talker, should be designed to further address this issue, possibly our findings can be interpreted as evidence that famil­ using conditions that further force subjects to rely on vi­ iarityfacilitated visual speech perception (ofspeechread sual speech information. Potentially, this could be ac­ sentences). In fact, performance facilitation via speaker complished with speechreading-as opposed to audio­ familiarization in the single-talker list provides an alter­ visual---eonditions and/or with hearing-impaired subjects, native interpretation to the notion that the multiple-talker who naturally rely more on visual speech. Circumstances lists inhibited visual speech. Still, either interpretation in which subjects are required to recover visual speech allows the possibility that individual talker characteristics information and/or have more experience speechreading playa role in visual speech perception. might reveal stronger evidence for visual talker facilitation There is also some preliminary evidence that visual ofrecognition memory. This research could also help ex­ talker characteristics can be retained for word recogni­ amine whether the similarity in decrement observed for tion tasks (Saldana, Nygaard, & Pisoni, 1996; Sheffert & mixed-speaker stimuli across modalities (see above) is Fowler, 1995). To date, however, the evidence suggests based on similar strategies. EFFECTS OF TALKER VARIABILITY ON SPEECHREADING 1411

To summarize, the few studies that have examined the 1980; Valentine & Bruce, 1985). Thus, recent behavioral effects oftalker familiarity on visual speech effects show evidence suggests that the processes ofvisual speech and some evidence that talker-specific information can be re­ face perception might not be completely independent. tained. Whether this evidence takes the form of inhibi­ Research is currently under way in our laboratory to fur­ tion ofvisual speech influences on auditory iden­ ther investigate the behavioral and neuropsychological tification (Walker et aI., 1995), recognition ofthe visual relationship between these functions (e.g., 1.A. Johnson talker speaking specific words (Sheffert & Fowler, 1995), & Rosenblum, 1996; Yakel & Rosenblum, 1996). or some subtle facilitation with articulatory versus gen­ The results ofthe present study provide the first dem­ eral facial information (Saldana et aI., 1996), it suggests onstration that the effects of talker variability found in that talker-specific information can be retained for longer the auditory speech literature also occur in visual speech durations than was observed in the present experiment perception. Potentially, the resource-demanding opera­ (less than I h). Thus, we expect that ofthe two interpreta­ tions used by listeners to compensate for the variability tions borrowed from Mullenix et al. (1989), the correct in­ between different talkers is not limited to auditory speech terpretation ofour single versus mixed visual talker results perception, but also incurs processing costs during vi­ is the second-that is, visual talker-specific information is sual speech perception. These findings will need to be retained, thereby incurring a greater processing cost for incorporated into current theoretical accounts of audio­ mixed-talker stimuli. Future research will reveal whether visual speech perception, which have not explicitly ad­ any ofthe newer theories ofauditory speech-talker con­ dressed the role of"normalization" of visual speech. tingencies (e.g., Church & Schacter, 1994; Remez et aI., 1997; Sheffert & Fowler, 1995) are also applicable to vi­ REFERENCES sual speech (cf. Rosenblum, Yakel, & Green, 2000). If future research continues to suggest that talker­ BENCH, J., & BAMFORD, J. (1979). Speech-hearing tests and the spoken specific information is retained for visual speech per­ language ofhearing impaired children. London: Academic Press. BIEDERMAN, I., & Ju, G. (1988). Surface versus edge-based determinants ception, then the more general question ofthe relationship of visual recognition. , 20, 38-64. between visual speech and face perception (and modu­ BRADLOW, A. R., NYGAARD, L c., & PISONI, D. B. (1999). Effects of larity) will need to be reevaluated. Most of what is cur­ talker, rate, and amplitude variation on recognition memory for spo­ rently known about this relationship comes from neu­ ken words. Perception & Psychophysics, 61, 206-219. BURT, D. M., & PERRETT, D. I. (1995). Perception ofage in adult Cau­ ropsychological research (see Ellis, 1989, for a review). casian male faces: Computer graphic manipulation of shape and Initial evidence showed a dissociation ofthese functions colour information. Proceedings ofthe Royal Society ofLondon: Se­ in prosopagnosics that can speechread, and aphasics that ries B, 259,137-143. have trouble speechreading but no trouble recognizing CAMPBELL, R. T.,BROOKS, B., DE HAAN, E., & ROBERTS, T. (1996). Dis­ faces (Campbell, Landis, & Regard, 1986). Additional evi­ soicating face processing skills: Decisions about lip-read speech ex­ pression and identity. Quarterly Journal ofExperimental Psychol­ dence has come from normal subjects who often show a ogy, 49A, 295-314. left-hemisphere advantage for speechreading (Campbell CAMPBELL, R. T., LANDIS, T., & REGARD, M. (1986). Face recognition et aI., 1986) and a right-hemisphere advantage for face and lip-reading: A neurological dissociation. Brain, 109, 509-521. recognition (e.g., Young, Hay, McWeeney, Ellis, & Barry, CHURCH, B. A., & SCHACTER, D. L. (1994). Perceptual specificity ofau­ ditory priming: Implicit memory for voice intonation and fundamen­ 1985; see Ellis, 1989, for a review). However, recent cri­ tal frequency. Journal ofExperimental Psychology: . Mem­ tiques and counterevidence for both the lesion and nor­ ory, & Cognition, 20, 521-533. mal subject research challenge the evidence for a neuro­ CRAIK, F. I., & KIRSNER, K. (1974). The effect of speaker's voice on psychological dissociation (e.g., Hillger & Koenig, 1991; word recognition. Quarterly Journal ofExperimental Psychology, Rosenblum & Saldana, 1998; Sergent, 1982). Thus, it is 26,274-284. CREELMAN, C. D. (1957). Case of the unknown talker. Journal ofthe too early to determine exactly how neuropsychological Acoustical Society ofAmerica, 29, 655. findings will bear on the visual speech/face perception DEMOREST, M. E., & BERNSTEIN, L. E. (1992). Sources ofvariability in question. speechreading sentences: A generalizability analysis. Journal ofSpeech A few recent behavioral studies are also relevant to & Hearing Research, 35, 876-891. EDWARDS, M., & BADCOCK, D. (1996). Global motion perception: this issue. It has long been known that inverting the image Interaction ofchromatic and luminance signals. Vision Research, 36, of a face makes it disproportionately difficult to recog­ 2423-2431. nize (see Valentine, 1988, for a review). This facial in­ ELLIS, H. D. (1989). Processes underlying face recognition. In R. Bruyer version effect has been interpreted as support for a spe­ (Ed.), Neuropsychology offace perception and facial expression cialization offacial processing. However, recent studies (pp. 41-49). Hillsdale, NJ: Erlbaum. FODOR, J. A. (1983). Modularity ofmind. Cambridge, MA: MIT Press, show that facial inversion also inhibits speechreading Bradford Books. and audiovisual speech integration (Green, 1994; Jordan FOWLER, C. A. (1986). An event approach to the study ofspeech percep­ & Bevan, 1997; Massaro & Cohen, 1996). Furthermore, tion from a direct-realist perspective. Journal ofPhonetics, 14,3-28. there is evidence that visual speech perception, like face GEGENFURTNER, K. R., & HAWKEN, M. J. (1996). Interaction ofmotion and color in the visual pathways. Trendsin Neurosciences, 19,394-401. perception, is hindered by idiosyncratic disruptions of GREEN, K. P. (1994). The influence ofan inverted face on the McGurk ef­ wholistic/configural information through the "Margaret fect (abstract). Journal ofthe Acoustical Society ofAmerica, 95, 3014. Thatcher effect" (Rosenblum et aI., 2000; Thompson, HALLE, M. (1985). Speculations about the representation of words in 1412 YAKEL, ROSENBLUM, AND FORTIER

memory. In V A. Fromkin (Ed.), Phonetic (pp. 101-104). hard to understand: A lipreading advantage with intact auditory stim­ New York:Academic Press. uli. In B. Dodd & R. Campbell (Eds.), Hearing by ear and eye: The HILL, H., BRUCE, V., & AKAMATSU, S. (1995). Perceiving the sex and psychology oflipreading (pp. 97-113). Hillsdale, NJ: Erlbaum. race of faces: The role of shape and colour. Proceedings ofthe Royal REMEZ, R E., FELLOWES, J. M., & RUBIN, P. E. (1997). Talker identifi­ Society ofLondon: Series B, 261, 367-373. cation based on phonetic information. Journal ofExperimental Psy­ HILLGER, L. A., & KOENIG, O. (1991). Separable mechanisms in face chology: Human Perception & Performance, 23, 651-666. processing: Evidence from hemispheric specialization [Special issue: ROSENBLUM, L. D., JOHNSON, J. A., & SALDANA, H. M. (1996). Visual Face perception]. Journal ofCognitive Neuroscience, 3, 42-58. kinematic information for embellishing speech in noise. Journal 01' HUMPHREYS, G. K., GOODALE, M. A., JAKOBSON, L. S., & SERVOS, P. Speech & Hearing Research, 39, 1159-1170. (1994). The role of surface information in object recognition: Stud­ ROSENBLUM, L. D., & SALDANA, H. M. (1998). Time-varying informa­ ies of a visual form agnosic and normal subjects. Perception, 23, tion for visual speech perception. In R. Campbell, B. Dodd, & 1457-1481. D. Burnham (Eds.), Hearing by eye ll: Advances in the psychology of JOHNSON, J. A., & ROSENBLUM, L. D. (1996). Hemispheric differences speech reading and auditory-visual speech. Hillsdale, NJ: Erlbaum. in perceivingand integrating dynamic visual speech information. Jour­ ROSENBLUM, L. D., YAKEL, D. A., & GREEN, K. P. (2000). Faceand mouth nal ofthe Acoustical Society ofAmerica, 100,2570. inversion effects on visual and audiovisual speech perception. Jour­ JOHNSON, K. (1990). The role ofperceived speaker identity in FOnor­ nal ofExperimental Psychology: Human Perception & Performance, malization of vowels. Journal 01' the Acoustical Society 01' America, 26, 806-819. 88, 642-654. SALDANA, H. M., NYGAARD, L. c, & PISONI, D. 8. (1996). Encoding of JORDAN, T. R, & BEVAN, K. (1997). Seeing and hearing rotated faces: visual speaker attributes and recognition memory for spoken words. In Influences of facial orientation on visual and audio-visual speech D.G. Stork & M. E. Hennecke (Eds.), Speechreading by humans and recognition. Journal ofExperimental Psychology: Human Perception machines: Models. systems. and applications (NATO ASI Series F: & Performance, 23, 388-403. Computers and Systems Sciences, No. 150). New York: Springer­ KRICOS, P. 8., & LESNER, S. A. (1982, May). Differences in visual in­ Verlag. telligibility across talkers. The Volta Review, 84, 219-225. SALDANA, H. M., & ROSENBLUM, L. D. (1994). Voice information in LEE, K. J., & PERRETT, D. (1997). Presentation-time measures ofthe ef­ auditory form-based priming. Journal 01' the Acoustical Society of fects of manipulations in colour space on discrimination of famous America, 95, 2870. faces. Perception, 26, 733-752. SCHWEINBERGER, S. R, & SOUKUP, G. R. (1998). Asymmetric rela­ LIBERMAN, A. M., & MATTINGLY, I. G. (1985). The motor theory of tionships among of facial identity, emotion, and facial speech perception revised. Cognition, 21, 1-36. speech. Journal ofExperimental Psychology: Human Perception & LIVINGSTONE, M. S., & HUBEL, D. H. (1987). Psychophysical evidence Performance, 24, 1748-1765. for separate channels for the perception of'form, color, movement, and SERGENT, J. (1982). About face: Left hemisphere involvement in pro­ depth. Journal ofNeuroscience, 7, 3416-3468. cessing physiognomies. Journal 01'Experimental Psychology: Human MACLEOD, A., & SUMMERFIELD, Q. (1987). Quantifying the contribu­ Perception & Performance, 8,1-14. tion ofvision to speech perception in noise. British Journal ofAudi­ SHEFFERT, S. M., & FOWLER, C. A. (1995). The effects ofvoice and vis­ ology, 21,131-141. ible speaker change on memory for spoken words. Journal ofMem­ MASSARO, D. W., & COHEN, M. M. (1996). Perceiving speech from in­ ory & Language, 34, 665-685. verted faces. Perception & Psychophysics, 58,1047-1065. SOMMERS, M. S., NYGAARD, L. c., & PISONI, D. B. (1994). Stimulus MCCLELLAND, J. L., & ELMAN, J. L. (1986). The TRACE model of variability and spoken word recognition: I. Effects of variability in speech perception. , 18, 1-86. speaking rate and overall amplitude. Journal 01' the Acoustical Soci­ MCGURK, H., & MACDONALD, J. W (1976). Hearing lips and seeing ety ofAmerica, 96, 1314-1324. voices. Nature, 264, 746-748. STRANGE, W., VERBRUGGE, R. R, SHANKWEILER, D. P.,& EDMAN, T. R. MILLS, A. E. (1987). The development ofphonology in the blind child. (1976). Consonant environment specifies vowel identity. Journal of In B. Dodd & R. Campbell (Eds.), Hearing by eye: The psychology the Acoustical Society ofAmerica, 60, 213-224. oflip-reading (pp. 145-162). Hillsdale, NJ: Erlbaum. SUMMERFIELD, Q., & HAGGARD, M. P. (1973). Vocal tract normaliza­ MONTGOMERY, A. A., WALDEN, B. E., & PROSEK, R. A. (1987). Effects tion and representation. Brain Language, 28, 12-23. of consonantal contest on vowel lipreading. Journal ofSpeech & Hear­ THOMPSON, P. (1980). Margaret Thatcher: A new . Perception, ing Research, 30, 50-59. 9,483-484. MULLENNIX, J. W, & PISONI, D. B. (1990). Talker variability andpro­ VALENTINE, T. (1988). Upside-down faces: A review ofthe effect ofin­ cessing dependencies between word and voice (Tech. Rep. No. 13). version upon face recognition. British Journal 01' Psychology, 79, Bloomington: Indiana University. 471-491. MULLENNIX,1. W., PISONI, D. B., & MARTIN, C. S. (1989). Some effects VALENTINE, T., & BRUCE, V. (1985). What's up? The Margaret Thatcher of talker variability on spoken word recognition. Journal ofthe Acous­ illusion revisited. Perception, 14,515-516. tical Society 01'America, 92, 1085-1099. VAN LANCKER, D. v.. KREIMAN, J., & EMMOREY, K. (1985). Familiar NYGAARD, L. C; & BURT, S. A. (1996). Sources of variability as linguis­ voice recognition: patterns and parameters, Part I: Recognition of tically relevant aspects of speech. Journal ofthe Acoustical Society backward voices. Journal ofPhonetics, 13, 19-38. ofAmerica, 100, 2572. VERBRUGGE, R. R., STRANGE, W., SHANKWEILER, D. P.,& EDMAN, T. R. NYGAARD, L. C; SOMMERS, M. S., & PISONI, D. B. (1994). Speech per­ (1976). What information enables a listener to map a talker's vowel ception as a talker-contingent process. Psychological Science, 5, 42-46. space? Journal ofthe Acoustical Society ofAmerica, 60, 198-212. PALMERI, T. J., GOLDINGER, S. D., & PISONI, D. B. (1993). Episodic en­ WALKER, S., BRUCE, v., & O'MALLEY, C. (1995). Facial identity and coding ofvoice attributes and recognition memory for spoken words. facial speech processing: Familiar faces and voices in the McGurk ef­ Journal ofExperimental Psychology: Learning, Memory, & Cogni­ fect. Perception & Psychophysics, 57,1124-1133. tion, 19,309-328. YAKEL, D. A., & ROSENBLUM, L. D. (1996). Face identification using POLLACK, I., PICKETT, J. M., & SUMBY, W H. (1954). On the identifi­ visual speech information. Journal ofthe Acoustical Society 01'Amer­ cation of speakers by voice. Journal ofthe Acoustical Society ofAmer­ ica, 100, 2570. ica, 26,403-406. YOUNG, A. W., HAY, D. c., MCWEENEY, K. H., ELLIS, A. W., & PRICE, C. J., & HUMPHREYS, G. W. (1989). The effects ofsurface detail BARRY, C. (1985). Familiarity decisions for faces presented to the left on object categorization and naming. Quarterly Journal ofExperi­ and right cerebral hemispheres. Brain & Cognition, 4, 439-450. mental Psychology, 41A, 797-828. RAMACHANDRAN, V. S., & GREGORY, R L. (1978). Does colour provide an input to human motion perception? Nature, 275, 55-56. (Manuscript received July 15, 1998; REISBERG, D., McLEAN, J., & GOLDFIELD, A. (1987). Easy to hear but revision accepted for publication November 12, 1999.)