NEUROBIOLOGY OF LANGUAGE

Edited by GREGORY HICKOK Department of Cognitive Sciences, University of California, Irvine, CA, USA

STEVEN L. SMALL Department of Neurology, University of California, Irvine, CA, USA

AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO Academic Press is an imprint of Elsevier SECTION F

PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL This page intentionally left blank CHAPTER 37 Phoneme Jeffrey R. Binder Department of Neurology, Medical College of Wisconsin, Milwaukee, WI, USA

Human vocalizations include the set of sounds called tract to allow air flow and resonance in the nasal cavity, phonemes, which are the basic linguistic units of speech. calledanasal(example:m,n,ng);andrelativelyunob- Phonemes are approximately equivalent to the set of structed phonation using the tongue and lips to shape the vowel and consonant sounds of a language. For example, resonant properties of the oral tract, resulting in a vowel. the word cat comprises the three phonemes (here tran- “Place” of articulation refers to the location in the vocal scribed in International Phonetic Alphabet notation) /k/, tract at which maximal restriction occurs during a /æ/, and /t/. Changing any one of these phonemes, such gesture. Examples include: restriction at the lips, referred as substituting /p/ for /k/, results in a different word. to as bilabial (example: b, p, m); restriction with the tip of Thus, a phoneme is often defined as the smallest contras- the tongue against the teeth or alveolar palate, referred to tive speech unit that brings about a change of meaning. as coronal (example: th, d, s); and restriction with the Phonemes are also the product of complex motor acts body of the tongue against the soft palate, called velar involving the lungs, diaphragm, larynx, tongue, palate, (example: g, k). Finally, the consonants are divided into lips, and so on. The actual sounds produced by this two “voicing” categories on the basis of whether the vocal complex apparatus vary from one instance to the next, folds in the larynx are made to vibrate during or very even for a single speaker repeating the same phoneme. close to the time of maximal articulation. When phona- They are also altered by the context in which the act tory muscles in the larynx contract, the vocal folds vibrate occurs, resulting in allophonic variations such as the due to periodic build-up of air pressure from the lungs, difference in the middle “t” sound of night rate and nitrate. resulting in an audible pitch or voice. When this vibration Phoneme realizations also show large variations across is present during maximal articulation, the consonant is geographical and social groups, resulting in regional and “voiced”(example:b,z,g).Whennovocalfoldvibration class accents. Therefore, a phoneme is best viewed as an is present, the consonant is “unvoiced” (example: p, s, k). abstract representation of a set of related speech sounds For stop consonants followed by a vowel, voicing is that are perceived as equivalent within a particular lan- largely determined by the time that elapses between guage. The main task of phoneme perception systems in release of the stop and the onset of vocal fold vibration, the brain is to enable this perceptual equivalence. referred to as voice onset time (VOT). The acoustic analysis of phonemes is a relatively recent Although derived originally from motor descriptions, science, becoming systematic only with the development each of these distinctions is associated with correspond- of acoustic spectral analysis in the 1940s using analog ing acoustic phenomena (Figure 37.1). Vocal cord vibra- filter banks (Koenig, Dunn, & Lacy, 1946). Accordingly, tions produce a periodic sound wave at a particular the standard classification of phonemes refers mainly to rate, called the fundamental frequency of the voice (F0), the motor gestures that produce them. “Manner” of arti- as well as harmonics at multiples of the fundamental. culation refers to the general type of gesture performed. The relative intensity of these harmonics is determined Examples of distinct manners of articulation include: by the shape of the vocal tract, which changes dynami- transient complete closure of the vocal tract, called a stop cally due to movement of the tongue and other articula- or plosive (example: p, d, k); sustained constriction of the tors, altering the resonant (and antiresonant) properties vocal tract resulting in turbulent air flow, called a fricative of the tract. Resonant harmonics form distinct bands (example: f, th, s); lowering of the soft palate and relaxa- of increased power in the spectrum, called formants tion of the upper pharynx during occlusion of the oral (numbered F1, F2, etc.), and the center frequency and

Neurobiology of Language. DOI: http://dx.doi.org/10.1016/B978-0-12-407794-2.00037-7 447 © 2016 Elsevier Inc. All rights reserved. 448 37. PHONEME PERCEPTION

FIGURE 37.1 Acoustic waveforms (top row in each pair) and time-frequency displays (“spectro- grams”) showing spectral power at each point in time (bottom row in each pair) for the syllables /bæ/ and /pæ/ (as in “bat” and “pat”). Vertical striations seen in both displays are due to vocal cord vibrations at /bæ/ the fundamental frequency of the speaker’s voice. Dark horizontal bands in the time-frequency displays are formants (labeled F1, F2, and F3 on the /bæ/ F3 display) caused by resonances related to the shape of the oral cavity. The first 5060 ms of /bæ/ is the F2 period of formant transition, during which the jaw opens and the spectral position of the formants F1 changes rapidly. Unvoiced /pæ/ differs from voiced /bæ/ in that the period of formant transition is 0 50 100 150 200 250 300 350 400 450 500 replaced in /pæ/ by a period of aspiration noise (indicated by “VOT”) during which there are no vocal cord vibrations. The absence of voicing also greatly shortens the formant transition duration and raises the starting frequency of F1 for /pæ/.

/pæ/ 3k

2k

1k

0k 0 50 100 150 200 250 300 350 400 450 500 VOT Time (ms) bandwidth of these formants change as the vocal tract the following vowel is “ah” or “oo” (Delattre, Liberman, & changes shape. Stop consonants are characterized by Cooper, 1955). Such context dependency, which is abrupt changes in tract shape as the jaw, tongue, and ubiquitous in speech, proved challenging to explain and lips open or close, resulting in rapid changes (typically even led to a prominent theory proposing that phoneme occurring over 2050 ms) in the center frequencies of identification depends on activation of an invariant the formants (called formant transitions), as well as rapid motor representation (Liberman, Cooper, Shankweiler, & changes in overall intensity as air flow is released or Studdert-Kennedy, 1967), although exactly how this stopped. Fricatives are characterized by sustained, high- could occur if the acoustic signal is uninformative was frequency broadband noise produced by turbulent air never fully specified. Later studies showed that overall flow at the point of vocal tract constriction. Nasal conso- spectral shape at the time of stop release and change nants are characterized by a prominent nasal formant at in overall shape just after release provide a context- approximately 250300 Hz (the “nasal murmur”) and invariant acoustic cue for place of articulation (Kewley- damping of higher-frequency oral formants. Port, 1983; Stevens & Blumstein, 1981). Similarly, place The acoustic correlates of consonant place of articula- of articulation for fricatives and nasal consonants is tion were a central focus of much early research in cued primarily by gross spectral shape (Hughes & acoustic phonetics, in large part due to the difficulty in Halle, 1956; Kurowski & Blumstein, 1987). identifying invariant characteristics that distinguished Phoneme perception is often based on multiple stops in different place categories. Place distinctions interacting acoustic cues. The cues signaling stop con- among stops are associated with differences in the direc- sonant voicing are a clear case in point. Although VOT tion and slope of formant transitions, but it was quickly is the most prominent cue, with VOT values longer noticed that these parameters are context-dependent. For than 2530 ms generally creating a strong unvoiced example, the second formant rises during articulation of percept, unvoiced stops (i.e., p, t, and k) are also char- syllable-initial /d/ when the following vowel is “ee,” acterized by aspiration (i.e., noise from the burst of air but it falls during articulation of the same segment when that accompanies release of the stop), a higher center

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 37. PHONEME PERCEPTION 449 frequency of the first formant at the onset of voicing, early linguistic experience. Speakers of English, for and a shorter formant transition duration, among other example, cannot appreciate distinctions between the cues (Lisker, 1986)(Figure 37.1). Studies using synthe- dentoalveolar and (postalveolar) retroflex stops of Hindi sized speech show that these cues jointly contribute to and other Indo-Aryan languages, whereas these are dis- voicing discrimination in that reduction of one cue can tinct phonemes with categorical perceptual boundaries be compensated for by an increase in others (e.g., for Hindi listeners (Pruitt, Jenkins, & Strange, 2006). It reduction of VOT can be “traded” for an increase in thus appears that the phoneme perceptual system aspiration amplitude or first formant onset frequency becomes “tuned” during language development through to maintain the likelihood of an unvoiced percept) bottom-up statistical learning processes that reflect over- (Summerfield & Haggard, 1977). representation of category prototypes in the language As mentioned, a major task of the phoneme percep- environment (Kuhl, 2000; Werker & Tees, 2005). The tion system is to detect these acoustic cues despite exten- phoneme categories that result from this tuning process sive variation in their realization. In addition to variation can be thought of as “attractor states” (Damper & due to phonetic context (i.e., preceding or following pho- Harnad, 2000) or “perceptual magnets” (Kuhl, 1991)that nemes), speaker accent, and a host of purely random fac- “pull” the neural activity pattern toward a target posi- tors, speakers vary in overall speed of production and in tion in phoneme perceptual space regardless of small the length and size of their vocal tracts, producing varia- variations in the acoustic input. tion in the absolute value of fundamental and formant Before moving on to a consideration of relevant center frequencies. Two types of mechanisms are thought neurobiological data, two additional core aspects of to mitigate this problem. The first are normalization phoneme perception should be mentioned. The first processes that adjust for variation in vocal tract characte- involves the integration of visual and auditory input. ristics and articulation speed. Evidence suggests, for Many of the articulatory movements used in speak- example, that although absolute formant frequencies for ing are visible, and because this visual information is any particular phoneme vary between speakers, the strongly correlated with the resulting acoustic phenom- ratios of formants are more constant and therefore pro- ena, it is useful for phoneme perception. For people vide more invariant cues (Miller, 1989). Similarly, there is with normal , visual information is particularly evidence that listeners automatically apply a temporal helpful in noisy environments (Sumby & Pollack, 1954). normalization process that adjusts for variation in speed For people with little or no hearing, visual information of production within and across speakers (Sawusch & alone can often provide enough information for pho- Newman, 2000). neme recognition (Bernstein, Demorest, & Tucker, A second mechanism for dealing with acoustic vari- 2000). Auditory and visual information are “integrated” ation is known as categorical perception. This term at a relatively early, preconscious level during phoneme refers to a relative inability to perceive variation within perception, as shown by the McGurk , in which a set of items belonging to the same perceptual cate- the perceived identity of a heard phoneme is altered by gory. The phenomenon is most strikingly demon- changing the content of a simultaneous video display strated using synthesized speech continua that vary in (McGurk & MacDonald, 1976). acoustically equidistant steps from one phoneme cate- A final point is that phoneme perception should not gory to another, such as from /b/ to /p/ (Liberman, be thought of as an entirely bottom-up sensory process. Harris, Hoffman, & Griffith, 1957). Listeners fail to As with all perceptual recognition tasks, identification of perceive changes in VOT from 0 to 20 ms, for example, the sensory input is influenced by knowledge about but can easily perceive a change from 20 to 40 ms. The what it is most likely to be. One illustration of this phe- 0-ms and 20-ms VOT sounds are both perceived as the nomenon is the influence of lexical status on phoneme same voiced stop consonant, whereas the 40-ms VOT category boundaries (Ganong, 1980; Pitt & Samuel, sound is perceived as unvoiced and, therefore, a differ- 1993). A sound that is acoustically near the VOT bound- ent phoneme. Importantly, the phenomenon is not ary between /b/ and /p/, for example, is likely to be simply attributable to response bias or limited avail- heard as /b/ after the stem /dræ/, thereby forming a ability of response labels, because it is observed using word (drab) rather than a nonword (drap), whereas the alternative forced-matching (e.g., ABX) tasks as well as same sound is likely to be heard as /p/ after the stem labeling tasks. Categorical perception indicates that the /træ/ (again, favoring a real word over a nonword). The phoneme perception system has an automatic (i.e., connectionist model of phoneme perception called subconscious) mechanism for “suppressing” acoustic TRACE instantiates this top-down influence using feed- detail of a speech sound, leaving behind, in conscious- back connections from lexical to phoneme representa- ness, mainly the abstract identity of the phoneme. Which tions (McClelland & Elman, 1986). Although feedback specific details are suppressed, however, depends on the from higher levels provides one biologically plausible phoneme categories to which one is exposed during account of lexical effects on phoneme perception, it may

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 450 37. PHONEME PERCEPTION not be strictly necessary (Norris, McQueen, & Cutler, linking the former to damage in the superior temporal 2000). Another example of “top-down” effects in pho- gyrus (STG) (Henschen, 19181919). Lesion data from neme perception is the phenomenon of “perceptual patients with PWD strongly suggest some degree of restoration,” in which a phoneme is perceived within a bilateral phoneme processing, because the vast majority word despite the fact that the actual phoneme has been of cases have bilateral STG lesions (Buchman et al., 1986; replaced by a segment of noise or other nonspeech Poeppel, 2001). Particularly compelling in this regard sound (Samuel, 1997; Warren & Obusek, 1971). An are those who have a unilateral STG lesion with no extreme example of such “top-down restoration” is the symptomatic speech perception impairment and then successful perception of phonemes in sounds constructed develop PWD only after the contralateral STG is affected of three or four pure tones that represent only the center (Buchman et al., 1986; Ulrich, 1978). Also relevant to the frequency and intensity of the formants in the original issue of bilateral representation is evidence from intra- speech. Naı¨ve listeners initially perceive such “sine wave carotid amobarbital (Wada) testing that shows that speech” as nonverbal “chirps” or “electronic sounds,” patients maintain much of their ability to discriminate but they are then immediately able to hear phonemes on phonemes even after anesthetization of the language- being told that the sounds are speech (Remez, Rubin, dominant hemisphere (Boatman et al., 1998; Hickok Pisoni, & Carrell, 1981). Such phenomena vividly illus- et al., 2008). Taken as a whole, these data suggest that trate the general principle that what we perceive is deter- phoneme perceptual processes are much more bilaterally minedasmuchbywhatweexpectasbywhatis represented than are later stages of speech comprehen- available from sensory input. sion. Nevertheless, PWD after unilateral damage is occa- sionally reported, and the damage in these instances is nearly always on the left side, suggesting a degree of asymmetry in the bilateral representation (Buchman 37.1 NEUROPSYCHOLOGICAL STUDIES et al., 1986; Poeppel, 2001). PWD is a misnomer for several reasons. The deficit As described in the preceding section, phoneme involves perception of phonemes, not words; therefore, perception in the healthy brain is influenced by stored sublexical phoneme combinations and other nonword lexical and phonetic knowledge. However, neuropsy- stimuli are affected at least as much as real words. More chological studies amply demonstrate that phoneme appropriate designations include “phoneme deafness” perception and higher lexical processes are often and “auditory verbal agnosia.” The “purity” of the differentially affected by focal lesions. For example, deficit is also highly variable. Although patients with patients with transcortical sensory have severe PWD have normal pure tone hearing thresholds, most impairments of word comprehension but are perfectly cases show impairments on more complex auditory per- able to repeat words they do not understand (Berthier, ceptual tasks that do not involve speech sounds, such as 1999). Intact repetition indicates preserved perception recognition of nonspeech environmental sounds, pitch of phonemes, and thus the word comprehension discrimination and other musical tasks, and various deficit must be due to either an impairment in map- measures of temporal sequence perception (Buchman ping the phonemes to a lexical representation or an et al., 1986; Goldstein, 1974). The variety and severity of impairment at the semantic level. Conversely, patients such nonverbal perceptual impairments are generally with “pure word deafness” (PWD) are unable to recog- greater in patients with bilateral lesions than in those nize or repeat spoken words but have normal under- with unilateral left lesions. Two patients with unilateral standing of written language and normal language left temporal lesions, for example, showed relatively cir- production, suggesting a relatively specific impairment cumscribed deficits involving perception of rapid spec- at the level of phoneme perception (Buchman, Garron, tral changes, including impaired consonant but spared Trost-Cardamone, Wichter, & Schwartz, 1986). This vowel discrimination (Stefanatos, Gershkoff, & Madigan, well-documented double dissociation should be kept 2005; Wang, Peach, Xu, Schneck, & Manry, 2000). in mind when encountering ambiguous terms such as “speech comprehension” and “intelligibility,” which unnecessarily conflate different levels of processing involving distinct kinds of information. 37.2 FUNCTIONAL IMAGING STUDIES Early studies of PWD provided the first hard evi- dence that speech sounds are processed in the tempo- Phoneme perception has been the focus of many ral lobe. The work of Salomon Henschen, in particular, functional imaging studies. The loud noises produced distinguished lesions causing specific phoneme per- by rapid gradient switching present a challenge for ception impairment from lesions causing other forms fMRI studies of auditory processing, because these of “word deafness” (at the time, this term was used noises not only mask experimental stimuli but also acti- to designate any speech comprehension impairment), vate many of the brain regions of interest, limiting the

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 37.2. FUNCTIONAL IMAGING STUDIES 451 dynamic range available for hemodynamic responses. Scott, Blank, Rosen, and Wise (2000) addressed this Most auditory fMRI studies avoid these problems problem using spectrally “rotated” speech. In this by separating successive image volume acquisitions manipulation, the instantaneous auditory spectrum at with a silent period during which the experimental each point in time is rotated about a central frequency stimuli are presented (Hall et al., 1999). If the acquisi- axis, such that high frequencies become low frequen- tions are separated by a sufficient amount of time cies and vice versa (Blesser, 1972). The rotation (i.e., .7 seconds), then the measured BOLD responses preserves overall spectrotemporal complexity (though primarily reflect activation by the experimental stimuli the long-term spectrum and spectral distribution of rather than the noise produced by the previous image dynamic features are both greatly altered) while reduc- acquisition. ing phoneme intelligibility. The contrast between Compared with a silent baseline, speech sounds speech and rotated speech produced activation of the (whether syllables, words, pseudowords, or reversed left middle and anterior STS. A number of subsequent speech) activate most of the STG bilaterally (Binder studies using similar methods have replicated this ini- et al., 2000; Wise et al., 1991). Although this speech tial finding (Davis & Johnsrude, 2003; Friederici, Kotz, silence contrast is sometimes advocated as a method of Scott, & Obleser, 2010; Narain et al., 2003; Okada et al., “mapping Wernicke’s area” for clinical purposes 2010). One difficulty with interpreting these results (Hirsch et al., 2000), the resulting activation is typically arises from the fact that the stimuli were sentences and symmetric and is unrelated to language lateralization therefore contained considerable semantic and syntac- as determined by Wada testing (Lehe´ricy et al., 2000). tic information. Phoneme perception would have been The activated regions include Heschl’s gyrus and perfectly correlated with engagement of semantic and surrounding dorsal STG areas that contain primary syntactic processes. Thus, what is sometimes referred to and belt auditory cortex (Rauschecker & Scott, 2009), as an “acoustically invariant response to speech” may suggesting that much of the activation is due to simply reflect semantic or syntactic processing indepen- low-level, general auditory processes. A variety of dent of phoneme perception per se. This would explain nonspeech control stimuli (frequency-modulated tones, the somewhat anterior location of the activation com- amplitude-modulated noise) were introduced to iden- pared with prior speech studies, because semantic and tify areas that might be more speech-specific (Binder syntactic integration at the sentence level is thought to et al., 2000; Ja¨ncke, Wu¨ stenberg, Scheich, & Heinze, particularly engage the anterior temporal lobe (DeWitt 2002; Mummery, Ashburner, Scott, & Wise, 1999; & Rauschecker, 2012; Humphries, Binder, Medler, & Zatorre, Evans, Meyer, & Gjedde, 1992). The main find- Liebenthal, 2006; Humphries, Swinney, Love, & ing in these studies was that primary and belt regions Hickok, 2005; Pallier, Devauchelle, & Dehaene, 2011; on the dorsal STG respond similarly to speech and sim- Spitsyna, Warren, Scott, Turkheimer, & Wise, 2006; pler nonspeech sounds, whereas areas in the ventral Vandenberghe, Nobre, & Price, 2002; Visser, Jefferies, & STG and superior temporal sulcus (STS) respond more Lambon Ralph, 2010). to speech than to simpler sounds (see DeWitt & Subsequent work focused more specifically on Rauschecker, 2012 for an activation likelihood estima- phoneme perception using a variety of controls for spec- tion [ALE] meta-analysis). The pattern suggests a hierar- trotemporal complexity. Liebenthal, Binder, Spitzer, chically organized processing stream running from the Possing, and Medler (2005) created synthetic nonphone- primary auditory cortex on the dorsal surface to higher- mic syllables by inverting the direction of the first for- level areas in the lateral and ventral STG, a pattern very mant (F1) transition in a series of stop consonants. Stops reminiscent of the core-belt-parabelt organization in are characterized acoustically by a rising F1 trajectory monkey auditory cortex (Kaas & Hackett, 2000). that results from opening the closed vocal tract and sud- The tone and noise controls used in the studies just den enlargement of the oral cavity (Fant, 1973). Inverting mentioned are far less acoustically complex than speech the F1 trajectory results in speech-like sounds that have sounds, leaving open the possibility that the observed no phoneme identity. Brain activity was measured dur- ventral STG/STS activation reflects general acoustic com- ing ABX discrimination along a synthetic continuum plexity rather than phoneme-specific processing. The ranging in equal steps from /b/ to /d/ (accomplished degree to which speech perception requires “special” by gradual shift of the F2 transition from rising to falling) mechanisms has been a contentious topic for many dec- and along an otherwise similar nonphonemic continuum ades (Diehl & Kluender, 1989; Liberman et al., 1967; with inverted F1 transitions. Participants showed typical Liberman & Mattingly, 1985; Lotto, Hickok, & Holt, categorical perception behavior (superior discrimination 2009). This unresolved issue motivated functional imag- across the category boundary compared with within ing researchers to begin the search for auditory cortical category) for the phoneme continuum, but a flat discrim- areas that respond selectively to speech phonemes com- ination function for the nonphonemic items, verifying pared with nonphonemic sounds of equivalent spectro- that the latter stimuli failed to evoke a phoneme percept. temporal complexity. The conditions were matched on overall accuracy (i.e.,

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 452 37. PHONEME PERCEPTION

PostPhon–PrePhon PostPhon–PostNon Phon CPI corr Non CPI corr

Phoneme identification Nonphoneme identification 100% 100%

80% 80%

60% Pre 60% 40% Post 40% 20% 20% Perceived as /ba/ Perceived Perceived as token1 Perceived 0% 0% ba1 ba2 ba3 ba4 ba5 ba6 ba7 NP1 NP2 NP3 NP4 NP5 NP6 NP7

FIGURE 37.2 Left posterior temporal activation correlated with perception of phonemes in sine wave speech. The first upper panel (PostPhonPrePhon) shows left posterior STS areas engaged only after the phonemic stimuli were perceived as /ba/ and /da/ as a result of training. The second upper panel (PostPhonPostNon) shows selective activation of this region by phonemic relative to nonphonemic stimuli after training. Graphs in the lower panels show group average identification functions for the phonemic (left) and nonphonemic (right) conti- nua, and the selective induction of categorical labeling for the phonemic stimuli after training. The third upper panel shows areas where increased activation for the phonemic stimuli from before (Pre) to after (Post) was correlated with individual change in slope of the phoneme identification function from Pre to Post (phonemic categorical perception index). The final upper panel shows areas where increased activation for the nonphonemic stimuli from Pre to Post was correlated with individual change in slope of the nonphonemic identification function from Pre to Post (nonphonemic CPI). Adapted from Desai et al. (2008) with permission from the publisher.

discriminability) and reaction time. A contrast between identification tasks, the participants were scanned the phonemic and nonphonemic conditions showed acti- again. Dehaene-Lambertz et al. included a manipula- vation localized in the mid-to-posterior left STS (peaks at tion of across-category versus within-category discrim- Talairach coordinates 260, 28, 23and256, 231, 3). ination and demonstrated a specific improvement in The spectrotemporally matched control condition effec- across-category discrimination after training. In both tively rules out auditory complexity as an explanation studies, a left posterior STS region showed stronger for this effect. This portion of the left STS thus appears to activation after training compared with before training. be a location where complex auditory information acti- The clusters were somewhat more posterior along the vates phoneme category representations. STS than in the Liebenthal studies (Dehaene-Lambertz Liebenthal et al. (2010) replicated this result using the et al. peaks: 260, 224, 4 and 256, 240, 0; Mo¨tto¨nen same stimuli together with a labeling task. Event-related et al. peak: 261, 239, 2), although both overlap with potentials data were obtained simultaneously with fMRI, the posterior aspect of the Liebenthal et al. clusters. showing that the difference between phonemic and non- Desai, Liebenthal, Waldron, and Binder (2008) com- phonemic processing begins at approximately 180 ms bined these approaches by comparing sine wave sylla- after stimulus onset and peaks at approximately 200 ms, bles and nonphonemic sine wave stimuli before and manifesting as a larger P2 deflection in the phonemic after training. The nonphonemic items were created by condition. inverting the initial trajectory of the sine tone repre- Dehaene-Lambertz et al. (2005) and Mo¨tto¨nen et al. senting “F1,” analogous to the approach followed by (2006) addressed the problem of spectrotemporal com- Liebenthal et al. with their speech sounds. The ques- plexity confounds by comparing activation to sine tion the authors addressed was whether the expecta- wave speech presented before and after training sub- tion of hearing phonemes after training is sufficient in jects to perceive the sounds as phonemes. In both stud- itself to activate the left STS, or if it is necessary for the ies, participants performed a discrimination task in the stimulus to contain phonemic information. The main naı¨ve state during which they reported hearing results (Figure 37.2) suggest that both top-down expec- the sounds as nonspeech. After explicit instruction that tation and bottom-up phoneme input are necessary to the sounds contained particular phonemes and after activate the left STS region observed in the prior stud- training in the scanner with explicit phoneme ies, and that the same region is activated whether

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 37.2. FUNCTIONAL IMAGING STUDIES 453 comparing phoneme sounds after versus before train- Nonphonemic identification ing (replicating the Dehaene-Lambertz et al. and 100 Mo¨tto¨nen et al. studies) or phonemic and nonphone- 80 60 mic sounds after training. Desai et al. also obtained PreNon labeling data for the phonemic and nonphonemic con- 40 PostNon tinua before and after training and used a logistic func- ‘A’ % labeled 20 tion to model the slope of the boundary region as an 0 index of categorical perception. This index increased 1234567 significantly for the phoneme continuum after training Token but not for the nonphonemic continuum (Figure 37.2, (PostNon–PreNon) > bottom), and activation in the left STS was correlated PostNon–PreNon (PostPhon–PrePhon) with individual variation in the size of this increase in categorical perception. It is noteworthy that in all of these studies in which phonemic stimuli were contrasted with nonphonemic sti- muli of equal spectrotemporal complexity, the phoneme- specific responses were strongly left-lateralized. In contrast to the data from Wada testing and patients with PWD, discussed previously, these fMRI studies suggest FIGURE 37.3 that, at least in the intact brain, the left Coding of novel speech categories in the posterior left STS. The top graph shows induction of categorical labeling of performs most of the computation related specifically to nonphonemic consonant-like speech sounds after four training phoneme perception. The issue of lateralization of pho- sessions. Training resulted in recruitment of a large left middle tem- neme perception is discussed further in the final section poral region (lower left) and a more focal region of posterior left STS of this chapter. that was recruited more by learning novel nonphonemic categories A major unanswered question from these fMRI stud- than by similar training with familiar phonemic categories (/ba/ and /da/). Adapted from Liebenthal et al. (2010) with permission from the ies is the extent to which the left STS response is necessar- publisher. ily exclusive to phoneme sounds. Phonemes are, by definition, highly familiar sounds for which behaviorally relevant perceptual categories have developed as a result phonemic and nonphonemic sounds after training. of extensive experience. Would the left STS respond simi- Participants participated in four sessions in which they larly during categorical perception of nonphoneme were taught to label items in the nonphonemic speech sounds? Does the response depend on the particular continuum developed by Liebenthal et al. (2005)using acoustic properties of phonemes, or does it reflect a more two (arbitrary) category names. Identification functions general process of auditory category perception? Desai became more categorical after training (Figure 37.3,top). et al. addressed this issue using individual variation in Activation in the left mid-STS, which had been stronger the categorical perception index they derived from their for phonemic sounds prior to training, showed no differ- nonphonemic labeling functions (Figure 37.2). Although ence between phonemic and nonphonemic stimuli after the participants, on average, showed no increase in cate- training, and a large region in left posterior middle tem- gorical perception of these sounds after training, the poral gyrus (MTG) and STS showed stronger responses small variation in the categorical perception index across to the nonphonemic sounds after training compared participants actually predicted the level of activation in a with before training (Figure 37.3, bottom). An interaction small region in the left supramarginal gyrus and poste- analysis showed that the left posterior STS was recruited rior STS. This cluster was somewhat posterior and dorsal by training specifically for the nonphonemic sounds to the one activated by phoneme perception (Figure 37.2). (Figure 37.3, bottom right). The authors proposed that the left posterior STS serves a These five studies paint a somewhat complex picture general function of auditory category perception, and of phoneme processing in the left STS, summarized in that small differences in the location of activation for pho- Figure 37.4. Highly familiar phonemes, for which strong nemic and nonphonemic sounds reflect differences in and overlearned perceptual categories exist, selectively familiarity between these types of categories. engage the mid-portion of the left STS compared with Desai et al. (2008) provided only minimal perceptual closely matched acoustic control stimuli. Sine wave training with their nonphonemic continuum. In a speech phonemes engage a slightly more posterior region follow-up study, Liebenthal et al. (2010) asked whether of left STS compared with the same sounds heard as non- more extensive training that induced stronger categori- speech. Finally, perception of newly learned nonphone- cal perception of nonphonemic sounds would result in mic speech categories engages still more posterior a similar recruitment of left posterior STS regions by regions in the left STS, MTG, and supramarginal gyrus.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 454 37. PHONEME PERCEPTION

A number of other regions, mainly nodes in the attention and cognitive control networks, showed the reverse pattern. The left STS response depended on successful perceptual fusion but showed no sensitivity to the degree of actual audiovisual synchrony. Recent studies also implicate the left mid-STS in audiovisual integration processes underlying the McGurk illusion. Susceptibility to the McGurk effect is known to vary across individuals. Nath and Beauchamp (2012) divided their participant sample into low- and high-susceptibility McGurk perceivers, defined by their probability of perceiving “da” when presented with simultaneous auditory “ba” and visual “ga.” High- Familiar speech phonemes susceptibility perceivers showed stronger responses to Sinewave speech phonemes these stimuli in a functionally defined audiovisual left Novel auditory categories mid-STS region of interest (ROI) (mean center-of-mass 2 2 FIGURE 37.4 Summary of five controlled studies of phoneme Talairach coordinates 58, 28, 4) compared with and matched nonphoneme perception. Red markers indicate peaks low-susceptibility perceivers. Across all participants, acti- in two studies comparing phonemic (/ba/-/da/) with spectrotempo- vation in this ROI was significantly correlated with rally matched nonphonemic speech sounds (Liebenthal et al., 2005, individual susceptibility scores. In a complementary 2010). Yellow markers indicate peaks from three studies examining study, Beauchamp, Nath, and Pasalar (2010) showed that induction of phoneme perception with sine wave speech (Dehaene- Lambertz et al., 2005; Desai et al., 2008; Mo¨tto¨nen et al., 2006). Blue single-pulse transcranial magnetic stimulation (TMS) markers indicate peaks in two studies examining acquisition of cate- applied to a functionally defined audiovisual left mid- gorical perception with nonphonemic sine wave stimuli (Desai et al., STS region disrupted the McGurk illusion in high- 2008) and nonphonemic synthetic speech (Liebenthal et al., 2010). susceptibility perceivers, causing participants to report only the auditory stimulus. Taken together, these results One interpretation of this “familiarity gradient” is that it suggest that a focal region in the left mid-STS integrates reflects a greater dependence on short-term auditory auditory and visual phoneme information, but that the memory mechanisms in the case of less familiar sounds. visual input to this region is more variable across indivi- As the mapping from spectrotemporal forms to catego- duals and more easily disrupted. ries becomes less automatic, the spectrotemporal infor- mation needs to be maintained in a short-term memory store. For speech sounds, this memory store likely consti- 37.3 DIRECT ELECTROPHYSIOLOGICAL tutes the first stage in a supra-sylvian speech production RECORDINGS pathway (Hickok & Poeppel, 2007; Wise et al., 2001), although the evidence obtained from nonphonemic cate- Intracranial cortical electrophysiological recording gorization tasks (Desai et al., 2008; Liebenthal et al., 2010) (electrocorticography, or ECog) offers a means of record- indicates that these short-term auditory memory repre- ing neural activity directly rather than indirectly via a sentations are not strictly speech-specific (Figure 37.4). hemodynamic response (as in positron emission tomogra- The perceptual integration of auditory and visual phy [PET] or fMRI). The technique provides much higher phoneme information has been examined in a number of temporal resolution than fMRI and can also provide more studies, most of which manipulated the congruency or precise spatial localization. Recent work has focused on temporal synchrony of auditory and visual speech induced changes in high gamma (.60 Hz) power, which inputs (see Calvert, 2001 for an early review). These is tightly correlated with local neural activity and the studies have generally implicated the left posterior STS BOLD signal (Crone, Miglioretti, Gordon, & Lesser, 1998; and superior parietal regions such as the intraparietal Goense & Logothetis, 2008). The method usually involves sulcus in audiovisual integration of speech. Miller and a regularly spaced grid of electrodes draped on the sur- D’Esposito (2005) used synchrony manipulation together face convexity of the brain. One limitation of this with an event-related analysis of perceptual fusion judg- approach is that the electrodes make contact with gyral ments to separate brain responses due to successful per- crests and do not record from sulcal depths. Thus, most ceptual fusion from responses reflecting the degree of ECog studies of phoneme perception provide data synchrony. The left mid-STS (MNI coordinates 246, regarding the lateral STG but little or none regarding the 228, 0) and left Heschl’s gyrus responded more strongly STS. when the audiovisual stop consonants were successfully ECog studies suggest that overall spectrotemporal fused than when they were perceived as separate events. form is relatively well-preserved in the pattern of neural

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 37.4. THE ROLE OF ARTICULATORY REPRESENTATIONS IN PHONEME PERCEPTION 455 activity on the left lateral mid-STG. The recorded spec- proposed hierarchical organization of this network trotemporal neural activity is sufficient to reconstruct (Binder et al., 2000; Kaas & Hackett, 2000). intelligible speech waveforms (Pasley et al., 2012), indi- These data add enormously to our understanding of cating that the represented information is primarily, if spectrotemporal encoding in higher-level auditory cor- not exclusively, acoustic rather than linguistic or tex. They indicate a process of multidimensional acous- symbolic in nature. Mesgarani, Cheung, Johnson, and tic feature representation from which subphonemic Chang (2014) demonstrated that this region contains features arise through combinatorial processing. The patches selective for different kinds of phonemes and extent to which such representations are specific to pho- phoneme features, with selectivity perhaps most closely nemes would require further testing with appropriate reflecting manner cues such as overall spectral shape nonphonemic controls, but presumably identical sub- and dynamic changes in spectral shape. Thus, distinct phonemic acoustic features would be represented by groups of electrodes respond selectively to the follow- the same neurons in lateral STG regardless of whether ing: plosive consonants, which are characterized by they occurred in phonemes or nonphonemic sounds. As rapid changes over a broad frequency spectrum; sibilant described in the preceding section, it appears likely that fricatives, which have less dynamic and much more encoding of more abstract phonetic category percepts focused high-frequency energy; vowels, which have less occurs at still higher levels of processing performed in dynamic and grossly bimodal frequency spectra; and the left STS. nasals, which are characterized by prominent low- frequency energy. The fricative manner category includes both sibilant and nonsibilant (anterior) pho- 37.4 THE ROLE OF ARTICULATORY nemes, and it is notable that the anterior voiced frica- REPRESENTATIONS IN PHONEME tives /v/ and /@/ (as in “vee” and “thee”) did not PERCEPTION pattern with the sibilant phonemes. This is further evi- dence that the patterns reflect spectrotemporal informa- The role of motor speech codes in phoneme percep- tion, which is quite different for sibilant unvoiced tion has been a long and heatedly debated topic. There fricatives and anterior voiced fricatives, rather than are theories emphasizing that speech sounds arise more abstract phonetic features like “fricative.” from articulatory gestures, and that these gestures, or Another notable finding from these studies is the at the least the motor programs that underlie them, combinatorial nature of responses in the lateral STG. For are less variable than the sounds they produce example, Mesgarani et al. report that areas sensitive to (Galantucci, Fowler, & Turvey, 2006; Liberman et al., vowels show negatively correlated sensitivity to first and 1967). According to such theories, motor programs second formant values, indicating that they integrate provide an internal template (or “efference copy”) of information about F1 and F2 and respond optimally the phoneme against which the auditory input can be to particular F1/F2 combinations. Such combinatorial matched. On the other side of the debate are purely encoding not only is expected given the essential acous- auditory theories of phoneme perception in which tic properties that distinguish vowels (Peterson & motor templates have no necessary role (Diehl & Barney, 1952) but also provides a probable mechanism Kluender, 1989; Lotto et al., 2009). An intermediate for perceptual normalization across speakers with differ- view is that motor templates are increasingly engaged ent vocal tract properties (Sussman, 1986). and become more necessary as the auditory signal is ECog data also provide novel information about the more unfamiliar or degraded (Callan, Jones, Callan, & timing of activation across the STG and STS during pho- Akahane-Yamada, 2004). There seems little doubt that neme perception. Canolty et al. (2007) compared spoken strong connections exist between speech perception word stimuli with nonphonemic sounds created by and articulation systems, enabling auditory feedback to removing formant detail from the auditory waveform. guide articulation during speech production (Guenther, High gamma responses were larger for words than for Hampson, & Johnson, 1998; Tourville, Reilly, & Guenther, the nonphonemic sounds, and these differences emerged 2008). It therefore appears reasonable that these connec- at 120 ms after stimulus onset in the posterior STG, at tions might operate in reverse during phoneme percep- 193 ms in the mid-STG, and at 268 ms in the mid-STS. tion tasks. This timing is in good agreement with the findings of a Numerous fMRI and PET studies show activation in larger scalp P2 response for phonemic compared with the left precentral gyrus (ventral premotor cortex) and nonphonemic sounds peaking at approximately 200 ms inferior frontal gyrus (IFG) pars opercularis during (Liebenthal et al., 2010). The ECog data furthermore speech perception tasks (Binder et al., 2000; Burton, show the spread of activation over time from earlier pos- Small, & Blumstein, 2000; De´monet et al., 1992), and terior auditory areas to higher-level anterior and lateral show common activation in these areas during speech STG cortex and, finally, to mid-STS, consistent with the perception and production tasks (Callan, Callan, Gamez,

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 456 37. PHONEME PERCEPTION

Sato, & Kawato, 2010; Callan et al., 2006; Tremblay & aforementioned TMS lesion studies all used relatively Small, 2011; Wilson & Iacoboni, 2006; Wilson, Saygin, difficult phoneme discrimination tasks with intelligibil- Sereno, & Iacoboni, 2004). A persistent difficulty with ity compromised by adding noise or by manipulating interpreting these posterior frontal activations is the pos- the distinctiveness of synthetic formant transitions. sibility that they may in some cases reflect decision, Two other motor cortex TMS lesion studies showed no working memory, or other cognitive control processes effect of TMS on perception of highly intelligible pho- rather than activation of motor codes (Binder, Liebenthal, nemes (D’Ausilio, Bufalari, Salmas, & Fadiga, 2012; Possing, Medler, & Ward, 2004). A more specific activa- Sato, Tremblay, & Gracco, 2009), suggesting that motor tion of motor representations is suggested by TMS stud- codes may only contribute to phoneme perception ies showing enhancement of motor-evoked potentials in under adverse listening conditions. A more recent articulatory muscles during passive listening to speech TMS study by Mo¨tto¨nen, Dutton, and Watkins (2013), (Fadiga,Craighero,Buccino,&Rizzolatti,2002;Watkins, however, provides evidence against this interpretation. Strafella, & Paus, 2003)(seeYuen, Davis, Brysbaert, & They examined effects of motor cortex TMS on audi- Rastle, 2010 for related evidence). These findings sug- tory mismatch negativity (MMN) responses to ignored gest that merely listening to speech sounds activates speech. The MMN is a relatively preattentive auditory motor speech areas sufficiently to lower the threshold cortex response to infrequent deviant stimuli appear- for eliciting an evoked potential. Pulvermuller et al. ing within a sequence of repeating stimuli. Repetitive (2006) presented fMRI evidence that this activation is TMS to the motor lip representation reduced the mag- articulator-specific. Premotor areas controlling tongue nitude of the MMN to deviant “ba” or “ga” syllables and lip movements were determined using motor in a train of “da” syllables. This effect was specific to localizer tasks. The tongue area was differentially lip cortex stimulation and did not appear when the activated by hearing the coronal stop /t/ relative to hand motor area was stimulated, when the lip cortex hearing the bilabial stop /p/, and conversely the lip was stimulated and tone trains with intensity or dura- area was more activated by /p/ compared to /t/. tion deviants were used, or when the lip cortex was As pointed out by many authors, the common acti- stimulated and syllable trains with duration deviants vation in premotor areas by production and auditory were used. One wrinkle in the results was that stimula- perception of speech gestures is analogous to the cross- tion of the lip cortex also reduced the MMN to syllables modal responsiveness of mirror neurons to execution with intensity deviations, which would be unexpected and visual observation of other actions (Rizzolatti & if the input from motor cortex reflects phoneme iden- Arbib, 1998; Rizzolatti & Craighero, 2004). In both cases tity. Although these results do not imply that activation the performer is simultaneously a perceiver and uses of motor cortex is necessary for successful conscious perceptual information to guide performance, presum- phoneme perception, they do demonstrate that input ably creating strong functional connections between from motor areas involved in speech articulation perceptual and motor networks. reduces the ability of auditory cortex to process changes The possibility that motor cortex activation might be in phoneme input under noise-free listening conditions merely an epiphenomenon unrelated to the perceptual and in the absence of any attentional task. process itself was countered by several studies show- Another notable aspect of the Mo¨tto¨nen and ing that repetitive TMS delivered to the articulatory Watkins MMN study concerns the locus of motor acti- motor cortex to produce a functional “lesion” reduces vation effects. Activation of articulatory representa- phoneme discrimination performance (D’Ausilio et al., tions could affect perception in at least two distinct 2009; Meister, Wilson, Deblieck, Wu, & Iacoboni, 2007; ways. According to the original Motor Theory of Mo¨tto¨nen & Watkins, 2009). In two of these studies, speech perception, articulatory representations are the the effects on speech perception were specific to the actual targets of perception (Liberman et al., 1967). place of articulation affected by TMS: identification of That is, perceptual decisions are based on activation of bilabial stops (“ba” and “pa”) was selectively affected these codes rather than on auditory phoneme repre- by TMS to the lip representation, whereas identifica- sentations, and there is no role for feedback from tion of lingual stops (“da” and “ta”) was selectively motor to auditory codes. In contrast, most recent work affected by TMS to the tongue representation assumes that motor codes provide feedback that assists (D’Ausilio et al., 2009; Mo¨tto¨nen & Watkins, 2009). phoneme recognition by the auditory system through Such results provide rather compelling evidence for at a mechanism variously described as priming, predic- least some contribution from motor codes to phoneme tion, or constraint. Suppression of the preattentive perception. auditory MMN by motor cortex TMS suggests that a At issue is the extent to which this contribution is feedback mechanism from motor to auditory cortex critical under typical listening circumstances. The plays at least some role.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 37.5. HEMISPHERIC SPECIALIZATION IN PHONEME PERCEPTION 457

37.5 HEMISPHERIC SPECIALIZATION explain with models that propose either bilateral sym- IN PHONEME PERCEPTION metric processing of phonemes or unilateral left hemi- sphere processing, despite the attractiveness of such Speech contains much more than just phoneme models. As with many theories of brain function, the information. Speech provides rich information about weight of the evidence suggests a less dichotomous the age, gender, and specific identity of the speaker state of affairs in which phoneme information is pro- (based on fundamental frequency, voice quality and cessed mainly but not exclusively by the left STG/STS, timbre, speed of articulation, accent, and other cues); and lesions to this region cause varying degrees of emotional state of the speaker and affective content phoneme perception deficit, depending on the type of the message (derived from emphasis cues and and extent of the lesion and the premorbid functional prosodic variation in fundamental frequency); and a capacity of the individual’s right hemisphere. variety of nonphonemic linguistic content (syllabic Zatorre et al. proposed a theory of auditory hemi- emphasis, semantic emphasis, sentential prosody). To spheric specialization in which the left auditory system refer to phoneme perception as “speech perception” is excels at temporal resolution and the right auditory therefore both illogical and confusing, yet this usage is system excels at spectral resolution (Zatorre & Belin, rampant and has contributed to some of the confusion 2001). Analogous to the trade-off in temporal versus regarding hemispheric lateralization of phoneme per- spectral resolution that characterizes any time-frequency ception. As noted, speech sounds induce relatively analysis, the left auditory system is proposed to integrate symmetric bilateral activation of the STG compared with information across wide spectral bandwidths but over nonspeech sounds (tones and noise), which has contrib- short time windows, whereas the right auditory system uted to a general that phoneme perception is integrates across longer time windows in narrower spec- symmetrically represented. Much of this right STG acti- tral bands. As a general concept, the theory provides vation from speech, however, likely reflects processing considerable explanatory power, accounting for right of indexical and prosodic cues (Baum & Pell, 1999; hemisphere specializations for prosody, melody, and Belin, Zatorre, Lafaille, Ahad, & Pike, 2000; Bonte, suprasegmental linguistic perception, and left hemi- Hausfeld, Scharke, Valente, & Formisano, 2014; Latinus, sphere specialization for perception of short-segment McAleer, Bestelmeyer, & Belin, 2013; Lattner, Meyer, & (i.e., ,50 ms) phoneme cues. Specialization of the right Friederici, 2005; Ross, Thompson, & Yenkosky, 1997; auditory system for frequency discrimination and mel- Van Lancker, Kreiman, & Cummings, 1989; von ody processing is now well-documented in functional Kriegstein, Eger, Kleinschmidt, & Giraud, 2003; Zatorre, imaging studies (Belin, Zilbovicius, Crozier, Thivard, & Belin, & Penhune, 2002; Zatorre et al., 1992). With ade- Fontaine, 1998; Brechmann & Scheich, 2005; Jamison, quate controls for these cues, activation associated Watkins, Bishop, & Matthews, 2006; Schonwiesner, specifically with phoneme perception is much more Rubsamen, & von Cramon, 2005; Zatorre & Belin, 2001; strongly lateralized to the left STS (Dehaene-Lambertz Zatorre et al., 2002), although the evidence supporting et al., 2005; Desai et al., 2008; Liebenthal et al., 2005, 2010; left hemisphere specialization for high temporal resolu- Meyer et al., 2005; Mo¨tto¨nen et al., 2006). tion is more mixed (Belin et al., 1998; Boemio, Fromm, Phoneme perception, however, does not depend Braun, & Poeppel, 2005; Jamison et al., 2006; Overath, solely on the left hemisphere. As noted, data from Kumar, von Kriegstein, & Griffiths, 2008; Schonwiesner patients with focal brain lesions and patients undergo- et al., 2005; Zaehle, Wustenberg, Meyer, & Jancke, 2004; ing Wada testing indicate that the right hemisphere Zatorre & Belin, 2001). McGettigan and Scott (2012) criti- has the capacity to perceive phonemes when the left cized the proposal on the basis that phoneme perception hemisphere is compromised. Again, however, it is use- requires extensive spectral analysis; however, this criti- ful to note that unilateral left STG lesions occasionally cism misses the central fact that the spectral resolution produce phoneme perception deficits (this could needed for phoneme perception is still very coarse. depend on the extent of left STG and STS damage), Discriminating place of articulation for stop consonants, and phoneme perception by the right hemisphere dur- for example, requires only an overall representation of ing Wada testing is not perfect (Hickok et al., 2008). spectral shape across a range of .1,000 Hz. Similarly, Moreover, Boatman and colleagues demonstrated discriminating fricative place of articulation and voicing, unequivocal phoneme perception deficits during uni- voiced from unvoiced stops, and nasal from oral stops lateral electrical stimulation of the left mid-posterior require only fairly coarse information about spectral STG (Boatman, Lesser, & Gordon, 1995). Perception shape over bandwidths of several hundred Hz (see was more impaired for consonant than for vowel dis- Shannon, Zeng, Kamath, Wygonski, & Ekelid, 1995 for criminations (Boatman, Hall, Goldstein, Lesser, & related evidence). In contrast, listeners can discriminate Gordon, 1997). These mixed findings are difficult to changes in F0 of 5 Hz or less (Klatt, 1973), which is

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 458 37. PHONEME PERCEPTION required for prosodic perception (and for singing in References tune!) yet constitutes spectral resolution several orders Baum, S., & Pell, M. (1999). The neural bases of prosody: Insights of magnitude finer than is needed for phoneme discrimi- from lesion studies and neuroimaging. Aphasiology, 13, 581608. nation. Changes of this magnitude in formant center Beauchamp, M. S., Nath, A. R., & Pasalar, S. (2010). fMRI-guided frequencies have no effect on phoneme intelligibility transcranial magnetic stimulation reveals that the superior (Ghitza & Goldstein, 1983). Vowels are a somewhat temporal sulcus is a cortical locus of the McGurk effect. Journal of special case in which phoneme discrimination depends Neuroscience, 30, 2414 2417. on a relatively fine-grained analysis of spectral shape Belin, P., Zatorre, R. J., Lafaille, P., Ahad, P., & Pike, B. (2000). Voice- selective areas in human auditory cortex. Nature, 403, 309312. (although still coarse by musical standards) and involves Belin, P., Zilbovicius, M., Crozier, S., Thivard, L., & Fontaine, A. longer segments. This difference is consistent with broad (1998). Lateralization of speech and auditory temporal processing. evidence for relatively right-lateralized processing of Journal of Cognitive Neuroscience, 10(4), 536540. vowel categories (Boatman et al., 1997; Bonte et al., 2014; Bernstein, L. E., Demorest, M. E., & Tucker, P. E. (2000). Speech per- Shankweiler & Studdert-Kennedy, 1970; Stefanatos et al., ception without hearing. Perception and Psychophysics, 62(2), 233252. 2005; Wang et al., 2000). Berthier, M. L. (1999). Transcortical . Hove: Psychology Press. Working from the principles articulated by Zatorre, Binder, J. R., Frost, J. A., Hammeke, T. A., Bellgowan, P. S. F., Poeppel proposed a relationship between asymmetric Springer, J. A., Kaufman, J. N., et al. (2000). Human temporal time scales and corresponding brain oscillations lobe activation by speech and nonspeech sounds. Cerebral Cortex, (Poeppel, 2003). Gamma frequency oscillations (40 Hz 10, 512 528. Binder, J. R., Liebenthal, E., Possing, E. T., Medler, D. A., & Ward, and higher) were proposed as the physiological B. D. (2004). Neural correlates of sensory and decision correlate of integration over short time windows processes in auditory object identification. Nature Neuroscience, 7(3), (2050 ms) and linked with processing of phonemes, 295301. whereas alphatheta oscillations (410 Hz) were pro- Blesser, B. (1972). Speech perception under conditions of spectral posed as the physiological correlate of integration over transformation: I. Phonetic characteristics. Journal of Speech and longer time windows (150250 ms) and linked with Hearing Research, 15,5 41. Boatman, D., Hall, C., Goldstein, M. H., Lesser, R., & Gordon, B. processing of syllables and suprasegmental frequency (1997). Neuroperceptual differences in consonant and vowel dis- modulations. The theory is otherwise (to this reader, at crimination as revealed by direct cortical electrical interference. least) identical to Zatorre’s. More recent versions of the Cortex, 33(1), 8398. theory emphasize the role of endogenous theta oscilla- Boatman, D., Hart, J., Lesser, R. P., Honeycutt, N., Anderson, N. B., tions, which are proposed to be entrained by ampli- Miglioretti, D., et al. (1998). Right hemisphere speech perception revealed by amobarbital injection and electrical interference. tude modulations at the syllable level in ongoing Neurology, 51(2), 458464. speech (Giraud & Poeppel, 2012). Entrainment of an Boatman, D., Lesser, R., & Gordon, B. (1995). Auditory speech endogenous theta rhythm enhances processing of processing in the left temporal lobe: An electrical interference nested gamma oscillations that sample the auditory study. Brain and Language, 51, 269290. input on a finer time scale to support processing of Boemio, A., Fromm, S., Braun, A., & Poeppel, D. (2005). Hierarchical and asymmetric temporal sensitivity in human auditory cortices. phoneme cues. Asymmetry arises because entrainment Nature Neuroscience, 8(3), 389395. of theta oscillations dominates in the right auditory Bonte, M., Hausfeld, L., Scharke, W., Valente, G., & Formisano, E. system, whereas gamma oscillations are more promi- (2014). Task-dependent decoding of speaker and vowel identity nent in the left auditory cortex. Some support for this from auditory cortical response patterns. Journal of Neuroscience, linkage comes from studies showing relative asymme- 34(13), 4548 4557. Brechmann, A., & Scheich, H. (2005). Hemispheric shifts of sound tries of theta and gamma power at rest (Giraud et al., representation in auditory cortex with conceptual listening. 2007; Morillon et al., 2010). The emphasis on syllabic Cerebral Cortex, 15, 578587. rhythms reflects a current renewed interest in the role Buchman, A. S., Garron, D. C., Trost-Cardamone, J. E., of syllabic organization in speech perception (Ghitza, Wichter, M. D., & Schwartz, D. (1986). Word deafness: One 2012; Peelle & Davis, 2012; Stevens, 2002), which in hundred years later. Journal of Neurology, Neurosurgery, and Psychiatry, 49,489499. turn reflects the dominant rate of articulator move- Burton, M. W., Small, S., & Blumstein, S. E. (2000). The role of seg- ments during production. However, entrainment by a mentation in phonological processing: An fMRI investigation. syllabic rhythm cannot be critical for phoneme percep- Journal of Cognitive Neuroscience, 12, 679690. tion because phonemes can be perceived in isolated Callan, D., Callan, A., Gamez, M., Sato, M. A., & Kawato, M. (2010). monosyllabic stimuli, which do not allow enough time Premotor cortex mediates perceptual performance. NeuroImage, for such entrainment (Ghitza, 2013). At present, the 51, 844 858. Callan, D., Tsytsarev, V., Hanakawa, T., Callan, A., Katsuhara, M., links between acoustic phonetic perception and endog- Fukuyama, H., et al. (2006). Song and speech: Brain regions enous neural oscillations is a topic of ongoing involved with perception and covert production. NeuroImage, 31, investigation. 13271342.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL REFERENCES 459

Callan, D. E., Jones, J. A., Callan, A. M., & Akahane-Yamada, R. (2004). Hearing—Physiological bases and psychophysics (pp. 352358). Berlin: Phonetic perceptual identification by native- and second-language Springer-Verlag. speakers differentially activates brain regions involved with acous- Giraud, A.-L., Kleinschmidt, A., Poeppel, D., Lund, T. E., tic phonetic processing and those involved with articulatory- Frackowiak, R. S. J., & Laufs, H. (2007). Endogenous cortical auditory/orosensory internal models. NeuroImage, 22(3), 11821194. rhythms determine cerebral specialization for speech perception Calvert, G. A. (2001). Crossmodal processing in the human brain: and production. Neuron, 56(6), 11271134. Insights from functional neuroimaging studies. Cerebral Cortex, 11, Giraud, A.-L., & Poeppel, D. (2012). Cortical oscillations and speech 11101123. processing: Emerging computational principles and operations. Canolty, R. T., Soltani, M., Dalal, S. S., Edwards, E., Dronkers, N. F., Nature Neuroscience, 15(4), 511517. Nagarajan, S. S., et al. (2007). Spatiotemporal dynamics of word pro- Goense, J. M., & Logothetis, N. K. (2008). Neurophysiology of the cessing in the human brain. Frontiers in Neuroscience, 1(1), 185196. BOLD fMRI signal in awake monkeys. Current Biology, 18(9), Crone, N. E., Miglioretti, D. L., Gordon, B., & Lesser, R. P. (1998). 631640. Functional mapping of human sensorimotor cortex with electro- Goldstein, M. (1974). Auditory agnosia for speech (“pure word deaf- corticographic spectral analysis: II. Event-related synchronization ness”): A historical review with current implications. Brain and in the gamma band. Brain, 121, 23012315. Language, 1, 195204. Damper, R. I., & Harnad, S. R. (2000). Neural network models of cat- Guenther, F. H., Hampson, M., & Johnson, D. (1998). A theoretical egorical perception. Perception and Psychophysics, 62(4), 843867. investigation of reference frames for the planning of speech D’Ausilio, A., Bufalari, I., Salmas, P., & Fadiga, L. (2012). The role of movements. Psychological Review, 105(4), 611633. the motor system in discriminating normal and degraded speech Hall, D. A., Haggard, M. P., Akeroyd, M. A., Palmer, A. R., sounds. Cortex, 48(7), 882887. Summerfield, A. Q., Elliott, M. R., et al. (1999). Sparse D’Ausilio, A., Pulvermu¨ller, F., Salmas, P., Bufalari, I., Begliomini, C., & temporal sampling in auditory fMRI. Human Brain Mapping, 7 Fadiga, L. (2009). The motor somatotopy of speech perception. 213223. Current Biology, 19, 381385. Henschen, S. E. (1918). On the hearing sphere. Acta Oto-laryngologica, 1, Davis, M. H., & Johnsrude, I. S. (2003). Hierarchical processing in 423486. spoken language comprehension. Journal of Neuroscience, 23(8), Hickok, G., Okada, K., Barr, W., Pa, J., Rogalsky, C., Donnelly, K., 34233431. et al. (2008). Bilateral capacity for speech sound processing in Dehaene-Lambertz, G., Pallier, C., Serniclaes, W., Sprenger-Charolles, L., auditory comprehension: Evidence from Wada procedures. Jobert, A., & Dehaene, S. (2005). Neural correlates of switching Brain and Language, 107,179184. from auditory to speech perception. NeuroImage, 24,2133. Hickok, G., & Poeppel, D. (2007). The cortical organization of speech Delattre, P., Liberman, A. M., & Cooper, F. S. (1955). Acoustic loci processing. Nature Reviews Neuroscience, 8(5), 393402. and transitional cues for consonants. Journal of the Acoustical Hirsch, J., Ruge, M. I., Kim, K. H. S., Correa, D. D., Victor, J. D., Society of America, 27, 769773. Relkin, N. R., et al. (2000). An integrated functional magnetic res- De´monet, J.-F., Chollet, F., Ramsay, S., Cardebat, D., Nespoulous, J.-L., onance imaging procedure for preoperative mapping of cortical Wise, R., et al. (1992). The anatomy of phonological and semantic areas associated with tactile, motor, language, and visual func- processing in normal subjects. Brain, 115,17531768. tions. Neurosurgery, 47(3), 711722. Desai, R., Liebenthal, E., Waldron, E., & Binder, J. R. (2008). Left pos- Hughes, G. W., & Halle, M. (1956). Spectral properties of fricative terior temporal regions are sensitive to auditory categorization. consonants. Journal of the Acoustical Society of America, 28, Journal of Cognitive Neuroscience, 20, 11741188. 303310. DeWitt, I., & Rauschecker, J. P. (2012). Phoneme and word recogni- Humphries, C., Binder, J. R., Medler, D. A., & Liebenthal, E. (2006). tion in the auditory ventral stream. Proceedings of the National Syntactic and semantic modulation of neural activity during audi- Academy of Sciences of the United States of America, 109, E505514. tory sentence comprehension. Journal of Cognitive Neuroscience, 18, Diehl, R., & Kluender, K. (1989). On the objects of speech perception. 665679. Ecological Psychology, 1, 121144. Humphries, C., Swinney, D., Love, T., & Hickok, G. (2005). Response of Fadiga, L., Craighero, L., Buccino, G., & Rizzolatti, G. (2002). Speech anterior temporal cortex to syntactic and prosodic manipulations listening specifically modulates the excitability of tongue muscles: during sentence processing. Human Brain Mapping, 26,128138. A TMS study. European Journal of Neuroscience, 15(2), 399402. Jamison, H. L., Watkins, K. E., Bishop, D. V. M., & Matthews, P. M. Fant, G. (1973). Speech sounds and features. Cambridge, MA: MIT Press. (2006). Hemispheric specialization for processing auditory non- Friederici, A. D., Kotz, S. A., Scott, S. K., & Obleser, J. (2010). speech stimuli. Cerebral Cortex, 16(9), 12661275. Disentangling syntax and intelligibility in auditory language com- Ja¨ncke, L., Wu¨ stenberg, T., Scheich, H., & Heinze, H. J. (2002). prehension. Human Brain Mapping, 31, 448457. Phonetic perception and the temporal cortex. NeuroImage, 15, Galantucci, B., Fowler, C. A., & Turvey, M. T. (2006). The motor the- 733746. ory of speech perception reviewed. Psychonomic Bulletin Review, Kaas, J. H., & Hackett, T. A. (2000). Subdivisions of auditory cortex 13(3), 361377. and processing streams in primates. Proceedings of the National Ganong, W. F. (1980). Phonetic categorization in auditory word per- Academy of Sciences of the United States of America, 97, ception. Journal of Experimental Psychology: Human Perception and 1179311799. Performance, 6, 110115. Kewley-Port, D. (1983). Time-varying features as correlates of place Ghitza, O. (2012). On the role of theta-driven syllabic parsing in of articulation in stop consonants. Journal of the Acoustical Society decoding speech: Intelligibility of speech with a manipulated of America, 73, 322335. modulation spectrum. Frontiers in Psychology, 3, 238. Klatt, D. H. (1973). Discrimination of fundamental frequency Ghitza, O. (2013). The theta-syllable: A unit of speech information contours in synthetic speech: Implications for models of speech defined by cortical function. Frontiers in Psychology, 4, 138. perception. Journal of the Acoustical Society of America, 53,816. Ghitza, O., & Goldstein, J. L. (1983). JNDs for the spectral envelope Koenig, W., Dunn, H. K., & Lacy, L. Y. (1946). The sound spectro- parameters in natural speech. In R. Klinke, & R. Hartmann (Eds.), graph. Journal of the Acoustical Society of America, 18,2132.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 460 37. PHONEME PERCEPTION

Kuhl, P. K. (1991). Human adults and human infants show a origin of human brain asymmetry for speech and language. “perceptual magnet effect” for the prototypes of speech catego- Proceedings of the National Academy of Sciences of the United States of ries, monkeys do not. Perception and Psychophysics, 50,93107. America, 107, 1868818693. Kuhl, P. K. (2000). A new view of language acquisition. Proceedings of Mo¨tto¨nen, R., Calvert, G. A., Jaaskelainen, I. P., Matthews, P. M., the National Academy of Sciences of the United States of America, 97 Thesen, T., Tuomainen, J., et al. (2006). Perceiving identical (22), 1185011857. sounds as speech or non-speech modulates activity in the left Kurowski, K., & Blumstein, S. E. (1987). Acoustic properties for place posterior superior temporal sulcus. NeuroImage, 30, 563569. of articulation in nasal consonants. Journal of the Acoustical Society Mo¨tto¨nen, R., Dutton, R., & Watkins, K. E. (2013). Auditory-motor of America, 81, 19171927. processing of speech sounds. Cerebral Cortex, 23(5), 11901197. Latinus, M., McAleer, P., Bestelmeyer, P. E., & Belin, P. (2013). Mo¨tto¨nen, R., & Watkins, K. E. (2009). Motor representations of Norm-based coding of voice identity in human auditory cortex. articulators contribute to categorical perception of speech sounds. Current Biology, 23, 10751080. Journal of Neuroscience, 29, 98199825. Lattner, S., Meyer, M. E., & Friederici, A. D. (2005). Voice perception: Mummery, C. J., Ashburner, J., Scott, S. K., & Wise, R. J. S. (1999). Sex, pitch, and the right hemisphere. Human Brain Mapping, 24, Functional neuroimaging of speech perception in six normal and 1120. two aphasic subjects. Journal of the Acoustical Society of America, Lehe´ricy, S., Cohen, L., Bazin, B., Samson, S., Giacomini, E., 106, 449457. Rougetet, R., et al. (2000). Functional MR evaluation of temporal Narain, C., Scott, S. K., Wise, R. J. S., Rosen, S., Leff, A., Iversen, and frontal language dominance compared with the Wada test. S. D., et al. (2003). Defining a left-lateralized response specific to Neurology, 54, 16251633. intelligible speech using fMRI. Cerebral Cortex, 13, 13621368. Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert- Nath, A. R., & Beauchamp, M. S. (2012). A neural basis for interindi- Kennedy, M. (1967). Perception of the speech code. Psychological vidual differences in the McGurk effect, a multisensory speech Review, 74, 431461. illusion. NeuroImage, 59, 781787. Liberman, A. M., Harris, K. S., Hoffman, H. S., & Griffith, B. C. Norris, D., McQueen, J. M., & Cutler, A. (2000). Merging information (1957). The discrimination of speech sounds within and across in speech recognition: Feedback is never necessary. Behavioral and phoneme boundaries. Journal of Experimental Psychology, 54, Brain Sciences, 23, 299325. 358368. Okada, K., Rong, F., Venezia, J., Matchin, W., Hsieh, I.-H., Saberi, K., Liberman, A. M., & Mattingly, I. G. (1985). The motor theory of et al. (2010). Hierarchical organization of human auditory cortex: speech perception revised. Cognition, 21,136. Evidence from acoustic invariance in the response to intelligible Liebenthal,E.,Binder,J.R.,Spitzer,S.M.,Possing,E.T.,&Medler,D.A. speech. Cerebral Cortex, 20, 24862495. (2005). Neural substrates of phonemic perception. Cerebral Cortex, 15, Overath, T., Kumar, S., von Kriegstein, K., & Griffiths, T. D. (2008). 16211631. Encoding of spectral correlation over time in auditory cortex. Liebenthal,E.,Desai,R.,Ellingson,M.M.,Ramachandran,B.,Desai,A., Journal of Neuroscience, 28(49), 1326813273. & Binder, J. R. (2010). Specialization along the left superior Pallier, C., Devauchelle, A.-D., & Dehaene, S. (2011). Cortical repre- temporal sulcus for phonemic and non-phonemic categorization. sentation of the constituent structure of sentences. Proceedings of Cerebral Cortex, 20, 29582970. the National Academy of Sciences of the United States of America, 108 Lisker, L. (1986). “Voicing” in English: A catalogue of acoustic fea- (6), 25222527. tures signaling /b/ versus /p/ in trochees. Language and Speech, Pasley, B. N., David, S. V., Mesgarani, N., Flinker, A., Shamma, S. A., 29,311. Crone, N. E., et al. (2012). Reconstructing speech from human Lotto, A. J., Hickok, G. S., & Holt, L. L. (2009). Reflections on mirror auditory cortex. Plos Biology, 10(1), e1001251. neurons and speech perception. Trends in Cognitive Sciences, 13, Peelle, J. E., & Davis, M. H. (2012). Neural oscillations carry 110114. speech rhythm through to comprehension. Frontiers in McClelland, J. L., & Elman, J. L. (1986). The TRACE model of speech Psychology, 3, 320. perception. Cognitive Psychology, 18,186. Peterson, G. E., & Barney, H. L. (1952). Control methods used in a McGettigan, C., & Scott, S. K. (2012). Cortical asymmetries in speech study of vowels. Journal of the Acoustical Society of America, 24, perception: What’s wrong, what’s right and what’s left? Trends in 175184. Cognitive Sciences, 16(5), 269276. Pitt, M. A., & Samuel, A. G. (1993). An empirical and meta-analytic McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. evaluation of the phoneme identification task. Journal of Nature, 264(5588), 746748. Experimental Psychology: Human Perception and Performance, 19, Meister, I. G., Wilson, S. M., Deblieck, C., Wu, A. D., & Iacoboni, M. 699725. (2007). The essential role of premotor cortex in speech perception. Poeppel, D. (2001). Pure word deafness and the bilateral processing Current Biology, 17, 16921696. of the speech code. Cognitive Science, 25, 679693. Mesgarani, N., Cheung, C., Johnson, K., & Chang, E. F. (2014). Poeppel, D. (2003). The analysis of speech in different temporal inte- Phonetic feature encoding in human superior temporal gyrus. gration windows: Cerebral lateralization as “asymmetric sam- Science, 343(6174), 10061010. pling in time”. Speech Communication, 41, 245255. Meyer, M., Zaehle, T., Gountouna, V.-E., Barron, A., Jancke, L., & Pruitt, J. S., Jenkins, J. J., & Strange, W. (2006). Training the percep- Turk, A. (2005). Spectro-temporal processing during speech per- tion of Hindi dental and retroflex stops by native speakers of ception involves left posterior temporal auditory cortex. American English and Japanese. Journal of the Acoustical Society of Neuroreport, 16(18), 19851989. America, 119(3), 16841696. Miller, J. D. (1989). Auditory-perceptual interpretation of the vowel. Pulvermuller, F., Huss, M., Kheri, F., Moscoso Del Prado Martin, F., Journal of the Acoustical Society of America, 85, 21142134. Hauk, O., & Shtrov, Y. (2006). Motor cortex maps articulatory fea- Miller, L. M., & D’Esposito, M. (2005). Perceptual fusion and stimu- tures of speech sounds. Proceedings of the National Academy of lus coincidence in the cross-modal integration of speech. Journal Sciences of the United States of America, 103, 78657870. of Neuroscience, 25(25), 58845893. Rauschecker, J. P., & Scott, S. K. (2009). Maps and streams in the Morillon, B., Lehongre, K., Frackowiak, R. S., Ducorps, A., auditory cortex: Nonhuman primates illuminate human speech Kleinschmidt, A., Poeppel, D., et al. (2010). Neurophysiological processing. Nature Neuroscience, 12(6), 718724.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL REFERENCES 461

Remez, R. E., Rubin, P. E., Pisoni, D. B., & Carrell, T. D. (1981). Ulrich, G. (1978). Interhemispheric relationships in auditory agnosia: Speech perception without traditional speech cues. Science, 212, An analysis of the preconditions and a conceptual model. Brain 947950. and Language, 5, 286300. Rizzolatti, G., & Arbib, M. A. (1998). Language within our grasp. Van Lancker, D. R., Kreiman, J., & Cummings, J. (1989). Voice percep- Trends in Neurosciences, 21, 188194. tion deficits: Neuroanatomical correlates of . Journal of Rizzolatti, G., & Craighero, L. (2004). The mirror-neuron system. Clinical and Experimental Neuropsychology, 11(5), 665674. Annual Review of Neuroscience, 27, 169192. Vandenberghe, R., Nobre, A. C., & Price, C. J. (2002). The response of Ross, E. D., Thompson, R. D., & Yenkosky, J. (1997). Lateralization of left temporal cortex to sentences. Journal of Cognitive Neuroscience, affective prosody in brain and the callosal integration of hemi- 14(4), 550560. spheric language functions. Brain and Language, 56,2754. Visser, M., Jefferies, E., & Lambon Ralph, M. A. (2010). Semantic pro- Samuel, A. G. (1997). Lexical activation produces potent phonemic cessing in the anterior temporal lobes: A meta-analysis of the func- percepts. Cognitive Psychology, 32,97127. tional neuroimaging literature. Journal of Cognitive Neuroscience, 22 Sato, M., Tremblay, P., & Gracco, V. L. (2009). A mediating role of (6), 10831094. the premotor cortex in phoneme segmentation. Brain and von Kriegstein, K., Eger, E., Kleinschmidt, A., & Giraud, A. L. (2003). Language, 111(1), 17. Modulation of neural responses to speech by directing attention Sawusch, J. R., & Newman, R. S. (2000). Perceptual normalization for to voices or verbal content. Cognitive Brain Research, 17,4855. speaking rate II: Effects of signal discontinuities. Perception and Wang, E., Peach, R. K., Xu, Y., Schneck, M., & Manry, C. (2000). Psychophysics, 62(2), 285300. Perception of dynamic acoustic patterns by an individual with Schonwiesner, M., Rubsamen, R., & von Cramon, D. Y. (2005). unilateral verbal auditory agnosia. Brain and Language, 73(3), Hemispheric asymmetry for spectral and temporal processing in 442455. the human antero-lateral auditory belt cortex. European Journal of Warren, R. M., & Obusek, C. J. (1971). Speech perception and phone- Neuroscience, 22(6), 15211528. mic restorations. Perception and Psychophysics, 9, 358362. Scott, S. K., Blank, C., Rosen, S., & Wise, R. J. S. (2000). Identification Watkins, K., Strafella, A., & Paus, T. (2003). Seeing and hearing of a pathway for intelligible speech in the left temporal lobe. speech excites the motor system involved in speech production. Brain, 123, 24002406. Neuropsychologia, 41(8), 989994. Shankweiler, D. P., & Studdert-Kennedy, M. (1970). Hemispheric Werker, J. F., & Tees, R. C. (2005). Speech perception as a window specialization for speech perception. Journal of the Acoustical for understanding plasticity and commitment in language sys- Society of America, 48(2), 579594. tems of the brain. Developmental Psychobiology, 46(3), 233251. Shannon, R. V., Zeng, F.-G., Kamath, V., Wygonski, J., & Ekelid, M. Wilson, S., Saygin, A., Sereno, M., & Iacoboni, M. (2004). Listening to (1995). Speech recognition with primarily temporal cues. Science, speech activates motor areas involved in speech production. 270, 303304. Nature Neuroscience, 7, 701702. Spitsyna, G., Warren, J. E., Scott, S. K., Turkheimer, F. E., & Wise, Wilson, S. M., & Iacoboni, M. (2006). Neural responses to non-native R. J. S. (2006). Converging language streams in the human tempo- phonemes varying in producibility: Evidence for the sensorimotor ral lobe. Journal of Neuroscience, 26(28), 73287336. nature of speech perception. NeuroImage, 33, 316325. Stefanatos, G. A., Gershkoff, A., & Madigan, S. (2005). On pure word Wise, R., Chollet, F., Hadar, U., Friston, K., Hoffner, E., & deafness, temporal processing and the left hemisphere. Journal of Frackowiak, R. (1991). Distribution of cortical neural networks the International Neuropsychological Society, 11(4), 456470. involved in word comprehension and word retrieval. Brain, 114, Stevens, K. (2002). Toward a model for lexical access based on acous- 18031817. tic landmarks and distinctive features. Journal of the Acoustical Wise, R. S. J., Scott, S. K., Blank, S. C., Mummery, C. J., Murphy, K., & Society of America, 111(4), 18721891. Warburton, E. A. (2001). Separate neural subsystems within Stevens, K. N., & Blumstein, S. E. (1981). The search for invariant acous- “Wernicke’s area”. Brain, 124,8395. tic correlates of phonetic features. In P. D. Eimas, & J. L. Miller (Eds.), Yuen, I., Davis, M. H., Brysbaert, M., & Rastle, K. (2010). Activation Perspectives on the study of speech (pp. 138). Hillsdale, NJ: Erlbaum. of articulatory information in speech perception. Proceedings of the Sumby, W. H., & Pollack, I. (1954). Visual contribution to speech National Academy of Sciences of the United States of America, 107, intelligibility in noise. Journal of the Acoustical Society of America, 592597. 26, 212215. Zaehle, T., Wustenberg, T., Meyer, M., & Jancke, L. (2004). Evidence Summerfield, Q., & Haggard, M. (1977). On the dissociation of spatial for rapid auditory perception as the foundation of speech proces- and temporal cues to the voicing distinction in initial stop conso- sing: A sparse temporal sampling fMRI study. European Journal of nants. Journal of the Acoustical Society of America, 62,435448. Neuroscience, 20(9), 24472456. Sussman, H. M. (1986). A neuronal model of vowel normalization Zatorre, R. J., & Belin, P. (2001). Spectral and temporal processing in and representation. Brain and Language, 28,1223. human auditory cortex. Cerebral Cortex, 11(10), 946953. Tourville, J. A., Reilly, K. J., & Guenther, F. H. (2008). Neural Zatorre, R. J., Belin, P., & Penhune, V. B. (2002). Structure and func- mechanisms underlying auditory feedback control of speech. tion of auditory cortex: Music and speech. Trends in Cognitive NeuroImage, 39(3), 14291443. Sciences, 6(1), 3746. Tremblay, P., & Small, S. L. (2011). On the context-dependent nature Zatorre, R. J., Evans, A. C., Meyer, E., & Gjedde, A. (1992). of the contribution of the ventral premotor cortex to speech per- Lateralization of phonetic and pitch discrimination in speech pro- ception. NeuroImage, 57, 15611571. cessing. Science, 256, 846849.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL This page intentionally left blank CHAPTER 38 A Neurophysiological Perspective on Speech Processing in “The Neurobiology of Language” Luc H. Arnal1,2, David Poeppel2,3 and Anne-Lise Giraud1 1Department of Neurosciences, Biotech Campus, University of Geneva, Geneva, Switzerland; 2Department of Psychology, New York University, New York, NY, USA; 3Max-Planck-Institute for Empirical Aesthetics, Frankfurt, Germany

38.1 OVERVIEW are outlined. Next, the properties of auditory cortex that reflect its sensitivity to these features are reviewed, and Of all the signals the human auditory system has to current ideas about the neurophysiological mechanisms process, the most relevant to the listener is arguably underpinning the processing of connected speech are speech. Speech perception is learned and executed discussed. The chapter closes with a summary of speech with automaticity and great ease even by very young processing models on a larger scale, attempting to cap- children, but it is handled surprisingly poorly by even ture most of these phenomena in an integrated vision. sophisticated automatic devices. Therefore, parsing and decoding speech can be considered one of the main challenges of the auditory system. This chapter 38.1.1 Timescales in Auditory Perception focuses on how the human auditory cortex uses the temporal structure of the acoustic signal to extract pho- Sounds are audible over a broad frequency range nemes and syllables, two types of events that need to between 20 and 20,000 Hz. They enter the outer ear be identified in connected speech. and travel through the middle ear to the inner ear, Speech is a complex acoustic signal exhibiting qua- where they provoke the basilar membrane to vibrate at siperiodic behavior at several timescales. The neural a specific location, depending on the sound frequency. signals recorded from the auditory cortex using EEG Low and high frequencies induce vibrations of the or MEG also exhibit a quasiperiodic structure, apex and base of the basilar membrane, respectively. whether in response to speech or not. In this chapter The deformation of the membrane on acoustic stimula- we present different models grounded on the assump- tion provokes the deflection of inner hair cells, ciliae, tion that the quasiperiodic structure of collective neu- and the emission of a neural signal to cochlear neurons, ral activity in auditory cortex represents the ideal subsequently transmitted to neurons of the cochlear mechanical infrastructure to solve the speech demulti- nucleus in the brainstem. Each cochlear neuron is sensi- plexing problem (i.e., the fractioning of speech into tive to a specific range of acoustic frequencies between linguistic constituents of variable size). These models 20 Hz and 20 kHz. Because of their regular position remain largely hypothetical. However, they constitute along the basilar membrane, the cochlear neurons exciting theories that will hopefully lead to new ensure the place-coding of acoustic frequencies, also research approaches and incremental progress on this called “tonotopy,” which is preserved up to the cortex foundational question about speech perception. (Moerel et al., 2013; Saenz & Langers, 2014). The chapter proceeds as follows. First, some of the Acoustic fluctuations ,20 Hz are not audible. They essential features of natural and speech auditory stimuli do not elicit place-specific responses in the cochlea.

Neurobiology of Language. DOI: http://dx.doi.org/10.1016/B978-0-12-407794-2.00038-9 463 © 2016 Elsevier Inc. All rights reserved. 464 38. A NEUROPHYSIOLOGICAL PERSPECTIVE ON SPEECH PROCESSING IN “THE NEUROBIOLOGY OF LANGUAGE”

(A) Frequency (Hz)

1 10 100 1000 20000 Range of audible frequencies (B) Sound perceptual category

Events Fluctuations/roughness Pitch Discrete percepts Continuous percept (C) Speech features

Words Syllable Phoneme Fundamental frequency

Men Children (D) Neural oscillations Women Delta–theta Alpha–beta Gamma

FIGURE 38.1 (A) Scale of perceived temporal modulation (modified from Joris, Schreiner, & Rees, 2004 and Nourski & Brugge, 2011). (B) Relevant psychophysical parameters (perceptual changes) of the spectrogram reflect the temporal constraints that superimpose on the structure of linguistic signals. (C) Temporal structure of linguistic features. (D) The length of linguistic features remarkably matches the frequency of oscillations that are observed at rest in the brain. Note that the frequency ranges at which auditory percepts switch from discrete (flutter) to continuous (pitch) roughly match the upper limit at which gamma rhythms can be entrained by the stimulus (B200 Hz).

Low frequencies ,300 Hz are present in complex modulations (,30 Hz). They can be perceived as dis- sounds, such as temporal fluctuations of audible fre- tinct events when being discriminated from each other. quencies, and are encoded through the discharge rate Faster modulations, such as those imposed by the glot- of cochlear neurons (Zeng, 2002). A range of frequen- tal pulse (100300 Hz), indicate the voice pitch cies from 20 to 300 Hz is both place-coded and rate- (Figure 38.1C). Figure 38.1D shows how these percep- coded at the auditory periphery. Temporal modulation tual events relate to the different frequency ranges of of sounds in these frequencies typically elicits a sensa- the EEG. tion of pitch. Figure 38.1A and 38.1B summarizes the correspondence between categories of perceptual attri- butes and the sound modulation frequency (see also 38.1.2 The Temporal Structure of Speech Nourski & Brugge, 2011 for a review). When sounds Sounds are modulated at very slow rates ,10 Hz, a sequence of distinct events is perceived. When modulations Figure 38.2 illustrates two useful ways to visualize accelerate from 10 to approximately 100 Hz, distinct the signal: as a waveform (38.2A) and as a spectrogram events merge into a single auditory stream, and the (38.2B). The waveform represents energy variation sensation evolves from fluctuating magnitude to a sen- over time—the input that the ear actually receives. The sation of acoustic roughness (Figure 38.1B). outlined “envelope” (thick line) reflects that there is a Speech sounds are complex acoustic signals that temporal regularity in the signal at relatively low mod- involve only the lower part of audible frequencies ulation frequencies. These modulations of signal (208,000 Hz). They are called “complex” because energy (in reality, spread out across the “cochlear” fil- both their frequency distribution and their magnitude terbank) are ,20 Hz and peak at a rate of approxi- vary strongly and quickly over time. In natural speech, mately 46Hz(Elliott & Theunissen, 2009; Steeneken amplitude modulations at slow (,20 Hz) and fast & Houtgast, 1980). From the perspective of what audi- (.100 Hz) timescales are coupled (Figure 38.2), and tory cortex receives as input, namely the modulations slower temporal fluctuations modulate the amplitude at the output of each frequency channel of the filter- of spectral fluctuations. Slow modulations (,5 Hz) sig- bank that constitutes the auditory periphery, these nal word and syllable boundaries (Hyafil, Fontolan, energy fluctuations can be characterized by the modu- Gutkin, & Giraud, 2012; Rosen, 1992), which are per- lation spectrum (Chi, Ru, & Shamma, 2005; Dau, ceived as a sequence of distinct events. Phonemes Kollmeier, & Kohlrausch, 1997; Kanedera, Arai, (speech sounds) are signaled by fast spectrotemporal Hermansky, & Pavel, 1999; Kingsbury, Morgan, &

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 38.1 OVERVIEW 465

FIGURE 38.2 (A) Waveform and (B) spectrogram of the same sentence uttered by a male speaker. Some of the key acoustic cues in speech comprehension are highlighted in black.

Greenberg, 1998; McDermott & Simoncelli, 2011). At can range from approximately 100 Hz (male adult) to the shortest timescale (below 1 ms or equivalently 300 Hz (child; see Figure 38.1D). The horizontal bands above 1 kHz), the very fast temporal fluctuations are of energy show where in frequency space a particular transformed into a spectral representation at the speech sound is carried. The spectral structure thus cochlea and the neural processing of these features is reflects the articulator configuration. These bands of generally known as spectral processing. At an interme- energy include the formants (F1, F2, etc.) definitional diate timescale (B70 Hz1 kHz), the temporal fluctua- of vowel identity; high-frequency bursts associated, for tions are usually referred to as the temporal fine example, with frication in certain consonants (e.g., /s/, structure. The temporal fine structure is critical to the /f/); and formant transitions that signal the change perception of pitch and inter-aural time differences from a consonant to a vowel or vice versa. that are important cues for sound source localization Notwithstanding the importance of the spectral (Grothe, Pecka, & McAlpine, 2010; Plack, Oxenham, fine structure, there is a big caveat: speech can be Fay, & Popper, 2005). Temporal fluctuations on an understood, in the sense of being intelligible in psy- even longer timescale (B110 Hz) are heard as a chophysical experiments, when the spectral content is sequence of discrete events. Acoustic events occurring replaced by noise and only the envelope is preserved. on this timescale include syllables and words in speech Importantly, this manipulation is done in separate and notes and beats in music. Of course, there are no bandsacrossthespectrum,forexample,asfewas clear boundaries between these timescales; the time- four separate bands (Shannon, Zeng, Kamath, scales are divided here based on human auditory Wygonski, & Ekelid, 1995). Speech that contains only perception. envelope but no fine structure information is called The second analytic representation, the spectrogram, vocoded speech (Faulkner, Rosen, & Smith, 2000). decomposes the acoustic signal in the frequency, time, Compelling demonstrations that exemplify this type and amplitude domains (Figure 38.2B). Although the of signal decomposition illustrate that the speech sig- human auditory system captures frequency informa- nal can undergo radical alterations and distortions tion between 20 Hz and 20 kHz (and such a spectro- and yet remain intelligible (Shannon et al., 1995; gram is plotted here), most of the information that is Smith, Delgutte, & Oxenham, 2002). Such findings extracted for effective recognition is below 8 kHz. It is have led to the idea that the temporal envelope, that worth remembering that speech transmitted over tele- is, temporal modulations of speech at relatively slow phone landlines contains a much narrower bandwidth rates, is sufficient to yield speech comprehension (2003600 Hz) and is comfortably understood by nor- (Drullman, Festen, & Plomp, 1994a, 1994b; Giraud mal listeners. A number of critical acoustic features et al., 2004; Loebach & Wickesberg, 2008; Rosen, 1992; can be identified in the spectrogram. The faintly visible Scott, Rosen, Lang, & Wise, 2006; Shannon et al., 1995; vertical stripes represent the glottal pulse, which Souza & Rosen, 2009). When using stimuli in which reflects the speaker’s fundamental frequency, F0. This the fine structure is compromised or not available at

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 466 38. A NEUROPHYSIOLOGICAL PERSPECTIVE ON SPEECH PROCESSING IN “THE NEUROBIOLOGY OF LANGUAGE” all, envelope modulations below 16 Hz appear to suf- temporal parameters of the acoustic signal are of major fice for adequate intelligibility. The remarkable com- importance (Ghitza & Greenberg, 2009; Luo, Liu, & prehension level reached by most patients with Poeppel, 2010; Luo & Poeppel, 2007; Peelle, Gross, & cochlear implants, in whom approximately 1520 Davis, 2013). Online segmentation remains a major electrodes replace 3,000 hair cells, remains the best challenge to contemporary models of speech perception empirical demonstration that the spectral content of as well as automatic speech recognition. speech can be degraded with tolerable alteration of Interestingly, a large body of psychophysical work speech perception (Roberts, Summers, & Bailey, studied speech perception and intelligibility using 2011). A related demonstration showing the resilience phrasal or sentential stimuli (see Miller, 1951 for a of speech comprehension in the face of radical signal summary of many experiments and Allen, 2005 for impoverishment is provided by sine-wave speech a review of the influential work of Fletcher and (Remez, Rubin, Pisoni, & Carrell, 1981). In these sti- others). Fascinating findings emerged from that work, muli both envelope and spectral content are emphasizing the role of signal-to-noise ratio in speech degraded, but enough information is preserved to comprehension, but perhaps the most interesting fea- permit intelligibility. Typically, sine-wave speech pre- ture is that connected speech has principled and useful serves the modulations of the three first formants, temporal properties that may play a key role in the which are replaced by sine-waves centered on F0, F1, problem of speech parsing and decoding. Natural and F2. In summary, dramatically impoverished speech usually comes to the listener as a continuous stimuli remain intelligible insofar as enough informa- stream and needs to be analyzed online and decoded tion in the spectrum is available to convey temporal by mechanisms that are unlikely to be continuous modulations at appropriate rates. (Giraud & Poeppel, 2012). The parsing mechanism cor- Based on this brief and selective summary, two con- responds to the discretization of the continuous input cepts merit emphasis. First, the extended speech signal signal into subsegments of speech information that are contains critical information that is modulated at rates read out, to a certain extent, independently from each below 20 Hz, with the modulation peaking at approxi- other. The notion that perception is discrete has been mately 5 Hz (Edwards & Chang, 2013). This low- extensively discussed and generalized in numerous frequency information correlates closely with the sensory modalities and contexts (Po¨ppel & Artin, 1988; syllabic structure of connected speech (Giraud & VanRullen & Koch, 2003). Here, we discuss the Poeppel, 2012). Second, the speech signal contains criti- hypothesis that neural oscillations constitute a possible cal information at modulation rates higher than, for mechanism for discretizing temporally complex example, 50 Hz. This rapidly changing information is sounds such as speech (Giraud & Poeppel, 2012). associated with fine spectral changes that carry infor- mation about the speaker’s gender or identity and other relevant speech attributes (Elliott & Theunissen, 38.2.2 Analysis at Mutiple Timescales 2009). Thus, two surprisingly different timescales are concurrently at play in the speech signal. This impor- Speech is a multiplexed signal, that is, it interlinks tant issue is described in the text that follows. In this several levels of complexity, and organizational princi- chapter, we discuss the timescales longer than 5 ms ples and perceptual units of analysis exist at distinct (,200 Hz) with a focus on the timescale between timescales. Using data from linguistics, psychophysics, 100 ms and 1 s (110 Hz). Temporal features that and physiology, Poeppel and colleagues proposed that contribute to the spatial localization of sounds are not speech is analyzed in parallel at multiple timescales discussed (see Grothe et al., 2010 for a review). (Boemio, Fromm, Braun, & Poeppel, 2005; Poeppel, 2001, 2003; Poeppel, Idsardi, & van Wassenhove, 2008). The central idea is that both local-to-global and global- 38.2 CORTICAL PROCESSING OF to-local types of analyses are carried out concurrently CONTINUOUS SOUNDS STREAMS (multitime-resolution processing). This assumption adds to the notion of reverse hierarchy (Hochstein & 38.2.1 The Discretization Problem Ahissar, 2002; Nahum, Nelken, & Ahissar, 2008) and other hierarchical models in perception, which propose In natural connected speech, speech information is that the hierarchical complexification of sensory infor- embedded in a continuous acoustic flow, and sentences mation (e.g., the temporal hierarchy) maps onto the are not “presegmented” in perceptual units of analysis. anatomo-functional hierarchy of the brain (Giraud Recent work on sentence-level stimuli (i.e., materials et al., 2000; Kiebel, Daunizeau, & Friston, 2008). The with a duration exceeding 12 s) using experimental principal motivations for extending such a hypothesis tasks such as intelligibility, suggests that long-term are two-fold. First, a single, short temporal window

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 38.2 CORTICAL PROCESSING OF CONTINUOUS SOUNDS STREAMS 467 that forms the basis for hierarchical processing, that is, properties in vision; Holcombe, 2009). Importantly, non- increasingly larger temporal analysis units as one speech signals are subject to similar thresholds. For ascends the processing system, fails to account for the example, 1520 ms is the minimal stimulus duration spectral and temporal sensitivity of the speech proces- required for correctly identifying upward versus down- sing system and is difficult to reconcile with behav- ward FM sweeps (Luo, Boemio, Gordon, & Poeppel, ioral performance. Second, the computational strategy 2007). By comparison, 200-ms stimulus duration of analyzing information on multiple scales is widely underlies loudness judgments. In summary, physiologi- used in engineering and biological systems, and the cal events at related scales form the basis for processing neuronal infrastructure exists to support multiscale at that level. Therefore, the neuronal oscillatory machin- computation (Canolty & Knight, 2010). According to ery (together with motor constraints related to speech the view summarized here, speech is chunked into production; Morillon et al., 2010) presumably imposed segments of roughly featural or phonemic length, and strong temporal constrains that might have shaped the then integrated into larger units, such as segments, size of acoustic features selected to carry speech informa- diphones, syllables, and words. In parallel, there is a tion. This is consistent with the notion that perception is fast global analysis that yields coarse inferences about discrete and that the exogenous recruitment of neuronal speech (akin to Stevens, 2002 “landmarks” hypothesis) populations is followed by refractory periods that tem- and that subsequently refines segmental analysis. porarily reduce the ability to optimally extract sensory Here, we propose that segmental and suprasegmental information (Ghitza, 2011; Ghitza & Greenberg, 2009). analyses could be carried out concurrently and “pack- According to this hypothesis, the temporally limited aged” for parsing and decoding by neuronal oscilla- capacity of gamma oscillations to integrate information tions at different rates. over time possibly imposes a lower limit to the phoneme The notion that speech analysis occurs in parallel at length. This also suggests that oscillatory constrains in multiple timescales justifies moving away from strictly the deltatheta range possibly constrained the size of hierarchical models of speech perception (Giraud & syllables to be approximately the size of a deltatheta Price, 2001). Accordingly, the simultaneous extraction of cycle. Considering that the average lengths of phoneme different acoustic cues permits simultaneous high-order and syllable are approximately 2580 and 150300 ms, processing of different information from a unique input respectively (Figure 38.1 and Figure 38.2), the dual signal. That speech should be analyzed in parallel at timescale segmentation requires two parallel sampling different timescales derives, among other reasons, from mechanisms, one at approximately 40 Hz (or, more the observation that articulatoryphonetic phenomena broadly, in the low gamma range) and one at approxi- occur at different timescales. It was noted previously mately 4 Hz (or in the theta range). (Figure 38.1C and Figure 38.2) that the speech signal contains events of different durations: short energy bursts and formant transitions occur within a 20- to 80- 38.2.3 Neural Oscillations as Endogenous ms timescale, whereas syllabic information occurs over Temporal Constraints 150300 ms. The processing of both types of events could be accounted for either by a hierarchical model Neural oscillations correspond to synchronous activ- in which smaller acoustic units (segments) are ity of neuronal assemblies that are both intrinsically concatenated into larger units (syllables) or by a parallel coupled and coupled by a common input. It was pro- model in which both temporal units are extracted inde- posed that these oscillations reflect modulations of pendently, and then combined. A degree of indepen- neuronal excitability that temporally constrain the dence in the processing of long (slow modulation) and sampling of sensory information (Schroeder & short (fast modulation) units is observed at the behav- Lakatos, 2009a). The intriguing correspondence ioral level. For instance, speech can be understood well between the size of certain speech temporal units and when it is first segmented into units up to 60 ms and the frequency of oscillations in certain frequency bands when these local units are temporally reversed (see Figure 38.1) has elicited the intuition that they (Greenberg & Arai, 2001; Saberi & Perrott, 1999). Because might play a functional role in sensory sampling. the correct extraction of short units is not a prerequisite Oscillations are evidenced by means of a spectrotem- for comprehension, this rules out the notion that speech poral analysis of electrophysiological recordings (see processing relies solely on hierarchical processing of Wang, 2010 for a review). The requirements for mea- short and then larger units. Overall, there appears to be suring oscillations and spiking activity are different. a grouping of psychophysical phenomena such that The presentation of an exogenous stimulus typically some cluster at thresholds of approximately 50 ms and results in an increase of spiking activity (i.e., an below and others cluster at approximately 200 ms and increase of synaptic output relative to baseline spiking above (a similar clustering is observed for temporal activity) in those brain areas that are functionally

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 468 38. A NEUROPHYSIOLOGICAL PERSPECTIVE ON SPEECH PROCESSING IN “THE NEUROBIOLOGY OF LANGUAGE” sensitive to such sensory inputs. Neural oscillations, interaction between pyramidal cells and inhibitory however, can be observed in local field potential interneurons (Borgers, Epstein, & Kopell, 2005, recordings (LFPs), which reflect synchronized dentritic Borgers, Epstein, & Kopell, 2008), or even just among inputs into the observed area, even in the absence of interneurons that are located in superficial cortical any external stimulation. Exogenous stimulation, how- layers (Tiesinga & Sejnowski, 2009). Exogenous inputs ever, typically modulates oscillatory activity, resulting usually increase gamma-band activity in sensory areas, either in a reset of their phase and/or in a change presumably clustering spiking activity that is propa- (increase or decrease) in the magnitude of these oscilla- gated to higher hierarchical processing stages (Arnal & tions (Howard & Poeppel, 2012). Giraud, 2012; Arnal, Wyart, & Giraud, 2011; Bastos Cortical oscillations are proposed to shape spike- et al., 2012; Fontolan, Morillon, Liegeois-Chauvel, & timing dynamics and to impose phases of high and Giraud, 2014). By analogy with the proposal of Elhilali, low neuronal excitability (Britvina & Eggermont, 2007; Fritz, Klein, Simon, and Shamma (2004) that slow Panzeri, Brunel, Logothetis, & Kayser, 2010; Schroeder responses gate faster ones, it is interesting to envisage & Lakatos, 2009a, 2009b). The assumption that it is this periodic modulation of spiking by oscillatory oscillations that cause spiking to be temporally clus- activity as an endogenous mechanism to optimize the tered is derived from the observation that spiking extraction of relevant sensory input in time. Such inte- tends to occur in specific phases (i.e., the trough) of gration could occur under the patterning of slower oscillatory activity (Womelsdorf et al., 2007). It is also oscillations in the deltatheta range. assumed that spiking and oscillations do not reflect the same aspect of information processing. Whereas spiking reflects axonal activity, oscillations are said to 38.2.4 Alignment of Neuronal Excitability reflect mostly dendritic synaptic activity (Wang, 2010). with Speech Timescales Although both measures are relevant to address how sensory information is encoded in the brain, we Experimental exploration of how speech parsing believe that the ability of neural oscillations to tempo- and encoding is performed by the brain is nontrivial. rally organize spiking activity support the functional One approach has been to explore how neural relevance of neural oscillations to solve the discretiza- responses can discriminate different sentences, assum- tion problem and to permit the integration of complex ing that the features of neural signals that are sensitive sensory signals across time. to such differences (e.g., frequency band, amplitude, Neuronal oscillations are ubiquitous in the brain, and phase) should reveal the features that are vital to but they vary in strength and frequency depending on sentence decoding. Using this approach, it was shown their location and the exact nature of their neuronal that the phase of theta-band neural activity reliably generators (Mantini, Perrucci, Del Gratta, Romani, & discriminates different sentences (Luo & Poeppel, Corbetta, 2007). The notion that neural oscillations 2007). Specifically, when one sentence is repeatedly shape the way the brain processes sensory information presented to listeners, the phase of ongoing theta-band is supported by a wealth of electrophysiological find- activity follows a consistent phase sequence. When dif- ings in humans and animals. Stimuli that occur in the ferent sentences are played, however, different phase ideal excitability phase of slow oscillations (,12 Hz) sequences are observed. Because the theta-band are processed faster and with a higher accuracy (48 Hz) falls around the mean syllabic rate of speech (Busch, Dubois, & VanRullen, 2009; Henry & Obleser, (B5 Hz), the phase of theta-band activity likely tracks 2012; Lakatos, Karmos, Mehta, Ulbert, & Schroeder, syllabic-level features of speech (Doelling, Arnal, 2008; Ng, Schroeder, & Kayser, 2012; Wyart, de Ghitza, & Poeppel, 2014; Edwards & Chang, 2013; Gardelle, Scholl, & Summerfield, 2012). However, Giraud & Poeppel, 2012; Hyafil et al., 2012; Luo & gamma-band 40 Hz activity (low gamma band) can be Poeppel, 2007; Peelle et al., 2013). These findings sup- observed at rest in both monkey (Fukushima, port the notion that the syllabic timescale has adapted Saunders, Leopold, Mishkin, & Averbeck, 2012) and to a preexisting cortical preference for temporal infor- human auditory cortex. In humans, it can be measured mation in this frequency range. At this point, however, using EEG, MEG, concurrent EEG and fMRI (with it is not clear whether the phase-locking between more precise localization) (Morillon et al., 2010), and speech inputs and neural oscillations is necessary for intracranial electroencephalographic recordings (sEEG, speech intelligibility. Sentences played backward (and EcoG) in patients. Neural oscillations in this range are therefore unintelligible) can similarly be discriminated endogenous in the sense that one can observe a spon- on the basis of their phase course, which tempers the taneous spike clustering at approximately 40 Hz even interpretation that these oscillations play a causal role in the absence of external stimulation. This gamma in speech perception (Howard & Poeppel, 2011). activity is thought to be generated by a ping-pong However, two recent studies using distinct ways of

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 38.2 CORTICAL PROCESSING OF CONTINUOUS SOUNDS STREAMS 469 acoustically degrading speech intelligibility demon- However, according to psychophysiological findings strate that the temporal alignment between the stimu- described previously, speech intelligibility mostly relies lus and deltatheta band responses is higher when the on the preservation of the low-frequency (,50 Hz) stimulus is intelligible (Doelling et al., 2014; Peelle temporal fluctuations rather than on higher-frequency et al., 2013). This, again, supports the notion that those information. Therefore, whether it is necessary to main- neural oscillations that match the slow (syllabic) tain a representation of such acoustic features to cor- speech timescales are useful (if not necessary) for the rectly perceive speech remains unclear. The following extraction of relevant speech information. section aims at clarifying the putative neural mechan- Neural oscillatory responses can also be entrained isms underpinning the segmentation and the integration at much higher rates in the middle to high (40200 Hz) of auditory speech signals into an intelligible percept. gamma band (Brugge et al., 2009; Fishman, Reser, Arezzo, & Steinschneider, 2000). This could suggest that faster speech segments such as phonemic transi- 38.2.5 Parallel Processing at Multiple tions could be extracted using the same encoding princi- Timescales ple. High gamma responses in early auditory regions (Ahissar et al., 2001; Mesgarani & Chang, 2012; Schroeder & Lakatos (2009a, 2009b) have argued that Morillon, Liegeois-Chauvel, Arnal, Benar, & Giraud, oscillations correspond to the alternation of phases of 2012; Nourski et al., 2009) reflect the fast temporal fluc- high and low neuronal excitability, which temporally tuations in the speech envelope. A recent EcoG study constrain . This means that low succeeded at reconstructing the original speech input by gamma oscillations at 40 Hz, which have a period of using a combination of linear and nonlinear methods to approximately 25 ms, provide a 10- to 15-ms window decode neural responses from high gamma (70150 Hz) for integrating spectrotemporal information (low spik- activity recorded in auditory cortical regions (Pasley ing rate), followed by a 10- to 15-ms window for propa- et al., 2012). Therefore, the decoding of auditory activity gating the output (high spiking rate; see Figure 38.3A. on a large spatial scale (at the population level) demon- for illustration). However, because the average length strates that the auditory cortex maintains a high-fidelity of a phoneme is approximately 50 ms on average, a representation of temporal modulations up to 200 Hz. 10- to 15-ms window might be too short for integrating

FIGURE 38.3 The temporal relationship between speech and brain oscillations. (A) Gamma oscillations periodically modulate neu- ronal excitability and spiking. The hypothesized mechanism is that neurons fire for approxi- mately 12.5 ms and integrate for the rest of the 25-ms time window. Note that these values are approximate; we consider the relevant gamma range for speech to lie between 28 and 40 Hz. (B) Gamma power is modulated by the phase of the theta rhythm (approximately 4 Hz). Theta rhythm is reset by speech, resulting in maintain- ing the alignment between brain rhythms and speech bursts.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 470 38. A NEUROPHYSIOLOGICAL PERSPECTIVE ON SPEECH PROCESSING IN “THE NEUROBIOLOGY OF LANGUAGE” this information. Using a computational model of (Arnal et al., 2011; Howard & Poeppel, 2012; Sauseng gamma oscillations generated by a pyramidal interneu- et al., 2007; Sayers & Beagley, 1974). This phase-reset ron network (PING model; Borgers et al., 2005), Shamir, would align the speech signal and the cortical theta Ghitza, Epstein, and Kopell (2009) showed that the rhythm, the proposed instrument of speech segmenta- shape of a sawtooth input signal designed to have the tion into syllable/word units. Because speech is typical duration and amplitude modulation of a strongly amplitude-modulated at the theta rate, this diphone (B50 ms; typically a consonantvowel or would result in aligning neuronal excitability with vowelconsonant transition) can correctly be repre- those parts of the speech signals that are most infor- sented by three gamma cycles, which act as a three-bit mative in terms of energy and spectrotemporal code. Such a code has the capacity required to distin- content (Figure 38.3B). There remain critical computa- guish different shapes of the stimulus and is therefore a tional issues, such as the means to get strong gamma plausible means to distinguish between phonemes. activity at the moment of theta reset. Recent psycho- That 50-ms diphones could be correctly discriminated physical research emphasizes the importance of align- with three gamma cycles suggests that phonemes could ing the acoustic speech signal with the brain’s be sampled with one or two gamma cycles. This issue is oscillatory/quasi-rhythmic activity. Ghitza and critical because the frequency of neural oscillations in Greenberg (2009) demonstrated that while compre- the auditory cortex might constitute a strong biophysi- hension was drastically reduced by time-compressing cal determinant with respect to the size of the minimal speech signals by a factor of 3, comprehension was acoustic unit that can be manipulated for linguistic pur- restored by artificially inserting periods of silence. poses. In a recent extension of this model, the parsing The mere fact of adding silent periods to speech to and encoding capacity of coupled theta and gamma restore an optimal temporal rate—which is equivalent oscillating networks was studied (Hyafil et al., 2012). In to restoring “syllabicity”—improves performance combination, these modules succeed in signaling sylla- even though the speech segments that remained bles boundaries and to orchestrate spiking within syl- available are not more intelligible. Optimal perfor- labic windows, so that online speech decoding becomes mance is obtained when 80-ms silent periods alternate possible with a similar accuracy as experimental find- with 40-ms time-compressed speech signals. These ings using intracortical recordings (Kayser, Ince, & time constants allowed the authors to propose a Panzeri, 2012). phenomenological model involving three nested An important requirement of the computational rhythms in the theta (5 Hz), beta, or low gamma model mentioned previously (Shamir et al., 2009)is (2040 Hz) and gamma (80 Hz) domains (for an that ongoing gamma oscillations are phase-reset, for extended discussion, see Ghitza, 2011). example, by a population of onset excitatory neurons. In the absence of this onset signal, the performance of the model declines. Ongoing intrinsic oscillations 38.2.6 Parallel Processing in Bilateral appear to be effective as a segmenting tool only if they Auditory Cortices align with the stimulus. Several studies in humans and nonhuman primates have suggested that gamma and There is emerging consensus, based on neuropsy- theta rhythms work together, and that the phase of chological and imaging data, that speech perception theta oscillations determines the power and possibly is mediated bilaterally. Poeppel (2003) attempted to also the phase of gamma oscillations (see Figure 38.3B; integrate and reconcile several of the strands of Canolty et al., 2006; Csicsvari, Jamieson, Wise, & evidence. First, speech signals contain information on Buzsaki, 2003; Lakatos et al., 2008, 2005; Schroeder, at least two critical timescales, correlating with Lakatos, Kajikawa, Partan, & Puce, 2008). This cross- segmental and syllabic information. Second, many frequency relationship is referred to as “nesting.” nonspeech auditory psychophysical phenomena fall Electrophysiological recordings suggest that theta in two groups, with integration constants of approxi- oscillations can be phase-reset by several means, such mately 2550 and 200300 ms. Third, both patient as through multimodal corticocortical pathways and imaging data reveal cortical asymmetries such (Arnal, Morillon, Kell, & Giraud, 2009; Lakatos, Chen, that both sides participate in auditory analysis but are O’Connell, Mills, & Schroeder, 2007; Thorne, De Vos, optimized for different types of processing in left Viola, & Debener, 2011) or through predictive top- versus right. Fourth, crucial for the present chapter, down modulations, but most probably by the stimulus neuronal oscillations might relate in a principled onset itself. The largest cortical auditory-evoked way to temporal integration constants of different response measured with EEG and MEG, approxi- sizes. Poeppel (2003) proposed that there are asym- mately 100 ms after stimulus onset, corresponds to a metric distributions of neuronal ensembles between phase-reset and magnitude increase of theta activity hemispheres with preferred shorter versus longer

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 38.2 CORTICAL PROCESSING OF CONTINUOUS SOUNDS STREAMS 471

(A) Series of integration windows 250 ms (~ 4 Hz; θ)

25 – 50 ms (~ 4 Hz; γ)

(B) Symmetric spectro-temporal patterns up to core primary auditory cortex

Asymmetric temporal representations from primary auditory cortex

LH RH Proportion of

neuronal ensembles 250 ms 25 ms 250 ms 25 ms (4 Hz) (40 Hz) (4 Hz) (40 Hz)

Size of temporal integration windows (ms) (associated oscillatory frequency (Hz)) Analyses requiring high spectral resolution Analyses requiring e.g., intonation contours high temporal resolution e.g., formant transitions

FIGURE 38.4 The AST hypothesis. (A) Temporal relationship between the speech waveform and the two proposed integration timescales (in milliseconds) and associated brain rhythms (in hertz). (B) Proposed mechanisms for asymmetric speech parsing: left auditory cortex (LH) contains a larger proportion of neurons able to oscillate at gamma frequency than the right one (RH). integration constants; these cell groups “sample” the cortex to information carried in fast temporal modula- input with different sampling integration constants tions that convey, for example, phonetic cues. A spe- (Figure 38.4A). Specifically, left auditory cortex has a cialization of right auditory cortex to slower relatively higher proportion of short-term (gamma) modulations would grant it better sensitivity to integrating cell groups, whereas right auditory cortex slower and stationary cues, such as harmonicity and has a larger long-term (theta) integrating proportion periodicity (Rosen, 1992), which are important to (Figure 38.4B). As a consequence, left hemisphere identify vowels, syllables, and thereby speaker iden- auditory cortex would be better equipped for parsing tity. The AST theory is very close, in kind, to the spec- speech at the segmental timescale, and right auditory trotemporal asymmetry hypothesis promoted by cortex would be better equipped for parsing speech at Zatorre (Zatorre et al., 2002; Zatorre & Gandour, the syllabic timescale. This hypothesis, referred to as 2008). Although many psychophysics and neurophys- the asymmetric sampling in time (AST) theory, is iological experiments seem to support this idea (see summarized in Figure 38.4. It accounts for a variety Giraud & Poeppel, 2012; Poeppel, 2003; Poeppel et al., of psychophysical and functional neuroimaging 2008 for reviews on the topic), there is a lot of work resultsthatshowthatlefttemporalcortexresponds in progress regarding this unresolved question. better to many aspects of rapidly modulated speech content, whereas right temporal cortex responds bet- ter to slowly modulated signals, including music, 38.2.7 Dysfunctional Oscillatory Sampling voices, and other sounds (Warrier et al., 2009; Zatorre, Belin, & Penhune, 2002). A difference in the Additional evidence to support the notion that neural size of the basic integration window between left and oscillations play an instrumental role in speech proces- right auditory cortices would explain speech func- sing would be to show that dysfunctional oscillatory tional asymmetry by better sensitivity of left auditory mechanisms result in speech processing impairments.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 472 38. A NEUROPHYSIOLOGICAL PERSPECTIVE ON SPEECH PROCESSING IN “THE NEUROBIOLOGY OF LANGUAGE”

Dyslexia, which is a phonological deficit (i.e., a deficit in 38.3 BROADENING THE SCOPE: processing speech sounds), presumably is a good candi- FUNCTIONAL MODELS date to test this hypothesis. Temporal sampling medi- ated by cortical oscillations has recently been proposed Although the perceptual analysis of speech is rooted to be a central mechanism in several aspects of dyslexia in the different anatomic subdivisions of auditory cor- (Goswami, 2011). This proposal suggests that a deficit tex in the temporal lobe, speech processing involves a involving theta oscillations might impair the tracking of large network that includes areas in parietal and fron- low temporal modulations in the syllabic range. In a tal cortices, the relative activations of which strongly complementary way, it was proposed recently that depend on the task performed. Several reviews have gamma oscillations might play a role in yielding an audi- synthesized the state of the art of functional neuroanat- tory phonemic deficit. omy of speech perception (Hickok & Poeppel, 2000, Interestingly, at approximately 30 Hz, the left- 2004, 2007; Rauschecker & Scott, 2009; Scott & dominant phase-locking profile of auditory responses Johnsrude, 2003). Figure 38.5A summarizes the main in MEG (auditory steady-state responses) was only consensus based on functional neuroimaging (fMRI; present in subjects with normal reading ability positron emission tomography [PET], MEG/EEG) and (Lehongre, Ramus, Villiermet, Schwartz, & Giraud, lesion data. 2011). Because this response is absent in dyslexic parti- Departing from the classical model in which both a cipants, the authors suggested that the ability of their posterior (Wernicke’s) and an anterior (Broca’s) area left auditory cortex to parse speech at the appropriate form the anatomic network, it is now argued that phonemic rate was altered. Those with dyslexia had a speech is processed in parallel in at least two streams, strong response at this frequency in right auditory cor- a ventral stream for speech-to-meaning mapping (a tex and therefore presented an abnormal asymmetry “what” stream) and a dorsal stream for speech-to- between left and right auditory cortices. Importantly, articulation mapping (a “how” stream). Both streams the magnitude of the anomalous asymmetry correlated converge on prefrontal cortex, with a tendency for the with behavioral measures in phonology (such as non- ventral pathway to contact ventral prefrontal cortex word repetition and rapid automatic naming). Finally, (BA 44/45, also referred to as Broca’s area) and for the it was also shown that dyslexic readers had stronger dorsal pathway to contact dorsal premotor regions resonance than controls in both left and right auditory (Hickok & Poeppel, 2007; Rauschecker & Scott, 2009). cortices at frequencies between 50 and 80 Hz. This sup- The dual path network operates both in a feedforward ports the notion that these participants had a tendency (bottom-up) and feedback (top-down) manner, to oversample information in the phonemic range, highlighting the need for neurophysiologically with this latter effect being positively correlated with a grounded theories that have appropriate primitives to phonological memory deficit. As a consequence, if dys- permit such bidirectional processing in real time. lexia induces speech parsing at a wrong frequency, then phonemic units would be sampled erratically, without necessarily inducing major perceptual deficits 38.3.1 An Oscillation-Based Model of Speech (Ramus & Szenkovits, 2008; Ziegler, Pech-Georgel, Processing George, & Lorenzi, 2009). As a consequence, the pho- nological impairment could take different forms, with This chapter emphasizes a neurophysiological a stronger impact on the acoustic side for undersam- perspective and especially the potential role of neuronal pling (insufficient acoustic detail per time unit) and on oscillations as “administrative mechanisms” to parse the memory side for oversampling (too many frames and decode speech signals. Does such a focus converge to be integrated per time unit). with the current functional anatomical models? Recent Although important, the observation that oscillatory experimental research has begun developing a func- anomalies co-occur with atypical phonological repre- tional anatomic model solely derived from recordings sentations remains insufficient to establish a causal role of neuronal oscillations. Based on analyses of the of dysfunctional oscillatory sampling. Causal evidence sources of oscillatory activity, that is, brain regions that auditory sampling depends on intrinsic oscillatory showing asymmetric theta/gamma activity at rest and properties (and is, as a consequence, determined by under linguistic stimulation, Morillon et al. (2010) cortical columnar organization) could be obtained from proposed a new functional model of speech and lan- knockout animal models comparing neuronal activity guage processing (Figure 38.5B) that links to the text- with continuous auditory stimuli in sites with various book anatomy (illustrated in Figure 38.5A). This model degrees of columnar disorganization. However, such is grounded in a “core network” showing left oscillatory animal work can only indirectly address a specific rela- dominance at rest (no linguistic stimulation, no task), tion to speech processing. encompassing auditory, somatosensory, and motor

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 38.3 BROADENING THE SCOPE: FUNCTIONAL MODELS 473

FIGURE 38.5 Three functional neuroanatomical models of speech perception. (A) Model based on neuropsychology and functional neuro- imaging data (PET and fMRI; after Hickok & Poeppel, 2007). (B) Model based on the propagation of resting oscillatory asymmetry during an audiovisual linguistic stimulation (a spoken movie). (C) The Predictive (Bayesian) View on Speech Processing relies on the inversion of the dual-stream model to support the propagation of “what” and “when” predictions toward sensory regions. cortices, and BA40 in inferior parietal cortex. The stron- The model argues that posterior superior temporal cor- gest asymmetries are observed in motor cortex and in tex (Wernicke’s area) inherits its profile from auditory BA40, which presumably play an important causal role and somatosensory cortices, whereas Broca’s area inher- in left hemispheric dominance during language proces- its its profile from all posterior regions including audi- sing. Critically, the proposed core network does not tory, somatosensory, Wernicke, and BA40. This model include Wernicke’s (BA22) and Broca’s (BA 44/45) specifies that posterior regions share their oscillatory areas, despite the fact that both are classically related to activity over the whole range of frequencies examined speech and language processing. Interestingly, whereas (172 Hz), whereas Broca’s area inherits only the these areas show no sign of asymmetry at rest, they gamma range of the posterior oscillatory activity. This possibly “inherit” left-dominant oscillatory activity dur- might reflect that oscillatory activity in Broca’s area ing linguistic processing from the putative core regions. does not exclusively pertain to language. Finally, an

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 474 38. A NEUROPHYSIOLOGICAL PERSPECTIVE ON SPEECH PROCESSING IN “THE NEUROBIOLOGY OF LANGUAGE” important feature of the model is the influence of the modulate neural oscillations to facilitate the sampling of motor lip and hand areas on auditory cortex oscillatory predicted sensory inputs. activity on the delta/theta scale, which underlines the Audiovisual speech perception provides an ideal importance of syllable and co-speech gesture produc- paradigmatic situation to test how the brain can use tion rates, on the receptive auditory sampling, and its information from one sensory (visual) stream to derive asymmetric implementation. This model is compatible predictions about upcoming events in another sensory with a hardwired alignment of speech perception and (auditory) stream. During face-to-face conversation, production capacities at a syllable but not at a phone- because the onset of visual cues (oro-facial move- mic timescale, suggesting that sensory/motor alignment ments) often precedes the auditory onset (syllable) by at the phonemic timescale is presumably acquired. approximately 150 ms, visual information can be used Using an approach entirely driven by oscillations, this to predict upcoming auditory syllables (van model is largely consistent with the traditional one Wassenhove, Grant, & Poeppel, 2005), speeding-up (Figure 38.5A), but it places a new emphasis on hard- speech perception. A series of recent works have pro- wired auditorymotor interactions and on a determi- posed that this predictive cross-modal mechanism nant role of BA40 in language lateralization, which relies on a corticocortical phase-reset from visual remains to be clarified. regions to auditory ones (Arnal et al., 2009, 2011; Luo et al., 2010). According to this hypothesis, a cross- modal visual-to-auditory phase-reset aligns ongoing 38.3.2 Predictive Models of Speech Processing low-frequency delta theta oscillations in the auditory cortex in an ideal phase allowing for optimal sampling When processing continuous speech, the brain of expected auditory inputs (Schroeder et al., 2008). needs to simultaneously carry out acoustic and linguis- Therefore, during the perception of a continuous tic operations; at every instant there is both acoustic audiovisual speech stream, visual inputs would pre- input to be processed and meaning to be calculated dictively entrain ongoing oscillations in the auditory from the preceding input. Discretization using phases cortex, which in turn facilitates the encoding of syllabic during which cortical neurons are either highly or information (Zion Golumbic, Cogan, Schroeder, & weakly receptive to input is one computational princi- Poeppel, 2013). ple that could ensure constant alternation between When presented to a rhythmic stream of events, the sampling the input and matching this input onto brain can also predict the temporal occurrence of higher-level, more abstract representations. A Bayesian future sensory inputs (Arnal & Giraud, 2012). Because perspective on this issue would indicate that the brain the speech signal exhibits quasiperiodic modulations decodes sounds by constantly generating descending at the syllabic rate, the syllable timing is relatively (top-down) inferences about what is and will be said predictable in time and could be predictively encoded on the basis of the quickest and crudest neural repre- by the phase of ongoing deltatheta (28 Hz) (Lakatos sentation it can make with an acoustic input (Poeppel et al., 2008). The alignment of incoming speech signals et al., 2008). Consistent with this view, recent models with slow endogenous cortical activity and the reso- of perception suggest that the brain hosts an internal nance with neocortical deltatheta oscillations repre- model of the world that is used to generate, test, and sents a plausible way to automatize predictive timing update predictions about future sensory events. A pro- at an early, prelexical, processing stage (Ding & Simon, posal in very similar spirit is the “reverse hierarchy 2012; Henry & Obleser, 2012). However, the observa- theory,” a conceptualization developed to meet certain tion that the dorsal pathway (Figure 38.5A and 38.5C) challenges in visual object recognition (Hochstein & is recruited during the perception of temporally regu- Ahissar, 2002), and more recently extended to speech lar event streams (e.g., during beat perception) sug- processing (Nahum et al., 2008). gests that the motor system also plays a role in the Adding to the dual-stream model of perception that predictive alignment of ongoing oscillation in the audi- splits the processing of “what” and “how” information tory cortex (Arnal, 2012; Arnal & Giraud, 2012; (Figure 38.5A), the predictive coding framework (Friston, Fujioka, Trainor, Large, & Ross, 2012). Consistent with 2005, 2010) posits that each stream also generates distinct the concepts of active sensing or active inference, types of predictions with regard to expected sensory motor efferent signals that are generated in synchrony inputs (Figure 38.5C). Top-down predictions that propa- with the beat may predictively modulate the activity in gate throughout the ventral stream relate to the content the auditory cortex and facilitate the processing of of expected stimuli (the “what” aspect), whereas the incoming events (Arnal & Giraud, 2012). In other dorsal stream generates predictions that pertain to the words, the active entrainment of slow endogenous timing of events (the “when” aspect). Here we propose cortical activity via a periodic resetting from the motor that in both cases, top-down expectations predictively system represents a plausible way to facilitate the

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL REFERENCES 475 processing of expected auditory events at a low space. The limited phase-locking capacity of the audi- processing level. Such a mechanism is possibly at the tory cortex thus appears a likely counterpart to its spa- origin of the attentional selection of the relevant tial integration properties (across cortical layers and streams in the cocktail party effect (Zion Golumbic, functional regions). Speech processing through and Ding, et al., 2013). across cortical columns containing complex recurrent In summary, predictions play a major role in the circuits bears a cost on the temporal precision of encoding of temporal information in the auditory cor- speech representations, and integration at the gamma tex. The ability to extract regularities and generate scale could be a direct consequence of processing at inferences about upcoming events primarily allows the the cortical column scale. In this chapter we argue that periodic cortical activity to facilitate sensory proces- the auditory cortex uses gamma oscillations to inte- sing, regardless of the informational content of forth- grate the speech auditory stream at the phonemic time- coming information. Additionally, when targeting scale, and uses theta oscillations to signal syllable specific neuronal populations, top-down signals could boundaries and orchestrate gamma activity. Although provide content-related priors. These two mechanisms the generation mechanisms are less well-known for are complementary at the computational level. theta than for gamma oscillations, at present we see no Whereas temporal predictions (“when”) align neuronal alternative computational solution to the online speech excitability by controlling the momentary phase of segmentation and integration problem than invoking low-frequency oscillations relative to incoming stimuli, coupled theta and gamma activity. More research is content predictions target neuronal populations spe- needed to evaluate the detailed neural operations that cific to the representational content (“what”) of forth- are necessary to transform the acoustic input into lin- coming stimuli. The combination of these two types of guistic representations, and it remains possible that mechanisms is again ideally illustrated by speech pro- other nonoscillatory mechanisms also contribute to cessing. Speech comprehension has long been argued these transformations. to rely on cohort models in which each heard word preactivates a pool of other words with the same onset, until it reaches a point at which the word is uniquely References identified. This model assumes that cognitive Ahissar, E., Nagarajan, S., Ahissar, M., Protopapas, A., Mahncke, H., resources are used at the lexical level, where predic- & Merzenich, M. M. (2001). Speech comprehension is correlated tions are formed. Gagnepain and collaborators with temporal response patterns recorded from auditory cortex. (Gagnepain, Henson, & Davis, 2012) recently demon- Proceedings of the National Academy of Sciences of the United States of strated, however, that the predictive mechanisms in America, 98, 1336713372. word comprehension involve segmental rather than Allen, J. B. (2005). Articulation and intelligibility. Synthesis Lectures on Speech and Audio Processing, 1(1), 1124. lexical predictions, meaning that each segment is likely Arnal, L. H. (2012). Predicting “when” using the motor system’s used to predict the next. Computationally, this obser- beta-band oscillations. Frontiers in Human Neuroscience, 6, 225. vation supports the view that auditory cortex samples Arnal, L. H., & Giraud, A. L. (2012). Cortical oscillations and sensory speech into segments using mechanisms that make predictions. Trends in Cognitive Sciences, 16, 390398. them predictable in time, and that a representation of Arnal, L. H., Morillon, B., Kell, C. A., & Giraud, A.-L. (2009). Dual neural routing of visual facilitation in speech processing. The these segments is used to test specific predictions in a Journal of Neuroscience, 29, 1344513453. recurrent, predictable fashion. Arnal, L. H., Wyart, V., & Giraud, A. L. (2011). Transitions in neural oscillations reflect prediction errors generated in audiovisual speech. Nature Neuroscience, 14, 797801. Bastos, A. M., Usrey, W. M., Adams, R. A., Mangun, G. R., Fries, P., 38.3.3 Conclusion & Friston, K. J. (2012). Canonical microcircuits for predictive cod- ing. Neuron, 76, 695711. Time is an essential feature of speech perception. Boemio, A., Fromm, S., Braun, A., & Poeppel, D. (2005). Hierarchical No speech sound can be identified without integrating and asymmetric temporal sensitivity in human auditory cortices. the acoustic input over time, and the temporal scale at Nature Neuroscience, 8, 389395. which such integration operates determines whether Borgers, C., Epstein, S., & Kopell, N. J. (2005). Background gamma rhythmicity and attention in cortical local circuits: A computa- we are hearing phonemes, syllables, or words. The tional study. Proceedings of the National Academy of Sciences of the central idea of this chapter is that, unlike subcortical United States of America, 102, 70027007. processing that faithfully encodes speech sounds in Borgers, C., Epstein, S., & Kopell, N. J. (2008). Gamma oscillations their precise spectrotemporal structure, processing in mediate stimulus competition and attentional selection in a corti- primary and association auditory cortices results in the cal network model. Proceedings of the National Academy of Sciences of the United States of America, 105, 1802318028. discretization of spectrotemporal patterns, using vari- Britvina, T., & Eggermont, J. J. (2007). A markov model for interspike able temporal integration scales. It is unlikely that interval distributions of auditory cortical neurons that do not speech representations are precise in both time and show periodic firings. Biological Cybernetics, 96, 245264.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 476 38. A NEUROPHYSIOLOGICAL PERSPECTIVE ON SPEECH PROCESSING IN “THE NEUROBIOLOGY OF LANGUAGE”

Brugge, J. F., Nourski, K. V., Oya, H., Reale, R. A., Kawasaki, H., reflects functional organization of auditory cortex in the awake Steinschneider, M., et al. (2009). Coding of repetitive transients by macaque. Neuron, 74, 899910. auditory cortex on Heschl’s gyrus. Journal of Neurophysiology, 102, Gagnepain, P., Henson, R. N., & Davis, M. H. (2012). Temporal pre- 23582374. dictive codes for spoken words in auditory cortex. Current Busch, N. A., Dubois, J., & VanRullen, R. (2009). The phase of ongo- Biology, 22, 615621. ing EEG oscillations predicts . The Journal of Ghitza, O. (2011). Linking speech perception and neurophysiology: Neuroscience, 29, 78697876. Speech decoding guided by cascaded oscillators locked to the Canolty, R. T., Edwards, E., Dalal, S. S., Soltani, M., Nagarajan, S. S., input rhythm. Front Psychology, 2, 130. Kirsch, H. E., et al. (2006). High gamma power is phase-locked to Ghitza, O., & Greenberg, S. (2009). On the possible role of brain theta oscillations in human neocortex. Science, 313, 16261628. rhythms in speech perception: Intelligibility of time-compressed Canolty, R. T., & Knight, R. T. (2010). The functional role of cross- speech with periodic and aperiodic insertions of silence. frequency coupling. Trends in Cognitive Sciences, 14, 506515. Phonetica, 66, 113126. Chi, T., Ru, P., & Shamma, S. A. (2005). Multiresolution spectrotem- Giraud, A. L., Kell, C., Thierfelder, C., Sterzer, P., Russ, M. O., poral analysis of complex sounds. The Journal of the Acoustical Preibisch, C., et al. (2004). Contributions of sensory input, audi- Society of America, 118, 887906. tory search and verbal comprehension to cortical activity during Csicsvari, J., Jamieson, B., Wise, K. D., & Buzsaki, G. (2003). speech processing. Cerebral Cortex, 14, 247255. Mechanisms of gamma oscillations in the hippocampus of the Giraud, A. L., Lorenzi, C., Ashburner, J., Wable, J., Johnsrude, I., behaving rat. Neuron, 37, 311322. Frackowiak, R., et al. (2000). Representation of the temporal enve- Dau, T., Kollmeier, B., & Kohlrausch, A. (1997). Modeling auditory lope of sounds in the human brain. Journal of Neurophysiology, 84, processing of amplitude modulation. II. Spectral and temporal 15881598. integration. The Journal of the Acoustical Society of America, 102, Giraud, A. L., & Poeppel, D. (2012). Cortical oscillations and speech 29062919. processing: Emerging computational principles. Nature Ding, N., & Simon, J. Z. (2012). Neural coding of continuous speech Neuroscience, 15, 511517. in auditory cortex during monaural and dichotic listening. Journal Giraud, A. L., & Price, C. J. (2001). The constraints functional neuro- of Neurophysiology, 107,7889. imaging places on classical models of auditory word processing. Doelling, K. B., Arnal, L. H., Ghitza, O., & Poeppel, D. (2014). Acoustic Journal of Cognitive Neuroscience, 13, 754765. landmarks drive delta-theta oscillations to enable speech compre- Goswami, U. (2011). A temporal sampling framework for develop- hension by facilitating perceptual parsing. Neuroimage, 85,761768. mental dyslexia. Trends in Cognitive Sciences, 15,310. Drullman, R., Festen, J. M., & Plomp, R. (1994a). Effect of reducing Greenberg, S., & Arai, T. (2001). The relation between speech intelli- slow temporal modulations on speech reception. The Journal of the gibility and the complex modulation spectrum. Proceedings of the Acoustical Society of America, 95, 26702680. 7th Eurospeech Conference on Speech Communication and Technology Drullman, R., Festen, J. M., & Plomp, R. (1994b). Effect of temporal (Eurospeech-2001) (pp. 473476). Aalborg, Denmark. envelope smearing on speech reception. The Journal of the Grothe, B., Pecka, M., & McAlpine, D. (2010). Mechanisms of sound Acoustical Society of America, 95, 10531064. localization in mammals. Physiological Reviews, 90, 9831012. Edwards, E., & Chang, E. F. (2013). Syllabic (B25 Hz) and fluctua- Henry, M. J., & Obleser, J. (2012). Frequency modulation entrains tion (B110 Hz) ranges in speech and auditory processing. slow neural oscillations and optimizes human listening behavior. Hearing research, 305, 113134. Proceedings of the National Academy of Sciences of the United States of Elhilali, M., Fritz, J. B., Klein, D. J., Simon, J. Z., & Shamma, S. A. America, 109, 2009520100. (2004). Dynamics of precise spike timing in primary auditory Hickok, G., & Poeppel, D. (2000). Towards a functional neuroanat- cortex. The Journal of Neuroscience, 24, 11591172. omy of speech perception. Trends in Cognitive Sciences, 4, 131138. Elliott, T. M., & Theunissen, F. E. (2009). The modulation transfer Hickok, G., & Poeppel, D. (2004). Dorsal and ventral streams: A function for speech intelligibility. PLoS Computational Biology, 5, framework for understanding aspects of the functional anatomy e1000302. of language. Cognition, 92,6799. Faulkner, A., Rosen, S., & Smith, C. (2000). Effects of the salience of Hickok, G., & Poeppel, D. (2007). The cortical organization of speech pitch and periodicity information on the intelligibility of four- processing. Nature Reviews Neuroscience, 8, 393402. channel vocoded speech: Implications for cochlear implants. The Hochstein, S., & Ahissar, M. (2002). View from the top: Hierarchies Journal of the Acoustical Society of America, 108, 18771887. and reverse hierarchies in the . Neuron, 36, 791804. Fishman, Y. I., Reser, D. H., Arezzo, J. C., & Steinschneider, M. Holcombe, A. O. (2009). Seeing slow and seeing fast: Two limits on (2000). Complex tone processing in primary auditory cortex of perception. Trends in Cognitive Sciences, 13, 216221. the awake monkey. II. Pitch versus critical band representation. Howard, M. F., & Poeppel, D. (2011). Discrimination of speech stimuli The Journal of the Acoustical Society of America, 108, 247262. based on neuronal response phase patterns depends on acoustics Fontolan, L., Morillon, B., Liegeois-Chauvel, C., & Giraud, A.-L. but not comprehension. Journal of Neurophysiology, 104, 25002511. (2014). The contribution of frequency-specific activity to hierarchi- Howard, M. F., & Poeppel, D. (2012). The neuromagnetic response to cal information processing in the human auditory cortex. Nature spoken sentences: Co-modulation of theta band amplitude and Communications, 5, 4694. phase. NeuroImage, 60, 21182127. Friston, K. (2005). A theory of cortical responses. Philosophical transactions Hyafil, A., Fontolan, L., Gutkin, B., & Giraud, A.-L. (2012). A theoret- of the Royal Society of London Series B, Biological sciences, 360,815836. ical exploration of speech/neural oscillation alignment for speech Friston, K (2010). The free-energy principle: A unified brain theory? parsing. FENS Abstract, 6, S4704. Nature Reviews Neuroscience, 11, 127138. Joris, P. X., Schreiner, C. E., & Rees, A. (2004). Neural processing of Fujioka, T., Trainor, L. J., Large, E. W., & Ross, B. (2012). Internalized amplitude-modulated sounds. Physiological Reviews, 84, 541577. timing of isochronous sounds is represented in neuromagnetic Kanedera, N., Arai, T., Hermansky, H., & Pavel, M. (1999). On the beta oscillations. The Journal of Neuroscience, 32, 17911802. relative importance of various components of the modulation Fukushima, M., Saunders, R. C., Leopold, D. A., Mishkin, M., & spectrum for automatic speech recognition. Speech Communication, Averbeck, B. B. (2012). Spontaneous high-gamma band activity 28,4355.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL REFERENCES 477

Kayser, C., Ince, R. A., & Panzeri, S. (2012). Analysis of slow (theta) Nourski, K. V., & Brugge, J. F. (2011). Representation of temporal oscillations as a potential temporal reference frame for informa- sound features in the human auditory cortex. Reviews in the tion coding in sensory cortices. PLoS Computational Biology, 8, Neurosciences, 22, 187203. e1002717. Nourski, K. V., Reale, R. A., Oya, H., Kawasaki, H., Kovach, C. K., Kiebel, S. J., Daunizeau, J., & Friston, K. J. (2008). A hierarchy of Chen, H., et al. (2009). Temporal envelope of time-compressed time-scales and the brain. PLoS Computational Biology, 4, e1000209. speech represented in the human auditory cortex. The Journal of Kingsbury, B. E. D., Morgan, N., & Greenberg, S. (1998). Robust Neuroscience, 29, 1556415574. speech recognition using the modulation spectrogram. Speech Panzeri, S., Brunel, N., Logothetis, N. K., & Kayser, C. (2010). Communication, 25, 117132. Sensory neural codes using multiplexed temporal scales. Trends Lakatos, P., Chen, C. M., O’Connell, M. N., Mills, A., & Schroeder, in Neurosciences, 33, 111120. C. E. (2007). Neuronal oscillations and multisensory interaction in Pasley, B. N., David, S. V., Mesgarani, N., Flinker, A., Shamma, S. A., primary auditory cortex. Neuron, 53, 279292. Crone, N. E., et al. (2012). Reconstructing speech from human Lakatos, P., Karmos, G., Mehta, A. D., Ulbert, I., & Schroeder, C. E. auditory cortex. PLoS Biology, 10, e1001251. (2008). Entrainment of neuronal oscillations as a mechanism of Peelle, J. E., Gross, J., & Davis, M. H. (2013). Phase-locked responses attentional selection. Science, 320, 110113. to speech in human auditory cortex are enhanced during compre- Lakatos, P., Shah, A. S., Knuth, K. H., Ulbert, I., Karmos, G., & hension. Cerebral Cortex, 23, 13781387. Schroeder, C. E. (2005). An oscillatory hierarchy controlling neu- Plack, C. J., Oxenham, A. J., Fay, R. R., & Popper, A. N. (2005). Pitch: ronal excitability and stimulus processing in the auditory cortex. Neural coding and perception. New York, NY: Springer. Journal of Neurophysiology, 94, 19041911. Poeppel, D. (2001). New approaches to the neural basis of speech Lehongre, K., Ramus, F., Villiermet, N., Schwartz, D., & Giraud, A. L. sound processing: Introduction to special section on brain and (2011). Altered low-gamma sampling in auditory cortex accounts speech. Cognitive Science, 25, 659661. for the three main facets of dyslexia. Neuron, 72, 10801090. Poeppel, D. (2003). The analysis of speech in different temporal inte- Loebach, J. L., & Wickesberg, R. E. (2008). The psychoacoustics of gration windows: Cerebral lateralization as “asymmetric sam- noise vocoded speech: A physiological means to a perceptual pling in time”. Speech Communication, 41, 245255. end. Hearing Research, 241,8796. Poeppel, D., Idsardi, W. J., & van Wassenhove, V. (2008). Speech per- Luo, H., Boemio, A., Gordon, M., & Poeppel, D. (2007). The percep- ception at the interface of neurobiology and linguistics. tion of FM sweeps by Chinese and English listeners. Hearing Philosophical Transactions of the Royal Society of London Series B, Research, 224,7583. Biological Sciences, 363, 10711086. Luo, H., Liu, Z. X., & Poeppel, D. (2010). Auditory cortex tracks both Po¨ppel, E., & Artin, T. T. (1988). Mindworks: Time and conscious experi- auditory and visual stimulus dynamics using low-frequency neu- ence. Harcourt Brace Jovanovich. ronal phase modulation. PLoS Biology, 8, 13. Ramus, F., & Szenkovits, G. (2008). What phonological deficit? Luo, H., & Poeppel, D. (2007). Phase patterns of neuronal responses reli- Quarterly Journal of Experimental Psychology, 61, 129141. ably discriminate speech in human auditory cortex. Neuron, 54, Rauschecker, J. P., & Scott, S. K. (2009). Maps and streams in the 10011010. auditory cortex: Nonhuman primates illuminate human speech Mantini, D., Perrucci, M. G., Del Gratta, C., Romani, G. L., & processing. Nature Neuroscience, 12, 718724. Corbetta, M. (2007). Electrophysiological signatures of resting Remez, R. E., Rubin, P. E., Pisoni, D. B., & Carrell, T. D. (1981). Speech state networks in the human brain. Proceedings of the National perception without traditional speech cues. Science, 212,947949. Academy of Sciences of the United States of America, 104, Roberts, B., Summers, R. J., & Bailey, P. J. (2011). The intelligibility of 1317013175. noise-vocoded speech: Spectral information available from across- McDermott, J. H., & Simoncelli, E. P. (2011). Sound texture percep- channel comparison of amplitude envelopes. Proceedings Biological tion via statistics of the auditory periphery: Evidence from sound Sciences/The Royal Society, 278, 15951600. synthesis. Neuron, 71, 926940. Rosen, S. (1992). Temporal information in speech: Acoustic, auditory Mesgarani, N., & Chang, E. F. (2012). Selective cortical representation and linguistic aspects. Philosophical Transactions of the Royal Society of attended speaker in multi-talker speech perception. Nature, of London Series B, Biological Sciences, 336, 367373. 485, 233236. Saberi, K., & Perrott, D. R. (1999). Cognitive restoration of reversed Miller, G. A. (1951). Language and communication. New York, NY: speech. Nature, 398, 760. McGraw-Hill. Saenz, M., & Langers, D. R. (2014). Tonotopic mapping of human Moerel,M.,DeMartino,F.,Santoro,R.,Ugurbil,K.,Goebel,R.,Yacoub, auditory cortex. Hearing Research, 307,4252. E., et al. (2013). Processing of natural sounds: Characterization of mul- Sauseng, P., Klimesch, W., Gruber, W. R., Hanslmayr, S., tipeak spectral tuning in human auditory cortex. The Journal of Freunberger, R., & Doppelmayr, M. (2007). Are event-related Neuroscience, 33, 1188811898. potential components generated by phase resetting of brain oscil- Morillon, B., Lehongre, K., Frackowiak, R. S., Ducorps, A., lations? A critical discussion. Neuroscience, 146, 14351444. Kleinschmidt, A., Poeppel, D., et al. (2010). Neurophysiological Sayers, B. M., & Beagley, H. A. (1974). Objective evaluation of audi- origin of human brain asymmetry for speech and language. tory evoked EEG responses. Nature, 251, 608609. Proceedings of the National Academy of Sciences of the United States of Schroeder, C. E., & Lakatos, P. (2009a). Low-frequency neuronal America, 107, 1868818693. oscillations as instruments of sensory selection. Trends in Morillon, B., Liegeois-Chauvel, C., Arnal, L. H., Benar, C. G., & Neurosciences, 32,918. Giraud, A. L. (2012). Asymmetric function of theta and gamma Schroeder, C. E., & Lakatos, P. (2009b). The gamma oscillation: activity in syllable processing: An intra-cortical study. Front Master or slave? Brain Topography, 22,2426. Psychology, 3, 248. Schroeder, C. E., Lakatos, P., Kajikawa, Y., Partan, S., & Puce, A. Nahum,M.,Nelken,I.,&Ahissar,M.(2008). Low-level information and (2008). Neuronal oscillations and visual amplification of speech. high-level perception: The case of speech in noise. PLoS Biology, 6, e126. Trends in Cognitive Sciences, 12, 106113. Ng, B. S., Schroeder, T., & Kayser, C. (2012). A precluding but not Scott, S. K., & Johnsrude, I. S. (2003). The neuroanatomical and func- ensuring role of entrained low-frequency oscillations for auditory tional organization of speech perception. Trends in Neurosciences, perception. The Journal of Neuroscience, 32, 1226812276. 26, 100107.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 478 38. A NEUROPHYSIOLOGICAL PERSPECTIVE ON SPEECH PROCESSING IN “THE NEUROBIOLOGY OF LANGUAGE”

Scott, S. K., Rosen, S., Lang, H., & Wise, R. J. (2006). Neural correlates Wang, X. J. (2010). Neurophysiological and computational principles of intelligibility in speech investigated with noise vocoded of cortical rhythms in cognition. Physiological Reviews, 90, speech—A positron emission tomography study. The Journal of the 11951268. Acoustical Society of America, 120, 10751083. Warrier, C., Wong, P., Penhune, V., Zatorre, R., Parrish, T., Abrams, Shamir, M., Ghitza, O., Epstein, S., & Kopell, N. (2009). D., et al. (2009). Relating structure to function: Heschl’s gyrus Representation of time-varying stimuli by a network exhibiting and acoustic processing. The Journal of Neuroscience, 29,6169. oscillations on a faster time scale. PLoS Computational Biology, 5, Womelsdorf, T., Schoffelen, J. M., Oostenveld, R., Singer, W., e1000370. Desimone, R., Engel, A. K., et al. (2007). Modulation of neuronal Shannon, R. V., Zeng, F. G., Kamath, V., Wygonski, J., & Ekelid, M. interactions through neuronal synchronization. Science, 316, (1995). Speech recognition with primarily temporal cues. Science, 16091612. 270, 303304. Wyart, V., de Gardelle, V., Scholl, J., & Summerfield, C. (2012). Smith, Z. M., Delgutte, B., & Oxenham, A. J. (2002). Chimaeric Rhythmic fluctuations in evidence accumulation during decision sounds reveal dichotomies in auditory perception. Nature, 416, making in the human brain. Neuron, 76, 847858. 8790. Zatorre, R. J., Belin, P., & Penhune, V. B. (2002). Structure and func- Souza, P., & Rosen, S. (2009). Effects of envelope bandwidth on the tion of auditory cortex: Music and speech. Trends in Cognitive intelligibility of sine- and noise-vocoded speech. The Journal of the Sciences, 6,3746. Acoustical Society of America, 126, 792805. Zatorre, R. J., & Gandour, J. T. (2008). Neural specializations for Steeneken, H. J., & Houtgast, T. (1980). A physical method for mea- speech and pitch: Moving beyond the dichotomies. Philosophical suring speech-transmission quality. The Journal of the Acoustical Transactions of the Royal Society of London Series B, Biological Society of America, 67, 318326. Sciences, 363, 10871104. Stevens, K. N. (2002). Toward a model for lexical access based on Zeng, F. G. (2002). Temporal pitch in electric hearing. Hearing acoustic landmarks and distinctive features. The Journal of the Research, 174, 101106. Acoustical Society of America, 111, 18721891. Ziegler, J. C., Pech-Georgel, C., George, F., & Lorenzi, C. (2009). Thorne, J. D., De Vos, M., Viola, F. C., & Debener, S. (2011). Cross- Speech-perception-in-noise deficits in dyslexia. Developmental modal phase reset predicts auditory task performance in humans. Science, 12, 732745. The Journal of Neuroscience, 31, 38533861. Zion Golumbic, E., Cogan, G. B., Schroeder, C. E., & Poeppel, D. Tiesinga, P., & Sejnowski, T. J. (2009). Cortical enlightenment: Are (2013a). Visual input enhances selective speech envelope tracking attentional gamma oscillations driven by ING or PING? Neuron, in auditory cortex at a “cocktail party”. The Journal of 63, 727732. Neuroscience, 33, 14171426. VanRullen, R., & Koch, C. (2003). Is perception discrete or continu- Zion Golumbic, E. M., Ding, N., Bickel, S., Lakatos, P., Schevon, ous? Trends in Cognitive Sciences, 7, 207213. C. A., McKhann, G. M., et al. (2013b). Mechanisms underlying van Wassenhove, V., Grant, K. W., & Poeppel, D. (2005). Visual selective neuronal tracking of attended speech at a “cocktail speech speeds up the neural processing of auditory speech. party”. Neuron, 77, 980991. Proceedings of the National Academy of Sciences of the United States of America, 102, 11811186.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL CHAPTER 39 Direct Cortical Neurophysiology of Speech Perception Matthew K. Leonard and Edward F. Chang Department of Neurological Surgery, University of California, San Francisco, CA, USA

39.1 INTRODUCTION systems. ECoG has been in use for much longer than many noninvasive methods, including functional mag- Language is among the most unique and complex netic resonance imaging (fMRI), but recent advances in human behaviors. Understanding how the brain coor- the design and manufacturing of electrode arrays, in dinates multiple representations (e.g., for speech: addition to important discoveries on the frequency acousticphonetic, phonemic, lexical, semantic, and dynamics of neural signals recorded directly from the syntactic) to achieve a seemingly unified linguistic cortical surface, have enabled critical new insights into experience is among the most important questions in neural processing. cognitive neuroscience. This is not a new question We argue that the ability to study language at the (Geschwind, 1974; Wernicke, 1874); however, methodo- level of neural population dynamics (and in some logical developments in the past decade have provided cases, single neurons) is fundamentally necessary if we a new perspective on its biological underpinnings. wish to understand how sensory input is transformed Combined with theoretical advances in linguistics into the phenomenologically rich linguistic representa- (Poeppel, Idsardi, & van Wassenhove, 2008;alsorefer tions that are at the center of human communication Chapters 1214) and important discoveries in the funda- and thought. Invasive electrophysiological methods mental mechanisms of neural communication and repre- are uniquely situated to examine these questions and, sentation (Friston, 2010; Simoncelli & Olshausen, 2001), when interpreted in the context of the vast (and grow- we are at a critical moment in the study of the ing) knowledge gained from noninvasive approaches, neurobiology of language. allow us to move beyond the localization questions One challenging consequence of language being that have dominated the field for decades and toward unique to humans is that we are more limited in our an understanding of neural representations. ability to observe the brain during linguistic tasks. Whereas sensory, motor, and even some cognitive abil- ities like decision-making can be studied in detail in 39.2 INVASIVE NEURAL RECORDING animal models using methods that afford high spatial METHODS and temporal resolution, there are only limited oppor- tunities for investigations at the same level of detail in 39.2.1 Event-Related Neural Responses humans. In this chapter, we consider how data acquisi- tion methods that provide unparalleled spatiotemporal Since soon after the discovery of the scalp electroen- resolution, combined with new applications of multi- cephalogram (EEG) 85 years ago (Berger, 1929), several variate statistical and machine learning analysis tools, aspects of neural population activity have been clear. The have significantly advanced our understanding of the coordinated, stimulus-evoked responses of large groups neural basis of language. In particular, we focus on a of similarly oriented pyramidal neurons can be closely relatively rare and invasive recording method known related to behavior. One of the most important examples as electrocorticography (ECoG) and its use in the direct in language research is the discovery of the N400 event- study of the human auditory, motor, and speech related potential (ERP) (Kutas & Federmeier, 2011). When

Neurobiology of Language. DOI: http://dx.doi.org/10.1016/B978-0-12-407794-2.00039-0 479 © 2016 Elsevier Inc. All rights reserved. 480 39. DIRECT CORTICAL NEUROPHYSIOLOGY OF SPEECH PERCEPTION

EEG subjects read or hear sentences that end with either lack of an analytic solution to the inverse problem, the contextually congruent (“I like my coffee with milk and biophysical properties of the head, including the vari- sugar”) or contextually incongruent (“I like my coffee with ous layers of parenchyma, corticospinal fluid, menin- milk and bolts”) endings, neural activity between approxi- ges, skull, and scalp, mean that there is up to mately 200 and 600 ms after critical word onset is stronger approximately 15 mm of tissue between the signal for the incongruent condition. This finding, replicated source and the recording electrode (Beauchamp et al., hundreds of times in a variety of subject populations 2011). This tissue further smears the signal spatially (Federmeier & Kutas, 2005) and stimulus modalities and also acts as a low-pass filter, preventing the obser- (Leonard et al., 2012; Marinkovic et al., 2003), suggests vation of oscillatory activity at higher frequencies. that activity in large groups of cells is sensitive to the lin- ECoG does not suffer from these problems to the guistic features of the stimulus and to the context in same extent because the recording electrodes are which individual words occur. placed directly on the surface of the cortex during a Understanding how these effects arise from local surgical procedure that involves exposing the brain neuronal firing requires more detailed information than through either a Burr hole or a craniotomy is typically available from scalp recordings. Electrical (Figure 39.1A)(Penfield & Baldwin, 1952). Obviously, currents from the brain are typically measured in the methods for implanting electrodes limit the popu- microvolts, approximately 1.5 million-times weaker lations that can be studied to patients who require than the power of a standard AA battery. Furthermore, such invasive surgery for clinical reasons. In a small the electromagnetic inverse problem, where signals portion of the epilepsy patient population, anti-seizure recorded at the scalp cannot unambiguously be local- medications are ineffective, and the clinicians deter- ized in the brain, means that noninvasive recordings of mine that the severity of the seizures is sufficiently dis- electrical activity are unable to resolve separate neural abling that the best course of treatment is to identify populations (Pascual-Marqui, 1999). Compounding the and resect the epileptogenic tissue.

(A)Intraoperative ECoG grid placement (B) High-density electrode locations

4mm

(C)Single electrode ECoG spectrogram (D) STG phonetic feature selectivity 200 10 Fricative 118 Plosive Nasal 72 -score 44 z 5 27 16 Frequency (Hz) Frequency 10 High-Gamma 6 0

–500 0 1,000–500 0 500 1,000 Time (ms) Time (ms)

FIGURE 39.1 Invasive recording methods. (A) Intraoperative photo showing placement of high-density (4 mm pitch) ECoG grid on corti- cal surface. (B) Brain reconstruction showing coverage of ECoG grid. (C) Spectrogram of average speech-evoked neural activity from STG, with prominent high-gamma band responses (red box). (D) STG electrode showing high-gamma sensitivity to phonetic features (fricatives, plosives, nasals).

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 39.2. INVASIVE NEURAL RECORDING METHODS 481

In many cases, localizing these foci can be done referred to as the “alpha” band (Berger, 1929). Over the through a combination of behavioral and scalp EEG next several decades, more frequency bands were dis- measures. However, in some cases, either because of covered and analyzed using a variety of techniques the depth of the seizure focus or because of the inverse including Fourier decomposition, wavelet analysis, and problem, it is only possible to identify the abnormal the Hilbert transform (Bruns, 2004). Different frequency tissue through electrodes implanted directly in the ranges have been associated with a variety of behavioral brain (Crone, Sinai, & Korzeniewska, 2006). In these states and stimulus-related responses. For example, in instances, the electrodes are placed according to the speech, the theta band (B47 Hz) is proposed to be a clinicians’ best estimates of probable seizure foci, and critical range for encoding the temporal structure of spo- the patient is kept in the epilepsy monitoring unit for a ken input (Edwards & Chang, 2013; Poeppel, 2003). few days to several weeks, or however long it takes to However, the biophysics of the various media in the record enough seizure activity to localize the source. If head prevents some of the higher frequencies from the patient has consented prior to surgery, then this being observed at the scalp. Thus, one major advan- hospitalization period provides a unique opportunity tage of ECoG is the ability to detect oscillations above for researchers to collect neural data directly from the approximately 70 Hz. Approximately 15 years ago, brain during a variety of experimental tasks. these frequencies were discovered in humans to con- The placement of ECoG electrodes often overlaps tain highly coherent and reliable information about the with peri-Sylvian areas of the brain that are critical for stimulus (Crone, Boatman, Gordon, & Hao, 2001; linguistic functions, particularly for speech. Electrode Crone, Miglioretti, Gordon, & Lesser, 1998). Since then, arrays can routinely obtain high-density coverage of the majority of studies that have examined speech pro- the superior temporal gyrus (STG), middle temporal cessing using ECoG have focused on the high-gamma gyrus (MTG), ventral sensory-motor cortex (vSMC), range, which is typically between 60 and 200 Hz and the inferior frontal gyrus (IFG), regions that are (Figure 39.1C). Whereas the broadband local field known to be involved in speech perception and pro- potential ERP is dominated by low-frequency oscilla- duction (Figure 39.1B). tions between approximately 2 and 20 Hz (in part due Additionally, given that medial temporal lobe to the 1/f power law), the temporal resolution of high- epilepsy is common in this patient population, activity gamma appears to carry critical information about is often recorded from the hippocampal and peri- speech, which is equally dynamic on a millisecond hippocampal structures, which are also thought to be time scale. As we describe, it has been shown that involved in the encoding of linguistic information high-gamma evoked activity correlates with acoustic (Halgren, Baudena, Heit, Clarke, & Marinkovic, 1994). phonetic variability in primary and secondary auditory These recordings are obtained by implanting penetrat- areas (Figure 39.1D) and with the coordinated move- ing depth electrodes, which reach these deep structures ment of the articulators in motor regions during through basal or lateral cortical areas that are likely to speech production. Perhaps of great interest to readers be resected (Jerbi et al., 2009). Depth electrode record- of this book is that high-gamma activity also carries ings are difficult, both surgically and electrophysiologi- information about higher level and abstract aspects of cally; however, several research groups have pioneered the speech signal, which suggests that it may be an these methods and obtained incredibly fruitful results. important tool for understanding how both local and It is also possible to record the activity of single neu- long distance neural networks give rise to rich linguis- rons in the human brain. We present several examples tic representations and concepts from incoming sen- of work that have examined the fundamental currency sory signals. Finally, high-gamma activity is well- of neural computation, the spike. Although relatively suited to link animal models with human cognitive less common, these studies provide a critical link studies because it is strongly correlated with multiunit between human cognitive neuroscience and animal spiking activity (Ray & Maunsell, 2011; Steinschneider, models, where much more is known about the cellular Fishman, & Arezzo, 2008) and the blood oxygen and neurophysiological properties of neuronal function. leveldependent (BOLD) response in fMRI (Mukamel et al., 2005; Ojemann, Ojemann, & Ramsey, 2013).

39.2.2 High-Frequency Oscillations 39.2.3 Limitations of Invasive Methods In addition to spatial (electrodes) and temporal (milliseconds) dimensions, neural population activity Before describing the scientific advancements that can be decomposed into the frequency dimension. One have been made using this set of tools, it is important to of Berger’s initial observations in the scalp EEG was a consider the limitations of invasive recording methods. dominant oscillation in the 8- to 12-Hz range, which he First, the data are obtained from patient volunteers who

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 482 39. DIRECT CORTICAL NEUROPHYSIOLOGY OF SPEECH PERCEPTION are undergoing surgical procedures for serious medical exhaustive review of the science, we hope to convey conditions. Therefore, all recordings are dictated by the the contributions of these studies to the broader view willingness of the patient to participate in the hospital of neurolinguistics. Thus, we consider these studies in setting and by the areas of the brain that the clinicians the context of fMRI, EEG, and MEG studies, particu- deem relevant to their treatment. A related limitation is larly those that are using analogous multivariate statis- that although electrode arrays are becoming more tical and machine learning analysis methods. We also advanced in their density and coverage (Insel, Landis, focus primarily on the sensory, acousticphonetic, and & Collins, 2013; Viventi et al., 2011), invasive methods phonemic aspects of speech perception because necessarily under-sample the brain. Except in cases implanted electrodes typically cover areas that are when it is clinically necessary, recordings are typically most closely associated with these functions. We also obtained only from the exposed surface of the cortex, briefly argue that invasive methods are already prov- which limits access to the sulci and deep structures. ing useful for examining other aspects of language and Additionally, because most epilepsy cases have unilat- linguistic encoding, particularly higher level abstract eral seizure foci, simultaneous recordings from both representations, such as those for lexical and semantic cerebral hemispheres are rare, although there have been information (Leonard & Chang, 2014). some bilateral cases (Cogan et al., 2014). Finally, it is important to consider that recordings are obtained from patients who have abnormal brains. 39.3.1 Sensory Encoding in Primary Many of these surgeries involve removing much of the Auditory Cortex anterior temporal lobe and hippocampal/amygdala complex, sometimes with only limited behavioral We begin this section with a brief overview of sen- effects on the patients (Tanriverdi, Ajlan, Poulin, & sory processing in the primary auditory cortex (A1). Olivier, 2009). This suggests that these areas had only As shown in subsequent sections, downstream regions limited functionality, perhaps due to plastic reorgani- that encode acousticphonetic and phonemic informa- zation around epileptogenic tissue. In practice, electro- tion are highly sensitive to the spectrotemporal charac- des that show any sort of abnormal neurophysiological teristics of the stimulus. Therefore, it is crucial to activity are excluded from analyses; however, this con- understand the inputs to these regions and to recog- cern warrants caution in interpreting results. nize that even the earliest cortical auditory areas show Despite all of these caveats, invasive recording meth- selective tuning that facilitates speech perception. ods offer numerous advantages for advancing our The ascending auditory system sends signals from knowledge of the neurobiology of language. Limited the tonotopically organized cochlea through the lateral sampling of select brain regions precludes an under- lemniscal pathway to an area on the posteromedial standing of the dynamics of large-scale inter-regional portion of Heschl’s gyrus, which is the primary audi- neural networks, but the frequency and spatiotemporal tory cortex in humans (Hackett, 2011). Like the areas resolutions of the data obtained from these areas are that precede it in the auditory hierarchy, A1 is tonoto- unparalleled compared with noninvasive tools. pically organized along at least one major axis, mean- Furthermore, results from invasive methods consistently ing that different neural populations are sensitive to agree with and extend findings from other techniques, different acoustic frequencies (Barton, Venezia, Saberi, despite the nature of the patient populations. As we Hickok, & Brewer, 2012; Baumann, Petkov, & Griffiths, argue in the rest of this chapter, ECoG, depth electrode, 2013; also refer to Chapter 5). Following initial work by and single unit recordings must be interpreted in the Celesia (Celesia, 1976) and Lie´geois-Chauvel (Liegeois- context of the current understanding from other meth- Chauvel, Musolino, & Chauvel, 1991), researchers at odologies. Ultimately, it is a rare opportunity and a the University of Iowa have provided a detailed privilege to work with these patients, and their contribu- description of human A1 using multicontact depth tions to neuroscience and the neurobiology of language electrodes and have found that there are frequency- have been critical over the past several years. specific responses that are indicative of both rate and temporal coding of auditory stimuli both within and across neural populations (Brugge et al., 2009). These 39.3 INTRACRANIAL CONTRIBUTIONS different coding mechanisms occur at different TO THE NEUROBIOLOGY stimulus frequencies and may also contribute to the OF LANGUAGE perception of pitch (Griffiths et al., 2010). Furthermore, narrow spectral tuning in human A1 is not just a fea- In this section, we present examples of how invasive ture of the population neural activity; it is also reflected recording techniques have provided new insights into in the selective responses of single neurons (Bitterman, the neural basis of language. Although this is not an Mukamel, Malach, Fried, & Nelken, 2008).

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 39.3. INTRACRANIAL CONTRIBUTIONS TO THE NEUROBIOLOGY OF LANGUAGE 483

These depth electrode and single unit studies in neural activation is observed for most speech tasks Heschl’s gyrus extend important work from noninva- (DeWitt & Rauschecker, 2012; Hickok & Poeppel, 2007; sive methods. Using machine learning analysis meth- Rauschecker & Scott, 2009). In fMRI, a very productive ods to relate the acoustic properties of natural sounds paradigm has contrasted BOLD responses to speech to changes in the BOLD response in fMRI, Moerel and and nonspeech sounds that preserve important spectral colleagues have shown that A1 exhibits narrow spec- or temporal aspects of the signal (Davis & Johnsrude, tral tuning in addition to tonotopic organization 2003; Scott, Blank, Rosen, & Wise, 2000). Numerous (Moerel, De Martino, & Formisano, 2012). The same studies have demonstrated stronger STG activation for researchers have also used high-field fMRI to show speech, suggesting that this region encodes aspects of that individual neural populations have multipeak the stimuli that are intelligible and behaviorally rele- tuning at octave intervals, possibly facilitating the vant to the listener (Obleser, Eisner, & Kotz, 2008). binding of complex sound features (Moerel et al., Recently, multivariate pattern analysis methods have 2013). In agreement with temporal coding mechanisms been applied to fMRI data and have shown that STG giving rise to the perceptual phenomenon of pitch, it (among other temporal lobe regions including superior has also been shown with fMRI that A1 has spectro- temporal sulcus) are involved in representing mean- temporal modulation transfer functions that track tem- ingful aspects of the speech signal (Evans et al., 2013). poral modulations in the stimulus (Scho¨nwiesner & These results have been advanced using both inva- Zatorre, 2009). The combined spatial organization of sive and noninvasive methods that afford high spatial spectral and temporal modulation selectivity reveals and temporal resolution. Using MEG in healthy volun- multiple subfields of both primary and secondary teers and ECoG in epilepsy patients, Travis and collea- auditory regions, suggesting a functional microstruc- gues showed that STG responds preferentially to ture of the earliest cortical sensory areas that may con- speech relative to a noise-vocoded control, which tribute to the complex nonlinear representations of smoothes the speech spectrogram in the spectral axis, input like speech (Barton et al., 2012). preserving the temporal and most of the frequency con- Sensitivity to spectral and temporal stimulus features tent but destroying intelligibility (Travis et al., 2012). in A1 most likely facilitates our ability to understand This study also demonstrates that STG speech-selective dynamic acoustic input like speech. However, it is not responses observed in fMRI most likely reflect necessarily the case that this type of tuning reflects stimulus-evoked neural activity between approximately speech-specific or even speech-oriented specialization. 50 and 150 ms, significantly earlier than activity in the In a recent study, Steinschneider and colleagues exam- same region that is associated with higher level linguis- ined single-unit and multi-unit spiking activity in the tic encoding of lexical and semantic information. This primary auditory cortex of human epilepsy patients and demonstrates that at least at the level of nearby cortical macaques. For broadband auditory evoked potentials columns, STG encodes multiple levels of acousticpho- and high-gamma activity, both species showed similar netic and abstract linguistic information. spectrotemporal selectivity (Steinschneider, Nourski, & What gives rise to this selectivity for meaningful Fishman, 2013). Remarkably, this selectivity included aspects of the acoustic speech signal? Like A1, distinct acoustic aspects of speech, such as voice-onset-time and STG neuronal populations encode the temporal structure place of articulation (POA), which are critical for mak- of nonspeech acoustic input differently depending on ing phonetic distinctions. This suggests that speech- the frequency content of the input (Nourski et al., 2013). specific representations do not reside at the level of A1, Also like A1, populations are spectrally selective to although the tuning characteristics of this region allow ranges of frequencies, although they are more broadly us to parse spoken input efficiently. tuned than in primary auditory regions (Nourski et al., 2012). These two ECoG studies with nonspeech tones and clicks are complemented by a set of speech ECoG 39.3.2 AcousticPhonetic Representations in studies using high-density electrode grids over STG. Lateral Superior Temporal Cortex Spectrotemporal selectivity at single electrodes has important consequences for the representation of spoken In humans, Heschl’s gyrus is in close proximity and words because it suggests that those acoustic inputs are is densely connected to the lateral superior temporal represented as a complex pattern of activity across elec- cortex, in particular the STG (Hackett, 2011). In part trodes over time as the input unfolds. Using linear stim- because of the most common placement of ECoG elec- ulus reconstruction methods, it is possible to generate a trode arrays, and also because of the functions of this spectrogram from the population STG activity that corre- area, STG is among the best characterized regions in sponds closely to the original stimulus spectrogram the speech system. It is well known that STG is (Pasley et al., 2012). This correspondence is particularly involved in phonological representation, because strong when the reconstruction model applies stronger

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 484 39. DIRECT CORTICAL NEUROPHYSIOLOGY OF SPEECH PERCEPTION weights to the spectrotemporal aspects of the acoustic complex representations of speech. Although single input that are relevant for speech intelligibility, includ- electrodes do not show sensitivity to specific pho- ing temporal modulation rates that correspond to sylla- nemes (Mesgarani et al., 2014), the combined activity ble onsets and offsets. across electrodes generates a population code for com- This spectrotemporal representation across electrodes plex sets of phonetic features. Two important higher indicates that separate neural populations may be con- level features that help distinguish phonemes are POA tributing sensitivity to particular features of the speech and voice onset time (VOT). POA refers to the configu- stimulus. Chang and colleagues presented ECoG partici- ration of the articulators and distinguishes different pants with hundreds of naturally spoken sentences that phonemes that share a particular feature, such as plo- contained a large number of examples of all English pho- sives (e.g., /ba/ vs. /ga/). VOT refers to the temporal nemes (Mesgarani, Cheung, Johnson, & Chang, 2014). delay between plosive closure and the onset of vibra- This allowed the researchers to examine activity at the tions in the larynx, and distinguishes plosives that closely spaced ECoG electrodes in response to each pho- have the same POA (e.g., /ba/ vs. /pa/). neme. They found that individual electrodes responded In an ECoG study, Steinschneider et al. (2011) exam- selectively to linguistically meaningful phonetic features, ined how event-related high-gamma activity differed such as fricatives, plosives, and nasals. For vowels, there for sounds that differed parametrically in POA or was a similar feature-based representation for low-back, VOT. They found that posterior STG electrodes corre- low-front, and high-front vowels, which arose from the lated most strongly with changes in POA. VOT was encoding of formant frequencies, particularly the ratio of also represented in both medial and posterior STG, F1 and F2. These results provide crucial information and it tracked the timing of voice onset through about the role of STG in speech perception. It has been latency shifts in the peaks of the average high-gamma suggested that STG is involved in spectrotemporal waveforms. This demonstrates that understanding (Pasley et al., 2012), acousticphonetic (Boatman, 2004), the encoding of complex stimulus features requires and phonemic (Chang et al., 2010; DeWitt & the ability to discriminate activity at local neural Rauschecker, 2012; Molholm et al., 2013)processing; populations (spatial selectivity, achieved through high- however, the functional and spatial specificity of these density electrode arrays and high-gamma band filter- features have not been described previously. Chang and ing) and also to track neural activity on a millisecond colleagues showed that STG does represent acoustic level. phonetic features (rather than individual phonemes, as Another common method for examining the encod- single electrodes responded to multiple phonemes shar- ing of individual phonemes is through categorical pho- ing a particular feature), but that this selectivity arises neme perception. In this paradigm, a linear continuum from tuning to specific spectrotemporal cues. of speech sounds varies in a particular acoustic para- Similar results have been obtained at the single neu- meter from one unambiguous extreme to another ron level. Chan and colleagues collected neural (e.g., F2 frequency, distinguishing POA for the plosives responses from an epilepsy patient implanted with a /b/, /d/, and /g/). The sounds are perceived non- Utah array, which consists of 100 penetrating electro- linearly, such that listeners hear one phoneme up to a des with 20 micron tips, and recorded spiking activity particular point on the continuum, and then abruptly from approximately 150 single and multiunits in ante- switch to perceiving a different phoneme; in other rior STG (Chan et al., 2013). They found neurons that words, listeners fit acoustically ambiguous examples responded selectively to specific phonemes that shared into established phoneme categories. This pheno- particular phonetic features (e.g., high-front vowels), menon has been localized using fMRI primarily to the but not to individual phonemes. Remarkably, they also lateral superior temporal cortex (Joanisse, Zevin, & demonstrated that responses are similarly selective to McCandliss, 2007); however, these studies generally do phonemes when they are heard and when they are not reveal how population neural activity enables this read. This suggests that written words activate phono- important perceptual effect. logical representations and also that single neurons in ECoG has sufficient spatial and temporal resolution the human STG are tuned to phonetic features, inde- to address how nonlinear perceptual phenomena such pendent of the nature of the sensory input. as categorical perception are encoded in the brain. Using a continuum of synthesized speech sounds that varied linearly in F2 onset from /ba/ to /da/ to /ga/ 39.3.3 Population Encoding of Phonemic (Figure 39.2A), it has been shown that high-gamma Information in STG activity across electrodes can distinguish these sounds categorically (Chang et al., 2010). Distinct neural popu- Spectrotemporally based phonetic tuning at individ- lations within the posterior STG generate neural activ- ual neural populations may also give rise to more ity patterns that correspond to each phoneme along

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 39.3. INTRACRANIAL CONTRIBUTIONS TO THE NEUROBIOLOGY OF LANGUAGE 485

(A) 100 80 0 3 60 –20 /ba/ /da/ /ga/

2 –40 dB kHz 40 –60 1 –80 20 Identification (%) 50 ms 0 1234567891011 12 13 14 Stimulus token (B) /ba/ /da/ /ga/

10 10 10

uV 0 uV 0 uV 0 –10 –10 –10

20 20 20 mm mm mm 12 12 12 20 20 20 4 12 4 12 4 12

Dorsal 4 mm 4 mm 4 mm Posterior (C) (D) 1 15 /ba/ cluster 3 0.7 12 7.5 23 13 5 1 11 0.6 4 10 7 0 /ga/ cluster

Stimulus 9 8 0.5 –7.5 5 7 6 9 11 Discrimination

Second eigenvariate /da/ cluster 13 –15 0.4 1351113 79 –15 –7.5 0 7.5 15 Stimulus First eigenvariate

FIGURE 39.2 Speech representation in human STG. (A) Speech sounds synthesized along an acoustic continuum (stimuli 114, small increases in the second formant [arrow]) are perceived categorically rather than linearly in the behavior. (B) Spatial topography of evoked potentials recorded directly from the cortical surface of STG using ECoG is highly distributed and complex for each sound class. (C) Neural confusion matrix plots the pair-wise dissimilarity of neural patterns using a multivariate classifier. (D) Multidimensional scaling shows that response patterns are organized in discrete clustered categories along both acoustic and perceptual sensitivities. Adapted from Chang et al. (2010) with permission from the publisher. the POA continuum, even within the space of only a of speech tokens were ordered according to F2 in one couple of centimeters (Figure 39.2B). Using dimension (see ordering along the x-axis in multivariate classification methods, stimulus-specific Figure 39.2D), but the overall pattern was categorical discriminability was observed in this activity in two dimensions (Figure 39.2D). This is a clear dem- (Figure 39.2C), which suggests that at the peak of neu- onstration that nonlinear perceptual phenomena are ral pattern dissimilarity across categories (occurring at encoded nonlinearly in the brain. B110 ms after stimulus onset), the perceptual effect arises from the activity at specific groups of electrodes. The timing of this discriminability is sufficiently early 39.3.4 Cognitive Influences on Speech in STG to suggest that categorical phoneme perception occurs in situ within the STG, rather than through top-down Intracranial recording methods have also recently influences of other brain regions, although the limited proven useful for understanding how spectrotemporal, sampling of brain areas with ECoG does not rule out phonetic, and phonemic representations in STG are the latter possibility. It is also important to note that, modulated by cognitive processes. For example, these consistent with the evidence described for phonetic methods have renewed an interest in the “cocktail party” feature representations at individual STG electrodes problem, where listeners can selectively filter out irrele- (Mesgarani et al., 2014), categorical perception of the vant auditory streams in a noisy environment (Cherry, linear continuum in this study was organized along 1953). Several recent studies have examined neural acoustic sensitivities. Specifically, the representations responses to overlapping, multispeaker acoustic stimuli,

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 486 39. DIRECT CORTICAL NEUROPHYSIOLOGY OF SPEECH PERCEPTION where listeners were directed to focus their attention on stimulus in STG only reflected the attended speech only one speaker (Ding & Simon, 2012; Kerlin, Shahin, & stream, as if the ignored speaker had not been heard at Miller, 2010; Sabri et al., 2008; Zion Golumbic et al., all (Figure 39.3C). 2013). Using ECoG, one study used stimulus reconstruc- Using similar methods, Zion-Golumbic and collea- tion methods (Mesgarani, David, Fritz, & Shamma, 2009) gues extended these findings to show that this atten- to investigate how attention modulates the spectrotem- tional modulation is reflected not only in high-gamma poral representation of the acoustic input in STG power changes but also in low-frequency phase over (Mesgarani & Chang, 2012). Participants listened to two the course of several hundred milliseconds (Zion speakers simultaneously (Figure 39.3A and B) and were Golumbic et al., 2013). They also demonstrated that the asked to report the content of just one of the speech extent to which the ignored speech stream is repre- streams, thus attending to only part of the acoustic input. sented changes as a function of brain region, specifi- Remarkably, the spectrotemporal representation of the cally that higher level frontal and pre-frontal areas show less robust representation. There was also a stronger correlation between the attended stimulus and the neural response in STG for high-gamma power, whereas frontal regions showed stronger corre- lations with the low-frequency phase, implicating dif- ferent frequency-dependent encoding mechanisms in different brain regions. As described, there have also been attempts to characterize higher order representations of linguis- tic input using intracranial recording methods (Sahin, Pinker, Cash, Schomer, & Halgren, 2009). The N400 response, which reflects the degree of semantic contextual mismatch, has been localized to left anterior, ventral, and superior temporal cortex (Halgren et al., 1994). Recent work has further shown that responses within STG are sensitive to semantic context, independent of lower level acous- tic features (Travis et al., 2012). ECoG and depth electrode recordings have also revealed that ventral temporal regions (including perirhinal cortex) show semantic category selectivity for words, regardless of whether they are written or spoken, suggesting that abstract semantic and conceptual representa- tions are resolvable using these methods (Chan et al., 2011). Halgren, Chan, and colleagues have also used penetrating laminar depth electrodes to record multiunit activity in individual cortical layers during these tasks and have shown that the timing of responses in input and output layers suggests an early (B130 ms) first-pass selective response, fol- lowed by a later (B400 ms) selective response that is FIGURE 39.3 Attention modulates STG representations of spec- consistent with the N400 recorded at the scalp trotemporal speech content. (A) ECoG participants listened to two (Chan et al., 2011; Halgren et al., 2006). This sug- speech streams either alone or simultaneously and were cued to gests that higher order linguistic representations not focus on a particular call sign (“tiger” or “ringo”) and to report the only are spatially distributed but also arise from a color/number combination (e.g., “green five”) associated with that speaker. (B) The acoustic spectrogram of the mixed speech streams complex temporal pattern of neural activity. shows highly overlapping energy distributions across time. (C) Together, these studies demonstrate the context- Neural population-based reconstruction of the spectrograms for sensitive nature of evoked neural responses. Crucially, speaker 1 (blue) and speaker 2 (red), when participants heard each they demonstrate that both fine spatial resolution and speaker alone (shaded area) or in the mixed condition (outline). temporal resolution are necessary to understand how Results demonstrate that in the mixed condition, attention to a par- ticular speaker results in a spectrotemporal representation in STG as cognitive representations influence and modulate if that speaker were heard alone. Adapted from Mesgarani and Chang lower level sensory and perceptual neural responses. (2012) with permission from the publisher. Differences across frequencies in the neural signal also

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL REFERENCES 487 encode important aspects of this modulation, which dynamics of populations of single units (Churchland nicely complements work using noninvasive methods. et al., 2012), involves characterizing the shared vari- ability across space and plotting neural activity as trajectories through a “state-space” over time. In a 39.4 THE FUTURE OF INVASIVE recent study on the organization of articulatory repre- METHODS IN LANGUAGE RESEARCH sentations in vSMC, principal components analysis was used to reduce the dimensionality across ECoG We have argued that methodological advances in electrodes and showed that activity closely tracks recording and analysis tools have allowed invasive the articulatory features of consonant-vowel syllables neurophysiology to provide a window into the neural on a millisecond time scale (Bouchard, Mesgarani, basis of speech and language that cannot be seen by Johnson, & Chang, 2013). Given that abstract represen- other methods. The past several years have been partic- tations such as lexical and semantic features are ularly exciting because there has been a fusion of novel known to be more spatially distributed (Huth, engineering approaches (machine learning and multi- Nishimoto, Vu, & Gallant, 2012) than lower level sen- variate statistics) with linguistic and neuroscientific sory features, it may be possible to use high-density questions. In general, there have been two ways in neurophysiological recordings to understand how which these fields have met. In the first, traditional speech is transformed from primary sensory areas experimental paradigms have been used with increas- to distributed cortical networks. Overall, the field ingly higher density neural recordings, which have in of neurolinguistics faces an exciting period with a some cases provided new spatial information and often multitude of experimental approaches that contri- an entirely novel temporal dimension (Edwards et al., bute unique and complementary information to our 2009, 2010; Molholm et al., 2013; Steinschneider et al., understanding of language. 2011; Travis et al., 2012). In the second approach, researchers have used data mining and machine learning techniques to understand how the activity References observed in the paradigms from the first approach reflect the encoding and representation of stimulus and Barton, B., Venezia, J. H., Saberi, K., Hickok, G., & Brewer, A. A. (2012). Orthogonal acoustic dimensions define auditory field abstract information (Chan et al., 2013; Chang et al., maps in human cortex. Proceedings of the National Academy of 2010; Cogan et al., 2014; Mesgarani et al., 2014; Pasley Sciences, 109(50), 2073820743. et al., 2012; Zion Golumbic et al., 2013). Together, and Baumann, S., Petkov, C. I., & Griffiths, T. D. (2013). A unified frame- in the context of noninvasive studies, these investiga- work for the organization of the primate auditory cortex. Frontiers tions are converging on the ultimate goal of neurolin- in Systems Neuroscience, 7. Beauchamp, M. S., Beurlot, M. R., Fava, E., Nath, A. R., Parikh, guistics: to understand how neurons and neural N. A., Saad, Z. S., et al. (2011). The developmental trajectory of networks generate the linguistic representations that brain-scalp distance from birth through childhood: Implications are at the core of our daily experience. for functional neuroimaging. PLoS One, 6(9), e24981. Although the machine learning and data mining Berger, H. (1929). U¨ ber das elektrenkephalogramm des menschen. approaches have provided an unprecedented under- European Archives of Psychiatry and Clinical Neuroscience, 87(1), 527570. standing of the underlying information representations Bitterman, Y., Mukamel, R., Malach, R., Fried, I., & Nelken, I. (2008). in areas such as STG, there are many equally funda- Ultra-fine frequency tuning revealed in single neurons of human mental questions that remain that invasive recording auditory cortex. Nature, 451, 197201. methods may be well-suited to answer. A basic ques- Boatman, D. (2004). Cortical bases of speech perception: Evidence tion for speech perception is how tuning and represen- from functional lesion studies. Cognition, 92(1), 47 65. Bouchard, K. E., Mesgarani, N., Johnson, K., & Chang, E. F. (2013). tations in A1 and STG are integrated over time to form Functional organization of human sensorimotor cortex for speech more abstract word-level representations. Surprisingly, articulation. Nature, 495(7441), 327332. many of the studies cited neglect the temporal dimen- Brugge, J. F., Nourski, K. V., Oya, H., Reale, R. A., Kawasaki, H., sion of the data; it is common for input to linear classi- Steinschneider, M., et al. (2009). Coding of repetitive transients by fiers to be an average over a large time window or auditory cortex on Heschl’s gyrus. Journal of Neurophysiology, 102, 23582374. even a single time point (Mesgarani et al., 2014). Bruns, A. (2004). Fourier-, Hilbert-and wavelet-based signal analysis: Although examining high-gamma band power inher- Are they really different approaches? Journal of Neuroscience ently focuses on an aspect of the data involving fast Methods, 137(2), 321332. time scales (e.g., acoustic onsets), there is a major Celesia, G. G. (1976). Organization of auditory cortical areas in man. question regarding how information is encoded over Brain: A Journal of Neurology, 99(3), 403. Chan, A. M., Baker, J. M., Eskandar, E., Schomer, D., Ulbert, I., time in different brain regions. Marinkovic, K., et al. (2011). First-pass selectivity for semantic cat- One method for examining neural activity over time, egories in human anteroventral temporal lobe. The Journal of which was originally developed for understanding the Neuroscience, 31(49), 1811918129.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 488 39. DIRECT CORTICAL NEUROPHYSIOLOGY OF SPEECH PERCEPTION

Chan, A. M., Dykstra, A. R., Jayaram, V., Leonard, M. K., Travis, Depth recorded potentials in the human occipital and parietal K. E., Gygi, B., et al. (2013). Speech-specific tuning of neurons in lobes. Journal of Physiology—Paris, 88(1), 150. human superior temporal gyrus. Cerebral Cortex, 24(10), Halgren, E., Wang, C., Schomer, D. L., Knake, S., Marinkovic, K., Wu, 26792693. J., et al. (2006). Processing stages underlying word recognition in Chang, E. F., Rieger, J. W., Johnson, K., Berger, M. S., Barbaro, N. M., the anteroventral temporal lobe. Neuroimage, 30(4), 14011413. & Knight, R. T. (2010). Categorical speech representation in Hickok, G., & Poeppel, D. (2007). The cortical organization of speech human superior temporal gyrus. Nature Neuroscience, 13, processing. Nature Reviews Neuroscience, 8, 393402. 14281432. Huth, A. G., Nishimoto, S., Vu, A. T., & Gallant, J. L. (2012). A con- Cherry, E. C. (1953). Some experiments on the recognition of speech, tinuous semantic space describes the representation of thousands with one and with two ears. Journal of the Acoustical Society of of object and action categories across the human brain. Neuron, 76 America, 25(5), 975979. (6), 12101224. Churchland, M. M., Cunningham, J. P., Kaufman, M. T., Foster, J. D., Insel, T. R., Landis, S. C., & Collins, F. S. (2013). The NIH brain initia- Nuyujukian, P., Ryu, S. I., et al. (2012). Neural population dynam- tive. Science, 340(6133), 687688. ics during reaching. Nature, 487,5156. Jerbi, K., Ossando´n, T., Hamame, C. M., Senova, S., Dalal, S. S., Jung, Cogan, G. B., Thesen, T., Carlson, C., Doyle, W., Devinsky, O., & J., et al. (2009). Task-related gamma-band dynamics from an Pesaran, B. (2014). Sensory-motor transformations for speech intracerebral perspective: Review and implications for surface occur bilaterally. Nature, 507,9498. EEG and MEG. Human Brain Mapping, 30(6), 17581771. Crone, N. E., Boatman, D., Gordon, B., & Hao, L. (2001). Induced Joanisse, M. F., Zevin, J. D., & McCandliss, B. D. (2007). Brain electrocorticographic gamma activity during auditory perception. mechanisms implicated in the preattentive categorization of Clinical Neurophysiology, 112, 565582. speech sounds revealed using fMRI and a short-interval habitua- Crone, N. E., Miglioretti, D. L., Gordon, B., & Lesser, R. P. (1998). tion trial paradigm. Cerebral Cortex, 17, 20842093. Functional mapping of human sensorimotor cortex with electro- Kerlin, J. R., Shahin, A. J., & Miller, L. M. (2010). Attentional gain corticographic spectral analysis. II. Event-related synchronization control of ongoing cortical speech representations in a “cocktail in the gamma band. Brain, 121, 23012315. party”. The Journal of Neuroscience, 30, 620628. Crone, N. E., Sinai, A., & Korzeniewska, A. (2006). High-frequency Kutas, M., & Federmeier, K. D. (2011). Thirty years and counting: gamma oscillations and human brain mapping with electrocorti- Finding meaning in the N400 component of the event-related cography. Progress in Brain Research, 159, 275295. brain potential (ERP). Annual Review of Psychology, 62, 621647. Davis, M. H., & Johnsrude, I. S. (2003). Hierarchical processing in Leonard, M. K., & Chang, E. F. (2014). Dynamic speech representa- spoken language comprehension. The Journal of Neuroscience, 23, tions in the human temporal lobe. Trends in Cognitive Sciences, 18 34233431. (9), 472479. DeWitt, I., & Rauschecker, J. P. (2012). Phoneme and word recogni- Leonard, M. K., Ramirez, N. F., Torres, C., Travis, K. E., Hatrak, M., tion in the auditory ventral stream. Proceedings of the National Mayberry, R. I., et al. (2012). Signed words in the congenitally Academy of Sciences, 109, E505E514. deaf evoke typical late lexicosemantic responses with no early Ding, N., & Simon, J. Z. (2012). Emergence of neural encoding of visual responses in left superior temporal cortex. The Journal of auditory objects while listening to competing speakers. Neuroscience, 32(28), 97009705. Proceedings of the National Academy of Sciences, 109, 1185411859. Liegeois-Chauvel, C., Musolino, A., & Chauvel, P. (1991). Edwards, E., & Chang, E. F. (2013). Syllabic (B25 Hz) and fluctua- Localization of the primary auditory area in man. Brain, 114(1), tion (B110 Hz) ranges in speech and auditory processing. 139153. Hearing Research, 305, 113134. Marinkovic, K., Dhond, R. P., Dale, A. M., Glessner, M., Carr, V., & Edwards, E., Nagarajan, S. S., Dalal, S. S., Canolty, R. T., Kirsch, Halgren, E. (2003). Spatiotemporal dynamics of modality-specific H. E., Barbaro, N. M., et al. (2010). Spatiotemporal imaging of cor- and supramodal word processing. Neuron, 38(3), 487497. tical activation during verb generation and picture naming. Mesgarani, N., & Chang, E. F. (2012). Selective cortical representation Neuroimage, 50(1), 291301. of attended speaker in multi-talker speech perception. Nature, Edwards, E., Soltani, M., Kim, W., Dalal, S. S., Nagarajan, S. S., 485, 233236. Berger, M. S., et al. (2009). Comparison of timefrequency Mesgarani, N., Cheung, C., Johnson, K., & Chang, E. F. (2014). responses and the event-related potential to auditory speech Phonetic feature encoding in human superior temporal gyrus. stimuli in human cortex. Journal of Neurophysiology, 102(1), 377. Science, 343(6174), 10061010. Evans, S., Kyong, J., Rosen, S., Golestani, N., Warren, J., McGettigan, Mesgarani, N., David, S. V., Fritz, J. B., & Shamma, S. A. (2009). C., et al. (2013). The pathways for intelligible speech: Multivariate Influence of context and behavior on stimulus reconstruction and univariate perspectives. Cerebral Cortex, 24(9), 23502361. from neural activity in primary auditory cortex. Journal of Federmeier, K. D., & Kutas, M. (2005). Aging in context: Age-related Neurophysiology, 102(6), 33293339. changes in context use during language comprehension. Moerel, M., De Martino, F., & Formisano, E. (2012). Processing of nat- Psychophysiology, 42(2), 133141. ural sounds in human auditory cortex: Tonotopy, spectral tuning, Friston, K. (2010). The free-energy principle: A unified brain theory? and relation to voice sensitivity. The Journal of Neuroscience, 32, Nature Reviews Neuroscience, 11, 127138. 1420514216. Geschwind, N. (1974). Disconnexion syndromes in animals and man. Moerel, M., De Martino, F., Santoro, R., Ugurbil, K., Goebel, R., The Netherlands: Springer. Yacoub, E., et al. (2013). Processing of natural sounds: Griffiths, T. D., Kumar, S., Sedley, W., Nourski, K. V., Kawasaki, H., Characterization of multipeak spectral tuning in human auditory Oya, H., et al. (2010). Direct recordings of pitch responses from cortex. The Journal of Neuroscience, 33, 1188811898. Available human auditory cortex. Current Biology, 20, 11281132. from: http://dx.doi.org/doi:10.1523/jneurosci.5306-12.2013. Hackett, T. A. (2011). Information flow in the auditory cortical net- Molholm, S., Mercier, M. R., Liebenthal, E., Schwartz, T. H., Ritter, work. Hearing Research, 271, 133146. W., Foxe, J. J., et al. (2013). Mapping phonemic processing zones Halgren, E., Baudena, P., Heit, G., Clarke, M., & Marinkovic, K. along human perisylvian cortex: An electro-corticographic inves- (1994). Spatio-temporal stages in face and word processing. 1. tigation. Brain Structure and Function115.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL REFERENCES 489

Mukamel, R., Gelbard, H., Arieli, A., Hasson, U., Fried, I., & Malach, Sahin, N. T., Pinker, S., Cash, S. S., Schomer, D., & Halgren, E. R. (2005). Coupling between neuronal firing, field potentials, and (2009). Sequential processing of lexical, grammatical, and phono- FMRI in human auditory cortex. Science, 309, 951954. logical information within Broca’s area. Science, 326(5951), Nourski, K. V., Brugge, J. F., Reale, R. A., Kovach, C. K., Oya, H., 445449. Kawasaki, H., et al. (2013). Coding of repetitive transients by Scho¨nwiesner, M., & Zatorre, R. J. (2009). Spectro-temporal modula- auditory cortex on posterolateral superior temporal gyrus in tion transfer function of single voxels in the human auditory cor- humans: An intracranial electrophysiology study. Journal of tex measured with high-resolution fMRI. Proceedings of the Neurophysiology, 109, 12831295. National Academy of Sciences, 106, 1461114616. Nourski, K. V., Steinschneider, M., Oya, H., Kawasaki, H., Jones, Scott, S. K., Blank, C. C., Rosen, S., & Wise, R. J. (2000). Identification R. D., & Howard, M. A. (2012). Spectral organization of the of a pathway for intelligible speech in the left temporal lobe. human lateral superior temporal gyrus revealed by intracranial Brain, 123(12), 24002406. recordings. Cerebral Cortex, 24(2), 340352. Simoncelli, E. P., & Olshausen, B. A. (2001). Natural image statistics Obleser, J., Eisner, F., & Kotz, S. A. (2008). Bilateral speech compre- and neural representation. Annual Review of Neuroscience, 24, hension reflects differential sensitivity to spectral and temporal 11931216. features. The Journal of Neuroscience, 28, 81168123. Steinschneider, M., Fishman, Y. I., & Arezzo, J. C. (2008). Ojemann, G. A., Ojemann, J., & Ramsey, N. F. (2013). Relation Spectrotemporal analysis of evoked and induced electroencepha- between functional magnetic resonance imaging (fMRI) and sin- lographic responses in primary auditory cortex (A1) of the awake gle neuron, local field potential (LFP) and electrocorticography monkey. Cerebral Cortex, 18, 610625. (ECoG) activity in human cortex. Frontiers in Human Neuroscience, Steinschneider, M., Nourski, K. V., & Fishman, Y. I. (2013). 7. Available from: http://dx.doi.org/10.3389/fnhum.2013.00034. Representation of speech in human auditory cortex: Is it special? Pascual-Marqui, R. D. (1999). Review of methods for solving the EEG Hearing Research, 305,5773. inverse problem. International Journal of Bioelectromagnetism, 1(1), Steinschneider, M., Nourski, K. V., Kawasaki, H., Oya, H., Brugge, 7586. J. F., & Howard, M. A. (2011). Intracranial study of speech- Pasley, B. N., David, S. V., Mesgarani, N., Flinker, A., Shamma, S. A., elicited activity on the human posterolateral superior temporal Crone, N. E., et al. (2012). Reconstructing speech from human gyrus. Cerebral Cortex, 21, 23322347. auditory cortex. PLoS Biology, 10, e1001251. Tanriverdi, T., Ajlan, A., Poulin, N., & Olivier, A. (2009). Morbidity Penfield, W., & Baldwin, M. (1952). Temporal lobe seizures and the in epilepsy surgery: An experience based on 2449 epilepsy sur- technic of subtotal temporal lobectomy. Annals of Surgery, 136(4), gery procedures from a single institution: Clinical article. Journal 625. of Neurosurgery, 110(6), 11111123. Poeppel, D. (2003). The analysis of speech in different temporal inte- Travis, K. E., Leonard, M. K., Chan, A. M., Torres, C., Sizemore, gration windows: Cerebral lateralization as “asymmetric sam- M. L., Qu, Z., et al. (2012). Independence of early speech proces- pling in time”. Speech Communication, 41, 245255. sing from word meaning. Cerebral Cortex, 23(10), 23702379. Poeppel, D., Idsardi, W. J., & van Wassenhove, V. (2008). Speech per- Viventi, J., Kim, D.-H., Vigeland, L., Frechette, E. S., Blanco, J. A., ception at the interface of neurobiology and linguistics. Kim, Y.-S., et al. (2011). Flexible, foldable, actively multiplexed, Philosophical Transactions of the Royal Society B: Biological Sciences, high-density electrode array for mapping brain activity in vivo. 363, 10711086. Nature Neuroscience, 14(12), 15991605. Rauschecker, J. P., & Scott, S. K. (2009). Maps and streams in the Wernicke, C. (1874). Der aphasische Symptomencomplex: eine psycholo- auditory cortex: Nonhuman primates illuminate human speech gische Studie auf anatomischer Basis. Breslau: Cohn. processing. Nature Neuroscience, 12, 718724. Zion Golumbic, E. M., Ding, N., Bickel, S., Lakatos, P., Schevon, Ray, S., & Maunsell, J. H. (2011). Different origins of gamma rhythm C. A., McKhann, G. M., et al. (2013). Mechanisms underlying and high-gamma activity in macaque visual cortex. PLoS Biology, selective neuronal tracking of attended speech at a “cocktail 9, e1000610. party”. Neuron, 77, 980991. Sabri, M., Binder, J. R., Desai, R., Medler, D. A., Leitl, M. D., & Liebenthal, E. (2008). Attentional and linguistic interactions in speech perception. Neuroimage, 39, 14441456.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL This page intentionally left blank CHAPTER 40 Factors That Increase Processing Demands When Listening to Speech Ingrid S. Johnsrude1 and Jennifer M. Rodd2 1Department of Psychology and Centre for Neuroscience Studies, Queen’s University, Kingston, ON, Canada; 2Department of Experimental Psychology, University College London, London UK

For many individuals, speech listening usually feels people with greater resources. Eventually, effort effortless. However, whenever background noise is asymptotes, at which point the cognitive resources are present, or when the person is hearing-impaired, per- no longer adequate for the conditions, and perception ception of speech can be difficult and tiring (Ivarsson and comprehension suffer (Figure 40.1). & Arlinger, 1994). It has become increasingly evident We focus on the different kinds of processing to applied hearing researchers that “listening effort” is demands that arise for all listeners in different listen- an essential concept (McGarrigle et al., 2014) because it ing situations and discuss the kinds of perceptual and can explain behavior. For example, two hearing- cognitive processes that may be required to meet these impaired individuals may experience similar improve- challenges. Individual differences in cognitive ments in speech perception with an auditory resources (such as memory, perceptual learning, pro- prosthesis, but one may also experience increased cessing speed, fluid intelligence, and control processes) listening effort and choose not to wear a listening that permit one person to cope more efficiently or device, whereas the other one who does not feel like more successfully than another with the challenges he/she must “listen harder” wears it habitually. imposed by a listening situation (i.e., that allow them Although listening effort is an appealing and intuitive to cope with the processing demands) are outside the concept, it has not been well-elaborated in the remit of this chapter. cognitive literature. In particular, it has not been differ- The listening conditions of everyday life are highly entiated from nonspecific factors such as arousal, vigi- variable. Sometimes speech is heard in quiet. More lance, motivation, or selective attention. We suggest often, however, it is degraded or masked by other that listening effort can be productively considered sounds. Such challenging situations increase proces- as the interaction of two factors—the processing sing demand (which we also call processing load) in demands, or challenges, imposed by the utterance and three different ways. First, an increase in processing the listening situation and cognitive resources that an load can be driven by perceptual challenges. individual brings to the situation, that they can deploy Processing load can increase, for example, when the to compensate for the demand. Different types of pro- stimulus is masked by interfering background noise or cessing load draw on different resources in different by speech from other talkers, or as a result of degrada- cognitive domains. Listening effort is only manifest tion of the stimulus caused by hearing loss. when the resources available to an individual are only Second, processing load depends strongly on the lin- barely adequate, or are inadequate, given the demand. guistic properties of the stimulus. For example, the pro- For example, when utterance and listening conditions cessing load required to understand a sentence are straightforward and demands are low, effort is increases with the syntactic (i.e., grammatical) complex- necessarily low for everyone. As processing load ity (Gibson & Pearlmutter, 1998; Stewart & Wingfield, increases, effort will begin to increase sooner for peo- 2009), or when its meaning is difficult to compute due ple with more limited resources in the cognitive to the presence of homophones (Rodd, Davis, & domains relevant for that type of load compared with Johnsrude, 2005; Rodd, Johnsrude, & Davis, 2012), or

Neurobiology of Language. DOI: http://dx.doi.org/10.1016/B978-0-12-407794-2.00040-7 491 © 2016 Elsevier Inc. All rights reserved. 492 40. FACTORS THAT INCREASE PROCESSING DEMANDS WHEN LISTENING TO SPEECH

quality and of the ability of the individual listener to

More limited overcome degradations in signal quality. cognitive resources A particular problem in using behavioral measures Greater cognitive of speech perception (like word-report scores) to mea- resources sure processing load is that these often suffer from ceiling effects. Perfect or near-perfect performance (i.e., near 100% intelligibility) may result either from a Behavioral performance listening situation that involves a low processing load (comprehension of clearly audible speech in quiet), Effort and brain activity Effort and brain with which cognitive resources can cope easily, or from a higher-load situation with greater recruitment of cognitive resources (see Figure 40.1). Methods used Processing demand to measure the processing load associated with differ- ent listening situations therefore not only must be FIGURE 40.1 Schematic depiction of the interaction between pro- sensitive to listeners’ overall performance on some cessing demand and cognitive resources and how they manifest in measure of speech perception/comprehension, but behavioral performance, effort, and brain activity (as measured by, e.g., fMRI). The three different kinds of processing demand reviewed also must be sensitive to the degree to which cogni- in this chapter will recruit somewhat different cognitive resources, tive resources were taxed to achieve that level of per- but the general principle is that, for those with more limited cogni- formance. The degree to which cognitive resources tive resources, effort begins to be required at a lower level of proces- are taxed has been, in the literature, operationally sing demand and may grow more quickly as demand grows defined as listening effort. Effort is high if processing compared with those with greater cognitive resources. Performance is high when processing demand is low and declines as cognitive load is nonzero, and if the available cognitive resources become insufficient to cope with the demand (i.e., as effort resources are not easily or amply sufficient to cope reaches an asymptotic maximum). Brain activity in areas sensitive to with the processing load (Figure 40.1). Performance a particular type of processing demand (e.g., perceptual demand) can be measured behaviorally; effort can be measured will be low when that demand is low (e.g., speech is produced using either behavioral dual-task methods or physio- clearly, against a quiet background) because cognitive resources are not recruited to cope with the demand. As demand increases (as logical (including neuroimaging) methods. The con- stimulus quality degrades), effort and brain activity increase. As structs that interact to give rise to such measures (i.e., demand exceeds available cognitive resources, effort and brain activ- processing load, which is the topic of this chapter, ity reach asymptotic levels. and cognitive resources, which are not) cannot be measured directly. Broadly speaking, there have been two approaches when the structure of the utterance places demands on to the measurement of listening effort: (i) dual-task memory (Lewis, Vasishth, & Van Dyke, 2006). methods and (ii) methods that measure changes in Finally, the overall processing load placed on the physiological responses. Dual-task methods typically listener can increase because of the concurrent measure performance on a second (usually unrelated) demands of another extrinsic task performed at the task that is being performed at the same time as the same time as speech is being processed. These dual- speech is being presented. Reduced performance on task situations are relatively common in everyday life: this second task is assumed to reflect increased proces- we often listen to speech while doing something else sing load of the first task (one involving speech per- (e.g., driving; Wild, Yusuf, et al., 2012). We first review ception/comprehension), resulting in a commensurate the ways in which processing load is measured, and increase in listening effort. For example, researchers then we consider each of these different types of pro- have shown that performance on a secondary task can cessing loads in turn. change as a function either of the perceptual properties The processing load associated with any given lis- of the speech (e.g., the sensory quality; Gosselin & tening situation is difficult to measure directly: what Gagne´, 2011; Pals, Sarampalis, & Baskent, 2012)orof can be measured is behavior or performance. Speech the linguistic properties of the speech (e.g., the pres- perception (and, by inference, comprehension) is often ence of a semantic ambiguity; Rodd, Johnsrude, & measured using word-report tasks, in which listeners Davis, 2010). Although dual-task approaches have report all the words they can understand from an been extensively used, these behavioral measures are, utterance. Performance on such tasks not only depends at best, an indirect test of effort required to process a on signal quality but also depends on the cognitive speech stimulus. Furthermore, claims that dual-task resources the individual can use to compensate for effects are only observed when the two concurrent declines in signal quality. In other words, performance tasks overlap in the set of cognitive processes they on a word report task is a function both of the signal recruit raise the possibility that such methods may be

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 40.1 TYPES OF PROCESSING DEMAND 493 differentially sensitive to different forms of listening 40.1 TYPES OF PROCESSING DEMAND effort (Rodd, Johnsrude, et al., 2010). Physiological responses have also been used to pro- 40.1.1 Perceptual Demands vide a more direct index of effort. For example, some researchers have suggested that pupillometry (i.e., Perceptual demands are imposed by a target signal changes in pupil diameter) can provide an index of that is not perfectly audible, such as hearing an cognitive and listening effort (Kahneman & Beatty, announcement over loudspeakers in a busy train sta- 1966; Piquado, Isaacowitz, & Wingfield, 2010; Sevilla, tion. The audibility of speech can be compromised in Maldonado, & Shalo´m, 2014; Zekveld & Kramer, 2014; many different ways, through degradation or distor- Zekveld, Kramer, & Festen, 2011). This approach is tion of the signal, and/or through the presence of con- limited by difficulties in understanding exactly how current interfering sounds. These different demands and why pupil diameter changes in response to are probably met through different cognitive mechan- changes in the task being performed. It is still not clear isms. For example, a degraded signal requires that whether this measure indexes effort per se or instead missing bits be “filled in,” whereas following a target reflects correlated factors such as task engagement in the presence of multiple competing sounds requires (Franklin, Broadway, Mrazek, Smallwood, & Schooler, sound segregation processes and selective attention. In 2013), aspects of perception/comprehension (Naber & this section, we review the different types of percep- Nakayama, 2013), or arousal (Bradshaw, 1967; Brown, tual demand and the cognitive mechanisms that may van Steenbergen, Kedar, & Nieuwenhuis, 2014). be required to compensate, as summarized in Another physiological measure of effort is brain activ- Table 40.1. ity. Researchers typically assume that an increase in Taking the public announcement example, the brain activity in a region (measured, e.g., using electroen- speech signal itself is likely to be both degraded and cephalography [EEG], magnetoencephalography [MEG], distorted because of the limitations of most public- or functional magnetic resonance imaging [fMRI]) transport loudspeaker systems. (An audio signal is indexes the extent to which a particular stimulus has degraded if frequency components have been reduced recruited that region (i.e., the “effort” being expended or removed; it is distorted if components, not originally within that region) (Figure 40.1). As with all measures of present, have been somehow introduced). Degradation effort, processing load and cognitive resources interact to places demands on processes that help to recover produce brain activity. Imaging measures have two key obliterated signal, whereas distortion introduces segre- advantages over other approaches to measuring effort. gation demands—the listener must be able to percep- First, they can be used for a wide range of listening situa- tually distinguish the components of the original tions and are potentially sensitive to changes in listening signal from the artefactual components of the distor- effort across all stages of speech processing, from early tion to recognize the signal. A common form of perceptual processes (i.e., analysis of sound in primary distortion is reverberation: the surfaces of indoor auditory cortex [PAC]) to higher-level aspects of seman- spaces, in which so much communication happens, tic and syntactic processing involving anterior and poste- reflect sound and the reflected sounds must be distin- rior temporal, frontal, and parietal regions. Second, guished from the direct sound source for the sound to imaging measures can be used in conditions in which a participant is not exclusively attending to the critical speech stimulus and therefore can be used to measure TABLE 40.1 Different Types of Perceptual Demand and the Nature of the Cognitive Processes Required to Cope with Such changes in processing that arise as a function of atten- Demands tional modulation. Behavioral reports for unattended speech may be unreliable, and being asked to report Imposes a processing load on something out of the focus of attention may inadver- Perceptual demand cognitive mechanisms required for tently reorient attention back to that stimulus. Imaging Target signal degradation, Perceptual closure, including provides an excellent way to measure processing and energetic masking of target linguistically guided perceptual how it varies with attention because imaging data can be closure acquired both when a participant is attending to a stimu- Linguistically guided word- segmentation processes lus and when he or she is distracted from it (Heinrich, Carlyon, Davis, & Johnsrude, 2011; Wild, Yusuf, et al., Target signal distortion; and Sound source segregation 2012). If one assumes that effort requires attention, then informational masking of Selective attention target due to reverberation or Voice identity processing imaging can be used to compare relatively load-free competing, acoustically similar, stimulus-driven processing (Wild, Yusuf, et al., 2012), sounds with the more effortful processing observed under full Hearing impairment All of the above attention, when a processing load is present.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 494 40. FACTORS THAT INCREASE PROCESSING DEMANDS WHEN LISTENING TO SPEECH be understood. The presence of concurrent sounds words: it will be heard as /k/ in the context iss, and (also introduced by the competing sounds, including /g/ in the context ift, because “kiss” and “gift” are concurrent speech, in that busy train station) requires words and “giss” and “kift” are not (Ganong, 1980). listeners to recruit processes that serve to segregate the Such perceptual ambiguity can also increase proces- target signal from the background, to selectively attend sing load on higher-level linguistic processes because to the target, and to recover obliterated signal (percep- there are more candidate linguistic representations tual closure). Masking that physically interrupts or (“bark” versus “park” for example). This load can be occludes a target signal so that it is effectively not met through effective use of syntactic and semantic transduced at the periphery is called energetic masking. context; many studies demonstrate that speech materi- When both target and masker are audible but are diffi- als embedded in a meaningful context are easier to cult to separate perceptually, informational masking report than those without a constraining context occurs (Kidd, Mason, Richards, Gallun, & Durlach, (Davis, Ford, Kherif, & Johnsrude, 2011a; Dubno, 2007; Leek, Brown, & Dorman, 1991; Pollack, 1975). Ahlstrom, & Horwitz, 2000; Kalikow, Stevens, & Compensation for energetic masking (which is typi- Elliott, 1977; Miller, Heise, & Lichten, 1951; Nittrouer cally transient in the dynamic conditions of real-world & Boothroyd, 1990). Degradation would also stress communicative environments) requires that the processes involved in segmenting adjacent words occluded signal be “filled in” in some way. because degradation means that word boundaries are Accumulating evidence suggests that such perceptual less clearly demarcated. Thus, linguistically guided closure comprises at least two, if not more, physiologi- perceptual closure phenomena draw on prior knowl- cally separable phenomena. First, if discrete frequen- edge at multiple levels: such linguistic priors would cies are occluded for a brief period, then the brain can include knowledge of the words and morphology of a complete the partially masked sound so that it is heard language and how words can be combined. These phe- as a complete whole (Carlyon, Deeks, Norris, & nomena would also draw on word frequency effects, Butterfield, 2002; Ciocca & Bregman, 1987; Heinrich, contextually contingent prior probabilities, and knowl- Carlyon, Davis, & Johnsrude, 2008). For example, if a edge of the world. steady-state tone is interrupted by a gap filled with Evidence from studies in which speech has been noise, the tone is “heard” to continue through the gap. perceptually degraded (e.g., using noise vocoding; Electrophysiological and MRI studies suggest that, in Shannon, Zeng, Kamath, Wygonski, & Ekelid, 1995) such cases, the illusion is complete at a relatively early suggests that processes involving the use of prior stage of processing. Petkov and colleagues (Petkov, knowledge to enhance speech comprehension (i.e., O’Connor, & Sutter, 2007) measured the responses of linguistically guided perceptual closure phenomena) macaque A1 neurons to steady tones, to noise bursts, rely on regions of the left inferior frontal gyrus and to interrupted tones where the silent gap could (LIFG) and/or regions along the superior and middle optionally be filled by a noise. In this latter case, they temporal gyri (Davis, Ford, Kherif, & Johnsrude, found that the response of approximately half the neu- 2011b; Davis & Johnsrude, 2003; Hervais-Adelman, rons to the noise was more similar to that elicited by Carlyon, Johnsrude, & Davis, 2012; Obleser & an uninterrupted tone than it was to that produced by Kotz, 2010; Scott, Blank, Rosen, & Wise, 2000; Wild, an isolated noise burst. This suggests that the continu- Yusuf, et al., 2012), and possibly also on left angular ity illusion, at least for tones, is complete by the level gyrus (Golestani, Hervais-Adelman, Obleser, & Scott, of auditory cortex. Physiological investigations involv- 2013; Wild, Davis, & Johnsrude, 2012). Even though ing humans also suggest that continuity for tones is stimuli in these studies are carefully controlled and complete at relatively early stages of auditory proces- energetic masking introduces uncertainty at the sing and happens obligatorily in the absence of full acoustic level, the widespread patterns of activity in attention (Heinrich et al., 2011; Micheyl et al., 2003). response are probably due to the fact that uncertainty The second type of perceptual closure also compen- propagates across linguistic levels—lexical, morpho- sates for transient energetic masking but works to fill logical, syntactic and semantic. in missing speech sounds, or even missing words, on Informational masking is defined as any decline in the basis of linguistic knowledge (Bashford, Riener, & perception or comprehension of a target stimulus Warren, 1992; Saija, Akyu¨ rek, Andringa, & Ba¸skent, (therefore “masking”) that cannot be attributed to 2014; Shahin, Bishop, & Miller, 2009; Warren, 1970). energetic masking (Cooke, Garcia Lecumberri, & For example, sounds that are distinguished by one Barker, 2008; Kidd et al., 2007; Leek et al., 1991; phonetic feature such as /k/ and /g/ are acoustically Pollack, 1975). This includes everything from difficulty similar and, if degraded, perhaps acoustically identi- with segregation (i.e., due to perceptual similarity cal. Uncertainty about the identity of a phoneme (/k/ between target and masker) to attentional bias effects versus /g/) can be resolved through knowledge of (due to the salience or “ignorability” of a target, or of a

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 40.1 TYPES OF PROCESSING DEMAND 495 masker due to familiarity or meaningfulness). If processes that are required to comprehend a sentence knowledge-based factors can produce a “release from that has a complex grammatical structure or that con- masking” (i.e., better performance on a listening task), tains words with more than one possible meaning then the masking was, by definition, informational. (Duffy, Kambe, & Rayner, 2001; Duffy, Morris, & However, informational masking should be distin- Rayner, 1988; Frazier, 1987; MacDonald, Pearlmutter, guished from general inattentiveness (Durlach et al., & Seidenberg, 1994). These cognitive models have 2003). There may be good reason to consider proces- been developed largely on the basis of reading experi- sing demands resulting from at least some types of ments, in which a relatively direct measure of the informational masking as very different from those amount of processing that is required by different resulting from the perceptual demands of energetic types of sentences can be obtained by measuring the masking (Cooke et al., 2008; Mattys, Barden, & time taken to read them. In contrast, it is more difficult Samuel, 2014; Mattys, Brooks, & Cooke, 2009; Mattys to measure the processing demand imposed by partic- & Wiget, 2011). We discuss this idea in the section on ular types of spoken sentences, because the rate at Concurrent task demands. which words arrive is typically controlled by the Importantly, familiarity and expertise can reduce speaker (and not the listener). informational masking. In a recent study, Johnsrude One form of linguistic challenge that has been rela- et al. (2013) observed that listeners obtained a substan- tively well-studied using spoken as well as written lan- tial release from (informational) masking when they guage is lexical ambiguity, also known as semantic heard their spouse’s voice in a two-voice mixture ambiguity, which arises when words with multiple either when it was the target or when it was the meanings are included within a sentence. For example, masker, suggesting that familiar-voice information can to understand the phrase “the bark of the dog,” a lis- facilitate segregation. Furthermore, although the ability tener must use the meaning of the word “dog” to to report novel-voice targets declined with age, there work out that “bark” is probably referring to the noise was no such decline for spouse-voice targets even made by that animal and not the outer covering of a though performance on spouse-voice targets was not tree. These forms of ambiguity are ubiquitous in lan- at ceiling. The decline with age for novel targets could guage; at least 80% of the common words in a typical not be due to energetic masking (because this would English dictionary have more than one definition apply equally to spouse-voice targets, which were (Rodd, Gaskell, & Marslen-Wilson, 2002), and most acoustically perfectly matched, across participants, to dictionaries list more than 40 different definitions for the novel-voice targets) and thus must result from the word “run” (e.g., “an athlete runs a race,” “a river increased informational masking with age, which is runs to the sea,” “a politician runs for office”). Each reduced (or possibly eliminated) by familiar-voice time one of these ambiguous words is encountered, information. the listener must select the appropriate meaning on the Perceptual demands imposed by degradation, dis- basis of its sentential context. tortion, and concurrent sounds are exacerbated by A dual-task experiment (Figure 40.2A) has revealed hearing impairment, including age-related hearing loss longer reaction times on an unrelated visual case- (presbycusis). Hearing impairment commonly involves judgement task when participants are listening to both diminished sensitivity to sound (degrading the high-ambiguity, compared with low-ambiguity, sen- input signal) and diminished frequency acuity. Poorer tences (Rodd, Johnsrude, et al., 2010), indicating that frequency acuity means that acoustically similar high-ambiguity sentences are more effortful to under- sounds are less discriminable: the signal is degraded stand. High-ambiguity sentences contained at least and harder to segregate from other sounds. Thus, hear- two ambiguous words (e.g., “There were dates and ing loss increases perceptual load and, like the other pears [homophonous with pairs] in the fruit bowl”) forms of perceptual load, hearing loss places greater and were compared with well-matched low-ambiguity demands on processes involved in perceptual closure, control sentences (e.g., “There was beer and cider on in selective attention, and in segregation (Gatehouse & the kitchen shelf”; Rodd, Johnsrude, et al., 2010). Noble, 2004). A concurrent case-judgement task required partici- pants to indicate, as quickly as possible with a key- press, whether a cue letter on the screen was presented 40.1.2 Linguistic Demands in UPPER case or lower case. Interestingly, this ambi- guity effect varies as a function of the sentence struc- Even in the case where all words are fully intelligi- ture. In cases when the ambiguous word is preceded ble, the effort required to understand language can by disambiguating sentence context (e.g., “The hunter vary greatly. For example, cognitive models of sen- thought that the HARE/HAIR ...”), there is no tence comprehension describe in detail the additional observed cost of the case-judgement task, suggesting

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 496 40. FACTORS THAT INCREASE PROCESSING DEMANDS WHEN LISTENING TO SPEECH

(A) 580 there is a considerable increase in the time taken to High-ambiguity make the case judgements (Rodd, Johnsrude, et al., 570 Low-ambiguity 2010). The most likely explanation is that, for these 560 sentences, listeners are frequently misinterpreting the 550 ambiguous word by selecting the more frequent, but contextually inappropriate, meaning, and that the 540 dual-task effect arises because of the increased proces- 530 sing load associated with the reinterpretation process that is triggered by the final word in the sentence. 520

Reaction time (ms) Several neuroimaging studies using a range of dif- 510 ferent spoken-sentence materials confirm that compre- 500 hension of sentences that contain ambiguous words is more demanding than comprehension of sentences 0 that do not. Rodd et al. (2005) compared high- Upper case Lower case (B) ambiguity sentences with well-matched low-ambiguity sentences and reported a large cluster of LIFG activa- Speech > High ambiguity > nonspeech low ambiguity tion with its peak within the pars opercularis, as well (noise) sentences as a cluster in the posterior portion of the left inferior temporal gyrus (LITG) (Figure 40.2B). This is broadly Left Right consistent with the results of subsequent studies, which consistently find ambiguity-related activation in both posterior and anterior subdivisions of the LIFG (Bekinschtein, Davis, Rodd, & Owen, 2011; Davis et al., 2007; Rodd et al., 2012; Rodd, Longe, Randall, & Tyler, 2010). The precise location of the LITG activation has been more variable, with some studies finding no sig- nificant effect (Rodd, Longe, et al., 2010). Again, as with the dual-task studies, the ambiguity-related increases in BOLD signal have been shown to be larg- FIGURE 40.2 (A) Reaction times (in ms) for a case-judgment task est when listeners must reinterpret a sentence that was presented while participants listened to high- and low-ambiguity initially misunderstood (Rodd et al., 2012), suggesting sentences as a function of the case of the presented letter (UPPER that this aspect of semantic processing is particularly case or lower case). Participants were asked to respond as quickly as possible when the letter cue was presented on the screen at the offset demanding. of the second homophone (or homologous word in the matched low- These two key brain regions (LIFG and posterior ambiguity sentence). At the end of each sentence, participants were temporal lobe) have also been activated for a number also required to make a semantic relatedness judgment to a word of other linguistic manipulations. Rodd, Longe, et al. presented on the screen. Case-judgment reaction times were signifi- (2010) have shown that syntactic ambiguities (e.g., cantly longer for high- compared with low-ambiguity sentences, sug- gesting that some cognitive process involved in resolving the “VISITING RELATIVES is/are boring”) produce LIFG meaning of the homophones in the high-ambiguity sentences is also activation (compared with unambiguous sentences) in involved in case-judgment. (B) Brain regions that are significantly the same region as semantic ambiguities (within the more active for high-ambiguity sentences compared with low- same participants), as well as in posterior left middle ambiguity sentences in a group of normal young-adult right-handed temporal gyrus (MTG). Tyler et al. (2011) found similar participants are shown in red/yellow. These regions include the infe- rior frontal gyri bilaterally and the left inferior temporal gyrus. Brain effects of syntactic ambiguity that extended across regions that are significantly more active for all speech compared both anterior and posterior LIFG, as well as smaller with unintelligible signal-correlated noise are shown in blue/yellow. activations in homologous right frontal regions, poste- These regions include the superior and middle temporal gyri bilater- rior MTG, and parietal regions. Finally, similar frontal ally. Data from Rodd et al. (2005) with permission from the publisher. and posterior temporal regions have been shown to produce elevated activation for sentences that have a grammatical structure that is more difficult to process that this process of selecting a contextually relevant (Mack, Meltzer-Asscher, Barbieri, & Thompson, 2013; word meaning on the basis of the information that is Meltzer, McArdle, Schafer, & Braun, 2010; Peelle, currently available is relatively undemanding. In con- McMillan, Moore, Grossman, & Wingfield, 2004; trast, when the disambiguating context is positioned Shetreet & Friedmann, 2014), although there is vari- after the ambiguous word (e.g., “The ecologist thought ability in the precise location of the posterior temporal that the PLANT by the river should be closed down”), lobe activations. In addition, studies have also found

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 40.1 TYPES OF PROCESSING DEMAND 497 that increased demands on syntactic processes are it is possible that when a listener needs to deploy addi- associated with activation in the anterior temporal lobe tional resources to identify the sounds that are present in (Brennan et al., 2012; Obleser et al., 2011) and temporo- the speech stream, the resources that they have available parietal regions (Meyer et al., 2012). to them for dealing with any semantic/syntactic demands Another linguistic challenge to comprehension is a may be reduced (and vice versa). relative lack of contextual constraint. For sentences Because high-level cognitive processes operate on that lack a strong determinative meaningful context the output from lower-level perceptual processes, (such as “Her good slope was done in carrot”), a conditions in which the perceptual input is degraded region of LIFG appears more active when compared may produce perceptual uncertainty that cascades to with sentences that have context (such as “Her new higher-level cognitive processes. For example, an skirt was made of denim”) (Davis et al., 2011b). This “easy” sentence that, when presented with high per- increased activity is consistent with the idea that a lack ceptual clarity, would produce relatively low demand of supportive context increases processing demands, on semantic/syntactic processes might become more possibly because weaker semantic constraint renders demanding at these higher cognitive levels when the speech less predictable. input is degraded such that the listener is unsure Although the precise cognitive roles of these differ- (based on perceptual input alone) of the words that ent brain regions associated with higher-level aspects were actually present. As reviewed previously, of speech comprehension remains unclear (Rodd et al., demands on semantic/syntactic processes almost cer- 2012), results are consistent with the view that the tainly increase as a function of the perceptual clarity of LIFG contributes to the integrative processes that are the input, even when semantic/syntactic properties required to combine the meanings and syntactic prop- are held constant. erties of individual words into higher-level linguistic In the face of perceptual degradation, higher-level structures (Hagoort, 2005; Rodd et al., 2005; Willems & cognitive resources make a key contribution to resolving Hagoort, 2009). The structures of the LIFG may be perceptual ambiguity, such that information about the required for cognitive control processes required to meaning, syntactic structure, or identity of a sentence (or select and maintain task-relevant information while a word) plays a key role in providing contextual cues inhibiting information that is not currently required that can be used to resolve perceptual ambiguities (Novick, Kan, Trueswell, & Thompson-Schill, 2009; (Davis et al., 2011b; Sohoglu, Peelle, Carlyon, & Davis, Thompson-Schill, D’Esposito, Aguirre, & Farah, 1997). 2012; Wild, Davis, et al., 2012). For example, Wild, Davis, Activation of posterior (and not anterior) aspects of et al. (2012) presented acoustically degraded speech that the temporal lobe provide clear support for the view was preceded (200 ms) earlier by text primes that either that these regions form part of a processing stream matched each upcoming word or was incongruent. that is critical for accessing the meaning of spoken Preceding text primes enhanced the rated subjective clar- words (Hickok & Poeppel, 2007), although other ity of heard speech, and activity within PAC was sensi- authors have suggested that this region, like the LIFG, tive to this manipulation. A region-of-interest defined is associated with control processes that are necessary using cytoarchitectonic maps of PAC (Tahmasebi et al., to select and maintain aspects of meaning that are cur- 2009) revealed that activity was significantly higher for rently relevant, rather than the stored representation of degraded speech preceded by matching compared with semantic/syntactic information (Noonan, Jefferies, mismatching primes, and that this difference was signifi- Visser, & Lambon Ralph, 2013). cantly greater than the difference observed for matching In summary, higher-level aspects of speech compre- and mismatching primes preceding either clear speech hension, such as selecting contextually appropriate or unintelligible control sounds. Connectivity analyses word meanings and assigning appropriate syntactic suggested that sources of input to PAC are higher-order roles, are largely associated with frontal and posterior temporal, frontal, and motor regions, consistent with the temporal brain regions. Interestingly, these overlap extensive network of feedback connections in the audi- with areas that are recruited to cope with the “lower tory system (De la Mothe, Blumell, Kajikawa, & Hackett, level” perceptual demands reviewed earlier, suggest- 2006; Winer, 2006) that, in humans, may permit higher- ing that similar cognitive mechanisms are required to level interpretations of speech to be tested against incom- cope with different kinds of processing demand. ing auditory representations. Overlapping domain-general processing resources are recruited in the face of both perceptual and cognitive 40.1.3 Concurrent Task Demands demands, raising the possibility that competition for these resources may result in a “double whammy” effect: when A final type of processing load is that imposed both forms of processing resources are required, the lis- by extrinsic tasks that are being performed while tener is disproportionately disadvantaged. For example, speech is heard. Although much research on speech

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 498 40. FACTORS THAT INCREASE PROCESSING DEMANDS WHEN LISTENING TO SPEECH perception and comprehension is conducted under other tasks. In every trial, young listeners with normal ideal conditions (undivided attention, in quiet environ- hearing were directed to attend to one of three simul- ments, and with carefully recorded material), real- taneously presented stimuli: a sentence (at one of four world speech perception often involves hearing speech acoustic clarity levels), an auditory distracter, or a while in the middle of other tasks that divert attention visual distracter. When attending to the distracters, or impose perceptual or cognitive loads (Mattys et al., participants performed a target-monitoring task. 2009). Such concurrent processing loads, if they render A postscan recognition test showed that clear speech target speech less intelligible, qualify as “informational was processed even when not attended, but that atten- masking” according to the definition offered in the tion greatly enhanced the processing of degraded previous section on Perceptual demands. To what extent speech. Speech-sensitive cortex could be fractionated is speech that is heard under such conditions pro- according to how speech-evoked responses were mod- cessed for information, and how is processing affected ulated by attention, and these divisions appeared to by the nature of the secondary task and by the acoustic map onto the hierarchical organization of the auditory quality of the speech signal? system (Davis & Johnsrude, 2003; Hickok & Poeppel, The degree to which speech is processed when 2007; Okada et al., 2010; Peelle, Johnsrude, & Davis, attention is elsewhere is a difficult question to address 2010). Critically, only in the region of the superior behaviorally because it is hard to measure the percep- temporal sulcus (STS) and in frontal regions—regions tion of a stimulus to which a participant is not attend- corresponding to the highest stages of auditory ing. Wild, Yusuf, et al. (2012) used fMRI to compare processing—was the effect of speech clarity on BOLD processing of degraded (noise-vocoded) and clear activity influenced by attentional state (Figure 40.3). speech under full attention and when distracted by Frontal and temporal regions manifested this interaction

FIGURE 40.3 The interaction between Attention (three levels: attention to speech; attention to an auditory distracter, or attention to a visual distracter) and Speech Clarity (four levels: Clear Speech; 6-band noise vocoded speech [NVhi]; compressed 6-band noise vocoded speech [NVlo]; and spectrally rotated noise vocoded speech [rNV]). Materials were meaningful English sentences such as “His handwriting was very difficult to read.” Whereas both NVhi and NVlow are potentially intelligible (word-report scores for the materials in a pilot group of participants were 88% and 68% respectively), rNV is completely unintelligible. (A) The Speech Clarity by Attentional State interaction F- contrast is shown in blue. Indicates the threshold at p , 0.05 uncorrected for multiple comparisons; indicates family-wise correction for mul- tiple comparisons. Voxels in red are sensitive to the different levels of Speech Clarity (i.e., demonstrate a significant main effect of speech type) but show no evidence for an interaction with Attention (p , 0.05, uncorrected). (B) Contrast values (i.e., estimated signal relative to base- line; arbitrary units) are plotted for LIFG and (C) an average of the STS peaks. Error bars represent the SEM suitable for repeated-measures data (Loftus & Masson, 1994). Vertical lines with asterisks indicate significant pairwise comparisons (p , 0.05, Bonferroni corrected for 12 com- parisons). Figure taken from Wild, Yusuf, et al. (2012) with permission from the publisher.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL REFERENCES 499 in different ways. In the LIFG, when speech was that concurrent tasks do not simply impair perception attended, activity was greater for degraded than for clear but rather alter it qualitatively by affecting perceptual speech. However, when the listeners attended else- decision criteria, highlight the importance of consider- where, LIFG activity did not depend on speech type ing how different real-world situations might affect the (Figure 40.3B). Increased activity for attended degraded effort required to understand speech. speech may reflect effortful processing that is required to enhance intelligibility or additional cognitive processes (such as those required for perceptual learning) that are 40.2 SUMMARY engaged uniquely when a listener is attending to speech (Davis, Johnsrude, Hervais-Adelman, Taylor, & “Listening effort” is a unitary term that probably McGettigan, 2005; Eisner, McGettigan, Faulkner, Rosen, reflects several different challenges to speech compre- & Scott, 2010; Huyck & Johnsrude, 2012). hension met through a variety of cognitive mechan- The pattern of activity around the STS was some- isms. Our position is that effort is due to an interaction what different; when speech was attended, degraded between the perceptual, linguistic, or task challenges and clear speech produced similar amounts of activa- imposed by the situation and stimulus and the cogni- tion, consistent with attention enhancing the intelligi- tive resources that the listener brings to bear. This bility of degraded speech. When distracted, activity framework allows us to separate different cognitive in the STS appeared to be driven by the speech qual- mechanisms that contribute to listening effort. ity of the stimulus (Figure 40.3C). Together, these Behavioral measures such as effort ratings, dual-task results suggest that the LIFG only responds to reaction time, and pupil dilation may not be able to degraded speech when listeners are attending to it, differentiate among different kinds of processing whereas the STS responds to speech intelligibility, demand because the measure is one-dimensional. In regardless of attention or how that intelligibility is contrast, neuroimaging methods such as fMRI permit achieved. exploration of different kinds of processing demand A series of studies by Mattys and colleagues because they may recruit different regions/networks. (Mattys, Barden, & Samuel, 2014; Mattys et al., 2009; Although this chapter has considered these Mattys & Wiget, 2011) explored how additional extrin- demands separately, they are not independent. An sic cognitive load alters the processing of simulta- increase in processing demand at a lower level of neously presented spoken words. They observed that representation (e.g., perceptual) can affect processing the mechanisms involved in processing speech under at a higher level of representation (e.g., semantic/syn- conditions of divided attention are not the same as tactic) in at least three ways: (i) by reducing the those involved in processing speech in conditions of domain-general processing resources that are available; undivided attention. When listeners were required to (ii) by increasing the ambiguity at later levels of repre- perform a concurrent task that involved divided atten- sentation; and (iii) by increasing the listeners’ reliance tion (Mattys et al., 2009), a memory load (Mattys et al., on these representations to resolve lower-level ambigu- 2009), or a visual search task (Mattys & Wiget, 2011), ities. Increases in processing demands at a higher level they relied more on lexical semantic structure to aid of representation may also have consequences for segmentation of words and on lexical knowledge to lower-level processing through the extensive network identify phonemes, rather than on acoustic cues con- of feedback connections documented in the auditory veyed in fine phonetic detail. In fact, the “lexical system. These complex interactions mean that proces- drift”—this increased reliance on lexical knowledge sing demands cannot be fully understood by studying when distracted—appears to be driven by the reduced them in isolation. acuity for acoustic cues to phoneme identity and word boundaries (Mattys et al., 2014). This may at first seem surprising. As load on central cognitive resources References increases, listeners rely more, not less, on knowledge- guided factors (that presumably rely on the same cen- Bashford, J. A., Riener, K. R., & Warren, R. M. (1992). Increasing the intelligibility of speech through multiple phonemic restorations. tral cognitive resources) for speech perception. Mattys Perception and Psychophysics, 51(3), 211217. and colleagues (2014) attribute this to a re-weighting of Bekinschtein, T. A., Davis, M. H., Rodd, J. M., & Owen, A. M. (2011). perceptual cues, such that acoustic cues to the identity Why clowns funny: The relationship between humor and of words and phonemes are weighted less than lin- semantic ambiguity. The Journal of Neuroscience: The Official Journal guistic cues. Alternatively, a concurrent task may alter of the Society for Neuroscience, 31(26), 9665 9671. Bradshaw, J. (1967). Pupil size as a measure of arousal during infor- the rate at which acoustic information is sampled, mation processing. Nature, 216(5114), 515516. causing some samples to be missed while attention is Brennan, J., Nir, Y., Hasson, U., Malach, R., Heeger, D. J., & elsewhere. The results of these experiments, indicating Pylkka¨nen, L. (2012). Syntactic structure building in the anterior

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 500 40. FACTORS THAT INCREASE PROCESSING DEMANDS WHEN LISTENING TO SPEECH

temporal lobe during natural story listening. Brain and Language, Frazier, L. (1987). Sentence processing: A tutorial review. In M. 120(2), 163173. Coltheart (Ed.), Attention and performance XII: The psychology of Brown, S. B., van Steenbergen, H., Kedar, T., & Nieuwenhuis, S. reading. Lawrence Erlbaum Associates. (2014). Effects of arousal on cognitive control: Empirical tests of Ganong, W. F. (1980). Phonetic categorization in auditory word per- the conflict-modulated Hebbian-learning hypothesis. Frontiers in ception. Journal of Experimental Psychology. Human Perception and Human Neuroscience, 8, 23. Performance, 6(1), 110125. Carlyon, R. P., Deeks, J., Norris, D., & Butterfield, S. (2002). The con- Gatehouse, S., & Noble, W. (2004). The Speech, Spatial and Qualities tinuity illusion and vowel identification. Acta Acustica United with of hearing scale (SSQ). International Journal of Audiology, 43(2), Acustica, 88, 408415. 8599. Ciocca, V., & Bregman, A. S. (1987). Perceived continuity of gliding Gibson, E., & Pearlmutter, N. J. (1998). Constraints on sentence com- and steady-state tones through interrupting noise. Perception and prehension. Trends in Cognitive Sciences, 2(7), 262268. Psychophysics, 42(5), 476484. Golestani, N., Hervais-Adelman, A., Obleser, J., & Scott, S. K. (2013). Cooke, M., Garcia Lecumberri, M. L., & Barker, J. (2008). The foreign Semantic versus perceptual interactions in neural processing of language cocktail party problem: Energetic and informational speech-in-noise. NeuroImage, 79,5261. masking effects in non-native speech perception. The Journal of the Gosselin, P. A., & Gagne´, J.-P. (2011). Older adults expend more lis- Acoustical Society of America, 123(1), 414427. tening effort than young adults recognizing audiovisual speech in Davis, M. H., Coleman, M. R., Absalom, A. R., Rodd, J. M., noise. International Journal of Audiology, 50(11), 786792. Johnsrude, I. S., Matta, B. F., et al. (2007). Dissociating speech per- Hagoort, P. (2005). On Broca, brain, and binding: A new framework. ception and comprehension at reduced levels of awareness. Trends in Cognitive Sciences, 9(9), 416423. Proceedings of the National Academy of Sciences of the United States of Heinrich, A., Carlyon, R. P., Davis, M. H., & Johnsrude, I. S. (2008). America, 104(41), 1603216037. Illusory vowels resulting from perceptual continuity: A functional Davis, M. H., Ford, M. A., Kherif, F., & Johnsrude, I. S. (2011a). Does magnetic resonance imaging study. Journal of Cognitive semantic context benefit speech understanding through “top- Neuroscience, 20(10), 17371752. down” processes? Evidence from time-resolved sparse fMRI. Heinrich, A., Carlyon, R. P., Davis, M. H., & Johnsrude, I. S. (2011). Journal of Cognitive Neuroscience, 23(12), 39143932. The continuity illusion does not depend on attentional state: Davis, M. H., Ford, M. A., Kherif, F., & Johnsrude, I. S. (2011b). Does FMRI evidence from illusory vowels. Journal of Cognitive semantic context benefit speech understanding through “top- Neuroscience, 23(10), 26752689. down” processes? Evidence from time-resolved sparse fMRI. Hervais-Adelman, A., Carlyon, R., Johnsrude, I., & Davis, M. (2012). Journal of Cognitive Neuroscience, 23(12), 39143932. Brain regions recruited for the effortful comprehension of noise- Davis, M. H., & Johnsrude, I. S. (2003). Hierarchical processing in vocoded words. Language and Cognitive Processes, 28, 11451166. spoken language comprehension. Journal of Neuroscience, 23(8), Hickok, G., & Poeppel, D. (2007). The cortical organization of speech 34233431. processing. Nature Reviews Neuroscience, 8(5), 393402. Davis, M. H., Johnsrude, I. S., Hervais-Adelman, A., Taylor, K., & Huyck, J. J., & Johnsrude, I. S. (2012). Rapid perceptual learning of McGettigan, C. (2005). Lexical information drives perceptual noise-vocoded speech requires attention. Journal of the Acoustical learning of distorted speech: Evidence from the comprehension of Society of America, 131(3), EL23642. noise-vocoded sentences. Journal of Experimental Psychology: Ivarsson, U. S., & Arlinger, S. D. (1994). Speech recognition in noise General, 134(2), 222241. before and after a work-day’s noise exposure. Scandinavian De la Mothe, L. A., Blumell, S., Kajikawa, Y., & Hackett, T. A. (2006). Audiology, 23(3), 159163. Cortical connections of the auditory cortex in marmoset monkeys: Johnsrude, I. S., Mackey, A., Hakyemez, H., Alexander, E., Trang, Core and medial belt regions. The Journal of Comparative H. P., & Carlyon, R. P. (2013). Swinging at a cocktail party: Voice Neurology, 496(1), 2771. familiarity aids speech perception in the presence of a competing Dubno, J. R., Ahlstrom, J. B., & Horwitz, A. R. (2000). Use of context voice. Psychological Science, 24(10), 19952004. by young and aged adults with normal hearing. The Journal of the Kahneman, D., & Beatty, J. (1966). Pupil diameter and load on mem- Acoustical Society of America, 107(1), 538546. ory. Science, 154(3756), 15831585. Duffy, S. A., Kambe, G., & Rayner, K. (2001). The effect of prior Kalikow, D. N., Stevens, K. N., & Elliott, L. L. (1977). Development disambiguating context on the comprehension of ambiguous of a test of speech intelligibility in noise using sentence materials words: Evidence from eye movements. In D. S. Gorfein (Ed.), with controlled word predictability. Journal of the Acoustical On the consequences of meaning selection: Perspectives on resolving Society of America, 61(5), 13371351. lexical ambiguity. Washington DC: American Psychological Kidd, G. J., Mason, C. R., Richards, V. M., Gallun, F. J., & Durlach, Association. N. I. (2007). Informational masking. In W. A. Yost, & R. R. Fay Duffy, S. A., Morris, R. K., & Rayner, K. (1988). Lexical ambiguity (Eds.), Auditory perception of sound sources (pp. 143190). Springer. and fixation times in reading. Journal of Memory and Language, 27 Leek, M. R., Brown, M. E., & Dorman, M. F. (1991). Informational (4), 429446. masking and auditory attention. Perception and Psychophysics, 50 Durlach, N. I., Mason, C. R., Kidd, G., Arbogast, T. L., Colburn, (3), 205214. H. S., & Shinn-Cunningham, B. G. (2003). Note on informational Lewis, R. L., Vasishth, S., & Van Dyke, J. A. (2006). Computational masking. The Journal of the Acoustical Society of America, 113(6), principles of working memory in sentence comprehension. Trends 29842987. in Cognitive Sciences, 10(10), 447454. Eisner, F., McGettigan, C., Faulkner, A., Rosen, S., & Scott, S. K. Loftus,G.R.,&Masson,M.E.(1994).Usingconfidenceintervalsin (2010). Inferior frontal gyrus activation predicts individual differ- within-subject designs. Psychonomic Bulletin and Review, 1(4), 476490. ences in perceptual learning of cochlear-implant simulations. MacDonald, M. C., Pearlmutter, N. J., & Seidenberg, M. S. (1994). Journal of Neuroscience, 30(21), 71797186. The lexical nature of syntactic ambiguity resolution [corrected]. Franklin, M. S., Broadway, J. M., Mrazek, M. D., Smallwood, J., & Psychological Review, 101(4), 676703. Schooler, J. W. (2013). Window to the wandering mind: Mack, J., Meltzer-Asscher, A., Barbieri, E., & Thompson, C. (2013). Pupillometry of spontaneous thought while reading. Quarterly Neural correlates of processing passive sentences. Brain Sciences, 3 Journal of Experimental Psychology (2006), 66(12), 22892294. (3), 11981214.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL REFERENCES 501

Mattys, S. L., Barden, K., & Samuel, A. G. (2014). Extrinsic cognitive Piquado, T., Isaacowitz, D., & Wingfield, A. (2010). Pupillometry as a load impairs low-level speech perception. Psychonomic Bulletin measure of cognitive effort in younger and older adults. and Review, 21(3), 748754. Psychophysiology, 47(3), 560569. Mattys, S. L., Brooks, J., & Cooke, M. (2009). Recognizing speech Pollack, I. (1975). Auditory informational masking. Journal of the under a processing load: Dissociating energetic from informa- Acoustical Society of America, 57, S5. tional factors. Cognitive Psychology, 59(3), 203243. Rodd, J., Gaskell, G., & Marslen-Wilson, W. (2002). Making sense of Mattys, S. L., & Wiget, L. (2011). Effects of cognitive load on semantic ambiguity: Semantic competition in lexical access. speech recognition. Journal of Memory and Language, 65(2), Journal of Memory and Language, 46(2), 245266. 145160. Rodd, J. M., Davis, M. H., & Johnsrude, I. S. (2005). The neural McGarrigle, R., Munro, K. J., Dawes, P., Stewart, A. J., Moore, D. R., mechanisms of speech comprehension: fMRI studies of semantic Barry, J. G., et al. (2014). Listening effort and fatigue: What ambiguity. Cerebral Cortex, 15(8), 12611269. exactly are we measuring? A British society of audiology cogni- Rodd, J. M., Johnsrude, I. S., & Davis, M. H. (2010). The role of tion in hearing special interest group “white paper.”. International domain-general frontal systems in language comprehension: Journal of Audiology. Evidence from dual-task interference and semantic ambiguity. Meltzer, J. A., McArdle, J. J., Schafer, R. J., & Braun, A. R. (2010). Brain and Language, 115(3), 182188. Neural aspects of sentence comprehension: Syntactic complexity, Rodd, J. M., Johnsrude, I. S., & Davis, M. H. (2012). Dissociating fron- reversibility, and reanalysis. Cerebral Cortex, 20(8), 18531864. totemporal contributions to semantic ambiguity resolution in spo- Meyer, L., Obleser, J., Kiebel, S. J., & Friederici, A. D. (2012). ken sentences. Cerebral Cortex, 22(8), 17611773. Spatiotemporal dynamics of argument retrieval and reordering: Rodd, J. M., Longe, O. A., Randall, B., & Tyler, L. K. (2010). The func- An FMRI and EEG study on sentence processing. Frontiers in tional organisation of the fronto-temporal language system: Psychology, 3, 523. Evidence from syntactic and semantic ambiguity. Micheyl, C., Carlyon, R. P., Shtyrov, Y., Hauk, O., Dodson, T., & Neuropsychologia, 48(5), 13241335. Pullvermu¨ ller, F. (2003). The neurophysiological basis of the audi- Saija, J. D., Akyu¨ rek, E. G., Andringa, T. C., & Ba¸skent, D. (2014). tory continuity illusion: A mismatch negativity study. Journal of Perceptual restoration of degraded speech is preserved with Cognitive Neuroscience, 15(5), 747758. advancing age. Journal of the Association for Research in Miller, G. A., Heise, G. A., & Lichten, W. (1951). The intelligibility of Otolaryngology, 15(1), 139148. speech as a function of the context of the test materials. Journal of Scott, S. K., Blank, C. C., Rosen, S., & Wise, R. J. (2000). Identification Experimental Psychology, 41(5), 329335. of a pathway for intelligible speech in the left temporal lobe. Naber, M., & Nakayama, K. (2013). Pupil responses to high-level Brain, 123(Pt 12), 24002406. image content. Journal of Vision, 13(6). Sevilla, Y., Maldonado, M., & Shalo´m, D. E. (2014). Pupillary dynam- Nittrouer, S., & Boothroyd, A. (1990). Context effects in phoneme ics reveal computational cost in sentence planning. Quarterly and word recognition by young children and older adults. The Journal of Experimental Psychology (2006), 67(6), 10411452. Journal of the Acoustical Society of America, 87(6), 27052715. Shahin, A. J., Bishop, C. W., & Miller, L. M. (2009). Neural mechan- Noonan, K. A., Jefferies, E., Visser, M., & Lambon Ralph, M. A. isms for illusory filling-in of degraded speech. NeuroImage, 44(3), (2013). Going beyond inferior prefrontal involvement in semantic 11331143. control: Evidence for the additional contribution of dorsal angular Shannon, R. V., Zeng, F. G., Kamath, V., Wygonski, J., & Ekelid, M. gyrus and posterior middle temporal cortex. Journal of Cognitive (1995). Speech recognition with primarily temporal cues. Science, Neuroscience, 25(11), 18241850. 270(5234), 303304. Novick, J. M., Kan, I. P., Trueswell, J. C., & Thompson-Schill, S. L. Shetreet, E., & Friedmann, N. (2014). The processing of different syn- (2009). A case for conflict across multiple domains: Memory and tactic structures: fMRI investigation of the linguistic distinction language impairments following damage to ventrolateral prefron- between wh-movement and verb movement. Journal of tal cortex. Cognitive Neuropsychology, 26(6), 527567. Neurolinguistics, 27(1), 117. Obleser, J., & Kotz, S. A. (2010). Expectancy constraints in degraded Sohoglu, E., Peelle, J. E., Carlyon, R. P., & Davis, M. H. (2012). speech modulate the language comprehension network. Cerebral Predictive top-down integration of prior knowledge during Cortex, 20(3), 633640. speech perception. The Journal of Neuroscience, 32(25), Obleser, J., Meyer, L., & Friederici, A. D. (2011). Dynamic assignment 84438453. of neural resources in auditory comprehension of complex Stewart, R., & Wingfield, A. (2009). Hearing loss and cognitive effort sentences. Neuroimage, 56(4), 23102320. in older adults’ report accuracy for verbal materials. Journal of the Okada, K., Rong, F., Venezia, J., Matchin, W., Hsieh, I.-H., Saberi, K., American Academy of Audiology, 20(2), 147154. et al. (2010). Hierarchical organization of human auditory cortex: Tahmasebi, A. M., Abolmaesumi, P., Geng, X., Morosan, P., Amunts, Evidence from acoustic invariance in the response to intelligible K., Christensen, G. E., et al. (2009). A new approach for creating speech. Cerebral Cortex, 20(10), 24862495. customizable cytoarchitectonic probabilistic maps without a tem- Pals, C., Sarampalis, A., & Baskent, D. (2012). Listening effort with plate. Medical Image Computing and Computer-Assisted Intervention, cochlear implant simulations. Journal of Speech, Language, and 12(Pt 2), 795802. Hearing Research. Thompson-Schill, S. L., D’Esposito, M., Aguirre, G. K., & Farah, Peelle, J. E., Johnsrude, I. S., & Davis, M. H. (2010). Hierarchical M. J. (1997). Role of left inferior prefrontal cortex in retrieval of processing for speech in human auditory cortex and beyond. semantic knowledge: A reevaluation. Proceedings of the National Frontiers in Human Neuroscience, 4, 51. Academy of Sciences of the United States of America, 94(26), Peelle, J. E., McMillan, C., Moore, P., Grossman, M., & Wingfield, A. 1479214797. (2004). Dissociable patterns of brain activity during comprehen- Tyler, L. K., Marslen-Wilson, W. D., Randall, B., Wright, P., sion of rapid and syntactically complex speech: Evidence from Devereux, B. J., Zhuang, J., et al. (2011). Left inferior frontal cortex fMRI. Brain and Language, 91(3), 315325. and syntax: Function, structure and behaviour in patients with Petkov, C. I., O’Connor, K. N., & Sutter, M. L. (2007). Encoding of left hemisphere damage. Brain, 134(Pt 2), 415431. illusory continuity in primary auditory cortex. Neuron, 54(1), Warren, R. M. (1970). Perceptual restoration of missing speech 153165. sounds. Science, 167(3917), 392393.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 502 40. FACTORS THAT INCREASE PROCESSING DEMANDS WHEN LISTENING TO SPEECH

Wild, C. J., Davis, M. H., & Johnsrude, I. S. (2012). Human auditory Winer, J. A. (2006). Decoding the auditory corticofugal systems. cortex is sensitive to the perceived clarity of speech. NeuroImage, Hearing Research, 212(12), 18. 60(2), 113. Zekveld, A. A., & Kramer, S. E. (2014). Cognitive processing load Wild, C. J., Yusuf, A., Wilson, D. E., Peelle, J. E., Davis, M. H., & across a wide range of listening conditions: Insights from pupillo- Johnsrude, I. S. (2012). Effortful listening: The processing of metry. Psychophysiology, 51(3), 277284. degraded speech depends critically on attention. Journal of Zekveld, A. A., Kramer, S. E., & Festen, J. M. (2011). Cognitive load Neuroscience, 32(40), 1401014021. during speech perception in noise: The influence of age, hearing Willems, R. M., & Hagoort, P. (2009). Broca’s region: Battles are not loss, and cognition on the pupil response. Ear and Hearing, 32(4), won by ignoring half of the facts. Trends in Cognitive Sciences, 13 498510. (3), 101 (author reply 102)

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL CHAPTER 41 Neural Mechanisms of Attention to Speech Lee M. Miller Center for Mind & Brain, and Department of Neurobiology, Physiology, & Behavior, University of California, Davis, CA, USA

41.1 OVERVIEWAND HISTORY comprehension in the presence of other distracting talkers, a situation he termed the “cocktail party prob- Speech perception rarely occurs under pristine lem” (Cherry, 1953). His observations and the experi- acoustic conditions. Background noises, competing mental paradigms he pioneered, such as dichotic talkers, and reverberations often corrupt speech signals listening (different messages to left and right ears) and and confuse the basic organization of our auditory shadowing (repeating aloud), attracted many investiga- world. In modern life this perceptual challenge is ines- tors to the topic over the ensuing decades. Attention- capable, pervading the home, workplace, social situa- to-speech therefore became a driving force behind gen- tions (e.g., restaurants), and modes of transit (cars, eral cognitive theories of attention through the 1950s subways). Fortunately, our brains have a remarkable and 1960s, when the “early versus late selection” ability to filter out undesired sounds and attend to a debate arose (Broadbent, 1958; Deutsch & Deutsch, single talker, dramatically improving our comprehen- 1963; Treisman, 1960). In contrast to the cognitive the- sion (Kidd, Arbogast, Mason, & Gallun, 2005). ory, the neuroscience of attention to speech began In this chapter we address selective attention to speech, much later. In part, this delay was methodological, as distinguished from general arousal or vigilance that is awaiting the widespread use of noninvasive human not directed at specific locations, objects, or features. electrophysiology in the 1970s (electroencephalography William James described the essential phenomenology of [EEG]) and the advent of neuroimaging in the 1980s selective attention as “the taking possession by the mind, and 1990s (positron emission tomography [PET] and in clear and vivid form, of one out of what seem several functional magnetic resonance imaging [fMRI]). simultaneously possible objects or trains of thought... It Additionally, much early neuroscientific work on audi- implies withdrawal from some things in order to deal tion and attention eschewed speech for simpler stimuli effectively with others” (James, 1890). Mechanistically, such as isolated pure tones or noise bursts (Miller this comprises two separable components: control and et al., 1972). Simple, isolated stimuli offer substantial selection. Attentional control refers to the cognitive pro- experimental benefit because they are easily parameter- cesses that coordinate and direct our attention to specific ized, have consistent neural responses that are straight- aspects of our world, for instance, to the high-pitch forward to analyze, and provide a ready comparison female talker on the left. In contrast, attentional selection between humans and nonhuman neurophysiology describes the biasing or filtering operation by which one where simple stimuli prevail. However, especially object is highlighted or processed further and the rest is when presented in isolation, they cannot approximate ignored. Selection is therefore the effect that attentional a “cocktail party” environment where selective atten- control exerts on the representations of speech content (e. tion is so crucial. Moreover, simple stimuli lack g., acoustic, phonetic, semantic, affective). In the meta- speech’s rich spectro-temporal structure that may be phor of the spotlight (Posner, Snyder, & Davidson, 1980), indispensable to understand the mechanisms of atten- attentional control points the spotlight and attentional tional selection. selection is evident by its illumination. Consequently, the neuroscience of selective atten- The modern study of attention to speech began in tion to speech—in the tradition of Cherry—began only 1953 with Colin Cherry’s behavioral work on speech in 1976, using traditional low-density (here, three

Neurobiology of Language. DOI: http://dx.doi.org/10.1016/B978-0-12-407794-2.00041-9 503 © 2016 Elsevier Inc. All rights reserved. 504 41. NEURAL MECHANISMS OF ATTENTION TO SPEECH electrodes) EEG in humans (Hink & Hillyard, 1976), paradigms often recruit the dorsal fronto-parietal which showed increased neural response to speech in “endogenous” attention network: particularly IPS and the attended ear with a latency of approximately superior parietal lobule, and the superior precentral 100 ms. In other words, attention can “select” a target sulcus in the characteristic location of FEF. Clear evi- speech stream by enhancing its neural processing at an dence for this dorsal network is observed for cued early stage, a result that has been replicated many attention (Krumbholz et al., 2009), including when lis- times since. We build on this basic idea, reviewing teners deploy the attention to a desired pitch or loca- some of the evidence and conceptual advances that tion of a talker before the speech occurs (Hill & Miller, have brought us to our current understanding. 2009). Dorsal network involvement is also evident Notably, recent years have witnessed a surge of inter- through suppression of alpha (B10 Hz) EEG power est and increased appreciation for realistic conditions over the parietal lobe contralateral to the attended with high perceptual load (Lavie, 2005), especially hemifield (Kerlin, Shahin, & Miller, 2010) and later in using continuous competing speech stimuli, degraded the fronto-parietal network after perception of acoustics, and realistic spatial scenes. In the following degraded speech (Obleser & Weisz, 2012). Notice that sections we review the neural networks for attentional in speech studies, such superior fronto-parietal regions control, the levels of processing or representations that including likely FEF appear to be involved when attention selects, and current neural theories about attending not only to space but also to speech features how mechanistically selective attention brings about such as pitch, and they are even recruited when such profound improvement in speech perception. attending to speech with no spatial acoustic cues what- soever (as well as visual distractor stimuli; Wild et al., 2012). This supports the model that a supramodal dor- 41.2 NEURAL NETWORKS FOR sal fronto-parietal network handles volitional attention ATTENTIONAL CONTROL to speech (Corbetta & Shulman, 2002), in contrast to alternate models that postulate, for instance, a dorsal- Communicating in daily life requires that we direct spatial versus ventral nonspatial specialization (Hill & the focus of our attention: maintain it on one talker in Miller, 2009; Lee et al., 2012). That said, some studies a noisy background, scan for a friend’s voice in a report little if any fronto-parietal activation associated crowd, listen for a certain word, or switch between dif- with attention to speech (Alho et al., 2006). Though the ferent talkers in a conversation. All these are aspects of reasons are unclear, in some cases this may be due to attentional control, whose functional neuroanatomy low perceptual interference (high speech clarity) or provides a framework for understanding attention-to- ease of the behavioral task, both of which would speech generally. Although attentional control to reduce the need for attentional control. speech has been studied less than attentional selection, Another area robustly involved in attention to speech many aspects of control are cross-modal or supra- is the inferior frontal gyrus (IFG), especially the dorsal modal (Krumbholz, Nobis, Weatheritt, & Fink, 2009; part and in some cases specifically the inferior frontal Macaluso, 2010; Salmi, Rinne, Degerman, Salonen, & junction (IFJ; posterior part of the inferior frontal sulcus Alho, 2007; Shomstein & Yantis, 2004), so several key where it meets the precentral sulcus). Early suggestions principles observed in other sensory systems apply to for an attentional role of IFG came not from neuroimag- speech as well. One influential model holds that two ing studies on attention per se, but from those that hap- independent neural networks exert attentional control pened to require attention to phonetic attributes of clean (Corbetta & Shulman, 2002). An “endogenous” dorsal speech. For instance, Zatorre reviewed a number of fronto-parietal network guides top-down or volitional reports that used phonetic monitoring and discrimina- attention. This network consistently includes intrapar- tion, all of which show activation in dorsal posterior ietal sulcus (IPS) or superior parietal lobule and the Broca’s area (Zatorre, Meyer, Gjedde, & Evans, 1996). frontal eye fields (FEF). A second, “exogenous” ventral Although Broca’s area is a large region heterogeneous in fronto-temporo-parietal control network is proposed to its cytoarchitectonics, anatomical connectivity, and func- handle reflexive or bottom-up reorienting of attention, tion (Amunts & Zilles, 2012; Rogalsky & Hickok, 2011), as when triggered by stimulus saliency and target evidence suggests that in addition to phonetic and other detection. This ventral control network consistently linguistic processing, this consistency in dorsal IFG is due includes lateral cortical areas such as the temporo- to sustained active monitoring (i.e., attentional control). parietal junction (TPJ). Subsequent studies using selective attention to one of Adverse speech environments virtually always multiple concurrent speech sounds also found left require top-down, volitional attention to particular lateralized dorsal-posterior IFG activity when selective talker locations, voice qualities, and speech content. attention was required (Hashimoto, Homae, Nakajima, Consistent with this requirement, “cocktail party” Miyashita, & Sakai, 2000; Lipschutz, Kolinsky, Damhaut,

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 41.3 LEVELS OF ATTENTIONAL SELECTION 505

Wikler, & Goldman, 2002), or bilateral inferior frontal “exogenous” control system—inferior parietal lobule involvement only when acoustic-phonetic cues could and especially the TPJ—although here the evidence is not easily distinguish the two competing speech not quite as consistent. Nevertheless, in the first neuro- streams (Nakai, Kato, & Matsuo, 2005). Another study imaging study of concurrent continuous speech (as approximated the challenge of a cocktail party using opposed to brief tokens), Alho et al. (2003) observed degraded speech and clever manipulation to distin- right lateralized TPJ activity regardless of whether sub- guish controlled attention: subjects were first pre- jects attended to the left or right ear. Similar activation sented with noises, unaware that they could be heard is seen when attending to one of two concurrent as speech, and then the subjects learned to listen to the streams of dichotic syllables (Lipschutz et al., 2002) noises as degraded words (Giraud et al., 2004). The and when switching and reorienting attention to one results showed that effortful “auditory search” recruits of two streams of spoken digits, whether simulta- left dorsal posterior IFG without reflecting compre- neously presented (Larson & Lee, 2012) or not hension directly. Wild et al. (2012) also reported effects (Krumbholz et al., 2009). Although bilateral TPJ in IFG with similarly degraded sentences, in this case involvement has been shown in a number of auditory effects of auditory attention broadly but with an inter- attention studies (Huang, Belliveau, Tengshe, & action such that a portion of dorsal IFG was only active Ahveninen, 2012), a right-hemisphere bias of a ventral when degraded but potentially comprehensible speech attentional network is nevertheless consistent with was attended. This is similar to the aforementioned visual attention and the phenomenon of hemispatial “auditory search” in that listeners use internal expec- neglect (Corbetta & Shulman, 2011). tations of speech to parse it, effortfully, from noise. In Thus, attentional control to speech relies on a supra- a rather different paradigm, left dorsal IFG/IFJ was modal, largely bilateral, dorsal fronto-parietal “voli- specifically related to attentional control as opposed to tional” network that directs attention to numerous selection of a competing speech stream, because it was speech attributes including but not limited to spatial shown to be active while directing attention before the location. Additionally, ventral prefrontal cortex (IFG) speech occurred (Hill & Miller, 2009). Finally, appears to be crucial for deploying attention flexibly to although it is not universally the case (Alho et al., comprehend degraded or competing speech, whether 2003), a majority of speech paradigms show this acti- it be attending to pitch or location, monitoring pho- vation to have a left-hemisphere bias, likely reflecting nemes, or analyzing sound sequences. Like the dorsal linguistic hemispheric dominance (Giraud et al., 2004; network, left IFG appears to be supramodal in the Hill & Miller, 2009). sense that parts of it handle a wide array of attention- Even when attending to nonspeech noise bursts, demanding perceptual and cognitive functions, both IFG is active when listening in silence before the verbal and nonverbal. This agrees with the suggestion stimulus occurs (but here leading to a right hemi- that the ventral frontal attention network, IFJ in partic- sphere activation; Voisin, Bidet-Caulet, Bertrand, & ular, may be a kind of attentional nexus playing an Fonlupt, 2006). Moreover, specifically the dorsal pos- integrative role in both goal-directed and stimulus- terior part of IFG is involved in a wide array of chal- driven attention (Asplund et al., 2010; Huang et al., lenging tasks—both auditory, such as sequencing 2012). Finally, ventral posterior cortices consistent with speech or nonspeech sounds (Gelfand & a supramodal “orienting network,” usually TPJ and Bookheimer, 2003), and nonauditory, from math to particularly in the right hemisphere, appear to be cognitive control to working memory (Fedorenko, important for attentional control and switching to Duncan, & Kanwisher, 2012). All of these are consis- speech in certain paradigms. Overall, the evidence tent with a domain-general function of (usually left) shows that attentional control to speech is domain-gen- dorsal IFG in attentional control or perhaps more eral, sharing principles and functional neuroanatomy broadly cognitive control (Novick, Trueswell, & with other modalities. Thompson-Schill, 2005). Interestingly, rather than being squarely part of a ventral reflexive attention network, IFJ may play an integrative role in both 41.3 LEVELS OF ATTENTIONAL goal-directed and stimulus-driven attention SELECTION (Asplund,Todd,Snyder,&Marois,2010). This would clearly be important in a “cocktail party” Once attentional control “points the spotlight,” its environment, where momentary attentional goals “illumination” then selects or modulates internal would be continually updated by the varying speech representations. Historically, since the time of salience of corrupted speech attributes. Cherry, perception of competing speech has been cen- In addition to IFG, attention to speech may recruit tral to the debate about attentional selection. Cherry the hypothesized posterior member of the ventral and others observed that when attending to and

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 506 41. NEURAL MECHANISMS OF ATTENTION TO SPEECH shadowing one of two competing speech streams, lis- modulation has been shown in some circumstances to teners may not notice even dramatic changes in the occur earlier, from 20 to 50 ms and from 80 to 130 ms unattended stream, such as a switch from English to (Woldorff et al., 1993). Thus, early studies demon- German or backward speech (Cherry, 1953; Wood & strated a gain modulation of neural responses to Cowan, 1995). Such evidence inspired the “early selec- attended sounds at latencies consistent with low-level tion” theory, championed by Broadbent (1958), which auditory cortex. held that processing of unattended speech is thor- These principles appeared to hold for speech oughly blocked at an early sensory stage. Other inves- sounds as well, with the aforementioned Hink and tigators advocated “late selection”: that some speech Hillyard dichotic speech study (Hink & Hillyard, 1976) information was still maintained through the entire that showed a larger N1 to stimuli in the attended ear. sensory processing hierarchy, with attentional selection Subsequent EEG work approximated Cherry’s sha- occurring just before entering working memory dowing paradigm, again reporting enhanced negativ- (Deutsch & Deutsch, 1963). Still others developed var- ity to speech probe sounds and again co-temporal with iants of these, with different stages of attenuation or the N1 wave (Woods, Hillyard, & Hansen, 1984). In a filtering by attention (Treisman, 1960). significant step toward ecologically valid “cocktail Today the debate is less heated, with most investi- party” perception, Teder et al. first used continuous gators recognizing that attention may act at different speech itself to elicit ERPs rather than independent levels, and unattended speech may be processed to dif- probe sounds in the attended “channel,” finding ferent depths, depending on the nature of the stimuli increased negativity beginning at approximately 40 ms and task goals. The neural evidence shows that in and peaking at 8090 ms latency (Teder, Kujala, & some cases, attention can modulate the ascending Naatanen, 1993). This study also used free-field speak- pathway at the thalamus (Frith & Friston, 1996), brain- ers rather than dichotic presentation and highlighted stem including inferior colliculus (Galbraith & Arroyo, the potential importance of realistic spatial cues, which 1993; Rinne et al., 2008), or even the cochlea (Delano, may influence comprehension more as the scene Elgueda, Hamame, & Robles, 2007; Giraud et al., 1997). becomes complex (Yost, Dye, & Sheft, 1996). However, most evidence for attentional modulation of Neuroanatomically, the timing of effects in the elec- speech, based on latency and anatomical location, tromagnetic studies suggest attentional modulation of points to auditory cortex and higher-level speech- primary and early nonprimary auditory cortices related cortex. The emerging pattern is that selective (Godey, Schwartz, de Graaf, Chauvel, & Liegeois- attention can modulate acoustic speech representations Chauvel, 2001). This conjecture was supported by a in primary or early nonprimary auditory cortical areas number of neuroimaging studies that used attention to and at numerous higher levels, evidently depending speech in the absence of competing sounds or degrada- on the task demands and on the degree of perceptual tion (Grady et al., 1997; Jancke, Mirzazade, & Shah, load or interference. 1999), or with brief dichotic speech stimuli spaced out Our understanding of the neural selection of speech in time (Jancke, Buchanan, Lutz, & Shah, 2001). Many builds on and parallels a large body of research on early imaging studies using concurrent speech reported auditory attention to nonspeech, especially simpler sti- attentional modulation of large areas of the superior muli. Although this chapter does not aspire to thor- temporal lobe and give mixed results or lacked the spa- oughly review auditory attention, a few notable studies tial specificity to determine whether the modulation provide some context. The earliest demonstration of extended to presumed primary or early secondary auditory attentional selection occurred in cat auditory auditory cortex (Alho et al., 2003; Hashimoto et al., cortex, where Hubel observed cells that only 2000; Hugdahl et al., 2000; Lipschutz et al., 2002; Nakai responded when the animal paid attention to the stim- et al., 2005; O’Leary et al., 1996). As a group, these leave ulus (Hubel, Henson, Rupert, & Galambos, 1959). This unanswered the question of “early selection” but pro- appeared to be an early and strong form of gating or vide powerful evidence for attentional modulation at gain modulation due to auditory attentional selection, multiple higher stages of processing. but little scientific work pursued the idea for many Meanwhile, studies using nonspeech sounds with years. In humans, the “early versus late selection” more precise mapping techniques demonstrated that debate was finally addressed with neural data using even for isolated (nonconcurrent) sounds, attentional EEG; in a dichotic listening task, attention modulated modulation tended to be weak or absent in primary the neural response with an enhanced negativity (at auditory cortex and stronger in secondary and tertiary the top of the scalp) to tones in the attended ear at fields (Petkov et al., 2004; Woods et al., 2009). Recent latencies 80110 ms, corresponding to the auditory N1 reports using more ecological “cocktail party” speech wave (Hillyard, Hink, Schwent, & Picton, 1973). This tasks also tend to report superior temporal attentional observation has been replicated many times, and the modulations that selectively spare primary auditory

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 41.4 SPEECH REPRESENTATIONS THAT ATTENTION SELECTS 507 cortex but affect neighboring areas (Hill & Miller, 2009). attention to speech has been used most often not to On the basis of timing and source location, MEG dem- study attention but as a tool to identify brain areas onstrated such modulations in planum temporale at devoted to components of speech-perceptual and lin- approximately 100 ms latency but not earlier (Ding & guistic processing. The advantage of using attention in Simon, 2012). This contrasts with attentional enhance- such paradigms is that one can equate low-level stimu- ments of response phase and coherence to nonspeech lus attributes among conditions, changing only the tonal stimuli, interpreted to occur in core or primary task goal, and with it the attentional focus. These stud- auditory cortex (Elhilali, Xiang, Shamma, & Simon, ies range from the well-established “what versus 2009). Another recent imaging study addressed selec- where” streams distinguishing speech identity from tive attention to distorted speech in the presence of location processing (Ahveninen et al., 2006; Alain, additional auditory and visual distractors, in other Arnott, Hevenor, Graham, & Grady, 2001; Maeder words, a very demanding and realistic task (Wild et al., et al., 2001), to recognizing voices versus comprehend- 2012). Attentional modulation occurred throughout ing them (Alho et al., 2006; von Kriegstein, Eger, auditory cortex when subjects attended to audition as Kleinschmidt, & Giraud, 2003), to performing syntactic opposed to vision. However, attentional enhancement and semantic operations (Rogalsky & Hickok, 2009). specifically to distorted speech occurred only in speech- However, although they use attentional selection as a sensitive anterior and posterior STS—high-level areas tool, most of these approaches are not intended to hierarchically distant from primary auditory cortex. In reproduce the ecological conditions (e.g., high percep- other words, much of auditory cortex showed some tual load) that may be necessary to strongly engage nonspecific modulation but only high-level speech pro- and reveal the attentional system (Lavie, 2005). cessing benefitted from targeted, selective attentional When listeners are challenged to understand enhancement. degraded or competing speech, one of the key neural Attentional selection of speech is thus remarkably representations that correlates with comprehension is a flexible and can modulate processing at multiple levels low-frequency response (approximately 216 Hz, and of the cortical auditory and speech hierarchy, even especially 48 Hz) likely evoked or entrained by the during the same task. In all likelihood, this variable speech amplitude envelope (Ahissar et al., 2001; Ding selection is as rapidly deployed and adaptive as & Simon, 2013; Giraud & Poeppel, 2012; Luo & observed in vision (Hopf et al., 2006). However, the Poeppel, 2007; Peelle & Davis, 2012) or other salient specific modulation appears to be constrained by per- acoustic features time-varying at a syllabic rate ceptual load or separability at the acoustic-phonetic (Ghitza, Giraud, & Poeppel, 2012; Obleser, Herrmann, level (Shinn-Cunningham, 2008). For sounds that differ & Henry, 2012). That is, the neural response has sub- on basic acoustic dimensions (or for sounds versus sti- stantial power at the same frequencies that speech muli in another sensory modality), attention may be acoustic power fluctuates across syllables. And it is able to act earlier in the hierarchy. For competing phase-locked in the sense that neural fluctuations in speech that shares many physical characteristics or in that frequency range follow the acoustic speech events cases of high perceptual load, attention may only be with a consistent latency. This particular response has able to modulate higher-level, more abstract speech a latency of approximately 100 ms, appears to be processing. This suggests that the levels of attentional strongly represented in early auditory cortices, and is selection are determined by both the processing level modulated by attention (Ding & Simon, 2012; Kerlin at which sensory representations can be readily distin- et al., 2010; Power, Lalor, & Reilly, 2011; Zion guished and the amount of competition among those Golumbic et al., 2013). Recordings from the cortical representations. surface in humans furthermore show that not only temporal envelope but also spectro-temporal envelope of the attended talker are preferentially modulated by attention, largely along the lateral superior temporal 41.4 SPEECH REPRESENTATIONS THAT gyrus (STG) (Mesgarani & Chang, 2012). These obser- ATTENTION SELECTS vations are consistent with reports of nonhuman ani- mals showing rapid, attention-dependent changes in To fully understand the levels of attentional selec- spectro-temporal tuning of single cells and populations tion, we must also characterize the nature or content of in auditory cortex (Fritz, Elhilali, & Shamma, 2007). As speech representations that are actually selected. Here, noted, other perceptually salient speech features such manipulating task demands has been particularly as pitch and spatial location are also clearly modulated informative, showing that selection works on a vast by attention, but compared with spectro-temporal array of perceptually salient speech attributes depend- envelope the specific nature of these selected represen- ing on their momentary relevance. For this reason, tations is relatively less apparent or less understood

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 508 41. NEURAL MECHANISMS OF ATTENTION TO SPEECH beyond a well-documented contralateral bias for spa- representation of attended speech adapts only to the tial hemifield (Alho et al., 2003). intensity of that speaker and not to the background The prominent role of dynamic speech attributes for object even though they overlap spectro-temporally attention and comprehension highlights the unique (Ding & Simon, 2012). These object representations, importance of time. Unlike in vision, all auditory sti- related to speech envelope and evident in the M100 muli—and particularly speech—are distinguished by response (B100 ms latency), were localized to planum their temporal evolution. Thus, speech perception may temporale (i.e., in nonprimary auditory cortex). So rely on attention to specific moments even more than even though we may attend to individual components in other perceptual domains (Coull & Nobre, 1998; of speech, naturalistic speech perception likely relies Lange & Roder, 2006). Behaviorally, temporal expec- on bound object representations beginning early in the tancy or attention in time has a powerful influence on processing hierarchy. speech comprehension. For instance, knowing both an expected word and the moment of a critical phoneme, but not one or the other, allows focused attention to 41.5 NEURAL MECHANISMS AND TOP- alter the perceptual continuity of speech (Samuel & DOWN/BOTTOM-UP INTERACTIONS Ressler, 1986), even when the attentional state gener- ally has no effect on continuity (Heinrich, Carlyon, In the tradition of the “early versus late selection” Davis, & Johnsrude, 2011). This suggests that debate, most inquiry into the neural bases of attention- attention-in-time to speech may require an internal to-speech has focused on neural timing and functional representation or prediction (e.g., lexical or syntactic) anatomy to determine the hierarchical level of speech on which it operates (Shahin, Bishop, & Miller, 2009). processing where attention acts. Recent years, how- In adverse environments such as context-guided, tem- ever, have brought an increased interest in the specific poral expectations would be the rule rather than the mechanisms by which attention modulates the speech exception. EEG shows that specific attention in time representations. The historical, often implicit, concept results in an enhanced N1 response (B100 ms) around of attentional selection has been that of a gain model: the time of word onsets (Astheimer & Sanders, 2008), selection increases neural firing for the attended similar to the enhanced N1 often observed for nontem- speech feature or object, and perhaps suppresses it for poral attention (Hillyard et al., 1973). These results unattended ones. Prima facie, most empirical evidence complement mounting evidence that context and pre- is consistent with some type of gain modulation, dictions, whether strictly attentional in nature or not, including single-cell neurophysiology (beginning with improve the neural processing of speech (Obleser, Hubel’s “attention units”; Hubel et al., 1959) as well as Wise, Alex, Dresner, & Scott, 2007; Sohoglu, Peelle, the majority of data from neuroimaging and EEG/ Carlyon, & Davis, 2012). MEG. One should of course keep in mind that differ- Notice that most of our understanding about the ent approaches may be more sensitive to different neural basis of attentional selection for speech regards mechanisms (e.g., noninvasive neuroimaging techni- specific speech features or attributes (e.g., pitch, tem- ques such as PET or fMRI might be blind to many non- poral envelope, spectral attributes; Tuomainen, Savela, gain temporal mechanisms), whereas in EEG and MEG Obleser, & Aaltonen, 2013). Perceptually, rather than increased neural synchrony (without more activity experiencing a collection of individual features, we overall) may be interpreted as a “gain” effect. perceive talkers as coherent, unified objects or streams. Nevertheless, recent studies using a variety of meth- An important theoretical question, then, has been ods continue to support selectively greater activity for whether attention acts on neural speech representa- attended speech representations. In several EEG or tions as objects or as individual features. And if it acts MEG reports, gain modulation has been observed for on individual features, does attention then automati- the phase-locked, syllable-rate responses noted, reflect- cally “spread” to other features shared by the object? ing increased neural following of the attended speech Although the question is not yet settled, most evidence envelope (approximately 48 Hz; Ding & Simon, 2012; is consistent with an object-based mechanism of atten- Kerlin et al., 2010). The envelope response to unat- tion to speech (Alain & Arnott, 2000; Ihlefeld & Shinn- tended speech may also be somewhat suppressed Cunningham, 2008; Shinn-Cunningham, 2008). More (Kerlin et al., 2010) as it is with music (Choi, Rajaram, specifically it seems that, similar to vision (Kravitz & Varghese, & Shinn-Cunningham, 2013). Moreover, the Behrmann, 2011), auditory-speech object formation selection of spectro-temporal speech modulations causally precedes but potentially still interacts with observed in cortical surface recordings (Mesgarani & attentional selection. A recent MEG study supports Chang, 2012) appears to reflect a gain effect. this idea, showing that when the intensity of compet- Interestingly, the amount of gain modulation in the ing talkers is separately varied, the neural time-varying speech response, effectively the strength

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 41.6 INTERACTIONS BETWEEN ATTENTION, PERCEPTION, AND PREDICTION 509 of attentional selection, is predicted by the strength of clarify both these additional possibilities and the attentional control exerted. Specifically, gain in the behavioral circumstances in which they play a role. speech representation correlates with the strength of anticipatory alpha power (B812 Hz) suppression over the parietal lobe contralateral to attended speaker hun- 41.6 INTERACTIONS BETWEEN dreds of milliseconds earlier (Kerlin et al., 2010). ATTENTION, PERCEPTION, AND Greater alpha suppression throughout the superior PREDICTION fronto-parietal control network furthermore predicts intelligibility of degraded speech (Obleser & Weisz, As our mechanistic understanding of attention to 2012). Thus, consistent with the visual system, the speech grows, so does the recognition that attention engagement of the attentional control network and may not simply modulate early sensory activity—boost- consequent suppression of alpha oscillations may ing or shifting it but leaving the nature of the represen- reflect the lifting of inhibition, followed by increased tations essentially intact. Rather, attention and similar gain in attended speech representations (Jensen & mechanisms may interact with how our perceptual sys- Mazaheri, 2010). However, the details of such a “gat- tems fundamentally parse and organize the world, a ing by inhibition” for speech, including any potential process known as auditory scene analysis (Bregman, relationships between speech envelope and alpha 1990; McDermott, 2009; Shinn-Cunningham, 2008; oscillation phase, need further study. Snyder, Gregg, Weintraub, & Alain, 2012). Most In addition to modulating the intensity of activity in research on this topic has used nonspeech stimuli, but certain neural populations, attentional selection may the principles may readily apply. For instance, it can also rapidly alter the spectrotemporal tuning of audi- take time, approximately 1 second, for our auditory sys- tory cortical cells and populations to match the target tems to segregate or stream multiple objects in a com- sound features. Using nonspeech stimuli, Fritz and col- plex scene. This “building up” of streaming may begin leagues have shown such effects in nonhuman animals without attentional influence (Sussman, Horvath, over multiple time scales (Elhilali, Fritz, Chi, & Winkler, & Orr, 2007) but might require attention in Shamma, 2007; Fritz, Shamma, Elhilali, & Klein, 2003). some circumstances (Carlyon, Cusack, Foxton, & Evidence that attention can narrow spectral tuning in Robertson, 2001; Snyder, Alain, & Picton, 2006). These auditory cortex has also been observed in humans with ideas are incorporated in a theory of auditory scene MEG/EEG and fMRI (Ahveninen et al., 2011). Yet analysis that posits that the formation of auditory another proposed mechanism broadly compatible with streams depends primarily on temporal coherence of increased gain and tuning is that attention aligns neural responses to different object features (Shamma, intrinsic, ongoing neural oscillations with the temporal Elhilali, & Micheyl, 2011). The theory furthermore holds evolution of attended speech (Besle et al., 2011; that attention to one object feature is then necessary to Lakatos, Karmos, Mehta, Ulbert, & Schroeder, 2008; bind the other temporally coherent features into a Lakatos et al., 2009; Schroeder & Lakatos, 2009; Zion stream. All the other incoherent, unattended features Golumbic et al., 2013). This “selective entrainment are left unstreamed, as if you “wrap up all your garbage hypothesis” holds that spontaneous low-frequency in the same bundle” (Bregman, 1990, p. 193). This is neural oscillations (e.g., 18 Hz) reflect fluctuations in related to the idea, also shown in vision and cross-mod- neural excitability (Bastos et al., 2012) and, therefore, ally, that attention “spreads” to encompass the bound preferred moments for sensory processing. In other features of an object (Cusack, Deeks, Aikman, & words, the brain has natural rhythms and, during dif- Carlyon, 2004). However, this “temporal coherence” ferent moments in the rhythmic cycles, neurons are theory contradicts some evidence mentioned that atten- more or less excitable. Attention, using temporal pre- tion acts on objects (Alain & Arnott, 2000; Shinn- dictions about precisely when speech information Cunningham, 2008) and that object formation is largely should occur, would align these endogenous oscilla- pre-attentive or automatic (Bregman, 1990). Perhaps as tions to be in phase with the evoked sensory response, with the levels of attentional selection, the degree to thereby selecting and boosting the attended speech which attention interacts with basic object formation representation. The idea that attention alters the tempo- depends on the intrinsic competition or ambiguity ral alignment of neural responses has further support among the sensory representations. from a study using nonspeech sounds showing greater The notion that attention interacts fundamentally temporal coherence for an attended rhythm across a with speech processing is also consistent with a grow- bilateral auditory network (Elhilali et al., 2009). ing body of research on the perceptual role of context. Therefore, attention may act through several mechan- We mentioned several instances where attention may isms that complement simple gain modulation or may guide sensory predictions or expectations (Astheimer & lead to it via different means. Much work remains to Sanders, 2008; Lakatos et al., 2008). Many more studies

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 510 41. NEURAL MECHANISMS OF ATTENTION TO SPEECH beyond the scope of this chapter show that the auditory them, error signals propagated forward via high- system maintains a continuous, adaptive representation frequency (6080 Hz) activity (Arnal, Wyart, & Giraud, of context at multiple levels. Some expectations are 2011). Typically “top-down” implies more abstract largely automatic and depend on stimulus statistics, as representations or rules originating at higher levels and indexed by the well-known MMN EEG response when flowing to lower levels of an anatomical or functional patterns are violated (Naatanen, Gaillard, & Mantysalo, hierarchy, whereas bottom-up implies information, in 1978; Sussman, Ritter, & Vaughan, 1998). However, this model an error signal, flowing up the hierarchy in even this so-called automatic response can be modu- the direction of an initial sensory volley. Such mechanis- lated by attention (Nyman et al., 1990; Sussman, tically different roles for different frequency bands agree Winkler, Huotilainen, Ritter, & Naatanen, 2002; with recent models of cortical information processing Woldorff, Hillyard, Gallen, Hampson, & Bloom, 1998). (Bastos et al., 2012; Wang, 2010). It also recalls a seminal Others, still largely unconscious, will be more linguistic nonhuman primate study of volitional versus stimulus- in nature (Clos et al., 2014) and rely more on long-term driven attentional control in vision, where volitional memory. These linguistic contextual cues profoundly attention began in prefrontal cortex (FEF), exogenous influence how well speech is processed, particularly attention initiated in the parietal lobe, and the coherence when it is degraded (Baumgaertner, Weiller, & Buchel, between them was greater in lower frequencies for voli- 2002; Obleser & Kotz, 2009; Pichora-Fuller, Schneider, tional control and in higher frequencies for stimulus- & Daneman, 1995) and when accompanied by age- driven attention (Buschman & Miller, 2007). Although related hearing loss, where peripheral impairment has results from different systems and paradigms should be been shown to interact with numerous cognitive and taken with care, they are at least consistent with the attentional functions (Humes et al., 2012). Still other notion that attention to speech could act within a predic- expectations will be conscious, as when anticipating a tive coding framework. specific word. Admittedly, not all of these phenomena will use attention-like mechanisms to select the pre- dicted speech content, but some evidently do (e.g., with 41.7 FUTURE DIRECTIONS attention-in-time to emphasize representations of word onsets) (Astheimer & Sanders, 2008). Our understanding of attention to speech has One increasingly prominent perceptual theory known advanced considerably in the past half century, but as “predictive coding” relies inherently on the interac- many questions remain. Regarding basic functional neu- tion between sensory input and expectation or context. roanatomy, for instance, how should attention to speech Bayesian in nature, it holds that predictions formulated be reconciled with the influential dual-stream theory of at higher levels are fed back to early sensory cortices, speech perception (Hickok & Poeppel, 2007)? The dual which act as comparators between prediction and sen- stream model broadly conforms to a dorsal “where/ sory input, with the ultimate goal of minimizing predic- how” pathway for sensorimotor transformation and tion error. Thus, early sensory activity feeding forward perception-for-action and a ventral “what” pathway for reflects an error signal or mismatch between expectation identifying objects (Goodale & Milner, 1992; Hickok & and reality. Because the acoustic and linguistic statistics Poeppel, 2007; O’Reilly, 2010; Scott, 2005). This is consis- of speech support such rich predictions, this theory is a tent with evidence from attentional selection, with mul- promising model to characterize attention to speech as tiple, flexible levels of modulation along the speech well. Several recent results lend some empirical support hierarchy. However, correspondence with the dual- to predictive coding. For instance, in a concurrent stream model is less clear for the fronto-temporo- EEG 1 MEG study, increased prior knowledge about parietal attentional control network. For example, ven- speech increased activity in IFG but decreased activity in tral areas for orienting attention overlap those thought STG, in line with predictive coding (Sohoglu et al., to perform sensorimotor (acoustic-articulatory) integra- 2012). The timing of these effects suggested that an ini- tion and, particularly in Broca’s area, other linguistic tial feed-forward draft of the speech allows IFG to acti- perceptual functions. Perhaps the supramodal dorsal vate likely predictions that are then fed back to sensory fronto-parietal attentional network controls volitional cortex. In another study, MEG responses localized to attention to speech as an acoustic object in the world (e.g., STG were consistent with predictive coding when per- to its spatial location) or when using other attributes to ceiving newly learned words that were similar to segregate the object from others. In contrast, perhaps the known (predictable) words (Gagnepain, Henson, & inferior frontal attentional system controls volitional Davis, 2012). Further clues about predictive coding used attention as well as stimulus-driven orienting to identi- audiovisual speech, suggesting that “top-down” predic- fying attributes of speech (i.e., those that contribute tions based on vision may be fed back via lower- directly to its intelligibility). Perhaps function is latera- frequency oscillations (1415 Hz) and, coupled with lized in the ventral attention network such that right

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL REFERENCES 511

TPJ is devoted to exogenous attention while left TPJ As the abundance of open questions makes clear, handles acoustic-motor mapping. At present, we lack a great effort will be required to characterize the neural framework that integrates these necessary but some- bases of attention to speech. But it also demonstrates what incompatible roles. how our understanding has become more informed, In terms of mechanisms, most investigators would nuanced, and mechanistic over time. We are now at agree that gain modulation plays a central role in atten- the threshold of characterizing how this profoundly tional selection of speech. But other plausible mechan- important everyday ability works. After 60 years, we isms have not yet been thoroughly vetted. Are some have returned to Cherry’s cocktail party. effects that we attribute to gain instead due to increased synchrony or coherence in neural activity, or are they due to altered receptive field tuning in individual audi- Acknowledgments tory cortical cells? Does attentional control selectively entrain endogenous fluctuations to be in phase with This work was supported by the National Institutes of Health/ incoming speech, thereby boosting it at the expense of National Institute on Deafness and other Communication Disorders (NIH/NIDCD), grant R01-DC008171 (LMM). distractors? Does alpha activity act as “gating inhibi- tion” and, when suppressed, dis-inhibit the attended speech representations? If so, then how would alpha fluctuations relate to the well-known syllabic rate sen- References sory activity or to behavioral performance (Zoefel & Ahissar, E., Nagarajan, S., Ahissar, M., Protopapas, A., Mahncke, H., Heil, 2013)? Such interactions between top-down control & Merzenich, M. M. (2001). Speech comprehension is correlated and sensory selection also lead naturally to the relation with temporal response patterns recorded from auditory cortex. between attention and perceptual segregation of speech Proceedings of the National Academy of Sciences of the United States of America, 98(23), 1336713372. streams. Are speech objects formed independently of Ahveninen, J., Hamalainen, M., Jaaskelainen, I. P., Ahlfors, S. P., attention (and then selected by it), or is attention Huang, S., Lin, F. H., et al. (2011). Attention-driven auditory cor- required—and, if so, under which circumstances and at tex short-term plasticity helps segregate relevant sounds from what levels of processing? Is attention organized hierar- noise. Proceedings of the National Academy of Sciences of the United chically for features and objects so that attention to any States of America, 108(10), 4182 4187. Ahveninen, J., Jaaskelainen, I. P., Raij, T., Bonmassar, G., Devore, S., feature always spreads to other features of that speech Hamalainen, M., et al. (2006). Task-modulated “what” and stream? Does attention to speech operate as “predictive “where” pathways in human auditory cortex. Proceedings of the coding,” seeking to minimize the error between context- National Academy of Sciences of the United States of America, 103(39), based expectations and reality? Far more empirical evi- 1460814613. dence is needed, but the proposal fits well with a grow- Alain, C., & Arnott, S. R. (2000). Selectively attending to auditory objects. Frontiers in Bioscience, 5, D202212. ing appreciation that the brain often adheres to Bayesian Alain, C., Arnott, S. R., Hevenor, S., Graham, S., & Grady, C. L. principles (Bastos et al., 2012). (2001). “What” and “where” in the human auditory system. Still other possible attentional mechanisms have Proceedings of the National Academy of Sciences of the United States of hardly been explored with speech, including several America, 98(21), 1230112306. already demonstrated in the visual system. For Alho, K., Vorobyev, V. A., Medvedev, S. V., Pakhomov, S. V., Roudas, M. S., Tervaniemi, M., et al. (2003). Hemispheric laterali- instance, should we conceive of selective attention to zation of cerebral blood-flow changes during selective listening to speech as a form of biased competition (Desimone & dichotically presented continuous speech. Brain Research Cognitive Duncan, 1995; Shinn-Cunningham, 2008) and, if so, Brain Research, 17(2), 201211. how should we think of the high-order “receptive Alho, K., Vorobyev, V. A., Medvedev, S. V., Pakhomov, S. V., field” within which different stimuli compete? As Starchenko, M. G., Tervaniemi, M., et al. (2006). Selective atten- tion to human voice enhances brain activity bilaterally in the noted, at the single neuron and population level, we superior temporal sulcus. Brain Research, 1075(1), 142150. expect that attention might induce rapid adaptive Amunts, K., & Zilles, K. (2012). Architecture and organizational princi- receptive field changes in how auditory neurons ples of Broca’s region. Trends in Cognitive Sciences, 16(8), 418426. respond to speech. But does it also decorrelate neural Arnal, L. H., Wyart, V., & Giraud, A. L. (2011). Transitions in neural activity, which in visual cortex can explain behavioral oscillations reflect prediction errors generated in audiovisual speech. Nature Neuroscience, 17, 797801. performance more than gain modulation (Cohen & Asplund, C. L., Todd, J. J., Snyder, A. P., & Marois, R. (2010). A cen- Maunsell, 2009)? Or does attention perform a kind of tral role for the lateral prefrontal cortex in goal-directed and “normalizing” computation to yield a winner-take-all stimulus-driven attention. Nature Neuroscience, 13(4), 507512. selected speech object in certain circumstances Astheimer, L. B., & Sanders, L. D. (2008). Listeners modulate tempo- (Reynolds & Heeger, 2009)? Already we know that rally selective attention during natural speech processing. Biological Psychology, 80(1), 2334. attentional control networks are neuroanatomically Bastos, A. M., Usrey, W. M., Adams, R. A., Mangun, G. R., Fries, P., largely supramodal and so, too, could the underlying & Friston, K. J. (2012). Canonical microcircuits for predictive cod- mechanisms of attention be universal. ing. Neuron, 76(4), 695711.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 512 41. NEURAL MECHANISMS OF ATTENTION TO SPEECH

Baumgaertner, A., Weiller, C., & Buchel, C. (2002). Event-related the representation of foreground and background in an auditory fMRI reveals cortical sites involved in contextual sentence inte- scene. PLoS Biology, 7(6), e1000129. gration. Neuroimage, 16(3 Pt 1), 736745. Fedorenko, E., Duncan, J., & Kanwisher, N. (2012). Language- Besle, J., Schevon, C. A., Mehta, A. D., Lakatos, P., Goodman, R. R., selective and domain-general regions lie side by side within McKhann, G. M., et al. (2011). Tuning of the human neocortex to Broca’s area. Current Biology, 22(21), 20592062. the temporal dynamics of attended events. The Journal of Frith, C. D., & Friston, K. J. (1996). The role of the thalamus in “top Neuroscience, 31(9), 31763185. down” modulation of attention to sound. Neuroimage, 4(3 Pt 1), Bregman, A. S. (1990). Auditory scene analysis. Cambridge, MA: 210215. Bradford. Fritz, J., Shamma, S., Elhilali, M., & Klein, D. (2003). Rapid task- Broadbent, D. (1958). Perception and communication. London: related plasticity of spectrotemporal receptive fields in primary Pergamon Press. auditory cortex. Nature Neuroscience, 6(11), 12161223. Buschman, T. J., & Miller, E. K. (2007). Top-down versus bottom-up Fritz, J. B., Elhilali, M., & Shamma, S. A. (2007). Adaptive changes in control of attention in the prefrontal and posterior parietal corti- cortical receptive fields induced by attention to complex sounds. ces. Science, 315(5820), 18601862. Journal of Neurophysiology, 98(4), 23372346. Carlyon, R. P., Cusack, R., Foxton, J. M., & Robertson, I. H. (2001). Gagnepain, P., Henson, R. N., & Davis, M. H. (2012). Temporal pre- Effects of attention and unilateral neglect on auditory stream seg- dictive codes for spoken words in auditory cortex. Current regation. Journal of Experimental Psychology Human Perception and Biology, 22(7), 615621. Performance, 27(1), 115127. Galbraith, G. C., & Arroyo, C. (1993). Selective attention and brainstem Cherry, E. C. (1953). Some experiments on the recognition of speech frequency-following responses. Biological Psychology, 37(1), 322. with one and with two ears. The Journal of the Acoustical Society of Gelfand, J. R., & Bookheimer, S. Y. (2003). Dissociating neural America, 25, 975979. mechanisms of temporal sequencing and processing phonemes. Choi, I., Rajaram, S., Varghese, L. A., & Shinn-Cunningham, B. G. Neuron, 38(5), 831842. (2013). Quantifying attentional modulation of auditory-evoked Ghitza, O., Giraud, A. L., & Poeppel, D. (2012). Neuronal oscillations cortical responses from single-trial electroencephalography. and speech perception: Critical-band temporal envelopes are the Frontiers in Human Neuroscience, 7, 115. essence. Frontiers in Human Neuroscience, 6, 340. Clos, M., Langner, R., Meyer, M., Oechslin, M. S., Zilles, K., & Giraud, A. L., Garnier, S., Micheyl, C., Lina, G., Chays, A., & Chery- Eickhoff, S. B. (2014). Effects of prior information on decoding Croze, S. (1997). Auditory efferents involved in speech-in-noise degraded speech: An fMRI study. Human Brain Mapping, 35(1), intelligibility. Neuroreport, 8(7), 17791783. 6174. Giraud, A. L., Kell, C., Thierfelder, C., Sterzer, P., Russ, M. O., Cohen, M. R., & Maunsell, J. H. (2009). Attention improves perfor- Preibisch, C., et al. (2004). Contributions of sensory input, audi- mance primarily by reducing interneuronal correlations. Nature tory search and verbal comprehension to cortical activity during Neuroscience, 12(12), 15941600. speech processing. Cerebral Cortex, 14(3), 247255. Corbetta, M., & Shulman, G. L. (2002). Control of goal-directed and Giraud, A. L., & Poeppel, D. (2012). Cortical oscillations and speech stimulus-driven attention in the brain. Nature Reviews processing: Emerging computational principles and operations. Neuroscience, 3(3), 201215. Nature Neuroscience, 15(4), 511517. Corbetta, M., & Shulman, G. L. (2011). Spatial neglect and attention Godey, B., Schwartz, D., de Graaf, J. B., Chauvel, P., & Liegeois- networks. Annual Review of Neuroscience, 34, 569599. Chauvel, C. (2001). Neuromagnetic source localization of auditory Coull, J. T., & Nobre, A. C. (1998). Where and when to pay attention: evoked fields and intracerebral evoked potentials: A comparison The neural systems for directing attention to spatial locations and of data in the same patients. Clinical Neurophysiology, 112(10), to time intervals as revealed by both PET and fMRI. The Journal of 18501859. Neuroscience, 18(18), 74267435. Goodale, M. A., & Milner, A. D. (1992). Separate visual pathways for Cusack, R., Deeks, J., Aikman, G., & Carlyon, R. P. (2004). Effects of perception and action. Trends in Neurosciences, 15(1), 2025. location, frequency region, and time course of selective attention Grady, C. L., Van Meter, J. W., Maisog, J. M., Pietrini, P., Krasuski, J., on auditory scene analysis. Journal of Experimental Psychology & Rauschecker, J. P. (1997). Attention-related modulation of activ- Human Perception and Performance, 30(4), 643656. ity in primary and secondary auditory cortex. Neuroreport, 8(11), Delano, P. H., Elgueda, D., Hamame, C. M., & Robles, L. (2007). 25112516. Selective attention to visual stimuli reduces cochlear sensitivity in Hashimoto, R., Homae, F., Nakajima, K., Miyashita, Y., & Sakai, chinchillas. The Journal of Neuroscience, 27(15), 41464153. K. L. (2000). Functional differentiation in the human auditory Desimone, R., & Duncan, J. (1995). Neural mechanisms of selective and language areas revealed by a dichotic listening task. visual attention. Annual Review of Neuroscience, 18, 193222. Neuroimage, 12(2), 147158. Deutsch, J. A., & Deutsch, D. (1963). Attention: Some theoretical con- Heinrich, A., Carlyon, R. P., Davis, M. H., & Johnsrude, I. S. (2011). siderations. Psychological Review, 70,8090. The continuity illusion does not depend on attentional state: Ding, N., & Simon, J. Z. (2012). Emergence of neural encoding of FMRI evidence from illusory vowels. Journal of Cognitive auditory objects while listening to competing speakers. Neuroscience, 23(10), 26752689. Proceedings of the National Academy of Sciences of the United States of Hickok, G., & Poeppel, D. (2007). The cortical organization of speech America, 109(29), 1185411859. processing. Nature Reviews Neuroscience, 8(5), 393402. Ding, N., & Simon, J. Z. (2013). Adaptive temporal encoding leads to Hill, K. T., & Miller, L. M. (2009). Auditory attentional control and selec- a background-insensitive cortical representation of speech. The tion during cocktail party listening. Cerebral Cortex, 20(3), 583590. Journal of Neuroscience, 33(13), 57285735. Hillyard, S. A., Hink, R. F., Schwent, V. L., & Picton, T. W. (1973). Elhilali, M., Fritz, J. B., Chi, T. S., & Shamma, S. A. (2007). Auditory Electrical signs of selective attention in the human brain. Science, cortical receptive fields: Stable entities with plastic abilities. The 182(4108), 177180. Journal of Neuroscience, 27(39), 1037210382. Hink, R. F., & Hillyard, S. A. (1976). Auditory evoked potentials dur- Elhilali, M., Xiang, J., Shamma, S. A., & Simon, J. Z. (2009). ing selective listening to dichotic speech messages. Perception & Interaction between attention and bottom-up saliency mediates Psychophysics, 20(4), 236242.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL REFERENCES 513

Hopf, J. M., Luck, S. J., Boelmans, K., Schoenfeld, M. A., Boehler, Lipschutz, B., Kolinsky, R., Damhaut, P., Wikler, D., & Goldman, S. C. N., Rieger, J., et al. (2006). The neural site of attention matches (2002). Attention-dependent changes of activation and connectiv- the spatial scale of perception. The Journal of Neuroscience, 26(13), ity in dichotic listening. Neuroimage, 17(2), 643656. 35323540. Luo, H., & Poeppel, D. (2007). Phase patterns of neuronal responses Huang, S., Belliveau, J. W., Tengshe, C., & Ahveninen, J. (2012). reliably discriminate speech in human auditory cortex. Neuron, 54 Brain networks of novelty-driven involuntary and cued voluntary (6), 10011010. auditory attention shifting. PLoS One, 7(8), e44062. Macaluso, E. (2010). Orienting of spatial attention and the interplay Hubel, D. H., Henson, C. O., Rupert, A., & Galambos, R. (1959). between the . Cortex, 46(3), 282297. Attention units in the auditory cortex. Science, 129(3358), Maeder, P. P., Meuli, R. A., Adriani, M., Bellmann, A., Fornari, E., 12791280. Thiran, J. P., et al. (2001). Distinct pathways involved in sound Hugdahl, K., Law, I., Kyllingsbaek, S., Bronnick, K., Gade, A., & recognition and localization: A human fMRI study. Neuroimage, Paulson, O. B. (2000). Effects of attention on dichotic listening: An 14(4), 802816. 15O-PET study. Human Brain Mapping, 10(2), 8797. McDermott, J. H. (2009). The cocktail party problem. Current Biology, Humes, L. E., Dubno, J. R., Gordon-Salant, S., Lister, J. J., Cacace, 19(22), R10241027. A. T., Cruickshanks, K. J., et al. (2012). Central presbycusis: A Mesgarani, N., & Chang, E. F. (2012). Selective cortical representation review and evaluation of the evidence. Journal of the American of attended speaker in multi-talker speech perception. Nature, 485 Academy of Audiology, 23(8), 635666. (7397), 233236. Ihlefeld, A., & Shinn-Cunningham, B. (2008). Disentangling the effects Miller, J. M., Sutton, D., Pfingst, B., Ryan, A., Beaton, R., & of spatial cues on selection and formation of auditory objects. The Gourevitch, G. (1972). Single cell activity in the auditory cortex of Journal of the Acoustical Society of America, 124(4), 22242235. Rhesus monkeys: Behavioral dependency. Science, 177(4047), James, W. (1890). The principles of psychology. New York, NY: Dover 449451. Publications. Naatanen, R., Gaillard, A. W., & Mantysalo, S. (1978). Early selective- Jancke, L., Buchanan, T. W., Lutz, K., & Shah, N. J. (2001). Focused attention effect on evoked potential reinterpreted. Acta and nonfocused attention in verbal and emotional dichotic listen- Psychologica (Amsterdam), 42(4), 313329. ing: An FMRI study. Brain and Language, 78(3), 349363. Nakai, T., Kato, C., & Matsuo, K. (2005). An FMRI study to investi- Jancke, L., Mirzazade, S., & Shah, N. J. (1999). Attention modulates gate auditory attention: A model of the cocktail party phenome- activity in the primary and the secondary auditory cortex: A func- non. Magnetic Resonance in Medical Sciences, 4(2), 7582. tional magnetic resonance imaging study in human subjects. Novick, J. M., Trueswell, J. C., & Thompson-Schill, S. L. (2005). Neuroscience Letters, 266(2), 125128. Cognitive control and parsing: Reexamining the role of Broca’s Jensen, O., & Mazaheri, A. (2010). Shaping functional architecture by area in sentence comprehension. Cognitive, Affective & Behavioral oscillatory alpha activity: Gating by inhibition. Frontiers in Human Neuroscience, 5(3), 263281. Neuroscience, 4, 186. Nyman, G., Alho, K., Laurinen, P., Paavilainen, P., Radil, T., Kerlin, J. R., Shahin, A. J., & Miller, L. M. (2010). Attentional gain Reinikainen, K., et al. (1990). Mismatch Negativity (MMN) for control of ongoing cortical speech representations in a “cocktail sequences of auditory and visual stimuli: Evidence for a mecha- party”. The Journal of Neuroscience, 30(2), 620628. nism specific to the auditory modality. Electroencephalography and Kidd, G., Jr., Arbogast, T. L., Mason, C. R., & Gallun, F. J. (2005). The Clinical Neurophysiology, 77(6), 436444. advantage of knowing where to listen. The Journal of the Acoustical Obleser, J., Herrmann, B., & Henry, M. J. (2012). Neural oscillations Society of America, 118(6), 38043815. in speech: Don’t be enslaved by the envelope. Frontiers in Human Kravitz, D. J., & Behrmann, M. (2011). Space-, object-, and feature- Neuroscience, 6, 250. based attention interact to organize visual scenes. Attention, Obleser, J., & Kotz, S. A. (2009). Expectancy constraints in degraded Perception, & Psychophysics, 73(8), 24342447. speech modulate the language comprehension network. Cerebral Krumbholz, K., Nobis, E. A., Weatheritt, R. J., & Fink, G. R. (2009). Cortex, 20, 633640. Executive control of spatial attention shifts in the auditory com- Obleser, J., & Weisz, N. (2012). Suppressed alpha oscillations pre- pared to the visual modality. Human Brain Mapping, 30(5), dict intelligibility of speech and its acoustic details. Cerebral 14571469. Cortex, 22(11), 24662477. Lakatos, P., Karmos, G., Mehta, A. D., Ulbert, I., & Schroeder, C. E. Obleser, J., Wise, R. J., Alex Dresner, M., & Scott, S. K. (2007). (2008). Entrainment of neuronal oscillations as a mechanism of Functional integration across brain regions improves speech per- attentional selection. Science, 320(5872), 110113. ception under adverse listening conditions. The Journal of Lakatos, P., O’Connell, M. N., Barczak, A., Mills, A., Javitt, D. C., & Neuroscience, 27(9), 22832289. Schroeder, C. E. (2009). The leading sense: Supramodal control of O’Leary, D. S., Andreason, N. C., Hurtig, R. R., Hichwa, R. D., neurophysiological context by attention. Neuron, 64(3), 419430. Watkins, G. L., Ponto, L. L., et al. (1996). A positron emission Lange, K., & Roder, B. (2006). Orienting attention to points in time tomography study of binaurally and dichotically presented sti- improves stimulus processing both within and across modalities. muli: Effects of level of language and directed attention. Brain and Journal of Cognitive Neuroscience, 18(5), 715729. Language, 53(1), 2039. Larson, E., & Lee, A. K. (2012). The cortical dynamics underlying O’Reilly, R. C. (2010). The what and how of prefrontal cortical orga- effective switching of auditory spatial attention. Neuroimage, 64, nization. Trends in Neurosciences, 33(8), 355361. 365370. Peelle, J. E., & Davis, M. H. (2012). Neural oscillations carry speech Lavie, N. (2005). Distracted and confused?: Selective attention under rhythm through to comprehension. Front Psychology, 3, 320. load. Trends in Cognitive Sciences, 9(2), 7582. Petkov, C. I., Kang, X., Alho, K., Bertrand, O., Yund, E. W., & Lee, A. K., Rajaram, S., Xia, J., Bharadwaj, H., Larson, E., Woods, D. L. (2004). Attentional modulation of human auditory Hamalainen, M. S., et al. (2012). Auditory selective attention cortex. Nature Neuroscience, 7(6), 658663. reveals preparatory activity in different cortical regions for selec- Pichora-Fuller, M. K., Schneider, B. A., & Daneman, M. (1995). How tion based on source location and source pitch. Frontiers in young and old adults listen to and remember speech in noise. The Neuroscience, 6, 190. Journal of the Acoustical Society of America, 97(1), 593608.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 514 41. NEURAL MECHANISMS OF ATTENTION TO SPEECH

Posner, M. I., Snyder, C. R., & Davidson, B. J. (1980). Attention and Teder, W., Kujala, T., & Naatanen, R. (1993). Selection of speech mes- the detection of signals. Journal of Experimental Psychology, 109(2), sages in free-field listening. Neuroreport, 5(3), 307309. 160174. Treisman, A. (1960). Contextual cues in selective listening. The Power, A. J., Lalor, E. C., & Reilly, R. B. (2011). Endogenous auditory Quarterly Journal of Experimental Psychology, 12(4), 242248. spatial attention modulates obligatory sensory activity in auditory Tuomainen, J., Savela, J., Obleser, J., & Aaltonen, O. (2013). Attention cortex. Cerebral Cortex, 21(6), 12231230. modulates the use of spectral attributes in vowel discrimination: Reynolds, J. H., & Heeger, D. J. (2009). The normalization model of Behavioral and event-related potential evidence. Brain Research, attention. Neuron, 61(2), 168185. 1490, 170183. Rinne, T., Balk, M. H., Koistinen, S., Autti, T., Alho, K., & Sams, M. Voisin, J., Bidet-Caulet, A., Bertrand, O., & Fonlupt, P. (2006). (2008). Auditory selective attention modulates activation of Listening in silence activates auditory areas: A functional mag- human inferior colliculus. Journal of Neurophysiology, 100(6), netic resonance imaging study. The Journal of Neuroscience, 26(1), 33233327. 273278. Rogalsky, C., & Hickok, G. (2009). Selective attention to semantic and von Kriegstein, K., Eger, E., Kleinschmidt, A., & Giraud, A. L. (2003). syntactic features modulates sentence processing networks in Modulation of neural responses to speech by directing attention anterior temporal cortex. Cerebral Cortex, 19(4), 786796. to voices or verbal content. Brain Research Cognitive Brain Research, Rogalsky, C., & Hickok, G. (2011). The role of Broca’s area in sentence 17(1), 4855. comprehension. Journal of Cognitive Neuroscience, 23(7), 16641680. Wang, X. J. (2010). Neurophysiological and computational principles Salmi, J., Rinne, T., Degerman, A., Salonen, O., & Alho, K. (2007). of cortical rhythms in cognition. Physiological Reviews, 90(3), Orienting and maintenance of spatial attention in audition and 11951268. vision: Multimodal and modality-specific brain activations. Brain Wild, C. J., Yusuf, A., Wilson, D. E., Peelle, J. E., Davis, M. H., & Structure & Function, 212(2), 181194. Johnsrude, I. S. (2012). Effortful listening: The processing of Samuel, A. G., & Ressler, W. H. (1986). Attention within auditory degraded speech depends critically on attention. The Journal of word perception: Insights from the phonemic restoration illusion. Neuroscience, 32(40), 1401014021. Journal of Experimental Psychology Human Perception and Woldorff, M. G., Gallen, C. C., Hampson, S. A., Hillyard, S. A., Performance, 12(1), 7079. Pantev, C., Sobel, D., et al. (1993). Modulation of early sensory Schroeder, C. E., & Lakatos, P. (2009). Low-frequency neuronal oscil- processing in human auditory cortex during auditory selective lations as instruments of sensory selection. Trends in attention. Proceedings of the National Academy of Sciences of the Neurosciences, 32(1), 918. United States of America, 90(18), 87228726. Scott, S. K. (2005). Auditory processing—speech, space and auditory Woldorff, M. G., Hillyard, S. A., Gallen, C. C., Hampson, S. R., & objects. Current Opinion in Neurobiology, 15(2), 197201. Bloom, F. E. (1998). Magnetoencephalographic recordings demon- Shahin, A. J., Bishop, C. W., & Miller, L. M. (2009). Neural mechan- strate attentional modulation of mismatch-related neural activity isms for illusory filling-in of degraded speech. Neuroimage, 44(3), in human auditory cortex. Psychophysiology, 35(3), 283292. 11331143. Wood, N. L., & Cowan, N. (1995). The cocktail party phenomenon Shamma, S. A., Elhilali, M., & Micheyl, C. (2011). Temporal coher- revisited: Attention and memory in the classic selective listening ence and attention in auditory scene analysis. Trends in procedure of Cherry (1953). Journal of Experimental Psychology Neurosciences, 34(3), 114123. General, 124(3), 243262. Shinn-Cunningham, B. G. (2008). Object-based auditory and visual Woods, D. L., Hillyard, S. A., & Hansen, J. C. (1984). Event-related attention. Trends in Cognitive Sciences, 12(5), 182186. brain potentials reveal similar attentional mechanisms during Shomstein, S., & Yantis, S. (2004). Control of attention shifts between selective listening and shadowing. Journal of Experimental vision and audition in human cortex. The Journal of Neuroscience, Psychology Human Perception and Performance, 10(6), 761777. 24(47), 1070210706. Woods, D. L., Stecker, G. C., Rinne, T., Herron, T. J., Cate, A. D., Snyder, J. S., Alain, C., & Picton, T. W. (2006). Effects of attention on Yund, E. W., et al. (2009). Functional maps of human auditory neuroelectric correlates of auditory stream segregation. Journal of cortex: Effects of acoustic features and attention. PLoS One, 4(4), Cognitive Neuroscience, 18(1), 113. e5183. Snyder, J. S., Gregg, M. K., Weintraub, D. M., & Alain, C. (2012). Yost, W. A., Dye, R. H., Jr., & Sheft, S. (1996). A simulated “cocktail Attention, awareness, and the perception of auditory scenes. party” with up to three sound sources. Perception & Psychophysics, Front Psychology, 3, 15. 58(7), 10261036. Sohoglu, E., Peelle, J. E., Carlyon, R. P., & Davis, M. H. (2012). Zatorre, R. J., Meyer, E., Gjedde, A., & Evans, A. C. (1996). PET stud- Predictive top-down integration of prior knowledge during ies of phonetic processing of speech: Review, replication, and speech perception. The Journal of Neuroscience, 32(25), 84438453. reanalysis. Cerebral Cortex, 6(1), 2130. Sussman, E., Ritter, W., & Vaughan, H. G., Jr. (1998). Attention Zion Golumbic, E. M., Ding, N., Bickel, S., Lakatos, P., Schevon, affects the organization of auditory input associated with the mis- C. A., McKhann, G. M., et al. (2013). Mechanisms underlying match negativity system. Brain Research, 789(1), 130138. selective neuronal tracking of attended speech at a “cocktail Sussman, E., Winkler, I., Huotilainen, M., Ritter, W., & Naatanen, R. party”. Neuron, 77(5), 980991. (2002). Top-down effects can modify the initially stimulus-driven Zoefel, B., & Heil, P. (2013). Detection of near-threshold sounds is auditory organization. Brain Research Cognitive Brain Research, 13 independent of EEG phase in common frequency bands. Front (3), 393405. Psychology, 4, 262. Sussman, E. S., Horvath, J., Winkler, I., & Orr, M. (2007). The role of attention in the formation of auditory streams. Perception & Psychophysics, 69(1), 136152.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL CHAPTER 42 Audiovisual Speech Integration: Neural Substrates and Behavior Michael S. Beauchamp Department of Neurosurgery and Core for Advanced MRI, Baylor College of Medicine, Houston, TX, USA

Speech is the most important form of human com- example subject in Figure 42.1B) are in three regions: munication. Speech perception is multisensory, with (i) visual cortex, especially lateral extrastriate both an auditory component (the talker’s voice) and a motion-sensitive areas, including areas MT and MST; visual component (the talker’s face). When the audi- (ii) auditory cortex and auditory association areas on tory speech signal is compromised because of auditory the superior temporal gyrus; and (iii) multisensory noise or hearing loss, the visual speech signal gains in areas in the posterior superior temporal sulcus (STS) importance (Bernstein, Auer, & Takayanagi, 2004; Ross and adjacent superior temporal gyrus and middle et al., 2007; Sumby & Pollack, 1954). In the case of pro- temporal gyrus (pSTS/G). found deafness, lip-reading by itself can be sufficient A working hypothesis for the function of these areas for speech perception (Suh, Lee, Kim, Chung, & Oh, is that extrastriate visual areas process the complex 2009). Visual speech information is also beneficial in biological motion signals carried by the talker’s facial cases of partial hearing loss; older adults with hearing motion and form a representation of visual speech. impairments identify visual-only speech better than Auditory association areas process the complex audi- older adults with normal hearing (Tye-Murray, tory information necessary to form a representation of Sommers, & Spehar, 2007). auditory speech sounds. These representations must abstract away from the physical stimulus properties to represent the key features of speech to be behaviorally 42.1 NEUROARCHITECTURE OF useful; for instance, it is important to integrate audi- AUDIOVISUAL SPEECH INTEGRATION tory and visual speech from a talker even if we have never seen or heard them before. Multisensory areas in The most popular technique for examining human the STS then integrate the visual and auditory speech brain function is blood-oxygen level-dependent func- representations to decide the most likely speech per- tional magnetic resonance imaging (BOLD fMRI) cept produced by the talker. A critical feature of (Friston, 2009), and neurophysiological studies of mul- speech is that the amount of information carried by the tisensory speech are no exception (Beauchamp, Lee, auditory and visual modalities varies tremendously Argall, & Martin, 2004; Blank & von Kriegstein, 2013; from moment to moment. According to Bayesian mod- Lee & Noppeney, 2011; Miller & D’Esposito, 2005; els of perception, ideal observers should weight each Nath & Beauchamp, 2011; Noppeney, Josephs, modality by its reliability (the inverse of its variance). Hocking, Price, & Friston, 2008; Okada, Venezia, For instance, in a conversation in a very loud room, Matchin, Saberi, & Hickok, 2013; Wilson, Molnar- the visual information is more reliable than the audi- Szakacs, & Iacoboni, 2008). The multisensory speech tory information and should be given more weight; in network in a typical subject is easily identifiable with a dark room, the auditory information is more reliable fMRI and includes regions of occipital, temporal, parie- and should receive more weight. In a BOLD fMRI tal, and frontal cortex (Figure 42.1A: left and right experiment, Nath and Beauchamp (2011) examined the hemisphere in one subject). The most consistent activa- neural mechanisms for this dynamic weighting of tions (found in every hemisphere in every subject, auditory and visual information. Auditory speech

Neurobiology of Language. DOI: http://dx.doi.org/10.1016/B978-0-12-407794-2.00042-0 515 © 2016 Elsevier Inc. All rights reserved. 516 42. AUDIOVISUAL SPEECH INTEGRATION: NEURAL SUBSTRATES AND BEHAVIOR

(A) (B) (A) (B)

1 13+ (C) (D)

Subject HT 0.33 0.44

0.31 0.42

FIGURE 42.1 The neural network for processing audiovisual FIGURE 42.2 fMRI functional connectivity within the audiovi- speech measured with BOLD fMRI. (A) Lateral views of the partially sual speech network. (A) Auditory-reliable speech: undegraded audi- inflated left hemisphere (top) and right hemisphere (bottom). Dark tory speech (loudspeaker icon) with degraded visual speech (single gray shows sulcal depths and light gray shows gyral crowns. Colored video frame shown). (B) Visual-reliable speech: undegraded visual brain regions showed a significant increase in BOLD fMRI signal to speech with degraded auditory speech. (C) Functional connectivity audiovisual speech (recordings of meaningless syllables) compared during auditory-reliable speech in a representative subject. The green with fixation baseline. All activations thresholded at F . 5.0, P , 0.05, region on the cortical surface model shows the location of the pSTS/ corrected for false discovery rate. Color scale shows t-values for con- G ROI; the blue region shows the location of the auditory cortex trast of all syllables versus fixation. (B) Anatomical-functional regions ROI; and the red region shows the location of the lateral extrastriate of interest (ROIs) created from activation map in (A). The STS ROI visual cortex ROI. The top number is the path coefficient between (green) contains voxels responsive to both auditory and visual speech pSTS/G and auditory cortex. The bottom number is the path coeffi- greater than baseline in the posterior STS. The auditory cortex ROI cient between the pSTS/G and visual cortex. (D) Functional connec- (blue) contains voxels responsive to auditory speech greater than tivity during visual-reliable speech (same subject and color scheme baseline within Heschl’s gyrus. The visual cortex ROI (red) contains as in C). voxels responsive to visual speech greater than baseline within extra- striate lateral occipitotemporal cortex. The IFG ROI (purple) contains voxels responsive to both auditory and visual speech greater than fMRI studies have provided a wealth of information baseline within the opercular region of the inferior frontal gyrus and about audiovisual speech perception, but they are con- the inferior portion of the precentral sulcus. strained by the limitations of the BOLD signal. Most critically, BOLD is an indirect measure of neural func- information was degraded by adding masking noise, tion that relies on slow changes in the cerebral vascula- and visual speech information was degraded by ture. The BOLD response to a stimulus consisting of a blurring the stimulus (across experiments, different single syllable (duration B0.5 s) extends for 15 s; most levels of degradation were used with similar results). fMRI studies have a temporal resolution of only 2 s. To assess functional connectivity, the psychophysio- This temporal resolution is thousands of times slower logical interaction (PPI) method was used. Functional than the underlying neural information processing connections between STS and auditory association (duration of a sodium-based action potential ,1 ms). areas and lateral extrastriate visual cortical areas were Because speech is itself a rapidly changing signal, observed, but not between STS and primary auditory and because the speech information must be processed or primary visual cortex. rapidly by the brain to be behaviorally useful, fMRI is As shown in Figure 42.2, increased functional con- far too slow for direct observation of the underlying nectivity between the STS and auditory association neural events. cortex was observed when the auditory modality was In contrast, studies that use electrical and magnetic more reliable, and increased functional connectivity measurements can directly measure neural activity with between the STS and extrastriate visual cortex was high temporal precision. In the first approach, event- observed when the visual modality was more reliable, related potentials are recorded. Transient responses to even when the reliability changed rapidly during rapid audiovisual syllables display peaks between 100 and event-related presentation of successive words. This 200 ms after stimulus onset (Bernstein, Auer, Wagner, & finding suggests that changes in STS functional con- Ponton, 2008). In a second approach, induced oscilla- nectivity may be an important neural mechanism tions are recorded. These have the advantage of allow- underlying the perception of noisy speech. Changes in ing for the examination of sustained neuronal activity functional connectivity may also be important for that is not phase-locked to the stimulus and can extend audiovisual learning (Driver & Noesselt, 2008; Powers, for several hundred milliseconds after stimulus onset Hevey, & Wallace, 2012). (Schepers, Schneider, Hipp, Engel, & Senkowski, 2013).

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 42.1 NEUROARCHITECTURE OF AUDIOVISUAL SPEECH INTEGRATION 517

(A) AnoiseV 200 100 10 f (Hz) t (s) 0 0.5 1

(B) Audiovisual 200 100 (D) % Change ** 10 0 0.5 1 200 –200 200 (C) Auditory-only % Change 100 200

100 0 10 AV AV A 0 0.5 1 Noise

FIGURE 42.3 Electrocorticography of visual cortex responses to audiovisual speech. (A) A patient being treated for intractable epilepsy was implanted with subdural electrodes (red dots on cortical surface model, right). Brain activity was recorded from electrodes over visual cortex (red dots inside ellipse) as the subject was presented with noisy audiovisual speech (visual speech with auditory white noise, AnoiseV). A time-frequency analysis was used to determine the neural response at different frequencies (y-axis) at each time (x-axis) after stimulus onset (dashed vertical line). Color scale indicates amplitude of change from pre-stimulus baseline. (B) Time-frequency response of visual cortex to audiovisual speech (undegraded visual and auditory). (C). Response to auditory-only speech. (D) Amplitude of high-frequency responses to each stimulus condition averaged across the entire course of the response. Significant difference between AnoiseV and AV speech, P 5 0.0016.

EEG and MEG have provided important insights Most electrocorticography studies of speech percep- into the neural basis of multisensory speech percep- tion have investigated auditory-only speech, either in a tion. Another technique known as electrocorticography single brain area, such as auditory cortex (Chang et al., has two important advantages (Allison, Puce, Spencer, 2010), or in a more distributed network (Canolty et al., & McCarthy, 1999). First, electrocorticography offers 2007; Pei et al., 2011; Towle et al., 2008). We collected superior spatial resolution: activity recorded from each electrocorticography data while subjects were presented approximately 1-mm electrode on the cortical surface with clear and noisy auditory and visual speech. Our is generated by neurons in close proximity to the elec- hypothesis was that multisensory interactions between trode. In contrast, EEG and MEG suffer from the auditory and visual speech processing should be most inverse problem: for any distribution of electrical or pronounced when the visual speech information is most magnetic fields measured outside the head, there are important, such as when noisy auditory speech informa- an infinite number of patterns of brain electrical tion is present. We tested this hypothesis by examining activity that could have resulted in the observed fields. the responses to visual speech in visual cortex. Practically speaking, this limits the spatial resolution, In the noisy audiovisual and audiovisual speech con- making it difficult to untangle the location of sources ditions (Figure 42.3A and B), the subject viewed a video that are in close proximity, as are the areas in the of a talker speaking a word. This is a powerful visual audiovisual speech network. Second, electrocorticogra- stimulus that evoked a strong response in visual electro- phy signals have much greater signal-to-noise than des. As is commonly reported for responses to visual EEG/MEG, especially for high-frequency responses, stimuli, the response differed by frequency. For low fre- because the electrode is much closer to the source, quencies (,20 Hz) there was a decrease in power rela- eliminating filtering by the cerebrospinal fluid and tive to pre-stimulus baseline (blue colors at bottom of skull that lies between the cortex and the EEG/MEG plot). For high frequencies (.40 Hz) there was a large sensors. High-frequency activity is ubiquitous in the increase in power relative to pre-stimulus baseline human brain (Canolty et al., 2006; Crone, Miglioretti, (yellow and red colors at middle of each plot). These Gordon, & Lesser, 1998; Schepers, Hipp, Schneider, high-frequency responses are thought to reflect spiking Roder, & Engel, 2012) and is thought to be an excellent activity in neurons underlying the electrode and there- population-level measure of neuronal activity corre- fore served as our primary measure of interest. As is lated with single-neuron spiking (for a recent review expected for visual cortex, there was a negligible see Lachaux, Axmacher, Mormann, Halgren, & Crone, response to stimuli without a visual component 2012; Logothetis, Kayser, & Oeltermann, 2007; Nir (auditory-only speech, Figure 42.3C). Responses to noisy et al., 2007; Ray & Maunsell, 2011). audiovisual speech (video of a speaker accompanied by

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 518 42. AUDIOVISUAL SPEECH INTEGRATION: NEURAL SUBSTRATES AND BEHAVIOR auditory white noise) were stronger than the responses modalities to audiovisual speech perception. In the most to AV speech (Figure 42.3D). straightforward method, noisy auditory speech is accom- This supports a simple model in which the goal of panied by clear visual speech, such as a video of a the cortical speech network is to extract meaning from talker saying the words that match the auditory signal AV speech as quickly and efficiently as possible. When (Figure 42.4A). The information available from viewing the auditory component of the speech input is suffi- the face of the talker improves the ability to understand cient to extract meaning, the visual response is not speech. This improvement is equivalent to an approxi- needed and therefore not enhanced. When the audi- mately 10-dB increase in signal-to-noise ratio; at inc- tory stimulus does not contain sufficient information reased levels of auditory noise, this can correspond to a to extract meaning (as is the case for visual-only dramatic improvement in the number of words under- speech or visual speech with auditory white noise), the stood (Ross, Saint-Amour, Leavitt, Javitt, & Foxe, 2007b). visual response is needed to extract meaning and At a neural level, we might imagine that there are activity in visual cortex is upregulated. different pools of neurons that represent representations of different auditory and visual speech representations. When presented with a noisy auditory phoneme, many 42.2 BEHAVIORAL APPROACHES FOR pools representing many different auditory phonologi- STUDYING AUDIOVISUAL SPEECH cal representations are weakly active, with no single INTEGRATION pool reaching an activity level sufficient to form a perceptual decision, assuming a Bayesian evidence There are a number of behavioral approaches to accumulation model (Beck et al., 2008). When visual studying the contributions of the auditory and visual information is added, pools of neurons representing the

(A) 80 FIGURE 42.4 Behavioral responses to incongruent audiovisual speech. (A) Noisy auditory speech with and without congruent 60 Visual visual speech. If a degraded auditory word (e.g., bread; auditory filtering indicated by blur- 40 riness of printed word) is presented, then the Auditory subject only rarely is able to correctly identify it (left bar in plot). If the same degraded auditory 20 word is accompanied by a video of a talker Percept “bread” speaking the word, then the subject is usually 0 able to identify it (right bar in plot). Data replotted from Ross, Saint-Amour, Leavitt, (B) 80 Javitt, and Foxe (2007a) with permission from the publisher. (B) Incongruent audiovisual speech consisting of a visual /ma/ and an audi- Visual 60 tory /na/ (or vice versa). If the visual compo- “ma” “ma” nent of speech is degraded by blurring (left column), then subjects report a percept corre- 40 sponding to the auditory component of the Auditory “na” stimulus (indicated in plot by solid blue bar

% Responses 20 % Correct responses and complementary blurry red bar). If the audi- Percept “na” “ma” tory component of speech is degraded by filter- ing (right column), then subjects report a 0 percept corresponding to the visual component Auditory Visual of the stimulus (indicated in plot by blurry blue bar and complementary solid red bar). (C) 50 (C) Incongruent audiovisual speech consisting of a visual /ga/ and an auditory /ba/ (or a Visual visual /pa/ and an auditory /ka/). In 43% of “ga” trials (green bar in plot) subjects reported a 30 McGurk fusion percept of /da/ (or /ta/), corre- sponding to neither the auditory nor the visual Auditory “ba” components of the stimulus. % Responses Percept “da,” “ba,” ”ga” 10

0 McGurk Auditory Visual and other

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 42.3 INTERSUBJECT VARIABILITY 519 visual speech information respond strongly. Combining that corresponds to neither the auditory nor the visual the information available in the visual and auditory- component of the stimulus. Two of these unusual responsive pools of neurons allows a perceptual deci- incongruent syllables, first discovered serendipitously sion to be reached. by McGurk and MacDonald (1976), consist of an audi- In a more complex approach, auditory speech and tory /ba/ paired with a visual /ga/ (perceived as visual speech are presented together, but the modalities /da/) and an auditory /pa/ paired with a visual /ka/ are incongruent; the auditory modality signals one syl- (perceived as /ta/). These illusory percepts have come lable at the same time as the visual modality signals a to be known as the McGurk effect. different one (Figure 42.4B). The advantage of this The McGurk effect is widely used in research for at approach is that it pits the auditory and visual modali- least two reasons. First, because the percept corre- ties against each other, allowing an evaluation of their sponds to neither the auditory nor the visual compo- relative contributions to perception. If the reported per- nent of the stimulus, the McGurk effect provides cept matches the auditory modality, then we infer that strong evidence that speech perception is multisensory. the auditory modality was most important; if the Second, the McGurk effect offers a quick and simple reported percept matched the visual modality, then we way to measure multisensory integration. A single trial infer that the visual modality was most important. For in which the subject perceives the illusion demonstrates instance, if a visual /ma/ is paired with an auditory that the subject is integrating auditory and visual infor- /na/ that is heavily degraded, subjects will often report mation. In contrast, measuring audiovisual integration perceiving /ma/, implying that they used the visual with other approaches requires many trials at different information more than the auditory information in levels of auditory noise, which is time-consuming for forming their decision (Nath & Beauchamp, 2011). normal volunteers and may be impossible for clinical However, it is not possible to directly quantify the con- groups such as elderly subjects or children. tribution of each modality to each response as it is with other perceptual decisions. For continuous variables, such as location, the percept of an incongruent audiovi- 42.3 INTERSUBJECT VARIABILITY sual stimulus (e.g., beep in one location and flash in a different location) can vary anywhere between the loca- The McGurk effect has also been used as a sensitive tion of the two unisensory stimuli. A report of a percept assay of individual differences in audiovisual integra- at a location exactly midway between the locations of tion. Although it is generally appreciated that the abil- the two stimuli indicates that both modalities contrib- ity to see and hear varies across the population (hence uted equally to the percept, with locations closer to one the need for eye glasses and hearing aids), difference or the other of the two stimuli indicating more influ- in central perceptual abilities are less well-appreciated. ence of that modality. However, speech perception is Healthy young subjects were tested with auditory-only categorical—any given speech sound is perceived as a stimuli and congruent audiovisual stimuli and identi- particular phoneme, and subjects in general do not fied them without errors. However, when presented make intermediate responses. Therefore, it is only possi- with McGurk stimuli, there were dramatic differences ble to state that one modality or the other was domi- between individuals. Across 14 different McGurk sti- nant. Subjects are often aware that the auditory and muli (recorded by many different speakers), some sub- visual modalities are incongruent, but this does not jects never reported the illusion, whereas some change their conscious percept of a single syllable. subjects reported the illusory percept on every trial of At a neural level, we could imagine that adding audi- every stimulus (Figure 42.5A). However, for most sub- tory noise weakens the response of the pool of neurons jects, the illusion was not “all or nothing.” Most sub- representing the auditory speech representation. At high jects reported the illusion for some stimuli but not levels of auditory noise, the response of the neurons others, and, even for multiple presentations of the representing the incongruent visual speech representa- same stimulus, a subject might report the illusion on tion is stronger than the response of the neurons repre- some trials but not others. senting the auditory speech, and thus the reported To better understand this variability, we created a percept corresponds to the visual speech representation. simple model with three sets of parameters. One A third approach to studying multisensory integra- parameter described the strength of each stimulus tion in speech also involves the presentation of incon- (across subjects, how often this stimulus elicited an gruent audiovisual syllables (Figure 42.4C). For most illusory percept). Another parameter described each incongruent audiovisual syllables, subjects perceive subject’s individual threshold for reporting the illusion either the auditory component of the stimulus or the (any stimulus perceived to lie above this threshold visual component. However, for a few incongruent would evoke the illusion). A third parameter described audiovisual syllables, subjects perceive a third syllable the sensory noise in each subject’s measurement of a

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 520 42. AUDIOVISUAL SPEECH INTEGRATION: NEURAL SUBSTRATES AND BEHAVIOR

(A) 100 (B)

Pα Pβ Pγ 50 % McGurk responses Likelihood

0 1 165 S S S Subject #

(C) 100

50 % McGurk responses

0 1 165 Subject #

FIGURE 42.5 Intersubject variability in the McGurk effect. (A) Each bar represents a single subject (n 5 165). The height of each bar corresponds to the proportion of McGurk fusion responses (e.g., percept of /da/ when presented with visual /ga/ and auditory /ba/) out of all responses when the subject was presented with 10 repetitions of up to 14 different McGurk stimuli. Subjects are ordered by rate of McGurk fusion responses. (B) A model of McGurk perception in which each stimulus is fit with a given strength (tick mark labeled S on x-axis) and each subject is fit with a given threshold (vertical dashed line). In each trial, the strength of the stimulus is measured by the subject with a process that includes sensory noise (width of Gaussian curve corresponds to amount of sensory noise). This leads to a distribution of measured strengths across trials. In those trials in which the measured strength exceeds threshold, the McGurk effect is perceived by the subject (shaded regions). The left plot shows a subject with low susceptibility (Pα): the subject threshold is greater than the stimulus strength, leading to few trials on which the illusion is perceived (small red shaded area). A second subject (Pβ, middle plot) has the same sensory noise as Pα (width of yellow curve same as width of red curve) but a lower threshold (dashed yellow line to the left of dashed red line along x-axis), leading to increased perception of the McGurk effect (yellow shaded region) for the identical stimulus. A third subject (Pγ, right plot) has greater sensory noise (width of green curve greater) but the same threshold as Pβ, leading to increased perception of the McGurk effect (green shaded region). (C) The model is fit to the behavioral data from each subject shown in (A). Each black point shows the behavioral data (mean proportion of fusion responses across all stimuli). Each gray bar shows the model prediction for that subject. Each red bar shows the prediction error for that subject. stimulus’s strength. In any given trial, the subject does (if the stimulus strength falls below their threshold). not perceive the actual strength of the stimulus, only Because subjects instead perceive a corrupted version the subject’s noisy internal representation of it. The of stimulus strength, in some trials, the perceived sensory noise term describes how much variability is strength lies above the subject’s fusion threshold, and added to the (true, external) stimulus strength when the illusion is reported. In other trials, the perceived creating this internal representation. The sensory noise strength lies below the subject’s fusion threshold and term is critical to account for the observation that the the illusion is not reported. Using these simple sets of exact same stimulus can evoke different reports on dif- parameters (two for each subject and one for each ferent trials. If subjects perceived the true stimulus stimulus), we can accurately reproduce the pattern of strength, then for any given stimulus they should observed reports of the illusion. Even more impor- either always report the illusion (if the stimulus tantly, the model can predict responses to stimuli that strength falls above their threshold) or never report it subjects have never before seen.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 42.4 NEURAL SUBSTRATES OF THE McGURK EFFECT 521

42.4 NEURAL SUBSTRATES OF THE McGurk effect and subjects who more strongly perceived McGURK EFFECT it in a behavioral pre-test performed outside the scanner (Figure 42.6A). Then, the STS was identified using an Different individuals with normal hearing and vision independent set of fMRI data that used different stimuli report very different percepts when presented with the (audiovisual words) to eliminate bias in voxel selection same McGurk stimulus. The source of this variability (Simmons, Bellgowan, & Martin, 2007). must lie within the central nervous system. To examine After the STS was identified, responses in the STS to the neural underpinnings of this behavioral variability, individual McGurk syllables were measured. Only we conducted three parallel BOLD fMRI studies in young passive viewing of the syllables was used; subjects did adults (Nath & Beauchamp, 2012), children (Nath, Fava, not make a behavioral response during MR scanning. & Beauchamp, 2011), and older adults. Subjects were Even though the physical stimulus presented to divided into those who only weakly perceived the the two groups was identical, subjects who strongly

(A) (B) 0.4 “da” 100 Strong perceivers

0.2

50 % Signal change 0.0 % McGurk responses Weak –0.1 “ba” 0 perceivers 1140481216 Subject # Time (s)

(C) (D) (E) 0.50 r = 0.73 r = 0.50 r = 0.29 P = 0.003 P = 0.04 P = 0.23

0.25

0.0 % Signal change

–0.2 0 50 100 0 50 100 0 50 100 % McGurk responses

FIGURE 42.6 Intersubject variability in neural responses to the McGurk effect. (A) Young adult subjects (n 5 14) were ordered by propor- tion of McGurk fusion responses and divided into strong perceivers of the McGurk effect (green squares, proportion of McGurk fusion responses $ 0.50) and weak perceivers of the McGurk effect (red squares, proportion of McGurk fusion responses ,0.50). (B) The STS was localized in each young adult subject using independent stimuli (auditory and visual words). Then, the BOLD fMRI response to single McGurk syllables (black bar) was measured in the STS ROI. The average BOLD response across all strong perceivers (green curve) and across all weak perceivers (red curve) is shown. Each square shows the peak response of a single subject who strongly perceived the effect (green square) or weakly perceived it (red square). (C) Each young adult subject’s McGurk fusion rate (x-axis) was correlated with the BOLD fMRI signal change in the STS ROI (y-axis). Error bars show the intersubject variability in the fMRI response across trials for each subject (green squares are strong perceivers, red squares are weak perceivers). (D) Children aged 512 years were tested using a similar protocol (the STS was localized with independent localizer; subjects were divided into strong and weak perceivers based on their behavioral responses to the McGurk effect; the fMRI signal in the STS ROI was plotted against the proportion of fusion responses; error bars show within-subject variabil- ity). (E) Older adult subjects aged 512 years were tested using a similar protocol (the STS was localized with independent localizer; subjects were divided into strong and weak perceivers based on their behavioral responses to the McGurk effect; the fMRI signal in the STS ROI was plotted against the proportion of fusion responses).

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 522 42. AUDIOVISUAL SPEECH INTEGRATION: NEURAL SUBSTRATES AND BEHAVIOR perceived the McGurk effect had greater BOLD by pre-screening the subjects so that equal numbers of responses to McGurk stimuli. There was a positive cor- strong and weak perceivers were included. relation between how often they perceived the illusion The observed positive correlation between left STS and the amplitude of the BOLD fMRI response of the activity and perception of the McGurk effect (and no STS. No other brain regions showed such a correlation. significant correlation in any other brain region) is pro- A possible explanation for this finding is that neurons vocative, but it does not demonstrate that activity in in the STS perform the computations necessary to inte- the STS is necessary for the illusion. To demonstrate grate the auditory and visual components of the stimu- this, we interfered with activity in the STS using tran- lus and create the McGurk illusion. If these neurons are scranial magnetic stimulation (TMS). not active, then the illusory percept is not created. It is The STS is found in every human brain, but there is important to emphasize that this does not indicate any- substantial variability in its anatomical configuration, thing defective about audiovisual integration in subjects as well as in the anatomical location of the functionally who do or do not perceive the McGurk illusion. In fact, defined subregion within the STS that responds to the STS in both groups responded identically to congru- auditory and visual speech (although it is often found ent audiovisual syllables. Rather, subjects may be at the bend where the STS turns upward into posterior differentially sensitive to the incongruent nature of the lobe). Most TMS studies stimulate the brain based on McGurk stimuli. If subjects infer that the auditory and skull landmarks (e.g., 2 cm lateral to the occipital pole) visual components of the stimulus are likely to have or on functional landmarks (e.g., 2 cm anterior to originated from different speakers, then the optimal motor cortex). However, the STS is not easily localized solution is to respond with only one component of the with either of these techniques. Therefore, we used stimulus (typically the auditory component, because the fMRI to identify the multisensory region of the STS in auditory modality is usually more reliable for speech each subject (Beauchamp, Nath, & Pasalar, 2010). perception). If subjects infer that the auditory and visual Then, frameless stereotaxic registration was used to compoments of the stimulus are more likely to have position the fMRI coil over the STS. Single pulses of originated from the same speaker, then the optimal TMS (,1 ms duration) were delivered as subjects solution is to fuse the components and respond with the viewed McGurk and control stimuli (Figure 42.7). illusory percept, because this percept is most consistent Subjects were pre-screened so that only those who with both the auditory and visual components. This strongly perceived the illusion were tested. Without process of Bayesian causal inference may account for TMS, the McGurk percept was reported on nearly audiovisual speech perception under conditions in every trial. However, when TMS was delivered to the which the synchrony between the auditory and visual STS, the illusion was reported in only approximately modalities varies (Magnotti, Ma, & Beauchamp, 2013). half of trials. Perception of other auditory or visual In children aged 5 to 12 years, the correlation stimuli was not affected. Stimulating cortex outside between STS responses and McGurk susceptibility the STS did not produce any change in the illusion, was weaker than in adults but still significant demonstrating that it was a specific result of STS (Figure 42.6B). In addition to the STS, the fusiform face stimulation and not related to the auditory click pro- area (FFA) on the ventral surface of the temporal lobe duced by the TMS coil or the scalp tactile stimulation was also correlated with McGurk susceptibility. induced by the pulse. The “virtual lesions” introduced In older adults aged 53 to 75 years, the correlation with TMS interfered with McGurk only in a temporal between STS responses and McGurk susceptibility was window just before and after onset of the auditory not significant. One possible explanation for the differ- stimulus (Figure 42.7C), similar to the window in ence is that older adults had significantly greater intra- which electrocorticographic activity is observed. The subject variability than younger adults. Each subject impairment in perception of the McGurk effect caused viewed the same McGurk stimuli multiple times, allow- by TMS stimulation of the STS lasted for ,1 s. Across ing for an estimate both of the mean BOLD fMRI multiple trials of TMS, there were no appreciable response (used to correlate with the McGurk susceptibil- changes: TMS interfered with McGurk perception as ity for that individual) and the variability of the BOLD effectively at the end of the experiment as it did at the fMRI response. This variability was much larger in older beginning. However, permanent damage to the STS adults, possibly obscuring a true correlation between might engage cortical mechanisms for plasticity. the mean response and susceptibility. A second possible To examine this possibility, we tested a patient, SJ, explanation is that most of the older subjects did not who experienced a stroke that ablated the entirety of perceive the illusion. Because the number of weak her left posterior STS (Figure 42.8); her right hemi- and strong perceivers was unbalanced, this reduced sphere was unaffected (Baum, Martin, Hamilton, & the power available to detect differences between them. Beauchamp, 2012). She was severely impaired immedi- In the younger adult study, this imbalance was avoided ately after the brain injury but underwent extensive

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 42.4 NEURAL SUBSTRATES OF THE McGURK EFFECT 523

(A) (B) “ba” 100 * * No TMS STS TMS Control TMS

50 % McGurk responses

0

(C) 100

* * * * 50

% McGurk responses 0 None –300 –200 –100 0 100 200 300 TMS time relative to auditory onset (ms)

FIGURE 42.7 Disruption of the McGurk Effect with TMS. (A) A TMS coil was used to stimulate the STS or a control location as subjects viewed a McGurk stimulus and reported their percept. fMRI was used to identify target regions in each subject and frameless stereotaxy was used to position the TMS coil over the identified target. A single pulse of TMS was applied in each trial. (B) Behavioral responses to McGurk stimuli with no TMS, with STS TMS, and with control TMS. The lightning bold icon positioned over the inflated cortical surface model and fMRI activation map of a sample subject shows the location of the applied TMS. Figure adapted from Beauchamp et al. (2010) with permission from the publisher. (C) The time at which the single pulse of TMS was delivered was varied. With no TMS, the McGurk effect was always perceived. Maximal interference with the McGurk effect was observed when the TMS was delivered at times near the onset of the auditory stimulus. Significant reduction in McGurk perception, P , 0.05. daily rehabilitation in the 5 years after her stroke. for speech perception. A subcortical structure, the super- After recovery, she reported good ability to under- ior and inferior colliculus (Wallace, Wilkinson, & Stein, stand speech but reported that it was very effortful. 1996), is a brainstem hub for integrating auditory and At testing, her ability to understand auditory visual information. Although the role of the colliculus in syllables was similar to that of age-matched controls. audiovisual speech perception is unknown, a lesion Surprisingly, she also reported experiencing the study suggested that damage to subcortical structures McGurk illusion. Because her left STS was ablated, we can impair audiovisual speech perception (Chamoun, performed a BOLD fMRI study to try and understand Takashima, & Yoshor, 2014). Within the cortex, Broca’s the brain circuits active when she was presented with area is critical for linguistic processing (Nagy, Greenlee, audiovisual speech. No activity was observed in here & Kovacs, 2012; Sahin, Pinker, Cash, Schomer, & left STS, but her right STS was highly active when she Halgren, 2009) and may play a role in audiovisual was presented with speech, including words, congru- integration and speech comprehension, especially for ent syllables, and McGurk syllables. The volume of noisy speech (Abrams et al., 2012; Noppeney, Ostwald, cortex active in her right STS exceeded that of any &Werner,2010;butseeHickok, Costanzo, Capasso, age-matched control. This suggests a scenario in which & Miceli, 2011). A common finding is activity in Broca’s the right STS reorganized to perform the multisensory area for producing (or repeating), but not perceiving, integration performed in the left STS by normal indivi- words (Towle et al., 2008). Another important brain area duals. This, in turn, suggests substantial plasticity in to consider is the FFA, which plays an important role in the circuit for perceiving audiovisual speech. A prom- face processing (Engell & McCarthy, 2010; Grill-Spector ising direction for future research is exploring the & Malach, 2004) and has also been implicated in the pro- training parameters that are most effective at modify- cessing of voices (von Kriegstein, Kleinschmidt, & ing the multisensory speech perception network. Giraud, 2006). Recent evidence from probabilistic tracto- Another important direction for future research is a graphy suggests that there exist anatomical connections better understanding of how the auditory cortex, visual between FFA and STS (Blank, Anwander, & von cortex, and STS interact with other brain areas important Kriegstein, 2011). High-resolution fMRI and

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 524 42. AUDIOVISUAL SPEECH INTEGRATION: NEURAL SUBSTRATES AND BEHAVIOR

(A) (B) (C) (D)

LR

(E) 110 (F) 110 640 100 100 F-Statistic 90 90 80 80 70 70 60 60 of active cortex) of active cortex) 3 3 50 50 40 40 30 30 20 20 Left STS (mm Right STS (mm 10 10 0 0 SJ Controls SJ Controls

FIGURE 42.8 Plasticity of the neural substrates of audiovisual speech perception after brain injury. (A) Cortical surface model created from a T1-weighted MRI of patient SJ 5 years after stroke showing extensive lesion to left posterior STS (red circle). Dashed white line shows the location of STS; only the most anterior portion is spared. Adapted from Baum et al. (2012) with permission from the publisher. (B) Cortical surface model of right hemisphere of patient SJ showing undamaged STS (dashed white line). (C) Cortical surface model of right hemisphere of SJ showing BOLD fMRI activity in response to audiovisual speech. Color scale shows significance of activation. Red arrow highlights exten- sive activity in posterior STS. (D) BOLD fMRI responses to audiovisual speech in the right hemisphere of a healthy age-matched control. Red arrow highlight paucity of activity in posterior STS (same color scale for (C) and (D)). (E) Volume of cortex in left posterior STS responsive to audiovisual speech in patient SJ (yellow square) and 23 healthy age-matched controls (blue squares). (F) Volume of cortex in right posterior STS responsive to audiovisual speech in patient SJ (yellow square) and 23 healthy age-matched controls (blue squares). electrocorticography may also make it possible to record Allison, T., Puce, A., Spencer, D. D., & McCarthy, G. (1999). responses from small pools of neurons that respond to Electrophysiological studies of human face perception. I: auditory or visual speech (Chang et al., 2010), allowing a Potentials generated in occipitotemporal cortex by face and non- face stimuli. Cerebral Cortex, 9(5), 415430. detailed understanding of the neural circuits underlying Baum, S. H., Martin, R. C., Hamilton, A. C., & Beauchamp, M. S. audiovisual speech perception. (2012). Multisensory speech perception without the left superior temporal sulcus. [Research Support, N.I.H., Extramural Research Support, U.S. Gov’t, Non-P.H.S.]. NeuroImage, 62(3), 18251832. Acknowledgments Available from: http://dx.doi.org/doi:10.1016/j.neuroimage.2012. 05.034. This research was supported by NIH 5R01NS065395 and would not Beauchamp, M. S., Lee, K. E., Argall, B. D., & Martin, A. (2004). have been possible without the research subjects who provided the Integration of auditory and visual information about objects in data and the scientists who made sense of it (in alphabetical order): superior temporal sulcus. Neuron, 41(5), 809823. Sarah Baum, Eswen Fava, Cris Hamilton, Haley Lindsay, Weiji Ma, Beauchamp, M. S., Nath, A. R., & Pasalar, S. (2010). fMRI-Guided John Magnotti, Debshila Basu Mallick, Randi Martin, Cara Miekka, transcranial magnetic stimulation reveals that the superior tempo- Audrey Nath, Xiaomei Pei, Inga Schepers, Muge Ozker, Ping Sun, ral sulcus is a cortical locus of the McGurk effect. Journal of Siavash Pasalar, Nafi Yasar, and Daniel Yoshor. Neuroscience, 30(7), 24142417. Available from: http://dx.doi. org/10.1523/JNEUROSCI.4865-09.2010. Beck, J. M., Ma, W. J., Kiani, R., Hanks, T., Churchland, A. K., References Roitman, J., et al. (2008). Probabilistic population codes for Bayesian decision making. Neuron, 60(6), 11421152. Abrams, D. A., Ryali, S., Chen, T., Balaban, E., Levitin, D. J., & Bernstein, L. E., Auer, E. T., Jr., Wagner, M., & Ponton, C. W. (2008). Menon, V. (2012). Multivariate activation and connectivity pat- Spatiotemporal dynamics of audiovisual speech processing. terns discriminate speech intelligibility in Wernicke’s, Broca’s, [Research Support, N.I.H., Extramural Research Support, U.S. and Geschwind’s areas. Cerebral Cortex. Available from: http:// Gov’t, Non-P.H.S.]. NeuroImage, 39(1), 423435. Available from: dx.doi.org/doi:10.1093/cercor/bhs165. http://dx.doi.org/doi:10.1016/j.neuroimage.2007.08.035.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL REFERENCES 525

Bernstein, L. E., Auer, J. E. T., & Takayanagi, S. (2004). Auditory Miller, L. M., & D’Esposito, M. (2005). Perceptual fusion and stimu- speech detection in noise enhanced by lipreading. Speech lus coincidence in the cross-modal integration of speech. The Communication, 44(14), 518. Journal of Neuroscience, 25(25), 58845893. Blank, H., Anwander, A., & von Kriegstein, K. (2011). Direct struc- Nagy, K., Greenlee, M. W., & Kovacs, G. (2012). The lateral occipital tural connections between voice- and face-recognition areas. cortex in the face perception network: an effective connectivity Journal of Neuroscience, 31(36), 1290612915. Available from: study. Frontiers in Psychology, 3, 141. Available from: http://dx. http://dx.doi.org/10.1523/JNEUROSCI.2091-11.2011. doi.org/doi:10.3389/fpsyg.2012.00141. Blank, H., & von Kriegstein, K. (2013). Mechanisms of enhancing Nath, A. R., & Beauchamp, M. S. (2011). Dynamic changes in super- visual-speech recognition by prior auditory information. ior temporal sulcus connectivity during perception of noisy Neuroimage, 65, 109118. Available from: http://dx.doi.org/ audiovisual speech. The Journal of Neuroscience, 31(5), 17041714. doi:10.1016/j.neuroimage.2012.1009.1047. Epub 2012 Sep 1027. Nath, A. R., & Beauchamp, M. S. (2012). A neural basis for interindi- Canolty, R. T., Edwards, E., Dalal, S. S., Soltani, M., Nagarajan, S. S., vidual differences in the McGurk effect, a multisensory speech Kirsch, H. E., et al. (2006). High gamma power is phase-locked to illusion. [Research Support, N.I.H., Extramural Research Support, theta oscillations in human neocortex. Science, 313(5793), 16261628. U.S. Gov’t, Non-P.H.S.]. NeuroImage, 59(1), 781787. Available Canolty, R. T., Soltani, M., Dalal, S. S., Edwards, E., Dronkers, N. F., from: http://dx.doi.org/doi:10.1016/j.neuroimage.2011.07.024. Nagarajan, S. S., et al. (2007). Spatiotemporal dynamics of Nath, A. R., Fava, E. E., & Beauchamp, M. S. (2011). Neural correlates word processing in the human brain. Front Neuroscience, 1(1), of interindividual differences in children’s audiovisual speech 185196. perception. [Research Support, N.I.H., Extramural Research Chamoun, R., Takashima, M., & Yoshor, D. (2014). Endoscopic extra- Support, U.S. Gov’t, Non-P.H.S.]. Journal of Neuroscience, 31(39), capsular dissection for resection of pituitary macroadenomas: 1396313971. Available from: http://dx.doi.org/doi:10.1523/ Technical note. Journal of Neurological Surgery. Part A: Central JNEUROSCI.2605-11.2011. European Neurosurgery, 75(1), 4852. Available from: http://dx. Nir, Y., Fisch, L., Mukamel, R., Gelbard-Sagiv, H., Arieli, A., Fried, I., doi.org/doi:10.1055/s-0032-1326940. et al. (2007). Coupling between neuronal firing rate, gamma LFP, Chang, E. F., Rieger, J. W., Johnson, K., Berger, M. S., Barbaro, N. M., and BOLD fMRI is related to interneuronal correlations. Current & Knight, R. T. (2010). Categorical speech representation in Biology, 17(15), 12751285. human superior temporal gyrus. [Research Support, N.I.H., Noppeney, U., Josephs, O., Hocking, J., Price, C. J., & Friston, K. J. Extramural Research Support, Non-U.S. Gov’t]. Nature (2008). The effect of prior visual information on recognition of Neuroscience, 13(11), 14281432. Available from: http://dx.doi. speech and sounds. Cerebral Cortex, 18(3), 598609. org/doi:10.1038/nn.2641. Noppeney, U., Ostwald, D., & Werner, S. (2010). Perceptual decisions Crone, N. E., Miglioretti, D. L., Gordon, B., & Lesser, R. P. (1998). formed by accumulation of audiovisual evidence in prefrontal Functional mapping of human sensorimotor cortex with electro- cortex. [Research Support, Non-U.S. Gov’t]. The Journal of corticographic spectral analysis. II. Event-related synchronization Neuroscience: The Official Journal of the Society for Neuroscience, 30 in the gamma band. Brain, 121(Pt 12), 23012315. (21), 74347446. Available from: http://dx.doi.org/doi:10.1523/ Driver, J., & Noesselt, T. (2008). Multisensory interplay reveals cross- JNEUROSCI.0455-10.2010. modal influences on ‘sensory-specific’ brain regions, neural Okada, K., Venezia, J. H., Matchin, W., Saberi, K., & Hickok, G. responses, and judgments. Neuron, 57(1), 1123. (2013). An fMRI Study of audiovisual speech perception reveals Engell, A. D., & McCarthy, G. (2010). Selective attention modulates multisensory interactions in auditory cortex. PLoS, 8(6), e68959 face-specific induced gamma oscillations recorded from ventral Print 62013. occipitotemporal cortex. Journal of Neuroscience, 30(26), 87808786. Pei, X., Leuthardt, E. C., Gaona, C. M., Brunner, P., Wolpaw, J. R., & Friston, K. J. (2009). Modalities, modes, and models in functional Schalk, G. (2011). Spatiotemporal dynamics of electrocortico- neuroimaging. Science, 326(5951), 399403. graphic high gamma activity during overt and covert word repe- Grill-Spector, K., & Malach, R. (2004). The human visual cortex. tition. Neuroimage, 54(4), 29602972. Annual Review of Neuroscience, 27, 649677. Powers, A. R., III, Hevey, M. A., & Wallace, M. T. (2012). Neural cor- Hickok, G., Costanzo, M., Capasso, R., & Miceli, G. (2011). The role of relates of multisensory perceptual learning. Journal of Neuroscience, Broca’s area in speech perception: Evidence from aphasia revis- 32(18), 62636274. Available from: http://dx.doi.org/10.1523/ ited. [Research Support, N.I.H., Extramural Research Support, JNEUROSCI.6138-11.2012. Non-U.S. Gov’t]. Brain and Language, 119(3), 214220. Available Ray, S., & Maunsell, J. H. (2011). Different origins of gamma rhythm from: http://dx.doi.org/doi:10.1016/j.bandl.2011.08.001. and high-gamma activity in macaque visual cortex. PLoS Biology, Lachaux, J. P., Axmacher, N., Mormann, F., Halgren, E., & Crone, 9(4), e1000610. N. E. (2012). High-frequency neural activity and human cogni- Ross, L. A., Saint-Amour, D., Leavitt, V. M., Javitt, D. C., & Foxe, J. J. tion: Past, present and possible future of intracranial EEG (2007a). Do you see what I am saying? Exploring visual enhance- research. Progress in Neurobiology, 98(3), 279301. Available from: ment of speech comprehension in noisy environments. Cerebral http://dx.doi.org/10.1016/j.pneurobio.2012.06.008. Cortex, 17(5), 11471153. Available from: http://dx.doi.org/ Lee, H., & Noppeney, U. (2011). Physical and perceptual factors doi:10.1093/cercor/bhl024. shape the neural mechanisms that integrate audiovisual signals in Ross, L. A., Saint-Amour, D., Leavitt, V. M., Javitt, D. C., & Foxe, J. J. speech comprehension. Journal of Neuroscience, 31(31), (2007b). Do you see what I am saying? Exploring visual enhance- 1133811350. ment of speech comprehension in noisy environments. [Research Logothetis, N. K., Kayser, C., & Oeltermann, A. (2007). In vivo mea- Support, N.I.H., Extramural]. Cerebral Cortex, 17(5), 11471153. surement of cortical impedance spectrum in monkeys: Available from: http://dx.doi.org/doi:10.1093/cercor/bhl024. Implications for signal propagation. Neuron, 55(5), 809823. Ross, L. A., Saint-Amour, D., Leavitt, V. M., Molholm, S., Javitt, Magnotti, J. F., Ma, W. J., & Beauchamp, M. S. (2013). Causal infer- D. C., & Foxe, J. J. (2007). Impaired multisensory processing in ence of asynchronous audiovisual speech. Frontiers in Psychology, schizophrenia: Deficits in the visual enhancement of speech com- 4, 798. Available from: http://dx.doi.org/doi:10.3389/fpsyg.2013. prehension under noisy environmental conditions. [Research 00798. Support, N.I.H., Extramural]. Schizophrenia Research, 97(13), McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. 173183. Available from: http://dx.doi.org/doi:10.1016/j.schres. Nature, 264(5588), 746748. 2007.08.008.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 526 42. AUDIOVISUAL SPEECH INTEGRATION: NEURAL SUBSTRATES AND BEHAVIOR

Sahin, N. T., Pinker, S., Cash, S. S., Schomer, D., & Halgren, E. Towle, V. L., Yoon, H. A., Castelle, M., Edgar, J. C., Biassou, N. M., (2009). Sequential processing of lexical, grammatical, and phono- Frim, D. M., et al. (2008). ECoG gamma activity during a lan- logical information within Broca’s area. Science, 326(5951), guage task: Differentiating expressive and receptive speech areas. 445449. Brain, 131(Pt 8), 20132027. Schepers, I. M., Hipp, J. F., Schneider, T. R., Roder, B., & Engel, A. K. Tye-Murray, N., Sommers, M. S., & Spehar, B. (2007). Audiovisual (2012). Functionally specific oscillatory activity correlates between integration and lipreading abilities of older adults with normal visual and auditory cortex in the blind. Brain, 135(Pt 3), 922934. and impaired hearing. Ear and Hearing, 28(5), 656668. Available Schepers, I. M., Schneider, T. R., Hipp, J. F., Engel, A. K., & from: http://dx.doi.org/10.1097/AUD.0b013e31812f7185. Senkowski, D. (2013). Noise alters beta-band activity in superior von Kriegstein, K., Kleinschmidt, A., & Giraud, A. L. (2006). Voice temporal cortex during audiovisual speech processing. recognition and cross-modal responses to familiar speakers’ Neuroimage, 70, 101112. Available from: http://dx.doi.org/ voices in prosopagnosia. Cerebral Cortex, 16(9), 13141322. 10.1016/j.neuroimage.2012.11.066. Available from: http://dx.doi.org/10.1093/cercor/bhj073. Simmons, W. K., Bellgowan, P. S., & Martin, A. (2007). Measuring Wallace, M. T., Wilkinson, L. K., & Stein, B. E. (1996). Representation selectivity in fMRI data. Nature Neuroscience, 10(1), 45. and integration of multiple sensory inputs in primate superior Suh, M. W., Lee, H. J., Kim, J. S., Chung, C. K., & Oh, S. H. (2009). colliculus. Journal of Neurophysiology, 76(2), 12461266. Speech experience shapes the speechreading network and subse- Wilson, S. M., Molnar-Szakacs, I., & Iacoboni, M. (2008). Beyond quent deafness facilitates it. Brain, 132(Pt 10), 27612771. superior temporal cortex: Intersubject correlations in narrative Available from: http://dx.doi.org/10.1093/brain/awp159. speech comprehension. Cerebral Cortex, 18(1), 230242. Available Sumby, W. H., & Pollack, I. (1954). Visual contribution to speech from: http://dx.doi.org/10.1093/cercor/bhm049. intelligibility in noise. Journal of the Acoustical Society of America, 26(2), 212215.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL CHAPTER 43 Neurobiology of Statistical Information Processing in the Auditory Domain Uri Hasson1 and Pascale Tremblay2,3 1Center for Mind and Brain Sciences (CIMeC), University of Trento, Mattarello (TN), Italy; 2Centre de Recherche de l’Institut Universitaire en Sante´ Mentale de Que´bec, Que´bec City, QC, Canada; 3De´partement de Re´adaptation, Faculte´ de Me´decine, Universite´ Laval, Que´bec City, QC, Canada

43.1 INTRODUCTION refers to the conditional probability of transitioning from one syllable to the immediately successive one, In this chapter, we provide an overview of the liter- P(syllable2 | syllable1). Importantly, syllables with high ature on the neurobiology of statistical information TPs tend to form words, whereas those with lower TPs processing that is relevant for studies of language pro- mark word boundaries. Because, unlike written lan- cessing. We intend this chapter for those interested in guage, the speech stream does not contain invariant understanding how the coding of statistics between cues marking word boundaries (Cole & Jakimik, 1980; sublexical (nonsemantic) speech elements may impact Lehiste & Shockey, 1972), access to statistical informa- language acquisition and online speech processing. For tion, in particular TP, is considered critical for learning this reason, we focus particularly on studies speaking how to segment continuous speech into syllables and to the sublexical level (syllables), but we also addresses words, especially during early childhood when word the processing of statistics within auditory nonspeech boundaries are still unknown. There is a large behav- stimuli. ioral literature detailing the statistical abilities of As detailed in this chapter, there is increasing evi- infants and children, but also adults, all of whom are dence that the human brain implements computations sensitive to TP in speech (Newport & Aslin, 2004; that allow it to understand the environment by extract- Pelucchi, Hay, & Saffran, 2009a, 2009b; Pena, Bonatti, ing and analyzing its regularities to derive meaning, Nespor,&Mehler,2002;Saffran,Aslin,&Newport, simplify the representation of inputs via categorization 1996; Saffran, Johnson, Aslin, & Newport, 1999)as or compression, and predict upcoming events. The well as in nonspeech auditory and visual stimuli auditory environment is particularly complex, present- (Fiser & Aslin, 2001, 2002; Kirkham, Slemmer, & ing humans with a seemingly infinite combination of Johnson, 2002; Saffran et al., 1999; Saffran, Pollak, sounds with an immensely rich spectral diversity, origi- Seibel, & Shkolnik, 2007). nating from a wide variety of sources. Among these Importantly, statistics within the auditory domain sounds are speech sounds, that is, syllables and pho- support more than just language acquisition. In early nemes. Speech sounds occur within rapid and continu- and late adulthood, statistical information can be used ous streams that must be parsed into their constituent for predicting upcoming syllables and words or for units for meaning to be extracted. Speech streams are disambiguating partly heard words, especially in con- sequentially organized based on complex language- texts in which the speech signal is degraded, such as specific (i.e., phonotactic) constraints. These constraints, in noisy environments and for people with hearing whose delineation is the focus of linguistic approaches loss. It is well-known that during speech comprehen- (e.g., Optimality theory; see McCarthy, 2001), are mani- sion the speech system seeks to identify the target lexi- fested, at the surface level, in language-specific statistics. cal item as quickly as possible (Gaskell & Marslen- One such statistic is “transition probability” (TP), which Wilson, 1997). While processing the auditory input, the

Neurobiology of Language. DOI: http://dx.doi.org/10.1016/B978-0-12-407794-2.00043-2 527 © 2016 Elsevier Inc. All rights reserved. 528 43. NEUROBIOLOGY OF STATISTICAL INFORMATION PROCESSING IN THE AUDITORY DOMAIN cohort of lexical items that form possible lexical conti- the AGL literature are of the simplest type (regular nuations becomes accessible, and the lexical item is grammars) that can be represented via a Markov pro- identified as soon as the “uniqueness point” is cess. The third body of literature, grounded in percep- reached, that is, the point at which only one continua- tual learning, uses manipulations in which random tion is possible and at which it can be discriminated token-sequences are contrasted with sequences in from similar words in the “mental lexicon.”1 This pro- which some tokens are always concatenated in a fixed cess is further sensitive to the statistical distribution of order to form “words.” It is important to note that potential complements: the uncertainty of possible both the random and word-forming conditions can be word continuations (the potential cohort) near the rec- described by a first-order Markov process: the random ognition point, as quantified by the cohort’s entropy, is condition has a uniform distribution of transitions negatively correlated with word identification, sug- whereas the word-forming condition contains several gesting that the set of potential complements is made cells with a 100% TP. We further note that these latter accessible during the word comprehension itself studies are different from those grounded in AGL or (Wurm, Ernestus, Schreude, & Baayen, 2006). Given information theory in that once the “words” are that interpretations are sought prior to word comple- decoded, learners benefit from numerous points in the tion, a probabilistic model of potential completions can sequence that completely lack uncertainty (particularly optimize recognition by differentially activating more when each word begins with a unique syllable). To probable completions. Recent work supports the prem- illustrate, if words are four syllables long, then after ise that such predictions are made prior to reaching the words have been learned, identifying a word- the uniqueness point (Gagnepain, Henson, & Davis, initial syllable will uniquely determine the next three 2012). Beyond accounting for online processing, it has syllables for the listener. That is, including “words” been shown that adults with better capacities for repre- introduces a measure of determinism that is not pres- senting input statistics have more efficient language ent in other studies of statistical processing, and there processing capacities, suggesting a tight link between is evidence that this results in chunking of the words the two mechanisms (Conway, Bauernschmidt, Huang, into single units, with less attention given to those syl- & Pisoni, 2010; Misyak & Christiansen, 2012). lables in the sequence that are predictable with abso- The studies we review here focus on the neurobiol- lute certainty (Astheimer & Sanders, 2011; Buiatti, ogy of statistical information processing. These studies Pena, & Dehaene-Lambertz, 2009). draw on three research traditions: information theory, We also emphasize that the findings we review in artificial grammar learning (AGL), and perceptual this chapter derive from studies examining the acquisi- learning. Therefore, they use different terminologies. tion or representation of statistical constraints, a field Studies grounded in information theory differentiate that only partly overlaps with studies of grammar “regular” from “irregular” or “random” inputs. Here, acquisition and representation or rule learning. The regular inputs are generated by a process with some study of transition-probability statistics takes as its level of constraints (e.g., a first- or second-order model inputs that are generated by regular grammars, Markov process2) in which the transitions between that is, grammars that can be described via finite state tokens is not uniform in the ordered condition but is automata (also referred to as finite state systems or uniform in the random condition. For instance, given a Markov processes), that are defined solely by an alpha- grammar of N tokens, in the random condition transi- bet, states, and transitions. Finite state systems can tion probabilities between tokens will be 1/N uni- model simple grammars, but not grammars that require formly for all token-pair transitions, whereas the a memory stack (e.g., An Bn grammars). Studies that regular conditions will increase the probability of examine the representation of more complex grammars some transitions and decrease the probability of others. such as these, which necessitate memory stores, hierar- Studies using AGL paradigms, which are grounded in chical representations, recursion, or that use of symbo- formal learning theories, often refer to the generating lisms such as phrase structures or unification-based process as a transition network and examine responses constraints, are outside the scope of this review and to sequences that could or could not be generated by may utilize different neural systems (see Fitch & the learned grammar. Many of the grammars used in Friederici, 2012, for recent neurobiological studies

1The mental lexicon is a concept used in linguistics and in psycholinguistics to refer to a speaker’s lexical representations. 2Markov chains are stochastic processes that can be parameterized by estimating TP between discrete states in a system. In first-order Markov chains, each subsequent state depends only on the immediately preceding one. In second-order Markov chains, the next state depends on two or more preceding states.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 43.2 BRAIN SYSTEMS INVOLVED IN STATISTICAL INFORMATION PROCESSING 529 comparing regular versus context free grammars and From a neurobiological perspective, different systems for representative experimental work; Bahlmann, might be involved in the representation and use of Schubotz, & Friederici, 2008; Folia, Forkstam, Ingvar, such statistical information. Some might represent Hagoort, & Petersson, 2011). Work on formal rule learn- associative links between speech elements (e.g., the ing and evaluation of stimuli against rules is also out- distribution of all elements that follow what is cur- side the scope of the current chapter. rently perceived), whereas some may mediate more Importantly, whereas much of the data in the studies general auditory-attention processes that enable the we discuss are interpreted in terms of the learning of use of statistical information. For instance, hearing the statistical relations (transition probabilities) or use of syllable “que” [/kwε/] can invoke a representation of systems that represent statistics, these interpretations a range of possible complements such as “est” or often ignore the relation between processes associated “estion.” Processing of acoustic signals uses active with the coding of statistics and those related to chunk- “look-ahead” mechanisms that aim to identify target ing of tokens. There is a strong relation between words given minimally sufficient acoustic information sequence statistics, the process of chunking sequences, prior to the moment when a word has been presented and the resulting segmentation of the input into a “men- in its entirety (Gaskell & Marslen-Wilson, 1997). This tal alphabet” consisting of distinct units (Perruchet & representation, which can be formally represented via Pacton, 2006). Computationally, if two or more tokens a cumulative distribution of possible following sylla- always appear jointly, it would be most optimal to chunk bles could be instantiated on the fly. The deployment (represent) them as a single unit for the purpose of any of neural systems to construct and utilize these distri- statistical computation. Depending on interpretive tradi- butions may further depend on the “diagnosticity” of tions, some of the studies reviewed in this chapter the auditory cue (i.e., the degree to which it constrains address differences in brain activity for regular and ran- future inputs). Whereas a syllable such as /kwε/is dom sequences, not in terms of statistical learning of highly diagnostic because it can be followed by rela- relations between items, but in terms of chunking. It has tively few continuations, a syllable such as “bi” [/bɑɪ/] been shown that chunking-based accounts can explain allows numerous ones (bite, bilateral, bicycle) in which many of the findings in the statistical learning literature case making predictions may be counterproductive and whether the same systems mediate chunking and and, instead, paying greater attention to bottom-up statistical learning is an ongoing issue (see Perruchet & input (i.e., to the acoustic details themselves) may be Pacton, 2006 for an important discussion). Thus, from a more optimal. When the number of potential comple- neurobiological perspective, it is still difficult to deter- ments is large, systems associated with modulation of mine whether sensitivity to statistical regularities is auditory attention likely play a role to favor bottom-up dependent on chunking-related processes or on pro- processing. cesses more specifically related to statistical learning, As discussed in sections 2.1 and 2.2, neuroimaging such as the updating of associative information and the research has begun to uncover a complex and distrib- construction of predictions. uted system, including both cortical and subcortical The structure of the rest of the chapter is as follows. components, which is important for processing audi- Section 2 identifies and discusses the brain regions tory statistical information either by directly support- most commonly involved in processing statistical ing statistical mechanisms, or by controlling more information. Section 2.1 focuses on the cortical regions, general mechanisms such as auditory segmentation, and section 2.2 focuses on the subcortical ones. Finally, auditory attention, and working memory. The func- section 2.3 discusses the connectivity between the cor- tional organization and specific contributions of the tical and subcortical structures involved in statistical different components of this distributed system, how- information processing. We conclude by discussing the ever, remain largely unknown. A range of experimen- potential usefulness of statistical information proces- tal paradigms has been used to study this question, sing in the context of auditory, language, and system which included either speech or nonspeech stimuli neurosciences. (such as pure tones), or sometimes both, presented in either auditory or visual modalities. Passive listening tasks have been used to focus on automatic and per- 43.2 BRAIN SYSTEMS INVOLVED haps more naturally occurring processes, whereas IN STATISTICAL INFORMATION other tasks have targeted deliberate/executive pro- PROCESSING cesses, including various forms of learning and pattern recognition paradigms. This range of experimental Not much is known about the anatomy and func- paradigms may explain the diverse network of regions tional organization of the brain systems that process that have been associated with statistical information and use statistical information in the speech signal. processing, which includes part of temporal and

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 530 43. NEUROBIOLOGY OF STATISTICAL INFORMATION PROCESSING IN THE AUDITORY DOMAIN frontal parietal lobes and subcortical structures. That information processing. Comparing auditory melodies said, in the present chapter we do not review research to unstructured sequences of tones comparable in focusing on decision-making about future outcomes. terms of pitch and rhythm revealed stronger fMRI sig- nal in bilateral IFG and anterior and posterior STG, including the planum temporale (PT) (Minati et al., 43.2.1 Cortical Systems Underlying Statistical 2008). Interestingly, the PT has been shown to respond Processing of Linguistic Inputs to the presentation of unexpected silences in auditory sequences of complex sounds (modulated scanner At the cortical level, perhaps not surprisingly, sensi- noise), further suggesting that it is sensitive to the tivity to statistical information in auditory input structure of sound sequences (Mustovic et al., 2003). streams has been found mainly in the supratemporal In sum, neuroimaging studies, including MEG, fMRI, plane (STP) of healthy adults, including the auditory and NIRS studies, focusing on the processing of audi- cortices. This is generally consistent with the idea that tory sequences robustly find activation along the representation of statistics in different modalities relies supratemporal cortex, particularly at the level of on separate circuits rather than on a general system for the posterior STG and PT, that is modulated by the coding statistics (Bastos et al., 2012). However, several internal statistical structure of nonspeech auditory studies have also implicated the left inferior frontal sequences. gyrus (IFG), a region that may play a more domain- Neuroimaging studies that have used speech instead general role in this process. of nonspeech stimuli lead to conclusions that are consis- tent with those found in nonspeech studies. Using fMRI, 43.2.1.1 Involvement of Temporal Regions McNealy, Mazziotta, and Dapretto (2006) examined Several studies examining statistical information brain activity of healthy adults during passive listening processing in auditory nonspeech streams, usually to: (i) syllable streams from an artificial language con- using pure-tone streams, have identified temporal and taining statistical regularities; (ii) streams containing sta- temporal-parietal regions whose activation is modu- tistical regularities and speech cues (stressed syllables); lated by the statistical properties of the input signal. and (iii) streams containing no statistical regularities and In a magnetoencephalographic (MEG) study (Furl no speech cues. The comparison of streams containing et al., 2011), sensitivity to statistical information, quan- regularities with those that did not revealed activity in tified as transition probabilities (second-order Markov the left posterior STG. Importantly, activity in this region chains of tones comprising five possible pitches), was was found to increase over time for the streams contain- found in the right temporal-parietal junction, includ- ing statistical regularities but not for the random ing the most posterior part of the STP. Using func- streams. This finding is particularly important for tional magnetic resonance imaging (fMRI), we found understanding the potential role of the posterior STG in sensitivity to perceived changes in statistical regularity statistical processing. If the posterior STG were involved in auditory sequences of pure tones in a region of the in the updating of statistical representations (i.e., statisti- left superior temporal gyrus (STG) as well as in sev- cal updating per se), then this process of updating proba- eral other regions, including the bilateral medial fron- bilities should not differ in its dynamics for regular tal gyrus (Tobia, Iacovella, Davis, & Hasson, 2012). versus random series. In contrast, if it were implicated Using near-infrared spectroscopy (NIRS), Abla and in segmentation of words from the stream, then it would Okanoya (2008) examined the processing of tone show more activation for regular series, as was found sequences containing “tone-words” and random by the authors. A subset of the participants in McNealy sequences. Prior to the experiment, participants were et al. study also completed a word discrimination task presented with the tone-words during a 10-minute in which words, part-words, and nonwords taken from training phase. During the experiment, participants the speech streams were presented to the participants were asked to try to recognize the word tones they who were given no explicit task. Results show activity had learned during the training phase, a task that in the left posterior IFG and left middle frontal gyrus requires continuous segmentation of the input and (MFG) for words compared with nonwords and part- evaluation against the learned words. The results words, altogether suggesting that posterior frontal showed increased activation in both left STG and left regions are involved in word recognition once these IFG. Although the authors argued against the STG as words are learned, whereas statistical information and being involved in statistical information processing, segmentation processes may be accomplished within the suggesting instead that the observed task-related mod- STP. This experimental paradigm was reapplied with a ulation reflects auditory selective attention, the finding group of 10-year-old children; results show essentially the of greater activation in this area adds to a literature same BOLD activation pattern (McNealy, Mazziotta, & suggesting that STG is indeed involved in statistical Dapretto, 2010).

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 43.2 BRAIN SYSTEMS INVOLVED IN STATISTICAL INFORMATION PROCESSING 531

In a related fMRI study (Karuza et al., 2013), partici- TTG are also in agreement with work showing sensitiv- pants were also exposed to an artificial language and ity to regularities in that region using direct recordings tested on their ability to recognize words (three- in rats (Yaron, Hershenhoren, & Nelken, 2012). Recent syllable combinations with high transitional probabili- data from our group (Deschamps, Hasson, & Tremblay, ties) and part-words that occurred during the exposure submitted) shows that the cortical thickness of left ante- streams (learning). Exposure streams included forward rior STG correlates with the ability to perceive statistical and backward speech streams. Greater activation was structure in speech sequences. An additional explor- found in the left STG, right posterior middle temporal atory whole-brain analysis revealed significant correla- gyrus (MTG), and right STG when comparing the for- tions in the superior parietal lobule (SPL) and ward speech stream with the backward speech control intraparietal sulcus (IPS), regions more typically associ- condition. Relating each participant’s change in learn- ated with attentional processes. These data suggest that ing across postexposure test phases to changes in neu- the ability to process statistical information depends on ral activity during the forward as compared with the auditory-specialized cortical networks in STP, as well as backward exposure phase revealed activation in the a network of cortical regions supporting executive left IFG pars triangularis and a small portion of the left functions. IFG pars opercularis. Together with the findings of McNealy et al. (2006), this suggests that when statisti- 43.2.1.2 Involvement of Left IFG cal dependencies are found in speech stream, As indicated, the left IFG appears to play a role in “chunks” of colinear elements are coded as single the segmentation of word or word-like structures in an items (“words”) in a representation accessible to left auditory stream. It is an outstanding question whether inferior frontal regions. That is, these regions may be the left IFG mediates a modality-general function. In a specifically involved in word learning or word recog- nonspeech study, Petersson, Folia, and Hagoort (2012) nition, whereas more general auditory statistical infor- introduced participants to written (visual) sequences mation processing mechanisms involve the STP. generated by a seven-item artificial grammar. In a sub- In our own fMRI work (Tremblay, Baroni, & Hasson, sequent fMRI session, these participants were asked to 2012), we used a slightly different approach to study determine whether sequences presented to them were statistical information processing in speech and non- grammatical. The authors found greater activation for speech auditory sequences. We exposed participants to accurate nongrammatical versus grammatical judg- meaningless sequences of native syllables and spectrally ments in left IFG, suggesting that these regions gener- complex natural nonspeech sounds (bird chirps) that ally evaluate the well-formedness of inputs by varied in terms of their statistical structure, which was examining whether they are licensed given the gram- determined by their transition-probability matrices (par- mar learned, even outside language contexts. However, ticipants monitored a visual stimulus in an incidental other fMRI work (Nastase, Iacovella, & Hasson, 2014) task). The auditory sequences ranged from random to argues that the left IFG may mediate low-level highly structured in three steps: predictable sequences auditory statistical processing rather than a modality- were associated with strong transition constraints independent process. The Nastase et al. study showed between elements in a way that allowed making a that activity in left IFG was sensitive to regular versus strong prediction about the next element, whereas mid- random series of pure-tone series, but it was not sensi- level or random sequences were associated with weaker tive to regularity when the tokens were substituted or no transition constraints. In contrast to prior work, with simple visual objects. One possibility is that parti- we used a detailed anatomical ROI approach and cipants in the Petersson et al. study represented the focused on the STP, which we partitioned into 13 subre- visual stimuli using a verbal code in preparation for the gions in each hemisphere. Our results emphasize the task required of them. In contrast, the short sequences, importance of the posterior STP in the automatic pro- passive perception context, and rapid presentation rate cessing of statistical information for both speech and used by Nastase et al. do not favor that strategy, which nonspeech sequences. Several areas in the left STP were may explain the discrepant findings. sensitive to statistical regularities in both speech and In sum, a review of the literature indicates that the nonspeech inputs, including the PT, posterior STG, and posterior STP, particularly at the level of the PT and primary auditory cortex in the medial transverse tem- posterior STG, as well as the left IFG, are the cortical poral gyrus (TTG). The bilateral lateral TTG and trans- structures most frequently implicated in the processing verse temporal sulcus (TTS) were sensitive to statistical of statistical information in the auditory domain. regularities in both speech and nonspeech, but patterns Further studies are needed, however, to delineate the of activation were distinct with earlier activation for specific role(s) of each region as well as the manner in speech than nonspeech, suggesting easier segmentation which information flows between these different for familiar than unfamiliar sounds. These findings for regions.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 532 43. NEUROBIOLOGY OF STATISTICAL INFORMATION PROCESSING IN THE AUDITORY DOMAIN

43.2.2 Subcortical Systems Underlying chunk strength revealed hippocampal and putamen Statistical Information Processing activation, with hippocampal activation interpreted as being related to retrieval of training items. 43.2.2.1 Involvement of Basal Ganglia The putamen has been linked to the coding of regu- in Statistical Learning larity in the study of McNealy et al. (2006) discussed Relying mainly on neuroimaging and neurophysiol- in the previous section. That study reported stronger ogy studies, we now consider the potential role of the activity in the putamen during presentation of random basal ganglia (BG) in mediating statistically informed syllable strings than during presentation of artificial aspects of phonemic and syllable-level processing. The languages. Furthermore, activation in the region was BG is involved in several cortico-striatal loops, includ- stronger for unstressed language-like streams than for ing limbic, associative, and motor loops, each originat- stressed language-like streams (in the latter, word- ing and terminating in distinct cortical areas, and strings were easier to disambiguate). Thus, the puta- connecting via different parts of the BG (Alexander, men showed a gradient of activation with the least DeLong, & Strick, 1986). The basal motor loop connects activity during the condition in which words were eas- primary and nonprimary motor areas including the iest to segment. Interpreting such relatively focal supplementary motor area (SMA) to the posterior stri- regional effects in the context of more general atum, particularly the putamen, the external and approach to understanding BG-thalamic loops is an internal pallidum, the subthalamic nucleus, and ven- important goal for future work. The BG is internally trolateral thalamic nuclei. This loop has been associ- connected via inhibitory connections so that increased ated with the acquisition, storage, and execution of activity in the striatum can inhibit or excite the palli- motor patterns and sequence learning (Graybiel, 1998). dum and, via different pathways, result in either Functionally, it has been suggested that the BG is increased or decreased activity in thalamus. Future involved in the coding of context (Saint-Cyr, 2003) and work with high temporal and spatial resolution (e.g., in the generation of sequenced outputs via rule appli- time-resolved fMRI or MEG) may identify such cation (Lieberman, 2001). Neuropsychological, anatom- interactions. In any case, the involvement of BG in ical, and behavioral evidence for these notions are not grammar learning is supported by work (De Diego- discussed here; they have been discussed extensively Balaguer et al., 2008) showing that individuals elsewhere, particularly in relation to populations with diagnosed with Huntington’s disease, which affects BG dysfunction (see Alm, 2004 for dysfunctions related the striatum (i.e., the caudate and the putamen), show to stuttering; see Saint-Cyr, 2003 for a review of BG deficits in word and morph-syntactic learning, with dysfunctions such as Parkinson’s and Hungtington’s the former necessitating veridical memory for three- diseases). syllable combinations and the latter necessitating the The potential role of BG in representing statistical extraction of morphological regularities (e.g., artificial constraints among short nonsemantic units has been words ending in “mi” always begin with “pa”). examined in several studies. An early neuroimaging A role for the BG in representing statistical relation- study (Lieberman, Chang, Chiao, Bookheimer, & ships between input elements, or their prediction, is Knowlton, 2004) provides a convincing demonstration also supported by research on auditory processing. For of this point. In this study, participants were presented example, Leaver, Van Lare, Zielinski, Halpern, and with letter strings constructed from an artificial gram- Rauschecker (2009) linked the BG to auditory sequenc- mar as part of a training stage; then, in a test stage, ing, a process that is closely related to the processing they were tested on novel grammatical and nongram- of statistical information, given that statistics informa- matical letter strings. Orthogonally to the grammatical- tion can only be computed and processed if the inde- ity manipulation, the authors constructed test items pendent units in a signal have been properly that either strongly or weakly overlapped with the segmented. In that study, participants were scanned training items in “chunk strength,” which is a measure while anticipating a familiar or unfamiliar melody. not relating to the grammaticality of the string, but While silently anticipating a familiar melody, SMA, rather to the frequency in which the “alphabet” of the pre-SMA, left BG (globus pallidus/putamen), and sev- grammar tended to occur conjointly (co-occurrence fre- eral other regions showed more activity than when quency). In this way, the authors could separate anticipating a recently learned melody, suggesting a regions implicated in retention of grammar rules from role in auditory prediction. However, for the BG, this those involved in retention of chunks of letters. When pattern was stronger during the early stages of famil- chunk strength was low, offering little superficial over- iarization with the “familiar” melody than in later lap with the training items, the caudate nucleus was stages, suggesting that the involvement of BG is sensitive to grammaticality, showing more activity for related to the learning process itself (“learning stimu- grammatical items. Examination of the correlates of lus associations,” ibid., p. 2482) rather than prediction.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 43.3 CONNECTIONAL ANATOMY OF THE STATISTICAL NETWORK 533

Whether the BG mediates pattern learning or use of cohesive BG network is not involved in these essential statistical information for prediction is more directly operations of language analysis” (p. 697). It is therefore addressed by recent work on beat perception, stimulus possible that the BG mediates statistical learning, but anticipation, and music perception (Grahn & Rowe, only when learning applies to sublexical units (phone- 2013). In that study, the authors presented participants mic, syllabic, morphemic) rather than to word or with auditory streams consisting of regular beats that clause-level phenomenon. either changed or were the same as the beat patterns in streams presented immediately before. Results 43.2.2.2 Debates on the Functional Role of Basal showed that the putamen was most strongly active in Ganglia the condition where the current pattern was the same It is important to consider the empirical findings as the previous one, and showed the least activity reviewed here in the context of the continuous discus- when the current pattern was new and did not bear a sion on the role of BG in language comprehension. relation to the prior one. Mid-levels of activity were Whether the BG’s computations during language com- found for changes in speed. On the basis of these find- prehension are intrinsically linguistic and related to ings, the authors proposed that “the basal ganglia’s grammatical computations or, alternatively, associated response profile suggests a role in beat prediction not with more general functions is currently being beat finding” (Grahn & Rowe, 2013, p. 913). Geiser, debated. Arguing for the linguistic-computation view, Notter, and Gabrieli (2012) also linked the putamen to Ullman (2001) has suggested that the BG subserves processing of auditory regularities by demonstrating aspects of mental grammar and that it allows learning that it shows more activity in the context of regular of grammars in various domains, including syntax, (versus irregular) beat patterns (the opposite pattern morphology, and even phonology. Given that gram- found by McNealy et al., 2006) and, furthermore, mar learning reflects, at minimum, extraction of regu- across individuals, greater activity in the putamen was larities or invariants, this would entail sophisticated negatively correlated with activity in auditory cortex. mechanisms for extracting regularities that can include We note that the studies of Grahn and Rowe and that phrase-structure rules, lexical-level constraints, or rules of Geiser et al. address predictions given temporal reg- governing long-distance movements between constitu- ularities rather than regularities of content; establishing ents. This viewpoint has received empirical support the relation between these two kinds of phenomena, in from work (Nemeth et al., 2011) showing that in dual- the domain of language, is a topic for future work. task conditions where one task involves statistically It is still unclear whether activity in BG reflects the driven motor sequence learning, a parallel sentence coding of regularity, predictive processes, or evalua- comprehension task diminishes learning, whereas tion of input against predictions. Seger et al. (2013) mathematical or word-monitoring parallel tasks do found that the caudate head and body (though not the not. In contrast, Crosson and colleagues have argued putamen) were sensitive to the degree of mismatch in that the BG is not involved in primary language func- musical progression of a chord series, suggesting that tions, but rather increases signal to noise during action the caudate is involved in evaluation of predictions, selection. Thus, during speech comprehension, the whereas the putamen is involved in constructing pre- region may be implicated in increasing the coding pre- dictions. However, even this interpretation of putamen cision (activation sharpening) of those phonemes most activity necessitates further work, as Langners et al. appropriate in a given context (Ford et al., 2013, p. 6). (2011) have shown that the putamen shows higher Understanding the role of BG in statistical contexts activity when more specific versus less specific predic- containing both auditory and nonauditory language tions are violated (see also Haruno & Kawato, 2006). content is important for differentiating these different Furthermore, it is also unclear whether temporal or theoretical frameworks. content-based predictions mediated by BG, such as those reviewed in the aforementioned studies, extend to the lexical level. Interestingly, some work suggests 43.3 CONNECTIONAL ANATOMY that the role of BG in the construction or evaluation of OF THE STATISTICAL NETWORK predictions is limited to the sublexical level. Wahl et al. (2008) examined responses to semantic or syntac- Our review suggests that the caudal part of STP, tic violations in spoken sentences. Deep brain record- including the posterior STG and PT extending to TTG, ings monitored activity in the thalamus and BG. as well as the striatum, and possibly the IFG Results showed that BG were not sensitive to either (pars opercularis, triangularis), form the core regions type of violation, leading the authors to propose that involved in processing statistical information. However, “syntactic and semantic language analysis is primarily whether these regions form a functional network for realized within cortico-thalamic networks, whereas a segmenting auditory inputs or processing embedded

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 534 43. NEUROBIOLOGY OF STATISTICAL INFORMATION PROCESSING IN THE AUDITORY DOMAIN statistical information remains unclear, because no stud- study, using resting-state connectivity, documented ies thus far have attempted to examine the connectivity functional connectivity between the putamen and pos- patterns associated with processing auditory statistics. terior STP (Di Martino et al., 2008). We briefly outline reasons for supposing these three In sum, there is some evidence of functional connec- areas (IFG, posterior-middle temporal cortex, BG) have tivity but very limited evidence of anatomical connec- the potential for integrative processing of statistics in the tivity between the IFG and posterior STP and the auditory stream. striatum. Additional research is needed to further cur- The idea of some degree of anatomical connectivity rent understanding of how these areas, which appear between the human IFG and posterior temporal areas to form the core of a statistical information processing has a long history, but it is not without controversy network, interact either directly or indirectly to extract (see Dick, Bernal, & Tremblay, 2013; Dick & Tremblay, and process that information. 2012 for reviews on the subject and references therein for data in support of and against the presence of such connectivity). There is, however, evidence for func- 43.4 RELATED WORK AND tional connectivity between these regions (Hampson FURTHER AFIELD et al., 2006). The other question of interest concerns the connec- There are a number of research lines that appear tivity of the striatum with posterior STP and/or IFG. related to the processing of statistical information but Although there is extended structural connectivity address substantially different issues. The first defines (monosynaptic) between the cortex and the striatum, statistical relations between inputs not in terms of for- evidence of anatomical connectivity between the IFG mal properties such as transition probabilities, but in and the striatum, as well as between STP and the stria- terms of the degree of physical (metric) changes tum, is scarce. Schmahmann and Pandya (2006), using in (sound) frequency space that occur between tones in an anterograde tract-tracer technique in the macaque more and less ordered sequences. In that literature, monkey, reported some projections from the posterior regularity is quantified via the serial autocorrelation of STG to the caudate nucleus. In humans, a recent diffu- tones in the pitch sequence (Overath et al., 2007; Patel & sion MRI study conducted on 13 healthy adults sug- Balaban, 2000) so that regularity is higher when low gest that the temporal cortex (including posterior frequencies dominate or sample entropy is lower. These segment) projects to the posterior part of the caudate studies have implicated auditory cortical regions in nucleus (Kotz, Anwander, Axer, & Knosche, 2013), regularities, but it is important to note that the term consistent with the macaque evidence. Documentation “regularity” is used differently and that the explana- of structural connectivity between the IFG and the stri- tions for such effects could have a purely physiological atum is scarce. A recent DTI study conducted on a rel- basis; when “statistical regularity” is related to the atively small sample (N 5 10) and focusing on the IFG magnitude of transitions in frequency space, the expla- reported that both the opercular and triangular convo- nation for these effects could be due to low-level lutions of the IFG connect with the rostral putamen repetition suppression effects per se driven by the tun- (Ford et al., 2013). ing curves of auditory neurons. Another way of assessing connectivity between There is also a body of work examining neural regions is to examine their functional connectivity, that responses to varying levels of uncertainty, but in is, the tendency (in the statistical sense) of different which uncertainty originates from the degree of unifor- regions to be active synchronously, which does not mity of the distribution from which the tokens are necessarily imply direct (monosynaptic) connectivity. sampled (captured by Shannon’s entropy of the vari- The functional connectivity literature, however, pro- able generating the input), rather than to the strength vides a slightly less consistent account. A detailed of transition probabilities. Particularly relevant are study of the functional connectivity of the human stria- studies that did not require directing attention to stim- tum conducted on a large sample (N 5 1000; Choi, Yeo, ulus statistics (Harrison, Duggins, & Friston, 2006; & Buckner, 2012) reported only weak connectivity Strange, Duggins, Penny, Dolan, & Friston, 2005; between the posterior STP and the striatum and Tobia, Iacovella, & Hasson, 2012). The studies of between the IFG and the striatum. Using dynamic Strange et al. (2005) and Harrison et al. (2006) used causal modeling (DCM) during a phonological (rhym- visual stimuli and linked the hippocampus to coding ing judgment) task, Booth et al. (Booth, Wood, Lu, series uncertainty (quantified via Shannon’s entropy). Houk, & Bitan, 2007) reported functional connectivity However, one study using passive listening to audi- between the putamen and left IFG as well as between tory stimuli (Tobia, Iacovella, & Hasson, 2012) has the putamen and the lateral temporal cortex (including identified perisylvian regions including both inferior both superior and middle temporal areas). Another parietal and lateral temporal cortex as sensitive to the

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL REFERENCES 535 relative frequency distribution of auditory tokens with- Alexander, G. E., DeLong, M. R., & Strick, P. L. (1986). Parallel orga- nization of functionally segregated circuits linking basal ganglia out hippocampal involvement (see Tobia, Iacovella, & Hasson, 2012 for a discussion of the role of the hippo- and cortex. Annual Review of Neuroscience, 9, 357 381. Alm, P. A. (2004). Stuttering and the basal ganglia circuits: A critical campus in statistical learning). review of possible relations. Journal of Communication Disorders, 37 Another important and unresolved issue concerns (4), 325369. Available from: http://dx.doi.org/10.1016/j. the domain specificity of the neural processes involved jcomdis.2004.03.001. in learning and representation of statistical information. Astheimer, L. B., & Sanders, L. D. (2011). Predictability affects early Our own work (Nastase et al., 2014) suggests that rapid perceptual processing of word onsets in continuous speech. Neuropsychologia, 49(12), 35123516. Available from: http://dx. learning of sequence statistics within auditory or visual doi.org/10.1016/j.neuropsychologia.2011.08.014. steams is largely mediated by separate neural systems, Bahlmann, J., Schubotz, R. I., & Friederici, A. D. (2008). Hierarchical with little overlap and without hippocampal involve- artificial grammar processing engages Broca’s area. Neuroimage, ment. Others (Bubic, von Cramon, & Schubotz, 2011) 42(2), 525 534. Available from: http://dx.doi.org/10.1016/j. have shown that violations of regularity-induced pre- neuroimage.2008.04.249. Bastos, A. M., Usrey, W. M., Adams, R. A., Mangun, G. R., Fries, P., & dictions evoke activities in different systems depending Friston, K. J. (2012). Canonical microcircuits for predictive coding. on whether the regularity is temporal-, location-, or Neuron, 76(4), 695711. Available from: http://dx.doi.org/ feature-based, suggesting a domain-specific basis for 10.1016/j.neuron.2012.10.038. the construction or evaluation of predictions. However, Booth, J. R., Wood, L., Lu, D., Houk, J. C., & Bitan, T. (2007). The role several studies (Cristescu, Devlin, & Nobre, 2006; Egner of the basal ganglia and cerebellum in language processing. Brain Research, 1133(1), 136144. Available from: http://dx.doi.org/ et al., 2008) suggest that there are domain-general sys- 10.1016/j.brainres.2006.11.074. tems that mediate predictions of items’ perceptual fea- Bubic, A., von Cramon, D. Y., & Schubotz, R. (2011). Exploring the tures, location, and semantic features. detection of associatively novel events using fMRI. Human Brain Mapping, 32(3), 370381. Available from: http://dx.doi.org/ 10.1002/hbm.21027. Buiatti, M., Pena, M., & Dehaene-Lambertz, G. (2009). Investigating 43.5 CONCLUSION AND FUTURE WORK the neural correlates of continuous speech computation with frequency-tagged neuroelectric responses. Neuroimage, 44(2), 509519. Available from: http://dx.doi.org/10.1016/j. The study of the neurobiological substrates underly- neuroimage.2008.09.015. ing the representation and use of statistical information Choi, E. Y., Yeo, B. T., & Buckner, R. L. (2012). The organization of is in its initial stages. Although statistical information the human striatum estimated by intrinsic functional connectivity. may be crucial to language acquisition and online Journal of Neurophysiology, 108(8), 22422263. Available from: speech processing, experimental work has just begun http://dx.doi.org/10.1152/jn.00270.2012. Cole, R. A., & Jakimik, J. (1980). How are syllables used to recognize identifying the neural mechanisms mediating these words? The Journal of the Acoustical Society of America, 67(3), processes as well as their functional importance. 965970. Several core areas identified in this experimental work Conway, C. M., Bauernschmidt, A., Huang, S. S., & Pisoni, D. B. are regions of the STP and lateral temporal cortex that (2010). Implicit statistical learning in language processing: Word appear to be sensitive to the degree of statistical regu- predictability is the key. Cognition, 114(3), 356 371. Available from: http://dx.doi.org/10.1016/j.cognition.2009.10.009. larity in an input and potentially initial word segmen- Cristescu, T. C., Devlin, J. T., & Nobre, A. C. (2006). Orienting attention tation and left IFG regions that play a role in word to semantic categories. NeuroImage, 33(4), 11781187. Available recognition but that may not be involved in the coding from: http://dx.doi.org/10.1016/j.neuroimage.2006.08.017. of statistics per se. The BG appears to have a role in DeDiego-Balaguer,R.,Couette,M.,Dolbeau,G.,Durr,A.,Youssov,K.,& online temporal predictions and anticipation, with a Bachoud-Levi, A. C. (2008). Striatal degeneration impairs language learning: evidence from Huntington’s disease. Brain, 131 particularly important role for the putamen, a core (Pt 11), 28702881. Available from: http://dx.doi.org/10.1093/brain/ part of the cortico-basal motor loop. Finally, although awn242. there is some evidence for functional and structural Deschamps, I., Hasson, U., & Tremblay, P. (submitted). Cortical connectivity between IFG, STP, and BG, future work is thickness in perisylvian regions correlates with individuals’ sensi- needed to examine the relation between statistical pro- tivity to statistical regularities in auditory sequences. Dick, A. S., Bernal, B., & Tremblay, P. (2013). The language connec- cessing mechanisms and the strength of functional and tome: New pathways, new concepts. Neuroscientist. Available structural connectivity within this network. from: http://dx.doi.org/10.1177/1073858413513502. Dick, A. S., & Tremblay, P. (2012). Beyond the arcuate fasciculus: Consensus and controversy in the connectional anatomy of lan- References guage. Brain, 135(Pt 12), 35293550. Available from: http://dx. doi.org/10.1093/brain/aws222. Abla, D., & Okanoya, K. (2008). Statistical segmentation of tone Di Martino, A., Scheres, A., Margulies, D. S., Kelly, A. M., Uddin, L. Q., sequences activates the left inferior frontal cortex: A near-infrared Shehzad, Z., et al. (2008). Functional connectivity of human striatum: spectroscopy study. Neuropsychologia, 46(11), 27872795. Available A resting state FMRI study. Cerebral Cortex, 18(12), 27352747. from: http://dx.doi.org/10.1016/j.neuropsychologia.2008.05.012. Available from: http://dx.doi.org/10.1093/cercor/bhn041.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 536 43. NEUROBIOLOGY OF STATISTICAL INFORMATION PROCESSING IN THE AUDITORY DOMAIN

Egner, T., Monti, J. M. P., Trittschuh, E. H., Wieneke, C. A., Hirsch, J., & Language, 127(1), 4654. Available from: http://dx.doi.org/ Mesulam, M. M. (2008). Neural integration of top-down spatial and 10.1016/j.bandl.2012.11.007. feature-based information in visual search. The Journal of Kirkham, N. Z., Slemmer, J. A., & Johnson, S. P. (2002). Visual statistical Neuroscience, 28(24), 61416151. Available from: http://dx.doi.org/ learning in infancy: Evidence for a domain general learning mecha- 10.1523/JNEUROSCI.3562-08.2008. nism. Cognition, 83(2), B3542. Available from: http://dx.doi.org/ Fiser, J., & Aslin, R. N. (2001). Unsupervised statistical learning of S0010027702000045 [pii]. higher-order spatial structures from visual scenes. Psychological Kotz, S. A., Anwander, A., Axer, H., & Knosche, T. R. (2013). Beyond Science, 12(6), 499504. cytoarchitectonics: The internal and external connectivity struc- Fiser, J., & Aslin, R. N. (2002). Statistical learning of new visual fea- ture of the caudate nucleus. PLoS One, 8(7), e70141. Available ture combinations by infants. Proceedings of the National Academy from: http://dx.doi.org/10.1371/journal.pone.0070141. of Sciences of the United States of America, 99(24), 1582215826. Langner, R., Kellermann, T., Boers, F., Sturm, W., Willmes, K., & Available from: http://dx.doi.org/10.1073/pnas.232472899. Eickhoff, S. B. (2011). Modality-specific perceptual expectations Fitch, W. T., & Friederici, A. D. (2012). Artificial grammar learning selectively modulate baseline activity in auditory, somatosensory, meets formal language theory: An overview. Philosophical and visual cortices. Cerebral Cortex, 21(12), 28502862. Available Transactions of the Royal Society of London Series B, Biological from: http://dx.doi.org/10.1093/cercor/bhr083. Sciences, 367(1598), 19331955. Available from: http://dx.doi. Leaver, A. M., Van Lare, J., Zielinski, B., Halpern, A. R., & org/10.1098/rstb.2012.0103. Rauschecker, J. P. (2009). Brain activation during anticipation of Folia, V., Forkstam, C., Ingvar, M., Hagoort, P., & Petersson, K. M. sound sequences. The Journal of Neuroscience: The Official Journal of (2011). Implicit artificial syntax processing: Genes, preference, the Society for Neuroscience, 29(8), 24772485. Available from: and bounded recursion. Biolinguistics, 5, 105132. http://dx.doi.org/10.1523/JNEUROSCI.4921-08.2009. Ford, A. A., Triplett, W., Sudhyadhom, A., Gullett, J., McGregor, K., Lehiste, I., & Shockey, L. (1972). On the perception of coarticulation Fitzgerald, D. B., et al. (2013). Broca’s area and its striatal and tha- effects in English VCV syllables. Journal of Speech and Hearing lamic connections: A diffusion-MRI tractography study. Frontiers Research, 15(3), 500506. in Neuroanatomy, 7, 8. Available from: http://dx.doi.org/10.3389/ Lieberman, M. D., Chang, G. Y., Chiao, J., Bookheimer, S. Y., & fnana.2013.00008. Knowlton, B. J. (2004). An event-related fMRI study of artificial Furl, N., Kumar, S., Alter, K., Durrant, S., Shawe-Taylor, J., & grammar learning in a balanced chunk strength design. Journal of Griffiths, T. D. (2011). Neural prediction of higher-order auditory Cognitive Neuroscience, 16(3), 427438. Available from: http://dx. sequence statistics. Neuroimage, 54(3), 22672277. Available from: doi.org/10.1162/089892904322926764. http://dx.doi.org/10.1016/j.neuroimage.2010.10.038. Lieberman, P. (2001). On the subcortical bases of the evolution of lan- Gagnepain, P., Henson, R. N., & Davis, M. H. (2012). Temporal pre- guage. In J. r. Trabant, & S. Ward (Eds.), New essays on the origin of dictive codes for spoken words in auditory cortex. Current language (pp. 120). Berlin/Hawthorne, NY: Mouton de Gruyter. Biology, 22(7), 615621. Available from: http://dx.doi.org/ McCarthy, J. (2001). A thematic guide to optimality theory. Cambridge: 10.1016/j.cub.2012.02.015. Cambridge University Press. Gaskell, M. G., & Marslen-Wilson, W. D. (1997). Integrating form McNealy, K., Mazziotta, J. C., & Dapretto, M. (2006). Cracking the and meaning: A Distributed model of speech perception. language code: Neural mechanisms underlying speech parsing. Language and Cognitive Processes, 12(56), 613656. Available The Journal of Neuroscience: The Official Journal of the Society for from: http://dx.doi.org/10.1080/016909697386646. Neuroscience, 26(29), 76297639. Available from: http://dx.doi. Geiser, E., Notter, M., & Gabrieli, J. D. (2012). A corticostriatal neural org/10.1523/JNEUROSCI.5501-05.2006. system enhances auditory perception through temporal context McNealy, K., Mazziotta, J. C., & Dapretto, M. (2010). The neural basis processing. The Journal of Neuroscience: The Official Journal of the of speech parsing in children and adults. Developmental Science, 13 Society for Neuroscience, 32(18), 61776182. Available from: (2), 385406. Available from: http://dx.doi.org/10.1111/j.1467- http://dx.doi.org/10.1523/JNEUROSCI.5153-11.2012. 7687.2009.00895.x. Grahn, J. A., & Rowe, J. B. (2013). Finding and feeling the musical Minati, L., Rosazza, C., D’Incerti, L., Pietrocini, E., Valentini, L., beat: Striatal dissociations between detection and prediction of Scaioli, V., et al. (2008). FMRI/ERP of musical syntax: regularity. Cerebral Cortex, 23(4), 913921. Available from: http:// Comparison of melodies and unstructured note sequences. dx.doi.org/10.1093/cercor/bhs083. Neuroreport, 19(14), 13811385. Available from: http://dx.doi. Graybiel, A. M. (1998). The basal ganglia and chunking of action reper- org/10.1097/WNR.0b013e32830c694b. toires. Neurobiology of Learning and Memory, 70(12), 119136. Misyak, Jennifer B., & Christiansen, Morten H. (2012). Statistical Available from: http://dx.doi.org/10.1006/nlme.1998.3843. learning and language: An individual differences study. Language Hampson, M., Tokoglu, F., Sun, Z., Schafer, R. J., Skudlarski, P., Learning, 62(1), 302331. Available from: http://dx.doi.org/ Gore, J. C., et al. (2006). Connectivity-behavior analysis reveals 10.1111/j.1467-9922.2010.00626.x. that functional connectivity between left BA39 and Broca’s area Mustovic, H., Scheffler, K., Di Salle, F., Esposito, F., Neuhoff, J. G., varies with reading ability. Neuroimage, 31(2), 513519. Available Hennig, J., et al. (2003). Temporal integration of sequential audi- from: http://dx.doi.org/10.1016/j.neuroimage.2005.12.040. tory events: Silent period in sound pattern activates human pla- Harrison, L. M., Duggins, A., & Friston, K. J. (2006). Encoding uncer- num temporale. Neuroimage, 20(1), 429434. tainty in the hippocampus. Neural Networks, 19(5), 535546. Nastase, S., Iacovella, V., & Hasson, U. (2014). Uncertainty in visual Haruno, M., & Kawato, M. (2006). Different neural correlates of and auditory series is coded by modality-general and modality- reward expectation and reward expectation error in the putamen specific neural systems. Human Brain Mapping. Available from: and caudate nucleus during stimulus-action-reward association http://dx.doi.org/10.1002/hbm.22238. learning. Journal of Neurophysiology, 95(2), 948959. Available Nemeth, D., Janacsek, K., Csifcsak, G., Szvoboda, G., Howard, J. H., from: http://dx.doi.org/10.1152/jn.00382.2005. Jr., & Howard, D. V. (2011). Interference between sentence proces- Karuza, E. A., Newport, E. L., Aslin, R. N., Starling, S. J., Tivarus, sing and probabilistic implicit sequence learning. PLoS One, 6(3), M. E., & Bavelier, D. (2013). The neural correlates of statistical e17577. Available from: http://dx.doi.org/10.1371/journal. learning in a word segmentation task: An fMRI study. Brain and pone.0017577.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL REFERENCES 537

Newport, E. L., & Aslin, R. N. (2004). Learning at a distance I. Saint-Cyr, J. A. (2003). Frontal-striatal circuit functions: Context, Statistical learning of non-adjacent dependencies. Cognitive sequence, and consequence. Journal of the International Psychology, 48(2), 127162. Available from: http://dx.doi.org/ Neuropsychological Society, 9(1), 103127. 10.1016/S0010-0285(03)00128-2 [pii]. Schmahmann, J. D., & Pandya, D. N. (2006). Fiber pathways of the Overath, T., Cusack, R., Kumar, S., von Kriegstein, K., Warren, J. D., brain. Oxford/New York, NY: Oxford University Press. Grube, M., et al. (2007). An information theoretic characterisation Seger, C. A., Spiering, B. J., Sares, A. G., Quraini, S. I., Alpeter, C., of auditory encoding. PLoS Biology, 5(11), e288. Available from: David, J., et al. (2013). Corticostriatal contributions to musical http://dx.doi.org/10.1371/journal.pbio.0050288. expectancy perception. Journal of Cognitive Neuroscience, 25(7), Patel, A. D., & Balaban, E. (2000). Temporal patterns of human corti- 10621077. Available from: http://dx.doi.org/10.1162/ cal activity reflect tone sequence structure. Nature, 404(6773), jocn_a_00371. 8084. Available from: http://dx.doi.org/10.1038/35003577. Strange, B. A., Duggins, A., Penny, W., Dolan, R. J., & Friston, K. J. Pelucchi, B., Hay, J. F., & Saffran, J. R. (2009a). Learning in reverse: (2005). Information theory, novelty and hippocampal responses: Eight-month-old infants track backward transitional probabilities. Unpredicted or unpredictable? Neural Networks, 18(3), 225230. Cognition, 113(2), 244247. Available from: http://dx.doi.org/ Tobia, M. J., Iacovella, V., Davis, B., & Hasson, U. (2012). Neural sys- 10.1016/j.cognition.2009.07.011. tems mediating recognition of changes in statistical regularities. Pelucchi, B., Hay, J. F., & Saffran, J. R. (2009b). Statistical learning in Neuroimage, 63(3), 17301742. Available from: http://dx.doi.org/ a natural language by 8-month-old infants. Child development, 80 10.1016/j.neuroimage.2012.08.017. (3), 674685. Available from: http://dx.doi.org/10.1111/j.1467- Tobia, M. J., Iacovella, V., & Hasson, U. (2012). Multiple sensitivity 8624.2009.01290.x. profiles to diversity and transition structure in non-stationary Pena, M., Bonatti, L. L., Nespor, M., & Mehler, J. (2002). Signal-driven input. Neuroimage, 60(2), 9911005. Available from: http://dx. computations in speech processing. Science, 298(5593), 604607. doi.org/10.1016/j.neuroimage.2012.01.041. Available from: http://dx.doi.org/10.1126/science.1072901. Tremblay, P., Baroni, M., & Hasson, U. (2012). Processing of speech Perruchet, P., & Pacton, S. (2006). Implicit learning and statistical and non-speech sounds in the supratemporal plane: Auditory learning: One phenomenon, two approaches. Trends in Cognitive input preference does not predict sensitivity to statistical struc- Sciences, 10(5), 233238. Available from: http://dx.doi.org/ ture. Neuroimage, 66C, 318332. Available from: http://dx.doi. 10.1016/j.tics.2006.03.006. org/10.1016/j.neuroimage.2012.10.055. Petersson, K. M., Folia, V., & Hagoort, P. (2012). What artificial gram- Ullman, M. T. (2001). A neurocognitive perspective on language: The mar learning reveals about the neurobiology of syntax. Brain and declarative/procedural model. Nature Reviews Neuroscience, 2(10), Language, 120(2), 8395. Available from: http://dx.doi.org/ 717726. Available from: http://dx.doi.org/10.1038/35094573. 10.1016/j.bandl.2010.08.003. Wahl, M., Marzinzik, F., Friederici, A. D., Hahne, A., Kupsch, A., Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996). Statistical learn- Schneider, G. H., et al. (2008). The human thalamus processes syn- ing by 8-month-old infants. Science, 274(5294), 19261928. tactic and semantic language violations. Neuron, 59(5), 695707. Saffran, J. R., Johnson, E. K., Aslin, R. N., & Newport, E. L. (1999). Available from: http://dx.doi.org/10.1016/j.neuron.2008.07.011. Statistical learning of tone sequences by human infants and Wurm, L. H., Ernestus, M., Schreude, R., & Baayen, R. H. (2006). adults. Cognition, 70(1), 2752. Available from: http://dx.doi. Dynamics of the auditory comprehension of prefixed words: org/S0010-0277(98)00075-4. Cohort entropies and conditional root uniqueness points. Mental Saffran, J. R., Pollak, S. D., Seibel, R. L., & Shkolnik, A. (2007). Dog is Lexicon, 1(1), 125146. a dog is a dog: Infant rule learning is not specific to language. Yaron, A., Hershenhoren, I., & Nelken, I. (2012). Sensitivity to complex Cognition, 105(3), 669680. Available from: http://dx.doi.org/ statistical regularities in rat auditory cortex. Neuron, 76(3), 603615. 10.1016/j.cognition.2006.11.004. Available from: http://dx.doi.org/10.1016/j.neuron.2012.08.025.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 536 43. NEUROBIOLOGY OF STATISTICAL INFORMATION PROCESSING IN THE AUDITORY DOMAIN

Egner, T., Monti, J. M. P., Trittschuh, E. H., Wieneke, C. A., Hirsch, J., & Language, 127(1), 4654. Available from: http://dx.doi.org/ Mesulam, M. M. (2008). Neural integration of top-down spatial and 10.1016/j.bandl.2012.11.007. feature-based information in visual search. The Journal of Kirkham, N. Z., Slemmer, J. A., & Johnson, S. P. (2002). Visual statistical Neuroscience, 28(24), 61416151. Available from: http://dx.doi.org/ learning in infancy: Evidence for a domain general learning mecha- 10.1523/JNEUROSCI.3562-08.2008. nism. Cognition, 83(2), B3542. Available from: http://dx.doi.org/ Fiser, J., & Aslin, R. N. (2001). Unsupervised statistical learning of S0010027702000045 [pii]. higher-order spatial structures from visual scenes. Psychological Kotz, S. A., Anwander, A., Axer, H., & Knosche, T. R. (2013). Beyond Science, 12(6), 499504. cytoarchitectonics: The internal and external connectivity struc- Fiser, J., & Aslin, R. N. (2002). Statistical learning of new visual fea- ture of the caudate nucleus. PLoS One, 8(7), e70141. Available ture combinations by infants. Proceedings of the National Academy from: http://dx.doi.org/10.1371/journal.pone.0070141. of Sciences of the United States of America, 99(24), 1582215826. Langner, R., Kellermann, T., Boers, F., Sturm, W., Willmes, K., & Available from: http://dx.doi.org/10.1073/pnas.232472899. Eickhoff, S. B. (2011). Modality-specific perceptual expectations Fitch, W. T., & Friederici, A. D. (2012). Artificial grammar learning selectively modulate baseline activity in auditory, somatosensory, meets formal language theory: An overview. Philosophical and visual cortices. Cerebral Cortex, 21(12), 28502862. Available Transactions of the Royal Society of London Series B, Biological from: http://dx.doi.org/10.1093/cercor/bhr083. Sciences, 367(1598), 19331955. Available from: http://dx.doi. Leaver, A. M., Van Lare, J., Zielinski, B., Halpern, A. R., & org/10.1098/rstb.2012.0103. Rauschecker, J. P. (2009). Brain activation during anticipation of Folia, V., Forkstam, C., Ingvar, M., Hagoort, P., & Petersson, K. M. sound sequences. The Journal of Neuroscience: The Official Journal of (2011). Implicit artificial syntax processing: Genes, preference, the Society for Neuroscience, 29(8), 24772485. Available from: and bounded recursion. Biolinguistics, 5, 105132. http://dx.doi.org/10.1523/JNEUROSCI.4921-08.2009. Ford, A. A., Triplett, W., Sudhyadhom, A., Gullett, J., McGregor, K., Lehiste, I., & Shockey, L. (1972). On the perception of coarticulation Fitzgerald, D. B., et al. (2013). Broca’s area and its striatal and tha- effects in English VCV syllables. Journal of Speech and Hearing lamic connections: A diffusion-MRI tractography study. Frontiers Research, 15(3), 500506. in Neuroanatomy, 7, 8. Available from: http://dx.doi.org/10.3389/ Lieberman, M. D., Chang, G. Y., Chiao, J., Bookheimer, S. Y., & fnana.2013.00008. Knowlton, B. J. (2004). An event-related fMRI study of artificial Furl, N., Kumar, S., Alter, K., Durrant, S., Shawe-Taylor, J., & grammar learning in a balanced chunk strength design. Journal of Griffiths, T. D. (2011). Neural prediction of higher-order auditory Cognitive Neuroscience, 16(3), 427438. Available from: http://dx. sequence statistics. Neuroimage, 54(3), 22672277. Available from: doi.org/10.1162/089892904322926764. http://dx.doi.org/10.1016/j.neuroimage.2010.10.038. Lieberman, P. (2001). On the subcortical bases of the evolution of lan- Gagnepain, P., Henson, R. N., & Davis, M. H. (2012). Temporal pre- guage. In J. r. Trabant, & S. Ward (Eds.), New essays on the origin of dictive codes for spoken words in auditory cortex. Current language (pp. 120). Berlin/Hawthorne, NY: Mouton de Gruyter. Biology, 22(7), 615621. Available from: http://dx.doi.org/ McCarthy, J. (2001). A thematic guide to optimality theory. Cambridge: 10.1016/j.cub.2012.02.015. Cambridge University Press. Gaskell, M. G., & Marslen-Wilson, W. D. (1997). Integrating form McNealy, K., Mazziotta, J. C., & Dapretto, M. (2006). Cracking the and meaning: A Distributed model of speech perception. language code: Neural mechanisms underlying speech parsing. Language and Cognitive Processes, 12(56), 613656. Available The Journal of Neuroscience: The Official Journal of the Society for from: http://dx.doi.org/10.1080/016909697386646. Neuroscience, 26(29), 76297639. Available from: http://dx.doi. Geiser, E., Notter, M., & Gabrieli, J. D. (2012). A corticostriatal neural org/10.1523/JNEUROSCI.5501-05.2006. system enhances auditory perception through temporal context McNealy, K., Mazziotta, J. C., & Dapretto, M. (2010). The neural basis processing. The Journal of Neuroscience: The Official Journal of the of speech parsing in children and adults. Developmental Science, 13 Society for Neuroscience, 32(18), 61776182. Available from: (2), 385406. Available from: http://dx.doi.org/10.1111/j.1467- http://dx.doi.org/10.1523/JNEUROSCI.5153-11.2012. 7687.2009.00895.x. Grahn, J. A., & Rowe, J. B. (2013). Finding and feeling the musical Minati, L., Rosazza, C., D’Incerti, L., Pietrocini, E., Valentini, L., beat: Striatal dissociations between detection and prediction of Scaioli, V., et al. (2008). FMRI/ERP of musical syntax: regularity. Cerebral Cortex, 23(4), 913921. Available from: http:// Comparison of melodies and unstructured note sequences. dx.doi.org/10.1093/cercor/bhs083. Neuroreport, 19(14), 13811385. Available from: http://dx.doi. Graybiel, A. M. (1998). The basal ganglia and chunking of action reper- org/10.1097/WNR.0b013e32830c694b. toires. Neurobiology of Learning and Memory, 70(12), 119136. Misyak, Jennifer B., & Christiansen, Morten H. (2012). Statistical Available from: http://dx.doi.org/10.1006/nlme.1998.3843. learning and language: An individual differences study. Language Hampson, M., Tokoglu, F., Sun, Z., Schafer, R. J., Skudlarski, P., Learning, 62(1), 302331. Available from: http://dx.doi.org/ Gore, J. C., et al. (2006). Connectivity-behavior analysis reveals 10.1111/j.1467-9922.2010.00626.x. that functional connectivity between left BA39 and Broca’s area Mustovic, H., Scheffler, K., Di Salle, F., Esposito, F., Neuhoff, J. G., varies with reading ability. Neuroimage, 31(2), 513519. Available Hennig, J., et al. (2003). Temporal integration of sequential audi- from: http://dx.doi.org/10.1016/j.neuroimage.2005.12.040. tory events: Silent period in sound pattern activates human pla- Harrison, L. M., Duggins, A., & Friston, K. J. (2006). Encoding uncer- num temporale. Neuroimage, 20(1), 429434. tainty in the hippocampus. Neural Networks, 19(5), 535546. Nastase, S., Iacovella, V., & Hasson, U. (2014). Uncertainty in visual Haruno, M., & Kawato, M. (2006). Different neural correlates of and auditory series is coded by modality-general and modality- reward expectation and reward expectation error in the putamen specific neural systems. Human Brain Mapping. Available from: and caudate nucleus during stimulus-action-reward association http://dx.doi.org/10.1002/hbm.22238. learning. Journal of Neurophysiology, 95(2), 948959. Available Nemeth, D., Janacsek, K., Csifcsak, G., Szvoboda, G., Howard, J. H., from: http://dx.doi.org/10.1152/jn.00382.2005. Jr., & Howard, D. V. (2011). Interference between sentence proces- Karuza, E. A., Newport, E. L., Aslin, R. N., Starling, S. J., Tivarus, sing and probabilistic implicit sequence learning. PLoS One, 6(3), M. E., & Bavelier, D. (2013). The neural correlates of statistical e17577. Available from: http://dx.doi.org/10.1371/journal. learning in a word segmentation task: An fMRI study. Brain and pone.0017577.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL REFERENCES 537

Newport, E. L., & Aslin, R. N. (2004). Learning at a distance I. Saint-Cyr, J. A. (2003). Frontal-striatal circuit functions: Context, Statistical learning of non-adjacent dependencies. Cognitive sequence, and consequence. Journal of the International Psychology, 48(2), 127162. Available from: http://dx.doi.org/ Neuropsychological Society, 9(1), 103127. 10.1016/S0010-0285(03)00128-2 [pii]. Schmahmann, J. D., & Pandya, D. N. (2006). Fiber pathways of the Overath, T., Cusack, R., Kumar, S., von Kriegstein, K., Warren, J. D., brain. Oxford/New York, NY: Oxford University Press. Grube, M., et al. (2007). An information theoretic characterisation Seger, C. A., Spiering, B. J., Sares, A. G., Quraini, S. I., Alpeter, C., of auditory encoding. PLoS Biology, 5(11), e288. Available from: David, J., et al. (2013). Corticostriatal contributions to musical http://dx.doi.org/10.1371/journal.pbio.0050288. expectancy perception. Journal of Cognitive Neuroscience, 25(7), Patel, A. D., & Balaban, E. (2000). Temporal patterns of human corti- 10621077. Available from: http://dx.doi.org/10.1162/ cal activity reflect tone sequence structure. Nature, 404(6773), jocn_a_00371. 8084. Available from: http://dx.doi.org/10.1038/35003577. Strange, B. A., Duggins, A., Penny, W., Dolan, R. J., & Friston, K. J. Pelucchi, B., Hay, J. F., & Saffran, J. R. (2009a). Learning in reverse: (2005). Information theory, novelty and hippocampal responses: Eight-month-old infants track backward transitional probabilities. Unpredicted or unpredictable? Neural Networks, 18(3), 225230. Cognition, 113(2), 244247. Available from: http://dx.doi.org/ Tobia, M. J., Iacovella, V., Davis, B., & Hasson, U. (2012). Neural sys- 10.1016/j.cognition.2009.07.011. tems mediating recognition of changes in statistical regularities. Pelucchi, B., Hay, J. F., & Saffran, J. R. (2009b). Statistical learning in Neuroimage, 63(3), 17301742. Available from: http://dx.doi.org/ a natural language by 8-month-old infants. Child development, 80 10.1016/j.neuroimage.2012.08.017. (3), 674685. Available from: http://dx.doi.org/10.1111/j.1467- Tobia, M. J., Iacovella, V., & Hasson, U. (2012). Multiple sensitivity 8624.2009.01290.x. profiles to diversity and transition structure in non-stationary Pena, M., Bonatti, L. L., Nespor, M., & Mehler, J. (2002). Signal-driven input. Neuroimage, 60(2), 9911005. Available from: http://dx. computations in speech processing. Science, 298(5593), 604607. doi.org/10.1016/j.neuroimage.2012.01.041. Available from: http://dx.doi.org/10.1126/science.1072901. Tremblay, P., Baroni, M., & Hasson, U. (2012). Processing of speech Perruchet, P., & Pacton, S. (2006). Implicit learning and statistical and non-speech sounds in the supratemporal plane: Auditory learning: One phenomenon, two approaches. Trends in Cognitive input preference does not predict sensitivity to statistical struc- Sciences, 10(5), 233238. Available from: http://dx.doi.org/ ture. Neuroimage, 66C, 318332. Available from: http://dx.doi. 10.1016/j.tics.2006.03.006. org/10.1016/j.neuroimage.2012.10.055. Petersson, K. M., Folia, V., & Hagoort, P. (2012). What artificial gram- Ullman, M. T. (2001). A neurocognitive perspective on language: The mar learning reveals about the neurobiology of syntax. Brain and declarative/procedural model. Nature Reviews Neuroscience, 2(10), Language, 120(2), 8395. Available from: http://dx.doi.org/ 717726. Available from: http://dx.doi.org/10.1038/35094573. 10.1016/j.bandl.2010.08.003. Wahl, M., Marzinzik, F., Friederici, A. D., Hahne, A., Kupsch, A., Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996). Statistical learn- Schneider, G. H., et al. (2008). The human thalamus processes syn- ing by 8-month-old infants. Science, 274(5294), 19261928. tactic and semantic language violations. Neuron, 59(5), 695707. Saffran, J. R., Johnson, E. K., Aslin, R. N., & Newport, E. L. (1999). Available from: http://dx.doi.org/10.1016/j.neuron.2008.07.011. Statistical learning of tone sequences by human infants and Wurm, L. H., Ernestus, M., Schreude, R., & Baayen, R. H. (2006). adults. Cognition, 70(1), 2752. Available from: http://dx.doi. Dynamics of the auditory comprehension of prefixed words: org/S0010-0277(98)00075-4. Cohort entropies and conditional root uniqueness points. Mental Saffran, J. R., Pollak, S. D., Seibel, R. L., & Shkolnik, A. (2007). Dog is Lexicon, 1(1), 125146. a dog is a dog: Infant rule learning is not specific to language. Yaron, A., Hershenhoren, I., & Nelken, I. (2012). Sensitivity to complex Cognition, 105(3), 669680. Available from: http://dx.doi.org/ statistical regularities in rat auditory cortex. Neuron, 76(3), 603615. 10.1016/j.cognition.2006.11.004. Available from: http://dx.doi.org/10.1016/j.neuron.2012.08.025.

F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL