<<

NEUROBIOLOGY OF LANGUAGE

Edited by GREGORY HICKOK Department of Cognitive Sciences, University of California, Irvine, CA, USA

STEVEN L. SMALL Department of Neurology, University of California, Irvine, CA, USA

AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO Academic Press is an imprint of Elsevier SECTION C

BEHAVIORAL FOUNDATIONS This page intentionally left blank CHAPTER 12 William J. Idsardi1,2 and Philip J. Monahan3,4 1Department of , University of Maryland, College Park, MD, USA; 2Neuroscience and Cognitive Science Program, University of Maryland, College Park, MD, USA; 3Centre for French and Linguistics, University of Toronto Scarborough, Toronto, ON, Canada; 4Department of Linguistics, University of Toronto, Toronto, ON, Canada

12.1 INTRODUCTION have very specific knowledge about a word’s form from a single presentation and can recognize and Phonology is typically defined as “the study of repeat such word-forms without much effort, all speech sounds of a language or languages, and the without knowing its meaning. Phonology studies the laws governing them,”1 particularly the laws govern- regularities of form (i.e., “rules without meaning”) ing the composition and combination of speech (Staal, 1990) and the laws of combination for speech sounds in language. This definition reflects a seg- sounds and their sub-parts. mental bias in the historical development of the field Any account needs to address the fact that and we can offer a more general definition: the study speech is produced by one anatomical system (the of the knowledge and representations of the sound mouth) and perceived with another (the auditory system of human languages. From a neurobiological system). Our ability to repeat new word-forms, or cognitive neuroscience perspective, one can such as “glark,” is evidence that people effortlessly consider phonology as the study of the mental model map between these two systems. Moreover, new for human speech. In this brief review, we restrict word-forms can be stored in both short-term and ourselves to spoken language, although analogous long-term memory. As a result, phonology must concerns hold for signed language (Brentari, 2011). confront the conversion of representations (i.e., data Moreover, we limit the discussion to what we con- structures) between three broad neural systems: mem- sider the most important aspects of phonology. ory, action, and perception (the MAP loop; Poeppel & These include: (i) the mappings between three sys- Idsardi, 2011). Each system has further sub-systems tems of representation: action, perception, and long- that we ignore here. The basic proposal is that this term memory; (ii) the fundamental components of is done through the use of phonological primitives speech sounds (i.e., distinctive features); (iii) the laws (features), which are temporally organized (chunked, of combinations of speech sounds, both adjacent and grouped, coordinated) on at least two fundamental long-distance; and (iv) the chunking of speech sounds time scales: the feature or and the syllable into larger units, especially syllables. (Poeppel, 2003). To begin, consider the word-form “glark.” Given this string of letters, native speakers of English will have an idea of how to pronounce it and what it 12.2 SPEECH SOUNDS would sound like if another person said it. They would AND THE MAP LOOP have little idea, if any, of what it means.2 The meaning of a word is arbitrary given its form, and it could The alphabet is an incredible human invention, but mean something else entirely. Consequently, we can its ubiquity overly influences our ideas regarding the

1Longman Dictionary of Contemporary English. 2Urban Dictionary (http://www.urbandictionary.com/) states that it means “to slowly grasp the meaning of a word or concept, based on the situation in which it is used” (i.e., almost grokking a concept).

Neurobiology of Language. DOI: http://dx.doi.org/10.1016/B978-0-12-407794-2.00012-2 141 © 2016 Elsevier Inc. All rights reserved. 142 12. PHONOLOGY basic units of speech. This continues to this day and is Each of these structures has some degrees of free- evident in the influence of the International Phonetic dom of movement, which we describe in terms of Alphabet (IPA; http://www.langsci.ucl.ac.uk/ipa/) their deflection from a neutral posture for speaking. for transcribing speech. Not all writing systems are The position for the mid-central , [ə], is alphabetic, however. Some languages choose ortho- considered to be the neutral posture of the speech graphic units larger than single sounds (moras, sylla- articulators. In most structures, two opposite directions bles) and a few, such as Bell’s Visible Speech (Bell, of movement are possible, yielding three stable regions 1867) and the Korean orthographic system Hangul of articulation, that is, the tongue dorsum can be put (Kim-Renaud, 1997), decompose sounds into their into a high, mid (neutral), or low position. In the neu- component articulations, all of which constitute impor- tral posture the velum is closed, but it can be opened tant, interconnected representations for speech. to allow air to flow through the nose, and such speech sounds are classified as nasal (as in English “m” [m]). 12.2.1 Action or Articulation of Speech The lips can deflect from the neutral posture by being rounded (as in English “oo” [u]) or drawn back (as in The musculature of the mouth has historically been English “ee” [i]). The tongue tip can be curled conca- somewhat more accessible to investigation than audi- vely or convexly either along its length (yielding retro- tion or memory, and linguistic has often dis- flex and laminal sounds, respectively) or across its played a bias toward classifying speech sounds in terms width (yielding grooved and lateral sounds, respec- of the actions needed to produce them (i.e., the articula- tively). The tongue dorsum (as mentioned) can be tion of the speech sounds by the mouth). For example, moved vertically (high or low) and horizontally (front the standard IPA charts for and or back), and the tongue root can be moved horizon- (Figure 12.1) are organized by how speech sounds are tally (advanced or retracted). articulated. The columns in Figure 12.1A arrange conso- The larynx (Figure 12.3) is particularly complex and nants with respect to where they are articulated in the can be moved along three different dimensions: modi- mouth (note: right to left corresponds to anterior to fying its vertical position (raised or lowered), modify- posterior position within the oral cavity), and the rows ing its tilt (rotated forward to slacken the vocal folds correspond to how they are articulated (i.e., their or rotated backwards to stiffen them), and changing ). The horizontal dimension in the degree of separation of the vocal folds (adducted Figure 12.1B represents the relative frontness-backness or abducted). Furthermore, the lips and the tongue of the tongue, and the vertical dimension represents the blade and dorsum can close off the mouth to dif- aperture of the mouth during production. These are the ferent degrees (termed the “manner” of production): standard methods for organizing and vowel completely closed (stops), nearly closed with turbulent inventories in languages. airflow (), or substantially open (approxi- Within the oral cavity, there are several controllable mants). Taken together, these articulatory maneuvers structures used to produce speech sounds. These include describe how to make various speech sounds. For the larynx, the velum, the tongue (which is further example, an English [s], as in “sea”, is an abducted divided into three relatively independently moveable (voiceless, high glottal airflow) grooved . sections: the tongue blade, the tongue dorsum, and the Furthermore, as described in Section 12.3, the antago- tongue root), and the lips (see Figure 12.2 reproduced nistic relationships between articulator movements from Bell, 1867; for more detail see Zemlin, 1998). serve as the basis for the featural distinctions (whether

(A) (B) Front Central Back Labio- Alveo- Bilabial Interdental Alveolar Palatal Velar Glottal dental palatal Close Stop

Fricative Close-mid

Nasal

Lateral Open-mid Manner of articulation Retroflex

Glide Open

FIGURE 12.1 IPA charts for American English (A) consonants and (B) vowels.

C. BEHAVIORAL FOUNDATIONS 12.2 SPEECH SOUNDS AND THE MAP LOOP 143 monovalent, equipollent, or binary; see Fant, 1973; 12.2.2 Perception or Audition of Speech Trubetzkoy, 1969) that have proven so powerful in understanding not only the composition of speech A great deal of the literature regarding speech per- sounds but also the phonology of human language. ception deals with how “special” speech is (Liberman, 1996) or is not. Often, this is cast as a debate between the motor theory of (Liberman & Mattingly, 1985) and speech as an area of expertise 4 within general auditory perception (Carbonnell & Lotto, 2014). The motor theory of speech perception 7 3 6 posits speech-specific mechanisms that recover the 5 7 intended articulatory that produced the phys- 2 8 ical auditory stimulus. General auditory perception models, however, posit that the primary representa- tional modality of speech perception is auditory and the mechanisms used during speech perception are 1 the same as those responsible for nonspeech auditory perception. This dichotomy, in some ways, parallels debates about face perception (Rhodes, Calder, Johnson, & Haxby, 2011). Since the development of the 1. The larynx 5. The back of the tongue sound spectrograph (Potter, Kopp, & Green, 1947)and 2. The pharynx 6. The front of the tongue the Haskins pattern playback machine (Cooper, 3. The soft palate 7. The point of the tongue Liberman, & Borst, 1951), it has been known that it is 4. The action of the soft palate in 8. The lips closing the nasal passage technologically feasible to analyze and accurately reproduce speech with timefrequencyamplitude FIGURE 12.2 Speech articulators. Note that terminology in 18 analysis techniques, as in the spectrogram in are all still in current usage (Zemlin, 1998:251) and that there are Figure 12.4, where time is on the horizontal axis, fre- many synonymous terms in common use. In the current chapter, quency is on the vertical axis, and amplitude is illus- articulator 5 is known as the tongue dorsum, 6 is the tongue blade or trated in the relative darkness of the pixels. corona, and 7 is the tongue tip or apex. From Bell (1867) with permis- sion from the publishers.

Hyoid bone FIGURE 12.3 The external view of the larynx. From Olek Remesz, http:// commons.wikimedia.org/wiki/File:Larynx_ Thyrohyoid membrane external_en.svg, with permission from the publisher. Median thyrohyoid Lateral thyrohyoid ligament ligament Superior cornu of thyroid cartilage Laryngeal incisure Superior laryngeal Thyroid cartilage nerve and artery Oblique line

Median cricothyroid Cricothyroid muscle ligament Inferior cornu Conus clasticus of thyroid cartilage

Cricothyroid joint Cricoid cartilage

Trachea

C. BEHAVIORAL FOUNDATIONS 144 12. PHONOLOGY

5,000 (“r” versus “l”), but share nothing discernible in mean- ing (see Hinton, Nichols, & Ohala, 1994 for cases of 4,000 phonaesthesia). The fundamental question of long-term memory

3,000 representations in speech is one of abstraction: attend- ing to and storing critical differences between forms while ignoring irrelevant differences. This is homo- 2,000

Frequency (Hz) Frequency morphic to a fundamental problem in vision. A pri- mary problem for visual object recognition is to 1,000 account for “discrimination among diagnostic object features and generalization across nondiagnostic fea- 0 tures” (Hoffman & Logothetis, 2009). The traditional 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 linguistic solution to this problem has been to posit a Time (s) single long-term memory representation (called the FIGURE 12.4 Sound spectrogram of a male saying “tata.” underlying representation [UR], notated with / /) and a set of transformations that yield the observed pronunciation variants (called surface representations Linking such technologies with neuroscience, Aertsen [SRs], notated with [ ]). Determining URs for word- and Johannesma (1981) demonstrated that auditory forms and the nature of the transformations respon- neurons display particular spectro-temporal receptive sible for SRs (given a particular UR) has been the core fields (STRFs) to auditory stimuli, akin to a set of goal of generative phonology for the past 50 years building blocks for spectrograms. So, this sets a (Chomsky & Halle, 1968) and survives as the domi- strong goal for phonology: find lawful relationships nant question in modern generative models of between the available articulator movements and phonology, such as Optimality Theory (OT), which their acoustic/auditory consequences (as in Stevens, posit ranked violable constraints instead of transfor- 1998), especially in terms of STRFs. We return to this mations (Prince & Smolensky, 2004;seeIdsardi, 2006 question in the discussion of features. More recently, for an argument that OT may not be computationally Mesgarani, David, Fritz, and Shamma (2014) have tractable). This is the analog of “view invariant” shown that STRFs derived from measuring responses models of visual object recognition (Booth & Rolls, in ferret primary auditory cortex can be used to 1998). The relation between various URs and SRs “clean” speech in various kinds of noise, including can become quite complicated, but it boils down to reverberation, demonstrating that the auditory system two kinds of basic cases: derived homophones and enhances the neural representation of speech events derived differences. against background noise. Consider the two words “intense” and “intents,” which can both be pronounced so that their pronuncia- tions end in [nts]. This is a case of derived 12.2.3 Memory or the Long-Term Storage homophones. Yoo and Blankenship (2003) show that in of Speech an experimental setting such pairs have substantially overlapping pronunciations, and also that a statistical It is a remarkable fact that humans can retain analysis of a corpus shows that the [t] in “intense” is detailed knowledge about a great number of words, somewhat shorter on average. However, another arbitrarily pairing forms with meanings across tens of related word-form, “intensive,” is much less likely to thousands of cases. This, again, is reminiscent of our exhibit an intrusive [t], with the “ns” usually being memory abilities for faces. More remarkably, the land- pronounced [ns]. Ohala (1997) provides a compelling scape of the form-meaning relation is not at all smooth, misproduction explanation for this effect due to the because small physical changes in form between two difficulty in precisely coordinating the closing of the words (what linguists call “minimal pairs”) can have velum during production in /ns/ with the simulta- profound differences in meaning. If the form-meaning neous transition from a complete closure in [n] to a relation were smooth, then there would be many more narrow incomplete closure in [s]. If these two changes pairs like “ram” and “lamb,” which differ only in their are not completely synchronized, then for a short first sound and share a great deal in meaning (“a male period there can be a complete closure in the mouth sheep” and “a young sheep,” respectively, definitions and a closed velum, resulting in the presence of an from Merriam Webster’s Collegiate Dictionary 11th intrusive [t]. As Ohala (1997: 85) notes: “the emergent edition). Instead, most cases are like “ramp” and stop is purely epiphenomenal and, indeed, such brief “lamp,” which differ in the same small sound attribute unintended stops are often observed in spontaneous

C. BEHAVIORAL FOUNDATIONS 12.3 FEATURES OR THE INTERNAL COMPOSITION OF SOUNDS 145 speech.” Now the listener faces the following problem: 12.3 FEATURES OR THE INTERNAL is that [t] a critical aspect of the word-form or is it an COMPOSITION OF SOUNDS irrelevant detail? Presumably relying on statistics and the pronunciations of related word-forms, such as Let us now return to the “glark” example, but this “intensive” and “intent,” learners generally do arrive time consider its cursive written form in English, as in at different long-term memory representations, one Figure 12.5. with a /t/ (“intents”) and one without (“intense”), as The motions necessary to produce cursive “glark” the usual English spellings reflect. But, Ohala argues, can be accomplished with a finger (as in finger paint- the listener cannot always reliably reconstruct the ing), a handheld stylus or pencil, a computer mouse speaker’s intent, and this is one cause of historical lan- (as was used to produce Figure 12.5), the tip of the guage change. nose, the elbow, or other body parts or instruments. The other situation, that of derived differences, is This wide variety of available “articulators” to produce already also illustrated by part of the previous exam- a cursive “glark” is a simple demonstration of the ple in that “intensive” is less likely to have an intrusive notion of motor equivalence (Wing, 2000). Motor [t] in its pronunciation than “intense” is, so the pro- equivalence is one kind of many-to-many mapping nunciations of the “intense” portion diverge. However, between causes and effects, and often is a symptom of much more dramatic examples can be found in other an ill-posed inverse problem (of reasoning backward English words. Consider the word “atom” in Standard from effects to their causes). In contrast, speech exhi- American English, for instance. On its own, it can be bits only limited motor equivalence. It is not generally pronounced as a homophone with “Adam” (the /t/ possible to substitute other coordinated sound- and /d/ both pronounced as a flap, notated as [ɾ]). producing gestures (such as finger snapping) even When the suffix “-ic” is added to “atom” to create when their acoustic effect might be appropriate (for “atomic,” its pronunciation contains a portion example, as substitutes for the click sounds found in homophonous with “Tom,” where the /t/ now has a southern African languages, some of which are pronunciation canonically associated with word-initial described as sounding like “sharply snapped fingers;” h position, [t ]. Adding “-ic” to “Adam,” however, gives see http://en.wikipedia.org/wiki/Click_consonant). “Adamic” with a portion pronounced as “damn.” So, One stereotypical (and racist) example of a strong ideally we would like to know how to recognize /t/ form of motor equivalence is the portrayal of Native from the speech signal. That is, what invariant aspects American war chants with a “woo woo” sound pro- h are obtained across [ɾ] and [t ] (and any other pronun- duced by repetitively covering the mouth with a hand ciations of /t/)? Such speech sound sized memorized (and infants often explore such activities, e.g., https:// units are known as , or as archi-phonemes www.youtube.com/watch?v5tcWmMPNUVb4). In this (Trubetzkoy, 1969) when their pronunciation ranges case, the acoustic effect of closing and opening an over an even wider variety of pronunciations, as in the acoustic tube at one end can be accomplished either by final sound of the negative prefix “in-,” which is pro- closing and opening the lips or by placing and remov- nounced [l] in “illegal,” [ɹ] in “irregular,” [m] in ing the palm of the hand from the lips; however, even “impossible,” and [ŋ] in “incomplete.” though infants explore such gestures, no language The goal, then, is to discover representations that uses the hand to make the bilabial [w]. abstract away from irrelevant changes in pronuncia- However, it is true that some conditions (ventrilo- tion but include diagnostic differences and differenti- quism, speaking with an object held by the teeth) ate between word-forms when they do have distinct show that some limited, approximate compensation is pronunciations, such as “atomic” and “Adamic.” possible, but other simple “experiments” (such as the h These two words differ in [t ] and [d], motivating a children’s taunt of saying “I was born on a pirate /t/ versus /d/ difference in long-term memory. Note ship” while holding your tongue with your fingers) that they also differ in their middle vowels, /a/ versus show that some speech features can really only be /æ/, a difference also missing in the corresponding performed by a single articulator, and no adequate portion of “atom” and “Adam.” compensation is possible. Speakers robustly compen- Most importantly, memory serves as the mediating sate during production when their vocal tract is representation between audition and articulation (Gow, obstructed (Guenther, 2006), and vowels exhibit more 2012). That is, the repetition of an auditorily pre- sented word-form must be first mapped from a per- ceptual representation to a memory representation, and only then mapped to an articulatory representa- tion. There is no short-cut directly from audition to articulation. FIGURE 12.5 The word “glark” as drawn with a mouse.

C. BEHAVIORAL FOUNDATIONS 146 12. PHONOLOGY compensation possibilities than consonants generally the final position of word, they are pronounced as (see Ladefoged & Maddieson, 1996: 300ff on tongue their devoiced counterparts [p, t, f, s, k] (e.g., bewei[z]en root position, for example), and English /r/ is notori- “to prove” is produced as bewei[s] “proof”). The ous for its variety of articulations (Zhou, Espy-Wilson, sounds that undergo this process form a natural class Tiede, & Boyce, 2007). Although we have no solution to (i.e., voiced obstruents) that can be represented as offer for the problem of /r/, we simply note that the [ 1 voiced, 1 obstruent]. The expectation, then, is that range of different articulations for /r/ includes flaps, all consonants that are [ 1 voiced, 1 obstruent] that trills, and uvulars, which have a similarly wide range exist in German or Dutch are to participate in this pro- of acoustic manifestations, so this goes well beyond cess. Thus, a very straightforward transformation can what is meant by motor equivalence (where there are be formulated to account for this process: /1voiced, multiple ways to achieve the same acoustic conse- 1obstruent/-[-voiced]/___ # (in word final posi- quence). Whatever the reason for the extreme variability tion). What we do not observe in natural language is in /r/ realizations, it cannot be just motor equivalence. for phonological rules to target non-natural classes of Thus, given the limited amount of motor equiva- sounds, such as /i, g, θ, j, a/. These sounds do not lence, we may have a reasonable inverse problem; form a natural class and, consequently, are not pre- therefore, we may also have the possibility of finding dicted to undergo systematic alternations like what we lawful relationships between articulatory actions and observed in German word-final obstruent devoicing. their acoustic/auditory consequences following the Thus, the utility of distinctive features in explaining lead of Halle (1983), Jakobson, Fant, and Halle (1952), observed cross-linguistic phonological patterns is obvi- and especially Stevens (1998). The general neurobio- ous. That the features themselves are cast in articula- logical conceit here is that pre-existing auditory tory terms or binary in nature have been points of feature detectors for ecologically important events for substantial debate. We would argue that neither is nec- mammals were paired with actions that the mouth essary to maintain their explanatory power. As a point could perform and thereby packaged as discrete of fact, distinctive features were initially described in units (features) for storage in long-term memory. auditory/acoustic terms (Jakobson et al., 1952; see Returning once again to the traditional articulatory Clements and Hume, 1995 for a discussion of different descriptions of speech sounds, we describe them in proposals for features valences). terms of manner, place of articulation (POA), and There are four broad classes for manner of articula- laryngeal posture. Importantly, the cues for POA and tion: stops (), fricatives, nasals, and approxi- laryngeal posture are dependent on the manner of mants. Stops have a complete closure in the mouth articulation, so the cues formannerserveas“land- and are characterized acoustically by a period of very marks” for the analysis of the speech signal (Juneja & low energy or silence often followed by a burst, similar Espy-Wilson, 2008; Stevens, 2002) and constrain the to percussive environmental events with discontinu- subsequent search for POA and laryngeal attributes. ities such as twigs snapping or rocks colliding. We provide here only a very oversimplified catalog Fricatives are made with a narrow channel causing of these relationships. turbulent airflow and are characterized acoustically by One result that has persisted throughout theoretical sustained aperiodic noise (which may be overlaid on a innovation within phonology is the notion that speech periodic signal) as with wind noise or rushing water. sounds are composed of smaller units of representa- Nasals are made with a closed oral cavity and an open tions (i.e., distinctive features; Jakobson et al., 1952). velum, and they are characterized acoustically by hav- These features have traditionally been binary in nature ing a single, strong single frequency resonance, a hum (i.e., either a positive or negative value) and relate to a such as some insects produce (but see Pruthi & Espy- speech sound’s articulation. For example, a consonant Wilson 2004 for a comprehensive discussion of other is either produced with the lips (i.e., [ 1labial]) or not important attributes for nasals). Finally, (i.e., [-labial]), or a vowel is either produced with an are made with a relatively open vocal tract and are advanced tongue root (i.e, [ 1ATR]) or not (i.e., characterized by having a rich resonance structure of [-ATR]). These binary distinctions incorporate the multiple ; these sounds are characteristic of apparent antagonistic relationship between articulator animal vocal tracts in particular and would have positions discussed in Section 12.2.1. The power of dis- served as a useful kind of animal detector within the tinctive features in generative phonology has been mammalian auditory system. Neurobiologically, this their ability to explain why certain cross-linguistic four-way division into “major class” features also phonological patterns are observed and why others are seems to be well-reflected in a coherent cortical map. not. Consider the case of word-final consonant devoi- Mesgarani, Cheung, Johnson, and Chang (2014) mea- cing, which occurs in German and Dutch among other sured the responses in implanted electrical cortical languages. When the sounds /b, d, v, z, g/ occur in grids (ECOG) placed along the superior temporal

C. BEHAVIORAL FOUNDATIONS 12.4 LOCAL SOUND COMBINATIONS AND CHUNKING 147 gyrus (STG) in presurgical epilepsy patients and found 12.4 LOCAL SOUND COMBINATIONS remarkable correlations between articulatory character- AND CHUNKING istics (consistent with phonological classes) and single electrode sites. In the supplemental materials for their Speech sounds can be combined together, but not , we can see that electrode e1 responds to stops, all combinations are possible. Given the set of sounds e2 responds to fricatives, e5 responds to nasals, and e3 {a, g, k, l, r}, some combinations are possible English and e4 respond to approximants with dorsal and coro- words (e.g., the now-familiar “glark” and others like nal POA, respectively. “gralk”), but other combinations are not licit, ”rlgka.” POA is primarily characterized by the most active One important constraint on sound sequences in (most displaced) articulator, as described: the lips word-forms is (approximately) that they must form a (labial), tongue blade (coronal), tongue body (dorsal), legal sequence of syllables in the language, and within tongue root (pharyngeal), or the larynx alone (laryn- syllables there are a limited set of pre-vowel and post- geal). As already noted, the acoustic correlates of POA vowel consonant sequences. Moreover, these local are heavily dependent on the manner of articulation. sequence constraints often depend on the manner fea- In the case of approximants, POA is signaled by the tures of the segments. For example, a stop- relative frequencies of the first three formants approximant sequence is an acceptable pre-vowel (Monahan & Idsardi, 2010), in fricatives by the disper- sequence in English (as in “blue”), but approximant- sion and center of the frication noise (often termed stop is not (”lbue”). Berent (2013) summarizes a num- center of gravity), and in stops and fricatives by the ber of investigations into this preference, which is transitions with neighboring approximants. exhibited even when listeners lack language experi- POA also appears to be topologically organized in cor- ence to both kinds of consonant sequences (“bl” and tex. ECOG recordings (Bouchard, Mesgarani, Johnson, & “lb”). Building on this work, Berent et al. (2014) show Chang, 2013) find a two-dimensional dorso-ventral by modulation of the fMRI BOLD response across such anteriorposterior POA map in STG, and cluster contrasts, with anterior portions of BA 45 showing analysis of the electrode responses recapitulates the less activation to syllables beginning with “lb” and POA categories labial, coronal, and dorsal. Most posterior portions showing more activation relative to importantly, the POA categories cut across the various that for “bl.” Additionally, listeners are sensitive to manner classifications (stop, fricative, nasal, and vowelconsonant sequence restrictions, exhibiting approximant). Taking the ECOG findings together, this negative deflections in event-related potentials in res- suggests a complex spatially entwined set of multidi- ponse to illicit sequences (Steinberg, Truckenbrodt, & mensional maps for manner and POA, broadly consis- Jacobsen, 2011). In short, listeners are clearly sensi- tent with other speech sound featural maps found tive to licit and illicit local sequences within syllables, using magnetoencephalography (Scharinger, Idsardi, & and these can be detected with various different Poe, 2012). methods. Phonetic feature information is encoded in the brain But what is the motivation for another layer of orga- in several different ways. Along with the recent corti- nization that groups segments together? Ghitza (2012) cal topographic findings, it is also known that the and Ghitza, Giraud, and Poeppel (2012) suggest that latency of the evoked magnetoencephalographic M100 the dual time-scale organization is related to endoge- response tracks vowel height (Roberts, Flagg, & Gage, nous brain rhythms in the theta (syllable) and gamma 2004), manner of articulation of consonants (Gage, (segment) bands. The idea here is that syllables can be Poeppel, Roberts, & Hickok, 1998), and POA of conso- identified from gross, easily tracked properties of the nants (Gage, Roberts, & Hickok, 2002), and that the speech signal, namely its envelope. Once syllables electrophysiological mismatch negativity (MMN) have been chunked, the signal can be further analyzed response is sensitive to native language phonetic to yield segment and feature information, guided by and phonological category representations (Kazanina, the syllable parsing. This proposal is similar to the Phillips, & Idsardi, 2006; Na¨a¨ta¨nen et al., 1997; Sharma & landmarks proposal reviewed in that an initial coarse Dorman, 2000). So, although maps are often a satisfy- coding of the signal is performed and then elaborated ing answer to the question of neural coding, the brain into finer distinctions as necessary. Taken together, seems to be using all the methods at its disposal in these proposals suggest another possible indexation coding speech. method for the mental lexicon, similar to hash tables. Now that we have features at our disposal, we can If the signal is coarsely processed into manner classes redefine our intuitive notion of segments as “feature (, Fricative, Nasal, Approximant) and syllable bundles” (Chomsky & Halle, 1968: 64), overlapping chunks are identified, then listeners can retrieve the features that are phonologically coordinated during a lexical items matching that coarse form (e.g., both period of time. “nest” [nεst] and “mashed” [mæʃt] would fall into the

C. BEHAVIORAL FOUNDATIONS 148 12. PHONOLOGY

,NAFP . group). We could then do further segment well-known are the nonlocal, action-at-a-distance and feature discovery guided by the manner class, the phonological effects such as vowel and consonant syllabic position, local sequence constraints, and the harmony and disharmony. For example, in languages pool of retrieved candidates, which would significantly with , the set of vowels of the lan- be a reduction in the search space of a couple orders of guage are divided into two classes and an indivi- magnitude compared with the entire lexicon. dual word-form will canonically draw all of its So, then do we need segments at all, or could we do vowels from only one of the two sets. To illustrate with just organizing features into syllables without vowel harmony, consider the following paradigm segmental organization? This has been proposed at from Turkish (Clements & Sezer, 1982: 216)providing least for perception (Hickok, 2014). The strongest argu- the nominative (nom) and genitive (gen) forms for ment for segments comes from resyllabification the singular (sg) and plural (pl) versions of represen- effects. Russian provides a particularly clear example. tative nouns. Consider the name “Ivan.” Alone, this name is pronounced with two syllables, indicated with paren- nom.sg gen.sg nom.pl gen.pl theses: [(i)(van)]. Russian nouns are inflected for case “rope” ip ip-in ip-ler ip-ler-in with suffixes, and so the form for “to Ivan” adds a pre- ɨ ɨ ɨ ɨ ɨ ɨ fix “k-” meaning “to,” and the appropriate case suffix “girl” k zkz- nkz-lar k z-lar- n “-u,” so that we have (abstractly) “k-Ivan-u.” But “face” yu¨ zyu¨z-u¨ nyu¨ z-ler yu¨ z-ler-in Russian strongly prefers that every consonant vowel “stamp” pul pul-un pul-lar pul-lar-ɨn sequence be grouped together into a syllable and, con- sequently, the syllabification in the pronounced form “hand” el el-in el-ler el-ler-in is [(kɨ)(va)(nu)] (the vowel change from [i] to [ɨ]is “stalk” sap sap-ɨn sap-lar sap-lar-ɨn symptomatic of the resyllabification). Notice that now “village” ko¨yko¨y-u¨nko¨y-ler ko¨y-ler-in none of the original syllables survive into the derived ɨ form, but the segment sequence [...ivan...] does (with “end” son son-un son-lar son-lar- n a slight vowel change). If we store forms as syllables and features without a segment level of representation, The suffix in the genitive singular forms alternates then we would have to build large tables of syllable between [in]/[ɨn] and [u¨ n]/[un]. The suffix is pro- correspondences such as (kɨ)3(i), (va)3(van) to rec- duced as [in]/[ɨn] when the root vowel is [-round] ognize “Ivan” in “k-Ivan-u.” But such syllable corre- (/i, ɨ, e, a/) and [un]/[u¨ n] when the root vowel is spondences are straightforwardly captured if segments [ 1 round] (/u, u¨ ,o,o¨/). Moreover, the suffix is pro- are available to us. Moreover, the extent of resyllabifica- duced with a [in]/[u¨ n] when the root tion is usually limited to a single segment, so, for exam- vowel is [-back] (/i, u¨ ,e,o¨/) and with a ple, Russian [(gart)] “printer type metal nom. sg.” [ɨn]/[un] when the root vowel is [1back] (/ɨ, u, a, o/). resyllabifies to [(gar)(ta)] in the plural, even though syl- In short, two dimensions of the Turkish vowel space lables can begin with /rt/, [(rtut’)] “mercury nom. sg.”. (i.e., backness and roundedness) participate in the In languages with small syllable inventories and harmony process (and the suffixes then need only restricted syllable types (e.g., Hawaiian, Japanese), contain vowel height information, a classic case of eschewing segmental representations might suffice; archi-phonemes). Similar analyses account for the nomi- however, once we consider languages with complex native plural and genitive plural paradigms. Vowel syllable structures (e.g., Russian, Polish), frequent resyl- harmony illustrates “action-at-a-distance” because these labification (e.g., Korean), and complex morphology patterns hold despite the presence of intervening (e.g., Navajo), it becomes more difficult to maintain consonants. That is, these are nonlocal phonological segment-less representations during perception. As dependencies, unique from local assimilation patterns such, resyllabification remains the greatest conceptual we find in many languages (in English, /a/ is pro- challenge to understanding the appropriate data struc- nounced as a nasalized /a˜/ before nasal consonants, ture for the organization of the mental lexicon. e.g., [bæ̃n] “ban”). Consonant harmony is similar to vowel harmony, except that the harmony process is between consonants and not vowels. Disharmony refers 12.5 NONLOCAL SOUND to processes that cause two sounds (at a distance) to be COMBINATIONS less similar. A remnant of a process of consonant disharmony in The local sound sequence restrictions discussed Latin survives statistically in English in the choice of are widely known and form the conceptual basis for the adjective forming suffixes “-al” and “-ar,” which n-gram models in natural language processing. Less tend to alternate with a preceding “l” or “r” in the

C. BEHAVIORAL FOUNDATIONS REFERENCES 149

0.00500%

0.00450%

0.00400% Linear 0.00350%

0.00300%

0.00250%

0.00200%

0.00150%

0.00100%

0.00050%

0.00000% Lineal 1800 1820 1840 1860 1880 1900 1920 1940 1960 1980 2000

FIGURE 12.6 Google n-gram frequencies for “linear” (blue) and “lineal” (red). word so that “circle” derives “circul-ar” but “flower” perception, and memory, we feel confident that derives “flor-al,” and the higher frequency of “line-ar” features, segments, syllables, abstraction, and laws of as compared with “line-al” (see Figure 12.6), even form will be crucial in explicating the mental represen- though the “l” in “linear” is three syllables apart from tations and computations used in listening and speak- the “r.” ing. We have deliberately not attempted to provide a Heinz and Idsardi (2011, 2013) argue that such comprehensive review of the neuropsychological find- effects cannot be reduced to iterated local effects of ings in speech relevant for phonology. For two reviews coarticulation across the intervening sounds and in that vein, see Idsardi and Poeppel (2012) and constitute a separate type of phonological generaliza- Monahan, Lau, and Idsardi (2013). tion with distinct computational properties (perhaps also motivating larger grouping structures such as feet and phonological that we are igno- References ring here). So far, there are few studies examining brain responses to vowel and consonant harmony Aertsen, A. M. H. J., & Johannesma, P. I. M. (1981). Spectro-temporal receptive field: A functional characteristic of auditory neurons. (see Scharinger, Poe, and Idsardi, 2011 for a prelimi- Biological Cybernetics, 42, 133143. nary study of Turkish vowel harmony and see Bell, A. M. (1867). Visible speech: The science of universal alphabetics. Monahan, 2013 for Basque sibilant harmony). Even London: Simkin, Marshall & Co. though such action-at-a-distance effects may seem Berent, I. (2013). The phonological mind. Cambridge: Cambridge exotic, and even though the experimental materials University Press. Berent, I., Pan, H., Zhao, X., Epstein, J., Bennett, M. L., Deshpande, are more difficult to construct (because they must V., et al. (2014). Language universals engage Broca’s area. PLoS eventually test the effects across various distances), it One, 17, e95155. is important to examine the neuropsychological prop- Booth, M. C. A., & Rolls, E. T. (1998). View-invariant representations erties of these phonological laws to determine how of familiar objects by neurons in the inferior temporal visual cor- they differ from the local sequence laws (for instance, tex. Cerebral Cortex, 8, 510 523. Bouchard, K. E., Mesgarani, N., Johnson, K., & Chang, E. F. (2013). whether the discontiguous sequence effects decrease Functional organization of human sensorimotor cortex for speech with increasing distance). articulation. Nature, 495, 327332. Brentari, D. (2011). phonology. In J. Goldsmith, J. Riggle, & A. Yu (Eds.), Handbook of phonological theory (2nd ed.) (pp. 692721). Oxford: Blackwells. 12.6 SUMMARY Carbonnell, K. M., & Lotto, A. J. (2014). Speech is not special... again. Frontiers in Psychology, 5, 427. In this chapter, we have provided neurobiological Chomsky, N., & Halle, M. (1968). The sound pattern of English. motivations for what we feel are the core concepts of Cambridge, MA: MIT Press. phonology: features, segments, syllables, abstraction, Clements, G. N., & Hume, E. V. (1995). The internal organization of speech sounds. In J. A. Goldsmith (Ed.), The handbook of phonologi- and laws of combination (both local and long- cal theory (pp. 245306). Cambridge, MA: Blackwell Publishing. distance). Although we obviously do not yet under- Clements, G. N., & Sezer, E. (1982). Vowel and consonant dishar- stand how speech is mentally represented for action, mony in Turkish. In H. van der Hulst, & N. Smith (Eds.), The

C. BEHAVIORAL FOUNDATIONS 150 12. PHONOLOGY

structure of phonological representations, II (pp. 213254). Mesgarani, N., Cheung, C., Johnson, K., & Chang, E. F. (2014). Dordrecht: Foris. Phonetic feature encoding in human superior temporal gyrus. Cooper, F. S., Liberman, A. M., & Borst, J. M. (1951). The interconver- Science, 343, 10061010. sion of audible and visible patterns as a basis for research in the Mesgarani, N., David, S. V., Fritz, J. B., & Shamma, S. A. (2014). perception of speech. Proceedings of the National Academy of Mechanisms of noise robust representation of speech in primary Sciences, 37, 318325. auditory cortex. Proceedings of the National Academy of Sciences, Fant, G. (1973). Speech sounds and features. Cambridge, MA: MIT Press. 111, 67926797. Gage, N., Poeppel, D., Roberts, T. P. L., & Hickok, G. (1998). Monahan, P. J. (2013). Using long-distance harmony to probe predic- Auditory evoked M100 reflects onset of speech sounds. tion in speech perception: ERP evidence from Basque. Presented Brain Research, 814, 236239. at the 5th Annual Meeting of the Society for the Neurobiology of Gage, N., Roberts, T. P. L., & Hickok, G. (2002). Hemispheric asym- Language, San Diego, CA, November 68, 2013. metries in auditory evoked neuromagnetic fields in response to Monahan, P. J., & Idsardi, W. J. (2010). Auditory sensitivity to for- place of articulation contrasts. Cognitive Brain Research, 14, mant ratios: Toward an account of vowel normalization. Language 303306. and Cognitive Processes, 25, 808839. Ghitza, O. (2012). On the role of theta-driven syllabic parsing in Monahan, P. J., Lau, E. F., & Idsardi, W. J. (2013). Computational pri- decoding speech: Intelligibility of speech with a manipulated mitives in phonology and their neural correlates. In C. Boeckx, & modulation spectrum. Frontiers in Psychology, 3, 238. K. K. Grohmann (Eds.), The Cambridge handbook of biolinguistics Ghitza, O., Giraud, A.-L., & Poeppel, D. (2012). Neuronal oscillations (pp. 233256). Cambridge: Cambridge University Press. and speech perception: Critical-band temporal envelopes are the Na¨a¨ta¨nen, R., Lehtokoski, A., Lennes, M., Cheour, M., Huotilainen, M., essence. Frontiers in Human Neuroscience, 6, 340. Iivonen, A., et al. (1997). Language-specific representa- Gow, D. (2012). The cortical organization of lexical knowledge: A tions revealed by electric and magnetic brain responses. Nature, dual lexicon model of spoken language processing. Brain and 385, 432434. Language, 121, 273288. Ohala, J. J. (1997). Emergent stops. In Proceedings of the 4th Seoul inter- Guenther, F. H. (2006). Cortical interactions underlying the produc- national conference on linguistics (pp. 8491). Seoul: Linguistic tion of speech sounds. Journal of Communication Disorders, 39, Society of Korea. 350365. Poeppel, D. (2003). The analysis of speech in different temporal inte- Halle, M. (1983). On distinctive features and their articulatory imple- gration windows: Cerebral lateralization as ‘asymmetric sampling mentation. Natural Language and Linguistic Theory, 1,91105. in time’. Speech Communication, 41, 245255. Heinz, J., & Idsardi, W. J. (2011). Sentence and word complexity. Poeppel, D., & Idsardi, W. J. (2011). Recognizing words from Science, 333, 295297. speech: The perception-action-memory loop. In G. Gaskell, & Heinz, J., & Idsardi, W. J. (2013). What complexity domains reveal P. Zwitserlood (Eds.), Lexical representation: A multidisciplinary about domains in language. Topics in Cognitive Science, 5, approach (pp. 171196). Berlin: de Gruyter. 111131. Potter, R. K., Kopp, G. A., & Green, H. C. (1947). Visible speech. New Hickok, G. (2014). The architecture of and the role York: van Nostrand. of the phoneme in speech processing. Language, Cognition and Prince, A., & Smolensky, P. (2004). Optimality theory: Constraint Neuroscience, 29,220. interaction in generative grammar. Malden, MA: Blackwell Hinton, L., Nichols, J., & Ohala, J. J. (Eds.), (1994). Sound symbolism. Publishing. Cambridge: Cambridge University Press. Pruthi, T., & Espy-Wilson, C. (2004). Acoustic parameters for auto- Hoffman, K. L., & Logothetis, N. K. (2009). Cortical mechanisms of matic detection of nasal manner. Speech Communication, 43, sensory learning and object recognition. Philosophical Transactions 225239. of the Royal Society B, 364, 321329. Rhodes, G., Calder, A., Johnson, M., & Haxby, J. V. (2011). Oxford Idsardi, W. J. (2006). A simply proof that optimality theory is not handbook of face perception. Oxford: Oxford University Press. computationally tractable. Linguistic Inquiry, 37, 271275. Roberts, T. P. L., Flagg, E. J., & Gage, N. M. (2004). Vowel categoriza- Idsardi, W. J., & Poeppel, D. (2012). Neurophysiological techniques tion induces departure of M100 latency from acoustic prediction. in laboratory phonology. In A. Cohn, C. Fougeron, & Neuroreport, 15, 16791682. M. Huffman (Eds.), The Oxford handbook of laboratory phonology Scharinger, M., Idsardi, W. J., & Poe, S. (2012). A comprehensive (pp. 593605). Oxford: Oxford University Press. three-dimensional cortical map of vowel space. Journal of Jakobson, R., Fant, G., & Halle, M. (1952). Preliminaries to speech analy- Cognitive Neuroscience, 12, 39723982. sis. Cambridge, MA: MIT Press. Scharinger, M., Poe, S., & Idsardi, W. J. (2011). Neuromagnetic reflec- Juneja, A., & Espy-Wilson, C. (2008). A probabilistic framework for tions of harmony and constraint violations in Turkish. Laboratory landmark detection based on phonetic features for automatic Phonology, 2,99123. speech recognition. Journal of the Acoustical Society of America, 117, Sharma, A., & Dorman, M. F. (2000). Neurophysiologic correlates of 11541168. cross-language phonetic perception. Journal of the Acoustical Kazanina, N., Phillips, C., & Idsardi, W. J. (2006). The influence of Society of America, 107, 26972703. meaning on the perception of speech sounds. Proceedings of the Staal, F. (1990). Rules without meaning: Ritual, mantras and the human National Academy of Sciences, 103, 1138111386. science. New York: Peter Lang. Kim-Renaud, Y.-K. (1997). The Korean alphabet: Its history and structure. Steinberg, J., Truckenbrodt, H., & Jacobsen, T. (2011). Phonotactic Honolulu: University of Hawaii Press. constraint violations in German grammar are detected automati- Ladefoged, P., & Maddieson, I. (1996). The sounds of the world’s lan- cally in auditory speech processing: A human event-related guages. Hoboken, NJ: Wiley. potentials study. Psychophysiology, 48, 12081216. Liberman, A. M. (1996). Speech: A special code. Cambridge, MA: MIT Stevens, K. N. (1998). . Cambridge, MA: MIT Press. Press. Stevens, K. N. (2002). Toward a model for lexical access based on Liberman, A. M., & Mattingly, I. G. (1985). The motor theory of acoustic landmarks and distinctive features. Journal of the speech perception revised. Cognition, 21,136. Acoustical Society of America, 111, 18721891.

C. BEHAVIORAL FOUNDATIONS REFERENCES 151

Trubetzkoy, N. S. (1969). Principles of phonology. Berkeley: University Zemlin, W. R. (1998). Speech and hearing science: Anatomy and physiol- of California Press. ogy (4th ed.). Boston: Allyn and Bacon. Wing, A. M. (2000). Motor control: Mechanisms of motor equivalence Zhou, X., Espy-Wilson, C. Y., Tiede, M., & Boyce, S. (2007). An artic- in handwriting. Current Biology, 10, R245R248. ulatory and acoustic study of “retroflex” and “bunched” Yoo, I. H., & Blankenship, B. (2003). Duration of epenthetic [t] in American English rhotic sound based on MRI. In Proceedings of polysyllabic American English words. Journal of the International INTERSPEECH’07 (pp. 5457). Phonetic Association, 33, 153164.

C. BEHAVIORAL FOUNDATIONS This page intentionally left blank CHAPTER 13 Morphology Alec Marantz Department of Linguistics, New York University, New York, NY, USA; Department of Psychology, New York University, New York, NY, USA; NYUAD Institute, New York University Abu Dhabi, Abu Dhabi, United Arab Emirates

13.1 INTRODUCTION strict division between words and phrases but still analyze words as consisting of ; with this Within linguistics, morphology is the subdiscipline view, the internal arrangement of morphemes within devoted to the study of the distribution and form of words falls under a different set of principles than the “morphemes,” taken to be the minimal combinatorial arrangement of words into sentences. However, in the unit languages use to build words and phrases. For morphological theory most closely associated with example, it is a fact about English morphology that generative grammar in this century, distributed mor- information about whether a sentence is in the past phology, there is no strict word/ distinction tense occurs at the end of verbs. This fact reduces to a (Matushansky & Marantz, 2013). The internal arrange- generalization about the distribution of the tense mor- ment of morphemes both within words and within pheme in English, which is a fact about “morphotactics” phrases and sentences is explained by a single syntactic (the distribution and ordering of morphemes) in mor- theory, and morphology provides an account of the phology. It is also a fact about English morphology that way in which these morphemes are realized phonolog- the (“regular”) past-tense is pronounced /t/ ically (in sound), whether inside words or indepen- after a class of voiceless consonants (walked, tipped, dently arranged in phrases. kissed) and /d/ after a class of voiced consonants and This chapter explains aspects of the theory of after vowels (gagged, ribbed, fizzed, played). This fact is a morphology with a view of the way that morphology fact about “allomorphy” (alternations in the pronuncia- has been explored in neurolinguistics. An important tion of morphemes). Traditionally, then, morphology conclusion of the chapter is that although the types of concerns itself with morphotactics and allomorphy. investigation of morphology currently found in neuro- Although the division or decomposition of words linguistics might seem to rely on motivated linguistic and phrases into smaller units seems relatively intui- distinctions, such as that between derivational and tive, linguistic morphologists have repeatedly ques- inflectional morphology, linguistic theory itself does tioned basic assumptions about morphemes. With one not support such distinctions in the manner required view, instead of dealing with the distribution and pro- to motivate neurolinguistic experiments. Although nunciation of small pieces of language, morphology research in the neurobiology of language does at least is about the form of words, where, for example, kick, sometimes adopt the vocabulary of linguistic morphol- kicks, kicking, and kicked, are all forms of the same verb ogy in investigating the neural bases of morphology, the kick (Matthews, 1965) but are not composed of a attention given to linguistic analysis is often superficial, sequence of morphemes. With this view, languages are with the consequence that experimental results are diffi- claimed to make a strict distinction between words cult to interpret with respect to central questions of and phrases, with only the latter having an internal language processing in the brain. There is hope that structure of organized pieces. From this morpheme- recent advances in psycholinguistics and in computa- less perspective, kicked is a form of the stem kick, not tional linguistics may help bridge the gap between the combination of kick 1 , where PAST TENSE is linguistic theory and the theory of neurolinguistic pro- realized as /t/. Other morphologists also endorse a cessing, such that linguistics’ deep understanding of the

Neurobiology of Language. DOI: http://dx.doi.org/10.1016/B978-0-12-407794-2.00013-4 153 © 2016 Elsevier Inc. All rights reserved. 154 13. MORPHOLOGY nature of language can inform neurolinguistics and, in grammar and that the interaction of , phonology, turn, neurolinguistic findings can help shape linguistic and semantics produces the morphological phenomena theory. we observe and allows for the variation in the expres- Although some controversies within linguistics over sion of morphemes that is exhibited cross-linguistically. the correct analysis of morphological phenomena are The why morphology? question can be usefully divided explained in this chapter, in general I adopt the into three issues. The first issue concerns the reason assumptions and results of distributed morphology. why certain information is sometimes indicated by As explained in Marantz (2013b), distributed morphol- attaching sounds to a word while the same information ogy is relatively conservative from a historical perspec- could be carried by an independent word. Why should tive, preserving the insights of mainstream linguistics the past tense in English be indicated by a suffix on walk from the 20th century. In experimental work, one can in a statement, He walked, but by an auxiliary verb did in attempt to explicitly test differing predictions made by a yes/no question, Did he walk? Why should we say competing representational theories of language, and repaint when the two words paint again can be used with so an experimentalist could choose to pit predictions the same meaning? Why do languages ever use prefixes of distributed morphology against available alterna- or suffixes, given that independent words can serve the tives. However, experimental work related to morphol- same function? If every language chose independent ogy must make some theoretical commitments; it is words for these functions, there is a sense in which there not possible to be agnostic over issues such as whether would be no morphology. words decompose into morphemes. A second, related question is why such diversity exists in the way that morphemes are realized across the world’s languages. Languages can signal informa- 13.2 WHY MORPHOLOGY? tion like past tense by copying part of a verb stem (reduplication), by tucking phonological material There seems to be an obvious need for a theory of inside of a verb stem (infixing), and by other means, syntax that would explain the constraints on how smal- as well as by concatenating a stem and a tense prefix ler linguistic units combine into and distribute across or suffix or by using a phonologically independent words and phrases, or for a theory of phonology that word. Do these different modes of signaling informa- would explain the way that the pronunciation of units is tion correspond to deep grammatical differences conditioned by their environments in sentences, or for a between languages? theory of semantics that would explain the way that A third question involves particular types of mor- meanings of smaller units combine to form meanings of phemes such as agreement and case affixes. For a pre- larger units. Syntax, phonology, and semantics together fix like re-, it should be clear why English might want represent the essential structure of grammar: an engine to use the morpheme for the expression of a meaning of combination (syntax) and two interpretive “compo- also expressible by the independent word again (John nents” that translate the combinations into sound (pho- repainted the house 5 John painted the house again): the nology) and meaning (semantics). However, the role meaning contrast between repaint and paint hinges on of morphology in language presents more of a puzzle. the presence of the prefix on the first verb. However, If we think of morphology as exemplified by the stuff the necessity of agreement morphology on verbs we add to English words—things like the past-tense (He runs every day, They run every day) and case mark- ending or a prefix like re- in repaint—the question arises: ing on nouns and (He saw him, where the why is there morphology in addition to syntax, phonology, subject is “nominative” and the object “accusative”) is and semantics? For the investigation of the nature of less obvious, given that many languages lack such language, we can ask why languages appear to add markings. Even in languages like English that show stuff to words and why that stuff takes the particular some limited agreement and case marking, any help shapes we observe cross-linguistically. For linguistics, that such morphology might provide in disambiguat- we can ask whether an account of this stuff requires ing word strings is minimal. For example, modal verbs a special (sub-)theory of “morphology” in addition to like may in English show no agreement (he may, they syntax, phonology, and semantics—a theory of an inde- may), whereas auxiliary verbs like be do (he is, they are), pendent morphological component of grammar—or but the lack of agreement on may does not cause com- whether the theory of morphemes can be reduced to prehension difficulties. these other theories (with the properties of morphemes For each of these questions about the necessity of distributed across the syntactic, phonological, and morphology for language, the answer that comes from semantic components, as in distributed morphology). the theory of morphology is this: do not be misled to This chapter explains why contemporary morphologists generalize a language-particular choice along a spec- claim that morphology is not a special component of trum of possibilities to a universal about the nature of

C. BEHAVIORAL FOUNDATIONS 13.2 WHY MORPHOLOGY? 155 language. That is, from a general perspective of the morpheme is not forced by any general, universal prin- structure of grammar, the variations in morphologies ciple; such morphemes often are silent. Thus, the inven- cross-linguistically can be seen as superficial variations tory of phonologically realized morphemes also differs on strong universal themes. across languages, in addition to the differences in how Taking “morpheme” to be defined as the smallest pronounced morphemes are phonologically realized. unit combined by the syntax of a language, morphology The distinction between root and functional mor- makes a strong distinction between the roots of phemes and the distinction between morphemes as the so-called lexical categories—nouns, verbs, and combinable constituents and the phonological (sound) adjectives—and all other morphemes. This distinction forms of these morphemes also provide the answer to between roots and other morphemes underlies an often the second issue regarding the variety of alternatives cited but less technical distinction between content cross-linguistically for the sound expression of various words and function words and morphemes. The root morphemes. The organization of morphemes within a morphemes, like cat, describe properties of entities, word or phrase is determined by the syntax and can states, and events. Their meanings are strongly con- be displayed in a hierarchical tree structure. Linguists nected to our a-linguistic cultural and conceptual have developed a general account of the way that indi- knowledge, and the set of root morphemes varies con- vidual morphemes are realized that covers not only siderably across languages and people. However, roots like cat that have the phonological form of a syl- nonroot morphemes are sets of grammatical features lable but also suffixes like past tense /d/ that are a that operate in a uniform way across languages and are single consonant, and the roots of Semitic languages central to the grammatical system. A past-tense mor- like Arabic, which might consist of just three conso- pheme would consist of features available to the gram- nants without vowels or syllable structure. The same mar of every language and would likely be shared generalized account that assigns each morpheme a bit across speakers of the language. Root morphemes need of phonological substance can describe cases of redu- to combine with the lexical category morphemes that plication, where the added phonological material is in create nouns, verbs, and adjectives to be used in a sense borrowed from the stem, and even truncation, phrases and sentences; each root morpheme 1 category- where it looks like the addition of a morpheme results determining morpheme complex will, in general, anchor in a shortening of the stem. That is, the general picture a phonological word, part of the “open class” vocabu- has each morpheme determining a bit of phonological lary of a language (cases of compounding like blackbird content (perhaps phonologically null), with the pho- and similar phenomena in other languages allow for nological forms of the morphemes combining accord- multiple roots in a single phonological word). Nonroot ing to the phonological principles of the language. morphemes, so-called functional morphemes that form The exoticness of infixing and reduplication—or the the “closed class” of items in a language, find their pho- expression of a morpheme as a on the vowel of a nological realization either by joining a root in a phono- stem or via gemination of a stem consonant—is rela- logical word anchored by the root or by forming a tive to the phonological structure of English; from a phonological word of their own, perhaps with other cross-linguistic standpoint there is no reason to treat functional morphemes. In general, then, the answer to these morphemes’ expression as more unusual than the first question about the existence of morphology— affixation. why some languages express certain types of informa- Finally, we may address the issue of why certain tion piled up in a single word while other languages types of syntactically dependent morphology should might express the same information on separate exist, particularly agreement and case marking. The words—follows from the generalization that language overt realization of agreement between subject and verb makes a cut between root morphemes and functional (or verb and object, etc.) and the overt realization of case morphemes. The general principles about the sound marking on constituents of noun phrases (nouns, adjec- realization of root morphemes demand that they anchor tives, determiners, numerals) are certainly not necessary independent phonological words cross-linguistically. for the processing of language; many languages do fine The general principles about the sound realization with minimal expression of such morphology. But the of functional morphemes allow languages plenty of general characteristic of such morphology is the reflec- room for variation, giving the appearance that some tion of grammatical relations among constituents of a languages have more morphology—more of the func- sentence that are present and necessary with or without tional morphemes appear as affixes on root-based pho- the morphology. Subjectverb agreement, for example, nological words—whereas others have less—more of reflects the computation of a relation between (features the functional morphemes appear as or as part of of) a subject and (a constituent of) a verb that under- root-less phonological words (i.e., as function words). lies the grammatical analysis of a sentence. The general In addition, the phonological realization of a functional picture of the syntax of language, then, is one of the

C. BEHAVIORAL FOUNDATIONS 156 13. MORPHOLOGY recursive combination of morphemes into hierarchical phonological words, and how these morphemes are constituent structures PLUS the computation of certain pronounced. relations between morphemes—like the subjectverb Contemporary generative grammar, as described (tense) relation—that are not completely reducible to within the “Minimalist Program” associated with Noam constituency (like “sister” relations in a tree) or linear Chomsky (Chomsky, 1995), describes the essence of lin- relations (like “next to”). Such grammatical relations guistic structure as involving the recursive “merger” of (with traditional names like “subject” and “object”) morphemes. Two morphemes are merged into a con- involve the transfer or checking of features on functional stituent, which might undergo further merger with morphemes. These features are phonologically realized another morpheme or with some previously constructed as case and agreement morphology in some languages, complex of morphemes. Repeated operation of merger but the syntactic operations and computations that yields familiar hierarchical constituent structures. For underlie this morphology are present in every language. morphemes interpreted as semantic functions or opera- tors, the hierarchical structure determines their scope. For example, if morphemes interpreted as make, want, 13.3 WHAT MAKES MORPHOLOGY, and go are merged into a structure like [make[want[go]]], MORPHOLOGY then the interpretation would involve causing the desire to leave, while a structure that swaps the structural In answering the question of why language includes positions of want and make—[want[make[go]]]—would morphology, we have sketched a picture of a struc- involve the desire to cause leaving. If we imagine that ture of grammar that denies a distinction between these hierarchical structures built by repeated merge the organization of morphemes within words and the operations hold no implications for the linear order of organization of words within sentences; that is, the the two constituents joined by each merge, the usual contemporary linguistic perspective on morphology pronunciation of these syntactic structures must involve makes no principled distinction between syntax and a decision, for each merger, regarding the order of each morphology. Nevertheless, the sound system of a lan- merged pair. Because the decision to order a complex guage organizes the pronunciation of sentences into constituent X before a complex constituent Y causes units of different sizes, with smaller units nested inside each subconstituent of X to be ordered before each sub- larger units, that is, “phonological words” combined constituent of Y, the linearization of a hierarchical syn- into “phonological phrases” (Hall, 1999). These phono- tactic structure will result in morpheme order reflecting logical units are the locus of certain phonological pro- the hierarchical structure. Within words, this correspon- cesses and generalizations. For example, in many dence between hierarchical structure and linear order European languages, including German and Russian, of morphemes has come to be called the “Mirror voiced consonants like /b/ or /d/ are pronounced Principle” (Baker, 1985), which is not really a “principle” voiceless—/p/, /t/—at the end of a phonological but an observation about this correspondence. word. In English, some of the units we write as sepa- The interface between the syntax of a language that rate words do not contain enough phonological mate- determines the hierarchical structures of morphemes rial to be pronounced as phonological words; for and the phonology of a language that determines the example, the and a pronounced as they usually are pronunciation of these structures must decide which before a consonant-initial noun must be joined with the morphemes to group together into phonological words noun in a phonological word. In general, the phono- and how to pronounce these morphemes. The envi- logical constraints on these phonological constituents ronment in which a morpheme appears will determine require a certain mismatch between the syntactic struc- how the morpheme is pronounced to a greater or lesser ture and the phonological structure, as is well-known extent, depending on properties of the language and the from such examples as the queen of England’s hat, which particular morpheme itself. Several factors may condi- is pronounced with the possessive morpheme joined tion the form of a morpheme. For example, as mor- with England, although the hat belongs to the monarch phemes are linearized in the phonological interpretation rather than the country. The grammar of a language of a hierarchical syntactic structure, a morpheme might must describe how the syntactic arrangement of hierar- show a different form (a different allomorph) depending chically organized morphemes is realized as the pho- on the phonological shape or the actual identity of a nological organization of phonological words and neighboring morpheme. The indefinite article a/an in phrases, and which morphemes will come together English is pronounced a before a consonant-initial word into single phonological words, as ’s and England come and an before a vowel-initial word—allomorphy condi- together in our example. What we call the morphology tioned by the phonology of a neighboring morpheme. of a language includes an account of how the grammar The plural suffix is pronounced -en after ox but of the language packages some morphemes into as silence after sheep—allomorphy conditioned by the

C. BEHAVIORAL FOUNDATIONS 13.4 TYPES OF MORPHEMES, TYPES OF MORPHOLOGIES, TYPES OF MORPHOLOGICAL THEORIES 157 identity of the morpheme to which it attaches, not its creating an adjective from an input verb, as in knowable phonological shape. Similarly, an allomorph of a mor- from know. Although there is no doubt that tense mor- pheme might be conditioned by general phonological phemes and category-changing morphemes like -able properties of the language and of the morpheme, or the differ in many ways, one must ask two general ques- allomorph might be unconnected to phonological prop- tions about the distinction. erties of the language. The pronunciation of the past- First, does language make a binary distinction tense morpheme as /t/ after voiceless consonants in between two classes of morphemes such that it is English but as /d/ after voiced consonants follows from coherent and important that we can identify a given general phonological facts about English. However, the morpheme as either inflectional or derivational, with choice of an before vowel-initial stems in the English certain characteristics deducible from the identifica- indefinite article is not determined by general phonolog- tion (e.g., whether it changes grammatical category or ical properties of English; English does not generally whether it may be realized as reduplication)? Second, insert an /n/ between vowels to avoid a hiatus. given a division into and derivation, do From this short sketch, the reader can already proj- any linguistic generalizations or principles rely on ect some of the investigations that occupy morpholo- the feature of being inflectional or derivational, or do gists. For example, questions arise regarding whether properties that characterize one or the other class of there are constraints on how close two morphemes morphemes follow from specific features of the mor- must be to influence each other’s pronunciation, and phemes themselves, not the inflection versus deriva- whether closeness should be measured in terms of the tion label? That is, does the theory of linguistics call on hierarchical structure of morphemes or in terms of a the features “inflectional” and/or “derivational”? The linear string of their pronunciations. What kinds of emerging answer to both questions from contemporary information associated with morphemes could serve as morphology is “no.” triggers for particular phonological alternations on Recall that the syntax of a language places various morphemes, and do the constraints on types of contex- constraints and requirements on different sets of tual information depend on the relative location of the morphemes. For example, a main clause in English morphemes within a hierarchical tree? Morphologists requires a tense morpheme, and a present-tense verb in have discovered that these interactions between mor- English must agree with a third-person singular subject, phemes are very local and are very dependent on the usually with the -s suffix (He walks). However, a prefix hierarchical syntactic structures in which the mor- like re- in English behaves like an optional modifier— phemes appear (Embick, 2010). The locality of so-called like the independent word again. A suffix like -able par- contextual allomorphy (the pronunciation of a mor- allels to some degree the adjective able: This game is win- pheme triggered by its linguistic context) seems conve- nable parallels Someone is able to win this game.The nient for language processing and would seem to question arises whether morphemes split into large constrain accounts of how language might be acquired. classes such that properties of morphemes follow from their membership in these classes. A recurrent proposal has been to divide morphemes into “inflection” versus 13.4 TYPES OF MORPHEMES, TYPES “derivation,” where inflection would include “grammat- OF MORPHOLOGIES, TYPES OF ical” morphemes like tense and agreement that are rele- MORPHOLOGICAL THEORIES vant to the syntax, and derivational morphemes would include those that derive words of one category (noun, Given this general picture of morphology as the verb, adjective) from words of a different category. exploration of principles governing the organization of However, although linguists have discovered some fea- morphemes into words and their pronunciation in con- tures of morphemes that determine their behavior, no text, we can turn to certain contrasts between sets of characterization of the inflection versus derivation split morphemes and between theories of morphemes that has proved relevant within morphological theory. hold potential importance for the study of language in For example, within distributed morphology it has the brain. A commonly invoked division between been observed that morphemes that attach directly to morphemes divides the “inflectional” from the “deri- roots, such as the morphemes that create nouns, verbs, vational.” On some definitions, inflectional morphol- and adjectives from roots, share properties governing ogy creates different forms of an individual word, the conditioning of allomorphy on roots that are while derivation creates new words from words. For not shared by morphemes that attach to constituents example, English tense would be inflectional, because that already contain these category-determining mor- the past-tense form of a verb is arguably a form of phemes (Embick & Marantz, 2008; Marantz, 2013a). the verb, rather than a word with its own distribution Subclasses of functional morphemes, then, share proper- and meaning, whereas the suffix -able is derivational, ties that a theory of morphology should explain.

C. BEHAVIORAL FOUNDATIONS 158 13. MORPHOLOGY

However, there is no particular evidence that, for exam- Although the possible role of paradigms in a speak- ple, tense and number morphemes form a coherent class er’s knowledge of language is still a somewhat contro- with case and agreement morphemes (the putative class versial subject in linguistics (Bachrach & Nevins, 2008), of inflection) as opposed to category-changing mor- properties of paradigms do not motivate a distinction phemes (the putative class of derivation) that might between inflection and derivation, for two main rea- motivate a broad distinction between inflection and sons. First, to the extent that a derivational relation is derivation. made available by the general properties of language, The search for contrastive properties for inflection derived forms can be displayed paradigmatically such versus derivation was motivated by the observation that a form is predicted for each cell for each noun, that some morphemes, such as tense, were required by verb, or adjective stem and such that there is a block- a syntactic environment and selective regarding their ing relation between an irregular form (specific to a host words (e.g., tense is required in main clauses in stem or set of stems) and a regular form. For example, English and attaches to verbs), whereas other words, the function served by the English -er suffix, attaching while perhaps selective regarding their hosts (-able to verbs to create a noun referring to a person that attaches generally to verbs), were not required by the habitually does what the verb describes, is generally broader sentential context in any important sense available cross-linguistically, and every verb with the (although one might construct a syntactic environment appropriate meaning is predicted to allow an -er for- that requires an adjective, this environment would not mation. A definition of inflection that related to the require an adjective made with the -able suffix). formation of paradigms would include category- Historically, linguists have explored two main hypoth- changing -er as inflection. Blocking effects in canonical eses about properties that might follow from the inflec- category-changing derivational morphology are less tion versus derivation distinction, characterized in this easy to illustrate with simple English examples for the way by the morphemes’ sensitivity to their syntactic reasons described in Embick and Marantz (2008);how- environment. First, it has been claimed that inflection ever, such effects are definitely found. For example, is paradigmatic while derivation is not. Second, and the productive phonological realization of the mor- relatedly, it has been claimed that inflection and pheme creating nouns from adjectives with the mean- derivation involve different mechanisms for the pho- ing “the quality or property of being adjective” is the nological realization of the information carried by the suffix -ness. The suffix -ity is a phonological realization morphemes. of the same morpheme, but with a more restricted The notion of a paradigm should be familiar for environment. In particular, when a stem ends in the readers learning a classical language or a highly affix -able, -ity is the preferred pronunciation, and -ness “inflected” language like Finnish or Russian. For verbs, is “blocked”: we get transferable and transferability, but the various combinations of tense, aspect, and mood not transferableness. The function of the morpheme features that modify verbs, along with the features of -ness/-ity, then, creates a type of paradigm for adjec- agreement (usually subject agreement) that can be sig- tives, as well as a blocking relation between forms. naled on the verb construct a multidimensional grid of In addition to the fact that canonical derivational feature values such that each cell of the grid is filled morphemes can be paradigmatic and exhibit blocking, by a form of the verb whose paradigm is being dis- a crucial fact that undermines any categorical distinc- played. In English, the paradigm for the present tense tion between inflection and derivation is that appar- of the verb to be would have six cells, one for each of ently inflected forms are often used in languages as the combinations of person and number for the subject category-changing derivation. In English, for example, of the verb: (I) am;(you) are;(he) is;(we) are;(you the present participles of verbs, as -ing forms, are often PLURAL) are; and (they) are. Two important features of used to create nouns, as in gerunds (the running of the paradigms are that for each verb, noun, or adjective race), and passive participles of verbs are often used to associated with an inflectional paradigm, there is create adjectives (the closed door, a typed note). expected to be a form for each cell in the paradigm In addition to the possible correlation between para- (although one form may fill multiple cells, as are does digms and inflectional morphology, linguists have for to be in English) and, in general, only a single form explored the possibility that the phonological expres- fills each cell. The latter property is behind the notion sion of morphemes differs fundamentally between of “blocking”: an irregular form that fills a cell specific derivation and inflection. For example, in Anderson’s to a particular stem or class of stems “blocks” the crea- A-Morphous Morphology theory (Anderson, 1992), tion of a regular, predicted form for that stem to fill inflection involves phonological alterations to a stem, that paradigm cell. For example, the irregular is blocks whereas derivation involves the concatenation of the the creation of regular be for the third-person singular stem with morphemes with independent phonological cell of the paradigm for the verb to be. form (standard affixation, for example). However, a

C. BEHAVIORAL FOUNDATIONS 13.4 TYPES OF MORPHEMES, TYPES OF MORPHOLOGIES, TYPES OF MORPHOLOGICAL THEORIES 159 major outcome of research in the 1970s and 1980s morphemes, like that between inflection and deriva- was the demonstration that derivation and inflection tion, linguists have asked whether apparent typo- cannot be distinguished phonologically. Apparent logical differences between classes of languages, “processes” altering the forms of stems such as redu- organized by their morphological systems, are theoreti- plication and truncation are not limited to canonical cally meaningful. For example, languages are some- inflection, and in general it is not possible to predict times divided among the isolating and synthetic anything about the phonology of a morpheme based languages, with synthetic languages further divided on the inflection versus derivation distinction. into the fusional and the agglutinative. Isolating lan- This conclusion about phonological realization is guages, to some degree, seem to lack morphology alto- more general: the organization of grammar computes gether. Within the morphological framework used in the phonological realization of morphemes and combi- this chapter, these languages would be said to realize nations of morphemes without universal constraints each root morpheme and each functional morpheme in based on their syntactic or semantic features. Individual an independent phonological word, with no affixation. languages may impose constraints on the phonological Synthetic languages place multiple morphemes within forms of classes of morphemes; for example, the roots of a single phonological word, including multiple func- verbs in Arabic are sequences of two to four consonants, tional morphemes with a root. Latin is often cited as with further constraints observed regarding possible an example of a synthetic language, because, for exam- sequences of identical consonants. But the general prin- ple, a single verb might include information about ciples of language do not impose such constraints on tense (e.g., past), aspect (e.g., perfect), voice (e.g., roots, nor do they restrict reduplication to any subset of active), person of the subject (e.g., first person), and functional morphemes. As already mentioned, these number of the subject (e.g., plural): portavimus (“we principles do impose locality constraints on what might carried (PERFECT)”). Among synthetic languages, serve as context for choice of phonological realizations, fusional languages seem to use a single phonological and other properties of syntactic structures may deter- piece, such as a single suffix, to express multiple mine aspects of phonological realization due to the very grammatical features simultaneously (as English -s architecture of the grammar. For example, if phono- expresses both present-tense and third-person singu- logical realization is restricted to syntactic units of a par- lar subject agreement), whereas agglutinative lan- ticular size, as it is in theories in which linguistic guages might string together long sequences of computation is “cyclic,” then information carried by functional morphemes, each of which expresses an morphemes outside these cyclic realizational (“spell- independent feature. This is exemplified in Turkish out”) domains cannot influence the pronunciation of c¸ekoslovakyalıla¸Stıramayacaklarımızdanmıydınız, discussed morphemes inside these domains. But grammatical in Lieber (2010) from Inkelas and Orgun (1998: 368): principles governing the phonological realization of 1. c¸ekoslovakya- lı- la¸S- tır- ama- morphemes do not seem to refer directly to classes of Czechoslovakia- from- become- CAUSE- unable- morphemes. The uniformity of phonological realization across yacak- lar- ımız- dan- mı- ydı- nız types of morphemes extends to the distinction between FUT- PL-1PL- ABL- INTERR- PAST-2PL morphemes like case and agreement that reflect “Were you one of those whom we are not going to grammatical relations between elements in a syntactic be able to turn into Czechoslovakians?” structure and all other morphemes. The principles of morphology that govern how root or functional mor- Even in describing these typological differences we phemes are pronounced do not singleout case and have implied that languages do not fall into pure catego- agreement morphemes, although features of such mor- ries along the dimensions implied by their descriptions phemes do require special mechanisms in the syntax (English can illustrate isolation, fusion, and agglutina- above and beyond the merger of morphemes into hier- tion, for example). That is, languages are more or less archical structure. The fact that case and agreement isolating, and more or less fusional. Moreover, there do cannot be identified solely on the basis of their phonol- not appear to be any linguistic principles or generaliza- ogy holds crucial clues regarding the organization of tions that follow from the classification. That is, although grammar and the nature of . being mostly isolating might be statistically correlated Although syntactic structure feeds semantic and pho- with other typological properties at a descriptive level of nological interpretation, its operation is autonomous analysis, there is no direct relationship between the and opaque in the phonological forms that language expression of functional morphemes as independent learners encounter. words and any syntactic or morphological properties of In addition to investigating the significance of a language. The isolating/synthetic continuum, then, is certain pretheoretical distinctions between classes of descriptive rather than essential.

C. BEHAVIORAL FOUNDATIONS 160 13. MORPHOLOGY

A final distinction between possible approaches to described). For example, -en is the suppletive allo- morphological theory will serve to illustrate another morph of the plural morpheme in English used in the crucial property of the phonological realization of mor- environment of the stem ox; the default plural allo- phemes, that of the “default” realization. Traditional morph is -s. The -en/-s alternation is not an example of generative approaches to morphology have been “lexi- syncretism because the same feature—plural—is real- cal” in the sense that they proposed that the “lexical” ized by both allomorphs. Such suppletion, however, or storage form of morphemes included both the sets is an example of the general asymmetry between a of features they displayed to the syntax and the pho- more specified phonological realization (-en for plural nological form they carried to the sound realization of in the environment of ox) and a default realization sentences. In contrast, this chapter has been describing (-s elsewhere) that covers syncretism in a realizational a “realizational” theory of morphology in which account. In general, the cross-linguistic properties of the syntax combines “formless” (phonology-free) mor- asymmetric syncretism (with less specific phonological phemes and the phonological form is part of the pho- realizations acting as defaults with respect to more nological interpretation of a morpheme, constrained by specific realizations) and of (asymmetric) suppletion the syntactic features the morpheme contains but sepa- have supported realizational theories over lexical rated from these features. For example, a past-tense theories. morpheme would be identified as “past” and “tense” in the syntax, but its pronunciation would be deter- mined after syntactic combination, during phonologi- 13.5 THE VIEW FROM ABOVE cal interpretation. A crucial difference between the lexical and realiza- This chapter describes a theory of grammar in tional theories is their approach to “syncretism”: a situ- which morphemes are the minimal units of syntactic ation in which a single phonological form is used combination. Within such a theory, morphemes are across a variety of sets of features. For example, the subject to a recursive merge operation that builds form walk in English is syncretic as a present-tense hierarchical structures of constituents. In addition, cer- verb, being identical across first-person and second- tain syntactic relations between constituents are com- person subjects as well as third-person plural subjects, puted, leading to the features that are realized as case but contrasting with walks for third-person singular and agreement morphology. Languages differ in their subjects. With a realizational approach, a present-tense vocabularies of morphemes, particularly with respect morpheme is realized as -s when it carries third- to the root morphemes that anchor the major syntactic person singular features but is realized as null else- categories of nouns, verbs, and adjectives. Differences where (as a default). For the null suffix to work as in the vocabularies of functional morphemes across a default, the grammar must setup a competition languages directly influence typological differences in between -s and null for the realization of a present- syntax, as described by syntacticians concerned with tense feature with subject agreement features that are the “parameters” of variation between languages. already present in the structure being realized. The -s The linguistic subfield of morphology concerns itself realization wins the competition if the agreement with a number of topics surrounding morphemes and features are third-person singular; otherwise, the null their realization. For example, what is the substantive realization is used. With a lexical approach, phonologi- inventory of functional morphemes from which indi- cally identified morphemes must carry syntactic fea- vidual languages must choose for their vocabularies? tures into the syntactic derivation. With this approach, How do features of morphemes interact in the syntax, defaults that clearly behave as “elsewhere” cases in particularly with respect to the computation of case opposition to more featurally specified morphemes and agreement? The bulk of research specific to mor- cannot exist because the notion of “elsewhere” implies phology concerns the phonological realization of mor- the pre-existence of a structure with syntactic features, phemes, both the manner in which sets of morphemes yet on the lexical approach it is the morphemes them- get packaged into phonological words and the com- selves that provide the features for the syntax. Lexical putation of allomorphy—the choice of phonological approaches would postulate five homophonous walks realizations for a morpheme and the ways in which for the various non-third-person singular sets of agree- phonological realization might be influenced by con- ment features and attempt to provide other ways of text and by competition among phonological forms explaining the apparent redundancy here. (as for syncretism, suppletion, and blocking, described The notion of default in phonological realization previously). The theory of morphology is therefore extends to “suppletion,” where the same set of features about the choice of functional morphemes, the way is realized by different phonological forms in different that language-specific choices in morphemes and fea- environments (a type of contextual allomorphy, as tures interact with the syntax, the syntactic principles

C. BEHAVIORAL FOUNDATIONS 13.6 WORDS AND RULES: THE MODERN CONSENSUS ON DECOMPOSITION 161 that distribute and constrain features like case and effects would be observed from the operation of (regu- agreement on morphemes, the principles by which lar) rules.1 Frequency modulates behavior across both morphemes receive their phonological and semantic retrieval of the memorized forms of morphemes and interpretation, and the way that the phonology of a the composition of these forms. This observation does language packages the phonological material of mor- not undermine the essential distinction between the phemes into words and phrases. atoms of linguistic composition—the morphemes—and Morphology presents an account of a speaker’s combinatory operations that produce and modify knowledge of language inconsistent with the idea that structures of morphemes, but it does put pressure on words could be studied in isolation from larger syntac- any distinction between words and phrases. tic constituents. The word itself—to the extent that it From the viewpoint of linguistic morphology, corresponds to anything real in a person’s grammar— experiments using single word processing aimed at is a phonological unit, not a unit of syntactic combina- uncovering properties of a mental or neural lexicon are tion, and even as a phonological unit its properties are choosing a somewhat arbitrary unit for their stimuli. dependent on other phonological words within its Phonological words (from open-class categories) con- phonological phrase. That is, linguistics provides no sist at least of a root and a morpheme carrying syntac- basis for the notion of a mental lexicon consisting only tic category information, as well as the various of stored words with gestalt properties as wholes. functional morphemes from the root’s syntactic envi- The recognition or production of a word—to the ronment that are required to join the root in the same extent that the word is being recognized or produced word. A verb in English, for example, as a token of the as belonging to a speaker’s language—necessarily language, would consist of at least three morphemes: involves a sequence of grammatical computations, the root, a morpheme that carries the syntactic cate- including syntactic merger of morphemes and their gory “verb,” and a syntactically required tense mor- phonological realization and packaging. pheme. Single word experiments, then, might be seen as “small syntax” experiments; such experiments are not necessarily misguided, but they should not be pre- 13.6 WORDS AND RULES: THE MODERN sented as somehow in opposition to experiments using CONSENSUS ON DECOMPOSITION sequences of words. Even a simple experimental para- digm like confrontational picture naming requires the Psycholinguistic and neurolinguistic work since the participants to produce minimal units of linguistic 1990s has taken as its starting point the perspective of articulation; these would be phonological words, Pinker’s “words and rules” framework (Pinker, 1999). which are the output of a computation involving the From the viewpoint of linguistics, Pinker’s theory was syntactic combination of morphemes and the phono- based on a fundamental analytic mistake, because it logical interpretation of the resulting structure. postulated a grammatical difference between regular As explained in this chapter, morphology is not an and irregular inflectional morphology. For Pinker, it independent “component” of grammar for linguistics, was a syntactic fact that an irregular past-tense form which recognizes a generative syntactic component like taught was a single computational unit, repre- and two “interfaces.” One interface is concerned with sented as a nonbranching tree structure, as compared the realization of syntactic structures in sound (or sign with a regular form like walked, which would consist or orthography), and one is concerned with the seman- of a separate stem and past-tense morpheme in hierar- tic interpretation of those structures. Morphologists chical syntactic structure. In fact, from any consider- study particular aspects of the syntactic component ation of syntax and morphotactics, irregular and and the interfaces that center on the minimal units of regular forms behave identically—the difference is syntactic composition, but their special interests do not entirely within the realm of the realization of mor- pick out a subsystem of grammar with linguistically phemes phonologically (allomorphy). This linguistic significant autonomy. Therefore, neurobiological error of the words and rules framework was paired research aimed at denying the presence of “morphol- with an interesting but empirically falsified separation ogy” in the brain (Devlin, Jamison, Matthews, & between memory for morphemes and experience with Gonnerman, 2004) is not targeting the claim that mor- “rules,” such as the combination of morphemes. For phology is neurologically isolatable—because the claim the retrieval of morphemes, Pinker hypothesized fre- would be incoherent from a linguistic perspective— quency effects, but he proposed that no frequency but rather the claim that the connection between sound

1For some discussion of the behavioral evidence apparently supporting a binary distinction between irregular and regular inflection, from the point of view of morpheme-based linguistics, see Albright and Hayes (2003), Embick and Marantz (2005), Fruchter, Stockall, and Marantz (2013), and the references cited therein.

C. BEHAVIORAL FOUNDATIONS 162 13. MORPHOLOGY and meaning involves an autonomous computation of healthy intact human brains has, along with corre- a syntactic structure. sponding psycholinguistic work, centered on demonstrat- Computational linguists that seriously consider both ing that speakers decompose words into morphemes the linguistic and the experimental evidence also come in visual and auditory word recognition (see Ettinger, to the conclusion that there is no principled distinction Linzen, & Marantz, 2014; Fruchter et al., 2013; between the structure and realization of morphemes Lewis, Solomyak, & Marantz, 2011; Rastle, Davis, & within (phonological) words and the structure and reali- New, 2004; Solomyak & Marantz, 2010; Zweig & zation of combinations of words and phrases. However, Pylkka¨nen, 2009 and the references cited therein). These such linguists may also deny the existence of mor- studies provide striking, although from a linguist’s per- phemes, claiming that the appearance of structured spective inevitable, support for “full decomposition” units in the mapping between sound and meaning is an models of recognition in which readers and listeners emergent property of systems learning such mappings, recognize morphologically complex forms via recognition not an a priori feature of the learner’s language acquisi- and combination of their parts. tion system, as in Baayen, Milin, Ðurðevic,´ Hendrix, As Marcus Taft (2004) has pointed out, the apparent and Marelli (2011). That is, for a generative linguist, incompatibility of full decomposition models with the the language learner’s task is to learn the grammar— observation that the surface frequency of a complex the morphemes, constraints on their combination, rules word is the primary predictor of reaction time to the for their phonological realization, and semantic inter- word in lexical decision experiments is tied to the pretation—to account for observations about the correla- unsupported claim that the frequency of regular com- tion of sound and meaning. Regarding the opposing putations and their results do not affect behavior. view of Baayen et al. (2011), learners would be acquir- If the frequency of combination of a stem with a regu- ing unmediated sound/meaning correspondences, and lar past-tense ending affected the speed with which morphemes would reflect general sound/meaning this combination could be computed in the future, correspondences that converge, for example, on contigu- then surface frequency effects for regular past-tense ous sequences of sounds. forms could be attributed to the stage of processing in There are linguists who might question the analysis a full decomposition model in which the morphemes of words as hierarchical organizations of morphemes. that result from decomposition of a word are recom- In this chapter, we have seen some reasons why the posed for evaluation as a whole. When brains are mon- consensus within generative grammar strongly sup- itored with MEG, for example, early correlations of ports the decomposition of words into such syntactic brain activity with properties of morphologically com- structures. Despite these disagreements over morpho- plex words show sensitivity to properties of the con- logical decomposition within words and the possibility stituent morphemes and their relations to other of an a-morphous morphology, no productive research morphemes, not to the surface frequency of the forms, program in linguistics has pursued the idea that sen- although the latter correlates well with button-pressing tences are a-syntactic. All competing accounts of the in responses (Solomyak & Marantz, 2010). The late tim- well-formedness of sentences and the connections ing of surface frequency effects supports Taft’s inter- between sound and meaning at the sentential level pretation: that they reflect the frequency of the assume a syntactic analysis that involves structures computation of combination of morphemes, not the of morphemes—both the elements and the relations frequency of the static whole forms themselves. between the elements—that are relatively abstract with Once we understand the necessity of composing respect to their interpretations. For example, all major morphemes via a syntactic derivation to create words, theories of syntax assume a set of syntactic catego- experiments using words provide a testing ground for ries—like noun, verb, and adjective—that although general theories of language processing. Neurolinguistic associated with distributional categories and connected work of the next decades should uncover how the brain to meanings can be reduced to neither. The absence of accesses representations of morphemes, combines these a motivated analytic dividing line between words and representations into hierarchical syntactic structures, phrases pushes the linguists’ conclusions into the inte- and realizes these structures in form and meaning. rior of words: the connections between sound and Theories of morphology in linguistics make explicit the meaning within and between words involve the com- types of knowledge that must be manipulated in these putation of syntactic structures of morphemes. computations, as well as specifics about the computa- Because the most striking claims of morphology tions and their constraints and detailed phenomena that involve the decomposition of words into syntactic struc- any account of language processing must explain. tures of morphemes, much of the neurolinguistic research Linguistic morphology, then, should be a key element in the era of brain imaging and brain monitoring of in the neurobiology of language enterprise.

C. BEHAVIORAL FOUNDATIONS REFERENCES 163

Acknowledgments Fruchter, J., Stockall, L., & Marantz, A. (2013). MEG masked priming evidence for form-based decomposition of irregular verbs. This work was supported in part by grant G1001 from the NYUAD Frontiers in Human Neuroscience, 7. Institute, New York University Abu Dhabi. Hall, T. A. (1999). The phonological word: A review. In T. A. Hall, & I thank Tal Linzen and Greg Hickok for comments regarding an U. Kleinhenz (Eds.), Studies on the phonological word (pp. 122). earlier draft of this chapter, and Phoebe Gaston for editorial assistance. Philadelphia, PA: John Benjamins. Inkelas, S., & Orhan Orgun, C. (1998). Level (non)ordering in recur- sive morphology: Evidence from Turkish. In S. Lapointe, References D. Brentari, & P. Farrell (Eds.), Morphology and its relation to phonology and syntax (pp. 360410). Stanford, CA: CSLI. Albright, A., & Hayes, B. (2003). Rules vs. analogy in English past tenses: Lewis, G., Solomyak, O., & Marantz, A. (2011). The neural basis of A computational/experimental study. Cognition, 90(2), 119161. obligatory decomposition of suffixed words. Brain and Language, Anderson, S. R. (1992). A-morphous morphology. Cambridge: 118(3), 118127. Cambridge University Press. Lieber, R. (2010). Introducing morphology. Cambridge: Cambridge Baayen, R. H., Milin, P., Ðurðevic,´ D. F., Hendrix, P., & Marelli, M. University Press. (2011). An amorphous model for morphological processing in Marantz, A. (2013a). Locality domains for contextual allomorphy visual comprehension based on naive discriminative learning. across the interfaces. In O. Matushansky, & A. Marantz (Eds.), Psychological Review, 118(3), 438. Distributed morphology today (pp. 95115). Cambridge, MA: MIT Baker, M. (1985). The mirror principle and morphosyntactic explana- Press. tion. Linguistic Inquiry, 16(3), 373415. Marantz, A. (2013b). No escape from morphemes in morphological Bobaljik, J. (2008). Paradigms (optimal and otherwise): A case for processing. Language and Cognitive Processes, 28(7), 905916. skepticism. In A. Bachrach, & A. Nevins (Eds.), Inflectional identity Matthews, P. H. (1965). The inflectional component of a word-and- (pp. 2954). Oxford: OUP. paradigm grammar. Journal of Linguistics, 1(2), 139171. Chomsky, N. (1995). The minimalist program. Cambridge, MA: MIT press. Matushansky, O., & Marantz, A. (2013). Distributed morphology today. Devlin, J. T., Jamison, H. L., Matthews, P. M., & Gonnerman, L. M. Cambridge, MA: MIT Press. (2004). Morphology and the internal structure of words. Pinker, S. (1999). Words and rules: The ingredients of language. New Proceedings of the National Academy of Sciences of the United States of York, NY: Basic Books. America, 101(41), 1498414988. Rastle, K., Davis, M. H., & New, B. (2004). The broth in my brother’s Embick, D. (2010). Localism versus globalism in morphology and phonol- brothel: Morpho-orthographic segmentation in visual word recog- ogy. Cambridge, MA: MIT Press. nition. Psychonomic Bulletin & Review, 11(6), 10901098. Embick, D., & Marantz, A. (2005). Cognitive neuroscience and the Solomyak, O., & Marantz, A. (2010). Evidence for early morphologi- English past tense: Comments on the paper by Ullman et al. Brain cal decomposition in visual word recognition. Journal of Cognitive and Language, 93(2), 243247. Neuroscience, 22(9), 20422057. Embick, D., & Marantz, A. (2008). Architecture and blocking. Taft, M. (2004). Morphological decomposition and the reverse base Linguistic Inquiry, 39(1), 153. frequency effect. Quarterly Journal of Experimental Psychology Ettinger, A., Linzen, T., & Marantz, A. (2014). The role of morphol- Section A, 57(4), 745765. ogy in phoneme prediction: Evidence from MEG. Brain and Zweig, E., & Pylkka¨nen, L. (2009). A visual M170 effect of morphologi- Language, 129,1423. cal complexity. Language and Cognitive Processes, 24(3), 412439.

C. BEHAVIORAL FOUNDATIONS This page intentionally left blank CHAPTER 14 Syntax and the Cognitive Neuroscience of Syntactic Structure Building Jon Sprouse1 and Norbert Hornstein2 1Department of Linguistics, University of Connecticut, Storrs, CT, USA; 2Department of Linguistics, University of Maryland, College Park, MD, USA

14.1 INTRODUCTION syntactic structure-building computations the compu- tational view of syntax. This view is simply that the One goal of cognitive neuroscience, if not the goal of syntactic operations that have been proposed in syntactic cognitive neuroscience, is to uncover how neural sys- theory (e.g., merge in Minimalism, substitution in Tree- tems can give rise to the computations that underlie Adjoining Grammar [TAG]) are a plausible cognitive the- human cognition. Assuming, as most do, that the rele- ory of the structure-building computations that neurons vant biological description can be found at the level of must perform to process language. Therefore, a plausible neurons, then another way of stating this is that research program for cognitive neuroscience would be to cognitive neuroscience is (at least) the search for the search for a theory of: (i) how (populations of) neurons neuronal computations that underlie human cognition could perform these computations and (ii) which (popula- (Carandini, 2012; Carandini & Heeger, 2012). To the tions of) neurons are performing these computations extent that this is an accurate formulation of the goal(s) during any given language processing event. As syntacti- of the field, any research program in cognitive neurosci- cians, this strikes us as the natural evolution of the goals ence will have three components: (i) a cognitive theory of the cognitive revolution of the 1950s in general, and of that specifies the potential computations that underlie the goals of generative linguistics in particular. However, cognition; (ii) a neuroscientific theory that specifies we are also aware that this is not how many would how neurons (or populations of neurons) perform dif- describe current syntactic theory. Therefore, we attempt ferent types of computations; and (iii) a linking theory to make our case in a series of steps. In Section 14.2, we that maps between the cognitive theory and the neuro- provide a brief history of the field of syntax. The goal of scientific theory (Marantz, 2005; Poeppel, 2012; Poeppel this section is to contextualize modern syntactic theories & Embick, 2005). We take all of this to be relatively such that it becomes clear that modern theories are not uncontroversial; however, we mention it explicitly simply lists of grammatical rules (although older theories because we believe that modern syntactic theories, were), but instead theories of cognitive computations. In under a certain conception, are well-positioned to pro- Section 14.3, we present two concrete examples of poten- vide the first component (a theory of computations) for tial structure-building computations (from two distinct a cognitive neuroscientific theory of syntactic structure contemporary syntactic theories) to illustrate the compu- building. Our goal in this chapter is to make a case tational view of syntax. In Section 14.4, we lay out several for this belief. We hope to demonstrate that the poten- of the properties of modern syntactic theories that we tial for a productive cross-fertilization exists between believe make them well-suited for the computational theoretical syntacticians and neuroscientists, and we view of syntax. We believe that these properties suggest that developments in syntactic theory over the will be easily recognizable to all cognitive neuroscientists past two decades make this an optimal time to engage as the properties of a theory of cognitive computations. seriously in this collaboration. In Section 14.5, we discuss the large-scale collaboration For ease of exposition, we call our view that the theory between syntacticians, psycholinguists, and neuroscien- of syntax can be, and should be, viewed as a theory of tists that will be necessary to construct a cognitive

Neurobiology of Language. DOI: http://dx.doi.org/10.1016/B978-0-12-407794-2.00014-6 165 © 2016 Elsevier Inc. All rights reserved. 166 14. SYNTAX AND THE COGNITIVE NEUROSCIENCE OF SYNTACTIC STRUCTURE BUILDING neuroscience of syntactic structure building. In for successful language acquisition (question 2). As Section 14.6, we discuss some of the challenges that this with any specialized science, the pursuit of these dual collaboration might face. Section 14.7 concludes. driving questions has led to the development of spe- Before making our case for the computational view cific research programs and technical terminology, of syntax, a small clarification about the scope of this both of which have at times been opaque to other cog- chapter is in order. We have explicitly chosen to focus nitive scientists working outside of syntax. Our goal in on the issue of why syntactic theories will be useful for this section is to provide a brief history of the way a cognitive neuroscience of language, and not how syn- the field has pursued these driving questions (to tactic theorizing is conducted today. In other words, contextualize the modern syntactic theories discussed this chapter is intended to lay out arguments in favor in Section 2.2) and to clarify some of the major points of a large-scale collaboration between syntacticians and of miscommunication that have historically arisen neuroscientists and is not intended to be a review between syntacticians and other cognitive scientists. chapter on syntax. We assume that if our arguments GS began by describing specific rules found in par- are successful, then syntacticians within these colla- ticular languages (and so contained in the grammars borations can carry the burden of doing the syntax. of these languages). This is not surprising; if one is That being said, for readers interested in reviews of interested in the kinds of rules natural language gram- topics in contemporary syntax, we can recommend the mars contain, then a good way to begin is by looking review chapters in the recently published Cambridge for particular examples of such rules. Thus, in the ear- Handbook of Generative Syntax (den Dikken, 2013), liest period of GS, syntacticians built mini-grammars which contains 26 excellent review chapters covering describing how various constructions in particular lan- everything from the history and goals of syntactic guages were built (e.g., relative clauses in Chamorro, theory, to overviews of several major contemporary questions in English, topic constructions in German, theories, to reviews of specific phenomena in syntax. reflexivization in French, etc.) and how they interacted with one another to generate a reasonably robust “fragment” of the language. 14.2 A BRIEF HISTORY OF SYNTACTIC With models of such grammars in hand, the next THEORY step was to factor out the common properties of these language particular grammars and organize them into Syntactic theory starts from two critical observa- rule types (e.g., movement rules, phrase structure rules, tions. The first is that there is no upper bound on the construal rules). This more abstract categorization number of possible phrases/sentences within any allowed for the radical simplification of the language given language (i.e., languages are, for all practical that particular rules investigated in the prior period, purposes, “infinite”). This implies that successful lan- with constructions reducing to congeries of simpler guage learning is not only the memorization of a set of operations (although analogies are dangerous, this expressions (otherwise infinity would be impossible) seems similar to the way other sciences often discover but also the acquisition of a grammar, which is just a that seemingly distinct phenomena are in fact related, finite specification of a recursive set of combinatory such as the unification of planetary motion, projectile rules. The second observation is that any child can motion, and tidal motion as instances of gravitational acquire any language (e.g., a child born to US citizens attraction in physics). By the mid 1980s there were sev- living in Kenya will successfully learn Swahili if eral reasonably well-articulated candidate theories of exposed to Swahili speakers during childhood). Given syntax (e.g., Government and Binding Theory, Lexical- that the first observation suggests that languages Functional Grammar, Tree-Adjoining Grammar), each should be viewed as grammars, the second observa- specifying various rule types and their properties and tion translates as any child can acquire any grammar. each illuminating commonalities across constructions These two observations lead to the two driving ques- and across languages. tions for the field of syntax: The simplification of grammatical rule types also led to progress on the second driving question. By reduc- 1. What are the properties of the grammars of all of ing syntactic theories to only a few rule types, syntacti- the world’s languages? cians could reduce the number of learning mechanisms 2. What are the mental mechanisms that allow required to learn human grammars (here we use the humans to learn human languages? term “learning mechanisms” as a cover term for all of The goal of Generative Syntax (GS) over the past 60 the components of learning theories: biases to attend to years has been to explore the properties of human certain input, specifications of hypothesis spaces, algo- grammars (question 1) in such a way to make it possi- rithms for searching hypothesis spaces, etc.). With ble to explore the mental mechanisms that are required fewer learning mechanisms in the theory, syntacticians

C. BEHAVIORAL FOUNDATIONS 14.2 A BRIEF HISTORY OF SYNTACTIC THEORY 167 were able to investigate (and debate) the nature of the this grid clarifies exactly what syntacticians mean learning mechanisms themselves. Although there are a when they use the term “Universal Grammar.” Universal number of dimensions along which learning mechan- Grammar is just a special term for potential learning isms might vary, syntactic theory has often focused on mechanisms that are simultaneously domain-specific two in particular. The first is specificity: the learning and innate. Despite this rhetorical flourish, we hope it is mechanisms either can be domain-general, meaning clear that syntacticians view UG mechanisms (if they that they are shared by several (or all) cognitive exist at all) as only a subset of the learning mechanisms domains, or can be domain-specific, meaning that they that give rise to human language.1 are specific to language learning. The second dimen- The progress made in the 1980s regarding simplify- sion is nativity: the learning mechanisms either can be ing the rule types in human grammars also laid the innate, meaning that they arise due to the genetic foundation for the current research program within makeup of the organism, or can be derived, meaning modern GS: to distill the computational commonalities that they are constructed from the combination of found among the various kinds of rules (i.e., the compu- experience and other innate mechanisms. This leads to tational features common to movement rules, phrase a23 2 grid that can be used to classify any postulated building rules, and construal rules). Here, again, the learning mechanism (see also Pearl & Sprouse, 2013): dimension of domain-generality and domain-specificity plays a role in theoretical discussions, but this time at Specificity the level of cognitive computation rather than at the Domain-specific Domain-general level of learning mechanisms. As syntacticians have made progress distilling the computational properties Nativity Innate Universal Grammar e.g., statistical of grammatical rules, they have found that some of the learning suggested computations appear similar to computa- Derived e.g., learning to read e.g., n-grams tions in other domains of cognition (e.g., the binding, or concatenation, of two mental representations), whereas What is particularly interesting about this grid is that others still retain some amount of domain-specificity it helps to clarify some of the miscommunications that (see Section 14.3 for a concrete example). Current GS have often arisen between syntacticians and other cogni- work is pursuing this program in full force: attempting tive scientists surrounding terms like “innate,” “domain- to identify the basic computations and determine which specific,” and, worst of all, “Universal Grammar (UG).” are specific to the syntax and which are shared with This grid highlights the fact that the classification of any other cognitive domains. given learning mechanism is an empirical one. In other Note the odyssey described: the field of syntax words, given a rule type X and a learning mechanism Y moved from the study of very specific descriptions of that could give rise to X, which cell does Y occupy in the particular rules in particular languages to very general grid? It may be the case that one or more of the cells are descriptions of the properties of linguistic computa- never used. Second, this grid highlights the fact that a tions and their relationship, and finally to cognitive complete specification of all of the rule types underlying computation more generally. This shift in the “grain” human grammars and all of the learning mechanisms of linguistic analysis (in the sense of Poeppel & deployed to learn human grammars could involve any Embick, 2005) has had two important effects. First, it combination of the four types of learning mechanisms. has reduced the special “linguistic” character of As cognitive scientists, syntacticians are interested in all syntactic computations, making them more similar of the mechanisms underlying human syntax, not just to the cognitive computations we find in other the ones that get all of the attention in debates. Finally, domains. Second, it has encouraged investigation of

1As a quick side note on Universal Grammar, the reason that UG receives so much attention, both within the syntax literature and across cognitive science, is that the other three types of learning mechanisms are generally uncontentious. For example, it is widely assumed that learning cannot occur in a blank slate (i.e., every learning system needs some built in biases if there is to be any generalization beyond the input); therefore, at least one learning mechanism must be innate. Nearly every postulated neural architecture (both symbolic and subsymbolic) assumes some form of statistical learning, which is presumably a learning mechanism (or set of mechanisms) that is domain- general and innate. The domain-general/derived cell is likely filled with the more complex statistical learning mechanisms required by different domains of cognition, such as the ability to track the probabilities of different sized sequences (n-grams). Similarly, the domain- specific/derived cell could potentially contain the learning mechanisms tailored to specific areas of higher-order cognition, such as reading (or maybe even language itself), but built from cognitive mechanisms available more broadly. It is the final cell, domain-specific/innate, that is the most contentious (and therefore, to some, the most interesting). In syntax, we call learning mechanisms that potentially fall into this cell Universal Grammar to highlight their significance. Currently, as we note here, a very hot area of syntactic investigation aims to reduce these domain-specific innate mechanisms to a minimum without losing explanations for the linguistic phenomena and generalizations that syntacticians have discovered over the past 60 years of syntactic research.

C. BEHAVIORAL FOUNDATIONS 168 14. SYNTAX AND THE COGNITIVE NEUROSCIENCE OF SYNTACTIC STRUCTURE BUILDING how syntactic computations might be used in real-time multiple words in different positions in the construc- tasks such are parsing, production, and learning. Both tions), as noted in Section 2.1, there has been a steady these effects have had the consequence of bringing evolution toward theories that postulate a small num- syntactic theory much closer to the empirical interests ber of structure-building operations that can be of others working in cognitive neuroscience. applied mechanistically (or derivationally) to construct Unfortunately, this shift in syntactic theory and its more elaborate syntactic structures in a piecewise fash- implications for cognitive neuroscience has not always ion. With very few exceptions, the primitives of con- been widely appreciated. Although the field of syntax temporary syntactic theories are units and the was a central player in the cognitive revolution of the computations that apply to those units. The following 1950s, in the intervening decades, syntax and the other are two concrete examples3: domains of cognitive science have drifted apart. Some of The syntactic theory known as Minimalism (or the this drift is the inevitable consequence of scientific spe- Minimalist Program) postulates a single structure- cialization, and some of it reflects the internal logic of the building computation called merge, which takes two different research programs (i.e., that the rule-based the- units and combines them to form a third. The units in ories of the past were a necessary step in the evolution of Minimalism are lexical and sublexical items (something syntactic theories). However, some of the drift reflects akin to the notion of word or morpheme, although the the view that syntactic theory has little to contribute to details can vary by analysis). Merge applies to these other domains of language research (including cognitive units directly, and also applies recursively to the out- neuroscience). We worry that part of this problem may put of previous instances of merge. In this way, merge be that syntacticians have done a less-than-adequate job can be used to iteratively construct complex syntactic of conveying the general computational character of structures from a basic inventory of lexical atoms. Of modern syntactic theories. In the absence of such discus- course, merge cannot freely concatenate any two units sions, it would not be surprising to learn that some cog- together. This means that restrictions on merge must be nitive neuroscientists still view syntax in terms of the built into the lexical items themselves (only certain lexi- phrase structure rules and transformations that typified cal items are compatible with each other), and in the syntactic theory in the 1950s and 1960s (and in varying case of merging units with the output of previous forms up through the 1980s), rather than the more cogni- merges, this means that the outputs of merge must also tively general computations common in current prac- contain restrictive properties. This is accomplished tice.2 In the following sections, we provide two examples through a labeling computation, let us call it label, that of how contemporary syntax might fruitfully make con- applies a label to the output of merge, which can then tact with cognitive neuroscience. be used to determine what that output can be merged with in the future. The goal of syntactic theory is to capture the major 14.3 TWO CONCRETE EXAMPLES OF properties of human syntactic structures with the pro- SYNTACTIC STRUCTURE-BUILDING posed units and computations. For concreteness, we COMPUTATIONS illustrate how merge and label succeed in capturing two such properties. The first is the distinction between local Although early formulations of syntactic theories dependencies and nonlocal dependencies. A local postulated complex rules that applied to entire dependency is simply the relationship between two constructions (often permuting, adding, or deleting adjacent items in a sentence. Local dependencies are

2The rule and transformation view of syntax has other problems as well. This conception of syntax is considered problematic for the computational view of syntax, because there are well-known empirical results from the 1950s and 1960s that appear to demonstrate that rule- based syntactic theories of that sort are poor models for real-time sentence processing (or, more specifically, poor predictors of complexity effects in language processing, as captured by the Derivational Theory of Complexity; for reviews see Fodor, Bever, & Garrett, 1974; but see Phillips, 1996 for a useful reevaluation of these claims). This problem is compounded by the fact that syntactic theories are at best only theories of syntactic structure building, with little to nothing to say about other components that are necessary for a complete theory of sentence processing, such as ambiguity resolution, memory/resource allocation, semantic structure building, and discourse structure building. Therefore, if one views syntactic theory as a rule-based theory, then it might appear to be a poor theory of only one small corner of language processing. Even as syntacticians, we understand why other cognitive scientists might find this version of syntactic theory difficult to engage with. 3There are, of course, a number of other syntactic theories that propose different types of computations (and different types of units). For example, Head-Driven Phrase Structure Grammar (HPSG) proposes a computation similar to merge, but without the possibility of internal merge (nonlocal dependencies involve a special slash unit instead). Construction grammar proposes a tree-unification computation similar to substitution in TAG, but operating over much larger units (entire constructions) and with the possibility of multiple unification points in a single construction. We assume that a full-fledged research program on the computational view of syntax would investigate all of these possible theories.

C. BEHAVIORAL FOUNDATIONS 14.3 TWO CONCRETE EXAMPLES OF SYNTACTIC STRUCTURE-BUILDING COMPUTATIONS 169 captured by merge by concatenating two distinct ele- implies, allows two trees to be concatenated into locally ments together. A nonlocal dependency is a relationship dependent, adjunction structures. TAG captures nonlo- between two elements that are not adjacent in a sen- cal dependencies that are only a single clause in length tence, such as the word what and buy in the question with a single elementary tree (so, What did John buy? is What did John buy? Nonlocal dependencies can be mod- a single tree without any application of substitution or eled by merge by concatenating a phrase with an ele- adjunction). For dependencies that are more than one ment that is already properly contained within that clause in length, the adjunction computation is applied phrase. Syntacticians call the former instantiation exter- to a special type of tree called an auxiliary tree to nal merge, because the two elements are external to each extend the dependency length. In this way, the two other, and call the latter instantiation internal merge, primitive computations substitution and adjunction because one element is properly contained within the can be used to construct syntactic structures from ele- other (Chomsky, 2004). The second property is the dis- mentary and auxiliary trees, and they give rise to the tinction between structures that contain verbs and their important distinctions of human syntax (for accessible arguments (e.g., eat bananas) and structures that contain introductions to TAG, see Frank, 2002, 2013)(Box 14.1). modifiers (e.g., eat quickly). The former, which we can Although both theories capture the same set of phe- call nonadjunction structures, are built from a combina- nomena in human syntax, and although both theories tion of merge and label; the latter, which we can call postulate structure-building computations, they do so adjunction structures, are built from merge alone (no using different computations, different units, and differ- label)(Hornstein, 2009). In this way, the two primitive ent combinations of computations for each phenomenon. computations merge and label can be used to construct For nonadjunction structures that involve only local syntactic structures capable of modeling the variety of dependencies, Minimalism uses external merge and label, structures one finds within natural language. whereas TAG uses substitution with two elementary The syntactic theory known as Tree-Adjoining trees. For adjunction structures, Minimalism uses exter- Grammar postulates two structure-building computa- nal merge alone, whereas TAG uses adjunction with two tions called substitution and adjunction. The units in elementary trees. For nonlocal dependencies, Minimalism TAG are small chunks of syntactic structure, or trees uses internal merge and label,whereasTAGusesadjunc- (hence the name of the theory). The substitution tion with one elementary tree and one auxiliary tree. The computation allows two elementary trees to be similarities between these two syntactic theories (i.e., both concatenated into locally dependent, nonadjunction use two basic computations to capture a wide range of structures. The adjunction computation, as the name characteristics of human syntax) suggest that both are

BOX 14.1 STRUCTURE-BUILDING COMPUTATIONS IN MINIMALISM AND TAG The structure-building computation in Minimalism is but it appears to be optional for the merge of adjuncts called merge. It takes two syntactic objects and concate- (e.g., verbs and modifiers): nates them into a third object. When the two syntactic Merge with Label:[ eat] 1 [ bananas] 5 objects are distinct, it is called external merge. When V NP [ [ eat] [ bananas]] one of the objects is contained within the other, it is VP V NP Merge without [ [ run]] 1 [ quickly] 5 called internal merge: VP V AdvP Label: [[VP [V run]] [AdvP quickly]] External [eat] 1 [bananas] 5 [[eat] [bananas]] TAG proposes two structure-building operations. merge: Substitution combines two elementary trees to form argu- Internal [did John buy what] 1 [what] 5 ment relationships, whereas adjunction combines elemen- merge: [[what] [did John buy what]] tary trees and adjunct trees to form adjunction structures: The label computation determines the properties of Substitution:[ John] 1 [ [ ][ eats bananas]] 5 the new syntactic object constructed by merge by apply- DP TP DP VP [ [ John] [ eats bananas]] ing a label based on the properties of one of the merged TP DP VP Adjunction:[[ John] [ [ runs]]] 1 objects (the head). Label is mandatory for the merge of TP DP VP V [ [ ] quickly] 5 [ [ John] argument relationships (e.g., verbs and their arguments), VP VP TP DP [VP [VP [V runs]] quickly]]

C. BEHAVIORAL FOUNDATIONS 170 14. SYNTAX AND THE COGNITIVE NEUROSCIENCE OF SYNTACTIC STRUCTURE BUILDING tapping into deeper truths about human structure- Second, syntactic theories attempt to minimize the building computations. However, the subtle differences number of domain-specific computations and maximize in the character of the proposed computations suggest the number of domain-general computations (to the that one might be able to derive competing predictions extent possible given the overall minimization of the from each theory about the presence or absence of com- number of computations). This is an important, and putations in different constructions. This combination of often overlooked, point within syntax. The merge abstract similarities and subtle differences strikes us as a computation in Minimalism and the substitution potentially fruitful starting point for a search for neuronal computation in TAG are both plausibly domain-general structure-building computations. computations similar to the binding computations that occur in multiple cognitive domains (vision, hearing, etc.), albeit operating over language-specific represen- tations. The formulation of these plausibly domain- 14.4 ADDITIONAL PROPERTIES OF general computations stems directly from the premium SYNTACTIC THEORIES THAT ONE that syntactic theories now place on unification/ WOULD EXPECT FROM A THEORY OF reductionism. In contrast, the label computation and the COGNITIVE COMPUTATIONS adjunction computation are potentially domain-specific, because there are no obvious correlates in other cognitive In addition to focusing on structure-building com- domains, although that could just be a consequence of putations, there are a number of additional properties our current state of knowledge. The question of whether of contemporary syntactic theories that make them plausibly domain-specific computations like label and ideal candidates for the computational view of syntax. adjunction can be learned or must be innate is an open Here, we review three. area of research in language acquisition. First, contemporary syntactic theories attempt to Finally, syntactic theories have mapped a sizable por- minimize the number of computations while maximiz- tion of the potential hypothesis space of syntactic ing the number of phenomena captured by the theory. structure-building computations. As we have men- This is a general desideratum of scientific theories in tioned, with few exceptions, every contemporary syn- general (it is sometimes called unification, or reduc- tactic theory has the potential to serve as a theory of tionism, or just Occam’s razor), and syntax, as a sci- cognitive structure-building computations. Although ence, has adopted it as well. In fact, the name the sheer number of competing theories may seem Minimalism was chosen to reflect the fact that years of daunting from outside of syntax, from inside of syntax investigations using earlier theories had yielded we believe this is a necessary step in the research. We enough information about the properties of language need to explore every possible combination of unit-size as a cognitive system that it was finally possible to and computation type that captures the empirical facts fruitfully incorporate unification/reduction/Occam’s of human languages (and to be clear, not every combi- razor as a core principle of the research program. nation does) to provide neuroscientists with a list of Other syntactic theories have been less blunt about this possible cognitive computations. To be sure, there is in their naming conventions, but the principles are more work to be done on this front. And it goes with- obvious in the shape of the theories. Commitment to out saying that syntacticians actively debate the empiri- Occam has led to syntactic theories based on simple cal coverage of the different theories, and also how well computations with wide applicability across the thou- each theory can achieve empirical coverage without sands of syntactic constructions in human languages. inelegant stipulations. But from the perspective of One nice side benefit of the ratio of computations to cognitive neuroscience, the value is in the hypothesis constructions is that it may make the search for neuro- space—each theory represents a different hypothesis physiological correlates of these computations more about the types of fundamental structure-building com- fruitful, especially given concerns about spurious cor- putations (and the distribution of those functions across relations in high-dimensional neurophysiological data. different sentences in any given language).4

4Inside of the field of syntax there is a recurring debate about whether different syntactic theories (e.g., Minimalism and TAG) are in some sense notational variants of one another. There are various mathematical proofs demonstrating that many theories are identical in terms of weak generative capacity (i.e., the ability to create certain strings of symbols and not others; e.g., Joshi, Vijay-Shanker & Weir 1991; Michaelis, 1998; Stabler, 1997). However, it is an open question whether these theories are equivalent in other terms, such as strong generative capacity (the types of structures that they can generate) or empirical adequacy for human languages. It is interesting to note that inside of syntax this debate is often couched in terms of theoretical “elegance” (i.e., how elegantly one theory captures a specific phenomenon relative to another theory). However, the research program suggested here would make such debates purely empirical: the “correct” syntactic theory would be the one that specifies the correct distribution of syntactic computations (and therefore their neuronal instantiations) across all of the constructions of a given language.

C. BEHAVIORAL FOUNDATIONS 14.5 THE COLLABORATION NECESSARY TO ENGAGE IN THIS PROGRAM 171

14.5 THE COLLABORATION NECESSARY beyond structure-building: Hale, 2003; Kobele, Gerth, & TO ENGAGE IN THIS PROGRAM Hale, 2013), and neuroscientists to identify candidate neurocomputational systems. Although this sounds The research program that the computational view straightforward, it is likely that the space of possible of syntax suggests will require close collaboration computations will expand at each step, from syntactic between different types of researchers. The first step is computations to mathematically formalized computa- for syntacticians to identify the structure-building com- tions, from formalized computations to parsing compu- putations that are deployed at each point in cons- tations, and from parsing computations to neuronal tructions from human syntax. From these analyses, computations. It is quite possible that this step will syntacticians could identify two types of interesting result in hypothesis spaces for the possible neuronal cases. The first interesting case would be constructions computations for each syntactic theory relevant to each that predict the same type of structure-building com- phenomenon. putation at the same location in all theories (e.g., Once the structure-building computations have been Minimalism predicts merge at the same location in the translated into potential neuronal computations (or construction as TAG predicts substitution). These areas hypothesis spaces of potential neuronal computations), of convergence may be fruitful places to begin the the final step is to look for evidence of those computa- search for neuronal computations. A second interesting tions in neural systems. Again, although we state this case would be constructions that require diverging as a single step in principle, we assume that it will be a computations across theories (e.g., Minimalism pre- multifaceted process in practice, drawing on neuro- dicts merge but TAG predicts adjunction). If these anal- scientists of all types: electrophysiologists (EEG/MEG), yses could be identified across a large number of neuroimagers (fMRI), and even neurosurgeons (ECoG). constructions, then it should be possible to construct As syntacticians, this step is the furthest beyond our a type of comparison/subtractive logic that could area of expertise, but we could imagine a process like uncover neuronal correlates of these computations. It the following. First, (extracranial) electrophysiological seems to us that phenomena that vary along the major work (either EEG or MEG) could be used to identify dimensions of human syntax, such as nonadjunction the gross neuronal signatures in either the amplitude versus adjunction structures, or local versus nonlocal domain (ERP/ERF) or frequency domain (oscillations) dependencies, will be most likely to lead to these types that occur at the critical regions in the sentences of of convergences and divergences. But over the long- interest. Depending on the similarities and differences term, every phenomenon of syntax should be investi- predicted by the different syntactic theories, and the gated (to the extent possible given some of the different classes of neuronal populations that follow challenges discussed in Section 14.4). from the formalization of those theories in the previous The second step is for syntacticians and theoretical step, the neuronal signatures (ERP/ERFs or oscilla- neuroscientists to figure out how neurons deploy the tions) may be useful in eliminating competing compu- structure-building computations that underlie the phe- tations from consideration. Recently, there have been nomenon in each theory. In practice, this step might exciting examples of work of this type in both syntax require several substeps. For example, the typical form and semantics research. For example, Pylkkanen and of syntactic theories is “bottom-up”: the most deeply colleagues have been searching for neurological corre- embedded constituents are constructed first, followed lates of fundamental semantic combinatory processes by the next most deeply embedded, and so on. This is in the timeamplitude domain using MEG, with largely the reverse order from sentence comprehension results to increased activity in left anterior and production. Given that the empirical studies temporal lobe (LATL) and ventromedial prefrontal cor- required by later steps will be based on comprehension tex (vmPFC) (e.g., Bemis & Pylkka¨nen, 2011; Brennan (and perhaps production), it may be necessary to & Pylkka¨nen, 2008; and many others). As another convert the “bottom-up” computations of syntactic example, Bastiaansen and colleagues have been search- theories into the “left-to-right” or “top-down” compu- ing for neurological correlates of both syntactic and tations of parsing theories. There exist several semantic combinatory processes in the timefrequency computational models for how to relate bottom-up domain using EEG, with results pointing to the gamma grammars with leftright parsers. This step will most frequency band (.30 Hz) for semantic processes and likely involve collaboration among mathematical the lower beta frequency band (1318 Hz) for syntactic linguists (to rigorously formalize the syntactic compu- processes (e.g., Bastiaansen, Magyari, & Hagoort, 2010; tations (Collins & Stabler, 2011; Stabler, 1997)), mathe- Bastiaansen, Van Berkum, & Hagoort, 2002; for a matical psycholinguists to convert those computations review see Bastiaansen, Mazaheri, & Jensen, 2012). into parsing computations (e.g., Berwick & Weinberg, Once electrophysiological correlates have been iden- 1984; Marcus, 1980; Stabler, 2011, 2013; and for issues tified, localization studies, either with MEG (if the

C. BEHAVIORAL FOUNDATIONS 172 14. SYNTAX AND THE COGNITIVE NEUROSCIENCE OF SYNTACTIC STRUCTURE BUILDING orientation of the generators is appropriate) or concur- dispel the challenge and in others simply to raise the rent EEG and fMRI, could be used to identify cortical issue for future work. areas associated with the neuronal activity of interest. One obvious challenge is the concern from some There is a large and ever-growing literature on locali- cognitive scientists that syntactic theories are not built zation in language processing, and we are sure the on solid empirical foundations. This concern has been other chapters in this volume provide more enlighten- expressed since the earliest days of syntactic theorizing ing reviews of that literature. However, we would like (Hill, 1961), and recently with several high-profile pub- to point to Pallier, Devauchelle, and Dehaene (2011) as lications (Gibson & Fedorenko, 2010, 2013). This con- an example of localization work that shares the same cern is driven by the idea that the typical data spirit as the program advocated here. Pallier et al. collection methods are too informal to provide reliable searched for brain areas that respond to the size of the data; therefore, the theories built on that data are syntactic constituent being processed (from 1 to 12 themselves unreliable. The persistence of this concern words), in essence using the number of syntactic speaks to a fundamental failure on the part of syntacti- computations deployed as a measure of complexity, cians to make the argument either that the data type and finding activity in a number of regions, inclu- that they are collecting (acceptability judgments) are ding left inferior frontal gyrus (LIFG), left anterior robust enough that the informality of the collection superior temporal sulcus (LaPSTS), and left posterior methods have no impact or that there are unreported superior temporal sulcus (LpSTS). Finally, when safeguards in the informal methods to prevent the suitable location and electrophysiological hypotheses kind of unreliability that they are concerned about (see are established, intracranial recordings (ECoG) could Marantz, 2005; Phillips, 2009 for discussions of these be used to identify the single unit information neces- issues). Sprouse and Almeida (2012) and Sprouse, sary to begin to identify the specific neuronal compu- Schu¨ tze, and Almeida (2013) have begun to address tation and observe its implementation. this concern directly by exhaustively retesting all of We admit that the brief sketch of the collaboration the phenomena in a popular Minimalist textbook using suggested by the computational view is based on our formal methods, and by retesting a large random sam- incomplete understanding of the various fields that ple of phenomena from a popular syntax journal using would be part of the collaboration. We also admit that formal methods. These retests have replicated 98% and the space of possible neuronal computations is likely 95% of the phenomena, respectively, suggesting that much larger than the space of extant structure- the informal methods used in syntax have the reliabil- building operations, making the search for the actual ity that syntacticians claim. Given recent concerns neuronal computations that much more difficult. But it about replicability inside of some areas of psychology, seems to us that the size of the hypothesis space is it is heartening to see that large-scale replications irrelevant to the question of how to move the fields of inside of syntax yield potential error rates at or below syntax and neuroscience forward (and together). This the conventional type I error rate of 5%. is either the right hypothesis space to be searching or Despite the substantial evidence that the acceptabil- it is not. It seems to us that multiple domains of cogni- ity judgments that form the basis of syntactic theory tion are converging on both the need for identifying are reliable, one could imagine potential collaborators neuronal computations and the plausibility of conduct- being concerned that a theory built on offline data ing such a search in the 21st century (Carandini, 2012; (like acceptability judgments) would be irrelevant for a Poeppel, Emmorey, Hickok, & Pylkka¨nen, 2012). We theory built on real-time language processing data believe that the wider field of syntax is ready to join (like the electrophysiological data required by the the search that researchers such as Bastiaansen, research program proposed here). We agree that this Dehaene, Pallier, Pylkkanen, and colleagues have could be a reasonable concern a priori. However, there begun. is also a growing body of research in the sentence pro- cessing literature demonstrating that real-time sentence processing behavior respects grammatical conditions on well-formedness. For example, several studies have 14.6 CHALLENGES TO THIS RESEARCH shown that complex constraints on the formation of PROGRAM nonlocal dependencies (called island constraints in the syntax literature) are respected by the parsing mecha- Beyond the obvious challenge of engaging in the nism that form these dependencies in real time (Stowe, interdisciplinary work presented in Section 14.5, there 1986; Traxler & Pickering, 1996). In addition, several are numerous smaller challenges that need to be studies have demonstrated that these same processing addressed for the collaboration to be successful. In this mechanisms respect the sophisticated exceptions to section we discuss five, in some cases in an attempt to these constraints postulated by syntactic theories

C. BEHAVIORAL FOUNDATIONS REFERENCES 173

(Phillips, 2006; Wagers & Phillips, 2009). Similarly, And even assuming that logically isolating a computa- several studies have demonstrated that complex con- tion of interest is possible in the experimental stimuli, straints on the dependencies that give pronouns their physically isolating a neuronal computation in human referents (called binding constraints in the syntax liter- neural systems is probably orders of magnitude more ature) are also respected by real-time referential pro- difficult. To our knowledge, there are no existing cessing mechanisms (Kazanina, Lau, Lieberman, neuronal computations that can be used as a guide Yoshida, & Phillips, 2007; Sturt, 2003; Van Gompel & (a Rosetta stone of sorts) to mark the beginning or Liversedge, 2003). Several recent studies also show end of a computation being physically performed. these effects to be the result of grammatical constraints We assume that as more and more computations are and not the consequences of nongrammatical proces- investigated, combining them in novel ways will even- sing mechanisms (Dillon & Hornstein, 2013; Kush, tually allow the physical boundaries of computations Omaki, & Hornstein, 2013; Sprouse, Wagers, & to be mapped, but this is currently a promissory note. Phillips, 2012; Yoshida, Kazanina, Pablos, & Sturt, In summary, the narrow focus of syntactic theories on 2013). In sum, there is a growing body of convincing structure-building computations is in some ways posi- evidence that syntactic theories capture structure- tive, because it provides a hypothesis space for a prob- building properties that are relevant for real-time sen- lem that is potentially tractable, but it is also negative, tence processing, despite having initially been empiri- because the computations left out of that hypothesis cally based on offline data. space may be either confounds or necessary additions A third potential challenge for the computational to solve the physical localization problem. view of syntax is that not every syntactician agrees that syntactic theories should serve as a theory of cognitive structure-building computations. The potential for a 14.7 CONCLUSION logical distinction between theories of syntax and theo- ries of cognitive structure-building is clearest in exam- We believe that modern syntactic theory is well- ples of nonmentalistic, or Platonic, linguistic theories, suited to serve as a cognitive theory of syntactic which seek to study the mathematical properties of lan- structure-building computations, and that the time is guage without making any claims about how those right for a large-scale collaboration between syntacti- properties are instantiated in a brain. Even within GS, cians, mathematical linguists and psycholinguists, and which is mentalistic, it is not uncommon to hear theo- theoretical and experimental neuroscientists to identify ries of syntax described as theories of knowledge (or the neuronal instantiations of those computations. competence) and not theories of use (or performance). Such a research program will be a collaborative project The computational view of syntax goes beyond simple of unprecedented scope and will face numerous theo- knowledge description. The computational view sees retical and technological challenges, but in the histories syntactic theories as making substantive claims about of cognitive science, linguistics, and neuroscience, how syntactic structure building is instantiated in the there has never been a better time to try. human brain. It may be the case that there is a one-to- many relationship between syntactic theories and neuronal structure-building computations, but the relationship is there (see Lewis & Phillips, 2014 for a References deeper discussion of this challenge). Bastiaansen, M. C. M., Magyari, L., & Hagoort, P. (2010). Tactic unifi- A final challenge to the computational view of syn- cation operations are reflected in oscillatory dynamics during on- tax is the problem of isolating structure-building line sentence comprehension. Journal of Cognitive Neuroscience, 22, 13331347. computations from other sentence processing compu- Bastiaansen, M. C. M., Mazaheri, A., & Jensen, O. (2012). Beyond tations in real-time processing data. Real-time lan- ERPs: Oscillatory neuronal dynamics. In S. J. Luck, & E. S. guage processing data are going to contain signals Kappenman (Eds.), The Oxford handbook of event-related potential from both structure-building computations and all of components (pp. 3150). New York, NY: Oxford University Press. the nonstructure-building computations that syntactic Bastiaansen, M. C. M., Van Berkum, J. J. A., & Hagoort, P. (2002). Syntactic processing modulates the θ rhythm of the human EEG. theory abstracts away from (parsing strategies, NeuroImage, 17, 14791492. resource allocation, task specific strategies in the sense Bemis, D. K., & Pylkka¨nen, L. (2011). Simple composition: An MEG of Rogalksy & Hickok, 2011, etc). This means that the investigation into the comprehension of minimal linguistic actual construction of neurophysiological experiments phrases. Journal of Neuroscience, 31(8), 28012814. discussed in Section 14.5 will require quite a bit of Berwick, R. C., & Weinberg, A. S. (1984). The grammatical basis of linguistic performance. Cambridge, MA: MIT Press. ingenuity to isolate the structure-building computa- Brennan, J., & Pylkka¨nen, L. (2008). Processing events: Behavioral tions, especially given the high-dimensionality of neu- and neuromagnetic correlates of aspectual coercion. Brain and ral data, and the likelihood of spurious correlations. Language, 106, 132143.

C. BEHAVIORAL FOUNDATIONS 174 14. SYNTAX AND THE COGNITIVE NEUROSCIENCE OF SYNTACTIC STRUCTURE BUILDING

Carandini, M. (2012). From circuits to behavior: A bridge too far? Pearl, L., & Sprouse, J. (2013). Syntactic islands and learning biases: Nature Neuroscience, 15(4), 507509. Combining experimental syntax and computational modeling to Carandini, M., & Heeger, D. J. (2012). Normalization as a canonical investigate the language acquisition problem. Language neural computation. Nature Reviews Neuroscience, 13,5162. Acquisition, 20,2368. Chomsky, N. (2004). Beyond explanatory adequacy. In A. Belletti Phillips, C. (1996). Order and structure. Cambridge, MA: (Ed.), The cartography of syntactic structure vol 3: Structures and Massachusetts Institute of Technology. beyond (pp. 104131). Oxford: Oxford University. Phillips, C. (2006). The real-time status of island phenomena. Collins, C., & Stabler E. (2011). A formalization of minimalist syntax. Language, 82, 795823. ,http://ling.auf.net/lingbuzz/001691.. Phillips, C. (2009). Should we impeach armchair linguists? In S. Iwasaki, den Dikken, M. (Ed.), (2013). The Cambridge handbook of generative syn- H. Hoji, P. Clancy, & S.-O. Sohn (Eds.), Proceedings from Japanese/ tax. Cambridge, UK: Cambridge University Press. Korean linguistics 17. Stanford, CA: CSLI Publications. Dillon, B., & Hornstein, N. (2013). On the structural nature of island Poeppel, D. (2012). The maps problem and the mapping problem: constraints. In J. Sprouse, & N. Hornstein (Eds.), Experimental syn- Two challenges for a cognitive neuroscience of speech and lan- tax and island effects (pp. 208222). Cambridge, UK: Cambridge guage. Cognitive Neuropsychology, 29,3455. University Press. Poeppel, D., & Embick, D. (2005). The relation between linguistics Fodor, J., Bever, T., & Garrett, M. (1974). The psychology of language. and neuroscience. In A. Cutler (Ed.), Twenty-first century psycho- New York, NY: McGraw Hill. linguistics: Four cornerstones (pp. 103118). Mahwah, NJ: Frank, R. (2002). Phrase structure composition and syntactic dependencies. Lawrence Erlbaum Associates. Cambridge, MA: MIT Press. Poeppel, D., Emmorey, K., Hickok, G., & Pylkka¨nen, L. (2012). Frank, R. (2013). Tree adjoining grammar. In M. den Dikken (Ed.), Towards a new neurobiology of language. Journal of Neuroscience, The Cambridge handbook of generative syntax (pp. 226261). 32(41), 1412514131. Cambridge, UK: Cambridge University Press. Rogalksy, C., & Hickok, G. (2011). The role of Broca’s area in sen- Gibson, E., & Fedorenko, E. (2010). Weak quantitative standards in tence comprehension. Journal of Cognitive Neuroscience, 23, linguistics research. Trends in Cognitive Sciences, 14, 233234. 16641680. Gibson, E., & Fedorenko, E. (2013). The need for quantitative meth- Sprouse, J., & Almeida, D. (2012). Assessing the reliability of text- ods in syntax and semantics research. Language and Cognitive book data in syntax: Adger’s Core Syntax. Journal of Linguistics, Processes, 28,88124. 48, 609652. Hale, J. (2003). Grammar, uncertainty, and sentence processing. Sprouse, J., Schu¨tze, C. T., & Almeida, D. (2013). A comparison of Baltimore, MD: Johns Hopkins University. informal and formal acceptability judgments using a random Hill, A. A. (1961). Grammaticality. Word, 17,110. sample from Linguistic Inquiry 20012010. Lingua, 134, 219248. Hornstein, N. (2009). A theory of syntax: Minimal operations and univer- Sprouse, J., Wagers, M., & Phillips, C. (2012). A test of the relation sal grammar. Cambridge, UK: Cambridge University Press. between working memory capacity and syntactic island effects. Joshi, A. K., Vijay-Shanker, K., & Weir, D. J. (1991). The convergence Language, 88(1), 82123. of mildly context-sensitive grammar formalisms. In P. Sells, S. Stabler, E. (1997). Derivational minimalism. In C. Retore´ (Ed.), Logical Sheiber, & T. Wasow (Eds.), Foundational issues in natural language aspects of computational linguistics (pp. 6895). New York, NY: processing (pp. 3181). Cambridge, MA: MIT Press. Springer. Kazanina, N., Lau, E. F., Lieberman, M., Yoshida, M., & Phillips, C. Stabler, E. (2011). Top-down recognizers for MCFGs and MGs. In: (2007). The effect of syntactic constraints on the processing of back- Proceedings of the second workshop on cognitive modeling and compu- wards anaphora. Journal of Memory and Language, 56,384409. tational linguistics (CMCL ’11) (pp. 3948). Stroudsburg, PA: Kobele, G. M., Gerth, S., & Hale, J. T. (2013). Memory resource alloca- Association for Computational Linguistics. tion in top-down minimalist parsing. In G. Morrill, & M.-J. Stabler, E. (2013). Two models of minimalist, incremental syntactic Nederhof (Eds.), FG 2012/2013, volume 8036 of lecture notes in analysis. Topics in Cognitive Science, 5(3), 611633. computer science (pp. 3251). Heidelberg: Springer. Stowe, L. A. (1986). Evidence for on-line gap-location. Language and Kush, D., Omaki, A., & Hornstein, N. (2013). Microvariation in Cognitive Processes, 1, 227245. islands? In J. Sprouse, & N. Hornstein (Eds.), Experimental syntax Sturt, P. (2003). The time-course of the application of binding con- and island effects (pp. 239264). Cambridge University Press. straints in reference resolution. Journal of Memory and Language, Lewis, S., & Phillips, C. (2014). Aligning grammatical theories and 48, 542562. language processing models. Journal of Psycholinguistic Research. Traxler, M. J., & Pickering, M. J. (1996). Plausibility and the proces- doi:10.1007/s10936-014-9329-z sing of unbounded dependencies: An eye-tracking study. Journal Marantz, A. (2005). Generative linguistics within the cognitive neuro- of Memory and Language, 35, 454475. science of language. The Linguistic Review, 22, 429445. Van Gompel, R. P. G., & Liversedge, S. P. (2003). The influence of Marcus, M. (1980). A theory of syntactic recognition for natural language. morphological information on cataphoric assignment. Cambridge. MA: MIT Press. Journal of Experimental Psychology: Learning, Memory, and Cognition, Michaelis, J. (1998). Derivational minimalism is mildly context sensi- 29, 128139. tive. In M. Moortgat (Ed.), Logical aspects of computational linguis- Wagers, M., & Phillips, C. (2009). Multiple dependencies and the role tics, lecture notes in artificial intelligence volume 1 (pp. 179198). of the grammar in real-time comprehension. Journal of Linguistics, Heidelberg: Springer. 45, 395433. Pallier, C., Devauchelle, A.-D., & Dehaene, S. (2011). Cortical repre- Yoshida, M., Kazanina, N., Pablos, L., & Sturt, P. (2014). On the sentation of the constituent structure of sentences. Proceedings of origin of islands. Language and Cognitive Processes, 29, 761770. the National Academy of Sciences, 108(6), 25222527.

C. BEHAVIORAL FOUNDATIONS CHAPTER 15 Speech Perception as a Perceptuo-Motor Skill Carol A. Fowler Department of Psychology, University of Connecticut, Storrs, CT, USA

15.1 INTRODUCTION because information for gestures is available in the acoustic speech signal. Required or not, however, Among theories of phonetic perception, there are evidence shows that recruitment is widespread. “general auditory approaches” (Diehl, Lotto, & Holt, In both theories, speech perception is a 2004) that contrast with “gesture theories” (Fowler, perceptuo-motor skill. Understanding why it is and 1986; Liberman & Mattingly, 1985). These approaches why that does not make it special require embedding are contrasted in many publications (Diehl et al., 2004; its study in a larger context of investigations of the Fowler & Iskarous, 2013). The present chapter focuses ecology of perceiving and acting. Therefore, the con- on gesture theories, in which the integrality of speech text for the literature reviewed in this chapter is not production and perception is central to the accounts. that of processing in the brain. Rather, it is about lan- The best known of the gesture approaches to speech guage users in their world, in which perceiving and perception, the motor theory (Liberman, Cooper, acting are inextricably intertwined. Presumably, the Shankweiler, & Studdert-Kennedy, 1967; Liberman & brains of language users, like the rest of language Mattingly, 1985), claims that speech perceivers per- users, will be adapted to such a world. Therefore, find- ceive linguistically significant (“phonetic”) gestures of ings of speech motor system activation in the brain the vocal tract as immediate perceptual objects (rather during ordinary speech perception perhaps should be than auditory transforms of the acoustic speech signal). unsurprising. For an alternative perspective, see Lotto Another claim is that perceiving speech necessarily and Holt (Chapter 16). involves speech motor system recruitment. A final claim is that, in these respects, speech perception is special, particularly in relation to other auditorily per- 15.1.1 Perception and Action are Inextricably ceived events in which auditory transforms of acoustic Integrated signals are perceptual objects. The present chapter sug- gests that the first two claims are accurate (except pos- In the econiche, animals’ activities are necessarily sibly the necessity of motor recruitment). However, the perceptually guided. For example, locomotion usually third claim is not. Perception of distal events (gestures, involves visual guidance so that walkers can move for speech) is generally what perceptual systems along a route toward a goal location while avoiding achieve (Gibson, 1966), and motor system recruitment collision with obstacles (Warren, 2006). Sometimes, is widespread in perception and cognition. however, for example, crossing a street at a curve so An alternative gesture theory, direct realism (Fowler, that oncoming traffic is not visible (or crossing any- 1986, 1996), agrees that listeners to speech perceive the where for those texting while walking) may involve distal events of speaking, phonetic gestures, but dis- detecting approaching vehicles by listening. It also agrees that, in regard to perceiving distal events, rather involves maintaining postural balance with the help of than proximal (e.g., acoustic) stimulation, speech per- vestibular systems and proprioceptive detection of the ception is special (see Carello, Wagman, & Turvey, 2005 forces exerted by the walker on the support surface for a review of “ecological acoustics”; see also and on the walker by the support surface. Walking, Rosenblum, 2008). In direct realism, recruitment of the like other activities, is a multimodal perceptuo-motor motor system is not expected for speech perception, activity.

Neurobiology of Language. DOI: http://dx.doi.org/10.1016/B978-0-12-407794-2.00015-8 175 © 2016 Elsevier Inc. All rights reserved. 176 15. SPEECH PERCEPTION AS A PERCEPTUO-MOTOR SKILL

Sometimes animals’ goals are more exploratory than for example, which typically is done to get somewhere, performatory. For example, human sightseers may not to cause patterning in light or air. However, it is walk around (as it were) to enable their perceptual sys- not so different in that respect from performances tems to intercept new sights or sounds or feels or smells meant to be seen or heard, such as ballet or competi- or tastes. In that case, complementarily to performatory tive diving or musical performances. actions whereby perception serves action, action sys- Regarding perceiving and acting generally, tems serve primarily perceptual aims. Either way, act- Liberman and Mattingly (1989) and Liberman and ing and perceiving both are perceptuo-motor skills. Whalen (2000) have remarked that parity is central to Aside from being perceptuo-motor in nature, how- speaking and perceiving speech. For Liberman and ever, animals’ actions and perceptions share something Whalen, parity in language use has three essential else that is crucial to their lives, namely the econiche. aspects. One relates to the observation that the same The world in which they act and the world in which language or languages are involved in a language they obtain perceptual information is the same world. user’s dual roles of talking and of listening to the Because survival depends on felicitous acting, it also speech of others. There must be a relation of parity depends on perceptual systems that accurately expose (sometimes called a “common code;” Schu¨ tz-Bosbach properties of the econiche relevant to their actions (the & Prinz, 2007) between language forms produced and “affordances” of the econiche; Gibson, 1979). In short, perceived by the same person. That is how perception perceptual systems as well as action systems have to of one’s own speech can guide its production (e.g., be adapted to the animals’ “way of life” (Gibson, Houde & Jordan, 1998). 1994). There has to be a strong likelihood of a relation A second component of parity in language relates to of parity between properties of the econiche implied between-person language use. For language use to serve by an animal’s actions and properties of the econiche its communicative role in a between-person exchange, perceived in support of those actions. there has to be a relation of sufficient parity1 between Most analogous to language use in the world are forms uttered by a talker and forms intercepted by coordinative social activities (e.g., moving a piece of listeners. For many theorists (Pierrehumbert, 1990), the furniture together, paddling a canoe, playing a duet). “common code” within a speakerhearer and between The foregoing characterizations are true of these activi- them is “mental” and not physical. However, for speech ties as well, but now parity has an additional cross- perception not to be special, perceptual objects have to person dimension. Participants have to perceive social be physical. Only physical things can causally structure affordances (Marsh, Richardson, Baron, & Schmidt, informational media such as light and air and therefore 2006), and their actions generally should be true to can have effects that can be perceived. Language forms them. In addition, co-participants’ perceptions should as physical events do not prevent their being psychologi- be shared and co-participants should coordinate their cal in nature as well (Ryle, 1949). In the account of actions in relation to those shared social affordances. Liberman and Mattingly (1985) and others (Browman & Speech perception and production in the econiche are Goldstein, 1986; Goldstein & Fowler, 2003), the smallest coordinative social activities. language forms are phonetic gestures of the vocal tract. They are physical actions that have linguistic and, hence, 15.1.2 Parity in Speech psychological significance. The third component of parity for Liberman and Because speaking and listening are social activities, Whalen (2000) is that brain systems for production and much of this discussion applies to them. But they are perception of language forms must have co-evolved, different from some nonlinguistic, nonsocial actions in because each is specialized for the unique problems to a notable way. At one level of description, an aim of which coarticulation gives rise, and neither specialized speaking is to cause patterning in the air. That is how system would be useful without the other. This com- speaking can get the attention of a listener and how it ponent is not addressed further here beyond comment- can get a chance to have its intended impact on him or ing that coarticulation and its effects are not special to her. This is different from the activity of locomoting, speech.

1The hedge “sufficient” is meant to forestall misunderstanding (Remez & Pardo, 2006). Talkers and listeners do not have to share their dialect, and listeners do not have to detect every phone, even every word, produced by talkers for language to “work” in public use. However, sharing of forms has to be sufficient for the talker’s message to get across in events involving talking. Relatedly, at a slower time scale, in a language community, language forms and structures serve as conventions (Millikan, 2003) that are conventional just because they are reproduced (with the same hedge: reproduced with sufficient fidelity to count as being reproduced) across community members. The capacity to reproduce perceived forms implies perceptionproduction parity.

C. BEHAVIORAL FOUNDATIONS 15.2 RESEARCH FINDINGS 177

The foregoing discussion is meant to underscore converging evidence for gesture perception. A sam- that, in the econiche, life, including linguistic life, is pling is offered here. perceptuo-motor in nature. Nothing discussed here Speech perception is a perceptuo-motor skill with logically requires that mechanisms involved in action respect to its objects of perception. In articles pub- must be incorporated in perceptual systems or vice lished in 1952 and 1954 (Liberman, Delattre, & Cooper, versa. However, it would be surprising if they were 1952; Liberman, Delattre, Cooper, & Gerstman, 1954), not, and in both the linguistic and nonlinguistic Liberman and colleagues reported their first findings domains, they appear to be. A brief review is provided that led them to develop a motor theory. Suggestions, within the domain of speech and then, outside of it such as that of Lotto, Hickok and Holt (2009), that the showing that evidence for the perceptuo-motor nature claims of the motor theory were motivated by a theo- of perception and cognition is quite general. retical issue, the problem of coarticulation, are mis- taken. The theory was motivated by research findings, the first of which were published in 1952. The claim of 15.2 RESEARCH FINDINGS motor system involvement was rationalized in terms of coarticulation later. The first (Liberman et al., 1952) 15.2.1 Speech was a finding that the same acoustic pattern (a stop burst centered at 1440 Hz) that had to have been pro- Liberman and Mattingly (1985) identify their motor duced by different consonantal gestures coarticulated theory of speech perception as motor in two respects. with different vowels was heard as different consonants First, in the theory, listeners perceive speech gestures. (the “/pi/-/ka/-/pu/” phenomenon). The second Second, to achieve perception of gestures, they recruit (Liberman et al., 1954) was a finding that different their own speech motor system. acoustic patterns (second formant transitions) that The first claim does not, in fact, make the theory a were produced by the same (alveolar) consonant con- motor theory. No one would identify a theory of visual striction gesture of the vocal tract coarticulated with perception as a motor theory if it made the (uncontro- different vowel gestures were heard as the same conso- versial) claim that, when a person walks by in a percei- nants (the “/di/-du/” phenomenon). In both cases, ver’s line of sight, the perceiver sees someone walking. when listeners’ perceptual tracking of acoustic “cues” Perceiving a motor event when a motor event occurs could be dissociated from their tracking of gesture pro- in the econiche does not require a theoretical account duction, listeners were found to track gestures.2 deserving the descriptor “motor.” The first claim of These findings imply that listeners somehow motor theorists should imply that there is nothing spe- “parse” (Fowler & Smith, 1986) the acoustic signal cial about speech perception in regard to perceptual along gestural lines. That is, to perceive the same con- objects (Rosenblum, 2008). Listeners perceive speech sonant, /d/, from different formant transitions for the gestures as they hear people knocking on doors and syllables /di/ and /du/, they must extract acoustic see walkers walking, because perceivers intercept information that supports perceptual separation of structure in media, such as structured air and light temporally overlapping consonantal and vocalic ges- that inform about what they need to perceive: the tures. Many distinct findings support that parsing objects and events that compose the econiche. occurs (Fowler & Brown, 1997; Fowler & Smith, 1986; Although the claim that speech listeners perceive Pardo & Fowler, 1997; Silverman, 1986, 1987). phonetic gestures is controversial in the field of speech Perceivers do the same kind of parsing in the visual perception, within the context of the foregoing discus- domain (e.g., in perception of walking and other bio- sion of perception in the econiche, phonetic gestures logical motions; Johanssen, 1973; Runeson & are expected perceptual objects. Compatibly, there is Frykholm, 1981).

2Intuitively, these findings can be explained as pattern learning instead of gesture perception. A pattern learner can learn to classify the very different acoustic patterns for /di/ and /du/ into the same /d/ category while learning to classify the same stop bursts in /pi/ and /ka/ differently. Learning these classifications requires a systematic basis for learning, of course, and the basis must be the articulatory sameness of the consonantal gestures in /di/ and /du/ and differentness in /pi/ and /ka/. By this account, pattern learners acquire the acoustic- articulatory links when they hear the acoustic consequences of their own speech. No presumption that listeners perceive articulation is required. This, in fact, was the earliest motor theory (Liberman, 1957), with the proposed learning underlying “acquired similarity” of the /d/ s in /di/ and /du/ and “acquired distinctiveness” of the stop bursts of /pi/ and /ka/. The account fails, however, as argued by early opponents of the motor theory, because individuals exist who perceive speech successfully without being able to produce what they perceive (MacNeilage, Rootes, & Chase, 1967). Information about articulation has to come from information in the acoustic signals from others’ speech as well as one’s own when that is possible.

C. BEHAVIORAL FOUNDATIONS 178 15. SPEECH PERCEPTION AS A PERCEPTUO-MOTOR SKILL

One such line of investigation is that regarding com- neighbors of the source of contrast, not of the source pensation for coarticulation. This is a finding that listen- itself. Finally, it is ruled out when gestural overlap is ers make speech judgments that reflect sensitivity to both cross-modal and simultaneous (Fowler, 2006). acoustic consequences of coarticulatory gestural over- Moreover, in the only two (difficult-to-find) instances lap. The research line has a long and controversial in which predictions of gestural parsing and contrast history. In a seminal finding by Mann (1980), members accounts have been dissociated in speech stimuli of a /da/-/ga/ continuum differing only in the third (Johnson, 2011; Viswanathan, Magnuson, & Fowler, formant (F3) onset frequency of the initial consonant 2010), results supported gestural parsing, not contrast. (high for /da/, low for /ga/) were identified differ- As noted, findings most supportive of the contrast ently after a precursor /al/ (high ending F3) than /ar/ account show that nonspeech contexts trigger (low ending F3) syllable. Listeners identified more con- compensation-like responses in speech. (Lotto & tinuum syllables as “ga” after /al/ than /ar/. As Mann Kluender, 1998). However, Viswanathan, Magnuson, (1980) explained, this can be interpreted as perceptual and Fowler (2013) distinguished a contrast from a compensation for the acoustic consequences of gestural masking account of nonspeech effects experimentally overlap, that is, the coarticulatory fronting/backing and found that the nonspeech contexts used in these pulls that /l/ and /r/ would exert, respectively, on a studies induced energetic masking rather than con- following /da/ or /ga/ in natural speech production. trast. Masking cannot explain the speech effects, However, Mann (1980) also remarked that her find- however. ings can be interpreted in another way. They can be seen In a different line of investigation, perceivers are as evidence for spectral contrast rather than for listeners’ shown to integrate cross-modal information about perceptual parsing of coarticulatory gestural overlap. In speech gestures. A striking and seminal finding by the contrast account, frequencies in a context segment McGurk and MacDonald (1976; MacDonald and render the perceptual system temporarily insensitive to McGurk, 1978) showed that appropriately selected frequencies in neighboring speech. For example, a high pairings of acoustic consonantvowel (CV) syllables F3 transition in /al/ makes frequencies in a following and synchronous dubbed, visible mouthings of differ- syllable that is ambiguous between /da/ (high F3) and ent CVs led listeners to report hearing a syllable that /ga/ (low F3) sound lower and therefore more /ga/- integrates information across the modalities. For exam- like. A preceding /ar/, with a low F3, has the opposite ple, acoustic /ma/ dubbed onto mouthed /da/ leads contrastive effect, leading the syllable to sound more listeners to hear /na/, an integration of the visually /da/-like. This mimics true perceptual parsing of the specified place of articulation of the consonant with its coarticulatory effects of /l/ and /r/ on /d/ and /g/. acoustically specified nasal and voicing properties. A In support of this view, investigators have reported gestural account of the finding is that listeners inte- that nonspeech contexts can yield compensation-like grate information about gestures that are specified perceptual judgments (Kingston et al., 2014; Lotto & cross-modally. That is, they experience an event of Kluender, 1998). For example, in research by Lotto and talking and integrate cross-modal information about Kluender (1998), high- and low-frequency tones that event. Although there can be other accounts of the replaced /al/ and /ar/ syllables and had qualitatively finding that invoke past experience associating the the same effect on /da/-/ga/ judgments as the context sights and sounds of talking (Diehl & Kluender, 1989; syllables had. Stephens & Holt, 2010), these accounts are challenged However, other findings oppose that account and by findings of cross-modal integration, among others. favor an interpretation that listeners, in fact, track ges- For example, Gick and Derrick (2009) showed that tural overlap in speech. For example, compensation is puffs of air against the neck of listeners transformed achieved perceptually when contrast is ruled out their reports of acoustic /ba/ to /pa/ and of /da/ to because compensation is cross-modal (Mitterer, 2006). /ta/. The puff of air is evidence of the aspiration or In this study, context syllables were distinguished only breathiness in production of voiceless (/p/, /t/) stops. visually (in audiovisual presentations), whereas the The gestural account is that puffs of air, acoustic sig- continuum syllables were distinguished only acousti- nals, reflected light, and more (Fowler & Dekle, 1991) cally. Because a visible speech gesture cannot be the have impacts on perceivers in ways that specify their source of spectral contrast on a following acoustic syl- ecological source. lable, contrast is not a viable account of the perceptual This is a subset of research that provides converging compensation that occurred in this study. Contrast is evidence for gesture perception in speech. This work is also ruled out when the gestural overlap for which lis- consistent with the framework in which perceivers teners compensate has simultaneous, rather than suc- must perceive the econiche as it is specified multimod- cessive, acoustic consequences (Silverman, 1987), ally (Stoffregen and Bardy, 2001) for perception to sup- because contrast affects perceptual sensitivity of port action and for action to support life.

C. BEHAVIORAL FOUNDATIONS 15.2 RESEARCH FINDINGS 179

However, nothing in this review indicates that per- experimental set-up, listening automatically and unin- ception of speech gestures reflects recruitment of the tentionally involves motor system activation. motor system as the motor theory of Liberman and col- The findings are consistent with those of Fadiga and leagues proposes (Liberman et al., 1967; Liberman & colleagues (Fadiga, Craighero, Buccino, & Rizzolatti, Mattingly, 1985; see Scott, McGettigan, & Eisner, 2009). 2002). They found that transcranial magnetic stimula- Listeners perceive speech gestures because acoustic tion (TMS) of the tongue region of the speech motor signals, having been lawfully and distinctively struc- system of the brain leads to more activation of tongue tured by speech gestures, specify the gestures (Fowler, muscles when words or nonwords being perceived 1986, 1996)3 just as reflected light that has been law- include lingual as compared with labial intervocalic fully and distinctively structured as the act of walking consonants. That is, motor activation that occurred specifies walking. Even so, there is considerable evi- during perception of speech was specific to the ges- dence that the motor system is active in effective ways tural properties of the words or nonwords to which during phonetic perception. The following review is the listener was exposed. meant to show only that there are effective perceptuo- Compatibly, D’Ausilio and colleagues (D’Ausilio motor links in speech perception. It does not show et al., 2009) used TMS either to the tongue or to the lip that motor involvement is required to extract phonetic region of the motor system of the brain as listeners (gestural) primitives from speech signals. identified stop-initial syllables in noise. Response times Speech perception is a perceptuo-motor skill in were faster and accuracy was higher to identify lingual respect to mechanisms that support it. The following consonants (/d/, /t/) than labial consonants (/b/, review is restricted to behavioral evidence and is only /p/) when TMS was to the lingual region. The pattern illustrative. Moreover, summaries of evidence of brain reversed when stimulation was to the labial region activation patterns that support speech perception and (see Hickok [2009, 2014] for an alternative evidence complementary to that provided here show- interpretation). ing that perceptual information changes speech pro- Finally, Ostry and colleagues (Ito, Tiede, & Ostry, duction (Houde & Jordan, 1998; Tremblay, Shiller, & 2009; Nasir & Ostry, 2009) have shown that changes in Ostry, 2003) are omitted. the way that talkers produce a vowel also lead to A direct connection between the speech perception changes in the way that they perceive it. A striking and action systems is reported by Yuen, Davis, finding in that regard is reported by Nasir and Ostry Brysbaert, and Rastle (2010). These investigators col- (2009). In their study, talkers produced monosyllables lected electropalatalographic data as talkers produced including the vowel /æ/. Talkers’ jaws were per- syllables starting with /k/ (produced with a constric- turbed in the direction of protrusion as they produced tion gesture of the tongue dorsum against the velum) the monosyllables. This perturbation did not have or /s/ (produced with a narrow constriction between measurable or audible acoustic consequences, but, the tongue tip and the alveolar region of the palate). despite that, most participants compensated for it. While articulating either of these syllables, the talkers That is, their jaw trajectories after compensation were heard either congruent syllables or /t/-initial syllables closer to the preperturbation path than it was before (/t/, like /s/, is an ; however, compensation. Before and after the perturbation expe- because it is a stop, not a fricative like /s/, there is rience, participants classified vowels perceptually more alveolar contact for /t/). The remarkable finding along a head to had acoustic continuum (where /æ/ is was that the heard syllable left traces in the production the vowel in had). They showed a boundary shift after of /k/- and /s/-initial syllables in the form of compensation identifying fewer vowels as /æ/ than increased alveolo-palatal contact by the tongue when they did before compensation to the jaw perturbations. the distractor was /t/-initial. The effect was absent No shifts in identification were found for control parti- when distractor syllables were presented in print cipants who followed the same protocol except that form. This finding is important because acoustically perturbations were never applied during the perturba- presented syllables were distractors to which partici- tion phase of the experiment. In addition, among parti- pants did not explicitly respond. The explicit task cipants in the experimental group, the size of the involved talking, not listening; even so, listening had compensation to perturbation was correlated with the an impact on articulation, presumably because, in this size of the perceptual shift.

3Claims that acoustic signals necessarily lack the required specificity for gestural recovery (Diehl, Lotto, & Holt, 2004, p. 172: “[T]he inverse problem appears to be intractable”) are overstated. Most approaches to the inversion problem make simplifying assumptions about the vocal tract that are false and consequential (Iskarous, 2010; Iskarous, Fowler, & Whalen, 2010). Moreover, characteristically approaches are designed to recover individual vocal tract states, not gestures, and so cannot take advantage of constraints provided by the tract’s history, an analog of the retinal image fallacy in visual perception (Gibson, 1966).

C. BEHAVIORAL FOUNDATIONS 180 15. SPEECH PERCEPTION AS A PERCEPTUO-MOTOR SKILL

Therefore, clearly, the perceptual shift was tied to that is comparable in some, but not all respects, with the adaptive learning. As talkers changed the way they activation that occurs when listeners produce the same produced a vowel (in response to perturbations that sounds themselves or see someone else producing had no measurable or audible acoustic consequences), them with or without sound. they also changed how they extracted acoustic infor- A finding that is conceptually analogous to that of mation about that vowel in perception. D’Ausilio et al. (2009) but in the visual domain has As noted, there have been proposals that motor sys- been reported as well. D’Ausilio et al. (2009) showed tem activation during speech perception may only occur that potentiation of lip or tongue muscles facilitated under special circumstances, such as when the signal is perception of labial or lingual consonants, respectively, noisy or only when special kinds of tasks are being per- in noise. Blaesi and Wilson (2010) had observers clas- formed (Osnes, Hugdahl, & Specht, 2011). However, the sify facial expressions that had been morphed along a foregoing review suggests that motor activation occurs continuum between a smile and a frown. In half of the whether or not the signal is noisy or distorted and in a trials, the observers clenched a pen lengthwise in their variety of tasks. Stimuli are sometimes meaningless syl- mouth to enforce an expression similar to a smile with- lables and sometimes meaningful words; they are some- out (directly) evoking the associated emotion. Findings times presented in the clear and sometimes in noise. were that “smile” judgments increased in those trials Moreover, the review that follows should lead skeptics as compared with trials in which the observers’ facial to question whether motor activation during speech per- expression was not manipulated. ception should be limited only to special circumstances. Findings similar to those reported by Blaesi and The review shows that motor system recruitment is Wilson (2010) abound in the literature. Barsalou (2008) widespread in other domains of perception and else- provides a summary of some of them in the domain of where in the realm of language. Its pervasiveness likely “embodied” or “grounded” cognition. reflects humans’ adaptation to the fundamentally The studies reviewed so far reveal motor activation perceptuo-motor nature of life in the econiche. in perception. Research by Goldin-Meadow and collea- gues (see Goldin-Meadow & Beilock, 2010, for a 15.2.2 Nonspeech review) uncovered a role for motor recruitment in problem-solving and thought processes more generally. 15.2.2.1 Nonlanguage In their review they show that the manual gestur- As noted, evidence that perceivers perceive a motor ing that accompanies language use not only reflects event when one occurs in their vicinity is not in itself thought but also can guide thought. Research with evidence for a motor theory. Even so, there is evidence children acquiring Piagetian conservation (Ping & for motor activation during visual perception of walk- Goldin-Meadow, 2008) showed that children learned ing in research by Takahashi, Kamibayashi, Nakajima, more from instruction involving both gestures and Akai, and Nakazawa (2008). They applied TMS to the speech than from instruction involving speech only motor systems of observers’ brains to potentiate mus- and that the advantage accrued whether the containers cles of their legs as the observers watched actors either of liquid used in the conservation problems were pres- walking or standing on a treadmill. Activity in the ent or absent (so that gestures were not points to the observers’ leg muscles was measured. Findings were critical properties of the objects). analogous to those of Fadiga et al. (2002) for speech In a study of adults, Beilock and Goldin-Meadow perception. Greater muscle activity occurred in muscles (2010) presented participants with variants of the of the leg as observers watched walking as contrasted tower of Hanoi problem.4 Participants solved the with standing. As it does during speech perception, problem and then were videotaped describing how muscle activation occurs that is specific to the event they had solved it. After that, they solved a variant being perceived during perception of biological motion. ofthesameproblem.Animportantfindingwasthat Motor system activation consequent on observing gestures produced during the description phase that action is not restricted to visual observation. were appropriate to the initial solution of the prob- Activation also occurs as listeners hear sounds (in a lem but inappropriate for the solution of the variant study by Caetano, Jousma¨ki, & Hari, 2007, the sound were associated with poorer performance on the of a drum membrane being tapped with ) variant.

4The tower of Hanoi tasks present solvers with four disks of different sizes and three pegs on which they may be placed. At the beginning, the four disks are stacked on the left-most peg in order of size, with the smallest disk on top. The solver’s task is to shift the disks one at a time so that, eventually, they are stacked in the same size order on the last peg. A constraint is that a larger disk cannot sit above a smaller one.

C. BEHAVIORAL FOUNDATIONS 15.3 CONCLUSION 181

15.2.2.2 Language, Not Speech review). These findings are not subject to concerns This review shows that motor activation occurs and about where in the brain motor activation occurs, is effective in perception and cognition outside of lan- because they show specificity in the motor actions guage. It is not special to speech perception. One con- themselves that are primed by linguistic meanings. clusion from this is that motor activation is pervasive Moreover, most are not subject to reinterpretation in in perception and cognition. A second is that the view terms of the orthographic or phonological properties of of Liberman and colleagues (Liberman et al., 1967) that the stimuli. For example, Glenberg and Kaschak (2002) motor recruitment in speech perception solves a prob- presented listeners with sentences such as Andy deliv- lem that is special to speech is not particularly sug- ered the pizza to you or You delivered the pizza to Andy gested by findings of motor activation in that domain. (two sentences with identical orthographic and phono- An explanation that is more likely to be valid for the logical properties but describing different actions). occurrence of motor activation in speech perception Participants made a speeded response whether each will be one that is shared with explanations for motor sentence made sense. For one participant group, the activation elsewhere in perception and cognition. yes response was a motion toward the body from a Motor activation within the domain of language is home button, whereas the no response was a motion in not special to speech perception either. It occurs in the opposite direction from the same home button. In word recognition and in language understanding more a second group of participants, the mapping was generally. The generality of motor activation to larger opposite. Findings were that latencies to respond yes to chunks of language than consonants and vowels sentences like the first sentence were faster for partici- reflects the fact that language use is a perceptuo-motor pants whose responses were toward the body, the activity. It is fundamentally a between-person activity same direction as the pizza’s motion. Latencies to in the world; as such, it is inherently and pervasively respond yes to sentences like the second sentence were perceptuo-motor in its nature (Fowler, 2013). Some faster for participants whose responses were away examples of findings are presented. from the body. Compatibly, Zwaan and Taylor (2006) Regarding word recognition, Pulvermu¨ ller and presented participants with sentences visually in Fadiga (2010) reviewed evidence that words with sequential groups of one to three words (separated action-related meanings (grasp or kick) activate the asso- here by slashes): ciated part of the motor system (Hauk, Johnsrude, & To quench/his/thirst/the/marathon/runner/eagerly/opened/ Pulvermu¨ ller, 2004) and do so with a sufficiently short the/water bottle. latency that the activation is likely integral to word Participants turned a knob to see each new word or understanding, not consequent of it (Pulvermu¨ ller, word sequence. In one block of trials, they turned the Shtyrov, & Ilmoniemi, 2005). Compatibly, a TMS study knob counterclockwise; in another block, they turned it (Pulvermu¨ ller, Hauk, Nikulin, & Ilmoniemi, 2005) in the opposite direction. Half of the critical sentences showed that stimulation of the armhand motor region described a clockwise motion; half (as in the example) of the left hemispheres of right-handed participants described a counterclockwise motion. Findings were facilitated lexical decisions (made as lip-movement that reading times for the critical region of the sentence responses) to words with armhand-related meanings (opened in the example) were faster when the direction compared with words with leg-related meanings, of the knob turn matched the rotation direction whereas stimulation of leg motor regions had a comple- implied by the sentence. mentary effect on lexical decision times. de Zubicaray, Arciuli, and McMahon (2013) chal- lenged these findings, in part, by showing that locali- 15.3 CONCLUSION zations of motor activations in response to linguistic stimuli in the literature are questionable. They also Set in the context of the many recent research find- show that some findings of motor activation to words ings showing motor system activation and effective (and nonwords) reflect sensitivity, not to the words’ involvement in perception, cognition, and language content but rather to their orthographic and phonologi- generally, the previously highly controversial claim of cal properties that statistically distinguish words by Liberman’s motor theory (Liberman et al., 1967; syntactic class. They show that nonwords having the Liberman & Mattingly, 1985) that there is motor system statistical properties of verbs activate the motor system recruitment in speech perception appears quite plausi- despite being mostly meaningless. ble, even mundane. Even so, the associated claim of Despite these findings, there is clear behavioral evi- motor theorists that speech motor system recruitment dence for motor activation specific to actions implied evolved to solve a perceptual problem that is special to by sentence meanings (see Taylor & Zwaan, 2009, for a speech recedes in plausibility. Whether acoustic speech

C. BEHAVIORAL FOUNDATIONS 182 15. SPEECH PERCEPTION AS A PERCEPTUO-MOTOR SKILL signals present an especially difficult obstacle to per- Blaesi, S., & Wilson, M. (2010). The mirror reflects both ways: Action ception because of coarticulation (and, most likely it influences perception of others. Brain and Cognition, 72, 306 309. does not; Fowler, 1986, 1996; Fowler & Iskarous, 2013), Browman, C., & Goldstein, L. (1986). Towards an articulatory pho- nology. Phonology Yearbook, 3, 219252. motor recruitment occurs pervasively in instances in Caetano, G., Jousma¨ki, V., & Hari, R. (2007). Actor’s and observer’s which this obstacle, if it is one, is absent. primary motor cortices stabilize similarly after seen or heard The present review suggests that motor recruitment motor actions. Proceedings of the National Academy of Sciences of the occurs generally in perception and cognition, including United States of America, 104, 9058 9062. in language perception and comprehension. This is Carello, C., Wagman, J. B., & Turvey, M. T. (2005). Acoustic specifica- tion of object properties. In J. Anderson, & B. Anderson (Eds.), likely because life in the econiche is pervasively Moving image theory: Ecological considerations (pp. 79104). perceptuo-motor in nature, and animals, including Carbondale, IL: Southern Illinois Press. humans, are adapted to that kind of life. Perception Clark, H. (1996). Using language. Cambridge: Cambridge University generally incorporates exploratory activity as an essen- Press. tial part and performatory actions are perceptually D’Ausilio, A., Pulvermu¨ller, F., Selmas, P., Bufalari, I., Begliomini, C., & Fadiga, L. (2009). The motor somatotopy of speech perception. guided. Moreover, for activities of either sort to be Current Biology, 19, 381385. felicitous requires both acting and perceiving to de Zubicaray, G., Arciuli, J., & McMahon, K. (2013). Putting an be true to the nature of the econiche. Actions have “end” to the motor cortex representations of action words. Journal to be appropriate to the affordances of the econiche, of Cognitive Neuroscience, 25, 1957 1974. Available from: http:// and perception has to reveal the affordances. The eco- dx.doi.org/10.1162/jocn_a_00437. Diehl, R., & Kluender, K. (1989). On the objects of speech perception. niche has to be shared (there must be a relation of Ecological Psychology, 1, 121144. parity) between perceiving and acting. Diehl, R. L., Lotto, A. J., & Holt, L. L. (2004). Speech perception. This kind of actionperception parity is required Annual Review of Psychology, 55, 149179. for interpersonal action in which participants in joint Fadiga, L., Craighero, L., Buccino, G., & Rizzolatti, G. (2002). Speech activities have to coordinate. Participants in joint listening specifically modulates the excitability of tongue muscles: A TMS study. European Journal of Neuroscience, 15, 399402. actions have to perceive accurately their own participa- Fowler, C. (1986). An event approach to the study of speech percep- tion in the action and their partner’s; complementarily, tion from a direct-realist perspective. Journal of Phonetics, 14, their actions have to be true to the joint situation and 328. their perception of it. This is no less true for language Fowler, C. A. (1996). Listeners do hear sounds, not tongues. Journal of the Acoustical Society of America 99 use than it is for activities such as dancing, paddling a , , 1730 1741. Fowler, C. A. (2006). Compensation for coarticulation reflects gesture canoe, or playing a duet (Clark, 1996). Perceivers of perception, not spectral contrast. Perception and Psychophysics, 68, linguistic utterances produced by others have, in gen- 161177. eral, to perceive accurately what has been said by Fowler, C. A. (2013). An ecological alternative to a “sad response”: Public language use transcends the boundaries of the skin. themselves and by interlocutors. There has to be a rela- tion of (sufficient) parity between utterances produced Behavioral and Brain Sciences, 36, 356 357. Fowler, C. A., & Brown, J. (1997). Intrinsic fO differences in spoken and perceived on the parts of all participants in a lin- and sung vowels and their perception by listeners. Perception and guistic interchange. In this case, the shared part of the Psychophysics, 59, 729738. econiche is the utterance composed at the level of lan- Fowler, C. A., & Dekle, D. J. (1991). Listening with eye and hand: guage forms of appropriately sequenced linguistic Crossmodal contributions to speech perception. Journal of Experimental Psychology. Human Perception and Performance 17 actions of the vocal tract. , , 816828. Where does this leave Liberman’s motor theory? Fowler, C. A., & Iskarous, K. (2013). Speech perception and produc- Although Liberman was by no means the first motor tion. In A. F. Healey, R. W. Proctor, & I. B. Weiner (Eds.), theorist, he should be recognized as among the earliest Handbook of psychology, Vol. 4: Experimental psychology (2nd ed., theorists to recognize the perceptuo-motor link in the pp. 236 263). Hoboken, NJ: John Wiley & Sons Inc. domain of speech (Galantucci, Fowler, & Turvey, Fowler, C. A., & Smith, M. (1986). Speech perception as “vector anal- ysis”: An approach to the problems of segmentation and invari- 2006). A task for motor theorists of speech perception ance. In J. Perkell, & D. Klatt (Eds.), Invariance and variability of now is to bring the theory into alignment with devel- speech processes (pp. 123136). Hillsdale, NJ: Lawrence Erlbaum opments outside the domain of speech such as those Associates. reviewed here. Galantucci, B., Fowler, C. A., & Turvey, M. T. (2006). The motor the- ory of speech perception reviewed. Psychonomic Bulletin & Review, 13, 361. References Gibson, E. J. (1994). Has psychology a future? Psychological Science, 5, 6976. Barsalou, L. (2008). Grounded cognition. Annual Review of Psychology, Gibson, J. J. (1966). The senses considered as perceptual systems. Boston, 59, 617645. MA: Houghton Mifflin. Beilock, S., & Goldin-Meadow, S. (2010). Gesture changes thought by Gibson, J. J. (1979). The ecological approach to visual perception. Boston, grounding it in action. Psychological Science, 21, 16051610. MA: Houghton Mifflin.

C. BEHAVIORAL FOUNDATIONS REFERENCES 183

Gick, B., & Derrick, D. (2009). Aero-tactile integration in speech per- MacNeilage, P. F., Rootes, T. A., & Chase, T. P. (1967). Speech pro- ception. Nature, 462, 502504. duction and perception in a patient with severe impairment of Glenberg, A. M., & Kaschak, M. P. (2002). Grounding language in somesthetic perception and motor control. Journal of Speech and action. Psychonomic Bulletin and Review, 9, 558565. Hearing Research, 10, 449467. Goldin-Meadow, S., & Beilock, S. (2010). Action’s influence on Mann, V. A. (1980). Influence of preceding liquid on stop-consonant thought: The case of gesture. Perspectives on Psychological Science, perception. Perception & Psychophysics, 28, 407412. 5, 664674. Marsh, K., Richardson, M., Baron, R., & Schmidt, R. (2006). Goldstein, L., & Fowler, C. A. (2003). Articulatory phonology: A pho- Contrasting approach to perceiving and acting with others. nology for public language use. In N. Schiller, & A. Meyer (Eds.), Ecological Psychology, 18,136. Phonetics and phonology in language comprehension and production: McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Differences and similarities (pp. 159207). Berlin: Mouton de Nature, 264, 746748. Gruyter. Millikan, R. G. (2003). In defense of public language. In L. M. Hauk, O., Johnsrude, I., & Pulvermu¨ller, F. (2004). Somatotopic Antony, & N. Hornstein (Eds.), Chomsky and his critics representation of action words in human motor and remotor (pp. 215237). Malden, MA: Blackwell Publishing, Ltd. cortex. Neuron, 41, 301307. Mitterer, H. (2006). On the causes of compensation for coarticulation: Hickok, G. (2009). Speech perception does not rely on motor Evidence for phonological mediation. Perception and Psychophysics, cortex. Available from: ,http://www.cell.com/current-biology/ 68, 12271240. comments/S0960-9822(09)00556-9.. Nasir, S., & Ostry, D. J. (2009). Auditory plasticity and speech motor Hickok, G. (2014). The myth of mirror neurons: The real neuroscience of learning. Proceedings of the National Academy of Sciences of the communication and cognition. New York, NY: Norton. United States of America, 106, 2047020475. Houde, J. F., & Jordan, M. I. (1998). Sensorimotor adaptation in Osnes, B., Hugdahl, K., & Specht, K. (2011). Effective connectivity speech production. Science, 227, 12131216. demonstrates involvement of premotor cortex during speech Iskarous, K. (2010). Vowel constrictions are recoverable from perception. Neuroimage, 54, 24372445. formants. Journal of Phonetics, 78, 375387. Pardo, J., & Fowler, C. A. (1997). Perceiving the causes of coarticula- Iskarous, K., Fowler, C. A., & Whalen, D. H. (2010). Locus equations tory acoustic variation: Consonant voicing and vowel pitch. are an acoustic signature of articulatory synergy. Journal of the Perception and Psychophysics, 59, 11411152. Acoustical Society of America, 128, 20212032. Pierrehumbert, J. (1990). Phonological and phonetic representations. Ito, T., Tiede, M., & Ostry, D. J. (2009). Somatosensory function in Journal of Phonetics, 18, 375394. speech perception. Proceedings of the National Academy of Sciences Ping, R., & Goldin-Meadow, S. (2008). Hands in the air: Using of the United States of America, 106, 12451248. ungrounded iconic gestures to teach children conservation of Johanssen, G. (1973). Visual perception of biological motion and a quantity. Developmental Psychology, 44, 12771287. model for its analysis. Perception and Psychophysics, 14, 201211. Pulvermu¨ ller, F., & Fadiga, L. (2010). Active perception: Johnson, K. (2011). Retroflex versus bunched [r] in compensation for Sensorimotor circuits as a cortical basis for language. Nature coarticulation. UC Berkeley Phonology Lab Annual Report, 2011, Reviews Neuroscience, 11, 351360. 114127. Pulvermu¨ ller, F., Hauk, O., Nikulin, V. V., & Ilmoniemi, R. J. (2005). Kingston, J., Kawahara, S., Chambless, D., Key, M., Mash, D., & Functional links between motor and language systems. European Watsky, S. (2014). Context effects as auditory contrast. Attention, Journal of Neuroscience, 21, 793797. Perception, and Psychophysics, 76, 14371464. Available from: Pulvermu¨ ller, F., Shtyrov, U., & Ilmoniemi, R. J. (2005). Brain signa- http://dx.doi.org/10.3758/s13414-013-0593-z. tures of meaning access in action word recognition. Journal of Liberman, A. M. (1957). Some results of research on speech percep- Cognitive Neuroscience, 17, 884892. tion. Journal of the Acoustical Society of America, 29, 117123. Remez, R. E., & Pardo, J. S. (2006). The perception of speech. In M. Liberman, A., Cooper, F. S., Shankweiler, D., & Studdert-Kennedy, Traxler, & M. A. Gernsbacher (Eds.), The handbook of psycholinguis- M. (1967). Perception of the speech code. Psychological Review, 74, tics (pp. 201248). New York, NY: Academic Press. 431461. Rosenblum, L. (2008). Primacy of multimodal speech perception. Liberman, A., Delattre, P., & Cooper, F. S. (1952). The role of selected In D. B. Pisoni, & R. E. Remez (Eds.), Handbook of speech perception stimulus variables in the perception of the unvoiced-stop (pp. 5178). Malden, MA: Blackwell Publishing. consonants. American Journal of Psychology, 65, 497516. Runeson, S., & Frykholm, G. (1981). Visual perception of lifted Liberman, A. M., Delattre, P., Cooper, F. S., & Gerstman, L. (1954). weight. Journal of Experimental Psychology. Human Perception and The role of consonant-vowel transitions in the perception of the Performance, 7, 733774-730. stop and nasal consonants. Psychological Monographs: General and Ryle, G. (1949). The concept of mind. New York, NY: Barnes and Applied, 68,113. Noble. Liberman, A. M., & Mattingly, I. (1985). The motor theory revised. Schu¨ tz-Bosbach, S., & Prinz, W. (2007). Perceptual resonance: Action- Cognition, 21,136. induced modulation of perception. Trends in Cognitive Sciences, 11, Liberman, A. M., & Mattingly, I. (1989). A specialization for speech 349355. perception. Science, 243, 489494. Scott, S. K., McGettigan, C., & Eisner, F. (2009). A little more conver- Liberman, A. M., & Whalen, D. H. (2000). On the relation of speech sation, a little less action—candidate roles for the motor cortex in to language. Trends in Cognitive Sciences, 4, 187196. speech perception. Nature Reviews Neuroscience, 10, 295302. Lotto, A., Hickok, G., & Holt, L. L. (2009). Reflections on mirror neu- Silverman, K. (1986). FO cues depend on : The case of the rons and speech perception. Trends in Cognitive Sciences, 13, rise after voiced stops. Phonetica, 43,7692. 110114. Silverman, K. (1987). The structure and processing of fundamental Lotto, A., & Kluender, K. (1998). General contrast effects in speech frequency contours. Unpublished Ph.D. dissertation, Cambridge perception: Effect of preceding liquid on stop consonant identi- University. fication. Perception and Psychophysics, 60, 602619. Stephens, J. D. W., & Holt, L. (2010). Learning to use an artificial MacDonald, J., & McGurk, H. (1978). Visual influences on speech visual cue in speech identification. Journal of the Acoustical Society perception. Perception and Psychophysics, 24, 253257. of America, 128, 21382149.

C. BEHAVIORAL FOUNDATIONS 184 15. SPEECH PERCEPTION AS A PERCEPTUO-MOTOR SKILL

Stoffregen, T. A., & Bardy, B. G. (2001). On specification and the Viswanathan, N., Magnuson, J. S., & Fowler, C. A. (2013). Similar senses. Behavioral and Brain Sciences, 24, 195261. response patterns do not imply identical origins: An energetic Takahashi, M., Kamibayashi, K., Nakajima, T., Akai, J., & Nakazawa, masking account of nonspeech effects in compensation for coarti- K. (2008). Changes in corticospinal excitability during observation culation. Journal of Experimental Psychology. Human Perception and of walking. Neuroreport, 19, 727731. Performance, 39, 11811192. Taylor, L. J., & Zwaan, R. A. (2009). Action in cognition: The case of Warren, W. H. (2006). The dynamics of perception and action. language. Language and Cognition, 1,4558. Psychological Review, 113, 358389. Tremblay, S., Shiller, D. M., & Ostry, D. J. (2003). Somatosensory Yuen, I., Davis, M. H., Brysbaert, M., & Rastle, K. (2010). Activation basis of speech production. Nature, 423, 866869. of articulatory information in speech perception. Proceedings of the Viswanathan, N., Magnuson, J. S., & Fowler, C. A. (2010). National Academy of Sciences of the United States of America, 107, Compensation for coarticulation: Disentangling auditory and ges- 592597. tural theories of perception of coarticulatory effects in speech. Zwaan, R. A., & Taylor, L. J. (2006). Seeing, acting, understanding; Journal of Experimental Psychology. Human Perception and Motor resonance in language understanding. Journal of Performance, 35, 10051015. Experimental Psychology. General, 135,111.

C. BEHAVIORAL FOUNDATIONS CHAPTER 16 Speech Perception The View from the Auditory System Andrew J. Lotto1 and Lori L. Holt2 1Speech, Language, & Hearing Sciences, University of Arizona, Tucson, AZ, USA; 2Department of Psychology and the Center for the Neural Basis of Cognition, Carnegie Mellon University, Pittsburgh, PA, USA

16.1 INTRODUCTION the ways that human speech communication appears to be constrained and structured on the basis of the For much of the past 50 years, the main theoretical operating characteristics of the auditory system. The debate in the scientific study of speech perception has basic premise is simple, with a long tradition in the sci- focused on whether the processing of speech sounds entific study of speech perception: the form of speech relies on neural mechanisms that are specific to speech (at the level of phonetics and higher) takes advantage and language or whether general perceptual/cognitive of what the auditory system does well, resulting in a processes can account for all of the relevant phe- robust and efficient communication system. We review nomena. Starting with the first presentations of here three aspects of auditory perception—discrimina- the Motor Theory of Speech Perception by Alvin bility, context interactions, and effects of experience— Liberman and colleagues (Liberman, Cooper, Harris, and discuss how the structure of speech appears to MacNeilage, & Studdert-Kennedy, 1964; Liberman, respect these general characteristics of the auditory Cooper, Shankweiler, & Studdert-Kennedy, 1967; system. Studdert-Kennedy, Liberman, Harris, & Cooper, 1970) It should be noted that we include in our conception and the critical reply from Harlan Lane (1965), many of the “auditory system” processes and constructs that scientists defended “all-or-none” positions on the are often considered to be “cognition,” such as mem- necessity of specialized speech processes, and much ory, learning, categorization, and attention (Holt & research was dedicated to demonstrations of phenom- Lotto, 2010). This is in contrast to previous characteri- ena that were purported to require general or speech- zations of “Auditorist” positions in speech perception specific mechanisms (see Diehl, Lotto, & Holt, 2004 for that appeared to constrain explanations of speech phe- a review of the theoretical commitments behind these nomena to peculiarities of auditory encoding at the positions). Whereas the “speech-is-special” debate con- periphery. Most researchers who have advocated for tinues to be relevant (Fowler, 2008; Lotto, Hickok, & general auditory accounts of speech perception actu- Holt, 2009; Massaro & Chen, 2008; Trout, 2001), the ally propose explanations within a larger general audi- focus of the field has moved toward more subtle dis- tory cognitive science framework (Holt & Lotto, 2008; tinctions concerning the relative roles of perceptual, Kluender & Kiefte, 2006). Recent findings in auditory cognitive, motor, and linguistic systems in speech per- neuroscience provide support for moving beyond sim- ception and how each of these systems interacts in the ple dichotomies of perception versus cognition or top- processing of speech sounds. The result has been an down versus bottom-up or peripheral versus central. opportunity to develop more plausible and complete There have been demonstrations that manipulation of models of speech perception/production (Guenther & attention may affect the earliest stages of auditory Vladusich, 2012; Hickok, Houde, & Rong, 2011). encoding in the cochlea (Froehlich, Collet, Chanal, & In line with this shift in focus, in this chapter we Morgon, 1990; Garinis, Glattke, & Cone, 2011; Giard, concentrate not on whether the general auditory sys- Collet, Bouchet, & Pernier, 1994; Maison, Micheyl, & tem is sufficient for speech perception but rather on Collet, 2001) and experience with music and language

Neurobiology of Language. DOI: http://dx.doi.org/10.1016/B978-0-12-407794-2.00016-X 185 © 2016 Elsevier Inc. All rights reserved. 186 16. SPEECH PERCEPTION changes the neural representation of sound in the reducing articulatory effort, such as Stevens’ (1972, brain stem (Song, Skoe, Wong, & Kraus, 2008; Wong, 1989) Quantal Theory, Lindblom’s (1991) H&H Theory Skoe, Russo, Dees, & Kraus, 2007). In line with these (which we return to below), and Ohala’s (1993) models findings, we treat attention, categorization, and learn- of sound change in historical linguistics. ing as intrinsic aspects of auditory processing. The proposal that auditory distinctiveness is impor- tant for effective speech communication was pushed even further by the Auditory Enhancement Theory 16.2 EFFECTS OF AUDITORY from Diehl and colleagues (Diehl & Kluender, 1987, DISTINCTIVENESS ON THE FORM 1989; Diehl, Kluender, Walsh, & Parker, 1991). OF SPEECH According to Auditory Enhancement, speakers tend to combine articulations that result in acoustic changes At the most basic level, the characteristics of the that mutually enhance distinctiveness of the resulting auditory system must constrain the form of speech sounds for the listener. For example, in English the because the information-carrying aspects of the signal voicing contrast between /b/ and /p/ when spoken must be encoded by the system and must be able to be between two vowels, such as rabid versus rapid, is sig- discriminated by listeners. Given the remarkable abil- naled in part by the duration of a silent interval that ity of normal-hearing listeners to discriminate spectral- corresponds to the lip closure duration, which is shorter temporal changes in simple sounds such as tones and for /b/. However, speakers also tend to lengthen the noises, the resolution of the auditory system does not duration of the preceding vowel when producing a /b/. appear to provide much of a constraint on the possible Kluender, Diehl, and Wright (1988) demonstrated that sounds used for speech communication. The smallest preceding a silent gap with a long-duration sound discriminable frequency change for a tone of 1,000 Hz results in the perception of a shorter silent gap, even for is just over 1 Hz (Wier, Jesteadt, & Green, 1977), and nonspeech sounds; this can be considered a kind of an increment in intensity of 1 dB for that tone will durational contrast. Thus, when talkers co-vary short likely be detected by the listener (Jesteadt, Wier, & lip closure durations with longer preceding vowels and Green, 1977). However, it is a mistake to make direct vice versa, they produce a clearer auditory distinction inferences from discriminability of simple acoustic sti- between /b/ and /p/. This is just one of numerous muli to the perception of complex sounds, such as examples appearing to indicate that the need for audi- speech. Speech perception is not a simple detection or tory distinctiveness drives the phonetic structure of lan- discrimination task; it is more similar to a pattern rec- guages (Diehl, Kluender, & Walsh, 1990; Kingston & ognition task in which the information is carried Diehl, 1995). through changes in relative patterns across a complex In addition to providing constraints on the global multidimensional space. These patterns must be structure of spoken languages, there is good evidence robustly encoded and perceptually discriminable for that the individual behavior of speakers is influenced by efficient speech communication. the local needs of listeners for auditory distinctiveness. To the extent that some patterns are more readily According to Lindblom’s (1991) H(yper) & H(ypo) discriminable by the auditory system, they will pre- Theory of speech communication, speakers vary their sumably be more effective as vehicles for communica- productions from hyperarticulation to hypoarticulation tion. Liljencrants and Lindblom (1972) demonstrated depending on the contextual needs of the listener. that one could predict the vowel inventories of lan- Spoken utterances that are redundant with other guages relatively well by maximizing intervowel dis- sources of information or with prior knowledge may tances within a psychophysically scaled vowel space be spoken with reduced effort, resulting in reduced defined by the first two formant frequencies (in Mel auditory distinctiveness. However, novel information scaling). For example, /i/, /a/, and /u/ are correctly or words that are likely to be misperceived by a lis- predicted to be the most common set of vowels for a tener are produced with greater clarity or hyperarticu- three-vowel language system based on the presump- lation. In accordance with this theory, there have been tion that they would be most auditorily discriminable many demonstrations that speakers modulate produc- given that their formant patterns are maximally dis- tions when speaking to listeners who may have per- tinct in the vowel space. Vowel inventory predictions ceptual challenges, such as hearing-impaired listeners become even more accurate as one more precisely or non-native language learners (Bradlow & Bent, models the auditory representation of each vowel 2002; Picheny, Durlach, & Braida, 1985, 1986). (Diehl, Lindblom, & Creeger, 2003; Lindblom, 1986). Despite the continued success of the theories These demonstrations are in agreement with proposals described, it remains a challenge to derive a valid met- that languages tend to use sounds that maximize audi- ric of “auditory distinctiveness” for complex time- tory distinctiveness in balance with the value of varying signals like speech (and equally difficult to

C. BEHAVIORAL FOUNDATIONS 16.3 EFFECTS OF AUDITORY INTERACTION ON THE FORM OF SPEECH 187 quantify “articulatory effort”). The classic psychophys- perceptual transformations is such that the challenge ical measures of frequency, intensity, and temporal of acoustic variability is mitigated when analyzed resolution are simply not sufficient. The pioneering through the lens of auditory perception. Some of the work of David Green regarding auditory profile analy- more daunting mysteries about the ability of humans sis in which listeners discriminate amplitude pattern to accommodate acoustic variability in speech may changes across a multitonal complex (Green, 1988; arise from a lack of understanding of how the auditory Green, Mason, & Kidd, 1984) was a step in the right system encodes complex sounds, generally. direction because it could conceivably be applied to Coarticulation is a case in point. As we talk, measuring the ability to discriminate steady-state the mouth, jaw, and other articulators move very vowel acoustics. However, vowel acoustics in real quickly, but not instantaneously, from target to speech are much more complex and it is not clear that target. Consequently, at any point in time the move- these measures scale up to predict intelligibility of ment of the articulators is a function of the articula- speech at even the level of words. The future prospects tory demands of previous and subsequent phonetic of understanding how the operating characteristics of sequences as well as the “current” intended produc- the auditory system constrain the acoustic elements tion. As a direct result, the acoustic signature of a used in speech communication are brighter given speech sound is context-dependent. When /al/ pre- more recent approaches to psychoacoustic research cedes /ga/, for example, the tongue must quickly that investigate the roles of context, attention, learning, move from anterior to posterior occlusions to form the and memory in general auditory processing (Kidd, consonants. The effect of coarticulation is to draw Richards, Streeter, Mason, & Huang, 2011; Krishnan, /ga/ to a more anterior position (toward /al/). This Leech, Aydelott, & Dick, 2013; Ortiz & Wright, 2010; context-sensitive shift in production impacts the resul- Snyder & Weintraub, 2013). tant acoustic realization, making it more “da”-like because the place of tongue occlusion slides forward in the mouth toward the articulation typical of “da.” 16.3 EFFECTS OF AUDITORY Likewise, when /da/ is spoken after the more posteri- INTERACTION ON THE FORM orly articulated /ar/, the opposite pattern occurs; the OF SPEECH acoustics of /da/ become more “ga”-like. This means that, due to coarticulation, the acoustic signature of the The patterns of acoustic change that convey informa- second syllables in “alga” and “arda” can be highly tion in speech are notoriously complex. Speech sounds similar (Mann, 1980). like /d/ and /g/ are not conveyed by a necessary or Viewed from the perspective of acoustic variability, sufficient acoustic cue and there is no canonical acoustic this issue seems intractable. If the second consonant of template that definitively signals a linguistic message. “alga” and “arda” is signaled by highly similar acoustics, Furthermore, variability is the norm. The detailed then how is it that we hear the distinct syllables “ga” acoustic signature of a particular phoneme, syllable, or and “da”? The answer lies in the incredible context word varies a great deal across different contexts, utter- dependence of speech perception; perception appears to ances, and talkers. The inherent multidimensionality of compensate for coarticulation. This can be demonstrated the acoustic signatures that convey speech sounds and by preceding a perceptually ambiguous syllable between the variability along these dimensions presents a chal- /ga/ and /da/ with /al/ or /ar/. Whereas the acoustics lenge for understanding how listeners readily map the of /ga/ produced after /al/ are more “da”-like, a pre- continuous signal to discrete linguistic representations. ceding /al/ shifts perception of the ambiguous sound This has been the central issue of speech perception toward “ga.” Similarly, /ar/ shifts perception of the research. Although some researchers have suggested same ambiguous sound toward “da.” This pattern that acoustic variability may serve useful functions in opposes the coarticulatory effects in speech production. speech communication (Elman & McClelland, 1986; In this example and many replications with other tasks Liberman, 1996), the prevailing approach has been to and stimuli, coarticulation assimilates speech acoustics, explore how listeners accommodate or compensate for but perception “compensates” in the opposing direction the messy physical acoustic signal to align it with (Mann, 1980; Mann & Repp, 1980). native-language linguistic knowledge. The traditional interpretation of these findings high- Although this framing of speech perception has lights that theoretical approaches have tended to dis- dominated empirical research and theory, the focus on count what the auditory system can contribute to the acoustic variability may lead us to pursue answers to challenges of speech perception. The flexibility of the wrong questions. Like all perceptual systems, the speech perception to make use of so many acoustic auditory system transforms sensory input; it is not a dimensions to signal a particular speech sound and linear system. It is possible that the nature of auditory the dependence of this mapping on context has

C. BEHAVIORAL FOUNDATIONS 188 16. SPEECH PERCEPTION suggested to many that it is infeasible for these effects have coarticulation and still be an effective communi- to arise from auditory processing. This challenge is cation signal because the operating characteristics of part of what led to the proposal that motor representa- the auditory system include exaggeration of spectral tions might be better suited to serve as the basis of and temporal contrast. Lotto et al. (1997) argue that the speech communication. But, by virtue of being sound, symmetry of assimilated speech production and con- acoustic speech necessarily interfaces with early audi- trastive perception is not serendipitous, but rather is a tory perceptual operations. As noted, these operations consequence of organisms having evolved within natu- are not linear; they do not simply convey raw acoustic ral environments in which sound sources are physi- input, they transform it. Thus, although acoustics are cally constrained in the sounds they can produce. readily observable and provide a straightforward Because of mass and inertia, natural sound sources means of estimating input to the linguistic system, tend to be assimilative, like speech articulators. this representation is not equivalent to the auditory Perceptual systems, audition included, tend to empha- information available to the linguistic system. What size signs of change, perhaps because in comparison might be gained by considering auditory—rather than with physical systems’ relative sluggishness rapid acoustic—information? change is ecologically significant information. Having Lotto and Kluender (1998) approached this question evolved like other perceptual systems to respect regu- by examining whether perceptual compensation for larities of the natural environment, auditory processing coarticulation like that described for “alga” and “arda” transforms coarticulated acoustic signals to exaggerate really requires information about speech articulation, contrast and, thus, eliminates some of the apparent or whether the context sounds need to be speech at all. challenges of coarticulation. We can communicate effi- They did this by creating nonspeech sounds that had ciently as our relatively sluggish articulators perform some of the acoustic energy that distinguishes /al/ acrobatics across tens of milliseconds to produce from /ar/. These nonspeech signals do not carry infor- speech, in part because our auditory system evolved to mation about articulation, talker identity, or any other use acoustic signals from natural sound sources that speech-specific details. The result was two nonspeech face the same physical constraints. tone sweeps, one with energy like /al/ and the other These results also highlight the importance of con- with energy mimicking /ar/. Lotto and Kluender sidering higher-level auditory processing in constrain- found that when these nonspeech acoustic signals pre- ing models of speech perception. Subsequent research ceded /ga/ and /da/ sounds, the tone sweeps had the has shown that the auditory system exhibits spectral same influence as the /al/ and /ar/ sounds they mod- and temporal contrast for more complex sound input eled. So-called perceptual compensation for coarticula- (Holt, 2005, 2006a, 2006b; Laing, Liu, Lotto, & Holt, tion is observed even for nonspeech contexts that 2012). These studies indicate that the auditory system convey no information about speech articulation. tracks the long-term average spectra (or rate; Wade & This finding has been directly replicated (Fowler, Holt, 2005a) of sounds, and that subsequent perception 2006; Lotto, Sullivan, & Holt, 2003) and extended to other is relative to, and contrastive with, these distributional stimulus contexts (Coady, Kluender, & Rhode, 2003; characteristics of preceding acoustic signals (Watkins, Fowler, Brown, & Mann, 2000; Holt, 1999; Holt & Lotto, 1991; Watkins & Makin, 1994). These effects, described 2002) many times. Across these replications, the pattern graphically in Figure 16.1, cannot be explained by low- of results reveals that a basic characteristic of auditory level peripheral auditory processing; effects persist perception is to exaggerate contrast. Preceded by a high- over more than a second of silence or intervening frequency sound (whether speech or nonspeech), subse- sound (Holt, 2005) and require the system to track quent sounds are perceived to be lower-frequency. This distributional regularity across acoustic events (Holt, is also true in the temporal domain; preceded by longer 2006a). These findings are significant for understand- sounds or sounds presented at a slower rate, subsequent ing talker and rate normalization, which refer to the sounds are heard as shorter (Diehl & Walsh, 1989; challenges introduced to speech perception by acoustic Wade & Holt, 2005a, 2005b). Further emphasizing the variability arising from different speakers and different generality of these effects, Japanese quail exhibit the rates of speech. What is important is that the preceding pattern of speech context dependence that had been context of sounds possess acoustic energy in the spec- thought to be indicative of perceptual compensation for tral (Laing et al., 2012) or temporal (Wade & Holt, coarticulation (Lotto, Kluender, & Holt, 1997). 2005a, 2005b) region distinguishing the target pho- This example underscores the fact that acoustic and nemes, and not that the context carries articulatory or auditory are not one and the same. Whereas there is speech-specific information. Here, too, some of the considerable variability in speech acoustics, some of challenges apparent from speech acoustics may be this variability is accommodated by auditory percep- resolved in the transformation from acoustic to tual processing. In this way, the form of speech can auditory.

C. BEHAVIORAL FOUNDATIONS 16.4 EFFECTS OF LEARNABILITY ON THE FORM OF SPEECH 189

Precursor contexts Speech categorization Speech precursor (please say what this word is…) 100 90 HighHigh 50 HighHigh 80 Spectrogram LowLow LowLow 70 40 60 responses Љ 50 ga

30 Љ 40 30 Time (ms) LTAS 20 20

Percent 10 Sound pressure level (dB) 0 Frequency (Hz) 10 0 1102 2205 3307 4409 5512 123456789 Frequency (Hz) Frequency (Hz)

Nonspeech precursor (sequence of pure tones) 100 90 HighHigh 50 HighHigh 80 Low 70 Low Low Spectrogram Low 60 responses 40 Љ 50 ga Љ 40 30 30 20 Time (ms)

20 Percent 10 LTAS 0 Sound pressure level (dB) 123456789 Frequency (Hz) 10 0 1102 2205 3307 4409 5512 Frequency (Hz) Frequency (Hz)

FIGURE 16.1 Precursor contexts and their effect on adult /ga/-/da/ categorization. Manipulation of the Long-Term Average Spectrum (LTAS) of both speech (top) and nonspeech (bottom) has a strong, contrastive influence on speech categorization. From Laing, Liu, Lotto, and Holt (2012) with permission from the publishers.

16.4 EFFECTS OF LEARNABILITY Andrews, & Harnad, 1998; Mirman, Holt, & ON THE FORM OF SPEECH McClelland, 2004), and that even speech is not entirely “categorical” (Eimas, 1963; Harnad, 1990; Liberman, Auditory representations are influenced greatly by Harris, Hoffman, & Griffith, 1957; Pisoni, 1973). Infants both short-term and long-term experience. Categorical (Kuhl, 1991; McMurray & Aslin, 2005) and adults perception, the classic textbook example among speech (Kluender, Lotto, Holt, & Bloedel, 1998; McMurray, perception phenomena, exemplifies this. When native- Aslin, Tanenhaus, Spivey, & Subik, 2008) remain sensi- language speech varying gradually in its acoustics is tive to within-category acoustic variation. Speech cate- presented to listeners, the patterns of identification gories exhibit graded internal structure such that change abruptly, not gradually, from one phoneme (or instances of a speech sound are treated as relatively syllable or word) to another. Likewise, there is a corre- better or worse exemplars of the category (Iverson & sponding discontinuity in discrimination such that Kuhl, 1995; Iverson et al., 2003; Johnson, Flemming, & pairs of speech sounds are more discriminable if they Wright, 1993; Miller & Volaitis, 1989). lie on opposite sides of the sharp identification bound- We have argued that it may be more productive to ary than if they lie on the same side of the identifica- consider speech perception as categorization, as opposed tion curve’s slope, even when they are matched in to categorical (Holt & Lotto, 2010). This may seem like a acoustic difference. Said another way, acoustically dis- small difference in designation, but it has important tinct speech sounds identified with the same label are consequences. Considering speech perception as an difficult to discriminate, whereas those with different example of general auditory categorization provides a labels are readily discriminated. Despite the renown of means of understanding how the system comes to categorical perception for speech, it is now understood exhibit relative perceptual constancy in the face of that it is not specific to speech (Beale & Keil, 1995; acoustic variability and does so in a native-language Bimler & Kirkland, 2001; Krumhansl, 1991; Livingston, specific manner. The reason for this is that although

C. BEHAVIORAL FOUNDATIONS 190 16. SPEECH PERCEPTION there is a great deal of variability in speech acoustics, representation of the stimulus set). Naı¨ve participants there also exist underlying regularities in the distribu- experienced these sounds in the context of a video- tions of experienced native-language speech sounds. game in which learning sound categories facilitated This is the computational challenge of categorization; advancement in the game but was never explicitly discriminably different exemplars come to be treated as required or rewarded. Within just a half-hour of game functionally equivalent. A system that can generalize play, participants categorized the sounds and general- across variability to discover underlying patterns and ized their category learning to novel exemplars. This distributional regularities—a system that can categorize— learning led to an exaggeration of between-category may cope with the acoustic variability inherent in discriminability (of the sort traditionally attributed to speech without need for invariance. Seeking invariance categorical perception) as measured with electroen- in the acoustic signatures of speech becomes less essen- cephalography (EEG; Liu & Holt, 2011). The seemingly tial if we take a broader view that extends beyond pat- intractable lack of acoustic invariance is, in fact, readily tern matching to consider active auditory processing learnable even in an incidental task. that involves higher-order and multimodal perception, This is proof that the auditory system readily uses categorization, attention, and learning. multimodal environmental information (modeled in From this perspective, learning about how listeners the videogame as sound-object links, as in natural acquire auditory categories can constrain behavioral and environments) to facilitate discovery of the distribu- neurobiological models of speech perception. Whereas tional regularities that define the relations between cat- the acquisition of first and second language phonetic egory exemplars while generalizing across acoustic systems provides an opportunity to observe the develop- variability within categories. More than this, however, ment of complex auditory categories, our ability to the approach can reveal details of auditory processing model these categorization processes is limited because that constrain behavioral and neurobiological models it is extremely difficult to control or even accurately mea- of speech perception. Using the same nonspeech cate- sure a listener’s history of experience with speech gories and training paradigm, Leech, Holt, Devlin, and sounds. However, we are beginning to develop insights Dick (2009) discovered that the extent to which partici- into auditory categorization from experiments using pants learn to categorize nonspeech sounds is strongly novel artificial nonspeech sound categories that inform correlated with the pretraining to post-training recruit- our understanding about how speech perception and ment of left posterior temporal sulcus (pSTS) during acquisition are constrained by general perceptual learn- presentation of the nonspeech sound category exem- ing mechanisms (Desai, Liebenthal, Waldron, & Binder, plars. This is unexpected because left pSTS has been 2008; Guenther, Husain, Cohen, & Shinn-Cunningham, described as selective for specific acoustic and infor- 1999; Holt & Lotto, 2006; Holt, Lotto, & Diehl, 2004; Ley mational properties of speech signals (Price, Thierry, & et al., 2012; Liebenthal et al., 2010). Griffiths, 2005). In recent work, Lim, Holt, and Fiez One example of what this approach can reveal (2013) have found that left pSTS is recruited online in about how auditory learning constrains speech relates the videogame category training task in a manner that to a classic early example of the “lack of invariance” in correlates with behavioral measures of learning. These speech acoustics. If one examines the formant frequen- results also demonstrate that recruitment of left pSTS cies corresponding most closely with /d/ as it pre- by the nonspeech sound categories cannot be attrib- cedes different vowels, then it is impossible to define a uted to their superficial acoustic signal similarity to single acoustic dimension that uniquely distinguishes speech or to mere exposure. When highly similar non- the sound as a /d/; the acoustics are greatly influ- speech sounds are sampled such that category mem- enced by the following vowel (Liberman, Delattre, bership is random instead of structured, left pSTS Cooper, & Gerstman, 1954). This kind of demonstra- activation is not related to behavioral performance. tion fueled theoretical commitments that speech per- As in the examples from the preceding sections, this ception is accomplished via the speech motor system series of studies demonstrates that there is danger in in the hopes that this would provide a more invariant presuming that speech is fundamentally different from mapping than acoustics (Liberman et al., 1967). other sounds in either its acoustic structure or in the Viewed from the perspective of acoustics, perceptual basic perceptual processes it requires. The selectivity constancy for /d/ seemed an intractable problem for of left pSTS for speech should not be understood to be auditory processing. selectivity for intrinsic properties of acoustic speech Wade and Holt (2005a, 2005b) modeled this percep- signals, such as the articulatory information that tual challenge with acoustically complex nonspeech speech may carry. Instead, this region seems to meet sound exemplars that formed categories signaled the computational demands presented by learning to only by higher-order acoustic structure and not by treat structured distributions of acoustically variable any invariant acoustic cue (see Figure 16.2 for a sounds as functionally equivalent.

C. BEHAVIORAL FOUNDATIONS 16.5 MOVING FORWARD 191

4000 4000 3500 3500 3000 3000 2500 2500 2000 2000 1500 1500 Frequency (Hz) Frequency (Hz) Frequency

Unidimensional 1000 1000 500 500 0 0 25 75 125 175 225 275 25 75 125 175 225 275

4000 4000 3500 3500 3000 3000 2500 2500 2000 2000 1500 1500 Frequency (Hz) Frequency (Hz) Frequency 1000 1000 500 500 Multidimensional 0 0 25 75 125 175 225 275 25 75 125 175 225 275

Time (ms) Time (ms)

FIGURE 16.2 Schematic spectrograms showing the artificial nonspeech auditory categories across time and frequency. The dashed gray lines show the lower-frequency spectral peak, P1. The colored lines show the higher-frequency spectral peak, P2. The six exemplars of each category are composed of P1 and one of the colored P2 components pictured. Note that unidimensional categories are characterized by an off- set glide that increases (top left) or decreases (top right) in frequency across all exemplars. No such unidimensional cue differentiates the mul- tidimensional categories. From Wade and Holt (2005a, 2005b) with permission from the publishers.

Likewise, caution is warranted in presuming that neurobiological approaches to understanding percep- the transformation from acoustic to auditory involves tion, cognition, and language. only a static mapping to stable, unchanging linguistic representations. The recruitment of putatively speech- selective left pSTS was driven by category learning in 16.5 MOVING FORWARD less than an hour (Lim et al., 2013). Thus, the behav- ioral relevance of the artificial, novel auditory catego- The preceding sections provide a few brief examples ries drove reorganization of their transformations from of how general auditory processing may influence the acoustic to auditory. The examples we present here perception of speech sounds as well as the structure of illustrate the facile manner by which auditory catego- phonetic systems. These examples demonstrate that, at ries can be acquired. On an even shorter time scale, the very least, human speech communication appears there is considerable evidence that the mapping of to take advantage of the things that the auditory sys- speech acoustics to linguistic representation is “tuned” tem does well—phonetic inventories tend to include by multiple information sources in an adaptive man- sounds whose differences are well-encoded in the ner such as may be required to adapt to foreign auditory system. The acoustic effects of coarticulation accented speech or to speech in adverse, noisy envir- are just the types of interactions that the auditory sys- onments (Kraljic, Brennan, & Samuel, 2008; Mehler tem can accommodate, and the multidimensional et al., 1993; Vitela, Carbonell, & Lotto, 2012). The structure of speech sounds form just the kinds of cate- active, flexible nature of auditory processing puts gories that are easily learned by the auditory system. learning in the spotlight and positions questions of Whether there are additional specialized processes speech perception in greater contact with other required for speech perception, it is likely that the

C. BEHAVIORAL FOUNDATIONS 192 16. SPEECH PERCEPTION auditory system constrains the way we talk to and per- In R. Hoffman, & D. Palermo (Eds.), Cognition and the symbolic ceive each other to a greater extent than has been process: Analytical and ecological perspectives (pp. 59 75). Hillsdale, acknowledged. NJ: Lawrence Erlbaum Associates, Inc. Diehl, R. L., Lindblom, B., & Creeger, C. P. (2003). Increasing realism One of the beneficial outcomes of the fact that the of auditory representations yields further insights into vowel phonetics, auditory system plays a strong role in speech perception Proceedings of the fifteenth international congress of phonetic sciences is that there is the opportunity for synergy between (Vol. 2, pp. 13811384). Adelaide: Causal Publications. research of speech and of general auditory processing. Diehl, R. L., & Walsh, M. A. (1989). An auditory basis for the Speech perception phenomena shine a light on auditory stimulus-length effect in the perception of stops and glides. Journal of the Acoustical Society of America, 85(5), 21542164. processes that have remained unilluminated by research Eimas, P. D. (1963). The relationship between identification and dis- of simpler acoustic stimuli. The theories regarding the crimination along speech and non-speech continua. Language and auditory distinctiveness of speech sounds have inspired Speech, 6(4), 206217. the search for better models of auditory encoding of Elman, J. L., & McClelland, J. L. (1986). Exploiting lawful variability complex stimuli and better functions for computing dis- in the speech . In J. S. Perkell, & D. H. Klatt (Eds.), Invariance and variability of speech processes (pp. 360385). tinctiveness (Lotto et al., 2003). The existence of per- Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. ceptual compensation for coarticulation and talker Fowler, C. A. (2006). Compensation for coarticulation reflects gesture normalization provide evidence for spectral and tempo- perception, not spectral contrast. Perception and Psychophysics, 68 ral interactions in general auditory processing that are (2), 161 177. not evident when presenting stimuli in isolation (Holt, Fowler, C. A. (2008). The FLMP STMPed. Psychonomic Bulletin and Review, 15(2), 458462. 2006a, 2006b; Holt & Lotto, 2002; Watkins & Makin, Fowler, C. A., Brown, J. M., & Mann, V. A. (2000). Contrast effects do 1994). The complexity of speech categories along with not underlie effects of perceding liquids on stop-consonant identi- the ease with which humans learn them is the starting fication by humans. Journal of Experimental Psychology: Human point for most of the current work on auditory categori- Perception and Performance, 26(3), 877 888. zation (Goudbeek, Smits, Swingly, & Cutler, 2005; Lotto, Froehlich, P., Collet, L., Chanal, J.-M., & Morgon, A. (1990). Variability of the influence of a visual task on the active microme- 2000; Maddox, Molis, & Diehl, 2002; Smits, Sereno, & chanical properties of the cochlea. Brain Research, 508(2), 286288. Jongman, 2006; Wade & Holt, 2005a, 2005b). Garinis, A. C., Glattke, T., & Cone, B. K. (2011). The MOC reflex dur- The vitality of auditory and speech cognitive neuro- ing active listening to speech. Journal of Speech, Language, and science depends on continuing this trend of using Hearing Research, 54(5), 1464 1476. speech and auditory phenomena to mutually inform Giard, M.-H., Collet, L., Bouchet, P., & Pernier, J. (1994). Auditory selec- tive attention in the human cochlea. Brain Research, 633(1), 353356. and inspire each field. Goudbeek, M., Smits, R., Cutler, A., & Swingley, D. (2005). Acquiring auditory and phonetic categories. In H. Cohen, & C. Lefebvre (Eds.), Handbook of categorization in cognitive science (pp. 497513). References Amsterdam: Elsevier. Green, D. M. (1988). Profile analysis: Auditory intensity discrimination. Beale, J. M., & Keil, F. C. (1995). Categorical effects in the perception New York: Oxford University Press. of faces. Cognition, 57(3), 217239. Green, D. M., Mason, C. R., & Kidd, G. (1984). Profile analysis: Bimler, D., & Kirkland, J. (2001). Categorical perception of facial Critical bands and duration. The Journal of the Acoustical Society of expressions of emotion: Evidence from multidimensional scaling. America, 75(4), 11631167. Cognition and Emotion, 15(5), 633658. Guenther, F. H., Husain, F. T., Cohen, M. A., & Shinn-Cunningham, Bradlow, A., & Bent, T. (2002). The clear speech effect for non-native B. G. (1999). Effects of categorization and discrimination training listeners. Journal of the Acoustical Society of America, 112(1), on auditory perceptual space. Journal of the Acoustical Society of 272284. America, 106(5), 29002912. Coady, J. A., Kluender, K. R., & Rhode, W. S. (2003). Effects of con- Guenther, F. H., & Vladusich, T. (2012). A neural theory of speech trast between onsets of speech and other complex spectra. Journal acquisition and production. Journal of Neurolinguistics, 25(5), of the Acoustical Society of America, 114, 2225. 408422. Desai, R., Liebenthal, E., Waldron, E., & Binder, J. R. (2008). Left pos- Harnad, S. R. (1990). Categorical perception: The groundwork of cogni- terior temporal regions are sensitive to auditory categorization. tion. New York: Cambridge University Press. Journal of Cognitive Neuroscience, 20(7), 11741188. Hickok, G. S., Houde, J., & Rong, F. (2011). Sensorimotor integration Diehl, R., Lotto, A. J., & Holt, L. L. (2004). Speech perception. Annual in speech processing: Computational basis and neural organiza- Review of Psychology, 55, 149179. tion. Neuron, 69(3), 407422. Diehl, R. L., & Kluender, K. R. (1987). On the categorization of Holt, L., & Lotto, A. (2006). Cue weighting in auditory categorization: speech sounds. In S. Harnad (Ed.), Categorical perception: The Implications for first and second language acquisition. The Journal groundwork of cognition (pp. 226253). London: Cambridge of the Acoustical Society of America, 119, 3059. University Press. Holt, L., & Lotto, A. (2010). Speech perception as categorization. Diehl, R. L., & Kluender, K. R. (1989). On the objects of speech per- Attention, Perception, and Psychophysics, 72(5), 12181227. ception. Ecological Psychology, 1(2), 121144. Holt, L., Lotto, A., & Diehl, R. (2004). Auditory discontinuities inter- Diehl, R. L., Kluender, K. R., & Walsh, M. A. (1990). Some auditory act with categorization: Implications for speech perception. The bases of speech perception and production. Advances in Speech, Journal of the Acoustical Society of America, 116, 1763. Hearing and Language Processing, 1, 243268. Holt, L., & Lotto, A. J. (2002). Behavioral examinations of the level of Diehl, R. L., Kluender, K. R., Walsh, M. A., & Parker, E. M. (1991). auditory processing of speech context effects. Hearing Research, Auditory enhancement in speech perception and phonology. 167, 156169.

C. BEHAVIORAL FOUNDATIONS REFERENCES 193

Holt, L. L. (1999). Auditory constraints on speech perception: An examina- Ley, A., Vroomen, J., Hausfeld, L., Valente, G., de Weerd, P., & tion of spectral contrast. Ph.D. thesis, University of Wisconsin- Formisano, E. (2012). Learning of new sound categories shapes Madison. neural response patterns in human auditory cortex. The Journal of Holt, L. L. (2005). Temporally nonadjacent nonlinguistic sounds Neuroscience, 32(38), 1327313280. affect speech categorization. Psychological Science, 16(4), 305312. Liberman, A., Harris, K., Hoffman, H., & Griffith, B. (1957). The dis- Holt, L. L. (2006a). The mean matters: Effects of statistically defined crimination of speech sounds within and across phoneme bound- nonspeech spectral distributions on speech categorization. Journal aries. Journal of Experimental Psychology, 54, 358368. of the Acoustical Society of America, 120, 28012817. Liberman, A. M. (1996). Speech: A special code. Cambridge: MIT Press. Holt, L. L. (2006b). Speech categorization in context: Joint effects of Liberman, A. M., Cooper, F. S., Harris, K. S., MacNeilage, P. F., & nonspeech and speech precursors. Journal of the Acoustical Society Studdert-Kennedy, M. (1964). Some observations on a model for of America, 119(6), 40164026. speech perception. Proceedings of the AFCRL symposium on models for Holt, L. L., & Lotto, A. J. (2008). Speech perception within an audi- the perception of speech and visual form. Cambridge: MIT Press. tory cognitive neuroscience framework. Current Directions in Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert- Psychological Science, 17(1), 4246. Kennedy, M. (1967). Perception of the speech code. Psychological Iverson, P., Kuhl, P., Akahane-Yamada, R., Diesch, E., Tohkura, Y., Review, 74(6), 431461. Kettermann, A., et al. (2003). A perceptual interference account of Liberman, A. M., Delattre, P. C., Cooper, F. S., & Gerstman, L. J. acquisition difficulties for non-native phonemes. Cognition, 87(1), (1954). The role of consonant-vowel transitions in the perception B47B57. of the stop and nasal consonants. Psychological Monographs: Iverson, P., & Kuhl, P. K. (1995). Mapping the perceptual magnet effect General and Applied, 68(8), 113. for speech using signal detection theory and multidimensional scal- Liebenthal, E., Desai, R., Ellingson, M. M., Ramachandran, B., ing. Journal of the Acoustical Society of America, 97(1), 553562. Desai, A., & Binder, J. R. (2010). Specialization along the left Jesteadt, W., Wier, C. C., & Green, D. M. (1977). Intensity discrimina- superior temporal sulcus for auditory categorization. Cerebral tion as a function of frequency and sensation level. Journal of the Cortex, 20(12), 29582970. Acoustical Society of America, 61(1), 169177. Liljencrants, J., & Lindblom, B. (1972). Numerical simulation of Johnson, K., Flemming, E., & Wright, R. (1993). The hyperspace vowel quality systems: The role of perceptual contrast. Language, effect: Phonetic targets are hyperarticulated. Language, 69(3), 48(4), 839862. 505528. Lim, S., Holt, L.L., & Fiez, J.A. (2013). Context-dependent modulation Kidd, G. J., Richards, V. M., Streeter, T., Mason, C. R., & Huang, R. of striatal systems during incidental auditory category learning. (2011). Contextual effects in the identification of nonspeech audi- Poster presentation at the 43rd Annual Conference of the Society for tory patterns. Journal of the Acoustical Society of America, 130(6), Neuroscience. San Diego, CA. 39263938. Lindblom, B. (1986). Phonetic universals in vowel systems. Kingston, J., & Diehl, R. L. (1995). Intermediate properties in the per- In J. Ohala, & J. Jaeger (Eds.), Experimental phonology (pp. 1344). ception of distinctive feature values. Papers in Laboratory Orlando, FL: Academic Press. Phonology, 4,727. Lindblom, B. (1991). The status of phonetic gestures. In I. G. Kluender, K. R., Diehl, R. L., & Wright, B. A. (1988). Vowel length Mattingly, & M. Studdert-Kennedy (Eds.), Modularity and the differences before voiced and voiceless consonants: An auditory motor theory of speech perception (pp. 724). Hillsdale, NJ: explanation. Journal of Phonetics, 16, 153169. Lawrence Erlbaum Associates. Kluender, K. R., & Kiefte, M. (2006). Speech perception within a bio- Liu, R., & Holt, L. L. (2011). Neural changes associated with non- logically realistic information-theoretic framework. In M. A. speech auditory category learning parallel those of speech cate- Gernsbacher, & M. Traxler (Eds.), Handbook of psycholinguistics gory acquisition. Journal of Cognitive Neuroscience, 23(3), 683698. (2nd ed., pp. 153199). London: Elsevier. Livingston, K. R., Andrews, J. K., & Harnad, S. (1998). Categorical per- Kluender, K. R., Lotto, A. J., Holt, L. L., & Bloedel, S. L. (1998). Role ception effects induced by category learning. Journal of Experimental of experience for language-specific functional mappings of vowel Psychology: Learning, Memory, and Cognition, 24(3), 732753. sounds. Journal of the Acoustical Society of America, 104(6), Lotto, A. J. (2000). Language acquisition as complex category forma- 35683582. tion. Phonetica, 57, 189196. Kraljic, T., Brennan, S. E., & Samuel, A. G. (2008). Accommodating Lotto, A. J., Hickok, G. S., & Holt, L. L. (2009). Reflections on mirror variation: Dialects, idiolects, and speech processing. Cognition, neurons and speech perception. Trends in Cognitive Sciences, 13(3), 107(2), 5481. 110114. Krishnan, S., Leech, R., Aydelott, J., & Dick, F. (2013). School-age Lotto, A. J., & Kluender, K. R. (1998). General contrast effects in children’s environmental object identification in natural auditory speech perception: Effect of preceding liquid on stop consonant scenes: Effects of masking and contextual congruence. Hearing identification. Perception and Psychophysics, 60(4), 602619. Research, 300,4655. Lotto, A. J., Kluender, K. R., & Holt, L. L. (1997). Perceptual compen- Krumhansl, C. L. (1991). Music psychology: Tonal structures in per- sation for coarticulation by Japanese quail (Coturnix coturnix ception and memory. Annual Review of Psychology, 42(1), 277303. japonica). Journal of the Acoustical Society of America, 102, Kuhl, P. K. (1991). Human adults and human infants show a “per- 11341140. ceptual magnet effect” for the prototypes of speech categories, Lotto, A. J., Sullivan, S. C., & Holt, L. L. (2003). Central locus for non- monkeys do not. Perception and Psychophysics, 50(2), 93107. speech context effects on phonetic identification. Journal of the Laing, E. J. C., Liu, R., Lotto, A. J., & Holt, L. L. (2012). Tuned with a Acoustical Society of America, 113(1), 5356. tune: Talker normalization via general auditory processes. Maddox, W. T., Molis, M. R., & Diehl, R. L. (2002). Generalizing a Frontiers in Psychology, 3, 203227. neuropsychological model of visual categorization to auditory Lane, H. (1965). The motor theory of speech perception: A critical categorization of vowels. Perception and Psychophysics, 64(4), review. Psychological Review, 72(4), 275309. 584597. Leech, R., Holt, L. L., Devlin, J. T., & Dick, F. (2009). Expertise with Maison, S., Micheyl, C., & Collet, L. (2001). Influence of focused audi- artificial nonspeech sounds recruits speech-sensitive cortical tory attention on cochlear activity in humans. Psychophysiology, 38 regions. The Journal of Neuroscience, 29(16), 52345239. (1), 3540.

C. BEHAVIORAL FOUNDATIONS 194 16. SPEECH PERCEPTION

Mann, V., & Repp, B. (1980). Influence of vocalic context on percep- Smits, R., Sereno, J., & Jongman, A. (2006). Categorization of sounds. tion of the [s]-[S] distinction. Attention, Perception, and Journal of Experimental Psychology: Human Perception and Psychophysics, 28(3), 213228. Performance, 32(3), 733754. Mann, V. A. (1980). Influence of preceding liquid on stop-consonant Snyder, J. S., & Weintraub, D. M. (2013). Loss and persistence of perception. Perception and Psychophysics, 28(5), 407412. implicit memory for sound: Evidence from auditory stream segre- Massaro, D. W., & Chen, T. H. (2008). The motor theory of speech per- gation context effects. Attention, Perception, and Psychophysics, 75, ception revisited. Psychonomic Bulletin and Review, 15(2), 453457. 10561074. McMurray, B., & Aslin, R. N. (2005). Infants are sensitive to within- Song, J. H., Skoe, E., Wong, P., & Kraus, N. (2008). Plasticity in the category variation in speech perception. Cognition, 95(2), B15B26. adult human auditory brainstem following short-term linguistic McMurray, B., Aslin, R. N., Tanenhaus, M. K., Spivey, M. J., & training. Journal of Cognitive Neuroscience, 20(10), 18921902. Subik, D. (2008). Gradient sensitivity to within-category variation Stevens, K. N. (1972). The quantal nature of speech: Evidence from in words and syllables. Journal of Experimental Psychology: Human articulatory-acoustic data. In E. E. David, & P. B. Denes (Eds.), Perception and Performance, 34(6), 16091631. Human communication: A unified view (pp. 5166). New York, NY: Mehler, J., Sebastian, N., Altmann, G., Dupoux, E., Christophe, A., & McGraw-Hill. Pallier, C. (1993). Understanding compressed sentences: The role Stevens, K. N. (1989). On the quantal nature of speech. Journal of of rhythm and meaning. Annals of the New York Academy of Phonetics, 17,345. Sciences, 682(1), 272282. Studdert-Kennedy, M., Liberman, A. M., Harris, K. S., & Cooper, Miller, J. L., & Volaitis, L. E. (1989). Effect of speaking rate on the F. S. (1970). Motor theory of speech perception: A reply to Lane’s perceptual structure of a phonetic category. Perception and critical review. Psychological Review, 77(3), 234249. Psychophysics, 46(6), 505512. Trout, J. D. (2001). The biological basis of speech: What to infer from Mirman, D., Holt, L., & McClelland, J. (2004). Categorization and dis- talking to the animals. Psychological Review, 108(3), 523549. crimination of nonspeech sounds: Differences between steady- Vitela, A. D., Carbonell, K. M., & Lotto, A. J. (2012). Predicting the state and rapidly-changing acoustic cues. Journal of the Acoustical effects of carrier phrases in speech perception. Poster presentation Society of America, 116(2), 11981207. at the 53rd meeting of the Psychonomics Society. Minneapolis, MN. Ohala, J. J. (1993). Sound change as nature’s speech perception exper- Wade, T., & Holt, L. (2005a). Incidental categorization of spectrally iment. Speech Communication, 13(12), 155161. complex non-invariant auditory stimuli in a computer game task. Ortiz, J. A., & Wright, B. A. (2010). Differential rates of consolidation Journal of the Acoustical Society of America, 118(4), 26182633. of conceptual and stimulus learning following training on an Wade, T., & Holt, L. L. (2005b). Perceptual effects of preceding non- auditory skill. Experimental Brain Research, 201(3), 441451. speech rate on temporal properties of speech categories. Picheny, M., Durlach, N., & Braida, L. (1985). Speaking clearly for Perception and Psychophysics, 67(6), 939950. the hard of hearing I: Intelligibility differences between clear and Watkins, A. J. (1991). Central, auditory mechanisms of perceptual conversational speech. Journal of Speech, Language, and Hearing compensation for spectral-envelope distortion. Journal of the Research, 28,96103. Acoustical Society of America, 90(6), 29422955. Picheny, M. A., Durlach, N. I., & Braida, L. (1986). Speaking clearly Watkins, A. J., & Makin, S. J. (1994). Perceptual compensation for for the hard of hearing II: Acoustic characteristics of clear and speaker differences and for spectral-envelope distortion. Journal of conversational speech. Journal of Speech, Language, and Hearing the Acoustical Society of America, 96(3), 12631282. Research, 29, 434446. Wier, C. C., Jesteadt, W., & Green, D. M. (1977). Frequency discrimi- Pisoni, D. B. (1973). Auditory and phonetic memory codes in the dis- nation as a function of frequency and sensation level. Journal of crimination of consonants and vowels. Perception and the Acoustical Society of America, 61(1), 178184. Psychophysics, 13, 253260. Wong, P., Skoe, E., Russo, N. M., Dees, T., & Kraus, N. (2007). Price, C., Thierry, G., & Griffiths, T. (2005). Speech-specific auditory Musical experience shapes human brainstem encoding of linguis- processing: Where is it? Trends in Cognitive Sciences, 9(6), 271276. tic pitch patterns. Nature Neuroscience, 10(4), 420422.

C. BEHAVIORAL FOUNDATIONS CHAPTER 17 Understanding Speech in the Context of Variability Shannon Heald, Serena Klos and Howard Nusbaum Department of Psychology, The University of Chicago, Chicago, IL, USA

In listening to spoken language, speech perception who is delivering the message. But this means that subjectively seems to be a simple, direct pattern match- much of speech understanding takes place in the con- ing system (cf. Fodor, 1983) because of the immediacy text of a particular speaker, and if the speaker changes, with which we understand what is said. This subjec- the context changes. Although what is being said tive simplicity is misleading given the difficulty of changes more frequently in a conversation, there can developing speech recognition devices with human- also be changes between speakers, and such changes level performance over the range of conditions under are important for the listener to recognize. A shift which we have little difficulty understanding speech. between talkers can pose a perceptual challenge to A typical approach to understanding the mechanisms a listener due to an increase in the variability of of human speech perception is to treat the system as if how acoustic patterns map onto phonetic categories. composed of two parts, a simple acousticlinguistic This perceptual challenge is often referred to as the pattern matcher and some kind of noise reducing filter. problem of talker variability. For different talkers, a However, we argue that this simple form of speech given specific acoustic pattern may correspond to dif- recognition, which is based largely on subjective ferent phonemes perceptually, whereas conversely, impression, misconstrues the nature of noiserobust across talkers, a given phoneme may be represented processing in speech. Rather than construe noise in speech by different acoustic patterns (Dorman, robustness as a separate system, we argue that it is Studdert-Kennedy, & Raphael, 1977; Liberman, Cooper, intrinsic to the definition of the human speech per- Shankweiler, & Studdert-Kennedy, 1967; Peterson & ception system and suggestive of the processes that Barney, 1952). For this reason, as well as others, changes are necessary to explain how we understand spoken in talker are important because they may mark signifi- language. We argue here that understanding the cant changes in how acoustic patterns map onto mechanisms of speech perception depends on adaptive phonetic categories (cf. Nusbaum & Magnuson, 1997). processing that can respond to contextual variability Additionally, recognizing a change in speaker may that can take a wide range of noise and distortion. be important because a listener’s attitudes and behavior Further, what counts as signal and what counts as toward a speaker are often informed by what a listener noise (or contextual variability) may well depend on knows about a speaker (e.g., Thackerar & Giles, 1981). the listener’s goals at the moment of listening. For example, indirect requests are understood in the context of a speaker’s status (Holtgraves, 1994). More directly relevant to speech perception, however, a 17.1 SPEECH AND SPEAKERS listener’s belief about the speaker’s social group can alter the perceived intelligibility of the speech (Rubin, In perceiving speech, we typically listen to under- 1992). Additionally, dialect (Niedzielski, 1999)and stand what someone is saying (the content of their gender (Johnson, Strand, & D’Imperio, 1999) expecta- message), as well as to understand something about tions can meaningfully alter phoneme perception, who is saying it (speaker identity). Of course the highlighting that social knowledge about a speaker can message (word by word) changes much faster than affect relatively “low-level” perceptual processing of a

Neurobiology of Language. DOI: http://dx.doi.org/10.1016/B978-0-12-407794-2.00017-1 195 © 2016 Elsevier Inc. All rights reserved. 196 17. UNDERSTANDING SPEECH IN THE CONTEXT OF VARIABILITY speaker’s message, much in the same way that knowl- Cooper, Harris, & MacNeilage, 1962; Liberman, edge of vocal tract characteristics can (Ladefoged & Cooper, Harris, MacNeilage, & Studdert-Kennedy, 1967; Broadbent, 1957; although see Huang & Holt, 2012; Liberman & Mattingly, 1985), analysis-by-synthesis Laing, Lotto, & Holt, 2012 for an auditory explanation (Stevens & Halle, 1967), or the Fuzzy Logical Model of of the mechanism that could underlie this). To under- Perception (Massaro, 1987; Massaro & Oden, 1980). stand speech, it is important to know something about Traditionally, speech perception has been described the speaker’s speech. Of course, speaker recognition is as classifying linguistic units (e.g., phonemes, words) important in its own right, over and above the way from a mixture of detailed acoustic information that that it informs message understanding, although contains both phonetically relevant and irrelevant speaker identification for its own sake is much less fre- information. In other words, there is an assumed quent as a typical listener goal. (typical or idealized) pattern that corresponds to In general, there have been two broad views linguistic information combined with noise or dis- regarding how talker information is used during percep- tortion of that pattern as a consequence of the process tion. One account, often called “talker normalization” of speaking, including aspects of speech production (Nearey, 1989; Nusbaum & Magnuson, 1997), suggests that are specific to the speaker. Given that the acoustic that listeners use talker information to calibrate or frame information about a talker (as opposed to the message) the interpretation of a given message to overcome the might be viewed as noise in relation to the canonical considerable uncertainty (e.g., acoustic variability, refer- linguistic units on which speech perception relies, it is ence resolution) that arises from talker differences. An sometimes assumed that talker information is lost or alternative view suggests that talker information is not stripped away during this process of message recogni- used as a context to frame message understanding at all. tion (e.g., Blandon, Henton, & Pickering, 1984; Disner, Rather the alternative view is that speaker recognition is 1980; Green, Kuhl, Meltzoff, & Stevens, 1991). something listeners need to do for social information or Although it is possible that talker information, even in just to identify the speaker as a known individual a talker normalization theory, is preserved in parallel (see Goldinger, 1998). However, this then suggests representational structures for other listening goals1 that there are two general, independent processes for (e.g., Hasson, Skipper, Nusbaum, & Small, 2007), the spoken language understanding—word recognition (or concern about losing talker-specific information and phoneme or syllable) and speaker identification. recognition of the need for this information for other The kind of knowledge and information that is used perceptual goals prompted the alternative view. In this during talker normalization is different from the view all auditory information in an utterance is knowledge used to account for phoneme recognition. putatively represented in a more veridical fashion To carry out talker normalization, it is necessary to that maintains talker-specific auditory information derive information about the talker’s vocal characteris- along with phonetically relevant auditory informa- tics. For example, in Gerstman’s (1968) model, the tion (e.g., Goldinger, 1998) as well as any environ- point vowels are used to scale the location of the F1-F2 mental noise or distortion. In the details of such space to infer all the other vowels produced by a theories, there is separate coding that represents both given talker. Because the point vowels represent the talker-specific auditory information such as funda- extremes of a talker’s vowel space, they can be used to mental frequency and acoustic-phonetic information.2 characterize the talker’s vocal tract extremes and there- However, because this is an auditory-trace model, there fore bound the recognition space. Similarly, Syrdal and are no specific provisions for the representation or pro- Gopal‘s (1986) model scales F1 and F2 using the talk- cessing of other aspects of talker information such as er’s fundamental frequency and F3 because these are knowledge about the social group of the talker, the dia- considered to be more characteristic of the talker’s lect of the talker, or the gender of the talker that might vocal characteristics than vowel quality (e.g., Fant, 1973; come from glottal waveform or other source informa- Peterson & Barney, 1952). Thus, talker normalization tion. Further, the echoic encoding account does not models use information about the talker rather than explain how talker-specific information that is not acous- information about the specific message or phonetic con- tic (e.g., visual talker information) can affect speech text, as in models of phoneme perception such as Trace processing or how other kinds of auditory information (McClelland & Elman, 1986), Motor Theory (Liberman, (e.g., noise or competing talkers) is filtered out.

1There are many examples of parallel representations for other sensory systems: there are multiple somatosensory maps (e.g., Kaas, 2004), visual maps (e.g., Bartels & Zeki, 1998), and auditory maps (e.g., Hackett & Kaas, 2004) in the brain. 2As specified, there is no process to determine what is talker-specific information (e.g., glottal waveshape or idiolectal cue combinations) versus phonetically relevant information, nor is there a way of separating noise, although the model suggests that there are separate codes for talker-specific and phonetic-specific information (Goldinger, 1998).

C. BEHAVIORAL FOUNDATIONS 17.2 THE LACK OF INVARIANCE PROBLEM 197

Moreover, and perhaps more importantly, this linguistic interpretation of those patterns is a core view privileges speaker differences as a problem for challenge to theories of speech perception. Many speech perception. But variation in speaking rate simple recognition systems assume that given some within a single talker can have the same kind of effect pattern as input, the features or structure of the input (Francis & Nusbaum, 1996) such that one acoustic pattern can be compared mathematically to a set of pattern corresponds to different phonemes and one stored representations of patterns and the distance phoneme can be produced with different acoustic between the input and each can be computed in some patterns (Miller, O’Rourke, & Volaitis, 1997; Miller & Minkowski metric (e.g., city block or Euclidean space). Volaitis, 1989). For example, at slow rates of speech, This distance can then serve as the basis for the deci- the acoustic pattern of /b/ is similar to /w/ produced sion criterion for selecting the recognized interpre- at a fast speaking rate, meaning that one acoustic tation. In other words, when stimulus patterns are pattern can correspond to either a /b/ or /w/, sufficiently different, recognition is simply a compari- depending on the rate of speech. Conversely, any son process between the input pattern and the stored phoneme can be produced at a fast or slow speaking representations with recognition determined by the rate, resulting in different acoustic patterns. Of course, stored representation that is most similar to the input. changes in phonetic and lexical context can restructure However, this kind of approach has traditionally failed the relationship between acoustic patterns and pho- for speech recognition. netic categories to produce the lack of invariance The problem of lack of invariance in the relationship problem (Liberman, Cooper, Shankweiler, et al., 1967; between the acoustic patterns of speech and the Nusbaum & Magnuson, 1997). Thus, talker differences linguistic interpretation of those patterns is a funda- are simply one example of the different kinds of vari- mental problem. Although the many-to-many mapping ability that can affect the pattern properties of speech. between acoustic patterns of speech and perceptual Noise can also come from environmental sources interpretations is a longstanding, well-known issue such as conversations or machinery, and distortion can (e.g., Liberman, Cooper, Shankweiler, et al., 1967), be introduced by transmission (e.g., cell phone) or there are two aspects of this problem—many-to-one room acoustics. These are all modifications of the versus one-to-many mappings—that are not distin- speech signal that are extrinsic to the utterance that guished, but may be important to understanding the was emitted from the lips of the person talking. neural architecture of the language processing system. However, some signal modifications (that can affect The core computational problem associated with the pattern structure) of the speech signal arise within the many-to-many mapping problem only truly emerges speaker during speech production, such as varying when a particular pattern has many different inter- speaking rate or voice amplitude or fundamental pretations or can be classified in many different ways. frequency changes. This distinction between extrinsic Nusbaum and Magnuson (1997) argued that a many- and intrinsic modifications of speech assumes that to-one mapping can be understood with a simple there is an idealized acoustic pattern corresponding to deterministic class of mechanisms, whereas a one-to- a linguistic message and these kinds of signal modi- many mapping can only be solved by nondeterministic fications impair the ability of the listener to recover mechanisms. In essence, a deterministic system estab- that putative idealized form from within the noise and lishes one-to-one mappings between inputs and outputs distortions. In addition, there is an assumption that the and thus can be computed by passive mechanisms idealized forms of linguistic messages are distinctive such as feature detectors. To achieve many-to-one in sufficiently differentiating among messages such that simply requires a set of one-to-one detectors for differ- once the noise is eliminated, the similarity of a cleaned- ent input signals. In other words, a many-to-one up acoustic pattern to known representations of linguis- mapping (e.g., rising formant transitions signaling a tic messages can be determined. While these assump- labial stop and diffuse consonant release spectrum sig- tions underlie almost all theories of speech perception, naling a labial stop) can be instantiated as a collection the problems entailed by these assumptions and how of one-to-one mappings. However, in the case of a they shape our understanding of the neurobiology of one-to-many mapping (e.g., a formant pattern that language are seldom examined explicitly. could signal either the vowel in BIT or BET) there is ambiguity about the interpretation of the input without additional information. One solution is that 17.2 THE LACK OF INVARIANCE additional context or information could eliminate some PROBLEM alternative interpretations, such as talker information (Nusbaum & Magnuson, 1997). However, this leaves In speech perception, the rubric of a lack of invari- the problem of determining the nature of the con- ance between acoustic patterns in speech and the straining context and how it is processed, which

C. BEHAVIORAL FOUNDATIONS 198 17. UNDERSTANDING SPEECH IN THE CONTEXT OF VARIABILITY together are arguably contingent on the nature of the signal modifications. Whether the system uses articula- ambiguity itself. This suggests that there is no auto- tory or linguistic information or other contextual matic or passive means of identifying and using the knowledge as a constraint, the perceptual system constraining information. Thus, an active mechanism needs to flexibly use context as a guide in determin- that tests hypotheses about interpretations and tenta- ing the relevant properties needed for recognition tively identifies sources of constraining information (Nusbaum & Schwab, 1986). The process of elimi- (see Nusbaum & Schwab, 1986) is needed. nating or weighing potential interpretations may In spoken language understanding, active cognitive involve working memory and changes in attention as processing is vital to achieve flexibility and genera- alternative interpretations are considered, as well as tivity (Nusbaum & Magnuson, 1997; Nusbaum & adapting to new sources of lawful variability in context Schwab, 1986). Active cognitive processing is con- (Elman & McClelland, 1986). Similar mechanisms may trasted with passive processing in terms of the control be implicated at higher levels of linguistic processing mechanisms that organize the nature and sequence of in spoken language comprehension, although the cognitive operations (Nusbaum & Schwab, 1986). neural implementation of such mechanisms might well A passive process is one in which inputs map directly differ depending on whether the nondeterminism to outputs with no hypothesis testing or information- occurred at the level of the speech signal, the lexical contingent operations as in the simple distance-based level, or the sentential level. recognition system already described. Automatized The involvement of cognitive mechanisms cognitive systems (see Shiffrin & Schneider, 1977) (e.g., working memory, attention) in speech perception behave as though passive, in that stimuli are manda- remains controversial. In particular, one such mecha- torily mapped onto responses without flexibility or nism, adaptability or plasticity in processing, has long any demand on cognitive resources. However, it is been a point of theoretical contention. Although much important to note that cognitive automatization does of the controversy about learning in language pro- not have strong implications for the nature of the cessing has focused on syntax, there is also some dis- mediating control system such that different mechan- agreement about the plasticity of speech processing. At isms have been proposed to account for the appear- the center of this debate is how the long-term memory ance of automatic processing (e.g., Logan, 1988). By structures that guide speech processing are modified comparison, active cognitive systems have a control to allow for this plasticity while at the same time structure that permits “information contingent proces- maintaining and protecting previously learned infor- sing” or the ability to change the sequence or nature of mation from being expunged. This is especially impor- operations in the context of new information or uncer- tant because often newly acquired information may tainty. In principle, active systems can generate represent irrelevant information to the system in a hypotheses to be tested as new information arrives or long-term sense (Born & Wilhelm, 2012; Carpenter & is derived (Nusbaum & Schwab, 1986) and thus pro- Grossberg, 1988). vide substantial cognitive flexibility to respond to novel situations and demands. Understanding how and why such active cognitive processes are involved 17.3 ADAPTIVE PROCESSING in speech perception is fundamental to the develop- AND PERCEPTUAL LEARNING ment of a theory of speech perception. However, what is important for understanding the neurobiology of To overcome this problem, researchers have speech perception is the notion that an active control proposed various theories, and although there is no system could, in principle, implicate brain regions that consensus, a hallmark characteristic of these accounts are outside the traditional perisylvian language proces- is that learning occurs in two stages. In the first sing regions. This assumes that the active control of stage, the memory system is able to use fast learning speech perception requires changes in attention to pat- and temporary storage to achieve adaptability and tern information, as well as the recruitment of brain in a subsequent stage, during an offline period such regions involved in long-term memory for nonlinguis- as sleep, this information is consolidated into tic knowledge, and working memory systems to main- long-term memory structures if the information is tain alternative linguistic interpretations. germane (Ashby, Ennis, & Spiering, 2007; Marr, 1971; When there are multiple alternative interpretations McClelland, McNaughton, & O’Reilly, 1995). However, for a particular acoustic pattern, the information this kind of mechanism does not figure into speech needed to constrain the selection depends on the recognition theories despite its arguable importance. source of variability that produced the non- Traditionally theories for speech recognition focus less determinism, and this could arise due to variation in on the formation of category representations and the speaking rate, or talker, or linguistic context, or other need for plasticity during recognition than on the

C. BEHAVIORAL FOUNDATIONS 17.3 ADAPTIVE PROCESSING AND PERCEPTUAL LEARNING 199 stability and structure of the categories (e.g., pho- mechanism that can support adaptation to contextual nemes) to be recognized. Theories of speech perception variability such as talker variability or other kinds of often avoid the plasticitystability trade-off problem speech distortion. by proposing that the basic categories of speech are Given that learning specific phonetic contrasts established early in life, tuned by exposure, and subse- depends on changes in perceptual attention, learning quently only operate as a passive detection system the phonetic properties of a particular talker (distinct (e.g., Abbs & Sussman, 1971; Fodor, 1983; McClelland & from more typical talkers) is relevant to the kinds of Elman, 1986). However, even these kinds of theories, variability that reflect talker differences or might be suggest that early exposure to a phonological system produced by variation in speaking rate. For this reason, has important effects on speech processing. research on the way listeners learn to recognize low- Research has established that adult listeners can quality synthetic speech produced, by rule, should be learn a variety of new phonetic contrasts from outside informative about the processes that underlie adapta- their native language (Best, McRoberts, & Sithole, 1988; tion to talker variability and variation in speaking rate Lively, Logan, & Pisoni, 1993; Logan, Lively, & Pisoni, (Schwab, Nusbaum, & Pisoni, 1985). Synthetic speech 1991; Pisoni, Aslin, Perey, & Hennessy, 1982; Yamada & learning has been demonstrated to generalize beyond Tohkura, 1992). For example, Francis and Nusbaum the training exemplars to novel spoken words and (2002) demonstrated that listeners are able to learn to contexts (Greenspan, Nusbaum, & Pisoni, 1988). Thus, direct perceptual attention to acoustic cues that were listeners can learn the acoustic-phonetics that are idio- not previously used to form phonetic distinctions in syncratic to a particular talker (Dorman et al., 1977). their native language. This change in perceptual pro- Furthermore, this kind of perceptual learning of the cessing can be described as a shift in attention acoustic phonetics of a particular talker results in (Nusbaum & Schwab, 1986), although other des- changes in attention to the speech signal (Francis, criptions are used as well. Auditory receptive fields Baldwin, & Nusbaum, 2000). Moreover, this shift in may be tuned (e.g., Cruikshank & Weinberger, 1996; attention between acoustic cues produces a restructur- Wehr & Zador, 2003; Weinberger, 1998; Znamenskiy & ing of the perceptual space for that talker’s speech Zador, 2013) or reshaped as a function of appro- (Francis & Nusbaum, 2002). Perceptual learning of a priate feedback (cf. Moran & Desimone, 1985)or novel talker’s speech results in a reduction in the cogni- context (Asari & Zador, 2009). This is consistent tive load on working memory for recognizing the with theories of category learning (e.g., Schyns, speech (Francis & Nusbaum, 2009). In other words, sys- Goldstone, & Thibaut, 1998) in which category struc- tematic experience listening to a novel talker allows a tures are related to corresponding sensory patterns listener to learn the acoustic-phonetic mapping for a (Francis, Kaganovich, & Driscoll-Huber, 2008; Francis, given talker. This learning increases the intelligibility of Nusbaum, & Fenn, 2007). This learning could also the talker’s speech, which results from shifting attention be described as cue weighting as observed in the to phonetically more relevant cues, which in turn lowers development of phonetic categories (e.g., McMurray & the working memory demands of speech perception. Jongman, 2011; Nittrouer & Lowenstein, 2007; This is a hallmark of an active processing system—infor- Nittrouer & Miller, 1997). Yamada and Tohkura (1992) mation about perceptual classification can be used to describe native Japanese listeners as typically directing direct attention to improve performance and reduce the attention to acoustic properties of /r/-/l/ stimuli that demands on working memory. However, although this are not the dimensions used by English speakers, and demonstrates the operation of active cognitive processes as such are not able to discriminate between these during speech perception, it is more about learning the categories. This is because Japanese and English listen- categories of speech or modifying those categories to be ers distribute attention in the acoustic pattern space specific to improving recognition of that talker’s speech. for /r/ and /l/ differently as determined by the Taken together, this work demonstrates that listeners phonological function of this space in their respective are able to detect variance with known acoustic- languages. Perceptual learning of these categories phonetic patterns and to shift attention to appropriate by Japanese listeners suggests a shift of attention to cues, given feedback, in order to reduce uncertainty in the English phonetically relevant cues. Recently, interpretation and provides a computational solution McMurray and Jongman (2011) proposed the C-Cure to the one-to-many lack of invariance problem. Given model of phoneme classification in which the relative a set of possible interpretations of a particular acoustic importance of cues varies with context. Although pattern, listeners may shift attention to the cues that the model does not specify a neural mechanism by discriminate among the alternative interpretations. The which such plasticity is implemented, there are a question, then, is whether such a mechanism seems to number of possibilities. This kind of approach to learn- operate in circumstances of contextual (e.g., speaker or ing or modifying phonetic categories provides a speaking rate) variability. If it is the case that sources

C. BEHAVIORAL FOUNDATIONS 200 17. UNDERSTANDING SPEECH IN THE CONTEXT OF VARIABILITY of variability impose a nondeterministic computational word recognition in noise and word recall is reduced structure on the problem of perception, then it must when there is talker variability (speech produced by follow that we can reject all theories that have an inher- several talkers) compared with a condition in which a ently passive control structure. In other words, it is our single talker produced the speech (Creelman, 1957; contention that phonetic constancy must be achieved by Martin, Mullennix, Pisoni, & Summers, 1989; Mullennix, an active computational system (e.g., Nusbaum & Pisoni, & Martin, 1989). Talker variability also slows Morin, 1992; Nusbaum & Schwab, 1986). recognition time for vowels, consonants, and spoken words in a number of different experiments using a range of different paradigms (Mullennix & Pisoni, 1990; 17.4 EMPIRICAL EVIDENCE FOR Nusbaum & Morin, 1992; Summerfield & Haggard, ACTIVE PROCESSING IN TALKER 1975). This provides some basic evidence that percep- NORMALIZATION tion of speech is sensitive to talker variability, but it does not really indicate why this occurs. Active control systems use a feedback loop structure Our view is that the evidence regarding the load to systematically modify computation to converge on sensitivity of the human listener when there is talker a single, stable interpretation (MacKay, 1951, 1956). variability provides strong evidence that speech percep- By comparison, passive control structures represent tion is performed by an active process. Furthermore, invariant computational mappings between inputs and evidence of the flexibility of human listeners in proces- outputs. In consideration of this distinction, there are sing speech, given talker variability, provides additional two general patterns of behavioral performance that support. For example, we have found that listeners can be taken as empirical evidence for the operation of shift attention to different acoustic cues when there is an active control system (see Nusbaum & Schwab, a single talker and when there is talker variability 1986, for a discussion). First, evidence of load sensi- (Nusbaum & Morin, 1992). In one condition, subjects tivity in processing should provide an argument for monitored a sequence of spoken vowels for a specified active processing. There are several ways to justify this target vowel, and the vowels were produced by one claim. For example, automatized processing in per- talker. In a second condition, a mix of different talkers ception occurs when there is an invariant mapping produced the vowels. Both of these conditions were between targets and responses whereas controlled— given with four different sets of vowels that were pro- load-sensitive—processing occurs when there is duced by LPC resynthesis of natural vowels used in uncertainty regarding the identity of targets and dis- our other experiments (Nusbaum & Morin, 1992). One tractors over trials or when there is no simple single set consisted of intact, four-formant voiced vowels. feature difference to distinguish targets and distractors A second set consisted of the same vowels with voicing (e.g., Shiffrin & Schneider, 1977; Treisman & Gelade, turned off to produce whispered counterparts. A third 1980). In other words, when there are multiple possible set was produced by filtering all information. A fourth interpretations of a stimulus pattern, processing shows set combined whispering with filtering to eliminate F0 load sensitivity, which may be manifest as an increase and formant information above F2. in processing time, a decrease in recognition accuracy, If listeners recognize vowels using a mechanism or an interaction with an independent manipulation of similar to the one described by Syrdal and Gopal cognitive load (Navon & Gopher, 1979) such as digit (1986), then fundamental frequency and F3 information preload (e.g., Baddeley, 1986; Logan, 1979). (although see Johnson, 1989, 1990a) should be necessary Second, the appearance of processing flexibility as to recognition under all circumstances because their demonstrated by the effects of listener expectations, view is this information provides a talker-independent context effects, learning, or other forms of online stra- specification of vowel identity. This predicts that in tegic processing should indicate active processing. both the single-talker and mixed-talker conditions, Although an active process need not demonstrate this the intact voiced vowels should be recognized most kind of flexibility, a passive process by virtue of its accurately with whispering or filtering reducing per- invariant computational mapping certainly cannot. formance somewhat and the combination reducing This means, for example, that evidence for the effects performance the most, because these modifications of higher-order linguistic knowledge on a lower-level eliminate critical information for vowel recognition. Our perceptual task such as lexical influence on phonetic results showed that in the single-talker condition, recog- recognition (e.g., Samuel, 1986) should implicate an nition performance was uniformly high across all four active control system in processing. sets of stimuli. In the mixed-talker condition, however, There is definitely a great deal of evidence arguing accuracy dropped systematically as a function of the that speech perception is load-sensitive under condi- modifications of the stimuli with the voiced, intact tions of talker variability. For example, the accuracy of vowels recognized most accurately and the whispered,

C. BEHAVIORAL FOUNDATIONS 17.5 TOWARD AN ACTIVE THEORY OF CONTEXTUAL NORMALIZATION 201 filtered vowels recognized least accurately (Nusbaum & the pitches themselves. Second, and perhaps more Morin, 1992). If vowel recognition were performed by a important for the present argument, the listeners’ expec- passive, talker-independent mechanism (e.g., Syrdal & tations affected whether they showed any processing Gopal, 1986), then the same pattern of results should sensitivity to pitch variability. This kind of processing have been obtained in both the single-talker and mixed- flexibility cannot be accounted for by a simple passive talker conditions. The results we obtained suggest that computational system and argues strongly for an active listeners only direct attention to F0 and F3 when there perceptual mechanism (Nusbaum & Schwab, 1986). is talker variability (cf. Johnson, 1989, 1990a). This kind of strategic flexibility in recognition is strong evidence of an active mechanism. Furthermore, it suggests that 17.5 TOWARD AN ACTIVE THEORY OF the reason for the increase in cognitive load given talker CONTEXTUAL NORMALIZATION variability may be because the listener must distribute attention over more cues in the signal than when there First and foremost, our view is that contextual is a single talker. Wong, Nusbaum, and Small (2004) normalization—using contextual variability as a per- demonstrated that talker variability increased brain ceptual frame for phoneme recognition—is carried activity consistent with an increase in cognitive load out by the normal process of speech perception. In and mobilization of attention in superior parietal cortex. other words, talker or rate normalization is not carried Listener expectations affect talker normalization out by a separate module or computational system, processes as well. In a previous study, we found that but it is a consequence of the basic computational not all talker differences increase recognition time in a structure of the normal operations of speech percep- mixed-talker condition (Nusbaum & Morin, 1992; also tion. This stands in sharp contrast to most previous see Johnson, 1990a). When the vowel spaces of talkers approaches to normalization that emphasized the prob- are sufficiently similar and their fundamental fre- lem of computing talker vocal tract limits and scaling quencies are similar, there may be no difference in vowel spaces or base rates of speaking. It may be more recognizing targets when speech from these talkers is productive to treat the processing of lawful variation as presented in separate blocks or in the same block of a single perceptual problem and focus on the common- trials. Magnuson and Nusbaum (2007) performed a alties rather than separating these problems based on study designed to investigate more specifically under the specific sources of information and knowledge what conditions talker variability increases recognition needed to support normalization and recognition. time. In this study, two sets of monosyllabic words Second, the effects of talker or rate variability on were synthesized with two different mean F0s differ- perceptual processing directly reflect the computa- ing by 10 Hz. In one condition, a small passage was tional operations needed to achieve phonetic con- played to subjects in which two synthetic talkers, stancy. Increased recognition times and interactions of differing in F0 by 10 Hz, have a short dialogue. In a varying cognitive load with recognition reflect the second condition, another group of subjects heard increased processing demands on capacity that are a passage in which one synthetic talker used a 10-Hz incurred by talker variability. Contextual variability pitch increment to accent certain words. Both groups increases the number of possible alternative interpreta- then listened to exactly the same set of single-pitch tions of the signal, thereby increasing the processing and mixed-pitch recognition trials using the mono- demands on the listener. As a corollary of our first syllabic stimuli. The subjects who listened to the dia- point, we predict that the same kinds of processing logue between two talkers showed longer recognition demands will be observed whenever there is any non- times when there was a mix of the two different F0s in deterministic relationship between acoustic cues and a trial compared to trials that consisted of words pro- linguistic categories during perceptual processing. duced at a single F0. By comparison, subjects who Furthermore, even though there may be some rela- expected that the 10-Hz pitch difference was not a tionship between the information used in talker talker difference showed no difference in recognition identification and talker normalization, we claim that times or accuracy between the single-pitch and mixed- the perceptual effects of talker variability are not a con- pitch trials. This demonstrates two things. First, the sequence of talker identification processes competing effect of increased recognition time in trials with a mix with speech understanding. This is likely true of any of F0s cannot be attributed to a simple contrast effect aspect of contextual variability, although listeners are (see Johnson, 1990b) because both groups received not typically called on to explicitly identify speaking exactly the same stimuli. Instead, the increased recog- rate or other forms of contextual variability. nition times in the mixed-pitch trials seem to reflect Third, to achieve phonetic constancy, given a processing specific to the attribution of the pitch differ- nondeterministic relationship between cues and cate- ence to a talker difference and not something about gories, different sources of information and knowledge

C. BEHAVIORAL FOUNDATIONS 202 17. UNDERSTANDING SPEECH IN THE CONTEXT OF VARIABILITY beyond the immediate acoustic pattern to be recog- the alternative classifications. Cognitive load increases nized must be brought to bear on the recognition as a function of the number of alternatives to be con- problem. For example, if the F1 and F2 extracted from sidered and the number of diagnostic tests that must an utterance could have been intended as either of be carried out. two different vowels given talker variability, informa- By this view, phonetic constancy is the result of a tion about the vocal tract that produced the vowels process of testing hypotheses that have been tailored (e.g., from F0 and F3) will be used to provide the to distinguish between alternative linguistic inter- context for interpretation. Whenever there is a one-to- pretations of an utterance. An active control system many mapping between a particular acoustic pattern mediates this process of hypothesis formation and test- and linguistic categories, listeners will have to use ing. An abstract representation of linguistic categories information outside the specific pattern to resolve the in terms of theories provides the flexibility to apply uncertainty. This information could come from other diverse forms of evidence to this classification process parts of the signal, previous utterances, linguistic allowing the perceptual system to resolve the non- knowledge, or subsequent parts of the utterance. deterministic structure produced by talker variability. To realize the kind of computational flexibility These components taken together form a complex required for this approach, it is important to reconcep- inferential system that has much in common with tualize the basic process of speech perception. The conceptual classification (Murphy & Medin, 1985) and standard view of speech perception is that phoneme other cognitive processes. recognition or auditory word recognition is a process of comparing auditory patterns extracted from an utterance with stored mental representations of pattern 17.6 NEUROBIOLOGICAL THEORIES information associated with linguistic categories. Our OF SPEECH PERCEPTION view is that speech perception, as an active process, is basically a cognitive process as described by Neisser There are a number of recently proposed theories (1967) and is more akin to hypothesis testing than of speech perception that have been framed in terms of pattern matching (cf. Nusbaum & Schwab, 1986). brain regions identified as active during speech proces- Nusbaum and Henly (1992) have argued that linguistic sing. To the extent that such theories are described as categories need to be represented by structures that purely bottom-up recognition systems, wherein audi- are much more flexible than have been previously tory coding leads to phonetic coding and then word proposed. They claimed that a particular linguistic recognition, it is difficult to reconcile that kind of archi- category such as the phoneme /b/ might be better tecture with evidence suggesting active processing in represented by a theory of what a /b/ is. This view is speech perception. Thus, neurobiological theories that an extension of the argument of Murphy and Medin are candidates for explaining recognition given contex- (1985) regarding more consciously processed, higher- tual variability need to incorporate feedback or possibly order categories. From this perspective, a theory is a feedforward information, although the difference in set of statements that provide an explanation that such models could well be testable in speech perception accounts for membership in a category. Rather than experiments. Moreover, such models seldom explicitly view a theory as a set of explicit verbal statements, our address the problem of talker or rate normalization, view is that a theory representation of a linguistic cate- leaving the issues of the lack of invariance out of the gory is an abstract, general specification regarding the domain of explanation. However, recent models have identity and function of that linguistic category. explicitly divided the neural processing of speech into Although this could be couched as a set of features, it dorsal and ventral streams following the neurobiology is more reasonable to think of a theory as something of visual processing models (Ungerleider & Mishkin, that would generate a set of features given particular 1982). The difference among the models is typically in contextual constraints. the functions attributed to these streams and their rela- Recognizing a particular phoneme or word is a tionship. In considering visual perception, Bar (2003) process of generating a set of candidate hypotheses explicitly proposed that the visual dorsal stream, typi- regarding the classification of the pattern structure of cally conceived of as functional for object location or an utterance. Conjectures about possible categories use (Milner & Goodale, 1995), may also serve as a fast that could account for a section of utterance are pro- pathway for coarse object classification, projecting posed based on the prior context, listener expectations, through prefrontal cortex to ultimately connect with the and information in the signal. Given a set of alterna- ventral stream for object recognition. This proposal of tive classifications for a stretch of signal information, interacting dorsal and ventral streams, interacting with the perceptual system may then carry out tests that are prefrontal mechanisms for working memory, attention intended to diagnose the specific differences among control, memory encoding, and goal and value

C. BEHAVIORAL FOUNDATIONS 17.6 NEUROBIOLOGICAL THEORIES OF SPEECH PERCEPTION 203 maintenance is quite different from some of the neurobi- Hickok (2012) has argued that the speech motor system ological models of speech perception because it explicitly (within the dorsal stream) does not play a causal role incorporates both feedback and feedforward active pro- in speech perception, despite evidence demonstrating cessing using neural networks that are not typically that activity within the putative dorsal stream is affect- viewed as “perceptual.” By contrast, neurobiological ing speech perception (Davis & Johnsrude, 2007; models of speech perception typically stay close to the Skipper, van Wassenhove, Nusbaum, & Small, 2007). perisylvian language areas, even when taking into By contrast, other models (e.g., Davis & Johnsrude, account task effects in speech processing, albeit not 2007; Friederici, 2012; Rauschecker & Scott, 2009) pro- explicitly considering active processing involving more pose a more direct interaction between different path- general cognitive systems. ways and brain regions. Rauschecker and Scott (2009) For example, Hickok and Poeppel (2007) have argue for a forward mapping ventral stream and an proposed a neurobiological model that explicitly sepa- inverse mapping dorsal stream, which provides more rates ventral and dorsal speech processing streams explicitly for active processing. In the forward map- identifying these largely with speech object recognition ping pathway, the speech signal is decoded into (ventral) and speech perception-production (dorsal). linguistic categories in the inferior frontal cortex (IFC), There is a somewhat unusual distinction made in this which are then translated into articulatory/motor theory between speech perception and speech recogni- movements in premotor cortex (PMC). These articula- tion as processes that double dissociate both function- tory representations of the speech signal are then sent ally and cortically. Hickok and Poeppel define speech to the inferior parietal lobe (IPL) as an efference copy. perception (as dissociated from speech recognition) as The inverse mapping stream essentially follows the any sublexical task that involves the discrimination or same pathway in the reverse direction. Attentional and categorization of auditory input. It is an active process intentional demands originating in the IPL moderate that requires both working memory and executive the context-dependent motor plans that are activated control, but it does not necessarily lead to the lexical- in PMC and prefrontal cortex. These predictive motor sentential understanding of the speech signal. One plans are then compared with the sensory input pro- could posit that such a network could play an cessed by the IFC. Rauschecker and Scott posit that important role in resolving the lack of invariance pro- these two processing streams are active simultaneously, blem. Auditoryphonological representations that are as the ventral stream solves the lack of invariance “ambiguous” (in the ventral stream) mappings with problem of the speech signal while the dorsal stream more than one linguistic interpretation could, in prin- engages in domain-general linguistic processing beyond ciple, be resolved using the dorsal projections into the maintenance of the phonologicalarticulatory loop. auditory working memory and adaptive processing to Friederici (2012) argues for a more complex system shift attention between cues. However, this is not how in which four pathways are involved in speech proces- Hickok and Poeppel describe the dorsal stream, which sing. In many respects, this approach adds complexity seems more functionally focused on word learning and to the dual-pathway model to account for sentence- metalinguistic task performance in speech perception level effects and explicit top-down processing as well experiments but is not identified as having any role in as cognitive control mechanisms from prefrontal recognizing spoken language, although the connections cortex. In doing so, this model goes beyond the more are present within the model for this possibility. By traditional perisylvian networks, but this is largely to contrast, utterances are recognized and understood by accommodate the demands of syntactic complexity the process of speech recognition, which takes place and sentence processing rather than the fundamental solely within the ventral stream, transforming acoustic problems of lack of invariance in speech. This contrasts signals into mental lexicon representations. with Davis and Johnsrude (2007), who focus more on This dual-stream neural network reflects a passive the role of active processing in basic speech percep- approach to speech recognition. An active process tion. They argue that in the ventral pathway, multiple model of speech recognition would suggest that lexical interpretations of the speech input at various contextual influences processed in the dorsal stream levels of representation must be activated in the IFC so may contribute to this comparison of multiple acoustic that they may be compared with an echoic record of cues. However, because the cortical model proposed the incoming acoustic signal in the temporal cortex. by Hickok and Poeppel does not explain how the This constant maintenance of the speech signal at ventral and dorsal streams interact, nor does it make multiple levels of representation allows top-down pro- clear how additional conceptual networks cited in the jections from the IFC to retune the perception of the illustration of the model may influence these two path- acoustic signal at lower levels of the auditory pathway. ways, the role of contextually based attentional Similarly, somato-motor representations of the acoustic changes cannot be explained by this model. Elsewhere, input in the dorsal stream are projected both to the

C. BEHAVIORAL FOUNDATIONS 204 17. UNDERSTANDING SPEECH IN THE CONTEXT OF VARIABILITY

IFC and downstream to areas of the temporal cortex to In humans, higher-level cognitive functions clearly further influence perception of the signal at both upper have an effect on subcortical structures as low as the and lower levels of the pathway. In this way, Davis and cochlea, because selective attention has been shown to Johnsrude have created a neural model that definitively enhance the spectral peaks of evoked otoacoustic depicts speech perception as an active process. emissions (Giard, Collet, Bouchet, & Pernier, 1994; Despite this improvement from the interpretation Maison, Micheyl, & Collet, 2001) and discrimination of speech recognition as a passive process, and the training has been directly related to enhanced sup- increased specificity and breadth of brain region inter- pression of click-evoked otoacoustic emissions (de actions, perceptual learning of speech is not suffi- Boer & Thornton, 2008). Despite the growing evidence ciently explained. These models also fail to go beyond that the corticofugal system plays an important role in the cortico-cortical connections of speech processing. audition, researchers of speech perception continue to The auditory pathway begins well before the acoustic overlook this network when delineating the neurobio- signal even reaches the primary auditory cortex and logical underpinnings of speech. If speech perception it has been well-established that there are more des- utilizes the same basic categorization processes as cending projections to these components of the peri- auditory perception in general, then the subcortical pheral nervous system than ascending projections structures that play a large role in audition must be (Huffman & Henson, 1990). Such evidence would included in these neural networks. suggest that a complete model of the neural substrates The influence of top-down cortical processes on of speech perception should include the interaction subcortical structures is most apparent in the auditory between the cortical structures of the network and brainstem. Electrophysiological recordings of the audi- lower-level areas of the nervous system such as the tory brainstem response have demonstrated that the thalamus, the auditory brainstem, and even the frequency following response (FFR), a sustained cochlea. To fully understand how speech perception response phase-locked to the fundamental frequency adapts to the many variant cues in the speech signal, of a periodic stimulus and/or the envelope of the stim- all components of the auditory pathway must be ulus (Krishnan, 2007), reflects changes in higher-level included in the neural model. cognitive processes such as attention and learning. Galbraith and Arroyo (1993) determined that the FFR is modulated by selective attention to dichotic tones, 17.7 SUBCORTICAL STRUCTURES with the attended tone eliciting larger peak amplitudes AND ADAPTIVE PROCESSING in the FFR than the ignored tone. The FFR is also affected by the reallocation of attentional resources The restriction of neural models of speech percep- to another modality. When listeners are presented tion to cortical systems is in sharp contrast to the more with auditory and visual stimuli simultaneously but cognitiveneurobiological models, such as the Ashby instructed to only attend to the visual stimulus, the and Maddox (2005) model in which thalamus and stri- signal-to-noise ratio of the FFR to the auditory stimulus atum play important roles in fast mapping of category decreases compared with when attention is directed to representations before slower sensorimotor cortical the auditory stimulus (Galbraith, Olfman, & Huffman, learning occurs or the complementary learning sys- 2003). Although in both of these examples the auditory tems model (McClelland et al., 1995) in which fast input consisted of tones rather than speech, they clearly learning occurs in the hippocampus and slower learn- establish an attentional influence on the activity of the ing occurs in neocortical circuits. Acoustic input auditory brainstem, which may modify the signal that travels from the cochlea through the cochlear nucleus, ultimately reaches the auditory cortex. the superior olive, the inferior colliculus, and the When the FFR is examined in response to speech medial geniculate nucleus in the thalamus before stimuli, top-down influences of the linguistic categori- reaching the primary auditory cortex. The connections zation processes that occur at the level of the cortex between these structures contain twice as many can also be seen. The FFR to synthetic English vowels descending projections from cortex as ascending contains prominent spectral peaks at the first formant projections to cortex. Evidence from animal models harmonics of the signal and smaller peaks at the suggests that cortical structures utilize the corticofugal harmonics between formants (Krishnan, 2002). The system to engage in egocentric selection, whereby they enhanced peaks found in the FFR at the first for- improve their own input from the brainstem through mant suggest that some form of categorization is feedback and lateral inhibition (Suga, Gao, Zhang, already occurring at the level of the auditory brain- Ma, & Olsen, 2002). Such processes allow for rapid stem. The representation of the input is modified to readjustment of subcortical processing and long-term strengthen the important cues so that they are more adjustments in cortex to facilitate associative learning. prominent than the rest of the signal, indicating that the

C. BEHAVIORAL FOUNDATIONS 17.8 CONCLUSION 205 translation of signal to lexical representations that occurs specificity of phenomena accounted for, there are two in the cortico-cortical connections of the speech network aspects of speech processing that are unaccounted for. may influence the way in which the auditory brainstem First, the basic problem of lack of invariance in represents the signal as it transfers it to higher points mapping acoustic patterns onto perceptual inter- along the auditory pathway. Krishnan, Xu, Gandour, pretations is not directly addressed. This problem has and Cariani (2005) compared the FFRs of Mandarin hindered the development of computer speech recog- speakers and English speakers with four lexical tones nition systems, and yet neurobiological models of used in Mandarin. They determined that Mandarin speech perception do not seem to recognize the need speakers had stronger pitch representations and for such explanations. Moreover, this is a general smoother pitch tracking in their FFRs than did English problem of language understanding, not just speech speakers. They also had stronger representations of the perception. A similar many-to-many mapping can second harmonic (F2) for all four tones. Based on these also be found between patterns at the syllabic, lexical, results, the researchers concluded that language experi- prosodic, and sentential level in speech and the inter- ence may induce changes in the subcortical transfer of pretations of those patterns as linguistic messages. auditory input to enhance the representation of relevant This is due to the fact that across linguistic contexts, linguistic features that are transmitted in the signal. The speaker differences (idiolect, dialect, etc.), and other interaction between experience and brainstem activity is contextual variations, there are no patterns (acoustic, not exclusive to language experience. Musically trained phonetic, syllabic, prosodic, lexical, etc.) in speech that individuals show earlier and larger FFRs and better have an invariant relationship to the interpretation phase-locking to the fundamental frequency in response of those patterns. For this reason, it could be beneficial to music stimuli as well as speech stimuli from their to consider how these phenomena of acoustic percep- native language (Musacchia, Sama, Skoe, & Kraus, 2007; tion, phonetic perception, syllabic perception, prosodic Wong, Skoe, Russo, Dees, & Kraus, 2007). perception, lexical perception, and others are related These studies demonstrate the influence of percep- computationally to one another and understand the tual experience and training on the responses of the computational similarities among the mechanisms that auditory brainstem, a relatively low-level neural struc- may subserve them (Marr, 2010). ture. However, such effects are currently outside the Second, the plasticity of human speech perception domain of neurobiological theories of speech percep- and language processing, which is likely tied closely to tion. These data do, however, demonstrate descending the solution to the lack of invariance problem, is also and experiential effects on the processing of speech not taken seriously. Such adaptive processing is at the and other acoustic stimuli and can be taken to reflect core of human speech understanding rather than some an active processing system. Speech input varies in kind of added-on system and is necessary to explain many ways, both between talkers and within a single how listeners cope with talker and rate variability, as talker. The top-down control of subcortical structures well as environmental noise and distortion. Explaining allows the system to adapt to these changes in the the kinds of neural mechanisms that mediate this kind signal by enhancing the most relevant spectral cues in of adaptive processing is key to explaining human the auditory input before it even reaches the cortical speech perception as well as explaining language speech recognition networks. understanding more generally. Such explanations are unlikely to reside in exclusively cortical systems and need to take into account the fact that speech perception 17.8 CONCLUSION is carried out within the auditory pathway that has descending projections all the way down to the cochlea. With the increase in neuroimaging methods and Simply recognizing the need to develop a broader view studies of speech perception, there are a number of of speech perception that incorporates brain regions new theories that are grounded in brain regions outside the traditional perisylvian network (e.g., pre- attempting to explain speech perception. In some frontal attention-working memory regions, striatum, respects these theories can be viewed as modifications thalamus) and the need to explain speech perception in of the longstanding theory of speech perception pro- the context of an intrinsically active auditory system posed in the 1800s by Wernicke focusing mostly on that has descending innervation to the cochlea is an perisylvian brain regions and incorporating aspects advance over theories derived from 1800s neurology. of neural architecture from the dorsal-ventral distinc- However, now there is a need to develop theories that tion in vision. While these theories may differ in are explicit and testable based on this broader view that the degree to which there are feedback connections treats the perceptual processing of contextual vari- among regions, incorporation of brain regions outside ability in speech as central to understanding speech the “traditional” Wernicke language areas, and recognition rather than as a separable system.

C. BEHAVIORAL FOUNDATIONS 206 17. UNDERSTANDING SPEECH IN THE CONTEXT OF VARIABILITY

Acknowledgments Francis, A., & Nusbaum, H. C. (2009). Effects of intelligibility on working memory demand for speech perception. Attention, Preparation of this manuscript was supported in part by an ONR Perception, & Psychophysics, 71, 13601374. grant DoD/ONR N00014-12-1-0850, and in part by the Division of Francis, A. L., Baldwin, K., & Nusbaum, H. C. (2000). Learning to Social Sciences at the University of Chicago. listen: The effects of training on attention to acoustic cues. Perception & Psychophysics, 62, 16681680. Francis, A. L., Kaganovich, N., & Driscoll-Huber, C. J. (2008). References Cue-specific effects of categorization training on the relative weighting of acoustic cues to consonant voicing in English. Abbs, J. H., & Sussman, H. M. (1971). Neurophysiological feature Journal of the Acoustical Society of America, 124, 12341251. detectors and speech perception: A discussion of theoretical Francis, A. L., & Nusbaum, H. C. (1996). Paying attention to speaking implications. Journal of Speech and Hearing Research, 14,2336. rate. Proceedings of the International Conference on Spoken Language Asari, H., & Zador, A. M. (2009). Long-lasting context dependence Processing, Philadelphia. constrains neural encoding models in rodent auditory cortex. Francis, A. L., & Nusbaum, H. C. (2002). Selective attention and the Journal of Neurophysiology, 102, 26382656. acquisition of new phonetic categories. Journal of Experimental Ashby, F. G., Ennis, J. M., & Spiering, B. J. (2007). A neurobiological Psychology: Human Perception and Performance, 28, 349366. theory of automaticity in perceptual categorization. Psychological Francis, A. L., Nusbaum, H. C., & Fenn, K. (2007). Effects of training Review, 114, 632656. on the acoustic phonetic representation of synthetic speech. Ashby, F. G., & Maddox, W. T. (2005). Human category learning. Journal of Speech, Language and Hearing Research, 50, 14451465. Annual Review of Psychology, 56, 149178. Friederici, A. D. (2012). The cortical language circuit: From auditory Baddeley, A. D. (1986). Working memory. Oxford: Oxford Science perception to sentence comprehension. Trends in Cognitive Publications. Sciences, 16, 262268. Bar, M. (2003). A cortical mechanism for triggering top-down facilita- Galbraith, G. C., & Arroyo, C. (1993). Selective attention and brainstem tion in visual object recognition. Journal of Cognitive Neuroscience, frequency-following responses. Biological Psychology, 37,322. 15, 600609. Galbraith, G. C., Olfman, D. M., & Huffman, T. D. (2003). Selective Bartels, A., & Zeki, S. (1998). The theory of multistage integration in attention affects human brain stem frequency-following response. the visual brain. Philosophical Transactions of the Royal Society of Neuroreport, 14(15), 735738. London B: Biological Sciences, 265, 23272332. Gerstman, L. J. (1968). Classification of self-normalized vowels. IEEE Best, C. T., McRoberts, G. W., & Sithole, N. M. (1988). Examination of Transactions on Audio Electroacoustics, AU-16, 7880. perceptual reorganization for nonnative speech contrasts: Zulu click Giard, M., Collet, L., Bouchet, P., & Pernier, J. (1994). Auditory selec- discrimination by English-speaking adults and infants. Journal of tive attention in the human cochlea. Brain Research, 633, 353356. Experimental Psychology: Human Perception and Performance, 14,345. Goldinger, S. D. (1998). Echoes of echoes? An episodic theory of Blandon, R. A. W., Henton, C. G., & Pickering, J. B. (1984). Towards lexical access. Psychological Review, 105, 251279. an auditory theory of speaker normalization. Language and Green, K. P., Kuhl, P. K., Meltzoff, A. N., & Stevens, E. B. (1991). Communication, 4,5969. Integrating speech information across talkers, gender, and Born, J., & Wilhelm, I. (2012). System consolidation of memory sensory modalities: Female faces and male voices in the McGurk during sleep. Psychological Research, 76(2), 192203. effect. Perception & Psychophysics, 50, 524536. Carpenter, G. A., & Grossberg, S. (1988). The ART of adaptive Greenspan, S. L., Nusbaum, H. C., & Pisoni, D. B. (1988). Perceptual pattern recognition by a self-organizing neural network. learning of synthetic speech produced by rule. Journal of Experimental Computer, 21,7788. Psychology: Learning, Memory, and Cognition, 14, 421433. Creelman, C. D. (1957). Case of the unknown talker. Journal of the Hackett, T. A., & Kaas, J. H. (2004). Auditory cortex in primates: Acoustical Society of America, 29, 655. Functional subdivisions and processing streams. In M. S. Cruikshank, S. J., & Weinberger, N. M. (1996). Receptive-field plastic- Gazzaniga (Ed.), The cognitive neurosciences (3rd ed.). Cambridge: ity in the adult auditory cortex induced by Hebbian covariance. MIT Press. Journal of Neuroscience, 16, 861875. Hasson, U., Skipper, J. I., Nusbaum, H. C., & Small, S. L. (2007). Davis, M. H., & Johnsrude, I. S. (2007). Hearing speech sounds: Top- Abstract coding of audiovisual speech: Beyond sensory represen- down influences on the interface between audition and speech tation. Neuron, 56, 11161126. perception. Hearing Research, 229, 132147. Hickok, G. (2012). The cortical organization of speech processing: de Boer, J., & Thornton, A. R. D. (2008). Neural correlates of percep- Feedback control and predictive coding the context of a dual- tual learning in the auditory brainstem: Efferent activity predicts stream model. Journal of Communication Disorders, 45, 393402. and reflects improvement at a speech-in-noise discrimination Hickok, G., & Poeppel, D. (2007). The cortical organization of speech task. The Journal of Neuroscience, 28, 49294937. processing. Nature Reviews Neuroscience, 8, 393402. Disner, S. F. (1980). Evaluation of vowel normalization procedures. Holtgraves, T. M. (1994). Communication in context: The effects of Journal of the Acoustical Society of America, 67, 253261. speaker status on the comprehension of indirect requests. Journal Dorman, M. F., Studdert-Kennedy, M., & Raphael, L. J. (1977). Stop of Experimental Psychology: Learning, Memory and Cognition, 20, consonant recognition: Release bursts and formant transitions as 12051218. functionally equivalent, context-dependent cues. Perception & Huang, J., & Holt, L. L. (2012). Listening for the norm: Adaptive Psychophysics, 22, 109122. coding in speech categorization. Frontiers in Perception Science, Elman, J., & McClelland, J. (1986). Exploiting lawful variability in the 3, 10. Available from: http://dx.doi.org/10.3389/fpsyg.2012.00010, speech wave. In J. S. Perkell, & D. H. Klatt (Eds.), Invariance PMC3078024. and variability in speech processes (pp. 360380). Hillsdale, NJ: Huffman, R. F., & Henson, O. W. (1990). The descending auditory Lawrence Erlbaum Associates. pathway and acousticomotor systems: Connections with the infe- Fant, G. (1973). Speech sounds and features. Cambridge: MIT Press. rior colliculus. Brain Research Reviews, 15(3), 295323. Fodor, J. A. (1983). The modularity of mind: An essay on faculty Johnson, K. (1989). Higher formant normalization results from inte- psychology. Cambridge, MA: MIT press. gration of F2 and F3. Perception & Psychophysics, 46, 174180.

C. BEHAVIORAL FOUNDATIONS REFERENCES 207

Johnson, K. (1990a). The role of perceived speaker identity in F0 nor- Marr, D. (1971). Simple memory: A theory for archicortex. malization of vowels. Journal of the Acoustical Society of America, Philosophical Transactions of the Royal Society B: Biological Sciences, 88, 642654. 262,2381. Johnson, K. (1990b). Contrast and normalization in vowel perception. Marr, D. (2010). Vision. A computational investigation into the human Journal of Phonetics, 18, 229254. representation and processing of visual information. Cambridge, MA: Johnson, K., Strand, E. A., & D’Imperio, M. (1999). Auditory-visual The MIT Press. integration of talker gender in vowel perception. Journal of Martin, C. S., Mullennix, J. W., Pisoni, D. B., & Summers, W. V. Phonetics, 27, 359384. (1989). Effects of talker variability on recall of spoken word lists. Kaas, J. H. (2004). Somatosensory system. In G. Paxinos, & J. K. Mai Journal of Experimental Psychology Learning, Memory and Cognition, (Eds.), The human nervous system (2nd ed., pp. 10591092). New 15, 676684. York, NY: Elsevier Academic Press. Massaro, D. W. (1987). Speech perception by ear and eye: A paradigm for Krishnan, A. (2002). Human frequency-following responses: psychological inquiry. Hillsdale, NJ: LEA. Representation of steady-state synthetic vowels. Hearing Research, Massaro, D. W., & Oden, G. C. (1980). Speech perception: A 166, 192201. framework for research and theory. In N. J. Lass (Ed.), Speech and Krishnan, A. (2007). Frequency-following response. In R. F. Burkard, language: Advances in basic research and practice (Vol. 3, M. Don, & J. J. Eggermont (Eds.), Auditory evoked potentials: Basic pp. 129165). New York, NY: Academic Press. Principles and clinical application (pp. 313333). Baltimore, MD: McClelland, J. L., & Elman, J. L. (1986). The TRACE model of speech Lippincott, Williams & Wilkins. perception. Cognitive Psychology, 18,186. Krishnan, A., Xu, Y., Gandour, J. T., & Cariani, P. A. (2005). McClelland, J. L., McNaughton, B. L., & O’Reilly, R. C. (1995). Why Encoding of pitch in the human brainstem is sensitive to lan- there are complementary learning systems in the hippocampus guage experience. Cognitive Brain Research, 25, 161168. and neocortex: Insights from the successes and failures of con- Ladefoged, P., & Broadbent, D. E. (1957). Information conveyed by nectionist models of learning and memory. Psychological Review, vowels. Journal of the Acoustical Society of America, 29,98104. 102, 419457. Laing, E. J. C., Lotto, A. J., & Holt, L. L. (2012). Tuned with a McMurray, B., & Jongman, A. (2011). What information is necessary tune: Talker normalization via general auditory processes. for speech categorization? Harnessing variability in the speech Frontiers in Psychology, 3, 203. Available from: http://dx.doi.org/ signal by integrating cues computed relative to expectations. 10.3389/fpsyg.2012.00203, PMC3381219. Psychological Review, 118, 219246. Liberman, A.M., Cooper, F.S., Harris, K.S., & MacNeilage, P.F. Miller, J. L., O’Rourke, T. B., & Volaitis, L. E. (1997). Internal (1962). A motor theory of speech perception. Proceedings of the structure of phonetic categories: Effects of speaking rate. speech communication seminar (Vol. 2). Stockholm: Royal Institute Phonetica, 54, 121137. of Technology. Miller, J. L., & Volaitis, L. E. (1989). Effect of speaking rate on the Liberman, A. M., Cooper, F. S., Harris, K. S., MacNeilage, P. F., & perceptual structure of a phonetic category. Perception & Studdert-Kennedy, M. (1967). Some observations on a model Psychophysics, 46, 505512. for speech perception. In W. Wathen-Dunn (Ed.), Models for Milner, A. D., & Goodale, M. A. (1995). The visual brain in action. the perception of speech and visual form (pp. 6887). Cambridge: Oxford: Oxford University Press. MIT Press. Moran, J., & Desimone, R. (1985). Selective attention gates visual Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert- processing in the extrastriate cortex. Science, 229, 782784. Kennedy, M. (1967). Perception of the speech code. Psychological Mullennix, J. W., & Pisoni, D. B. (1990). Stimulus variability and Review, 74, 431461. processing dependencies in speech perception. Perception & Liberman, A. M., & Mattingly, I. G. (1985). The motor theory of Psychophysics, 47, 379380. speech perception revised. Cognition, 21,136. Mullennix, J. W., Pisoni, D. B., & Martin, C. S. (1989). Some effects of Lively, S. E., Logan, J. S., & Pisoni, D. B. (1993). Training Japanese listen- talker variability on spoken word recognition. Journal of the ers to identify English /r/ and /l/. II: The role of phonetic environ- Acoustical Society of America, 85, 365378. ment and talker variability in learning new perceptual categories. Murphy, G. L., & Medin, D. L. (1985). The role of theories in con- The Journal of the Acoustical Society of America, 94, 12421255. ceptual coherence. Psychological Review, 92, 289316. Logan, G. D. (1979). On the use of a concurrent memory load to Musacchia, G., Sama, M., Skoe, E., & Kraus, N. (2007). Musicians measure attention and automaticity. Journal of Experimental have enhanced subcortical auditory and audiovisual processing Psychology: Human Perception and Performance, 5, 189207. of speech and music. Proceedings of the National Academy of Logan, G. D. (1988). Toward an instance theory of automatization. Sciences, 104, 1589415898. Psychological Review, 95, 492527. Navon, D., & Gopher, D. (1979). On the economy of the human- Logan, J. S., Lively, S. E., & Pisoni, D. B. (1991). Training Japanese processing system. Psychological Review, 86, 214255. listeners to identify English /r/ and /l/: A first report. The Nearey, T. M. (1989). Static, dynamic, and relational properties in Journal of the Acoustical Society of America, 89, 874886. vowel perception. Journal of the Acoustical Society of America, 85, MacKay, D. M. (1951). Mindlike behavior in artefacts. British Journal 20882113. for the Philosophy of Science, 2, 105121. Neisser, U. (1967). Cognitive psychology. New York, NY: Appleton- MacKay, D. M. (1956). The epistemological problem for automata. Century-Crofts. In C. E. Shannon, & J. McCarthy (Eds.), Automata studies. Niedzielski, N. (1999). The effects of social information on the per- Princeton, NJ: Princeton University Press. ception of sociolinguistic variables. Journal of Language and Social Magnuson, J. S., & Nusbaum, H. C. (2007). Acoustic differences, Psychology, 18,6285. listener expectations, and the perceptual accommodation of talker Nittrouer, S., & Lowenstein, J. H. (2007). Children’s weighting strategies variability. Journal of Experimental Psychology: Human Perception for word-final stop voicing are not explained by auditory capacities. and Performance, 33, 391409. Journal of Speech Language and Hearing Research, 50,5873. Maison, S., Micheyl, C., & Collet, L. (2001). Influence of focused Nittrouer, S., & Miller, M. E. (1997). Predicting developmental shifts auditory attention on cochlear activity in humans. Psychophysiology, in perceptual weighting schemes. Journal of the Acoustical Society 38,3540. of America, 101, 22532266.

C. BEHAVIORAL FOUNDATIONS 208 17. UNDERSTANDING SPEECH IN THE CONTEXT OF VARIABILITY

Nusbaum, H. C., & Henly, A. S. (1992). Listening to speech through supporting speech production mediate audiovisual speech an adaptive window of analysis. In B. Schouten (Ed.), The perception. Cerebral Cortex, 17(10), 23872399. processing of speech: From the auditory periphery to word recognition Stevens, K. N., & Halle, M. (1967). Remarks on analysis by synthesis (pp. 339348). Berlin: Mouton-De Gruyter. and distinctive features. In W. Walthen-Dunn (Ed.), Models for Nusbaum, H. C., & Magnuson, J. (1997). Talker normalization: Phonetic the perception of speech and visual form (pp. 88102). Cambridge: constancy as a cognitive process. In K. Johnson, & J. W. Mullennix MIT Press. (Eds.), Talker variability in speech processing (pp. 109132). San Diego, Suga, N., Gao, E., Zhang, Y., Ma, X., & Olsen, J. F. (2002). The CA: Academic Press. corticofugal system for hearing: Recent progress. Proceedings of Nusbaum, H. C., & Morin, T. M. (1992). Paying attention to differ- the National Academy of Sciences, 97, 1180711814. ences among talkers. In Y. Tohkura, Y. Sagisaka, & E. Vatikiotis- Summerfield, Q., & Haggard, M. (1975). Vocal tract normalization as Bateson (Eds.), Speech perception, production, and linguistic structure demonstrated by reaction times. In G. Fant, & M. Tatham (Eds.), (pp. 113134). Tokyo: OHM Publishing Company. Auditory analysis and perception of speech (pp. 115141). London: Nusbaum, H. C., & Schwab, E. C. (1986). The role of attention and Academic Press. active processing in speech perception. In E. C. Schwab, & H. C. Syrdal, A. K., & Gopal, H. S. (1986). A perceptual model of vowel recog- Nusbaum (Eds.), Pattern recognition by humans and machines: Vol. 1. nition based on the auditory representation of American English Speech perception (pp. 113157). San Diego, CA: Academic Press. vowels. Journal of the Acoustic Society of America, 79,10861100. Peterson, G., & Barney, H. (1952). Control methods used in a study of Thackerar, J. N., & Giles, H. (1981). They are—so they spoke: the vowels. Journal of the Acoustical Society of America, 24,175184. Noncontent speech stereotypes. Language and Communication, 1, Pisoni, D. B., Aslin, R. N., Perey, A. J., & Hennessy, B. L. (1982). Some 255261. effects of laboratory training on identification and discrimination of Treisman, A. M., & Gelade, G. (1980). A feature integration theory of voicing contrasts in stop consonants. Journal of Experimental attention. Cognitive Psychology, 12,97136. Psychology: Human perception and performance, 8,297314. Ungerleider, L. G., & Mishkin, M. (1982). Two cortical visual Rauschecker, J. P., & Scott, S. K. (2009). Maps and streams in the systems. In D. J. Ingle, M. A. Goodale, & R. J. W. Mansfield auditory cortex: Nonhuman primates illuminate human speech (Eds.), Analysis of visual behavior (pp. 549586). Cambridge, MA: processing. Nature Neuroscience, 12, 718724. MIT Press. Rubin, D. L. (1992). Nonlanguage factors affecting undergraduate’s Wehr, M., & Zador, A. (2003). Balanced inhibition underlies tuning judgments of nonnative English-speaking teaching assistants. and sharpens spike timing in auditory cortex. Nature, 426, Research in Higher Education, 33, 511531. 442446. Samuel, A. G. (1986). The role of the lexicon in speech perception. Weinberger, N. M. (1998). Tuning the brain by learning and by In E. C. Schwab, & H. C. Nusbaum (Eds.), Pattern recognition by stimulation of the nucleus basalis. Trends in Cognitive Sciences, 2, humans and machines: Vol. 1. Speech perception (pp. 89112). San 271273. Diego, CA: Academic Press. Wong, P. C. M., Nusbaum, H. C., & Small, S. (2004). Neural bases of Schwab, E. C., Nusbaum, H. C., & Pisoni, D. B. (1985). Effects of talker normalization. Journal of Cognitive Neuroscience., 16, training on the perception of synthetic speech. Human Factors, 27, 11731184. 395408. Wong, P. C. M., Skoe, E., Russo, N. M., Dees, T., & Kraus, N. (2007). Schyns, P. G., Goldstone, R. L., & Thibaut, J. P. (1998). The develop- Musical experience shapes human brainstem encoding of linguis- ment of features in object concepts. Behavioral and Brain Sciences, tic pitch patterns. Nature Neuroscience, 10(4), 420422. 21(01), 117. Yamada, R. A., & Tohkura, Y. (1992). The effects of experimental Shiffrin, R. M., & Schneider, W. (1977). Controlled and automatic variables on the perception of American English /r/ and /l/ by human information processing: II. Perceptual learning, automatic Japanese listeners. Perception & Psychophysics, 52, 376392. attending and a general theory. Psychological Review, 84, 127190. Znamenskiy, P., & Zador, A. (2013). Corticostriatal neurons in audi- Skipper, J. I., van Wassenhove, V., Nusbaum, H. C., & Small, S. L. tory cortex drive decisions during auditory discrimination. (2007). Hearing lips and seeing voices: How cortical areas Nature, 497, 482486.

C. BEHAVIORAL FOUNDATIONS CHAPTER 18 Successful Speaking: Cognitive Mechanisms of Adaptation in Language Production Gary S. Dell and Cassandra L. Jacobs Beckman Institute, University of Illinois, Urbana, IL, USA

The language production system works. If a person is 18.1 LANGUAGE PRODUCTION older than the age of 4, has no major brain pathology, and has been exposed to linguistic input that accords The production system turns thoughts into with their perceptual and motor abilities, then they will sequences of words, which can be spoken aloud, have developed a production system that transmits inwardly spoken, or written down. Traditionally what they want to say. It works when the goal is only to (Levelt, 1989), the production process consists of say “hi,” and when the speaker attempts to communi- determining the semantic content of one’s utterance cate a complicated novel thought that takes several (conceptualization), translating that content into lin- sentences to convey. guistic form (formulation), and articulation, as illus- Successful linguistic communication is achieved by trated in Figure 18.1.Here,wefocusonthesecondof a division of labor between the speaker and the lis- these stages, which describes how intended meaning, tener (Ferreira, 2008). Both the production and compre- sometimes called the message,isturnedintoan hension systems have to do their job. The speaker has ordered set of words that are specified for their pho- to say something apt and understandable, and the lis- nological content. That is, the formulation stage tener must do the rest, which can include compensat- describes how CHASE (CAT1, RAT1, past) becomes ing for any of the speaker’s errors or other infelicities. /δə.kæt’.cest’.ˇ δə.ræt’/. (The “1” in “CAT1” represents In this chapter, we focus on how the production sys- a particular definite CAT). Much of the psycholinguis- tem keeps up its end so that the listener is not overly tic and neuroscience research on formulation has con- burdened. Our central claim is that the production sys- cerned three subprocesses: (i) lexical access,the tem benefits from a number of what we call speaker retrieval of appropriate words; (ii) grammatical encod- tuning mechanisms. Speaker tuning mechanisms are ing, the specification of the order and grammatical properties of the system that adapt it to current circum- forms of those words; and (iii) phonological encoding, stances and to circumstances that are generally more determining the pronunciation of the sequence of likely. These include implicit learning mechanisms that words. These are discussed in turn. create long-term adaptive changes in the production system, and a variety of short-term adaptive devices, 18.1.1 Lexical Access including error monitoring, availability-based retrieval, information-density sensitivity, and, finally, audience Most experimental, clinical, and theoretical research design. Although we characterize these mechanisms in on production has concerned lexical access and focuses cognitive rather than neural terms, we include some on the production of single-word utterances. When pointers to relevant neurobiological data and mechan- given a picture that has been identified as the concept isms. In the following, we describe the production sys- CAT, how does the speaker retrieve the word “cat”? tem generally and then focus on the long-term and Lexical access has been characterized as a two-step then short-term speaker tuning mechanisms. process (Garrett, 1975). First, the concept is mapped

Neurobiology of Language. DOI: http://dx.doi.org/10.1016/B978-0-12-407794-2.00018-3 209 © 2016 Elsevier Inc. All rights reserved. 210 18. SUCCESSFUL SPEAKING: COGNITIVE MECHANISMS OF ADAPTATION IN LANGUAGE PRODUCTION

Conceptualization Message Chase (Cat1, Rat1, past)

Formulation

Grammatical encoding Lexical access

S

1. Lemma access NP VP Cat, noun Chase, verb Det N V past NP (Singular) Rat, noun the Det N (Singular) the

Phonological encoding 2. Phonological access

(σ σ ′)((σ′) σ σ ′) kæt es ræt δ kæt est δ ræt

Articulation

FIGURE 18.1 Components of the language production system. onto an abstract lexical representation, variously called lexical item has been settled on in the first step. Or, one the lemma, the L-level representation, or simply the could allow for interaction, which blurs the steps even word node. This abstraction identifies the grammatical further by allowing for relevant representations at each properties of the word such as its syntactic category step to influence one another through the interactive (e.g., noun) and other grammatically relevant features spread of activation (see Dell, Nozari, & Oppenheim, (e.g., number, grammatical gender). Importantly, this 2014 for a recent review of the evidence for interaction level does not specify anything about pronunciation. between the steps, and see Dell, Schwartz, Nozari, That comes in the second step, where the word’s pho- Faseyitan, & Coslett, 2013 and Ueno, Saito, Rogers, & nological form, most often viewed as a sequence of Lambon Ralph, 2011 for proposals regarding the neural phonemes, is retrieved. Intuitive support for the two- correlates of the steps). step notion comes from speech errors (Fromkin, 1971). Slips can profitably be divided into those that might 18.1.2 Grammatical Encoding have arisen during the first step (e.g., semantic errors such as “dog” for “cat”) and those that could have Although most production research concerns single- happened in the second step (e.g., “cap” for “cat”). word utterances, the hallmark of production is the abil- Furthermore, the tip-of-the tongue state (“I know that ity to construct multiword utterances, particularly word! It’s on the tip of my tongue”) can be character- those that the speaker has never said or even heard ized as getting stuck between the steps. before. For example, William Blake famously used the Much of the research on lexical access has concerned phrase, “fearful symmetry” to characterize the tiger. just how separate the two steps are. For example, the And it is not just the poets who are linguistically inven- modular discrete-step view states that the first step must tive. Since Chomsky (1959) emphasized the creativity be completed before the second step can begin (Levelt, of language, it is a psycholinguistic cliche´ that most Roelofs, & Meyer, 1999). Alternatively, one could allow of what speakers say is novel. Regardless of whether for cascading, which blurs the boundaries between the this claim is strictly true, there is no doubt that theories steps by allowing for phonological properties of poten- must explain the production of novel utterances. The tial word candidates to be retrieved before a single usual explanation is that the production system uses

C. BEHAVIORAL FOUNDATIONS 18.2 LONG-TERM SPEAKER TUNING: IMPLICIT LEARNING 211 syntactic-sequential abstractions that specify how word 18.2 LONG-TERM SPEAKER categories can combine to express structured messages. TUNING: IMPLICIT LEARNING For Blake’s phrase, the relevant abstractions would dic- tate that, in English, adjectives (fearful) precede nouns The production system does its job because it has (symmetry). Production models of grammatical encod- learned to do so, and the basis for that learning is ing (Bock, 1982; Chang, Dell, & Bock, 2006; Kempen & experience in comprehending and speaking (Chang Hoenkamp, 1987) differ considerably, but all recognize et al., 2006). Learning, however, is not just something the distinction between categorically specified abstrac- that children do. The typical adult speaker says tions and lexical items. Typically, the abstractions are approximately 16,000 words per day (Mehl, Vazire, characterized as frames, that is, structures that specify Ramirez-Esparza, Slatcher, & Pennebaker, 2007) and the sequence and phrasal membership of syntactically hears and reads many more. This experience adapts categorized word-sized slots (Dell, 1986; Garrett, 1975). the production system so that it is able to make So, there might be a noun-phrase frame with slots for effective choices in the particular circumstances that it a singular indefinite determiner, an adjective, and a finds itself. We refer to this continual process of adap- singular count noun. And this frame may occupy a tation as implicit learning. We claim that this adaptation larger slot in a clausal frame, and so on. Because of the is a kind of learning because the changes induced are separation between words and their slots, the system not short-lived, and that the learning is implicit has the means to encode new phrases (e.g., “a poetic because it is an automatic consequence of linguistic tiger”) by putting known words into known frames in experience that occurs without any intention to learn new combinations. Evidence for such a system comes or awareness of what has been learned. In the remain- from dissociations in aphasia between individuals with der of this section, we review implicit-learning lexical retrieval deficits and those with deficits in research in each of the three production subprocesses syntactic-sequential processes (e.g., see Gordon & Dell, mentioned previously. For lexical access, we consider 2003, for review), from functional imaging data that mechanisms of lexical repetition priming and fre- identify different brain areas for word-retrieval and quency effects, and the possibility of phrasal frequency word-combination mechanisms (e.g., Hagoort, 2013), effects. For grammatical encoding, we discuss the and from structural priming studies, which are hypothesis that structural priming in production is a reviewed later. form of implicit learning. And, for phonological encod- ing, we review studies that find implicit learning of 18.1.3 Phonological Encoding novel phonotactic patterns.

The retrieval of the phonological form of a word 18.2.1 Implicit Learning of Words and Phrases results in a sequence of phonological segments: k æ t. The segments are then put together with the segments The production system adapts to make the words of surrounding words, and the resulting sequence that it is most likely to use easier to retrieve and articu- must be characterized in terms of its syllables and late. In particular, words that we have said recently how those syllables are stressed (Levelt et al., 1999). are easier to say than words that we have not said These processes must respect the phonological pro- recently (repetition priming; e.g., Mitchell & Brown, perties of the language being spoken, including how 1988). We are also, in general, faster and more accurate segments combine to make syllables (phonotactic at producing words that we have more experience knowledge), how syllables are organized into higher- saying, that is, frequent words (Caramazza, Costa, level prosodic structures, and how timing, pitch, Miozzo, & Bi, 2001; Jescheniak & Levelt, 1994). Both and intensity vary as a function of those structures. repetition priming and frequency effects are thought to Ultimately, this organized phonological structure arise from and can be explained by implicit learning, guides the articulatory process. The phonological which optimizes the production system for situations encoding process has been studied by assessing that are more likely to happen. the response time to produce words and syllables One manifestation of implicit learning in word (e.g., Cholin, Dell, & Levelt, 2011; Meyer, 1991), by production is cumulative semantic interference (e.g., examining phonological speech errors (Warker & Dell, Howard, Nickels, Coltheart, & Cole-Virtue, 2006). 2006), by measuring the articulatory and acoustic When we have to name a picture of the same thing details of utterances (e.g., Goldrick & Blumstein, 2006; twice (e.g., crow), we benefit from repetition priming. Goldstein, Pouplier, Chen, Saltzman, & Byrd, 2007; But, if instead of repeating the picture’s name we next Lam & Watson, 2010), and more recently by event- have to name something that is similar in meaning, related brain potentials and other imaging techniques but not the same word (e.g., finch), then we produce (e.g., Qu, Damian, & Kazanina, 2012). this word more slowly and have a greater chance of

C. BEHAVIORAL FOUNDATIONS 212 18. SUCCESSFUL SPEAKING: COGNITIVE MECHANISMS OF ADAPTATION IN LANGUAGE PRODUCTION error (e.g., Schnur, Schwartz, Brecher, & Hodgson, generally that the production system keeps track of 2006). This negative effect is semantic interference. likely events at all levels. Oppenheim, Dell, and Schwartz (2010) investigated the The production system also seems to keep track of “dark” (semantic interference) and “light” (repetition and adapts to the degree to which words combine. priming) sides of word production using a computa- Janssen and Barber (2012) explored this by looking at tional model, aptly called the “dark-side” model. In whether the frequency of the combination of two the model, each experience with a word tunes the words (e.g., red car or red hammer) predicted how easily production system by prioritizing words that are that phrase was generated. In particular, they pre- recently used and, importantly, deprioritizing their sented participants with pictures of colored objects competitors, that is, semantically similar words. This and had them name the object and its color with an tuning consists of the strengthening of connections to appropriate phrase. They found that frequent phrases words when they are used, but weakening of connec- had faster naming latencies than would be predicted tions to these words’ competitors. As a result, when a just by the frequency of the first or second word. This word is repeated it becomes relatively more active in suggests that the production system tunes itself to the lexical network, effectively by leeching activation probable events beyond the word level by keeping from similar words. In this way, repeating the word track of word combinations as well. crow becomes easier, whereas naming different, but semantically similar, words in a sequence (e.g., crow, 18.2.2 Structural Priming finch, gull) becomes increasingly difficult. This effect shows that the production system is adaptive, because One of the classic findings in psycholinguistics is words that are used and will likely be used again structural priming, also known as syntactic priming, become easier to say, whereas words that could poten- or structural repetition. Structural priming is the tially interfere with those words are rendered less tendency for speakers to reuse recently experienced accessible and, hence, less disruptive. structures. Bock (1986a) gave experimental participants We described how lexical access in production pictures that can be described with either of the involves two steps, retrieval of the abstract lexical two kinds of dative structures (double objects, “The item and then retrieval of the item’s phonological woman handed the boy the paint brush,” versus prep- form. We also noted that semantic errors such as ositional datives, “The woman handed the paint brush “dog” for “cat” can occur at the first step, but pho- to the boy”). Participants described these pictures after nological errors such as “cap” or “dat” for “cat” saying an unrelated prime sentence that used either occur during the second step. We also said that a double-object or prepositional dative structure. recently spoken or high-frequency words (e.g., “cat” Priming was seen in the tendency for speakers to use as opposed to “feline”) are less vulnerable to error the structure of the prime when describing the picture. because an implicit learning process enhances their Similar effects were seen for other structural alterna- retrieval. But does the greater ease associated with tions such as active transitive sentences (“Lightning is common or repeated words apply to both steps or striking the church”) versus passives (“The church just one of them? Jescheniak and Levelt (1994) pro- is struck by lightning”). The important aspect of posed that frequency effects in word retrieval are this priming is that it appears to be the persistence felt largely during the second step. Others (e.g., of an abstract syntactically characterized structure Knobel, Finkbeiner, & Caramazza, 2008) claim that (e.g., the frame: Noun_phrase Auxillary_verb Main_verb both steps benefit when the target word is frequent, Prepositional_phrase for a full passive), and not the lexi- because implicit learning should have an effect cal content of the utterance, its meaning, or its intona- throughout the retrieval process. Kittredge, Dell, tional properties (Bock & Loebell, 1990). As such, Verkuilen, and Schwartz (2008) addressed this ques- structural priming provides evidence for a production tion by looking at how target-word frequency affects process that uses structural abstractions during gram- semantic and phonological errors during picture matical encoding. naming. They presented aphasic participants with Bock and Griffin (2000) claimed that structural pictures to name that varied, among other factors, in priming is not just a temporary change to the system, their word frequency. They found, as expected, that but instead it is a form of implicit learning, akin to the the odds of saying the right word increased with the connection weight changes that characterize learning frequency of the target, demonstrating that common in connectionist models. They provided evidence for words are “protected” by their frequency. This pro- this claim by showing that the effect of a prime per- tective power was found to prevent both semantic sists undiminished over at least 10 unrelated sentences and phonological errors, suggesting that both steps (several minutes). If priming were due to temporary of lexical retrieval benefit from frequency and, more activation of a structure, then the prime’s influence

C. BEHAVIORAL FOUNDATIONS 18.2 LONG-TERM SPEAKER TUNING: IMPLICIT LEARNING 213 would rapidly decay. The evidence that the learning is the end (the coda). Because of their phonotactic knowl- implicit is that it occurs in brain-damaged speakers edge, English speakers can readily produce the phono- who have no explicit memory of the prime sentence tactically legal nonword “heng,” but not the illegal (Ferreira, Bock, Wilson, & Cohen, 2008). “ngeh.” Evidence that the production system actively Chang et al. (2006) created a computational model uses this knowledge comes from the phonotactic regu- that reflected the idea that structural priming is larity effect on speech errors: slips tend to be phonotac- implicit learning. They trained a connectionist model tically legal. One might mistakenly produce “nun” as to simulate a child experiencing sentences one word at “nung,” a phonotactically legal nonword, but not as a time. The model was also given a representation of “ngun” (Wells, 1951). the intended meaning of some of the sentences that it Warker and Dell (2006) and Dell, Reed, Adams, and experienced, with this meaning presumably having Meyer (2000) created an experimental analogue to the been inferred by the child from context. The model phonotactic regularity effect in which participants learned by “listening” to each sentence and trying to recited four-syllable tongue twisters such as “hes feng predict each word. When the actual next word was kem neg” at a fast pace. Unbeknownst to the partici- heard, the model then compared its prediction to that pants, the syllables followed artificial phonotactic pat- word, thus generating a prediction error signal. This terns that were present only in the experimental error signal was the impetus for the model to change materials. For example, a participant’s syllables might its connection weights so that its future predictions follow the pattern: During the experiment, /f/ can only were more accurate. By using prediction error, the be a syllable onset and /s/ can only be a syllable coda model learned the linguistic patterns in the language (as in the example four-syllable sequence above). (e.g., syntactic structures) and how those patterns Participants would recite several hundred of these mapped onto meaning (e.g., Elman, 1993). After this sequences in each of four experimental sessions on learning, the model was able to produce because consecutive days. Because of the fast speech rate, slips “prediction is production” (Dell & Chang, 2014); gen- were reasonably common. Most often, these involved erating the next word from a representation of previ- movements of consonants from one place to another, ous words and intended meaning is, computationally, such as “hes feng kem neg” being spoken as “fes feng a production process. When given a representation of kem neg,” in which /f/ moved to the first syllable. intended meaning, the model’s sequence of word pre- The crucial feature of the study was whether these dictions constituted the production of a sentence. The slips respected the phonotactics of the experienced syl- key aspect of this model, for our purposes, is that it lables. As expected, slips of /h/ and /ng/ respected accounted for structural priming through learning. English phonotactics; /h/ is always moving to an Even after the model attained “adult” status, it contin- onset position and /ng/ is always moving to a coda ued to learn. When a prime sentence was experienced, position. The crucial finding, though, was that slips of the model’s connection weights were changed ever so the artificially restricted consonants (/f/ and /s/ in slightly to favor the subsequent production of sen- our example) also respected the local phonotactics of tences with the same structure. Experiencing, for the experiment. Notice in the example that /f/ slips to example, a double-object dative inclined the model to an onset position, that is, the slip is “legal” with regard produce that structure later. Because the priming was to the experimental phonotactic patterns. And this was based on weight change, it is a form of learning, thus not just a small statistical tendency; 98% of the slips accounting for Bock and Griffin’s finding that structural of experimentally restricted consonants were “legal” priming is undiminished over time. Also, the evidence in this respect, whereas consonants that were not that the implicit learning that characterizes structural experimentally restricted the way that /f/ and /s/ priming is based on prediction error comes from were often slipped from onset to coda or vice versa demonstrations that less common, and hence more sur- (Dell et al., 2000). prising, prime structures lead to more priming than Finding that slips respected the experimental dis- common ones (e.g., Jaeger & Snider, 2013). tributions of consonants suggests that participants implicitly learned these distributions, and this learning 18.2.3 Phonotactic Learning affected their slips. But is this effect truly one of learn- ing, as opposed to some very temporary priming of Young children implicitly learn the phonotactic pat- preexisting knowledge (e.g., priming of a rule that /f/ terns of their language through experience. Such pat- can be an onset in English)? Evidence that true learn- terns include knowledge about where certain ing is occurring comes from exposing participants consonants can occur in the syllables in their language; to more complex “second-order” constraints such as: for example, in English, /h/ only occurs at the begin- if the vowel is /ae/, then /f/ must be an onset and /s/ must be ning of a syllable (the onset) and /ng/ occurs only at a coda, but if the vowel is /I/, then /s/ must be an onset and

C. BEHAVIORAL FOUNDATIONS 214 18. SUCCESSFUL SPEAKING: COGNITIVE MECHANISMS OF ADAPTATION IN LANGUAGE PRODUCTION

/f/ must be a coda. Warker and Dell (2006) found that speech before it is produced to explain the fact that participants’ slips did not follow this vowel-dependent errors can be detected before articulation. The alterna- second-order constraint on the first day of a 4-day tive is that error detection occurs within the produc- experiment. On the second and subsequent days, tion system itself. An example of this is the conflict though, the slips did obey the constraint (e.g., more detection theory of Nozari, Dell, and Schwartz (2011), than 90% of slips were legal). This suggests that the which proposes that the production system can assess effect requires consolidation, a period of time (possibly the extent to which its decisions are conflicted and involving sleep; Warker, 2013) in which the results of assumes, when conflict is high, that an error is likely. the experience are registered in a relatively permanent For example, suppose that during word access, the way in the brain. After consolidation, the effects word CAT was selected during the first lexical-access appear to remain at least for 1 week (Warker, 2013). step, but DOG was also nearly as activated as CAT. Because the effect requires consolidation and is persis- That can be taken as a sign that there was a possible tent in time, it appears to be a form of learning. Thus, error during that step. Similarly, if a particular speech phonotactic-like knowledge and its expression in sound, for example, /d/, is selected while another, speech production errors can be tuned by an implicit /k/, is almost as active, again that can be a signal that learning process. there may have been a mis-selection, this time during the second access step. Nozari et al. used a computa- tional model to demonstrate that the association 18.3 SHORT-TERM SPEAKER TUNING between high conflict and error likelihood is a strong one, but also that the association no longer holds when Implicit learning is not the only mechanism that the production system is functioning very poorly. allows the production system to fluently generate Thus, for some aphasic individuals, conflict would not appropriate, grammatically correct utterances that lis- be an effective predictor of error and such individuals teners can easily interpret. There are several adaptive would be expected to have trouble detecting their phenomena in production that involve immediate pro- own errors. cessing, rather than long-term learning. These short- To test the conflict detection theory of monitoring term tuning mechanisms include error monitoring, and the competing perceptual loop theory, Nozari availability-based production choices, sensitivity to et al. (2011) examined how successful aphasic indivi- information density, and audience design. duals were at detecting their own errors in a picture- naming task. The perceptual loop account predicts that 18.3.1 Error Monitoring good error detection should be associated with good comprehension because detection is performed by the Speakers help their listeners by avoiding making comprehension system in that theory. In contrast, the speech errors or, when an error occurs, by attempting conflict detection theory expects good detection to to correct it. Catching slips before they happen or be associated with production rather than comprehen- fixing them after they do requires that speakers do sion skill. The results supported the conflict detection error monitoring. Studies of monitoring suggest that account. The aphasic patients with higher rates of error we detect at least half of our overt slips after they hap- detection had relatively good production skills, and pen, and that we detect and block potential errors comprehension ability was unrelated to error detection before they can occur (Baars, Motley, & MacKay, 1975; rate. Furthermore, Nozari and colleagues showed that Levelt, 1983). Evidence that errors can be detected patients who were relatively better at the first step before they are spoken comes from the existence of of lexical access, but poor at the second step, could very rapid detections of overt errors. Levelt (1983) detect their first-step errors (e.g., semantic errors) but gave the example of “v—horizontal.” The speaker not their second-step errors (phonological errors). The started to say “vertical,” but quickly stopped and complement was true as well—doing better on the replaced it with the correct “horizontal.” The fact that second step in production implied better detection of speech was stopped right away (within 100 msec of second-step errors in particular. These results show the onset of the erroneous /v/) demonstrates that the that dissociations in production abilities for the lexical- error was almost certainly detected before articulation access steps are mirrored in differential abilities to began. How is this possible? There are two theories of detect errors at the two steps, exactly as expected by error detection. One is that speakers detect errors by the conflict detection theory. comprehending their own speech and noting if there is Do the results of Nozari et al. (2011) mean that we do a mismatch with what they intended (Hartsuiker & not detect errors by comprehending our own speech? Kolk, 2001; Levelt, 1983). This view—the perceptual loop No. These results only point to another possible mecha- theory—allows for the comprehension of internal nism for error detection, particularly a mechanism that

C. BEHAVIORAL FOUNDATIONS 18.3 SHORT-TERM SPEAKER TUNING 215 can detect errors before they happen. It seems likely that participants who had recently experienced the word many overtly spoken slips are detected simply by hear- “thunder,” which presumably makes “lightning” more ing them, as proposed in the perceptual loop theory. available, were more likely to describe the picture with In support of this claim, Lackner and Tuller (1979) found the active form, making the primed word come out that using noise to mask a speaker’s speech diminished earlier. Similarly, priming “church” made the passive the speaker’s ability to detect their overt phonological more likely. errors, demonstrating that perception of the auditory One can also see availability-based production at signal plays a role in detection. It is therefore likely that work in choices about optional words. A sentence such multiple mechanisms contribute to the monitoring pro- as “The coach knew that you missed practice” can be cess. For example, speakers appear to guard their speech produced with no “that,” without changing the mean- against slips that create taboo words (Motley, Camden, ing of the sentence. So, what determines whether you & Baars, 1982;seealsoSeverens, Ku¨hn, Hartsuiker, & include the “that”? One possibility is that speakers Brass, 2012 for an fMRI study of frontal brain regions engage in audience design: when faced with a produc- involved in taboo-word monitoring). tion choice, they choose what will make the sentence easier for their listener to understand. Notice that if the “that” is missing, then the sentence has a tempo- 18.3.2 Availability-Based Production rary ambiguity when “you” is heard. The “you” can be The adage, “think before you speak” advises speak- either the direct object of “knew” or the subject of a ers to fully plan their utterances before saying them. new embedded clause. Including the “that” removes The fact that the adage exists suggests that speakers do the ambiguity. Ferreira (2008) and Ferreira and Dell not routinely do this. Instead, language production (2000) suggested an alternative explanation for when involves some degree of incrementality (Kempen & “that” is present in these sentences. It has to do with Hoenkamp, 1987): utterances are often constructed and the availability of the material after “that.” If “you” spoken in a piecemeal fashion, with the result that one has already been retrieved and is ready to go at the might start talking before having planned the entire point in the sentence after “The coach knew...,” then sentence. Because production can be incremental, the the speaker is more likely to omit “that.” But if the retrievability of the various parts of the utterance can speaker is not quite ready with “you,” then including influence its structure. For example, when attempting “that” is a convenient way to pause and buy time. As to produce the message illustrated in Figure 18.1, sup- described in the subsequent section on information- pose that we are able to retrieve “rat,” but have not yet density sensitivity, there is evidence that speakers do retrieved “cat.” We can start the utterance as “The attempt to stretch time out at certain points in a sen- rat...” and then, because English allows for a passive tence, and this can be thought of as an example of this. structure, can continue with “was chased by the cat” Here the issue is whether speakers produce “that” to as we eventually retrieve the other words. Thus, the disambiguate the utterance for their listeners, or production system may opportunistically take advan- because their production systems naturally produce 1 tage of the words that are retrieved first and may start whatever is available. If the “you” is immediately with those words. This is the essence of availability- available after “The coach knew,” then the sentence based production. What is retrieved first tends to be can grammatically continue without the “that.” said first. More generally, what is available tends to be Ferreira and Dell tested these ideas by comparing the spoken as soon as it can. Although this strategy occa- production of four kinds of sentences: sionally results in false starts, it makes for an efficient I knew (that) I missed practice. (embedded pronoun is production system (Bock, 1982). repeated and unambiguously nominative) Bock (1986b) provided support for availability- You knew (that) you missed practice. (embedded based production by asking speakers to describe pic- pronoun is repeated and ambiguous) tures such as one in which lightning is striking a I knew (that) you missed practice. (embedded pronoun church. This can be described with either an active is not repeated and ambiguous) (“Lightning is striking a church”) or a passive (“The You knew (that) I missed practice. (embedded pronoun church is struck by lightning”) structure. Earlier in this is not repeated and unambiguously nominative) chapter, we showed how this structural choice can be influenced by structural priming. It turns out that this The sentences were presented and then recalled in choice is also sensitive to the relative availability of situations in which the participants could not remem- the words “lightning” and “church.” Bock found that ber whether there had been a “that” in the sentence

1Of course, the grammar does not always allow you to eliminate “that” as a complementizer; “the girl that saw the boy is here” must have it in Standard American English.

C. BEHAVIORAL FOUNDATIONS 216 18. SUCCESSFUL SPEAKING: COGNITIVE MECHANISMS OF ADAPTATION IN LANGUAGE PRODUCTION and, hence, they tended to use their natural inclina- time saves nine” is shorter than in the phrase “I’d like tions about whether to include “that.” The key variable nine”(Lieberman, 1963) because the nine in the first was the percentage of recalled sentences with “that.” example is highly predicted by the previous words. The hypothesis that speakers include “that” to help Speakers moderate duration and other prosodic cues their listeners predicts that the two sentences with in response to these linguistic factors, as has been dem- the ambiguous embedded “you” will include more onstrated experimentally and in the wild, and this instances of “that” than the unambiguous conditions effect is robust even when a large number of other fac- that have the clearly nominative pronoun, “I,” as the tors are taken into account (Jurafsky, Bell, Gregory, & embedded subject. The availability hypothesis predicts Raymond, 2001). that because of repetition priming, the embedded pro- Aylett and Turk (2004) examined the relationship noun (I or you) will be more available if it had just between reduction and redundancy, or the contribu- been said as the subject of the main clause. Because tion of statistical predictability to the short-term mani- their embedded pronouns should be quite available, festation of phonetic output. They modeled the the two conditions with repeated pronouns are durations of syllables as a function of the degree to expected to have fewer instances of “that.” Across which they were predicted by the preceding informa- several experiments, there was no tendency for more tion and the predictability of the word itself in dis- “that”s in the ambiguous sentences, but repeating the course as well as its frequency of occurrence. They pronoun caused the percentage of “that”s to decrease found evidence that individuals regulate the distribu- by approximately 9%. The results clearly supported the tion of information in the signal (modulating various availability hypothesis, providing another demonstra- prosodic cues like duration, volume, and pitch) in that tion that the production system’s decisions are oppor- these cues represent a tradeoff between predictability tunistically guided by what is easily retrieved. In the and acoustic prominence. So, when a word is less pre- next section, we approach the question of the produc- dictable, it will carry more information in the sense tion of optional words like “that” from another angle. that it is unexpected, but speakers take this into account by providing additional cues as to the identity 18.3.3 Information-Density Sensitivity of an upcoming word or syllable, such as articulating the word more loudly. One way that people may alter production in the Jaeger (2006, 2010) identified analogous behavior short-term is by monitoring for and adjusting the prob- in syntactic flexibility as a function of information abilistic characteristics of what they are about to say. density. Using evidence from the optional that structure Taking a cue from information theory (Shannon, 1948), that we introduced in the previous section, he demon- it has been proposed that speakers control the rate of strated that the choice of whether to include a “that” information conveyed in their utterances so that there provides the language production system with a means are as few as possible points in which the rate is of redistributing information so that information density extremely high or extremely low. Recall that, on a is more uniform across the utterance. For example, in formal level, words or structures that are less likely con- sentences such as “My boss confirmed/thinks (that) we tain more information and that, in the reverse case, were absolutely crazy,” speakers were more likely to redundant or predictable items are associated with less include “that” when the presence of a complement information. The idea is that keeping the information clause (e.g., “...we were absolutely crazy”) is unex- rate constant at a level that listeners can handle maxi- pected given the main verb (e.g., “confirmed”). This is mizes the effective transmission to the listener. Too fast because confirm is a verb that most often takes a noun as a rate leads to loss of transmission, and too slow a rate its argument (e.g., “...confirmed the result”), and so the wastes time. The hypothesized information constancy presence of a complement clause is less probable and in production is termed the smooth signal redundancy therefore more surprising. In contrast, the verb think hypothesis or, alternatively, uniform information density often takes a complement clause and, because that is (UID). This tendency can be assumed to apply at all more expected, “that” was less likely to be included in levels of language production, including lexical choice the utterance. In general, including a “that” when the (Mahowald, Fedorenko, Piantadosi, & Gibson, 2013), complement clause is unexpected makes upcoming lin- syntactic structure (Jaeger, 2006, 2010), and phonetic guistic material less surprising for the listener, because and phonological output (Aylett & Turk, 2004). “that” very commonly signals for a complement clause. Lexical, syntactic, phonological, and pragmatic The resulting overall structure is much more even in its predictability and given-ness, as constrained by the syntactic surprisal. In this way, the speaker’s choices discourse or experiment, strongly influence the dura- about optional that are sensitive to the goal of minimiz- tions of individual words as would be expected from ing peaks and valleys in the information conveyed dur- the UID. For example, the word nine in “A stitch in ing the incremental production of the sentence.

C. BEHAVIORAL FOUNDATIONS 18.3 SHORT-TERM SPEAKER TUNING 217

These choices presumably translate into less effort for how participants in a cooperative task describe an object the listener. It is also possible that producing sentences and how that description changes as the participants with more UID directly aids the fluency of the produc- continue to interact. In Clark and Wilkes-Gibbs (1986), tion process, because what may be highly surprising to participants had to cooperate to sort a set of abstract a listener may be relatively more difficult for speakers visual shapes (made up of “tangrams”) often resembling to create. people or animals. Over the course of several turns, When we say that speakers “monitor” and “adjust” both partners came to use similar, eventually convergent information density, this implies active online control. terms or short phrases to describe the items. Early on, a But control of information rate is not necessarily the speaker might recognize that the other person does not result of an active short-term adaptation. Instead, the understand their initial description (e.g., “The next one mechanisms that achieve good information rates may is the rabbit.” “huh?”), requiring that the speaker elabo- be learned as speakers gain experience about what rate (e.g., “That’s asleep, you know, it looks like it’s got production choices lead to effective comprehension ears and a head pointing down.”). As the experiment (Jaeger & Ferreira, 2013). For example, speakers may continues, the objects’ labels become increasingly short- consistently include “that” in their complement clauses er and the listener’s errors in interpreting what is said introduced by main verbs such as “confirm” because become rare, suggesting that talkers have optimized they have learned that failure to do so leads to misun- label length and form for communicative efficiency. The derstanding. With this view, there is no active control description of such a figure can go from the very of information density; only the retention of successful complex on the first exchange (“looks like a person speaking habits. who’s ice skating, except they’re sticking two arms out in front”) to shorter, multiphrasal (“the person ice skat- 18.3.4 Audience Design ing, with two arms”) to finally a single noun phrase (“the ice skater”). Thus, the entrainment process adapts The language production system adapts to one’s the production systems of both participants in such a partner, not only by avoiding high information rates manner that communication success, rather than pro- but also by considering the partner’s specific needs duction ease, is the goal. and abilities; a speaker uses syntax, words, and Phonetic convergence is a phenomenon where indivi- phonology that the partner will likely understand. As duals adopt the phonetic and phonological representa- we have outlined throughout this chapter, the produc- tions of the other talker during a conversation. tion system can adapt to internal moment-by-moment A person may adopt features of another’s accent, such demands, and it can change itself in the long-term as a as the famous US southern “pin-pen” merger or a function of experience. In this final section, we con- northern cities vowel shift (e.g., “dawg” becomes sider how the production system goes beyond what is “dahg”), or even more subtle features such as differ- easy for it to do, and instead considers what might ences in voice-onset time. Pardo (2006) demonstrated best help the other person understand. This consider- such convergence experimentally. Participants ation is known as audience design. We discuss two completed a map task where one partner’s map (the examples of such design. First, we consider the way receiver’s) needs to be drawn to look like the other’s individuals use words that result in more effective (the giver’s). Like the tangram task used by Clark and communication on a cooperative task via a process Wilkes-Gibbs, the map task requires cooperative com- called entrainment (Brennan & Hanna, 2009; Clark & munication. There were many places on the map with Wilkes-Gibbs, 1986), and then how talkers can change standard names provided to the talkers (e.g., abandoned their own pronunciation of words to facilitate under- monastery, wheat field, etc.). Phonetic convergence of the standing in phonetic convergence (Pardo, 2006). speech of the talker pairs was assessed by naı¨ve parti- Entrainment, or the convergence on a single term cipants who were asked to judge the degree of similar- between two talkers in a conversation, is a necessary ity of the pairs’ pronunciations for these place names part of communication. It is estimated that approxi- as they did the task. Not only did all conversation mately 50% of discourse entities are mentioned multiple partners show some degree of convergence, but also times in a conversation or text (Recasens, de Marneffe, these effects arose after very little interaction time— &Potts,2013). Given this degree of repetition, it would many partners showed convergent phonetics as early be useful if speakers could agree on labels for those enti- as before the halfway point in their dialogue, with con- ties. If one party to a conversation referred to a particu- vergence persisting into the second half as well. This lar plant as a “bush” and the other called it a “tree,” convergence demonstrates that individuals engage in then confusion is likely. Agreement on terms through audience design by adopting the phonetic features of entrainment removes the confusion. In experimental their conversation partner during a cooperative task. settings, entrainment has been examined by looking at Because it is presumably easier for each speaker to use

C. BEHAVIORAL FOUNDATIONS 218 18. SUCCESSFUL SPEAKING: COGNITIVE MECHANISMS OF ADAPTATION IN LANGUAGE PRODUCTION his or her own accent, phonetic convergence counts as Bock, K., & Griffin, Z. M. (2000). The persistence of structural prim- ing: Transient activation or implicit learning? Journal of another example in which the adaptation suits the goal of communication, rather than the immediate ease of Experimental Psychology: General, 129, 177 192. Bock, K., & Loebell, H. (1990). Framing sentences. Cognition, 35, the production systems of the individual speakers. 139. Brennan, S. E., & Hanna, J. E. (2009). Partner-specific adaptation in dialog. Topics in Cognitive Science, 1, 274291. Caramazza, A., Costa, A., Miozzo, M., & Bi, Y. (2001). The specific- 18.4 CONCLUSION word frequency effect: Implications for the representation of homophones in speech production. Journal of Experimental Language production is, in one sense, difficult. The Psychology: Learning, Memory, and Cognition, 27, 14301450. speaker has to decide on something worth saying, Chang, F., Dell, G. S., & Bock, K. (2006). Becoming syntactic. Psychological Review, 113, 234272. choose words (out of a vocabulary of 40,000), appro- Cholin, J., Dell, G. S., & Levelt, W. J. M. (2011). Planning and articula- priate syntax, morphology, and , and ulti- tion in incremental word production: Syllable-frequency effects in mately has to articulate at the rate of two to three English. Journal of Experimental Psychology: Learning, Memory, and words per second. In another sense, production is Cognition, 37, 109122. easy. We think it takes little effort. Particularly when Chomsky, N. (1959). A review of Skinner’s Verbal Behavior. Language, 35,2658. we are talking about familiar topics, we can at the Clark, H. H., & Wilkes-Gibbs, D. (1986). Referring as a collaborative same time walk, drive, or even play the piano (Becic process. Cognition, 22,139. et al., 2010). The seeming paradox that something so Dell, G. S. (1986). A spreading activation theory of retrieval in difficult is yet so easy is resolved when we consider sentence production. Psychological Review, 93, 283321. the mechanisms presented in this chapter. The produc- Dell, G. S., & Chang, F. (2014). The P-chain: Relating sentence production and its disorders to comprehension and acquisition. tion system is continually being tuned by the extraor- Philosophical Transactions of the Royal Society B, 369, 20120394. dinary amount of experience we have. We say 16,000 Dell, G. S., Nozari, N., & Oppenheim, G. M. (2014). Word production: words per day and hear and read a lot more. The Behavioral and computational considerations. In M. Goldrick, implicit learning that results from this input effectively V. S. Ferreira, & M. Miozzo (Eds.), The Oxford handbook of language trains the system and tunes it well to its current cir- production. Oxford, UK: Oxford University Press. Dell, G. S., Reed, K. D., Adams, D. R., & Meyer, A. S. (2000). Speech cumstances. But implicit learning is not the whole errors, phonotactic constraints, and implicit learning. A study story. The production system also makes use of a vari- of the role of experience in language production. Journal of ety of moment-by-moment mechanisms to compensate Experimental Psychology: Learning, Memory, and Cognition, 26, for and prevent errors, to promote fluency, and to 13551367. make the job of the listener easier. Dell, G. S., Schwartz, M. F., Nozari, N., Faseyitan, O., & Coslett, H. B. (2013). Voxel-based lesion-parameter mapping: Identifying the neural correlates of a computational model of word production. Cognition, 128, 380396. Acknowledgments Elman, J. L. (1993). Learning and development in neural networks: The importance of starting small. Cognition, 48,7199. Preparation of this chapter was supported by NIH DC000191 and by Ferreira, V. S. (2008). Ambiguity, availability, and a division of labor an NSF fellowship to Cassandra Jacobs. for communicative success. Psychology of Learning and Motivation, 49, 209246. Ferreira, V. S., Bock, K., Wilson, M. P., & Cohen, N. J. (2008). References Memory for syntax despite amnesia. Psychological Science, 19, 940946. Aylett, M., & Turk, A. (2004). The smooth signal redundancy hypoth- Ferreira, V. S., & Dell, G. S. (2000). The effect of ambiguity and esis: A functional explanation for relationships between redun- lexical availability on syntactic and lexical production. Cognitive dancy, prosodic prominence, and duration in spontaneous Psychology, 40, 296340. speech. Language and Speech, 47,3156. Fromkin, V. A. (1971). The non-anomalous nature of anomalous Baars, B. J., Motley, M. T., & MacKay, D. G. (1975). Output editing utterances. Language, 47,2752. for lexical status from artificially elicited slips of the tongue. Garrett, M. F. (1975). The analysis of sentence production. Psychology Journal of Verbal Learning and Verbal Behavior, 14, 382391. of Learning and Motivation, 9, 133177. Becic, E., Dell, G. S., Bock, K., Garnsey, S. M., Kubose, T., & Goldrick, M., & Blumstein, S. (2006). Cascading activation from Kramer, A. F. (2010). Driving impairs talking. Psychonomic phonological planning to articulatory processes: Evidence from Bulletin & Review, 17,1521. tongue twisters. Language and Cognitive Processes, 21, 649683. Bock, J. K. (1982). Towards a cognitive psychology of syntax: Goldstein, L., Pouplier, M., Chen, L., Saltzman, E., & Byrd, D. (2007). Information processing contributions to sentence formulation. Dynamic action units slip in speech production errors. Cognition, Psychological Review, 89,147. 103, 386412. Bock, K. (1986a). Syntactic persistence in language production. Gordon, J. K., & Dell, G. S. (2003). Learning to divide the labor: Cognitive Psychology, 18, 355387. An account of deficits in light and heavy verb production. Bock, K. (1986b). Meaning, sound, and syntax: Lexical priming in Cognitive Science, 27,140. sentence production. Journal of Experimental Psychology: Learning, Hagoort, P. (2013). MUC (Memory, Unification, Control) and beyond. Memory, and Cognition, 12, 575586. Frontiers in Psychology, 4, 416.

C. BEHAVIORAL FOUNDATIONS REFERENCES 219

Hartsuiker, R. J., & Kolk, H. H. J. (2001). Error monitoring in speech Mehl, M. R., Vazire, S., Ramirez-Esparza, N., Slatcher, R. B., & production: A computational test of the perceptual loop theory. Pennebaker, J. W. (2007). Are women really more talkative than Cognitive Psychology, 42, 113157. men? Science, 317, 82. Howard, D., Nickels, L., Coltheart, M., & Cole-Virtue, J. (2006). Meyer, A. S. (1991). The time course of phonological encoding in lan- Cumulative semantic inhibition in picture naming: Experimental guage production: Phonological encoding inside a syllable. and computational studies. Cognition, 100, 464482. Journal of Memory and Language, 30,6989. Jaeger, T. F. (2006). Redundancy and syntactic reduction in spontaneous Mitchell, D. B., & Brown, A. S. (1988). Persistent repetition priming speech. Ph.D. dissertation. Stanford University. in picture naming and its dissociation from recognition memory. Jaeger, T. F. (2010). Redundancy and reduction: Speakers manage Journal of Experimental Psychology: Learning, Memory, and Cognition, syntactic information density. Cognitive Psychology, 61,2362. 14, 213222. Jaeger, T. F., & Ferreira, V. S. (2013). Seeking predictions from a pre- Motley, M. T., Camden, C. T., & Baars, B. J. (1982). Covert formula- dictive framework. Behavioral and Brain Sciences, 36, 359360. tion and editing of anomalies in speech production: Evidence Jaeger, T. F., & Snider, N. E. (2013). Alignment as a consequence of from experimentally elicited slips of the tongue. Journal of Verbal expectation adaptation: Syntactic priming is affected by the Learning and Verbal Behavior, 21, 578594. prime’s prediction error given both prior and recent experience. Nozari,N.,Dell,G.S.,&Schwartz,M.F.(2011).Iscomprehen- Cognition, 127,5783. sion necessary for error detection? A conflict-based account Janssen, N., & Barber, H. A. (2012). Phrase frequency effects in lan- of monitoring in speech production. Cognitive Psychology, 63, guage production. PLoS ONE, 7, e33202. 133. Jescheniak, J. D., & Levelt, W. J. M. (1994). Word frequency effects Oppenheim, G. M., Dell, G. S., & Schwartz, M. F. (2010). The dark in speech production: Retrieval of syntactic information and of side of incremental learning: A model of cumulative semantic phonological form. Journal of Experimental Psychology: Learning, interference during lexical access in speech production. Cognition, Memory, and Cognition, 20, 824843. 114, 227252. Jurafsky, D., Bell, A., Gregory, M., & Raymond, W. D. (2001). Pardo, J. S. (2006). On phonetic convergence during conversational Probabilistic relations between words: Evidence from reduction interaction. The Journal of the Acoustical Society of America, 119, in lexical production. Typological Studies in Language, 45, 229254. 23822393. Kempen, G., & Hoenkamp, E. (1987). An incremental procedural Qu, Q., Damian, M. F., & Kazanina, N. (2012). Sound-sized segments grammar for sentence formulation. Cognitive Science, 11, 201258. are significant for Mandarin speakers. Proceedings of the National Kittredge, A. K., Dell, G. S., Verkuilen, J., & Schwartz, M. F. (2008). Academy of Sciences, 109, 1426514270. Where is the effect of frequency in word production? Insights Recasens, M., de Marneffe, M. C., & Potts, C. (2013). The life and from aphasic picture-naming errors. Cognitive Neuropsychology, 25, death of discourse entities: Identifying singleton mentions. In 463492. Proceedings of NAACL-HLT, (pp. 627633). Atlanta, GA. Knobel, M., Finkbeiner, M., & Caramazza, A. (2008). The many Schnur, T. T., Schwartz, M. F., Brecher, A., & Hodgson, C. (2006). places of frequency: Evidence for a novel locus of the frequency Semantic inference during blocked-cyclic naming: Evidence from effect in word production. Cognitive Neuropsychology, 25, 256286. aphasia. Journal of Memory and Language, 54, 199227. Lackner, J. R., & Tuller, B. H. (1979). Role of efference monitoring in Severens, E., Ku¨ hn, S., Hartsuiker, R. J., & Brass, M. (2012). the detection of self-produced speech errors. In W. E. Cooper, & Functional mechanisms involved in the internal inhibition of E. C. T. Walker (Eds.), Sentence processing: Psycholinguistic studies taboo words. Social, Cognitive, and Affective Neuroscience, 7, presented to Merrill Garrett. Hillsdale, NJ: Erlbaum. 431435. Lam, T. Q., & Watson, D. G. (2010). Repetition is easy: Why repeated Shannon, C. E. (1948). A mathematical theory of communication. Bell referents have reduced prominence. Memory & Cognition, 38, System Technical Journal, 27, 623656. 11371146. Ueno, T., Saito, S., Rogers, T. T., & Lambon Ralph, M. A. (2011). Levelt, W. J. M. (1983). Monitoring and self-repair in speech. Lichtheim 2: synthesizing aphasia and the neural basis of lan- Cognition, 14,41104. guage in a neurocomputational model of the dual dorsal-ventral Levelt, W. J. M. (1989). Speaking: From intention to articulation. language pathways. Neuron, 72, 385396. Cambridge, MA: MIT Press. Warker, J. A. (2013). Investigating the retention and time course for Levelt, W. J. M., Roelofs, A., & Meyer, A. S. (1999). A theory of lexical phonotactic constraint learning from production experience. access in speech production. Behavioral and Brain Sciences, 22,138. Journal of Experimental Psychology: Learning, Memory, and Cognition, Lieberman, P. (1963). Some effects of semantic and grammatical con- 39,96109. text on the production and perception of speech. Language and Warker, J. A., & Dell, G. S. (2006). Speech errors reflect newly learned Speech, 6, 172187. phonotactic constraints. Journal of Experimental Psychology: Mahowald, K., Fedorenko, E., Piantadosi, S. T., & Gibson, E. (2013). Learning, Memory, and Cognition, 32, 387398. Info/information theory: Speakers choose shorter words in pre- Wells, R. (1951). Predicting slips of the tongue. Yale Scientific dictive contexts. Cognition, 126, 313318. Magazine, 3,930.

C. BEHAVIORAL FOUNDATIONS This page intentionally left blank CHAPTER 19 Speech Motor Control from a Modern Control Theory Perspective John F. Houde1 and Srikantan S. Nagarajan1,2 1Department of Otolaryngology—Head and Neck Surgery, University of California, San Francisco, CA, USA; 2Department of Radiology and Biomedical Imaging, University of California, San Francisco, CA, USA

19.1 INTRODUCTION models can be seen as approximating characteristics best modeled by SFC. We conclude by presenting an Speech motor control is unique among motor beha- SFC model of the role of the CNS in speech motor viors in that it is a crucial part of the language system. control and discuss its neural plausibility. It is the final neural processing step in speaking, where intended messages drive articulator movements that create sounds conveying those messages to a listener 19.2 THE ROLE OF THE CNS IN (Levelt, 1989). Many questions arise concerning this PROCESSING SENSORY FEEDBACK neural process we call speech motor control. What is its DURING SPEAKING neural substrate? Is it qualitatively different from other motor control processes? Recently, research into other It is not controversial that the CNS plays a role in areas of motor control has benefited from a vigorous speech motor output: cortex appears to be a main source interplay between people who study the psychophysics of motor commands in speaking. In humans, the speech- and neurophysiology of motor control and engineers relevant areas of motor cortex (M1) make direct connec- that develop mathematical approaches to the abstract tions with the motor neurons of the lips, tongue, and problem of control. One of the key results of these colla- other speech articulators (Ju¨rgens, 1982, 2002; Ludlow, borations has been the application of state feedback con- 2004). Damage to these M1 areas causes mutism and dys- trol (SFC) theory to modeling the role of the higher arthria (Duffy, 2005; Ju¨rgens, 2002). However, it is much central nervous system (i.e., cortex, the cerebellum, less clear what the role of the CNS is in processing the thalamus, and basal ganglia—hereafter referred to as sensory feedback from speaking. Sensory feedback, espe- “the CNS”) in motor control (Arbib, 1981; Guigon, cially auditory feedback, is critically important for chil- Baraduc, & Desmurget, 2008b; Shadmehr & Krakauer, dren learning to speak (Borden, Harris, & Raphael, 1994; 2008; Todorov, 2004; Todorov & Jordan, 2002). SFC pos- Levitt, Stromberg, Smith, & Gold, 1980; Oller & Eilers, tulates that the CNS controls motor output by estimat- 1988; Osberger & McGarr, 1982; Ross & Giolas, 1978; ing the current state of the thing (e.g., arm) being Smith, 1975). However, once learned, the control of controlled and by generating controls based on this esti- speech has the characteristics of being both responsive mated state. SFC has successfully predicted a great to, yet not completely dependent on, sensory feedback. range of the phenomena seen in nonspeech motor con- In the absence of sensory feedback, speaking is only trol, but as yet it has not received attention in the selectively disrupted. Somatosensory nerve block impacts speech motor control community. Here, we review only certain aspects of speech (e.g., lip rounding, fricative some of the key characteristics of how sensory feedback constrictions) and, even for these, the impact is not suffi- appears to be used during speaking and what this says cient to prevent intelligible speech (Scott & Ringel, 1971). about the role of the CNS in the speech motor control In postlingually deafened speakers, the control of process. Along the way, we discuss prior efforts to pitch and loudness degrades rapidly after hearing loss, model this role, but ultimately we argue that such yet their speech will remain intelligible for decades

Neurobiology of Language. DOI: http://dx.doi.org/10.1016/B978-0-12-407794-2.00019-5 221 © 2016 Elsevier Inc. All rights reserved. 222 19. SPEECH MOTOR CONTROL FROM A MODERN CONTROL THEORY PERSPECTIVE

(Cowie & Douglas-Cowie, 1992; Lane et al., 1997). respond to changes in the speech waveform, and thus Normal speakers also produce intelligible speech with they will effectively introduce additional delays in the their hearing temporarily blocked by loud masking noise detectionofsuchchanges.Consistentwiththistheoreti- (Lane & Tranel, 1971; Lombard, 1911). cal account, studies show that response latencies of But this does not mean speaking is largely a feedfor- auditory areas to changes in higher-level auditory ward control process that is unaffected by feedback. features can range from 30 ms to more than 100 ms Delaying auditory feedback (DAF) by approximately a (Cheung, Nagarajan, Schreiner, Bedenbaugh, & Wong, syllable’s production time (100200 ms) is very effective 2005; Godey, Atencio, Bonham, Schreiner, & Cheung, at disrupting speech (Fairbanks, 1954; Lee, 1950; Yates, 2005; Heil, 2003). A particularly relevant example is the 1963). Masking noise feedback causes increases in speech long (B100 ms) response latency of neurons in a recently loudness (Lane & Tranel, 1971; Lombard, 1911), whereas discovered area of pitch-sensitive neurons in auditory amplifying feedback causes compensatory decreases in cortex (Bendor & Wang, 2005). As a result, while auditory speech loudness (Chang-Yit, Pick, Herbert, & Siegel, responses can be seen within 1015 ms of a sound at the 1975). Speakers compensate for mechanical perturbations ear (Heil & Irvine, 1996; Lakatos et al., 2005), there are of their articulators (Abbs & Gracco, 1984; Nasir & Ostry, important reasons to suppose that the features needed for 2006; Saltzman, Lofqvist, Kay, Kinsella-Shaw, & Rubin, controlling speech are not available to the CNS until a 1998; Shaiman & Gracco, 2002; Tremblay, Shiller, & significant time (B30100 ms) after they are peripherally Ostry, 2003), and compensatory changes in speech pro- present. This is a problem for feedback control models, duction are seen when auditory feedback is altered in its because direct feedback control based on delayed feed- pitch (Burnett, Freedland, Larson, & Hain, 1998; Elman, back is inherently unstable, particularly for fast move- 1981; Hain et al., 2000; Jones & Munhall, 2000a; Larson, ments (Franklin, Powell, & Emami-Naeini, 1991). Altman, Liu, & Hain, 2008), loudness (Bauer, Mittal, Larson, & Hain, 2006; Heinks-Maldonado & Houde, 2005), formant frequencies (Houde & Jordan, 1998, 2002; 19.3 THE CNS AS A FEEDFORWARD Purcell & Munhall, 2006), or, in the case of fricative pro- SOURCE OF SPEECH MOTOR duction, when the center of spectral energy is shifted COMMANDS (Shiller, Sato, Gracco, & Baum, 2007). Taken together, such phenomena reveal a complex Given these problems with controlling speech via role for feedback in the control of speaking—a role not sensory feedback control, it is not surprising that, in easily modeled as simple feedback control. Beyond this, some models of speech motor control, the role of the however, there are also more basic difficulties with CNS has been relegated to being a pure feedforward modeling the control of speech as being based on sensory source, outputting desired trajectories for the lower feedback. In biological systems, sensory feedback is noisy motor system to follow (Ostry, Flanagan, Feldman, & due to environment noise and the stochastic firing prop- Munhall, 1991, 1992; Payan & Perrier, 1997; Perrier, erties of neurons (Kandel, Schwartz, & Jessell, 2000). Ostry, & Laboissiere, 1996; Sanguineti, Laboissiere, & Furthermore, when considering the role of the CNS in Ostry, 1998; Sanguineti, Laboissiere, & Payan, 1997). In particular, an even more significant problem is that sen- these models, it is the lower motor system (e.g., brain- sory feedback is delayed. There are several obvious stem and spinal cord) that implements feedback con- reasons why sensory feedback to the CNS is delayed trol and responds to feedback perturbations. The (e.g., by axon transmission times and synaptic delays; inspiration for these models comes from consideration Kandel et al., 2000), but a less obvious reason involves of biomechanics and neurophysiology. A muscle has the time needed to process raw sensory feedback into mechanical spring-like properties that naturally resist features useful in controlling speech. For example, in the perturbations (Hill, 1925; Zajac, 1989), and these auditory domain, there are several key features of the spring-like properties are further enhanced by somato- acoustic speech waveform that are important for discrim- sensory feedback to the motor neurons in the brain- inating between speech utterances. For some of these stem and spinal cord that control the muscle (e.g., for features, like pitch, spectral envelope, and formant fre- the jaw: Pearce, Miles, Thompson, & Nordstrom, 2003; quencies, signal processing theory dictates that the accu- see also the stretch reflex: Hulliger, 1984; Matthews, racy in which the features are estimated from the speech 1931; Merton, 1951). This local feedback control of the waveform depends on the duration of the time window muscle makes it look, on first approximation, like a used to calculate them (Parsons, 1987). In practice, this spring with an adjustable rest-length that can be set by means such features are estimated from the acoustic control descending from the higher levels of the CNS waveform using sliding time windows with lengths of (Asatryan & Feldman, 1965). The muscles affecting an approximately 30100 ms in duration. Such integration- articulator’s position (e.g., the muscles controlling the window-based feature estimation methods are slow to position of the tongue tip) always come in opposing

C. BEHAVIORAL FOUNDATIONS 19.3 THE CNS AS A FEEDFORWARD SOURCE OF SPEECH MOTOR COMMANDS 223 pairs—agonists and antagonists—whose contractions 10 ms before the time of the reflex response (i.e., at the have opposite effects on articulator position. Thus, for time motor cortex would be activated if it governed the any given set of muscle activations, an articulator will response). The authors found motor TMS only facili- always come to rest at an equilibrium point where the tated the response to jaw perturbation during /ph/, muscle forces are balanced. In response to pertur- implicating cortex involvement specifically in only the bations from its current equilibrium point, the articula- task-dependent perturbation response during speaking. tor will naturally generate forces that return it to the Perhaps a larger problem with ascribing feedback con- equilibrium point, without any higher-level interven- trol to only subcortical levels is that responses to sensory tion. This characteristic was the inspiration for models feedback perturbations in speaking often look task- of motor control based on equilibrium point control specific. For example, perturbation of the upper lip will (EPC) (Bizzi, Accornero, Chapple, & Hogan, 1982; induce compensatory movement of the lower lip, but Feldman, 1986; Polit & Bizzi, 1979). EPC models postu- only in the production of bilabials. The upper lip is not late that to control an articulator’s movement, the involved in the production of /f/ and perturbation of the higher-level CNS need only provide the lower motor upper lip before /f/ in /afa/ induces no lower lip system with a sequence of desired equilibrium point to response. However, the upper lip is involved in the specify the trajectory of that articulator. The lower motor production of /p/ and, here, perturbation of the upper system handles responses to perturbations. lip before /p/ in /apa/ does induce compensatory In speech, EPC models can explain the phenomenon movement of the lower lip (Shaiman & Gracco, 2002). of “undershoot,” or “carryover,” coarticulation (Lindblom, Task-dependence is also seen in responses to auditory 1963). This can be seen when a speaker produces a vowel feedback. The production of vowels in stressed syllables in a consonant-vowel-consonant (CVC) context: as the appears to be more sensitive to immediate auditory feed- duration of the vowel segment is made shorter, the for- back than vowels in unstressed syllables (Kalveram & mants of the vowel do not reach (i.e., they undershoot) Jancke, 1989; Natke, Grosser, & Kalveram, 2001; Natke & their normal steady-state values. This undershoot is easily Kalveram, 2001), responses to pitch perturbations are explained by supposing that successive equilibrium modulated by how fast the subject is changing pitch points are generated faster than they can be achieved. In (Larson, Burnett, Kiran, & Hain, 2000), and responses to the case of a rapidly produced CVC syllable, undershoot loudness perturbations appear to be modulated by sylla- of vowel formants would happen if, while it was still ble emphasis (Liu, Zhang, Xu, & Larson, 2007). Such task- moving toward the equilibrium point for the vowel, the dependent perturbation responses cannot be simply tongue was retargeted to the equilibrium point of the explained with pure feedback control by setting stiffness following consonant. levels (i.e., muscle impedance) for individual articulators There are, however, several problems with the EPC (e.g., upper lip or lower lip) and instead suggest that, account of the lower motor system being solely respon- depending on the task (i.e., the particular speech target sible for feedback control. First, although both somato- being produced), the higher-level CNS uses sensory feed- sensory (Ju¨rgens, 2002; Kandel et al., 2000)andauditory back to couple the behavior of different articulators in (Burnett et al., 1998; Ju¨rgens, 2002)pathwaysmakesub- ways that accomplish a higher-level goal (e.g., closing of cortical connections with descending motor pathways, the lip opening) (Bernstein, 1967; Kelso, Tuller, Vatikiotis- the latencies of responses to somatosensory and audi- Bateson, & Fowler, 1984; Saltzman & Munhall, 1989). tory feedback perturbations (approximately 50150 ms) There is also evidence that the CNS is sensitive to the are longer than would be expected for subcortical dynamics of the articulators. In controlling fast move- feedback loops (Abbs & Gracco, 1983). Instead, such ments, the CNS behaves as if it does anticipate that the response delays appear sufficiently long enough for articulators will have dynamical responses to its motor neural signals to go to and come from cortex (Kandel commands. For example, arm movement studies have et al., 2000). By themselves, such timing estimates do shown that fast movements are characterized by a not prove involvement of cortex, but a study by Ito and “three-phase” muscle activation sequence whereby an Gomi using transcranial magnetic stimulation (TMS) initial burst of activation of the agonist muscle acceler- gives further evidence (Ito, Kimura, & Gomi, 2005). The ates the articulator quickly toward its target, followed authors examined the facilitatory effect of applying a by, at approximately mid-movement, a “breaking” subthreshold TMS pulse to mouth motor cortex on two burst of antagonist muscle activation that decelerates oral reflexes: the compensatory response by the upper the articulator, causing it to come to rest near the target lip to a jaw-lowering perturbation during the produc- (followed, in turn, by a weaker agonist burst to further tion of /ph/ (a soft version of /f/ in Japanese made correct the articulator’s position) (Hallett, Shahani, & only with the lips) and a response to upper lip stimula- Young, 1975; Shadmehr & Wise, 2005; Wachholder & tion know to be subcortically mediated called the peri- Altenburger, 1926). Such activation patterns appear to oral reflex. The TMS pulse was applied approximately take advantage of the momentum of the arm. When

C. BEHAVIORAL FOUNDATIONS 224 19. SPEECH MOTOR CONTROL FROM A MODERN CONTROL THEORY PERSPECTIVE equilibrium points are determined for such muscle acti- his feedback control subsystem does not drive the vocal vations, they appear to follow complex trajectories, ini- tract directly. Instead, it is first combined with the “input tially racing far ahead of the target position before signal” (the output of a feedforward control subsystem) finally converging back to it (Gomi & Kawato, 1996). by a “mixer” element to create the “effective driving Yet, in such cases, the actual trajectory of the arm is signal” that directly controls the vocal tract. This combi- always a smooth path to the target that greatly differs nation of feedback and feedforward control subsystems from the complex equilibrium point trajectory. This mis- is similar in design to that of the current Directions into match suggests that even if the CNS were outputting Velocities of Articulators (DIVA) model of speech motor “desired” articulatory trajectories to the lower motor control (Guenther, 1995; Guenther, Ghosh, & Tourville, system, it does so by taking into account dynamical 2006; Guenther, Hampson, & Johnson, 1998; Guenther & responses to these trajectory requests, such that a fast Vladusich, 2012), although the feedforward control sub- smooth motion is achieved. system in DIVA is implemented as an internal feedback This ability of the CNS to take articulator dynamics loop, which we describe further. into account can also be seen in speech production. Feedback control models can be considered the most A series of experiments has shown that speakers will extreme implementation of the efference copy hypothesis, learn to compensate for perturbations of jaw protrusion where the motor-derived prediction functions as the that are dependent on jaw velocity (Nasir & Ostry, target output, and comparison with this target/prediction 2008, 2009; Tremblay, Houle, & Ostry, 2008; Tremblay results in a prediction error that directly drives the motor et al., 2003). In learning to compensate for such altered control output. In current speech motor control models, articulator dynamics, speakers show that they are the efference-copy/feedback prediction process is still formulating articulator movement commands that anti- used to create a correction, but that correction does not cipate and cancel out the effects of those altered dynam- directly generate output controls. Instead, it is a contri- ics. Thus, the ability to anticipate articulator dynamics buting factor in the generation of output controls. These is not only a theoretically desirable property of a model models retain the concept of feedback control but put the of speech motor control but also a property required to feedback loop inside the CNS, where processing delays account for real experimental results. are minimal, with actual sensory feedback forming a Taken together, these several lines of evidence sug- slower, possibly delayed and intermittent, external loop gest that, rather than simply instructing the lower motor that updates the internal feedback loop (Guenther & system on what its goals are, the CNS instead likely Vladusich, 2012; Hickok, Houde, & Rong, 2011; Houde & plays an active role in responding to sensory informa- Nagarajan, 2011; Price, Crinion, & Macsweeney, 2011; tion about deviations from task goals. Tian & Poeppel, 2010). It turn out that such models can be described as variations in the general theory of SFC, developed in the domain of modern control engineering 19.4 CURRENT MODELS OF THE ROLE theory, which is based on the concept of a dynamic state. OF THE CNS IN SPEECH MOTOR CONTROL 19.5 THE CONCEPT OF DYNAMICAL Current models of speech motor control can trace STATE their lineage back to Fairbanks’ early model. With the advent of cybernetic theory (Wiener, 1948) and the dis- A key feature of these current models of speech covery of the effects of DAF soon after (Lee, 1950), it motor control is their outer sensory processing loops was natural that at conference in 1953, Fairbanks would are based on comparing incoming feedback with a pre- propose a model of speech motor control based in large diction of that feedback. To make such sensory predic- part on the principles of feedback control (Fairbanks, tions, the CNS would ideally base them not on what 1954). A key element of Fairbanks’ model was a “com- the current articulatory target was, but instead on the parator” that subtracted sensory feedback (including actual articulatory commands currently being sent to auditory feedback) from a target “input signal,” creating the articulators (i.e., true efference copy of the descend- an “error signal” that was used in the control of the ing motor commands output to the motor units of the vocal tract articulators. However, given the aforemen- articulators). However, without a model of how these tioned phenomena concerning auditory feedback, it is motor commands affect articulator dynamics, accurate not surprising that current models of speech motor con- feedback predictions cannot be made because it is only trol are significantly more complicated than simple through their effects on the dynamics of the articulators feedback control models. Even Fairbanks appeared to that motor commands affect articulator positions and hedge on proposing a model completely based on feed- velocities, and thus acoustic output and somatosensory back control. In his model, the “error signal” output of feedback from the vocal tract.

C. BEHAVIORAL FOUNDATIONS 19.6 A MODEL OF SPEECH MOTOR CONTROL BASED ON STATE FEEDBACK 225

But how can we model the effects of motor com- higher CNS correctly formulate the next controls ut to be mands on articulator dynamics? To say that vocal tract applied to the vocal tract given access only to previously articulators have “dynamics” is another way of saying applied controls ut21 and noisy, delayed, and possibly that how they will move in the future and how they will intermittent feedback yt2N? react to applied controls are dependent on their immedi- ate history (e.g., the direction they were last moving in). The past can only affect the future via the present; in engineering terms, the description of the present suffi- 19.6 A MODEL OF SPEECH MOTOR cient to predict how a system’s past affects its future is CONTROL BASED ON STATE FEEDBACK called the dynamical state of the system. It is this concept of dynamical state that is basis for engineering models An approach to this problem is based on the follow- of systems and how they respond to applied controls. ing idealization shown in Figure 19.2: If the state xt of Based on these ideas, Figure 19.1 illustrates how the the vocal tract is available to the CNS via immediate problem of controlling speaking can be phrased in terms feedback, then the CNS could control vocal tract state of the control of vocal tract state. This discrete time directly via feedback control. For this reason, this con- description represents a snapshot of the speech motor trol approach is referred to as SFC. However, as dis- cussed, because xt is not directly observable from any control process at time t, where the controls ut21 for- mulated at the previous timestep t 1 have now been type of sensory feedback, and because the sensory applied to the muscles of the vocal tract, changing its feedback that comes to the higher CNS is both noisy and delayed, the scheme as shown is unrealizable. dynamic state to xt, which in turn results in the vocal As a result, a fundamental principle of SFC is that con- tract outputting yt. In this process, xt represents an instantaneous dynamical description of the vocal tract trol must instead be based on a running internal esti- (e.g., positions and velocities of various parts of the mate of the state xt (Jacobs, 1993). The first step toward tongue, lips, or jaw) sufficient to predict its future getting this estimate is another idealization. Suppose, ð ; Þ as shown in Figure 19.3, the higher CNS had an behavior and vtdyn ut21 xt21 expresses the physical b internal model of the vocal tract, vocal tract, processes (e.g., inertia) that dictate what next state xt which had accurate forward models of the dynamics will result from controls ut21 being applied to prior state b ^ b ^ vtdynðut21; xt21Þ and output function vtoutðxtÞ (i.e., its xt21. The next state xt is also partly determined by ran- acoustics, auditory and somatosensory transformations) dom disturbances wt21 (called state noise). A key part of of the actual vocal tract. Such an internal model could this formulation is that xt is not directly observable from ð Þ mimic the response of the real vocal tract to applied con- sensory feedback. Instead, output function vtout xt ^ represents all the physical and biophysical processes trols and provide an estimate xt of the actual vocal tract state. In this situation, the controller could permanently causing xt to generate sensory consequences yt. yt is also ignore the feedback y 2 of the actual vocal tract and corrupted by noise v and delayed by z−N ,whereN is a t N t perform ideal SFC U ðx^Þ based only on x^ . The controls u t t bt vector of time delays representing the time taken to thus generated would correctly control both vocal tract y neurally transmit each element of t to the higher and the actual vocal tract. CNS and process it into a control-useable form (e.g., into But this situation is still idealized. The vocal tract pitch, formant frequencies, tongue height). Furthermore, state xt is subject to disturbances wt1, and the forward certain elements of yt can be intermittently unavailable, as when auditory feedback is blocked by noise. Therefore, from this description, the control of vocal tract state can be summarized as follows: how can the Vocal tract Controller wt–1 vt u u x t –1 t–1 t y Vocal tract z vtdyn(u,x) vtout(x) t Controller w v x t–1 t U (x) t–1 t –1 –N ut ut–1 xt z z z–1 vtdyn(u,x) vtout(x) yt y x t–N ? t–1 z–1 z–N xt xt

yt–N FIGURE 19.2 Ideal SFC. If the controller in the CNS had access to the full internal state xt of the vocal tract system (red path), then it ð Þ could ignore feedback yt2N and formulate an SFC law Ut xt that FIGURE 19.1 The control problem in speech motor control. The would optimally guide the vocal tract articulators to produce the figure shows a snapshot at time t, when the vocal tract has produced desired speech output yt. However, as discussed in the text, the output yt in response to the previously applied control ut21. internal vocal tract state xt is, by definition, not directly available.

C. BEHAVIORAL FOUNDATIONS 226 19. SPEECH MOTOR CONTROL FROM A MODERN CONTROL THEORY PERSPECTIVE

Vocal tract tract state xt can be derived in a feasible way and used Controller w v ð^ Þ t–1 t by the SFC law Ut xt to determine the next controls ut ut ut–1 xt z–1 vtdyn(u,x) vtout(x) yt output to the vocal tract. x As Figure 19.4 indicates, the combination of U (x)^ t–1 b t z–1 z–N vocal tract plus this feedback-based correction process is

yt–N called an observer (Jacobs, 1993; Stengel, 1994; Tin & Poon, 2005; Wolpert, 1997), which in this case, because it includes allowances for feedback delays, is also a variant of a Smith Predictor (Mehta & Schaal, 2002; Miall, Weir, Vocal tract ^ u xt t–1 ^ Wolpert, & Stein, 1993; Smith, 1959). Within the xt ^ ð~Þ vtdyn(u,x)^ vtout(x)^ yt observer, Kt y converts changes in feedback to changes ~ ^ in state. When it is optimally determined, KtðyÞ is a feed- xt–1 z–1 back gain proportional to how correlated the feedback prediction error y~ 2 ^ is with the state prediction error x^ t N t ð 2 ^ j 2 Þ ~ ^ xt xt t 1 . Thus, if yt2N is highly uncorrelated with ðx 2 x^ j 2 Þ, as happens with large feedback delays or FIGURE 19.3 A more realizable model of SFC based on an esti- t t t 1 ^ feedback being blocked, K ðy~Þ largely attenuates the mate xt of the true internal vocal tract state xt. If the CNS had an t b internal model of the vocal tract vocal tract, comprising dynamics influence of feedback prediction errors on correcting the bð ; ^ Þ bð^ Þ ~ model vtdyn ut21 xt21 and sensory feedback model vtout xt , then current state estimate. When KtðyÞ is so optimally deter- it could send efference copy (green path) of vocal tract controls ut21 mined, it is referred to as the Kalman gain function and ^ to the internal model, whose state xt is accessible and could be used ^ the observer is referred to as a Kalman filter (Jacobs, 1993; in place of xt in the controller’s feedback control law UtðxÞ (red path). However, this scheme only works if x^ always closely tracks Kalman, 1960; Stengel, 1994; Todorov, 2006). We also t ð~Þ xt, which is not a realistic assumption. refer to Kt y as the Kalman gain function because we assume the speech motor control system would seek an optimal value for this function. b ^ b ^ models vtdynðut21; xt21Þ and vtoutðxtÞ could never be Therefore, SFC is the combination of a control law act- assumed to be perfectly accurate. Furthermore, ing on a state estimate provided by an observer. This is b vocal tract could not be assumed to start out in the same a relatively new way to model speech motor control, but state as the actual vocal tract. Thus, without corrective SFC models are well-known in other areas of motor con- ^ help, xt will not, in general, track xt.Unfortunately,only trol research. Interest in SFC models of motor control noisy and delayed sensory feedback yt2N is available to has a long history that can trace its roots all the way the controller, and yt2N is not tightly correlated with the back to Nikolai Bernstein, who suggested that the CNS current vocal tract state xt. Nevertheless, because yt2N is would need to take into account the current state of the not completely uncorrelated with xt, it carries some body (both the nervous system and articulatory biome- ^ information about xt that can be used to correct xt. chanics) to know the sensory outcomes of motor com- Figure 19.4 shows how this can be done by augmenting mands it issued (Bernstein, 1967; Whiting, 1984). Since the idealization shown in Figure 19.3 to include the fol- then, the problem of motor control has been formulated lowing prediction/correction process. First, in the pre- in state-space terms like those discussed (Arbib, 1981), diction (green) direction, efference copy of the previous and observer-based SFC models of reaching motor con- vocal tract control ut21 is input to forward dynamics trol have been advanced to explain how people optimize bð ; ^ Þ ^ model vtdyn ut21 xt21 to generate a prediction xtjt21 their movements (Guigon et al., 2008b; Shadmehr & ^ −Nˆ of the next vocal tract state. xtjt21 is then delayed by z Krakauer, 2008; Todorov, 2004; Todorov & Jordan, 2002). to match the actual sensory delays. The resulting delayed state estimate x^ ^ is input to forward out- b ðtjt21Þ2N ð^ Þ ^ ^ put model vtout xt to generate a prediction yt2N of the 19.7 SFC MODELS MOTOR ACTIONS AS expected sensory feedback yt2N. The resulting sensory AN OPTIMAL CONTROL PROCESS ~ ^ 5 2 ^ ^ feedback prediction error yt2N yt2N yt2N is a mea- ^ sure of how well xt is currently tracking xt (note, for Experiments show that people appear to move ^ ~ ^ example, if xt was perfectly tracking xt,thenyt2N would optimally, not just on average, but in each movement, be approximately zero). Next, in the correction (red) making optimal responses to perturbations of their move- ~ ^ direction, feedback prediction error yt2N is converted ments that take advantage of task constraints (Guigon, ^ ~ into state estimate correction et by the function KtðyÞ. Baraduc, & Desmurget, 2008a; Guigon et al., 2008b; ^ Finally, et is added to the original next state prediction Izawa, Rane, Donchin, & Shadmehr, 2008; Kording, ^ ^ xtjt21 to derive the corrected state estimate xt.Bythispro- Tenenbaum, & Shadmehr, 2007; Li, Todorov, & Pan, cess, therefore, an accurate estimate of the true vocal 2004; Liu & Todorov, 2007; Shadmehr & Krakauer, 2008;

C. BEHAVIORAL FOUNDATIONS 19.8 SPEAKING BEHAVES LIKE AN OPTIMAL CONTROL PROCESS 227

Vocal tract Controller wt–1 vt ut ut–1 xt z–1 vtdyn(u,x) vtout(x) yt x ^ t–1 Ut(x) z–1 z–N

yt–N Observer x^ u t t–1 ^ ^ ^ xt t–1 x (t t–1)–N^ y ^ ^ t–N vtdyn(u,x)^ z–N vtout(x)^ ~ x^ k (y) t–1 ^ t ~ z–1 et yt–N^ x^ t Predict

Correct

FIGURE 19.4 SFC model of speech motor control. The model is similar to that depicted in Figure 19.4 (i.e., the forward models bð ; ^ Þ bð^ Þ b vtdyn ut21 xt21 and vtout xt constitute the internal model of the vocal tract vocal tract shown in Figure 19.4), but here sensory feedback ^ yt2N is used to keep the state estimate xt tracking the true vocal tract state xt. This is accomplished with a prediction/correction process in b ^ which, in the prediction (green) direction, efference copy of vocal motor commands ut21 are passed through dynamics model vtdynðut21; xt21Þ ˆ ˆ b ^ j 2 −N −N ^ ^ to generate the next state prediction xt t 1, which is delayed by z . z outputs the next state prediction xðtjt21Þ2N from N seconds ago to b ^ ^ ð^ Þ match the sensory transduction delay of N seconds. xðtjt21Þ2N is passed through sensory feedback model vtout xt to generate feedback predic- ^ ^ ^ ^ tion yt2N. Then, in the correction (red) direction, incoming sensory feedback yt2N is compared with prediction yt2N , resulting in sensory feed- ~ ^ ~ ^ ð~Þ ^ ^ j 2 back prediction error yt2N. yt2N is converted by Kalman gain function Kt y into state correction et, which is added to xt t 1 to make corrected ^ ^ ^ state estimate xt. Finally, as in Figure 19.4, xt is used by SFC law UtðxtÞ in the controller to generate the controls ut that will be applied at the next timestep to the vocal tract.

Todorov, 2007; Todorov & Jordan, 2002). Furthermore, begun; the position of the tongue at this time is there- people quickly reoptimize their movements as task fore acoustically irrelevant and thus unconstrained by requirements change. They flexibly discover and adap- the task. tively adjust control of different aspects of their move- Many studies have shown that speakers appear to ments (e.g., contact force, final velocity) to take systematically take advantage of this underspecification advantage of any aspect of the task that lets them reduce in their articulation of speech. This is manifest in the control effort (e.g., reaching to a target they must stop trading relations seen in the production of /u/, with in front of versus reaching to a target when they can use tongue position being made similar to the surrounding impact with the target to slow their reach; Liu & phonetic context and lip rounding accommodating the Todorov, 2007). resulting tongue position (Perkell, Matthies, Svirsky, & Jordan, 1993). This same context-dependent choice of tongue position is also seen in the production of /r/, 19.8 SPEAKING BEHAVES LIKE with the bunched articulation used in velar contexts AN OPTIMAL CONTROL PROCESS (e.g., /grg/) and the retroflex articulation used in alveo- lar contexts (e.g., /drd/) (Espy-Wilson & Boyce, 1994; Like learning to reach, the process of learning to Guenther et al., 1998; Guenther et al., 1999; Zhou, 2008). speak could be described as an optimization process, These effects of phonetic context on the articulation of a with the speaker attempting to learn articulatory con- speech sound are broadly referred to as coarticulation— trols that strike a balance between the idiosyncrasies of a term introduced in the discussion of the undershoot his/her own vocal tract and the sounds demanded by as a phenomenon that the equilibrium point hypothesis his/her language. The reason speakers can, in general, can explain. However, coarticulation is often more com- find such an optimal balance is that the speaking task, plicated than this simple undershoot. In their running if defined only as “be understood by the listener,” is speech, speakers appear to anticipate the future need underspecified with respect to the available articula- of currently noncritical articulators by moving them tory degrees of freedom. This is especially true because in advance to their ultimately needed positions: in of the many-to-one nature of the articulatoryacoustic the production of /ba/, the tongue is already moved to relationship. For example, during the initial closure por- the position for /a/ during the production of /b/ tion of /b/, the lips are closed and has not (Farnetani & Recasens, 1999). How critical an articulator

C. BEHAVIORAL FOUNDATIONS 228 19. SPEECH MOTOR CONTROL FROM A MODERN CONTROL THEORY PERSPECTIVE is for a given speech target is also often language- an optimization process, like that postulated for non- dependent; for example, in English, nasalization of speech movements. This was first suggested by vowels is not a critical perceptual distinction, leaving Lindblom in 1983 (Lindblom, 1983), and later more speakers free in the production of /am/ to begin in fully elaborated in his “H&H” theory of speech advance the nasalization needed for /m/ (i.e., the open- production (Lindblom, 1990). In it, Lindblom explains ing of the velo-pharyngeal port) during the production phenomena observed in speech production as var- of /a/. However, in French, where nasalization of iations between “hyperspeech” (speech determined vowels is critical distinction, this advance nasalization by demanding constraints on acoustic output) and of /m/ is not seen (Clumeck, 1976). Coarticulation has “hypospeech” (speech more determined by production also been shown to vary widely across different native system [e.g., minimal effort] constraints). Acoustic out- speakers of the same language (Ku¨hnert & Nolan, 1999; put demands are determined by two things: (i) acous- Lubker & Gay, 1982; Perkell & Matthies, 1992). Even tic distinctiveness (i.e., how confusable a given speech within the same speaker, the instruction to speak sound is with nearest neighbors in “acoustic space”) “clearly” reduces undershoot coarticulation (Moon & and (ii) how easily the listener can predict the next Lindblom, 1994), showing that speaking style controls sound to be produced based on any number of sources some types of coarticulation (but not all types; see of information the listener has available (e.g., semantic, Matthies, Perrier, Perkell, & Zandipour, 2001). linguistic, or phonetic contextual constraints). These These observations suggest that coarticulation is factors can be approximately summarized by the more partly a learned phenomenon, but what exactly is general constraint, mentioned previously, that what learned is a matter of debate that distinguishes theories the speaker says should be understood by the listener. of coarticulation. One theory explains coarticulation In this way, the complexities of coarticulation are purely in terms of “phonological” rules (i.e., rules based explained as a “tug-of-war” between acoustic output on the assumption that speech sounds are stored and production system constraints, with coarticulated in memory as groups of discreet valued features) speech resulting when acoustic distinctiveness con- (Chomsky & Halle, 1968). For example, the representa- straints are sufficiently lax that production system tion of the /a/ includes the binary tongue features constraints can determine a minimal effort articula- [1high] and [1back], whereas these features are left tion. The acoustic output constraint accounts for the unspecified in /b/. In the look-ahead model of coarticu- language-dependent nature of coarticulation, whereas lation, there is a feature spreading process that considers production system constraint accounts for variations in the full utterance to be spoken and that fills all unspeci- coarticulation across speakers, because what counts fied features with any values they take on in the future as “minimal effort” depends on an individual speaker’s (Henke, 1966). Thus, in the case of /ba/, the unspecified vocal tract geometry and musculature. A variant of [high] and [back] features would be set to [1high] and this idea that makes the listener-oriented acoustic [1back], based on looking ahead to the features speci- distinctiveness constraints more explicit is Keating’s win- fied for /a/. Unfortunately, not all coarticulation phe- dow theory of coarticulation (Keating, 1990). The theory nomena can be accounted for as an all-or-nothing postulates that speech targets are not single-valued, but feature spreading to unspecified features (remember the instead are specified as windows—permissible ranges example of partial undershoot coarticulation described), for different speech features, where these permissible and rule-based theories have been expanded to include feature ranges are learned by a speaker from listening to learning of continuously variable, target-specific coarti- other speakers of the language. Coarticulation happens culation resistance values (Bladon & Al-Bamerni, 1976). because articulatory trajectory planning in speaking is a Another attempt to explain the continuous nature of process of satisfying a language constraint that features coarticulation dismisses the idea of discreet feature tar- bounds for speech targets must be respected and a mini- gets in speech, instead postulating speakers learn whole mal effort constraint. trajectories (timecourses) for different features (e.g., lip Historically, the problem with explaining the control opening or tongue-tip height) called gestures; coarti- of speaking as resulting from an optimization process culation then naturally results when these gestures over- (i.e., that speaking is an optimal control process) is that, lap in time (Browman & Goldstein, 1986; Fowler & by itself, this is only a descriptive theory; qualitatively, Saltzman, 1993). Unfortunately, not all coarticulation can speakers do behave as if they are working to minimize be modeled as the linear overlap of such feature time different movement cost terms, but this does not courses, necessitating once again the supposition that explain how this minimization is accomplished. This speakers learn a resistance to coarticulation (in this case, incompleteness of the theory makes it difficult to use it called blending strength) for different speech targets. as a model of the role of the CNS in speaking. As a An alternative to explaining coarticulation with result, for example, in early versions of the DIVA such explicit rules is to model it as resulting from model, one of the best current attempts at providing a

C. BEHAVIORAL FOUNDATIONS 19.8 SPEAKING BEHAVES LIKE AN OPTIMAL CONTROL PROCESS 229 mechanistic account of speech motor control, the opti- actions are taken after that state. Control actions also mization process was implemented by adopting a have a cost (as discussed) and, in one optimal control specific version of Keating’s window theory; the pho- algorithm (dynamic programming), the optimal next netic feature ranges on speech targets provide an control action for the current state is chosen to minimize explicit rule for specifying how targets are achieved the cost of that action plus the cost-to-go of the state it (i.e., the articulatory trajectory must pass within the leads to (Bellman, 1957; Bertsekas, 2000). That minimal target’s feature ranges), and the minimal effort con- cost also becomes the new cost-to-go for the current straint is replaced with an explicit rule that says that state. Thus, over repeated utterance variations, costs-to- minimal distance trajectories in phonetic feature go for later states are propagated backward to earlier space will be followed between successive phonetic states, with the cost-to-go for the end state defined as the targets (Guenther, 1995). probability that the listener will misunderstand what Without resorting to such explicit rules, however, the was said. And, in this process of back-propagating costs- more general question about speech motor control to-go, an optimal next control action is chosen for every remains unanswered: just how could the CNS choose state visited. In this way, the complete control law map- the next articulatory control to be output in ongoing ping states to output controls is eventually learned. speech, such that, ultimately, some overall movement This process for learning a control law is of partic- cost constraints are satisfied? How does a distal goal, ular interest because neurophysiological processes whose achievement is only known after a word is pro- appear to be in the CNS that mimics it. The basal gan- duced, guide the selection of the next articulatory con- glia (BG) in particular appear to represent several of the trol at a given point in the ongoing production of the needed quantities, which is significant because the word? Optimal control theory provides a solution: the BG are thought to be involved in action selection. algorithms at the heart of the theory provide mechan- Dopaminergic neurons have been implicated in both isms for translating an overall movement goal into reward prediction (Hollerman & Schultz, 1998; Schultz, moment-by-moment controls (Stengel, 1994). In this Dayan, & Montague, 1997) and the detection of novel framework, overall movement goals are expressed as sensory outcomes of actions (Redgrave & Gurney, cost functions to minimize (Todorov, 2004, 2006; 2006). Interestingly, these neurons also display a back- Todorov & Jordan, 2002; Scott, 2004). These cost func- propagation of their responses: they initially respond tions are a composite of terms reflecting the competing vigorously to delivery of a reward but soon habituate constraints governing movements; for example, one and instead respond to earlier sensory cues that predict term (accuracy) could express the constraint that the lis- the reward (Schultz, 1998). If it can be considered tener should understand what was said (e.g., it could be rewarding to improve the outcome of a movement (i.e., obtained by an evaluation of the probability that the lis- minimize its cost), then the behavior of dopaminergic tener confuses what was said with a different utterance neurons suggests the possibility of a back-propagation during the state corresponding to movement finished). of state cost-to-go values would be a way for the CNS Another term (effort) could express the fact that actions to learn control laws (Daw & Doya, 2006; Doya, 2000). incur a metabolic cost depending on the forces the artic- Several studies have shown that neurons in the BG stri- ulator muscles are commanded to generate (e.g., cost as atum appear to represent and retain expected reward the sum of all force magnitudes over the whole move- returns for different possible future actions (Samejima, ment). The total cost is then a weighting of these cost Ueda, Doya, & Kimura, 2005; Wang, Miura, & Uchida, terms, where the weighting is determined by the current 2013)—characteristics that are well-suited for represent- task (e.g., high weighting of accuracy and low weighting ing costs-to-go of different states. In addition, other of effort for clear speech, and low weighting of accuracy studies have shown that GPi output appears to repre- and high weighting of effort for casual speech). sent movement effort (Desmurget, Grafton, Vindras, To find the control law that minimizes a given cost Grea, & Turner, 2003; Grafton, Vindras, Grea, & Turner, function, the concept of state is crucial. Intuitively, if you 2004; Turner, Desmurget, Grethe, Crutcher, & Grafton, knew exactly how the system being controlled would 2003). And although many of the detailed studies of the respond to your commands (i.e., if you knew its state), role of BG in action learning have been done at a rela- you could choose one that minimized the need for future tively high level of discreet action choice, the BG are corrective commands and thus minimize control efforts. also likely involved in the learning of new “lower level” This intuitive idea is the principle behind the algorithms sensorimotor skills. The subthalamic nucleus (STN) has that determine optimal controls. Perhaps the most been shown to react more strongly to more successful understandable versions of these algorithms are those movement outcomes in the production of simple move- based on dynamic programming and reinforcement ments (Brown et al., 2006). Taken together, there is learning, in which each state has a cost-to-go, which is ample evidence to implicate the BG in the learning of the movement cost incurred if only optimal control state-based control laws, and studies have modeled is

C. BEHAVIORAL FOUNDATIONS 230 19. SPEECH MOTOR CONTROL FROM A MODERN CONTROL THEORY PERSPECTIVE role as a “critic” that uses cost-to-go values to learn compensated for the lip tubes, suggesting they worked movement control laws (Barto, 1995; Berthier, to maintain an acoustic /u/ representation, whereas Rosenstein, & Barto, 2005). others did not (Savariaux, Perrier, & Orliaguet, 1995). Because these other speakers nevertheless had normal speech and hearing, what explains their results? 19.9 SFC EXPLAINS THE TASK-SPECIFIC It could be that unknown experimental factors ROLE OF THE CNS IN SPEECH confounded the results for these speakers, but another FEEDBACK PROCESSING distinct possibility is that there are many viable solu- tions to the speech motor control problem, and that Besides providing a mechanistic explanation for how different speakers learn different solutions. Thus, it optimal control laws could be learned, the SFC frame- may be that the control solution based on an acoustic work also provides an explanation for how and why representation of /u/ is only discovered and used by the CNS would process sensory feedback during ongo- some speakers. An optimal SFC model of speaking can ing speaking. This is because, in its most general form, explain such variability across speakers because it spe- the process of estimating the current state of the system cifies only the way that speakers learn task-dependent being controlled relies on more than just tracking the perturbation responses rather than the specific repre- sequence of controls sent to the system. Crucially, it sentations of the task that different speakers learn. also relies on sensory feedback to correct errors in the state estimate, as is described here. This full state esti- mation process not only serves as a model of the role of 19.10 IS SFC NEURALLY PLAUSIBLE? CNS in feedback processing but also explains how task- specific responses to feedback perturbations would For speech, the SFC model suggests not only that occur: In SFC, such perturbations cause corrections to auditory processing is used by the CNS for compre- the current state estimate, and the corrected state, if it hension during listening, but also that the CNS uses has been visited before, has a task-specific optimal con- auditory information in a distinctly different way dur- trol law associated with it. If the state has not been ing speech production; it is compared with a predic- visited before, then the process updating the cost-to-go tion derived from efference copy of motor output, with for that state will begin the process of learning a task- the resulting prediction error used to keep an internal optimal control response in that state. model tracking the state of the vocal tract. There are a In this way, a state estimation process that includes number of lines of evidence supporting the neural sensory feedback explains how task-specific responses to plausibility of this second, production-specific mode of feedback perturbations could be learned without sensory processing. First, even in other primates, there recourse to assuming speech is perceived in terms of appear to be at least two distinct pathways, or streams, certain specialized features. This is important because of auditory processing. The concept of multiple sensory experiments that test whether speakers use task-relevant processing streams in both the auditory (Deutsch & feature representations often have mixed results, with Roll, 1976; Evans & Nelson, 1973; Poljak, 1926)and some speakers behaving as if they use a certain represen- visual (Held, 1968; Ingle, 1967) systems of the brain has tation and others behaving as if they do not. To return to been around for decades, but the idea gained most an earlier example, it is often reported that speakers attention when it was advanced as an organizational exhibit trading relations in the production of /u/, with principle of the cortical regions involved in visual infor- variations in tongue height being compensated by covar- mation processing. A dorsal “where” stream leading to iations in lip extension (and vice versa) such that an parietal cortex that was concerned with object location acoustic representation of their /u/ production (i.e., its and a ventral “what” stream leading to the temporal formant pattern) is preserved. And when such covaria- pole was concerned with object recognition were tion has been looked for in experiments, it is observed in hypothesized (Mishkin, Ungerleider, & Macko, 1983; many speakers (Perkell et al., 1993), consistent with their Ungerleider & Mishkin, 1982). Subsequently, studies of use of an acoustic representation to constrain their /u/ the auditory system found a match to this visual system productions. Critically, however, it is not observed in organization (Romanski et al., 1999). Neurons respond- other speakers. Related experiments have looked for ing to auditory source location were found in a dorsal acoustic /u/ representations by examining how speak- pathway leading up to parietal cortex, and neurons ers produce /u/ when required to hold tubes at their responding to auditory source type were found in a lips. These tubes function as artificial perturbations of ventral pathway leading down toward the temporal a speaker’s lip extension, requiring compensatory adjust- pole (Rauschecker & Tian, 2000). More recent evidence, ment of other articulators like the tongue to maintain however, has refined the view of the dorsal stream’s the original /u/ formant pattern. Some speakers task to be one of sensorimotor integration. The dorsal

C. BEHAVIORAL FOUNDATIONS 19.11 SFC ACCOUNTS FOR EFFERENCE COPY PHENOMENA 231 visual stream was found to be closely linked with non- PT in those stutterers whose fluency was enhanced by speech motor control systems (e.g., reaching, head, and DAF. Several other anatomical studies have also impli- eye movement control) (Andersen, 1997; Rizzolatti, cated dorsal stream dysfunction in stuttering, including Fogassi, & Gallese, 1997) while in humans, the dorsal studies showing impaired white matter connectivity in auditory stream was found to be closely linked with the this region (Cykowski, Fox, Ingham, Ingham, & Robin, vocal motor control system. In particular, a variety of 2010) as well as aberrant gyrification patterns (Foundas, studies have implicated the posterior superior temporal Bollich, Corey, Hurley, & Heilman, 2001). gyrus (STG) (Zheng, Munhall, & Johnsrude, 2010)and the superior parietal temporal area (Spt) (Buchsbaum, Hickok, & Humphries, 2001; Hickok, Buchsbaum, 19.11 SFC ACCOUNTS FOR EFFERENCE Humphries, & Muftuler, 2003) as serving auditory feed- COPY PHENOMENA back processing specifically related to speech produc- tion. Consistent with this, studies of stroke victims have There are a number of studies that have found evi- shown a double dissociation between ability to perform dence that production-specific feedback processing discreet production-related perceptual judgments and involves comparison of incoming feedback with a feed- ability to understand continuous speech that depends back prediction derived from motor efference copy. on lesion location (dorsal and ventral stream lesions, Nonspeech evidence for this is seen when a robot creates respectively) (Baker, Blumstein, & Goodglass, 1981; delay between the tickle action subjects produce and Miceli, Gainotti, Caltagirone, & Masullo, 1980). This has when they feel it on their own hand (Blakemore, led to refined looped and “dual stream” models of Wolpert, & Frith, 1998, 1999, 2000). With increa- speech processing (Hickok et al., 2011; Hickok & sing delay, subjects report a more ticklish sensation, as Poeppel, 2007; Rauschecker & Scott, 2009), with a ventral expected if the delay created mismatch between a sen- stream serving speech comprehension and a dorsal sory prediction derived from the tickle action and the stream serving feedback processing related to speaking. actual somatosensory feedback. By using different neu- This two-stream model is in fact a close match with that roimaging techniques, an analogous effect can be seen in originally proposed by Wernicke more than 100 years speech production; the response of a subject’s auditory earlier (Wernicke, 1874/1977). cortices to his/her own self-produced speech is signifi- When the production-oriented auditory processing of cantly smaller than their response to similar but exter- the dorsal stream is disrupted, a number of speech sen- nally produced speech (e.g., tape playback of the sorimotor disorders appear to result (Hickok et al., subject’s previous self-productions). This effect, which 2011). Conduction aphasia is a neurological condition we call speaking-induced suppression (SIS), has been resulting from stroke in which production and compre- seen using positron emission tomography (PET) (Hirano hension of speech are preserved but the ability to et al., 1996; Hirano, Kojima et al., 1997; Hirano, Naito repeat speech sound sequences just heard is impaired et al., 1997), electroencephalography (EEG) (Ford et al., (Geschwind, 1965). Conduction aphasia appears to result 2001; Ford & Mathalon, 2004), and magnetoencephalog- from damage to area Spt in the dorsal auditory proces- raphy (MEG) (Curio, Neuloh, Numminen, Jousmaki, & sing stream (Buchsbaum et al., 2011). Consistent with Hari, 2000; Heinks-Maldonado, Nagarajan, & Houde, this, the impairment is particularly apparent in the task 2006; Houde, Nagarajan, Sekihara, & Merzenich, 2002; of repeating nonsense speech sounds, because when the Numminen & Curio, 1999; Numminen, Salmelin, & sound sequences do not form meaningful words, the Hari, 1999; Ventura, Nagarajan, & Houde, 2009). An ana- intact speech comprehension system (the ventral stream) log of the SIS effect has also been seen in nonhuman pri- cannot aid in remembering what was heard. More spec- mates (Eliades & Wang, 2003, 2005, 2008). Our own ulatively, stuttering may also result from impairments in MEG experiments have shown that the SIS effect is only auditory feedback processing in the dorsal stream. It is minimally explained by a general suppression of audi- well-known that altering auditory feedback (e.g., alter- tory cortex during speaking and that this suppression is ing pitch (Howell, El-Yaniv, & Powell, 1987), masking not happening in the more peripheral parts of the feedback with noise (Maraist & Hutton, 1957), and DAF CNS (Houde et al., 2002). We have also shown that (Soderberg, 1968)) can make many persons who stutter the observed suppression goes away if the subject’s speak fluently. Evidence for dorsal stream involvement feedback is altered to mismatch his/her expectations in these fluency enhancements comes from a study relat- (Heinks-Maldonado et al., 2006; Houde et al., 2002), ing DAF-induced fluency to structural MRIs of the as is consistent with some of the PET study findings. brains of persons who stutter (Foundas et al., 2004). The Finally, if SIS depends on a precise match between planum temporale (PT) is an area of temporal cortex feedback and prediction, then precise time alignment of encompassing dorsal stream areas like Spt, and the prediction with feedback would be critical for complex study found that right PT was aberrantly larger than left rapidly changing productions (e.g., rapidly speaking

C. BEHAVIORAL FOUNDATIONS 232 19. SPEECH MOTOR CONTROL FROM A MODERN CONTROL THEORY PERSPECTIVE

“ah-ah-ah”), and less critical for slow or static produc- 19.12 NEURAL SUBSTRATE OF THE SFC tions (e.g., speaking “ah”). Assuming a given level of MODEL time alignment inaccuracy, the prediction/feedback match should therefore be better (and SIS stronger) for Based partly on the discussion here, Figure 19.5 sug- slower, less dynamic productions, which is what we gests a putative neural substrate for the SFC model. found in a recent study (Ventura et al., 2009). Basic neuroanatomical facts dictate the neural substrates By itself, evidence of feedback being compared with on both ends of the SFC prediction/correction proces- a prediction derived from efference copy implies the sing loop. On one end of the loop, motor cortex (M1) is ^ existence of predictive forward models within the CNS, the likely area where the feedback control law UtðxtÞ but another line of evidence for forward models comes generates neuromuscular controls applied to the vocal from sensorimotor adaptation experiments (Ghahramani, tract. Motor cortex is the main source of motor fibers of Wolpert, & Jordan, 1996; Wolpert & Ghahramani, 2000; the pyramidal tract, which synapse directly with motor Wolpert, Ghahramani, & Jordan, 1995). Such experiments neurons in the brainstem and spinal cord and enable have been conducted with speech production, where fine motor movements (Kandel et al., 2000). As men- subjects are shown to alter and then retain compensatory tioned, damage to the vocal tract areas of motor cortex production changes in response to extended exposure to often results in mutism (Duffy, 2005; Ju¨rgens, 2002). On artificially altered audio feedback (Houde & Jordan, the other end of the loop, auditory and somatosensory 1997, 1998, 2002; Jones & Munhall, 2000a, 2000b, 2002, information first reaches the higher CNS in the primary 2003, 2005; Jones, Munhall, & Vatikiotis-Bateson, 1998; auditory (A1) and somatosensory (S1) cortices, respec- Purcell & Munhall, 2006; Shiller, Sato, Gracco, & Baum, tively (Kandel et al., 2000). Based on our SIS studies, we 2009;Villacorta,Perkell,&Guenther,2007)oraltered hypothesize this end of the loop is where the operation somatosensory feedback (Nasir & Ostry, 2006, 2008, 2009; comparing the feedback prediction with incoming feed- Tremblay et al., 2003; Tremblay et al., 2008). For example, back occurs. Between these endpoints, the model also in the original speech sensorimotor adaptation experi- predicts the need for an additional area that mediates ment, subjects produced the vowel /ε/ (as in “head”), the prediction (green) and correction (red) processes first hearing normal audio feedback and then hearing running between motor and the sensory cortices. The their formants shifted toward /i/ (as in “heed”). Over premotor cortices are ideally placed for such an interme- repeated productions while hearing the altered feedback, diary role: premotor cortex is both bidirectionally well- subjects gradually shifted their productions of /ε/inthe connected to motor cortex (Kandel et al., 2000)and,via opposite direction (i.e., they shifted their produced for- the arcuate and longitudinal fasiculi (Glasser & Rilling, mants toward / /, as in “hot”). This had the effect of 2008; Schmahmann et al., 2007; Upadhyay, Hallock, making the altered feedback sound more like /ε/again. Ducros, Kim, & Ronen, 2008), bidirectionally connected These changes in the production of /ε/wereretained to the higher order somatosensory (S2/inferior parietal even when feedback was subsequently blocked by noise lobe [IPL]) and auditory (Spt/PT) cortices, respectively. (Houde & Jordan, 1997, 1998, 2002). The retained produc- In this way, the key parts of the SFC model are a good tion changes are consistent with the existence of a fit for a known network of sensorimotor areas that are, forward model making feedback predictions that are in turn, well-placed to receive task-dependent, modula- modified by experience. In addition to providing evi- tory connections (blue dashed arrows in Figure 19.5) dence for forward models, such adaptation experiments from other frontal areas. also allow investigation of the organization of forward What evidence is there for premotor cortex playing models in the speech production system. By examining such an intermediary role in speech production? First, how compensation trained in the production of one reciprocal connections with sensory areas suggest the phonetic task (e.g., the production of /eh/) generalizes possibility that premotor cortex could also be active dur- to another untrained phonetic task (e.g., the production ing passive listening to speech, and this appears to be the of /ah/), such experiments can determine if there are case. Wilson et al. found the superior ventral premotor shared representations like forward models used in the area (svPMC), bilaterally, was activated by both listening control of both tasks. Some of these experiments have to and speaking meaningless syllables, but found generalization of adaptation across speech tasks not by listening to nonspeech sounds (Wilson, Saygin, (Houde & Jordan, 1997, 1998; Jones & Munhall, 2005), Sereno, & Iacoboni, 2004). In a follow-up study, Wilson but other experiments have not found such generaliza- et al. found that this area, bilaterally, showed greater acti- tion (Pile, Dajani, Purcell, & Munhall, 2007; Tremblay vation when subjects heard non-native speech sounds et al., 2008), suggesting that, in many cases, forward than they did when they heard native sounds. In this models used in the control of different speech tasks are same study, auditory areas were also activated more for perhaps not shared across tasks. speech sounds rated least producible, and that svPMC

C. BEHAVIORAL FOUNDATIONS 19.12 NEURAL SUBSTRATE OF THE SFC MODEL 233

Vocal tract

wt–1 vt

ut ut–1 xt z–1 vtdyn(u,x) vtout(x) yt x ^ Motor t–1 Ut(x) cortex z–1 z–N

yt–N

x^ u t t–1 ^ ^ ^ xt |t–1 x (t |t–1)–N^ yt–N^ + ^ ^ vtdyn(u,x) z–N vtout(x)^ – ^ xt–1 High Primary z–1 order ~ ^ e^ y ^ xt t ~ t–N kt(y) Premotor cortex Sensory cortex (vPMC)

Other premotor, Predict frontal cortex Task-dependent modulation Correct

FIGURE 19.5 SFC model of speech motor control with putative neural substrate. The figure depicts the same operations as those shown in Figure 19.5, but with suggested cortical locations of the operations (motor areas are in yellow, sensory areas are in pink). The current model is largely doubtful regarding hemispheric specialization for these operations. Also, for diagrammatic simplicity, the operations in the auditory and somatosensory cortices are depicted in the single area marked “sensory cortex,” with the understanding that it represents analogous ^ operations occurring in both of these sensory cortices, that is, the delayed state estimate xð j 2 Þ2 ^ is sent to both high-order somatosensory and b t t 1 N auditory cortex, each with separate feedback prediction modules (vtoutðx^ Þ for predicting auditory feedback in high-order auditory cortex and b t ð^ Þ ~ ^ vtout xt for predicting somatosensory feedback in high-order somatosensory cortex. The feedback prediction errors yt2N generated in audi- ^ tory and somatosensory cortex are converted into separate state corrections et based on auditory and somatosensory feedback by auditory ~ and somatosensory Kalman gain functions KtðyÞ in the high-order auditory and somatosensory cortices, respectively. The auditory-based and ^ ^ somatosensory-based state corrections are then added to xtjt21 in premotor cortex to make the next state estimate xt. Finally, the key operations depicted in blue are all postulated to be modulated by the current speech task goals (e.g., what speech sound is currently meant to be pro- duced) that are expressed in other areas of frontal cortex. was functionally connected to these auditory areas dur- frontal gyrus, pars triangularis (Broca’s) region. When ing listening (Wilson & Iacoboni, 2006). This activation of they looked at the functional connectivity of these right premotor cortex when speech is heard has also been seen frontal areas, they found that the presence of the altered in other functional imaging studies (Skipper, Nusbaum, feedback significantly increased the functional connec- & Small, 2005) and studies based on TMS (Watkins & tivity only of the left and right auditory areas, as well as Paus, 2004). the functional connectivity of these auditory areas with Second, altering sensory feedback during speech pro- vPMC and IFt. The result suggests that the auditory duction should create feedback prediction errors in sen- feedback correction information from higher auditory sory cortices, increasing activations in these areas, and areas has a bigger effect on premotor/pars triangularis the resulting state estimate corrections should be passed regions than motor cortex regions, which is consistent back to premotor cortex, increasing its activation as well. with our SFC model if we expand the neural substrate of A study that tested this prediction was performed by our state estimation process beyond premotor cortex to Tourvilleetal.,whousedfMRItoexaminehowcortical also include Broca’s area. The results of Tourville et al. activations changed when subjects spoke with their audi- are partly confirmed by another fMRI study. Toyomura tory feedback altered (Tourville, Reilly, & Guenther, 2008). et al. had subjects continuously phonate a vowel and, in In the study, subjects spoke simple CVC words with the some trials, the pitch of the subjects’ audio feedback was frequency of first formant occasionally altered in their briefly perturbed higher or lower by two semitones audio feedback of some of their productions. When they (Toyomura et al., 2007). In examining the contrast looked for areas more active in altered feedback versus between perturbed and unperturbed trials, Toyomura nonaltered trials, Tourville et al. found auditory areas et al. found premotor activation in the left hemisphere (pSTG, including Spt in both hemispheres), and they also and a number of activations in the right hemisphere, found areas in the right frontal cortex: a motor area (vMC), including auditory cortex (STG) and frontal area BA9, a premotor area (vPMC), and an area (IFt) in the inferior which is near the IFt activation found by Tourville et al.

C. BEHAVIORAL FOUNDATIONS 234 19. SPEECH MOTOR CONTROL FROM A MODERN CONTROL THEORY PERSPECTIVE

19.13 CONCLUSION Bertsekas, D. P. (2000). Dynamic programming and optimal control (2nd ed.). Belmont, MA: Athena Scientific. Bizzi, E., Accornero, N., Chapple, W., & Hogan, N. (1982). Arm trajectory In this review, the applicability of SFC to modeling formation in monkeys. Experimental Brain Research, 46(1), 139143. speech motor control has been explored. The phenom- Bladon, R. A. W., & Al-Bamerni, A. (1976). Coarticulation resistance ena related to the role of CNS in speech production, in English /l/. Journal of Phonetics, 4, 137150. especially its role in processing sensory feedback, are Blakemore, S. J., Wolpert, D. M., & Frith, C. D. (1998). Central cancel- complex and suggest that speech motor control is not lation of self-produced tickle sensation. Nature Neuroscience, 1(7), 635640. an example of pure feedback control or feedforward Blakemore, S. J., Wolpert, D. M., & Frith, C. D. (1999). The cerebel- control. The task-specificity of responses to feedback lum contributes to somatosensory cortical activity during self- perturbations in speech further argues that feedback produced tactile stimulation. Neuroimage, 10(4), 448459. control is not only a function of the lower motor sys- Blakemore, S. J., Wolpert, D. M., & Frith, C. D. (2000). Why can’t you tem but also one in which CNS plays an active role in tickle yourself? Neuroreport, 11(11), R11 R16. Borden, G. J., Harris, K. S., & Raphael, L. J. (1994). Speech science the online processing of sensory feedback during primer: Physiology, acoustics, and perception of speech (3rd ed.). speaking. Current models of this role are described as Baltimore, MD: Williams & Wilkins. variations of concept of SFC from engineering control Browman, C. P., & Goldstein, L. M. (1986). Towards an articulatory theory. Thus, SFC is put forth as an appropriate and phonology. Phonology Yearbook, 3, 219252. neurally plausible model of how the CNS processes Brown,P.,Chen,C.C.,Wang,S.,Kuhn,A.A.,Doyle,L.,Yarrow,K.,etal. (2006). Involvement of human basal ganglia in offline feedback feedback and controls the vocal tract. control of voluntary movement. Current Biology, 16(21), 21292134. Buchsbaum,B.R.,Baldo,J.,Okada,K.,Berman,K.F.,Dronkers,N., D’Esposito, M., et al. (2011). Conduction aphasia, sensory-motor inte- gration, and phonological short-term memory—An aggregate analy- References sis of lesion and fMRI data. Brain and Language, 119(3), 119128. Buchsbaum, B. R., Hickok, G., & Humphries, C. (2001). Role of left pos- Abbs, J. H., & Gracco, V. L. (1983). Sensorimotor actions in the control terior superior temporal gyrus in phonological processing for of multi-movement speech gestures. Trends in Neurosciences, 6,391. speech perception and production. Cognitive Science, 25(5), 663678. Abbs, J. H., & Gracco, V. L. (1984). Control of complex motor ges- Burnett, T. A., Freedland, M. B., Larson, C. R., & Hain, T. C. (1998). tures: Orofacial muscle responses to load perturbations of lip dur- Voice F0 responses to manipulations in pitch feedback. Journal of ing speech. Journal of neurophysiology, 51(4), 705723. the Acoustical Society of America, 103(6), 31533161. Andersen, R. A. (1997). Multimodal integration for the representation Chang—Yit, R., Pick, J., Herbert, L., & Siegel, G. M. (1975). Reliability of space in the posterior parietal cortex. Philosophical Transactions of sidetone amplification effect in vocal intensity. Journal of of the Royal Society of London Series B-Biological Sciences, 352(1360), Communication Disorders, 8(4), 317324. 14211428. Cheung, S. W., Nagarajan, S. S., Schreiner, C. E., Bedenbaugh, P. H., & Arbib, M. A. (1981). Perceptual structures and distributed motor con- Wong, A. (2005). Plasticity in primary auditory cortex of monkeys trol. In J. M. Brookhart, V. B. Mountcastle, & V. B. Brooks (Eds.), with altered vocal production. Journal of Neuroscience, 25(10), Handbook of physiology, Section 1: The nervous system, volume 2: 24902503. Motor control, part 2 (pp. 14491480). Bethesda, MD: American Chomsky, N., & Halle, M. (1968). The sound pattern of English. New Phsyiological Society. York, NY: Harper & Row. Asatryan, D. G., & Feldman, A. G. (1965). Biophysics of complex sys- Clumeck, H. (1976). Patterns of soft palate movement in six lan- tems and mathematical models. Functional tuning of nervous sys- guages. Journal of Phonetics, 4, 337351. tem with control of movement or maintenance of a steady Cowie, R., & Douglas-Cowie, E. (1992). Postlingually acquired deafness: posture. I. Mechanographic analysis of the work of the joint on Speech deterioration and the wider consequences. Hawthorne, NY: execution of a postural task. Biophysics, 10, 925935. Mouton de Gruyter. Baker, E., Blumstein, S. E., & Goodglass, H. (1981). Interaction between Curio, G., Neuloh, G., Numminen, J., Jousmaki, V., & Hari, R. (2000). phonological and semantic factors in auditory comprehension. Speaking modifies voice-evoked activity in the human auditory Neuropsychologia, 19(1), 115. cortex. Human Brain Mapping, 9(4), 183191. Barto, A. G. (1995). Adaptive critics and the basal ganglia. In J. C. Cykowski, M. D., Fox, P. T., Ingham, R. J., Ingham, J. C., & Robin, D. A. Houk, J. Davis, & D. Beiser (Eds.), Models of information processing (2010). A study of the reproducibility and etiology of diffusion in the basal ganglia (pp. 215232). Cambridge, MA: MIT Press. anisotropy differences in developmental stuttering: A potential role Bauer, J. J., Mittal, J., Larson, C. R., & Hain, T. C. (2006). Vocal for impaired myelination. Neuroimage, 52(4), 14951504. responses to unanticipated perturbations in voice loudness feed- Daw, N. D., & Doya, K. (2006). The computational neurobiology of back: An automatic mechanism for stabilizing voice amplitude. The learning and reward. Current Opinion in Neurobiology, 16(2), 199204. Journal of the Acoustical Society of America, 119(4), 23632371. Desmurget, M., Grafton, S. T., Vindras, P., Grea, H., & Turner, R. S. Bellman, R. (1957). Dynamic programming. Princeton, NJ: Princeton (2003). Basal ganglia network mediates the control of movement University Press. amplitude. Experimental Brain Research, 153(2), 197209. Bendor, D., & Wang, X. (2005). The neuronal representation of pitch Desmurget, M., Grafton, S. T., Vindras, P., Grea, H., & Turner, R. S. in primate auditory cortex. Nature, 436(7054), 11611165. (2004). The basal ganglia network mediates the planning of move- Bernstein, N. A. (1967). The co-ordination and regulation of movements. ment amplitude. European Journal of Neuroscience, 19(10), 28712880. Oxford: Pergamon Press. Deutsch, D., & Roll, P. L. (1976). Separate “what” and “where” deci- Berthier, N. E., Rosenstein, M. T., & Barto, A. G. (2005). Approximate sion mechanisms in processing a dichotic tonal sequence. Journal optimal control as a model for motor learning. Psychological of Experimental Psychology. Human Perception and Performance, 2(1), Review, 112(2), 329346. 2329.

C. BEHAVIORAL FOUNDATIONS REFERENCES 235

Doya, K. (2000). Complementary roles of basal ganglia and cerebellum Guenther, F. H. (1995). Speech sound acquisition, coarticulation, and in learning and motor control. Current Opinion in Neurobiology, 10 rate effects in a neural network model of speech production. (6), 732739. Psychological Review, 102(3), 594621. Duffy, J. R. (2005). Motor speech disorders: Substrates, differential diagno- Guenther, F. H., Espy-Wilson, C. Y., Boyce, S. E., Matthies, M. L., sis, and management (2nd ed.). Saint Louis, MO: Elsevier Mosby. Zandipour, M., & Perkell, J. S. (1999). Articulatory tradeoffs reduce Eliades, S. J., & Wang, X. (2003). Sensory-motor interaction in the pri- acoustic variability during American English /r/ production. The mate auditory cortex during self-initiated vocalizations. Journal of Journal of the Acoustical Society of America, 105(5), 28542865. Neurophysiology, 89(4), 21942207. Guenther, F. H., Ghosh, S. S., & Tourville, J. A. (2006). Neural model- Eliades, S. J., & Wang, X. (2005). Dynamics of auditory-vocal interaction ing and imaging of the cortical interactions underlying syllable in monkey auditory cortex. Cerebral Cortex, 15(10), 15101523. production. Brain and Language, 96(3), 280301. Eliades, S. J., & Wang, X. (2008). Neural substrates of vocalization Guenther, F. H., Hampson, M., & Johnson, D. (1998). A theoretical feedback monitoring in primate auditory cortex. Nature, 453 investigation of reference frames for the planning of speech (7198), 11021106. movements. Psychological Review, 105(4), 611633. Elman, J. L. (1981). Effects of frequency-shifted feedback on the pitch Guenther,F.H.,&Vladusich,T.(2012).Aneuraltheoryofspeechacqui- of vocal productions. The Journal of the Acoustical Society of sition and production. Journal of Neurolinguistics, 25(5), 408422. America, 70(1), 4550. Guigon, E., Baraduc, P., & Desmurget, M. (2008a). Computational Espy-Wilson, C., & Boyce, S. (1994). Acoustic differences between motor control: Feedback and accuracy. The European Journal of “bunched” and “retroflex” variants of American English /r/. The Neuroscience, 27(4), 10031016. Journal of the Acoustical Society of America, 95(5), 2823. Guigon, E., Baraduc, P., & Desmurget, M. (2008b). Optimality, stochas- Evans, E. F., & Nelson, P. G. (1973). On the functional relationship ticity, and variability in motor behavior. Journal of Computational between the dorsal and ventral divisions of the cochlear nucleus Neuroscience, 24(1), 5768. of the cat. Experimental Brain Research, 17(4), 428442. Hain, T. C., Burnett, T. A., Kiran, S., Larson, C. R., Singh, S., & Fairbanks, G. (1954). Systematic research in experimental phonetics: Kenney, M. K. (2000). Instructing subjects to make a voluntary 1. A theory of the speech mechanism as a servosystem. Journal of response reveals the presence of two components to the audio- Speech and Hearing Disorders, 19(2), 133139. vocal reflex. Experimental Brain Research, 130(2), 133141. Farnetani, E., & Recasens, D. (1999). Coarticulation models in recent Hallett, M., Shahani, B. T., & Young, R. R. (1975). EMG analysis of speech production theories. In W. Hardcastle, & N. Hewlett stereotyped voluntary movements in man. Journal of Neurology, (Eds.), Coarticulation: Theory, data, and techniques (pp. 3165). Neurosurgery and Psychiatry, 38(12), 11541162. Cambrigde, UK: Cambridge University Press. Heil, P. (2003). Coding of temporal onset envelope in the auditory Feldman, A. G. (1986). Once more on the equilibrium-point hypothe- system. Speech Communication, 41(1), 123134. Available from: sis (lambda model) for motor control. Journal of Motor Behavior, 18 http://dx.doi.org/10.1016/S0167-6393(02)00099-7. (1), 1754. Heil, P., & Irvine, D. R. (1996). On determinants of first-spike latency Ford, J. M., & Mathalon, D. H. (2004). Electrophysiological evidence in auditory cortex. Neuroreport, 7(18), 30733076. of corollary discharge dysfunction in schizophrenia during talk- Heinks-Maldonado, T. H., & Houde, J. F. (2005). Compensatory ing and thinking. Journal of Psychiatric Research, 38(1), 3746. responses to brief perturbations of speech amplitude. Acoustics Ford, J. M., Mathalon, D. H., Heinks, T., Kalba, S., Faustman, W. O., & Research Letters Online, 6(3), 131137. Roth, W. T. (2001). Neurophysiological evidence of corollary Heinks-Maldonado, T. H., Nagarajan, S. S., & Houde, J. F. (2006). discharge dysfunction in schizophrenia. American Journal of Magnetoencephalographic evidence for a precise forward model Psychiatry, 158(12), 20692071. in speech production. Neuroreport, 17(13), 13751379. Foundas, A. L., Bollich, A. M., Corey, D. M., Hurley, M., & Heilman, Held, R. (1968). Dissociation of visual functions by deprivation and K. M. (2001). Anomalous anatomy of speech-language areas in rearrangement. Psychologische Forschung, 31(4), 338348. adults with persistent developmental stuttering. Neurology, 57(2), Henke, W.L. (1966). Dynamic articulatory model of speech production using 207215. computer simulation. Unpublished Ph.D., Cambridge, MA: MIT. Foundas, A. L., Bollich, A. M., Feldman, J., Corey, D. M., Hurley, M., Hickok, G., Buchsbaum, B., Humphries, C., & Muftuler, T. (2003). Lemen, L. C., et al. (2004). Aberrant auditory processing and atypi- Auditory-motor interaction revealed by fMRI: Speech, music, and cal planum temporale in developmental stuttering. Neurology, 63 working memory in area Spt. Journal of Cognitive Neuroscience, 15 (9), 16401646. (5), 673682. Fowler, C. A., & Saltzman, E. (1993). Coordination and coarticulation Hickok, G., Houde, J. F., & Rong, F. (2011). Sensorimotor integration in speech production. Language and Speech, 36(Pt 23), 171195. in speech processing: Computational basis and neural organiza- Franklin, G. F., Powell, J. D., & Emami-Naeini, A. (1991). Feedback con- tion. Neuron, 69(3), 407422. trol of dynamic systems (2nd ed.). Reading, MA: Addison-Wesley. Hickok, G., & Poeppel, D. (2007). The cortical organization of speech Geschwind, N. (1965). Disconnexion syndromes in animals and man, processing. Nature Reviews Neuroscience, 8(5), 393402. Part II. Brain, 88(3), 585644. Hill, A. V. (1925). Length of muscle, and the heat and tension devel- Ghahramani, Z., Wolpert, D. M., & Jordan, M. I. (1996). oped in an isometric contraction. The Journal of Physiology, 60(4), Generalization to local remappings of the visuomotor coordinate 237263. transformation. The Journal of Neuroscience, 16(21), 70857096. Hirano, S., Kojima, H., Naito, Y., Honjo, I., Kamoto, Y., Okazawa, H., Glasser, M. F., & Rilling, J. K. (2008). DTI tractography of the human et al. (1996). Cortical speech processing mechanisms while vocal- brain’s language pathways. Cerebral Cortex, 18(11), 24712482. izing visually presented languages. Neuroreport, 8(1), 363367. Godey, B., Atencio, C. A., Bonham, B. H., Schreiner, C. E., & Cheung, Hirano, S., Kojima, H., Naito, Y., Honjo, I., Kamoto, Y., Okazawa, H., S. W. (2005). Functional organization of squirrel monkey primary et al. (1997). Cortical processing mechanism for vocalization with auditory cortex: Responses to frequency-modulation sweeps. auditory verbal feedback. Neuroreport, 8(910), 23792382. Journal of Neurophysiology, 94(2), 12991311. Hirano, S., Naito, Y., Okazawa, H., Kojima, H., Honjo, I., Ishizu, K., et Gomi, H., & Kawato, M. (1996). Equilibrium-point control hypothesis al. (1997). Cortical activation by monaural speech sound stimula- examined by measured arm stiffness during multijoint move- tion demonstrated by positron emission tomography. Experimental ment. Science, 272(5258), 117120. Brain Research, 113(1), 7580.

C. BEHAVIORAL FOUNDATIONS 236 19. SPEECH MOTOR CONTROL FROM A MODERN CONTROL THEORY PERSPECTIVE

Hollerman, J. R., & Schultz, W. (1998). Dopamine neurons report an Keating, P. (1990). The window model of coarticulation: Articulatory error in the temporal prediction of reward during learning. evidence. In J. Kingston, & M. Beckman (Eds.), Papers in laboratory Nature Neuroscience, 1(4), 304309. phonology I: Between the grammar and physics of speech (pp. 451470). Houde, J. F., & Jordan, M. I. (1997). Adaptation in speech motor control. Cambridge: Cambridge University Press. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural Kelso, J. A. S., Tuller, B., Vatikiotis-Bateson, E., & Fowler, C. A. information processing systems (Vol. 10, pp. 3844). Cambridge, MA: (1984). Functionally specific articulatory cooperation following MIT Press. jaw perturbations during speech: Evidence for coordinative struc- Houde, J. F., & Jordan, M. I. (1998). Sensorimotor adaptation in tures. Journal of Experimental Psychology: Human Perception and speech production. Science, 279(5354), 12131216. Performance, 10(6), 812832. Houde, J. F., & Jordan, M. I. (2002). Sensorimotor adaptation of speech I: Kording, K. P., Tenenbaum, J. B., & Shadmehr, R. (2007). The dynam- Compensation and adaptation. Journal of Speech, Language, and ics of memory as a consequence of optimal adaptation to a chang- Hearing Research, 45(2), 295310. ing body. Nature Neuroscience, 10(6), 779786. Houde, J. F., & Nagarajan, S. S. (2011). Speech production as state Ku¨hnert, B., & Nolan, F. (1999). The origin of coarticulation. In W. J. feedback control. Frontiers in Human Neuroscience, 5, 82. Hardcastle, & N. Hewlett (Eds.), Coarticulation: Theory, data and tech- Houde, J. F., Nagarajan, S. S., Sekihara, K., & Merzenich, M. M. niques (pp. 730). Cambridge, UK: Cambridge University Press. (2002). Modulation of the auditory cortex during speech: An Lakatos, P., Pincze, Z., Fu, K. M., Javitt, D. C., Karmos, G., & MEG study. Journal of Cognitive Neuroscience, 14(8), 11251138. Schroeder, C. E. (2005). Timing of pure tone and noise-evoked Howell, P., El-Yaniv, N., & Powell, D. J. (1987). Factors affecting flu- responses in macaque auditory cortex. Neuroreport, 16(9), ency in stutterers when speaking under altered auditory feed- 933937. back. In H. F. Peters, & W. Hulstijn (Eds.), Speech motor dynamics Lane, H., & Tranel, B. (1971). The lombard sign and the role of hearing in stuttering (pp. 361369). New York, NY: Springer Press. in speech. Journal of Speech and Hearing Research, 14(4), 677709. Hulliger, M. (1984). The mammalian muscle spindle and its central con- Lane, H., Wozniak, J., Matthies, M., Svirsky, M., Perkell, J., trol. Reviews of Physiology Biochemistry & Pharmacology, 101,1110. O’Connell, M., et al. (1997). Changes in sound pressure and fun- Ingle, D. (1967). Two visual mechanisms underlying the behavior of damental frequency contours following changes in hearing status. fish. Psychologische Forschung, 31(1), 4451. The Journal of the Acoustical Society of America, 101(4), 22442252. Ito, T., Kimura, T., & Gomi, H. (2005). The motor cortex is involved Larson, C. R., Altman, K. W., Liu, H. J., & Hain, T. C. (2008). in reflexive compensatory adjustment of speech articulation. Interactions between auditory and somatosensory feedback for Neuroreport, 16(16), 17911794. voice F0 control. Experimental Brain Research, 187(4), 613621. Izawa, J., Rane, T., Donchin, O., & Shadmehr, R. (2008). Motor adap- Larson, C. R., Burnett, T. A., Kiran, S., & Hain, T. C. (2000). Effects of tation as a process of reoptimization. Journal of Neuroscience, 28 pitch-shift velocity on voice F-0 responses. Journal Of The (11), 28832891. Acoustical Society Of America, 107(1), 559564. Jacobs, O. L. R. (1993). Introduction to control theory (2nd ed.). Oxford, Lee, B. S. (1950). Some effects of side-tone delay. Journal of the UK: Oxford University Press. Acoustical Society of America, 22, 639640. Jones, J. A., & Munhall, K. G. (2000a). Perceptual calibration of F0 Levelt, W. J. M. (1989). Speaking: From intention to articulation. production: Evidence from feedback perturbation. Journal of the Cambridge, MA: The MIT Press. Acoustical Society of America, 108(3 Pt 1), 12461251. Levitt, H., Stromberg, H., Smith, C., & Gold, T. (1980). The structure Jones, J.A., & Munhall, K.G. (2000b). Perceptual contributions to of segmental errors in the speech of deaf children. Journal of fundamental frequency production. Paper presented at the 5th Communication Disorders, 13(6), 419441. Seminar on Speech Production: Models and Data, Kloster Seeon, Li, W., Todorov, E., & Pan, X. (2004). Hierarchical optimal control of Germany. redundant biomechanical systems. Conference Proceedings IEEE Jones, J. A., & Munhall, K. G. (2002). The role of auditory feedback Engineering in Medicine and Biology Society, 6, 46184621. during phonation: Studies of mandarin tone production. Journal Lindblom, B. (1963). Spectrographic study of vowel reduction. The of Phonetics, 30(3), 303320. Journal of the Acoustical Society of America, 35(11), 17731781. Jones, J. A., & Munhall, K. G. (2003). Learning to produce speech Lindblom, B. (1983). Economy of speech gestures. In P. F. with an altered vocal tract: The role of auditory feedback. Journal MacNeilage (Ed.), The production of speech (pp. 217245). New of the Acoustical Society of America, 113(1), 532543. York, NY: Springer-Verlag. Jones, J. A., & Munhall, K. G. (2005). Remapping auditory-motor Lindblom, B. (1990). Explaining phonetic variation: A sketch of the representations in voice production. Current Biology, 15(19), H&H theory. In W. J. Hardcastle, & A. Marchal (Eds.), Speech pro- 17681772. duction and speech modelling (Vol. 55, pp. 403439). Dordrecht, Jones,J.A.,Munhall,K.G.,&Vatikiotis-Bateson,E.(1998).Adaptationto Netherlands: Kluwer Academic Publishers. altered feedback in speech. Paper presented at The 136th Meeting of Liu, D., & Todorov, E. (2007). Evidence for the flexible sensorimotor the Acoustical Society of America, Norfolk, VA. strategies predicted by optimal feedback control. Journal of Ju¨rgens, U. (1982). Afferents to the cortical larynx area in the mon- Neuroscience, 27(35), 93549368. key. Brain Research, 239(2), 377389. Liu, H., Zhang, Q., Xu, Y., & Larson, C. R. (2007). Compensatory Ju¨rgens, U. (2002). Neural pathways underlying vocal control. responses to loudness-shifted voice feedback during production Neuroscience and Biobehavioral Reviews, 26(2), 235258. of mandarin speech. The Journal of the Acoustical Society of America, Kalman, R. E. (1960). A new approach to linear filtering and prediction 122(4), 24052412. problems. Transactions of the ASME-Journal of Basic Engineering, 82 Lombard, E. (1911). Le signe de l’elevation de la voix. Annales des (Series. D), 3545. Maladies de l’oreille, du Larynx, du Nez et du Pharynx, 37, 101119. Kalveram, K. T., & Jancke, L. (1989). Vowel duration and voice onset Lubker, J., & Gay, T. (1982). Anticipatory labial coarticulation: time for stressed and nonstressed syllables in stutterers under Experimental, biological, and linguistic variables. The Journal of delayed auditory feedback condition. Folia Phoniatrica, 41(1), the Acoustical Society of America, 71(2), 437448. 3042. Ludlow, C. L. (2004). Recent advances in laryngeal sensorimotor con- Kandel, E. R., Schwartz, J. H., & Jessell, T. M. (2000). Principles of neu- trol for voice, speech and swallowing. Current Opinion in ral science (4th ed.). New York, NY: McGraw-Hill. Otolaryngology & Head and Neck Surgery, 12(3), 160165.

C. BEHAVIORAL FOUNDATIONS REFERENCES 237

Maraist, J. A., & Hutton, C. (1957). Effects of auditory masking upon Pearce, S. L., Miles, T. S., Thompson, P. D., & Nordstrom, M. A. the speech of stutterers. The Journal of Speech and Hearing Disorders, (2003). Is the long-latency stretch reflex in human masseter trans- 22(3), 385389. cortical? Experimental Brain Research, 150(4), 465472. Matthews, B. H. (1931). The response of a single end organ. Journal of Perkell, J. S., & Matthies, M. L. (1992). Temporal measures of antici- Physiology, 71(1), 64110. patory labial coarticulation for the vowel /u/: Within- and cross- Matthies, M., Perrier, P., Perkell, J. S., & Zandipour, M. (2001). subject variability. The Journal of the Acoustical Society of America, Variation in anticipatory coarticulation with changes in clarity and 91(5), 29112925. rate. Journal of Speech, Language, and Hearing Research, 44(2), 340353. Perkell, J. S., Matthies, M. L., Svirsky, M. A., & Jordan, M. I. (1993). Mehta, B., & Schaal, S. (2002). Forward models in visuomotor con- Trading relations between tongue-body raising and lip rounding trol. Journal of Neurophysiology, 88(2), 942953. in production of the vowel /u/: a pilot “motor equivalence” Merton, P. A. (1951). The silent period in a muscle of the human study. Journal of the Acoustical Society of America, 93(5), 29482961. hand. Journal of Physiology, 114(12), 183198. Perrier, P., Ostry, D. J., & Laboissiere, R. (1996). The equilibrium Miall, R. C., Weir, D. J., Wolpert, D. M., & Stein, J. F. (1993). Is the point hypothesis and its application to speech motor control. cerebellum a smith predictor? Journal of Motor Behavior, 25(3), Journal of Speech and Hearing Research, 39(2), 365378. 203216. Pile, E.J.S., Dajani, H.R., Purcell, D.W., & Munhall, K.G. (2007, Miceli, G., Gainotti, G., Caltagirone, C., & Masullo, C. (1980). Some August 610). Talking under conditions of altered auditory feed- aspects of phonological impairment in aphasia. Brain and Language, back: Does adaptation of one vowel generalize to other vowels? 11(1), 159169. Paper presented at the International Congress of Phonetic Mishkin, M., Ungerleider, L. G., & Macko, K. A. (1983). Object vision Sciences, Saarland University, Saarbru¨ cken, Germany. and spatial vision: Two cortical pathways. Trends in Neurosciences, Polit, A., & Bizzi, E. (1979). Characteristics of motor programs under- 6, 414417. lying arm movements in monkeys. Journal of Neurophysiology, 42 Moon, S. J., & Lindblom, B. (1994). Interaction between duration, (1 Pt 1), 183194. context, and speaking style in English stressed vowels. Journal of Poljak, S. (1926). The connections of the acoustic nerve. Journal of the Acoustical Society of America, 96(1), 4055. Anatomy, 60(4), 465469. Nasir, S. M., & Ostry, D. J. (2006). Somatosensory precision in speech Price, C. J., Crinion, J. T., & Macsweeney, M. (2011). A generative production. Current Biology, 16(19), 19181923. model of speech production in Broca’s and Wernicke’s areas. Nasir, S. M., & Ostry, D. J. (2008). Speech motor learning in pro- Frontiers in Psychology, 2, 237. foundly deaf adults. Nature Neuroscience, 11(10), 12171222. Purcell, D. W., & Munhall, K. G. (2006). Adaptive control of vowel Nasir, S. M., & Ostry, D. J. (2009). Auditory plasticity and speech formant frequency: Evidence from real-time formant manipula- motor learning. Proceedings of the National Academy of Sciences, 106 tion. Journal of the Acoustical Society of America, 120(2), 966977. (48), 2047020475. Rauschecker, J. P., & Scott, S. K. (2009). Maps and streams in the Natke, U., Grosser, J., & Kalveram, K. T. (2001). Fluency, fundamen- auditory cortex: Nonhuman primates illuminate human speech tal frequency, and speech rate under frequency-shifted auditory processing. Nature Neuroscience, 12(6), 718724. feedback in stuttering and nonstuttering persons. Journal of Rauschecker, J. P., & Tian, B. (2000). Mechanisms and streams for Fluency Disorders, 26(3), 227241. processing of “what” and “where” in auditory cortex. Proceedings Natke, U., & Kalveram, K. T. (2001). Effects of frequency-shifted of the National Academy of Sciences of the United States of America, 97 auditory feedback on fundamental frequency of long stressed and (22), 1180011806. unstressed syllables. Journal of Speech, Language, and Hearing Redgrave, P., & Gurney, K. (2006). The short-latency dopamine signal: Research, 44(3), 577584. A role in discovering novel actions? Nature Reviews Neuroscience, 7 Numminen, J., & Curio, G. (1999). Differential effects of overt, covert (12), 967975. and replayed speech on vowel- evoked responses of the human Rizzolatti, G., Fogassi, L., & Gallese, V. (1997). Parietal cortex: From auditory cortex. Neuroscience Letters, 272(1), 2932. sight to action. Current Opinion in Neurobiology, 7(4), 562567. Numminen, J., Salmelin, R., & Hari, R. (1999). Subject’s own speech Romanski, L. M., Tian, B., Fritz, J., Mishkin, M., Goldman-Rakic, reduces reactivity of the human auditory cortex. Neuroscience P. S., & Rauschecker, J. P. (1999). Dual streams of auditory affer- Letters, 265(2), 119122. ents target multiple domains in the primate prefrontal cortex. Oller, D. K., & Eilers, R. E. (1988). The role of audition in infant bab- Nature Neuroscience, 2(12), 11311136. bling. Child Development, 59(2), 441449. Ross, M., & Giolas, T. G. (1978). Auditory management of hearing-impaired Osberger, M. J., & McGarr, N. S. (1982). Speech production character- children: Principles and prerequisites for intervention. Baltimore, MD: istics of the hearing-impaired. In N. J. Lass (Ed.), Speech and lan- University Park Press. guage: Advances in basic research and practice (pp. 221284). New Saltzman, E. L., Lofqvist, A., Kay, B., Kinsella-Shaw, J., & Rubin, P. York, NY: Academic Press. (1998). Dynamics of intergestural timing: A perturbation study Ostry,D.J.,Flanagan,J.R.,Feldman,A.G.,&Munhall,K.G.(1991). of lip-larynx coordination. Experimental Brain Research, 123(4), Humanjawmotioncontrolinmastication and speech. In J. Requin, & 412424. G. E. Stelmach (Eds.), Tutorials in motor neuroscience. NATO ASI series; Saltzman, E. L., & Munhall, K. G. (1989). A dynamical approach to Series D: Behavioral and social sciences (Vol. 62, pp. 535543). New gestural patterning in speech production. Ecological Psychology, 1 York, NY: Kluwer Academic/Plenum Publishers. (4), 333382. Ostry, D. J., Flanagan, J. R., Feldman, A. G., & Munhall, K. G. (1992). Samejima, K., Ueda, Y., Doya, K., & Kimura, M. (2005). Human jaw movement kinematics and control. In G. E. Stelmach, & Representation of action-specific reward values in the striatum. J. Requin (Eds.), Tutorials in motor behavior, 2. Advances in psychology Science, 310(5752), 13371340. (Vol. 87, pp. 647660). Oxford, England: North-Holland. Sanguineti, V., Laboissiere, R., & Ostry, D. J. (1998). A dynamic bio- Parsons, T. W. (1987). Voice and speech processing. New York, NY: mechanical model for neural control of speech production. The McGraw-Hill Book Company. Journal of the Acoustical Society of America, 103(3), 16151627. Payan, Y., & Perrier, P. (1997). Synthesis of V-V sequences with a 2D Sanguineti, V., Laboissiere, R., & Payan, Y. (1997). A control model of biomechanical tongue model controlled by the equilibrium point human tongue movements in speech. Biological Cybernetics, 77(1), hypothesis. Speech Communication, 22(23), 185205. 1122.

C. BEHAVIORAL FOUNDATIONS 238 19. SPEECH MOTOR CONTROL FROM A MODERN CONTROL THEORY PERSPECTIVE

Savariaux, C., Perrier, P., & Orliaguet, J.-P. (1995). Compensation Tremblay, S., Houle, G., & Ostry, D. J. (2008). Specificity of speech strategies for the perturbation of the rounded vowel [u] using a motor learning. Journal of Neuroscience, 28(10), 24262434. lip tube: A study of the control space in speech production. The Tremblay, S., Shiller, D. M., & Ostry, D. J. (2003). Somatosensory Journal of the Acoustical Society of America, 98(5), 2428. basis of speech production. Nature, 423(6942), 866869. Schmahmann, J. D., Pandya, D. N., Wang, R., Dai, G., D’Arceuil, Turner, R. S., Desmurget, M., Grethe, J., Crutcher, M. D., & Grafton, H. E., de Crespigny, A. J., et al. (2007). Association fibre pathways S. T. (2003). Motor subcircuits mediating the control of movement of the brain: Parallel observations from diffusion spectrum imag- extent and speed. Journal of Neurophysiology, 90(6), 39583966. ing and autoradiography. Brain, 130(Pt 3), 630653. Ungerleider, L. G., & Mishkin, M. (1982). Two cortical visual systems. Schultz, W. (1998). Predictive reward signal of dopamine neurons. In D. J. Ingle, M. A. Goodale, & R. J. W. Mansfield (Eds.), Analysis Journal of Neurophysiology, 80(1), 127. of visual behavior (pp. 549586). Cambridge, MA: MIT Press. Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate Upadhyay, J., Hallock, K., Ducros, M., Kim, D.-S., & Ronen, I. (2008). of prediction and reward. Science, 275(5306), 15931599. Diffusion tensor spectroscopy and imaging of the arcuate fascicu- Scott, C. M., & Ringel, R. L. (1971). Articulation without oral sensory lus. Neuroimage, 39(1), 19. control. Journal of Speech and Hearing Research, 14(4), 804818. Ventura, M. I., Nagarajan, S. S., & Houde, J. F. (2009). Speech target Scott, S. H. (2004). Optimal feedback control and the neural basis of modulates speaking induced suppression in auditory cortex. volitional motor control. Nature Reviews Neuroscience, 5(7), 534546. BMC Neuroscience, 10, 58. Shadmehr, R., & Krakauer, J. W. (2008). A computational neuro- Villacorta, V. M., Perkell, J. S., & Guenther, F. H. (2007). anatomy for motor control. Experimental Brain Research, 185(3), Sensorimotor adaptation to feedback perturbations of vowel 359381. acoustics and its relation to perception. Journal of The Acoustical Shadmehr, R., & Wise, S. P. (2005). The computational neurobiology of Society of America, 122(4), 23062319. reaching and pointing: A foundation for motor learning. Cambridge, Wachholder, K., & Altenburger, H. (1926). Beitra¨ge zur physiologie MA: MIT Press. der willku¨ rlichen bewegung X. Mitteilung. Einzelbewegungen. Shaiman, S., & Gracco, V. L. (2002). Task-specific sensorimotor inter- Pflugers Archiv fur die gesamte Physiologie des Menschen und der actions in speech production. Experimental Brain Research, 146(4), Tiere, 214(1), 642661. 411418. Wang, A. Y., Miura, K., & Uchida, N. (2013). The dorsomedial stria- Shiller, D.M., Sato, M., Gracco, V.L., & Baum, S.R. (2007). Motor tum encodes net expected return, critical for energizing perfor- and sensory adaptation following auditory perturbation of /s/ mance vigor. Nature Neuroscience, 16(5), 639647. production. Paper presented at the 154th Meeting of the Watkins, K., & Paus, T. (2004). Modulation of motor excitability dur- Acoustical Society of America, New Orleans, LA. ing speech perception: The role of Broca’s area. Journal of Shiller, D. M., Sato, M., Gracco, V. L., & Baum, S. R. (2009). Perceptual Cognitive Neuroscience, 16(6), 978987. recalibration of speech sounds following speech motor learning. Wernicke, C. (1874). Der aphasische symptomencomplex: Eine psy- Journal of the Acoustical Society of America, 125(2), 11031113. chologische studie auf anatomischer basis. In G. H. Eggert (Ed.), Skipper, J. I., Nusbaum, H. C., & Small, S. L. (2005). Listening to talk- Wernicke’s works on aphasia: A sourcebook and review (pp. 91145). ing faces: Motor cortical activation during speech perception. The Hague: Mouton. Neuroimage, 25(1), 7689. Whiting, H. T. A. (Ed.), (1984). Human motor actions: Bernstein reas- Smith, C. R. (1975). Residual hearing and speech production in deaf sessed. Amsterdam, NL: North-Holland. children. Journal of Speech and Hearing Research, 18(4), 795811. Wiener, N. (1948). Cybernetics: Control and communication in the animal Smith, O. J. M. (1959). A controller to overcome deadtime. ISA and the machine. New York, NY: John Wiley & sons, Inc. Journal, 6,2833. Wilson, S. M., & Iacoboni, M. (2006). Neural responses to non-native Soderberg, G. A. (1968). Delayed auditory feedback and stuttering. phonemes varying in producibility: Evidence for the sensorimotor Journal of Speech and Hearing Disorders, 33(3), 260267. nature of speech perception. Neuroimage, 33(1), 316325. Stengel, R. F. (1994). Optimal control and estimation. Mineola, NY: Wilson, S. M., Saygin, A. P., Sereno, M. I., & Iacoboni, M. (2004). Dover Publications, Inc. Listening to speech activates motor areas involved in speech Tian, X., & Poeppel, D. (2010). Mental imagery of speech and move- production. Nature Neuroscience, 7(7), 701702. Available from: ment implicates the dynamics of internal forward models. http://dx.doi.org/10.1038/nn1263. Frontiers in Psychology, 1, 166. Wolpert, D. M. (1997). Computational approaches to motor control. Tin, C., & Poon, C.-S. (2005). Internal models in sensorimotor integra- Trends in Cognitive Sciences, 1(6), 209. tion: Perspectives from adaptive control theory. Journal of Neural Wolpert, D. M., & Ghahramani, Z. (2000). Computational principles of Engineering, 2(3), S147S163. movement neuroscience. Nature Neuroscience, 3(Suppl.), 12121217. Todorov, E. (2004). Optimality principles in sensorimotor control. Wolpert, D. M., Ghahramani, Z., & Jordan, M. I. (1995). An internal Nature Neuroscience, 7(9), 907915. model for sensorimotor integration. Science, 269(5232), 18801882. Todorov,E.(2006).Optimalcontroltheory.InK.Doya,S.Ishii, Yates, A. J. (1963). Delayed auditory feedback. Psychological Bulletin, A. Pouget, & R. P. N. Rao (Eds.), Bayesian brain: Probabilistic approaches 60(3), 213232. to neural coding (pp. 269298). Cambridge, MA: MIT Press. Zajac, F. E. (1989). Muscle and tendon: Properties, models, scaling, Todorov, E. (2007). Mixed muscle-movement representations emerge from and application to biomechanics and motor control. Critical optimization of stochastic sensorimotor transformations. Unpublished Reviews in Biomedical Engineering, 17(4), 359411. manuscript, UCSD, La Jolla, CA. Zheng, Z. Z., Munhall, K. G., & Johnsrude, I. S. (2010). Functional Todorov, E., & Jordan, M. I. (2002). Optimal feedback control as a the- overlap between regions involved in speech perception and in ory of motor coordination. Nature Neuroscience, 5(11), 12261235. monitoring one’s own voice during speech production. Journal of Tourville, J. A., Reilly, K. J., & Guenther, F. H. (2008). Neural Cognitive Neuroscience, 22(8), 17701781. mechanisms underlying auditory feedback control of speech. Zhou, X., Espy-Wilson, C. Y., Boyce, S., Tiede, M., Holland, C., & Neuroimage, 39(3), 14291443. Choe, A. (2008). A magnetic resonance imaging-based articulatory Toyomura, A., Koyama, S., Miyamaoto, T., Terao, A., Omori, T., and acoustic study of “retroflex” and “bunched” American English Murohashi, H., et al. (2007). Neural correlates of auditory feed- /r/. The Journal of the Acoustical Society of America, 123(6), back control in human. Neuroscience, 146(2), 499503. 44664481. Available from: http://dx.doi.org/10.1121/1.2902168.

C. BEHAVIORAL FOUNDATIONS CHAPTER 20 Spoken Word Recognition: Historical Roots, Current Theoretical Issues, and Some New Directions David B. Pisoni1 and Conor T. McLennan2 1Department of Psychological and Brain Sciences, Indiana University, Bloomington, IN, USA; 2Department of Psychology, Cleveland State University, Cleveland, OH, USA

20.1 INTRODUCTION underlying spoken language processing (see Chomsky & Miller, 1963). These assumptions embodied the This is an exciting time to be working in the field conventional segmental linguistic view, which assumes that of human spoken word recognition (SWR). Many long- speech signals consist of a linear sequence of abstract, standing assumptions about spoken language proces- idealized, context-free segments ordered temporally in sing are being reevaluated in light of new experimental time much like the discrete letters of the alphabet or and computational methods, empirical findings, and bricks on the wall (Halle, 1985; Hockett, 1955; Licklider, theoretical developments (Gaskell, 2007a, 2007b; Hickok 1952; Peterson, 1952). The assumption that the continu- & Poeppel, 2007; McQueen, 2007; Pisoni & Levi, 2007). ously varying speech signal can be represented as a Moreover, recent findings on SWR also have direct sequence of discrete units has played a central role in applications to issues related to hearing impairment all theoretical accounts of spoken language research in deaf children and adults, non-native speakers of (Lindgren, 1965a, 1965b). English, bilinguals, and older adults. The present chapter is organized into four sections. The fundamental problems in the field of SWR, First, we briefly review the historical roots of the field such as invariance and variability, neural coding, of SWR. Second, we discuss the principle theoretical representational specificity, and perceptual constancy issues and contemporary models of SWR. Third, we in the face of diverse sensory input, are similar to the contrast the conventional segmental view of speech and perceptual problems studied in other areas of cognitive SWR with an alternative proposal that moves beyond psychology and neuroscience. Although these well- abstract linguistic representations. Finally, we briefly known theoretical problems have occupied speech consider how basic research in SWR has led to several research since the early 1950s, until recently research new research directions and additional challenges. and theory on spoken language processing have been intellectually isolated from mainstream developments in neurobiology and cognitive science (Arlinger, Lunner, 20.2 HISTORICAL ROOTS AND Lyxell, & Pichora-Fuller, 2009; Dahan & Magnuson, PRECURSORS TO SWR 2006; Hickok & Poeppel, 2007; Magnuson, Mirman, & Harris, 2012; Magnuson, Mirman, & Myers, 2013; This is a chapter about human SWR. However, Ro¨nnberg et al., 2013). The isolation of speech com- before jumping right into our discussion of SWR, some munication evolved because speech scientists and brief historical background is necessary to place the communication engineers relied heavily on linguistically current research and theory into a broader historical motivated theoretical assumptions about the core prop- context. By discussing some older theoretical issues erties of speech and the computational processes and empirical findings from research on speech and

Neurobiology of Language. DOI: http://dx.doi.org/10.1016/B978-0-12-407794-2.00020-1 239 © 2016 Elsevier Inc. All rights reserved. 240 20. SPOKEN WORD RECOGNITION: HISTORICAL ROOTS, CURRENT THEORETICAL ISSUES, AND SOME NEW DIRECTIONS hearing, we are able to illustrate important changes in to broader theoretical and conceptual issues in SWR. the way researchers think about representational and One exception was a brief theoretical work by Licklider processing issues in SWR today. There are, of course, (1952) on process models of human speech perception. also some connections and parallels with studies of Surprisingly, several ideas proposed by Licklider are visual word recognition (Carreiras, Armstrong, Perea, still relevant to current theoretical issues today. After & Frost, 2014) and neuroimaging methods of language WW II, speech and hearing scientists and acoustical processing (Price, 2012; Sharp, Scott, & Wise, 2004), engineers turned their attention to the human listener. which are discussed in other chapters of this book. Several research programs were initiated to understand The field of speech and hearing science has a how SWR is performed so efficiently under highly well-documented history dating back to the end of the impoverished conditions (Cooper, Delattre, Liberman, 19th century when researchers began to use electrically Borst, & Gerstman, 1952). Efforts were also begun to recorded audio signals to assess hearing loss supple- develop methods for -by-rule that menting traditional clinical measures that relied on could be used in reading machines for blind veterans simple acoustical signals (Flanagan, 1965; Fletcher, returning from the war (Allen, Hunnicutt, & Klatt, 1929; Miller, 1951; Wilson & McArdle, 2005). Except 1987; Cooper et al., 1952; Klatt, 1987). for the seminal, but otherwise obscure, early findings reported by Bagley (1900) on SWR using a novel exper- imental methodology involving mispronunciation 20.2.2 Source-Filter Theory and Speech Cues detection (see Cole, 1973; Cole & Rudnicky, 1983), The development of the source-filter theory of speech most of what we currently know about the basic acoustics at MIT (Stevens, Kasowski, & Fant, 1953)and acoustical and perceptual foundations of speech and the well-known pattern playback studies of speech cues hearing comes from pioneering research performed at using highly simplified hand-painted spectrographic Bell Telephone Laboratories in the 1920s (Flanagan, patterns of synthetic speech at Haskins Laboratories 1965; Fletcher, 1929, 1953; Fletcher & Galt, 1950). This (Cooper et al., 1952) provided the foundations of extensive body of research established the minimal modern speech science and acoustic-phonetics. The necessary and sufficient acoustical conditions for effec- modeling work at MIT focused on the acoustic nature tive and highly reliable speech transmission and recep- of the speech signal (Stevens, 1998); research at Haskins tion over conventional telephone circuits and provided investigated listeners’ perceptual skills in making effi- an enormous body of empirical data and acoustic mea- cient use of minimal acoustic-phonetic cues as central surements on the foundations of hearing and speech components in the speech chain (Denes & Pinson, 1963; communication under limited telephone bandwidth Liberman, 1996; Liberman, Cooper, Shankweiler, & conditions (Allen, 1994, 2005; Fletcher, 1953). Studdert-Kennedy, 1967; Moore, 2007a, 2007b). These early studies were directly responsible for uncovering many core theoretical problems in spoken 20.2.1 Speech Intelligibility language processing, especially problems related to the articulatory dynamics of speech production and Most of the quantitative experimental methods the context-dependent nature of the acoustic cues to developed for assessing speech intelligibility routinely speech perception that have remained among the used today can be traced directly back to these early major theoretical issues in the field (see Klatt, 1979; empirical studies (Fletcher, 1929, 1953; Hirsh, 1947; Liberman, 1996; Pisoni, 1978; Studdert-Kennedy, 1974; Konkle & Rintelmann, 1983; Wilson & McArdle, 2005). and see chapters in Gaskell, 2007a; Pisoni & Remez, The primary focus of this research was on speech 2005 for additional reviews and discussion). intelligibility; there was little interest in describing the human listener’s perceptual and cognitive abilities to recognize spoken words (Allen, 1994, 2005). Further 20.3 PRINCIPLE THEORETICAL applied research on speech communication in noise ISSUES IN SWR was performed at the Psycho-Acoustic Laboratory at Harvard University during World War II (WW II) (see In this section, we discuss the principle theoretical Hudgins, Hawkins, Karlin, & Stevens, 1947; Licklider & issues in SWR and then briefly review several contem- Miller, 1951; Rosenzweig & Stone, 1948 for reviews). porary models of SWR (for more detailed reviews, see Although these two applied research programs pro- Jusczyk & Luce, 2002; Magnuson et al., 2012, 2013; vided much of the core knowledge about hearing and Marslen-Wilson, 1989; Pisoni & Levi, 2007). The funda- speech communication, almost all of these investiga- mental problem in SWR is to understand how listeners tions were focused on practical telephone and military- recover the talker’s intended message from the com- related communications issues. Little effort was devoted plex time-varying speech waveform. This problem is

C. BEHAVIORAL FOUNDATIONS 20.3 PRINCIPLE THEORETICAL ISSUES IN SWR 241 typically broken-down into a series of more manage- believed that word frequency was equivalent to able subquestions (Pisoni, 1978; Studdert-Kennedy, experienced frequency and that word counts of printed 1974). What stages of perceptual analysis intervene text of English (Francis & Kucera, 1964; Thorndike- between the presentation of the speech signal and Lorge, 1944) could serve as a good proxy for word recognition of the talker’s intended linguistic message? frequency. Experienced frequency reflects how often a What types of processing operations occur at each listener encountered a specific word form. Significant stage of analysis? What are the primary processing theoretical work was performed by Broadbent (1967) units of speech? What is the nature of the neural and and many others to understand the basis of frequency cognitive representations of spoken words? Finally, effects. The word frequency effect is one of the distinc- what specific perceptual, neurocognitive, and linguistic tive hallmarks of SWR and has played a central role in processing operations are used in SWR, and how are research and theory development for many years they coordinated into an integrated system? (Forster, 1976; Morton, 1979; Oldfield, 1966). Although many of these questions have remained Computational analyses and theoretical work basically the same since the early 1950s, the answers performed by Landauer and Streeter (1973) suggested have changed, reflecting new theoretical and methodo- that although experienced frequency may play a role in logical developments and novel ways of thinking about word recognition processes, frequency effects may also sensory and neurocognitive processes used to recognize reflect more subtle underlying differences in the struc- spoken words (Hickok & Poeppel, 2007; Luce & Pisoni, tural properties of high- and low-frequency words, and 1998; Moore, 2007a, 2007b; Scott & Johnsrude, 2003). As that experienced frequency may simply be a byproduct of discussed, applied research on hearing and speech per- the statistical regularities of the sound patterns of formed at Bell Labs and Harvard was primarily con- words in the language. In an unpublished seminal cerned with assessing the adequacy of telephone study using phonotactically legal nonwords, Eukel communication equipment and investigating factors (1980) demonstrated frequency effects for novel sound that affected speech intelligibility in noise. Other patterns. These results suggested that phonotactics—the research focused on methods to improve speech intelli- frequency and patterning of sound segments and sylla- gibility in military combat conditions (Black, 1946). bles within words—may be responsible for the robust Although these applied research programs were cre- perceptual differences observed between high- and ated to address practical real-world problems, many low-frequency words in English (Pisoni, Nusbaum, theoretically important empirical findings were uncov- Luce, & Slowiaczek, 1985; Vitevitch & Luce, 1999). ered. The significance of these discoveries for theories Computational and behavioral studies performed by of human SWR were discussed only briefly in numer- Luce and Pisoni (1998) revealed that the sound ous research reports from Harvard in the early 1940s similarity neighborhoods of high- and low-frequency (Abrams et al., 1944; Karlin, Abrams, Sanford, & Curtis, words differed significantly, and that spoken words 1944; Wiener & Miller, 1946). Many of these empirical are recognized relationally in the context of other observations played substantial roles in theoretical phonetically similar words in the mental lexicon. accounts and models of perceptual and cognitive pro- Spoken words are not recognized left-to-right segment- cesses (Lindgren, 1965a, 1965b, 1967). Next, we briefly by-segment in serial order as traditionally assumed by consider three theoretically significant findings origi- the conventional linguistic view of speech perception. nally uncovered by speech scientists at Harvard: Instead, spoken words are recognized by processes (i) word frequency effects; (ii) word length effects; and involving activation and competition among word (iii) sentence context effects. These findings, among form candidates or lexical neighbors of spoken words others, later played central roles in theoretical discus- (Pisoni & Luce, 1986). sions of SWR (Broadbent, 1967; Morton, 1979), have Results from one of the earliest studies on word shaped the direction of the field, and need to be length effects are illustrated in Figure 20.1. The data accounted for in any model of SWR. from this study, described by Wiener and Miller (1946), demonstrate effects of word length on SWR scores. In marked contrast to word length effects observed in 20.3.1 Word Frequency, Word Length, and visual perception and memory, which consistently show Sentence Context Effects that longer words are more difficult to recognize and recall, word length effects in SWR show precisely the Numerous investigations have reported that opposite result—longer words are easier to perceive and high-frequency words presented in noise are identified recognize (Rosenzweig & Postman, 1957; Savin, 1963). more accurately than low-frequency words (Howes, Weiner and Miller’s results also demonstrate that the 1954, 1957; Savin, 1963). At the time the word frequency longer a spoken word, the less often it will be confused effect was first discovered in the late 1940s, researchers with phonetically similar sounding words (Savin, 1963).

C. BEHAVIORAL FOUNDATIONS 242 20. SPOKEN WORD RECOGNITION: HISTORICAL ROOTS, CURRENT THEORETICAL ISSUES, AND SOME NEW DIRECTIONS

100 100

Words in 80 90 sentences

80 60 Same words in 70 40 isolation Percent words correct words Percent Percent word articulation word Percent 60 20

50 0 3456 –12 –6 0 +6 +12 +18 Average number of sounds per word Signal-to-noise ratio in decibels

FIGURE 20.1 The word length effect in SWR illustrating the FIGURE 20.2 The effects of sentence context on the intelligibility improvement in speech articulation as the average number of speech of spoken words in noise as a function of signal-to-noise ratio in dec- sounds per word is increased. Adapted from Wiener and Miller (1946) ibels. The filled circles show percentage of words correct in meaning- with permission from the publisher. ful English sentences. The open circles show percentage of words correct for the same words presented in isolation. Adapted from Miller et al. (1951) with permission from the publisher. Word frequency and word length effects suggest that spoken words are recognized relationally in the context of other words in lexical memory and are condition, the same words were presented in isolation. not processed in a left-to-right fashion segment-by- The results in Figure 20.2 illustrate that spoken words segment as many theorists had assumed. In open-set are much more difficult to recognize in isolation than speech intelligibility tests in which no response alter- in meaningful sentences. natives are given to the listener, spoken words are These results establish that sentence context con- recognized in relation to other perceptually similar strains word recognition and demonstrate the contri- words the listener knows that serve as potential lexical butions of the listener’s prior linguistic knowledge to candidates for responses (Pollack, Rubenstein, & SWR (Miller, 1962; Miller & Selfridge, 1950). When Decker, 1959, 1960). When listeners receive compro- words are encoded in sentences, multiple sources of mised sensory information, they make use of sophisti- information automatically become available and their cated guessing strategies, generating lexical candidates associated brain circuits are recruited to support the and strongly biased responses in a systematic manner recognition process (Scott, Blank, Rosen, & Wise, 2000). reflecting the sound similarity relations among words in lexical memory (Broadbent, 1967; Morton, 1969; 20.3.2 Contemporary Approaches to SWR Savin, 1963; Treisman, 1978a, 1978b). When words occur in meaningful sentences, listen- One of the most important changes in research on ers also make use of additional knowledge and ling- spoken language processing over the past 40 years has uistic constraints that are unavailable when the same been a dramatic shift from a focus on the perception of words are presented in isolation (Marks & Miller, individual speech sounds in isolated nonsense sylla- 1964; Miller, Heise, & Lichten, 1951; Miller & Isard, bles to the study of the underlying cognitive and 1963). SWR is an active and highly automatized linguistic processes involved in SWR. For many speech process that takes place very rapidly with little con- scientists, the domain of speech perception was nar- scious awareness of the underlying sensory, cognitive, rowly confined to the study of the perception of speech and linguistic processes. Many different sources of features and phonetic segments in highly controlled knowledge are used to recognize words depending on experimental contexts using simplified synthesized non- the context, test materials, and specific task demands sense syllables (Liberman, 1996). The widespread view (Jenkins, 1979). One of the most important and power- at the time was that speech perception was a necessary ful sources of contextual constraint comes from prerequisite for SWR. In most discussions in the early sentences. Figure 20.2 shows the pioneering speech 1950s, the linguistic construct of the phoneme was intelligibility results obtained by Miller et al. (1951).In considered to be the central elementary building block one condition, Miller et al. presented sentences con- of speech (Peterson, 1952). The theoretical assumption taining key words mixed in noise. In a second was that if we can understand the processes used to

C. BEHAVIORAL FOUNDATIONS 20.3 PRINCIPLE THEORETICAL ISSUES IN SWR 243 recognize individual phonemes in nonsense syllables, sensory evidence in the input signal (Morton, 1969). then this knowledge could be scaled up to SWR. Such a Passive sensing devices called logogens were associated narrow reductionist research strategy is not surprising; with individual words in the lexicon. These word detectors researchers in all areas of science typically work on collected information from the input. Once a Logogen tractable problems that can be studied with existing reached a threshold, it became activated. To account for paradigms and experimental methodologies. Despite frequency effects, common high-frequency words had the voluminous literature on isolated phoneme percep- lower thresholds than rare low-frequency words. There tion in the 1950s, 1960s, and 1970s, until the mid-1970s were a number of problems with the Logogen model. It very little was known about how listeners use acoustic- failed to specify precisely the perceptual units used to phonetic information in the speech signal to support map acoustic phonetic input onto logogens or how dif- SWR (Marslen-Wilson, 1975). ferent sources of linguistic information are combined Several reasons can be identified for the shift in together to alter the activation levels of individual logo- research efforts from the study of phonetic segments to gens. Finally, the Logogen model was also unable to SWR. First, the number of speech scientists and psy- account for lexical neighborhood effects and the effects cholinguists increased. Second, the cost of performing of lexical competition among phonetically similar words speech research decreased significantly with the wide- because the logogens for individual words are activated spread availability of low-cost digital computers and independently and have no input from other phoneti- high-powered sophisticated digital signal processing cally similar words in memory. techniques. Third, with more interest in the field of The third family of models combined assumptions speech perception and more powerful research tools, from both search and activation models. One example younger investigators were able to turn their attention of a hybrid model of SWR is Klatt’s Lexical Access and creative efforts to a wider range of challenges From Spectra (LAFS) model (Klatt, 1979), which relies related to how acoustic-phonetic information in the extensively on real-speech input in the form of power speech signal makes contact with stored lexical represen- spectra that change over time, unlike other models of tations in memory. Importantly, new studies on SWR SWR that rely on preprocessed coded speech signals also used novel experimental paradigms and techniques as input. Klatt argued that earlier models failed to that required listeners to actively use phonological, acknowledge the important role of fine phonetic detail lexical, syntactic, and semantic knowledge in assigning because they uniformly assumed the existence of an a meaningful interpretation to the input. The shift in intermediate abstract level of representation that emphasis from perception of phonetic segments to SWR eliminated potentially useful acoustic information was also motivated by the belief that by investigating from the speech signal (Klatt, 1986). Based on a SWR, new insights would be obtained about the role of detailed analysis of the design architecture of the context, lack of acoustic-phonetic invariance, linearity, HARPY speech recognition system (Lowerre & Reddy, and the interaction of multiple sources of knowledge in 1980), Klatt suggested that intermediate representa- spoken language processing (Miller, 1962). tions may not be optimal for human or machine SWR because they are always potentially error-prone, espe- 20.3.3 Theoretical Accounts of SWR cially in noise (Klatt, 1977). Instead, Klatt suggested that spoken words could be recognized directly from Early theories of SWR were based on models an analysis of the input power spectrum using a large and research findings in visual word recognition. network of diphones combined with a “backward Three basic families of models have been proposed to beam search” technique like the one originally incor- account for mapping of speech waveforms onto lexical porated in HARPY that eliminated weak lexical candi- representations. One approach, represented by the dates from further processing (Klatt, 1979). LAFS is the Autonomous Search Model developed by Forster only model of SWR that attempted to deal with fine (1976, 1989), is based on the assumption that words phonetic variation in speech, which in recent years has are accessed using a frequency-ordered search process. come to occupy the attention of many speech and In this model, the initial search is performed based hearing scientists as well as computer engineers who on frequency, with high-frequency words searched are interested in designing psychologically plausible before low-frequency words. Search theories are no models of SWR that are robust under challenging longer considered viable models of SWR and are not conditions (Moore, 2005, 2007b). considered any further in this chapter. The second family of models assumes that words are 20.3.4 Activation and Competition recognized through processes of activation and competi- tion. Early pure activation models like Morton’s Logogen Almost all current models of SWR assume two Theory assumed that words are recognized based on fundamental processes: activation and competition

C. BEHAVIORAL FOUNDATIONS 244 20. SPOKEN WORD RECOGNITION: HISTORICAL ROOTS, CURRENT THEORETICAL ISSUES, AND SOME NEW DIRECTIONS

(Gaskell & Marslen-Wilson, 2002; Luce & Pisoni, 1998; among lexical representations. Lateral inhibition refers McClelland & Elman, 1986; Norris, 1994). Although to competition within the same level (e.g., words there is widespread agreement that acoustic-phonetic competing with words or segments competing with input activates a set of lexical candidates that are sub- segments). In contrast, in the DCM, lateral inhibition sequently selected as a response, the precise details of among local units is replaced by an active process activation and competition remain a matter of continu- that results from the blending of multiple representa- ing debate (Magnuson et al., 2012). As we discuss simi- tions distributed across processing levels (Gaskell & larities and differences between activation-competition Marslen-Wilson, 2002). models of SWR, it is important to emphasize that all of these SWR models deal with somewhat different theoretical issues and empirical findings, making it dif- 20.3.5 TRACE ficult to draw direct comparisons among specific TRACE is a highly influential interactive-activation models. This is a problem that needs to be addressed localist connectionist model of SWR (McClelland & in the future because all of the current models target Elman, 1986). Localist models of processing assume different problems and often focus on specific issues the existence of discrete stand-alone representations or (i.e., the role of top-down feedback, competition processing units that have meaning and can be inter- dynamics, or representational specificity; see Magnuson preted directly, whereas distributed models make use et al., 2013 for further discussion). Moreover, none of of patterns of activation across a collection of represen- the current models of SWR deal satisfactorily, if at all, tations that are dependent on each other and cannot be with the indexical channel of information encoded in interpreted alone by looking at individual units in the the speech signal, especially vocal source information network. TRACE contains three types of processing about the speaker’s voice quality, vocal tract transfer units corresponding to features, phonemes, and words function, and environmental context conditions. (Elman & McClelland, 1986; McClelland & Elman, Logogen, TRACE, Shortlist, PARSYN, and the 1986). Connection weights raise or lower activation Distributed Cohort Model (DCM) all assume that levels of the nodes at each level depending on the form-based lexical and sublexical representations can input and activity of the system. Although TRACE has be activated at any point in the speech signal, referred had considerable influence in the field, the model has to as radical activation (Luce & McLennan, 2005). several weaknesses and relies extensively on a psycho- Radical activation differs from the earlier proposal of logically and neurally implausible processing architec- constrained activation in the original Cohort Theory in ture. One weakness is that TRACE is only concerned which the initial activation of a set of lexical candidates with the recognition of isolated spoken words and has was strictly limited to word initial onsets (Marslen- little to say about the recognition of spoken words in Wilson & Welsh, 1978). The original Cohort Theory connected fluent speech. Furthermore, nodes and con- was based on the hypothesis that the beginnings of nections in TRACE are reduplicated to deal with the words played a special role in activating a set of word temporal dynamics of SWR. Recently, Hannagan et al. initial lexical candidates or cohorts. As more sensory (2013) developed a new model called TISK that com- information is acquired, words that became inconsis- bines time-specific representations with higher-level tent with the input signal were dropped from the representations based on string kernels. TISK reduces cohort until only one word remained. Although the the number of units and connections by several orders first version of the Cohort Theory was quite influential of magnitude relative to TRACE. and generated many novel studies that shaped the field of SWR (see papers in Altman, 1990), the model has been significantly revised and updated in the 20.3.6 Shortlist, Merge, and Shortlist B DCM described below. In current models of SWR, a central defining feature Shortlist is another localist connectionist model of is the assumption of competition among multiple SWR (Norris, 1994). First, a short list of lexical candi- activated lexical candidates. Lexical competition is one dates is activated consisting of word forms that match of the major areas of current research and theory on the speech signal. In the second stage, a subset of SWR (see Hannagan, Magnuson, & Grainger, 2013; hypothesized lexical items enters a smaller network of Scharenborg & Boves, 2010). Although there is now word units. Lexical units then compete with one considerable evidence for competition in SWR, debate another for recognition via lateral inhibition. Shortlist continues over the precise cognitive and neural also attempts to account for segmentation of words in mechanisms underlying lexical competition dynamics fluent speech via lexical competition. Shortlist simu- (see Magnuson et al., 2013). In TRACE, Shortlist, lates the temporal dynamics of SWR without having to and PARSYN, competition involves lateral inhibition rely on the unrealistic processing architecture of

C. BEHAVIORAL FOUNDATIONS 20.3 PRINCIPLE THEORETICAL ISSUES IN SWR 245

TRACE. Shortlist is also an autonomous model of SWR. allophones; (ii) pattern allophones; and (iii) words Unlike TRACE, Shortlist does not allow for any top- (Luce, Goldinger, & Vitevitch, 2000). Lateral connec- down lexical influences to affect the initial activation of tions between nodes are mutually inhibitory. PARSYN its phoneme nodes. In Merge, an extension of Shortlist, was originally developed to account for lexical compe- the flow of information between phoneme and word tition and probabilistic phonotactics in SWR studies levels is unidirectional and strictly bottom-up (Norris, motivated by predictions based on NAM (Vitevitch & McQueen, & Cutler, 2000) A revised version of the Luce, 1999). Unlike TRACE and Shortlist, however, Shortlist model, Shortlist B, retains many key assump- PARSYN assumes the existence of an intermediate tions of the original model but differs radically in two allophonic level of representation in the network that ways (Norris & McQueen, 2008). First, Shortlist B is encodes fine content-dependent phonetic details (see based on Bayesian principles. Second, input to Shortlist Wickelgren, 1969). B is a sequence of phoneme probabilities obtained from human listeners, rather than discrete phonemes. Two other closely related models, SpeM and Fine-Tracker, 20.3.8 Distributed Cohort Model have been developed recently to accept real speech waveforms (Weber & Scharenborg, 2012). In the DCM, unlike the original Cohort Theory, activation corresponding to a specific word is distrib- 20.3.7 NAM and PARSYN uted over a large set of simple processing units (Gaskell & Marslen-Wilson, 2002). In contrast to the three localist One of the most successful models of SWR is the SWR models discussed, the DCM assumes distributed Neighborhood Activation Model (NAM) developed representations in which featural input is projected onto by Luce and Pisoni (1998). NAM was designed to simple semantic and phonological units. Because the confront the acoustic-phonetic invariance problem in DCM is a distributed model of SWR, there are no speech perception. NAM assumes that listeners recog- intermediate or sublexical units of representation. nize spoken words relationally in the context of other Moreover, in contrast to lateral inhibition used by phonetically similar words rather than by strictly TRACE, Shortlist, and PARSYN, lexical competition in bottom-up processing of a sequence of abstract pho- DCM is expressed as a blending of multiple lexical items netic segments. NAM uses a simple similarity metric that are consistent with the input. for estimating phonological distances of spoken words The differences that exist among the four models based on the one-phoneme deletion, addition, or sub- of SWR reviewed here are relatively modest. Some stitution rule developed by Greenberg and Jenkins unresolved issues need to be explored further, such as (1964). This computational method of assessing lexical segmentation of words in sentences and connected similarity provides an efficient and powerful way of fluent speech, the contributions of sentence and dis- quantifying the relations between spoken words. course context, and the interaction of other sources of The approach embodied in NAM avoids the knowledge used in spoken language understanding. long-standing intractable problem of trying to recognize These issues are known to affect SWR in normal- individual context-free abstract idealized sound seg- hearing listeners and in clinical populations with hear- ments (e.g., phonemes) from bottom-up linguistic analy- ing loss, language delay, and cognitive aging. Because sis of invariant acoustic-phonetic properties in the the current group of SWR models are all very similar, speech waveform. The search for unique acoustic- it is doubtful that these issues will prove to be criti- phonetic invariants for phonemes is no longer a neces- cally important in deciding which model provides the sary prerequisite when the primary recognition problem most realistic and valid account of SWR. is viewed as lexical discrimination and selection among In many ways, it appears that the field of SWR similar sounding words in memory rather than context- research has reached a modeling plateau. Although independent identification of phonetic segments. there remain numerous unresolved empirical issues, it Perceptual constancy and lexical abstraction emerge is also clear that substantial progress has already been naturally in NAM and are an automatic byproduct of made in dealing with many long-standing founda- processing interactions between initial sensory informa- tional issues surrounding SWR and how information tion in the signal and lexical knowledge the listener has in the speech signal is mapped onto lexical representa- about possible words and phonological contrasts (also, tions. Given recent findings documenting the increas- see Grossberg, 2003; Grossberg & Myers, 2000). ingly important role of fine phonetic variation and PARSYN, another localist model of SWR, is a con- representational specificity on SWR, especially in nectionist instantiation of the design principles origi- hearing-impaired populations, it is likely that several nally incorporated in NAM. The model contains three of the unconventional design features of the LAFS levels of interconnected processing units: (i) input architecture may find their way into revised versions

C. BEHAVIORAL FOUNDATIONS 246 20. SPOKEN WORD RECOGNITION: HISTORICAL ROOTS, CURRENT THEORETICAL ISSUES, AND SOME NEW DIRECTIONS of these four basic models of SWR in the near future Importantly, the acoustic consequences of coarticu- (see recent models developed by Moore, 2007a, 2007b; lation, as well as other sources of contextually condi- Ro¨nnberg et al., 2013). tioned variability, result in the failure of the acoustic signal to meet two formal conditions: linearity and invariance. This failure gives rise to a third problem: 20.4 SWR AND THE MENTAL LEXICON the absence of explicit segmentation of the acoustic speech signal into discrete units (Chomsky & Miller, One solution to long-standing problems associated 1963). The linearity condition requires that each with the lack of acoustic-phonetic invariance, segmenta- segment corresponds to a stretch of sound in the utter- tion, and the context-conditioned nature of speech has ance. The linearity condition is not met in speech been to reframe the perceptual invariance issue by pro- because extensive coarticulation and other contextual posing that the primary function of speech perception is effects smear acoustic features for adjacent segments. the recognition of spoken words rather than the The smearing or parallel transmission of acoustic recognition and identification of phonetic segments features results in stretches of the speech waveform in (Luce & Pisoni, 1998). The proposal to recast research on which acoustic features of more than one segment are speech perception and SWR as a lexical selection prob- present simultaneously (Liberman, 1996). Acoustic- lem has had a significant influence in the field because it phonetic invariance means that every segment must drew attention away from traditional studies of speech have a specific set of defining acoustic attributes in all cues in isolated nonsense syllables to somewhat broader contexts. Because of coarticulatory effects in speech theoretical and empirical issues related to SWR, the production, the acoustic properties of a particular mental lexicon and the organization of spoken words in speech sound vary as a function of the phonetic envi- lexical memory (Marslen-Wilson, 1975; Marslen-Wilson ronment it is embedded in. Acoustic-phonetic invari- & Welsh, 1978). It also emphasized the central role ance is absent due to within-speaker variation as well of SWR in language comprehension and production, as when we look across different speakers (Peterson & topics that had been ignored by previous work focused Barney, 1952). The absence of acoustic-phonetic invari- exclusively on phonetic perception of speech sounds. ance is inconsistent with the theoretical assumption that speech can be represented as an idealized context- 20.4.1 The Conventional View free linear sequence of discrete linguistic segments. A large body of research over the past 60 years demon- The conventional view of speech assumes a strates clearly that the speech signal cannot be reliably bottom-up sensory-based approach to speech percep- segmented into discrete acoustically defined units; in tion and SWR in which segments are first recognized fluent connected speech, especially casual or reduced from elementary cues and distinctive features in speech, it is impossible to identify where one word the speech signal are then parsed into words (see ends and another begins using acoustic criteria alone. Lindgren, 1965a, 1965b for reviews of early theories). Precisely how the continuous speech signal is mapped Historically, within the conventional segmental linguis- onto discrete segmental representations by the listener tic approach, variability in speech was treated as an still remains one of the most important and challeng- undesirable source of noise that needed to be reduced ing problems in speech research today and suggests in order to reveal the hidden idealized linguistic the existence of additional representations that encode message (Chomsky & Miller, 1963; Halle, 1985; and process the graded continuous properties of the Miller & Chomsky, 1963). Many factors known to pro- speech signal (McMurray & Jongman, 2011). duce acoustic-phonetic variability were deliberately eliminated or systematically controlled for by speech 20.4.3 An Alternative Proposal scientists in experimental protocols used to study speech. As a consequence, very little basic research Theoretical developments in neurobiology, cognitive was specifically devoted to understanding how vari- science, and brain modeling along with the availability ability in the speech signal and listening environment of powerful new computational tools and large digital is encoded, processed, and stored, especially in noise speech databases have led researchers to reconceptual- and under adverse listening conditions. ize the major theoretical problems in speech perception and how they should be approached in light of what 20.4.2 Linearity, Invariance, and Segmentation we now know about neural development and brain function (Sporns, 1998, 2003; Sporns, Tononi, & Several aspects of the conventional view of speech Edelman, 2000). In particular, several new exemplar- are difficult to reconcile with the continuous nature based approaches to SWR have emerged in recent years of the acoustic waveform produced by a speaker. from independent developments in the field of human

C. BEHAVIORAL FOUNDATIONS 20.4 SWR AND THE MENTAL LEXICON 247 categorization (Kruschke, 1992; Nosofsky, 1986), pho- integral part of the long-term representations that a netics and laboratory phonology (Johnson, 2002, 2006), listener stores about the sound patterns of spoken words and frequency-based phonology in linguistics (Bybee, in his or her language and the talkers he or she has been 2001; Pierrehumbert, 2001). These novel approaches to exposed to (Pisoni, 1997; Port, 2010a, 2010b; Remez, SWR offer new insights into many traditional pro- Fellows, & Nagel, 2007; Remez, Rubin, Pisoni, & Carrell, blems related to variability and the lack of acoustic- 1981). Viewed from this approach, speech variability is phonetic invariance in speech perception (Moore, an important theoretical problem to study and under- 2007a, 2007b; Pisoni, 1997). stand because it has been ignored by speech scientists Research findings over the past 25 years provide over the years despite its central role in all aspects of converging support for an approach to SWR that is speech communication, cognition, learning, and memory compatible with a large and growing body of literature (Jacoby & Brooks, 1984; Klatt, 1986, 1989; Stevens, 1996). in cognitive science dealing with episodic models of cat- Until recently, very little was known about the contri- egorization and multiple-trace models of human mem- bution of the indexical properties of speech to speech ory (Erickson & Kruschke, 1998; Hintzman, 1986, 1988; perception and the role these complementary attributes Kruschke, 1992; Nosofsky, 1986; Shiffrin & Steyvers, play in SWR (see Van Lancker & Kreiman, 1987). 1997). This theoretical approach emphasizes the impor- An example of the parallel encoding of linguistic tance of temporal context and the encoding of specific and indexical information in speech is displayed in instances in memory. Such accounts of SWR assume Figure 20.3. The absolute frequencies of the vowel that highly detailed stimulus information in the speech formants shown by peaks in the spectrum provide signal and listening environment is encoded, pro- cues to speaker identification (A), whereas the relative cessed, and stored by the listener and subsequently differences among the formants specify information becomes an inseparable component of rich and highly detailed lexical representations of spoken words (Port, Speaker identification Vowel identification 2010a, 2010b). (A) A critical assumption of this approach to speech aoeiu perception and SWR is that variability in speech is useful and highly informative to the listener, rather than being a source of noise that degrades the underlying idealized abstract linguistic representations (Elman & F2 F1 F0 F2–F0 F1–F0 McClelland, 1986). Exemplar-based views assume that listeners encode and store highly detailed records of Bark scale Bark difference scale episodic experiences, rather than prototypes or abstrac- Frequency Normalized tions (Kruschke, 1992; Nosofsky, 1986). According to coordinate formant Pitch coordinate these accounts, abstraction and categorization occur coordinate but they are assumed to emerge from computational F0 processes that take place at retrieval not at encoding. Bark scale Thus, the fine-grained continuous acoustic-phonetic and indexical details of speech and episodic contextual Tonotopic organization Functional organization information are not discarded as a consequence of early sensory processing and perceptual encoding of spoken words into lexical representations. (B) dB F2 F1 F0

20.4.4 Indexical Properties of Speech Vowel representation at High Bark frequency Low Numerous studies on the perception of talker the auditory variability in SWR have shown that indexical information periphery in the speech (Abercrombie, 1967), including details about vocal sound source and optical information about FIGURE 20.3 (A) Vowel projections at the auditory periphery the speaker’s face, as well as highly detailed contextual reveal that information for speaker identification and (B) perception information about speaking rate, speaking mode, and of vowel quality is carried simultaneously and in parallel by the other detailed episodic properties, such as the acoustic same acoustic signal. The tonotopic organization of the absolute properties of the listening environment, are encoded frequencies using a bark scale provides reliable cues to speaker iden- tification, whereas the relations among the formant (F1, F2, and F3) into lexical representations (Brandewie & Zahorik, 2010; patterns in terms of difference from F0 in barks provide reliable cues Lachs, McMichael, & Pisoni, 2003). These additional to vowel identification. Adapted from Hirahara and Kato (1992) with sources of information are assumed to become an permission from the publisher.

C. BEHAVIORAL FOUNDATIONS 248 20. SPOKEN WORD RECOGNITION: HISTORICAL ROOTS, CURRENT THEORETICAL ISSUES, AND SOME NEW DIRECTIONS used for vowel identification (B). Both channels are have turned out to be misguided, given our current carried simultaneously by the same acoustic signal understanding of how the peripheral and central audi- and both sources of information are encoded by the tory system work and how the brain and nervous sys- peripheral and central auditory mechanisms used to tem function to recognize spoken language (Davis & process speech signals. Johnsrude, 2003; Hickok & Poeppel, 2007; Scott et al., Despite recent evidence in favor of episodic 2000; Scott & Johnsrude, 2003). Long-standing assump- approaches to speech perception and SWR, we believe a tions about SWR are being critically revaluated (Luce hybrid account of SWR that also incorporates abstract & McLennan, 2005). Deeper theoretical insights have representations espoused in the conventional view is a also emerged in recent years, encouraging further promising direction for future research (see proposals by empirical research on normal-hearing populations, as Moore, 2007a, 2007b; Ro¨nnberg et al., 2013). Substantial well as clinical populations with hearing loss (Niparko evidence has been reported for both abstract and et al., 2009; Pichora-Fuller, 1995) and language delays episodic coding of speech by human listeners. The chal- (Beckage, Smith, & Hills, 2011). lenge for the future is to identify the conditions under We have suggested that speech should no longer be which these types of representations are used in SWR. viewed as just a linear sequence of idealized abstract For example, there is mounting evidence in support of context-free linguistic segments (see Port, 2010a, 2010b). the time course hypothesis of SWR (Luce & Lyons, 1998; It is also becoming clear that any principled theoretically Luce & McLennan, 2005) that posits that during the motivated account of SWR will also have to be compati- process of recognizing spoken words, abstract linguistic ble with what we currently know about human informa- information is typically processed prior to vocal sound tion processing, including episodic memory and learning source information and other types of indexical and epi- (Pisoni, 2000). SWR does not take place in a vacuum sodic information in speech (Krestar & McLennan, 2013; isolated from the rest of cognition; core computational Mattys & Liss, 2008; McLennan & Luce, 2005; Vitevitch processes used in SWR are inseparable from basic mem- & Donoso, 2011). Time course data on SWR are also ory, learning, and cognitive control processes that reflect consistent with hemispheric differences demonstrating the operation of many separate neurobiological compo- that abstract information is processed more efficiently nents working together as a functionally integrated sys- in the left hemisphere and more specific episodic details tem (Belin, Zattore, Lafaille, Ahad, & Pike, 2000). As of speech, including indexical information, are processed Nauta said more than 50 years ago, “no part of the brain more efficiently in the right hemisphere. There are functions on its own but only through the other parts of data in support of such hemispheric differences in the the brain with which it is connected” (Nauta, 1964, visual domain as well, both for visual words (Marsolek, p. 125). In other words, it takes a whole brain to recognize 2004) and nonlinguistic information (Burgund & words and understand spoken language—the ear is con- Marsolek, 2000), and in the auditory domain for spoken nected to the brain and the brain is connected to the ear. words (Gonza´lez & McLennan, 2007), talker identifica- Research on how early sensory information in tion (Gonza´lez, Cervera-Crespo, & McLennan, 2012), speech is mapped onto lexical representations of spoken and nonlinguistic environmental sounds (Gonza´lez & words and how these detailed representations are McLennan, 2009). Given that hemispheric differences accessed in SWR tasks has numerous clinical implica- have been observed in the visual and auditory tions for listeners with hearing loss, especially hearing- domains—and for both linguistic and nonlinguistic sti- impaired children and adults who have received muli—these results suggest that having both abstract cochlear implants (Kirk, 2000; Kirk & Choi, 2009; and more detailed specific representations may be a gen- Kronenberger, Pisoni, Henning, & Colson, 2013; eral property of cognition and the human information Niparko et al., 2009; Pisoni, 2005). processing system. Finally, controlled attention and In addition to clinical implications related to hearing other neurocognitive and linguistic factors also affect lis- impairment, basic research in SWR has led to other teners’ processing of abstract and indexical information new directions. Research on the recognition of foreign- in speech (Maibauer, Markis, Newell, & McLennan, accented speech contributes to our understanding of 2014; McLennan, 2006; Theodore & Blumstein, 2011). lexical activation and selection processes (e.g., Chan & Vitevitch, 2015) and provides new insights into the circumstances under which variability in indexical 20.5 SOME NEW DIRECTIONS AND information affects listeners’ ability to recognize words FUTURE CHALLENGES by non-native speakers (McLennan & Gonza´lez, 2012). SWR in bilinguals represents another promising After more than 50 years of research on human speech area of research. Studies with bilinguals make impor- perception, many of the foundational assumptions of tant contributions to our understanding of basic represen- the conventional segmental linguistic view of speech tational and processing issues in SWR (Vitevitch, 2012).

C. BEHAVIORAL FOUNDATIONS REFERENCES 249

For example, a recent study with bilinguals demon- We suggested that the field of SWR appears to have strates that it is easier to learn to recognize the voices of reached a plateau in terms of major theoretical or model- previously unfamiliar talkers in a language learned ing advancements. However, there have been a number early in life (Bregman & Creel, 2014). Studies of bilin- of important new empirical and methodological gual SWR can also be used as a tool to investigate contributions. Among these include time course findings, differences in cognitive control processes between due in large part to innovative techniques such monolinguals and bilinguals (Kroll & Bialystok, 2013; as eye-tracking using the visual world paradigm however, see de Bruin, Treccani, & Della Sala, 2015). (Allopenna, Magnuson, & Tanenhaus, 1998)and,more Finally, scientists have extended SWR studies to recently, mouse-tracking (Spivey, Grosjean, & Knoblich, investigations of the aging lexicon and the decline in 2005). Moving forward into the future, these new language processes (e.g., Ben-David et al., 2011; Meister methodologies should contribute to improved theories et al., 2013; Sommers, 2005; Yonan & Sommers, 2000). and more precise models of SWR. We also expect to see Many aspects of language processing are less likely to increases in the number of studies investigating the show age-related declines—and some show improve- recognition of casually spoken phonetically reduced ments—as a consequence of normal aging compared words (e.g., Ernestus, Baayen, & Schreuder, 2002)and with many other perceptual and cognitive domains novel approaches to modeling SWR, including the inte- (Taler, Aaron, Steinmetz, & Pisoni, 2010). gration of current computational models with existing hybrid frameworks that incorporate the roles of both abstract and indexical information in SWR (Luce et al., 20.6 SUMMARY AND CONCLUSIONS 2000). Finally, there is a rapidly growing body of research on lexical organization and lexical connectivity of words Listeners bring an enormous amount of prior using theory and methodology developed in the field of knowledge to every spoken language task they are complex networks and graph theory that has provided asked to perform in the research laboratory, clinic, or additional new insights into spoken language processing daily life. In this chapter, we have argued that it is and holds promise for dealing with more global aspects important to keep these broad observations in mind of SWR in typical and atypical populations (Altieri, in understanding how spoken words are recognized Gruenenfelder, & Pisoni, 2010; Beckage et al., 2011; so efficiently and how listeners manage to reliably Kenett, Wechsler-Kashi, Kenett, Schwartz, & Ben-Jacob, recover the talker’s intended linguistic message from 2013; Vitevitch, 2008; Vitevitch, Chan, & Goldstein, 2014). highly degraded sensory inputs under challenging conditions. The findings reviewed in this chapter Acknowledgments suggest that SWR processes are highly robust because listeners are able to make use of multiple sources of Preparation of this chapter was supported, in part, by research grants information encoded in the speech signal—the tradi- R01 DC-000111 and R01 DC-009581 to Indiana University from NIH NIDCD. We are grateful to Luis Hernandez for his dedication, help, tional linguistic pathway that encodes acoustic- and continued assistance and advice over many years in maintaining phonetic information specifying the talker’s intended our research laboratory and contributing to the creation of a highly pro- message, the indexical pathway that encodes and ductive and stimulating research environment at Indiana University in carries detailed episodic contextual attributes specify- Bloomington. We also thank Terren Green for her help and assistance ing the vocal sound source such as the talker’s gender, in the preparation of the present manuscript. Finally, D.B.P. expresses his thanks and deepest appreciation to Professor Kenneth N. Stevens, regional dialect, and mental and physical states, as his postdoctoral mentor in the Speech Communications Group at the well as other downstream sources of linguistic know- Research Laboratory of Electronics, MIT, who passed away last year. ledge that support word prediction strategies, sentence As everyone in our field knows, Ken Stevens was one of the early pio- parsing, and linguistic interpretation. Variability in the neers and major architects in the field of speech communications speech signal was once considered an undesirable research, speech acoustics, acoustics-phonetics, and speech perception, and he provided wonderful guidance and advice to many graduate source of noise and signal degradation that needed to students and postdocs during his more than 50 years as a member of be eliminated or normalized away to recover the ideal- the faculty at MIT. Ken—this one’s for you! We’ll miss you. ized abstract segmental content of the talker’s intended linguistic message. We now realize that this long- standing conventional view of speech perception and References SWR is fundamentally incorrect and that variability Abercrombie, D. (1967). Elements of general phonetics. Edinburgh: in speech is highly informative and an extremely Edinburgh University. valuable source of contextual information that liste- Abrams, M. H., Goffard, S. J., Kryter, K. D., Miller, G. A., Miller, J., & Sanford, F. H. (1944). Speech in noise: A study of the factors determin- ners encode, process, and routinely make use of in rec- ing its intelligibility. OSRD Report 4023. Cambridge, MA: Research ognizing spoken words, especially in noise and under on Sound Control. Psycho-Acoustic Laboratory, Harvard other adverse listening conditions. University.

C. BEHAVIORAL FOUNDATIONS 250 20. SPOKEN WORD RECOGNITION: HISTORICAL ROOTS, CURRENT THEORETICAL ISSUES, AND SOME NEW DIRECTIONS

Allen, J., Hunnicutt, M. S., & Klatt, D. (1987). From text to speech: The Dahan, D., & Magnuson, J. S. (2006). Spoken-word recognition. MITalk system. Cambridge, UK: Cambridge University Press. In M. J. Traxler, & M. A. Gernsbacher (Eds.), Handbook of psycho- Allen, J. B. (1994). How do humans process and recognize speech? linguistics (pp. 249283). Amsterdam: Academic Press. IEEE Transactions on Speech Audio, 2(4), 567577. Davis, M. H., & Johnsrude, I. S. (2003). Hierarchical processing Allen, J. B. (2005). Articulation and intelligibility. San Rafael, CA: in spoken language comprehension. Journal of Neuroscience, 23, Morgan & Claypool Publishers. 34233431. Allopenna, P. D., Magnuson, J. S., & Tanenhaus, M. K. (1998). de Bruin, A., Treccani, B., & Della Sala, S. (2015). Cognitive advan- Tracking the time course of spoken word recognition using eye tage in bilingualism: An example of publication bias? movements: Evidence for continuous mapping models. Journal of Psychological Science, 26(1), 99107. Memory and Language, 38, 419439. Denes, P. B., & Pinson, E. N. (1963). The speech chain: The physics Altieri, N., Gruenenfelder, T., & Pisoni, D. B. (2010). Clustering and biology of spoken language. New York, NY: Bell Telephone coefficients of lexical neighborhoods. The Mental Lexicon, 5(1), Laboratories. 121. Elman, J. L., & McClelland, J. L. (1986). Exploiting lawful variability Altman, G. T. M. (Ed.), (1990). Cognitive models of speech processing: in the speech waveform. In J. S. Perkell, & D. H. Klatt (Eds.), Psycholinguistic and computational perspectives. Cambridge, MA: Invariance and variability in speech processing (pp. 360385). MIT Press. Hillsdale, NJ: Erlbaum. Arlinger, S., Lunner, T., Lyxell, B., & Pichora-Fuller, M. K. (2009). Erickson, M. A., & Kruschke, J. K. (1998). Rules and exemplars in The emergence of cognitive hearing science. Scandinavian Journal category learning. Journal of Experimental Psychology: General, 127, of Psychology, 50, 371384. 107140. Bagley, W. C. (1900). The apperception of the spoken sentence: A study Ernestus, M., Baayen, H., & Schreuder, R. (2002). The recognition of in the psychology of language. American Journal of Psychology, 12(1), reduced word forms. Brain and Language, 81, 162173. 80130. Eukel, B. (1980). Phonotactic basis for word effects: Implications for Beckage, N., Smith, L., & Hills, T. (2011). Small worlds and semantic lexical distance metrics. Journal for the Acoustical Society of America, network growth in typical and late talkers. PLoS One, 6(5), e19348. 68(S1), S33. Belin, P., Zattore, R. J., Lafaille, P., Ahad, P., & Pike, B. (2000). Voice Flanagan, J. L. (1965). Speech analysis synthesis and perception. selective areas in human auditory cortex. Nature, 403, 309312. Heidelberg: Springer-Verlag. Ben-David, B. M., Chambers, C. G., Daneman, M., Pichora-Fuller, Fletcher, H. (1929). Speech and hearing (1st ed.). New York, NY: D. M. K., Reingold, E. M., & Schneider, B. A. (2011). Effects of aging Van Nostrand Company, Inc. and noise on real-time spoken word recognition: Evidence from Fletcher, H. (1953). Speech and hearing in communication. Huntington, eye movements. Journal of Speech, Language, and Hearing Research, NY: Krieger. 54, 243262. Fletcher, H., & Galt, R. H. (1950). The perception of speech and its Black, J. W (1946). Effects of voice communication training. Speech relation to telephony. Journal of the Acoustical Society of America, 22 Monographs, 13,6468. (2), 89151. Brandewie, E., & Zahorik, P. (2010). Prior listening in rooms Forster, K. I. (1976). Accessing the mental lexicon. In R. J. Wales, & E. improves speech intelligibility. Journal of the Acoustical Society of Walker (Eds.), New approaches to language mechanisms. America, 128(1), 291299. Amsterdam: North Holland. Bregman, M. R., & Creel, S. C. (2014). Gradient language dominance Forster, K. I. (1989). Basic issues in lexical processing. In W. Marslen- affects talker learning. Cognition, 130,8595. Wilson (Ed.), Lexical representation and process (pp. 75107). Broadbent, D. E. (1967). Word-frequency effect and response bias. Cambridge, MA: MIT Press. Psychological Review, 74(1), 115. Francis, W. N., & Kucera, H. (1964). A standard corpus of present-day Burgund, E. D., & Marsolek, C. J. (2000). Viewpoint-invariant and edited American English. Providence, RI: Department of viewpoint-dependent object recognition in dissociable neural sub- Linguistics, Brown University. systems. Psychonomic Bulletin & Review, 7, 480489. Gaskell, M. G. (Ed.), (2007a). The Oxford handbook of psycholinguistics Bybee, J. (2001). Phonology and language use. Cambridge, UK: New York, NY: Oxford University Press. Cambridge University Press. Gaskell, M. G. (2007b). Statistical and connectionist models of speech Carreiras, M., Armstrong, B. C., Perea, M., & Frost, R. (2014). The perception and word recognition. In M. G. Gaskell (Ed.), The what, when, where, and how of visual word recognition. Trends Oxford handbook of psycholinguistics (pp. 5569). New York, NY: in Cognitive Sciences, 18,9098. Oxford University Press. Chan, K. Y., & Vitevitch, M. S. (2015). The influence of neighborhood Gaskell, M. G., & Marslen-Wilson, W. M. (2002). Representation and density on the recognition of Spanish-accented words. Journal of competition in the perception of spoken words. Cognitive Experimental Psychology: Human Perception and Performance, 41(1), Psychology, 45, 220266. 6585. Gonza´lez, J., Cervera-Crespo, T., & McLennan, C. T. (2012). Chomsky, N., & Miller, G. A. (1963). Introduction to the formal Hemispheric differences in specificity effects in talker identifica- analysis of natural languages. In R. D. Luce, R. R. Bush, & tion. Attention, Perception, & Psychophysics, 72, 22652273. E. Galanter (Eds.), Handbook of mathematical psychology Gonza´lez, J., & McLennan, C. T. (2007). Hemispheric differences in (pp. 269321). New York, NY: Wiley. indexical specificity effects in spoken word recognition. Journal of Cole, R. A. (1973). Listening to mispronunciations: A measure of what Experimental Psychology: Human Perception and Performance, 33, we hear during speech. Perception & Psychophysics, 13,153156. 410424. Cole, R. A., & Rudnicky, A. I. (1983). What’s new in speech percep- Gonza´lez, J., & McLennan, C. T. (2009). Hemispheric differences in tion? The research and ideas of William Chandler Bagley, the recognition of environmental sounds. Psychological Science, 20, 18741946. Psychological Review, 90,94101. 887894. Cooper, F. S., Delattre, P. C., Liberman, A. M., Borst, J. M., & Greenberg, J. H., & Jenkins, J. J. (1964). Studies in the psychological corre- Gerstman, L. J. (1952). Some experiments on the perception of lates of the sound system of American English. Word, 20,157177. synthetic speech sounds. Journal of the Acoustical Society of Grossberg, S. (2003). Resonant neural dynamics of speech perception. America, 24, 597606. Journal of Phonetics, 31, 423445.

C. BEHAVIORAL FOUNDATIONS REFERENCES 251

Grossberg, S., & Myers, C. W. (2000). The resonant dynamics of Klatt, D. H. (1986). The problem of variability in speech recognition speech perception: Interword integration and duration-dependent and in models of speech perception. In J. S. Perkell, & D. H. Klatt backward effects. Psychological Review, 107, 735767. (Eds.), Invariance and variability in speech processing (pp. 300319). Halle, M. (1985). Speculations about the representation of words in Hillsdale, NJ: Erlbaum. memory. In V. A. Fromkin (Ed.), Phonetic linguistics (pp. 101104). Klatt, D. H. (1987). Review of text-to-speech conversion for English. New York, NY: Academic Press. Journal of the Acoustical Society of America, 82(3), 737793. Hannagan, T., Magnuson, J. S., & Grainger, J. (2013). Spoken word Klatt, D. H. (1989). Review of selected models of speech perception. recognition without a trace. Frontiers in Psychology, 4, 563. In W. Marslen-Wilson (Ed.), Lexical representation and process Hickok, G., & Poeppel, D. (2007). The cortical organization of speech (pp. 169226). Cambridge, MA: MIT Press. processing. Nature Reviews Neuroscience, 8, 393402. Konkle, D. F., & Rintelmann, W. F. (1983). Principles of speech audiometry Hintzman, D. L. (1986). “Schema abstraction” in a multiple-trace perspectives in audiology series. Baltimore, MD: University Park Press. memory model. Psychological Review, 93, 411428. Krestar, M. L., & McLennan, C. T. (2013). Examining the effects Hintzman, D. L. (1988). Judgments of frequency and recognition of emotional tone of voice on spoken word recognition. The memory in a multiple-trace memory model. Psychological Review, Quarterly Journal of Experimental Psychology, 66, 17931802. 95, 528551. Kroll, J. F., & Bialystok, E. (2013). Understanding the consequences Hirahara, T., & Kato, H (1992). The effect of F0 on vowel identifica- of bilingualism for language processing and cognition. Journal of tion. In Y. Tohkura, E. Vatikiotis-Bateson, & Y Sagisaka (Eds.), Cognitive Psychology, 25. Available from: http://dx.doi.org/ Speech perception, production and linguistic structure (pp. 89112). 10.1080/20445911.2013.799170. Tokyo: Ohmsha Publishing. Kronenberger, W. G., Pisoni, D. B., Henning, S. C., & Colson, B. G. Hirsh, I. J. (1947). Clinical application of two Harvard auditory tests. (2013). Executive functioning skills in long-term users of Journal of Speech and Hearing Disorders, 12, 151158. cochlear implants: A case control study. Journal of Pediatric Hockett, C. F. (1955). Amanualofphonology. Baltimore, MD: Waverly Press. Psychology, 38(8), 902914. Howes, D. (1954). On the interpretation of word frequency as a vari- Kruschke, J. K. (1992). Alcove: An exemplar-based connectionist able affecting speed of recognition. Journal of Experimental model of category learning. Psychological Review, 99,2244. Psychology, 48(2), 106112. Lachs, L., McMichael, K., & Pisoni, D. B. (2003). Speech perception Howes, D. (1957). On the relation between the intelligibility and fre- and implicit memory: Evidence for detailed episodic encoding. quency of occurrence of English words. The Journal of the In J. Bowers, & C. Marsolek (Eds.), Rethinking implicit memory Acoustical Society of America, 29(2), 296305. (pp. 215235). Oxford: Oxford University Press. Hudgins, C. V., Hawkins, J. E., Karlin, J. E., & Stevens, S. S. (1947). Landauer, T. K., & Streeter, L. A. (1973). Structural differences The development of recorded auditory tests for measuring between common and rare words: Failure of equivalence assump- hearing loss for speech. The Laryngoscope, 57(1), 5789. tions for theories of word recognition. Journal of Verbal Learning Jacoby, L. L., & Brooks, L. R. (1984). Nonanalytic cognition: Memory, and Verbal Behavior, 7, 291295. perception, and concept learning. In G. Bower (Ed.), The psych- Liberman, A. M. (1996). Speech: A special code. Cambridge, MA: MIT ology of learning and motivation (pp. 147). New York, NY: Press. Academic Press. Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert- Jenkins, J. J. (1979). Four points to remember: A tetrahedral model of Kennedy, M. (1967). Perception of the speech code. Psychological memory experiments. In L. S. Cermak, & F. I. M. Craik (Eds.), Review, 74(6), 431461. Levels of processing in human memory (pp. 429446). Hillsdale, NJ: Licklider, J. C. R. (1952). On the process of speech perception. Journal Erlbaum Associates. of the Acoustical Society of America, 24, 590594. Johnson, K. (2002). Acoustic and auditory phonetics (2nd ed., (1st Licklider, J. C. R., & Miller, G. A. (1951). The perception of speech. edition, 1997)). ). Oxford: Blackwell. In S. S. Stevens (Ed.), Handbook of experimental psychology Johnson, K. (2006). Resonance in an exemplar-based lexicon: The (pp. 10401074). Oxford, England: Wileyxi, 1436 pp. emergence of social identity and phonology. Journal of Phonetics, Lindgren, N. (1965a). Machine recognition of human language part I- 34, 485499. Automatic speech recognition. IEEE Spectrum, 2(3), 114136. Jusczyk, P. W., & Luce, P. A. (2002). Speech perception and spoken Lindgren, N. (1965b). Machine recognition of human language part word recognition: Past and present. Ear & Hearing, 23,240. II-Theoretical models of speech perception and language. IEEE Karlin, J. E., Abrams, M. H., Sanford, F. H., & Curtis, J. F. (1944). Spectrum, 2(4), 4559. Auditory tests of the ability to hear speech in noise. OSRD Report Lindgren, N. (1967). Speech—Man’s natural communication. IEEE 3516. Cambridge, MA: Research on Sound Control. Psycho- Spectrum, 4,7586. Acoustic Laboratory, Harvard University. Lowerre, B., & Reddy, R. (1980). The Harpy speech understanding Kenett, Y. N., Wechsler-Kashi, D., Kenett, D. Y., Schwartz, R. G, & system. In W. A. Lea (Ed.), Trends in speech recognition. Ben-Jacob, E. (2013). Semantic organization in children with Englewood Cliffs, NJ: Prentice-Hall. cochlear implants: Computational analysis of verbal fluency. Luce, P. A., Goldinger, S. D., & Vitevitch, M. S. (2000). It’s good ... Frontiers in Psychology, 4(543), 111. but is it ART? Behavioral and Brain Sciences, 23, 336. Kirk, K. I. (2000). Communication skills in early-implanted children. Luce, P. A., & Lyons, E. A. (1998). Specificity of memory representa- Washington, DC: American Speech-Language-Hearing Association. tions for spoken words. Memory & Cognition, 36, 708715. Kirk, K. I., & Choi, S. (2009). Clinical investigations of cochlear Luce, P. A., & McLennan, C. T. (2005). Spoken word recognition: The implant performance. In J. K. Niparko (Ed.), Cochlear implants: challenge of variation. In D. B. Pisoni, & R. E. Remez (Eds.), Principles & practices (2nd ed., pp. 191222). Philadelphia, PA: Handbook of speech perception (pp. 591609). Malden, MA: Blackwell. Lippincott Williams & Wilkins. Luce, P. A., & Pisoni, D. B. (1998). Recognizing spoken words: The Klatt, D. H. (1977). Review of the ARPA speech understanding neighborhood activation model. Ear & Hearing, 19,136. project. Journal of the Acoustical Society of America, 62, 13451366. Magnuson, J. S., Mirman, D., & Harris, H. D. (2012). Computational Klatt, D. H. (1979). Speech perception: A model of acoustic- models of spoken word recognition. In M. Spivey, K. McRae, & phonetic analysis and lexical access. Journal of Phonetics, 7, M. Joanisse (Eds.), The Cambridge handbook of psycholinguistics. 279312. Cambridge, UK: Cambridge University Press.

C. BEHAVIORAL FOUNDATIONS 252 20. SPOKEN WORD RECOGNITION: HISTORICAL ROOTS, CURRENT THEORETICAL ISSUES, AND SOME NEW DIRECTIONS

Magnuson, J. S., Mirman, D., & Myers, E. (2013). Spoken word Moore, R. K. (2007b). Presence: A human-inspired architecture for recognition. In D. Reisberg (Ed.), The Oxford handbook of cognitive speech-based human-machine interaction. IEEE Transactions on psychology (pp. 412441). New York, NY: Oxford University Press. Computers, 56(9), 11761188. Maibauer, A. M., Markis, T. A., Newell, J., & McLennan, C. T. (2014). Morton, J. (1969). Interaction of information in word recognition. Famous talker effects in spoken word recognition. Attention, Psychological Review, 76, 165178. Perception, & Psychophysics, 76,1118. Morton, J. (1979). Word recognition. In J. Morton, & J. D. Marshall Marks, L. E., & Miller, G. A. (1964). The role of semantic and syntac- (Eds.), Psycholinguistics 2: Structures and processes (pp. 107156). tic constraints in the memorization of English sentences. Journal of Cambridge, MA: MIT Press. Verbal Learning and Verbal Behavior, 3,15. Nauta, W. J. H. (1964). Discussion of ‘retardation and faciliatation in Marslen-Wilson, W. D. (1975). Sentence perception as an interactive learning by stimulation of frontal cortex in monkeys’. In J. M parallel process. Science, 189(4198), 226228. Warren, & K. Akert (Eds.), The frontal granular cortex and behavior Marslen-Wilson, W. D. (1989). Lexical representation and process. (p. 125). New York, NY: McGraw-Hill. Cambridge, MA: MIT Press. Niparko, J. K., Kirk, K. I., McConkey-Robbins, A., Mellon, N. K., Marslen-Wilson, W. D., & Welsh, A. (1978). Processing interactions Tucci, D. L., & Wilson, B. S. (2009). Cochlear implants: Principles & and lexical access during word recognition in continuous speech. practices (2nd ed.). Philadelphia, PA: Lippincott Williams & Cognitive Psychology, 10,2963. Wilkins. Marsolek, C. J. (2004). Abstractionist versus exemplar-based theories Norris, D. (1994). Shortlist: A connectionist model of continuous of visual word priming: A subsystems resolution. Quarterly speech recognition. Cognition, 52, 189234. Journal of Experimental Psychology: Section A, 57, 12331259. Norris, D., & McQueen, J. M. (2008). Shortlist B: A Bayesian model of Mattys, S. L., & Liss, J. M. (2008). On building models of spoken-word continuous speech recognition. Psychological Review, 115(2), 357395. recognition: Where there is as much to learn from natural “oddities” Norris, D. S., McQueen, J. M., & Cutler, A. (2000). Merging informa- as artificial normality. Perception & Psychophysics, 70,12351242. tion in speech recognition: Feedback is never necessary. Behavioral McClelland, J. L., & Elman, J. L. (1986). The TRACE model of speech and Brain Sciences, 23(3), 299325. perception. Cognitive Psychology, 18,186. Nosofsky, R. M. (1986). Attention, similarity, and the identification- McLennan, C. T. (2006). The time course of variability effects in the categorization relationship. Journal of Experimental Psychology: perception of spoken language: Changes across the lifespan. General, 115,3957. Language and Speech, 49, 113125. Oldfield, R. C. (1966). Things, words, and the brain. The Quarterly McLennan, C. T., & Gonza´lez, J. (2012). Examining talker effects in Journal of Experimental Psychology, 18(4), 340353. the perception of native- and foreign-accented speech. Attention, Peterson, G. E. (1952). The information-bearing elements of speech. Perception, & Psychophysics, 74, 824830. Acoustical Society of America Journal, 24(6), 629637. McLennan, C. T., & Luce, P. A. (2005). Examining the time course of Peterson, G. E., & Barney, H. L. (1952). Control methods used in the indexical specificity effects in spoken word recognition. Journal study of a vowel. Journal of the Acoustical Society of America, 24(2), of Experimental Psychology: Learning, Memory, and Cognition, 31, 175184. 306321. Pichora-Fuller, M. K. (1995). How young and old adults listen to and McMurray, B., & Jongman, A. (2011). What information is necessary remember speech in noise. Journal of the Acoustical Society of for speech categorization? Harnessing variability in the speech America, 97(1), 593608. signal by integrating cues computed relative to expectations. Pierrehumbert, J. B. (2001). Exemplar dynamics: Word frequency, Psychological Review, 118(2), 219246. lenition and contrast. In J. Bybee, & P. Hopper (Eds.), Frequency McQueen, J. M. (2007). Eight questions about spoken-word recogni- and the emergence of linguistic structure (pp. 137158). Amsterdam: tion. In M. G. Gaskell (Ed.), The Oxford handbook of psycholinguis- John Benjamins. tics (pp. 3753). Oxford: Oxford University Press. Pisoni, D. B. (1978). Speech perception. In W. K. Estes (Ed.), Meister, H., Schreitmu¨ller, S., Grugel, L., Ortmann, M., Beutner, D., Handbook of learning and cognitive processes (Vol. 6, pp. 167233). Walger, M., et al. (2013). Cognitive resources related to speech rec- Hillsdale, NJ: Erlbaum Associates. ognition with a competing talker in younger and older listeners. Pisoni, D. B. (1997). Some thoughts on “normalization” in speech Neuroscience, 232,7482. perception. In K. Johnson, & J. W. Mullennix (Eds.), Talker variability Miller, G., & Isard, S. (1963). Some perceptual consequences of in speech processing (pp. 932). San Diego, CA: Academic Press. linguistic rules. Journal of Verbal Learning and Verbal Behavior, 2, Pisoni, D. B. (2000). Cognitive factors and cochlear implants: Some 217228. thoughts on perception, learning, and memory in speech per- Miller, G. A. (1951). Language and communication. New York, NY: ception. Ear & Hearing, 21,7078. McGraw-Hill. Pisoni, D. B. (2005). Speech perception in deaf children with cochlear Miller, G. A. (1962). Decision units in the perception of speech. implants. In D. B. Pisoni, & R. E. Remez (Eds.), Handbook of speech Information Theory, IRE Transactions, 8(2), 8183. perception (pp. 494523). Oxford: Blackwell Publishers. Miller, G. A., & Chomsky, N. (1963). Finitary models of language users. Pisoni, D. B., & Levi, S. V. (2007). Representations and represen- In R. D. Luce, R. R. Bush, & E. Galanter (Eds.), Handbook of mathemat- tational specificity in speech perception and spoken word recog- ical psychology (Vol. 2, pp. 419491). New York, NY: Wiley. nition. In M. G. Gaskell (Ed.), The Oxford handbook of Miller, G. A., Heise, G. A., & Lichten, W. (1951). The intelligibility of psycholinguistics (pp. 318). Oxford: Oxford University Press. speech as a function of the context of the test material. Journal of Pisoni, D. B., & Luce, P. A. (1986). Speech perception: Research, the- Experimental Psychology, 41, 329335. ory and the principal issues. In E. C. Schwab, & H. C. Nusbaum Miller, G. A., & Selfridge, J. A. (1950). Verbal context and the recall of (Eds.), Pattern recognition by humans and machines: Speech perception meaningful material. American Journal of Psychology, 63,176185. (Vol. 1, pp. 150). New York, NY: Academic Press. Moore, R. (2007a). Spoken language processing: Piecing together the Pisoni, D. B., Nusbaum, H. C., Luce, P. A., & Slowiaczek, L. M. puzzle. Speech Communication, 49, 418443. (1985). Speech perception, word recognition and the structure of Moore, R.K. (2005). Towards a unified theory of spoken language the lexicon. Speech Communication, 4,7595. processing. Proceeding 4th IEEE international conference on cognitive Pisoni, D. B., & Remez, R. E. (2005). The handbook of speech perception. informatics. Irvine, USA, 810. Oxford: Blackwell.

C. BEHAVIORAL FOUNDATIONS REFERENCES 253

Pollack, I., Rubenstein, H., & Decker, L. (1959). Intelligibility of Stevens, K. N. (1996). Understanding variability in speech: A known and unknown message sets. Journal of the Acoustical Society requisite for advances in speech synthesis and recognition. of America, 31, 273279. Journal of the Acoustical Society of America, 100, 2634. Pollack, I., Rubenstein, H., & Decker, L. (1960). Analysis of incorrect Stevens, K. N. (1998). Acoustic phonetics. Cambridge, MA: MIT Press. responses to an unknown message set. Journal of the Acoustical Stevens, K. N., Kasowski, S., & Fant, G. (1953). An electrical analog Society of America, 32, 454457. of the vocal tract. Journal of the Acoustical Society of America, 25, Port, R. F. (2010a). Language is a social institution: Why phonemes 734742. and words do not have explicit psychological form. Ecological Studdert-Kennedy, M. (1974). The perception of speech. In T. A. Psychology, 22, 304326. Sebeok (Ed.), Current trends in linguistics (Vol. 12, pp. 23492385). Port, R. F. (2010b). Rich memory and distributed phonology. The Hague: Mouton. Language Sciences, 32(1), 4355. Taler, V., Aaron, G. P., Steinmetz, L. G., & Pisoni, D. B. (2010). Price, C. J. (2012). A review and synthesis of the first 20 years of PET Lexical neighborhood density effects on spoken word recognition and fMRI studies of heard speech, spoken language and reading. and production in healthy aging. Journal of Gerontology: NeuroImage, 62, 816847. Psychological Sciences, 65, 551560. Remez, R. E., Fellows, J. M., & Nagel, D. S. (2007). On the perception Theodore, R. M., & Blumstein, S. E. (2011). Attention modulates the of similarity among talkers. Journal of the Acoustical Society of time-course of talker-specificity effects in lexical retrieval. Poster America, 122, 36883696. presented at the 162nd meeting of the Acoustical Society of America. Remez, R. E., Rubin, P. E., Pisoni, D. B., & Carrell, T. D. (1981). San Diego, CA. Speech perception without traditional speech cues. Science, 212 Thorndike, E. L., & Lorge, I. (1944). The teacher’s word book of 30,000 (4497), 947950. words. New York: Teachers College Bureau of Publications, Ro¨nnberg, J., Lunner, T., Zekveld, A., So¨rqvist, P., Danielsson, H., Columbia University. Lyxell, B., et al. (2013). The ease of language understanding Treisman, M. (1978a). Space or lexicon? Journal of Verbal Learning and (ELU) model: Theoretical, empirical, and clinical advances. Verbal Behavior, 17,3759. Frontiers in Systems Neuroscience, 7(article 31). Available from: Treisman, M. (1978b). A theory of the identification of complex sti- http://dx.doi.org/10.3389/fnsys.2013.00031. muli with an application to word recognition. Psychological Rosenzweig, M. R., & Postman, L. (1957). Intelligibility as a function of Review, 85, 525570. frequency of usage. Journal of Experimental Psychology, 54(6), 412422. Van Lancker, D., & Kreiman, J. (1987). Voice discrimination and rec- Rosenzweig, M. R., & Stone, G. (1948). Wartime research in psycho- ognition are separate abilities. Neuropsychologia, 25, 829834. acoustics. Review of Educational Research Special Edition: Vitevitch, M. S. (2008). What can graph theory tell us about word Psychological Research in the Armed Forces, 18(6), 642654. learning and lexical retrieval? Journal of Speech Language Hearing Savin, H. B. (1963). Word-frequency effect and errors in the perception Research, 51, 408422. of speech. Journal of the Acoustical Society of America, 35, 200206. Vitevitch, M. S. (2012). What do foreign neighbors say about the Scharenborg, O., & Boves, L. (2010). Computational modelling of spoken mental lexicon? Bilingualism: Language and Cognition, 15, 167172. word recognition processes. Pragmatics & Cognition, 18(1), 136164. Vitevitch, M. S., Chan, K. Y., & Goldstein, R. (2014). Insights into Scott, S. K., & Johnsrude, I. S. (2003). The neuroanatomical and func- failed lexical retrieval from network science. Cognitive Psychology, tional organization of speech perception. Trends in Neuroscience, 68,132. 26, 100107. Vitevitch, M. S., & Donoso, A. (2011). Processing of indexical infor- Scott, S. K. C. C., Blank, C. C., Rosen, S., & Wise, R. J. S. (2000). mation requires time: Evidence from change deafness. The Identification of a pathway for intelligible speech in the left Quarterly Journal of Experimental Psychology, 64, 14841493. temporal lobe. Brain, 123(12), 24002406. Vitevitch, M. S., & Luce, P. A. (1999). Probabilistic phonotactics and Sharp, D. J., Scott, S. K., & Wise, R. J. S. (2004). Retrieving meaning neighborhood activation in spoken word recognition. Journal of after temporal lobe infarction: The role of the basal language area. Memory & Language, 40, 374408. Annals of Neurology, 56, 836846. Weber, A., & Scharenborg, O. (2012). Models of spoken-word recog- Shiffrin, R. M., & Steyvers, M. (1997). A model for recognition mem- nition. WIREs. Cognitive Science, 3, 387401. Available from: ory: REM—retrieving effectively from memory. Psychonomic http://dx.doi.org/10.1002/wcs.1178. Bulletin & Review, 4, 145166. Wickelgren, W. A. (1969). Context-sensitive coding, associative mem- Sommers, M. S. (2005). Age-related changes in spoken word recogni- ory, and serial order in (speech) behavior. Psychological Review., tion. In D. B. Pisoni, & R. E. Remez (Eds.), Handbook of speech 76(1), 115. perception (pp. 469493). Malden, MA: Blackwell. Wiener, F.M., & Miller, G.A. (1946). Some characteristics of human Spivey, M. J., Grosjean, M., & Knoblich, G. (2005). Continuous attrac- speech. In C. E. Waring (Ed.), Transmission and reception of sounds tion toward phonological competitors. PNAS, 102, 1039310398. under combat conditions (Vol. 3, pp. 5868). Summary Technical Sporns, O. (1998). Biological variability and brain function. In Report of NDRC Division 17. Washington, DC: NDRC. J. Cornwell (Ed.), Consciousness and human identity (pp. 3856). Wilson, R. H., & McArdle, R. (2005). Speech signals used to evaluate Oxford: Oxford University Press. functional status of the auditory system. Journal of Rehabilitation Sporns, O. (2003). Network analysis, complexity, and brain function. Research & Development, 42(4 Suppl. 2), 7994. Complexity, 8,5660. Yonan, C. A., & Sommers, M. S. (2000). The effects of talker familiar- Sporns, O., Tononi, G., & Edelman, G. M. (2000). Connectivity and ity on spoken word identification in younger and older listeners. complexity: The relationship between neuroanatomy and brain Psychology & Aging, 15,8899. dynamics. Neural Networks, 13, 909922.

C. BEHAVIORAL FOUNDATIONS This page intentionally left blank CHAPTER 21 Visual Word Recognition Kathleen Rastle Department of Psychology, Royal Holloway, University of London, Egham, Surrey, UK

Reading is one of the most remarkable of our Information from the printed stimulus maps onto language abilities. Skilled readers are able to recog- stored representations about the visual features that nize printed words and compute their associated make up letters (e.g., horizontal bar), and information sounds and meanings with astonishing speed and a from this level of representation then maps onto stored great deal of accuracy. Yet, unlike our inborn capac- representations of letters. Some theories assert that ity for spoken language, reading is not a universal letter information goes on to activate higher-level sub- part of the human experience. Reading is a cultural word representations at increasing levels of abstrac- invention and a learned skill, acquired only through tion, including orthographic rimes (e.g., the -and in years of instruction and practice. Understanding the “band”; Taft, 1992), morphemes (Rastle, Davis, & New, functional mechanisms that underpin reading and 2004), and syllables (Carreiras & Perea, 2002), before learning to read has been a question of interest since activating stored representations of the spellings of the beginnings of psychology as a scientific discipline known whole words in an orthographic lexicon. (Cattell, 1886; Huey, 1908) and remains a central aim Representations in the orthographic lexicon can then of modern psycholinguistics. This chapter considers activate information about their respective sounds one aspect of the reading process—visual word rec- and/or meanings. The major theories of visual word ognition—which is the process whereby we identify recognition posit that word recognition is achieved a printed letter string as a unique word and compute when a unique representation in the orthographic lexi- its meaning. I focus on the understanding of this pro- con reaches a critical level of activation (Coltheart cess that we have gained through the analysis of et al., 2001; Grainger & Jacobs, 1996; Perry et al., 2007). behavior and draw particular attention to those In recent years, a different class of theory based on aspects of this process that have been the object distributed-connectionist principles has made a sub- of recent debate. stantial impact on our understanding of processes involved in mapping orthography to phonology (Plaut, McClelland, Seidenberg, & Patterson, 1996) and mapping orthography to meaning (Harm & 21.1 THE ARCHITECTURE OF VISUAL Seidenberg, 2004). This chapter highlights some of the WORD RECOGNITION most important insights that these models have offered to our understanding of reading. However, although Although the earliest theories of visual word recogni- these models have been very effective in helping us to tion claimed that words were recognized as wholes on understand the acquisition of quasi-regular mappings the basis of their shapes (Cattell, 1886), there is a strong (as in spelling-to-sound relationships in English), they consensus among modern theories that words are rec- have been less successful in describing performance in ognized in a hierarchical manner on the basis of their the most frequently used visual word recognition constituents, as in the interactive-activation model tasks. They offer no coherent account of the most ele- (McClelland & Rumelhart, 1981; Rumelhart & mentary of these tasks—deciding whether a letter McClelland, 1982) shown in Figure 21.1 and its subse- string is a known word (i.e., visual lexical decision). quent variants (Coltheart, Rastle, Perry, Langdon, & Therefore, this chapter assumes a theoretical perspec- Ziegler, 2001; Grainger & Jacobs, 1996; Perry, Ziegler, & tive based on the interactive-activation model and its Zorzi, 2007).

Neurobiology of Language. DOI: http://dx.doi.org/10.1016/B978-0-12-407794-2.00021-3 255 © 2016 Elsevier Inc. All rights reserved. 256 21. VISUAL WORD RECOGNITION

Print However, recent research has demonstrated convinc- ingly that information about letter position is not repre- sented through this type of slot-based coding. The Feature general problem with slot-based coding is that words representations such as SLAT and SALT are judged by skilled readers to be perceptually very similar, despite the fact that their slot-based codes overlap by only 50% (S1T4). In fact, the slot-based codes for SLAT and SALT overlap to the same Letter degree as do those for SPIT and SALT (S1T4), which are representations judged to be much less similar. The following e-mail message, which was circulated globally some years ago, demonstrates this principle very well:

Orthographic Aoccdrnig to rseearch at Cmabrigde Uinervtisy, it lexicon deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. FIGURE 21.1 The interactive-activation model of visual word recognition (McClelland & Rumelhart, 1981; Rumelhart & Indeed, the reason that we can read this passage so McClelland, 1982). easily is that words with transposed letters are per- ceived as being very similar to their base words. This issue has been studied experimentally using the subsequent variants but directs the reader to further masked form priming technique (Forster & Davis, discussion of this issue in relation to distributed- 1984) in which a target stimulus presented for recogni- connectionist models (Coltheart, 2004; Rastle & tion is preceded by a consciously imperceptible prime Coltheart, 2006). stimulus. For example, Schoonbaert and Grainger (2004) demonstrated that recognition of a target stimu- 21.2 ORTHOGRAPHIC lus like SERVICE is faster when it is preceded by a REPRESENTATION masked transposed-letter prime like sevrice than when it is preceded by a masked substitution prime stimulus 21.2.1 Letters and Letter Position like sedlice. This result is important because according to slot-based coding, transposed-letter prime sevrice There is widespread agreement that stored and substitution prime sedlice have equivalent percep- representations of letters are abstract letter identities, tual overlap with target SERVICE, and thus should meaning that they are activated independently of speed target recognition to the same degree. Similar font, size, case, color, or retinal location (Bowers, results are observed when transpositions are nonadja- 2000). This abstraction is a key part of skilled reading cent; for example, the recognition of CASINO is because it permits rapid recognition of words pre- speeded by the prior masked presentation of the prime sented in unfamiliar surface contexts (e.g., handwrit- caniso relative to the prime caviro (Perea & Lupker, ing, typeface). In addition to encoding information 2003). Finally, these kinds of results extend to even aboutabstractletteridentities, letter representations more extreme modifications; for example, the recogni- must also encode information about letter position. tion of SANDWICH is speeded by the prior masked Otherwise, the visual word recognition system would presentation of the prime snawdcih relative to the not be able to distinguish words like SALT and SLAT, prime skuvgpah (Guerrera & Forster, 2008). which share the same letters but in different positions. These and other findings highlight an intriguing prob- The classical solution to this problem implemented in lem in visual word recognition. Readers are clearly able the interactive-activation model (McClelland & to distinguish anagram stimuli such as SNAWDCIH and Rumelhart, 1981; Rumelhart & McClelland, 1982)and SANDWICH, so letter representations must be coded for its subsequent variants (Coltheart et al., 2001; position. However, classical theories of how this is Grainger & Jacobs, 1996; Perry et al., 2007)involves achieved (McClelland & Rumelhart, 1981)areclearly slot-based coding. In this scheme, there are slots for inadequate. The evidence now seems to suggest that each position in a letter string, and a full set of letters orthographic representations must code position in a rel- within each of those positions. Thus, SALT is coded as ative rather than absolute manner, and probably with S1A2L3T4 and is therefore easy to distinguish from some degree of uncertainty or sloppiness, as in the spa- SLAT (S1L2A3T4), because these stimuli overlap only tial coding scheme used by the SOLAR model (Davis, in the initial and final letters (S1T4). 2010). Further, while these kinds of effects motivating

C. BEHAVIORAL FOUNDATIONS 21.2 ORTHOGRAPHIC REPRESENTATION 257 position uncertainty have been reported across a variety age-of-acquisition on visual word recognition when of alphabetic languages, it is also important to observe these factors are manipulated orthogonally (Gerhand & that they are not universal. Primes with transposed let- Barry, 1999; Morrison & Ellis, 1995), these claims have ters do not facilitate recognition of their base words (rela- been very difficult to assess for a number of reasons. tive to substitution primes) in Hebrew, for example For one, the age-of-acquisition metrics used in these (Velan & Frost, 2009). The reasons for this are not yet studies are typically subjective estimates given by adults well-understood, but it seems likely that the greater the of the age at which they acquired particular words. It is density of the orthographic space, the greater the pres- not unlikely that the frequency with which a word sure to develop very precise orthographic representa- occurs influences those subjective estimates provided tions (Frost, 2012). Further research is necessary to by adults. Further, word frequency and age-of- determine the exact nature of orthographic coding and acquisition are very tightly correlated (i.e., high- why position uncertainty appears to vary as a function of frequency words are typically the ones acquired the nature of the writing system. earliest; r 520.68; Carroll & White, 1973), making it extremely difficult to design experiments that exam- ine independent effects of these variables, particu- 21.2.2 Frequency, Cumulative Frequency, larly given that there are multiple corpora from and Age of Acquisition which to draw both of these metrics. Finally, it has been proposed that word frequency and age-of- There is a broad consensus that an individual’s previ- acquisition are just two dimensions of a single vari- ous experience with a word is the most powerful deter- able—cumulative frequency (i.e., the frequency with minant of how rapidly that word is identified. But what which a word occurs over an individual’s lifetime; is meant by “an individual’s previous experience”? The e.g., Zevin & Seidenberg, 2002). most common proxy for this is word frequency—the Although it is now fairly well-accepted that cumula- number of times a particular word occurs in some large tive frequency provides a better description of our expe- corpus of text (Baayen, Piepenbrock, & van Rijn, 1993; rience with words than printed word frequency New, Brysbaert, Veronis, & Pallier, 2007). Effects of (Brysbaert & Ghyselinck, 2006), the more difficult ques- word frequency have been reported in lexical decision tion is whether the age-of-acquisition effects observed (Balota, Cortese, Sergent-Marshall, Spieler, & Yap, 2004; on visual word recognition can be accounted for by Forster & Chambers, 1973) along with every other cumulative frequency. It now seems that the answer is speeded task thought to reflect access to orthographic “no.” Recent empirical work has demonstrated that the representations, including perceptual identification impact of age-of-acquisition on a number of word pro- (Broadbent, 1967), reading aloud (Balota & Chumbley, cessing tasks is greater than would be predicted by 1984), and eye-fixation times in sentence reading (Schilling, cumulative frequency (Ghyselinck, Lewis, & Brysbaert, Rayner, & Chumbley, 1998). Provided frequency 2004). Further, work using connectionist models has estimates are derived from a suitably large corpus of shown that age-of-acquisition effects may be a funda- text (approximately half of the frequency effect occurs mental property of models that learn incrementally over for words between 0 and 1 occurrences per million; van time (Monaghan & Ellis, 2010). This computational Heuven, Mandera, Keuleers, & Brysbaert, 2014), word work has also suggested that age-of-acquisition effects frequency estimates can explain more than 40% of the may be more prevalent when input-to-output mappings variance in lexical decision time (Brysbaert & New, are less systematic—the reason being that the solution 2009). In light of these data, there is wide agreement space for early-acquired items will be less helpful for that one’s experience with words is somehow encoded later-acquired items when the mapping is more arbi- in the orthographic representations of those words and trary (Monaghan & Ellis, 2010). This observation is con- influences the ease with which they can be identified. sistent with the empirical literature that finds One long-standing theory is that orthographic repre- particularly robust effects of age-of-acquisition in tasks sentations for high-frequency words have higher that require semantic involvement such as object nam- resting levels of activation than those for lower- ing (Ghyselinck, Lewis et al., 2004), translation judgment frequency words, making them easier to reach a critical (Izura & Ellis, 2004), and living versus nonliving deci- recognition threshold (McClelland & Rumelhart, 1981). sions (Ghyselinck, Custers, & Brysbaert, 2004). Recently, an interesting debate has emerged over whether the age at which a word is acquired might also be an important aspect of lexical experience, with 21.2.3 Morphology words acquired earlier processed more easily in visual word recognition tasks. Although several studies claim The majority of words in English, and in virtually to have observed independent effects of frequency and all of the world’s languages, are built by combining

C. BEHAVIORAL FOUNDATIONS 258 21. VISUAL WORD RECOGNITION and recombining a finite set of morphemes. These corner speeds recognition of CORN). Critically, this combinatorial processes in English start with a small facilitation cannot be ascribed to overlap in letter number of stem morphemes (e.g., trust) and pair them representations between primes and targets, because with other stem morphemes to form compound words masked primes that share letters but no apparent mor- (e.g., trustworthy), or with derivational (e.g., trusty, dis- phological relationship with targets (e.g., brothel- trust) or inflectional (e.g., trusted, trusts) affixes to form BROTH; -el is not a possible suffix in English) yield no the much larger proportion of the words that we use. facilitation (Rastle et al., 2004; see Rastle & Davis, 2008 Despite the fact that words with just a single mor- for review and relevant neural evidence). These data pheme (e.g., trust) are in the extreme minority, the again indicate that readers activate representations of major computational models of visual word recogni- the stems of morphologically structured words during tion (Coltheart et al., 2001; Grainger & Jacobs, 1996; the recognition process. Perry et al., 2007) have focused on those. Even when affixed words are included in these models, they are treated in exactly the same way as are nonaffixed 21.3 PROCESSING DYNAMICS words. This treatment is likely to be inadequate, AND MECHANISMS OF SELECTION because there is now substantial evidence that words comprising more than one morpheme are recognized The discussion thus far has described a hierarchical in terms of their morphemic constituents. theory of visual word recognition that involves multiple One of the main sources of evidence for this claim layers of orthographic representation. Representations of comes from studies that investigate whether the fre- the features are used to make letters map onto representa- quency of the stem in a morphologically complex word tions that code abstract letter identity and letter position. (e.g., the trust in distrust) plays any role in the time There is evidence that these representations then activate taken to recognize that morphologically complex word. representations of sublexical units (e.g., morphemes) The answer is virtually unequivocal that it does. For before activating representations of the spellings of example, Taft and Ardasinski (2006) demonstrated that known words in an orthographic lexicon. This section visual lexical decisions for prefixed words with high- considers how information flows through this architec- frequency stems (e.g., rediscover) were significantly fas- ture, and how amidst activation of multiple candidates a ter than those for prefixed words with low-frequency unique word representation reaches a recognition thresh- stems (e.g., refuel), despite the fact that these two sets of old. This discussion is based largely on principles of the words were matched on whole-word frequency (e.g., interactive-activation model (McClelland & Rumelhart, Ford, Davis, & Marslen-Wilson, 2010). This result 1981; Rumelhart & McClelland, 1982), which many still appears to indicate that participants access the stems of consider to be the cornerstone of our understanding of these words during the recognition process, or alterna- visual word recognition. tively that the representation for morphologically com- plex words like rediscover is somehow strengthened during acquisition by experience with their constituent 21.3.1 Interactive Processing stems (in this case, discover). Interestingly, the findings of Taft and Ardasinski (2006) held even in cases in One of the key principles of the interactive- which the nonword fillers for the lexical decision task activation model and its subsequent variants is interac- comprised a prefix attached to a real-word stem tive, or bidirectional, processing. In the hierarchical (e.g., relaugh), which, if anything, should have biased model described, there are assumed to be connections participants against segmenting the stimuli into their between adjacent levels of representation, which are morphemic constituents. both excitatory and inhibitory. Information flows in a The other major source of evidence for the claim bidirectional manner across these connections (e.g., let- that printed words are recognized in terms of their ter representations activate word representations, and morphemic constituents comes from masked priming word representations activate letter representations), data. Multiple studies have now demonstrated that the and this is what allows the model to explain how recognition of a stem target (e.g., DARK) is speeded by higher-level knowledge can influence processing at a the prior masked presentation of a morphologically lower level. related prime (e.g., darkness). The locus of this facilita- Two empirical effects were particularly important in tion appears to reside in orthographic representations, identifying the role of bidirectional processing in because the recognition of stem targets is speeded to visual word recognition—the word superiority effect the same degree by the prior presentation of morpho- (Reicher, 1969; Wheeler, 1970) and the pseudoword logically simple masked primes that have the appear- superiority effect (Carr, Davidson, & Hawkins, 1978; ance of morphological complexity (e.g., the prime McClelland & Johnston, 1977). In the ReicherWheeler

C. BEHAVIORAL FOUNDATIONS 21.3 PROCESSING DYNAMICS AND MECHANISMS OF SELECTION 259 word superiority experiments, a letter string was solves this problem through competition. In addition flashed very briefly and then replaced by a pattern to connections between levels of representation, the mask. Participants were then asked to decide which of interactive-activation model posits intralevel inhi- two letters, positioned below or above the previous bitory connections. These lateral inhibitory connections target letter, were in the original target stimulus. The between whole-word representations allow the most key manipulation was whether the original target was active unit (typically that of the target) to drive down a word (e.g., WORD) or nonword (e.g., OWRK). activation of its competitors. Of course, representations Results revealed that letter identification was far for any competing alternative candidates will also be superior when the original flashed target was a word. exerting inhibition, which will serve to drive down These data suggest that letter representations receive activation of other competitors as well as the represen- top-down support through bidirectional connections tation of the target, making it more difficult for the tar- from whole-word representations activated on presen- get to reach a recognition threshold. tation of the target stimulus. Intriguingly, a similar let- ter identification benefit is observed when the stimulus 21.3.2.1 Neighborhood Effects is a pronounceable pseudoword (e.g., TARK) as One way in which this prediction has been tested is opposed to a nonpronounceable string (e.g., ATRK) by looking at the impact of lexical similarity on visual (Carr et al., 1978; McClelland & Johnston, 1977). Even word recognition. If a letter string is similar to many though pseudowords like TARK are not represented in words (and thus activates multiple candidates), then it the orthographic lexicon, this result indicates that they should be more difficult to recognize than a letter may activate whole-word representations for similar string that is similar to few words (and thus does not words (e.g., DARK, TALK, PARK), which then feed activate multiple candidates). Before describing the lit- activation back to the letter level, thus explaining the erature around this prediction, it is important to con- letter identification benefit observed. Overall, these sider what is meant by the phrase “similar to many effects support the notion of interactive processing, words.” This is a key point: what counts as similar because they suggest that a decision based on activa- depends entirely on the nature of the scheme adopted tion at the letter level is influenced by higher-level for coding letter position (see Section 21.2.1). Two sti- information from the orthographic lexicon. More muli that have large orthographic overlap according to recent research has revealed top-down influences of one scheme for coding letter position may have much semantic and phonological variables on visual lexical less overlap according to another scheme. This princi- decision, which can only be explained through inter- ple is nicely illustrated by considering the example active processing in the reading system. These words BLAND and LAND. In the slot-based coding semantic and phonological effects are discussed in scheme described on Section 21.2.1, these stimuli share Section 21.4. no overlap whatsoever. However, in a letter coding scheme based on relative position, such as the spatial coding scheme of the SOLAR model (Davis, 2010), 21.3.2 Competition as a Mechanism these stimuli share substantial overlap. Thus, research for Selection on the consequences of lexical similarity for visual word recognition is impeded by the lack of consensus The explanation of the pseudoword superiority around the nature of orthographic input coding. effect suggests that printed letter strings activate multi- Until recently, much of the work regarding the ple candidates at the word level. Thus, a letter string impact of lexical similarity on visual word recognition like WORD may activate the whole-word representa- has been based on a metric known as Coltheart’s N tion for WORD, along with whole-word representa- (Coltheart, Davelaar, Jonasson, & Besner, 1977). N is tions for WORK, WARD, CORD, LORD, and others. defined as the number of words of the same length In the interactive-activation model, the activation of that can be created by changing one letter of a stimu- multiple candidates is achieved through cascaded lus, such that a word like CAKE has a very large processing (McClelland, 1979). Representations at every neighborhood (e.g., BAKE, LAKE, CARE, CAVE) and level excite and inhibit representations at adjacent a word like TUFT has no neighbors. Coltheart et al. levels continuously, without having to reach some (1977) reported that participants rejected high-N non- threshold (as was the case in the “logogen” model; words (e.g., PAKE) more slowly than low-N nonwords Morton, 1969). However, the situation in which (e.g., PLUB) in a lexical decision task, an effect repli- multiple candidates are activated from a single printed cated several times (Forster & Shen, 1996; McCann, stimulus raises the question of how the recognition Besner, & Davelaar, 1988). High-N nonwords are system selects a unique representation corres- thought to be more difficult to reject in lexical decision ponding to the target. The interactive-activation model because they activate many units at the word level,

C. BEHAVIORAL FOUNDATIONS 260 21. VISUAL WORD RECOGNITION and this activation makes it difficult to decide that the savings in the time taken for those targets to reach a stimulus is not a word. The situation is more compli- recognition threshold. However, primes can also cated for word targets, however. In contrast to the pre- activate orthographic units for whole words, which dictions of competitive models, Coltheart et al. (1977) compete with targets for recognition. Davis (2003) reported no effect of N on lexical decisions to word tar- therefore argued that the lexical status of a prime gets. Andrews (1989) then went on to report that words should be an important factor in determining the mag- with many neighbors are responded to more quickly nitude of form priming effects. Nonword primes (e.g., than words with few neighbors. The same year, how- azle-AXLE) should yield facilitation, because primes ever, Grainger, O’Regan, Jacobs, and Segui (1989) activate units for their corresponding targets without reported that words with at least one higher-frequency also activating any units for competing words. neighbor are recognized more slowly than words with Conversely, word primes (e.g., able-AXLE) should no higher-frequency neighbors. This latter result makes yield inhibition, because although the prime will still the Andrews (1989) findings particularly perplexing activate the target, it will activate the orthographic unit given that words with many neighbors will almost cer- for itself much more strongly, which will compete tainly have at least one higher-frequency neighbor. with the target for recognition. Results strongly favor Although some investigators have continued to competitive models. Nonword masked primes always report facilitatory effects of N on recognition latency facilitate recognition of visually similar targets (e.g., (Balota et al., 2004; Forster & Shen, 1996), most reports azle-AXLE), whereas word masked primes typically are in line with the prediction from competitive models inhibit or yield no effect on the recognition of visually (i.e., inhibitory effects of N) (Carreiras, Perea, & similar targets (e.g., able-AXLE; see Davis & Lupker, Grainger, 1997; Grainger & Jacobs, 1996; Perea & 2006 for review). The only exception is when word Pollatsek, 1998). Grainger and Jacobs (1996) put forward primes appear morphologically related to targets (e.g., one of the most compelling explanations for these diver- darker-DARK); in these cases, primes clearly facilitate gent effects, arguing that the inhibitory pattern is the rather than inhibit recognition of their targets (Rastle “true” pattern and that facilitatory effects might be the et al., 2004; see Section 21.2). result of strategic processes involved in making lexical decisions. Specifically, they argued that participants in the lexical decision task may be able to make a fast 21.4 VISUAL WORD RECOGNITION “YES” response if the total activation in the orthographic AND THE READING SYSTEM lexicon is high. The idea is that a large neighborhood is likely to lead to high total activation (i.e., because of the This chapter has put forward an understanding of large number of word units activated), and hence it is visual word recognition based on a hierarchical analysis through this fast guess mechanism that facilitatory of visual features, letters, subword units (e.g., mor- effects of neighborhood are deemed to arise. This expla- phemes), and, ultimately, orthographic representations nation has received support from studies showing that of whole words. Although visual word recognition is the direction of the neighborhood size effect can be typically regarded in modern theories as based on the influenced by instructions that speed or accuracy analysis of orthography, this system is embedded in a (De Moor, Verguts, & Brysbaert, 2005; Grainger & larger reading system that comprises processes to Jacobs, 1996). When participants need to be very accu- compute the sounds and meanings associated with rate, inhibitory neighborhood effects are observed, pre- known spellings. Further, although visual word recogni- sumably because their decisions are based on the tion remains possible inthefaceofseveresemanticand/ activation of a single orthographic unit (which will be or phonological impairment due to brain damage influenced by lateral inhibition). Conversely, when par- (Coltheart, 2004; Coltheart & Coltheart, 1997), it is undis- ticipants need to be very fast, facilitatory effects are puted that semantic and phonological information can observed, presumably because participants’ decisions contribute to visual word recognition. can be based on the fast guess mechanism that does not One computational model that may help us to require access to an individual orthographic unit. understand semantic and phonological influences on visual word recognition is the DRC model (Coltheart 21.3.2.2 Masked Form Priming Effects et al., 2001) shown in Figure 21.2. Masked form priming effects have been another This model postulates three processing pathways: powerful source of evidence in support of models that (i) one pathway in which a printed letter string is use competition as the mechanism for selection. In the translated to sound in the absence of lexical informa- interactive-activation model, priming is conceptualized tion; (ii) one pathway in which the phonological form as a balance between facilitation and inhibition. Primes of a word is retrieved directly after its activation in the activate visually similar targets, thus producing orthographic lexicon; and (iii) one pathway in which

C. BEHAVIORAL FOUNDATIONS 21.4 VISUAL WORD RECOGNITION AND THE READING SYSTEM 261

Print phonological theory of visual word recognition (Frost, 1998) has fallen out of favor in more recent years, the Feature evidence is unequivocal that sound-based representa- representations tions are computed as a matter of routine during read- ing (see Rastle & Brysbaert, 2006 for review). Letter The use of homophones and pseudohomophones representations (i.e., nonwords that sound like words; e.g., BRANE) has proven especially useful in delineating the role of Orthographic lexicon phonological representations in visual word recogni- Semantic Rule-based tion. Rubenstein et al. (1971) observed that YES representations translation responses in lexical decision were slower for homo- Phonological lexicon phones than for nonhomophones (e.g., recognition of MAID slower than PAID), and that NO responses

Phoneme were slower for pseudohomophones than for nonpseu- representations dohomophones (e.g., KOAT slower than FOAT). Both of these effects have been replicated and are well- Speech accepted (e.g., homophone effect: Pexman, Lupker, & Hino, 2002; pseudohomophone effect: Ziegler, Jacobs, & FIGURE 21.2 The DRC model (Coltheart et al., 2001). Kluppel, 2001). Both of these effects are easy to under- stand in the light of bidirectional connections in the the phonological form of a word is retrieved via its reading system. The homophone effect arises because meaning representation. The architecture of this model after presentation of the stimulus MAID, activation of maps fairly well to our understanding of the neural the phonological unit corresponding to MAID goes on underpinnings of reading (see Taylor, Rastle, & Davis, to activate the competitor MADE in the orthographic 2013 for review). lexicon, thus slowing recognition of MAID. The pseudo- The activation of orthographic whole-word units in homophone effect arises because the stimulus KOAT the DRC model is based on the interactive-activation will be translated nonlexically to a phonological repre- model (McClelland & Rumelhart, 1981; Rumelhart & sentation that will activate a unit in the phonological McClelland, 1982). Because information about the lexicon. This phonological unit will then send activation printed stimulus flows through all three pathways in back to the orthographic unit for COAT, making it diffi- cascade, it is entirely possible in this model that infor- cult to classify the stimulus as a nonword. mation about the semantic and phonological characteris- Pseudohomophones have also been used extensively tics of a letter string will be activated before any unit in in the context of masked form priming to elucidate the the orthographic lexicon reaches a critical recognition role of phonology in visual word recognition. There are threshold. Further, and critically, there are bidirectional now a number of studies showing that the recognition connections between semantic, phonological, and ortho- of a target stimulus (e.g., COAT) is facilitated by the graphic bodies of knowledge, which make it possible prior presentation of a masked pseudohomophone for semantic and phonological information to impact on prime (e.g., KOAT) relative to an orthographic control the activation of units in the orthographic lexicon. prime (e.g., POAT; Ferrand & Grainger, 1992). This masked phonological priming effect arises in lexical decision, reading aloud, perceptual identification, and 21.4.1 Phonological Influences on Visual eye-movement paradigms, although a meta-analysis Word Recognition conducted by Rastle and Brysbaert (2006) revealed that the effect is small. These effects suggest not only that It has been apparent for more than 40 years (actu- phonological representations can play a role in visual ally, going all the way back to Huey, 1908) that the word processing but also that phonology is activated sounds associated with a printed letter string can influ- remarkably quickly in the recognition process. It was ence its recognition. Huey (1908) described reading as these demonstrations of “fast phonology” that led to involving auditory imagery or a “voice in the head,” excitement around the strong phonological theory of and empirical effects reported during the cognitive reading (Frost, 1998), although modern theorizing sug- renaissance (Rubenstein, Lewis, & Rubenstein, 1971) gests that these effects can be explained within weak pho- led theorists of reading to consider that phonological nological theories like the DRC model (Figure 21.2), in representations may be not only involved in visual which visual word recognition is characterized by an word recognition but also a requirement of it (Frost, orthographic analysis that can be influenced by phonologi- 1998; Lukatela & Turvey, 1994). Although this strong cal representations (Rastle & Brysbaert, 2006).

C. BEHAVIORAL FOUNDATIONS 262 21. VISUAL WORD RECOGNITION

21.4.2 Semantic Influences on Visual recognition in the literate adult such an astonishing Word Recognition ability. Readers are faced with considerable variability in the forms of the symbols presented to them, and the The DRC model (Coltheart et al., 2001) asserts that density of the orthographic space (particularly in writ- skilled readers can recognize printed words in the ing systems such as Hebrew) renders words highly absence of semantic information, and this claim is confusable. Further, the reading system must develop backed by evidence that brain-damaged patients with in such a way that it is closely linked to phonological severe semantic impairments can nevertheless recognize and semantic bodies of knowledge, and there is sub- printed words accurately (Coltheart, 2004). However, as stantial evidence that this stored knowledge is acti- in the case of phonological information, there is a broad vated very soon after presentation of a printed letter consensus that semantic information can influence string. Research over the past 40 years on the func- visual word recognition. Multiple studies now suggest tional mechanisms that underpin visual word proces- that words that have particularly rich semantic repre- sing has been a great success story. This research sentations are recognized more quickly than words with provides a sound basis for which to discover how the more impoverished semantic representations. Visual brain supports the mind in regard to this remarkable word recognition is speeded by high imageability (e.g., human achievement. Balota et al., 2004), high semantic neighborhood density (Locker, Simpson, & Yates, 2003), a large number of meanings (Hino & Lupker, 1996) and related meanings Acknowledgment (Azuma & Van Orden, 1997), and a large number of related senses (Rodd, Gaskell, & Marslen-Wilson, 2002). This work was supported by research grants from the ESRC (RES- Here, again, bidirectional connections provide the mech- 000-62-2268, ES/L002264/1) and Leverhulme Trust (RPG-2013-024). anism for explaining these kinds of effects. Printed words activate their semantic representations, and acti- vation at this level feeds back to support orthographic References representations for word targets. Andrews, S. (1989). Frequency and neighborhood effects on lexical Priming studies also reveal semantic influences on access: Activation or search? Journal of Experimental Psychology: visual word recognition. In a seminal study, Meyer Learning, Memory, and Cognition, 15(5), 802814. and Schvaneveldt (1971) observed that lexical deci- Azuma, T., & Van Orden, G. (1997). Why safe is better than fast: The relatedness of a word’s meaning affects lexical decision times. sions to words (e.g., DOCTOR) were speeded by the Journal of Memory and Language, 36(4), 484504. prior presentation of semantically related primes (e.g., Baayen, R. H., Piepenbrock, R., & van Rijn, H. (1993). The CELEX lexi- NURSE) relative to unrelated primes (e.g., BREAD). cal database (CD-ROM). Linguistic Data Consortium. Philadelphia, This finding has been replicated numerous times and PA: University of Pennsylvania. has motivated a literature all of its own (e.g., see Balota, D. A., & Chumbley, J. I. (1984). Are lexical decisions a good measure of lexical access? The role of word frequency in the Hutchison, 2003; Lucas, 2000 for reviews). Semantic neglected decision stage. Journal of Experimental Psychology: priming is usually conceptualized in terms of spread- Human Perception and Performance, 10(3), 340357. ing activation between localist units (Collins & Loftus, Balota, D. A., Cortese, M. J., Sergent-Marshall, S. D., Spieler, D. H., & 1975), overlap in distributed featural representations Yap, M. J. (2004). Visual word recognition of single-syllable (McRae, de Sa, & Seidenberg, 1997), or Euclidean dis- words. Journal of Experimental Psychology: General, 133(2), 283 316. Bowers, J. S. (2000). In defense of abstractionist theories of repetition tance between high-dimensional vectors derived from priming and word identification. Psychonomic Bulletin & Review, 7 lexical co-occurrence matrices (Landauer & Dumais, (1), 8399. 1997). Bidirectional connections between semantic and Broadbent, D. E. (1967). Word frequency effects and response bias. orthographic levels of representation also play a role Psychological Review, 74(1), 115. in explaining the semantic priming effect. If semantic Brysbaert, M., & Ghyselinck, M. (2006). The effect of age-of-acquisition: Partly frequency-related, partly frequency-independent. Visual information about the target is activated after presenta- Cognition, 13(78), 9921011. tion of the prime (e.g., information about doctor acti- Brysbaert, M., & New, B. (2009). Moving beyond Kucera and Francis: vated after presentation of nurse), then this activation A critical evaluation of current word frequency norms and the can feed back to the orthographic unit for the target introduction of a new and improved word frequency measure for (e.g., DOCTOR), thus speeding recognition time. American English. Behavior Research Methods, 41(4), 977 990. Carr, T. H., Davidson, B. J., & Hawkins, H. L. (1978). Perceptual flexi- bility in word recognition: Strategies affect orthographic compu- tation but not lexical access. Journal of Experimental Psychology: 21.5 CONCLUSION Human Perception and Performance, 4(4), 674690. Carreiras, M., & Perea, M. (2002). Masked priming effects with syl- labic neighbors in the lexical decision task. Journal of Humans are born to speak, but they have to learn to Experimental Psychology: Human Perception and Performance, 28(5), read. This fact is part of what makes visual word 12281242.

C. BEHAVIORAL FOUNDATIONS REFERENCES 263

Carreiras, M., Perea, M., & Grainger, J. (1997). Effects of the ortho- literature and a new multi-task investigation. Acta Psychologica, graphic neighborhood in visual word recognition: Cross-task 115(1), 4367. comparisons. Journal of Experimental Psychology: Learning, Memory, Grainger, J., & Jacobs, A. M. (1996). Orthographic processing in and Cognition, 23(4), 857871. visual word recognition: A multiple read-out model. Psychological Carroll, J. B., & White, M. N. (1973). Word frequency and age of Review, 103(3), 518565. acquisition as determiners of picture naming latencies. Quarterly Grainger, J., O’Regan, J. K., Jacobs, A. M., & Segui, J. (1989). On the Journal of Experimental Psychology, 24(1), 8595. role of competing word units in visual word recognition: The Cattell, J. (1886). The time it takes to see and name objects. Mind, 11, neighborhood frequency effect. Perception and Psychophysics, 51(1), 6365. 4956. Collins, A. M., & Loftus, E. F. (1975). A spreading-activation theory Guerrera, C., & Forster, K. I. (2008). Masked form priming with of semantic processing. Psychological Review, 82(6), 407428. extreme transpositions. Language and Cognitive Processes, 23(1), Coltheart, M. (2004). Are there lexicons? Quarterly Journal of 117142. Experimental Psychology, 57A(7), 11531171. Harm, M., & Seidenberg, M. S. (2004). Computing the meanings of Coltheart, M., & Coltheart, V. (1997). Reading comprehension is not words in reading: Cooperative division of labor between visual exclusively reliant upon phonological representation. Cognitive and phonological processes. Psychological Review, 111(3), 662720. Neuropsychology, 14(1), 167175. Hino, Y., & Lupker, S. J. (1996). Effects of polysemy in lexical deci- Coltheart, M., Davelaar, E., Jonasson, J. T., & Besner, D. (1977). sion and naming: An alternative to lexical access accounts. Journal Access to the internal lexicon. In S. Dornic (Ed.), Attention and of Experimental Psychology: Human Perception and Performance, 22 Performance VI (pp. 535555). Hillsdale, NJ: Erlbaum. (6), 13311356. Coltheart, M., Rastle, K., Perry, C., Langdon, R., & Ziegler, J. (2001). Huey, E. B. (1908). The Psychology and Pedagogy of Reading. Repr. 1968. DRC: A dual route cascaded model of visual word recognition Cambridge, MA: MIT Press. and reading aloud. Psychological Review, 108(1), 204256. Hutchison, K. A. (2003). Is semantic priming due to association Davis, C. J. (2003). Factors underlying masked priming effects in strength or feature overlap? A micro-analytic review. Psychonomic competitive network models of visual word recognition. In S. Bulletin and Review, 10(4), 785813. Kinoshita, & S. J. Lupker (Eds.), Masked Priming: The State of the Izura, C., & Ellis, A. W. (2004). Age of acquisition effects in translation Art (pp. 121170). Hove, UK: Psychology Press. judgement tasks. Journal of Memory and Language, 50(2), 165181. Davis, C. J. (2010). The spatial coding model of visual word identifi- Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s prob- cation. Psychological Review, 117(3), 713758. lem: The Latent Semantic Analysis theory of the acquisition, Davis, C. J., & Lupker, S. J. (2006). Masked inhibitory priming in induction, and representation of knowledge. Psychological Review, English: Evidence for lexical inhibition. Journal of Experimental 104(2), 211240. Psychology: Human Perception and Performance, 32(3), 668687. Locker, L., Simpson, G. B., & Yates, M. (2003). Semantic neighbor- De Moor, W., Verguts, T., & Brysbaert, M. (2005). Testing the “multi- hood effects on the recognition of ambiguous words. Memory and ple” in the multiple read-out model of visual word recognition. Cognition, 31(4), 505515. Journal of Experimental Psychology: Learning, Memory, and Cognition, Lucas, M. (2000). Semantic priming without association: A meta- 31(6), 15021508. analytic review. Psychonomic Bulletin and Review, 7(4), 618630. Ferrand, L., & Grainger, J. (1992). Phonology and orthography in Lukatela, G., & Turvey, M. T. (1994). Visual lexical access is initially visual word recognition: Evidence from masked non-word prim- phonological: 1. Evidence from associative priming by words, ing. Quarterly Journal of Experimental Psychology, 45A(3), homophones, and pseudohomophones. Journal of Experimental 353372. Psychology: General, 123(2), 107128. Ford, M. A., Davis, M. H., & Marslen-Wilson, W. D. (2010). McCann, R. S., Besner, D., & Davelaar, E. (1988). Word recognition Derivational morphology and base morpheme frequency. Journal and identification. Do word-frequency effects reflect lexical of Memory and Language, 63(1), 117130. access? Journal of Experimental Psychology: Human Perception and Forster, K. I., & Chambers, S. (1973). Lexical access and naming time. Performance, 14(4), 693706. Journal of Verbal Learning and Verbal Behavior, 12(6), 627635. McClelland, J. L. (1979). On the time relations of mental processes: A Forster, K. I., & Davis, C. (1984). Repetition priming and frequency framework for analyzing processes in cascade. Psychological attenuation in lexical access. Journal of Experimental Psychology: Review, 86(4), 287330. Learning, Memory, and Cognition, 10(4), 680689. McClelland, J. L., & Johnston, J. C. (1977). The role of familiar units Forster, K. I., & Shen, D. (1996). No enemies in the neighborhood: in perception of words and nonwords. Perception and Absence of inhibitory neighborhood effects in lexical decision Psychophysics, 22(3), 249261. and semantic categorization. Journal of Experimental Psychology: McClelland, J. L., & Rumelhart, D. E. (1981). An interactive activation Learning, Memory, & Cognition, 22(3), 696713. model of context effects in letter perception: Part 1. An account of Frost, R. (1998). Toward a strong phonological theory of visual word basic findings. Psychological Review, 88(5), 375407. recognition: True issues and false trails. Psychological Bulletin, 123 McRae, K., de Sa, V. R., & Seidenberg, M. S. (1997). On the nature (1), 7199. and scope of featural representations of word meaning. Journal of Frost, R. (2012). Towards a universal model of reading. Behavioral and Experimental Psychology: General, 126(2), 99130. Brain Sciences, 35(5), 263279. Meyer, D. E., & Schvaneveldt, R. W. (1971). Facilitation in recogniz- Gerhand, S., & Barry, C. (1999). Age of acquisition, word frequency, ing pairs of words: Evidence of a dependence between retrieval and the role of phonology in the lexical decision task. Memory and operations. Journal of Experimental Psychology, 90(2), 227234. Cognition, 27(4), 592602. Monaghan, P., & Ellis, A. W. (2010). Modeling reading development: Ghyselinck, M., Custers, R., & Brysbaert, M. (2004). The effect of age Cumulative, incremental learning in a computational model of of acquisition in visual word processing: Further evidence for the word naming. Journal of Memory and Language, 63(4), 506525. semantic hypothesis. Journal of Experimental Psychology: Learning, Morrison, C. M., & Ellis, A. W. (1995). Roles of word frequency and Memory, and Cognition, 30(2), 550554. age of acquisition in word naming and lexical decision. Journal of Ghyselinck, M., Lewis, M. B., & Brysbaert, M. (2004). Age of acquisi- Experimental Psychology: Learning, Memory, and Cognition, 21(1), tion and the cumulative-frequency hypothesis: A review of the 116133.

C. BEHAVIORAL FOUNDATIONS 264 21. VISUAL WORD RECOGNITION

Morton, J. (1969). Interaction of information in word recognition. Rubenstein, H., Lewis, S. S., & Rubenstein, M. A. (1971). Evidence for Psychological Review, 76(2), 165178. phonemic recoding in visual word recognition. Journal of Verbal New, B., Brysbaert, M., Veronis, J., & Pallier, C. (2007). The use of film Learning and Verbal Behavior, 10(6), 645657. subtitles to estimate word frequencies. Applied Psycholinguistics, 28 Rumelhart, D. E., & McClelland, J. L. (1982). An interactive activation (4), 661677. model of context effects in letter perception: Part 2. The contex- Perea, M., & Lupker, S. J. (2003). Does jugde activate COURT? tual enhancement effect and some tests and extensions of the Transposed-letter similarity effects in masked associative prim- model. Psychological Review, 89(1), 6094. ing. Memory and Cognition, 31(6), 829841. Schilling, H. E. H., Rayner, K., & Chumbley, J. I. (1998). Comparing Perea, M., & Pollatsek, A. (1998). The effects of neighborhood fre- naming, lexical decision, and eye fixation times: Word frequency quency in reading and lexical decision. Journal of Experimental effects and individual differences. Memory and Cognition, 26(6), Psychology: Human Perception and Performance, 24(3), 767779. 12701281. Perry, C., Ziegler, J. C., & Zorzi, M. (2007). Nested incremental model- Schoonbaert, S., & Grainger, J. (2004). Letter position coding in ing in the development of computational theories: The CDP1 printed word perception: Effects of repeated and transposed let- model of reading aloud. Psychological Review, 114(2), 273315. ters. Language and Cognitive Processes, 19(3), 333367. Pexman, P. M., Lupker, S. J., & Hino, Y. (2002). The impact of feed- Taft,M.(1992).ThebodyoftheBOSS:Sub-syllabicunitsin back semantics in visual word recognition: Number of features the lexical processing of polysyllabic words. Journal of effects in lexical decision and naming tasks. Psychonomic Bulletin Experimental Psychology: Human Perception and Performance, 18 and Review, 9(3), 542549. (4), 10041014. Plaut, D. C., McClelland, J. L., Seidenberg, M. S., & Patterson, K. Taft, M., & Ardasinski (2006). Obligatory decomposition in reading (1996). Understanding normal and impaired word reading: prefixed words. The Mental Lexicon, 1(2), 183199. Computational principles in quasi-regular domains. Psychological Taylor, J., Rastle, K., & Davis, M. H. (2013). Can cognitive models Review, 103(1), 56115. explain brain activation during word and pseudoword reading? Rastle, K., & Brysbaert, M. (2006). Masked phonological priming A meta-analysis of 36 neuroimaging studies. Psychological Bulletin, effects in English: Are they real? Do they matter? Cognitive 139(4), 766791. Psychology, 53,97145. van Heuven, W. J. B., Mandera, P., Keuleers, E., & Brysbaert, M. Rastle, K., & Coltheart, M. (2006). Is there serial processing in the (2014). SUBTLEX-UK: A new and improve word frequency data- reading system; and are there local representations? In S. base for British English. The Quarterly Journal of Experimental Andrews (Ed.), From inkmarks to ideas: Current issues in lexical pro- Psychology, 67(6), 11761190. cessing. Hove: Psychology Press. Velan, H., & Frost, R. (2009). Letter-transposition effects are not uni- Rastle, K., & Davis, M. H. (2008). Morphological decomposition versal: The impact of transposing letters in Hebrew. Journal of based on the analysis of orthography. Language and Cognitive Memory and Language, 61(3), 285302. Processes, 23(78), 942971. Wheeler, D. D. (1970). Processes in visual word recognition. Cognitive Rastle, K., Davis, M., & New, B. (2004). The broth in my brother’s Psychology, 1(1), 5985. brothel: Morpho-orthographic segmentation in visual word recog- Zevin, J. D., & Seidenberg, M. S. (2002). Age of acquisition effects in nition. Psychonomic Bulletin and Review, 11(6), 10901098. word reading and other tasks. Journal of Memory and Language, 47 Reicher, G. M. (1969). Perceptual recognition as a function of mean- (1), 129. ingfulness of stimulus material. Journal of Experimental Psychology, Ziegler, J. C., Jacobs, A. M., & Kluppel, D. (2001). Pseudohomophone 81(2), 274280. effects in lexical decisions: Still a challenge for current word rec- Rodd, J., Gaskell, G., & Marslen-Wilson, W. (2002). Making sense of ognition models. Journal of Experimental Psychology: Human semantic ambiguity: Semantic competition in lexical access. Perception and Performance, 27(3), 547559. Journal of Memory and Language, 46(2), 245266.

C. BEHAVIORAL FOUNDATIONS CHAPTER 22 Sentence Processing Fernanda Ferreira1 and Derya C¸ okal2 1Department of Psychology and Center for Mind and Brain, University of California, Davis, CA, USA; 2Institute for Brain and Mind, University of South Carolina, Columbia, SC, USA

The existence of a field called “sentence processing” successful language comprehension. Not all theorists attests to the implicit agreement among most psycho- agree on the nature of those syntactic representations linguists that the sentence is a fundamental unit of or the relative importance of information sources that language. In addition, by convention, the term are nonsyntactic, but almost all assume that structure- “processing” in this context tends to refer to compre- building operations are essential for successful com- hension rather than production, and thus the topic of prehension (Fodor, Bever, & Garrett, 1974; Frazier & this chapter is people’s interpretations of sentences. Rayner, 1990). One key component is phrase-structure Our goal is to provide an overview of the findings, parsing, which refers to the process of identifying theories, and debates that are discussed in more detail constituents and grouping them into a hierarchical in the chapters in this volume comprising the section structure. For example, in a sentence such as While Mary on “Sentence Processing” (Chapters 4752). The bathed the baby played in the crib, the parser must create relevant issues include syntactic and semantic proces- a structural analysis that postulates the existence of sing, the time-course of interpretation, and the role of a subordinate and a main clause; moreover, the main other cognitive systems such as working memory in verb of the subordinate clause must be analyzed as forming sentence interpretations. In this chapter, we intransitive and reflexive, and the subject of the main begin by examining the sources of information that are clause must be identified as the baby. With this analysis, used during sentence processing. We then review the the correct meaning can be derived, which is that Mary major theoretical controversies and debates in the field: is bathing herself, and the baby is the agent of playing. the incremental nature of interpretation, serial versus As the same example makes clear, one of the chal- parallel processing, and the extent of interaction among lenges to the parser is syntactic ambiguity. At various information sources during online processing. Then, we points in a sentence, a sequence of words can be given go over the major models of sentence processing, more than one grammatical analysis. In the example, including syntax-based models, constraint-based models, the phrase the baby appears to be the object of bathed, the good-enough approach, and the very recent rational but in fact it turns out to be the subject of played. analysis approaches. We end with a few conclusions and The result is a so-called “garden-path.” The parser first speculations concerning future research directions. builds an incorrect analysis, and reanalysis processes are triggered on receipt of a constituent that cannot be incorporated into the existing structure. Because the 22.1 SOURCES OF INFORMATION parser obeys the rules of the grammar, including the FOR SENTENCE PROCESSING rule mandating overt subjects, the parse will fail at played, and the sentence processing system must locate Since the 1980s, when psycholinguistics experi- the alternative analysis in which the baby is a subject. enced a renaissance (Clifton, 1981) and returned to How this happens is another point of divergence the question of how to relate formal and psychologi- between competing sentence processing models, as is cal approaches to language, the field of sentence discussed in Section 22.2. processing has been associated with a commitment to An additional complication regarding the syntactic the idea that syntactic information is critical to analysis of a sentence is that the grammar allows

Neurobiology of Language. DOI: http://dx.doi.org/10.1016/B978-0-12-407794-2.00022-5 265 © 2016 Elsevier Inc. All rights reserved. 266 22. SENTENCE PROCESSING constituents to be moved from their canonical well (Joshi & Schabes, 1997; MacDonald, Pearlmutter, positions. One classic example is the passive, in & Seidenberg, 1994). For example, retrieval of the verb which the theme of an action is also the sentential bathe would bring up not only information associated subject, contrary to the general preference to align with the syntactic category and meaning of that word agency and subjecthood (Fillmore, 1968; Grimshaw, but also the word’s syntactic dependents in the form 1990; Jackendoff, 1990). Another type of moved con- of what are known as argument structures. An option- stituent is wh-phrases; in English, as in many other ally transitive verb like bathe would have at least two languages, wh-phrases must be moved from their argument structures, one specifying an agent and a canonical position to a position at the beginning of patient and the other specifying an agent and an oblig- the clause, leaving behind a trace or “gap.” For exam- atory reflexive null element. Nonlexicalist theories also ple, in Whichmandidthedogbite?, the phrase which assume a major role for this type of information; how- man receives its thematic role from bite.Thejobofthe ever, in contrast with lexicalist theories, argument parser is to find the gap and relate it to the structures are used not to generate a parse, but rather wh-phrase so that the sentence can receive a correct to filter or reinforce a particular analysis and to facili- interpretation. This task is made difficult by two chal- tate recovery from a garden-path. lenges. First, the gap is a phonetically null element in Another type of lexical information that can be the string, and therefore the parser must identify the critical for parsing relates to semantic features such as gap based on the application of a range of linguistic number and animacy. Number information can affect constraints. The second challenge concerns ambiguity. how an ambiguous phrase is attached during online Because many verbs have multiple argument processing; for example, in a sentence such as While structures, the parser may end up postulating a gap John and Mary kissed the baby slept,theverbkissed is incorrectly. The result is so-called decoy gaps, as interpreted as intransitive because the plural subject illustrated in Whowillthezombieeatwith?The parser triggers a reciprocal reading of kissed. A singular sub- initially assumes that who was moved from the direct ject does not license this reciprocal interpretation object position after eat, and then must reanalyze that (Ferreira & McClure, 1997; Patson & Ferreira, 2009). structure when with is encountered. Similarly, animacy can help the parser avoid a Studies investigating the processing of filler-gap garden-path, or help it recover more easily (Ferreira dependencies have found evidence for a filled-gap & Clifton, 1986; Trueswell, Tanenhaus, & Kello, 1993). effect, which is closely related to decoy gaps. Consider Specifically, if a subject is inanimate, then it is the example Which patient did the doctor expect the nurse unlikely to be an agent, and that analysis, in turn, to call? Most comprehenders will assume that which might lead the parser to adopt a less frequent patient is the object of expect, but the noun phrase (NP) passive or reduced relative parse (e.g., The evidence the nurse occupies that position, which means that the examined by the lawyer). These examples also show parser must look further along for the correct gap how word properties such as number and animacy (located after call). The existence of filled-gap effects interact with lexical argument structures, because has led researchers to postulate two parsing prefer- those features can lead the parser to select one argu- ences for creating filler-gap dependencies. One is ment structure (e.g., a reciprocal one for a verb such that the parser adopts an active or early filler strat- as )overanother. egy (Frazier, Clifton, & Randall, 1983; Frazier & Next, let us consider the question of how prosodic Flores D’Arcais, 1989), according to which a gap is information might influence sentence processing. The postulated at the first syntactically permissible loca- starting point for most studies published on this tion. The second is that the parser makes use of verb topic is that syntactic and prosodic structures are argument structure information to guide the postula- related and, in particular, major syntactic boundaries tion of gaps. If a verb has a strong intransitive bias, such as those separating clauses are usually marked then the parser is less likely to postulate a gap after by phrase-final lengthening and changes in pitch it; if the verb is strongly transitive, then a postverbal (Ferreira, 1993). Some clause-internal phrasal bound- gap will be more compelling. aries are also marked, although much less reliably As we have been discussing the importance of syn- (Allbritton, McKoon, & Ratcliff, 1996); for example, tactic information for parsing, we have had numerous in the sentence John hit the thief with the bat, the higher occasions to refer to lexical information as well. This is attachment of with the bat, which supports the instru- because lexical information is the fundamental bottom- ment interpretation, is sometimes (but not always) up information source for sentence processing. In lex- associated with lengthening of thief. The logic of the icalist theories, syntactic information is attached to research enterprise is to see whether prosodic “cues” specific words so that when a word is retrieved, its can signal syntactic structure and help the parser to associated structural possibilities become available as avoid going down a garden-path. One of the earliest

C. BEHAVIORAL FOUNDATIONS 22.1 SOURCES OF INFORMATION FOR SENTENCE PROCESSING 267 studies to consider this question was conducted by representation in memory becomes activated, which trig- Beach (1991), who demonstrated that meta-linguistic gers eye movements toward the named object as well as judgments about sentence structure are influenced objects semantically and even phonologically associated by the availability of durational and pitch informa- with it (Huettig & McQueen, 2007). The widespread tion linked to the final structures of the sentences. adoption of the VWP occurred in part because the idea A few decades later, more sensitive online techniques of multimodal processing was also catching on, with including recording of event-related potentials (ERPs) many cognitive scientists wanting to understand the and eye-tracking have yielded a wealth of information way different cognitive systems might work together— about the comprehension of spoken sentences, and one in this case, the auditory language processing system of the ideas on which there is now a general consensus and the visuo-attention system associated with object is that prosody does influence the earliest stages of recognition (Henderson & Ferreira, 2004; Jackendoff, parsing (Nakamura, Arai, & Mazuka, 2012). 1996). There was also a growing interest in auditory lan- Another potentially influential source of informa- guage processing generally, and in the investigation of tion for sentence processing is context, both discourse how prosodic information might be used during com- and visual. An early analysis of the role of discourse prehension, as discussed previously. By now, hundreds context is known as Referential Theory (Crain & of studies have been reported making use of it in one Steedman, 1985). It has been observed that many of way or another (Ferreira, Foucart, & Engelhardt, 2013; the sentence forms identified as syntactically dispre- Huettig, Olivers, & Hartsuiker, 2011; Huettig, Rommers, ferred by the two-stage model are also presupposi- &Meyer,2011). tionally more complex. For example, the sentence The reports that triggered the widespread use of the John hit the thief with the bat allows for two interpreta- VWP are those by Spivey, Tanenhaus, Eberhard, and tions: the with-phrase may be interpreted as an instru- Sedivy (2002) and Tanenhaus, Spivey-Knowlton, ment or a modifier; the latter interpretation requires Eberhard, and Sedivy (1995). This study adapted the a more complex structure (on some theories of ideas of Crain and Steedman (1985) concerning pre- syntax). The “confound” here is that the more com- suppositional support to the domain of visual contexts plex structure also involves modification, whereas the and spoken sentences that could be evaluated against simpler analysis does not. Moreover, a modified them. To illustrate, consider the sentence Put the apple phrase such as the thief with the bat presupposes the on the towel in the box. At the point at which the listener existence of more than one thief, and thus the hears on the towel, two interpretations are possible: difficulty of the more complex structure might not either on the towel is the location to which the apple be due to its syntax, but rather to the lack of a should be moved or it is a modifier of apple. The context to motivate the modified phrase. Crain and phrase into the box forces the latter interpretation Steedman predicted that sentences processed in because it is unambiguously a location. Referential presuppositionally appropriate contexts would be Theory specifies that speakers should provide modi- easy to process, a prediction that Ferreira and Clifton fiers only when modification is necessary to establish (1986) examined using eye movement monitoring in reference. It follows that if two apples are present in reading. Their data were consistent with the idea the visual world and one of them is supposed to be that context did not affect initial parsing decisions: moved, then right from the earliest stages of proces- Supportive contexts led to shorter global reading sing, the phrase on the towel will be taken to be a times and more accurate question-answering behav- modifier, because the modifier allows a unique apple ior, but early measures of processing revealed to be identified. The listener faced with this visual that processing times were longer for structurally world containing two referents should therefore imme- complex sentences compared with their structurally diately interpret the phrase as a modifier and avoid simpler counterparts. being garden-pathed (Farmer, Cargill, Hindy, Dale, & The potential role of visual context became a topic of Spivey, 2007; Novick, Thompson-Schill, & Trueswell, intense interest in the 1990s with the emergence of the 2008; Spivey et al., 2002; Tanenhaus et al., 1995). Visual World Paradigm (VWP) for studying sentence Recently, however, the interpretation of these findings processing. The idea behind the paradigm is simple. has been challenged. Ferreira et al. (2013) conducted From reading studies, it was known that fixations are three experiments manipulating properties of the utter- closely tied to attention and processing (Rayner, 1977). ances and the visual worlds. They concluded that lis- The VWP extends this logic to spoken language proces- teners engage in a fairly atypical mode of processing sing by pairing spoken utterances with simple displays in VWP experiments with simple visual worlds and containing mentioned and unmentioned objects. The utterances that are highly similar to each other in all “linking hypothesis” (Tanenhaus, Magnuson, Dahan, & experimental trials. Rather than processing utterances Chambers, 2000) is that as a word is heard, its normally, they instead form a skeleton, underspecified

C. BEHAVIORAL FOUNDATIONS 268 22. SENTENCE PROCESSING representation of what they are likely to hear based on These and other studies have been taken as evidence the content of the display, and then they evaluate that that the sentence processing system is not just an incre- prediction against the utterance itself. These issues mental, but actually a predictive, anticipating structure concerning the use of the VWP require additional and even specific lexical content. investigation. Atthesametime,thereissomeevidencethat In summary, a range of sources of information is additional processing takes place at major syntactic used for successful sentence processing. Lexical and boundaries. So-called end of sentence wrap-up refers syntactic constraints are central for defining the struc- to the finding that reading times at the ends of tural alternatives considered by the language proces- clauses and sentences are longer than in other senten- sing system, and information associated with the tial positions (Aaronson & Scarborough, 1976; Just & prosody of the sentence as well as the discourse and Carpenter, 1980; Rayner, Kambe, & Duffy, 2000; visual context in which the sentence occurs helps to Rayner, Sereno, Morris, Schmauder, & Clifton, 1989). reinforce some interpretations and flesh out the full Wrap-up effects indicate that some elements of meaning of the sentence. In the following section, we meaning are computed over a more global domain. In consider some of the theoretical controversies concern- addition, clause boundaries might be the locations ing the architecture of the language system and the where the comprehension system evaluates the entire way these sources of information are coordinated. This structure to ensure that all relevant constraints are discussion sets the stage for our discussion of theoreti- satisfied, for example, to check that a verb has all its cal models of sentence processing. obligatory arguments. Evidence for underspecified representations also suggests some tendency on the part of the processing system to delay interpretations 22.2 THEORETICAL CONTROVERSIES (for an excellent summary, see Frisson, 2009). Words with multiple senses (e.g., book as an object versus its In this section, we consider four issues that help content) seem to be processed by initially activating distinguish among competing models of sentence an underspecified meaning, and then by filling out processing: (i) incremental interpretation; (ii) serial the semantics once contextually disambiguating infor- versus parallel processing; (iii) interactivity versus mation becomes available. Some syntactic ambiguities modularity; and (iv) sources of complexity in compre- may also be handled in a similar manner, for exam- hension, including those that arise due to working ple, comprehenders leave open the interpretation of memory constraints. ambiguous relative clauses (the servant of the actress Incremental interpretation refers to whether the who was on the balcony), making a specific attachment sentence processing system builds the meaning of a decision only once it is necessary to do so (Swets, sentence word-by-word, as the input unfolds, or Desmet, Clifton, & Ferreira, 2008). Pronouns are also whether the system either falls behind or gets ahead of often not assigned specific antecedents (McKoon, the input. Falling behind the input would indicate Greene, & Ratcliff, 1993). delays in interpretation; getting ahead would indicate The second theoretical issue in which theories of anticipation or prediction. Essentially all current mod- sentence processing differ is serial versus parallel els of processing assume that interpretations are built processing, which typically refers to assumptions incrementally and, in particular, that there are no about whether the system considers only one interpre- delays in incorporating new words into the ongoing tation at a time or multiple interpretations. For exam- representation of sentence meaning. In addition, there ple, consider The defendant examined by the lawyer turned is some evidence that comprehenders engage in pre- out to be unreliable (Ferreira & Clifton, 1986). The diction (Levy, 2008; Rayner, Li, Juhasz, & Yan, 2005; sequence the defendant examined could mean that the Van Berkum, Brown, Zwitserlood, Kooijman, & defendant examined something or that the defendant Hagoort, 2005). The classic demonstration of prediction is the thing being examined (the ultimately correct comes from Altmann and Kamide (1999), who used analysis). The issue is whether only one of these inter- the VWP and semantically constrained sentences such pretations is built and evaluated at any one time, or as The boy will eat the cake. They observed that listeners whether all the interpretations are simultaneously acti- made anticipatory eye movements to a depicted cake vated and assessed. In the serial view, first the system prior to hearing the word cake, indicating that they considers one analysis—in most theories, the one that predicted that continuation. In the structural domain, assumes that the defendant is the agent of examining, Staub and Clifton (2006) found that when readers pro- given that this analysis is syntactically simpler and cessed a clause beginning with the word either, they more frequent—and then reanalyzes it if a revision predicted an upcoming or-clause based on the syntac- signal is encountered. The sentence processing system tic constraint that the latter must follow the former. then goes into “reanalysis mode,” attempting to adjust

C. BEHAVIORAL FOUNDATIONS 22.2 THEORETICAL CONTROVERSIES 269 the syntactic structure that has been built to create a processing system would be garden-pathed not only grammatical analysis (Ferreira & Henderson, 1991; by the defendant examined by the lawyer but also by the Fodor & Ferreira, 1998; Fodor & Inoue, 1994). Ease of evidence examined by the lawyer, even though evidence is reanalysis depends on the extent to which the sentence inanimate and therefore cannot engage in an act of processing system can find lexical and grammatical examination. In contrast, interactive models assume information that motivates an alternative structure. the immediate use of all relevant constraints. At this The parallel view assumes that the sentence proces- stage, there is widespread belief in the field that the sing system activates all grammatically licensed analyses preponderance of evidence supports interactive simultaneously. Considering our example, both the models, although it is possible to argue that this incorrect and the ultimately correct interpretations of the conclusion goes somewhat beyond the evidence defendant examined would be available in parallel, initially (Ferreira & Nye, in progress). weighted by their frequency. The agent analysis of defen- Thus far we have mainly focused on structural dant is more frequent; therefore, at first, it will be stron- ambiguity, which is certainly one source of difficulty ger than the ultimately correct analysis. But when the or complexity in processing. But structure-building word by is encountered, the sentence processing system processes independent of ambiguity resolution are also must shift to the other activated interpretation. Ease of a potential source of complexity, as first argued by reanalysis depends on the relative activation levels of the Gibson (1991). One source of complexity is structural two interpretations. If the ultimately correct interpreta- frequency. All things being equal, a structure encoun- tion is infrequent, then it will be difficult to retrieve and tered more frequently will be easier to comprehend reanalysis might even fail. If the right interpretation has than one that is rare (Ferreira, 2003; Gibson, 1991, 1998; some strength based on the extent to which it conforms MacDonald et al., 1994). The demands that structures to a wide range of linguistic and nonlinguistic con- place on working memory are an additional source of straints, then reanalysis will be easier, and so will overall processing difficulty for both ambiguous and unam- comprehension of the sentence. biguous structures (Chomsky & Miller, 1963; Gibson, A careful reader might have noticed subtle differences 1991, 1998, 2000; Lewis & Vasishth, 2005; Yngve, 1960). in the terminology used in our discussion of serial versus For example, nested structures (The reporter who the parallel processing. For the former, interpretations are senator who John met attacked disliked the editor) are hard- typically described as being “built,” whereas for the latter er to process than right-branching structures (John met they are often referred to as being “activated” or the senator who attached the reporter who disliked the “retrieved.” These different terms reflect fundamentally editor), a generalization that holds across typologically different ideas about how interpretations are stored in different languages (e.g., English, which is a subject memory and accessed during sentence processing. The verb object [SVO] language, and Japanese, which is serial view tends to assume that syntactic rules are stored subject object verb [SOV]). This contrast between in memory and then used online to create a structural nested and right-branching structures can be explained representation bit by bit. Reanalysis processes are a mat- by appealing to the greater demands the former struc- ter of editing the structure. The parallel view tends to tures place on working memory. More specifically, assume that structures are stored in chunks, typically cor- two kinds of demands increase processing complexity responding to an argument-taking word such as a verb for unambiguous as well as ambiguous structures: and its arguments. Online processing involves not so (i) storage costs and (ii) distance-based integration much building a structure as much as activating one. costs. Storage costs are incurred when incomplete These issues are raised again when we consider models materials must be held in working memory, for exam- of sentence processing. ple, a verb that needs its arguments (Chen, Gibson, & The third issue in which theories of processing Wolf, 2005; Gibson, 1998; Nakatani & Gibson, 2010). differ is interactivity versus modularity. Almost since Distance-based costs are those that arise from attempts the earliest days of psycholinguistics, debate has cen- to integrate a word into the structure already built and tered around the issue of whether the system considers seem to be proportional to the difficulty of reactivating only linguistic (and possibly even only syntactic) infor- an earlier word, for example, an argument that must mation when parsing a sentence versus a system that be linked back to its verb (Gibson, 1998, 2000; Gordon, considers all potentially relevant sources of informa- Hendrick, & Johnson, 2001, 2004; Grodner & Gibson, tion. Modular models assume sentence structures are 2005; Lewis & Vasishth, 2005; Lewis, Vasishth, & Van assigned to words at least initially without any consid- Dyke, 2006). Distance-based costs also account for the eration of whether the structure will map to a sentence well-known preference for subject-extracted over interpretation that makes sense given prior knowledge object-extracted relative clauses (Grodner & Gibson, or given the contents of the immediate linguistic, 2005) and arise in part due to similarity-based interfer- visual, or social context. For example, the sentence ence in working memory (Gordon et al., 2001).

C. BEHAVIORAL FOUNDATIONS 270 22. SENTENCE PROCESSING

22.3 CLASSES OF MODELS OF built and evaluated serially, rather than in parallel; SENTENCE PROCESSING and (iii) only certain kinds of information can be used during the initial stages of sentence processing, partic- We begin with the so-called two-stage model or ularly information stated in the syntactic and prosodic garden-path model, first developed by Lyn Frazier vocabulary of the sentence processing module. (Ferreira & Clifton, 1986; Frazier & Fodor, 1978; Rayner, The two-stage model was soon challenged by Carlson, & Frazier, 1983). The model assumes that a sin- researchers in sentence processing who were strongly gle parse is constructed for any sentence based on the influenced by the connectionist architectures popular operation of Minimal Attachment, which constrains the in the 1980s and 1990s (Rumelhart & McClelland, 1985, parser to construct no potentially unnecessary syntactic 1986; Seidenberg & McClelland, 1989). These architec- nodes, and Late Closure, which causes the parser to tures contrast with the assumptions of the two-stage attach new linguistic input to the current constituent. In model in two defining ways. First, in connectionist sys- addition, the model assumes that the only information tems, alternative possibilities are activated and evalu- that the parser has access to when building a syntactic ated in parallel; second, any relevant source of structure is its database of phrase-structure rules; there- information can be used to modulate the activation fore, the parser cannot consult information associated levels and allow one possible analysis to win at the with lexical items. For example, in the sequence Mary expense of the others (MacDonald et al., 1994). knew Bill, the noun phrase Bill would be assigned the Applying these ideas to sentence processing, the con- role of direct object because that analysis is simpler nectionist alternative assumed the following principles. than the alternative subject-of-complement-clause anal- First, rather than analyses being built with the help of ysis, and the information that know takes sentence com- grammatical rules, a great deal of the burden of syn- plements more frequently than direct objects could not tactic representation is put into the lexicon. Adapting be used to inform the initial parse. ideas that were then timely in linguistic theory The two-stage model has evolved over the past (Pesetsky, 1995), lexical representations were assumed three decades to take into account changes in linguistic to activate not only words and word meanings but theory and significant findings in psycholinguistics. also syntactic frames. In this view, syntactic rules are One important addition is the notion of “Construal” redundant because almost all the necessary informa- (Frazier & Clifton, 1997; Frisson & Pickering, 2001), tion is already stated in the lexicon. Thus, with syntac- which allows some constituents to be merely associ- tic structures being stored rather than built, it is easy ated with a specific thematic domain in a sentence to imagine an architecture in which all possible analy- rather than definitively attached to the structure. ses are considered in parallel, weighted by their Evidence for Construal comes from the finding that frequency of use. Lexical, contextual, and pragmatic readers process sentences with ambiguous relative constraints can be used to further modulate the activa- clauses more quickly than those that have a unique tion levels. With this approach the sentence processing attachment (e.g., the servants of the actress who was on system is incremental, but different possible interpreta- the balcony), unless the sentence is followed by a ques- tions are activated in parallel. In addition, any poten- tion that forces the reader to provide a specific inter- tial source of information can be used at any stage of pretation; in that case, readers take longer to read the sentence processing, making the system interactive ambiguous versions, presumably because they are try- rather than modular. ing to choose between the attachment options. Other classes of models emphasize the role of com- Another important revision of the two-stage model is plexity in sentence processing (Gibson, 1991, 1998). The that, now, prosody plays an essential role in determin- significance of these models is two-fold. First, they high- ing how parsing proceeds from the earliest stages of light sources of information that lead to processing costs processing (Millotte, Wales, & Christophe, 2007; for both ambiguous and unambiguous structures and Nakamura et al., 2012; Price, Ostendorf, Shattuck- thus capture well-known findings such as the preference Hufnagel,&Fong,1991). Pitch and durational infor- forsubject-overobject-relativeclauses(Gibson, 1991, mation associated with different kinds of prosodic and 1998, 2000; Grodner & Gibson, 2005; Lewis & Vasishth, intonational phrasing are used to constrain the par- 2005; Yngve, 1960). Second, these models are specifically ser’s syntactic analyses and assist in the construction designed to take into account the role of working mem- of semantic meanings such as focus and presupposi- oryinsentenceprocessing,withanemphasisonhow tion. Nonetheless, the essential features of the costs associated with maintaining and integrating items two-stage model remain. The model assumes that in working memory affect complexity and, therefore, (i) information is used incrementally to build an inter- processing difficulty. These models make important pretation; (ii) different possible interpretations are predictions about phenomenainsentenceprocessing

C. BEHAVIORAL FOUNDATIONS 22.3 CLASSES OF MODELS OF SENTENCE PROCESSING 271 that are of particular interest to researchers attempting to It appears that comprehenders are not entirely up to uncover the neural mechanisms that underlie sentence the task of syntactic reanalysis and sometimes fail to comprehension, including structures such as passives revise either all pieces of the syntactic structure or all and relative clauses. elements of the semantic consequences of the initial, In the past 15 years or so, a new class of models has incorrect parse. And the more semantically compelling emerged with roots in all the approaches that have the original misinterpretation, the more likely people been described thus far. There are many variants with are to want to retain it. important distinctions among them, but what they Townsend and Bever’s (2001) model implements an share is the idea that comprehenders sometimes end up architecture similar to what has been suggested for with an interpretation that differs from the actual input decision-making (Gigerenzer, 2004; Kahneman, 2003), received—the interpretation is simpler (Construal), where researchers sometimes distinguish between so- somewhat distorted (Late Assignment of Syntax Theory called System 1 and System 2 (or Type 1 and Type 2) (LAST); Good-Enough Processing), or outright inconsis- reasoning. System 1 is fast, automatic, and operates via tent (Noisy Channel Approaches) with the sentence’s the application of simple heuristics. System 2, however, true content. Let us begin with the models that assume is slow and attention-demanding, and it consults a representations that reduce the input in some way. One wide range of beliefs—essentially anything the organ- implementation is to allow representations to be under- ism knows and has stored in memory. In Townsend specified (Sanford & Sturt, 2002). Consider Construal. and Bever’s (2001) LAST, sentences are essentially As mentioned, this model assumes that syntactic struc- processed twice. First, heuristics are accessed that yield tures are not always fully connected and adjunct a quick meaning, and then syntactic computations are phrases in particular (e.g., relative clauses, modifying performed on the same word string to yield a fully prepositional phrases) may instead simply get associ- connected, syntactic analysis. The second process ated with a certain processing domain, “floating” until ensures that the meaning that is obtained for a sentence disambiguating information arrives. The parser thus is consistent with its actual form. Townsend and Bever remains uncommitted (Pickering, McElree, Frisson, also assume that the first stage is nonmodular and that Chen, & Traxler, 2006) concerning the attachment of the the second is modular; this is to account for the use of relative clause and the interpretation that would follow semantics in the first stage and the use of essentially from any particular attachment (Frisson & Pickering, only syntactic constraints in the second. 2001; Sanford & Graesser, 2006; Sturt, Sanford, Stewart, Two models similar in spirit to LAST but that & Dawydiak, 2004). Other studies support the idea of assume a modular architecture for the first stage have underspecified representations for global syntactic been proposed by Ferreira (2003) and by Garrett structures (Tyler & Warren, 1987), semantic information (2000). The Ferreira model assumes that the first stage (Frazier & Rayner, 1990), and coercion structures consults just two heuristics—a version of the noun (Pickering et al., 2006). verb noun (NVN) strategy in which people assume an More radical variants of shallow processing models agentpatient mapping of semantic roles to syntactic allow the comprehension system to generate an inter- positions and an animacy heuristic, in which animate pretation that is even more discrepant from the input. entities are biased toward subjecthood. The 2003 Researchers in the field of text processing have shown Ferreira model explains comprehenders’ tendencies to that readers are sometimes remarkably insensitive to misinterpret passive sentences, particularly when they contradictions in text (Otero & Kintsch, 1992), and they express an implausible event with reversible semantic also often fail to update their interpretations when roles, as in the dog was bitten by the man. The applica- later information undermines a fact stated previously tion of heuristics in the first stage yields the dog- (Albrecht & O’Brien, 1993). These ideas from text pro- bit-man interpretation; a proper syntactic parse will cessing were exported to the sentence processing liter- deliver the opposite, correct interpretation, but the ature in a series of experiments showing that people model assumes that it is fragile and susceptible to do not seem to fully recover from garden-paths interference. Garrett (2000) offers a more explicitly (Christianson, Hollingworth, Halliwell, & Ferreira, analysis-by-synthesis model that incorporates the pro- 2001). Participants read sentences such as While the duction system to generate what are generally thought woman bathed the baby played in the crib and then of as top-down effects. A first pass, bottom-up process answered a question such as Did the woman bathe the uses syntactic information to create a simple parse baby? The surprising finding was that most people that, in turn, allows for a rudimentary interpretation. answered “yes,” even though the meaning of the Then, the language production system takes over and reflexive verb bathe requires that the object be inter- uses that representation to generate the detailed syn- preted as coreferential with the subject (see also tactic structure that would support the initial parse Slattery, Sturt, Christianson, Yoshida, & Ferreira, 2013). and interpretation.

C. BEHAVIORAL FOUNDATIONS 272 22. SENTENCE PROCESSING

Finally, a family of models has recently been mechanisms that allow the input to be rationally eval- proposed that assume people engage in rational uated and corrected. Future work will continue to behavior over a noisy communication channel. The make use of behavioral techniques as well as methods channel is noisy because listeners sometimes mishear from neuroscience to expand our understanding of or misread due to processing error or environmental these topics. The critical next stage is to determine contamination, and because speakers sometimes make how the processes assumed in models of sentence mistakes when they communicate. Thus, a rational processing are actually implemented in the human comprehender whose goal is to recover the intention brain. Our view is that the field is well-positioned for behind the utterance will normalize the input accord- this challenge given the sophistication of extant sen- ing to Bayesian priors. A body of evidence from tence processing models. research using ERPs helped to motivate these ideas (Kim & Osterhout, 2005; Van Herten, Kolk, & Chwilla, 2005). In these experiments, it is reported that subjects References who encounter a sentence such as The fox that hunted the poachers stalked through the woods experience a P600 Aaronson, D., & Scarborough, H. S. (1976). Performance theories for rather than an N400 on encountering the semantically sentence coding: Some quantitative evidence. Journal of anomalous word, even though an N400 might be Experimental Psychology: Human Perception and Performance, 2(1), 5670. expected given that it is presumed to reflect problems Albrecht, J. E., & O’Brien, E. J. (1993). Updating a mental model: related to meaning. There is still not a great deal of Maintaining both local and global coherence. Journal of consensus regarding what triggers P600s, but an idea Experimental Psychology: Learning, Memory, and Cognition, 19(5), that has been gaining traction is that it reflects a need 10611070. to engage in some type of structural reanalysis or revi- Allbritton, D. W., McKoon, G., & Ratcliff, R. (1996). Reliability of pro- sodic cues for resolving syntactic ambiguity. Journal of sion. The conclusion, then, is that when a person Experimental Psychology: Learning, Memory, and Cognition, 22(3), encounters a sentence that seems to say that the fox 714735. hunted the poachers, that person “fixes” it so it makes Altmann, G. T. M., & Kamide, Y. (1999). Incremental interpretation sense, resulting in a P600. Other models have taken at verbs: Restricting the domain of subsequent reference. this idea and developed it further (Gibson, Bergen, & Cognition, 73, 247 264. Beach, C. M. (1991). The interpretation of prosodic patterns at points Piantadosi, 2013; Levy, 2011; Levy, Bicknell, Slattery, of syntactic structure ambiguity: Evidence for cue trading rela- & Rayner, 2009). These models are generally interac- tions. Journal of Memory and Language, 30(6), 644663. tive, because the information that is accessed to Chen, E., Gibson, E., & Wolf, F. (2005). Online syntactic storage costs establish the priors can range from biases related to in sentence comprehension. Journal of Memory and Language, 52, structural forms and all the way to beliefs concerning 144 169. Chomsky, N., & Miller, G. A. (1963). Introduction to the formal anal- speaker characteristics (Van Berkum, Van den Brink, ysis of natural languages. In R. D. Luce, R. R. Bush, & E. Galanter Tesink, Kos, & Hagoort, 2008). However, these noisy (Eds.), Handbook of mathematical psychology (pp. 269321). New channel models have not yet been rigorously tested York, NY: Wiley. using a methodology that allows early processes to be Christianson, K., Hollingworth, A., Halliwell, J. F., & Ferreira, F. distinguished from later ones. For example, it remains (2001). Thematic roles assigned along the garden path linger. Cognitive Psychology, 42(4), 368407. possible that comprehenders create a simple parse in a Clifton, C., Jr. (1981). Psycholinguistic renaissance? Contemporary manner compatible with modularity and then consult Psychology, 26, 919921. information outside the module to revise that interpre- Crain, S., & Steedman, M. (1985). On not being led up the garden tation, right down to actually normalizing the input. path: The use of context by the psychological parser. In D. Models designed to explain the comprehension of sen- Dowty, L. Karttunen, & A. Zwicky (Eds.), Natural language pars- ing: Psychological, computational, and theoretical perspectives tences containing self-repairs and other disfluencies (pp. 320358). Cambridge, UK: Cambridge University Press. (turn left uh right at the light) assume mechanisms that Farmer, T. A., Cargill, S. A., Hindy, N. C., Dale, R., & Spivey, M. J. allow input to be deleted so that the speaker’s (2007). Tracking the continuity of language comprehension: intended meaning can be recovered (Ferreira, Lau, & Computer mouse trajectories suggest parallel syntactic proces- Bailey, 2004). sing. Cognitive Science, 31(5), 889 909. Ferreira, F. (1993). The creation of prosody during sentence produc- tion. Psychological Review, 100, 233253. Ferreira, F. (2003). The misinterpretation of noncanonical sentences. 22.4 CONCLUSION Cognitive Psychology, 47(2), 14203. Ferreira, F., & Clifton, C. (1986). The independence of syntactic pro- The field of sentence processing has changed sig- cessing. Journal of Memory and Language, 25(3), 348 368. Ferreira, F., Foucart, A., & Engelhardt, P. E. (2013). Language pro- nificantly since the 1980s. Current models emphasize cessing in the visual world: Effects of preview, visual complex- more detailed, context-specific information such as ity, and prediction. Journal of Memory and Language, 69(3), speaker, and there is a great deal of interest in 165182.

C. BEHAVIORAL FOUNDATIONS REFERENCES 273

Ferreira, F., & Henderson, J. M. (1991). Recovery from misanalyses of Grimshaw, J. (1990). Argument structure. Cambridge, MA: MIT Press. garden-pathsentences. Journal of Memory and Language, 30, Grodner, D., & Gibson, E. (2005). Consequences of the serial nature 725745. of linguistic input. Cognitive Science, 29, 261291. Ferreira, F., Lau, E. F., & Bailey, K. G. (2004). Disfluencies, language Henderson, J. M., & Ferreira, F. (2004). Scene perception for psycho- comprehension, and tree adjoining grammars. Cognitive Science, linguists. In J. M. Henderson, & F. Ferreira (Eds.), The interface of 28(5), 721749. language, vision, and action: Eye movements and the visual world Ferreira, F., & McClure, K. (1997). Parsing of garden-path sentences (pp. 158). New York, NY: Psychology Press. with reciprocal verbs. Language and Cognitive Processes, 12, Huettig, F., & McQueen, J. M. (2007). The tug of war between phono- 273306. logical, semantic and shape information in language-mediated Ferreira, F., & Nye, J. (in press). The modularity of sentence processing visual search. Journal of Memory and Language, 57(4), 460482. reconsidered. In R. G. De Almeida & L. Gleitman (Eds.), Minds on Huettig, F., Olivers, C. N., & Hartsuiker, R. J. (2011). Looking, lan- language and thought. Oxford, UK: Oxford University Press. guage, and memory: Bridging research from the visual world and Fillmore, C. J. (1968). The creation of prosody during sentence pro- visual search paradigms. Acta Psychologica, 137(2), 138150. duction. Psychological Review, 100, 233253. Huettig, F., Rommers, J., & Meyer, A. S. (2011). Using the visual Fodor, J. A., Bever, T. G., & Garrett, M. (1974). The psychology of lan- world paradigm to study language processing: A review and crit- guage: An introduction to psycholinguistics and generative grammar. ical evaluation. Acta Psychologica, 137(2), 151171. New York, NY: McGraw-Hill. Jackendoff, R. (1996). The architecture of the linguistic-spatial inter- Fodor, J. D., & Ferreira, F. (1998). Reanalysis in sentence processing. face. In P. Bloom, M. A. Peterson, L. Nadel, & M. F. Garrett Dordrecht, The Netherlands: Kluwer. (Eds.), Language and space. Language, speech, and communication Fodor, J. D., & Inoue, A. (1994). The diagnosis and cure of garden (pp. 130). Cambridge, MA: MIT Press. paths. Journal of Psycholinguistic Research, 23, 407434. Jackendoff, R. S. (1990). Semantic structures. Cambridge, MA: MIT Frazier, L., & Clifton, C., Jr. (1997). Construal: Overview, motivation, and Press. some new evidence. Journal of Psycholinguistic Research, 26(3), Joshi, A. K., & Schabes, Y. (1997). Tree-adjoining grammars. In G. 277295. Rozenberg, & A. Salomaa (Eds.), Handbook of formal languages Frazier, L., Clifton, C., & Randall, J. (1983). Filling gaps: Decision princi- (pp. 69123). Berlin: Springer. ples and structure in sentence comprehension. Cognition, 13, Just, M. A., & Carpenter, P. A. (1980). A theory of reading: From eye 187222. fixations to comprehension. Psychological Review, 87(4), 329354. Frazier, L., & Flores D’Arcais, G. B. (1989). Filler driven parsing: A study Kahneman, D. (2003). Maps of bounded rationality: Psychology for of filling gap in Dutch. Journal of Memory and Language, 28,331444. behavioral economics. The American Economic Review, 93(5), Frazier, L., & Fodor, J. D. (1978). The sausage machine: A new two- 14491475. stage parsing model. Cognition, 6(4), 291325. Kim, A., & Osterhout, L. (2005). The independence of combinatory Frazier, L., & Rayner, K. (1990). Taking on semantic commitments: semantic processing: Evidence from event-related potentials. Processing multiple meanings vs. multiple senses. Journal of Journal of Memory and Language, 52(2), 205225. Memory and Language, 29(2), 181200. Levy, R. (2008). Expectation-based syntactic comprehension. Frisson, S. (2009). Semantic underspecification in language proces- Cognition, 106(3), 11261177. sing. Language and Linguistics Compass, 3(1), 111127. Levy, R. (2011). Probabilistic linguistic expectations, uncertain input, Frisson, S., & Pickering, M. J. (2001). Obtaining a figurative interpre- and implications. Studies of Psychology and Behavior, 9(1), 5263. tation of a word: Support for underspecification. Metaphor and Levy, R., Bicknell, K., Slattery, T., & Rayner, K. (2009). Eye move- Symbol, 16(34), 149171. ment evidence that readers maintain and act on uncertainty about Garrett, M. (2000). Remarks on the architecture of language processing past linguistic input. Proceedings of the National Academy of Sciences systems. In Y. Grodzinsky, & L. Shapiro (Eds.), Language and the brain: of the United States of America, 106(50), 2108621090. Representation and processing (pp. 3169). San Diego, CA: Academic Lewis, R. L., & Vasishth, S. (2005). An activation-based model of sen- Press. tence processing as skilled memory retrieval. Cognitive Science, 29, Gibson, E. (1991). A computational theory of human linguistic processing: 375419. Memory limitations and processing breakdown. Pittsburgh, PA: Lewis, R. L., Vasishth, S., & Van Dyke, J. A. (2006). Computational Carnegie Mellon University. principles of working memory in sentence comprehension. Trends Gibson, E. (1998). Linguistic complexity: Locality of syntactic depen- in Cognitive Sciences, 10,4454. dencies. Cognition, 68(1), 176. MacDonald, M. C., Pearlmutter, N. J., & Seidenberg, M. S. (1994). Gibson, E. (2000). The dependency locality theory: A distance-based The lexical nature of syntactic ambiguity resolution. Psychological theory of linguistic complexity. In Y. Miyashita, A. Marantz, & Review, 101(4), 676703. W. O’Neil (Eds.), Image, language, brain (pp. 95126). Cambridge, McKoon, G., Greene, S., & Ratcliff, R. (1993). Discourse models, pro- MA: MIT Press. noun resolution, and the implicit causality of verbs. Journal of Gibson, E., Bergen, L., & Piantadosi, S. T. (2013). Rational integration Experimental Psychology: Learning, Memory, and Cognition, 19, of noisy evidence and prior semantic expectations in sentence 113. interpretation. Proceedings of the National Academy of Sciences of the Millotte, S., Wales, R., & Christophe, A. (2007). Phrasal prosody dis- United States of America, 110(20), 80518056. ambiguates syntax. Language and Cognitive Processes, 22(6), Gigerenzer, G. (2004). Fast and frugal heuristics: The tools of bounded 898909. rationality. In D. J. Koehler, & N. Harvey (Eds.), Handbook of judgment Nakamura, C., Arai, M., & Mazuka, R. (2012). Immediate use of pros- and decision making (pp. 6288). Oxford, UK: Blackwell. ody and context in predicting a syntactic structure. Cognition, 125 Gordon, P. C., Hendrick, R., & Johnson, M. (2001). Memory interfer- (2), 317325. ence during language processing. Journal of Experimental Nakatani, K., & Gibson, E. (2010). An on-line study of Japanese nest- Psychology: Learning, Memory, and Cognition, 27, 14111423. ing complexity. Cognitive Science, 34,94112. Gordon, P. C., Hendrick, R., & Johnson, M. (2004). Effects of noun Novick, J. M., Thompson-Schill, S. L., & Trueswell, J. C. (2008). phrase type on sentence complexity. Journal of Memory and Putting lexical constraints in context into the visual-world para- Language, 51(1), 97114. digm. Cognition, 107(3), 850903.

C. BEHAVIORAL FOUNDATIONS 274 22. SENTENCE PROCESSING

Otero, J., & Kintsch, W. (1992). Failures to detect contradictions in a arise from competing syntactic representations. Journal of Memory text: What readers believe versus what they read. Psychological and Language, 69(2), 104120. Science, 3(4), 229235. Spivey, M. J., Tanenhaus, M. K., Eberhard, K. M., & Sedivy, J. C. Patson, N. D., & Ferreira, F. (2009). Conceptual plural information is (2002). Eye movements and spoken language comprehension: used to guide early parsing decisions: Evidence from garden-path Effects of visual context on syntactic ambiguity resolution. sentences with reciprocal verbs. Journal of Memory and Language, Cognitive Psychology, 45(4), 447481. 60, 464486. Staub, A., & Clifton, C. (2006). Syntactic prediction in language com- Pesetsky, D. (1995). Zero Syntax. Experiencers and cascades. Cambridge, prehension: Evidence from either... or. Journal of Experimental MA: MIT Press. Psychology: Learning, Memory, and Cognition, 32, 425436. Pickering, M. J., McElree, B., Frisson, S., Chen, L., & Traxler, M. J. Sturt, P., Sanford, A. J., Stewart, A., & Dawydiak, E. (2004). (2006). Underspecification and aspectual coercion. Discourse Linguistic focus and good-enough representations: An application Processes, 42(2), 131155. of the change-detection paradigm. Psychonomic Bulletin and Price, P. J., Ostendorf, M., Shattuck-Hufnagel, S., & Fong, C. (1991). Review, 11(5), 882888. The use of prosody in syntactic disambiguation. The Journal of the Swets, B., Desmet, T., Clifton, C., & Ferreira, F. (2008). Acoustical Society of America, 90, 29562970. Underspecification of syntactic ambiguities: Evidence from self- Rayner, K. (1977). Visual attention in reading: Eye movements reflect paced reading. Memory and Cognition, 36(1), 201216. cognitive processes. Memory and Cognition, 5(4), 443448. Tanenhaus, M. K., Magnuson, J. S., Dahan, D., & Chambers, C. Rayner, K., Carlson, M., & Frazier, L. (1983). The interaction of syn- (2000). Eye movements and lexical access in spoken-language tax and semantics during sentence processing: Eye movements in comprehension: Evaluating a linking hypothesis between fixa- the analysis of semantically biased sentences. Journal of Verbal tions and linguistic processing. Journal of Psycholinguistic Research, Learning and Verbal Behavior, 22(3), 358374. 29(6), 557580. Rayner, K., Kambe, G, & Duffy, S. A. (2000). The effect of clause Tanenhaus, M. K., Spivey-Knowlton, M. J., Eberhard, K. M., & wrap-up on eye movements during reading. The Quarterly Journal Sedivy, J. C. (1995). Integration of visual and linguistic informa- of Experimental Psychology, 53(4), 10611080. tion in spoken language comprehension. Science, 268(5217), Rayner, K., Li, X., Juhasz, B., & Yan, G. (2005). The effect of word 16321634. predictability on the eye movements of Chinese readers. Townsend,D.J.,&Bever,T.G.(2001).Sentence comprehension: The integra- Psychonomic Bulletin and Review, 12(6), 10891093. tion of habits and rules (Vol. 1950). Cambridge, MA: MIT Press. Rayner, K, Sereno, S. C., Morris, R. K., Schmauder, A. R., & Clifton, Trueswell, J. C., Tanenhaus, M. K., & Kello, C. (1993). Verb-specific C. (1989). Eye movements and on-line language comprehension constraints in sentence processing: Separating effects of lexical processes. Language and Cognitive Processes, 4,2149. preference from garden-paths. Journal of Experimental Psychology: Rumelhart, D. E., & McClelland, J. L. (1985). Level’s indeed! Learning, Memory and Cognition, 19(3), 528553. A response to Broadbent. Journal of Experimental Psychology: Tyler, L. K., & Warren, P. (1987). Local and global structure in spo- General, 114(2), 193197. ken language comprehension. Journal of Memory and Language, 26 Rumelhart, D. E., & McClelland, J. L. (1986). PDP Models and gen- (6), 638657. eral issues in cognitive science. In D. E. Rumelhart, J. L. Van Berkum, J., Van den Brink, D., Tesink, C., Kos, M., & Hagoort, McClelland, & the PDP Research Group (Eds.), Parallel distributed P. (2008). The neural integration of speaker and message. Journal processing (pp. 111146). Cambridge, MA: MIT Press. of Cognitive Neuroscience, 20(4), 580591. Sanford, A. J., & Graesser, A. C. (2006). Shallow processing and Van Berkum, J. A., Brown, C. M., Zwitserlood, P., Kooijman, & underspecification. Discourse Processes, 42(2), 99108. Hagoort, P. (2005). Anticipating upcoming words in discourse: Sanford, A. J., & Sturt, P. (2002). Depth of processing in language Evidence from erps and reading times. Journal of Experimental comprehension: Not noticing the evidence. Trends in Cognitive Psychology: Learning, Memory and Cognition, 31(3), 443467. Sciences, 6(9), 382386. Van Herten, M., Kolk, H. H., & Chwilla, D. J. (2005). An ERP study Seidenberg, M. S., & McClelland, J. L. (1989). A distributed, develop- of P600 effects elicited by semantic anomalies. Cognitive Brain mental model of word recognition and naming. Psychological Research, 22(2), 241255. review, 96(4), 523. Yngve, V. H. (1960). A model and an hypothesis for language struc- Slattery, T. J., Sturt, P., Christianson, K., Yoshida, M., & Ferreira, F. ture. Proceedings of the American Philosophical Society, 104(5), (2013). Lingering misinterpretations of garden path sentences 444466.

C. BEHAVIORAL FOUNDATIONS CHAPTER 23 Gesture’s Role in Learning and Processing Language O¨ zlem Ece Demir1 and Susan Goldin-Meadow2 1Department of Communication Sciences and Disorders, Northwestern University, Evanston, IL, USA; 2Department of Psychology, University of Chicago, Chicago, IL, USA

In all cultures and at all ages, speakers move their consider a child who is shown two rows of checkers. The hands when they talk—they gesture. Even congenitally child is first asked to verify that the two rows have blind individuals who have never seen anyone gesture the same number of checkers and is then asked whether move their hands when they talk (Iverson & Goldin- therowsstillhavethesamenumberafteronerow Meadow, 1998), suggesting that gesturing is a robust is spread out. The child says “no” and justifies his part of speaking. Moreover, gesture and speech form an response by saying, “They are different because you integrated system for expressing meaning. Gesture con- moved them.” But at the same time, the child produces veys the visual component of the meaning and uses the following gestures—he moves his finger between the imagistic and analog devices to do so; speech conveys first checker in row 1 and the first checker in row 2, then the linguistic component and uses the linear-segmented, the second checker in rows 1 and 2, and so on. In his hierarchical devices characteristic of language (McNeill, gestures, the child is demonstrating an understanding of 1992, 2008). Our goal in this chapter is to introduce one-to-one correspondence, a central concept underlying neuroscientists interested in the neurobiology of lan- the conservation of number that does not appear in his guage to gesture and the role it plays in language learn- speech (Church & Goldin-Meadow, 1986). ing and language processing. We focus primarily on Two additional points are worth noting about behavioral studies and include findings from neuroim- gesture. The information conveyed uniquely in a speaker’s aging studies only when relevant (we direct the reader gestures is often accessible only to gesture, that is, it interested in the neurobiology of gesture to Chapter 32 is encapsulated knowledge not yet accessible to speech by Dick and Broce). We begin by briefly reviewing (Goldin-Meadow, Alibali, & Church, 1993). Speakers evidence showing that gesture can provide a unique who produce gestures that convey information not window into the mind of a speaker—not only does found in their speech when explaining a task are ready gesture reflect a speaker’s thoughts but also it can play to learn that task—when given instruction in the task, a role in changing those thoughts. We then explore in they are more likely to profit from that instruction detail the role gesture plays in how language is learned than speakers whose gestures convey the same informa- and how it is processed. tion as their speech, whether the speakers are children (Church & Goldin-Meadow, 1986; Perry, Church, & Goldin-Meadow, 1988; Pine, Lufkin, & Messer, 2004)or 23.1 GESTURE NOT ONLY REFLECTS adults (Perry & Elder, 1997; Ping, Decatur, Larson, THOUGHT, IT CAN PLAY A ROLE Zinchenko, & Goldin-Meadow, under review). IN CHANGING THOUGHT Gesture can thus reflect the state of a speaker’s knowledge. But there is now good evidence that gesture Although gesture may seem like handwaving, it in can do more than display what speakers know and can fact conveys substantive information, often information play a role in changing what they know. Gesture can that is not found in the speaker’s words. For example, change thinking in at least two ways.

Neurobiology of Language. DOI: http://dx.doi.org/10.1016/B978-0-12-407794-2.00023-7 275 © 2016 Elsevier Inc. All rights reserved. 276 23. GESTURE’S ROLE IN LEARNING AND PROCESSING LANGUAGE

First, the gestures we see others produce can change age (Bates, 1976). Gesture affects children’s language our minds. Learners are more likely to profit from comprehension over both short and long periods of instruction when it is accompanied by gesture than time. When seeing an experimenter label an object, when that same instruction is not accompanied by infants look longer if the named object is not at the loca- gesture (Perry, Berch, & Singleton, 1995; Valenzeno, tion indicated by the experimenter’s gesture but instead Alibali, & Klatzky, 2003), even when the gestures are is on the other side of the display, suggesting that not directed at objects in the immediate environment infants expect concurrently occurring labels and deictic (Ping & Goldin-Meadow, 2008). Gesture has been found gestures to indicate the same referent (Gliga & Csibra, to be particularly helpful in instruction when it conveys 2009). Not surprisingly, then, gesture can help children a correct strategy for solving a math problem that is learn new object labels (Namy & Waxman, 1998). different from the (also correct) strategy conveyed in the Infants are more likely to associate a label with an accompanying speech (Singer & Goldin-Meadow, 2005). object if the label is accompanied by a pointing gesture Second, the gestures that we ourselves produce can to the object than if it is not (Woodward, 2004). When change our minds. To determine whether gesture can the newly learned object label needs to be retrieved, bring about change, we need to teach speakers to gesture less scaffolding by pictures or gestures is needed if the in particular ways. If speakers can extract meaning from labels were initially taught with accompanying gestures their gestures, then they should be sensitive to the partic- than if they were taught without gestures (Capone & ular movements in those gestures and change their McGregor, 2004). Parental gesture thus has the poten- minds accordingly. Alternatively, all that may matter is tial to facilitate children’s language comprehension by that speakers move their hands. If so, then they should providing nonverbal support. In naturalistic situations, change their minds regardless of which gestures they parents frequently gesture when talking to their chil- produce. To investigate these alternatives, Goldin- dren, and most of these gestures reinforce the informa- Meadow, Cook, and Mitchell (2009) manipulated gestur- tion conveyed in the accompanying speech (Iverson, ing during a math lesson. They found that children Capirci, Longobardi, & Caselli, 1999). Moreover, when required to produce correct gestures learned more than infants misunderstand their parents, their parents often children required to produce partially correct gestures, provide additional gesture cues (e.g., through pointing) who in turn learned more than children required to that repair the misunderstanding and enable the dyad produce no gestures. After the lesson, the children who to reach a consensus (Zukow-Goldring, 1996). had gestured were able to express in their own words Looking over longer periods of time, early parent the information that they had conveyed only in their gesture use has been found to predict the size of chil- gestures during the lesson (and that the teacher had not dren’s comprehension vocabularies years later. Rowe conveyed at all), that is, they had learned from their and colleagues (Rowe, O¨ zc¸alı¸skan, & Goldin-Meadow, hands. These findings suggest that the gestures speakers 2008) found that the number of different meanings par- produce can have an impact on what they learn. ents convey in their gestures to a child at age 14 months Having found that gesture has cognitive significance is a significant predictor of that child’s vocabulary com- in many contexts, we are now in a position to explore prehension at age 42 months. However, early parent the role of gesture in learning and processing language. gesture did not have a direct effect on later child com- prehension—early parent gesture was related to early child gesture, which, in turn, was related to later child 23.2 ROLE OF GESTURE comprehension. This study suggests that parent gesture IN LANGUAGE LEARNING might be indirectly related to children’s vocabulary development through encouraging the child’s own ges- Just like adults, children gesture as they speak. tures. The relation between parent gesture and child Children start communicating through gestures even vocabulary has been replicated by Pan and colleagues before they are able to speak. In this section, we review in a low-income sample and extended to language the role gesture plays in developing vocabulary, syntax, production—early parent pointing predicted their chil- and discourse skills in language comprehension and dren’s vocabulary production growth between 14 and production. 36 months of age (Pan, Rowe, Singer, & Snow, 2005). In an experimental manipulation, Goodwyn, Acredolo, and Brown (2000) trained a group of parents to use 23.2.1 Vocabulary baby signs in addition to words when talking to their children. The children showed greater gains in vocabu- 23.2.1.1 Vocabulary Comprehension lary and used more gestures themselves than children Children rely on gesture to help them comprehend of parents who were encouraged to use only words or words starting from approximately 8 to 12 months of who did not receive any training at all.

C. BEHAVIORAL FOUNDATIONS 23.2 ROLE OF GESTURE IN LANGUAGE LEARNING 277

23.2.1.2 Vocabulary Production The role of gesture varies depending on the complex- In the earliest stages of language learning, infants ity of the spoken message—gesture is most beneficial produce few, if any, words. Their referential communica- when the listeners are young and the message is tion is primarily through gestures. Children start using complex. For example, in a referential communication gesture to communicate even before they say their first game, preschool children were given instructions to words, which are usually accompanied by meaningless select a certain set of blocks from an array of blocks. vocalizations (Bates, 1976). For example, children point Instructions were accompanied by a reinforcing ges- to places, people, or objects, they hold up an object to ture (saying “up” and producing an up gesture), a con- show it to others, or they extend their hands to request tradicting gesture (saying “up” and producing a down an object (Bates, 1976). Children in the United States are gesture), or no gesture. The spoken messages varied in also often taught to communicate using “baby signs” complexity and that complexity influenced the impact (Acredolo,Goodwyn,&Gentieu,2002). the gestures had on comprehension. Reinforcing ges- Not much is known about whether producing a ges- tures facilitated comprehension only when the spoken ture helps children to actually produce a word (although message was complex. Interestingly, no gesture and this is an interesting question). But we do know that contradicting gesture had a similar impact on compre- early gesture use is a strong predictor of later vocabulary hension and did not facilitate comprehension as much production. Children’s use of gesture for specific objects as reinforcing gestures (McNeil, Alibali, & Evans, 2000). (e.g., pointing to a ball) predicts the appearance of verbal 23.2.2.2 Syntactic Production labels for these objects in their lexicon (e.g., producing the word “ball”) (Iverson & Goldin-Meadow, 2005). Gesture paves the way for children’s ability to form More remarkably, the number of meanings children sentences. Starting from approximately 10 months of convey through gesture at 14 months predicts not only age, children produce gestures along with single words. the size of their comprehension vocabularies but also the Children use three types of gesture speech combina- size of their production vocabularies at 54 months (Rowe tions during the one-word period. Gestures are used to & Goldin-Meadow, 2009a). Moreover, early gestures can reinforce the meanings conveyed in speech (e.g., pointing also be used to predict developmental trajectories of to a ball and saying “ball”), to disambiguate the meanings clinical populations. Sauer and colleagues showed that conveyed in speech (e.g., pointing to a ball and saying children with unilateral focal brain injury whose gesture “it”), or to add to the meanings conveyed in speech (e.g., production at 18 months is within the typical range (but pointing to a ball and saying “want”). whose speech production is below the range) will catch Interestingly, using gesture speech combinations to up to their peers and achieve production (and compre- convey sentence-like ideas seems to set the stage for hension) spoken vocabularies later in development children’s earliest sentences. The age at which children that are within the typical range. Importantly, children first produce combinations in which gesture conveys with brain injury whose gesture rate is below the typical one idea and speech conveys another (e.g., point at 1 range at 18 months continue to display delays in bird “nap” to describe a sleeping bird) predicts the both their production and comprehension vocabularies age at which they first produce their two-word combi- (Sauer, Levine, & Goldin-Meadow, 2010). nations (e.g., “bird sleep”) (Iverson & Goldin-Meadow, 2005). Moreover, the number of these gesturespeech combinations that children produce at 18 months selec- 23.2.2 Syntax tively predicts their syntactic skill, as measured by the Index of Productive Syntax (Scarborough, 1990)at 23.2.2.1 Syntactic Comprehension 42 months (the measure does not predict vocabulary Children rely on gesture to comprehend sentences size at this age, Rowe & Goldin-Meadow, 2009b). from a very early age. Morford and Goldin-Meadow Gesture thus gives children a means to convey com- (1992) showed that comprehension of simple sen- plex ideas before they are able to convey the same ideas tences, such as “give the clock,” is facilitated either by a entirely in speech. In fact, particular constructions pointing gesture at the clock or by a give gesture (hand produced across gesture and speech predict the emer- extended, palm up) in 15- to 29-month-olds. In fact, gence of the same constructions entirely in speech several children who were unable to produce two-word utter- months later. For example, saying “bird”inspeechand ances were able to understand a “two-word idea” if producing a flying gesture (an argument 1 verb combi- one of those ideas was presented in gesture. For exam- nation) precedes and predicts the onset of argu- ple, children responded appropriately to “give” plus a ment 1 verb combinations in speech (i.e., saying “bird point at a clock significantly more often than they fly”) (O¨ zc¸alı¸skan & Goldin-Meadow, 2005). Once a con- responded to the message when it was produced entirely struction is produced in speech, children do not seem to in speech (i.e., “give the clock”). rely on gesture to further expand that construction.

C. BEHAVIORAL FOUNDATIONS 278 23. GESTURE’S ROLE IN LEARNING AND PROCESSING LANGUAGE

For example, once children acquire the ability to produce expressed in a pronoun or omitted entirely. Although a verb and one argument in a single utterance, they do English-, Chinese-, and Turkish-speaking 4-year-olds do not produce additional arguments first in gesture, that is, not have control of this discourse principle in speech, a two-argument 1 verb construction is just as likely to they do display an understanding of the principle in appear first in speech alone as in speech 1 gesture gesture—they produce more gestures when referring to (Ozc¸alı¸skan, Goldin-Meadow, & O¨ zc¸alı¸skan, 2009). referents that are new to the perceptual or discourse con- text, particularly when those referents are underspecified or ambiguous in speech (Demir, So, Ozyu¨rek, & Goldin- 23.2.3 Discourse Meadow, 2012; So, Demir, & Goldin-Meadow, 2010). The gestures children produce in complex discourse 23.2.3.1 Discourse Comprehension early in their development have been found to predict With age, children face increasingly complex lan- their discourse skills in speech later in development. For guage tasks, such as understanding indirect requests example, age 5 to 6 years marks a transitional stage in or listening to stories. Gesture continues to support narrative development. Children produce narratives on children’s language comprehension in later stages their own but rarely include the goal of the story or the of language learning. For example, 3- to 5-year-old perspective of the story characters in those narratives. children understand indirect requests better if those However, some 5- to 6-year-olds use gesture to portray a requests are accompanied by a gesture, for example, character from a first person perspective (e.g., moving saying “It’s going to get loud in here” is more easily the arms back and forth to describe a character who is understood as a request to close the door if the words running from the character’s point of view) as opposed are accompanied by a pointing gesture to the door than to a third-person perspective (e.g., wiggling the fingers if they are not accompanied by gesture (Kelly, 2001). to describe the character from an observer’s point of Gesture can also have an impact on children’s com- view). Whether children use character-viewpoint ges- prehension of longer stretches of discourse. In a recent tures in their narratives at age 5 predicts the structure of study, Demir and colleagues (Demir, Fisher, Goldin- their spoken narratives 3 years later, controlling for early Meadow, & Levine, 2014) compared children’s ability to narrative structure in speech and language skill (Demir, retell a story presented to them under different condi- Levine, & Goldin-Meadow, under review). Thus, gesture tions: a wordless cartoon; an audio recording of a story- continues to be a harbinger of change in speech even in teller (like listening to a story on radio); an audiovisual later stages of . presentation of a storyteller who does not produce cospeech gestures (like listening to a storyteller holding a book while reading it); and an audiovisual presentation 23.2.4 Does Gesture Play a Causal Role of a storyteller producing cospeech gestures while talk- in Language Learning? ing (like listening to an oral storyteller). Children told We have seen that the early gestures children pro- better-structured narratives in the gesture condition than duce reflect their cognitive potential for learning par- in any of the other three conditions, consistent with find- ticular aspects of language. But early gesture could be ings that cospeech gesture can scaffold comprehension doing more—it could be helping children realize their of complex language. The gestures were particularly potential. Child gesture could have an impact on lan- beneficial to children who had difficulty telling a well- guage learning in at least two ways. structured narrative, suggesting that gesture might be First, gesture gives children an opportunity to prac- most helpful when language skill is low. tice producing particular meanings by hand at a time when those meanings are difficult to express by mouth. 23.2.3.2 Discourse Production To accurately determine whether child gesture is playing With age, children start producing longer and more a causal role in language learning, we need to mani- frequent utterances that need to be sensitive to more pulate the gestures children produce. LeBarton, Goldin- sophisticated discourse-pragmatic principles. As in ear- Meadow and Raudenbush, (2013) studied 15 toddlers lier stages of language learning, gesture reveals that chil- (beginning at 17 months) in an 8-week at-home interven- dren have more understanding of discourse-pragmatics tion study (6 weekly training sessions plus follow-up 2 than they express in speech. For example, when a refer- weeks later) in which all children were exposed to object ent is new to the discourse, or is not perceptually avail- words, but only some were told to point at the named able, proficient speakers know that the referents must be objects. Before each training session and at follow-up, explicitly expressed and they use a noun to do so; con- children interacted naturally with their parents to estab- versely, if the referent is available in perceptual context lish a baseline against which changes in communication or retrievable from discourse, then the referent can be were measured. Children who were told to gesture

C. BEHAVIORAL FOUNDATIONS 23.3 ROLE OF GESTURE IN LANGUAGE PROCESSING 279 increased the number of gesture meanings they con- words and pointing gestures together leads to long veyed not only when interacting with the experimenter initiation times for the accompanying speech, relative during training but also when later interacting naturally to producing speech alone (Feyereisen, 1997; Levelt, with their parents. Critically, these experimentally Richardson, & Laheij, 1985). Viewing gesture also affects induced increases in gesture led to larger spoken reper- voicing in listeners’ vocal responses to audiovisual sti- toires at follow-up, and thus suggest that gesturing can muli (Bernardis & Gentilucci, 2006). have an impact on language learning through the cogni- tive effect it has on the learner. 23.3.1.2 Lexicon The second way in which child gesture could play a At the lexical level, gesturing increases when the role in language learning is more indirect—child gesture speaker is searching for a word (Morsella & Krauss, could elicit timely speech from listeners. Supporting this 2004). More generally, gestures reflect and compensate view, Goldin-Meadow and colleagues (Goldin-Meadow, for gaps in a speaker’s verbal lexicon. Gestures can pack- Goodrich, Sauer, & Iverson, 2007)foundthatwhen age information in the same way that information their children are in one-word stage, mothers translate is packaged in the lexicon of the speaker’s language. For the child’s gestures into speech. For example, on seeing example, when speakers of English, Japanese, and her child point to a ball and say “kick,” a mother might Turkish are asked to describe a scene in which an ani- say, “Doyouwanttokicktheball?,” thus modeling for mated figure swings on a rope, English speakers over- the child a two-word sentence that expresses the ideas whelmingly use the verb “swing” along with an arced the child conveyed in gesture 1 speech. Importantly, gesture (Kita & O¨ zyu¨rek, 2003). In contrast, speakers of these maternal translations are reliable predictors of Japanese and Turkish, languages that do not have single children’s subsequent word and sentence learning, verbs that express an arced trajectory, use generic suggesting that gesturing can have an impact on lan- motion verbs along with the comparable gesture, that is, guage through the communicative effect it has on the a straight gesture (Kita & O¨ zyu¨rek, 2003). But gesture learning environment. can also compensate for gaps in the speaker’s lexicon by conveying information that is not encoded in the accom- panying speech. For example, complex shapes that are 23.3 ROLE OF GESTURE difficult to describe in speech can be conveyed in gesture IN LANGUAGE PROCESSING (Emmorey & Casey, 2002). 23.3.1.3 Syntax Children continue to use gesture long after they become proficient users of their native language(s). At the syntactic level, gestures are influenced by the The tight relation between speech and gesture structural properties of the accompanying speech. For increases with development (Thompson & Massaro, example, English expresses manner and path within the 1986). In this section, we first describe the role gesture same clause, whereas Turkish expresses the two in sepa- plays in how language is processed once language has rate clauses. The gestures that accompany manner and been mastered, and then we describe the functions ges- path constructions in these two languages display a par- ture serves for both listeners and speakers. allel structure—English speakers produce a single gesture combining manner and path (a rolling movement produced while moving the hand forward), whereas 23.3.1 Gesturing is Involved in Language Turkish speakers produce two separate gestures (a roll- Processing at Every Level ing movement produced in place, followed by a moving forward movement) (Kita & O¨ zyu¨rek, 2003; Kita, 23.3.1.1 Phonology O¨ zyu¨rek, Allen, Brown, Furman, & Ishizuka, 2007). Gesture is linked to spoken language at every level of A recent event-related potential (ERP) study illustrates analysis. At the phonological level, producing gestures how gesture can influence syntactic processing (Holle, influences the voice spectra of the accompanying speech Obermeier, Schmidt-Kassow, Friederici, Ward, & Gunter, for deictic gestures (Chieffi, Secchi, & Gentilucci, 2009), 2012). Listeners were presented with two types of emblem gestures (Barbieri, Buonocore, Dalla Volta, & sentences, one with a less-preferred syntactic structure. Gentilucci, 2009; Bernardis & Gentilucci, 2006), and beat Less-preferred syntactic structures commonly elicit P600 gestures (Krahmer & Swerts, 2007). When phonological waves, which is usually associated with syntactic reanal- production breaks down, as in stuttering or aphasia, ges- ysis. When the less-preferred syntactic structures were ture production stops as well (Mayberry & Jaques, 2000, accompanied by rhythmic beat gestures, the P600 was McNeill, Pedelty, & Levy, 1990). There are phonological eliminated, suggesting that gesture can reduce the pro- costs to producing gestures with speech—producing cessing cost associated with hearing a syntactically

C. BEHAVIORAL FOUNDATIONS 280 23. GESTURE’S ROLE IN LEARNING AND PROCESSING LANGUAGE complex sentence. Supporting this finding, gesture has accompanied by no gesture (Beattie & Shovelton, 1999, been found to play a greater role in language compre- 2002; Graham & Argyle, 1975; McNeil et al., 2000; hension when the spoken message is syntactically com- Thompson & Massaro, 1994). Conversely, listeners are plex than when it is simple (McNeil et al., 2000). Gesture less likely to glean the message from speech when the production also reflects the amount of information speech is accompanied by gesture conveying different encoded in a syntactic structure. Speakers gesture more informationthanwhenthespeechisaccompaniedbyno when producing an unexpected (and, in this sense, more gesture (Goldin-Meadow & Sandhofer, 1999; Kelly & informative) syntactic structure than when producing an Church, 1998; McNeil et al., 2000). In addition, more expected structure (Cook, Jaeger, & Tanenhaus, 2009). incongruent gestures lead to greater processing difficulty than congruent gestures (Kelly, O¨ zyu¨rek, & Maris, 2010). 23.3.1.4 Discourse The effect that gesture has on listeners’ processing is thus At the discourse level, speakers use recurring gestural linked to the meaning relation between gesture and features (e.g., the same hand shape or location) speech. Moreover, listeners cannot ignore gesture even throughout a narrative when referring to a particular when given explicit instructions to do so (Kelly, ¨ character, thus creating linkages across the narrative Ozyu¨rek, & Maris, 2010; Langton, O’Malley, & Bruce, (McNeill, 2000). In terms of narrative comprehension, 1996), suggesting that the integration of gesture and when listening to stories containing gestures, listeners speech is automatic. activate regions that are responsive to semantic manipu- 23.3.2.2 Impact of Gesture Impact on Speakers lations in speech (triangular and opercular portions of the left inferior frontal gyrus; left posterior middle tem- But gesture can also have an impact on the speakers poral gyrus). However, the level of activation in these themselves. Gestures have long been argued to help areas differs as a function of semantic relation between speakers “find” words, that is, to facilitate lexical access gesture and speech—stories in which gesture conveys (Rauscher, Krauss, & Chen, 1996). Studies supporting information that differs from, but complements, the this view show that gesture production increases when information conveyed in speech (e.g., a flying gesture lexical retrieval is made difficult (Morsella & Krauss, produced along with a story about a “pet”) activates the 2004), and speakers are more successful in resolving tip regions more than stories in which gesture conveys the of the tongue states when they are permitted to gesture same information as speech (e.g., a flying gesture pro- than when they are prevented from gesturing (Frick- duced along with a story about a “pet”) (Dick, Mok, Horbury & Guttentag, 1998). Gestures have also been Beharelle, Goldin-Meadow, & Small, 2014). hypothesized to reduce demands on conceptualization, and speakers have been found to gesture more on pro- blems that are conceptually difficult, even when lexical 23.3.2 Gesture Serves a Function for Both demands are equated (Alibali, Kita, & Young, 2000; Listeners and Speakers Hostetter, Alibali, & Kita, 2007; Melinger & Kita, 2007). For example, adults were asked to describe complex 23.3.2.1 Impact of Gesture on Listeners geometric shapes under two different conditions. In the Speakers’ gestures reveal their thoughts. Accordingly, easy condition, the shapes that the adults were sup- one function that gesture could serve is to convey those posed to describe were outlined by dark lines; in the thoughts to listeners. There is, in fact, considerable hard condition, the shapes were obscured by lines out- evidence that listeners can use gesture as a source of lining alternative organizations. The adults produced information about the speaker’s thinking (e.g., Goldin- more gestures in the hard condition than in the easy Meadow & Sandhofer, 1999; Graham & Argyle, 1975; condition (Kita & Davies, 2009). McNeil et al., 2000). The ability of listeners to glean Although findings of this sort are consistent with the information from a speaker’s gestures can be seen most idea that gesturing reduces demands on conceptualiza- clearly when the gestures convey information that cannot tion, to be certain that gesturing is playing a causal role be found anywhere in the speaker’s words (e.g., Cook & in reducing demands (as opposed to merely reflecting Tanenhaus, 2009). Gesture can even affect the informa- those demands), we need to manipulate gesture and tion listeners glean from the accompanying speech. demonstrate that the manipulation reduces the demands Listeners are quicker to identify a speaker’s referent on conceptualization. This type of manipulation has when speech is accompanied by gesture than when it is been done in some cases, and gesturing has been not (Silverman, Bennetto, Campana, & Tanenhaus, 2010). found to reduce demand on speakers’ working memory Moreover, listeners are more likely to glean the message (Goldin-Meadow, Nusbaum, Kelly, & Wagner, 2001; fromspeechwhenthatspeechisaccompaniedbygesture Wagner, Nusbaum, & Goldin-Meadow, 2004), to acti- conveying the same information than when the speech is vate knowledge that speakers have but do not express

C. BEHAVIORAL FOUNDATIONS REFERENCES 281

(Broaders, Cook, Mitchell, & Goldin-Meadow, 2007), References and to build new knowledge (Goldin-Meadow, Cook, & Acredolo, L. P., Goodwyn, S., & Gentieu, P. (2002). My first baby signs. Mitchell, 2009). New York, NY: HarperFestival. Alibali, M. W., Kita, S., & Young, A. J. (2000). Gesture and the process of speech production: We think, therefore we gesture. Language and 23.4 IMPLICATIONS FOR THE Cognitive Processes, 15(6), 593613. NEUROBIOLOGY OF LANGUAGE Barbieri, F., Buonocore, A., Dalla Volta, R., & Gentilucci, M. (2009). How symbolic gestures and words interact with each other. Brain Our review reveals that gesture can play a role in lan- and Language, 110(1), 1 11. Bates, E. (1976). Langauge and context: The acquisition of pragmatics. guage learning and processing. Going forward, we sug- New York, NY: Academic Press. gest that the right question to ask is not whether gesture Beattie, G., & Shovelton, H. (1999). Mapping the range of information helps language processing, but rather when and how it contained in the iconic hand gestures that accompany spontane- does so to explore the mechanisms by which gesture ous speech. Journal of Language and Social Psychology, 18(4), exerts its influence on communication. In this regard, 438 462. Beattie, G., & Shovelton, H. (2002). An experimental investigation of neural levels of analyses have the potential to provide some properties of individual iconic gestures that mediate their insight into unanswered questions that behavioral analy- communicative power. British Journal of Psychology, 93(2), 179192. ses cannot. Behavioral studies reflect the combined influ- Bernardis, P., & Gentilucci, M. (2006). Speech and gesture share the ence of multiple processes in how gesture affects same communication system. Neuropsychologia, 44(2), 178190. communication and cognition. By localizing brain net- Broaders, S. C., Cook, S. W., Mitchell, Z., & Goldin-Meadow, S. (2007). Making children gesture brings out implicit knowledge works underlying different cognitive functions and and leads to learning. Journal of Experimental Psychology: General, examining their contribution during various language 136(4), 539. tasks, neuroimaging studies may be able to help us tease Capone, N. C., & McGregor, K. K. (2004). Gesture development: A apart the contribution of different cognitive processes. review for clinical and research practices. Journal of Speech, There are a number of important neurobiologic ques- Language, and Hearing Research, 47(1), 173 186. Available from: http://dx.doi.org/doi:10.1044/1092-4388(2004/015). tions that can be raised about gesture’s role in communi- Chieffi, S., Secchi, C., & Gentilucci, M. (2009). Deictic word and ges- cation and cognition. For example, does the neural basis ture production: Their interaction. Behavioural Brain Research, 203 for gesturespeech integration vary as a function of the (2), 200206. content of speech or the linguistic skills of the listener? Church, R. B., & Goldin-Meadow, S. (1986). The mismatch between In this regard, it is important to point out that many gesture and speech as an index of transitional knowledge. Cognition, 23(1), 4371. neuroimaging studies examine the effects of gestures Cook, S. W., Jaeger, T. F., & Tanenhaus, M. (2009). Producing less conveying information that contradicts the information preferred structures: More gestures, less fluency. In The 31st conveyed in speech (e.g., Willems, Ozyu¨rek, & Hagoort, annual meeting of the cognitive science society (CogSci09) (pp. 2007). Although these studies can offer important 6267). insights into what it possible, we note that contradictory Cook, S. W., & Tanenhaus, M. K. (2009). Embodied communication: Speakers’ gestures affect listeners’ actions. Cognition, 113(1), 98104. gestures (i.e., gestures that convey information that con- Demir, O¨ . E., Fisher, J. A., Goldin-Meadow, S., & Levine, S. C. (2014). tradicts, and therefore cannot under any conditions be Narrative processing in typically developing children and chil- integrated with, the information conveyed in speech) are dren with early unilateral brain injury: Seeing gesture matters. not commonly observed in naturalistic, spontaneous con- Developmental Psychology, 50(3), 815. versation. We therefore encourage researchers to include Demir, O. E., So, W.-C., Ozyu¨rek, A., & Goldin-Meadow, S. (2012). Turkish- and English-speaking children display sensitivity to percep- in their studies gestures conveying information that is tual context in the referring expressions they produce in speech and different from, but has the potential to be integrated gesture. Language and Cognitive Processes, 27(6), 844867. Available with, the information conveyed in speech (as in Dick from: http://dx.doi.org/doi:10.1080/01690965.2011.589273. et al., 2014). Dick, A. S., Mok, E. H., Beharelle, A. R., Goldin-Meadow, S., & In sum, the gestures that we produce when we talk Small, S. L. (2014). Frontal and temporal contributions to under- standing the iconic co-speech gestures that accompany speech. are not mindless handwaving. Gesture takes on signifi- Human Brain Mapping, 34, 900917. Available from: http://dx. cance simply because it can convey information about a doi.org/doi:10.1002/hbm.22222. speaker’s thoughts that are not found in the speaker’s Emmorey, K., & Casey, S. (2002). Gesture, thought, and spatial language words. Moreover, the information conveyed in gesture ((pp. 87101)). Spatial language. Netherlands: Springer. forecasts subsequent changes in a speaker’s thinking Frick-Horbury, D., & Guttentag, R. E. (1998). The effects of restricting hand gesture production on lexical retrieval and free recall. The and can even play a causal role in changing that think- American Journal of Psychology4362. ing. Gesture thus offers a unique lens through which we Gliga, T., & Csibra, G. (2009). One-year-old infants appreciate the ref- can explore the mechanisms that underlie language erential nature of deictic gestures and words. Psychological Science, learning and processing at the behavioral and the neuro- 20(3), 347353. Available from: http://dx.doi.org/doi:10.1111/j. biological levels. 1467-9280.2009.02295.x.

C. BEHAVIORAL FOUNDATIONS 282 23. GESTURE’S ROLE IN LEARNING AND PROCESSING LANGUAGE

Goldin-Meadow, S., Alibali, M. W., & Church, R. B. (1993). increases in spoken vocabulary. Journal of Cognition and Transitions in concept acquisition: Using the hand to read the Development. mind. Psychological Review, 100(2), 279. Mayberry, R. I., & Jaques, J. (2000). 10 Gesture production during Goldin-Meadow, S., Cook, S. W., & Mitchell, Z. A. (2009). Gesturing stuttered speech: Insights into the nature of gesture-speech inte- gives children new ideas about math. Psychological Science, 20(3), gration. Language and Gesture, 2, 199. 267272. McNeill, D. (1992). Hand and mind: What gestures reveal about thought. Goldin-Meadow, S., Goodrich, W., Sauer, E., & Iverson, J. (2007). Chicago, IL: University of Chicago Press. Young children use their hands to tell their mothers what to say. McNeill, D. (Ed.), (2000). Language and gesture (Vol. 2). ). Cambridge Developmental Science, 10(6), 778785. Available from: http://dx. University Press. doi.org/doi:10.1111/j.1467-7687.2007.00636.x. McNeill, D. (2008). Gesture and thought. Chicago, IL: University of Goldin-Meadow, S., Nusbaum, H., Kelly, S. D., & Wagner, S. (2001). Chicago Press. Explaining math: Gesturing lightens the load. Psychological McNeill, D., Pedelty, L. L., & Levy, E. T. (1990). Speech and gesture. Science, 12(6), 516522. Advances in Psychology, 70, 203256. Goldin-Meadow, S., & Sandhofer, C. M. (1999). Gestures convey sub- McNeil, N. M., Alibali, M. W., & Evans, J. L. (2000). The role of gesture stantive information about a child’s thoughts to ordinary listen- in children’s comprehension of spoken language: Now they need it, ers. Developmental Science, 2(1), 6774. now they don’t. Journal of Nonverbal Behavior, 24(2), 131150. Goodwyn, S. W., Acredolo, L. P., & Brown, C. A. (2000). Impact of Melinger, A., & Kita, S. (2007). Conceptualisation load triggers ges- symbolic gesturing on early language development. Journal of ture production. Language and Cognitive Processes, 22(4), 473500. Nonverbal Behavior, 24(2), 81103. Morford, M., & Goldin-Meadow, S. (1992). Comprehension and pro- Graham, J. A., & Argyle, M. (1975). A cross-cultural study of the duction of gesture in combination with speech in one-word communication of extra-verbal meaning by gestures (1). speakers. Journal of Child Language, 19(03), 559580. International Journal of Psychology, 10(1), 5767. Morsella, E., & Krauss, R. M. (2004). The role of gestures in spatial Holle, H., Obermeier, C., Schmidt-Kassow, M., Friederici, A. D., working memory and speech. The American Journal of Psychology, Ward, J., & Gunter, T. C. (2012). Gesture facilitates the syntactic 117(3), 411424. Retrieved from: http://www.ncbi.nlm.nih.gov/ analysis of speech. Frontiers in Psychology, 3. pubmed/15457809. Hostetter, A. B., Alibali, M. W., & Kita, S. (2007). I see it in my hands’ Namy, L. L., & Waxman, S. R. (1998). Words and gestures: Infants’ eye: Representational gestures reflect conceptual demands. interpretations of different forms of symbolic reference. Child Language and Cognitive Processes, 22(3), 313336. Development, 69(2), 295308 Retrieved from: http://www.ncbi. Iverson,J.M.,Capirci,O.,Longobardi,E.,&Caselli,M.C.(1999). nlm.nih.gov/pubmed/9586206. Gesturing in motherchild interactions. Cognitive Development, 5775. O¨ zc¸alı¸skan, S., & Goldin-Meadow, S. (2005). Gesture is at the cutting Iverson, J. M., & Goldin-Meadow, S. (1998). Why people gesture edge of early language development. Cognition, 96(3), B101B113. when they speak. Nature, 396(6708), 228. Available from: http://dx.doi.org/doi:10.1016/j.cognition.2005. Iverson, J. M., & Goldin-Meadow, S. (2005). Gesture paves the way 01.001. for language development. Psychological Science, 16(5), 367371. Ozc¸alı¸skan, S., Goldin-Meadow, S., & O¨ zc¸alı¸skan, S. (2009). When Kelly, S. D. (2001). Broadening the units of analysis in communica- gesture-speech combinations do and do not index linguistic tion: Speech and nonverbal behaviours in pragmatic comprehen- change. Language and Cognitive Processes, 24(2), 190. Available sion. Journal of Child Language, 28(2), 325349. Retrieved from: from: http://dx.doi.org/doi:10.1080/01690960801956911.When. http://www.ncbi.nlm.nih.gov/pubmed/11449942. Pan, B. A., Rowe, M. L., Singer, J. D., & Snow, C. E. (2005). Kelly, S. D., & Church, R. B. (1998). A comparison between children’s Maternal correlates of growth in toddler vocabulary production and adults’ ability to detect conceptual information conveyed in low-income families. Child Development, 76(4), 763782. through representational gestures. Child Development, 69(1), 8593. Available from: http://dx.doi.org/doi:10.1111/j.1467-8624.2005. Kelly, S. D., O¨ zyu¨ rek, A., & Maris, E. (2010). Two sides of the same 00876.x. coin speech and gesture mutually interact to enhance comprehen- Perry, M., Berch, D., & Singleton, J. (1995). Constructing shared sion. Psychological Science. understanding: The role of nonverbal input in learning contexts. Kita, S., & Davies, T. S. (2009). Competing conceptual representations Journal of Contemporary Legal Issues, 6, 213. trigger co-speech representational gestures. Language and Perry, M., Church, R. B., & Goldin-Meadow, S. (1988). Transitional Cognitive Processes, 24(5), 761775. knowledge in the acquisition of concepts. Cognitive Development, 3 Kita, S., & O¨ zyu¨ rek, A. (2003). What does cross-linguistic variation in (4), 359400. semantic coordination of speech and gesture reveal?: Evidence Perry, M., & Elder, A. D. (1997). Knowledge in transition: Adults’ for an interface representation of spatial thinking and speaking. developing understanding of a principle of physical causality. Journal of Memory and Language, 48(1), 1632. Cognitive Development, 12(1), 131157. Kita, S., O¨ zyu¨ rek, A., Allen, S., Brown, A., Furman, R., & Ishizuka, T. Pine, K. J., Lufkin, N., & Messer, D. (2004). More gestures than (2007). Relations between syntactic encoding and co-speech ges- answers: Children learning about balance. Developmental tures: Implications for a model of speech and gesture production. Psychology, 40(6), 1059. Language and Cognitive Processes, 22(8), 12121236. Ping, R., Decatur, M., Larson, S. W., Zinchenko, E., & Goldin- Krahmer, E., & Swerts, M. (2007). The effects of visual beats on pro- Meadow, S. (revision under review). Unpacking the gestures of sodic prominence: Acoustic analyses, auditory perception and chemistry learners: What the hands can tell us about correct and visual perception. Journal of Memory and Language, 57(3), 396414. incorrect conceptions of stereochemistry. Langton, S. R., O’Malley, C., & Bruce, V. (1996). Actions speak no louder Ping, R. M., & Goldin-Meadow, S. (2008). Hands in the air: using than words: Symmetrical cross-modal interference effects in the pro- ungrounded iconic gestures to teach children conservation of cessing of verbal and gestural information. Journal of Experimental quantity. Developmental Psychology, 44(5), 1277. Psychology: Human Perception and Performance, 22(6), 1357. Rauscher, F. H., Krauss, R. M., & Chen, Y. (1996). Gesture, speech, LeBarton, E. S., Goldin-Meadow, S., & Raudenbush, S. (2013). and lexical access: The role of lexical movements in speech pro- Experimentally-induced increases in early gesture lead to duction. Psychological Science, 7(4), 226231.

C. BEHAVIORAL FOUNDATIONS REFERENCES 283

Rowe, M. L., & Goldin-Meadow, S. (2009a). Differences in early ges- principles in early childhood. Applied Psycholinguistics, 31(1), ture explain SES disparities in child vocabulary size at school 209224. Available from: http://dx.doi.org/doi:10.1017/S014271 entry. Science, 323, 951953. 6409990221.When. Rowe, M. L., & Goldin-Meadow, S. (2009b). Early gesture selectively Thompson, L. A., & Massaro, D. W. (1986). Evaluation and integra- predicts later language learning. Developmental Science, 323(1), tion of speech and pointing gestures during referential under- 182187. Available from: http://dx.doi.org/doi:10.1111/j.1467- standing. Journal of Experimental Child Psychology, 42(1), 144168. 7687.2008.00764.x. Thompson, L. A., & Massaro, D. W. (1994). Children’ s integration of Rowe, M. L., O¨ zc¸alı¸skan, S., & Goldin-Meadow, S. (2008). Learning speech and pointing gestures in comprehension. Journal of words by hand: Gesture’s role in predicting vocabulary develop- Experimental Child Psychology, 57(3), 327354. ment. First Language, 28(2), 182199. Available from: http://dx. Valenzeno, L., Alibali, M. W., & Klatzky, R. (2003). Teachers’ ges- doi.org/doi:10.1177/0142723707088310.Learning. tures facilitate students’ learning: A lesson in symmetry. Sauer, E., Levine, S. C., & Goldin-Meadow, S. (2010). Early gesture Contemporary Educational Psychology, 28(2), 187204. predicts language delay in children with pre- or perinatal brain Wagner, S. M., Nusbaum, H., & Goldin-Meadow, S. (2004). Probing lesions. Child Development, 81(2), 528539. Available from: the mental representation of gesture: Is handwaving spatial? http://dx.doi.org/doi:10.1111/j.1467-8624.2009.01413.x. Journal of Memory and Language, 50(4), 395407. Scarborough, H. S. (1990). Very early language deficits in dyslexic Willems, R. M., Ozyu¨rek, A., & Hagoort, P. (2007). When language children. Child Development, 61(6), 17281743. meets action: The neural integration of gesture and speech. Silverman, L. B., Bennetto, L., Campana, E., & Tanenhaus, M. K. Cerebral Cortex, 17(10), 23222333. Available from: http://dx.doi. (2010). Speech-and-gesture integration in high functioning autism. org/doi:10.1093/cercor/bhl141. Cognition, 115(3), 380393. Woodward, A. L. (2004). Infants’ use of action knowledge to get a Singer, M. A., & Goldin-Meadow, S. (2005). Children learn when grasp on words. Weaving a Lexicon149172. their teacher’s gestures and speech differ. Psychological Science, 16 Zukow-Goldring, P. (1996). Sensitive caregiving fosters the compre- (2), 8589. hension of speech: When gestures speak louder than words. Early So, W. C., Demir, O¨ . E., & Goldin-Meadow, S. (2010). When speech is Development and Parenting, 5(4), 195211. ambiguous gesture steps in: Sensitivity to discourse-pragmatic

C. BEHAVIORAL FOUNDATIONS