PERCEPTUAL ANALYSIS of the SPEECH SIGNAL This Page Intentionally Left Blank CHAPTER 37 Phoneme Perception Jeffrey R
Total Page:16
File Type:pdf, Size:1020Kb
NEUROBIOLOGY OF LANGUAGE Edited by GREGORY HICKOK Department of Cognitive Sciences, University of California, Irvine, CA, USA STEVEN L. SMALL Department of Neurology, University of California, Irvine, CA, USA AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO Academic Press is an imprint of Elsevier SECTION F PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL This page intentionally left blank CHAPTER 37 Phoneme Perception Jeffrey R. Binder Department of Neurology, Medical College of Wisconsin, Milwaukee, WI, USA Human vocalizations include the set of sounds called tract to allow air flow and resonance in the nasal cavity, phonemes, which are the basic linguistic units of speech. calledanasal(example:m,n,ng);andrelativelyunob- Phonemes are approximately equivalent to the set of structed phonation using the tongue and lips to shape the vowel and consonant sounds of a language. For example, resonant properties of the oral tract, resulting in a vowel. the word cat comprises the three phonemes (here tran- “Place” of articulation refers to the location in the vocal scribed in International Phonetic Alphabet notation) /k/, tract at which maximal restriction occurs during a /æ/, and /t/. Changing any one of these phonemes, such gesture. Examples include: restriction at the lips, referred as substituting /p/ for /k/, results in a different word. to as bilabial (example: b, p, m); restriction with the tip of Thus, a phoneme is often defined as the smallest contras- the tongue against the teeth or alveolar palate, referred to tive speech unit that brings about a change of meaning. as coronal (example: th, d, s); and restriction with the Phonemes are also the product of complex motor acts body of the tongue against the soft palate, called velar involving the lungs, diaphragm, larynx, tongue, palate, (example: g, k). Finally, the consonants are divided into lips, and so on. The actual sounds produced by this two “voicing” categories on the basis of whether the vocal complex apparatus vary from one instance to the next, folds in the larynx are made to vibrate during or very even for a single speaker repeating the same phoneme. close to the time of maximal articulation. When phona- They are also altered by the context in which the act tory muscles in the larynx contract, the vocal folds vibrate occurs, resulting in allophonic variations such as the due to periodic build-up of air pressure from the lungs, difference in the middle “t” sound of night rate and nitrate. resulting in an audible pitch or voice. When this vibration Phoneme realizations also show large variations across is present during maximal articulation, the consonant is geographical and social groups, resulting in regional and “voiced”(example:b,z,g).Whennovocalfoldvibration class accents. Therefore, a phoneme is best viewed as an is present, the consonant is “unvoiced” (example: p, s, k). abstract representation of a set of related speech sounds For stop consonants followed by a vowel, voicing is that are perceived as equivalent within a particular lan- largely determined by the time that elapses between guage. The main task of phoneme perception systems in release of the stop and the onset of vocal fold vibration, the brain is to enable this perceptual equivalence. referred to as voice onset time (VOT). The acoustic analysis of phonemes is a relatively recent Although derived originally from motor descriptions, science, becoming systematic only with the development each of these distinctions is associated with correspond- of acoustic spectral analysis in the 1940s using analog ing acoustic phenomena (Figure 37.1). Vocal cord vibra- filter banks (Koenig, Dunn, & Lacy, 1946). Accordingly, tions produce a periodic sound wave at a particular the standard classification of phonemes refers mainly to rate, called the fundamental frequency of the voice (F0), the motor gestures that produce them. “Manner” of arti- as well as harmonics at multiples of the fundamental. culation refers to the general type of gesture performed. The relative intensity of these harmonics is determined Examples of distinct manners of articulation include: by the shape of the vocal tract, which changes dynami- transient complete closure of the vocal tract, called a stop cally due to movement of the tongue and other articula- or plosive (example: p, d, k); sustained constriction of the tors, altering the resonant (and antiresonant) properties vocal tract resulting in turbulent air flow, called a fricative of the tract. Resonant harmonics form distinct bands (example: f, th, s); lowering of the soft palate and relaxa- of increased power in the spectrum, called formants tion of the upper pharynx during occlusion of the oral (numbered F1, F2, etc.), and the center frequency and Neurobiology of Language. DOI: http://dx.doi.org/10.1016/B978-0-12-407794-2.00037-7 447 © 2016 Elsevier Inc. All rights reserved. 448 37. PHONEME PERCEPTION FIGURE 37.1 Acoustic waveforms (top row in each pair) and time-frequency displays (“spectro- grams”) showing spectral power at each point in time (bottom row in each pair) for the syllables /bæ/ and /pæ/ (as in “bat” and “pat”). Vertical striations seen in both displays are due to vocal cord vibrations at /bæ/ the fundamental frequency of the speaker’s voice. Dark horizontal bands in the time-frequency displays are formants (labeled F1, F2, and F3 on the /bæ/ F3 display) caused by resonances related to the shape of the oral cavity. The first 50À60 ms of /bæ/ is the F2 period of formant transition, during which the jaw opens and the spectral position of the formants F1 changes rapidly. Unvoiced /pæ/ differs from voiced /bæ/ in that the period of formant transition is 0 50 100 150 200 250 300 350 400 450 500 replaced in /pæ/ by a period of aspiration noise (indicated by “VOT”) during which there are no vocal cord vibrations. The absence of voicing also greatly shortens the formant transition duration and raises the starting frequency of F1 for /pæ/. /pæ/ 3k 2k 1k 0k 0 50 100 150 200 250 300 350 400 450 500 VOT Time (ms) bandwidth of these formants change as the vocal tract the following vowel is “ah” or “oo” (Delattre, Liberman, & changes shape. Stop consonants are characterized by Cooper, 1955). Such context dependency, which is abrupt changes in tract shape as the jaw, tongue, and ubiquitous in speech, proved challenging to explain and lips open or close, resulting in rapid changes (typically even led to a prominent theory proposing that phoneme occurring over 20À50 ms) in the center frequencies of identification depends on activation of an invariant the formants (called formant transitions), as well as rapid motor representation (Liberman, Cooper, Shankweiler, & changes in overall intensity as air flow is released or Studdert-Kennedy, 1967), although exactly how this stopped. Fricatives are characterized by sustained, high- could occur if the acoustic signal is uninformative was frequency broadband noise produced by turbulent air never fully specified. Later studies showed that overall flow at the point of vocal tract constriction. Nasal conso- spectral shape at the time of stop release and change nants are characterized by a prominent nasal formant at in overall shape just after release provide a context- approximately 250À300 Hz (the “nasal murmur”) and invariant acoustic cue for place of articulation (Kewley- damping of higher-frequency oral formants. Port, 1983; Stevens & Blumstein, 1981). Similarly, place The acoustic correlates of consonant place of articula- of articulation for fricatives and nasal consonants is tion were a central focus of much early research in cued primarily by gross spectral shape (Hughes & acoustic phonetics, in large part due to the difficulty in Halle, 1956; Kurowski & Blumstein, 1987). identifying invariant characteristics that distinguished Phoneme perception is often based on multiple stops in different place categories. Place distinctions interacting acoustic cues. The cues signaling stop con- among stops are associated with differences in the direc- sonant voicing are a clear case in point. Although VOT tion and slope of formant transitions, but it was quickly is the most prominent cue, with VOT values longer noticed that these parameters are context-dependent. For than 25À30 ms generally creating a strong unvoiced example, the second formant rises during articulation of percept, unvoiced stops (i.e., p, t, and k) are also char- syllable-initial /d/ when the following vowel is “ee,” acterized by aspiration (i.e., noise from the burst of air but it falls during articulation of the same segment when that accompanies release of the stop), a higher center F. PERCEPTUAL ANALYSIS OF THE SPEECH SIGNAL 37. PHONEME PERCEPTION 449 frequency of the first formant at the onset of voicing, early linguistic experience. Speakers of English, for and a shorter formant transition duration, among other example, cannot appreciate distinctions between the cues (Lisker, 1986)(Figure 37.1). Studies using synthe- dentoalveolar and (postalveolar) retroflex stops of Hindi sized speech show that these cues jointly contribute to and other Indo-Aryan languages, whereas these are dis- voicing discrimination in that reduction of one cue can tinct phonemes with categorical perceptual boundaries be compensated for by an increase in others (e.g., for Hindi listeners (Pruitt, Jenkins, & Strange, 2006). It reduction of VOT can be “traded” for an increase in thus appears that the phoneme perceptual system aspiration amplitude or first formant onset frequency becomes “tuned” during language development through to maintain the likelihood of an unvoiced percept) bottom-up statistical learning processes that reflect over- (Summerfield & Haggard, 1977). representation of category prototypes in the language As mentioned, a major task of the phoneme percep- environment (Kuhl, 2000; Werker & Tees, 2005). The tion system is to detect these acoustic cues despite exten- phoneme categories that result from this tuning process sive variation in their realization.