Phonetics HCS 7367 Speech Perception Anatomical, acoustic, neural properties Production and perception
Focuses on continuous/physical properties Dr. Peter Assmann Basic unit: phone (=speech sound) h Fall 2014 Narrow transcription: [ p ]
Phonology Phonology
Abstract representations mediate between Phonemes: minimum meaning- the speech stream and its interpretation as differentiating units in a language
meaningful words Phonemic inventory: list of the phonemes Discrete entities (phonemes, syllables, ...) of a given language
Basic unit: phoneme (smallest contrastive American English has ~12 vowel unit of a language) phonemes, 3 diphthongs, ~24 consonants
Broad transcription: / p /
Phonology Phonology
Minimal pairs test: criterion for deciding Allophones: variants (pronunciations) of a whether two sounds belong to the same given phoneme
phonemic category. In English, the “p” sound in “pot” is aspirated h If two words or phrases in a given language (pronounced [ p ɔt ]) while the “p” in “spot” is differ only in one phonological segment (e.g. unaspirated (pronounced [ spɔt ]). English “pin” and “bin”) but are perceived as [ph] and [p] are allophones of the phoneme /p/ distinct words, then that sound contrast contains two distinct phonemes (/p/ and /b/).
1 Consonant Perception Phonology What is a consonant? Preliminaries to Jakobson, Fant and Halle (1961). Consonants are sounds produced with a major Speech Analysis. constriction somewhere in the vocal tract. Distinctive feature theory: universal set of (binary) features, combining articulation and perception Chomsky and Halle (1968). The Sound Pattern of English Phonological representations as sequences of segments (phonemes) composed of distinctive features. Two levels: underlying abstract representation and surface phonetic representation
English Consonants Consonant production
Labio‐ Place of articulation Bilabial Dental Alveolar Palatal Velar Glottal dental American English has 24 consonant phonemes Voicing –+–+–+–+–+–+– The standard phonetic classification system Stop p b t d k g uses three main features to classify the Nasal m n ŋ articulation
of consonants: Fricative f v θ ð s z ∫ ʒ h Affricate t∫ ʤ Place of articulation
Manner Approximant (w) r jw Lateral l Manner of articulation
Ladefoged, P. (1993). A Course in Phonetics. Harcourt Brace: Fort Worth. 3rd Ed. Voicing
Consonant classification Stop consonants in English
Place of articulation Place of articulation bilabial alveolar velar bilabial, labiodental, dental, alveolar, palatal, velar, glottal /ba/ /da/ /ga/ Manner of articulation voiced
stops, nasal, fricatives, affricates, approximant, lateral Voicing /pa/ /ta/ /ka/ Voicing voiceless
voiced, voiceless Manner of articulation = stop consonants
2 Consonant production Source‐filter theory for consonants
Speech production theory Consonants: sounds produced with a major
Source‐filter concept still applies constriction somewhere in the vocal tract.
Relatively narrow constriction or closure Source‐filter theory still applies, but there are
Secondary sound source in the upper vocal tract complications:
1. secondary sound source within the vocal tract
2. secondary source is aperiodic, turbulent noise (frication) rather than quasi‐periodic like the glottal source
3. secondary source may combine with the glottal source to produce mixed excitation (voiced fricatives)
Acoustic tube perturbation and vowel formants Filter properties of stop consonants Source: Stevens (1999, p. 286) After Stevens (1997)
velar alveolar 2.0
1.5
labial F2 frequency (kHz) F2 frequency
1.0 2.0 2.5 3.0 F3 frequency (kHz)
Area functions: cylindrical tube Source‐filter theory for consonants approximations of the vocal tract
Obstruents (fricatives, affricates, stops) Sonorants (nasals, liquids, glides) In obstruents, the constriction can divide the resonating vocal tract into two glottis lips separate cavities.
coupled vs. uncoupled
interactions with the source
3 Schematic model for fricative /s/
After Kent, Dembowski & Lass (1996)
Voiceless Obstruction (maxillary incisors) unaspirated stop consonant
Glottis Jet Lips [pɑ] Eddies
Back Cavity Front Cavity
Anterior constriction
Time (ms)
Place of articulation Place of articulation
What are the acoustic cues for place of What are the acoustic cues for place of articulation (i.e., what acoustic properties do articulation (i.e., what acoustic properties do listeners use to distinguish /p/, /t/, /k/ (and listeners use to distinguish /p/, /t/, /k/ (and /b/, /d/, /g/) from each other)? /b/, /d/, /g/) from each other)?
1. Frequency of burst
2. Formant transitions (F2 and F3) from the burst to the vowel
Acoustic cues for place of Formant transitions articulation in stops
Burst cues During the closure for a stop consonant the
Release burst, frication noise, aspiration noise vocal tract is completely closed, and no sound is emitted. At the moment of release Spectral shape the resonances change rapidly. These Static vs. dynamic changes are called formant transitions. Formant transition cues
F2, F3 onset vs. target (vowel) frequency
4 Formant transitions Burst cues
The first formant (F1) always increases in The burst is a brief acoustic transient that frequency following the release of a stop occurs at the moment of consonant release. closure. F2 and F3 increases or decreases, Burst intensity is higher for voiceless stops than depending on the place of constriction and for voiced stops the flanking vowels. Burst frequency varies as a function of place
voicing onset burst
Burst cues Stop consonants (from Kent and Read, 1992)
Is consonant place information available when the burst is removed? Closure (stop gap) Release Aspirated Syllable-initial (noise burst) prevocalic stop Unaspirated Transition (formant transitions)
/ pɑ /
Alveolar stop Velar stop ata aka
4 4 Closure Release Formant (stop gap) (noise burst) transitions / ada / Closure Release Formant /aga/ (stop gap) (noise burst) transitions 3 3 •high F2 •high F2 F3 •high F3 •low F3
2 2 F3 F2 F2 Frequency (kHz) Frequency Frequency (kHz) Frequency 1 1
F1 F1 0 0 0 100 200 300 400 500 0 100 200 300 400 500 Time (ms) Time (ms)
5 Bilabial stop Stop consonants apa (from Kent and Read, 1992)
4 Closure Release Formant Transition (stop gap) (noise burst) transitions /aba/ (formant transitions) Syllable-final 3 postvocalic stop Closure •low F2 (stop gap) •low F3 F3 Release (noise burst) 2 or No release (no burst) Frequency (kHz) Frequency 1 F2
F1 / /
0 0 100 200 300 400 500 Time (ms)
Alveolar stop Bilabial stop ap
4 4 /ad/ /ab/ 3 3 •high F2 •low F2 F3 •high F3 F3 •low F3 2 2 F2
Frequency (kHz) F2
Frequency (kHz) 1 1
0 0 0 50 100 150 200 250 0 100 200 300 400 Time (ms) Time (ms)
Burst cues: /ba/ and /pa/ Burst cues: /da/ and /ta/
/ba/ /pa/ /da/ /ta/
4 4 4 4
3 3 3 3
2 2 2 2 Frequency (kHz) Frequency (kHz) 1 1 1 1
0 0 0 0 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 0 50 100 150 200 250 300 350 Time (ms) Time (ms) Time (ms) Time (ms)
• Bilabials: Burst energy is concentrated at low frequencies • Alveolars: Burst energy is concentrated at high frequencies
6 Burst cues: /ga/ and /ka/ Place of articulation
/ga/ /ka/ What are the acoustic cues for place of 4 4 articulation (i.e., what acoustic properties do
3 3 listeners use to distinguish /p/, /t/, /k/ (and
2 2 /b/, /d/, /g/) from each other)? Frequency (kHz) Frequency (kHz) 1 1
0 0 0 50 100 150 200 250 300 0 50 100 150 200 250 300 350 Time (ms) Time (ms)
• Velars: Burst energy is concentrated in the mid range
Place of articulation Acoustic cues for place of articulation in stops
What are the acoustic cues for place of Burst cues
articulation (i.e., what acoustic properties do Release burst, frication noise, aspiration noise listeners use to distinguish /p/, /t/, /k/ (and Spectral shape /b/, /d/, /g/) from each other)? Static vs. dynamic
1. Frequency of burst Formant transition cues
2. Formant transitions (F2 and F3) from the F2, F3 onset vs. target (vowel) frequency burst to the vowel
Formant transitions Formant transitions
The first formant (F1) always increases in During the closure for a stop consonant the vocal tract is completely closed, and no frequency following the release of a stop sound is emitted. At the moment of release closure. F2 and F3 increases or decreases, the resonances change rapidly. These depending on the place of constriction and changes are called formant transitions. the flanking vowels.
7 Burst cues Burst cues
The burst is a brief acoustic transient that Is consonant place information available when occurs at the moment of consonant release. the burst is removed?
Burst intensity is higher for voiceless stops than for voiced stops
Burst frequency varies as a function of place
voicing onset burst
Invariance problem for stop consonants Invariance problem for stop consonants Liberman, Delattre, and Cooper (1952)
NOISE VOWELS BURSTS B C t A 3.5 4K 3.0
3K 2.5
2K 2.0 ppkpk p 1K 1.5
Burst frequency(kHz) i eu c 1.0 burst + transitionless vowels 0.5
pp Cooper, Delattre, Liberman, Borst & Gerstman (1952) JASA 24, 597-606 ie a c ou Majority responses
Invariance problem for stop consonants Invariance problem for stop consonants
Transition + vowel without burst
8 Categorical Perception Categorical Perception of stop consonants Phoneme boundary (50%) Equal acoustic changes unequal auditory percepts Identification place of articulation of stops: /b/ vs /d/ vs /g/ (labeling) function
Discrimination function b d g
Liberman, Harris, Hoffman, and Griffith (1957) Journal of Experimental Psychology 54, 358-368
Categorical Perception Segmentation problem for stops
Where are the segments? 1. Identification function shows steep "bag" slope (cross‐over) between categories /bæg/ 2. Good between‐category discrimination F3 (near perfect) when the members of the F2
pair belong to different categories F1
3. Poor within‐category discrimination time (near chance) when they are perceived to belong to the same category
Same noise ‐ different consonant Different transition ‐ same consonant
F 2
1400 Hz 1400 Hz
F 1
pea ka dee da
9 Motor Theory K.N. Stevens
Invariance problem Quantal theory Certain sounds are favored in the Segmentation problem languages of the world because their Categorical perception acoustic properties can be produced Mapping from continuous to discrete with a wide range of articulations. Encodedness of speech
Units of perception (segments, syllables..) Acoustic Articulatory consequences Linguistic units, acoustics and articulation variations
K.N. Stevens K.N. Stevens JASA 2002
Distinctive feature theory Acoustic Landmarks
syllables /bi/, /du/, … An early stage in the processing of speech identifies acoustic landmarks in the signal segments /b/, /d/, … such as syllable nuclei and acoustic features [+voiced], [-rounded], … discontinuities corresponding to consonantal closures and releases.
Landmarks point to regions of the signal where a more detailed phonetic analysis is carried out.
K.N. Stevens JASA 2002 K.N. Stevens JASA 2002
Lexical access from features Lexical access from features
The acoustic signal in the vicinity of these From these landmarks and features, lexical landmarks is processed by a set of (word) hypotheses are evaluated.
modules, each of which identifies a This is done using analysis-by-synthesis, phonetic feature. in which a word sequence is hypothesized, a possible pattern of features from this sequence is internally synthesized, and the synthesized pattern is tested for a match against an acoustically derived pattern.
10 Burst spectrum Static spectral shape model
Step 1: Find consonant burst onset Blumstein and Stevens (1979, 1980) invariance hypothesis (static spectral shape model)
The shape of the spectrum – sampled at the time of burst release – provides invariant cues specifying the place of articulation for the English stop consonants.
Temporal window Acoustic landmarks
Step 2: Position a 25.6 ms half-Hamming window Step 3: Compute spectrum of windowed segment
10
0 LPC spectrum -10 FFT spectrum -20 Amplitude in dB in Amplitude
-30
-40
-50 0 0.5 1 1.5 2 2.5 33.5
Frequency (kHz)
Static spectral shape model Static spectral shape model
Diffuse falling Diffuse rising Compact Predictions: 10 10 10 0 0 0 1) spectral shape for a given place of -10 -10 -10
-20 -20 -20 articulation is the same for different talkers Amplitude in dB Amplitude in dB Amplitude in dB Amplitude in
-30 -30 -30
-40 -40 -40 and in different phonetic contexts.
-50 -50 -50 0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5 Frequency (kHz) Frequency (kHz) Frequency (kHz) 2) acoustic modifications that distort other / ba / / da / / ga / aspects of the syllable but preserve the spectral shape near the burst do not affect Burst spectra of syllable-initial stop consonants consonant identity.
11 Dynamic properties Locus Equations
Kewley‐Port (1983) hypothesized that the Delattre et al. (1954) described the locus as identification of stop consonants is based on the frequency location of F2 extrapolated time‐varying changes in the spectrum from the back in time to the consonant release. onset of the burst into the transition region. F2 of /i/ locus Onset of Onset of voiced formant transition vowel target F2 of /u/
Problems with the locus concept Locus equations
No physical energy exists at the time/frequency position of the locus; actual extrapolation of Sussman et al. (1991, 1993) proposed the formant transition may lead to a change in that locus equations provide invariant consonant identity. relational cues for the perception of place of articulation in stop consonants. Locus equations describe the relationship locus Onset of between the frequency of F2 at burst Onset of voiced formant transition vowel target onset and F2 in the vowel.
Locus equations Locus equations
ba da
4 4
3 3
2 2 Frequency (kHz) Frequency (kHz) 1 1
0 0 0 100 200 300 400 0 50 100 150 200 250 300 Time (ms) Time (ms)
12 Locus equations Locus equations
ga
4
3
2 Sussman (1991) Frequency (kHz) 1
0 0 50 100 150 200 250 300 Time (ms)
Smits, ten Bosch, and Collier (1996) Cues for voicing
Gross cues • VOT – voice onset time spectral features that are distributed across frequency or time Short lag (0-20 ms) = voiced sounds /b,d,g/
overall spectral shape or its relative change over time Long lag (80-100 ms) = unvoiced sounds /p,t,k/ Detailed cues
features that are narrowly localized in frequency or time
center frequency of prominent spectral peaks and their dynamic transitions.
short lag long lag bi pi
4 4
3 3
2 2 Frequency (kHz) Frequency (kHz) 1 1
0 0 0 50 100 150 200 250 300 0 100 200 300 400 Time (ms) Time (ms)
13 long lag short lag di ti
4 4
3 3
2 2 Frequency (kHz) Frequency (kHz) 1 1
0 0 0 50 100 150 200 250 300 0 50 100 150 200 250 300 350 Time (ms) Time (ms)
short lag long lag gi ki
4 4
3 3
2 2 Frequency (kHz) Frequency (kHz) 1 1
0 0 0 50 100 150 200 250 300 350 0 100 200 300 400 Time (ms) Time (ms)
aga aka
4 4
3 3
2 2 Frequency (kHz) Frequency (kHz) 1 1
0 0 0 100 200 300 400 500 0 100 200 300 400 500 Time (ms) Time (ms)
14 VOT in Dutch Voicing Prevoiced and unaspirated stops • Lisker and Abramson (1970) Studied voice onset time (VOT) in several /bɛn/ /pɛn/ languages, some with 2 voicing categories (e.g., English, Spanish, Cantonese) and others with 3 or 4 voicing categories (e.g. Thai, Hindi, Korean) • Prevoiced • Voiced • Voiceless aspirated • Voiceless unaspirated
VOT in an English-Dutch bilingual child Voicing
E. Simon (2010). • Identification Child L2 development: A longitudinal case study on functions for Voice Onset Times in word- initial stops. English VOT J. Child Lang. 37 159–173. continuum. • Frequency histograms of measured VOT in English stops.
Identification and discrimination of Voicing English VOT continuum • Identification 100 Discrimination functions for Identification function Thai VOT 80 function
continuum. 60 • Frequency histograms of 40 measured VOT in Thai stops. Percent /ba/ responses 20 Hypothetical data 0 -150 -100 -50 0 50 100 150 Voice onset time (ms)
15 Voicing cues in stop consonants Categorical Perception / a d a / • Relationship between VOT and 4 identification functions is nonlinear, with 4. FUNDAMENTAL steep transitions and a clearly defined FREQUENCY 3 boundary. 3. BURST AMPLITUDE
• Discrimination functions show a peak near 2 the phoneme boundary; near chance 1. VOICE ONSET TIME Frequency (kHz) Frequency levels elsewhere. 1 5. VOWEL DURATION 6. PREVOICING
2. F1 CUTBACK 0 0 100 200 300 400 500 Time (ms)
Voicing cues in stop consonants
/ a t a /
4 4. FUNDAMENTAL FREQUENCY 3 3. BURST AMPLITUDE
2 1. VOICE ONSET TIME Frequency (kHz) Frequency 1 5. VOWEL DURATION 6. PREVOICING
2. F1 CUTBACK 0 0 100 200 300 400 500 Time (ms)
16