Course web page
HCS 7367 • http://www.utdallas.edu/~assmann/hcs6367 Speech Perception – Course information – Lecture notes – Speech demos – Assigned readings Dr. Peter Assmann – Additional resources in speech & hearing Fall 2010
Course materials Course requirements http://www.utdallas.edu/~assmann/hcs6367/readings • Class presentations (15%)
• No required text; all readings online • Written reports on class presentations (15%) • Midterm take-home exam (20%) • Background reading: • Term paper (50%) http://www.speechandhearing.net/library/speech_science.php • Recommended books: Kent, R.D. & Read, C. (2001). The Acoustic Analysis of Speech. (Singular). Stevens, K.N. (1999). Acoustic Phonetics (Current Studies in Linguistics). M.I.T. Press.
Class presentations and reports Suggested Topics
• Pick two broad topics from the field of speech • Speech acoustics • Vowel production and perception perception. For each topic, pick a suitable (peer- • Consonant production and perception reviewed) paper from the readings web page or from • Suprasegmentals and prosody • Speech perception in noise available journals. Your job is to present a brief (10-15 • Auditory grouping and segregation minute) summary of the paper to the class and • Speech perception and hearing loss • Cochlear implants and speech coding initiate/lead discussion of the paper, then prepare a • Development of speech perception written report. • Second language acquisition • Audiovisual speech perception • Neural coding of speech • Models of speech perception
1 Finding papers Finding papers PubMed search engine: Journal of the Acoustical Society of America: http://www.ncbi.nlm.nih.gov/entrez/ http://scitation.aip.org/jasa/
The evolution of speech: The evolution of speech: a comparative review a comparative review Primate vocal tract W. Tecumseh Fitch Primate vocal tract W. Tecumseh Fitch Trends in Cognitive Sciences 4(7) July 2000 Trends in Cognitive Sciences 4(7) July 2000
air sac orangutan chimpanzee human tongue body larynx
Source-filter theory of speech production Human vocal tract
Vibrating Output Lungs Vocal Vocal Sound folds Tract
Power supply Oscillator Resonator
2 Organs of speech
• Lungs: apply pressure to generate Acoustics air stream (power supply) • Larynx: air forced through the of speech glottis, a small opening between the vocal folds (sound source) • Vocal tract: pharynx, oral and ● Articulation nasal cavities serve as complex ● Phonation resonators (filter)
Source-Filter Theory Audio demo: the source signal
• Source signal for an adult male voice • Source signal for an adult female voice • Source signal for a 10-year child
From Fitch, W.T. (2000). Trends in Cognitive Sciences
Vocal fold oscillation Vocal fold oscillation
• One-mass model • Three-Mass Model – Air flow through the – One large mass glottis during the closing (representing the phase travels at the thyroarytenoid muscle) same speed because of and two smaller masses, inertia, producing M1 and M2 (representing lowered air pressure the vocal fold surface) . above the glottis. All three masses are connected by springs and damping constants.
Source: http://www.ncvs.org/ncvs/tutorials/voiceprod/tutorial/model.html Source: http://www.ncvs.org/ncvs/tutorials/voiceprod/tutorial/model.html
3 Source-Filter Theory: Vowels Source-Filter Theory: Vowels
G. Fant (1960). Acoustic Theory of Speech Time domain version: Production U( t ) T( t ) R( t ) = P( t ) Linear systems theory Assumpt ions: (1) linear ity (2) t ime-iiinvariance Vowels can be decomposed into two primary
components: a source (input signal) and a filter Amplitude (modulates the input). Frequency Frequency domain version: U( f ) · T( f ) · R( f ) = P( f )
Source properties
In voiced sounds the glottal source spectrum contains Demo: harmonic synthesis a series of lines called harmonics. The lowest one is called the fundamental frequency Additive harmonic synthesis: vowel /i/ (F0). F Cumulative sum of harmonics: vowel /i/ 0 0 Additive synthesis: “wheel” -10 Cumulative sum of partials: Amplitude -20
Spectrum -30
-40 Relative Amplitude (dB) -50 0 200 400 600 800 1000 Frequency (Hz)
Filter properties
The vocal tract resonances (called formants) produce source filter radiation= output sound peaks in the spectrum envelope. Formants are labelled F1, F2, F3, ... in order of increasing frequency. / i / F1 F2 F3 0
-10 F4 Amplitude -20 Spectrum Amplitude / A /
(with superimposed -30 LPC spectral envelope) Amplitude in dB in Amplitude -40
-50 0 1 2 3 4 Frequency Frequency (kHz)
4 Source: J. Hillenbrand Source-filter theory Source-filter theory
Robert Mannell, Macquarie University http://clas.mq.edu.au/phonetics/phonetics/vowelartic/vowel_artic.gif Source: J. Hillenbrand
Source: J. Hillenbrand Source: J. Hillenbrand Source-filter theory Source-filter theory
Source: J. Hillenbrand Source-filter theory Speech terminology…
/s/ and /S/ SF Models Overlaid Fundamental frequency (F0): lowest frequency component in voiced speech sounds, linked to vocal fold vibration. Formants: resonances of the vocal tract.
Amplitude
Formant F0
Frequency
5 Source properties: Pitch Source properties: Pitch
Fundamental frequency (F0) is determined by the F0 can be removed by filtering (as in telephone circuits) rate of vocal fold vibration, and is responsible for the and the pitch remains the same. perceived voice pitch. This is the problem of the missing fundamental, one of the oldest problems in hearing science. Pitch is determined by the frequency pattern of the harmonics (or their equivalent in the time domain, the periodicities in the waveform).
Harmonicity and Periodicity Harmonicity and Periodicity
Period: regularly repeating pattern in the Harmonic: regularly repeating peak in the waveform amplitude spectrum
Period duration, T0 = 6 ms
20 F0 = 1000 / 6 = 166 Hz
waveform 0 F0 = 1 / T0 -20
Amplitude (dB) amplitude -40 spectrum 0 0.5 1 1.5 2 2.5 F(kH)
Harmonicity and Periodicity Voicing irregularities • Period: regularly repeating pattern in Shimmer: variation in amplitude from one the waveform Period duration T = 6 ms 0 Waveform cycle to the next. Jitter: variation in frequency (period duration) from one cycle to the next.
20 F0 = 1000 / 6 = 166 Hz Harmonics are F0 = 1 / T0 0 integer multiples of F0 and are -20 evenly spaced in Amplitude Amplitude (dB) frequency -40 Spectrum
0 0.5 1 1.5 2 2.5 Frequency (kHz)
6 Vocal tract properties Voicing irregularities Resonating tube model Breathy voice is associated with a glottal – approximation for neutral vowel (schwa), [´] waveform with a steeper roll-off than modal – closed at one end (glottis); open at the other (lips) voice. As a result there is less energy in the – uniform cross-sectional area higher harmonics (steeper slope in the – curvature is relatively unimportant spectrum).
Glottis Lips
length, L
Uniform tube model (schwa) Vocal tract model
Quarter-wave resonator: /´/ Fn = ( 2n – 1 ) c / 4 L
– Fn is the frequency of formant n in Hz
– c is the v elocity of sou nd (abou t 35000 cm/sec)
– L is the length of the vocal tract (17.5 for adult male)
Acoustic vowel space Vocal tract model 3000 Quarter-wave resonator: i “heed” Fn = ( 2n – 1 ) c / 4 L 2000 – F1 = (2(1) –1)*35000/(4*17.5) = 500 Hz uency (Hz) q – F2 = (2(2) –1)*35000/(4*17. 5) = 1500 Hz Ə fre
– F3 = (2(3) –1)*35000/(4*17.5) = 2500 Hz F2 1000
ɑ “hod” u “who’d”
0
Second formant, 0 200 400 600 800 1000
First formant, F1 frequency (Hz)
7 Vocal tract model Vocal tract model
Quarter-wave resonator: Quarter-wave resonator:
Fn = ( 2n – 1 ) c / 4 L Fn = ( 2n – 1 ) c / 4 L
– Fn is the frequency of formant n in Hz – F1 = (2(1) –1)*35000/(4*17.5) = 500 Hz
– c is the v elocity of sou nd in air (abou t 35000 cm/sec) – F2 = (2(2) –1)*35000/(4*17. 5) = 1500 Hz
– L is the length of the vocal tract (17.5 for adult male) – F3 = (2(3) –1)*35000/(4*17.5) = 2500 Hz
L L
Helium speech Helium speech
The speed of sound in a helium/oxygen mixture Using Matlab as a calculator, find the at 20°C is about 93000 cm/s, compared to frequencies of F1, F2 and F3 for a 17.5 cm 35000 cm/s in air. This increases the resonance vocal tract producing the vowel /ə/ in a
frequencies but has relatively little effect on F0. helium/air mixture (velocity c ≈ 93000 cm/s) In helium speech, the formants are shifted up Fn = ( 2n – 1 ) c / 4 L but the pitch stays the same. F1 = (2(1) –1)*93000/(4*17.5) =
F2 = (2(2) –1)*93000 /(4*17.5) =
F3 = (2(3) –1)*93000 /(4*17.5) =
Speech in air Helium speech
Audio demos 4
– Speech in air
– Speech in helium 3 Hz)
– Pitch in air k 2 – Pitch in helium Frequency ( 1
0 http://phys.unsw.edu.au/phys_about/PHYSICS!/SPEECH_HELIUM/speech.html 0 100 200 300 400 500 600 700 800 900 Time (ms)
http://phys.unsw.edu.au/phys_about/PHYSICS!/SPEECH_HELIUM/speech.html
8 Speech in helium Perturbation Theory
4 The first formant (F1) frequency is lowered by a constriction in the front half of the vocal tract (/u/ and /i/), and raised when the constriction is 3 in the back of the vocal tract, as in /A/. Hz) k
2 Frequency ( 1 delta F1
0 glottis lips 0 100 200 300 400 500 600 700 800 Time (ms)
http://phys.unsw.edu.au/phys_about/PHYSICS!/SPEECH_HELIUM/speech.html
Perturbation Theory Perturbation Theory
The second formant (F2) is lowered by a The third formant (F3) is lowered by a constriction near the lips or just above the constriction at the lips or at the back of the pharynx; in /u/ both of these regions are mouth or in the upper pharynx. This occurs in constricted. F2 is raised when the constriction is /r/ and /r/-colored vowels like American be hin d the lips an d tee th, as in the vowe l /i/. English / ´’ /.
delta delta F2 F3
glottis lips glottis lips
Perturbation Theory Perturbation Theory
F3 is raised when the constriction is behind All formants tend to drop in frequency when the lips and teeth or near the upper pharynx. the vocal tract length is increased or when a constriction is formed at the lips.
delta F3
glottis lips
glottis lips
9 Perturbation Theory Perturbation Theory
F1 frequency is correlated with jaw F2 frequency is correlated with tongue opening (and inversely related to tongue advancement (front-back dimension) height ). 0 0 -10 -10 -20 -20 -30
-30 dB in Amplitude amplitude -40 Amplitude in dB in Amplitude spectrum -40 amplitude spectrum -50 0 1 2 3 4 -50 0 1 2 3 4 Frequency (kHz) Frequency (kHz)
Spectral analysis
Amplitude spectrum: sound pressure levels associated with different frequency components of a signal Power or intensity Amplitude or magnitude Log units and decibels (dB) Phase spectrum: relative phases associated with different frequency components Degrees or radians
Spectral analysis of speech Spectral analysis of speech
Why perform a frequency analyses of speech? But: the ear is not a spectrum analyzer.
– Auditory frequency selectivity is best at low – E+biEar+brain carry ou t a f orm of ff frequency anal liysis frequencies and gets progressively worse at higher – Relevant features of speech are more readily visible frequencies. in the amplitude spectrum than in the raw waveform
10 Formant Estimation Measuring 60 amplitude Formants formants 40 spectrum F1 F2 F3 20 60 Vowel spectra have peaks 0 0 500 1000 1500 2000 2500 3000 3500 4000 corresponding to the center Click here to play vowel /a/ frequencies of formants 60 40 ) 40
B
Formant frequency peak 20
0 20 estimation requires an 0 500 1000 1500 2000 2500 3000 3500 4000 interpolation process. Click here to play vowel /u/
60 Amplitude (d 0 40 Spectrum of 20 natural vowel / / 0 Q 0 500 1000 1500 2000 2500 3000 3500 4000 -20 Frequency (Hz) 0 2 4 6 Frequency (kHz)
Formant Estimation
harmonics But: harmonics also Children’s speech F1 generate spectral peaks; 60 formant frequencies do not necessarily coincide with Children’s voices have high F0s. 50 harmonic frequencies
) When F0 is 400 Hz (not unusual for 3-year B olds), only 4 harmonics appear in the 40 frequency range between 0-1600 Hz.
Amplitude (d 30
20 Amplitude 0 0.5 1 1.5 2 Frequency (kHz) 0 Frequency (kHz) 1.5
LPC spectrum Sparce sampling problem 70
60 Vowel identity is dependent on the frequencies of formant peaks. 50 LPC spectrum Formants are difficult to estimate when (dB) 40 fdfundament tlfal frequency i ihihs high. e 30
Amplitud FFT spectrum 20
10
0 Amplitude Amplitude 0 0.5 1 1.5 2 2.5 3 3.5 0 1.5 Frequency (kHz)
11 Formants sometimes appear to merge
LPC spectrum
Short-term amplitude spectrum Speech spectrogram
running amplitude spectra (codes amplitude changes in different frequency bands over time).
60 F1 = 281 Hz 50 F2 = 2196 Hz 40
30 F3 = 2755 Hz
20
Amplitude (dB) 10
0
-10 0 1 2 3 4 Frequency (kHz)
Speech spectrograms Speech spectrograms
What is a speech spectrogram? Why are speech spectrograms useful? – Display of amplitude spectrum at successive – Shows dynamic properties of speech instants in time ("running spectra") – Incorporates frequency analysis – How can 3 dimensions be represented on a two- – Related to speech production dimensional display? – Helps to visually identify speech cues Gray-scale spectrogram Waterfall plots Animation
12 “The watchdog” waveform
spectrogram
5
F3
F2 Frequency (kHz)
F1 0
Digital representations of signals
4 E F3 F2
time amplitude Frequency (kHz) F1
sampling quantization 0 0
TrackDraw: a graphical speech synthesizer TrackDraw: a graphical speech synthesizer
13 American English vowel space
Advancement ← F2 Peterson and Barney (1952) F1 front center back
→ i “heed” Acoustic measurements (made from spectrograms) u“who’d” of formant frequencies (F1, F2, F3) in vowels spoken high ɩ “hid” ʊ “hood” by 76 men, women and children. vowel space: projection of a given talker’s vowels in Height a F1 x F2 plane e“hayed” o“hoed” Ə “schwa” Simple target model: vowels are differentiated mid ɛ “head” ɔ “hawed” (perceptually) by F1 and F2 frequencies measured in the middle of the vowel (vowel target).
low ʌ “hut” æ“had” ɑ “hod”
Vowel formant space: F1 x F2 Assmann & Katz JASA 2000 Peterson andPeterson Barney and Barney (1952) (1952) 3500 i Children 3000 3.0 i Women Females i 2500 æ i 2.5 I Men æ i E 2000 Q æ 2.0 I y (kHz) F2 (Hz) F2 cy (Hz) cy c
f E
n u Q Males U √ 1500 o
F2 Freque 1.5 √ A
u u c U F2 Frequen Frequency o Frequency
o A 1000 u c u c 1.0
250 500 1000 1500 200 300 400 600 800 1000 200 F1 Frequency (Hz) F1 frequency (Hz) F1 Frequency (Hz)
14