Course web page

HCS 7367 • http://www.utdallas.edu/~assmann/hcs6367 – Course information – Lecture notes – Speech demos – Assigned readings Dr. Peter Assmann – Additional resources in speech & hearing Fall 2010

Course materials Course requirements http://www.utdallas.edu/~assmann/hcs6367/readings • Class presentations (15%)

• No required text; all readings online • Written reports on class presentations (15%) • Midterm take-home exam (20%) • Background reading: • Term paper (50%)  http://www.speechandhearing.net/library/speech_science.php • Recommended books:  Kent, R.D. & Read, C. (2001). The Acoustic Analysis of Speech. (Singular).  Stevens, K.N. (1999). Acoustic (Current Studies in Linguistics). M.I.T. Press.

Class presentations and reports Suggested Topics

• Pick two broad topics from the field of speech • Speech production and perception perception. For each topic, pick a suitable (peer- • production and perception reviewed) paper from the readings web page or from • Suprasegmentals and • Speech perception in noise available journals. Your job is to present a brief (10-15 • Auditory grouping and segregation minute) summary of the paper to the class and • Speech perception and hearing loss • Cochlear implants and speech coding initiate/lead discussion of the paper, then prepare a • Development of speech perception written report. • Second language acquisition • Audiovisual speech perception • Neural coding of speech • Models of speech perception

1 Finding papers Finding papers PubMed search engine: Journal of the Acoustical Society of America: http://www.ncbi.nlm.nih.gov/entrez/ http://scitation.aip.org/jasa/

The evolution of speech: The evolution of speech: a comparative review a comparative review Primate vocal tract W. Tecumseh Fitch Primate vocal tract W. Tecumseh Fitch Trends in Cognitive Sciences 4(7) July 2000 Trends in Cognitive Sciences 4(7) July 2000

air sac orangutan chimpanzee human tongue body larynx

Source-filter theory of Human vocal tract

Vibrating Output Lungs Vocal Vocal Sound folds Tract

Power supply Oscillator Resonator

2 Organs of speech

• Lungs: apply pressure to generate Acoustics air stream (power supply) • Larynx: air forced through the of speech glottis, a small opening between the vocal folds (sound source) • Vocal tract: pharynx, oral and ● Articulation nasal cavities serve as complex ● resonators (filter)

Source-Filter Theory Audio demo: the source signal

• Source signal for an adult male • Source signal for an adult female voice • Source signal for a 10-year child

From Fitch, W.T. (2000). Trends in Cognitive Sciences

Vocal fold oscillation Vocal fold oscillation

• One-mass model • Three-Mass Model – Air flow through the – One large mass glottis during the closing (representing the phase travels at the thyroarytenoid muscle) same speed because of and two smaller masses, inertia, producing M1 and M2 (representing lowered air pressure the vocal fold surface) . above the glottis. All three masses are connected by springs and damping constants.

Source: http://www.ncvs.org/ncvs/tutorials/voiceprod/tutorial/model.html Source: http://www.ncvs.org/ncvs/tutorials/voiceprod/tutorial/model.html

3 Source-Filter Theory: Source-Filter Theory: Vowels

 G. Fant (1960). Acoustic Theory of Speech Time domain version: Production U( t )  T( t )  R( t ) = P( t )  Linear systems theory  Assumpt ions: (1) linear ity (2) t ime-iiinvariance  Vowels can be decomposed into two primary

components: a source (input signal) and a filter Amplitude (modulates the input). Frequency domain version: U( f ) · T( f ) · R( f ) = P( f )

Source properties

 In voiced sounds the glottal source spectrum contains Demo: harmonic synthesis a series of lines called harmonics.  The lowest one is called the  Additive harmonic synthesis: vowel /i/ (F0). F  Cumulative sum of harmonics: vowel /i/ 0 0  Additive synthesis: “wheel” -10  Cumulative sum of partials: Amplitude -20

Spectrum -30

-40 Relative Amplitude (dB) -50 0 200 400 600 800 1000 Frequency (Hz)

Filter properties

 The vocal tract (called ) produce source  filter  radiation= output sound peaks in the spectrum envelope.  Formants are labelled F1, F2, F3, ... in order of increasing frequency. / i / F1 F2 F3 0

-10 F4 Amplitude -20 Spectrum Amplitude / A /

(with superimposed -30 LPC spectral envelope) Amplitude in dB in Amplitude -40

-50 0 1 2 3 4 Frequency Frequency (kHz)

4 Source: J. Hillenbrand Source-filter theory Source-filter theory

Robert Mannell, Macquarie University http://clas.mq.edu.au/phonetics/phonetics/vowelartic/vowel_artic.gif Source: J. Hillenbrand

Source: J. Hillenbrand Source: J. Hillenbrand Source-filter theory Source-filter theory

Source: J. Hillenbrand Source-filter theory Speech terminology…

/s/ and /S/ SF Models Overlaid  Fundamental frequency (F0): lowest frequency component in voiced speech sounds, linked to vocal fold vibration.  Formants: resonances of the vocal tract.

Amplitude

Formant F0

Frequency

5 Source properties: Pitch Source properties: Pitch

 Fundamental frequency (F0) is determined by the  F0 can be removed by filtering (as in circuits) rate of vocal fold vibration, and is responsible for the and the pitch remains the same. perceived voice pitch.  This is the problem of the missing fundamental, one of the oldest problems in hearing science.  Pitch is determined by the frequency pattern of the harmonics (or their equivalent in the time domain, the periodicities in the waveform).

Harmonicity and Periodicity Harmonicity and Periodicity

 Period: regularly repeating pattern in the  Harmonic: regularly repeating peak in the waveform amplitude spectrum

Period duration, T0 = 6 ms

20 F0 = 1000 / 6 = 166 Hz

waveform 0 F0 = 1 / T0 -20

Amplitude (dB) amplitude -40 spectrum 0 0.5 1 1.5 2 2.5 F(kH)

Harmonicity and Periodicity Voicing irregularities • Period: regularly repeating pattern in  Shimmer: variation in amplitude from one the waveform Period duration T = 6 ms 0 Waveform cycle to the next.  Jitter: variation in frequency (period duration) from one cycle to the next.

20 F0 = 1000 / 6 = 166 Hz Harmonics are F0 = 1 / T0 0 integer multiples of F0 and are -20 evenly spaced in Amplitude Amplitude (dB) frequency -40 Spectrum

0 0.5 1 1.5 2 2.5 Frequency (kHz)

6 Vocal tract properties Voicing irregularities  Resonating tube model  is associated with a glottal – approximation for neutral vowel (schwa), [´] waveform with a steeper roll-off than modal – closed at one end (glottis); open at the other (lips) voice. As a result there is less energy in the – uniform cross-sectional area higher harmonics (steeper slope in the – curvature is relatively unimportant spectrum).

Glottis Lips

length, L

Uniform tube model (schwa) Vocal tract model

 Quarter-wave resonator: /´/ Fn = ( 2n – 1 ) c / 4 L

– Fn is the frequency of n in Hz

– c is the v elocity of sou nd (abou t 35000 cm/sec)

– L is the length of the vocal tract (17.5 for adult male)

Acoustic vowel space Vocal tract model 3000  Quarter-wave resonator: i “heed” Fn = ( 2n – 1 ) c / 4 L 2000 – F1 = (2(1) –1)*35000/(4*17.5) = 500 Hz uency (Hz) q – F2 = (2(2) –1)*35000/(4*17. 5) = 1500 Hz Ə fre

– F3 = (2(3) –1)*35000/(4*17.5) = 2500 Hz F2 1000

ɑ “hod” u “who’d”

0

Second formant, 0 200 400 600 800 1000

First formant, F1 frequency (Hz)

7 Vocal tract model Vocal tract model

 Quarter-wave resonator:  Quarter-wave resonator:

Fn = ( 2n – 1 ) c / 4 L Fn = ( 2n – 1 ) c / 4 L

– Fn is the frequency of formant n in Hz – F1 = (2(1) –1)*35000/(4*17.5) = 500 Hz

– c is the v elocity of sou nd in air (abou t 35000 cm/sec) – F2 = (2(2) –1)*35000/(4*17. 5) = 1500 Hz

– L is the length of the vocal tract (17.5 for adult male) – F3 = (2(3) –1)*35000/(4*17.5) = 2500 Hz

L L

Helium speech Helium speech

 The speed of sound in a helium/oxygen mixture  Using Matlab as a calculator, find the at 20°C is about 93000 cm/s, compared to of F1, F2 and F3 for a 17.5 cm 35000 cm/s in air. This increases the vocal tract producing the vowel /ə/ in a

frequencies but has relatively little effect on F0. helium/air mixture (velocity c ≈ 93000 cm/s) In helium speech, the formants are shifted up Fn = ( 2n – 1 ) c / 4 L but the pitch stays the same.  F1 = (2(1) –1)*93000/(4*17.5) =

 F2 = (2(2) –1)*93000 /(4*17.5) =

 F3 = (2(3) –1)*93000 /(4*17.5) =

Speech in air Helium speech

 Audio demos 4

– Speech in air

– Speech in helium 3 Hz)

– Pitch in air k 2 – Pitch in helium Frequency ( 1

0 http://phys.unsw.edu.au/phys_about/PHYSICS!/SPEECH_HELIUM/speech.html 0 100 200 300 400 500 600 700 800 900 Time (ms)

http://phys.unsw.edu.au/phys_about/PHYSICS!/SPEECH_HELIUM/speech.html

8 Speech in helium Perturbation Theory

4  The first formant (F1) frequency is lowered by a constriction in the front half of the vocal tract (/u/ and /i/), and raised when the constriction is 3 in the back of the vocal tract, as in /A/. Hz) k

2 Frequency ( 1 delta F1

0 glottis lips 0 100 200 300 400 500 600 700 800 Time (ms)

http://phys.unsw.edu.au/phys_about/PHYSICS!/SPEECH_HELIUM/speech.html

Perturbation Theory Perturbation Theory

 The second formant (F2) is lowered by a  The third formant (F3) is lowered by a constriction near the lips or just above the constriction at the lips or at the back of the pharynx; in /u/ both of these regions are mouth or in the upper pharynx. This occurs in constricted. F2 is raised when the constriction is /r/ and /r/-colored vowels like American be hin d the lips an d tee th, as in the vowe l /i/. English / ´’ /.

delta delta F2 F3

glottis lips glottis lips

Perturbation Theory Perturbation Theory

 F3 is raised when the constriction is behind  All formants tend to drop in frequency when the lips and teeth or near the upper pharynx. the vocal tract length is increased or when a constriction is formed at the lips.

delta F3

glottis lips

glottis lips

9 Perturbation Theory Perturbation Theory

 F1 frequency is correlated with jaw  F2 frequency is correlated with tongue opening (and inversely related to tongue advancement (front-back dimension) height ). 0 0 -10 -10 -20 -20 -30

-30 dB in Amplitude amplitude -40 Amplitude in dB in Amplitude spectrum -40 amplitude spectrum -50 0 1 2 3 4 -50 0 1 2 3 4 Frequency (kHz) Frequency (kHz)

Spectral analysis

 Amplitude spectrum: sound pressure levels associated with different frequency components of a signal  Power or intensity  Amplitude or magnitude  Log units and decibels (dB)  Phase spectrum: relative phases associated with different frequency components  Degrees or radians

Spectral analysis of speech Spectral analysis of speech

 Why perform a frequency analyses of speech?  But: the ear is not a spectrum analyzer.

– Auditory frequency selectivity is best at low – E+biEar+brain carry ou t a f orm of ff frequency anal liysis frequencies and gets progressively worse at higher – Relevant features of speech are more readily visible frequencies. in the amplitude spectrum than in the raw waveform

10 Formant Estimation Measuring 60 amplitude Formants formants 40 spectrum F1 F2 F3 20 60 Vowel spectra have peaks 0 0 500 1000 1500 2000 2500 3000 3500 4000 corresponding to the center Click here to play vowel /a/ frequencies of formants 60 40 ) 40

B

Formant frequency peak 20

0 20 estimation requires an 0 500 1000 1500 2000 2500 3000 3500 4000 interpolation process. Click here to play vowel /u/

60 Amplitude (d 0 40 Spectrum of 20 natural vowel / / 0 Q 0 500 1000 1500 2000 2500 3000 3500 4000 -20 Frequency (Hz) 0 2 4 6 Frequency (kHz)

Formant Estimation

harmonics But: harmonics also Children’s speech F1 generate spectral peaks; 60 formant frequencies do not necessarily coincide with  Children’s voices have high F0s. 50 harmonic frequencies

)  When F0 is 400 Hz (not unusual for 3-year B olds), only 4 harmonics appear in the 40 frequency range between 0-1600 Hz.

Amplitude (d 30

20 Amplitude 0 0.5 1 1.5 2 Frequency (kHz) 0 Frequency (kHz) 1.5

LPC spectrum Sparce sampling problem 70

60  Vowel identity is dependent on the frequencies of formant peaks. 50 LPC spectrum  Formants are difficult to estimate when (dB) 40 fdfundament tlfal frequency i ihihs high. e 30

Amplitud FFT spectrum 20

10

0 Amplitude Amplitude 0 0.5 1 1.5 2 2.5 3 3.5 0 1.5 Frequency (kHz)

11 Formants sometimes appear to merge

LPC spectrum

Short-term amplitude spectrum Speech

 running amplitude spectra (codes amplitude changes in different frequency bands over time).

60 F1 = 281 Hz 50 F2 = 2196 Hz 40

30 F3 = 2755 Hz

20

Amplitude (dB) 10

0

-10 0 1 2 3 4 Frequency (kHz)

Speech Speech spectrograms

 What is a speech spectrogram?  Why are speech spectrograms useful? – Display of amplitude spectrum at successive – Shows dynamic properties of speech instants in time ("running spectra") – Incorporates frequency analysis – How can 3 dimensions be represented on a two- – Related to speech production dimensional display? – Helps to visually identify speech cues  Gray-scale spectrogram  Waterfall plots  Animation

12 “The watchdog” waveform

spectrogram

5

F3

F2 Frequency (kHz)

F1 0

Digital representations of signals

 4 E F3 F2

time amplitude Frequency (kHz) F1

sampling quantization 0 0

TrackDraw: a graphical speech synthesizer TrackDraw: a graphical speech synthesizer

13 American English vowel space

Advancement ← F2 Peterson and Barney (1952) F1 front center back

→ i “heed”  Acoustic measurements (made from spectrograms) u“who’d” of formant frequencies (F1, F2, F3) in vowels spoken high ɩ “hid” ʊ “hood” by 76 men, women and children.  vowel space: projection of a given talker’s vowels in Height a F1 x F2 plane e“hayed” o“hoed” Ə “schwa”  Simple target model: vowels are differentiated mid ɛ “head” ɔ “hawed” (perceptually) by F1 and F2 frequencies measured in the middle of the vowel (vowel target).

low ʌ “hut” æ“had” ɑ “hod”

Vowel formant space: F1 x F2 Assmann & Katz JASA 2000 Peterson andPeterson Barney and Barney (1952) (1952) 3500 i Children 3000 3.0 i  Women  Females  i 2500  æ i 2.5 I Men æ i E  2000  Q æ 2.0 I y (kHz) F2 (Hz) F2 cy (Hz) cy c

f E

n u  Q Males U √ 1500   o

 F2 Freque 1.5  √ A

u u c  U F2 Frequen Frequency o Frequency

o A 1000 u c u c 1.0

250 500 1000 1500 200 300 400 600 800 1000 200 F1 Frequency (Hz) F1 frequency (Hz) F1 Frequency (Hz)

14