Lecture 3: and spoken

Pierre Lison, Language Technology Group (LTG) Department of Informatics

Fall 2012, September 3 2012

fredag 31. august 2012 Outline

• Introduction to phonetics • Spoken language syntax • Summary

@ 2012, Pierre Lison - INF5820 course 2 fredag 31. august 2012 Outline

•Introduction to phonetics • Articulatory phonetics • Pronunciation variation • Acoustic phonetics • Spoken language syntax • Summary

@ 2012, Pierre Lison - INF5820 course 3 fredag 31. august 2012 Phonetics

• A basic knowledge of phonetics is useful when working on spoken dialogue systems • How does recognition work? • Why is speech recognition difficult? • How can we detect the intonation of an utterance? • How does speech synthesis work? • The next slides will introduce some of the key concepts we will use later

@ 2012, Pierre Lison - INF5820 course 4 fredag 31. august 2012 Phonetics and phonology

• Phonetics = scientific study of speech sounds • Articulatory phonetics: how are sounds produced by the vocal organs? • Acoustic phonetics: what are the acoustic properties of speech sounds? • Auditory phonetics: how are sounds perceived by the human ears and brain? (we’ll skip that part) • Phonology: scientific study of how sounds are organised in a given language

@ 2012, Pierre Lison - INF5820 course 5 fredag 31. august 2012 Phonetics

• The basic phonetic unit is the phone, which is a distinctive speech sound • The International Phonetic is a standard for transcribing the sounds of all human • Currently 107 letters, 52 diacritics, and four prosodic marks • A specific language (e.g. English or Norwegian) will only use a subset of these sounds • Other compatible transcription codes, such as ARPAbet for U.S. English

@ 2012, Pierre Lison - INF5820 course 6 fredag 31. august 2012 IPA transcription

Nordavinden og sola kranglet om hvem av dem som var den sterkeste. Da kom det en mann gående med en varm frakk på seg. De blei enige om at den som først kunne få mannen til å ta av seg frakken skulle gjelde for sterkere enn den andre. Så blåste nordavinden av all si Bokmål makt, men dess mer han blåste, dess tettere trakk mannen frakken rundt seg, og til sist gav nordavinden opp. Da skinte sola fram så godt og varmt, og straks tok mannen av seg frakken. Og så måtte nordavinden innrømme at sola var den sterkeste av dem.

[ ˈˈnuːɾɑˌʋinˑn ̩ ɔ ˈsuːln ̩ ˈˈkɾɑŋlət ɔm ˈʋem ɑ dem sɱ̩ ˈʋɑː ɖɳ̩ ˈˈstæɾ̥kəstə ˌdɑˑ ˈkʰɔmː de n ˈmɑnː ˌgɔˑənə me n ˈʋɑɾm ˈfɾɑkː pɔ ˌsæ di ble ˈˈeːnjə ɔm ɑt ˈdenː sɱ̩ ˈføʂt̠ ˌkʰʉnˑə fɔ ˈmɑnːn̩ tɔ ˈˈtʰɑː IPA ɑ sæ ˈfɾɑkːən ˌskʉlˑə ˈˈjelːə fɔ ɖɳ̩ ˈˈstæɾkəstə ɑ ˌdemˑ ˈsoː ˈˈbloːstə ˈˈnuːɾɑˌʋinːn̩ ɑ ˈʔɑlː sin (Oslo) ˈmɑkʰtʰ men ju ˈmeːɾ ɦɑm ˈˈbloːstə ju ˈˈtʰetːəɾə ˌtɾɑkˑ ˈmɑnːn̩ ˈfɾɑkːən ˈɾʉnt sæ ɔ tə ˈsist ˌmɔtˑə ˈˈnuːɾɑˌʋinˑn ̩ ˈˈjiː ˌɔpʰˑ ˌdɑː ˈˈʂintə ˈsuːln ̩ ˈfɾem ˌsɔˑ ˈgɔtʰː ɔ ˈʋɑɾmtʰ ɑt̚ ˈmɑnːən ˈstɾɑks ˌmɔtˑə ˈˈtɑː ɑ sæ ˈfɾɑkːən ɔ ˈsoː ˌmɔtˑə ˈˈnuːɾɑˌʋinˑn̩ ˈinːˌɾømˑə ɑt ˈsuːln̩ ˈʋɑːɾ n̩ ˈˈstæɾ̥kəstə ʔɑ ˈdemː]

[ ˈˈnuːɾɑˌʋinˑn̩ ɔ ˈsuːla ˈˈkɾɑŋlɑ um ˈkʋenː ɑʋ ˌd̠æi sɔɱ ˈʋɑː d̠ən ˈˈstæɾ̥kɑstə ˈdoː ˈkɔmː də ɛn ˈmɑnː ˈˈgɑŋgɑndə mə æn ˈʋɑɾmə ˈfɾɑkːə pɔ ˌseˑg dæi blæɪ ˈˈʔeːnigə um ɑt dæn sɔɱ ˈfysː IPA ˌkʰunˑə fɔ ˈmɑnːən tə ɔ ˈˈtʰɑːk ɑʋ sə ˈfɾɑçːən ˌskʉlːə ˈˈɾeknɑs feɾ ɛ̃n ˈˈstæɾkɑst ɑʋ ˌd̠ɛɪ ˈsoː ˈbluːs ˈˈnuːɾɑˌʋinˑn̩ ɑʋ ˈɑlː si ˈmɑkʰtʰ mən tɪ ˈmæi hɑn ˈbluːs tɪ ˈˈtʰetːɑɾə ˈdɾuːg ˈmɑnːən ˈfɾɑçːən (Fyresdal) ˈɾunt seɰ̥ ɔ tɪ ˈs̠l̥ʉtː ˈˈmɔtːə ˈˈnuːɾɑˌʋinˑn̩ʔ ˌjeˑ ˈupːʰ ˈd̥oː ˈʂæin ˈsuːla ˈfɾɑmː sɔ ˈgøʰtː ɔ ˈʋɑɾmt ɑt ˈmɑnːn̩ me ˈʔæi ˈgɔŋ ˈˈmɔtːə ˈˈtʰɑːkə ˈɑːʋ sə ˈfɾɑçːən ɔ ˈsoː ˈˈmɔtːə ˈˈnuːɾɑˌʋinˑn̩ ˈʋeːˌçɛn̠ːə ʔɑt ˈsuːla ˈʋɑː dən ˈˈstæɾ̥kɑ̰st ɑʋ ˈðæ̰ɪ̥]

[Nordavinden og sola, en norsk dialektprøvedatabase på nettet, NTNU, http://www.ling.hf.ntnu.no/nos/] @ 2012, Pierre Lison - INF5820 course 7 fredag 31. august 2012 The speech chain

[Denes and Pinson (1993), «The speech chain»]

@ 2012, Pierre Lison - INF5820 course 8 fredag 31. august 2012

• Sounds are variations in air pressure • How are they produced? • An air supply: the lungs (we usually speak by breathing out) • A sound source setting the air in motion (e.g. vibrating) in ways relevant to speech production: the larynx, in which the vocal folds are located • A set of 3 filters modulating this sound: the pharynx, the oral tract (teeth, tongue, palate,lips, etc.) and the nasal tract

@ 2012, Pierre Lison - INF5820 course 9 fredag 31. august 2012 Speech production

• Some languages also rely on sounds not produced by the vibration of the vocal folds, such as click languages (e.g. Khoisan family in southern Africa):

@ 2012, Pierre Lison - INF5820 course 10 fredag 31. august 2012 Speech production

@ 2012, Pierre Lison - INF5820 course 11 fredag 31. august 2012 Vocal tract

visualisation of the vocal tract via magnetic resonance imaging [MRI]:

[Speech Production and Articulation Knowledge Group, University of Southern California. http://sail.usc.edu/span/ ]

@ 2012, Pierre Lison - INF5820 course 12 fredag 31. august 2012 Sound classes

• Voiced sounds are made when the vocal folds are vibrating • e.g. [b], [d], [g], [v], [z], and all the vowels • Voiceless sounds are made without such vibration • e.g. [p], [t], [k], [f], [s]

@ 2012, Pierre Lison - INF5820 course 13 fredag 31. august 2012 Sound classes

• Phones divided in 2 (+1) main classes: • Consonants: may be voiced or voiceless, narrow to complete closure of vocal tract • Vowels: nearly always voiced, little obstruction of vocal tract, often louder and longer than consonants • Glides (or semivowels) between consonants and vowels, e.g. [y] and [w]

@ 2012, Pierre Lison - INF5820 course 14 fredag 31. august 2012 Sound classes: consonants

• Consonants are realised by restricting the airflow • They can differ in three ways: • The place of articulation: where does the obstruction occur in the vocal tract? • The manner of articulation: how does this restriction occur? • Whether they are voiced or voiceless

@ 2012, Pierre Lison - INF5820 course 15 fredag 31. august 2012 Sound classes: consonants

• Place of articulation: • Labial: with the lips, such as [p],[b],[m] • Coronal: with tip or blade of tongue, such as [s], [t], [d] • Guttural: with the back of the oral cavity, e.g. [k], [g],[ŋ] • Manner of articulation: • Stop: airflow completely blocked then released, such as [b], [t], [k] • Nasal: air passing through nasal cavity, such as [n],[m],[ŋ] • Fricatives: airflow constricted but not cut off, such as [f],[v], [s] • Approximants: articulators close together, but not enough to create turbulent airflow, such as [w],[y],[l],[r]

@ 2012, Pierre Lison - INF5820 course 16 fredag 31. august 2012 Sound classes: consonants (English)

Coronal «Guttural» Stop followed by fricative Approximants

@ 2012, Pierre Lison - INF5820 course 17 fredag 31. august 2012 Sound classes: vowels

• Relevant parameters for vowels: • vowel height: height of highest part of the tongue • vowel backness: location of this high point in the oral tract • Shape of the lips • In some languages, distinction between short and long vowels

@ 2012, Pierre Lison - INF5820 course 18 fredag 31. august 2012 Pronunciation variation

can be pronounced very differently due to various phonetic processes: • E.g. aspirations, assimilations, deletions • Co-articulation: anticipation of next phone by articulators, or perseverance of previous movement from last phone • Influence of various contextual factors: age, gender, environment, rate of speech, dialect, register

@ 2012, Pierre Lison - INF5820 course 19 fredag 31. august 2012 Pronunciation variation

• Concept of : • Abstraction over a set of phones/speech sounds • But regarded as a single «sound» within a particular language, in opposition to other sounds • «the smallest segmental unit of sound employed to form meaningful contrasts between utterances» • Example: the English /t/ can regroup [th], [ɾ] and [t]

@ 2012, Pierre Lison - INF5820 course 20 fredag 31. august 2012 Pronunciation variation

@ 2012, Pierre Lison - INF5820 course 21 fredag 31. august 2012 Acoustic phonetics

• A (speech) sound is a variation of air pressure • This variation originates from the speaker’s speech organs • We can plot a wave showing the changes in air pressure over time (zero value being the normal air pressure) • A wave can be defined using sine functions such as:

y(t)=A sin(2πft) ∗

output signal (in our time variable (continuous) case, air pressure) amplitude as a function of time frequency of the signal

@ 2012, Pierre Lison - INF5820 course 22 fredag 31. august 2012 Waveforms

• The frequency f of a wave is the number of times a second that the wave repeats itself (i.e. the number of cycles per second) 1 The period T is the duration of one cycle, with T = • f • The amplitude A corresponds to the maximum value reached by the wave untitled 3 Basic example: • 2

• Wave drawn for a 1 duration of 2 seconds T = 0.5 s 0 Amplitude A = 3 • -1 • Frequency f = 2 -2

-3 0 0.5 1 1.5 2 Time (s) @ 2012, Pierre Lison - INF5820 course 23 fredag 31. august 2012 Speech waveforms

• Of course, speech is more complex than a simple sine function • But it can still be described using the same mathematical apparatus, in terms of frequency, amplitude etc. untitled 0.251

0

-0.2938 0.04047 1.479 Time (s) @ 2012, Pierre Lison - INF5820 course 24 fredag 31. august 2012 Speech waveforms • If we zoom on the waveform from previous slide, and look at the time betweenuntitled 1.126 and 1.157 s:

0.136

0

-0.1201 1.126 1.157 Time (s) • We can see about 4 cycles in the waveform, which means a frequency of about 4/0.03 ≈129 Hz

@ 2012, Pierre Lison - INF5820 course 25 fredag 31. august 2012 Speech waveforms

• Important measures of a speech signal:

1. The fundamental frequency F0: lowest frequency of the sound wave, corresponding to the speed of vibration of the vocal folds

For the human voice, F0 ranges from 85-180 Hz for males, and 165-255 Hz for females

@ 2012, Pierre Lison - INF5820 course 26 fredag 31. august 2012 Speech waveforms

2. Its intensity: the signal power normalised to the human auditory threshold, measured in dB (decibels):

1 N Power = y(t )2 N i i=1 ￿ Power 1 N Intensity = 10 log = 10 log y(t )2 10 P 10 NP i 0 0 i =1 ￿

for a sample of N time points t1,... tN -5 P0 is the human auditory threshold, = 2 x 10 Pa

Note: dB scale is logarithmic, not linear!

@ 2012, Pierre Lison - INF5820 course 27

fredag 31. august 2012

Speech waveforms

• Why are F0 and the signal intensity important measures?

• F0 correlates with the pitch of the voice, and the analysis of the pitch movement for an utterance will give us its intonation • The signal intensity gives us an indication of the loudness of the speech sound

@ 2012, Pierre Lison - INF5820 course 28

fredag 31. august 2012 Pitch

«The ballslice1 is red» «Is the ballslice red?» 0.3681 0.2002

0 0

-0.3904 -0.2713 0.3906 slice1 1.284 0.3449 slice 1.187 Time (s) Time (s) 300 300 ) ) z z H H ( (

h h c c t t i i P P

75 75 0.3906 1.284 0.3449 1.187 Time (s) slice Time (s) 0.7604 Interrogative utterance slice 0 → rising intonation at the end 0.2366 @ 2012, Pierre Lison - INF5820 course 29 fredag 31. august 2012 -1 0 3.806 slice 4.904 Time (s) 100 90 ) B

-0.2549 d Loudness ( 80

1.848 slice y 3.012 t i

Time (s) s n 70 e

100 t n I 90 60 ) B

d «low intensity» «high intensity» ( 50 slice 80 slice

y 3.806 4.904 t i s Time (s) n 70

0.2366e 0.7604 t n I 60 50 1.848 3.012 0 0 Time (s)

-0.2549 -1 1.848 slice 3.012 3.806 slice 4.904 Time (s) Time (s) 100 100 90 90 ) ) B B d d ( (

80 80 y y t t i i s s n n 70 70 e e t t n n I I 60 60 50 50 1.848 3.012 3.806 4.904 Time (s) Time (s)

@ 2012, Pierre Lison - INF5820 course 30 fredag 31. august 2012 Spectral analysis

• Possible to derive basic phonetic features (such as pitch or loudness) directly from the waveform • But usually, the actual phones cannot be recognised • For this, we need to use a different representation, in terms of the signal’s component frequencies (spectral analysis) • Phones often have a distinctive «spectral signature»

@ 2012, Pierre Lison - INF5820 course 31 fredag 31. august 2012 Spectral analysis

• Basic idea: a wave can be represented as a superposition of many sine waves with distinct frequencies • This is called Fourier analysis (operation: Fourier transform) sineWithNoise Example: the wave on the left 4.492 4 is a combination of two sine: 3 2 y(t)=3sin(2π2t)+1.5sin(2π21t) 1 0 • The first sine has a slow frequency -1 (f=2 Hz) and an amplitude A = 3.0 -2 • The second sine has a faster -3 frequency (f = 21 Hz) and -4 -4.492 amplitude A = 1.5 0 0.5 1 1.5 2 Time (s)

@ 2012, Pierre Lison - INF5820 course 32 fredag 31. august 2012 Spectral analysis

• We can then draw a spectrum showing the component frequencies of the signal and their corresponding amplitudes • Spectrum for the waveform from the previous slide:

100 )

z We clearly see two H /

B peaks, one at frequency d (

l

e 80

v 2 Hz, and one at e l

e

r frequency 21 Hz u s s e r p 60 d

n (as expected from the u o S formula)

0 5 10 15 20 25 30 Frequency (Hz)

@ 2012, Pierre Lison - INF5820 course 33 fredag 31. august 2012 Spectral analysis

• The spectral analysis transforms a signal encoded in the time domain into one in the frequency domain • Goal is to determine the frequency content of a signal

sineWithNoise

4.492 4 100 ) 3 z H /

2 B d (

l

1 e 80 v e l 0 e r u

-1 s s e r p -2 60 d n

-3 u o S -4 -4.492 0 0.5 1 1.5 2 0 5 10 15 20 25 30 Time (s) Frequency (Hz)

@ 2012, Pierre Lison - INF5820 course 34 fredag 31. august 2012 Spectral analysis

• Below is the waveform for the phone [e] in the «Here», associated with its spectral analysis • We notice that some frequencies are more important than others untitled 0.2817 60 ) z H / B d (

40 l e v

0 e l

e r u s s e r p 20 d n u o S

-0.3454 0.4082 0.4394 0 6000 Time (s) Frequency (Hz)

@ 2012, Pierre Lison - INF5820 course 35 fredag 31. august 2012 Spectral analysis

@ 2012, Pierre Lison - INF5820 course 36 fredag 31. august 2012 Spectral analysis

untitled • Finally, we can draw a 0.4157 spectrogram, 0 which shows how the -0.4086 0 untitled 1.707 different frequencies Time (s) making up a waveform 5000 4000 ) z

change over time H ( 3000 y c n e

u 2000

horizontal axis shows time q

• e r F while vertical axis shows 1000 frequencies 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.61.707 • The darkness of a point Time (s) corresponds to the For instance, this frequency range amplitude of the frequency (±1800-4000 Hz) is high at time 0.2 s component at that time

@ 2012, Pierre Lison - INF5820 course 37 fredag 31. august 2012 Spectral analysis • From the spectrogram, we can view the formants of the speech sound, shown as dark bars

• The formants are denoted F1, F 2, ... • We can e.g. distinguish vowels based on these formants [a]slice [ae] 5000 5000 5000 5000 ) ) z 4000 z 4000 H H ( (

y y c c 3000 3000 n n e e u u

q 2000

q 2000 e e r r F F 1000 1000 0 0 1.81 2.53 13.57 14.39 Time (s) Time (s)

@ 2012, Pierre Lison - INF5820 course 38 fredag 31. august 2012 Outline

• Introduction to phonetics •Spoken language syntax • Summary

@ 2012, Pierre Lison - INF5820 course 39 fredag 31. august 2012 Spoken language syntax

• The structure of spoken language is markedly different than written language: • Presence of numerous disfluencies (filled pauses, repetitions, corrections, repairs) • Meta-communicative dialogue acts, where the speaker comments on his own «performance» • Non-sentential utterances (NSUs) which need to be interpreted from the context

@ 2012, Pierre Lison - INF5820 course 40 fredag 31. august 2012 Disfluencies

• Speakers construct their utterances «as they go», incrementally • Production leaves a trace in the speech stream • Presence of multiple disfluencies • Pauses, fillers («øh», «um», «liksom») • Fragments • repetitions («the the ball»), corrections («the ball err mug»), repairs («the bu/ ball»)

@ 2012, Pierre Lison - INF5820 course 41

fredag 31. august 2012 Shriberg’s disfluency model

• Internal structure of a disfluency:

Book a ticket to Boston uh I mean to Denver reparandum interregnum repair ￿ ￿￿ ￿ ￿ ￿￿ ￿ ￿ ￿￿ ￿ • reparandum: part of the utterance which is edited out • interregnum: (optional) filler • repair: part meant to replace the reparandum

[Shriberg (1994), «Preliminaries to a Theory of Speech Disfluencies», Ph.D thesis]

@ 2012, Pierre Lison - INF5820 course 42

fredag 31. august 2012 Basic examples of disfluencies

• Repetitions robot now go to the hallway the hallway

reparandum repair • Corrections: ￿ ￿￿ ￿ ￿ ￿￿ ￿ ok and then turn right no sorry I mean left

reparandum interregnum repair • Rephrasing/completion:￿ ￿￿ ￿ ￿ ￿￿ ￿ ￿￿￿￿ robot please give me the ball yes the red one on your left exactly

reparandum interregnum repair ￿ ￿￿ ￿ ￿ ￿￿ ￿ ￿ ￿￿ ￿ @ 2012, Pierre Lison - INF5820 course 43

fredag 31. august 2012 General remarks on disfluencies

• All parts of a disfluency may carry meaning relevant for interpretation • Even filled pauses such as «uh» and «um» • Levelt: reparandum and repair are of syntactic types that could be joined by a conjunction • Pervasive phenomena: about 6% of the words in spontaneous speech are «edited»

[Levelt W. (1983), « Monitoring and self-repair in speech», in Cognitive Science.]

@ 2012, Pierre Lison - INF5820 course 44

fredag 31. august 2012 More complex disfluencies

så gikk jeg e flytta vi til Nesøya da begynte jeg på barneskolen der og så har jeg gått på Landøya ungdomsskole # som ligger ## rett over broa nesten # rett med Holmen

jeg gikk på Bryn e skole som lå rett ved der vi bodde den gangen e barneskole videre på Hauger ungdomsskole

da hadde alle hele på skolen skulle liksom # spise julegrøt og det va- det var bare en mandel og da var jeg som fikk den da ble skikkelig sånn " wow # jeg har fått den " ble så glad

@ 2012, Pierre Lison - INF5820 course 45 fredag 31. august 2012 Non-sentential utterances

• Non-sentential utterances = utterances lacking an overt predicate • Also called fragments or elliptical utterances • Examples: • «Should I take the ball?» → «yes indeed» • «Please go the kitchen» → «go where?» • «Task completed» → «brilliant!» • «First take left after the corner» → «and afterwards?»

[J. Ginzburg (2008), Semantics for conversation]

@ 2012, Pierre Lison - INF5820 course 46 fredag 31. august 2012 Non-sentential utterances

• Pervasive in spoken dialogue: • ± 30% of the utterances, depending on the corpus • Why are they so common? • Their «full» meaning can often be easily retrieved from the context, with very little mental effort • They make the turns both shorter and more effective (cf. Grice’s Maxim of Quantity)

@ 2012, Pierre Lison - INF5820 course 47 fredag 31. august 2012 Non-sentential utterances

• Example from Harold Pinter’s play «Betrayal»: (a) Emma: We have a flat. (b) Robert: Ah, I see. (Pause) Nice? (Pause) A flat. It’s quite well established then, your . . . uh . . . afair? (c) Emma: Yes. (d) Robert: How long? (e) Emma: Some time. (f) Robert: But how long exactly? (g) Emma: Five years. (h) Robert: Five years?

[J. Ginzburg (2008), Semantics for conversation]

@ 2012, Pierre Lison - INF5820 course 48 fredag 31. august 2012 Non-sentential utterances • The meaning of non-sentential utterances is (practically always) context-dependent • Their meaning arises through the interaction itself • This can lead to ambiguities in the resolution: 1. A: When are they going to open the new main station? B: Tomorrow (short answer) 2. A: They are going to open the station today. B: Tomorrow (correction) 3. A: They are going to open the station tomorrow. B: Tomorrow (acknowledgement)

[R. Fernandez (2006), «Non sentential utterances in dialogue: classification, resolution and use», PhD thesis] @ 2012, Pierre Lison - INF5820 course 49 fredag 31. august 2012 Outline

• Introduction to phonetics • Spoken language syntax •Summary

@ 2012, Pierre Lison - INF5820 course 50 fredag 31. august 2012 Summary

• We introduced some basic concepts of phonetics: • Speech sounds are produced by a complex interaction of several organs (lungs, vocal folds, vocal tract) • The same vowel or consonant can be pronounced very differently depending on the context • We can perform acoustic analysis of the speech signal to determine properties such as pitch or loudness • Using spectral analysis, we can also distinguish different phones (looking at the formants on the spectrogram)

@ 2012, Pierre Lison - INF5820 course 51 fredag 31. august 2012 Summary

• We have also seen that the structure of spoken language can be very different from written language: • Presence of disfluencies (filled pauses, repairs, repetitions, corrections) • Non-sentential utterances are omnipresent, and they require contextual interpretation

@ 2012, Pierre Lison - INF5820 course 52 fredag 31. august 2012 This Friday

• On Friday, we’ll describe the software architectures for spoken dialogue systems • What are the main components of a dialogue system? • How are these components interconnected? • We’ll also introduce our focus domain: human-robot interaction

@ 2012, Pierre Lison - INF5820 course 53 fredag 31. august 2012