NGSLT 2006 Björn Granström Multimodal

Multimodal speech synthesis/ (N)GSLT - Speech Technology course Björn Granström KTH - Kungliga tekniska högskolan CTT, KTH

School for Computer Science and Communication

Department of Speech, Music and Hearing

Multimodal speech synthesis – (N)GSLT 2006 [ 1 ] Multimodal speech synthesis – (N)GSLT 2006 [ 2 ]

CTT - Centre for speech technology CTT - modes of operation Research areas • Speech production · long-term research projects, · participation by CTT personnel in international • Speech perception projects, • Communication aids · research exchange with leading foreign groups, • Multimodal speech synthesis · graduate-level research training, • Speech understanding · dissemination of competence through the exchange of researchers within Sweden and • Speaker characteristics collaboration with other Swedish research groups • Interactive dialog systems · special information dissemination efforts • Affective computing • Language learning

Multimodal speech synthesis – (N)GSLT 2006 [ 3 ] Multimodal speech synthesis – (N)GSLT 2006 [ 4 ]

CTT partners phase 4 - 2004 -2006 The work presented in this lecture Babel-Infovox is the result of many researchers’ EnglishTown GN Resound efforts at the Department of Speech, Hjälpmedelsinstitutet HoneySoft Music and Hearing IcePeak Dolphin Audio Publishing LingTek STTS Phoneticom Polycom Technologies SaabTech SpeechCraft STTS Reports and further Sveriges Radio information can be Sveriges Television found on our home page TeliaSonera HoneySoft TPB - Talboks- och www.speech.kth.se punktskriftsbiblioteket Vattenfall VoiceProvider

Multimodal speech synthesis – (N)GSLT 2006 [ 5 ] Multimodal speech synthesis – (N)GSLT 2006 [ 6 ]

1 NGSLT 2006 Björn Granström Multimodal speech synthesis

Why Speech Tech? User “Interactive Human Communication” Chapanis Scientific American 1975!

“Computer” Februari 1998

Multimodal speech synthesis – (N)GSLT 2006 [ 7 ] Multimodal speech synthesis – (N)GSLT 2006 [ 8 ]

Speech is fast and verbose

More efficient and convenient! Average time for solving divers problems using different combinations of means of communication

Multimodal speech synthesis – (N)GSLT 2006 [ 9 ] Multimodal speech synthesis – (N)GSLT 2006 [ 10 ]

Speech technology is making The first pen phone money! Why no commercial success? Classic example : AT&T and Lucent Technologies VRCP, "Voice Recognition Call Processing” sevice Too small keys?  Selection of payment method  Vocabulary only five words : collect, Too small display!? calling card, third number, person, operator  More than 5 000 000 calls per day  Earnings already (1999?) more than 1 april 1998 AT&T/Bell labs´ total investment in speech research

Multimodal speech synthesis – (N)GSLT 2006 [ 11 ] Multimodal speech synthesis – (N)GSLT 2006 [ 12 ]

2 NGSLT 2006 Björn Granström Multimodal speech synthesis

Speech synthesis developments at KTH The KTH speech group - Early days

1953 1961 1982 1993 1998 Gunnar Fant and OVE I 1953 2002

Multimodal speech synthesis – (N)GSLT 2006 [ 13 ] Multimodal speech synthesis – (N)GSLT 2006 [ 14 ]

OVE I, in the WaveSurfer tool Ove II, 1958

• Interface is based around WaveSurfer , a general purpose tool for speech and audio viewing, editing and labelling • TTS and Talking Head functionality is added as plug-ins • WaveSurfer (presently without TTS&TH) works on all common platforms and is freely available as open 1961 source • Modules from Waves available – formants and F0 in 1962 present release – thanks to Microsoft and AT&T http://www.speech.kth.se/wavesurfer

Multimodal speech synthesis – (N)GSLT 2006 [ 15 ] Multimodal speech synthesis – (N)GSLT 2006 [ 16 ]

KTH/TTS history What do we mean by Speech synthesis? • Recorded speech – Words or phrases (telephone banking) • 1967, Digitally controlled OVE III – Fixed vocabulary – maintenance problems… • 1974, Rule-based system RULSYS • Concatenative speech synthesis – transformation rules – Diphones or larger units (unit selection) • LPC: source filter model (too simple?) • 1979, Mobile text-to-speech system • PSOLA/MBROLA/HNM – and rules for prosody – One speaker – used by a non-vocal child • Parametric synthesis • 1982, Portable TTS (ICASSP, Paris) – Formant synthesis – Multilingual – – MC 68000, NEC 7720 – flexible – But lower quality - today • 1983, Founding of Infovox Inc. • Mulitmodal synthesis Multimodal speech synthesis – (N)GSLT 2006 [ 17 ] Multimodal speech synthesis – (N)GSLT 2006 [ 18 ]

3 NGSLT 2006 Björn Granström Multimodal speech synthesis

The synthesis space TEXT vs SPEECH

Speech • Parallel vs. sequential Knowledge Flexibility Intelligibility • permanent vs disappearing Naturalness Processing • Text as transcription of speech? needs Bit rate • Example - WaveSurfer demo : Vocabulary – Palindrome Cost Units – ”Var sak har två sidor” Complexity

Multimodal speech synthesis – (N)GSLT 2006 [ 19 ] Multimodal speech synthesis – (N)GSLT 2006 [ 20 ]

Phonology, Morphology Phonetics

hov#mäst-ar-n två kHz

ms

Multimodal speech synthesis – (N)GSLT 2006 [ 21 ] Multimodal speech synthesis – (N)GSLT 2006 [ 22 ]

Non-phonetic writing Signs vs. technology

Multimodal speech synthesis – (N)GSLT 2006 [ 23 ] Multimodal speech synthesis – (N)GSLT 2006 [ 24 ]

4 NGSLT 2006 Björn Granström Multimodal speech synthesis

How is a Chinese lexicon organized? Is the sign what you think?

Multimodal speech synthesis – (N)GSLT 2006 [ 25 ] Multimodal speech synthesis – (N)GSLT 2006 [ 26 ]

Text-to-speech (TTS) Synthesis methods “abcd” texttext Artikulatorer Language ident. Artikulatorer Language ident. Morphological analysis Lexicon and rules Linguistic analysis AnsatsröretsAnsatsrörets form form Linguistic analysis Syntax analysis

ProsodicProsodic analysisanalysis Rules and lexicon TuberTuber Läppar Phonetic description Rules and unit selection Phonetic description ElektrisktElektriskt nät nät Concatenation Rules Concatenation KällaKälla ResonanserResonanser SoundSound generationgeneration

Multimodal speech synthesis – (N)GSLT 2006 [ 27 ] Multimodal speech synthesis – (N)GSLT 2006 [ 28 ]

Source filter theory RULSYS Rules - Features

Time: • Is IPA synthesiser possible? • Based on generative phonology Source filter Radiation • Language specific definitions (glottal flow) (shape of vocal tract) (lips) • Rules for contextual modifications • Examples • Interactive rule manipulations Frequency:

Multimodal speech synthesis – (N)GSLT 2006 [ 29 ] Multimodal speech synthesis – (N)GSLT 2006 [ 30 ]

5 NGSLT 2006 Björn Granström Multimodal speech synthesis

Glove

(a) a e / __ e #

F0 RKRGFADI

B1 voice FN source AN N2 NZ BN

A0 NA morph root NM F1 F2 F3 F4 F5-F7 B1 B2 B3 B4 B5-B7 (b) graph a e AH noise source AK phon e C2 C1 ^ ^ ^ ^

Multimodal speech synthesis – (N)GSLT 2006 [ 31 ] Multimodal speech synthesis – (N)GSLT 2006 [ 32 ]

”Pia odlar blå violer” Evaluation of synthesis- VCV test

percent Consonant errors 50 41.7 40 Original Goldstein & Till (1992) 30 26.1

20 12.8 8.7 10 5.6

0 Synthesis Year Year Year Year One 1983 1987 1989 1991 human Carlson, R. Granström, B., and Karlsson, I.(1990): KTH text-to-speech system speaker "Experiments with voice modeling in speech synthesis," Multimodal speech synthesis – (N)GSLT 2006 [ 33 ] Multimodal speech synthesis – (N)GSLT 2006 [ 34 ]

Speaker characteristics Speaker characteristics

Speaker • TIDE/Vaess project (Voices, attitudes and dialect, sex, social, education, age emotions in synthetic speech • “Voice fitting” Situation formality, style, interspeaker relation • Software, Glove synthesis • 10 user controlled parameters Complex description many dependent variables • Rule specified connection to synthesis parameters

Multimodal speech synthesis – (N)GSLT 2006 [ 35 ] Multimodal speech synthesis – (N)GSLT 2006 [ 36 ]

6 NGSLT 2006 Björn Granström Multimodal speech synthesis

Emotional/ubiquitous computing – do we want it? Emotions

natural natural synthesis synthesis Neutral

Happy

Sad

Angry natural natural Angry synthesis synthesis

Early BBC vision – the conversational toaster

Thanks to Mark Huckvale, UCL,Multimodal for speechthe synthesis video – (N)GSLT clip 2006 [ 37 ] Multimodal speech synthesis – (N)GSLT 2006 [ 38 ]

Result from listening test Emotions intended

100 90 80 neutral 70 angry 60 50 happy Happiness Anger Fear 40 sad 30 percent response percent 20 10 0 Happy Sad Angry Synthetic stimuli Surprise Disgust Sadness

Multimodal speech synthesis – (N)GSLT 2006 [ 39 ] Multimodal speech synthesis – (N)GSLT 2006 [ 40 ]

Voice source simulation Voice source simulation

(Phar.) 2y examples x φφφ y C ∆ x ∆ y ∆ z Samplingsintervall : 1 parameter Adduction force Stämbandsanatomi : 18 parametar, t ex längd, vikt, dämpningsfaktor l C m J Stämbandsartikulation : 7 parametrar, t ex restglapp, lungtryck Rest position Högtrycksartikulation: 4 parametrar, t ex tonhöjd, ljudstyrka (Trach.) x Talrörsartikulation: 8 parametrar, t ex area, dämpningsfaktor Translational, Rotational y transv. + ax. 2D z Exempel originalyttrande Mechanical model B 2(y+v) Geometric model with fördubblad stämbandsmassa two elastic rods förlängning av stämband 13->17 mm

Ps assymetriförändring 1.0->1.8

Po x assymetriförändring 1.0->2.0

AAm s Air flow through glottis restgap 0.1->1.0 mm Multimodal speech synthesis – (N)GSLT 2006 [ 41 ] Multimodal speech synthesis – (N)GSLT 2006 [ 42 ]

7 NGSLT 2006 Björn Granström Multimodal speech synthesis

Predictable word accents Letter-to-sound vs. • Word accent and stress –'and en (bird) lexicon –'ànd en (spirit) Size vs. Precision –'ànk -pas ta –'ànk -pas tej Maintenance - “half of the words are unique” –'ànk -pas tej en –'ànk -pastejs-mål-tid Name pronunciation (Onomastica Project) • Name pronunciation – karl-'èrik (karl-'erik) + ' ànne -ma rie Misspellings – Onomastica/EU • Reduction Use rules, morphology, analogy…as fall back – be' tòng -väg(g)s-konstruk tion (C-C Elert) Multimodal speech synthesis – (N)GSLT 2006 [ 43 ] Multimodal speech synthesis – (N)GSLT 2006 [ 44 ]

Research trends in speech synthesis • Already Peterson et al. (1958) 1950 Synthesis by analysis • Dixon and Maxey (1968) 1960 Phonetic rules • “Diadic Units”, (Olive, 1977) 1970 Linguistic processing • ”PSOLA”, Charpentier and Stella (1986) • Review, Möbius (2000) 1980 Concatenation • Unit selection 1990 Automatic procedures – Concatenation cost 2000 – Quality/size of database – Prosody/speaking styles

Multimodal speech synthesis – (N)GSLT 2006 [ 45 ] Multimodal speech synthesis – (N)GSLT 2006 [ 46 ]

Speak&Spell - TexasInstrument Concatenative synthesis Christmas 1978 Signal manipulations • Prosodic modifications – Possibility to modify F0 – Possibility to lengthen or shorten segments • Spectral modifications – Interpolation of spectrum at joints • Early technique - LPC

Multimodal speech synthesis – (N)GSLT 2006 [ 47 ] Multimodal speech synthesis – (N)GSLT 2006 [ 48 ]

8 NGSLT 2006 Björn Granström Multimodal speech synthesis

PSOLA Unit selection • Large databases of recorded natural speech • Minimal processing • Annotation of database – what information is • Pitch pulses moved in time to fit F0 contour needed? • Conceptually simple and computationally • Synthesis defaults to transcription and search efficient problem • Few cuts > maximally long units selected - Need for precise pitch pulse marking (but context and prosody must fit well) - Could not handle spectral interpolation • Target and concatenation costs

Multimodal speech synthesis – (N)GSLT 2006 [ 49 ] Multimodal speech synthesis – (N)GSLT 2006 [ 50 ]

Synthesis methods Unit selection - BrightSpeech • Unit selection – minimal processing – chatr(ATR) weather (CSTR) good, less good • Swedish • Diphone synthesis – Svensk(Mbrola), Fransk (ICP) (CNET)

• Formantsynthesis • Norwegian – Svensk (KTH), Fransk (KTH)

Multimodal speech synthesis – (N)GSLT 2006 [ 51 ] Multimodal speech synthesis – (N)GSLT 2006 [ 52 ]

Examples of Synthesized Speech To make it work today: Hybrid systems Universität Stuttgart, Institut für Maschinelle Sprachverarbeitung e.g. Who has number nn....? [German] [English] [French] [Dutch] [Spanish] [Italian] [Portuguese] Key input – Synthesis mixed with [Swedish] [Norwegian] [Finnish] [Estonian] [Icelandic] [Czech] [Russian] [Greek] [Croatian] [Romanian] [Japanese] [Chinese] recorded speech (Call +46 118 999) [Korean] [Hebrew] [Arabic]

http://www.ims.unistuttgart.de/~moehler/synthspeech/exam ples.html

also e.g. Synthesis examples : http://www.naturalvoices.att.com/ http://www.acapela-group.com/demos/demos.asp http://www.naturalvoices.att.com/ http://www.nextup.com/

Multimodal speech synthesis – (N)GSLT 2006 [ 53 ] Multimodal speech synthesis – (N)GSLT 2006 [ 54 ]

9 NGSLT 2006 Björn Granström Multimodal speech synthesis

Hybrid methods, cont. Aim

•Rolf Carlson,Tor Sigvardson , Arvid Sjölander (2002). • Keeps the flexibility of the formant Data-driven formant synthesis. Proc of Fonetik 2002 , synthesis TMH-QPSR David Öhlin and Rolf Carlson Data -driven Formant • More natural sounding than rule-driven Synthesis, Proc of Fonetik 2004 synthesis

+ Four MSc theses ( Sigvardson,Sjölander , Vinet,Öhlin ) • Speaker adaption •All available on www.speech.kth.se

Multimodal speech synthesis – (N)GSLT 2006 [ 55 ] Multimodal speech synthesis – (N)GSLT 2006 [ 56 ]

Rule-driven formant Data-driven formant synthesis synthesis • Parameters are generated • Some parameters

by rule (RULSYS , 4000 4000 M O B I: L sil (namely, the first four M O B I: L sil Carlson et al.) 3500 formants) are replaced 3500 • Formant values are 3000 by data 3000 generated by interpolating 2500 2500 2000 • Same synthesizer 2000 between target frequencies 1500 1500

• Parameters are fed to a 1000 1000

synthesizer (GLOVE, 500 500

0 0 Carlson et al.) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Multimodal speech synthesis – (N)GSLT 2006 [ 57 ] Multimodal speech synthesis – (N)GSLT 2006 [ 58 ]

Data-driven formant Synthesis comparison synthesis

4000 M O B I: L sil Data−driven Rule−driven • Formants are replaced through unit 3500 selection from a formant diphone library 3000

2500 • Formant trajectories are scaled and

2000 interpolated to fit the rule-generated

1500 durations

1000

500

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Multimodal speech synthesis – (N)GSLT 2006 [ 59 ] Multimodal speech synthesis – (N)GSLT 2006 [ 60 ]

10 NGSLT 2006 Björn Granström Multimodal speech synthesis

Cost function Voiceless consonants

• Designed to promote probable formant • Replace the voiceless fricatives and plosives candidates with recorded versions • Penilizes: – Large bandwidths • Voiceless fricatives in Standard Swedish: – Large frequency deviations (given the current /f/, /s/, /sj/, /tj/, and /rs/ phoneme) • Voiceless plosives: /k/, /p/, /t/, /rt/ – Large frequency jumps (promotes smooth trajectories) (/h/ is excluded) Multimodal speech synthesis – (N)GSLT 2006 [ 61 ] Multimodal speech synthesis – (N)GSLT 2006 [ 62 ]

Text-to-speech synthesis Listening test evaluation 1

Input text • 15 subjects, 20 sentences, continuous scale Data-base Text-to-parameter • Data-driven synthesis with non-corrected Extracted Rule-generated parameters parameters formant data was judged more natural Unit library Unit selection sounding than rule-driven synthesis

Concatenation +

Synthesizer

Concatenation

Output sound

Multimodal speech synthesis – (N)GSLT 2006 [ 63 ] Multimodal speech synthesis – (N)GSLT 2006 [ 64 ]

Sound samples Listening test evaluation 2 • Strindberg, rule-driven: • 12 subjects, 10 sentences, binary scale • Strindberg, data-driven: • Data-driven synthesis with manually cor- • Strindberg, MBROLA: rected formant data was preferred in 73 % of the cases over rule-driven synthesis • ”Pytteliten”, rule-driven: • ”Pytteliten”, data-driven:

Multimodal speech synthesis – (N)GSLT 2006 [ 65 ] Multimodal speech synthesis – (N)GSLT 2006 [ 66 ]

11 NGSLT 2006 Björn Granström Multimodal speech synthesis

Child-directed speech synthesis Mean ratings

5 • Increase prosodic variation in synthesis 4 Diphone 3 Formant

• How do children between the ages of 9 and Rating 2 11 react to: 1 Natural Fun Natural Fun – default, F0, duration? Children Adults – diphone synthesis vs formant synthesis?

Multimodal speech synthesis – (N)GSLT 2006 [ 67 ] Multimodal speech synthesis – (N)GSLT 2006 [ 68 ]

Articulatory synthesis Articulatory parameters

• Jaw opening • Lip rounding • Lip Protrusion • Tongue position • Tongue height • Tongue tip • Velum /usu/ • Hyoid

Multimodal speech synthesis – (N)GSLT 2006 [ 69 ] Multimodal speech synthesis – (N)GSLT 2006 [ 70 ]

Articulatory synthesis Why a 3D model? potential use

• Articulatory synthesis •2D articulatory models, e. g. – Calculations directly from – Mermelstein cross sectional areas – Maeda – Fluid dynamics calculations •Imposing a third dimension: • Visual synthesis area=a•(width) b – Articulation training •A 3D model: • Demonstrations and research – Direct calculation – laterals etc. Multimodal speech synthesis – (N)GSLT 2006 [ 71 ] Multimodal speech synthesis – (N)GSLT 2006 [ 72 ]

12 NGSLT 2006 Björn Granström Multimodal speech synthesis

From areas to formants

A transfer function is determined from the cross-sectional areas

Multimodal speech synthesis – (N)GSLT 2006 [ 73 ] Multimodal speech synthesis – (N)GSLT 2006 [ 74 ]

Ranges of parameter activations Multimodal speech synthesis

Jaw height Tongue body Tongue dorsum

Multimodal speech synthesis – (N)GSLT 2006 [ 75 ] Multimodal speech synthesis – (N)GSLT 2006 [ 76 ]

Talking heads - A new paradigm for human- Applications computer interaction • Improved speech synthesis • Shift from desktop-metaphor • Human-Computer Interface in to person-metaphor spoken dialogue systems • Spoken dialogue as well as • Aid for hearing impaired non-verbal communication • Educational software Listening & Thinking • Take advantage of the user’s • Stimuli for perceptual social skills experiments • Strive for believability, • Entertainment: games, virtual but not necessarily realism reality, movies etc.

Multimodal speech synthesis – (N)GSLT 2006 [ 77 ] Multimodal speech synthesis – (N)GSLT 2006 [ 78 ]

13 NGSLT 2006 Björn Granström Multimodal speech synthesis

Conventions? – use same as for person-to-person communication Tasks of an Animated Agent • Provide intelligible synthetic speech • Indicate emphasis and focus in utterances • Support turn-taking • Give spatial references (gaze, pointing etc) • Provide non-verbal back-channeling • Indicate the system’s internal state

SONY SDR

Multimodal speech synthesis – (N)GSLT 2006 [ 79 ] Multimodal speech synthesis – (N)GSLT 2006 [ 80 ]

Animated Character Parameters used for articulatory - architecture control of the face. Requests • Jaw rotation • Upper lip raise Multi-modal Speech waveform Text-to-Speech Facial parameters • Lip rounding • Lower lip depression Synthesiser Command dispatcher • Lip protrusion • Apex Face & body parameters Motion scripts • Tongue length Speech • Mouth width playback& animation • Bilabial closure • + more for prosody, • Labiodental closure attitude, emotions, turn-taking, back- channeling, pointing Multimodal speech synthesis – (N)GSLT 2006 [ 81 ] …..Multimodal speech synthesis – (N)GSLT 2006 [ 82 ]

The WaveSurfer Tool WaveSurfer Tool Demo • Interface is based around WaveSurfer , a general purpose tool for speech and audio viewing, editing and labelling • TTS and Talking Head functionality is added as plug-ins • WaveSurfer (presently without TTS&TH) works on all common platforms and is freely available as open source http://www.speech.kth.se/wavesurfer

Multimodal speech synthesis – (N)GSLT 2006 [ 83 ] Multimodal speech synthesis – (N)GSLT 2006 [ 84 ]

14 NGSLT 2006 Björn Granström Multimodal speech synthesis

How to obtain data? Combining model and data

Re-synthesis using speech movement recorded with Qualisys

Qualisys recordings in Linköping Multimodal speech synthesis – (N)GSLT 2006 [ 85 ] Multimodal speech synthesis – (N)GSLT 2006 [ 86 ]

Vision from audio

> 1. technologies for speech-to-speech translation mimic original 2. detection and expressions from audio of emotional states 3. core speech technologies for children > EU project: start October 2002, duration 2 YR ITC-IRST (Trento) co-ordinates + 3*Germany + Italy + UK + Sweden http://pfstar.itc.it/ Fónagy, 1967 “Hörbare Mimik” , Phonetica Multimodal speech synthesis – (N)GSLT 2006 [ 87 ] Multimodal speech synthesis – (N)GSLT 2006 [ 88 ]

Measurement points for lip coarticulation analysis

Vertical distance

Lateral distance

Multimodal speech synthesis – (N)GSLT 2006 [ 89 ] Multimodal speech synthesis – (N)GSLT 2006 [ 90 ]

15 NGSLT 2006 Björn Granström Multimodal speech synthesis

Interactions: emotion and The expressive mouth articulation (from AV speech database – • All vowels EU/PF_STAR project) (sentences)

– Encouraging – Happy – Angry – Sad – Neutral

”left mouth corner” Multimodal speech synthesis – (N)GSLT 2006 [ 91 ] Multimodal speech synthesis – (N)GSLT 2006 [ 92 ]

Combining motion capture techniques Examples (automatic) Example of resynthesis

Happy Angry

EMA & Qualisys Surprised Sad Multimodal speech synthesis – (N)GSLT 2006 [ 93 ] Multimodal speech synthesis – (N)GSLT 2006 [ 94 ]

Collection of audio-visual databases: interactive spontaneous Recording and model dialogues  Eliciting technique: information seeking scenario

 Focus on the speaker who has the role of information giver

 The speaker seats facing 4 infrared cameras, a digital video-camera, a microphone The other person is only video recorded.

Multimodal speech synthesis – (N)GSLT 2006 [ 95 ] Multimodal speech synthesis – (N)GSLT 2006 [ 96 ]

16 NGSLT 2006 Björn Granström Multimodal speech synthesis

Eyebrow vs intonation

Conversation with agent 1 No eyebrow motion 2 Eyebrow motion controlled by the fundamental frequency of the voice 3 Eyebrow motion at focal accents + 4 Eyebrow motion at the first focal accent +

“Jag heter Axel, inte Axell” (translation: “My name is Axel, not Axell”). In Sweden Axel is a first name as opposed to Axell, which is a family name. Multimodal speech synthesis – (N)GSLT 2006 [ 97 ] Multimodal speech synthesis – (N)GSLT 2006 [ 98 ]

Experiment • Speech material Eyebrow movement – När pappa fiskar stör, piper Putte When dad is fishing sturgeon, Putte is whimpering • Hand edited with a synthesis parameter – När pappa fiskar, stör Piper Putte editor When dad is fishing, Piper disturbs Putte • 6 versions • 500 ms – 1 static, 5 eyebrow raising on successive – 100 ms dynamic rise content words – 200 ms static raised • 20 stimuli (6 x 3) plus first and last – 200 ms dynamic lowering • Subjects: 21 students at KTH – 14 native Swedish, 7 non-Swedish

Multimodal speech synthesis – (N)GSLT 2006 [ 99 ] Multimodal speech synthesis – (N)GSLT 2006 [ 100 ]

TWO EXAMPLES

No eyebrow Eyebrow movement movement (neutral)

Multimodal speech synthesis – (N)GSLT 2006 [ 101 ] Multimodal speech synthesis – (N)GSLT 2006 [ 102 ]

17 NGSLT 2006 Björn Granström Multimodal speech synthesis

Prominence increase due to Feedback experiment

eyebrow movement • Mini dialogues (two turns) • Travel agent application Influence on judged prominence by eyebrow movement • Both visual and acoustic feedback cues • Affirmative cues – agent understands/accepts the 50 request 40 • Negative cues – agent is unsure about the request 30 (seeks confirmation) 20 • Six cues hypothesised 10 eyebrow movement % prominence% due to 0 Granström, House & Swerts (2002) Swedish Foreign All

Multimodal speech synthesis – (N)GSLT 2006 [ 103 ] Multimodal speech synthesis – (N)GSLT 2006 [ 104 ]

Cue strength - demonstration

Parameter Human: ”Jag vill åka från settings to till Linköping” create Agent: ”Linköping” different Affirmative cues: None stimuli * * All * Smile (only) * F0 contour (only) Affirmative setting Negative setting * No brow frown (only) Smile Head smiles Head has neutral expression * Head movement (nod) (only) Head movement Head nods Head leans back * Eye closure (only) Eyebrows Eyebrows rise Eyebrows frown * Delay (only) Eye closure Eyes close a bit Eyes open widely All F0 contour Declarative intonation Interrogative intonation * Delay Immediate reply Slow reply Multimodal speech synthesis – (N)GSLT 2006 [ 105 ] Multimodal speech synthesis – (N)GSLT 2006 [ 106 ]

Cue strength Conclusions

3 • Eyebrow movement can be an 2,5 independent cue to prominence 2 • Non-native Swedish listeners 1,5 rely more on the visual cues 1

0,5 • Interaction Average response value 0 – visual and acoustic cues

e e – visual cues and prominence expectation w nt r mil ro e u elay tour b s S n lo D vem c Eye o • Further work on interaction m F0 co Eye ad – prominence, mood and attitude (demo) He

Multimodal speech synthesis – (N)GSLT 2006 [ 107 ] Multimodal speech synthesis – (N)GSLT 2006 [ 108 ]

18 NGSLT 2006 Björn Granström Multimodal speech synthesis

Different characters Examples on the use of eyebrow and head motion (from the August dialogue system)

Translation: “Symmetrical works of art easily become dull just like symmetrical beauties; impeccable or flawless people are often unbearable.” (Strindberg 1907)

Multimodal speech synthesis – (N)GSLT 2006 [ 109 ] Multimodal speech synthesis – (N)GSLT 2006 [ 110 ]

Speech synthesis Talking heads on the Science Museum, Stockholm( (Exhibition: FrittFram) applications

• Dialogue systems (at next NGSLT at KTH) • Translators • Aids for disabled • Mediated communication • New HMI • Language learning • ….

Multimodal speech synthesis – (N)GSLT 2006 [ 111 ] Multimodal speech synthesis – (N)GSLT 2006 [ 112 ]

Targeted audio & talking head for personal announcements (EU/Chil project) Dialog systems at KTH

Multimodal speech synthesis – (N)GSLT 2006 [ 113 ] Multimodal speech synthesis – (N)GSLT 2006 [ 114 ]

19 NGSLT 2006 Björn Granström Multimodal speech synthesis

Directionality Speech technology for visually impaired persons • First synthesis application • Intelligibility vs. naturalness • Screen reader vs. GUI • Talking books and newspapers • “Design for all” or special demands – E.g. Rapid speech - 500 wpm

Multimodal speech synthesis – (N)GSLT 2006 [ 115 ] Multimodal speech synthesis – (N)GSLT 2006 [ 116 ]

Aid for speech impaired Synthesis Software The EU/Vaess project Development • Improved speech synthesis for disabled Vaess - Voices, attitudes and emotions in speech and elderly people synthesis • Danish, British English, Spanish and The Vaess Text-to-Speech converter enables speech Swedish speech synthesis impaired persons to communicate by individualised speech • Development of new voices • Development of different speaking styles • Experiments with new synthesis strategies for emotive speech • Uses an extension of the KTH/Infovox Speech Synthesizer • Based on analysis of human speech databases Multimodal speech synthesis – (N)GSLT 2006 [ 117 ] Multimodal speech synthesis – (N)GSLT 2006 [ 118 ]

User controlled “voice fitting” System architecture • Direct access to selected voices Sentence Speech • Individual settings easy to use Pronunciation level and production Text rules articulation Speech model • Phonetic rules use slide buttons as inputs Lexicon rules • Synthesizer implemented with great flexibility • Examples of possible adjustments – Vocal tract size – Voice source characteristics – Pitch dynamics Voice control – Degree of clear or reduced speech Multimodal speech synthesis – (N)GSLT 2006 [ 119 ] Multimodal speech synthesis – (N)GSLT 2006 [ 120 ]

20 NGSLT 2006 Björn Granström Multimodal speech synthesis

Speech synthesis for hard of Speech synthesis for non-vocal persons hearing persons Cameleon CV – a speech prosthesis (EU/Vaess project) The Teleface application

Multimodal speech synthesis – (N)GSLT 2006 [ 121 ] Multimodal speech synthesis – (N)GSLT 2006 [ 122 ]

EU-project Synface – Coordinated by KTH “The TELEFACE project” (simulated)

Multi-modal Speech communication for the hearing impaired

Continues in EU project SYNFACE, aiming at a real-time demonstrator http://www.speech.kth.se/synface/ Multimodal speech synthesis – (N)GSLT 2006 [ 123 ] Multimodal speech synthesis – (N)GSLT 2006 [ 124 ]

Instruction video Formal intelligibility test

• Material: VCV (symmetric vowel context) – 2 vowels: /ʊ, a/ – 17 consonants: /p, b, m, f, v, t, d, n, s, l, r, k, g, ŋ, ɧ, ç, j / • Task: consonant identification • Synthetic face with human speech • hard of hearing subjects (or KTH students) • Additive white noise, –3 dB SNR (if normal hearing)

Multimodal speech synthesis – (N)GSLT 2006 [ 125 ] Multimodal speech synthesis – (N)GSLT 2006 [ 126 ]

21 NGSLT 2006 Björn Granström Multimodal speech synthesis

Results for VCV-words (hearing impaired subjects) 100 Better than humans? 90 80 70 bil labd den pal vel bil labd den pal vel 60 bilabial 100 96,3 2,5 1,3 labiodental 96,3 3,7 92,6 5,6 1,9 50 aCa % correct 40 dental 3,0 78,0 5,5 13,4 85,8 7,4 6,8 30 palatal 9,9 70,4 19,8 1,2 17,3 71,6 9,9 20 velar 4,9 16,0 79,0 2,5 25,0 72,5 10 Synthetic face Natural face 0 Audio alone Synthetic face Natural face + rule-synthesis

Natural voice

Multimodal speech synthesis – (N)GSLT 2006 [ 127 ] Multimodal speech synthesis – (N)GSLT 2006 [ 128 ]

Hypo to hyper articulation Possible improvements to Pilot study stimuli ”lip readability” • Great variation in human speakers due to for example – Speaking rate 25% 50% 75% 100% – Extent of articulatory movements (the hypo – hyper dimension) – Anatomy, facial hair …. – Light, distance, viewing angle…

Multimodal speech synthesis – (N)GSLT 2006 [ 129 ] 125% 150% Multimodal175% speech synthesis – (N)GSLT 2006200% [ 130 ]

The Synface telephone Real user tests

Multimodal speech synthesis – (N)GSLT 2006 [ 131 ] Multimodal speech synthesis – (N)GSLT 2006 [ 132 ]

22 NGSLT 2006 Björn Granström Multimodal speech synthesis

Jonas Beskow - mottagare av Talking faces for speech reading Chester Carlsons forskningspris support in TV 2 februari, 2006

Multimodal speech synthesis – (N)GSLT 2006 [ 133 ] Multimodal speech synthesis – (N)GSLT 2006 [ 134 ]

Texttype – man or machine? Language learning

• Oral proficiency training • Possible display of internal articulations • Exploiting hyper/hypo dimension • Training in dialogue context • Always available conversational partner • Untiring model of pronunciation – everything from phonemes to prosody

Multimodal speech synthesis – (N)GSLT 2006 [ 135 ] Multimodal speech synthesis – (N)GSLT 2006 [ 136 ]

Automatic tutor simulation Different representations

no gestures some gestures

Multimodal speech synthesis – (N)GSLT 2006 [ 137 ] Multimodal speech synthesis – (N)GSLT 2006 [ 138 ]

23 NGSLT 2006 Björn Granström Multimodal speech synthesis

Speak&Spell - TexasInstrument Speech training using Christmas 1978 intonation models

student speech speech teacher pronunciation database synthesis

model pronunciation

speech modifications

FO DTW PSOLA

AUDITORY AND VISUAL DISPLAYS

Student pronunciation

Model pronunciation

Modified student presentation

Multimodal speech synthesis – (N)GSLT 2006 [ 139 ] Multimodal speech synthesis – (N)GSLT 2006 [ 140 ]

1 Original recording Demo of prototype 2 Stockholm “Teacher”(sound only)-original-modified Pohlman - weatherman from the south 3 South Swedish “Sen drar hela det här moln- och regnområdet i 4 Synthesis alla fall vidare österut” (~then, this whole cloud and rain system moves eastward) 1 Original recording “Teacher”(sound only)-original-modified 2 Stockholm 3 South Swedish 4 Synthesis

Multimodal speech synthesis – (N)GSLT 2006 [ 141 ] Multimodal speech synthesis – (N)GSLT 2006 [ 142 ]

Articulatory training Reiko Yamada ATR, 1999

• Stylized • Program Fonem –Johan Liljencrants

Multimodal speech synthesis – (N)GSLT 2006 [ 143 ] Multimodal speech synthesis – (N)GSLT 2006 [ 144 ]

24 NGSLT 2006 Björn Granström Multimodal speech synthesis

ARTUR: the articulation tutor new national project ARTUR

What? video Relation between Automatic articulatory feedback image Computer face and vocal tract display using face and vocal tract vision movements models. ”Hally Pottel” For whom? Hearing impaired children, second- Articulatory language learners, speech theraphy inversion patients. How? Mispronunciation Contrasting the user’s articulation detection with a correct one. Feedback display VT model

Multimodal speech synthesis – (N)GSLT 2006 [ 145 ] Multimodal speech synthesis – (N)GSLT 2006 [ 146 ]

3D tongue movements CTT Virtual Language Tutor

• Practice dialogues • Correct your pronunciation • Keep track of your improvements • Tailor lessons based on your interaction

[s] [k] vowels

“OK” 3D tongue movements (difficult to define an objective evaluation criteria - suggestions? ) The tongue movements can be presented from different views Multimodal speech synthesis – (N)GSLT 2006 [ 147 ] Multimodal speech synthesis – (N)GSLT 2006 [ 148 ]

CTT Virtual Language Tutor Discriminate acceptable from Different Types of Users: unwanted variation • How to do it (automatically) • Swedish children learning English • Adult immigrants learning Swedish • Adult Swedes wanting to improve aspects • What are the aims of L2 learning of English (e.g. corporate English, – Less accentedness technical English) – Comprehensibility • Native Swedes with language disabilities – Intelligibility wanting to improve their Swedish – More? – Acceptability in context

• Economy of language learning

• Could a virtual tutor help? Multimodal speech synthesis – (N)GSLT 2006 [ 149 ] Multimodal speech synthesis – (N)GSLT 2006 [ 150 ]

25 NGSLT 2006 Björn Granström Multimodal speech synthesis

The virtual teacher vision already at STiLL, The cost of pronunciation errors Marholmen 1998! ISCA STiLL/InStil sig revival? •Video

Multimodal speech synthesis – (N)GSLT 2006 [ 151 ] Multimodal speech synthesis – (N)GSLT 2006 [ 152 ]

Unacceptable pronunciation First tried at the VISPP summerschool needs to be identified in Palmse, August 2005 (for Estonian)

from a Vinnova video (onMultimodal CTTs speech webpages) synthesis – (N)GSLT 2006 [ 153 ] Multimodal speech synthesis – (N)GSLT 2006 [ 154 ]

Nordic project –NordPlus Sprog Using VILLE – CTT Virtual Language Tutor

Chapter 6 "The Talking Computer": Text to Speech Synthesis Joseph P. Olive Chapter 7 When Will HAL Understand What We Are Saying? Computer and Understanding Raymond Kurzweil

http://mitpress.mit.edu/e-books/Hal/

Multimodal speech synthesis – (N)GSLT 2006 [ 155 ] Multimodal speech synthesis – (N)GSLT 2006 [ 156 ]

26 NGSLT 2006 Björn Granström Multimodal speech synthesis

Beyond 2001 Ready for the Odyssey?

Multimodal speech synthesis – (N)GSLT 2006 [ 157 ]

27