Multimodal Speech Synthesis

NGSLT 2006 Björn Granström Multimodal speech synthesis Multimodal speech synthesis/ (N)GSLT - Speech Technology course Björn Granström KTH - Kungliga tekniska högskolan CTT, KTH School for Computer Science and Communication Department of Speech, Music and Hearing Multimodal speech synthesis – (N)GSLT 2006 [ 1 ] Multimodal speech synthesis – (N)GSLT 2006 [ 2 ] CTT - Centre for speech technology CTT - modes of operation Research areas • Speech production · long-term research projects, · participation by CTT personnel in international • Speech perception projects, • Communication aids · research exchange with leading foreign groups, • Multimodal speech synthesis · graduate-level research training, • Speech understanding · dissemination of competence through the exchange of researchers within Sweden and • Speaker characteristics collaboration with other Swedish research groups • Interactive dialog systems · special information dissemination efforts • Affective computing • Language learning Multimodal speech synthesis – (N)GSLT 2006 [ 3 ] Multimodal speech synthesis – (N)GSLT 2006 [ 4 ] CTT partners phase 4 - 2004 -2006 The work presented in this lecture Babel-Infovox is the result of many researchers’ EnglishTown GN Resound efforts at the Department of Speech, Hjälpmedelsinstitutet HoneySoft Music and Hearing IcePeak Dolphin Audio Publishing LingTek STTS Phoneticom Polycom Technologies SaabTech SpeechCraft STTS Reports and further Sveriges Radio information can be Sveriges Television found on our home page TeliaSonera HoneySoft TPB - Talboks- och www.speech.kth.se punktskriftsbiblioteket Vattenfall VoiceProvider Multimodal speech synthesis – (N)GSLT 2006 [ 5 ] Multimodal speech synthesis – (N)GSLT 2006 [ 6 ] 1 NGSLT 2006 Björn Granström Multimodal speech synthesis Why Speech Tech? User “Interactive Human Communication” Chapanis Scientific American 1975! “Computer” Februari 1998 Multimodal speech synthesis – (N)GSLT 2006 [ 7 ] Multimodal speech synthesis – (N)GSLT 2006 [ 8 ] Speech is fast and verbose More efficient and convenient! Average time for solving divers problems using different combinations of means of communication Multimodal speech synthesis – (N)GSLT 2006 [ 9 ] Multimodal speech synthesis – (N)GSLT 2006 [ 10 ] Speech technology is making The first pen phone money! Why no commercial success? Classic example : AT&T and Lucent Technologies VRCP, "Voice Recognition Call Processing” sevice Too small keys? Selection of payment method Vocabulary only five words : collect, Too small display!? calling card, third number, person, operator More than 5 000 000 calls per day Earnings already (1999?) more than 1 april 1998 AT&T/Bell labs´ total investment in speech research Multimodal speech synthesis – (N)GSLT 2006 [ 11 ] Multimodal speech synthesis – (N)GSLT 2006 [ 12 ] 2 NGSLT 2006 Björn Granström Multimodal speech synthesis Speech synthesis developments at KTH The KTH speech group - Early days 1953 1961 1982 1993 1998 Gunnar Fant and OVE I 1953 2002 Multimodal speech synthesis – (N)GSLT 2006 [ 13 ] Multimodal speech synthesis – (N)GSLT 2006 [ 14 ] OVE I, in the WaveSurfer tool Ove II, 1958 • Interface is based around WaveSurfer , a general purpose tool for speech and audio viewing, editing and labelling • TTS and Talking Head functionality is added as plug-ins • WaveSurfer (presently without TTS&TH) works on all common platforms and is freely available as open 1961 source • Modules from Waves available – formants and F0 in 1962 present release – thanks to Microsoft and AT&T http://www.speech.kth.se/wavesurfer Multimodal speech synthesis – (N)GSLT 2006 [ 15 ] Multimodal speech synthesis – (N)GSLT 2006 [ 16 ] KTH/TTS history What do we mean by Speech synthesis? • Recorded speech – Words or phrases (telephone banking) • 1967, Digitally controlled OVE III – Fixed vocabulary – maintenance problems… • 1974, Rule-based system RULSYS • Concatenative speech synthesis – transformation rules – Diphones or larger units (unit selection) • LPC: source filter model (too simple?) • 1979, Mobile text-to-speech system • PSOLA/MBROLA/HNM – and rules for prosody – One speaker – used by a non-vocal child • Parametric synthesis • 1982, Portable TTS (ICASSP, Paris) – Formant synthesis – Multilingual – Articulatory synthesis – MC 68000, NEC 7720 – flexible – But lower quality - today • 1983, Founding of Infovox Inc. • Mulitmodal synthesis Multimodal speech synthesis – (N)GSLT 2006 [ 17 ] Multimodal speech synthesis – (N)GSLT 2006 [ 18 ] 3 NGSLT 2006 Björn Granström Multimodal speech synthesis The synthesis space TEXT vs SPEECH Speech • Parallel vs. sequential Knowledge Flexibility Intelligibility • permanent vs disappearing Naturalness Processing • Text as transcription of speech? needs Bit rate • Example - WaveSurfer demo : Vocabulary – Palindrome Cost Units – ”Var sak har två sidor” Complexity Multimodal speech synthesis – (N)GSLT 2006 [ 19 ] Multimodal speech synthesis – (N)GSLT 2006 [ 20 ] Phonology, Morphology Phonetics hov#mäst-ar-n två kHz ms Multimodal speech synthesis – (N)GSLT 2006 [ 21 ] Multimodal speech synthesis – (N)GSLT 2006 [ 22 ] Non-phonetic writing Signs vs. technology Multimodal speech synthesis – (N)GSLT 2006 [ 23 ] Multimodal speech synthesis – (N)GSLT 2006 [ 24 ] 4 NGSLT 2006 Björn Granström Multimodal speech synthesis How is a Chinese lexicon organized? Is the sign what you think? Multimodal speech synthesis – (N)GSLT 2006 [ 25 ] Multimodal speech synthesis – (N)GSLT 2006 [ 26 ] Text-to-speech (TTS) Synthesis methods “abcd” texttext Artikulatorer Language ident. Artikulatorer Language ident. Morphological analysis Lexicon and rules Linguistic analysis AnsatsröretsAnsatsrörets form form Linguistic analysis Syntax analysis ProsodicProsodic analysisanalysis Rules and lexicon TuberTuber Läppar Phonetic description Rules and unit selection Phonetic description ElektrisktElektriskt nät nät Concatenation Rules Concatenation KällaKälla ResonanserResonanser SoundSound generationgeneration Multimodal speech synthesis – (N)GSLT 2006 [ 27 ] Multimodal speech synthesis – (N)GSLT 2006 [ 28 ] Source filter theory RULSYS Rules - Features Time: • Is IPA synthesiser possible? • Based on generative phonology Source filter Radiation • Language specific definitions (glottal flow) (shape of vocal tract) (lips) • Rules for contextual modifications • Examples • Interactive rule manipulations Frequency: Multimodal speech synthesis – (N)GSLT 2006 [ 29 ] Multimodal speech synthesis – (N)GSLT 2006 [ 30 ] 5 NGSLT 2006 Björn Granström Multimodal speech synthesis Glove (a) a e / __ <cons> e # F0 RKRGFADI B1 voice FN source AN N2 NZ BN A0 NA morph root NM F1 F2 F3 F4 F5-F7 B1 B2 B3 B4 B5-B7 (b) graph a e AH noise source AK phon <cons> e C2 C1 ^ ^ ^ ^ Multimodal speech synthesis – (N)GSLT 2006 [ 31 ] Multimodal speech synthesis – (N)GSLT 2006 [ 32 ] ”Pia odlar blå violer” Evaluation of synthesis- VCV test percent Consonant errors 50 41.7 40 Original Goldstein & Till (1992) 30 26.1 20 12.8 8.7 10 5.6 0 Synthesis Year Year Year Year One 1983 1987 1989 1991 human Carlson, R. Granström, B., and Karlsson, I.(1990): KTH text-to-speech system speaker "Experiments with voice modeling in speech synthesis," Multimodal speech synthesis – (N)GSLT 2006 [ 33 ] Multimodal speech synthesis – (N)GSLT 2006 [ 34 ] Speaker characteristics Speaker characteristics Speaker • TIDE/Vaess project (Voices, attitudes and dialect, sex, social, education, age emotions in synthetic speech • “Voice fitting” Situation formality, style, interspeaker relation • Software, Glove synthesis • 10 user controlled parameters Complex description many dependent variables • Rule specified connection to synthesis parameters Multimodal speech synthesis – (N)GSLT 2006 [ 35 ] Multimodal speech synthesis – (N)GSLT 2006 [ 36 ] 6 NGSLT 2006 Björn Granström Multimodal speech synthesis Emotional/ubiquitous computing – do we want it? Emotions natural natural synthesis synthesis Neutral Happy Sad Angry natural natural Angry synthesis synthesis Early BBC vision – the conversational toaster Thanks to Mark Huckvale, UCL,Multimodal for speechthe synthesis video – (N)GSLT clip 2006 [ 37 ] Multimodal speech synthesis – (N)GSLT 2006 [ 38 ] Result from listening test Emotions intended 100 90 80 neutral 70 angry 60 50 happy Happiness Anger Fear 40 sad 30 percent response percent 20 10 0 Happy Sad Angry Synthetic stimuli Surprise Disgust Sadness Multimodal speech synthesis – (N)GSLT 2006 [ 39 ] Multimodal speech synthesis – (N)GSLT 2006 [ 40 ] Voice source simulation Voice source simulation (Phar.) 2y examples x φφφ y C ∆ x ∆ y ∆ z Samplingsintervall : 1 parameter Adduction force Stämbandsanatomi : 18 parametar, t ex längd, vikt, dämpningsfaktor l C m J Stämbandsartikulation : 7 parametrar, t ex restglapp, lungtryck Rest position Högtrycksartikulation: 4 parametrar, t ex tonhöjd, ljudstyrka (Trach.) x Talrörsartikulation: 8 parametrar, t ex area, dämpningsfaktor Translational, Rotational y transv. + ax. 2D z Exempel originalyttrande Mechanical model B 2(y+v) Geometric model with fördubblad stämbandsmassa two elastic rods förlängning av stämband 13->17 mm Ps assymetriförändring 1.0->1.8 Po x assymetriförändring 1.0->2.0 AAm s Air flow through glottis restgap 0.1->1.0 mm Multimodal speech synthesis – (N)GSLT 2006 [ 41 ] Multimodal speech synthesis – (N)GSLT 2006 [ 42 ] 7 NGSLT 2006 Björn Granström Multimodal speech synthesis Predictable word accents Letter-to-sound vs. • Word accent and stress –'and en (bird) lexicon –'ànd en (spirit) Size vs. Precision –'ànk -pas ta –'ànk -pas tej Maintenance - “half of the words

Load more