Speech Technology
Total Page:16
File Type:pdf, Size:1020Kb
Specialization Module Speech Technology Timo Baumann [email protected] Universität Hamburg, Department of Informatics Natural Language Systems Group not „just“ Text-to-Speech Synthesis Synthesis examples ● first singing !digital) computer (IB$% 1961) hand-tuned vocoding ● extension of the same techni*ue today+ espeak → rule-based vocoding system ● based on natural speech: DreSS-F.% $brola → diphone-synthesis ● a more modern system: MaryTTS → general concatenati)e speech synthesis ● smaller memory footprint of the abo)e → HM$-based speech synthesis (to be co)ered in 2 weeks" → Input and Output for Spoken Dialogue Systems ● .ecognition ● Synthesis – .eduction of the signal to 1ords themsel)es only 1ords insu3ciently describe the signal ➔ abstraction from details ➔ naturalness only 1ith addition of details 2 s o rd rd o s 2 Sound Sound Speech Recognition Speech Synthesis 1hat is missing in 1ritten language4 Written vs. Spoken Language Timo's list ● 5bbreviations% dates% numbers% currencies% … ● /omographs+ Bass ● Text does not ha)e any melody or rhythm! – prosody is important to con)ey meaning – 8unctuation only partially helpful Homographs "#aɪs$ "#%s$ Bass information structure Information Structure The linguistic means of structuring information, in order to optimize information transfer within discourse – Topic / Focus – :i)en / New information ● not directly con)eyed in textual representation – but to a certain degree by prosody ● to reconstruct the structure% listeners also use – context of the utterance in the whole con)ersation – 1orld kno1ledge Sonderforschungsbereich 0<<=-0<&>+ http+99111.s?(=0.uni-potsdam.de &rosody supra-segmental properties of speech ● phenomena+ – pitch (i.e.% melody / fundamental fre*uency) – loudness / intensity – duration, pauses ● phonetically+ accentuation and phrasing ● phonologically+ (word)stress% intonation, juncture &rosody' &honology ( &honetics ( &henomena )ocus and *ccentuation )ocus and *ccentuation ● “# didn@t say 1e should kill him.A – someone else said we should kill him – # am denying that # said 1e should kill him – # 1rote it do1n or implied it% but # didn't say it – # said someone else should do the job – # said that 1e absolutely must kill him – getting him a little ner)ous 1ould ha)e been enough – 1e got the wrong guy )ocus and *ccentuation ● “# didn@t say 1e should kill him.A – someone else said we should kill him – # am denying that # said 1e should kill him – # 1rote it do1n or implied it% but # didn't say it – # said someone else should do the job – # said that 1e absolutely must kill him – getting him a little ner)ous 1ould ha)e been enough – 1e got the wrong guy )ocus and *ccentuation ● “# didn@t say 1e should kill him.A – someone else said we should kill him – # am denying that # said 1e should kill him – # 1rote it do1n or implied it% but # didn't say it – # said someone else should do the job – # said that 1e absolutely must kill him – getting him a little ner)ous 1ould ha)e been enough – 1e got the wrong guy )ocus and *ccentuation ● “# didn@t say 1e should kill him.A – someone else said we should kill him – # am denying that # said 1e should kill him – # 1rote it do1n or implied it% but # didn't say it – # said someone else should do the job – # said that 1e absolutely must kill him – getting him a little ner)ous 1ould ha)e been enough – 1e got the wrong guy )ocus and *ccentuation ● “# didn@t say 1e should kill him.A – someone else said we should kill him – # am denying that # said 1e should kill him – # 1rote it do1n or implied it% but # didn't say it – # said someone else should do the job – # said that 1e absolutely must kill him – getting him a little ner)ous 1ould ha)e been enough – 1e got the wrong guy )ocus and *ccentuation ● “# didn@t say 1e should kill him.A – someone else said we should kill him – # am denying that # said 1e should kill him – # 1rote it do1n or implied it% but # didn't say it – # said someone else should do the job – # said that 1e absolutely must kill him – getting him a little ner)ous 1ould ha)e been enough – 1e got the wrong guy )ocus and *ccentuation ● “# didn@t say 1e should kill him.A – someone else said we should kill him – # am denying that # said 1e should kill him – # 1rote it do1n or implied it% but # didn't say it – # said someone else should do the job – # said that 1e absolutely must kill him – getting him a little ner)ous 1ould ha)e been enough – 1e got the wrong guy Information Structure ● information structure is an acti)e area of research+ – unkno1n ho1 exactly to represent IS !cross-linguistically% cross-genre% in dialogue% 6" – unkno1n ho1 (exactly" IS inBuences speech ● problem of premature implementation+ can we really expect a computer to successfully perform speech synthesis even before the basic research has been done? What a computer can do ● problems that are well understood+ – nd solutions based on a model – use lists of exceptions if model is faulty ● problems that are some1hat understood+ – use heuristics to get details right – try to a)oid taking a stand ● problems that aren't yet understood+ – re*uire additional instructions in the input – guess What a computer can do: focus ● human listeners are predicti)e !and forgiving): – it's 1orse to be )ery wrong occasionally than to say e)erything a little bit wrongly – human listeners 1ill select the correct interpretation !using their 1orld kno1ledge) from a)ailable options ● solution: – put a small accentuation on all possible focus points ● ho1e)er – system does not take a stand% it sounds indiCerent, bored &rocess diagram of Speech Synthesis ✓ pro#lem' text is missing detail &rocess diagram of Speech Synthesis ✓ pro#lem' text is missing detail phones D their durations D pitch cur)e !support )alues" 1a)eform synthesis Waveform Synthesis from the target se*uence !phonesDduration+pitch) 1.formant-based+ rules to determine target formants and other parts of the signal rules to determine transitions 2.pattern-based+ database of many short speech segments segments are concatenated one aEer the other 3.model-based approach in 2 weeks Diphone Synthesis ● Foncatenation of short speech snippets ● units from center of a phone to center of the next+ Gh+ha+Da:lDlo+Do:GDG)Dvi+Di:g+ge+De:t+tsDsG – concatenation 1ithin “stableA phase of the phone – coarticulation is (largely) co)ered ● 40 phones ~1600 diphones7 – recorded from one speaker → one voice → – additional signal processing for durationDpitch change +eneral ,oncatenative Synthesis ● alternati)es for the mapping target speech snippets – more speech material in database → – selection of material that better ts the target se*uence ● selection becomes a search of best concatenation – costs of t of concatenation between snippets – costs of t of snippets to target se*uence ● computationally expensi)e (search" – )ery high memory demands (5<<MBD per voice) ● results can be very natural sounding 1hat do you like better+ formant-based or pattern-based synthesis4 Summary Kank you. [email protected] https://nats-111.informatik.uni-hamburg.de/SL8&( Universität Hamburg, Department of Informatics Natural Language Systems Group )urther Reading ● Speech Synthesis in :eneral+ – 8. Taylor !0<<'"+ Text-to-Speech Synthesis. Fambridge Mni) 8ress. #SB;+ 'NO- <>0&O''0NN. #nfBib+ 5 T5P H=<N<. ● Ke $aryTTS Speech Synthesis System+ – SchrQder R Trou)ain !0<<="+ “Ke :erman Text-to-Speech Synthesis System $5.P+ 5 Tool for .esearch, ,e)elopment and TeachingA% Int. of Speech Technology 6!=". -oti.en Desired Learning Outcomes ● speech synthesis goal is to add variation for naturalness (this is opposite from 5S." ● problems/ambiguities in linguistic pre-processing – prosody and pitch: ToB#% information structure – major synthesis techni*ues+ formants% diphone% – !8SSL5 techni*ue".