2010-05-19

TextText--toto--SpeechSpeech (TTS) Synthesis • GOAL: convert arbitrary textual messages to intelligible and natural sounding synthetic Digital Speech ProcessingProcessing—— speech so as to transmit information from a machine to a person Lecture 18 abstract input Text undliderlying Speech synthetic text Analysis linguistic Synthesizer speech output TextText--toto--SpeechSpeech (TTS) description • pronunciation of text => phonemes, stress, intonation, duration Synthesis Systems • syntactic structure of sentence (pauses, rate of speaking, emphasis) • semantic focus, ambiguity resolution (duration, intonation) – rules for etymology (especially names, foreign terms)

Text Analysis Components Document Structure Raw English Input Text • end of sentence marked by ‘.?!’ is not infallible Basic Text Processing Document Structure Detection – “The car is 72.5 in. long” Text Normalization • ee--mailmail and web pages need special processing Linguistic Analysis Dictionary – Larry: Tagged Text Phonetic Analysis Sure. I'll try to do it before Thursday :-) Homograph disambiguation GraphemeGrapheme--toto--PhonemePhoneme Conversion Ed Tagged Phones • multiple languages Prosodic Analysis – insertion of foreign , unusual accent and Pitch and Duration Rules diacritical marks, etc. Stress and Pause Assignment Synthesis Controls (Sequence of Sounds, Durations, Pitch)

Text Normalization Text Normalization • abbreviations and acronyms – Dr. is pronounced either as Doctor or drive depending on context (Dr. Smith lives on Smith Dr.) • proper names – St. is pronounced either as ‘street’ or ‘Saint’ depending on context (I live on Bourbon St. in St. Louis) – Rudy Prpch is pronounced as ‘Rudy Perpich’ – DC is either direct current or District of Columbia – MIT is pronounced as either ‘M I T’ or ‘Massachusetts Institute of – Sorin Ducan is pronounced as ‘Sorin Duchan’ Technology’ but never as ‘mitt’ • part of speech – DEC is pronounced as either ‘deck’ or ‘Diggyital Equipment Company’ but never as ‘D E C’ – readid is pronounce d as ‘ree d’ or ‘re d’ • numbers – record is pronounced as ‘rec-ard’ or ‘ri-cord’ – 370-1111 can be either ‘three seven oh …’ or ‘three seventy-model 1111’ for the IBM 370 computer • word decomposition – 1920 is either ‘nineteen-twenty’ or ‘one thousand, nine hundred, twenty’ • dates, times currency, account numbers, ordinals, cardinals, – need to decompose complex words into base forms math (morphemes) to determine pronunciation – Feb. 15, 1983 needs to convert to ‘February fifteenth, nineteen eighty- (‘indivisibility’ needs to be decomposed into ‘in-di- three’ – $10.50 is pronounced as ‘ten dollars and fifty cents’ visible-ity’ to determine pronunciation) – part # 10-50 needs to be pronounced as ‘part number 10 dash fifty’ rather than ‘part pound sign ten to fifty’

1 2010-05-19

Text Normalization Knowledge Source Confusions

• proper handling of special symbols • Let us pray ------Lettuce spray in text • Pay per view ------Paper view – punctuation, e.g., . , : ; ’ ” - -- _ * & ( ) ^ % @ ! ~ < > ? / \ = + • Meet her on Main St. ------• string resolution Meter on Main St. – 10:20 can be pronounced as either • It is easy to recognize speech ----- ‘twenty after 10 (as a time)’ or ‘ten to ----- It is easy to wreck a nice twenty’ (as a sequence) beach 8

Linguistic Analysis Homograph Disambiguation

• part-of-speech (POS) • “an absent boy” versus “do you choose to • word sense absent yourself?” • phrases • “they will abuse him” versus “they won’t • anaphora take abuse” • emphasis •style • “an overnight bag” versus “are you staying overnight?” a conventional parser could be used, but • “he is a learned man” versus “he learned typically a simple shallow analysis is done for to play piano” speed (parsers are not real-time!) • “El Camino Real road” versus “real world”

LetterLetter--toto--SoundSound (LTS) Conversion Prosody Input Word •CART “Whole Word” Dictionary Probe • pauses (Classification no – to indicate phrases and to avoid running out Affix Stripping of breath and Regression no Tree))y analysis yes • pitch • conventional no “Root” Dictionary Probe – fundamental frequency (F0) as a function of yes time dictionary LetterLetter--toto--SoundSound and yes search with Stress Rules • rate/relative duration – phoneme durations, timing, and rhythm letter-to-sound Affix Reattachment rules • loudness This is only ONE way of doing phonemes, stress, LTS. FSMs are another. partsparts--ofof--speechspeech – relative amplitude/volume

2 2010-05-19

Full TTS System Word Concatenation Synthesis

TextText/Language Face Sync

Text Letter-to- Synthesis • words in sentences are much shorter than Analysis Sound Rules Backend Speech in isolation (up to 50% shorter) (see next

- text and name dictionaries Synthesis model impacts TTS, page) as well as unit representation. - Numerical expansion - timing and duration (dates, times, $$$, …) [rate, stress (position, context)] choice of units: •wordtds cannot preserve sen tence-lllevel - words - abbreviations, acronyms - intonation/pitch - phones (type of phrase[list,statement, ??]) - proper name id - diphones stress, rhythm or intonation patterns - phonetic substitution - dyads “Dr. Smith lives at 23 (vowel reduction, flapping rules) - syllables Lakeshore Dr.” Prosody • too many words to store (1.7 million - loudness/amplitude choice of parameters: (phrasal amp., local reduction) -LPC surnames), extended words using prefixes : - formants - sentences, phrase and breath groups - waveform templates and suffixes - semantic and syntactic accent; stress of compounds - articulatory parameters - intonation types; parts of speech - sinusoidal parameters method of computation: - rules - concatenation

Word Concatenation Proper Name Statistics

Number of Names Coverage 10 4.9% 100 16.3% 5,000 59.1% 50,000 83.2% 100,000 88.6% 200,000 93.0%

Statistics of Proper Name Concatenative Word Synthesis Coverage

several examples of concatenated words and full sentencessentences----SBRSBR

name pronunciation wordword--basedbased based on etymology synthesis does not work!

3 2010-05-19

Speech Synthesis Methods The VODER

• 1939—the VODER (Voice Operated DEmonstratoR)—Homer Dudley – based on a simple model of speech sound production – select voicing source (with foot pedal control of pitch) or noise source – ten filters shaped the source to produce vocal or noise-excited sounds—controlled by finger motions – separate keys for stop sounds – wrist bar control of signal energy

Articulatory Synthesis Articulatory Model • in theory — can create more natural and more realistic motions of the articulators (rather than Articulatory Parameters of formant parameters), thereby leading to more Model: natural sounding synthetic speech • lip openingopening——WW – utilize physical constraints of articulator movements • lip protrusion lengthlength——L – use X-ray data to characterize individual speech • tongue body height and sounds lengthlength——Y,X – model how articulatory parameters move smoothly • velar closure—closure—K between sounds • tongue tip height and • direct method: solve wave equation for sound pressure at lengthlength——B,R lips • jaw raising (dependent • indirect method: convert to formants or LPC parameters for parameter) final synthesis in order to utilize existing synthesizers • velum openingopening——N – use highly constrained motions of articulatory parameters

Articulatory Synthesis Using Articulatory Synthesis by Copying Formant Synthesizer Backend Measured Vocal Tract Data

• fully automatic closed-loop optimization – initialized from articulatory codebooks, neural nets – Schroeter and Sondhi, 1987 Cecil CokerCoker----teachingteaching computers to talk

Articulatory Synthesis of One example: original: re-synthesis: SpeechSpeech——CecilCecil Coker

4 2010-05-19

Articulatory Synthesis Issues SourceSource--FilterFilter Synthesis Models • cascade/serial (formant) synthesis model

• requires highly accurate models of glottis and vocal tract • requires rules for dynamics of the articulators

Parallel Synthesis Model Serial/Formant Synthesis Model A serial synthesizer is a good approach for open, non-nasal vocal tracts (vowels, liquids). For obstruents and nasals, we need to control the amplitudes of each resonance, and to introduce zeros in addition to the poles. • flaws in the serial/formant synthesis model: 4 12−+rrcos(θ ) 2 – can’t handle voiced fricatives Vz()= kkk ∏12−+rzrzcos(θ ) −122− – no zeros for nasal sounds k=1 kk k 4 ABz− −1 – no precise control for stop consonants = kk ∑12−+rzrzcos(θ ) −122+ − – pitch pulse shape fixedfixed——independentindependent of pitch k=1 kk k i use the approximation – spectral compensation is inadequate 4 A Vz() k ≈ ∑ −122− k =1 12−+rzrzkkcos(θ ) k

parallel synthesizer provides more flexibility in matching spectrum levels at formant frequencies (via gain OVE 11----FantFant SPASS JSRU “To Be …”…”-- We Wish DaisyDaisy-- controls)controls)——however,however, zeros are synthesis Synthesis Bell Labs You … Daisy with introduced into the spectrum. music

Parallel Synthesis Continuing Evolution (1959(1959--1987)1987) • issues: – need individual resonance amplitudes • Haskins, 1959 (A1,…,A4))——ifif resonances are close, this is a • KTH – Stockholm, 1962 messy calculation • Bell Labs, 1973 – ppghasing of resonances neg lected ( the B z--11 terms) k • MIT, 1976 – synthetic speech has both resonances and zeros (at frequencies between the resonances) that may • MIT-talk, 1979 be perceptible • Speak ‘N spell, 1980 – better reproduction of complex consonants • BELL Labs, 1985 • Dec talk, 1987

Parallel synthesis-synthesis-HolmesHolmes Parallel synthesis from BYU

5 2010-05-19

TextText--toto--SpeechSpeech Synthesis (TTS) Speech SynthesisSynthesis——thethe 90’s Evolution Good Intelligibility; Poor Good Customer • what changed? Intelligibility; Intelligibility; Quality Poor Poor Naturalness – TTS was highly intelligible but extremely unnatural Naturalness Naturalness (Limited sounding Context) – a decade of work had not changed the naturalness FtFormant LPC-BdBased UitSlUnit Selec tion subtbstanti tillally Synthesis Diphone/Dyad Synthesis Synthesis – computation and memory grew with Moore’s law, enabling highly complex concatenative systems to be ATR in Japan; Bell Labs; Bell Labs; CNET; CSTR in Scotland; created, implemented and perfected Joint Speech BT in England; Research Unit; Bellcore; AT&T Labs (1998); – concatenative systems showed themselves capable MIT (DEC-Talk); Berkeley L&H in Belgium Haskins Lab Speech of producing (in some cases) extremely natural Technology sounding synthetic speech 1962 1967 1972 1977 1982 1987 1992 1997 Year

Concatenation TTS Systems Concatenation Units

• key idea: use segments of recorded speech for • choice of units: synthesis • data driven approach Î more segments give better – Words—there are an infinite number of them synthesis Î using an infinite number of segments leads – Syllables—there are about 10K in English to perfect synthesis • key issues: – Phonemes—there are about 45 in English – what units to use – Demi-syllables—there are about 2500 in – how to select units from natural speech English – how to label and extract consistent units from a large database – what signal representation should be used for spectrally – Diphones—there are about 1500-2500 in smoothing units (at junctures) and for prosody modification (pitch, duration, amplitude) English

Choice of Units Corpus Coverage by Unit Type

# Rules, Necessary Length Unit # Units 50000 2-Syllable 1.0 NOTE: depends on domain (English) Unit Modifications VC*V (here SURNAMES) Triphones .83 Quality Many Syllables Short Low 40000 Demisyllables Allophone 60-80 Diphones 2 2 Slope Diphone <40 -65 30000 Words s unit) t Triphone <403-653 / Demisyllable 2K # Uni 20000 Syllable 11K .23 (1 token (1 .15 VC*V .11 2-syllable <11 K2 10000 Word 100K-1.5M .02 Phrase 0 .003 8 Long Sentence Few High 1 10K 20K 30K 40K 50K 2 M

Top N Surnames (rank)

6 2010-05-19

Block Diagram of a Concatenation Units Concatenative TTS System • Words: – no complete coverage for broad domains Î words Dictionary Store of have to be supplemented with smaller units and Rules Sound – limited ability to modify pitch, amplitude and duration Units without losing naturalness and intelligibility – needhd huge da ta base to ex trac t mu ltilltiple vers ions o f each word Text Assemble Speech Speech Message Text Analysis, Units that Waveform • Sub-Words: Letter-to- Modification – hard to isolate in context due to co-articulation Match Alphabetic Sound, Input and – need allophonic variations to characterize units in all Characters Prosody Targets Synthesis contexts – puts large burden on signal processing to smooth at unit join points Phonetic Symbols, Prosody Targets Female Male

Procedure for Concatenative Synthesis Unit Selection Synthesis • off-line inventory preparation: – record and process with coding method • need to optimally match units at boundaries, e.g., of choice fundamental frequency (pitch), and spectrum – determine location of speech units and store units in • need to automatically and efficiently select optimal inventory sequence of units from database • issues in Unit Selection Synthesis 6 • on-lineesyt synthes essis fr om tettext – severallihitt(f10t10)l examples in each unit category (from 10 to 10 ) – normalize input text (expand abbreviations, etc.) – waveform modification used sparingly (leads to perceived distortions) – letter-to-sound (pronunciation dictionary and rules) – high intelligibility must be maintained – prosody (melody/pitch&durations, stress patterns/amplitudes, …) – “customer quality” attained with reasonable training set – select appropriate sequence of units from inventory (1-10 hours) – “natural quality” attained with large training set (10’s of hours) – modify units (smooth at boundaries, match desired – “unit selection” algorithm must run in a fraction of real time on a prosody) state-of-the-art processor – synthesize and output speech signal

(On(On--line)line) Unit Selection: Viterbi Unit Selection Measures Search #-a a-u u-# • USD—Unit Segmental Distortion Î differences between desired spectral pattern of target and #-a a-u u-# (1) (1) (1) that of candidate unit, throughout whole unit •UCD—Unit Concatenative Distortion Î spectral #-a a-u u-# #-#(2) (2) (2) #-# discontinuity across boundaries of the concatenated units #-a a-u u-# (3) (3) (3) Example: source context—want—/w ah n t/ a-u (4) target context—cart—/k ah r t/ • transitional (concatenation) costs are based on acoustic distances USDÎ(ahÆn versus ahÆr) • node (target) costs are based on linguistic identity of unit

7 2010-05-19

Modern TTS Systems (Natural Modern COMMERCIAL Systems Voices from AT&T)

• Soliloquy from Hamlet— NV2005 • Lucent • Gettysburg Address— NV2005 •AcuVoice • Bob Story— •Festival • German female— • L&H RealSpeak • UK British female— • SpeechWorks female • Spanish female— • SpeechWorks male • Korean female— • Cselt (Actor) - Italian • French male—

TTS Future Needs Business Drivers of TTS

• TTS needs to know how things should be said • cost reduction • context-sensitive pronunciations of words – TTS as a dialog component for customer care • prosody predictionÎ emphasis – TTS for delivering messages – I gave the book to John (not someone else) – TTS to replace expensive recorded IVR prompts – I gave the book to John (not the photos) • new products and services – I gave the book to John (I did it , not someone else) – location-based services • unit selection process Î target cost captures mismatch – providing information in cars (e.g., driving directions, between predicted unit specification (phoneme name, traffic reports) duration, pitch, spectral properties) and actual features of a candidate recorded unit Î need better spectral – unified Messaging (reading e-mail, fax) distance measures that incorporate human perception – voice Portals (enterprise, home, phone access to Web-based services) • better signal processing Î compress units for small – e-commerce (automatic information agents) footprint devices – customized News, Stock Reports, Sports Scores – devices

Reading Email Reading Email (final) Example: Old TTS, No Filter Example: Enhanced TTS

From: Marilyn Walker From: Marilyn Walker To: David Ross To: David Ross Subject: Re: Today's Meeting Subject: Re: Today's Meeting Date: Tuesday, December 01, 1998 4:25 PM Date: Tuesday, December 01, 1998 4:25 PM ------4:30 is fine for me. See you at the meeting. ------4:30 is fine for me. See you at the meeting. Marilyn Marilyn Walker -----Original Message----- From: David Ross -----Original Message----- To: Marilyn Walker From: David Ross Date: Tuesday, December 01, 1998 2:25 PM To: Marilyn Walker Subject: Today's Meeting Date: Tuesday, December 01, 1998 2:25 PM Subject: Today's Meeting Today's meeting has been changed from 4:00 to 4:30 PM. If the time change is a problem, please send me email at [email protected]. Today's meeting has been changed from 4:00 to 4:30 PM. If the time change is a problem, please send me email at [email protected]. Thanks, Thanks, david ross david ross

8 2010-05-19

TTS Application Categories

Devices • PDAs, cellphones, gaming, talking appliances

Automotive • driving directions, city and restaurant guides, Connectivity location services (e.g., “Macys has a sale!”) Consumer voice control of cell phones, VCRs, TVs • home information access over telephone (“Home Communications Voice Portals”)

Enterprise • information access over the phone such as sales Communications information, HR, internal phonebook, messaging

Voice-assisted • E-commerce, customer care (e.g., friendly E-Commerce automated talking web agents, FAQs, product information) Call center Automation • next-gen “HMIHY” automated operator services

9