CS 224S / LINGUIST 285 Spoken Language Processing

Andrew Maas Stanford University Winter 2021 Lecture 12: Text-to-Speech. Text Normalization. Prosody

Original slides by Dan Jurafsky, Alan Black, & Richard Sproat Outline — History and architectural overview — Text analysis — Text normalization — Letter-to-sound (grapheme-to-phoneme) — Prosody and intonation History of TTS — Pictures and some text from Hartmut Traunmüller’s web site: http://www.ling.su.se/staff/hartmut/kemplne.htm — Von Kempeln 1780 b. Bratislava 1734 d. Vienna 1804 — Leather resonator manipulated by the operator to try and copy vocal tract configuration during sonorants (, glides, nasals) — Bellows provided air stream, counterweight provided inhalation — Vibrating reed produced periodic pressure wave Von Kempelen: — Small whistles controlled consonants — Rubber mouth and nose; nose had to be covered with two fingers for non-nasals — Unvoiced sounds: mouth covered, auxiliary bellows driven by string provides puff of air

From Traunmüller’s web site Homer Dudley 1939 — Synthesizing speech by electrical means — 1939 World’s Fair Homer Dudley’s VODER

•Manually controlled through complex keyboard •Operator training was a problem An aside on demos That last slide exhibited:

Rule 1 of playing a demo: Always have a human say what the words are right before you have the system say them ’s OVE synthesizer — Of the Royal Institute of Technology, — Formant Synthesizer for vowels — F1 and F2 could be controlled

From Traunmüller’s web site Cooper’s — Haskins Labs for investigating — Works like an inverse of a spectrograph — Light from a lamp goes through a rotating disk then through spectrogram into photovoltaic cells — Thus amount of light that gets transmitted at each frequency band corresponds to amount of acoustic energy at that band Cooper’s Pattern Playback Pre-modern TTS systems

— 1960’s first full TTS: Umeda et al (1968) — 1970’s —Joe Olive 1977 concatenation of linear- prediction diphones —Texas Instruments Speak and Spell, — June 1978 — Paul Breedlove Types of Synthesis — : — Model movements of articulators and acoustics of vocal tract — Formant Synthesis: — Start with acoustics, create rules/filters to create each formant — : — Use databases of stored speech to assemble new utterances. — Diphone — Unit Selection — Parametric Synthesis — Learn parameters on databases of speech Text modified from Richard Sproat slides 1980s: Formant Synthesis — Were the most common commercial systems when were slow and had little memory. — 1979 MIT MITalk (Allen, Hunnicut, Klatt) — 1983 DECtalk system based on Klatttalk —“Perfect Paul” (The voice of Stephen Hawking) —“Beautiful Betty” 1990s: 2nd Generation Synthesis Diphone Synthesis — Units are diphones; middle of one phone to middle of next. — Why? Middle of phone is steady state. — Record 1 speaker saying each diphone — ~1400 recordings — Paste them together and modify prosody. 3rd Generation Synthesis Unit Selection Systems — Most current commercial systems. — Larger units of variable length — Record one speaker speaking 10 hours or more, —Have multiple copies of each unit — Use search to find best sequence of units Parametric Synthesis — Train a statistical model on large amounts of data. Learn text à sound mapping. — Very active area of research — Previously associated with HMM Synthesis — Deep learning approaches (Wavenet, Tacotron) are the latest generation of parametric Some modern examples + Quiz — https://canvas.stanford.edu/courses/132145/quizzes/90223 — Word of the day: synthesis Pick your favorite voice below: (Intelligibility + Naturalness)

— Sample 1

— Sample 2

— Sample 3

— Sample 4 The two stages of TTS PG&E will file schedules on April 20.

1. Text Analysis: Text into intermediate representation:

2. Waveform Synthesis: From the intermediate representation into waveform

Text Normalization

Analysis of raw text into pronounceable words:

— Sentence Tokenization — Text Normalization — Identify tokens in text — Chunk tokens into reasonably sized sections — Map tokens to words — Tag the words Text Processing — He stole $100 million from the bank — It’s 13 St. Andrews St. — The home page is http://www.stanford.edu — Yes, see you the following tues, that’s 11/12/01 — IV: four, fourth, I.V. — IRA: I.R.A. or Ira — 1750: seventeen fifty (date, address) or one thousand seven… (dollars) Text Normalization Steps 1. Identify tokens in text 2. Chunk tokens 3. Identify types of tokens 4. Convert tokens to words Step 1+2: identify tokens and chunk — Whitespace can be viewed as separators — Punctuation can be separated from the raw tokens — For example, Festival converts text into —ordered list of tokens —each with features: — its own preceding whitespace — its own succeeding punctuation Important issue in tokenization: end-of-utterance detection — Relatively simple if utterance ends in ?! — But what about ambiguity of “.”? — Ambiguous between end-of-utterance, end- of-abbreviation, and both —My place on Main St. is around the corner. —I live at 123 Main St. —(Not “I live at 151 Main St..”) Steps 3+4: Identify Types of Tokens, and Convert Tokens to Words — Pronunciation of numbers often depends on type. Three ways to pronounce 1776: Date: seventeen seventy six Phone number: one seven seven six Quantifier: one thousand seven hundred (and) seventy six

—Also: — 25 Day: twenty-fifth Rules/features for end-of-utterance detection

— A dot with one or two letters is an abbrev — A dot with 3 cap letters is an abbrev. — An abbrev followed by 2 spaces and a capital letter is an end-of-utterance — Non-abbrevs followed by capitalized word are breaks — Modern approaches use deep learning sequence transduction How common are NSWs? — Word not in lexicon, or with non-alphabetic characters (Sproat et al 2001, before SMS/Twitter)

Text Type % NSW novels 1.5% press wire 4.9% e-mail 10.7% recipes 13.7% classified 17.9% How common are non-standard words (NSWs)? Word not in gnu aspell dictionary (Han, Cook, Baldwin 2013) not counting @mentions, #hashtags, urls,

Twitter: 15% of tweets have 50% or more OOV 256 Chapter 8. Speech Synthesis

French or German, for example, in addition to the above issues, normalization may de- 256 Chapterpend 8. on Speech morphological Synthesis properties. In French, the phrase 1 fille (‘one girl’) is normal- ized to une fille,but1garc¸on (‘one boy’) is normalized to un garc¸on. Sim- ilarly,French in or German, German, forHeinrich example, IV in addition(‘Henry to IV’) the above can issues, be normalized normalization to mayHeinrich de- der Viertepend on, morphologicalHeinrich properties. des Vierten In French,, Heinrich the phrase 1 fille dem(‘one Vierten girl’) is normal-,orHeinrich denized Vierten to une filledepending,but1garc on¸on the(‘one grammatical boy’) is normalized case of the to un noun garc (Demberg,¸on. Sim- 2006). ilarly, in German, Heinrich IV (‘Henry IV’) can be normalized to Heinrich der 256 Chapter 8. Speech Synthesis Vierte, Heinrich des Vierten, Heinrich dem Vierten,orHeinrich 8.1.3den Vierten Homographdepending Disambiguation on the grammatical case of the noun (Demberg, 2006). French or German, for example, in additionThe goal to the of above our NSW issues, normalization algorithms inmay the de- previous section was to determine which se- pend on morphological properties. In French,8.1.3 the Homograph phrase 1 fille (‘one Disambiguation girl’) is normal- ized to une fille,but1garc¸onquence(‘one boy’) of is standard normalized words to un to garc pronounce¸on. Sim- for each NSW. But sometimes determining 256 Chapter 8. Speech Synthesis The goal of our NSW algorithms in the previous section was to determine which se- ilarly, in German, Heinrich IV (‘Henryhow IV’) to pronounce can be normalized even standard to Heinrich words der is difficult. This is particularly true for homo- quence of standard words to pronounce for each NSW. But sometimes determining Vierte, Heinrich des Vierten, Heinrich dem Vierten,orHeinrich Homograph graphshow to,whicharewordswiththesamespellingbutdifferentpronunciations.Hereare pronounce even standard words is difficult. This is particularly true for homo- den ViertenFrench ordepending German, on for the example, grammatical in addition case of the to noun the above (Demberg, issues, 2006). normalization may de- Homograph somegraphs examples,whicharewordswiththesamespellingbutdifferentpronunciations.Hereare of the English homographs use, live,andbass: pend on morphological properties.some examples In French, of the the English phrase homographs1 fille (‘oneuse girl’), live,and is normal-bass: 8.1.3ized Homograph to une fille Disambiguation,but(8.6)1garc¸on It’s(‘one no use boy’)(/y uw is normalized s/) to ask to to useun(/y garc uw¸on z/).the Sim- telephone. (8.6) It’s no use (/y uw s/) to ask to use (/y uw z/) the telephone. The goalilarly, of our in NSW German, algorithmsHeinrich in(8.7) the IV previous(‘Henry Do you section IV’) live was(/l can ih to be v/) determine normalizednear a which zoo to with se-Heinrich live (/l ay der v/) animals? (8.7) Do you live (/l ih v/) near a zoo with live (/l ay v/) animals? quenceVierte of standard, Heinrich words to pronounce des(8.8) Vierten for I prefer each NSW., Heinrich bass But(/b sometimes ae s/) demfishing Vierten determining to playing,orHeinrich the bass (/b ey s/) guitar. how toden pronounce Vierten even standarddepending words on(8.8) is the diffi Igrammaticalcult. prefer This bass is(/b particularly case ae s/) offi theshing true noun forto playinghomo- (Demberg, the bass 2006).(/b ey s/) guitar. Homograph graphs,whicharewordswiththesamespellingbutdifferentpronunciations.HereareFrenchFrench homographs homographs include includefils fi(whichls (which has two has pronunciations two pronunciations [fis] ‘son’ [ versusfis] ‘son’ versus some examples of the English homographs use, live,andbass: 8.1.3 Homograph Disambiguation[fil][fil] ‘thread’]), ‘thread’]), or the the multiple multiple pronunciations pronunciations for fier for(‘proud’fier (‘proud’ or ‘to trust’), or ‘to and trust’),est and est (8.6) It’s no use (/y uw s/) to ask to(‘is’ use(‘is’(/y or or uw ‘East’) ‘East’) z/) the (Divay telephone. and and Vitale, Vitale, 1997). 1997). (8.7)The Do you goal live of(/l our ih v/) NSWnear a algorithms zoo withLuckilyLuckily live in(/l the ay for for previous v/) theanimals? task task section of of homograph homograph was to disambiguation, determine disambiguation, which the two se- the forms two of forms homographs of homographs in English (as well as in similar languages like French and German) tend to have dif- (8.8)quence I prefer bass of standard(/b ae s/) fi wordsshingin to to playing English pronounce the (as bass well for(/b each ey as s/) in NSW.guitar. similar But languages sometimes like determining French and German) tend to have dif- how to pronounce even standardferent words parts of is speech. difficult. For This example, is particularly the two forms true for of usehomo-above are (respectively) a French homographs include filsferent(which parts has two of pronunciations speech. For [fi example,s] ‘son’ versus the two forms of use above are (respectively) a Homograph graphs,whicharewordswiththesamespellingbutdifferentpronunciations.Herearenoun and a verb, while the two forms of live are (respectively) a verb and a noun. Fig- [fil] ‘thread’]), or the multiple pronunciationsnoun and for a verb,fier (‘proud’ while the or ‘to two trust’), forms and ofestlive are (respectively) a verb and a noun. Fig- some examplesHomograph of the Englishure homographs 8.5disambiguation shows someuse interesting, live,and systematicbass: relations between the pronunciation of some (‘is’ or ‘East’) (Divay and Vitale, 1997).urenoun-verb 8.5 shows and some adj-verb interesting homographs. systematic Indeed, relations Liberman between and Church the (1992) pronunciation showed of some Luckily for the task of homograph disambiguation, the two forms of homographs (8.6) It’s no use (/y uw s/)noun-verbtothat ask many to use and of(/y the adj-verb uw most z/) frequentthe homographs. telephone. homographs Indeed, in 44 million Liberman words of and AP Church newswire (1992) are showed in English (as well as in similar languagesdisambiguatable like French and just German) by using tend part to of have speech dif- (the most frequent 15 homographs in or- ferent(8.7) parts of Do speech. you live For(/l example, ih v/)thatnear the many two a zoo forms of with the of liveuse mostabove(/l ay frequent are v/) (respectively)animals? homographs a in 44 million words of AP newswire are der are use, increase, close, record, house, contract, lead, live, lives, protest, survey, noun and a verb, while the two forms of live are (respectively) a verb and a noun. Fig- (8.8) I prefer bass (/b ae s/)disambiguatableprojectfishing, separate to playing, justpresent the by bass, usingread(/b). partey s/) ofguitar. speech (the most frequent 15 homographs in or- ure 8.5 shows some interesting systematicder are relationsuse, increase between the, close pronunciation, record of, somehouse, contract, lead, live, lives, protest, survey, noun-verbFrench and adj-verb homographs homographs. include Indeed,fils Liberman(which has and two Church pronunciations (1992) showed [fis] ‘son’ versus Final voicing Stress shift -ate final that many[fil] of ‘thread’]), the most frequent or the multiplehomographsproject pronunciations, inseparate 44 million, present words for fi ofer, read AP(‘proud’ newswire). or are ‘to trust’), and est N (/s/) V (/z/) N (init. stress) V (fin. stress) N/A (final /ax/) V (final /ey/) disambiguatable(‘is’ or ‘East’) just by (Divay using part and of Vitale, speech 1997). (the most frequent 15 homographs in or- der are use, increaseFinaluse, close voicing,yuwsrecord, yuwzhouse, contractrecord, leadreh1kaxr0d, liveStress, lives shift,rix0kao1rdprotest, survey, estimate eh s t ih m ax-ate t ehfinal s t ih vowel m ey t Luckily forclose theklows task ofklowz homographinsult disambiguation,ih1 n s ax0 l t ix0 the n two s ah1 forms l t separate of homographssehpaxraxt sehpaxreyt project, separate, present, read). in English (ashouseN well(/s/)haws as inV similar(/z/)hawz languagesobject aa1N like(init. b j French eh0 stress) k t andax0V b German)( jfi eh1n. k stress) t tendmoderate to havemaadaxraxt dif-N/A (finalmaadaxreyt /ax/) V (final /ey/) ferent partsuse ofyuws speech.yuwz For example,record the tworeh1kaxr0d forms of userix0kao1rdabove are (respectively)estimate aeh s t ih m ax t eh s t ih m ey t Final voicing FigureStress shift 8.5 Some systematic relationships-ate betweenfinal vowel homographs: final consonant (noun /s/ versus verb /z/), stress N (/s/) V (/z/) noun andcloseN a(init. verb,shiftklows stress) (noun whileV initial( theklowzfin. stress stress) two forms versusinsult verb of livefinalih1N/Aare stress), n(fi s (respectively)nal ax0 and /ax/) lfi t nalVix0( vowelfinal n a /ey/) s verb weakening ah1 land t a inseparate noun.-ate noun/adjs. Fig-sehpaxraxt sehpaxreyt use yuws yuwz urerecord 8.5house showsreh1kaxr0d somehaws interestingrix0kao1rdhawz systematicobjectestimate relationsaa1eh s b t ih j eh0 m between ax k t t ehax0 s the t ih b pronunciationm j eh1 ey t k t moderate of somemaadaxraxt maadaxreyt close klows klowz noun-verbinsult Figureih1 and n s 8.5 ax0 adj-verb l t Someix0 n homographs. s systematic ah1 l tBecauseseparate relationships Indeed, knowledgesehpaxraxt Liberman between of partsehpaxreyt of and homographs: speech Church is suf (1992)fifinalcient consonant showed to disambiguate (noun /s/ many versus homo- verb /z/), stress house haws hawz thatobject manyshiftaa1 of (noun b j the eh0 most initialk t ax0 frequent stress b j eh1graphs, versus k t homographsmoderate in verb practicefinalmaadaxraxt in we stress), 44 perform million and homographmaadaxreytfinal words vowel of disambiguation APweakening newswire in by-ate are storingnoun/adjs. distinct pronun- Figure 8.5 Some systematicdisambiguatable relationships between just homographs: by usingciations partfinal of forconsonant speech these (noun homographs (the /s/most versus frequent labeled verb /z/), by 15 stress part homographs of speech in and or- then running a part-of- shift (noun initial stress versus verb final stress), and final vowel weakening in -ate noun/adjs. der are use, increase, close,speechrecord tagger, house to, choosecontract the, pronunciationlead, live, lives for, aprotest given homograph, survey, in context. BecauseThere are knowledge a number of of homographs, part of speech however, is for suf whichficient both to pronunciations disambiguate have many homo- project, separate, present, read). Because knowledge of part of speechgraphs,the same is in suf partpracticeficient of tospeech. disambiguate we perform We saw many two homograph pronunciations homo- disambiguation for bass (fish versus by storing instrument) distinct pronun- graphs, in practice we perform homographciations disambiguation for these homographs by storing distinct labeled pronun- by part of speech and then running a part-of- Final voicingciations for these homographsStress labeled shift by part of speech and then running-ate afi part-of-nal vowel N (/s/)speechV (/z/) tagger to choose theN (init. pronunciation stress)speechV for(fi taggern. a given stress) to homograph choose the in context. pronunciationN/A (final /ax/) forV (fi anal given /ey/) homograph in context. use yuws Thereyuwz are arecord number ofreh1kaxr0d homographs,There however,rix0kao1rd are for a which numberestimate both of pronunciations homographs,eh s t ih m have ax t however,eh s t ih m for ey which t both pronunciations have close klowsthe sameklowz part ofinsult speech.ih1 We n saw s ax0 twothe l t pronunciations sameix0 n s part ah1 l of for t speech.bassseparate(fish We versussehpaxraxt saw instrument) two pronunciationssehpaxreyt for bass (fish versus instrument) house haws hawz object aa1 b j eh0 k t ax0 b j eh1 k t moderate maadaxraxt maadaxreyt Figure 8.5 Some systematic relationships between homographs: final consonant (noun /s/ versus verb /z/), stress shift (noun initial stress versus verb final stress), and final vowel weakening in -ate noun/adjs.

Because knowledge of part of speech is sufficient to disambiguate many homo- graphs, in practice we perform homograph disambiguation by storing distinct pronun- ciations for these homographs labeled by part of speech and then running a part-of- speech tagger to choose the pronunciation for a given homograph in context. There are a number of homographs, however, for which both pronunciations have the same part of speech. We saw two pronunciations for bass (fish versus instrument) Letter to Sound Rules — AKA Grapheme to Phoneme (G2P) — Generally machine learning, induced from a dictionary — Pick your favorite machine learning tool and go for it — Earlier work: (Black et al. 1998) — Two steps: alignment and (CART-based) rule-induction — Modern seq2seq approaches might handle tokenization and G2P jointly as a single text normalization module Prosody and Intonation in TTS Prominence/Accent: Decide which words are accented, which syllable has accent, what sort of accent Boundaries: Decide where intonational boundaries are Duration: Specify length of each segment F0: Generate F0 contour from these Stress vs. accent — Stress: structural property of a word —Fixed in the lexicon: marks a potential (arbitrary) location for an accent to occur, if there is one — Accent: property of a word in context —Context-dependent. Marks important words in the discourse (x) (x) (accented syll) x x stressed syll x x x full vowels x x x x x x x syllables vi ta mins Ca li for nia Same ‘tune’, different alignment

400

350

300

250

200

150

100

50 LEGUMES are a good source of vitamins

The main rise-fall accent (= “I assert this”) shifts locations.

Slide from Jennifer Venditti Same ‘tune’, different alignment

400

350

300

250

200

150

100

50 Legumes are a GOOD source of vitamins

The main rise-fall accent (= “I assert this”) shifts locations.

Slide from Jennifer Venditti Same ‘tune’, different alignment

400

350

300

250

200

150

100

50 legumes are a good source of VITAMINS

The main rise-fall accent (= “I assert this”) shifts locations.

Slide from Jennifer Venditti Levels of prominence

— Most phrases have more than one accent — Nuclear Accent: Last accent in a phrase, perceived as more prominent — Plays semantic role like indicating a word is contrastive or focus. — Modeled via ***s in IM, or capitalized letters — ‘I know SOMETHING interesting is sure to happen,’ she said — Can also have reduced words that are less prominent than usual (especially function words) — Sometimes use 4 classes of prominence: — Emphatic accent, pitch accent, unaccented, reduced Predicting Boundaries: Full || versus intermediate |

Ostendorf and Veilleux. 1994 “Hierarchical Stochastic model for Automatic Prediction of Prosodic Boundary Location”, Computational Linguistics 20:1

Computer phone calls, || which do everything | from selling magazine subscriptions || to reminding people about meetings || have become the telephone equivalent | of junk mail. ||

Doctor Norman Rosenblatt, || dean of the college | of criminal justice at Northeastern University, || agrees.||

For WBUR, || I’m Margo Melnicove. A single intonation phrase

400

350

300

250

200

150

100

50 legumes are a good source of vitamins

Broad focus statement consisting of one intonation phrase (that is, one intonation tune spans the whole unit).

Slide from Jennifer Venditti Multiple phrases

400

350

300

250

200

150

100

50 legumes are a good source of vitamins

Utterances can be ‘chunked’ up into smaller phrases in order to signal the importance of information in each unit.

Slide from Jennifer Venditti Phrasing sometimes helps disambiguate

400

350

300 Mary & Elena’s mother 250 mall

200

150

100

50 I met Mary and Elena’s mother at the mall yesterday

One intonation phrase with relatively flat overall pitch range.

Slide from Jennifer Venditti Phrasing sometimes helps disambiguate

400

350 Elena’s mother 300 mall Mary 250

200

150

100

50 I met Mary and Elena’s mother at the mall yesterday

Separate phrases, with expanded pitch movements.

Slide from Jennifer Venditti More prosodic tunes

Slide from Jennifer Venditti Yes-No question tune

550 500 450 400 350 300 250 200 150 100 50 are LEGUMES a good source of vitamins

Rise from the main accent to the end of the sentence.

Slide from Jennifer Venditti Yes-No question tune

550 500 450 400 350 300 250 200 150 100 50 are legumes a GOOD source of vitamins

Rise from the main accent to the end of the sentence.

Slide from Jennifer Venditti Yes-No question tune

550 500 450 400 350 300 250 200 150 100 50 are legumes a good source of VITAMINS

Rise from the main accent to the end of the sentence.

Slide from Jennifer Venditti WH-questions

[I know that many natural foods are healthy, but ...] 400

350

300

250

200

150

100

50 WHAT are a good source of vitamins

WH-questions typically have falling contours, like statements.

Slide from Jennifer Venditti Broad focus

“Tell me something about the world.” 400

350

300

250

200

150

100

50 legumes are a good source of vitamins In the absence of narrow focus, English tends to mark the first and last ‘content’ words with perceptually prominent accents.

Slide from Jennifer Venditti Rising statements

“Tell me something I didn’t already know.” 550 500 450 400 350 300 250 200 150 100 50 legumes are a good source of vitamins [... does this statement qualify?] High-rising statements can signal that the speaker is seeking approval. Slide from Jennifer Venditti Yes-No question

550 500 450 400 350 300 250 200 150 100 50 are legumes a good source of VITAMINS

Rise from the main accent to the end of the sentence.

Slide from Jennifer Venditti ‘Surprise-redundancy’ tune

[How many times do I have to tell you ...] 400

350

300

250

200

150

100

50 legumes are a good source of vitamins

Low beginning followed by a gradual rise to a high at the end.

Slide from Jennifer Venditti ‘Contradiction’ tune

“I’ve heard that linguini is a good source of vitamins.”

400

350

300

250

200

150

100

50 linguini isn’t a good source of vitamins [... how could you think that?] Sharp fall at the beginning, flat and low, then rising at the end.

Slide from Jennifer Venditti Appendix: Synthesis tools Synthesis tools (older) — I want my to talk —Festival Speech Synthesis — I want my computer to talk in my voice —FestVox Project — I want it to be fast and efficient —Flite

Text from Richard Sproat Festival — Open source speech synthesis system — Designed for development and runtime use — Use in many commercial and academic systems — Hundreds of thousands of users — Multilingual — No built-in language — Designed to allow addition of new languages — Additional tools for rapid voice development — Statistical learning tools — Scripts for building models

Text from Richard Sproat Festival as software — http://festvox.org/festival/ — General system for multi-lingual TTS — C/C++ code with Scheme scripting language — General replaceable modules: — Lexicons, LTS, duration, intonation, phrasing, POS tagging, tokenizing, diphone/unit selection, signal processing — General tools — Intonation analysis (f0, Tilt), signal processing, CART building, N-gram, SCFG, WFST

Text from Richard Sproat CMU FestVox project — Festival is an engine, how do you make voices? — Festvox: building synthetic voices: — Tools, scripts, documentation — Discussion and examples for building voices — Example voice databases — Step by step walkthroughs of processes — Support for English and other languages — Support for different waveform synthesis methods — Diphone — Unit selection

Text from Richard Sproat 1/5/07 Dictionaries — CMU dictionary: 127K words —http://www.speech.cs.cmu.edu/cgi- bin/cmudict — Unisyn dictionary — Significantly more accurate, includes multiple dialects — http://www.cstr.ed.ac.uk/projects/unisyn/ Dictionaries aren’t always sufficient: Unknown words — Go up with square root of # of words in text — Mostly person, company, product names — From a Black et al analysis — 5% of tokens in a sample of WSJ not in the OALD dictionary 77%: Names 20% Other Unknown Words 4%: Typos — So commercial systems have 3-part system: — Big dictionary — Special code for handling names — Machine learned LTS system for other unknown words Names — Big problem area is names — Names are common — 20% of tokens in typical newswire text — Spiegel (2003) estimate of US names: — 2 million surnames — 100,000 first names — Personal names: McArthur, D’Angelo, Jiminez, Rajan, Raghavan, Sondhi, Xu, Hsu, Zhang, Chang, Nguyen — Company/Brand names: Infinit, Kmart, Cytyc, Medamicus, Inforte, Aaon, Idexx Labs, Bebe Methods for Names — Can do morphology (Walters -> Walter, Lucasville) — Can write stress-shifting rules (Jordan -> Jordanian) — Rhyme analogy: Plotsky by analogy with Trostsky (replace tr with pl) — Liberman and Church: for 250K most common names, got 212K (85%) from these modified- dictionary methods, used LTS for rest. — Can do automatic country detection (from letter trigrams) and then do country-specific rules Homograph disambiguation • 19 most frequent homographs, from Liberman and Church 1992 • Counts are per million, from an AP news corpus of 44 million words. • Not a huge problem, but still important

use 319 survey 91 increase 230 project 90 close 215 separate 87 record 195 present 80 house 150 read 72 contract 143 subject 68 lead 131 rebel 48 live 130 lives 105 finance 46 protest 94 estimate 46 POS Tagging for homograph disambiguation — Many homographs can be distinguished by POS use y uw s y uw z close k l ow s k l ow z house h aw s h aw z live l ay v l ih v REcord reCORD INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT — POS tagging also useful for CONTENT/FUNCTION distinction, which is useful for phrasing The 1936 UK Speaking Clock

From http://web.ukonline.co.uk/freshwater/clocks/spkgclock.htm The UK Speaking Clock — July 24, 1936 — Photographic storage on 4 glass disks — 2 disks for minutes, 1 for hour, one for seconds. — Other words in sentence distributed across 4 disks, so all 4 used at once. — Voice of “Miss J. Cain” A technician adjusts the amplifiers of the first speaking clock

From http://web.ukonline.co.uk/freshwater/clocks/spkgclock.htm Closer to a natural vocal tract: Riesz 1937