Speech and Language Processing
Total Page:16
File Type:pdf, Size:1020Kb
Speech and Language Processing Chapter 8 of SLP Speech Synthesis Outline 1) Arpabet 2) TTS Architectures 3) TTS Components • Text Analysis • Text Normalization • Homonym Disambiguation • Grapheme-to-Phoneme (Letter-to-Sound) • Intonation • Waveform Generation • Unit Selection • Diphones 13-03-05 Speech and Language Processing Jurafsky and Martin 2 Dave Barry on TTS “And computers are getting smarter all the time; scientists tell us that soon they will be able to talk with us. (By "they", I mean computers; I doubt scientists will ever be able to talk to us.) 13-03-05 Speech and Language Processing Jurafsky and Martin 3 ARPAbet Vowels b_d ARPA b_d ARPA 1 bead iy 9 bode ow 2 bid ih 10 booed uw 3 bayed ey 11 bud ah 4 bed eh 12 bird er 5 bad ae 13 bide ay 6 bod(y) aa 14 bowed aw 7 bawd ao 15 Boyd oy 8 Budd(hist) uh 13-03-05 Speech and Language Processing Jurafsky and Martin 4 Brief Historical Interlude • Pictures and some text from Hartmut Traunmüller’s web site: • http://www.ling.su.se/staff/hartmut/kemplne.htm • Von Kempeln 1780 b. Bratislava 1734 d. Vienna 1804 • Leather resonator manipulated by the operator to copy vocal tract configuration during sonorants (vowels, glides, nasals) • Bellows provided air stream, counterweight provided inhalation • Vibrating reed produced periodic pressure wave 13-03-05 Speech and Language Processing Jurafsky and Martin 5 Von Kempelen: • Small whistles controlled consonants • Rubber mouth and nose; nose had to be covered with two fingers for non-nasals • Unvoiced sounds: mouth covered, auxiliary bellows driven by string provides puff of air From Traunmüller’s web site 13-03-05 Speech and Language Processing Jurafsky and Martin 6 Modern TTS systems . 1960’s first full TTS: Umeda et al (1968) . 1970’s . Joe Olive 1977 concatenation of linear-prediction diphones . Speak and Spell . 1980’s . 1979 MIT MITalk (Allen, Hunnicut, Klatt) . 1990’s-present . Diphone synthesis . Unit selection synthesis 13-03-05 Speech and Language Processing Jurafsky and Martin 7 2. Overview of TTS: Architectures of Modern Synthesis . Articulatory Synthesis: . Model movements of articulators and acoustics of vocal tract . Formant Synthesis: . Start with acoustics, create rules/filters to create each formant . Concatenative Synthesis: . Use databases of stored speech to assemble new utterances. 13-03-05 Text from Richard Sproat slides Speech and Language Processing Jurafsky and Martin 8 Formant Synthesis . Were the most common commercial systems while computers were relatively underpowered. 1979 MIT MITalk (Allen, Hunnicut, Klatt) . 1983 DECtalk system . The voice of Stephen Hawking 13-03-05 Speech and Language Processing Jurafsky and Martin 9 Concatenative Synthesis . All current commercial systems. Diphone Synthesis . Units are diphones; middle of one phone to middle of next. Why? Middle of phone is steady state. Record 1 speaker saying each diphone . Unit Selection Synthesis . Larger units . Record 10 hours or more, so have multiple copies of each unit . Use search to find best sequence of units 13-03-05 Speech and Language Processing Jurafsky and Martin 10 TTS Demos (all are Unit-Selection) . Festival . http://www-2.cs.cmu.edu/~awb/festival_demos/index.html . Cepstral . http://www.cepstral.com/cgi-bin/demos/general . IBM . http://www-306.ibm.com/software/pervasive/tech/demos/tts.shtml 13-03-05 Speech and Language Processing Jurafsky and Martin 11 Architecture . The three types of TTS . Concatenative . Formant . Articulatory . Only cover the segments+f0+duration to waveform part. A full system needs to go all the way from random text to sound. 13-03-05 Speech and Language Processing Jurafsky and Martin 12 Two steps . PG&E will file schedules on April 20. TEXT ANALYSIS: Text into intermediate representation: . WAVEFORM SYNTHESIS: From the intermediate representation into waveform 13-03-05 Speech and Language Processing Jurafsky and Martin 13 The Hourglass 13-03-05 Speech and Language Processing Jurafsky and Martin 14 1. Text Normalization . Analysis of raw text into pronounceable words: . Sentence Tokenization . Text Normalization . Identify tokens in text . Chunk tokens into reasonably sized sections . Map tokens to words . Identify types for words 13-03-05 Speech and Language Processing Jurafsky and Martin 15 Rules for end-of-utterance detection . A dot with one or two letters is an abbrev . A dot with 3 cap letters is an abbrev. An abbrev followed by 2 spaces and a capital letter is an end-of-utterance . Non-abbrevs followed by capitalized word are breaks . This fails for . Cog. Sci. Newsletter . Lots of cases at end of line. Badly spaced/capitalized sentences 13-03-05 From Alan Black lecture notes Speech and Language Processing Jurafsky and Martin 16 Decision Tree: is a word end- of-utterance? 13-03-05 Speech and Language Processing Jurafsky and Martin 17 Learning Decision Trees . DTs are rarely built by hand . Hand-building only possible for very simple features, domains . Lots of algorithms for DT induction 13-03-05 Speech and Language Processing Jurafsky and Martin 18 Next Step: Identify Types of Tokens, and Convert Tokens to Words . Pronunciation of numbers often depends on type: . 1776 date: . seventeen seventy six. 1776 phone number: . one seven seven six . 1776 quantifier: . one thousand seven hundred (and) seventy six . 25 day: 13-03-05 . twenty-fifth Speech and Language Processing Jurafsky and Martin 19 Classify token into 1 of 20 types . EXPN: abbrev, contractions (adv, N.Y., mph, gov’t) . LSEQ: letter sequence (CIA, D.C., CDs) . ASWD: read as word, e.g. CAT, proper names . MSPL: misspelling . NUM: number (cardinal) (12,45,1/2, 0.6) . NORD: number (ordinal) e.g. May 7, 3rd, Bill Gates II . NTEL: telephone (or part) e.g. 212-555-4523 . NDIG: number as digits e.g. Room 101 . NIDE: identifier, e.g. 747, 386, I5, PC110 . NADDR: number as stresst address, e.g. 5000 Pennsylvania . NZIP, NTIME, NDATE, NYER, MONEY, BMONY, PRCT,URL,etc . SLNT: not spoken (KENT*REALTY) 13-03-05 Speech and Language Processing Jurafsky and Martin 20 More about the types . 4 categories for alphabetic sequences: . EXPN: expand to full word or word seq (fplc for fireplace, NY for New York) . LSEQ: say as letter sequence (IBM) . ASWD: say as standard word (either OOV or acronyms) . 5 main ways to read numbers: . Cardinal (quantities) . Ordinal (dates) . String of digits (phone numbers) . Pair of digits (years) . Trailing unit: serial until last non-zero digit: 8765000 is “eight seven six five thousand” (some phone numbers, long addresses) . But still exceptions: (947-3030, 830-7056) 13-03-05 Speech and Language Processing Jurafsky and Martin 21 Finally: expanding NSW Tokens . Type-specific heuristics . ASWD expands to itself . LSEQ expands to list of words, one for each letter . NUM expands to string of words representing cardinal . NYER expand to 2 pairs of NUM digits… . NTEL: string of digits with silence for puncutation . Abbreviation: . use abbrev lexicon if it’s one we’ve seen . Else use training set to know how to expand . Cute idea: if “eat in kit” occurs in text, “eat-in 13-03-05 kitchen” will also occur somewhere.Speech and Language Processing Jurafsky and Martin 22 2. Homograph disambiguation 19 most frequent homographs, from Liberman and Church use 319 survey 91 increase 230 project 90 close 215 separate 87 record 195 present 80 house 150 read 72 contract 143 subject 68 lead 131 rebel 48 live 130 finance 46 lives 105 estimate 46 protest 94 Not a huge problem, but still important 13-03-05 Speech and Language Processing Jurafsky and Martin 23 POS Tagging for homograph disambiguation . Many homographs can be distinguished by POS . use y uw s y uw z . close k l ow s k l ow z . house h aw s h aw z . live l ay v l ih v . REcord reCORD . INsult inSULT . OBject obJECT . OVERflow overFLOW . DIScount disCOUNT . CONtent conTENT 13-03-05 Speech and Language Processing Jurafsky and Martin 24 3. Letter-to-Sound: Getting from words to phones . Two methods: . Dictionary-based . Rule-based (Letter-to-sound=LTS) . Early systems, all LTS . MITalk was radical in having huge 10K word dictionary . Now systems use a combination 13-03-05 Speech and Language Processing Jurafsky and Martin 25 Pronunciation Dictionaries: CMU . CMU dictionary: 127K words . http://www.speech.cs.cmu.edu/cgi-bin/cmudict . Some problems: . Has errors . Only American pronunciations . No syllable boundaries . Doesn’t tell us which pronunciation to use for which homophones . (no POS tags) . Doesn’t distinguish case . The word US has 2 pronunciations . [AH1 S] and [Y UW1 EH1 S] 13-03-05 Speech and Language Processing Jurafsky and Martin 26 Pronunciation Dictionaries: UNISYN . UNISYN dictionary: 110K words (Fitt 2002) . http://www.cstr.ed.ac.uk/projects/unisyn/ . Benefits: . Has syllabification, stress, some morphological boundaries . Pronunciations can be read off in . General American . RP British . Australia . Etc . (Other dictionaries like CELEX not used because too small, British-only) 13-03-05 Speech and Language Processing Jurafsky and Martin 27 Dictionaries aren’t sufficient . Unknown words (= OOV = “out of vocabulary”) . Increase with the (sqrt of) number of words in unseen text . Black et al (1998) OALD on 1st section of Penn Treebank: . Out of 39923 word tokens, . 1775 tokens were OOV: 4.6% (943 unique types): names unknown Typos/other 1360 351 64 76.6% 19.8% 3.6% . So commercial systems have 4-part system: . Big dictionary . Names handled by special routines . Acronyms handled by special routines (previous lecture) . 13-03-05 Machine learned g2p algorithm for otherSpeech andunknown Language Processing Jurafsky and Martin 28 words Names . Big problem area is names . Names are common . 20% of tokens in typical newswire text will be names . 1987 Donnelly list (72 million households) contains about 1.5 million names .