<<

Machine-Readable in Text-to-Speech Systems

Judith L. Klavans~and Evclyne Tzoukcrmann * "~Colnmbia University, l)epartment of Computer Science, New York, New York 10027 [email protected] ~* A.T.&T. Bell Laboratories, 600 Monntain Avenue, Murray llill, N.J. 07974 [email protected]

Abstract 2 Using MRDs in Text to This paper presents the results of an experiment Speech usiug machine-readable dictionaries (Mill)s) and Several problems are addressed in this paper; corpora for building concatenativc units for text one concerns tile subtle comple×itics and idiosyn- to speech (T'PS) systems. Theoretical questions crasies ilwolved iu parsing dictionaries and ex- concerning the nature of t)honemic data in dic- tracting data. Added to this is the lack of consis- tionaries are raised; phonemic data is tency both within the same dictionary and across viewed as a representative corpus over which to dictionaries which often requires ad hoc proce- extract n- gram phonemic frequencies in the lan- dures for each- resource. Another issue relates to guage. Dictionary data are compared to corpus tile structure of the modules of a TTS system, data, and inventories arc evaluated for specifically ill the grapheme-to-phoneme compo- coverage. A methodology is defined to compute nent; dictionary lookup depends on several factors I)honemic n-grams for incorporation into a TTS including size, machine power and storage, factors system. that have important consequences for the extrac- tion ofconcatenative nnits. Another consideration 1 Introduction concerns tile nature of the language itself: a lan- guage with irregular graphcme4o-phoneme map- The majority of speech synthesis systems use ping and lexically determined stress assignment two techniques: concatenation and formant- (such as English) benefits rnost from the large ex- synthesis. Building a comprehensive and intelli- ception list which a dictionary can provide, There gible concatenative-based speech synthesis system is also the practical issue of dictionary availabil- relies heavily on the successfid choice of concate- ity, and of pronunciation field accuracy within an native units. Our results contribute to the t~sk of available dictionary. Thus, decisions on the use developing an eificient and elfective methodology of MRD data depend on many factors, and can for reducing the potentially large set of concaten- significantly impact efficiency and accuracy of a live units to a manageable size, and to chosing the speech system. optimal set for recording and storage. Since a dictionary entry consists of several The paper is aimed primarily at two audiences: fields of information, naturally, each will bc use- one consists of those concerned with research on rid for different applications [1]. Among the stan- the automatic use of MR.D data; the other are dard fields are prommciation, etymology, sub- TTS system designers who require linguistic and jcct field notes, definition fields, synonym and lcxicographic resources to improve and streamline antonym cross references, semantic and syntactic system-building. Issues of morphological analysis comments, run-on forms, conjugational class and and generation, as well as stress assigmnent based inflectional information where relevant, and trans- on dictiona.ry data, are discussed. lation for the I)ilingual dictionaries. Each of these fields has proven usefifl for different applications, such as for building semantic taxouomies [3], [13] and machine translation [12]. The most directly useflfl for TTS is the pronunciation field [4], [11]. Equally usefifl for TTS, but less dir6ctly acces-

971 sible, are data from run-on fields, conjugational which has a rough ratio of eight morphologically class information, and part-of-speech. 1 inflected for one baseform, Robert lists only To illustrate, the following partial entries from the non-inflected forms of the lexical entries. Itow- Webster's Seventh (W7) [15] illustrate typical pro- ever, if pronunciation varies during inflection of nunciation, definition, and run-on fields: nouns and adjectives, the pronunciation field re- flects that variation which makes the information (l) ha.yen/'h.~-v0n/ n 1: IIAnBOR, POLO' 2 : a difficult to extract automatically. For example, in place of safety : ASYLUM haven vt (5) and (6), one needs to know the nature of the (2) bi.son/'brs-on, 't>iz-/ n ... rule to apply in order to relate both forms of the (3) ho.m,,.ge.neous /-'j~-ne-0s,-ny0s/ ... adjective. (4) den.tic.u.late/den-'tik-y0-1~t/ or den.tic.u.lat.ed/-,lat-od/ adj (5) blanc, blanche/bl~, blbJ'/adj, et n. The entry for "haven" contains one fnll pronun- (6) vif, rive/vif, viv/adj, et n. ciation. The entry for "bison" has one alterna- In (5), the masculine/bl~/is obtained by remov- tive, but the user must figure out that the /on/ ing the phoneme /J'/ from the feminine /bl~,j'/ should be appended after 't)]z-/, as in the first pro- (blanche, "white" ). In (6), the form masculine mmciation, in order to obtain the correct varia- fornr/vif/("sharp, qnick") is formed by stripping tion. Correct pronunciation for "homogeneous" the affix /ve/ and substituting the phoneme /f/. relies on the pronunciation of the previous en- Notice that tile rules are different in nature, the try, "homogeneity" , and requires the user to sep- first being a addition/deletion relation, and the arate and bring the prefix "homo-" from one en- second being a substitution. try to another. To complicate matters, the alter- In this project, the dictionary pronunciation native pronunciation for the suffix /n6.-as/-nyas/ field was used to start building the phonetic inven- must also be correctly interpreted by the user. Fi- tory of a speech synthesis system. For the French nally, "dentieulate" has a morphologically related TTS system [?], the set of diphones was estab- run-on form "denticulated" in tile early part of lished by taking most of the thirty-flve the entry, and the pronunciation of that run-on is for French and coupling them with each other (352 related to the main entry, but the user must de- = 1225 pairs). Then, the diphones were extracted cide how to strip and append the given syllables. from the pronunciation field for in the 2 While these types of reasoning are not difficult Robert dictionary. A program was written to for humans, for whom the dictionary was written, search through the dictionary phonetic field and they are quite difficult for programs, and thus are select the longest where the phoneme pairs not straighforward to perform automatically. would be in mid-syllable position. For example, the phonemic pair/lo/was found in the pronun- 2.1 Using the MRD pronunciation ciation field/zoolo3ik/corresponding to the head- field word zoologiquc "zoologic." Out of 1225 phonemic pairs, 874 words were Extracting the prommciation field from an MRD is fonnd with at least one occurence of the pair. one of the most obvious uses of a dictionary. Nev- The pair [headword_orth, headword_phon] was ex- ertheless, parsing dictionaries in general can be a tracted and headword_orth was placed in a carrier very complex operation ([16]) and even the extrac- sentence for recording. For instance, the speaker tion of one field, such as prommciation, can pose would utter the following sentence: "C'est zo- problems. Similar to W7, in the Robert French ologique que je dis" where "C'est ... que je dis" dictionary [9], which contains about 89,000 entries, is the carrier sentence. Due to the lack of explicit several pronunciations can be given for a head- inflectional information for nmms and adjectives, word and the choice of one must be made. More- only the non-inflected forms of the entries were ex- over, because of the rich morphology of French tracted during dictionary lookup for building tile 1Notice, however, that the fifll Collins Spanish-English diphone table. Similarly for verbs, only the infini- dictionary [7], as opposed to the other bilinguals, does not tive forms were used since the dictionary does not contain any prommciatlon information. Although this is list the inflected forms as headwords. This exem- rather surprising taking into account that the smaller ver- plifies the most simple way to use pronunciation si...... h as the paperback and g.... ([8], [lO]) do 1..... phonetic field, it could be attributed to the fact that pro- field data, which we have completed. A pronun- mmciation miles in Spanish are relatively predictable. ciation list of around 85,796 phonetic words was [2] reports on the need to resyllablfy entries already syl- obtained from the original list of ahnost 89,000 en- labified in LDOCE [18], since syllable boundaries for writ- tries, i.e. 96% of the entries. The remaining 4% ten forms usually reflect hyphenation conventions, rather than phonologically motivated syllabification conventions consist primarily of prefixes and suffixes which are necessary for pronunciation. listed in the dictionary without pronunciations,

972 and which should not be used in isolation in arty 3 Methodology and Results ease. 3.1 Collecting Data 2.2 Using the MRD for morphology As stated al)ove, out of ahnost 89,000 headwords Even though an MRI) may not list complete in- in the dictionary, 874 phonemic pairs,which repre- tlectional paradigms, it contains useful inflectional sents 71% of the total, were found. This is due to information. For example in the Collins Spanish- the fact that (a) the lookup occurs only on non: English dictionary, verb entries are listed with an inflected words, thus a limited sample of the lan- index pointing to the conjugation chess and table, guage, (b) because the dictionary consists of a list listed at the end of the dictionary. Using this infer of isolated words, it does not aceonnt for inter: word boundary phenomena. Sitme French liaison mation, a finite-state transducer for morphological analysis and generation was built for Spanish [20]. plays such all important role in tile phonology of From the original list of over 50,000 words, a few French, a look at phonetic data from a corpus mnst million words have been generated. These forms be giw'n in order to achieve fnll coverage. A per can then be used as tile input to the grapheme- tion of the llansard French corpus (over 2.3 million to-phoneme conversion module, in ;t Spanish TTS words) wa.s used h)r this purpose. Graplmme-to- system. phoneme software [14] was utilized in order to con- vert l"rench orthography into phonemes. For the sake of comparison, both the phonetic transcrip: 2.3 Using Run-on's tion from the corpus and the one from the MRI) A run-on is defined as a morphological variant of were converted into a unique set of i)honemes. a , included in the entry. Run-on's are Typical outt>ut front the dictionary looks like: problematic data in MRI)s [16], and they can be ABACA [abaka] n. m. found nearly anywhere in the entry. In example ABASOURI)IR [abazuRdiR,]; [abasuRdiR] (4), the run-on occurs at the beginning of the en- AI~ASOUR1)ISSANT, ANTI" [abazu RdisA, At try, and consists of a fitll form with suffix. More AI~NI"I'EUt{, EUSE [abat8R, 7z] n. commonly, run-on's occur towards the end of the ABCE'S [absE; apsE] n. m. entry, and tend to consist of predictable suttixa- ABDOMINAL, ALE, AUX [abd>minal, el adj. tion, that is, class II or neutral suttixes [19] , such ABI)OMINO- as :hess, dy, or -er, ~s in: ABDUCTION [abdyksjO] n. f. (7) sharp adj .... sharp.ly adv sharp.hess n A small sample of the tlansard followed by tile (8) suc.ces.sion n .... suc.ces.sion.al adj ascii transcription is shown below: snc.ces.sion.al.ly adv l)re'sident de la Compagnie d'Ame',mgement du barreau de. In cases where stress is changed with class I non- X Monsieur X I)e'pute' anrien Ministre Pre'sident neutral sultixes, a separate prououneiation is given du Conseil. as in: prezidA d& la kOpaNi d amenaZmA dy bare d& (9) gy.ro.scope /'ji-ra-,skSp/ n .... gy.ro.scop.ie /ji-ra-'sk~p-ik/ adj- iks m&sju dis depyte Asjl ministr prezid dy kOsEj gy.ro.s('ol,.i.cal.ly/d-k(a-)le/ adv

The run-on form with part-of-speech is given in.- As all experhnent, we compared triphones ex- side the entry, so it could be used for morphologi.- tracted fronl dictionary data and corpora. A eel analysis, tIowever, since proton|elation is usu- greedy algorithm 3 to locate the most common ally predictable from the headword (i.e. there is coocurrences between ortlmgraphy and transcrip- usually no stress change, and if there is a change, tion was run on the data sets. A sample of the. this is explicitly indicated) the run-on pronuncia- corpus and dictionary results are given in the Ta- tion often consists of a truncated form, requiring ble below. The table shows in the leftmost two some logic for reconstruction of the entire pronun- coh|mns the top twenty triphones and occurring ciation. Again, this may be obvious to the human frequencies extracted from the Hansard corpus, whereas the righthand columns show dictionary user, but rather complex to tigure out by l)rogram. 'l'hus, the run-on may be nsefld for Inorpl|ology, results. Notice the discrepancy between tlmse but is not ms useful h)r automatic pronunciation lists; for the top twenty triphones, there are only extraction. 3 We thank Jan van Smlt.en for providing this software.

9Z~ two overlaps, sjO and jO*. The levels of common- 4 Limitations of MRDs ality between the triphones of the tIansard and the dictionary (5% of commonality for the top 100 tri- The most straightforward way, but in the long phones and 15% of commonality for the top 1000 run not the nlost flexible, is to parse the phonetic triphones) is interesting to observe. information out of the prommciation field. The )ronunciation field information can generally be Hansard data Robert data ~onsulted by a TTS system within the grapheme- 54580 ~O 38745 *z& 3636 mA* 1725 i81H ;o-phoneme module. Additional rules for pro- 53948 jO* 38707 set 3324 ik* 1554 ite :esses such as inter-word assimilation, juncture, 47339 par 35052 k>m 3223 jO* 1492 je* 44328 asj 30389 *mE 2823 sjO 1462 8111" md prosodic contouring need to be added, since 44065 prL 39722 re* 2597 te* 1405 bl* solated word pronunciation couhl already be ban- 43288 tr~ 29093 &pr 2202 *de 1391 Mi tied by look-up table. Although appealing, there 41356 &la 28784 ist 2105 5sj 1389 abl 40877 put 26766 2086 EFt* 1376 tik ~re two major drawbacks to this approach: 39122 ~mA 25997 Eat 2067 aZ* 1341 (a) dictionary pronunciation fields are often not 38707 d&l 25378 HIA* 1789 ist 1321 st* )honetically fine-grained enough for acceptable Table 1: Twenty most frequent triphones speech output. For example, the pronunciation for "inquest" is given ill W7 as /'in-,kwest/, but The preliminary results indicate that the coar- of course the nasal will assimilate in place to the ticulatory effects derived from the corpus data will velar, giving /i0-kwest/. Without assimilation, be usefnl, in particular for languages like French the perceptual cffect is of two words: "in quest" where liaison plays a major role. This remains to and would be misleading. Again, the human user be tested in the TTS system. will a.ssimilatc naturally, but a text to speech sys- tem must figure out such details, since artieulatory ease is not a factor in most synthesis systems. One 3.2 Related Work way to solve this problem is to impose such assim- Although the statistical analysis of MRDs has fo- ilation on input from the pronunciation field by cussed primarily on definitions and translations, a set of post-processing rules. Although this so- [5] used the prommciation field as data. A dic- lution wouht be correct in the majority of cases, tionary of over 110,000 entries containing 51,219 blanket application of such rules is not always ap- common words and 59,625 proper nouns, [17] was propriate for lexical exceptions. For example, as- used for selecting candidate units that were fur- similation is optional for words like "uncaring", ther utilized in the set of concatenative units (di- in this case related to the morphological structure phones, triphones, and longer milts) for synthe- of the lexical item. A TTS system will proba- sis. The phonemic string was split according bly already have snch rules since they are inher- to ten language-dependent segmentation princi- ent in the graphemc-to-phoneme approach. Thus, ples. For example, the word "abacus" ['ab-o-kos] it could be argued that there is no need for the was first transformed into cuttable units as fol- dictionary prommciation, since with a complete lows: [#'a,'~b,bo,ok,ko,os,s#]. Once each dictio- and comprehensive grapheme-to-phoneme conver- nary word was split, the duplicates were removed sion system, a list which requires post-processing and the remaining units formed the set of con- is simply inadequate and unnecessary. Tiffs is the catenative units. At the end of this operation, a approach taken, for example, by [14], who makes rather long list was obtained that was pruned by use of small word lists (the main dictionary being methods such as reduction of secondary and pri- 25K stored forms) and several affix tables to recog- mary stress into one stress in order to keep only nize graphemic forms, which arc then transformed one +stress/-stress distinction. Techniques were into phonemic reprcsentations; shown that allow the selection of a minimal set (b) only a small percentage of possible words of word pairs for inter-word junctures; every can- are listed with prommciations in a dictionary. For didate unit inside and across word sequence was example, Wcbster's Sevcnth contains about 70,000 included. The same strategy was replicated on the headwords, but is missing words like "computer- Collins Spanish-English dictionary by [6]. In this ize" and "computerization" since they came into fashion, the dictionary was used as a sample of the frequent use in the language after the 1963 publi- language in the sense that it assnmes that most of cation date. Two solutions to this problem present the phonemic combinations of the language were themselves. One is to expand the word list from present. tile dictionary to include run-on's, as illustrated in examples (3) and (4), and discussed in Section 2.3. The other is to build a morphological generator,

974 using headwords, part of sl)eech , and other in- [7] Collins Spanish Dictionary: Spanish-English. formation as input, discussed in Section 2.2 that Collins Publishers, Glasgow, 1989. would be invoked when the word does not tigure [8] P-II. Cousin, L. Sinclair, J-F. Allain, and C. E. in tl,e headword list. Love. The Collins Paperback French Dictionary: t4"eneh- English. English-French. Collins Publish- ers, l,ondon, 1989. 5 Final Remarks [9] Main l)uval et al. Robert Encyclopedic Dictionary (CD-ROM). [Iachettc, Paris, 1992. Although limitations ('lcarly constrain the use of MRI)s in TTS, we have demonstrated in this pa- [10] M. (;onzMes. Collins Gem Spanish Dictionary: l'i'ench-English. English-French. Harper ColliNs per that it is more cost eflqcient to post process lhlblishers, London, 1990. underspecilie(l dictionary information such as in- flection, pronunciation, and part-of-speech, rather [11] Judith Klavans and Sara Basson. l)ocnmentation than generate rules from scratch to arrive at the of letter to sound components of the WALR.US text to spe(;eh system. 1984. same end point. For speech synthesis, thc data is not always perfect, and often must be post- [12] Judith l(lavans and Evclyne Tzoukermann. The processed. This paper h~us demonstrated ways we bicord system: (~omt)ining lexical information have successfully used dictionary data in 'FTS sys- front bilingual corpora and machine readable die-. tems, ways wc have post-processed data to make tionarles. In I¥ocecdings of the 131h Interna- tional Confcrencc on Computational Linguistics, it morc useful, and ways data Camlot bc easily llelsinki, Finland, 1990. post-processed or used. Of course, for any TTS system, the power of [13] Judilh L. Klavans, Martin S. Chodorow, and the dictionary data can be found at the lexical, Nina Wacholder. From dictionary to knowledge t)hrmqal, and level. Although any word base via l~txonomy. (;entre for the New Oxford list such ,-Ls a dictionary is by definition closed, Fnglish I)ictionary and q~xt Research: Electronic whereas language is open-ended, dictionary data Texl |leseareh, University of Waterloo, Canada, has proven to be usefid from both a theoretical 1990. l)rocecdings of the Sixth Conference of the and practical point of view. University of Waterloo. [14] 1". Marty. Trois systbmes inforlnatiques de transcription I)hondtique et graph6mique. Lc References l'¥ancais Modcrnc, LX, 2:179 197, 1992. [1] Branimir ]loguraev, Roy Byrd, Judith Klavans, [15] Mcrriam. Wcbstcr's Seventh New Collegiate Dic- anti Mary Neff. From machine readable dictionar- tionary. G.&~ C. Mcrriam, Springficld, M~s., ies to a lexical knowledge base. Detroit, Michi- 1963. gan, 1989. First International l,exical Acquisition [16] M. Netf and B. Boguraev. Dictionaries, dictio- Workshop. nary gl'ammars and dictionary entry parsing. In [2] David Carter. I,doce and speech recognition. t'rocecdings of the 271h Annual Meeting of the in Branimir Boguraev and Ted llriseoe, ed- Association for Computational Linguistics, Van- itors, Computational fl~r Natural couver, Canada, 1989. Association for Computa- Language Processing, chapter 6, pages 135--152. tional l,inguistics. Longman, Burnt tlill, llarlow, Essex, 1989. [17] Olive Joe P. and Mark Y. l,iberman. A set ofcon- [3] Martin S. Chodorow, Roy J. Byrd, and George E. catenative units for speech synthesis. In In J. J. Iteidorn. Extracting scmantic hierarchies from a Wolf and 1). II. Klatt, editors, Speech Commu- large on-line dictionary. In Procccdinqs of the 23rd nication t)apcrs Prcscntcd at the 971h Mccting of Annual Meeting of the Association for Computa- tin Acoustical Society of America, pages 515-518, tional Linguistics, pages 299 304. Association for New York: American Institute of Physics, 1979. Computational Linguistics, 1985. [18] Paul Procter, editor. Longman Dictionary of [4] Paul Cohen. to sound conversion for text Contemporary English. l,ongman Group, Burnt to speech. 1982. llill, ttarlow, l!'ssex: Longnran, 1978. [5] John Coleman. Computation of candidate syn- [19] F,lisabcth O. Selkirk. The Syntax of Words. MIT thesis units. In 112~22-930719-07TM, Mnrray llill, Press, Cambridge, Mass., 1982. N.J., USA, 19.(13. 'technical Memorandum, AT& [20] l",velync '['zonkcrnutnn and Mark Y. l,iberman. Bell Lahoratories. A finite-state morphological processor for span- [6] John Coleman and Pilar Prieto. Accurate pro- ish. In Procccdings of ColinggO, ]Iclsinki, l"inland, nunciation rules for american spanish text-to- 1990. International Conference on Computational speech. In 11222-930719-06TM, Murray Ilill, Linguistics. N.J., USA, 19!13. q~chnical Memorandum, AT& Bell l,aboratories.

9Z5