Machine-Readable Dictionaries in Text-To-Speech Systems
Total Page:16
File Type:pdf, Size:1020Kb
Machine-Readable Dictionaries in Text-to-Speech Systems Judith L. Klavans~and Evclyne Tzoukcrmann * "~Colnmbia University, l)epartment of Computer Science, New York, New York 10027 [email protected] ~* A.T.&T. Bell Laboratories, 600 Monntain Avenue, Murray llill, N.J. 07974 [email protected] Abstract 2 Using MRDs in Text to This paper presents the results of an experiment Speech usiug machine-readable dictionaries (Mill)s) and Several problems are addressed in this paper; corpora for building concatenativc units for text one concerns tile subtle comple×itics and idiosyn- to speech (T'PS) systems. Theoretical questions crasies ilwolved iu parsing dictionaries and ex- concerning the nature of t)honemic data in dic- tracting data. Added to this is the lack of consis- tionaries are raised; phonemic dictionary data is tency both within the same dictionary and across viewed as a representative corpus over which to dictionaries which often requires ad hoc proce- extract n- gram phonemic frequencies in the lan- dures for each- resource. Another issue relates to guage. Dictionary data are compared to corpus tile structure of the modules of a TTS system, data, and phoneme inventories arc evaluated for specifically ill the grapheme-to-phoneme compo- coverage. A methodology is defined to compute nent; dictionary lookup depends on several factors I)honemic n-grams for incorporation into a TTS including size, machine power and storage, factors system. that have important consequences for the extrac- tion ofconcatenative nnits. Another consideration 1 Introduction concerns tile nature of the language itself: a lan- guage with irregular graphcme4o-phoneme map- The majority of speech synthesis systems use ping and lexically determined stress assignment two techniques: concatenation and formant- (such as English) benefits rnost from the large ex- synthesis. Building a comprehensive and intelli- ception list which a dictionary can provide, There gible concatenative-based speech synthesis system is also the practical issue of dictionary availabil- relies heavily on the successfid choice of concate- ity, and of pronunciation field accuracy within an native units. Our results contribute to the t~sk of available dictionary. Thus, decisions on the use developing an eificient and elfective methodology of MRD data depend on many factors, and can for reducing the potentially large set of concaten- significantly impact efficiency and accuracy of a live units to a manageable size, and to chosing the speech system. optimal set for recording and storage. Since a dictionary entry consists of several The paper is aimed primarily at two audiences: fields of information, naturally, each will bc use- one consists of those concerned with research on rid for different applications [1]. Among the stan- the automatic use of MR.D data; the other are dard fields are prommciation, etymology, sub- TTS system designers who require linguistic and jcct field notes, definition fields, synonym and lcxicographic resources to improve and streamline antonym cross references, semantic and syntactic system-building. Issues of morphological analysis comments, run-on forms, conjugational class and and generation, as well as stress assigmnent based inflectional information where relevant, and trans- on dictiona.ry data, are discussed. lation for the I)ilingual dictionaries. Each of these fields has proven usefifl for different applications, such as for building semantic taxouomies [3], [13] and machine translation [12]. The most directly useflfl for TTS is the pronunciation field [4], [11]. Equally usefifl for TTS, but less dir6ctly acces- 971 sible, are data from run-on fields, conjugational which has a rough ratio of eight morphologically class information, and part-of-speech. 1 inflected words for one baseform, Robert lists only To illustrate, the following partial entries from the non-inflected forms of the lexical entries. Itow- Webster's Seventh (W7) [15] illustrate typical pro- ever, if pronunciation varies during inflection of nunciation, definition, and run-on fields: nouns and adjectives, the pronunciation field re- flects that variation which makes the information (l) ha.yen/'h.~-v0n/ n 1: IIAnBOR, POLO' 2 : a difficult to extract automatically. For example, in place of safety : ASYLUM haven vt (5) and (6), one needs to know the nature of the (2) bi.son/'brs-on, 't>iz-/ n ... rule to apply in order to relate both forms of the (3) ho.m,,.ge.neous /-'j~-ne-0s,-ny0s/ ... adjective. (4) den.tic.u.late/den-'tik-y0-1~t/ or den.tic.u.lat.ed/-,lat-od/ adj (5) blanc, blanche/bl~, blbJ'/adj, et n. The entry for "haven" contains one fnll pronun- (6) vif, rive/vif, viv/adj, et n. ciation. The entry for "bison" has one alterna- In (5), the masculine/bl~/is obtained by remov- tive, but the user must figure out that the /on/ ing the phoneme /J'/ from the feminine /bl~,j'/ should be appended after 't)]z-/, as in the first pro- (blanche, "white" ). In (6), the form masculine mmciation, in order to obtain the correct varia- fornr/vif/("sharp, qnick") is formed by stripping tion. Correct pronunciation for "homogeneous" the affix /ve/ and substituting the phoneme /f/. relies on the pronunciation of the previous en- Notice that tile rules are different in nature, the try, "homogeneity" , and requires the user to sep- first being a addition/deletion relation, and the arate and bring the prefix "homo-" from one en- second being a substitution. try to another. To complicate matters, the alter- In this project, the dictionary pronunciation native pronunciation for the suffix /n6.-as/-nyas/ field was used to start building the phonetic inven- must also be correctly interpreted by the user. Fi- tory of a speech synthesis system. For the French nally, "dentieulate" has a morphologically related TTS system [?], the set of diphones was estab- run-on form "denticulated" in tile early part of lished by taking most of the thirty-flve phonemes the entry, and the pronunciation of that run-on is for French and coupling them with each other (352 related to the main entry, but the user must de- = 1225 pairs). Then, the diphones were extracted cide how to strip and append the given syllables. from the pronunciation field for headwords in the 2 While these types of reasoning are not difficult Robert dictionary. A program was written to for humans, for whom the dictionary was written, search through the dictionary phonetic field and they are quite difficult for programs, and thus are select the longest word where the phoneme pairs not straighforward to perform automatically. would be in mid-syllable position. For example, the phonemic pair/lo/was found in the pronun- 2.1 Using the MRD pronunciation ciation field/zoolo3ik/corresponding to the head- field word zoologiquc "zoologic." Out of 1225 phonemic pairs, 874 words were Extracting the prommciation field from an MRD is fonnd with at least one occurence of the pair. one of the most obvious uses of a dictionary. Nev- The pair [headword_orth, headword_phon] was ex- ertheless, parsing dictionaries in general can be a tracted and headword_orth was placed in a carrier very complex operation ([16]) and even the extrac- sentence for recording. For instance, the speaker tion of one field, such as prommciation, can pose would utter the following sentence: "C'est zo- problems. Similar to W7, in the Robert French ologique que je dis" where "C'est ... que je dis" dictionary [9], which contains about 89,000 entries, is the carrier sentence. Due to the lack of explicit several pronunciations can be given for a head- inflectional information for nmms and adjectives, word and the choice of one must be made. More- only the non-inflected forms of the entries were ex- over, because of the rich morphology of French tracted during dictionary lookup for building tile 1Notice, however, that the fifll Collins Spanish-English diphone table. Similarly for verbs, only the infini- dictionary [7], as opposed to the other bilinguals, does not tive forms were used since the dictionary does not contain any prommciatlon information. Although this is list the inflected forms as headwords. This exem- rather surprising taking into account that the smaller ver- plifies the most simple way to use pronunciation si...... h as the paperback and g.... ([8], [lO]) do 1..... phonetic field, it could be attributed to the fact that pro- field data, which we have completed. A pronun- mmciation miles in Spanish are relatively predictable. ciation list of around 85,796 phonetic words was [2] reports on the need to resyllablfy entries already syl- obtained from the original list of ahnost 89,000 en- labified in LDOCE [18], since syllable boundaries for writ- tries, i.e. 96% of the entries. The remaining 4% ten forms usually reflect hyphenation conventions, rather than phonologically motivated syllabification conventions consist primarily of prefixes and suffixes which are necessary for pronunciation. listed in the dictionary without pronunciations, 972 and which should not be used in isolation in arty 3 Methodology and Results ease. 3.1 Collecting Data 2.2 Using the MRD for morphology As stated al)ove, out of ahnost 89,000 headwords Even though an MRI) may not list complete in- in the dictionary, 874 phonemic pairs,which repre- tlectional paradigms, it contains useful inflectional sents 71% of the total, were found. This is due to information. For example in the Collins Spanish- the fact that (a) the lookup occurs only on non: English dictionary, verb entries are listed with an inflected words, thus a limited sample of the lan- index pointing to the conjugation chess and table, guage, (b) because the dictionary consists of a list listed at the end of the dictionary. Using this infer of isolated words, it does not aceonnt for inter: word boundary phenomena. Sitme French liaison mation, a finite-state transducer for morphological analysis and generation was built for Spanish [20].