Tool for Czech Pronunciation Generation Combining Fixed Rules with Pronunciation Lexicon and Lexicon Management Tool
Total Page:16
File Type:pdf, Size:1020Kb
Tool for Czech Pronunciation Generation Combining Fixed Rules with Pronunciation Lexicon and Lexicon Management Tool Petr PollaÂk, VaÂclav HanzÏl Czech Technical University in Prague CVUT FEL K331 Technicka 2, 166 27 Praha 6 - Dejvice, Czech Republic [email protected], [email protected] Abstract This paper presents two different tools which may be used as a support of speech recognition. The tool ªtranscº is the ®rst one and it generates the phonetic transcription (pronunciation) of given utterance. It is based mainly on ®xed rules which can be de®ned for Czech pronunciation but it can work also with speci®ed list of exceptions which is de®ned on lexicon basis. It allows the usage of ªtranscº for unknown text with high probability of correct phonetic transcription generation. The second part is devoted to lexicon management tool ªlexeditº which may be useful in the phase of generation of pronunciation lexicon for collected corpora. The presented tool allows editing of pronunciation, playing examples of pronunciation, comparison with reference lexicon, updating of reference lexicon, etc. 1. Introduction 2. Phonetical alphabet for Czech One of the most important tasks within speech reco- It is fact that any of®cially de®ned standard of Czech gnition is the generation of the pronunciation for given phoneme inventory has not been de®ned yet. We can ®nd utterance. Both in training or in recognition phase, the in- several attempts to its de®nition, e.g. in (Nouza et al., formation about the pronunciation of given or supposed 1997) or (http://www.phon.ucl.ac.uk/home/sampa/czech- utterance must be available. The pronunciation can be obta- uni.htm), but any alphabet has not been accepted ined either from a pronunciation lexicon or using a tool for as the standard. The latest Czech SAMPA proposal automated conversion of orthographic transcription to the is described in (http://noel.feld.cvut.cz/sampa/CZECH- phonetic one. This paper is devoted to the solution of this SAMPA.html) That's reason why we must brie¯y describe problem for Czech language. the alphabets used for phonetic transcription in this work. Czech has quite regular pronunciation, so the rules for We work with two different phonetical alphabets which automated generation of phonetic transcription (pronunci- differ only in used symbols for de®ned inventory of phone- ation) can be de®ned excepting some words, mainly the mes with only one exception. It will be discussed later. words of foreign origin. A special tool solving this pro- The ®rst one is IPA CTU (Internal Phonetic Alphabet blem ªtranscº was developed in Speech Processing Group created at Czech Technical University in Prague) which at Czech Technical University in Prague. When the automa- is very close to the alphabet published in (Nouza et al., ted generation of phonetic transcription cannot be applied 1997). It is used historically by many tools written at our for a word with some pronunciation irregularities (due a laboratory and it seems to be very convenient especially con¯ict with the general rule), the special simple syntax for the following reasons. Each phoneme is always repre- of orthographic transcription can be used to re-de®ne au- sented by single character and the symbols are very close tomatically generated regular pronunciation. This syntax is to the transcription of the pronunciation using standard described in section 3. Czech orthography. Consequently, it mainly respects stan- The second approach of utterance pronunciation gene- dard Czech pronunciation rules, i.e. applying ªtranscº to ration is based on pronunciation lexicon. This approach is the word in this phonetic transcription we obtain the same not optimal for Czech due to high number of in¯ections result. The exception from this rule can be ®nd for spe- in the language because it leads to growing size of the le- cial phonemes which are represented by upper case letters. xicon. On the other hand, irregular pronunciation can be These phonemes are quite rare and they may be represen- easily involved in the lexicon and any special orthography ted usually by equivalents with two phonemes written by transcription isn't required in this case. This is the main lower case letters (ªE - euº, ªA - auº, ªZ - dzº, etc.), see advantage of the lexicon approach. Because the creation of tab. 1. Many times it can be done because the threshold be- pronunciation lexicon is really hard work with necessary tween these two pronunciation is often quite soft, i.e. we can manual checks we have designed lexicon management tool observe smooth passing of two phonemes ªeuº into single ªlexeditº with quite simple control which is described in the phoneme ªEº. section 4. Finally, we use small lexicon as a part of tool for auto- As the ®rst step to standardization, CZECH SPEECH- mated pronunciation generation (ªtranscº), where the words DAT SAMPA was de®ned. It originates from IPA CTU but with pronunciation exceptions are stored. it uses already existing symbols in SAMPA alphabets for equivalent phonemes from other languages. This alphabet were de®ned within SpeechDat(E) project, see (CÏernocky et al., 2000). The correspondence between these two alphabets is in the tab. 1. There is only one small difference in pho- neme inventory. CZECH SPEECHDAT SAMPA doesn't of context sensitive grammars (which permits any order of resolve voiced and unvoiced allophones of ªrϺ. rules) but leads to much reduced complexity and also ena- CZECH SPEECHDAT SAMPA is the alphabet which bles us to use smaller set of symbols ± each 8-bit character is usually used at the international level. But the transcrip- may serve as both terminal and non-terminal symbol with tion of Czech pronunciations is very strange. That is reason slightly different meanings in various stages of processing. why for our private purposes we use IPA CTU. The Czech reader may appreciate very convenient phonetical transcrip- 3.1. Input format tion which is very close to Czech orthography and which is The orthographic transcription at the input of this generally easily understandable to Czech people, see tab 1. tool must obey conventions which can be summarized as When we annotated Czech SpeechDat database on the pro- follows: nunciation level we made the experience that people without any special education in this area could easily understand The convertible text must be written in regular words this transcription after short training period. with standard Czech syntax. The transcription can con- tain punctuation. 3. transc - tool for pronunciation generation The words with irregular pronunciation are typed with Czech language has generally regular pronunciation following syntax: ª(orthography/pronunciation)º. which corresponds closely with written form of the text. The pronunciation is typed using standard Czech Some irregularities appear mainly within the words of fore- alphabet in any way which yields the correct result ign origin. The most frequent case are personal and company according to regular pronunciation rules, i.e. according names, scienti®c terminology, technical terms, etc. Some of to rule ªwrite what you hearº. these irregularities are quite systematic and they can be handled by regular rules. Others can only be covered by Ex: (monitor/monytor) exception lexicon lookup. Czech, similarly as other Slavic languages, has highly in- Known irregularities are included in internal lexicon ¯ective nature. Typically, nouns, adjectives, pronouns, and of exceptions. This list can be extended by user de®ned numerals are declined. They may have up to 7 different external lexicon which includes transcriptions accor- forms both in plural and singular. Verbs may be conjugated ding to previously de®ned rule. The growing database in 6 different forms for each tense, sometimes the forms dif- of irregularities continuously decreases the probabi- fer according to gender. Using the pronunciation lexicon, its lity of bad orthographic-to-phonetic conversion, i.e. it size rapidly increases. Moreover, many forms of the words minimizes the necessity of special syntax in the ortho- differ only in terminations with regular pronunciation while graphy transcription of new utterances. their bases remain unchanged. That's the reason why the The transcription is generally case insensitive. Whole usage of ®xed pronunciation rules in a conversion algori- text is in the beginning converted to lower case form. thm seems to be more ef®cient for the generation of written However sometimes it may be useful to distinguish text pronunciation than the lexicon approach. between two words which differ just in the case The most complicated part of ªregularº pronunciation of the letters. These words may be included in generation is proper handling of voiced/voiceless assimilati- the external lexicon. Order of lexicon lookup and ons in consonant sequences. Consonant sequences may be conversion to lowercase is problematic to decide, quite long in Czech words and most often the sequence assi- each possibility has some disadvantages ± therefore milates to be either voiced or unvoiced as a whole. However we repeat the lookup both before and after the con- individual consonants behave in different ways during this version to lowercase to enable case sensitive handling assimilation ± they may be more or less dominant and force where needed and case insensitive handling otherwise. its neighbor to share voiced/voiceless quality with them, they may be more or less likely to be dominated, these re- Ex: Detektor rÏecÏi znacÏÂmeõ VAD. lations may differ in forward and backward directions and Detektor rÏecÏi znacÏõÂme ve Âa deÂ. may depend on context. Chains of assimilations may cross detektor rÏecÏy znacÏÂmey ve Âa de word boundaries. On the other hand, certain consonants may split consonant sequences into parts which differ in Ex: VyÂrobek ma mnoho vad. voiced/voiceless quality. Overall number of combinations VyÂrobek ma mnoho vad. is immense and software realization of these ªcommon and vyÂrobek ma mnoho vat simpleº assimilations gets rather complicated.