Tool for Czech Pronunciation Generation Combining Fixed Rules with Pronunciation Lexicon and Lexicon Management Tool

Petr Polla´,Va´clav Hanzˇl

Czech Technical University in Prague CVUT FEL K331 Technicka 2, 166 27 Praha 6 - Dejvice, Czech Republic [email protected], [email protected]

Abstract This paper presents two different tools which may be used as a support of speech recognition. The tool “transc” is the first one and it generates the phonetic transcription (pronunciation) of given utterance. It is based mainly on fixed rules which can be defined for Czech pronunciation but it can work also with specified list of exceptions which is defined on lexicon basis. It allows the usage of “transc” for unknown text with high probability of correct phonetic transcription generation. The second part is devoted to lexicon management tool “lexedit” which may be useful in the phase of generation of pronunciation lexicon for collected corpora. The presented tool allows editing of pronunciation, playing examples of pronunciation, comparison with reference lexicon, updating of reference lexicon, etc.

1. Introduction 2. Phonetical for Czech One of the most important tasks within speech reco- It is fact that any officially defined standard of Czech gnition is the generation of the pronunciation for given inventory has not been defined yet. We can find utterance. Both in training or in recognition phase, the in- several attempts to its definition, .. in (Nouza et al., formation about the pronunciation of given or supposed 1997) or (http://www.phon.ucl.ac.uk/home/sampa/czech- utterance must be available. The pronunciation can be obta- uni.htm), but any alphabet has not been accepted ined either from a pronunciation lexicon or using a tool for as the standard. The latest Czech SAMPA proposal automated conversion of orthographic transcription to the is described in (http://noel.feld.cvut.cz/sampa/CZECH- phonetic one. This paper is devoted to the solution of this SAMPA.html) That’ reason why we must briefly describe problem for . the used for phonetic transcription in this work. Czech has quite regular pronunciation, so the rules for We work with two different phonetical alphabets which automated generation of phonetic transcription (pronunci- differ only in used symbols for defined inventory of phone- ation) can be defined excepting some words, mainly the mes with only one exception. It will be discussed later. words of foreign origin. A special tool solving this pro- The first one is IPA CTU (Internal Phonetic Alphabet blem “transc” was developed in Speech Processing Group created at Czech Technical University in Prague) which at Czech Technical University in Prague. When the automa- is very close to the alphabet published in (Nouza et al., ted generation of phonetic transcription cannot be applied 1997). It is used historically by many tools written at our for a word with some pronunciation irregularities (due a laboratory and it seems to be very convenient especially conflict with the general rule), the special simple syntax for the following reasons. Each phoneme is always repre- of orthographic transcription can be used to re-define au- sented by single character and the symbols are very close tomatically generated regular pronunciation. This syntax is to the transcription of the pronunciation using standard described in section 3. Czech . Consequently, it mainly respects stan- The second approach of utterance pronunciation gene- dard Czech pronunciation rules, i.e. applying “transc”to ration is based on pronunciation lexicon. This approach is the word in this phonetic transcription we obtain the same not optimal for Czech due to high number of inflections result. The exception from this rule can be find for spe- in the language because it leads to growing size of the le- cial which are represented by upper case letters. xicon. On the other hand, irregular pronunciation can be These phonemes are quite rare and they may be represen- easily involved in the lexicon and any special orthography ted usually by equivalents with two phonemes written by transcription isn’ required in this case. This is the main lower case letters (“E - eu”, “A - au”, “ - dz”, etc.), see advantage of the lexicon approach. Because the creation of tab. 1. Many times it can be done because the threshold be- pronunciation lexicon is really hard work with necessary tween these two pronunciation is often quite soft, i.e. we can manual checks we have designed lexicon management tool observe smooth passing of two phonemes “eu” into single “lexedit” with quite simple control which is described in the phoneme “E”. section 4. Finally, we use small lexicon as a part of tool for auto- As the first step to standardization, CZECH SPEECH- mated pronunciation generation (“transc”), where the words DAT SAMPA was defined. It originates from IPA CTU but with pronunciation exceptions are stored. it uses already existing symbols in SAMPA alphabets for equivalent phonemes from other languages. This alphabet were defined within SpeechDat(E) project, see (Cˇernocky´et al., 2000). The correspondence between these two alphabets is in the tab. 1. There is only one small difference in pho- neme inventory. CZECH SPEECHDAT SAMPA doesn’t of context sensitive (which permits any order of resolve voiced and unvoiced allophones of “rˇ”. rules) but leads to much reduced complexity and also ena- CZECH SPEECHDAT SAMPA is the alphabet which bles us to use smaller set of symbols – each 8-bit character is usually used at the international level. But the transcrip- may serve as both terminal and non-terminal symbol with tion of Czech pronunciations is very strange. That is reason slightly different meanings in various stages of processing. why for our private purposes we use IPA CTU. The Czech reader may appreciate very convenient phonetical transcrip- 3.1. Input format tion which is very close to Czech orthography and which is The orthographic transcription at the input of this generally easily understandable to Czech people, see tab 1. tool must obey conventions which can be summarized as When we annotated Czech SpeechDat database on the pro- follows: nunciation level we made the experience that people without

any special education in this area could easily understand ¯ The convertible text must be written in regular words this transcription after short training period. with standard Czech syntax. The transcription can con- tain punctuation. 3. transc - tool for pronunciation generation

¯ The words with irregular pronunciation are typed with Czech language has generally regular pronunciation following syntax: “(orthography/pronunciation)”. which corresponds closely with written form of the text. The pronunciation is typed using standard Czech Some irregularities appear mainly within the words of fore- alphabet in any way which yields the correct result ign origin. The most frequent case are personal and company according to regular pronunciation rules, i.e. according names, scientific terminology, technical terms, etc. Some of to rule “write what you hear”. these irregularities are quite systematic and they can be handled by regular rules. Others can only be covered by Ex: (monitor/monytor) exception lexicon lookup.

Czech, similarly as other , has highly in- ¯ Known irregularities are included in internal lexicon flective nature. Typically, nouns, adjectives, , and of exceptions. This list can be extended by user defined numerals are declined. They may have up to 7 different external lexicon which includes transcriptions accor- forms both in plural and singular. Verbs may be conjugated ding to previously defined rule. The growing database in 6 different forms for each tense, sometimes the forms dif- of irregularities continuously decreases the probabi- fer according to gender. Using the pronunciation lexicon, its lity of bad orthographic-to-phonetic conversion, i.e. it size rapidly increases. Moreover, many forms of the words minimizes the necessity of special syntax in the ortho- differ only in terminations with regular pronunciation while graphy transcription of new utterances. their bases remain unchanged. That’s the reason why the

¯ The transcription is generally case insensitive. Whole usage of fixed pronunciation rules in a conversion algori- text is in the beginning converted to lower case form. thm seems to be more efficient for the generation of written However sometimes it may be useful to distinguish text pronunciation than the lexicon approach. between two words which differ just in the case The most complicated part of “regular” pronunciation of the letters. These words may be included in generation is proper handling of voiced/voiceless assimilati- the external lexicon. Order of lexicon lookup and ons in consonant sequences. Consonant sequences may be conversion to lowercase is problematic to decide, quite long in Czech words and most often the sequence assi- each possibility has some disadvantages – therefore milates to be either voiced or unvoiced as a whole. However we repeat the lookup both before and after the con- individual consonants behave in different ways during this version to lowercase to enable case sensitive handling assimilation – they may be more or less dominant and force where needed and case insensitive handling otherwise. its neighbor to share voiced/voiceless quality with them, they may be more or less likely to be dominated, these re- Ex: Detektorˇec ˇi znacˇı«me VAD. lations may differ in forward and backward directions and Detektorˇec r ˇi znacˇı«me ve«a«de«. may depend on context. Chains of assimilations may cross detektorˇec r ˇy znacˇy«me ve«a«de« word boundaries. On the other hand, certain consonants may split consonant sequences into parts which differ in Ex: Vy«robek ma« mnoho vad. voiced/voiceless quality. Overall number of combinations Vy«robek ma« mnoho vad. is immense and software realization of these “common and vy«robek ma« mnoho vat simple” assimilations gets rather complicated. See (Palkova´, ¯ All numbers should be transcribed in words. 1984) for extensive list of details.

Most transformations leading from orthographic text to ¯ There are special characters and symbols with different list of phonetic symbols are accomplished using sets of meanings: context-sensitive rules. Each rule replaces certain $xx spelled letter sequence of symbols by other sequence of symbols and is applied as many times as possible. Once the rule can do disjunctor no more replacements, next rule in set is applied. All ru- [yyy] non-speech events and special noises les in the set are applied in order from the first rule in the *, ˜ , % special marks for mispronunciation, word set to the last rule. This differs slightly from usual concept truncation, etc. (SpeechDat syntax) CTU IPA Czech SpeechDat Ortography CTU IPA Czech SpeechDat Ortography SAMPA SAMPA a ta´ta a ta:ta ta´ta nˇ konˇe koJe koneˇ a´ ta´ta a: ta:ta ta´ta kolo o kolo kolo A Ato a u a uto auto O pOze o u po uze pouze ba´ba b ba:ba ba´ba p pupen p pupen pupen cesta t s t sesta cesta o´ o´da o: o:da o´da

cˇ cˇyHa´ t S t Sixa: cˇicha´ r bere r bere bere Ò

jeden d jeden jeden ˇr morˇe PÒ moP e morˇe

Ò Ò Ò d’ d’elat J J elat deˇlat Rˇ keRˇ PÒ keP kerˇ e lef e lef lev s sut s sut sud e´ me´nˇe e: me:Je me´neˇ sˇ dusˇe S duSe dusˇe E Ero e u e uro euro t duty´ t duti: duty´ fAna f fa una fauna t’ kut’yl c kucil kutil

g guma g guma guma u dusˇe u duSe dusˇe Ò hat hÒ h at had u´ ku´ u: ku:l ku˚l H Hudy´ xudi: chudy´ la´va v la:va la´va j dojat j dojat dojat byl, byl i bil bil, byl k kupec k kupet s kupec y´ vy´tr, ly´ko i: vi:tr, li:ko vı´tr, ly´ko

l d’ela´ l JÒela: deˇla´ z koza z koza koza ma´ma m ma:ma ma´ma Z leZgde d z led zgde leckde M traMvaj F traFvaj tramvaj zˇ ru´zˇe Z ru:Ze ru˚zˇe vy´no n vi:no vı´no Zˇ ra´Zˇ a d Z ra:d Za ra´dzˇa N baNka N baNka banka

Table 1: Correspondence between CTU IPA and SAMPA

3.2. Stages of processing ¯ Transcription of certain patterns often found in foreign Individual transformations performed during conver- word (internal lexicon). Foreign words not covered by sion of orthographic text (with certain special marks) to dictionary lookup can still be processed right in this pronunciation can be approximately summarized as the stage. (Conversion of “” should precede as some of following steps (throughout these steps, CTU IPA is used these patterns ending in “c” would cause errors here.)

as internal representation):

! ! ¯ Expansions “x” “” and “q” “kv”.

¯ Process marks for non-speech events, special noises,

¯ Assimilation exceptions – this stage handles marginal mispronunciation and word truncation (“[yyy]”, “*”, but yet regular patterns not covered by rules below. “˜”, “%”. These marks just go through unchanged and they appear unchanged at corresponding places of out- ¯ Replacement of “d”, “t”, “n” by “d’”, “t’”, “nˇ” when put. followed by one of “i”, “ı´”,“eˇ”. (This is not true assimilation, just quirk of Czech orthography.)

¯ Replace “(orthography/pronunciation)”by “pronunci- ation” only. Including this step rather early in proces-

¯ Handling of “b”, “p”, “v”, “f”, “m” followed by “eˇ”. sing makes possible formats of pronunciation much more readable – user does not have to care about tiny ¯ Ends of words made voiceless (unless interword influ- details like certain mandatory allophones; following ence happens). stages will most likely do a better job than the user

could do. It is also possible to change level of details ¯ Backward assimilation – certain phones made voiced, ! (using or not using certain allophones) without chan- e.g. “sb” ! “zb”, “kd” “gd”. ging the input text.

¯ Backward assimilation – certain phones made -

¯ Do some normalizations of text – delete interpunction, less. This is the most common type of assimilation shrink sequences of spaces etc. which happens inside words. Dominant phones are “k”, “cˇ”, “s”, “sˇ”, “p”, “t”, “H” and they influence

¯ First exceptions lexicon lookup and replacement. preceding phones “h”, “v”, “z”, “d”, “zˇ”, “b”.

¯ Conversion of text to lowercase.

¯ Forward assimilation – much less frequent, most no- ! ¯ Second exceptions lexicon lookup and replacement. tably “sh” “sH” (but even this depends on dialect). This stage covers capitalized words at the beginning

of sentences. ¯ “dzˇ”, “dz”, “au” which survived till this point are con-

verted to “Zˇ ”, “Z”, “A”.This replacement is in principal ! ¯ Conversion “ch” “H”. At this stage we started reuse similar to replacement of “ch” made earlier but is much of capital letters as additional phone symbols. less sure and can be prevented by rules between. ¯ Handling of certain allophones, e.g. “M” (allophone of 4.1. Generation of lexicon “m”) and “N” (allophone of “n”). The creation of pronunciation lexicon is usually the work for an expert in phonetics or linguistics. Concerning the ¯ Remove disjunctors “ ”. Disjunctors may be used in dictionary and “(/)” to prevent false assimilations and fact that many people don’t pronounce words correctly, real false interpretation of letter couples, e.g. “d zˇ” where pronunciation of recorded speech data can be quite diffe- “d” and “zˇ” should be separate phones. rent from the regular one. That’s the reason why we hand- checked the difference between automatically generated and

¯ Conversion of CTU IPA symbols to SAMPA if requi- real pronunciation during the annotations. Observed diffe- red. rences were marked according to above described rule for The above set of transcription rules explains the logic orthographic transcription. When speech database is anno- behind “transc” pronunciation generation but is in no means tated this way, it allows us to generate lexicon of correct real exhaustive; program itself covers much wider range of slight pronunciations from orthographic transcription of recorded dependencies and irregularities than we can describe here. utterances. This procedure seems to be quite efficient for generation 3.3. Processing options available to user of pronunciation variants of some words because the annota- An updated version of “transc” contains many options tor can always compare automatically generated pronunci- which allow effective usage of this tool for different ortho- ation by a tool (“transc” in our case) and real pronunciation graphic transcriptions. uttered by speaker. It reads input data from stdin stream and writes output to stdout stream so it can be easily included in a pipe for 4.2. Lexicon structure text processing during speech recognition. The information about word pronunciation is stored in These options can tailor “transc” output for particular the file which has very simple structure. It is simple text use: file where the information for single item is a block of lines separated by empty line. Each significant line begins with

¯ The output can be either in CTU IPA alphabet or in a keyword which resolve the kind of the information. The SAMPA. structure is similar to SAM label file format, used for anno- ˇ ¯ Use of capital letters as additional phone symbols may tation of SpeechDat family databases (Polla´k and Cernocky´, be disabled, thereby making “transc” output suitable 1999). as “transc” input. This option may be used to generate example pronunciations usable in “(/)” format. WRD: word PRN: pronun1;pronun3;pronun3

¯ External exceptions lexicon can be enabled (and selec- OCW: 10 ted) or disabled. OCP: 3;5;2 SND: f11.wav,f12.wav,f13.wav;;f3.wav

¯ Internal foreign words patterns can be disabled.

¯ Interword assimilations can be considered instead of WRD: another word simple word-end unvoicing. .. 4. Pronunciation lexicon - generation and The first two lines of the block for one item are mandatory management tools and these two lines represents also the principle part of the Although fixed conversion rules seems to be more ef- lexicon. Other information are confidential. This structure ficient to obtain phonetic transcription, many approaches allows easy work when some information is missing and require lexicon of pronunciation for given corpus. Typi- also it is open to possible extensions of lexicon with other cally, when we use the algorithms derived from English information. Finally, the file remains readable by simple based recognition where the usage of pronunciation lexicon view and also it can be processed by other general scripts. is the standard. From this point of view, pronunciation lexi- 4.3. Lexicon management tools con should be the part of internationally distributed speech corpora, e.g. databases of SpeechDat family (Polla´k and Updating pronunciation lexicon with new words is usu- Cˇernocky´, 1999). ally provided in several steps. However it requires some Main advantages and disadvantages of lexicon solution manual checks, some other steps may be done automati- for generation of phonetic transcription can be summarized cally. in following points: At the early beginning, the first lexicon had to be always - the size of lexicon increases (due to repetition of very manually checked and corrected. Then the update of chec- similar words), ked lexicon by new words can be slightly automatize, i.e. it may be done in following three steps: - different pronunciation variants of one word cannot be resolved from text orthography, 1. automated comparison with checked lexicon (i.e. fin- + it doesn’t need special orthography for irregular pronun- ding new items), ciation, 2. manual check of new items, + it is standard approach in many working speech recogni- tion systems for other languages. 3. update of final lexicon by corrected new items. Figure 1: Lexicon management tool “lexedit” (pronunciation in IPA CTU)

This procedure is very efficient because after several updates and which uses above described tools. The control is very the number of new words which must be checked is rapidly intuitive and it is done by standard menus, hot keys, decreasing. etc. Java programming language allows efficient imple- Concerning very simple structure of lexicon file, it can mentation of structured lexicon management and it gives be processed by simple scripts for text processing. Our the support for visualisation and some additional functi- scripts representing particular tools are written in Perl. The ons as playing of associated acoustic files etc. The tools most important tools for lexicon management are: can be used independently on the platform, only stan- dard Java virtual machine, runtime class libraries, and Java

¯ lexicon differentiate.pl - it finds new items or items application launcher must be installed on the platform. with another unknown pronunciation, These components are parts of Java Runtime Environment

¯ lexicon update.pl - it updates final lexicon from correc- v1.4 (http://java.sun.com/j2se/1.4/download.html). The ba- ted differential one, bad items in difference lexicon sic window of this tool with an example of one lexicon is should be manually marked and these items are not on fig. 1. involved to the final lexicon, The first version of lexicon management tool was rea- lized under TCL/TK environment but especially complex

¯ lexicon count occurence.pl - computes the rates of editing and processing structured data was more complica- occurrences above given corpus, ted. That was the reason for final choice of Java Program- ming Language. ¯ lexicon find word.pl, lexicon find pron.pl - find given words (pronunciations) from lexicon in given source Very important property of our tool is the possibility of corpus, playing associated audio file which represents the utterance where the word with given pronunciation was spoken. It is

¯ lexicon sampify.pl, lexicon desampify.pl - this pair ma- very important especially during manual check of particular kes conversion between IPA CTU and CZECH SPE- pronunciations. Items with associated audio file are marked ECHDAT SAMPA. with simple icon, see fig. 1. The usage of above described tools is very simple. They 5. Applications can be processed on command-line basis and output is writ- ten to stdout stream. The redirection into specified file 5.1. Analysis of pronunciation variants of digits can be easily provided using operating system syntax. As a by-product of this work we can present the rates of occurrence for pronunciation variants of digits

4.4. User friendly interface

¤ 9 ¼ over Czech database of numerals (database FI- To increase user convenience, we present friendly in- XED2CS - “CˇI´SLOVKY”) which is available from ELRA terface which is written in Java programming language (http://www.icp.grenet.fr/ELRA/home.html). The results presented in tab. 2 are easily derived by computation of Acknowledgement occurrences over given corpus by “lexedit” tool. This paper is the part of the research supported by Czech Grant Agency as grant “Voice Technologies for Support of Digit Rate of variants Ortho Information Society” with No. GACR-102/02/0124.

Also the work of Miloslav Brada, student of Czech Tech- :

0 nula - 1221 ( =½¼¼±) nula nical University in Prague, should be appreciated. He has :

1 jedna - 1219 ( =½¼¼±) jedna created graphical user’s interface for lexicon management

jeden - 2 jeden tool. :

2 dva - 1037 ( = 85±) dva :

dvje - 186 ( =½5±) dveˇ 7. References

: =½¼¼± 3 tPÒi - 1220 ( ) trˇi

: J. Nouza, J. Psutka, and J. Uhlı´rˇ. 1997. Phonetic alpha- =79± 4 t StiPÒi - 965 ( ) cˇtyrˇi

: bet for speech recognition of Czech. Radioengineering,

t Stiri - 47 ( =4±) cˇtyry

: 6(4):16–20, December. = ½½± StiPÒi - 137 ( ) sˇtyrˇi

: Z. Palkova´. 1984. Fonetika a fonologie cˇesˇtiny. Univerzita

Stiri - 72 ( =6±) sˇtyry

: Karlova, vydavatelstvı´ Karolinum. 5 pjet - 1224 ( =½¼¼±) peˇt ˇ

: P. Polla´k and J. Cernocky´. 1999. Specification of speech

6 Sest - 1222 ( =½¼¼±) sˇest

: database interchange format. Technical report, Speech-

7 sedm - 280 ( =¾¿±) sedm

: Dat(E), August. Deliverable ED1.3, workpackage WP1.

sedum - 942 ( = 77±) sedum ˇ

: J. Cernocky´, P. Polla´k, and V. Hanzˇl. 2000. Czech recor-

8 osm - 269 ( = ¾¾±) osm

: dings and annotations on CD’s - Documentation on the osum - 952 ( =78±) osum Czech database and database access. Technical report,

vosm - 0 (¼±) vosm

: SpeechDat(E), November. Deliverable ED2.3.2, work-

vosum - 1 ( =¼±) vosum : package WP2. 9 devjet - 1220 ( =½¼¼±) deveˇt http://noel.feld.cvut.cz/sampa/CZECH-SAMPA.html - La-

test Czech SAMPA proposal.

¤ 9 Table 2: Rates of pronunciation variants of digits ¼ http://www.icp.grenet.fr/ELRA/home.html - ELRA (Euro- (pronunciation written in SAMPA) pean Languange Resources Association) home page. http://www.phon.ucl.ac.uk/home/sampa/czech-uni.htm - Unofficial Czech SAMPA. 5.2. Quality check of database annotation Another application where we have used this tools was in annotation of large corpora. It seemed to be useful to cre- ate the pronunciation lexicon continuously by small steps after annotations of small blocks of data. Firstly, the check of smaller amount of new lexicon items is more pleasant. It brings consequently higher concentration on this work which yields to less amount of errors in the final lexicon. Se- condly, many errors in orthographic transcription are easily detected which increase the quality of database annotation.

6. Conclusions The most important part of this presentation can be sum- marized in following points:

¯ Two phonetic alphabets for Czech were described, IPA CTU and CZECH SPEECHDAT SAMPA.

¯ The updated version of tool “transc” for automated ge- neration of phonetic transcription was described. Fixed rules of conversion are completed by extensible lexi- con of irregularities. The usage of this lexicon increase the probability of correct pronunciation generation by this tool for new text.

¯ Lexicon management tool “lexedit” was created. It works with structured pronunciation lexicon file with possible additional information. It is written in Java programming language which allows user friendly edi- ting or updating of given lexicon.