A Czech Morphological Lexicon Hana Skoumalovfi Institute of Theoretical and Computational Linguistics Charles University Celetn£ 13, Praha 1 Czech Republic hana.skoumalova @ff. cuni. cz

Abstract of 219 paradigms I got 159 that use 116 sets of endings. Under the term paradigm I mean the In this paper, a treatment of Czech set of endings that belong to one lemma (e.g. phonological rules in two-level mor- noun endings for all seven cases in both num- phology approach is described. First bers) and possible derivations with their cor- the possible phonological alternations responding endings (e.g. possessive adjectives in Czech are listed and then their derived from nouns in all possible forms). That treatment in a practical application of is why the number of paradigms is higher then a Czech morphological lexicon. the number of endings. In this approach, it is necessary to deal with 1 Motivation the phonological changes that occur at bound- aries between the stem and the suffix/ending or In this paper I want to describe the way in which between the suffix and the ending. There are I treated the phonological changes that occur in also changes inside the stem (e.g. p~'tel 'friend' Czech conjugation, declension and derivation. x p~dteld 'friends', or hndt 'to chase' x 5enu My work concerned the written , but 'I chase'), but I will not deal with them, as as spelling of Czech is based on phonological they are rather rare and irregular. They are principles, moSt statements will be true about treated in the lexicon as exceptions. I also will , too, not deal with all the changes that may occur in My task was to encode an existing Czech mor- a verb stem--this would require reconstructing phological dictionary (Haji~, 1994) as a finite the forms of the verbs back in the 14th cen- state transducer. The existing lexicon was orig- tury, which is outside the scope:of my work. inally designed :for simple programs that only Instead, I work with several stems of these ir- attach "endings" to the "stems". The quota- regular verbs. For example the verb hndt ('to tion marks in the previous sentence mean that chase') has three different stems, hnd- for infini- the terms are not used in the linguistic mean- tive, 5en- for the present tense, imperative and ing but rather, technically: Stem means any present participles, and hna- for the past par- part of a word: that is not changed in declen- ticiples. The verb vdst ('to lead') has two stems, sion/conjugation. Ending means the real ending vds- for the infinitive and ved- for all finite forms and possibly also another part of the word that and participles. The verb tit ('to cut') has the is changed. Wh:en I started the work on convert- stem tn- in the present tense, and the stem ra- ing this lexicon to a two-level morphology sys- in the past tense; the participles can be formed tem, the first idea was that it should be linguis- both from the present and the past stem. For tically more elegant and accurate. This required practical reasons we work either with one verb me to redesign the set of patterns and their cor- stem (for regular verbs) or with six stems (for responding endings. From the original number irregular verbs). These six stems are stems for

4-1 infinitive, present indicative, imperative, past with a soft vowel. The alternations are different participle, transgressive and passive participle. for different types of . The types of In fact, there is no verb in Czech with six differ- consonants and vowels are as follows: ent stems, but this division is made because of various combinations of endings with the stems. • hard consonants--d, (g,), , k, n, r, t • soft consonants--c, d, d, j, ~, ÷, g, t, 2 2 Types of phonological alternations in Czech • neutral consonants--b, l, m; p, s, v, z We will deal with three types of phonological • hard vowels--a, d, e, d, o, 6, u, ~, y, ~] and alternations: palatalization, assimilation and the ou epenthesis. Palatalization occurs mainly in de- clension and partly also in conjugation. Assimi- • soft vowels--d, i, ( lation occurs mainly in conjugation. Epenthesis occurs both in declension and in conjugation. The vowel d cannot occur in the ending/suffix so it will not be interesting for us. I also will not 2.1 Epenthesis discuss what happens with 'foreign' consonants An epenthetic e occurs in a group of consonants /, q, w and x--they would be treated as v, k, before a O-ending. The final group of conso- v and s, respectively. The only borrowing from nants can consist of a suffix (e.g. -k or -b) and foreign that I included to the above a part of the stem; in this case the epenthesis is lists is g: This sound existed in Old Slavonic but obligatory (e.g. kousek x kousku 'piece', malba in Czech it changed into h. However, when later x maleb 'painting'). In cases when the group new words with g were adopted from other lan- is morphologically unseparable, the application guages, this sound behaved phonologically as h of epenthesis depends on whether the group of (e.g. hloh, hlozich--from Common Slavonic glog consonants is phonetically admissable at word 'hawthorn', and katalog, kataloz(ch 'catalog'). end. In loan words, the epenthetic e may occur The phonological alternations are reflected in if the final group of consonants reminds a Czech writing, with one exception--if the consonants suffix (e.g. korek x korku 'cork', but alba x alb d, n and t are followed by a soft vowel, they are 'alb'). In declension, two situations can occur: palatalized, but the spelling is not changed: spelling: d~, di phonology: /de/,/di/ • The base form contains an epenthetic e; the rule has to remove it, if the form has a ne, ni I el, la l t~, ti / [e/, / [i/ non-O ending, e.g. chlapec 'boy', chlapci dative/locative sg or nominative pl. In other cases the spelling reflects the phonol- ogy. In the further text I will use { } for the • The base form has a non-O ending; the rule morpho-phonological level, / / for the phonolog- has to insert an epenthetic e, if the ending ical level and no brackets for the orthographical is O, e.g. chodba 'corridor', chodeb genitive level. In the cases where the orthography and pl. phonology are the same I will only use the or- thographical level. Let us look at the possible In conjugation, an epenthetic e occurs in the types of alternation of consonants: past participle, masculine sg of the verb jit 'to go' (and its prefixed derivations): gel 'he-gone', • Soft and ~-- The soft consonant gla 'she-gone', glo 'it-gone'. The rule has to in- is not changed, the soft ~ is changed to e. sert an epenthetic e if the form has a O-ending. {d(d@} ---+ d(de 'pussycat' dative sg

2.2 Palatalization and assimilation • Soft or neutral consonant and i/(-- No al- Palatalization or assimilation at the morpheme ternations occur. boundaries occurs when an ending/suffix starts { d(di} ~ didi 'pussycat' genitive sg • Hard consonant and a soft vowel -- The tions. alternations differ depending on when and - {k~/ki} --+ 5e/di (1st pMat.) how the soft vowel originated. matka 'mother' ---+ matSin possesive adjective Assimilation: - {k~/ki) --~ ce/ci (2nd palat.) - {kj} -~ e matka ~ matce dative/locative sg tlak 'pressure' ---+ tladen 'pressed' - {hi/hi} ~ 2e/2i (1st palat.) - {hj)~ B~h 'God' ~ Bo2e vocative sg mnoho 'much, many' ~ mno2eni'mul- - {hi/hi} ~ ze/zi (2nd palat.) t/plying' Bgh ~ Bozi nominative/vocative pl - {g~/gi} ~ 2e/2i (1st palat.) - {gj}.-~2 It is !not easy to find an example of Jaga a witch from Russian tales --~ i this sprt of alternation, as g only oc- Ja2in possesive adjective curs in loan words that do not use the - {ge/gi} -+ ze/zi (2nd palat.) old t~rpes of derivation. In colloquial Jaga ~ Jaze dative/locative sg

speec h it would be perhaps possible to - { d~} ~ / de/--4 dg creat~ the following form: rada 'council' --~ radg dative/locative pedaglog 'teacher' ---+ pedago2en( 'work- sg

ing as a teacher' - {t4 --~ lie/--~ t~ - {dj}-~z teta 'aunt' --+ tet~ dative/locative sg sladit 'to sweeten' ~ slazen('sweeten- Both palatalization and assimilation yields ing' the same result: This sort of alternation is not pro- - {oh} ~ ductive any more--in newer words r moucha 'fly' -+ mouse dative/locative palatalization applies: sg, muM derived adjective sladit.'to tune up' --+ slad~n( 'tuning up' - {n) ~/~/~ hon 'chase' ---+ honit 'to chase', hongn~] In some cases both variants are pos- sible, :or the different variants exist in 'chased' - {r)-~ ~ different dialects--the east (Moray/an) dialects tend to keep this phonolog- vat 'boil' --~ va÷it 'to cook', va÷en( ical alternation, while the west (Bo- 'cooking' hemiah) dialects often abandoned it. • Neutral consonant and ~--:The alterna-

- {tie} ~ ~e tions differ depending on when and how platit !to pay' ~ placen( 'paying' originated. This alternation is also not productive any more. The newest word that I Assimilation: found which shows this sort of phono- - { bje} ~ be log/ca! alternation is the word fotit zlobit 'to irritate' ---+ {zlobjem] 'to take a photo' ~ focen( 'taking a zloben( 'irritating' photo ~. - {m j4 -~ .~e Palatalization: zlomit 'to break' ~ {zlornjen~]} --+ During the historical development of the zlornen~ 'broken' language several sorts of palatalization - {pie} ~ pe occured--the first and second Slavonic kropit 'to sprinkle' ----+ { kropjen,~ --+ palatalization and further Czech palataliza- kropeni 'sprinkling' - {vie} -+ ve - {.k} +/~i/ lovit 'to hunt' ---+ {lovjen~] -+ loven( kamarddsk~] 'friendly' ~ kamarddgt( 'hunting' masculine animate, nominative pl, ka-

- {sje} ~ ge marddgt~jg( 'more friendly' prosit 'to ask' --+ {prosjenz~ -+ proven( - {ck} ~/d/ 'asking' 5ack~] 'brave' ~ 5aSt( masculine ani- This type of assimilation is not pro- mate, nominative pl, 5a2t~jM 'braver'

ductive any more. In newer deriva- - {ek) +/d/ tions {sje} --+ se (e.g. kosit 'to mow' 2lu[oudkU 'yellowish' ~ 2lu[oudt~jg( kosen( 'mowing') . 'more yellowish', but 21ufoudc( mascu-

- {zje} ~ 2e line animate, nominative pl kazit 'to spoil' ~ { kazjenz~ -+ ka2en( 'spoiling' The alternations affect also the vowel ~. This type of assimilation is also not When it causes palatalization or assimilation of productive any more. In newer deriva- the previous consonant, it looses its 'softness', tions {zje} ~ ze (e.g. ~et&it 'to con- i.e. ~ --~ e: catenate' --+ ÷et&eni'concatenating'). {matk@} ~ matce { sestr@} ~ sest÷e Palatalization: { gkol@} --+ gkole With b, m, p and v no alternation occurs ({vrb~} 'willow' dative/locative sg ---+ vrb~). 3 Phenomena treated by two-level

- {s~) + se rules in the Czech lexicon rosa 'wasp' ---+ {vos@} ~ rose da- As the Czech lexicon should serve practical ap- tive/locative sg plications I did not try to solve all the prob- - {z~} --~ ze lems that occur in Czech phonology. I concen- koza 'goat' --.+ {koz@} --+ koze da- trated on dealing with the alternations that oc- tive/locative sg cur in declension and regular conjugation, and Both palatalization and assimilation yields the most productive derivations. The rest of al- the same result: ternations occurring in conjugation are treated by inserting several verb stems in the lexicon. - {lje} -+ le The list of alternations and other changes cov- akolit 'to school' --+ {$koljem~ ered by the rules: gkolen( 'schooling' - {le} ~ le • epenthesis ~kola 'school' -+ { $kol~} ~ ~kole da- tive/locative sg • palatalization in declension

• Group of hard consonants and a soft vowel. • palatalization in conjugation Here again either palatalization or assimi- lation occurs. • palatalization in derivation of feminine nouns from masculines Assimilation: • palatalization in derivation of possessive

- {stj} ~ Igtl adjectives distit 'to clean' --+ 5igt~n( 'cleaning' • palatalization in derivation of adverbs - {slj} -~ ~z myslit 'to think' --+ my~leni'thinking' • palatalization in derivation of comparatives Palatalization: of adjectives and adverbs • palatalization or assimilation in derivation In the previous section, I described all pos- of passive participles sible alternations concerning single consonants. When I work with the paradigms or with the • shortening of the vowel in suffixes -ik (in derivations, it is necessary to specify the kind derivation of feminine noun from mascu- of the alternation for all consonants that can line) and-~v (in declension of possesive ad- occur at the boundary. For this purpose I in- jectives) troduced four types of markers: For the CZech lexicon I used the software r "1P -- 1st palatalization for g, h and k, or tools for two-level morphology developed at Xe- the only possible (or no) palatalization for rox (Karttune.n and Beesley, 1992; Karttunen, other consonants. I use this marker also for 1993). The le:kical forms are created by attach- palatalization c --~ 5 in vocative sg of the ing the proper ending/suffix to the base form paradigm chlapec. The final c is in fact a in a separate:program. To help the two-level palatalized k, so there is even a linguistic rules to find where they should operate, I also motivation for this. marked morpheme boundaries by special mark- ers. These markers have two further functions: A2P -- 2nd palatalization for g, h and k, or the only possible (or no) palatalization for • They bear the information about the length other consonants. of ending i(or suffix and ending) of the base form, i.e. how many characters should be ^A -- Assimilation (or nothing). removed before attaching the ending. AN --- NO alternation. • They bear the information about the kind These markers are followed by a number that of alternation. denotes how many characters of the base form Beside the markers for morpheme boundaries should be removed before attaching the end- I also use markers for an epenthetic e. As I said ing/suffix. Thus there are markers ~1P0, ^2P0, before, e is inserted before the last consonat of a ^1P1, etc. The markers starting with ^N only final consonant group, if the last consonant is a denote the length of the ending of the base suffix, or if the consonant group is not phoneti- form--and instead of using ^N0 I attach the cally admissable. However, as I do not generally suffix/ending directly to the base form. For- deal with derivation nor with the , I tunately, nearly all paradigms and derivations am not able to recognize what is a suffix and cause at most one type of alternation, so it what is phone~ically admissable. That is why I is possible to use one marker for the whole need these special markers. paradigm. Another auxiliary marker is used for mark- The markers for an epenthetic e are ^El (for ing the suffix -~7~, that needs a special treatment e that should be deleted) and ^E2 (for e that in derivation of feminine nouns and their poss- should be inserted). The marker for the suffix esive adjectives. The long vowel/must be short- -zTc in derivations is ^ IK. ened in the derivation, and the final k must be Here are some examples of lexical items and palatalized even if the O-ending follows. I need the rules that transduce them to the surface a special marker, as -ik- allows two realizations form: for both the sohnds in same contexts: Two realizations of i (1) doktorka ^ 1Plin^2P0~ch d~edn~7~ 'clerk' ~ d~ednice 'she-clerk', but The base form is doktorka 'she-doctor'. The rybnzT~ 'pond' ~ rybnlce locative sg marker ^IP1 denotes that the possible alter- Two realizations of k nation at this morpheme boundary is (first) d÷ednzT~ x d÷ednic (genitive pl of the derived palatalization and that the length of the end- feminine) ing of the base form is 1 (it means that a must

i be removed from the word form and the possi- The second context describes a comparative ble alternation concerns k). The marker ~2P0 of an adjective, or a comparative of adverb de- means that the derived possessive adjective has rived from that adjective (ho÷k~] ~ ho÷dejM, a O-ending and the possible alternation at this ho~deji). The set NonCCS contains all character morpheme boundary is palatalization. If we except c, d and s and it is defined in a speciM rewrite this string to a sequence of morphemes section. This context condition is introduced, we get the following string: doktork-in-~jch. The because the groups of consonants ck, dk and sk sound k in front of i is palatalized, so the cor- have different 1st palatalization. rect final form is doktordin~eh, which is genitive The label End denotes any character that can plural of the possessive adjective derived from occur in an ending and that is removed from the the word doktorka. base form. Let us look now at the two-level rules that transduce the lexical string to the surface string. (2) korek'2P0^Elem We need four rules in this example: two for The base form of this word form is korek 'cork'; deleting the markers, one for deleting the end- the marker ^2P0 means their the possible alter- ing -a, and one for palatalization. The rules for nation is (second) palatalization and that the deleting auxiliary markers are very simple, as length of ending of the base form is 0. The these markers should be deleted in any context. marker ^El means that the base form contains The rules can be included in the definition of an epenthetic e, and em is the ending of in- the alphabet of symbols: strumental singular. The correct final form is Alphabet korkem. The rule for deleting an (epenthetic) e 7j IP0 :0 7j 1P1:0 follows: 7.'2P0:0 7,'2PI:0 7j2P2:0 7,'2P3:0 7jA2:0 "Deletion of e" Z'NI:0 Z'N2:0 Z'N3:0 Z'N4:0 e:0 <=> Cons c: 7,'N2:; Y,'EI:0 Y.'E2:0 Y.'IK:0 [ YjIPI~" I 7j2P1: I Y,'NI: I 7jN2: ]; Dons Cons: ([Z'IPO: IZ'2PO:]) Z'Ei: Vowel: ; This notation means that the auxiliary markers t:0-[ Z*2P2: I Z'N2: ]; are always realized as zeros on the surface level. The first line describes the context for dele- The rule for deleting the ending -c looks as tion of the suffix -ec in the derivation of the type follows: v~dec 'scientist' --+ v~dkyn~ 'she-scientist'. "Deletion of the ending -a-" The second context is the context of the end- a:O <=> _ [ Y,'NI: I ~jiPI: I ~,'2Pl: ] ; ing -e or the suffix -ce. This suffix must be _ t: [ Z'N2: I Z'N4: ]; removed in the derivation of the type soudce The first line of the rule describes the context 'judge' ~ soudkyn~ 'she-judge'. : of a one-letter nominal ending u, and the second The third context is the context of an line describes the context of an infinitive suffix epenthetic e that is present in the base form with ending -at or -ovut. and must be removed from a form with a non-O The rule for palatalization k -+ d looks as fol- ending. The sets Cons and Vowel contain all lows: consonants and all vowels, respectively. "First palatalization k -> ~" The fourth line describes the context for dele- k:~ <=> _ (7,'IK:) [ a: I ~: ] 7.'iPi: i ; tion of the infinitive ending -et. NonCeS: (End) 7.'1PI: ~: ;

The first line describes two possible cases: ei- The whole program contains 35 rules. Some ther the derivation of a possesive adjective from of the rules concern rather morphology than a feminine noun (doktorku--~ doktordin), or the phonology; namely the rules that remove end- derivation of a possesive adjective from a fem- ings or suffixes. One rule is purely technical; inine noun derived from a masculine that ends it is one of the two rules for the alternation with -~7~ ( ~ednzT~ ~ ( d÷ednice -+) d÷ednidin). ch ~ ~, as c and h must be treated separately

4.6 (though ch is considered one letter in Czech alphabet). Six rules are forced by the Czech spelling rules (e.g. rules for treating /d/, /t/ and/~/in various contexts, or a rule for rewrit- ing y ~ i after soft consonants). 18 rules deal b with the actual phonological alternations and they cover the whole productive phonological system of . The lexicon using these rules was tested on a newspaper text con- taining 2,978,320 word forms, with the result of more than 96% analyzed forms.

4 Acknowledgements My thanks to Ken Beesley, who taught me how to work with the Xerox tools, and to my fa- ther, Jan Skoumal, for fr~uitful discussions on the draft of tNis paper.

References Jan Hajji. 1994. Unification Morphology Grammar, Ph.D. dissertation, Faculty of Mathematics and Physics, Charles University, Prague. Josef Holub, and Stanislav Lyer. 1978. StruSn~ etymologick~ :slovnzT~ jazyka 5eskdho (Concise et- ymological dictionary of Czech language), SPN, Prague. Lauri Karttunen, and Kenneth R. Beesley. 1992. Two-Level Role Compiler, Xerox Palo Alto Re- search Center', Palo Alto. Lauri Karttunen. 1993. Finite-State Lexicon Com- piler, Xerox Palo Alto Research Center, Palo Alto. Kimmo Koskenniemi. 1983. Two-level Morphology: A General ComputationalE Model for Word-Form Recognition ~ind Production, Publication No. 11, University of iHelsinki. Arno~t Lamprecht, Dugan Slosar, and Jaroslav Bauer. 1986.i Historickd mluvnice 5egtiny (His- torical Grammar of Czech), SPN, Prague. Jan Petr et al. 1!986. Mluvnice 5egtiny (Grammar of Czech), Academia, Prague. Jana Weisheiteiov£, Kv~ta Kr£1fkov£, and Petr Sgall. 1982. Morphemic Analysis of Czech. No. VII in Explizite Beschreibung der Sprache und au- tomatische Textbearbeitung, Faculty of Mathemat- ics and Physics, Charles University, Prague.

4-'7