SBORNlK PRACl FILOZOFICKE FAKULTY BRNENSKE UN1VERZ1TY STIIDIA MINORA FAC.t ILTATIS PHILOSUPHICAE UNIVER.SITAT1S BRUNENSIS A 41. l'J'j:l

KAREL PA LA. KLARA OSOLSOBE

CZECH STEM DICTIONARY

1. INTRODUCTION In our contribution we would like to present the results of the research performed at the Dept. of , Philosophical Faculty of Masaryk University, Brno in the course of the last three years (Pala, Osolsobe, Franc, 1987. Halasova-Osolsobe. 1989-90, Osolsobe, Pala, 1990).

We have set the following tasks:

1. to prepare as complete a machine dictionary of Czech stems as pos­ sible (at present containing about 175 000 entries);

2. to work out an algorithmic description of Czech morphology based on a complete list of pattern (model) words now containing 510 pattern words (Osolsobe, 1991);

3. to prepare a complete machine glossary of Czech words containing 190 000 entries and based on the glossary of the representative Czech dictionary (Dictionary of Literary Czech Language) published in 1960 and reprinted in 1989;

4. to build a collection of computer programs for the automatic morpho­ logical parsing and generating Czech word forms containing also a Czech spelling checker and other programs for creating, updating and mantaining machine dictionaries on personal computers (Pala, Osolsobe, Franc, 1987, Osolsobe, Pala, 1990).

2. CZECH MORPHOLOGY AND DICTIONARIES As our first step we have decided to develop a dictionary of Czech word stems. The basic structure of the Czech stem dictionary can be divided into three parts: 52 KAREL PA LA. KLAKA OS< >LSOBE

1. those words that cannot be inflected, i. e. adverbs, prepositions, con­ junctions, particles and interjections;

2. forms of words with irregular , i. e. personal, demonstrative, negative and indefinite and also a few irregular , as eg. byt (to be), mit (to have), chtit (to want). Jit (togo),

3. a dictionary of stems of regularly inflected parts of speech, i. e. , , numerals and all regular verbs.

Uninflected words and irregularly inflected words represent the part of our dictionary that can be called the „fixed dictionary". It is a full list of the noninflected and irregularly inflected word forms that are found in Czech, together with the information about their part of speech and res­ pective grammatical categories associated to them.

As a result we now have a dictionary of stems having the following structure:

1. on each line there is an entry consisting of a basic , or form segmented as a stem and an ending, particularly, - as a stem without a final consonant, final consonant and ending, - as a root, word forming and ending, - as a stem, word forming suffix and ending. - as a stem, stem forming suffix and ending.

From a formal point of view a stem can be defined as a string of charac­ ters (lower case letters) ending in a space or nonalphabetic character (eg. by hyphen „-"), followed either by an ending or a final consonant, word forming suffix, stem forming suffix and ending or followed by a space.

Examples: pan-O (lord) ten-a (woman) vl-k-O (wolf) mat-k-a (mother) sidl-ist-e (settlement, housing estate) pekn-y (nice) del-a-t (to do)

2. a symbol denoting which part of speech the basic form belongs to. With nouns we also associate the respective four genders (see also Sgall. 1960), eg. @1Z - animate masculine @1W - inanimate masculine @1F - feminine @1N - neuter. CZECH STEM DICTIONARY 53

Further we distinguish hard adjectives - @2T, and soft adjectives - @2M.

Verbs as a whole are denoted as @5A.

For all uninflected parts of speech we use the symbol @NO and they have unified inflection pattern denoted as >dummy.

3. An inflection pattern (see Sect. 3) is introduced by the character >, e.g. pan @1Z >pdn kld-r-a ©IF >sd sidl-ist-e ©IN >sidl pekn-y @2T >nov del-a-t ®5A >del

The following components of the entry are not obligatory:

4. the character _ expresses that the word contained in the entry can have a negative form obtained by adding the prefix ne-, (not).

5. the character " is used with adjectives to express the fact they may form the superlative with the prefix ne/-, (most),

6. the character A shows that the word contained in the entry starts in uppercase (e.g., proper noun),

7. all uninflected words contain information about the word category they belong to in the form „/kat:V, /kat:Y, /kat:X, /kat:W, /kat:Z", eg-,

pan @1Z >pdn kld-r-a ©IF >sd A sidl-ist-e ©IN >sidl pikn-y @2T >nov ~" del-a-t @5A >del~ tarn ©NO >dummy /kat:V

8. an entry also can contain ..switches", i.e. strings of characters begin­ ning with / and followed by a string of lowercase letters of the Czech alphabet. A switch defines the word forming structure of a word or a word deriving suffix. Thus a dictionary where entries contain these switches yields a classification of the entries according to word for­ mation classes. In this way the switches represent a base for more detailed classification of the words according to the respec­ tive word formation types. An example:

anarchist-k-a ©IF >mat /istkaa (type a)) 54 KAHEL PALA. KLARA OSOLSOBE

kod-k-a ©IF >mat /kaa (type a)) pojist-k-a ©IF >alja$ /kab (type b)) ucitel-k-a @1F >mat / telkaa (type a))

3. THE LIST OF PATTERNS AND ITS STRUCTURE

The list of patterns displays a two level structure:

1. definitions of the sets of endings,

2. definitions of the patterns.

3.1. DEFINITIONS OF THE SETS OF ENDINGS

A set of endings is a collection of endings, and there are 166 of them in our system, that for all words belonging to one inflection type expresses the same grammatical meaning. At the same time the requirement has to be fulfilled that all the endings belonging to one set combine with just one variant of the stem. This, of course, means that the endings that combine with the forms in which alternations of stem occur belong to a special set.

An example:

The endings connected with the typical pattern pan are:

-O, -a, -u, -ovi, -em, -i, -ove, -e, -il, -urn, -y, -ech, -ich, -e. The can be divided into the following subsets:

=V1 [Qsm] (a,2) (u,3) (ovi,3) (a.4) (u,6) (ovi,6) (em. 7) [Qpm] (u,2) (um,3) (y.4) (y.7) CZECH STEM DICTIONARY 55

=VI [Qpm] 0.1) (i.5)

=VOVE [Qpm] (ove.l) (ove,5)

=VE [Qpm] (6.1) (e.5)

=V13X [Qsm] (_.D

=WE [Qsy] (e.5)

=WU [Qsyl (u.5)

The definition of the set of endings has a fixed format. It begins with a character „=", followed by the name of the defined set consisting of uppercase letters and digits. On the folowing line (in the third column) there is a string of characters in square brackets denoting the part of speech, number, and gender for words that arise as a combination of the stems of a certain type and the ending(s) of the defined set. The next line(s) contains the definition of single endings belonging to the set. In the brackets one can gradualy find the respective endings separated by a comma „," from a digit (1, 2, 3, 4, 5. 6, 7) denoting case or person. The following symbols are used:

Q - noun V - adverb R - adjective W - preposition S - X - conjunction T - numeral Y - particle U - verb Z - interjection s - singular P " plural

1....7 case 56 KAREL PALA, KLARA OSOLSCIBE m gender, animate masculine w gender, inanimate masculine f genden, feminine n genden, neuter z m + w (animate masculine + inanimate masculine) y m + w + n (animate masculine + inanimate masculine + neuter) q w + f (inanimate masculine + feminine) u w + n (inanimate masculine + neuter) X m + w + f + n (animate masculine + inanimate masculine + femi­ nine + neuter) a positive of adverbs b comparative of adverbs

A 1st person C 3rd person B 2nd person G infinitive

The above mentioned combinations of genders and other categories are quite freguent, therefere it is very useful to have the presented abb­ reviations for them.

3.2. PATTERN DEFINITIONS

The number of patterns is quite large if cmpared with standard gram­ mars of Czech (Havranek, Jedlicka, 1981, Petr a kol, 1986). There are 510 patterns in our list, and one half are noun patterns, one third - verb patterns, one tenth - adjective patterns, and the rest - numeral patterns (Osolsobe, 1991). The reason for having this number of patterns is that we want to capture all the doublets and exceptions in Czech and conjugation. A pattern is defined as a pattern word and sets of endings - their combinations capture all the word forms associated with a given pattern. Pattern words can be segmented in several ways:

• stem + <> + set (sets) of endings, • stem + < word forming suffix > + set (sets) of endings, • stem + < final consonant > + set (sets) of endings, • stem + < final consonant + word forming suffix > + set (sets) of endings, • stem + < comparative suffix > + set (sets) of endings, • stem + < final consonant + comparative suffix > + set (sets) of endings, • stem + < stem forming suffix > + set (sets) of endings, • stem + < final consonant + stem forming suffix > + set (sets) of endings.

The strings in angle brackets can be called intersegments and there are 241 of them in our system. They include: CZECH STEM DICTH )NAHY 57

1. final consonants where the following alternations take place: fc-c-c, h-z-z, g-z-i, c/i-5, r-f,

2. final consonants with the ..graphic" alternations: d-d', t-f, n-n,

3. final consonant groups with epenthetical -e that is omitted in the cases with full ending: eb-b, el-l, em-m, en-n, er-r, et-t, eu-v,

4. adjectives with the word forming -iiv/-ov, -in,

5. suffixes used for forming the comparative, e.g., -ejs-,

6. alternating word forming suffixes like -ost, -sivi,

7. stem forming suffixes as e.g., -ova-,

8. alternating consonants of verb stems + stem forming suffixes (z- > -z),

9. passive suffix for deriving verbal nouns and adjectival (-m, -ny, -icQ,

10. combinations of the types mentioned above.

Examples:

+slon <> V1.V13X.VOVE.WE.VZ.VI PRIVL1X PRTVL1 +/medvid V13X Vl.VOVE.WU x PRIVL1X PRIVL1 VI, VQ +/nov <> PRT1,PRT2.6B,6H,6C PRK +/uc <> W1E.W2A.W4A W1D W3A.W5A VI3 58 KAHEL PALA. KLAKA OSOLSODE

4. AN ALGORITHMIC DESCRIPTION OF CZECH MORPHOLOGY

It can be seen that the dictionary of stems and list of patterns together represent an algorithmic description of Czech morphology. For each stem we know the pat­ tern according to which the stem is inflected, thus we are able not only to recog­ nize any word form for which there is a stem in the dictionary, we can generate it as well. The links between the patterns and stems represent the core of the algo­ rithm of Czech morphological analysis and synthesis.

The dictionary of stems and list of patterns exist in two forms:

1. as text files that are readable for a linguist and can be edited; here relati­ ons between the patterns and stems are visible and fully accessible and can be easily changed, checked and corrected. 2. as program (exe) files in an internal machine code; here all the relations are hidden in a machine code and their correctness can be tested only by analyzing or generating particular Czech word forms.

Preprocessing programs have also been developed that translate the dictio­ naries as text files into machine files. In this way a user can create new versions of dictionaries and update them. They do not represent a direct part of the algo- rithmical description of Czech morphology but we have found them useful for building, updating and maintaining the whole system.

5. COMPUTER PROGRAMS

Presented linguistic description in the form of algorithms enabled us to develop linguistic software consisting of three groups of programs:

1. A set analyzers, i. e. programs that are able to recognize Czech word forms taken from an arbitrary Czech text file and in this way to check their cor­ rectness, i. e. a Czech spelling checkers. They are implemented as a RAM resident (screen) checker and also as a file checker working with complete Czech text files. Both programs perform full morphological analysis of Czech inflected word forms and they say yes if a checked word is correct or no if a checked word is wrong or if the respective stem cannot be found in the dictionary.

2. This group of programs can for each inflected word form occuring in a Czech text file find its basic form. i. e. nominative, singular for nouns and adjectives and infinitive for verbs (a lemmatizer). Each recognized word form can also be associated with its corresponding grammatical categories and if there is a homonymy all the existing possibilities are offered to a user. It is obvious that both these programs can serve as part of an integra­ ted parser. There is also a possibility to generate all word forms that can be UZECH STEM OICTIONARY 59

derived from a selected stem. These programs in fact represent a machine dictionary of Czech and can become a module within a larger natural lan­ guage understanding system.

3. Programs for building dictionaries from any text file:

- dictionaries of word forms - freguency dictionaries of word forms - inverted (a tergo) dictionaries of word forms sorted according to their endings;

Their limitation is that they are worked out just for Czech language and it is not possible to offer to them easily other East European alphabets (which can be done without trouble with eg. MICRO OCP (Oxford, 1991) or WORDCRUN- CHER.

Related to these are programs for editing and updating, maintaining and creating a machine stem dictionary and a list of patterns. It is also possible to merge dictionaries and assign patterns to the respective stems by means of a special editor developed just for this purpore.

6. CONCLUSIONS

The system of stems and patterns is designed in such a way that it captures not only Czech declension and conjugation but also basic word forming types in Czech. Within the system a user can get:

• all word forms that can be derived from a given stem by declension and conjugation and the respective grammatical information associated with them.

• all words that can be formed from a given stem by regular word formation processes, i. e. the system is able to produce verbal nouns from verbs, participles from verb stems and adverbs from adjectives plus the respective comparatives and superlatives.

In this sense our machine dictionaries are of some interest not only for a Czech user but also for foreign students of Czech and linguists interested in Slavonic languages.

In further research it is our aim to use the present system and programs for modelling more complicated word formation processes in Czech. For this, however, we also have to solve the problems with homonymy in Czech (e.g.fenu - 60 KAREL PALA. KLARA OSOLSOBE verb, 1st pers. sg or noun, fem, acc. sg) as well as the Czech verb prefixation. The machine dictionaries also have become a core of the Czech lemmatizer used as a part of the Czech computer thesaurus containing now about 20 000 entries (Pala, Sevecek, Vsiansky, 1992).

BIBLIOGRAPHY

IIALASOVA-OSOLSOBE. K.: Algoritmlcky popis ceske foniiaini morfologic substantiv a adjekliv, SPFFBU. A 37-38. 1989-90. S. 83-97. HAVRANEK, B.. JEDUCKA. A.: Ceska inluvnlce. SPN, Pralia. 1981. OSOLSOBE, K.: Popls systeniu ceskyeh suhstanltvuich a slovesnych vzoru. ruknpls, Dmo, 1991. OSOLSOBE. K.. PALA, K.: Czech Stem Dictionary for IBM PC XT/AT. Conference on Computer Lexicography. Balatonfured, Septeinl>er 1990. PALA. K.. OSOLSOBE. K., FRANC, S.: Ceska morfologle a syntax v PROLOGU, sh. scinlnafe SOFSEM'87. VUSEIAR Bratislava. 1987. PALA, K.. SEVECEK. P.. VSIANSKY, J.: Ccsky pixjtacnvy synonymlcky slovnik. softwarovy produkt a rukopis. Brno 1992. PETR. J., A KOLEKnV AUTOR0. Akaclemicka lnluvnice cesiiny 1. 2, Acidemia. Praha. 1986. RUSfNOVA. Z.: Soucasna ceska morfologle. SPN. Pralia. 1984. SGALL, P.: Souslava paclovydi koncovek v cesline. AUC - Slavica Pragensla 2. s. 05-84. 1960. SLOVNiK SPISOVNEHO JAZYKA CESKEHO. Aeacleinla, Pralia. 1960. 1989.