A New Language for Constraint : Estonian∗ Kaili M¨u¨urisep Tiina Puolakainen Kadri Muischnek Institute of Cybernetics Institute of Department of Estonian and Tallinn Technical University 10136 Tallinn, Estonia Finno-Ugric Linguistics 12168 Tallinn, Estonia [email protected] University of Tartu [email protected] 50409 Tartu, Estonia [email protected]

Mare Koit and Tiit Roosmaa and Heli Uibo Institute of Computer Science University of Tartu 50409 Tartu, Estonia {koit,roosmaa,heli u}@ut.ee

Abstract problem exists also in Estonian, but the number The Constraint Grammar of Estonian presented of choices is much greater due to the richness of in the paper is the first attempt in automatic forms. syntactic analysis of Estonian. The grammar consists of 1,240 morphological disambiguation The ratio of ambiguous words varies greatly rules, 47 clause boundary detection rules, 180 from language to language: for example, in En- morphosyntactic mapping rules and 1,118 syn- glish, Swedish, and Finnish the ratio of words tactic constraints. The rules have been devised using a training corpus of 20,300 words and have with multiple morphological interpretation is been tested on a benchmark corpus of 10,000 40%, over 60%, and 11%, respectively (Karls- words. As the result of tests, 86.6% of words become morphologically unambiguous, and the son et al. 95). In Estonian literary texts more error rate of the morphological disambiguator than 45% of words are ambiguous. Estonian, un- is 1.8%. The results of the full analysis demon- like , is not subject-centered. strate the ambiguity rate of 83% and error rate of 3.5%. There are a number of non-elliptical sentences in Estonian with no subject. Before we started our project, an automatic 1 Introduction morphological analyzer for Estonian had already The Estonian language is a Finno-Ugric language been created (Kaalep 96). It was our task to elab- that has a rich structure of declensional and con- orate a grammar suitable for the automatic syn- jugational forms, and also a relatively free order tactic analysis of Estonian, and to compile the of sentence constituents. In these respects it dif- program for the syntactic analysis. To accomplish fers considerably from English. There are 14 cases this task, we had to choose a suitable grammar in Estonian, but due to the free word order it is model for Estonian, and analyze the available Es- difficult to determine the syntactic functions of tonian texts, primarily from the Corpus of Writ- these cases. Furthermore, there is no grammati- ten Estonian Texts (Hennoste et al. 98), in order cal gender. The person agreement (1st, 2nd, and to use the established regularities for wording the 3rd person in singular and ) is common in rules and writing the syntactic parser. finite verbs in all forms and tenses. The major- Our grammar has been composed on the for- ity of grammatical categories are implemented by malism of the Constraint Grammar (Karlsson et means of morphology. al. 95). The main idea of the Constraint Gram- Estonian is characterised by a wide extent and mar (CG) is that it determines the surface-level variety of grammatical homonymy that makes the syntactic analysis of the text, which has gone automatic analysis of Estonian a difficult task. In through prior morphological analysis. The pro- the case of English, the main difficulty lies in de- cess of syntactic analysis consists of three stages: termining the correct part of speech. The same morphological disambiguation, identification of ∗ This work was supported by the Estonian Science clause boundaries, and identification of syntactic Foundation under the grants No. 3314 and No. 4605. functions of words. The underlying principle in determining both (came) the morphological interpretation and the syntac- tuli+0 // S com sg nom // (the light) tic functions is the same: first all the possible $. labels are attached to words, and then the ones that do not fit the context are removed by ap- Let us note that the word forms in this sentence plying special rules called constraints. Constraint can be sequenced in 3! = 6 different ways, and all Grammar consists of hand written rules, which by the resulting sentences will be correct and under- checking the context decide whether an interpre- standable for a native speaker of Estonian due to tation is correct or has to be removed. the free word order. The Constraint Grammar parser of Estonian In analyzing this sentence, the correct interpre- exists as two separate programs: the morphologi- tation of the word form tuli is found by applying cal disambiguator (Puolakainen 01) and the syn- the following constraint: remove the finite form tactic analyzer in a narrower sense (M¨u¨urisep00). of the verb from the cohort (in the present case The basic differences of our grammar from the verb tule+i // V main indic impf ps3 sg ps af #Intr //) standard one are the following: if a given word is immediately preceded by a fi- nite form of verb which is the only interpretation • the assumed clause boundaries are also used; of the word form (in the present case kustu+s // V • the referenced context conditions can be rep- main indic impf ps3 sg ps af #Intr // ). resented in two ways: the appropriate cohort After having added the syntactic tags we have is searched either up to the very end of the the sentence in the following form: possible context, or it is searched up to the Aknas first appropriate elements/filler; aken+s // S com sg in **CLB // @ADVL @ • it is possible to remove morphological inter- kustus pretations during the syntactic analysis. kustu+s // V main indic impf ps3 sg ps af #Intr // The next sections of the paper will provide an @+FMV overview of the Constraint Grammar of Estonian tuli (EstCG) and problems that cropped up in its cre- tuli+0 // S com sg nom // @SUBJ @OBJ @ADVL ation. We think that other researches who set the @NN> @