A New Language for Constraint Grammar: Estonian∗ Kaili M¨u¨urisep Tiina Puolakainen Kadri Muischnek Institute of Cybernetics Institute of Estonian Language Department of Estonian and Tallinn Technical University 10136 Tallinn, Estonia Finno-Ugric Linguistics 12168 Tallinn, Estonia [email protected] University of Tartu [email protected] 50409 Tartu, Estonia [email protected]
Mare Koit and Tiit Roosmaa and Heli Uibo Institute of Computer Science University of Tartu 50409 Tartu, Estonia {koit,roosmaa,heli u}@ut.ee
Abstract problem exists also in Estonian, but the number The Constraint Grammar of Estonian presented of choices is much greater due to the richness of in the paper is the first attempt in automatic forms. syntactic analysis of Estonian. The grammar consists of 1,240 morphological disambiguation The ratio of ambiguous words varies greatly rules, 47 clause boundary detection rules, 180 from language to language: for example, in En- morphosyntactic mapping rules and 1,118 syn- glish, Swedish, and Finnish the ratio of words tactic constraints. The rules have been devised using a training corpus of 20,300 words and have with multiple morphological interpretation is been tested on a benchmark corpus of 10,000 40%, over 60%, and 11%, respectively (Karls- words. As the result of tests, 86.6% of words become morphologically unambiguous, and the son et al. 95). In Estonian literary texts more error rate of the morphological disambiguator than 45% of words are ambiguous. Estonian, un- is 1.8%. The results of the full analysis demon- like Germanic languages, is not subject-centered. strate the ambiguity rate of 83% and error rate of 3.5%. There are a number of non-elliptical sentences in Estonian with no subject. Before we started our project, an automatic 1 Introduction morphological analyzer for Estonian had already The Estonian language is a Finno-Ugric language been created (Kaalep 96). It was our task to elab- that has a rich structure of declensional and con- orate a grammar suitable for the automatic syn- jugational forms, and also a relatively free order tactic analysis of Estonian, and to compile the of sentence constituents. In these respects it dif- program for the syntactic analysis. To accomplish fers considerably from English. There are 14 cases this task, we had to choose a suitable grammar in Estonian, but due to the free word order it is model for Estonian, and analyze the available Es- difficult to determine the syntactic functions of tonian texts, primarily from the Corpus of Writ- these cases. Furthermore, there is no grammati- ten Estonian Texts (Hennoste et al. 98), in order cal gender. The person agreement (1st, 2nd, and to use the established regularities for wording the 3rd person in singular and plural) is common in rules and writing the syntactic parser. finite verbs in all forms and tenses. The major- Our grammar has been composed on the for- ity of grammatical categories are implemented by malism of the Constraint Grammar (Karlsson et means of morphology. al. 95). The main idea of the Constraint Gram- Estonian is characterised by a wide extent and mar (CG) is that it determines the surface-level variety of grammatical homonymy that makes the syntactic analysis of the text, which has gone automatic analysis of Estonian a difficult task. In through prior morphological analysis. The pro- the case of English, the main difficulty lies in de- cess of syntactic analysis consists of three stages: termining the correct part of speech. The same morphological disambiguation, identification of ∗ This work was supported by the Estonian Science clause boundaries, and identification of syntactic Foundation under the grants No. 3314 and No. 4605. functions of words. The underlying principle in determining both (came) the morphological interpretation and the syntac- tuli+0 // S com sg nom // (the light) tic functions is the same: first all the possible $. labels are attached to words, and then the ones that do not fit the context are removed by ap- Let us note that the word forms in this sentence plying special rules called constraints. Constraint can be sequenced in 3! = 6 different ways, and all Grammar consists of hand written rules, which by the resulting sentences will be correct and under- checking the context decide whether an interpre- standable for a native speaker of Estonian due to tation is correct or has to be removed. the free word order. The Constraint Grammar parser of Estonian In analyzing this sentence, the correct interpre- exists as two separate programs: the morphologi- tation of the word form tuli is found by applying cal disambiguator (Puolakainen 01) and the syn- the following constraint: remove the finite form tactic analyzer in a narrower sense (M¨u¨urisep00). of the verb from the cohort (in the present case The basic differences of our grammar from the verb tule+i // V main indic impf ps3 sg ps af #Intr //) standard one are the following: if a given word is immediately preceded by a fi- nite form of verb which is the only interpretation • the assumed clause boundaries are also used; of the word form (in the present case kustu+s // V • the referenced context conditions can be rep- main indic impf ps3 sg ps af #Intr // ). resented in two ways: the appropriate cohort After having added the syntactic tags we have is searched either up to the very end of the the sentence in the following form: possible context, or it is searched up to the Aknas first appropriate elements/filler; aken+s // S com sg in **CLB // @ADVL @ comparison (there are no such words as nagu The grammar of syntactic analyzer was tested on /as/, kui /as/, otsekui /as if/, justkui /as if/ to two types of corpora: the first one was manu- its left). ally morphologically disambiguated, i.e. all the errors of previous stages of analysis were fixed 5 Results and morphological ambiguities removed by lin- guists. This type of corpus helped to determine 5.1 The corpora used the problems that are specific for syntactic dis- To compile and test the EstCG, a 20,314 word ambiguation only. The second corpus consists of corpus of Estonian was used that was manually same texts but the preceding analysis is achieved tagged for morphological and syntactic features automatically. (hencetoforth referred to as the training corpus). While compiling the grammar the goal was set The corpus consists of six original fiction texts, to obtain as disambiguous analysis as possible on each from a different author (12,223 words alto- the training corpus, at the same time maintaining gether, derived from the Corpus of Written Esto- a high rate of recall (at least 98.5%). The recall of nian), a 6,373 word translation of a fiction text all texts did not fall below 98.9% in the training (”1984” by G. Orwell), and a 1,719 word newspa- corpus, whereas the ratio of unambiguous anal- per text. yses differs more than 7%, from 86.1% to 93.4% To assess the effectiveness of the parser, a man- (precision is 89.61% for the whole corpus). It is ually tagged 9,663 word test corpus was used that lowest for the newspaper text which is stylistically had not been applied for optimizing nor evaluat- a lot different from the others. ing the grammar rules. The results of syntactic disambiguation in the test corpus were a bit less promising but the dif- 5.2 Results of morphological ference was not too big: 0.7% reduction in re- disambiguation call and 2% reduction in precision. When ana- lyzing the results of the test corpus it should be The use of the disambiguator reduced the per- noted that the journalistic text is more difficult centage of ambiguous word forms 4 times in both for the parser. The remaining ambiguity mostly the training and the test corpus. consists of two tags, but single words may have The results of morphological disambiguation in from 3 to 5 syntactic tags. The largest class of the training corpus are the following: precision ambiguities is formed by adverbials and adver- is 83.39-89.68%, recall is 97.87-99.16%, and the bial attributes. This is almost the same problem percentage of words with one reading is 88.67- as PP-attachment in English, but additionally it 91.74%. is possible to use both premodifying and post- The results of morphological disambiguation modifying adverbial attributes in Estonian. Of in the test corpus are the following: precision course, the PP-attachment problem is also exis- is 85.49-89.16%, recall is 97.95-98.36%, and the tent. The other complicated problem is the dis- percentage of words with one reading is 88.67- tinction of genitive attributes and objects, which 91.96%. are followed by any other noun. To make things Most of the mistakes occurred when the cases of even worse the morphological disambiguator often nouns, adjectives, and pronouns were determined: fails to solve the morphological ambiguities in the the difficulties occur in differentiating the nomi- same position: noun in genitive case is frequently native, genitive, partitive, and aditive cases. It ambiguous between nominative or partitive case. was also the leading ambiguity type in the initial So the most difficult problem in the Estonian text and in remaining ambiguities after disam- language appears to be determining the borders biguation as well. The second place in the errors of noun phrases. It is often hard to decide which toplist was taken by the problem of determining adjacent nouns belong to a common noun phrase, whether past participles were used as adjectives or and which form separate noun phrases. verbs, and the third is closely related to the sec- Having analyzed the errors, we found that ond one – determining whether the form of the there were no systematic misinterpretations in the verb olema /to be/ is the main or auxiliary verb. grammar. The errors are mostly caused by ellip- sis, some errors occurred during determination of than the nominative, genitive, and partitive. apposition and the third biggest group of errors The problem of complex and phrasal verbs exists in sentences where one clause divides the needs solving as well. At the moment their other into two parts. Most of the errors can be nominal and adverbial components are ana- avoided by refining the contextual conditions of lyzed according to their morphological form, the rules. The rate of erroneous analyses is as but it would be necessary to relate them to well below 2% that was set as the goal in compil- the verb. To do so, we need exhaustive lists ing the grammar. of complex and phrasal verbs (this work has The results obtained from the automatically already been launched, cf. (Kaalep & Muis- morphologically disambiguated corpora are sup- chnek 02)). posedly worse. In the test corpus, the recall of syntactic disambiguation was 95.50-96.85%, pre- • The volume of the tagged training corpus has cision 76.36-79.22%, and the unambiguity 81.34- to be increased. 83.63%. • To experiment with different statistical The analysis of errors on that type of corpus methods for the automatic generation of demonstrates that the majority of the additional grammar on the bigger training corpus. errors have directly been caused by the errors of the morphological disambiguator and by words • To expand the amount of syntactic tags. A unrecognised by the morphological analyzer. more detailed tagging should considerably in- crease the efficiency of the object analysis. 6 Conclusion • In the longer perspective we foresee a need Constraint grammars have been written for for transition to a deeper description of the the Basque, English, Norwegian, Portuguese, syntax, e.g., to take over the principles of the Swedish, Turkish, and now for the Estonian lan- Functional Dependency Grammar (J¨arvinen guage. & Tapanainen 97). The computational grammar and the parser elaborated during our project are the first at- tempts to automate the syntactic analysis of Es- References tonian. The 2,500-odd rules of the Constraint (Bangalore 97) Srinivas Bangalore. Complexity of Lexical Descrip- tions and its Relevance to Partial Parsing. Ph.d. diss., Univer- Grammar of Estonian have been formed on the sity of Pennsylvania, 1997. basis of the standard linguistic grammar of Esto- (Erelt et al. 93) Mati Erelt, Reet Kasik, Helle Metslang, Henno Rajandi, Kristiina Ross, Henn Saari, Kaja Tael, and Silvi Vare. nian and the texts from the corpus of the written Eesti keele grammatika. II S¨untaks. Eesti TA Keele ja Kirjan- Estonian. The parser has been successfully used duse Instituut, Tallinn, 1993. in two prototypes of practical applications: recog- (Hennoste et al. 98) Tiit Hennoste, Mare Koit, Tiit Roosmaa, and Madis Saluveer. Structure and usage of the Tartu University nizing the noun phrases and the automatic gen- corpus of written Estonian. International Journal of Corpus eration of text summaries. The promising fields Linguistics, 3(2):279–304, 1998. of application include information retrieval (tak- (J¨arvinen& Tapanainen 97) Timo J¨arvinenand Pasi Tapanainen. A dependency parser for English. TR 1, Department of General ing into account the typical patterns of sentences Linguistics, University of Helsinki, 1997. while dealing with a specific topic, e.g., (Banga- (Kaalep & Muischnek 02) Heiki-Jaan Kaalep and Kadri Muischnek. Using the text corpus to create a comprehensive list of phrasal lore 97), text-speech generation (partial syntac- verbs. In Proceedings of the Third International Conference on Language Resources and Evaluation, pages 101–105, Las Pal- tic analysis for determining the intonation of the mas, 2002. sentence), grammar checking in text editing, and (Kaalep 96) Heiki-Jaan Kaalep. Estmorf. a morphological analyzer translator aids (e.g., search for standard sentence for Estonian. In Estonian in the Changing World, pages 43–98. Tartu, 1996. constructions from a parallel corpus). (Karlsson et al. 95) Fred Karlsson, Atro Voutilainen, Juha Heikkil¨a, The directions of our future work are as follows. and Arto Anttila. Constraint Grammar: a Language Indepen- dent System for Parsing Unrestricted Text. Mouton de Gruyter, • The share of the lexicon in the grammar Berlin and New York, 1995. (M¨u¨urisep 00) Kaili M¨u¨urisep. Eesti keele arvutigrammatika: must be increased. We need a lexicon con- s¨untaks. Ph.d. diss., University of Tartu, 2000. taining certain semantic information cover- (Puolakainen 01) Tiina Puolakainen. Eesti keele arvutigrammatika: ing, first of all, noun quantifiers and nouns morfoloogiline ¨uhestamine. Ph.d. diss., University of Tartu, that can fulfill the functions of adverbials of 2001. time and manner occurring in other cases