A New Language for Constraint Grammar: Estonian∗

A New Language for Constraint Grammar: Estonian∗ Kaili Müürisep Tiina Puolakainen Kadri Muischnek Institute of Cybernetics Institute of Estonian Language Department of Estonian and Tallinn Technical University 10136 Tallinn, Estonia Finno-Ugric Linguistics 12168 Tallinn, Estonia [email protected] University of Tartu [email protected] 50409 Tartu, Estonia [email protected] Mare Koit and Tiit Roosmaa and Heli Uibo Institute of Computer Science University of Tartu 50409 Tartu, Estonia {koit,roosmaa,heli u}@ut.ee Abstract problem exists also in Estonian, but the number The Constraint Grammar of Estonian presented of choices is much greater due to the richness of in the paper is the first attempt in automatic forms. syntactic analysis of Estonian. The grammar consists of 1,240 morphological disambiguation The ratio of ambiguous words varies greatly rules, 47 clause boundary detection rules, 180 from language to language: for example, in En- morphosyntactic mapping rules and 1,118 syn- glish, Swedish, and Finnish the ratio of words tactic constraints. The rules have been devised using a training corpus of 20,300 words and have with multiple morphological interpretation is been tested on a benchmark corpus of 10,000 40%, over 60%, and 11%, respectively (Karls- words. As the result of tests, 86.6% of words become morphologically unambiguous, and the son et al. 95). In Estonian literary texts more error rate of the morphological disambiguator than 45% of words are ambiguous. Estonian, un- is 1.8%. The results of the full analysis demon- like Germanic languages, is not subject-centered. strate the ambiguity rate of 83% and error rate of 3.5%. There are a number of non-elliptical sentences in Estonian with no subject. Before we started our project, an automatic 1 Introduction morphological analyzer for Estonian had already The Estonian language is a Finno-Ugric language been created (Kaalep 96). It was our task to elab- that has a rich structure of declensional and con- orate a grammar suitable for the automatic syn- jugational forms, and also a relatively free order tactic analysis of Estonian, and to compile the of sentence constituents. In these respects it dif- program for the syntactic analysis. To accomplish fers considerably from English. There are 14 cases this task, we had to choose a suitable grammar in Estonian, but due to the free word order it is model for Estonian, and analyze the available Es- difficult to determine the syntactic functions of tonian texts, primarily from the Corpus of Writ- these cases. Furthermore, there is no grammati- ten Estonian Texts (Hennoste et al. 98), in order cal gender. The person agreement (1st, 2nd, and to use the established regularities for wording the 3rd person in singular and plural) is common in rules and writing the syntactic parser. finite verbs in all forms and tenses. The major- Our grammar has been composed on the for- ity of grammatical categories are implemented by malism of the Constraint Grammar (Karlsson et means of morphology. al. 95). The main idea of the Constraint Gram- Estonian is characterised by a wide extent and mar (CG) is that it determines the surface-level variety of grammatical homonymy that makes the syntactic analysis of the text, which has gone automatic analysis of Estonian a difficult task. In through prior morphological analysis. The pro- the case of English, the main difficulty lies in de- cess of syntactic analysis consists of three stages: termining the correct part of speech. The same morphological disambiguation, identification of ∗ This work was supported by the Estonian Science clause boundaries, and identification of syntactic Foundation under the grants No. 3314 and No. 4605. functions of words. The underlying principle in determining both (came) the morphological interpretation and the syntac- tuli+0 // S com sg nom // (the light) tic functions is the same: first all the possible $. labels are attached to words, and then the ones that do not fit the context are removed by ap- Let us note that the word forms in this sentence plying special rules called constraints. Constraint can be sequenced in 3! = 6 different ways, and all Grammar consists of hand written rules, which by the resulting sentences will be correct and under- checking the context decide whether an interpre- standable for a native speaker of Estonian due to tation is correct or has to be removed. the free word order. The Constraint Grammar parser of Estonian In analyzing this sentence, the correct interpre- exists as two separate programs: the morphologi- tation of the word form tuli is found by applying cal disambiguator (Puolakainen 01) and the syn- the following constraint: remove the finite form tactic analyzer in a narrower sense (Müürisep00). of the verb from the cohort (in the present case The basic differences of our grammar from the verb tule+i // V main indic impf ps3 sg ps af #Intr //) standard one are the following: if a given word is immediately preceded by a finite form of verb which is the only interpretation • the assumed clause boundaries are also used; of the word form (in the present case kustu+s // V • the referenced context conditions can be rep- main indic impf ps3 sg ps af #Intr // ). resented in two ways: the appropriate cohort After having added the syntactic tags we have is searched either up to the very end of the the sentence in the following form: possible context, or it is searched up to the Aknas first appropriate elements/filler; aken+s // S com sg in **CLB // @ADVL @<NN @NN> • it is possible to remove morphological inter- kustus pretations during the syntactic analysis. kustu+s // V main indic impf ps3 sg ps af #Intr // The next sections of the paper will provide an @+FMV overview of the Constraint Grammar of Estonian tuli (EstCG) and problems that cropped up in its cre- tuli+0 // S com sg nom // @SUBJ @OBJ @ADVL ation. We think that other researches who set the @NN> @<NN aim of elaborating an automatic syntactic analy- $. sis of a language can benefit from our experience, In this sentence, the noun tuli in the nominative especially if the language is different from English, case singular may be either the subject (@SUBJ), possesses rich morphology and/or free word order. the object (@OBJ), adverbial (@ADVL), pre- modifing attribute (@NN>), or postmodifing at- 2 Motives for Selecting Constraint tribute (@<NN). Grammar Formalism During the last stage syntactic constraints are According to Constraint Grammar, after the mor- applied to words that remove the syntactic tags phological analysis of a sentence the following unsuitable for the context: steps are performed: morphological disambigua- Aknas tion, determination of sentence-internal clause aken+s // S com sg in **CLB // @ADVL boundaries, adding of syntactic tags, and finally, kustus syntactic disambiguation. kustu+s // V main indic impf ps3 sg ps af #Intr // As an example, let us consider the morpholog- @+FMV ically analyzed Estonian sentence ”Aknas kustus tuli tuli” (The light went out in the window): tuli+0 // S com sg nom // @SUBJ Aknas (window) $. aken+s // S com sg in // Word form aknas was analyzed as an adverbial, kustus (go out) the word form tuli was analyzed as the subject, kustu+s // V main indic impf ps3 sg ps af #Intr // and the verb kustus received the tag of a finite tuli (light) predicate. tule+i // V main indic impf ps3 sg ps af #Intr // In 1995 when we launched preparatory activi- ties for the automatic syntactic parsing of Esto- and write new rules that would reduce the re- nian, the Constraint Grammar was beyond doubt maining ambiguities. The easiest to compile were the most efficient grammar model for morpholog- rules establishing the complements of quantifiers ical disambiguation. The syntactic description of and adpositions. For example, a word in the gen- the CG was not as deep as in the case of other itive case is a complement to a postposition if the rule-based grammar models, but the CG output postposition is immediately next it and it requires contained far less mistakes. CG has maintained the genitive case. that leading position from its introduction to the Among the attribute rules, the simplest are present day. those seeking whether in the left or right context there exists at all a word they may complement, 3 Method of Elaborating Rules as well as numerous rules checking the agreement or non-agreement. A number of rules are clearly To elaborate the morphological disambigua- of heuristic nature – the rule might not be 100% tion constraints, we established the more fre- true but its proficiency rate is very high, com- quent groups of ambiguities. We found both more pared to the number of errors. Several rules have frequent ambiguous word forms as well as the am- been compiled solely on the statistical informa- biguous grammatical categories (the past partici- tion. While observing the word order in the sen- ple with the interpretation of the adjective either tence, it became obvious that such combinations in singular or plural, the noun or the verb; noun in as 1) object in the nominative or genitive case – the nominative, genitive and partitive case; noun predicate – subject in the nominative case, or 2) in the genitive, partitive and aditive; adverb and object in the nominative or genitive case – sub- adjective in the ablative case, etc.) (Puolakainen ject in the nominative case – predicate occur very 01). This frequency table indicated which phe- rarely. nomena needed to be handled first of all. For each We tried to group the rules in such a way case samples were collected from text corpora and that the most reliable ones or those that cause upon these observations tentative rules were com- least errors are in the main part of the grammar; piled.

Load more