<<

Parsing a Polysynthetic

Petr Homola Codesign, s.r.o. [email protected]

Abstract (see Section 2.3 for a brief description). A rare We present the results of a project of build- property of in Aymara is the so-called vowel ing a lexical-functional grammar of Ay- elision (sometimes called ‘subtractive morphol- mara, an Amerindian language. There was ogy’) which is quite hard to describe formally. We almost no research on Aymara in compu- show how vowel elision can be dealt with in the tational to date. The goal of . the project is two-fold: First, we want to The paper is organized as follows: Section 2 provide a formal description of the lan- presents selected properties of Aymara, many of guage. Second, NLP resources (lexicon them absent from well-researched such and grammar) are being developed that as English, and their formal analysis in LFG. could be used in machine translation and Section 3 introduces a dependency-based abstrac- other NLP tasks. The paper presents for- tion of f-structures which brings formal grammars mal description of selected properties of closer cross-linguistically. Section 4 describes Aymara which are uncommon in well- our experiment with MT from Aymara into Span- researched Western languages. Further- ish and English. Finally, we conclude in Section 5 more, we present an experimental machine and give an outlook for further research. translation system into Spanish and En- 2 Some Properties of Aymara glish. 1 Introduction In this section, we on some properties of Aymara at the level of and syntax Aymara is an Amerindian language spoken in Bo- which are mostly absent from Western languages livia, Chile and Peru by approx. two million peo- such as English, and sketch their analysis in LFG. ple. It is a that has many A detailed description of the language can be lexical and structural similarities with Quechua found in (Hardman et al., 2001; Adelaar and but the often suggested genetic relationship be- Muysken, 2007; Cerrón-Palomino and Carvajal, tween these languages is still disputed. 2009; Briggs, 1976). The only research on Aymara in the field of computational linguistics we know about is 2.1 Agglutinative Morphology the project described in (Beesley, 2006). The Aymara has a very rich inflection. Suffixes presented project uses Lexical-Functional Gram- of various categories can be chained to build mar (LFG) (Kaplan and Bresnan, 1982; Bresnan, up long words that would be expressed by a 2001) to formally describe the lexicon, morphol- sentence in languages like English. For ex- ogy and syntax of Aymara in a manner suitable ample, alanxarusksmawa (ala-ni-xaru-si-ka-sma- for natural language processing (NLP). The gram- wa) means “I am preparing myself to go and buy mar we have implemented is capable of parsing it for you”. complex sentences with embedded clauses. We In concordance with the principle of lexical in- have also done experiments with machine transla- tegrity (Bresnan, 2001), we deal with morphol- tion (MT) into Spanish and English; the results are ogy in the lexicon. Ishikawa (1985) has suggested presented in Section 4. to use -internal (sublexical) rules to analyze Aymara is a polysynthetic language with a very structurally complex words in agglutinative lan- complicated system of polypersonal guages. We have adopted this analysis.

562 Proceedings of Recent Advances in Natural Language Processing, pages 562–567, Hissar, Bulgaria, 12-14 September 2011. 2.2 Vowel elision the final vowel of the word nucleus is elided and Aymara uses vowel elision as morphosyntacic (↑ COMPEL) = − if it is not. with two marking, as illustrated in (1) and (2).1 vowels do not define this attribute, i.e., it can be unified with both values. The corresponding rule (1) aycha manq’ani for compound nouns is given in (7). meat eater N0 → (N0)N “who eats much meat” (7) (↑ MOD) = ↓ ↑=↓ (↓ COMPEL) = + (2) aych manq’ani 2.3 meat-ELI eat-FUT 3→3 Being a polysynthetic language, Aymara has “(s)he will eat meat” polypersonal conjugation, i.e., the finite There are three types of vowel elision that inter- agrees with the and with another act with each other. elision marks a which may be the object (direct or indirect) or an or pronoun as direct object, such as in (3) (as op- oblique argument. An example is given in (8). posed to (4)). (8) Uñjsma see-NFUT1→2 (3) khits uñji “I see/saw you.” whom-ELI see-NFUT3→3 “Whom does he/she see?” The morpholexical entry for uñjsma is given in (9).3 Note that the PRED value for both subject (4) khitis uñji and object is optional.4 who see-NFUT 3→3 (9) “Who does see him/her?” uñjsma V(↑PRED) =‘uñjañah(↑SUBJ)(↑OBJ)i’ (↑TAM TENSE) = NON-FUT Noun compound elision occurrs in NPs. The (↑TAM MOOD) = INDIC ((↑SUBJ PRED) = ‘PRO’) final vowel of noun attributes gets elided if they (↑SUBJ PERS) = 1 have three or more syllables, as illustrated in (5) ((↑OBJ PRED) = ‘PRO’) and (6). (↑OBJ PERS) = 2

(5) aymar aru The verb agrees with the subject and with the Aymara-ELI language most animate argument which may be a , “the ” addressee or source, e.g., um churäma-FUT1→2 “I will give you water” (addresse), aych aläma- (6) qala uta FUT1→2 “I will buy meat from you” (source) etc. stone house However, there are verbal suffixes which can make “stone house” the verb agree with other arguments, such as the beneficiary, e.g., aych churarapitäta-BEN,FUT2→1 elision is applied to all words that “You will give him/her bread for me” (the verb are arguments or adjuncts of a verb except for the agrees with the beneficiary instead of the ad- final word of a clause.2 dressee). All these agreement rules are encoded Whereas object elision concerns the nucleus in the lexicon. of a word (the stem with an optional possessive and/or suffix), noun compound and comple- 2.4 Free ment elisions concern the final vowel of a word At the clause level, the word order in Aymara is (the vowel of the last suffix or the stem if there are not restricted although SOV is preferred. There is no suffixes). Vowel elision is dealt with in the lexi- also no evidence for a VP, thus we assume a flat con. As for noun compound elision, all nouns with phrase structure. The rules for matrix clauses are more than two syllables get (↑ COMPEL) = + if given in (10).5

1 3 In the glosses, FUT3→3 means . The numbers TAM means Tense-Aspect-Mood. express the person of the subject and an additional argument, 4Both arguments can be dropped. mostly object. 5In the functional annotation, κ is either ‘−’ (no case) 2Object and noun compound elision has the gloss ELI in or a semantic case and GF is the corresponding grammatical our examples. function.

563 S → X + jumax PRON (↑PRED) = ‘PRO’ (10) (14) (↑PERS) = 2 (↑PRED FN) ∈ (↑iTOP) where X is V or NP/CP ↑=↓ (↓ CASE) = κ ⇒  n o (↑ GF) =↓ TOP ‘jumax’    n o FOC ‘qillqiri’ CP → (C) , S ↑=↓ ↑=↓ jumaw PRON (↑PRED) = ‘PRO’ As can be seen, word order in a clause is free (15) (↑PERS) = 2 with the exception of an optional complementizer (↑PRED FN) ∈ (↑iFOC) (see (11) and (12)) which can be placed at the be-  n o ginning of the clause or at its end. TOP ‘qillqiri’    n o FOC ‘jumax’ (11) Ukat juti then come-NFUT 3→3 The i-structure is very important for correct “Then (s)he came.” translation. For example, the sentence Chachax li- wrw liyi would be translated as “The man read(s) a (12) Jutät ukaxa. . . book” whereas Chachaw liwrx liyi would be better come-FUT if 2→3 translated as “The book is/was read by a man”.7 “If you will come. . . ” 3 Lexical Mapping Theory and There are no discontinuous constituents and D-Structures complement clauses can be embedded in the ma- trix sentence. Since Aymara is not discourse- Although f-structures abstract to some extent from configurational (see the next subsection), the word language specific features (such as differential ob- order, despite of being free, is usually unmarked ject marking, see (16) where the Spanish dative (SOV) and if it is different then mostly for stylis- phrase and the Polish genitive phrase would be in tic reasons. accusative in German), there are still many differ- ences even between relatively closely related lan- 2.5 Topic-Focus Articulation guages.8 We have adopted the approach proposed by King (1997). Thus we use an i(nformation)-structure to (16) Ayer visité a Juan approximate topic-focus articulation (TFA).6 yesterday visit-PAST,1SG to Juan A simple example of two sentences which differ “I visited Juan yesterday.” only in TFA is given in (13) (the word qullqirï is a verbalized noun). Nie mam samochodu NEG have-PRES,1SG car-SG,GEN (13) Jumax qillqirïtawa “I don’t have a car.” you-SG,TOP be-a-writer-NFUT2→3,FOC “You are a writer.” Wong and Hancox (1998) examine the use Jumaw qillqirïtaxa of a(rgument)-structures in machine translation you-SG,FOC be-a-writer-NFUT2→3,TOP 7Unlike some other languages with morphological topic “It is you who is the writer.” and/or focus markers, such as Japanese (cf. examples from (Kroeger, 2004): Taroo-wa-TOP sono hon-o-ACC yon- deiru “Taroo is reading that book.” vs. Sono hon-wa-TOP Taroo-ga-NOM yondeiru “That book, Taroo is reading”), Ay- The morpholexical entries for jumax and jupaw mara allows their co-occurrence with case suffixes without and corresponsing i-structures for the sentences limitation. 8 in (13) are given in (14) and (15), respectively. For example, the East Baltic language Latvian has only -less passives (i.e., in LFG, it completely lacks OBLag, 6The difference is that we use only two discourse func- cf. (Forssman, 2001)), whereas its closest and partially mu- tions, TOP or FOC, with the possibility for words being tually intelligible relative Lithuanian has and frequently uses discourse-unspecified (the term ‘discourse-neutral’ is used agents in passives. sometimes). This is exactly how morphological marking of TFA works in Aymara.

564 (MT). In LFG, a-structures are another level of lin- a brief overview of which information at different guistic representation which provides the lexico- levels of linguistic representation in LFG is used syntactic interface. The mapping between a- in d-structures. structures and f-structures is defined by the so- LFG layer information in d-structures called Lexical Mapping Theory (LMT; see (Bres- c-structure original word order nan, 2001)). We will give a brief overview of LMT f-structure dependencies and coreferences here. LFG assumes that there is a prominence hierar- i-structure topic-focus articulation chy of semantic roles. We use the hierarchy shown a-structure valence in (17) (proposed by Bresnan (2001)): Table 1: Information provided by LFG layers to (17) agent beneficiary/maleficiary d-structures experiencer/goal instrument patient/theme locative The skeleton of a d-structure is provided by the f-structure. According to a generally accepted Argument grammatical functions (GF) are as- principle of deep syntax (tectogrammatics) only signed features objective and restricted as in (18). autosemantic (content) word are represented by The hierarchy of GFs is given in (19). nodes in d-structures. In LFG, autosemantic words are associated with projections of lexical cate- -r +r gories, i.e., f-structures with the PRED attribute (18) -o SUBJ OBLθ (see (Bresnan, 2001) for a detailed discussion of +o OBJ OBJθ lexical and functional categories and the so-called ‘coheads’). Thus a d-structure derived from (20) (19) SUBJ OBJ, OBLθ OBJθ would have three nodes for the words dog, chases in LFG have an a-structure that expresses and cat. their valence. The arguments of each verb are or- (20) PRED ‘chaseh(↑SUBJ)(↑OBJ)i’ dered according to the hierarchy in (17) and an- TENSE PRES    notated with −o, −r, +o, +r. General LMT prin-      PRED ‘dog’    ciples determine how the arguments are mapped SUBJ  h i   SPEC DEF +  onto GFs. The initial role is mapped onto SUBJ       [−o]  PRED ‘cat’  if classified with . Otherwise, the leftmost   OBJ  h i  role classified [−r] is mapped onto SUBJ. Other SPEC DEF – roles are mapped onto the lowest compatible GF according to the hierarchy in (19). There are two The edges are labelled with semantic roles. This other constraints: Every verb must have a SUBJ is possible due to the bi-uniqueness of the map- and each role must be associated with a unique ping between roles and GFs (see above). However, function, and conversely. there is one exception: The initial role is assigned Bresnan (2001) argues that LMT allows for nat- a special label which we call ‘actor’ (ACT, which ural treatment of passives, ditransitives and other is equvalent to what Bresnan (2001) marks θˆ and constructions which have been handled by lexical calls ‘logical subject’). This partially reflects the rules in earlier versions of LFG. shifting of actants in tectogrammatics as defined We use the information provided by f- by Sgall et al. (1986).9 structures, i-structures, c-structures and a- So far, we have an unordered tree (f-structures structures to create a dependency-based represen- are unordered by definition).10 We define an or- tation of parsed sentences (a tectogrammatical dering based on information structure, as proposed tree in the terminology of Sgall et al. (1986)). The 9The edge labels are theory specific and somewhat arbi- main reason is that we already have a module that trary. For example, Butt et al. (1999) distinguish between ‘se- generates English and Spanish sentences from mantic’ and ‘non-semantic’ prepositions. As a consequence, the complement in He relies on the book is an OBJ and there- (tectogrammatical) syntax trees. fore PAT in the corresponding d-structure although on the In the following, we will use the term book is not a direct object in the traditional dependency gram- d(ependency)-structure to refer to dependency mar. 10Generally, the skeleton rendered by f-structures may trees induced by LFG structures. Table 1 gives contain a cycle, i.e., a node with more than one mother nodes.

565 for deep syntax by Sgall et al. (1986). Thus we However, there are several differences. For use the i-structure to define a partial ordering on example, d-structures can be non-projective (tec- the nodes of the d-structure (TOP ≺ ‘discourse- togrammatical trees are projective by defini- unspecified’ ≺ FOC). The nodes in each of the tion (Sgall et al., 1986)) which is a direct con- three topic-focus domains are ordered according sequence of how long-distance dependencies are to their original ordering in the sentence (which is represented in f-structures. Furthermore, one word captured by c-structures).11 can be represented by more than one d-structure The resulting d-structure is given in (21).12 nodes (such as in languages with incorporation). (21) Butt et al. (1999) give a detailed description of • the process of parallel grammar development. In ACT PAT our approach, the correspondence between orig- inal LFG structures and d-structures poses some • • (mostly technical) limitations on grammar writers. For example, f-structures of synsemantic words dog chases cat (functional categories) must be ‘coheads’ of their functional categories (however, this is a general re- quirement in modern LFG according to Bresnan Let us briefly point out some properties of d- (2001)). Also, GFs must conform to the strict con- structures as defined above. Most of them di- straints imposed by LMT. rectly correspond to properties of deep syntax (tec- Table 2 show how many c-structures, f- togrammatical) trees. structures and d-structures are identical (two d- 1. There is a bi-unique mapping between d- structures are identical if they have the same struc- structure nodes and autosemantic (content) ture and edge labels) in a parallel Aymara-Spanish words. Synsemantic (auxiliary/function) corpus of 1,000 sentences. words are represented as attributes of nodes. level identical representation This is a direct consequence of LFG ‘co- c-structure 7.8% heads’. f-structure 38.3% 2. ‘Dropped’ words (e.g., subject and/or object d-structure 69.5% pronouns in so-called pro-drop languages) are re-established in d-structures as a conse- Table 2: Identical c-, f- and d-structures in a paral- quence of the LFG Principle of Completeness lel corpus since PRED attributes are instantiated in the lexicon if needed (cf. (Bresnan, 2001)). 4 Machine Translation 3. Edge labels in d-structures reflect semantic In this section, we briefly present the results of an relations rather the GFs which are more lan- MT experiment from Aymara into Spanish and En- guage specific. glish. All modules of the system were developed 4. The ordering of d-structure nodes is partially in SWI Prolog (Wielemaker, 2003). determined by topic-focus articulation. It is obvious (cf. Section 2) the there are very few structural similarities between Aymara and This is how LFG handles coreferences, such as in the sen- tence I want to go home where the complement clause is an Spanish or English, thus a ‘direct’ or ‘shallow’ open complement (XCOMP) in the f-structure of ‘want’ and approach to MT, as proposed by Dyvik (1995), (↑ SUBJ) = (↑ XCOMP SUBJ). To obtain a well-formed would not lead to quality translation. As has been tree, we reflect the path of length 1 in the f-structure as an edge and the remaining (conflicting) functional paths as co- said above, we have developed an LFG grammar references. for Aymara. Kaplan and Wedekind (2000) have 11In free word-order languages, NPs and PPs usually have shown that the generation of sentences out of a f- more rigid word order than clause arguments and adjuncts, thus in an MT system, the module for syntactic synthesis of structure according to an LFG grammar yields a the target language would reorder the d-structure according context-free language. However an LFG grammar to language specific word-order rules. developed for parsing may not be suitable for gen- 12The attributes associated with nodes can be obtained from corresponding f-structures (in LFG, all lingustic levels eration (due to overgeneration). That is why we are interlinked). use d-structures as defined in Section 3.

566 Evaluation results are given in Table 3. Lucy Therina Briggs. 1976. Dialectal Variation in the Aymara Language of Bolivia and Peru. Ph.D. thesis, language pair WER University of Florida. Aymara-Spanish 22.3% Miriam Butt, Tracy Holloway King, María-Eugenia Aymara-English 24.8% Niño, and Frédérique Segond. 1999. A Grammar Writer’s Cookbook. CSLI Publications. Table 3: Evaluation of MT into Spanish and En- R. Cerrón-Palomino and J. Carvajal Carvajal. 2009. glish Aimara. In M. Crevels and P. Muysken, editors, Lenguas de Bolivia. Plural Editores, La Paz, Bolivia. While the error rate is not low, it is acceptable Helge Dyvik. 1995. Exploiting Structural Similarities given the fact that the source language is struc- in Machine Translation. Computers and Humani- turally very different from the target language. ties, 28:225–245. Most translation errors can be tracked to diverg- ing of verbs in both languages. Berthold Forssman. 2001. Lettische Grammatik. Ver- lag J.H. Roell, Dettelbach.

5 Conclusions and Further Research Martha Hardman, J. Vásquez, and J. Yapita de Dios. 2001. Aymara. Compendio de estructura fonológica We have presented a formal grammar for Aymara y gramatical. Instituto de Lengua y Cultura Aymara. and pointed out some interesting properties of the language and how they can be dealt with in the Akira Ishikawa. 1985. Complex Predicates and Lexi- cal Operations in Japanese. Ph.D. thesis, Stanford LFG framework. University. As can be seen, the LFG framework can be eas- ily used to develop formal grammars of polysyn- Ronald M. Kaplan and Joan Bresnan. 1982. Lexical- thetic languages such as Aymara. While the rules Functional Grammar: A formal system for gram- matical representation. In Joan Bresnan, editor, we have developed cover a large part of the Ay- Mental Representation of Grammatical Relations. mara syntax, the lexicon we have now needs to be MIT Press, Cambridge. expaned. Currently, we are focusing on refining Ronald M. Kaplan and Jürgen Wedekind. 2000. LFG sublexical rules. Generation Produces Context-free Languages. In We have chosen LFG for our grammar because Proceedings of COLING-2000, Saarbrücken. it has a solid formal foundation while provid- ing grammars that can be used directly in NLP. Tracy Holloway King. 1997. Focus Domains and In- formation Structure. In Miriam Butt and Tracy Hol- However, we are developing the grammar for use loway King, editors, Proceedings of the LFG Con- in MT and LFG’s f-structures are still relatively ference. language-specific. To overcome this limitation, Paul R. Kroeger. 2004. Analyzing Syntax. Cambridge we have developed a fully automatic procedure University Press. which induces d(ependency)-structures (deep syn- tax trees) that are at a higher level of abstraction. Petr Sgall, Eva Hajicová,ˇ and Jarmila Panevová. 1986. The Meaning of the Sentence in Its Semantic and Our d-structures are not only more suitable for Pragmatic Aspects. D. Reider Publishing Company. cross-lingual NLP tasks such as MT but they also disclose that LFG is, in its core, a dependency- Jan Wielemaker. 2003. An overview of the SWI- based formalism. Prolog programming environment. In Fred Mesnard and Alexander Serebenik, editors, Proceedings of the 13th International Workshop on Logic Program- ming Environments, pages 1–16, Heverlee, Belgium, References december. Katholieke Universiteit Leuven. CW Willem Adelaar and Pieter Muysken. 2007. The Lan- 371. guages of the Andes. Cambridge University Press. Shun Ha Sylvia Wong and Peter Hancox. 1998. An In- Kenneth R. Beesley. 2006. Finite-state Morpholog- vestigation into the Use of Argument Structure and ical Analysis and Generation for Aymara. In Pro- Lexical Mapping Theory for Machine Translation. ceedings of the Global Symposium on Promoting the In Proceedings of the 12th Pacific Asia Conference Multilingual Internet. on Linguistics, Information and Computation, Sin- gapore. Joan Bresnan. 2001. Lexical-Functional Syntax. Blackwell Textbooks in Linguistics, New York.

567