Parsing a Polysynthetic Language

Parsing a Polysynthetic Language Petr Homola Codesign, s.r.o. [email protected] Abstract (see Section 2.3 for a brief description). A rare We present the results of a project of build- property of words in Aymara is the so-called vowel ing a lexical-functional grammar of Ay- elision (sometimes called ‘subtractive morphol- mara, an Amerindian language. There was ogy’) which is quite hard to describe formally. We almost no research on Aymara in compu- show how vowel elision can be dealt with in the tational linguistics to date. The goal of lexicon. the project is two-fold: First, we want to The paper is organized as follows: Section 2 provide a formal description of the lan- presents selected properties of Aymara, many of guage. Second, NLP resources (lexicon them absent from well-researched languages such and grammar) are being developed that as English, and their formal analysis in LFG. could be used in machine translation and Section 3 introduces a dependency-based abstrac- other NLP tasks. The paper presents for- tion of f-structures which brings formal grammars mal description of selected properties of closer cross-linguistically. Section 4 describes Aymara which are uncommon in well- our experiment with MT from Aymara into Span- researched Western languages. Further- ish and English. Finally, we conclude in Section 5 more, we present an experimental machine and give an outlook for further research. translation system into Spanish and En- 2 Some Properties of Aymara glish. 1 Introduction In this section, we focus on some properties of Aymara at the level of morphology and syntax Aymara is an Amerindian language spoken in Bo- which are mostly absent from Western languages livia, Chile and Peru by approx. two million peo- such as English, and sketch their analysis in LFG. ple. It is a polysynthetic language that has many A detailed description of the language can be lexical and structural similarities with Quechua found in (Hardman et al., 2001; Adelaar and but the often suggested genetic relationship be- Muysken, 2007; Cerrón-Palomino and Carvajal, tween these languages is still disputed. 2009; Briggs, 1976). The only research on Aymara in the field of computational linguistics we know about is 2.1 Agglutinative Morphology the project described in (Beesley, 2006). The Aymara has a very rich inflection. Suffixes presented project uses Lexical-Functional Gram- of various categories can be chained to build mar (LFG) (Kaplan and Bresnan, 1982; Bresnan, up long words that would be expressed by a 2001) to formally describe the lexicon, morphol- sentence in languages like English. For ex- ogy and syntax of Aymara in a manner suitable ample, alanxarusksmawa (ala-ni-xaru-si-ka-sma- for natural language processing (NLP). The gram- wa) means “I am preparing myself to go and buy mar we have implemented is capable of parsing it for you”. complex sentences with embedded clauses. We In concordance with the principle of lexical in- have also done experiments with machine transla- tegrity (Bresnan, 2001), we deal with morphol- tion (MT) into Spanish and English; the results are ogy in the lexicon. Ishikawa (1985) has suggested presented in Section 4. to use word-internal (sublexical) rules to analyze Aymara is a polysynthetic language with a very structurally complex words in agglutinative lan- complicated system of polypersonal agreement guages. We have adopted this analysis. 562 Proceedings of Recent Advances in Natural Language Processing, pages 562–567, Hissar, Bulgaria, 12-14 September 2011. 2.2 Vowel elision the final vowel of the word nucleus is elided and Aymara uses vowel elision as morphosyntacic (↑ COMPEL) = − if it is not. Nouns with two marking, as illustrated in (1) and (2).1 vowels do not define this attribute, i.e., it can be unified with both values. The corresponding rule (1) aycha manq’ani for compound nouns is given in (7). meat eater N0 → (N0)N “who eats much meat” (7) (↑ MOD) = ↓ ↑=↓ (↓ COMPEL) = + (2) aych manq’ani 2.3 Polypersonal agreement meat-ELI eat-FUT 3→3 Being a polysynthetic language, Aymara has “(s)he will eat meat” polypersonal conjugation, i.e., the finite verb There are three types of vowel elision that inter- agrees with the subject and with another argument act with each other. Object elision marks a noun which may be the object (direct or indirect) or an or pronoun as direct object, such as in (3) (as op- oblique argument. An example is given in (8). posed to (4)). (8) Uñjsma see-NFUT1→2 (3) khits uñji “I see/saw you.” whom-ELI see-NFUT3→3 “Whom does he/she see?” The morpholexical entry for uñjsma is given in (9).3 Note that the PRED value for both subject (4) khitis uñji and object is optional.4 who see-NFUT 3→3 (9) “Who does see him/her?” uñjsma V(↑PRED) =‘uñjañah(↑SUBJ)(↑OBJ)i’ (↑TAM TENSE) = NON-FUT Noun compound elision occurrs in NPs. The (↑TAM MOOD) = INDIC ((↑SUBJ PRED) = ‘PRO’) final vowel of noun attributes gets elided if they (↑SUBJ PERS) = 1 have three or more syllables, as illustrated in (5) ((↑OBJ PRED) = ‘PRO’) and (6). (↑OBJ PERS) = 2 (5) aymar aru The verb agrees with the subject and with the Aymara-ELI language most animate argument which may be a patient, “the Aymara language” addressee or source, e.g., um churäma-FUT1→2 “I will give you water” (addresse), aych aläma- (6) qala uta FUT1→2 “I will buy meat from you” (source) etc. stone house However, there are verbal suffixes which can make “stone house” the verb agree with other arguments, such as the beneficiary, e.g., aych churarapitäta-BEN,FUT2→1 Complement elision is applied to all words that “You will give him/her bread for me” (the verb are arguments or adjuncts of a verb except for the agrees with the beneficiary instead of the ad- final word of a clause.2 dressee). All these agreement rules are encoded Whereas object elision concerns the nucleus in the lexicon. of a word (the stem with an optional possessive and/or plural suffix), noun compound and comple- 2.4 Free Word Order ment elisions concern the final vowel of a word At the clause level, the word order in Aymara is (the vowel of the last suffix or the stem if there are not restricted although SOV is preferred. There is no suffixes). Vowel elision is dealt with in the lexi- also no evidence for a VP, thus we assume a flat con. As for noun compound elision, all nouns with phrase structure. The rules for matrix clauses are more than two syllables get (↑ COMPEL) = + if given in (10).5 1 3 In the glosses, FUT3→3 means future tense. The numbers TAM means Tense-Aspect-Mood. express the person of the subject and an additional argument, 4Both arguments can be dropped. mostly object. 5In the functional annotation, κ is either ‘−’ (no case) 2Object and noun compound elision has the gloss ELI in or a semantic case and GF is the corresponding grammatical our examples. function. 563 S → X + jumax PRON (↑PRED) = ‘PRO’ (10) (14) (↑PERS) = 2 (↑PRED FN) ∈ (↑iTOP) where X is V or NP/CP ↑=↓ (↓ CASE) = κ ⇒ 2 n o3 (↑ GF) =↓ TOP ‘jumax’ 6 7 4 n o5 FOC ‘qillqiri’ CP → (C) , S ↑=↓ ↑=↓ jumaw PRON (↑PRED) = ‘PRO’ As can be seen, word order in a clause is free (15) (↑PERS) = 2 with the exception of an optional complementizer (↑PRED FN) ∈ (↑iFOC) (see (11) and (12)) which can be placed at the be- 2 n o3 ginning of the clause or at its end. TOP ‘qillqiri’ 6 7 4 n o5 FOC ‘jumax’ (11) Ukat juti then come-NFUT 3→3 The i-structure is very important for correct “Then (s)he came.” translation. For example, the sentence Chachax li- wrw liyi would be translated as “The man read(s) a (12) Jutät ukaxa. book” whereas Chachaw liwrx liyi would be better come-FUT if 2→3 translated as “The book is/was read by a man”.7 “If you will come. ” 3 Lexical Mapping Theory and There are no discontinuous constituents and D-Structures complement clauses can be embedded in the matrix sentence. Since Aymara is not discourse- Although f-structures abstract to some extent from configurational (see the next subsection), the word language specific features (such as differential ob- order, despite of being free, is usually unmarked ject marking, see (16) where the Spanish dative (SOV) and if it is different then mostly for stylis- phrase and the Polish genitive phrase would be in tic reasons. accusative in German), there are still many differ- ences even between relatively closely related lan- 2.5 Topic-Focus Articulation guages.8 We have adopted the approach proposed by King (1997). Thus we use an i(nformation)-structure to (16) Ayer visité a Juan approximate topic-focus articulation (TFA).6 yesterday visit-PAST,1SG to Juan A simple example of two sentences which differ “I visited Juan yesterday.” only in TFA is given in (13) (the word qullqirï is a verbalized noun). Nie mam samochodu NEG have-PRES,1SG car-SG,GEN (13) Jumax qillqirïtawa “I don’t have a car.” you-SG,TOP be-a-writer-NFUT2→3,FOC “You are a writer.” Wong and Hancox (1998) examine the use Jumaw qillqirïtaxa of a(rgument)-structures in machine translation you-SG,FOC be-a-writer-NFUT2→3,TOP 7Unlike some other languages with morphological topic “It is you who is the writer.” and/or focus markers, such as Japanese (cf. examples from (Kroeger, 2004): Taroo-wa-TOP sono hon-o-ACC yon- deiru “Taroo is reading that book.” vs.

Parsing a Polysynthetic Language

Computational Challenges for Polysynthetic Languages

Linguistics 203 - Languages of the World Language Study Guide

What Are the Limits of Polysynthesis?

Lara Mantovan Nominal Modification in Italian Sign Language Sign Languages and Deaf Communities

Proceedings of the LFG13 Conference

Stress Chapter

The Dative-Ergative Connection Miriam Butt

Nez Perce Verb Morphology Phillip Cash Cash University of Arizona, 2004

Mighty Morpheme Tagging Rangers

Multi-Lingual Dependency Parsing: Word Representation and Joint

Journal of South Asian Languages and Linguistics 2(2)

Peter Arkadiev & Yury Lander October 2018 Version. Submitted to Maria