Improving Coverage of an Inuktitut Morphological Analyzer Using a Segmental Recurrent Neural Network
Total Page:16
File Type:pdf, Size:1020Kb
Improving Coverage of an Inuktitut Morphological Analyzer Using a Segmental Recurrent Neural Network Jeffrey C. Micher US Army Research Laboratory 2800 Powder Mill Road Adelphi, MD 20783 [email protected] natural language processing applications. On the Abstract other hand, Inuktitut may only be spoken by around 23,000 people as a first language,2 and as Languages such as Inuktitut are particu- a result, receives minimal commercial attention. larly challenging for natural language The National Research Council of Canada has processing because of polysynthesis, produced a very important morphological ana- abundance of grammatical features repre- lyzer for Inuktitut. We make use of this analyzer sented via morphology, morphophone- and propose a neural network approach to en- mics, dialect variation, and noisy data. hancing its output. This paper is organized as We make use of an existing morphologi- follows: first we talk about the nature of Inukti- cal analyzer, the Uqailaut analyzer, and a tut, focusing on polysynthesis, abundance of dataset, the Nunavut Hansards, and ex- grammatical suffixes, morphophonemics, and periment with improving the analyzer via spelling. Second, we describe the Uqailaut mor- bootstrapping of a segmental recurrent phological analyzer for Inuktitut. Third, we de- neural network onto it. We present re- scribe the Nunavut Hansard dataset used in the sults of the accuracy of this approach current experiments. Forth, we describe an ap- which works better for a coarse-grained proach to enhancing the morphological analysis analysis than a fine-grained analysis. We based on a segmental recurrent neural network also report on accuracy of just the architecture, with character sequences as input. “closed-class” suffix parts of the Inuktitut Finally, we present ongoing experimental re- words, which are better than the overall search results. accuracy on the full words. 2 Nature of Inuktitut 2.1 Polysynthesis 1 Introduction The Eskimo-Aleut language family, to which Inuktitut belongs, is the canonical language fami- Inuktitut is a polysynthetic language, and is ly for exemplifying what is meant by part of the Inuit language dialect continuum spo- 1 polysynthesis. Peter Stephen DuPonceau coined ken in Arctic North America . It is one of the the term in 1819 to describe the structural char- official languages of the territory of Nunavut, acteristics of languages in the Americas, and it Canada and is used widely in government and further became part of Edward Sapir’s classic educational documents there. Despite its elevat- linguistic typology distinctions. 3 Polysynthetic ed status of official language, very little research on natural language processing of Inuktitut has been carried out. On the one hand, the complexi- 2 http://nunavuttourism.com/about- ty of Inuktitut makes it challenging for typical nunavut/welcome-to-nunavut 3https://en.wikipedia.org/wiki/Polysynthetic_languag 1 https://en.wikipedia.org/wiki/Inuit_languages e 101 Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages, pages 101–106, Honolulu, Hawai‘i, USA, March 6–7, 2017. c 2017 Association for Computational Linguistics languages show a high degree of synthesis, more 2.3 Morphophonemics so than other synthetic languages, in that single In addition to the abundance of morphological words in a polysynthetic language can express suffixes that Inuktitut roots can take on, the mor- what is usually expressed in full clauses in other phophonemics of Inuktitut are quite complex. languages. Not only are these languages highly Each morpheme in Inuktitut dictates the possible inflected, but they show a high degree of incor- sound changes that can occur to its left, and/or to poration as well. In Figure 1 we see an example itself. As a result, morphological analysis cannot of a sentence in Inuktitut, proceed as mere segmentation, but rather, each Qanniqlaunngikkalauqtuqlu aninngittunga., con- surface segmentation must map back to an un- sisting of two words, and we break those words derlying morpheme. In this paper, we refer to down into their component morphemes. these different morpheme forms as ‘surface’ Qanniqlaunngikkalauqtuqlu morphemes and ‘deep’ morphemes. The exam- qanniq-lak-uq-nngit-galauq-tuq-lu ple below demonstrates some of the typical mor- snow-a_little-frequently-NOT-although-3.IND.S-and phophonemic alternations that can occur in an “And even though it’s not snowing a great deal,” Inuktitut word, using the word aninngittunga mivviliarumalauqturuuq, ‘he said he wanted to ani-nngit-junga go to the landing strip’: go_out-NOT-1.IND.S “I’m not going out” Romanized Inuktitut word: mivviliarumalauqturuuq Surface segmentation miv-vi-lia-ruma-lauq-tu-ruuq Figure 1: Inuktitut Words Analyzed Deep form segmentation mit-vik-liaq-juma-lauq-juq-guuq Gloss land-place-go_to-want-PAST- In this example, two Inuktitut words express 3.IND.S-he_says what is expressed by two complete clauses in Figure 2: Morphophonemic Example English. The first Inuktitut word shows how many morphemes representing a variety of We proceed from the end to the beginning to ex- grammatical and semantic functions (quantity, plain the morphophonemic rules. ‘guuq’ is a “a_little”, frequency “frequently”, negation, and regressive assimilator, acting on the point of ar- concession) as well as grammatical inflection (3rd ticulation, and it also deletes. So ‘guuq’ changes person indicative singular), can be added onto a to ‘ruuq’ and deletes the preceding consonant ‘q’ root (“snow”), in addition to a clitic (“and”), and of ‘juq.’ ‘juq shows an alternation in its first the second word shows the same, but to a lesser consonant, which appears as ‘t’ after a conso- degree. nant, and ‘j’ otherwise. ‘lauq’ is neutral after a 2.2 Abundance of grammatical suffixes vowel, so there is no change. ‘juma’ is like ‘guuq’, a regressive assimilator on the point of Morphology in Inuktitut is used to express a articulation, and it deletes. So ‘juma’ becomes variety of abundant grammatical features. ‘ruma,’ and the ‘q’ of the preceding morpheme is Among those features expressed are: eight verbs deleted. ‘liaq’ is a deleter, so the preceding ‘vik’ “moods” (declarative, gerundive, interrogative, becomes ‘vi.’ Finally, ‘vik’ regressively assimi- imperative, causative, conditional, frequentative, lates a preceding‘t’ completely, so ‘mit’ becomes dubitative) ; two distinct sets of subject and sub- ‘miv.’4 ject-object markers, per mood; four persons; three numbers, (singular, dual, plural); eight 2.4 Dialect differences / spelling variation “cases” on nouns: nominative (absolutive), accu- The fourth aspect of Inuktitut which contrib- sative, genitive (ergative), dative (“to”), ablative utes to the challenge of processing it with a com- (“from”), locative (“at, on, in”), similaris (“as”), puter is the abundance of spelling variation seen and vialis (“through”), and noun possessors (with in the electronically available texts. Three as- number and person variations). In addition, pects of spelling variation must be taken into demonstratives show a greater variety of dimen- account. First, Inuktitut, like all languages, can sions than most languages, including location, be divided into a number of different dialects. directionality, specificity, and previous mention. Dorais, (1990) lists ten: Uummarmiutun, The result of a language that uses morphology Siglitun, Inuinnaqtun, Natsilik, Kivallirmiutun, (usually suffixes) to express such a great number of variation is an abundance of word types and very sparse data sets. 4http://www.inuktitutcomputing.ca/Technocrats/ILF T.php#morphology 102 Aivilik, North Baffin, South Baffin, Arctic Que- 3 Related work bec, and Laborador. The primary distinction be- tween the dialects is phonological, which is re- The work on morphological processing is flected in spelling. See Dorais (1990) for a dis- abundant, so here we simply present a selected cussion of dialect variation. Second, a notable subset. Creutz and Lagus (2006) present an un- error on the part of the designers of the Roman- supervised approach to learning morphological ized transcription system has produced a confu- segmentation, relying on a minimum description sion between ‘r’s and ‘q’s. It is best explained in length principle (Morfessor). Kohonen et al, a quote by Mick Mallon:5 (2010) use a semi-supervised version of Morfessor on English and Finnish segmentation It's a long story, but I'll shorten it. Back in tasks to make use of unlabeled data. Narasimhan 1976, at the ICI standardization conference, et al. (2015) use an unsupervised method to learn because of my belief that it was a good idea to morphological “chains” (which extend base mirror the Assimilation of Manner in the or- forms available in the lexicon) and report results thography, it was decided to use q for the first on English, Arabic, and Turkish. Neural network consonant in voiceless clusters, and r for the approaches to morphological processing include first consonant in voiced and nasal clusters. a number of neural model types. Among then, That was a mistake. That particular distinction Wang, et al. (2016) use Long Short Term does not come natural to Inuit writers, (possi- Memory (LSTM) neural morphological analyz- bly because of the non-phonemic status of [ɴ].) Public signs, newspaper articles, government ers to learn segmentations from unsupervised publications, children's literature produced by text, by exploiting windows of characters to cap- the Department of Education, all are littered ture context and report results on Hebrew and with qs where there should be rs, and rs where Arabic. Malouf (2016) uses LSTM neural net- there should be qs. Kativik did the right thing works to learn morphological paradigms for sev- in switching to the use of rs medially, with qs eral morphologically complex languages (Finn- left for word initial and word final. When ish, Russian, Irish, Khaling, and Maltese). One things settle down, maybe Nunavut will make recent work which may come close to the chal- that change. It won't affect the keyboard or the lenge of segmentation and labeling for Inuktitut fonts, but it will reduce spelling errors among is presented by Morita et al.