Development of Natural Language Processing Tools for Cook Islands Maori¯
Total Page:16
File Type:pdf, Size:1020Kb
Development of Natural Language Processing Tools for Cook Islands Maori¯ Rolando Coto-Solano1, Sally Akevai Nicholas2, and Samantha Wray3 1School of Linguistics and Applied Language Studies Victoria University of Wellington Te Whare Wananga¯ o te Upoko¯ o te Ika a Maui¯ [email protected] 2School of Language and Culture Auckland University of Technology [email protected] 3Neuroscience of Language Lab New York University Abu Dhabi, United Arab Emirates [email protected] Abstract 1.1 Minority Languages and NLP Lack of resources makes it difficult to train data- This paper presents three ongoing projects driven NLP tools for smaller languages. This is for NLP in Cook Islands Maori:¯ Un- compounded by the difficulty in generating input trained Forced Alignment (approx. 9% er- for Indigenous and endangered languages, where ror when detecting the center of words), dwindling numbers of speakers, non-standardized automatic speech recognition (37% WER writing systems and lack of resources to train spe- in the best trained models) and automatic cialist transcribers and analysts create a vicious part-of-speech tagging (92% accuracy for cycle that makes it even more difficult to take ad- the best performing model). These new re- vantage of NLP solutions. Amongst the hundreds sources fill existing gaps in NLP for the of languages of the Americas, for example, very language, including gold standard POS- few have large spoken and written corpora (e.g. tagged written corpora, transcribed speech Zapotec from Mexico, Guaran´ı from Paraguay and corpora, and time-aligned corpora down Quechua from Bolivia and Peru),´ some have spo- to the phoneme level. These are part of ken and written corpora, and only a handful pos- efforts to accelerate the documentation of sess more sophisticated tools like spell-checkers Cook Islands Maori¯ and to increase its vi- and machine translation (Mager et al., 2018). tality amongst its users. Overcoming these limitations is an important part of accelerating language documentation. Cre- 1 Introduction ating NLP resources also enhances the profile of endangered languages, creating a symbolic impact Cook Islands Maori¯ has been moderately well to generate positive associations towards the lan- documented with two dictionaries (Buse et al., guage and attract new learners, particularly young 1996; Savage, 1962), a comprehensive description members of the community who might not other- (Nicholas, 2017), and a corpus of audiovisual ma- wise see their language in a digital environment terials (Nicholas, 2012). However these materials (Aguilar Gil, 2014; Kornai, 2013). are not yet sufficiently organized or annotated so As for Polynesian languages, te reo Maori,¯ as to be machine readable and thus maximally use- the Indigenous language spoken in Aotearoa New ful for both scholarship and revitalization projects. Zealand, is the one that has received the most at- The human resources needed to achieve the de- tention from the NLP community. It has multiple sired level of annotation are not available, which corpora and Google provides machine-translations has encouraged us to take advantage of NLP meth- for it as part of Google Translate, and tools such ods to accelerate documentation and research. as speech-to-text, text-to-speech and parsing are Rolando Coto Solano, Sally Akevai Nicholas and Samantha Wray. 2018. Development of Natural Language Processing Tools for Cook Islands Maori.¯ In Proceedings of Australasian Language Technology Association Workshop, pages 26−33. under development (Bagnall et al., 2017). Other (Nicholas, 2018, 46). Overall its vitality is be- languages in the family have also received some tween a 7 (shifting) and an 8a (moribund) on attention. For example, Johnson et al., (2018) have the Expanded Graded Intergenerational Disrup- worked on forced alignment for Tongan. tion Scale (Lewis and Simons, 2010, 2). Among the diaspora population and that on the island 1.2 Cook Islands Maori¯ of Rarotonga there has been a shift to English Southern Cook Islands Maori¯ 1 (ISO 639-3 rar, and there is very little intergenerational transmis- or glottology raro1241) is an endangered East sion of Cook Islands Maori.¯ The only contexts Polynesian language indigenous to the realm of where the vitality is strong is within the small New Zealand. It originates from the southern populations of the remaining islands of the south- Cook Islands (see figure1). Today however, ern Cook Islands, where Cook Islands Maori¯ still most of its speakers reside in diaspora popu- serves as the lingua franca (see Nicholas (2018) lations in Aotearoa New Zealand and Australia for a full discussion). (Nicholas, 2018). The languages most closely related to Cook Islands Maori¯ are the languages 1.3 Grammatical Structure of Rakahanga/Manihiki (rkh, raka1237) and Penrhyn (pnh, penr1237) which originate from The following is a selection of grammatical fea- the northern Cook Islands. Te reo Maori¯ (mri, tures of Cook Islands Maori¯ as described in maor1246) and Tahitian (tah, tahi1242), Nicholas (2017). Cook Islands Maori¯ is of the both East Polynesian languages, are also closely isolating type with very few productive morpho- related. There is some degree of mutual intelligi- logical processes. There is widespread polysemy, bility between these languages but they are gen- particularly within the grammatical particles. For erally considered to be separate languages by lin- example, in the following sentence, glossed using guists and community members alike. the Leipzig Glossing Rules (Bickel et al., 2008), there are four homophones of the particle i: the past tense marker, the cause preposition, the loca- tive preposition and the temporal locative preposi- tion. (1) I mate a Mere i te mango¯ PST be-dead DET Mere CAUSE the shark i roto i te moana i LOC inside LOC the ocean LOC.TIME te Tapati. the Sunday ‘Mere was killed by the shark in the ocean on Sunday.’ Furthermore, nearly every lexical item that can occur in a verb phrase can also occur in a noun Figure 1: Cook Islands (CartoGIS Services et al., phrase without any overt derivation, making tasks 2017) like POS tagging more difficult (see section 2.3). The unmarked constituent order is predicate ini- Cook Islands Maori¯ is severely endangered tial. There are verbal and non-verbal predicate 1Southern Cook Islands Maori¯ (henceforth Cook Islands types. In sentences with verbal predicates the un- Maori)¯ has historically been called Rarotongan by non- marked order is VSO. The phoneme paradigm is Indigenous scholars. However, this name is disliked by the small, as is typical for Polynesian languages, with speech community and should not be used to refer to South- ern Cook Islands Maori¯ but rather to the specific variety orig- 9 consonants and 5 vowels which have a phonemic inating from the island of Rarotonga (Nicholas, 2018:36). length distinction. 27 2 CIM NLP Projects Under Development CIM Arpabet kite K IY1 T EH1 There is a need to accelerate the documentation nga‘i¯ NG AE1 T IY1 of the endangered Cook Islands Maori¯ language, so that it can be revitalized in Rarotonga and Table 1: Arpabet conversion of CIM data Aotearoa New Zealand and its usage domains can be expanded in the islands where it is still the lingua franca. We have begun working on to align previously transcribed recordings of Cook this through three projects: (i) We have used un- Islands Maori¯ speech. Figure2 shows the output trained forced speech alignment to generate corre- of this process, a time-aligned transcription of the spondences between transcriptions and their audio audio in the Praat (Boersma et al., 2002) TextGrid files, with the purpose of improving phonetic and format. phonological documentation. (ii) We are training speech-to-text models to automatize the transcrip- tion of both legacy material and recordings made during linguistic fieldwork. (iii) We are develop- ing an interface-accompanied part-of-speech tag- ger as a first step towards full automatic parsing of the language. The following subsections provide details about the current state of each project. 2.1 Untrained Forced Alignment Figure 2: Praat TextGrid for CIM forced aligned text Forced alignment is a technique that matches the sound wave with its transcription, down to the After the automatic TextGrids were generated, word and phoneme levels (Wightman and Talkin, 1045 phonemic tokens (628 vowels, 298 conso- 1997). This makes the work of generating align- nants, 119 glottal stops) and the words containing ment grids up to 30 times faster than manual pro- them were hand-corrected to verify the accuracy cessing (Labov et al., 2013). In theory, forced of the automatic system. Table2 shows a summary alignment needs a specific language model to of the results. The alignment showed error rates of function (e.g. an English language model aligning 9% for the center of words and 20% for the center English text). However, untrained forced align- of vowels2. This error is higher than that observed ment, where for example Cook Islands Maori¯ au- for other instances of untrained forced alignment dio recordings and transcriptions are processed us- (2%, 7% and 3% than that observed for the Central ing an English language model, has been proven American languages Bribri, Cabecar´ and Malecu to be useful for the alignment of text and audio respectively (Coto-Solano and Solorzano, 2016; in Indigenous and under-resourced languages (Di- Coto-Solano and Solorzano´ , 2017)), but it pro- Canio et al., 2013). vides significant improvements in speed over hand In order to use this technique, a dictionary of alignment. Cook Islands Maori¯ to English Arpabet was built, Error (relative to the so that the Cook Islands Maori¯ words could be in- Type of interval troduced as new English words into the the align- duration of the interval) ment tool. Some examples are shown in table1. Words 9% ± 12% The phones of Cook Islands Maori¯ words were Vowels 20% ± 25% matched with the closest English phone in the Arpabet system (e.g.