A Survey of Part-Of-Speech Tagging Approaches Applied to K'iche'

A survey of part-of-speech tagging approaches applied to K’iche’ Francis Tyersy} Nick Howell} y Department of Linguistics } School of Linguistics Indiana University HSE University Bloomington, IN Moscow [email protected] [email protected] Abstract lemmata [qʼojomaj, le, qʼojom] and the set of feature value pairs for each of the forms.1 We study the performance of several popular neural part-of-speech taggers from the Uni- (1) Kinqʼojomaj le qʼojom. versal Dependencies ecosystem on Mayan lan- k-∅-in-qʼojomaj le qʼojom. guages using a small corpus of 1435 anno- IMP-B3SG-A1SG-play the marimba. tated K’iche’ sentences consisting of approximately 10,000 tokens, with encouraging re- ‘I play the marimba.’ sults: F1 scores 93%+ on lemmatisation, part- of-speech and morphological feature assign- A brief reading guide: prior work, on Mayan ment. The high performance motivates a cross- language part-of-speech tagging study, where and other languages of the Americas and on cross- K’iche’-trained models are evaluated on two language part-of-speech tagging, is reviewed in sec- other Mayan languages, Kaqchikel and Uspan- tion 2. Our experimental design including the math- teko: performance on Kaqchikel is good, 63- ematical model used for analysing performance are 85%, and on Uspanteko modest, 60-71%. Sup- given section 3. Universal dependencies annotation porting experiments lead us to conclude the rel- for K’iche’ and the systems tested are described in ative diversity of morphological features as a section 4, and results are presented and analysed in plausible explanation for the limiting factors in cross-language tagging performance, providing section 5. some direction for future sentence annotation and collection work to support these and other 2 Prior work Mayan languages. Palmer et al. (2010) explore morphological segmentation and analysis for the purpose of generating in- 1 Introduction terlinearly glossed texts. They work with Uspan- This paper presents a survey of approaches to part- teko, a language of the Greater Quichean branch, of-speech tagging for K’iche’, a Mayan language and the closest language to K’iche’ we were able spoken principally in Guatemala. The Mayan lan- to identify with published studies of computational guages are a group of related languages spoken morphology. They explore several different sys- throughout Mesoamerica. K’iche’ belongs to the tems: inducing morphology from parallel texts, an Eastern branch, which contains 14 other languages, unsupervised segmentation+clustering strategy, and including Kaqchikel in the Quichean subgroup and an interactive training strategy with a linguist. Uspanteko which belongs to its own subgroup. In Sachse and Dürr (2016), a set of preliminary Part-of-speech tagging has wide usage in corpus annotation conventions for Mayan languages in gen- and computational linguistics and natural language eral, and K’iche’ in particular, are proposed. processing, and is often considered part of a toolkit A maximum-entropy part-of-speech tagger is for basic natural language processing. presented in Kuhn and Mateo-Toledo (2004) for Q’anjob’al, which, like K’iche’, is a Mayan language In the definition of part-of-speech tagging we of Guatemala. They work with a custom selection subsume the tasks of determining the part of speech, morphological analysis and lemmatisation. That is, 1For example for the VERB it would re- given a sentence such as in (1) part-of-speech tag- turn Aspect=Imp, Number[obj]=Sing, Number[subj]=Sing, Person[obj]=3, ging would return both the sequence of part-of- Person[subj]=1, Subcat=Tran, speech tags [VERB, DET, NOUN] but also the VerbForm=Fin. 44 Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, pages 44–52 June 11, 2021. ©2021 Association for Computational Linguistics of 60 tags, and trained on an annotated corpus of seen in Table 1. 4100 words (no lemmatisation is performed). In We studied the performance of several popular contrast to the systems we will study, Kuhn and part-of-speech taggers within the Universal Depen- Mateo-Toledo (2004) perform feature engineering dencies ecosystem; these are reviewed in section 4. and end up with F1 scores between 63% and 78%, Performance was computed as F1 scores for lem- depending on the features chosen. matisation, universal part-of-speech (UPOS), and There is much work on part-of-speech tagging universal morphological features (UFeats). We perfor languages of the Americas outside of the Mayan formed 10-fold cross validation to obtain mean and family: statistical lemmatisation and part-of-speech standard deviation of F1. We also recorded training tagging systems are described by Pereira-Noriega time and model size to compare the resource con- et al. (2017) and a finite-state morphological anal- sumption of the models in the training process. yser by Cardenas and Zeman (2018) for Shipibo- We selected the best-performing system and per- Konibo, a Panoan language of the Amazonian re- formed a convergence study (see section 5.3 for re- gion of Peru. sults). We decimated the training data of one of the In Rios (2010) and Rios (2015), respec- test-train splits from the cross-validation, and plot- tively, finite-state morphology and support vec- ted the performance of models trained on the deci- tor machine-based tagging+parsing systems are de- mations. scribed for Quechua. The latter uses a corpus that We make the following assumption about the comprises 2k sentences. performance: additional training data provides ex- Cross-language part-of-speech tagging through ponentially decreasing performance improvement. parallel corpora, sometimes called annotation pro- Under this assumption, we obtain the formula: jection, is well-studied; in Mayan languages, Palmer 1 − · −n/k et al. (2010) use a parallel corpus as a bridge to F1(n) = F1( ) ∆F1 e . (1) a higher-resourced language for which a part-of- Here F (n) is the performance of a model trained speech tagger already exists. 1 on n tokens, F (1) is the asymptotic performance, In the absence of such a corpus, so-called “zero- 1 and ∆F is the gap between F (1) (estimated shot” methods are created from other (presumably 1 1 maximum performance) and F (0) (zero-shot per- higher-resourced) languages and applied to the tar- 1 formance). get language. The main balance to strike is be- The parameter k is the characteristic number of tween specificity of resources (how closely-related tokens; each additional k tokens of training data are the other languages) and quantity of resources causes the gap ∆F = F (1) − F (n) to shrink (how much linguistic data is accessible). UDify 1 1 1 by a factor of 1/e ≈ 36%. This can be used to es- of Kondratyuk and Straka (2019) is an example timate the training data n required to meet a given of preferring the latter: a deep neural architec- performance target F target: ture is trained on all of the Universal Dependen- 1 cies treebanks. The former strategy can be seen · ∆F1 in Huck et al. (2019), where in addition to anno- n = k log target (2) F (1) − F tation projection, authors attempt zero-shot tagging 1 1 of Ukrainian with a model trained on Russian. We fit this curve against our convergence data and estimate peak performance and characteristic num- 3 Methodology ber. Error propagation is used with the error in pa- We used a corpus of K’iche’2 annotated with part- rameter estimation to compute the error bands in of-speech tags and morphological features (Tyers the graph: and Henderson, 2021). The corpus consisted of ( ) X @F 2 1,435 sentences comprising approximately 10,000 (δF )2 = 1 δx (3) 1 @x tokens from a variety of text types and was annotated according to the guidelines of the Universal Here x runs over the parameters of F1(n): F1(1), Dependencies (UD) project (Nivre et al., 2020). ∆F1 and k. An example of a sentence from the corpus can be We also studied the best-performer in cross- 2https://github.com/ language tagging on the related Kaqchikel and Us- UniversalDependencies/UD_Kiche-IU panteko languages. The 10 models trained in 45 cross-validation were all evaluated on small part-of- the reported hardware. Error could be introduced speech-tagged corpora of 157 (Kaqchikel) and 160 into these estimates from many sources: only the (Uspanteko) sentences. For results and overviews reported device is considered, ignoring many other of the languages, see section 6. components of the machine; devices are assumed to run at their TDP the entire runtime; the UDify num- 4 Systems bers as reported by Kondratyuk and Straka (2019) We tested morphological analysis on three systems are approximate. designed for Universal Dependencies treebanks: UDPipe (Straka et al., 2016), UDPipe 2 (Straka, 5.2 Task performance 2018), and UDify (Kondratyuk and Straka, 2019). We evaluated the performance of the models on five Of these, only UDPipe had a working tokeniser. tasks: tokenisation (Tokens), word segmentation For other taggers we trained, we trained the UD- (Words), lemmatisation (Lemmas), part-of-speech Pipe tokeniser and other tagger together. We thus tagging (UPOS) and morphological tagging (Fea- present combined tokeniser-tagger systems. tures). The difference between tokenisation and UDPipe (Straka et al., 2016) is a language- word segmentation can be explained with reference independent trainable tokeniser, lemmatiser, POS to Table 1. The word chqawach ‘to us’ counts as a tagger, and dependency parser designed to train on single token, but two syntactic words. So the perfor- and produce Universal Dependencies-format tree- mance of tokenisation is recovering the tokens, and banks. It uses gated linear units for tokenisation, av- the performance of word segmentation is recover- eraged perceptrons for part-of-speech tagging, and ing the words.

Load more