ThamizhiUDp: A Dependency Parser for Tamil

Kengatharaiyer Sarveswaran Gihan Dias University of Moratuwa / Sri Lanka University of Moratuwa / Sri Lanka [email protected] [email protected]

Abstract state of the art for most natural language process- This paper describes how we developed ing tasks. These approaches require a significant a neural-based dependency parser, namely amount of quality data for training and evaluation. ThamizhiUDp, which provides a complete On the other hand techniques like transfer learning, pipeline for the dependency of the and multilingual learning may be used to overcome Tamil language text using Universal Depen- data scarcity. This paper discusses how we devel- dency formalism. We have considered the oped a neural-based dependency parser for Tamil phases of the dependency parsing pipeline and with the aid of data orchestration and multilingual identified tools and resources in each of these training. phases to improve the accuracy and to tackle data scarcity. ThamizhiUDp uses Stanza for to- kenisation and lemmatisation, ThamizhiPOSt 2 Background and Motivation and ThamizhiMorph for generating Part of Speech (POS) and Morphological annotations, Tamil is a Southern Dravidian language spoken and uuparser with multilingual training for de- by more than 80 million people around the world. pendency parsing. ThamizhiPOSt is our POS However, it still lacks enough tools and quality tagger, which is based on the Stanza, trained annotated data to build good Natural Language with Amrita POS-tagged corpus. It is the cur- Processing (NLP) applications. rent state-of-the-art in Tamil POS tagging with an F1 score of 93.27. Our morphological ana- 2.1 Universal Dependency lyzer, ThamizhiMorph is a rule-based system with a very good coverage of Tamil. Our de- are a collection of texts with various lev- pendency parser ThamizhiUDp was trained us- els of annotations, including Part of Speech (POS) ing multilingual data. It shows a Labelled As- and morpho-syntactic annotations. There are differ- signed Score (LAS) of 62.39, 4 points higher ent formalisms used to mark syntactic annotations than the current best achieved for Tamil depen- (Marcus et al., 1993;B ohmov¨ a´ et al., 2003; Nivre dency parsing. Therefore, we show that break- ing up the dependency parsing pipeline to ac- et al., 2016; Kaplan and Bresnan, 1982). Among commodate existing tools and resources is a vi- the available formalisms, the dependency grammar able approach for low-resource languages. formalism is useful for languages like Tamil which are morphologically rich, and whose order 1 Introduction arXiv:2012.13436v1 [cs.CL] 24 Dec 2020 is relatively variable and less bound (Bharati et al., Applying neural-based approaches to Tamil, like 2009). other Indic languages, is challenging due to a lack The Universal Dependency formalism (Nivre of quality data (Bhattacharyya et al., 2019), and the et al., 2016) is nowadays used widely to create language’s structure (Sarveswaran and Butt, 2019; Universal Dependency Treebanks (UD) with an- Butt, Miriam, Rajamathangi, S. and Sarveswaran, notations. The current release of UDv2.7 has 183 K., 2020). Although there is a large volume of elec- annotated treebanks of various sizes from 104 lan- tronic unstructured/partially-structured text avail- guages (Zeman et al., 2020). UD captures informa- able on the Internet, not many language processing tion such as Parts of Speech (POS), morphological tools are publicly available for even fundamental features, and syntactic relations in the form of de- tasks like part of speech (POS) tagging or pars- pendencies. All these annotations are defined with ing. Nowadays, neural-based approaches are the multilingual language processing in mind, and the present format used to specify the annotation is called CoNLL-U format.1 There are only three Indic languages, namely, Hindi, Urdu, and Sanskrit that have relatively large datasets 375K, 138K, and 28K tokens, respectively in UDv2.7. All the other six Indic languages, in- cluding Tamil and Telugu, have less than 12K to- kens in UDv2.7. 2.2 Tamil Universal Dependency Treebanks Tamil has been included in UD treebank releases since 2015. Initially it was populated from the Prague Style Tamil treebank by (Ramasamy and Figure 1: Phases of the Parsing Pipeline Zabokrtskˇ y´, 2012), and since then the dataset has Image source: https://stanfordnlp.github.io/stanza been part of the UD without much alterations or corrections. Tamil TTB in UDv2.6 has some inac- Dependency Treebanks (UD), including Stanza (Qi curacies, and inconsistencies. For instance, num- et al., 2020), and uuparser (de Lhoneux et al., 2017) bers are marked as NUM and ADJ, while only the and its derivatives. Both of these are open source former tag is correct. The first author of this pa- tools. Stanza is a Python NLP library which in- per has corrected some of these issues and made cludes a multilingual neural NLP pipeline for de- it available in UDv2.7. However, there are still pendency parsing using Universal Dependency for- more issues that need to be solved. Tamil TTB malism. uuparser is a tool developed specifically in UDv2.7 has altogether 600 sentences for train- for UD parsing. These neural-based tools need ing, development and testing. In (Zeman et al., large amount of quality data on which to be trained. 2020), there is another Tamil treebank with 536 On the other hand, several approaches are being sentences, namely MWTT, which has been newly used to overcome the issue with data scarcity, in- added. MWTT is based on the Enhanced Universal cluding multilingual training. There is an attempt 2 Dependency annotation, where complex concepts to create a multilingual parsing for several low- like elision, relative clauses, propagation of con- resource languages, and it is reported that multi- juncts, raising and control constructions, and ex- lingual training significantly improves the parsing tended case marking are captured. Therefore, there accuracy of low-resource languages (Smith et al., are slight variations in TTB and MWTT. Further, 2018). MWTT has very short sentences, while TTB has relatively very longer ones. In this paper, we have 3 ThamizhiUDp mainly used and discussed Tamil TTB. By considering all available resources and ap- 2.3 Dependency parsers proaches as outlined in section2, we decided to A Dependency parser is a type of syntactic parser develop a Universal Dependency parser (UDp) for which is useful to elicit lexical, morphological, and Tamil called ThamizhiUDp using existing open syntactic features, and the inter-connections of to- source tools, namely Stanza, ThamizhiMorph, and kens in a given sentence. Linguistically, this would uuparser. However, since we do not have enough be useful for syntactic analyses, and comparative data to train a neural-based parser end-to-end, we studies. Computationally, this is a key resource for have broken up the pipeline to different phases. We natural language understanding (Dozat and Man- have then orchestrated data from different sources ning, 2018). Different approaches are employed for each of these phases, and used different tools in when developing dependency parsers. However, different phases, as shown in Table1. The follow- neural-based parsers are the latest state of the art. ing sub-sections discuss each of the stages of the There are several off-the-shelf neural-based pipeline, and how we went about developing them. parsers available that are built around Universal Our dependency parsing pipeline has several 1https://universaldependencies.org/ stages as shown in Figure1. As mentioned, we format.html used different datasets and tools, as shown in Table 2https://universaldependencies.org/u/ overview/enhanced-syntax.html 1, in the different stages of the pipeline. Currently, Step Tool Dataset in TTB. We are in the process of improving this Tokenisation Stanza Tamil UDT Multi-word Stanza Tamil UDT multi-word tokeniser with the use of more data. tokeniser Lemmatisation Stanza Tamil UDT 3.3 Lemmatisation using Stanza POS tagging ThamizhiPOSt Amrita Data Morphological ThamizhiMorph Rule-based UD Treebanks also have lemmas marked in their tagging Dependency pars- uuparser UDT of various CoNLL-U format annotation This is useful for ing languages language processing applications, such as a Ma- chine Translator. We trained Stanza using the TTB Table 1: ThamizhiUDp process pipeline UDv2.6 to do lemmatisation. However, the cur- rent TTB has several inaccuracies in identifying Stanza does not have support for multilingual train- lemmas, specifically due to improper multi-word ing. Therefore, for dependency parsing, we used tokenisation. Since a lemma is identified for multi- uuparser with the multilingual training. Each of word tokenised , multi-word tokenisation has the phases within the pipeline is explained in the an effect on lemmatisation. Since we are still in following respective sub-sections. the process of improving our multi-word tokeniser, lemmatisation will also be improved in the future. 3.1 Tokenisation 3.4 POS tagging using ThamizhiPOSt First, the given texts have been Unicode nor- malised, and then tokenised, and broken up in to Part of Speech (POS) tagging is an important sentences. We developed a script3 to do Unicode phase in the parsing process where each word normalisation. Because of different input meth- in a sentence is assigned with its POS tag (or ods or other reasons, at times the same surface lexical category) information. Several attempts form of a character has been stored using different have been made to define POS tagsets for Tamil, Unicode sequences. Therefore, this needed to be based on different theories, and level of granular- normalised, otherwise, a computer would consider ity; (Sarveswaran and Mahesan, 2014) gives an them as different characters. Once this normalisa- account of different tagsets. Among these, Am- tion was done, we moved on to tokenisation. To do rita (Anand Kumar et al., 2010) and BIS5 are two this, we trained Stanza with the texts available in popular tagsets. In addition to tagsets, Amrita and TTB. During this phase, punctuations were sepa- BIS POS tagged data are also available. The cor- rated from words, and the given texts were broken pus6 which is tagged using BIS tagset is taken from in to sentences. a historical novel, while the corpus tagged using Amrita is taken from news websites. Further more 3.2 Multi-word tokenisation using Stanza we found that Amrita’s data is cleaner and there is more consistency when it comes to POS tagging. After the initial tokenisation, syntactically com- We also harmonised the tags found in the BIS, Am- pound words or multi-word tokens were bro- rita, and UPOS7 tagsets. ken into syntactic units as proposed by the UD guidelines,4 so that syntactic dependencies can be Though there have been several attempts to marked precisely. Syntactically con- develop a POS tagger for Tamil, there are not structions are common in Tamil. For instance, available, or have not given convincing results. words with -um clitic will be tokenised, like Moreover, only a few neural-based approaches for naanum =i naan+um ‘I+and’, so that coordinat- Tamil POS tagging have been developed. There- ing conjunctive dependency can be shown easily. fore, we decided to develop a POS tagger, namely In the current TTB UDv2.7, there are 520 instances ThamizhiPOSt, using Stanza, and to publish it as an of multi-words found among 400 sentences in the open source tool. We used the corpus tagged using training set. We used this TTB training set to train Amrita’s POS tagged corpus to train this tagger. our multi-word tokeniser using Stanza. However, The development process is outlined briefly below. multi-word tokenisations are not properly divided 5http://www.tdil-dc.in/tdildcMain/ articles/134692DraftPOSTagstandard.pdf 3https://github.com/sarves/ 6http://www.au-kbc.org/nlp/ thamizhi-validator corpusrelease.html 4https://universaldependencies.org/u/ 7https://universaldependencies.org/u/ overview/tokenization.html pos/all.html Neural-based POS taggers F1 Score ysis, it also gives us the POS tag information, and PVS and Karthik(2007) 87.0* lemma information. When the lemma of a given Mokanarangan et al.(2016) 87.4 surface form in not found in the ThamizhiMorph Qi et al.(2020) 82.6** lexicon, it uses a rule-based guesser to predict the ThamizhiPOSt 93.27 lemma; sometimes this fails too, especially when there is a foreign word. Table 2: Scores of neural-based POS taggers for Tamil However, for our parsing purpose, we wanted POS tagging *github.com/avineshpvs/indic_tagger to get the single correct morphological analysis **stanfordnlp.github.io/stanza/ based on the context. This was challenging. To performance.html tackle this challenge, we used a disambiguation process to generate a single analysis. However, First we mapped Amrita’s 32 POS tags to Uni- we still failed at times, since we especially get versal POS (UPOS); see Table4 in Appendix A for multiple analyses because of the way some people the mapping of Amrita-UPOS mapping. In doing write. When this was the case, we manually picked so, we converted the annotations from Amrita POS the correct analysis, even after our disambiguation tags to UPOS tags. We then divided Amrita POS process. We are now in the process of training a tagged corpus of 17K sentences in to 11K, 5K and Stanza based morphological analyser using the data 1K sentences for training, development, and test- generated by ThamizhiMorph. We hope this will ing, respectively. Thereafter, we converted these improve the robustness, especially when there are datasets in CoNLL-U format so that it could then be out of vocabulary tokens. fed to Stanza. Following that we trained and eval- 3.6 Dependency parsing uated ThamizhiPOSt, which is a Stanza instance that has been trained on Amrita’s data. During the When we looked for a Dependency parser, we training, we also used fastText model (Bojanowski found that none existed that were specifically et al., 2016) to capture the context in POS tagging, trained for Tamil. For TTB test data, in their de- as specified in Stanza. fault configurations the off-the-shelf Stanza and We also evaluated ThamizhiPOSt using Tamil uuparser give the Labelled Assigned Score (LAS) UDv2.6 test data. The F1 score of the evaluation of 57.64 and 55.76, respectively. was 93.27, which is higher than the results reported We wanted to improve the accuracy, however, for existing neural-based POS taggers, as shown we could not find any datasets with dependency Table2. annotations, other than TTB UDv2.6 at the time of development. To overcome this data scarcity, 3.5 Morphological tagging we tried multilingual training for Tamil along with Hindi HDTB,10 Turkish,11 Arabic,12 and Telugu,13 We used an open source morphological analyser which we found would be relevant, available in called (Sarveswaran et al., 2019, 2018),8 which we UDv2.6. We did this multilingual training using uu- developed as part of our project on computational parser. The experiment gave us some good results, grammar for Tamil, to generate morphological fea- when we compared this with what was reported by tures according to the UD specification.9 Since we Stanza or uuparser as shown in Table3. have developed this rule-based analyser for gram- mar development purposes, this gives us a very As in Table3, we got a LAS of 62.39 when detailed analysis for each given word. We used training with Hindi HDTB UDv2.6, but, surpris- ThamizhiMorph in the process. At this stage, we ingly, not when training with Telugu, which is also fed the tokenised, lemmatised and POS tagged data a Dravidian language like Tamil. We trained the in the CoNLL-U format to ThamizhiMorph to do tagger with the whole Telugu, and Hindi treebanks the morphological analyses. along with Tamil. However, the score was lesser than what we got when we trained it with Hindi As a morphological analyser, ThamizhiMorph gives us all the possible morphological analyses for 10https://github.com/UniversalDependencies/ UD_Hindi-HDTB/tree/master a given word. In addition to the morphological anal- 11https://github.com/UniversalDependencies/ UD_Turkish-IMST/tree/master 8https://github.com/sarves/ 12https://github.com/UniversalDependencies/ thamizhi-morph UD_Arabic-PADT/tree/master 9https://universaldependencies.org/u/ 13https://github.com/UniversalDependencies/ feat/all.html UD_Telugu-MTG/tree/master Languages (# of sent.) Accuracy(LAS) parsing, we used the UD test data. However, it is with Telugu (100) 58.91 not a clean and error free dataset for evaluation. with Telugu ( 1050) 59.22 For this reason, we have now started working on with Hindi (1600) 62.39 a Tamil dependency treebank which can soon be with Telugu (100) used as an evaluation dataset. and Arabic (100) 58.04 We used our personal computers without any with Telugu (100) Graphical Processing Units (GPU) to carry out all and Turkish (100) 58.43 these experiments. However, high performance with Telugu (100) computing resources will save time, and we might and Hindi (100) 59.07 need to go for such resources when we increase the size of datasets. Table 3: LAS of Multilingual parsing 5 Conclusion data. For all these experiments, we used the Tamil e have implemented a Universal Dependency parser testing set available in TTB UDv2.6. for Tamil, ThamizhiUDp, which annotates a Tamil sentence with POS, Lemma, Morphology, and De- 4 Discussion pendency information in CoNLL-U format. We have developed a parsing pipeline using several Tamil TTB has not undergone any major revisions open source tools and datasets to overcome data or corrections since its initial release. It has several scarcity. We have also used multilingual training issues, in POS tagging, multi-word tokenisation, to overcome the scarecity of dependency annotated and dependency tagging. Altogether we only have data. ThamizhiPOSt, a POS tagger for Tamil, has 600 sentences for training, development, and test- been implemented using Stanza and the Amrita ing; some of these sentences are very long. All POS tagged dataset. ThamizhiPOSt outperforms these made the training of a UD parser a difficult existing neural-based POS taggers, and gives an task. We tried to overcome some of these issues us- F1 score of 93.27. Further, we obtained the best ing other data, and tools available online. However, accuracy of LAS 62.39 for dependency parsing in we still depend on this dataset for some part of the a multilingual training setting with Hindi HDTB, training, such as for dependency parsing. using uuparser. More importantly, we have made Only one treebank, Telugu MTG UDv2.6, which our tools ThamizhiPOSt,14,15 ThamizhiMorph,16,17 is the closest to Tamil in terms of linguistic struc- and ThamizhiUDp18,19 along with relevant mod- tures, is available as of today. We observed that Tel- els, datasets, and scripts available open source for ugu UDv2.6 is small in size. That only has around others to use and extend upon. 1050 sentences compared to Hindi HDTB UDv2.6. Moreover, Telugu has very short sentences with- Acknowledgements out any morphological feature information. On the other hand, some sentences in TTB in UDv2.6 has We would like express our appreciation to Maris up to 40 tokens. Because of all these varied fac- Camilleri from the University of Essex for her sup- tors we could not achieve much improvement when port in language editing, and three anonymous re- use Telugu MTG UDv2.6 in multilingual training. viewers for their valuable comments and inputs to However, Hindi, which belongs to a different lan- improve this menu script. guage family, showed better performance when This research was supported by the Accelerat- used for Multilingual training. We have addition- ing Higher Education Expansion and Development ally noticed that the accuracy of the dependency (AHEAD) Operation of the Ministry of Higher Ed- parsing also improved when we increased the Hindi ucation, Sri Lanka funded by the World Bank. data size during the training. Another challenge we have faced was finding 14http://nlp-tools.uom.lk/thamizhi-pos 15https://github.com/sarves/ quality test data or benchmark datasets for evalu- thamizhi-pos/ ation. In the current practice, everyone tests their 16http://nlp-tools.uom.lk/thamizhi-morph 17https://github.com/sarves/ tools using their own dataset to evaluate. There- thamizhi-morph/ fore, it is always a challenging task to reproduce 18http://nlp-tools.uom.lk/thamizhi-udp 19https://github.com/sarves/ or compare results. In our case, for dependency thamizhi-udp/ References Joakim Nivre, Marie-Catherine De Marneffe, Filip Gin- ter, Yoav Goldberg, Jan Hajic, Christopher D Man- M Anand Kumar, V Dhanalakshmi, KP Soman, and ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, S Rajendran. 2010. A sequence labeling approach Natalia Silveira, et al. 2016. to morphological analyzer for Tamil language. In- v1: A multilingual treebank collection. In Proceed- ternational Journal on Computer Science and Engi- ings of the Tenth International Conference on Lan- neering (IJCSE), 2(06):1944–195. guage Resources and Evaluation (LREC’16), pages Akshar Bharati, Mridul Gupta, Vineet Yadav, Karthik 1659–1666. Gali, and Dipti Misra Sharma. 2009. Simple parser Avinesh PVS and G Karthik. 2007. Part-of-speech tag- for Indian languages in a dependency framework. ging and chunking using conditional random fields In Proceedings of the Third Linguistic Annotation and transformation based learning. Workshop (LAW III), pages 162–165. for South Asian Languages, 21:21–24. Pushpak Bhattacharyya, Hema Murthy, Surangika Ranathunga, and Ranjiva Munasinghe. 2019. Indic Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, language computing. Communications of the ACM, and Christopher D Manning. 2020. Stanza: 62(11):70–75. A python natural language processing toolkit for many human languages. arXiv preprint Alena Bohmov¨ a,´ Jan Hajic,ˇ Eva Hajicovˇ a,´ and Barbora arXiv:2003.07082. Hladka.´ 2003. The Prague Dependency Treebank. In Treebanks, pages 103–127. Springer. Loganathan Ramasamy and Zdenekˇ Zabokrtskˇ y.´ 2012. Prague dependency style treebank for Tamil. In Piotr Bojanowski, Edouard Grave, Armand Joulin, Proceedings of Eighth International Conference on and Tomas Mikolov. 2016. Enriching word vec- Language Resources and Evaluation (LREC 2012), tors with subword information. arXiv preprint pages 1888–1894, Istanbul,˙ Turkey. arXiv:1607.04606. K Sarveswaran, Gihan Dias, and Miriam Butt. 2018. Butt, Miriam, Rajamathangi, S. and Sarveswaran, K. ThamizhiFST: A Morphological Analyser and Gen- 2020. Mixed Categories in Tamil via Complex Cat- erator for Tamil Verbs. In 2018 3rd International egories. In Proceedings of the LFG20 Conference, Conference on Information Technology Research Stanford. CSLI Publications. (ICITR), pages 1–6. IEEE.

Timothy Dozat and Christopher D. Manning. 2018. K Sarveswaran, Gihan Dias, and Miriam Butt. 2019. Simpler but more accurate semantic dependency Using meta-morph rules to develop morphological parsing. In Proceedings of the 56th Annual Meet- analysers: A case study concerning Tamil. In Pro- ing of the Association for Computational Linguis- ceedings of the 14th International Conference on tics (Volume 2: Short Papers), pages 484–490, Mel- Finite-State Methods and Natural Language Pro- bourne, Australia. Association for Computational cessing, pages 76–86, Dresden, Germany. Associa- . tion for Computational Linguistics. Ron Kaplan and Joan Bresnan. 1982. Lexical func- K Sarveswaran and S Mahesan. 2014. Hierarchical tional grammar: a formal system for grammatical Tag-set for Rule-based Processing of Tamil Lan- representation. The Mental Representation of Gram- guage. International Journal of Multidisciplinary matical Relations. J. Bresnan. Cambridge, MA: MIT Studies (IJMS), 1(2):67–74. Press, pages 173–281.

Miryam de Lhoneux, Yan Shao, Ali Basirat, Eliyahu Kengatharaiyer Sarveswaran and Miriam Butt. 2019. Kiperwasser, Sara Stymne, Yoav Goldberg, and Computational Challenges with Tamil Complex Joakim Nivre. 2017. From raw text to universal Predicates. In Proceedings of the LFG19 Confer- dependencies-look, no tags! In Proceedings of ence, Australian National University, pages 272– the CoNLL 2017 Shared Task: Multilingual Pars- 292, Stanford. CSLI Publications. ing from Raw Text to Universal Dependencies, pages 207–217. Aaron Smith, Bernd Bohnet, Miryam de Lhoneux, Joakim Nivre, Yan Shao, and Sara Stymne. 2018. 82 Mitchell P Marcus, Mary Ann Marcinkiewicz, and treebanks, 34 models: Universal Dependency pars- Beatrice Santorini. 1993. Building a large annotated ing with multi-treebank models. In Proceedings of corpus of English: The Penn Treebank. Computa- the CoNLL 2018 Shared Task: Multilingual Pars- tional linguistics, 19(2):313–330. ing from Raw Text to Universal Dependencies, pages 113–123, Brussels, Belgium. Association for Com- T Mokanarangan, T Pranavan, U Megala, N Nilusija, putational Linguistics. Gihan Dias, Sanath Jayasena, and Surangika Ranathunga. 2016. Tamil morphological analyzer Daniel Zeman, Joakim Nivre, Mitchell Abrams, using support vector machines. In International Elia Ackermann, Noemi¨ Aepli, Hamid Aghaei, Conference on Applications of Natural Language to Zeljkoˇ Agic,´ Amir Ahmadi, Lars Ahrenberg, Information Systems, pages 15–23. Springer. Chika Kennedy Ajede, Gabriele˙ Aleksandraviciˇ ut¯ e,˙ Ika Alfina, Lene Antonsen, Katya Aplonova, An- Alexei Lavrentiev, John Lee, Phuong Le Hong, gelina Aquino, Carolina Aragon, Maria Jesus Aran- Alessandro Lenci, Saran Lertpradit, Herman Leung, zabe, H orunn´ Arnardottir,´ Gashaw Arutie, Jes- Maria Levina, Cheuk Ying Li, Josie Li, Keying Li, sica Naraiswari Arwidarasti, Masayuki Asahara, Yuan Li, KyungTae Lim, Krister Linden,´ Nikola Luma Ateyah, Furkan Atmaca, Mohammed Attia, Ljubesiˇ c,´ Olga Loginova, Andry Luthfi, Mikko Aitziber Atutxa, Liesbeth Augustinus, Elena Bad- Luukko, Olga Lyashevskaya, Teresa Lynn, Vivien maeva, Keerthana Balasubramani, Miguel Balles- Macketanz, Aibek Makazhanov, Michael Mandl, teros, Esha Banerjee, Sebastian Bank, Verginica Christopher Manning, Ruli Manurung, Cat˘ alina˘ Barbu Mititelu, Victoria Basmov, Colin Batche- Mar˘ anduc,˘ David Marecek,ˇ Katrin Marheinecke, lor, John Bauer, Seyyit Talha Bedir, Kepa Ben- Hector´ Mart´ınez Alonso, Andre´ Martins, Jan Masek,ˇ goetxea, Gozde¨ Berk, Yevgeni Berzak, Irshad Ah- Hiroshi Matsuda, Yuji Matsumoto, Ryan Mc- mad Bhat, Riyaz Ahmad Bhat, Erica Biagetti, Eck- Donald, Sarah McGuinness, Gustavo Mendonc¸a, hard Bick, Agne˙ Bielinskiene,˙ Krist´ın Bjarnadottir,´ Niko Miekka, Karina Mischenkova, Margarita Rogier Blokland, Victoria Bobicev, Lo¨ıc Boizou, Misirpashayeva, Anna Missila,¨ Cat˘ alin˘ Mititelu, Emanuel Borges Volker,¨ Carl Borstell,¨ Cristina Maria Mitrofan, Yusuke Miyao, AmirHossein Mo- Bosco, Gosse Bouma, Sam Bowman, Adriane Boyd, jiri Foroushani, Amirsaeid Moloodi, Simonetta Kristina Brokaite,˙ Aljoscha Burchardt, Marie Can- Montemagni, Amir More, Laura Moreno Romero, dito, Bernard Caron, Gauthier Caron, Tatiana Cav- Keiko Sophie Mori, Shinsuke Mori, Tomohiko alcanti, Guls¸en¨ Cebiroglu˘ Eryigit,˘ Flavio Massim- Morioka, Shigeki Moro, Bjartur Mortensen, Bohdan iliano Cecchini, Giuseppe G. A. Celano, Slavom´ır Moskalevskyi, Kadri Muischnek, Robert Munro, Cˇ epl´ o,¨ Savas Cetin, Ozlem¨ C¸etinoglu,˘ Fabri- Yugo Murawaki, Kaili Mu¨urisep,¨ Pinkey Nainwani, cio Chalub, Ethan Chi, Yongseok Cho, Jinho Mariam Nakhle,´ Juan Ignacio Navarro Horniacek,˜ Choi, Jayeol Chun, Alessandra T. Cignarella, Sil- Anna Nedoluzhko, Gunta Nespore-Bˇ erzkalne,¯ Lu- vie Cinkova,´ Aurelie´ Collomb, C¸agrı˘ C¸ oltekin,¨ ong Nguyen Thi, Huyen Nguyen Thi. Minh, Yoshi- Miriam Connor, Marine Courtin, Elizabeth David- hiro Nikaido, Vitaly Nikolaev, Rattima Nitisaroj, son, Marie-Catherine de Marneffe, Valeria de Paiva, Alireza Nourian, Hanna Nurmi, Stina Ojala, Atul Kr. Mehmet Oguz Derin, Elvis de Souza, Arantza Ojha, Adedayo Oluokun, Mai Omura, Emeka On- Diaz de Ilarraza, Carly Dickerson, Arawinda Di- wuegbuzia, Petya Osenova, Robert Ostling,¨ Lilja nakaramani, Bamba Dione, Peter Dirix, Kaja Do- Øvrelid, S¸aziye Betul¨ Ozates¸,¨ Arzucan Ozg¨ ur,¨ brovoljc, Timothy Dozat, Kira Droganova, Puneet Balkız Ozt¨ urk¨ Bas¸aran, Niko Partanen, Elena Pas- Dwivedi, Hanne Eckhoff, Marhaba Eli, Ali Elkahky, cual, Marco Passarotti, Agnieszka Patejuk, Guil- Binyam Ephrem, Olga Erina, Tomazˇ Erjavec, herme Paulino-Passos, Angelika Peljak-Łapinska,´ Aline Etienne, Wograine Evelyn, Sidney Facun- Siyao Peng, Cenel-Augusto Perez, Natalia Perkova, des, Richard´ Farkas, Mar´ılia Fernanda, Hector Fer- Guy Perrier, Slav Petrov, Daria Petrova, Jason Phe- nandez Alcalde, Jennifer Foster, Claudia´ Freitas, lan, Jussi Piitulainen, Tommi A Pirinen, Emily Pitler, Kazunori Fujita, Katar´ına Gajdosovˇ a,´ Daniel Gal- Barbara Plank, Thierry Poibeau, Larisa Ponomareva, braith, Marcos Garcia, Moa Gardenfors,¨ Sebastian Martin Popel, Lauma Pretkalnin¸a, Sophie Prevost,´ Garza, Fabr´ıcio Ferraz Gerardi, Kim Gerdes, Filip Prokopis Prokopidis, Adam Przepiorkowski,´ Tiina Ginter, Iakes Goenaga, Koldo Gojenola, Memduh Puolakainen, Sampo Pyysalo, Peng Qi, Andriela Gokırmak,¨ Yoav Goldberg, Xavier Gomez´ Guino- Ra¨abis,¨ Alexandre Rademaker, Taraka Rama, Lo- vart, Berta Gonzalez´ Saavedra, Bernadeta Griciut¯ e,˙ ganathan Ramasamy, Carlos Ramisch, Fam Rashel, Matias Grioni, Lo¨ıc Grobol, Normunds Gruz¯ ¯ıtis, Mohammad Sadegh Rasooli, Vinit Ravishankar, Bruno Guillaume, Celine´ Guillot-Barbance, Tunga Livy Real, Petru Rebeja, Siva Reddy, Georg Rehm, Gung¨ or,¨ Nizar Habash, Hinrik Hafsteinsson, Jan Ivan Riabov, Michael Rießler, Erika Rimkute,˙ Hajic,ˇ Jan Hajicˇ jr., Mika Ham¨ al¨ ainen,¨ Linh Ha` My,˜ Larissa Rinaldi, Laura Rituma, Luisa Rocha, Eir´ıkur Na-Rae Han, Muhammad Yudistira Hanifmuti, Sam Rognvaldsson,¨ Mykhailo Romanenko, Rudolf Rosa, Hardwick, Kim Harris, Dag Haug, Johannes Hei- Valentin Ros, ca, Davide Rovati, Olga Rudina, Jack necke, Oliver Hellwig, Felix Hennig, Barbora Rueter, Kristjan´ Runarsson,´ Shoval Sadde, Pegah Hladka,´ Jaroslava Hlava´covˇ a,´ Florinel Hociung, Safari, Benoˆıt Sagot, Aleksi Sahala, Shadi Saleh, Petter Hohle, Eva Huber, Jena Hwang, Takumi Alessio Salomoni, Tanja Samardziˇ c,´ Stephanie Sam- Ikeda, Anton Karl Ingason, Radu Ion, Elena Irimia, son, Manuela Sanguinetti, Dage Sarg,¨ Baiba Saul¯ıte, O. laj´ ´ıde´ Ishola, Toma´sˇ Jel´ınek, Anders Johannsen, Yanin Sawanakunanon, Kevin Scannell, Salvatore Hildur Jonsd´ ottir,´ Fredrik Jørgensen, Markus Juu- Scarlata, Nathan Schneider, Sebastian Schuster, tinen, Sarveswaran K, Huner¨ Kas¸ıkara, Andre Djame´ Seddah, Wolfgang Seeker, Mojgan Seraji, Kaasen, Nadezhda Kabaeva, Sylvain Kahane, Hi- Mo Shen, Atsuko Shimada, Hiroyuki Shirasu, Muh roshi Kanayama, Jenna Kanerva, Boris Katz, Tolga Shohibussirri, Dmitry Sichinava, Einar Freyr Sig- Kayadelen, Jessica Kenney, Vaclava´ Kettnerova,´ uresson, Aline Silveira, Natalia Silveira, Maria Simi, Jesse Kirchner, Elena Klementieva, Arne Kohn,¨ Ab- Radu Simionescu, Katalin Simko,´ Maria´ Simkovˇ a,´ dullatif Koksal,¨ Kamil Kopacewicz, Timo Korki- Kiril Simov, Maria Skachedubova, Aaron Smith, Is-  akangas, Natalia Kotsyba, Jolanta Kovalevskaite,˙ Si- abela Soares-Bastos, Carolyn Spadine, Steinhor´ Ste- mon Krek, Parameswari Krishnamurthy, Sookyoung ingr´ımsson, Antonio Stella, Milan Straka, Emmett Kwak, Veronika Laippala, Lucia Lam, Lorenzo Strickland, Jana Strnadova,´ Alane Suhr, Yogi Les- Lambertino, Tatiana Lando, Septina Dian Larasati, mana Sulestio, Umut Sulubacak, Shingo Suzuki, Zsolt Szant´ o,´ Dima Taji, Yuta Takahashi, Fabio Tam- burini, Mary Ann C. Tan, Takaaki Tanaka, Sam- son Tella, Isabelle Tellier, Guillaume Thomas, Li- isi Torga, Marsida Toska, Trond Trosterud, Anna Trukhina, Reut Tsarfaty, Utku Turk,¨ Francis Ty- ers, Sumire Uematsu, Roman Untilov, Zdenkaˇ Uresovˇ a,´ Larraitz Uria, Hans Uszkoreit, Andrius Utka, Sowmya Vajjala, Daniel van Niekerk, Gert- jan van Noord, Viktor Varga, Eric Villemonte de la Clergerie, Veronika Vincze, Aya Wakasa, Joel C. Wallenberg, Lars Wallin, Abigail Walsh, Jing Xian Wang, Jonathan North Washington, Max- imilan Wendt, Paul Widmer, Seyi Williams, Mats Wiren,´ Christian Wittern, Tsegay Woldemariam, Tak-sum Wong, Alina Wroblewska,´ Mary Yako, Kayo Yamashita, Naoki Yamazaki, Chunxiao Yan, Koichi Yasuoka, Marat M. Yavrumyan, Zhuoran Yu, Zdenekˇ Zabokrtskˇ y,´ Shorouq Zahra, Amir Zeldes, Hanzhi Zhu, and Zhuravleva. 2020. Universal De- pendencies 2.7. LINDAT/CLARIAH-CZ digital li- brary at the Institute of Formal and Applied Linguis- tics (UFAL),´ Faculty of Mathematics and Physics, Charles University.

Appendix A: Harmonisation of Amrita and UPOS tagsets

Amrita UPOS Amrita UPOS NN NOUN CVB CCONJ NNC NOUN PPO ADP NNP PROPN CNJ CCONJ NNPC PROPN DET DET ORD NUM COM CCONJ CRD NUM EMP PART PRP PRON ECH PART PRIN PRON RDW ADP ADJ ADJ QW VERB ADV ADV QM PUNCT VNAJ VERB INT ADJ VNAV VERB NNQ NUM VINT VERB QTF NUM VBG NOUN COMM PUNCT VF VERB DOT PUNCT VAX VAUX

Table 4: Harmonisation of Amrita and UPOS tagsets