Thamizhiudp: a Dependency Parser for Tamil

ThamizhiUDp: A Dependency Parser for Tamil Kengatharaiyer Sarveswaran Gihan Dias University of Moratuwa / Sri Lanka University of Moratuwa / Sri Lanka [email protected] [email protected] Abstract state of the art for most natural language process- This paper describes how we developed ing tasks. These approaches require a significant a neural-based dependency parser, namely amount of quality data for training and evaluation. ThamizhiUDp, which provides a complete On the other hand techniques like transfer learning, pipeline for the dependency parsing of the and multilingual learning may be used to overcome Tamil language text using Universal Depen- data scarcity. This paper discusses how we devel- dency formalism. We have considered the oped a neural-based dependency parser for Tamil phases of the dependency parsing pipeline and with the aid of data orchestration and multilingual identified tools and resources in each of these training. phases to improve the accuracy and to tackle data scarcity. ThamizhiUDp uses Stanza for tokenisation and lemmatisation, ThamizhiPOSt 2 Background and Motivation and ThamizhiMorph for generating Part of Speech (POS) and Morphological annotations, Tamil is a Southern Dravidian language spoken and uuparser with multilingual training for de- by more than 80 million people around the world. pendency parsing. ThamizhiPOSt is our POS However, it still lacks enough tools and quality tagger, which is based on the Stanza, trained annotated data to build good Natural Language with Amrita POS-tagged corpus. It is the cur- Processing (NLP) applications. rent state-of-the-art in Tamil POS tagging with an F1 score of 93.27. Our morphological ana- 2.1 Universal Dependency Treebank lyzer, ThamizhiMorph is a rule-based system with a very good coverage of Tamil. Our de- Treebanks are a collection of texts with various lev- pendency parser ThamizhiUDp was trained us- els of annotations, including Part of Speech (POS) ing multilingual data. It shows a Labelled As- and morpho-syntactic annotations. There are differ- signed Score (LAS) of 62.39, 4 points higher ent formalisms used to mark syntactic annotations than the current best achieved for Tamil depen- (Marcus et al., 1993;B ohmov¨ a´ et al., 2003; Nivre dency parsing. Therefore, we show that break- ing up the dependency parsing pipeline to ac- et al., 2016; Kaplan and Bresnan, 1982). Among commodate existing tools and resources is a vi- the available formalisms, the dependency grammar able approach for low-resource languages. formalism is useful for languages like Tamil which are morphologically rich, and whose word order 1 Introduction arXiv:2012.13436v1 [cs.CL] 24 Dec 2020 is relatively variable and less bound (Bharati et al., Applying neural-based approaches to Tamil, like 2009). other Indic languages, is challenging due to a lack The Universal Dependency formalism (Nivre of quality data (Bhattacharyya et al., 2019), and the et al., 2016) is nowadays used widely to create language’s structure (Sarveswaran and Butt, 2019; Universal Dependency Treebanks (UD) with an- Butt, Miriam, Rajamathangi, S. and Sarveswaran, notations. The current release of UDv2.7 has 183 K., 2020). Although there is a large volume of elec- annotated treebanks of various sizes from 104 lan- tronic unstructured/partially-structured text avail- guages (Zeman et al., 2020). UD captures informa- able on the Internet, not many language processing tion such as Parts of Speech (POS), morphological tools are publicly available for even fundamental features, and syntactic relations in the form of de- tasks like part of speech (POS) tagging or pars- pendencies. All these annotations are defined with ing. Nowadays, neural-based approaches are the multilingual language processing in mind, and the present format used to specify the annotation is called CoNLL-U format.1 There are only three Indic languages, namely, Hindi, Urdu, and Sanskrit that have relatively large datasets 375K, 138K, and 28K tokens, respectively in UDv2.7. All the other six Indic languages, including Tamil and Telugu, have less than 12K tokens in UDv2.7. 2.2 Tamil Universal Dependency Treebanks Tamil has been included in UD treebank releases since 2015. Initially it was populated from the Prague Style Tamil treebank by (Ramasamy and Figure 1: Phases of the Parsing Pipeline Zabokrtskˇ y´, 2012), and since then the dataset has Image source: https://stanfordnlp.github.io/stanza been part of the UD without much alterations or corrections. Tamil TTB in UDv2.6 has some inac- Dependency Treebanks (UD), including Stanza (Qi curacies, and inconsistencies. For instance, num- et al., 2020), and uuparser (de Lhoneux et al., 2017) bers are marked as NUM and ADJ, while only the and its derivatives. Both of these are open source former tag is correct. The first author of this pa- tools. Stanza is a Python NLP library which in- per has corrected some of these issues and made cludes a multilingual neural NLP pipeline for de- it available in UDv2.7. However, there are still pendency parsing using Universal Dependency for- more issues that need to be solved. Tamil TTB malism. uuparser is a tool developed specifically in UDv2.7 has altogether 600 sentences for train- for UD parsing. These neural-based tools need ing, development and testing. In (Zeman et al., large amount of quality data on which to be trained. 2020), there is another Tamil treebank with 536 On the other hand, several approaches are being sentences, namely MWTT, which has been newly used to overcome the issue with data scarcity, in- added. MWTT is based on the Enhanced Universal cluding multilingual training. There is an attempt 2 Dependency annotation, where complex concepts to create a multilingual parsing for several low- like elision, relative clauses, propagation of con- resource languages, and it is reported that multi- juncts, raising and control constructions, and ex- lingual training significantly improves the parsing tended case marking are captured. Therefore, there accuracy of low-resource languages (Smith et al., are slight variations in TTB and MWTT. Further, 2018). MWTT has very short sentences, while TTB has relatively very longer ones. In this paper, we have 3 ThamizhiUDp mainly used and discussed Tamil TTB. By considering all available resources and ap- 2.3 Dependency parsers proaches as outlined in section2, we decided to A Dependency parser is a type of syntactic parser develop a Universal Dependency parser (UDp) for which is useful to elicit lexical, morphological, and Tamil called ThamizhiUDp using existing open syntactic features, and the inter-connections of to- source tools, namely Stanza, ThamizhiMorph, and kens in a given sentence. Linguistically, this would uuparser. However, since we do not have enough be useful for syntactic analyses, and comparative data to train a neural-based parser end-to-end, we studies. Computationally, this is a key resource for have broken up the pipeline to different phases. We natural language understanding (Dozat and Man- have then orchestrated data from different sources ning, 2018). Different approaches are employed for each of these phases, and used different tools in when developing dependency parsers. However, different phases, as shown in Table1. The follow- neural-based parsers are the latest state of the art. ing sub-sections discuss each of the stages of the There are several off-the-shelf neural-based pipeline, and how we went about developing them. parsers available that are built around Universal Our dependency parsing pipeline has several 1https://universaldependencies.org/ stages as shown in Figure1. As mentioned, we format.html used different datasets and tools, as shown in Table 2https://universaldependencies.org/u/ overview/enhanced-syntax.html 1, in the different stages of the pipeline. Currently, Step Tool Dataset in TTB. We are in the process of improving this Tokenisation Stanza Tamil UDT Multi-word Stanza Tamil UDT multi-word tokeniser with the use of more data. tokeniser Lemmatisation Stanza Tamil UDT 3.3 Lemmatisation using Stanza POS tagging ThamizhiPOSt Amrita Data Morphological ThamizhiMorph Rule-based UD Treebanks also have lemmas marked in their tagging Dependency pars- uuparser UDT of various CoNLL-U format annotation This is useful for ing languages language processing applications, such as a Ma- chine Translator. We trained Stanza using the TTB Table 1: ThamizhiUDp process pipeline UDv2.6 to do lemmatisation. However, the current TTB has several inaccuracies in identifying Stanza does not have support for multilingual train- lemmas, specifically due to improper multi-word ing. Therefore, for dependency parsing, we used tokenisation. Since a lemma is identified for multi- uuparser with the multilingual training. Each of word tokenised words, multi-word tokenisation has the phases within the pipeline is explained in the an effect on lemmatisation. Since we are still in following respective sub-sections. the process of improving our multi-word tokeniser, lemmatisation will also be improved in the future. 3.1 Tokenisation 3.4 POS tagging using ThamizhiPOSt First, the given texts have been Unicode normalised, and then tokenised, and broken up in to Part of Speech (POS) tagging is an important sentences. We developed a script3 to do Unicode phase in the parsing process where each word normalisation. Because of different input meth- in a sentence is assigned with its POS tag (or ods or other reasons, at times the same surface lexical category) information. Several attempts form of a character has been stored using different have been made to define POS tagsets for Tamil, Unicode sequences. Therefore, this needed to be based on different theories, and level of granular- normalised, otherwise, a computer would consider ity; (Sarveswaran and Mahesan, 2014) gives an them as different characters. Once this normalisa- account of different tagsets. Among these, Am- tion was done, we moved on to tokenisation. To do rita (Anand Kumar et al., 2010) and BIS5 are two this, we trained Stanza with the texts available in popular tagsets.

Load more