Arxiv:2101.04899V2
Total Page:16
File Type:pdf, Size:1020Kb
Experimental Evaluation of Deep Learning models for Marathi Text Classification Atharva Kulkarni1, Meet Mandhane1, Manali Likhitkar1, Gayatri Kshirsagar1, Jayashree Jagdale1 and Raviraj Joshi2 1Pune Institute of Computer Technology, Pune 2Indian Institute of Technology Madras, Chennai {k.atharva4899,meetmandhanemnm,manalil1806,gayatrimohan7}@gmail.com {jayashree.jagdale,ravirajoshi}@gmail.com Abstract to building single multilingual models instead of models for individual language (Pires et al., 2019; The Marathi language is one of the prominent Conneau and Lample, 2019). languages used in India. It is predominantly spoken by the peopleof Maharashtra. Over the past decade, the usage of language on online English is the most widely used language platforms has tremendously increased. How- globally, due to which the majority of the NLP ever, research on Natural Language Processing research is done in English. There is a large scope (NLP) approaches for Marathi text has not re- for research in other regional languages in India. ceived much attention. Marathi is a morpho- These regional languages are a low resource logically rich language and uses a variant of and morphologically rich, which is the main the Devanagari script in the written form. reason for limiting research (Patil and Patil, 2017). This works aims to provide a comprehensive overviewof available resources and models for Morphologically rich languages have greater Marathi text classification. We evaluate CNN, complexity in their grammar as well as sentence LSTM, ULMFiT, and BERT based models on structure. Grammatical relations like subject, two publicly available Marathi text classifica- predicate, object, etc., are indicated by changes tion datasets and present a comparative analy- to the words. Also, the structure of the sentence sis. The pre-trained Marathi fast text word em- in Marathi can change, without it affecting the beddings by Facebook and IndicNLP are used meaning. For example, ”To khelayla yenar nahi”, in conjunction with word-based models. We show that basic single layer models based on ”Khelayla nahi yenar to”, ”To nahi yenar khe- CNN and LSTM coupled with FastText em- layla”. An absence of large annotated datasets is a beddings perform on par with the BERT based major issue due to which new methods cannot be models on the available datasets. We hope our properly tested and documented, thus hampering paper aids focused research and experiments research. In this work, we are concerned with in the area of Marathi NLP. Marathi text classification. We evaluate different deep learning models for Marathi text and provide 1 Introduction a comprehensive review of publicly available The Marathi language is spoken by more than 83 models and datasets. arXiv:2101.04899v2 [cs.CL] 14 Jan 2021 million people in India. In terms of the number of speakers, it ranks third in India after Hindi and Text classification is the process of categorizing Bengali. It is native to the state of Maharashtra and the text into different classes, grouped according also spoken in Goa and some regions of western to content (Kowsari et al., 2019). It has been India. Despite leading in education and economy, used in a variety of applications from optimizing NLP research in Marathi has not received much searches in search engines to analyzing customer attention in Maharashtra. In contrast research in needs in businesses. With the increase in Marathi Hindi has been much more significant (Arora, textual content on online platforms, it becomes 2013; Akhtar et al., 2016; Joshi et al., 2016, important to build text processing systems for the 2019; Patra et al., 2015) followed by Bengali Marathi language. The text classification module (Patra et al., 2018; Al-Amin et al., 2017; Pal et al., requires the application of various pre-processing 2015; Sarkar and Bhowmick, 2017). Recently, to techniques to the text before running the clas- enable cross-lingual NLP, the focus has shifted sification model. These tasks involve steps viz tokenizing, stop word removal, and stemming the word tokenizers for Marathi text classifica- words to their root form. Tokenization is a way of tion. We present a comparative analysis of separating a piece of text into smaller units called CNN, LSTM, and Transformer based mod- tokens. The tokens can be either word, characters, els. The analysis shows that simple CNN and or subwords. Tokens and punctuations that do LSTM based models along with FastText em- not contribute to classification are removed using beddings performs as good as currently avail- stopword removal techniques. Stemming is the able pre-trained multilingual BERT based process of reducing a word to its root form. These models. root forms of words then represent the sentence of the document to which they belong and are 2 Related Work passed on to classifiers. In this work, we are not concerned about stemming and stop word removal. There has been very little research done on More recent techniques based on sub-words and Marathi NLP. Recently, (Kakwani et al., 2020) neural networks implicitly mitigate problems introduced NLP resources for 11 major Indian associated with morphological variations and stop languages. It had Marathi as one of the languages. words to a large extent (Joulin et al., 2016). They collected sentence-level monolingual cor- pora for all the languages from web sources. The This paper explores and summarizes the effi- monolingual corpus contains 8.8 billion tokens ciency of Convolutional Neural Network (CNN), across multiple languages. This corpus was used Long Short Term Memory (LSTM), and Trans- to pre-train word embeddings and multi-lingual former based approaches on two datasets. The language models. The pre-trained models are two datasets are contrasting in terms of length based on the compact ALBERT model termed of records, grammatical complexity of sentences, IndicBERT. The FastText embeddings were and the vocabulary itself. We also evaluate the trained as it is better at handling morphological effect of using pre-trained FastText word embed- variants. The pre-trained models and embeddings dings and explicit sub-word embeddings on the were evaluated on text classification and gener- aforementioned architectures. Finally, we eval- ation tasks. In most of their experimentations, uate the pre-trained language models such as IndicBERT models have outperformed XLM-R Universal Language Model Fine-tuning (ULM- and mBERT models. They have also created other FiT) and multilingual variations of the Bidirec- datasets for various NLP tasks like Article Genre tional Encoder Representations from Transform- Classification, Headline Prediction, etc. This ers (BERT), mBERT, and IndicBERT for the task work is referred to as IndicNLP throughout the of Marathi text classification (Howard and Ruder, paper. 2018; Devlin et al., 2018). We use the pub- licly available ULMFiT and BERT based models An NLP library for Indian languages iNLTK and re-run the experiments on the classification is presented in (Arora, 2020). It consists of datasets. The evaluation of other models and the pre-trained models, support for word embeddings, effect of using pre-trained word embeddings on textual similarity, and other important components them is specific to this work and not covered in of NLP for 13 Indian languages including Marathi. previous literature. The main contributions of this They evaluate pre-trained models like ULMFiT work are and TransformerXL. The ULMFiT model and other pre-trained models are shown to perform • We provide an overview of publicly avail- well on small datasets as compared to raw models. able classification datasets, monolingual cor- Work is being done to expand the iNLTK support pus, and deep learning models useful for to other languages like Telugu, Maithili, and some the Marathi language. We emphasize that code mixed languages. Marathi is truly a very low resource language and even lacks a simple sentiment classifica- Previously, (Bolaj and Govilkar, 2016) pre- tion dataset. sented supervised learning methods and ontology- based Marathi text classification. Marathi text • We evaluate the effectiveness of publicly documents were mapped to the output labels available FastText word embeddings and sub- like festivals, sports, tourism, literature, etc. The steps proposed for predicting the labels were 3.3 Monolingual Corpora preprocessing, feature extraction and finally Although we have not explicitly used monolingual applying supervised learning methods. Methods corpus in this work, we list the publicly available based on Label Induction Clustering Algorithm Marathi monolingual corpus for the sake of com- (LINGO) to categorize the Marathi documents pleteness. These individual corpora were used were explored in (Vispute and Potey, 2013; to pre-train FastText word embeddings, ULMFiT, Patil and Bogiri, 2015). A small custom data-set and BERT based models by the respective authors. containing 100-200 documents was used for classification in the respective work. Wikipedia text corpus: The Marathi Wikipedia article monolingual dataset consists of 85k cleaned articles. This is a small corpus which con- 3 Data Resources sists of comparatively fewer tokens. 3.1 Datasets CC-100 Monolingual Dataset: The dataset is a huge collection of crawled websites for 100+ This section summarizes publicly available classi- languages (Wenzek et al., 2019). It was created fication datasets used for experimentation. by processing January-December 2018 Common- crawl snapshots. However, for Marathi as well as IndicNLP News Article Dataset: This dataset most other Indian languages, the dataset consists consists of news articles in Marathi categorized of just about 50 million tokens each. into 3 classes viz. sports, entertainment, and lifestyle (Kakwani et al., 2020). The dataset con- OSCAR: Open Super-large Crawled AL- tains 4779 records with predefined splits of the MAnaCH coRpus (OSCAR) is obtained by train, test, and validation sets. They contain 3823, filtering and language classifying the Common 479, and 477 records respectively. The average Crawl corpus (Su´arez et al., 2019). After de- length of a record is approximately 250 words. duplifying all words, the size of the Marathi corpus comes up to 82 million tokens.