<<

arXiv:2101.04899v2 [cs.CL] 14 Jan 2021 ( h aah agaei pknb oeta 83 than more by spoken is Marathi The nbecoslnulNP h ou a shifted has focus the NLP, cross-lingual enable 2015 2019 2013 id a enmc oesgicn ( significant more much in been research has contrast much In received . not in has attention Marathi in economy, research and western education NLP of in regions leading some Despite and . in and Maharashtra spoken of andalso state the Hindi to number after native is India the It in Bengali. of third terms ranks it In speakers, of India. in people million Introduction 1 ar tal. et Patra hswrsam opoieacomprehensive a provide to aims works This re- not has text Marathi for approaches (NLP) nteae fMrtiNLP. Marathi of experiments area and the in research focused our aids hope We paper datasets. available the based on BERT em- models the FastText with par with on perform coupled beddings on LSTM based and models layer CNN single We basic that models. show word-based with conjunction used are in IndicNLP and Facebook em- by word text beddings fast Marathi pre-trained analy- The comparative sis. a present and classifica- datasets text tion Marathi available publicly on models two based BERT and ULMFiT, CNN, LSTM, evaluate We classification. text Marathi for models and resources available of overview form. of written variant the in a script uses the morpho- and a language is rich Marathi logically attention. much ceived Processing Language How- Natural on research online ever, increased. on tremendously language has of platforms usage the decade, the predominantly Over past Maharashtra. is of people It the by spoken India. in prominent the used of one is language Marathi The xeietlEauto fDe erigmdl o Marath for models Learning Deep of Evaluation Experimental ; ; ; akradBhowmick and Sarkar ar tal. et Patra ktre al. et Akhtar { tav Kulkarni Atharva k.atharva4899,meetmandhanemnm,manalil1806,gayatrimo , 2018 ; , lAi tal. et Al-Amin Abstract 2015 , 2016 olwdb Bengali by followed ) 2 ninIsiueo ehooyMda,Chennai Madras, Technology of Institute Indian 1 ueIsiueo optrTcnlg,Pune Technology, Computer of Institute Pune ; { 1 , etMandhane Meet , jayashree.jagdale,ravirajoshi 2017 oh tal. et Joshi aahe Jagdale Jayashree , 2017 .Rcnl,to Recently, ). ; a tal. et Pal , Classification Arora 2016 1 , , , aaiLikhitkar Manali , 1 n aia Joshi Raviraj and Keal aiynrt” T aiynrkhe- yenar nahi ”To to”, yenar nahi ”Khelayla hs einllnugsaealwresource low a are languages regional These ocnet( according content grouped classes, to different into text the available publicly of datasets. and models review comprehensive provide and a text with Marathi for concerned different models learning evaluate are deep We we classification. work, text this Marathi In hampering thus research. be documented, cannot and methods tested new properly which to a due is issue datasets annotated major large of absence An layla”. the nahi”, yenar affecting khelayla it ”To example, without sentence For the change, meaning. of can structure Marathi the changes in Also, by words. subject, indicated the are like to etc., relations object, Grammatical predicate, sentence as greater well as structure. have grammar their in languages complexity rich Morphologically Lample and Conneau ( of language individual instead for models models multilingual single building to icto oe.Teetssivlesesviz steps clas- involve the tasks running These before model. text sification the to pre-processing techniques various of application module the classification requires text The the for language. becomes systems Marathi processing it text build platforms, to online important on Marathi in content increase textual the With customer businesses. analyzing in to needs optimizing engines from search applications in of searches variety a in used n opooial ih hc stemain the is ( which research limiting for reason rich, morphologically India.and in languages regional other scope large NLP in a the research is There for of English. majority in done the is research which to due globally, etcasfiaini h rcs fcategorizing of process the is classification Text nls stems ieyue language used widely most the is English } @gmail.com osr tal. et Kowsari 1 aar Kshirsagar Gayatri , 2 , han7 2019 , } ). @gmail.com 2019 ai n Patil and Patil ie tal. et Pires .I a been has It ). Text i 1 , , , 2017 2019 ). ; tokenizing, removal, and the word tokenizers for Marathi text classifica- words to their root form. Tokenization is a way of tion. We present a comparative analysis of separating a piece of text into smaller units called CNN, LSTM, and Transformer based mod- tokens. The tokens can be either word, characters, els. The analysis shows that simple CNN and or subwords. Tokens and punctuations that do LSTM based models along with FastText em- not contribute to classification are removed using beddings performs as good as currently avail- stopword removal techniques. Stemming is the able pre-trained multilingual BERT based process of reducing a word to its root form. These models. root forms of words then represent the sentence of the document to which they belong and are 2 Related Work passed on to classifiers. In this work, we are not concerned about stemming and stop word removal. There has been very little research done on More recent techniques based on sub-words and Marathi NLP. Recently, (Kakwani et al., 2020) neural networks implicitly mitigate problems introduced NLP resources for 11 major Indian associated with morphological variations and stop languages. It had Marathi as one of the languages. words to a large extent (Joulin et al., 2016). They collected sentence-level monolingual cor- pora for all the languages from web sources. The This paper explores and summarizes the effi- monolingual corpus contains 8.8 billion tokens ciency of Convolutional Neural Network (CNN), across multiple languages. This corpus was used Long Short Term Memory (LSTM), and Trans- to pre-train word embeddings and multi-lingual former based approaches on two datasets. The language models. The pre-trained models are two datasets are contrasting in terms of length based on the compact ALBERT model termed of records, grammatical complexity of sentences, IndicBERT. The FastText embeddings were and the vocabulary itself. We also evaluate the trained as it is better at handling morphological effect of using pre-trained FastText word embed- variants. The pre-trained models and embeddings dings and explicit sub-word embeddings on the were evaluated on text classification and gener- aforementioned architectures. Finally, we eval- ation tasks. In most of their experimentations, uate the pre-trained language models such as IndicBERT models have outperformed XLM-R Universal Language Model Fine-tuning (ULM- and mBERT models. They have also created other FiT) and multilingual variations of the Bidirec- datasets for various NLP tasks like Article Genre tional Encoder Representations from Transform- Classification, Headline Prediction, etc. This ers (BERT), mBERT, and IndicBERT for the task work is referred to as IndicNLP throughout the of Marathi text classification (Howard and Ruder, paper. 2018; Devlin et al., 2018). We use the pub- licly available ULMFiT and BERT based models An NLP library for Indian languages iNLTK and re-run the experiments on the classification is presented in (Arora, 2020). It consists of datasets. The evaluation of other models and the pre-trained models, support for word embeddings, effect of using pre-trained word embeddings on textual similarity, and other important components them is specific to this work and not covered in of NLP for 13 Indian languages including Marathi. previous literature. The main contributions of this They evaluate pre-trained models like ULMFiT work are and TransformerXL. The ULMFiT model and other pre-trained models are shown to perform • We provide an overview of publicly avail- well on small datasets as compared to raw models. able classification datasets, monolingual cor- Work is being done to expand the iNLTK support pus, and deep learning models useful for to other languages like Telugu, Maithili, and some the . We emphasize that code mixed languages. Marathi is truly a very low resource language and even lacks a simple sentiment classifica- Previously, (Bolaj and Govilkar, 2016) pre- tion dataset. sented methods and ontology- based Marathi text classification. Marathi text • We evaluate the effectiveness of publicly documents were mapped to the output labels available FastText word embeddings and sub- like festivals, sports, tourism, literature, etc. The steps proposed for predicting the labels were 3.3 Monolingual Corpora preprocessing, feature extraction and finally Although we have not explicitly used monolingual applying supervised learning methods. Methods corpus in this work, we list the publicly available based on Label Induction Clustering Algorithm Marathi monolingual corpus for the sake of com- (LINGO) to categorize the Marathi documents pleteness. These individual corpora were used were explored in (Vispute and Potey, 2013; to pre-train FastText word embeddings, ULMFiT, Patil and Bogiri, 2015). A small custom data-set and BERT based models by the respective authors. containing 100-200 documents was used for classification in the respective work. Wikipedia : The Marathi Wikipedia article monolingual dataset consists of 85k cleaned articles. This is a small corpus which con- 3 Data Resources sists of comparatively fewer tokens.

3.1 Datasets CC-100 Monolingual Dataset: The dataset is a huge collection of crawled websites for 100+ This section summarizes publicly available classi- languages (Wenzek et al., 2019). It was created fication datasets used for experimentation. by processing January-December 2018 Common- crawl snapshots. However, for Marathi as well as IndicNLP News Article Dataset: This dataset most other Indian languages, the dataset consists consists of news articles in Marathi categorized of just about 50 million tokens each. into 3 classes viz. sports, entertainment, and lifestyle (Kakwani et al., 2020). The dataset con- OSCAR: Open Super-large Crawled AL- tains 4779 records with predefined splits of the MAnaCH coRpus (OSCAR) is obtained by train, test, and validation sets. They contain 3823, filtering and language classifying the Common 479, and 477 records respectively. The average Crawl corpus (Su´arez et al., 2019). After de- length of a record is approximately 250 words. duplifying all words, the size of the Marathi corpus comes up to 82 million tokens. iNLTK Headline Dataset: This dataset consists of Marathi news article headlines of 3 different IndicNLP Corpus: This is a multi-domain cor- classes viz. entertainment, sports, state (Arora, pus that spans over 10 Indian languages and con- 2020). It consists of 12092 records. The dataset tains over 2.7 billion tokens. For Marathi, the made available under the IndicNLP catalog con- dataset consists of 9.9 million sentences, 142 mil- sists of 9672 train, 1210 test, and 1210 valida- lion tokens. tion samples. The average length of a record is 7 words. 4 Architecture Traditionally, LSTM based models are used for 3.2 Embedding NLP activities. However, recent experiments have The input to the classification models can be a shown that CNN based models give encouraging word, sub-word, or character embeddings. In this results (Joshi et al., 2019; Kim, 2014). We have work, we have experimented with word and sub- explored a variety of models, which give insights word embeddings. For word embeddings, ran- into their applications according to the data used. dom initialization and FastText initialization are CNN: An embedding layer converts inputs to explored. Whereas for sub-word embeddings we word embeddings of length 300. This sequence of train a unigram based sentence piece tokenizer word embeddings is passed on to a Conv1D layer and randomly initialize the sub-word embeddings with 300 filters and kernel size 3. This layer is fol- (Kudo and Richardson, 2018). A vocab size of lowed by a GlobalMaxPool layer and a dense layer 12k is used for sub-word tokens. There are two of size 100. A final dense layer of size equal to the versions of pre-trained Marathi FastText embed- number of target labels completes the model. ding models available trained by Facebook and In- dicNLP. We use both of these in static and train- LSTM + GlobalMaxPool: An embedding layer able mode. In static mode, the word-embedding outputs embeddings of length 300. This is fol- layer is frozen and in trainable mode, the embed- lowed by an LSTM layer with cell size 300 and ding layer is kept trainable. then a GlobalMaxPool layer. The output is then Models Variants News News Articles Headlines CNN random-word 98.95 89.01 random-subword 98.53 88.18 FBFastText-Trainable 98.74 91.49 FBFastText-Static 96.65 90.50 IndicNLP FastText-Trainable 99.16 94.88 IndicNLP FastText-Static 99.37 94.13 LSTM random-word 98.74 88.51 random-subword 97.69 86.36 FBFastText-Trainable 98.95 91.49 FBFastText-Static 88.22 89.75 IndicNLP FastText-Trainable 99.16 94.79 IndicNLP FastText-Static 98.54 94.55 BiLSTM random-word 98.32 89.09 random-subword 96.86 86.78 FBFastText-Trainable 97.70 91.24 FBFastText-Static 94.98 89.83 IndicNLP FastText-Trainable 99.16 94.63 IndicNLP FastText-Static 99.16 94.13 ULMFiT (iNLTK) 99.37 92.39 BERT mBERT 97.48 90.70 IndicBERT(INLP) 99.40 94.50

Table 1: Classification accuracies over different architectures fed to a dense layer of size 100, and finally a dense • Indic-BERT trained by IndicNLP layer of size equal to the total number of target classes. For each of these models, we experimented with two variations: a CLS-based architecture for se- BiLSTM + GlobalMaxPool: The word vectors quence classification; and fine-tuning the final lay- from the embedding layer are passed onto a Bidi- ers of the model and passing them to a classifier rectional LSTM layer with cell size 300. The out- layer. put of the Bi-LSTM layer is max pooled over time. This is followed by a dense layer of size 100 and a 5 Results final dense layer. We evaluated different CNN, LSTM, BiLSTM ULMFiT: Universal Language Model Fine- based models along with language models such as tuning for Text classification uses transfer learning ULMFiT and BERT on two datasets viz. Marathi and can be used for various NLP activities. The News Headline Dataset and Marathi News Ar- language model pretraining allows it to work well ticles Dataset. The two input representations on small datasets. The pre-trained models made based on words and sub-words are used. For publicly available by iNLTK is finetuned for our word-based representations, we compare random target dataset. initialization and FastText based initialization. The FastText based initialization is then used BERT: Various transformer-based BERT lan- in trainable and non-trainable mode. Table 1 guage models pre-trained from text corpus is pub- shows a comparison of various approaches. The licly available. We tested out some of the versions variant column indicates the variation used for and fine-tuned it for the Marathi text classification input representation. The random initialization task, viz of embeddings is indicated as random-word and random-subword. The original FastText • Multilingual-BERT embedding released by Facebook is termed as FBFastText. The version released by IndicNLP 6 Conclusion is termed IndicNLP FastText. The suffix static indicates an un-trainable embedding layer while In this paper, we have experimented with various trainable indicates a trainable one. deep learning models and evaluated them for the task of Marathi text classification. We present a comparative analysis of different input represen- tations and model combinations. We show that For both the datasets, the IndicBERT, and the IndicNLP FastText embeddings perform better as models that use IndicNLP FastText Embeddings compared to random initialization and original gives the best results. For the articles dataset, Facebook FastText embeddings when used with the increase in accuracy after using IndicNLP CNN, LSTM, and BiLSTM models. In fact, they embeddings over a simple model is small, but perform competitively with more complex BERT the increase is significant for the headline dataset. based models. The difference in accuracies for Also, the FBFastText embeddings do not match up using a static and trainable embedding layer is to the results of the IndicNLP embeddings. This is very small, where the trainable approach has a in line with the observations in the original work. slight advantage in accuracy. The ULMFiT model The CNN based models have a slight edge over performs better than basic CNN and LSTM based LSTM and BiLSTM based models. The ULMFiT models. The IndicBERT has an upper hand over model performs better than models using random the mBERT model and all other smaller models and FBFastText word embeddings. utilizing randomly initialized embeddings. In the end, we highlight the fact the Marathi NLP has not received enough attention. There are only a Finally, for the BERT-based models, we couple of pre-trained models and relevant publicly experimented with fine-tuning the final layers available datasets for Marathi text classification. of the BERT model, passing them onto a BiL- Both the datasets are based on the News domain. STM model and only extracting the sentence embeddings ie CLS tokens, and using them for classification. Both approaches give us almost Acknowledgments similar results. The IndicBERT works best for both datasets. The results for mBERT and models This work was done under the L3Cube Pune men- using FBFastText are in the same range. More- torship program. We would like to express our over, the IndicBERT and models using IndicNLP gratitude towards our mentors at L3Cube for their FastText also perform in the same range but continuous support and encouragement. roughly 4% better than mBERT in absolute terms. The ULMFiT is somewhere between mBERT and IndicBERT. Overall, models utilizing IndicNLP References FastText embeddings perform competitively with Md Shad Akhtar, Asif Ekbal, and Pushpak - IndicBERT. The higher accuracy of these models tacharyya. 2016. Aspect based in can be attributed to the fact that a major chunk hindi: resource creation and evaluation. In Proceed- of data used to pre-train them came from news ings of the Tenth International Conference on Lan- guage Resources and Evaluation (LREC’16), pages sources. The target datasets also come from the 2703–2709. news domain. Md Al-Amin, Md Saiful Islam, and Shapan Das Uz- zal. 2017. Sentiment analysis of bengali com- ments with and sentiment information of The CNN + IndicNLP FastText performs words. In 2017 International Conference on Elec- best on the Headlines dataset leaving behind trical, Computer and Communication Engineering IndicBERT by a small margin. However, In- (ECCE), pages 186–190. IEEE. dicBERT is multilingual in nature and serves more applications. The results in Table 1 show Gaurav Arora. 2020. inltk: for indic languages. arXiv preprint arXiv:2009.12534. that for applications only concerned about Marathi text, basic models will be more preferable. Piyush Arora. 2013. Sentiment analysis for hindi lan- guage. MS by Research in Computer Science. Pooja Bolaj and Sharvari Govilkar. 2016. Text classifi- Jaydeep Jalindar Patil and Nagaraju Bogiri. 2015. Au- cation for marathi documents using supervised learn- tomatic text categorization: Marathi documents. In ing methods. International Journal of Computer Ap- 2015 International Conference on Energy Systems plications, 155(8):6–10. and Applications, pages 689–694. IEEE. Alexis Conneau and Guillaume Lample. 2019. Cross- Braja Gopal Patra, Dipankar Das, and Amitava Das. lingual language model pretraining. In Advances 2018. Sentiment analysis of code-mixed indian in Neural Information Processing Systems, pages languages: An overview of sail code-mixed shared 7059–7069. task@ icon-2017. arXiv preprint arXiv:1803.06745. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Braja Gopal Patra, Dipankar Das, Amitava Das, and Kristina Toutanova.2018. Bert: Pre-training of deep Rajendra Prasath. 2015. Shared task on senti- bidirectional transformers for language understand- ment analysis in indian languages (sail) tweets-an ing. arXiv preprint arXiv:1810.04805. overview. In International Conference on Mining Intelligence and Knowledge Exploration, pages 650– Jeremy Howard and Sebastian Ruder. 2018. Univer- 655. Springer. sal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146. Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is multilingual ? arXiv Aditya Joshi, Ameya Prabhu, Manish Shrivastava, and preprint arXiv:1906.01502. Vasudeva Varma. 2016. Towards sub-word level compositions for sentiment analysis of hindi-english Kamal Sarkar and Mandira Bhowmick. 2017. Sen- code mixed text. In Proceedings of COLING 2016, timent polarity detection in bengali tweets using the 26th International Conference on Computational multinomial na¨ıve bayes and support vector ma- : Technical Papers, pages 2482–2491. chines. In 2017 IEEE Calcutta Conference (CAL- Ramchandra Joshi, Purvi Goel, and Raviraj Joshi. 2019. CON), pages 31–36. IEEE. Deep learning for hindi text classification: A com- Pedro Javier Ortiz Su´arez, Benoˆıt Sagot, and Laurent parison. In International Conference on Intelli- Romary. 2019. Asynchronous pipeline for process- gent Human Computer Interaction, pages 94–101. ing huge corpora on medium to low resource infras- Springer. tructures. In 7th Workshop on the Challenges in the Armand Joulin, Edouard Grave, Piotr Bojanowski, and Management of Large Corpora (CMLC-7). Leibniz- Tomas Mikolov. 2016. Bag of tricks for efficient text Institut f¨ur Deutsche Sprache. classification. arXiv preprint arXiv:1607.01759. Sushma R Vispute and MA Potey. 2013. Auto- Divyanshu Kakwani, Anoop Kunchukuttan, Satish matic text categorization of marathi documents us- Golla, Avik Bhattacharyya, Mitesh M Khapra, and ing clustering technique. In 2013 15th International Pratyush Kumar. 2020. Indicnlpsuite: Monolin- Conference on Advanced Computing Technologies gual corpora, evaluation benchmarks and pre-trained (ICACT), pages 1–5. IEEE. multilingual language models for indian languages. Guillaume Wenzek, Marie-Anne Lachaux, Alexis Con- Findings of EMNLP. neau, Vishrav Chaudhary, Francisco Guzm´an, Ar- Yoon Kim. 2014. Convolutional neural net- mand Joulin, and Edouard Grave. 2019. Ccnet: Ex- works for sentence classification. arXiv preprint tracting high quality monolingual datasets from web arXiv:1408.5882. crawl data. arXiv preprint arXiv:1911.00359. Kamran Kowsari, Kiana Jafari Meimandi, Mojtaba Hei- darysafa, Sanjana Mendu, Laura Barnes, and Donald Brown. 2019. Text classification algorithms: A sur- vey. Information, 10(4):150. Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tok- enizer and detokenizer for neural . arXiv preprint arXiv:1808.06226. Alok Ranjan Pal, Diganta Saha, and Niladri Sekhar Dash. 2015. Automatic classification of bengali sen- tences based on sense definitions present in bengali . arXiv preprint arXiv:1508.01349. Harshali B Patil and Ajay S Patil. 2017. Mars: a rule- based stemmer for morphologically rich language marathi. In 2017 International Conference on Com- puter, Communications and Electronics (Comptelix), pages 580–584. IEEE.