Arxiv:2102.01051V2 [Cs.CL] 12 Mar 2021

Home , Danda, Malayalam

arXiv:2102.01051v2 [cs.CL] 12 Mar 2021 oTb r o eim hr ag num- large a where mediums now are YouTube ( aims identification language offensive of task The od rmto(rmr)lnugsaeused are languages more) (or two from words volved. github.com/murali1996/eacl2021-OffensEval-Dravidian English in language offensive fo- identifying on have do- cused identification the language in offensive tasks of shared main Previous so- relevant tasks. the media able for cial are that data models code-switched build process to interac- to vital media is social it Thus for tions. used commonly very is in- data of stream large the to due auto- best methods and is matic models which computational building monitoring by media done social is of detection form language a Offensive social on platform. common media opinions. are abuses their target express and and Conflicts interact people of Facebook, ber Twitter, like platforms threats, of media form Social ( the insults in language, be abusive me- could social which in texts, used dia language offensive identify to Introduction 1 ntesm etne( sentence same the in where communities multilingual and bilingual in apeie al. et Zampieri SJ Code-Switching 2 1 oeadperie oesaeaalbeat available are code-mixing term models the with used Interchangeably pretrained and Code s o and,2dfrMlylmad3rd and Malayalam for 2nd Kannada, for 1st o Tamil. for ranked was system language Our masked a objective. modeling with models BERT gual multilin- of of which pre-training ensemble task-adaptive models leverage an XLM-RoBERTa is and system mBERT final Our Offen- submission lan- on guages. Dravidian our in Task Identification present 2021-Shared Language sive EACL we the paper for this In JDaiinagehEC22:Ts-dpiePre-Trai Task-Adaptive AJ@DravidianLangTech-EACL2021: utlnulBR oesfrOfnieLnug Identifica Language Offensive for models BERT Multilingual agaeTcnlge Institute Technologies Language 1 [email protected] angeMlo University Mellon Carnegie a uaihrJayanthi Muralidhar Sai , 2019 2 Abstract sapeoeo common phenomenon a is ,Aai,Gek aihand Danish Greek, Arabic, ), iaa tal. et Sitaram apeie al. et Zampieri , 2019 , 2020 and ) . ). L-oET ( XLM-RoBERTa ihaMse agaeMdln betv and objective Modeling Language Masked a with Task Shared the for submission our present we h ls-ietann aae ttsisare statistics dataset training class-wise The ( ( uks ( Turkish hs oeshv lobe sdt ul best built to used been also have models These h aae ttsisfrtets fOf- of task the for statistics dataset The hw nTable in shown hrdtssfrcd-ie aaes( 2020 datasets code-mixed for various tasks tasks. shared in systems multiple analysis in sentiment performing results art the of state duced are languages Dravidian in code-mixing. by task characterized shared this for https://competitions.codalab.org/competitions/27654 are datasets the that see and We imbalanced. procedure highly collection data baselines. the some about brief a n hi utlnulvrin–mETand mBERT versions– multilingual their and esv agaeIetfiaini Malay- in ( Identification alam Language fensive Dataset 2 XLM- submission. and final our mBERT for models RoBERTa from predictions ensemble pre-training task-adaptive employ We languages. Dravidian in Identification Language Offensive on lcrcladCmue Engineering Computer and Electrical ad tal. et Hande al. et Chakravarthi akshatgu@andrew.cmu.edu rtandBR oes( models BERT Pretrained 3 oedtiso h akcnb on at found be can task the of details More ; Malayalam angeMlo University Mellon Carnegie Language Kannada hkaatie al. et Chakravarthi Tamil apeie al. et Zampieri hkaatie al. et Chakravarthi khtGupta Akshat , al :Dtststatistics Dataset 1: Table 2020 2 ona tal. et Conneau . 3 , 16010 35139 r hw nTable in shown are ) Train 6217 ad tal. et Hande 2020b , 2020 , 2020c , Split n Kannada and ) 1999 4388 .Tedtst used datasets The ). Dev 777 elne al. et Devlin 2020a , .I hspaper this In ). 2019 ( 2020 2001 4392 tion aw tal. et Patwa ,Tamil ), Test 778 aepro- have ) igof ning outlines ) , 2018 1 ) , . Language Class Kannada Tamil Malayalam XLM-RoBERTa, on the other hand, is also a Not-{language} 1522 1454 1287 transformer based model but is pretrained on Com- Not offensive 3544 25425 14153 6 Offensive Targeted Insult Individual 487 2343 239 mon Crawl data . The common crawl data con- Offensive Targeted Insult Group 329 2557 140 sists of textual data both in native scripts as well Offensive Targeted Insult Other 123 454 0 Offensive Untargetede 212 2906 191 as in romanized script for some of the languages. Total Training Data 6217 35139 16010 Among the three languages in this task, only Tamil has Romanized data in the Common Crawl. Table 2: Class-wise training data statistics Given a stream of text as input, mBERT splits the text into sub-words by using WordPiece tok- 3 Modeling enization, whereas XLM-RoBERTa uses Senten- cePiece tokenization strategy. The sub-tokens In this section, we describe the different tech- are then appended with terminal markers such as niques and models that were used in developing [CLS] and [SEP], indicating the start and end offensive language classifiers. We use two pop- of the input stream, as shown below. ular transformer architecture based models- Mul- tilingual BERT (Devlin et al., 2019) and XLM- [CLS], w1, w2, ..., wn, [SEP ] RoBERTa (Conneau et al., 2020) as our backbone models. (Gupta et al., 2021) showed that Popularized by BERT, the representation corre- these multilingual BERT models achieve state sponding to the [CLS] token in the output layer of the art performance on Dravidian language is utilized for multi-class classification via a soft- code-mixed sentiment analysis datasets. Inspired max layer. We utilize the same methodology in by some of the recent works on Hindi-English this work. While there are alternative choices for codemixed datasets, such as Khanuja et al. (2020) obtaining sentence-level representations, such as 7 and Aguilar et al. (2020), we develop solutions sentence-transformers , we leave their evaluation based on language model pretraining and translit- for future work. We obtain results with this setup eration. and consider them as our baselines. In an attempt to understand the usefulness of 3.1 Architectures character embeddings when augmented with representations from BERT-based models, we develop Motivated by the successes of BERT model and its a fusion architecture. In this architecture, we underlying Masked Language Modeling (MLM) first obtain sub-word level representations from pretraining strategy (Devlin et al., 2019), several mBERT or XLM-RoBERTa models. We then con- derivatives evolved over the last couple of years. vert them back to word-level representations by In this task, since the datasets contain texts in En- averaging the representations of sub-tokens of a glish, Tamil, Kannada and Malayalam in their na- given word. If a word is not split into sub-tokens, tive scripts as well as in Romanized form, we iden- we use its representation as the word’s representa- tify two suitable BERT based models for creating tions. For every batch of training data, we keep offensive identification classifiers. track of words and their corresponding sub-tokens 4 Multilingual-BERT , aka. mBERT, is based on for the purpose of averaging. In a parallel, we also BERT architecture but pretrained with Wikipedias pass character-level embedding sequences through of 104 different languages including English, a Bidirectional-LSTM for every word. The hope is Tamil, Kannada and Malayalam. In this work, that the character-level BiLSTM helps to capture 5 we specifically use the cased version of mBERT . different variations in word patterns due to Roman- mBERT is pretrained on textual data from differ- ization, especially useful in the context of social ent languages with a shared architecture and a media text. shared vocabulary, due to which there is a possible Once we obtain word-level representations from representation sharing across native and translit- both character-level BiLSTM as well as BERT- erated text forms. Corroborating it, Pires et al. based models, we concatenate the representations (2019) presented its superior zero-shot abilities for at word-level and pass them through a word-level an NER task in Hinglish. Bidirectional-LSTM. The representation of the

4https://github.com/google-research/bert/multilingual.md 6http://data.statmt.org/cc-100/ 5https://huggingface.co/transformers/multilingual.html 7https://github.com/UKPLab/sentence-transformers last word at the output of BiLSTM is then used due to scarcity of relevant public datasets. Thus, for classification. we resort to only task-adaptive pretraining. Some ideas towards curating corpora for 3.2 Training Methodologies domain-adaptive pretraining could be using ma- In this section, we discuss some techniques to tai- chine translation datasets which have fair amounts lor mBERT and XLM-ROBERTa models for the of code-mixing or by adopting semi-supervised task of offensiveness identification in Dravidian code-mixed data creation techniques (Gupta et al., languages. 2020). Alternate approaches could be developing techniques for efficient representation-sharing MLM Pretraining: Both the backbone models between words from native script and their are pretrained with their respective corpora using Romanized forms, similar to works such as Masked Language Modeling (MLM) objective– Chaudhary et al. (2020). We leave these explo- given a sentence, the model randomly masks rations to future work. 15% of the tokens in the input stream, runs the masked stream through several transformer Transliteration: In order to evaluate the perfor- layers, and then has to predict the masked to- mance of mBERT and XLM-RoBERTa models kens. Doing so helps the model to learn when inputted with only Romanized script, we bi-directional contextual representations, unlike first transliterate the task datasets. To this end, LSTM (Hochreiter and Schmidhuber, 1997) based we identify indic-trans8 as the transliterator. language modeling. While task-adaptive pretraining could also be con- Recent works such as Gururangan et al. (2020) ducted with transliterated texts, we leave this ex- showed that a second phase of pretraining (called ploration to future work. domain-adaptive pretraining) with in-domain data can lead to performance gains. Relevant in the 4 Experiments realm of codemixing NLP, Khanuja et al. (2020) and Aguilar et al. (2020) have shown that such a Dataset and Metrics: The number of training domain-adaptive pretraining can improve perfor- and validation examples for each language are pre- mances for classification tasks in Hinglish. They sented in Table 1. Table 2 presents the class-wise curated a corpora with millions of entries consist- distribution of the training data. As observed, the ing of codemixed Hinglish in Romanized form. distribution is skewed with the dominant class be- Once the multilingual BERT-based models are ing Not offensive. In this work, we experiment trained with such data, they utilized those mod- with all three languages presented in the shared els for downstream tasks. Their downstream tasks task. We use F1 and Accuracy as evaluation met- generally contained only texts in Romanized form. rics9. Gururangan et al. (2020) have further shown that adapting the models to the task’s unlabeled data Implementation details: For the purpose of (called task-adaptive pretraining) improves perfor- training and validation, the corresponding data mance even after domain-adaptive pretraining. splits are utilized. Since gold labels for test sets However, we identify few challenges in order to were not previously available, different techniques directly adopt ideologies from the related works were compared using evaluations on the validation to our task at hand. Firstly, the datasets provided set itself. For the purpose of task-adaptive pre- for each language in this task consists of texts training using MLM objective, we combine the in English, native script of that language and its train and validation splits of the task data for train- transliterated form. Thus, in order to convert ev- ing BERT models, and utilize test split of the task erything into Romanized form, we need an accu- data for testing MLM pretraining. Due to compute rate transliterator tool. Moreover, even if such a or- resource limitations, we trim longer sentences to acle transliterator exists, there could be challenges 300 characters (including spaces) and always use related to word normalization due to peculiarity of a batch size of 8. We do not perform any prepro- a given language’s usage in social media platforms. cessing on the text so as to allow for a generalized Secondly, obtaining large amounts of Malayalam- evaluation across the different datasets in the task. English or Kannada-English code-mixed data for 8https://github.com/libindic/indic-trans the purpose of pretraining is not readily feasible 9sklearn.metrics.classification report.html F1 / Acc (Dev Split) Model Input Text Kannada Tamil Malayalam Baselines mBERT 67.30/70.79 76.79/79.01 95.77/96.10 as-is XLM-RoBERTa 68.70/71.56 76.28/78.19 94.12/94.85 Task-adaptive pretraining mBERT 68.40/72.07 76.95/78.87 96.58/96.75 as-is XLM-RoBERTa 69.66/72.72 77.10/78.83 96.28/96.55 Transliteration mBERT 68.84/72.07 75.16/78.49 95.72/95.80 romanized XLM-RoBERTa 67.73/71.17 75.73/78.08 94.18/95.00 Fusion Architecture (w/ char-BiLSTM) XLM-RoBERTa as-is 69.79/72.97 75.51/77.87 95.78/96.05 Ensemble (Task-adaptive pretraining)) mBERT as-is 70.30/74.00 76.94/79.67 96.76/97.00

Table 3: Results on all three languages of the task. as-is implies the text is inputted without any preprocessing.

F1 / Acc (Dev Split) Model Input Text Kannada Tamil Malayalam mBERT (run-1) 69.88/73.49 76.04/78.30 96.48/96.65 as-is XLM-RoBERTa (run-1) 69.72/72.59 76.80/79.40 95.80/96.25 mBERT (run-2) 68.67/72.72 76.58/79.17 96.50/96.70 as-is XLM-RoBERTa (run-2) 68.39/71.69 76.20/79.03 96.30/96.70 mBERT (run-3) 66.28/70.91 76.30/79.54 96.04/96.55 as-is XLM-RoBERTa (run-3) 69.08 / 71.94 77.86 /78.71 96.17/96.30 Ensemble (Mode) as-is 70.30 / 74.00 76.94 / 79.67 96.76 / 97.00

Table 4: Ensemble results on all three languages. as-is implies the text is inputted without any preprocessing.

We use Huggingface library (Wolf et al., Discussion: We observe that task-adaptive pre- 2020) for implementation of our back- training improves performance (1-2% absolute) in bone models. Specifically, we use case of both the BERT-based models, with major bert-base-multilingual-cased and gains obtained for Malayalam dataset (upto 2% ab- xlm-roberta-base models. For experiments solute improvement). We also observe that the related to representation fusion, we implement baseline performances are retained when the input the BiLSTM models in PyTorch (Paszke et al., text is fully Romanized, thereby opening up scope 2019). During task specific training for offensive for future research along the direction of domain- classification, we optimize using the BertAdam10 adaptive pretraining. We also observe that the rep- optimizer for models with a BERT component resentation fusion helps in some datasets and de- and with Adam (Kingma and Ba, 2014) optimizer grades performance in others. However, we be- for the remainder. These optimizers are used lieve that more evaluation needs to be conducted with default parameter settings. All experiments to ascertain the usefulness of such fusion. are conducted for 5 epochs and we pick the best model using the dev set. We use a hidden size of 5 Conclusions 128 for character-level BiLSTM and a size of 256 for word-level BiLSTM. We add dropouts with In this paper, we present our submission for 0.40 rate at the outputs of both BiLSTMs. EACL-2021 Shared Task on Offensive Language Identification in Dravidian languages. Our system Ensemble: For the final submission, we create was an ensemble of 6 different mBERT and XLM- a majority voting ensemble of 6 models– 3 of RoBERTa models fine-tuned for the task of offen- mBERT and 3 of XLM-RoBERTa– based on task- sive language identification. The BERT models adaptive pretraining technique. Table 4 shows the were pre-trained on given datasets in Malayalam, results of various runs and the ensemble scores Tamil and Kannada. Our system ranked 1st for across all three languages. Kannada, 2nd for Malayalam and 3rd for Kannada. 10github.com/cedrickchee/pytorch-pretrained-BERT We also proposed a fusion-architecture to leverage character-level, subword-level and word-level em- bidirectional transformers for language understand- bedding to improve performance. ing. arXiv preprint arXiv:1810.04805. Jacob Devlin, Ming-Wei Chang, Ken- ton Lee, and Kristina Toutanova. 2019. References BERT: Pre-training of deep bidirectional transformers for language understanding. Gustavo Aguilar, Sudipta Kar, In Proceedings of the 2019 Conference of the North and Thamar Solorio. 2020. American Chapter of the Association for Computa- LinCE: A Centralized Benchmark for Linguistic Code-switchtionaling Evaluation Linguistics:. Human Language Technologies, In Proceedings of The 12th Language Resources Volume 1 (Long and Short Papers), pages 4171– and Evaluation Conference, pages 1803–1813, 4186, Minneapolis, Minnesota. Association for Marseille, France. European Language Resources Computational Linguistics. Association. Akshat Gupta, Sai Krishna Rallabandi, and Alan Black. Bharathi Raja Chakravarthi, Navya Jose, 2021. Task-specific pre-training and cross lingual Shardul Suryawanshi, Elizabeth Sherly, transfer for code-switched data. arXiv preprint and John Philip McCrae. 2020a. arXiv:2102.12407. A sentiment analysis dataset for code-mixed Malayalam-English. In Proceedings of the 1st Joint Workshop on Spoken Deepak Gupta, Asif Ekbal, and Language Technologies for Under-resourced lan- Pushpak Bhattacharyya. 2020. guages (SLTU) and Collaboration and Computing A semi-supervised approach to generate the code-mixed text using pre-trained encoder and transfer learning. for Under-Resourced Languages (CCURL), pages In Findings of the Association for Computational 177–184, Marseille, France. European Language Linguistics: EMNLP 2020, pages 2267–2280, Resources association. Online. Association for Computational Linguistics. Bharathi Raja Chakravarthi, Vignesh- Suchin Gururangan, Ana Marasović, Swabha waran Muralidaran, Ruba Priyad- Swayamdipta, Kyle Lo, Iz Beltagy, Doug harshini, and John Philip McCrae. 2020b. Downey, and Noah A. Smith. 2020. Corpus creation for sentiment analysis in code-mixed TamilDon’t-English stop text pretraining:. Adapt language models to domains and tasks. In Proceedings of the 1st Joint Workshop on Spoken Proceedings of the 58th Annual Meeting of the Asso- Language Technologies for Under-resourced lan- ciation for Computational Linguistics. guages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), pages Adeep Hande, Ruba Priyadharshini, and 202–210, Marseille, France. European Language Bharathi Raja Chakravarthi. 2020. Resources association. KanCMD: Kannada CodeMixed dataset for sentiment analysis and offensive language detection. In Proceedings of the Third Workshop on Computa- BR Chakravarthi, R Priyadharshini, V Muralidaran, tional Modeling of People’s Opinions, Personality, S Suryawanshi, N Jose, E Sherly, and JP McCrae. and Emotion’s in Social Media, pages 54–63, 2020c. Overview of the track on sentiment analysis Barcelona, Spain (Online). Association for Compu- for dravidian languages in code-mixed text. In Work- tational Linguistics. ing Notes of the Forum for Information Retrieval Evaluation (FIRE 2020). CEUR Workshop Proceed- Sepp Hochreiter and Jürgen Schmidhuber. 1997. ings. In: CEUR-WS. org, Hyderabad, India. Long short-term memory. Neural Computation, 9(8):1735–1780. Aditi Chaudhary, Karthik Raman, Kr- ishna Srinivasan, and Jiecao Chen. 2020. Simran Khanuja, Sandipan Danda- Dict-mlm: Improved multilingual pre-training using bilingualpat, dictionaries Anirudh. Srinivasan, Sunayana Sitaram, and Monojit Choudhury. 2020. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, GLUECoS: An evaluation benchmark for code-switched NLP. Vishrav Chaudhary, Guillaume Wenzek, Francisco In Proceedings of the 58th Annual Meeting of the Guzmán, Edouard Grave, Myle Ott, Luke Zettle- Association for Computational Linguistics, pages moyer, and Veselin Stoyanov. 2019. Unsupervised 3575–3585, Online. Association for Computational cross-lingual representation learning at scale. arXiv Linguistics. preprint arXiv:1911.02116. Diederik P. Kingma and Jimmy Ba. 2014. Alexis Conneau, Kartikay Khandelwal, Naman Adam: A method for stochastic optimization. Goyal, Vishrav Chaudhary, Guillaume Wen- zek, Francisco Guzmán, Edouard Grave, Myle Adam Paszke, Sam Gross, Francisco Massa, Adam Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Lerer, James Bradbury, Gregory Chanan, Trevor Unsupervised cross-lingual representation learning at scale.Killeen, Zeming Lin, Natalia Gimelshein, Luca Proceedings of the 58th Annual Meeting of the Asso- Antiga, Alban Desmaison, Andreas Kopf, Edward ciation for Computational Linguistics. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Lu Fang, Junjie Bai, and Soumith Chintala. 2019. Kristina Toutanova.2018. Bert: Pre-training of deep Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc. Parth Patwa, Gustavo Aguilar, Sudipta Kar, Suraj Pandey, Srinivas PYKL, Björn Gambäck, Tanmoy Chakraborty, Thamar Solorio, and Amitava Das. 2020. Semeval-2020 task 9: Overview of sentiment analysis of code-mixed tweets. arXiv e-prints, pages arXiv–2008. Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is multilingual BERT? In Pro- ceedings of the 57th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 4996– 5001, Florence, Italy. Association for Computa- tional Linguistics. Sunayana Sitaram, Khyathi Raghavi Chandu, Sai Kr- ishna Rallabandi, and Alan W Black. 2019. A sur- vey of code-switched speech and language processing. arXiv preprint arXiv:1904.00784. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Can- wen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics. Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh Kumar. 2019. Semeval-2019 task 6: Identifying and cate- gorizing offensive language in social media (offenseval). arXiv preprint arXiv:1903.08983. Marcos Zampieri, Preslav Nakov, Sara Rosenthal, Pepa Atanasova, Georgi Karadzhov, Hamdy Mubarak, Leon Derczynski, Zeses Pitenis, and Ça˘grı Çöltekin. 2020. Semeval-2020 task 12: Multilingual offensive language identification in social media (offenseval 2020). arXiv preprint arXiv:2006.07235.