Sentiment Analysis of Dravidian Code Mixed Data
Total Page:16
File Type:pdf, Size:1020Kb
Sentiment Analysis of Dravidian Code Mixed Data Asrita Venkata Mandalam Yashvardhan Sharma Department of CSIS Department of CSIS BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus Rajasthan, India Rajasthan, India f20171179 yash @pilani.bits-pilani.ac.in @pilani.bits-pilani.ac.in Abstract an interest in analysing the sentiment of the con- This paper presents the methodologies im- tent posted by them. As code-mixed data does plemented while classifying Dravidian code- not belong to one language and is often written mixed comments according to their polar- using Roman script, identifying its polarity can- ity. With datasets of code-mixed Tamil and not be done using traditional sentiment analysis Malayalam available, three methods are pro- models(Puranik et al., 2021; Hegde et al., 2021; posed - a sub-word level model, a word em- Yasaswini et al., 2021; Ghanghor et al., 2021b,a). bedding based model and a machine learn- In social media, low-resourced languages such as ing based architecture. The sub-word and Tamil and Malayalam have been increasingly used word embedding based models utilized Long Short Term Memory (LSTM) network along along with English (Thavareesan and Mahesan, with language-specific preprocessing while 2019, 2020a,b). Indian government declared Tamil the machine learning model used term fre- as classical language in 2004, meaning that it meets quency–inverse document frequency (TF-IDF) three criteria: its sources are old; it has an inde- vectorization along with a Logistic Regression pendent tradition; and it has a significant body of model. The sub-word level model was sub- ancient literature. The most ancient non-Sanskritic mitted to the the track ‘Sentiment Analysis Indian literature is found in Tamil. The earliest ex- for Dravidian Languages in Code-Mixed Text’ proposed by Forum of Information Retrieval isting Tamil literary works and their commentaries Evaluation in 2020 (FIRE 2020). Although commemorate the organisation of long-term Tamil it received a rank of 5 and 12 for the Tamil Sangams (600 BCE to 300 CE) by the Pandiyan and Malayalam tasks respectively in the FIRE Rulers, which studied, established and made im- 2020 track, this paper improves upon the re- provements in the language of Tamil. Identifying sults by a margin to attain final weighted F1- the sentiment of this data can prove to be useful scores of 0.65 for the Tamil task and 0.68 for in social media monitoring and feedback of users the Malayalam task. The former score is equiv- towards other online content. alent to that attained by the highest ranked team of the Tamil track. The ‘Sentiment Analysis for Dravidian Lan- guages in Code-Mixed Text’ task proposed 1 Introduction by FIRE 2020 (Chakravarthi et al., 2020c; Code-mixing usually involves two languages to cre- Chakravarthi, 2020) contains a code-mixed dataset ate another language that consists of elements of consisting of comments from social media web- both in a structurally understandable manner. It sites for both Tamil and Malayalam. Each team has been noted that bilingual and multilingual soci- had to submit a set of predicted sentiments for the eties use multiple languages together in informal Tamil-English and Malayalam-English mixed test speech and text. Dravidian code-mixed languages, sets (Chakravarthi et al., 2020d). Along with lan- including but not limited to Malayalam and Tamil, guage specific preprocessing techniques, the imple- are increasingly used by younger generations in mented model made use of sub-word level repre- advertising, entertainment and social media. The sentations to incorporate features at the morpheme language is commonly written in Roman script. level, the smallest meaningful unit of any language. With the rise in the number of non-English and Evaluated by weighted average F1-score, the sub- multilingual speakers using social media, there is word level approach achieved the 5th highest score 46 Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, pages 46–54 April 20, 2021 ©2021 Association for Computational Linguistics in the Tamil task and the 12th rank in the Malay- of the same corpus.Yadav and Chakraborty(2020) alam task (Sharma and Mandalam, 2020). used multilingual embeddings to identify the sen- This paper presents the aforementioned model timent of code-mixed text. As they used an unsu- and two more models - a word embedding based pervised approach, they performed sentiment anal- model and a machine learning model in an attempt ysis by using the embeddings to transfer knowl- to improve upon the submitted weighted F1-scores. edge from monolingual text to their code-mixed The highest scoring model for the Tamil dataset input. Yadav et al.(2020) designed an ensemble received a score equivalent to the highest scoring model consisting of four classifiers - Naive Bayes, model presented at the Sentiment Analysis for Dra- Support-Vector Machines (SVM), Linear Regres- vidian Languages in Code-Mixed Text track of sion, and Stochastic Gradient Descent (SGD) clas- FIRE 2020. Although they achieved it using a sifiers. They tested it on a Hindi-English code-mix Bidirectional Encoder Representations from Trans- dataset and managed to surpass the scores attained formers (BERT) based model, the model presented by the baseline and state-of-art systems. in this paper uses Word2Vec and FastText embed- The proposed models test three approaches - sub- dings to achieve the same score. The implemented word analysis, word-level embeddings and machine models are available on Github1. learning algorithm based classification. The top few submissions submitted at the ‘Sentiment Anal- 2 Related Work ysis for Dravidian Languages in Code-Mixed Text’ task proposed by FIRE 2020 (Chakravarthi et al., Analysing the sentiment of code-mixed data is im- 2020c) used BERT-based models. This work aims portant as traditional methods fail when given such to attain the same F1-score with help of non-BERT- data. Barman et al.(2014) concluded that n-grams based approaches. As Dravidian code-mixed lan- proved to be useful in their experiments that in- guages are under-resourced, this paper also de- volved multiple languages with Roman script. scribes the preprocessing methods used to aid in the Bojanowski et al.(2017) used character n-grams sentiment classification of comments that contain in their skip-gram model. The lack of preprocess- both Roman and Tamil/Malayalam characters. ing resulted in a shorter training time and outper- formed baselines that did not consider sub-word 3 Dataset information. Joshi et al.(2016) outperformed ex- isting systems as well by using a sub-word based The model has been trained, validated and tested LSTM architecture. Their dataset consisted of 15% using the Tamil (Chakravarthi et al., 2020b) and negative, 50% neutral and 35% positive comments. Malayalam (Chakravarthi et al., 2020a) datasets As their dataset was imbalanced like the one used provided by the organizers of the Dravidian Code- in this paper, the submitted approach involved mor- Mix FIRE 2020 task. The Tamil code-mix dataset pheme extraction as it would help in identifying consists of 11,335 comments for the train set, 1,260 the polarity of the dataset. Using a hybrid archi- for the validation set and 3,149 comments for test- tecture, Lal et al.(2019) an F1-score of 0.827 on ing the model. In the Malayalam code-mix dataset, the Hindi-English dataset released by Joshi et al. there are 4,851 comments for training, 541 for val- (2016). After generating the sub-word level in- idating and 1,348 for testing the model. Table1 formation of the input data, they used one Bidi- gives the distribution of each sentiment in each rectional LSTM to figure out the sentiment of the dataset. given sentence and another to select the sub-words 4 Proposed Technique that contributed to the overall sentiment of the sen- tence. 4.1 Sub-word level model In more recent work, Jose et al.(2020) surveyed The submitted approach used a sub-word level publicly available code-mixed datasets. They noted model as it accounts for words that have a simi- statistics about each dataset such as vocabulary size lar morpheme. For example, in the Tamil dataset, and sentence length. Priyadharshini et al.(2020) Ivan, Ivanga and Ivana have similar meanings due used embeddings of closely related languages of to their root word Ivan. the code-mixed corpus to predict Named Entities First, the dataset is preprocessed to replace all 1https://github.com/avmand/SA_ emojis with their corresponding description in En- Dravidian glish. As the dataset contains both Roman and 47 Dataset Positive Negative Mixed Feelings Unknown State Other Languages Tamil 10,559 2,037 1,801 850 497 Malayalam 2,811 738 403 1,903 884 Table 1: Distribution of Data in the Tamil and Malayalam Dataset Tamil (or Malayalam) characters, the latter is re- create an embedding matrix. For the classification placed with its corresponding Roman script repre- of the input data, two LSTM layers were used after sentation. the embedding layer. The dimensionality of the From the preprocessed data, a set of characters output space of the first and second LSTM layers was obtained. The input to the model is a set of were 64 and 32 repectively. Similar to the previous character embeddings. The sub-word level repre- model, batch normalization and early stopping sentation is generated through a 1-D convolution were used with the model automatically stopping layer with activation as ReLU, size of convolutional training after 5 epoch cycles. window as 5 and number of output filters as 128. After getting a morpheme-level feature map, a 1-D 4.3 Machine Learning model maximum pooling layer is used to obtain its most Till now, all of the tested models used deep learning prominent features.