<<

Classification in a Resource Constrained Language Using Transformer-based Approach

Avishek Das, Omar Sharif, Mohammed Moshiul Hoque and Iqbal H. Sarker Department of Computer Science and Engineering Chittagong University of Engineering & Technology Chittagong-4349, Bangladesh [email protected],{omar.sharif,moshiul 240,iqbal}@cuet.ac.bd

Abstract have accelerated the development of emotion clas- sification in high-resource languages such Although research on emotion classification as English, Arabic, Chinese, and French (Plaza del has significantly progressed in high-resource Arco et al., 2020). However, there is no notable languages, it is still infancy for resource- constrained languages like Bengali. However, progress in low resource languages such as Bengali, unavailability of necessary language process- Tamil and Turkey. The proliferation of the Internet ing tools and deficiency of benchmark corpora and digital technology usage produces enormous makes the emotion classification task in Ben- textual data in the Bengali language. The analysis gali more challenging and complicated. This of these massive amounts of data to extract under- work proposes a transformer-based technique lying is a challenging research issue in to classify the Bengali text into one of the the realm of Bengali language processing (BLP). six basic emotions: , , , sad- The complexity arises due to various limitations, ness, , and . A Bengali emotion corpus consists of 6243 texts is developed for such as the lack of BLP tools, scarcity of bench- the classification task. Experimentation car- mark corpus, complicated language structure, and ried out using various machine learning (LR, limited resources. By considering the constraints RF, MNB, SVM), deep neural networks (CNN, of emotion classification in the Bengali language, BiLSTM, CNN+BiLSTM) and transformer this work aims to contribute to the following: (Bangla-BERT, m-BERT, XLM-R) based ap- proaches. Experimental outcomes indicate • Develop a Bengali emotion corpus consisting that XLM-R outdoes all other techniques by of 6243 text documents with manual annota- achieving the highest weighted f1-score of 69.73% on the test data. The dataset is pub- tion to classify each text into one of six emo- licly available at https://github.com/ tion classes: anger, disgust, fear, joy, , omar-sharif03/NAACL-SRW-2021. surprise.

1 Introduction • Investigate the performance of various ML, DNN and transformer-based approaches on Classification of emotion in the text signifies the the corpus. task of automatically attributing an emotion cat- egory to a textual document selected from a set • Proposed a benchmark system to classify emo- arXiv:2104.08613v1 [cs.CL] 17 Apr 2021 of predetermined categories. With the growing tion in Bengali text with the experimental val- number of users in virtual platforms generating idation on the corpus. online contents steadily as a fast-paced, interpret- ing emotion or sentiment in online content is vital 2 Related Work for consumers, enterprises, business leaders, and other parties concerned. Ekman (Ekman, 1993) Substantial research activities have been carried defined six basic emotions: , fear, anger, out on emotion analysis in high-resource languages sadness, surprise, and disgust based on facial fea- like English, Arabic, and Chinese (Alswaidan and tures. These primary type of emotions can also be Menai, 2020). A multi-label with multi-target emo- extracted from the text expression (Alswaidan and tion detection of Arabic tweets accomplished using Menai, 2020). decision trees, random forest, and KNN, where The availability of vast amounts of online data random forest provided the highest f1-score of and the advancement of computational processes 82.6% (Alzu’bi et al., 2019). Lai et al.(2020) proposed a graph convolution network architec- lowing pre-processing before the annotation: ture for emotion classification from Chinese mi- croblogs and their proposed system achieved an F- • Removal of non-Bengali words, punctuation, measure of 82.32%. Recently, few works employed emoticons and duplicate data. transformer-based model (i.e., BERT) analyse emo- tion in texts. (Huang et al., 2019; Al-Omari et al., • Discarding data less than three words to get 2020) used a pre-trained BERT for embedding pur- an unerring emotional context. pose on top of LSTM/BiLSTM to get an improved After pre-processing the corpus holds 6523 text f -score of 76.66% and 74.78% respectively. 1 data. The processed texts are eligible for manual Although emotion analysis on limited resource annotation. The details of the preprocessing mod- languages like Bengali is in the preliminary stage, ules found in the link1. few studies have already been conducted on emo- tion analysis using ML and DNN methods. Irtiza 3.2 Data Annotation and Quality Tripto and Eunus Ali(2018) proposed an LSTM Five postgraduate students working on BLP were based approach to classify multi-label emotions assigned for initial annotation. To choose the initial from Bengali, and English sentences. This system label majority voting technique is applied (Mag- considered only YouTube comments and achieved atti et al., 2009). Initial labels were scrutinized 59.23% accuracy. Another work on emotion classi- by an expert who has several years of research ex- fication in Bengali text carried out by Azmin and pertise in BLP. The expert corrected the labelling Dhar(2019) concerning three emotional labels (i.e., if any initial annotation is done inappropriately. happy, sadness and anger). They used Multinomial The expert discarded 163 texts with neutral emo- Na¨ıve Bayes, which outperformed other algorithms tion and 117 texts with mixed emotions for the with an accuracy of 78.6%. Pal and Karn(2020) intelligibility of this research. To minimize bias developed a logistic regression-based technique during annotation, the expert finalized the labels to classify four emotions (joy, anger, , sus- through discussions and deliberations with the an- pense) in Bengali text and achieved 73% accuracy. notators (Sharif and Hoque, 2021). We evaluated Das and Bandyopadhyay(2009) conducted a study the inter-annotator agreement to ensure the quality to identify emotions in Bengali blog texts. Their of the annotation using the coding reliability (Krip- scheme attained 56.45% accuracy using the con- pendorff, 2011) and Cohen’s kappa (Cohen, 1960) ditional random field. Recent work used SVM to scores. An inter-coder reliability of 93.1% with classify six raw emotions on 1200 Bengali texts Cohen’s Kappa score of 0.91 reflects the quality of which obtained 73% accuracy (Ruposh and Hoque, the corpus. 2019). 3.3 Data Statistics 3 BEmoC: Bengali Emotion Corpus The BEmoC contains a total of 6243 text docu- Due to the standard corpus unavailability, we de- ments after the preprocessing and annotation pro- veloped a corpus (hereafter called ‘BEmoC’) to cess. Amount of data inclusion in BEmoC varies classify emotion in Bengali text. The development with the sources. For example, among online procedure is adopted from the guidelines stated in sources, Facebook contributed the highest amount (Dash and Ramamoorthy, 2019). (2796 texts) whereas YouTube (610 texts), blogs (483 texts), and news portals (270 texts) contributed 3.1 Data Collection and Preprocessing a small amount. Offline sources contributed a to- Five human crawlers were assigned to accumu- tal of 2084 texts, including storybooks (680 texts), late data from various online/offline sources. They novels (668 texts), and conversations (736 texts). manually collected 6700 text documents over three Data partitioned into train set (4994 texts), valida- months (September 10, 2020 to December 11, tion set (624 texts) and test set (625 texts) to eval- 2020). The crawler accumulated data selectively, uate the models. Table1 represents the amount of i.e., when a crawler finds a text that supports the data in each class according to the train-validation- definition of any of the six emotion classes accord- test set. ing to Ekman(1993), the content is collected, oth- 1https://github.com/omar-sharif03/ erwise ignored. Raw accumulated data needs fol- NAACL-SRW-2021/tree/main/Code%20Snippets Class Train Validation Test seem to have almost similar number of texts in all Anger 621 67 71 Disgust 1233 155 165 length distributions. Fear 700 89 83 For quantitative analysis, the Jaccard similarity Joy 908 120 114 among the classes has been computed. We used Sadness 942 129 119 Surprise 590 64 73 200 most frequent words from each emotion class, and the similarity values are reported in table3. Table 1: Number of instances in the train, validation, The Anger-Disgust and Joy-Surprise pairs hold the and test sets highest similarity of 0.58 and 0.51, respectively. These scores indicate that more than 50% frequent words are common in these pair of classes. On the Since the classifier models learn from the train- other hand, the Joy-Fear pair has the least similar- ing set instances to obtain more , we further ity index, which clarifies that this pair’s frequent analyzed this set. Table2 shows several statistics words are more distinct than other classes. These of the training set. similarity issues can substantially the emo- Total Unique Avg. words tion classification task. Some sample instances of Class words words per text BEmoC are shown in Table 9 (Appendix B). Anger 14914 5852 24.02 Disgust 27192 7212 22.35 C1 C2 C3 C4 C5 C6 Fear 14766 5072 21.09 C1 1.00 0.58 0.40 0.43 0.45 0.47 Joy 20885 7346 23.40 C2 - 1.00 0.41 0.45 0.47 0.44 Sadness 22727 7398 24.13 C3 - - 1.00 0.37 0.45 0.46 Surprise 13833 5675 23.45 C4 - - - 1.00 0.47 0.51 Total 114317 38555 - C5 - - - - 1.00 0.48 Table 2: Statistics of the train set of BEmoC Table 3: Jaccard similarity between the emotion class pairs. Anger (c1), disgust (c2), fear (c3), joy (c4), sad- The sadness class contains the most unique ness (c5), surprise (c6). words(7398), whereas the fear class contains the least(5072). In average all the classes have more than 20 words in each text document. However, a 4 Methodology text document in sadness class contained the max- Figure2 shows an abstract view of the used strate- imum number of words (107) whereas fear class gies. Various feature extraction techniques such consisting of a minimum number of words (4). Fig- as TF-IDF, Word2Vec, and FastText are used to ure1 represents the number of texts vs the length train ML and DNN models. Moreover, we also of texts distribution for each class of the corpus. investigate the Bengali text’s emotion classification Investigating this figure revealed that most of the performance using transformer-based models.All data varied a length between 15 to 35 words. In- the models are trained and tuned on the identical terestingly, most of the texts of Disgust class have dataset. a length less than 30. The Joy & Sadness classes

Figure 1: Corpus distribution concerning number of Figure 2: Abstract process of emotion classification texts vs length 4.1 Feature Extraction ‘sparse categorical crossentropy’ is selected as the ML and DNN algorithms are unable to learn from loss function. raw texts. Therefore, feature extraction is required CNN: Convolutional Neural Network (CNN) to train the classifier models. (LeCun et al., 2015) is tuned over the emo- TF-IDF: Term frequency-inverse document fre- tion corpus. The trained weights from the quency (TF-IDF) is a statistical measure that deter- Word2Vec/FastText embeddings are fed to the em- mines the importance of a word to a document in bedding layer to generate a sequence matrix. The a collection of documents. Uni-gram and bi-gram sequence matrix is then passed to the convolution features are extracted from the most frequent 20000 layer having 64 filters of size 7. The convolution words of the corpus. layer’s output is max-pooled over time and then transferred to a fully connected layer with 64 neu- Word2Vec: It utilizes neural networks to find rons. ‘ReLU’ activation is used in the correspond- semantic similarity of the context of the words ing layers. Finally, an output layer with softmax in a corpus (Mikolov et al., 2013). We trained activation is used to compute the probability distri- Word2Vec on Skip-Gram with the window size of bution of the classes. 7, minimum word count to 4, and the embedding of 100. BiLSTM: Bidirectional Long-Short Term Mem- ory (BiLSTM) (Hochreiter and Schmidhuber, FastText: This technique uses subword informa- 1997) is a variation of recurrent neural network tion to find the semantic relationships (Bojanowski (RNN). The developed BiLSTM network consists et al., 2017). We trained FastText on Skip-Gram of an Embedding layer similar to CNN, a BiLSTM with character n-grams of length 5, windows size layer with 32 hidden units, and a fully connected of 5, and embedding dimension of 100. layer having 16 neurons with ‘ReLU’ activation. For both Word2Vec and FastText, there are pre- An output layer with ‘softmax’ activation is used. trained vectors available for the Bengali language CNN+BiLSTM: An embedding layer followed trained on generalized Bengali wiki dump data by a 1D convolutional layer with 64 filters of size (Sarker, 2021). We observed that the deep learn- three and a 1D max-pool layer is employed on ing models perform well on vectors trained with top of two BiLSTM layers with 64 and 32 units. our developed BEmoC rather than the pre-trained Outputs of BiLSTM layer fed to an output layer vectors. with ‘softmax’ activation. 4.2 ML Approaches Table7 (AppendixA) illustrates the details of the hyperparameters used in the DNN models. We started an investigation on emotion detection system with ML models. Logistic Regression (LR), 4.4 Transformer Models Support Vector Machine (SVM), Random Forest We used three transformer models: m-BERT, (RF) and Multinomial Naive Bayes (MNB) tech- Bangla-BERT, and XLM-R on BEmoC. In recent niques are employed using TF-IDF text vectorizer. years transformer is being used extensively for clas- For LR lbfgs’ solver and ‘l1’ penalty is chosen and sification tasks to achieve state-of-the-art results C value is set to 1. The same C value with ‘lin- (Chen et al., 2021). The models are culled from the ear’ kernal is used for SVM. Meanwhile, for RF Huggingface2 transformers library and fine-tuned ‘n estimators’ is set to 100 and ‘alpha=1.0’ is cho- on the emotion corpus by using Ktrain(Maiya, sen for MNB. A summary of the parameters chosen 2020) package. for ML models are provided in Table6 (Appendix m-BERT: m-BERT (Devlin et al., 2019) is A). a transformer model pre-trained over 104 lan- 4.3 DNN Approaches guages with more than 110M parameters. We em- ployed ‘bert-base-multilingual-cased’ model and Variation of deep neural networks (DNN) such fine-tuned it on BEmoC with a batch size of 12. as CNN, BiLSTM and a combination of CNN Bangla BERT: Bangla BERT (Sarker, 2020) and BiLSTM (CNN+BiLSTM) will investigate is a pre-trained BERT mask language modelling, the performance of emotion classification task in trained on a sizeable Bengali corpus. We used Bengali. To train all the DNN models, ‘adam’ the ‘sagorsarker/Bangla-bert-base’ model and fine- optimizer with a learning rate of 0.001 and a batch size of 16 is used for 35 epochs. The 2https://huggingface.co/transformers/ tuned to update the pre-trained model fitted for BE- achieved the lowest of 61.91% f1-score. However, moC. A batch size of 16 is used to provide better this model outperformed the best ML and DNN results. approaches (56.94% for BiLSTM (FastText) and XLM-R: XLM-R (Liu et al., 2019) is a size- 60.75% for LR). Meanwhile, m-BERT shows al- able multilingual language model trained on 100 most 3% improved f1-score (64.39%) than Bangla- different languages. We implemented the ‘xlm- BERT (61.91%). XLM-R model shows an im- Roberta-base’ model on BEmoC with a batch size mense improvement of about 6% compared to of 12. Bangla-BERT and 5% compared to m-BERT, re- All the transformer models have been trained spectively. It achieved a f1-score of 69.73% that is with 20 epochs with a learning rate of 2e−5. By the highest among all models. using the checkpoint best intermediate model is 5.1 Error Analysis stored to predict on the test data. Table8 (Appendix A) shows a list of parameters used for transformer It is evident from Table4 that XLM-R is the best models. performing model to classify emotion from Ben- gali texts. A detailed error analysis is performed 5 Results and Analysis using the matrix. Figure3 illustrates a class-wise proportion of the number of predicted This section presents a comprehensive performance labels. It is observed from the matrix that few analysis of various ML, DNN, and transformer- based models to classify Bengali texts emotion. The superiority of the models is determined based on the weighted f1-score. However, the precision (P r), recall (Re) and accuracy (Acc) metrics also considered. Table4 reports the evaluation results of all models. Among ML approaches, LR achieved the high- est (60.75%) f1-score than RF (52.78%), MNB (48.67%) and SVM (59.54%). LR also performed well in P r, Re and Acc than other ML models. In DNN, BiLSTM with FastText outperformed other Figure 3: Confusion matrix of XLM-R model approaches concerning all the evaluation param- eters. It achieved f1-score of 56.94%. However, data classified wrongly. For example, 8 instances BiLSTM (FastText) achieved about 4% lower f1- among 71 of the anger class predicted as disgust. score than the best ML method (i.e., LR). In fear class, 6 data out of 83 mistakenly classified After employing transformer-based models, it as sadness. In the sadness class, misclassification observed a significant increase in all scores. ratio is the higest (15.13%). That means, 18 data Among transformer-based models, Bangla-BERT out of 119 in sadness class misclassified as disgust.

Method Classifier Pr(%) Re(%) F1(%) Acc(%) LR 61.07 60.64 60.75 60.64 RF 55.91 54.72 52.78 54.72 ML models MNB 60.23 54.08 48.67 54.08 SVM 61.12 60.10 59.54 60.00 CNN (Word2Vec) 53.20 52.12 51.84 52.12 CNN (FastText) 54.54 53.45 52.52 53.48 BiLSTM (Word2Vec) 56.81 55.78 53.45 57.12 DNN models BiLSTM (FastText) 57.30 58.08 56.94 58.08 CNN + BiLSTM (Word2Vec) 56.48 56.64 56.39 56.64 CNN + BiLSTM (FastText) 55.74 55.68 55.41 55.68 Bangla-BERT 62.08 62.24 61.91 62.24 Transformers m-BERT 64.62 64.64 64.39 64.63 XLM-R 70.11 69.61 69.73 69.61

Table 4: Comparison of various approaches on test set. Here Acc, Pr, Re, F1 denotes accuracy, weighted precision, recall, and f1-score Moreover, among 73 data in surprise class 9 are former model provided a superior result among all predicted as sadness. The error analysis reveals the methods. Specifically, XLM-R achieved the that fear class achieved the highest rate of correct highest f1-score of 69.61% which indicates the im- classification (77.15%) while surprise gained the provement of 8.97% (than ML) and 11.53% (than lowest (61.64%). DNN). Although XLM-R exhibited the most ele- The possible reason for incorrect predictions vated scores, other technique (such as the ensemble might be the class imbalance nature of the cor- of the transformer models) can also investigate en- pus. However, the high value of Jaccard similar- hancing performance. Additional categories (such ity (Table3) also reveals some interesting points. as , hate, and ) can also include general- Few words are used multi-purposely in multiple ization. Moreover, transformer-based models can classes. For instance, hate words can be used to also investigate extending the corpus, including express both anger and disgust . Moreover, text with sarcasm or irony, text with comparison emotion classification is highly subjective, depends and mixed-emotion. on the individual’s , and people may contemplate a sentence in many ways (LeDoux Acknowledgements and Hofmann, 2018). Thus, by developing a bal- anced dataset with diverse data, incorrect predic- We sincerely acknowledge the anonymous review- tions might be reduced to some extent. ers and pre-submission mentor for their insightful suggestions, which help improve the work. This 5.2 Comparison with Recent Works work was supported by the Directorate of Research The analysis of results revealed that XLM-R is the & Extension, CUET. best model to classify emotion in Bengali texts. Thus, we compare the performance of XLM-R References with the existing techniques to assess the effec- tiveness. We implemented previous methods (Ir- H. Al-Omari, M. A. Abdullah, and S. Shaikh. 2020. tiza Tripto and Eunus Ali, 2018; Azmin and Dhar, Emodet2: Emotion detection in english textual di- 2019; Pal and Karn, 2020; Ruposh and Hoque, alogue using bert and bilstm models. In 2020 11th International Conference on Information and Com- 2019) on BEmoC and reported outcomes in f1- munication Systems (ICICS), pages 226–232. score. Table5 shows a summary of the comparison. The results show that XLM-R outperformed the Nourah Alswaidan and Mohamed El Bachir Menai. 2020. A survey of state-of-the-art approaches for Methods F1(%) in text. Knowledge and Infor- Word2Vec + LSTM (Irtiza Tripto and mation Systems, pages 1–51. 53.54 Eunus Ali, 2018) TF-IDF + MNB (Azmin and Dhar, 2019) 48.67 S. Alzu’bi, O. Badarneh, B. Hawashin, M. Al-Ayyoub, TF-IDF + LR (Pal and Karn, 2020) 60.75 N. Alhindawi, and Y. Jararweh. 2019. Multi-label BOW + SVM (Ruposh and Hoque, 2019) 59.17 emotion classification for arabic tweets. In 2019 XLM-R (Proposed) 69.73 Sixth International Conference on Social Networks Table 5: Performance comparison. Here F1 denotes Analysis, Management and Security (SNAMS), pages 499–504. weighted f1-score. S. Azmin and K. Dhar. 2019. Emotion detection from past techniques with achieving the highest f1-score bangla text corpus using na¨ıve bayes classifier. In (69.73%). 2019 4th International Conference on Electrical In- formation and Communication Technology (EICT), pages 1–5. 6 Conclusion This paper investigated various ML, DNN and Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with transformer-based techniques to classify the emo- Subword Information. Transactions of the Associa- tion in Bengali texts. Due to the scarcity of bench- tion for Computational Linguistics, 5:135–146. mark corpus, we developed a corpus (i.e., BEmoC) containing 6243 Bengali texts labelled with six Ben Chen, Bin Chen, Dehong Gao, Qijin Chen, Chengfu Huo, Xiaonan Meng, Weijun Ren, and basic classes. Co-hen’s Kappa score of 0.91 re- Yang Zhou. 2021. Transformer-based language flects the quality of the corpus. Performance anal- model fine-tuning methods for covid-19 fake news ysis on BEmoC illustrated that XLM-R, a trans- detection. arXiv preprint arXiv:2101.05509. Jacob Cohen. 1960. A coefficient of agreement for Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- nominal scales. Educational and Psychological frey Dean. 2013. Efficient estimation of word Measurement, 20(1):37–46. representations in . arXiv preprint arXiv:1301.3781. Dipankar Das and Sivaji Bandyopadhyay. 2009. Word to sentence level emotion tagging for bengali blogs. Aditya Pal and Bhaskar Karn. 2020. Anubhuti–an In Proceedings of the ACL-IJCNLP 2009 Confer- annotated dataset for emotional analysis of bengali ence Short Papers, pages 149–152. short stories. arXiv preprint arXiv:2010.03065. Niladri Sekhar Dash and L Ramamoorthy. 2019. Util- Flor Miriam Plaza del Arco, Carlo Strapparava, L. Al- ity and application of language corpora. Springer. fonso Urena Lopez, and Maite Martin. 2020. Emo- Event: A multilingual emotion corpus based on dif- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and ferent events. In Proceedings of the 12th Language Kristina Toutanova. 2019. Bert: Pre-training of deep Resources and Evaluation Conference, pages 1492– bidirectional transformers for language understand- 1498, Marseille, France. European Language Re- ing. sources Association. . 1993. and emotion. H. A. Ruposh and M. M. Hoque. 2019. A computa- American , 48(4):384. tional approach of recognizing emotion from ben- 2019 5th International Conference on Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. gali texts. In Advances in Electrical Engineering (ICAEE) Long Short-Term Memory. Neural Computation, , pages 9(8):1735–1780. 570–574. Chenyang Huang, Amine Trabelsi, and Osmar R. Sagor Sarker. 2020. Banglabert: Bengali mask lan- Za¨ıane. 2019. ANA at semeval-2019 task 3: Contex- guage model for bengali language understading. tual emotion detection in conversations through hier- Sagor Sarker. 2021. Bnlp: Natural language process- archical lstms and BERT. CoRR, abs/1904.00132. ing toolkit for bengali language. arXiv preprint N. Irtiza Tripto and M. Eunus Ali. 2018. Detect- arXiv:2102.00405. ing multilabel sentiment and emotions from bangla Omar Sharif and Mohammed Moshiul Hoque. 2021. youtube comments. In 2018 International Confer- Identification and classification of textual aggres- ence on Bangla Speech and Language Processing sion in social media: Resource creation and evalua- (ICBSLP), pages 1–6. tion. In Combating Online Hostile Posts in Regional Klaus Krippendorff. 2011. Agreement and information Languages during Emergency Situation, pages 9–20, in the reliability of coding. Communication Meth- Cham. Springer International Publishing. ods and Measures, 5(2):93–112. Yuni Lai, Linfeng Zhang, Donghong , Rui Zhou, and Guoren Wang. 2020. Fine-grained emotion clas- sification of chinese microblogs based on graph con- volution networks. World Wide Web, 23(5):2771– 2787. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature, 521(7553):436–444. Joseph E LeDoux and Stefan G Hofmann. 2018. The subjective experience of emotion: a fearful view. Current Opinion in Behavioral Sciences, 19:67–72. Emotion-cognition interactions. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining ap- proach. CoRR, abs/1907.11692. D. Magatti, S. Calegari, D. Ciucci, and F. Stella. 2009. Automatic labeling of topics. In 2009 Ninth Interna- tional Conference on Intelligent Systems Design and Applications, pages 1227–1232. Arun S Maiya. 2020. ktrain: A low-code library for augmented machine learning. arXiv preprint arXiv:2004.10703. Appendices A Model Hyperparameters

Classifier Parameters optimizer = ‘lbfgs’, max iter = LR 400, penalty = ‘l1’, C=1 kernel=‘linear’, random state = SVM 0, γ=‘scale’, tol=‘0.001’ criterion=‘gini’, n estimators = RF 100 α = 1.0, fit prior = true, MNB class prior = none,

Table 6: Optimized parameters for ML models

CNN + Hyperparameters Hyperparameter Space CNN BiLSTM BiLSTM Filter Size 3,5,7,9 7 - 3 Pooling type ‘max’, ‘average’ ‘max’ - ‘max’ 30, 35, 50, 70, 90, 100, 150, Embedding Dimension 100 100 100 200, 250, 300 Number of Units 16, 32, 64, 128, 256 64 32 64,64,32 Neurons in Dense Layer 16, 32, 64, 128, 256 64 16 - Batch Size 16, 32, 64, 128, 256 16 16 16 ‘relu’, ‘tanh’, ‘softplus’, Activation Function ‘relu’ ‘relu’ ‘relu’ ‘sigmoid’ ‘RMSprop’, ‘Adam’, ‘SGD’, Optimizer ‘Adam’ ‘Adam’ ‘Adam’ ‘Adamax’ 0.5, 0.1, 0.05, 0.01, 0.005, Learning Rate 0.001 0.001 0.001 0.001, 0.0005, 0.0001

Table 7: Hyperparameters for DNN methods

Hyperparameter Value Fit method ‘auto fit’ Learning rate 2e-5 Epochs 20 Batch size 12,16 Max sequence length 70

Table 8: Optimized hyperparameters for transformers models