Multilingual Toxic Comment Classification Kaggle Competition Gabriela Ožegovic´ Sven Celin Graz University of Technology Graz University of Technology [email protected] [email protected]

ABSTRACT to select specific types of toxicity and focus on them, since Great expansion of the internet has produced wide range of some sites might be fine with one type of toxicity (e.g. severe people to come online. All this different cultures have a lot of toxicity) and not others. things different from political views, religion, economics or In 2019, another competition on similar topic was held. In even their favourite singer or actress, and with this differences the "Unintended Bias in Toxicity Classification Challenge" people start to act irrationally and start to fight online. This [4], focus was on detecting the toxicity while minimizing things could happen in daily matter from internet bullying, the unintended model bias. The unintended model bias is online harassment or personal attacks. With this project idea happening while trying to detect the toxicity of comments was to limit toxic comments written by the users and flag them which contained the names of identities which are frequently as inappropriate. In this work, we will show how we flagged attacked or used in offensive ways. Training a model on this toxic comments from large dataset we have been provided by type of data ends with classifying a text as toxic just because Kaggle’s competition regarding this topic. they contain certain identity, even though it is not toxic itself. Author Keywords Text classification, Text mining, Toxic text classification, Word embeddings, word2vec, Feature extraction KAGGLE Kaggle is one of biggest data science and INTRODUCTION communities where users are publishing datasets and kernels All platforms that serves a lot of people will, in one point for everyone to see [11]. This web-based system allows data of their existence, have disagreements, abuse and harassment scientists and machine learning engineers to enter competi- from certain groups of people that disagree with their ideas. tions where they try to solve data science challenges. In the Only one comment is enough to derail an online discussion. newest instance of Kaggle they allowed usage of TPU and To counter that flow of non constructive comments this com- GPU cores on their cloud servers for challengers to use. That petition was made to yield the best algorithm for flagging sped up many machine learning algorithms and brought data toxic comments. As platforms struggle to effectively enable science closer for people that have no GPU or TPU cores in conversations in their comment sections, many limit or com- their computers. pletely shut down user comments. This project was made to focus machine learning models to identify toxicity in online conversations and flag them as rude, disrespectful or likely to make someone leave discussion. If this comments could Competitions be identified that would lead to safe and more collaborative Kaggle competitions sole purpose is to improve data science thread. and machine learning in particular field or in particular topic from the industry[1]. Every competition is made of overview Related work where competition maker write description of the work and In 2018, a competition was held on Kaggle called "Toxic data, evaluation, timeline, prizes and code requirements. Sec- Comment Classification Challenge" [3]. In that competition, ond, and the most important thing is the dataset that is provided competitors were asked to build models not only to recognize by the competition maker, in this part of the competition con- toxicity, but to also detect few types of toxicity. The types of testants can access all the data that competition requires to toxicity which had to be detected are: severe toxicity, obscene, make your model and assumptions. For contestant, notebooks threat, insult and identity hate. The goal was to enable users section is go to place when submitting their result. In this sec- Permission to make digital or hard copies of all or part of this work for personal or tion, contestant needs to upload (or make in online editor) their classroom use is granted without fee provided that copies are not made or distributed Python notebook and run it on Kaggle servers with all data for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the provided. After execution of the notebooks in cloud servers author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or user is being placed in Leaderboard by their score number. In republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. competitions there are often more than few thousand teams with maximum number of 5 contestants per team. We could © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. say that this competitions attracts large number of people from ISBN 978-1-4503-2138-9. all over the world. DOI: 10.1145/1235 JIGSAW MULTILINGUAL TOXIC COMMENT CLASSIFICA- possibility of false negatives after the training, we decided to TION balance the dataset and downsample it. That seemed like the In this year’s competition, main challenge was to build mul- best solution, since the algorithm that uses unbalanced training tilingual models for toxicity classification with English-only set will most likely not learn to discriminate from the features. training data [5]. The ratio we opted for is 1:2. So, for each toxic comment there are 2 not-toxic ones. Jigsaw developed Perspective API, which uses machine learn- ing models to determine an impact a comment could have After that, we did some feature extraction processes, which are on the conversation [8]. It is served in multiple languages explained in the next chapter. In the end, we used a classifier of (English, Spanish, French, German, Portuguese and Italian). choice to train the model and predict the toxicity probabilities Recently, there are impressive multilingual capabilities, in- of test data. cluding few- and zero-shot learning, and the goal is to see it these results are applicable to toxicity classification. FEATURE EXTRACTION Feature extraction is a process of distillation of characteristics The host team shared two notebooks to help competitors start from the raw dataset and reducing it to manageable groups for running BERT. BERT (Bidirectional Encoder Representations processing [2]. Characteristic of datasets is that the number from Transformers) is a method of pre-training language rep- of features which they provide is not always appropriate for a resentations developed by , which presented state-of- task for which the same dataset is used. In the situation with the-art results in a wide variety of NLP tasks. datasets provided for this competition, there are no available The competitors had a week to work on their models and to features which could help determining the toxicity of a com- try get the best solution. The expected result is the probability ment. Only provided features are comment itself, it’s language that given comment is toxic, a value which is between 0 and 1. and whether or not is the comment toxic. For this reason, it is important to find the possible characteristics which are closely related to the toxicity level and extract them as separate fea- tures. We did some brainstorming, and though about in which way would a user type a comment if they are, e.g., angry. Extracted features and their significance will be explained in further sections.

Comment length Figure 1. Competition When thinking about toxic comment and toxic behaviour in writing, we thought about the length of each comment. We Dataset assumed that people, who are willing to be toxic, and angrier Provided training data is English-only, and it is the same data and have less nerves than the ones who did not plan on being which was provided for last two competitions (mentioned in toxic. We can assume that they put less thought into their 1.1.). Both training datasets are given in unprocessed and comments, and often, possibly in heat of an argument, can processed format. The unprocessed format is just the data rush into them. which was given before, and the processed format is done For this reason we checked the length of comment, and found in a way that it is ready for input to BERT. Data consists of out that there is a difference between average length of toxic columns such as id, comment text, toxic (whether the toxic and non-toxic comments. Toxic comments ended up indeed is comment or not, or the probability of the comment to be being shorter than the non-toxic ones. The correlation is a toxic), and types of toxicity each one in separate column. negative value because the higher the comment length is, the Test data is consisted of comments from Wikipedia talk page lesser the toxicity is. in several different languages (Italian, Spanish, Turkish, Por- tuguese, Russian and French). Test data consists of few columns: id, comment content and language of the content.

Other than that, competitors were provided validation data, Figure 2. Point-biserial correlation with toxicity and comment length which is, just as the test data, just in non-English languages. It consists of columns: id, comment text, language and toxic column. Punctuation count Similar to the previous idea about length of comments, here our idea was that people who are toxic are more prone to Problem approach using multiple punctuation signs, like "???" or "!!!". This is The first step was to inspect the data. It is important to know something that makes people seem not pleased, and in general how the data we are working on looks like. Combining both people do not write that many of the punctuation. training dataset, the number of provided comments exceeded 2 million. After counting and seeing the ration between toxic Here we just count all the consecutive occurrences of punctua- and non-toxic comments, it turned out that only around 6% tion mark when they are more that just one. This also showed of the provided comments are flagged as toxic. To lower the to have some impact on how the comments were classified. We saved the new gotten value in a new column. In addition to that, we added a separate column, which shows us the ratio of punctuation counts with all the other tweets, That way, we can easily check it’s correlation with the tweets. What we concluded from this features is that there is a cor- Figure 5. Word vectorization relation between using many punctuation marks and toxicity. In general, comments which have multiple punctuation marks have higher chance of being toxic.

Figure 3. Point-biserial correlation with toxicity and punctuation count

Uppercase words Our next thought was that people who are generally more toxic use more capitalized words. Writing in capitalized words s Figure 6. In-depth view of word vectorization understood as if the person writing it is yelling, thus it can be seen as negative. scientist to identify user that has positive, negative or neutral Because of this, we checked and counted all the capitalized thoughts about certain topic. In this case sentiment was used words, for each comment. The only condition we have is to on comments to figure out which comments deviate from avoid those words which are capitalized and shorter that 1 neutral in to negative spectrum and they would be flagged as letter. With this approach we eliminated words like "I" which negative in nature. For this task we used TextBlob [10]. It is are in English written with capitalized letters. Python library that offers simple access to its methods that can We realised that the correlation between uppercase words and perform basic NLP tasks. With sentiment function, TextBlob toxicity is not really high, but we still decided to use the returns two properties polarity and subjectivity. acquired features in our classifier. Polarity Polarity is number that ranges from -1 to 1 and represents emotion in witch text is written. This emotion could be from -1 for very negative to 1 very positive. Values around 0 are Figure 4. Point-biserial correlation with toxicity and uppercase words neutral that means neither positive neither negative. After classifying data with this tool there was negative 26percent correlation between this two values. From this it could be said Count of bad words that more toxic comment is connotation of the comment will Word embeddings or Word2vec is type of machine learning be more negative. This value also gave us some insights of algorithms that allows word representation of words similar how people structure their comments when writing them and meaning to have a similar representation in vector format[6]. what could be a red flag when encountering the comments that In this way it could be said that words such as "Man" and are toxic. "Woman" are similar such as "King" and "Queen". One of the most famous examples are "Man" + "King" - "Woman" = "Queen" that simplifies understanding of this algorithms for novice developer. Using this paradigm idea for this feature extraction is that if we need to flag bad words in a dataset, eas- Figure 7. Point-biserial correlation with toxicity and polarity iest thing to do is to make a list of "bad words" that is made on exact dataset. In this case algorithm would only use words that were used in this comments and it could also be multilingual. Subjectivity With this findings we implemented word2vec into this dataset. Subjectivity is other part of sentiment analysis and besides of As parameters changed we found out that list of 100 words Polarity it gives us information about in which way the sen- could be used the best on dataset of this size and we have tences were written. For example, subjective sentences gener- trained models on all six languages. After correlating "bad ally refer to personal opinion, emotion or judgment whereas words" and toxicity with point-biserial correlation function objective refers to factual information. This would be very there was 45percent correlation between this two columns. nice information in this case because as we all know toxic comments are written in more subjective manner and could Sentiment be flagged in this way. The idea was one but in practice this Sentiment analysis is classification of emotions from text data was not the case. With this dataset there was correlation with using text analysis techniques. Sentiment analysis allows data only -1.5 percent between toxicity and subjectivity. This was unexpected result but as the correlation was so small this fea- the context. In that case we have score of 0.9459 and the best ture couldn’t be used in further development of prediction score was 0.9536. This is difference of 0.0077, so less then 1 algorithms. percent. In this competition there were a lot of contestants who only came and copied the first notebook and submitted that as their final solution. That way competition is saturated with teams that have same score as it can be seen in the leaderboard. In conclusion with this approach we have found a lot about the dataset with previously described ways and learned a lot about the NLP field. It broad spectrum of different ways of approaching the problem is sometimes overwhelming but Figure 8. Dataset with Polarity and subjectivity really interesting to learn.

COMMENT CLASSIFICATION Alternative approach In our work here is a lot of things that could be improved. Starting from previously mentioned problem with training the the machine learning model using labels which are not integers but float numbers. Secondly we could have improved Figure 9. View of the dataset on optimization of Word vectorization where we should have optimized every list size per language. In our approach we After extracting all the wanted features, our next move was to only had lists of 100 "bad words" that were used for classifying classify the comments. The classifier we chose for this task the toxic comments. Thirdly, we could have made more data is Random Forest Classifier [9]. Random Forests are seen to train on by using translation libraries but it wasn’t enough as the best classifiers according to the JMLR study Do we time to implement this feature well. Finally, we could have need hundreds of classifiers to solve real world classification tested more different libraries for calculating sentiment (not problems? [7], which is the reason why we chose it. only TextBlob) and we could have tried already premade lists of "bad words" and their iterations. First, we dropped the columns which are not important for the prediction itself, which are: id, comment_text, toxic and lang. We separated the features (all the calculated features) and labels (toxic), so we have the X and y for the classifier. After splitting the test and train data, and trying to fit a model, we ran into one problem. One of the train datasets we were supposed to use had the probability of toxicity as a value in toxic column. That was giving us an error, since the model can’t be trained and tested on labels (classes) which are dec- imal numbers. Our solution to this problem was to round up/down the toxicity probabilities in the training dataset. All the values which were below 0.5 were rounded down, and other were rounded up. With this, changed, dataset we re- peated the process and trained the model. After training the model, we used the predict_proba method, which predicts the probability for each class. The output is an array of arrays, each one of them having two values. One value is the probability that the comment is toxic, and other one the probability that it is not toxic. From there, we just needed the toxic probabilities, so we selected just those. To check how good our classifier is, we checked the score using sklearn.metrics.roc_auc_score method. It computes area under the receiver operating characteristic curve (ROC AUC) from prediction scores. The score we got was 0.61, which is not bad, but certainly not the best. The score was probably affected with our rounding up and rounding down of the toxic values, but we feel like the model is decent.

RESULTS With this approach we have scored 220 place out of 1621 competing teams. This result sounds better when it is put in REFERENCES https://kawine.github.io/blog/nlp/2019/06/21/ [1] Analytics Vidhya. 2020. Get Started Kaggle word-analogies.html Cometitions. (2020). https://www.analyticsvidhya.com/ [7] Manuel Fernandez-Delgado, Eva Cernadas, Senen blog/2020/06/get-started-kaggle-competitions/ Barro, Dinani Amorim. 2014. Do we Need Hundreds of [2] DeepAI. 2020. Feature Extraction. (2020). https: Classifiers to Solve Real World Classification Problems? //deepai.org/machine-learning-glossary-and-terms/ (2014). http: feature-extraction //jmlr.org/papers/volume15/delgado14a/delgado14a.pdf [3] Kaggle. 2018. Toxic Comment Classification Challenge. [8] Perspective api. 2020. Perspective API. (2020). (2018). https://www.kaggle.com/c/ https://perspectiveapi.com/#/start jigsaw-toxic-comment-classification-challenge [9] Sckit-learn. 2020. RandomForestClassifier. (2020). [4] Kaggle. 2019. Jigsaw Unintended Bias in Toxicity https://scikit-learn.org/stable/modules/generated/ https://www.kaggle.com/c/ Classification. (2019). sklearn.ensemble.RandomForestClassifier.html jigsaw-unintended-bias-in-toxicity-classification [5] Kaggle. 2020. Jigsaw Multilingual Toxic Comment [10] Towards Data Science. 2019. Introducing TextBlob. (2019). https://towardsdatascience.com/ Classification. (2020). https://www.kaggle.com/c/ having-fun-with-textblob-7e9eed783d3f jigsaw-multilingual-toxic-comment-classification [6] Kawin Ethayarajh. 2019. Word Embedding Analogies: [11] Wikipedia. 2020. Kaggle. (2020). Understanding King - Man + Woman = Queen. (2019). https://en.wikipedia.org/wiki/Kaggle