Examining the Effects of Preprocessing on the Detection of Offensive Language in German Tweets

Examining the Effects of Preprocessing on the Detection of Offensive Language in German Tweets Sebastian Reimann Daniel Dakota Uppsala University Uppsala University Department of Linguistics Department of Linguistics [email protected] [email protected] Abstract representations to a classifier (e.g. a sentence embedding), that embedding still represents the textual Preprocessing is essential for creating more ef- representation, simply in an alternative way. Thus, fective features and reducing noise in classifi- it too is influenced by individual alterations to the cation, especially in user-generated data (e.g. text. With languages that show large variation in di- Twitter). How each individual preprocessing decision changes an individual classifier’s be- alect preferences and orthographic representations, havior is not universal. We perform a series this has been shown to be particularly important of ablation experiments in which we examine (Husain, 2020). how classifiers behave based on individual pre- Twitter has proven to be a typical source for not processing steps when detecting offensive lan- only research on offensive language, but also neces- guage in German. While preprocessing deci- sitating additional preprocessing approaches given sions for traditional classifier approaches are its different style of communication and lexicon. not as varied, we note that pre-trained BERT models are far more sensitive to each decision In this work we look to perform a set of ablation and do not behave identically to each other. experiments in which we evaluate how different We find that the cause of much variation be- preprocessing techniques impact classifier behav- tween classifiers has to do with the interactions ior over three different approaches to classification specific preprocessing steps have on the over- when detecting offensive language in German Twit- all vocabulary distributions, and, in the case ter. We seek to answer the following questions: of BERT models, how this interacts with the WordPiece tokenization. 1. How do different preprocessing techniques 1 Introduction influence performance across different classifiers? The task of abusive language detection has be- come increasingly popular for a variety of lan- 2. Can we identify features within different pre- guages (Zampieri et al., 2019; Basile et al., 2019; proccesing techniques that can help explain Al-Khalifa et al., 2020). German specifically has specific classifier behaviors? had two shared tasks on the topic, one in 2018 (Wiegand et al., 2018) and a second in 2019 (Struß 2 Related Work et al., 2019). 2.1 Data Not only is offensive language detection some- what subjective in nature, particularly in the need Ross et al.(2016) introduced a Twitter corpus of for contextual requirements, but is often examined offensive language detection, examining the 2015 through user generated mediums, creating another refugee crisis. They predominantly focused on layer of complexity to successfully identify possi- user’s perceptions of hate speech and the reliability ble abusive language. Often, in order to create more of annotations. They found that agreement among useful features out of the text for the classifier, we annotators was relatively low and that also the opin- must first treat the text to reduce the noise. While ions of users asked in a survey diverge greatly for standard feature generation via count vectors and thus stress the necessity of specific guidelines. the impact is far more obvious (e.g. reduction of However, even with such guidelines, annotators can feature space), even when we feed a dense vector still show large differences (Nobata et al., 2016). Task 2 of the GermEval 2018 shared task (Wie- processing methods of replacement of emoticons gand et al., 2018) focused on detecting offensive with a text representation, replacing negation con- and non-offensive Tweets and was further exam- tractions such as don’t with do not, detection of ined in GermEval 2019 (Struß et al., 2019). spelling errors, stemming, and removal of stop- A different approach was taken by Zufall et al. words for general sentiment analysis on Twitter (2019) who instead label offensive Tweets based on data. Using a Naive Bayes classifier to classify whether they may be punishable by law or not. This whether the sentiment was positive, neutral or neg- decision is based on two criteria: the type of target ative, most techniques yielded slight improvements and the type of offense. A Tweet may be punishable over the baseline with little preprocessing. if it is targeted at either a living individual or a While Risch et al.(2019) had a minimalistic ap- specific group of people, and if it expresses either proach to preprocessing and only normalized user a wrong factual claim, abusive insults, or abusive names, Paraschiv and Cercel(2019), whose contri- criticism. bution performed best in the GermEval 2019 shared 2.2 Classifiers task, made use of a wide range of preprocessing methods when fine-tuning BERT. They replaced Abusive language detection in German has shown a emojis with spelled-out representations; removed great deal of variation across classifiers and feature the #-character at the beginning of hashtags and thresholds (Steimel et al., 2019). In the 2018 shared split hashtags into words; transformed usernames, tasks, SVMs were a popular choice (Wiegand et al., weblinks, newline markers, numbers, dates and 2018), achieving effective results. Popular features timestamps to standard tokens; and manually cor- include pre-trained word embeddings, mostly ei- rected spelling errors. They however do not explic- ther fastText (Bojanowski et al., 2017) or word2vec itly state how much this contributed to achieving a (Mikolov et al., 2013), and lexical features based higher performance. on polarity lexicons or lexicons on offensive language and effective results were achieved with only Schmid et al.(2019) lowercased and lemmatized a few hundred features (De Smedt and Jaki, 2018). words, while also removing the #-character of the Other classifiers included standard Decision Trees hashtag and stop words when creating features for or Boosted Classifiers, but tended to yield slightly their SVM. Sentiment scores were also obtained worse performance (Scheffler et al., 2018). for emojis through the sentiment ranking for emo- The most effective approaches tended to use en- jis by Kralj Novak et al.(2015) and added to the semble classifiers: CNNs with logit averaging (von sentiment scores obtained trough SentiWS (Remus Grunigen¨ et al., 2018), a combination of RNNs and et al., 2010) for all words in the sentence. Both CNNs (Stammbach et al., 2018), or combination of scores were treated as separate features and ranged Random Forest classifiers (Montani and Schuller¨ , range from -1 to 1. Scheffler et al.(2018) also 2018). lemmatized and removed stop words, but did not With the introduction of BERT (Devlin et al., explicitly state their treatment of hashtags and cap- 2019), the 2019 shared task saw a different trend, italization for their experiments involving SVMs, with many participants submitting fine-tuned mod- decision tree, and boosted classifiers. Moreover, els (Struß et al., 2019). Paraschiv and Cercel(2019) they did not include emojis when modeling a senti- pre-trained a BERT model on German Twitter data, ment score as one of their features. obtaining the best reported macro F-score of 76.95. Other approaches included fine-tuning an ensem- ble of BERT models trained on different German 3 Methodology textual sources (Risch et al., 2019). SVMs continued to be a popular choice however, 3.1 Data with some systems achieving results almost equal to BERT-based approaches by using word embed- For all experiments, we use the the dataset from the dings pre-trained on German Tweets and lexical GermEval 2019 Task 2 (Struß et al., 2019). Tweets features (Schmid et al., 2019). were sampled from a range of political spectrums and labeled as either OFFENSE or OTHER for 2.3 Preprocessing the binary classification task (see Table1 for data Angiani et al.(2016) experimented with the pre- splits). OFFENSE OTHER Total 2003). Train 1287 2709 3996 Truecasing the test and training data is per- Test 970 2061 3031 formed by using the truecasing scripts from the Moses system (Koehn et al., 2007), which are nor- Table 1: Train and Test Data Splits mally used for statistical machine translation. We create a truecasing model by training on a large, 3.2 Preprocessing cleaned, preprocessed German Wikipedia Text Cor- pus.3 We use the SoMaJo tokenizer for German Base Methods Lemmatization is a relatively social media data (Proisl and Uhrig, 2016) to tok- common preprocessing step applied in the shared enize the Twitter data. task of Wiegand et al.(2018), examples include Scheffler et al.(2018) and Schmid et al.(2019), 3.3 Classifiers on which we base our experimental setup for our All hyperparameter optimization is performed us- SVM and AdaBoost classifiers. Consequently, we ing a 5-fold cross validation and results for all lemmatize all words1 for our AdaBoost and SVM experiments are reported using macro-averaged F experiments. A second base step, carried out in scores since the dataset is imbalanced and we wish all experiments, including those when fine-tuning to give equal weight to both the minority and ma- BERT for classification, is replacing user names jority classes. with the token USER. SVM The features for the SVM (Boser et al., Emojis We try the approaches in the contribu- 1992) are similar to the ones used in the second tions

Load more