Fine-Tuning a Pre-Trained Comment-Domain BERT Model
Total Page:16
File Type:pdf, Size:1020Kb
Jigsaw @ AMI and HaSpeeDe2: Fine-Tuning a Pre-Trained Comment-Domain BERT Model Alyssa Lees and Jeffrey Sorensen and Ian Kivlichan Google Jigsaw New York, NY (alyssalees|sorenj|kivlichan)@google.com Abstract labeled by crowd workers. Note that Perspec- tiveAPI actually hosts a number of different mod- The Google Jigsaw team produced els that each score different attributes. The under- submissions for two of the EVALITA lying technology and performance of these models 2020 (Basile et al., 2020) shared tasks, has evolved over time. based in part on the technology that pow- While Jigsaw has hosted three separate Kaggle ers the publicly available PerspectiveAPI competitions relevant to this competition (Jigsaw, comment evaluation service. We present a 2018; Jigsaw, 2019; Jigsaw, 2020) we have not basic description of our submitted results traditionally participated in academic evaluations. and a review of the types of errors that our system made in these shared tasks. 3 Related Work The models we build are based on the popular 1 Introduction BERT architecture (Devlin et al., 2019) with dif- The HaSpeeDe2 shared task consists of Italian so- ferent pre-training and fine-tuning approaches. cial media posts that have been labeled for hate In part, our submissions explore the importance speech and stereotypes. As Jigsaw’s participation of pre-training (Gururangan et al., 2020) in the was limited to the A and B tasks, we will be lim- context of toxicity and the various competition at- iting our analysis to that portion. The full details tributes. A core question is to what extent these of the dataset are available in the task guidelines domains overlap. Jigsaw’s customized models (Bosco et al., 2020). (used for the second HaSpeeDe2 submission, and The AMI task includes both raw (natural Twit- both AMI submissions) are pretrained on a set of ter) and synthetic (template-generated) datasets. one billion user-generated comments: this imparts The raw data consists of Italian tweets manually statistical information to the model about com- labelled and balanced according to misogyny and ments and conversations online. This model is fur- aggressiveness labels, while the synthetic data is ther fine-tuned on various toxicity attributes (toxi- labelled only for misogyny and is intended to city, severe toxicity, profanity, insults, identity at- measure the presence of unintended bias (Elisa- tacks, and threats), but it is unclear how well these betta Fersini, 2020). should align with the competition attributes. The descriptions of these attributes and how they were 2 Background collected from crowd workers can be found in the data descriptions for the Jigsaw Unintended Bias Jigsaw, a team within Google, develops the Per- in Toxicity Classification (Jigsaw, 2019) website. spectiveAPI machine learning comment scoring A second question studied in prior work is to system, which is used by numerous social media what extent training generalizes across languages companies and publishers. Our system is based (Pires et al., 2019; Wu and Dredze, 2019; Pa- on distillation and uses a convolutional neural- mungkas et al., 2020). The majority of our train- network to score individual comments according ing data is English comment data from a variety to several attributes using supervised training data of sources, while this competition is based on Ital- ian Twitter data. Though multilingual transfer has Copyright ©2020 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 In- been studied in general contexts, less is known ternational (CC BY 4.0). about the specific cases of toxicity, hate speech, misogyny, and harassment. This was one of the fo- gual teacher model (that is too large to practi- cuses of Jigsaw’s recent Kaggle competition (Jig- cally serve in production) to a smaller CNN. Using saw, 2020); i.e., what forms of toxicity are shared this large teacher model, we initially compared the across languages (and hence can be learned by EVALITA hate speech and stereotype annotations multilingual models) and what forms are different. against the teacher model’s scores for different at- tributes. The results are shown in Figure 1 for the 4 Submission Details training data. Perspective is a reasonable detector for the hate speech attribute, but performs less well Attribute hs Attribute stereotype 1.0 for the stereotype attribute, with the identity attack 0.8 model performing the best. Using these same models on the AMI task, 0.6 shown in Figure 2 for detecting misogyny proved 0.4 even more challenging. In this case, the aggres- True Positive Rate identity_attack 83.6% identity_attack 71.2% severe_toxicity 82.6% severe_toxicity 70.9% siveness attribute was evaluated only on the sub- 0.2 toxicity 82% insult 70.5% insult 80.8% toxicity 70.4% profanity 79% profanity 69.9% set of the training data labeled misogynous. In threat 68.1% threat 63.6% 0.0 this case, the most popular attribute of “toxicity” 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate False Positive Rate is actually counter-indicative of the misogyny la- bel. The best detector for both of these attributes Figure 1: ROC curves for the PerspectiveAPI appears to be the “threat” model. multilingual teacher model attributes compared to As can be seen, the existing classifiers are all the HaSpeeDe2 attributes (hate speech and stereo- poor predictors of both attributes for this shared type). task. Due to errors in our initial analysis, we did not end up using any of the models used for Per- Attribute misogynous Attribute aggressiveness spectiveAPI in our final submissions. 1.0 1.0 0.8 0.8 0.6 0.6 Category Submissionhate speechstereotype 0.4 0.4 news 1 0.68 0.64 True Positive Rate threat 61.5% threat 66% identity_attack 56.5% identity_attack 61.8% 2 0.64 0.68 0.2 0.2 insult 45.2% severe_toxicity 53.9% severe_toxicity 44.4% insult 53.3% tweets 1 0.72 0.67 toxicity 39.8% toxicity 48.4% profanity 28.8% 0.0 profanity 37% 2 0.77 0.74 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate False Positive Rate Table 1: Macro-averaged F1 scores for Jigsaw’s HaSpeeDe2 Submissions. Figure 2: ROC curves for PerspectiveAPI multi- lingual teacher model attributes compared to the AMI attributes (misogyny and aggressiveness). 4.1 HaSpeeDe2 The Jigsaw team submitted two separate submis- As Jigsaw has already developed toxicity mod- sions that were independently trained for Tasks A els for the Italian language, we initially hoped and B. that these would provide a preliminary baseline for the competition despite the independent na- 4.1.1 First Submission ture of the development of the annotation guide- Our first submission, one that did not perform very lines. Our Italian models score comments for tox- well, was based on a simple multilingual BERT icity as well as five additional distinct toxicity at- model fine-tuned on 10 random splits of the train- tributes: severe toxicity, profanity, threats, insults, ing data. For each split, 10% of the data was and identity attacks. We might expect some of held out to choose an appropriate equal-error-rate these attributes to correlate with the HaSpeeDe2 threshold for the resulting model. and AMI attributes, though it is not immediately The BERT fine-tuning system used the 12 layer clear whether any of these correlations should be model (Tensorflow Hub, 2020), a batch size of particularly strong. 64 and sequence length of 128. A single dense The current Jigsaw PerspectiveAPI models are layer is used to connect to the two output sigmoids typically trained via distillation from a multilin- which are trained using a binary cross-entropy loss using stochastic gradient descent with early stop- ping, which is computed using the AUC metric 1.0 computed using the 10% held out slice. This 0.8 model is implemented using Keras (Chollet and others, 2015). 0.6 To create the final submission, the decisions of the ten separate classifiers were combined in a ma- 0.4 True Positive Rate jority voting scheme (if 5 or more models pro- Tweets Hatespeech 85.5% 0.2 Tweets Stereotype 82.6% duced a positive detection, the attribute was as- News Stereotype 77.3% signed true). News Hatespeech 75.7% 0.0 0.0 0.2 0.4 0.6 0.8 1.0 4.1.2 Second Submission False Positive Rate Our second submission was based on a similar ap- proach of fine-tuning a BERT-based model, but Figure 3: ROC plots for HaSpeeDe2 Test Set La- one based on a more closely matched training set. bels. The underlying technology we used is the same as the Google Cloud AutoML for natural language point and custom wordpiece vocabulary from Sec- processing product that had been employed in sim- tion 4.1.2. However, a larger batch-size of 128 ilar labeling applications (Bisong, 2019). was specified. All models were fine-tuned simul- The remaining models built for this competi- taneously on misogynous and aggressive labels tion and in the subsequent section are based on a using the provided data, where zero aggressive- customized BERT 768-dimension 12-layer model ness weights were assigned to data points with no pretrained on 1B user-generated comments using misogynous labels. MLM for 125 steps. This model was then fine- tuned on supervised comments in multiple lan- Both submissions were based on ensembles of guages for six attributes: toxicity, severe toxic- partitioned models evaluated on a 10% held-out ity, obscene, threat, insult, and identity hate.