An Expert Annotated Dataset for the Detection of Online Misogyny Ella Guest Bertie Vidgen Alexandros Mittos The Alan Turing Institute The Alan Turing Institute Queen Mary University of London University of Manchester
[email protected] University College London
[email protected] [email protected] Nishanth Sastry Gareth Tyson Helen Margetts University of Surrey Queen Mary University of London The Alan Turing Institute King’s College London The Alan Turing Institute Oxford Internet Institute The Alan Turing Institute
[email protected] [email protected] [email protected] Abstract spite social scientific studies that show online miso- gyny is pervasive on some Reddit communities, to Online misogyny is a pernicious social prob- date a training dataset for misogyny has not been lem that risks making online platforms toxic created with Reddit data. In this paper we seek and unwelcoming to women. We present a new hierarchical taxonomy for online miso- to address the limitations of previous research by gyny, as well as an expert labelled dataset to presenting a dataset of Reddit content with expert enable automatic classification of misogynistic labels for misogyny that can be used to develop content. The dataset consists of 6,567 labels more accurate and nuanced classification models. for Reddit posts and comments. As previous Our contributions are four-fold. First, we de- research has found untrained crowdsourced an- velop a detailed hierarchical taxonomy based on notators struggle with identifying misogyny, existing literature on online misogyny. Second, we hired and trained annotators and provided them with robust annotation guidelines. We we create and share a detailed codebook used to report baseline classification performance on train annotators to identify different types of miso- the binary classification task, achieving accu- gyny.