On the Fairness of Crowdsourced Training Data and Machine Learning Models for the Prediction of Subjective Properties

On the fairness of crowdsourced training data and Machine Learning models for the prediction of subjective properties. The case of sentence toxicity. To be or not to be #$&%*! toxic? To be or not to be fair? Agathe Balayn On the fairness of crowdsourced training data and Machine Learning models for the prediction of subjective properties. The case of sentence toxicity. To be or not to be #$&%*! toxic? To be or not to be fair? by Agathe Balayn to obtain the degree of Master of Science at the Delft University of Technology, to be defended publicly on Thursday September 27, 2018 at 11:00 AM. Student number: 4620844 Project duration: November 1, 2017 – September 27, 2018 Thesis committee: Chair Prof. dr. G.-J. Houben, Faculty EEMCS, TU Delft University supervisor Dr. A. Bozzon, Faculty EEMCS, TU Delft Committee member Dr. G. Gousios , Faculty EEMCS, TU Delft Company supervisor B. Timmermans, Center for Advanced Studies IBM Benelux An electronic version of this thesis is available at http://repository.tudelft.nl/. Preface Training machine learning (ML) models for natural language processing usually requires lots of data that is often acquired through crowdsourcing. In crowdsourcing, crowd workers annotate data samples according to one or more properties, such as the sentiment of a sentence, the violence of a video segment, the aesthetics of an image, ... To ensure quality of the annotations, several workers annotate the same sample, and their annotations are combined into one unique label using aggregation techniques such as majority voting. When the property to be annotated by the workers is subjective, the workers’ annotations for one same sample might differ, but all be valid. The way the annotations are aggregated can have an effect on the fairness of the outputs of the trained model. For example only accounting for the majority vote leads to ignoring the workers’ opinions which differ from the majority and consequently being discriminative towards certain workers. Also, ML models are not always designed to account for individual opinions, for simplicity’s or performance’s sake. Finally, to the best of our knowledge, no method exists to assess the fairness of a ML algorithm predicting a subjective property. In this thesis we address such limitations by seeking an answer to the following research question: how can targeted crowdsourcing be used to increase the fairness of ML algorithms trained for subjective properties’ prediction? We investigate how annotation aggregation via majority voting creates a dataset bias towards the majority opinion, and how this dataset bias in combination with the current limits of ML models lead to an algorithmic bias of the ML models trained with this dataset and unfairness in the model’s outputs. We assume that an ML model able to return each annotation of each user is a fair model. We propose a new evaluation method of the ML models’ fairness, and a methodology to highlight and mitigate potential unfairness based on the creation of adapted training datasets and ML models. Although our work is applicable to any kind of label aggregation for any data subject to multiple interpretations, we focus on the effects of the bias introduced by majority voting for the task of predicting sentence toxicity. Our results show that the fairness evaluation method that we create enables to identify unfair algorithms and compare algorithmic fairness, and the final fairness metric is usable in the training process of ML models. The experiments on the models point out that we can mitigate the biases resulting from majority voting and increase the fairness towards the minority opinions. This is provided that the workers’ individual information and each of their annotations are taken into account when training adapted models, rather than only relying on the aggregated annotations, and that the dataset is resampled on criteria according to the favoured aspect of fairness. We also highlight that more work needs to be done to develop crowdsourcing methods to collect high-quality annotations of subjective properties, possibly at low-cost. Agathe Balayn Delft, September 2018 iii Acknowledgement I would like to acknowledge every person and institution with who I spent time during this year. Without the support of many people, and without the Delft University of Technology and the Center for Advanced Studied of the IBM Benelux, the completion of this thesis would have not been possible. Particularly, I would like to thank my thesis supervisors from the university Alessandro Bozzon and from the company Benjamin Timmermans for their frequent feedbacks. They always listened to my explanations about my work and answered questions about my research. They also read my long thesis report and gave valuable comments about its structure and the scientific writing. I would also like to thank the other people with who I had detailed discussions about my work, mainly Zoltán Szlávik who took the time to listen to my questionings about the fairness metric, Panagiotis Mavridis who brought a lot of insights into the crowdsourcing work and for the writing of the CrowdBias paper, Lora Aroyo who shared her point of view on the use of her CrowdTruth framework, and Pavel Kucherbaev for the initial supervision of the project. I also wish to acknowledge the other members of the committee, Geert-Jan Houben and Georgios Gousios without who I would not be able to defend my thesis. Furthermore, working at the CAS of the IBM Benelux was undoubtedly a great chance for me to interact with more people during the long time of the thesis project. On this, I wish to acknowledge the other interns and my friends from university who made the thesis time very fun and more motivating all along the way. Finally, I must express my profound gratitude to my parents and siblings for always providing me with support and continuous encouragements throughout the thesis year at the TU Delft. This accomplishment would not have been possible without them. Thank you. Agathe Balayn Delft, the Netherlands, September 2018 v Contents List of Figures 1 List of Tables 3 1 Introduction 5 1.1 Problem studied in the thesis: Machine Learning, Crowdsourcing, need of data, biases and fairness..............................................6 1.1.1 Limitations of current systems for the prediction of subjective properties.........7 1.1.2 Consequences from the combination of Machine Learning and Crowdsourcing......7 1.2 Main research question, challenges, hypothesis and use-case..................8 1.3 Research questions of the research project............................9 1.4 Thesis contributions....................................... 10 1.5 Thesis outline........................................... 11 2 Literature review 13 2.1 Definitions of toxicity-related speeches in the psychology literature............... 13 2.1.1 Term definition...................................... 13 2.1.2 Methodology to search for the papers........................... 14 2.1.3 The different targets of hate speech............................ 14 2.1.4 The different variables influencing how toxicity-related speeches are judged....... 14 2.1.5 Methodologies of the different studies........................... 16 2.1.6 Discussion......................................... 17 2.2 Computational methods to detect toxic speech.......................... 18 2.2.1 Methodology to search for the papers........................... 18 2.2.2 Techniques used for toxicity detection........................... 18 2.2.3 Dataset gathering methods................................ 22 2.2.4 Discussion......................................... 24 2.3 Machine Learning and Deep Learning for subjective tasks.................... 27 2.3.1 Methodology to search for the papers........................... 27 2.3.2 Machine Learning and subjective tasks.......................... 27 2.3.3 Deep Learning models integrating subjectivity into their predictions........... 27 2.3.4 Discussion......................................... 28 2.4 Definition and evaluation method of algorithmic fairness.................... 29 2.4.1 Methodology to search for the papers........................... 29 2.4.2 Definitions of algorithmic fairness............................. 29 2.4.3 Metrics to characterize and evaluate algorithmic fairness................. 31 2.4.4 Methods to mitigate algorithmic unfairness........................ 31 2.4.5 Discussion......................................... 32 2.5 Crowdsourcing and subjectivity.................................. 33 2.5.1 Methodology to search for the papers........................... 33 2.5.2 Crowdsourcing methodology for corpus annotations................... 33 2.5.3 Identification of variables influencing subjective annotations............... 33 2.5.4 Techniques to ensure crowdsourcing quality........................ 33 2.5.5 Metrics to evaluate crowdsourcing quality......................... 35 2.5.6 Discussion......................................... 36 2.6 Summary............................................. 36 vii viii Contents 3 Dataset to study sentence toxicity as a subjective property 37 3.1 Introduction........................................... 37 3.2 Toxicity in psychology and dataset choice............................. 38 3.2.1 Sentence toxicity as a subjective property in the psychology literature........... 38 3.2.2 Subjectivity in Computer Science datasets of sentence toxicity.............. 38 3.3 Crowdsourcing treatment of dataset annotations of subjective properties............ 40 3.3.1 Design of the crowdsourcing

Load more