Characterizing Abhorrent, Misinformative, and Mistargeted Content on Youtube Arxiv:2105.09819V1 [Cs.CY] 20 May 2021
Total Page:16
File Type:pdf, Size:1020Kb
Characterizing Abhorrent, Misinformative, and Mistargeted Content on YouTube Kostantinos Papadamou A thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy arXiv:2105.09819v1 [cs.CY] 20 May 2021 of Cyprus University of Technology Department of Electrical Engineering, Computer Engineering and Informatics Cyprus University of Technology May 16, 2021 Abstract YouTube has revolutionized the way people discover and consume video content. Although YouTube facilitates easy access to hundreds of well-produced educational, entertaining, and trustworthy news videos, abhorrent, misinformative and mistargeted content is also common. The platform is plagued by various types of inappropriate content including: 1) disturbing videos targeting young children; 2) hateful and misogynistic content; and 3) pseudoscientific and conspiratorial content. While YouTube’s recommendation algorithm plays a vital role in increasing user engagement and YouTube’s monetization, its role in unwittingly promoting problematic content is not entirely understood. In this thesis, we shed some light on the degree of abhorrent, misinformative, and mistargeted content on YouTube and the role of the recommendation algorithm in the discovery and dissemination of such content. Following a data-driven quantitative approach, we analyze thousands of videos posted on YouTube. Specifically, we devise various methodologies to detect problematic content, and we use them to simulate the behavior of users casually browsing YouTube to shed light on: 1) the risks of YouTube media consumption by young children; 2) the role of YouTube’s recommendation algorithm in the dissemination of hateful and misogynistic content, by focusing on the Involuntary Celibates (Incels) community; and 3) user exposure to pseudoscientific misinformation on various parts of the platform and how this exposure changes based on the user’s watch history. In a nutshell, our analysis reveals that young children are likely to encounter disturbing content when they randomly browse the platform starting from benign videos relevant to their interests and that YouTube’s currently deployed counter-measures are ineffective in terms of detecting them in a timely manner. By analyzing the Incel community on YouTube, we find that not only Incel activity is increasing over time, but platforms may also play an active role in steering users towards extreme content. Finally, when studying pseudoscientific misinformation, we find among other things that YouTube suggests more pseudoscientific content regarding traditional pseudoscientific topics (e.g., flat earth) than for emerging ones (like COVID-19), and that these recommendations are more common on the search results page than on a user’s homepage or the video recommendations (up-next) section. Acknowledgments First and foremost, I am grateful to my advisor, Michael Sirivianos, for his continuous support and valuable feedback throughout my PhD journey. His support and guidance was instrumental in turning me into an independent and competent researcher. He was there to guide me when the research seemed fuzzy and disheartening. More importantly, he has shown me how to analyze an important and complex problem, and divide it in small manageable problems that can be addressed in a more practical way. Second, I would like to thank Jeremy Blackburn, Emiliano De Cristofaro, Gianluca Stringhini, and Savvas Zannettou. Their expertise and valuable feedback complemented the one received by my advisor, hence helping me in further expanding my research and writing skills. Also, I want to thank various colleagues from the Cyprus University of Technology, Telefonica Research, and Max Planck Institute. Their help and feedback was vital to undertaking the studies presented in this thesis. Furthermore, I would like to thank my family for their support, patience, and encouragement, which ensured that I was mentally strong to overcome all the obstacles faced during my PhD journey. Finally, I owe gratitude to the European Commission’s Horizon 2020 program and the Cyprus University of Technology for funding my research. In Memoriam of Prof. Vassos Soteriou I would like to pay tribute to the memory of Professor Vassos Soteriou. I am grateful for having him as a professor during my barchelor’s and master’s studies, and I am thankful for the valuable feedback he gave me as one of my PhD proposal examiners. His pressing questions during my proposal examination have helped to improve this work and to make me a better presenter and researcher. Contents 1 Introduction . 10 1.1 Contributions . 14 1.2 Research Papers . 16 1.3 Thesis Organization . 17 2 Background . 18 2.1 YouTube . 18 2.1.1 YouTube . 18 2.1.2 YouTube Kids . 20 2.2 Reddit . 21 2.2.1 Remarks . 22 3 Literature Review . 23 3.1 Inappropriate Content for Children . 23 3.1.1 Characterizing Inappropriate Content for Children . 23 3.1.2 Detection and Containment of Inappropriate Content for Children . 25 3.1.3 Inappropriate Content for Children - Remarks . 30 3.2 Misogyny and Other Harmful Activity . 30 3.2.1 Misogyny and Other Harmful Activity on YouTube . 30 3.2.2 Misogyny and Hateful Activity on the Rest of the Web . 35 3.2.3 Misogyny and Other Harmful Activity - Remarks . 39 3.3 Pseudoscientific Misinformation and Conspiracy Theories . 39 3.3.1 Pseudoscientific Misinformation and Conspiracy Theories on YouTube 40 3.3.2 Pseudoscientific Misinformation and Conspiracy Theories on the Web 43 3.3.3 Pseudoscientific Misinformation and Conspiracy Theories - Remarks 46 3.4 Recommendation Algorithms and Audits . 46 3.4.1 Understanding User Personalization on the Web . 47 3.4.2 Auditing YouTube’s Recommendation Algorithm . 49 1 2 3.4.3 Recommendation Algorithms and Audits - Remarks . 50 3.5 Other Related Work . 51 3.5.1 Understanding YouTube Characteristics . 51 3.5.2 YouTube in Science and Education . 52 4 Characterizing and Detecting Inappropriate Videos Targeting Young Children on YouTube . 53 4.1 Motivation . 53 4.2 Datasets . 55 4.2.1 Data Collection . 55 4.2.2 Manual Annotation Process . 57 4.3 General Characterization . 60 4.4 Detection of Disturbing Videos . 68 4.4.1 Dataset and Feature Description . 69 4.4.2 Model Architecture . 69 4.4.3 Experimental Evaluation . 71 4.5 Analysis . 74 4.5.1 Recommendation Graph Analysis . 75 4.5.2 How likely is it for a toddler to come across inappropriate videos? . 76 4.6 Challenges and Limitations . 78 4.6.1 Challenges . 78 4.6.2 Limitations . 79 4.7 Remarks . 79 5 Characterizing Hateful and Misogynistic Content on YouTube Through the Lens of the Incel Community . 81 5.1 Motivation . 81 5.2 Background . 84 5.3 Datasets . 85 5.3.1 Data Collection . 85 5.3.2 Video Annotation . 88 5.3.3 Ethics . 90 5.4 Analysis . 91 5.4.1 Evolution of Incel Community on YouTube . 91 3 5.4.2 Does YouTube’s Recommendation Algorithm steer users towards Incel-related videos? . 94 5.5 Challenges and Limitations . 101 5.5.1 Challenges . 101 5.5.2 Limitations . 102 5.6 Remarks . 103 6 Assessing the Effect of Watch History on YouTube’s Pseudoscientific Video Recommendations . 104 6.1 Motivation . 104 6.2 Datasets . 107 6.2.1 Data Collection . 107 6.2.2 Crowdsourcing Data Annotation . 108 6.2.3 Ethics . 111 6.3 Detection of Pseudoscientific Videos . 111 6.3.1 Classifier Architecture . 111 6.3.2 Experimental Evaluation . 113 6.4 Analysis . 115 6.4.1 Experimental Design . 115 6.4.2 Pseudoscientific Content on Homepage, Search Results, and Video Recommendations . 120 6.4.3 Temporal Sensitivity . 126 6.4.4 Experiments Summary . 128 6.5 Limitations . 129 6.6 Remarks . 129 7 Discussion & Conclusions . 131 7.1 Characterizing and Detecting Inappropriate Videos Targeting Young Children on YouTube . 131 7.2 Characterizing Hateful and Misogynistic Content on You-Tube Through the Lens of the Incel Community . 133 7.3 Assessing the Effect of Watch History on YouTube’s Pseudoscientific Video Recommendations . 135 7.4 Challenges and Limitations . 137 7.5 Conclusion . 139 List of Figures 4.1 Examples of disturbing videos, i.e. inappropriate videos that target toddlers. 54 4.2 Per class proportion of videos for top 15 stems found in titles of (a) Elsagate- related; (b) other child-related; (c) random; and (d) popular videos. 61 4.3 Per class proportion of videos for the top 15 stems found in video tags of (a) Elsagate-related; (b) other child-related; (c) random; and (d) popular videos. 62 4.4 Per class proportion of videos for the top 15 labels found in thumbnails of (a) Elsagate-related; (b) other child-related; (c) random; and (d) popular videos. 63 4.5 Per class proportion of videos that their thumbnail contains spoofed, adult, medical, violent, and/or racy content for (a) Elsagate-related; (b) other child- related; (c) random; and (d) popular videos. 64 4.6 CDF of the number of views per class for (a) Elsagate-related (b) other child- related, (c) random, and (d) popular videos. 65 4.7 CDF of the fraction of likes to dislikes per class for (a) Elsagate-related (b) other child-related, (c) random, and (d) popular videos. 66 4.8 CDF of the number of comments/views per class for (a) Elsagate-related (b) other child-related, (c) random, and (d) popular videos. 67 4.9 Architecture of our deep learning model for detecting.