A longitudinal analysis of YouTube’s promotion of conspiracy videos Marc Faddoul1, Guillaume Chaslot3, and Hany Farid1,2 Abstract Conspiracy theories have flourished on social media, raising concerns that such content is fueling the spread of disinformation, supporting extremist ideologies, and in some cases, leading to violence. Under increased scrutiny and pressure from legislators and the public, YouTube announced efforts to change their recommendation algorithms so that the most egregious conspiracy videos are demoted and demonetized. To verify this claim, we have developed a classifier for automatically determining if a video is conspiratorial (e.g., the moon landing was faked, the pyramids of Giza were built by aliens, end of the world prophecies, etc.). We coupled this classifier with an emulation of YouTube’s watch-next algorithm on more than a thousand popular informational channels to obtain a year-long picture of the videos actively promoted by YouTube. We also obtained trends of the so-called filter-bubble effect for conspiracy theories. Keywords Online Moderation, Disinformation, Algorithmic Transparency, Recommendation Systems Introduction social media 21; (2) Although view-time might not be the only metric driving the recommendation algorithms, YouTube By allowing for a wide range of opinions to coexist, has not fully explained what the other factors are, or their social media has allowed for an open exchange of relative contributions. It is unarguable, nevertheless, that ideas. There have, however, been concerns that the keeping users engaged remains the main driver for YouTubes recommendation engines which power these services advertising revenues 22,23; and (3) While recommendations amplify sensational content because of its tendency to may span a spectrum, users preferably engage with content generate more engagement. The algorithmic promotion of that conforms to their existing world view 24. conspiracy theories by YouTube’s recommendation engine, Nonetheless, in January of 2019 YouTube announced in particular, has recently been of growing concern to 1–7 8 9–14 efforts to reduce ”recommendations of borderline content academics , legislators , and the public . In August and content that could misinform users in harmful ways 2019, the FBI introduced fringe conspiracy theories as a – such as videos promoting a phony miracle cure for a domestic terrorist threat, due to the increasing number of 15 serious illness, claiming the earth is flat, or making blatantly violent incidents motivated by such beliefs . false claims about historic events like 9/11” 25. This effort Some 70% of watched content on YouTube is recom- complemented a previous initiative to include direct links to mended content 16, in which YouTube algorithms promote Wikipedia with videos related to conspiratorial topics. 26 In videos based on a number of factors including optimizing for June of 2019, YouTube announced that their efforts led to a user-engagement or view-time. Because conspiracy theories reduction of view-time from these recommendations by over generally feature novel and provoking content, they tend to 50% 27. In December of 2019, YouTube updated this estimate yield higher that average engagement 17. The recommenda- to 70% 28. Our analysis aims to better understand the nature tion algorithms are thus vulnerable to sparking a reinforcing and extent of YouTube’s promotion of conspiratorial content. feedback loop 18 in which more conspiracy theories are arXiv:2003.03318v1 [cs.CY] 6 Mar 2020 recommended and consumed 19. Materials & Methods YouTube has, however, contested this narrative with three main counter-arguments 20: (1) According to YouTube’s Recommendations Chief Product Officer Neal Mohan, ”it is not the case that YouTube makes algorithmic recommendations in several extreme content drives a higher version of engagement”; (2) different places. We focus on the watch-next algorithm, The company claims that view-time is not the only metric which is the system that recommends a video to be shown accounted for by the recommendation algorithm; and (3) Recommendations are made within a spectrum of opinions, leaving users the option to engage or not with specific 1School of Information, University of California, Berkeley content. 2Electrical Engineering & Computer Sciences, University of California, We are skeptical that these counter-arguments are Berkeley 3 consistent with what we and others qualitatively have Mozilla Foundation been seeing play out on YouTube for the past several Corresponding author: years. In particular: (1) according to Facebooks CEO Mark Hany Farid, School of Information, UC Berkeley Zuckerberg, extreme content does drive more engagement on Email: [email protected] Prepared using sagej.cls [Version: 2017/01/17 v1.20] 2 Journal Title XX(X) next when auto-play is enabled. YouTube distinguishes activities or accidents; (2) Holds a view of the world that goes between two types of recommendations: recommended-for- against scientific consensus; (3) Is not backed by evidence, you videos are computed based on the user’s previous but instead by information that was claimed to be obtained viewing history and recommended are not individualized. through privileged access; (4) Is self-filing or unfalsifiable. Our requests are made with a U.S.-based IP addresses, without any identifying cookie. There are, therefore, no Text Classification recommended-for-you videos. A key component of our video classifier is fastText, a Our method to emulate the recommendation engine is text-based classifier 31. This classifier takes a text sample as a two step process: we start by gathering a list of seed input, and predicts the probability that the sample belongs to channels, and then generate recommendations starting from a given class (e.g., a conspiratorial video). the videos posted by these channels. The classifier begins by parsing the training data to define The list of seed channels is obtained with a snowball a vocabulary. Input text samples are then represented by a method. We start with an initial list of 250 of the most concatenation of a bag-of-words and bag of n-grams, as subscribed English YouTube channels. The last video posted defined by the vocabulary. An embedding matrix projects by each of these seed channels is retrieved and the next this representation into a lower-dimensional space, after 20 watch-next recommendations are extracted. The channels which a linear classifier is used to classify the text into one associated with these recommendations are ranked by of two (or more) classes. number of occurrences. The channel that has the largest number of recommendations, and is not part of the initial Video Classification seed set, is added to the set of seed channels. This process is repeated until 12; 000 channels are gathered. Our video classifier analyzes various text-based components To focus our computational resources on the parts of of a video using individual classifier modules for each. These YouTube that are relevant to information and disinformation, modules, described next, are followed by a second layer that we perform a cluster analysis 29 on these 12; 000 channels. combines their outputs to yield a final conspiracy likelihood. We retain a single cluster of 1103 channels which corresponds to news and information channels (e.g., BBC, 1. The transcript of the video, also called subtitles, CNN, FOX...). Since the unsupervised clustering is not can be uploaded by the creator or auto-generated by perfect, we manually added 43 channels that we considered YouTube, and captures the content of the video. The fastText to be consistent with the other information channels. This transcript is scored by a classifier. yielded a final list of 1146 seed channels, then reduced to 2. The video snippet is the concatenation of the title, 1080 by the end of the analysis after some channels were the description, and the tags of the video. The snippet deleted or stalled. renders the language used by the content creator to ∗ describe their video. The snippet is also scored by a fastText We then gathered the 20 first recommendations from the classifier. watch-next algorithm starting from the last video uploaded 3. The content of the 200 top comments defined by by each of the seed channels everyday from October 2018 YouTube’s relevance metric (without replies). Each fastText to February 2020. The top 1000 most recommended videos comment is individually scored by a on a given day were retained and used in our analysis. As classifier. The score of a video is the median score of described below, these videos were analyzed to determine all its comments. which were predicted to be conspiratorial in nature. 4. The perceived impact of the comments. We use Google’s Perspective API 32 to score each comment Training Set on the following properties: (1) toxicity; (2) spam; (3) unsubstantial; (4) threat; (5) incoherent; (6) profanity; We collected a training set of conspiracy videos in and (7) inflammatory. This set of seven perspective an iterative process. An initial set of 200 videos was scores for each comment is converted into a 35- collected from a book referencing top conspiracy theories D feature vector for the whole video by taking the 30 on YouTube , and a set of videos harvested on 4chan median value and standard deviation of each property and on the sub-reddits r/conspiracy, r/conspiracyhub, (14 features) as well as the median value of the pair- and r/HealthConspiracy. A comparable set of 200 non- wise products of each property (21 features). A logistic conspiratorial videos was collected by randomly scraping regression classifier is trained to predict the conspiracy YouTube videos. These videos were manually curated to likelihood of the video from this 35-D feature vector. remove any potentially conspiratorial videos. As we began our analysis, we augmented these initial videos by adding The output of these four modules is then fed into a final any obviously mis-classified videos into the appropriate logistic regression layer to yield a prediction for the entire conspiratorial or non-conspiratorial training set, yielding video.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages8 Page
-
File Size-