Public Health Community Mining in Youtube
Total Page:16
File Type:pdf, Size:1020Kb
Public Health Community Mining in YouTube Scott Burton, Richard Morris, Michael Joshua West, Carl Hanson, Michael Dimond, Joshua Hansen, Christophe Barnes Giraud-Carrier Department of Health Science Department of Computer Science Brigham Young University, Provo, UT 84602 Brigham Young University, Provo, UT 84602 {joshua.west, carl_hanson, {sburton, rmorris}@byu.edu, {dimondm, michael_barnes}@byu.edu hansen.joshuaa}@gmail.com, [email protected] ABSTRACT is constantly being updated by the public as a whole. One YouTube has become a vast repository of not only video of the great advantages this offers is the ability to observe, content, but also of rich information about the reactions of in a timely manner, the attitudes and behaviors of people in viewers and relationships among users. This meta-data of- their natural interactions with others. There has thus natu- fers novel ways for public health researchers to increase their rally been a growing focus on the importance of online social understanding of, and ultimately to more effectively shape, media in public health research [7, 31]. Recent studies have, people's attitudes and behaviors as both consumers and pro- for example, shown that online social interactions may carry ducers of health. We illustrate some of the possibilities here enough positive peer pressure to encourage healthy behav- by showing how communities of videos, authors, subscribers ior [22]. It has also been found that, while in some cases and commenters can be extracted and analyzed. Tobacco anonymity may promote increased antagonism [20], adoles- use serves as a case study throughout. cents generally feel more comfortable discussing potentially embarrassing topics with some degree of anonymity, as af- forded by social media [12, 30]. For public health practition- Categories and Subject Descriptors ers, social media offer yet another significant advantage in H.4 [Information Systems Applications]: Miscellaneous; that the interactive nature of Web 2.0 applications facilitates J.3 [Life and Medical Sciences]: Health not only observations, but more importantly intervention, such as through tweets or chats [9]. General Terms In this paper, we focus our attention on YouTube. With the ability to easily post, view, and comment on videos, Algorithms, Experimentation YouTube allows the ideas of a single user to be seen by millions in a matter of days. In addition to the obvious Keywords video content, YouTube is also a repository of rich meta-data Community Mining, Public Health, YouTube giving relationships among related videos, users, and com- ments. Indeed, while it was designed primarily as a video- 1. INTRODUCTION sharing platform, each entry or submission to YouTube goes far beyond the video content and author's name alone, to in- One of the difficulties of research in public health is deter- clude such things as a list of author-defined tags or keywords mining |and ultimately influencing| the perception and to describe the video, a list of subscribers (i.e., people who reaction of the public with regard to health-related issues. \follow" the video's author), comments left by viewers, rat- Typical approaches, such as questionnaires (e.g., NHANES, ings left by users (now simplified to like or dislike), statistics HINTS), can be difficult and costly to administer. Further- collected by YouTube, such as number of views and viewing more, processing results and preparing them for analysis are history, and a ranked list of links to 20 related videos, as tedious activities that cause studies based on questionnaires determined by YouTube's proprietary algorithm based on to be delayed and thus to lag behind important, relevant, viewers' clickstream data, recency, etc. and detectable, social media health communications, which Unlike some who have argued that medical research based arise spontaneously and much faster. on statistics from YouTube may lend false credence to a Social networking websites, such as Facebook, MySpace, \conduit of popular culture" [13], we believe that YouTube Twitter and YouTube, contain vast amounts of content that remains a valuable medium both for observing and for in- teracting with the public. Additionally, it has been noted that health topics are already being discussed in social net- Permission to make digital or hard copies of all or part of this work for works, and in many cases the associated communications personal or classroom use is granted without fee provided that copies are are dominated by businesses that have vested commercial not made or distributed for profit or commercial advantage and that copies interests [31]. It would seem not only reasonable, but in bear this notice and the full citation on the first page. To copy otherwise, to fact desirable, for the public health community to take ad- republish, to post on servers or to redistribute to lists, requires prior specific vantage of YouTube, and other social media, to ensure that permission and/or a fee. accurate, constructive and health-promoting viewpoints are IHI’12, January 28–30, 2012, Miami, Florida, USA. Copyright 2012 ACM 978-1-4503-0781-9/12/01 ...$10.00. widely represented and adequately expressed. While much of what is presented here extends in principle the graph [6]. An alternative method still views the videos to other social media platforms, there are several features as nodes, but defines the edges based on \video responses" of YouTube that make it particularly well suited to our ap- posted by users in response to an original video [2]. Once a plications. In particular, 1) YouTube is an open forum, 2) social network of videos is defined, a social network of the YouTube's data is rich in natural relationships among its users that authored those videos can also be derived rather elements (e.g., friends, comments, related videos), and 3) straightforwardly [2]. YouTube possesses a rich application programming inter- We capitalize on the richness of implicit relationships em- face (API) that makes almost all of its data available for bedded in YouTube's data to build on these ideas. Indeed, easy consumption by data mining tools. while YouTube is essentially an extensive network of videos, We take advantage of these features here and show how the additional information available in tags, friends' lists, social network analysis tools can be used to build and an- subscribers' lists, and comment trails can be used to build alyze communities of videos, authors and comments from various focused communities of videos, authors and com- YouTube's rich data. We give examples of the value of these menters relevant to public health research. In what follows, communities with regard to public health, with specific em- we present a basic analysis of several of these communities phasis on tobacco usage as a relevant case study. and illustrate their value in the context of tobacco usage. 3.1 Video Communities 2. RELATED WORK As stated above, YouTube does not make available a com- There is certainly no way for us to be exhaustive here plete list of its videos, and even if it did, that list would be about work in social network analysis or the use of social unmanageable due to its sheer size. Hence, it is not possible media in public health. However, we highlight several pieces to use most community building algorithms, which require a of work most relevant to our own in the context of YouTube knowledge of the complete social network. Instead, an iter- and community mining. ative approach to building communities must be employed, The increase of YouTube's popularity and the accessibility beginning with one or more (seed) videos and expanding of its audio/visual material, textual comments, and friend- from that point. A mechanism for this iterative expansion ships, is leading public health researchers to leverage this or- consists in exploiting the set of related videos provided by acle to public perception and interaction. Most of the stud- YouTube alongside each video, and also available through ies so far have focused exclusively on the content or message the public API. While the details of the algorithm used by of videos returned by certain keywords. For example, videos YouTube to produce the set of related videos are proprietary, have been examined for their potential role in implicitly in- the related videos are a valuable resource for understanding fluencing normative beliefs formation or for their explicit at- behavior in that they represent what users see and click on tempts at eliciting positive or negative sentiment in areas as when navigating the site. varied as vaccinations/immunizations [16], recreational par- We describe two complementary ways of building commu- tial asphyxiation (i.e., the choking game) [21], and tanning nities of videos. The first tries to capture the general behav- beds [14], with a significant body of studies specifically tar- ior of viewers. The second is more directed and focuses the geted at smoking behavior [1, 11, 10, 27, 18]. Our research community on a specific topic. goes beyond content. It is interested in how videos and au- thors are connected to each other, and how such networks 3.1.1 Breadth-first Search can inform our understanding of health issues. Perhaps the simplest way of iteratively producing a com- While others have studied structural properties of YouTube munity of videos is to begin with a specific seed video and as a general social network (e.g., see [28, 29]), our work to proceed with a breadth-first search of related videos. focuses on the unique notion of community. Indeed, so- Breadth-first search consists of going from the seed video cial networks differ from other types of networks, such as to its related videos, followed by their related videos, and so technological or computer networks, in many ways that can on, until all videos have been visited or a certain number of be traced to the fact that they are inherently composed of iterations has been reached, as detailed in Algorithm 1 [19].