<<

Public Health Community Mining in YouTube

Scott Burton, Richard Morris, Michael Joshua West, Carl Hanson, Michael Dimond, Joshua Hansen, Christophe Barnes Giraud-Carrier Department of Health Science Department of Computer Science Brigham Young University, Provo, UT 84602 Brigham Young University, Provo, UT 84602 {joshua.west, carl_hanson, {sburton, rmorris}@byu.edu, {dimondm, michael_barnes}@byu.edu hansen.joshuaa}@gmail.com, [email protected]

ABSTRACT is constantly being updated by the public as a whole. One YouTube has become a vast repository of not only video of the great advantages this offers is the ability to observe, content, but also of rich information about the reactions of in a timely manner, the attitudes and behaviors of people in viewers and relationships among users. This meta-data of- their natural interactions with others. There has thus natu- fers novel ways for public health researchers to increase their rally been a growing focus on the importance of online social understanding of, and ultimately to more effectively shape, media in public health research [7, 31]. Recent studies have, people’s attitudes and behaviors as both consumers and pro- for example, shown that online social interactions may carry ducers of health. We illustrate some of the possibilities here enough positive peer pressure to encourage healthy behav- by showing how communities of videos, authors, subscribers ior [22]. It has also been found that, while in some cases and commenters can be extracted and analyzed. Tobacco anonymity may promote increased antagonism [20], adoles- use serves as a case study throughout. cents generally feel more comfortable discussing potentially embarrassing topics with some degree of anonymity, as af- forded by social media [12, 30]. For public health practition- Categories and Subject Descriptors ers, social media offer yet another significant advantage in H.4 [Information Systems Applications]: Miscellaneous; that the interactive nature of Web 2.0 applications facilitates J.3 [Life and Medical Sciences]: Health not only observations, but more importantly intervention, such as through tweets or chats [9]. General Terms In this paper, we focus our attention on YouTube. With the ability to easily post, view, and comment on videos, Algorithms, Experimentation YouTube allows the ideas of a single user to be seen by millions in a matter of days. In addition to the obvious Keywords video content, YouTube is also a repository of rich meta-data Community Mining, Public Health, YouTube giving relationships among related videos, users, and com- ments. Indeed, while it was designed primarily as a video- 1. INTRODUCTION sharing platform, each entry or submission to YouTube goes far beyond the video content and author’s name alone, to in- One of the difficulties of research in public health is deter- clude such things as a list of author-defined tags or keywords mining —and ultimately influencing— the perception and to describe the video, a list of subscribers (i.e., people who reaction of the public with regard to health-related issues. “follow” the video’s author), comments left by viewers, rat- Typical approaches, such as questionnaires (e.g., NHANES, ings left by users (now simplified to like or dislike), statistics HINTS), can be difficult and costly to administer. Further- collected by YouTube, such as number of views and viewing more, processing results and preparing them for analysis are history, and a ranked list of links to 20 related videos, as tedious activities that cause studies based on questionnaires determined by YouTube’s proprietary algorithm based on to be delayed and thus to lag behind important, relevant, viewers’ clickstream data, recency, etc. and detectable, social media health communications, which Unlike some who have argued that medical research based arise spontaneously and much faster. on statistics from YouTube may lend false credence to a Social networking websites, such as , MySpace, “conduit of popular culture” [13], we believe that YouTube Twitter and YouTube, contain vast amounts of content that remains a valuable medium both for observing and for in- teracting with the public. Additionally, it has been noted that health topics are already being discussed in social net- Permission to make digital or hard copies of all or part of this work for works, and in many cases the associated communications personal or classroom use is granted without fee provided that copies are are dominated by businesses that have vested commercial not made or distributed for profit or commercial advantage and that copies interests [31]. It would seem not only reasonable, but in bear this notice and the full citation on the first page. To copy otherwise, to fact desirable, for the public health community to take ad- republish, to post on servers or to redistribute to lists, requires prior specific vantage of YouTube, and other social media, to ensure that permission and/or a fee. accurate, constructive and health-promoting viewpoints are IHI’12, January 28–30, 2012, Miami, Florida, USA. Copyright 2012 ACM 978-1-4503-0781-9/12/01 ...$10.00. widely represented and adequately expressed. While much of what is presented here extends in principle the graph [6]. An alternative method still views the videos to other social media platforms, there are several features as nodes, but defines the edges based on “video responses” of YouTube that make it particularly well suited to our ap- posted by users in response to an original video [2]. Once a plications. In particular, 1) YouTube is an open forum, 2) social network of videos is defined, a social network of the YouTube’s data is rich in natural relationships among its users that authored those videos can also be derived rather elements (e.g., friends, comments, related videos), and 3) straightforwardly [2]. YouTube possesses a rich application programming inter- We capitalize on the richness of implicit relationships em- face (API) that makes almost all of its data available for bedded in YouTube’s data to build on these ideas. Indeed, easy consumption by data mining tools. while YouTube is essentially an extensive network of videos, We take advantage of these features here and show how the additional information available in tags, friends’ lists, social network analysis tools can be used to build and an- subscribers’ lists, and comment trails can be used to build alyze communities of videos, authors and comments from various focused communities of videos, authors and com- YouTube’s rich data. We give examples of the value of these menters relevant to public health research. In what follows, communities with regard to public health, with specific em- we present a basic analysis of several of these communities phasis on tobacco usage as a relevant case study. and illustrate their value in the context of tobacco usage. 3.1 Video Communities 2. RELATED WORK As stated above, YouTube does not make available a com- There is certainly no way for us to be exhaustive here plete list of its videos, and even if it did, that list would be about work in social network analysis or the use of social unmanageable due to its sheer size. Hence, it is not possible media in public health. However, we highlight several pieces to use most community building algorithms, which require a of work most relevant to our own in the context of YouTube knowledge of the complete social network. Instead, an iter- and community mining. ative approach to building communities must be employed, The increase of YouTube’s popularity and the accessibility beginning with one or more (seed) videos and expanding of its audio/visual material, textual comments, and friend- from that point. A mechanism for this iterative expansion ships, is leading public health researchers to leverage this or- consists in exploiting the set of related videos provided by acle to public perception and interaction. Most of the stud- YouTube alongside each video, and also available through ies so far have focused exclusively on the content or message the public API. While the details of the algorithm used by of videos returned by certain keywords. For example, videos YouTube to produce the set of related videos are proprietary, have been examined for their potential role in implicitly in- the related videos are a valuable resource for understanding fluencing normative beliefs formation or for their explicit at- behavior in that they represent what users see and click on tempts at eliciting positive or negative sentiment in areas as when navigating the site. varied as vaccinations/immunizations [16], recreational par- We describe two complementary ways of building commu- tial asphyxiation (i.e., the choking game) [21], and tanning nities of videos. The first tries to capture the general behav- beds [14], with a significant body of studies specifically tar- ior of viewers. The second is more directed and focuses the geted at smoking behavior [1, 11, 10, 27, 18]. Our research community on a specific topic. goes beyond content. It is interested in how videos and au- thors are connected to each other, and how such networks 3.1.1 Breadth-first Search can inform our understanding of health issues. Perhaps the simplest way of iteratively producing a com- While others have studied structural properties of YouTube munity of videos is to begin with a specific seed video and as a general social network (e.g., see [28, 29]), our work to proceed with a breadth-first search of related videos. focuses on the unique notion of community. Indeed, so- Breadth-first search consists of going from the seed video cial networks differ from other types of networks, such as to its related videos, followed by their related videos, and so technological or computer networks, in many ways that can on, until all videos have been visited or a certain number of be traced to the fact that they are inherently composed of iterations has been reached, as detailed in Algorithm 1 [19]. communities [26]. Understanding these communities with regard to health concerns can lead to valuable research in- Algorithm 1 Breadth-first Search sights, yet discovering these communities within the context of YouTube is a non-trivial computational problem. At least Require: An initial video v0, and for each video v, a set of part of the difficulty arises from the fact that many commu- related videos defined by v.relatedV ideos() nity mining algorithms depend on a complete enumeration Ensure: The set C contains the community of the network [5, 17, 24, 32, 25, 33]. Yet, YouTube does not Initialize set C and queue Q to be empty make available a complete list of its videos, and even if it Q.enqueue(v ) did that list would be much too large for its enumeration to 0 repeat be computationally feasible. Recently, new algorithms have v ← Q.dequeue() begun to emerge that perform community discovery through C.add(v) a controlled iterative process [4]. We follow and extend this for all related videos r in v.relatedV ideos() do latter approach to community building here. Q.enqueue(r) end for 3. YOUTUBE COMMUNITIES until Q is empty or terminating condition reached One approach to defining a social network from YouTube return C data is to 1) consider videos as nodes, and 2) use the re- lated video list provided by YouTube to define the edges of Because there are 20 related videos provided by YouTube for each video, the size of communities discovered using this technique would grow very quickly, on the order of O(20d), Table 1: Subset of Titles of the Beam Search- where d is the depth or number of iterations performed. An generated Community of Anti-smoking Videos alternative in such contexts is to constrain the breadth-first Video Depth Title search to a beam search [34], where, for each video consid- 1 0 Tobacco Free Florida: Kid Tossing Ball ered, only the first b most related videos, as per YouTube’s 2 1 Tobacco Free Florida: Kid Tossing Ball rankings, are added to the queue. The size of the commu- (CC) nity now only grows on the order of O(bd). If b = 20, the 3 1 Tobacco Free Florida: Mirror result of the beam search is identical to the result of the 4 1 Tobacco Free Florida: 31 Flavors traditional breadth-first search. 5 1 Tobacco Free Florida: Buckle Up (en As an example, Figure 1 shows the community of videos Espanol) discovered around an anti-smoking seed video using a beam 6 1 The Sexiest Commercial Ever. search with beam size b = 5, run to a depth of d = 3 from 7 2 Tobacco Free Florida: Buckle Up the initial video. Table 1 shows a subset of the titles of these 8 2 Grey Poupon Original Commercial videos. For the sake of space, only the first four titles are 9 2 Bounty Paper Towel Ads with Captions included for depths 2 and 3. 10 2 Gray Bright, Jack In The Box Taco Ad- venture (from Sydney Australia to USA for Taco’s) 82 92 35 85 91 36 ...... 83 34 93 37 26 3 Tobacco Free Florida: Video Game (en 84 44 33 29 24 88 Espanol) 22 9 28 86 47 87 90 30 32 27 3 Tobacco Free Florida: Light It Up 43 46 95 28 3 Wayne’s World - Grey Poupon (Parody) 78 23 8 31 25 89 45 11 41 29 3 Grey Poupon “Son Of Rolls” 30 Sec 81 21 6 38 80 10 Commercial 94 2 39 1 42 ...... 79 77 40 26 71 49 7 74 27 48 12 19 75 Table 2: Statistics on the Titles of the Beam Search- 69 5 20 3 76 generated Community of Anti-smoking Videos 70 18 57 55 73 Depth Unique Videos Smoking-related Sex-related 66 15 14 72 4 0 1 1 0 68 13 67 16 54 56 1 5 4 1 63 17 52 51 2 19 9 5 60 53 62 3 70 18 17 64 58 50 61 4 268 41 42 65 59 Total 363 73 65 Figure 1: Beam Search-generated Community of Anti-smoking Videos (b = 5, d = 3) links, even if they start watching an anti-smoking video, it is very likely that they will end up viewing content with sexual As may be expected, many of the videos in this commu- or humorous appeal, rather than continuing to view multiple nity, even a small number of links away from the starting anti-smoking productions. From a public health standpoint, video, are about very different topics. Specifically, many of there are several follow-up questions one may consider: the videos at a distance of three and four steps from the first video are not about smoking behavior at all, but rather 1. Is the current observation representative of a more gen- focus on humorous or sexual content (top left-hand side of eral human behavior? In other words, is it true that Figure 1). A reasonable way to assess the likely subject whatever the first video is (i.e., whatever the reason a matter of these videos is to observe keywords in the titles. user was drawn to a specific video on YouTube), users After surveying the titles, we noted that a large subset of the quickly (i.e., 2 or 3 hops) drift away to gravitate around videos could be identified as very likely to focus on smoking videos with sexual content? behavior or sexual appeal based on a few specific keywords. We recognize that there are clearly some tobacco and many 2. As far as conveying health-promoting messages is con- sexual related videos that do not contain these specific key- cerned, should content be packed into the first videos words, but using them gives an objective way of summariz- users are most likely to watch? ing the list of titles and illustrating the point. As shown in Table 2, this community of 363 unique videos has only 73 3. If viewers do indeed tend to be distracted by other whose titles contain the smoking-related words “tobacco,” content, is it possible to design health-related videos “smoke,” or “smoking,” whereas 65 contain the sex-related that are more likely to cause viewers to stick with the words “hot,”“sex,”“ass,” or “Megan Fox.” topic? How? It is interesting to note that such findings would be diffi- cult, if not impossible, to bring out without building commu- Answers to these questions, and other related ones, would nities. One valuable insight gained from this finding is that if help the preventive and intervention efforts of public health a user is navigating through content using the related videos practitioners within social media. 3.1.2 Multiple Sub-community Expansion Chen et al. do suggest that their algorithm could be ap- As shown, a breadth-first search or even a beam search us- plied iteratively to eventually build communities covering ing the related videos provided by YouTube quickly diverges the whole graph, by selecting random starting nodes from to many different topics. While this discloses possibly inter- those not in a community [4]. These potential starting esting aspects of human behavior, alternative methods must nodes consist of those linked to by a boundary node, but be employed to discover communities of videos that are more outside a community, and are referred to as the shell of the interrelated and therefore closer to the same topic. community. This random selection approach would result Chen et al. recently introduced Iterative Local Expansion in assigning additional videos to communities, but as with a (ILE), a community discovery process designed for iterative beam search, it would quickly diverge to more diverse topics. expansion in large networks [4]. The first part of this pro- If we consider each of these communities as a sub-community cess is a local community identification algorithm, which at- of a larger set of videos related to the desired topic, this it- tempts to identify communities with a “sharp” boundary to erative process could be used to identify additional starting the rest of the network. A community is considered in two nodes and subsequently additional sub-communities. How- parts: 1) nodes in the core, which only link to other nodes in ever, to discover additional sub-communities about the same the community; and 2) nodes on the boundary which link to overall topic, the selection process must be guided. other nodes in the community but also to those outside the We propose an extension to ILE, called Multiple Sub- community. The local modularity factor R is used to eval- community Expansion (MSCE), that implements an alter- uate the quality of a community [8]. It is specified in terms native selection process to identify starting videos that are of the boundary nodes and is defined as R = Bin , where more closely related to the original topic, composed of two Btotal components. First, each node in the shell set S is given Bin represents the number of links from the boundary nodes a community link score L of the number of unique sub- that stay inside the community and Btotal is the total num- ber of links from the boundary nodes. The local community communities that link to the node. On the first iteration, identification algorithm begins with a single node and adds this will result in a score of L = 1 for each node in S because nodes to the community in a greedy fashion in order of most they are each linked to by the single existing sub-community. improvement to R, until R can no longer be increased. On subsequent iterations, when more sub-communities have ILE can, of course, be applied to videos on YouTube by been identified, videos that are linked to by more than one using the set of related videos to define the nodes to which sub-community will receive higher L scores. a particular video links. One of the limitations of this ap- The second component consists of a keyword score K. proach however is that, while a small set of videos (typically These keywords are related to the overall topic and are sup- between 10 and 30) is discovered that are related around a plied by the user at the beginning of the process. The K certain topic, the topic may not be the exact one desired. score of a video is determined by the number of keywords For example, beginning with an anti-smoking commercial contained in that video’s title. Using these two components, featuring a superhero, a community may be discovered that an overall expansion selection score E can be determined is focused on tobacco, or alternatively a superhero related as the weighted sum of these, i.e., E = L + αK, where α community may be brought out. As an illustration, we have denotes a constant that can be defined to indicate the impor- run ILE starting with ten different anti-smoking videos, and tance of keyword score. The node with the highest E score observed that in many instances the communities are, in is then selected as the starting node for the next community. fact, closely centered on tobacco, but in many instances Details of MSCE are shown in Algorithm 2. the communities tend to focus closely on other topics, as One of the benefits of MSCE is that even when the first summarized in Table 3. As above, tobacco-related videos sub-community is not as related to the central topic, subse- are designated by titles containing the keywords “tobacco,” quent sub-communities are likely to return back to the de- “smoke,” or “smoking.”1 sired topic. For example, Figure 2 shows the composite com- munity that is the set of ten sub-communities discovered by MSCE, using the single keyword “smoking,” and beginning with the same single seed video used for our earlier beam Table 3: Tobacco Relatedness of ILE-generated search (“Tobacco Free Florida: Kid Tossing Ball”). In this Communities of Videos Community Videos Smoking-related Percent case, the first sub-community (highlighted by the rectangu- 1 18 4 22.2 lar region on the top right part of the figure) contains some 2 16 2 12.5 anti-smoking videos featuring superheroes, which results in 3 33 4 12.1 also including several videos that are solely about super- 4 9 0 0.0 heroes. Despite the fact that this sub-community is not com- 5 17 1 5.9 pletely focused on smoking, subsequent sub-communities are 6 12 1 8.3 much more focused on the topic, as shown in Table 4. Note 7 11 9 81.8 that because sub-communities can overlap, the total values 8 13 12 92.3 are computed with regard to the total number of unique 9 29 1 3.4 videos, not as sums of the corresponding columns. While 10 30 27 90.0 the first sub-community consists of only 22.2% (4/18) videos Total 188 61 32.4 containing the words “tobacco,” “smoke,” and “smoking,” when considering all ten sub-communities, 84.4% (157/186) of unique videos contain these words. 1We have found that even running ILE with the same start- To further demonstrate the robustness of the MSCE al- ing anti-smoking commercial in successive weeks can result gorithm in finding subsequent sub-communities that return in rather different behaviors because the greedy algorithm to the desired topic, we have run the algorithm for ten it- is highly influenced by the selection of the first few nodes. Algorithm 2 Multiple Sub-community Expansion 18

17

13 Require: An initial video v0, a set of keywords, and a C D 14 15 8 weight parameter α 16 11

12 Ensure: The set C contains a set of sub-communities 10 5

7 Initialize Set C to be empty 4

Let s = v0 6

repeat 3 121

119 2 9 Run ILE starting with s to produce sub-community Ci 120

123 122 129 130 126 and shell set S 115 125 62 124 118 141 73 C.add(C ) 64 74 i 127 72 61 75 113

128 60 144 117 for all videos v in S do 56 59 51 110 52 54 53 139 Let L = 0 and K = 0 77 76 71 136 132 55 66 138 for all sub-communities c in C do 50 133 68 58 57 112 140 137 63 111 135 142 if v is connected to any nodes in c then 79 70 82 1 143 78 67 L = L + 1 114 81 80 69 116

83 86 134 end if 96 65 107 84 85 87 94 end for 95 88

for all keywords key in keywords set do 93 92 109 90 97 131 89 if v.title contains key then 91 98 108 20 K = K + 1 99 19 152 147 end if A 181 106 102 end for 100 145 153 151 150 101 104 160 Let E = L + αK 156 103 154 105 155 end for 161 146 149 162 158 166

148 s = arg maxv E 157 163 44 34 40 159 22 30 until S is empty or terminating conditions reached 48 25 29 35 28 31 return C 183 184 21 178 164 46 36 39 41 175 182 170 23 173 171 24 27 B 185 26 169 38 33 32 177 47 174 165 167 42 45 179 37 180 168 43 176

186 erations (i.e., building a community composed of ten sub- 172 49 communities) beginning with each of the anti-smoking com- mercials used above. Thus, where before we built a single whole community for each of these videos (as shown in Ta- Figure 2: MSCE-generated Community of Anti- ble 3), we now build a composite community (made up of ten smoking Videos in 10 Sub-communities sub-communities) for each of the ten anti-smoking videos. Table 5 shows statistics regarding these ten communities. 8 and 10 experienced a reduction in the percentage of videos Even when the first sub-community (as shown in Table 3) on the topic (from 92.3% to 71.2% and 90.0% to 69.8%, re- was not on topic, the subsequent nine sub-communities in- spectively), where in each case the expansion included sub- cluded in the final composite community bring the overall communities that focused more on humorous commercials community back on topic (as shown in Table 5). The per- rather than strictly tobacco centered ones. centage of smoking-related videos in the first sub-community From the point of view of public heath practitioners, dis- compared to the percentage for the entire community, for covering a community of related videos using MSCE may each of the ten videos, are depicted in Figure 3. These re- prove useful in at least a couple of important ways. sults show that on average the percentage of smoking-related videos increases significantly from 32.4% to 69.4%, when in- 1. As discussed later, obtaining a community of videos cluding the additional nine sub-communities. This suggests is the first step in many other types of analysis, such that MSCE can be successful at returning to the desired as considering the relatedness of the authors or com- topic even when the initial community was further away. menters of videos. In addition, insight can be gained Additionally, we observe that videos that are more archetyp- by examining the community of videos directly. For ical of the topic and focus solely on the message rather example, nodes with a very high degree are likely to than including other themes or personalities from popular be very central to the topic and could represent those culture are more likely to remain centered on the original with higher social capital among the set. Nodes that topic. However, even in these cases the keyword compo- are bridges between different sub-communities are in- nent of MSCE is able to guide some of the subsequent sub- teresting because they represent a clickstream that a communities back toward the topic. This is demonstrated user may follow to transition between topics. For ex- with regard to videos 4, 5, and 6, where, as shown in Table 3, ample, a video that bridges a sub-community of anti- the first sub-communities for each had as little as 0%, 5.9%, smoking commercials and unrelated humorous com- and 8.3% of the videos containing the smoking-related words mercials could represent the point a which a user stops in the title, but after iterating to ten sub-communities, the consuming the health related content. in Figure 2, percentage of smoking-related videos in the corresponding node A has a very high degree which is the video final communities had risen to 69.0%, 49.1%, and 58.6%, re- “Graphic Australian Anti-Smoking Ad” that has been spectively. Interestingly, the communities built from videos viewed 2.5 million times and is very central to the topic Table 4: Tobacco Relatedness of MSCE-generated First Sub-community 100 All Sub-communities Sub-communities for a Single Anti-smoking Video Community Sub-community Videos Smoking-related Percent 80 1 18 4 22.2 2 31 29 93.5 60 3 15 14 93.3 4 33 32 97.0 40 5 24 23 95.8 Percent Smoking Related 6 35 29 83.0 20 7 16 15 94.8

8 15 12 80.0 0 9 22 21 95.5 1 2 3 4 5 6 7 8 9 10 10 15 13 86.7 Community Total (unique) 186 157 84.4 Figure 3: Percentage of Smoking-related Videos in the First Sub-community vs. in the Complete Com- Table 5: Tobacco Relatedness of MSCE-generated posite Community for Ten MSCE Communities Communities for Ten Different Anti-smoking Videos Community Videos Smoking-related Percent has shown that users rarely look beyond the first few 1 186 157 84.4 pages of results: 41% are reported as continuing their 2 163 144 88.3 search by changing keywords when the desired content 3 165 87 52.7 is not found on the first page of results and 88% as 4 145 100 69.0 changing their keywords when they do not find it on 5 161 79 49.1 the first three pages [15]. Thus, performing analysis on 6 145 85 58.6 over 900 videos retrieved from a search does not match 7 111 100 90.1 user behavior. Our own intuition and experience sug- 8 139 99 71.2 gests that users often hop from one video to the next 9 161 104 64.6 by way of the related video links. 10 149 104 69.8 Total 1525 1059 69.4 3.2 User Communities While YouTube is well-known for its video content, users are also at the heart of YouTube. Users are part of a larger of anti-smoking commercials. Also, node B, entitled community of friends, subscribers, subscriptions, videos, and “How to quit smoking,” has high degree and is the authors. Models of these communities offer researchers ways bridge between three sub-communities focusing specif- to identify important authors and their characteristics, in- ically on “the effects of smoking,”“do you still want to fluential videos, and interesting users. YouTube also acts, in smoke,” and “how to quit smoking.” Nodes C (“Star some fashion, as a social networking service, allowing users Wars Anti Smoking Ad”) and D (“Anti-Smoking : Su- to identify other users as friends, subscribe to authors, per- perman Versus Nick O’Teen (1981)”) are examples of sonalize a page with user info and videos posted by the user, bridge videos that are about tobacco, but could also and exchange messages with other users. represent a clickstream taking a user to more super- hero or movie related videos than health ones. 3.2.1 Author-Friend Community An author-friend community is an example of the commu- 2. MCSE could be used as an alternative sampling method nities that can be built based on the YouTube users. This for other studies. Almost all previous public health community is built starting with a set of videos (such as work involving YouTube has the researchers choose a those obtained using the community mining algorithm men- set of keywords to search through YouTube’s website tioned above) and identifying the author of each video in the and using the resultant videos as the sample for their set. Then each of the friends of these authors is identified, work [1, 11, 14, 16]. While this approach has a higher and a graph is built with each of these authors and their chance of returning videos that are well on topic, it friends as nodes, and edges denoting the friendship relation also presents a number of drawbacks. In particular, between users. Anomalous users can be identified from this finding adequate keywords is notoriously difficult,2 and graph, such as those with an unusual number of friends or in the case of YouTube (as many other online search those who are friends with an unusual amount of other au- systems) the number of results returned per query is thors in the community. limited to 1,000. Alternatively, the MSCE approach As an illustration, Figure 4 shows the author-friend com- can retrieve any number of videos. Another interest- munity built from the authors of the same MSCE anti- ing aspect of building a sample based on MSCE is that smoking community discussed earlier, showing only those it more closely matches actual user activity. Research users who are friends of at least four authors in the set. 2Ambiguous words, mismatch between practitioners’ vocab- Nodes corresponding to authors are shown in black, while ulary (e.g., smoking cessation) and layman’s terms (e.g., quit nodes corresponding to friends are shown in white. The size smoking), etc. of each node is proportional to its degree. 812

3241 ure 4) are a health and beauty company, an online fitness company, and a documentary film maker. This may have in-

3534 6444 748 teresting ramifications for governmental or non-profit public 11426 health producers, in that it may not be sufficient to simply

6312

5897 7988 768 6882 produce content and upload it to YouTube. Authors likely 3055 7332 9149 9649 5989 4021 need to become involved in the community so as to gain so- 8967 539 13045 4579 1747 4162 cial capital, subsequently getting exposure to content. Such 879 14203 12247 8005 13187 6222 4406 involvement can be built from the ground up, or it could take 3009 11582 11144 3935 9206 2328 10280 4236 advantage of the novel understanding of the target commu- 7282 3641 3160 14048 3530 10767 4157 8376 11938 1443 282 13981 7123 12068 12769 7401 9282 nity provided by the foregoing community mining approach. 1592 5267 3031 7410 6511 10605 2074 1326 3781 13221 603 11635 9980 8262 12877 1672 13030 1304 11434 Indeed, rather than waiting to acquire the needed social cap- 14140 13653 802 14109 2682 7042 13071 5151 7450 6792 3555 13384 10210 5024 9164 8135 1217210458 8778 10475 10916 11419 11547 8122 14027 1470 177 11551 5662 ital, authors of public health videos may benefit from tying 378712510 12225 588213878 5316 7077 13091 2409 12107 7888 6567 7432 6307 3959 10238 13216 7261 2678 1261 1192 12748 5439 8318 6074 10469 into established users with high social capital, getting them 707 4469 7909 11673 778 10754 13377 4588 3017 7186 4056 5774 11505 2630 9914 10571 10601 4928 7292 10820 6046 1942 13823 6703 10546 10744 5144 5778 11691 1888 3231 to upload and/or promote their content. 3962 1981 6789 10797 1483 11218 3677 10848 13262 5275 12912 6891 12134 8251 13056 11163 1383 10194 9970 Additionally, users who are friends but not authors in such 3322 6510 1468 7646 4905 2750 10716 1 3636 4380 12533 6045 1162 12817 3596 6971 13913 901 4237 7314 941 communities and who have high degree, may be highlighted 8638 3385 1276 5303 5442 6869 1563 12933 2606 12644 8704 6530 9311 5063 10931 3551 8509 1203 10316 11844 14210 8896 7115 5672 8985 as the prime consumers of the community’s video content. 12143 940 9760 9915 1295 6091 3315 4456 7436 3072 8263 320 1089 5479 11499 11054 12903 10116 9381 4569 1552 84 5512 6786 These consumers, in turn, could be observed in terms of 13037 1666 2078 2028 8561 8819 10304 3978 9894 2406 6528 1660 7570 6667 10451 3960 3427 6289 2944 13696 2720 their susceptibility to or targeted with specific messages. 11773 1347 4831 4827 654 1161 962 3999 2532 11902 11921 4535 13243 3.2.2 Commenter Community 1678 Another rich source of metadata in YouTube is found in the comments made by viewers on the videos. A commenter Figure 4: Community of Authors and their Friends community can be discovered by, for example, identifying those users that leave comments on the same videos. Specif- ically, this commenter community is built by beginning with Figure 5 shows the distribution of number of authors per a set of videos and identifying all users that have made com- number of friends. Not surprisingly, the distribution follows ments on each one.3 Then, a link is made between users that a kind of power law with most authors having a small num- commented on the same videos, where the strength of the ber of friends and few authors having a very large number link (or the weight of the edge) between two users is the of friends. The maximum number of friends for an author in number of videos in the sample on which both users com- this community is 6,249. Note that we did not distinguish mented. Additionally, thresholds can be used to indicate a between authors with 0 friends and authors who choose to link only if the users have commented on at least some num- keep their list of friends private, which may bias our results. ber of common videos. This graph can also be restricted by considering users whose comments occur within a certain 90 distance of each other in the list of comments. 80 Due to the number of comments per video, the commenter

70 community can quickly become difficult to visualize if the set of videos is large and the threshold parameters are set 60 low. Figure 6 shows the community of commenters for the 50 anti-smoking videos in Figure 2, restricted to users who com-

40 mented on at least four common videos. The thickness of Authors an edge is proportional to the number of common videos on 30 which the adjacent users commented. 20 Figure 7 gives the distribution of the number of comments

10 made by users in this set, as well as the number of unique videos on which these users commented. Again, unsurpris- 0 0 10 20 30 40 50 ingly so, the distribution follows a power law, with most Total Friends commenters leaving only very few comments behind. It should be noted that commenter communities do not Figure 5: Number of Authors vs. Number of Friends imply that users feel the same way about an issue, but rather that they are both interested in the issue, and may in real- The users of this author-friend community represent those ity have opposite views on the topic. The two nodes from who are likely to have some affinity toward the topic, because Figure 6 with the highest degree are both users promoting they have either authored a video on the topic themselves their own stop smoking programs, leaving almost identical or are friends with at least four authors of videos on the comments on many videos in the set, encouraging others to topic. Nodes with a high degree could represent users of follow a profile link. Because these users left comments on higher social capital who potentially have influence in this so many videos in the set, they have an implicit relationship community. One of the reasons that some of the users in with a large amount of other commenters. this graph have a high degree is that they try to become 3In the current API, YouTube returns a maximum of 1,000 friends with many others in an attempt to increase their comments per video. Even with this limitation valuable in- own exposure as advertising means. For example, the three sights can be found, but this limitation should be considered users with the highest degree (appearing in the center of Fig- when attempting to generalize from this data. 1912 20000 Total Comments 757 512 Unique Videos Commented On 824 18000 1539 371 871

535 290

1623 16000 781

1645 862 14000 975 1758 2109 1140 766

1733 576 309 1154 259 708 12000 1132 1805

102 175 1683 772 2113 490 389 10000 1978 1927 1595 2013 2019 1587 Users 1062 506 2070 360 994 1279 1870 422 8000 1386 2103 1142 1902 1876 1750 1048 295 90 1091 35 6000 1184 1147 470 858 2091 584 611 1648 1277 461 31 624 30 1084 353 316 4000 431 524 648 2042

1520 68 2065 928 900 1309 2000 1569

49 1443 2176 1900

623 150 0 1821 317 1424 1047 0 5 10 15 20 25 30 1175 987 1502 1864 Total Comments or Unique Videos

1406 1890 2062 1238 1873 2100 621 683 1466 1489 439 747 Figure 7: Distribution of the Number of Unique

122 54

1110 Videos Commented on by Users 1555 2074 1407 1155 1380

1127 838 142 1225 1009 944 1397 2214 415 1860 589 1379 track the comments of a specific user through time to ob- 16 1975 1528 1194 1106 368 321 1628 837 1934 serve a type of path followed by the user. Because the cur-

2043 1314 212 629 609 1695 1240 rent version of the YouTube API does not provide all the 860 33 1880 182 2087 763 comments of a user, this information must be acquired by 508 369 1552 1777 18 1514 1598 736 first identifying a set of videos and then considering all com- 970 334 ments left on those videos. The set of comments can then be sorted by user and comment time, to show the trail of 1074 users through the set of videos.4 This type of user trail can be valuable in two ways. First it can help to further identify characteristics of a single user of Figure 6: Community of Users Who Commented on interest. Second, and perhaps more importantly, it can help at Least Four Common Videos in the Set of Anti- to show trends of what videos users are seeing and how they smoking Videos move through these. An obvious limitation in identifying the trails of users is that not all users leave comments, and Another, more explicit commenter community can also be those that do do not leave them on each video they watch. built by considering directed edges between users that direct Despite this limitation, these trails may still be valuable in comments at one another using the conventional “@user- discovering overall trends and relationships among videos. name” syntax. Figure 8 shows the resulting community of Also, we submit that those that do leave comments are in commenters over the same set of videos as above. Links many cases the ones with the most extreme views on either between users appear only when at least two directed com- side of an issue. Depending on the topic being studied, this ments have been made. The thickness of the links is propor- may actually be more valuable as a way of identifying those tional to the number of times users referenced each other. that are more interested or passionate about the issue. The central node in this figure with a disproportionately Using the same set of comments (for the community of high degree is a spammer similar to those found in Fig- anti-smoking videos) discussed above, we identify users’ trails. ure 6. However, in this case, the user left multiple directed In the sample of 186 videos, the maximum number of unique comments to others promoting a political and ideological videos commented on by a single commenter is 27. Ta- agenda, which in many cases elicited antagonistic responses. ble 6 shows the trail of a prototypical commenter through Alternatively, the pairs of users with disproportionately high our community of anti-smoking videos as defined by the weight on the edges between them represent users that main- date/time the comments were authored. The author en- tained long-lasting conversations with one another. gaged in a conversation with other users on videos B and D Additionally, because the explicit commenter network is resulting in returning to leave additional comments on that directed, users can be identified that have a high in-degree, video. The fact that a user returns to the same video to con- representing those at whom many others direct comments. tinue a conversation may result in additional exposure to its These users may have higher social capital in that they have content. Thus, there may be a correlation between high- attracted the attention of many others. In the case of this impact videos and increased conversation in their respective network of anti-smoking videos, the user with the highest comments, either because the video itself drew increased dis- in-degree made a single comment asking the question: “if cussion, or because the increased conversation led to more smoking is so bad why isn’t it illegal?” This elicited the exposure of the message. responses of over 30 other users.

3.2.3 Comment Trails 4This analysis is also subject to the limitation of only being Finally, it is also possible to utilize video comments to able to consider the first 1,000 comments on a video. 540 264 722 285 520 388 596 78 495 The number of topics K is set to 10 to give a high-level 231 705 585 270 688 368 493 450 536 699 731 275 717 309 169 505 354 476 594 237 389 329 sense of the themes dealt with in the corpus, and to simplify 670 556 179 192 749 403 465 468 266 464 746 383 512 218 289 698 730 521 414 641 494 457 86 90 226 71 637 2 277 579 611 351 8 712 analysis. The topics discovered, represented by their most 593 280 366 753 522 685 61 694 549 80 109 326 553 54 736 592 72 132 570 527 639 478 404 146 376 206 154 94 697 577 475 656 48 10 728 427 392 617 551 295 190 prominent words, are given in Table 7. 241 440 429 652 750 117 530 219 49 590 597 305 131 150 755 664 515 418 406 323 174 107 203 224 554 580 439 257 77 98 88 213 668 186 45 411 538 183 308 531 194 422 612 501 709 304 738 254 572 477 210 121 347 557 602 5 247 444 584 635 185 108 103 235 665 115 420 646 4 647 279 25 1 532 104 299 249 666 369 452 714 690 506 209 471 64 571 412 413 340 657 40 18 576 684 537 396 123 46 497 288 721 702 39 Table 7: Ten Topics Inferred on the “Quit Smoking” 357 509 3 189 511 517 455 519 469 616 402 526 291 222 600 63 242 675 541 461 544 633 481 569 330 489 12 661 161 672 628 256 187 346 245 456 6 246 135 654 645 9 14 566 695 410 Videos and Comments 204 482 125 52 674 659 547 92 229 421 350 75 681 581 752 315 603 198 87 342 238 507 435 290 718 486 341 361 488 58 278 148 160 251 582 59 255 561 165 # Weight Top Words 378 575 426 244 463 22 550 292 723 325 158 181 631 658 742 618 43 142 201 419 667 400 605 739 472 560 555 76 568 448 314 261 73 548 321 215 733 642 630 391 533 447 172 16 751 732 0 0.08614 scary videos ur die life dont read fake 362 293 745 331 122 32 322 355 363 598 620 377 614 147 634 26 140 673 609 703 606 573 387 443 682 128 437 234 636 373 543 212 503 230 156 417 114 395 312 510 217 453 451 648 479 33 175 166 119 586 754 364 post press video ghost lol works com- 352 91 269 711 265 504 676 653 502 663 424 608 446 467 200 473 744 294 483 386 233 134 11 588 632 701 105 704 607 367 178 69 644 258 144 253 484 381 303 15 660 208 385 496 124 384 756 262 307 724 433 82 28 610 333 100 344 574 ment love 145 127 337 686 335 375 56 260 268 202 434 273 345 216 539 399 394 458 182 57 287 498 283 301 349 643 687 741 239 60 559 629 459 252 678 297 95 300 29 188 55 601 320 327 168 118 595 604 725 649 328 379 1 0.13785 video lol watch thumbs videos 460 228 67 83 21 74 700 129 719 441 110 563 113 747 106 528 143 151 693 564 474 227 689 683 220 36 225 627 708 180 19 332 50 276 177 638 153 524 359 720 748 70 306 562 197 726 716 491 358 274 706 425 amir check xd love channel remember 248 565 640 296 196 236 44 27 282 621 514 223 37 691 599 334 65 214 552 431 284 546 655 529 263 401 454 430 62 587 525 336 97 622 518 184 740 101 567 justin 516 259 324 462 371 680 240 650 141 432 66 737 470 583 727 130 360 195 311 313 17 93 626 162 38 542 734 487 137 416 139 466 445 221 338 30 286 438 281 84 545 302 2 0.17116 don people game real time lol make 692 393 534 743 42 68 623 409 267 316 138 485 715 111 152 436 81 34 96 271 415 707 120 159 7 348 207 149 136 20 729 578 710 23 good car video fake guy batman man 232 499 157 99 679 353 47 317 24 651 205 85 310 535 13 508 193 250 112 173 662 319 79 155 405 735 390 523 133 211 380 318 343 677 356 167 171 102 243 298 dont thing 428 116 500 398 51 41 163 339 480 126 31 191 89 370 624 374 696 53 397 589

713 558 382 3 0.02474 de la el es en se si lo por una los video 591 442 164 513 407 408 372 615 619 272 176 613 671 35 199 170 mi le con xd di che tu 365 492 669 423 490 449 625 4 0.25734 people video love baby good life don time god sad feel im man make wow girl Figure 8: Community of Users Defined by the kid dont “@username” Syntax 5 0.10693 people god don world jesus life make re- ligion human time truth things country good Table 6: A User Comment Trail Sorted by Time 6 0.08751 smoking smoke people cancer don weed Date/Time Video quit dont good cigarettes years stop day 06/02/10 10:39 AM A bad 06/19/10 06:47 AM B 7 0.45418 lol funny f? s? f? xd people guy a? haha 06/19/10 06:54 AM B stupid im dont ur man dude gay 06/19/10 07:03 AM C 8 0.01 allah bu ha bir ve ne fap mart de wal ya 06/19/10 07:07 AM C da bean ama mr bi sen ben var 06/19/10 07:19 AM C 9 0.10818 movie love song great good film 06/19/10 07:28 AM D watch movies trailer awesome amazing 07/17/10 05:48 PM E watched music 07/17/10 05:49 PM E 08/03/10 08:32 AM B A clear “quit smoking” topic emerges—Topic 6. How- 09/08/10 12:32 PM D ever, its weight is relatively small (0.08751) indicating that the discussion has diverged substantially from the topic of the starting video. A near-universal of topic modeling on YouTube comments is the presence of an expletives topic. 3.3 Comment Communities In this case, Topic 7 combines expletives with other collo- In addition to considering the users that made comments quial forms such as “lol”, “ur”, and “wtf”. on videos, the text of the comments themselves can be valu- able in discovering the views of content consumers. We turn 4. CONCLUSIONS AND FUTURE WORK to Latent Dirichlet Allocation (LDA) to exploit this text data and gauge the feeling, perception, and reaction of the We have illustrated ways in which community mining tech- public to the messages that are presented. LDA is a prob- niques may be applied to YouTube to inform public health abilistic model that, when applied to documents, hypothe- practice. We recognize that we have only scratched the sizes that each document in a collection has been generated surface and that, while tobacco usage provides an intuitive as a mixture of unobserved (latent) topics, where a topic is case study, we have not here produced any significantly new defined as a categorical distribution over words [3]. While knowledge in this area. However, we have showed the poten- not strictly the case, we can usefully regard the set of topics tial and highlighted a number of relevant follow up questions as a community of comments over the video community. that the approach presented here brings to light naturally, As an illustration, we consider the text from video titles, and can help answer in more thorough and focused analyses. descriptions, and comments in the set of 4, 407 videos gath- ered by breadth-first search (d = 3), starting from the video 5. REFERENCES titled “Quit Smoking.” We assemble the title, description, [1] C. L. Backinger, A. M. Pilsner, E. M. Augustson, and comment data as a corpus of documents consisting of A. Frydl, T. Phillips, and J. Rowden. YouTube as a one document for each video containing both its title and source of quitting smoking information. Tobacco its description (if any), plus one document for each video Control, 20(2):119–122, 2011. comment. We use the popular MALLET implementation [2] F. Benevenuto, F. Duarte, T. Rodrigues, V. Almeida, of LDA [23] to automatically discover topics in this corpus. J. Almeida, and K. Ross. Understanding video interactions in YouTube. In Proceeding of the 16th implications for tobacco control. Health ACM International Conference on Multimedia, pages Communications, 25(2):97–106, 2010. 761–764, 2008. [19] D. Knuth. The Art of Computer Programming: [3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Fundamental Algorithms (Vol. 1). Boston: dirichlet allocation. Journal of Machine Learning Addison-Wesley, 3rd edition, 1997. Research, 3:993–1022, March 2003. [20] P. Lange. Commenting on comments: Investigating [4] J. Chen, O. Za¨ıane, and R. Goebel. Detecting responses to antagonism on YouTube. In Proceedings communities in large networks by iterative local of the 70th Annual Conference of the Society for expansion. In Proceedings of the International Applied Anthropology, page 2007, 2007. Conference on Computational Aspects of Social [21] M. Linkletter, K. Gordon, and J. Dooley. The choking Networks, pages 105–112, 2009. game and YouTube: A dangerous combination. [5] J. Chen, O. Za¨ıane, and R. Goebel. Detecting Clinical Pediatrics, 49(3):274–279, 2009. communities in social networks using max-min [22] X. Ma, G. Chen, and J. Xiao. Analysis of an online modularity. In Proceedings of the 9th SIAM health social network. In Proceedings of the 1st ACM International Conference on Data Mining, pages International Health Informatics Symposium, pages 978–989. SIAM, 2009. 297–306. ACM, 2010. [6] X. Cheng, C. Dale, and J. Liu. Statistics and social [23] A. McCallum. MALLET: A machine learning for network of YouTube videos. In Proceedings of the 16th language toolkit. http://mallet.cs.umass.edu, 2002. International Workshop on Quality of Service, pages [24] M. Newman. Finding community structure in 229–238, 2008. networks using the eigenvectors of matrices. Physical [7] D. Chiu, P. Ande, R. Coward, and A. Woywodt. The Review E, 74(3):036104, 2006. times they are a changin’—the internet and how it [25] M. Newman and M. Girvan. Finding and evaluating affects daily practice in nephrology. NDT Plus, community structure in networks. Physical review E, 2(4):273, 2009. 69(2):026113, 2004. [8] A. Clauset. Finding local community structure in [26] M. Newman and J. Park. Why social networks are networks. Physical Review E, 72(2):026132, 2005. different from other types of networks. Physical [9] R. Crutzen and J. De Nooijer. Intervening via chat: an Review E, 68(3):036122, 2003. opportunity for adolescents’ mental health promotion? [27] H.-J. Paek, K. Kim, and T. Hove. Content analysis of Health Promotion International, 26(2):238–243, 2011. antismoking videos on youtube: Message sensation [10] S. Forsyth and R. Malone. “I’ll be your value, message appeals, and their relationships with cigarette—light me up and get on with it”: Examining viewer responses. Health Education Research, smoking imagery on youtube. Nicotine & Tobacco 25(6):1085–1099, 2010. Research, 12(8):810–816, 2010. [28] J. Paolillo. Structure and network in the youtube core. [11] B. Freeman and S. Chapman. Is “YouTube” telling or In Proceedings of the 41st Annual Hawaii International selling you something? tobacco content on the Conference on System Sciences, pages 156–165, 2008. YouTube video-sharing website. Tobacco Control, [29] R. Santos, B. Rocha, C. Rezende, and A. Loureiro. 16(3):207, 2007. Characterizing the youtube video-sharing community. [12] M. Gould, J. Munfakh, K. Lubell, M. Kleinman, and Available online at http://security1.win.tue.nl/ S. Parker. Seeking help from the internet during ∼bpontes/pdf/yt.pdf, 2006. adolescence. Journal of the American Academy of [30] L. Suzuki and J. Calzo. The search for peer advice in Child & Adolescent Psychiatry, 41(10):1182–1189, cyberspace: An examination of online teen bulletin 2002. boards about health and sexuality. Journal of Applied [13] A. Hayanga and H. Kaiser. Medical information on Developmental Psychology, 25(6):685–698, 2004. YouTube. Journal of the American Medical [31] K. Vance, W. Howe, and R. Dellavalle. Social internet Association, 299(12):1424, 2008. sites as a source of public health information. [14] E. Hossler and M. Conroy. YouTube as a source of Dermatologic Clinics, 27(2):133–136, 2009. information on tanning bed use. Archives of [32] S. White and P. Smyth. A spectral clustering Dermatology, 144(10):1395–1396, 2008. approach to finding communities in graphs. In [15] iProspect.com. iProspect search engine user behavior. Proceedings of the 5th SIAM International Conference Technical report, iProspect.com, Inc., April 2006. on Data Mining, pages 274–285, 2005. [16] J. Keelan, V. Pavri-Garcia, G. Tomlinson, and [33] X. Xu, N. Yuruk, Z. Feng, and T. Schweiger. SCAN: K. Wilson. YouTube as a source of information on A structural clustering algorithm for networks. In immunization: a content analysis. Journal of the Proceedings of the 13th ACM SIGKDD International American Medical Association, 298(21):2482, 2007. Conference on Knowledge Discovery and Data Mining, [17] R. Khorasgani, J. Chen, and O. Za¨ıane. Top leaders pages 824–833. ACM, 2007. community detection approach in information [34] W. Zhang. State-space search: Algorithms, complexity, networks. In Proceedings of the 4th Workshop on extensions, and applications. Springer: New York, Social Network Mining and Analysis, 2010. 1999. [18] K. Kim, H.-J. Paek, and J. Lynn. A content analysis of smoking fetish videos on YouTube: Regulatory