Characterizing the Youtube Video-Sharing Community∗
Total Page:16
File Type:pdf, Size:1020Kb
Characterizing the YouTube video-sharing community∗ Rodrygo L. T. Santos, Bruno P. S. Rocha, Cristiano G. Rezende, Antonio A. F. Loureiro Department of Computer Science Federal University of Minas Gerais Belo Horizonte, MG 31270-901 Brazil {rodrygo,bpontes,rezende,loureiro}@dcc.ufmg.br ABSTRACT February 2005, YouTube was officially launched in Decem- The YouTube video-sharing community is a recent and suc- ber of the same year and has not stopped growing since then. cessful phenomenon that provides an expressive representa- By July 2006, the site reported to serve 100 million videos tion of a social network. Despite its accelerated growth, a per day, with a daily upload of more than 65,000 videos and deep study of YouTube’s topology has not yet been made nearly 20 million unique visitors per month – a 29% share available. For this work, we have collected a representative of the US multimedia entertainment market and 60% of all sample of YouTube using our Crawlanga tool and analyzed videos watched online [12]. Its storage demands were es- both its structural properties, as well as its social relation- timated at around 45 terabytes with several million dollar ships among users, among videos, and between users and expenses on bandwidth per month [3]. Within one year of videos. We analyze properties such as profile of users and its launch, YouTube was purchased by Google for US$1.65 popularity of videos in order to highlight the impact of social billion in stock. relationships on a content-sharing network. YouTube’s success can be seen as an example of the “wisdom of crowds” [14]: the site exerts no control over its users’ free- Categories and Subject Descriptors 2 H.2.8 [Database Management]: Database Applications- dom for publishing , in such a way that users not only share Data Mining; J.4 [Computer Applications]: Social and their videos with a few friends, but instead participate in a behavioral sciences huge decentralized community by creating and consuming terabytes of video content, ranging from home-made stand- General Terms up performances to eyewitness footages from inside news as they occur anywhere in the world. Human factors, Measurement Despite its enormous popularity and the sums of money in- Keywords volved, it is rather surprising that (at least to our knowl- Virtual communities, network sampling, network analysis edge) no study has been carried on unveiling the virtual community behind YouTube. 1. INTRODUCTION The last decade has witnessed the emergence of several pop- In this paper, we present an analysis of YouTube network, ularity phenomena through the word-of-mouth and self-pub- based on a sample of it we were able to collect using a lishing made feasible by the World Wide Web. This is true crawler tool. In our analysis we focus users and videos, for people, the content they produce, and the vehicles that and attributes and relationships between them. We observe distribute their production. Some of these phenomena have attributes such as number of videos visualizations, users sub- declined or have been replaced as rapid as they rose, while scription, users favorite lists, commenting, and others. We others have retained a steady pace of growth. also model the collected network as different networks in- cluding specific views as, for instance, a friendship network 1 The TIME’s Invention of the Year for 2006 [4], the YouTube between users and a network between videos connected by video-sharing website is one of the most recent and aston- edges that represent being part of a same user’s favorite list. ishing such examples of a Web phenomenon. Founded in This paper is organized as follows. On Section 2 we present ∗Data set will be made available in the camera ready version 1 work on similar networks and virtual communities. We http://www.youtube.com present some background on the YouTube video-sharing com- munity in Section 3. Sections 4 and 5 detail the crawling process and tool, as well as the data sample we used, re- spectively. Our analysis of attributes and relationships is discussed in Section 6. Finally, we present our conclusions in Section 7. 2According to the site policy, copyrighted or inappropriate content is reviewed after being flagged by the community. 2. RELATED WORK compares structural properties of the co-authorship networks The analysis of structural properties of large networks have in publication databases from different areas, including biomed- received much attention in the late years. Typically, stud- ical research, physics, and computer science. He presents re- ies include network properties such as degree distribution, sults on the mean and distribution of co-authorship degrees diameter, clustering coefficient, betweenness centrality, net- and clustering coefficients for these networks and shows the work resilience, mixing patterns, degree correlations, com- presence of the small world effect in all of them. Kumar munity structure, network navigation, etc. In this section, et al. [5] characterize the profile of more than one million we briefly outline some publications on the analysis of large- LiveJournal users with regards to three main dimensions: scale virtual communities, organized as social networks and age, geography, and interests. They show how over 70% information networks [11]. of friendship links among these users can be explained by combining these three dimensions. They also investigate the Anh et al. [1] compare structural properties of sampled friend- cultural aspect of highly-dynamic local, informal community ship networks from two social networking services (SNSs), formation in the blogospace, through the establishment of namely MySpace 3 and Orkut 4, and the entire topology of short-lived reading, posting, and listing relationships among the Cyworld 5 SNS. They uncover a two-period scaling be- small groups of users. havior in Cyworld’s degree distribution, being the exponent of each period correspondent to the exponent of the degree 3. THE YOUTUBEVIDEO-SHARINGCOM- distribution of MySpace and orkut, respectively. Also, they show how Cyworld’s testimonial network (a subset of its MUNITY The YouTube video-sharing community can be seen as an friendship network) presents a similar degree correlation to 10 real-life social networks. Friendship network properties are heterogeneous graph with basically two types of node: also studied by Kumar et al. [6]. They present measure- user and video. ments on two Yahoo! SNSs: Flickr 6, a one-million-node photo-sharing community, and Yahoo! 360 7, a five-million- Users can upload, view, and share video clips. Videos can node social networking website. They present a model of be rated, and the average rating and the number of times network growth by classifying users in these networks as a video has been watched are both published. Unregistered either (1) passive, loner members; (2) inviters, who bring users can watch most videos on the site; registered users offline friends to form isolated communities; and (3) linkers, have the ability to upload an unlimited number of videos. who play the role of bridging a large fraction of the entire Related videos, determined by the title and tags, appear to networking the network evolution. the right of the video. In the site’s second year new functions were added, providing the ability to post video ‘responses’ Liben-Nowell et al. [8] study the formation of friendship links and subscribe to content feeds for a particular user or users. in the LiveJournal 8 blogging community. They show that, YouTube had (and still has) a lot of traffic coming to the among the nearly 500,000 LiveJournal users with mappable site to view videos, but far fewer users actually creating and geographic locations (at the level of towns and cities), the posting content [15]. probability of two people being friends is inversely propor- tional to the number of people geographically close to them. Among all the potential relationships present in the YouTube Also, they find that this property influences the formation community, we consider the following in this paper: of two thirds of the friendship links among these users and prove analytically that short paths can be discovered in ev- ery network in which it is present. Link formation is also • user-user friendship: two users mutually regard each investigated by Backstrom et al. [2]. They study the influ- other as a friend; ence of network structural properties on the establishment of community membership links in two large sources of data: • user-user subscription: a user subscribes to video feeds the LiveJournal social networking and blogging service, with from another user; several million members and explicitly defined membership 9 • user-video favoring: a user adds a video to his/her list links, and DBLP , a publication database with several hun- of favorites; dred thousand authors with conferences regarded as proxies for communities. They show that the tendency of individu- • video-video relatedness: a video is regarded related to als to join a community is influenced by both the number of another one by the YouTube’s search engine. friends they have within the community and how connected these friends are to one another. 4. CRAWLING YOUTUBE Some works on information networks analysis have also em- Due to the amount of data required to analyze YouTube, ployed data sets from virtual communities. Newman [10] using a tool like a web crawler to collect data is a necessity. A web crawler needs to visit web pages of videos and user 3 profiles. It must be able to follow links representing rela- http://www.myspace.com tionships, like user friendship or commenting, and store the 4http://www.orkut.com information on visited nodes and followed edges in a format 5http://www.cyworld.com which can be further analyzed. As there is necessity for a 6http://www.flickr.com 7 large amount of data, the tool must be efficient and scalable.