A Large-Scale Study of Myspace: Observations and Implications for Online Social Networks
Total Page:16
File Type:pdf, Size:1020Kb
A Large-Scale Study of MySpace: Observations and Implications for Online Social Networks James Caverlee Steve Webb Department of Computer Science College of Computing Texas A&M University Georgia Tech College Station, TX 77843-3112 Atlanta, GA 30332-0280 [email protected] [email protected] Abstract network is organized, and the important text artifacts that distinguish these users. In particular, we study over 1.9 mil- We study the characteristics of large online social net- lion real social network profiles, with an emphasis on: works through an extensive analysis of over 1.9 mil- lion MySpace profiles in an effort to understand who • The sociability of users in MySpace based on relationship, is using these networks and how they are being used. messaging, and group participation. We study MySpace through a comparative study over three different, but related, facets: (i) the sociability of • The demographic characteristics of MySpace users in users in MySpace based on relationship, messaging, and terms of age, gender, and location, and a study of how group participation; (ii) the demographic characteristics these factors correlate with their privacy preferences. of MySpace users in terms of age, gender, and location, • The text artifacts of MySpace users, which can be used to and a study of how these factors correlate with their pri- vacy preferences; and (iii) the text artifacts of MySpace construct emergent language models that distinguish My- users, which can be used to construct language models Space users not just by who they say they are but also by that distinguish MySpace users not just by who they say the language model they employ. they are but also by the language model they employ. By studying how MySpace users participate in the social We find a number of surprising results and conjecture network (sociability), how they describe themselves (demo- several potential research directions based on our ob- servations. graphics), and how they communicate their personal inter- ests and feelings (language model), we hope to encourage the development of new models, algorithms, and approaches Introduction for the further enhancement and continued success of online Online communities are the fastest growing phenomenon social networks. The core findings of our study are: on the Web, enabling millions of users to discover and ex- • Nearly half of the profiles on MySpace have been aban- plore community-based knowledge spaces and engage in doned, meaning that the overall growth and explosive rate new modes of social interaction. Sites like Bebo, Facebook, of user interest in social networks may need to be tem- MySpace, Orkut, and LinkedIn have grown tremendously in pered; but we also identify a large core of active users the past few years, garnering increased media and popular within MySpace who account for the vast majority of awareness. friends, comments, and group activity. As online social networks continue to grow, evolve and develop, an important challenge we face is how to maintain • While young users (in their teens and 20s) are most preva- the incredible success of Web 2.0 going forward. There is lent on MySpace, women who are most prevalent at the a growing demand for understanding this new social phe- youngest ages (14 to 20), whereas men are most prevalent nomenon, understanding the process by which communities for all other ages (21 and up). come together, attract new members and develop over time, • There are clear patterns of language use for users based and understanding what it takes to empower the online com- on their age, location, and gender, which is useful both for munities with the ability to attract and retain a core of mem- text mining and characterization applications. We identify bers who participate actively (Backstrom & others 2006; class-specific distinguishing terms and language model Coleman 1990). clusters that could be used to identify deceptive users who As a step toward these goals, we present in this paper the misrepresent their demographics. results of a large-scale study over MySpace, the largest and • Overall, the fraction of private profiles is increasing with most active online social network. By studying the char- time, indicating that new adopters of social networks may acteristics of MySpace, we hope to provide insight into the be more attuned to the inherent privacy risks of adopting types of users using these online social networks, how the a public Web presence. We find that women favor private Copyright c 2008, Association for the Advancement of Artificial profiles 2-to-1 over men, and that (perhaps, counterintu- Intelligence (www.aaai.org). All rights reserved. itively) younger users are more likely to adopt a private profile than older users. We also find that the more con- default option) or private. If a user’s profile is designated nected a user is in the social network, the more likely she as private, only the user’s friends are allowed to view the is to adopt a private profile. profile’s detailed personal information (e.g., the user’s inter- ests, blog entries). However, a private profile still reveals Related Work the user’s name, picture, headline, gender, age, location, and last login date.1 The study of social networks has a rich history (Milgram Since extracting and analyzing all 250 million MySpace 1967), and the recent rise of online social networks has seen profiles would place resource and network burdens on renewed interest in this area. For example, a number of pre- both MySpace and our local infrastructure, we adopted vious studies have examined the nature and structure of on- a sampling-based approach to extract representative sam- line social networks, including social networks derived from ples from MySpace for further study. We consider two blogspaces (Backstrom & others 2006; Liben-Nowell et al. approaches – random-sampling and relationship-based (or 2005), email networks (Adamic & Adar 2005), online fo- snowball) sampling: rums (Zhang, Ackerman, & Adamic 2007), photo sharing sites (Kumar, Novak, & Tomkins 2006), among many oth- • Random Sampling: MySpace profiles are sequentially ers. numbered and made publicly Web accessible by con- With respect to online social networks like MySpace and structing a URL containing the profile’s unique profile Facebook, there has been some research interest, but most ID. Hence, we can randomly sample from the space of studies have been on a smaller scale. In one study, re- all MySpace profiles by randomly generating profile IDs. searchers analyzed the relationship between a user’s profile By construction, we expect a random sample of MySpace and friendships over 31,000 Facebook profiles (Lampe, El- profiles to provide perspective on the global characteris- lison, & Steinfeld 2007). Social capital has been studied tics of the entire MySpace social network. over several hundred Facebook users in (Ellison, Steinfield, • Relationship-Based Sampling: Unlike random sampling, & Lampe 2006), and the privacy attitudes of 7,000 Face- the second approach leverages the relationship structure book users was studied in (Acquisti & Gross 2006). (Dwyer, of the social network to select profiles from the social net- Hiltz, & Passerini 2007) surveyed a number of trust-related work. We begin by generating a set of randomly selected issues of over 100 MySpace and Facebook users. Per- seed profiles. We extract the IDs of their friends, add these sonal information revelation among 10,000 young people on friend IDs to the queue of profiles to sample, and continue MySpace was studied in (Hinduja & Patchin 2008). One in a breadth-first traversal of the social network. When the study considered membership formation for 200,000 Orkut queue is empty, we generate a new random profile ID and members (Spertus, Sahami, & Buyukkokten 2005) and an- continue the process. In contrast to the random sampling other looked exclusively at the messaging characteristic of approach, we expect the profiles extracted through this 4 million Facebook users (Golder, Wilkinson, & Huberman sampling approach to provide a more focused perspective 2007). on the active portion of the social network. In comparison with previous work, we provide the first large-scale demographic study over millions of real social In practice, we collected two representative datasets from network profiles with respect to age, gender, and location, MySpace: the Random Dataset using Random Sampling and we study how these factors correlate with their privacy and the Connected Dataset using the Relationship-Based preferences. We compare two sampling approaches for ex- Sampling. We wrote two MySpace-specific crawlers (based tracting social network data, and we provide a unique anal- on Perl’s LWP::UserAgent and HTML::Parser modules). ysis of text artifacts to distinguish users. Both crawlers disregarded invalid profile IDs (i.e., profiles that were deleted or undergoing maintenance at the time of Data and Setup the crawl) and entertainment profile IDs (i.e., profiles that were associated with bands, comedians, etc.), to focus our To study the characteristics of large online social networks, collections on active profiles that belong to regular individu- we selected MySpace as our target social network. MySpace als. In June 2006, we deployed ten instances of the Random is the largest social networking site, the 6th most visited Web Sampling crawler in parallel across ten different servers, col- destination according to Compete.com, and one that has re- lecting profiles for about a week. We repeated this setup ceived a tremendous amount of media coverage. In addition in September 2006 with the Relationship-Based Sampling to these appealing characteristics, MySpace is one of the few crawler. Summary statistics for each dataset are listed in Ta- online social networks that provides open access to user pro- ble 1. files. Many other sites require a user account and, even then, We wrote a custom MySpace parser to extract the name, access to the entire social network can be restricted.