<<

A Large-Scale Study of MySpace: Observations and Implications for Online Networks

James Caverlee Steve Webb Department of College of Computing Texas A&M University Georgia Tech College Station, TX 77843-3112 Atlanta, GA 30332-0280 [email protected] [email protected]

Abstract network is organized, and the important text artifacts that distinguish these users. In particular, we study over 1.9 mil- We study the characteristics of large online social net- lion real profiles, with an emphasis on: works through an extensive of over 1.9 mil- lion MySpace profiles in an effort to understand who • The sociability of users in MySpace based on relationship, is using these networks and how they are being used. messaging, and group participation. We study MySpace through a comparative study over three different, but related, facets: (i) the sociability of • The demographic characteristics of MySpace users in users in MySpace based on relationship, messaging, and terms of age, gender, and location, and a study of how group participation; (ii) the demographic characteristics these factors correlate with their privacy preferences. of MySpace users in terms of age, gender, and location, • The text artifacts of MySpace users, which can be used to and a study of how these factors correlate with their pri- vacy preferences; and (iii) the text artifacts of MySpace emergent models that distinguish My- users, which can be used to construct language models Space users not just by who they say they are but also by that distinguish MySpace users not just by who they say the language model they employ. they are but also by the language model they employ. By studying how MySpace users participate in the social We find a number of surprising results and conjecture network (sociability), how they describe themselves (demo- several potential research directions based on our ob- servations. graphics), and how they communicate their personal inter- ests and feelings (language model), we hope to encourage the development of new models, algorithms, and approaches Introduction for the further enhancement and continued success of online Online are the fastest growing phenomenon social networks. The core findings of our study are: on the Web, enabling millions of users to discover and ex- • Nearly half of the profiles on MySpace have been aban- plore -based knowledge and engage in doned, meaning that the overall growth and explosive rate new modes of social interaction. Sites like , , of user interest in social networks may need to be tem- MySpace, , and LinkedIn have grown tremendously in pered; but we also identify a large core of active users the past few years, garnering increased media and popular within MySpace who account for the vast majority of awareness. friends, comments, and group activity. As online social networks continue to grow, evolve and develop, an important challenge we face is how to maintain • While young users (in their teens and 20s) are most preva- the incredible success of Web 2.0 going forward. There is lent on MySpace, women who are most prevalent at the a growing demand for understanding this new social phe- youngest ages (14 to 20), whereas men are most prevalent nomenon, understanding the process by which communities for all other ages (21 and up). come together, attract new members and develop over time, • There are clear patterns of language use for users based and understanding what it takes to empower the online com- on their age, location, and gender, which is useful both for munities with the ability to attract and retain a core of mem- text mining and characterization applications. We identify bers who participate actively (Backstrom & others 2006; class-specific distinguishing terms and language model Coleman 1990). clusters that could be used to identify deceptive users who As a step toward these , we present in this paper the misrepresent their demographics. results of a large-scale study over MySpace, the largest and • Overall, the fraction of private profiles is increasing with most active online social network. By studying the char- time, indicating that new adopters of social networks may acteristics of MySpace, we hope to provide insight into the be more attuned to the inherent privacy risks of adopting types of users using these online social networks, how the a public Web presence. We find that women favor private Copyright c 2008, Association for the Advancement of Artificial profiles 2-to-1 over men, and that (perhaps, counterintu- Intelligence (www.aaai.org). All rights reserved. itively) younger users are more likely to adopt a private profile than older users. We also find that the more con- default option) or private. If a user’s profile is designated nected a user is in , the more likely she as private, only the user’s friends are allowed to view the is to adopt a private profile. profile’s detailed personal (e.g., the user’s inter- ests, blog entries). However, a private profile still reveals Related Work the user’s name, picture, headline, gender, age, location, and last login date.1 The study of social networks has a rich (Milgram Since extracting and analyzing all 250 million MySpace 1967), and the recent rise of online social networks has seen profiles would place and network burdens on renewed interest in this area. For example, a number of pre- both MySpace and our local , we adopted vious studies have examined the nature and structure of on- a sampling-based approach to extract representative sam- social networks, including social networks derived from ples from MySpace for further study. We consider two blogspaces (Backstrom & others 2006; Liben-Nowell et al. approaches – random-sampling and relationship-based (or 2005), email networks (Adamic & Adar 2005), online fo- snowball) sampling: rums (Zhang, Ackerman, & Adamic 2007), photo sharing sites (Kumar, Novak, & Tomkins 2006), among many oth- • Random Sampling: MySpace profiles are sequentially ers. numbered and made publicly Web accessible by con- With respect to online social networks like MySpace and structing a URL containing the profile’s unique profile Facebook, there has been some research interest, but most ID. Hence, we can randomly from the space of studies have been on a smaller scale. In one study, re- all MySpace profiles by randomly generating profile IDs. searchers analyzed the relationship between a user’s profile By construction, we expect a random sample of MySpace and over 31,000 Facebook profiles (Lampe, El- profiles to provide perspective on the global characteris- lison, & Steinfeld 2007). Social has been studied tics of the entire MySpace social network. over several hundred Facebook users in (Ellison, Steinfield, • Relationship-Based Sampling: Unlike random sampling, & Lampe 2006), and the privacy attitudes of 7,000 Face- the second approach leverages the relationship structure book users was studied in (Acquisti & Gross 2006). (Dwyer, of the social network to select profiles from the social net- Hiltz, & Passerini 2007) surveyed a number of -related work. We begin by generating a set of randomly selected issues of over 100 MySpace and Facebook users. Per- seed profiles. We extract the IDs of their friends, add these sonal information revelation among 10,000 young people on friend IDs to the queue of profiles to sample, and continue MySpace was studied in (Hinduja & Patchin 2008). One in a breadth-first traversal of the social network. When the study considered membership formation for 200,000 Orkut queue is empty, we generate a new random profile ID and members (Spertus, Sahami, & Buyukkokten 2005) and an- continue the process. In contrast to the random sampling other looked exclusively at the messaging characteristic of approach, we expect the profiles extracted through this 4 million Facebook users (Golder, Wilkinson, & Huberman sampling approach to provide a more focused perspective 2007). on the active portion of the social network. In comparison with previous work, we provide the first large-scale demographic study over millions of real social In practice, we collected two representative datasets from network profiles with respect to age, gender, and location, MySpace: the Random Dataset using Random Sampling and we study how these factors correlate with their privacy and the Connected Dataset using the Relationship-Based preferences. We compare two sampling approaches for ex- Sampling. We wrote two MySpace-specific crawlers (based tracting social network data, and we provide a unique anal- on Perl’s LWP::UserAgent and HTML::Parser modules). ysis of text artifacts to distinguish users. Both crawlers disregarded invalid profile IDs (i.e., profiles that were deleted or undergoing maintenance at the time of Data and Setup the crawl) and entertainment profile IDs (i.e., profiles that were associated with bands, comedians, etc.), to focus our To study the characteristics of large online social networks, collections on active profiles that belong to regular individu- we selected MySpace as our target social network. MySpace als. In June 2006, we deployed ten instances of the Random is the largest social networking site, the 6th most visited Web Sampling crawler in parallel across ten different servers, col- destination according to Compete.com, and one that has re- lecting profiles for about a week. We repeated this setup ceived a tremendous amount of media coverage. In addition in September 2006 with the Relationship-Based Sampling to these appealing characteristics, MySpace is one of the few crawler. Summary for each dataset are listed in Ta- online social networks that provides open access to user pro- ble 1. files. Many other sites require a user account and, even then, We wrote a custom MySpace parser to extract the name, access to the entire social network can be restricted. age, and other pertinent information from each collected On MySpace, as on most online social networks, the most profile. Some of these features are self-described by the basic element is a profile. A profile is a user-controlled owner of the profile – like age and gender – and so these Web page that includes some descriptive information about features may or may not be truthful. Other features – like the person it represents. Profiles connect to other pro- number of friends – are maintained by MySpace and are files through explicitly declared friend relationships and nu- merous messaging mechanisms. MySpace allows users to 1MySpace also provides a few finer-grained privacy mecha- choose between making their profiles publicly viewable (the nisms for limiting IMs, comments, and for blocking specific users. Public Private Total Profiles Profiles Profiles Size Random 859,347 101,158 960,505 52 GB Connected 717,337 173,830 891,167 98 GB

Table 1: Summary Statistics for the Two MySpace Datasets

Figure 2: of Comments: The x-axis is the num- ber of comments a user may have posted on her profile; the y-axis is a count of the number of users with a particular number of comments posted on her profile.

or only one friend: 426,926 or 50% of the public profiles Figure 1: Distribution of Friends: The x-axis is the num- in the Random dataset. Since MySpace provides each new ber of friends a user may have; the y-axis is a count of the user with a single default friend, we surmise that more than number of users with a particular number of friends. half of MySpace users created an account and subsequently abandoned it. In contrast, we see that for the Connected dataset, most expected to be correct. Note that MySpace has a limited users have many friends and are actively participating in the validation process; thus, we have no assurances that a self- social network. Only 2.5% of the public profiles in the Con- described 20-year old male from Texas is really who he says nected Dataset have zero or one friend. By construction, the he is. Having said that, we do believe there is significant Connected dataset favors users with many friends. in studying demographics in the aggregate, and as we To further validate the sociability divide, we show in Fig- will see in the following section, certain text artifacts spe- ure 2 the distribution of the number of comments posted to cific to certain groups on the social network could be used a user’s profile for both datasets. The commenting feature to mitigate deceptive profiles. of MySpace is one of several avenues for users to commu- nicate with other users; comments written to a particular Results and Observations user are posted on that user’s profile, so we would antic- In this section we present the main findings of our study ipate that users with many comments are well-known and through a series of characterizations: sociability, demo- active in the social network. Again, we see the heavy-tailed graphics, language models, and privacy preferences. distribution for the Random dataset, whereas the Connected dataset shows more skew, since it is by construction more Sociability Characterization connected. We begin our characterization of MySpace by examining Group participation is another metric of the sociability of the social aspects of users in the network. Since online so- a social network. While over 80% of the users in the Ran- cial networks derive their value from users actively partic- dom dataset (and hence, we can extrapolate for MySpace as ipating in relationships with others users, we are interested a whole) participate in no groups, we find that slightly less to observe to what users actually take advantage of than half of the users in the Connected dataset belong to at these social aspects. To examine this sociability over both least one group and that nearly 20% of users in the Con- datasets, we measure the number of friends, the number of nected dataset belong to at least 8 groups. This evidence comments, and the number of groups a user participates in. further confirms what we observed with the friend and com- Note that these values are only available for public profiles. ment measures of sociability: most MySpace users have ef- In Figure 1, we present the distribution of the number of fectively abandoned their online profiles, but there is a large friends for both datasets on a log-log scale. For the Ran- core of active users within MySpace who account for the dom dataset, we see a heavy-tailed distribution – that is, vast majority of friends, comments, and group activity. most users have very few friends, but a few users have many But who are these active users? In an effort to understand friends. Such a heavy-tailed distribution has been observed if some users are more likely to be active than others, we in a number of related domains, and observing it here is no considered a number of features, including the age, loca- surprise. What is surprising is the number of users with zero tion, gender, and length of time a profile had existed on My- Figure 3: Sampling By Date: The x-axis shows buckets of Figure 4: Distribution of Ages: The x-axis is the self- profiles organized by the date of their creation; the y-axis reported age on a profile; the y-axis is the fraction of all shows the fraction of all profiles created within a bucket’s profiles declaring a particular age. range that were sampled.

or female? Where are they located? The answers to the Space. We find that California and other western U.S. states questions can provide us with added insight into how a so- dominate the total number of profiles on MySpace, but that cial network grows, what features are attractive to certain these differences are minor across the Connected and Ran- participants, and other interesting avenues. dom datasets, which means location is not a strong indicator Recall that both public and private profiles on MySpace of sociability. Likewise, we find little evidence that a pro- list basic demographic information. We find that nearly all file’s self-declared age or gender impacts its relative socia- MySpace users (> 99.9%) provide some age, gender, or bility. In contrast, we find that the length of time a user has location information. Only 1,311 profiles in the Random participated is a strong indicator of sociability. To measure Dataset declare no age, gender, or location; in the Connected participation length on MySpace, we augment our original Dataset, only 1,203 profiles declare nothing. sampling process. The profile creation date is listed for each Figure 4 shows the distribution of ages in both datasets. profile on a separate blog page linked off the public profile As expected, MySpace is dominated by the young, with a Web page. This requires accessing one additional page per peak at 17 years of age for the Random Dataset. Nearly 85% profile sampled. In an effort to avoid burdening MySpace of the users on MySpace are 30 or younger. Interestingly, with a doubling of page requests, we sampled a handful of we observe that the Random dataset skews slightly younger profiles (e.g., profile 10,000, profile 100,000) and their cre- than the Connected dataset, indicating that the most active ation date to create a time series. Since MySpace profile users on MySpace may in fact be users in their 20s. We IDs are assigned sequentially, we can interpolate the date of also observe a peak at the age of 69 – presumably either a creation for each profile sampled. joke age or an age intentionally selected by users interested Hence, in Figure 3 each point represents a bucket of all in sex to find one another through the age-based search fa- profiles created before that date back to the previous point. cility available on MySpace (Scalet 2007). We also observe The y-axis measures the rate of sampling for each bucket. As a peak around 100, but we can presume that most of these expected, the random sampling approach nearly uniformly self-reported ages are false. samples from each bucket (One caveat: we see a hiccup In Table 2 we show the gender breakdown for each at the beginning since the bucket is so small, and at the dataset: the split between male and female is nearly even: end because the sampling periods are slightly different). In 52% male and 48% female in the Random dataset versus contrast, the relationship-based sampling used to create the 50% male and 50% female in the Connected dataset. The Connected dataset identifies users who joined overwhelm- “Other” gender is a placeholder for profiles that list either ingly at an earlier date. These long-lived users are presum- no gender information or non-standard gender information. ably more plugged-in and active participants in the social In Figure 5 we consider the gender distribution across both network. datasets. The results are intriguing: women are more preva- lent at the youngest ages, whereas men are more prevalent Demographic Characterization for all other ages (barring a few hiccups at the older end In the previous section, we studied the sociability of My- where the data is sparser). Space users – how active are they and to what degree are Why are women more active participants at younger ages? they connecting to other users? In this section, we expand Perhaps women intentionally self-report a younger age, or our analysis of MySpace to consider the of par- men intentionally self-report an older age. Perhaps there are ticipants. How old are they? Are they predominately male clear gender differences in how users participate in a social Random Connected Characterizing Language Models Male 505,357 440,330 In our study so far, we have characterized how users partici- Female 452,240 448,920 pate in the social network (e.g., friendships, comments) and Other 2,908 1,917 how users describe themselves (e.g., male, 24, from Cal- Table 2: Gender breakdown for each dataset. ifornia). In this section, we examine what users are say- ing on their profiles through an analysis of the “language models” of social network users. Our is to understand network, so that younger women are more attracted to cer- how language use varies by class. For example, do women tain social aspects than their male counterparts? These are express themselves differently from men? Do older My- interesting and open questions that deserve further explo- Space users describe themselves differently from younger ration. MySpace users? Finally, we studied the self-reported location information We begin with some definitions. We treat each profile V = for each profile. MySpace users hail from all fifty U.S. as a sequence of terms drawn from a vocabulary set {t , t, , ..., t } states, and a significant fraction come from other countries. 1 2 |V | . We consider all terms on a profile that are Not all profiles list an intelligible location (e.g. “Some- generated by the user (e.g., “About Me”, “Interests”), and where Over the Rainbow”), and some list multiple locations we exclude all terms most likely generated by other users (e.g., “Honolulu and Metro DC”), so we built a best-effort (e.g., terms in comments). Following the standard informa- parser. Based on our initial analysis, we find that the U.S. tion retrieval approach, we can describe the language model is by far the most prevalent location, followed by the United of all profiles as a probability distribution over the terms in Kingdom and Canada; thus, we shall focus solely on U.S. the profiles according to a unigram language model: states for the rest of this study. For the Random Dataset, we X find that 77% list a U.S. state in the location, and for the {p(t)}t∈V s.t. p(t) = 1 Conected Dataset, we find that 87% list a U.S. state. t∈V In Table 3, we report the top-5 states that are over- Terms with high probability are more likely to be ob- represented on MySpace relative to their actual population served on a profile than low probability terms. We can as well as the top-5 most under-represented states. We mea- compute p(t) as a function of the count count(t) of pro- sure the relative presence of a state i on MySpace versus its files containing term t relative to the total number of profiles relative share of the U.S. population as: n: p(t) = count(t)/n. For example, the top-5 most prob- able terms in the Connected Dataset are: the, and, straight, popi,MySpace popi,US friends, with. These common terms provide little insight, reli = 1 − P /P popj,MySpace popj,US and hence, we augment the basic language model by identi- j j fying class-specific distinguishing terms for classes based on where popi,US is the population of state i based on the latest age, gender, and location. Our goal is to identify terms that U.S. Census data and popi,MySpace is the number of pro- are more likely to be generated by a certain class of users: files in our dataset that declare state i as their location. For (e.g., by women). the Random dataset, we see in Table 3 that California and To identify class-specific distinguishing terms, we rely on other western U.S. states are the most over-represented on an information theoretic measure – Mutual Information – MySpace relative to their actual population. Southern and for assessing the importance of a term to a particular class.2 mid-west states tend to lag relative to their actual popula- Mutual Information between a term and a class is defined as: tion. p(t|c) MI(t, c) = p(t|c)p(c) log Most Over-represented Most Under-represented p(t) Hawaii [+115%] [-58%] where p(t|c) is the probability that a profile contains term t California [+61%] West Virginia [-53%] given that it belongs to class c, p(c) is the probability that a Washington [+41%] Arkansas [-52%] profile belongs to class c, and p(t) is the unigram language Alaska [+40%] Missouri [-49%] model described above for the probability of term t across Nevada [+39%] South Dakota [-48%] all profiles. Letting count(c) denote the count of profiles belonging to class c and letting count(c, t) denote the count Table 3: The states that are most over-represented and most of profiles containing term t that belong to class c, we have: under-represented on MySpace relative to their actual U.S. Census population. [Random Dataset] count(c, t) count(c) p(t|c) = and p(c) = count(c) n We attribute much of this geographic discrepancy to My- Mutual Information measures how much information a Space’s initial launch by a California-based company and particular term t tells us about class c. Higher MI values success with Los Angeles area bands (Rosenbush 2005). Al- though California accounts for only 12% of the U.S. popu- 2Since this work is exploratory in nature, we choose to keep all lation, users from California dominate the early adopters of common words (like “the” and “and”) that are often eliminated for MySpace. text mining. In the same spirit, we perform no stemming. (a) Random Dataset (b) Connected Dataset Figure 5: Gender Breakdown by Age: The x-axis is the self-reported gender on a profile; the y-axis is the fraction of all profiles of a particular age declaring a particular gender. indicate stronger association. In this raw form, however, Second, we consider class distinction by location for all rare terms that by chance happen to occur only in profiles fifty U.S. states. In Table 5 we report representative results belonging to a particular class will score highly by Mutual from three states representing distant geographic regions of Information. Hence, a natural correction is to replace p(t|c) the U.S.: the south, pacific northwest, and northeast. We with a “smoothed” version that gives every term a non-zero see an interesting mix of -specific identifiers (e.g., probability of occurrence across all classes: protestant in Alabama versus catholic in Connecticut), inter- ests (e.g., football in Alabama versus camping in Oregon), p∗(t|c) = αp(t|c) + (1 − α)p(t) and word constructions (e.g., yall versus rad versus sneak- ers). where 0 ≤ α ≤ 1. In practice we select a smoothing factor of α = 0.9. We can interpret Alabama Oregon Connecticut {p∗(t|c)} s.t. P p∗(t|c) = 1 as a class-specific lan- t∈V t∈V christian camping catholic guage model. african-descent pdx yankees Class-Specific Distinguishing Terms Given the Mutual tide hiking nyc Information measure for identifying distinguishing terms, jesus northwest uconn we next explore the language models of MySpace users ac- football pixies hispanic cording to three characteristics: gender, location, and age. bama snowboarding bronx Since we are primarily interested in users who are actively church coast boston using the social network, we report results from the Con- christ rafting sox nected Dataset. Superficially, we see many similarities with protestant floater nas the Random Dataset in the presence of distinguishing terms. gospel rad italian Note that only public profiles are included in this analysis yall wine goodfellas since the contents of private profiles are hidden. nascar vegan sneakers First, we consider the class distinction by gender – male and female. In Table 4, we report the top-16 class-specific Table 5: Distinguishing Terms for Three Representative Lo- distinguishing terms for profiles declared to be male and for cations (Ranked by MI): Popular location names (e.g., Birm- profiles declared to be female. The differences are stark. ingham, Portland) within each state are excluded.

Male Female Finally, we consider how the language model of MySpace dating sport love people users varies by age. In Table 6, we report the distinguish- networking metal dancing life ing terms for ages ranging from 16 to 100. We see how serious football shopping can the language model shifts in focus with age based on edu- relationships s*** girl family cation (e.g., from high school to college to graduate to re- single wars hearts being tired). Also, older members use terms like married, parent, straight band have notebook and proud, whereas younger members user terms like single, video f*** are dance friend, and love. guitar gay favorite things We next consider a few notes about the older (and per- haps, less truthful ages). The 69-year olds have a clearly- Table 4: Distinguishing Terms by Gender (Ranked by MI) expressed interest in sex. The odd language model of 80- 16 18 20 25 30 40 60 69 80 100 high high college graduate networking parent parent networking scudda swinger school school someday college graduate proud proud swinger mortenson our hearts someday student networking parent married president sex gable kids junior love love grad proud networking swinger a** jeane capricorn single best straight professional married kids his f*** showgirl networking best boy caucasian relationship grad great married rock asphalt virgo hair ever white traveling professional our kids islander dimaggio artists friend hair like some art divorced united real dougherty their lol lol girl reading cure daughter began our harlow please play single know working travel years retired night actress official

Table 6: Distinguishing Terms by Age (Ranked by MI) year olds is skewed by the presence of many Marilyn Mon- a 16-year old is closest to a 17-year old, then an 18-year old, roe tribute profiles (who would have been 80 at the time); and so on. A similar pattern holds for the 20-year old lan- all of the terms are relevant to her movie career and relation- guage model and for the 30-year old language model. There ships. The 100-year olds display a less coherent language are clear clusters based on age. model, perhaps due to the diversity of users declaring such What do we observe when we consider profiles that are an age. more likely to be deceptive about their true age? As an illus- Identifying Language Model Clusters In the previous tration, we show in Table 7 the closest language models for section, we saw how certain classes of MySpace users can be profiles listing an age of 69 and profiles with an age of 100. described by distinguishing terms that are relatively strong indicators of class membership. In this section, we con- Rank Age 69 Age 100 tinue this analysis by considering clusters of related classes. 1. 100 [0.017 ] 99 [0.047] For example, given that most self-declared 100 year-old 2. 99 [0.021] 101 [0.103] members of MySpace are not actually 100, what is their 3. 101 [0.047] 30 [0.105] true age? MySpace has made some effort to remove self- 4. 33 [0.068] 31 [0.105] declared older members (Scalet 2007) through manual in- 5. 31 [0.072] 29 [0.106] spection. Can the language models provide us with a scal- able solution? Table 7: Identifying Outliers: Which language model most closely matches the language model of the self-described We begin with the class-specific language models of in- ∗ ∗ 69-year olds? And of the 100-year olds? [KL-divergence] terest (e.g., by age: {p (t|c = 16)}t∈V , {p (t|c = 17)}t∈V , and so on). Are there clusters of language models by age or by location? In this initial study, we consider a similarity For the 69-year olds, we see that the closest matches are measure for determining the “closeness” of two language other outlier ages – 100, 99, and 101. This gives us some ev- models based on the Kullbeck-Leibler divergence (or rela- idence that the type of user who lies about his age is bound tive entropy). KL-divergence measures the difference be- by some common language model cues. The next two clos- tween two probability distributions p and q over an event est matches are in their 30s. This is a bit surprising; we space X: would have expected teenagers to be more likely to engage in such behavior. For the 100-year olds, we see a similar X pattern: close matches with other outlier ages (99 and 101) KL(p, q) = p(x) · log(p(x)/q(x)) and then close matches with younger profiles that are pre- x∈X sumably more likely to declare true ages. We believe this Intuitively, the KL-divergence indicates the inefficiency line of inquiry could be extended along a number of fruitful (in terms of wasted bits) of using the q distribution to en- directions. code the p distribution. In this case, we can measure the divergence of two class-specific language models (i.e. p = Privacy Preferences ∗ ∗ {p (t|c = 16)}t∈V and q = {p (t|c = 17)}t∈V ). Note that Finally, we turn our attention in this study to the impor- KL-divergence is not symmetric so we will typically find tant issue of privacy in social networks. A number of re- KL(p, q) 6= KL(q, p). searchers have examined some of the aspects impacting pri- First, we report the KL-divergence in Figure 6 for 16- vacy on social networks (e.g., (Barnes 2006; Boyd 2007; year olds versus other ages, for 20-year olds versus other Nussbaum 2007)) in an effort to understand user’s under- ages, and for 30-year olds versus other ages. Since there are standing of privacy and the limits of privacy controls, and so very few profiles listing an older age, we omit these from the on. In this section, we examine the privacy choices of mem- graph. bers of MySpace through the lens of our demographic study. Note that the KL-divergence of 16-year olds is lowest for Recall that MySpace users can elect to declare their profile profiles closest in age, which means the language model of as public or private. A private profile displays only limited Figure 6: KL-Divergence by Age: We compare the class- Figure 7: Privacy Breakdown by Age: The x-axis is the self- specific language model using KL-divergence (lower is bet- reported age on a profile; the y-axis is the fraction of all pro- ter). files declaring a particular age that are private. [Connected] information like name, age, gender, and location. Especially social network, we see a much larger fraction. These re- young members of MySpace (14 and 15-year olds) are re- sults also lend credence to the hypothesis that more sociable quired to have a private profile. members tend to be more likely to choose private profiles. First, we report in Table 8 the privacy preferences of the To further explore the impact of demographics, we randomly selected MySpace users of the Random Dataset present in Figure 7 the fraction of private profiles in the Con- (which is intended to reflect the overall MySpace popula- nected Dataset by age and gender. We truncate the graph tion) versus the privacy preferences of the more sociable over the age of 60 since there are very few profiles at those members of the Connected Dataset. Members of the Con- ages and hence we see more noise. We find that women fa- nected Dataset select private profiles by nearly 2-to-1 over vor private profiles 2-to-1 over men and that (perhaps, coun- the average MySpace user. These findings are especially terintuitively) younger users are more likely to adopt a pri- surprising since the relationship-based sampling technique vate profile than older users. Why is this? Perhaps older used to extract the Connected Dataset relies on the friend- users are less technically savvy and have difficulty under- ships declared on public profiles to identify profiles to sam- standing how to set up the privacy setting; perhaps younger ple; private profiles reveal no friendships and so the sam- users are more attuned to the privacy and security concerns pling terminates when it arrives at a private profile. We fur- of social networks. We believe this is an area deserving more attention. Random Connected Finally, we consider how privacy preferences have Private 101,158 (10.5%) 173,830 (19.5%) changed over time. In Figure 8 each point represents a Public 859,357 (89.5%) 717,337 (80.5%) bucket of all profiles created before that date back to the Total 960,505 891,167 previous point. The y-axis measures the fraction of profiles created within that bucket that are private (again, relying on Table 8: Privacy preferences for each dataset. MySpace’s use of sequential IDs to interpolate profile cre- ation dates).3 After an initial drop in privacy rate, we see a ther examined the private profiles in each dataset and found fairly steady growth of privacy adoption for new members. that nearly all (99.9%) of the private profiles in the Random Overall, the fraction of private profiles is increasing with Dataset belong to 14 and 15-year olds (see Table 9). In con- time, indicating that new adopters of social networks tend trast, we find that over 73.7% of the private profiles in the to be more attuned to the inherent privacy risks of adopt- Connected Dataset are of the age 16 or higher. ing a public Web presence. We also investigated privacy preferences by location, but find no dramatic swings from Random Connected state-to-state. 14/15 Years Old 101,017 (99.9%) 45,633 (26.3%) All Other Ages (16+) 141 (00.1%) 128,197 (73.7%) Conclusions Total 101,158 173,337 In this paper, we have presented a large-scale study over MySpace in an effort to better understand this new social Table 9: Privacy preferences by age for each dataset. phenomenon. Our comparative study differs from previous

Overall, very few users elect private profiles when given 3We assume the choice of public/private is a one-time decision. the opportunity (00.1%), but of users who actively use the In practice, users can modify their privacy settings at any time. Ellison, N.; Steinfield, C.; and Lampe, C. 2006. Spatially bounded online social networks and . In In- ternational Communication Association. Golder, S. A.; Wilkinson, D.; and Huberman, B. 2007. Rhythms of social interaction: messaging within a mas- sive online network. In Third International Conference on Communities and Technologies. Hinduja, S., and Patchin, J. W. 2008. Personal informa- tion of adolescents on the Internet: A quantitative content analysis of MySpace. Journal of Adolescence. Kumar, R.; Novak, J.; and Tomkins, A. 2006. Structure and evolution of online social networks. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD). Lampe, C.; Ellison, N.; and Steinfeld, C. 2007. Profile ele- Figure 8: Privacy Over Time: The x-axis shows buckets of ments as signals in an online social network. In Conference profiles organized by the date of their creation; the y-axis on Factors in Computing Systems. shows the fraction of all profiles created within a bucket’s Liben-Nowell, D.; Novak, J.; Kumar, R.; Raghavan, P.; range that are private. [Connected Dataset] and Tomkins, A. 2005. Geographic routing in social net- works. Proceedings of the National Academy of Sciences work in its scale (over 1.9 million profiles) and in its breadth. 102(33):11623–1162. In particular, we have examined how MySpace users par- Milgram, S. 1967. The small-world problem. ticipate in the social network (sociability), how they de- Today 60 – 67. scribe themselves (demographics), and how they communi- Nussbaum, E. 2007. Kids, the Internet, and the end of cate their personal interests and feelings (language model). privacy. New York Magazine. We have identified a number of surprising and interesting Rosenbush, S. 2005. News Corp.’s place in MySpace. features that motivate our continuing research. In particular, Business Week. we are interested in augmenting and extending models of social network growth to incorporate the demographic vari- Scalet, S. D. 2007. MySpace cracks down on 69-year-old ations we have observed. Along this line, we believe finer- members. CSO Online. grained language models that move beyond age, gender, and Spertus, E.; Sahami, M.; and Buyukkokten, O. 2005. location to capture user interest and user expectations of the Evaluating similarity measures: A large-scale study in the social network (e.g., for business-development networking, Orkut social network. In Proceedings of the ACM Confer- for making friends) could be beneficial. ence on Knowledge Discovery and Data Mining (KDD). Zhang, J.; Ackerman, M.; and Adamic, L. 2007. Expertise References networks in online communities: Structure and algorithms. Acquisti, A., and Gross, R. 2006. Imagined communities: In Proceedings of the International World Wide Web Con- Awareness, information sharing, and privacy on the Face- ference (WWW). book. In 6th Workshop on Privacy Enhancing Technologies (PET). Adamic, L. A., and Adar, E. 2005. How to search a social network. Social Networks 27(3):187–203. Backstrom, L., et al. 2006. Group formation in large so- cial networks. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD). Barnes, S. B. 2006. A privacy paradox: Social networking in the United States. First Monday 11(9). Boyd, D. 2007. Social network sites: Public, private, or what? The Knowledge Tree: An e-Journal of Learning Innovation. Coleman, J. 1990. Foundations of Social . Harvard University Press. Dwyer, C.; Hiltz, S. R.; and Passerini, K. 2007. Trust and privacy concern within social networking sites. In Proceed- ings of the Thirteenth Americas Conference on Information Systems.