PROXIMITY, INTERACTIONS, AND COMMUNITIES IN SOCIAL NETWORKS: PROPERTIES AND APPLICATIONS.

By

Tommy Nguyen

A Thesis Submitted to the Graduate Faculty of Rensselaer Polytechnic Institute in Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY Major Subject: COMPUTER SCIENCE

Examining Committee:

Boleslaw K. Szymanski, Thesis Adviser

Sibel Adal´ı,Member

James A. Hendler, Member

Gyorgy Korniss, Member

Mohammed J. Zaki, Member

Rensselaer Polytechnic Institute Troy, New York

October 2014 (For Graduation December 2014) c Copyright 2014 by Tommy Nguyen All Rights Reserved

ii CONTENTS

LIST OF TABLES ...... vi

LIST OF FIGURES ...... vii

ACKNOWLEDGMENT ...... ix

ABSTRACT ...... x

1. INTRODUCTION ...... 1 1.1 Ranking Information in Social Networks ...... 2 1.2 Small Worlds and Social Stratification ...... 4 1.3 Summary of Contributions & Organization ...... 6 1.3.1 Organization ...... 7

2. LITERATURE REVIEW ...... 10 2.1 Ranking Techniques ...... 10 2.1.1 Web Conceptualization ...... 10 2.1.2 User Data & Trust Models ...... 11 2.1.3 Learning to Rank ...... 13 2.2 Small-world Problem ...... 15 2.2.1 Six Degrees of Separation ...... 15 2.2.2 Social Stratification ...... 16

3. ANALYSIS ...... 18 3.1 Geography, Co-Appearance, & Interactions ...... 19 3.1.1 Data Collection ...... 19 3.1.2 Notations & Definitions ...... 20 3.1.3 Data Analysis & Results ...... 21 3.1.4 Limitations ...... 24 3.2 Incorporating Geography into Community Detection ...... 24 3.2.1 Percolation Method ...... 25 3.2.2 Modularity Maximization ...... 26 3.2.3 Speaker-Label Propagation (GANXiS) ...... 27 3.3 Contrasting Communities to Null Models ...... 28 3.3.1 Techniques for Generating Covers ...... 29

iii 3.3.2 Measuring Covers & Communities ...... 29 3.3.3 Examining Covers in Gowalla ...... 31 3.4 Examining Detected Communities ...... 33 3.4.1 Network Community Profile (NCP) ...... 34 3.4.2 Link Connectivity Measurements ...... 35 3.4.3 Face-to-Face Interactions Measurements ...... 35 3.5 Application: Social Relationships & Human Mobility ...... 39 3.5.1 Network Congestion in MANETs ...... 41 3.5.2 Mobility Generation ...... 41 3.5.3 Experimental Congestion Design ...... 42 3.5.4 Congestion Simulation Results ...... 43 3.6 Application: Long Ties & Economic Development ...... 44 3.6.1 A Stochastic Model of Economic Development ...... 47 3.6.2 Experimental Results & Discussion ...... 48 3.7 Summary of Results ...... 54

4. SOCIAL RANKING TECHNIQUES ...... 57 4.1 Google Buzz & Twitter ...... 57 4.1.1 Categories of URLs...... 59 4.1.2 Spreaders & Affected Sets ...... 60 4.1.3 Information Distances ...... 61 4.1.4 Geographical Distances ...... 62 4.1.5 Densities of Social Relationships ...... 64 4.1.6 Keyword Similarity ...... 65 4.2 Social Ranking Techniques ...... 66 4.2.1 PageRank on Social Network ...... 66 4.2.2 HITS on Social Network ...... 67 4.2.3 Ranking with Maximum Flow ...... 68 4.2.4 Variants of Maximum Flow ...... 70 4.3 Social Ranking Experiments ...... 70 4.3.1 Comparing PageRank & HITS ...... 70 4.3.2 Flow Ranking ...... 71 4.3.3 Rank Differences ...... 74 4.3.4 Rank Distributions ...... 76 4.3.5 Rank Validation ...... 77 4.4 Summary of Results ...... 78

iv 5. SOCIAL SEARCHING EXPERIMENTS ...... 81 5.1 Attrition, Geography, & Communities ...... 82 5.1.1 Modeling Attrition ...... 82 5.1.2 Geographical Analysis ...... 84 5.1.3 Detecting Communities ...... 86 5.2 Experimental Design ...... 86 5.2.1 Routing Strategies ...... 87 5.2.2 Starter & Target Selections ...... 88 5.3 Experimental Results ...... 89 5.3.1 Selection & Routing Combinations ...... 89 5.3.2 Friends-of-Friends Knowledge Densities ...... 90 5.3.3 Distributions of Successful Chains ...... 91 5.3.4 Effects of Hubs and Connectors ...... 92 5.3.5 Individual and Community Prominence ...... 93 5.4 Summary of Results ...... 95

6. CONCLUSION AND FUTURE WORK ...... 97

REFERENCES ...... 99

v LIST OF TABLES

1.1 Aspects of SNA & applications...... 7

3.1 Data summary of Gowalla network...... 20

3.2 Six techniques for generating covers...... 29

3.3 Measurements for cover C of the size k...... 31

3.4 Detected communities and their sizes...... 34

3.5 Measuring spatial ...... 36

3.6 Measuring face-to-face interactions...... 36

3.7 Network simulator ns-2 parameters...... 43

3.8 Measuring economic development (Gowalla)...... 52

3.9 Measuring economic development (FourSquare)...... 53

4.1 Data summary of Google Buzz...... 59

4.2 Data summary of Twitter...... 59

4.3 Google Buzz (left) & Twitter (right) with geography...... 59

4.4 Social relationships densities in Google Buzz...... 64

4.5 Social relationships densities in Twitter...... 65

4.6 Ranking results of 30 popular URLs in Google Buzz...... 74

4.7 Ranking results of 30 random URLs in Google Buzz...... 75

4.8 Avg. ranking differences in Google Buzz...... 76

4.9 Avg. ranking differences in Twitter...... 76

5.1 Summaries of online social networks datasets...... 81

5.2 Communities detected by GANXiS...... 86

5.3 Prominence of individuals and communities...... 88

5.4 Experimental results for Gowalla...... 88

5.5 Experimental results for FourSquare...... 89

6.1 Aspects of SNA & applications...... 97

vi LIST OF FIGURES

3.1 Geographical spread of 100K checkins in Gowalla...... 19

3.2 Friendship is bounded by geographical distance...... 21

3.3 Densities of pairs as a function of geographical distance...... 22

3.4 Measuring face-to-face interactions (t=30mins, d=1km)...... 23 3.5 Generating CTA & FTA covers...... 30

3.6 Intra-edge count, boundary-edge count, and geographic diameter of covers. 32

3.7 Contraction, expansion, conductance, and geographic distance of covers. 33

3.8 Communities detected by Clique Percolation Method...... 36

3.9 Communities detected by Inference Algorithm...... 37

3.10 Communities detected by GANXiS...... 38

3.11 Measuring face-to-face interactions among members...... 39

3.12 Generating a Markov Model using checkins...... 41

3.13 Design of simulation overview...... 43

3.14 Traffic congestion in FMM and RWP...... 44

3.15 Frequency of pauses using the RWP...... 45

3.16 Scaling laws of short and long ties...... 49

3.17 Face-to-face interactions of short ties and long ties...... 49

3.18 The collective strength of long ties in a simple contagion model. . . . . 50

3.19 Distribution of long ties for adopters and non-adopters...... 51

3.20 Economic development as a function of idea flow (Gowalla)...... 52

3.21 Economic development as a function of idea flow (FourSquare)...... 53

3.22 Speedy idea flow as a function of social diversity...... 53

4.1 Conceptualization of social ranking...... 57

4.2 Categories of popular (a,c) and random (b,d) URLs...... 60

vii 4.3 Shortest paths to URLs in Google Buzz (a) and Twitter (b)...... 61

4.4 Ultra small-world property from starters to information...... 62

4.5 Densities of shortest path lengths from starters to URLs...... 62

4.6 Two degrees of spatial concentration...... 63

4.7 Four dimensions of social relationships...... 64

4.8 CKS for friendship, following, peers, and random pairs...... 65

0 4.9 Graph Gp for ranking URLs {u1, u2} with respect to node p...... 69 4.10 Ranking URLs on Google Buzz...... 71

4.11 Ranking URLs on Twitter...... 72

4.12 Social ranking with popular URLs on Google Buzz...... 72

4.13 Social ranking with random URLs on Google Buzz...... 73

4.14 Social ranking with popular URLs on Twitter...... 73

4.15 Social ranking with random URLs on Twitter...... 73

4.16 Densities of rank correlation coefficient...... 77

4.17 Ranking quality results...... 77

5.1 Stratification graph of communities in Gowalla...... 83

5.2 Distributions of shortest path lengths & average path lengths...... 84

5.3 Densities of geographical distances...... 85

5.4 Friends-of-friends knowledge densities...... 90

5.5 Path length of successful chains & drop rates...... 92

5.6 Effects of routing to connectors & hubs...... 93

5.7 Prominence of individuals & communities on reachability...... 94

5.8 Prominence of individuals & communities correlations...... 95

viii ACKNOWLEDGMENT

I like to thank everyone that mentored me during my undergraduate and graduate studies. This dissertation is not possible without their guidance. First, I like thank my dissertation chair for his guidance, ideas and intellectual contributions in this dissertation. From seeking research problems to career planning, he was always encouraging and supportive throughout my graduate studies. To quote a previous graduate student, “his pleasant and friendly personality made this graduate study more enjoyable.” Also, I like to thank committee members for providing their feedback and helping me organize the structure of this thesis. Second, I like to thank the entire staff in the CS department. Ms. Coonrad and Ms. Hayden are always responsive to my questions regarding classes, graduation requirements, etc. even when there are hundreds of questions from other students. Mr. Lindsay is always around and ready to help whenever a server crashes. It was always a pleasure to interact with them throughout my graduate studies. Last but not least, I like to acknowledge the graduate students and postdocs in our center and computer science department. Some of them are talented scientists and experts in their areas of research; others are going to become experts one day. They make me feel proud of being a member of our center and alumni of the university.

ix ABSTRACT

Social network analysis, in the form of , where nodes represent humans and edges represent social relationships between humans, have a wide range of appli- cations in information science, political science, social science, economics, etc. The availability of data from location-based such as Gowalla and FourSquare has helped scientists model and analyze human relationships and their interactions. In this thesis, we use such data to analyze multiple dimensions of social relationships in terms of three specific aspects: geographical proximity of nodes, their face-to-face interactions, and the structure of their communities. Then we incorporate these three aspects of social relationships into the following applications. First, we propose techniques for analyzing human relationships in terms of ge- ographical proximity, face-to-face interactions, and communities. We show how ge- ographical proximity shapes structure of the social network by limiting face-to-face interactions among distant users. We also incorporate geographical locations that users visited into a few community detection algorithms for the purpose of detecting communities where members are on average separated by a few friendship link, are close to each other geographically, and are likely to interact with each other face- to-face. These aspects of allowed the study of the first two applications − human mobility patterns and the spread of ideas. Second, we use URLs that people share with their followers on social media to personalize the ranking of information by looking at who follows whom, geographical location of the users, and the structure of their detected communities. This allows us to analyze how social media tunnels the flow of information in the network. More im- portantly, personalized ranking based on these aspects allow users to see information through the eyes of other users whom they consider important (neighbors, friends, peers, etc.) and provides an opportunity for them to interact with information which was used by the people that they care − resulting in the third application studied in this thesis. Finally, we replicate the small world experiment by emulating the process of searching for targets by routing a folder among their acquaintances. Geographical

x information and community structure allow us to selectively choose starters and tar- gets based on the knowledge of where users are located and to which community they belong. In addition, we examine various routing strategies based on geographical proximity and community structure that perhaps were likely used by participants in the small-world experiment to reach a target. In doing so, we discover which combina- tions of routing strategies and selection techniques are likely to make the small-world experiment successful in terms of the small number of hops required to reach the target and the percentage of such successful chains − resulting in the last application studied in this thesis.

xi CHAPTER 1 INTRODUCTION

Social network analysis examines human relationships in terms of graph theory where nodes represent humans and edges represent their social relationships. In addition, social network analysis can also examine the geographical proximity of the nodes, their face-to-face interactions, and the structure of their detected communities. This thesis examines these three aspects of social network analysis in detail. Within the last five years, the proliferation of smartphones has provided a new type of social networking where people can share their current location with their friends and tag the activities that they are doing. This new type of social networking has provided a much richer dataset of human behavior because geographical locations and face-to-face interactions were not previously available. More importantly, this new type of social networking provides a that connects the digital world with the physical world where physical activities of human behavior such as proximity and face-to-face interactions are recorded and shared instantly. Before location-based social media, scientists used CDRs (call detail records) of telephone companies to study spatial properties, infer friendship topology, and guess face-to-face interactions. However, a problem with CDRs is that call volume is not a good proxy for friendship because people can make phone calls to order food, request technical support, seek medical help, and so on. More importantly, using calling patterns to infer friendship is biased towards those that are more likely to be strong ties since weak ties are by definition those that are contacted infrequently; hence using CDRs to infer friendship leaves out an important dimension of social relationships in the study of social network analysis. Therefore, location-based social media is valuable for the study of social network analysis because it provides a network that is embedded into physical space - the

Portions of this chapter previously appeared as: T. Nguyen and B. Szymanski, “Social Ranking Techniques for the Web,” in Proc. IEEE/ACM Int. Conf. Advances in Social Network Analysis and Mining, Niagara Falls, Ontario, 2013, pp. 49-55. Portions of this chapter have been submitted as: T. Nguyen et al., “Small Worlds and Social Stratification,” PLoS ONE,(under review).

1 2 surface of earth, and its nodes - humans, are constantly moving. In addition, the links have different characteristics depending on the frequency of interactions. The questions that immediately arise are what are ramifications of this type of embedding in physical space, and what are the roles of ties (weak/strong, long/short) in human behavior. The collection of data from Gowalla and FourSquare allows the investigation of these issues which are studied in detail in this thesis. Chapter 3 addresses the issue of face-to-face interactions and finds that friend- ship still requires both face-to-face interactions and geographical proximity. Moreover, the desire to interact face-to-face motivates strong ties to travel together impact- ing human mobility patterns with ramification for transportation traffic and wireless bandwidth infrastructure management (one of the applications studied in Chapter 3). However, this does not mean that weak ties are unimportant. The last section of Chapter 3 shows that weak ties that are geographically distant tunnel the flow of ideas and are a strong predictor of economic development in the US in terms of GDP, patents, and startups. Chapter 4 returns to strong ties and examines social influence that people have on each other in terms of interests, geographical distance, and communities. Chapter 4 explores this influence to improve relevancy of responses to queries by individualizing them for the users based on the ranking of web pages shared on social networks. Some potential evidence of increased relevancy mentioned in this thesis could possibly demonstrate the level of influence the friends exert on the interests of others. Chapter 5 expands the last section of chapter 3 by examining how spatial em- bedding of social networks, long distance ties, and communities underlie strategies of social search. These aspects of social network analysis examine whether social networks are small-world, stratified, or both simultaneously. Results show that while social networks have small topological path lengths, there is no evidence that people with limited knowledge can find a designated target within a small number of hops when attrition is completely eliminated.

1.1 Ranking Information in Social Networks Over the last decade, scientists examined the structure of web [1]-[4] and pro- posed algorithms to rank web pages based on significance and relevance to a given 3 query [5]-[9]. A conceptualization of the web is to look at patterns in the topol- ogy of hyperlinks containing web pages to separate prominent websites that serve as authorities for trusted information from malicious pages created by spammers [1]. This conceptualization of the web eliminates the complexity of textual analysis and creates a pot-pourri of information that gets incorporated into search engines or other information retrieval systems for the purpose of finding information on personal computers, mobile devices, and any other computing platforms [10]. In the case of a search engine, billions of web pages containing rich context of information are organized where end users can find their target quickly. Thus, this need for speed makes ranking crucial in information retrieval systems. Also, ranking has many other applications in social sciences such as the citation analysis of legal and scientific documents [11]. Advances in social network analysis and the proliferation of online social media have provided a different perspective for examining ranking [12]-[18]. The study of algorithms used for ranking and organizing information in hybrid networks such as social search engines have promising improvements when incorporating social network analysis into them; for example, incorporating personal information containing social relationships on G+ for personalizing search results on Google. As the proliferation of social media continues to expand, we want to be able to use techniques from social network analysis to personalize the ranking of information for a given user. This is important because social relevance allows users to see information through the eyes of other users who they consider important and provides an opportunity for them to interact with the information accessed by the people about whom they care. Social media such as Twitter and Google Buzz can be characterized as a web service that allows users to share information with their followers. While a lot of research has been devoted to examining text in hashtags and messages [19]-[21] we focus on URLs because information contained in URLs is not restricted by length limitation, is less likely to be informally written, and contains less slang and fewer abbreviations. Analyzing URLs provides a unique opportunity to infer the interests of users based on their reading habits. We assume that URLs shared via people concentrate on selected topics of their interests. It is important to notice that our purpose here is not to rank a set of URLs based on a given query but instead to rank a 4 set of URLs based on whether we think a user is likely to engage with the information contained within the URLs. Such engagement could be clicking, commenting, re- sharing, and spending time reading them. The problem we want to solve is to provide a framework for ranking URLs shared on social media based on social relationships; where some of the URLs are ranked higher if they are shared via certain type of social relationships. The social relationships we examine for ranking URLs include but are not limited to neighbors (nodes that are within geographical proximity [22]) and peers (nodes that are within a detected community [23]) The literature review on this subject is provided in Chapter 2 (Section 1) and the contribution is discussed in Chapter 6. Some data-driven questions that we examine are whether pairs of users that are geographically close are more likely to have similar interests than pairs that are distant, and whether reciprocal relationships have higher keyword similarity in web pages than non-reciprocal relationships. Other related questions that we explore are examining the densities of friends, peers, neighbors, and people with similar inter- ests, since these social relationships are the building block for understanding social relevance.

1.2 Small Worlds and Social Stratification Data scientists have recently calculated the distribution of the shortest path lengths between randomly selected pairs of users in online social networking sites and confirmed that the majority of people are on average within six degrees of separation (e.g., 4.7 in Facebook [24], 2.7 in MySpace [25], 4.2 in Twitter [26], and so on [27]). However, empirical research in social stratification such as racial segregation and income inequality undermine the premise that we live in a small-world where there are short paths connecting people with culturally and economically diverse backgrounds together. In [28], Kleinfeld mentioned that Beck and Cadamagnani were unsuccessful in replicating the small-world experiment with high success rates when they attempted to reach a high-income target starting from a low-income person, suggesting that the world we live in is divided by wealth caused by income inequality. Before the availability of data from online social networking sites, Milgram and his colleagues performed an experiment to demonstrate the small-world phenomenon 5 by recruiting randomly selected starters from Nebraska and Oklahoma to reach a broker in Boston [29]. In their experiment, starters were asked to mail a folder to an acquaintance known to them on a first-name basis and would be likely to reach the target using the least number of hops. The process repeats until the chain stops when the folder eventually reaches the target or its current holder drops out the experiment for the lack of qualified acquaintances or unwillingness to participate in the experiment. Hence, the expected number of hops required for a starter to successfully reach a target is an upper bound and also a lose estimate for the length of shortest path connecting them. Travers and Milgram reported that 64% of the chains successfully reached the designated target within 5.2 hops [29], suggesting that the diameter of the network of social connections is small. The problem we want to solve is finding out whether the network of our so- cial connections is small, stratified, or both simultaneously. We want to investigate this problem by replicating the process of routing a folder from selected starters to randomly chosen targets by using data containing geographical locations and so- cial relationships of hundreds of thousands of users from location-based social media. The advantage of incorporating large-scale and multi-dimensional data into the small- world experiment is that many aspects of the experiment can be controlled such as determining how to strategically route a folder between acquaintances and having real data on who is actually connected to whom for hundreds of thousands of users. Un- like other social experiments requiring incentives for human subjects to participate, we can control the effect of participation by supposing that everyone who receives a chain letter participates in the experiment once, since long chains are not likely to exist when the average participant rate is 37% [30] (e.g., 0.375 < 0.01) reported by Dodds et al. These advantages from the data help us focus on how two factors of the experiment, geographical locations and community structure of users’s connections, make it possible for social networks to be either small-world, stratified, or both simul- taneously. These aspects of geographical proximity and community structures allows us to strategically route a folder between their acquaintances and also select starters and targets based on geographical distance or by a fixed number of community hops connecting them. We used community detection algorithms to partition a social network so that 6 starters and targets can be selected in the following ways. We define the network distance from community of the starter Cs to the community of the target Ct as the length of the shortest path connecting nodes from Cs to Ct. The question we ask is how many hops does it take to reach a target t originating from a starter s if the length of the shortest path connecting their communities is fixed at k? When k ≈ 0, we expect to capture the small-world phenomenon where it is easy to find short paths connecting people together. On the other hand, when k >> 0, we expect that while there might exist short paths connecting people together, it is much harder to find them with limited information available to the participants due to the stratified nature of society where some people have little compare to others, making it difficult for people to reach targets outside of their communities and social class. Beside the debate between whether we live in a small world or stratified one, the techniques that were used by the participants in the experiment to select an acquain- tance have practical applications in rescue and search operations [31] and job searching via personal contacts [32]. Dodds et al. reported that such successful techniques used by the participants including forwarding the folder to a selected acquaintance such as a friend (67%), relative (10%), co-worker (9%), sibling (5%), significant other (3%), and others (6%) based on geographical proximity and occupation “for at least half of the decisions” [30]. In addition, the results from the small-world experiment led to an avalanche of network models that have certain properties resembling real social networks such as the short diameter and high clustering coefficient [33]. The literature review on this subject is included in Chapter 2 (Section 2) and the contribution is discussed in Chapter 6.

1.3 Summary of Contributions & Organization First, this thesis collects terabytes of data that users shared on social media and analyzes their relationship dynamics in terms of three specific aspects: geog- raphy, face-to-face interactions, and communities. Such data allows us to analyze human behavior in terms of social network analysis such as the interplay between interactions, geographical proximity, and community structure. An example of an in- teresting behavior we notice is the creation of friendship between two people is more likely to occur when they are geographically close and friends-of-friends are also more 7 likely than not to be within proximity of each other. Also, geography has an effect by limiting face-to-face interactions as well as their interests in terms of what users read on social media. For more details on data analysis of human behavior and their social relationships, see Chapter 3. Second, this thesis proposes techniques for incorporating social relevance into the process of ranking URLs. Personalized ranking results using variants of net- work flow are highly independent from PageRank. The four dimensions of social relationships that we use for ranking URLs are friends, neighbors, peers, and users with similar interests. Results from the experiments show that social relevance can improve ranking quality of up to 19% compare to the baseline and 5% compare to PageRank. For more details on the personalization of information, see Chapter 4. Third, this thesis examines effects of social stratification in the small-world problem. Results show that while using geographical and community information in modeling social routing for the small-world problem is more realistic than using either one alone, average path lengths are 3 times longer then in Travers-Milgram experiments when attrition is eliminated. Community distance is more effective and robust at predicting probability of reaching targets than geographical distance in terms of average path lengths and percentage of successful chains. Finally, results show that prominent targets and targets in prominent communities can be reached much quicker than on average. Our results can be summarized as follows: the small- world property holds for the prominent but everyone else is lost in the crowd except when being reached by members within its own community. For more details on effects of stratification in searching for people, see Chapter 5.

1.3.1 Organization

Table 1.1: Aspects of SNA & applications. Geography Interactions Communities Human Mobility Congestion Communication Group Spreading Ideas Long Ties Weak Ties Bridge Ties Personalized Ranking Geo. Influence Peer Influ. Collective Influ. Small-world Selection Cognitive Biases Routing

The organization of this thesis can be summarized by using Table 1.1. The 8 three aspects of social network analysis are geographical proximity of nodes (Chapter 3 Section 1), their face-to-face interactions (Chapter 3 Section 1), and the structure of their communities (Chapter 3 Section 2). The four applications studied in this thesis are human mobility & congestion modeling (Chapter 3 Section 5), spreading ideas & economic development (Chapter 3 Section 6), personalized ranking (Chapter 4), and the small-world experiment (Chapter 5). Each element in Table 1.1 describes how the corresponding aspect of social network analysis can be used to analyze the corresponding application. For the first application (human mobility), geography in terms of the geograph- ical proximity of friends shows that human mobility traces can be used to study wireless bandwidth infrastructure management, and as we later see, network conges- tion is centralized in a few geographical locations impacting the throughput of the bandwidth when studying mobile ad-hoc networks. Later in Chapter 3 Section 5, face-to-face interactions is analogous to establishing wireless connections, since the purpose of establishing connections in wireless networks is to communicate, and es- tablishing connection is only possible when nodes are within geographical proximity just like face-to-face interactions. Last but not least, this can be extended to incorpo- rate the communities where mobility traces are simulated based on a group of nodes belonging to the same community and moving together. For the second application (spreading ideas), geography plays a role in dis- tinguishing between short and long ties where the effects of long ties are examined in simple contagion models for the purpose of measuring economic development of large geographical areas. The analysis of face-to-face interactions shows that long ties are especially weak. In addition to long ties, ties that connect between different communities are also examined in Chapter 3 Section 6. For the third application (personalized ranking), three elements are incorpo- rated into the process of ranking URLs. Geography allows selecting users based on geographical distance (neighbors). Reciprocal interactions in terms of social relation- ship (friends instead of followers) allows us to select nodes based on their interactions. Last but not least, community structures allow us to select nodes that belong to the same community. For the last application (small-world), geography allows selecting a starter and 9 a target in the simulations based on their geographical distance. Face-to-face inter- actions could affect the statistics of average path lengths because the folder holder is likely to pass the folder to the next holder based on the number of their interactions and independent of the target. And finally, community strictures allow the nodes in the simulations to pass the folder based on community awareness. CHAPTER 2 LITERATURE REVIEW

This chapter provides a literature review on ranking techniques and the small-world problem.

2.1 Ranking Techniques The literature review on ranking techniques is broken down into three parts. The first part looks at the conceptualization of the web (Sec. 2.1.1), the second part looks at incorporating more sources of data and modeling trust (Sec. 2.1.2), and the third part looks at data mining techniques for learning how to rank (Sec. 2.1.3).

2.1.1 Web Conceptualization Early days of search engines rated information on the web by using the text em- bedded in the page rather than by the hypertext containing the information invisible to the end users. Previous work in the ranking of web pages incorporated text and hypertext to determine the rank of a page, since hypertext by itself does not contain information related to the query and a lot of information in the text does not mean it is authoritative [34]. In a sense, ranking pages by counting the number of inlinks is like voting, where the number of inlinks is the number of votes for a page, and additional textual analysis can be applied to a query for retrieving a subset of related pages ranked by the number of votes. Advances came from Page and Brin when they devised an algorithm now known as PageRank to capture not only the number of incoming inlinks like in voting but 1 also the quality of those links [5]. The initial score of a web page is equal to n0 where n0 is the number of pages containing a link to that page. At the first iteration, each page sends its score divided by the number of its links pointing to other pages. Then each page replaces its current score with the sum of scores that were sent to it by the pointing links. The process of sending and updating scores repeats until convergence

Portions of this chapter have been submitted as: T. Nguyen et al., “Small Worlds and Social Stratification,” PLoS ONE, (under review).

10 11 or a pre-defined number of iterations is reached. The final scores determined by PageRank are used to rank pages across the web graph. Kleinberg purposed a ranking algorithm known as HITS (Hypertext-Induced Topic Search) based on the idea that good hubs point to good authoritative pages and vice-versa [35]. This query dependent algorithm first retrieves a subset of pages that are related to a query. Then it applies an update technique to recalculate scores of hubs and authorities, and the algorithm uses the scores of the authorities to rank the pages. Initially, the score of an authority is the number of backlinks coming from hubs, and the score of a hub is the sum of scores of authorities that it points to. At the second iteration, the algorithm updates the score of an authority by taking the sum of the scores of the hubs pointing to it. The updating scores process is then repeated, and the algorithm stops after reaching some number of iterations. Stochastic Approach for Link-Structure Analysis, or SALSA for abbreviation, is proposed by Lempel and Moran where two independent random walks are applied to a bipartite graph consisting of hubs and authorities [2]. Instead of repeatedly cal- culating and updating scores for hubs and authorities as is done in HITS, the number of times a page is visited by the surfer in the random walk is used to extrapolate the quality of the pages. The TKC (tightly knit community) effect is shown where communities of web pages are scored relatively high even though some pages are not authoritative or relevant to the topic when every hub points to every authority causing a tight knit community of hubs and authorities.

2.1.2 User Data & Trust Models While the link analysis of the web structure is a powerful tool used to capture the ranking of pages, an emergence of algorithms and ideas came from difference sources of data where additional information about end users is taken into consideration. For instance, how long on average do users stay on a page, and how often are two pages consecutively visited? BrowseRank is proposed to capture the number of page visits and the amount of time a user stays on a page modeled as a continuous time Markov process [8]. Another technique is taken from the principle of isolation or the disconnectivity of trustworthy pages from spam pages where trust is propagated from trustworthy pages to other trustworthy pages [6]. EdgeRank is proposed by 12

researchers from Facebook to consider interactions of two people or social associates during the process of ranking updated messages, photos, URLs, etc. on news feed [36]. Last but not least, the annotation of web pages created by users on Delicious is used to rank pages in SocialSimRank by considering the structure of annotators and annotated pages [12]. A technique of using personal data to rank pages was proposed by Liu et al. called BrowseRank where they used the browsing graph in which vertices represent visited pages and edges between vertices represent a transition from one page to another [8]. The novelty in BrowseRank is that it incorporates data that provides the amount of time an average user stays on a page which is an indicator of the page’s quality and that cannot be captured by discreet time link analysis techniques such as PageRank, HITS, and SALSA. Also as mentioned by the authors, the web graph is not the most reliable source of data because of its large size and decentralized architecture where problems can come from spammers creating link farms to increase the visibility of their pages and web masters are constantly changing the content of their pages. Empirical results suggest that BrowseRank outperforms PageRank when independently hired researchers evaluated the ranked pages according to a linear combination of relevance and importance. TrustRank algorithm proposed by Gyongyi et al. relies on the principle of isolation, under the assumption that it is unlikely for trustworthy pages to link to spam pages [6]. Seed detection is a process that determines a small set of pages to be evaluated where these pages are likely to point to other trustworthy pages. First, a small set of seed pages is evaluated by using an oracle function to determine whether a page is trustworthy or not. In practice, the oracle function represents human judgment and would be too costly to use on a large set of pages. Second, each trustworthy page propagates its trust to pages that its points to and the value of the trust gets divided equally among all pointed pages. The propagation process repeats until convergence or some predefined number of iterations is reached. Additional advances came from the interests of Facebook in ranking items such as photos, messages, URLs, etc. on each individual news feed. In EdgeRank, the affinity score of two users, the weight of the posted item, and time decay are taken into consideration for the ranking of items on personalized news feeds [36]. The 13 affinity score of the viewing user and the item creator is calculated by looking at their online interactions; the more they have interacted, the more likely the item is shown or ranked higher. Time decay decreases the relevance of a posted item as time goes on, and the edge weight increases the score of items that have a high level of potential interaction such as photo albums, messages embedded with URLs, etc. In addition to EdgeRank, Bao et al. proposed SocialSimRank that uses social annotations on Delicious to rank pages according to the observation that popular pages are annotated by up-to-date users and up-to-date users annotate popular pages [12]. The novelty of SocialSimRank comes from using the annotations of users to match search queries to the corresponding annotated pages and applying the PageRank algorithm to the annotated pages as means to rank pages corresponding to the view of the annotator.

2.1.3 Learning to Rank Learning to rank is an intersection between information retrieval and machine learning where techniques in machine learning are used to model the learning process of ranking documents. Techniques are based on the idea of computing a function to maximize quality measures in ranking or minimize the sum of differences between the computed function and human-defined ratings. The advantage of using machine learning techniques is that parameters in proposed learning models are tuned au- tomatically. In pointwise comparison, the objective is to minimize the difference between the calculated score of a document and the human-defined rating of it. In pairwise comparison, the objective is to determine whether the first document in a pair of documents is ranked higher than the second document or vice-versa. One of the challenges in learning to rank is to go from pointwise to pairwise comparison where the goal is to predict the ranking positions of two given documents. Another challenge is to optimize non-continuous and non-differential objective functions. For- tunately, previous work in the machine learning literature shows that techniques were developed to handle such cases. RankNet learns how to rank pages by using a neural network with pairwise comparison [37], SoftRank approximates the non-continuous and non-differential objective function [9], and SVMRank uses support vector ma- chines to minimize pairwise inconsistency [38]. In RankNet, Burges et al. proposed to use a two layer neural network for learn- 14 ing the process of ranking pages [37]. Given a pair of pages represented as vectors, the ranking problem that the authors proposed is to compute the probability that the first page is ranked higher than or equal to the second page. One advantage in the learning stage is pairs of ranks might not be complete or even consistent to reflect the missing pieces of information in the data or the noise containing in them. First, they proposed using the cross-entropy cost function where ranking probabil- ities are modeled by using the logistic function. Second, they proposed using the backward propagation algorithm to optimally calculate the weights and offsets in a two layer neural network such that the difference between the computed function and human-defined ratings is minimalized. They conducted their learning, testing, and validation experiments by using data from a proprietary search engine consisting of 17,000 searched queries where each query contains the top 1,000 ranked pages. A page is represented as a vector consisting of 569 features. Query-dependent features are extracted from the anchor text, URL representations, title, and content. The remain- ing features are taken from log files in the proprietary search engine [37]. Empirical results suggested that NetRank outperformed the other learning models (RankProp [39], PRank [40]) in the validation stage. Taylor et al. proposed SoftRank where the idea is to consider ranking scores as random variables, map score distributions to rank distributions, calculate the ex- pected SoftNDCG (normalized discounted cumulative gain), and use gradient tech- niques to optimize parameters in a two layer neural network with respect to Soft- NDCG as a cost function. While it is possible to use the cost function proposed in RankNet, there are many other metrics in information retrieval such as MAP (mean average precision), precision, and NDCG that reflect the experience of end users. As mentioned, using these metrics as objective functions for training is challenging since small parameter changes might yield different scores but ranking positions will change when a score passes another score making the function non-differential. SoftNDCG is a proposed metric based on the approximation of NDCG by mapping scores to ran- dom variables. Also as in RankNet, backward propagation uses gradient techniques to optimize parameters in a two layer neural network where the cost function is the approximated NDCF metric. Last but not least, SVMRank is an algorithm proposed by Joachims based on the 15 idea of using SVM (support vector machines) to construct a function that maximizes the empirical Kendals Tau distance between the targeted function determined from click through data and the system function computed by SVM [38]. Click through data provides constructive feedback of the ranking system where a clicked URL implies an estimate of relevancy relative to the query. While a clicked link does not represent absolute judgement, it provides useful insights about the ranking positions of the unclicked items. For instance, clicking on the link that is ranked 7th implies that 7th link is more relevant to the query than the unclicked links starting from one to six. This motivates the usage of pairwise comparison where the objective is to minimize pairwise inconsistency between a computed function and the targeted function derived from click through data.

2.2 Small-world Problem This literature review on the small-world problem is broken down into two parts. The first part provides an overview of the small-world phenomenon in terms of six degrees of separation (Sec. 2.2.1). The second part looks at effects of inequality and stratification that undermine the small-world property (Sec. 2.2.2).

2.2.1 Six Degrees of Separation Milgram and his colleagues proposed an experiment to demonstrate the small- world property by recruiting starters from Nebraska and Oklahoma to reach a broker in Boston [29]. Starters in the experiments were asked to mail a folder to an ac- quaintance who would be likely to reach the target quickly. Previous folder holders were recorded into the folder roster so that they would not be selected twice in a mail-forwarding chain. The process repeats until the chain stops either when folder reaches the target, or the current holder drops out of the experiment for various rea- sons. The expected number of hops it requires for a starter to successfully reach a target is an upper bound of the shortest path length connecting them. Travers and Milgram reported that 64% of the chains successfully reached the designated target within 5.2 hops [29] which gave name to the six degrees of separation. The idea of six degrees of separation is that if we pick any two people on this planet, there are on average 5 unique individuals who are connected in such a way where the first person 16 knows the second person, who knows the third person, who eventually knows the last person. Beside the debate between whether we live in a small world or stratified one, the techniques that were used by the participants in the experiment to select an acquaintance have practical applications in rescue and search operations [31] and job searching via personal contacts [32]. Dodds et al. reported that such successful techniques used by the participants including forwarding the folder to a selected acquaintance such as a friend (67%), relative (10%), co-worker (9%), sibling (5%), significant other (3%), and miscellaneous ties (6%) based on geographical proximity and occupation “for at least half of the decisions” [30]. In addition, the results from the small-world experiment led to an avalanche of network models that have certain properties resembling real social networks such as the short diameter and high clustering coefficient [33].

2.2.2 Social Stratification Research in stratification such as racial segregation in neighborhoods and income inequality undermine the premise that we live in a small-world. For instance, are there really short paths connecting random people together? What about people who are isolated from the rest of the world? Clearly, isolated people are much harder to reach than prominent individuals such as politicans, CEOs, religious leaders, celebrities, etc. In [28], Kleinfeld mentioned that Beck and Cadamagnani were unsuccessful in replicating the small-world experiment with high success rates when they attempted to reach a high-income target starting from a low-income person. This suggests that one causes of stratification comes from income inequality where people are segregated into economic classes. This leads to a question what are the elements that cause stratification? What attributes do we associate with other people? Since people have an inclination to associate with people of the same ethnicity, cultural heritage, and other economic classes, how do such tendencies affect the small-world property? Th small-world property has been accepted in the research literature because possible routing strategies have been proposed to show how people strategically make routing decisions. A routing strategy proposed by Kleinberg relies on participants passing the folder to the acquaintance who is closest in terms of geography to the 17 target [41]. This make sense since people have cognitive abilities to remember where there acquaintances live. Also, it is common to have a few acquaintances who are geographically close and a few acquaintances who are distant due to the relocation for a new job, studying at a university, retiring, etc. CHAPTER 3 SOCIAL NETWORK ANALYSIS

Typically social network analysis examines relationships among people in terms of graph theory where nodes represent actors and edges represent their relationships. In this chapter, we examine three important aspects of social network analysis. The first is understanding the effect of geography in terms of the location of actors on the structure of the social network. The second is measuring face-to-face interactions of the actors and their social relationships. The third is detecting hidden communities that are well-connected in terms of social relationships and highly-active in terms of face-to-face interactions. We examine these three aspects of social network analysis in details using data collected from a location-based social network called Gowalla. Beside ranking and searching, these three aspects of social network analysis can also be used to model human mobility in mobile ad-hoc network (see Sec. 3.5) and predict economic development of large geographical areas (see Sec. 3.6). In section 3.1, we examined geography, co-appearance, and interactions of users in Gowalla focusing on the effect of geography on the structure of the network and face-to-face interactions. In section 3.2, we incorporated geographical information of users into three selected community detection algorithms consisting of a modified version of Clique Percolation Method (CPM), Inference Algorithm (IA), and GANXiS to detect disjoint and overlapping communities that are well-connected in terms of social relationships and highly-active in terms of face-to-face interactions. In section 3.3, we designed an experiment in which we generated different types of covers by using a combination of social and geographic information. In section 3.4, we used quality measurements based on the link connectivity, geographical proximity, and physical interactions among members to examine detected communities as a function of their sizes and used covers as a baseline. We conclude this chapter in section 3.7

Portions of this chapter previously appeared as: T. Nguyen and B. Szymanski, “Using Location- Based Social Networks to Validate Human Mobility and Relationships Models,” in Proc. IEEE/ACM Int. Conf. Advances in Social Network Analysis and Mining, Istanbul, 2012, pp. 1247-1253. This chapter previously appeared as: T. Nguyen et al., “Analyzing the Proximity and Interac- tions of Friends in Communities in Gowalla,” in Proc. IEEE/ACM Int. Conf. Advances on Data Mining Workshops, Dallas, TX, 2013, pp. 1036-1044.

18 19

Figure 3.1: Geographical spread of 100K checkins in Gowalla. with a summary of the results and potential applications that might benefit from the analysis of geography and spatially-aware community detection.

3.1 Geography, Co-Appearance, & Interactions 3.1.1 Data Collection We collected data from a location-based social networking provider called Gowalla that allowed people to use their internet-enabled and sensing-capable mobile phones to record and share their current location with their friends. By using the Gowalla’s API, we were able to retrieve 391,223 users with public profiles (friends and checkins) from mid September in 2011 to late October of that year. Unfortunately, Gowalla has been purchased by Facebook and is no longer operating by itself. The data for FourSquare, Twitter, and Google Buzz are collected in the similar manner by using breath first search. To collect the data, we start with a user randomly chosen and process all the public information available about that user. Then we store all id’s of the user’s friends and put them into a processing queue in a FIFO order. After that, we retrieve the next user from the queue and repeat the process. Therefore, we crawled Gowalla breadth-first, a standard technique in the social networking literature often referred to as Breadth First Search (BFS) sampling. As shown in Table 3.1, the users accumulated a total of around 26 million checkins and 8 million friendship links. The average day of the checkins is 3.14 which 20

Table 3.1: Data summary of Gowalla network. P x¯ σX Users − − 391,223 Checkins 164.64 636.68 26,303,580 Friends 11.13 67.03 2,176,384 Weekday 3.14 2.01 Jan. 21, 2009 Distance 128.72 356.51 20,565,644 Time 6.41 13.29 - represents Wednesday. The earliest checkin is on Jan 21, 2009. The average distance between two consecutive checkins of a user is 128.72 km. The average time interval between two consecutive checkins of a user is 6.41 days with a standard deviation of 13.29. The geographical spread of the checkins is shown in Fig. 3.1. The checkins from Gowalla allow us to measure the face-to-face interactions between friends by inferring how often do friends checked into the same location at approximately the same time.

3.1.2 Notations & Definitions

Given a set of users U, let u ∈ U be a particular user, Lu be a set of its shared locations known as checkins, and Fu be a set of its friends. A shared location l ∈ Lu of the user u is a tuple of three elements denoted as l1, l2, and l3 corresponding to the latitude, longitude, and timestamp of the location l, respectively. The friendship network denoted as F = (U, EU ) is an undirected and non-weighted graph where an 0 0 edge represents reciprocal friendship; that is, e = (u, u ) ∈ EU means u ∈ Fu and 0 0 u ∈ Fu0 . The geographic distance d(u, u ) between two users u and u is estimated by averaging the locations in Lu and Lu0 and using the haversine formula to calculate arch distances. The checkin similarity CS(u, u0) of user u and u0 is defined as:

|L ∩ L 0 | CS(u, u0) = u u . (3.1) |Lu ∪ Lu0 | The level of physical interaction between user u and u0 denoted as I(u, u0) is 0 calculated from their shared locations as follows. Two locations l ∈ Lu and l ∈ Lu0 0 are equivalent if they are within geographic proximity d(l, l ) < d and occurred within 3 03 a time interval |l − l | < t. Have such two equivalent locations lu and lu0 means we infer u and u0 have gone to the place l together. 21

8

6

log(km) 4 Distance Similarity 2

Not Friends Friends 0 0 0.2 0.4 0.6 0.8 1 Checkin Similarity

Figure 3.2: Friendship is bounded by geographical distance.

The maximum pair-wise equivalence between Lu and Lu0 is defined as the longest 0 0 sequence of equivalent location pairs ((l1, l1),..., (lk, lk)), such that for each 1 ≤ i ≤ k, 0 0 0 li ∈ Lu, li ∈ Lu0 and li is equivalent to li. The level of physical interaction I(u, u ) is defined as the length k of the maximum pairwise equivalence divided by the size of the smallest locations set:

k/min(|Lu|, |Lu0 |)). (3.2)

Finding the maximum pairwise equivalence can be reduced to a network flow problem where polynomial running time algorithms such as Ford-Fulkerson can be used to calculate the maximum number of matches.

3.1.3 Data Analysis & Results In Fig. 3.2, there are 701 blue points that represent two randomly selected users who are friends and 620 red points that represent two randomly selected users who are not friends within the dataset. The shaded region is drawn by using the k-nearest neighbor algorithm for classifying whether two users are friends given their average distance apart and checkin similarity. In Fig. 3.2, we notice that co-appearance represented by checking similarity is a poor indicator of friendship; that is, people who are temporarily within the same place and time are not likely to be friends. Intuitively, co-appearance happens often 22 at popular spots, like concerts and cafes that attract people living at great variety of locations. Even if a group of a few friends goes together for a concert, they would not be friends with thousands of other attendees, hence, a chance that a random pair of attendees are friends is low. Occasional co-appearances are not sufficient, but geo-proximity helps in establishing and maintaining friendship, as seen in Fig. 3.2.

0.08 Hop=1 Hop=4 0.3 Hop=2 Hop=5 0.25 Hop=3 0.06 Hop=6 0.2 0.04 0.15 Fraction Fraction 0.1 0.02 0.05 0 0 0 1000 2000 3000 4000 0 1000 2000 3000 4000 Avg. Distance of Separation (km) Avg. Distance of Separation (km) (a) Hop=1-3 (b) Hop=4-6

Figure 3.3: Densities of pairs as a function of geographical distance.

In Fig. 3.3, we plotted the density of friends (hop=1), friends-of-friends (hop=2), and pairs of users up to six degrees of separation as a function of the average geo- graphic distance between two users in km. For each level 1 ≤ k ≤ 6 of indirection (measured in the number of hops), we randomly selected 5,000 non-cyclic paths of length k and created from the ends of these paths 5,000 pairs from the Gowalla dataset, each pair with k indirection of friendship. We analyzed pairs that were within 4,000 km distance from each other. In Fig. 3.3(a), the density of direct friends (4,317 total) reaches the highest value of 0.35 (in other words, 1511 pairs) at the lowest geographic separation in the range from 0 to 160 km (each point at distance x represent users with distances from x-160km to x+160 km) and continues to decrease as the distance between them increases. At the second level of indirection, the density of friends-of-friends (3,464 total) achieves the highest value 0.19 in the range from 0 to 160 km and continues to decrease as the geographic distance between them increases. Geographic proximity has an effect where friends (hop=1) and friends-of-friends (hop=2) are more likely but not necessary required to be within proximity of each 23

0.02 Hop=1 0.01 0 0 1000 2000 3000 4000

−3 x 10 1.5 1 Hop=2 0.5

Level of Interaction 0 0 1000 2000 3000 4000 Avg. Distance of Speration (km)

Figure 3.4: Measuring face-to-face interactions (t=30mins, d=1km). other. For instance, 61% of friends are within 480 km and 47% of friends-of-friends are within 640 km of each other. Another way of looking at the results is that people who are separated by three or more hops are unlikely to be within geographic proximity of each other. In Fig. 3.3(b), we plotted pairs of users who are separated by four, five, and six hops. We noticed that they are not likely to be within geographic proximity of each other. The density of those pairs reaches the highest value 0.07 at the 160 km range centered at 1,200 km and continues to decrease regardless of their degrees of separation. In Fig. 3.4, we plotted the average level of face-to-face interactions I(u, u0) of friends (hop=1) and friends-of-friends (hop=2) as a function of their geographic distance in km. The larger the geographic distance between friends, the less likely they physically interact by going to the same places together. The highest peak (0.027) is at the lowest geographic separation from 0 to 266 km and continue to gradually decrease (with some small fluctuations) as the distance between them increases. For friends-of-friends, the physical interactions reflect the probability that they happened to be together. 24

3.1.4 Limitations We like to mention that it is possible the locations of some users are irrelevant to their distant friends. This may be a source of potential bias where the geographic proximity of friends may be enlarged by a friendship selection process in Gowalla in which users subjectively add friends who are within their geographic proximity. However, we noticed that 38% of friends are geographically separated by more than 520 km. Also, the Gowalla data and other social media indicate that distant friends are selected, perhaps for the purpose of keeping in contact [42]. In addition, Mislove et al. mentioned that the population of users who tweet on Twitter is unbalanced [43]. Therefore, we believe that the users who checks in on Gowalla do not make a representative sample of the entire population as shown in the concentration of checkins in Fig. 3.1.

3.2 Incorporating Geography into Community Detection A common approach in community detection is to divide a network into multiple partitions by maximizing the number of edges within each partition and minimizing the number of edges between them. The often used quality measurement for the partitions is modularity that compares the difference between the fraction of edges inside and fraction of edges across a partition and such expected difference if edges in the network were randomly distributed [44]. Greedy approaches like hierarchical clustering [45] and spectral approaches such as minimum cuts [46] divide a network into disjoint partitions by combining or separating clusters of nodes so that modularity is maximized at every step. As studied by authors in [47], [48], a problem with this modularity maximization approach is that it inclines to merge two separated communities together, increasing the value of modularity, but creating the merger that does not reflect the ground truth. Another approach to community detection is to divide a network into multiple partitions so that the majority of members within each partition shares a common attribute [49]. A proposed attribute is based on friendship similarity defined as the density of common friends between pairs of nodes [49]. A problem with this proposed attribute is that it allows for a community consisting of people who have a lot of friends in common but are not friends of each other. However, this imperfect definition works 25

well in practice because people who have a lot of friends in common are likely to be friends themselves. Since community detection is an active area of research, our goal is not to provide another technique that detect communities (many have been proposed) but to incorporate the spatial information of nodes into existing algorithms for analyzing Gowalla and propose a null model (generating covers) to benchmark the detected communities. We combine these two approaches in community detection by incorporating the location information of users and geographic distances between them into three selected algorithms taken from the rich literature. First, we want to minimize the number of edges between communities and maximize the number of edges within them. Second, we want members inside a community to be within spatial proximity by giving geographically correlated friends more weight than distant friends during the detection process. This combined approach applies a natural interpretation of a friendship community where members are well connected and also likely to be geographically close. Also, geographically correlated nodes are more likely to interact with each other face-to-face as seen previously. We selected three community detection algorithms based on their popularity (CPM), promising experimental results (IA), and ability to scale to millions of nodes and edges (GANXiS) for the purpose of capturing and measuring the interactions of users inside a community. In the following subsections, we summarize the selected algorithms and describe how we incorporated geographic information of users into the process of detecting friendship communities in Gowalla since level of interactions is correlated with distance as seen previously.

3.2.1 Clique Percolation Method The CPM algorithm was proposed to detect overlapping communities by com- bining or fully connected subgraphs [50]. Given an undirected graph F =

(U, EU ), let Hm denotes the set of all cliques in F of the size m. The clique-graph

G = (Hm,E) consists of cliques in Hm represented as nodes, and edges between pairs of cliques if they have m − 1 overlapping members. Each connected component of the graph G is a community consisting of many fully connected subgraphs of F . A problem of the CPM algorithm is its lack of scalability because the number 26 of cliques explodes as m increases for large networks. Unfortunately, the problem of finding the clique with the largest size in a given graph is NP-hard [51] preventing the algorithm from using cliques with the near largest size. We modified CPM to incorporate geographic information of nodes and made the algorithm scalable as follows. Instead of finding cliques of large sizes, we find triangles (m = 3) since they can be efficiently identified in parallel using map-reduce. To limit the number of triangles, we select a subset of disjoint triangles from all possible triangles by using geographic distances between pairs of nodes as follows. The average geographic distance of a triangle t is defined as (1/3) P d(u, u0) for u 6= u0 ∈ t. We take a triangle one at a time from a sorted list of triangles until all possible disjoint triangles have been taken. If a user is not part of any disjoint triangle, we assign it to a triangle that maximizes the number of edges between this user and the triangle and use geographic distances to break ties by assigning a user to the geographically closest triangle. 0 0 The clique-graph G is defined as G = (T,ET ) where T is the set of modified triangles and ET is the set of edges between triangles that are assigned as follows. For each triangle, we create a single clique edge from this triangle to the one that maxi- mizes the number of friendship edges between them, and use geographic distances to break ties if necessary. Like in the original CPM algorithm, each connected compo- nent of G0 is a community consisting of geographically correlated and well connected subgraphs of F .

3.2.2 Modularity Maximization Modularity maximization is a popular technique used to find communities pro- posed in [44], [45]. Given a graph F = (U, EU ) and a set P containing disjoint partitions or subsets of U, the modularity Q of the partitions in P is defined as:

X 2 Q = eii − ai (3.3) pi∈P where eij is the fraction of edges between nodes in the partitions pi and pj, and P ai = j eij is the fraction of edges leaving the partition pi [44]. A positive value of 27

Q correlates with the difference between densities of edges inside and edges leaving the partitions compared to a null model. To maximize modularity, a greedy approach based on hierarchical clustering was proposed in [45], [52]. Initially, every node in U belongs to its own community. Then the pair of communities with the highest increase in modularity is merged together. The process of merging repeats n − 1 times where n = |U|. The clusters with the highest overall value of modularity at each iteration are taken as a set of communities. For weighted networks, Newman proposed a simple technique to map weights

of integer values to multigraphs [53]. For every edge of the weight wij, there will be

wij − 1 additional unweighed edges added between node i and j, and the weight wij is set to 1. The definition of modularity remains the same, since the fraction of edges

eij between partition pi and pj can simply incorporate multiple edges between nodes. We incorporated geographic information about users into the Inference Algo- rithm by assigning weights to edges based on spontaneousness and typical means of travel: walking up to 1.6km, biking/using public transportation up to 25km, short car/train ride up to 100km, long car/train ride up to 500km, and plane flight above 500km. Friends who are within walking distance (1.6 km) get the highest weight of 24. Friends who are within biking distance (25 km) get the second highest weight of 23. Friends who are within driving distance get a weight of 22, and so on.

3.2.3 Speaker-Label Propagation (GANXiS) GANXiS was proposed in [54] based on a probabilistic propagation process that

spread labels between speakers and listeners. Given a graph F = (U, EU ), each node

ui ∈ U initially carries a unique label i in its pocket pi = {i}. When a node u is randomly selected to speak, it requests all members of its neighborhood, nodes that are adjacent to u to randomly send a label in their pocket to u. The probability of a 0 label being chosen by u in its pocket pu0 is proportional to number of times the label was added; the more times a label was added, the more likely it will be chosen. The

probability of a speaker ui choosing a label from a listener uj is based on the weight

wij/wi where wi is the sum of all weighted edges coming out of ui. For unweighted

networks, wij = 1. The algorithm repeats until the maximum number of iterations is completed 28

where in each iteration everyone gets to speak exactly once in a random order. At the end, labels that have a probability of being chosen to send to a speaker less than a threshold r are deleted. Finally, the labels that a node carries determine the communities that to which it belongs. For instance, nodes that carry a label i will belong to the community ci. Time to live (TTL) has been recently proposed to limit the number of labels that nodes propagate. TTL defines the number of times a label can be sent (so it reaches limited number of nodes within TTL hop distance). The advantage of GANXiS is that it scales linearly with the number of edges, but the disadvantage is that the relationship between convergence and the number of iterations is yet unknown. GANXiS is capable of discovering overlapping communi- ties, but we selected its running parameters in such a way that the results included only disjoint communities to make them compatible with the results of other algo- rithms. We incorporated geographic information of users into GANXiS by assigning weights based on spontaneousness and typical means of travel like in weighted IA. Friends who are within walking distance (1.6 km) get the highest weight of 24. Friends who are within biking distance (25 km) get the second highest weight of 23. Friends who are within driving distance get a weight of 22, and so on. This is an extension of the interpretation of speaker-listener propagation algorithm where a listener is more likely to be able to hear a speaker if they are within spatial proximity.

3.3 Contrasting Communities to Null Models We proposed to integrate spatial and friendship information of nodes into a process of generating covers. The purpose of the covers is to serve as a baseline for analyzing the performance of various community detection algorithms under a quality measurement. In section 3.3.1, we described how we generated six covers by using a combination of spatial and friendship information in traversing the network. In section 3.3.2, we selected a few quality measurements for examining covers and detected communities. In section 3.3.3, we examined the covers using the selected quality measurements. 29

Table 3.2: Six techniques for generating covers. Algorithm Abbreviation Spatial Info.? Social Info.? Completely Random CR no no Random Walk RW no yes Closest Friend First CFF yes yes Farthest Friend First FFF yes yes Closest to All CTA yes yes Farthest to All FTA yes yes

3.3.1 Techniques for Generating Covers

Given a graph F = (U, EU ), a cover C ⊂ U of size k is a subgraph of F with k nodes selected in a specific way. A completely random cover CR is one where each user u ∈ U has the same probability of being added during the selection. In a random walk cover RW , we first randomly add a seed into the cover, then randomly select a friend of the most recently added user, and continue selecting friends until the cover reaches the size k. The closest-friend-first cover CFF is similar to RW but instead of adding a random friend, we add the spatially closest friend not in the cover of the last added user. If all of that user’s friends have already been added into the cover, we go back one step to the previously last added user and branch out from there. We call this the roll-back mechanism. The farthest-friend-first cover FFF is similar to CFF except that we take the spatially farthest friend instead of taking the closest one. The closest-to-all cover CTA is similar to CFF but instead of adding the spatially closest friend to the last added user, we add the spatially closest friend with respect to all members already in the cover. Finally, the farthest-to-all cover FTA is one where we take the spatially farthest friend with respect to all members already in the cover. Cover generation algorithms such as CTA and FTA are described in Fig. 3.5 without the roll back mechanism for simplicity. We listed the covers and their details in Table 3.2.

3.3.2 Measuring Covers & Communities We use three types of quality measurements based on the link connectivity and location of members to measure covers and communities. The first type of measurements is based on the intra-edge count IEC defined as the number of edges whose both ends are inside the cover. The contraction CONT of 30

1: procedure CoverGeneration(k) 2: F = (U, EU ) 3: seed = rand(1, |U|), cover = [seed] 4: while len(cover) < k do 5: distances = [ ], m = len(cover) 6: for u in Fseed do 7: // Compute haversine distance from u to cover[i]. 1 Pm 8: du = m i=1 d(cover[i], u) 9: distances.append((u, du)) 10: end for 11: // sort du from least to greatest or vice-versa 12: distances = sort(distances, key = x: x[1]) 13: for u, du in distances do 14: if u ∈/ cover then 15: cover.append(u) 16: seed = u 17: end if 18: end for 19: end while 20: return cover 21: end procedure

Figure 3.5: Generating CTA & FTA covers. a cover is computed by dividing intra-edge count by the size of the cover. The intra- density IND of a cover is calculated by dividing intra-edge count by the intra-edge count of a completely connected cover of the same size. For these three measures (IEC, CONT , IND), higher the value, better formed is the community. The second type of measurements is based on the boundary-edge count BEC defined as the number of edges whose one end is inside the cover while the other is outside. This metric is useful for taking into account the effect of adding high degree users into covers of large sizes since such users are likely to increase both the intra- and boundary-edge counts. The expansion EXP of a cover is computed by dividing the boundary-edge count by the size of the cover. The conductance COND of a cover BEC(C) is defined as COND(C) = 2IEC(C)+BEC(C) . For these three measures (BEC, EXP , COND), lower the value, better formed is the community. The third type of measurements is based on pair-similarity that measures a given metric such as friendship similarity among pairs of nodes. This is applicable to the definition of a community of which members have a lot of commonality [49]. We 31

Table 3.3: Measurements for cover C of the size k. Measurement Definition IEC [55] |{(vi, vj) ∈ E | vi ∈ C ∧ vj ∈ C}| BEC [56] |{(vi, vj) ∈ E | vi ∈ C ∨ vj ∈ C}| - IEC CONT IEC/k EXP [57] BEC/k IND [55] IEC/(0.5k(k − 1)) COND [56, 57] BEC/(2IEC + BEC) GDI max d(u, u0) ∀u, u0 ∈ C P 0 AGD u6=u0∈C d(u, u )/(0.5k(k − 1)) P 0 SLI u6=u0∈C I(u, u )

replace friendship similarity ratio with three additional measurements based on the geographic proximity and location of nodes. The first one is the geographic diameter of a cover GDI defined as the geographic distance between the two farthest nodes. The second one is the average geographic distance AGD among pairs of nodes. Here, lower the measure (GDI and AGD), better formed is the community. The third one is the sum of the levels of physical interactions SLI among pairs of nodes for which higher the measure, better formed is the community.

3.3.3 Examining Covers in Gowalla For each technique, we generated covers of fixed sizes from 5 to 100 with an increment of 1. For each cover size, we generated 100 covers and calculated the average intra-edge count, boundary-edge count, geographic distance, and geographic diameter. We then derived the remaining measurements. In Fig. 3.6(a), we noticed that FFF outgrows the other techniques in terms of intra-edge count as the cover size increases. In Fig. 3.6(b), we noticed that FFF and FTA outgrow the other techniques in terms of boundary-edge count by a great margin suggesting that they strategically add users with very large degrees. While RW is decent at generating covers with high intra-edge counts as seen in Fig. 3.6(a), it is also biased since users with high degrees are more likely to be added, which increases the intra-edge count as the cover continues to grow. However, FFF and FTA are even more biased than RW and FFF outgrows the other five techniques because the radius of the farthest friend would cover everyone including common friends in between. On the other hand, we noticed that CFF and CTA are most 32

450 CR 400 RW 350 CFF FFF 300 CTA 250 FTA 200 150 Intra−Edge Count 100 50 0 20 40 60 80 100 Cover Size (a)

5 4 x 10 x 10 2 2

1.5 1.5

CR RW 1 1 CFF FFF CTA 0.5 0.5 FTA Boundary−Edge Count Geographic Diameter (km)

0 0 20 40 60 80 100 0 20 40 60 80 100 Cover Size Cover Size (b) (c)

Figure 3.6: Intra-edge count, boundary-edge count, and geographic diam- eter of covers. effective out of the six techniques at increasing the intra-edge count while minimizing the boundary-edge count at the same time. In Fig. 3.6(c), we measure the geographic diameter of a cover as a function of its size. As expected from how covers are generated, FFF and FTA are most effective at maximizing the geographic diameter while CFF and CTA are most effective at minimizing this measurement. The geographic diameter of FFF and FTA reaches the limit within 20 iterations, while the diameter for CTA and CFF slowly continues to grow. A similar trend is seen in Fig. 3.7(c) which shows the average geographic distance in contrast to the growth rate of intra- and boundary-edge counts seen in Fig. 3.7(a). Last but not least, conductance is a measurement used to determine the quality of a community by considering both the intra- and boundary-edge counts. As seen in Fig. 3.7(b), CFF is the most effective out of the six covers at minimizing conductance 33

6 CR RW 4 CFF FFF 2 CTA Contraction 0 FTA 20 40 60 80 100 Cover Size 6000

4000

2000 Expansion 0 20 40 60 80 100 Cover Size (a)

1 12000 CR CR RW RW CFF 10000 CFF FFF FFF 0.995 CTA 8000 CTA FTA FTA 6000

Conductance 0.99 4000

Avg. Geo. Distance (km) 2000

0.985 0 0 20 40 60 80 100 0 20 40 60 80 100 Cover Size Cover Size (b) (c)

Figure 3.7: Contraction, expansion, conductance, and geographic distance of covers. since it preserves some geographic structure of the social network by traversing the edges based on who is the geographically closest friend, and adding friends who are likely to be friends with the members already in the cover. CTA is not as effective as CFF because geographic distances get diluted as the size of the cover increases. FFF and FTA are worse than RW at minimizing conductance. We later use the physical interactions of users to compare and contrast the results generated by the CFF cover to results detected by the community detection algorithms.

3.4 Examining Detected Communities We first examined the results by looking at the total number of communities detected and the number of members in each one. The modified CPM algorithm with geographic information detected 2.6K communities whose average size was 60 with the size of the largest one being 69K. We did not run the original CPM algorithm 34

Table 3.4: Detected communities and their sizes. Community Size Algorithms Avg. Std. Smallest Largest Total CPM 60 1,356 6 68,671 2,572 IA 134 1,935 2 52,315 1,151 IA w (w for weighted) 442 2,954 2 45,242 349 GANXiS TTL 21 87 3 3,139 7,236 GANXiS TTL w 33 767 3 48,290 4,636 because of the long execution time required to generate the clique graph. IA without geographic information detected 1.2K communities with the average size of 134 and the size of the largest one being 52K. IA with geographic information detected 349 communities with the average size of 442 and the size of the largest one being 45K. GANXiS without geographic information detected 7.2K communities with the average size of 21 and the size of the largest one being 3K. Finally, GANXiS with geographic information detected 4.6K communities with the average size of 33 and the size of the largest one being 48,290. Additional information relating to community sizes is listed in Table 3.4.

3.4.1 Network Community Profile (NCP) We used the network community profile (NCP) proposed in [56] to examine detected communities as a function of its size. The authors proposed to take the best partition defined by a quality feature of a given community size because it represents the potential of a partition in a community detection algorithm. By inspecting all communities in the set of communities with the same size, we find for this set the lowest conductance or the highest intra-density among its members, one quality metric at a time. For intra-density and conductance without geographic information, we use the classical definitions from Table 3.3 and include all existing intra- and boundary-edges in the counts. For intra-density and conductance with geographic information, we only include edges that are within geographic proximity of 160 km or roughly 2 hours of driving. A low value of conductance is good because this means that the fraction of edges leading outside the community is low, but the value of 0 is rare since it would indicate that 35 the community is isolated. However, for conductance with geographic information, a value of 0 means there are no edges that connect to other communities that are geographically close, so all bridge edges are long. This means also that seeing a short bridge edge, the community detection algorithm tends to merges communities connected by such edge together following the insight that neighbors tend to be friends. The potential issues resulting from using this approach are discussed below. First, in many situations, taking the average value of a community quality gives a more representative picture and probably is less sensitive in cases containing outliers. Second, the number of communities for a given size might vary from a large number of small communities to very few for large communities. Last but not least, there might be no communities of a particular size, and taking the average quality might give a smooth function that is easier to extrapolate at the missing points as seen with the covers. Fig. 3.8-3.10 present the results for communities detected by CPM, IA, and GANXiS respectively.

3.4.2 Link Connectivity Measurements First, intra-density rapidly decreases as the size of the cover increases because adding another member into a large community requires everyone already in it to be connected with this new member, as seen in Fig. 3.8-3.10(a). Unlike intra-density, conductance is not correlated with the community size because there are some small and large communities of varying values, as seen in Fig. 3.8-3.10(b). Third, GANXiS and IA are a little better than CPM at maximizing intra-edges that are within geo- graphic proximity, as seen in Fig. 3.8-3.10(c). IA is the best at minimizing boundary- edges that are within geographic proximity, as seen in Fig. 3.9(d). Last but not least, GANXiS and IA benefited from incorporating the geographic information of users, as seen in Fig. 3.9-3.10(d), where geographically correlated friends are captured in the community detection process.

3.4.3 Face-to-Face Interactions Measurements Comparing Fig. 3.8-3.10(d) to Fig. 3.8-3.10(b), we noticed that some detected communities had a conductance value of 0. This means that every potential node 36

1

0.8 0.8 0.7 0.6 0.6 0.5 0.4 0.4 Conductance Intra−density 0.3

0.2 0.2 0.1 0 0 0 2 4 6 8 10 12 0 2 4 6 8 10 12 Community Size (log−scale) Community Size (log−scale) (a) (b)

0.7 1

0.6 0.8 0.5

0.4 0.6

0.3 0.4 0.2 0.2

Intra−density with Spatial info. 0.1 Conductance with Spatial info.

0 0 0 2 4 6 8 10 12 0 2 4 6 8 10 12 Community Size (log−scale) Community Size (log−scale) (c) (d)

Figure 3.8: Communities detected by Clique Percolation Method.

Table 3.5: Measuring spatial conductance. Algorithm # Spatial Cond. of 0 Total Ratio CPM 21 175 0.12 IA 20 78 0.26 IA w (w for weighted) 19 84 0.23 GANXiS TTL 48 126 0.38 GANXiS TTL w 47 155 0.30

Table 3.6: Measuring face-to-face interactions. Algorithm Count Total Ratio CPM 84 95 0.88 IA 38 41 0.93 IA w (w for weighted) 28 30 0.93 GANXiS TTL 60 87 0.69 GANXiS TTL w 77 85 0.91 37

1 0.5 Weighted Network Unweighted Network Unweighted Network 0.8 0.4

0.6 0.3

0.4 0.2 Intra−density Conductance

0.2 0.1

0 0 0 2 4 6 8 10 12 0 2 4 6 8 10 12 Community Size (log−scale) Community Size (log−scale) (a) (b)

1 0.7 Weighted Network Weighted Network Unweighted Network 0.6 Unweighted Network 0.8 0.5

0.6 0.4

0.4 0.3 0.2 0.2 0.1 Intra−density with Spatial info. Conductance with Spatial info.

0 0 0 2 4 6 8 10 12 0 2 4 6 8 10 12 Community Size (log−scale) Community Size (log−scale) (c) (d)

Figure 3.9: Communities detected by Inference Algorithm. within geographic proximity of a community has already been included in it. For the IA without geographic information, out of the 78 community sizes, 20 of them have geographic conductance of 0, yielding 20/78 ≈ 0.26 ratio. For the IA with geographic information, out of the 84 communities, 19 of them have a geographic conductance of 0, yielding 19/84 ≈ 0.23 ratio. The remaining values are listed in Table 3.5. Results in Table 3.5 show that GANXIS has the highest ratio of the number of communities with a 0 spatial conductance divided by the number of communities detected. From this perspective, a good community detection algorithm detects communities that have a lot of communities with 0 spatial conductance as the result of merging connected and geographically close communities together. We examined small-size communities because humans have limited resources and cognitive abilities to keep and maintain social relationships resulting in a limited 38

1 1 Weighted Network Weighted Network Unweighted Network Unweighted Network 0.8 0.8

0.6 0.6

0.4 0.4 Conductance Intra−density

0.2 0.2

0 0 0 2 4 6 8 10 12 0 2 4 6 8 10 12 Community Size Community Size (log−scale) (a) (b)

1 1 Weighted Network Weighted Network Unweighted Network Unweighted Network 0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2 Intra−density with Spatial info. Conductance with Spatial info. 0 0 0 2 4 6 8 10 12 0 2 4 6 8 10 12 Community Size (log−scale) Community Size (log−scale) (c) (d)

Figure 3.10: Communities detected by GANXiS.

number of friendships known as Dunbar’s number [58]. We measured and then plotted in Fig. 3.11 the NCP level of physical interactions in communities and covers by summing the level of physical interactions among pairs. From the plots, we observed that CPM have small communities where members are statistically more likely than members in covers to physically interact with each other by going to the same places together. In Fig. 3.11(a), out of 95 communities detected by CPM of the size up to 100, 84 of them have higher amount of physical interaction among members than a null model, CFF . In Fig. 3.11(b), out of 41 communities detected by IA under the size of 100, 38 of them have higher amount of physical interaction among members than CFF . The remaining values are listed in Table 3.6. While CPM is the most effective at detecting communities that are intrinsically small (95 total) and where the physical interaction among member is likely to be 39

15 CPM CFF

10

5 Amount of Physical Interaction

0 20 40 60 80 100 Community Size (a) CPM

15 15 Weighted Network Weighted Network Unweighted Network Unweighted Network CFF CFF 10 10

5 5 Level of Physical Interaction Amount of Physical Interaction

0 0 20 40 60 80 100 20 40 60 80 100 Size Size (b) IA (c) GANXiS TTL

Figure 3.11: Measuring face-to-face interactions among members. higher than CFF (88%), IA is the most effective at detecting communities where 93% of them have higher amount of physical interaction than the null model, as seen in Table 3.6. Incorporating geographical information into GANXiS improves the overall performance of GANXiS (91% vs. 69% (without geography)).

3.5 Application: Social Relationships & Human Mobility Random mobility models have been popular among applied researchers for gen- erating synthetic movements. Random walk is commonly used for graph traversals, clustering analysis, and many other applications to model unpredictable behavior. Random waypoint is a mobility model on Cartesian coordinate systems where two dimensions are commonly used in simulations and higher dimensions are used for theoretical analysis and generalization. Not only these random models are useful for application purposes, but they are also powerful tools for analytical understanding 40 of many networking applications, like routing in decentralized architectures where mobility plays a large role. A typical ad-hoc network is a decentralized network formed by mobile agents in a dynamic process without any fixed infrastructure. It is dynamic because the topology of who is connected to whom is constantly changing due to the mobility and connection preferences of the agents and the physical limitation of communication devices. If two mobile agents are outside of transmission range, then the connection is dropped. If they are within the transmission range, then the connection could be established. Hence, the topology of the ad-hoc networks depends on a complex combination of agent mobility, connection preferences, and environmental factors that could disrupt services or enhance communication. Some of these networks could be uncoordinated where each agent acts selfishly on its behalf while other networks could be coordinated where all agents are collab- orating to accomplish a particular goal, task, or mission. For instance, peer-to-peer networks are uncoordinated networks where the architecture is designed for robust- ness to reduce the damage of selfish activities in which users engage but are reluctant to contribute and anti-choking algorithms are designed for effectively distribute pieces of a file to maximize throughput and efficiency. On the other hand, military ad-hoc networks are coordinated networks where soldiers communicate through a network channel to rescue innocent civilians or capture fugitives in a mission. Outside of computer networks, human mobility is important for studying the spread of contagious diseases, traffic engineering, methods of large scale emergency evacuations, and so on [59]. While is important at a micro-level, it serves as a building block for population mobility that has many potential applications in studying the population at large scale. Using data to observe statistical patterns that capture, characterize, and predict trajectories of human movements during their daily activities is important for health organizations, civil engineers, and national interests. For instance, health organizations may want to study the spread of transmitted diseases, while traffic and civil engineers may want to incorporate human mobility analysis into their transportation models, where travellers can use a transportation system consisting of bikes, buses, and subways to get from one place to another. Un- 41

Mall

Home Pij

Work Lunch School

Figure 3.12: Generating a Markov Model using checkins. derstanding population mobility allows the design of effective transportation systems where traffic congestion is controlled and reduced. Last but not least, national security might be interested in knowing how social relationships impact population mobility, so guidelines can be provided during emergency evacuations in natural disasters like the Hurricane Irene and Japan Nuclear Meltdown of 2011, where evacuating 45,000 people within a six mile radius of two malfunctioned nuclear power plants required optimal efficiency since every second could potentially counts toward saving a life.

3.5.1 Network Congestion in MANETs The backoff timer in the MAC 802.11 protocol is an algorithm designed for preventing traffic collision of wireless signal. If two or more concurrent wireless trans- missions are within radio range, one will randomly backoff to let the other one talk. Suppose we are interested in measuring the throughput of a wireless network where people are working on their laptops and moving from location to location with some hidden attributes. Since human beings do not move randomly, we know that there will be more congestion at popular locations. If we use the RWP, most of the congestion occurs in the middle due to the stationary distribution as shown in Fig. 3.15.

3.5.2 Mobility Generation We propose a following algorithm for generating mobility traces using social networking data from Gowalla. For our Friendship Mobility Model (FMM) using Markov Model as an underpinning, we first randomly select a user from the dataset and include his or her friends into the selected group of users. For each user selected, we calculate the patterns of checkin activities from the datasets. To define set of 42

locations, we look into how many unique places have this user checked in. For each pair of subsequent locations, we calculate the shortest haversine route. For the prob- ability in the Markov Model of moving from location a to location b, we calculate how many times the user checks in at location a immediately after checking in at location b divided by the number of times the user checks in at the location a. Finally, we calculate the time it takes for a given user to go from one checkin to another. The entire process is depicted in Fig. 3.12. After we have our empirical Markov Model built for each user, we use Miller’s coordinate projection to convert geographic space into a Cartesian coordinate system that preserve the triangle law of distances. Finally for mobility simulation, each node randomly gets assigned to one of its checkins. Then each node randomly picks with the assigned probability the location of the next checkin and moves directly to it using a straight line trajectory. Once the node reaches the new checkin, it repeats the process until the end of the simulation. Hence, the difference between the RWP mobility model and our FMM is that in the latter the space of travel is limited to the area of the checkins for each individual node. Moreover, each node moves differently based on its training set of checkins. For instance, an adult might be inclined to check in at work more often than a student.

3.5.3 Experimental Congestion Design We designed a controlled experiment in MANET using ns-2 to compare the traffic congestion between the RWP and the FMM. In the experiment, there are 15 mobile nodes constantly sending out packets to their neighbors within the transmis- sion range. Other simulation parameters are listed in Table 3.7. When two or more nodes are within radio range of each other, at most one can make a successful trans- mission and the rest has to pause. We measure the overall congestion of the network by counting how many times did a node need to pause given that we know its current geographic location during the simulation. Fig. 3.13 provides the outline of a simulated node moving and how it causes congestion. Suppose a node starts at p1 and travels to p2 with some speed dictated

We use “user” when referring to the dataset and “node” when referring to the simulation. A node is built from the social network data provided by the users. 43

Figure 3.13: Design of simulation overview. by the mobility model. A mobile node cannot transmit if there is already a concur- rent transmission within some nearby range. Therefore, it pauses until it detects no concurrent transmissions. The pause time duration in a subarea is the total amount of time of all the nodes pausing or suspending their transmissions due to the backoff timer of the MAC 802.11 protocol. During the trip from p1 to p2, the node pauses in 3 subareas (1,2), (2,2), (3,3) represented by the dashed line, meaning that the trans- mission was suspended for some time. The length of the dashed line in a subarea represents the duration of pause time for that particular trip.

3.5.4 Congestion Simulation Results

Table 3.7: Network simulator ns-2 parameters. Parameters RWP FMM Simulation Time (t) 10,000s 10,000s MAC Layer 802.11Ext 802.11Ext Width (x ) 2000m 2000m Length (l ) 2000m 2000m Nodes (n) 15 15 Pause Time 0 0 Min Speed 0 5 Max Speed 5 5 Total Backoffs. 598,316 1,654,967

With the FMM (see [22]), we were surprised that it had 2.77 times more conges- tion than the RWP. However, this agrees with our intuition that in the FMM, friends 44

2000

1500

1000 Length (m)

500 FMM RWP

0 0 500 1000 1500 2000 X (m)

Figure 3.14: Traffic congestion in FMM and RWP. like to maintain their relationships by being closer to each other. Economic factors like the cost of transportation and mobility have a great impact on how we choose with whom to be friends. Fig. 3.14 displays the simulation results of network congestion in a controlled MANET. We took a sample of locations with traffic congestion. The points represent places where at least one node had to backoff within the simulation. Notice how traffic congestion is dispersed for RWP and clustered for FMM. Please note that this graph only shows places of congestion but not density or total volume of communications. Fig. 3.15 displays the frequency of pauses caused by the backoff timer in the MAC 802.11 protocol using the RWP. We noticed how congestion is centralized in the middle, which is correlated to the stationary distribution of the RWP.

3.6 Application: Long Ties & Economic Development A number of results in economic sociology suggested that human relationships affect economic opportunities because information often spread between people [60]- [65]. In addition, information coming from interpersonal relationships is often richer than traditional broadcast media such as television, newspaper, radio, etc. because acquaintances can interact face-to-face and influence one another in terms of adopt- ing new behavior and ideas [66]. Therefore, social networks can be portrayed as 45

Figure 3.15: Frequency of pauses using the RWP. a transportation system where individuals are drivers for generating ideas and the links between people are vehicles for transporting ideas from one person to another. Metaphorically, some links are faster at transporting ideas to a larger number of people than others because not all vehicles are created equal. It has been argued that information coming from weak ties is often richer than information arriving via strong ties because “those to whom we are weakly tied are more likely to move in circles different from our own ... and have access to infor- mation different from what we [usually] receive [65].” Weak ties have been shown to be valuable sources of information because individuals can use them to find jobs [32], [60], solicit feedback on starting new ventures [63], and search for people like in the small-world experiment [31], [41], [67], [68]. In other settings such as examin- ing workplaces, structural holes can affect productivity and innovation of employees and could lead to higher compensation, more promotion opportunities, and better performance evaluations [61]-[64]. Structural holes are those social relationships that connect non-redundant contacts together [61]. An example of a structural hole is a bridge that connects non-redundant contacts from two communities together. The effect of weak ties on economic opportunities [69] suggests that perhaps information coming from weak ties can also be used for measuring economic development on a 46

larger scale. Contemporary development in the science of urbanization has provided scaling laws for innovation and wealth creation as a power function of the population size in the equation: y(t) = cx(t)m where x(t) is the population size and y(t) is the metric of innovation at time t [70]. These results show that as the population size increases, GDP, wages, patents, private research employment & development increase at super- liner rates where 1.03 ≤ m ≤ 1.46 [70]. A plausible explanation for the superliner scaling of wealth creation is that as the population size increases, the number of social relationships between people increases because there are more choices for establishing relationships; therefore, increasing the connectivity between people and decreasing the time for ideas to spread as long as the rate of establishing connections is faster than the rate of population growth. Following this line of thinking, recent results in [71] suggest that a generative model for tie formation as a function of population density yields results very similar to the model based on population size [70]. Results show that algorithmically gen- erated social ties based on population density, assuming that nodes are distributed uniformly on a Euclidean space and they establish connections similar to the rank friendship model [67], can be used to model urban characteristics of cities such as GDP, HIV transmissions, and communication volume. Here we extend this line of thinking by focusing on characteristics of economic development as a function of speedy idea flow emulated on real social relationships - using long ties as the main component enabling such flow. This was accomplished by using data containing ge- ographical locations and friendship information of hundreds of thousands of people from location-based social media such as Gowalla and FourSquare [22]. More impor- tantly, these datasets allow us to infer face-to-face interactions [23] and measure the strength of ties in terms of not only interactions but also geographical distance (i.e., short or long ties [72], [73]). Other approaches for measuring economic development of large geographical areas include examining the diversity of social contacts (i.e., call records as a proxy for social relationships) since more contacts imply more channels for receiving information [74], but using calling patterns to infer social contacts is biased towards those that are more likely to be strong ties since weak ties are by definition those that are contacted 47

infrequently. While these approaches [71], [74] can vary in their complexity, ranging from mathematically oriented to data-driven, what they share in common is using social network analysis to predict innovation, wealth creation, and even patterns of complex human behavior. The novelty of our approach lies at the intersection of economic sociology (i.e., the interplay of weak ties and economic opportunities) and simple contagion models (i.e., the spread of good ideas from one place to another). Results show that the speed of access to ideas is a near prefect measure for social diversity and also a signature of economic development in the US without needing to tune parameters or incorporate secondary factors such as the level of educational attainment and internal transportation infrastructure.

3.6.1 A Stochastic Model of Economic Development We propose a simple stochastic model that uses long ties as the main component for measuring economic development of large geographical areas. Let G = (V,E,L) be a social network where V is the set of nodes, E is the set of their undirected relationships, and L is the mapping of users to locations of their residences. Let Ai denotes the set of nodes that reside in area i; i.e., Ai = {v ∈ V |L(v) = i}. The flow of ideas matrix denoted as F = (fij) where fij is the probability of an idea going from

Ai to Aj in one step defined as the fraction of long ties connecting nodes from Ai to

Aj divided by the number of long ties originating from Ai; i.e.,

LT (Ai,Aj) fij = Pm (3.4) k=1 LT (Ai,Ak)

where m is the total number of areas and LT (Ai,Aj) (1 ≤ i 6= j ≤ m) denotes the

number of long ties connecting nodes from Ai to Aj; i.e.,

LT (Ai,Aj) = |{(s, t) ∈ E | (s ∈ Ai & t ∈ Aj) or (t ∈ Ai & s ∈ Aj)}| (3.5)

If we assume that innovative ideas travel randomly between areas, and the probability

of an idea spreading from Ai to Aj depends only on the present area and not the

previous areas, then {Xt, t ≥ 0} is a discrete-time Markov chain where Xt denotes 48

where the idea is located at time t.

Let Hij denotes the expected time it takes for the idea originating at Ai to arrive at Aj. Then the average expected time for the idea originating from anywhere to arrive at Ai denoted as φi is defined as:

m 1 X φ = H (3.6) i m − 1 ki k=1

where Hii is 0. Hence, we expect φi to be inversely correlated with economic devel- opment since areas that receive information quicker can act faster. Suppose an innovative idea travels indefinitely, then the fraction of time the

idea stays in Ai is denoted as:

λi = P (Xt = Ai) (3.7)

λ = (λ1, λ2, ..., λm) is known as the stationary distribution, and there exists a unique

stationary distribution of Xt since it is irreducible [24]. If φi denotes the fraction of

time the idea spends in area i, then 1/λi denotes the expected time needed for the 1 idea to come back to i; therefore, φi ≈ . λi

3.6.2 Experimental Results & Discussion We extracted users and their social relationships in Gowalla and FourSquare and kept those that are confined to the US. We partitioned the US into 51 areas where each area corresponds to a federal state. Figure 3.16 shows scaling laws of the number of short and long ties as a function of the population size. Short ties are defined as those relationships where both users live in the same state, while long ties are defined as those who live in separate states. The total number of ties (i.e., all ties) is the sum of the number of short and long ties. A point is a state where the x-axis corresponds to the number of users that live there, and the y-axis corresponds to the number of their ties. Results show that as the population size increases, the number of short ties increases at superliner rates where m ≈ 1.34 for Gowalla (a) and m ≈ 1.43 for FourSquare (b). This result supports 49

a) b) 14 14

12 12 10 10 8 8 6 6 4 Short Ties (m=1.34, r=0.97) Short Ties (m=1.43, r=0.95) Number of Ties (log) Number of Ties (log) 4 Long Ties (m=0.95, r=0.98) 2 Long Ties (m=1.00, r=0.94) All Ties (m=1.02, r=0.99) All Ties (m=1.07, r=0.96) 2 0 4 6 8 10 4 6 8 10 Population Size (log) Population Size (log)

Figure 3.16: Scaling laws of short and long ties.

a) b) 4.5 0 4 Short ties Long ties 3.5 −2

3 −4 2.5

2 −6 1.5

1 P(k > K) (log−scale) −8

Number of face−to−face interactions 0.5

0 −10 Short Ties Long Ties 0 2 4 6 8 K (log−scale)

Figure 3.17: Face-to-face interactions of short ties and long ties. the claim that increasing the population size increases the number of relationships between people and decreasing their path lengths so ideas can spread quicker. How- ever, long ties do not increase at superlinear rates but instead approximately at linear rates where m ≈ 0.95 for Gowalla (a) and m ≈ 1.00 for FourSquare (b). Therefore, long ties do not explain superlinear scaling of innovation and wealth creation as a function of population size. Figure 3.17 shows that most of long ties are weak because face-to-face interac- tions occur more often when people are geographically close. In this experiment, we selected all pairs of long and short ties and calculated the number of their face-to-face interactions by matching their checkins. The average number of interactions for short ties is 3.95 (std=43.20) while this number for long ties is 0.73 (std=9.19). While not all short ties are strong, most of long ties are weak since 90% of them have no more 50

a) b)

25 Short Ties 25 Weak Ties Long Ties Long Ties 20 20

15 15

10 10 Number of Ties Number of Ties 5 5

0 0 Adopters Adopters Non−Adop.Non−Adop. Adopters Adopters Non−Adop.Non−Adop.

Figure 3.18: The collective strength of long ties in a simple contagion model. than two interactions. The x-axis in (b) represents the number of interactions K, and the y-axis represents the probability that a tie has more than K interactions. We did not repeat the same experiment for FourSquare because their API did not provide access to users checkins. We emulated a simple contagion process using social relationships of users to examine the effects of short and long ties on adopting versus non-adopting a con- tagion (similar to the process of spreading ideas in [71]). Using Rogers work on the diffusion of innovations [75], we assume that 2.5% of the population, randomly selected, is responsible for generating innovative ideas (i.e., the seed set). In each step, they randomly select one of their acquaintances to propagate the contagion and that acquaintance decides whether to adopt it with some fixed probability pc. If the acquaintance decides to adopt the contagion, then it later becomes an initiator for spreading it. The process stops when 13.5% of the population has adopted the con- tagion. Those 13.5% of the population would be considered as early adopters in the diffusion of innovations [75]. Figure 3.18 shows that early adopters have on average more long than short ties. For Gowalla (a), the average adopter has 17.77 (std=38.67) short ties and 23.90 (std=111.57) long ties compared to 3.81 (std=6.58) short ties and 2.99 (std=6.20) long ties for non-adopters. For FourSquare, the average adopter has 16.36 (std=27.67) short ties and 25.14 (std=54.11) long ties compared to 1.62 (std=3.45) short ties and 1.63 (std=6.74) long ties for non-adopters. For the distribution of short and long 51

a) Adopters b) Non−Adopters 0.4

0.6

0.2 0.4

Fraction 0.2

0 0 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 Long Ties (log−scale) Long Ties (log−scale) c) Adopters d) Non−Adopters 0.2

0.6

0.1 0.4

Fraction 0.2

0 0 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 Long Ties (log−scale) Long Ties (log−scale)

Figure 3.19: Distribution of long ties for adopters and non-adopters. ties of adopters and non-adopters see Fig. 3.19 for Gowalla (a,b) and FourSquare (c,d). Since nodes in the social networks are more likely to adopt if they have more acquaintances, the point is that a job source, valuable idea, or even a social contagion is more likely to come from a weak tie because people have limited number of strong ties but many more weak ties [61]. This experiment shows the collective strength of long ties by showing that people have a higher chance of adopting a new idea if they have more long ties.

We generate the flow matrix F = (fij) and calculate λi as a proxy for φi. Figures 3.20 and 3.21 show the economic development of US states as a function of the speed of access to ideas for Gowalla and FourSquare respectively. The metrics we used for economic development are gross GDP [76], the number of patents issued [77], and the number of startups defined as non-profit firms with less than 20 employees [78].

Overall, results show that φi is highly correlated with the economic development in the US. Tables 1 and 2 show results using other techniques that have been proposed in the literature for measuring economic development. The population density of a state is defined as the number of residents [79] divided by the state’s land area in sq. mi 52

a) b) −10 −4 −11

−12 −6

−13 −8 2009, m=−0.67, r=−0.92, 2009, m=−0.81, r=−0.76,

2010, m=−0.67, r=−0.92 Patents (log−scale) 2010, m=−0.82, r=−0.77 −14 Gross GDP (log−scale) 2011, m=−0.66, r=−0.92 −10 2011, m=−0.83, r=−0.77 2012, m=−0.67, r=−0.91 2012, m=−0.83, r=−0.79 −15 −8 −6 −4 −2 −8 −6 −4 −2 φ (log−scale) φ (log−scale) i i c) −9

−10

−11

−12 2009, m=−0.59, r=−0.86 Startups (log−scale) 2010, m=−0.59, r=−0.86 −13 2011, m=−0.59, r=−0.86

−8 −6 −4 −2 φ (log−scale) i

Figure 3.20: Economic development as a function of idea flow (Gowalla).

(excluding water) [80]. The social diversity of a state i denoted as Di is defined as:

Pm pijlog(pij) D = j=1 (3.8) i log(m − 1) where pij is the number of edges connecting Ai and Aj divided by the number of edges leaving Ai [74].

Table 3.8: Measuring economic development (Gowalla). GDP Patents Startups Population Density r = 0.50 r = 0.45 r = 0.38 Social Diversity r = 0.88 r = 0.74 r = 0.83 Ideas Flow r = 0.92 r = 0.77 r = 0.86

In Table 3.8, results show that speed of access to ideas φi in Gowalla is more correlated with economic development than population density and social diversity. 53

a) b) −10 −4 −11

−12 −6

−13 −8 2009, m=−0.59, r=−0.88 2009, m=−0.70, r=−0.71

2010, m=−0.59, r=−0.88 Patents (log−scale) 2010, m=−0.71, r=−0.72 −14 Gross GDP (log−scale) 2011, m=−0.59, r=−0.88 −10 2011, m=−0.73, r=−0.74 2012, m=−0.59, r=−0.88 2012, m=−0.72, r=−0.74 −15 −8 −6 −4 −2 −8 −6 −4 −2 φ (log−scale) φ (log−scale) i i c) −9

−10

−11

−12 2009, m=−0.51, r=−0.80 Startups (log−scale) 2010, m=−0.51, r=−0.81 −13 2011, m=−0.51, r=−0.81

−8 −6 −4 −2 φ (log−scale) i

Figure 3.21: Economic development as a function of idea flow (FourSquare).

Table 3.9: Measuring economic development (FourSquare). GDP Patents Startups Population Density r = 0.50 r = 0.45 r = 0.38 Social Diversity r = 0.88 r = 0.74 r = 0.83 Ideas Flow r = 0.92 r = 0.77 r = 0.86

a) b) 8 9 r=−0.98 r=−0.99 linear linear 7 8 y = − 0.9*x + 13 y = − 0.9*x + 13 i i

6 φ 7 φ

5 6

4 5 Speedy Idea Flow 3 Speedy Idea Flow 4

2 3

1 2 6 7 8 9 10 11 12 13 14 5 6 7 8 9 10 11 12 13 Social Diversity D Social Diversity D i i

Figure 3.22: Speedy idea flow as a function of social diversity. 54

In Table 3.9, there are two instances where social diversity is more correlated with economic development in FourSquare but still less correlated than the results in Table 3.8. Results show that the speed of access to ideas is correlated with economic de- velopment in the US from 2009 to 2012 because it is a near prefect measure for social diversity as shown in Fig. 3.22 for Gowalla (a) and FourSquare (b); however, the causality between the two relationships is still unknown but the results suggest that perhaps combining long ties and the spread of ideas might be an important indicator of economic development in addition to population size, density and social diversity. Aggregating and normalizing hundreds of thousands of long ties across the US re- moves the potential effect of ideas not traveling randomly. Unlike social diversity, population density performed not as well as others because it was simply designed to measure characteristics of cities and not geographical areas with diverse ranges of pop- ulation densities (e.g., New York consists of dense NYC and sparse NYS; therefore, limiting its predictive power). Finally, we focus only on a very specific dimension of social relationships (i.e., long ties) and ignore other ties that could lead to better correlations of economic de- velopment. While there are many more dimensions of human relationships (e.g., short ties, strong ties, friends from different communities, etc.), one particular dimension that could lead to better results within a geographical area is friends with different interests or skills since they would complement each other in terms of collaboration like solving a difficult problem. Perhaps understanding the interplay of human rela- tionships and economic development can suggest radical socially-driven alternatives in addition to the traditional stimulus packages for growing the economy [74] and a direction for studying urban growth [71].

3.7 Summary of Results Contrary to the belief in the death of distance barrier to forming social ties [81], we find that the creation of friendship between two people in Gowalla is more likely to occur when they are geographically closer, and the likelihood of users being friends rapidly decreases as the geographic distance between them increases. Such geographic effects may help in designing spatially-aware community detection algorithms where 55 on average every two people in a community are separated by a few hops and also likely to be within spatial proximity. First, our data analysis of Gowalla friendship network reveals two degrees of geographical concentration where friends and friends-of-friends are more likely to be within geographic proximity. Conversely, pairs of users who are separated by three or more hops of friendship relation are unlikely to be within geographic proximity. Also, friends who are within geographic proximity are more likely to physically interact by going to the same places together than distant friends. Yet, the likelihood of physical interactions among friends-of-friends is minuscule even though they are geographically concentrated. Second, we showed that covers can serve as a null model for examining com- munity structures. For most quality metrics, small communities are more likely to outperform large ones because it is much easier to find a small group to maximize a particular metric. Therefore, comparing detected communities to covers tell us how much better the algorithm is performing than a proposed null model for a given size of the community. Finally, we used the results from the covers and compared them to the com- munities detected by modified CPM, unweighted and weighted IA, and GANXiS. By incorporating spatial information into CPM to make the algorithm scalable, it detected meaningful communities of a large online social network where members are more likely to physically interact than members of a cover used as a null model. From the NCP plots, we noticed the importance of small-size communities in large social networks in which it is much harder to find a large community because humans have limited resources to create and maintain relationships. We used the level of physical interactions among members in a community as the final quality measure to compare and validate the performance of the community detection algorithms to the closest-friend-first cover. Other applications that we foresee might benefit from such spatial effects in- clude recommendation systems and link prediction by designing systems based on the knowledge of users’ geographical locations, their social connections, and the structure of their friendship communities. For instance, recommendation systems could be en- riched by incorporating geographical information of users, their friends and location- 56 based ratings to increase the quality of the recommended items [82]. Link prediction could be enriched by using pairs of users that are geographically close and belong to the same community to predict how likely they will become friends or connected in the future [83]. CHAPTER 4 SOCIAL RANKING TECHNIQUES

Social Graph

P4 P1 P2 P3 P5 P6

Web Graph

Yahoo Digg ABC

CNN Fox MSNBC

Figure 4.1: Conceptualization of social ranking.

Previous work on the ranking of pages conceptualized the web as a network con- sisting of pages representing nodes, and links representing directed edges illustrated in Fig. 4.1. Advances in social networks enabled a different perspective of ranking pages from a relationship point of view. For simplicity, the social network of users illustrated in the top rectangular box in Fig. 4.1 consists of nodes P1,P2, ..., P6 where an undirected edge between P1 and P2 represents a social relationship of the two users and an undirected edge from P1 to CNN represents P1 broadcasting a CNN URL to its ties P2,P3, and P4. Note that the edge from P 1 to CNN is not a part of the social network, but a connection between the web and social network.

4.1 Google Buzz & Twitter We collected data from two networks on the web. The first one is the Google Buzz, a platform that combines social relationships and mini-blogging for information dissemination. The second network is Twitter where users choose to follow sources Portions of this chapter previously appeared as: T. Nguyen and B. Szymanski, “Social Ranking Techniques for the Web,” in Proc. IEEE/ACM Int. Conf. Advances in Social Network Analysis and Mining, Niagara Falls, Ontario, 2013, pp. 49-55.

57 58 of information. These two networks have messages containing URLs that provide us clues into how users would rank the quality of the information coming from URLs by using the techniques we later describe. We collected the Google Buzz data from early September of 2011 to the middle of October of the same year. There were around 2.5M users who shared approximately 100M messages of which about 30M messages had URLs embedded in them. We collected the Twitter data from early September of 2011 to the late December of that year. There were around 1M users who shared approximately 300M messages and 50M of them had URLs embedded in them. Additional details of the datasets for Google Buzz and Twitter are provided in the Table 4.1 and Table 4.2. Please note that all URLs refer to all representations of URLs embedded into messages and two different representations could be the same URL when they are masked by redirect services. *URLs refer to the final destination of URLs that have been shared by at least two users within the network. In addition, we reduced the size of the datasets by keeping users whose geographical locations were known. To pinpoint the geographical location of a user, we extracted locations from their geo-tagged messages and used the most frequent location as the location of their residence. Reduced networks are shown in Table 4.3. Parsing URLs from messages is prone to errors where humans have multiple ways of writing supposedly the same link. Examples are URLs containing typos and spelling mistakes, masked by redirect services, and so on. Second, with limits on hardware resources, bandwidth sharing and data access, we attempted to collect as much as we could for the purpose of ranking URLs on social media. Third, we were able to collect the entire connected component with BFS sampling for Google Buzz, which resulted in the sum of indegree being equal to the sum of outdegree. Twitter is a much larger network that consists of hundreds of millions of accounts [26]. When calculating the data summary of Twitter, we look at users who have been processed in terms of collecting their information and not users who are waiting to be processed, which resulted in the sum of indegree not being equal to the sum of outdegree. 59

Table 4.1: Data summary of Google Buzz. P x¯ σX Users − − 2,522,109 Inlinks 7.36 115.04 18,566,607 Outlinks 7.36 58.39 18,566,607 Messages 42.94 1,067.21 108,439,019 All URLs 11.67 21,706.36 34,472,205 *URLs 3.85 174.80 2,647,561

Table 4.2: Data summary of Twitter. P x¯ σX Users − − 1,057,163 Inlinks 17,675.58 334,127.10 18.69B Outlinks 520.66 7,676.48 550,421,023 Messages 280.84 1,005.09 277,310,683 All URLs 44.26 45,359.19 46,532,403 *URLs 8.19 57.59 2,294,077

4.1.1 Categories of URLs. Figure 4.2 shows categories of 100 most popular and 100 randomly selected URLs for Google Buzz (a,b) and Twitter (c,d). Popular URLs are defined by the number of spreaders, that is the users who shared or re-shared a given URL. In Google Buzz (a), 24% of popular URLs are from social media, 16% are about technological products such as Apple, 15% are videos from Youtube, and so on. In Twitter (c), 41% of popular URLs are from social media, 19% are videos from Youtube, 11% are image related, and so on. Google Buzz has more URLs relating to technological products, while Twitter has more popular URLs relating to social media. For random URLs, Google Buzz has 27% of URLs from social media while for Twitter this number is 53%.

Table 4.3: Google Buzz (left) & Twitter (right) with geography. P P x¯ σX x¯ σX Users − − 24,813 − − 15,036 Inlinks 8.30 39.92 206K 102.60 279.35 1.5M Outlinks 8.30 33.45 206K 102.60 278.77 1.5M Extracted URLs 260.66 978.30 6.5M 227.93 305.39 3.4M 60

a) b) Google 7% Technology 16% Technology 10% < 1% Information 27% Youtube 15% News 1%

Images 9% News 21%

Games 1% Twitter 15% Facebook 4% Google 4% FourSquare 2% Videos 2% < 1% Facebook 2% Yfrog 3% Foursquare 7% Twitter 11% < 1% Tumblr 5% Information 27% Youtube 6% Last.fm2%

c) d) Technology 5% Technology 1% News 3% Youtube 19% News 18% Information 23% Images 11%

Facebook 1% Facebook 6% Foursquare 11% Yfrog 5%

Twitter 30% Yfrog 8% Twitter 28% Information 21% Tumblr 5% Youtube 5%

Figure 4.2: Categories of popular (a,c) and random (b,d) URLs.

4.1.2 Spreaders & Affected Sets From both Google Buzz and Twitter datasets, we have randomly chosen 2,000 URLs with equal probability denoted as the random set of URLs. We also have chosen the top 2,000 shared URLs denoted as the popular set of URLs. There are two sets of URLs in each network giving us four sets of URLs in total. For each URL, we calculated the size of the affected set consists of nodes that received the URL from the spreaders but chose not to spread it further. We also computed the average length of all shortest paths from 10 randomly chosen users to members of a random subset of spreaders. The results are shown in Fig. 4.3(a) for Google Buzz and Fig. 4.3(b) for Twitter. A point on the plot is a URL where the x-axis corresponds to the size of the affected set in logarithmic scale, and the y-axis corresponds to the average length of shortest paths from randomly chosen users to the spreaders. A red point is a URL from the random set, and a blue star is a URL from the popular set. The black line is a linear classifier that separates popular 61

6 Random Random Popular 5 Popular 5 4.5 4 4 Avg. Distance 3.5 Avg. Distance 3

3 2 2 4 6 8 10 12

Size of Affected Set (log−scale) 5 10 15 Size of Affected Set (log−scale)

Figure 4.3: Shortest paths to URLs in Google Buzz (a) and Twitter (b).

URLs from random URLs and crosses are points that have been miss-classified. We substitute the entire spreader set with a randomly selected subset simply as a matter of efficiency because shortest-path computations are expensive in large networks as mentioned by authors in [84]. In Fig. 4.3, we noticed that as the size of the affected set increases, the average distance from randomly selected users to the information on the web page decreases for random and popular sets of URLs in Google Buzz. This is because very large affected sets increase the likelihood that a randomly chosen user has a path through an affected user reaching a spreader. This agrees with our intuition that information collectively shared by users with high outdegrees has a greater coverage of dissemina- tion. However, this correlation is weaker in Twitter due to the celebrity effect of some users having millions of followers and creating large affected sets. For instance, a URL that was only shared in the network by a celebrity. More importantly, affected sets influence our social ranking techniques where the structure of the network instead of the web topology is used to rank pages or URLs.

4.1.3 Information Distances Figure 4.4 shows ultra small-world property of the distance from a randomly selected starter to popular and random URLs in Google Buzz (a) and Twitter (b). For each URL, we randomly selected 100 starters and calculated the length of shortest path from the starter to the closest spreader of the URL. We calculated the densities of the number of hops in Fig. 4.5 and the average shortest path lengths Fig. 4.4. 62

a) Avg. Path Length b) Avg. Path Length 5 4 4

3 3

2 2

1 1

0 0 YO TW YF FA IM NE TE RA YO TW YF FA IM NE TE RA

Figure 4.4: Ultra small-world property from starters to information.

a) b) Facebook Facebook

0.5 Images 0.5 Images News News 0.4 0.4 Tech Tech Twitter Twitter 0.3 0.3 Youtube Youtube Density Density 0.2 Random 0.2 Random

0.1 0.1

0 0 0 2 4 6 0 2 4 6 Hop Hop

Figure 4.5: Densities of shortest path lengths from starters to URLs.

Results show that a randomly selected starter in Google Buzz is about one hop away from a popular URL compared to 2.5 hops distance from a random URL. For Twitter, a randomly selected starter is about 2 hops away from a popular URL and a little bit further for a random URL. These average shortest path lengths to popular and random URLs are much shorter than six degrees of separation in Travers- Milgram small-world experiment [29] demonstrating that the distance from human to information is sometimes shorter than the distance from human to human.

4.1.4 Geographical Distances Figure 4.6 shows geographical concentration of pairs of users who are separated by a fixed number of hops in Google Buzz and Twitter, and two additional networks: Gowalla and FourSquare. We noticed that these four social networks have two degrees 63

a) Hop 1 b) Hop 2 c) Hop 3

0.5 B 0.3 0.06 0.4 T 0.3 G 0.2 0.04 F 0.2 0.1 0.02 0.1

0 0 0 0 2000 4000 0 2000 4000 0 2000 4000

Density d) Hop 4 e) Hop 5 f) Hop 6

0.06 0.06 0.06

0.04 0.04 0.04

0.02 0.02 0.02

0 0 0 0 2000 4000 0 2000 4000 0 2000 4000 Geographic Distances (km)

Figure 4.6: Two degrees of spatial concentration.

of spatial concentration where users who are separated by one or two hops are more geographically concentrated than pairs who are separated by 3 hops or more. For instance, 69% of friendship pairs (hops=1 shown in a) are within 560 km, 47% of friends-of-friends pairs (hops=2 shown in b) are within 560 km, 25% of pairs with hops=3 (shown in c) are within 560 km, 20% of pairs with hops=4 (shown in d) are within 560 km, 17% of pairs with hops=5 (shown in e) are within 560 km, and 17% of pairs with hops=6 (shown in f) are within 560 km. An explanation for this two degrees of concentration is the effect of local clustering coefficient of a user defined as the fraction of its friends who are friends with each other. In order for a probability of two people who have a friend in common being friends themselves to be high, they need to be within some geographical proximity or else the opportunity for them to interact is small. The average local clustering coefficient of 104 randomly selected pairs of users in Google Buzz, Twitter, Gowalla, and FourSquare are 0.31, 0.36, 0.30, and 0.34 respectively. 64

Figure 4.7: Four dimensions of social relationships.

4.1.5 Densities of Social Relationships Four dimensions of social relationships are visualized in Fig. 4.7. Friends are de- fined as reciprocal following relationships. Neighbors are users that are geographically close. Peers are users that belong in the same community. Interests are users that have similar interests measured by the keyword similarity in URLs they share. The intersection of circles represents pairs of users with multiple dimensions of social re- lationships. Two represents pairs of users with two dimensions of social relationships such as being friends and neighbors.

Table 4.4: Social relationships densities in Google Buzz. Buzz Friends Peers Interests Neighbors Among Friends — 0.99 0.09 0.58 Among Peers 0.26 — 0.25 0.41 Among Interests 0.01 0.32 — 0.06 Among Neighbors 0.05 0.50 0.13 — Among Random 0.01 0.27 0.06 0.03

Tables 4.4-4.5 show the densities of friends, peers, neighbors, and users with similar interests. The left column represents relationships of the pairs and the top row represents the density of the relationships. For example, among friends in Table 4.4 for Google Buzz, 99% of are also peers, 9% of them have similar interests, 58% of 65

Table 4.5: Social relationships densities in Twitter. Twitter Friends Peers Interests Neighbors Among Friends — 0.85 0.11 0.30 Among Peers 0.32 — 0.12 0.29 Among Interests < 0.01 0.19 — 0.03 Among Neighbors 0.01 0.36 0.09 — Among Random < 0.01 0.13 0.04 0.02

a) b)

Friends Friends Followings 0.4 Followings 0.25 Peers Peers Random Random 0.3 0.2 Avg. CKS Avg. CKS 0.2 0.15

0.1 0.1 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 Geographical Distance (km) Geographical Distance (km)

Figure 4.8: CKS for friendship, following, peers, and random pairs. them are neighbors. For Twitter, among friends, 85% of them are peers, 11% have similar interests, and 30% are neighbors. The densities of friends, peers, interests, and neighbors are consistent in Google Buzz and Twitter. For example, most of the friends are among peers, most of the peers are among friends, most of people with similar interests are among peers, and most of the neighbors are among friends.

4.1.6 Keyword Similarity Figure 4.8 shows cosine keyword similarity (CKS) of selected friendship, follow- ing, peers, and random pairs of users in Google Buzz (a) and Twitter (b). The CKS of two users is the cosine of the angle between the two vectors consisting of keyword frequencies extracted from webpages shared by these two users. 0 Let Wv and Wv0 be lists of words in web pages that users v and v have shared. th Let Av be a vector of word frequencies where the i index in Av represents the number of times the word wi appears in Wv The keyword cosine similarity for v and v0 is defined as: 66

A A 0 cos(u, u0) = u u . (4.1) ||A||||B|| A pair of nodes (v, v0) represents friendship if they follow each other, following if v follows v0 but not vice-versa and is a random pair if there is no following in either direction. We calculated the average CKS of friendship, following, peers, and random pairs as a function of geographical distance separating members of these pairs. For random pairs, we noticed that CKS decreases as the geographical distance increases. On the other hand, the effect of geography on cosine keyword similarity is negligible when comparing friendship, peer, and following pairs. However, they have a higher cosine keyword similarity than random pairs.

4.2 Social Ranking Techniques

Let GU = (V,E) be a directed multi-labeled graph where V is the set of nodes,

E is the set of edges where e = (vi, vj) represents a directed edge from node vi to node vj, and U is the set of URLs with subsets of which nodes in V are labeled. For URL u ∈ U, let S(u) denotes the set of all spreaders of the URL u; in other words all nodes in V who has posted u.

4.2.1 PageRank on Social Network We extend the PageRank algorithm to rank URLs on a social network (PRSN) as follows. Given a multi-labeled graph GU = (V,E), let F = (fij) be a n×n weighted adjacency matrix where n is the number of nodes (i.e, n = |V |), fij = 0 if there is no directed edge from vi to vj, and fij = 1/deg(i) otherwise. Let R be a vector th consisting of n elements where the i element of R denoted as ri corresponds to the PageRank score of the ith node. Let k be the maximum number of iterations that the PageRank algorithm runs. At the first iteration, every node sends its score divided by the number of links pointing from this node to other nodes through each outgoing link. After that, each node updates its score to the sum of scores that it has received:

ri = f1ir1 + f2ir2 + ... + fnirn. (4.2)

If there is an edge from node j to node i, then fji > 0 and node j will send 67

1 fji fraction deg(j) of its score rj to node i. Equation 4.2 can be compactly written as R<1> = F T R<0> where F T is the transpose of the matrix F , the superscript <1> denotes the scores of all nodes after the first iteration, and R<0> is the initial vector. Let R be the scores of nodes at the k > 0 or last iteration defined by induction as:

R = F T R (4.3)

If there are sinks in the graph G, that is nodes without outgoing edges, then for large enough k’s they will absorb all scores since the scores can enter but cannot leave the sinks. One way to fix this problem is to scale the strength of links by a constant factor of 0 < σ < 1 and to compensate this scaling by adding an artificial 1−σ flow between any two nodes with the weight n . This solution is known as the scaled th 0 version of PageRank [85]. The score of the i node is then denoted as ri and is defined as:

n X 1 − σ r0 = (σf + )r0 . (4.4) i ji n j j=1

˜ 1−σ Equation 4.3 can be compactly written using the following matrix F = σF + n . By the Perron-Forbenius Theorem [85], the scaled PageRank scores converge to a stable solution:

R0i = F˜T R0i−1 where 0 < i ≤ k. (4.5)

Given a subset of URLs U 0 ⊂ U, the PageRank score of a URL u ∈ U 0 on a social network (PRSN) is defined as:

P r0k vi∈S(u) i PRSN(u) = P P 0k . (4.6) 0 0 0 r u ∈U vi∈S(u ) i

4.2.2 HITS on Social Network The HITS algorithm used to rank URLs on a social network (HSN) is defined

as follows [35], [85]. Given GU = (V,E), let M = (mij) be a n × n adjacency matrix

where n is the number of nodes, mij = 1 if there is a directed edge from node vi to 68

node vj, and mij = 0 otherwise. Let k be the maximum number of iterations. Given a set of URLs U 0 ⊂ U, let H and A be vectors of scores for hubs and authorities, respectively. Authorities are the URLs (i.e., u ∈ U 0) and hubs are nodes that share these URLs. The ith element of the vector H represents the score of the ith hub, and the jth element of the vector A represents the score of the jth authority. At the

first iteration, the score hi of a hub gets set to the number of authorities to which it

points, and the score aj of an authority gets set to the scores of hubs pointing to it.

More formally, hi and aj are defined as:

<0> hi = mi1 + mi2 + ... + min, (4.7) <0> <0> <0> <0> aj = m1jh1 + m2jh2 + ... + mnjhn . (4.8)

Let H and A be the scores of hubs and authorities at the iteration l, the HITS algorithm [85] can be written as:

H = (MM T )lH<0> where 0 < l ≤ k, (4.9) A = (M T M)l−1M T H<0> where 0 < l ≤ k. (4.10)

Finally, the score of a URL in the authorities is the value aj normalized by the sum of scores in the vector A.

4.2.3 Ranking with Maximum Flow We defined the following maximum flow algorithm to rank URLs on a social 0 network. Given a graph GU = (V,E) and a subset of URLs U ⊂ U, let p represent a node. We want to rank the URLs in U 0 with respect to p and G by constructing a 0 0 0 directed flow graph denoted as Gp = (V ,E ). The first part of the construction requires copying the social structure of G to 0 0 Gp. For every node vi that p follows, we add vi to V and the edge e = (p, vi) into E0. At the subsequent iteration, we repeat the same process for every node that has 0 0 been added into V from the previous iteration; that is, if vi was added into V and 69

Information and Social Network Web Pages

Super Sink Source P4 u2 P1 P2 t p P3 P5 u1

0 Figure 4.9: Graph Gp for ranking URLs {u1, u2} with respect to node p.

0 there is an edge e = (vi, vj), then we add vj to V if vj has not been added before. 0 The edge e = (vi, vj) will still be added into E if vj has been added before. This 0 process of constructing the graph Gp continues until all possible nodes from V that are reachable from p have been added into V 0. For practical reasons, it is wise to stop 0 when the diameter of Gp is small; e.g., three to reflect the influence of nodes that are within network proximity. At the end of the process, an edge originating from node 0 v gets the weight equal to the inverse of the node degree in Gp. 0 The second part of constructing Gp introduces some additional nodes and edges. For every URL u0 ∈ U 0, we add u0 into V 0. For every spreader s ∈ S(u0) of the URL u0, we add an edge e = (s, u0) with a weight of 1 into E0 if s ∈ V 0. We add a super sink denoted t into V 0 and add an edge e = (u0, t) with an edge weight of 1 for every URL u0 in U 0. 0 The maximum flow of the graph Gp from source p to super sink t is a function F that assigns a non-negative value to each edge so that it maximizes the total flow coming from the source p to the super sink t satisfying two conditions: first, it does not exceed the weight of an edge; i.e, F (e) ≤ ce and second, it obeys the conservation of flow law except for the source p and the super sink t; i.e,

Flow out to social ties Flows out to pages Xz }| { Xz }| {0 Fout(v) = ce + ce = Fin(v) (4.11)

0 where ce is the assigned flow for the edge e = (vi, vj) between two nodes, and ce is 0 the assigned flow for the edge e = (vi, uj) for the node vi and the URL uj. The 0 construction of the graph Gp is illustrated in Fig. 4.9. Polynomial running time algorithms such as the Edmonds-Karp algorithm O(V 0E02) for finding the maximum 70

flow can be found in [85], [86].

4.2.4 Variants of Maximum Flow The second variant of network flow incorporates social relationships and geog- raphy by assigning weights to edges based on the geographical distance between the nodes. We assign the edge weight for nodes vi and vj as:

g (v , v )−1 w = d i j . (4.12) ij P −1 out gd(vi, vk) vk∈vi where gd(vi, vj) is the geographical distance from vi to vj. The third variant uses cosine keyword similarity to assign the weights. The edge weight for nodes vi and vj is defined as:

CKS(v , v )−1 w = i j . (4.13) ij P −1 out CKS(vi, vk) vk∈vi The last variant of network flow uses community structure by replacing the social network with the community group and connecting the source to all members in the community. Weights (binary) for the edges in community do not taken into account geography or cosine keyword similarity so their values are 1.

4.3 Social Ranking Experiments 4.3.1 Comparing PageRank & HITS We selected 30 URLs from the popular and random URLs sets. For each selected URL, we calculated its score by using PageRank and HITS, and ranked the URLs (i.e, 1st, 2nd, 3rd, etc.) with respect to the set. We compared the ranking results of PageRank and HITS for popular and random URLs shown in Fig. 4.10 for Google Buzz and Fig. 4.11 for Twitter. Ranking Results of Google Buzz are listed in Table 4.6 and Table 4.7. The ranking of popular URLs using PageRank and HITS are more consistent than the random URLs. We measured the ranking consistency as the average differ- 1 P ence of two ranking algorithms on a set of URLs (i.e., w u∈U 0 |PHSN (u)−PPRSN (u)|) P and the sum of differences (i.e., u∈U 0 |PHSN (u) − PPRSN (u)|) where Px(u) is the po- sition of the URL u determined by the algorithm x and w is the number of URLs. 71

The average difference is more appropriate than the sum difference for ranking a large number of pages. An example is ranking 1000 pages instead of 5 pages. The average gives the average difference of two ranking algorithms in the 1000 pages, and the sum difference gives the difference in ranks of the two algorithms. For smaller number of pages, sum might be more appropriate in quantifying the difference between two ranking algorithms. For the popular URLs in Google Buzz, the average difference was 2.9 meaning that on average HITS and PageRank were off by 3 positions and the sum of differences between them was 86. For the random URLs in Google Buzz, the average difference was 9.6 and the sum of differences between them was 288. For the popular URLs in Twitter, the average difference was 5.9 and the sum of differences between them was 178. For random URLs in Twitter, the average difference was 7.2 and the sum of differences between them was 216. In both networks, popular URLs are ranked more consistently than random URLs which makes the HITS algorithm more suitable than PageRank when ranking viral information because it is computationally more efficient.

30 stackoverflow30 theprism bbc picasaweb.google thesocialnetwork−movienetworkedblogs pingchat foxnews photofocus dslreports 25 boston 25 digg xkcd last.fm reddit huffingtonpost empireavenue telegraph amazon wimp 20 gizmodo 20 tech.slashdot reuters fastestwaylosebellyfat whitehouse twitter techcrunch income4free engadget popsci 15 pcworld 15 puntogov businessweek socialturns ted businessinsider guardian forbes apple behance 10 yahoo 10 economist bloomberg opencog facebook marketwatch wordpress npr HITS on Social Network lockerz HITS on Social Network addictivefonts 5 wired 5 wired nytimes sports.espn.go appleinsider thenextweb youtube entrepreneur abcnews.go ping.fm 0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 PageRank on Social Network PageRank on Social Network (a) Popular URLs. (b) Random URLs.

Figure 4.10: Ranking URLs on Google Buzz.

4.3.2 Flow Ranking We noticed that the ranking results determined by each individual user using maximum flow are less correlated with themselves than the results computed by PageRank and HITS. First, we compared the ranking results of maximum flow with 72

30 ubersocial 30 vice news.yahoo turbotdouble zdnet meadowparty tinychat influxinsights latimes gototennis 25 nbcnews 25 gigaom ted getglue vimeo fizy fastcodesign espn.go chinadaily 20 businessweek20 adage pitchengine 9gag barackobama happyplace abc.go iphoneblog change.org newscj 15 brightkit 15 foxnews ebay macrumors usatoday amazon pinterest keek huffingtonpost barnesandnoble 10 wired 10 mtv pepsi blog.vegas estovar nme newstomato hotlist HITS on Social Network HITS on Social Network wordpress scientificamerican 5 forbes 5 eco4planet wefollow wimp.com hollywoodlife techcrunch mtv blog.naver twitpic.co viewsnnews 0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 PageRank on Social Network PageRank on Social Network (a) Popular URLs (b) Random URLs

Figure 4.11: Ranking URLs on Twitter.

PageRank and HITS using popular and random URLs for Google Buzz shown in Fig. 4.12 for popular URLs and Fig. 4.13 for random URLs. The first and second plots on the left are ranking results of popular URLs and the third and fourth plots on the right are ranking results of random URLs labelled by their sub-captions. A point on the graph is a URL where the x-axis is the ranking position of the URL determined by maximum flow and the y-axis is the ranking position determined by either PageRank or HITS labelled on the y-axis. The identical layout for Twitter is shown in Fig. 4.14 for popular URLs and Fig. 4.15 for random URLs.

30 30

25 25

20 20

15 15

10 Person 1 10 Person 1 Person 2 Person 2 Person 3 HITS on Social Graph Person 3

PageRank on Social Graph 5 5 Person 4 Person 4 y=x y=x 0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Personalized Ranking with Maximum Flow Personalized Ranking with Maximum Flow (a) Max. Flow vs. PageRank (b) Max. Flow vs. HITS

Figure 4.12: Social ranking with popular URLs on Google Buzz. 73

30 30

25 25

20 20

15 15

10 Person 1 10 Person 1 Person 2 Person 2 Person 3 HITS on Social Graph Person 3

PageRank on Social Graph 5 5 Person 4 Person 4 y=x y=x 0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Personalized Ranking with Maximum Flow Personalized Ranking with Maximum Flow (a) Max. Flow vs. HITS (b) Max. Flow vs. PageRank

Figure 4.13: Social ranking with random URLs on Google Buzz.

30 30

25 25

20 20

15 15

10 10 HITS on Social Graph

5 PageRank on Social Graph 5

0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Personalized Ranking with Maximum Flow Personalized Ranking with Maximum Flow (a) Max. Flow vs. HITS (b) Max. Flow vs. PageRank

Figure 4.14: Social ranking with popular URLs on Twitter.

30 30

25 25

20 20

15 15

10 Person 1 10 Person 1 Person 2 Person 2 Person 3 HITS on Social Graph Person 3

PageRank on Social Graph 5 5 Person 4 Person 4 y=x y=x 0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Personalized Ranking with Maximum Flow Personalized Ranking with Maximum Flow (a) Random URLs. (b) Random URLs.

Figure 4.15: Social ranking with random URLs on Twitter. 74

Table 4.6: Ranking results of 30 popular URLs in Google Buzz. URLs PRSN HSN MF abcnews.go 1 1 9/12/10/15 youtube 2 2 5/7/5/6 yahoo 3 10 1/2/2/4 businessweek 4 14 10/14/12/14 bloomberg 5 9 10/14/13/12 wordpress 6 7 5/5/7/9 nytimes 7 4 10/14/6/10 appleinsider 8 3 10/14/13/16 facebook 9 8 1/1/1/1 wired 10 5 9/14/13/15 lockerz 11 6 4/6/6/6 apple 12 11 6/8/9/8 pcworld 13 15 8/13/10/7 guardian 14 12 10/14/8/10 reuters 15 19 10/14/10/16 ted 16 13 9/13/7/10 amazon 17 21 8/9/8/10 techcrunch 18 17 8/13/9/14 engadget 19 16 9/13/7/7 reddit 20 23 10/13/8/11 empireavenue 21 22 9/14/11/15 boston 22 25 3/3/3/3/ xkcd 23 24 2/4/8/2 whitehouse 24 18 9/14/11/14 gizmodo 25 20 7/10/12/12 pingchat 26 27 9/12/12/14 thesocialnetwork-movie 27 28 9/14/13/14 bbc 28 29 10/11/4/13 photofocus 29 26 8/14/13/16 stackoverflow 30 30 6/11/12/12

4.3.3 Rank Differences For personalized ranking, we measured the ranking consistency as the average difference of a pair of users with respect to a URL set. For instance, in the Table

4.8, the left column and the top row are the four selected users where the element aij corresponds to the average difference of users i and j. Please note the upper triangle or elements above the diagonal refer to the random URLs and the lower triangle or elements below the diagonal refer to the popular URLs. The right column refers to the outdegree of users in the random URLs, and the last row refers to the outdegree 75

Table 4.7: Ranking results of 30 random URLs in Google Buzz. URLs PRSN HSN MF networkedblogs 1 28 6/5/7/2 picasaweb.google 2 29 1/3/1/5 ping.fm 3 1 5/4/4/4 thenextweb 4 3 8/7/8/3 twitter 5 18 12/17/13/10 income4free 6 17 2/1/2/1 fastestwaylosebellyfat 7 19 10/9/10/10 digg 8 25 12/19/12/5 sports.espn.go 9 4 4/6/6/6 wired 10 5 12/21/9/9 businessinsider 11 13 3/2/3/8 forbes 12 12 7/12/12/9 foxnews 13 27 11/13/5/9 behance 14 11 11/23/13/8 huffingtonpost 15 23 12/20/11/7 entrepreneur 16 2 12/21/13/10 puntogov 17 15 12/23/13/10 addictivefonts 18 6 10/14/13/9 theprism 19 30 12/20/13/10 telegraph 20 22 9/10/13/10 npr 21 7 10/19/13/10 popsci 22 16 10/11/13/10 economist 23 10 12/16/13/10 marketwatch 24 8 8/8/13/10 opencog 25 9 12/23/13/8 dslreports 26 26 12/15/13/10 last.fm 27 24 12/23/13/10 tech.slashdot 28 20 12/22/13/10 wimp 29 21 12/18/13/10 socialturns 30 14 12/18/13/10

of users in the popular URLs. For Twitter, the ranking results in the same format are given in Table 4.9.

For random URLs in Google Buzz, we noticed that persons p1 and p3 have

an average difference of 1.7 where p2 and p4 have an average difference of 6.7. For

popular URLs, the variability is smaller where p4 and p2 have an average difference of

2.0 and p1 and p2 have an average difference of 3.2. Outdegree measures the number of people a user follows since the ranking results are based on them. And finally, ties are expected when using maximum flow since the number of URLs shared among 76 friends is minuscule compared to the number of pages in the deep Web. Therefore, we simply use PageRank or HITS to break ties among pages when necessary.

Table 4.8: Avg. ranking differences in Google Buzz. - p1 p2 p3 p4 outdegree. p1 - 5.1 1.7 2.4 369 p2 3.2 - 4.8 6.7 4,505 p3 2.5 2.6 - 3.1 1,125 p4 3.2 2.0 2.5 - 102 out deg. 159 355 503 340

Table 4.9: Avg. ranking differences in Twitter. - p1 p2 p3 p4 outdegree. p1 - 1.5 2.0 4.0 203 p2 3.7 - 3.0 3.8 122 p3 3.3 3.3 - 4.6 426 p4 3.7 3.8 5.2 - 119 out deg. 324 158 129 1,731

4.3.4 Rank Distributions We examine variants of flow ranking as follows. We selected a user in Twitter, selected the top 25 URLs shared by people that this user is following in terms of CKS shown. These 25 URLs contain similar keywords to the URLs that this user has previously shared. Once we have the candidate URLs, we use network flow to re-rank them taken into account social relationships, the effect of geography, and community structure. Results show a re-ordering where geography have an effect on reducing the number of URLs with positive scores by considering spreaders of URLs who are geographically close. On the other hand, community have an effect on distributing the scores of URLs more evenly since more spreaders are taken into consideration. This flexibility allows users to select information that are locally relevant when it is appropriate or select information of potential interests from their community mem- bers. Figure 4.16 shows the rank correlation coefficient of URLs between variants of network flow and PageRank. For a selected user, we selected 25 URLs from its neigh- borhood and ranked these URLs using variants of network flow: without geography 77

a) b) c) 0.14 0.14 Twitter 0.12 0.12 0.15 Buzz 0.1 0.1

0.1 0.08 0.08 P(x) P(x) P(x) 0.06 0.06 0.05 0.04 0.04 0.02 0.02 −0.1 0 0.1 0.2 −0.1 0 0.1 0.2 −0.1 0 0.1 0.2 Tau Tau Tau

Figure 4.16: Densities of rank correlation coefficient.

a) b) 0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4 Avg. NCDG Avg. NCDG 0.3 0.3

0.2 0.2

0.1 0.1

0 0 Flow O Flow G Flow I Flow C PR BL Flow O Flow G Flow I Flow C PR BL

Figure 4.17: Ranking quality results.

(a), with geography (b), and with community (c). Given a set of URLs U, let Ru(v) 0 0 and Ru(v ) be the ranking results for nodes v and v . The rank correlation coefficient denoted as τ is defined as:

n − n τ = c d (4.14) 0.5k(k − 1) where k = |U|, nc is the number of concordant pairs, and nd is the number discordant 0 pairs in Ru(v) and Ru(v ). Then we calculated the rank correlation coefficient τ where a value of 1 means the ranking results are identical, -1 if they are in reverse order, 0 if they are independent. Results show that personalized ranking using network flow is highly independent from PageRank.

4.3.5 Rank Validation Fig. 4.17 shows ranking quality results for Google Buzz (a) and Twitter (b) using the four variants of network flow, PageRank applied to the social/information network, and the baseline. The y-axis is the normalized cumulative discounted gain 78

(NCDG) used to benchmark the quality of ranking results and defined below. For this experiment, we selected 50 users and 100 URLs from a user’s neighborhood. Then we ranked these URLs by using the six ranking techniques. NCDG is defined as follows. Let p be a source node and R a list of ranked URLs for p. The discounted cumulative gain DCG for R with respect to p is:

w X δ(Ri, p) δ(R , p) + (4.15) i log(i) i=2 where δ(Ri, p) is 1 if Ri is relevant to p and 0 otherwise, and w is the number of pages to be ranked. We assume Ri is relevant to p if p has shared Ri before. The normalized discounted cumulative gain (NDCG) is the DCG divided by the DCG of the optimal ordering of R with respect to p. Optimal ordering is defined by using the pages that the user has later shared in the future. To capture any effect of social relevance, we randomly rank these URLs and use this random ranking as the baseline. Results shown in Fig. 4.17 confirmed that social relevance can improve ranking results of up to 19% in Google Buzz and 17% in Twitter. The improvement is defined as the difference in two ranks in terms of average NCDG of PageRank and flow rank divided by the average NCDG of PageRank (See Fig. 4.17). It is interesting that peers in community have a stronger effect in ranking quality than friends in Google Buzz. This is consistent with the densities of social relationships in Table 4.4 where 25% of peers have similar interests compared to 9% for friends. For Twitter, the densities in Table 4.5 align with the ranking quality results in Fig. 4.17(b) where the densities of interests among friends and peers are almost identical. Recall that the PageRank is calculated by using the social network and not by using the web graph.

4.4 Summary of Results Information shared between users in online social networks such as URLs pro- vides a unique perspective of the ranking of web pages. In our approach, humans instead of pages are the ones who rank the URLs by sharing them, and the social network of the users instead the web graph topology is used to propagate the ranking. First, we collected two large-scale information networks of online users to study 79 how users in these networks share URLs which impacts the distance between a person and a URL. For instance, researchers in [3] estimated the number of hops between any two pages to be on average 19; while Milgram estimated that the number of hops between any two people is no more than 6 [87]. Since information propagates differently in social networks, the social structure bounds how far a person is away from a shared URL. Second, we reinterpreted the ranking techniques of PageRank and HITS and proposed to use maximum network flow to personalized the ranking of pages tailored to each individual user. Maximum flow detects the popularity of a shared URL among friends but popularity does not necessary reflect endorsement which could impact ranking because one could share something that was not meant to be positive (e.g., a sad news). We expected that each unique individual would rank the URLs differently, since no two people on a social network are the same. Interestingly, the ranking results of popular URLs using PageRank and HITS are more correlated than random URLs suggesting that the overall view of users on ubiquitous information is more consistent, but everyone has their own opinion in the end. Instead of attempting to socially rank the entire web, we re-ranked a selected set of URLs to make it scalable and efficiently executable for search engines. If the size of the web doubles in the next few years, it would not affect our approach since only a subset of URLs that users shared are actually re-ranked. Third, experimental results show that personalization can improve ranking qual- ity of up to 19% compared to the baseline and 5% compared to PageRank in Google Buzz. For Twitter, personalization improves ranking quality of up to 17% compared to the baseline but it is not better than PageRank. More importantly, we believe that personalizing the ranking is useful for social searching because it provides a mechanism for the interaction between the searcher and the sharer where the searcher can discuss with the sharer about the item relating to a query on a search engine. For instance, a new product that the sharer posted on appleinsider.com or a piece of political news on nytimes.com. This potential interaction between the searcher and the sharer is valuable because the influence of the sharer on the searcher is stronger than the influence coming from the authorities detected by HITS and PageRank in many non-technical and social situations but not 80 for all. This feature could be implemented in search engines where pages returned to a given query are re-ranked via social networks if there are pages shared among friends or other associates of the searcher that are related to the query. CHAPTER 5 SOCIAL SEARCHING EXPERIMENTS

We collected friendship, checkin, and location data from two location-based social media, Gowalla and FourSquare, that allowed people to use their internet-enabled and sensing-capable smart phones to record and share their current location. Gowalla is no longer operating by itself since it has been integrated into Facebook. Unlike Gowalla, FourSquare doesn’t allow an automated mechanism for collecting publicly shared checkins through their API. We have also collected two additional social networks containing social relationships, Flickr and Last.fm, but without geographical locations of their users. The reason for collecting data from these four diverse networks is that we can directly calculate the hop length of the shortest path between randomly selected pairs of users and use these path lengths as an estimate for the ground truth in the small-world experiment. We use Gowalla and FourSquare for the emulation of the small-world experiment in which knowing geographical distance between users is essential. Even though the collected data from online social media is not a represen- tative sample of the entire population, it still provides “one of the best estimates of social distance”[88] and one of the best environments for analyzing the small-world experiment at large scale.

Table 5.1: Summaries of online social networks datasets. Social Networks Number of Users Number of Edges Period Gowalla 154,557 1,139,110 Sept. 11 - Oct. 12 FourSquare 251,621 800,201 Jun. 13 - Aug. 13 Flickr 2,435,257 155,110,479 Jun. 13 - Aug. 13 Last.fm 4,355,516 30,325,890 Jun. 13 - Aug. 13

In Table 5.1, we list the number of users and edges collected for each network over the specified time period. These numbers in case of Gowalla and FourSquare refer to a subset of the collected network reduced after data cleaning. In Gowalla, we removed users that did not have any publicly shared checkins. In FourSquare,

Portions of this chapter have been submitted as: T. Nguyen et al., “Small Worlds and Social Stratification,” Plos One,(under review.)

81 82

we kept only users that were successfully geocoded by Google’s Maps. This subtle difference between Gowalla and FourSquare is important because checkins in Gowalla directly pinpoint users’ locations, making connections between users in Gowalla more dense than in FourSquare. However, the advantage of FourSquare is that it provides different perspective to some of the questions being asked such as the effect of network sparsity on the small-world problem.

5.1 Attrition, Geography, & Communities Let G = (V,E) be a social network where V is the set of users and E is the set of edges representing undirected relationships among users. The great-circle distance

between two users s and t is denoted as gd(s, t) and estimated based on the users’ self-entered location of residence (FourSquare) or the most-frequent checkin that they

have shared (Gowalla). The network distance between s and t is denoted as nd(s, t) and defined as the smallest number of hops needed to reach t starting from s. Let A be a community detection algorithm that partitions nodes in G into m 0 overlapping clusters denoted as {C1,C2, ..., Cm}. An edge-bridge is an edge e = (u, u ) 0 such that u ∈ Ci and u ∈ Cj for i 6= j. A node-bridge is a node u such that for

certain i 6= j, u ∈ Ci and u ∈ Cj. The stratification graph of G denoted as S = (sij) is defined as:

eb(i, j) + nb(i, j) sij = Pm (5.1) k=1 eb(i, k) + nb(i, k) where eb(i, j) and nb(i, j) are the number of edge- and node-bridges connecting com- munities i and j respectively. We extend the definition of network distance of users

to communities denoted as nd(Ci,Cj) and defined it as the smallest number of node- or edge-bridges needed to reach Cj starting from Ci. We latter use sij to define the prominence of community Ci. Fig. 5.1 shows the stratification graph of communities for Gowalla.

5.1.1 Modeling Attrition

Let pk denotes the probability of getting from a source to a target in k hops in chains that are of length at least k, and let p denotes the probability of dropping out of 83

Figure 5.1: Stratification graph of communities in Gowalla.

experiment for nodes that are not adjacent to a target. Let N denotes the number of th folders sent, Dk be the number of folders delivered to the target at the k hop, and Ck be the number of chains continuing for at least k hops. If participants do not drop out Pk−1 of the experiment, then the number of deliveries in k hops is Ek = pk(N − i=1 Ei).

The expected number of deliveries for one hop targets is D1 = N ∗p1, and the number

of chains continuing for two or more hops is C2 = N(1 − p1)p. For k > 1, Dk = pkCk

and Ck+1 = Ck(1 − pk)p. In Travers-Milgram’s experiment, we know N, Ck,and Dk. Then, the numbers of deliveries including drops for k > 1 is:

k−1 X Ek = Dk + (N − Ck − Ei) ∗ pk. (5.2) i 84

a) b) 10 GW 0.6 FS 8 0.5 FR LFM 0.4 TMO 6 TMA 0.3 Density 4 0.2 Avg. path length 0.1 2 0 0 2 4 6 8 10 12 0 GW FS FR LFM TMO TMA Path length

Figure 5.2: Distributions of shortest path lengths & average path lengths.

With these formulas, we can compute average number of hops taking into account the effect of participants dropping out of the experiment. In the original Travers-Milgram, the reported path length was 6.2 plus 2 additional hops for drop- ping. When taking into consideration dropping, the path length should be reported as at least 8. An element of novelty here is that we can apply the effect of attrition to our experimental results from the opposite point of view. Suppose a participant does not drop out in our emulations unless it has sent the folder to (one at a time) all of its acquaintances. Since we know Ek, pk, and Ck from the social routing emulations, we can calculate Dk and report the average path length in Dk as a function of the dropping rate.

5.1.2 Geographical Analysis In Fig. 5.2(a,b), we compare the shortest path lengths distribution of four online social networks (a) and the average shortest path length with one standard deviation (b) with two results from Travers and Milgram. The shortest path lengths with one standard deviation are 4.91 ± 0.78 in Gowalla, 7.74 ± 1.99 in FourSquare, 4.90 ± 0.78 in Flickr, and 5.98 ± 0.99 in Last.fm. The average path length reported by Travers and Milgram is within one standard deviation away from the ground truth in FourSquare and LastFM but not for Gowalla and Flickr, but the average path length adjusted by the impact of attrition is out of range for LastFM. In Fig. 5.3(a,b), we plot the probability density (log-scale) as a function of geo- graphical distances (log-scale) between pairs of friends for Gowalla (a) and FourSquare 85

a) b) 0 0 10 10 Friends Friends 19/d 5/d −1 2 2 −1 9580/d 10 4608/d 10

−2 10 −2 10 −3 10 Density f(d) Density f(d)

−3 10 −4 10

−4 10 2 3 4 2 3 4 10 10 10 10 10 10 Distance d (km) Distance d (km)

Figure 5.3: Densities of geographical distances.

(b). The probability density function f(d) is defined as the fraction of friends such that their geographic distance is d ± . We fitted the data for each network with two models, one assuming inverse proportionality to the distance c/d and the other to the square of the distance c0/d2. We found two constants c and c0 by minimizing the difference between the model and the data:

n X (f(d) − c/d)2 o n X (f(d) − c0/d2)2 o min , min . (5.3) c c/d c0 c0/d2 d d For Gowalla, the error is 0.15 for 19.26/d and 1.64 for 9580.42/d2. The error is 0.55 for 4.78/d and 16.76 for 4608.35/d2 for FourSquare. In other words, c/d fits the distribution of geographical distances about 20 times better than c0/d2. An explanation of this difference is that online social media distorts physical dimension (approximately 2-dimensional surface) by allowing people from anywhere to establish a connection. Even a better fit is a model c00/dδ where 1 < δ < 2 which means that the social network space is fractal. This observation is in agreement with Liben-Nowell et al. [67]. Kleinberg’s theoretical results in [41] bounding the expected delivery time to O(log n) assumes that the distribution of distances is d−2. Hence, empirical results do not satisfy assumptions made in mathematical models and therefore the theoretical bounds in those models cannot be universally applied. 86

Table 5.2: Communities detected by GANXiS. Gowalla Foursquare Average Size 16.60 15.86 Weighted Average Size 591.81 399.77 Total Communities 10,562 16,495 Avg. Link Density 0.35 0.37 Edge-Bridges 1,004,964 574,748 Node-Bridges 20,137 9,492

5.1.3 Detecting Communities We selected GANXiS to detect overlapping communities based on its promising experimental results and the ability to scale to millions of nodes and edges [54]. The intuition behind GANXiS is that there should be a lot of edges within a community, and an important feature of GANXiS is that it is able to detect either disjoint or overlapping communities. This intuition is consistent with the stratified nature of society in which members within a community such as a family, workplace team, religious congregation, sport club, etc. are more likely to be connected with each other than to casual acquaintances. Consequently, once a folder reaches a person that belongs to the same community as the target, only a few more hops are needed to reach the target. In Table 5.2, we listed several measurements of communities detected by GANXiS. They include the average community size, weighted community size defined as the average size of a community as observed by each member, the total number of com- munities detected, the average link density defined as the number of edges inside a community divided by the maximum number of possible edges, and the number of edge- and node-bridges.

5.2 Experimental Design There are two strategies that define our emulation of the social routing. The first one describes the process of routing a folder by defining in each step of routing which acquaintance of the current folder holder is receiving the folder, and the second one defines the process of selecting starters and targets. 87

5.2.1 Routing Strategies

The first routing strategy, denoted as GEOGREEDY in [67], is to pass the folder to an acquaintance who is the geographically closest to the target, that is, picking

an acquaintance u with the smallest gd(u, t). The second routing strategy, denoted as COMGREEDY, is to pass the folder to an acquaintance who is the closest to the target in terms of community distance, that is, picking an acquaintance u with the

smallest nd(Cu,Ct). For overlapping communities, COMGREEDY selects the corre-

sponding community of u and t in such a way so that nd(Cu,Ct) is minimum. Such information may not be always available to the current folder holder, so GEOGREEDY is more realistic than COMGREEDY, but the purpose of introducing COMGREEDY is to understand which property of the network, geography or community, is more useful for reaching the target for the majority of the cases. The third routing strategy is to use a combination of the knowledge of geography and community, denoted as GEOCOM, when selecting an acquaintance. In GEOCOM, a node gives the highest preference to acquaintances who belong to the same com- munity as the target (i.e, nd(u, t) = 0), and breaks ties between them by selecting the acquaintance who is the geographically closest to the target (i.e, GEOGREEDY). If a node have no acquaintances who belong to the same community as the target, then the node uses GEOGREEDY. An element of novelty in using a combination of geography and community is that it seems to be more realistic than using either one alone. In all strategies, routing stops either when the folder has reached the target or when a user does not have any more acquaintances to whom it can pass the folder because all of its acquaintances have already been chosen by the current holder. If the current holder doesn’t have any acquaintances who belong to the same community

as the target, then sij defines the probability going from Ci to Cj in one step. The

implication is that sij influences the routing strategy in GEOGREEDY and GEOCOM but not in COMGREEDY, and this influence does not depend on the target but on how communities are interconnected. Therefore, we define the prominence of community

Ci, denoted as λi, as P (Xt = Ci) where Xt denotes the community reached in a random walk at step 0 < t < ∞. The idea is the more prominent a community, the more likely it is to be reached in a random walk process. 88

Table 5.3: Prominence of individuals and communities. Percentile PageRank Steady State PageRank Steady State Top 1% 0.000121 0.005421 0.000065 0.003652 Top 20% 0.000014 0.000134 0.000010 0.000094 60th-80th% 0.000005 0.000036 0.000003 0.000016 40th-60th% 0.000003 0.000021 0.000002 0.000008 Bottom 40% 0.000002 0.000021 0.000001 0.000003 Gowalla FourSquare

Table 5.4: Experimental results for Gowalla. Average Number of Hops in Successful Chains GEOGREEDY COMGREEDY GEOCOM Random 29.43 20.57 19.08 nd(Cs,Ct) = 0 5.61 3.61 3.61 nd(Cs,Ct) = 1 26.13 16.06 16.23 nd(Cs,Ct) = 2 27.78 18.71 21.13 nd(Cs,Ct) = 3 29.06 19.76 24.36 Percentage of Successful Chains Random 0.30 0.44 0.50 nd(Cs,Ct) = 0 0.71 0.87 0.91 nd(Cs,Ct) = 1 0.34 0.57 0.58 nd(Cs,Ct) = 2 0.27 0.46 0.44 nd(Cs,Ct) = 3 0.18 0.38 0.27

5.2.2 Starter & Target Selections First, we select a starter and a target that are separated by a fixed number of communities; i.e, nd(Cs,Ct) = k. For example, when k = 0, the starter s and target t are selected within the same community. Then, we select a target t based on its prominence measured by its PageRank score and next a random target from a prominent community as measured by the steady state of the random walk on the stratification graph S. The percentile of individual and community prominence measured by PageRank and steady state of λi are listed in Table 5.3. Finally, we select starters and targets randomly to mimic the most unbiased way in which participants could be selected for Travers-Milgram’s like experiment. 89

Table 5.5: Experimental results for FourSquare. Average Number of Hops in Successful Chains GEOGREEDY COMGREEDY GEOCOM Random 18.19 16.01 16.52 nd(Cs,Ct) = 0 1.93 2.06 1.99 nd(Cs,Ct) = 1 7.81 7.37 6.21 nd(Cs,Ct) = 2 15.36 12.96 12.10 nd(Cs,Ct) = 3 18.02 15.14 15.81 Percentage of Successful Chains Random 0.01 0.22 0.04 nd(Cs,Ct) = 0 0.75 0.86 0.88 nd(Cs,Ct) = 1 0.13 0.51 0.38 nd(Cs,Ct) = 2 0.04 0.37 0.12 nd(Cs,Ct) = 3 0.02 0.28 0.04

5.3 Experimental Results 5.3.1 Selection & Routing Combinations Table 5.4 contains the experimental results for Gowalla. The upper section in the Table 5.4 displays the average number of hops it takes to successfully reach a target using the five selection techniques listed in the left column and three routing strategies listed in the second row. The lower section of Table 5.4 refers to the percentage of successful chains defined as the number of times the target was reached divided by the number of trails. For each selection process and routing strategy, we ran N = 104 trails. The experimental results for FourSquare are displayed in Table 5.5. Tables 5.4 and 5.5 show that selecting a starter and a target from the same community makes it likely for the target to be reached in a few hops, about 4 hops in Gowalla and 2 hops in FourSquare, with high success rate of approximately 83% for both networks. The percentage of successful chains decreases as the community distance between the starter and target increases. On average, it takes approximately 22 hops to reach a target with a success rate of 39% for Gowalla, and 12 hops to reach a target with a success rate of 21% for FourSquare for the community distance ranging from 0 to 3. As the community distance between the starter and target increases, the percentage of successful chains decreases to about 19% for Gowalla and 24% for FourSquare. 90

9 6.2 4.8 a) b) c) 6 4.7 8 5.8 4.6

5.6 4.5 7 Gowalla

Avg. Path Length 5.4 4.4

6 4.3 0 0.5 1 0 0.5 1 0 0.5 1 drop=5% drop=15% drop=30% 7.5 5.6 d) e) f) 10 5.5 7 9.5 5.4

6.5 FourSquare 9 5.3 Avg. Path Length

6 5.2 0 0.5 1 0 0.5 1 0 0.5 1 Friends−of−Friends Knoweldge Density

Figure 5.4: Friends-of-friends knowledge densities.

Also, Tables 5.4 and 5.5 show that COMGREEDY is much more effective than GEOGREEDY in terms of average path length and percentage of successful chains in both networks. On average, COMGREEDY reaches the target in about 8 hops quicker than GEOGREEDY in Gowalla and 2 hops quicker in FourSquare. Moreover, COMGREEDY reaches the target 18% more often than GEOGREEDY in Gowalla and 26% more often in FourSquare. Hence, using community distances is more effective at reaching targets than using geographical distances.

5.3.2 Friends-of-Friends Knowledge Densities

To make GEOCOM more realistic, we introduce the probability that current holder might have some relevant clues about its acquaintances. A possible clue is the friends-of-friends knowledge where a holder might know some of its friends’ friends, where they are geographically located, and to which communities they belong. In Fig. 5.4, we plotted the average path length as a function of friends-of-friends knowledge density for Gowalla (a-c) and FourSquare (d-f). The x-axis represents the probability that the current holder might know the geographical location and community infor- 91 mation of a friend-of-friend. A value of 0 means the holder only uses its friends to make a routing decision, and a value of 1 means a holder knows all the friends of its friends. In addition, we examined three levels of attrition added into this particular experiment. Subfigures a) and d) refer to a 5% dropping rate, subfigures b) and e) refer to a 15% dropping rate, and subfigures c) and f) refer to a 30% dropping rate. Regions within one standard deviation away from the ground truth in terms of average path length are shaded in blue. In Gowalla, results show that with a 5% dropping rate, the friends-of-friends knowledge level is too low to make the average path length within one standard deviation away from the ground truth. However, with a 15% dropping rate, knowledge level of about 20% is sufficient to reach one standard deviation away from ground truth, and no friends-of-friends is needed when the drop rate is 30% or higher. In FourSquare, results show that with a 5% dropping rate, no friends-of-friends knowledge is needed to be within one standard deviation away from the ground truth, and average path lengths are very short and not within one standard deviation when the dropping rate is 15% or more. The reason for the contrasting behavior is that increasing attrition makes the path length of successful chains smaller than the ground truth (i.e., 5 in Gowalla vs. 8 in FourSquare). A difference between Gowalla and FourSquare is that Gowalla is much more connected in terms of the density of relationships between nodes. The percentage of finding targets successfully is overall higher in Gowalla than in FourSquare. Recall that nodes drop out in the simulations when they do not have any more acquaintances to the pass the folder to. Since there are more relationships in Gowalla, participants stay longer in the simulations, which increases the path length of successful chains. For FourSquare, participants have less social relationships so they drop out quicker; therefore, successful chains are shorter in FourSquare than in Gowalla.

5.3.3 Distributions of Successful Chains In Fig. 5.5, we plotted the distribution of the lengths of successful chains in a) and c) and the modified average path length as a function of the dropping rate in b) and d) for Gowalla and FourSquare, respectively. Results show that it is difficult to find targets when nd > 0, but still the average path length decreases when the dropping rate increases. For instance, the average path length of successful 92

a) b) 15 0.5 n = 0 d n = 1 0.4 d TM drop rate n = 2 d 10 0.3 n = 3 d Ground Truth 0.2

Percentage 5 Avg. Path Length 0.1

0 0 0 10 20 30 0.1 0.2 0.3 0.4 0.5 0.6 Path length of Successful Chains Drop rate (%) c) d) 0.4 12 n = 0 d 10 n = 1 TM drop rate 0.3 d n = 2 8 d n = 3 0.2 d 6 Ground Truth

Percentage 4

0.1 Avg. Path Lengths 2

0 0 5 10 15 20 0.1 0.2 0.3 0.4 0.5 0.6 Path length of Successful Chains Drop rate (%)

Figure 5.5: Path length of successful chains & drop rates.

chains with a dropping rate increasing from 0.2 to 0.4 grows on average from 2 to 6 for Gowalla and 2 to 7 for FourSquare. More importantly, the variances of the

distributions for nd > 0 are large compared to the ground truth as seen in a) and c), meaning that some targets are easier to reach than others. This leads us to measure the reachability of a target by examining its individual prominence.

5.3.4 Effects of Hubs and Connectors In Fig. 5.6, we examined effects of routing the folder to connectors and hubs discussed in the literature [89]. The first experiment is to pass the folder to the connector defined as the acquaintance who has the highest number of connections to other nodes within the community. Results show an improvement in the delivery rates in Gowalla and FourSquare as seen in Fig. 5.6(a,b). For this connector experiment, we did not selected starters and targets randomly because connectors would be flooded with requests making the routing strategy not practical in reality. Perhaps a setting where passing the folder to a connector would not be too unrealistic is when the 93

a) b) 1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2 % of Succesful Chains % of Succesful Chains

0 0 GEO COM GCOM CON. GEO COM GCOM CON. c) d) 0.08 R=80km (36,12%) R=80km (28,72%) 0.06 R=241 (44,14%) R=241 (32,74%) R=400 (65,16%) 0.06 R=400 (38,76%) R=563 (75,17%) R=563 (42,78%) 0.04 0.04 Density Density

0.02 0.02

0 0 0 5 10 15 20 25 0 5 10 15 20 25 Path Length of Successful Chains Path Length of Successful Chains

Figure 5.6: Effects of routing to connectors & hubs.

starter and target are from the same community. Another setting that would reduce the flooding of requests is selecting a hub within some geographical radius from the target. For this experiment, we modified GEOCOM to incoportate indegree into making a routing decision. First, if the holder has multiple friends who belong to the same community as the target, then it break ties by selecting the connector. If the connector does not exist, then it selects a group of acquaintances who are within some radius away from the target, and select a hub from this group defined as the friend who has the highest degree. If the hub does not exist, then it uses GEOGREEDY. As the radius increases by 161km, the delivery rates for Gowalla and FourSquare increase by approximately 2%, and the average path length of successful chains increases about 5 hops in Gowalla as seen in Fig. 5.6 (c) and 10 hops in FourSquare as seen in Fig. 5.6 (d).

5.3.5 Individual and Community Prominence In Fig. 5.7, we calculated the average path length of finding a target as a function of its PageRank for Gowalla a) and FourSquare b). When PageRank score 94

a) b) 18 20 Emulations Emulations 16 Linear Linear 14 15 12 10 10 8 Avg. Path Length Avg. Path Length 6 r=−0.71 5 r=−0.44 4 0 1 2 3 4 5 0 1 2 3 4 5 −5 −5 PageRank x 10 PageRank x 10 c) d)

Emulations 20 Emulations 25 linear linear

15 20

15 10 Avg. Path length Avg. Path length

10 r=−0.54 5 r=−0.65

−14 −12 −10 −8 −6 −4 −2 −12 −10 −8 −6 −4 −2 λ (log−scale) λ (log−scale) i i

Figure 5.7: Prominence of individuals & communities on reachability. increases, the average path length decreases from 16 to 4 in Gowalla and 15 to 5 in FourSquare. The routing algorithm used in this particular experiment is GEOCOM with a 8% friends-of-friends knowledge level with starters and targets randomly se- lected. Hence, results from this experiment show that small-world property holds for the highly prominent while everyone else is lost in the crowd. In addition, we calculated the average path length of finding a target as a function of its community prominence measured by λi for Gowalla c) and FourSquare d). Results from this experiment also show that targets selected from prominent communities are reached quicker than targets from non-prominent communities. Correlation coefficients of the linear relationship between prominence and average path lengths are displayed in each individual subfigures. Finally, we examined the correlation between the individual prominence of tar- gets measured by the PageRank and community prominence measured by a random walk process in Fig. 5.8. Results show that these two measurements are highly corre- lated and consistent in the sense that prominent users are in prominent communities 95

a) b) c) 0 −10 −5 r = 0.95 r = 0.79 r = 0.81 −11 −5 −12 −10 log−scale −10 Gowalla −13

−15 −14 −15 −20 −10 0 −20 −10 0 −20 −10 0

Sum PageRank Avg. PageRank Max. PageRank

d) e) f) 0 −10 −6 r = 0.91 r = 0.61 r = 0.82 −11 −8 −5 −12 −10

log−scale −10

−13 −12 FourSquare

−15 −14 −14 −20 −10 0 −20 −10 0 −20 −10 0 λ (log−scale) i

Figure 5.8: Prominence of individuals & communities correlations. and prominent communities contain prominent users. For each community, we cal- culated the collective prominence of users measured by total, average, and maximum PageRank of its users. Subfigures a-c refer to communities in Gowalla and subfigures d-f refer to communities in FourSquare. Each point in a figure is a community where the x-axis for all subfigures refer to the community prominence and the y-axis in a) and d) refer to sum PageRank, b) and e) refer to the average PageRank, and c) and f) refer to the maximum PageRank of a community. Correlation coefficients of the linear relationship between community and individual prominence are shown in each individual subfigures.

5.4 Summary of Results By analyzing data recently available from location-based social media, we pro- vided three conclusions from our social routing experiments. First, results show that while using geographical and community information in modeling social routing for the small-world problem is more realistic than using either one alone, average path 96

lengths are 3 times longer when attrition is eliminated and not even within two standard deviations away from the ground truth defined as the calculated average shortest path length. Second, COMGREEDY is more effective and robust at reaching targets than GEOGREEDY in terms of average path lengths and percentage of success- ful chains. It is quite plausible that participants could use COMGREEDY cognitively. For example, a holder can select an acquaintance whose occupation is mortgage in- surance as being ‘closer’ to commodity broker than a social science teacher. Third, results from the data show that prominent targets and targets in prominent commu- nities can be reached much quicker than on average. This leads us to ask what would the results be if Travers and Milgram had not select a broker but instead a much less prominent target such as a homeless man? To conclude, our results show that the small-world property holds for the prominent while everyone else is lost in the crowd except when being reached by members within its own community. CHAPTER 6 CONCLUSION AND FUTURE WORK

Table 6.1: Aspects of SNA & applications. Geography Interactions Communities Human Mobility Distance Communication Group Spreading Ideas Long Ties Weak Ties Bridge Ties Personalized Ranking Geo. Influence Peer Influ. Collective Influ. Small-world Selection Cognitive Biases Routing

In Chapter 3, we examined interesting human dynamics in online social networks in terms of geographical proximity, face-to-face interactions, communities, and found some valuable insights. For instance, the creation of friendship between two people is more likely to occur when they are close, and friends and friends-of-friends are more likely to be within geographic proximity but not further. Geography has an effect on limiting face-to-face interactions as well as keyword similarity in terms of what users read on social media. One possible direction for future research is to investigate social influence as a function of geographical distance. For instance, if a friend checkins at a location, how likely is his friend going to checkin at the same location in the future? Two applications we studied in Chapter 3 are human mobility & congestion modeling and ideas spreading & economic development. Geography shows how friends are likely to be close in terms of moving together (human mobility). Face-to-face interactions could be used in the establishing connections in the wireless simulations where nodes that are frequently interacting are more likely to establish a connection, and communities can be used to simulate a group of nodes moving together. For ideas spreading, geography can be used to measure the length of short and long ties, face-to-face interactions can be used to measure the strength of ties, and communities can be used to distinguish between bridge and non-bridge ties. In Chapter 4, we proposed to personalize the ranking of URLs by using public information that users shared in social media. We incorporated the following two important aspects of the social networks into the processing of ranking URLs: geo- graphical distance and community structure. Personalized ranking results from three

97 98 variants of network flow are highly independent from PageRank meaning that each individual has their own unique way to rank information. Experimental results show that personalization can improve ranking quality of up to 19% when compared to the baseline and 5% when compared to PageRank in Google Buzz. For Twitter, person- alization improves ranking quality of up to 17% compared to the baseline but it is not better than PageRank. Future work could incorporate calculating novelty of a piece of information by examining its keywords [90] and determining the popularity of information in terms of burstiness [91]. These filters allow users to see or filter information on the web through the eyes of the world. In Chapter 5, results show that average path lengths in social searching are 3 times longer when attrition is eliminated and not even within two standard devia- tions away from the ground truth. COMGREEDY is more highly effective at reaching targets. Also, it is plausible that participants could use COMGREEDY cognitively. On the other hand, prominent targets can be reached much quicker. The small-world property holds for the prominent while everyone else is lost in the crowd except when being reached by members within its own community. Future work could incorpo- rate face-to-face interactions for measuring potential cognitive biases in selecting the next acquaintance. In addition, instead of assuming a fixed probability for attrition, participants could drop out based on interactions in the sense that the next folder holder has a higher chance of participating if he interacts frequently with the previous holder. To summarize, this thesis collects terabytes of information that users share on social networks and analyzes their social dynamics in terms of geography, face-to-face interactions and community structures. REFERENCES

[1] J. Kleinberg and S. Lawrence, “The structure of the web,” Sci., vol. 294, no. 5548, pp. 1849-1850, Nov. 2001.

[2] R. Lempel and S. Moran, “SALSA: The stochastic approach for link-structure analysis,” ACM Trans. Inf. Syst., vol. 19, no. 2, pp. 131-160, Apr. 2001.

[3] R. Albert et al., “The diameter of the world wide web,” Nature, vol. 401, no. 6749, pp. 130-131, Sept. 1999.

[4] T. Berners-Lee et al., “The semantic web,” Sci. Amer., vol. 284, no. 5, pp. 34-43, May 2001.

[5] S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine,” Comput. Networks and ISDN Syst., vol. 30, no. 1, pp. 107-117, Apr. 1998.

[6] Z. Gyongyi et al., “Combating web spam with trustrank,” in Proc. 30th Int. Conf. Very Large Data Bases, Toronto, Canada, 2004, pp. 576-587.

[7] J. Xu and H. Li, “Adarank: A boosting algorithm for information retrieval,” in Proc. 30th Int. ACM SIGIR Conf. Res. and Develop. in Inform. Retrieval, Amsterdam, Netherlands, 2007, pp. 391-398.

[8] Y. Liu et al., “Browserank: Letting web users vote for page importance,” in Proc. 31st Int. ACM SIGIR Conf. Res. and Develop. in Inform. Retrieval, Singapore, Republic of Singapore, 2008, pp. 451-458.

[9] M. Taylor et al., “Softrank: Optimizing non-smooth rank metrics,” in Proc. 1st Int. Conf. Web Search and Data Mining, Palo Alto, CA, 2008, pp. 77-86.

[10] H. Yan et al., “Architectural design and evaluation of an efficient web-crawling system,” J. Syst. Softw., vol. 60, no. 3, pp. 185-193, Feb. 2002.

[11] E. Leicht et al., “Large-scale structure of time evolving citation networks,” Eur. Phys. J. B, vol. 59, no. 1, pp. 75-83, Oct. 2007.

99 100

[12] S. Bao et al., “Optimizing web search using social annotations,” in Proc. 16th Int. Conf. World Wide Web, Alberta, Canada, 2007, pp. 501-510.

[13] J. Davitz et al., “ilink: Search and routing in social networks,” in Proc. 13th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, San Jose, CA, 2007, pp. 931-940.

[14] D. Carmel et al., “Personalized social search based on the user’s social network,” in Proc. 18th ACM Conf. Inform. and Knowledge Manage., Hong Kong, China, 2009, pp. 1227-1236.

[15] D. Horowitz and S. Kamvar, “The anatomy of a large-scale social search engine,” in Proc. 19th Int. Conf. World Wide Web, Raleigh, NC, 2010, pp. 431-440.

[16] A. Dong et al., “Time is of the essence: Improving recency ranking using Twitter data,” in Proc. 19th Int. Conf. World Wide Web, Raleigh, NC, 2010, pp. 331-340.

[17] B. Bahmani and A. Goel, “Partitioned multi-indexing: Bringing order to social search,” in Proc. 21st Int. Conf. World Wide Web, Lyon, France, 2012, pp. 399-408.

[18] T. Nguyen and B. Szymanski, “Social ranking techniques for the web,” in Proc. IEEE/ACM Int. Conf. Advances in Social Network Analysis and Mining, Niagara Falls, Canada, 2013, pp. 49-55.

[19] D. Romero et al., “Differences in the mechanics of information diffusion across topics: Idioms, political hashtags, and complex contagion on Twitter,” in Proc. 20th Int. Conf. World Wide Web, Hyderabad, India, 2011, pp. 695-704.

[20] A. Ritter et al., “Open domain event extraction from Twitter,” in Proc. 18th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Beijing, China, 2012, pp. 1104-1112.

[21] P. Bogdanov et al., “The social media genome: Modeling individual topic-specific behavior in social media,” in Proc. IEEE/ACM Int. Conf. 101

Advances in Social Network Analysis and Mining, Niagara Falls, Canada, 2013, pp. 236-242.

[22] T. Nguyen and B. Szymanski, “Using location-based social networks to validate human mobility and relationships models,” in Proc. IEEE/ACM Int. Conf. Advances in Social Network Analysis and Mining (SNAA Workshop), Istanbul, Turkey, 2012, pp. 1247-1253.

[23] T. Nguyen, M. Chen and B. Szymanski “Analyzing the proximity and interactions of friends in communities in Gowalla,” in Proc. IEEE 13th Int. Conf. Data Mining Workshops, Dallas, TX, 2013, pp. 1036-1044.

[24] L. Backstrom et al., “Four degrees of separation,” in Proc. 4th ACM Int. Conf. Web Science, Evanston, IL, 2012, pp. 33-42.

[25] Y. Ahn et al., “Analysis of topological characteristics of huge online social networking services,” in Proc. 16th Int. Conf. World Wide Web, Alberta, Canada, 2007, pp. 835-844.

[26] H. Kwak et al., “What is Twitter, a social network or a news media?,” in Proc. 19th Int. Conf. World Wide Web, Raleigh, NC, 2010, pp. 591-600.

[27] A. Mislove et al., “Measurement and analysis of online social networks,” in Proc. 7th ACM SIGCOMM Conf. Internet Measurement, San Diego, CA, 2007, pp. 29-42.

[28] J. Kleinfeld, “Could it be a big world after all? The ‘six degrees of separation’ myth,” Soc., vol. 39, no. 2, pp. 61-66, Apr. 2002.

[29] J. Travers and S. Milgram, “An experimental study of the small world problem,” Sociometry, vol. 32, no. 4, pp. 425-443, Dec. 1969.

[30] P. Dodds et al., “An experimental study of search in global social networks,” Sci., vol. 301, no. 5634, pp. 827-829, Aug. 2003.

[31] D. Watts et al., “Identity and search in social networks,” Sci., vol. 296, no. 5571, pp. 1302-1305, May 2002. 102

[32] M. Granovetter, Getting a Job: A Study of Contacts and Careers. Chicago, IL: University Chicago Press, 1995.

[33] D. Watts, “Networks, dynamics, and the small-world phenomenon,” AJS, vol. 105, no. 2, pp. 493-527, Sept. 1999.

[34] M. Marchiori, “The quest for correct information on the web hyper search engines,” Comput. Networks and ISDN Syst., vol. 29, no. 8, pp. 1225-1235, Sept. 1997.

[35] J. Kleinberg, “Authoritative sources in a hyperlinked environment,” J. ACM, vol. 46, no. 5, pp. 604-632, Sept. 1999.

[36] C. Warden. (2010, April 22) EdgeRank: The Secret Sauce That Makes Facebook’s News Feed Tick [Blog]. Available: http://www.techcrunch.com/2010/04/22/facebook-edgerank/ (Date Last Accessed, September, 22, 2014).

[37] C. Burges et al., “Learning to rank using gradient descent,” in Proc. 22nd Int. Conf. Mach. Learning, Bonn, Germany, 2005, pp. 89-96.

[38] T. Joachims, “Optimizing search engines using clickthrough data,” in Proc. 8th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Edmonton, Canada, 2002, pp. 133-142.

[39] R. Caruana et al., “Using the future to sort out the present: Rankprop and multitask learning for medical risk evaluation,” in Proc. Advances in Neural Inform. Processing Symp., Denver, CO, 1995, pp. 959-965.

[40] K. Crammer and Y. Singer, “Pranking with ranking,” in Proc. Advances in Neural Inform. Processing Syst., Vancouver, Canada, 2001, pp. 641-647.

[41] J. Kleinberg, “The small-world phenomenon: An algorithmic perspective,” in Proc. 32nd Ann. ACM Symp. Theory Computing, Portland, OR, pp. 163-170, 2000. 103

[42] M. Burke et al., “Social capital on Facebook: Differentiating uses and users,” in Proc. SIGCHI Conf. Human Factors in Computing Syst., Vancouver, Canada, 2011, pp. 571-580.

[43] A. Mislove et al., “Understanding the demographics of Twitter users,” presented at 2011 5th Int. AAAI Conf. Weblogs and Social Media, Barcelona, Spain, 2011.

[44] M. Newman, “Fast algorithm for detecting community structure in networks,” Phys. Rev. E, vol. 69, no. 6, doi: 10.1103/PhysRevE.69.066133, June 2004.

[45] A. Clauset et al., “Finding community structure in very large networks,” Phys. Rev. E, vol. 70, no. 6, doi: 10.1103/PhysRevE.70.066111, Dec. 2004.

[46] M. Newman, “Modularity and community structure in networks,” Proc. Nat. Academy Sci., vol. 103, no. 23, pp. 8577-8582, May 2006.

[47] S. Fortunato and M. Barthelemy, “Resolution limit in community detection,” Proc. Nat. Academy Sci., vol. 104, no. 1, pp. 36-41, Dec. 2006.

[48] M. Chen et al., “A new metric for quality of network community structure,” ASE Human J., vol. 1, no. 4, pp. 226-240, 2013.

[49] M. Goldberg et al., “Finding overlapping communities in social networks,” in Proc. 4th ASE/IEEE Int. Conf. Social Computing, Minneapolis, MN, 2010, pp. 37-54.

[50] G. Palla et al., “Uncovering the overlapping community structure of complex networks in nature and society,” Nature, vol. 435, no. 7043, pp. 814-818, Apr. 2005.

[51] M. Sipser, Introduction to the Theory of Computation. Boston, MA: PWS, 1997.

[52] B. Good et al., “The performance of modularity maximization in practical contexts,” Phys. Rev. E, vol. 81, no. 4, doi: 10.1103/PhysRevE.81.046106, 2010. 104

[53] M. Newman, “Analysis of weighted networks,” Phys. Rev. E, vol. 70, no. 5, doi: 10.1103/PhysRevE.70.056131, Apr. 2004.

[54] J. Xie and B. Szymanski, “Towards linear time overlapping community detection in social networks,” in Proc. 16th Pacific-Asia Conf. Knowledge Discovery and Data Mining PAKDD, Kuala Lumpur, Malaysia, 2012, pp. 25-36.

[55] S. Fortunato, “Community detection in graphs,” Phy. Rep., vol. 486, no. 3, pp. 75-174, Feb. 2010.

[56] J. Leskovec et al., “Empirical comparison of algorithms for network community detection,” in Proc. 19th Int. Conf. World Wide Web, Raleigh, NC, 2010, pp. 631-640.

[57] R. Kannan et al., “On clusterings: Good, bad and spectral,” J. ACM, vol. 51, no. 3, pp. 497-515, May 2004.

[58] R. Dunbar, “Neocortex size as a constraint on group size in primates,” J. Human Evolution, vol. 22, no. 6, pp. 469-493, June 1992.

[59] M. Gonzalez et al., “Understanding individual human mobility patterns,” Nature, vol. 453, no. 7196, pp. 779-782, June 2008.

[60] E. Boxman et al., “The impact of social and human capital on the income attainment of Dutch managers,” Social Networks, vol. 13, no. 1, pp. 51-73, Mar. 1991.

[61] B. Ronald, Structural Holes: The Social Structure of Competition. Cambridge, MA: Harvard University Press, 1992.

[62] Ray Reagans and Ezra W. Zuckerman, “Networks, diversity, and productivity: The social capital of corporate R & D teams,” Organ. Sci., vol. 12, no. 4, pp. 502-517, Aug. 2001.

[63] Martin Ruef, “Strong ties, weak ties and islands: Structural and cultural predictors of organizational innovation,” ICC vol. 11, no. 3, pp. 427-449, Jun. 2002. 105

[64] R. Burt, “Structural holes and good ideas”, AJS, vol. 10, no. 2, pp. 349-399, Sept. 2004.

[65] M. Granovetter, “The impact of social structure on economic outcomes,” JEP, vol. 19, no. 1, pp. 33-50, Dec. 2005.

[66] A. Pentland, Social Physics: How Good Ideas Spread Lessons From a New Science. London, UK: Penguin Press, 2014.

[67] D. Liben-Nowell et al., “Geographic routing in social networks,” Proc. Nat. Academy Sci., vol. 102, no. 33, pp. 11623-11628, June 2005.

[68] Lada Adamic and Eytan Adar, “How to search a social network,” Social Networks, vol. 27, no. 3, pp. 187-203, Jul. 2005.

[69] M. Granovetter, “The strength of weak ties,” AJS, vol. 78, no. 6, pp. 1360-1380, May 1973.

[70] L. Bettencourt et al., “Growth, innovation, scaling, and the pace of life in cities,” Proc. Nat. Acad., vol. 104, no. 17, pp. 7301-7306, Mar. 2007.

[71] W. Pan et al., “Urban characteristics attributable to density-driven tie formation,” Nat. Commun., vol. 4, no. 1, doi: 10.1038/ncomms2961.

[72] G. Ghasemiesfeh et al., “Complex contagion and the weakness of long ties in social networks: revisited,” in Proc. 14th ACM Conf. Electronic Commerce, Philadelphia, PA, 2013, pp. 507-524, 2013.

[73] and Michael Macy, “Complex contagions and the weakness of long Ties,” ASJ, vol. 113, no. 3, pp. 702-734, Nov. 2007.

[74] N. Eagle et al., “Network diversity and economic development,” Sci., vol. 328, no. 5981, pp. 1029-1031, May 2010.

[75] Everett M. Rogers, Diffusion of Innovations. New York: Free Press, 2003.

[76] (2014, August 27) Gross Domestic Product by State [Online]. Available: http://www.bea.gov/regional/gsp/ (Date Last Accessed, September, 22, 2014). 106

[77] (2014, August 27) Patents By Country, State, and Year - Utility Patents [Online]. Available: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cstutl.htm (Date Last Accessed, September, 22, 2014).

[78] (2014, August 27) Statistics of U.S. Businesses [Online]. Available: http://www.census.gov/econ/susb/ (Date Last Accessed, September, 22, 2014).

[79] (2014, August 27) Annual Estimates of the Population for the United States, Regions, States, and Puerto Rico [Online]. Available: http://www.census.gov/popest/index.html (Date Last Accessed, September, 22, 2014).

[80] (2014, August 27) Census of Population and Housing 2010 [Online]. Available: https://www.census.gov/prod/www/decennial.html (Date Last Accessed, September, 22, 2014).

[81] F. Cairncross, The Death of Distance: How the Communications Revolution is Changing Our Lives. Cambridge, MA: Harvard Business Review Press, 2001.

[82] J. Levandoski et al., “Lars: A location-aware recommender system,” in Proc. 28th Int. Conf. Data Eng., Washington, DC, 2012, pp. 450-461.

[83] D. Liben-Nowell and J. Kleinberg, “The link prediction problem for social networks,” in Proc. 12th Int. Conf. Inform. and Knowledge Manage., New Orleans, LA, 2003, pp. 556-559.

[84] A. Sarma et al., “A sketch-based distance oracle for web-scale graphs,” in Proc. 3rd ACM Int. Conf. Web Search and Data Mining, New York, NY, 2010, pp. 401-410.

[85] D. Easley and J. Kleinberg, Networks, Crowds, and Markets: Reasoning About a Highly Connected World. Cambridge, England: Cambridge University Press, 2010. 107

[86] A. Goldberg et al., “Network flow algorithms,” in Paths, Flows, and VLSI-Design, Berlin, Heidelberg: Springer, 1990, pp. 101-164.

[87] S. Milgram, “The small world problem,” Psychology Today, vol. 2, no. 1, pp. 60-67, May 1967.

[88] S. Schnettler, “A structured overview of 50 years of small-world research,” Social Networks, vol. 31, no. 3, pp. 165-178, July 2009.

[89] H. P. Thadakamalla et al., “Search in spatial scale-free networks,” J. Phys, vol. 9, no. 6, doi: 10.1088/1367-2630/9/6/190, June 2007.

[90] S. Sreenivasan, “Quantitative analysis of the evolution of novelty in cinema through crowdsourced keywords,” Scientific Rep., vol. 3, no. 1, doi: 10.1038/srep02758, Apr. 2013.

[91] A. Hoonlor et al., “Trends in computer science research,” Commun. ACM, vol. 56, no. 10, pp. 74-83, Oct. 2013.

[92] V. Lolla et al., “Detecting MAC layer back-off timer violations in mobile ad hoc networks,” in Proc. 26th IEEE Int. Conf. Distributed Comput. Syst., Lisboa, Portugal, pp. 63-63, 2006.

[93] Q. Chen et al., “Overhaul of IEEE 802.11 modeling and simulation in ns-2,” in Proc. 10th ACM Symp. Modeling, Analysis, and Simulation Wireless and Mobile Syst., Chania, Greece, 2007, pp. 159-168.

[94] H. Zhang et al., “Bootstrapping deny-by-default access control for mobile ad-hoc networks,” in IEEE Military Commun. Conf., San Diego, CA, 2008, pp. 1-7.

[95] J. Broch et al., “A performance comparison of multi-hop wireless ad hoc network routing protocols,” in Proc. 4th Ann. ACM/IEEE Int. Conf. Mobile Computing and Networking, Dallas, TX, 1998, pp. 85-97.

[96] P. Erdos and A. Renyi, “On random graphs,” Publ. Math. Debrecen, vol. 6, no. 1, pp. 290-297, 1959. 108

[97] F. Simini et al., “A universal model for mobility and migration patterns,” Nature, vol. 484, no. 7392, pp. 96-100, Apr. 2012.

[98] P. Boldi et al., “Ubicrawler: A scalable fully distributed web crawler,” Software: Practice and Experience, vol. 34, no. 8, pp. 711-726, July 2004.

[99] T. Camp et al., “A survey of mobility models for ad hoc network research,” Wireless Commun. and Mobile Computing, vol. 2, no. 5, pp. 483-502, Aug. 2002.

[100] C. Bettstetter et al., “The node distribution of the random waypoint mobility model for wireless ad hoc networks,” IEEE Trans. Mobile Computing, vol. 2, no. 3, pp. 257-269, July 2003.

[101] W. Navidi and T. Camp, “Stationary distributions for the random waypoint mobility model,” IEEE Trans. Mobile Comput., vol. 3, no. 1, pp. 99-108, Jan. 2004.

[102] M. Kurant et al., “Towards unbiased BFS sampling,” Computing Res. Repository, vol. 29, no. 9, pp. 1799-1809, Oct. 2011.

[103] C. Foh and M. Zukerman, “Performance analysis of the IEEE 802.11 MAC protocol,” in Proc. Eur. Wireless Conf., Florence, Italy, 2002, pp. 184-190.

[104] S. Geyik et al., “PCFG based synthetic mobility trace generation,” in Proc. IEEE Global Telecommun. Conf., Miami, FL, 2010, pp. 1-5.

[105] M. Chen et al., “On measuring the quality of a network community structure,” in Proc. ASE/IEEE Int. Conf. Social Computing, Washington, DC, 2013, pp. 122-127.

[106] K. Kuzmin et al., “Parallel overlapping community detection with SLPA,” in Proc. ASE/IEEE Int. Conf. Social Computing, Washington, DC, 2013, pp. 204-212.

[107] D. Wang et al., “Human mobility, social ties, and link prediction,” in Proc. 17th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, San Diego, CA, 2011, pp. 1100-1108. 109

[108] E. Cho et al., “Friendship and mobility: User movement in location-based social networks,” in Proc. 17th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, San Diego, CA, 2011, pp. 1082-1090.

[109] T. Razafindralambo and F. Valois, “Performance evaluation of backoff algorithms in 802.11 ad-hoc networks,” in Proc. 3rd ACM Int. Performance Evaluation Wireless Ad hoc, Sensor and Ubiquitous Networks, Terromolinos, Spain, 2006, pp. 82-89.

[110] J. Yoo et al., “Random waypoint considered harmful,” in Proc. 22nd Ann. Joint Conf. IEEE Comput. and Commun., San Francisco, CA, 2003, pp. 1312-1321.

[111] L. Katzir et al., “Estimating sizes of social networks via biased sampling,” in Proc. 20th Int. Conf. World Wide Web, Hyderabad, India, 2011, pp. 597-606.

[112] C. Boldrini et al., “Users mobility models for opportunistic networks: The role of physical locations,” in Proc. Wireless Rural and Emergency Commun., Rome, Italy, 2007, pp. 255-267.

[113] X. Hong et al., “A group mobility model for ad hoc wireless networks,” in Proc. 2nd ACM Int. Workshop Modeling, Analysis and Simulation of Wireless and Mobile Syst., Seattle, WA, 1999, pp. 53-60.

[114] H. Hsu, Schaum’s Outline of Probability, Random Variables, and Random Processes. New York: McGraw-Hill, 2010.

[115] A. Langville et al., “Deeper inside pagerank,” Internet Math., vol. 1, no. 3, pp. 335-380, Jan. 2004.

[116] W. Steward, Introduction to the Numerical Solution of Markov Chains. Princeton, NJ: Princeton University Press, 1994.

[117] J. A. Rice, Mathematical Statistics and Data Analysis. Stamford, CT: Cengage Learning, 2006. 110

[118] A. Banerjee and S. Basu, “A social query model for decentralized search,” in Proc. 2nd ACM Workshop on Social Network Mining and Analysis, Las Vegas, NV, 2008.

[119] A. Bozzon et al., “Answering search queries with crowdsearcher,” in Proc. 21st Int. Conf. World Wide Web, Lyon, France, 2012, pp. 1009-1018.

[120] S. Sahay et al., “Social ranking for spoken web search,” in Proc. 20th ACM Int. Conf. Inform. and Knowledge Manage., Glasgow, Scotland, 2011, pp. 1835-1840.

[121] A. Agarwal et al., “Learning to rank networked entities,” in Proc. 12th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Philadelphia, PA, 2006, pp. 14-23.

[122] S. Chakrabarti et al., “Focused crawling: A new approach to topic-specific web resource discovery,” Comput. Networks, vol. 31, no. 11, pp. 1623-1640, May 1999.

[123] A. Maiya and T. Wolf, “Expansion and search in networks,” in Proc. 19th ACM Int. Conf. Inform. and Knowledge Manage., Toronto, Canada, 2010, pp. 239-248.

[124] J. Kleinberg and E. Tardos, Algorithm Design. London, UK: Pearson, 2006.

[125] M. Girvan and M. Newman, “Community structure in social and biological networks,” Proc. Nat. Academy Sci., vol. 99, no. 12, pp. 7821-7826, Apr. 2002.

[126] J. Xie et al., “Overlapping community detection in networks: The state of the art and comparative study,” ACM Comput. Surveys, vol. 45, no. 4, Aug. 2013.

[127] S. Scellato et al., “Distance matters: Geo-social metrics for online social networks,” in Proc. 3rd Conf. Online Social Networks, Boston, MA, 2010, pp. 8.

[128] M. Newman, “Communities, modules and large-scale structure in networks,” Nature Physics, vol. 8, no. 1, pp. 25-31, Dec. 2011. 111

[129] L. Backstrom et al., “Find me if you can: Improving geographical prediction with social and spatial proximity,” in Proc. 19th Int. Conf. World Wide Web, Raleigh, NC, 2010, pp. 61-70.

[130] S. Scellato et al., “Socio-spatial properties of online location-based social networks,” in Proc. 5th Int. AAAI Conf. Weblogs and Social Media, Barcelona, Spain, 2011, pp. 329-336.

[131] M. Allamanis et al., “Evolution of a location-based online social network: Analysis and models,” in Proc. 2012 ACM Conf. Internet Measurement, Boston, MA, 2012, pp. 145-158.

[132] S. Adali et al., “Deconstructing : Thinking locally and ranking globally in networks,” in Proc. IEEE/ACM Int. Conf. Advances in Social Network Analysis and Mining, Niagara Falls, Canada, 2013, pp. 418-425.

[133] P. Expert et al., “Uncovering space-independent communities in spatial networks,” Proc. Nat. Academy Sci., vol. 108, no. 19, pp. 7663-7668, Aug. 2011.

[134] M. McPherson et al., “Birds of a feather: in social networks,” Ann. Review Sociol., vol. 27, no. 1, pp. 415-444, Aug. 2001.

[135] J. Yang and J. Leskovec, “Defining and evaluating network communities based on ground-truth,” in Proc. ACM SIGKDD Workshop Mining Data Semantics, Beijing, China, 2012, pp. 31-38.

[136] M. Deutsch and H. Gerard, “A study of normative and informational social influences upon individual judgment,” J. Abnormal & Social Psychology, vol. 51, no. 3, pp. 629-36, Sept. 1955.

[137] E. Bulut and B. Szymanski, “Exploiting friendship relations for efficient routing in mobile social networks,” IEEE Trans. Parallel Distrib. Syst., vol. 3, no. 12, pp. 2254-2265, Dec. 2012. 112

[138] M. Cha et al., “A measurement-driven analysis of information propagation in the flickr social network,” in Proc. 18th Int. Conf. World Wide Web, Madrid, Spain, 2009, pp. 721-730.

[139] A. Hannak et al., “Measuring personalization of web search,” in Proc. Int. Conf. World Wide Web, Rio de Janeiro, Brazil, 2013, pp. 527-538.

[140] J. Leskovec and E. Horvitz, “Planetary-scale views on a large instant-messaging network,” in Proc. 17th Int. Conf. World Wide Web, Beijing, China, 2008, pp. 915-924.

[141] D. Watts and S. Strogatz, “Collective dynamics of ‘small-world’ networks,” Nature, vol. 393, no. 6684, pp. 409-410, June 1998.

[142] J. Onnela et al., “Geographic constraints on social network groups,” PloS One, vol. 6, no. 4, doi: 10.1371/journal.pone.0016939, Apr. 2011.

[143] E. Garfield, “It is a small world after all,” Essays of an Inform. Scientist, vol. 4, no. 43, pp. 299-304, Oct. 1978.

[144] S. Adali et al., “Attentive betweenness centrality (ABC): Considering options and bandwidth when measuring criticality,” in Proc. ASE/IEEE Int. Conf. Social Computing, Amsterdan, Netherlands, 2012, pp. 358-367.

[145] E. Daly and M. Haahr, “Social network analysis for routing in disconnected delay-tolerant MANETs,” in Proc. 8th ACM Int. Symp. on Mobile Ad Hoc Networking and Computing, Montreal, Canada, 2007, pp. 32-40.

[146] M. Newman, “Models of the small world,” J. Stat. Phys, vol. 101, no. 4, pp. 819-841, Nov. 2000.

[147] B. Uzzi and J. Spiro, “Collaboration and creativity: The small world problem,” AJS, vol. 111, no. 2, pp. 447-504, Sept. 2005.

[148] N. Hodas and K. Lerman, “The simple rules of social contagion,” Sci. Rep., vol. 4, no. 434, doi:10.1038/srep04343. 113

[149] L. Bettencourt and G. West, “A unified theory of urban living,” Nat., vol. 467, no. 7318, pp. 912-913, Oct. 2010.