Extraction and Analysis of Facebook Friendship Relations
Total Page:16
File Type:pdf, Size:1020Kb
Chapter 12 Extraction and Analysis of Facebook Friendship Relations Salvatore Catanese, Pasquale De Meo, Emilio Ferrara, Giacomo Fiumara, and Alessandro Provetti Abstract Online social networks (OSNs) are a unique web and social phenomenon, affecting tastes and behaviors of their users and helping them to maintain/create friendships. It is interesting to analyze the growth and evolution of online social networks both from the point of view of marketing and offer of new services and from a scientific viewpoint, since their structure and evolution may share similarities with real-life social networks. In social sciences, several techniques for analyzing (off-line) social networks have been developed, to evaluate quantitative properties (e.g., defining metrics and measures of structural characteristics of the networks) or qualitative aspects (e.g., studying the attachment model for the network evolution, the binary trust relationships, and the link prediction problem). However, OSN analysis poses novel challenges both to computer and Social scientists. We present our long-term research effort in analyzing Facebook, the largest and arguably most successful OSN today: it gathers more than 500 million users. Access to data about Facebook users and their friendship relations is restricted; thus, we acquired the necessary information directly from the front end of the website, in order to reconstruct a subgraph representing anonymous interconnections among a significant subset of users. We describe our ad hoc, privacy-compliant crawler for Facebook data extraction. To minimize bias, we adopt two different graph mining techniques: breadth-first-search (BFS) and rejection sampling. To analyze the structural properties of samples consisting of millions of nodes, we developed a S.Catanese•P.DeMeo•G.Fiumara Department of Physics, Informatics Section. University of Messina, Messina, Italy E. Ferrara () Department of Mathematics, University of Messina, Messina, Italy A. Provetti Department of Physics, Informatics Section. University of Messina, Messina, Italy Oxford-Man Institute, University of Oxford, Oxford, UK A. Abraham (ed.), Computational Social Networks: Mining and Visualization, 291 DOI 10.1007/978-1-4471-4054-2 12, © Springer-Verlag London 2012 292 S. Catanese et al. specific tool for analyzing quantitative and qualitative properties of social networks, adopting and improving existing Social Network Analysis (SNA) techniques and algorithms. Introduction The increasing popularity of online social networks (OSNs) is witnessed by the huge number of users that MySpace, Facebook, etc. acquired in a short amount of time. The growing accessibility of the web, through several media, gives to most users a 24/7 online presence and encourages them to build a online mesh of relationships. As OSNs become the tools of choice for connecting people, we expect that their structure will increasingly mirror real-life society and relationships. At the same time, with an estimated 13 million transactions per second (at peak), Facebook is one of the most challenging computer science artifacts, posing several optimization, scalability, and robustness challenges. The essential feature of Facebook is the friendship relation between participants. It consists, mainly, in a permission to consult each others’ friends list and posted content: news, photos, links, blog posts, etc.; such permission is mutual. In this chapter, we consider the Facebook friendship graph as the (nondirected) graph having FB users as vertices and edges represent their friendship relation. The analysis of OSN connections is a fascinating topic on multiple levels. First, a complete study of the structure of large real (i.e., off-line) communities was impossible or at least very expensive before, even at fractions of the scale considered in OSN analysis. Second, data is clearly defined by some structural constraints, usually provided by the OSN structure itself, w.r.t. real-life relations, often hardly identifiable. The interpretation of these data opens up new fascinating research issues, e.g., is it possible to study OSNs with the tools of traditional Social Network Analysis, as in Wasserman-Faust [89]and[69]? To what extent the behavior of OSN users is comparable to that of people in real-life social networks [39]? What are the topological characteristics of the relationships network (friendship, in the case of FB) of OSN [4]? And what about their structure and evolution [58]? To address these questions, further computer science research is needed to design and develop the tools to acquire and analyze data from massive OSNs. First, proper social metrics need to be introduced, in order to identify and evaluate properties of the considered OSN. Second, scalability is an issue faced by anyone who wants to study a large OSN independently from the commercial organization that owns and operates it. For instance, last year Gjoka et al. [42] estimated the crawling overhead needed to collect the whole Facebook graph in 44 Tb of data. Moreover, even when such data could be acquired and stored locally (which, however, raises storage issues related to the social network compression [16, 17]), it is nontrivial to devise and implement effective functions that traverse and visit the graph or even evaluate simple metrics. In literature, extensive research has been conducted on sampling techniques for large graphs; only recently, however, studies have shed 12 Extraction and Analysis of Facebook Friendship Relations 293 light on the bias that those methodologies may introduce. That is, depending on the method by which the graph has been explored, certain features may result over/underrepresented w.r.t. the actual graph. Our long-term research on these topics is presented in this chapter. We describe in detail the architecture and functioning modes of our ad hoc Facebook crawler, by which, even on modest computational resources, we can extract large samples containing several milions of nodes. Two recently collected samples of about eight millions of nodes each are described and analyzed in detail. To comply with the FB end-user licence, data is made anonymous upon extraction, hence we never memorize users’ sensible data. Next, we describe our newly developed tool for graph analysis and visualization, called LogAnalysis. LogAnalysis may be used to compute the metrics that are most pertinent to OSN graph analysis, and can be adopted as an open-source, multiplatform alternative to the well-known NodeXL tool. Background and Related Literature The task of extracting and analyzing data from Online Social Networks has attracted the interest of many researchers, e.g., in [7, 39, 93]. In this section, we review some relevant literature directly related to our approach. In particular, we first discuss techniques to crawl large social networks and collect data from them (see section “Data Collection in OSN”). Collected data are usually mapped onto graph data structures (and sometimes hypergraphs) with the goal of analyzing their structural properties. The ultimate goal of these efforts is perhaps best laid out by Kleinberg [56]: topological properties of graphs may be reliable indicators of human behaviors. For instance, several studies show that node degree distribution follows a power law, both in real and online social networks. That feature points to the fact that most social network participants are often inactive, while few key users generate a large portion of data/traffic. As a consequence, many researchers leverage on tools provided from graph theory to analyze the social network graph with the goal, among others, of better interpreting personal and collective behaviors on a large scale. The list of potential research questions arising from the analysis of OSN graphs is very long; in the following, we shall focus on three themes which are directly relevant to our research: 1. Node Similarity Detection, i.e., the task of assessing the degree of similarity of two users in an OSN (see section “Similarity Detection”) 2. Community Detection, i.e., the task of of finding groups of users (called communities) who frequently interact with each other but seldom with those outside their community (see section “Community Detection”) 3. Influential User Detection, i.e., the task of identifying users capable of stimulat- ing other users to join activities/discussions in their OSN (see section “Influential User Detection”). 294 S. Catanese et al. Data Collection in OSN The most works focusing on data collection adopt techniques of web information extraction [34], to crawl the front end of websites; this is because OSN datasets are usually not publicly accessible; data rests in back-end databases that are accessible only through the web interface. In [63] the problem of sampling from large graphs adopting several graph mining techniques, in order to establish whether it is possible to avoid bias in acquiring a subset of the whole graph of a social network is discussed. The main outcome of the analysis in [63] is that a sample of size of 15% of the whole graph preserves most of the properties. In [69], the authors crawled data from large online social networks like Orkut, Flickr, and LiveJournal. They carried out an in-depth analysis of OSN topological properties (e.g., link symmetry, power-law node degrees, groups formation) and discussed challenges arising from large-scale crawling of OSNs. Ye et al. [93] considered the problem of crawling OSNs analyzing quantitative aspects like the efficiency of the adopted visiting algorithms, and bias of data produced by different crawling approaches. The work by Gjoka et al. [42] on OSN graphs is perhaps the most similar to our current research, e.g., in [22]. Gjoka et al. have sampled and analyzed the Facebook friendship graph with different visiting algorithms namely, BFS, Random Walk, and Metropolis-Hastings Random Walks. Our objectives differ from those of Gjoka et al. because their goal is to produce a consistent sample of the Facebook graph. A sample is defined consistent when some of its key structural properties, i.e., node degree distribution, assortativity and clustering coefficient approximate fairly well the corresponding properties of the original Facebook graph.