Link Prediction in Social Networks Using Computationally Efficient Topological Features
Total Page:16
File Type:pdf, Size:1020Kb
Link Prediction in Social Networks using Computationally Efficient Topological Features Michael Fire, Lena Tenenboim, Ofrit Lesser, Rami Puzis, Lior Rokach and Yuval Elovici Deutsche Telekom Laboratories at Ben-Gurion University of the Negev, Department of Information Systems Engineering, Ben Gurion University, Be’er Sheva, 84105, Israel Email: mickyfi@bgu.ac.il, [email protected], [email protected], [email protected], [email protected], and [email protected] Abstract—Online social networking sites have become increas- heuristic techniques for uncovering hidden links missed, by the ingly popular over the last few years. As a result, new interdisci- WEB crawler, are useful for completing the network structure. plinary research directions have emerged in which social network Detection of hidden links is also very practical in friend- analysis methods are applied to networks containing hundreds millions of users. Unfortunately, links between individuals may suggestion mechanisms used by online social networks. In be missing either due to imperfect acquirement processes or such cases, the hidden links may consist of existing social because they are not yet reflected in the online network (i.e., ties that have not yet been established in the particular social friends in real-world did not form a virtual connection.) Existing network. Chen et al. [1] depict several algorithms used by IBM link prediction techniques lack the scalability required for full on their internal social network, which enable its employees application on a continuously growing social network. The primary bottleneck in link prediction techniques is ex- to connect with each other. The problem of predicting the tracting structural features required for classifying links. In this existence of hidden links or of creating new ones in social paper we propose a set of simple, easy-to-compute structural networks is commonly referred to as the Link Prediction prob- features, that can be analyzed to identify missing links. We show lem. Link prediction has also many applications outside the that by using simple structural features, a machine learning domain of social networks. For example, in bioinformatics link classifier can successfully identify missing links, even when applied to a hard problem of classifying links between individuals prediction can be used to find interactions between proteins with at least one common friend. A new friends measure that [2]; in e-commerce it can help build recommendation systems we developed is shown to be a good predictor for missing [3]; and in the security domain link prediction can assist in links. An evaluation experiment was performed on five large identifying hidden groups of terrorists or criminals [4]. Since Social Networks datasets: Facebook, Flickr, YouTube, Academia the Link Prediction problem is relevant to different scenarios, and TheMarker. Our methods can provide social network site operators with the capability of helping users to find known, several algorithms have been proposed in recent years to solve offline contacts and to discover new friends online. They may also it. Most of the solutions are generally based on supervised be used for exposing hidden links in an online social network. machine learning and selecting relevant features, Bayesian Index Terms—Link Prediction, HiddenLinks, Social Networks, probabilistic models, relational Bayesian networks, or linear Supervised Learning algebraic methods. Further details on these approaches can be found in a thorough survey written by Mohammed J. Zaki and I. INTRODUCTION AND BACKGROUND Mohammad Al Hasan [5]. In recent years, online social networks have been growing One common approach for solving the Link Prediction prob- exponentially. Offering individuals with similar personal and lem is using supervised learning algorithms. This approach business interests the possibility of meeting and networking. was introduced by Liben-Nowell and Kleinberg in 2003 [6], Social Networks are creating new opportunities to develop who studied the usefulness of graph topological features by friendships, share ideas and conduct business. Online social testing them on bibliographic data sets. In 2006 Hasan et. al [4] networking services such as Facebook, Twitter and Flickr, extended their work, and since then several other researchers just to name a few, have become part of the daily life of have implemented this approach. Most of the solutions that millions of people around the world. The enormous growth these researchers proposed were tested on bibliographic or on of these networks has resulted in several research directions co-authorship data sets [4], [6], [7], [8]. In 2009, Song et al. that examine the structural and behavioral properties of large- used matrix factorization to estimate the similarity between scale social networks. Typically researchers in this area collect nodes in real life social networks such as Facebook and information by using WEB crawler software. Such crawlers, MySpace [9]. Recently, in 2011, W J. Cukierski et. al. [10] however, may sometimes collect only partial information. This extracted 94 distinct graph features. Using a Random Forests is due to various causes such as the attempt to access broken classifier, they achieved impressive results in predicting links Web links, efforts by the social network operator to block on Flickr datasets. various subscribers for one reason or another, communication In order to build an efficient classifier for link prediction, failures etc. Consequently, those depending upon the Web it is crucial to define and calculate a set of graph structural crawler might find themselves with only partial information features. When dealing with large scale graphs that may about the set of links within a social network. In such cases, include millions of nodes and edges, one of the challenges is the computationally intensive extraction of such features. YouTube is a popular video-sharing site that includes a For example, Facebook has nearly 700 million registered users social network. YouTube switched from directed links to a and each month an average 20 million new users are added to two-phase symmetric link-creation process in 2007. In this the network[11]. In addition the topologies of social networks paper we use the dataset published by [22], which was have several well-known characteristics such as power law collected while YouTube was still a directed graph. degree distribution [12] and the small world phenomenon The following two datasets where obtained for this research [13]. The power law degree distribution in social networks using a web-crawler. Academia.edu is a platform for aca- suggests that there are individuals with a large number of demics to share and follow research underway in a particular connections (hubs). Computing local topological features on a field or discipline. Academics upload their papers to share subgraph consisting only of friends of these individuals may them with other academics in over 100,000 research areas. be computationally intensive. An Academia social network member may choose to follow In this paper, we present a solution to the Link Prediction any of the other members in this network, hence the directed problem based on a machine learning classifier trained on a nature of the links within this network. small set of easy-to-compute topological features. These clas- TheMarker Cafe is an Israelis online social network site sifiers are evaluated using several social network datasets such that allows its member to connect and interact. Since most of as Facebook[14], Flickr[15], Youtube[16], Academia[17] and its members are Israeli, most interaction and communication TheMarker[18]. Further information on these social networks among members is done in Hebrew. Due to the geographic is presented in Section II. In addition, in Section III-A, we and demographic nature of this network, it is smaller in introduce a topological feature, friends-measure. This simple scale compared to the other networks. TheMarker friendship feature, a variation of the well-known Katz measure, estimates connection is reciprocal, hence its underlying social structure how well the friends of two users know each other. In Section may be represented as an undirected graph. IV we show that this measure is more valuable for link In summary, Facebook and TheMarker are undirected net- prediction than the well-known common-friends feature and works while Flickr, YouTube and Academia are directed Jaccard’s-coefficient [19]. networks. The rest of the paper is organized as follows: In Section II Details of the datasets are summarized in Table I. we give a brief overview of the online social networks whose TABLE I structures were used in this study. Section III describes the SOCIAL NETWORKS DATASETS experimental framework and methods used to develop and evaluate a link predictor. The topological features used in Network Is Directed Node Number Edge Number Date this study are formally defined in Section III-A. Section IV Academia Yes 200,169 1,398,063 2011 Facebook No 63,731 1,545,686 2009 presents our numeric results including the area under the ROC Flickr Yes 1,133,54 7,237,983 2010 curve (AUC) and classification accuracy, the contribution of TheMarker No 65,953 1,572,684 2011 each feature set, and the information gain value for the differ- YouTube Yes 1,138,499 4,945,382 2007 ent features. Finally, in Section V we present our conclusions. III. METHODS AND EXPERIMENTS II. ONLINE SOCIAL NETWORKS DATASETS Since our goal was to identify and predict a set of hidden In this paper we evaluate link prediction classifiers in links within a social network structure, we chose to use relation to five social network datasets: Facebook, Flickr, machine learning methods and develop a link classifier which YouTube, Academia.edu, and TheMarker. Facebook is a social can predict the likelihood of a link existence in the social networking service and website launched in February 2004. graph. We collected several social network datasets for this As of June 2011, Facebook had nearly 700 million active purpose.