Link Prediction in Social Networks Via Bounded Local Path Traversal
Total Page:16
File Type:pdf, Size:1020Kb
Friendlink: Link Prediction in Social Networks via Bounded Local Path Traversal Alexis Papadimitriou Panagiotis Symeonidis Yannis Manolopoulos Computer Science Department Computer Science Department Computer Science Department Aristotle University Aristotle University Aristotle University Thessaloniki, Greece Thessaloniki, Greece Thessaloniki, Greece Email: [email protected] Email: [email protected] Email: [email protected] Abstract—Online social networks (OSNs) like Facebook, may know : (i) user U7 because you have two common Myspace, and Hi5 have become popular, because they allow friends (user U5 and user U6) (ii) user U9 because you have users to easily share content or expand their social circle. one common friend (user U )...”.Thelistofrecommended OSNs recommend new friends to registered users based on 8 local graph features (i.e. based on the number of common friends is ranked based on the number of common friends friends that two users share). However, OSNs do not exploit each candidate friend has with the target user. all different length paths of the network. Instead, they consider only pathways of maximum length 2 between a user and U U his candidate friends. On the other hand, there are global 9 8 approaches, which detect the overall path structure in a network, being computationally prohibitive for huge-size social U2 U5 networks. In this paper, we provide friend recommendations, also known as the link prediction problem, by traversing all U4 U1 U7 paths of a bounded length, based on the “algorithmic small world hypothesis”. As a result, we are able to provide more U U accurate and faster friend recommendations. We perform an 3 6 extensive experimental comparison of the proposed method against existing link prediction algorithms, using two real data sets (Hi5 and Epinions). Our experimental results show that Figure 1. Social Network Example. our FriendLink algorithm outperforms other approaches in terms of effectiveness and efficiency in both real data sets. Keywords-Social Networks; Link Prediction; A. Motivation Compared to approaches which are based on local features I. INTRODUCTION of a network (i.e. Friend of a Friend (FOAF) algorithm or Common Neighbors, Adamic/Adar index, Jaccard Coeffi- Online social networks (OSNs) such as Facebook.com1, cient, etc. - for more details see Section II), we provide Myspace2,Hi5.com3, etc. contain gigabytes of data that can friend recommendations, exploiting paths of greater length. be mined to make predictions about who is a friend of In contrast, they consider only pathways of maximum length whom. OSNs recommend other people to users based on 2 between a target user and his candidate friends. In Fig- their common friends. The reason is that there is a significant ure 1, which will be used as our running example, according possibility that two users are friends, if they share a large to existing OSNs, U would get as friend recommendation number of common friends. 1 with equal probability U or U . However, if we take into In this paper, we focus on recommendations based on 4 7 account also paths of length 3, then U should have a higher links that connect the nodes of an OSN, known as the 4 probability to be recommended as a friend to U . In our Link Prediction problem, where there are two main ap- 1 approach, we assume that a person can be connected to proaches [5] that handle it. The first one is based on local another with many paths of different length (through human features of a network, focusing mainly on the nodes struc- chains). Thus, two persons that are connected with many ture; the second one is based on global features, detecting the unique pathways of different length have a high probability overall path structure in a network. For instance, as an exam- to know each other, proportionally to the length of the ple of a local approach, as shown in Figure 1, Facebook.com pathways they are connected with. or Hi5.com use the following style of recommendation for Compared to global approaches (i.e Katz status index, recommending new friends to a target user U : “People you 1 RWR algorithm, SimRank algorithm etc.), which detect the 1http://www.facebook.com overall path structure in a network, our method is more 2http://www.myspace.com efficient. This means, that our method, which is based 3http://www.hi5.com on a bounded path traversal, requires less time and space 978-1-4577-1133-6/11/$26.00 c 2011 IEEE 66 complexity than the global based algorithms. The reason is to be close to 1. On the other hand, if the two nodes are that we traverse only paths of length l in a network based dissimilar, we expect the value sim(vi,vj ) to be close to 0. on the “algorithmic small world hypothesis”, whereas global Our method assumes that persons in an OSN can use approaches detect the overall path structure. (for more details all the pathways connecting them, proportionally to the see Section Related Work). pathway lengths. Thus, two persons who are connected The rest of this paper is organized as follows. Section II with many unique pathways have a high possibility to summarizes the related work. Section III defines a new node know each, proportionally to the length of the pathways similarity measure in OSNs. The proposed approach, its they are connected with. For example, referring back to complexity analysis, and its possible extensions to other Figure 1, if we consider only length-2 paths, then U1 would networks, are described in Section IV. Experimental results get as friend recommendation with equal probability U4 or are given in Section V. Finally, Section VI concludes this U7. However, if we consider also length-3 paths, then U4 paper. should have a higher probability to be recommended as a friend to U1. II. RELATED WORK There is a variety of local similarity measures [5] (i.e. Definition 1. The similarity sim(vx,vy) between two graph FOAF algorithm, Adamic/Adar index, Jaccard Coefficient, nodes vx and vy is defined as the counts of paths of varying etc.) for analyzing the “proximity” of nodes in a network. length from vx to vy: FOAF [2] is adopted by many popular OSNs, such as facebook.com and hi5.com for the friend recommendation pathsi 1 vx,vy task. FOAF is based on the common sense that two nodes sim(vx,vy)= · (2) v ,v i − 1 i x y are more likely to form a link in the future, if i=2 (n − j) they have many common neighbors. In addition to FOAF j=2 algorithm, there are also other local-based measures such as Jaccard Coefficient [5] and Adamic/Adar index [1]. Adamic where and Adar proposed a distance measure to decide when two • n is the number of vertices in a graph G, v personal home pages are strongly “related”. In particular, • is the length of a path between the graph nodes x and v they computed features of the pages and defined the similar- y (excluding paths with cycles). By the term “paths x, y 1 with cycles” we mean that a path can not be closed ity between two pages as follows: z log(frequency(z)) , where z is a feature shared by pages x, y. This refines (cyclic). Thus, a node can exist only one time in a path v → v → v →v → v the simple counting of common features by weighting rarer (e.g. path 1 2 3 1 5 is not acceptable v features more heavily. because 1 is traversed twice), • 1 There is a variety of global approaches [5]. In this i−1 is an “attenuation” factor that weights paths accord- paper, as comparison partners of global approaches, we ing to their length . Thus, a 2-step path measures the 1 consider Katz status index [4], and Random Walk with non-attenuation of a link with value equals to 1 ( 2−1 Restart algorithm [9], [7] (RWR) algorithm. Katz defines a =1).A3-step path measures the attenuation of a link 1 1 1 measure that directly sums over all paths between any pair of with value equals to 2 ( 3−1 = 2 )etc.Inthissense,we nodes in graph G, exponentially damped by length to count use appropriate weights to allow the lower effectiveness short paths more heavily. RWR considers a random walker of longer path chains. Notice that we have also tested experimentally other possible attenuation factors such that starts from node vx, and chooses randomly among the β 1 available edges every time, except that, before he makes a as Katz’s original exponential , the logarithmic log(i) , choice, with probability α, he goes back to node vx (restart). etc. and, as will be shown later, the attenuation factor 1 The similarity matrix (i.e. Kernel) between nodes of a graph, − attains the best accuracy results. i 1 can be computed by Equation 1: • paths v vx,vy is the set of all length- paths from x to vy, −1 KernelRW R =(I − αP ) (1) i • (n − j) is the set of all possible length- paths from where I is the identity matrix and P is the transition- j=2 probability matrix. vx to vy, if each vertex in graph G was linked with pathsv ,v III. DEFINING A NODE SIMILARITY MEASURE all other vertices. By using the fraction x y , i In this section, we define a new similarity measure to (n − j) determine a way of expressing the proximity among graph j=2 nodes. Let vi and vj be two graph nodes and sim(vi,vj ) a our similarity measure is normalized and takes values function that expresses their similarity in the range [0,1].