Friendlink: Link Prediction in Social Networks via Bounded Local Traversal

Alexis Papadimitriou Panagiotis Symeonidis Yannis Manolopoulos Computer Science Department Computer Science Department Computer Science Department Aristotle University Aristotle University Aristotle University Thessaloniki, Greece Thessaloniki, Greece Thessaloniki, Greece Email: [email protected] Email: [email protected] Email: [email protected]

Abstract—Online social networks (OSNs) like , may know : (i) user U7 because you have two common Myspace, and Hi5 have become popular, because they allow friends (user U5 and user U6) (ii) user U9 because you have users to easily share content or expand their social circle. one common friend (user U )...”.Thelistofrecommended OSNs recommend new friends to registered users based on 8 local graph features (i.e. based on the number of common friends is ranked based on the number of common friends friends that two users share). However, OSNs do not exploit each candidate friend has with the target user. all different length paths of the network. Instead, they consider only pathways of maximum length 2 between a user and U U his candidate friends. On the other hand, there are global 9 8 approaches, which detect the overall path structure in a network, being computationally prohibitive for huge-size social U2 U5 networks. In this paper, we provide friend recommendations, also known as the link prediction problem, by traversing all U4 U1 U7 paths of a bounded length, based on the “algorithmic small world hypothesis”. As a result, we are able to provide more U U accurate and faster friend recommendations. We perform an 3 6 extensive experimental comparison of the proposed method against existing link prediction algorithms, using two real data sets (Hi5 and Epinions). Our experimental results show that Figure 1. Example. our FriendLink algorithm outperforms other approaches in terms of effectiveness and efficiency in both real data sets. Keywords-Social Networks; Link Prediction; A. Motivation Compared to approaches which are based on local features I. INTRODUCTION of a network (i.e. Friend of a Friend (FOAF) algorithm or Common Neighbors, Adamic/Adar index, Jaccard Coeffi- Online social networks (OSNs) such as Facebook.com1, cient, etc. - for more details see Section II), we provide Myspace2,Hi5.com3, etc. contain gigabytes of data that can friend recommendations, exploiting paths of greater length. be mined to make predictions about who is a friend of In contrast, they consider only pathways of maximum length whom. OSNs recommend other people to users based on 2 between a target user and his candidate friends. In Fig- their common friends. The reason is that there is a significant ure 1, which will be used as our running example, according possibility that two users are friends, if they share a large to existing OSNs, U would get as friend recommendation number of common friends. 1 with equal probability U or U . However, if we take into In this paper, we focus on recommendations based on 4 7 account also paths of length 3, then U should have a higher links that connect the nodes of an OSN, known as the 4 probability to be recommended as a friend to U . In our Link Prediction problem, where there are two main ap- 1 approach, we assume that a person can be connected to proaches [5] that handle it. The first one is based on local another with many paths of different length (through human features of a network, focusing mainly on the nodes struc- chains). Thus, two persons that are connected with many ture; the second one is based on global features, detecting the unique pathways of different length have a high probability overall path structure in a network. For instance, as an exam- to know each other, proportionally to the length of the ple of a local approach, as shown in Figure 1, Facebook.com pathways they are connected with. or Hi5.com use the following style of recommendation for Compared to global approaches (i.e Katz status index, recommending new friends to a target user U : “People you 1 RWR algorithm, SimRank algorithm etc.), which detect the 1http://www.facebook.com overall path structure in a network, our method is more 2http://www.myspace.com efficient. This means, that our method, which is based 3http://www.hi5.com on a bounded path traversal, requires less time and space

978-1-4577-1133-6/11/$26.00 c 2011 IEEE 66 complexity than the global based algorithms. The reason is to be close to 1. On the other hand, if the two nodes are that we traverse only paths of length l in a network based dissimilar, we expect the value sim(vi,vj ) to be close to 0. on the “algorithmic small world hypothesis”, whereas global Our method assumes that persons in an OSN can use approaches detect the overall path structure. (for more details all the pathways connecting them, proportionally to the see Section Related Work). pathway lengths. Thus, two persons who are connected The rest of this paper is organized as follows. Section II with many unique pathways have a high possibility to summarizes the related work. Section III defines a new node know each, proportionally to the length of the pathways similarity measure in OSNs. The proposed approach, its they are connected with. For example, referring back to complexity analysis, and its possible extensions to other Figure 1, if we consider only length-2 paths, then U1 would networks, are described in Section IV. Experimental results get as friend recommendation with equal probability U4 or are given in Section V. Finally, Section VI concludes this U7. However, if we consider also length-3 paths, then U4 paper. should have a higher probability to be recommended as a friend to U1. II. RELATED WORK There is a variety of local similarity measures [5] (i.e. Definition 1. The similarity sim(vx,vy) between two graph FOAF algorithm, Adamic/Adar index, Jaccard Coefficient, nodes vx and vy is defined as the counts of paths of varying etc.) for analyzing the “proximity” of nodes in a network. length  from vx to vy: FOAF [2] is adopted by many popular OSNs, such as     facebook.com and hi5.com for the friend recommendation  pathsi  1 vx,vy task. FOAF is based on the common sense that two nodes sim(vx,vy)= · (2) v ,v i − 1 i x y are more likely to form a link in the future, if i=2 (n − j) they have many common neighbors. In addition to FOAF j=2 algorithm, there are also other local-based measures such as Jaccard Coefficient [5] and Adamic/Adar index [1]. Adamic where and Adar proposed a distance measure to decide when two • n is the number of vertices in a graph G,  v personal home pages are strongly “related”. In particular, • is the length of a path between the graph nodes x and v they computed features of the pages and defined the similar- y (excluding paths with cycles). By the term “paths x, y 1 with cycles” we mean that a path can not be closed ity between two pages as follows: z log(frequency(z)) , where z is a feature shared by pages x, y. This refines (cyclic). Thus, a node can exist only one time in a path v → v → v →v → v the simple counting of common features by weighting rarer (e.g. path 1 2 3 1 5 is not acceptable v features more heavily. because 1 is traversed twice), • 1 There is a variety of global approaches [5]. In this i−1 is an “attenuation” factor that weights paths accord- paper, as comparison partners of global approaches, we ing to their length . Thus, a 2-step path measures the 1 consider Katz status index [4], and Random Walk with non-attenuation of a link with value equals to 1 ( 2−1 Restart algorithm [9], [7] (RWR) algorithm. Katz defines a =1).A3-step path measures the attenuation of a link 1 1 1 measure that directly sums over all paths between any pair of with value equals to 2 ( 3−1 = 2 )etc.Inthissense,we nodes in graph G, exponentially damped by length to count use appropriate weights to allow the lower effectiveness short paths more heavily. RWR considers a random walker of longer path chains. Notice that we have also tested experimentally other possible attenuation factors such that starts from node vx, and chooses randomly among the β 1 available edges every time, except that, before he makes a as Katz’s original exponential , the logarithmic log(i) , choice, with probability α, he goes back to node vx (restart). etc. and, as will be shown later, the attenuation factor 1 The similarity matrix (i.e. Kernel) between nodes of a graph, − attains the best accuracy results. i 1  can be computed by Equation 1: • paths   v vx,vy is the set of all length- paths from x to vy, −1 KernelRW R =(I − αP ) (1) i • (n − j) is the set of all possible length- paths from where I is the identity matrix and P is the transition- j=2 probability matrix. vx to vy, if each vertex in graph G was linked with   pathsv ,v  III. DEFINING A NODE SIMILARITY MEASURE all other vertices. By using the fraction x y , i In this section, we define a new similarity measure to (n − j) determine a way of expressing the proximity among graph j=2 nodes. Let vi and vj be two graph nodes and sim(vi,vj ) a our similarity measure is normalized and takes values function that expresses their similarity in the range [0,1]. If in [0,1]. If two nodes are similar we expect the value the two nodes are similar, we expect the value sim(vi,vj ) sim(vx,vy) to be close to 1. On the other hand, if

2011 International Conference on Computational Aspects of Social Networks (CASoN) 67 the two nodes are dissimilar, we expect the value Algorithm FriendLink (G, A, n, ) sim(v ,v ) Input i j to be close to 0. G: an undirected and unweighted graph Notice that the similarity is computed for nodes that are A: adjacency matrix of graph G,  ≥ 2 n: number of nodes of graph G, connected with paths of length . This is because when : maximum length of paths explored in G, there is a path between two nodes of length 1 they are m: the length of a path Output already friends. sim(i, j): similarity between node i and node j in G

IV. THE PROPOSED APPROACH 1. Main Program 2. for i =1ton In this section we first provide a detailed explanation 3. for j =1ton 4. if A(i, j) =1then of our approach, named FriendLink. Then, we perform a 5. A(i, j) = j complexity analysis comparison of FriendLink with other 6. else 7. A(i, j)=0 approaches and finally we extend Friendlink for other types 8. end if of networks. 9. end for j 10. end for i 11. for m =2to A. The FriendLink Algorithm 12. Combine Paths() 13. Compute Similarity(m) Friendlink computes node similarity between any two 14. end for m nodes in a graph G. The initial input of Friendlink is the 15. End Main Program n G A number of nodes of , the adjacency matrix ,andthe 16. Function Combine Paths() length  of paths that will be explored in G. To enumerate all 17. for i =1ton G 18. for j =1ton simple paths in , Rubin’s algorithm [8] can be employed. 19. for k =1ton However, Rubin’s algorithm uses O(n3) matrix operations to 20. if A(i, k) <> 0 and A(k, j) <> 0 then 21. A(i, j) = concatenate(A(i, k),A(k, j)) find all paths of different length between any pair of nodes, 22. end if where n is the number of nodes in G. In the following, we 23. end for k 24. end for j customize Rubin’s algorithm to create only paths of length 25. end for i up to  for our purpose. 26. return A(i, j) As shown in Figure 2, FriendLink consists of a main 27. End Function program and two functions. In the main program, we modify 28. Function Compute Similarity() 29. for i =1ton the adjacency matrix so instead of holding 0/1 values, the 30. for j =1ton (i, j) entry of the matrix A is a list of paths from i to j. Then, 31. denominator =1 32. for k =2tom in the function Combine Paths(), we perform the matrix 33. denominator = denominator *(n - k) multiplication algorithm. However, instead of multiplying 34. end for k    m  pathsi,j  and adding entries, we concatenate pairs of paths together. 35. sim(i, j) = sim(i, j) + 1 · m−1 denominator Finally, in the function Compute Similarity(), we update the 36. end for j i j  37. end for i similarity between nodes and , for each length- path we 38. return sim(i, j) find, where i is the start node and j is the destination node 39. End Function (i.e all paths of length [2..]). For the update of the similarity value between nodes i and j we use Equation 2. Notice that, we do not take into account cyclic paths in our similarity measure. Figure 2. The FriendLink algorithm. For our running example, in Figure 3, we present the U1 U2 U3 U4 U5 U6 U7 U8 U9 node similarity matrix of graph G calculated by FriendLink U1 0 0 0 0.2975 0 0 0.2856 0 0.167 U2 0 0 0.286 0 0146 0.146 0.024 0.156 0.156 algorithm. New friends can be recommended according to U3 0 0.286 0 0 0.146 0.146 0.024 0.156 0.156 U4 0.298 0 0 0 0.025 0.025 0 0.167 0 their total weight, which is computed by aggregating all U5 0 0.146 0.146 0.025 0 0.286 0 0.144 0.015 U6 0 0.146 0.146 0.025 0.286 0 0 0.144 0.015 paths connecting them with the target user, and by weighting U7 0.286 0.024 0.024 0 0 0 0 0.024 0 them proportionally to the length of each path. As shown in U8 0 0.156 0.156 0.167 0.144 0.144 0.024 0 0 U9 0.167 0.156 0.156 0 0.015 0.015 0 0 0 Figure 3, user U1 would receive user U4 as friend recommen- dation. The resulting recommendation is reasonable, because Figure 3. Node Similarity Matrix. It presents the possibility of two users being friends. U1 is connected with more paths to user U4 than those that connect U1 and U7. That is, the FriendLink approach is able to capture the associations among the graph data objects. graphs, because they require the inversion of a matrix. B. Complexity Analysis For instance, the time complexity of Katz index is mainly RWR [7] and Katz index [4] as representatives of the determined by the matrix inversion operator, which is O(n3). global approaches, are computationally prohibitive for large RWR algorithm also requires a matrix inversion, which

68 2011 International Conference on Computational Aspects of Social Networks (CASoN) can be, however, pre-computed [9] and stored for faster data set with the same users by only preserving 16512 new on-line operations. In the same direction, there is also a emerged edges connecting them. The graph data from the faster version [3] of Katz status index that reduces the first crawl are used to predict the new links emerging in the computational complexity from time O(n3) to O(n + m), second crawl. where m is the number of edges. We also use in our experiments the Epinions5 data set, FOAF as representative of the local-based methods, con- which is a who-trusts-whom social network. In particular, siders very small paths (only paths of length 2) between any users of Epinions.com express their Web of Trust, i.e. pair of nodes in G. In particular, for each vx node, FOAF reviewers whose reviews and ratings they have found to be traverses all its neighbors and then traverses the neighbors of valuable. The Epinions data set is a directed network and, each of vx’s neighbor. Since the time complexity to traverse thus, we treat it by simply disregarding the directions of the neighborhood of a node is simply h (h is the average links [10]. It contains 49K users and 487K edges among nodes degree in a network) and our graph G is sparse, it pairs of users. holds that h<

4http://delab.csd.auth.gr/∼symeon 5http://www.trustlet.org/wiki/

2011 International Conference on Computational Aspects of Social Networks (CASoN) 69 performance of each algorithm. Epinions data sets does not In Section IV-A, one of the required input values for the have time stamps of the edges. The performance of the FriendLink algorithm is the length  of paths considered algorithms is evaluated by applying double cross-validation in a graph. To improve our recommendations, it is impor- (internal and external). Each data set was divided into 10 tant to fine-tune the  variable. Based on Milgram’s [6] subsets. Each subset (EP ) was in turn used for performance “small-world hypothesis”,  should take integer values in estimation in the external cross-validation. The 9 remaining the interval [2,6]. Figures 5c and 5d illustrate precision subsets (ET ) were used for the internal cross-validation. In for varying  values for the Epinions 49K, and Hi5 63K particular, we performed an internal 9-fold cross-validation data sets, respectively. As expected, precision decreases as to determine the best values of the algorithms’ needed the number of recommended friends is increased. The best parameters. In particular, for RWR we set parameter α= precision is attained by  =3in both data sets. Notice that 0.001, whereas for Katz we set parameter β=0.005. We we omit to show results for  =6because precision follows chose as values for the parameters those providing the best a degraded performance for  =4and  =5, respectively. performance on the internal 9-fold cross-validation. Then, In the following, we keep the  =3as the default maximum their performance is averaged on the external 10-fold cross- length of paths of the FriendLink algorithm. validation. The presented results, based on two-tailed t-test, are statistically significant at the 0.05 level. D. Accuracy Comparison of FriendLink with other methods We use the classic precision/recall metric as performance We proceed with the comparison of FriendLink with measure for friend recommendations. For a test user receiv- RWR, Katz, and FOAF algorithms, in terms of precision ing a list of k recommended friends (top-k list), precision and recall. This reveals the robustness of each algorithm in and recall are defined as follows: Precision is the ratio of attaining high recall with minimal losses in terms of preci- the number of relevant users in the top-k list (i.e., those in sion. We examine the ranked list, which is recommended to the top-k list that belong in the probe set EP of friends of a target user, starting from the top friend. In this situation, the target user) to k. Recall is the ratio of the number of the recall and precision vary as we increase the number relevant users in the top-k list to the total number of relevant of recommended friends. For the Epinions 49K data set, users (all friends in the probe set EP of the target user). as shown in Figure 5e, our FriendLink algorithm again attains the best precision value of 55% when we recommend C. Sensitivity Analysis for the FriendLink Algorithm a top-1 list of friends. The precision of FriendLink is impressive in this specific data set. The main reason is the In this Section, we study the sensitivity of FriendLink topological characteristics of Epinions 49K data set (i.e. accuracy performance in real networks (i) with different high LCC and small ASD). Thus, Epinions 49K can be possible attenuation factors, (ii) with different controllable considered as a small-world network. This experiment shows sparsity, (iii) with different  values for path traversal. that FriendLink is more robust in finding relevant users In section III, we presented the definition of our similarity for the test user. The reason is that FriendLink exploits measure (see Equation 2). The attenuation factor that was proportionally the -length pathways between two nodes mentioned, weights paths according to their length .Inthis in the graph. In contrast, RWR and Katz traverse globally section, we test other possible attenuation factors in order the social network, missing to capture adequately the local to discover the best precision value that we can attain. In characteristics of the graph. Moreover, FOAF fails to provide particular, we have tested the following possible attenuation accurate recommendations because it exploits only length-2 1 1 1 1 factors: (i) m−1 (ii) 2·m (iii) m2 (iv) log(m) and (v) the paths. Katz’s index attenuation factor βm,wherem is the path For the Hi5 63K data set, as shown in Figure 5f, our length. The attenuation factors performance can be seen FriendLink algorithm attains the best results. However, the in Figure 5a, for both real data set. As shown, the best overall performance of FriendLink, RWR and Katz algo- 1 performance in both data sets is attained by m−1 .Inthe rithms is significantly decreased compared with the results 1 following, we keep the m−1 as the default attenuation factor in the Epinions data set. The first reason is the high sparsity of the FriendLink algorithm. (i.e. ADEG equal to 2.78) of the Hi5 63K data set. The Next, we measure the accuracy that FriendLink attains, second reason is the fact that Hi5 cannot be considered as with different controllable sparsity. To examine the accuracy a small-world network. performance of FriendLink in terms of different network sparsity, we have created 5 different sparsity cases, by chang- E. Time Comparison of FriendLink with other Methods ing the fraction of observed edges, as shown in Figure 5b. In this Section, we compare FriendLink against RWR, As expected, as the fraction of edges observed increases, Katz and FOAF algorithms in terms of efficiency using the precision increases too. This is reasonable, since every the 2 real data sets. We measured the clock time for the prediction algorithm is expected to give higher accuracy for off-line parts of all algorithms. The off-line part refers to a denser network. the building of the similarity matrix between any pair of

70 2011 International Conference on Computational Aspects of Social Networks (CASoN) 60 100 Epinions Le ngt h 2 Le ngt h 3 Le ngt h 4 Le ngt h 5 50 60 Hi5 80 Epinions Hi5 40 50 60 40 30 30

% Precision 20 40 Precision 20 % Precision %

10 10 20 0 0 1/(m-1) 1/(2m) 1/m^2 1/log(m) b^m 0 1 2 3 4 0 0.2 0.4 0.6 0.8 1 Attenuation Factor # Recommended Friends Fraction of Edges observed (a) (b) (c) FriendLink RWR Katz FOAF FriendLink RWR Katz FOAF 40 60 35 50 30 40 25 20 30 15 % Precision % Precision 20 10 10 5 0 0 02468101214 0 5 10 15 20 25 30 % Recall % Recall (d) (e) (f) Figure 5. (a) Precision vs. Attenuation factors diagram, (b) Precision vs. fraction of edges observed diagram. (c) Precision vs. number of recommended friends for Epinions 49K data set, (d) Precision vs. number of recommended friends for Hi5 63K data sets. (e) Precision vs. Recall diagram for Epinions 49K data set, (f) Precision vs. Recall diagram for Hi5 63K data set. nodes in a graph. The results are presented in Table I. [2] J. Chen, W. Geyer, C. Dugan, M. Muller, and I. Guy. Make As shown, FriendLink outperforms RWR and Katz, since new friends, but keep the old: recommending people on they calculate the inverse of an n × n matrix. As expected, social networking sites. In CHI ’09: Proceedings of the 27th international conference on Human factors in computing FOAF algorithm, outperforms the other algorithms due to its systems, pages 201–210, 2009. simpler complexity. Notice that the results depict the time needed to compute the whole similarity matrix. On the other [3] K. C. Foster, S. Q. Muth, J. J. Potterat, and R. B. Rothenberg. hand, if we were to calculate the similarity matrix of only A faster katz status score algorithm. Comput. Math. Organ. one user, then the computation would require only part of a Theory, 7:275–285, December 2001. second to produce a recommendation. [4] L. Katz. A new status index derived from sociometric Data Sets FriendLink RWR Katz FOAF analysis. Psychometrika, 18(1):39–43, 1953. Epinions 49K 245 sec 380 sec 460 sec 55 sec Hi5 63K 340 sec 520 sec 617 sec 221 sec [5] D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. Proceedings of the 12th In- Table I ternational Conference on Information and Knowledge Man- TIME COMPARISON OF TESTED ALGORITHMS FOR BOTH DATA SETS. agement (CIKM), 2003.

VI. CONCLUSIONS [6] S. Milgram. The small world problem. Psychology Today, In this paper, we introduced a framework to provide friend 22:61–67, 1967. recommendations in OSNs. We define a new node similarity measure that exploits local and global characteristics of a [7] J. Pan, H. Yang, C. Faloutsos, and P. Duygulu. Automatic multimedia cross-modal correlation discovery. In KDD ’04: network. We performed extensive experimental comparison Proceedings of the 10th ACM SIGKDD international confer- of the proposed method against existing link prediction ence on Knowledge discovery and data mining, pages 653– algorithms showing that our FriendLink algorithm provides 658, 2004. more efficient and more accurate friend recommendations. In the future, we want to examine other ways of improving [8] F. Rubin. Enumerating all simple paths in a graph. IEEE Transactions on Circuits and Systems, 25(8):641–642, 1978. friend recommendations based on other features that OSNs offer, such as photo and video tagging, groups and common [9] H. Tong, C. Faloutsos, and J. Pan. Fast random walk with applications. The combination of such features can provide restart and its applications. In ICDM ’06: Proceedings of the information on different ways that users are connected and 6th International Conference on Data Mining, pages 613– therefore yield to more accurate friend recommendations. 622. IEEE Computer Society, 2006. REFERENCES [10] S. Wasserman and K. Faust. Social network analysis: Methods [1] L. Adamic and E. Adar. How to search a social network. and applications. 1994. Social Networks, 27(3):187–203, 2005.

2011 International Conference on Computational Aspects of Social Networks (CASoN) 71