Link Prediction Using Supervised Learning ∗
Total Page:16
File Type:pdf, Size:1020Kb
Link Prediction using Supervised Learning ∗ Mohammad Al Hasan, Vineet Chaoji, Saeed Salem, and Mohammed Zaki Rensselaer Polytechnic Institute, Troy, New York 12180 falhasan, chaojv, salems, [email protected] Abstract parameters. But, a comparatively easier problem is to Social network analysis has attracted much attention in re- understand the association between two specific nodes. cent years. Link prediction is a key research direction within Several variations of the above problems make interest- this area. In this paper, we study link prediction as a su- ing research topics. For instance, some of the interesting pervised learning task. Along the way, we identify a set of questions that can be posed are { how does the associa- features that are key to the performance under the super- tion pattern change over time, what are the factors that vised learning setup. The identified features are very easy to drive the associations, how is the association between compute, and at the same time surprisingly effective in solv- two nodes affected by other nodes. The specific prob- ing the link prediction problem. We also explain the effec- lem instance that we address in this research is to pre- tiveness of the features from their class density distribution. dict the likelihood of a future association between two Then we compare different classes of supervised learning al- nodes, knowing that there is no association between the gorithms in terms of their prediction performance using var- nodes in the current state of the graph. This problem ious performance metrics, such as accuracy, precision-recall, is commonly known as the Link Prediction problem. F-values, squared error etc. with a 5-fold cross validation. We use the coauthorship graph from scientific pub- Our results on two practical social network datasets shows lication data for our experiments. We prepare datasets that most of the well-known classification algorithms (deci- from the coauthorship graphs, where each data point sion tree, k-NN, multilayer perceptron, SVM, RBF network) corresponds to a pair of authors, who never coauthored can predict links with comparable performances, but SVM in training years. Depending on the fact whether they outperforms all of them with narrow margin in all perfor- coauthored in the testing year or not, the data point mance measures. Again, ranking of features with popular has either a positive label or a negative label. We ap- feature ranking algorithms shows that a small subset of fea- ply different types of supervised learning algorithms to tures always plays a significant role in link prediction. build binary classifier models that distinguish the set of authors who will coauthor in the testing year from the 1 Introduction and Background rest who will not coauthor. Predicting prospective links in coauthorship graph Social networks are a popular way to model the interac- is an important research direction, since it is identical, tion among the people in a group or community. They both conceptually and structurally to many practical can be visualized as graphs, where a vertex corresponds social network problems. The primary reason is that to a person in some group and an edge represents some a coauthorship network is a true example of social net- form of association between the corresponding persons. work, where the scientists in the community collaborate The associations are usually driven by mutual interests to achieve a mutual goal. Researchers [20] have shown that are intrinsic to a group. However, social networks that this graph also obeys the power-law distribution, are very dynamic objects, since new edges and vertices an important property of a typical social network. To are added to the graph over the time. Understanding name some practical problems that very closely match the dynamics that drives the evolution of social network with the one we study in this research, we consider is a complex problem due to a large number of variable the task of analyzing and monitoring terrorist networks. The objective in analyzing terrorist networks is to con- ∗This material is based upon work funded in whole or in part jecture that particular individuals are working together by the US Government and any opinions, findings, conclusions, even though their interactions cannot be identified from or recommendations expressed in this material are those of the the current information base. Intuitively, we are pre- author(s) and do not necessarily reflect the views of the US dicting hidden links in a social network formed by the Government. This work was supported in part by NSF CAREER Award IIS-0092978, DOE Career Award DE-FG02-02ER25538, group of terrorists. In general, link prediction provides and NSF grants EIA-0103708 and EMT-0432098. a measure of social proximity between two vertices in a social group, which, if known, can be used to optimize ment in the results. Moreover, we compare different an objective function over the entire group, especially machine learning algorithms for the link prediction task. in the domain of collaborative filtering [22], Knowledge Another recent work by Faloutsos et al. [10], al- Management Systems [8], etc. It can also help in mod- though does not directly perform link prediction, is eling the way a disease, a rumor, a fashion or a joke, or worth mentioning in this context. They introduced an an Internet virus propagates via a social network [13]. object, called connection subgraph, which is defined as Our research has the following contributions: a small subgraph that best captures the relationship between two nodes in a social network. They also pro- 1. We explain the procedural aspect of constructing posed efficient algorithm based on electrical circuit laws a machine learning dataset to perform link predic- to find the connection subgraph from large social net- tion. work efficiently. Connection subgraph can be used to ef- fectively compute several topological feature values for 2. We identify a short list of features for link pre- supervised link prediction problem, especially when the diction in a particular domain, specifically, in the network is very large. coauthorship domain. These features are powerful There are many other interesting recent efforts [11, enough to provide remarkable accuracy and general 3, 5] related to social network, but none of these were enough to be applicable in other social network do- targeted explicitly to solve the link prediction problem. mains. They are also very inexpensive to obtain. Nevertheless, experiences and ideas from these papers 3. We experiment with a set of learning algorithms to were helpful in many aspects of this work. Goldenberg et al. [11] used Bayesian Networks to analyze the evaluate their performance in link prediction prob- social network graphs. Baumes et al. [3] used graph lem and perform a comparative analysis among them. clustering approach to identify sub-communities in a social network. Cai et al. [5] used the concept of relation 4. We evaluate each feature; first visually, by compar- network, to project a social network graph into several ing their class density distribution and then algo- relation graphs and mine those graphs to effectively rithmically through some well known feature rank- answer user's queries. In their model, they extensively ing algorithms. used optimization algorithms to find the most optimal combination of existing relations that best match the 2 Related Work user's query. Although most of the early research in social network 3 Data and Experimental Setup has been done by social scientists and psychologists [19], numerous efforts have been made by computer scien- Consider a social network G = hV; Ei in which each edge tists recently. Most of the work has concentrated on e = hu; vi 2 E represents an interaction between u and analyzing the social network graphs [2, 9]. Few efforts v at a particular time t. In our experimental domain have been made to solve the link prediction problem, the interaction is defined as coauthoring a research specially for social network domain. The closest match article. Each article bears, at least, author information with our work is that of D. Liben, et al. [17], where and publication year. To predict a link, we partition the authors extracted several features from the network the range of publication years into two non-overlapping topology of a coauthorship network. Their experiments sub-ranges. The first sub-range is selected as training evaluated the effectiveness of these features for the link years and the later one as the testing years. Then, prediction problem. The effectiveness was judged by we prepare the classification dataset, by choosing those the factor by which the prediction accuracy was im- author pairs, that appeared in the training years, but proved over a random predictor. This work provides an did not publish any papers together in those years. excellent starting point for link prediction as the fea- Each such pair either represents a positive example or a tures they extracted can be used in a supervised learning negative example, depending on whether those author framework to perform link prediction in a more system- pairs published at least one paper in the testing years atic manner. But, they used features based on network or not. Coauthoring a paper in testing years by a pair topology only. We, on the other hand, added several of authors, establishes a link between them, which was non-topological features and found that they improve not there in the training years. Classification model of the accuracy of link prediction substantially. In prac- link prediction problem needs to predict this link by tice, such non-topological data are available (for exam- successfully distinguishing the positive classes from the ple, overlap of interest between two persons) and they dataset. Thus, link prediction problem can be posed should be exploited to achieve a substantial improve- as a binary classification problem, that can be solved by employing effective features in a supervised learning arbitrary scientists x and y from the social network.