Link Prediction Using Supervised Learning ∗
Total Page:16
File Type:pdf, Size:1020Kb
Link Prediction using Supervised Learning ∗ Mohammad Al Hasan Vineet Chaoji Saeed Salem Mohammed Zaki† Abstract two specific nodes. Several variations of the above Social network analysis has attracted much attention in re- problems make interesting research topics. For instance, cent years. Link prediction is a key research directions some of the interesting questions that can be posed are within this area. In this research, we study link prediction – how does the association pattern change over time, as a supervised learning task. Along the way, we identify what are the factors that drive the associations, how a set of features that are key to the superior performance is the association between two nodes affected by other under the supervised learning setup. The identified features nodes. The specific problem instance that we address are very easy to compute, and at the same time surpris- in this research is to predict the likelihood of a future ingly effective in solving the link prediction problem. We association between two nodes, knowing that there is also explain the effectiveness of the features from their class no association between the nodes in the current state density distribution. Then we compare different classes of of the graph. This problem is commonly known as the supervised learning algorithms in terms of their prediction Link Prediction problem. performance using various performance metrics, such as ac- We use the co-authorship graph from scientific pub- curacy, precision-recall, F-values, squared error etc. with a lication data for our experiments. We prepare datasets 5-fold cross validation. Our results on two practical social from the co-authorship graphs, where each data point network datasets shows that most of the well-known classifi- corresponds to a pair of authors, who never co-authored cation algorithms (decision tree, k-nn,multilayer perceptron, in train years. Depending on the fact, whether they co- SVM, rbf network) can predict link with surpassing perfor- authored in the test year or not, the data point has mances, but SVM defeats all of them with narrow margin in either a positive label or a negative label. We apply dif- all different performance measures. Again, ranking of fea- ferent types of supervised learning algorithms to build tures with popular feature ranking algorithms shows that a binary classifier models that distinguish the set of au- small subset of features always plays a significant role in the thors who will coauthor in the test year from the rest link prediction job. who will not coauthor. Predicting prospective links in co-authorship graph 1 Introduction and Background is an important research direction, since it is identical, both conceptually and structurally, with many practical Social network is a popular way to model the interaction social network problems. The primary reason is that a among the people in a group or community. It can co-authorship network is a true example of social net- be visualized as a graph, where a vertex corresponds work, where the scientists in the community collaborate to a person in that group and an edge represents to achieve a mutual goal. Researchers [15] have shown some form of association between the corresponding that this graph also obeys the power-law distribution, persons. The associations are usually driven by mutual an important property of a typical social network. To interests that are intrinsic in that group. However, name some practical problems that very closely match social networks are very dynamic objects, since new with the one we study in this research, we consider the edges and vertices are added to the graph over the time. task of analyzing and monitoring terrorist networks. Understanding the dynamics that drives the evolution The objective while analyzing terrorist networks is to of social network is a complex problem due to a large conjecture that particular individuals are working to- number of variable parameters. But, a comparatively gether even though their interactions cannot be identi- easier problem is to understand the association between fied from the current information base. Intuitively, we are predicting hidden links in a social network formed ∗This material is based upon work funded in whole or in part by the US Government and any opinions, findings, conclusions, by the group of terrorists. In general, link prediction or recommendations expressed in this material are those of the provides a measure of social proximity between two ver- author(s) and do not necessarily reflect the views of the US tices in a social group, which, if known, can be used to Government. optimize an objective functions over the entire group, † Rensselaer Polytechnic Institute, Troy, New York 12180, especially in the domain of collaborative filtering [18], {alhasan, chaojv, salems, zaki}@cs.rpi.edu Knowledge Management Systems [10], etc. It can also Another recent work by Faloutsos et. al. [3], al- help in modeling the way a disease, a rumor, a fashion though not directly performs link prediction, is worth or a joke, or an Internet virus propagates via a social mentioning in this context. They introduced an object, network [11]. called connection subgraph, which is defined as a small Our research has the following contributions: subgraph that best captures the relationship between two nodes in a social network. They also proposed ef- 1. We explained the procedural aspect of constructing ficient algorithm based on electrical circuit laws to find a machine learning dataset to perform link predic- the connection subgraph from large social network ef- tion. ficiently. Connection subgraph can be used to effec- 2. We identified a short list of features for link pre- tively compute several topological feature values for su- diction in a particular domain, specifically, in the pervised link prediction problem, especially when the co-authorship domain. These features are powerful network is very large. enough to provide remarkable accuracy and general There are many other interesting recent efforts enough to be applicable in other social network do- [7, 4, 14] related to social network, but none of these mains. They are also very inexpensive to obtain. was targeted explicitly to solve the link prediction problem. Nevertheless, experiences and ideas from 3. We experimented with a set of learning algorithms these papers were helpful in many aspects of this work. to evaluate their performance in link prediction Goldenberg et.al. used Bayesian Networks to analyze problem and made a comparative analysis among the social network graphs. Baumes et. al. used graph those. clustering approach to identify sub-community in a 4. We evaluate each feature; first visually, by compar- social network. Cao et. al. used the concept of relation ing their class density distribution and then algo- network, to project a social network graph into several rithmically through some well known feature rank- relation graph and mined those graph to effectively ing algorithms. answers user’s query. In their model, they extensively used optimization algorithm to find the most optimal 2 Related Work combination of existing relations that best matches the user’s query. Although, most of the early research in social network has been done by social scientists and psychologist [17], 3 Data and Experimental Setup numerous efforts has been made by Computer Scientists recently. Most of the work has concentrated on analyz- Consider a social network G = hV, Ei in which each edge ing the social network graphs [1, 2]. Few efforts has e = hu, vi ∈ E represents an interaction between u and made to solve the link prediction problem, specially for v at a particular time t. In our experimental domain the social network domain. The closest matches with our interaction is defined as coauthoring a research article. work is the work by D. Liben, et.all [6], where the au- Each article bears, at least, its author information and thors extracted several features from the network topol- publication year. To predict a link, we partition the ogy of a co-authorship network. Their experiments eval- range of publication year into two non-overlapping sub- uated the effectiveness of these features for the link pre- ranges. The first sub-range is selected as train years and diction problem. The effectiveness was judged by the the later one as the test years. Then, we prepare the factor by which the prediction accuracy was improved classification dataset, by choosing those author pairs, over a random predictor. This work provides an ex- that appeared in the train years, but did not publish any cellent start-up for link prediction as the features they papers together in those years. Each such pair either extracted can be used in a supervised learning frame- represent a positive example or a negative example, work to perform link prediction in a more systematic depending on whether those author pairs published at manner. But, they used features based on network least one paper in the test years or not. Coauthoring a topology only. We, on the other hand, added several paper in test years by a pair of authors, establishes a link non-topological features and found that they improve between them, which was not there in the train years. the accuracy of link prediction substantially. In prac- Classification model of link prediction problem needs tice, such non-topological data are available (for exam- to predict this link by successfully distinguishing the ple, overlap of interest between two persons) and they positive classes from the dataset. Thus, link prediction should be exploited to achieve a substantial improve- problem can be posed an a binary classification problem, ment in the result. Moreover, we employed different that can be solved by employing effective features in a machine learning algorithms to perform the link predic- supervised learning framework.