Local Probabilistic Models for Link Prediction∗

Seventh IEEE International Conference on Data Mining Local Probabilistic Models for Link Prediction∗ Chao Wang Venu Satuluri Srinivasan Parthasarathy Department of Computer Science and Engineering, The Ohio State University Abstract networks. This has been a hotbed of recent research activ- ity where much of the focus has been on encapsulating the One of the core tasks in social network analysis is to topological and/or semantic information embedded in such predict the formation of links (i.e. various types of relation- networks to address the link prediction problem. In contrast ships) over time. Previous research has generally repre- in this article we explore the realm of probabilistic models sented the social network in the form of a graph and has derived from frequency statistics and use the resulting pre- leveraged topological and semantic measures of similar- dictions from the probabilistic models as additional features ity between two nodes to evaluate the probability of link to further enhance predictions made by topological-based formation. Here we introduce a novel local probabilistic and semantic-based link prediction algorithms. graphical model method that can scale to large graphs to Specifically our probabilistic model is driven by two as- estimate the joint co-occurrence probability of two nodes. pects. First, given the candidate link (say between nodes Such a probability measure captures information that is not X and Y ) whose probability is to be estimated, we identify captured by either topological measures or measures of se- the central neighborhood set (say W; X; Y; Z), which are mantic similarity, which are the dominant measures used the nodes that are deemed germane to the estimation pro- for link prediction. We demonstrate the effectiveness of the cedure. The identification of the central neighborhood set co-occurrence probability feature by using it both in isola- is governed by the local topology of the social network as tion and in combination with other topological and seman- viewed from the perspective of the two nodes whose link tic features for predicting co-authorship collaborations on probability is to be estimated. three real datasets. Second, once the central neighborhood set(W; X; Y; Z) is identified we learn a maximum entropy Markov ran- dom field model that estimates the joint probability of 1 Introduction the nodes comprising the central neighborhood set, i.e., p(W; X; Y; Z). In this context one can leverage the fact that In recent times, there has been a lot of interest in under- most co-authorship networks are computed from an event standing and characterizing the properties of large scale net- log (an event corresponding to a publication). Multi-way works or graphs. Part of this interest is because of the gener- statistics (e.g. non-derivable frequent itemsets [4] whose ality of the graph model: many domains, such as social net- elements are drawn from (W; X; Y; Z)) on these event logs works, gene regulatory networks and the World Wide Web can be used to constrain and learn the model parameters can be naturally thought of as graphs. An important prob- efficiently [19]. The resulting model can then be used to lem in this context is that of link prediction. Informally link estimate the link probability between X and Y which we prediction is concerned with the problem of predicting the henceforth denote as the co-occurrence probability. (future) existence of links among nodes in a graph. Link prediction is useful in many application domains, ranging In our empirical results we demonstrate that the co- from recommender systems to the detection of unseen links occurrence probabilities inferred from the resulting model in terrorism networks, from protein interaction networks to can be computed in a scalable manner and is highly dis- the prediction of collaborations among scientists, and from criminatory for link prediction when compared with state- prediction of friendship formations to the prediction of web of-the-art topological and semantic features on several real hyperlinks. world datasets. Moreover, we demonstrate that the result- In this article we focus on the problem of link predic- ing co-occurrence probability can also be effectively com- tion particularly in the context of evolving co-authorship bined with these other features and then one can employ any classification algorithm to predict if a link will be formed ∗ This work is supported in part by the following research grants: DOE Award between two nodes. Specifically, we employ a simple yet No. DE-FG02-04ER25611; NSF CAREER Grant IIS-0347662; Contact email - [email protected] novel variant of the Katz score as a topological feature, one 1550-4786/07 $25.00 © 2007 IEEE 322 DOI 10.1109/ICDM.2007.108 that scales reasonably well at some cost to accuracy. Ad- values in the result set of the join queries (such as count and ditionally we describe and use straightforward state-of-the- average). A more complex feature such as cosine similar- art methods to measure the semantic overlap among nodes ity between bag-of-words representations cannot be easily based on the topics they work on, to further enhance the expressed using simple SQL aggregation functions. feature vector and improve overall link prediction perfor- Researchers have also examined the use of probabilistic mance. models for solving the link prediction problem. Taskar et al. [18] use discriminatively trained relational Markov net- 2 Related Work works to define a joint probabilistic model over the entire graph (i.e. over the links as well as the content attributes of the nodes). The trained model is used to collectively clas- The seminal work of Liben-Nowell and Kleinberg [12] sify the test data. Kashima and Abe [10] propose a parame- was the first comprehensive study on the utility of topolog- terized probabilist model of network evolution and then use ical features derived from graphs for predicting links in so- it for link prediction. They assume the network structure cial networks. They examine various topological features, is in a stationary state and propose an EM algorithm to es- including graph shortest distance, common neighbors, pref- timate model parameters. They report encouraging results erential attachment, Adamic-Adar, Jaccard, SimRank, hit- on two small biological datasets. However, both collective ting time, rooted PageRank, and Katz. They find that topo- classification and training global probabilistic models can logical information is quite useful when compared to a ran- be expensive to compute and typically do not scale well to dom predictor. In particular, Adamic-Adar and the Katz medium and large scale networks. measure appear to be more effective than the other topolog- In related work, Rattigan and Jensen [17] argue that the ical features. Recently, Huang [7] proposes to use another link prediction problem is too hard to solve because of the topological feature – generalized clustering coefficient – to extreme class skew problem. Social networks are usually solve the link prediction problem. very sparse and positive links only hold a very small amount An important limitation of these works is that they only of all possible pairs of nodes. As an alternative, they pro- use a single (topological) feature for the link prediction task. pose a simpler problem – anomalous link discovery. Specif- Intuitively, it seems that one can achieve better performance ically, they constrain their focus on the links that have been by utilizing the other sources of information, such as the formed and infer their anomaly (surprisingness) scores from content or semantic attributes of the nodes. A natural way to the previous data. do this would be to use the multiple sources of information as features to be fed into a classifier that is trained to dis- criminate between positive instances (i.e. links that form) 3 Methods and negative instances (links that do not form) by making use of all the features. This is the approach adopted by We consider three sources of information of the network Hasan et al. [6] and O’Madadhain et al. [14]. Hasan et data for link prediction. We have a large number of lo- al. [6] have used topological features (such as the short- cal events that are accumulated along time, where by lo- est distance between the two nodes), aggregated features cal event we mean an interaction among a set of objects in (such as the sum of neighbors) and semantic features (such the network. For example, the publication of a paper would as the number of matching keywords). They report keyword represent a local event involving all the authors of the pa- match count to be their most useful feature on one dataset, per. We refer to the collection of such local events as the which indicates that a lot is to be gained by taking into con- event log. This is the raw format of the network data and sideration the semantic similarity in the publications of the provides the first source of information for link prediction. two authors. O’Madadhain et al. [14] also have investi- Such an event log is typically converted to a graph repre- gated the use of content-based attributes such as the KL- sentation, in which nodes represent objects in the network divergence of the topic distributions of the two nodes, their and two nodes are connected to each other if they co-occur geographic proximity, and similarity of journal publication in at least one local event. This graph provides the second patterns. source of information for link prediction. Finally, we have The work of Popescul et al. [16] is another interesting access to other attributes of an object in the network, such as approach to integrating different kinds of information. They the research areas of the authors in an author collaboration represent the data in a relational format, generate candidates network, usually referred to as content or semantic informa- for features through database join queries, select features tion.

Local Probabilistic Models for Link Prediction∗

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support