Homophily-Based Link Prediction in the Facebook Online Social Network: a Rough Sets Approach
Total Page:16
File Type:pdf, Size:1020Kb
J. Intell. Syst. 2015; 24(4): 491–503 Islam Elkabani* and Roa A. Aboo Khachfeh* Homophily-Based Link Prediction in The Facebook Online Social Network: A Rough Sets Approach Abstract: Online social networks are highly dynamic and sparse. One of the main problems in analyzing these networks is the problem of predicting the existence of links between users on these networks: the link prediction problem. Many studies have been conducted to predict links using a variety of techniques like the decision tree and the logistic regression approaches. In this work, we will illustrate the use of the rough set theory in predicting links over the Facebook social network based on homophilic features. Other supervised learning algorithms are also employed in our experiments and compared with the rough set classifier, such as naive Bayes, J48 decision tree, support vector machine, logistic regression, and multilayer perceptron neural network. Moreover, we studied the influence of the “common groups” and “common page likes” homophilic features on predicting friendship between users of Facebook, and also studied the effect of using the Jaccard coefficient in measuring the similarity between users’ homophilic attributes compared with using the overlap coefficient. We conducted our experiments on two different datasets obtained from the Facebook online social network, where users in each dataset live within the same geographical region. The results showed that the rough set classifier significantly outperformed the other classifiers in all experi- ments. The results also demonstrated that the common groups and the common page likes features have a significant influence on predicting the friendship between users of Facebook. Finally, the results revealed that using the overlap coefficient homophilic features provided better results than that of the Jaccard coef- ficient features. Keywords: Link prediction, homophily, online social networks, rough set theory. DOI 10.1515/jisys-2014-0031 Received February 24, 2014; previously published online January 10, 2015. 1 Introduction An online social network (e.g., Facebook, LiveJournal, and Google+) is a social structure that models the interaction between people in a group or community [2]. These online social networks allow members to maintain user profiles with basic information (geographic location, gender, etc.), interests, and friends. This user profile represents the basis for grouping users and recommending friendship links between them. An online social network can be represented as a graph where nodes are the set of actors (together with their information) and the edges represent some type of associations between these actors, such as friendship or interaction links. The rapid growth of these social networks and the fact that new nodes are added to the *Corresponding authors: Islam Elkabani, Faculty of Science, Mathematics and Computer Science Department, Beirut Arab Uni- versity, P.O. Box 11-5020 Riad El Solh 11072809, Beirut, Lebanon; and Faculty of Science, Mathematics and Computer Science Department, Alexandria University, Alexandria, 21526, Egypt, e-mail: [email protected]; and Roa A. Aboo Khachfeh, Faculty of Science, Mathematics and Computer Science Department, Beirut Arab University, P.O. Box 11-5020 Riad El Solh 11072809, Beirut, Lebanon, e-mail: [email protected] 492 I. Elkabani and R. A. Aboo Khachfeh: Homophily-Based Link Prediction in Facebook graph over time defines a basic problem underlying social network evolution: the link prediction problem. It is the problem of predicting the likelihood of a future association (e.g., friendship, interaction) between two nodes, knowing that there is no association between the nodes in the current state of the graph [2]. Link prediction is a subfield of analyzing social network in which the target is to deduce some unobserv- able links based on the existent observations and relationships. It also has many applications in different domains, like in e-commerce (building recommendation systems), bioinformatics (protein-protein interac- tions), and many security-related applications (identifying hidden groups of criminals) [10]. Link prediction methods can generally be grouped into two approaches: those that just use the link struc- ture of the network and those that use both the attributes and nodes of the network. In the former approach, topological features are used to measure the connectivity of nodes in the network. However, the latter approach incorporates additional similarity features that measure the correspondence among the attributes of the nodes based on the sociological principle of “homophily” [12]. Homophily is the tendency of individu- als with similar interests, culture, or language to associate and bond with each other [17]. Many algorithms were proposed for predicting links in social networks [10]. Some of these algorithms were based on Bayesian probabilistic models and probabilistic relational models. Some other algorithms were based on computing a similarity score between a pair of nodes so that a supervised learning method, such as decision trees or neural networks, can be employed. In this work, we study predicting friendship links on the Facebook online social network based on the users’ homophilic features. We experiment using the rough set-based algorithm and five other supervised learning algorithms. Their performance in the link prediction problem is evaluated, and a comparative analysis among them is performed. We also study the effect of using the Jaccard coefficient in measuring the similarity between users’ homophilic attributes com- pared with using the overlap coefficient. Finally, we study the effect of using two additional homophilic features – common groups and common page likes – on predicting friendship links in the Facebook online social network. The remainder of the article is structured as follows: Section 2 describes related works. In Section 3, the rough set-based classification is introduced. Section 4 describes the data collection process from the Facebook online social network. In Section 5, the set of homophilic features used in our experiment are outlined. Section 6 introduces our new approach and describes the experiments. In Section 7, the results of the experiments are presented and discussed. Finally, Section 8 concludes the article and provides a brief about our future work. 2 Related Works A considerable amount of work has been recently conducted to investigate the problem of link prediction in social networks [10]. Some approaches are based on Bayesian probabilistic models and probabilistic relational models. Others are posed as a binary classification model where a similarity score is computed between a pair of nodes, based on a set of features: graph topological features (network structure) or edge/ vertex attributes. Liben-Nowell and Kleinberg [16] formalized the link problem as follows: given a snapshot of a social network, can we infer new interactions among its members? They developed an approach for link predic- tion based on measuring and analyzing the proximity of nodes in a network (e.g., common neighbors, SimRank). Hasan and Zaki [10] extended this work in two ways. First, they showed that using external data outside the scope of graph topology can significantly improve the prediction result. Second, they used various simi- larity metrics as features in a supervised learning setup. Steurer and Trattner [24] studied the extent to which interactions between users in online social networks can be predicted by looking at features (topological and homophilic) from social network and position data. They used the binomial logistic regression algorithm for link prediction. Backstrom and Leskovec developed in [4] an algorithm based on supervised random walks that naturally combines the information from the network structure with node and edge level attributes. I. Elkabani and R. A. Aboo Khachfeh: Homophily-Based Link Prediction in Facebook 493 In [8], Golder and Yardi found, by experimenting on the microblogging service Twitter, that two struc- tural characteristics – transitivity and mutuality – are significant predictors of the desire to form new ties. They investigated the link prediction problem from the natural topological perspective. Gilbert and Karahalios proposed in [7] a regression model to predict friendship links on Facebook based on a set of homophilic features consisting of users’ demographic and interactions but not including their interests. A data set consisting of 35 users of Facebook were used in their study. Patil et al. [20] considered the problem of predicting friendships between actors in a social network from only their sociodemographic attributes. Two different kinds of social networks were studied: a popular online friendship network and an offline network of mutual partner choices in speed dating. The decision tree was used as the classifier. It achieved an accuracy exceeding 85% on the online social network and exceeding 90% on the speed-dating network. In [1], Aiello et al. formalized the link prediction problem as follows: “Given a subset of users Ut ⊆ U, we want to predict the presence or absence of a social link for every pair (u, v)∈{Ut × Ut|u ≠ v}.”. They studied the correlation between social and semantic features in three online social networks – Flickr, Last.fm, and aNobii – based on four different types of features: groups, tags, library, and tagged items. The J48 decision tree was used as the binary classifier in their work. Most of the previous