J. Intell. Syst. 2015; 24(4): 491–503

Islam Elkabani* and Roa A. Aboo Khachfeh* Homophily-Based Link Prediction in The Facebook Online : A Rough Sets Approach

Abstract: Online social networks are highly dynamic and sparse. One of the main problems in analyzing these networks is the problem of predicting the existence of links between users on these networks: the link prediction problem. Many studies have been conducted to predict links using a variety of techniques like the decision tree and the logistic regression approaches. In this work, we will illustrate the use of the rough set theory in predicting links over the Facebook social network based on homophilic features. Other algorithms are also employed in our experiments and compared with the rough set classifier, such as naive Bayes, J48 decision tree, support vector machine, logistic regression, and multilayer perceptron neural network. Moreover, we studied the influence of the “common groups” and “common page likes” homophilic features on predicting friendship between users of Facebook, and also studied the effect of using the Jaccard coefficient in measuring the similarity between users’ homophilic attributes compared with using the overlap coefficient. We conducted our experiments on two different datasets obtained from the Facebook online social network, where users in each dataset live within the same geographical region. The results showed that the rough set classifier significantly outperformed the other classifiers in all experi- ments. The results also demonstrated that the common groups and the common page likes features have a significant influence on predicting the friendship between users of Facebook. Finally, the results revealed that using the overlap coefficient homophilic features provided better results than that of the Jaccard coef- ficient features.

Keywords: Link prediction, homophily, online social networks, rough set theory.

DOI 10.1515/jisys-2014-0031 Received February 24, 2014; previously published online January 10, 2015.

1 Introduction

An online social network (e.g., Facebook, LiveJournal, and Google+) is a social structure that models the interaction between people in a group or community [2]. These online social networks allow members to maintain user profiles with basic information (geographic location, gender, etc.), interests, and friends. This user profile represents the basis for grouping users and recommending friendship links between them. An online social network can be represented as a graph where nodes are the set of actors (together with their information) and the edges represent some type of associations between these actors, such as friendship or interaction links. The rapid growth of these social networks and the fact that new nodes are added to the

*Corresponding authors: Islam Elkabani, Faculty of Science, Mathematics and Computer Science Department, Beirut Arab Uni- versity, P.O. Box 11-5020 Riad El Solh 11072809, Beirut, Lebanon; and Faculty of Science, Mathematics and Computer Science Department, Alexandria ­University, ­Alexandria, 21526, Egypt, e-mail: [email protected]; and Roa A. Aboo Khachfeh, Faculty of Science, Mathematics and Computer­ Science Department, Beirut Arab University, P.O. Box 11-5020 Riad El Solh 11072809, Beirut, Lebanon, e-mail: [email protected] 492 I. Elkabani and R. A. Aboo Khachfeh: Homophily-Based Link Prediction in Facebook graph over time defines a basic problem underlying social network evolution: the link prediction problem. It is the problem of predicting the likelihood of a future association (e.g., friendship, interaction) between two nodes, knowing that there is no association between the nodes in the current state of the graph [2]. Link prediction is a subfield of analyzing social network in which the target is to deduce some unobserv- able links based on the existent observations and relationships. It also has many applications in different domains, like in e-commerce (building recommendation systems), bioinformatics (protein-protein interac- tions), and many security-related applications (identifying hidden groups of criminals) [10]. Link prediction methods can generally be grouped into two approaches: those that just use the link struc- ture of the network and those that use both the attributes and nodes of the network. In the former approach, topological features are used to measure the connectivity of nodes in the network. However, the latter approach incorporates additional similarity features that measure the correspondence among the attributes of the nodes based on the sociological principle of “homophily” [12]. Homophily is the tendency of individu- als with similar interests, culture, or language to associate and bond with each other [17]. Many algorithms were proposed for predicting links in social networks [10]. Some of these algorithms were based on Bayesian probabilistic models and probabilistic relational models. Some other algorithms were based on computing a similarity score between a pair of nodes so that a supervised learning method, such as decision trees or neural networks, can be employed. In this work, we study predicting friendship links on the Facebook online social network based on the users’ homophilic features. We experiment using the rough set-based algorithm and five other supervised learning algorithms. Their performance in the link prediction problem is evaluated, and a comparative analysis among them is performed. We also study the effect of using the Jaccard coefficient in measuring the similarity between users’ homophilic attributes com- pared with using the overlap coefficient. Finally, we study the effect of using two additional homophilic features – common groups and common page likes – on predicting friendship links in the Facebook online social network. The remainder of the article is structured as follows: Section 2 describes related works. In Section 3, the rough set-based classification is introduced. Section 4 describes the data collection process from the Facebook online social network. In Section 5, the set of homophilic features used in our experiment are outlined. Section 6 introduces our new approach and describes the experiments. In Section 7, the results of the experiments are presented and discussed. Finally, Section 8 concludes the article and provides a brief about our future work.

2 Related Works

A considerable amount of work has been recently conducted to investigate the problem of link prediction in social networks [10]. Some approaches are based on Bayesian probabilistic models and probabilistic relational models. Others are posed as a binary classification model where a similarity score is computed between a pair of nodes, based on a set of features: graph topological features (network structure) or edge/ vertex attributes. Liben-Nowell and Kleinberg [16] formalized the link problem as follows: given a snapshot of a social network, can we infer new interactions among its members? They developed an approach for link predic- tion based on measuring and analyzing the proximity of nodes in a network (e.g., common neighbors, SimRank). Hasan and Zaki [10] extended this work in two ways. First, they showed that using external data outside the scope of graph topology can significantly improve the prediction result. Second, they used various simi- larity metrics as features in a supervised learning setup. Steurer and Trattner [24] studied the extent to which interactions between users in online social networks can be predicted by looking at features (topological and homophilic) from social network and position data. They used the binomial logistic regression algorithm for link prediction. Backstrom and Leskovec developed in [4] an algorithm based on supervised random walks that naturally combines the information from the network structure with node and edge level attributes. I. Elkabani and R. A. Aboo Khachfeh: Homophily-Based Link Prediction in Facebook 493

In [8], Golder and Yardi found, by experimenting on the microblogging service Twitter, that two struc- tural characteristics – transitivity and mutuality – are significant predictors of the desire to form new ties. They investigated the link prediction problem from the natural topological perspective. Gilbert and Karahalios proposed in [7] a regression model to predict friendship links on Facebook based on a set of homophilic features consisting of users’ demographic and interactions but not including their interests. A data set consisting of 35 users of Facebook were used in their study. Patil et al. [20] considered the problem of predicting friendships between actors in a social network from only their sociodemographic attributes. Two different kinds of social networks were studied: a popular online friendship network and an offline network of mutual partner choices in speed dating. The decision tree was used as the classifier. It achieved an accuracy exceeding 85% on the online social network and exceeding 90% on the speed-dating network.

In [1], Aiello et al. formalized the link prediction problem as follows: “Given a subset of users Ut ⊆ U, we want to predict the presence or absence of a social link for every pair (u, v)∈{Ut × Ut|u ≠ v}.”. They studied the correlation between social and semantic features in three online social networks – Flickr, Last.fm, and aNobii – based on four different types of features: groups, tags, library, and tagged items. The J48 decision tree was used as the binary classifier in their work. Most of the previous studies applied various classification algorithms such as naive Bayes, binomial logistic regression, decision tree, support vector machine (SVM), and multilayer perceptron (MLP) to predict links between users in online social networks based on topological or homophilic features. In our work, we investigate the link prediction problem from the network homophilic features perspective in the Facebook online social network. The rough set-based classification algorithm [14] is used for link prediction, and its performance is compared with that of other various classification algorithms.

3 Rough Set-Based Classification

The rough set theory (RST) was developed by Zdizslaw Pawlak in the early 1980s [21]. It has been used in the classification of imprecise, uncertain, or incomplete information [15]. RST has been successfully applied in many real-life problems in medicine, banking, toxicity predictions, and financial markets [13, 15]. One of the important features of RST is its ability to identify the importance of different attributes and to reduce them using one of its reduction algorithms. Finding minimal subsets (reducts) of attributes that are efficient for rule making is a central part of its process, which consequently improves the efficiency of the predictor [22]. In RST, the structure of data is represented in the form of a decision system (DS) [3]: DS = (U, A∪{d}), where U is a non-empty finite set of observations objects called the universe, whereas A is a non-empty finite set of attributes. Every attribute a∈A is a total function of the form a: U → Va, where Va is the set of allowable values for the attribute a. The attributes belonging to A are called conditional attributes, whereas d is called a decision attribute. A reduct (R) of A refers to the minimal selection of attributes that can be used to represent all classes of the decision system, thus preserving the original classification defined by A. A reduct has two main properties: 1. C(R) = C(A), i.e., R produces the same classification (C) of objects as the collection A of all attributes. 2. For any attribute a∈R, C(R – {a}) ≠ C(R), i.e., a reduct is a minimal subset with respect to property 1.

A method of computing reducts is given by Skowron and Rauszer [23] and described as follows. First, a dis- cernibility matrix is constructed from the data set. This matrix shows the differences between each pair of objects with respect to attributes. A discernibility matrix is defined as an n × n matrix, where n is the number of objects in the dataset. The general (i,j)-th term cij of this matrix is the set of attributes for which the i-th (xi) and j-th(xj) object of the dataset differ: 494 I. Elkabani and R. A. Aboo Khachfeh: Homophily-Based Link Prediction in Facebook

Caij =∈{|Ax∈≠Ua and ()xaij()xi}, for ,1jn= ,,… .

Second, a Boolean discernibility function f is constructed. This function describes constraints that must hold to preserve the discernibility between all pairs of discernible objects in the information system, and requires keeping at least one attribute from each non-empty element of the discernibility matrix correspond- ing to any pair of discernible objects. The discernibility function is parallel to the discernibility matrix and represented as follows:

f (,aa1 … mi)V=≤{|Λφcjji1,≤≤in c j ≠ }, where a̅1…am̅ are Boolean variables corresponding to the m attributes in A, and cij̅ = {a̅|a∈ij}.c Moreover, in case of classifying data with respect to the decision attribute d, decision-relative discernibil- ity reducts are computed and a decision is made. Thus, a decision-relative discernibility matrix is constructed similar to the discernibility matrix stated above. This matrix shows the differences between each pair of objects, with respect to attributes and decision classes. In this case, the two objects xi and xj of the dataset take different values and are also in different decision classes:

{|aA∈∈xU and ax()≠≠ax()} if dx()dx() c′ =  ij ij. ij φ =  if dx()ijdx() From this decision-relative discernibility matrix, the decision-relative discernibility function is con- structed in the same way as the discernibility function stated above:

f ′′(,aa1 …,)mi=≤V{ Λφc|ji1,ji≤≤nc ′j ≠ }. As a result, the set of prime implicants of this function determines the set of all reducts of A. One of the most efficient rough set reduction algorithms is the Johnson algorithm [18]. It is a typical greedy algorithm that tends to find a single prime implicant of minimal length. It provides a reduction method that is fast and efficient with the minimum overhead. Algorithm 1 represents the Johnson algorithm.

Algorithm 1

Johnson(A, fD) A, the set of conditional attributes fD, the discernibility function initialize R←φ; bestc = 0; while (fD not empty) do for each a∈A that appears in fD c = heuristic (a) if (c > bestc) bestc = c; bestAttr←a R←R∪a fD←removeClauses (fD, a) return R

The algorithm starts by initializing the current reduct candidate R to the empty set. Then, each condi- tional attribute appearing in the discernibility function is evaluated according to the heuristic measure rep- resenting the count of the number of appearances an attribute makes within clauses. Attributes that appear more frequently are more significant. The attribute with the highest heuristic value is added to R, and all clauses in the discernibility function containing this attribute are removed. The algorithm terminates when all clauses have been removed, and the reduct R is obtained [11]. The Johnson algorithm is implemented in the ROSETTA software [18] used in our rough set-based link prediction algorithm. In this work, we used rough sets for classification to exploit two of its main advantages over other traditional classifiers: (i) selecting an optimal subset of attributes quickly and effectively, and (ii) dealing effectively with the missing attribute values problem. I. Elkabani and R. A. Aboo Khachfeh: Homophily-Based Link Prediction in Facebook 495

4 Data Collection

We conducted our experiment on two different datasets obtained from the online social network Facebook, where its structure can be represented as an unweighted, undirected graph. Two common methods are used by researchers to collect data from online social networks. First, the API (application programming interface) calls if they are supported by the online social network operator. This method allows third parties of an online social network to access some of their specified data precisely. The second method is to perform data scraping straight from the online social network website by requesting the specific URLs and fetching the web page as if we are using a web browser, and then parsing the returned HTML page and extracting the needed information. In Facebook, the API calls are more restrictive than HTML scraping. For our work, we are only interested in public data that are available for any logged-in user. Thus, we have implemented our Facebook crawler, by HTML scraping, using the breadth-first search (BFS) algorithm [5]. This algorithm is a classic graph traversal algorithm that starts from a seed node and then iteratively explores the neighboring nodes. In each iteration, the next but not yet visited node is selected. Algorithm 2 shows the general process of crawling an online social network based on the BFS method.

Algorithm 2

Initialize u←u0 (where u0 is the seed node) Create a queue Q Enqueue u onto Q Mark u as visited while Q is not empty||NOT FINISHED do Dequeue an item from Q into u Fetch data for u and obtain F(u), the list of friends of u for each user, v belongs to F(u): if v is not marked as visited: Mark v as visited Enqueue v onto Q

Our crawler has been implemented in Java and SQL Server 2008. We run it on a machine with 2-GB RAM and a data storage space up to 5 GB. It can be described as follows: –– The Facebook server is contacted through an HTTP request providing credentials required for the authen- tication through the cookies that Facebook maintains for each request and response. –– Once logged in, HTTP requests are sent to open the seed profile to be processed to retrieve the needed data. For example, to retrieve the personal information of a node, the corresponding URI used is https://m. facebook.com/nodeID. –– The list of friends for this node is also retrieved through the following URI: https://m.facebook.com/ profile.php?v = friends&id = nodeID. –– The collected HTML pages are parsed using regular expressions, and the relevant information is extracted. –– The friends’ IDs of the visited node are inserted into a FIFO queue to be visited cyclically.

The information obtained about each user profile visited during crawling are as follows: ID (which is a 32-bit number that uniquely identifies a user), name, gender, birthday, activities, athletes, education, games, inter- ests, languages, movies, music, sport, teams, TV shows, work, books, religion, political view, city, home- town, pages liked, groups, and the list of friends. We downloaded the profile data of 5038 users for the first dataset and 1430 users for the second dataset. For each dataset, all of the Facebook users have a common feature – they live in the same geographical region. This is because socially connected pairs of users are more likely co-residents of the same geographical location [20]. Both data sets are generated in a way that yielded two balanced data sets to ensure the accuracy of the classification process. 496 I. Elkabani and R. A. Aboo Khachfeh: Homophily-Based Link Prediction in Facebook

5 Homophilic Feature Sets

As already mentioned, the aim of this work is to predict friendships between users in the Facebook online social networks based on their homophilic features. We can consider the social network as an undirected graph GF (VF, EF), with VF representing the users and e = (u, v)∈EF is the friendship relation existing between the users u and v, where this relationship is based on a mutual agreement that both users confirm. Users of Facebook can enhance their social network profile by adding information such as interests, activi- ties, language, and favorite movies. Additionally, users may join common-interest groups, organized by work- place, school or college, or other characteristics; they can also like other pages on the network. We define a set of homophilic features based on the attributes collected from the users’ profiles after discarding those attrib- utes that have missing values for most of the users. The attributes selected for use in prediction are activities, athletes, education, games, interests, languages, movies, music, teams, TV shows, work, books, groups, and page likes. The set of homophilic features is calculated as the attributes’ similarity between users u, v. The Jaccard coefficient is the first homophilic feature used to measure the similarity, and it is defined as the size of the intersection divided by the size of the union of the sample sets [6]. –– Jaccard coefficient for interests: This homophilic feature is a combination of total interests and common interests between two users: ||I(u)∩I(v) JCI_(nterestsu,)v = , ||I(u)∪I(v) where I(u) and I(v) are the set of interests for the users u and v, respectively. The Jaccard coefficient is also applied to the rest of the selected user profile attributes. However, we intuitively figured out that a better homophilic feature for measuring the similarity, especially of multivalued attributes, can be found by using the overlap coefficient defined as the size of the intersection between two sets divided by the minimum size of the two sets.

–– Overlap coefficient for interests: ||I(u)∩I(v) OC _(Interestsu,v),= MinI (| (u),|| I(v)|) The overlap coefficient is also applied to the rest of the selected user profile attributes.

6 Homophily-Based Link Prediction

Figure 1 describes the steps followed in our approach to predict friendship links in the Facebook online social network based on the users’ homophilic features. The first step was the data collection by crawling the online social network Facebook for users’ profiles as described in Section 4. After collecting the data, the following data preprocessing steps were applied:

Crawling Facebook Pairing users’ Data preprocessing for profiles profiles

Homophilic features Link prediction Training and testing extraction

Figure 1. Homophily Based Link Prediction Process. I. Elkabani and R. A. Aboo Khachfeh: Homophily-Based Link Prediction in Facebook 497

–– The attributes having missing values for the most of users’ profiles were discarded. –– The isolated users, whose friends’ list is missing owing to privacy settings, were removed from our col- lected profiles. –– The users’ profiles having missing values for most of their attributes also were removed from our col- lected profiles.

Table 1 shows the remaining attributes used in our work and the percentage of availability for each attribute in the collected profiles after the preprocessing step. The next step was the pairing of profiles; 10,054 user pairs were generated from the first dataset collected and 3000 user pairs from the second one. For each pair of profiles, two data sets were generated: the first dataset is the set of homophilic features generated based on the Jaccard coefficient similarity, and the second dataset is the set of homophilic features generated based on the overlap coefficient similarity. In both cases, the score obtained has a value between 0 and 1. Thus, we chose a parameter (threshold) as a cutoff point to decide whether the pair of profiles matches on a given attribute or not. We varied the threshold in steps of 0.1 (ranging from 0.1 to 1.0), with a threshold of 0.1 indicating a loose matching and a threshold of 1.0 indicating a tight matching. Each dataset consists of a set of matching along with the ID of the first user and that of the second user. Each single matching is a t bit vector, with each bit denoting whether the pair of users matches on a particular attribute At. A single bit is also used to indicate whether a link exists between a pair of users in the network. A value of 0 (FALSE) indicates that there is no link or friendship relation between the pair of users, and a value of 1 (TRUE) indicates that the users are friends. Two experiments were conducted on the two Facebook datasets described earlier to study the extent to which friendship between users in online social networks can be predicted on the basis of homophilic features. In the first experiment, we predicted links by employing five different classification algorithms adopted in most of the previous related works. In the second experiment, the rough set classification algorithm was employed for predicting links between users. Moreover, two cases were taken into consideration for both experiments. In the first case, we excluded from the attributes that we have listed earlier the groups and page likes. In the second case, these two attributes were included, and the set of homophilic features were aug- mented with two more features to measure the common groups and the common page liked by pair of users. These two attributes were rarely studied in the literature for predicting links between users based on homo- philic features. In our experiments, we decided to study their influence on predicting friendship between users on the Facebook online social network.

Table 1. Percentage of Available Data for Each Attribute.

Attribute % of Available Data

First Dataset Second Dataset

Activities 58 54 Athletes 57 58 Education 64 59 Games 42 40 Interests 50 55 Languages 21 25 Movies 64 67 Music 78 79 Teams 50 58 TV Shows 68 73 Work 56 61 Books 52 49 Groups 61 86 Page Likes 90 92 498 I. Elkabani and R. A. Aboo Khachfeh: Homophily-Based Link Prediction in Facebook

For both experiments, we considered the social network as an undirected graph for the computation of the edge features (i.e., the existence of a link between two users). In the first dataset, of the 10,054 user pairs used in our experiments, there were 5106 pair of edges (u, v) in the undirected graph (i.e., u and v are cur- rently friends on the social network) and 4948 pairs of users without edges between them (i.e., they are not currently friends on the social network). As for the second dataset, of the 3000 user pairs used in our experi- ments, there were 1590 pairs of edges (u, v) in the undirected graph and 1410 pairs of users without edges between them. This will yield two balanced datasets ready for training and testing. In the classification process, 70% of user pairs in our dataset were used for training and 30% were used for testing. Furthermore, we used a 10-fold cross-validation approach to justify our findings. We trained and tested our classifier at each of the thresholds mentioned before (0.1, 0.2,…, 1.0).

6.1 Predicting Using Traditional Classifiers

Five different classification algorithms used in previous related works were employed: naive Bayes, SVM, J48 decision tree, simple logistic regression, and multilayer perceptron (MLPClassifier). The datasets were avail- able as ARFF files. WEKA, a well-known library [9], was used.

6.2 Predicting Using the Rough Set Approach

In this experiment, the dataset was available as an Excel sheet and loaded in ROSETTA toolkit [18] as an ODBC. The Johnson reduced algorithm was employed to reduce the number of measured attributes and gen- erate the reduct and rules.

7 Results

In this work, we used the F-measure (F1) for an equal weighting of precision and recall as an evaluation metric for the classification performance [19]. Tables 2–5 show the results (precision, recall, and F1) obtained for the different classifiers used at threshold 0.8 from the first dataset (10,054 user pairs).

Tables 6–9 show the results (precision, recall, and F1) obtained for the different classifiers used at thresh- old 0.8 from the second dataset (3000 user pairs). In all experiments, two testing modes were adopted, which are the 10-fold cross-validation mode and the testing-training mode. Tables 2 and 6 show the results obtained by using the Jaccard coefficient homophilic features and applying the six different classifiers on the first and second dataset, respectively. As for the first

Table 2. Homophilic Features Calculated Using the Jaccard Coefficient: Without Groups and Page Likes Attributes.

Algorithm Test Mode

10-fold cross-validation Testing and Training

Precision Recall F1 Precision Recall F1 Naive Bayes 0.71 0.702 0.7 0.707 0.702 0.701 J48 decision tree 0.655 0.654 0.654 0.662 0.661 0.661 Logistic regression 0.715 0.668 0.647 0.721 0.669 0.649 SVM 0.714 0.665 0.642 0.72 0.668 0.646 MLPClassifier 0.72 0.67 0.648 0.721 0.673 0.655 Rough set 0.876 0.834 0.833 0.877 0.841 0.837

Average F1 measure for the six classifiers 0.687 0.691 I. Elkabani and R. A. Aboo Khachfeh: Homophily-Based Link Prediction in Facebook 499

Table 3. Homophilic Features Calculated Using the Overlap Coefficient: Without Groups and Page Likes Attributes.

Algorithm Test Mode

10-fold cross-validation Testing and Training

Precision Recall F1 Precision Recall F1 Naive Bayes 0.675 0.675 0.675 0.678 0.677 0.677 J48 decision tree 0.671 0.671 0.671 0.634 0.63 0.627 Logistic regression 0.818 0.816 0.815 0.82 0.817 0.816 SVM 0.812 0.807 0.807 0.814 0.809 0.808 MLPClassifier 0.824 0.824 0.824 0.821 0.821 0.821 Rough set 0.945 0.939 0.938 0.915 0.924 0.925

Average F1 measure for the six classifiers 0.788 0.779

Table 4. Homophilic Features Calculated Using the Jaccard Coefficient: With Groups and Page Likes Attributes.

Algorithm Test Mode

10-fold cross-validation Testing and Training

Precision Recall F1 Precision Recall F1 Naive Bayes 0.682 0.669 0.664 0.707 0.702 0.701 J48 decision tree 0.659 0.658 0.658 0.662 0.661 0.661 Logistic regression 0.716 0.669 0.648 0.721 0.669 0.649 SVM 0.716 0.665 0.642 0.72 0.668 0.646 MLPClassifier 0.718 0.67 0.649 0.721 0.673 0.655 Rough set 0.946 0.941 0.941 0.937 0.928 0.927

Average F1 measure for the six classifiers 0.701 0.706

Table 5. Homophilic Features Calculated Using the Overlap Coefficient: With Groups and Page Likes Attributes.

Algorithm Test Mode

10-fold cross-validation Testing and Training

Precision Recall F1 Precision Recall F1 Naive Bayes 0.667 0.666 0.666 0.665 0.665 0.665 J48 decision tree 0.676 0.676 0.676 0.693 0.693 0.693 Logistic regression 0.828 0.826 0.825 0.829 0.826 0.825 SVM 0.819 0.815 0.814 0.825 0.821 0.821 MLPClassifier 0.83 0.83 0.83 0.832 0.831 0.83 Rough set 0.948 0.944 0.943 0.942 0.943 0.94

Average F1 measure for the six classifiers 0.792 0.796

dataset, in both testing modes, the least F1 values were obtained from applying the SVM classifier (between

0.642 and 0.646) and the highest F1 values were obtained from applying the rough set classifier (between

0.833 and 0.837). For the second dataset, in both testing modes, the least F1 values were obtained from apply- ing the naive Bayes classifier (0.576) and the highest F1 values were obtained from applying the rough set classifier (between 0.726 and 0.822). Tables 3 and 7 show the results obtained by using the overlap coefficient homophilic features and applying the six different classifiers on the first and second datasets, respectively.

As for the first dataset, in both testing modes, the least F1 values were obtained from applying the J48 decision tree classifier (between 0.627 and 0.671), whereas the highest F1 values were obtained from applying the rough 500 I. Elkabani and R. A. Aboo Khachfeh: Homophily-Based Link Prediction in Facebook

Table 6. Homophilic Features Calculated Using the Jaccard Coefficient: Without Groups and Page Likes Attributes.

Algorithm Test Mode

10-fold cross-validation Testing and Training

Precision Recall F1 Precision Recall F1 Naive Bayes 0.624 0.624 0.576 0.588 0.601 0.576 J48 decision tree 0.697 0.683 0.655 0.69 0.678 0.651 Logistic regression 0.706 0.646 0.576 0.702 0.664 0.616 SVM 0.713 0.655 0.59 0.705 0.669 0.624 MLPClassifier 0.707 0.649 0.583 0.703 0.668 0.623 Rough set 0.824 0.823 0.822 0.773 0.745 0.726

Average F1 measure for the six classifiers 0.634 0.636

Table 7. Homophilic Features Calculated Using the Overlap Coefficient: Without Groups and Page Likes Attributes.

Algorithm Test Mode

10-fold cross-validation Testing and Training

Precision Recall F1 Precision Recall F1 Naive Bayes 0.713 0.649 0.581 0.702 0.671 0.63 J48 decision tree 0.712 0.646 0.574 0.704 0.668 0.622 Logistic regression 0.715 0.655 0.591 0.702 0.666 0.619 SVM 0.699 0.685 0.657 0.691 0.679 0.652 MLPClassifier 0.709 0.656 0.594 0.704 0.668 0.622 Rough set 0.828 0.817 0.812 0.869 0.861 0.858

Average F1 measure for the six classifiers 0.635 0.667

Table 8. Homophilic Features Calculated Using the Jaccard Coefficient: With Groups and Page Likes Attributes.

Algorithm Test Mode

10-fold cross-validation Testing and Training

Precision Recall F1 Precision Recall F1 Naive Bayes 0.666 0.652 0.611 0.666 0.651 0.611 J48 decision tree 0.716 0.651 0.582 0.706 0.668 0.621 Logistic regression 0.709 0.669 0.62 0.708 0.663 0.612 SVM 0.71 0.668 0.619 0.705 0.662 0.611 MLPClassifier 0.709 0.668 0.619 0.706 0.661 0.608 Rough set 0.88 0.846 0.874 0.859 0.854 0.852

Average F1 measure for the six classifiers 0.654 0.652

set classifier (between 0.925 and 0.938). For the second dataset, the least F1 values were obtained from apply- ing the J48 decision tree classifier (between 0.574 and 0.622), whereas the highest F1 values were obtained from applying the rough set classifier (between 0.812 and 0.858). Tables 4 and 8 show the effect of including the groups and page likes attributes as the Jaccard coefficient homophilic features in predicting friendship on the first and second dataset, respectively. For both datasets, the F1 values increased for most of the classifiers except for the naive Bayes classifier in case of the first dataset and the J48 decision tree in case of the second dataset. Finally, Tables 5 and 9 show the effect of adding the groups and page likes attributes as the overlap coefficient homophilic features in predicting friendship on the first and second dataset, respectively. Again, in this experiment, for both datasets, the F1 values increased for most of the classifiers except for the naive Bayes classifier in case of the first dataset and the SVM classifier in case of the second dataset. I. Elkabani and R. A. Aboo Khachfeh: Homophily-Based Link Prediction in Facebook 501

Table 9. Homophilic Features Calculated Using the Overlap Coefficient: With Groups and Page Likes Attributes.

Algorithm Test Mode

10-fold cross-validation Testing and Training

Precision Recall F1 Precision Recall F1 Naive Bayes 0.7 0.691 0.669 0.706 0.696 0.674 J48 decision tree 0.714 0.666 0.612 0.705 0.659 0.604 Logistic regression 0.723 0.701 0.674 0.723 0.702 0.677 SVM 0.71 0.649 0.582 0.704 0.669 0.624 MLPClassifier 0.727 0.703 0.676 0.723 0.704 0.681 Rough set 0.909 0.907 0.906 0.892 0.887 0.886

Average F1 measure for the six classifiers 0.687 0.691

On the basis of these results, we can conclude the following major findings: 1. The rough set classifier outperformed the other classifiers in all experiments. The results of the four different experiments conducted on the first dataset (presented in Tables 2–5)

show that the average F1 measure obtained for the rough set classifier in the four experiments is 0.914 for the 10-fold cross-validation mode and 0.907 for the testing-training mode. These averages exceed all the

average F1 measures of the other five classifiers. The highest average F1 measure among the five classifiers is that obtained from the MLP classifier, which is 0.738 for the 10-fold cross-validation mode and 0.740 for the testing-training mode. As for the results of the four experiments conducted on the second dataset (shown in Tables 6–9), the

average F1 measure for the rough set classifier in the four experiments is 0.853 for the 10-fold cross-valida-

tion mode and 0.831 for the testing-training mode. Again, these averages exceed the average F1 measure of

the five other classifiers. The highest average F1 measure among these classifiers is that obtained from the MLP classifier, which is 0.618 for the 10-fold cross-validation mode and 0.634 for the testing-training mode.

Moreover, by comparing the values of the F1 measure resulting from the experiment conducted on the small dataset (3000 pairs of user profiles) and the larger dataset (10,000 pairs of user profiles), it can be seen that all classifiers perform better when the size of the dataset increases. This indicates the con- sistency of the results obtained from the different classifiers, especially the rough set classifier. Thus, on the basis of the aforementioned results, it is obvious that the rough set classifier signifi- cantly outperforms the other classifiers in homophily-based link prediction in the Facebook online social network. Although the homophily-based link prediction in online social networks is always associated with many difficulties owing to the missing attributes or incomplete attributes of users’ profiles, the rough sets classification proved a distinguished performance in homophily-based link prediction in the Facebook online social network. This is due to the two main features of rough sets: first, the Johnson reduction algorithm, which provides an effective method for identifying the minimal and optimal subset of attributes. Second, its ability of handling missing valued attributes effectively. 2. The common groups and the common page likes features have a significant influence on predicting the friendship between users. The results of the two experiments conducted on the first dataset presented in Tables 2 and 4 show that

the value of the average F1 measure for all classifiers obtained using the Jaccard coefficient homophilic features without the common groups and page likes is 0.687 for the 10-fold cross-validation mode and 0.691 for the testing-training mode. These results increased after including the common groups and page likes to 0.701 for the 10-fold cross-validation mode and 0.706 for the testing-training mode. Also, the results of the two experiments conducted on the second dataset, and presented in Tables 6 and 8, show

that the value of the average F1 measure for all classifiers obtained using the Jaccard ­coefficient homo- philic features without the common groups and page likes is 0.634 for the 10-fold cross-validation mode and 0.636 for the testing-training mode. These results increased after including the common groups and 502 I. Elkabani and R. A. Aboo Khachfeh: Homophily-Based Link Prediction in Facebook

page likes to 0.654 for the 10-fold cross-validation mode and 0.652 for the testing-training mode. The same results were obtained after using the overlap coefficient homophilic features instead of the Jaccard coef-

ficient for both datasets. The average F1 measure also increased after including the common groups and page likes in the case of using the overlap coefficient homophilic features. These results demonstrate that the common groups and the common page likes features have a significant influence on predicting the friendship between users. In other words, users sharing the same groups and page likes are more likely to be friends on the Facebook online social network. 3. Using the overlap coefficient homophilic features provided better results than that of the Jaccard coefficient features. The results of the experiments conducted on the first dataset and presented in Tables 4 and 5 show that

the value of the average F1 measure using the overlap coefficient homophilic features is 0.792 for the 10-fold cross-validation mode and 0.796 for the testing-training mode. These averages exceed the value

of the average F1 measure using the Jaccard coefficient homophilic features, which is 0.701 for the 10-fold cross-validation mode and 0.706 for the testing-training mode. Also, the results of the experiments con-

ducted on the second dataset and presented in Tables 8 and 9 show that the value of the average F1 measure using the overlap coefficient homophilic features is 0.687 for the 10-fold cross-validation mode

and 0.691 for the testing-training mode. These averages exceed the value of the average F1 measure using the Jaccard coefficient homophilic features, which is 0.654 for the 10-fold cross-validation mode and 0.652 for the testing-training mode. These results show that using the overlap coefficient homophilic features provides better performance than using the Jaccard coefficient features. This is because most of the user profile attributes used in this study were multivalued attributes.

8 Conclusion and Future Work

Link prediction is an important task in analyzing and understanding social networks. Some approaches for link prediction are based on topological features, and others integrate these features with the homophilic fea- tures. In this work, we investigated the link prediction problem from a homophilic perspective based on the homophily principle, which states that a contact between similar people occurs at a higher rate than among dissimilar people. For this purpose, datasets from the Facebook online social network were collected. Various classifiers were applied for prediction. The results of the rough set algorithm outperformed the other tradi- tional classifying algorithms. Moreover, including the common groups shared among pairs of users and the common pages they like improved the results of the link prediction, which means that they have a significant influence on predicting friendship between users on online social networks. As for near-future work, we will incorporate some topological features in link prediction, such as the Jaccard coefficient for common neighbors and the Adamic-Adar measure for link prediction using the differ- ent classifiers stated in this work and the rough sets classifier. We will compare the influence of each category of features (homophilic and topological) separately in link prediction, and then we will investigate what will be the results of combining both categories together when predicting links in online social network.

Bibliography

[1] L. M. Aiello, A. Barrat, R. Schifanella, C. Cattuto, B. Markines and F. Menczer, Friendship prediction and homophily in social media, ACM Trans. Web 6 (2012), 1–33. [2] M. Al Hasan, V. Chaoji, S. Salem and M. Zaki, Link prediction using supervised learning, in: Proceedings of SDM Workshop of Link Analysis, Counterterrorism and Security, 2006. [3] Q. Al-Radaideh, The impact of classification evaluation methods on rough sets based classifiers, in: Proceedings of the 2008 International Arab Conference on Information Technology (ACIT2008), University of Sfax, Tunisia, December 2008. [4] L. Backstrom and J. Leskovec, Supervised random walks: Predicting and recommending links in social networks, in: ACM WSDM, 2011. I. Elkabani and R. A. Aboo Khachfeh: Homophily-Based Link Prediction in Facebook 503

[5] S. Catanese, P. De Meo, E. Ferrara, G. Fiumara and A. Provetti, Crawling Facebook for purposes, in: Proc. International Conference on Web Intelligence, Mining and Semantics, p. 52, ACM, Sogndal, Norway, 2011. [6] S. Cesare and Y. Xiang, Software Similarity and Classification, Springer Briefs in Computer Science, 2012. [7] E. Gilbert and K. Karahalios, Predicting tie strength with social media, in: Proc. of CHI, 2009. [8] S. A. Golder and S. Yardi, Structural predictors of tie formation in Twitter: transitivity and mutuality, in: Proc. SocialCom, pp. 88–95, 2010. [9] S. A. Hall M., E. Frank, G. Holmes, B. Pfahringer, P. Reutemann and I. Witten, The WEKA software: an update, in: ACM SIGKDD Explorations Newsletter, pp. 10–18, 2009. [10] M. Hasan and M. Zaki, A survey of link prediction in social networks, in: C. Aggarwal, ed., Social Network Data Analytics, 1st Ed., pp. 243–276, Springer, New York, 2011. [11] R. Jensen, Rough set-based feature selection: a review, in: Rough Computing: Theories, Technologies and Applications, pp. 70–107, 2008. [12] I. Kahanda and J. Neville, Using transactional information to predict link strength in online social networks, in: Proceedings of the 3rd International AAAI Conference on Weblogs and Social Media (ICWSM’09), pp. 74–81, 2009. [13] A. Khan, G. Saha, S. Dasgupta and S. K. Datta, Prediction of dominant genes responsible for lung adenocarcinoma using rough set theory, in: E-Health and Bioengineering Conference (EHB), pp. 24–26, November 2011. [14] L. P. Khoo, S. B. Tor and L. Y. Zhai, A rough-set-based approach for classification and rule induction, Int. J. Adv. Manuf. Technol. 4 (1999), 438–444. [15] S. J. Lee, J. J. Ahn and K. J. Oh, Using rough set to support investment strategies of real-time trading in futures market, Appl. Intell. 32 (2010), 364–377. [16] D. Liben-Nowell and J. Kleinberg, The link prediction problem for social networks, in: CIKM ’03: Proceedings of the Twelfth International Conference on Information and Knowledge Management, New York, pp. 556–559, 2004. [17] M. McPherson, L. Smith-Lovin and J. M. Cook, Birds of a feather: homophily in social networks, Annu. Rev. Sociol. 27 (2001), 415–444. [18] A. Øhrn, ROSETTA Technical Reference Manual, Knowledge Systems Group, Department of Computer and Information Sci- ence, NTNU, Trondheim, Norway, November 1999. [19] D. L. Olson and D. Delen, Advanced Data Mining Techniques, Springer Publishing Company Incorporated, 2008. [20] A. Patil, J. Gao and A. van de Rijt, Homophily based link prediction in friendship networks, in: INSNA Sunbelt’10: The Inter- national Sunbelt Social Networks Conference XXX, Trento, Italy, June 29–July 4, 2010. [21] Z. Pawlak, Rough sets, Int. J. Comput. Inf. Sci. 11 (1982), 341–346. [22] M. Z. Rashad, A rough-neuro model for classifying opponent behavior in real-time strategy games, Int. J. Comput. Sci. Inf. Technol. (IJCSIT) 4 (2012), 185–196. [23] A. Skowron and C. Rauszer. The discernibility matrices and functions in information systems, Theor. Decis. Lib. 11 (1992), 331–362. [24] M. Steurer and C. Trattner, Predicting interactions in online social networks: an experiment in second life, in: MSM’13, Paris, France, May 1, 2013.