User Similarity Determination in Social Networks

technologies

Article User Similarity Determination in Social Networks

Sadia Tariq 1,*, Muhammad Saleem 2 and Muhammad Shahbaz 2

1 Department of Computer Science & Engineering, Narowal Campus, University of Engineering & Technology Lahore, Lahore 51600, Pakistan 2 Department of Computer Science & Engineering, University of Engineering & Technology Lahore, Lahore 54890, Pakistan; [email protected] (M.S.); [email protected] (M.S.) * Correspondence: [email protected]; Tel.: +92-333-8870846

Received: 8 March 2019; Accepted: 3 April 2019; Published: 15 April 2019

Abstract: Online social networks have provided a promising communication platform for an activity inherently dear to the human heart, to find friends. People are recommended to each other as potential future friends by comparing their profiles which require numerical quantifiers to determine the extent of user similarity. From similarity-based methods to artificial intelligent machine learning methods, several metrics enable us to characterize social networks from different perspectives. This research focuses on the collaborative employment of neighbor based and graphical distance-based similarity measurement methods with text classification tools such as the feature matrix and feature vector. Likeminded nodes are predicted accurately and effectively as compared to other methods.

Keywords: neighbour based similarity metrics; Euclidean distance; distance based similarity metrics; similarity score vector; data objects; adjacency matrix; link formation; node proﬁle attributes

1. Introduction Social network analysis is a measurement of relationships and the flow of information between the people of an organization, portraying them as actors of a network or as inter-connected, knowledge-sharing entities. The actors in the network and the relationships between them are modeled in the form of mathematical graphs having nodes and links. The links or edges between people or actors carry the information in the form of emails, telephonic conversations and text and picture messages, etc., counted as the weight of the link. Similarly, links can be directed to show the initiator of conversion. Models determine whether localized network measures are sufficient to explain the global properties of the network or not. Conclusively, social network analysis provides a visual representation in the form of network graphs and a statistical representation in the form of network centrality measures and network similarity measures about human relationships. With the passage of time, dynamic social networks tend to shift properties. There exists a finite set of unique features for social networks. The properties exhibited by a random communication network at a given point in time is merely a subset of the above-mentioned superset. The superset of salient characteristics identified to be present in social networks is given below:

Directed ties • Weighted ties • Disconnected components • Isolated nodes • Bi-directional ties • Non-normally distributed data •

Technologies 2019, 7, 36; doi:10.3390/technologies7020036 www.mdpi.com/journal/technologies Technologies 2019, 7, x FOR PEER REVIEW 2 of 15

 Near complete node information  Near complete link information  Missing node information  Missing link information The temporal evolution undergone by dynamic social networks is classified into two kinds: Technologies 2019, 7, 36 2 of 15 expansion (when nodes are being added) and shrinking (when nodes are being removed). They might grow stronger (when the number of ties or tie weights increase) and weak (when the number of ties or tie weightsNear completedecrease). node The informationshift in size and shape corresponds to the properties they possess • at a specific time.Near complete link information • A second Missingconcept nodeutilized information in the paper, and needing an introduction, is that of feature matrix • and feature vectors.Missing A feature link information matrix as shown in Table 1 is a set of features that characterize a given set of linguistic• units for a finite set of properties. In lexical semantics, feature matrices can be used to determine the Themeaning temporal of specific evolution word undergonefields. by dynamic social networks is classified into two kinds: The approachesexpansion (when and metrics nodes areused being for added)accurate and similarity shrinking detection (when nodes are large are beingin number. removed). Many They might combinationsgrow have stronger been (whentried and the tested. number Therefore, of ties or we tie weightsattempt increase)to reap the and benefits weak (whenof a rarely the numberused of ties idea by jointlyor tie employing weights decrease). two well The-known shift categories in size and of shape similarity corresponds metrics. toThis the paper properties presents they the possess at a theoreticalspecific justification time. of loosely integrating neighborhood information and geographic distance information forA precise second identification concept utilized of similar in the paper, users andwithin needing a social an network. introduction, The fundamental is that of feature idea matrix and is to formulatefeature a vectors.feature matrix A feature (adjacency matrixas matrix shown in in case Table of 1social is a set networks) of features from that neighbor characterize-based a given set similarity ofscores linguistic and work units out for distance a finite set-based of properties. similarity metrics In lexical for semantics, two feature feature vectors matrices (referred can to be used to as similaritydetermine score vectors the meaning in this study) of specific at a wordtime. fields.

Table 1. FeatureTable Matrix 1. Feature. Matrix.

Feature A FeatureFeature A B FeatureFeature B C FeatureFeature C DFeature D Feature A Feature A    Feature B Feature B    Feature C Feature C     Feature D Feature D    

To accomplishThe approaches our goal, we and selected metrics numerous used for more accurate or less similarity commonly detection known are similarity large in number. measurementMany methods combinations to be applied have beento social tried network and tested. graph. Therefore, First of all, we we attempt provide to a reapbrief thesurvey benefits of a of the similarityrarely used metrics idea under by jointly consideration. employing Secondly,two well-known we perform categories social of n similarityetwork analysis metrics. by This paper populatingpresents an adjacency the theoretical matrix justification for the weighted of loosely social integrating network neighborhood graph using information neighbor- andbased geographic similarity distancemetrics. informationEvery row in for the precise matrix identification represented of the similar similarity users scores within of a sociala particular network. node The with fundamental its neighbors,idea is hence to formulate it was a feature named matrix the simil (adjacencyarity score matrix vector. in case Lastly, of social we networks) bring into from action neighbor-based vector-basedsimilarity similarity scores metricsand work for out measuringdistance-based the similarity pairwise metrics distance for two between feature vectors (referred on to as two-dimensionalsimilarity plane. score vectors in this study) at a time. The studyTo aims accomplish to elaborate our goal, and we reveal selected the numerous productivity more of or various less commonly combinations known of similarity similarity-measurementbased methods methods in constructing to be applied a user similarity to social determination network graph. system First without of all, additional we provide a brief informationsurvey about of the the suggested similarity content. metrics under consideration. Secondly, we perform social network analysis In theby following populating section an adjacency a consolidated matrix for literature the weighted survey social that scampers network around graph using the studies neighbor-based related to similaritythe proposed metrics. scheme Every is prese row innted. the matrix represented the similarity scores of a particular node with its neighbors, hence it was named the similarity score vector. Lastly, we bring into action vector-based 2. Relatedsimilarity Work metrics for measuring the pairwise distance between vectors on two-dimensional plane. The study aims to elaborate and reveal the productivity of various combinations of similarity-based Socialmethods networks in have constructing become an a user integral, similarity part of determination our daily lives, system enabling without us to additional keep in touch information with about our friendsthe and suggested families content. [1]. The existing models incorporate either node profile attributes or link formation informationIn the following or both [ section2]. Most a of consolidated them represent literature the members survey thatand scampersrelationships around among the them studies related in the formto of the a mathematical proposed scheme graph is with presented. nodes as vertices and ties as edges. In 1960's Stanley Milgram presented the term, ``small-world phenomenon''[3]. It stated that we are all linked by short chains of acquaintances2. Related and was Work the pioneer to quantify the phenomenon allowing people to speak of the ``six Social networks have become an integral, part of our daily lives, enabling us to keep in touch with our friends and families [1]. The existing models incorporate either node profile attributes or link formation information or both [2]. Most of them represent the members and relationships among them in the form of a mathematical graph with nodes as vertices and ties as edges. In 1960’s Stanley Milgram presented the term, “small-world phenomenon” [3]. It stated that we are all linked by short chains of acquaintances and was the pioneer to quantify the phenomenon allowing people to speak of the Technologies 2019, 7, 36 3 of 15

Technologies 2019, 7, x FOR PEER REVIEW 3 of 15 “six degrees of separation” between any two people in the United States. Since then, many network degreesmodels andof separation'' frameworks between were proposedany two people to analytically in the United study States. the phenomenon. Since then, many network models and frameworksThe concept were of Similarity proposed has to beenanalytically mathematically study the explained phenomenon.. and formally deﬁned in reference [4] and aThe graphical concept overview of Similarity of applications has been mathematica of similaritylly algorithms explained and and similarity formally metrics defined is presentedin reference in [4]Figure and1 . a graphical overview of applications of similarity algorithms and similarity metrics is presented in Figure 1. Similarity Similarity determination metrics algorithms

Profile matching Neighbor based

Link prediction Node based

Sequence Graphical alignment distance based

Classification/cl Path based ustering

Information Random walk retrieval based

FigureFigure 1.1.algorithmsAlgorithms and and kinds kinds of of similarity similarity metrics. metrics.

2.1. Profile Profile Matching Matching A hybrid similaritysimilarity measure measure that that combines combines network network similarity similarity with wi nodeth node profile profile similarityhas similarityhas been beenproposed proposed in reference in reference [5].A brief[5].A survey brief survey in reference in reference [6] illustrates [6] illustrates the variety the of variety similarity of similarity measures measuresdeveloped developed for social networks for social and networks the diffi cultyand ofthe selecting difficulty a similarityof selecting measure a similarity for problems measure such for as problemslink prediction such as or link community prediction detection. or community detection. The friendship recommendation recommendation system system proposed proposed in in reference reference [2] [2] employed employed different different similarity metrics forfor variousvarious features. features. When When applied applied to realto real world world social social network network data, itdata, suggested it suggested that interest, that interest,age, gender, age, andgender, location and contributelocation contribute to friendship to friendship formation. formation. The hierarchical hierarchical arrangement arrangement of of the the network network in in th thee form form of ofa tree a tree with with nodes nodes and and leaves leaves relies relies on theon thebelief belief that that a pair a pair of ofnodes nodes is islikely likely to tobe be si similarmilar if ifthey they are are equally equally close close to to a a specificspecific common predecessor. Kleinberg Kleinberg in in reference reference [3] [3] proposed th thatat there is a decentralizeddecentralized algorithm capable of findingfinding short paths with highhigh probability.probability. DecentralizedDecentralized search algorithms using combinations of homophily andand node node degree degree parameters parameters in reference in referenc [7].e The [7].The work work in reference in reference [8] presents [8] anpresents extension an extensionof decentralized of decentralized search algorithm search algorithm which enables which it enab to searchles it into completesearch in unconsciouscomplete unconscious mode when mode the whenuser is the unaware user is ofunaware his or her of positionhis or her as position well as thatas well of others as that within of others the network. within the Also, network. community Also, communityassociations associations and common and keywords common play keywords a vital play role ina vital identification role in identification of like-minded of like-minded users [9]. users [9]. Link prediction is mostly used to determine the most liked item by a user based on his/her buying patternsLink in prediction past, and is recommender mostly used systemsto determine and linkthe most prediction liked goitem side by bya user side. based In reference on his/her [10 ],buying a link patternsprediction in past, system and has recommender been constructed. systems Tested and link on pr anediction online judge,go side it by recommended side. In reference programming [10], a link predictionproblems and system tasks has to users been to constructed. be solved similar Tested to test-sessionon an online or practice-session.judge, it recommended Based on programming performance problemsin one problem, and tasks a next to moreusers ditoffi cultbe solved problem similar is presented to test-session to the user. or practice-session. Such automatic judgmentBased on performancesystems are used in one for proble onsitem, training a next ofmore programming difficult problem contests. is presented to the user. Such automatic judgmentAdamic systems and Adar are used in reference for onsite [11 training] developed of programming functions to contests. analyse similarities between users as a functionAdamic of the and frequency Adar in reference of a shared [11] item. developed The reference functions [12 to] highlights analyse similarities construction between of a measure users as ofa functionsimilarity of of the whole frequency graphs of based a shared on the item. similarities The refere ofnce nodes. [12] highlights construction of a measure of similarity of whole graphs based on the similarities of nodes.

Technologies 2019, 7, 36 4 of 15

User interest patterns have been studied chronologically to understand the underlying dynamics of the similarity determination process [13]. It has been proved that both path-based and neighbor-based approaches can signiﬁcantly do better than the standard user-based algorithms [14]. Most of the studies on link prediction focus on un-weighted networks. Extensive work has been done to widen the information carrying capacity of similarity metrics by creating their weighted versions [15]. However, their performance has been questionable when tested on real world networks. The experimental study has been conducted to determine the role of weak links to improve the quality of results. Many other methods have been proposed to determine the real purpose of tie weights. Other research works such as in reference [16] use projections mechanism to convert bi-partite graphs into non-bipartite and vice versa for application of suitable mathematical models. Many kinds of research jointly utilize similarity-based methods along with other approaches. The work in reference [17] chooses to enhance famous similarity-based metrics with network centrality measures such as betweenness, closeness and degree.

2.2. Distance between Feature Vectors Though loosely based on references [18–20], our proposed methodology is distinguished by its attempt to incorporate local structural information by giving weight to the distances between similarity scores with neighbors. Two nodes are considered similar based on nearness between their similarity scores with their neighbors. Various combinations of metrics were tried and their results were compared. The ultimate goal was to precisely predict future friends for a particular node in its neighborhood. Similar approach has been attempted for graph matching in reference [21] by using the structural similarity of local neighborhoods to derive pair-wise similarity scores for the nodes of two diﬀerent graphs, and present a related similarity measure that uses a linear update to generateboth node and edge similarity scores.

2.3. Similarity Measurement Methods Similarity-based methods compute the pairwise similarity between nodes for the sake of prediction of possible future links among them [10]. Some similarity metrics are in common use for analyzing the aﬃnity between two data objects. Be it a pair of vectors in two-dimensional plane or two nodes in a graph or more precisely two actors in a social network. These can also be classiﬁed broadly intosimilarity based methods and learning based methods [22]:

Node-based metrics, which incorporate node features information for computing similarity. • Neighbor-based metrics, which incorporate node’s neighborhood information for • computing similarity. Path-based metrics, which utilize the knowledge of various diﬀerent alternative paths between • two nodes. Random walk-based metrics, which use transition probabilities from a node to its neighbors and • non-connected nodes to simulate social interactions.

In the proposed model the neighbor-based metrics and vector or geographical distance-based metrics will be jointly employed. Therefore, in the upcoming sections we give a brief overview of the chosen metrics from among these two categories.

3. Neighbor Based Similarity Metrics Some of the most commonly used neighborhood-based similarity metrics [10] have been brieﬂy introduced in the following paragraphs: Edge Weight (EW), which measures the weight of the edge that links a pair of nodes in a social network graph. The presence of an edge between two problem nodes is mandatory and therefore cannot be used for an alikeness prediction. Eω(x, y) = Axy, where Axy is the weight of the edge that connects nodes x and y Technologies 2019, 7, 36 5 of 15

Common Neighbors (CN), which measure the number of neighbors that a pair of nodes have in common. Common neighbors are found using intersection operation on the two sets. The greater the intersection of the neighbor sets of any two nodes, the greater the chance of future association between them, and vice versa.

CN(x, y) = N(x) N(y) , where |N(x)| represents the number of neighbors of node x or its degree. ∩ Jaccard Neighbors (JN), which compares the number of common neighbors of x and y with respect to the total neighbors of x and y. Total number of neighbors is found using union operation on the two sets. This metric is based on set overlapping and uses a ratio of intersection operation to union operation. Empty set in denominator will cause numerical problems such as division by zero.

N(x) N(y) JN(x, y) = ∩ N(x) N(x) ∪ Adar/Adamic (AA), which evaluate the likelihood that a node x is linked to a node y as the sum of the number of neighbors they have in common. Also, it can measure the intersection of neighbor-sets of two nodes in the graph but emphasizing in the smaller overlap. This metric is dependent upon commonality among the two problem nodes.

X 1 AA(x, y) = ( ) zεN(x) (y) log N z ∩ Preferential Attachment (PA), which calculates the product of the degree of the nodes x and y, so he higher the degree of both nodes, the higher is the similarity between them. This metric has the drawback of leading to high similarity values for highly connected nodes to the detriment of the less connected ones in the network.

PA(x, y) = N(x) . N(y) In the upcoming section we discuss in detail the concept of feature matrix and feature vector, named here as similarity score vector. The similarity score vector (SSV) gets computed from the neighbor-based metrics discussed in above sections.

4. The Similarity Score Vector In pattern recognition and machine learning, a feature vector is an n-dimensional vector of numerical features that represent some object. Many algorithms in machine learning require a numerical representation of objects since such representations facilitate processing and statistical analysis. Two data objects or more familiarly, feature vectors, are comparable on the basis of their numerical attributes or dimensions provided they are equal in number. When simulated on a two-dimensional plane, the distances between these vectors could be used to measure the farness or nearness between them. The lesser the diﬀerence between the values of their attributes, the more alike they are. Utilization of adjacency matrix for mathematical representation of social networks is not an unfamiliar phenomenon. For example, to build a movie recommender system, a particular critic’s review score (c), for a movie (m), will be represented by an element (c,m) in an adjacency matrix called feature matrix [23]. Technologies 2019, 7, 36 6 of 15

m1 m2 m3 m4 m5 m6 c1 3 7 4 9 9 7 c2 7 5 5 3 8 8 c3 7 5 5 0 8 4 c4 5 6 8 5 9 8 c5 5 8 8 8 10 9 c6 7 7 8 4 7 8 Each row in the matrix represents the ratings of a particular critic c for all the movies under consideration in the review/rating space. The diﬀerence in opinion of two critics happens to be equal to the farness between their corresponding rating vectors on two-dimensional plane. If we want a feature vector per critic, then we can just take the rows of the matrix:

Technologies 2019, 7, x FOR PEER REVIEW 6 of 15 x1 = (3, 7, 4, 9, 9, 7) .. to the farness between their corresponding rating vectors on two-dimensional plane. If we want a .. feature vector per critic, then we can just take the rows of the matrix: .. x1 = (3, 7, 4, 9, 9, 7) x6 = (7,.. 7, 8, 4, 7, 8)

.. where x1 corresponds to feature vector of Critic 1, x2 corresponds to feature vector of Critic 2, and so .. on. Now, we propose that instead of measuring dissimilarity between the critic’s opinion, we should measure the dissimilarity between movies.x6 = We(7, 7, compute 8, 4, 7, 8) distance between movies in the space of the critic’swhere x1 ratings. corresponds Transposing to feature the matrix,vector of we Critic get feature 1, x2 corresponds vectors for moviesto feature m1 vector to m6: of Critic 2, and so on. Now, we propose that instead of measuring dissimilarity between the critic’s opinion, we should measure the dissimilarity between movies.w1 =We(3, compute 7, 7, 5, 5, distance 7)T between movies in the space of the critic’s ratings. Transposing the matrix, we get feature. vectors for movies m1 to m6: w1 = (3,. 7, 7, 5, 5, 7)T w6 = (7, 8, 4,. 8, 9, 8)T . In our proposed model, each row of adjacencyw6 = (7, matrix8, 4, 8, is9, composed8)T of similarity score of particular node x with all of its neighboring nodes. Since a square matrix possess a uniform number of rows and columns, therefore each row represents the similarity score vector of the node x. A pictorial representation of a similarity score vector for various nodes from a particular network snapshot is shown in Figure2. It shows PA-SSV for node B grows linearly from node A to D respectively whereas that for node c remains a straight line. Thus, node C holds equal similarity with respect to PA measure with all of its neighbors from A to D. And node B’s similarity with respect to PA measure increases gradually from node A to D.

4 ssv node B ssv node C 2

0 PA(a) PA(b) PA(c) PA(d)

Figure 2.2. PreferentialPreferential Attachment Attachment (PA) (PA) similarity similarity score score vectors vectors (SSV) of(SSV) nodes of Bnodes and C B with and neighbors C with A,neighbors B, C, D, A, respectively. B, C, D, respectively.

This approach can be tested to perform link prediction using similarity scores for an experimental dataset which is split into two halves at particular timestamp t. The former set will contain the data collected before time t and the latter set will be used for evaluation purpose. Our work however purely generates theoretical results and is needed be tested on real world social network data.

n =−+−+22 +−=2 − 2 dq(p,q) (p11 ) (p 2 q 2 ) ...... (pnn q ) (p ii q ) i=1 Technologies 2019, 7, 36 7 of 15

Vector Based Similarity Metrics The pair wise distance between two nodes is computed by applying a variety of distance-based similarity metrics on their respective similarity score vectors: Euclidean distance, which is the difference between individual attributes of two data objects, provided the number of attributes and the scale used to normalize them is uniform for the pair [23], mathematically, tv q Xn d(p, q) = (p q )2 + (p q )2 + ...... +(p q )2 = (p q )2 1 − 1 2 − 2 n − n i − i i=1 Pearson’s coefficient, a linear similarity measure which uses mean centering and normalization of profiles. It uses a best fit line that runs through the attributes of the two data objects, plotted as data points on a two-dimensional plane. It is calculated by dividing the covariance of the two data objects by the product of their standard deviations. It performs better than Euclidean distance for non-normally distributed data.

Cov(x, y) ρx,y = σxσy Cosine similarity, a linear similarity measure that uses normalization of proﬁles. It measures the similarity between two data objects treated as vectors, as the angle between them. This method is useful when ﬁnding the similarity between two text documents whose attributes are word frequencies A perfect similarity will have a score of 1 (or an angle of 0) and no similarity will have a score of zero (or an angle of 90 degrees).

A B cos(A, B) = × A B k kk k Tanimoto coeﬃcient, a generalized version of Jaccard Coeﬃcient, a set of overlap measures. Here A and B are data objects represented by vectors. The similarity score is the dot product of A and B divided by the squared magnitudes of A and B minus the dot product. A perfect similarity score would be one.

A.B T(A, B) = A 2 + B 2 A.B k k k k − Using the grocery store example, the Tanimoto coefficient ensures that a customer who buys five apples and one orange will be different from a customer who buys five oranges and an apple. All of these measures perform well for non-bipartite graphs. The social network graph under consideration is however strictly non-bipartite. Its adjacency matrix comprises of nodes along rows as well as along columns, thus producing a people-people network or social network. The Euclidean distance and its variations are easily computed using built-in functions provided by mathematical and statistical tools preferably Matlab and R. Hamming, Manhattan, Minkowski and Cityblock are few to name famous variations of this universal metric. Similarity is further worked out as a decreasing function of geographic or Euclidean distances with formula given below [23]: 1 sim(x, y) = 1 + d(x, y) Technologies 2019, 7, 36 8 of 15

The Euclidean distance is thus converted into a measure of similarity. It compares the pair wise aﬃnity of nodes using their similarity score vectors. The higher the Euclidean distance score, lesser will be their similarity score and vice versa. Hence a distance score of zero corresponds to a similarity of one (the largest value it can take) and a distance score of one corresponds to a similarity of zero (the smallest it can take). Pearson’s coeﬃcient uses trend line to decide whether the data points are correlated positively or negatively. It is also possible that two sets of data points are simply uncorrelated. An evaluation chart which replicates a scattered plot of data points on two-dimensional plane serves the purpose.

5. The Proposed Model The proposed model followed a simple set of steps in order to achieve its objectives. The ﬁrst step was to compute various neighbor-based similarity metrics for sample social network graph shown in FigureTechnologies3 and 2019 craft, 7, x FOR an adjacency PEER REVIEW matrix. 8 of 15

Figure 3. Weighted network graph.

The adjacency matrix for edge weight measure is tabulated below.below. 𝑨 𝑩𝑪𝑫𝑬𝑭𝒎𝒆𝒂𝒏 𝑬𝑾 𝑨 A042000 B C D E F mean 1 EW 𝑩404120A 0 4 2 0 0 0 1 1.83 𝑪240000B 4 0 4 1 2 0 1.83 1 𝑫C 2 04 10 0 0 0 0 0.167 1 𝑬020001D 0 1 0 0 0 0 0.167 0.5 𝑭 0 0 0 0 1 0 0.167 The social network graph simulatesE 0 2 actors 0 0as nodes, 0 1 communications 0.5 as edges and number of messages transmitted or receivedF as0 edge 0 0 weight 0 1s. The 0 average 0.167 similarity score per node was obtainedThe socialfrom the network adjacency graph matrix. simulates Secondly, actors aswe nodes, selected communications two similarity as score edges vectors and number from the of messagesadjacency transmittedmatrix and or receivedcomputed as edgevector-based weights. Thesimilarity average metrics similarity for score them. per nodeWe repeated was obtained the fromprocedure the adjacency for node A’s matrix. vector Secondly, with all of we its selected neighboring two similaritynodes’ vector scores in vectors order to from find the most adjacency similar matrixand least and similar computed nodes vector-based for node similarityA. Next, metricswe em forployed them. line We charts repeated to theillustrate procedure the foraverage node A’sneighbor-based vector with allmetrics of its and neighboring bar charts nodes’ to depict vectors the similarity in order torank ﬁnd for most vector-based similar and metrics. least similarLastly, nodeswe compared for node the A. results Next, we of employedboth charts. line A high-level charts to illustrate diagram the of averagethe whole neighbor-based course of action metrics is shown and barin Figure charts 4to [2]. depict the similarity rank for vector-based metrics. Lastly, we compared the results of both charts. A high-level diagram of the whole course of action is shown in Figure4[2]. Neighbor based Feature/Adjacen The network graph under considerationsimilarity metric shows thatcy for matrix node A, the most similar node is node C with same number of ties, tie weights and same connected neighbors. The output variables used for evaluation of generated results are listed below:

most similar node for node A Pair wise vector Feature/Similarit • based similariy least similar node for node A y score vectors • metrics In the next section we discuss the empirical setup, the experiment and the analysis of results of the proposed model.

Line chart from Bar chart from neighbour based vector based similarity scores similarity scores

Repeat for other Comparison of metrics charts

Figure 4. High level view of the proposed model.

The network graph under consideration shows that for node A, the most similar node is node C with same number of ties, tie weights and same connected neighbors. The output variables used for evaluation of generated results are listed below: • most similar node for node A • least similar node for node A

Technologies 2019, 7, x FOR PEER REVIEW 8 of 15

Figure 3. Weighted network graph.

The adjacency matrix for edge weight measure is tabulated below. 𝑨 𝑩𝑪𝑫𝑬𝑭𝒎𝒆𝒂𝒏 𝑬𝑾 𝑨 042000 1 𝑩404120 1.83 𝑪240000 1 𝑫 0 1 0 0 0 0 0.167 𝑬020001 0.5 𝑭 0 0 0 0 1 0 0.167 The social network graph simulates actors as nodes, communications as edges and number of messages transmitted or received as edge weights. The average similarity score per node was obtained from the adjacency matrix. Secondly, we selected two similarity score vectors from the adjacency matrix and computed vector-based similarity metrics for them. We repeated the procedure for node A’s vector with all of its neighboring nodes’ vectors in order to find most similar and least similar nodes for node A. Next, we employed line charts to illustrate the average

Technologiesneighbor-based2019, 7, 36metrics and bar charts to depict the similarity rank for vector-based metrics. Lastly,9 of 15 we compared the results of both charts. A high-level diagram of the whole course of action is shown in Figure 4 [2].

Neighbor based Feature/Adjacen similarity metric cy matrix

Pair wise vector Feature/Similarit based similariy y score vectors metrics

Line chart from Bar chart from neighbour based vector based similarity scores similarity scores

Repeat for other Comparison of Technologies 2019, 7, x FOR PEER REVIEW metrics charts 9 of 15

Figure 4. High level view of the proposed model. 6. Empirical Setup and Evaluation The network graph under consideration shows that for node A, the most similar node is node C with Insame this number section weof ties, evaluate tie weights the performance and same ofco thennected previously neighbors. discussed The ou neighbor-basedtput variables similarity used for evaluationmetrics to recommend of generated new results potential are listed friends below: in comparison to the joint incorporation of neighbor based and -vector based similarity metrics. • most similar node for node A Figure5 indicates the mean similarity score of all nodes. The most and least similar nodes for • least similar node for node A node A as determined by various neighbor-based metrics is tabularized in Table2. According to the presented results, the most similar node for node A, agreed upon by all metrics except AA. For AA node B happens to be most similar.

5 mean PA 4 mean JCN 3 mean CN 2 mean EW Mean similarity

1 mean AA

0 ABCDEF Nodes

Figure 5. Neighbor based similarity scores for all nodes. Table 2. Most and least similar nodes for node A determined by neighbor-based metrics. The least similar node is obviously node F, agreed upon by all. It is evident that only JCNMost and SimilarCN metrics to Node success Afully Least determined Similar to Node similarity A closer to reality. The most precise proximityPA prediction is Caccomplished by JCN and D,F CN. The work in reference [10] compared various neighbor-basedJCN similarityCF measures and discovered that edge weight is best among its competitors.CN CF The Figures 6–8 illustrateEW the most Csimilar and least similar D,F nodes for node A identified by AA BF vector-based similarity metrics. The obtained results for most similar node for node A are summarized in Table 3. The work in reference [12] compares two commonly used distance measures in vector models, namely the Euclidean distance and cosine angle distance for nearest neighbor determination in high dimensional data spaces. Through theoretical analysis as well as experimental evaluation it was revealed that the retrieval results based on Euclidean distance are similar for Cosine angle distance when dimension is high.

Table 2. Most and least similar nodes for node A determined by neighbor-based metrics.

Most Similar to Node A Least Similar to Node A PA C D,F JCN C F CN C F EW C D,F AA B F Technologies 2019, 7, 36 10 of 15

The least similar node is obviously node F, agreed upon by all. It is evident that only JCN and CN metrics successfully determined similarity closer to reality. The most precise proximity prediction is accomplished by JCN and CN. The work in reference [10] compared various neighbor-based similarity measures and discovered that edge weight is best among itsTechnologies competitors. 2019, 7, x FOR PEER REVIEW 10 of 15 The Figures6–8 illustrate the most similar and least similar nodes for node A identiﬁed by Table 3. Most similar node for node A. vector-based similarity metrics. The obtained results for most similar node for node A are summarized in Table3. The work in reference [ 12] comparesMsn two Msn commonly Msn usedMsn distance Msn measures in vector models, namely the Euclidean distance and cosine(PA) angle(JCN) distance (CN) for nearest(EW) neighbor(AA) determination in high dimensionalSim(x,y) data spaces. Euclidean Through theoreticalC,E analysisC C,D as wellC as experimental C evaluation it was revealed that theSim(x,y) retrieval Tanimoto results based on EuclideanC,E C distance C,D are similarC for CosineC angle distance when dimension is high.Sim(x,y) Cosine D,F C C,D D C

0.8

0.7

0.6

0.5 Euclidean sim(PA) Euclidean sim(JCN) 0.4 Euclidean sim(CN) 0.3 Similarity score Similarity score Euclidean sim(EW) 0.2 Euclidean sim(AA) 0.1

0 A&B A&C A&D A&E A&F Pair of vectors

FigureFigure 6. 6.Similarity Similarity for neighbor-based metrics. metrics.

0.7

0.6

0.5 Tanimoto sim(PA) 0.4 Tanimoto sim(JCN)

0.3 Tanimoto sim(CN) Similarity socre Similarity socre Tanimoto sim(EW) 0.2 Tanimoto sim(AA)

0.1

0.0 A&B A&C A&D A&E A&F Pair of vectors

FigureFigure 7. 7.Coe Coefficientﬃcient similaritysimilarity for neighbor neighbor based based metrics. metrics.

Technologies 2019, 7, 36 11 of 15 Technologies 2019, 7, x FOR PEER REVIEW 11 of 15

1.0

0.9 Cosine sim(PA) 0.8

Technologies 2019, 70.7, x FOR PEER REVIEW 11 of 15 Cosine sim(JCN) 0.6 1.0 0.5 0.9 Cosine sim(CN) 0.4 Cosine sim(PA) Similarity score 0.80.3

0.70.2 Cosine sim(EW) Cosine sim(JCN) 0.60.1

0.50.0 Cosine sim(AA) A&B A&C A&D A&E A&F Cosine sim(CN) 0.4

Similarity score Pair of vectors 0.3 FigureFigure 8. 8.Similarity Similarity forfor neighbor based based metrics. metrics. 0.2 Cosine sim(EW) • Table 3. Most similar node for node A. Table0.1 3 clearly proves that when teamed up with vector-based measures, JCN and AA determined node similarity most precisely. Msn (PA) Msn (JCN) Msn (CN) Msn (EW) Msn (AA) • Furthermore,0.0 the table told us that among vector-based measures, EuclideanCosine sim(AA)distance and Sim(x,y)Tanimoto Euclidean coefficientA&B presentedC,E A&C analogous A&D C results. A&E C,D A&F C C Sim(x,y) Tanimoto C,EPair of vectors C C,D C C Sim(x,y) Cosine D,F C C,D D C The value of Pearson’s coefficient computed for various neighbor-based metrics is presented in Table 4 and is evaluated by Figureusing the8. Similarity Figure 9. forThe neighbor key rules based for evaluationmetrics. are listed below: Table3 clearly proves that when teamed up with vector-based measures, JCN and AA determined 1. A value above 0 represent positive correlation • • Table 3 clearly proves that when teamed up with vector-based measures, JCN and AA node2. similarityA value below most 0 precisely.represent negative correlation determined node similarity most precisely. Furthermore, the table told us that among vector-based measures, Euclidean distance and Tanimoto • • 3. A value of 0 represents that the two vectors in question are uncorrelated coeffiFurthermore,cient presented the analogous table told results. us that among vector-based measures, Euclidean distance and Tanimoto coefficientTable presented4. Pearson’s analogous coefficient for results. neighbor-based metrics. The value of Pearson’s coefficient computed for various neighbor-based metrics is presented in Table4 ρ ρ ρ ρ ρ and isThe evaluated value of by Pearson’s using the coefficient Figure(PA)9. Thecomputed(JCN) key rules for forva(CN)rious evaluation neighbor-based(EW) are listed(AA) below:metrics is presented in Table 4 and is evaluated by using the Figure 9. The key rules for evaluation are listed below: 1. A value above 0 representA & B positive−0.4 correlation−0.6 −0.2 −0.2 0 A & C 0.2 0.4 0.2 0.6 0 2. 1.A valueA value below above 0 represent 0 represent negative positive correlation correlation A & D 0.5 −0.1 0.2 0.7 −0.8 3. 2.A valueA value of 0below represents 0 represent that the negative two vectors correlation in question are uncorrelated 3. A value of 0 representsA & E that 0.2 the two 0.3 vectors in 0.0 question 0.6 are uncorrelated−0.8 A & F 0.5 0.0 0.0 −0.2 0.8 Table 4.4. Pearson’s coecoefficientfficient for neighbor-basedneighbor-based metrics.metrics.

ρ (PA) ρ (JCN) ρ (CN) ρ (EW) ρ (AA) Figure 9. Chart for Pearson’s coefficient. A&B 0.4 0.6 0.2 0.2 0 − − − − • A & C 0.2 0.4 0.2 0.6 0 AccordingA & Dto the tabularized 0.5 results,0.1 all me 0.2trics agree 0.7 at the similarity0.8 or positive − − correlationA & Eof node 0.2C with node 0.3 A except for 0.0 Adar/Adamic 0.6 which evaluates0.8 them to be − uncorrelated.A & F 0.5 0.0 0.0 0.2 0.8 − • It is evident that Pearson’s coefficient did not work well with AA. • From metric perspective, JCN appears to be best predictor with CN standing at second position. In the next section we justify our submitted results by comparing them with latest research on Figure 9. Chart for Pearson’s coeﬃcient. the subject. Our list of conclusionsFigure is also 9. Chart supported for Pear byson’s many coefficient. cited references.

• According to the tabularized results, all metrics agree at the similarity or positive correlation of node C with node A except for Adar/Adamic which evaluates them to be uncorrelated. • It is evident that Pearson’s coefficient did not work well with AA. • From metric perspective, JCN appears to be best predictor with CN standing at second position. In the next section we justify our submitted results by comparing them with latest research on the subject. Our list of conclusions is also supported by many cited references. Technologies 2019, 7, 36 12 of 15

According to the tabularized results, all metrics agree at the similarity or positive correlation of • node C with node A except for Adar/Adamic which evaluates them to be uncorrelated. It is evident that Pearson’s coeﬃcient did not work well with AA. • From metric perspective, JCN appears to be best predictor with CN standing at second position. • In the next section we justify our submitted results by comparing them with latest research on the subject. Our list of conclusions is also supported by many cited references.

7. Analysis of Results In order to deeply envision the results produced, let us have a quick glance at the performance and robustness of neighbor based and vector-based metrics from existing researches. Liben-Nowell and Kleinberg in reference [24] used a co-authorship network of scientists to prove that an evolutionary dynamic spatio-temporal network is aﬀected more by endogenous changes rather than exogenous changes. They concluded that a similarity determination algorithm for link prediction based on network topology only outperforms direct or neighbor-based measures. The correctness of prediction rises by a factor of 40–50 for network structure based measures as compared to random prediction as elaborated by Table5. Graph distance predictors lie between the two extremes with mediocre performance.

Table 5. Improvement factor of various techniques over correct random prediction [24].

Technique Probability of Correct Random Prediction Improvement Factor Graphical distance metrics 0.153% 29.2 CN 0.153% 47.2 PA 0.207% 15.2 AA 0.147% 54.8 Jaccard/Tanimoto 0.147% 42.3

Haihang and Zhang in reference [25] attempted to lift the performance of four chosen graph distance descriptors using time as active factor. The idea was to convert existing structural similarity measures into temporal similarity measures. They used two datasets: one was a citation network which derived the citation data in high energy physical phenomenon between 1992 and 2001 on Arxiv.org website, while the second was data of cooperation network extracted from thesis information concerning high energy physical theory part between 1991 and 2012. They concluded that PA was enhanced more by active factor than other measures and produced more accurate results. Sharma and Khatri in reference [26] compared twelve graphical distance metrics using the Epinion dataset, which is a general consumer review website for large range of products. It belongs to the who-trusts-whom category of online social networks. They concluded that node neighbor behaved best in terms of precision of results produced, whereas PA behaved worst as indicated by Table6.

Table 6. Precision of various metrics [25].

Technique Precision Node Neighbor 16.4 Jaccard/Tanimoto 15.8 Adar/Adamic 15.1 Preferential attachment 7.1

The results produced by our theoretical analysis and evaluation are summarized below in Table7: Technologies 2019, 7, 36 13 of 15

Table 7. The best proposed metrics as proposed by experiment in comparison to recent researches.

Best Metric Experimental Cited JCN/CN Table2 Tables5 and6 JCN + Euclidean/Tanimoto/Cosine Table3 Table5 AA+Euclidean/Tanimoto/Cosine Table3 Table6 Pearson+JCN/CN Table4 Table8

Table 8. Properties and limitations of distance-based measures.

Metric Properties Limitations -less optimal in higher dimensional space -geometric distance measure Euclidean -sparse vectors -alignment of vectors -sensitive to data distortion and outliers -normalized inner product -symmetric -less optimal in lower dimensional -incorporate semantics space Cosine -use relative frequencies -scale and origin shift variant -best for sparse vector space -eﬀected by liner transformation -same for superset and its subset - -multiplicative and additive identity invariant -intersection of objects/union of objects -less optimal in lower molecular -extension of Jaccard complexity space Tanimoto -best for asymmetric binary variables -not deliberately used as distance -range between 0 and 1 measure -origin and scale shift invariant -symmetric -neglects role of third variable -range between 1 and +1 -assumption of linearity Pearson − -normalized covariance -time consuming -diﬀerent for superset and its subset -sensitive to extreme values - multiplicative and additive identity invariant

The case of distance or vector-based measures is different from the neighbor-based measures. Euclidean distance is affected by every term which is non-zero in either vector. It tends to suffer from the curse of dimensionality. As dimension grow higher and vector space grows sparse, its performance slowly downgrades. It means that they are not reliable with larger social network datasets. A possible solution to this problem is normalization of the outlier/noisy feature provided its minimum, and maximum possible value is known. Scaling can also be applicable for certain datasets. Cosine, on the other hand, depicts the angular orientation of the vectors rather than magnitudes. It is affected by the terms which two vectors have in common and understands the meaningful semantics of similarity. Properties and limitations of distance metrics have been tabularized below:

8. Conclusions and Future Work Incorporation of a useful and accurate user similarity determination system is an integral part of social networks. In this paper, we have considered a simple network graph and used social network analysis and mathematical techniques for determination of similarity between nodes. The data analysis showed that the JCN and AA when vigilantly combined with Euclidean distance, Tanimoto coeﬃcient or cosine similarity, provided most accurate results. However, one of the biggest anomalies in neighbor-based metrics is their absolute dependence upon the existence of a tie between a pair of nodes. It is possible only if they are friends already and therefore is of little use for similarity prediction. Among the distance-based measures, Pearson’s coeﬃcient performed admirably well when paired with JCN and CN but produced surprising results with AA. Technologies 2019, 7, 36 14 of 15

A theoretical foundation for proposed model has been established. Practical evaluation on real world social network dataset is a lavishing consideration for future work. Though the results produced are fairly justiﬁed by similar studies, there is no clear winner among the metrics discussed. It is quite likely that similarity metrics when paired diﬀerently, work well together under certain circumstances and behave weirdly under others. Trying various combinations of metrics might be promising. The use of weighted versions of metrics could be tried to produce more precise results for weighted networks. The limitations of certain metrics such as edge weight (the existence of a tie is mandatory between two nodes and for that they have to be friends already) could be taken into account and studied. Also, the reasons behind the poor performance of AA with Pearson needs to be explored. The roots of variation in results lie in the underlying dynamics and randomly shifting set of characteristics of a real-world social network. However, it can be safely concluded that the more node and network structure properties a model reveals, the better will be the results produced. To achieve this goal, better and newer methods and metrics should be incorporated and evaluated.

Author Contributions: Conceptualization, S.T. and M.S. (Muhammad Shahbaz); methodology, S.T.; software, M.S. (Muhammad Saleem).; validation, M.S. (Muhammad Shahbaz) and M.S. (Muhammad Saleem); formal analysis, M.S. (Muhammad Shahbaz) and M.S. (Muhammad Saleem); investigation, S.T. and M.S. (Muhammad Saleem).; resources, S.T. and M.S. (Muhammad Shahbaz); data curation, M.S. (Muhammad Saleem); writing—original draft preparation, S.T.; writing—review and editing, M.S. (Muhammad Shahbaz); visualization, M.S. (Muhammad Shahbaz) and M.S. (Muhammad Saleem); supervision, S.T. and M.S. (Muhammad Shahbaz); funding; S.T. and M.S. (Muhammad Saleem). Funding: This research received no external funding and the APC was funded by [Sadia Tariq and Muhammad Saleem]. Conﬂicts of Interest: The authors declare no conﬂict of interest.

References

1. Liang, X.; Li, X.; Zhang, K.; Lu, R.; Lin, X.; Shen, X.S. Fully anonymous profile matching in mobile social networks. IEEE J. Sel. Areas Commun. 2013, 31, 641–655. [CrossRef] 2. Mazhari, S.; Fakhrahmad, S.M.; Sadeghbeygi, H. A user-profile-based friendship recommendation solution in social networks. J. Inf. Sci. 2015, 41, 284–295. [CrossRef] 3. Kleinberg, J. The small-world phenomenon: An algorithmic perspective. In Proceedings of the thirty-second annual ACM symposium on Theory of computing, Portland, OR, USA, 21–23 May 2000. 4. Lin, D. An Information-Theoretic Definition of Similarity. Available online: http://citeseerx.ist.psu.edu/ viewdoc/download?doi=10.1.1.55.1832&rep=rep1&type=pdf (accessed on 15 April 2019). 5. Gurcan Akcora, C.; Carminati, B.; Ferrari, E. User similarities on social networks. Soc. Netw. Anal. Min. 2013, 3, 475–495. [CrossRef] 6. Rawashdeh, A.; Ralescu, A.L. Similarity Measure for Social Networks—A Brief Survey. In Proceedings of the Modern AI and Cognitive Science Conference (MAICS), Greensboro, NC, USA, 25–26 April 2015; Curran Associates, Inc.: Greensboro, NC, USA, 2015. 7. ¸Sim¸sek,Ö.; Jensen, D. Navigating networks by using homophily and degree. Proc. Natl. Acad. Sci. USA 2008, 105, 12758–12762. [CrossRef][PubMed] 8. Sandberg, O. The Structure and Dynamics of Navigable Networks; Division of Mathematical Statistics, Department of Mathematical Sciences, Chalmers University of Technology and Göteborg University: Göteborg, Sweden, 2007. 9. Banks, L.; Bhattacharyya, P.; Spear, M.; Wu, S.F. Davis social links: Leveraging social networks for future internet communication. In Proceedings of the Ninth Annual International Symposium on Applications and the Internet (SAINT’09), Bellevue, WA, USA, 20–24 July 2009. 10. Jimenez-Diaz, G.; Gómez-Martín, P.P.; Gómez-Martín, M.A.; Sánchez-Ruiz, A.A. Similarity metrics from social network analysis for content recommender systems. AI Commun. 2017, 30, 223–234. [CrossRef] 11. Adamic, L.A.; Adar, E. Friends and neighbors on the web. Soc. Netw. 2003, 25, 211–230. [CrossRef] 12. Nikolić,M. Measuring similarity of graph nodes by neighbor matching. Intell. Data Anal. 2012, 16, 865–878. [CrossRef] Technologies 2019, 7, 36 15 of 15

13. Ricci, F.; Rokach, L.; Shapira, B. Recommender systems: Introduction and challenges. In Recommender Systems Handbook; Springer: Boston, MA, USA, 2015; pp. 1–34. 14. Chen, H.; Li, X.; Huang, Z. Link prediction approach to collaborative ﬁltering. In Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’05), Denver, CO, USA, 7–11 June 2005. 15. Lü, L.; Zhou, T. Link prediction in weighted networks: The role of weak ties. EPL (Europhys. Lett.) 2010, 89, 18001. 16. Zhou, T.; Ren, J.; Medo, M.; Zhang, Y.-C. Bipartite network projection and personal recommendation. Phys. Rev. E 2007, 76, 046115. [CrossRef][PubMed] 17. Liu, H.; Hu, Z.; Haddadi, H.; Tian, H. Hidden link prediction based on node centrality and weak ties. EPL (Europhys. Lett.) 2013, 101, 18004. [CrossRef] 18. Mafteiu-Scai, L.O. A new dissimilarity measure between feature-vectors. Int. J. Comput. Appl. 2013, 64, 39–44. 19. Balmachnova, E.; Florack, L.; ter Haar Romeny, B. Feature vector similarity based on local structure. Presented at the International Conference on Scale Space and Variational Methods in Computer Vision, Ischia, Italy, 30 May–2 June 2007. 20. Qian, G.; Sural, S.; Gu, Y.; Pramanik, S. Similarity between Euclidean and cosine angle distance for nearest neighbor queries. In Proceedings of the 2004 ACM Symposium on Applied Computing, Nicosia, Cyprus, 14–17 March 2004. 21. Zager, L.A.; Verghese, G.C. Graph similarity scoring and matching. Appl. Math. Lett. 2008, 21, 86–94. [CrossRef] 22. Wang, P.; Xu, B.; Wu, Y.; Zhou, X. Link prediction in social networks: The state-of-the-art. Sci. China Inf. Sci. 2015, 58, 1–38. [CrossRef] 23. Shimodaira, H. Similarity and Recommender Systems; School of Informatic, The University of Eidenburgh: Eidenburgh, UK, 2014; p. 21. 24. Liben-Nowell, D.; Kleinberg, J. The link prediction problem for social networks. J. Am. Soc. Inf. Sci. Technol. 2007, 58, 1019–1031. [CrossRef] 25. Xu, H.H.; Zhang, L.J. Application of link prediction in temporal networks. Adv. Mater. Res. 2013, 756, 2231–2236. [CrossRef] 26. Sharma, D.; Sharma, U.; Khatr, S.K. An experimental comparison of the link prediction techniques in social networks. Int. J. Model. Optim. 2014, 4, 21. [CrossRef]

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).