RECOMMENDING COLLABORATIONS

USING

LINK PREDICTION

A thesis submitted in partial fulfillment of the requirements for

the degree of Master of Science

By

NIKHIL CHENNUPATI

B. Tech., Gandhi Institute of Technology and Management,

India, 2016

2021

Wright State University

WRIGHT STATE UNIVERSITY GRADUATE SCHOOL April 21, 2021 I HEREBY RECOMMEND THAT THE THESIS PREPARED UNDER MY SUPERVISION BY Nikhil Chennupati ENTITLED Recommending Collaborations Using Link Prediction BE ACCEPTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science.

______Tanvi Banerjee, Ph.D. Thesis Director

______Mateen M.Rizki, Ph.D. Chair, Department of Computer Science and Engineering

Committee on Final Examination

______Tanvi Banerjee, Ph.D.

______Krishnaprasad Thirunarayan, Ph.D.

______Michael L Raymer, Ph.D.

______Barry Milligan, Ph.D. Vice Provost for Academic Affairs Dean of the Graduate School.

ABSTRACT

Chennupati, Nikhil. M.S., Department of Computer Science and Engineering, Wright State University, 2021. Recommending Collaborations Using Link Prediction.

Link prediction in the domain of scientific collaborative networks refers to exploring and determining whether a connection between two entities in an academic network may emerge in the future. This study aims to analyse the relevance of academic collaborations and identify the factors that drive co-author relationships in a heterogeneous bibliographic network. Using topological, semantic, and graph representation learning techniques, we measure the authors' similarities w.r.t their structural and publication data to identify the reasons that promote co-authorships.

Experimental results show that the proposed approach successfully infer the co-author links by identifying authors with similar research interests. Such a system can be used to recommend potential collaborations among the authors.

iii

Table of Contents

1. Introduction ...... 1 1.1. Overview ...... 1 1.2. Link Prediction for Recommending Author Collaborations ...... 4 1.3. Research Questions and Contributions...... 5 1.4. Thesis Outline...... 7 2. Related Work ...... 8 ...... 9 2.1. Feature Extraction Based Methods...... 9 2.1.1. Similarity-based Metrics ...... 9 2.1.2. Probabilistic and Maximum-Likelihood Models ...... 20 2.2. Feature Learning Methods...... 25 2.2.1. Matrix Factorization Methods ...... 26 2.2.2. Random Walk Based Methods ...... 29 2.2.3. Neural Network-based Methods ...... 33 3. Methods...... 37 3.1. Feature Extraction Methods ...... 37 3.1.1. Feature Extraction Based on Topology ...... 37 3.1.2. Feature extraction based on Node Attributes (Semantic similarity) ...... 41 3.2. Network Embedding Based Approach for Link Prediction ...... 45 3.2.1. Homogeneous Network Embedding ...... 45 3.2.2. Heterogeneous Network Embedding ...... 46 3.2.3. Weighted Meta-path Biased Random Walks ...... 46 3.2.4. Heterogeneous Skip-gram Model ...... 49 3.3. Supervised Algorithms...... 51 3.3.1. Logistic Regression ...... 51 3.3.2. Support Vector Machines ...... 52 3.3.3. Random Forests ...... 53 3.3.4. AdaBoost ...... 54 3.4. Evaluation Metrics ...... 55 3.4.1. Precision ...... 55 3.4.2. Recall ...... 56

iv

3.4.3. F- measure ...... 56 3.4.4. AUC Score ...... 56 4. Data and Experimental Setup...... 58 4.1. Data ...... 58 4.1.1. Microsoft Academic Graph ...... 59 4.1.2. Data Collection ...... 60 4.1.3. Building a Collaboration Graph ...... 61 4.2. Link Prediction Problem ...... 62 4.2.1. Case 1: Experiment with Negative Samples as Nodes n-hop Away ...... 63 4.2.1. Case 2: Experiment with Randomly Chosen Negative Samples ...... 64 4.3. Generating Link Prediction Features ...... 65 4.4. Choosing a Binary Classifier ...... 65 4.5. Network Embedding Based Approach for Predicting Future Collaborations ... 65 4.5.1. Generating Node Embeddings ...... 66 4.5.2. Prediction Pipeline ...... 68 5. Results and Discussion ...... 72 5.1. Feature Extraction Based Approach Results ...... 72 5.1.1. Results of Experiments with Negative Samples as Nodes n-hop Away .... 73 5.1.2. Results of Experiments with Randomly chosen Negative Samples ...... 75 5.1.3. Comparing Results of Case-1 and Case-2 ...... 76 5.2. Network Embedding Based Approach Results ...... 76 5.2.1. Author’s Node Embedding Visualizations ...... 77 5.2.2. Weighted Meta-path Based Results ...... 78 5.3. Case Study: Relevant Author Search ...... 81 5.4. Comparison of Feature Extraction Based and Network Embedding Based Approach ...... 82 6. Conclusion and Future Work ...... 83 References ...... 84

v

List of Figures

Figure1. Trending authors in machine learning (adapted from academic.microsoft.com)

...... 2

Figure 2. Trending topics in all fields (adapted from academic.microsoft.com) ...... 3

Figure 3. A sample collaboration graph of authors from different institutes...... 5

Figure 4. Pipeline of the feature extraction and learning-based approach ...... 6

Figure 5. Overarching block diagram of weighted meta-path-based network embedding

method ...... 7

Figure 6. Taxonomy of link prediction approaches ...... 9

Figure 7. Local probabilistic model ...... 21

Figure 8. Frequency of common authors vs Percentage of collaborations ...... 38

Figure 9. Weighted meta-path approach using supervised learning ...... 48

Figure 10. Weighted meta-paths and their importance scores ...... 49

Figure 11. MAG schema and MAG entity data(adapted from academic.microsoft.com)

...... 59

Figure 12. Hierarchy describing 5 different entities and their properties of each entity

type ...... 61

Figure 13. Heterogeneous graph schema ...... 62

Figure 14. Training data frame ...... 64

Figure 15. Pipeline for meta-path-based network embedding approach ...... 66

Figure 16. Author - Paper - Venue - Paper - Author meta-path ...... 66

Figure 17. Weighted meta-path learning pipeline...... 68

vi

Figure 18. ROC-AUC Curve with topological and semantic features using Random

Forests ...... 74

Figure 19. t-SNE visualizations of node embeddings learned from MAG data ...... 77

Figure 20. Author node embeddings and link(co-author) embeddings using PCA ..... 78

vii

List of Tables

Table 1. Sample author pairs with paper title similarity ...... 43

Table 2. Meta-paths and their semantic meaning ...... 47

Table 3. Distribution of the count of each node type ...... 60

Table 4. Description of relationships in the graph ...... 62

Table 5. Choice of binary operator ...... 70

Table 6. Different classifier performances for predicting co-author relations using

Logistic, SVMs, Random Forest, and AdaBoost techniques...... 73

Table 7. Classifiers performance with randomly chosen negative samples ...... 75

Table 8. AUC scores of weighted meta-path approach with different classifiers ...... 80

Table 9. A Case study of relevant author search ...... 81

viii

Acknowledgment

This work was supported by US Air Force, Grant-APEX.

I would firstly like to thank my advisor, Dr. Tanvi Banerjee, for providing this opportunity to work along with her team. Not only has she been a great advisor to me, but a great mentor as well. She provided me with constant support both personally and academically through my graduate journey. I have learned a lot from my interactions with her and will carry these valuable lessons throughout my life.

I would like to thank Dr. T.K. Prasad and Dr. Michael Raymer for the insightful discussions, brainstorming sessions, support, and guidance. I will continue to be in awe of their attention to detail and ability to identify significant problems that made this work more meaningful.

My graduate program has been a fulfilling one, and I am excited for whatever is in store next.

Finally, I must express my very grateful to my parents and my family for their constant support and encouragement throughout my life. This accomplishment would not have been possible without them

ix

1. Introduction

1.1. Overview

Collaborative research is an important and growing aspect of scientific study. These associations among researchers often yield positive results. Researchers can share thoughts, get inspired, and most importantly, get rid of redundancies – and it seems so much easier as well as productive than when they are all on their own. These partnerships may help investigators understand how methods from various disciplines can be integrated to address a problem and contribute to creating novel ideas. Wilbur and Orville (Goldblatt et al., 2002) were first successful in fixing bicycles, then revolutionizing aircraft technology in the following year, developing three-axis controls that allowed fixed-wing aircraft to fly. Their collaborative effort in this development turned manufacturing and commerce into something that people had not even thought of. Their joint effort culminated in one of the formative 20th-century projects that gave birth to a whole new world. Another remarkable achievement in scientific world is the

International Space Station Program which involves space agencies from USA, Japan, and Europe and Russia. This is not just a meeting of minds but meeting of physical components as well. Identifying and retaining important alliances is essential for research groups because collaboration could complement the same study challenge and yield more effective outcomes. On the other hand, seeking a suitable collaborator can be complex and time-consuming.

These research collaborations can be represented as networks or graphs.

According to Wikipedia, a collaboration graph is a graph modeling some

1

where the vertices represent participants of that network (usually individual people) and two distinct participants are joined by an edge whenever there is a collaborative relationship such as co-authoring a paper, publishing at same venue, affiliated with the same organization. Collaboration graphs can enable measurement of the closeness of collaborative relationships between the network participants. However, graphs can be used for more than just organized information repositories; they can also be used to conduct analytics and gain insights into science, such as top authors in a field, research importance of institutes, and author citation growth rate. Figure 1 and Figure 2 below illustrate the nature of insights that could be established from the structured graph data.

Figure 1. Trending authors in machine learning (adapted from academic.microsoft.com)

2

Figure 2. Trending topics in all fields (adapted from academic.microsoft.com)

The feature information obtained from graph-structured data can be used in many machine learning use-cases (Hamilton et al., 2017b). For example, a graph can be used to predict a protein's location in a biological interaction graph, predict an individual's place in a collaboration network, recommend new friends to a social network user, or predict new therapeutic applications of established drug molecules.

Using machine learning algorithms, we could gain insights on top of these graphs' extracted features.

Similarly, academic collaboration graphs are constructed using different entities in the data i.e., ‘Author’, ‘Paper’, ‘Conference’, ‘Journal’, ‘Affiliation’ and relationships among them. The idea of constructing a network of co-authorship is not new. To the best of our knowledge, (Newman,2000) was the first to study the structure of scientific collaboration networks where “two scientists were considered connected if they have co-authored a paper”. Following that, many studies in the literature are

3

available that work on the time evolution of academic partnership networks by forecasting new connections between researchers. Accurate prediction of new partnerships among participants of a collaboration network may aid in the development of creative solutions due to collaborative efforts, the creation of opportunities, and the increase of productivity.

1.2. Link Prediction for Recommending Author Collaborations

Predicting connections or associations between entities or nodes in a network is critical in network analysis. (Liben-Nowell & Kleinberg, 2007) formally defines Link

Prediction as the problem of predicting or identifying the existence of a link between two entities in a network. In a co-authorship network, there are chances that two authors might collaborate in future, if one of them changes the organization in which he or she is associated. These kinds of collaborations can be difficult to foresee. The network topology, on the other hand, could help us identify a significant number of new collaborations: two authors nearby the network will have colleagues in common and travel in similar circles, meaning that they are more likely to collaborate in the near future. Many studies in the literature tried to make this intuitive notion precise by introducing various proximity, semantic, and graph representation learning methods that lead to more accurate link predictions.

4

1.3. Research Questions and Contributions

Providing recommendation can be as simple as identifying a set of similar items or entities in a network. In other words, being able to infer the non-existing links in the network by analyzing the past collaborations.

Areas of Research: Radar Electronic Areas of Research: Attack Electronic warfare Command and Signals Intelligence Surface-to-air missile

Figure 3. A sample collaboration graph of authors from different institutes

Given a collaboration network as above, where nodes represent authors, and a

“co-author” relation exists, if both have at least one paper in common, we would like to answer the following questions:

1. Given an author in the network (e.g., Dr. Chandler), who are the most similar

or top-k similar scientists whose research interests match that of Dr. Chandler?

2. Will Dr. Kudva, affiliated with Northrop Grumman Corporation, and Dr.

Chandler affiliated with WPAFB collaborate in the future?

5

In a homogeneous network, the nodes' structural distance plays a prominent role in measuring the similarity. The greater the gap, the less is the relevance between the two. In comparison, a neighbor node's relevance in a HIN (Heterogeneous Information

Network) depends not just on structural distance but also on semantics among those nodes. The dictionary definition of semantics means understanding the meaning of a word, phrase or a sentence. In this study, by semantics we mean knowing and understanding the meaning of attributes tagged with the nodes in our network such as author’s abstracts, paper titles.

Figure 4. Pipeline of the feature extraction and learning-based approach

In this thesis, we identify similar authors or recommend future collaborations for an author in a heterogeneous network that represents Authors, Papers, Venues,

Journals, Affiliations, and the relationships among them, using link prediction techniques. We have implemented two different approaches in this study: extracting the network's structural and semantic features and using machine learning algorithms to learn the authors' associations. Figure 4 explains the end-to-end pipeline of the first approach i.e., feature extraction based.

In the second approach, we implemented a network embedding-based graph representation learning method where the learning is done using different meta-paths

6

to capture the similarities among the authors. Also, these meta-paths are weighted based on their importance towards co-authorship prediction. Figure 5 provides the overarching block diagram of the network embedding model we built.

Collaboration Testing Stage Network Pre-processing the heterogeneous Graph Test Sub Graph

Generating Author Node Embeddings

Train Sub- Training and graph Comparing different Classifiers Run the best classifier Generating Meta-paths Weighted Meta-path Heterogeneo us Skip Gram Learning Model Meta-path Finding top- Predicting k similar Future authors collaborations Training Stage

Figure 5. Overarching block diagram of weighted meta-path-based network embedding method

1.4. Thesis Outline

The rest of the thesis is organized as follows. In Chapter 2, surveys – the related works in link prediction, i.e., discusses and compares different approaches and techniques.

Chapter 3 introduces the methodology used in implementing our approach for link prediction and the evaluation metrics to analyze the system's performance. Chapter 4 shows the process of data collection and the experimental steps followed. In Chapter 5, we have the results and discuss the system's outcome to gain more insights about the co-author prediction task performed using link prediction approaches. In Chapter 6, we present the conclusions and directions for future work.

7

2. Related Work

Multiple approaches have been proposed in the literature to address link prediction.

These approaches can be categorized based on the type of information used from the network or how they learn from existing relationships among the nodes to predict a non-existing link. In this work, we classify the link prediction techniques following their approach to extract the network features. Accordingly, these models can be classified into two categories based on how they learn the relationships among the nodes. As described in Figure 6, first group is the Feature extraction based approach, which extracts and analyses the features using similarity between node attributes , graph topology, and probabilistic and maximum likelihood based approach that defines a likelihood function. The second group is the Feature Learning based approach with techniques - matrix-factorization, random walk and neural network based methods which encode node’s neighborhood by mapping them into a latent space. We discuss these approaches in detail in the following sub-sections.

8

Figure 6. Taxonomy of link prediction approaches

2.1. Feature Extraction Based Methods

2.1.1. Similarity-based Metrics

To recommend or predict links between two nodes, a simple initial step could be to compute the similarities among them. To predict the links with maximum likelihood, it is essential to consider the similarity between the nodes. This is consistent because users tend to create relationships or connections with similar interests, education, location, and background. Sometimes, these could be due to complementary interest of the users.

For example, interdisciplinary collaboration between two or more scientists from various professional fields to achieve common goals. But, in this study we limit our analysis in identifying collaborations of authors in same field using similarity-based metrics. The node attributes available and the network topology help calculate this similarity among the nodes. A similarity function defined helps us calculate the

9

relevance of two nodes where greater the similarity score between the two, more likely that the two nodes forms a link.

In a practical sense, an online social network contains node attributes as information, email, location, etc. Similarly, in a research collaboration network, the author nodes have the characteristics such as research interests, citation count, affiliations, and conferences attended. In majority of the cases, using textual information from the properties of the nodes shows how similar nodes are with each other.

(Bhattacharyya et al., 2011) defines a forest model for categorizing keywords with a hypothesis that similarities among the people in the network is determined by relevance of the keywords. They believe that these keywords correspond to users interests (such as movies, games) and passions (such as art, music, research). The authors answer these questions by understanding different keywords based on their usage patterns and how similarity between user interests influences friendship. They have analysed the keyword usage patterns using a dataset containing 1265 Facebook profiles, including 1301 unique keywords. Then, they built a forest model to connect keywords based on their context and meaning. As part of assessing the performance of their model authors attempt to visualize the output of their study w.r.t keyword matching pairs. Finally, they have pictorially presented the results showing the variation in weak similarity and substantial similarity based on users' different node degree and their number of keywords. The two major findings of their study was that correlations between direct friends are high regardless of number of hopes they are with each another, and individuals who are already friends has a higher similarity score than any other individual pair of users in the friendship network.

10

(Akcora et al., 2013) proposed a novel similarity measure for online social networks that combines both network and profile similarity. To test how the considered measures could forecast the creation of new relationships, the authors have conducted experiments on different kinds of graph data (Facebook, YouTube, Epinions, and

DBLP data sets). They have seen how different scenarios and graph sizes affect performance. According to their creation dates, they have sampled 480,274 edges from the Facebook dataset to reduce the computation costs. In 480,274 edges, 400,000 (83

%) edges were created between users and their friends (i.e., two-hop strangers). Only

80,274 (17 %) edges were created between 3 or more hop distance pairs. Then, they've created the graph status in the time instant before the creation of 푒=⟨푢, 푥⟩, denoted as

GTe, and compute similarity measures among u and all its strangers in GTe, including x.

Their results show that network similarity measures have the best performance.

Similarly, they performed experiments on YouTube and Epinions datasets to compare their similarity measures to the other ones in the literature. In an online social network (e.g., Twitter.com), a user enters a few profile information during the registration process where the user doesn't have any preferences or recommendations about possible new friends. Once he/she starts creating new friends/followings

Twitter.com, offers a list of people to follow based on user profile information and his/her neighborhood and thus, shows how network types can impact interactions between users of social networks, and what kind of social network information can be used with various types of networks.

(Anderson et al., 2012) studies user's interest overlap to measure the similarity.

This study estimates the similarity between users using two different types of characteristics: similarity of interests using a distance metric capturing overlap in the

11

types of content they produce and similarity of social ties using a measure of the overlap in the sets of people they have evaluated. They have performed experiments on three datasets Wikipedia, Stack Overflow, and Epinions their results show the effect of user similarity on how users evaluate each other online. For instance, on Stack Overflow, where questions are annotated with tags (to represent relevant topics,) the authors find the similarity between users by defining Tag similarity as the cosine between two tag vectors. Similarly, they characterize a user by a vector of other users that a user has evaluated and called this social similarity. To prove the efficiency of the above two similarity measures, they plot the fraction of positive evaluations as a function of these two similarity measures (showing how probable a user u1 positively evaluates user u2 based on how similar they are). Here, predicting future links between users was not their goal. Still, the way they have measured the similarity among users in a social network is worth mentioning in this category of node-based similarity measures. So, we could indeed train a classifier on top of these similarity measures and use the model to predict future links.

Graph topological properties may also be used to describe empirical similarities among the nodes. In other terms, similarity metrics for any two nodes may be measured using a variety of network properties, including structural property. (Newman, 2001) explored several features extracted using graph’s topology. Depending on the topological features, these can be grouped into – local and global approaches.

Local Approaches

In a social network or collaboration graph actors or nodes prefer to make new links with nodes that are closer to them compared to that of nodes far away in the network.

Therefore, researchers design many locally based indices or neighborhood-based

12

metrics for link prediction. They do have a drawback that using only local information restricts node similarity to be computed for neighbors. Some of these popular metrics are – Common neighbors (Newman, 2001), Adamic/Adar (Adamic & Adar, 2003),

Preferential Attachment (Barabási et al., 2001), Resource Allocation (Zhou et al.,

2009), Jaccard Coefficient (Jaccard & Zurich, 1901).

Common Neighbor’s

Introduced by (Newman, 2001), this method provides a measure of similarity by calculating the intersection of the sets of neighbors of the nodes to predict future linkage. The intuition behind familiar neighbors is that two authors who have a friend in common are more likely to be introduced than those who don't have any friends in common. It is defined as follows:

CN(푥, 푦) = |Γ(푥) ∩ Γ(푦)| where x and y represents nodes in the graph and ′Γ′ represents neighbors of each node.

Adamic/Adar algorithm

(Adamic & Adar, 2003) presented a metric where similarity between two is measured based on shared links between the two. It is the sum of the inverse logarithmic degree of shared neighbors between the two nodes and defined as:

1 AA⁡(푥, 푦) = ∑ 푧∈Γ(푥)∩Γ(푦) log⁡ |Γ(푧)| where ′Γ′ represents neighbors of node and ‘z’ is the common neighbor of nodes ‘x’ and ‘y’. The higher the value more similar are two nodes with each other.

13

Preferential Attachment

The basic premise here (Barabási et al., 2001) is that the probability a new edge involves a node x is proportional to N(x). So, the likelihood of co-authorship of authors a1 and a2 is correlated with the product of collaborators of a1 and a2. It is defined as:

PA(푥, 푦) = |Γ(푥)| ⋅ |Γ(푦)| where ′Γ′ represents neighbors of node.

Resource Allocation

This metric (Jaccard & Zurich, 1901) measures what fraction of resource that a node can send to another node through their common neighbors.

1 RA⁡(푥, 푦) = ∑ |Γ(푧)| 푧∈Γ(푥)∩Γ(푦) where ′Γ′ represents neighbors of node and ‘z’ is the common neighbor of nodes ‘x’ and ‘y’.

Jaccard Coefficient

Here the authors (Jaccard & Zurich, 1901) measure shared neighbors ratio in the complete set of neighbors in two nodes.

|Γ(푥) ∩ Γ(푦)| JC(푥, 푦) = |Γ(푥) ∪ Γ(푦)|

In use cases such as social networks and bibliographic networks, these local methods could be efficiently used to address the link prediction task. Common

Neighbors, Jaccard Coefficient, and Preferential Attachment could be used when past

14

collaborations, and the number of connections for each need to be weighted more heavily. Capturing local neighborhood information helps identify past and mutual relationships among them and allows us to associate or predict links among the nodes having a similar neighborhood somewhere far away in the network.

Global Approaches

Structural information of the network can be used to calculate the similarity and defined as global similarity index. As opposed to local approaches, these methods have a greater computational complexity. Examples of global similarity measures include but not limited to:

Katz Index

This index introduced by (Katz et al., 1970) counts all the paths with shorter paths counting more heavily. In other words, there is more likely of existence of new links between two nodes when there are many shorter paths between the two.

Shortest Path

(Liben-Nowell & Kleinberg, 2007) The distance between two nodes in a graph is a direct measure to determine how close two nodes are. The shortest path and its negation are used as a metric for determining the likelihood of connection, with the reciprocal relationship and length of the shortest path being captured.

SimRank

This metric (Jeh G et al., 2002) is based on the premise that two nodes are identical if they are related to other nodes that are similar. In the context of co-authorship networks,

15

we could say two scientist are considered to be similar if they are connected by similar scientist in the network.

Rooted Page Rank

(Chung & Zhao, 2010) combined PageRank with a random walk-in link prediction system. In a random walk, it is the amount of step x to y with a probability of to return to x each step and one to proceed to a random node adjacent. The walker goes with probability 훼 to an arbitrary adjacent vertex and returns with probability (1-⁡⁡훼).

Hitting Time

In (Ois Fouss et al., 2007), HT (x, y) determines the estimated number of steps taken by a random walk from x to y for two vertices x and y in a graph. The lesser the value of hitting time, more similar two nodes are to each other.

Usually, a combination of node-based and topological metrics is used for predicting missing or future links in a network. We divide the literature we reviewed w.r.t similarity based approaches based on type of information i.e., topology or content- based used in building their model.

Measuring similarity using graph topology

(Pavlov & Ichise, 2007) solved a similar problem of finding/predicting collaborations among authors in a Japanese co-authorship network by extracting structural attributes from the network and using them to train a set of predictors to predict future collaborations. This studies popular structural attributes were Shortest Path, Common neighbors, Jaccard's coefficient, Katz, PageRank, etc. and then formally define a predictor that maps feature vectors to the binary space by training several learning

16

algorithms such as Decision Trees, Support Vector Machines. Also, they believe that by analysing the algorithmic structure of predictors built for specific networks, we could also gain valuable information about which attributes are most informative for link prediction problem and also use this knowledge as a basis for specifying vocabularies for expert description which in turn helps to make the prediction even better because we come to know to what extent each researcher is an expert in each field. The data set contains 111,210 published articles by 86,696 authors collected over

1993 to 2006. They have compared the results of various classifiers – AdaBoost,

Decision Trees, SVMs and achieved a precision of 0.75% using the AdaBoost algorithm.

In (Aiello et al., 2012), authors find new connections among the friends using topical similarity among the users who lie close to each other in a social network. They introduced a null model that preserves user activity while removing local correlations, allowing them to disentangle the actual local similarity between users from statistical effects due to the of user activity and centrality in the social network.

They conducted experiments on three popular online social networks: Flickr, Last. FM and aNobii. Combining all the proposed social and topological features achieved an accuracy of 91% using a decision tree classifier on a balanced set of 10,000 positive and negative samples extracted from aNobii data set.

Measuring similarity based on graph topology and attributes

(al Hasan et al., 2006) used topological features, such as clustering index, shortest path, and aggregate features such as the sum of papers, neighbors, also, semantic features like keyword match count. The authors believe that predicting links in a co-authorship network could be potentially applied to many social online network problems. In this

17

work, the authors have used two bibliographic datasets Elsevier, BIOBASE and DBLP, that have information about research publications in computer science and biology fields. For DBLP, they have used 15 years of data was used out of which 11 years were used as training and four years for testing the model. They have performed experiments with the feature mentioned above set and training using many classification models such as SVM, Decision Trees, RBF Network, Bagging, KNN. Using evaluation metrics like accuracy, precision, recall, and F-1 score, they achieved an accuracy of around 85 to 90 % for each classification algorithm. Our work is similar to their study regarding the topological and semantic methods for conducting the experiments. In this work, we are capturing the semantic similarity among authors using more promising techniques available today in NLP, such as SciBERT and Word2Vec, by leveraging the authors' list of paper titles and abstracts.

(Almansoori et al., 2012) apply link prediction to the domain of healthcare and gene expression networks. The authors argue that, the chances that links could disappear in future is as equally significant as likelihood of formation of new connections. So, they present a novel model that tackles both emerging and shrinking problems towards this goal. Their experimental results prove that their proposed model can achieve the purpose of gene-gene interaction analysis and effective of network patterns. Besides, they also use the link prediction method to solve the medical referral system where it is the process of referring patients to physicians with a specific area of specialty. By building a classification model on top of the extracted features such as Ethnicity, Professional Activity Match, Sum of Patients (adding the number of patients that the pair of physicians had), Sum of Neighbors, and Jaccard Similarity. The authors address the problem of both positive and negative (links that are likely to be removed in future) link prediction. On the medical referral data set used, they achieved

18

a prediction accuracy of 92% using SVM and an accuracy of 72% using a Decision tree on a real-world gene data set.

(Sachan & Ichise, 2010) introduced a semantic approach and event-based approach and used the graph's structure to improve predictors' accuracy. They believe that the researchers' semantic descriptions might help find researchers with compatible expertise in each field and thus suggest collaborations. Further, they used an event- based approach to consider the common venues and journal information to identify future collaborations more precisely. Some of the non-structural and event-based attributes they used were common words in title, common conference venues, common journals, etc. They conducted experiments on a subset of the DBLP data set with 17623 authors, 18820 papers, and then split them into 16 partitions based on a timeframe (1987 to 2002). They were able to achieve a f-1 score of 60% by using a combination of topological, semantic, and event-based features along with sampling methods

(SMOTE) and training them using Decision Trees.

Along this line, (Sun.Y. et.al., 2019) proposed Path predict model to study the co-author relationship prediction model's problem. Here, the authors have considered a meta-path based topological features to capture the similarity among the author nodes and then, learning the importance of each in predicting potential collaborations among the authors. To prove the same, experiments were conducted on a real bibliographic network DBLP. So, by defining meta-paths over the network schema their intuition is that both the neighbor set features and topological features could be generalized in a heterogeneous information networks. To conduct the experiments, the authors partition the network based on publication year associated with each paper, i.e., from 1989 to

2009. The model proposed can achieve an AUC of around 0.83 for the test interval

19

using the proposed meta-path based topological features. They have also conducted a case study to evaluate the significance level of different meta-paths. It shows that common venue and common authors meta-paths help identify co-author relationships better. Although they have handled the heterogeneous network using multiple meta- paths, the semantic relations among the authors were not considered in this study.

Remarks

Similarity-based methods mostly focus on structural and node attribute level information to compute the similarity score, which helps assess the probability of forming non-existing links between the nodes. To compare local and global approaches based on their run time, local approaches takes less since only adjacent node information is being considered, which is not the case with global which is an advantage when dealing with large scale networks. Based on how well are they able to capture the similarity among the nodes, global methods performs well taking into account the complete topological information from the given graph.

2.1.2. Probabilistic and Maximum-Likelihood Models

“Statistical and probabilistic principles have been used effectively to characterize several network formation-based models” (Goldenberg et al., 2009). These models optimize an objective feature composed of many parameters given a network. This method is based on the assumption that the network has a well-defined framework.

They use mathematical techniques to construct a model that matches the system and predict model parameters. The obtained parameters are used to compute the likelihood of formation of non-existing connections. As we have seen in the previous similarity based approach, these likelihood values can be used to rate possible connection among the nodes. 20

(C. Wang et al., 2007) proposed a novel local probabilistic graphical method that can be scaled up to large graphs to estimate two nodes' joint co-occurrence probability. They argue that such a probability measure captures information that is not captured by either topological or semantic similarity measures, which are the most used methods for link prediction. This method is described in Figure 7. To derive co- occurrence probability (the link probability between two nodes), a local probabilistic graph model using Markov Random Fields (MRF) is proposed. Within this set and use them as training data to train a local probabilistic model. There are three steps in predicting when two nodes x and y, would be connected: (1) Define the central

Figure 7. Local probabilistic model neighborhood set of x and y using topological facts (2) Choose an itemset that is fully contained within this set and use it as training data to train a local probabilistic model

In this case, the training process is converted into a maximum entropy optimization task

(3) Using inference over the local model, estimate the co-occurrence likelihood characteristics. To test their theory, they ran tests on three different data sets: DBLP,

Genetics, and Biochemistry. On DBLP, the co-occurrence probability features alone yielded an AUC score of 0.82 compared to the Katz measure of 0.75. Their model did fairly well on the other two data sets obtaining an AUC of around 0.80. They have also demonstrated the co-occurrence probability feature's effectiveness combined with other topological and semantic features for predicting co-authorship collaborations on DBLP, which yielded an AUC score of 0.83.

21

Probabilistic relational models (PRM) are a language for describing statistical models for relational domains. Consider the issue of link prediction in a co-authorship network to further explain the PRM. Non-relational link prediction frameworks accept only one entity type as a node and one relationship: nevertheless, relation frameworks

(PRMs) can handle more than one entity and relationship type. In this way, the link prediction framework or model may be simplified to an attribute prediction task.

(Friedman et al., 1999) extended this framework by modeling the interaction between node attributes and link structures themselves. The authors have presented a precise semantics of these extensions and proposed learning such models from a relational database. To demonstrate the proposed model's predictive capability, they performed experiments on Cora (McCallum et al., 2000) and WebKb (Craven et al.,

1998) data sets and evaluated their model by computing log-likelihood. Training was done on nine-tenths of the data and performance evaluated by computing the log- likelihood of held-out test subset. They obtained a log-likelihood of -210,044 compared to -213,798 for the baseline model using existence uncertainty. Similarly, using reference uncertainty, they got a log-likelihood of -149,705 compared to -152,280 for the baseline model. These numbers indicate the likelihood of test data given the learned model they built. Thus, the model they built is more predictive compared to baseline where the relational structure was correlated with attribute values.

(Taskar et al., 2003) tackled this problem of predicting link existence using the

Relational Markov network (RMV) framework. They defined a single probabilistic model over the entire graph, including object labels and links between objects. The model parameters are trained discriminatively to maximize the object's probability and link labels given the known attributes. Then, the learned model is applied, using

22

probabilistic inference, to predict and classify links using any observed characteristics and links. Experiments were performed on customized web data set WebKB containing computer science web pages from 3 schools. Firstly, 2954 pages were labeled into one of eight categories such as faculty, student, and research staff. Then, they established a set of candidate links based on evidence of a relation between them. So, the task now is to predict the relation type for all the candidate links. Further, they have experimented on a large university portal's social network data using “friendship” links between two students using personal information and using observed links in test data as a piece of evidence for predicting remaining links.

(Guimerà & Sales-Pardo, 2009) provides a general mathematical and statistical basis for identifying incomplete and erroneous interactions in a noisy network. It is a model with cluster structures or communities. Nodes in the network are grouped into classes in the stochastic block model, and nodes in the same category have the same status. This means that each node has a fixed community membership, which helps us determine the likelihood of a non-existing link between the nodes. A stochastic block model M = (P, Q) is made up of two parts: a partitioning system P and a probability matrix Q of two nodes from separate classes. Let Qαβ denote the likelihood between group α and group β, and A be the observed network. Then the probability of the network structure is:

푙훼훽 훾훼훽−푙훼훽 푝(퐴 ∣ 푃, 푄) = ∏ 푄훼훽 (1 − 푄훼훽) 훼⩽훽

where lαβ is the number of the existing links between groups α and β, and 훾αβ is the number of maximum number of potential connections between group α and β. Assume

23

that, Ω be sequence of all feasible partitions. Bayes theorem can be used to measure the reliability of a certain relation between two nodes in the network:

1 1 푝(퐴 = 1 ∣ 퐴) = ∑ ∫ |푄|푝(퐴 = 1 ∣ 푃, 푄)푝(퐴 ∣ 푃, 푄)푝(푃, 푄)푑푄 푥푦=1 푧 푥푦=1 푝∈Ω 0 where Z is a normalization factor which can be represented as Z = Σp∈Ω exp[−H(P)], and H(P) is a partition function. This kind of a system needs enormous computing time to identify all incomplete and spurious connections.

To demonstrate that their model will reveal structural features, they attempt to locate missed interactions, or, in other terms, determine the likelihood that the connections occurs provided the entire network. The authors evaluate the performance on five networks, finding interactions between people in a karate club to predict missing interactions with 85% accuracy.

In real-world, many complex systems or networks have an inherently hierarchical organization. The nodes in the network are divided into multiple layers such that they were organized as ordered sets. In most of the cases, the resulting hierarchy formed depends on several factors such as different roles of nodes, their significance and history.

(Clauset et al., 2008) built a model that understands the topological relationships behind the hierarchical structure and predict the connections among the nodes in the hierarchy. The hierarchical structure was described by a tree or dendrogram in which similarly connected pairs of vertices have the lowest common ancestors, which are lower in the tree than those of more distantly related pairs. They expect that the probability of a connection between two vertices to be dependent on their degree of

24

relatedness. They detect and analyse the real world's hierarchical structure by fitting the hierarchical model to observed network data using statistical inference tools, combining a maximum likelihood approach on the space of all possible dendrograms. The authors first sampled a group of dendrograms with a likelihood equivalent to their chance, then averaged the corresponding probability to determine the mean probability over the sample dendrograms.

In this study, the authors prove the efficiency of their approach for solving link prediction using three example networks –Terrorist association network, metabolic network, and Grassland species network and evaluate the success of their approach by showing the degree to which the AUC exceeds ½, indicating how much better their predictions are than chance. Also, they plot the AUC for the three networks as a function of the fraction of connections known to the algorithm.

They further show that this hierarchical structure knowledge can predict missing connections in partially known networks with better accuracy than the standard topological-based techniques. However, the space and time complexity of this approach is high and is typically applicable in use-cases with large-scale networks.

2.2. Feature Learning Methods

Using graph embedding and network representation learning techniques, where the original graph is projected into a low-dimensional space feature learning is made possible. The goal here is to automatically project nodes or objects in a homogeneous or heterogeneous network into a latent embedding space. Both the structural and relational properties of the network can be encoded and preserved. The idea behind these representation learning approaches (Hamilton et al., 2017b) is to learn a mapping

25

that embeds nodes or entire (sub) graphs as points in a low-dimensional vector space

Rd. Then, optimizing these mappings so that geometric relationships in this learned space reflect the original graph's structure. After optimizing the embedding space, the learned embeddings can be used as feature inputs for downstream machine learning tasks. To put it in another way, computing cosine similarity or Euclidian distance between the encoded nodes corresponds to similarity between them in input network.

In earlier approaches we used manually computed features to find the similarity among the nodes, but in graph representation learning methods the feature learning is done automatically.

There are many network embedding methods in the literature. Most of these methods could be classified under 1) Matrix Factorization based models, 2) Random walk-based models 3) Deep Neural Network-based models.

2.2.1. Matrix Factorization Methods

A matrix contains a set of rows and columns. Mapping the nodes in a given input graph to rows and columns in the matrix is a way of denoting whether the two nodes are connected or not in the original network. In these approaches, vector representations of nodes corresponding to initial network are obtained by representing them in a low- dimensional space using their structural attributes. The aim is to reduce the dimensionality of this space while maintaining non-linearity and localization. Over a decade, matrix-factorization methods are being used in many papers to solve the link prediction task. The two approaches for matrix-factorization were Singular value decomposition and Non-negative matrix factorization.

26

Laplacian Eigenmaps

Belkin proposes this algorithm, and (Belkin & Niyogi, 2001) proposed a geometrically motivated algorithm for constructing a representation for data sampled from a low- dimensional manifold embedded in a higher-dimensional space. In the first step, a graph is constructed to model local neighborhood relations among the data points. Next, the graph is embedded using Graph Laplacian. The eigen vectors here directly give embedding of the data points. The decoder within the Laplacian Eigenmaps encoder-decoder system is described as:

2 DEC⁡(퐳 , 퐳 ) = ∥퐳 − 퐳 ∥ 푖 푗 ∥ 푖 푗∥2 where pairs calculated by similarity between node-pairs in graph are described by the weights of the loss function:

ℒ = ∑ DEC⁡(퐳푖, 퐳푗) ⋅ 푠풢(푣푖, 푣푗)

(푣푖,푣푗)∈풟

Graph Factorization

In Graph factorization(Schwabe et al., 2013) the same loss function as in Laplacian

Eigenmaps was used, and it was optimized using stochastic gradient descent. The gist of this approach is that it aims to represent the graph as a Graph Laplacian matrix. Under this representation all positive values correspond to degrees of node and all negative values are edge weights. The goal of the objective function used here is to minimize the error comparted to that of Graph Laplacian Matrix.

27

GraRep

GraRep (Cao et al., 2015) also used the same loss function given in Laplacian

Eigenmaps and optimized it using stochastic gradient descent. Though reducing the vector space component, the model also incorporated global topological structure knowledge into learning. Here, the authors have conducted experiments on multi-label classification tasks for the Blog Catalog dataset, where results show they have a slight improvement in Micro-F1 score f around 40%.

HOPE

For directed graph embeddings, an inner product-based approach (Ou et al., 2016) retains asymmetric transitivity. When factoring the graph vertices to vector space, this property seems to be important for capturing the graph structure. This property may also be used to decode the properties of an embedded graph. First to estimate the high order proximity of nodes in a given network, SVD is applied. Then, using singular values in top of these gives the vector representations of each node in low- dimensional setting.

Empirical experiments performed over several synthetic data sets demonstrate that HOPE could significantly better approximate the higher-order proximities than existing state-of-the-art algorithms. Experiments were conducted on SN-Weibo, and

SN-Twitter data sets with mean average precision as evaluation metric HOPE can achieve a precision of over 0.80 compared to baselines – Deep walk, Adamic/Adar,

LINE, Etc.

Similarly, for structural relation estimation, a prominent work by (Menon &

Elkan, 2011) is suggested. Here, by assuming the model as matrix completion problem

28

authors have implemented this using Matrix factorization approach. (Chen et al.,

2017) factorized the extracted structural and matrix of their corresponding properties using non-negative matrix factorization technique instead of relying on SVD.

2.2.2. Random Walk Based Methods

Graph exploration and sampling with random walks or search algorithms is used to investigate node characteristics such as node centrality and node similarities. We will view the sampling neighborhoods as a local search form by using random walks and techniques such as Breadth-First Search and Depth-First Search. In terms of the search space, these two search techniques reflect severe scenarios.

The results of the studies below show that, nodes that occur on shorter random walks tend to have similar embeddings in the latent space. Thus, instead of using a deterministic measure of graph proximity, these random walks methods employ a flexible, stochastic measure of graph proximity, which has led to superior performance in many settings (Goyal & Ferrara, 2017).

Deep Walk

(Perozzi et al.,2014) addresses graph representation learning as a natural language challenge. The Deep walk learns latent representations by treating walks as equivalent sentences and relying on local information gleaned from truncated random walks. Initially, in academia and industry skip-gram was used mostly for language processing use cases, but then using this as an objective function helps to represent networks in a low-dimensional embedding space.

To demonstrate the algorithm's potential, the authors evaluate its performance using a multi-label classification problem. They conducted experiments on Blog

29

Catalog, Flickr, and YouTube data sets. Firstly, they sample a portion of labelled nodes and use them as training data with the rest of the nodes as a test. Then, they repeat the process ten times and report the results by averaging Macro-F1 and Micro-F1 scores.

The results show that Deep walk outperforms the existing baseline methods, i.e.,

Spectral Clustering, Edge clustering, Modularity, and Majority in the multi-label classification task. Also, they report the parameter sensitivity by varying dimensionality and sampling frequency which proves that results are pretty consistent over the three data sets.

Node2vec

This technique (Grover & Leskovec, 2016) expands Deep Walk by integrating BFS and

DFS graph traversal strategies to learn the features. This algorithm employs a scalable second-order random walk to investigate various network structure details. Nodes with a strong relation density and belonging to the same group or cluster, for example, are closely embedded together.

The authors also demonstrate the efficacy of the proposed approach on multi- label classification and link prediction tasks. In the link prediction task, given a network with a certain fraction of edges removed, they try to predict these missing edges. The experiments were conducted on three popular data sets – Facebook, PPI, and arXiv.

The results show that node2vec outperforms Deep walk w.r.t AUC score with a gain of up to 3.8%.

Metapath2vec

This approach (Dong et al., 2017) attempts to solve the problem of representation learning in heterogeneous networks (containing multiple nodes and edges). Similar to

30

word2vec where sequence of words were used to understand the context of a node, here running a heterogeneous skip-gram model on top of pre-defined meta-paths generates the required node embeddings . Given a heterogeneous network, the goal was to understand the structural and semantic features among the nodes.

Deep walk and Node2vec was experimented on homogeneous networks, while metapath2vec and metapath2vec++ concurrently learn low- dimensional representations of nodes where more than one node type is involved. The authors clearly define the heterogeneous network embedding approach built using heterogeneous skip-gram and meta-path-based random walks in this work. They also suggest a Heterogeneous negative sampling method for the metapath2vec++ paradigm, in which the SoftMax feature is normalized based on the context's node form. They perform experiments on two heterogeneous graphs DBIS and AMiner and successfully prove them using Node clustering and multi-label classification tasks. By varying the percentage of training samples, they achieved a Micro-F1 score of 0.92 on the AMiner dataset for the multi-node classification task.

GraphSAGE

Unlike most current methods, which enable all nodes in the graph to be present during the embeddings preparation, the authors present a general inductive framework that uses node feature knowledge to efficiently produce node embeddings for previously unseen details. Graph SAGE (Hamilton et al., 2017a) with a goal to gather the local features and represent them as vectors in latent space, a fixed number of adjacent nodes are sampled. Using context based similarity assumption similar to that of previous approach that were built on the idea of word2vec, Graph SAGE assumes that nodes with similar neighborhoods have similar embeddings. Towards this goal, this

31

approach uses information aggregation and a loss function to adapt to an end to end learning procedure.

They have demonstrated their approach's efficacy by classifying nodes in evolving information graphs – Citation data and Reddit data. In the Citation data set, authors picked the papers in six biology-related fields for the years 2000-2005, in which

2000-2004 was used for training the algorithms and 2005 used for testing. This method improved the classification F1-scores by 51% compared to using node features alone and showed a significant gain in running time (~100*).

Watch Your Step (WYS)

(Abu-El-Haija et al., 2018) is a learned attention model based on the power series of the transformation matrix G. With just a few manually tunable hyperparameters, the weight that should be assigned to a node’s context is determined. Their experiments on link prediction produce embeddings that best preserve the graph structure, generalizing to unseen information.

Here, the authors had evaluated the quality of embeddings produced when random walks were augmented with attention through experiments on link prediction.

They have assessed their method by removing a fraction of edges from the network, learn embeddings from remaining edges, and see how well their algorithm can recover the removed edges. They conducted experiments on wiki-vote, ego-Facebook, ca-

AstroPh, ca-HepTh, and PPI data sets. This graph attention method substantially outperforms baselines reducing error by up to 45%.

32

PathSim

(Sun Y. et.al., 2011) proposed the PathSim model to study the importance of similarity search in large-scale networks such as bibliographic and networks. The intuition behind this study is that “two objects are similar if they are linked by many meta-paths in the network” which means nodes connected to each other with different meta-paths hold some kind of similarity. The authors here introduced the concept of meta-path based similarity, where a meta-path is a path consisting of a sequence of relations defined between different object types. Experiments were conducted on DBLP and DBIS data set and to evaluate the performance similarity search case studies for retrieving similar venues and similar authors where query results accuracy was around

0.74 and 0.65 respectively.

2.2.3. Neural Network-based Methods

Because of neural networks ability to capture clearly non-linear relations, graph representation learning methods adapt neural networks with multiple layers. The knowledge from the network topology aggregated by the network attributes is one of the functions of these models. The intuition behind these approaches is that local graph’s neighborhood could be easily determined or learned using aggregated function knowledge instead of traversing whole graph which indeed in computationally expensive in some cases. In (Harada et al., 2018), for example, the graph network of molecules, link prediction is investigated using a mixture of two convolutional neural networks. The molecules are defined as having a hierarchical structure for their internal and external relationships. For representing the network in a latent space, each node representation uses a convolutional layer with backpropagation. The encoded node representations are then fed into the external convolutions to learn the external

33

network. Finally, the machine learning model for link prediction is built using multilayer neural network to capture the representations at the end and interaction among the molecules are modelled using Soft-max function.

Graph Auto-encoders

The authors of (Tran, 2018) demonstrated how an auto-encoder architecture would learn a joint representation of both local graph layout and usable node functions. The model can learn descriptive latent features for nodes from topological frameworks of sparse, bipartite graphs and is adaptable enough to provide specific side features regarding nodes as an optional aspect to increase predictive efficiency.

To demonstrate the same, they perform experiments on link prediction tasks by recovering the status of missing or unknown links in the input graph. The results show that their approach outperforms the baselines with an AUC of 0.89 and an average precision of 0.91 on the Cora data set.

Large Scale Information Network Embedding (LINE)

This model (Tang et al., 2015) integrates two encoder-decoder architectures to investigate and maximize vector space first and second node proximities. In this case, their goal is for the proposed model to scale to massive, subjective types of networks: undirected, guided, and/or weighted. Using a novel edge sampling algorithm, the model overcomes the limitations of classical stochastic gradient descent.

The authors have conducted experiments on Word analogy and Document classification tasks to demonstrate the approach's efficacy. In word analogy, given a word pair (a, b) and a word c, the task aims to find a word d, such that the relation between c and d is similar to the connection between a and b. They report the results

34

on the Wikipedia dataset and show that their approach can outperform the baselines

Deep walk, GF, and Skip Gram with Micro -F1 score of around 0.80-0.82%.

Deep Neural Networks for Learning Graph Representations (DNGR)

This approach (Cao et al., 2016) uses a random surfing method to insert node-local neighborhood information and tests single embeddings rather than pair-wise transformations. Using autoencoders architecture both dynamic features and non- linearity could be captured from the input network. This model illustrates the use of stacked denoising autoencoders in the extraction of meaningful details.

To demonstrate their embeddings' effectiveness, they perform experiments on

– Clustering task on 20 Newsgroup data set, Visualization of Wine, and a Word similarity task. In the word similarity task, where the goal is to learn word representations from extensive linear structured graph data, DGNR achieved an accuracy of 74.84%, significantly higher than SVD.

Structural Deep Network Embedding (SDNE)

This method (D. Wang et al., 2016) is a representation learning model that, with a few exceptions, is close to DNGR in that where it uses a encoder-decoder framework with an objective function with a goal to capture the similarity among the nodes. They assume that capturing the non-linear network structure while maintaining global and local information is a serious challenge. As a result, they attempt to preserve the network hierarchy by using both first-order and second-order proximity. The authors will then maintain both the local and global network weights by maximizing them in the semi-supervised deep model.

35

Here, the authors have used ARXIV GR – QC data set for performing link prediction. The results show that this method can learn network representations better even when 80% of the links were removed than baselines. Using precision@k as the evaluation metric of predicting the hidden ties, the precision is higher than 0.90 when k = 1000, where other methods dropped down below 0.80.

36

3. Methods

This section describes the topological and semantic-based methods implemented in the feature extraction process. Later, we explain the techniques used in capturing the similarity among author nodes using a dimensionality reduction-based approach.

Finally, we provide a brief description of each of the supervised machine learning algorithms and evaluation metrics used in this study.

3.1. Feature Extraction Methods

We classify the feature extraction-based methods into two categories based on graph topology, which uses structural attributes to extract the features. The other one is extracting semantic features from the graph to understand the author's interests, capturing similarity better.

3.1.1. Feature Extraction Based on Topology

A graph's topology has a significant role in computing similarity among the author pairs. People in every network prefer to form relationships with people in their immediate vicinity. Here, considering the graph topology or structure, we measure the similarity between nodes.

Common Neighbor’s

People who have more mutual acquaintances are more likely to get acquainted than someone who have only few to zero. (Newman, 2001) make this intuition precise by conducting experiments on scientific co-authorship networks. Their work indicates the

37

importance of scientists introducing their collaborators to one another in the development of scientific communities. In this study, we believe that finding common neighbors between a pair of authors is significant in computing the similarity between them. For two authors, a1 and a2, CN is defined as the number of adjacent nodes that both a1 and a2 have in common. It is more likely that a connection between the two authors a1 and a2 being created with more common nodes between the two.

This metric is defined using the following formula.

퐶푁(푎1, 푎2) = |푁(푎1) ∩ 푁(푎2)|

N(a1) is the set of adjacent nodes of a1 and N(a2) is the set of adjoining nodes of a2.

Figure 8. Frequency of common authors vs Percentage of collaborations

To add more weight to this, we have plotted a bar graph with the frequency of common authors vs. percentage of collaborations that occurred using the data extracted into Neo4j a graphical database, which shows that around 90% of authors who haven't collaborated to date have either no common author or just one author in common. On the other hand, more than 90% of authors collaborated had at least one author in

38

common, which says how meaningful mutual connections are in establishing non- existing co-authorship relations.

Adamic/ Adar Algorithm

Introduced by (Adamic & Adar, 2003), this measure is beneficial in computing author nodes' closeness based on their shared neighbors. We use this metric to construct on common neighbors, but instead of simply counting them, Adamic/Adar computes the value of the inverse log of the degree of each of the neighbors, where the degree of a node is the number of neighbors it has and defined as:

1 퐴퐴⁡(푎1, 푎2) = ∑ 푙표𝑔⁡ |푁(푧)| 푧∈푁(푎1)∩푁(푎2) where N(z) is the set of nodes adjacent to z.

“When it comes to closing triangles, the intuition is those author nodes with low degrees are more likely to be prominent” (Adamic & Adar, 2003). In our scenario, the likelihood of two authors being introduced by a similar author is proportional to the number of other pairs of neighbors or corresponding relations the author has. As a result, an unpopular author is more likely to introduce a co-authors pair. The intuition is that common authors with very large neighbourhoods are less significant when predicting a connection between two authors compared with authors shared between a small number of authors.

Preferential Attachment

As the name says, preferential attachment (PA) (Barabási et al., 2001) uses the product of degrees of authors a1 and a2 in the co-authorship network as the proximity measure,

39

considering that new co-author links are more likely to appear between authors who have a large number of connections. We compute the preferential attachment between a pair of authors using the formula.

푃퐴(푎1, 푎2) = |푁(푎1)| ⋅ |푁(푎2)| where N(a1) is the number of adjacent nodes of a1 and N(a2) is the number of adjoining nodes of author a2. The intuition behind using this method is that authors with lots of neighboring nodes will gain more relationships. Hence, in the context of link prediction, an author a1 will be introduced to a new author than an author a2 given a2 has fewer neighbors than a1.

Jaccard's Coefficient

Using Jaccard coefficient (Jaccard & Zurich, 1901), the scale of different neighbors is normalized. It gives further points to author couples that have a higher proportion of shared neighbors in comparison to the average number of neighbors. Two authors with the same number of common neighbors might have a huge difference. Consider a scenario: authors a1 and a2 have 20 common neighbors, and those two authors do not have any other co-author relationships; authors a3 and a4 also have 20 common neighbors, but a3 also has 50 co-author links beside these 20 common ones. It is easy to see that a1 and a2 are similar to each other than a3, a4 pair. Jaccard's coefficient catches this idea by normalizing common neighbors. This way of computing similarity between the authors is just another way of counting common connection between the two, with a penalty for each non-existing co-author. Jaccard coefficient is defined as:

|푁(푎1) ∩ 푁(푎2)| 퐽퐶(푎1, 푎2) = |푁(푎1) ∪ 푁(푎2)|

40

where N(a1), N(a2) represent the number of adjacent nodes of authors a1 and a2, respectively.

Using the above four structural or topological features, we have extracted the connectivity properties for a pair of objects to infer future connectivity by leveraging the network's current connectivity. Given a pair of author nodes, these methods return us a similarity score between the two.

3.1.2. Feature extraction based on Node Attributes (Semantic similarity)

In any social network it is natural to extract the structural features among the nodes. Coming to our use-case of academic collaboration networks, the properties of authors describe their research areas, location, affiliation etc. Extracting such properties help us determine the similarity among the authors easily and in a better manner. Hence, in this study, using authors publication data we compute the similarities among them. In this work, by semantics we mean to compute similarity between the authors by building a model that understands the meaning of those words or phrases instead of just relying on statistical methods. Hence, in this study, we calculate the two authors' similarity by the semantic similarity of keywords in the abstracts of their published papers and paper titles. In addition to these two, we believe that analyzing authors' citation graphs and venues or conferences could help us capture the similarity between the two authors better and meaningful.

A few works in the literature on this line have taken into account authors' semantic information for calculating the similarity. (al Hasan et al., 2006) introduced a keyword matching degree measure and existing topological features, which has given better results. (Yamaguchi, 2008) introduces two semantic information-based metrics: “keyword matching number” and “common event number”. (Sachan & Ichise, 41

2010) represents the author's research interests using, “paper's title” and “abstract information”. (Bartal et al., 2009), In order to render assumptions, it often derives textual knowledge from the paper's title. When measuring semantic similarity, though, they actually compute the intersection of two sets of author's properties or use Jaccard's coefficient (treating all the methods equally). Hence, in this thesis, without relying on just the matching count or TF-IDF vector similarity, we calculate the semantic similarity between of the publication data (such as paper abstracts and paper titles) of authors using SciBERT (Beltagy et al., 2019), a pre-trained language model trained on a sizeable multi-domain corpus of scientific publications.

Spacy

Spacy, introduced by (Hannibal et al., 2017), is a free, open-source library for Natural

Language Processing in Python which provides a one-stop solution for tasks we encounter in any NLP project such as tokenization, lemmatization, POS Tagging, word to vector transformation, and entity recognition. Using Spacy, we can quickly create our custom pipelines, including the tasks mentioned above. In this study, we use Spacy to build a pipeline to perform tokenization, removing stop words from paper abstracts and paper titles, and then load a language model to compute the similarity between two author's abstracts/keywords/paper titles. In this way, Spacy makes the task of measuring semantic similarity between a set of sentences/keywords easy and helps us achieve our task of computing similarity given a pair of authors despite topological properties.

SciBERT

SCIBERT (Beltagy et al., 2019) is based on BERT architecture. It is BERT pretrained on scientific corpora. BERT uses the original WordPiece vocabulary termed as

BASEVOCAB. SCIBERT uses the Sentence Piece library to construct a new Workpiece

42

vocabulary (SCIVOCAB) on scientific corpora. It is trained on 1.14M papers from

Semantic Scholar, where the complete text, including abstracts, was used. In this work, we used the SCIBERT language model to measure the similarity between a pair of authors using their abstracts and paper titles.

Paper Title Similarity

A good paper title uses as few terms as possible to accurately capture the substance and intent of one's work. In MAG, we have the “paper title” as one of the attributes or properties for each “Paper” node in the co-authorship graph. Firstly, we extract authors corresponding paper titles for each author node and store them as Python lists. Then, we create a pipeline using Spacy and removed the stop words from the paper titles list.

We added the pretrained language model SCIBERT, i.e., scibert_scivocab_uncased.tar, to the channel to measure the similarity between the author pair. So, given the node IDs of a pair of authors, it returns a similarity score on a scale of 0 to 1. Table 1 below is a snippet of sample rows from the data set for paper title similarity.

Table 1. Sample author pairs with paper title similarity

Abstract (Keyword) Similarity

MAG doesn't publish the raw author-supplied keywords. Instead, we leverage the inverted index of abstracts available to get the papers' abstracts published by an author.

So, we extract the keywords from these abstracts and use them to measure semantic

43

similarity between an author pair. There are many key phrases or keyword extraction techniques readily available such as RAKE (Rose et al., 2010) and YAKE (Campos et al., 2020). However, we found that the findings were not so good since these models were usually focused on a text's statistical properties rather than textual similarities.

Hence, we use BERT (Devlin et al., 2018), a bidirectional transformer model that allows us to transform phrases and documents into vectors that capture their meaning. There are many methods for generating the BERT embeddings, such as Flair, Hugging Face transformers. Here, we have used the sentence-transformers package as it allows us to create high-quality embeddings that work pretty well for sentence and document-level embeddings. Using a pretrained model, we extract the keywords from the set of abstracts we have for each of the authors' data set. Finally, we measure semantic similarity between the extracted keywords using the SCIBERT language model, which gives me a similarity score on a scale of 0 to 1.

Co-citations

Co-citation is defined as the frequency with which two papers are cited together by other articles or publications. The more co-citations two publications receive, the higher their co-citation strength, and more likely they are semantically related. In this thesis, we compute the co-citation strength of a pair of authors by counting the number of papers that cited both the author's pairs' papers. We are calculating the semantic similarity. The intuition behind adding this to our feature set is that, in most cases, co- citation among the papers happens if they belong to similar research areas. Hence, this metric helps us capture the similarity between the authors whose research interests overlap.

44

3.2. Network Embedding Based Approach for Link Prediction

“Network embedding is considered as a dimensionality reduction technique in which higher D dimensional nodes (vertices) in the graphs are mapped to a lower d (d << D) dimensional representation (embedding) space by preserving the node neighborhood structures” (Alexandru Cristian Mara, 2020). Thus, similar nodes in the original network have similar embedding in the representation space. Identical to the previous approach we implemented where we measured the similarity among the authors using structural as well as semantic features of the heterogeneous graph, we map the nodes of a network to vectors in embedding space such that these representations can be used to approximate any notion of similarity or proximity between pairs of author nodes in the network. Standard machine learning methods are used for these embeddings to learn and estimate the likelihood of edges between nodes that are not linked in the input network.

3.2.1. Homogeneous Network Embedding

(Mikolov et al., 2013) introduced word2vec where vector representations of words in a large corpus of text are learned. Deep Walk (Perozzi et al., 2014) and Node2vec (Grover

& Leskovec, 2016). Inspired by these architectures, which use a random walk-based sampling technique to generate node sequences that capture node's neighborhood characteristics, similar to how a sentence captures the relational interaction between two words. The skip-gram model was used to learn the representations of a node, which aids in the prediction of its structural background in these node sequences. The aim is to maximize network probability in terms of local structures given a network G = (V,

E), which is expressed as:

45

푎푟𝑔⁡ 푚푎푥 ∏푣∈푉 ∏푐∈푁(푣) 푝(푐 ∣ 푣; 휃) 휃

N(v) is the neighborhood of node v in the network G, which can be defined in different ways such as v's one-hop neighbors, and 푝(푐 ∣ 푣; 휃) represents the conditional probability of having a context node c given a node v.

3.2.2. Heterogeneous Network Embedding

(Dong et al., 2017) introduces the heterogeneous skip-gram approach learn the semantic behind heterogeneous graphs where more than one node type is present. In this study, we suggest a weighted meta-path-based random walk-in heterogeneous networks, inspired by metapath2vec. We built a pipeline with the following measures to achieve our aim of obtaining relation predictions. 1) Define a series of random walks centered on meta-paths to turn the configuration of a network into skip-grams. 2) From the node sequences obtained in phase 1, learn node representations for the heterogeneous network. 3) Recasting the issue as a binary classification challenge by computing a node pair embedding, which is then fed into a binary classifier to predict links.

3.2.3. Weighted Meta-path Biased Random Walks

How do we efficiently transform a graph's topology into a skip-gram? In word2vec, we train a neural network to predict the nearby words given a specific word. For training this model we need a corpus of text or build a vocabulary of words from our training documents. So, using these sentences the model learns the vector representation of each word in the vocabulary. Similarly, the goal in our use-case is to learn the vector representations of nodes in our network. For our model to learn the context of each author node it needs a set of node sequences similar to the sentences we use in training

46

word2vec. For generating these node sequences, we could have used graph traversal techniques like BFS and DFS, but those are computationally expensive and might also lead to many irrelevant meta-paths. Inspired by metapath2vec, we design weighted- meta path-based random walks that can capture both the topological and semantic relations among the nodes in the network. Similar to the skip-gram in word2vec here we use a heterogeneous skip-gram model that distinguishes the node type and generates the embedding or vector representations for each of the author nodes.

Meta-path: (Sun & Han, 2013) formally defines meta-path scheme P as a path denoted

푅 1 푅2 푅푡 푅푙−1 in the form of⁡푉1 ⟶ 푉2 ⟶ ⋯ 푉푡 ⟶ 푉푡+1 ⋯ ⟶ 푉푙 where R = R1 – R2 - …….- Rl-1 defines the composite relations between nod types⁡푉1 and푉푙. To capture structural and semantic features, we design multiple meta-paths where each meta-path exploits a different semantic meaning of the relation. Table 2 below describes the meta-paths we have used and their semantic meaning.

Table 2. Meta-paths and their semantic meaning Meta Path Meta-Path’s Semantic Meaning

A – P – A a1 and a2 are co-authors – they publish

together

A – P – V – P – A a1 and a2 publish at the same venue

A – P – J – P – A a1 and a2 publish in the same journal

A – P – A – P – A a1 and a2 publish with the same author

A – I – A a1 and a2 are affiliated to same the institution

A – P – P – P – A the same papers cite a1 and a2

We use these meta-paths to guide heterogeneous random walkers. Given a heterogeneous academic network G = (V, E, T) and a meta-path scheme P:

푅2 푅푡 푅푙−1 푉2 ⟶ ⋯ 푉푡 ⟶ 푉푡+1 ⋯ ⟶ 푉푙, the transition probability at step i is defined as follows:

47

1 푖+1 푖 푖+1 푖 (푣 , 푣푡) ∈ 퐸, 휙(푣 ) = 푡 + 1 |푁푡+1(푣푡)| 푖+1 푖 푝(푣 ∣ 푣푡, 풫) = 0 (푣푖+1, 푣푖) ∈ 퐸, 휙(푣푖+1) ≠ 푡 + 1 푡 푖+1 푖 { 0 (푣 , 푣푡) ∉ 퐸

푖 푖 푖 where 푣푡 ∈ 푉푡 and푁푡+1(푣푡) denotes the 푉푡+1 type of neighborhood of a node 푣푡. In other words, we ask random walker to traverse to next node if the adjacent node is the node type we specified in the pre-defined meta-path scheme. Hence, in the equation above the transition probability at step i is ‘0’ when the adjacent node type is not the one we specified in meta-path or if there exists no edge between the two nodes.

Each of these pre-defined meta-paths has its importance in predicting the co- authorship relation. So, we rank these meta-paths by generating node embeddings for each of these meta-paths and training a classifier for predicting the non-existing links.

In this way, we obtain each meta-path's importance, which decides the number of the meta-path type to be passed into our heterogeneous skip-gram model.

Figure 9. Weighted meta-path approach using supervised learning

48

As described in Figure 9, similar to the feature extraction approach implemented we extract the neighborhoods of author node pairs using set of features that corresponds to each meta-path. For example, ‘Common co-authors’ corresponds to ‘APAPA’ meta- path of our pipeline. Using these set of six features we train a supervised classification model to learn the co-author links among the authors. The assumption here is that the feature importance scores of these meta-paths are nothing but the importance of each meta-paths.

Figure 10. Weighted meta-paths and their importance scores Figure 10, describes each meta-path we used and their corresponding importance scores. From the adjacent table we see that ‘APVPA’, ‘APAPA’, ‘APPPA’, meta-paths contribute more in predicting the co-author relationships compared to other three.

3.2.4. Heterogeneous Skip-gram Model

We use the heterogeneous skip-gram model introduced by (Dong et al., 2017) to learn node embeddings from node sequences from the above-returned meta-paths. The next question is what is the difference between the original skip-gram model used in word2vec and the heterogeneous skip-gram model.

In word2vec, we train a neural network to do the following. Given a specific word in the middle of a sentence, pick one word at random from the words nearby. The network tells us the probability for every word in our vocabulary of being the “nearby

49

word” that we chose. The “nearby words” is nothing but the context of the given input word. Similarly, the sub section 3.2.3 we discussed how we built the context using pre- defined meta-paths. The intuition behind original skip-gram is that if two different words have very similar “contexts”, then the model needs to output very similar results for these two words. And one way for the network to output similar context predictions for these two words is if the word vectors are similar. So, if two words have similar contexts, then our network is motivated to learn similar word vectors for these two words. Now, let us map this intuition to our use of co-author prediction. Here, similar to words in word2vec model we have different nodes in the network. So, the intuition remains the same i.e., two authors with similar contexts (created using meta-paths) need to have similar vector representations or embeddings. In other words, the model we built learns the embeddings of author node capturing the heterogeneous context with other node types and relationships among them.

Given a heterogeneous network G = (V, E, T) with |TV| > 1 , the difference is that we have multiple node types. So, we add a new term - 푁푡⁡(푣), 푡 ∈ 푇푉 to maximize the likelihood of different node type, given a node 푣:

푎푟𝑔⁡ 푚푎푥 ∑ ∑ ∑ 푙표𝑔⁡ 푝(푐푡 ∣ 푣; 휃) 휃 푣∈푉 푡∈푇푉 푐푡∈푁푡⁡(푣)

th where푁푡(푣) denotes 푣′푠 neighborhood with the t type of nodes, 푐푡 is the context of a given input node and 푝(푐푡 ∣ 푣; 휃) is commonly defined as SoftMax function, that is

푒푋푐푡⋅푋푣 푝(푐푡 ∣ 푣; 휃) = 푋 , where X푣 is the 푣푡ℎ row of X, representing the embedding ∑푢∈푉 푒 푢⋅푋푣 vector for node 푣.

50

3.3. Supervised Machine Learning Algorithms

Supervised learning is the most common technique in classification problems since the aim is to make the system learn the labels we have assigned for each of the input samples. When input data is fed into the model, using reinforcement learning it adjusts its weights to ensure that model is trained appropriately. The key challenge is to build a mapping function that could predict the label of the given sample based of extracted features. Supervised machine learning can be split into two types depending on the type of task being handled i.e., Classification and Regression.

Classification uses an algorithm to correctly assign the input test data into specific classes whereas Regression is used to understand the relationship between independent and dependent variables.

In this work, we have trained our features using four different types of classification algorithms, i.e., Logistic Regression (Walkerf & Duncan, 1867), Support

Vector Machines (Hearst et al., 1998), Random Forests (Quinlan., 1986), and AdaBoost

(Freund & Schapire, 1996). In the following subsection, we describe each technique in detail with its pros and cons.

3.3.1. Logistic Regression

Logistic regression (Walkerf & Duncan, 1867) operates by removing a series of weighted features from the data, taking logs, and linearly adding them, which ensures that each function is compounded by a weight and then added together. Logistic regression is a form of regression that predicts the likelihood of an event occurring by fitting data to a logistic function. Logistic regression, like every other form of regression analysis, employs a number of predictor variables that may be numerical or

51

categorical. The logistic function, also known as the sigmoid function, is an S-shaped curve that takes a real value number as input and maps it into a value between 0 and 1, but never at 0 or 1. In case of binary logistic regression the categorical output has only two cases (e.g., whether connection or link exists or not). In case of multinomial logistic regression there are three or more possible outcomes.

Using decision boundary, a threshold can be set to predict which class a data belongs to. Based upon this threshold, the estimated likelihood is classified into classes or categories. For example, if predicted outcome is greater than or equal to

0.5, then classify co-author relationship exists otherwise not.

3.3.2. Support Vector Machines

SVM (Hearst et al., 1998), which stands for support vector machine, can be used for regression and classification functions. However, it is commonly found in classification goals. Many users utilize SVMs because they achieve good accuracy by utilizing fewer computing resources. The SVM algorithm's aim is to locate a hyperplane in an N- dimensional space. SVM is a binary classifier at its core and hyperplanes are used to distinguish the two types of data points. We aim to find a plane with the most significant margin or the greatest difference between both groups' data points. Maximizing the margin gap gives certain reinforcement, enabling potential data points to be classified with greater certainty. Hyperplanes are judgment boundaries that aid in the classification of data points. Different groups may be assigned to data points that land on either side of the hyperplane.

Furthermore, the hyperplane size is determined by the number of functions.

When the input features are two, the hyperplane is simply a segment. As the number of input features exceeds three, the hyperplane turns into a three-dimensional plane. 52

Support vectors are data points closer to the hyperplane and affect the hyperplane's direction and orientation. Using these help vectors, we optimize the classifier's margin.

The location of the hyperplane will shift if the support vectors are lost. In SVM, we take the linear function's output and mark it with one class if it is greater than one and another class if it is smaller than 1. Hinge loss is a loss feature that aids in margin maximization. If the expected and real values have the same symbol, the expense is negligible. If they aren't, we compute the loss value. Also, we introduce a regularization parameter to the cost function. The regularization parameter's goal is to balance margin maximization and loss. The kernel trick is a strategy used by the SVM algorithm.

Specifically, the SVM kernel is a function that takes a low-dimensional input space and translates it to a higher-dimensional space, converting a non-separable problem to a separable problem. It is most effective when dealing with non-linear separation issues.

Simply stated, it performs some incredibly complicated data transformations before determining how to isolate the data depending on the labels or outputs you've specified.

When we have a big data collection, it does not do well since the expected training period is longer. It often does not do well as the data set has more noise, i.e., target groups overlap.

In our use-case we implemented Linear SVM which uses linear kernel to find a hyperplane to distinguish the relationships based on the selected features.

3.3.3. Random Forests

As the name suggests, it is made up of many individual decision trees (Quinlan., 1986) that function together as an ensemble. Each tree in the random forest generates a class prediction, and the class with the most votes becomes the prediction of our model. The wisdom of the crowd is at random forests' heart. In data science jargon, the random

53

forest paradigm performs well because many relatively uncorrelated models (trees) acting as a committee will outperform all constituent models individually. The main point is the poor similarity between models. Uncorrelated models may generate ensemble forecasts that are more reliable than all of the individual predictions, comparable to how portfolios with low correlations (such as stocks and bonds) create a better performing portfolio in the aggregate. The explanation for this great result is that the trees shield each other from their mistakes (as long as they don't all err in the same direction all of the time). While certain trees will be incorrect, several more would be accurate, causing the trees to migrate in the same direction as a group.

A decision tree is a specific type of flow chart used in visualizing the process of decision making where tree nodes represent selected inputs from the data, edges as potential values for given inputs and leaf represents potential outcomes. The hypothesis behind decision trees is that all the samples were divided into subsets before it hits a smaller collection with data points with single class label. A decision tree is built top down from a root nodes. At each level, using a cost function and trying different split points we move from complete uncertainty to complete certainty. For a decision to be effective it should contain all the possibilities i.e., all possible pathways and event sequences.

3.3.4. AdaBoost

Boosting is an ensemble strategy that aims to construct a powerful classifier from a set of poor classifiers. This is accomplished by first generating a model from the training results, followed by developing a second model that aims to fix the errors in the first model. Models are introduced before the training set is perfectly estimated or the full number of models is reached. AdaBoost (Freund & Schapire, 1996) was the first truly

54

efficient binary classification boosting algorithm. It is the perfect place to continue learning about boosting.

AdaBoost is best used to improve decision tree accuracy on binary classification problems. Freund and Schapire, the technique's developers, initially referred to

AdaBoost as AdaBoost.M1. It is also known as discrete AdaBoost since it is used for grouping rather than regression. Any machine learning algorithm will benefit from

AdaBoost's efficiency enhancements. It fits well for slow learners. These models achieve accuracy only above random chance on a classification problem. Since these trees are so small and only have one classification choice, they are sometimes referred to as decision stumps.

3.4. Evaluation Metrics

The efficiency of any trained model is evaluated using evaluation metrics. For a particular problem or a use-case, these help to determine the best performing model for the selected set of features. They are used to measure how well our model is successful in labeling the classes or estimating the outcome. There are several measures to evaluate or assess the models performance such as - Precision, Recall, F-1 Score, and

AUC score.

3.4.1. Precision

Precision is a machine learning efficiency metric that calculates the fraction of obtained instances that are correctly returned by the classifier. It is also known as positive predictive value. It is the ratio of true positives (correctly retrieved instances) to the total number of true positives and false positives. High precision means our model has the low false positive rate.

55

3.4.2. Recall

Recall is another performance measure used in machine learning to get the fraction of correctly retrieved instances by the classifier. It is also known as the sensitivity of the classifier. It is the ratio of correctly predicted positive observations to that of all the observations in actual class. We could think recall as accuracy over just the positives.

3.4.3. F- measure

The F1 score is the weighted average of Precision and Recall. It comes into picture when we need a balance between precision and recall. Also, compared with accuracy

F1 might be a better measure when there is an uneven class distribution.

3.4.4. AUC Score

A ROC curve (Receiver Operating Characteristic curve) is a graph that depicts a classification model's output overall classification thresholds. This curve plots two parameters:

• True Positive Rate

• False Positive Rate

TPR vs. FPR is plotted on a ROC curve at various classification thresholds.

Lowering the classification threshold causes additional products to be classified as positive, which increases all False Positives and True Positives.

56

AUC is an abbreviation for "Area Under the ROC Curve." It measures the whole two-dimensional area under the entire ROC curve. It offers a consolidated indicator of success across all classification thresholds. AUC can be interpreted as the likelihood that the formula scores a random positive sample higher than a random negative sample.

The AUC scale runs from 0 to 1. A model with 100% incorrect predictions has an AUC of 0.0; one with 100% accurate predictions has an AUC of 1.0.

57

4. Data and Experimental Setup

This section describe data collection and pre-processing steps we followed in building the link prediction model. Then, we explain the implementation steps of the two approaches we explored in this work.

4.1. Data

We studied the problem of predicting potential future collaborations (link prediction) among the authors in an academic collaboration network. In other words, our goal was to predict the collaborations that might occur in the future by looking at past collaborations using various link prediction techniques. We have two types of networks along this line: homogeneous and heterogeneous. In homogeneous networks, only one type of object (author type) and one type of link (co-authorship) exist in the network. In contrast, a heterogeneous network consists of multiple types of objects (e.g., venues papers, a field of study) and various types of links (e.g., co-author, write, belongs to). Although we could make predictions using information from homogeneous networks by taking into account just the available “co-author” relationship, heterogeneous bibliographic networks can generate more accurate predictions using the heterogeneous context of an author on top of the topological structure of the graph. Towards this objective, there are many publicly available bibliographic data sets like DBLP, Cora, Microsoft

Academic Graph, Cite Seer, PubMed, and ArXiv . Here, we use Microsoft Academic

Graph (MAG) (Sinha et al., 2015) for performing our co-author prediction task because of its heterogeneous nature and the way all the documents discovered by

58

Bing crawler and then extract scholarly entities and their relationships to form a knowledge base. Currently, MAG is one of the most significant academic content indexes next to Google Scholar.

4.1.1. Microsoft Academic Graph

The MAG (Sinha et al., 2015) is a heterogeneous graph containing scientific publication records, citation relationships between those publications, as well as authors, institutions, journals, conferences, and fields of study. Figure 11 shows the count of each entity type in MAG and its heterogeneous structure.

Figure 11. MAG schema and MAG entity data(adapted from academic.microsoft.com)

Each entity in the MAG schema has a set of attributes attached. For example, the author entity consists of details such as Author Id, Rank, DisplayName, Paper Count, Citation Count, Etc. and similarly, an entity Papers consists of attributes Paper Id, Doi, Doc Type, Paper Title, Year, Date,

Conference Id, Citation Count, Publisher, Etc.

59

4.1.2. Data Collection

We queried MAG for authors with a citation count greater than 100 and publications that belong to Field of study (FoS) Computer

Science in the timeframe 2005 to 2020. The corresponding papers, conferences, journals, affiliations were also retrieved with the above Author IDs. We need to construct a heterogeneous graph to perform network analysis and link prediction from the available data in CSV. Hence, we loaded the data into a NoSQL graph database,

Neo4j (neo4j.com).

Table 3. Distribution of the count of each node type

Node Label Count

Author 82421

Papers 209525 Conference Series (Venues) 62491

Journals 20972

Affiliations 1842

Each entity type has its own set of attributes which we call properties in

Neo4j. This study has created five distinct entities – Author, Papers, Affiliation,

Conference Series (Venue) and, Journals. Figure 12 shows each of the entities and their related properties.

60

Figure 12. Hierarchy describing 5 different entities and their properties of each entity type

4.1.3. Building a Collaboration Graph

The dataset doesn’t contain edges among the entities or nodes describing their relationships. We created a set of edges among these entities representing the relationship's nature among the nodes. The below schema shows the nodes and their corresponding relationships. Each node in the graph has its ID, which is a primary key in creating connections between entities.

We could infer the authors' collaborations by finding the papers authored by multiple people. A “CO_AUTHOR” relationship between the author nodes is created when they have collaborated at least once. A heterogeneous graph is constructed by creating relationships among other node types.

61

Figure 13. Heterogeneous graph schema

Table 4 below briefly describes each of the relationship types created using

Cypher queries in Neo4j.

Table 4. Description of relationships in the graph

Node 1 Node 2 Relationship Description of relationship type

Author Author [: CO_AUTHOR] An edge between a pair of authors if they have at least one Paper in common. Author Affiliation [: AFFILIATED_WITH] Author’s current affiliation with Institute Author Paper [: WROTE] An author wrote the Paper Paper Conference [: PUBLISHED_AT] Paper published at a venue/conference Paper Journal [: PUBLISHED_IN] Paper published in a Journal Paper Paper [: CITED] Paper cited/ referenced

4.2. Link Prediction Problem

Link Prediction problem can be defined as inferring new links among the subset of nodes in the network that are likely to occur shortly. We consider a network graph G =

(V, E) in which edge e ∈ E represents an interaction between its endpoints at a particular time t(e). To formulate the link prediction problem, we choose a training interval

’ ’ ’ [t0,t0 ] and a test interval [t1,t1 ] where t0 < t1, and given algorithm access to the

62

’ network G [t0,t0 ], it must predict whether an edge exists or not between a pair of the

’ author in the network G [t1,t1 ]. We used the collaboration network constructed from the MAG to predict the links between the author pairs that are likely to occur in the future using the topological and semantic methods defined in Chapter 3. In other words, we train a model

To evaluate the efficacy of our method, we need to train a machine learning model to learn the co-author relationships among the author nodes in the network.

Towards this goal, the first step is to divide our model into Train and Test sets where using author pairs of train sets, we train a classifier and evaluate our trained classifier's performance using the test set of author pairs. One way of dividing the train and test sets is using the authors' collaboration year, with existing co-author edges as positive samples and authors at n-hops (usually n = 2,3,4) as negative samples. The other case is to remove a fraction of edges and predict those randomly. Our study experimented with both ways and compared the results with the two. The sub-sections below explain the two experiments conducted using the feature extraction approach.

4.2.1. Case 1: Experiment with Negative Samples as Nodes n-hop Away

Inspired by (Mark. 2019), we built a binary classifier to predict future collaborations from the co-authorship graph. We need to create out Train and Test datasets. However, there is a risk of data leakage in the case of graphical data since pairs of nodes in our training set may be linked to those in the test set. So, we split our graph into training and test subgraphs using the time information we have, i.e., the first year’s co-authors collaborated. We see that 2015 acts as an excellent year to split the as it gives reasonable amount of samples in each of our subgraphs. We used everything from 2015 and earlier as our training graph and everything from 2016 to

63

2020 as our test graph. In other terms, by training our classifier on collaboration data from the first ten years, from 2005 to 2015, we estimate collaborations that will arise in the next five years, from 2016 to 2020. In other words, by training our classifier on collaboration data of the first ten years, 2005 to 2015, we are predicting the collaborations that might occur in the next five years, i.e., 2016 to 2020. This split leaves us with 224,096 relationships in the early graph and 103,582 in the later one.

The relationships in these subgraphs act as the positive samples in our train and test sets, but we need some negative examples as well so that our model could learn to distinguish nodes that could have a link between them and vice-versa. Instead of using all possible pairs as negative samples, we have used pairs of nodes between 2 and 3 hops away from each other. This has given us 600k+ negative samples compared to

224,096 positive samples. To overcome the class imbalance issue, we have used a down sampling technique for the generated negative samples. Figure 14 shows how the training and test data looks like:

Figure 14. Training data frame

4.2.1. Case 2: Experiment with Randomly Chosen Negative Samples

In the second set of experiments, instead of relying on collaboration year, we consider a network with randomly sampled negative examples. The labeled dataset of edges are generated as follows: ensuring that the residual network remains connected, we

64

randomly chose 40% of existing edges from the network and label them as positive examples for our experiment, and to generate negative examples, we now randomly sample an equal number of node pairs from the network which have no co-author edge between the author nodes. This split leaves us with 131,071 positive and negative examples each. We proceed to extracting features from the data and training the classifiers with this data.

4.3. Generating Link Prediction Features

We then used the methods listed in Chapter 3 for creating features using the link prediction techniques on topological and semantic similarity features.

4.4. Choosing a Binary Classifier

Our dataset is made up of both good and weak features. We train and evaluate our hypothesis using Random Forests, Logistic Regression, AdaBoost, and SVM.

4.5. Network Embedding Based Approach for Predicting Future

Collaborations

This section explains the implementation details of the network embedding-based approach. We have trained a binary classifier on top of the node embeddings obtained from the skip-gram model, which uses the predefined heterogeneous context of the nodes we sent with the help of pre-defined random walks as input. We have constructed a pipeline(Figure 15) with our heterogeneous collaboration graph as input and co-author link predictions as output to describe each step in detail.

65

Figure 15. Pipeline for meta-path-based network embedding approach

We divide the pipeline into two major components: generating node embeddings for each node and build a prediction pipeline for learning the co-author relationships. Now, let’s look into the first part to obtain node embeddings for each node. We have our heterogeneous collaboration graph with Author, Paper, Venue,

Journals, Affiliation nodes and the relationships among these nodes as input.

4.5.1. Generating Node Embeddings

To model the heterogeneous neighborhood, we use the heterogeneous skip-gram model.

We have used pre-defined meta-path-based random walks to the heterogeneous

Figure 16. Author - Paper - Venue - Paper - Author meta-path collaboration network to incorporate the heterogeneous network structures into skip- gram. We could obtain the heterogeneous context by using graph traversing strategies, i.e., BFS and DFS, or using pre-defined meta-paths. The earlier one is computationally expensive since we also get many irrelevant meta-paths by random walks, and the latter requires domain knowledge. We believe that, for our use case of co-authorship 66

prediction, building a neighborhood for each of the author nodes using pre-defined symmetric meta-paths makes more sense than training a skip-gram using random walks from automatic meta-path generation methods. In this work, to capture the heterogeneous context of author nodes w.r.t other node types, i.e., Papers, Venues,

Journals, and Affiliations, we construct six types of meta-paths which help us in measuring similarity among the authors in each of these contexts. These symmetric meta-paths were defined so that each meta-path starts and ends with the Author node because our goal here is to obtain the low-dimensional representations of author nodes.

Here’s a snippet(Figure 16) of the Author-Paper-Venue-Paper-Author meta-path:

Using these meta-paths as input, we have trained a heterogeneous skip-gram model using the following set of parameters to obtain the node embedding for each node.

1. The number of walks per node w: 1000

2. The walk length l: 100

3. The vector dimension d: 128

4. The neighborhood size (window size) k:5

67

Weighted Meta-paths: Initially, we have computed node embeddings using these multiple meta-paths where each type has equal importance in obtaining each node's embedding. Then, we implemented a weighted meta-path based approach to assign higher weights to those meta-paths that help make more meaningful co-authorship predictions. This pre-processing step not only helps in giving higher importance to meaningful meta-paths but also helps in eliminating irrelevant meta-paths. So, our intuition here is that number of walks parameter we used in training our skip-gram model decides the importance of each meta-path type as a whole in obtaining node embeddings. To different weight meta-path, we first rank the meta-paths by training a

Figure 17. Weighted meta-path learning pipeline classifier using a set of features describing each of the meta-path schemes we have and using the feature importance scores of the same to weight the meta-paths. Figure 17 illustrates the modified pipeline.

4.5.2. Prediction Pipeline

To obtain link predictions, various network embedding strategies necessitate different pipelines. However, some embedding approaches calculate link probabilities explicitly

(Kang et al., 2018) (Zhang et al., 2018); for others, these need to be learned on top of the node embeddings. There are two typical approaches: (i) Estimating the 68

similarities among the nodes using a distance metric like Euclidian or cosine-similarity to predict the existence of a link, and (ii) Treating the problem as supervised machine learning problem by assigning labels to each of the data samples. The more successful latter (Gurukar et al., 2019) necessitates a node-pair embedding pre-computation stage. At this stage, we just have the node representations but to transform the problem as classification task we need to compute edge embedding by applying a operator between each node embedding pair (Alexandru Cristian Mara, 2020). In this work, we have used the second one for building our prediction pipeline on top of the node embeddings computed in the previous step.

Train/Test Split: We have split the data in the following way to avoid any data leakage and make sure that algorithms are evaluated correctly (CSIROData61., 2018).

• Train Graph: For computing node embeddings.

• Training Set: For training classifiers, node embeddings were computed through

the use of a collection of positive and negative edges.

• Model Selection Set: A collection of positive and negative edges that were not

used for computing node embeddings or training the classifier were used to

choose the best classifier.

• Test Graph: To compute test node embeddings.

• Test Set: Collection of positive and negative edges that are not included in the

computation of test node embeddings, classifier training, or model range.

69

Choice of Binary Operator (Grover & Leskovec, 2016): Given two author nodes u and v, we define a binary operator ◦ over the corresponding feature vectors f(u) and f(v) to generate a representation g (u, v) such that g: V×V→Rd′ where d` is the representation size (128 in our case) for the author pair (u, v). Table 5 below shows the set of binary operators we used for learning edge features.

Table 5. Choice of binary operator

We have used 56387 examples as Training Set, 23986 examples for Model

Selection, and 82541 as the Test set for our data set. Below are the steps we implemented to train and evaluate the link prediction model.

Training:

1. Apply a binary operator to the source embeddings and target author nodes of

each sampled edge to calculate the link/edge embeddings for the positive and

negative edge samples.

2. Train a classifier to predict whether a co-authorship relation between two

authors should occur or not based on the embeddings of the positive and

negative cases.

3. Select the best classifier by evaluating the link classifier's output on the training

data on each of the four operators using the node embeddings measured on the

Train graph.

70

Testing:

1. The best performing classifier is used to compute test data scores, with node

embedding measured on the Test Graph.

71

5. Results and Discussion

This section evaluates and discusses the performance of supervised machine learning algorithms used for both the feature extraction and network embedding-based approaches implemented in this thesis. Apart from the evaluation metrics, we have also performed a case study – “Relevant Author Search” to prove the efficiency of our weighted meta-path-based network embedding approach in recommending similar authors and comparing our feature results extraction-based and network embedding approaches against the baseline in the literature Metapath2vec.

5.1. Feature Extraction Based Approach Results

We extract the node pairs from the train and test subgraphs, and a set of classifiers are trained on the data split with equal number of positive and negative examples.

Supervised learning is the method of learning a mapping feature using an algorithm such that when you have new input data (x), you can predict the output variable (y). We know the right responses, but the algorithm generates assumptions on the training data iteratively and is corrected by the trainer. When the algorithm reaches an appropriate degree of success, the learning process comes to an end. Since our use case deals with labels or classes, we used classification algorithms. In this study, the inputs are our topological and semantic features extracted, whereas the output is the actual co-author relationships between the author node pairs.

In the following two sub-sections, 5.1.1 and 5.1.2, we show our experiments' performance for the two cases described in Chapter 4.

72

5.1.1. Results of Experiments with Negative Samples as Nodes n-hop Away

We experimented with the set of topological and semantic features extracted using the methods described in Chapter 3. Table 6, below shows the performance of the different classifiers evaluated on 120k samples of author pairs (test set) to find the co-author relationships that might occur in the future. Along with the Logistic Regression, we have also trained the set of features using SVMs, Random Forests, and AdaBoost models. For this subset of data, the models performed well with AUC values ranging from 0.80 to 0.89.

Table 6. Different classifier performances for predicting co-author relations using Logistic, SVMs, Random Forest, and AdaBoost techniques.

Classifier Accuracy Precision Recall AUC F1 Score

Random Forest 0.91 0.88 0.81 0.89 0.84

SVM 0.85 0.81 0.72 0.80 0.76

AdaBoost 0.86 0.87 0.86 0.84 0.86

Logistic 0.86 0.88 0.78 0.82 0.87 Regression

In our data collection, we made sure that the counts of positive and negative tests were almost equal. For the reported findings, we used 5-fold-cross-validation for all of the algorithms. According to the table above, all of the models we tested had an accuracy of greater than 80%. This means that the features we choose have a high discriminating capacity. On accuracy metrics, Random Forests performed the best with an accuracy of

91%. Although the subset of data we extracted from MAG was using previous 15 years of published articles data, the accuracy of link prediction has not deteriorated because 73

of the longer range of time since the institution affiliations, co-authors, and research interests of authors may vary over time. The semantic features of paper title similarity and keyword (abstract) similarity might be a reason for this. Apart from accuracy, we could consider the F-1 score, which is the harmonic mean of precision and recall. In this experiment, logistic regression with linear kernel has performed better when compared with Random forest, which has better accuracy. To better evaluate the performance at different classification thresholds, we have plotted (Figure 18) the ROC

Curve for the test set of author node pairs.

Figure 18. ROC-AUC Curve with topological and semantic features using Random Forests

AUC Interpretation-

1. The threshold is set to 1.0 at the lowest value, i.e. (0,0). This suggests that our

model considers all co-author relationships to be detrimental, or that no links

remain at this time.

2. The threshold is set to 0.0 at the maximum stage, i.e. (1,1). This implies that our

model recognizes or forecasts all node pairs as co-authors.

3. The remainder of the curve represents False Positive Rate and True Positive

Rate values with threshold values ranging from 0 to 1. At some stage, we note

74

that for False Positive Rate close to zero, we achieve a True Positive Rate close

to one (somewhere around 0.9). This is the point at which the model accurately

predicts all current co-author relationships.

In this experiment, we got an AUC of 0.89, which is a pretty good score. In simplest terms, this means that the model will distinguish whether the co-author relationship between given author node pairs exists or not 89% of the time.

5.1.2. Results of Experiments with Randomly chosen Negative Samples

As described in Chapter 4, we conducted experiments by randomly choosing a fraction of positive and negative edges from the existing network and predicting them. By splitting the data this way, we would like to test our extracted features' efficacy. We have the author nodes with 2 or 3 hops away as our negative samples in the previous approach. But this might lead our model to depend more on structural similarity rather than semantic similarity of their research areas we computed. The results in Table 7, below show that for the feature extraction approach with random train/test split, the performance was comparatively poor compared to that of our previous case.

Table 7. Classifiers performance with randomly chosen negative samples

Classifier Accuracy Precision Recall AUC

Random Forest 0.71 0.68 0.66 0.76

SVM 0.62 0.62 0.53 0.68

AdaBoost 0.64 0.57 0.58 0.65

Logistic 0.68 0.69 0.63 0.73 Regression

75

In this case, the count of positive and negative samples is the same. For all the algorithms, we used 5-fold-cross-validation for the results reported. Based on the table above, we see that the model’s performance was poor compared with the previous approach. On accuracy metrics, Random Forests performed the best with an accuracy of 71%. The AUC score values range from 0.69 to 0.76, around a 10% decrease in the model’s performance.

5.1.3. Comparing Results of Case-1 and Case-2

In the first set of experiments, the set of negative samples we provided were the author nodes 2 to 3 hops away from the candidate author. While in the second set of experiments conducted, we randomly sampled the positive and negative examples, so there are pairs of authors who are far away in the network both in our train and test set, which is not the same in earlier cases. The reason for randomly sampling the co-author pairs is to test our semantic methods' efficiency extracted from the author's node attributes.

The results show that the model's performance was comparatively poor when the train and test sets were randomly sampled. This proves that our feature extraction approach depends more on the author nodes' structural similarity than semantic features. Our model was unable to identify similar authors far away in the network.

Thus, our feature extraction approach could not capture the various semantics behind the authors' similarity.

5.2. Network Embedding Based Approach Results

We show the learned node embeddings of our heterogeneous skip-gram model in the following sub-sections by plotting them into a 2D plane using PCA. Then, the results

76

of the weighted meta-path-based supervised learning methods were compared and analyzed, followed by a case study – “Relevant Author Search” to prove the efficiency of our methods compared to that of feature extraction approach and baselines in the literature.

5.2.1. Author’s Node Embedding Visualizations

We plotted (Figure 19) the author nodes trained using multiple meta-path schemes into the two-dimensional plane with Principal Component Analysis(n=2) from the original

128-dimensional embeddings generated.

Figure 19. t-SNE visualizations of node embeddings learned from MAG data

In the above figure, two authors who stay close to each other in the 2D space have either similar research circles or tend to share similar research interests. For

77

example, consider two authors, “Ross T Whitaker,” affiliated with the University of

Utah, and “Jing Hua,” affiliated with Wayne State University. The latter are close to each other in the embedding space. These two computer scientists share common research interests like Computer Vision, Image Analysis, Medical Image Processing.

On the contrary, authors “Samuel Madden,” affiliated with MIT, has his areas of interest in Database Systems, Mobile Computing, Distributed Systems, and “Timothy

W Simpson,” affiliated with Pennsylvania State University, has his research in

Engineering design. Hence, these two authors appear far in the low dimensional embedding space. This example makes our intuitive notion of capturing semantic relationships among the authors using other node types (Papers, Venue, Affiliations,

Journals) with multiple meta-paths meaningful and precise. However, our goal in this work is to learn the co-author relationships among these author nodes on the existing/past author collaboration graph and predict the authors' collaborations that might occur shortly.

5.2.2. Weighted Meta-path Based Supervised Learning Results

As described in Chapter 3, multiple meta-paths were defined to encode the author nodes' structural and semantic relations. A machine learning model trained on top of

Figure 20. Author node embeddings and link(co-author) embeddings using PCA

78

these low dimensional representations could learn the co-author edge embeddings and predict future co-author relationships. It shows how well we could make the above intuitive notion clear using our prediction pipeline.

As described in Chapter 4, we use a set of operators on top of node embeddings to generate a representation. These representations are then trained using ML algorithms to learn the co-author relationships. The above picture's right-hand side represents a two-dimensional projection of edge(link) or co-author embeddings obtained by performing Hadamard operation on the 128-dimensional author nodes. To see how well the supervised machine learning model has learned the co-author relationships, we label(color) the dots (link embeddings) w.r.t MAG data, i.e., blue dots represent True positives, and red dots represent True negatives. The few blue dots lurking within the red dots are our incorrectly classified samples since these are the projections of our test set. Table 8 below compares different classifiers' performance, w.r.t other operators.

79

Table 8. AUC scores of weighted meta-path approach with different classifiers

Operator Classifier AUC Score

Average Logistic Regression 0.7209

Random Forests 0.6589

SVM 0.7852

Hadamard Logistic Regression 0.8711

Random Forests 0.8120

SVM 0.7903

Weighted L1 Logistic Regression 0.7118

Random Forests 0.6106

SVM 0.6236

Weighted L2 Logistic Regression 0.6765

Random Forests 0.7099

SVM 0.6292

Among the supervised machine learning algorithms used, Logistic Regression performed well compared to Random Forests and SVMs for each of the binary operators used. When we look at operators individually, our algorithm had given an

AUC of 0.87 when the Hadamard operator was used. Also, the performance was comparatively poor when edge embeddings were generated using Weighted L1 and

Weighted L2 operators.

80

5.3. Case Study: Relevant Author Search

To evaluate the quality of our features and author node embeddings generated using machine learning and dimensionality reduction-based techniques, we conducted a case study to compare the results of our two approaches against the baseline method in the literature, i.e., Metapath2vec.

Table 9. A Case study of relevant author search

Table 9 lists the top-5 returned authors for query author “Thomas Dietterich” of the three methods- Feature extraction method (topology and semantics),

Metapath2vec(baseline), and Weighted meta-path approach. (a) We queried for top-25 similar authors for the query author using feature extraction approach(combination of topology and semantics) and most of the authors returned (19 out of 25) were already co-authors with “Thomas,” meaning that this approach is highly dependent on structural closeness and is not able to find out the authors working on similar research areas but are far away in the network (b) metapath2vec uses APVPA meta- path to determine the node representations and there could be cases where different research areas published in same venue. Hence, metapath2vec returned several authors

(11 out of 25) who are different from Thomas in their particular research interests, demonstrating that APVPA meta-path walks (used by metapath2vec) which collect

81

contextual nodes that are different from the target author node. (c) Our approach does not rely on a single meta-path, so W-metapaths2vec not only returned authors who are co-authors with Thomas already, but it also finds more authors (e.g., Manish

Raghavan) who has similar research interests with Thomas, showing that our methodology captures both topological and semantic relationships among the authors for learning author embeddings. In this case, 18 out of 25 queried authors for

Thomas were not co-authors before but have similar research interests, proving that

72% of co-author recommendations provided from our method are aligned towards similarity in their research areas rather than just authors’ structural closeness in the graph. For finding the areas of interests of an authors’ we used their Google Scholar page, author’s home page, and we say that their research interests match when at least two of their research interests match with the second.

5.4. Comparison of Feature Extraction Based and Network

Embedding Based Approach

In the first approach, we achieved an AUC score of 0.79, using just the topological and semantic features compared to that of 0.87 using the network embedding approach. This study identifies the factors that drive the network's co-authorship links. The results presented above show that we successfully determined the similarity among the network authors. The feature extraction approach's performance was poor when the data examples were randomly sampled. Still, our weighted meta-path method captured the semantics among the authors. From the results of the case study above, we infer that the network embedding approach could infer missing links in the network and identify authors with similar research interests far away in the collaboration graph.

82

6. Conclusion and Future Work

Analyzing and harnessing academic collaboration networks is vital in identifying the authors with similar research interests. This study utilizes Microsoft Academic Graph to locate potential co-authorship by identifying semantics and factors that drive a successful collaboration among the researchers. We extracted topological and semantic features from the network to measure the similarities among the authors thus, recommending co-authors with similar research interests. Then, we built a pipeline for understanding the semantics behind co-author links better using a network embedding approach. Using weighted meta-paths, we could assign higher weights to meta-paths that make co-author predictions better and help eliminate irrelevant meta-paths (by giving less weightage). This kind of system allows us to successfully locate authors with similar research interests, allowing us to recommend collaborations to foster targeted collaborations with possible geographic constraints, as well as to seek grant targeted grant applications in a specific area of research. Although performance of our approach was below state of the art results, but we have a new way of looking at the problem and capturing the similarities among the nodes in a network. As an extension of the study, weighted mechanism could be implemented using Graph Attention

Networks, that can combine the preprocessing and supervised learning within one model.

83

References

1. Abu-El-Haija, B. Perozzi, R. Al-Rfou, and A. A. Alemi, “Watch your step:

Learning node embeddings via graph attention,” in Proc. NeurIPS, 2018, pp.

9197–9207.

2. Adamic, L. A., & Adar, E. (2003). Friends and neighbors on the web. Social

networks, 25(3), 211-230.

3. Ahmed, A., Shervashidze, N., Narayanamurthy, S., Josifovski, V., & Smola, A.

J. (2013, May). Distributed large-scale natural graph factorization.

In Proceedings of the 22nd international conference on World Wide Web (pp.

37-48).

4. Aiello, L. M., Barrat, A., Schifanella, R., Cattuto, C., Markines, B., & Menczer,

F. (2012). Friendship prediction and homophily in social media. ACM

Transactions on the Web (TWEB), 6(2), 1-33.

5. Akcora, C. G., Carminati, B., & Ferrari, E. (2013). User similarities on social

networks. and Mining, 3(3), 475-495.

6. Al Hasan, M., Chaoji, V., Salem, S., & Zaki, M. (2006, April). Link prediction

using supervised learning. In SDM06: workshop on link analysis, counter-

terrorism and security (Vol. 30, pp. 798-805).

7. Asadi, K., & Littman, M. L. (2017, July). An alternative softmax operator for

reinforcement learning. In International Conference on Machine Learning (pp.

243-252). PMLR.

8. Mara, A. C., Lijffijt, J., & De Bie, T. (2020, October). Benchmarking Network

Embedding Models for Link Prediction: Are We Making Progress?. In 2020

84

IEEE 7th International Conference on Data Science and Advanced Analytics

(DSAA) (pp. 138-147). IEEE.

9. Almansoori, W., Gao, S., Jarada, T. N., Elsheikh, A. M., Murshed, A. N., Jida,

J., ... & Rokne, J. (2012). Link prediction and classification in social networks

and its application in healthcare and systems biology. Network Modeling

Analysis in Health Informatics and Bioinformatics, 1(1-2), 27-36.

10. Anderson, A., Huttenlocher, D., Kleinberg, J., & Leskovec, J. (2012, February).

Effects of user similarity in social media. In Proceedings of the fifth ACM

international conference on Web search and data mining (pp. 703-712).

11. Barabâsi, A. L., Jeong, H., Néda, Z., Ravasz, E., Schubert, A., & Vicsek, T.

(2002). Evolution of the social network of scientific collaborations. Physica A:

Statistical mechanics and its applications, 311(3-4), 590-614.

12. Bartal, A., Sasson, E., & Ravid, G. (2009, July). Predicting links in social

networks using text mining and sna. In 2009 International conference on

advances in social network analysis and mining (pp. 131-136). IEEE.

13. Belkin, M., & Niyogi, P. (2001, December). Laplacian eigenmaps and spectral

techniques for embedding and clustering. In Nips (Vol. 14, No. 14, pp. 585-

591).

14. Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained language model

for scientific text. arXiv preprint arXiv:1903.10676.

15. Bhattacharyya, P., Garg, A., & Wu, S. F. (2011). Analysis of user keyword

similarity in online social networks. Social network analysis and mining, 1(3),

143-158.

85

16. Campos, R., Mangaravite, V., Pasquali, A., Jorge, A., Nunes, C., & Jatowt, A.

(2020). YAKE! Keyword extraction from single documents using multiple local

features. Information Sciences, 509, 257-289.

17. Cao, S., Lu, W., & Xu, Q. (2016, February). Deep neural networks for learning

graph representations. In Proceedings of the AAAI Conference on Artificial

Intelligence (Vol. 30, No. 1).

18. Cao, S., Lu, W., & Xu, Q. (2015, October). Grarep: Learning graph

representations with global structural information. In Proceedings of the 24th

ACM international on conference on information and knowledge

management (pp. 891-900).

19. Chen, B., Li, F., Chen, S., Hu, R., & Chen, L. (2017). Link prediction based on

non-negative matrix factorization. PloS one, 12(8), e0182968.

20. Chung, F., & Zhao, W. (2010). PageRank and random walks on graphs. In Fete

of combinatorics and computer science (pp. 43-62). Springer, Berlin,

Heidelberg.

21. Clauset, A., Moore, C., & Newman, M. E. (2008). Hierarchical structure and

the prediction of missing links in networks. Nature, 453(7191), 98-101.

22. M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and

S. Slattery. Learning to extract symbolic knowledge from the world wide web.

In Proceedings of the Fifteenth Conference of the American Association for

Artificial Intelligence, pages 509–516, Madison, Wisconsin, 1998.

23. CSIRO's Data61 (2018). StellarGraph Machine Learning Library. Github

24. Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J., & Scholkopf, B. (1998).

Support vector machines. IEEE Intelligent Systems and their

applications, 13(4), 18-28.

86

25. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training

of deep bidirectional transformers for language understanding. arXiv preprint

arXiv:1810.04805.

26. Dong, Y., Chawla, N. V., & Swami, A. (2017, August). metapath2vec: Scalable

representation learning for heterogeneous networks. In Proceedings of the 23rd

ACM SIGKDD international conference on knowledge discovery and data

mining (pp. 135-144).

27. Freund, Y., & Schapire, R. E. (1996, July). Experiments with a new boosting

algorithm. In icml (Vol. 96, pp. 148-156).

28. Friedman, N., Getoor, L., Koller, D., & Pfeffer, A. (1999, August). Learning

probabilistic relational models. In IJCAI (Vol. 99, pp. 1300-1309).

29. Renstrom, A. G., Goldblatt, R. W., Minkus, C., Berube, K. L., & Launius, R.

(2002). Wilbur and Orville Wright: A Bibliography Commemorating the One-

Hundredth Anniversary of the First Powered Flight, December 17, 1903.

Revised.

30. Goldenberg, A., Zheng, A. X., Fienberg, S. E., & Airoldi, E. M. (2010). A

survey of statistical network models.

31. Goyal, P., & Ferrara, E. (2018). Graph embedding techniques, applications, and

performance: A survey. Knowledge-Based Systems, 151, 78-94.

32. Grover, A., & Leskovec, J. (2016, August). node2vec: Scalable feature learning

for networks. In Proceedings of the 22nd ACM SIGKDD international

conference on Knowledge discovery and data mining (pp. 855-864).

33. Guimerà, R., & Sales-Pardo, M. (2009). Missing and spurious interactions and

the reconstruction of complex networks. Proceedings of the National Academy

of Sciences, 106(52), 22073-22078.

87

34. Gurukar, S., Vijayan, P., Srinivasan, A., Bajaj, G., Cai, C., Keymanesh, M., ...

& Parthasarathy, S. (2019). Network representation learning: Consolidation and

renewed bearing. arXiv preprint arXiv:1905.00987.

35. Hamilton, W. L., Ying, R., & Leskovec, J. (2017). Inductive representation

learning on large graphs. arXiv preprint arXiv:1706.02216.

36. Hamilton, W. L., Ying, R., & Leskovec, J. (2017). Representation learning on

graphs: Methods and applications. arXiv preprint arXiv:1709.05584.

37. Honnibal, M., & Montani, I. (2017). spaCy 2: Natural language understanding

with Bloom embeddings, convolutional neural networks and incremental

parsing.

38. Harada, S., Akita, H., Tsubaki, M., Baba, Y., Takigawa, I., Yamanishi, Y., &

Kashima, H. (2018). Dual convolutional neural network for graph of graphs link

prediction. arXiv preprint arXiv:1810.02080.

39. Société Vaudoise des Sciences Naturelles. (1864). Bulletin de la

Société vaudoise des sciences naturelles (Vol. 7). F. Rouge.

40. Jeh, G., & Widom, J. (2002, July). Simrank: a measure of structural-context

similarity. In Proceedings of the eighth ACM SIGKDD international conference

on Knowledge discovery and data mining (pp. 538-543).

41. Kang, B., Lijffijt, J., & De Bie, T. (2018). Conditional network

embeddings. arXiv preprint arXiv:1805.07544.

42. Katz, S., Downs, T. D., Cash, H. R., & Grotz, R. C. (1970). Progress in

development of the index of ADL. The gerontologist, 10(1_Part_1), 20-30.

43. Liben‐Nowell, D., & Kleinberg, J. (2007). The link‐prediction problem for

social networks. Journal of the American society for information science and

technology, 58(7), 1019-1031.

88

44. A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the

construction of internet portals with machine learning. Information Retrieval,

3(2):127–163, 2000.

45. Menon, A. K., & Elkan, C. (2011, September). Link prediction via matrix

factorization. In Joint european conference on machine learning and

knowledge discovery in databases (pp. 437-452). Springer, Berlin, Heidelberg.

46. Mark Needham. (2019, March 28). Link Prediction with Neo4j [Blog post].

Retrieved from https://towardsdatascience.com/link-prediction-with-neo4j-

part-2-predicting-co-authors-using-scikit-learn-78b42356b44c/.

47. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of

word representations in vector space. arXiv preprint arXiv:1301.3781.

48. Newman, M. E. (2001). The structure of scientific collaboration

networks. Proceedings of the national academy of sciences, 98(2), 404-409.

49. Newman, M. E. (2001). Clustering and preferential attachment in growing

networks. Physical review E, 64(2), 025102.

50. Fouss, F., Pirotte, A., Renders, J. M., & Saerens, M. (2007). Random-walk

computation of similarities between nodes of a graph with application to

collaborative recommendation. IEEE Transactions on knowledge and data

engineering, 19(3), 355-369.

51. Ou, M., Cui, P., Pei, J., Zhang, Z., & Zhu, W. (2016, August). Asymmetric

transitivity preserving graph embedding. In Proceedings of the 22nd ACM

SIGKDD international conference on Knowledge discovery and data

mining (pp. 1105-1114).

52. Pavlov, M., & Ichise, R. (2007). Finding experts by link prediction in co-

authorship networks. FEWS, 290, 42-55.

89

53. Perozzi, B., Al-Rfou, R., & Skiena, S. (2014, August). Deepwalk: Online

learning of social representations. In Proceedings of the 20th ACM SIGKDD

international conference on Knowledge discovery and data mining (pp. 701-

710).

54. Quinlan, J. R. (1986). Induction of decision trees. Machine learning, 1(1), 81-

106.

55. Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic keyword

extraction from individual documents. Text mining: applications and theory, 1,

1-20.

56. Sachan, M., & Ichise, R. (2010, February). Using abstract information and

community alignment information for link prediction. In 2010 Second

International Conference on Machine Learning and Computing (pp. 61-65).

IEEE.

57. Sinha, A., Shen, Z., Song, Y., Ma, H., Eide, D., Hsu, B. J., & Wang, K. (2015,

May). An overview of microsoft academic service (mas) and applications.

In Proceedings of the 24th international conference on world wide web (pp.

243-246).

58. Sun, Y., & Han, J. (2013). Mining heterogeneous information networks: a

structural analysis approach. Acm Sigkdd Explorations Newsletter, 14(2), 20-

28.

59. Sun, Y., Barber, R., Gupta, M., Aggarwal, C. C., & Han, J. (2011, July). Co-

author relationship prediction in heterogeneous bibliographic networks. In 2011

International Conference on Advances in Social Networks Analysis and

Mining (pp. 121-128). IEEE.

90

60. Sun, Y., Han, J., Yan, X., Yu, P. S., & Wu, T. (2011). Pathsim: Meta path-based

top-k similarity search in heterogeneous information networks. Proceedings of

the VLDB Endowment, 4(11), 992-1003.

61. Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., & Mei, Q. (2015, May). Line:

Large-scale information network embedding. In Proceedings of the 24th

international conference on world wide web (pp. 1067-1077).

62. Taskar, B., Wong, M. F., Abbeel, P., & Koller, D. (2003). Link prediction in

relational data. Advances in neural information processing systems, 16, 659-

666.

63. Tran, P. V. (2018, October). Learning to make predictions on graphs with

autoencoders. In 2018 IEEE 5th international conference on data science and

advanced analytics (DSAA) (pp. 237-245). IEEE.

64. Walker, S. H., & Duncan, D. B. (1967). Estimation of the probability of an event

as a function of several independent variables. Biometrika, 54(1-2), 167-179.

65. Wang, C., Satuluri, V., & Parthasarathy, S. (2007, October). Local probabilistic

models for link prediction. In Seventh IEEE international conference on data

mining (ICDM 2007) (pp. 322-331). IEEE.

66. Wang, D., Cui, P., & Zhu, W. (2016, August). Structural deep network

embedding. In Proceedings of the 22nd ACM SIGKDD international

conference on Knowledge discovery and data mining (pp. 1225-1234).

67. Yamaguchi, T. (2008). Practical aspects of knowledge

management. Yokohama: Springer Science & Business Media.

68. Zhang, Z., Cui, P., Wang, X., Pei, J., Yao, X., & Zhu, W. (2018, July).

Arbitrary-order proximity preserved network embedding. In Proceedings of the

91

24th ACM SIGKDD International Conference on Knowledge Discovery & Data

Mining (pp. 2778-2786).

69. Zhou, T., Lü, L., & Zhang, Y. C. (2009). Predicting missing links via local

information. The European Physical Journal B, 71(4), 623-630.

92