Recommending Collaborations Using Link Prediction BE ACCEPTED in PARTIAL FULFILLMENT of the REQUIREMENTS for the DEGREE of Master of Science

RECOMMENDING COLLABORATIONS USING LINK PREDICTION A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science By NIKHIL CHENNUPATI B. Tech., Gandhi Institute of Technology and Management, India, 2016 2021 Wright State University WRIGHT STATE UNIVERSITY GRADUATE SCHOOL April 21, 2021 I HEREBY RECOMMEND THAT THE THESIS PREPARED UNDER MY SUPERVISION BY Nikhil Chennupati ENTITLED Recommending Collaborations Using Link Prediction BE ACCEPTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science. ___________________________ Tanvi Banerjee, Ph.D. Thesis Director ___________________________ Mateen M.Rizki, Ph.D. Chair, Department of Computer Science and Engineering Committee on Final Examination ________________________________ Tanvi Banerjee, Ph.D. ________________________________ Krishnaprasad Thirunarayan, Ph.D. ________________________________ Michael L Raymer, Ph.D. ________________________________ Barry Milligan, Ph.D. Vice Provost for Academic Affairs Dean of the Graduate School. ABSTRACT Chennupati, Nikhil. M.S., Department of Computer Science and Engineering, Wright State University, 2021. Recommending Collaborations Using Link Prediction. Link prediction in the domain of scientific collaborative networks refers to exploring and determining whether a connection between two entities in an academic network may emerge in the future. This study aims to analyse the relevance of academic collaborations and identify the factors that drive co-author relationships in a heterogeneous bibliographic network. Using topological, semantic, and graph representation learning techniques, we measure the authors' similarities w.r.t their structural and publication data to identify the reasons that promote co-authorships. Experimental results show that the proposed approach successfully infer the co-author links by identifying authors with similar research interests. Such a system can be used to recommend potential collaborations among the authors. iii Table of Contents 1. Introduction ................................................................................................................ 1 1.1. Overview ............................................................................................................. 1 1.2. Link Prediction for Recommending Author Collaborations ............................... 4 1.3. Research Questions and Contributions................................................................ 5 1.4. Thesis Outline...................................................................................................... 7 2. Related Work ............................................................................................................. 8 .................................................................................................................................... 9 2.1. Feature Extraction Based Methods...................................................................... 9 2.1.1. Similarity-based Metrics ............................................................................... 9 2.1.2. Probabilistic and Maximum-Likelihood Models ........................................ 20 2.2. Feature Learning Methods................................................................................. 25 2.2.1. Matrix Factorization Methods .................................................................... 26 2.2.2. Random Walk Based Methods ................................................................... 29 2.2.3. Neural Network-based Methods ................................................................. 33 3. Methods.................................................................................................................... 37 3.1. Feature Extraction Methods .............................................................................. 37 3.1.1. Feature Extraction Based on Topology ...................................................... 37 3.1.2. Feature extraction based on Node Attributes (Semantic similarity) ........... 41 3.2. Network Embedding Based Approach for Link Prediction .............................. 45 3.2.1. Homogeneous Network Embedding ........................................................... 45 3.2.2. Heterogeneous Network Embedding .......................................................... 46 3.2.3. Weighted Meta-path Biased Random Walks .............................................. 46 3.2.4. Heterogeneous Skip-gram Model ............................................................... 49 3.3. Supervised Machine Learning Algorithms........................................................ 51 3.3.1. Logistic Regression .................................................................................... 51 3.3.2. Support Vector Machines ........................................................................... 52 3.3.3. Random Forests .......................................................................................... 53 3.3.4. AdaBoost .................................................................................................... 54 3.4. Evaluation Metrics ............................................................................................ 55 3.4.1. Precision ..................................................................................................... 55 3.4.2. Recall .......................................................................................................... 56 iv 3.4.3. F- measure .................................................................................................. 56 3.4.4. AUC Score .................................................................................................. 56 4. Data and Experimental Setup................................................................................... 58 4.1. Data ................................................................................................................... 58 4.1.1. Microsoft Academic Graph ........................................................................ 59 4.1.2. Data Collection ........................................................................................... 60 4.1.3. Building a Collaboration Graph .................................................................. 61 4.2. Link Prediction Problem ................................................................................... 62 4.2.1. Case 1: Experiment with Negative Samples as Nodes n-hop Away .......... 63 4.2.1. Case 2: Experiment with Randomly Chosen Negative Samples ................ 64 4.3. Generating Link Prediction Features ................................................................. 65 4.4. Choosing a Binary Classifier ............................................................................. 65 4.5. Network Embedding Based Approach for Predicting Future Collaborations ... 65 4.5.1. Generating Node Embeddings .................................................................... 66 4.5.2. Prediction Pipeline ...................................................................................... 68 5. Results and Discussion ............................................................................................ 72 5.1. Feature Extraction Based Approach Results ..................................................... 72 5.1.1. Results of Experiments with Negative Samples as Nodes n-hop Away .... 73 5.1.2. Results of Experiments with Randomly chosen Negative Samples ........... 75 5.1.3. Comparing Results of Case-1 and Case-2 .................................................. 76 5.2. Network Embedding Based Approach Results ................................................. 76 5.2.1. Author’s Node Embedding Visualizations ................................................. 77 5.2.2. Weighted Meta-path Based Supervised Learning Results .......................... 78 5.3. Case Study: Relevant Author Search ................................................................ 81 5.4. Comparison of Feature Extraction Based and Network Embedding Based Approach .................................................................................................................. 82 6. Conclusion and Future Work ................................................................................... 83 References .................................................................................................................... 84 v List of Figures Figure1. Trending authors in machine learning (adapted from academic.microsoft.com) ...................................................................................................................... 2 Figure 2. Trending topics in all fields (adapted from academic.microsoft.com) ........... 3 Figure 3. A sample collaboration graph of authors from different institutes................. 5 Figure 4. Pipeline of the feature extraction and learning-based approach ..................... 6 Figure 5. Overarching block diagram of weighted meta-path-based network embedding method .......................................................................................................... 7 Figure 6. Taxonomy of link prediction approaches ....................................................... 9 Figure 7. Local probabilistic model ............................................................................. 21 Figure 8. Frequency of common authors vs Percentage of collaborations .................. 38 Figure 9. Weighted meta-path approach using supervised learning ............................ 48 Figure 10. Weighted

Recommending Collaborations Using Link Prediction BE ACCEPTED in PARTIAL FULFILLMENT of the REQUIREMENTS for the DEGREE of Master of Science

Discover the Golden Paths, Unique Sequences and Marvelous Associations out of Your Big Data Using Link Analysis in SAS® Enterprise Miner TM

Analysis of Social Networks with Missing Data (Draft: Do Not Cite)

Link Analysis Using SAS Enterprise Miner

Evolving Networks and Social Network Analysis Methods And

Graph Theory and Social Networks Spring 2014 Notes

Graph Theory and Social Networks - Part I

On Some Aspects of Link Analysis and Informal Network in Social Network Platform

A Note on the Importance of Collaboration Graphs

Graph Theory and Social Networks Spring 2014 Notes

The Following Pages Contain Scaled Artwork Proofs and Are Intended Primarily for Your Review of ﬁgure/Illustration Sizing and Overall Quality

Link Analysis

Pdf/38/1/219/1810360/Coli R 00089.Pdf by Guest on 02 October 2021 Technische Universit¨At Darmstadt