Recommending Collaborations Using Link Prediction BE ACCEPTED in PARTIAL FULFILLMENT of the REQUIREMENTS for the DEGREE of Master of Science

Total Page:16

File Type:pdf, Size:1020Kb

Recommending Collaborations Using Link Prediction BE ACCEPTED in PARTIAL FULFILLMENT of the REQUIREMENTS for the DEGREE of Master of Science RECOMMENDING COLLABORATIONS USING LINK PREDICTION A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science By NIKHIL CHENNUPATI B. Tech., Gandhi Institute of Technology and Management, India, 2016 2021 Wright State University WRIGHT STATE UNIVERSITY GRADUATE SCHOOL April 21, 2021 I HEREBY RECOMMEND THAT THE THESIS PREPARED UNDER MY SUPERVISION BY Nikhil Chennupati ENTITLED Recommending Collaborations Using Link Prediction BE ACCEPTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science. ___________________________ Tanvi Banerjee, Ph.D. Thesis Director ___________________________ Mateen M.Rizki, Ph.D. Chair, Department of Computer Science and Engineering Committee on Final Examination ________________________________ Tanvi Banerjee, Ph.D. ________________________________ Krishnaprasad Thirunarayan, Ph.D. ________________________________ Michael L Raymer, Ph.D. ________________________________ Barry Milligan, Ph.D. Vice Provost for Academic Affairs Dean of the Graduate School. ABSTRACT Chennupati, Nikhil. M.S., Department of Computer Science and Engineering, Wright State University, 2021. Recommending Collaborations Using Link Prediction. Link prediction in the domain of scientific collaborative networks refers to exploring and determining whether a connection between two entities in an academic network may emerge in the future. This study aims to analyse the relevance of academic collaborations and identify the factors that drive co-author relationships in a heterogeneous bibliographic network. Using topological, semantic, and graph representation learning techniques, we measure the authors' similarities w.r.t their structural and publication data to identify the reasons that promote co-authorships. Experimental results show that the proposed approach successfully infer the co-author links by identifying authors with similar research interests. Such a system can be used to recommend potential collaborations among the authors. iii Table of Contents 1. Introduction ................................................................................................................ 1 1.1. Overview ............................................................................................................. 1 1.2. Link Prediction for Recommending Author Collaborations ............................... 4 1.3. Research Questions and Contributions................................................................ 5 1.4. Thesis Outline...................................................................................................... 7 2. Related Work ............................................................................................................. 8 .................................................................................................................................... 9 2.1. Feature Extraction Based Methods...................................................................... 9 2.1.1. Similarity-based Metrics ............................................................................... 9 2.1.2. Probabilistic and Maximum-Likelihood Models ........................................ 20 2.2. Feature Learning Methods................................................................................. 25 2.2.1. Matrix Factorization Methods .................................................................... 26 2.2.2. Random Walk Based Methods ................................................................... 29 2.2.3. Neural Network-based Methods ................................................................. 33 3. Methods.................................................................................................................... 37 3.1. Feature Extraction Methods .............................................................................. 37 3.1.1. Feature Extraction Based on Topology ...................................................... 37 3.1.2. Feature extraction based on Node Attributes (Semantic similarity) ........... 41 3.2. Network Embedding Based Approach for Link Prediction .............................. 45 3.2.1. Homogeneous Network Embedding ........................................................... 45 3.2.2. Heterogeneous Network Embedding .......................................................... 46 3.2.3. Weighted Meta-path Biased Random Walks .............................................. 46 3.2.4. Heterogeneous Skip-gram Model ............................................................... 49 3.3. Supervised Machine Learning Algorithms........................................................ 51 3.3.1. Logistic Regression .................................................................................... 51 3.3.2. Support Vector Machines ........................................................................... 52 3.3.3. Random Forests .......................................................................................... 53 3.3.4. AdaBoost .................................................................................................... 54 3.4. Evaluation Metrics ............................................................................................ 55 3.4.1. Precision ..................................................................................................... 55 3.4.2. Recall .......................................................................................................... 56 iv 3.4.3. F- measure .................................................................................................. 56 3.4.4. AUC Score .................................................................................................. 56 4. Data and Experimental Setup................................................................................... 58 4.1. Data ................................................................................................................... 58 4.1.1. Microsoft Academic Graph ........................................................................ 59 4.1.2. Data Collection ........................................................................................... 60 4.1.3. Building a Collaboration Graph .................................................................. 61 4.2. Link Prediction Problem ................................................................................... 62 4.2.1. Case 1: Experiment with Negative Samples as Nodes n-hop Away .......... 63 4.2.1. Case 2: Experiment with Randomly Chosen Negative Samples ................ 64 4.3. Generating Link Prediction Features ................................................................. 65 4.4. Choosing a Binary Classifier ............................................................................. 65 4.5. Network Embedding Based Approach for Predicting Future Collaborations ... 65 4.5.1. Generating Node Embeddings .................................................................... 66 4.5.2. Prediction Pipeline ...................................................................................... 68 5. Results and Discussion ............................................................................................ 72 5.1. Feature Extraction Based Approach Results ..................................................... 72 5.1.1. Results of Experiments with Negative Samples as Nodes n-hop Away .... 73 5.1.2. Results of Experiments with Randomly chosen Negative Samples ........... 75 5.1.3. Comparing Results of Case-1 and Case-2 .................................................. 76 5.2. Network Embedding Based Approach Results ................................................. 76 5.2.1. Author’s Node Embedding Visualizations ................................................. 77 5.2.2. Weighted Meta-path Based Supervised Learning Results .......................... 78 5.3. Case Study: Relevant Author Search ................................................................ 81 5.4. Comparison of Feature Extraction Based and Network Embedding Based Approach .................................................................................................................. 82 6. Conclusion and Future Work ................................................................................... 83 References .................................................................................................................... 84 v List of Figures Figure1. Trending authors in machine learning (adapted from academic.microsoft.com) ...................................................................................................................... 2 Figure 2. Trending topics in all fields (adapted from academic.microsoft.com) ........... 3 Figure 3. A sample collaboration graph of authors from different institutes................. 5 Figure 4. Pipeline of the feature extraction and learning-based approach ..................... 6 Figure 5. Overarching block diagram of weighted meta-path-based network embedding method .......................................................................................................... 7 Figure 6. Taxonomy of link prediction approaches ....................................................... 9 Figure 7. Local probabilistic model ............................................................................. 21 Figure 8. Frequency of common authors vs Percentage of collaborations .................. 38 Figure 9. Weighted meta-path approach using supervised learning ............................ 48 Figure 10. Weighted
Recommended publications
  • Discover the Golden Paths, Unique Sequences and Marvelous Associations out of Your Big Data Using Link Analysis in SAS® Enterprise Miner TM
    MWSUG 2016 – Paper AA04 Discover the golden paths, unique sequences and marvelous associations out of your big data using Link Analysis in SAS® Enterprise Miner TM Delali Agbenyegah, Alliance Data Systems, Columbus, OH Candice Zhang, Alliance Data Systems, Columbus, OH ABSTRACT The need to extract useful information from large amount of data to positively influence business decisions is on the rise especially with the hyper expansion of retail data collection and storage and the advancement in computing capabilities. Many enterprises now have well established databases to capture Omni channel customer transactional behavior at the product or Store Keeping Unit (SKU) level. Crafting a robust analytical solution that utilizes these rich transactional data sources to create customized marketing incentives and product recommendations in a timely fashion to meet the expectations of the sophisticated shopper in our current generation can be daunting. Fortunately, the Link Analysis node in SAS® Enterprise Miner TM provides a simple but yet powerful analytical tool to extract, analyze, discover and visualize the relationships or associations (links) and sequences between items in a transactional data set up and develop item-cluster induced segmentation of customers as well as next-best offer recommendations. In this paper, we discuss the basic elements of Link Analysis from a statistical perspective and provide a real life example that leverages Link Analysis within SAS Enterprise Miner to discover amazing transactional paths, sequences and links. INTRODUCTION A financial fraud investigator may be interested in exploring the relationship between the financial transactions of suspicious customers, the BNI may want to analyze the social network of individuals identified as terrorists to conduct further investigation or a medical doctor may be interested in understanding the association between different medical treatments on patients and their corresponding results.
    [Show full text]
  • Analysis of Social Networks with Missing Data (Draft: Do Not Cite)
    Analysis of social networks with missing data (Draft: do not cite) G. Kossinets∗ Department of Sociology, Columbia University, New York, NY 10027. (Dated: February 4, 2003) We perform sensitivity analyses to assess the impact of missing data on the struc- tural properties of social networks. The social network is conceived of as being generated by a bipartite graph, in which actors are linked together via multiple interaction contexts or affiliations. We discuss three principal missing data mecha- nisms: network boundary specification (non-inclusion of actors or affiliations), survey non-response, and censoring by vertex degree (fixed choice design). Based on the simulation results, we propose remedial techniques for some special cases of network data. I. INTRODUCTION Social network data is often incomplete, which means that some actors or links are missing from the dataset. In a normal social setting, much of the incompleteness arises from the following main sources: the so called boundary specification problem (BSP); respondent inaccuracy; non-response in network surveys; or may be inadvertently introduced via study design (Table I). Although missing data is abundant in empirical studies, little research has been conducted on the possible effect of missing links or nodes on the measurable properties of networks at large.1 In particular, a revision of the original work done primarily in the 1970-80s [4, 17, 21] seems necessary in the light of recent advances that brought new classes of networks to the attention of the interdisciplinary research community [1, 3, 30, 37, 40, 41]. Let us start with a few examples from the literature to illustrate different incarnations of missing data in network research.
    [Show full text]
  • Link Analysis Using SAS Enterprise Miner
    Link Analysis Using SAS® Enterprise Miner™ Ye Liu, Taiyeong Lee, Ruiwen Zhang, and Jared Dean SAS Institute Inc. ABSTRACT The newly added Link Analysis node in SAS® Enterprise MinerTM visualizes a network of items or effects by detecting the linkages among items in transactional data or the linkages among levels of different variables in training data or raw data. This node also provides multiple centrality measures and cluster information among items so that you can better understand the linkage structure. In addition to the typical linkage analysis, the node also provides segmentation that is induced by the item clusters, and uses weighted confidence statistics to provide next-best-offer lists for customers. Examples that include real data sets show how to use the SAS Enterprise Miner Link Analysis node. INTRODUCTION Link analysis is a popular network analysis technique that is used to identify and visualize relationships (links) between different objects. The following questions could be nontrivial: Which websites link to which other ones? What linkage of items can be observed from consumers’ market baskets? How is one movie related to another based on user ratings? How are different petal lengths, width, and color linked by different, but related, species of flowers? How are specific variable levels related to each other? These relationships are all visible in data, and they all contain a wealth of information that most data mining techniques cannot take direct advantage of. In today’s ever-more-connected world, understanding relationships and connections is critical. Link analysis is the data mining technique that addresses this need. In SAS Enterprise Miner, the new Link Analysis node can take two kinds of input data: transactional data and non- transactional data (training data or raw data).
    [Show full text]
  • Evolving Networks and Social Network Analysis Methods And
    DOI: 10.5772/intechopen.79041 ProvisionalChapter chapter 7 Evolving Networks andand SocialSocial NetworkNetwork AnalysisAnalysis Methods and Techniques Mário Cordeiro, Rui P. Sarmento,Sarmento, PavelPavel BrazdilBrazdil andand João Gama Additional information isis available atat thethe endend ofof thethe chapterchapter http://dx.doi.org/10.5772/intechopen.79041 Abstract Evolving networks by definition are networks that change as a function of time. They are a natural extension of network science since almost all real-world networks evolve over time, either by adding or by removing nodes or links over time: elementary actor-level network measures like network centrality change as a function of time, popularity and influence of individuals grow or fade depending on processes, and events occur in net- works during time intervals. Other problems such as network-level statistics computation, link prediction, community detection, and visualization gain additional research impor- tance when applied to dynamic online social networks (OSNs). Due to their temporal dimension, rapid growth of users, velocity of changes in networks, and amount of data that these OSNs generate, effective and efficient methods and techniques for small static networks are now required to scale and deal with the temporal dimension in case of streaming settings. This chapter reviews the state of the art in selected aspects of evolving social networks presenting open research challenges related to OSNs. The challenges suggest that significant further research is required in evolving social networks, i.e., existent methods, techniques, and algorithms must be rethought and designed toward incremental and dynamic versions that allow the efficient analysis of evolving networks. Keywords: evolving networks, social network analysis 1.
    [Show full text]
  • Graph Theory and Social Networks Spring 2014 Notes
    Graph Theory and Social Networks Spring 2014 Notes Kimball Martin April 30, 2014 Introduction Graph theory is a branch of discrete mathematics (more specifically, combinatorics) whose origin is generally attributed to Leonard Euler's solution of the K¨onigsberg bridge problem in 1736. At the time, there were two islands in the river Pregel, and 7 bridges connecting the islands to each other and to each bank of the river. As legend goes, for leisure, people would try to find a path in the city of K¨onigsberg which traversed each of the 7 bridges exactly once (see Figure1). Euler represented this abstractly as a graph∗, and showed by elementary means that no such path exists. Figure 1: The Seven Bridges of K¨onigsberg (Source: Wikimedia Commons) Intuitively, a graph is just a set of objects which are connected in some way. The objects are called vertices or nodes. Pictorially, we usually draw the vertices as circles, and draw a line between two vertices if they are connected or related (in whatever context we have in mind). These lines are called edges or links. Here are a few examples of abstract graphs. This is a graph with 8 vertices connected in a circle. ∗In this course, graph does not mean the graph of a function, as in calculus. It is unfortunate, but these two very basic objects in mathematics have the same name. 1 Graph Theory/Social Networks Introduction Kimball Martin (Spring 2014) 1 2 8 3 7 4 6 5 This is a graph on 5 vertices, where all pairs of vertices are connected.
    [Show full text]
  • Graph Theory and Social Networks - Part I
    Graph Theory and Social Networks - part I ! EE599: Social Network Systems ! Keith M. Chugg Fall 2014 1 © Keith M. Chugg, 2014 Overview • Summary • Graph definitions and properties • Relationship and interpretation in social networks • Examples © Keith M. Chugg, 2014 2 References • Easley & Kleinberg, Ch 2 • Focus on relationship to social nets with little math • Barabasi, Ch 2 • General networks with some math • Jackson, Ch 2 • Social network focus with more formal math © Keith M. Chugg, 2014 3 Graph Definition 24 CHAPTER 2. GRAPHS A A • G= (V,E) • V=set of vertices B B C D C D • E=set of edges (a) A graph on 4 nodes. (b) A directed graph on 4 nodes. Figure 2.1: Two graphs: (a) an undirected graph, and (b) a directed graph. Modeling of networks Easley & Kleinberg • will be undirected unless noted otherwise. Graphs as Models of Networks. Graphs are useful because they serve as mathematical Vertex is a person (ormodels ofentity) network structures. With this in mind, it is useful before going further to replace • the toy examples in Figure 2.1 with a real example. Figure 2.2 depicts the network structure of the Internet — then called the Arpanet — in December 1970 [214], when it had only 13 sites. Nodes represent computing hosts, and there is an edge joining two nodes in this picture Edge represents a relationshipif there is a direct communication link between them. Ignoring the superimposed map of the • U.S. (and the circles indicating blown-up regions in Massachusetts and Southern California), the rest of the image is simply a depiction of this 13-node graph using the same dots-and-lines style that we saw in Figure 2.1.
    [Show full text]
  • On Some Aspects of Link Analysis and Informal Network in Social Network Platform
    Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320–088X IJCSMC, Vol. 2, Issue. 7, July 2013, pg.371 – 377 RESEARCH ARTICLE On Some Aspects of Link Analysis and Informal Network in Social Network Platform Subrata Paul Department of Computer Science and Engineering, M.I.T.S. Rayagada, Odisha 765017, INDIA [email protected] Abstract— This paper presents a review on the two important aspects of Social Network, namely Link Analysis and Informal Network. Both of these characteristic plays a vital role in analysis of direction of information flow and reliability of the information passed or received. They can be easily visualized or can be studied using the concepts of graph theory. We have deliberately omitted discussing about general definitions of social network and representing the relationship between actors as a graph with nodes and edges. This paper starts with a formal definition of Informal Network and continues with some of its major aspects. Moreover its application is explained with a real life example of a college scenario. In the later part of the paper, some aspects of Informal Network have been presented. The similar kind of real life scenario is also being drawn out here to represent application of informal network. Lastly, this paper ends with a general conclusion on these two topics. Key Terms: - Social Network; Link Analysis; Information Overload; Formal Network; Informal Network I. INTRODUCTION Link analysis is a data-analysis technique used to evaluate relationships (connections) between nodes. Relationships may be identified among various types of nodes (objects), including organizations, people and transactions.
    [Show full text]
  • A Note on the Importance of Collaboration Graphs
    Int. J. of Mathematical Sciences and Applications, Vol. 1, No. 3, September 2011 Copyright Mind Reader Publications www.journalshub.com A Note on the Importance of Collaboration Graphs V.Yegnanarayanan1 and G.K.Umamaheswari2 1Senior Professor, Department of Mathematics, Velammal Engineering College,Ambattur-Red Hills Road, Chennai - 600 066, India. Email id:[email protected] 2Research Scholar, Research and Development Centre, Bharathiar University, Coimbatore-641046, India. Abstract Numerous challenging problems in graph theory has attracted the attention and imagination of researchers from physics, computer science, engineering, biology, social science and mathematics. If we put all these different branches one into basket, what evolves is a new science called “Network Science”. It calls for a solid scientific foundation and vigorous analysis. Graph theory in general and the collaboration graphs, in particular are well suited for this task. In this paper, we give a overview of the importance of collaboration graphs with its interesting background. Also we study one particular type of collaboration graph and list a number of open problems. Keywords :collaboration graph, network science, erdos number AMS subject Classification: 05XX, 68R10 1. Introduction In the past decade, graph theory has gone through a remarkable shift and a profound transformation. The change is in large part due to the humongous amount of information that we are confronted with. A main way to sort through massive data sets is to build and examine the network formed by interrelations. For example, Google’s successful web search algorithms are based on the www graph, which contains all web pages as vertices and hyperlinks as edges.
    [Show full text]
  • Graph Theory and Social Networks Spring 2014 Notes
    Graph Theory and Social Networks Spring 2014 Notes Kimball Martin March 14, 2014 Introduction Graph theory is a branch of discrete mathematics (more specifically, combinatorics) whose origin is generally attributed to Leonard Euler’s solution of the K¨onigsberg bridge problem in 1736. At the time, there were two islands in the river Pregel, and 7 bridges connecting the islands to each other and to each bank of the river. As legend goes, for leisure, people would try to find a path in the city of K¨onigsberg which traversed each of the 7 bridges exactly once (see Figure 1). Euler represented this abstractly as a graph⇤, and showed by elementary means that no such path exists. Figure 1: The Seven Bridges of K¨onigsberg (Source: Wikimedia Commons) Intuitively, a graph is just a set of objects which are connected in some way. The objects are called vertices or nodes. Pictorially, we usually draw the vertices as circles, and draw a line between two vertices if they are connected or related (in whatever context we have in mind). These lines are called edges or links. Here are a few examples of abstract graphs. This is a graph with 8 vertices connected in a circle. ⇤In this course, graph does not mean the graph of a function, as in calculus. It is unfortunate, but these two very basic objects in mathematics have the same name. 1 Graph Theory/Social Networks Introduction Kimball Martin (Spring 2014) 1 2 8 3 7 4 6 5 This is a graph on 5 vertices, where all pairs of vertices are connected.
    [Show full text]
  • The Following Pages Contain Scaled Artwork Proofs and Are Intended Primarily for Your Review of figure/Illustration Sizing and Overall Quality
    P1: SBT cuus984-net CUUS984-Easley 0 521 19533 1 November 30, 2009 16:2 The following pages contain scaled artwork proofs and are intended primarily for your review of figure/illustration sizing and overall quality. The bounding box shows the type area dimensions defined by the design specification for your book. The figures are not necessarily placed as they will appear with the text in page proofs. Design specification and page makeup parameters determine actual figure placement relative to callouts on actual page proofs, which you will review later. If you are viewing these art proofs as hard-copy printouts rather than as a PDF file, please note the pages were printed at 600 dpi on a laser printer. You should be able to adequately judge if there are any substantial quality problems with the artwork, but also note that the overall quality of the artwork in the finished bound book will be better because of the much higher resolution of the book printing process. You will be reviewing copyedited manuscript/typescript for your book, so please refrain from marking editorial changes or corrections to the captions on these pages; defer caption corrections until you review copyedited manuscript. The captions here are intended simply to confirm the correct images are present. Please be as specific as possible in your markup or comments to the artwork proofs. 0 P1: SBT cuus984-net CUUS984-Easley 0 521 19533 1 November 30, 2009 16:2 1 27 23 15 10 20 16 4 31 13 11 30 34 14 6 1 12 17 9 21 33 7 29 3 18 5 22 19 2 28 25 8 24 32 26 Figure 1.1.
    [Show full text]
  • Link Analysis
    Link Analysis from Bing Liu. “Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data”, Springer and other material. Data and Web Mining - S. Orlando 1 Contents § Introduction § Network properties § Social network analysis § Co-citation and bibliographic coupling § PageRank § HITS § Summary Data and Web Mining - S. Orlando 2 Introduction § Early search engines mainly compare content similarity of the query and the indexed pages, i.e., – they use information retrieval methods, cosine, TF- IDF, ... § From 1996, it became clear that content similarity alone was no longer sufficient. – The number of pages grew rapidly in the mid-late 1990’s. • Try the query “Barack Obama”. Google estimates about 140,000,000 relevant pages. • How to choose only 30-40 pages and rank them suitably to present to the user? – Content similarity is easily spammed. • A page owner can repeat some words (TF component of ranking) and add many related words to boost the rankings of his pages and/or to make the pages relevant to a large number of queries. Data and Web Mining - S. Orlando 3 Introduction (cont …) § Starting around 1996, researchers began to work on the problem. They resort to hyperlinks. – In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a hyperlink based search patent. The method uses words in anchor text of hyperlinks. § Web pages on the other hand are connected through hyperlinks, which carry important information. – Some hyperlinks: organize information at the same site. – Other hyperlinks: point to pages from other Web sites. Such out-going hyperlinks often indicate an implicit conveyance of authority to the pages being pointed to.
    [Show full text]
  • Pdf/38/1/219/1810360/Coli R 00089.Pdf by Guest on 02 October 2021 Technische Universit¨At Darmstadt
    Book Reviews Graph-Based Natural Language Processing and Information Retrieval Rada Mihalcea and Dragomir Radev (University of North Texas and University of Michigan) Cambridge, UK: Cambridge University Press, 2011, viii+192 pp; hardbound, ISBN 978-0-521-89613-9, $65.00 Reviewed by Chris Biemann Downloaded from http://direct.mit.edu/coli/article-pdf/38/1/219/1810360/coli_r_00089.pdf by guest on 02 October 2021 Technische Universit¨at Darmstadt Graphs are ubiquitous. There is hardly any domain in which objects and their relations cannot be intuitively represented as nodes and edges in a graph. Graph theory is a well-studied sub-discipline of mathematics, with a large body of results and a large number of efficient algorithms that operate on graphs. Like many other disciplines, the fields of natural language processing (NLP) and information retrieval (IR) also deal with data that can be represented as a graph. In this light, it is somewhat surprising that only in recent years the applicability of graph-theoretical frameworks to language technology became apparent and increasingly found its way into publications in the field of computational linguistics. Using algorithms that take the overall graph structure of a problem into account, rather than characteristics of single objects or (unstructured) sets of objects, graph-based methods have been shown to improve a wide range of NLP tasks. In a short but comprehensive overview of the field of graph-based methods for NLP and IR, Rada Mihalcea and Dragomir Radev list an extensive number of techniques and examples from a wide range of research papers by a large number of authors.
    [Show full text]