Link Prediction Using Graph Neural Networks for Master Data Management

Link Prediction using Graph Neural Networks for Master Data Management Balaji Ganesan Srinivasa Parkala Neeraj R Singh Sumit Bhatia IBM Research, India IBM Data and AI IBM Data and AI IBM Research, India [email protected] [email protected] [email protected] [email protected] Gayatri Mishra Matheen Ahmed Pasha Hima Patel Somashekhar Naganna IBM Data and AI IBM Data and AI IBM Research, India IBM Data and AI [email protected] [email protected] [email protected] [email protected] Abstract—Learning graph representations of n-ary relational data has a number of real world applications like anti-money laundering, fraud detection, and customer due diligence. Contact tracing of COVID19 positive persons could also be posed as a Link Prediction problem. Predicting links between people using Graph Neural Networks requires careful ethical and privacy considerations than in domains where GNNs have typically been applied so far. We introduce novel methods for anonymizing data, model training, explainability and verification for Link Prediction in Master Data Management, and discuss our results. Index Terms—link prediction, master data management, graph neural networks, explainability, graphsheets Fig. 1: Contact tracing of COVID19 positive persons could be I. INTRODUCTION posed as a watchlist link prediction problem in MDM Relational Data consists of tuples where each tuple is a set of attribute/value pairs. A set of tuples that all share the same attributes is a relation. Such relational data can be presented terms constitute the attributes of the nodes. Links between in a table, json arrays among other forms. people or organizations in the real world constitute the edges In Master Data Management (MDM), one or more tuples of the graph. The type of people to people relations form the in relational data can be resolved to an entity. Typically, an link type and details of the relationship like duration constitute entity in this setting is a person or an organization, but there the edge properties. can be other types too. These entities may share explicit and Link Prediction on people graphs presents a unique set of implicit relations between them. A common way to represent challenges in terms of model performance, data availability, these entities and relationships is a graph, where each entity fairness, privacy, and data protection. Further, Link Prediction is a node, and the relationships are links (edges) between the among people has a number of societal implications irrespec- nodes. tive of the use-cases. While link prediction of financial fraud Master Data Management includes tasks like Entity Res- detection, law enforcement, and advertisement go through olution, Entity Matching, Non-Obvious Relation Extraction. ethical scrutiny, the recent use of social networks for political arXiv:2003.04732v2 [cs.SI] 28 Aug 2020 While the enterprise customer data is predominantly stored as targeting and job opportunities also require ethical awareness relational data, graph stores are widely used for visualization on the part of the model developers and system designers. and analytics. Watchlist is a use-case typical in Master Data Management. Link Prediction is the task of finding missing links in a Given a set of nodes in a graph, the task is to find links to graph. These links could be typed or untyped. Given a graph Watchlist nodes from other nodes of the graph. For example, with several nodes and links, a model can be trained to learn we can assume people who have tested positive for COVID19 embeddings of the nodes and links, and predict missing links as people on a Watchlist. We may want to find people who may in the graph. In recent years, Graph Neural Networks (GNN) have potentially come in contact with them, commonly known are being used for link prediction and node classification tasks. as contact tracing. This can however be a time consuming and Graphs in MDM could be considered as Property Graphs potentially controversial process that impacts privacy. which differ from Knowledge Graphs and Social Networks. In In this work we discuss link prediction on property graphs particular, MDM is focused on entities like people, organiza- involving people, their specific dataset requirements, ethical tion, and location, which constitute nodes in the graph. Other considerations, and practical insights in designing the infras- entities like numerical ids, demographic information, business tructure for industrial scale deployment. Fig. 2: Predicting Links to Watch-List Nodes. A list of COVID19 persons can be uploaded as Watch-List and their links to people in a master data could be predicted. Note: We need informed consent of people for this contact tracing. The main contributions of this work are: introduced the idea of inductive representation of nodes in a • We discuss the dataset requirements and ethical con- graph. [3] added anchor nodes to improve the representation siderations to train neural models for many real-world of nodes in the graph. applications in Master Data Management. [7] introduced many aspects of entity resolution using a • We present our results on training Graph Neural Networks probabilistic matching engine. DeepMatcher [8] presents a on property graphs typical of MDM workloads. neural model for Entity Matching. [9] presented an end-to- • We present 3 easily understandable explainability solu- end use case for Entity Matching. [10] described an integrated tions in addition to interpretability, to explain the GNN neural model to find non-obvious relations. model predictions. The Protein-Protein Interaction (PPI) dataset introduced by • We present GraphSheets, a method to help increase [6] consists of 24 human tissues and hence has 24 subgraphs accountability in the development and deployment of our of roughly 2400 nodes each and their edges. Having similar solution. subgraphs helps to average the performance of the model II. RELATED WORK across subgraphs. Open Graph Benchmark [11] initiative by SNAP Group at [1] showed that a Relational Graph Convolutional Network Stanford is trying to come up with large benchmark datasets can outperform direct optimization of the factorization (ex: for research in Graph Neural Models. DistMult). They used an autoencoder model consisting of an encoder an R-GCN producing latent feature representations of Much of the recent work on explanations are based on entities and a decoder a tensor factorization model exploiting post-hoc models that try to approximate the prediction of these representations to predict labeled edges. complex models using interpretable models. [12] present post- [2] introduced GraphSAGE (SAmple and AggreGatE) an hoc explanations of the links predicted by a Graph Neural inductive framework that leverages node feature informa- Network by treating it as a classification problem. They present tion (ex: text attributes, node degrees) to efficiently generate explanations using LIME [13] and SHAP [14]. [15] introduced node embeddings for previously unseen data or entirely new the AIX360 toolkit which has a number of explainability (sub)graphs. In this inductive framework, we learn a func- solutions that can be used for post-hoc explanation of graph tion that generates embeddings by sampling and aggregating models, if they can posed as approximated as interpretable features from a nodes local neighborhood. [3] proposed a models. Position Aware Graph Neural Network that significantly im- Over the years, Attention has been understood to provide an proves performance on the Link Prediction task over the Graph important way to explain the workings of neural models, par- Convolutional Networks. ticularly in the field of NLP. [16] challenged this understanding [4] introduced the WikiPeople dataset based on Wikidata. by showing that the assumptions for accepting Attention as However, Wikidata does not have contact details and can explanation do not hold. [17] however argued that Attention be incomplete like the DBPedia (UDBMS) dataset. [5] also also contributions to explainability. introduced a dataset based on Wikidata. Again, for afore More recently, [18] introduced Neural Additive Models mentioned reasons, its not different from DBPedia dataset. which learn a model for each feature to increase the inter- [6] introduced the Protein-Protein Interaction dataset which pretability. In this work, we focus on solutions that are more has been used in a number of recent works in graph neural suited to Graph Neural Models and in particular, explaining networks. [cite] presented Graph Convolutional networks. [2] Link Prediction models. III. ETHICAL CONSIDERATIONS IN MDM LINK Batch and Online Link Prediction PREDICTION The Watch-List Links Prediction is a typical use-case in Predicting links between people and organizations in the Master Data Management as shown in Figure 2. When the Master Data of any company, requires lot more care than master data is initially created or updated in batch, there is a innocuous use-cases like predicting links in protein-protein- need for batch mode link prediction. Later when the model is interactions. In this section we describe a number of properties deployed, and a new node or an update to nodes and edges is and ethical considerations that are specific to Link Prediction performed, there is a need for online link prediction. Consid- in Master Data Management. ering an enterprise property graph is likely to be updated quite frequently, the model needs to be frequently re-trained which is Explainability expensive or we have to accept the drift in model performance. Explainability methods in Graph Neural Networks tend to The inductive approach to learning the node representations follow similar methods in text and images, namely identifying in GNNs helps to make the models more robust to unseen features that are most significant for the predictions. However, data, but the inductive approach will not be sufficient if the in enterprise applications, there is a need for explanations distribution of the unseen nodes is different from the graph for non-technical users. Path-based explanations that help to created in batch mode.

Link Prediction Using Graph Neural Networks for Master Data Management

Biased Edge Dropout for Enhancing Fairness in Graph Representation

Combining Collective Classification and Link Prediction

Finding Experts by Link Prediction in Co-Authorship Networks

Learning to Make Predictions on Graphs with Autoencoders

Predicting Disease Related Microrna Based on Similarity and Topology

Representation Learning on Graphs

Khanamk2021m-1A.Pdf (2.656Mb)

Neural Variational Inference for Embedding Knowledge Graphs

A Scalable and Distributed Actor-Based Version of the Node2vec Algorithm

Application of Machine Learning to Link Prediction

Neural Graph Representations and Their Application to Link Prediction

Link Prediction in Large-Scale Complex Networks (Application to Bibliographical Networks)