Scale-Free, Attributed and Class-Assortative Graph Generation to Facilitate Introspection of Graph Neural Networks

Scale-Free, Attributed and Class-Assortative Graph Generation to Facilitate Introspection of Graph Neural Networks Neil Shah Snap Inc. [email protected] ABSTRACT node features H and labels ~ (potentially many of which are unde- Semi-supervised node classification on graphs is a complex inter- fined). The task is to correctly infer the status of nodes which have play between graph structure, node features and class-assortative undefined labels. Modern graph-based machine learning methods (homophilic) properties, and the flexibility of a model to capture for this problem generally involve learning an embedding function 3 these nuances. Modern datasets used to push the frontier for such 5 : V! R which maps each node into a high-dimensional space, tasks exhibit diverse properties across these aspects, making it chal- where it can be subsequently classified. Most advances in recent lenging to study how these properties individually and jointly influ- years on this task have arisen from various novelties in parame- ence performance of modern methods like graph neural networks terizing and learning 5 . While understanding of the importance of (GNNs). In this work, we propose an intuitive and flexible scale- architectural, loss-based and situational choices for 5 has improved free graph generation model, CaBaM, which enables simulation of substantially in recent years [27, 43], there is comparatively little class-assortative and attributed graphs via the well-known Barabasi- work in understanding the importance of G (A), H and ~ to the Albert model. We show empirically and theoretically how our model performance of various methods. can easily describe a variety of graph types, while imbuing the gen- This is in large part due to convention and limitations in exist- erated graphs with the necessary ingredients for attribute, topology, ing benchmark datasets for evaluating performance on this task. and label-aware semi-supervised node-classification. We hope our The graph-based machine learning community commonly utilizes work illustrates the need for graph generation and provides a step- several datasets to demonstrate outperformance of a model. These ping stone compensating for the lack of manipulability offered benchmark datasets include citation networks (Cora, Citeseer in common public graph dataset benchmarks. We also hope this [22]), protein-protein interactions (PPI [14]), social networks (Flickr, inspires future work towards (a) more principled evaluation and BlogCatalog [18]), air trafficAir-USA ( [42]) and more. Recently, study of GNNs, specifically their sensitivity to varying assortativity [17] curated and released several additional benchmark datasets to and attribute distributions, and (b) development of GNN architec- improve representation of other domains and standardize method tures which facilitate graph context-awareness in line with these comparisons. Nonetheless, these benchmark datasets have non- properties. homogeneous properties that are not well-characterized or typically considered, making the performance analysis between differ- KEYWORDS ent methods and graph types challenging to analyze. While the graphs may have similar structure in a skewed, power-law topol- graph generation, preferential attachment, assortativity, network ogy sense, they may have (a) very different attribute distributions embedding across nodes (conditional on class), and (b) varying assortative (ho- ACM Reference Format: mophilic) tendencies between nodes of the same class. Both of these Neil Shah. 2020. Scale-Free, Attributed and Class-Assortative Graph Genera- could influence the inherent difficulty of the learning task, andlimit tion to Facilitate Introspection of Graph Neural Networks. In MLG ’20: ACM or facilitate different models. Symposium on Neural Gaze Detection, August 24, 2020, San Diego, CA. ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn To facilitate this analysis, we turn to graph generation models, a staple in the network science community [4, 9, 23, 25, 34, 36, 37, 41]. 1 INTRODUCTION Graph generation models aim to simulate graphs which match (in a statistical sense) various observed processes. For example, [4, 25] Semi-supervised learning on graphs (SSL) is a well-known task, aim to model scale-freeness and power-law degree distribution which has gained renewed interest in recent years with the advances evident in many social networks, [36] aims to preserve local and of neural node embedding methods, particularly graph neural net- global topological motifs, and [9] produces random graphs with works (GNNs) [14, 15, 22, 33, 39, 40, 45]. In modern instances of such interesting mathematical properties. Unfortunately, most of these tasks, one is typically given a graph G¹V Eº (with adjacency A), , models only generate topology and not attributes, with the excep- Permission to make digital or hard copies of all or part of this work for personal or tion of [34, 37], which are used in the context of mimicking a given classroom use is granted without fee provided that copies are not made or distributed graph, rather than flexibly manipulating the generation process to for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM handle different graph settings. None of the above models explicitly must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, facilitates analysis of the previously mentioned aspects. to post on servers or to redistribute to lists, requires prior specific permission and/or a In lieu, we propose CaBaM. Our work builds on the Barabasi- fee. Request permissions from [email protected]. MLG ’20, August 24, 2020, San Diego, CA Albert (BA) [4] generation model for generating scale-free networks © 2020 Association for Computing Machinery. via preferential attachment. Since the model only produces G ¹Aº, ACM ISBN 978-X-XXXX-XXXX-X/YY/MM...$15.00 we extend it in two key ways which facilitate investigation of SSL https://doi.org/10.1145/nnnnnnn.nnnnnnn MLG ’20, August 24, 2020, San Diego, CA N. Shah methods: by allowing nodes to flexibly (a) belong to classes, and products, and infer seed matrices given an input graph. [5, 41] en- be associated with the associated (arbitrary) attribute distribution, able production of various graphs with prescribed degree sequences (b) vary their class-assortativity (for example, based on degree), which meet certain structural properties, like existence of a hub thereby enabling flexible designation of H and ~ choices. Moreover, or connected components. [36] proposes an approach to generate we show empirically and theoretically both that our extensions graphs using concepts from context-free grammars. All the above preserve the natural degree distribution of the original BA model, methods are applicable only to graph data without attributes. [4] and that they have derivable class-assortativity dynamics in terms proposes a preferential-attachment model, by which nodes join a of the expected number of intra-class edges and inter-class edges graph and attach to other nodes with higher degree with a higher in the generated graph. probability. [21, 26, 34] discuss models which enable inference and We hope our model helps facilitate analysis of strengths and mimicking connectivity given attributes from an input graph, but weaknesses of relative graph-based machine learning methods and not flexible simulation of a new graph. The graph generation pro- the impact of graph structure, attribute distribution and assortativ- cess discussed in our work differs from these by focusing on flexible ity on performance, particularly across GNN models which have simulation of graphs with class-imbued nodes and attributes with- become prominent in recent years. Moreover, we hope that graph out an input graph to mimic. generation via our model appeals to practitioners and researchers Several works also tackle producing assortativity in generated who work on development of GNNs to evaluate their models on graphs, but mainly in the context of joint degree distributions. [30] well-specified and importantly, tunable benchmark datasets. We discusses this concept: namely, that nodes tend to connect to oth- make our code and model publicly available at https://github.com/ ers with similar degrees. [44] produces degree assortativity via an nshah171/cabam-graph-generation. edge rewiring process from nodes in an existing graph. [28] uses accept-reject sampling to only keep edges from the model from [41] which satisfy a binned joint degree distribution. [7] modifies 2 RELATED WORK the Barabasi Albert (BA) model [4] for assortative mixing, but for We discuss related work in two settings: graph-based semi-supervised degree sequence assortativity (nodes connect to others with simi- learning, and graph generation models. lar degree) and with different motive. Closest to our work is[20], Graph-based semi-supervised learning (SSL). Graph-based which studies an extension to the BA model for class-assortativity SSL has a rich history, stemming from early methods such as la- (nodes tend to connect to other nodes of the same class) aware bel spreading [48], label propagation [49], belief propagation [46], preferential attachment, but differs from ours in both (a) model and various random-walk and guilt-by-association approaches [24], setup: their model does not discuss varying specifications of degree- many of which utilize only graph structure and no

Load more