A Degeneracy Framework for Scalable Graph Autoencoders
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) A Degeneracy Framework for Scalable Graph Autoencoders Guillaume Salha1;2 , Romain Hennequin1 , Viet Anh Tran1 and Michalis Vazirgiannis2 1Deezer Research & Development, Paris, France 2Ecole´ Polytechnique, Palaiseau, France [email protected] Abstract In this paper, we focus on the graph extensions of autoen- coders and variational autoencoders. Introduced in the 1980’s In this paper, we present a general framework to [Rumelhart et al., 1986], autoencoders (AE) regained a sig- scale graph autoencoders (AE) and graph varia- nificant popularity in the last decade through neural network tional autoencoders (VAE). This framework lever- frameworks [Baldi, 2012] as efficient tools to learn reduced ages graph degeneracy concepts to train models encoding representations of input data in an unsupervised only from a dense subset of nodes instead of us- way. Furthermore, variational autoencoders (VAE) [Kingma ing the entire graph. Together with a simple yet ef- and Welling, 2013], described as extensions of AE but actu- fective propagation mechanism, our approach sig- ally based on quite different mathematical foundations, also nificantly improves scalability and training speed recently emerged as a successful approach for unsupervised while preserving performance. We evaluate and learning from complex distributions, assuming the input data discuss our method on several variants of existing is the observed part of a larger joint model involving low- graph AE and VAE, providing the first application dimensional latent variables, optimized via variational infer- of these models to large graphs with up to millions ence approximations. [Tschannen et al., 2018] review the of nodes and edges. We achieve empirically com- wide recent advances in VAE-based representation learning. petitive results w.r.t. several popular scalable node In this paper we show that, during the last three years, many embedding methods, which emphasizes the rele- efforts have been devoted to the generalization of such mod- vance of pursuing further research towards more els to graphs. Graph AE and VAE appear as elegant node scalable graph AE and VAE. embedding tools i.e. ways to learn a low dimensional vector space representation of nodes, with promising applications to 1 Introduction link prediction, node clustering, matrix completion and graph Graphs have become ubiquitous in the Machine Learning generation. However, most existing models suffer from scala- community, thanks to their ability to efficiently represent the bility issues and all existing experiments are limited to graphs relationships among items in various disciplines. Social net- with at most a few thousand nodes. The question of how works, biological molecules and communication networks to scale graph AE and VAE to larger graphs remains widely are some of the most famous real-world examples of data open, and we propose to address it in this paper. More pre- usually represented as graphs. Extracting meaningful infor- cisely, our contribution is threefold: mation from such structure is a challenging task, which has • We introduce a general framework to scale graph AE initiated considerable research efforts, aiming at tackling sev- and VAE models, by optimizing the reconstruction loss eral learning problems such as link prediction, influence max- (for AE) or variational lower bound (for VAE) only from imization and node clustering. a dense subset of nodes, and then propagate representa- In particular, over the last decade there has been an increas- tions in the entire graph. These nodes are selected using ing interest in extending and applying Deep Learning meth- graph degeneracy concepts. Such approach considerably ods to graph structures. [Gori et al., 2005; Scarselli et al., improves scalability while preserving performance. 2009] firstly introduced graph neural network architectures, • We apply this framework to large real-world data and and were later joined by numerous contributions to general- discuss empirical results on ten variants of graph AE or ize CNNs and the convolution operation to graphs, leverag- VAE models for two learning tasks. To the best of our ing spectral graph theory [Bruna et al., 2014], its approxi- knowledge, this is the first application of these models mations [Defferrard et al., 2016; Kipf and Welling, 2016a] or to graphs with up to millions of nodes and edges. spatial-based approaches [Hamilton et al., 2017]. Attempts at extending RNNs, GANs, attention mechanisms or word2vec- • We show that these scaled models have competitive per- like methods for node embeddings also recently emerged in formances w.r.t. several popular scalable node embed- the literature ; for complete references, we refer to [Wu et al., ding methods. It emphasizes the relevance of pursuing 2019]’s survey on Deep Learning for graphs. further research towards scalable graph autoencoders. 3353 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) This paper is organized as follows. In Section 2, we provide matrices W (l), of potentially different dimensions, are trained an overview of graph AE/VAE and of their extensions, appli- by stochastic gradient descent. Implementing GCN encoders cations and limits. In Section 3, we present our degeneracy is mainly driven by complexity purposes. Indeed, the cost framework and how we reconstruct the latent space from an of computing each hidden layer is linear w.r.t. m [Kipf and autoencoder only trained on a subset of nodes. We interpret Welling, 2016a], and its training efficiency can also be im- our experimental analysis and discuss possible extensions of proved via importance sampling [Chen et al., 2018]. How- our approach in Section 4, and we conclude in Section 5. ever recent works, e.g. [Xu et al., 2019], highlight some fun- damental limits of the simple GCN heuristics. It incites to 2 Preliminaries resort to more powerful albeit more complex GNN encoders, such as [Bruna et al., 2014] computing actual spectral graph In this section, we recall some key concepts related to graph convolutions, a model later extended by [Defferrard et al., AE and VAE. Throughout this paper, we consider an undi- 2016], approximating smooth filters in the spectral domain rected graph G = (V; E) with jVj = n nodes and jEj = m with Chebyshev polynomials (GCN being a faster first-order edges, without self-loops. We denote by A the adjacency ma- approximation of [Defferrard et al., 2016]). In this paper, we trix of G, weighted or not. Nodes can possibly have features show that our scalable degeneracy framework adequately fa- vectors of size d, stacked up in an n×d matrix X. Otherwise, cilitates the training of such more complex encoders. X is the identity matrix I. 2.3 Variational Graph Autoencoders (VGAE) 2.1 Graph Autoencoders (GAE) [Kipf and Welling, 2016b] also introduced Variational Graph In the last three years, several attempts at transposing autoen- Autoencoders (VGAE). They assume a probabilistic model [ ] coders to graph structures with Kipf and Welling, 2016b on the graph structure involving some latent variables z of [ ] i or without Wang et al., 2016 node features have been pre- length f for each node i 2 V, later interpreted as latent rep- sented. Their goal is to learn, in an unsupervised way, a low resentations of nodes in an embedding space of dimension dimensional node embedding/latent vector space (encoding), f. More precisely, with Z the n × f latent variables matrix, from which reconstructing the graph topology (decoding) is the inference model (encoder) is defined as q(ZjX; A) = possible. In its most general form, the n × f matrix Z of all Qn 2 i=1 q(zijX; A) where q(zijX; A) = N (zijµi; diag(σi )). latent space vectors zi, where f is the dimension of the latent Parameters of Gaussian distributions are learned using two space, is the output of a Graph Neural Network (GNN) ap- two-layer GCN. Therefore, µ, the matrix of mean vectors plied on A and, potentially, X. To reconstruct A from Z, one µ , is defined as µ = GCN (X; A). Also, log σ = [ i µ could resort to another GNN. However, Kipf and Welling, GCN (X; A), and both GCNs share the same weights in first ] σ 2016b and several extensions of their model implement a layer. Then, as for GAE, a generative model (decoder) aim- simpler inner product decoder between latent variables, along ing at reconstructing A is defined as the inner product be- with a sigmoid activation σ(·) or, if A is weighted, some more Qn Qn tween latent variables: p(AjZ) = i=1 j=1 p(Aij jzi; zj) complex thresholding. The drawback of this simple decoding T is that it involves the multiplication of the two dense matrices where p(Aij = 1jzi; zj) = σ(zi zj) and σ(·) is the sigmoid Z and ZT , which has a quadratic complexity O(fn2) w.r.t. function. As explained for GAE, such reconstruction has a limiting quadratic complexity w.r.t. n. [Kipf and Welling, the number of nodes. To sum up, with A^ the reconstruction: 2016b] optimize weights of GCN by maximizing a tractable A^ = σ(ZZT ) with Z = GNN(X,A): variational lower bound (ELBO) of the model’s likelihood: h i The model is trained by minimizing the reconstruction loss L = Eq(ZjX;A) log p(AjZ) − DKL(q(ZjX; A)jjp(Z)); kA − A^kF of the graph structure where k · kF denotes the Frobenius matrix norm, or alternatively a weighted cross en- where DKL(·; ·) is the Kullback-Leibler divergence. They tropy loss, by stochastic gradient descent. perform full-batch gradient descent, using the reparameter- ization trick [Kingma and Welling, 2013], and choosing a Q Q 2.2 Graph Convolutional Networks (GCN) Gaussian prior p(Z) = i p(zi) = i N (zij0;I). [ ] Kipf and Welling, 2016b , and a majority of following 2.4 Applications, Extensions and Limits works, assume that the GNN encoder is a Graph Convo- lutional Network (GCN).