Autoregressive Models for Sequences of Graphs
Total Page:16
File Type:pdf, Size:1020Kb
Autoregressive Models for Sequences of Graphs Daniele Zambon∗ Daniele Grattarola∗ Universita` della Svizzera italiana, Lugano, Switzerland Universita` della Svizzera italiana, Lugano, Switzerland [email protected] [email protected] Lorenzo Livi Cesare Alippi University of Manitoba, Winnipeg, Canada Universita` della Svizzera italiana, Lugano, Switzerland University of Exeter, Exeter, United Kingdom Politecnico di Milano, Milano, Italy Abstract—This paper proposes an autoregressive (AR) model where the expectation E[·] is with respect to noise at time for sequences of graphs, which generalises traditional AR models. t + 1. The predictor from (2) is optimal when considering the A first novelty consists in formalising the AR model for a very L2-norm loss between x^ and x . general family of graphs, characterised by a variable topology, t+1 t+1 and attributes associated with nodes and edges. A graph neural Predicting the temporal evolution of a multivariate stochas- network (GNN) is also proposed to learn the AR function associ- tic process is a widely explored problem in system identifi- ated with the graph-generating process (GGP), and subsequently cation and machine learning. However, several recent works predict the next graph in a sequence. The proposed method is have focused on problems where the vector representation of compared with four baselines on synthetic GGPs, denoting a a system can be augmented by also considering the relations significantly better performance on all considered problems. Index Terms—graph, structured data, stochastic process, re- existing among variables [1], ending up with a structured current neural network, graph neural network, autoregressive representation that can be naturally described by a graph. model, prediction. Such structured information provides an important inductive bias that the learning algorithms can take advantage of. For I. INTRODUCTION instance, it is difficult to predict the chemical properties of a molecule by only looking at its atoms; on the other hand, Several physical systems can be described by means of by explicitly representing the chemical bonds, the description autoregressive (AR) models and their variants. In that case, of the molecule becomes more complete, and the learning a system is represented as a discrete-time signal in which, at algorithm can take advantage of that information. Similarly, every time step, the observation is modelled as a realisation of several other machine learning problems benefit from a graph- a random variable that depends on preceding observations. In based representation, e.g., the understanding of visual scenes this paper, we consider the problem of designing AR predictive [2], the modelling of interactions in physical and multi-agent models where the observed entity is a graph. systems [3], [4], or the prediction of traffic flow [5]. In all In the traditional setting for AR predictive models, each these problems (and many others [1]), the dependencies among observation generated by the process is modelled as a vector, variables provide a strong prior that has been successfully x 2 d intended as a realisation of random variable t+1 R , so leveraged to significantly surpass the previous state of the art. that 8 In this paper, we consider attributed graphs where each node x = f(x ; x ; : : : ; x ) + , <> t+1 t t−1 t−p+1 and edge can be associated with a feature vector. As such, a E[] = 0; (1) graph with N nodes can be seen as a tuple (V; E), where > 2 :Var[] = σ < 1: F V = fvi 2 R gi=1;:::;N arXiv:1903.07299v1 [cs.LG] 18 Mar 2019 In other terms, each observation xt+1 is obtained from the represents the set of nodes with F -dimensional attributes, and regressor S p E = feij 2 gv ;v 2V xt = [xt; xt−1; : : : ; xt−p+1] R i j models the set of S-dimensional edge attributes [6]. By this through an AR function f(·) of order p affected by an additive formulation also categorical attributes can be represented, via stationary random noise . Given (1), the prediction of x t+1 vector encoding. We denote with G the set of all possible is often taken as graphs with vector attributes. Graphs in G can have differ- x^t+1 = E[xt+1] = f(xt; xt−1; : : : ; xt−p+1); (2) ent order and topology, as well as variable node and edge attributes. Moreover, a correspondence among the nodes of This research is funded by the Swiss National Science Foundation project different graphs might not be present, or can be unknown; in 200021 172671. We gratefully acknowledge partial support of the Canada this case, we say that the nodes are non-identified. Research Chairs program. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. Here we build on the classic AR model (1) operating on ∗Equal contribution. discrete-time signals, and propose a generalised formulation that allows us to consider sequences of generic attributed the numerical case, we model each observation of the process graphs. The formulation is thus adapted by modelling a as a realisation of a random variable gt+1 2 G, dependent on graph-generating process (GGP) that produces a sequence a graph-valued regressor fg ; g ; : : : ; g ;::: g of graphs in G, where each graph g 1 2 t t+1 gp = [g ; g ; : : : ; g ]; is obtained through an AR function t t t−1 t−p+1 φ(·): Gp !G: (3) through an AR function φ(·) on the space of graphs. Similarly to (1), gt+1 is modelled as on the space G of graphs. In order to formulate an AR model p equivalent to (1) for graphs, the AR function f(·) must be gt+1 = H(φ(gt ); η); (4) defined on the space of graphs, and the concept of additive where noise must also be suitably adapted to the graph representation. H : G × G ! G: Previous art on predictive autoregression in the space of graphs is scarce, with [7] being the most notable contribution is a function that models the effects of noise graph1 η 2 G on p to the best of out knowledge. However, the work in [7] only graph φ(gt ). Function H(·; ·) in (4) is necessary because, as p considers binary graphs governed by a vector AR process of mentioned above, the sum between graphs φ(gt ) and η is not order p = 1 on some graph features, and does not allow to generally defined. consider attributes or non-identified nodes. Assumptions made on the noise in (1) have to be adapted The novel contribution of this work is therefore two-fold. as well. In the classic formulation (1), the condition of First, we formulate an AR system model generating graph- unbiased noise is formulated as E[] = 0 or, equivalently, as valued signals as a generalisation of the numerical case. In f(xt; : : : ; xt−p−1) = E[f(xt; : : : ; xt−p−1) + ]. In the space this formulation, we consider graphs in a very general form. of graphs the assumption of unbiased noise can be extended In particular, our model deals with graphs characterised by: as p f p • directed and undirected edges; φ(gt ) 2 Eη[H(φ(gt ); η)]; (5) • identified and non-identified nodes; where f [·] 2 G is the set of mean graphs according to Frechet´ • variable topology (i.e., connectivity); E [10], defined as the set of graphs minimising: • variable order; Z • node and edge attributes (not necessarily numerical vec- f 0 2 E [g] = arg min d(g; g ) dQg(g): (6) tors, but also categorical data is possible). g02G G Our model also accounts for the presence of an arbitrary sta- Function d(·; ·) in (6) is a pre-metric distance between two tionary noise, which is formulated as a graph-valued random graphs, and Q is a graph distribution defined on the Borel variable that can act on the topology, node attributes, and edge g sets of space (G; d). Examples of graph distances d(·; ·) that attributes of the graphs. can be adopted are the graph edit distances [11]–[13], or Second, we propose to learn the AR function (3) using any distance derived by positive semi-definite kernel functions the recently proposed framework of graph neural networks [14]. Depending on the graph distribution Q , there can be (GNNs) [1], a family of learnable functions that are designed g more than one graph minimising (6), hence ending up with to operate directly on arbitrary graphs. This paper represents a f [·] as a set. We stress that for a sufficiently small Frechet´ first step towards modelling graph-valued stochastic processes, E variation of the noise graph η (see Eq. (7)) and a metric d(·; ·), and introduces a possible way of applying existing tools for the Frechet´ mean graph exists and is unique. deep learning to this new class of prediction problems. The rest of the paper is structured as follows: Section II-A Equation (6) holds only when Z formulates the problem of defining AR models in the space of 0 2 d(g; g ) dQg(g) < 1; graphs and Section II-B introduces the proposed architecture G for the GNN; Section III reports the experimental analysis which can be interpreted as the graph counterpart of Var[] < performed to validate the methodology. 1 in (1). The variance in the graph domain can be expressed II. NEURAL GRAPH AUTOREGRESSION in terms of the Frechet´ variation: Z A. Autoregressive Model for a GGP f 0 2 Var [g] := min d(g; g ) dQg(g) < 1: (7) 0 Due to the lack of basic mathematical operators in the vast g 2G G family of graphs we consider here, the generalisation of model The final AR system model in the graph domain becomes (1) to account for graph data is non-trivial.