SPAGAN: Shortest Path Graph Attention Network

SPAGAN: Shortest Path Graph Attention Network Yiding Yang1 , Xinchao Wang1∗ , Mingli Song2 , Junsong Yuan3 and Dacheng Tao4 1Department of Computer Science, Stevens Institute of Technology 2College of Computer Science and Technology, Zhejiang University 3Department of Computer Science and Engineering, State University of New York at Buffalo 4UBTECH Sydney Artifical Intelligence Centre, University of Sydney fyyang99, [email protected], [email protected], [email protected], [email protected]. Abstract Graph convolutional networks (GCN) have recently demonstrated their potential in analyzing non-grid structure data that can be represented as graphs. The core idea is to encode the local topology of a graph, via convolutions, into the feature of a center node. In this paper, we propose a Figure 1: Comparing neighbor attention and shortest path atten- novel GCN model, which we term as Shortest Path tion. Conventional approaches focus on aggregating information Graph Attention Network (SPAGAN). Unlike con- from first-order neighbors within each layer, shown on the left, and ventional GCN models that carry out node-based often rely on stacking multiple layers to reach higher-order neigh- attentions within each layer, the proposed SPA- bors. The proposed SPAGAN shown on the right, on the other hand, GAN conducts path-based attention that explicitly explicitly conducts shortest-path-based attention that aggregates information from distant neighbors and explores the graph topology, accounts for the influence of a sequence of nodes all in a single shot. yielding the minimum cost, or shortest path, between the center node and its higher-order neighbors. SPAGAN therefore allows for a more infor- conventional square or cubic filters tailored for handling grid- mative and intact exploration of the graph struc- structured data. ture and further a more effective aggregation of in- To extract features and utilize the contextual information formation from distant neighbors into the center encoded in graphs, researchers have explored various possi- node, as compared to node-based GCN methods. bilities. [Kipf and Welling, 2016] proposes a convolution op- We test SPAGAN on the downstream classification erator defined in the no-grid domain and build a graph convo- task on several standard datasets, and achieve per- lutional network (GCN) based on this operator. In GCN, the formances superior to the state of the art. Code is feature of a center node will be updated by averaging the fea- publicly available at https://github.com/ihollywhy/ tures of all its immediate neighbors using fixed weights deter- SPAGAN. mined by the Laplacian matrix. This feature updating mechanism can be seen as a simplified polynomial filtering of [Def- 1 Introduction ferrard et al., 2016], which only considers first-order neighbors. Instead of using all the immediate neighbors, Graph- Convolutional neural networks (CNNs) have achieved a great SAGE [Hamilton et al., 2017] suggests using only a fraction success in many fields especially computer vision, where they arXiv:2101.03464v1 [cs.LG] 10 Jan 2021 of them for the sake of computing complexity. A uniform can automatically learn both low- and high-level feature rep- sampling strategy is adopted to reconstruct graph and aggre- resentations of objects within an image, thus benefiting the gate features. Recently, [Velickoviˇ c´ et al., 2018] introduces subsequent tasks like classification and detection. The input an attention mechanism into the graph convolutional network, to such CNNs are restricted to grid-like structures such as im- and proposes the graph attention network (GAT) by defining ages, in which square-shaped filters can be applied. an attention function between each pair of connected nodes. However, there are many other types of data that do not take form of grid structures and instead represented using All the above models, within a single layer, only look at im- graphs. For example, in the case of social networks, each per- mediate or first-order neighboring nodes for aggregating the son is denoted as a node and each connection as an edge, and graph topological information and updating features of the each node may be connected to a different number of edges. center node. To account for indirect or higher-order neigh- The features of such graphical data are therefore encoded in bors, one can stack multiple layers so as to enlarge the size of their topological structures, which cannot be extracted using receptive field. However, results have demonstrated that, by doing so, the performances tend to drop dramatically [Kipf ∗Corresponding Author and Welling, 2016]. In fact, our experiments even demon- and thus into the features of nodes. This is achieved by, dur- ing training, an iterative scheme that computes shortest paths between a center node and high-order neighbors, and esti- mates the path-based attentions for updating parameters. We test the proposed SPAGAN on several benchmarks for the downstream node classification task, where SPAGAN consis- tently achieves results superior to the state of the art. 2 Related Work In this section, we review related methods on graph convolu- Figure 2: The overall training workflow of SPAGAN. Given a graph, tion and on graph attention mechanism, where the latter one for each center node we compute a set of shortest paths of varying lengths to its high-order neighbors denoted by P, and then extract can be considered as a variant of the former. The proposed their features as path features. Next, a shortest path attention mech- SPAGAN model falls into the graph attention domain. anism is used to compute their attention coefficients with respect Graph Convolutional Networks. Due to the great success to the center node. The features of each center node can therefore of CNNs, many recent works focus on generating such frame- be updated according to the feature of paths and also their attention work to make it possible to apply on non-grid data structure coefficients to minimize the loss function. Afterwards, P will be re- [Gilmer et al., 2017; Monti et al., 2017a; Morris et al., 2018; generated according to these new attention coefficients and used for Velickoviˇ c´ et al., 2018; Simonovsky and Komodakis, 2017; the next training phase. Monti et al., 2017b; Li et al., 2018; Zhao et al., 2019; Peng et al., 2018]. [Bruna et al., 2014] first proposes a convo- strate that simply stacking more layers using of the GAT lution operator for graph data by defining a function in spec- model will often lead to the failure of convergence. tral domain. It first calculates the eigenvalues and eigenvec- In this paper, we propose a novel and the first dedi- tors of graph’s Laplacian matrix and then applies a function to cated scheme, termed Shortest Path Graph Attention Net- the eigenvalues. For each layer, the features of node will be a work (SPAGAN), that allows us to, within a single layer, uti- different combination of weighted eigenvectors. [Defferrard lize path-based high-order attentions to explore the topologi- et al., 2016] defines polynomial filters on the eigenvalues of cal information of the graph and further update the features of the Laplacian matrix to reduce the computational complexity the center node. At the heart of SPAGAN is a mechanism that by using the Chebyshev expansion. [Kipf and Welling, 2016] finds the shortest paths between a center node and its higher- further simplifies the convolution operator by using only first- order neighbors, then computes a path-to-node attention for order neighbors. The adjacent matrix A is first added self- updating the node features and coefficients, and iterates the connections and normalized by the degree matrix D. Then, two steps. In this way, each distant node influences the cen- graph convolution operation can be applied by just simple 0 − 1 − 1 ter node through a path connecting the two with minimum matrix multiplication: H = D 2 AD 2 HΘ. Under this cost, providing a robust estimation and intact exploration of framework, the convolution operator is defined by parameter the graph structure. Θ which is not related to the graph structure. Instead of using We highlight in Fig. 1 the difference between SPAGAN and the full neighbors, GraphSAGE[Hamilton et al., 2017] uses a conventional neighbor-attention approaches. Conventional fixed-size set of neighbors for aggregation. This mechanism methods look at only immediate neighbors within a single makes the memory used and running time to be more pre- layer. To propagate information from distant neighbors, mul- dictable. [Abu-El-Haija et al., 2018] proposes a multi GCN tiple layers would have to be stacked, often yielding the afore- framework by using random walk. They generate multiple mentioned convergence or performance issues. By contrast, adjacent matrices through random walk and feed them to the SPAGAN explicitly explores a sequence of high-order neigh- GCN model. bors, achieved by shortest paths, to capture the more global Graph Attention Networks. Instead of using fixed aggre- graph topology into the path-to-node attetion mechanism. gation weights, [Velickoviˇ c´ et al., 2018] brings attention The training workflow of SPAGAN is depicted in Fig. 2. mechanism into graph convolutional network and proposes For each center node, a set of shortest paths of different GAT model based on that. One of the benefits of attention is lengths denoted by P, are first computed using the attention the ability to deal with input with variant sizes and make the coefficients of each pair of nodes that are all initialized to be model focus on parts which are most related to current tasks. the same. The features of P will then be generated. After- In GAT model, the weight for aggregation is no longer equal wards, a path attention mechanism is applied to generate the or fixed. Each edge in the adjacent matrix will be assigned new embedded feature for each node as well as the new at- with an attention coefficient that represents how much one tention coefficients.

Load more