Advancements in Graph Neural Networks: Pgnns, Pretraining, and OGB Jure Leskovec
Total Page:16
File Type:pdf, Size:1020Kb
Advancements in Graph Neural Networks: PGNNs, Pretraining, and OGB Jure Leskovec Includes joint work with W. Hu, J. You, M. Fey, Y. Dong, B. Liu, M. Catasta, K. Xu, S. Jegelka, M. Zitnik, P. Liang, V. Pande Networks of Interactions Social networks Knowledge graphs Biological networks Complex Systems Molecules Code Jure Leskovec, Stanford University 2 Representation Learning in Graphs … z Predictions: Node labels, Link predicition, Graph Input: Network classification Jure Leskovec, Stanford University 3 Why is it Hard? Networks are complex! § Arbitrary size and complex topological structure (i.e., no spatial locality like grids) vs. Text Networks Images § No fixed node ordering or reference point § Often dynamic and have multimodal features Jure Leskovec, Stanford University 4 Graph Neural Networks A TARGET NODE B B C A A C B A C F E D F E D INPUT GRAPH A Each node defines a computation graph § Each edge in this graph is a transformation/aggregation function Scarselli et al. 2005. The Graph Neural Network Model. IEEE Transactions on Neural Networks. Jure Leskovec, Stanford University 5 Graph Neural Networks A TARGET NODE B B C A A C B A C F E D F E D INPUT GRAPH A Neural networks Intuition: Nodes aggregate information from their neighbors using neural networks Inductive Representation Learning on LargeJure GraphsLeskovec, .Stanford W. Hamilton, University R. Ying, J. Leskovec. NIPS, 2017. 6 Idea: Aggregate Neighbors Intuition: Network neighborhood defines a computation graph Every node defines a computation graph based on its neighborhood! Can be viewed as learning a generic linear combination of graph low-pass and high-pass operators [Bronstein et al., 2017] Jure Leskovec, Stanford University 7 [NIPS ‘17] Our Approach: GraphSAGE !( # ℎ ) ' ∈ ) * % ?… combine xA @… aggregate xC ? @ Update for node +: (-./) - - - - ℎ, = 1 2 ℎ, , 4# 1(5 ℎ% ) %∈) , 9 + 1=> level Transform *’s own Transform and aggregate embedding of node * embedding from level 9 embeddings of neighbors : 6 § ℎ, = features 7, of node *, 1(⋅) is a sigmoid activation function Jure Leskovec, Stanford University 8 [NIPS ‘17] GraphSAGE: Training § Aggregation parameters are shared for all nodes § Number of model parameters is independent of |V| § Can use different loss functions: * § Classification/Regression: ℒ ℎ% = '% − ) ℎ% § Pairwise Loss: ℒ ℎ%, ℎ, = max(0, 1 − 3456 ℎ%, ℎ, ) B shared parameters (9) (9) A C 8 , : F D shared parameters E INPUT GRAPH Compute graph for node A Compute graph for node B Jure Leskovec, Stanford University 9 Inductive Capability zu generate embedding train with a snapshot new node arrives for new node Even for nodes we Jure Leskovec, Stanford University never trained on! 10 [NeurIPS ‘18] DIFFPOOL: Pooling for GNNs Don’t just embed individual nodes. Embed an entire graph. Problem: Learn how to hierarchical pool the nodes to embed the entire graph Our solution: DIFFPOOL § Learns hierarchical pooling strategy § Sets of nodes are pooled hierarchically § Soft assignment of nodes to next-level nodes Hierarchical Graph Representation LearningJure with Leskovec, Differentiable Stanford University Pooling. R. Ying, et al. NeurIPS 2018. 11 [NeurIPS ‘18] DIFFPOOL: Pooling for GNNs Don’t just embed individual nodes. Embed the entire graph.How expressive are Problem:GraphLearn howNeural to hierarchical Networks? pool the nodes to embed the entire graph Our solution: DIFFPOOL § Learns hierarchical pooling strategy § Sets of nodes are pooled hierarchically § Soft assignment of nodes to next-level nodes Hierarchical Graph Representation LearningJure with Leskovec, Differentiable Stanford University Pooling. R. Ying, et al. NeurIPS 2018. 12 How expressive are GNNs? Theoretical framework: Characterize GNN’s discriminative power: § Characterize upper bound of the discriminative power of GNNs GNN tree: § Propose a maximally powerful GNN § Characterize discriminative power of popular GNNs Aggregation: How Powerful are Graph Neural Networks?Jure Leskovec,K. Xu,Stanford et University al. ICLR 2019. Multiset13 ! Key Insight: Rooted Subtrees Assume no node features, then single nodes cannot be distinguished but rooted trees can be distinguished: Graph: GNN distinguishes: The most powerful GNN is able to distinguish rooted subtrees of different structure GNN can distinguish blue D and violet E but not violet E and pink F node Jure Leskovec, Stanford University 14 Discriminative Power of GNNs Multiset " Idea: If GNN aggregation is injective, then different multisets are distinguished and GNN can capture subtree structure Theorem: Injective multiset function ! can be written as: ! " = $(∑'∈) *(+)) Funcs. $, * are based on universal approximation thm. Consequence: Sum aggregator is the most expressive! Jure Leskovec, Stanford University 15 Under review as a conference paper at ICLR 2019 > > Input sum - multiset mean - distribution max - set Figure 2: RankingPower by expressive powerof for Aggregators sum, mean and max-pooling aggregators over a multiset. Left panel shows the input multiset and the three panels illustrate the aspects of the multiset a given aggregator is able to capture: sum captures the full multiset, mean captures the proportion/distribution of elements of a given type, and the max aggregator ignores multiplicities (reduces the multiset to a simpleFailure set). cases for mean and max agg. A A A A A vs. vs. A vs. (a) Mean and Max both fail (b) Max fails (c) Mean and Max both fail UnderFigure review 3: as Examplesa conference of simple paper at graph ICLR structures 2019 that mean and max-pooling aggregators fail to distinguish. Figure 2 gives reasoning about how different aggregators “compress” different graph structures/multisets.Ranking by discriminative power existing GNNs instead use a 1-layer perceptron σ W (Duvenaud et al., 2015; Kipf & Welling, 2017; Zhang etA al., 2018), a linear mapping followedA by◦ a non-linear activationA function suchA as a ReLU. Such 1-layer mappings are examples of Generalized> Linear Models (Nelder> & Wedderburn, 1972). Therefore, we are interested in understanding whether 1-layer perceptrons are enough for graph learning. Lemma 7 suggests that there are indeed network neighborhoods (multisets) that models with 1-layerInput perceptrons can neversum distinguish. - multiset mean - distribution max - set Jure Leskovec, Stanford University 16 FigureLemma 2: Ranking 7. byThere expressive exist finite power multisets for sum,X mean1 = andX2 max-poolingso that for aggregators any linear over mapping a multiset.W , ReLU (Wx)= ReLU (Wx) . 6 Left panelx showsX1 the input multisetx X2 and the three panels illustrate the aspects of the multiset a given 2 2 aggregatorP is able to capture: sumP captures the full multiset, mean captures the proportion/distribution of elementsThe main of a idea given of type, the proof and the for max Lemma aggregator 7 is that ignores 1-layer perceptrons multiplicities can (reduces behave much the multiset like linear to a simplemappings, set). so the GNN layers degenerate into simply summing over neighborhood features. Our proof builds on the fact that the bias term is lacking in the linear mapping. With the bias term and sufficiently large output dimensionality, 1-layer perceptrons might be able to distinguish different multisets. Nonetheless, unlike models using MLPs, the 1-layer perceptron (even with the bias term) is not a universal approximator of multiset functions. Consequently, even if GNNs with 1-layer perceptrons can embed different graphs to different locations to some degree, such embeddings may notvs. adequately capture structural similarity,vs. and can be difficult for simple classifiers, e.g., linear classifiers, to fit. In Section 7, we will empirically see that GNNs with 1-layervs. perceptrons, when applied to graph classification, sometimes severely underfit training data and often underperform (a) MeanGNNs and with Max MLPs both fail in terms of test accuracy.(b) Max fails (c) Mean and Max both fail Figure 3: Examples of simple graph structures that mean and max-pooling aggregators fail to 5.2 STRUCTURES THAT CONFUSE MEAN AND MAX-POOLING distinguish. Figure 2 gives reasoning about how different aggregators “compress” different graph structures/multisets. What happens if we replace the sum in h (X)= x X f(x) with mean or max-pooling as in GCN and GraphSAGE? Mean and max-pooling aggregators2 are still well-defined multiset functions because they are permutation invariant. But, they are notPinjective. Figure 2 ranks the three aggregators by existing GNNs instead use a 1-layer perceptron σ W (Duvenaud et al., 2015; Kipf & Welling, 2017; their representational power, and Figure 3 illustrates◦ pairs of structures that the mean and max-pooling Zhangaggregators et al., 2018), fail a to linear distinguish. mapping Here, followed node colors by a denote non-linear different activation node features, function and such we assume as a ReLU. the Such 1-layerGNNs aggregate mappings neighbors are examples first before of Generalized combining Linearthem with Models the central (Nelder node. & Wedderburn, 1972). Therefore, we are interested in understanding whether 1-layer perceptrons are enough for graph learning.In Figure Lemma 3a, 7 every suggests node that has there the same are indeed feature networka and f( neighborhoodsa) is the same across (multisets) all nodes that (for models any with 1-layerfunction perceptronsf). When performing can never distinguish.neighborhood aggregation, the mean or maximum over f(a) remains f(a) and, by induction, we always obtain the same node representation everywhere.