1

Neural Trees for Learning on Graphs Rajat Talak, Siyi Hu, Lisa Peng, and Luca Carlone

Abstract—Graph Neural Networks (GNNs) have emerged as a Various GNN architectures have been proposed, that go flexible and powerful approach for learning over graphs. Despite beyond local message passing, in order to improve expressivity. this success, existing GNNs are constrained by their local message- Graph isomorphism testing, G-invariant function approximation, passing architecture and are provably limited in their expressive power. In this work, we propose a new GNN architecture – and the generalized k-order WL (k-WL) tests have served as the Neural Tree. The neural tree architecture does not perform end objectives and guided recent progress of this inquiry. For message passing on the input graph, but on a tree-structured example, k-order linear GNN [Maron et al., 2019b] and k-order graph, called the H-tree, that is constructed from the input graph. folklore GNN [Maron et al., 2019a] have expressive powers Nodes in the H-tree correspond to subgraphs in the input graph, equivalent to k-WL and (k + 1)-WL test, respectively [Azizian and they are reorganized in a hierarchical manner such that a parent-node of a node in the H-tree always corresponds to a and Lelarge, 2021]. While these architectures can theoretically larger subgraph in the input graph. We show that the neural tree approximate any G-invariant function (as k → ∞), they use architecture can approximate any smooth probability distribution k-order tensors for representations, rendering them impractical function over an undirected graph, as well as emulate the junction for any k ≥ 3. tree . We also prove that the number of parameters needed to achieve an -approximation of the distribution function There is a need for a new way to look at constructing GNN is exponential in the of the input graph, but linear architectures. With better end objectives to guide theoretical in its size. We apply the neural tree to semi-supervised node progress. Such an attempt can result in new and expressive classification in 3D scene graphs, and show that these theoretical GNNs that are provably tractable – if not in general, at least properties translate into significant gains in prediction accuracy, in reasonably constrained setting. over the more traditional GNN architectures. A GNN, by its very definition, operates on a graph struc- Index Terms—graph neural networks, universal function ap- tured data. The graph structure of the data determines inter- proximation, tree-decomposition, semi-supervised node classifica- tion, 3D scene graphs. dependency between nodes and their features. Probabilistic graphical models present a reasonable and well-established way of articulating and working with such inter-dependencies I.INTRODUCTION in the data. Prior to the advent of neural networks, inference on such graphical models were successfully applied Graph-structured learning problems arise in several disci- to many real-world problems. Therefore, we pose that a GNN plines, including biology (e.g., molecule classification [Du- architecture operating on a graph should have the expressive venaud et al., 2015]), computer vision (e.g., action recogni- power of at least a probabilistic , i.e., it tion [Guo et al., 2018], image classification [Satorras and should be able to approximate any distribution defined by Estrach, 2018], shape and pose estimation [Kolotouros et al., a probabilistic graphical model. 2019]), computer graphics (e.g., mesh and point cloud classifica- tion and segmentation [Hanocka et al., 2019], [Li et al., 2019]), This is not a trivial requirement as exact inference (akin to and social networks (e.g., fake news detection [Monti et al., learning the distribution or its marginals) on a probabilistic 2019]), among others [Bronstein et al., 2017]. In this landscape, graphical model, without any structural constraints on the Graph Neural Networks (GNN) have gained popularity as a input graph, is known to be an NP-hard problem [Cooper, flexible and effective approach for regression and classification 1990]. Even approximate inference on a probabilistic graphical over graphs. model, without structural constraints, is known to be NP- arXiv:2105.07264v1 [cs.LG] 15 May 2021 Despite this growing research interest, recent work has hard [Roth, 1996]. A common trick, for exact inference, consists pointed out several limitations of existing GNN architec- in constructing a junction tree for an input graph and performing tures [Xu et al., 2019], [Morris et al., 2019], [Maron et al., message passing on the junction tree instead. In the junction 2019a], [Bouritsas et al., 2021]. Local message passing GNNs tree, each node corresponds to a subset of nodes of the input are no more expressive than the Weisfeiler-Lehman (WL) graph graph. The junction tree algorithm remains tractable for graphs isomorphism test [Xu et al., 2019], neither can they serve with bounded treewidth, while [Chandrasekaran et al., 2008] as universal approximators to all G-invariant functions, i.e., shows that treewidth is the only structural parameter, bounding functions defined over a graph G that remains unchanged with which, allows for tractable inference on graphical models. node permutation. [Chen et al., 2019] proves an equivalence This paper proposes a novel GNN architecture – the Neural between the ability to do graph isomorphism testing and the Tree. Neural trees do not perform message passing on the input ability to approximate any G-invariant function. graph, but on a tree-structured graph, called the H-tree, that is constructed from the input graph. Each node in the H-tree The authors are with the Laboratory of Information and Decision Systems corresponds to a subgraph of the input graph. These subgraphs (LIDS), Massachusetts Institute of Technology, Cambridge, MA 02139, USA. are arranged hierarchically in the H-tree such that a parent- Corresponding author: Rajat Talak (email: [email protected]) This work was partially funded by the Office of Naval Research under the node, of a node in the H-tree, will always correspond to a larger ONR RAIDER program (N00014-18-1-2828). subgraph in the input graph. Thus, only the leaf nodes in the 2

H-tree correspond to singleton subsets (i.e., individual nodes) Convolutional Networks (GCN) [Kipf and Welling, 2017], of the input graph. The H-tree is constructed by recursively Message Passing Neural Networks (MPNN) [Gilmer et al., computing tree decompositions of the input graph and its 2017], GraphSAGE [Hamilton et al., 2017], Graph Attention subgraphs, and attaching them to one another to form a Networks (GAT) [Velickoviˇ c´ et al., 2018], [Lee et al., 2019], hierarchy. Neural message passing on the H-tree generates and their successful applications have led to a renewed interest representations for all the nodes and important subgraphs of towards GNNs. In particular, GCNs extend convolutional neural the input graph. networks to graph-structured data, and have been developed in We show that the neural tree architecture can approximate several works [Henaff et al., 2015], [Defferrard et al., 2016], any smooth probability distribution function defined over [Kipf and Welling, 2017], [Bronstein et al., 2017], where they a given undirected graph. We also bound the number of are often motivated as a first-order approximation to spectral parameters required by a neural tree architecture to obtain an convolutions.  approximation of an arbitrary (smooth) distribution function The GCN architecture has well-known limitations. For over a graphical model. We show that the number parameters example, the GCN architecture treats all the neighbors of a increases exponentially in the treewidth of the input graph, but node in the same way, irrespective of their features. GAT only linearly in the input graphs size. Thus, for graphs with was proposed to address this limitation [Velickoviˇ c´ et al., bounded treewidth, the neural tree architecture can tractably 2018], [Lee et al., 2019], [Busbridge et al., 2019]. In it, approximate any smooth distribution function. different weights are assigned to different edges, which are also The proposed neural tree architecture has a message passing trained. [Hamilton et al., 2017] propose a GNN architecture to structure that can emulate a junction tree algorithm. This is efficiently generate node embedding for a graph. [Gilmer et al., unlike the standard, local message passing GNN architectures. 2017] propose a message-passing neural network architecture [Xu et al., 2020] suggests that such algorithmic alignments that encompasses all the pre-existing GNN models, and shows of neural architectures lead to lower sample complexity and state-of-the-art results on the molecular property prediction better generalizability. Thus, for the task of inference on benchmark. graphs, the neural tree architecture may also provide for better Expressive Power of GNNs. [Xu et al., 2019] and [Morris generalizablility. et al., 2019] show that local-message-passing GNNs can We use the neural tree architecture for semi-supervised node be as expressive as the 1-Weisfeiler-Lehman (WL) test in classification in 3D scene graphs. A 3D scene graph [Armeni distinguishing between two non-isomorphic graphs. [Garg et al., et al., 2019], [Rosinol et al., 2020] is a hierarchical graphical 2020] further show that local-message-passing GNNs cannot model of a 3D environment, where nodes correspond to entities compute simple graph properties such as longest or shortest in the environment (e.g., objects, rooms, buildings), while edges cycle, diameter, or clique information. Many GNN architectures correspond to relations among these entities (e.g., “room i have been proposed to circumvent such problems of limited contains object j”). In many robotics and computer vision expressivity. [Zhang et al., 2020] show the limitations of GNN applications, the scene graph is partially labeled (e.g., a robot in representing probabilistic logic reasoning. [Morris et al., can predict the label of a subset of known objects, but may 2019] propose a k-GNN, with an architecture inspired by the not have direct access to the labels of rooms and buildings). k-WL test, which is shown to be provably more powerful than We use our neural trees to predict the missing node labels. Our local-message-passing GNNs. In k-GNNs, message passing experiments on 3D scene graphs demonstrate that neural trees is performed among a selected subset of nodes in the input outperform standard, local message passing GNNs, by a large graph. GNNs with expressive power equivalent to the k- margin. WL test have also been proposed in [Maron et al., 2019a]. [Bouritsas et al., 2021] propose an architecture based on A. Organization subgraph isomorphism counting. It is generally understood that to improve the expressivity of GNNs one has to extract The paper is organized as follows. Section II discusses features corresponding to important subgraphs or node subsets, related work. Section III states the problem of semi-supervised and operate on them. [Ying et al., 2018] propose a hierarchical node classification and provides other preliminaries. Section IV architecture that pools a representation vector from a subset G defines -compatible functions and argues that approximating of nodes in a graph, at every layer. [Jin et al., 2018] propose G -compatible functions is equivalent to approximating any a junction-tree-based message passing GNN for molecular probability distribution on a probabilistic graphical model. graph generation, where message passing occurs over clique Section V describes the construction of the H-tree and the nodes. This is unlike the proposed neural tree, in which we neural tree architecture. Section VI derives the approximation construct a hierarchy of tree decompositions and prove a novel result. Section VII presents the experimental results on 3D approximation result. scene graphs. Section VIII concludes the paper. [Scarselli et al., 2009] introduce the notion of unfolding equivalence and derive a universal approximation result for II.RELATED WORK graph neural networks. While several questions regarding func- Graph Neural Networks. GNNs were proposed in [Gori tion approximation abilities and generalization of GNNs remain et al., 2005], [Scarselli et al., 2008] as a generalization of open, some progress has been made in recent years. [Garg the popular Recurrent Neural Networks to graph-structured et al., 2020] prove generalization bounds for message passing data. Development of new architectures, such as Graph GNNs. [Scarsellia et al., 2018] obtain Vapnik–Chervonenkis 3

dimension upper bounds for GNNs. denotes the size of the set A. As described, we use xv to denote Scene Graphs. Scene graphs are a popular model to abstract the feature vector of v, while the space of all node features information in images or model 3D environments. 2D scene of v is denoted by Xv. The |V|-tuple of all node features of graphs have been used in image retrieval [Johnson et al., 2015], G is denoted by X = (xv)v∈V . For a set of nodes A ⊂ V, caption generation [Karpathy and Fei-Fei, 2015], [Anderson the |A|-tuple of node features, corresponding to nodes A, is et al., 2016], visual question answering [Ren et al., 2015], denoted by XA = (xv)v∈A. [Krishna et al., 2016], and relationship detection [Lu et al., 2016]. GNNs are a popular tool for joint object labels and/or IV. GRAPH COMPATIBLE FUNCTIONS relationship inference on scene graphs [Xu et al., 2017], [Li We first define a class of functions called graph compatible et al., 2017], [Yang et al., 2018], [Zellers et al., 2017]. Recently, functions or G-compatible. These functions will play an there has been a growing interest towards 3D scene graphs, important role throughout the paper. which are constructed from 3D data, such as meshes [Armeni Definition 1 (G-compatible functions): We say that a function et al., 2019], point clouds [Wald et al., 2020], or raw sensor f : ×v∈V Xv → R is compatible with graph G or G-compatible data [Kim et al., 2019], [Rosinol et al., 2020]. GNNs have if it can be factorized as been very recently applied to 3D scene graphs for scene layout f(X) = P θ (x ), (3) prediction [Wald et al., 2020] or object search [Kurenkov et al., C∈C(G) C C 2020]. where C (G) denotes the collection of all maximal cliques in G and θC is some function that maps ×v∈C Xv (the set of node III.PROBLEM STATEMENT AND PRELIMINARIES features in the clique C) to a real number. Compatible functions arise in probabilistic graphical models. In this section, we state the node classification problem, The joint probability distribution of a probabilistic graphical review standard graph neural networks, and introduce notations model, on an undirected graph G, is given by used throughout the paper. Y Problem. We focus on the standard problem of semi- p(X | G) = ψC (xC ), (4) supervised node classification [Kipf and Welling, 2017]. We C∈C(G) are given a large graph G = (V, E) along with node features where C (G) is the collection of maximal cliques in G and ψC X = (xv)v∈V ; where xv denotes the node feature of node are some functions, often referred to as clique potentials [Jor- v ∈ V. The graph is not necessarily connected. A subset of dan, 2002], [Koller and Friedman, 2009]. This can be written nodes A ⊂ V in G are labeled, i.e.,{zv ∈ L | v ∈ A} is given; as here zv denotes the label for node v and L the finite set of p(X | G) = exp{f(X)}, (5) label classes. We need to design a model that correctly predicts the labels of all the unlabeled nodes v ∈ V\A in the graph G. where f(X) is the compatible function in (3) with θC (xC ) = Graph Neural Networks (GNN). Various GNN architec- log ψC (xC ). Thus, the ability to approximate any compatible tures have been successfully applied to solve the node classifica- function is equivalent to the ability to approximate any tion problem [Hamilton et al., 2017], [Kipf and Welling, 2017], distribution function on a probabilistic graphical model. [Velickoviˇ c´ et al., 2018], [Xu et al., 2019], [Jin et al., 2018]. Standard GNN architectures construct representation vectors for A. Examples of G-Compatible Functions each node in G by iteratively aggregating representation vectors Compatible functions arise naturally when performing infer- of its neighboring nodes. At iteration t, the representation vector ence on probabilistic graphical models. Here we provide two of node v is generated as follows: examples where we have to learn graph compatible functions t t−1 t−1  to compute maximum likelihood estimates over graphs. hv = AGGt hv , {hu | u ∈ NG (v)} , (1) Graph Classification. Given a graph G and its label y ∈ L, 0 suppose that the node features X are distributed according with hv , xv, ∀v ∈ V; where NG (v) denotes the set of neighbors of node v in graph G. This process of sharing and to a probabilistic graphical model on the undirected graph aggregating representation vectors among neighboring nodes G. This induces a natural correlation between the observed in G is often called message passing. This procedure runs for a node features, which is dictated by the graph G. Then, the predetermined number of iterations T . The node labels are then conditional probability density p(X|y, G) of the node features generated from the representation vectors at the final iteration X, given label y and graph G, is given by T . Node labels are extracted as Y p (X|y, G) = ψC (xC , y), (6) T C∈C(G) yv = READ(hv ), (2) where C (G) is the set of all maximal cliques in graph G and for all v ∈ V. The functions AGGt and READ are modeled as ψC are the clique potentials. A maximum likelihood estimator single or multi-layer perceptrons. for the graph labels will predict: Notation. All graphs in this paper are undirected and simple, i.e., they do not contain multiple edges between two nodes. yˆ = arg max log p (X|y, G) , (7) y∈L For a graph G, we also use V(G) and E(G) to denote the set X = arg max log ψC (xC , y), (8) of nodes and edges, respectively. We use G[A] to denote the y∈L subgraph of G induced by a subset of nodes A ⊂ V(G). |A| C∈C(G) 4

V. NEURAL TREE ARCHITECTURE The key idea behind the neural trees architecture is to construct a tree-structured graph from the input graph and perform message passing on the resulting tree instead of the input graph. This tree-structured graph is such that every node in it represents either a node or a subset of nodes in

Fig. 1. A G-compatible function that is not G-invariant (θa 6= θb). the input graph. Trees are known to be more amenable for message passing [Jordan, 2002], [Koller and Friedman, 2009] and indeed the proposed architecture enables the derivation of which is a compatible function on graph G. In practice, we do strong approximation results, which we present in Section VI. not know the functions ψC or the conditional distribution In the following, we first review the notion of tree decomposi- p(X|y, G), and will have to learn from the data to make tion (Section V-A). We then show how to construct a H-tree for predictions given in (8) feasible. a graph, by successively applying on a given Node Classification. Given a graph G and node labels graph G and its subgraphs (Section V-B). Finally, we discuss the y = {yv}v∈V(G), suppose the node features X be distributed proposed neural tree architecture for node classification, which according to a probabilistic graphical model on graph G. We, performs neural message passing on the H-tree (Section V-C). therefore, have Y p(X|y, G) = ψC (xC , y). (9) A. Tree Decomposition C∈C(G) For a graph G, a tree decomposition is a tuple (T , B) where A maximum likelihood estimator that estimates node labels T is a tree graph and B = {Bτ }τ∈V(T ) is a family of bags, y by observing the node features will predict: where Bτ ⊂ V(G) for every tree node τ ∈ V(T ), such that the tuple (T , B) satisfies the following two properties: yˆ = arg max log p(X|y, G), (10) (1) Connectedness Property: for every graph node v ∈ V(G), yv ∈L X the subgraph of T induced by tree nodes τ whose bag contains = arg max log ψC (xC , y), (11) yv ∈L node v, is connected, i.e., Tv , T [{τ ∈ V(T ) | v ∈ Bτ }] is C∈C(G) a connected subgraph of T for every v ∈ V(G). which is a compatible function on G. (2) Covering Property: for every edge {u, v} ∈ E(G) there exists a node τ ∈ V(T ) such that u, v ∈ Bτ . B. Relation with G-Invariant Functions The simplest tree decomposition of any graph G is a tree G Recent interest in developing expressive GNN architectures with a single node, whose bag contains all the nodes in . has been towards approximating graph invariant functions; also However, in practical applications, it is desirable to obtain referred to as G-invariant functions. A G-invariant function decompositions where the size of the largest bag is small. This treewidth on graph G is one that is invariant to permutation of nodes in is captured by the notion of . The treewidth of a tree (T , B) the graph G. A recent paper shows the equivalence between decomposition is defined as the size of the largest bag designing GNN architectures for graph isomorphism testing minus one: and designing GNN architectures as universal approximators tw [(T , B)] , max |Bτ | − 1. (12) of G-invariant functions [Chen et al., 2019]. While, invariance τ∈V(T ) seems a desirable property, the problem of designing GNNs The treewidth of a graph G is defined as the minimum that are universal approximators of G-invariant function or treewidth that can be achieved among all tree decompositions perform graph isomorphism testing has been difficult. Recently, of G. While finding a tree decomposition with minimum k-order linear GNN [Maron et al., 2019b] and k-order folklore treewidth is NP-hard, many algorithms exist that generate GNN [Maron et al., 2019a] have been proposed, which in theory tree decompositions with small enough treewidth [Becker and can approximate any G-invariant function as k → ∞ [Azizian Geiger, 1996], [Bodlaender, 2006], [Thomas and Green, 2009], and Lelarge, 2021]. However, they become impractical even [Bodlaender and Koster, 2010], [Bodlaender and Koster, 2011]. for modest k ≥ 3, due to their reliance on k-order tensors for One of the most popular tree decompositions is the junction representations. tree decomposition, which was introduced in [Jensen and We note here that the class of G-compatible functions are Jensen, 1994]. In it, the graph G is first triangulated. Triangu- not a subset of all G-invariant functions, i.e., there exists G- lation is done by adding a chord between any two nodes in compatible functions that are not G-invariant. To see this, every cycle of length 4 or more. This eliminates all the cycles consider the example in Figure 1. The G-compatible function of length 4 or more in the graph G to produce a f is not G-invariant, i.e., switching nodes 1 and 3 (along with Gc. The collection of bags B = {Bτ }τ in the junction tree is node features x1 and x3) – though produces the same graph – chosen as the set of all maximal cliques in the chordal graph Gc. does not produce the same output, assuming θa 6= θb. Then, an intersection graph I on B is built, which has a node In the next section, we describe the proposed neural tree for every bag in B and an edge between two bags Bτ and Bµ architecture. We later show that this neural tree architecture is if they have a non-empty intersection, i.e., |Bτ ∩ Bµ| ≥ 1. The expressive enough to approximate any (smooth) G-compatible weight of every link {τ, µ} in the intersection graph I is set function. to |Bτ ∩ Bµ|. Finally, the desired junction tree is obtained by 5

2 4 2 2 4 4

5 5

graphs

-

Graph, Sub 1 3 1 3 3 3

123 234 345 12 13 24 34 345

Tree Junction

123 234 345 12 13 24 34 345 5

4 34 12 13 24 3

1 2 1 3 2 4 3 4 3 4 5

tree

- H

Fig. 2. Generating H-trees for graph G and its subgraphs.

extracting a maximum-weight spanning tree on the weighted corresponds to one node in G. Furthermore, |R| = 1 and the intersection graph I. It is know that this extracted tree T , with only element in R corresponds to the center node in JG = Sn. the bag B, is a valid tree decomposition of G that satisfies the The H-tree for a complete graph of three nodes is illustrated connectedness and covering property. in Fig. 2, rightmost column. In it, the unique clique in the We denote by (T , B) = junction-tree(G) the algo- graph, which contains nodes {3, 4, 5}, is labeled as C = (345). rithm that takes an undirected graph G and outputs a junction From the figure, it is clear that the main role of the H-tree is tree decomposition (T , B) of G. The junction tree decompo- to explicitly include the nodes of the input graph as part of sition is only one way of constructing a tree decomposition. the tree. In practice, some other tree decompositions may be more The H-tree for a graph that is not complete is constructed suitable for a given input graph. While in this paper we use recursively as described in Algorithm 1. the junction tree decomposition, we also use the notation Definition 3 (Generic graph): The H-tree (JG,R) for an (T , B) = tree-decomposition(G) to stress that the undirected graph G is defined as the output of Algorithm 1. derivation applies to any choice of tree decomposition. Algorithm 1 takes an undirected graph G and outputs a H- tree JG with a set of root nodes R. To understand the algorithm, let (T , B) denote a tree decomposition of graph G (line 1). The B. H-tree H-tree J is initialized to J = T (line 2) and the set of root In this section, we propose and construct a new structure nodes equals the root nodes of this tree, namely R = V(T ) called the H-tree for a given graph. A H-tree is a tuple (JG,R), (line 3). For B ∈ B, let τ(B) denote the node corresponding to where JG is a tree and R is the set of root nodes. The key bag B in J . Then for each bag B ∈ B we construct a H-tree insight behind a H-tree is to capture nodes and subgraphs in of the induced subgraph G[B] (lines 4-17). the input graph and organize them in a suitable hierarchy – If G[B] is not complete, we attach its root nodes to τ(B) from nodes (as leaves) to large subgraphs (the largest being (lines 9-15). Specifically, if (J 0,R0) denotes the H-tree for the root nodes). the induced subgraph G[B], then we attach the graph J 0 to J We know that the tree decomposition (T , B) of a graph by linking all root nodes of J 0, namely R0, to the node τ(B) captures the important subgraphs (induced by bags B ∈ B) (lines 13-14). To avoid cycles, we also remove edges between of the graph. In the H-tree, we further decompose each of the root nodes R0 in J 0(line 12). these subgraphs with its own tree decomposition, and iterate If the induced subgraph G[B] is complete, then from this process till all the remaining subgraphs (induced by bags) Definition 2 we know that its H-tree is a star graph with contain only a single node. We start by defining the H-tree of a single clique node, call it C. In this case, we attach the star a complete graph. graph to τ(B) by merging two nodes – C and τ(B) – into Definition 2 (Complete graph): Let Sn denote a star graph one. Doing this avoids an unnecessary edge (τ(B),C) in the with n + 1 nodes and n edges, where there is one center node H-tree. and all edges are between the center node and every other Example. Figure 2 shows the construction of a H-tree for node. The H-tree (JG,R) of a complete graph G with n nodes a graph with 5 nodes and 6 edges. Here, we have used the is a star graph JG = Sn, where the center node represents the junction-tree algorithm to perform tree decomposition. single maximal clique in G and each of the leaf nodes in Sn The first column shows the graph G and its junction trees, 6

Algorithm 1: H-tree input : Graph G output : H-tree (JG,R)

1 (T , B) ← tree-decomposition(G) ...... 2 J ← T ...

3 R ← V(T ) 4 for each bag B in B do 5 if G[B] is a complete graph then 6 Update J : Fig. 3. Neural tree architecture for node classification. 7 V(J ) ← V(J ) ∪ B 8 E(J ) ← E ∪ {{τ(B), b}}b∈B 9 0 else hu = 0 for non-leaf nodes u in JG. These representation 0 0 10 (J ,R ) ← H-tree(G[B]) vectors are then updated as 11 Update J : t t−1 t−1 0 0 0 0  12 E(J ) ← E(J )\E(J [R ]) hu = AGGt hu , {hw | w ∈ NJG (u)} , (13) 0 13 J ← J ∪ J for each iteration t ∈ {1, 2,...T }. The aggregation function 14 E(J) ← E(J) ∪ {{τ(B), r}}r∈R0 AGGt can be modeled in numerous ways. Many of the message 15 end passing GNN architectures in the literature, such as GCN [Kipf 16 JG ← J and Welling, 2017], GraphSAGE [Hamilton et al., 2017], 17 return (JG, R) GIN [Xu et al., 2019], GAT [Velickoviˇ c´ et al., 2018], [Lee 18 end et al., 2019], can be used to perform message passing on JG. After T iterations of message passing, we extract the label yv for node v ∈ G by combining the representation vectors of leaf nodes l of J , which correspond to node v in G, i.e., which has three nodes corresponding to the three cliques in G 1 v = κ(l): the chordal graph Gc. The remaining columns show the three T  subgraphs of G corresponding to each of the three maximal yv = COMB {hl | l leaf node in JG s.t. κ(l) = v} , cliques in Gc, along with their junction trees and H-trees. The H- (14) T tree of each of these subgraphs is then attached to the junction for every v ∈ V, where hl denotes the representation vector tree of G to get the required H-tree for G. The H-tree for graph generated at leaf node l in JG after T iterations. COMB can be G is shown in the last row of the first column in Figure 2. Also modeled by using any of the standard neural network models. illustrated are the two edges deleted (in red) when merging In our experiments, we model COMB with a mean pooling the two H-trees of the subgraphs to the junction tree of G. function followed by a softmax. This neural tree architecture Remark 4 (Leaves and features): The H-tree of a graph G is shown in Figure 3. is such that every node in it corresponds to a subset of nodes Remark 5 (Mutatis mutandis): The neural tree architecture is in graph G. Further, every leaf node l in JG corresponds to inspired by the junction tree message passing algorithm [Jordan, exactly one node v in G. We denote this node by κ(l) for every 2002]. In the junction tree message passing algorithm, the leaf node l of JG. In the construction of the H-tree, we also clique potentials are first constructed for all nodes in the assign the node input feature xv to every node l in JG for junction tree (T , B) of G. This is followed by message passing which κ(l) = v. Note that multiple leaf nodes may correspond between nodes on T , which updates the clique potentials. Once to a single node v in the graph G, i.e., we can have κ(l) = v this converges, the marginals are computed for each node from for many leaf nodes l in JG. Fig. 2 illustrates the input node the clique potentials. The proposed neural tree has a similar features by node coloring. We show the input node features structure. The forming of clique potentials, in the junction tree being replicated on (multiple) leaf nodes of the H-tree. The algorithm, is parallel to message passing from leaf nodes in the input node features for other nodes in JG is set to 0. H-tree to the root nodes. The message passing in the junction tree algorithm is parallel to the message passing between root nodes in the H-tree J . And, computing the marginal from the C. Message Passing on H-tree G clique potentials, in the junction tree algorithm, is parallel to Given a graph G with input node features, we construct the messages passing back from the root nodes to the leaf nodes a H-tree (JG,R) and perform message passing on JG. We in the H-tree. [Xu et al., 2020] suggests that such algorithmic call this the neural tree architecture. Representation vectors alignment of the neural architecture leads to lower sample are generated for each node in the H-tree JG by aggregating complexity and better generalizability. We leave the question representation vectors of neighboring nodes in JG. The message of generalization power to future work. 0 passing starts with hl = xκ(l) for all leaf nodes l in JG and VI.EXPRESSIVE POWEROF NEURAL TREES 1 The chordal graph Gc is used in the junction tree construction and is obtained from G after graph triangulation, which in this case consists in We now establish some theoretical results regarding the adding the dashed blue line in Figure 2. expressive power of the neural tree architecture. 7

We next show that neural trees can learn any compatible Remark 8 (Efficient approximations): Corollary 7 shows that function provided it is smooth enough. For simplicity, let the the number of parameters needed to obtain an  approximation input node features and the representation vectors be real with neural trees increases exponentially in only the treewidth t numbers, i.e. xv ∈ R and hu ∈ R for all v ∈ V(G) and nodes of the tree-decomposition, and is linear in the number of nodes u ∈ JG. Let us implement the aggregation function AGGt n in the graph. Thus, for graphs with bounded treewidth, neural in (13) as a shallow network: trees will be able to approximate any graph compatible function efficiently. ht = AGG ht−1, {ht−1 | w ∈ N (u)} (15) u t u w JG Remark 9 (Comparing with shallow and deep neural net- N ! Xu works): If we were to take a single-layer neural network, as = ReLU ak hwk , ht−1 i + bk , (16) u,t u,t N¯ (u) u,t in (16), that aggregates all the input features X to approximate k=1 f without any regard to its compatibility structure, we would ¯ −n where N (u) = {u} ∪ NJG (u) denotes the set containing node need N = O ( ) number of parameters to achieve the same 2 t u and its neighbors NJG . The representation vectors hu are  approximation [Poggio et al., 2017a]. In this case, we see N initialized as discussed in Section V-C. We fix a node v0 in growing exponentially in the graph size n. It can be argued graph G and extract our output from v0: that the same advantage is shared by a deep neural network of depth tw [J ] [Poggio et al., 2017a]. However, a general multi- y = COMB {hT | l leaf node in J s.t. κ(l) = v } , G v0 l G 0 layer perceptron would have to train many more parameters (17) (a constant factor more) to first zone in on the exact message where T is the number of message passing iterations. We also passing structure for computing the compatible function f. assume the COMB function to be a shallow network. However, in our case, this is given by the construction of the Let N denote the total number of parameters used in the H-tree. neural tree architecture. Consider the space of functions g that Remark 10 (Data efficiency): The value of N also affects map the input node features X to the output y in (16)-(17): v0 the data required to train the model: the larger the N, the more ( ) samples are required for training. In particular, Corollary 7 For some T > 0 where F(G,N) = X → yv0 . provides the reassuring result that if the training dataset contains yv is given by (16)-(17) 0 graphs of small treewidth, then the amount of data required We now show that any graph-compatible function – that for training scales only linearly in the number of nodes n. is smooth enough – can be approximated by a function in F(G,N) to an arbitrary precision. VII.EXPERIMENTS Theorem 6: Let f : [0, 1]n → [0, 1] be a function compatible with a graph G with n nodes. Let each of the clique functions We now use the proposed neural tree architecture for the θc in f (see Definition 1) be 1-Lipschitz and be bounded to task of node classification in a 3D scene graph. We show [0, 1]. Then, for any  > 0, there exists a g ∈ F(G,N) such that it outperforms standard GNN architectures. A 3D scene that ||f − g||∞ < , while the number of parameters N is graph [Armeni et al., 2019], [Rosinol et al., 2020] is a bounded by hierarchical graphical model of a 3D environment. The nodes in the graph correspond to entities in the environment (e.g.,  −(d −1) X    u objects, rooms, buildings), while edges correspond to relations N = O  (du − 1)  , (18) du − 1 among these entities (e.g., “room i contains object j”). The u∈V(JG ) model is hierarchical, in that nodes are grouped into layers where du denotes the degree of node u in JG, and the corresponding to different levels of abstraction (Fig. 4). summation is over all the non-leaf nodes in JG. Dataset. We run semi-supervised node classification experi- Proof: See Appendix A. ments on Stanford’s 3D scene graph dataset [Armeni et al., We can further develop the bound in Theorem 6 to expose 2019]. The dataset includes 35 3D scene graphs with verified the dependence of the number of parameters N on the treewidth semantic labels, each containing building, room, and object of the tree decomposition of the graph. nodes in a residential unit. Since there is only a single class of Corollary 7: The number of parameters N in Theorem 6 building nodes (residential), we remove the building node and can be upper-bounded by obtain 482 room-object graphs where each graph contains a   room and at least one object in that room as shown in Fig. 4. 2tw[JG ]+3 −(tw[JG ]+1) N = O n × (tw [JG] + 1) ×  , The resulting dataset has 482 room nodes with 15 semantic labels, and 2338 objects with 35 labels. Each object node is where tw [J ] denotes the treewidth of the tree-decomposition G connected to the room node it belongs to. In addition we add of G, formed by the root nodes R of the H-tree J .3 G 4920 edges to connect adjacent objects in the same room. We Proof: See Appendix B. use the centroid and bounding box dimensions as features for 2 We assume a different AGGt function for different node u, for the same each node. t. This choice is more general than our architecture in Section V-C. However, Approaches and Setup. We implement the neural tree archi- our results extend to the case where the AGGt function is the same across tecture with four different aggregation functions AGG speci- nodes u in each iteration t. t 3 fied in: GCN [Kipf and Welling, 2017], GraphSAGE [Hamilton By construction, the root nodes R of JG induce a subgraph T that is a tree-decomposition of G. et al., 2017], GAT [Velickoviˇ c´ et al., 2018], GIN [Xu et al., 8

55 55 0.6 4.04 GCN GCN NT+GCN 50 NT+GCN 50 0.5

45 45 bedroom 0.4 40 40 0.3

35 35 Frequency 0.2

30 30

Average accuracy (%) accuracy Average (%) accuracy Average

0.1 plant chair 25 25 clock

20 20 0 10 20 30 40 50 60 70 2 4 6 8 chair Nodes used for training (%) Number of iterations, diameter

bed plant Fig. 5. Accuracy for increasing Fig. 6. Accuracy vs. number of itera- book amount of training data (% of labeled tions overlaid with weighted diameter nodes). distribution.

Fig. 4. A room-object graph in the 3D scene graph. training data used. We vary the number of training nodes from 10% (282 nodes) to 70 % (1974 nodes) of the entire dataset. TABLE I The test accuracy is averaged over 10 runs. We see that the TEST ACCURACY average test accuracy – for both the neural tree (NT+GCN) and GCN – improves with increasing training data. However, Model Input graph Neural Tree this increase is sharper for the neural tree architecture. With GCN 40.88 ± 2.28 % 50.63 ± 2.25 % 10% of the nodes for training, neural tree behaves slightly GraphSAGE 59.54 ± 1.35 % 63.57 ± 1.54 % GAT 46.56 ± 2.21 % 62.16 ± 2.03 % worse than GCN. However, the test accuracy rapidly improves GIN 49.25 ± 1.15 % 63.53 ± 1.38 % for the neural tree architecture, with the increase in training data, eventually outperforming GCN. This shows the higher expressive power of the proposed neural tree architecture. 2019]. We use the ReLU activation function and also implement We also study the impact of the number of message dropout at each iteration. The READ function for the standard passing iterations T (see (13)-(14)) on the test accuracy in GNN (see (2)) is implemented as a single linear layer followed our experiments. Figure 6 plots the average test accuracy as a by a softmax. On the other hand, the COMB function (see (14)) function of the number of message passing iterations T . As for neural trees is implemented as a mean pooling operation, before, the test accuracy is averaged over 10 runs. We see that followed by a single linear layer and a softmax. We use different the average test accuracy increases, hits a maximum, and then READ (similarly, COMB) functions for the room nodes and decreases, with increasing number of iterations T . This holds the object nodes. We train the architectures using the standard for both, the neural trees as well as the standard GNN. We cross entropy loss function. The experiments are implemented see that the peak test accuracy attained by the neural tree is using the PyTorch Geometric library. much higher than the peak attained by GCN, in the figure, We randomly select 10% of the nodes for validation, 20% again validating the higher expressive power of the proposed for testing and remove the corresponding labels during training. architecture. The hyper-parameters of the two approaches are separately In order to explain this peaking behavior, we plot the tuned based on the best validation accuracy, while using all histogram of diameters of the H-trees (weighted by the number 70% of the remaining nodes for training; see Appendix C for of nodes), of all (room-object) scene graphs in the dataset. the details on hyper-parameter tuning. We see that the maximum test accuracy occurs for a T that Results and Ablation. We compare the proposed neural nearly equals the average diameter. This is intuitive, as when tree architecture with the standard GNN architectures, which propagating messages on the H-tree, we would need the number does message passing on the input graph. Table I compares of iterations to equal to the tree diameter, for each node feature the test accuracies for the standard GNN architectures and the to propagate across the entire tree. Therefore, our algorithm corresponding neural tree architecture, while using the same achieves the best performance when the number of iterations type of aggregation function. The reported test accuracies are is roughly equal to the average diameter (shown as a vertical averaged over 100 runs. We see that the neural tree architecture dashed line in Fig. 6). It must be noted here that the diameter always yields a better prediction model than the standard GNN, of the H-tree is bounded by twice the treewidth of the H-tree, for a given aggregation function. namely 2 × tw [JG]. To further analyze the proposed architecture, we carry out a Figure 6 also shows the criticality of choosing the cor- series of experiments to see how the test accuracy varies as rect number of iterations T , in training a neural tree. The a function of the amount of training data and the number of performance of the trained neural tree model degrades very message passing iterations T . For simplicity, we only show rapidly with any further increase in the number of iterations the neural tree that uses the GCN aggregation function in T . This degradation is much more severe for the neural tree comparison with the standard GCN. architecture than for the standard GNN. We attribute this Figure 5 plots the average test accuracy as a function of the to over-parametrization in the neural tree architecture. The 9

TABLE II REFERENCES TIME REQUIREMENTS:COMPUTE,TRAIN, AND TEST [Anderson et al., 2016] Anderson, P., Fernando, B., Johnson, M., and Gould, Model Computing H-tree Training (per epoch) Testing S. (2016). Spice: Semantic propositional image caption evaluation. In European Conf. on Computer Vision (ECCV), pages 382–398. GCN n/a 0.072 s 0.048 s [Armeni et al., 2019] Armeni, I., He, Z.-Y., Gwak, J., Zamir, A. R., Fischer, NT+GCN 2.08 s 0.305 s 0.058 s M., Malik, J., and Savarese, S. (2019). 3D scene graph: A structure for unified semantics, 3D space, and camera. In Intl. Conf. on Computer Vision increase in the number of parameters, with the number of (ICCV), pages 5664–5673. [Azizian and Lelarge, 2021] Azizian, W. and Lelarge, M. (2021). Expressive iterations T , is much more drastic for the neural trees than for power of invariant and equivariant graph neural networks. In Intl. Conf. on the standard GNN. Learning Representations (ICLR). Time Requirements. Table II reports the time required for [Becker and Geiger, 1996] Becker, A. and Geiger, D. (1996). A sufficiently fast algorithm for finding close to optimal junction trees. In Conf. on training and testing our model over the 3D scene graph dataset. Uncertainty in Artificial Intelligence (UAI), pages 81–89. Also reported is the cumulative time required to compute and [Bodlaender, 2006] Bodlaender, H. L. (2006). Treewidth: Characterizations, construct the H-trees for the 482 room-object scene graphs. applications, and computations. In Graph-Theoretic Concepts in Computer Science, pages 1–14. Springer Berlin Heidelberg. It takes about 2 seconds to construct H-trees for all the 482 [Bodlaender and Koster, 2010] Bodlaender, H. L. and Koster, A. M. (2010). room-object scene graphs. The reported times are measured Treewidth computations I: Upper bounds. Information and Computation, when implementing the respective models on an Nvidia Quadro 208(3):259 – 275. P4000 GPU processor. [Bodlaender and Koster, 2011] Bodlaender, H. L. and Koster, A. M. (2011). Treewidth computations II: Lower bounds. Information and Computation, In Table II, we see that neural tree (NT+GCN) takes about 209(7):1103 – 1119. 4.25x more time to train than the standard GCN.4 We observe [Bouritsas et al., 2021] Bouritsas, G., Frasca, F., Zafeiriou, S., and Bronstein, similar results when using other GNN models (see Appendix C). M. M. (2021). Improving graph neural network expressivity via subgraph isomorphism counting. arXiv preprint arXiv:2006.09252. This is expected because the H-tree is much larger than the input [Bronstein et al., 2017] Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A., graph, and as a consequence, the neural tree architecture needs and Vandergheynst, P. (2017). Geometric deep learning: going beyond to train more weights than a standard GNN. The testing time euclidean data. IEEE Signal Process. Mag., 34(4):18–42. [Busbridge et al., 2019] Busbridge, D., Sherburn, D., Cavallo, P., and Ham- for the neural tree, on the other hand, remains comparable to merla, N. Y. (2019). Relational graph attention networks. arXiv preprint the standard GNN architectures. This makes the more accurate arXiv:1904.05811. neural trees architecture amenable for real-time deployment. [Chandrasekaran et al., 2008] Chandrasekaran, V., Srebro, N., and Harsha, P. (2008). Complexity of inference in graphical models. In Conf. on Uncertainty in Artificial Intelligence (UAI), page 70–78. VIII.CONCLUSION [Chen et al., 2019] Chen, Z., Villar, S., Chen, L., and Bruna, J. (2019). On the We proposed a novel graph neural network architecture – equivalence between graph isomorphism testing and function approximation the neural tree. In it, we perform neural message passing, with gnns. In Advances in Neural Information Processing Systems (NIPS), volume 32. not on the input graph, but over a constructed H-tree. The [Cooper, 1990] Cooper, G. (1990). The computational complexity of prob- H-tree encodes nodes and subgraphs of the input graph in a abilistic inference using Bayesian belief networks. Artificial Intelligence, hierarchical manner. Performing neural message passing on 42(2-3):393–405. [Defferrard et al., 2016] Defferrard, M., Bresson, X., and Vandergheynst, P. the H-tree allows us to get representation vectors for all the (2016). Convolutional neural networks on graphs with fast localized spectral nodes and the important subgraphs of the input graph. filtering. In Advances in Neural Information Processing Systems (NIPS), We showed that the neural tree architecture can approximate volume 29, pages 3844–3852. any graph compatible function. The relevance of this result is [Diestel, 2005] Diestel, R. (2005). Graph Theory. Springer, 3ed edition. [Duvenaud et al., 2015] Duvenaud, D. K., Maclaurin, D., Iparraguirre, J., two-fold. First, graph compatible functions arise naturally in Bombarell, R., Hirzel, T., Aspuru-Guzik, A., and Adams, R. P. (2015). probabilistic graphical models, hence the result establishes a Convolutional networks on graphs for learning molecular fingerprints. In stronger connection between graph neural networks and exact Advances in Neural Information Processing Systems (NIPS), volume 28, pages 2224–2232. inference in graphical models. Second, the result guarantees [Garg et al., 2020] Garg, V., Jegelka, S., and Jaakkola, T. (2020). General- that the number of parameters required to obtain a desired ization and representational limits of graph neural networks. In Intl. Conf. approximation grows linearly with the number of nodes and on (ICML), volume 119, pages 3419–3430. [Gilmer et al., 2017] Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., exponentially in the treewidth of the input graph. This renders and Dahl, G. E. (2017). Neural message passing for quantum chemistry. the proposed architecture more parsimonious for large graphs In Intl. Conf. on Machine Learning (ICML), volume 70, pages 1263–1272. with small treewidth. We also saw that the neural tree, unlike [Gori et al., 2005] Gori, M., Monfardini, G., and Scarselli, F. (2005). A new model for learning in graph domains. In IEEE Intl. J. Conf. Neural Netw., the local message passing GNN architectures, can emulate the volume 2, pages 729–734. junction tree algorithm used for exact inference on probabilistic [Guo et al., 2018] Guo, M., Chou, E., Huang, D.-A., Song, S., Yeung, S., graphical models. and Fei-Fei, L. (2018). Neural graph matching networks for fewshot 3D Application of the neural tree architecture to the problem of action recognition. In European Conf. on Computer Vision (ECCV), pages 673–689. node classification in 3D scene graphs shows promising results. [Hamilton et al., 2017] Hamilton, W. L., Ying, R., and Leskovec, J. (2017). In future work, we look at overcoming the complexity barrier Inductive representation learning on large graphs. In Advances in Neural posed by the treewidth of the graph, investigate a neural tree Information Processing Systems (NIPS), page 1025–1035. [Hanocka et al., 2019] Hanocka, R., Hertz, A., Fish, N., Giryes, R., Fleish- architecture for the task of graph classification, and establish man, S., and Cohen-Or, D. (2019). MeshCNN: a network with an edge. generalizability of the neural tree architecture. ACM Trans. Graph., 38(4):1–12. [Henaff et al., 2015] Henaff, M., Bruna, J., and LeCun, Y. (2015). Deep 4For GCN and NT+GCN training we require no more than 1000 epochs convolutional networks on graph-structured data. arXiv preprint of SGD to attain reasonable convergence. arXiv:1506.05163. 10

[Jensen and Jensen, 1994] Jensen, F. V. and Jensen, F. (1994). Optimal [Roth, 1996] Roth, D. (1996). On the hardness of approximate reasoning. junction trees. In Conf. on Uncertainty in Artificial Intelligence (UAI), page Artificial Intelligence, 82(1):273–302. 360–366. [Satorras and Estrach, 2018] Satorras, V. G. and Estrach, J. B. (2018). Few- [Jin et al., 2018] Jin, W., Barzilay, R., and Jaakkola, T. (2018). Junction tree shot learning with graph neural networks. In Intl. Conf. on Learning variational for molecular graph generation. In Intl. Conf. on Representations (ICLR). Machine Learning (ICML), volume 80, pages 2323–2332. [Scarselli et al., 2008] Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., [Johnson et al., 2015] Johnson, J., Krishna, R., Stark, M., Li, L.-J., Shamma, and Monfardini, G. (2008). The graph neural network model. IEEE Trans. D. A., Bernstein, M. S., and Fei-Fei, L. (2015). Image retrieval using Neural Netw., 20(1):61–80. scene graphs. In IEEE Conf. on Computer Vision and Pattern Recognition [Scarselli et al., 2009] Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., (CVPR), pages 3668–3678. and Monfardini, G. (2009). Computational capabilities of graph neural [Jordan, 2002] Jordan, M. (2002). An introduction to probabilistic graphical networks. IEEE Transactions on Neural Networks, 20(1):81–102. models. Unpublished Lecture Notes. [Scarsellia et al., 2018] Scarsellia, F., Tsoi, A., and Hagenbuchner, M. (2018). [Karpathy and Fei-Fei, 2015] Karpathy, A. and Fei-Fei, L. (2015). Deep The Vapnik–Chervonenkis dimension of graph and recursive neural visual-semantic alignments for generating image descriptions. In IEEE networks. Neural Networks, 108:248–259. Conf. on Computer Vision and Pattern Recognition (CVPR). [Shchur et al., 2019] Shchur, O., Mumme, M., Bojchevski, A., and [Kim et al., 2019] Kim, U.-H., Park, J.-M., Song, T.-J., and Kim, J.-H. Gunnemann,¨ S. (2019). Pitfalls of graph neural network evaluation. In (2019). 3-D scene graph: A sparse and semantic representation of physical Relational Representation Learning Workshop (R2L), NeurIPS. environments for intelligent agents. IEEE Trans. Cybern., PP:1–13. [Thomas and Green, 2009] Thomas, A. and Green, P. J. (2009). Enumerating [Kipf and Welling, 2017] Kipf, T. N. and Welling, M. (2017). Semi- the junction trees of a decomposable graph. J. Comput Graph Stat., supervised classification with graph convolutional networks. In Intl. Conf. 18(4):930–940. on Learning Representations (ICLR). [Velickoviˇ c´ et al., 2018] Velickoviˇ c,´ P., Cucurull, G., Casanova, A., Romero, [Koller and Friedman, 2009] Koller, D. and Friedman, N. (2009). Probabilis- A., Lio,´ P., and Bengio, Y. (2018). Graph attention networks. In Intl. Conf. tic Graphical Models: Principles and Techniques. The MIT Press. on Learning Representations (ICLR). [Kolotouros et al., 2019] Kolotouros, N., Pavlakos, G., and Daniilidis, K. [Wald et al., 2020] Wald, J., Dhamo, H., Navab, N., and Tombari, F. (2020). (2019). Convolutional mesh regression for single-image human shape Learning 3D semantic scene graphs from 3D indoor reconstructions. In reconstruction. In IEEE Conf. on Computer Vision and Pattern Recognition Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern (CVPR). Recognition, pages 3961–3970. [Krishna et al., 2016] Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., [Xu et al., 2017] Xu, D., Zhu, Y., Choy, C. B., and Fei-Fei, L. (2017). Scene Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D. A., Bernstein, graph generation by iterative message passing. In IEEE Conf. on Computer M., and Fei-Fei, L. (2016). Visual genome: Connecting language and Vision and Pattern Recognition (CVPR), pages 3097–3106. vision using crowdsourced dense image annotations. arXiv preprints [Xu et al., 2019] Xu, K., Hu, W., Leskovec, J., and Jegelka, S. (2019). arXiv:1602.07332. How powerful are graph neural networks? In Intl. Conf. on Learning [Kurenkov et al., 2020] Kurenkov, A., Mart´ın-Mart´ın, R., Ichnowski, J., Representations (ICLR). Goldberg, K., and Savarese, S. (2020). Semantic and geometric modeling [Xu et al., 2020] Xu, K., Li, J., Zhang, M., Du, S. S., ichi Kawarabayashi, with neural message passing in 3D scene graphs for hierarchical mechanical K., and Jegelka, S. (2020). What can neural networks reason about? In search. arXiv preprint arXiv:2012.04060. Intl. Conf. on Learning Representations (ICLR). [Lee et al., 2019] Lee, J. B., Rossi, R. A., Kim, S., Ahmed, N. K., and Koh, [Yang et al., 2018] Yang, J., Lu, J., Lee, S., Batra, D., and Parikh, D. (2018). E. (2019). Attention models in graphs: A survey. ACM Trans. Knowl. Graph R-CNN for scene graph generation. In European Conf. on Computer Discov. Data, 13(6). Vision (ECCV). [Li et al., 2019] Li, G., Muller,¨ M., Thabet, A., and Ghanem, B. (2019). [Ying et al., 2018] Ying, Z., You, J., Morris, C., Ren, X., Hamilton, W., DeepGCNs: Can GCNs Go as Deep as CNNs? In Intl. Conf. on Computer and Leskovec, J. (2018). Hierarchical graph representation learning with Advances in Neural Information Processing Vision (ICCV). differentiable pooling. In Systems (NIPS), pages 4800–4810. [Li et al., 2017] Li, Y., Ouyang, W., Zhou, B., Wang, K., and Wang, X. [Zellers et al., 2017] Zellers, R., Yatskar, M., Thomson, S., and Choi, Y. (2017). Scene graph generation from objects, phrases and region captions. (2017). Neural motifs: Scene graph parsing with global context. In IEEE In International Conference on Computer Vision (ICCV). Conf. on Computer Vision and Pattern Recognition (CVPR). [Lu et al., 2016] Lu, C., Krishna, R., Bernstein, M., and Li, F.-F. (2016). [Zhang et al., 2020] Zhang, Y., Chen, X., Yang, Y., Ramamurthy, A., Li, B., Visual relationship detection with language priors. In European Conference Qi, Y., and Song, L. (2020). Efficient probabilistic logic reasoning with on Computer Vision, pages 852–869. graph neural networks. arXiv preprint arXiv:2001.11850. [Maron et al., 2019a] Maron, H., Ben-Hamu, H., Serviansky, H., and Lipman, Y. (2019a). Provably powerful graph networks. In Advances in Neural Information Processing Systems (NIPS), volume 32, pages 2156–2167. APPENDIX [Maron et al., 2019b] Maron, H., Ben-Hamu, H., Shamir, N., and Lipman, Y. (2019b). Invariant and equivariant graph networks. In Intl. Conf. on A. Proof of Theorem 6 Learning Representations (ICLR). [Monti et al., 2019] Monti, F., Frasca, F., Eynard, D., Mannion, D., and The proof is divided into four sub-sections. Here is a brief Bronstein, M. (2019). Fake news detection on social media using geometric outline: deep learning. arXiv 1902.06673. 1. In Section A1, we first prove an aggregation lemma. It [Morris et al., 2019] Morris, C., Ritzert, M., Fey, M., Hamilton, W. L., Lenssen, J. E., Rattan, G., and Grohe, M. (2019). Weisfeiler and leman (roughly) states the following: If the representation vectors at go neural: Higher-order graph neural networks. Nat. Conf. on Artificial the root nodes of the H-tree JG are {hr}r∈R, at some iteration Intelligence (AAAI), 33(01):4602–4609. t, then in finitely many more message passing iterations it is [Poggio et al., 2017a] Poggio, T., Mhaskar, H., Rosasco, L., Miranda, B., and P Liao, Q. (2017a). Why and when can deep - but not shallow - networks possible to output a label yv0 = r∈R hr. avoid the curse of dimensionality: A review. Int. J. Autom. Comput., 2. In Section A2, we then prove that any G-compatible 14:503–519. function f can be written as a sum f(X) = P γ of [Poggio et al., 2017b] Poggio, T. A., Mhaskar, H., Rosasco, L., Miranda, r∈R r B., and Liao, Q. (2017b). Why and when can deep - but not shallow component functions γr. - networks avoid the curse of dimensionality: A review. arXiv preprint 3. In Section A3, we establish that the component functions arXiv:1611.00740. γr have a compositional structure that matches with the sub- [Ren et al., 2015] Ren, M., Kiros, R., and Zemel, R. S. (2015). Image question answering: A visual semantic embedding model and a new dataset. arXiv tree Tr of the H-tree JG formed by the root node r and preprints arXiv:1505.02074. its descendants. This helps in efficient computation of the [Rosinol et al., 2020] Rosinol, A., Gupta, A., Abate, M., Shi, J., and Carlone, component function γr on the sub-tree Tr. L. (2020). 3D dynamic scene graphs: Actionable spatial perception with places, objects, and humans. In Robotics: Science and Systems (RSS). (pdf), 4. The goal is to first estimate each component γr, by (video). message passing on Tr, and then aggregate by applying the 11

aggregation lemma. In Section A4, we put it all together to This can be established by looking at T = JG[R] as a tree argue that it is indeed possible to approximate any (adequately rooted at r0 and performing message aggregation from the smooth and bounded) compatibility function f, to arbitrary leaf nodes of T to the root node r0 using Assertion 2. t precision , by the message passing described in (16). We Assertion 4. If hr ∈ [0, 1] for some r ∈ R, then there exists obtain a bound on the number of parameters N required to t0 message passing iterations from the root node r to all the t+t0 t approximate any such function in Section A4. leaf nodes l in H-tree JG such that hl = hr, for all leaf 1) Aggregation: Let the COMB function be a simple average nodes l. function: This can be done by using Assertion 1 and successively passing the representation vector ht from r to all the leaf T  r yv0 = COMB {hl | l leaf node in JG s.t. κ(l) = v0} , nodes l in JG. t 1 X T From Assertions 3 and 4 it is clear that, given {hr}r∈R , hl , (19) |{l | κ(l) = v0}| at some t (bounded in [0, 1] as described in the statement of l:κ(l)=v0 the lemma), there exists t0 message passing iterations such t+t0 P t for some T , where index l is over the set of leaf nodes in JG. that hl = r∈R hr at all the leaf nodes l in JG. Since We first prove the following lemma. the COMB operation computes a simple average (see (19)) we t Lemma 11 (Aggregation): Let hr denote the representation have the result.  t vectors of root nodes r ∈ R at some iteration t. If hr ∈ [0, 1] 2) Factorization: Next, we show that any compatibility for all r ∈ R and P ht ∈ [0, 1], then there exists t function r∈R r 0 X message passing iterations such that f(X) = θC (xC ), (27) C∈C(G) X t yv = h , (20) 0 r {γ } r∈R can be broken down into component functions r r∈R such that for T = t + t . Further, the parameters used in this message X 0 f(X) = γr, (28) passing and the number of iterations t0 do not depend on r∈R {ht } . r r∈R where Proof: We first make a few assertions about the message X γ = θ (x ), (29) passing described in (16), in the paper. The proof of the lemma r C C directly follows from them. The assertions are self-evident C∈Cr 5 and we only give a one line descriptive proof following its for all r ∈ R, Cr are subsets of C (G) which form its partition, statement. and R is the set of root nodes in the H-tree JG. Assertion 1. Let (v, u) be an edge in the H-tree JG. If Lemma 12 (Factorization): Let f be a graph compatible t−1 k k hv ∈ [0, 1] then there exists parameters Nu, au,t, bu,t, and function given in (27) with its clique functions θC . Then, for k wu,t in (16) such that every r ∈ R there exists a subset Cr ⊂ C (G) such that X t t−1 t−1  γ = θ (x ), (30) hu = AGGt hu , {hw | w ∈ NJG (u)} (21) r C C N ! C∈Cr Xu = ReLU ak hwk , ht−1 i + bk , (22) P u,t u,t N¯ (u) u,t and f(X) = r∈R γr. Further, the collection of subsets k=1 {Cr}r∈R forms a partition of C (G), i.e., Cr ∩Cr0 = ∅ whenever t−1 t−1 0 = ReLU hv = hv . (23) r 6= r and ∪r∈RCr = C (G).

t−1 Proof: Let f be a graph compatible function given in (27) The last equality holds only because hv ∈ [0, 1]. with its clique functions θ and t−1 t−1 C Assertion 2. Let (v, u) be an edge in JG. If hu + hv ∈ k k k [0, 1] then there exists parameters Nu, au,t, bu,t, and wu,t in (T , B) = tree-decomposition(G), (31) (16) such that be the tree decomposition of graph G. Note that the set of t t−1 t−1  root nodes R, in the H-tree JG, is in fact all the nodes in T , hu = AGGt hu , {hw | w ∈ NJG (u)} (24) N ! namely R = V(T ). Further, for every r ∈ R, Br ∈ B is a bag Xu ReLU k k t−1 k of nodes Br ⊂ V(G) associated with r. = au,thwu,t, hN¯ (u)i + bu,t , (25) k=1 It is known that for any clique C in graph G, i.e., C ∈ C (G), t−1 t−1 t−1 t−1 there exists an r ∈ R such that all nodes in C are in the bag = ReLU hu + hv = hu + hv . (26) Br, i.e., V(C) ⊂ Br [Diestel, 2005]. However, it is possible 0 t−1 t−1 0 The last equality holds only because hu + hv ∈ [0, 1]. that two bags Br and Br , for r 6= r , may contain all the t C Assertion 3. Let hr denote representation vectors at root nodes of the same clique . t nodes r ∈ R on the H-tree JG at some iteration t. If hr ∈ [0, 1] Ideally, we would define P t and r∈R hr ∈ [0, 1] then for any r0 ∈ R there exists t0 Cr , {C ∈ C (G) | V(C) ⊂ Br} , (32) message passing iterations, for some t0 > 0, on the root nodes t+t0 P t in J such that h = h . Further, the parameters 5 G r0 r∈R r We omit the explicit dependence of the function γr on X to ease the t used in this message passing are independent of {hr}r∈R. notation. 12 which is the set of all cliques C in G such that all its nodes 2 4 are in the bag Br, and the functions γr to be Graph with X clique functions 5 γr = θC (xC ), (33)

C∈Cr P 1 3 for all r ∈ R. However, this can lead the r∈R γr to 1 overestimate the function f. This is because two bags Br 1 2 and Br0 may contain all the nodes of the same clique C. 2 In order to avoid double counting of clique functions, we 1 order the nodes in R as R = {r , r , . . . r }. We then 1 1 2 |R| 3 iterate over these ordered R nodes in the tree-decomposition to 3 2 generate Crk and γrk (for k = 1, 2,... |R|) as follows. Initialize 2 4 M1 = ∅ and iterate over k = 1, 2,... |R|: 4

Crk = {C ∈ C (G) \Mk | V(C) ⊂ Brk } , (34) 3 3 4 Mk+1 = Mk ∪ Crk , (35) 4 and set 3 X 3 γrk = θC (xC ), (36) 4 4 C∈Crk 5 5 for k = 1, 2,... |R|. This procedure ensures that we do not P H-tree Directed H-tree overestimate f and have f(X) = r∈R γr. Furthermore, {Cr}r∈R by its very construction (in (34)) is pairwise disjoint and spans the entire C (G), thereby forming Fig. 7. Shows the H-tree JG and the directed H-tree J~G of a graph. its partition.  Computation of a compatible function f is shown on the H-tree. 3) Compositional Structure: Fig. 7 illustrates computation of a compatible function on the H-tree. We see how the computation of f splits as f = γr1 + γr2 + γr3 , where We now show that the components {γr}r∈R of the compat-

γr1 = θ12 + θ13, γr2 = θ24, and γr3 = θ345. It is interesting ible function f in Lemma 12 have a compositional structure to note that the functions γr, further, have a compositional that matches with T~r. structure that matches with the sub-tree induced by the root Lemma 14 (Compositional Structure): The function γr, in nodes r, and its descendants. For example, the compositional Lemma 12, has a compositional structure that matches with ~ structure of γr1 matches with the sub-tree formed by the root the directed sub-tree Tr, for all r ∈ R. r J node 1 and its descendants in the H-tree G. This turns out Proof: In Lemma 12, the function γr is given by to be true in general for any compatible function f, and its X factorization {γr}r∈R (in Lemma 12). γr = θC (xC ), (38)

In order to make this precise, we introduce a few definitions C∈Cr that are inspired by [Poggio et al., 2017b], [Poggio et al., 2017a]. Let D~ be a directed acyclic graph (DAG) with a single were Cr is given by root node α, i.e. all the directed paths in D~ end at α. We will C = {C ∈ C (G) \M | V(C) ⊂ B } , (39) use the term DAG to refer to a single-root DAG in this section. r r Definition 13: A function f : X = (xi) → f(X) ∈ i∈[n] R for some set M ⊂ C (G). The set Cr can be thought of as a is said to have a compositional structure that matches with a collection of cliques in the subgraph of G induced by the bag DAG D~ , with root node α, if the following holds: Br, namely G[Br]. Therefore, the function γr is a compatible 1. Each leaf node l of D~ embeds one component of the function on G[Br]. input feature, i.e., hl = xi for some i ∈ [n]. Note that, in Lemma 12, we showed that a graph G 2. For every non-leaf node u there exists some function Hu compatible function can be factored as a sum of R functions, such that call them {γr}r∈R, where R is the set of nodes in the tree- hu = Hu ({hw | w ∈ Nin(u)}) , (37) decomposition (T , B). We have now argued that the functions where Nin(u) denotes the set of all incoming neighbors to γr are compatible function on the subgraphs G[Br]. node u. Note that the set of all children of node r in the sub-tree 3. f(X) = hα, where α is the single-root node of D~ . Tr (of the H-tree JG) form a tree decomposition of G[Br]. Let Tr denote the sub-tree of JG induced by the root r ∈ R This indicates that the function γr should also split as a and its descendants in JG. Further, let T~r denote a directed sum of functions, one corresponding to each node in the tree version of Tr in which every edge e in Tr is turned into a decomposition of G[Br], by Lemma 12. directed edge, pointing in the direction of the root node r. Note Thus, by successively applying Lemma 12, we can see that that T~r is a DAG with node r functioning as the single root the compositional structure of γr matches with the directed ~ node. sub-tree Tr, constructed out of the H-tree JG.  13

4) Approximation: Lemmas 11 and 12 suggest that in output by simply summing the representation vectors at the order to approximate a compatible function f(X) = root nodes. P C∈C(G) θC (xC ), with f and θC bounded between [0, 1], it Note that, in (42), hu depends on the input node features X. suffices to generate representation vectors We omit this dependence in the notation for ease of presentation. X We now define the space of functions that the above message ht ≈ γ = θ (x ), (40) r r C C passing on J~G produces: C∈Cr ( ) at each root node r ∈ R of the H-tree J , for some t. The X G F(G,N) = X → hu hu given in (42) , (43) approximation in (40) must be such that u∈R P where N = (du − 1)Nu is the sum of all the parameters X t X X t u hr − γr = hr − f(X) < . (41) used in (42). r∈R r∈R r∈R In the following, we will restrict ourselves to the DAG ~ Once such representation vectors ht are generated at the root JG and argue that any (smooth enough) function f that r ~ nodes of the H-tree, by Lemma 11, it’s sum can be propagated has a compositional structure that matches with JG can be P t approximated by a g ∈ F(G,N) (see (43)) with an arbitrary to generate the node label yv0 = r∈R hr, with message passing that is independent of the function being approximated. precision. Next, we show that the message passing defined in (16) in the We now show that for any (smooth enough) function f, main paper can indeed produce an approximation, give in (41). which has a compositional structure that matches with the ~ The number of parameters required to attain this approximation directed H-tree JG, can be approximated by a g ∈ F(G,N) will be an upper-bound on N. (see (43)) with an arbitrary precision. n To prove this, we consider a directed version of the H-tree Theorem 15: Let f : [0, 1] → [0, 1] be a function that ~ JG, where each edge in JG is turned into a directed edge has a compositional structure that matches with the DAG JG. pointing in the direction that leads to the root nodes R ∈ JG. Let every constituent function Hu of f (see Definition 13) be We also remove the edges between the root nodes R, and add Lu-Lipschitz with respect to the infinity norm. Then, for every another final node that aggregates information from all the  > 0 there exists a neural network g ∈ F(G,N) such that root nodes. We call the final node the aggregator and call it ||f − g||∞ <  and the number of parameters N is bounded α. We call this directed graph J~G. A directed H-tree graph is by illustrated in Figure 7. The red colored edges between root     −(du−1) nodes show the deleted edges between the root nodes R in JG X N = O  (du − 1)  , (44) to get J~G. Lu u∈V(J~ )\{α} We assume that the messages propagate only in one direction, G i.e. from the leaf nodes, where the input node features are where du denotes the degree (counting incoming and outgoing ~ embedded, to the aggregator node α. We implement a shallow edges) for node u in JG neural network at every non leaf node in J~G, which takes in Proof: The proof of this result follows directly from the argu- input from all its incoming edges, and propagates its output ments presented for Theorem 3, Theorem 4, and Proposition through its single outgoing edge, directed towards the root 6 in [Poggio et al., 2017b], [Poggio et al., 2017a]. The first nodes. modification we make is the constant factor term (du − 1) for This can be implemented in the original message passing (16) each node u in the summation in (44). This appears here, but k by setting the weight (i.e., parameter wu,t) component corre- not in [Poggio et al., 2017b], [Poggio et al., 2017a], because sponding to the parent node, in the directed JG, to zero. This in [Poggio et al., 2017b], [Poggio et al., 2017a] the node degree final aggregation layer in J~G is only for mathematical purpose was considered as a constant. Here, the degree relates to the so that we can prove an  approximation result, as in (41). treewidth of the graph, and is an important parameter to track With this, in the new message passing architecture on J~G, scalability of the architecture. The second modification is that each non-leaf node u in J~G implements the following shallow we allow for different Lipschitz constants Lu for different neural network given by constituent function. However, the arguments in [Poggio et al., 2017b], [Poggio et al., 2017a] work for this case as well. N !  Xu We now apply Theorem 15 to the function f given in the h = ReLU ak hwk, h i + bk , (42) u u u Nin(u) u statement of Theorem 6. In it, f is compatible with respect k=1 to G. Thus, using Lemma 12 and Lemma 14, we can deduce 0 0 where hNin(u) , (hu | u ∈ Nin(u)) denotes the vector formed that f also has a compositional structure that matches with ~ by concatenating all the representation vectors hu0 of nodes the directed H-tree JG. In Figure 7, we illustrate this for a 0 ~ k k u that have an incoming edge to u in JG. Here, au and bu simple example. Thus we can apply Theorem 15 on f in order k are constants and wu is a vector of size du − 1, which is the to seek an approximation g ∈ F(G,N). total number of incoming links to node u in J~G and du is the In applying Theorem 15, we see that the functions θC are total number of links that node u has in JG. Thus, for every 1-Lipschitz. Thus, all the nodes u ∈ J~G at which we compute non-leaf node u ∈ J~G we have (du − 1) × Nu parameters that θC , Lu = 1 ≤ du − 1. The remaining functions that are to model the shallow network. The aggregator node generates the be approximated on the J~G are the addition functions (see 14

Figure 7 to know how they arise in computing a compatible Hyper-parameter Tuning. We tune the hyper-parameters in function). In order to derive our result, it suffices to argue that the following order, as recommended by [Shchur et al., 2019]: a simple sum of k variables, taking values in the unit cube • Iterations: [1, 2, 3, 4, 5, 6] k [0, 1] , is k-Lipschitz with respect to the sup norm. This is • Hidden dimension: [16, 32, 64, 128, 256] indeed true and can be verified by simple arguments in analysis. • Learning rate: [0.0005, 0.001, 0.005, 0.01] Thus, for all the nodes u on which we have to compute the • Dropout probability: [0.25, 0.5, 0.75] addition, we have L = d − 1, where d is the degree of u u u • L2 regularization strength: [0, 1e-4, 1e-3, 1e-2] node u (counting both incoming and outgoing edges). We first tune the number of iterations, hidden dimension, and Putting all this together and applying Theorem 15 we obtain learning rate using a grid search, while keeping dropout and L the result. 2 regularization to the lowest value. For both standard GNN and neural tree, a single choice of the triplet: number of iterations, B. Proof of Corollary 7 hidden dimension, and learning rate, yields significantly higher We first obtain upper-bounds on the number of nodes accuracy than the others. With this triplet fixed, we then tune |V(JG)| in the H-tree JG and the node degree du for the dropout and L2 regularization using another grid search. u ∈ V(JG). We prove the desired result by substituting these In training, we notice that the batch size does not have a bounds in Theorem 6. noticeable impact on the training and test accuracy. After having First, note that the subgraph of the H-tree JG induced by the experimented with various batch sizes between 32 to 512, we set of root nodes R is a tree decomposition (T = JG[R], B) recommend and use a batch size of 128 in our experiments. of G, by construction; see Algorithm 1 (lines 1-3). Let tw [JG] denote the treewidth of the tree decomposition (T = JG[R], B). TABLE III Then the size of each bag Bτ ∈ B is bounded by the treewidth TUNED HYPER-PARAMETERS FOR VARIOUS MODELS tw [JG] + 1 (see (12)). Let Tτ denote the sub-tree in JG that is formed by all the descendants of, and including, the node τ Model hidden dim. iter. regularization learning rate GCN 64 3 0.0 0.01 in T = JG[R]. Then, the number of nodes in JG is given by NN + GCN 128 4 0.0 0.01 X G-SAGE 128 3 1e-3 0.005 |V(JG)| = |V(Tτ )|. (45) NN + G-SAGE 128 4 1e-3 0.005 τ∈R GAT 128 2 1e-4 0.001 NN + GAT 128 4 1e-4 0.0005 Note that the size of each sub-tree |V(Tτ )| is bounded by GIN 64 3 1e-3 0.005 NN + GIN 128 4 1e-3 0.005 tw[JG ]+1 |V(Tτ )| ≤ 1 + (tw [JG] + 1) . (46)

This is because the depth of the tree Tτ is bounded by the bag Table I in Section VII reported the test accuracies for various size |Bτ |, which is upper-bounded by tw [JG] + 1. Further, no standard GNNs and neural tree models. The tuned hyper- node in Tτ has a bag size larger than |Bτ | and therefore the parameters for these models are given in Table III. There number of children at each non-leaf node in Tτ is bounded by hyper-parameters were tuned using the procedure described tw [JG] + 1. The additional “+1” in (46) accounts for the root here, in the previous paragraph. A dropout ratio of 0.25 turns node τ in Tτ . out to be the optimal choice in each case. The optimization is Finally, the number of nodes in the tree decomposition run for no more than 1000 epochs of SGD (using the Adam T = JG[R] (or equivalently, the number of root nodes R) is optimizer) to achieve reasonable convergence during training. upper-bounded by n, the total number of nodes in graph G. Apart from the four listed hyper-parameters (hidden dimen- This, along with (45)-(46), imply sion, number of iterations, L2 regularization, learning rate),

tw[JG ]+1 some of the implemented architectures (GAT, G-SAGE, GIN) |V(JG)| ≤ n + n (tw [JG] + 1) (47) have their specific design choices and hyper-parameters. In the

Note that the degree minus 1, du − 1, is the size of the bag case of GAT, for example, we use 6 attention heads and ELU in a tree decomposition of some subgraph of G. Since the size activation function (instead of ReLU) to be consistent with the of the largest bag in the tree decomposition of the entire graph original paper. For GraphSAGE (G-SAGE in Table III), we is bounded by tw [JG] + 1, we have use the GraphSAGE-mean from the original paper, which does mean pooling after each convolution operation. In the case of du − 1 ≤ tw [JG] + 1, (48) GIN, we use the more general GIN- and train  for better performance. for all u ∈ V(JG). Substituting (48)-(47) in (18) of Theorem 6 we obtain the Time Requirements. In Table II we reported the compute, result. train, and test time required for the standard GCN and neural tree with GCN message passing. We now report the train and test time for the other standard GNN architectures - C. Addendum to Experiments GraphSAGE (G-SAGE), GAT, GIN - and the corresponding We discuss the methods we use for tuning of our hyper- neural trees in Table IV. Note that the time to compute the parameters, report more results on train and test time require- H-trees remains the same across these implementations and is ments, and provide the list of semantic labels in the dataset. given in Table II. 15

TABLE IV TIME REQUIREMENTS:TRAINAND TEST

Model Training (per epoch) Testing G-SAGE 0.068 s 0.042 s NT + G-SAGE 0.311 s 0.060 s GAT 0.089 s 0.049 s NT + GAT 0.872 s 0.107 s GIN 0.079 s 0.043 s NT + GIN 0.348 s 0.059 s

Semantic Labels in the Dataset. In the 482 room-object scene graphs we used for testing, the room labels are: bathroom, bedroom, corridor, dining room, home office, kitchen, liv- ing room, storage room, utility room, lobby, playroom, stair- case, closet, gym, garage. The object labels are: bottle, toilet, sink, plant, vase, chair, bed, tv, skateboard, couch, dining table, handbag, keyboard, book, clock, microwave, oven, cup, bowl, refrigerator, cell phone, laptop, bench, sports ball, backpack, tie, suitcase, wine glass, toaster, apple, knife, teddy bear, remote, orange, bicycle.