Networks for Geometric Deep Learning

Marjan Albooyeh * 1 Daniele Bertolini * Siamak Ravanbakhsh 2 3

Abstract object; see Fig. 1(a,b). Sparse incidence tensors can represent a variety of We are interested in designing models that are informed structured data. For example, we may represent at- by the symmetry of the underlying structure. We do so by tributed graphs using their node-node, node-edge, making the model equivariant to symmetry transformations. or edge-edge incidence matrices. In higher dimen- When using the incidence tensor representation, a natural sions, incidence tensors can represent simplicial choice of symmetry transformations is the automorphism complexes and polytopes. In this paper, we for- group of the geometric object. However, when working malize incidence tensors, analyze their structure, with a dataset comprising of different instances (e.g., differ- and present the family of equivariant networks ent graphs or polyhedra), using individual automorphism that operate on them. We show that any incidence groups is not practical. This is because each symmetry tensor decomposes into invariant subsets. This group dictates a different equivariant model, and we cannot decomposition, in turn, leads to a decomposition train a single model on the whole dataset. A solution is to of the corresponding equivariant linear maps. We use the symmetric group (the group of all permutations of characterize these linear maps as a combination nodes) for all instances, which implicitly assumes a dense of efficient pooling-and-broadcasting operations. structure where all faces are present, e.g., all graphs are fully connected; see Fig. 1(c,d). We show that under the action of the symmetric group, any incidence tensor decomposes into invariant subsets, or or- 1. Introduction bits, where each orbit corresponds to faces of particular size. Many interesting data structures can be represented with For example, a node-node incidence decomposes sparse incidence tensors. For example, we can represent into: 1) diagonals, that can encode node attributes, and; graphs using both node-node and node-edge sparse inci- 2) off-diagonals, corresponding to edge attributes. This is dence matrices. We can extend this incidence representation because any permutation of nodes, (i.e., simultaneous per- to data defined on simplicial complexes and polytopes of mutation of rows and columns) moves an (off-) diagonal arbitrary dimension, such as mesh, polygons, and polyhedra. entry to another (off-) diagonal entry in the node-node in- The goal of this paper is to design deep models for these cidence. We can vectorize the diagonal and off-diagonal structures. entries to get a node vector and an edge vector. These are examples of face-vectors in the general setting, and this We represent an attributed geometric structure using its in- example shows how and incidence tensor decomposes into cidence tensor, which models the incidence pattern of its face-vectors. faces. For example, rows and columns in a node-edge inci- arXiv:1905.11460v4 [cs.LG] 12 Aug 2020 dence matrix are indexed by faces of size one (nodes) and This decomposition into face-vectors also breaks up the two (edges). Moreover each edge (column) is incident to design of equivariant linear maps for arbitrary incidence exactly two nodes (rows). The sparsity pattern of the inci- tensors into design of such maps between face-vectors of dence tensor has important information about the geometric different size. We show that any such linear map can be structure. This is because sparsity preserving permutation of written as a linear combination of efficient pooling-and- nodes often match the automorphism group of the geometric broadcasting operations. These equivariant linear maps replace the linear layer in a feedforward neural network to * 1 Equal contribution Department of Computer Science, Uni- create an incidence network. versity of British Columbia, Vancouver, Canada 2School of Com- puter Science, McGill University, Montreal, Canada 3Mila Que- bec AI Institute, Montreal, Canada. Correspondence to: Mar- 2. Related Works jan Albooyeh , Daniele Bertolini . Deep learning with structured data is a very active area of research. Here, we briefly review some of the closely related , Incidence Networks for Geometric Deep Learning

Figure 1: a) The sparsity pattern in the node-face for an (undirected) triangular bi-pyramid (concatenation of two tetrahedra). Note that each face (column) is adjacent to exactly three nodes. b) Nodes are permuted using a member of the symmetry group of the object π ∈ D3h ≤ S5. This permutation of nodes imposes a natural permutation action on the faces in which {δ1, δ2, δ3} ↦ {π ⋅ δ1, π ⋅ δ2, π ⋅ δ3}. Note that permutations from the automorphism group preserve the sparsity pattern of the incidence matrix. c) The geometric object of (a) after densification: the incidence matrix now includes all possible faces of size three, however, it still maintains a specific sparsity pattern. d) After densifying the structure, any permutation of nodes (and corresponding permutation action on faces of the dense incidence matrix) preserves its sparsity pattern. works in graph learning and equivariant deep learning. Most relevant to our work, are equivariant models proposed for geometric deep learning that we review next. Covari- GRAPH LEARNING. The idea of graph neural networks ant compositional network (Kondor et al., 2018) extends goes back to the work of (Scarselli et al., 2009). More the message passing framework by considering basic tensor recently, Gilmer et al. (2017) introduced the message pass- operations that preserve equivariance. While the resulting ing neural networks and showed that they subsume several architecture can be quite general, it comes at the cost of other graph neural network architectures (Li et al., 2015; efficiency. Hartford et al. (2018) propose a linear map equiv- Duvenaud et al., 2015; Kearnes et al., 2016; Schütt et al., ariant to independent permutation of different dimensions 2017), including the spectral methods that follows. Another of a tensor. Equivariant graph networks of (Maron et al., body of work in graph deep learning extends convolution to 2018) model the interactions within a set of nodes. We will graphs using the spectrum of the graph Laplacian (Bronstein further discuss this model as a special type of incidence et al., 2017; Bruna et al., 2014). While principled, in its network. These equivariant layers for interactions between complete form, the Fourier bases extracted from the Lapla- and within sets are further generalized to multiple types of cian are instance-dependent and the lack of any parameter interactions in (Graham & Ravanbakhsh, 2019). Several or function sharing across the graphs limits their generaliza- recent works investigate the universality of such equivari- tion. Following (Henaff et al., 2015; Defferrard et al., 2016), ant networks (Maron et al., 2019b; Keriven & Peyré, 2019; Kipf & Welling (2016) propose a single-parameter simpli- Chen et al., 2019). A flexible approach to equivariant and fication of spectral method that addresses this limitation geometric deep learning where a global symmetry is lacking and it is widely used in practice. Some notable extensions is proposed in (Cohen et al., 2019). and related ideas include (Velickoviˇ c´ et al., 2017; Hamilton et al., 2017; Xu et al., 2018; Zhang et al., 2018; Ying et al., 3. Graphs 2018; Morris et al., 2018; Maron et al., 2019a). In this section we discuss graphs and later generalize the arguments to a broader set of geometric objects in Section 4. EQUIVARIANT DEEP LEARNING. Equivariance con- Without loss of generality, we assume a fully-connected strains the predictions of a model φ ∶ X ↦ Y under a group G of transformations of the input, such that graph G = ([N], E ⊆ [N]×[N]), where [N] = {1,...,N} denotes a set of nodes, and E a set of edges. φ(π ⋅ x) = π ⋅ φ(x), ∀x ∈ X, ∀π ∈ G. (1) 3.1. Linear Layers Here π ⋅ x is a consistently defined transformation of x There are two common ways of representing G. The first parameterized by π G, while π φ x denotes the corre- ∈ ⋅ ( ) approach is to use a node-node incidence matrix Xδ ,δ in- sponding transformation of the output. For example, in a 1 2 dexed by δ1, δ2 ∈ [N]. In this representation, node and edge convolution layer (LeCun et al., 1998), G is the group of features are encoded as diagonal and off-diagonal entries of discrete translations, and (1) means that any translation of X, respectively. Here, we assume single input and output the input leads to the same translation of the output. When channel (i.e., scalar node and edge attributes) for simplicity; φ ∶ x ↦ σ(Wx) is a standard feed-forward layer with pa- results trivially generalize to multiple channels. rameter matrix W, the equivariance property (1) enforces parameter-sharing in W (Shawe-Taylor, 1993; Ravanbakhsh Consider the group of all N! permutations of nodes SN and et al., 2017). its action on X, which simultaneously permutes the rows Incidence Networks for Geometric Deep Learning

Figure 2: Parameter-sharing in the receptive field of equivariant map: (left) 15 parameter layer for node-node incidence, (middle) 7 parameter layer for node-edge incidence, and (right) 4 parameter layer node-edge incidence BLOCKSTRUCTURE. The adjacency structure of the undirected graph with 5 nodes and 7 edges is evident from the sparsity patterns. Here, each inner block shows the parameter- sharing in the receptive field of the corresponding output unit. For example, the block on row 1 and column {1, 3} of the (middle) figure shows the dependency of the output incidence matrix at that location on the entire input incidence matrix. DECOMPOSITION. The total number of unique parameters in (left) is 15 compared to 7 for the (middle). As shown in Section 3.1 the 15 (= 7 + 2 + 3 + 3) parameter model decomposes into 4 linear maps, one of which is isomorphic to the 7 parameter model. One could also identify the 7 unique symbols of (middle) in the parameter-sharing of the (left). Note that these symbols appear on the off-diagonal blocks and off-diagonal elements within blocks, corresponding to input and output edges.

N×N N×N and columns of X. Let W ∶ R → R be a linear of X into orbits under the action of SN . In particular, X map equivariant to this action of SN , decomposes into two orbits: diagonal elements (X{δ1}) and

⊺ ⊺ off-diagonal elements (X{δ1,δ2}), where each subset is in- W(πXπ ) = πW(X)π , ∀π ∈ SN , ∀X. (2) variant under the action of SN – that is simultaneous permu- The map W is constrained so that permuting the rows and tations of rows and columns do not move a diagonal columns of the input will have the same effect on the output. to off-diagonal or vice-versa. We write this decomposition As shown in (Maron et al., 2018) this condition constrains as the number of independent parameters in W to fifteen, re- X ≅ X{δ1} ⊍ X{δ1,δ2}, (3) gardless of the size of the graph. where the diagonal orbit is isomorphic to the vector of nodes Alternatively, one can represent G with a node-edge inci- X{δ1} and the off-diagonal orbit is isomorphic to the vector dence matrix Yδ ,{δ ,δ } where δ1 N labels nodes and 1 2 3 ∈ [ ] of edges X{δ1,δ2} with δ1 ≠ δ2. Consider the map W defined the unordered pair {δ2, δ3} with δ2 ≠ δ3 ∈ [N] labels edges. above, in which both input and target decompose in this way.

Y has a special sparsity pattern: Yδ1,{δ2,δ3} ≠ 0 iff node It follows that the map itself also decomposes into four maps δ1 is incident to the edge {δ2, δ3}. We identify this spar- sity pattern implicitly by writing the matrix as Yδ1,{δ1,δ2}, 2 ′ ′ so that we only index non-zero entries (note the repeated W X W1→m X W2→m X , ( ) = # Š ( {δ1}) + ( {δ1,δ2}) ′ δ1). Y naturally encodes edge features at Yδ1,{δ1,δ2} and m =1 Yδ ,{δ ,δ } for two different edge-directions of the edge (4) 2 1 2 m→m′ δ1, δ2 . where W maps a face-vector of faces of size m to { } ′ a face-vector of faces of size m . Equivariance to SN for The action of SN on Y is also a simultaneous permutation each of these maps constrains the number of independent of rows and columns, where the permutation of columns is parameters: W1→1 is the equivariant layer used in DeepSets defined by the action on the node pair that identifies each (Zaheer et al., 2017), and has two parameters. W1→2 and edge. π ⋅ {δ1, δ2} = {π ⋅ δ1, π ⋅ δ2}. This action preserves the W2→1 each have three parameters, and W2→2 has seven sparsity pattern of Y defined above (note that even a fully unique parameter, and maps input edge features into target connected graph has a sparse node-edge incidence matrix.) edge features. One key point is that the edge-vector X The maximal equivariant linear map acting on Y is con- {δ1,δ2} is isomorphic to the node-edge incidence matrix Yδ1,{δ1,δ2} strained to have seven independent parameters (assuming a and thus the seven-parameters equivariant map for Y is single input and output channel). More alternatives beside 2→2 exactly W (X{δ1,δ2}) of (4). The parameter-sharing in node-node and node-edge representation for graph exist – two linear maps for the node-node and node-edge incidence e.g. , one may use an edge-edge incidence matrix. matrices are visualized in Fig. 2 (left, middle). 1 Since both X and Y represent the same graph G, and the 1As a side note, one can also encode node features in a node- corresponding linear maps are equivariant to the same group edge incidence matrix by doubling the number of channels and SN , one expects a relationship between the two representa- broadcasting node-features across all edges incident to a node. In tions and maps. This relationship is due to decomposition this case all fifteen operations are retrieved, and the two layers for Incidence Networks for Geometric Deep Learning

we directly produce and store only the non-zero values.

3.3. Further Relaxation of the Symmetry Group The neural layers discussed so far are equivariant to the group G = SN , where N is the number of nodes. A simpli- fying alternative is to assume independent permutations of rows and columns of the node-node or node-edge matrix. This is particularly relevant for the node-edge matrix, where one can consider node and edges as two interacting sets of

distinct objects. The corresponding (SN ×SN2 )-equivariant Figure 3: The 15 parameter equivariant map for node-node inci- layer, where N2 is the number of edges, was introduced in dence matrix is decomposed into four maps between two orbits, (Hartford et al., 2018), has 4 unique parameters, and it is diagonals and off-diagonals, where each orbit represents features substantially easier to implement compared to the layers of faces of a specific size. Each of these four maps performs pool- introduced so far. In Appendix A we show how to construct ing and broadcasting operations listed here: for example, the 3 parameter map from diagonals to off-diagonals is a linear combina- a sparsity-preserving (and therefore non-linear) layer for tion of: (1) broadcasting the diagonal over rows; (2) broadcasting this case. Even though a single layer is over-constrained by over columns, and; (3) first pooling the diagonal to get a scalar and these symmetry assumptions, we prove that two such layers broadcasting it over the off-diagonals. This approach generalizes: generate exactly the same node and edge features as a single any SN -equivariant linear map between face-vectors of any size S -equivariant linear layer for a node-node incidence. can be written as a weighted sum of all “viable” pool-broadcast N combinations. 4. Higher Order Geometric Structures Rather than using explicit parameter-sharing in W, we show In this section we generalize the results of Section 3 to geo- that W is a linear combination of pooling-and-broadcasting metric structures beyond graphs. In Section 4.1 we provide a operations. In particular, the combination includes pooling definition of incidence tensors, which generalize node-node the node indices of the input face-vector in “all possible and node-edge matrices of graphs. We discuss several exam- ways”, and broadcasting the resulting collection of pooled ples, showing how they can represent geometric structures vectors to the target face-vector, again in all ways possible. such as polytopes and simplicial complexes. In Section 4.2 Each operation is associated with a learnable parameter; see we generalize the orbit decomposition of Section 3.1. Fi- Fig. 3 for the graph example. nally, in Section 4.3, we show how to build equivariant layers using a linear combination of simple pooling-and- 3.2. Sparse Tensors and Non-Linear Layers broadcasting operations for arbitrary incidence tensors. So far we have discussed equivariant linear layers for a fully 4.1. Incidence Tensors connected graph. This means dense input/output node-node incidence X, or equivalently a node-edge incidence Y with Recall that [N] = {1,...,N} denotes a set of nodes. A the sparsity pattern described in the previous section (which directed face of size M is an ordered tuple of M distinct M is maintained by the action of SN ). To avoid the cost of a nodes δ ∈ [N] S δi ≠ δj∀i ≠ j. Following a similar dense representation, one may apply a sparsity mask after logic, an undirected face δ ⊆ [N], is a subset of [N] of size the linear map, while preserving equivariance: M. We use δ(M) when identifying the size of the face – (M) (2) i.e., Sδ S = M. For example, δ identifies an edge in a Wsp ∶ X ↦ W(X) ○ s(X), (5) graph, or a mesh, while δ(3) is a triangle in a triangulated where W is the equivariant linear map of Section 3.1, s(X) mesh. is the sparsity mask, and is the Hadamard product. For ○ An incidence tensor X is a tensor of order D, where example, assuming the layer output has the same shape as δ1,...,δD each dimension is indexed by all faces of size M δ . For the input, we can choose to preserve the sparsity of the input. d = S dS example, if δ δ indexes nodes, and δ δ , δ In this case, s X will have zero entries where the input X 1 = { 1,1} 2 = { 2,1 2,2} ( ) identifies an edge, X becomes a node-edge incidence has zero entries, and ones otherwise. However, the setting δ1,δ2 matrix. An incidence tensor has a sparsity structure, identi- of (5) is more general as input and output may have different fied by a set of constraints Σ σ ,..., σ , where all the shapes. Since the sparsity mask s X depends on the input, = { 1 C } ( ) indices σ . . . , σ σ , σ Σ are equal for any non- the map of (5) is now non-linear. In practice, rather than m1 mc ∈ c ∀ c ∈ zero entry of X. For example, we have X calculating the dense output and applying the sparsity mask, {δ1,1},{δ2,1,δ2,2} ≠ 0, only if δ1,1 = δ2,1. Therefore Σ = {σ1 = {δ1,1, δ2,1}}. X and Y are equivalent. Incidence Networks for Geometric Deep Learning

While in general the pair (Xδ1,...,δD , Σ) defines the inci- dence tensor, whenever it is clear from context, we will only use Xδ1,...,δD to denote it. This formalism can represent a variety of different geometric structure as demonstrated in the following sections.

4.1.1. SIMPLICIAL COMPLEXES The triangular bi-pyramid of Fig. 1 is an example of 3 di- mensional simplicial complex with 5 nodes, 9 edges, 7 faces Before discussing general simplicial complexes let us review of size 3, and two faces of size 4. The node-face incidence graphs as an example of incidence tensors. matrix in Fig. 1(a) is expressed by X{δ1},{δ1,δ2,δ3} in our The node-node incidence matrix is an incidence tensor formalism.

X{δ1},{δ2} indexed by a pair of nodes, with no sparsity 4.1.2. POLYGONS,POLYHEDRA, AND POLYTOPES constraints. We denoted it with Xδ1,δ2 in Section 3.1 for simplicity. The node-edge incidence matrix is denoted by Another family of geometric objects with incidence struc- the pair X , δ , δ . It is indexed by nodes ( {δ1},{δ2,δ3} { 1 2}) ture is polytopes. A formal definition of abstract polytope δ and edges δ , δ . The entries can be non-zero only { 1} { 2 3} and its representation using incidence tensors is given in when δ δ , meaning that the edge δ , δ is adjacent 1 = 2 { 1 3} Appendix D. A polytope is a generalization of polygone to the node δ . An alternative notation is X . { 1} {δ1},{δ1,δ2} and polyhedron to higher dimensions. The structure of an Again, we denoted it simply with X in Section 3.1. δ1,{δ1,δ2} (abstract) polytope is encoded using a partially ordered set We have also denoted node and edge vectors with X and {δ1} (poset) that is graded, i.e., each element of the poset has a X , respectively. As a final example, X {δ1,δ2} {δ1,δ2},{δ1,δ3} rank. For example, Fig. 4, shows the poset for a cube, where would denote an edge-edge incidence matrix whose entries each level is a different rank, and subsets in each level iden- are non-zero wherever two edges are incident. tify faces of different size (nodes, edges, and squares). The Let us now move to the definition of a general (undi- idea of using incidence tensor representation for a polytope, rected) simplicial complex. An abstract simplicial complex is similar to its use for simplicial complexes. Each dimen- [N] ∆ ⊆ 2 is a collection of faces, closed under the operation sion of Xδ1,...,δD indexes faces of different rank. Two faces of taking subsets – that is (δ1 ∈ ∆ and δ2 ⊂ δ1) ⇒ δ2 ∈ ∆. of the same dimension may be considered incident if they Dimension of a face δ is its size minus one. Maximal faces have a face of specific lower rank in common. We may also are called facets and the dimension of ∆ is the dimension define two faces of different dimension incident if one face (m) (m′) of its largest facet. For example, an undirected graph is a is a subset of the other – i.e., δ < δ in the partial one-dimensional simplicial complex. Each dimension of order. an incidence tensor Xδ1,...,δD may be indexed by faces of specific dimension. Two undirected faces of different di- ′ mension δ, δ ∈ ∆ are incident if one is a subset of the other. This type of relationship as well as alternative definitions of incidence between faces of the same dimension can be easily accommodated in the form of equality constraints in Σ. Although not widely used, a directed simplicial complex can be defined similarly. The main difference is that faces are sequences of the nodes, and ∆ is closed under the operation of taking a subsequence. As one might expect, the incidence Figure 4: Representation of a cube as a (graded) partially ordered set. The incidence structure of the poset as well as face attributes tensor for directed simplicial complexes can be built using is encoded in an incidence tensor. directed faces in our notation.

4.2. Symmetry & Decomposition

Example 1. A zero-dimensional simplicial complex is a set The automorphism group Aut(X) ≤ SN associated with of points that we may represent using an incidence vector. an incidence tensor is the set of all permutations of nodes At dimension one, we get undirected graphs, where faces that maps every face to another face, and therefore preserve of dimension one are the edges. Triangulated mesh is an the sparsity example of two-dimensional simplicial complex; see figure X 0 X 0 π X below. ( π⋅δ1,...,π⋅δD ≠ ⇔ δ1,...,δD ≠ ) ⇔ ∈ Aut( ) Incidence Networks for Geometric Deep Learning where the action of Aut(X) on the faces is naturally de- fined as

π ⋅ (δ1, . . . , δM ) = (π ⋅ δ1, . . . , π ⋅ δM ). (6) See Fig. 1(a,b) for an example. We may then con- 4.3. Equivariant Maps for Incidence Tensors struct Aut(X)-equivariant linear layers through parameter- sharing. However, the constraints on this linear operator As shown in the previous section, any incidence tensor can varies if our dataset has incidence tensors with different be decomposed into disjoint union of face-vectors, that are sparsity patterns. For example, a directed graph dataset may invariant sets under the action of the symmetric group. An contain a fully connect graph with automorphism group SN implication is that any equivariant map from an incidence and a cyclic graph with automorphism group CN . For these tensor to another also decomposes into equivariant maps two graphs, node-node and node-edge incidence matrices between face-vectors. are invariant to the corresponding automorphism groups, ′ Let WM→M be a linear function (here represented as a necessitating different constraints on their linear layer. To tensor) that maps a vector of faces of size M to a vector of remedy the problem with model-sharing across instances, faces of size M ′, we densify all incidence tensors so that all directed or undi- rected faces of a given dimension are present. Now, one may M M′ use the same automorphism group S across all instances; M→M ′ N×N×⋅⋅⋅×N N×N×⋅⋅⋅×N N W ∶ R ↦ R (8) see Fig. 1(c,d). Next, we consider the incidence tensor as a δ1,...,δM X(δ1,...,δM ) ↦ Wδ′ ,...,δ′ X(δ1,...,δM ), G-set, and identify the orbits of SN action. 1 M′ where δ , . . . , δ identifies faces of size M, and (using Ein- Theorem 4.1. The action of SN on any incidence tensor 1 M X decomposes into orbits that are each isomorphic to a stein notation) repeated indices on are summed over. Equiv- face-vector: ariance to SN is realized through a symmetry constraint on W, π⋅δ1,...,π⋅δM δ1...δM Xδ ,...,δ , Σ κm X (m) , (7) W ′ ′ W ′ ′ π SN , (9) ( 1 D ) ≅ # δ π⋅δ ...π⋅δ ′ = δ ...δ ′ ∀ ∈ m 1 M 1 M which ties the elements within each orbit of the so called where κm is the multiplicity of faces of size m. The diagonal SN -action on W; see Fig. 2 (left, middle) for a value of κm is equal to the number of partitioning of the graph example. set of all indices {δ1,1, . . . , δ1,M1 , . . . , δD,MD } into m non-empty partitions, such that δ m M belong d,m ∀ ∈ [ d] 4.3.1. POOL &BROADCAST INTERPRETATION to different partitions, and members of σ ∈ Σ belong to the same partition. Each unique parameter in the constrained W corresponds to a linear operation that has a pool and broadcast interpretation – that is any linear equivariant map between two incidence The proof appears in Appendix B. tensors can be written as a linear combination of pooling- and-broadcasting operations. Moreover, this interpretation Example 2 (Node-adjacency tensors). Consider an order D allows for a linear-time implementation of the equivariant node-node-. . . -node incidence tensor X with no {δ1},...{δD } layers, as we avoid the explicit construction of W. sparsity constraints. In this case, the multiplicity κm of (7) corresponds to the number of ways of partitioning a set of D Definition 1 (Pooling). Given a face vector X(δ1...δM ), for elements into m non-empty subsets and it is also known as P = {p1, . . . , pL} ⊆ [M], the pooling operation sums over D the indices in P: Stirling number of the second kind (written as ™mž). Each partition of size m identifies a face-vector X (m) for a face δ Pool X ... X , of size m. These faces can be identified as hyper-diagonals P( (δ1...δM )) = Q Q (δ1,...,δM ) δ ∈[N] δ ∈[N] of order m in the original adjacency tensor X. For example, p1 pL as shown in the figure below, X{δ1},{δ2},{δ3} decomposes into a node-vector (the main diagonal of the adjacency In practice, the summation in the definition may be replaced cube), three edge-vectors (isomorphic to the three diago- with any permutation-invariant aggregation function. nal planes of the cube adjacency, with the main diagonal Definition 2 (Broadcasting). BcastB,M ′ (X) broadcasts X, removed), and one hyper-edge-vector (isomorphic to the ad- a faces vector of size M, over a target vector of faces of size ′ jacency cube, where the main diagonal and diagonal planes M ≥ M. We identify X with a sequence of node indices have been removed). Here, κ 3 1, κ 3 3, and ′ 1 = ™1ž = 2 = ™2ž = of the target face-vector, B = (b1, . . . , bM ) with bm ∈ [M ], 3 ′ κ3 = ™3ž = 1. and we broadcast across the remaining M −M node indices Incidence Networks for Geometric Deep Learning

′ – that is where for each copy (out of κm′ copies) of the output face of size m′, we are summing over all the maps produced by ′ ‰ BcastB,M (X)Ž = X(δb ,...,δb ). (δ1,...,δM′ ) 1 M different input faces having different multiplicities. Use of ′ ′ k and k′ in the map Wk,m→k ,m is to indicate that for each m→m′ For example, given an edge-vector X = X{δ1,δ2}, input-output copy, the map W uses a different set of Bcast⟨0,1⟩,3(X) broadcasts X to a triangle-vector (i.e., vec- parameters. The upshot is that input and output multiplici- tor of faces of size 3), where X is mapped to the first two ties κ, κ′ play a role similar to input and output channels. node indices and broadcasted along the third. Note that operations defined through pooling-and-broadcasting are The total number of independent parameters in a layer is equivariant to permutation of nodes. In fact, it turns out ′ m→m′ τ κ ′ κ τ , that an equivariant W can only linearly combine pooling = Q m m (13) m,m′ and broadcasting of input incidence tensor into an output ′ tensor. where τ m→m is given by (11). ′ Example 3 . This example, is con- Theorem 4.2. Any equivariant linear map WM→M be- (Node-adjacency tensors) cerned with the incidence representation used in equiv- tween face-vectors of size M and M ′, defined in (8), can ariant graph networks of (Maron et al., 2018) and de- be written as rives their model as a special case, using our pooling-and- M→M ′ broadcasting layer and face-vectors decomposition. For W (X) = Q wB,P BcastB,M ′ ‰ PoolP(X)Ž. P⊆[M] an equivariant layer that maps a node-node-. . . -node in- B⊆⟨1,...,M ′⟩ cidence tensor X of order D to the same structure, the SBS=M−SPS (10) decomposition in terms of face-vectors reads D Xδ1,...,δD ≅ # › Xδ(m) , m m The proof appears in Appendix B. The sum of the pooling- and-broadcasting operations in (10) includes pooling the D where ™mž is the Stirling number of the second kind; see node indices of the input face-vector in all possible ways, Example 2. The total number of operations according to and broadcasting the resulting collection of face-vectors to (13) is then given by the target face-vector, again in all ways possible; wB,P ∈ R ′ is the parameter associated with each unique pooling-and- D D D min(m,m ) m m′ broadcasting combination. More details are discussed in τ = Q › › ′ Q ‹ ‹ l! ′ m m l l Appendix C. m,m =1 l=0 D D D m D m′ D The number of operations in (10), is given by = Q Q Q ‹ › ‹ › ′ l! = Bell(2D). l=0 m=l m′=l l m l m min(M,M ′) ′ M→M ′ M M τ = Q ‹ ‹ m!. (11) In the last line, Bell(2D) is the Bell number and counts m=0 m m the number of unique partitions of a set of size 2D. To This counts the number of possible choices of m indices out see the logic in the final equality: first divide [2D] in of M input indices in (8) and m indices out of M ′ output half. Next, partition each half into subsets of different sizes ′ indices to for pool and broadcast. Once this set is fixed (0 ≤ m, m ≤ D) and choose l of these partitions from each there are m! different ways to match input indices to output half and merge them in pairs. The first two terms count the indices. number of ways we can partition each half into m (or m′) partitions and select a subset of size l among them. The l! 4.3.2. DECOMPOSITIONOF EQUIVARIANT MAPS term accounts for different ways in which l partitions can be ′ aligned. This result agrees with the result of (Maron et al., W κ X (m) ′ κ ′ X m′ Let ∶ ("m m δ ) ↦ ("m m δ( ) ) be an equiv- 2018). Therefore one may implement the hyper-graph net- ariant map between arbitrary incidence tensors, where both works using efficient pooling-and-broadcasting operations input and output decompose according to (7). Using the ′ outlined in (10). equivariant maps Wm→m of (10), we get a decomposition of W into all possible combination of input-output face Recall that when discussing equivariant layers for graphs, vectors we also considered independent permutations of rows and ′ κm′ κm columns in a node-edge incidence matrix, and claimed that k,m→k′,m′ W(Xδ1,...,δD , Σ) ≅ # # Q Q W (Xδ(m) ), despite having only 4 parameters, stacking two such layers m′ k′=1 m k=1 (with additional channels) is equivalent to the 15 parameter (12) model. In Appendix E a similar result is given for higher Incidence Networks for Geometric Deep Learning

Figure 5: Decomposition of the 52 = 2 + 3 + 3 × 3 + 7 × 3 + 4 + 13 parameter equivariant map from a node-node incidence matrix to a node-node-node incidence tensor. This figure is similar to the example of Fig. 3, with the only difference that the output tensor has a higher rank. The input has two orbits and the output has five orbits described in Example 2. In our notation, the equivariant map from ′ ′ each orbit (face-vector) to another is WM→M for M = 1, 2 and M ′ = 1, 2, 3. Each map WM→M consists of different ways the source orbit can be pooled and broadcasted to the target orbit. For example, the 4 operations from a matrix diagonal to all the off-diagonals in the cube consists of broadcasting the diagonal over the cube in three different ways, plus an operation that pools over the diagonal and ′ broadcasts the resulting scalar over the entire cube. The number of pool-broadcast operations τ M→M for each map is highlighted by the pink square, and the number agrees with (11).

dimensions, showing that one may use SN ... SN of this work. Note that for a comprehensive study with Sδ1S × × SδD S as the symmetry group of an incidence tensor, where the nodes, edges, and triangles, as they appear in triangulated equivariant model has 2D parameters. mesh, one could investigate multiple order three represen- tations (node-node-node, node-edge-face, etc), as well as 5. Conclusion and Future Work some order two representations (node-face, face-face, etc.), or simply work with face-vectors. For each of these, one This paper introduces a general approach to learning equiv- potentially needs a different incidence network using a dif- ariant models for a large family of structured data through ferent set of sparse pool-broadcast operations. We hope to their incidence tensor representation. In particular, we explore this direction in a follow-up work. showed various incidence tensor representations for graphs, simplicial complexes, and abstract polytopes. Acknowledgements The proposed family of incidence networks are 1) modular: We thank the reviewers for helpful comments on the they decompose to simple blocks; 2) efficient: they manuscript. The computational resources for this work all have linear-time pooling-and-broadcasting implementa- were mainly provided by Compute Canada. This research tion. was in part supported by Canada CIFAR AI Chair Program In our systematic study of this family, we discussed impli- and NSERC Discovery Grant. cations of 1) added symmetry due to undirected faces; 2) sparsity preserving equivariant maps, and; 3) the succes- References sive relaxation of the symmetry group Aut(X) ≤ SN ≤ SN ... SN , from the automorphism group, to a Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A., and Van- Sδ1S × × SδD S direct product of symmetric groups that independently per- dergheynst, P. Geometric deep learning: going beyond mutes each dimension of the incidence tensor. Here, moving euclidean data. IEEE Signal Processing Magazine, 34(4): to a larger group simplifies the neural layer by reducing the 18–42, 2017. number of unique parameters (and linear operations), while increasing its bias. Bruna, J., Zaremba, W., Szlam, A., and LeCun, Y. Spec- tral networks and locally connected networks on graphs. Application of incidence networks to different domains, ICLR, 2014. such as learning on triangulated mesh, is a natural extension Chen, Z., Villar, S., Chen, L., and Bruna, J. On the equiva- Incidence Networks for Geometric Deep Learning

lence between graph isomorphism testing and function ap- Li, Y., Tarlow, D., Brockschmidt, M., and Zemel, R. proximation with gnns. arXiv preprint arXiv:1905.12560, Gated graph sequence neural networks. arXiv preprint 2019. arXiv:1511.05493, 2015.

Cohen, T. S., Weiler, M., Kicanaoglu, B., and Welling, Maron, H., Ben-Hamu, H., Shamir, N., and Lipman, Y. M. Gauge equivariant convolutional networks and the Invariant and equivariant graph networks. arXiv preprint icosahedral cnn. arXiv preprint arXiv:1902.04615, 2019. arXiv:1812.09902, 2018. Defferrard, M., Bresson, X., and Vandergheynst, P. Con- Maron, H., Ben-Hamu, H., Serviansky, H., and Lipman, volutional neural networks on graphs with fast localized Y. Provably powerful graph networks. arXiv preprint spectral filtering. In Advances in Neural Information arXiv:1905.11136, 2019a. Processing Systems, pp. 3844–3852, 2016. Duvenaud, D. K., Maclaurin, D., Iparraguirre, J., Bombarell, Maron, H., Fetaya, E., Segol, N., and Lipman, Y. On R., Hirzel, T., Aspuru-Guzik, A., and Adams, R. P. Con- the universality of invariant networks. arXiv preprint volutional networks on graphs for learning molecular fin- arXiv:1901.09342, 2019b. gerprints. In Advances in neural information processing Morris, C., Ritzert, M., Fey, M., Hamilton, W. L., Lenssen, systems, 2015. J. E., Rattan, G., and Grohe, M. Weisfeiler and leman Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and go neural: Higher-order graph neural networks. arXiv Dahl, G. E. Neural message passing for quantum chem- preprint arXiv:1810.02244, 2018. istry. arXiv preprint arXiv:1704.01212, 2017. Ravanbakhsh, S., Schneider, J., and Poczos, B. Equivariance Graham, D. and Ravanbakhsh, S. Deep models for relational through parameter-sharing. In Proceedings of the 34th In- databases. arXiv preprint arXiv:1903.09033, 2019. ternational Conference on Machine Learning, volume 70 of JMLR: WCP, August 2017. Hamilton, W., Ying, Z., and Leskovec, J. Inductive repre- sentation learning on large graphs. In Advances in Neural Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Information Processing Systems, pp. 1024–1034, 2017. Monfardini, G. The graph neural network model. IEEE Hartford, J., Graham, D. R., Leyton-Brown, K., and Ra- Transactions on Neural Networks, 20(1):61–80, 2009. vanbakhsh, S. Deep models of interactions across sets. Schütt, K. T., Arbabzadah, F., Chmiela, S., Müller, K. R., In Proceedings of the 35th International Conference on and Tkatchenko, A. Quantum-chemical insights from Machine Learning, pp. 1909–1918, 2018. deep tensor neural networks. Nature communications, 8: Henaff, M., Bruna, J., and LeCun, Y. Deep convolu- 13890, 2017. tional networks on graph-structured data. arXiv preprint arXiv:1506.05163, 2015. Shawe-Taylor, J. Symmetries and discriminability in feed- forward network architectures. IEEE Transactions on Kearnes, S., McCloskey, K., Berndl, M., Pande, V., and Neural Networks, 4(5):816–826, 1993. Riley, P. Molecular graph convolutions: moving beyond fingerprints. Journal of computer-aided molecular design, Velickoviˇ c,´ P., Cucurull, G., Casanova, A., Romero, A., 30(8):595–608, 2016. Lio, P., and Bengio, Y. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017. Keriven, N. and Peyré, G. Universal invariant and equivariant graph neural networks. arXiv preprint Xu, K., Li, C., Tian, Y., Sonobe, T., Kawarabayashi, K.- arXiv:1905.04943, 2019. i., and Jegelka, S. Representation learning on graphs with jumping knowledge networks. arXiv preprint Kipf, T. N. and Welling, M. Semi-supervised classifica- arXiv:1806.03536, 2018. tion with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016. Ying, Z., You, J., Morris, C., Ren, X., Hamilton, W., and Kondor, R., Son, H. T., Pan, H., Anderson, B., and Trivedi, Leskovec, J. Hierarchical graph representation learning S. Covariant compositional networks for learning graphs. with differentiable pooling. In Advances in Neural Infor- arXiv preprint arXiv:1801.02144, 2018. mation Processing Systems, pp. 4800–4810, 2018. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Gradient-based learning applied to document recognition. Salakhutdinov, R. R., and Smola, A. J. Deep sets. In Ad- Proceedings of the IEEE, 86(11):2278–2324, 1998. vances in Neural Information Processing Systems, 2017. Incidence Networks for Geometric Deep Learning

Zhang, M., Cui, Z., Neumann, M., and Chen, Y. An end-to- end deep learning architecture for graph classification. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. Incidence Networks for Geometric Deep Learning

A. Additional Results for Equivariant Layers (a) a single WY layer spans a subspace of features Using Relaxed Symmetry Group spanned by WX ,

Consider an undirected graph with N nodes and its node- (b) two WY layers span the same feature space spanned by W (maximal feature space). node Xδ1,δ2 and node-edge Yδ1,{δ1,δ2} incidence repre- X sentations. We discussed the equivalence of their SN - equivariant linear layers in Section 3.1. Here, we study Proof. First, we discuss how to interpret output features. node-edge layers equivariant to SN × SN2 , where N2 = N N 1 2 is the number of edges, and compare their Additionally, for the rest of the proof we will assume K = ( − )~ ′ expressive power with their SN -equivariant linear counter- K = 1 for simplicity, noting that the proof generalizes to parts. the multi-channel case.

N×N×K NODE-NODE INCIDENCE. Let X ∈ R be a node- Output Features. For a node-node incidence layer, it is node incidence matrix with K channels. Consider a linear natural to interpret diagonal and off-diagonal elements of ′ N×N×K′ X′ as output node and edge features, respectively. layer WX ∶ X ↦ X ∈ R , where WX is defined ′ as in (12), and K are output channels. Diagonal elements For the node-edge incidence case, all four operations of the of X encode input node features and off-diagonal elements WY map return linear combinations of features that vary input edge features. Since the graph is undirected, X is at most across one dimension, and are repeated across the symmetric, and the corresponding number of independent remaining dimensions. This is the same pattern of input pooling and broadcasting operations (see, e.g., equation ′ node and edge features, and we will use the same scheme to (31)) is τX = 9, for a total of 9KK parameters. interpret them. In particular,

NODE-EDGE INCIDENCE. Alternatively, one could rep- • an output feature that varies across the node dimension resent the graph with a node-edge incidence matrix Y ∈ (but it is repeated across the edge dimension) is a node N×N2×2K R . Node features are mapped on the first K chan- feature, nels along the node dimension and broadcasted across the edge dimension. Similarly, edge features are mapped on • an output feature that varies across the edge dimension the last K channels along the edge dimension and broad- (but it is repeated across the node dimension) is an casted across the node dimension. Consider a layer equiv- edge feature, ariant to SN × SN2 as described in Section 3.3, that also preserves the sparsity through the non-linear implementa- • and finally a feature that is repeated across both node ′ ′ N×N2×2K and edge dimension is either a node or edge feature. tion of Section 3.2, WY ∶ Y ↦ Y ∈ R , where ′ Y = WY (Y) ○ s(Y), and W is an equivariant map defined as in (Hartford et al., 2018). We can write this linear map in For example, consider a complete graph with three nodes. our notation as Its incidence matrix with node and edge features nδ1 and

eδ1δ2 repeat across rows and columns as follows WY (Y) = Q λP Bcast{0,1}−P (PoolP (Y)) , (14) P∈2{0,1} {12}{13}{23}

1 n , e n , e where 2{0,1} is the set of all subsets of 0, 1 , and we have ⎛ ( 1 12 )( 1 13 ) − ⎞ { } ⎜ ⎟ dropped the output dimension in the Bcast operator, as all ⎜ ⎟ ⎜ ⎟ ′ ⎜ ⎟ features are broadcasted back to Y . WY (Y) corresponds ⎜ ⎟ Y 2 ⎜ n , e n , e ⎟ to pooling-and-broadcasting each of the two dimensions of = ⎜ ( 2 12) − ( 2 23 ) ⎟ ⎜ ⎟ Y independently, thus the summation has four terms, for ⎜ ⎟ (15) ⎜ ⎟ ⎜ ⎟ pooling/broadcasting over rows, columns, both rows and ⎜ ⎟ columns and no pooling at all. The number of independent 3 ⎝ − ( n3 , e13)(n3, e23) ⎠ ′ operations is τY = 4 for a total of 16KK independent parameters. = edge features = node features, Theorem A.1. Let WX and WY be the linear SN - Consider pooling and broadcasting across the edge dimen- equivariant and the non-linear (SN × SN2 )-equivariant layers operating on node-node and node-edge incidence, sion. Node features were broadcasted across it, and since respectively. The following statements hold: every node is incident with two edges, we get back multiples of the original features. The edge channel will instead return new features that combine the edges incident on each node. Incidence Networks for Geometric Deep Learning

These new node features vary across the node dimension and are summarized in Table 1. We have split the operations m→m′ and are broadcasted across the edge dimension, according to the sub-blocks defined above, where Wi represents operators mapping faces of size m to faces of {12}{13}{23} size m′, and with 0 i m, m′ representing the size of the 1 2n , e e 2n , e e ≤ ≤ ⎛ ( 1 12 + 13 )( 1 12 + 13) − ⎞ faces after pooling. In terms of pooling-and-broadcasting of ⎜ ⎟ ⎜ ⎟ 2 ⎜ (2n2, e12 + e23 ) − (2n2, e12 + e23) ⎟. (16) ⎜ ⎟ Definition 1 and Definition 2, ⎜ ⎟ 3 2n3, e13 e23 2n3, e13 e23 ⎝ − ( + )( + ) ⎠ m→m′ W X (m) ′ Bcast ′ Pool X , i δ = ∑B∈2[m ] B,m ‰ [m−i] ‰ (δ1,...,δm)ŽŽ On the other hand, pooling and broadcasting across the SBS=i node dimension returns multiples of input edge features, and (22) generates new edge features from the two nodes incident on where Xδ(m) = X(δ1,...,δm) is a face-vector representing each edge, 1→1 either node or edge features. For example, W0 is the

{12}{13}{23} operator that pools all node features (i.e., it pools the node- 1 n n , 2e n n , 2e vector to dimension zero) and broadcasts the pooled scalar ⎛ ( 1 + 2 12)( 1 + 3 13) − ⎞ ⎜ ⎟ over nodes. The symbol ≃ in (21) indicates that each op- 2 ⎜ n n , 2e n n , 2e ⎟ ⎜ ( 1 + 2 12) − ( 2 + 3 23) ⎟. (17) ⎜ ⎟ erator is defined up to a multiplicative constant (i.e., the 3 n n , 2e n n , 2e ⎝ − ( 1 + 3 13)( 2 + 3 23) ⎠ corresponding learnable parameter wB,P in (10)). The ac- tion of WX on the space of node and edge features can be Proof of statement (a). Given the feature encoding de- uniquely identified by the nine-dimensional subspace scribed above, W and W layers can be represented as a X Y ⎧ 1→1 ⎫ function acting on the space of node and edge features, ⎪⎛W0 0⎞ ⎛0 0 ⎞⎪ V span ⎪ ,..., ⎪ . (23) WX = (⎨ 2→2 ⎬) Q Q ⎪⎝ 0 0⎠ ⎝0 W2 ⎠⎪ W ∶ R ↦ R , (18) ⎩ ⎭ On the other hand, the four operations on the two channels where Q = N(N + 1)~2 is the number of graph elements, i.e., the sum of nodes and edges. Let us fix a basis in of the WY map can be written as Q Q R such that the first N components of a vector φ ∈ R 1→1 1→1 2→1 ⎛W1 0 ⎞ ⎛W1 W1 ⎞ represent node features and the remaining N2 = N(N −1)~2 W B ≃ 2→2 + represent edge features ⎝ 0 W2 ⎠ ⎝ 0 0 ⎠ ´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶ ´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶ T T identity pool/broadcast edges φ = ‰ n1, . . . , nN , e1, . . . , eN2 Ž ≡ ‰n, eŽ . (19) (24) 1→1 2→1 ´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶ ´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶ 0 0 W0 W0 node features edge features ⎛ ⎞ ⎛ ⎞ . + 1→2 2→2 + 1→2 2→2 ⎝W1 W2 ⎠ ⎝W0 W0 ⎠ W Q×Q Then a layer has a matrix representation ∈ R , ´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶ ´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶ pool/broadcast nodes pool/broadcast all 1→1 2→1 ⎛W W ⎞ ⎛n⎞ φ ↦ Wφ = , (20) In particular, ⎝W1→2 W2→2⎠ ⎝e⎠ • identity: if no pooling is applied, we simply map in- where we have split the matrix into four sub-blocks acting ′ put node and edge features into output node and edge Wm→m on the sub-vectors of node and edge features. Here features, respectively. Note that, since we map two m labels operators that maps faces of size into faces of input channels into two output channels, we have four m′ 1 m, m′ 2 size , with ≤ ≤ . Nodes are faces of size 1 independent parameters associated with this operation, and edges faces of size 2; see also Section 4 for a general 1→1 N×N but two of them are redundant. definition of faces. In particular, W ∈ R maps 2→1 N×N input node features to output node features, W ∈ R 2 • pool/broadcast edges: when pooling and broadcasting 1→2 maps input edge features to output node features, W ∈ over the edge dimension we are mapping input edge RN2×N maps input node features to output edge features, features into output node features, like the example in 2→2 N ×N and W ∈ R 2 2 maps input edge features to output (16). Input node features are also mapped into (mul- edge features. tiples of) themselves. Similarly to the previous case, two of the four parameters are redundant. The nine operations of the WX map can be written as

1→1 1→1 2→1 2→1 • pool/broadcast nodes: when pooling and broadcasting ⎛W0 + W1 W0 + W1 ⎞ W , (21) over the node dimension we are mapping input node X ≃ 1→2 1→2 2→2 2→2 2→2 ⎝W0 + W1 W0 + W1 + W2 ⎠ features into output edge features, like the example in Incidence Networks for Geometric Deep Learning

(17). Input edge features are also mapped into (mul- In order to show this, consider the following

tiples of) themselves. Similarly to the previous case, m′→m′′ m→m′ m→m′ Wj ŠWi Xδ(m)  = Q BcastQ,m′′ ŠPool[m′−j] ŠWi Xδ(m)  two of the four parameters are redundant. ′′ Q∈2[m ] SQS=j ⎛ ⎛ ⎞⎞ • pool/broadcast all: when pooling and broadcasting ⎜ ⎜ ⎟⎟ = Q BcastQ,m′′ ⎜Pool[m′−j] ⎜ Q BcastP,m′ ‰Pool[m−i] (Xδ(m) )Ž⎟⎟ ′′ ⎜ ⎜ ′ ⎟⎟ over both dimensions we get the pooled node and [m ] [m ] Q∈2 ⎝ ⎝P∈2 ⎠⎠ pooled edge features across the entire matrix, which SQS=j SPS=i min(i,j) we can interpret as either node or edge features as de- ≃ Q Q BcastR,m′′ ‰Pool[m′′−k] (Xδ(m) )Ž ′ ′′ k=max(0,i+j−m ) R∈2[m ] scribed in the previous paragraph. We can use the four SRS=k independent parameters associated with this operation min(i,j) m→m′′ W X (m) , to identify it with the operators that map pooled nodes = Q k δ k=max(0,i+j−m′) to node and edges, and pooled edges to node and edges. (27) where the first two lines are simply substituting the defini- tion of the operators. The important step is going from From (24), a single node-edge WY layer does not gener- ate W2→2, which corresponds to pooling one dimension of second to third line: this is considering all the pooled 1 tensors that can be created and broadcasting them to the the edge-tensor and rebroadcasting the result to the full ′′ tensor (in order to preserve symmetry, the pooled one- target m -dimensional tensor. Note that only the di- dimensional tensor has to be rebroadcasted both across rows mension of these pooled tensors is important as the con- Q×Q stants are irrelevant and absorbed in the ≃ symbol. In and columns). Thus, the subspace VWB ⊆ R of node particular, the lowest dimension Xδ(m) is pooled to, is and edge features spanned by a single WY map is a subspace ′ ′ max(0, m − (m − i) − (m − j)) = max(0, i + j − m ). On of VWX , the other hand, the maximum dimension Xδ(m) is pooled to is min(i, j). Also, note that (26) is consistent with the ⎧ 0 0 ⎫ ′ ′ ⎪⎛ ⎞⎪ ′ ′′ m →m V V span V . (25) special case m = m = j. In this case Wm′ ≃ 1 is WY = WX ~ (⎨ 2→2 ⎬) ⊂ WX ⎪⎝0 W1 ⎠⎪ proportional to the identity operator. Consistently, we get ⎩ ⎭ ′ m′→m′ m→m′ min(m ,i) m→m′ m→m′ Wm′ Wi ≃ ∑k=max(0,i) Wk = Wi . The ′ same holds for the special case m = m = i. Table 1: The nine operators of the SN -equivariant map WX for Using (26) we get m→m′ an undirected graph. An operator Wi maps faces of size m ′ ′ ′ 1→1 1→1 2→1 2→1 to faces of size m (with m, m = 1 for nodes and m, m = 2 for W0 + W1 W0 + W1 W2 ⎛ ⎞ edges), and with i representing the size of partially pooled faces. Y ≃ 1→2 1→2 2→2 2→2 2→2 Rows represent the pooled features, and columns targets each row ⎝W0 + W1 W0 + W1 + W2 ⎠ (28) is being broadcasted to. PARTIALLY-POOLED EDGES corresponds 2 WX VW VWX . to pooling the X (ignoring its diagonal) either across rows or ≃ Ô⇒ Y = columns. POOLED NODES and POOLED EDGES corresponds to Two stacked W maps span the same subspace of operators pooling over all node and edge features, respectively. Y and thus output features as a single WX map. Note that 2→2 W1 missing from a single WY is generated by composing 2→2 1→2 2→1 POOLED FEATURES /BROADCASTTO NODES EDGES W1 ≃ W1 W1 , that is edge features are pooled to 2→2 EDGES - W2 dimension one and broadcasted to nodes in the first layer 2→1 1→1 1→2 through W1 , and then re-broadcasted across rows and NODES W1 W1 2→2 2→2 columns, like all the other node features, in the second layer PARTIALLY-POOLED EDGES W1 W1 1→2 1→1 1→2 through W1 . Furthermore, using the same multiplication POOLED NODES W0 W0 2→1 2→2 rules, we find that POOLED EDGES W0 W0 m 1+m WX ≃ WX ≃ WY ∀m ∈ N S m ≥ 1, (29)

thus by taking a single WX map or by stacking two WY maps we span the maximal node and edge feature space. Proof of statement (b). Consider the definition of m→m′ Wi operators in (22). The following multiplication B. Proofs rule holds: B.1. Proof of Theorem 4.1 min(i,j) m′→m′′ m→m′ m→m′′ Proof. Consider an incidence tensor (Xδ1,...,δD , Σ). As- Wj Wi ≃ Q Wk . (26) k=max(0,i+j−m′) sume there exist two node indices δd,m, δd′,m′ within two Incidence Networks for Geometric Deep Learning

′ face indices δd and δd′ , d ≠ d that are not constrained to be C. Pooling-and-broadcasting from Partitions equal by Σ; therefore they can be either equal or different. ′ Orbits of an equivariant linear map WM→M The action of SN (and any of its subgroups) maintains this ∶ X(δ ,...,δ ) X(δ′ ,...,δ′ ) are partitions of the set (lack of) equality, that is 1 M ↦ 1 M′ ′ ′ {δ1, . . . , δM , δ1, . . . , δM ′ } of input and output node indices, where any subset of the partition contains at most one input π ⋅ δd,m = π ⋅ δd′,m′ ⇔ δd,m = δd′,m′ ∀π ∈ SN . and one output index, as described in Appendix B. Note that for the partitions of interest, subsets have either size one, This means that the set of indices constrained by Σ with a single input or output index, or size two, with a pair {δ1,..., δD S Σ}, can be partitioned into two disjoint G-sets of input and output indices. We map each partition to the corresponding pooling-and-broadcasting operation:

{δ1,..., δD S Σ} ={δ1,..., δD S Σ, δd,m = δd′,m′ }⊍ (a) input indices in subsets of size one are pooled over, {δ1,..., δD S Σ, δd,m ≠ δd′,m′ }, (b) input and output indices in the same subsets are with δd,m = δd′,m′ in one G-set and δd,m ≠ δd′,m′ in the mapped onto each other, other. We may repeat this partitioning recursively for each of the resulting index-sets. This process terminates with (c) output indices in subsets of size one are broadcasted over. homogeneous G-sets where any two indices δd,m and δd′,m′ are either constrained to be equal or different. It follows Example 4. Consider the function W2→1 that maps faces that if we aggregate all equal indices, we are left with a set of size two (edges) to faces of size one (nodes). We intro- of indices that are constrained to be different, and therefore duced this map, , in Section 3.1. From (11) W2→1 has (m) e.g. define a face δ . ′ three orbits. Let δ1, δ2 be input node indices and δ1 the The number of ways in which we are left with m dif- output node index. Following the rules outlined above we get: δ , δ , δ′ corresponds to pooling both edge ferent indices κm is given by the number of partitions {{ 1} { 2} { 1}} dimensions and broadcasting the result to the output nodes, of δ1,1, . . . , δ1,M , . . . , δD,M into m non-empty sets, { 1 D } ′ where two elements in the same partition are constrained to {{δ1, δ1}, {δ2}} corresponds to pooling the edge second be equal. This partitioning is further constrained by the fact dimension and mapping the resulting one-dimensional vec- tor to output nodes, finally δ , δ′ , δ corresponds to that elements within the same face δd, are constrained to be {{ 2 1} { 1}} different and therefore should belong to different partitions. pooling the edge first dimension and mapping the resulting Moreover, Σ adds its equality constraints. one-dimensional vector to output nodes.

B.2. Proof of Theorem 4.2 D. More on Abstract Polytopes ′ Proof. Consider the linear map WM→M of (8). Each The structure of an abstract polytope is encoded using unique parameter corresponds to an orbit under the a graded partially ordered set (poset). A poset is a set action of SN . Orbits are partitions of the set equipped with a partial order that enables transitive com- ′ ′ {δ1, . . . , δM , δ1, . . . , δM ′ } of input and output node indices, parison of certain members. A poset Π is graded if there where any subset of the partition contains at most one input exists a rank function rank ∶ Π ↦ N satisfying the following and one output index. That is, subsets have either size one, constraints: with a single input or output index, or size two, with a pair of ′ ′ ′ input and output indices. Given such a partition, all and only δ < δ ⇒ rank(δ) < rank(δ ) ∀δ, δ ∈ Π M→M ′ ′′ ′′ ′ ′ W elements such that indices in the same subset are ∃~ δ ∈ Π s.t. δ < δ < δ ⇒ rank(δ ) = rank(δ) + 1 equal and indices in separate subsets are different, identify ′ an orbit. That is, given any two elements of WM→M , there An abstract polytope is a set Π of partially ordered faces exists an element of SN that maps one onto the other iff they of different dimension. In a geometric realisation, the par- satisfy the equality pattern described above. Each partition tial order is naturally defined by the inclusion of a lower- can be mapped to a pooling-and-broadcasting operation as dimensional face in a higher dimensional one (e.g., an edge detailed in Appendix C. that appears in a face of a cube). Fig. 4 shows the partial order for a cube, where we continue to use a set of nodes to This is similar to the construction of (Maron et al., identify a face. 2018). Here, we generalize their partitions approach and the pooling-and-broadcasting implementation to arbitrary We can define the incidence structure similar to simplicial incidence tensors (through the decomposition of Theo- complexes. For example, we may assume that two faces ′ rem 4.1). δ, δ ∈ Π of different dimension (rank) are incident iff δ < Incidence Networks for Geometric Deep Learning

′ ′ ′ δ , or δ > δ . Similarly, we may assume that two faces δ, δ that resembles having multiple channels, indexed by k. of the same dimension d are incident iff there is a face δ′′ of ′′ ′ Yet another alternative is to use an incidence tensor dimension d − 1 incident to both of them δ < δ, δ . Note M X κ X (m) that decomposes into face that if the polytope is irregular, faces of similar rank may δ1,...,δM SΣ ≅ "m=1 m δ vectors according to Theorem 4.1. We have similarly di- have different sizes – e.g., consider the soccer ball where verse alternatives for the “output” of an equivariant map. pentagons and hexagons have the same rank. Observe that the corresponding equivariant maps have the same expressiveness up to change in the number of channels. E. Additional Results for Higher Order This is because we could produce the same pool-broadcast Geometric Structures features of (10) across different representations. E.1. Additional Symmetry of Undirected Faces An example is the node-edge incidence matrix discussed in Section 3.1, where node features are broadcasted across all When counting the number of unique pooling and broad- edges incident to each node and whose equivariant layer has X (M) X casting operations, we assumed that δ = (δ1,...,δM ) ≠ the same expressiveness as the layer for the corresponding (M) X(π⋅δ1,...,π⋅δM ). However, if δ is an undirected face, the node-node matrix. face-vector is invariant to permutation of its nodes. This symmetry reduces the number of independent parameters E.3. Non-Linear Layers and Relaxation of the in (10). Furthermore, if we enforce the same symmetry on Symmetry Group the output face-vector, some of the weights need to be tied together. In particular, there is only one vector of faces of Non-linear layers, where the sparsity of the output is con- size m that can be extracted through pooling for each value trolled by an input-dependent mask as in (5), are easily ′ generalized to arbitrary incidence tensors. of 0 ≤ m ≤ min(M,M ). Similarly, all possible ways of broadcasting the pooled face-vector to the target face-vector Similarly, as discussed in Section 3.3 for a node-edge inci- have to carry the same parameter in order to restore sym- dence matrix, one can consider independent permutation for metry of the output. Thus, in comparison to (11), for the (m) each dimension of the incidence tensor. Let Nm = S{δ }S symmetric input and symmetric output case, the degrees of denote the number of faces of size m, so that N1 = N. Con- freedom of the equivariant map are significantly reduced: 1 D sider the action of π π . . . , π SN ... SN = ( ) ∈ Sδ1S × × SδD S min(M,M ′) on Xδ1,...,δD : (M→M ′) ′ τsymm = Q 1 = min(M,M ) + 1. (30) 1 D m=0 π ⋅ Xδ1,...,δD = Xπ ⋅δ1,...,π ⋅δD ,

Example 5 (Symmetric node-adjacency tensors). Consider d where π ⋅ δd is one of NSδdS! permutations of these faces. a similar setup, but for the case of undirected faces. In this The corresponding equivariant layer introduced in (Hartford case we map a symmetric input into a symmetric output. D et al., 2018) has τ = 2 unique parameters. From (13), where we use now (30) for the symmetric case, we get

D ′ 1 3 2 τ = Q (min(m, m ) + 1) = (2D + 9D + D). (31) m,m′=1 6

Note that we have omitted the multiplicity coefficients κm,m′ , assuming that all faces of a given dimension are equal.

E.2. Representation and Expressiveness Consider a simplicial complex or polytope where faces of particular dimension have associated attributes. This in- formation may be directly represented using face-vectors M "m=1 Xδ(m) . Alternatively, we may only use the largest face δ(M), and broadcast all lower dimensional data to the maximal face-vectors. This gives an equivalent representa- tion

M M M k # Xδ(m) ≡ # BcastB,M (Xδ(m) ) = # Xδ(M) , m=1 m=1 k=1