Quick viewing(Text Mode)

Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data

Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data

Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data

Marc Finzi 1 Samuel Stanton 1 Pavel Izmailov 1 Andrew Gordon Wilson 1

Abstract The translation equivariance of convolutional lay- ers enables convolutional neural networks to gen- eralize well on image problems. While translation equivariance provides a powerful inductive bias for images, we often additionally desire equivari- ance to other transformations, such as rotations, especially for non-image data. We propose a gen- eral method to construct a convolutional layer that is equivariant to transformations from any Figure 1. Many modalities of spatial data do not lie on a grid, but specified with a surjective exponential still possess important . We propose a single model map. Incorporating equivariance to a new group to learn from continuous spatial data that can be specialized to requires implementing only the group exponential respect a given continuous group. and logarithm maps, enabling rapid prototyping. Showcasing the simplicity and generality of our method, we apply the same model architecture to the translation equivariance of convolutional layers in neu- images, ball-and-stick molecular data, and Hamil- ral networks (LeCun et al., 1995): when an input (e.g. an tonian dynamical systems. For Hamiltonian sys- image) is translated, the output of a convolutional layer is tems, the equivariance of our models is especially translated in the same way. impactful, leading to exact conservation of linear provides a mechanism to reason about symme- and angular momentum. try and equivariance. Convolutional layers are equivariant to translations, and are a special case of group convolu- tion. A group convolution is a general linear transformation 1. Introduction equivariant to a given group, used in group equivariant con- volutional networks (Cohen and Welling, 2016a). Symmetry pervades the natural world. The same law of gravitation governs a game of catch, the orbits of our plan- In this paper, we develop a general framework for equiv- ets, and the formation of galaxies. It is precisely because ariant models on arbitrary continuous (spatial) data repre- of the order of the universe that we can hope to understand sented as coordinates and values (x , f ) N . Spatial data { i i }i=1 it. Once we started to understand the symmetries inherent is a broad category, including ball-and-stick representations in physical laws, we could predict behavior in galaxies bil- of molecules, the coordinates of a dynamical system, and arXiv:2002.12880v3 [stat.ML] 24 Sep 2020 lions of light-years away by studying our own local region images (shown in Figure1). When the inputs or group of time and space. For statistical models to achieve their elements lie on a grid (e.g., image data) one can simply full potential, it is essential to incorporate our knowledge enumerate the values of the convolutional kernel at each of naturally occurring symmetries into the design of algo- group element. But in order to extend to continuous data, rithms and architectures. An example of this principle is we define the convolutional kernel as a continuous function on the group parameterized by a neural network. 1New York University. Correspondence to: Marc Finzi , Samuel Stanton , Pavel We consider the large class of continuous groups known as Izmalov , Andrew Gordon Wilson . in terms of a vector space of infinitesimal generators (the Lie Proceedings of the 37 th International Conference on Machine algebra) via the logarithm and exponential maps. Many use- Learning, Online, PMLR 119, 2020. Copyright 2020 by the au- ful transformations are Lie groups, including translations, thor(s). rotations, and scalings. We propose LieConv, a convolu- Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data tional layer that can be made equivariant to a given Lie data types like spherical images (Esteves et al., 2018; Co- group by defining exp and log maps. We demonstrate the hen et al., 2018), voxel data (Weiler et al., 2018), and point expressivity and generality of LieConv with experiments clouds (Thomas et al., 2018; Anderson et al., 2019), the on images, molecular data, and dynamical systems. We requirement of working out the representation theory for the emphasize that we use the same network architecture for group can be cumbersome and is limited to compact groups. all transformation groups and data types. LieConv achieves Our approach reduces the amount of work to implement state-of-the-art performance in these domains, even com- equivariance to a new group, enabling rapid prototyping. pared to domain-specific architectures. In short, the main There is also work applying Lie group theory to deep neural contributions of this work are as follows: networks. Huang et al.(2017) define a network where the intermediate activations of the network are 3D rotations We propose LieConv, a new convolutional layer equiv- • representing skeletal poses and embed elements into the Lie ariant to transformations from Lie groups. Models algebra using the log map. Bekkers(2019) use the log map composed with LieConv layers can be applied to non- to express an equivariant convolution kernel through the use homogeneous spaces and arbitrary spatial data. of B-splines, which they evaluate on a grid and apply to image problems. While similar in motivation, their method We evaluate LieConv on the image classification bench- is not readily applicable to point data and can only be used • mark dataset rotMNIST (Larochelle et al., 2007), and when the equivariance group acts transitively on the input the regression benchmark dataset QM9 (Blum and Rey- space. Both of these issues are addressed by our work. mond, 2009; Rupp et al., 2012). LieConv outperforms state-of-the-art methods on some tasks in QM9, and in all cases achieves competitive results. 3. Background

We apply LieConv to modeling the Hamiltonian of 3.1. Equivariance • physical systems, where equivariance corresponds to A mapping h( ) is equivariant to a set of transformations the preservation of physical quantities (energy, angular G if when we· apply any transformation g to the input of momentum, etc.). LieConv outperforms state-of-the- h, the output is also transformed by g. The most common art methods for the modeling of dynamical systems. example of equivariance in deep learning is the translation equivariance of convolutional layers: if we translate the We make code available at input image by an integer number of pixels in x and y, the https://github.com/mfinzi/LieConv output is also translated by the same amount (ignoring the regions close to the boundary of the image). Formally, if 2. Related Work h : A A, and G is a set of transformations acting on A, we say→h is equivariant to G if a A, g G, One approach to constructing equivariant CNNs, first in- ∀ ∈ ∀ ∈ troduced in Cohen and Welling(2016a), is to use standard h(ga) = gh(a). (1) convolutional kernels and transform them or the feature The continuous convolution of a function f : R R with maps for each of the elements in the group. For discrete → the kernel k : R R is equivariant to translations in the groups this approach leads to exact equivariance and uses → sense that Lt(k f) = k Ltf where Lt translates the the so-called regular representation of the group (Cohen ∗ ∗ function by t: Ltf(x) = f(x t). et al., 2019). This approach is easy to implement, and has − also been used when the feature maps are vector fields (Zhou It is easy to construct functions, where transfor- et al., 2017; Marcos et al., 2017), and with other representa- mations on the input do not affect the output, by simply tions (Cohen and Welling, 2016b), but only on image data discarding information. Strict invariance unnecessarily lim- where locations are discrete and the group cardinality is its the expressive power by discarding relevant information, small. This approach has the disadvantage that the compu- and instead it is necessary to use equivariant transformations tation grows quickly with the size of the group, and some that preserve the information. groups like 3D rotations cannot be easily discretized onto a lattice that is also a subgroup. 3.2. Groups of Transformations and Lie Groups Another approach, drawing on harmonic analysis, finds a ba- Many important sets of transformations form a group. To sis of equivariant functions and parametrizes convolutional form a group the set must be closed under composition, kernels in that basis (Worrall et al., 2017; Weiler and Cesa, include an identity transformation, each element must have 2019; Jacobsen et al., 2017). These kernels can be used to an inverse, and composition must be associative. The set construct networks that are exactly equivariant to continu- of 2D rotations, SO(2), is a simple and instructive example. ous groups. While the approach has been applied on general Composing two rotations r1 and r2, yields another Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data r = r r . There exists an identity id G that maps on G at u is given by 2 ◦ 1 ∈ every point in R2 to itself (i.e., rotation by a zero ). 1 1 And for every rotation r, there exists an inverse rotation r− h(u) = (k f)(u) = k(v− u)f(v)dµ(v). (2) 1 1 ∗ such that r r− = r− r = id. Finally, the composition ZG of rotations◦ is an associative◦ operation: (r r ) r = 1 ◦ 2 ◦ 3 r1 (r2 r3). Satisfying these conditions, SO(2) is indeed (Kondor and Trivedi, 2018; Cohen et al., 2019) a group.◦ ◦ 3.4. PointConv Trick We can also adopt a more familiar view of SO(2) in terms 2 2 of , where a rotation R : R R × is In order to extend learnable convolution layers to point parametrized as R(θ) = exp(Jθ). J is the antisymmet-→ clouds, not having the regular grid structure in images, Dai 0 1 et al.(2017), Simonovsky and Komodakis(2017), and Wu ric matrix J = (an infinitesimal generator of the 1− 0 et al.(2019) go back to the continuous definition of a con- group) and expis the matrix exponential. Note that θ is volution for a single channel between a learned function d cout cin totally unconstrained. Using R(θ) we can add and subtract (convolutional filter) kθ( ): R R × and an in- 1 ·d c→ rotations. Given θ1, θ2 we can compute R(θ1)− R(θ2) = put feature map f( ): R R in yielding the function d c · → exp( Jθ1 + Jθ2) = R(θ2 θ1). R(θ) = exp(Jθ) is an h( ): R R out , example− of the Lie algebra parametrization− of a group, and · → SO(2) forms a Lie group. h(x) = (k f)(x) = k (x y)f(y)dy. (3) θ ∗ θ − More generally, a Lie group is a group whose elements form Z a smooth manifold. Since G is not necessarily a vector We approximate the integral using a discretization: space, we cannot add or subtract group elements. However, the Lie algebra of , the tangent space at the identity, g h(xi) = (V/n) kθ(xi xj)f(xj) . (4) G = − j TidG, is a vector space and can be understood informally as X a space of infinitesimal transformations from the group. As a vector space, one can readily expand elements in a basis Here V is the volume of the space integrated over and n is k the number of points. In a convolutional A = k a ek and use the components for calculations. 3 3 layer for images, where points fall on× a uniform square The exponential map g gives a mapping from P exp : G grid, the filter k has independent parameters for each of the Lie algebra to the Lie group,→ converting infinitesimal θ the inputs ( 1, 1), ( 1, 0),..., (1, 1). In order to accom- transformations to group elements. In many cases, the image − − − modate points that are not on a regular grid, kθ can be of the exponential map covers the group, and an inverse parametrized as a small neural network, mapping input off- g mapping log : G can be defined. For matrix groups the sets to filter matrices, explored with MLPs in Simonovsky exp map coincides→ with the matrix exponential (exp(A) = 2 and Komodakis(2017). The compute and memory costs I + A + A /2! + ... ), and the log map with the matrix has severely limited this approach, for typical CIFAR-10 logarithm. Matrix groups are particularly amenable to our images with batchsize = 32,N = 32 32, c = c = method because in many cases the and maps can be in out exp log 256, n = 3 3, evaluating a single layer× requires computing computed in closed form. For example, there are analytic 20 billion values× for k. solutions for the translation group T(d), the 3D rotation group SO(3), the translation and rotation group SE(d) for In PointConv, Wu et al.(2019) develop a trick where clever d = 2, 3, the rotation-scale group R∗ SO(2), and many reordering of the computation cuts memory and computa- others (Eade, 2014). In the event that× an analytic solution tional requirements by 2 orders of magnitude, allowing ∼ is not available there are reliable numerical methods at our them to scale to the point cloud classification, segmenta- disposal (Moler and Van Loan, 2003). tion datasets ModelNet40 and ShapeNet, and the image dataset CIFAR-10. We review and generalize the Efficient- 3.3. Group Convolutions PointConv trick in Appendix A.1, which we will use to accelerate our method. Adopting the convention of left equivariance, one can define a group convolution between two functions on the group, which generalizes the translation equivariance of convolu- 4. Convolutional Layers on Lie Groups tion to other groups: We now introduce LieConv, a new convolutional layer that can be made equivariant to a given Lie group. Models with LieConv layers can act on arbitrary collections of coordi- N Definition 1. Let k, f : G R, and µ( ) be the Haar nates and values (xi, fi) i=1, for xi and fi V on G. For any u G→, the convolution· of k and f where V is a vector{ space.} The domain ∈ Xis usually a∈ low ∈ X Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data

cos φ sin φ r cos φ r sin φ cos φi sin φi txi x [∅, x] x [t, ∅] x sin φ −cos φ , r x r sin φ− r cos φ , ∅ − → → → → x sin φi cos φi tyi , ∅ → 0 0 1          φ t t r r 1 x φ t1 2 φ φ t O O O O O O 2

∗ (a) Data (b) Trivial (c) T (2) (d) SO(2) (e) R × SO(2) (f) SE(2)

Figure 2. Visualization of the lifting procedure. Panel (a) shows a point x in the original input space X . In panels (b)–(f) we illustrate the lifted embeddings for different groups in the form [u, q], where u ∈ G is an element of the group and q ∈ X /G identifies the orbit (see Section 4.5). For SE(2) the lifting is multi-valued. dimensional domain like R2 or R3 such as for molecules, the origin H = h G : ho = o , to generate the rest { ∈ } point clouds, the configurations of a mechanical system, with Lift(x) = uxh for h H . For continuous groups images, time series, videos, geostatistics, and other kinds of the stabilizer may{ be infinite,∈ and} in these cases we sample spatial data. uniformly using the Haar measure µ which is described in Appendix C.2. We visualize the lifting procedure for We begin with a high-level overview of the method. In different groups in Figure2. Section 4.1 we discuss transforming raw inputs xi into group elements ui on which we can perform group convolution. We refer to this process as lifting. Section 4.2 addresses 4.2. Parameterization of the Kernel the irregular and varied arrangements of group elements The conventional method for implementing an equivariant that result from lifting arbitrary continuous input data by convolutional network (Cohen and Welling, 2016a) requires parametrizing the convolutional kernel k as a neural network. enumerating the values of k( ) over the elements of the In Section 4.3, we show how to enforce the locality of the group, with separate parameters· for each element. This kernel by defining an invariant distance on the group. In procedure is infeasible for irregularly sampled data and Section 4.4, we define a Monte Carlo estimator for the problematic even for a discretization because there is no group convolution integral in Eq. (2) and show that this generalization between different group elements. Instead estimator is equivariant in distribution. In Section 4.5, we of having a discrete mapping from each group element to extend the procedure to cases where the group does not the kernel values, we parametrize the convolutional kernel act transitively on the input space (when we cannot map as a continuous function kθ using a fully connected neural any point to any other point with a transformation from the network with Swish activations, varying smoothly over the group). Additionally, in Appendix A.2, we show that our elements in the Lie group. method generalizes coordinate transform equivariance when G is Abelian. At the end of Section 4.5 we provide a concise However, as neural networks are best suited to learn on algorithmic description of the lifting procedure and our new euclidean data and G does not form a vector space, we convolution layer. propose to model k by mapping onto the Lie Algebra g, which is a vector space, and expanding in a basis for the space. To do so, we restrict our attention in this paper to Lie 4.1. Lifting from to G X groups whose exponential maps are surjective, where every If is a homogeneous space of G, then every two elements element has a logarithm. This means defining k (u) = X θ in are connected by an element in G, and one can lift ele- (k exp)θ(log u), where k˜θ = (k exp)θ is the function X ◦ ˜ ◦ cout cin ments by simply picking an origin o and defining Lift(x) = parametrized by an MLP, kθ : g R × . Surjectivity u G : uo = x : all elements in the group that map of the exp map guarantees that exp→ log = id, although not the{ origin∈ to x. This} procedure enables lifting tuples of co- in the other order. ◦ ordinates and features (x , f ) N (u , f ) N,K , { i i }i=1 → { ik i }i=1,k=1 with up to K group elements for each input.1 To find all 4.3. Enforcing Locality the elements u G : uo = x , one simply needs to find Important both to the inductive biases of convolutional neu- one element u{ ∈and use the elements} in the stabilizer of x ral networks and their computational efficiency is the fact 1 When fi = f(xi), lifting in this way is equivalent to defining that convolutional filters are local, kθ(ui uj) = 0 for f ↑(u) = f(uo) as in Kondor and Trivedi(2018). u u > r. In order to quantify locality− on matrix k i − jk Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data

the Monte Carlo estimator for (1) as

1 1 h(ui) = (k V f)(ui) = k(vj− ui)f(vj), (7) ∗ ni j nbhd(i) ∈ X Figure 3. A visualization of the local neighborhood for ∗× SO(2), R where ni = nbhd(i) the number of points sampled in each in terms of the points in the input space. For the computation neighborhood.| | of h at the point in orange, elements are sampled from colored region. Notice that the same points enter the calculation when the For vj µ nbhd(u)( ), the Monte Carlo estimator is image is transformed by a rotation and scaling. We visualize the equivariant∼ (in| distribution).· neighborhoods for other groups in Appendix C.6. Proof: Recalling that we can absorb the local neigh- borhood into the definition of k using an indicator function, groups, we introduce the function: θ we have 1 d(u, v) := log(u− v) , (5) F 1 1 k k (k V L f)(u ) = (1/n ) k(v− u )f(w− v ) ∗ w i i j i j where is the matrix logarithm, and is the Frobenius j log F X norm. The function is left invariant, since d(wu, wv) = 1 1 = (1/n ) k(˜v− w− u )f(˜v ) 1 1 i j i j log(u− w− wv) F = d(u, v), and is a semi-metric (it j doesk not necessarilyk satisfy the triangle inequality). In Ap- X d 1 pendix A.3 we show the conditions under which = (k V f)(w− ui) = Lw(k V f)(ui). d(u, v) ∗ ∗ is additionally the distance along the geodesic connect- ing u, v , a generalization of the well known formula for Here v˜j := wvj, and the last line follows from the fact T d the geodesic distance between rotations log(R1 R2) F that the random variables wvj = vj are equal in distribu- (Kuffner, 2004). k k tion because they are sampled from the Haar measure with property dµ(wv) = dµ(v). The equivariance also holds de- To enforce that our learned convolutional filter k is local, we terministically when the sampling locations are transformed can use our definition of distance to only evaluate the sum 1 along with the function vj wvj. Now that we have the for d(u, v) < r, implicitly setting kθ(v− u) = 0 outside a → 1 discretization h = (1/n ) k˜ (log(v− u ))f , local neighborhood nbhd(u) = v : d(u, v) r , i i j nbhd(i) θ j i i { ≤ } we can accelerate this computation∈ using the Efficient- P 1 1 PointConv trick, with the argument of aij = log(vj− ui) h(u) = kθ(v− u)f(v)dµ(v). (6) for the MLP. See Appendix A.1 for more details. Note v nbhd(u) Z ∈ that we can also apply this discretization of the convolution This restriction to a local neighborhood does not break when the inputs are not functions fi = f(xi), but sim- N equivariance precisely because d( , ) is left invariant. ply coordinates and values (xi, fi) i=1, and the mapping 1 · · N {N } Since d(u, v) = d(v− u, id) this restriction is equiva- (ui, fi) i=1 (ui, hi) i=1 is still equivariant, which 1 { } → { } lent to multiplying by the indicator function kθ(v− u) we also demonstrate empirically in Table B.1. We also de- 1 1 1→ kθ(v− u) [d(v−1u,id) r] which depends only on v− u. tail two methods for equivariantly subsampling the elements ≤ Note that equivariance would have been broken if we used to further reduce the cost in Appendix A.4. neighborhoods that depend on fixed regions in the input space like the square 3 3 region. Figure3 shows what 4.5. More Than One Orbit? these neighborhoods look× like in terms of the input space.

4.4. Discretization of the Integral Assuming that we have a collection of quadrature points N vj j=1 as input and the function fj = f(vj) evaluated at{ these} points, we can judiciously choose to evaluate the N convolution at another set of group elements ui i=1, so as to have a set of quadrature points to approximate{ } an integral in a subsequent layer. Because we have restricted the integral (6) to the compact neighbourhood nbhd(u), 2 we can define a proper sampling distribution µ nbhd(u) to Figure 4. Orbits of SO(2) and T(1)y containing input points in R . estimate the integral, unlike for the possibly unbounded| G. Unlike T(2) and SE(2), not all points are not contained in a single Computing the outputs only at these target points, we use orbit of these small groups. Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data

In this paper, we consider groups both large and small, and Discretizing (8) as we did in (7), we get we require the ability to enable or disable equivariances like translations. To achieve this functionality, we need to go 1 ˜ 1 hi = kθ(log(vj− ui), qi, qj)fj, (9) ni beyond the usual setting of homogeneous spaces considered j nbhd(i) in the literature, where every pair of elements in are ∈ X related by an element in G. Instead, we consider the quotientX which again can be accelerated with the Efficient-PointConv 1 space Q = /G, consisting of the distinct orbits of G in trick by feeding in aij = Concat([log(vj− ui), qi, qj]) as (visualizedX in Figure4). 2 Each of these orbits q Q input to the MLP. If we want the filter to be local over orbits X ∈ 2 is a homogeneous space of the group, and when is a also, we can extend the distance d((ui, qi), (vj, qj)) = X 2 2 homogeneous space of then there is only a single orbit. d(ui, vj) + α qi qj , which need not be invariant to G k − k But in general, there will be many distinct orbits, and lifting transformations on q. To the best of our knowledge, we are should preserve the information on which orbit each point the first to systematically address equivariances of this kind, is on. where is not a homogeneous space of G. X Since the most general equivariant mappings will use this To recap, Algorithms1 and2 give a concise overview of our orbit information, throughout the network the space of el- lifting procedure and our new convolution layer respectively. ements should not be G but rather G /G, and x Please consult Appendix C.1 for additional implementation is lifted to the tuples (u, q) for u G×and X q Q.∈ This X details. mapping may be one-to-one or one-to-many∈ depending∈ on Algorithm 1 Lifting from to G /G the size of H, but will preserve the information in x as X × X N cin uoq = x where oq is the chosen origin for each orbit. Gen- Inputs: spatial data (xi, fi) i=1 (xi , fi R ). { } ∈ X ∈ NK eral equivariant linear transforms can depend on both the Returns: matrix-orbit-value tuples (uj, qj, fj) j=1 . input and output orbit, and equivariance only constrains the For each orbit q /G, choose an{ origin o . } ∈ X q dependence on group elements and not the orbits. For each oq, compute its stabilizer Hq. for i = 1,...,N do When the space of orbits Q is continuous we can write the Find the orbit q /G , s.t. x q . equivariant integral transform as i i i Sample v K ,∈ where X v µ(H∈ ) (see C.2). { j}j=1 j ∼ qi Compute an element ui G s.t. uioq = xi. 1 (8) K ∈ h(u, q) = k(v− u, q, q0)f(v, q0)dµ(v)dq0. Zi = (uivj, qi, fi) j=1. G,Q { } Z end When G is the trivial group id , this equation simplifies return Z { } to the integral transform h(x) = k(x, x0)f(x0)dx0 where each element in is in its own orbit. X R Algorithm 2 The Lie Group Convolution Layer In general, even if is a smooth manifold and G is a Lie Inputs: matrix-orbit-value tuples (u , q , f ) m group it is not guaranteedX that /G is a manifold (Kono j j j j=1 Returns: convolved matrix-orbit-values{ (u ,} q , h ) m and Ishitoya, 1987). However inX practice this is not an issue i i i i=1 for i = 1, . . . , m do { } as we will only have a finite number of orbits present in 1 u− = exp( log(u )). the data. All we need is an invertible way of embedding i i nbhd = j−: d((u , q ), (u , q )) < r) . the orbit information into a vector space to be fed into k . i i i j j θ for j = 1,{ . . . , m do } One option is to use an embedding of the orbit origin oq, or 1 a = Concat([log(u− u ), q , q ]). simply find enough invariants of the group to identify the ij i j i j end orbit. To give a few examples: (see A.1). hi = (1/ni) j nbhdi kθ(aij)fj end ∈ 1. = Rd and G = SO(d): Embed(q(x)) = x P X k k return (u, q, h)

d x 2. = R and G = R∗ : Embed(q(x)) = X x k k d 5. Applications to Image and Molecular Data 3. = R and G = T(k): Embed(q(x)) = x[k+1:d] X First, we evaluate LieConv on two types of problems: clas- 2 When X is a homogeneous space and the quantity of interest sification on image data and regression on molecular data. is the quotient with the stabilizer of the origin H: G/H 'X , which has been examined extensively in the literature. Here we With LieConv as the convolution layers, we implement a concerned with the separate quotient space Q = X /G, relevant bottleneck ResNet architecture with a final global pooling when X is not a homogeneous space. layer (Figure5). For a detailed architecture description, Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data

Table 1. Classification Error (%) on RotMNIST dataset for LieConv with different group equivariances and baselines: G-CNN (Cohen and Welling, 2016a), H-Net (Worrall et al., 2017), ORN (Zhou et al., 2017), TI-Pooling (Laptev et al., 2016), RotEqNet (Marcos et al., 2017), E(2)-Steerable CNNs (Weiler and Cesa, 2019).

Baseline Methods LieConv (Ours) ∗ G-CNN H-NET ORN TI-Pooling RotEqNet E(2)-Steerable Trivial T(1)y T(2) SO(2) SO(2)×R SE(2) 2.28 1.69 1.54 1.2 1.09 0.68 1.58 1.49 1.44 1.42 1.27 1.24

Table 2. QM9 Molecular Property Mean Absolute Error 2 Task α ∆ε εHOMO εLUMO µ Cν GHR UU0 ZPVE Units bohr3 meV meV meV D cal/mol K meV meV bohr2 meV meV meV NMP .092 69 43 38 .030 .040 19 17 .180 20 20 1.500 SchNet .235 63 41 34 .033 .033 14 14 .073 19 14 1.700 Cormorant .085 61 34 38 .038 .026 20 21 .961 21 22 2.027 LieConv(T3) .084 49 30 25 .032 .038 22 24 .800 19 19 2.280 see Appendix C.3. We use the same model architecture for all tasks and achieve performance competitive with task- Lifting specific specialized methods. BatchNorm Linear Swish

5.1. Image Equivariance Benchmark BottleBlock Linear

The RotMNIST dataset consists of 12k randomly rotated x L BatchNorm MNIST digits with rotations sampled uniformly from Swish SO(2), separated into 10k for training and 2k for validation. BottleBlock LieConv This commonly used dataset has been a standard benchmark for equivariant CNNs on image data. To apply LieConv to BatchNorm Downsample image data we interpret each input image as a collection Swish BatchNorm of N = 28 28 points on = R2 with associated binary Swish × 784 X Linear values: xi, f(xi) i=1 to which we apply a circular center crop. We{ note that} LieConv is broadly targeting generic Linear spatial data, and more practical equivariant methods exist Global Pool + specialized to images (e.g. Weiler and Cesa(2019)). How- ever, as we demonstrate in Table1, we are able to easily incorporate equivariance to a variety of different groups Figure 5. A visual overview of the LieConv model architecture, without changes to the method or the architecture of the which is composed of L LieConv bottleneck blocks that couple network, while achieving performance competitive with the values at different group elements together. The BottleBlock is methods that are not applicable beyond image data. a residual block with a LieConv layer between two linear layers.

5.2. Molecular Data We first perform an ablation study on the Homo problem Now we apply LieConv to the QM9 molecular property of predicting the energy of the highest occupied molecular learning task (Wu et al., 2018). The QM9 regression dataset orbital for the molecules. We apply LieConv with different consists of small inorganic molecules encoded as a collec- equivariance groups, combined with SO(3) data augmen- tion of 3D spatial coordinates for each of the atoms, and tation. The results are reported in Table 5.2. Of the three their atomic charges. The labels consist of various properties groups, our SE(3) network performs the best. We then apply of the molecules such as heat capacity. This is a challenging T(3)-equivariant LieConv layers to the full range of tasks in task as there is no canonical origin or orientation for each the QM9 dataset and report the results in Table2. We per- molecule, and the target distribution is invariant to E(3) form competitively with state-of-the-art methods (Gilmer (translation, rotation, and reflection) transformations of the et al., 2017; Schütt et al., 2018; Anderson et al., 2019), with coordinates. Successful models must generalize across dif- lowest MAE on several of the tasks. See B.1 for a demon- ferent spatial locations and orientations. stration of the equivariance property and efficiency with Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data

Cormorant Trivial SO(3) T(3) SE(3) K(p) + V (q). The dynamics can also be written compactly 0 I 34 31.7 65.4 29.6 26.8 as z˙ = J z for J = . ∇ H I 0 −  Table 3. LieConv performance (Mean Absolute Error in meV) for As shown in Greydanus et al.(2019), a neural network different groups on the HOMO regression problem. parametrizing ˆθ(z, t) can be learned directly from tra- jectory data, providingH substantial benefits in generaliza- tion over directly modeling F (z, t), and with better en- limited data. θ ergy conservation. We follow the approach of Sanchez- Gonzalez et al.(2019) and Zhong et al.(2019). Given an initial condition z0 and Fθ(z, t) = J z ˆθ, we employ a twice-differentiable model architecture∇ andH a differentiable ODE solver (Chen et al., 2018) to compute predicted states (zˆ1,..., zˆT ) = ODESolve(z0,Fθ, (t1, t2, ..., tT )). The pa- rameters of the Hamiltonian model ˆθ can be trained di- rectly through the L2 loss, H

1 T L(θ) = zˆ z 2. (11) T || t − t ||2 t=1 X

6.2. Exact Conservation of Momentum Figure 6. A qualitative example of the trajectory predictions over While equivariance is broadly useful as an inductive bias, it 100 time steps on the 2D spring problem given a set of initial con- has a very special implication for the modeling of Hamil- ditions. We see that HLieConv (right) yields predictions that are tonian systems. Noether’s Hamiltonian theorem states that accurate over a longer time than HOGN (left), a SOTA architecture for modeling interacting physical systems. each continuous symmetry in the Hamiltonian of a dynami- cal system has a corresponding conserved quantity (Noether, 1971; Butterfield, 2006). Symmetry with respect to the con- tinuous transformations of translations and rotations lead 6. Modeling Dynamical Systems directly to conservation of the total linear and angular mo- mentum of the system, an extremely valuable property for Accurate transition models for macroscopic physical sys- modeling dynamical systems. In fact, all models that ex- tems are critical components in control systems (Lenz et al., actly conserve linear and angular momentum must have a 2015; Kamthe and Deisenroth, 2017; Chua et al., 2018) and corresponding translational and rotational symmetry. See data-efficient reinforcement learning algorithms (Nagabandi Appendix A.5 for a primer on Hamiltonian symmetries, et al., 2018; Janner et al., 2019). In this section we show Noether’s theorem, and the implications in the current set- how to enforce conservation of quantities such as linear and ting. angular momentum in the modeling of Hamiltonian systems through LieConv symmetries. As showed in Section4, we can construct models that are equivariant to a large variety of continuous Lie Group sym- 6.1. Predicting Trajectories with Hamiltonian metries, and therefore we can exactly conserve associated Mechanics quantities like linear and angular momentum. Figure 7(a) shows that using LieConv layers with a given T(2) and/or For dynamical systems, the equations of motion can be writ- SO(2) symmetry, the model trajectories conserve linear ten in terms of the state z and time t: z˙ = F (z, t). Many and/or angular momentum with relative error close to ma- physically occurring systems have Hamiltonian structure, chine epsilon, determined by the integrator tolerance. As meaning that the state can be split into generalized coordi- there is no corresponding Noether conservation for discrete nates and momenta z = (q, p), and the dynamics can be symmetry groups, discrete approaches to enforcing symme- written as try (Cohen and Welling, 2016a; Marcos et al., 2017) would d q ∂ d p ∂ not be nearly as effective. = H = H (10) dt ∂ p dt − ∂ q 6.3. Results for some choice of scalar Hamiltonian (q, p, t). is often the total energy of the system, andH can sometimesH For evaluation, we compare a fully-connected (FC) Neural- be split into kinetic and potential energy terms (q, p) = ODE model (Chen et al., 2018), ODE graph networks H Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data

102 Angular Momentum 101 0.5 0 − 0 10 1.0 10 − 2 0 50 100 150 200 250 1 10− 10− Linear Momentum 4 2.5 2 Relative10 Error Relative10 Error − 2.0 −

6 1.5 3 10− 10− Angular Momentum Linear Momentum 0 50 100 t 150 200 250 State Energy FC HOGN HLieConv-T2 FC HOGN HLieConv-T2 OGN HLieConv-SE2 HLieConv-SO2 Trivial T2 SO2 SE2 Truth OGN HLieConv-Trivial HLieConv-SO2 (a) (b) (c)

Figure 7. Left: We can directly control whether linear and angular momentum is conserved by changing the model’s symmetries. The components of linear and angular momentum of the integrated trajectories are conserved up to integrator tolerance. Middle: Momentum along the rollout trajectories for the LieConv models with different imposed symmetries. Right: Our method outperforms HOGN, a state-of-the-art model, on both state rollout error and system energy conservation.

(OGN) (Battaglia et al., 2016), Hamiltonian ODE graph In Figure7(a) and7(b) we show that by changing the in- networks (HOGN) (Sanchez-Gonzalez et al., 2019), and variance of our Hamiltonian models, we have direct control our own LieConv architecture on predicting the motion over the conservation of linear and angular momentum in of point particles connected by springs as described in the predicted trajectories. Figure7(c) demonstrates that (Sanchez-Gonzalez et al., 2019). Figure6 shows exam- our method outperforms HOGN, a SOTA architecture for ple rollout trajectories, and our quantitative results are pre- dynamics problems, and achieves significant improvement sented in Figure7. In the spring problem N bodies with over the naïve fully-connected (FC) model. We summa- mass m1, . . . , mN interact through pairwise spring forces rize the various models and their symmetries in Table6. with constants k1, . . . , kN N . The system preserves energy, Finally, in Figure8 we evaluate test MSE of the different linear momentum, and angular× momentum. The behavior models over a range of training dataset sizes, highlighting of the system depends both the values of the system pa- the additive improvements in generalization from the Hamil- rameters (s = (k, m)) and the initial conditions z0. The tonian, Graph-Network, and equivariance inductive biases dynamics model must learn not only to predict trajectories successively. across a broad range of initial conditions, but also infer the dependence on varied system parameters, which are addi- 7. Discussion tional inputs to the model. We compare models that attempt to learn the dynamics Fθ(z, t) = dz/dt directly against We presented a convolutional layer to build networks that models that learn the Hamiltonian as described in section can handle a wide variety of data types, and flexibly swap 6.1. out the equivariance of the model. While the image, molec- ular, and dynamics experiments demonstrate the generality of our method, there are many exciting application domains FC HFC HOGN HLieConv(T2) (e.g. time-series, geostats, audio, mesh) and directions for future work. We also believe that it will be possible to bene- 1 10− fit from the inductive biases of HLieConv models even for systems that do not exactly preserve energy or momentum,

3 such as those found in control systems and reinforcement 10− learning. Test MSE

5 10− The success of convolutional neural networks on images has highlighted the power of encoding symmetries in models 101 102 103 104 105 for learning from raw sensory data. But the variety and com- N plexity of other modalities of data is a significant challenge in further developing this approach. More general data may Figure 8. Test MSE as a function of the number of examples in the not be on a grid, it may possess other kinds of symmetries, training dataset, N. As the inductive biases of Hamiltonian, Graph- Network, and LieConv equivariance are added, generalization or it may contain quantities that cannot be easily combined. performance improves. LieConv outperforms the other methods We believe that central to solving this problem is a decou- across all dataset sizes. The shaded region corresponds to a 95% pling of convenient computational representations of data confidence interval, estimated across 3 trials. as dense arrays from the set of geometrically sensible opera- Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data tions they may have. We hope to move towards models that Taco S Cohen and Max Welling. Steerable cnns. arXiv can ‘see’ molecules, dynamical systems, multi-scale objects, preprint arXiv:1612.08498, 2016b. heterogeneous measurements, and higher mathematical ob- jects, in the way that convolutional neural networks perceive Taco S Cohen, Mario Geiger, Jonas Köhler, and images. Max Welling. Spherical cnns. arXiv preprint arXiv:1801.10130, 2018. Acknowledgements Taco S Cohen, Mario Geiger, and Maurice Weiler. A gen- eral theory of equivariant cnns on homogeneous spaces. MF, SS, PI and AGW are supported by an Amazon Research In Advances in Neural Information Processing Systems, Award, Amazon Machine Learning Research Award, Face- pages 9142–9153, 2019. book Research, NSF I-DISRE 193471, NIH R01 DA048764- 01A1, NSF IIS-1910266, NSF 1922658 NRT-HDR: FU- Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong TURE Foundations, Translation, and Responsibility for Zhang, Han Hu, and Yichen Wei. Deformable convolu- Data Science, and by the United States Department of De- tional networks. In Proceedings of the IEEE international fense through the National Defense Science & Engineering conference on computer vision, pages 764–773, 2017. Graduate (NDSEG) Fellowship Program. We thank Alex Wang for helpful comments. Ethan Eade. Lie groups for computer vision. Cambridge Univ., Cam-bridge, UK, Tech. Rep, 2014. References Carlos Esteves, Christine Allen-Blanchette, Xiaowei Zhou, and Kostas Daniilidis. Polar transformer networks. arXiv Brandon Anderson, Truong Son Hy, and Risi Kondor. Cor- preprint arXiv:1709.01889, 2017. morant: Covariant molecular neural networks. In Ad- vances in Neural Information Processing Systems, pages Carlos Esteves, Christine Allen-Blanchette, Ameesh Maka- 14510–14519, 2019. dia, and Kostas Daniilidis. Learning so (3) equivariant representations with spherical cnns. In Proceedings of Peter Battaglia, Razvan Pascanu, Matthew Lai, the European Conference on Computer Vision (ECCV), Danilo Jimenez Rezende, et al. Interaction net- pages 52–68, 2018. works for learning about objects, relations and physics. In Advances in neural information processing systems, Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol pages 4502–4510, 2016. Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th In- Erik J Bekkers. B-spline cnns on lie groups. arXiv preprint ternational Conference on Machine Learning-Volume 70, arXiv:1909.12057, 2019. pages 1263–1272. JMLR. org, 2017.

L. C. Blum and J.-L. Reymond. 970 million druglike small Samuel Greydanus, Misko Dzamba, and Jason Yosinski. molecules for virtual screening in the chemical universe Hamiltonian neural networks. In Advances in Neural database GDB-13. J. Am. Chem. Soc., 131:8732, 2009. Information Processing Systems, pages 15353–15363, 2019. Jeremy Butterfield. On symmetry and conserved quanti- ties in classical mechanics. In Physical theory and its Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. interpretation, pages 43–100. Springer, 2006. Deep residual learning for image recognition. In Pro- ceedings of the IEEE conference on computer vision and Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and pattern recognition, pages 770–778, 2016. David K Duvenaud. Neural ordinary differential equa- Zhiwu Huang, Chengde Wan, Thomas Probst, and Luc tions. In Advances in neural information processing sys- Van Gool. Deep learning on lie groups for skeleton-based tems, pages 6571–6583, 2018. action recognition. In Proceedings of the IEEE confer- Kurtland Chua, Roberto Calandra, Rowan McAllister, and ence on computer vision and pattern recognition, pages Sergey Levine. Deep reinforcement learning in a handful 6099–6108, 2017. of trials using probabilistic dynamics models. In Ad- Jörn-Henrik Jacobsen, Bert De Brabandere, and Arnold WM vances in Neural Information Processing Systems, pages Smeulders. Dynamic steerable blocks in deep residual 4754–4765, 2018. networks. arXiv preprint arXiv:1706.00598, 2017. Taco Cohen and Max Welling. Group equivariant convolu- Michael Janner, Justin Fu, Marvin Zhang, and Sergey tional networks. In International conference on machine Levine. When to trust your model: Model-based policy learning, pages 2990–2999, 2016a. optimization. arXiv preprint arXiv:1906.08253, 2019. Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data

Sanket Kamthe and Marc Peter Deisenroth. Data-efficient Robotics and Automation (ICRA), pages 7559–7566. reinforcement learning with probabilistic model predic- IEEE, 2018. tive control. arXiv preprint arXiv:1706.06491, 2017. Emmy Noether. Invariant variation problems. Transport Diederik P Kingma and Jimmy Ba. Adam: A method for Theory and Statistical Physics, 1(3):186–207, 1971. stochastic optimization. arXiv preprint arXiv:1412.6980, Prajit Ramachandran, Barret Zoph, and Quoc V Le. 2014. Searching for activation functions. arXiv preprint Risi Kondor and Shubhendu Trivedi. On the general- arXiv:1710.05941, 2017. ization of equivariance and convolution in neural net- M. Rupp, A. Tkatchenko, K.-R. Müller, and O. A. von works to the action of compact groups. arXiv preprint Lilienfeld. Fast and accurate modeling of molecular atom- arXiv:1802.03690, 2018. ization energies with machine learning. Physical Review Akira Kono and Kiminao Ishitoya. Squaring operations in Letters, 108:058301, 2012. mod 2 cohomology of quotients of compact lie groups Alvaro Sanchez-Gonzalez, Victor Bapst, Kyle Cranmer, and by maximal tori. In Algebraic Topology Barcelona 1986, Peter Battaglia. Hamiltonian graph networks with ode pages 192–206. Springer, 1987. integrators. arXiv preprint arXiv:1909.12790, 2019. James J Kuffner. Effective sampling and distance metrics Kristof T Schütt, Huziel E Sauceda, P-J Kindermans, for 3d rigid body path planning. In IEEE International Alexandre Tkatchenko, and K-R Müller. Schnet–a deep Conference on Robotics and Automation, 2004. Proceed- learning architecture for molecules and materials. The ings. ICRA’04. 2004, volume 4, pages 3993–3998. IEEE, Journal of Chemical Physics, 148(24):241722, 2018. 2004. Martin Simonovsky and Nikos Komodakis. Dynamic edge- Dmitry Laptev, Nikolay Savinov, Joachim M Buhmann, conditioned filters in convolutional neural networks on and Marc Pollefeys. Ti-pooling: transformation-invariant graphs. In CVPR, 2017. pooling for feature learning in convolutional neural net- works. In Proceedings of the IEEE Conference on Com- Nathaniel Thomas, Tess Smidt, Steven Kearnes, Lusann puter Vision and Pattern Recognition, pages 289–297, Yang, Li Li, Kai Kohlhoff, and Patrick Riley. Ten- 2016. sor field networks: Rotation-and translation-equivariant neural networks for 3d point clouds. arXiv preprint Hugo Larochelle, Dumitru Erhan, Aaron Courville, James arXiv:1802.08219, 2018. Bergstra, and Yoshua Bengio. An empirical evaluation of deep architectures on problems with many factors of vari- Maurice Weiler and Gabriele Cesa. General e (2)- ation. In Proceedings of the 24th international conference equivariant steerable cnns. In Advances in Neural Infor- on Machine learning, pages 473–480, 2007. mation Processing Systems, pages 14334–14345, 2019.

Yann LeCun, Yoshua Bengio, et al. Convolutional networks Maurice Weiler, Mario Geiger, Max Welling, Wouter for images, speech, and time series. The handbook of Boomsma, and Taco S Cohen. 3d steerable cnns: Learn- brain theory and neural networks, 3361(10):1995, 1995. ing rotationally equivariant features in volumetric data. In Advances in Neural Information Processing Systems, Ian Lenz, Ross A Knepper, and Ashutosh Saxena. Deepmpc: pages 10381–10392, 2018. Learning deep latent features for model predictive control. In Robotics: Science and Systems. Rome, Italy, 2015. Benjamin Willson. Reiter nets for semidirect products of amenable groups and semigroups. Proceedings of Diego Marcos, Michele Volpi, Nikos Komodakis, and Devis the American Mathematical Society, 137(11):3823–3832, Tuia. Rotation equivariant vector field networks. In 2009. Proceedings of the IEEE International Conference on Computer Vision, pages 5048–5057, 2017. Daniel E Worrall, Stephan J Garbin, Daniyar Turmukham- betov, and Gabriel J Brostow. Harmonic networks: Deep Cleve Moler and Charles Van Loan. Nineteen dubious ways translation and rotation equivariance. In Proceedings of to compute the exponential of a matrix, twenty-five years the IEEE Conference on Computer Vision and Pattern later. SIAM review, 45(1):3–49, 2003. Recognition, pages 5028–5037, 2017. Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Wenxuan Wu, Zhongang Qi, and Li Fuxin. Pointconv: Deep Sergey Levine. Neural network dynamics for model- convolutional networks on 3d point clouds. In Proceed- based deep reinforcement learning with model-free fine- ings of the IEEE Conference on Computer Vision and tuning. In 2018 IEEE International Conference on Pattern Recognition, pages 9621–9630, 2019. Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data

Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu, Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecu- lar machine learning. Chemical science, 9(2):513–530, 2018. Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.

Yaofeng Desmond Zhong, Biswadip Dey, and Amit Chakraborty. Symplectic ode-net: Learning hamiltonian dynamics with control. arXiv preprint arXiv:1909.12077, 2019.

Yanzhao Zhou, Qixiang Ye, Qiang Qiu, and Jianbin Jiao. Oriented response networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 519–528, 2017. Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data Appendices at indices i and j is also necessary when there are differ- ent numbers of points per minibatch but batched together using zero padding. The generalized PointConv trick can A. Derivations and Additional Methodology thus be applied in batch mode when there may be varied number of points per example and varied number of points A.1. Generalized PointConv Trick per neighborhood. The matrix notation becomes very cumbersome for manipu- lating these higher order n-dimensional arrays, so we will A.2. Abelian G and Coordinate Transforms instead use index notation with Latin indices i, j, k index- For Abelian groups that cover in a single orbit, the ing points, Greek indices α, β, γ indexing feature channels, computation is very similar to ordinaryX Euclidean convo- and c indexing the coordinate dimensions of which there lution. Defining ai = log(ui), bj = log(vj), and using are d = 3 for PointConv and d = dim(G) + 2 dim(Q) bj ai ai bj 1 the fact that e− e = e − means that log(v− ui) = for LieConv.3 As the objects are not geometric tensors but j (log exp)(a b ). Defining f˜ = f exp, h˜ = h exp; simply n-dimensional arrays, we will make no distinction i j we get◦ − ◦ ◦ between upper and lower indices. After expanding into in- dices, it should be assumed that all values are scalars, and 1 h˜(a ) = (k˜ proj)(a b )f˜(b ), (15) that any free indices can range over all of the values. i n θ ◦ i − j j j nbhd(i) ∈ X Let kα,β be the output of the MLP k which takes ac ij θ ij where proj = log exp projects to the image of the loga- as input and acts independently over the locations i, j{. For} rithm map. Apart from◦ a projection and a change to logarith- PointConv, the input ac = xc xc and for LieConv the ij i j mic coordinates, this is equivalent to Euclidean convolution c 1 − c input aij = Concat([log(vj− ui), qi, qj]) . in a vector space with dimensionality of the group. When We wish to compute the group is Abelian and is a homogeneous space, then the dimension of the groupX is the dimension of the input. In α α,β β hi = kij fj . (12) these cases we have a trivial stabilizer group H and single origin o, so we can view f and h as acting on the input Xj,β xi = uio. In Wu et al.(2019), it was observed that since kα,β is the ij This directly generalizes some of the existing coordinate α,β α,β γ output of an MLP, kij = γ Wγ si,j for some final transform methods for achieving equivariance from the liter- weight matrix W and penultimate activations sγ (sγ is P i,j i,j ature such as log polar coordinates for rotation and scaling simply the result of the MLP after the last nonlinearity). equivariance (Esteves et al., 2017), and using hyperbolic With this in mind, we can rewrite (12) coordinates for squeeze and scaling equivariance. α α,β γ β Log Polar Coordinates: Consider the Abelian Lie group hi = Wγ si,j fj (13) j,β γ of positive scalings and rotations: G = R∗ SO(2) acting X X 2 ×  on R . Elements of the group M G can be expressed as α,β γ β (14) ∈ = Wγ si,jfj a 2 2 matrix β,γ j × X X  r cos(θ) r sin(θ) M(r, θ) = r sin(θ) −r cos(θ) In practice, the intermediate number of channels is much   less than the product of c and c : γ < α β and + 4 in out for r R and θ R. The matrix logarithm is so this reordering of the computation leads| | to| a|| massive| ∈ ∈ reduction in both memory and compute. Furthermore, r cos(θ) r sin(θ) log(r) θ mod 2π log − = − , bγ,β = sγ f β can be implemented with regular ma- r sin(θ) r cos(θ) θ mod 2π log(r) i j i,j j       α α,β γ,β trix multiplication and hi = β,γ Wγ bi can be also or more compactly log(M(r, θ)) = log(r)I+(θ mod 2π)J, P α α,ε ε by flattening (β, γ) into a single axis ε: hi = ε W bi . which is mod in the basis for the Lie algebra P [log(r), θ 2π] [I,J]. It is clear that proj = log exp is simply mod 2π The sum over index j can be restricted to a subsetPj(i) (such ◦ β on the J component. as a chosen neighborhood) by computing f( ) at each of the · required indices and padding to the size of the maximum As R2 is a homogeneous space of G, one can choose the γ,β γ β 2 subset with zeros, and computing bi = j si,j(i)fj(i) us- global origin o = [1, 0] R . A little algebra shows that ing dense matrix multiplication. Masking out of the values ∈ P 4Here θ mod 2π is defined to mean θ + 2πn for the integer 3dim(Q) is the dimension of the space into which Q, the orbit n such that the value is in (−π, π), consistent with the principal identifiers, are embedded. matrix logarithm. (θ + π)%2π − π in programming notation. Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data lifting to the group yields the transformation ui = M(ri, θi) A.3. Sufficient Conditions for Geodesic Distance 2 2 for each point pi = uio, where r = x + y , and 1 In general, the function d(u, v) = log(v− u) F , defined θ = atan2(y, x) are the polar coordinates of the point pi. k k 1 p on the domain of GL(d) covered by the exponential map, Observe that the logarithm of v− u has a simple expression j i satisfies the first three conditions of a distance metric but highlighting the fact that it is invariant to scale and rotational not the triangle inequality, making it a semi-metric: transformations of the elements,

1 1 log(v− u ) = log(M(r , θ ) M(r , θ )) 1. d(u, v) 0 j i j j − i i ≥ 1 = log(ri/rj)I + (θi θj mod 2π)J. 2. d(u, v) = 0 log(u− v) = 0 u = v − ⇔ ⇔ 1 1 3. d(u, v) = log(v− u) = log(u− v) = d(v, u). k k k − k Now writing out our Monte Carlo estimation of the integral: However for certain subgroups of GL(d) with additional 1 structure, the triangle inequality holds and the function is h(p ) = k˜ (log(r /r ), θ θ mod 2π)f(p ), i n θ i j i − j j the distance along geodesics connecting group elements u j X and v according to the metric tensor which is a discretization of the log polar convolution from T T 1 A, B := Tr(A u− u− B), (16) Esteves et al.(2017). This can be trivially extended to h iu encompass cylindrical coordinates with the group T (1) where T denotes inverse and transpose. × − R∗ SO(2). × Specifically, if the subgroup G is in the image of the exp : : For another nontrivial example, g G map and each infinitesmal generator commutes with → T consider the group of scalings and squeezes G = R∗ SQ its transpose: [A, A ] = 0 for A g, then d(u, v) = 2 × 1 ∀ ∈ acting on the positive orthant = (x, y) R : x > log(v− u) F is the geodesic distance between u, v. 0, y > 0 . Elements of the groupX can{ be expressed∈ as the k k Geodesic Equation: product of} a squeeze mapping and a scaling Geodesics of (16) satisfying γ˙ γ˙ = 0 can equivalently be derived by minimizing the∇ energy s 0 r 0 rs 0 functional M(r, s) = = 0 1/s 0 r 0 r/s 1       T T 1 E[γ] = γ,˙ γ˙ γ dt = Tr(˙γ γ− γ− γ˙ )dt for any r, s +. As the group is abelian, the logarithm h i R Zγ Z0 splits nicely∈ in terms of the two generators I and A: using the calculus of variations. Minimizing curves γ(t), rs 0 1 0 1 0 connecting elements u and v in G (γ(0) = v, γ(1) = u) log = (log r) + (log s) . 0 r/s 0 1 0 1 satisfy        −  1 Again is a homogeneous space of G, and we choose a T T 1 single originX o = [1, 1]. With a little algebra, it is clear that 0 = δE = δ Tr(˙γ γ− γ− γ˙ )dt Z0 M(ri, si)o = pi where r = √xy and s = x/y are the Noting that δ(γ 1) = γ 1δγγ 1 and the linearity of the hyperbolic coordinates of pi. − − − p trace, − Expressed in the basis = [I,A] for the Lie algebra above, B 1 we see that T T 1 T T 1 1 2 Tr(˙γ γ− γ− δγ˙ ) Tr(˙γ γ− γ− δγγ− γ˙ )dt = 0. 1 0 − log(vj− ui) = log(ri/rj)I + log(si/sj)A Z yielding the expression for convolution Using the cyclic property of the trace and integrating by parts, we have that 1 h(pi) = k˜θ(log(ri/rj), log(si/sj))f(pj), 1 n d T T 1 1 T T 1 j 2 Tr (γ ˙ γ− γ− )+γ− γ˙ γ˙ γ− γ− δγ dt = 0, X − dt Z0   which is equivariant to squeezes and scalings.  where the boundary term T 1 1 vanishes since Tr(˙γγ− γ− δγ) 0 As demonstrated, equivariance to groups that contain the (δγ)(0) = (δγ)(1) = 0. input space in a single orbit and are abelian can be achieved with a simple coordinate transform; however our approach As δγ may be chosen to vary arbitrarily along the path, γ generalizes to groups that are both ’larger’ and ’smaller’ than must satisfy the geodesic equation: the input space, including coordinate transform equivariance d T T 1 1 T T 1 as a special case. (γ ˙ γ− γ− ) + γ− γ˙ γ˙ γ− γ− = 0. (17) dt Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data

1 T Solutions: When A = log(v− u) satisfies [A, A ] = 0, the subsampled set because the distances are left invariant 1 the curve γ(t) = v exp(t log(v− u)) is a solution to the d(ui, uj) = d(wui, wuj). geodesic equation (17). Clearly γ connects u and v, γ(0) = Now we can use either of these methods for Sub ( ) to v and γ(1) = u. Plugging in γ˙ = γA into the left hand side p equivariantly subsample the quadrature points in each neigh-· of equation (17), we have borhood used to estimate the integral to a fixed number p,

d T 1 T 1 = (A γ− ) + AA γ− 1 1 dt hi = kθ(vj− ui)fj. (19) T 1 1 T 1 p = A γ− γγ˙ − + AA γ− j Subp(nbhd(ui)) − ∈ X T 1 = [A, A ]γ− = 0 Doing so has reduced the cost of estimating the convolution from O(N 2) to O(pN), ignoring the cost of computing Length of γ: The length of the curve γ connecting u and v N Subp and nbhd(ui) i=1. 1 is log(v− u) , { } k kF 1 A.5. Review and Implications of Noether’s Theorem L[γ] = γ,˙ γ˙ dt = Tr(˙γT γ T γ 1γ˙ )dt h iγ − − In the Hamiltonian setting, Noether’s theorem relates the Zγ Z0 1q q continuous symmetries of the Hamiltonian of a system with T 1 = Tr(A A)dt = A F = log(v− u) F conserved quantities, and has been deeply impactful in the 0 k k k k Z q understanding of classical physics. We give a review of Noether’s theorem, loosely following Butterfield(2006). Of the Lie Groups that we consider in this paper, all of which have a single connected component, the groups G = More on Hamiltonian Dynamics T (d),SO(d), R∗ SO(d), R∗ SQ satisfy this property As introduced earlier, the Hamiltonian is a function acting T × × that [g, g ] = 0; however, the SE(d) groups do not. on the state H(z) = H(q, p), (we will ignore time depen- dence for now) can be viewed more formally as a function A.4. Equivariant Subsampling on the cotangent bundle (q, p) = z M = T ∗C where C ∈ Even if all distances and neighborhoods are precomputed, is the coordinate configuration space, and this is the setting the cost of computing equation (6) for i = 1, ..., N is still for Hamiltonian dynamics. quadratic, O(nN) = O(N 2), because the number of points In general, on a manifold , a vector field X can be viewed in each neighborhood n grows linearly with N as f is more as an assignment of a directionalM derivative along for densely evaluated. So that our method can scale to handle a each point z . It can be expanded in a basisM using large number of points, we show two ways two equivariantly ∈ M α ∂ coordinate charts X = α X ∂α, where ∂α = ∂zα and subsample the group elements, which we can use both for α acts on functions f by X(f) = α X ∂αf. In the chart, the locations at which we evaluate the convolution and the each of the components XPα are functions of z. locations that we use for the Monte Carlo estimator. Since P the elements are spaced irregularly, we cannot readily use In Hamiltonian mechanics, for two functions on M, there is the Poisson bracket which can be written in terms of the the coset pooling method described in (Cohen and Welling, 5 2016a), instead we can perform: canonical coordinates qi, pi, ∂f ∂g ∂f ∂g Random Selection: Randomly selecting a subset of p f, g = . points from the original preserves the original sampling { } ∂p ∂q − ∂q ∂p n i i i i i distribution, so it can be used. X Farthest Point Sampling: Given a set of group elements The Poisson bracket can be used to associate each function k S = ui i=1 G, we can select a subset Sp∗ of size p by f to a vector field maximizes{ } the∈ minimum distance between any two elements ∂f ∂ ∂f ∂ in that subset, X = f, = , f { ·} ∂p ∂q − ∂q ∂p i i i i i X Subp(S) := Sp∗ = arg max min d(u, v), (18) Sp S u,v Sp:u=v ⊂ ∈ 6 which specifies, by its action on another function g, the di- rectional derivative of along : . Vector farthest point sampling on the group. Acting on a set g Xf Xf (g) = f, g fields that can be written in this way are known{ as} Hamil- of elements, Sub : S S∗, the farthest point sub- p 7→ p tonian vector fields, and the Hamiltonian dynamics of the sampling is equivariant Subp(wS) = wSubp(S) for any w G. Meaning that applying a group element to each 5Here we take the definition of the Poisson bracket to be nega- of∈ the elements does not change the chosen indices in tive of the usual definition in order to streamline notation. Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data system is a special example X = H, . This vector This implication goes both ways, if f is conserved then φXf H { ·} λ field in canonical coordinates z = (p, q) is the vector field Xf is necessarily a symmetry of the Hamiltonian, and if φλ is XH = F (z) = J zH (i.e. the symplectic gradient, as a symmetry of the Hamiltonian then f is conserved. discussed in Section∇ 6.1). Making this connection clear, a given scalar quantity evolves through time as f˙ = H, f . Hamiltonian vs Dynamical Symmetries { } But this bracket can be used to evaluate the rate of change of So far we have been discussing Hamiltonian symmetries, in- a scalar quantity along the flows of vector fields other than variances of the Hamiltonian. But in the study of dynamical the dynamics, such as the flows of continuous symmetries. systems there is a related concept of dynamical symmetries, Noether’s Theorem symmetries of the equations of motion. This notion is also captured by the Lie Derivative, but between vector fields. The flow φX by λ R of a vector field X is the set of λ ∈ A dynamical system z˙ = F (z), has a continuous dynami- integral curves, the unique solution to the system of ODEs cal symmetry φX if the flow along the dynamical system α α λ z˙ = X with initial condition z and at parameter value commutes with the symmetry: X λ, or more abstractly the iterated application of X: φλ = X F F X exp(λX). Continuous symmetries transformation are the φλ (φt (z)) = φt (φλ (z)). (20) X transformations that can be written as the flow φλ of a vector field. The directional derivative characterizes how a Meaning that applying the symmetry transformation to the function such as the Hamiltonian changes along the flow of state and then flowing along the dynamical system is equiv- X and is a special case of the Lie Derivative . alent to flowing first and then applying the symmetry trans- L formation. Equation (20) is satisfied if and only if the Lie d Derivative is zero: H = (H φX ) = X(H) LX dλ ◦ λ λ=0 X F = [X,F ] = 0, A scalar function is invariant to the flow of a vector field if L and only if the Lie Derivative is zero where [ , ] is the Lie bracket on vector fields.7 · · X H(φλ (z)) = H(z) X H = 0. For Hamiltonian systems, every Hamiltonian symmetry is ⇔ L also a dynamical symmetry. In fact, it is not hard to show 6 For all transformations that respect the Poisson Bracket , that the Lie and Poisson brackets are related, which we add as a requirement for a symmetry, the vector

field X is (locally) Hamiltonian and there exists a function [Xf ,Xg] = X f,g { } f such that X = Xf = f, . If M is a contractible 2n { ·} domain such as R , then f is globally defined. For every and this directly shows the implication. If Xf is a Hamilto- continuous symmetry φXf , nian symmetry, f, H = 0, and then λ { }

[Xf ,F ] = [Xf ,XH ] = X f,H = 0. { }

Xf H = Xf (H) = f, H = H, f = XH (f), L { } −{ } − However, the converse is not true, dynamical symmetries X by the antisymmetry of the Poisson bracket. So if φλ is a of a Hamiltonian system are not necessarily Hamiltonian symmetry of H, then X = Xf for some function f, and symmetries and thus might not correspond to conserved Xf H(φλ (z)) = H(z) implies quantities. Furthermore even if the system has a dynamical symmetry which is the flow along a Hamiltonian vector field XH X H = 0 X f = 0 f(φ (z)) = f(z) X L f ⇔ L H ⇔ τ φλ , X = Xf = f, , but the dynamics F are not Hamil- tonian, then the dynamics{ ·} will not conserve f in general. or in other words f(z(t+τ)) = f(z(t)) and f is a conserved Both the symmetry and the dynamics must be Hamiltonian quantity of the dynamics. for the conservation laws. 6More precisely, the Poisson Bracket can be formulated in a coordinate free manner in terms of a symplectic two form ω, This fact is demonstrated by Figure9, where the dynamics P of the (non-Hamiltonian) equivariant LieConv-T(2) model {f, g} = ω(Xf ,Xg). In the original coordinates ω = i dpi ∧ dqi, and this coordinate basis, ω is represented by the matrix has a T(2) dynamical symmetry with the generators ∂x, ∂y J from earlier. The dynamics XH are determined by dH = which are Hamiltonian vector fields for f = px, f = py, ω(XH , ·) = ιXH ω. Transformations which respect the Poisson and yet linear momentum is not conserved by the model. Bracket are symplectic, LX ω = 0. With Cartan’s magic formula, this implies that d(ιX ω) = 0. Because the form ιX ω is closed, 7 Poincare’s Lemma implies that locally (ιX ω) = df) for some The Lie bracket on vector fields produces another vector field function f and hence X = Xf is (locally) a Hamiltonian vector and is defined by how it acts on functions, for any smooth function field. For more details see Butterfield(2006). g: [X,F ](g) = X(F (g)) − F (X(g)) Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data

Energy q˙ m = Aq, p˙ m = Aq which have the solution 100 XnL q p θAq θAp q p 50 φθ ( m, m) = (e m, e m) = (Rθ m,Rθ m),

0 50 100 150 200 250 where Rθ is a rotation about the axis n by the angle θ, which Linear Momentum follows from the Rodriguez rotation formula. Therefore, the 2.5 flow of the Hamiltonian vector field of angular momentum 2.0 along a given axis is a global rotation of the position and 1.5 momentum of each particle about that axis. Again, the 0 50 100 t 150 200 250 dynamics of a neural network modeling a Hamiltonian con-

LieConv-T2 HLieConv-Trivial HLieConv-T2 Truth serve total angular momentum if and only if the network is invariant to simultaneous rotation of all particle positions and momenta. Figure 9. Equivariance alone is not sufficient, for conservation we need both to model H and incorporate the given symmmetry. For comparison, LieConv-T(2) is T(2)-equivariant but models F , and B. Additional Experiments HLieConv-Trivial models H but is not T(2)-equivariant. Only HLieConv-T(2) conserves linear momentum. B.1. Equivariance Demo While (7) shows that the convolution estimator is equivari- Conserving Linear and Angular Momentum ant, we have conducted the ablation study below examining the equivariance of the network empirically. We trained Consider a system of interacting particles described in Eu- N LieConv (Trivial, T(3), SO(3), SE(3)) models on a limited clidean coordinates with position and momentum qim, pim, subset of 20k training examples (out of 100k) of the HOMO such as the multi-body spring problem. Here the first index task on QM9 without any data augmentation. We then evalu- i = 1, 2, 3 indexes the spatial coordinates and the second ate these models on a series of modified test sets where each m = 1, 2, ..., N indexes the particles. We will use the example has been randomly transformed by an element of bolded notation q p to suppress the spatial indices, but m, m the given group (the test translations in T(3) and SE(3) are still indexing the particles m as in Section 6.1. sampled from a normal with stddev 0.5). In table B.1 the The total linear momentum along a given direction n is rows are the models configured with a given group equiv- ariance and the columns N/G denote no augmentation at n P = i,m nipim = n ( m pm). Expanding the Poisson· bracket, the Hamiltonian· vector field training time and transformations from G applied to the test P P set (test translations in T(3) and SE(3) are sampled from a ∂ ∂ XnP = n P, = n = n normal with stddev 0.5). { · ·} i ∂q · ∂q i,m im m m X X Model N/N N/T(3) N/SO(3) N/SE(3) which has the flow φXnP (q , p ) = (q + λn, p ), a λ m m m m Trivial 173 183 239 243 translation of all particles by λn. So our model of the T(3) 113 113 133 133 Hamiltonian conserves linear momentum if and only if it is SO(3) 159 238 160 240 invariant to a global translation of all particles, (e.g. T(2) SE(3) 62 62 63 62 invariance for a 2D spring system). Table 4. Test MAE (in meV) on HOMO test set randomly trans- The total angular momentum along a given axis n is formed by elements of G. Despite no data augmentation (N), G equivariant models perform as well on G transformed test data. n L = n q p =  n q p = pT Aq · · m× m ijk i jm km m m m m X i,j,k,mX X Notably, the performance of the LieConv-G models do not degrade when random G transformations are applied to the , where  is the Levi-Civita symbol and we have defined ijk test set. Also, in this low data regime, the added equivari- the antisymmetric matrix A by A =  n . kj i ijk i ances are especially important. ∂ P ∂ XnL = n L, = Akjqjm Ajkpjm { · ·} ∂qkm − ∂pkm B.2. RotMNIST Comparison j,k,mX T T ∂ T T ∂ While the RotMNIST dataset consists of 12k rotated MNIST XnL = q A + p A m ∂q m ∂p digits, it is standard to separate out 10k to be used for train- m m m X  ing and 2k for validation. However, in Ti-Pooling and E(2)- where the second line follows from the antisymmetry of A. Steerable CNNs, it appears that after hyperparameters were We can find the flow of XnL from the differential equations tuned the validation set is folded back into the training set Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data to be used as additional training data, a common approach In this paper, the groups we use in which the lifting map used on other datasets. Although in table1 we only use is multi-valued are SE(2), SO(3), and SE(3). The process 10k training points, in the table below we report the perfor- is especially straightforward for SE(2) and SE(3) as these mance with and without augmentation trained on the full groups can be expressed as a semi-direct product of two 12k examples. groups G = H n N,

dµG(h, n) = δ(h)dµH (h)dµN (n), (21) ∗ Aug Trivial Ty T(2) SO(2) SO(2)×R SE(2) dµN (n) where δ(h) = −1 (Willson, 2009). For G = dµN (hnh ) SO(2) 1.44 1.35 1.32 1.27 1.13 1.13 SE(d) = SO(d) n T(d), δ(h) = 1 since the Lebesgue None 1.60 2.64 2.34 1.26 1.25 1.15 measure dµT(d)(x) = dλ(x) = dx is invariant to rotations. So simply dµSE(d)(R, x) = dµSO(d)(R)dx. Table 5. Classification Error (%) on RotMNIST dataset for LieConv with different group equivariances and baselines: So lifts of a point x to SE(d) consistent with the µ ∈ X are just TxR, the multiplication of a translation by x and randomly sampled rotations R µSO(d)( ). There are multiple easy methods to sample∼ uniformly· from SO(d) C. Implementation Details given in (Kuffner, 2004), for example sampling uniformly C.1. Practical Considerations from SO(3) can be done by sampling a unit quaternion from the 3-sphere, and identifying it with the corresponding While the high-level summary of the lifting procedure (Al- rotation matrix. gorithm1) and the LieConv layer (Algorithm2) provides a useful conceptual understanding of our method, there are C.3. Model Architecture some additional details that are important for a practical implementation. We employ a ResNet-style architecture (He et al., 2016), using bottleneck blocks (Zagoruyko and Komodakis, 2016), and replacing ReLUs with Swish activations (Ramachan- 1. According to Algorithm2, aij is computed in every LieConv layer, which is both highly redundant and dran et al., 2017). The convolutional kernel gθ internal to each LieConv layer is parametrized by a 3-layer MLP with costly. In practice, we precompute aij once after lift- ing and feed it through the network with layers op- 32 hidden units, batch norm, and Swish nonlinearities. Not N,N N only do the Swish activations improve performance slightly, erating on the state aij , fi instead of { }i,j { }i=1 but unlike ReLUs they are twice differentiable which is a (u , q , f ) N . Doing so requires fixing the group i i i i=1  requirement for backpropagating through the Hamiltonian elements{ that} will be used at each layer for a given dynamics. The stack of elementwise linear and bottleneck forwards pass. blocks is followed by a global pooling layer that computes the average over all elements, but not over channels. Like 2. In practice only p elements of nbhdi are sampled (ran- domly) for computing the Monte Carlo estimator in for regular image bottleneck blocks, the channels for the order to limit the computational burden (see Appendix convolutional layer in the middle are smaller by a factor of A.4). 4 for increased parameter and computational efficiency. Downsampling: As is traditional for image data, we in- 3. We use the analytic forms for the exponential and loga- crease the number of channels and the receptive field at rithm maps of the various groups as described in Eade every downsampling step. The downsampling is performed (2014). with the farthest point downsampling method described in Appendix A.4. For a downsampling by a factor of s < 1, 1/2 C.2. Sampling from the Haar Measure for Various the radius of the neighborhood is scaled up by s− and 1/2 groups the channels are scaled up by s− . When an image is downsampled with s = (1/2)2 that is typical in a CNN, When the lifting map from G /G is multi-valued, this results in 2x more channels and a radius or dilation of we need to sample elementsX of →u ×G Xthat project down to 2x. In the bottleneck block, the downsampling operation is x: uo = x in a way consistent with∈ the Haar measure µ( ). fused with the LieConv layer, so that the convolution is only In other words, since the restriction µ( ) is a distribu-· nbhd evaluated at the downsampled query locations. We perform tion, then we must sample from the conditional· | distribution downsampling only on the image datasets, which have more u µ(u uo = x) . In general this can be done by nbhd points. parametrizing∼ | the distribution| of µ as a collection of random variables that includes x, and then sampling the remaining BatchNorm: In order to handle the varied number of group variables. elements per example and within each neighborhood, we Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data use a modified batchnorm that computes statistics only over i, the position and momentum of body j were distributed elements from a given mask. The batch norm is computed (i) (i) as qj (0, 0.16I), pj (0, 0.36I). Using the per channel, with statistics averaged over the batch size and analytic∼ form N of the Hamiltonian∼ N for the spring problem, each of the valid locations. (q, p) = K(p, m) + V (q, k), we use the RK4 numerical integrationH scheme to generate 5 second ground truth tra- C.4. Details for Hamiltonian Models jectories broken up into 500 evaluation timesteps. We use a fixed step size scheme for RK4 chosen automatically (as Model Symmetries: implemented in Chen et al.(2018)) with a relative tolerance As the position vectors are mean centered in the model for- of 1e-8 in double precision arithmetic. We then randomly se- ward pass qi0 = qi q¯, HOGN and HLieConv-SO2* have lected a single segment for each trajectory, consisting of an additional T( ) invariance,− yielding SE( ) invariance for (i) (i) 2 2 initial state zt and τ = 4 transition states: (zt+1,..., zt+τ ). HLieConv-SO2*. We also experimented with a HLieConv- SE2 equivariant model, but found that the exponential map Training: All models were trained in single precision arith- for SE2 (involving taylor expands and masking) was not metic (double precision did not make any appreciable differ- numerically stable enough for for second derivatives, re- ence) with an integrator tolerance of 1e-4. We use a cosine quired for optimizing through the Hamiltonian dynamics. decay for the learning rate schedule and perform early stop- So instead we benchmark the HLieConv-SO2 (without cen- ping over the validation MSE. We trained with a minibatch tering) and the HLieConv-SO2* (with centering) models size of 200 and for 100 epochs each using the Adam opti- separately. Layer equivariance is preferable for not prema- mizer (Kingma and Ba, 2014) without batch normalization. turely discarding useful information and for better modeling With 3k training examples, the HLieConv model takes about performance, but invariance alone is sufficient for the con- 20 minutes to train on one 1080Ti. servation laws. Additionally, since we know a priori that For the examination of performance over the range of dataset the spring problem has Euclidean coordinates, we need not sizes in8, we cap the validation set to the size of the training n 2 model the kinetic energy K(p, m) = p /mj set to make the setting more realistic, and we also scale the j=1 k j k and instead focus on modeling the potential V (q, k). We number of training epochs up as the size of the dataset P observe that this additional inductive bias of Euclidean co- shrinks (epochs = 100( 103/D)) which we found to be ordinates improves model performance. Table6 shows the sufficient to fit the training set. For D 200 we use the full p invariance and equivariance properties of the relevant mod- dataset in each minibatch. ≤ els and baselines. For Noether conservation, we need both to model the Hamiltonian and have the symmetry property. Hyperparameters:

channels layers lr

Dataset Generation: To generate the spring dynam- (H)FC 256 4 1e-2 (H)OGN 256 1 1e-2 ics datasets we generated D systems each with N = (H)LieConv 384 4 1e-3 6 particles connected by springs. The system param- eters, mass and spring constant, are set by sampling m(i), . . . m(i), k(i), . . . , k(i) N , m(i) (0.1, 3.1), Hyperparameter tuning: Model hyperparameters were { 1 6 1 6 }i=1 j ∼ U tuned by grid search over channel width, number of layers, k(i) (0, 5). Following Sanchez-Gonzalez et al.(2019), j ∼ U and learning rate. The models were tuned with training, we set the spring constants as kij = kikj. For each system validation, and test datasets consisting of 3000, 2000, and 2000 trajectory segments respectively. F (z, t) (z, t) T (2) SO(2) FC H C.5. Details for Image and Molecular Experiments OGN • RotMNIST Hyperparameters: For RotMNIST we train • # HOGN each model for 500 epochs using the Adam optimizer with LieConv-T(2) • % • learning rate 3e-3 and batch size 25. The first linear layer HLieConv-Trivial maps the 1-channel grayscale input to k = 128 channels, HLieConv-T(2) • % • and the number of channels in the bottleneck blocks follow HLieConv-SO(2) % the scaling law from Appendix C.3 as the group elements HLieConv-SO(2)* • # % • are downsampled. We use 6 bottleneck blocks, and the total downsampling factor S = 1/10 is split geometrically Table 6. Model characteristics. Models with layers invariant to G between the blocks as s = (1/10)1/6 per block. The initial are denoted with #, and those with equivariant layers with %. radius r of the local neighborhoods in the first layer is set so Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data as to include 1/15 of the total number of elements in each neighborhood and is scaled accordingly. The subsampled neighborhood used to compute the Monte Carlo convolution estimator uses p = 25 elements. The models take less than 12 hours to train on a 1080Ti. QM9 Hyperparameters: For the QM9 molecular data, we use the featurization from Anderson et al.(2019), where the input features fi are determined by the atom type (C,H,N,O,F) and the atomic charge. The coordinates xi are simply the raw atomic coordinates measured in angstroms. A separate model is trained for each prediction task, all using the same hyperparameters and early stopping on the validation MAE. We use the same train, validation, test split as Anderson et al.(2019), with 100k molecules for train, 10% for test and the remaining for validation. Like with the other experiments, we use a cosine learning rate decay schedule. Each model is trained using the Adam optimizer for 1000 epochs with a learning rate of 3e-3 and batch size of 100. We use SO(3) data augmentation, 6 bottleneck blocks, each with k = 1536 channels. The radius of the local neighborhood is set to r = to include all elements. The model takes about 48 hours to∞ train on a single 1080Ti.

C.6. Local Neighborhood Visualizations In Figure 10 we visualize the local neighborhood used with different groups under three different types of transforma- tions: translations, rotations and scaling. The distance and neighborhood are defined for the tuples of group elements and orbit. For Trivial, T(2), SO(2), R SO(2) the corre- spondence between points and these tuples× is one-to-one and we can identify the neighborhood in terms of the input points. For SE(2) each point is mapped to multiple tuples, each of which defines its own neighborhood in terms of other tuples. In the Figure, for SE(2) for a given point we vi- sualize the distribution of points that enter the computation of the convolution at a specific tuple. Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data

(a) Trivial (b) T (2) (c) SO(2)

∗ (d) R × SO(2) (e) SE(2)

Figure 10. A visualization of the local neighborhood for different groups, in terms of the points in the input space. For the computation of the convolution at the point in red, elements are sampled from colored region. In each panel, the top row shows translations, middle row shows rotations and bottom row shows scalings of the same image. For SE(2) we visualize the distribution of points entering the computation of the convolution over multiple lift samples. For each of the equivariant models that respects a given symmetry, the points that enter into the computation are not affected by the transformation.