Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data Appendices at indices i and j is also necessary when there are differ- ent numbers of points per minibatch but batched together using zero padding. The generalized PointConv trick can A. Derivations and Additional Methodology thus be applied in batch mode when there may be varied number of points per example and varied number of points A.1. Generalized PointConv Trick per neighborhood. The notation becomes very cumbersome for manipu- lating these higher order n-dimensional arrays, so we will A.2. Abelian G and Coordinate Transforms instead use index notation with Latin indices i, j, k index- For Abelian groups that cover in a single orbit, the ing points, Greek indices α, β, γ indexing feature channels, computation is very similar to ordinaryX Euclidean convo- and c indexing the coordinate dimensions of which there lution. Defining ai = log(ui), bj = log(vj), and using are d = 3 for PointConv and d = dim(G) + 2 dim(Q) bj ai ai bj 1 the fact that e− e = e − means that log(v− ui) = for LieConv.3 As the objects are not geometric tensors but j (log exp)(a b ). Defining f˜ = f exp, h˜ = h exp; simply n-dimensional arrays, we will make no distinction i j we get◦ − ◦ ◦ between upper and lower indices. After expanding into in- dices, it should be assumed that all values are scalars, and 1 h˜(a ) = (k˜ proj)(a b )f˜(b ), (15) that any free indices can range over all of the values. i n θ ◦ i − j j j nbhd(i) ∈ X Let kα,β be the output of the MLP k which takes ac ij θ ij where proj = log exp projects to the image of the loga- as input and acts independently over the locations i, j{. For} rithm map. Apart from◦ a projection and a change to logarith- PointConv, the input ac = xc xc and for LieConv the ij i j mic coordinates, this is equivalent to Euclidean convolution c 1 − c input aij = Concat([log(vj− ui), qi, qj]) . in a vector space with dimensionality of the group. When We wish to compute the group is Abelian and is a homogeneous space, then the dimension of the groupX is the dimension of the input. In α α,β β hi = kij fj . (12) these cases we have a trivial stabilizer group H and single origin o, so we can view f and h as acting on the input Xj,β xi = uio. In Wu et al.(2019), it was observed that since kα,β is the ij This directly generalizes some of the existing coordinate α,β α,β γ output of an MLP, kij = γ Wγ si,j for some final transform methods for achieving equivariance from the liter- weight matrix W and penultimate activations sγ (sγ is P i,j i,j ature such as log polar coordinates for and scaling simply the result of the MLP after the last nonlinearity). equivariance (Esteves et al., 2017), and using hyperbolic With this in mind, we can rewrite (12) coordinates for squeeze and scaling equivariance. α α,β γ β Log Polar Coordinates: Consider the Abelian Lie group hi = Wγ si,j fj (13) j,β γ of positive scalings and rotations: G = R∗ SO(2) acting X X 2 ×  on R . Elements of the group M G can be expressed as α,β γ β (14) ∈ = Wγ si,jfj a 2 2 matrix β,γ j × X X  r cos(θ) r sin(θ) M(r, θ) = r sin(θ) −r cos(θ) In practice, the intermediate number of channels is much   less than the product of c and c : γ < α β and + 4 in out for r R and θ R. The matrix logarithm is so this reordering of the computation leads| | to| a|| massive| ∈ ∈ reduction in both memory and compute. Furthermore, r cos(θ) r sin(θ) log(r) θ mod 2π log − = − , bγ,β = sγ f β can be implemented with regular ma- r sin(θ) r cos(θ) θ mod 2π log(r) i j i,j j       α α,β γ,β trix multiplication and hi = β,γ Wγ bi can be also or more compactly log(M(r, θ)) = log(r)I+(θ mod 2π)J, P α α,ε ε by flattening (β, γ) into a single axis ε: hi = ε W bi . which is mod in the basis for the Lie algebra P [log(r), θ 2π] [I,J]. It is clear that proj = log exp is simply mod 2π The sum over index j can be restricted to a subsetPj(i) (such ◦ β on the J component. as a chosen neighborhood) by computing f( ) at each of the · required indices and padding to the size of the maximum As R2 is a homogeneous space of G, one can choose the γ,β γ β 2 subset with zeros, and computing bi = j si,j(i)fj(i) us- global origin o = [1, 0] R . A little algebra shows that ing dense matrix multiplication. Masking out of the values ∈ P 4Here θ mod 2π is defined to mean θ + 2πn for the integer 3dim(Q) is the dimension of the space into which Q, the orbit n such that the value is in (−π, π), consistent with the principal identifiers, are embedded. matrix logarithm. (θ + π)%2π − π in programming notation. Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data lifting to the group yields the transformation ui = M(ri, θi) A.3. Sufficient Conditions for Geodesic Distance 2 2 for each point pi = uio, where r = x + y , and 1 In general, the function d(u, v) = log(v− u) F , defined θ = atan2(y, x) are the polar coordinates of the point pi. k k 1 p on the domain of GL(d) covered by the exponential map, Observe that the logarithm of v− u has a simple expression j i satisfies the first three conditions of a distance metric but highlighting the fact that it is to scale and rotational not the triangle inequality, making it a semi-metric: transformations of the elements,

1 1 log(v− u ) = log(M(r , θ ) M(r , θ )) 1. d(u, v) 0 j i j j − i i ≥ 1 = log(ri/rj)I + (θi θj mod 2π)J. 2. d(u, v) = 0 log(u− v) = 0 u = v − ⇔ ⇔ 1 1 3. d(u, v) = log(v− u) = log(u− v) = d(v, u). k k k − k Now writing out our Monte Carlo estimation of the integral: However for certain subgroups of GL(d) with additional 1 structure, the triangle inequality holds and the function is h(p ) = k˜ (log(r /r ), θ θ mod 2π)f(p ), i n θ i j i − j j the distance along geodesics connecting group elements u j X and v according to the metric tensor which is a discretization of the log polar convolution from T T 1 A, B := Tr(A u− u− B), (16) Esteves et al.(2017). This can be trivially extended to h iu encompass cylindrical coordinates with the group T (1) where T denotes inverse and transpose. × − R∗ SO(2). × Specifically, if the subgroup G is in the image of the exp : : For another nontrivial example, g G map and each infinitesmal generator commutes with → T consider the group of scalings and squeezes G = R∗ SQ its transpose: [A, A ] = 0 for A g, then d(u, v) = 2 × 1 ∀ ∈ acting on the positive orthant = (x, y) R : x > log(v− u) F is the geodesic distance between u, v. 0, y > 0 . Elements of the groupX can{ be expressed∈ as the k k Geodesic Equation: product of} a squeeze mapping and a scaling Geodesics of (16) satisfying γ˙ γ˙ = 0 can equivalently be derived by minimizing the∇ energy s 0 r 0 rs 0 functional M(r, s) = = 0 1/s 0 r 0 r/s 1       T T 1 E[γ] = γ,˙ γ˙ γ dt = Tr(˙γ γ− γ− γ˙ )dt for any r, s +. As the group is abelian, the logarithm h i R Zγ Z0 splits nicely∈ in terms of the two generators I and A: using the calculus of variations. Minimizing curves γ(t), rs 0 1 0 1 0 connecting elements u and v in G (γ(0) = v, γ(1) = u) log = (log r) + (log s) . 0 r/s 0 1 0 1 satisfy        −  1 Again is a homogeneous space of G, and we choose a T T 1 single originX o = [1, 1]. With a little algebra, it is clear that 0 = δE = δ Tr(˙γ γ− γ− γ˙ )dt Z0 M(ri, si)o = pi where r = √xy and s = x/y are the Noting that δ(γ 1) = γ 1δγγ 1 and the linearity of the hyperbolic coordinates of pi. − − − p trace, − Expressed in the basis = [I,A] for the Lie algebra above, B 1 we see that T T 1 T T 1 1 2 Tr(˙γ γ− γ− δγ˙ ) Tr(˙γ γ− γ− δγγ− γ˙ )dt = 0. 1 0 − log(vj− ui) = log(ri/rj)I + log(si/sj)A Z yielding the expression for convolution Using the cyclic property of the trace and integrating by parts, we have that 1 h(pi) = k˜θ(log(ri/rj), log(si/sj))f(pj), 1 n d T T 1 1 T T 1 j 2 Tr (γ ˙ γ− γ− )+γ− γ˙ γ˙ γ− γ− δγ dt = 0, X − dt Z0   which is equivariant to squeezes and scalings.  where the boundary term T 1 1 vanishes since Tr(˙γγ− γ− δγ) 0 As demonstrated, equivariance to groups that contain the (δγ)(0) = (δγ)(1) = 0. input space in a single orbit and are abelian can be achieved with a simple coordinate transform; however our approach As δγ may be chosen to vary arbitrarily along the path, γ generalizes to groups that are both ’larger’ and ’smaller’ than must satisfy the geodesic equation: the input space, including coordinate transform equivariance d T T 1 1 T T 1 as a special case. (γ ˙ γ− γ− ) + γ− γ˙ γ˙ γ− γ− = 0. (17) dt Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data

1 T Solutions: When A = log(v− u) satisfies [A, A ] = 0, the subsampled set because the distances are left invariant 1 the curve γ(t) = v exp(t log(v− u)) is a solution to the d(ui, uj) = d(wui, wuj). geodesic equation (17). Clearly γ connects u and v, γ(0) = Now we can use either of these methods for Sub ( ) to v and γ(1) = u. Plugging in γ˙ = γA into the left hand side p equivariantly subsample the points in each neigh-· of equation (17), we have borhood used to estimate the integral to a fixed number p,

d T 1 T 1 = (A γ− ) + AA γ− 1 1 dt hi = kθ(vj− ui)fj. (19) T 1 1 T 1 p = A γ− γγ˙ − + AA γ− j Subp(nbhd(ui)) − ∈ X T 1 = [A, A ]γ− = 0 Doing so has reduced the cost of estimating the convolution from O(N 2) to O(pN), ignoring the cost of computing Length of γ: The length of the curve γ connecting u and v N Subp and nbhd(ui) i=1. 1 is log(v− u) , { } k kF 1 A.5. Review and Implications of Noether’s Theorem L[γ] = γ,˙ γ˙ dt = Tr(˙γT γ T γ 1γ˙ )dt h iγ − − In the Hamiltonian setting, Noether’s theorem relates the Zγ Z0 1q q continuous symmetries of the Hamiltonian of a system with T 1 = Tr(A A)dt = A F = log(v− u) F conserved quantities, and has been deeply impactful in the 0 k k k k Z q understanding of classical physics. We give a review of Noether’s theorem, loosely following Butterfield(2006). Of the Lie Groups that we consider in this paper, all of which have a single connected component, the groups G = More on Hamiltonian Dynamics T (d),SO(d), R∗ SO(d), R∗ SQ satisfy this property As introduced earlier, the Hamiltonian is a function acting T × × that [g, g ] = 0; however, the SE(d) groups do not. on the state H(z) = H(q, p), (we will ignore time depen- dence for now) can be viewed more formally as a function A.4. Equivariant Subsampling on the cotangent bundle (q, p) = z M = T ∗C where C ∈ Even if all distances and neighborhoods are precomputed, is the coordinate configuration space, and this is the setting the cost of computing equation (6) for i = 1, ..., N is still for Hamiltonian dynamics. quadratic, O(nN) = O(N 2), because the number of points In general, on a manifold , a vector field X can be viewed in each neighborhood n grows linearly with N as f is more as an assignment of a directionalM derivative along for densely evaluated. So that our method can scale to handle a each point z . It can be expanded in a basisM using large number of points, we show two ways two equivariantly ∈ M α ∂ coordinate charts X = α X ∂α, where ∂α = ∂zα and subsample the group elements, which we can use both for α acts on functions f by X(f) = α X ∂αf. In the chart, the locations at which we evaluate the convolution and the each of the components XPα are functions of z. locations that we use for the Monte Carlo estimator. Since P the elements are spaced irregularly, we cannot readily use In Hamiltonian mechanics, for two functions on M, there is the Poisson bracket which can be written in terms of the the coset pooling method described in (Cohen and Welling, 5 2016a), instead we can perform: canonical coordinates qi, pi, ∂f ∂g ∂f ∂g Random Selection: Randomly selecting a subset of p f, g = . points from the original preserves the original sampling { } ∂p ∂q − ∂q ∂p n i i i i i distribution, so it can be used. X Farthest Point Sampling: Given a set of group elements The Poisson bracket can be used to associate each function k S = ui i=1 G, we can select a subset Sp∗ of size p by f to a vector field maximizes{ } the∈ minimum distance between any two elements ∂f ∂ ∂f ∂ in that subset, X = f, = , f { ·} ∂p ∂q − ∂q ∂p i i i i i X Subp(S) := Sp∗ = arg max min d(u, v), (18) Sp S u,v Sp:u=v ⊂ ∈ 6 which specifies, by its action on another function g, the di- rectional derivative of along : . Vector farthest point sampling on the group. Acting on a set g Xf Xf (g) = f, g fields that can be written in this way are known{ as} Hamil- of elements, Sub : S S∗, the farthest point sub- p 7→ p tonian vector fields, and the Hamiltonian dynamics of the sampling is equivariant Subp(wS) = wSubp(S) for any w G. Meaning that applying a group element to each 5Here we take the definition of the Poisson bracket to be nega- of∈ the elements does not change the chosen indices in tive of the usual definition in order to streamline notation. Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data system is a special example X = H, . This vector This implication goes both ways, if f is conserved then φXf H { ·} λ field in canonical coordinates z = (p, q) is the vector field Xf is necessarily a symmetry of the Hamiltonian, and if φλ is XH = F (z) = J zH (i.e. the symplectic gradient, as a symmetry of the Hamiltonian then f is conserved. discussed in Section∇ 6.1). Making this connection clear, a given scalar quantity evolves through time as f˙ = H, f . Hamiltonian vs Dynamical Symmetries { } But this bracket can be used to evaluate the rate of change of So far we have been discussing Hamiltonian symmetries, in- a scalar quantity along the flows of vector fields other than variances of the Hamiltonian. But in the study of dynamical the dynamics, such as the flows of continuous symmetries. systems there is a related concept of dynamical symmetries, Noether’s Theorem symmetries of the equations of motion. This notion is also captured by the Lie Derivative, but between vector fields. The flow φX by λ R of a vector field X is the set of λ ∈ A dynamical system z˙ = F (z), has a continuous dynami- integral curves, the unique solution to the system of ODEs cal symmetry φX if the flow along the dynamical system α α λ z˙ = X with initial condition z and at parameter value commutes with the symmetry: X λ, or more abstractly the iterated application of X: φλ = X F F X exp(λX). Continuous symmetries transformation are the φλ (φt (z)) = φt (φλ (z)). (20) X transformations that can be written as the flow φλ of a vector field. The directional derivative characterizes how a Meaning that applying the symmetry transformation to the function such as the Hamiltonian changes along the flow of state and then flowing along the dynamical system is equiv- X and is a special case of the Lie Derivative . alent to flowing first and then applying the symmetry trans- L formation. Equation (20) is satisfied if and only if the Lie d Derivative is zero: H = (H φX ) = X(H) LX dλ ◦ λ λ=0 X F = [X,F ] = 0, A scalar function is invariant to the flow of a vector field if L and only if the Lie Derivative is zero where [ , ] is the Lie bracket on vector fields.7 · · X H(φλ (z)) = H(z) X H = 0. For Hamiltonian systems, every Hamiltonian symmetry is ⇔ L also a dynamical symmetry. In fact, it is not hard to show 6 For all transformations that respect the Poisson Bracket , that the Lie and Poisson brackets are related, which we add as a requirement for a symmetry, the vector

field X is (locally) Hamiltonian and there exists a function [Xf ,Xg] = X f,g { } f such that X = Xf = f, . If M is a contractible 2n { ·} domain such as R , then f is globally defined. For every and this directly shows the implication. If Xf is a Hamilto- continuous symmetry φXf , nian symmetry, f, H = 0, and then λ { }

[Xf ,F ] = [Xf ,XH ] = X f,H = 0. { }

Xf H = Xf (H) = f, H = H, f = XH (f), L { } −{ } − However, the converse is not true, dynamical symmetries X by the antisymmetry of the Poisson bracket. So if φλ is a of a Hamiltonian system are not necessarily Hamiltonian symmetry of H, then X = Xf for some function f, and symmetries and thus might not correspond to conserved Xf H(φλ (z)) = H(z) implies quantities. Furthermore even if the system has a dynamical symmetry which is the flow along a Hamiltonian vector field XH X H = 0 X f = 0 f(φ (z)) = f(z) X L f ⇔ L H ⇔ τ φλ , X = Xf = f, , but the dynamics F are not Hamil- tonian, then the dynamics{ ·} will not conserve f in general. or in other words f(z(t+τ)) = f(z(t)) and f is a conserved Both the symmetry and the dynamics must be Hamiltonian quantity of the dynamics. for the conservation laws. 6More precisely, the Poisson Bracket can be formulated in a coordinate free manner in terms of a symplectic two form ω, This fact is demonstrated by Figure9, where the dynamics P of the (non-Hamiltonian) equivariant LieConv-T(2) model {f, g} = ω(Xf ,Xg). In the original coordinates ω = i dpi ∧ dqi, and this coordinate basis, ω is represented by the matrix has a T(2) dynamical symmetry with the generators ∂x, ∂y J from earlier. The dynamics XH are determined by dH = which are Hamiltonian vector fields for f = px, f = py, ω(XH , ·) = ιXH ω. Transformations which respect the Poisson and yet linear momentum is not conserved by the model. Bracket are symplectic, LX ω = 0. With Cartan’s magic formula, this implies that d(ιX ω) = 0. Because the form ιX ω is closed, 7 Poincare’s Lemma implies that locally (ιX ω) = df) for some The Lie bracket on vector fields produces another vector field function f and hence X = Xf is (locally) a Hamiltonian vector and is defined by how it acts on functions, for any smooth function field. For more details see Butterfield(2006). g: [X,F ](g) = X(F (g)) − F (X(g)) Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data

Energy q˙ m = Aq, p˙ m = Aq which have the solution 100 XnL q p θAq θAp q p 50 φθ ( m, m) = (e m, e m) = (Rθ m,Rθ m),

0 50 100 150 200 250 where Rθ is a rotation about the axis n by the θ, which Linear Momentum follows from the Rodriguez rotation formula. Therefore, the 2.5 flow of the Hamiltonian vector field of angular momentum 2.0 along a given axis is a global rotation of the position and 1.5 momentum of each particle about that axis. Again, the 0 50 100 t 150 200 250 dynamics of a neural network modeling a Hamiltonian con-

LieConv-T2 HLieConv-Trivial HLieConv-T2 Truth serve total angular momentum if and only if the network is invariant to simultaneous rotation of all particle positions and momenta. Figure 9. Equivariance alone is not sufficient, for conservation we need both to model H and incorporate the given symmmetry. For comparison, LieConv-T(2) is T(2)-equivariant but models F , and B. Additional Experiments HLieConv-Trivial models H but is not T(2)-equivariant. Only HLieConv-T(2) conserves linear momentum. B.1. Equivariance Demo While (7) shows that the convolution estimator is equivari- Conserving Linear and Angular Momentum ant, we have conducted the ablation study below examining the equivariance of the network empirically. We trained Consider a system of interacting particles described in Eu- N LieConv (Trivial, T(3), SO(3), SE(3)) models on a limited clidean coordinates with position and momentum qim, pim, subset of 20k training examples (out of 100k) of the HOMO such as the multi-body spring problem. Here the first index task on QM9 without any data augmentation. We then evalu- i = 1, 2, 3 indexes the spatial coordinates and the second ate these models on a series of modified test sets where each m = 1, 2, ..., N indexes the particles. We will use the example has been randomly transformed by an element of bolded notation q p to suppress the spatial indices, but m, m the given group (the test translations in T(3) and SE(3) are still indexing the particles m as in Section 6.1. sampled from a normal with stddev 0.5). In table B.1 the The total linear momentum along a given direction n is rows are the models configured with a given group equiv- ariance and the columns N/G denote no augmentation at n P = i,m nipim = n ( m pm). Expanding the Poisson· bracket, the Hamiltonian· vector field training time and transformations from G applied to the test P P set (test translations in T(3) and SE(3) are sampled from a ∂ ∂ XnP = n P, = n = n normal with stddev 0.5). { · ·} i ∂q · ∂q i,m im m m X X Model N/N N/T(3) N/SO(3) N/SE(3) which has the flow φXnP (q , p ) = (q + λn, p ), a λ m m m m Trivial 173 183 239 243 translation of all particles by λn. So our model of the T(3) 113 113 133 133 Hamiltonian conserves linear momentum if and only if it is SO(3) 159 238 160 240 invariant to a global translation of all particles, (e.g. T(2) SE(3) 62 62 63 62 invariance for a 2D spring system). Table 4. Test MAE (in meV) on HOMO test set randomly trans- The total angular momentum along a given axis n is formed by elements of G. Despite no data augmentation (N), G equivariant models perform as well on G transformed test data. n L = n q p =  n q p = pT Aq · · m× m ijk i jm km m m m m X i,j,k,mX X Notably, the performance of the LieConv-G models do not degrade when random G transformations are applied to the , where  is the Levi-Civita symbol and we have defined ijk test set. Also, in this low data regime, the added equivari- the antisymmetric matrix A by A =  n . kj i ijk i ances are especially important. ∂ P ∂ XnL = n L, = Akjqjm Ajkpjm { · ·} ∂qkm − ∂pkm B.2. RotMNIST Comparison j,k,mX T T ∂ T T ∂ While the RotMNIST dataset consists of 12k rotated MNIST XnL = q A + p A m ∂q m ∂p digits, it is standard to separate out 10k to be used for train- m m m X  ing and 2k for validation. However, in Ti-Pooling and E(2)- where the second line follows from the antisymmetry of A. Steerable CNNs, it appears that after hyperparameters were We can find the flow of XnL from the differential equations tuned the validation set is folded back into the training set Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data to be used as additional training data, a common approach In this paper, the groups we use in which the lifting map used on other datasets. Although in table1 we only use is multi-valued are SE(2), SO(3), and SE(3). The process 10k training points, in the table below we report the perfor- is especially straightforward for SE(2) and SE(3) as these mance with and without augmentation trained on the full groups can be expressed as a semi-direct product of two 12k examples. groups G = H n N,

dµG(h, n) = δ(h)dµH (h)dµN (n), (21) ∗ Aug Trivial Ty T(2) SO(2) SO(2)×R SE(2) dµN (n) where δ(h) = −1 (Willson, 2009). For G = dµN (hnh ) SO(2) 1.44 1.35 1.32 1.27 1.13 1.13 SE(d) = SO(d) n T(d), δ(h) = 1 since the Lebesgue None 1.60 2.64 2.34 1.26 1.25 1.15 dµT(d)(x) = dλ(x) = dx is invariant to rotations. So simply dµSE(d)(R, x) = dµSO(d)(R)dx. Table 5. Classification Error (%) on RotMNIST dataset for LieConv with different group equivariances and baselines: So lifts of a point x to SE(d) consistent with the µ ∈ X are just TxR, the multiplication of a translation by x and randomly sampled rotations R µSO(d)( ). There are multiple easy methods to sample∼ uniformly· from SO(d) C. Implementation Details given in (Kuffner, 2004), for example sampling uniformly C.1. Practical Considerations from SO(3) can be done by sampling a unit quaternion from the 3-sphere, and identifying it with the corresponding While the high-level summary of the lifting procedure (Al- rotation matrix. gorithm1) and the LieConv layer (Algorithm2) provides a useful conceptual understanding of our method, there are C.3. Model Architecture some additional details that are important for a practical implementation. We employ a ResNet-style architecture (He et al., 2016), using bottleneck blocks (Zagoruyko and Komodakis, 2016), and replacing ReLUs with Swish activations (Ramachan- 1. According to Algorithm2, aij is computed in every LieConv layer, which is both highly redundant and dran et al., 2017). The convolutional kernel gθ internal to each LieConv layer is parametrized by a 3-layer MLP with costly. In practice, we precompute aij once after lift- ing and feed it through the network with layers op- 32 hidden units, batch norm, and Swish nonlinearities. Not N,N N only do the Swish activations improve performance slightly, erating on the state aij , fi instead of { }i,j { }i=1 but unlike ReLUs they are twice differentiable which is a (u , q , f ) N . Doing so requires fixing the group i i i i=1  requirement for backpropagating through the Hamiltonian elements{ that} will be used at each layer for a given dynamics. The stack of elementwise linear and bottleneck forwards pass. blocks is followed by a global pooling layer that computes the average over all elements, but not over channels. Like 2. In practice only p elements of nbhdi are sampled (ran- domly) for computing the Monte Carlo estimator in for regular image bottleneck blocks, the channels for the order to limit the computational burden (see Appendix convolutional layer in the middle are smaller by a factor of A.4). 4 for increased parameter and computational efficiency. Downsampling: As is traditional for image data, we in- 3. We use the analytic forms for the exponential and loga- crease the number of channels and the receptive field at rithm maps of the various groups as described in Eade every downsampling step. The downsampling is performed (2014). with the farthest point downsampling method described in Appendix A.4. For a downsampling by a factor of s < 1, 1/2 C.2. Sampling from the Haar Measure for Various the radius of the neighborhood is scaled up by s− and 1/2 groups the channels are scaled up by s− . When an image is downsampled with s = (1/2)2 that is typical in a CNN, When the lifting map from G /G is multi-valued, this results in 2x more channels and a radius or dilation of we need to sample elementsX of →u ×G Xthat project down to 2x. In the bottleneck block, the downsampling operation is x: uo = x in a way consistent with∈ the Haar measure µ( ). fused with the LieConv layer, so that the convolution is only In other words, since the restriction µ( ) is a distribu-· nbhd evaluated at the downsampled query locations. We perform tion, then we must sample from the conditional· | distribution downsampling only on the image datasets, which have more u µ(u uo = x) . In general this can be done by nbhd points. parametrizing∼ | the distribution| of µ as a collection of random variables that includes x, and then sampling the remaining BatchNorm: In order to handle the varied number of group variables. elements per example and within each neighborhood, we Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data use a modified batchnorm that computes statistics only over i, the position and momentum of body j were distributed elements from a given mask. The batch norm is computed (i) (i) as qj (0, 0.16I), pj (0, 0.36I). Using the per channel, with statistics averaged over the batch size and analytic∼ form N of the Hamiltonian∼ N for the spring problem, each of the valid locations. (q, p) = K(p, m) + V (q, k), we use the RK4 numerical integrationH scheme to generate 5 second ground truth tra- C.4. Details for Hamiltonian Models jectories broken up into 500 evaluation timesteps. We use a fixed step size scheme for RK4 chosen automatically (as Model Symmetries: implemented in Chen et al.(2018)) with a relative tolerance As the position vectors are mean centered in the model for- of 1e-8 in double precision arithmetic. We then randomly se- ward pass qi0 = qi q¯, HOGN and HLieConv-SO2* have lected a single segment for each trajectory, consisting of an additional T( ) invariance,− yielding SE( ) invariance for (i) (i) 2 2 initial state zt and τ = 4 transition states: (zt+1,..., zt+τ ). HLieConv-SO2*. We also experimented with a HLieConv- SE2 equivariant model, but found that the exponential map Training: All models were trained in single precision arith- for SE2 (involving taylor expands and masking) was not metic (double precision did not make any appreciable differ- numerically stable enough for for second derivatives, re- ence) with an integrator tolerance of 1e-4. We use a cosine quired for optimizing through the Hamiltonian dynamics. decay for the learning rate schedule and perform early stop- So instead we benchmark the HLieConv-SO2 (without cen- ping over the validation MSE. We trained with a minibatch tering) and the HLieConv-SO2* (with centering) models size of 200 and for 100 epochs each using the Adam opti- separately. Layer equivariance is preferable for not prema- mizer (Kingma and Ba, 2014) without batch normalization. turely discarding useful information and for better modeling With 3k training examples, the HLieConv model takes about performance, but invariance alone is sufficient for the con- 20 minutes to train on one 1080Ti. servation laws. Additionally, since we know a priori that For the examination of performance over the range of dataset the spring problem has Euclidean coordinates, we need not sizes in8, we cap the validation set to the size of the training n 2 model the kinetic energy K(p, m) = p /mj set to make the setting more realistic, and we also scale the j=1 k j k and instead focus on modeling the potential V (q, k). We number of training epochs up as the size of the dataset P observe that this additional inductive bias of Euclidean co- shrinks (epochs = 100( 103/D)) which we found to be ordinates improves model performance. Table6 shows the sufficient to fit the training set. For D 200 we use the full p invariance and equivariance properties of the relevant mod- dataset in each minibatch. ≤ els and baselines. For Noether conservation, we need both to model the Hamiltonian and have the symmetry property. Hyperparameters:

channels layers lr

Dataset Generation: To generate the spring dynam- (H)FC 256 4 1e-2 (H)OGN 256 1 1e-2 ics datasets we generated D systems each with N = (H)LieConv 384 4 1e-3 6 particles connected by springs. The system param- eters, mass and spring constant, are set by sampling m(i), . . . m(i), k(i), . . . , k(i) N , m(i) (0.1, 3.1), Hyperparameter tuning: Model hyperparameters were { 1 6 1 6 }i=1 j ∼ U tuned by grid search over channel width, number of layers, k(i) (0, 5). Following Sanchez-Gonzalez et al.(2019), j ∼ U and learning rate. The models were tuned with training, we set the spring constants as kij = kikj. For each system validation, and test datasets consisting of 3000, 2000, and 2000 trajectory segments respectively. F (z, t) (z, t) T (2) SO(2) FC H C.5. Details for Image and Molecular Experiments OGN • RotMNIST Hyperparameters: For RotMNIST we train • # HOGN each model for 500 epochs using the Adam optimizer with LieConv-T(2) • % • learning rate 3e-3 and batch size 25. The first linear layer HLieConv-Trivial maps the 1-channel grayscale input to k = 128 channels, HLieConv-T(2) • % • and the number of channels in the bottleneck blocks follow HLieConv-SO(2) % the scaling law from Appendix C.3 as the group elements HLieConv-SO(2)* • # % • are downsampled. We use 6 bottleneck blocks, and the total downsampling factor S = 1/10 is split geometrically Table 6. Model characteristics. Models with layers invariant to G between the blocks as s = (1/10)1/6 per block. The initial are denoted with #, and those with equivariant layers with %. radius r of the local neighborhoods in the first layer is set so Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data as to include 1/15 of the total number of elements in each neighborhood and is scaled accordingly. The subsampled neighborhood used to compute the Monte Carlo convolution estimator uses p = 25 elements. The models take less than 12 hours to train on a 1080Ti. QM9 Hyperparameters: For the QM9 molecular data, we use the featurization from Anderson et al.(2019), where the input features fi are determined by the atom type (C,H,N,O,F) and the atomic charge. The coordinates xi are simply the raw atomic coordinates measured in angstroms. A separate model is trained for each prediction task, all using the same hyperparameters and early stopping on the validation MAE. We use the same train, validation, test split as Anderson et al.(2019), with 100k molecules for train, 10% for test and the remaining for validation. Like with the other experiments, we use a cosine learning rate decay schedule. Each model is trained using the Adam optimizer for 1000 epochs with a learning rate of 3e-3 and batch size of 100. We use SO(3) data augmentation, 6 bottleneck blocks, each with k = 1536 channels. The radius of the local neighborhood is set to r = to include all elements. The model takes about 48 hours to∞ train on a single 1080Ti.

C.6. Local Neighborhood Visualizations In Figure 10 we visualize the local neighborhood used with different groups under three different types of transforma- tions: translations, rotations and scaling. The distance and neighborhood are defined for the tuples of group elements and orbit. For Trivial, T(2), SO(2), R SO(2) the corre- spondence between points and these tuples× is one-to-one and we can identify the neighborhood in terms of the input points. For SE(2) each point is mapped to multiple tuples, each of which defines its own neighborhood in terms of other tuples. In the Figure, for SE(2) for a given point we vi- sualize the distribution of points that enter the computation of the convolution at a specific tuple. Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data

(a) Trivial (b) T (2) (c) SO(2)

∗ (d) R × SO(2) (e) SE(2)

Figure 10. A visualization of the local neighborhood for different groups, in terms of the points in the input space. For the computation of the convolution at the point in red, elements are sampled from colored region. In each panel, the top row shows translations, middle row shows rotations and bottom row shows scalings of the same image. For SE(2) we visualize the distribution of points entering the computation of the convolution over multiple lift samples. For each of the equivariant models that respects a given symmetry, the points that enter into the computation are not affected by the transformation.