Spherical Convolutional Neural Networks: Stability to Perturbations in SO(3)
Zhan Gao†, Fernando Gama‡ and Alejandro Ribeiro†
Abstract
Spherical convolutional neural networks (Spherical CNNs) learn nonlinear rep- resentations from 3D data by exploiting the data structure and have shown promising performance in shape analysis, object classification, and planning among others. This paper investigates the properties that Spherical CNNs ex- hibit as they pertain to the rotational structure inherent in spherical signals. We build upon the rotation equivariance of spherical convolutions to show that Spherical CNNs are stable to general structure perturbations. In particular, we model arbitrary structure perturbations as diffeomorphism perturbations, and define the rotation distance that measures how far from rotations these per- turbations are. We prove that the output change of a Spherical CNN induced by the diffeomorphism perturbation is bounded proportionally by the perturba- tion size under the rotation distance. This stability property coupled with the rotation equivariance provide theoretical guarantees that underpin the practi- cal observations that Spherical CNNs exploit the rotational structure, maintain performance under structure perturbations that are close to rotations, and offer good generalization and faster learning. arXiv:2010.05865v2 [cs.LG] 3 Apr 2021 Keywords: Spherical convolutional neural networks, spherical convolutional filters, structure perturbations, stability analysis
1†Department of Electrical and Systems Engineering, University of Pennsylvania, USA. Email: {gaozhan,aribeiro}@seas.upenn.edu. ‡Department of Electrical Engineering and Com- puter Sciences, University of California, Berkeley, USA. Email: [email protected]
Preprint submitted to Journal of Signal Processing April 6, 2021 1. Introduction
Modern image-acquisition technologies such as light detection and ranging (LIDAR) [1], panorama cameras [2] and optical scanners [3] obtain data that can be modelled on the spherical surface in 3D space [4]. Spherical signals offer mathematical representations for such data, which essentially assign a scalar (or vector) value to each point on the spherical surface [5, 6], and have been lever- aged in applications of object classification in computer vision [7], panoramic video processing in self-driving cars [8] and 3D surface reconstruction in medical imaging [9]. Processing spherical signals in a successful manner entails archi- tectures capable of exploiting the structural information given by the spherical surface. In particular, we are concerned about exploiting the rotational struc- ture inherent in spherical signals [10, 11]. The rotational structure can be captured by means of the rotation group. This is a mathematical group defined in the space of spherical surface equipped with the operation of rotation [12]. The rotation group admits the definition of convolution operation for spherical signals, which in turn gives rise to spherical convolutional filters [13]. The latter are linear processing operators that com- pute weighted, rotated linear combinations of spherical signal values [14, 15]. Spherical convolutional filters exhibit the property of rotation equivariance, which means that they yield the same features irrespective of rotated versions of the input spherical signal [16, 17]. This property is particularly meaning- ful when processing 3D data where the information is contained in the relative location of signal values but not their absolute positions [18, 19, 20]. There- fore, processing spherical signals with spherical convolutional filers, allows the resulting architecture to exploit the data structure for information extraction. Spherical convolutional neural networks (Spherical CNNs) are developed as nonlinear processing architectures that consist of a cascade of layers, each of which applies a spherical convolutional filter followed by a pointwise nonlinear- ity [16, 17, 21, 22, 23, 24]. The inclusion of nonlinearities coupled with the cascade of multiple layers dons Spherical CNNs with an enhanced representa-
2 tive power, thus exhibiting superior performance for spherical signal process- ing. Several Spherical CNN architectures have been proposed, leveraging differ- ent implementations of the spherical convolution. In particular, [16, 17] carry out the spherical convolution through the spherical harmonic transform, while [21, 22, 23, 24] discretize the sphere using a graph connecting pixels and then approximate the spherical convolution with the Laplacian-based graph convo- lution. Considering the evident success of Spherical CNNs in 3D tasks, we focus on analyzing the properties that Spherical CNNs exhibit as they pertain to the rotational structure present in spherical signals. These properties are meant to shed light on the reasons behind this observed success and provide theoret- ical guarantees on the performance robustness under structure perturbations. Rotation equivariance of spherical convolutional linear filters was established in [16, 17], and Spherical CNNs were shown to be numerically effective in standard retrieval and classification problems. In this paper, we further investigate how spherical convolutional filters and Spherical CNNs react to general structure perturbations applied on input spher- ical signals. More specifically, we build upon the rotation equivariance of spher- ical convolutions [16, 17] to prove that Spherical CNNs are Lipschitz stable to structure perturbations. To conduct such analysis, we leverage the Euler parametrization together with the normalized Haar measure to carry out spher- ical convolutions explicitly in the spherical coordinate system. We also model the arbitrary structure perturbation as the diffeomorphism perturbation, and define a notion of rotation distance that measures the difference between dif- feomorphism perturbations and rotation operations. The established stability results indicate that Spherical CNNs extract the same information under rota- tions, and more importantly, it is able to maintain performance when structure perturbations are close to rotations. Our detailed contributions can be summa- rized as follows.
(i) Stability of spherical convolutional filters (Section 3): We prove the out-
3 put difference of spherical convolutional filters induced by general struc- ture perturbations, defined as diffeomorphism perturbations, is bounded proportionally by the perturbation size under the rotation distance [Thm. 1]. This result extends the stability analysis from regular rotation op- erations to irregular diffeomorphism perturbations, and establishes that almost same information can be extracted if structure perturbations are close to rotation operations.
(ii) Stability of Spherical CNNs (Section 4): We show Spherical CNNs inherit the stability to general structure perturbations with a factor proportional to the perturbation size [Thm. 2]. It also indicates the explicit impact of the filter, the nonlinearity, the architecture width and depth on the stabil- ity property. In particular, a wider and deeper Spherical CNN degrades the stability but improves the representative power, indicating a trade-off between these two factors.
The stability property to arbitrary perturbations plays an important role in improving the generalization capability of Spherical CNNs. In the case of 3D object identification [25], for instance, mild changes in spherical signals may be introduced by different viewing angles or distances but Spherical CNNs are expected to extract the same information. Section 5 provides numerical exper- iments to corroborate theory on the problem of 3D object classification and Section 6 concludes the paper. All proofs are collected in the appendix.
2. Spherical Convolutional Neural Networks
Signals arising from 3D data, e.g., LIDAR images, X-ray models, surface data, etc., can be described as inscribed on the spherical surface in 3D space [5, 6]. In Sec. 2.1 we introduce the mathematical description of these spherical signals, in Sec. 2.3 we discuss the spherical convolution as the fundamental linear operation between spherical signals, and in Sec. 2.4 we present the spherical convolutional neural network (Spherical CNN).
4 Figure 1. (a) The spherical coordinate system. (b) Azimuthal rotation on the sphere. The rotation displaces points on the sphere along the rotation axis, e.g., it displaces points on the red line to corresponding points on the blue line.
2.1. Spherical signal
3 3 Let S2 ⊂ R be the spherical surface contained in R . Given a coordinate system S(x, y, z), a point u = (xu, yu, zu) ∈ S2 can be alternatively described p 2 2 by the polar angle θu = arctan( xu + yu/zu) ∈ [0, π] from the z-direction, the azimuth angle φu = arctan(xu/yu) ∈ [0, 2π) along the xy-plane, and the p 2 2 2 distance d = xu + yu + zu from the origin. Without loss of generality, we assume the unit sphere with d = 1—see Fig. 1a for the spherical coordinate system. More specifically, a point u ∈ S2 can be represented by the vector u=(θu, φu)= sin(θu) cos(φu), sin(θu) sin(φu), cos(θu) . (1)
For points whose polar angle are θu = 0 or θu = π, the azimuth angle is assumed to be zero. These points are known as the north and south poles, referred to as
N ∈ S2 and S ∈ S2 respectively [cf. Fig. 1a].
A spherical signal is defined as the map x : S2 → R which assigns a scalar x(u) ∈ R to each point u ∈ S2 on the sphere. Equivalently, these signals can be represented as mappings from the pair of angular variables (θu, φu) to the real line R, i.e. x(u) = x(θu, φu) ∈ R. Spherical signals are typically used to describe data collected from 3D objects—see Fig. 2 for an example on how to cast the surface of 3D objects as spherical signals.
5 Figure 2. Spherical signal representing the surface data of a tetrahedron. The tetra- hedron contains interior points from which the whole boundary is visible. Assume the object center as one of these interior points, from which we cast rays that intersect the object surface. Denote by d(φu, θu) the distance between the farthest intersection point and the center along the ray direction (φu, θu), where θu is the polar angle and
φu is the azimuth angle. The surface data can be represented as a spherical signal x(φu, θu) = d(φu, θu) ∈ R, which maps angular variables to the distance.
2.2. Rotation operation
Rotations are elementary operations to which a spherical signal can be sub- r ject [10, 11]. A rotation r is uniquely characterized by the rotation point N ∈ S2 and the rotation angle βr ∈ [0, 2π). It displaces the points on the sphere by βr degrees along the axis that passes through the origin and the point N r; thus, we may refer to N r as the rotation axis. To describe the rotation in the spherical θ coordinate system [cf. (1)], let r : S2 → R be the polar angle displacement φ and r : S2 → R be the azimuth angle displacement induced by the rotation r.
These two quantities parametrize the rotation operation r : S2 → S2 as
θ φ r(u) = r(θu, φu) = sin(θu + r (θu, φu)) cos(φu + r (θu, φu)), (2)
θ φ θ sin(θu +r (θu, φu)) sin(φu +r (θu, φu)), cos(θu +r (θu, φu)) where rθ and rφ are determined by the Rodrigues’ rotation formula [26]. The rotation operation takes a point u on the sphere S2 and maps it into another point r(u), also on the sphere S2, but where the polar angle has been displaced by rθ and the azimuthal angle by rφ. The rotation axis intersects the spher-
6 r ical surface at two points, namely N ∈ S2 referred to as the rotation north r pole and S ∈ S2 referred to as the rotation south pole. These two points remain unchanged under the rotation operation. As an illustrative example, consider the azimuthal rotation whose rotation axis coincides with the z-axis of the coordinate system S(x, y, z), i.e. N r = N and Sr = S—see Fig. 1b. Then, θ φ r r (θu, φu) = 0 and r (θu, φu) = β are constants for all (θu, φu) ∈ [0, π]×[0, 2π). We can also describe the rotation operation r ∈ SO(3) in terms of the ZYZ Euler parametrization as
r = r = rz ◦ ry ◦ rz (3) φr θr ρr φr θr ρr where rz , ry and rz are the rotations along the Euclidean coordinate axes φr θr ρr with the rotation angles of φr ∈ [0, 2π), θr ∈ [0, π] and ρr ∈ [0, 2π), respectively. The Euler parametrization allows us to decompose any rotation r as three con- secutive rotations. The set of all rotation operations about the origin [cf. (2)] defines a mathematical group under the operation of composition, referred to as the 3D rotation group SO(3) [27, 28]. This is a group in the sense that (i) com- posing two rotations r1, r2 ∈ SO(3) results in another rotation r1 ◦ r2 ∈ SO(3),
(ii) rotations are associative (r1 ◦r2)◦r3 = r1 ◦(r2 ◦r3) for any r1, r2, r3 ∈ SO(3), θ φ (iii) the rotation determined by r = r = 0 is the identity rotation r0 ∈ SO(3) such that r ◦ r0 = r for all r ∈ SO(3), and (iv) for every rotation r ∈ SO(3) −1 −1 there exists another rotation r ∈ SO(3), such that r ◦ r = r0.
2.3. Spherical convolution
The spherical surface S2 is now equipped with the rotation operation that conforms a mathematical group. This allows defining the convolution operation for spherical signals. Given two signals x, h : S2 → R the spherical convolution
∗SO(3) between them is defined as Z −1 y(u) = (h ∗SO(3) x)(u) = h r (u) x r(u0) dr, ∀ u ∈ S2 (4) SO(3) where y(u) is the output spherical signal and u0 = (0, 0) ∈ S2 is the point given by θu0 = φu0 = 0.
7 We can rewrite the convolution operation in the spherical coordinate sys- tem by leveraging the Euler parametrization of rotations [cf. (3)]. Recall the normalized Haar measure on the rotation group SO(3) as [29] dφ sin(θ )dθ dρ dr = r r r r , (5) 2π 2 2π and we can rewrite (4) as
y(u) = (h ∗SO(3) x)(u) (6) Z 1 −1 = h r (u) x rφ θ ρ (u0) sin(θr) dθrdφrdρr 8π2 φr θr ρr r r r where rφr θr ρr is the rotation parametrized by Euler angles φr, θr, ρr [cf. (3)] and r−1 is its inverse rotation. In the spherical coordinate system, by noting φr θr ρr that rφr θr ρr (u0) = (θr, φr) and using the Rodrigues’ formula (2), we further get
y(θu, φu) = (h ∗SO(3) x)(θu, φu) (7) Z 1 θ −1 φ −1 = h θu +r (θu, φu), φu +r (θu, φu) x(θr, φr) sin(θr) dθrdφrdρr. 8π2 φr θr ρr φr θr ρr where rθ −1 and rφ −1 are angle displacements induced by r−1 . We see φr θr ρr φr θr ρr φr θr ρr that (7) is equivalent to (4) and (6), but described in terms of the polar angle
θu and the azimuthal angle φu on which we can evaluate the output of spherical convolution mathematically. That is, we move from a description given entirely in the realm of rotation group SO(3) to a description given by specific angular variables in the spherical coordinate system [cf. (1)]. It is important because this allows us to carry out the integral of the spherical convolution explicitly, serving for the stability analysis in the following sections.
Remark 1 (Group convolution). Convolution operations can be formally de- fined with respect to any group that is applicable to the space on which the convolving signals are defined. That is, let x, h : X → R be two signals map- ping some space X into the real line, and let G = G(X, ◦) be a group where the element g ∈ G is defined over X, ◦ is the binary operation, and g−1 ∈ G is the inverse element. The convolution operation ∗G is generically defined as Z −1 (h ∗G g)(u) = h g (u) x g(u0) dg (8) G
8 with u, u0 ∈ X. The spherical convolution (4) is one instance of the group convolution for G = SO(3). Another example is the regular convolution defined over the group of translations. We remark that many of the results derived in this work are applicable to generic group convolutions—see [30] for details.
2.4. Spherical convolutional neural network
Spherical convolutions are linear operations that exploit the rotational struc- ture present in spherical signals. Thus, they can be used to regularize the linear transform in neural networks such that the resulting architecture is able to lever- age the rotational structure when processing spherical signals from 3-D tasks. In particular, neural networks are nonlinear maps that consist of a cascade of L layers, each applying a linear transform A` followed by a pointwise nonlinearity
σ` [31, Ch. 6] h i x` = σ` A`x`−1 , ` = 1,...,L (9) with x0 = x the input signal. The number of layers L as well as the specific form of nonlinearity σ` are typically design choices, while the linear trasforms L {A`}`=1 are learned by minimizing some objective function over a training set. Neural networks in such a general form work well only for small input data L size, otherwise the space of learnable linear transforms {A`}`=1 becomes too large to efficiently explore, and the resulting mapping rarely generalizes well [31, Ch. 9]. Successful neural network architectures like convolutional neural networks [32, 33, 34] or graph neural networks [35, 36, 37] typically regularize the space of linear transforms by imposing the data structure information on the operation A` (e.g., CNNs exploit the Euclidean structure present in data and GNNs exploit the graph structure present in data). Working with spherical signals demands neural network architectures that exploit the rotational structure. This motivates to develop the spherical convolu- tional neural network (Spherical CNN) where we regularize the linear transform
A` to be a spherical convolution [cf. (4)], namely
h i x` = σ` h` ∗SO(3) x`−1 , ` = 1,...,L (10)
9 with x0 = x the input signal [16]. We see in (10) that the linear transform is now forced to be a spherical convolution. The learnable linear transforms are L now the collection of spherical convolutional filters H = {h`}`=1. The descriptive power of Spherical CNNs can be enhanced by considering F a multi-feature signal mapping x : S2 → R , where we map each point on the sphere to a F -dimensional vector. Each entry of the vector is termed a feature and the multi-feature spherical signal can be thought of as a collection of F f f F single-feature spherical signals x : S2 → R, i.e. x = {x }f=1. In this context, the convolution operation becomes a series of spherical convolutions where the input signal is processed through a filter bank. More specifically, an output signal with G features y : S2 → RG can be obtained by
F g X fg f y = h ∗SO(3) x , ∀ f = 1, . . . , F, g = 1, . . . , G. (11) f=1
Operations (11) define the spherical convolution acting on multi-feature spheri- cal signals, which is equivalent to filtering F features of the input signal through a bank of FG filters and then aggregating the output of each gth filter across all F input features. To simplify notation, we denote the single-feature spherical convolution [cf. (4)] as a filtering operation, i.e. y = H(x) = h∗SO(3) x. The out- put of the last layer is Φ(x; H) = xL where Φ : S2 → S2 represents the nonlinear mapping given by the Spherical CNN [cf. (10)]. The learnable parameters in L fg L the Spherical CNN consist of the filters H = {H`}`=1 (or H = {H` }`=1 for the multi-feature scenario). The design hyperparameters are the number of layers
L, the nonlinearity σ`, and the number of features F` at each layer. The resulting Spherical CNN [cf. (10)] leverages the 3D data structure and thus, we expect it to exhibit strong performance on 3D tasks (as a matter of fact, this is observed in practice [16]). In what follows, we aim to analyze the properties that Spherical CNNs exhibit when processing spherical signals, i.e., the stability to arbitrary structure perturbations, to explicitly illustrate how they exploit the rotational structure and to bring insights behind their observed superior performance.
10 3. Stability of Spherical Convolutional Filters
Adopting the spherical convolution as the main operation to process spher- ical signals exploits the rotational structure of 3D data and thus exhibits im- proved performance. In this section, we analyze the effect that structure pertur- bations in spherical signals have on the output of spherical convolutional filters, to explain theoretical reasons behind its superior performance.
3.1. Rotation equivariance
We begin by considering rotation operations as fundamental structure per- turbations in spherical signals. Let r ∈ SO(3) be a rotation and x be a spherical signal. Define the rotated signal xr as
θ φ xr(u) = x(r(u)) = x(θu + r (θu, φu), φu + r (θu, φu)), ∀ u ∈ S2, (12) where the last equality responds to the rotation displacements following the Rodrigues’ formula [cf. (2)]. We formally state the rotation equivariance of spherical convolutional filters in the following proposition.
Proposition 1. Let x be a spherical signal, H be a spherical convolutional filter, and r ∈ SO(3) be a rotation. Denote by y = H(x) the filter output and yr the rotated signal of y [cf. (12)]. Then, it holds that
yr = H(xr). (13)
Proof. See Appendix A for a detailed proof that completes that in [16, 17].
Proposition 1 states that applying a spherical convolutional filter H to a rotated version of the signal H(xr) yields an associated rotated version of the output yr, where y = H(x) is the output of filtering the original signal. This indicates that the spherical convolutional filter is capable of harnessing the same information irrespective of what rotated version of the spherical signal is pro- cessed. The rotation equivariance serves to improve transference and speed up training, since learning how to process one instance of the signal is equivalent to learning how to process all rotated versions of the same signal. This is analogous
11 to the effect that translation equivariance has on CNNs [30] and the effect that permutation equivariance has on GNNs [38].
3.2. Signal dissimilarity modulo rotations
In general, we are more interested in how Spherical CNNs fare when acting on two signals that are similar but not quite the same. In a sense, we expect the architecture to tell apart spherical signals when they are sufficiently different but to give a similar output if the signal change is due to noise or unimportant causes [39, 40, 41]. The rotation equivariance (Prop. 1) states that the same information is obtained when processing all rotated versions of the same sig- nal. This suggests that to measure how (dis)similar two signals are, we need to abstract rotations or, in other words, we need to do it modulo rotations. Motivated by this consideration, we define the rotation distance as follows.
Definition 1 (Rotation distance). For two spherical signals x, xˆ, the rotation distance is defined as
kx − xˆkSO(3) = inf kx − xˆrk (14) r∈SO(3) where xˆr(u) =x ˆ(r(u)) is the rotated spherical signal of xˆ by the rotation r and k · k is the norm representation.
Any valid norm can be applied in the rotation distance (Def. 1). In the case of spherical signals, we adopt the normalized spherical norm.
Definition 2 (Normalized spherical norm). For a spherical signal x : S2 → R, the normalized spherical norm is defined as
1 Z kxk2 = x(θ , φ )2dθ dφ . (15) 2π2 u u u u where θu ∈ [0, π] and φu ∈ [0, 2π) describe the support of the spherical signal x in the spherical coordinate system [cf. (1)]. For multi-feature spherical signals f F 2 PF f 2 x = {x }f=1, we define kxk = f=1 kx k .
12 This is a proper norm in the sense that it is (i) absolutely scalable, (ii) posi- tive definite, and (iii) satisfying the triangular inequality. Equipping spherical signals with a norm allows us to define the following normed space
2 F L (S2) = x : S2 → R : kxk < ∞ . (16)
2 The L (S2) space holds for single-feature signals by setting F = 1 in (16) and using the corresponding definition of the normalized spherical norm [cf. Def. 2]. 2 In essence, L (S2) is the space of all finite-energy spherical signals.
Note that, ifx ˆ(u) = x(r(u)) = xr(u) is a rotated version of the spherical signal x by r, then kx − xˆkSO(3) = kx − xrkSO(3) = 0. Under this dissimilarity measure given by the rotation distance (Def. 1), Prop. 1 can be reframed with respect to the following corollary.
Corollary 1. Let x, xˆ be two spherical signals, and H be a spherical convolu- tional filter. Then, it holds that
kx − xˆkSO(3) = 0 ⇒ kH(x) − H(ˆx)kSO(3) = 0. (17)
Proof. See Appendix B.
Note that the rotation distance [cf. (14)] applies to the filter output since the latter is also a spherical signal.
Remark 2 (Group distance). Definition 1 can be extended to any group. In particular, let G = G(X, ◦) be a group with elements g ∈ G defined over space X, and x, xˆ : X → R be two group signals mapping X to the real line R. The group distance with respect to G is defined as
kx − xˆkG = inf kx − xˆgk (18) g∈G wherex ˆg(u) =x ˆ(g(u)) is the signal operated by g and k · k can be any valid norm. The rotation distance is a particular case of this definition for G = SO(3).
3.3. Stability to diffeomorphism perturbations With above preliminaries in place, we formally characterize the stability of Spherical CNNs to general structure perturbations under the rotation distance
13 (a) Unperturbed space (b) Perturbed space
Figure 3. Diffeomorphism perturbation on the sphere [cf. Def 3]. (a) Unperturbed sphere. (b) Perturbed sphere after a diffeomorphism perturbation.
(Def. 1). In other words, we shift focus on arbitrary structure perturbations that are different from but close to rotation operations.
We consider the general structure perturbation τ : S2 → S2 that displaces each point u ∈ S2 to another point τ(u) ∈ S2. Such a perturbation can be represented as a set of local rotations, i.e., the rotation whose characterization changes depending on the specific point u ∈ S2 on which it is applied to. Let ru ∈ SO(3) be the rotation operation that maps the point u0 = (0, 0) ∈ S2 to the point u ∈ S2 along the shortest arc, i.e., ru(u0) = u, and similarly rτ(u) be the rotation operation that maps the point u0 to the perturbed point τ(u) along the shortest arc, i.e., rτ(u)(u0) = τ(u). Combining the notions of ru and rτ(u), we can formally define the general structure perturbation as the diffeomorphism perturbation in the following definition.
Definition 3 (Diffeomorphism perturbation). Let τ : S2 → S2 be a diffeomor- −1 phism, i.e., a bijective differentiable function whose inverse τ : S2 → S2 is differentiable as well. The diffeomorphism is equivalent to a set of local rotations
{τu}u∈S2 , where each local rotation τu ∈ SO(3) is defined as
−1 τu = rτ(u) ◦ ru , ∀ u ∈ S2, (19) that maps the point u to the perturbed point τ(u), i.e., τ(u) = τu(u) for all u ∈ S2. The diffeomorphism perturbation is defined as the set of local rotations
14 in (19) that satisfies kτk < ∞ and k∇τk < ∞ for n o kτk := max βτu , (20) u∈S2
τu 2 where β is the rotation angle of the local rotation τu [cf. (19)] ; and for ∂τ θ ∂τ φ
k∇τk = max , . (21) (θu,φu)∈[0,π]×[0,2π) ∂θu ∂φu where τ θ and τ φ are polar and azimuthal angle displacements induced by τ such θ θ φ φ that τ (θu, φu) = τu (θu, φu) and τ (θu, φu) = τu (θu, φu) for all u ∈ S2. The signal resulting from a diffeomorphism perturbation to the spherical structure of a given signal x is denoted by xτ such that