Synchronizing Probability Measures on Rotations via Optimal Transport
Tolga Birdal 1 Michael Arbel 2 Umut S¸ims¸ekli 3, 4 Leonidas Guibas 1 1 Department of Computer Science, Stanford University, USA 2 Gatsby Computational Neuroscience Unit, University College London, UK 3 LTCI, Tel´ ecom´ Paris, Institut Polytechnique de Paris, Paris, France 4 Department of Statistics, University of Oxford, Oxford, UK
Abstract
훍1 We introduce a new paradigm, ‘measure synchroniza- 훍13 tion’, for synchronizing graphs with measure-valued edges. We formulate this problem as maximization of the cycle- 훍3 15 훍35 consistency in the space of probability measures over rela- 훍 tive rotations. In particular, we aim at estimating marginal 훍12 distributions of absolute orientations by synchronizing the 훍5 2 ‘conditional’ ones, which are defined on the Riemannian 훍 훍4 훍45 manifold of quaternions. Such graph optimization on 훍24 distributions-on-manifolds enables a natural treatment of Figure 1. We coin measure synchronization to refer to the prob- multimodal hypotheses, ambiguities and uncertainties aris- lem of recovering distributions over absolute rotations {µi}i ing in many computer vision applications such as SLAM, (shown in red) from a given set of potential relative rotations SfM, and object pose estimation. We first formally define the {µij }(i,j) per pair that are estimated in an independent fashion. problem as a generalization of the classical rotation graph synchronization, where in our case the vertices denote prob- of the parameters. In a wide variety of applications, these ability measures over rotations. We then measure the qual- parameters are composed of the elements of a rotation ex- ity of the synchronization by using Sinkhorn divergences, pressed either as a point on a Riemannian manifold (forms which reduces to other popular metrics such as Wasserstein a Lie group) or its tangent space (Lie algebra) [33, 14]. distance or the maximum mean discrepancy as limit cases. Unfortunately, for our highly complex environments, We propose a nonparametric Riemannian particle optimiza- symmetries are inevitable and they may render it impossible tion approach to solve the problem. Even though the prob- to hypothesize an ideal single candidate to explain the con- lem is non-convex, by drawing a connection to the recently figuration of capture, be it pairwise or absolute. In those proposed sparse optimization methods, we show that the cases, there is no single best solution to the 3D percep- proposed algorithm converges to the global optimum in a tion problem. One might then speak of a set of plausible special case of the problem under certain conditions. Our solutions that are only collectively useful in characterizing qualitative and quantitative experiments show the validity global positioning of the cameras / objects [42, 54, 14]. Put of our approach and we bring in new perspectives to the more concretely, imagine a 3D sensing device capturing a study of synchronization. rotating untextured, cylindrical coffee mug with a handle (c.f . Fig. 1). The mug’s image is, by construction, identical 1. Introduction under the rotations around its symmetry axis whenever the handle is occluded. Any relative or absolute measurement is Synchronization [70, 30, 71], is the art of consistently then subject to an ambiguity which prohibits the estimation recovering absolute quantities from a collection of ratios of a unique best alignment. Whenever the handle appears, or pairwise relationships. An equivalent definition de- mug’s pose can be determined uniquely, leading to a single scribes the problem as simultaneously upgrading from pair- mode. Such phenomena are ubiquitous in real scenarios and wise local information onto the global one: an averag- have recently lead to a paradigm shift in computer vision to- ing [34, 36, 8]. At this point, synchronization is a fun- wards reporting the estimates in terms of either empirical or damental piece of most multi-view reconstruction / multi- continuous multimodal probability distributions on the pa- shape analysis pipelines [68, 17, 18] as it heavy lifts the rameter spaces [41, 54, 14, 13], rather than single entities as global constraint satisfaction while respecting the geometry done in the conventional methods [61, 55, 37].
11569 In the aforementioned situations, to the best of our 4. We show several use-cases of our approach under syn- knowledge, all of the existing synchronization algo- thetic and real settings, demonstrating the applicability rithms [33, 65, 25, 34, 79, 12] are doomed to failure and a to a general set of challenges. new notion of synchronization is required that rather takes into account the distributions that are defined on the param- We release our source code under: https: eter space. In other words, what are to be synchronized are //synchinvision.github.io/probsync/. now the probability measures and not single data points1. In this paper, considering the Sinkhorn divergence, 2. Prior Art which has proven successful in many application domains Synchronization. The emergence of the term synchro- [23, 26, 28], we introduce the concept of measure syn- nization dates back to Einstein & Poincare´ and the prin- chronization. After formally defining the graph synchro- ciples of relativity [27]. It was first used to recover nization problem over probability measures by using the clocks [70]. Since then the ideas have been extended Sinkhorn divergence, we propose a novel algorithm to solve to many domains other than space and time, leaping to this problem, namely the Riemannian particle gradient de- communication systems [31, 30], finally arriving at the scent (RPGD), which is based on the recently developed graphs in computer vision with Singer’s angular synchro- nonparametric ‘particle-based’ methods [51, 5, 52, 44] and nization [71]. Various aspects of the problem have been Riemannian optimization methods [1]. In particular, we pa- vastly studied: different group structures [33, 13, 11, 8, rameterize rotations by quaternions and use the Sinkhorn 38], closed form solutions [12, 11, 8], robustness [19], divergence [26] between the estimated joint rotation distri- certifiability [66], global optimality [15], learning-to- bution and the input empirical relative rotation distribution synchronize [39] and uncertainty quantification (UQ) [84, as a measure of consistency. 14]. The latter recent work of Birdal et al.[14] and the K- We further investigate the special case of the proposed best synchronization [79] are probably what relates to us the problem, where the loss function becomes the maximum most. In the former a probabilistic framework to estimate an mean discrepancy (MMD) combined with a simple regu- empirical absolute distribution and thereby the uncertainty larizer. By using this special structure and imposing some per camera is proposed. The latter uses a deterministic ap- structure on the weights of the measures to be estimated, we proach to yield K hypotheses with the hope that ambiguities draw a link between our approach and the recently proposed are captured. Yet, none of the works in the synchronization sparse optimization of [20]. This link allows us to establish field including [14, 13, 79] are able to handle multiple rela- a theoretical guarantee showing that RPGD converges to the tive hypotheses and thus are prone to failure when ambigu- global optimum under certain conditions. ities are present in the input. This is what we overcome in Our synthetic and real experiments demonstrate that both this work by proposing to align the observed and computed of our novel, non-parametric algorithms are able to faith- relative probability measures: measure synchronization. fully discover the probability distribution for each of the absolute cameras with varying modes, given the multimodal Optimal transport and MMD. Comparing probability rotation distribution for each edge in the graph. This is the distributions has even a longer history and is a very well first time the two realms of synchronization and optimal studied subject [53, 47]. Taking into account the geome- transport are united and we hope to foster further research try of the underlying space, optimal transport (OT) [86], leading to enhanced formulations and algorithms. To this proposed by Monge in 1781 [57] and advanced by Kan- end, we will make our code as well as the datasets publicly torovich [40] seeks to find the minimum amount of ef- available. In a nutshell, our contributions are: fort needed to morph one probability measure into another one. This geometric view of comparing empirical proba- 1. We introduce the problem of measure synchronization bility distributions metrizes the convergence in law better that aims for synchronization of probability distribu- capturing the topology of measures [26]. Thanks to the tions defined on Riemannian manifolds. efficient algorithms developed recently [23, 64, 75], OT 2. We address the particular case of multiple rotation has begun to find use-cases in various sub-fields related averaging and solve this new synchronization prob- to computer vision such as deep generative modeling [7], lem via minimizing a collection of regularized optimal non-parametric modeling [52], domain adaptation [22], 3D transport costs over the camera pose graph. shape understanding [76], graph matching [87], topological 3. We propose the first two solutions resulting in two al- analysis [48], clustering [56], style transfer [43] or image gorithms: Sinkhorn-Sync and MMD-Sync. These algo- retrieval [67] to name a few. rithms differ both by the topology they induce and by Only recently Feydy et al.[26] formulated a new geo- their descent regimes. metric Sinkhorn [74, 72] divergence that can relate OT dis- tances, in particular the Sinkhorn distance [23] to maximum 1We note that our framework extends the classical one in the sense that single data points can be seen as degenerate probability distributions mean discrepancy (MMD) [35], a cheaper to compute fre- represented by Dirac-delta functions. quentist norm with a smaller sample complexity. MMD
1570 has been successfully used in domain adaptation [85], in [r1, vr] is defined to be p ⊗ r = [p1r1 − vp · vr, p1vr + GANs [50, 6], as a critic [80] as well as to build auto- r1vp + vp × vr]. For simplicity we use p ⊗ r := p · r := pr. encoders [82]. Furthermore, Arbel et al.[5] have es- tablished the Wasserstein gradient flow of the MMD and Definition 4 (Relative Quaternion). The relative rotation demonstrated global convergence properties based on a between two frames can also be conveniently written in novel noise injection scheme. Jointly, these advancements terms of quaternions: qij = qi ⊗ q¯j. Note that for quater- allow us to tackle the proposed measure synchronization nions we will later omit the symbol ⊗. problem using MMDs with global optimality guarantees. To the best of our knowledge, no prior work addresses Riemannian geometry of quaternions We will now the measure synchronization problem we consider here. briefly explain the Lie group structure of the quaternions The closest work is by Solomon et al.[77] where a Wasser- H that are essential for deriving the update equations on the stein metric is minimized so as to perform a graph trans- manifold. Our convention follows [4]. duction: the labels are propagated from the absolute entities along the edges (relative), not the other way around. Definition 5 (Exponential Map). The exponential map H H Expq(·) maps any vector in Tq , the tangent space of , 3. Background onto H: We now define and review the necessary concepts re- w sin(θ) quired for the grasp of our method. Expq(η)= q exp(η)= q e cos(θ), v , (2) θ ! 3.1. Rotations and Synchronization where η =(w, v) ∈T H and θ = kvk. Definition 1 (Rotation Graph). We consider a finite simple q directed graph G = (V, E), where vertices correspond to Definition 6 (Logarithmic Map). The inverse of exp-map, reference frames and edges to the available relative mea- H H Logq(p): 7→ Tq is called the log-map and defined as: surements. Both vertices and edges are labeled with abso- lute and relative orientations, respectively. Each absolute −1 v n Logq(p) = log(q p)= 0, arccos(w) , (3) frame is described by a rotation matrix {Ri ∈ SO(3)}i=1. kvk Each relative pose is expressed as the transformation be- − tween frames i and j, Rij, where (i,j) ∈E ⊂ [n] × [n]. this time with a slight abuse of notation q 1p =(w, v). G is undirected such that if (i,j) ∈ E, then (j,i) ∈ E and −1 Rij = Rji . Definition 7 (Riemannian Distance). Let us rephrase the Riemannian distance between two unit quaternions using Definition 2 (Rotation Synchronization). Synchronizing the logarithmic map, whose norm is the length of the short- pose graphs, also known as multiple rotation averaging, in- est geodesic path. Respecting the antipodality: volves the task of relating different coordinate frames by satisfying the compatibility constraint [32, 36]: Log q k q1 ( 2)k = arccos(w), w ≥ 0 d(q1, q2)= −1 kLogq (−q2)k = arccos(−w), w< 0 Rij ≈ RjRi , ∀i 6= j (1) ( 1
−1 Definition 3 (Quaternion). In the rest of the paper, we will where q1 q2 =(w, v). represent a 3D rotation by a quaternion q, an element of Hamilton algebra H, extending the complex numbers with 3.2. Optimal Transport three imaginary units i, j, k in the form q = q11 + q2i + Definition 8 (Discrete Probability Measure). We define a T T 4 q3j + q4k = (q1,q2,q3,q4) , with (q1,q2,q3,q4) ∈ R , 2 2 2 discrete probability measure µ on the simplex Σd {x ∈ and i = j = k = ijk = −1. We also write q := [a, v] Rn : x⊤1 = 1} with weights a = {a } and locations R + n i with the scalar part a = q1 ∈ and the vector part v = {x },i = 1 ...n as: T 3 i (q2,q3,q4) ∈ R . The conjugate q¯ of the quaternion q n n is given by ¯q := q1 − q2i − q3j − q4k. A versor or unit µ = aiδx , ai ≥ 0 ∧ ai = 1 (4) ! −1 i=1 i i=1 quaternion q ∈ H1 with 1 = kqk := q · ¯q and q = ¯q, gives a compact and numerically stable parametrization X X where δ is a Dirac delta function at x. to represent orientation of objects in S3, avoiding gimbal x We further define the product of two measures µ = lock and singularities [16]. Identifying antipodal points q nµ µ nν ν a δxµ and ν = a δxν , as follows: and −q with the same element, the unit quaternions form i=1 i i i=1 i i a double covering group of SO (3). The non-commutative n n P Pµ ν µ ν , µ µ ⊗ ν a aj δx δxν . (5) multiplication of two quaternions p := [p1, vp] and r := i=1 j=1 i i j X X 1571 Definition 9 (Transportation Polytope). For two probability 4. Measure Synchronization measures µ and ν in the simplex, let U(µ, ν) denote the In this section, we will introduce the measure synchro- polyhedral set of nµ × nν matrices: nization problem. In our problem setting, for each camera Rnµ×nν ⊤ U(µ, ν)= {P ∈ + : P1nν = µ ∧ P 1nµ = ν} pair (i,j) ∈ E, we will assume that we observe a discrete probability measure µ , denoting the distribution of the rel- 1 is a d-dimensional ones-vector. ij d ative poses between the cameras i and j. More formally, Definition 10 (Kantorovich’s Optimal Transport Problem). such relative pose measures will consist of a collection of Let C be a nµ × nν cost matrix that is constructed from weighted Dirac masses (cf. (4)), given as follows: µ ν the ground cost function c(xi ,xj ). Kantorovich’s problem K seeks to find the transport plan optimally mapping µ to ν: , ij (k) µij wij δ (k) , (11) k=1 qij Wc(µ, ν)= dc(µ, ν) = minP∈U(µ,ν)hP, Ci (6) X (k) Kij with h·i denoting the Frobenius dot-product. Note that this where {qij }k=1 denote all the possible relative pose expression is also known as the Wasserstein distance (WD). masses and Kij denotes the total number of potential rela- tive poses for a given camera pair , whereas (k) R Definition 11 (Maximum Mean Discrepancy). Given a (i,j) wij ∈ + (k) characteristic positive semi-definite kernel k, the maximum denotes the weight of each point mass qij . We will as- mean discrepancy (MMD)([35]) defines a distance between sume that these measures can be estimated (in a some- probability distributions. In particular, for discrete mea- what noisy manner) by using an external algorithm, such sures µ and ν and when k is induced from a ground cost as [55, 61, 24]. function c: k(x,y) = exp(−c(x,y)), the squared MMD is Given all the ‘noisy’ relative pose measures given by: {µij}(i,j)∈E , similar to the conventional pose syn- chronization problem, our goal becomes estimating the 2 ¯ ¯ ¯ MMDk(µ, ν)= Kµ,µ + Kν,ν − 2Kµ,ν (7) absolute pose measures µi, which are a-priori unknown where K¯µ,ν is an expectation of k under µ ⊗ ν: and denote the probability distribution of the absolute pose corresponding to camera i. Similar to the relative pose nµ nµ ¯ µ ν µ ν measures, we define the absolute pose measures as follows: Kµ,ν = ai aj exp(−c(xi ,xj )) (8) i=1 j=1 K X X , i (k) µi wi δq(k) , (12) Definition 12 (Sinkhorn Distances). Cuturi [23] intro- k=1 i duced an alternative convex set consisting of joint proba- X (k) (k) Ki bilities with small enough mutual information: where {qi ,wi }k=1 are the point mass-weight pairs in ⊤ µi and Ki denotes the number of points in µi. Semanti- Uα(µ, ν)= {P ∈ U(µ, ν) : KL(P k µν ) ≤ α} (9) (k) (k) cally, qi and wi will respectively denote all the ‘plau- where KL(·) refers to the Kullback Leibler divergence [47] sible’ poses and their weights for camera i. Our ultimate between µ and ν. This restricted polytope leads to a goal is to estimate the absolute pose measures such that they tractable distance between discrete probability measures will produce ‘compatible’ relative measures in the sense under the cost matrix C constructed from the function c: of Eq (1). In other words, we aim at estimating the pairs {{q(k),w(k)}Ki }n , that achieve a good graph consis- d (µ, ν) = min hP, Ci. (10) i i k=1 i=1 c,α P∈Uα(µ,ν) tency at a probability distribution level, a notion which will dc,α can be computed by a modification of simple matrix be precised in the sequel. scaling algorithms, such as Sinkhorn’s algorithm [73]. Let us introduce some notation that will be useful for defining the ultimate problem. We denote by µ a coupling The discrepancy dc,α suffers from an entropic bias, that is corrected by [26, 29] giving rise to a new family of diver- between (q1,..., qn) which is of the form: gences as we define below: K1 Kn n (k1,...,kn) Definition 13 (Sinkhorn Divergences). Sc,α(µ, ν) is de- µ = ... v δ ki . (13) qi fined to be the Sinkhorn divergence: k k i=1 X1=1 Xn=1 Y S (µ, ν) , 2d (µ, ν) − d (µ, µ) − d (ν, ν). c,α c,α c,α c,α where v(k1,...,kn) are non-negative weights that sum to 1. µ Unlike Dfn. 12, this satisfies Sc,α(µ, µ) = Sc,α(ν, ν) = represents the joint distribution over (q1,..., qn) and recov- 0 and interpolates between the WD and MMD with kernel ers each marginal distributions µi after marginalizing over k = −c: all the other ones. Although one could, in principle, learn the general form of Eq (13), we will be interested in two α→0 α→∞ 1 2 Wc(µ, ν) ←−−−Sc,α(µ, ν) −−−−→ MMD−c(µ, ν). extreme cases for µ, a high entropy case (HE) and a low 2
1572 entropy case (LE) for which Eq (13) simplifies. In the (HE) Our problem formulation resembles the recent optimal case, we assume that µ is fully factorized, i.e: transport-based implicit generative modeling problems [7, 46, 81]. The main differences in our setting are as follows: n (k1,...,kn) (ki) v = wi . (14) (i) the choice of the composition functions gij, which is spe- i=1 cific to the pose synchronization problem, (ii) the problem Y This case occurs when all information about the pairings is purely nonparametric in the sense that the absolute mea- between the absolute poses measures µi is lost. In the (LE) sures {µi}i are not defined through any parametric map- (k1,...,kn) case, v = 0 whenever k1,...kn are not all equal to ping. We will be interested in two choices for the loss D: each other. This implies in particular that all marginals µ i Case 1 - Sinkhorn-Sync. Due to its recent success in var- have the same number of point masses K = ... = K = K 1 n ious applications in machine learning [7] and its favorable and share the same weights: theoretical properties [28], in our first configuration, we (k) (k,...,k) , (k) choose the loss functional D to be the squared Sinkhorn wi = v v . (15) 2 divergence Sc,α with finite α, as defined in Dfn. 12. In this This setting is ‘simpler’ in the sense that it carries more setting, we impose the absolute measures µi to be probabil- information about the absolute poses. We will illustrate that ity measures, i.e. we set each Ci to be the probability sim- K1 (k) both of these settings will be useful in different applications. plex ΣKi , hence imposing the sum k=1 wi to be equal We propose to compute a joint distribution µ in either to 1. Thanks to these constraints, in this setup we do not P case that is consistent with the relative pose measures. As a need to use regularization, i.e. we set R(µ) = 0. first step towards this, we further introduce the ‘composition Case 2 - MMD-Sync. In our second configuration, we functions’ g , which are simply given by: ij consider a setting which will enable us to analyze the pro- posed Riemannian particle descent algorithm, which will g (q ,..., q ) , q q . (16) ij 1 n i j be defined in the next section. In particular, we consider the loss functional is given by the MMD distance. In this We use gij to construct ‘ideal relative pose measures’ from the joint absolute measure µ by a push-forward operation, setup, we constrain the absolute measures µi to be positive Ki measures by choosing the constraint sets to be Ci = R . again denoted by gij(µ): + Accordingly, we use regularization on the weights of µi, K1 Kn that is given as follows: , (k1,...,kn) gij(µ) ... v δ k . (17) k =1 k =1 (ki) ( j ) 1 n qi qj n Ki X X R(µ)= λ w(ki) (20) i=1 k =1 i In both cases (HE) and (LE), gij admits a simpler expres- i X X sion which is provided in the supplementary document. where λ> 0 is a hyper-parameter. We are now ready to formally the define the measure syn- By relaxing the constraint on the weights, this problem chronization problem. can be written as an unconstrained optimization problem, Definition 14. Let L(µ) be a loss functional of the form: which favors ‘sparse’ solutions, in the sense that we en- force the weights to have small values, which in turn forces the solution to have only a few point masses with large L(µ) , D(gij(µ), µij) (18) (i,j)∈E weights. We will show that in the (LE) setting and by us- X ing the structure of MMD, this setup is closely related to where denotes a divergence that measures the discrep- D the recent sparse optimization problem in the space of mea- ancy between two probability distributions and let be R sures [20] and hence inherits its theoretical properties. a regularization functional imposing structure on µ. The measure synchronization problem is formally defined as: Choice of the ground cost. As for the ground cost func- tion c required for the Sinkhorn divergence and MMD, we min L(µ)+ R(µ) µ investigate two options. The first option is simply choosing the squared Euclidean distance between the point masses, 1 Ki ⊤ s. t. [wi ,...,wi ] ∈Ci, ∀i ∈ {1,...,n} (19) (k) (l) (k) (l) 2 (k) (k) c(qˆij , qij ) = kqˆij − qij k2, where qˆij and qij are given in (17) and (11), respectively. This yields a positive where C ⊆ RKi denotes a constraint set where the weights i + definite MMD kernel. On the other hand, for the quater- of the absolute measure µ are restricted to live in. i nions which are naturally endowed with a Riemannian met- Note that in both (HE) and (LE), µ can be completely ric, the second option is the geodesic distances as defined (k) (l) (k) (l) p recovered from the marginals µi and that an additional con- in Dfn. 7, i.e. c(qˆij , qij ) = d(qˆij , qij ) , 1 ≤ p ≤ 2. k k straint on the weights is needed in (LE): wi = wj for all This second choice leads to a positive definite MMD ker- 1 ≤ i,j ≤ n. nel k (c.f . Dfn. 11) only when p = 1. Positivity of the
1573 T=1 T=1000 Figure 2. Density estimates for a single camera visualized over time. We sum up the probability density functions of Bingham distributions centered on each quaternion and marginalize over the angles to reduce the dimension to 3. This way we are able to plot the density estimate for all particles of a given camera and track it across iterations. Note that in here we use MMD distance with an unconstrained RPGD. kernel is a theoretical requirement for both the Sinkhorn di- PGD algorithms, which are in general applicable to mea- vergence and the MMD to be a distance metric, however, it sures that are defined in Euclidean spaces. Unfortunately, turns out that the computation of the gradients when p = 1 standard PGD algorithms often cannot be directly applied becomes numerically unstable. Hence, as a remedy, we to our problem since our measures are defined on a Rie- follow the robust optimization literature [2] and consider mannian manifold. Therefore, similar to [20], we consider p > 1. This approach finds a good balance in terms of nu- a modified PGD algorithm, which we call the Riemannian merical stability: for p slightly larger than 1, we lose from PGD (RPGD). RPGD consists of two coupled equations. the positive definiteness of the kernel k; however, the nu- (k) The first equation updates the location of the particles qi merical computation of the gradients becomes numerically on the quaternion manifold and doesn’t depend on R(µ): stable. We have observed in practice that small values of p (k) (k) i.e. 1+ ǫ