STRIDE along Spectrahedral Vertices for Solving Large-Scale Rank-One Semidefinite Relaxations

Heng Yang1, Ling Liang2, Kim-Chuan Toh2, and Luca Carlone1

1Laboratory for Information and Decision Systems, MIT 2Department of Mathematics, National University of Singapore

Abstract

We consider solving high-order semidefinite programming (SDP) relaxations of nonconvex polynomial optimization problems (POPs) that admit rank-one optimal solutions. Existing approaches, which solve the SDP independently from the POP, either cannot scale to large problems or suffer from slow convergence due to the typical degeneracy of such SDPs. We propose a new algorithmic framework, called SpecTrahedral pRoximal gradIent Descent along vErtices (STRIDE), that blends fast local search on the nonconvex POP with global descent on the convex SDP. Specifically, STRIDE follows a globally convergent trajectory driven by a proximal gradient method (PGM) for solving the SDP, while simultaneously probing long, but safeguarded, rank-one “strides”, generated by fast algo- rithms on the POP, to seek rapid descent. We prove STRIDE has global convergence. To solve the subproblem of projecting a given point onto the feasible set of the SDP, we reformulate the projection step as a continuously differentiable unconstrained optimization and apply a limited-memory BFGS method to achieve both scalability and accuracy. We conduct numerical experiments on solving second-order SDP relaxations arising from two important applications in machine learning and com- puter vision. STRIDE dominates a diverse set of five existing SDP solvers and is the only solver that can solve degenerate rank-one SDPs to high accuracy (e.g., KKT residuals below 1e−9), even in the presence of millions of equality constraints.

1 Introduction

d Let p, h1, . . . , hl ∈ [x] be real-valued multivariate polynomials in x ∈ R . We consider the following equality constrained polynomial optimization problem arXiv:2105.14033v1 [math.OC] 28 May 2021 min {p(x) | hi(x) = 0, i = 1, . . . , l} . (POP) d x∈R Assuming problem (POP) is feasible and bounded below, we denote by −∞ < p? < ∞ the global minimum and by x? a global minimizer. Finding (p?, x?) has applications in various fields of engineering and science [28]. It is, however, well recognized that problem (POP) is NP-hard except T ? ? for some special cases (e.g., minxTx=1 x Qx for a symmetric matrix Q, has (p , x ) as the minimum eigenvalue and associated eigenvector of Q). Therefore, solving convex relaxations of (POP), especially semidefinite programming (SDP) relaxations arising from the celebrated moment/sums- of-squares hierarchy [27, 40], has been the predominant way of finding (p?, x?). In this paper, we design robust and scalable algorithms for computing (p?, x?) of (POP) through its SDP relaxations. SDP Relaxations for Polynomial Optimization Problems. We first give an overview of the moment/sums-of-squares semidefinite relaxation hierarchy [27, 40]. Let κ ≥ 1 be an integer such that 2κ is equal or greater than the maximum degree of p, hi, i = 1, . . . , l, and denote v := [x]κ as

Preprint. Under review. T the standard vector of monomials in x of degree up to κ. We build the moment matrix X := [x]κ[x]κ that contains the standard monomials in x of degree up to 2κ. As a result, p and hi’s can be written as linear functions in X. We then consider two types of necessary linear constraints that can be imposed on X: (i) the entries of X are not linearly independent, in fact the same monomial could appear in multiple locations of X; (ii) each constraint hi in (POP) generates a set of equalities of the form hi(x)[x]2κ−deg(hi) = 0, having monomials in x of degree up to 2κ and thus linear in X, where deg(hi) is the degree of hi. Since X is positive semidefinite by construction, reformulating (POP) using the moment matrix X leads to the following primal semidefinite relaxation min {hC,Xi | A(X) = b, X  0} , (P) n X∈S ¯ d+κ m n m where n = dκ := ( κ ) is the dimension of [x]κ, b ∈ R , A : S → R is a linear map A(X) := m n n (hAi,Xi)i=1, ∀X ∈ S with Ai ∈ S , i = 1, . . . , m, that is onto and collects all the linearly independent constraints generated by the relaxation. The feasible set of (P) is called a spectrahedron and is defined by the intersection of the positive semidefinite cone with a finite set of linear equality ∗ m n ∗ Pm m constraints. Let A : R ∈ S be the adjoint of A defined as A y := i=1 yiAi, ∀y ∈ R , the standard Lagrangian dual of (P) is min {hb, yi | A∗y + S = ,S  0} , (D) m n y∈R ,S∈S and admits an interpretation using sums-of-squares polynomials [40]. SDP relaxations (P)-(D) correspond to the so-called dense hierarchy, while many sparse variants exist to exploit the sparsity of p and hi’s to generate relaxations with multiple smaller blocks, see [51]. The algorithms developed in this paper can be extended in a straightforward way to solve sparse multi-block relaxations. ? ? ? ? Assuming strong duality holds between (P) and (D), it follows that p ≥ fP = fD, where fP and ? fD are the optimal values of (P) and (D), respectively. The SDP relaxation is said to be tight if ? ? ? ? ? T p = fP. In such cases, for any global minimizer x of (POP), the rank-one lifting X = [x ]κ[x ]κ is a minimizer of (P), and any rank-one optimal solution X? of (P) corresponds to a global minimizer of (POP). A special case of the relaxation hierarchy is Shor’s relaxation (κ = 1) for quadratically constrained quadratic programs [47, 30], of which applications and analyses have been extensively studied [15, 1, 12, 45, 42, 17, 50, 14]. Although κ = 1 is interesting and typically leads to SDPs with a small number of constraints (e.g., m = n in MAXCUT [21]) that can be solved efficiently by existing solvers, higher order relaxations (κ ≥ 2) are often desirable to produce tighter relaxations. ? ? In the seminal work [27], Lasserre proved that fP asymptotically approaches p as κ increases to infinity [27, Theorem 4.2]. Later on, several authors showed that tightness indeed happens for a finite κ [29, 37, 38]. Notably, Nie proved that, under the archimedean condition, if constraint qualification, strict complementarity and second-order sufficient conditions hold at every global minimizer of (POP), then the hierarchy converges at a finite κ [38, Theorem 1.1]. What is more encouraging is that, empirically, for many important (POP) instances, tightness happens at a very low relaxation order such as κ = 2, 3, 4. For instance, Lasserre [26] showed that κ = 2 attains tightness for a set of 50 randomly generated MAXCUT problems; in the experiments of Henrion and Lasserre [22], tightness holds at a small κ for most of the (POP) problems in the literature; Doherty, Parrilo and Spedalieri [20] demonstrated that the second-order SOS relaxation is sufficient to decide quantum separability in all bound entangled states found in the literature of dimensions up to 6 by 6; Cifuentes [16] designed and proved that a sparse variant of the second-order moment relaxation is guaranteed to be tight for the structured total least squares problem under a low noise assumption; Yang and Carlone applied a sparse second-order moment relaxation to globally solve a broad family of nonconvex and combinatorial computer vision problems [54, 56, 55]. Computational Challenges in Solving High-order Relaxations. The surging applications listed above make it imperative to design computational tools for solving high-order (tight) SDP relax- ations. However, high-order relaxations pose notorious challenges to existing SDP solvers. The computational challenges mainly come from two aspects. In the first place, in stark contrast to Shor’s semidefinite relaxation, high-order relaxations lead to SDPs with a large amount of linear constraints, even if the dimension of the original (POP) is small. Taking binary quadratic program- ming as an example (i.e., xi ∈ {+1, −1} and p is quadratic in (POP)), when d = 60, the number of equality constraints for κ = 2 is m = 1, 266, 971. For this reason, in all applications above, the experiments were performed on (POP) problems of very small size, e.g., the original (POP) has d up to 20 for dense relaxations. In the second place, the resulting SDP is not only large, but

2 also highly degenerate. Specifically, since the relaxation is tight and the optimal solution of (P) is rank one, it necessarily fails the primal nondegeneracy condition [3], which is crucial for ensuring numerical stability and fast convergence of existing algorithms [4, 61]. Due to these two challenges, our experiments show that no existing solver can consistently solve semidefinite relaxations with rank-one solutions to a desired accuracy when m is larger than 500, 000. In particular, interior point methods, such as SDPT3 [49] and MOSEK [6], can only handle up to m = 50, 000 before they run out of memory on an ordinary workstation. First-order methods based on ADMM and conditional gradient method, such as CDCS [62] and SketchyCGAL [59], albeit scalable, converge very slowly and cannot attain even modest accuracy. The best performing existing solver is SDPNAL+ [58], which employs an augmented Lagrangian framework where the inner problem is solved inexactly by a semi-smooth Newton method with a conjugate gradient solver (SSNCG). However, the problem with SDPNAL+ is that, in the presence of large m and degeneracy, computing the inexact SSN step involves solving a large and singular linear system of size m × m, and CG becomes incapable of computing the search direction with sufficient accuracy. A tempting alternative approach, since (P) has rank-one optimal solutions, is to apply the low-rank factorization method of Burer and Monteiro T T n×r (B-M) [13] and solve the nonlinear optimization minV ∈R { C,VV | A(VV ) = b} for some T ? small r. We note that since ∇V ( Ai,VV ) = 2AiV , if rank (V ) = 1 at the optimal solution ? ? m ? V , then rank ({2AiV }i=1) ≤ n  m no matter how large r is. Therefore, V fails the linear independence constraint qualification and may not be a KKT point, both of which are required to guarantee convergence [13, 11]. In fact, successful theory and applications of B-M factorization re- quire m ≈ n so that the dual multipliers can be obtained in closed-form from nonlinear programming solutions [42, 41, 11], a property that does not hold for degenerate SDPs. Nevertheless, exploiting low rankness and nonlinear programming tools is a great source of inspiration, and, as we will show, our solver shares B-M’s insight of blending fast local search with a scheme to escape local minima. Our Contribution. We contribute the first approach that can solve high-order rank-one SDP relax- ations to high accuracy in spite of large m and degeneracy. Our solver employs a globally convergent proximal gradient method (PGM) as the backbone for solving the convex SDP. Despite having favorable global convergence, the PGM steps are typically short and lead to slow convergence, particularly in the presence of degeneracy. Therefore, we blend short PGM steps with long rank-one steps generated by fast nonlinear programming (NLP) algorithms. The intuition is that, once we have achieved a sufficient amount of descent via PGM, we can switch to NLP to perform local refinement to “jump” to a nearby rank-one point for rapid convergence. Due to this reason, we name our framework SpecTrahedral pRoximal gradIent Descent along vErtices (STRIDE), which essentially strides along the rank-one vertices of the spectrahedron until the global minimum is reached. Another interpretation is that STRIDE only uses the SDP relaxation to escape local minima (corresponding to suboptimal rank-one vertices of the spectrahedron) or certify global optimality, while relying on NLP to quickly reach local minima. To tackle the challenge of large m when solving the PGM subproblem of projection onto the spectrahedron of (P), we show the projection problem can be reformulated as a continuously differentiable unconstrained optimization and we solve it with a limited-memory BFGS (L-BFGS) method that scales to millions of variables. We apply STRIDE to solve second-order relaxations of two POPs, namely the nearest structured rank deficient matrix problem [16] that finds extensive applications in control, statistics, machine learning and computer algebra, and the outlier-robust Wahba problem [54] that underpins several important computer vision applications. We compare STRIDE with a diverse set of five state-of-the-art SDP solvers and demonstrate that STRIDE is the only solver that can solve SDP relaxations with millions of constraints to high accuracy (e.g., KKT residuals below 1e−9) while being up to 10-20 times faster on medium-scale problems. Paper Organization. We first provide the intuition behind STRIDE by discussing an illustrative example in Section2. We formally introduce STRIDE and prove its global convergence in Section3. In Section4, we show that the critical subproblem of projection onto the spectrahedron can be reformulated as a smooth unconstrained problem and we apply L-BFGS to solve it at scale. We provide numerical results in Section5 and we give concluding remarks in Section6. n n n Notation. Let S be the space of real symmetric n × n matrices, and S+ (resp. S++) be the set of positive semidefinite (resp. definite) matrices. We also write X  0 (resp. X 0) to n indicate√X is positive semidefinite (resp. definite) when the dimension of X is clear. For x ∈ R , T m×n p T kxk = x x is the standard `2 norm. For X ∈ R , kXk = tr (X X) denotes the Frobenius p norm. kxkH := hx, Hxi denotes the H-weighted norm for a positive semidefinite mapping H. Let n n I denote the identity map from S → S . We use δC(·) to denote the indicator function of a set C.

3 2 A Simple Univariate Example

Consider minimizing a quartic polynomial subject to a single quartic equality constraint   4 2 3 2 2 2 min p(x) := x + x − 8x − 8x h(x) := (x − 4)(x − 1) = 0 , (1) x∈R 3 where it is obvious that x can only take four real values {+2, −2, +1, −1} due to the constraint ? ? 80 h(x) = 0, among which x = 2 attains the global minimum p = − 3 ≈ −26.67. A plot of p(x) is shown in Fig.1(a). However, let us discard our “global” view of p(x) and design an algorithm to numerically obtain (x?, p?) through its SDP relaxation. Towards this goal, let v(x) := [1, x, x2]T be the vector of monomials in x of degree up to 2, and build the moment matrix X := vvT. X has i monomials of x with degree up to 4, which we denote as zi := x , i = 1,..., 4. With h(x) = 0, we 4 2 further have x = 5x − 4, i.e., z4 = 5z2 − 4. Now we can write the cost function p(x) as a linear 2 2 function of z: f(z) = z4 + 3 z3 − 8z2 − 8z1 = 3 z3 − 3z2 − 8z1 − 4. With this construction, the second-order relaxation of (1) reads ( " # ) 2 1 z1 z2 min f(z) := z3 − 3z2 − 8z1 − 4 X(z) := z1 z2 z3  0 . (2) z∈ 3 3 R z2 z3 5z2 − 4 We plot the feasible set of problem (2), i.e., a convex spectrahedron, in Fig.1(b) using Mathemat- ica [23]. The four rank-one vertices are annotated with associated coordinates (z1, z2, z3), where z1 = x ∈ {+2, −2, +1, −1} corresponds to the four values that x can take. The negative gradient 2 T direction −∇f(z) = [8, 3, − 3 ] , ∀z, is shown as solid blue arrows. The hyperplane defined by f(z) ? taking a constant value is shown as a solid green line in Fig.1(b). Clearly, z = (2, 4, 8) (“F” in (b)) ? attains the minimum of the SDP (2), which corresponds to x = 2 (“F” in (a)) in POP (1).

Figure 1: Overview of STRIDE using a univariate example.

We now state an informal version of STRIDE that finds both x? and z?. Initialization. We use a standard nonlinear programming (NLP) solver (e.g., fmincon in Matlab) to solve problem (1). We initialize at x(0) = −10, and NLP converges to the local minimum x(1) = −2, with a cost p(x(1)) ≈ −5.33 (“ ” in Fig.1(a)). Lift and Descend. The initialization converges to a local minimum. Unfortunately, the NLP solver is unaware of this suboptimality, and incapable of escaping the local minimum. The SDP (2) now comes into play. By lifting x(1) to X = v(x(1))v(x(1))T, we obtain the rank-one vertex “ ” in Fig.1(b) with z(1) = [−2, 4, −8]T. Although it is difficult to descend from x(1) in POP (1), descending from z(1) in the SDP (2) is relatively easier. We first take a step along the negative (1) (1) (1) gradient to arrive at z+ (σ) := z − σ∇f(z ) for a given step size σ > 0, and then project (1) (1) (1) 34 T z+ back to the spectrahedron, denoted as z¯ . Taking σ = 5 yields z+ = [38, 19, − 3 ] and z¯(1) = [−0.35, 3.99, −1.67]T, and the projection z¯(1) is plotted in Fig.1(b) as “ •”. Rounding and Local Search. The step we just performed is called projected gradient descent, and iteratively performing this step guarantees convergence to the SDP optimal solution [9]. However, the drawback is that this step is short and it typically requires many iterations to converge. Instead, we go back to the POP and try to generate a long rank-one step. In particular, we perform a spectral

4 decomposition of the top-left 2 × 2 block of X(¯z(1)): " # " # " # 1 −0.35 T T 1 1 =λ1v1v +λ2v2v , v1=0.12 , v2=0.99 , λ1,2=(4.04,0.96), −0.35 3.99 1 2 −8.59 0.12 where we have written v1,2 by normalizing their leading entry as 1, and ordered them such that (1) λ1 ≥ λ2. We then generate two hypotheses x¯1,2 = (−8.59, 0.12) (“•” in Fig.1(a)) and use NLP to (1) (1) perform local search starting from the hypotheses x¯1,2, respectively. Although x¯1 still converges to (1) (1) ? (1) ? ? x = −2, x¯2 converges to x = 2. Now that NLP has visited both x and x , with x attaining a lower cost, we are certain that x? is at least a better local minimum. Hence, a successful long rank-one step is generated and we can move to x?. We call this rounding and local search. Lift and Certify. To certify the global optimality of x? (or to escape it if suboptimal), we need the SDP relaxation again. We lift x? to z? = [2, 4, 8]T, and try to descend from z? by performing another step of projected gradient descent. However, because z? is optimal, this step ends up back to z?, i.e., no descent is possible [9]. Therefore, we conclude that z? is optimal for SDP (2). Because X(z?) is rank one, and p(x?) = f(z?), we can say that x? is indeed optimal for the nonconvex POP (1). This simple example demonstrates the major difference between STRIDE and existing SDP solvers. STRIDE only uses projected gradient descent on the SDP for escaping local minima or certifying global optimality, and it leverages fast nonlinear programming tools to generate long rank-one steps to accelerate the convergence of projected gradient descent.

3 STRIDE: Accelerating PGM with Nonlinear Programming

In this section, we formalize the intuition in Section2 and present the STRIDE algorithm for solving the nonconvex (POP) and its associated SDP relaxation (P)-(D). Let us first rewrite the primal SDP (P) as the following convex composite program min f(X) + g(X) (3) n X∈S where f(X) := hC,Xi is a continuously differentiable convex function with Lf -Lipschtzian gradient

(Lf can be any positive scalar since f is a linear function) and g(X) := δFP (X) is the indicator n ¯ k n function of the spectrahedron FP = {X ∈ S | A(X) = b, X  0}. Given a point X ∈ S and a n n positive definite map Hk : S → S for k ≥ 0, consider the following quadratic approximation of f ¯ k ¯ k ¯ k 1 ¯ k 2 qk(X) = f(X ) + ∇f(X ),X − X + X − X . (4) 2 Hk ¯ 1 0 n Short PGM Step. It is well known that, with an arbitrary initial point X = X ∈ S+ and t0 = 0, the following proximal gradient method (PGM)  k X = arg min qk(X) + g(X)  n X∈S , ∀k ≥ 0 (PGM) k+1 k tk − 1 k k−1 X¯ = X + X − X  tk+1 ? 2 guarantees convergence to an optimal solution X of (3), if we choose tk+1 > 0 such that tk+1 − 2 tk+1 ≤ tk, and Hk such that Hk  Hk+1 0 (see [24, 8] for well-studied complexity analyses). 1 Particularly, in this paper we consider a simple choice with tk = 1, ∀k, and Hk = I for some σk nondecreasing sequence of positive numbers {σk}. With this, the subproblem in (PGM) becomes

k+1 1 k 2 k  X = arg min hC,Xi + X − X + g(X) = ΠFP X − σkC , (5) n X∈S 2σk where ΠFP denotes the projection onto the spectrahedron FP, and we have used tk = 1 so that k+1 k X¯ = X . This choice of tk and Hk leads to projected gradient descent [9, Section 3.3], as eq. (5) first takes a step along the negative gradient and then projects the trial point onto the feasible set. Long Rank-One Step. The issue with (5), or in general (PGM), is that the convergence can be slow. Here we propose to exploit the low-rankness of the optimal solution X? and accelerate the ¯ k+1 k convergence by generating long rank-one steps. Towards this goal, calling X := ΠFP (X −σkC), we compute a potentially better rank-one iterate via the following three steps:

5 ¯ k+1 Pn T ¯ k+1 1. (Rounding). Let X = i=1 λivivi be the spectral decomposition of X with λ1 ≥ ... ≥ λn in nonincreasing order. Compute r hypotheses from the leading r eigenvectors k+1 x¯i = rounding(vi), i = 1, . . . , r, (6) where the function rounding tries to generate a feasible initial guess for the (POP). The function rounding is problem dependent and we provide examples in Section5. If the feasible set of (POP) is easy to project (e.g., binary [21], unit sphere [54], and orthogonal [56, 12]), then rounding just projects vi(x), i.e., the entries of vi corresponding to order-one monomials, to the feasible set. Otherwise, rounding outputs vi(x) as an infeasible initial guess, from which NLP can still converge [39]. 2. (Local Search). Apply a local search method for the (POP) using NLP with initial point k+1 k+1 chosen as x¯i for each i = 1, . . . , r. Denote the solution of each local search as xˆi , with k+1 associated objective value p(ˆxi ), choose the best local solution with minimum objective value. Formally, this is

k+1 k+1 k+1 k+1 xˆ = nlp(¯x ), i = 1, . . . , r, xˆ = arg min k+1 p(ˆx ). (7) i i xˆi ,i=1...,r i One can use any NLP solver to perform local search, with a good choice being an interior point method [39, Chapter 19] that is available via fmincon in Matlab. If the feasible set of (POP) admits a smooth manifold structure, then a Riemannian trust region solver is recommended and available in Manopt [10]. 3. (Lifting). Perform a rank-one lifting of the best local solution T Xˆ k+1 = v xˆk+1 v xˆk+1 , (8)

where v : Rd → Rn is a dense or sparse monomial lifting based on the relaxation scheme. Taking the Right Step. Now we are given two candidates for the next iteration, namely the short k+1 k PGM step X¯ (generated by computing the projection of X − σkC onto FP) and the long rank-one step Xˆ k+1 (obtained by rounding, local search, and lifting). The follow-up question is: which one should we choose to be the next iterate Xk+1 such that the entire sequence {Xk} is globally convergent? The answer to this question is quite natural –we accept Xˆ k+1 if and only if it attains a strictly lower cost than X¯ k+1 (cf. eq. (10))– and guarantees that the algorithm visits a sequence of rank-one vertices (local minima via NLP) with descending costs.

Algorithm 1: SpecTrahedral pRoximal gradIent Descent along vErtices (STRIDE). 0 0 0 n n m Given (X ,S , y ) ∈ S+ × S+ × R , a tolerance tol > 0, an integer r ∈ [1, n], a constant  > 0, a nondecreasing sequence {σk > 0}. Perform the following steps for k = 0, 1,... . Step 1. (Short PGM step). Apply a primal-dual algorithm with initial point (Xk,Sk, yk) to compute ¯ k+1 k  X = ΠFP X − σkC , (9)

k+1 k+1 n m and the associated dual variable (S , y ) ∈ S+ × R (see Section4). Step 2. (Long rank-one step via NLP). Compute Xˆ k+1 as in eqs. (6)-(8). Step 3. (Update primal variable). Update Xk+1 according to Xˆ k+1 if f(Xˆ k+1) ≤ f(X¯ k+1) −  , and Xˆ k+1 ∈ F Xk+1 = P . (10) X¯ k+1 otherwise

Step 4. (Check convergence). If a stopping condition with respect to tol is met (cf. (17)), then output (Xk+1,Sk+1, yk+1). Otherwise, go to Step 1.

The full STRIDE algorithm is presented in Algorithm1. We have the following global convergence. Theorem 1 (Global Convergence). Suppose the Slater condition for (P) holds and {(Xk, yk,Sk)} is generated by STRIDE, then {f(Xk)} converges to f ?, where f ? is the optimal value of (P).

The proof is straightforward and presented in the Supplementary Material.

6 4 Solving the Projection Subproblem

A key ingredient to enable effective execution of STRIDE is to efficiently compute the projection onto n the spectrahedron FP = {X ∈ S | A(X) = b, X  0}, which is required at each iteration (cf. (9)). n Formally, given a point Z ∈ S , the projection problem seeks the closest point in FP w.r.t. Z   1 2 min kX − Zk | X ∈ FP . (11) n X∈S 2

Since FP is the intersection of two convex sets, namely the hyperplane defined by A(X) = b and n the positive semidefinite cone S+, a natural idea is to apply Dykstra’s projection (see e.g., [18]) to generate an approximate solution by alternating the projection onto the hyperplane and the projection onto the semidefinite cone, both of which are easy to compute. However, Dykstra’s projection is known to have slow convergence and it may take too many iterations until a satisfactory projection is found. As a result, instead of solving (11) directly, we consider its dual problem   1 ∗ 2 m n min kS + A y + Zk − hb, yi | y ∈ R ,S ∈ S+ , (12) y,S 2 1 2 where we have ignored the constant term − 2 kZk and converted “max” to “min” by changing the sign of the objective function. The KKT conditions for the pair (11) and (12) are: A(X) = b, A∗y + S = X − Z,X,S  0,XS = 0. (13)

An Unconstrained Formulation. Now we introduce a key observation that allows us to further simplify the dual (12). Fixing the unconstrained y, problem (12) can be seen as finding the closest n ∗ S ∈ S+ w.r.t. the matrix −A y − Z, and hence admits a closed-form solution ∗ S = Π n (−A y − Z) . (14) S+ As a result, after inserting (14), problem (12) is equivalent to 2 1 ∗ min φ(y) := Π n (A y + Z) − hb, yi , (15) m S+ y∈R 2 ∗ ? with the gradient of φ(y) given as ∇φ(y) = AΠ n (A y + Z) − b. Thus, if y is an optimal S+ solution for problem (15), then we can recover S? from (14), and X? from the KKT conditions (13): X? = A∗y? + S? + Z. Formulating the dual problem as (15) has appeared multiple times in [61, 31]. Now that (15) is a smooth unconstrained convex problem in y ∈ Rm, plenty of efficient algorithms are available, such as (accelerated) gradient descend [36], nonlinear conjugate gradient [19], quasi- Newton methods [39] and the semismooth Newton method [61]. In this paper, we apply the celebrated limited-memory BFGS (L-BFGS) method, see for example [39, Algorithm 7.5]. L-BFGS is easy to implement, can handle very large unconstrained optimization problems due to its low memory consumption, and is typically the “the algorithm of choice” for large-scale problems [39, Chapter 7]. Empirically, we observed that L-BFGS is efficient and robust for various applications. To the best of our knowledge, this is the first work that demonstrates the effectiveness of L-BFGS, or in general quasi-Newton methods, in solving large-scale and degenerate SDPs.

5 Applications and Experiments

In this section, we test STRIDE on solving two POPs arising from practical machine learning and computer vision applications and compare its performance to five existing SDP solvers. Implementation. We implement STRIDE in Matlab, with admmplus [48] generating (X0, y0,S0) as a warm start. Experiments are performed on a Linux PC with Intel i9-7920X CPU and 128GB RAM. We choose r = 5 and  = 1e−12 in Algorithm1 for all our experiments. Stopping Conditions and Solution Quality. To measure the feasibility and optimality of a given n m n approximate solution (X, y, S) ∈ S+ × R × S+ for the SDP pair (P)-(D), we define the following standard relative KKT residuals: kA(X) − bk kA∗y + S − Ck |hC,Xi − hb, yi | ηp = , ηd = , ηg = . (16) 1 + kbk 1 + kCkF 1 + |hC,Xi |+|hb, yi | | {z } | {z } | {z } SDP primal feasibility SDP dual feasibility SDP duality gap

7 Table 1: Results on solving the nearest structured rank deficient matrix problem [16]. At each d, the “mean (max)” values of each metric is computed over 5 random instances. Time in seconds.

Dimension Metric SDPT3 [49] MOSEK [6] CDCS [62] SketchyCGAL [59] SDPNAL+ [58] STRIDE η ↓ 6.2e−11 (1.6e−10) 5.8e−11 (1.8e−10) 5.9e−06 (1.9e−05) 0.34 (0.39) 4.9e−11 (1.0e−10) 1.3e−15 (2.2e−15) d: 29 p ηd ↓ 1.0e−10 (1.7e−10) 2.1e−09 (9.7e−09) 3.3e−04 (6.5e−04) 2.2e−05 (5.7e−05) 1.5e−08 (4.7e−08) 1.7e−13 (7.9e−13) n: 200 ηg ↓ 2.3e−05 (5.5e−05) 2.4e−09 (6.4e−09) 1.7e−04 (3.2e−04) 0.13 (0.29) 3.1e−06 (1.5e−05) 2.7e−10 (6.4e−10) η ↓ 1.3e−06 (6.4e−06) 4.8e−10 (2.4e−09) 0.19 (0.43) 0.89 (0.92) 8.3e−07 (3.9e−06) 5.4e−10 (1.5e−09) m: 10,551 s time ↓ 94 (99) 47 (53) 141 (146) 344 (358) 38 (66) 15 (17) η 4.9e−06 (1.2e−05) 0.96 (1.00) 5.2e−10 (2.5e−09) 2.2e−15 (3.5e−15) d: 59 p ηd 3.7e−04 (5.3e−04) 3.3e−02 (6.2e−2) 4.5e−09 (1.7e−08) 1.7e−13 (7.6e−13) n: 800 ηg out of memory out of memory 2.1e−04 (3.1e−04) 0.28 (0.50) 2.6e−06 (1.2e−05) 9.9e−10 (4.3e−09) η 0.48 (0.57) 0.98 (0.99) 1.4e−06 (6.5e−06) 1.3e−09 (5.7e−09) m: 164,201 s time 4211 (4229) 2174 (2184) 1165 (1380) 279 (321) η 1.5e−05 (1.6e−05) 1.38 (1.39) 1.1e−12 (2.2e−12) 4.1e−15 (5.0e−15) d: 89 p ηd 1.0e−03 (1.1e−03) 1.35 (1.38) 8.6e−06 (2.4e−05) 3.7e−13 (1.6e−12) n: 1800 ηg out of memory out of memory 2.9e−04 (3.7e−04) 0.42 (0.52) 5.4e−2 (0.11) 9.2e−10 (2.0e−09) η 0.84 (0.87) 1.00 (1.00) 0.07 (0.21) 1.5e−08 (7.1e−08) m: 823,951 s time 12276 (12360) 6022 (6058) 9911 (10001) 1721 (2019) η 6.2e−15 (8.1e−15) d: 119 p ηd 2.3e−13 (1.0e−12) n: 3200 ηg out of memory out of memory out of time out of time out of time 6.9e−10 (1.5e−09) η 1.5e−08 (7.0e−08) m: 2,592,801 s time 11252 (15902)

For a given tolerance tol > 0, we terminate STRIDE when

max{ηp, ηd, ηg} ≤ tol. (17) We choose tol = 1e−8 for our experiments. To obtain a solution of the original (POP) with a (sub-)optimality certificate, we compute a POP relative suboptimality gap from (X, y, S) as ∗ |p(ˆx) − (hb, yi + Mλmin(C − A y))| ηs = ∗ , (18) 1 + |p(ˆx)| + |hb, yi + Mλmin(C − A y)| where xˆ ∈ FPOP is a feasible approximate solution to the (POP) that is rounded from the leading eigenvector of X; λmin denotes the minimum eigenvalue, and M ≥ tr (X) is a bound on the trace of ? X when X is generated by a rank-one lifting. We observe that p(ˆx) ≥ p ≥ hb, yi + Mλmin(C − ∗ A y), and hence ηs = 0 certifies that xˆ is a global minimizer of the nonconvex (POP). Baseline Solvers. We choose SDPT3 [49] and MOSEK [6] as representative interior point methods; CDCS [62] and SketchyCGAL [59] as representative first-order methods; and SDPNAL+ [58] as a representative method that combines first-order and second-order methods. For SDPT3 and MOSEK, we use default parameters. For CDCS, we use the sos solver with maximum 20, 000 iterations. For SketchyCGAL, we use sketching size 10, maximum runtime 10, 000 seconds and maximum iterations 20, 000. For SDPNAL+, we use 1e−8 as the tolerance. Random Test Problems. We select two fundamental problems in machine learning and computer vision to test STRIDE against the baseline solvers. The first problem is the nearest structured rank deficient matrix problem, also known as structured total least squares (STLS) [16, 32, 33]. Given an arbitrary matrix, STLS seeks a rank-deficient matrix with certain structure (e.g., Hankel) to best approximate the given matrix. The second problem is the outlier-robust Wahba problem (Wahba) [54, 57]. Given two sets of 3D points with putative correspondences corrupted by outliers, Wahba seeks the best 3D rotation to align them (see Fig.2). For both problems, we use d to denote the dimension of their POP formulations, and use (n, m) to denote the dimension of the resulting SDP relaxations. We generate random test problems by following the protocol in the original papers [16, 54], for increasing d, where five random instances are generated at each d. We compare five metrics, including ηp, ηd, ηg, ηs (as in eqs. (16) and (18)), and time (in seconds), for all the tested solvers. Due to space constraints, we only show the mean and max values of each metric over five random instances. We invite the interested reader to refer to the Supplementary Material for details about the POP formulations, the SDP relaxations, and the experimental setup of both problems. Results. Table1 and2 show the computational results for STLS and Wahba, respectively. We make the following observations. (i) Interior point methods (SDPT3 and MOSEK) are robust to degeneracy and can solve small problems to high accuracy (KKT residuals below 1e−9). However, they quickly become intractable due to large memory consumption and can only solve small SDP relaxations. (ii) First-order methods (CDCS and SketchyCGAL) have affordable memory requirement and scale to large problems. Nevertheless, they suffer from degeneracy and fail to attain even modest accuracy (e.g., ηg of CDCS and SketchyCGAL is above 1e−6 and ηs is above 1e−3). (iii) SDPNAL+ achieves a good balance between scalability and accuracy and it solves small to medium-scale problems with modest accuracy. However, when m is larger than 500, 000, SDPNAL+ starts failing, as shown by the large ηd, ηg and ηs. (iv) STRIDE successfully solves all problem instances, with an accuracy level that

8 Table 2: Results on solving the outlier-robust Wahba problem [54]. At each d, the “mean (max)” values of each metric is computed over 5 random instances. Time in seconds.

Dimension Metric SDPT3 [49] MOSEK [6] CDCS [62] SketchyCGAL [59] SDPNAL+ [58] STRIDE η ↓ 2.3e−09 (5.0e−09) 1.5e−12 (3.0e−12) 2.3e−06 (2.4e−06) 0.83 (0.84) 2.2e−11 (5.6e−11) 1.8e−15 (3.7e−15) d: 54 p ηd ↓ 1.4e−12 (2.2e−12) 3.2e−11 (8.0e−11) 3.3e−05 (3.4e−05) 3.5e−02 (3.5e−02) 1.2e−08 (4.9e−08) 1.1e−13 (1.8e−13) n: 204 ηg ↓ 8.4e−08 (1.8e−07) 6.5e−10 (1.3e−09) 2.0e−03 (2.1e−03) 1.0 (1.0) 5.4e−04 (2.5e−03) 2.7e−09 (4.1e−09) η ↓ 1.8e−12 (3.6e−12) 2.0e−13 (5.8e−13) 0.13 (0.20) 1.0 (1.0) 1.0e−07 (5.0e−07) 8.0e−14 (1.7e−13) m: 8,151 s time ↓ 66 (67) 33 (35) 89 (92) 303 (305) 45 (50) 13 (13) η 1.1e−11 (2.6e−11) 3.8e−07 (5.9e−07) 0.86 (0.88) 1.3e−14 (2.2e−14) 2.3e−15 (5.3e−15) d: 104 p ηd 5.3e−15 (7.1e−15) 2.5e−05 (2.6e−05) 5.2e−02 (5.3e−02) 8.9e−06 (1.1e−05) 2.3e−13 (9.4e−13) n: 404 ηg out of memory 1.1e−09 (3.9e−09) 2.4e−04 (3.8e−04) 1.0 (1.0) 6.6e−02 (8.8e−02) 4.3e−09 (8.9e−09) η 4.0e−13 (1.3e−12) 0.25 (0.35) 1.0 (1.0) 1.9e−02 (2.2e−02) 1.9e−09 (4.8e−09) m: 31,301 s time 870 (975) 328 (334) 831 (832) 501 (554) 45 (59) η 1.0e−06 (1.1e−06) 0.89 (0.89) 6.2e−06 (1.1e−05) 1.7e−15 (5.8e−15) d: 204 p ηd 1.8e−05 (1.9e−05) 9.0e−02 (9.5e−02) 1.3e−05 (1.3e−05) 9.1e−14 (3.9e−13) n: 804 ηg out of memory out of memory 4.8e−04 (5.2e−04) 1.0 (1.0) 7.4e−02 (0.11) 4.0e−09 (7.8e−09) η 0.30 (0.31) 1.0 (1.0) 2.2e−02 (4.0e−02) 7.7e−12 (3.2e−11) m: 122,601 s time 1210 (1220) 1694 (1723) 2333 (2530) 261 (283) η 4.0e−07 (5.7e−07) 0.90 (0.90) 1.5e−04 (2.6e−04) 1.1e−14 (3.8e−14) d: 504 p ηd 1.4e−05 (1.5e−05) 0.13 (0.13) 1.1e−05 (1.1e−05) 3.1e−13 (1.2e−12) n: 2004 ηg out of memory out of memory 1.7e−04 (2.3e−04) 1.0 (1.0) 7.1e−02 (7.8e−02) 3.8e−09 (9.8e−09) η 0.35 (0.38) 1.0 (1.0) 0.16 (0.29) 6.4e−09 (2.9e−08) m: 756,501 s time 7798 (7905) 6005 (6035) 10062 (10219) 2858 (3127) η 1.8e−12 (7.9e−12) d: 1004 p ηd 5.4e−14 (1.5e−13) n: 4004 ηg out of memory out of memory out of time out of time out of time 9.0e−10 (1.6e−09) η 1.9e−09 (8.1e−09) m: 3,013,001 s time 52335 (69992) clearly dominates SDPNAL+, CDCS, SketchyCGAL and is typically higher than SDPT3 and MOSEK in small problems. Moreover, STRIDE is 20 times faster than MOSEK on Wahba when d = 104, 10 times faster than SDPNAL+, on STLS when d = 89, and on Wahba when d = 204. Notably, for the first time in the literature, STRIDE solves degenerate SDP relaxations with m as large as three millions. Outlier-Robust Wahba Problems on Real Data. To show the robustness of STRIDE, we test it on two applications of the Wahba problem on real data. The first application is image stitching on PASSTA [35] shown in Fig.2(a). Given two images taken by the same camera with an unknown relative rotation, we first use SURF [7] to establish 70 putative keypoint matches, and then use STRIDE to solve the Wahba problem (d = 74, n = 284, m = 15, 611) to estimate the relative rotation and stitch the two images. STRIDE obtains the globally optimal solution (max{ηp, ηd, ηg} = 6.2e−13, ηs = 7.8e−14) in 79 seconds. The second application is point cloud registration on 3DMatch [60] shown in Fig.2(b). Given two point clouds with an unknown relative rotation, we first use FPFH [ 44] and ROBIN [46] to establish 108 keypoint matches, and then use STRIDE to solve the Wahba problem (d = 112, n = 436, m = 36, 397) to estimate the rotation and register the point clouds. STRIDE obtains the globally optimal solution (max{ηp, ηd, ηg} = 1.6e−10, ηs = 1.8e−13) in 53 seconds.

(a) Image stitching on PASSTA [35]. (b) Point cloud registration on 3DMatch [60]. Figure 2: STRIDE solves outlier-robust Wahba problems on real datasets to global optimality.

6 Conclusions

We presented STRIDE for globally solving polynomial optimization problems through their semidef- inite relaxations. STRIDE is the first algorithmic framework that combines local search using the POP and global descent using the SDP, and it is the only solver that can solve degenerate rank-one SDPs to high accuracy even in the presence of millions of equality constraints. We applied STRIDE to solve two POPs arising in machine learning and computer vision applications and demonstrated its superior performance in terms of accuracy, robustness, and scalability. Future work aims to investigate inexactness in the computation of the projection subproblem to further speedup computation.

9 Supplementary Material

A1 Proof of Theorem1

Proof. Let V = {Xˆ k(i)} be the sequence of all the Xˆ’s that have been accepted due to (10), where k(i) returns the iteration index of the i-th element in V. If V = ∅, then STRIDE reduces to (PGM) and (5) and is globally convergent. If V 6= ∅, then we claim that V must be finite. Note that, for any two consecutive elements Xˆ k(i) and Xˆ k(i+1) in V, we have f(Xˆ k(i+1)) ≤ f(X¯ k(i+1)) −  < f(Xk(i+1)−1) −  ≤ f(Xˆ k(i)) − , (A1) where the first inequality is due to (10), the second inequality is due to (9) and the fact that projected gradient descent must strictly decrease the objective value when optimality has not been achieved [9, Proposition 3.4.1], and the last inequality holds because k(i + 1) − 1 ≥ k(i). Eq. (A1) states that the objective value must decrease by at least  along each element of V. Therefore, we have fmin(V) ≤ fmax(V) − (|V|−1), where fmin and fmax are the minimum and maximum objective values along V. Hence |V| must be finite, otherwise f ? is unbounded below, contradicting Slater’s condition and strong duality. Let Xˆ k(|V|) be the last element of V, then STRIDE reduces to (PGM) with a new initial point at Xˆ k(|V|) and is still globally convergent, completing the proof.

A2 Nearest Structured Rank Deficient Matrix

N N1×N2 Let N,N1,N2 be positive integers with N1 ≤ N2, and let L : R → R be an affine map. Consider finding the nearest structured rank deficient matrix problem min ku − θk2 , subject to L(u) is rank deficient, (A2) N u∈R where θ ∈ RN is a given point. Problem (A2) is commonly known as the structured total least squares (STLS) problem [43, 34], and has numerous applications in control, systems theory, statistics [32], approximate greatest common divisor [25], camera triangulation [2], among others [33]. Problem (A2) can be reformulated as the following polynomial optimization problem N ! 2 X min ku − θk , subject to z L0 + uiLi = 0, (STLS) z∈SN1−1,u∈ N R i=1

d−1 d N1×N2 where S := {x ∈ R | kxk = 1} denotes the d-dimensional unit sphere, Li ∈ R , i = 0,...,N, are the set of independent bases of the affine map L, z ∈ SN1−1 is a unit vector in the left kernel of L(u) and acts as a witness of rank deficiency. Problem (STLS) is easily seen to be nonconvex and the best algorithm in practice is based on local nonlinear programming [33]. Recently, Cifuentes [16] devised a semidefinite relaxation for (STLS) and proved that the relaxation T T T d is guaranteed to be tight under a low-noise assumption [17]. Let x = [z , u ] ∈ R , d = N1 + N be the vector of unknowns in (STLS), construct the sparse monomial vector of degree up to 2 in x as T T T T n v(x) = [x]s = [z , u1z , . . . , uN z ] ∈ R , n = (N + 1)N1 (A3) T and build the moment matrix X = [x]s[x]s . It can be easily checked that all the off-diagonal N1 ×N1 T blocks, uiujzz , are symmetric by construction. Using [x]s, the N2 equality constraints in (STLS) T n can be conveniently written as [x]s ai = 0, i = 1,...,N2, for constant vectors ai ∈ R . In addition, T each of the equality constraint also gives rise to n localizing constraints of the form ([x]s ai)[x]s = 0. N1−1 Finally, the unit sphere constraint z ∈ S implies the trace of the leading N1 × N1 block of X is equal to 1. The construction above leads to a semidefinite relaxation with size

n = (N + 1)N1, m = 1 + nN2 + t(N1 − 1) × t(N), (A4) d(d+1) where t(d) := 2 for a positive integer d denotes the d-th triangular number. Due to the limitation of interior point methods, Cifuentes [16] was only able to numerically verify the tightness of the relaxation for very small problems (e.g., N1 ≤ N2 ≤ 10, N < 20). With STRIDE, we can compute globally optimal solutions of (STLS) with much larger dimensions. We perform experiments on random instances of (STLS) where the affine map L is structured to be a

10 square Hankel matrix such that N = N1 + N2 − 1,N1 = N2. We set N1 = 10, 20, 30, 40, and at each level we randomly generate five problem instances by drawing θ ∼ N (0,IN ) from the standard Gaussian distribution. We then solve the sparse semidefinite relaxation using SDPT3, MOSEK, CDCS, SketchyCGAL, SDPNAL+, and STRIDE. For STRIDE, since the nonlinear programing (STLS) does not admit any manifold structure, we use fmincon with an interior point method as the nlp method. To generate hypotheses for nlp from the moment matrix X, we follow n X vi(z) X = λ v vT, z = , ui = zTv (u z), j = 1,...,N, (A5) i i i i kv (z)k j i i j i=1 i where we first perform a spectral decomposition, then round zi by projecting the entries of vi i corresponding to block z onto the unit sphere, and then round uj by computing the inner product between zi and the entries of vi corresponding to block ujz (again, the rationale for this rounding comes from the lifting (A3)). We generate r = 5 hypotheses by rounding 5 eigenvectors. In order to set M for computing ηs, we make the assumption that the search variable u of (STLS) contains random vectors that follow N (0,IN ), and its squared norm follows a chi-square distribution of degree N. As a result, we choose the quantile corresponding to a 99.9% probability, denoted as M, as the bound on kuk2, such that M = M + 1 can upper bound the trace of X.

A3 Outlier-Robust Wahba Problem

Consider the problem of finding the best 3D rotation to align two sets of 3D points while explicitly tolerating outliers

N ( −1 2 ) X z˜i − q ◦ w˜i ◦ q min min , 1 (A6) q∈S3 β2 i=1 i 3 3 3 N where q ∈ S is the unit quaternion parametrization of a 3D rotation, (zi ∈ R , wi ∈ R )i=1 are T T 4 given N pairs of 3D points (often normalized to have unit norm), z˜ , [z , 0] ∈ R denotes the −1 T zero-homogenization of a 3D vector z, q , [−q1, −q2, −q3, q4] is the inverse quaternion, “◦” denotes the quaternion product defined as

 q4 −q3 q2 q1  q3 q4 −q1 q2 4 q ◦ p   p, ∀q, p ∈ R , (A7) ,  −q2 q1 q4 q3  −q1 −q2 −q3 q4

βi > 0 is a given threshold that determines the maximum inlier residual, and min {·, ·} realizes the so-called truncated least squares (TLS) cost function in robust estimation [5, 52]. Intuitively, the −1 term q ◦ w˜i ◦ q is the rotated copy of wi, and the `2 norm in (A6) measures the Euclidean distance between zi and wi after rotation (a metric for the goodness of fit). Problem (A6) therefore seeks to find the best 3D rotation that minimizes the sum of (normalized) squared Euclidean distances between zi and wi while preventing outliers from damaging the estimation via the usage of the TLS cost function that assigns a constant value to those pairs of points that cannot be aligned well (i.e., outliers). A pictorial description of the outlier-robust Wahba problem is presented in Fig. A1. Problem (A6) appears nonsmooth, but can be equivalently reformulated as

N −1 2 X 1 + θi z˜i − q ◦ w˜i ◦ q 1 − θi min 2 + (Wahba) q∈S3,θ ∈{+1,−1},i=1,...,N 2 β 2 i i=1 i by introducing N binary variables θi that expose the combinatorial nature, that is each θi acts as the selection variable for determining if the i-th pair of 3D points (zi, wi) is an inlier or an outlier. Problem (Wahba) is a fundamental problem in aerospace, robotics and computer vision, and is the essential subproblem in point cloud registration [57, 53]. To solve (Wahba) to global optimality, Yang and Carlone [54] proposed the following semidefinite T T d relaxation that was empirically shown to be always tight. Let x = [q , θ1, . . . , θN ] ∈ R , d = 4+N, be the variable of the nonlinear programming (Wahba), construct T T T T n [x]s = [q , θ1q , . . . , θN q ] ∈ R , n = 4N + 4 (A8)

11 Figure A1: An example of the outlier-robust (Wahba) problem. (a) Four pairs of 3D points (zi, wi) lying around a unit sphere, with one of the pairs being an outlier that cannot be aligned well by a 3D rotation. (b) A truncated least squares (TLS) cost function compared with a least squares cost function. The TLS cost function prevents the outlier from contaminating the estimation problem by assigning constant cost to the outlier. Adapted from [54]. as the sparse set of monomials in x of degree up to 2 (a technique that was dubbed binary cloning), T 2 and then build X = [x]s[x]s as the sparse moment matrix. Because of the binary constraint θi = 1, 2 T T it can be easily seen that: (i) the diagonal 4 × 4 blocks of X are all identical (θi qq = qq ), and T 4 (ii) the off-diagonal 4 × 4 blocks are symmetric (θiθjqq ∈ S ). Because of the unit quaternion constraint, X satisfies tr (X) = N + 1. Therefore, this leads to a semidefinite relaxation of size n = 4N + 4, m = 10N + 1 + 3N(N + 1). (A9) The largest N whose relaxation was successfully solved by interior point method was N = 100 [54]. This sparse second-order relaxation scheme has been shown as a general framework for certifiable outlier-robust machine perception [56]. Here we show the scalability of our solver by solving instances of (Wahba) up to N = 1000 and obtaining the globally optimal solution. At each N = 50, 100, 200, 500, 1000, we generate five random instances of the (Wahba) problem as follows. (i) We draw a random 3D rotation R ∈ SO(3) (a rotation matrix can be converted from and to a unit quaternion easily); (ii) we simulate N 3D unit vectors wi, i = 1,...,N uniformly on the unit sphere; (iii) we generate 2 zi = Rwi + i, i ∼ N (0, 0.01 ), i = 1,...,N (A10) by rotating wi and adding Gaussian noise; (iv) we replace 50% of the zi’s by random unit vectors on the sphere so that they do not follow the generative model (A10) and are outliers. We then use SDPT3, MOSEK, CDCS, SketchyCGAL, SDPNAL+, and STRIDE to solve the SDP relaxations. For STRIDE, we use Manopt with a trust region solver as the nlp method for solving the nonlinear programming (Wahba). Specifically, q ∈ S3 is modeled as a sphere manifold, and θ ∈ {+1, −1}N is modeled as an oblique manifold of size 1 × N (an oblique manifold of size n × m is the set of matrices of size n × m with unit-norm columns), and the problem is treated as an unconstrained problem on the product of two manifolds. To round hypotheses from a moment matrix X, we follow n X vi(q) X = λ v vT, q = , θi = sgn(qTv (θ q)), j = 1,...,N, (A11) i i i i kv (q)k j i i j i=1 i where we first perform spectral decomposition of X with eigenvalues in nonincreasing order, then i round qi by normalizing the corresponding entries of vi to have unit norm, and finally generate θj by taking the sign of the dot product between the rounded qi and the entries of vi corresponding to each θjq block (the rationale for using this rounding method is easily seen from (A8)). We generate r = 5 hypotheses by rounding 5 eigenvectors from the moment matrix. We set M = N + 1 = tr (X) to compute ηs.

References [1] Abbe, E., Bandeira, A.S., Hall, G.: Exact recovery in the stochastic block model. IEEE Transactions on Information Theory 62(1), 471–487 (2015)2

12 [2] Aholt, C., Agarwal, S., Thomas, R.: A qcqp approach to triangulation. In: European Conference on Computer Vision, pp. 654–667. Springer (2012) 10 [3] Alizadeh, F., Haeberly, J.P.A., Overton, M.L.: Complementarity and nondegeneracy in semidef- inite programming. Mathematical programming 77(1), 111–128 (1997)3 [4] Alizadeh, F., Haeberly, J.P.A., Overton, M.L.: Primal-dual interior-point methods for semidef- inite programming: convergence rates, stability and numerical results. SIAM Journal on Optimization 8(3), 746–768 (1998)3 [5] Antonante, P., Tzoumas, V., Yang, H., Carlone, L.: Outlier-robust estimation: Hardness, minimally-tuned algorithms, and applications. arXiv preprint arXiv:2007.15109 (2020) 11 [6] ApS, M.: The MOSEK optimization toolbox for MATLAB manual. Version 9.0. (2019). URL http://docs.mosek.com/9.0/toolbox/index.html 3,8,9 [7] Bay, H., Tuytelaars, T., Van Gool, L.: Surf: Speeded up robust features. In: European conference on computer vision, pp. 404–417. Springer (2006)9 [8] Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences 2(1), 183–202 (2009)5 [9] Bertsekas, D.: Nonlinear Programming. Athena Scientific (1999)4,5, 10 [10] Boumal, N., Mishra, B., Absil, P.A., Sepulchre, R.: Manopt, a Matlab toolbox for optimization on manifolds. Journal of Machine Learning Research 15(42), 1455–1459 (2014). URL https://www.manopt.org 6 [11] Boumal, N., Voroninski, V., Bandeira, A.S.: The non-convex burer-monteiro approach works on smooth semidefinite programs. In: Conference on Neural Information Processing Systems (NeurIPS) (2016)3 [12] Briales, J., Gonzalez-Jimenez, J.: Convex global 3d registration with lagrangian duality. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4960– 4969 (2017)2,6 [13] Burer, S., Monteiro, R.D.: A nonlinear programming algorithm for solving semidefinite pro- grams via low-rank factorization. Mathematical Programming 95(2), 329–357 (2003)3 [14] Burer, S., Ye, Y.: Exact semidefinite formulations for a class of (random and non-random) nonconvex quadratic programs. Mathematical Programming pp. 1–17 (2019)2 [15] Candes, E.J., Eldar, Y.C., Strohmer, T., Voroninski, V.: Phase retrieval via matrix completion. SIAM review 57(2), 225–251 (2015)2 [16] Cifuentes, D.: A convex relaxation to compute the nearest structured rank deficient matrix. arXiv preprint arXiv:1904.09661 (2019)2,3,8, 10 [17] Cifuentes, D., Agarwal, S., Parrilo, P.A., Thomas, R.R.: On the local stability of semidefinite relaxations. arXiv preprint arXiv:1710.04287 (2017)2, 10 [18] Combettes, P.L., Pesquet, J.C.: Proximal splitting methods in signal processing. In: Fixed-point algorithms for inverse problems in science and engineering, pp. 185–212. Springer (2011)7 [19] Dai, Y.H., Yuan, Y.: A nonlinear conjugate gradient method with a strong global convergence property. SIAM Journal on optimization 10(1), 177–182 (1999)7 [20] Doherty, A.C., Parrilo, P.A., Spedalieri, F.M.: Complete family of separability criteria. Physical Review A 69(2), 022308 (2004)2 [21] Goemans, M.X., Williamson, D.P.: Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of the ACM (JACM) 42(6), 1115–1145 (1995)2,6 [22] Henrion, D., Lasserre, J.B.: GloptiPoly: over polynomials with Matlab and SeDuMi. ACM Transactions on Mathematical Software (TOMS) 29(2), 165–194 (2003)2 [23] Inc., W.R.: Mathematica, Version 12.2. URL https://www.wolfram.com/mathematica. Champaign, IL, 20204 [24] Jiang, K., Sun, D., Toh, K.C.: An inexact accelerated proximal gradient method for large scale linearly constrained convex sdp. SIAM Journal on Optimization 22(3), 1042–1064 (2012)5

13 [25] Kaltofen, E., Yang, Z., Zhi, L.: Approximate greatest common divisors of several polynomials with linearly constrained coefficients and singular polynomials. In: Proceedings of the 2006 international symposium on Symbolic and algebraic computation, pp. 169–176 (2006) 10 [26] Lasserre, J.B.: An explicit exact SDP relaxation for nonlinear 0-1 programs. In: International Conference on Integer Programming and Combinatorial Optimization, pp. 293–303. Springer (2001)2 [27] Lasserre, J.B.: Global optimization with polynomials and the problem of moments. SIAM Journal on Optimization 11(3), 796–817 (2001)1,2 [28] Lasserre, J.B.: Moments, positive polynomials and their applications, vol. 1. World Scientific (2009)1 [29] Laurent, M.: Semidefinite representations for finite varieties. Mathematical programming 109(1), 1–26 (2007)2 [30] Luo, Z.Q., Ma, W.K., So, A.M.C., Ye, Y., Zhang, S.: Semidefinite relaxation of quadratic optimization problems. IEEE Signal Processing Magazine 27(3), 20–34 (2010)2 [31] Malick, J., Povh, J., Rendl, F., Wiegele, A.: Regularization methods for semidefinite program- ming. SIAM Journal on Optimization 20(1), 336–356 (2009)7 [32] Markovsky, I.: Structured low-rank approximation and its applications. Automatica 44(4), 891–909 (2008)8, 10 [33] Markovsky, I., Usevich, K.: Software for weighted structured low-rank approximation. Journal of Computational and Applied Mathematics 256, 278–292 (2014)8, 10 [34] Markovsky, I., Van Huffel, S.: Overview of total least-squares methods. Signal processing 87(10), 2283–2302 (2007) 10 [35] Meneghetti, G., Danelljan, M., Felsberg, M., Nordberg, K.: Image alignment for panorama stitching in sparsely structured environments. In: Scandinavian Conference on Image Analysis, pp. 428–439. Springer (2015)9 [36] Nesterov, Y.: Lectures on convex optimization, vol. 137. Springer (2018)7 [37] Nie, J.: Polynomial optimization with real varieties. SIAM Journal On Optimization 23(3), 1634–1646 (2013)2 [38] Nie, J.: Optimality conditions and finite convergence of lasserre’s hierarchy. Mathematical programming 146(1), 97–121 (2014)2 [39] Nocedal, J., Wright, S.: Numerical optimization. Springer Science & Business Media (2006)6, 7 [40] Parrilo, P.A.: Semidefinite programming relaxations for semialgebraic problems. Mathematical programming 96(2), 293–320 (2003)1,2 [41] Rosen, D.M.: Scalable low-rank semidefinite programming for certifiably correct machine perception. In: Intl. Workshop on the Algorithmic Foundations of Robotics (WAFR), vol. 3 (2020)3 [42] Rosen, D.M., Carlone, L., Bandeira, A.S., Leonard, J.J.: SE-Sync: A certifiably correct algorithm for synchronization over the special euclidean group. The International Journal of Robotics Research 38(2-3), 95–125 (2019)2,3 [43] Rosen, J.B., Park, H., Glick, J.: Total least norm formulation and solution for structured problems. SIAM Journal on matrix analysis and applications 17(1), 110–126 (1996) 10 [44] Rusu, R., Blodow, N., Beetz, M.: Fast point feature histograms (fpfh) for 3d registration. In: IEEE Intl. Conf. on Robotics and Automation (ICRA), pp. 3212–3217. Citeseer (2009)9 [45] Shi, J., Yang, H., Carlone, L.: Optimal pose and shape estimation for category-level 3d object perception. In: Robotics: Science and Systems (RSS) (2021)2 [46] Shi, J., Yang, H., Carlone, L.: ROBIN: a graph-theoretic approach to reject outliers in robust estimation using invariants. In: IEEE Intl. Conf. on Robotics and Automation (ICRA) (2021)9 [47] Shor, N.Z.: Dual quadratic estimates in polynomial and boolean programming. Annals of Operations Research 25(1), 163–168 (1990)2

14 [48] Sun, D., Toh, K.C., Yang, L.: A convergent 3-block semiproximal alternating direction method of multipliers for conic programming with 4-type constraints. SIAM journal on Optimization 25(2), 882–915 (2015)7 [49] Toh, K.C., Todd, M.J., Tütüncü, R.H.: SDPT3—a MATLAB software package for semidefinite programming, version 1.3. Optimization methods and software 11(1-4), 545–581 (1999)3,8,9 [50] Wang, A.L., Kılınç-Karzan, F.: On the tightness of sdp relaxations of qcqps. Mathematical Programming pp. 1–41 (2021)2 [51] Wang, J., Magron, V., Lasserre, J.B.: Chordal-TSSOS: a moment-sos hierarchy that exploits term sparsity with chordal extension. SIAM Journal on Optimization 31(1), 114–141 (2021)2 [52] Yang, H., Antonante, P., Tzoumas, V., Carlone, L.: Graduated non-convexity for robust spa- tial perception: From non-minimal solvers to global outlier rejection. IEEE Robotics and Automation Letters (RA-L) 5(2), 1127–1134 (2020) 11 [53] Yang, H., Carlone, L.: A polynomial-time solution for robust registration with extreme outlier rates. In: Robotics: Science and Systems (RSS) (2019) 11 [54] Yang, H., Carlone, L.: A quaternion-based certifiably optimal solution to the wahba problem with outliers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1665–1674 (2019)2,3,6,8,9, 11, 12 [55] Yang, H., Carlone, L.: In perfect shape: Certifiably optimal 3d shape reconstruction from 2d landmarks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 621–630 (2020)2 [56] Yang, H., Carlone, L.: One ring to rule them all: Certifiably robust geometric perception with outliers. In: Conference on Neural Information Processing Systems (NeurIPS) (2020)2,6, 12 [57] Yang, H., Shi, J., Carlone, L.: Teaser: Fast and certifiable point cloud registration. IEEE Transactions on Robotics (2020)8, 11 [58] Yang, L., Sun, D., Toh, K.C.: SDPNAL+: a majorized semismooth Newton-CG augmented Lagrangian method for semidefinite programming with nonnegative constraints. Mathematical Programming Computation 7(3), 331–366 (2015)3,8,9 [59] Yurtsever, A., Tropp, J.A., Fercoq, O., Udell, M., Cevher, V.: Scalable semidefinite program- ming. SIAM Journal on Mathematics of Data Science 3(1), 171–200 (2021)3,8,9 [60] Zeng, A., Song, S., Nießner, M., Fisher, M., Xiao, J., Funkhouser, T.: 3dmatch: Learning the matching of local 3d geometry in range scans. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, p. 4 (2017)9 [61] Zhao, X.Y., Sun, D., Toh, K.C.: A newton-cg augmented lagrangian method for semidefinite programming. SIAM Journal on Optimization 20(4), 1737–1765 (2010)3,7 [62] Zheng, Y., Fantuzzi, G., Papachristodoulou, A., Goulart, P., Wynn, A.: Chordal decomposition in operator-splitting methods for sparse semidefinite programs. Mathematical Programming 180(1), 489–532 (2020)3,8,9

15