arXiv:1606.02395v3 [math.OC] 17 Sep 2017 A([email protected]) WA etr and vector, aibe while variables aibepoeto ehius swl stepnlymeth penalty the as well as techniques, projection variable ([email protected]) ([email protected]) o,w aen supin on assumptions no make we now, Here, 2,[] 1] onaycnrlpolm 1] 1] and [19], [11], Typically, problems [13]. control [1], boundary transport [15], optimal filter [4], Kalman i [23], appear [2], [21], often [6], form optimization constrained this PDE nonsm of ‘simple’ problems a or Optimization smooth function. a either typically is it practice, physics. A rbeshvn h form the having problems nec trto.W lutaeteter n loihso and estima loosely inference. parameter theory dynamic and very th robust the transport, for that solved illustrate optimal We showing control, be boundary iteration. analysis, can each inexact subproblem in an minimization outline extens partial we robust addition, motivates and In light proble optimization, new constrained the sheds PDE viewpoint regularizes This technique well-conditioned. it surpri minimization making that partial show We N ill-conditioning. the state. inherent pr underlying from in an suffer used parame commonly auxiliary over opti- formulations, of penalized constraints constrained quadratically set physical PDE a to over and subject fitting models data require dynamic mization, in estimation eter peln orlxi ntefruain emnl naive penalty: seemingly quadratic A the formulation. on based the is in approach it relaxation relax to appealing Here sstse,te increasing then t crite satisfied, termination some is until optimization The (2) iterative problem unconstrained physics. an the programmi applying relaxed nonlinear by in to proceeds method penalty corresponding quadratic (1), classical in constraints ( ∗ ∗∗ † ic h discretization the Since nti ok ecnie tutrdcaso optimization of class structured a consider we work, this In fcetqartcpnlzto hog h partial the through penalization quadratic Efficient Abstract u eateto ple ahmtc,Uiest fWashingt of University Mathematics, Applied of Department eateto ahmtc,Uiest fWsigo,Seat Washington, of University Mathematics, of Department eateto ahmtc,UrctUiest,Urct N Utrecht, University, Utrecht Mathematics, of Department ) min y y,u > λ f min = y,u : R F Cmo opttoa rbes uha param- as such problems, computational —Common q n ( A f orsod oadsrtzdPEdsrbn the describing PDE discretized a to corresponds ,u y, ( → 0 ( y · y ) + ) sarlxto aaee o h equality the for parameter relaxation a is := ) R noe h tt ftesse;teconstraint the system; the of state the encodes sasotl ayn netbemti.For matrix. invertible varying smoothly a is scne n smooth, and convex is g ( .I I. f u ( ) y lkad .Aravkin Y. Aleksandr NTRODUCTION + ) A ( g ujc to subject λ u ( n eetn h procedure the repeating and , ) u y + ) = iiiaintechnique minimization q λ g : k · sarayieat tis it inexact, already is R u A d ( q noe auxiliary encodes A u → ) ∈ ( y u ) − R R † y n huhin though , q mti Drusvyatskiy Dmitriy = k safixed a is n Seattle, on, 2 q. . ethelands l,WA tle, dfor od actice, singly ooth ions. rion aive tion ters ing (1) (2) ng on m, n n o e without ihtepeiu trt sda amsat o detailed a For start. warm a as used iterate previous the with e of ues nScinII ahprilmnmzto ovsalatsqu least a solves in proble minimization problem control partial Each boundary III. the Section for in (2) to applied minimization iiiaindgae as degrades minimization minimizes -FSwt ata iiiaini nestv to insensitive is minimization partial with L-BFGS i.1 o ae shows panel Top 1: Fig. nteoiia rbe 1;tersda error residual the (1); problem original the m one in that is difficulty The allow appropriate. rarely is technique osdrto ln,tefruain()apast eusef be local to th appears extraneous From (2) (1). avoid formulation formulation the [22] original to alone, the of consideration helps to authors contrast strategy The in minima, this 17.1]. Section that [17, observe e.g. see discussion, ovninlwso ece sta h udai penalty quadratic the that us teaches wisdom Conventional λ λ ohmtosaeiiilzda random at initialized are methods Both . ata iiiainapidt 2,frtesm val- same the for (2), to applied minimization partial otn oifiiyi re ofrenear-feasibility force to order in infinity to tend to * * log(F-F ) log(F-F ) ∗ 10 10 10 10 10 10 10 10 F 10 10 y rsa a Leeuwen van Tristan otmpnlshows panel Bottom . -8 -6 -4 -2 -8 -6 -4 -2 ( 0 0 · 0 0 0 0 1000 800 600 400 200 0 40 30 20 10 0 u , ) .Promneo -FSwtotpartial without L-BFGS of Performance ). log( λ iteration iteration F nrae,wiepromneof performance while increases, k − F ∗∗ ∗ log( ) fL-BFGS of F λ λ λ λ λ λ λ λ λ λ =10 =1 =0.2 =0.1 =0.05 =10 =1 =0.2 =0.1 =0.05 k − 2 2 2 2 2 2 2 2 F 2 2 ∗ k ) u A with fL-BFGS of and , ( λ u . ) y partial y − ares that ust ul. q m is k 1 2 at an optimal pair (u,y) for the problem (2) is at best on approach on boundary control and optimal transport problems, the order of (1/λ). Consequently, the maximal eigenvalue and on tuning an oscillator from very noisy measurements. of the HessianO of the penalty term scales linearly with λ and the problems (2) become computationally difficult. Indeed, the II. THEORY maximal eigenvalue of the Hessian determines the behavior In this section, we show that the proposed framework is of numerical methods (, quasi-Newton) far insensitive to the parameter λ. Throughout we assume that f away from the solution – a regime in which most huge and A( ) are C2-smooth, f is convex, and A(u) is invertible scale problems are solved. Figure 1 is a simple numerical for every· u Rn. illustration of this inherent difficulty on a boundary control To shorten∈ the formulas, in this section, we will use the problem; see Section III for more details on the problem symbol A instead of A(u) throughout. Setting the stage, formulation. The bottom panel in the figure tracks progress u define the function of the objective function in (2) when an L-BFGS method is applied jointly in the variables (u,y). After 1000 iterations λ 2 ϕ(u,y)= f(y)+ Auy q . of L-BFGS, the objective value significantly increases with 2 k − k increasing λ, while the actual minimal value of the objective A quick computation shows function converges to that of (1), and so hardly changes. In ϕ(u,y)= f(y)+ λAT (A y q), other words, performance of the method scales poorly with λ, ∇y ∇ u u − (5) illustrating the ill-conditioning. ϕ(u,y)= λG(u,y)T (A y q), ∇u u − In this paper, we show that by using a simple partial mini- where G(u,y) is the Jacobian with respect to u of the map mization step, this complexity blow-up can be avoided entirely. u A y. Clearly the Lipschitz constant Lip( ϕ) scales with The resulting algorithm is perfectly suited for many large-scale u λ.7→ This can be detrimental to numerical methods.∇ For example, problems, where satisfying the constraint A(u)y = q to high basic gradient descent will find a point (uk,yk) satisfying accuracy is not required (or even possible). The strategy is ∗ ϕ(u ,y ) 2 < ǫ after at most Lip(∇ϕ)(ϕ(u0,x0)−ϕ ) straightforward: we rewrite (2) as k∇ k k k O ǫ iterations [16, Section 1.2.3].   min ϕ(u)+ g(u), (3) As discussed in the introduction, minimizing ϕ amounts to u the minimization problem minu ϕ(u) for the reduced function where the function ϕ(u) ise defined implicitly by ϕ defined in (4). Note since f is convex and Au is invertible, the function ϕ(u, ) admits a unique minimizer, which we 2 · e ϕ(u) = min f(y)+ λ A(u)y q . (4) denote by yu. Appealing to the classical implicit function ye ·k − k e n o theorem (e.g. [20, Theorem 10.58]) we deduce that ϕ is We will calle ϕ( ) the reduced function. Though this approach differentiable with · of minimizing out the variable y, sometimes called variable T e ϕ(u)= uϕ( ,yu) = λG(u,yu) (Auyu q). projection, ise widely used (e.g. [7], [10], [22]), little theoretical ∇ ∇ · u − justification for its superiority is known. In this work, we We aim to upper bound the Lipschitz constant of ϕ by a e show that not only does partial minimization perform well quantity independent of λ. We start by estimating the∇ residual numerically for the problem class (2), but is also theoretically Auyu q . e grounded. We prove that surprisingly the Lipschitz constant k Throughout− k the paper, we use the following simple identity. of ϕ is bounded by a constant independent of λ. Therefore, Rn Rn ∇ Given an invertible map F : and invertible matrix iterative methods can be applied directly to the formulation (3). C, for any points x Rn and nonzero→ λ R, we have The performance of the new method is illustrated using a toy ∈ ∈ e 1) F −1(λx) = (λ−1F )−1(x), and example (top panel of Figure 1). We use L-BFGS to attack the −1 −1 T −T −1 outer problem; solving it within 35 iterations. The inner solver 2) C F C = C F C . ◦ ◦ ◦ ◦ for the toy example simply solves the least squares problem We often apply this observation to the invertible map F (x)= in y. f(x)+ Bx, where B is a positive definite matrix. The inner problem (4) can be solved efficiently since its ∇ condition number is nearly independent of λ. When f is a Lemma 1 (Residual bound). For any point u, the inequality holds: convex quadratic and A(u) is sparse, one can apply sparse −1 (f Au )(q) direct solvers or iterative methods such as LSQR [18]. More Auyu q k∇ ◦ k. generally, when f is an arbitrary smooth convex function, one k − k≤ λ can apply first-order methods, which converge globally linearly Proof. Note that the minimizers y of ϕ(u, ) are characterized u · with the rate governed by the condition number of the strongly by first order optimality conditions convex objective in (4). Quasi-newton methods or variants of 0= f(y)+ λ AT (A y q). (6) Newton’s method are also available; even if g(u) is nonsmooth, ∇ · u u − BFGS methods can still be applied. Applying the implicit function theorem, we deduce that yu The outline of the paper is as follows. In Section II, we depends C2-smoothly on u with y given by ∇u u present complexity guarantees of the partial minimization −1 technique. In Section III, we numerically illustrate the overall 2f(y )+ λAT A λA( )T (A( )y q) (u). − ∇ u u u ∇u · · u −     3

On the other hand, from the equality (6) we have See [22, Section 4] for more details. Moreover, it is known that the Hessian of the reduced function ϕ admits the expression f −1 y = ( f + λAT A )−1(λAT q)= ∇ + AT A AT q. [22, Equation 22] u ∇ u u u λ u u u   2 e −1 ϕ(u)= ϕuu(u,yu) ϕuy(u,yu)ϕyy(u,yu) ϕyu(u,yu), Therefore, we deduce ∇ − (7) −1 which is simply the Schur complement of ϕ (u,y ) in 1 e yy u A y q = A f + AT A AT q q 2ϕ(u,y ). We define the operator norms u u − u λ ∇ u u u − ∇ u   −1 T T 1 −T −1 uG(u,y) := sup u G(u,y) v , = A f A + I I q. ∇ kvk≤1 ∇ λ u ◦ ∇ ◦ u − h i   ! T T uAu := sup u Au v . Define now the operator ∇ kvk≤1 ∇ h i Using this notation, we can prove the following key bounds. 1 −T −1 F := Au f Au + I λ ◦ ∇ ◦ Corollary 2. For any points u and y, the inequalities hold: and the point z := F (q). Note that −1 2 −1 Au ϕyy(u,y) k k , 1 −1 k k≤ λ F (x) x = (f Au )(x). −1 − λ∇ ◦ (f Au )(q) uG(u,yu) R(u,yu, Auyu q) k∇ ◦ k k∇ k, −1 k − k≤ λ Letting L be a Lipschitz constant of F , we obtain −1 T (f Au )(q) uAu −1 −1 k∇ ◦ k ∇ A y q = F (q) F (z) L q z K(u, Auyu q) . k u u − k k − k≤ k − k k − k≤ λ L = L q F (q) = A−T f(A−1q) . Proof. The first bound follows by the inequality k − k λ k u ∇ u k −1 Now the inverse function theorem yields for any point y the −1 1 −1 1 −T 2 −1 −T ϕyy(u,y) = Au Au f(y)Au + I Au inequality k k λ λ ∇   −1 2 −1 −1 −1 A F (y) = F (F (y)) u , k∇ k k∇ k k k −1 ≤ λ 1 2 −1 −1 = (f Au )(F (y)) + I 1, and the remaining bounds are immediate from Lemma 1. λ ∇ ◦ ≤   Next, we need the following elementary linear algebraic where the last inequality follows from the fact that by convex- fact. ity of f A−1 all eigenvalues of 2(f A−1) are nonnegative. ◦ u ∇ ◦ u Lemma 2. For any positive semidefinite matrix B and a real Thus we may set L =1, completing the proof. −1 λ> 0, we have I I + 1 B kBk . k − λ k2 ≤ λ For ease of reference, we record the following direct corol- −1 Proof. Define the matrix F  = I I + 1 B and lary. λ consider an arbitrary point z. Observing− the inequality 1 −1  1 Corollary 1. For any point u, we have I + λ B 1 and defining the point p := (I + λ B)z, wek obtain k≤ −1 −1  −1 −1 Au (f Au )(q) −1 yu Au q Au Auyu q k kk∇ ◦ k . 1 k − k≤k kk − k≤ λ Fz = z I + B z k k − λ Next we will compute the Hessian of ϕ(u,y), and use it   −1 −1 to show that the norm of the Hessian of ϕ is bounded by a 1 1 = I + B p I + B z constant independent of λ. Defining λ − λ     e −1 R(u,y,v)= G(u,y)T v and K(u, v)= AT v , 1 ∇u ∇u u I + B p z h i h i ≤ λ k − k we can partition the Hessian as follows:   1 B Bz k k z . 2 ϕ ϕ ≤ λ ≤ λ k k ϕ = uu uy ∇ ϕ ϕ  yu yy  Since this holds for all z, the result follows. where Putting all the pieces together, we can now prove the main ϕ (u,y)= λ G(u,y)T G(u,y)+ R(u,y,A y q) , theorem of this section. uu u − Theorem 1 (Norm of the reduced Hessian). The operator ϕ (u,y)= 2f(y)+ λAT A ,  yy ∇ u u norm of 2ϕ(u) is bounded by a quantity C(A( ),f,q,u) ϕ (u,y)= λ K(u, A y q)+ AT G(u,y) . independent∇ of λ. · yu u − u   e 4

Proof. To simplify the proof, define G := G(u,yu), R := the global efficiency estimates even for exact quasi-Newton R(u,y , A y q), K := K(u, A y q), and ∆= ϕ (u,y ). methods for nonconvex problems are poorly understood. u u − u u− yy u When λ 1, the operator norm of 2φ has a trivial bound Define the function H(u) := g(u)+ ϕ(u). Let β > 0 be directly from≤ equation (7). For large∇λ, after rearranging (7), the Lipschitz constant of the gradient H = g + ϕ. Fix 2 ∇ ∇ ∇ we can write ϕ as follows: e a constant c > 0, and suppose that in eache iteration k, we ∇ compute a vector v with v H(u ) c . Consider 2 T −1 T −1 T T −1 k k k − ∇ k k ≤ k e λR λ K ∆ K + K ∆ Au G + G Au∆ K then the inexact gradient descent method u = u 1 v . − e k+1 k − β k T  2 T −1 T  Then we deduce + λG G λ G Au∆ Au G. − (8) β H(u ) H(u ) H(u ),β−1v + β−1v 2 Corollary 2 implies that the operator norm of the first row k+1 − k ≤ −h∇ k ki 2 k kk of (8) is bounded above by the quantity 1 2 2 = vk H(uk) H(uk) . 1 2β k − ∇ k − k∇ k L G + L2 AT 2 A−1 2  (10) uk∇u k λ uk∇u u k k u k (9) T −1 2 T Hence we obtain the convergence guarantee: +2Lu uAu Au Au G , k∇ kk k k k k −1 2 1 2 where we set Lu := (f Au )(q) . Notice that the min H(ui) H(ui) k∇ ◦ k i=1,...,kk∇ k ≤ k k∇ k expression in (9) is independent of λ. We rewrite the second i=1 X row of (8) using the explicit expression for ∆: ∗ k 2β H(u ) H 1 1 − + v H(u ) 2 λGT G λ2GT A ∆−1AT G ≤ k k k k − ∇ k k − u u  i=1 −1 X 1 ∗ 2 2 2 ∗ 2 2 2 = λ GT G GT A−T 2f(y )A−1 + I G 2β H(u1) H c π β u1 u + c π /6 − λ u ∇ u u − + − .   ! ≤ k  6k ≤ k −1 1 where (10) is used to go from line 1 to line 2. Now, if we = λ GT I A−T 2f(y )A−1 + I G . − λ u ∇ u u compute g exactly, the question is how many inner iterations   ! ! ∇ c are needed to guarantee vk H(uk) k . For fixed u = −T 2 −1 k − ∇ k≤ Applying Lemma 2 with B = A f(y )A , we have u , the inner objective is u ∇ u u k T 2 T −1 T 2 −T 2 −1 λ 2 λG G λ G Au∆ Au G G Au f(yu)Au . ϕ(u ,y)= f(y)+ A(u )y q . − ≤k k k ∇ k k 2 k k − k

Setting The condition number (ratio of Lipschitz constant of the T −1 2 T C(A( ),f,q,u) := Lu uG +2Lu uAu Au Au G gradient over the strong convexity constant) of ϕ(uk,y) in · k∇ k k∇ kk k k k y is + G 2 A−1 2 2f(y ) k k k u k k∇ u k 1 2 T 2 −1 2 1 −1 2 2 −1 2 + L A A . κk := Lip(f) A(uk) + A(uk) A(uk) . λ uk∇u u k k u k λ k k k k k k For λ > 1, the last term is always trivially bounded by Notice that κk converges to the squared condition number of 2 T 2 −1 2 L A A and the result follows. A(uk) as λ . Gradient descent on the function ϕ(uk, ) uk∇u u k k u k ↑ ∞ ∗ 2 · guarantees y y∗ 2 ǫ after κ log ky0−y k iterations. k i − k ≤ k ǫ Then we have   A. Inexact analysis of the projection subproblem ϕ(u ,y ) ϕ(u ,y∗) λ G(u,y∗) y y∗ . In practice, one can rarely evaluate ϕ(u) exactly. It is k∇u k i − ∇u k k≤ k∇u kk i − k therefore important to understand how inexact solutions of the c Since we want the left hand side to be bounded by k , we inner problems (4) impact iteration complexitye of the outer simply need to ensure problem. The results presented in the previous section form 2 the foundation for such an analysis. For simplicity, we assume ∗ 2 c yi y . that g is smooth, though the results can be generalized, as we k − k ≤ k2λ2 G(u ,y∗) 2 k∇u k k comment on shortly. Therefore the total number of inner iterations is no larger than In this section, we compute the overall complexity of the ∗ 2 ∗ 2 partial minimization technique when the outer nonconvex min- y0 y uG(uk,y ) 2 2 imization problem (3) is solved by an inexact gradient descent κk log k − k k∇2 k k λ , c ! algorithm. When g is nonsmooth, a completely analogous analysis applies to the prox-. We only focus which grows very slowly with k and with λ. In particular, here on gradient descent, as opposed to more sophisticated the number of iterations to solve the inner problem scales as 1 2 methods, since the analysis is straightforward. We expect log(kλ) to achieve a global k rate in H . If instead we quasi-Newton methods and limited memory variants to exhibit use a fast-gradient method [16, Sectionk∇ 2.2]k for minimizing exactly the same behavior (e.g. Figure 1). We do not perform ϕ(u , ), we can replace κ with the much better quantity k · k a similar analysis here for inexact quasi-Newton methods, as √κk throughout. 5

III. NUMERICAL ILLUSTRATIONS -1 In this section, we present two representative examples of -0.5 PDE (boundary control and optimal transport) and a problem of robust dynamic inference. In each case, we show that practical experience supports theoretical 0 results from the previous section. In particular, in each nu- merical experiment, we study the convergence behavior of the 0.5 proposed method as λ increases. 1 -1 -0.5 0 0.5 1 A. Boundary control Fig. 2: L-shaped domain with the source function q. In boundary control, the goal is to steer a system towards a desired state by controlling its boundary conditions. Perhaps -1 the simplest example of such a problem is the following. Given 2

-0.5 a source q(x), defined on a domain Ω, we seek boundary 1.5 1 conditions u such that the solution to the Poisson problem 0 0.5

0 0.5 for 1 ∆y = q x Ω 0.5 1 0 ∈ 0 1 -0.5 -1 -0.5 0 0.5 1 y = u -1 -1 |∂Ω (a) is close to a desired state yd. Discretizing the PDE yields the -1 system 1.5 -0.5 1

Ay + Bu = q 0.5 0 0 where A is a discretization of the Laplace operator on the -0.5 0.5 1 0.5 1 interior of the domain and couples the interior gridpoints 0 B 0 1 -0.5 -1 -0.5 0 0.5 1 to the boundary. The corresponding PDE-constrained optimiza- -1 -1 tion problem is given by (b)

1 2 Fig. 3: Boundary values, u, and solution in the interior for the min y yd 2 subject to Ay + Bu = q, u,y 2 k − k initial and optimized boundary values are depicted in (a) and whereas the penalty formulation reads (b) respectively.

1 2 λ 2 min y yd 2 + Ay + Bu q 2. u,y 2 2 k − k k − k 101 λ = 1 11.94 λ = 10 Since both terms are quadratic in y, we can quickly solve for λ = 100 λ 100 = 1000 11.92 y explicitly. λ = inf

1) Numerical experiments: In this example, we consider 11.9 an L-shaped domain with a source q shaped like a Gaussian 10-1

gradient 11.88 bell, as depicted in figure 2. Our goal is to get a constant

-2 Lipschitz constant 10 11.86 distribution yd = 1 in the entire domain. The solution for u = 1 is shown in figure 3 (a). To solve the optimization 11.84 10-3 problem we use a steepest-descent method with a fixed step- 0 100 200 300 400 500 100 101 102 103 iteration λ size, determined from the Lipschitz constant of the gradient. (a) (b) The result of the constrained formulation is shown in figure 3 (b). We see that by adapting the boundary conditions we get a Fig. 4: The convergence plots for various values of λ are more even distribution. The convergence behavior for various depicted in (a), while (b) shows the dependence of the (nu- values of λ is shown in figure 4 (a). We see that as λ , the merically computed) Lipschitz constant on λ. behaviour tends towards that of the constrained formulatio↑ ∞ n, as expected. The Lipschitz constant of the gradient (evaluated at the initial point u), as a function of λ is shown in figure 4 u (t, x) finding a flowfield, u(t, x)= 1 , such that y (x) = (b); the curve levels off as the theory predicts. u (t, x) T  2  y(T, x) and y0(x)= y(0, x), where y(t, x) solves B. Optimal transport yt + (yu)=0. The second class of PDE-constrained problems we consider ∇ · comes from optimal transport, where the goal is to determine Discretizing using an implicit Lax-Friedrichs scheme [12], the a mapping, or flow, that optimally transforms one mass density PDE reads function into another. Say we have two density functions y0(x) and y (x), with x Ω, we can formulate the problem as A(u)y = q, T ∈ 6

y(0) data

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4 0.3 0.3 Fig. 8: Left: Densities, Gaussian (black dash), Huber (red 0.2 0.2 solid), and Student’s t (blue dot). Right: Negative Log Likeli- 0.1 0.1 hoods. 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 Fig. 5: The initial (left) and desired (right) mass density are shown. Here, P restricts the solution y to t = T , α is a regularization parameter and L is a block matrix with I on the main and y(T) upper diagonal. The penalized formulation is 1 0.9 2 1 2 α T 1 2 0.8 min Py yT 2 + y Ldiag(u)u + λ A(u)y q . 0.8 u,y 2 k − k 2 2 k − k 0.7 (12) 0.6 0.6 Again the partial minimization in y amount to minimizing a 0.5

0.4 0.4 quadratic function.

0.3 1) Numerical experiments: For the numerical example we 2 0.2 0.2 consider the domain Ω = [0, 1] , discretized with ∆x =1/16 0.1 and T = 1/32 with a stepsize of ∆t = 1/8. The initial and 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1 desired state are depicted in figure 5. The resulting state ob- Fig. 6: The mass density obtained after optimization (left) and tained at time T and the corresponding time-averaged flowfield the corresponding time-averaged flow field (right) are shown. are depicted in figure 6. The initial flow u0 was generated by i.i.d. samples from a standard Gaussian random variable. To minimize (12), we used a steepest-descent method with where q contains the initial condition and we have constant step size, using the largest eigenvalue of the Gauss- Newton Hessian at the initial u as an estimate of the Lipschitz I tB u1 +∆ ( ) constant. The convergence behavior for various values of λ as −MI tB u2 +∆ ( ) well as the corresponding estimates of the Lipschitz constant A(u)=  −M  , at the final solution are shown in figure 7.  N   −MI +∆tB(u )     C. Robust dynamic inference with the penalty method with M a four-point averaging matrix and B containing the discretization of the derivative terms. Adding regularization to In many settings, data is naturally very noisy, and a lot promote smoothness of u and y in time [12], we obtain the of effort must be spent in pre-processing and cleaning before problem applying standard inversion techniques. To narrow the scope, consider dynamic inference, where 2 1 2 α T we wish to infer both hidden states and unknown parameters min 2 Py yT 2 + 2 y Ldiag(u)u u,y k − k (11) driven by an underlying ODE. Recent efforts have focused on subject to A(u)y = q. developing inference formulations that are robust to outliers in the data [3], [4], [9], using convex penalties such as ℓ1, Huber [14] and non-convex penalties such as the Student’s t 100 10-1 log likelihood in place of the least squares penalty. The goal is λ = 1 λ = 10 to develop formulations and estimators that achieve adequate λ = 100 -1 λ = 1000 10 λ = 10000 performance when faced with outliers; these may arise either λ = inf as gross measurement errors, or real-world events that are not -2 10-2 10 modeled by the dynamics. gradient Figure 8 shows the probability density functions and penal-

10-3 Lipschitz constant ties corresponding to Gaussian, Huber, and Student’s t densi- ties. Quadratic tail growth corresponds to extreme decay of -3 10-4 10 the Gaussian density for large inputs, and linear growth of the 0 20 40 60 80 100 100 102 104 iteration λ influence of any measurement on the fit. In contrast, Huber and (a) (b) Student’s t have linear and sublinear tail growth, respectively, which ensures every observation has bounded influence. Fig. 7: Convergence plots for various values of λ are shown We focus on the Huber function [14], since it is both C1- in (a), while (b) shows the (numerically computed) Lipschitz smooth and convex. In particular, the function f(y) in (1) constant as a function of λ. and (4) is chosen to be a composition of the Huber with an 7 observation model. Note that Huber is not C2, so this case is TABLE I: Results for Kalman experiment. Penalty method not immediately captured by the theory we propose. However, for both least squares and huber achieves the same results for Huber can be closely approximated by a C2 function [8], and moderate values of λ as does the projected formulation. While then the theory fully applies. For our numerical examples, we Huber results converge to nearly the true parameters u, least apply the algorithm developed in this paper directly to the squares results converge to an incorrect parameter estimate. Huber formulation. λ ρ Iter Opt u 3 −7 We illustrate robust modeling using a simple representative 10 ℓ2 9 3.1 × 10 (.45,.98) 5 −7 example. Consider a 2-dimensional oscillator, governed by the 10 ℓ2 18 3.2 × 10 (.15, 4.3) 7 −5 following equations: 10 ℓ2 26 5 × 10 (.07, 11.1) 9 −6 ′ 10 ℓ2 31 4 × 10 (.07, 11.8) y 2u u u2 y sin(ωt) −7 1 = − 1 2 − 1 1 + (13) ∞ ℓ2 29 3.3 × 10 (.07, 11.8) y2 1 0 y2 0 3 −7        10 h 9 2.4 × 10 (1.92.14) where we can interpret u as the frequency, and u is the 5 −6 1 2 10 h 12 5 × 10 (1.98,.11) damping. Discretizing in time, we have 7 −6 10 h 10 5 × 10 (1.99,.11) 9 5 k+1 k − y 1 2∆tu u ∆tu2 y sin(ωt ) 10 h 13 1 × 10 (1.99,.11) 1 = 1 2 1 1 +∆t k . −6 y − ∆t − 1 y 0 ∞ h 17 2 × 10 (1.99,.11)  2   2   We now consider direct observations of the second component, k We then immediately find y1 zk = 0 1 + wk, wk N(0, Rk). 1 −1/2 λ 2 y2 ∼ y = arg min ρ R (Hy z) + Gy v   y   2 − 2 k − k We can formulate the joint inference problem on states and ∂(G(u)y)  measurements as follows: φ(u)= λ (Gy v). ∇ ∂u − 1 −1/2 min φ(u,y) := ρ R (Hy z) s.t. Gy = v, (14) Whene ρ is the least squares penalty, y is available in closed u,y 2 − form. However, when ρ is the Huber, y requires an iterative   with vk = sin(ωtk), v1 the initial condition for y, and ρ is algorithm. Rather than solving for y using a first-order method, either the least squares or Huber penalty, and we use the we use IPsolve, an interior point method well suited for following definitions: Huber [5]. Even though each iteration requires inversions of systems of size (N), these systems are very sparse, and the y = vec( y ) O { k} complexity of each iteration to compute y is (N) for any z = vec( z ,...,z ) O { 1 N } piecewise linear quadratic function [5]. Once again, we see that the computational cost does not scale with λ. R = diag( Rk ) I 0 { } .. 1) Numerical Experiments: We simulate a data contami- H = diag( Hk ) G I . { } G =  2  nation scenario by solving the ODE (13) for the particular − . .  .. .. 0 parameter value u = (2, 0.1). The second component of    GN I the resulting state y is observed, and the observations are  −  contaminated. In particular, in addition to Gaussian noise with Note in particular that there are only two unknown parameter s, standard deviation σ = 0.1, for 10% of the measurements i.e. u R2, while the state y lies in R2N , with N the number uniformly distributed errors in [0, 2] are added. The state of modeled∈ time points. y R2(4000) is finely sampled over 40 periods. The reduced optimization problem for u is given by ∈For the least squares and the Huber penalty with κ = 0.1, 1 min f(u) := ρ(R−1/2(HG(u)−1v z). we solved both the ODE constrained problem (14) and the u 2 − penalized version (15) for λ 103, 105, 107, 109 . The ∈ { } To compute the derivative of the ODE-constrained problem, results are presented in Table I. The Huber formulation we can use the adjoint state method. Defining the Lagrangian behaves analogously to the formulation using least squares; 1 in particular the outer (projected) function in u is no more (y,x,u)= ρ R−1/2(Hy z) + x, Gy v , difficult to minimize. And, as expected, the robust Huber L 2 − h − i penalty finds the correct values for the parameters. we write down the optimality conditions =0 and obtain ∇L A state estimate generated from u-estimates corresponding y = G−1v to large λ is shown in Figure 9. The huberized approach is able to ignore the outliers, and recover both better estimates of x = G−T (HT R−1/2 ρ(R−1/2HG−1v z))  − ∇ −  . the underlying dynamics parameters u, and the true observed  ∂(G(u)y)   f = x,  and hidden components of the state y. ∇u ∂u     IV. CONCLUSIONS The inexact (penalized) problem is given by  In this paper, we showed that, contrary to conventional 1 −1/2 λ 2 min ϕ(u,y)= ρ R (Hy z) + Gy v . (15) wisdom, the quadratic penalty technique can be used effec- u,y 2 − 2 k − k   8

As an immediate application, we extended the penalty 0.2 True method in [22] to convex robust formulations, using the Huber L2 recovery Huber recovery 0.15 measurements penalty composed with a linear model as the function f(y). Numerical results illustrated the overall approach, including 0.1 convergence behavior of the penalized projected scheme, as 0.05 well as modeling advantages of robust penalized formulations.

0 Acknowledgment. Research of A. Aravkin was partially sup- -0.05 ported by the Washington Research Foundation Data Science Professorship. Research of D. Drusvyatskiy was partially -0.1 supported by the AFOSR YIP award FA9550-15-1-0237. The -0.15 research of T. van Leeuwen was in part financially supported

-0.2 by the Netherlands Organisation of Scientific Research (NWO) 0 10 20 30 40 50 60 70 80 90 100 as part of research programme 613.009.032.

0.3 REFERENCES True L2 recovery Huber recovery [1] L. Ambrosio and N. Gigli. A User’s Guide to Optimal Transport.

0.2 In Modelling and Optimisation of Flows on Networks, pages 1–155. Springer, 2013. [2] B. D. Anderson and J. B. Moore. Optimal filtering. 1979, 1979. 0.1 [3] A. Aravkin, B. Bell, J. Burke, and G. Pillonetto. An ℓ1-Laplace robust Kalman smoother. IEEE Transactions on Automatic Control, 2011.

0 [4] A. Aravkin, J. Burke, and G. Pillonetto. Robust and trend-following Stu- dent’s t Kalman smoothers. SIAM Journal on Control and Optimization, 52(5):2891–2916, 2014. -0.1 [5] A. Aravkin, J. V. Burke, and G. Pillonetto. Sparse/robust estimation and Kalman smoothing with nonsmooth log-concave densities: Modeling, computation, and theory. Journal of Machine Learning Research, -0.2 14:2689–2728, 2013. [6] A. Aravkin, M. Friedlander, F. Herrmann, and T. van Leeuwen. Robust -0.3 inversion, dimensionality reduction, and randomized sampling. Mathe- 0 10 20 30 40 50 60 70 80 90 100 matical Programming, 134(1):101–125, 2012. [7] A. Y. Aravkin and T. van Leeuwen. Estimating nuisance parameters in Fig. 9: Top panel: first 100 samples of state y (solid black), inverse problems. Inverse Problems, 28(11):115016, 2012. 2 [8] K. P. Bube and R. T. Langan. Hybrid ℓ1/ℓ2 minimization with least squares recovery (red dash-dot) and huber recovery (blue applications to tomography. Geophysics, 62(4):1183–1195, 1997. dash). Noisy measurements z appear as blue diamonds on [9] S. Farahmand, G. B. Giannakis, and D. Angelosante. Doubly Robust the plot, with outliers shown at the top of the panel. Bottom Smoothing of Dynamical Processes via Outlier Sparsity Constraints. IEEE Transactions on Signal Processing, 59:4529–4543, 2011. panel: first 100 samples of state y1 (solid black), least squares [10] G. Golub and V. Pereyra. Separable nonlinear least squares: the variable recovery (red dash-dot) and Huber recovery (blue dash). projection method and its applications. Inverse Problems, 19(2):R1–R26, Apr. 2003. [11] M. Gugat. Control problems with the wave equation. SIAM Journal on Control and Optimization, 48(5):3026–3051, 2009. tively for control and PDE constrained optimization problems, [12] E. Haber and L. Hanson. Model Problems in PDE-Constrained Opti- mization. Technical report, 2007. if done correctly. In particular, when combined with the [13] S. Haker, L. Zhu, A. Tannenbaum, and S. Angenent. Optimal Mass partial minimization technique, we showed that the penalized Transport for Registration and Warping. International Journal of projected scheme Computer Vision, 60(3):225–240, dec 2004. [14] P. J. Huber. Robust Statistics. John Wiley and Sons, 2004. [15] R. E. Kalman. A New Approach to Linear Filtering and Prediction λ Transactions of the AMSE - Journal of Basic Engineering min g(u) + min f(y)+ A(u)y b 2 Problems. , u y 2 k − k 82(D):35–45, 1960.   [16] Y. Nesterov. Introductory lectures on , volume 87 of Applied Optimization. Kluwer Academic Publishers, Boston, MA, 2004. has the following advantages: A basic course. [17] J. Nocedal and S. Wright. Numerical optimization. Springer Series in 1) The Lipschitz constant of the gradient of the outer Operations Research and Financial Engineering. Springer, New York, function in u is bounded as λ , and hence we can second edition, 2006. effectively analyze the global convergence↑ ∞ of first-order [18] C. C. Paige and M. A. Saunders. LSQR: An algorithm for sparse linear equations and sparse least squares. ACM Transactions on Mathematical methods. Software (TOMS), 8(1):43–71, 1982. 2) Convergence behavior of the data-regularized convex [19] P. Philip. Optimal Control and Partial Differential Equations. Technical inner problem is controlled by parametric matrix A( ), report, LMU Berlin, 2009. · [20] R. Rockafellar and R. Wets. Variational Analysis, volume 317. Springer, a fundamental quantity that does not depend on λ. 1998. 3) The inner problem can be solved inexactly, and in this [21] A. Tarantola. Inverse Problem Theory. SIAM, 2005. case, the number of inner iterations (of a first-order [22] T. van Leeuwen and F. J. Herrmann. A penalty method for pde- constrained optimization in inverse problems. Inverse Problems, algorithm) needs to grow only logarithmically with λ 32(1):015007, 2015. and the outer iterations counter, to preserve the natural [23] J. Virieux and S. Operto. An overview of full-waveform inversion in rate of gradient descent. exploration geophysics. Geophysics, 74(6):WCC1–WCC26, 2009.