Level Set Methods in

James V Burke , University of Washington

Joint work with Aleksandr Aravkin (UW), Michael Friedlander (UBC/Davis), Dmitriy Drusvyatskiy (UW) and Scott Roy (UW)

Happy Birthday Andy!

Fields Institute, Toronto Workshop on Nonlinear Optimization Algorithms and Industrial Applications June 2016 Motivation

Optimization in Large-Scale Inference • A range of large-scale data science applications can be modeled using optimization: - Inverse problems (medical and seismic imaging ) - High dimensional inference (compressive sensing, LASSO, quantile regression) - Machine learning (classification, matrix completion, robust PCA, time series) • These applications are often solved using side information: - Sparsity or low rank of solution - Constraints (topography, non-negativity) - Regularization (priors, total variation, “dirty” data)

• We need efficient large-scale solvers for nonsmooth programs.

2/33 A.R. Conn, Constrained optimization using a non-differentiable penalty function, SIAM Journal on Numerical Analysis, vol. 10(4), pp. 760-784, 1973. T.F. Coleman and A.R. Conn, Second-order conditions for an exact penalty function, Mathematical Programming A, vol. 19(2), pp. 178-185, 1980.

R.H. Bartels and A.R. Conn, An Approach to Nonlinear `1 Data Fitting, Proceedings of the Third Mexican Workshop on Numerical Analysis, pp. 48-58, J. P. Hennart (Ed.), Springer-Verlag, 1981.

Foundations of Nonsmooth Methods for NLP

I. I. EREMIN, The penalty method in convex programming, Soviet Math. Dokl., 8 (1966), pp. 459-462. W. I. ZANGWILL, Nonlinear programming via penalty functions, Management Sci., 13 (1967), pp. 344-358. T. PIETRZYKOWSKI, An exact potential method for constrained maxima, SIAM J. Numer. Anal., 6 (1969), pp. 299-304.

3/33 T.F. Coleman and A.R. Conn, Second-order conditions for an exact penalty function, Mathematical Programming A, vol. 19(2), pp. 178-185, 1980.

R.H. Bartels and A.R. Conn, An Approach to Nonlinear `1 Data Fitting, Proceedings of the Third Mexican Workshop on Numerical Analysis, pp. 48-58, J. P. Hennart (Ed.), Springer-Verlag, 1981.

Foundations of Nonsmooth Methods for NLP

I. I. EREMIN, The penalty method in convex programming, Soviet Math. Dokl., 8 (1966), pp. 459-462. W. I. ZANGWILL, Nonlinear programming via penalty functions, Management Sci., 13 (1967), pp. 344-358. T. PIETRZYKOWSKI, An exact potential method for constrained maxima, SIAM J. Numer. Anal., 6 (1969), pp. 299-304.

A.R. Conn, Constrained optimization using a non-differentiable penalty function, SIAM Journal on Numerical Analysis, vol. 10(4), pp. 760-784, 1973.

3/33 R.H. Bartels and A.R. Conn, An Approach to Nonlinear `1 Data Fitting, Proceedings of the Third Mexican Workshop on Numerical Analysis, pp. 48-58, J. P. Hennart (Ed.), Springer-Verlag, 1981.

Foundations of Nonsmooth Methods for NLP

I. I. EREMIN, The penalty method in convex programming, Soviet Math. Dokl., 8 (1966), pp. 459-462. W. I. ZANGWILL, Nonlinear programming via penalty functions, Management Sci., 13 (1967), pp. 344-358. T. PIETRZYKOWSKI, An exact potential method for constrained maxima, SIAM J. Numer. Anal., 6 (1969), pp. 299-304.

A.R. Conn, Constrained optimization using a non-differentiable penalty function, SIAM Journal on Numerical Analysis, vol. 10(4), pp. 760-784, 1973. T.F. Coleman and A.R. Conn, Second-order conditions for an exact penalty function, Mathematical Programming A, vol. 19(2), pp. 178-185, 1980.

3/33 Foundations of Nonsmooth Methods for NLP

I. I. EREMIN, The penalty method in convex programming, Soviet Math. Dokl., 8 (1966), pp. 459-462. W. I. ZANGWILL, Nonlinear programming via penalty functions, Management Sci., 13 (1967), pp. 344-358. T. PIETRZYKOWSKI, An exact potential method for constrained maxima, SIAM J. Numer. Anal., 6 (1969), pp. 299-304.

A.R. Conn, Constrained optimization using a non-differentiable penalty function, SIAM Journal on Numerical Analysis, vol. 10(4), pp. 760-784, 1973. T.F. Coleman and A.R. Conn, Second-order conditions for an exact penalty function, Mathematical Programming A, vol. 19(2), pp. 178-185, 1980.

R.H. Bartels and A.R. Conn, An Approach to Nonlinear `1 Data Fitting, Proceedings of the Third Mexican Workshop on Numerical Analysis, pp. 48-58, J. P. Hennart (Ed.), Springer-Verlag, 1981.

3/33 The Prototypical Problem

Sparse Data Fitting:

Find sparse x with Ax ≈ b

There are numerous applications; • system identification • image segmentation • compressed sensing • grouped sparsity for remote sensor location • ...

4/33 Convex approaches: kxk1 as a sparsity surragate (Candes-Tao-Donaho ’05)

BPDN LASSO Lagrangian (Penalty) 1 2 1 2 min kxk1 min kAx − bk2 min kAx − bk2 + λkxk1 x x 2 x 2 1 2 s.t. 2 kAx − bk2 ≤ σ s.t. kxk1 ≤ τ • BPDN: often most natural and transparent. (physical considerations guide σ) • Lagrangian: ubiquitous in practice. (“no constraints”) All three are (essentially) equivalent computationally! Basis for SPGL1 (van den Berg-Friedlander ’08)

The Prototypical Problem Sparse Data Fitting: Find sparse x with Ax ≈ b

5/33 • BPDN: often most natural and transparent. (physical considerations guide σ) • Lagrangian: ubiquitous in practice. (“no constraints”) All three are (essentially) equivalent computationally! Basis for SPGL1 (van den Berg-Friedlander ’08)

The Prototypical Problem Sparse Data Fitting: Find sparse x with Ax ≈ b

Convex approaches: kxk1 as a sparsity surragate (Candes-Tao-Donaho ’05)

BPDN LASSO Lagrangian (Penalty) 1 2 1 2 min kxk1 min kAx − bk2 min kAx − bk2 + λkxk1 x x 2 x 2 1 2 s.t. 2 kAx − bk2 ≤ σ s.t. kxk1 ≤ τ

5/33 • Lagrangian: ubiquitous in practice. (“no constraints”) All three are (essentially) equivalent computationally! Basis for SPGL1 (van den Berg-Friedlander ’08)

The Prototypical Problem Sparse Data Fitting: Find sparse x with Ax ≈ b

Convex approaches: kxk1 as a sparsity surragate (Candes-Tao-Donaho ’05)

BPDN LASSO Lagrangian (Penalty) 1 2 1 2 min kxk1 min kAx − bk2 min kAx − bk2 + λkxk1 x x 2 x 2 1 2 s.t. 2 kAx − bk2 ≤ σ s.t. kxk1 ≤ τ • BPDN: often most natural and transparent. (physical considerations guide σ)

5/33 All three are (essentially) equivalent computationally! Basis for SPGL1 (van den Berg-Friedlander ’08)

The Prototypical Problem Sparse Data Fitting: Find sparse x with Ax ≈ b

Convex approaches: kxk1 as a sparsity surragate (Candes-Tao-Donaho ’05)

BPDN LASSO Lagrangian (Penalty) 1 2 1 2 min kxk1 min kAx − bk2 min kAx − bk2 + λkxk1 x x 2 x 2 1 2 s.t. 2 kAx − bk2 ≤ σ s.t. kxk1 ≤ τ • BPDN: often most natural and transparent. (physical considerations guide σ) • Lagrangian: ubiquitous in practice. (“no constraints”)

5/33 Basis for SPGL1 (van den Berg-Friedlander ’08)

The Prototypical Problem Sparse Data Fitting: Find sparse x with Ax ≈ b

Convex approaches: kxk1 as a sparsity surragate (Candes-Tao-Donaho ’05)

BPDN LASSO Lagrangian (Penalty) 1 2 1 2 min kxk1 min kAx − bk2 min kAx − bk2 + λkxk1 x x 2 x 2 1 2 s.t. 2 kAx − bk2 ≤ σ s.t. kxk1 ≤ τ • BPDN: often most natural and transparent. (physical considerations guide σ) • Lagrangian: ubiquitous in practice. (“no constraints”) All three are (essentially) equivalent computationally!

5/33 The Prototypical Problem Sparse Data Fitting: Find sparse x with Ax ≈ b

Convex approaches: kxk1 as a sparsity surragate (Candes-Tao-Donaho ’05)

BPDN LASSO Lagrangian (Penalty) 1 2 1 2 min kxk1 min kAx − bk2 min kAx − bk2 + λkxk1 x x 2 x 2 1 2 s.t. 2 kAx − bk2 ≤ σ s.t. kxk1 ≤ τ • BPDN: often most natural and transparent. (physical considerations guide σ) • Lagrangian: ubiquitous in practice. (“no constraints”) All three are (essentially) equivalent computationally!

Basis for SPGL1 (van den Berg-Friedlander ’08) 5/33 Strategy: Consider the “flipped” problem

v(τ) := min ρ(Ax − b) x∈X Q(τ) s.t. φ(x) ≤ τ

Then opt-val(P(σ)) is the minimal root of the equation

v(τ) = σ

Optimal Value or Level Set Framework

Problem class: Solve min φ(x) x∈X P(σ) s.t. ρ(Ax − b) ≤ σ

6/33 Then opt-val(P(σ)) is the minimal root of the equation

v(τ) = σ

Optimal Value or Level Set Framework

Problem class: Solve min φ(x) x∈X P(σ) s.t. ρ(Ax − b) ≤ σ

Strategy: Consider the “flipped” problem

v(τ) := min ρ(Ax − b) x∈X Q(τ) s.t. φ(x) ≤ τ

6/33 Optimal Value or Level Set Framework

Problem class: Solve min φ(x) x∈X P(σ) s.t. ρ(Ax − b) ≤ σ

Strategy: Consider the “flipped” problem

v(τ) := min ρ(Ax − b) x∈X Q(τ) s.t. φ(x) ≤ τ

Then opt-val(P(σ)) is the minimal root of the equation

v(τ) = σ

6/33 Other historical examples abound (e.g. the isoperimetric inequality). More recently, these observations provide the basis for the Markowitz Mean-Variance Portfolio Theory.

Queen Dido’s Problem

The intuition behind the proposed framework has a distinguished history, appearing even in antiquity. Perhaps the earliest instance is Queen Dido’s problem and the fabled origins of Carthage.

In short, the problem is to find the maximum area that can be enclosed by an arc of fixed length and a given line. The converse problem is to find an arc of least length that traps a fixed area between a line and the arc. Although these two problems reverse the objective and the constraint, the solution in each case is a semi-.

7/33 Queen Dido’s Problem

The intuition behind the proposed framework has a distinguished history, appearing even in antiquity. Perhaps the earliest instance is Queen Dido’s problem and the fabled origins of Carthage.

In short, the problem is to find the maximum area that can be enclosed by an arc of fixed length and a given line. The converse problem is to find an arc of least length that traps a fixed area between a line and the arc. Although these two problems reverse the objective and the constraint, the solution in each case is a semi-circle.

Other historical examples abound (e.g. the isoperimetric inequality). More recently, these observations provide the basis for the Markowitz Mean-Variance Portfolio Theory.

7/33 Convex Functions n Let f : R → R¯ := R ∪ {+∞}. We say that f is convex if the set epi (f ) := { (x, µ): f (x) ≤ µ } is a .

(x 2 , f (x 2 )) (x 1 , f (x 1 ))

x x 1 λ x 1 + (1 - λ )x 2 2

f ((1 − λ)x1 + λx2) ≤ (1 − λ)f (x1) + λf (x2)

The Role of Convexity Convex Sets n Let C ⊂ R . We say that C is convex if (1 − λ)x + λy ∈ C whenever x, y ∈ C and 0 ≤ λ ≤ 1.

8/33 (x 2 , f (x 2 )) (x 1 , f (x 1 ))

x x 1 λ x 1 + (1 - λ )x 2 2

f ((1 − λ)x1 + λx2) ≤ (1 − λ)f (x1) + λf (x2)

The Role of Convexity Convex Sets n Let C ⊂ R . We say that C is convex if (1 − λ)x + λy ∈ C whenever x, y ∈ C and 0 ≤ λ ≤ 1. Convex Functions n Let f : R → R¯ := R ∪ {+∞}. We say that f is convex if the set epi (f ) := { (x, µ): f (x) ≤ µ } is a convex set.

8/33 The Role of Convexity Convex Sets n Let C ⊂ R . We say that C is convex if (1 − λ)x + λy ∈ C whenever x, y ∈ C and 0 ≤ λ ≤ 1. Convex Functions n Let f : R → R¯ := R ∪ {+∞}. We say that f is convex if the set epi (f ) := { (x, µ): f (x) ≤ µ } is a convex set.

(x 2 , f (x 2 )) (x 1 , f (x 1 ))

x x 1 λ x 1 + (1 - λ )x 2 2

f ((1 − λ)x1 + λx2) ≤ (1 − λ)f (x1) + λf (x2) 8/33 Addition Non-negative linear combinations of convex functions are convex: fi convex and λi ≥ 0, i = 1,..., k Pk f (x) := i=1 λi fi (x).

Infimal Projection n m If f : R × R → R¯ is convex, then so is v(x) := infy f (x, y), since epi (v) = { (x, µ): ∃ y ∈ s.t. f (x, y) ≤ µ }.

Convex Functions

Convex indicator functions n Let C ⊂ R . Then the function (0 , if x ∈ C, δC (x) := +∞ , if x ∈/ C, is a .

9/33 Infimal Projection n m If f : R × R → R¯ is convex, then so is v(x) := infy f (x, y), since epi (v) = { (x, µ): ∃ y ∈ s.t. f (x, y) ≤ µ }.

Convex Functions

Convex indicator functions n Let C ⊂ R . Then the function (0 , if x ∈ C, δC (x) := +∞ , if x ∈/ C, is a convex function. Addition Non-negative linear combinations of convex functions are convex: fi convex and λi ≥ 0, i = 1,..., k Pk f (x) := i=1 λi fi (x).

9/33 Convex Functions

Convex indicator functions n Let C ⊂ R . Then the function (0 , if x ∈ C, δC (x) := +∞ , if x ∈/ C, is a convex function. Addition Non-negative linear combinations of convex functions are convex: fi convex and λi ≥ 0, i = 1,..., k Pk f (x) := i=1 λi fi (x).

Infimal Projection n m If f : R × R → R¯ is convex, then so is v(x) := infy f (x, y), since epi (v) = { (x, µ): ∃ y ∈ s.t. f (x, y) ≤ µ }.

9/33 Convexity of v

When X , ρ, and φ are convex, the optimal value function v is a non-increasing convex function by infimal projection:

v(τ) := min ρ(Ax − b) s.t. φ(x) ≤ τ x∈X

= min ρ(Ax − b) + δ (x, τ) + δX (x) x epi (φ)

10/33 400

350

300

250

200

150

100

50

0 0 20 40 60 80 100

The problem is that f is often not differentiable. Use the convex subdifferential T n ∂f (x) := { z : f (y) ≥ f (x) + z (y − x) ∀ y ∈ R }

Newton and Secant Methods

For f convex and non-increasing, solve f (τ) = 0.

11/33 The problem is that f is often not differentiable. Use the convex subdifferential T n ∂f (x) := { z : f (y) ≥ f (x) + z (y − x) ∀ y ∈ R }

Newton and Secant Methods

For f convex and non-increasing, solve f (τ) = 0.

400

350

300

250

200

150

100

50

0 0 20 40 60 80 100

11/33 Use the convex subdifferential T n ∂f (x) := { z : f (y) ≥ f (x) + z (y − x) ∀ y ∈ R }

Newton and Secant Methods

For f convex and non-increasing, solve f (τ) = 0.

400

350

300

250

200

150

100

50

0 0 20 40 60 80 100

The problem is that f is often not differentiable.

11/33 Newton and Secant Methods

For f convex and non-increasing, solve f (τ) = 0.

400

350

300

250

200

150

100

50

0 0 20 40 60 80 100

The problem is that f is often not differentiable. Use the convex subdifferential T n ∂f (x) := { z : f (y) ≥ f (x) + z (y − x) ∀ y ∈ R }

11/33 Initialization: τ−1 < τ0 < τ∗ ( τ if f (τ ) = 0, τ := k k (Newton) k+1 f (τk ) τk − [for gk ∈ ∂f (τk)] otherwise; gk and ( τk if f (τk) = 0, τk+1 := τk −τk−1 (Secant) τk − f (τk) otherwise. f (τk )−f (τk−1)

If either sequence terminates finitely at some τk, then τk = τ∗; otherwise,

g∗ |τ∗ − τk+1| ≤ (1 − )|τ∗ − τk|, k = 1, 2,..., γk

where γk = gk (Newton) and γk ∈ ∂f (τk−1) (secant). In either case, γk ↑ g∗ and τk ↑ τ∗ globally q-superlinearly.

Superlinear Convergence

Let τ∗ := inf{τ : f (τ) ≤ 0} and assume

g∗ := inf { g : g ∈ ∂f (τ∗) } < 0 (non-degeneracy)

12/33 If either sequence terminates finitely at some τk, then τk = τ∗; otherwise,

g∗ |τ∗ − τk+1| ≤ (1 − )|τ∗ − τk|, k = 1, 2,..., γk

where γk = gk (Newton) and γk ∈ ∂f (τk−1) (secant). In either case, γk ↑ g∗ and τk ↑ τ∗ globally q-superlinearly.

Superlinear Convergence

Let τ∗ := inf{τ : f (τ) ≤ 0} and assume

g∗ := inf { g : g ∈ ∂f (τ∗) } < 0 (non-degeneracy)

Initialization: τ−1 < τ0 < τ∗ ( τ if f (τ ) = 0, τ := k k (Newton) k+1 f (τk ) τk − [for gk ∈ ∂f (τk)] otherwise; gk and ( τk if f (τk) = 0, τk+1 := τk −τk−1 (Secant) τk − f (τk) otherwise. f (τk )−f (τk−1)

12/33 Superlinear Convergence

Let τ∗ := inf{τ : f (τ) ≤ 0} and assume

g∗ := inf { g : g ∈ ∂f (τ∗) } < 0 (non-degeneracy)

Initialization: τ−1 < τ0 < τ∗ ( τ if f (τ ) = 0, τ := k k (Newton) k+1 f (τk ) τk − [for gk ∈ ∂f (τk)] otherwise; gk and ( τk if f (τk) = 0, τk+1 := τk −τk−1 (Secant) τk − f (τk) otherwise. f (τk )−f (τk−1)

If either sequence terminates finitely at some τk, then τk = τ∗; otherwise,

g∗ |τ∗ − τk+1| ≤ (1 − )|τ∗ − τk|, k = 1, 2,..., γk where γk = gk (Newton) and γk ∈ ∂f (τk−1) (secant). In either case, γk ↑ g∗ and τk ↑ τ∗ globally q-superlinearly.

12/33 • Bisection is one approach • nonmonotone iterates (bad for warm starts) • at best linear convergence (with perfect information)

• Solution: • modified secant • approximate Newton methods

Inexact Root Finding

• Problem: Find root of the inexactly known convex function

v(·) − σ.

13/33 • nonmonotone iterates (bad for warm starts) • at best linear convergence (with perfect information)

• Solution: • modified secant • approximate Newton methods

Inexact Root Finding

• Problem: Find root of the inexactly known convex function

v(·) − σ.

• Bisection is one approach

13/33 • Solution: • modified secant • approximate Newton methods

Inexact Root Finding

• Problem: Find root of the inexactly known convex function

v(·) − σ.

• Bisection is one approach • nonmonotone iterates (bad for warm starts) • at best linear convergence (with perfect information)

13/33 Inexact Root Finding

• Problem: Find root of the inexactly known convex function

v(·) − σ.

• Bisection is one approach • nonmonotone iterates (bad for warm starts) • at best linear convergence (with perfect information)

• Solution: • modified secant • approximate Newton methods

13/33 Inexact Root Finding: Secant

14/33 Inexact Root Finding: Secant

14/33 Inexact Root Finding: Secant

14/33 Inexact Root Finding: Secant

14/33 Inexact Root Finding: Secant

14/33 Inexact Root Finding: Secant

14/33 Inexact Root Finding: Newton

14/33 Inexact Root Finding: Newton

14/33 Inexact Root Finding: Newton

14/33 Inexact Root Finding: Newton

14/33 Inexact Root Finding: Newton

14/33 Inexact Root Finding: Newton

14/33 Inexact Root Finding: Convergence

15/33 Inexact Root Finding: Convergence

15/33 Inexact Root Finding: Convergence

15/33 Inexact Root Finding: Convergence

∗ Key observation: C = C(τ0) is independent of ∂v(τ ). Nondegeneracy not required. 15/33 Minorants from Duality

16/33 Minorants from Duality

16/33 Minorants from Duality

16/33 Robustness: 1 ≤ u/l ≤ α, where α ∈ [1, 2) and  = 10−2

140 160 120 f f f2 120 1 140 1 100 120 100 80 100 80 80 60 60 60 40 40 40 20 20 20

0 0 0 10 9 8 7 6 5 4 3 2 10 9 8 7 6 5 4 3 2 10 8 6 4 2 0 (a) k = 13, α = 1.3 (b) k = 770, α = 1.99 (c) k = 18, α = 1.3

140 160 120 f f f2 120 1 140 1 100 120 100 80 100 80 80 60 60 60 40 40 40 20 20 20

0 0 0 10 9 8 7 6 5 4 3 2 10 9 8 7 6 5 4 3 2 10 8 6 4 2 0 (d) k = 9, α = 1.3 (e) k = 15, α = 1.99 (f) k = 10, α = 1.3

Figure : Inexact secant (top) and Newton (bottom) for 2 2 f1(τ) = (τ − 1) − 10 (first two columns) and f2(τ) = τ (last column). Below each panel, α is the oracle accuracy, and k is the number of −2 iterations needed to converge, i.e., to reach fi(τk) ≤  = 10 . 17/33 When is the Level Set Framework Viable

Problem class: Solve min φ(x) x∈X P(σ) s.t. ρ(Ax − b) ≤ σ

Strategy: Consider the “flipped” problem

v(τ) := min ρ(Ax − b) x∈X Q(τ) s.t. φ(x) ≤ τ

Then opt-val(P(σ)) is the minimal root of the equation

v(τ) = σ

Lower bounding slopes: ∂τ Φ(y,τ )

18/33 Support Functionals For any set C, the support functional for C is δ∗ (x | C ) := sup hx, zi . z∈C Gauges For any convex set C, the convex gauge function for C is γ (x | C ) := inf {t ≥ 0 | x ∈ tC } γ◦(z|C) := sup {hz, xi | γ (x | C ) ≤ 1}

Fact If 0 ∈ C, then γ (x | C ) = δ∗ (x | C ◦ ), where C ◦ := {z | hz, xi ≤ 1 ∀ x ∈ C } .

Conjugate Functions and Duality Convex Indicator For any convex set C, the convex indicator function for C is (0, x ∈ C, δ (x | C ) := +∞, x ∈/ C.

19/33 Gauges For any convex set C, the convex gauge function for C is γ (x | C ) := inf {t ≥ 0 | x ∈ tC } γ◦(z|C) := sup {hz, xi | γ (x | C ) ≤ 1}

Fact If 0 ∈ C, then γ (x | C ) = δ∗ (x | C ◦ ), where C ◦ := {z | hz, xi ≤ 1 ∀ x ∈ C } .

Conjugate Functions and Duality Convex Indicator For any convex set C, the convex indicator function for C is (0, x ∈ C, δ (x | C ) := +∞, x ∈/ C. Support Functionals For any set C, the support functional for C is δ∗ (x | C ) := sup hx, zi . z∈C

19/33 γ◦(z|C) := sup {hz, xi | γ (x | C ) ≤ 1}

Fact If 0 ∈ C, then γ (x | C ) = δ∗ (x | C ◦ ), where C ◦ := {z | hz, xi ≤ 1 ∀ x ∈ C } .

Conjugate Functions and Duality Convex Indicator For any convex set C, the convex indicator function for C is (0, x ∈ C, δ (x | C ) := +∞, x ∈/ C. Support Functionals For any set C, the support functional for C is δ∗ (x | C ) := sup hx, zi . z∈C Gauges For any convex set C, the convex gauge function for C is γ (x | C ) := inf {t ≥ 0 | x ∈ tC }

19/33 Fact If 0 ∈ C, then γ (x | C ) = δ∗ (x | C ◦ ), where C ◦ := {z | hz, xi ≤ 1 ∀ x ∈ C } .

Conjugate Functions and Duality Convex Indicator For any convex set C, the convex indicator function for C is (0, x ∈ C, δ (x | C ) := +∞, x ∈/ C. Support Functionals For any set C, the support functional for C is δ∗ (x | C ) := sup hx, zi . z∈C Gauges For any convex set C, the convex gauge function for C is γ (x | C ) := inf {t ≥ 0 | x ∈ tC } γ◦(z|C) := sup {hz, xi | γ (x | C ) ≤ 1}

19/33 Conjugate Functions and Duality Convex Indicator For any convex set C, the convex indicator function for C is (0, x ∈ C, δ (x | C ) := +∞, x ∈/ C. Support Functionals For any set C, the support functional for C is δ∗ (x | C ) := sup hx, zi . z∈C Gauges For any convex set C, the convex gauge function for C is γ (x | C ) := inf {t ≥ 0 | x ∈ tC } γ◦(z|C) := sup {hz, xi | γ (x | C ) ≤ 1}

Fact If 0 ∈ C, then γ (x | C ) = δ∗ (x | C ◦ ), where C ◦ := {z | hz, xi ≤ 1 ∀ x ∈ C } .

19/33 Gauge Optimization

Problem Pσ Qτ ∂τ Φ(y, τ) gauge min ϕ(x) min ρ(Ax − b) −ϕ◦(AT y) x x optimization s.t. ρ(Ax − b) ≤ σ s.t. ϕ(x) ≤ τ T BPDN min kxk1 min kAx − bk2 −kA yk∞ x x s.t. kAx − bk2 ≤ σ s.t. kxk1 ≤ τ T sharp min αkxk1 + βkxk2 min kAx − bk2 −γα +β (A y) x x B∞ B2 elast-net s.t. kAx − bk2 ≤ σ s.t. αkxk1 + βkxk2 ≤ τ ∗ matrix min kXk∗ min kAX − bk2 −σ1(A y) x completion X s.t. kAX − bk2 ≤ σ s.t. kXk∗ ≤ τ

Nonsmooth regularized data-fitting.

20/33 Piecewise Linear-Quadratic Penalties

1 φ(x) := sup [hx, ui − uT Bu] u∈U 2 U ⊂ Rn is nonempty, closed and convex with 0 ∈ U (not nec. poly.) B ∈ Rn×n is symmetric positive semi-definite. Examples: 1. Support functionals: B = 0

2. Gauge functionals: γ (· | U ◦ ) = δ∗ (· | U )

3. Norms: B = closed unit ball, k·k = γ (· | B)

n 4. Least-squares: U = R , B = I

5. Huber: U = [−, ]n, B = I

21/33 PLQ Densities: Gauss, Laplace, Huber, Vapnik

y y

x x

V (x) = 1 x2 V (x) = x 2 | |

Gauss `1

y y

K K x   x − −

V (x) = Kx 1 K2; x < K V (x) = x ; x <  − − 2 − − − − V (x) = 1 x2; K x K V (x) = 0;  x  2 − ≤ ≤ − ≤ ≤ V (x) = Kx 1 K2; K < x V (x) = x ;  x − 2 − ≤

Huber Vapnik 22/33 ∂τ Φ(y, τ) for PLQ Penalties φ

1 φ(x) := sup [hx, ui − uT Bu] u∈U 2

Pσ min φ(x) st ρ(b − Ax) ≤ σ

Qτ min ρ(b − Ax) st φ(x) ≤ τ

 q √   T  T T − max γ A y | U , y ABA y/ 2τ ∈ ∂τ Φ(y, τ)

23/33 Sparse and Robust Formulation

Signal Recovery 3 HBP : min kxk st ρ(b − Ax) ≤ σ σ 1 Huber2

1 Problem Specification Truth 0 ï1 x 20-sparse spike train in R512 ï2 b measurements in R120 LS A Measurement matrix satisfying RIP ï3

ï4 ρ Huber function 0 100 200 300 400 500 600 Residuals σ error level set at .01 2

5 outliers LS 1.5

1

Results 0.5

In the presence of outliers, the robust Truth 0 formulation recovers the spike train, ï0.5 while the standard formulation does not. ï1

Huberï1.5

ï2 0 20 40 60 80 100 120

24/33 Sparse and Robust Formulation

4 Signal Recovery

HBPσ: min kxk1 st ρ(b − Ax) ≤ σ 0≤x 3

LS 2 Problem Specification

1 512 x 20-sparse spike train in R+ Truth 0 b measurements in R120

A Measurement matrix satisfying RIP Huberï1 ρ Huber function

ï2 σ error level set at .01 0 100 200 300 400 500 600 5 outliers Residuals 2

LS 1.5 Results 1

In the presence of outliers, the robust 0.5 formulation recovers the spike train, Truth 0 while the standard formulation does not. ï0.5 ï1

ï1.5 Huberï2 ï2.5 0 20 40 60 80 100 120

25/33 Sensor Network Localization (SNL)

0.5

0.4

0.3

0.2

0.1

0

−0.1

−0.2

−0.3

−0.4

−0.5 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6

Given a weighted graph G = (V , E, d) find a realization: 2 2 p1,..., pn ∈ R with dij = kpi − pj k for all ij ∈ E.

26/33 Key point: Slater failing (always the case) is irrelevant.

1 n Intuition: X = PPT and then tr (X) = X kp − p k2 n + 1 i j with pi the ith row of P. i,j=1 Flipped problem: 2 min kPE K(X) − dk2 s.t. tr X = τ Xe = 0 X  0. • Perfectly adapted for the Frank-Wolfe method.

Sensor Network Localization (SNL) SDP relaxation (Weinberger et al. ’04, Biswas et al. ’06): max tr (X) 2 s.t. kPE K(X) − dk2 ≤ σ Xe = 0, X  0 where [K(X)]i,j = Xii + Xjj − 2Xij .

27/33 Key point: Slater failing (always the case) is irrelevant.

Flipped problem: 2 min kPE K(X) − dk2 s.t. tr X = τ Xe = 0 X  0. • Perfectly adapted for the Frank-Wolfe method.

Sensor Network Localization (SNL) SDP relaxation (Weinberger et al. ’04, Biswas et al. ’06): max tr (X) 2 s.t. kPE K(X) − dk2 ≤ σ Xe = 0, X  0 where [K(X)]i,j = Xii + Xjj − 2Xij . 1 n Intuition: X = PPT and then tr (X) = X kp − p k2 n + 1 i j with pi the ith row of P. i,j=1

27/33 Key point: Slater failing (always the case) is irrelevant.

• Perfectly adapted for the Frank-Wolfe method.

Sensor Network Localization (SNL) SDP relaxation (Weinberger et al. ’04, Biswas et al. ’06): max tr (X) 2 s.t. kPE K(X) − dk2 ≤ σ Xe = 0, X  0 where [K(X)]i,j = Xii + Xjj − 2Xij . 1 n Intuition: X = PPT and then tr (X) = X kp − p k2 n + 1 i j with pi the ith row of P. i,j=1 Flipped problem: 2 min kPE K(X) − dk2 s.t. tr X = τ Xe = 0 X  0.

27/33 Key point: Slater failing (always the case) is irrelevant.

Sensor Network Localization (SNL) SDP relaxation (Weinberger et al. ’04, Biswas et al. ’06): max tr (X) 2 s.t. kPE K(X) − dk2 ≤ σ Xe = 0, X  0 where [K(X)]i,j = Xii + Xjj − 2Xij . 1 n Intuition: X = PPT and then tr (X) = X kp − p k2 n + 1 i j with pi the ith row of P. i,j=1 Flipped problem: 2 min kPE K(X) − dk2 s.t. tr X = τ Xe = 0 X  0. • Perfectly adapted for the Frank-Wolfe method.

27/33 Sensor Network Localization (SNL) SDP relaxation (Weinberger et al. ’04, Biswas et al. ’06): max tr (X) 2 s.t. kPE K(X) − dk2 ≤ σ Xe = 0, X  0 where [K(X)]i,j = Xii + Xjj − 2Xij . 1 n Intuition: X = PPT and then tr (X) = X kp − p k2 n + 1 i j with pi the ith row of P. i,j=1 Flipped problem: 2 min kPE K(X) − dk2 s.t. tr X = τ Xe = 0 X  0. • Perfectly adapted for the Frank-Wolfe method.

Key point: Slater failing (always the case) is irrelevant.

27/33 Pareto 1.4

1.2

1

0.8

0.6

0.4

0.2

0 0 5 10 15 20

Figure : σ = 0

Approximate Newton

Figure : σ = 0.25

28/33 Approximate Newton

Pareto curve 1.4

1.2

1

0.8

0.6

0.4

0.2

0 0 5 10 15 20

Figure : σ = 0.25 Figure : σ = 0

28/33

− −

− −

− −

Max-trace

− −

− −

− −

29/33 Max-trace

− −

− −

− −

− −

− −

− − 29/33 Observations

• Simple strategy for optimizing over complex domains • Rigorous convergence guarantees • Insensitivity to ill-conditioning • Many applications • Sensor Network Localization (Drusvyatskiy-Krislock-Voronin-Wolkowicz ’15) • Sparse/Robust Estimation and Kalman Smoothing (Aravkin-B-Pillonetto ’13) • Large scale SDP and LP (cf. Renegar ’14) • Chromosome reconstruction (Aravkin-Becker-Drusvyatskiy-Lozano ’15) • Phase retrieval (Aravkin-B-Drusvyatskiy-Friedlander-Roy ’16) • Generalized linear models (Aravkin-B-Drusvyatskiy-Friedlander-Roy ’16) • ...

30/33 Andy and Barbara

Thank you!

31/33 and Barbara

Thank you!

Andy

31/33 Thank you!

Andy and Barbara

31/33 References

• “Probing the pareto frontier for basis pursuit solutions” van der Berg - Friedlander SIAM J. Sci. Comput. 31(2008), 890–912.

• “Sparse optimization with least-squares constraints” van der Berg - Friedlander SIOPT 21(2011), 1201–1229.

• “Variational Properties of Value Functions.” Aravkin - B - Friedlander SIOPT 23(2013), 1689–1717.

• “Level-set methods for convex optimization” Aravkin - B - Drusvyatskiy - Friedlander - Roy Preprint, 2016

32/33 General Level Set Theorem

n ψi : X ⊆ R → R, i = 1, 2, arbitrary functions and X an arbitrary set. epi (ψ) := { (x, µ) } ψ(x) ≤ µ

v1(σ) := inf ψ1(x) + δ ((x, σ) | epi (ψ2)) P1,2(σ) x∈X

v2(τ) := inf ψ2(x) + δ ((x, τ) | epi (ψ1)) P2,1(τ) x∈X

S1,2 := { σ ∈ R } ∅ 6= argmin P1,2(σ) ⊂ { x ∈ X } ψ2(x) = σ

Then, for every σ ∈ S1,2, (a) v2(v1(σ)) = σ, and (b) argmin P1,2(σ) = argmin P2,1(v1(σ)) ⊂ { x ∈ X } ψ1(x) = v1(σ). Moreover, S2,1 = { v1(σ) } σ ∈ S1,2 and

{ (σ, v1(σ)) } σ ∈ S1,2 = { (v2(τ), τ) } τ ∈ S2,1.

33/33