Line Search and Trust-Region Methods for Convex-Composite

arXiv:1806.05218v2 [math.OC] 10 Sep 2019 ( convex-co the solving for methods lem descent three consider We o eesrl nt aud[ valued finite necessarily not where eei oiae yti ehdfrdfeetal functio W differentiable weak for method the this Although by motivated method. is trust-region here a an is backtracking approach on third based methods search line are approaches o ok eetbihcniin o h oa super-linea local the on for based conditions methods dyna establish nonlinear we and work, learning ion machine in applications modern problem the for funct Algorithms nondifferentiable to application its allows that way 3.1 ( subproblems Gauss-Newton to solutions approximate is [email protected] where ∆ P P f nvriyo ahntnDprmn fMteais Seatt Mathematics, of Department Washington of University C k ) IESAC N RS-EINMTOSFRCONVEX-COMPOSITE FOR METHODS TRUST-REGION AND SEARCH LINE .By ). ( ) 1 x sot.Orfcsi nmtosta mlysac directio search employ that methods on is focus Our -smooth. ; d ehd htueabctakn iesac,awa of lin Wolfe s the weak exploit a to designed problems. search, are composite approaches line con these of backtracking global All a the update. use establish that we Specifically, methods update. Gauss-Newton iteration employ that problems optimization composite A h n ) bstract x sa prxmto otedrcinlderivative directional the to approximation an is : descent k R o m ⊂ minimize . ujc to subject → ecnie ecn ehd o ovn o-nt audn valued non-finite solving for methods descent consider We . d ema that mean we , dom ∈ R P R sconvex, is k n hnboth when g ∆ r h trtsgnrtdb h algorithm, the by generated iterates the are k d f minimize k ( P x x 5 ≤ g k ]. ∈ ; aercnl eevdrnwditrs u onumerous to due interest renewed received recently have d : d η .V UK N .ENGLE A. AND BURKE V. J. k R R h ) k , satisfies : n and n = OPTIMIZATION .I 1. → h ( g f ntroduction R c ( ( r sue ob icws ierqartcbut linear-quadratic piecewise be to assumed are x x scoe,poe,adcne,and convex, and proper, closed, is ) k ∆ + ) 1 : f = ( x h ∇ k ( ; c c d ( ( k x x ) )+ )) k ) < d + ) ions. egnepoete o descent for properties vergence rcueascae ihconvex- with associated tructure tec iteration each at 0 f g s si [ in as ns, upolm odtriethe determine to subproblems ′ n udai ovrec of convergence quadratic and r ( ( g x x ekWlecniin.The conditions. Wolfe weak d erh ratrust-region a or search, e ( ) e WA, le, ; is[ mics x , d pst piiainprob- optimization mposite k ) leln erhdiscussed search line olfe + si [ in as so steps or ns d ) 1 13 nmohconvex- onsmooth , − 8 ,i smdfidi a in modified is it ], – [email protected] 2 f η 11 as e Lemma see (also ] ( k x

.I compan- a In ]. k ) k ⊂ w fthe of Two . c d : k ( 0, R ∈ n ∞ R → ] n and , that R m , 2 J. V. BURKE AND A. ENGLE

Previously, the backtracking line search was studied in finite-valued case and in the absence of the function g [2]. In recent work, Lewis and Wright [14] utilized a similar backtracking line search in the context of infinite-valued prox-regular composite optimization. Lewis and Overton [13] developed a weak Wolfe algorithm using directional derivatives for finite- valued nonsmooth functions f that are absolutely continuous along the line segment of interest, with finite termination in particular when the function f is semi-algebraic. The method of Lewis and Overton can be applied in the finite-valued convex-composite case where g = 0. Here, we develop a weak Wolfe algorithm for infinite-valued problems that uses the approximation ∆ f (x; d) to the directional derivative, which exploits the structure associated with convex-composite problems.

The function g in P is typically nonsmooth and is used to induce structure in the solution x. For example, it can be used to introduce sparsity or group sparsity in the solution x as well as bound constraints x. Drusvyatskiy and Lewis [9] have established local and global convergence of proximal-based methods for solving P, and Drusvyatskiy and Paquette [10] have established iteration complexity results for proximal methods to locate ﬁrst-order stationary points for P.

While the assumptions we use are similar to those in [9, 10], our algorithmic approach differs signiﬁcantly. In particular, we use either a backtracking or an adaptive weak Wolfe line search, or a trust-region strategy to induce objective function descent at each iteration. In addition, we do not exclusively use proximal methods to generate search directions or employ the backtracking line search to estimate Lipschitz constants as in [10,14]. Moreover, all of the methods discussed here make explicit use of the structure in P, thereby differing from the method developed in [13].

2. Notation

This section records notation and tools from convex and variational analysis used throughout the paper. Unless otherwise stated, we follow the notation of [5, 16, 17, 21]. For any two points x, x′ ∈ Rn, denote the line segment connecting x and x′ by [x, x′] := (1 − λ)x + λx′ | 0 ≤ λ ≤ 1 . For a nonempty closed convex set C ⊂ Rm let aff C denote its afﬁne hull. Then the relative interior of C is ri (C) = x ∈ aff C ∃ (ǫ > 0) (x + ǫB) ∩ aff C ⊂ C . n o The functions in this paper take values in the extended reals R := R ∪ {±∞}. For f : Rn → R, the domain of f is dom f := x ∈ Rn f (x) < ∞ , and the epigraph of f is epi f := (x, α) ∈ Rn × R f (x) ≤ α . n o

A functionn f is closed if the level setso lev (α) := x ∈ Rn f (x) ≤ α are closed for all f R Rn α ∈ , proper if dom f 6= ∅ and f (x) > −∞ for allnx ∈ , and convexoif epi f is a convex subset of Rn+1. For a set X ⊂ dom f and x ∈ X, the function f is strictly continuous at x LINE SEARCH AND TRUST-REGION METHODS FOR CONVEX-COMPOSITE OPTIMIZATION 3 relative to X if f (x) − f (x′) < ∞ lim sup ′ , x,x′−→x kx − x k X x6=x′ where x, x −→ x ⇐⇒ x, x′ ∈ X and x, x′ → x represents converegence within X. This finiteness X property is equivalent to f being locally Lipschitz at x relative to X (see [17, Section 9.A]). By [16, Theorem 10.4], proper and convex functions g : Rn → R are strictly continuous relative to ri dom g . To each nonempty closed convex set C, we associate the closed, proper, and convex indicator function defined by 0 x ∈ C, δ (x | C ) := (+∞ x 6∈ C. Suppose f : Rn → R is finite at x and w ∈ Rn. The subderivative d f (x) : Rn → R and one-sided directional derivative f ′(x; ·) at x for w are f (x + tw) − f (x) f (x + tw) − f (x) d f (x)(w) := lim inf , f ′(x; w) := lim . tց0 t tց0 t w′→w The structure of P allows the classical one-sided directional derivative f ′(x; ·) to capture the variational properties of its more general counterpart as discussed in the next section. Results in the following section also require the notion of set convergence from variational Rm analysis, as in [17, Section 4.A]. For a sequence of sets {Cn}n∈N, with Cn ⊂ , the outer and inner limits are defined, respectively, as

N k k lim sup Cn := x ∃ (inﬁnite K ⊂ , x −→ x) ∀ (k ∈ K) x ∈ Ck n→∞ ( K )

n n lim inf Cn := x ∃ (n0 ∈ N, x → x) ∀ (n ≥ n0) x ∈ Cn . n→∞ n o The sets Cn converge to a set C if the two limits agree and equal C: lim sup C = lim inf C = C. n ∞ n n→∞ n→ With this notion of convergence in mind, we apply it to the epigraphs of a sequence of e functions and say that f k : Rm → R epigraphically converge to f : Rm → R, written f k −→ f , if and only if epi f k → epi f (see [17, Section 7.B]).

3. Properties of Convex-Composite Objectives

The general convex-composite optimization problem [3] is of the form Φ (1) minimize f (x) := ψ( (x)), x ∈ Rn where ψ : Rm → R is closed, proper and convex, and Φ : Rn → Rm is sufficiently smooth. Allowing infinite-valued convex functions ψ into the composition introduces theoretical difficulties as discussed in [5, 6, 12, 14, 17]. In this work, we assume f takes the form given 4 J. V. BURKE AND A. ENGLE

in P by setting ψ(y, x) := h(y)+ g(x) and Φ(x) = (c(x), x). In this case, the calculus simpliﬁes dramatically. As in [3], we have dom f = dom g and (2) f (x + d)= h(c(x)+ ∇c(x)d)+ g(x + d)+ o(kdk). Consequently, at any x ∈ dom g and d ∈ Rn, f is directionally differentiable, with ′ ′ ′ d f (x)(d) = f (x; d)= h (c(x); ∇c(x)d)+ g (x; d). This motivates deﬁning the subdifferential of f at any x ∈ dom g by setting (3) ∂ f (x) := ∇c(x)⊤∂h(c(x)) + ∂g(x).

Within the context of variational analysis [17], we have that f is subdifferentially regular on its domain and the subdifferential of f as defined above agrees with the regular and limiting ′ subdifferentials of variational analysis. In particular, f (x; d)= supv∈∂ f (x) hv, di. Following [2], we define an approximation to the directional derivative that is key to our algorithmic development. Definition 3.1. Let f be as in P and x ∈ dom g . Define ∆ f (x; ·) : Rn → R by (4) ∆ f (x; d)= h(c(x)+ ∇c(x)d)+ g(x + d) − h(c(x)) − g(x).

The next lemma records the interplay between ∆ f (x; d) and its inﬁnitesimal counterpart f ′(x; d) and is a consequence of (2) and the deﬁnitions. Lemma 3.1. Let f be given as in P and let x ∈ dom g . Then (a) the function d 7→ ∆ f (x; d) is convex; Rn ∆ f (x;td) > (b) for any d ∈ , the difference quotients t are nondecreasing in t 0, with ∆ f (x; td) f ′(x; d)= inf , t>0 t and in particular; (c) for any d ∈ Rn, f ′(x; d) ≤ ∆ f (x; d); (d) for any d ∈ Rn, t ∈ [0,1], ∆ f (x; td) ≤ t∆ f (x; d).

We now state equivalent ﬁrst-order necessary conditions for a local minimizer x of P, emphasizing that f ′(x; d) and ∆ f (x; d) are interchangeable with respect to these conditions. The proof of this result parallels that given in [3] using (2) and (3). Theorem 3.1 (First-order necessary conditions for P). [3, Theorem 2.6] Let h, c, and g be as given in P. If x ∈ dom g is a local minimizer of P, then f ′(x; d) ≥ 0, for all d ∈ Rn. Moreover, the following conditions are equivalent for any x ∈ dom g , (a) 0 ∈ ∂ f (x); (b) for all d ∈ Rn, 0 ≤ f ′(x; d); (c) for all d ∈ Rn, 0 ≤ ∆ f (x; d); (d) for all η > 0, d = 0 solves min{∆ f (x; d) | kdk ≤ η}. LINE SEARCH AND TRUST-REGION METHODS FOR CONVEX-COMPOSITE OPTIMIZATION 5

The next lemma shows that if the sequence {(xk, dk)} ⊂ Rn × Rn is such that dk is an k k k approximate solution to Pk for all k with ∆ f (x ; d ) → 0, then cluster points of x are ﬁrst-order stationary for P. n o R k R Lemma 3.2. Let h, c, and g be as in P and α ∈ . Set L := lev f (α). Let (x , ηk) ⊂L× +, k Rn R with (x , ηk) → (x, η) ∈ × + and 0 < η < ∞. Deﬁne n o k ∆ ∆ B k f (d) := f (x ; d)+ δηk (d), and (5) ∆k f := min ∆k f (d) d

k B If, for each k ≥ 1, d ∈ ηk satisﬁes k k (6) ∆ f (x ; d ) ≤ β∆k f ≤ 0, with ∆ f (xk; dk) → 0, then 0 ∈ ∂ f (x).

Proof. Since f is closed, f (x) ≤ α, which implies x ∈ dom g . Deﬁne the functions k k k hk(d) := h(c(x )+ ∇c(x )d) − h(c(x )), h∞(d) := h(c(x)+ ∇c(x)d) − h(c(x)), k k gk(d) := g(x + d) − g(x ), and g∞(d) := g(x + d) − g(x).

Since 0 < η → η, with η > 0, and since δ d η B = δ 1 d | B , [17, Proposition 7.2] k k ηk implies B e B δ · ηk −→ δ · η . e B e By [17, Exercise 7.8(d)], gk −→ g∞, so [17, Exercise 7.47] implies gk + δ · ηk −−−→ k→∞ B g∞ + δ · η , and applying [17, Exercise 7.47] again yields B e B hk + gk + δ · ηk −→ h∞ + g∞ + δ · η . Equivalently,

k B e B ∆ f (x ; ·)+ δ · ηk −→ ∆ f (x; ·)+ δ · η . By [17, Proposition 7.30] and (6),

0 = lim sup ∆k f ≤ min ∆ f (x; d) ≤ 0, k kdk≤η so Theorem 3.1 implies 0 ∈ ∂ f (x).

The approximate solution condition (6) is described in [2]. It can be satisﬁed by employing the trick described in [4, Remark 6, page 343]. Speciﬁcally, any solution technique solving R the convex subproblems Pk that also generates lower bounds ℓk,j ∈ such that ℓk,j ր ∆k f k k,j and ∆ f (x ; d ) ց ∆k f as j → ∞. If ∆k f < 0, then the condition k k,j ℓ ∆ f (x ; d ) ≤ β k,j 6 J. V. BURKE AND A. ENGLE

is ﬁnitely satisﬁed, and

k k,j ∆ f (x ; d ) ≤ β∆k f .

We conclude this section with a mean-value theorem for P.

Theorem 3.2 (Mean-Value for convex-composite). [17, Theorem 10.48] Let f be as in P, g strictly continuous relative to its domain, and x0, x1 ∈ dom g . Then there exists t ∈ (0,1), xt := (1 − t)x + tx and v ∈ ∂ f (x ) such that 0 1 t

f (x1) − f (x0)= hv, x1 − x0i .

Proof. Let F(t) := (1 − t)x0 + tx1 and let ϕ(t)= f (F(t)) − (1 − t) f (x0) − tf (x1). Then

ϕ(t)= h(c(F(t))) + g(F(t)) − (1 − t) f (x0) − tf (x1) is an instance of P, since g ◦ F is convex. Consequently, the chain rules for ϕ and −ϕ on [0,1] are

′ ⊤ ⊤ ′ ⊤ ∂ϕ(t)= F (t) ∇c(F(t)) ∂h(c(F(t))) + F (t) ∂g(F(t)) + f (x0) − f (x1)

= hv, x1 − x0i v ∈ ∂ f (F(t)) + f (x0) − f (x1), and n′ ⊤ ⊤ o ′ ⊤ ∂(−ϕ)(t) = F (t) ∇c(F(t)) ∂(−h)(c(F(t))) + F (t) ∂(−g)(F(t)) + f (x1) − f (x0)

= h−v, x1 − x0i v ∈ ∂ f (F(t)) + f (x1) − f (x0). n o As g is continuous on its domain, ϕ is continuous on [0,1] with ϕ(0)= ϕ(1)= 0. Therefore, ϕ attains either its minimum or maximum value at some t ∈ (0,1), and 0 ∈ ∂ϕ(t) or 0 ∈ ∂(−ϕ)(t) respectively.

4. Backtracking for Convex-Composite Minimization

The simplest and most well established line search is the Armijo-Goldstein backtracking procedure [21]. It has been adapted for the convex-composite setting in [2, 15, 22] where it takes the form

f (x + td) ≤ f (x)+ σ1t∆ f (x; d)

and enforces sufﬁcient decrease of f along the ray x + td | t > 0 , with ∆ f acting as a sur- rogate for the directional derivative. Existence of step sizes t > 0 satisfying the sufﬁcient decrease follows immediately from Lemma 3.1. The method of proof to follow adapts the step-size arguments given in Royer and Wright [18] to the convex-composite setting. Similar ideas on convex majorants for the composite P are employed in [10, 14]. LINE SEARCH AND TRUST-REGION METHODS FOR CONVEX-COMPOSITE OPTIMIZATION 7

Algorithm 1 Global Backtracking 0 1: procedure BacktrackingGlobal(x , σ1, θ) 2: k ← 0 3: repeat 4: Find dk ∈ Rn such that ∆ f (xk; dk) < 0 5: if no such dk then 6: 0 ∈ ∂ f (xk) return 7: end if 8: t ← 1 k k k k k 9: while f (x + td ) > f (x )+ σ1t∆ f (x ; d ) do 10: t ← θt 11: end while 12: tk ← t k k k 13: x ← x + tkd 14: k ← k + 1 15: until 16: end procedure

0 0 Theorem 4.1. Let f be as in P,x ∈ dom g , 0 < σ1 < 1, and 0 < θ < 1. Set L := lev f ( f (x )). > > k Suppose there exists M 0 and M 0 such that d ≤ M, supx∈L ∇c(x) ≤ M, and that

(i) ∇cis L -Lipschitz on L + MB ; ∇c e n e B B (ii) h is Lh-Lipschitz on c(L + M )+ MM m.

k 0 Let x be a sequence initialized at x ande generated by Algorithm 1: Then one of the following mustn occur:o (a) the algorithm terminates ﬁnitely at a ﬁrst-order stationary point for f ; (b) f (xk) ց−∞; ∞ ∆ f (xk; dk)2 (c) ∑ < ∞, in particular, ∆ f (xk; dk) → 0. k 2 k=0 d 2

Proof. We assume (a) - (b) do not occur and show (c) occurs. Since (a) does not occur, the sequence xk is inﬁnite, and ∆ f (xk; dk) < 0 for all k ≥ 0. The sufﬁcient decrease (WWI) obtained byn theo backtracking subroutine gives a strict descent method, so the function values f (xk) are strictly decreasing, with xk ⊂L for all k ≥ 0. In particular, f (xk) ց f > −∞n. o n o

We ﬁrst show that for each k ≥ 0, the step size 0 < tk ≤ 1 satisﬁes

k k µ(1 − σ2)|∆ f (x ; d )| (7) tk ≥ min 1, 2 ,  L L dk   ∇c h 

  8 J. V. BURKE AND A. ENGLE

by considering two cases.

If the unit step tk = 1 is accepted, the bound is immediate. Following [18], suppose now that the unit step length is not accepted. Then t := θj ∈ (0,1] does not satisfy the decrease condition for some j ≥ 0. Using the Lipschitz condition on h, the quadratic bound lemma, and Lemma 3.1, we obtain b

2 k k k k k k L∇c Lh k σ1t∆ f (x ; d ) < f (x + td ) − f (x) ≤ ∆ f (x ; td )+ td 2 2 L∇ L 2 b b ≤ t∆ f (xk;bdk) + (t)2 c bh dk 2

After dividing both sides by t > 0 and rearranging,b b

2(1 − σ )|∆ f (xk; dk)| (8) b t ≥ 1 . k 2 L∇c Lh d 2 b > Consequently, when the backtracking algorithm terminates at tk 0,

k k 2θ(1 − σ1)|∆ f (x ; d )| (9) tk ≥ . k 2 L∇c Lh d 2

Therefore, tk satisfying (WWI) implies

2(1 − σ )|∆ f (x; d)| 1 ∆ ∆ k k k k+1 σ1 min 1, θ 2 | f (x; d)| ≤ σ1tk| f (x ; d )|≤ f (x ) − f (x ). ( L∇c Lhkdk2 ) Using the boundedness of the search directions and arguing as in the proof of Theorem 5.1, the bound (9) holds for all k ≥ k0. Summing the previous display,

2σ (1 − σ )∆ f (xk; dk)2 0 < ∑ θ 1 1 < f (x0) − lim f (xk). k 2 k→∞ k≥k0 L∇c Lh d 2 k Since (b) does not occur, limk→∞ f (x ) > − ∞, so (c) must occur.

Remark 1. When h is the identity on R and g = 0, we recover the convergence analysis of backtracking for smooth minimization.

The following corollary is an immediate consequence of Lemma 3.2.

Corollary 4.1. Let the hypotheses of Theorem 4.1 hold. If 0 < β < 1 and the directions dk are chosen to satisfy n o k k ∆ f (x ; d ) ≤ β∆k f < 0, then the occurrence of (c) in Theorem 4.1 implies that cluster points of xk are ﬁrst-order stationary for P. n o LINE SEARCH AND TRUST-REGION METHODS FOR CONVEX-COMPOSITE OPTIMIZATION 9

5. Weak Wolfe for Convex-Composite Minimization

Deﬁnition 5.1. Weak Wolfe in the convex composite case is deﬁned at x ∈ dom g with ∆ f (x; d) < 0 by choosing 0 < σ < σ < 1 and µ > 0 and requiring 1 2 (WWI) f (x + td) ≤ f (x)+ σ1t∆ f (x; d), and ∆ f (x + td; µd) (WWII) σ ∆ f (x; d) ≤ . 2 µ Remark 2. The second condition (WWII) is a curvature condition that parallels the classical weak Wolfe [19, 20] curvature condition for smooth, unconstrained minimization: ′ ′ σ2 f (x; d) ≤ f (x + td; d), which prevents the line search early termination at “strongly negative” slopes [21, Section 3.1].

′ ′ Remark 3. The strong Wolfe conditions require | f (x + td; d)|≤−σ2 f (x; d), whenever f is smooth. However, in nonsmooth minimization, kinks and upward cusps at local minimiz- ers make this condition unworkable.

The following lemma shows that the set of points satisfying (WWI)and(WWII) has nonempty interior.

Lemma 5.1. Let f be as in P, x ∈ dom g , and d chosen so that ∆ f (x; d) < 0. Suppose f is bounded below on the ray {x + td : t > 0}, and µ ∈ R. Then, the set

f (x + td) ≤ f (x)+ σ1t∆ f (x; d), > C(µ) := t 0 ∆ f (x + td; µd)   σ2∆ f (x; d) ≤   µ 

  has nonempty interior for any µ > 0. 

Proof. Deﬁne

K(y, z, t) := h(y)+ g(z) − [ f (x)+ σ1t∆ f (x; d)], c(x + td) ∇c(x + td)d ′ G(t) :=  x + td  , with G (t)=  d  , t 1         and set φ(t) := K(G(t)) = f (x + td) − [ f (x)+ σ1 t∆ f (x; d)]. Then, φ(t) is convex-composite, ∆φ(t; µ)= K(G(t)+ G′(t)µ) − K(G(t))

= h(c(x + td)+ ∇c(x + td)µd)+ g(x + (t + µ)d) − [ f (x)+ σ1(t + µ)∆ f (x; d)]

− (h(c(x + td)) + g(x + td) − [ f (x)+ σ1t∆ f (x; d)])

= ∆ f (x + td; µd) − µσ1∆ f (x; d), 10 J. V. BURKE AND A. ENGLE

and, by Lemma 3.1,

′ ′ φ (t; µ)= f (x + td; µd) − µσ1∆ f (x; d) ≤ µ∆φ(t;1).

′ Consequently, φ (0;1) ≤ (1 − σ1)∆ f (x; d) < 0, so there exists t > 0 such that for all t ∈ (0, t), φ(t) < 0. This is equivalent to (WWI) being satisﬁed on (0, t).

Since φ is bounded below on the ray, φ(t) ր ∞. Let t := sup{t > t : φ(s) < 0 for all s ∈

(0, t)}. Then, since g is closed and h is finite-valued, φ(t)= lim inftրt φ(t), which implies b h(c(x + td))+ g(x + td)= −[ f (x)+ σ1t∆bf (x; d)] + limb inf φ(t) tրt ∆ < ∞ b b ≤−[ f (x)+ σ1bt f (x; d)] , b so x + td ∈ dom g . Since g is continuous relative tob its domain, φ is continuous relative to its domain, so φ(t) ≤ 0. We now consider two cases on the value of φ(t). b ( ) < (( − Suppose φ t 0. Web aim to show that f satisfies (WWI) and (WWII) onb the interval t µ)+, t]. To prove this, we show that if φ(t) < 0, then t > t implies x + td 6∈ dom g and, as a consequence,b t ≥ 1. Suppose to the contrary that there exists t > t with x + td ∈ dom bg . Then,b the definition of t, convexity of domb φ , and theb intermediate value theorem imply there exists t˜ suchb that φ(t˜) = 0 and t ≥ t˜ > t. But relative continuityb of φ at t˜ with respect to dom φ meansb there exist points in (t, t˜) which contradict the definition of t. This proves the claim. Consequently, if φ(t) < 0,b then f satisfies both (WWI) and (WWII) on the interval ((t − µ)+, t], as the right-hand sideb of (WWII) is +∞. b b ( )= ˜ ∈ ( ) ˜ ∈ ( ) ≤ ′(˜ ) ≤ ∆ (˜ ) Otherwise, φ t b 0. Let tb arg mint∈[0,t] φ t . Then, t 0, t , with 0 φ t; µ φ t, µ for all µ, equivalently b b b ∆ f (x + td˜ ; µd) ≥ σ ∆ f (x; d) > σ ∆ f (x; d) ∀ µ > 0, µ 1 2 so (WWI) and (WWII) hold with strict inequality at t˜. We now consider two cases based on whether x + (t˜ + µ)d ∈ dom g .

First, for all sufficiently small µ > 0, x + (t˜ + µ)d ∈ dom g . Because the inequalities in (WWI) and (WWII) are strict at t˜, relative continuity of f and of t 7→ ∆ f (x + td; d) at t = t˜ imply there exists an open interval I with t˜ ∈I and x + Id ⊂ dom g where both (WWI) and (WWII) hold. For those µ > 0 for which x + (t˜ + µ)d 6∈ dom g , (WWI) and (WWII) hold for all t ∈ ((t˜ − µ) , t) as argued in the previous case where φ(t) < 0. + b b Next, we prove finite termination of a bisection algorithm to point t ≥ 0 satisfying the weak Wolfe conditions. The algorithm is analogous to the weak Wolfe bisection method for finite-valued nonsmooth minimization in [13]. LINE SEARCH AND TRUST-REGION METHODS FOR CONVEX-COMPOSITE OPTIMIZATION 11

Algorithm 2 Weak Wolfe Bisection Method n Input: x ∈ dom g , d ∈ R with ∆ f (x; d) < 0, and 0 < σ1 < σ2 < 1, µ > 0. 1: procedure WWBisect(x, d, σ , σ ) 1 2 2: α ← 0; 3: t ← 1; 4: β ← ∞; 5: while (WWI) and (WWII) fail do 6: if f (x + td) > f (x)+ σ1t∆ f (x; d) then ⊲ If not sufﬁcient decrease 7: β ← t ∆ > ∆ f (x+td;µd) ⊲ 8: else if σ2 f (x; d) µ then Else if not curvature 9: α ← t 10: else 11: return t 12: end if 13: if β = ∞ then ⊲ Doubling Phase 14: t ← 2t 15: else ⊲ Bisection Phase 1 16: t ← 2 (α + β) 17: end if 18: end while 19: end procedure

Lemma 5.2. Let f be given as in P with g strictly continuous relative to its domain, and suppose x ∈ dom g and d is chosen such that ∆ f (x; d) < 0. Then, one of the following must occur in Algorithm 2: (a) the doubling phase does not terminate finitely, with the parameter β never set to a finite value, the parameter α becoming positive on the first iteration and doubling every iteration thereafter, with f (x + tkd) ց−∞; (b) both the doubling phase and the bisection phase terminate finitely to a t ≥ 0 for which the weak Wolfe conditions are satisfied.

Proof. Suppose the procedure does not terminate finitely. If the parameter β is never set to a finite value, then the doubling phase does not terminate. Then the parameter α becomes positive on the first iteration and doubles on each subsequent iteration k, with tk satisfying

f (x + tkd) ≤ f (x)+ σ1tk∆ f (x; d), ∀ k ≥ 1.

Therefore, since ∆ f (x; d) < 0, the function values f (x + tkd) ց −∞, so the first option occurs. Otherwise, the procedure does not terminate finitely, and β is eventually finite. Therefore, the doubling phase terminates finitely, but the bisection phase does not terminate finitely. This implies there exists t ≥ 0 such that

(10) αk ր t, tk → t, βk ց t. 12 J. V. BURKE AND A. ENGLE

We now consider two cases. First, suppose that the parameter α is never set to a positive if number. Then, αk = 0 for all k ≥ 1, and tk, βk → 0, so the ﬁrst statement is entered in each iteration. This implies

f (x + tkd) − f (x) σ1∆ f (x; d) < , ∀ k ≥ 1. tk Since [x, x + d] ⊂ dom g , Lemma 3.1 yields the chain of inequalities ′ σ1∆ f (x; d) ≤ f (x; d) ≤ ∆ f (x; d) < 0, which contradicts σ1 ∈ (0,1). Otherwise, the parameter α is eventually positive. Then, the bisection phase does not terminate, and the algorithm generates inﬁnite sequences {αk} , {tk} , and βk satisfying (10) such that, for all k large, 0 < α < t < β < ∞, and k k k (11) f (x + αkd) ≤ f (x)+ σ1αk∆ f (x; d), > (12) f (x + βkd) f (x)+ σ1βk∆ f (x; d), ∆ f (x + α d; µd) (13) σ ∆ f (x; d) > k , 2 µ

(14) [x, x + max αk + µ, βk d] ⊂ dom g . Letting k → ∞ in (13) and using lower semicontinuity of g gives ∆ f (x + td; µd) (15) σ ∆ f (x; d) ≥ . 2 µ

By Theorem 3.2, for sufﬁciently large k there exists τk ∈ (0,1) so that the vectors k x := (1 − τk)(x + αkd)+ τk(x + βkd)= x + [(1 − τk)αk + τkβk]d, vk ∈ ∂ f (xk) yield an extended form of the mean-value theorem

k (16) f (x + βkd) − f (x + αkd)= v , (βk − αk)d .

kD E Let γk := (1 − τk)αk + τkβk ∈ (αk, βk), so that x = x + γkd. Then, γk → t as k → ∞. Combining (11) and (12) and using (16) gives < k σ1(βk − αk)∆ f (x; d) f (x + βkd) − f (x + αkd)= v , (βk − αk)d . D E Dividing by βk − αk > 0 gives ′ σ1∆ f (x; d) ≤ f (x + γkd; d) ∆ f (x + γ d; µd) ≤ k . µ As k → ∞, using (15), we obtain the string of inequalities ∆ f (x + td; µd) ∆ f (x + td; µd) ≤ σ ∆ f (x; d) < σ ∆ f (x; d) ≤ , µ 2 1 µ LINE SEARCH AND TRUST-REGION METHODS FOR CONVEX-COMPOSITE OPTIMIZATION 13 which is a contradiction. Therefore, either the doubling phase never terminates or the procedure terminates ﬁnitely at some t at which f satisﬁes both weak Wolfe conditions.

A global convergence result for the weak Wolfe line search that parallels [2, Theorem 2.4] now follows under standard Lipschitz assumptions, which hold, in particular, if the initial 0 set lev f ( f (x )) is compact.

Algorithm 3 Global Weak Wolfe 0 1: procedure WeakWolfeGlobal(x , σ1, σ2, µ) 2: k ← 0 3: repeat 4: Find dk ∈ Rn such that ∆ f (xk; dk) < 0 5: if no such dk then 6: 0 ∈ ∂ f (xk) return 7: end if 8: Let tk be a step size satisfying (WWI) and (WWII) 9: if no such tk then 10: f unbounded below. return 11: end if k k k 12: x ← x + tkd 13: k ← k + 1 14: until 15: end procedure

Theorem 5.1. Let f be as in P with g strictly continuous relative to its domain, x0 ∈ dom g , < < < < < L = ( ( 0)) > 0 σ1 σ2 1, and 0 µ 1. Set : lev f f x . Suppose there exists M, M 0 such that dk ≤ M for all k ≥ 0, sup ∇c(x) ≤ M, and x∈L e

(i) cis L -Lipschitz on L; c e

(ii) ∇cis L∇c-Lipschitz on L;

(iii) g is Lg-Lipschitz on (L + MµB) ∩ dom g ; B (iv) h is Lh-Lipschitz on c(L)+ MMµ .

k 0 Let x be a sequence initialized at xe and generated by Algorithm 3: Then at least one of the followingn o must occur: (a) the algorithm terminates ﬁnitely at a ﬁrst-order stationary point for f ;

∞ n↑ ∞ (b) for some k the step size selection procedure generates a sequence of trial step sizes tkn −→ k k ∞ such that f (x + tkn d ) →− ; (c) f (xk) ց−∞; 14 J. V. BURKE AND A. ENGLE

∞ ∆ f (xk; dk)2 (d) ∑ < ∞, in particular, ∆ f (xk; dk) → 0. k k 2 k=0 d + d

Proof. We assume (a) - (c) do not occur and show (d) occurs. Since (a) does not occur, the sequence xk is infinite, and ∆ f (xk; dk) < 0 for all k ≥ 0. Since (b) does not occur, Lemma 5.2 impliesn o that the weak Wolfe bisection method terminates finitely at every iteration k ≥ 0. The sufficient decrease condition (WWI) gives a strict descent method, so the function values f (xk) are strictly decreasing, with xk ⊂ L for all k ≥ 0. By the nonoccurrence of (c),nf (xk)oց f > −∞. n o

We ﬁrst show that for each k ≥ 0, the step size tk satisﬁes

k k µ(1 − σ2)|∆ f (x ; d )| (17) tk ≥ min 1 − µ,  , k + k 2  K d d  by considering two cases.  

k+1 k k+1 k k k First, suppose ∆ f (x ; µd ) = ∞. Then x + µd = x + (tk + µ)d 6∈ dom g . Since xk + dk ∈ dom g , t + µ > 1, and by assumption 0 < µ < 1. Therefore, t ≥ 1 − µ. k k Otherwise, ∆ f (xk+1; µdk) < ∞. Then

∆ f (xk+1; µdk) − ∆ f (xk; µdk)= h(c(xk+1)+ ∇c(xk+1)µdk) − h(c(xk+1))+ g(xk+1 + µdk) − g(xk+1) − [h(c(xk )+ ∇c(xk)µdk) − h(c(xk))+ g(xk + µdk) − g(xk)] = h(c(xk)) − h(c(xk+1)) + h(c(xk+1)+ ∇c(xk+1)µdk) − h(c(xk)+ ∇c(xk)µdk) + g(xk) − g(xk+1)+ g(xk+1 + µdk) − g(xk + µdk) 2 k k k ≤ 2Lh Lctk d + Lh L∇cµtk d + 2Lgtk d 2 k k ≤ Ktk d + d , for some K ≥ 0. Adding and subtracting in (WWII ) gives

∆ f (xk + t dk; µdk) σ ∆ f (xk; dk) ≤ k 2 µ ∆ f (xk; µdk) ∆ f (xk + t dk; µdk) ∆ f (xk; µdk) = + k − µ " µ µ # K 2 ≤ ∆ f (xk; dk)+ t dk + dk (since 0 < µ < 1), µ k

LINE SEARCH AND TRUST-REGION METHODS FOR CONVEX-COMPOSITE OPTIMIZATION 15 which rearranges to µ(1 − σ )|∆ f (xk; dk)| (18) 0 < 2 ≤ t , 2 k K dk + dk so (17) holds. Next, (WWI) and (17) imply

µ(1 − σ )|∆ f (xk; dk)| 2 ∆ k k ∆ k k k k+1 (19) σ1 min 1 − µ,  | f (x ; d )|≤ σ1tk| f (x ; d )|≤ f (x ) − f (x ). k + k 2  K d d  We aim to show that the bound (18 ) holds for all large k by showing ∆ f (xk; dk) → 0 and using boundedness of the search directions dk . Suppose there exists a subsequence J ⊂ N for which ∆ f (xk; dk) 6−→ 0. Let γ > 0n beo such that sup ∆ f (xk; dk) ≤ −γ < 0. 1 k∈J1 J1 Then, since dk ⊂ MB, n o µ(1 − σ )|∆ f (xk; dk)| (20) 2 6−→ 0. 2 K dk + dk J1 If there exists a further subsequence J 2 ⊂ J1 with µ(1 − σ )|∆ f (xk; dk)| 2 ≥ 1 − µ, ∀ k ∈ J , 2 2 K dk + dk then by expanding the recurrence given by (WWI), and writing J2 = {k1, k2,... }, we have

kn kn−1 (21) f (x ) ≤ f (x ) − σ1(1 − µ)γ

k1 ≤ f (x ) − C(kn)σ1(1 − µ)γ with C(kn) → ∞ as n → ∞. This contradicts the nonoccurence of (c). By (20), there exists a subsequence J2 ⊂ J1 and δ > 0 so that µ(1 − σ )|∆ f (xk; dk)| 0 < δ ≤ 2 < 1 − µ 2 K dk + dk for all large k ∈ J2. Repeating the argument at ( 21) with δ in place of 1 − µ, we conclude µ(1 − σ )|∆ f (xk; dk)| 2 −→ 0, 2 K dk + dk J1

k k and consequently ∆ f (x ; d ) −→ 0, which is a contradiction. Therefore, (18) holds for all J1 k ≥ k0. Summing over k ∈ N in (19) gives σ µ(1 − σ )∆ f (xk; dk)2 0 < ∑ 1 2 < f (x0) − lim f (xk). k k 2 k→∞ k≥k0 K d + d

k > Since (c) does not occur, limk→∞ f (x ) −∞ , so (d) must occur. 16 J. V. BURKE AND A. ENGLE

Remark 4. When h is the identity on R and g = 0, we recover the convergence analysis of weak Wolfe for smooth minimization given in [21, Theorem 3.2]. Remark 5. The hypotheses of Theorem 5.1 simplify if h is globally Lipschitz. In that case, the boundedness condition on ∇c(x) | x ∈L is not necessary. Alternatively, if ∇c(x) is bounded on the closed convexn hull of L, theno the Lipschitz condition of c on L is immediate.

The following corollary is an immediate consequence of Lemma 3.2.

Corollary 5.1. Let the hypotheses of Theorem 5.1 hold. If 0 < β < 1 and the directions dk are chosen to satisfy n o k k ∆ f (x ; d ) ≤ β∆k f < 0, then the occurrence of (d) in Theorem 5.1 implies that cluster points of xk are ﬁrst-order stationary for P. n o

6. Trust-Region Subproblems

In this section, we let k·k denote an arbitrary norm on Rn.

> C Deﬁnition 6.1. For δ 0 and x ∈ dom g , deﬁne the set of Cauchy steps Dδ (x) by C ∆ Dδ (x):=arg min f (x; d), kdk≤δ and set ∆C ∆ δ f (x) := inf f (x; d). kdk≤δ

∆C f (xk) = ∆ f ∆ f Observe that ηk k , where k is defined in Lemma 3.2. Our pattern of proof in this section follows a path similar to the standard approaches to such results given in [7]. In short, the Cauchy step defined above provides the driver for the global convergence of Algorithm 4. The particular steps chosen by the algorithm must do at least as well as the Cauchy step. This idea is formalized by the notion of the sufficient decrease condition described below. The utility of this condition requires some basic Lipschitz continuity assumptions. Assumption:

A1: c, ∇c, and g are Lipschitz continuous on dom g with Lipschitz constants Lc, Lc′ , and L , respectively. g A2: h is Lipschitz continuous on g(dom g ) with Lipschitz constant Lh. ∆C These assumptions imply the Lipschitz continuity of δ f on dom g . > ∆C Lemma 6.1. Let the assumptions A1 and A2 hold. Then, for δ 0, the mapping x 7→ δ f (x) is Lipschitz continuous on dom g with constant L∆ := Lh(2Lc + δLc′ )+ 2Lg. LINE SEARCH AND TRUST-REGION METHODS FOR CONVEX-COMPOSITE OPTIMIZATION 17

Proof. Observe that for any x1, x2 ∈ dom g and d ∈ δB we have

1 2 ∆ f (x ; d) − ∆ f (x ; d) =[h(c(x1)+ ∇c(x1)d) − h(c(x2)+ ∇c(x2)d)]+[h(c(x2)) − h(c(x1))] [g(x1 + d) − g(x2 + d)]+[g(x2) − g(x1)].

Hence, 1 2 1 2 ∆ f (x ; d) − ∆ f (x ; d) ≤ L∆ x − x ,

and, by symmetry, 2 1 1 2 ∆ f (x ; d) − ∆ f (x ; d) ≤ L∆ x − x .

C 2 C 1 Taking d ∈ Dδ (x ) in the ﬁrst of these inequalities and d ∈ Dδ (x ) in the second gives the result.

Definition 6.2. (Sufficient Decrease Condition for f ) We say that a direction choice method satisfies > > k Rn ∆C k > the sufficient decrease condition for f if for all ǫ 0 and δk 0,ifx ∈ satisfies | 1 f (x )| ǫ, k n then there exists constants κ1, κ2 > 0 depending only on ǫ such that the direction choice d ∈ R satisfies

1 (22) ∆ f (xk; dk)+ dk⊤H dk < −κ min(κ , δ ). 2 k 1 2 k Remark 6. The sufficient decrease condition can be defined using |∆C f (xk)| for any δ > 0, δ but this choice of δ must then remain constant throughout the iteration process; δ = 1 is chosen for simplicity. b b b b We now show that the sufficient decrease condition can always be satisfied in a neighbor- hood of any non-stationary point.

Rn×n > > Lemma 6.2. Let x ∈ dom g ,H ∈ , δ 0 and let σ 0 be such that kdk2 ≤ σkdk for all ∈ Rn ∈ C( ) ∆C ( ) < ∈ ( ( )] d . If d D1 x with 1f x 0, then there exists t 0, min 1, δ such that

b t2 1 b |∆C f (x)| ∆ ⊤ ∆C 1 f (x; td)+ d Hd ≤ 1 f (x) min 2 ,1, δ . 2 2 σ kHk2 ! b bb b b Proof. For any t ∈ (0, min(1, δ)], Lemma 3.1 and H¨older’s inequality implies

t2 t2 ∆ f (x; td)+ d⊤Hd ≤ t∆ f (x; d)+ σ2kHk . 2 2 2 ∆ < 2 b > b b b Set α = f (x; d) 0, β = σ kHk2 0, and

b t = arg min αt + βt2/2 = min(1, δ, −α/β). t∈[0,min(1,δ)] b 18 J. V. BURKE AND A. ENGLE

There are two cases to consider. If −α/β ≤ min(1, δ), then t = −α/β and

t2 t2 ∆ f (x; td)+ d⊤Hd ≤ t∆ f (x; d)+ d⊤Hb d 2 2 b ∆ f (x; d)2 ∆ f (x; d)2 bb b b= b− b + b b d⊤Hd 2 4 2 σ kHk2 2σ kHk b b 2 ∆ f (x; d)2 b b = − 2 2σ kHk2 b 1 |∆C f (x)| ∆C 1 = 1 f (x) 2 . 2 σ kHk2 !

Otherwise, min(1, δ) ≤−α/β. Setting t = min(1, δ) gives

t2 t2 ∆ f (x; td)+ d⊤bHd ≤ t∆ f (x; d)+ σ2kHk 2 2 2 b bt bb b b= bt∆ f (x; db) − ∆ f (x; d) 2 1 b = b ∆C f (xb) min(1, δ). b 2 1 The result follows.

The sufﬁcient decrease condition provides a useful bound on the change in the objective function. > Lemma 6.3. Let the assumptions A1 and A2 hold. Suppose σ 0 satisﬁes kdk2 ≤ σkdk for all Rn Rn×n < < > > d ∈ . Let H ∈ ,0 β1 ≤ β2 1, and α, κ1, κ2 0 be given. Choose δ 0 so that for all δ ∈ [0, δ], 1 1 κ (1 − β ) min(κ , δ) ≥ δ2L L + σ2δ2kHk . 1 2 2 2 h ∇c 2 2 Then, for every δ ∈ [0, δ], x ∈ dom g , and d ∈ δB for which

1 ∆ f (x; d)+ d⊤Hd ≤−κ min(κ , δ), 2 1 2 one has 1 1 f (x + d) − f (x) ≤ β [∆ f (x; d)+ d⊤Hd] ≤ β [∆ f (x; d)+ d⊤Hd]. 2 2 1 2

Proof. The Lipschitz assumptions on h and ∇c imply

L L f (x + d) − f (x) ≤ ∆ f (x; d)+ h ∇c kdk2 . 2 2 LINE SEARCH AND TRUST-REGION METHODS FOR CONVEX-COMPOSITE OPTIMIZATION 19

1 ⊤ 1 2 2 Since 2 d Hd + 2 σ δ kHk2 ≥ 0, it follows that

1 1 1 f (x + d) − f (x) ≤ ∆ f (x; d)+ δ2L L + d⊤Hd + σ2δ2kHk 2 h ∇c 2 2 2 1 ≤ ∆ f (x; d)+ d⊤Hd + κ (1 − β ) min(κ , δ) 2 1 2 2 1 1 ≤ ∆ f (x; d)+ d⊤Hd − (1 − β )[∆ f (x; d)+ d⊤Hd] 2 2 2 1 = β [∆ f (x; d)+ d⊤Hd] 2 2 1 < β [∆ f (x; d)+ d⊤Hd]. 1 2

Algorithm 4 Global Trust Region 0 0 n×n Input: x ∈ dom g , H ∈ R ,0 < γ1 ≤ γ2 < 1 ≤ γ3,0 < β1 ≤ β2 < β3 < 1, δ0 > 0, 1: procedure TRS(x0, H0, γ , γ , γ , β , β , β , δ ) 1 2 3 1 2 3 0 2: k ← 0 3: repeat k ∆ k 1 ⊤ < 4: Find d ∈ Dk := d kdk ≤ δk, f (x ; d)+ 2 d Hkd 0 k 5: if no such d then return 6: end if f (xk+dk)− f (xk) 7: rk ← k k 1 k⊤ k ∆ f (x ;d )+ d Hkd > 2 8: if rk β3 then 9: Choose δk+1 ∈ [δk, γ3δk] 10: else if β2 ≤ rk ≤ β3 then 11: Set δk+1 = δk 12: else 13: Choose δk+1 ∈ [γ1δk, γ2δk] 14: end if 15: if rk < β1 then 16: xk+1 ← xk 17: Hk+1 ← Hk 18: else 19: xk+1 ← xk + dk 20: Choose Hk+1 21: end if 22: k ← k + 1 23: until 24: end procedure 20 J. V. BURKE AND A. ENGLE

The trust-region method of Algorithm 4 selects steps dk so that 1 h(c(xk)+ ∇c(xk)dk)+ g(xk + dk)+ (dk)⊤H dk < f (xk). 2 k The ratio of actual to predicted reduction in f , rk, is then computed, and the trust-region radius is increased or decreased depending on the value of this ratio. In addition, if the ratio is sufficiently positive, then the step dk is accepted; otherwise it is rejected, and a new step is computed using a smaller trust-region radius. We conclude with a global convergence result based on this method. Theorem 6.1. Let f be as in P, x0 ∈ dom g , and let the hypotheses of Lemma 6.3 be satisfied. k Suppose the sequence {Hk} is bounded, and the choice of search directions d generated by Algorithm 4 satisfy the sufficient decrease condition in Definition 6.2. Then, onen ofo the following must occur:

(i) Dk = ∅ for some k; (ii) f (xk) ց−∞; ∆C k (iii) | 1 f (x )|→ 0.

Proof. We assume that none of (i) - (iii) occur and derive a contradiction. First, observe that Lemma 6.2 tells us that if Dk 6= ∅, then there is an element of Dk that satisfies the sufficient decrease condition, and so the algorithm is well-defined and the hypotheses of the theorem can be satisfied. Consequently, since (i) does not occur, the sequence xk is infinite. Since (iii) does not occur, there exists ζ > 0 such that the set n o

N < C k J := k ∈ 2ζ |∆1 f (x )| N > is an inﬁnite subsequence of . By the sufﬁcient decrease condition, there exists κ1, κ2 0 such that 1 (23) ∆ f (xk; dk)+ (dk)⊤H dk ≤−κ min(κ , δ ) whenever |∆C f (xk)| > ζ. 2 k 1 2 k 1 In particular, (23) holds for all k ∈ J. Lemma 6.3 and Lipschitz continuity of ∇c imply the existence of a δ such that k+1 k k C k > (24) rk ≥ β2 and x = x + d whenever |∆1 f (x )| ζ and δk ≤ δ. By Lemma 6.1,

C 1 C 2 1 2 1 2 (25) |∆1 f (z ) − ∆1 f (z )|≤ ζ whenever z − z ≤ ǫ with z , z ∈ dom g ,

where ǫ := ζ/L∆.

For each k ∈ J, let ν(k) be the ﬁrst integer p ≥ k such that either

(26) xp+1 − xk ≤ ǫ,

(27) δp ≤ δ LINE SEARCH AND TRUST-REGION METHODS FOR CONVEX-COMPOSITE OPTIMIZATION 21 is violated. We need to show that ν(k) is well-deﬁned for every k ∈ J. Assume that ν(k) is not well-deﬁned for some k0 ∈ J. That is,

p+1 k0 (28) x − x ≤ ǫ and δp ≤ δ ∀ p ≥ k0.

∈ ≥ For every k J with k k0, observe that (25) and the ﬁrst condition in (28) imply that ∆C p > | 1 f (x )| ≥ ζ for all p k0. Combining this with (23), (24), the second condition in (28) and the deﬁnition of rk gives 1 f (xp+1) − f (xp) ≤ β [∆ f (xp; dp)+ (dp)⊤H dp] 2 2 p (29) ≤−β2κ1 min(κ2, δp)

= −β2κ1 min(κ2, δk0 )

k for all p ≥ k0. But then f (x ) ց−∞, which contradicts our working hypotheses. Therefore, ν(k) is well-deﬁned for every k ∈ J.

Let k ∈ J and suppose ν(k) is such that (26) is violated but (27) is not violated at xν(k)+1. Then either ν(k) = k or, as in (29),

p+1 p p p 1 p T p f (x ) − f (x ) ≤ β [∆ f (x ; d )+ (d ) Hp)d ] (30) 2 2 ≤−β2κ1 min(κ2, δp), for p = k,..., ν(k) − 1. If ν(k) = k, then, by (24), rk ≥ β2 and so

ν(k)+1 k f (x ) − f (x ) ≤−β2κ1 min(κ2, δk) ≤−β2κ1 min(κ2, ǫ); otherwise, summing (30) over p gives

ν(k)−1 ν(k)+1 k ν(k)+1 ν(k) f (x ) − f (x ) ≤ ( f (x ) − f (x )) − ∑ β1κ1 min(κ2, δp) p=k ν(k)−1 ≤−β1κ1 min(κ2, ∑ δp) p=k

≤−β1κ1 min(κ2, ǫ),

ν(k)−1 ν(k)+1 k since ∑p=k δp ≥ x − x ≥ ǫ (here, the second inequality uses the elementary fact that for any three non-negative numbers τ , τ , τ , we have min(τ , τ + τ ) ≤ min(τ , τ )+ 1 2 3 1 2 3 1 2 min(τ1, τ3)). Hence,

ν(k)+1 k (31) f (x ) − f (x ) ≤−β2κ1 min(κ2, ǫ) when (26) is violated but (27) is not violated. Next suppose that (27) is violated at ν(k), and let s(k) be the smallest positive integer for which xν(k)+s(k) 6= xν(k). The integer s(k) is well deﬁned since, by (24), xν(k) is eventually updated. Also note that x(ν(k)+i) satisﬁes (24) for i = 0,1,..., s(k) − 1. Therefore, by (23) 22 J. V. BURKE AND A. ENGLE

and (24), (32) 1 f (xν(k)+s(k)) − f (xν(k)) ≤ β [∆ f (xν(k); dν(k)+s(k)−1)+ d(ν(k)+s(k)−1)⊤ H d(ν(k)+s(k)−1)] 2 2 ν(k) ≤−β2κ1 min(κ2, δν(k)+s(k)−1)

≤−β2κ1 min(κ2, γ1δ), (ν(k)+i) ν(k) since x = x so that δ(ν(k)+i)+1 = γ3δ(ν(k)+i), i = 0,1,..., s(k) − 1, and, in particular, −1 ν(k)+1 δν(k)+s(k)−1 = γ3 δν(k)+s(k). Set s(k)= 1if(26) is violated but (27) is not violated at x . Putting (31) together with (32), gives ν(k)+s(k) k f (x ) − f (x ) ≤−β2κ1 min(κ2, γ1δ, ǫ) ∀ k ∈ J. But then f (xk) ց −∞, which contradicts our working hypotheses. This establishes the theorem.

∆C Under the hypotheses of Theorem 6.1, Lemma 6.1 tells us that 1 f is Lipschitz continuous on dom g . Hence if x is a cluster point of the sequence {xk} generated by Algorithm 4, then x is a stationary point of f . We record this fact in the following corollary. Corollary 6.1. Under the hypotheses of Theorem 6.1, every cluster point x of the sequence {xk} ∆C generated by Algorithm 4 has 1 f (x)= 0, and so is a ﬁrst-order stationary point for f .

References [1] A. Aravkin, J. Burke, and G. Pillonetto. Sparse/robust estimation and kalman smoothing with nonsmooth log-concave densities: Modeling, computation, and theory. The Journal of Machine Learning Re- search, 14(1):2689–2728, 2013. [2] J. V. Burke. Descent methods for composite nondifferentiable optimization problems. Mathematical Pro- gramming, 33(3):260–279, 1985. [3] J. V. Burke. Second order necessary and sufficient conditions for convex composite ndo. Mathematical Programming, 38(3):287–302, 1987. [4] J. V. Burke. A sequential quadratic programming method for potentially infeasible mathematical pro- grams. Journal of Mathematical Analysis and Applications, 139(2):319–351, 1989. [5] J. V. Burke and A. Engle. Strong metric (sub) regularity of kkt mappings for piecewise linear-quadratic convex-composite optimization. arXiv preprint arXiv:1805.01073, 2018. [6] J. V. Burke and R. Poliquin. Optimality conditions for non-finite valued convex composite functions. Mathematical Programming, 57(1):103–120, 1992. [7] A. R. Conn, N. I. Gould, and P. L. Toint. Trust region methods, volume 1. Siam, 2000. [8] D. Davis, D. Drusvyatskiy, and C. Paquette. The nonsmooth landscape of phase retrieval. arXiv preprint arXiv:1711.03247, 2017. [9] D. Drusvyatskiy and A. S. Lewis. Error bounds, quadratic growth, and linear convergence of proximal methods. Mathematics of Operations Research, 2018. [10] D. Drusvyatskiy and C. Paquette. Efficiency of minimizing compositions of convex functions and smooth maps. ArXiv e-prints, Apr. 2016. [11] J. C. Duchi and F. Ruan. Solving (most) of a set of quadratic equalities: Composite optimization for robust phase retrieval. ArXiv e-prints, May 2017. [12] A. S. Lewis. Active sets, nonsmoothness, and sensitivity. SIAM Journal on Optimization, 13(3):702–725, 2002. [13] A. S. Lewis and M. L. Overton. Nonsmooth optimization via quasi-newton methods. Mathematical Pro- gramming, 141(1-2):135–163, 2013. LINE SEARCH AND TRUST-REGION METHODS FOR CONVEX-COMPOSITE OPTIMIZATION 23

[14] A. S. Lewis and S. J. Wright. A proximal method for composite minimization. Mathematical Programming, 158(1-2):501–546, 2016. [15] M. Powell. On the global convergence of trust region algorithms for unconstrained minimization. Math- ematical Programming, 29(3):297–303, 1984. [16] R. T. Rockafellar. Convex analysis. Princeton university press, 2015. [17] R. T. Rockafellar and R. J.-B. Wets. Variational analysis. Springer, 1998. [18] C. W. Royer and S. J. Wright. Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization. SIAM Journal on Optimization, 28(2):1448–1477, 2018. [19] P. Wolfe. Convergence conditions for ascent methods. SIAM review, 11(2):226–235, 1969. [20] P. Wolfe. Convergence conditions for ascent methods. ii: Some corrections. SIAM review, 13(2):185–188, 1971. [21] S. Wright and J. Nocedal. Numerical optimization. Springer Science, 35(67-68):7, 1999. [22] Y. Yuan. Global convergence of trust region algorithms for nonsmooth optimization. University of Cambridge. Department of Applied Mathematics and Theoretical , 1983.