Intro Smooth NonSmooth

Stochastic Second-Order Optimization Methods Part I: Convex

Fred Roosta

School of Mathematics and Physics University of Queensland Intro Smooth NonSmooth Disclaimer

Disclaimer: To preserve brevity and flow, many cited results are simplified, slightly-modified, or generalized...please see the cited work for the details. Unfortunately, due to time constraints, many relevant and interesting works are not mentioned in this presentation. Intro Smooth NonSmooth Problem Statement

Problem

Risk zZ }| { min F (x) = Ewf (x; w) = f (x; w(ω)) dP(w(ω)) x∈X ⊆Rd Ω | {z } e.g., Loss/Penalty + prediction function

F : (non-)convex/(non-)smooth High-dimensional: d  1

n p With empirical (counting) measure over {wi }i=1 ⊂ R , i.e., n n Z 1 X 1 X dP(w) = 1 , ∀A ⊆ Ω =⇒ F (x) = f (x; w ) n {wi ∈A} n i A i=1 i=1 | {z } finite-sum/empirical risk “Big data”: n  1 Intro Smooth NonSmooth Intro Smooth NonSmooth Intro Smooth NonSmooth Humongous Data / High Dimension

Classical algorithms =⇒ High per-iteration costs Intro Smooth NonSmooth Humongous Data / High Dimension

Modern variants:

1 Low per-iteration costs

2 Maintain original convergence rate Intro Smooth NonSmooth First Order Methods

Use only information

E.g. :

x(k+1) = x(k) − α(k)∇F (x(k))

Smooth Convex F ⇒ Sublinear, O(1/k) Smooth Strongly Convex F ⇒ Linear, O(ρk ),ρ< 1 However, iteration cost is high! Intro Smooth NonSmooth First Order Methods

Stochastic variants e.g., (mini-batch) SGD s For some s small, chosen at random {wi }i=1 and

s α(k) X x(k+1) = x(k) − ∇f (x(k), w ) s i i=1

Cheap per-iteration costs! However slower to converge: √ Smooth Convex F ⇒ O(1/ k) Smooth Strongly Convex F ⇒ O(1/k) Intro Smooth NonSmooth

Modifications... Achieve the convergence rate of the full GD Preserve the per-iteration cost of SGD E.g.: SAG, SDCA, SVRG, Prox-SVRG, Acc-Prox-SVRG, Acc-Prox-SDCA, S2GD, mS2GD, MISO, SAGA, AMSVRG, ... Intro Smooth NonSmooth Second Order Methods

Use both gradient and Hessian information

E.g. : Classical Newton’s method

x(k+1) = x(k) − α(k) [∇2F (x(k))]−1∇F (x(k)) | {z } Non-uniformly Scaled Gradient

Smooth Convex F ⇒ Locally Superlinear Smooth Strongly Convex F ⇒ Locally Quadratic and Globally Linear However, per-iteration cost is much higher! Intro Smooth NonSmooth Outline

Part I: Convex Smooth Newton-CG Non-Smooth Proximal Newton Semi-smooth Newton Part II: Non-Convex Line-Search Based Methods L-BFGS Gauss-Newton Natural Gradient Trust-Region Based Methods Trust-Region Cubic Regularization Part III: Discussion and Examples Intro Smooth NonSmooth Outline

Part I: Convex Smooth Newton-CG Non-Smooth Proximal Newton Semi-smooth Newton Part II: Non-Convex Line-Search Based Methods L-BFGS Gauss-Newton Natural Gradient Trust-Region Based Methods Trust-Region Cubic Regularization Part III: Discussion and Examples Intro Smooth NonSmooth Finite Sum / Empirical Risk Minimization

FSM/ERM n n 1 X 1 X min F (x) = f (x; wi ) = fi (x) x∈X ⊆Rd n n i=1 i=1

fi : Convex and Smooth F : Strongly Convex =⇒ Unique minimizer x? n  1 and/or d  1 Intro Smooth NonSmooth Iterative Scheme

 1  y(k) = argmin (x − x(k))T g(k) + (x − x(k))T H(k)(x − x(k)) x∈X 2 (k+1) (k)  (k) (k) x = x + αk y − x Intro Smooth NonSmooth

Iterative Scheme   (k) (k) T (k) 1 (k) T (k) (k) (k+1) (k)  (k) (k) y = argmin (x − x ) g + (x − x ) H (x − x ) , x = x + αk y − x x∈X 2

Newton: g(k) = ∇F (x(k))& H(k) = ∇2F (x(k))

Gradient Descent: g(k) = ∇F (x(k))& H(k) = I

Frank-Wolfe: g(k) = ∇F (x(k))& H(k) = 0

(k) 1 P (k) (k) (mini-batch) SGD: Sg ⊂ {1, 2,..., n} =⇒ g = ∇f (x )& H = I |Sg| j∈Sg j Sub-Sampled Newton:

Hessian Sub-Sampling

g(k) = ∇F (x(k))

(k) 1 X 2 (k) Sg ⊂ {1, 2,..., n} =⇒ H = ∇ fj (x ) |SH | j∈SH

Gradient and Hessian Sub-Sampling

(k) 1 X (k) Sg ⊂ {1, 2,..., n} =⇒ g = ∇fj (x ) |Sg| j∈Sg

(k) 1 X 2 (k) SH ⊂ {1, 2,..., n} =⇒ H = ∇ fj (x ) |SH | j∈SH Intro Smooth NonSmooth Sub-sampled Newton’s Method

Let X = Rd , i.e., unconstrained optimization Iterative Scheme  1  p(k) ≈ argmin pT g(k) + pT H(k)p , p∈Rd 2 x(k+1) = x(k) + α(k)p(k)

Hessian Sub-Sampling g(k) = ∇F (x(k))

(k) 1 X 2 (k) H = ∇ fj (x ) |SH | j∈SH Intro Smooth NonSmooth Sub-sampled Newton’s Method

Global convergence, i.e., starting from any initial point Intro Smooth NonSmooth Sub-sampled Newton’s Method

Iterative Scheme Descent Direction: H(k)p(k) ≈ −g(k),

( αk = argmax α Step Size: α≤1 (k) (k) T (k) s.t. F (x + αpk ) ≤ F (x ) + αβpk ∇F (x )

Update: x(k+1) = x(k) + α(k)p(k) Intro Smooth NonSmooth Sub-sampled Newton’s Method

Algorithm Newton’s Method with Hessian Sub-Sampling 1: Input: x(0) 2: for k = 0, 1, 2, ··· until termination do 3: - g(k) = ∇F (x(k)) (k) 4: - SH ⊆ {1, 2,..., n} 1 X 5: - H(k) = ∇2f (x(k)) (k) j |SH | (k) j∈SH 6: - H(k)p(k) ≈ −g(k) 7: - Find α(k) that passes Armijo linesearch 8: - Update x(k+1) = x(k) + α(k)p(k) 9: end for Intro Smooth NonSmooth Sub-sampled Newton’s Method

Theorem ( [Byrd et al., 2011])

If each fi is strongly convex, twice continuously differentiable with Lipschitz continuous gradient, then

lim ∇F (x(k)) = 0. k→∞ Intro Smooth NonSmooth Sub-sampled Newton’s Method

Theorem ( [Bollapragada et al., 2016])

If each fi is strongly convex, twice continuously differentiable with Lipschitz continuous gradient, then

 (k) ?  k  (0) ?  E F (x ) − F (x ) ≤ ρ F (x ) − F (x ) ,

where ρ ≤ 1 − 1/κ2 with κ being the “condition number” of the problem. Intro Smooth NonSmooth Sub-sampled Newton’s Method

What if F is strongly convex, but each fi is only weakly convex?, e.g., 1 Pn T F (x) = n i=1 `i (ai x) d n d ai ∈ R and Range({ai }i=1) = R 00 `i : R → R and `i ≥ γ > 0 2 00 T T Each ∇ fi (x) = `i (ai x)ai ai is rank one! n ! 2 X T But ∇ F (x)  γ · λmin ai ai i=1 | {z } >0 Intro Smooth NonSmooth Sub-Sampling Hessian

1 X H = ∇2f (x) |S| j j∈S

Lemma ( [Roosta and Mahoney, 2016a]) 2 2 Suppose ∇ F (x)  γI and 0  ∇ fi (x)  LI. Given any 0 <  < 1, 0 < δ < 1, if Hessian is uniformly sub-sampled with

2κ log(d/δ) |S| ≥ , 2 then   Pr H  (1 − )γI ≥ 1 − δ.

where κ = L/γ. Intro Smooth NonSmooth Sub-Sampling Hessian

Inexact Update Hp ≈ −g =⇒ kHp + gk ≤ θ kgk, θ< 1, using CG.

Why CG and not other solvers (at least theoretically)? H is SPD 1 D E 1 D E p(t) = argmin pT g+ pT Hp → p(t), g ≤ − p(t), Hp(t) p∈Kt 2 2 Intro Smooth NonSmooth Sub-Sampling Hessian

Theorem ( [Roosta and Mahoney, 2016a]) r (1 − ) If θ ≤ , then, w.h.p, κ   F (x(k+1)) − F (x?) ≤ ρ F (x(k)) − F (x?) ,

where ρ = 1 − (1 − )/κ2. Intro Smooth NonSmooth Sub-sampled Newton’s Method

Local convergence, i.e., in a neighborhood of x?, and with α(k) = 1

Linear-Quadratic Error Recursion 2 (k+1) ? (k) ? (k) ? x − x ≤ ξ1 x − x + ξ2 x − x Intro Smooth NonSmooth Sub-sampled Newton’s Method

[Erdogdu and Montanari, 2015]

[Ur+1, Λr+1] = TSVD (H, r + 1)   ˆ −1 −1 −1 1 T H = λr+1I + Ur Λr − I Ur λr+1 n  o x(k+1) = argmin x − x(k) − α(k)Hˆ −1∇F (x(k)) x∈X Theorem ( [Erdogdu and Montanari, 2015]) 2 (k+1) ? (k) (k) ? (k) (k) ? x − x ≤ ξ1 x − x + ξ2 x − x ,

s λ (H(k)) L log d L ξ(k) = 1 − min + g , and ξ(k) = H . 1 (k) (k) (k) 2 (k) λr+1(H ) λr+1(H ) 2λr+1(H ) |SH |

Note: For constrained optimization, the method is based on two metric gradient projection =⇒ starting from an arbitrary point, the algorithm might not recognize, i.e., fail to stop at, an stationary point. Intro Smooth NonSmooth Sub-sampling Hessian

1 X H = ∇2f (x) |S| j j∈S

Lemma ( [Roosta and Mahoney, 2016b]) Given any 0 <  < 1, 0 < δ < 1, if Hessian is uniformly sub-sampled with

2κ2 log(d/δ) |S| ≥ , 2 then   Pr (1 − )∇2F (x)  H(x)  (1 + )∇2F (x) ≥ 1 − δ. Intro Smooth NonSmooth Sub-sampled Newton’s Method [Roosta and Mahoney, 2016b]

Theorem ( [Roosta and Mahoney, 2016b]) With high-probability, we get

2 (k+1) ? (k) ? (k) ? x − x ≤ ξ1 x − x + ξ2 x − x , where  r κ  L ξ = + θ, and ξ = H . 1 (1 − ) 1 −  2 2(1 − )γ √ If θ = / κ, then ξ1 is problem-independent! ⇒ Can be made arbitrarily small! Intro Smooth NonSmooth Sub-sampled Newton’s Method

Theorem ( [Roosta and Mahoney, 2016b])

Consider any 0 <ρ 0 <ρ< 1 and  ≤ ρ0/(1 + ρ0). If (0) ∗ kx − x k ≤ (ρ − ρ0)/ξ2, and r (1 − ) θ ≤ ρ , 0 κ we get locally Q-linear convergence

(k) ∗ (k−1) ∗ kx − x k ≤ ρkx − x k, k = 1,..., k0

with probability (1 − δ)k0 .

Problem-independent local convergence rate By increasing Hessian accuracy, super-linear rate is possible Intro Smooth NonSmooth Putting it all together

Theorem ( [Roosta and Mahoney, 2016b]) Under certain assumptions, starting at any x(0), we have linear convergence after certain number of iterations, we get “problem-independent” linear convergence after certain number of iterations, the step size of α(k) = 1 passes Armijo rule for all subsequent iterations Intro Smooth NonSmooth

“Any optimization algorithm for which the unit step length works has some wisdom. It is too much of a fluke if the unite step length [accidentally] works.”

Prof. Jorge Nocedal IPAM Summer School, 2012 Intro Smooth NonSmooth Sub-sampled Newton’s Method

Theorem ( [Bollapragada et al., 2016]) Suppose

2  2 2  2 Ei ∇ fi (x) − ∇ F (x) ≤ σ .

Then,

2 (k+1) ? (k) ? (k) ? E x − x ≤ ξ1 x − x + ξ2 x − x , where σ LH ξ1 = + κθ, and ξ2 = . q (k) 2γ γ |SH | Intro Smooth NonSmooth

Exploiting the structure....

Example:

n 1 X F (x) = ` (aT x) =⇒ ∇2F (x) = AT DA, n i i i=1 where

 T   00 T  a1 ` (a1 x) . n×d 1 . n×n A  .  ∈ , Dx  ..  ∈ . ,  .  R , n   R T 00 T an ` (an x) Intro Smooth NonSmooth Sketching [Pilanci and Wainwright, 2017]

 1  x(k+1) = argmin (x − x(k))T ∇F (x(k)) + (x − x(k))T ∇2F (x(k))(x − x(k)) x∈X 2 ∇2F (x(k)) = B(k)T B(k), B(k) ∈ Rn×d    1  x(k+1) = argmin (x − x(k))T ∇F (x(k)) + k B(k) (x − x(k))k2 x∈X 2 |{z}  n×d  ∇2F (x(k)) ≈ B(k)T S(k)T S(k)B(k), S(k) ∈ Rs×n, d ≤ s  n    1  x(k+1) = argmin (x − x(k))T ∇F (x(k)) + k S(k)B(k) (x − x(k))k2 x∈X 2 | {z }  s×d  Intro Smooth NonSmooth Sketching [Pilanci and Wainwright, 2017]

Sub-Gaussian sketches well-known concentration properties involve dense/unstructured matrix operations Randomized orthonormal systems, e.g., Hadamard or Fourier sub-optimal sample sizes fast matrix multiplication Random row sampling Uniform Non-uniform (more on this later) Sparse JL sketches Intro Smooth NonSmooth Non-uniform Sampling [Xu et al., 2016]

2 Non-uniformity among ∇ fi =⇒O (n) uniform samples!!!

Find |SH| independent ofn Immune non-uniformity Non-Uniform sampling schemes base on

1 Row norms

2 Leverage scores Intro Smooth NonSmooth LiSSA [Agarwal et al., 2017]

Key idea: use Taylor expansion (Neumann series) to construct an estimator of the Hessian inverse −1 P∞ i kAk ≤ 1, A 0 =⇒ A = i=0 (I − A) −1 Pj i −1 Aj = i=0 (I − A) = I + (I − A) Aj−1 −1 −1 limj→∞ Aj = A Uniformly sub-sampled Hessian + Turncated Neumann series

2 (k) (k) Three-nested loop involving HVP, i.e., ∇ fi,j (x )vi,j Leverage fast multiplication by Hessian of GLMs Intro Smooth NonSmooth

name complexity per√ iteration reference Newton-CG O(nnz(A) κ) Folklore SSN-LS O˜(nnz(A) log n + p2κ3/2) [Xu et al., 2016] SSN-RNS O˜(nnz(A) + sr(A)pκ5/2) [Xu et al., 2016] SRHT O˜(np(log n)4 + p2(log n)4κ3/2) [Pilanci et al., 2016] SSN-Uniform O˜(nnz(A) + pκˆκ3/2) [Roosta et al., 2016] LiSSA O˜(nnz(A) + pκˆκ¯2) [Agarwal et al., 2017]

λ ∇2F (x)  κ = max max  2  x λmin∇ F (x)    max λ ∇2f (x)  κˆ = max i max i ⇒ κ ≤ κˆ ≤ κ¯ x λ ∇2F (x) min   2  maxi λmax∇ fi (x)  κ¯ = max  2  x mini λmin∇ fi (x) Intro Smooth NonSmooth Sub-sampled Newton’s Method

Iterative Scheme  1  p(k) ≈ argmin pT g(k) + pT H(k)p , p∈Rd 2 x(k+1) = x(k) + α(k)p(k)

Gradient and Hessian Sub-Sampling

(k) 1 X (k) g = ∇fj (x ) |Sg| j∈Sg

(k) 1 X 2 (k) H = ∇ fj (x ) |SH | j∈SH Intro Smooth NonSmooth Sub-sampled Newton’s Method

Algorithm Newton’s Method with Hessian and Gradient Sub- Sampling 1: Input: x(0) 2: for k = 0, 1, 2, ··· until termination do (k) (k) 1 P (k) 3: - Sg ⊆ {1, 2,..., n} =⇒ g = (k) (k) ∇fj (x ) |Sg | j∈Sg (k) 4: if g ≤ σg then 5: - STOP 6: end if (k) (k) 1 P 2 (k) 7: - S ⊆ {1, 2,..., n} =⇒ H = (k) ∇ f (x ) H (k) j∈S j |SH | H 8: - H(k)p(k) ≈ −g(k) 9: - Find α(k) that passes Armijo linesearch 10: - Update x(k+1) = x(k) + α(k)p(k) 11: end for Intro Smooth NonSmooth Sub-sampled Newton’s Method

Global convergence, i.e., starting from any initial point

[Byrd et al., 2012,Roosta and Mahoney, 2016a,Bollapragada et al., 2016] Intro Smooth NonSmooth Sub-sampled Newton’s Method

We can write ∇F (x) = AB where 1/n   | | | 1/n A := ∇f (x) ∇f (x) · · · ∇f (x) and B :=    1 2 n   .  | | |  .  1/n

Lemma ( [Roosta and Mahoney, 2016a])

Let k∇fi (x)k ≤ G(x) < ∞. For any 0 < , δ < 1, if

|S| ≥ G(x)21 + p8 ln(1/δ)2/2,

then

Pr k∇F (x) − g(x)k ≤  ≥ 1 − δ. Intro Smooth NonSmooth Sub-sampled Newton’s Method

Theorem ( [Roosta and Mahoney, 2016a]) If r (1 −  ) θ ≤ H , κ then, w.h.p,   F (x(k+1)) − F (x?) ≤ ρ F (x(k)) − F (x?) ,

2 where ρ = 1 − (1 − H)/κ , and upon “STOP”, we have (k) k∇F (x )k < (1 + σ) g. Intro Smooth NonSmooth Sub-sampled Newton’s Method

Local convergence, i.e., in a neighborhood of x?, and with α(k) = 1

Error Recursion 2 (k+1) ? (k) ? (k) ? x − x ≤ ξ0 + ξ1 x − x + ξ2 x − x

[Roosta and Mahoney, 2016b, Bollapragada et al., 2016] Intro Smooth NonSmooth Sub-sampled Newton’s Method

Theorem ( [Roosta and Mahoney, 2016b]) With high-probability, we get

2 (k+1) ? (k) ? (k) ? x − x ≤ ξ0 + ξ1 x − x + ξ2 x − x , where g ξ0 = (1 − H)γ r  H κ ξ1 = + θ, (1 − H) 1 − H LH ξ2 = . 2(1 − H)γ Intro Smooth NonSmooth Sub-sampled Newton’s Method

Theorem ( [Roosta and Mahoney, 2016b]) (0) ∗ Consider any 0 <ρ 0 + ρ1 <ρ< 1. If kx − x k ≤ c(ρ0,ρ 1,ρ ), (k) k g = ρ g and r (1 −  ) θ ≤ ρ H , 0 κ we get locally R-linear convergence

kx(k) − x∗k ≤ cρk

with probability (1 − δ)2k .

Problem-independent local convergence rate Intro Smooth NonSmooth Putting it all together

Theorem ( [Roosta and Mahoney, 2016b]) Under certain assumptions, starting at any x(0), we have linear convergence, after certain number of iterations, we get “problem-independent” linear convergence, after certain number of iterations, the step size of α(k) = 1 passes Armijo rule for all subsequent iterations, upon “STOP” at iteration k, we have (k) p k k∇F (x )k < (1 + σ) ρ g. Intro Smooth NonSmooth Outline

Part I: Convex Smooth Newton-CG Non-Smooth Proximal Newton Semi-smooth Newton Part II: Non-Convex Line-Search Based Methods L-BFGS Gauss-Newton Natural Gradient Trust-Region Based Methods Trust-Region Cubic Regularization Part III: Discussion and Examples Intro Smooth NonSmooth Newton-type Methods for Non-Smooth Problems

Convex Composite Problem min F (x) + R(x) x∈X ⊆Rd

F : convex and smooth R: convex and non-smooth Intro Smooth NonSmooth Non-Smooth Newton-type [Yu et al., 2010]

Let X = Rd , i.e., unconstrained optimization Iterative Scheme (Non-Quadratic Sub-Problem)    n o 1  p(k) ≈ argmin sup pT g(k) + pT H(k) p , d (k) (k) 2 |{z} p∈R g ∈∂(F +R)(x ) e.g.,  QN x(k+1) = x(k) + α(k)p(k) Intro Smooth NonSmooth Proximal Newton-type [Lee et al., 2014,Byrd et al., 2016]

Iterative Scheme      1  p(k) ≈ argmin pT ∇F (x(k))+ pT H(k) p + R(x(k) + p) d 2 |{z} p∈R  ≈   ∇2F (x(k))  x(k+1) = x(k) + α(k)p(k)

Notable examples of proximal Newton methods:

glmnet( [Friedman et al., 2010]): `1-regularized GLMs, sub-problems are solved using coordinate descent

QUIC( [Hsieh et al., 2014]): Graphical Lasso problem with factorization tricks, sub-problems are solved using coordinate descent Intro Smooth NonSmooth Proximal Newton-type [Lee et al., 2014,Byrd et al., 2016]

Inexactness of sub-problem solver  1  p(k) ≈ argmin pT ∇F (x(k))+ pT H(k)p + R(x(k) + p) p∈Rd 2

? ? ? p is the minimizer of the above sub-problem iff p = Proxλ(p ), where   (k) D (k) (k) E (k) λ 2 Proxλ (p) , argmin ∇F (x )+ H p, q + R(x + q) + kq − pk . q∈Rd 2

Inexactness

(k) (k) p − Proxλ (p) ≤ θ Proxλ (0) , and some sufficient decrease condition on the subproblem Intro Smooth NonSmooth Proximal Newton-type [Lee et al., 2014,Byrd et al., 2016]

Step-size x(k+1) = x(k) + α(k)p(k)

Line-Search

(k) (k) (k)   (k)   (k) (F + R)(x + αp ) − (F + R)(x ) ≤ β `k x + αp − `k x

(k) (k) T (k) `k (x) = F (x ) + (x − x ) ∇F (x ) + R(x) Intro Smooth NonSmooth Proximal Newton-type [Lee et al., 2014,Byrd et al., 2016]

Global convergence: x(k) converges to an optimal solution starting at any x(0) Local convergence: for x(0) close enough to x? H(k) = ∇2F (x(k)) Exact solve: quadratic convergence Approximate solve with decaying forcing term: superlinear convergence Approximate solve with fixed forcing term: linear convergence H(k) : Dennis-Mor´e=⇒ supelinear convergence Intro Smooth NonSmooth Sub-Sampled Proximal Newton-type [Liu et al., 2017]

FSM/ERM n 1 X T min F (x) + R(x) = fi (ai x) + R(x) x∈X ⊆Rd n i=1

fi : Strongly-Convex, Smooth, and Self-Concordant R: Convex and Non-Smooth

n  1 and/or d  1

Dennis-Mor´econdition:

(k)T  (k) 2 (k)  (k) (k)T 2 (k) (k) |p H − ∇ F (x ) p | ≤ ηk p ∇ F (x )p

Leverage score sampling to ensure Dennis-Mor´e

Inexact sub-problem solver Intro Smooth NonSmooth Finite Sum / Empirical Risk Minimization

Self-Concordant

T 3  T 2 3/2 d v ∇ f (x)[v] v ≤ M v ∇ f (x)v , ∀x, v ∈ R

M = 2 is called standard self-concordant.

Theorem ( [Zhang and Lin, 2015]) Suppose there exists γ > 0 and η ∈ [0, 1) such that 000  00 1−η |fi (t)| ≤ γ fi (t) . Then n 1 X λ 2 F (x) = f (aT x) + kxk n i i 2 i=1 is self-concordant with max ka k1+2η γ M = i i . λη+1/2 Intro Smooth NonSmooth Proximal Newton-type Methods

General treatment: [Fukushima and Mine, 1981, Lee et al., 2014, Byrd et al., 2016, Becker and Fadili, 2012, Schmidt et al., 2012, Schmidt et al., 2009, Shi and Liu, 2015] Tailored to specific problems: [Friedman et al., 2010, Hsieh et al., 2014, Yuan et al., 2012, Oztoprak et al., 2012] Self-Concordant: [Li et al., 2017, Kyrillidis et al., 2014, Tran-Dinh et al., 2013] Intro Smooth NonSmooth Outline

Part I: Convex Smooth Newton-CG Non-Smooth Proximal Newton Semi-smooth Newton Part II: Non-Convex Line-Search Based Methods L-BFGS Gauss-Newton Natural Gradient Trust-Region Based Methods Trust-Region Cubic Regularization Part III: Discussion and Examples Intro Smooth NonSmooth Semi-Smooth Newton-type Methods

Convex Composite Problem min F (x) + R(x) x∈X ⊆Rd

F : (non-)convex and(non-)smooth R: convex and non-smooth Intro Smooth NonSmooth Semi-Smooth Newton-type Methods

Recall: Proximal Newton-type Methods  1  p(k) ≈ argmin pT ∇F (x(k))+ pT H(k)p + R(x(k) + p) p∈Rd 2 H(k)  (k) (k)−1 (k)  (k) ≡ proxR x − H ∇F (x ) − x

H(k) 1 2 proxR (x) , argmin R(y) + ky − xkH(k) y∈Rd 2 2 ky − xkH = hy − x, H (y − x)i Intro Smooth NonSmooth Semi-Smooth Newton-type Methods

Recall Newton for smooth root-finding problems:

r(x) = 0 =⇒ J(x(k))p(k) = −r(x(k)) =⇒ x(k+1) = x(k) + p(k)

Key Idea For solving non-linear non-smooth systems of equations, replace the Jacobian in the Newtonian iteration system by an element of Clarke’s generalized Jacobian, which may be nonempty (e.g., locally Lipschitz continuous) even if the true Jacobian does not exist.

Example: B −1  r(x) = proxR x − B ∇F (x) − x Intro Smooth NonSmooth Semi-Smooth Newton-type Methods

Clarke’s generalized Jacobian d p a Let r : R → R be locally Lipschitz continuous at x . Let Dr be the set of differentiable points. The Bouligand-subdifferential of r at x is defined as   0 ∂B lim r (xk ) | xk ∈ Dr, xk → x . , k→∞

Clarke’s generalized Jacobian is ∂r(x) = Conv-hull (∂Br(x)).

aBy Rademacher’s Theorem, r is almost everywhere differentiable Intro Smooth NonSmooth Semi-Smooth Newton-type Methods

Semi-Smooth Map r : Rd → Rp is semi-smooth at x if r a locally Lipschitz continuous function, r is directionally differentiable at x, and kr(x + v) − r(x) − Jvk = o(kvk), v ∈ Rd , J ∈ ∂r(x + v)

Example: Piecewise continuously differentiable functions. A vector-valued function is (strongly) semi-smooth if and only if each of its component functions is (strongly) semi-smooth. Intro Smooth NonSmooth Semi-Smooth Newton-type Methods

First-order stationarity is written as a fixed point-type non-linear system of equations, e.g.,

B B −1  r (x) , x − proxR x − B ∇F (x) = 0, B 0

(Stochastic) semi-smooth Newton’s method used to (approximately) solve these system of nonlinear equations , e.g., as in [Milzarek et al., 2018]

B B −1  ˆr (x) , x − proxR x − B g(x) J(k)p(k) = −rB(x(k)), J(k) ∈ J B(x(k))

B n d×d −1 (k) B o Jk , J ∈ R | J = (I − A) + AB H , A ∈ ∂proxR (u(x)) −1 u(x) , x − B g(x) Intro Smooth NonSmooth Semi-Smooth Newton-type Methods

Semi-smoothness of proximal mapping does not hold in general.

In certain settings, these vector-valued functions are monotone and can be semi-smooth due to the properties of the proximal mappings.

In such cases, their generalized Jacobian matrix is positive semi-definite due to monotonicity.

Lemma For a monotone and Lipschitz continuous mapping r : Rd → Rd , and d any x ∈ R , each element of ∂Br(x) is positive semi-definite. Intro Smooth NonSmooth Semi-Smooth Newton-type Methods

General treatment: [Patrinos et al., 2014, Milzarek and Ulbrich, 2014,Xiao et al., 2016,Patrinos and Bemporad, 2013] Tailored to a specific problem: [Yuan et al., 2018] Stochastic: [Milzarek et al., 2018] Intro Smooth NonSmooth

THANK YOU! Intro Smooth NonSmooth

Agarwal, N., Bullins, B., and Hazan, E. (2017). Second-order stochastic optimization for in linear time. The Journal of Machine Learning Research, 18(1):4148–4187. Becker, S. and Fadili, J. (2012). A quasi-Newton proximal splitting method. In Advances in Neural Information Processing Systems, pages 2618–2626. Bollapragada, R., Byrd, R. H., and Nocedal, J. (2016). Exact and inexact subsampled Newton methods for optimization. IMA Journal of Numerical Analysis. Byrd, R. H., Chin, G. M., Neveitt, W., and Nocedal, J. (2011).

On the use of stochastic Hessian information in optimization methods for machine learning. SIAM Journal on Optimization, 21(3):977–995. Intro Smooth NonSmooth

Byrd, R. H., Chin, G. M., Nocedal, J., and Wu, Y. (2012). Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155. Byrd, R. H., Nocedal, J., and Oztoprak, F. (2016). An inexact successive quadratic approximation method for l-1 regularized optimization. Mathematical Programming, 157(2):375–396. Erdogdu, M. A. and Montanari, A. (2015). Convergence rates of sub-sampled Newton methods. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2, pages 3052–3060. MIT Press. Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1. Intro Smooth NonSmooth

Fukushima, M. and Mine, H. (1981). A generalized proximal point algorithm for certain non-convex minimization problems. International Journal of Systems Science, 12(8):989–1000. Hsieh, C.-J., Sustik, M. A., Dhillon, I. S., and Ravikumar, P. (2014). QUIC: quadratic approximation for sparse inverse covariance estimation. The Journal of Machine Learning Research, 15(1):2911–2947. Kyrillidis, A., Mahabadi, R. K., Tran-Dinh, Q., and Cevher, V. (2014). Scalable Sparse Covariance Estimation via Self-Concordance. In AAAI, pages 1946–1952. Lee, J. D., Sun, Y., and Saunders, M. A. (2014). Proximal Newton-type methods for minimizing composite functions. SIAM Journal on Optimization, 24(3):1420–1443. Intro Smooth NonSmooth

Li, J., Andersen, M. S., and Vandenberghe, L. (2017). Inexact proximal Newton methods for self-concordant functions. Mathematical Methods of Operations Research, 85(1):19–41. Liu, X., Hsieh, C.-J., Lee, J. D., and Sun, Y. (2017). An inexact subsampled proximal Newton-type method for large-scale machine learning. arXiv preprint arXiv:1708.08552. Milzarek, A. and Ulbrich, M. (2014). A semismooth newton method with multidimensional filter globalization for l 1-optimization. SIAM Journal on Optimization, 24(1):298–333. Milzarek, A., Xiao, X., Cen, S., Wen, Z., and Ulbrich, M. (2018). A Stochastic Semismooth Newton Method for Nonsmooth Nonconvex Optimization. arXiv preprint arXiv:1803.03466. Intro Smooth NonSmooth

Oztoprak, F., Nocedal, J., Rennie, S., and Olsen, P. A. (2012).

Newton-like methods for sparse inverse covariance estimation. In Advances in neural information processing systems, pages 755–763. Patrinos, P. and Bemporad, A. (2013). Proximal Newton methods for convex composite optimization. In Decision and Control (CDC), 2013 IEEE 52nd Annual Conference on, pages 2358–2363. IEEE. Patrinos, P., Stella, L., and Bemporad, A. (2014). Forward-backward truncated Newton methods for convex composite optimization. arXiv preprint arXiv:1402.6655. Pilanci, M. and Wainwright, M. J. (2017). Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence. SIAM Journal on Optimization, 27(1):205–245. Intro Smooth NonSmooth

Roosta, F. and Mahoney, M. W. (2016a). Sub-sampled Newton methods I: globally convergent algorithms. arXiv preprint arXiv:1601.04737. Roosta, F. and Mahoney, M. W. (2016b). Sub-sampled Newton methods II: Local convergence rates. arXiv preprint arXiv:1601.04738. Schmidt, M., Berg, E., Friedlander, M., and Murphy, K. (2009). Optimizing costly functions with simple constraints: A limited-memory projected quasi-Newton algorithm. In Artificial Intelligence and Statistics, pages 456–463. Schmidt, M., Kim, D., and Sra, S. (2012). Projected Newton-type Methods in Machine Learning. Optimization for Machine Learning, (1). Shi, Z. and Liu, R. (2015). Intro Smooth NonSmooth

Large scale optimization with proximal stochastic newton-type gradient descent. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 691–704. Springer. Tran-Dinh, Q., Kyrillidis, A., and Cevher, V. (2013). Composite self-concordant minimization. arXiv preprint arXiv:1308.2867. Xiao, X., Li, Y., Wen, Z., and Zhang, L. (2016). A regularized semi-smooth Newton method with projection steps for composite convex programs. Journal of Scientific Computing, pages 1–26. Xu, P., Yang, J., Roosta, F., R´e,C., and Mahoney, M. W. (2016). Sub-sampled Newton methods with non-uniform sampling. In Advances in Neural Information Processing Systems (NIPS), pages 3000–3008. Intro Smooth NonSmooth

Yu, J., Vishwanathan, S., G¨unter,S., and Schraudolph, N. N. (2010). A quasi-Newton approach to nonsmooth problems in machine learning. Journal of Machine Learning Research, 11(Mar):1145–1200. Yuan, G.-X., Ho, C.-H., and Lin, C.-J. (2012). An improved glmnet for l1-regularized logistic regression. Journal of Machine Learning Research, 13(Jun):1999–2030. Yuan, Y., Sun, D., and Toh, K.-C. (2018). An Efficient Semismooth Newton Based Algorithm for Convex Clustering. arXiv preprint arXiv:1802.07091. Zhang, Y. and Lin, X. (2015). DiSCO: Distributed optimization for self-concordant empirical loss. In International conference on machine learning, pages 362–370.