`1 and `2 Regularization

David Rosenberg

New York University

July 26, 2017

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 44 Tikhonov and Ivanov Regularization

Tikhonov and Ivanov Regularization

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 2 / 44 Tikhonov and Ivanov Regularization Hypothesis Spaces

We’ve spoken vaguely about “bigger” and “smaller” hypothesis spaces In practice, convenient to work with a nested sequence of spaces:

F1 ⊂ F2 ⊂ Fn ··· ⊂ F

Polynomial Functions F = {all polynomial functions}

Fd = {all polynomials of degree 6 d}

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 3 / 44 Tikhonov and Ivanov Regularization Complexity Measures for Decision Functions

Number of variables / features Depth of a decision tree Degree of polynomial How about for linear models?

`0 complexity: number of non-zero coefficients d `1 “” complexity: i=1 |wi |, for coefficients w1,...,wd ` “ridge” complexity: d w 2 for coefficients w ,...,w 2 Pi=1 i 1 d P

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 4 / 44 Tikhonov and Ivanov Regularization Nested Hypothesis Spaces from Complexity Measure

Hypothesis space: F Complexity measure Ω : F → [0, ) Consider all functions in F with complexity at most r: ∞ Fr = {f ∈ F | Ω(f ) 6 r}

If Ω is a on F, this is a ball of radius r in F. Increasing complexities: r = 0,1.2,2.6,5.4,... gives nested spaces:

F0 ⊂ F1.2 ⊂ F2.6 ⊂ F5.4 ⊂ ··· ⊂ F

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 5 / 44 Tikhonov and Ivanov Regularization Constrained Empirical Risk Minimization

Constrained ERM (Ivanov regularization) For complexity measure Ω : F → [0, ) and fixed r > 0,

∞ n 1 min `(f (xi ),yi ) f ∈F n i=1 X s.t. Ω(f ) 6 r

Choose r using validation data or cross-validation. Each r corresponds to a different hypothesis spaces. Could also write: 1 n min `(f (xi ),yi ) f ∈Fr n i=1 X David Rosenberg (New York University) DS-GA 1003 July 26, 2017 6 / 44 Tikhonov and Ivanov Regularization Penalized Empirical Risk Minimization

Penalized ERM (Tikhonov regularization) 0 For complexity measure Ω : F → R> and fixed λ > 0,

1 n min `(f (xi ),yi ) + λΩ(f ) f ∈F n i=1 X Choose λ using validation data or cross-validation. (Ridge regression in Homework #1 is of this form.)

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 7 / 44 Tikhonov and Ivanov Regularization Ivanov vs Tikhonov Regularization

Let L : F → R be any performance measure of f e.g. L(f ) could be the empirical risk of f For many L and Ω, Ivanov and Tikhonov are “equivalent”. What does this mean? Any solution f ∗ you could get from Ivanov, can also get from Tikhonov. Any solution f ∗ you could get from Tikhonov, can also get from Ivanov. In practice, both approaches are effective. Tikhonov convenient because it’s unconstrained minimization. Proof of equivalence based on Lagrangian duality – a topic of Lecture 3.

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 8 / 44 Tikhonov and Ivanov Regularization Ivanov vs Tikhonov Regularization (Details)

Ivanov and Tikhonov regularization are equivalent if: 1 For any choice of r > 0, the Ivanov solution ∗ fr = argminL(f ) s.t. Ω(f ) 6 r f ∈F is also a Tikhonov solution for some λ > 0. That is, ∃λ > 0 such that ∗ fr = argminL(f ) + λΩ(f ). f ∈F 2 Conversely, for any choice of λ > 0, the Tikhonov solution: ∗ fλ = argminL(f ) + λΩ(f ) f ∈F is also an Ivanov solution for some r > 0. That is, ∃r > 0 such that ∗ fλ = argminL(f ) s.t. Ω(f ) 6 r f ∈F

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 9 / 44 `1 and `2 Regularization

`1 and `2 Regularization

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 10 / 44 `1 and `2 Regularization Linear Regression

Consider linear models

F = f : Rd → R | f (x) = w T x for w ∈ Rd

Loss: `(yˆ,y) = (y − yˆ)2 

Training data Dn = ((x1,y1),...,(xn,yn)) regression is ERM for ` over F:

n 1 T 2 wˆ = argmin w xi − yi d n w∈R i=1 X Can overfit when d is large compared to n. e.g.: d  n very common in Natural Language Processing problems (e.g. a 1M features for 10K documents).

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 11 / 44 `1 and `2 Regularization Ridge Regression: Workhorse of Modern Data Science

Ridge Regression (Tikhonov Form) The ridge regression solution for regularization parameter λ > 0 is n 1 T 2 2 wˆ = argmin w xi − yi + λkwk2, d n w∈R i=1 X 2 2 2 where kwk2 = w1 + ··· + wd is the square of the `2-norm.

Ridge Regression (Ivanov Form) The ridge regression solution for complexity parameter r > 0 is n 1 T 2 wˆ = argmin w xi − yi . 2 2 n kwk26r i=1 X David Rosenberg (New York University) DS-GA 1003 July 26, 2017 12 / 44 `1 and `2 Regularization

How does `2 penalty induce “regularity”?

Let fˆ(x) = wˆ T x be a solution to the Ivanov form of ridge regression:

n 1 T 2 wˆ = argmin w xi − yi . 2 2 n kwk26r i=1 X 0 0 Suppose x and x are “close” — that is, kx − x k2 < ε.

ˆ ˆ 0 Then f (x) and f (x ) are also “close” if kwˆk2 6 r is small:

ˆ ˆ 0 T T 0 T 0 f (x) − f (x ) = |wˆ x − wˆ x | = wˆ x − x 0 6 kwˆk2kx − x k2(Cauchy-Schwarz inequality) ˆ f is Lipschitz continuous with a Lipschitz constant kwˆk2 6 r.

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 13 / 44 `1 and `2 Regularization Ridge Regression: Regularization Path

β˜ is unregularized solution; βˆ is the ridge solution. (Which side has more regularization?) Plot from Hastie, Tibshirani, and Wainwright’s Statistical Learning with Sparsity, Figure 2.1. Data about predicting crime rate in 50 US cities. David Rosenberg (New York University) DS-GA 1003 July 26, 2017 14 / 44 `1 and `2 Regularization Lasso Regression: Workhorse (2) of Modern Data Science

Lasso Regression (Tikhonov Form) The lasso regression solution for regularization parameter λ > 0 is n 1 T 2 wˆ = argmin w xi − yi + λkwk1, d n w∈R i=1 X where kwk1 = |w1| + ··· + |wd | is the `1-norm.

Lasso Regression (Ivanov Form) The lasso regression solution for complexity parameter r > 0 is n 1 2 wˆ = argmin w T x − y . n i i kwk16r i=1 X David Rosenberg (New York University) DS-GA 1003 July 26, 2017 15 / 44 `1 and `2 Regularization Lasso Regression: Regularization Path

β˜ is unregularized solution; βˆ is the lasso solution.

Plot from Hastie, Tibshirani, and Wainwright’s Statistical Learning with Sparsity, Figure 2.1 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 16 / 44 `1 and `2 Regularization Ridge vs. Lasso: Regularization Paths

Plot from Hastie, Tibshirani, and Wainwright’s Statistical Learning with Sparsity, Figure 2.1 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 17 / 44 `1 and `2 Regularization Lasso Gives Feature Sparsity: So What?

Coefficient are 0 =⇒ don’t need those features. What’s the gain? Time/expense to compute/buy features Memory to store features (e.g. real-time deployment) Identifies the important features Better prediction? sometimes As a feature-selection step for training a slower non-linear model

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 18 / 44 `1 and `2 Regularization Ivanov and Tikhonov Equivalent?

For ridge regression and lasso regression (and much more) the Ivanov and Tikhonov formulations are equivalent [Homework assignment 3 or 4.] We will use whichever form is most convenient.

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 19 / 44 `1 and `2 Regularization

The `1 and `2 Norm Constraints

For visualization, restrict to 2-dimensional input space

F = {f (x) = w1x1 + w2x2} (linear hypothesis space) 2 Represent F by (w1,w2) ∈ R .  `2 contour: `1 contour: 2 2 w1 + w2 = r |w1| + |w2| = r

Where are the “sparse” solutions?

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 20 / 44 `1 and `2 Regularization

The Famous Picture for `1 Regularization

∗ 1 n T 2 fr = argminw∈R2 n i=1 w xi − yi subject to|w1| + |w2| 6 r P

ˆ n T 2 Red lines: contours of Rn(w) = i=1 w xi − yi . Blue region: Area satisfying complexity constraint: |w | + |w | r P 1 2 6 KPM Fig. 13.3 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 21 / 44 `1 and `2 Regularization The Empirical Risk for Square Loss

Denote the empirical risk of f (x) = w T x by

1 Rˆ (w) = kXw − yk2, n n where X is the . T −1 T Rˆn is minimized by wˆ = X X X y, the OLS solution.

What does Rˆn look like around wˆ?

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 22 / 44 `1 and `2 Regularization The Empirical Risk for Square Loss

By “completing the square”, we can show for any w ∈ Rd : 1 Rˆ (w) = (w − wˆ)T X T X (w − wˆ) + Rˆ (wˆ) n n n

Set of w with Rˆn(w) exceeding Rˆn(wˆ) by c > 0 is

T T w | Rˆn(w) = c + Rˆn(wˆ) = w | (w − wˆ) X X (w − wˆ) = nc ,

which is an ellipsoid centered at wˆ. We’ll derive this in homework #2.

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 23 / 44 `1 and `2 Regularization

The Famous Picture for `2 Regularization

∗ n T 2 2 2 fr = argminw∈R2 i=1 w xi − yi subject to w1 + w2 6 r P

ˆ n T 2 Red lines: contours of Rn(w) = i=1 w xi − yi . Blue region: Area satisfying complexity constraint: w 2 + w 2 r P 1 2 6 KPM Fig. 13.3 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 24 / 44 `1 and `2 Regularization Why are Lasso Solutions Often Sparse?

Suppose design matrix X is orthogonal, so X T X = I , and contours are circles.

Then OLS solution in green or red regions implies `1 constrained solution will be at corner

Fig from Mairal et al.’s Sparse Modeling for Image and Vision Processing Fig 1.6 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 25 / 44 `1 and `2 Regularization

`q Even Sparser

Suppose design matrix X is orthogonal, so X T X = I , and contours are circles.

Then OLS solution in green or red regions implies `q constrained solution will be at corner

`q-ball constraint is not convex, so more difficult to optimize. Fig from Mairal et al.’s Sparse Modeling for Image and Vision Processing Fig 1.9 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 26 / 44 `1 and `2 Regularization The Quora Picture

From Quora: “Why is L1 regularization supposed to lead to sparsity than L2? [sic]”

Does this picture have any interpretation that makes sense? ( Aren’t those lines supposed to be ellipses?) Yes... we can revisit.

Figure from https://www.quora.com/Why-is-L1-regularization-supposed-to-lead-to-sparsity-than-L2. David Rosenberg (New York University) DS-GA 1003 July 26, 2017 27 / 44 Finding the Lasso Solution

Finding the Lasso Solution

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 28 / 44 Finding the Lasso Solution How to find the Lasso solution?

How to solve the Lasso? n T 2 min w xi − yi + λkwk1 w∈Rd i=1 X kwk1 is not differentiable!

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 29 / 44 Finding the Lasso Solution Splitting a Number into Positive and Negative Parts

Consider any number a ∈ R. Let the positive part of a be + a = a1(a > 0). Let the negative part of a be − a = −a1(a 6 0). + − Do you see why a > 0 and a > 0? How do you write a in terms of a+ and a−? How do you write |a| in terms of a+ and a−?

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 30 / 44 Finding the Lasso Solution How to find the Lasso solution?

The Lasso problem n T 2 min w xi − yi + λkwk1 w∈Rd i=1 X + − Replace each wi by wi − wi . + + + − − − Write w = w1 ,...,wd and w = w1 ,...,wd .

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 31 / 44 Finding the Lasso Solution The Lasso as a Quadratic Program

We will show: substituting w = w + − w − and |w| = w + + w − gives equivalent Lasso:

n 2  + −T  T + − min w − w xi − yi + λ1 w + w w +,w − i=1 X+ − subject to wi > 0 for all i wi > 0 for all i,

Objective is differentiable (in fact, convex and quadratic) 2d variables vs d variables and 2d constraints vs no constraints A “quadratic program”: a convex quadratic objective with linear constraints. Could plug this into a generic QP solver.

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 32 / 44 Finding the Lasso Solution The Lasso as a Quadratic Program

Lasso problem is trivially equivalent to the following:

n 2  T  T min min (a − b) xi − yi + λ1 (a + b) w a,b i=1 X subject to ai > 0 for all i bi > 0 for all i, a − b = w a + b = |w|

Claim: Don’t need constraint a + b = |w|. a0 ← a − min(a,b) and b 0 ← b − min(a,b) at least as good So if a and b are minimizers, at least one is 0. Since a − b = w, we must have a = w + and b = w −. So also a + b = |w|.

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 33 / 44 Finding the Lasso Solution The Lasso as a Quadratic Program

n 2  T  T min min (a − b) xi − yi + λ1 (a + b) w a,b i=1 X subject to ai > 0 for all i bi > 0 for all i, a − b = w

Claim: Don’t need constraint a − b = w. If we let a and b vary freely, they’ll hit all possible w’s via w = a − b.

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 34 / 44 Finding the Lasso Solution The Lasso as a Quadratic Program

So lasso optimization problem is equivalent to

n 2  T  T min (a − b) xi − yi + λ1 (a + b) a,b i=1 X subject to ai > 0 for all i bi > 0 for all i, QED

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 35 / 44 Finding the Lasso Solution Projected SGD

n 2  + −T  T + − min w − w xi − yi + λ1 w + w w +,w −∈Rd i=1 X+ subject to wi > 0 for all i − wi > 0 for all i

Just like SGD, but after each step Project w + and w − into the constraint set. In other words, if any component of w + or w − becomes negative, set it back to 0.

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 36 / 44 Finding the Lasso Solution Coordinate Descent Method

d Goal: Minimize L(w) = L(w1,...,wd ) over w = (w1,...,wd ) ∈ R . In gradient descent or SGD, each step potentially changes all entries of w. In each step of coordinate descent,

we adjust only a single wi . In each step, solve new wi = argminL(w1,...,wi−1,wi,wi+1,...,wd ) wi

Solving this argmin may itself be an iterative process.

Coordinate descent is great when it’s easy or easier to minimize w.r.t. one coordinate at a time

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 37 / 44 Finding the Lasso Solution Coordinate Descent Method

Coordinate Descent Method d Goal: Minimize L(w) = L(w1,...wd ) over w = (w1,...,wd ) ∈ R . Initialize w (0) = 0 while not converged: Choose a coordinate j ∈ {1,...,d} wnew ← argmin L(w (t),...,w (t) ,w ,w (t) ,...,w (t)) j wj 1 j−1 j j+1 d (t+1) new (t+1) (t) wj ← wj and w ← w t ← t + 1

Random coordinate choice =⇒ stochastic coordinate descent Cyclic coordinate choice =⇒ cyclic coordinate descent In general, we will adjust each coordinate several times.

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 38 / 44 Finding the Lasso Solution Coordinate Descent Method for Lasso

Why mention coordinate descent for Lasso? In Lasso, the coordinate minimization has a closed form solution!

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 39 / 44 Finding the Lasso Solution Coordinate Descent Method for Lasso

Closed Form Coordinate Minimization for Lasso n T 2 wˆj = argmin w xi − yi + λ|w|1 wj ∈R i=1 X Then (cj + λ)/aj if cj < −λ wˆj = 0 if cj ∈ [−λ,λ] (cj − λ)/aj if cj > λ

 n n 2 T aj = 2 xi,j cj = 2 xi,j (yi − w−j xi,−j ) i=1 i=1 X X where w−j is w without component j and similarly for xi,−j .

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 40 / 44 Finding the Lasso Solution Coordinate Descent: When does it work?

Suppose we’re minimizing f : Rd → R. Sufficient conditions: 1 f is continuously differentiable and 2 f is strictly convex in each coordinate

But lasso objective n T 2 w xi − yi + λkwk1 i=1 X is not differentiable... Luckily there are weaker conditions...

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 41 / 44 Finding the Lasso Solution Coordinate Descent: The Separability Condition

Theorem aIf the objective f has the following structure

d f (w1,...,wd ) = g(w1,...,wd ) + hj (xj ), j=1 X where g : Rd → R is differentiable and convex, and

each hj : R → R is convex (but not necessarily differentiable) then the coordinate descent algorithm converges to the global minimum.

aTseng 1988: “Coordinate ascent for maximizing nondifferentiable concave functions”, Technical Report LIDS-P

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 42 / 44 Finding the Lasso Solution Coordinate Descent Method – Variation

Suppose there’s no closed form? (e.g. ) Do we really need to fully solve each inner minimization problem?

A single projected gradient step is enough for `1 regularization! Shalev-Shwartz & Tewari’s “Stochastic Methods...” (2011)

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 43 / 44 Finding the Lasso Solution Stochastic Coordinate Descent for Lasso – Variation

Let w˜ = (w +,w −) ∈ R2d and

n 2  + −T  + − L(w˜) = w − w xi − yi + λ w + w i=1 X

Stochastic Coordinate Descent for Lasso - Variation + − Goal: Minimize L(w˜) s.t. wi ,wi > 0 for all i. Initialize w˜ (0) = 0 while not converged: Randomly choose a coordinate j ∈ {1,...,2d} w˜j ← w˜j + max −w˜j ,−∇j L(w˜) 

David Rosenberg (New York University) DS-GA 1003 July 26, 2017 44 / 44