<<

ScAi Lab Study Group

Zhiping (Patricia) Xiao University of California, Los Angeles

Winter 2020 Outline 2

Introduction: Convex Optimization Convexity Convex Functions’ Properties Definition of Convex Optimization

Convex Optimization General Strategy Learning Algorithms Convergence Analysis

Examples Source of Materials 3

Textbook: I Convex Optimization and Intro to Linear Algebra by Prof. Boyd and Prof. Vandenberghe Course Materials: I ECE236B, ECE236C offered by Prof. Vandenberghe I CS260 Lecture 12 offered by Prof. Quanquan Gu Notes: I My previous ECE236B notes and ECE236C final report. I My previous CS260 Cheat Sheet. Related Papers: I Accelerated methods for nonconvex optimization I Lipschitz regularity of deep neural networks: analysis and efficient estimation Introduction: Convex Optimization

q Notations 5

I iff: if and only if I R+ = {x ∈ R | x ≥ 0} I R++ = {x ∈ R | x > 0} I int K: interior of K, not its boundary. I Generalized inequalities (textbook 2.4), based on a proper cone K (convex, closed, solid, pointed — if x ∈ K and −x ∈ K then x = 0):

I x K y ⇐⇒ y − x ∈ K I x ≺K y ⇐⇒ y − x ∈ int K n n T I Positive semidefinite matrix X ∈ S+, ∀y ∈ R , y Xy ≥ 0 ⇐⇒ X  0. Convexity of Set 6

Set C is convex iff the line segment between any two points in C lies in C, i.e. ∀x1, x2 ∈ C and ∀θ ∈ [0, 1], we have:

θx1 + (1 − θ)x2 ∈ C

Both convex and nonconvex sets have , which is defined as:

k k X X conv C = { θixi | xi ∈ C, θi ≥ 0, i = 1, 2, . . . , k, θi = 1} i=1 i=1 Convexity of Set (Examples from Textbook) 7

Figure: Left: convex, middle & right: nonconvex.

Figure: Left: convex hull of the points, right: convex hull of the kidney-shaped set above. Operations that Preserve Convexity of SetsI 8

The most common operations that preserve convexity of convex sets include: I Intersection I Image / inverse image under affine I Cartesian Product, Minkowski sum, Projection I Perspective function I Linear-fractional functions Operations that Preserve Convexity of SetsII 9

Convexity is preserved under intersection:

I S1,S2 are convex sets then S1 ∩ S2 is also . I If Sα is convex for ∀α ∈ A, then ∩α∈ASα is convex.

Proof: Intersection of a collection of convex sets is convex set. If the intersection is empty, or consists of only a single point, then proved by definition. Otherwise, for any two points A, B in the intersection, line AB must lie wholly within each set in the collection, hence must lie wholly within their intersection. Operations that Preserve Convexity of SetsIII 10

n m An affine function f : R → R is a sum of a linear function and a constant, i.e., if it has the form f(x) = Ax + b, where m×n m A ∈ R , b ∈ R , thus f represents a hyperplane. n Suppose that S ⊆ R is convex and then the image of S under f is convex: f(S) = {f(x) | x ∈ S} m n Also, if f : R → R is an affine function, the inverse image of S under f is convex:

f −1(S) = {x | f(x) ∈ S}

Examples include scaling αS = {f(x) | αx, x ∈ S} (α ∈ R) and n translation S + a = {f(x) | x + a, x ∈ S} (a ∈ R ); they are both convex sets when S is convex. Operations that Preserve Convexity of SetsIV 11

Proof: the image of convex set S under affine function f(x) = Ax + b is also convex.

If S is empty or contains only one point, then f(S) is obviously convex. Otherwise, take xS, yS ∈ f(S). xS = f(x) = Ax + b, xS = f(y) = Ay + b. Then ∀θ ∈ [0, 1], we have:

θxS + (1 − θ)yS = A(θx + (1 − θ)y) + b = f(θx + (1 − θ)y)

Since x, y ∈ S, and S is convex set, then θx + (1 − θ)y ∈ S, and thus f(θx + (1 − θ)y) ∈ f(S). Operations that Preserve Convexity of SetsV 12

n m The Cartesian Product of convex sets S1 ⊆ R , S2 ⊆ R is obviously convex:

S1 × S2 = {(x1, x2) | x1 ∈ S1, x2 ∈ S2}

The Minkowski sum of the two sets is defined as:

S1 + S2 = {x1 + x2 | x1 ∈ S1, x2 ∈ S2}

and it is also obviously convex. The projection of a convex set onto some of its coordinates is also obviously convex. (consider the definition of convexity reflected on each coordinate)

m n T = {x1 ∈ R | (x1, x2) ∈ S for some x2 ∈ R } Operations that Preserve Convexity of SetsVI 13

n+1 n We define the perspective function P : R → R , with domain n dom P = R × R++, as P (z, t) = z/t. The perspective function scales or normalizes vectors so the last component is one, and then drops the last component.

We can interpret the perspective function as the action of a pin-hole camera. (x1, x2, x3) through a hold at (0, 0, 0) on plane x3 = 0 forms an image at −(x1/x3, x2/x3, 1) at x3 = −1. The last component could be dropped, since the image point is fixed.

Proof: That this operation preserves convexity is already proved by affine function + projection preserve convexity. Operations that Preserve Convexity of SetsVII 14

n m A linear-fractional function f : R → R is formed by composing the perspective function with an affine function. n m+1 Consider the following affine function g : R → R :  A  b g(x) = x + cT d

m×n m n where A ∈ R , b ∈ R , c ∈ R , d ∈ R. m+1 m Followed by a perspective function P : R → R we have:

f(x) = (Ax + b)/(cT x + d), dom f = {x | cT x + d > 0}

And it naturally preserves convexity because both affine function and perspective function preserve convexity. Convexity of Function 15

Convex Functions

Strict Convex Functions Strong Convex Functions

Figure: The three commonly-seen types of convex functions and their relations. In brief, strong convex functions ⇒ strict convex functions ⇒ convex functions. Convexity of Function 16

n f : R → R is convex iff it satisfies: I dom f is a convex set. I ∀x, y ∈ dom f, θ ∈ [0, 1], we have the Jensen’s inequality:

f(θx + (1 − θ)y) ≤ θf(x) + (1 − θ)f(y)

f is strictly convex iff when x 6= y and θ ∈ (0, 1), strict inequality of the above inequation holds.

f is concave when −f is convex, strictly concave when −f strictly convex, and vice versa.

f is strong convex iff ∃α > 0 such that f(x) − αkxk2 is convex. k·k is any norm. Convexity of Function 17

Proof: strong convex functions ⇒ strict convex functions ⇒ convex functions.

That all strict convex functions are convex functions, and that convex functions are not necessarily strict convex. Strong convexity implies, ∀x, y ∈ dom f, θ ∈ [0, 1], x 6= y, ∃α > 0:

f(θx + (1 − θ)y) − αkθx + (1 − θ)yk2 (1.1) ≤ θf(x) + (1 − θ)f(y) − θαkxk2 − (1 − θ)αkyk2

Something we didn’t prove yet but is true: k·k2 is strictly convex. We need it for this proof.

kθx + (1 − θ)yk2 < θkxk2 + (1 − θ)kyk2 Convexity of Function 18

(proof continues)

αkθx + (1 − θ)yk2 < θαkxk2 + (1 − θ)αkyk2

t = −αkθx + (1 − θ)yk2 + θαkxk2 + (1 − θ)αkyk2 > 0 (1.1) is equivalent with:

f(θx + (1 − θ)y) + t ≤ θf(x) + (1 − θ)f(y)

where t > 0, thus:

f(θx + (1 − θ)y) < θf(x) + (1 − θ)f(y) Convexity of Function 19

Figure: illustration from Prof. Gu’s Slides. This figure shows a typical convex function f, and instead of our expression of x and y he used u & v instead. Convexity of Function 20

Commonly-seen uni-variate convex functions include: I Constant: C I Exponential function: eax I Power function: xa (a ∈ (−∞, 0] ∪ [1, ∞), otherwise it is concave) I Powers of absolute value: |x|p (p ≥ 1) I Logarithm: − log(x)(x ∈ R++) I x log(x)(x ∈ R++) I All norm functions kxk I “The inequality follows from the triangle inequality, and the equality follows from homogeneity of a norm.” Affine functions are Convex & Concave 21

n m An affine function f : R → R , f(x) = Ax + b, where m×n m A ∈ R , b ∈ R , is convex & concave (neither strict convex nor strict concave).

Conversely, all functions that are both convex and concave are affine functions.

Proof: ∀θ ∈ [0, 1], x, y ∈ dom f, we have:

f(θx + (1 − θ)y) = A(θx + (1 − θ)y) + b = θ(Ax + b) + (1 − θ)(Ay + b) = θf(x) + (1 − θ)f(y) th 0 -Order Characterization 22

f is convex iff it is convex when restricted to any line that intersects its domain. n In other words, f is convex iff ∀x ∈ dom f and ∀v ∈ R , the function: g(t) = f(x + tv) is convex. dom g = {t | x + tv ∈ dom f}

This property allows us to check convexity of a function by restricting it to a line. st 1 -Order Characterization 23

Suppose f is differentiable (its gradient ∇f exists at each point in dom f, which is open). Then f is convex iff: I dom f is a convex set I ∀x, y ∈ dom f:

f(y) ≥ f(x) + ∇f(x)T (y − x)

It states that, for a convex function, the first-order Taylor approximation (f(x) + ∇f(x)T (y − x) is the first-order Taylor approximation of f near x) is in fact a global underestimator of the function.

Could also be interpreted as “tangents lie below f”.

Proof is on next page. st 1 -Order Characterization: Proof 24

This proof comes from CVX textbook page 70, 3.1.3.

Let x, y ∈ dom f, t ∈ (0, 1], s.t. x + t(y − x) ∈ dom f, then, by convexity we have:

f(x + t(y − x)) = f((1 − t)x + ty) ≤ (1 − t)f(x) + tf(y)

tf(y) ≥ (t − 1)f(x) + f(x + t(y − x)) f(x + t(y − x)) − f(x) f(y) ≥ f(x) + t

take limt→0 we have:

f(y) ≥ f(x) + ∇f(x)T (y − x)

It is not specifically mentioned in the textbook, but also referred to as subgradient inequality elsewhere. 1 Subgradient 25

Figure: By using subgradient g ∈ ∂f(x) instead of gradient ∇f(x), where ∀u, w ∈ dom f, f(u) ≥ f(w) + gT (u − w), we can handle the cases where the functions are not differentiable. ∂f(x) is called sub-differential, the set of sub-gradients of f at x.

f is convex iff for every x ∈ dom f, ∂f(x) 6= ∅. 1In Prof. Gu’s Slides. st 1 -Order Characterization w. Strong Convex 26

First we assume that α is the maximum value of the parameter before the norm.

Also note that all norms are equivalent 2, meaning that ∃0 < C1 ≤ C2 for ∀a, b, x:

C1kxkb ≤ kxka ≤ C2kxkb

and thus it is okay to treat k·k as `2 norm. Consider the Taylor formula: 1 f(y) ≈ f(x) + ∇f(x)T (y − x) + (y − x)T ∇2f(x)T (y − x) 2

2 https://math.mit.edu/~stevenj/18.335/norm-equivalence.pdf nd 2 -Order Characterization 27

We now assume that f is twice differentiable, that is, its Hessian or second derivative ∇2f exists at each point in dom f, which is open. Then f is convex iff: I dom f is convex I f’s Hessian is positive semidefinite, ∀x ∈ dom f:

∇2f(x)  0

n When f : R → R, it is simply:

∇2f(x) ≥ 0

(*) When f is strongly convex with constant m:

∇2f(x)  mI ∀x ∈ dom f nd 2 -Order Characterization: More on Bound 28

∇2f(x)  0 Then for strong convex, where ∇2f(x) − αkxk2  0, we have:

2 2 2 ∇ f(x)  ∇xαkxk

2 2 and we often take the bound of ∇xαkxk as m. For instance, in 2 2 the case of ∇xαkxk2, m = 2α. Note that α and m are usually different constants. But it doesn’t matter such much in practice. Sublevel Sets & Superlevel Sets 29

n The α-sublevel set of a function f : R → R is defined as:

Cα = {x ∈ dom f | f(x) ≤ α}

Sublevel sets of a convex function are convex, for any value of α.

Proof: ∀x, y ∈ Cα, f(x) ≤ α, ∀θ ∈ [0, 1], f(θx + (1 − θ)y) ≤ α, and hence θx + (1 − θ)y ∈ Cα. The converse is not true: a function can have all its sublevel sets convex (a.k.a. quasiconvex), but not convex itself. e.g. x f(x) = −e is concave in R but all its sublevel sets are convex. If f is concave, then its α-superlevel set is a convex set:

{x ∈ dom f | f(x) ≥ α} Graph and Epigraph 30

Figure: The illustration of graph and epigraph from textbook. Epigraph of f is the shaded part, graph of f is the dark line. Graph and Epigraph 31

n n+1 The f : R → R is a subset of R :

{(x, f(x)) | x ∈ dom f}

n+1 The epigraph of it is also subset of R , defined as:

epi f = {(x, t) | x ∈ dom f, f(x) ≤ t}

The link between convex sets and convex functions is via the epigraph: A function is convex iff its epigraph is a convex set. Proof of Epigraph’s PropertyI 32

Statement: A function f is convex iff its epigraph epi f is a convex set.

First, we assume that f is convex and show epi is convex.

∀(x1, y1), (x2, y2) ∈ epi f, θ ∈ [0, 1], and:

(˜x, y˜) = θ(x1, y1) + (1 − θ)(x2, y2)

Point (x, y) ∈ epi f then y ≥ f(x).

f(x) is convex, thus:

f(θx + (1 − θ)y) ≤ θf(x) + (1 − θ)f(y) Proof of Epigraph’s PropertyII 33

Then we have:

y˜ = θy1 + (1 − θ)y2 ≥ θf(x1) + (1 − θ)f(x2)(∵ epigraph) ≥ f(θx1 + (1 − θ)x2)(∵ convexity) = f(˜x)

y˜ ≥ f(˜x), thus (˜x, y˜) ∈ epi f, and epi f is proved to be convex. Proof of Epigraph’s PropertyIII 34

Next, we prove that when epi f is convex, the f must be convex:

∀x1, x2 ∈ dom f, and θ ∈ [0, 1], then the points (x1, f(x1)) and (x2, f(x2)) must be in epi f (on int f, to be specific).

(˜x, y˜) = θ(x1, f(x1)) + (1 − θ)(x2, f(x2))

From the convexity of epi f, (˜x, y˜) must also be included in epi f, and thus:

y˜ = θf(x1) + (1 − θ)f(x2) ≥ f(˜x) = f(θx1 + (1 − θ)x2)

This is essentially satisfies the Jensen’s inequality, and f has to be convex. Why Convex Optimization 35

(Mathematical) Optimization Problems

Convex Optimization Problems

Least-Squares Linear Programming Problems Problems

Figure: “... least-squares and linear programming problems have a fairly complete theory, arise in a variety of applications, and can be solved numerically very efficiently ... the same can be said for the larger class of convex optimization problems.” — from textbook

Note that although I drew it this way for clearer visualization, convex optimization problems are much more than just two families. We’ll see their names later. Mathematical Optimization Problem 36

Considering the following mathematical optimization problem (a.k.a optimization problem):

minimize f0(x)

subject to fi(x) ≤ bi, i = 1, 2, . . . m

n I x = (x1, . . . , xn) ∈ R is the optimization variable n I f0 : R → R is the objective function n I fi : R → R (i = 1, 2, . . . , m) are the constraint functions A vector x∗ is called optimal, or called a solution of the problem, iff: ∀z satisfying every fi(z) ≤ bi (i = 1, 2, . . . , m), we ∗ have f0(z) ≥ f0(x ). Least-Squares Problems (Linear ver.) 37

k 2 X T 2 minimize f0(x) = kAx − bk2 = (ai x − bi) i=1

n k×n n It has no constraints. x ∈ R , A ∈ R , k ≥ n. ai ∈ R are the rows of the coefficient matrix A. The solution can be reduced to solving a set of linear equations:

AT Ax = AT b

We have analytical solution:

x = (AT A)−1AT b

Can be solved in approximately O(n2k) time if A is dense, otherwise much faster. Least-Square Problems’ Solution 38

Figure: Illustration of how we get the solution of a least-square problem. k = 3, n = 1. With Col(A) be the set of all vectors of the form Ax (the column space, consistent), the closest vector of the form Ax to b is the orthogonal projection of b onto Col(A). Figure from https://textbooks.math.gatech.edu/ila/least-squares.html. Linear Programming Problems 39

T minimize f0(x) = c x T subject to fi(x) = ai x ≤ bi, i = 1, 2, . . . m

It is called linear programming, because the objective n (parameterized by c ∈ R ) and all constraint functions n (parameterized by ai ∈ R and bi ∈ R) are linear. I No simple analytical solution. I Cannot give exact number of arithmetic operations required. I A lot of effective methods, include: I Dantzig’s method 3 I Interior-point methods (most recent) I Time complexity can be estimated to a given accuracy, usually around O(n2m) in practice (assuming m ≥ n). I Could be extended to convex optimization problems.

3It’s the thing you’ve be taught in junior high school. It can be solved by solving:

minimize t T subject to ai x − t ≤ bi, i = 1, 2, . . . , k T − ai x − t ≤ bi, i = 1, 2, . . . , k

n Here, ai, x ∈ R , bi, t ∈ R.

Linear Programming Problems 40

Many optimization problems can be transformed to an equivalent linear program. For example, the Chebyshev approximation problem:

T minimize max |ai x − bi| i=1,2,...k Linear Programming Problems 40

Many optimization problems can be transformed to an equivalent linear program. For example, the Chebyshev approximation problem:

T minimize max |ai x − bi| i=1,2,...k

It can be solved by solving:

minimize t T subject to ai x − t ≤ bi, i = 1, 2, . . . , k T − ai x − t ≤ bi, i = 1, 2, . . . , k

n Here, ai, x ∈ R , bi, t ∈ R. Convex Optimization Problems 41

A convex optimization problem is one of the form

minimize f0(x)

subject to fi(x) ≤ bi, i = 1, 2, . . . m

n m where the functions f0, f1, . . . , fm : R → R are all convex functions. That is, they satisfy:

fi(θx + (1 − θ)y)  θfi(x) + (1 − θ)fi(y)

n ∀x, y ∈ R , θ ∈ [0, 1].

The least-squares problem and linear programming problem are both special cases of the general convex optimization problem. Convex Optimization Problems Solution 42

I No analytical formula for the solution. I Interior-point methods work very well in practice, but no consensus has emerged yet as to what the best method or methods are, and it is still a very active research area. I We cannot yet claim that solving general convex optimization problems is a mature technology. I For some subclasses of convex optimization problems, e.g. second-order cone programming or geometric programming, interior-point methods are approaching mature technology. Nonlinear Optimization 43

(Mathematical) Optimization Problems

Convex Optimization Problems

Least-Squares Linear Programming Problems Problems

Figure: Illustration of what are included in the nonlinear optimization problems (grey parts are wiped out). The problems where (1) ∃fi not linear, (2) the problem is not known to be convex. Nonlinear Optimization StudyI 44

No effective methods for solving the general nonlinear programming problem, and the different approaches each of involves some compromise. I Local optimization: “more art than technology” I Global optimization: “the compromise is efficiency”

Convex optimization also helps with non-convex problems from:

I Initialization for local optimization: 1. Find an approximate, but convex, formulation of the problem. 2. Use the approximate convex problem’s exact solution to handle the original non-convex problem. Nonlinear Optimization StudyII 45

I Introduce convex heuristics for solving nonconvex optimization problems, e.g: I Sparsity: when and why it is preferred. I The use of randomized algorithms to find the best parameters. I Estimating the bounds, e.g. estimating the lower bound on the optimal value (the best-possible value): I Lagrangian relaxation: 1. Solve the Lagrangian dual problem, which is convex 2. It provides a lower bound on the optimal value I Relaxation: I Each nonconvex constraint is replaced with a looser, but convex, constraint. Convex Optimization

ƭ The Standard Form 47

The problem is often expressed as:

minimize f0(x)

subject to fi(x) ≤ 0, i = 1, 2, . . . , m

hi(x) = 0, i = 1, 2, . . . , p

The domain D is defined as:

m p \ \ D = dom fi ∩ dom hi i=0 i=1

A point x ∈ D is feasible if it satisfies the constraints (fi for i = 1, . . . , m, and hi for i = 1, . . . , p). ∗ The optimal value p is inf f0(x) when x is feasible. An optimal ∗ ∗ ∗ point x satisfies f0(x ) = p . The Standard Form Convex Optimization Problem 48

The standard form optimization problem is convex optimization problem when satisfying three additional conditions:

1. The objective function f0 must be convex;

2. The inequality constraint functions fi (i = 1, 2, . . . , m) must be convex; T 3. The equality constraint functions hi(x) = ai x − bi (i = 1, 2, . . . , p) must be affine. An important property coming after: The feasible set D must be convex, as it is the intersection of the above-listed convex functions. The Epigraph Form 49

n The epigraph form is in the form (x ∈ R , t ∈ R), obviously equivalent with standard form:

minimize t

subject to f0(x) − t ≤ 0

fi(x) ≤ 0, i = 1, 2, . . . , m

hi(x) = 0, i = 1, 2, . . . , p

Note that the objective function of the epigraph form problem is a linear function of the variables x, t. It can be interpreted geometrically as minimizing t over the epigraph of f0, subject to the constraints on x. Local and Global Optima 50

A fundamental property of convex optimization problems is that any locally optimal point is also (globally) optimal.

Proof: Assume x is local optima, then x ∈ D, and for some R > 0, f0(x) = inf{f0(z) | z ∈ D, kz − xk2 ≤ R}

Now assume it is not global optima, then ∃y ∈ D, f0(y) < f0(x). There must be ky − xk2 > R. Consider point z given by: R 1 z = (1 − θ)x + θy θ = < 2ky − xk2 2

R Therefore, kz − xk2 = 2 < R. By convexity of feasible set D, z ∈ D, and f0 is convex. Then it contradicts the assumption:

f0(z) ≤ (1 − θ)f0(x) + θf0(y) < f0(x) Solving Convex Optimization Problems 51

When solving an optimization problem, we follow the following steps: 1. Reformulate the problem into the standard format / epigraph format / other known equivalent format (e.g. LP (Linear Program), QP (Quadratic Program), SOCP (Second-Order Cone Program), GP (Geometric Program), CP (Cone Program), SDP (Semidefinite Program)); 4 2. We could form highly nontrivial bounds on convex optimization problems by . (Weak) duality works even for hard problems that are not necessarily convex (but the functions involved must be convex). 3. The problem could be solved by solving the KKT (Karush-Kuhn-Tucker) conditions.

4 Textbook Chapter 4. Special Cases 52

When f0 is a constant, the problem becomes a feasibility problem.

When there’s no f1, . . . , fm and no h1, . . . hp, the problem is an unconstrained minimization problem.

0 I Feasibility → unconstrained minimization: make a new f0 with value 0 (or other constants) when x ∈ D, otherwise 0 f0(x) = +∞. I Unconstrained minimization → feasibility: introduce ∗ f0(x) ≤ p +  as the constraint and remove the objective. Infeasible problem: p∗ = +∞; unbounded problem: p∗ = −∞. Example: Converting LPs to standard form 53

LPs are normally in the form:

minimize cT x + d subject to Gx  h Ax = b

m + − With slack variable s ∈ R  0 introduced and x = x − x , x+, x−  0:

minimize cT x+ − cT x− + d subject to Gx+ − Gx− + s = h Ax+ − Ax− = b s, x+, x−  0 Example: Converting a problem into LPI 54

Consider the Chebyshev center of a polyhedron P, defined as:

n T P = {x ∈ R | ai x ≤ bi, i = 1, 2, . . . m}

We want to find the largest Euclidean ball that lies in P, whose center is known as the Chebyshev center of the polyhedron. The ball is represented as:

B = {xc + u | kuk2 ≤ r}

n The variables: xc ∈ R , r ∈ R, problem: maximize r subject to the constraint B ⊆ P. Example: Converting a problem into LPII 55

We start from observing that x = xc + u from B, and that x ∈ P, thus: T T T ai (xc + u) = ai xc + ai u ≤ bi

kuk2 ≤ r infers that:

T sup{ai u | kuk2 ≤ r} = rkaik2

and that the condition we have is:

T ai xc + rkaik2 ≤ bi

a linear inequality in (xc, r).

minimize − r T subject to ai xc + rkaik2 ≤ bi, i = 1, 2, . . . m Lagrangian Duality 56

Consider the standard form written as:

minimize f0(x)

subject to fi(x) ≤ 0, i = 1, 2, . . . , m

hi(x) = 0, i = 1, 2, . . . , p

Denote the optimal value as p∗. Its Lagrangian, n m p m p L : R × R × R → R, with dom L = D × R × R :

m p X X L(x, λ, ν) = f0(x) + λifi(x) + νihi(x) i=1 i=1 It is basically a weighted sum of the objective and the m p constraints. The Lagrange dual function g : R × R → R:

g(λ, ν) = inf L(x, λ, ν) x∈D Lagrangian Duality Properties 57

Denote g’s optimal point, or dual optimal point, as (λ∗, ν∗).

g is always concave, and could reach −∞ at some λ, ν values.

There’s an important lower-bound property: If λ  0, then g(λ, ν) ≤ p∗.

∗ ∗ ∗ Proof: Since x ∈ D, fi(x ) ≤ 0, hi(x ) = 0, thus:

∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ p = f0(x ) ≥ L(x , λ , ν ) ≥ inf L(x, λ , ν ) = g(λ , ν ) x∈D KKT: Proof of Complementary Slackness 58

Assume holds, then:

∗ ∗ ∗ f0(x ) = g(λ , ν ) m p  X ∗ X ∗  = inf f0(x) + λi fi(x) + νi hi(x) x∈D i=1 i=1 m p ∗ X ∗ ∗ X ∗ ∗ ≤ f0(x ) + λi fi(x ) + νi hi(x ) i=1 i=1 ∗ ≤ f0(x )

∗ ∗ because of λi ≥ 0, fi(x ) ≤ 0, hi(x ) = 0. Therefore,

m X ∗ ∗ λi fi(x ) i=1

∗ ∗ Since each term is non-positive, λi fi(x ) = 0. KKT (Karush-Kuhn-Tucker) optimality conditions 59

For a problem with differentiable fi and hi, we have four conditions that togetherly named KKT conditions: I Primal Constraints: ( fi(x) ≤ 0 i = 1, 2, . . . , m

hi(x) = 0 i = 1, 2, . . . , p

I Dual Constraints: λ  0 I Complementary Slackness: λifi(x) = 0 (i = 1, 2, . . . , m) I gradient of Lagrangian vanishes (with respect to x):

m p X X ∇xL(x, λ, ν) = ∇xf0(x)+ λi∇xfi(x)+ νi∇xhi(x) = 0 i=1 i=1 KKT Properties 60

I If strong duality holds and (x, λ, ν) are optimal, then KKT condition must be satisfied. I If the KKT condition is satisfied by (x, λ, ν), strong duality must hold and the variables are optimal. I If Slater’s Conditions (see textbook section 3.5.6, these conditions imply strong duality) is satisfied, and x is optimal ⇐⇒ ∃(λ, ν) that satisfy KKT conditions. Least Square and Least Norm 61

The original form of least-square problem:

2 minimize kAx − bk2

With regularization (µ > 0):

2 2 minimize kAx − bk2 + µkxk2

T −1 T the solution becomes: xµ = (A A + µI) A b A corresponding least-norm problem’s solution is xln:

2 minimize kxk2 subject to Ax = b

5 A fact: xln = limµ→0 xµ (ref: Prof. Boyd’s slides )

5 https://see.stanford.edu/materials/lsoeldsee263/08-min-norm.pdf Least-Norm problem equivalent form 62

Previously we have the least-norm problem in the form:

2 minimize kxk2 subject to Ax = b

But it is equivalent to the form:

minimize xT x subject to Ax = b

Easily proved by showing that when Ax = b, A(x − x∗) = 0, x∗ = AT (AAT )−1b makes (x − x∗)T x∗ = 0, and apply Pythagorean theorem, we have kxk2 > kx∗k2. Pseudo Inverse 63

With independent rows, we have that AAT is nonsingular, and thus: A† = AT (AAT )−1 With independent columns, we have that AT A is nonsingular, and thus: A† = (AT A)−1AT Recall that previously, we said that for a least-square problem, 2 −1 kAx − bk2, sometimes it doesn’t exist an A , thus we use AT Ax = AT b instead, the pseudo inverse is a formal definition of this operation. Solution: The Lagrangian of this problem is (no need λ):

L(x, ν) = xT x + νT (Ax − b)

The KKT conditions are: I Primal Constraints: Ax = b I Dual Constraints: None I Complementary Slackness: None I gradient of Lagrangian vanishes (with respect to x):

T ∇xL(x, ν) = 2x + A ν = 0

Example: Solving Least-norm I 64

n p×n p Consider the following problem with x ∈ R , A ∈ R , b ∈ R :

minimize xT x subject to Ax = b Example: Solving Least-norm I 64

n p×n p Consider the following problem with x ∈ R , A ∈ R , b ∈ R :

minimize xT x subject to Ax = b

Solution: The Lagrangian of this problem is (no need λ):

L(x, ν) = xT x + νT (Ax − b)

The KKT conditions are: I Primal Constraints: Ax = b I Dual Constraints: None I Complementary Slackness: None I gradient of Lagrangian vanishes (with respect to x):

T ∇xL(x, ν) = 2x + A ν = 0 Example: Solving Least-norm II 65

From gradient of Lagrangian vanishes, x∗ = −(1/2)AT ν∗; from Primal Constraints, Ax∗ = b. Therefore:

AAT ν∗ = −2b ν∗ = −2(AAT )−1b x∗ = AT (AAT )−1b = A†b

It is equivalent with the least-square solution we previously have in terms of pseudo-inverse:

x∗ = (AT A)−1AT b = A†b Categories of Learning Algorithms 66

I Descent Methods I To find the best step size we do line search, but the particular choice of line search does not matter such much, instead, the particular choice of search direction matters a lot. I SGD, AdaGrad, Adam, etc. Almost all popular optimizers today. I Newton’s Method I In theory faster convergence, in practice much larger space. I (*) Prof. Lin’s course projects (Newton + CNN) I Interior-point Methods I Applying Newton’s method to a sequence of modified versions of the KKT conditions. Descent Methods 67

x(k+1) = x(k) + η∆x(k) f(x(k+1)) < f(x(k)) where ∆x(k) is called a step, and |η| = −η the step size. From convexity, it implies:

∇f(x)T ∆x < 0

Step size could be determined by line-search, optimized along the direction of ∇f(x)∆x.

f(x + η∆x) ≈ f(x) + η∇f(x)T ∆x Line Search 68

Exact line search:

η∗ = argmin f(x + η∆x) η>0

Backtracking line search: I Parameters: α ∈ (0, 0.5), β ∈ (0, 1) I Start with η = 1, repeat: 1. Stop when:

f(x + η∆x) < f(x) + αη∇f(x)T ∆x

2. If not stop, update η := βη. Both strategies are used for selecting a proper step size. Not very important in practice. Illustration of Previous Example’s Central Path 69

f(x)

Exact Line-Search

△x Backtrack x Line-Search

Figure: Illustration of two types of line search. Steepest Descent Methods 70

In steepest descent methods, instead of optimizing towards the direction of ∇f(x)T ∆x, it searches for the unit-vector v with the most negative ∇f(x)T v — the directional derivative of f at x in the direction v. In other words:

(k+1) (k) (t) x = x + η∆xnsd

where xnsd is defined as:

T ∆xnsd = argmin{∇f(x) v | kvk = 1} v 6 Subgradient Methods 71

Use subgradient g ∈ ∂f(x) instead of gradient ∇f(x), which means that, f(y) ≥ f(x) + gT (y − x), ∀y There could be multiple g for the same x. The advantage of using g is that it enables the function to handle non-derivative functions.

g for x(t) is denoted as g(t).

x(k+1) = x(k) + ηg(t)

6In Prof. Gu’s Slides only, not included in textbook. Newton’s Method 72

Giving starting point x ∈ dom f and tolerance  > 0, repeat the following steps: ∇f(x) 1. Compute Newton step: ∆xnt = − ∇2f(x) 2. Compute Newton decrement: λ(x)2 = (∇f(x)T ∇2f(x)−1∇f(x)) λ(x)2 3. Quit if 2 ≤  4. Select step size η by backtracking line-search

5. x = x + η∆xnt Newton’s Method Interpretations 73

I x + ∆xnt minimized second-order approximation: 1 fˆ(x + v) = f(x) + ∇f(x)T v + vT ∇2f(x) 2

I x + ∆xnt solves linearized optimality condition:

∇f(x + v) ≈ ∇fˆ(x + v) = ∇f(x) + ∇2f(x)v = 0

I ∆xnt is steepest descent direction at x in local Hessian norm: T 2 1/2 kuk∇2f(x) = u ∇ f(x)u I λ(x) is an approximation of f(x) − p∗, with p∗ estimated by infy fˆ(y): 1 f(x) − inf fˆ(y) = λ(x)2 y 2 Logarithm Barrier Function 74

Define the logarithm barrier function:

m X φ(x) = − log(−fi(x)), dom φ = {x | fi(x) < 0, i = 1, . . . m} i=1 it preserves the convexity and the twice continuously differentiable (if any) of fi, and could turn the inequality constraints from explicit to implicit:

minimize f0(x) + φ(x)

subject to hi(x) = 0, i = 1, 2, . . . , p Logarithm Barrier Function’s Properties 75

m X m φ(x) = − log(−fi(x)), dom φ = ∩i=1 dom fi i=1

The function φ(x) is convex when all fi(x) are convex, and twice continuous differentiable when fi are all twice continuous differentiable. m X 1 ∇φ(x) = ∇f (x) −f (x) i i=1 i m m X 1 X 1 ∇2φ(x) = ∇f (x)∇f (x)T + ∇2f (x) f (x)2 i i −f (x) i i=1 i i=1 i Interior-Point Method 76

The interior point method’s conditions’ only difference with KKT conditions is, replacing complementary slackness with approximate complementary slackness: 1 −λ f (x) = i = 1, 2, . . . , m i i t Interior point methods does not work well if some of the constraints are not strictly feasible:

I fi is convex and twice continuously differentiable p×n I A ∈ R and A’s rank is p I p∗ is finite and attained I The problem is strictly feasible (exists interior point), hence, strong duality holds and dual optimum is attained. Interior-Point Method 77

It is the algorithm coming directly from primal-dual methods. In brief, at iteration step t, we set x∗(t) as the solution of:

minimize tf0(x) + φ(x) subject to Ax = b

t exists here as a balance of φ(x)’s increasing value, forcing the algorithm to focus on f0 more in the end, approximation improves as t → ∞.

We have central path defined as {x∗(t) | t > 0}, the path alone which we minimizes the Lagrangian, and:

∗ ∗ lim f0(x (t)) = p t→∞ Central PathI 78

Central path is formed by the solutions of:

minimize tf0(x) + φ(x) subject to Ax = b

The necessary and sufficient conditions of points on the central path (a.k.a central points): strictly feasible.

∗ ∗ Ax (t) = b, fi(x (t)) < 0 (i = 1, 2, . . . , m)

Applying the Lagrangian-gradient vanishing-condition (No. 4), p×n p we have that, for A ∈ R , ∃νˆ ∈ R , s.t.:

∗ ∗ T t∇f0(x (t)) + ∇φ(x (t)) + A νˆ = 0 Central PathII 79

Expanding ∇φ, we have:

m X 1 t∇f (x∗(t)) + ∇f (x∗(t)) + AT νˆ = 0 0 −f (x∗(t)) i i=1 i

According to the previous properties of x∗(t), we derive an important property: Every central point yields a dual feasible point, and hence a lower bound on the optimal value p∗.

∗ 1 ∗ νˆ λi (t) = − ∗ , i = 1, 2, . . . , m ν (t) = tfi(x (t)) t are considered the dual feasible pair for the original problem with f0(x), inequality constraints, and no barrier function. Central PathIII 80

In particular, we have the estimated value p∗ which is the optimal value of the dual function g:

p∗ = gλ∗(t), ν∗(t) m ∗ X ∗ ∗ ∗ ∗  = f0(x (t)) + λi (t)fi(x (t)) + ν (t) Ax (t) − b i=1 m X 1 νˆ = f (x∗(t)) − + ∗ 0 0 t t i=1 m = f (x∗(t)) − 0 t ∗ m Therefore, central point x (t) is no more than t sub-optimal: m f (x∗(t)) − p∗ ≤ 0 t Interior-Point Method ExampleI 81

Considering the inequality form linear programming:

minimize cT x subject to Ax  b

Then we have the barrier function:

m X T φ(x) = − log(bi − ai x), dom φ = {x | Ax ≺ b} i=1

T where ai are the rows of A. Interior-Point Method ExampleII 82

m X 1 ∇φ(x) = ∇f (x) −f (x) i i=1 i m X ai = b − aT x i=1 i i m m X 1 X 1 ∇2φ(x) = ∇f (x)∇f (x)T + ∇2f (x) f (x)2 i i −f (x) i i=1 i i=1 i m T X aia = i (b − aT x)2 i=1 i i

m 1 T If we define d ∈ R s.t. di = T , we have: ∇φ(x) = A d bi−ai x and ∇2φ(x) = AT diag(d)2A. Interior-Point Method ExampleIII 83

There’s no equality constraints in this case so there’s no ν. Recall that previously we have (A corresponds to equality constraint here):

m X 1 t∇f (x∗(t)) + ∇f (x∗(t)) + AT νˆ = 0 0 −f (x∗(t)) i i=1 i In this situation:

m X 1 T tc + ai = tc + A d = 0 b − aT x∗(t) i=1 i i

Points on central path, x∗(t), must be parallel to −c, ∇φ(x∗(t)) is normal to the level set of φ through x∗(t). Illustration of Previous Example’s Central Path 84

Figure: n = 2, m = 6. The dashed curves show three contour lines of the logarithmic barrier function, at different level of φ(x∗(t)) value. The Lipschitz Constraint 85

Lipschitz constraint is a very common type of constraint applied to the functions, being L-Lipschitz meaning:

|f(x) − f(y)| ≤ Lkx − yk, ∀x, y ∈ dom f

L is called the coefficient.

Lipschitzness is very important in analyzing convergence of optimization problems, in both convex cases and non-convex cases.

We need it to analyze from one step to the next, although sometimes it is omitted in the end. The Lipschitz: Interpretation 86

The coefficient L of f can be interpreted as: I A bound on the next-level derivative of f I Can taken to be zero if f is constant. I More generally, L measures how well f can be approximated by a constant. I If f = ∇g then L measures how well g can be approximated by a linear model. I If f = ∇2h then L measures how well h can be approximated by a quadratic model. Example pf Lipschitz: Convergence Analysis 87

Consider the convergence analysis of Newton, in unbounded optimization, where the objective f is: I Twice continuously differentiable: ∇f(x) and ∇2f(x) exist; I Strongly convex with constant m: ∇2f(x)  mI (x ∈ D) I It implies that ∃M > 0, ∀x ∈ D, ∇2f(x)  MI. (Proof on next page) I The Hessian of f is L-Lipschitz continuous on D, ∀x, y ∈ D:

2 2 k∇ f(x) − ∇ f(y)k2 ≤ Lkx − yk2 Proof of Strong-Convexity Upper BoundI 88

This part’s proof comes from textbook 9.1.2.

First, by using the 1st-order characterization of convex function f, we have that, ∀x, y ∈ dom f:

f(y) ≥ f(x) + ∇f(x)T (y − x)

with the previously-mentioned quadratic Taylor approximation: 1 f(y) ≈ f(x) + ∇f(x)T (y − x) + (y − x)T ∇2f(x)(y − x) 2 In the case of strong convex with constant m > 0, we have ∇2f(x)  mI, thus

T 2 2 (y − x) ∇ f(z)(y − x) ≥ mky − xk2 Proof of Strong-Convexity Upper BoundII 89

Therefore we have: m f(y) ≥ f(x) + ∇f(x)T (y − x) + ky − xk2 2 2 This inequality implies that the sublevel sets contained in dom f are bounded, so dom f is bounded. It essentially means that the maximum eigenvalue of ∇2f(x), which is a of x on dom f, is bounded above on dom f, i.e., there exists a constant M > 0 such that:

∇2f(x)  MI

Note that m are M are often unknown in practice.

M f(y) ≤ f(x) + ∇f(x)T (y − x) + ky − xk2 2 2 Strong-Convexity Error Bounds ImplicationsI 90

Still from textbook 9.1.2.

Previously we have had: m f(y) ≥ f(x) + ∇f(x)T (y − x) + ky − xk2 2 2 Now, considering a fixed x, it is obvious that the right-hand-side is a convex quadratic function of y. Where we find they ˜ that minimizes it is the one that achieves zero derivative:  m  ∇ f(x)+∇f(x)T (˜y −x)+ ky˜−xk2 = ∇f(x)+m(˜y −x) = 0 y˜ 2 2 1 y˜ = x − ∇f(x) m Strong-Convexity Error Bounds ImplicationsII 91

And therefore, m f(y) ≥ f(x) + ∇f(x)T (y − x) + ky − xk2 2 2 m ≥ f(x) + ∇f(x)T (˜y − x) + ky˜ − xk2 2 2 1 1 = f(x) − ∇f(x)T ∇f(x) + k∇f(x)k2 m 2m 2 1 = f(x) − k∇f(x)k2 2m 2 Since it holds for ∀y ∈ D, we can say that: 1 p∗ ≥ f(x) − k∇f(x)k2 2m 2 Strong-Convexity Error Bounds ImplicationsIII 92

It is often used as the upper-bound estimation of error: 1  = f(x) − p∗ ≤ k∇f(x)k2 2m 2

Similarly, applying the same strategy to: M f(y) ≤ f(x) + ∇f(x)T (y − x) + ky − xk2 2 2 we have an lower bound of the error : 1 p∗ ≤ f(x) − k∇f(x)k2 2M 2 1  = f(x) − p∗ ≥ k∇f(x)k2 2M 2 Idea of the Newton Convergence Analysis 93

The general idea: The process of learning by (Damped) Newton method could be divided into two phases; once we enter the second phase, we never leave there.

(*) If you’ve taken Prof. Gu’s CS260 you’ll see how commonly-used this approach of division is... in terms of convergence analysis. Example pf Lipschitz: Convergence AnalysisI 94

m2 Outline of the proof: ∃η ∈ (0, L ], γ > 0, such that

( (k+1) (k) f(x ) − f(x ) ≤ −γ k∇f(x)k2 ≥ η L (k+1) L (k) 2 2m2 k∇f(x )k2 ≤ 2m2 k∇f(x )k2 k∇f(x)k2 < η

There are two phases of the problem (t(k) is the step size here):

1. Damped Newton phase (k∇f(x)k2 ≥ η): I Most iterations require backtracking steps I Function value decreases by at least γ I If bounded (p∗ > −∞), this phase costs iterations no more than f(x(0)) − p∗ γ

2. Quadratically convergent phase (k∇f(x)k2 < η): I All iterations use step size t(k) = 1 Example pf Lipschitz: Convergence AnalysisII 95

I k∇f(x)k2 converges to zero quadratically:

L L 2 1 k∇f(x(k+1))k ≤ k∇f(x(k))k  ≤ 2m2 2 2m2 2 2

L2 (k) We’ve set η ≤ m , thus for k + 1 and k∇f(x )k2 < η, we have:

L η2L η k∇f(x(k+1))k ≤ k∇f(x(k))k2 < ≤ < η 2 2m2 2 2m2 2 and it holds for ∀l > k. More generally:

l−k L L l−k 12 k∇f(x(l))k ≤ k∇f(x(k))k 2 ≤ 2m2 2 2m2 2 2

4m4 12l−k+1 k∇f(x(l))k2 ≤ 2 L2 2 Example pf Lipschitz: Convergence AnalysisIII 96

From strong convexity we know: 1 f(x) − p∗ ≤ k∇f(x)k2 2m 2

1 2m3 12l−k+1 f(x(l)) − p∗ ≤ k∇f(x(l))k2 ≤ ≤  2m 2 L2 2 It implies that it converges fast at this phase.

2m3 Define 0 = L2 , then we have that, We can bound the number of iterations in the quadratically convergent phase by:   log log 0 2 2  Example of Lipschitz: Sub-gradient Analysis 97

Consider a function f which is ρ-Lipschitz, updated via sub-gradient descent.

Something we need to prove in advance to make the conclusion obvious:

(t+1) ∗ ∵ f(x ) − f(x ) ≥ 0 f(x(t+1)) − f(x(t)) ≤ 0 (t+1) (t) (t+1) (t+1) ∗ (t+1) ∴ hx − x , g i ≤ 0 ≤ hx − x , g i ⇐⇒ hx(t+1) − x∗, x(t+1) − x(t)i ≤ 0 (t) (t+1) 2 (t) ∗ 2 (t+1) ∗ 2 ∴ kx − x k ≤ kx − x k − kx − x k I have a figure to illustrate this relation on the next page. 2 Proof of the k·k Inequation 98

x* x(t)

x(t+1)

Figure: Illustration of why we conclude kx(t) − x(t+1)k2 ≤ kx(t) − x∗k2 − kx(t+1) − x∗k2 from hx(t+1) − x∗, x(t+1) − x(t)i ≤ 0 . Example of Lipschitz: Sub-gradient AnalysisI 99

(t) (k) (t) (t) (k) (∵ (x − x )g ≤ f(x ) − f(x )) T X f(x(t)) − f(x∗) ≤ f(x(t)) − f(x(t+1)) t=1 T X (t) (t+1) T (t) 2 2 ≤ (x − x ) g (∵ a + b ≥ 2ab) t=1 T X kx(t) − x(t+1)k2 η  ≤ + kg(t)k2 2η 2 t=1 T T X kx(t) − x∗k2 − kx(t+1) − x∗k2 X η ≤ + kg(t)k2 2η 2 t=1 t=1 T kx(0) − x∗k2 − kx(T +1) − x∗k2 η X = + kg(t)k2 2η 2 t=1 Example of Lipschitz: Sub-gradient AnalysisII 100

Assume that f is ρ-Lipschitz, then we have kg(t)k ≤ ρ (∀t). (0) (t) ∗ Also, x = 0, limt→∞ x → x .

T kx(0) − x∗k2 − kx(T +1) − x∗k2 η X f(x(t)) − f(x∗) ≤ + kg(t)k2 2η 2 t=1 1 kx∗k2 ηρ2 f(x(t)) − f(x∗) ≤ + T 2ηT 2

∗ kx∗k2ρ2 q kx∗k2 For every x , if T ≥ 2 and η = ρ2T , then the right hand side of the last inequation is at most . Examples

ƭ Accelerated Methods for Non-Convex Optimization 102

Accelerated Methods for Non-Convex Optimization I Design accelerated methods that doesn’t rely on convexity of the optimization problem.

I It relies on that the problem has L1-Lipschitz continuous gradient and L2-Lipschitz continuous Hessian. I Calculate a score α according to L1, , and the gradient ∇f(x) to decide whether or not negative curvature descent should be conducted at each step. I Apply accelerated gradient descent for almost-convex function made for the almost-convex point at each step to update that point. Gradient descent with One-Step Escaping (GOSE) 103

Saving gradient and negative curvature computations: Finding local minima more efficiently I Doesn’t require the original problem to be convex. I Develops an algorithm with fewer steps of computing the negative curvature descent 7. I Divide the entire domain of the objective function into two regions (by comparing k∇f(x)k2 with ): large gradient region, small gradient region; and then perform gradient descent-based methods in the large gradient region, and only perform negative curvature descent in the small gradient region. Official code in PyTorch: https://github.com/yaodongyu/gose-nonconvex.

7Useful for escaping the small-gradient regions. MTL as Multi-Objective Optimization 104

Multi-Task Learning as Multi-Objective Optimization work on solving the problem of that multiple tasks might conflict. I Use the multiple-gradient-descent algorithm (MGDA) optimizer; I Define the Pareto optimality for MTL (in brief, no other solutions dominants the current solution); I Use multi-objective KKT (Karush-Kuhn-Tucker) conditions and find a descent direction that decreases all objectives. I Applicable to any problem that uses optimization based on gradient descent. Implementation: https://github.com/hav4ik/Hydra