ECE 592 Topics in Data Science

Dror Baron Associate Professor Dept. of Electrical and Computer Engr. North Carolina State University, NC, USA Optimization

Keywords: linear programming, dynamic programming, convex optimization, non-convex optimization What is Optimization? What is optimization?

. Wikipedia: In mathematics, computer science and operations research, mathematical optimization (alternatively, mathematical programming or simply, optimization) is the selection of a best element (with regard to some criterion) from some of available alternatives.

4 Application #1-Classroom scheduling

. Real story NCSU has classes on multiple campuses, dozens of buildings, etc. We want a “good” schedule

. What’s good? – Availability of rooms – Proximity of classroom to department – Instructors have day/time preferences – Match sizes of rooms and anticipated class enrollment – Avoid conflicts between course pairs of interest to students

5 Application #2- 1 recovery

. Among infinitely manyℓ solutions, seek one with smallest 1 norm (sum of absolute values)

ℓ . Relation to compressed sensing recovery (later in course)

. Can express x=xp-xn  1 = , + , . min + subject to (s.t.𝑁𝑁 ) y=Φx -Φx , , 𝑥𝑥 ∑𝑖𝑖=1 𝑥𝑥𝑝𝑝 𝑖𝑖 p 𝑥𝑥𝑛𝑛 𝑖𝑖n 𝑁𝑁 – Also need xp, xn to be non-negative ∑𝑖𝑖=1 𝑥𝑥𝑝𝑝 𝑖𝑖 𝑥𝑥𝑛𝑛 𝑖𝑖

6 Application #3-reducing fuel consumption

. Suppose gas prices increase a lot . Truck fleet company wants to save $ by reducing fuel consumption . Things are simple on flat highways

. Challenges: 1) You see a hill; can push engine up hill and coast down, or accelerate before hill, then reduce speed while climbing 2) We see red light; should we coast, accelerate, slam brakes? Main point: dynamic behavior links past, present, future

7 Application #4-process design in factories

. Consider factory with complicated process – Want to buy less inputs (chemicals) – Want to use less energy – Want product to be produced quickly (time) – Want robustness to surprises (e.g., power shortages)

. Goal: tune production process to minimize costs – “Costs” involve inputs, energy, time, robustness, … – Known as multi objective optimization

8 Dynamic Programming (DP)

Keywords: Bellman equations, dynamic programming What is dynamic programming (DP)?

. Wikipedia: In mathematics, management science, economics, computer science, and bioinformatics, dynamic programming (also known as dynamic optimization) is a method for solving a complex problem by breaking it down into a collection of simpler subproblems, solving each of those subproblems just once, and storing their solutions – ideally, using a memory-based data structure.

10 Resembles divide and conquer . Have a large problem . Partition into parts

. Dynamic nature of problem links past, present, future . Want decision whose combined “costs” (current plus future) is best

. Whereas brute force optimization is computationally intense, DP is fast

11 Problem setting

. t – time . T – time horizon (maximal time)

. xt – state at time t . Possible actions ( ) . T(x,a) – next state upon choosing action a 𝑎𝑎𝑡𝑡 ∈ Γ 𝑥𝑥𝑡𝑡 . F(x,a) – payoff from action a

. Want to maximize our payoff up to horizon T

12 Solution approach

. Basis case: t=T-1, have one time left for an action

. Maximize payoff by maximizing F(xt,a) * a =arg maxa∈ F(xt,a)

Γ * . At time T (end of problem) arrive at state xT=T(x,a ) – Don’t care about final state, only about payoff

13 Solution continued

. Recursive case: t

. Let’s keep it simple with t=T-2

. Based on basis case, for each Xt-1=XT-2 can calculate a* for last decision (in next time step, t=T-1)

. Want optimal cost to account for current payoff and payoff in next step = arg max { , + next_payoff(T(xt,a))} ( ) ∗ 𝑡𝑡 𝑎𝑎 𝑎𝑎𝑡𝑡∈Γ 𝑥𝑥𝑡𝑡 𝐹𝐹 𝑥𝑥 𝑎𝑎 14 Recursive solution . Let’s simplify recursive case for t=T-2 using notation for optimal actions / payoffs at time t

– - optimal action at time t given state xt – ∗( ) – optimal payoff starting from time t 𝑎𝑎 𝑥𝑥𝑡𝑡 𝑡𝑡 Ψ 𝑥𝑥 * . Basis case provides a (xT-1), , ∀

. Recursive case for t=T-2 Ψ 𝑥𝑥𝑇𝑇−1 𝑥𝑥𝑥𝑥 = arg max { , + next_payoff(T(x ,a))} ( ) t ∗ = arg max { , 𝑡𝑡 + ( = ( , ))} 𝑎𝑎 𝑎𝑎(𝑡𝑡∈Γ) 𝑥𝑥𝑡𝑡 𝐹𝐹 𝑥𝑥 𝑎𝑎 𝑡𝑡 𝑇𝑇−1 𝑡𝑡 . Repeat recursively𝑎𝑎𝑡𝑡∈Γ 𝑥𝑥𝑡𝑡 𝐹𝐹 𝑥𝑥 for𝑎𝑎 smallerΨ 𝑥𝑥 t 𝑇𝑇 𝑥𝑥 𝑎𝑎

15 Computationally efficient DP solution

. Instead of processing from t up to T, reverse order:

– t=T-1: compute , ( ) for all possible xt – t=T-2: = arg max∗ { , + ( = ( , ))} ( ) 𝑡𝑡 𝑡𝑡 ∗ 𝑎𝑎 𝑥𝑥 Ψ 𝑥𝑥 𝑡𝑡 𝑇𝑇−1 𝑇𝑇−2 – t=T-3: 𝑎𝑎 = arg𝑎𝑎𝑡𝑡m∈Γax𝑥𝑥𝑡𝑡 {𝐹𝐹 𝑥𝑥 ,𝑎𝑎 + Ψ(𝑥𝑥 = 𝑇𝑇(𝑥𝑥 ,𝑎𝑎))} ( ) ∗ 𝑡𝑡 𝑇𝑇−2 𝑇𝑇−3 𝑎𝑎 𝑎𝑎𝑡𝑡∈Γ 𝑥𝑥𝑡𝑡 𝐹𝐹 𝑥𝑥 𝑎𝑎 Ψ 𝑥𝑥 𝑇𝑇 𝑥𝑥 𝑎𝑎 . General case: Bellman’s optimality equations

. Each time step, store optimal actions and payoffs . Lookup table (LUT) for Ψ instead of recomputing . Can construct sequence of optimal actions with LUT

16 Why computationally efficient? . Let’s contrast computational complexities

. Brute force optimization: – |Γ| actions per time step and T time steps – Must evaluate |Γ|T trajectories of actions – Θ(|Γ|T)

. DP: – Compute , + ( = ( , ))} for |Γ| actions, T time – Θ(|Γ|T) 𝐹𝐹 𝑥𝑥𝑡𝑡 𝑎𝑎 Ψ 𝑥𝑥𝑡𝑡 𝑇𝑇 𝑥𝑥𝑡𝑡 𝑎𝑎 . Whereas brute force optimization is computationally intense, DP is fast

17 Variations

. Deterministic / random – Next state and payoff could be random – Example: there could be more users than expected; adjust server (action) to account for future trajectory of software

. Finite / infinite horizon – Infinite horizon decision problems require discount factor β to give future payoffs at time t weight βt – Payoffs in far future matter less  β<1

. Discrete / continuous time

18 Example [Cormen et al.]

. Rod cutting problem . Have rod of integer length n

. Have table of prices pi charged for length-i cuts . Cutting is free . Want to cut rod into parts (or not cut at all) to maximize profit

19 Example continued

. Length n=4

. Can charge prices p1=1, p2=5, p3=8, p4=9 . Could look at all possible sets of cuts (see below)

20 Example using DP

. Unrealistic to consider cutting configurations for large n, use DP instead

. Basis: n=1, Ψ(1)=p1=1

. Recursion: n=2, Ψ(2)=max{2Ψ(1),p2}=5

. n=3, Ψ(3)=max{Ψ(1)+Ψ(2),p3}=max{5+1,8}=8

. At each stage, maximize over Ψ(k)+Ψ(n-k) for k=1,2,…,n-1; and for k=n use pn

Ex: Ψ(7)=max{Ψ(1)+Ψ(6),Ψ(2)+Ψ(5),Ψ(3)+Ψ(4),…,p7}

21 Real-world application

. Viterbi algorithm . Decodes convolution codes in CDMA – Also used in speech recognition – Text is “hidden” and (noisy) speech observations help estimate text

. Relies on DP . Finds shortest path

22 Linear Programming

Keywords: linear programming, simplex method Formulation

. Canonical form max . . , 𝑇𝑇 – Note: s.t. = subject to 𝑥𝑥 𝑠𝑠 𝑡𝑡 𝐴𝐴𝐴𝐴≤𝑏𝑏 𝑥𝑥≥0 𝑐𝑐 𝑥𝑥

. Matrix manipulations/tricks create variations: – Ax=b by enforcing ≤ and ≥ T – We’ve minimized ||x||1 (instead of c x)

24 What’s it good for?

. Transportation - pose airline costs and revenue as linear model, maximize profit (revenue-costs) w/LP

. Manufacturing – minimize costs by posing them as linear model

. Common theme: many real-world problems are approximately linear, or can be linearized around working point (Taylor series)

25 History

. Early formulations date back to early 20th century (rudimentary forms even earlier)

. Dantzig invented simplex method (solver) in 40s – Polynomial average runtime; slow worst case

. Interior point methods – much faster worst case

26 Simplex algorithm . Linear constraints, , 0 – Correspond to convex polytope . Linear being𝐴𝐴𝐴𝐴 optimized,≤ 𝑏𝑏 𝑥𝑥 ≥ cTx – Optimal on corner point of polytope – Simplex = outer shell of convex polytope

. Start at some corner point (vertex) . Examine neighboring vertices . Either cTx already optimal, or it’s better at neighbor . Move to best neighboring vertex; iterate until done . Specific steps correspond to linear algebra

27 Convex Optimization

Keywords: convex optimization What are convex/concave functions? . Consider convex real valued function defined on space , :

𝒳𝒳 𝑓𝑓 𝒳𝒳 → ℝ

. Convex: f(λx+(1- λ)y))≤ λf(x)+(1- λ)f(y),∀x,y∈ , λ∈(0,1) . Concave: -”- ≥ -”- . Note: f convex if and only if –f concave; convex/concave𝒳𝒳 imply negative/positive second derivatives

. Any local optimum is global optimum

29 What is convex optimization?

. Basic convex problem: = arg min{f x } – Set and function f(x) must∗ both be convex 𝑥𝑥 𝑥𝑥∈𝒳𝒳 𝒳𝒳 . Alternate form: min ( ) . . , – Functions f, g , …, g all convex 1 𝑠𝑠 𝑡𝑡 𝑔𝑔m𝑖𝑖 𝑥𝑥 ≤0 ∀𝑖𝑖 𝑓𝑓 𝑥𝑥

30 Applications (Why is this interesting?)

. Many problems can be posed as convex

. Least squares

. Entropy maximization

. Linear programming

31 Newton’s method

. Newton’s method finds roots of equations, f(x)=0 . Instead, derivative f`=0 or gradient = 0

𝛻𝛻𝛻𝛻 2 . Taylor expansion: fT(x)=f(xt)+ + +… 1 . Root of derivative: f’(x )+f’’(x ) =0 t f′t � ∆𝑥𝑥 2 f′′ � ∆𝑥𝑥 . Iterate with x =x + t+1 t ∆𝑥𝑥 ∆𝑥𝑥

. Newton’s method is simple but O(1/t) convergence

32 Second order methods

. Challenge: first order approximation to derivative slows down Newton’s method . Solution: use higher order approximation

– Instead of f’(xt)+f’’(xt) , use third derivative too

∆𝑥𝑥

. Multi-dimensional function? Use gradient, Hessians…

. Second order methods more complicated but faster

33 Gradient descent

Keywords: gradient descent, line search, golden section search Gradient descent

. In each iteration, select direction to pursue – Coordinate descent – move along one of coordinates – Gradient descent - direction that minimizes cost function fastest

. How far should we move along that direction? . Undershooting or overshooting bad for convergence

35 Line search

. Key sub-method is to move along direction just enough to minimize the function along that line

. Line search = optimization along line

. Many variations – binary search, golden rule search

. Let’s make up an example for this and code it! – Check course webpage for Matlab script

36 Integer Programming

Keywords: integer programming, integer linear programs, relaxation What is integer programming?

. Integer program = optimization problem where some/all variables must be integers

. Integer linear programs (ILP): = arg max

. . 𝑇𝑇 𝑥𝑥� 𝐴𝐴𝐴𝐴+𝑠𝑠=𝑏𝑏 𝑐𝑐 𝑥𝑥 – Slack variables s 𝑠𝑠 𝑡𝑡 𝑠𝑠≥0𝑛𝑛 𝑥𝑥∈ℤ

38 Example

. Support set detection – y=Ax+z – Sparse x – Want to identify support set where x≠0

. Can we do “perfect” support set detection? – Are there tiny non-zeros? (yes  difficult) – What’s the SNR (low  difficult)

39 Example continued

. Support set detection, y=Ax+z, want support set

. Algorithm: – Consider candidate support set, s∈{0,1}N

– Create matrix As contains column i iff si=1

– Run least squares using As (find low-energy solution to y=Asx) – Iterate over all s, select solution with smallest residual

. Algorithm is optimal & slow

40 More about ILP

. Integer linear programs can be shown to be NP – This means they are slow

. Another algorithmic approach – relaxation – First ignore integer constraints, solve standard LP – Next, “round” (sort of!) to nearby (not necessarily nearest) integer solution

. Various applications require integer solutions; we’re just skimming the surface

41 Non-Convex Optimization

Keywords: non-convex optimization What’s the challenge?

. Many functions are non-convex . Convex  one local min (it’s the global min)

. Non-convex – local min need not be global min . Various algorithms could get stuck in local min

43 Is it hopeless?

. Maybe initialize an algo many different ways; could get stuck in different local mins, choose best – But could be tons of local mins (especially in higher dimensions)

44 Markov chain Monte Carlo

. Markov chain Monte Carlo (MCMC) can solve some non-convex problems . Form expression E(x) for energy (analogous to statistical physics)

. Distribution for signal: Pr(x)=Z⋅exp{-sE(x)} – s analogous to inverse temperature; normalization term Z – Sample next version of x from Gibbs distribution – High temperature  small s  weak pull toward low energy – Low temp  large s  strong pull to low energy – Gradual cooling

45 MCMC continued

1) Do we sample entire sequence x?

Not necessarily. Can consider re-generating one xi at a time; only need marginal distribution for xi

2) Is MCMC guaranteed to converge to global min? Maybe. If you cool down very slowly

3) So is it any good? Depends. MCMC is very slow but can converge to global min; some techniques to accelerate it

46 EM Algorithm

Keywords: expectation maximization algorithm, Gaussian mixture models, latent variables Main ideas

. Iterate over estimation (E) & maximization (M)

. Estimation – create function for computing expected log likelihood (based on current parameters)

. Maximization – update parameters to maximize expected log likelihood from E step

. Details coming up

48 Statistical model & motivation

. Model generates data X . Z – latent / missing values . θ - parameter

. : L(θ;X,Z)=Pr(X,Z|θ)

. Marginal likelihood: L(θ;X)=Pr(X|θ)=∫L(θ;X,Z)dZ – Might be intractable (e.g., due to many possible Z sequences)

. Want to compute L(θ;X), then optimize parameter – Computationally intractable – Motivates EM

49 Statistical model & motivation

. Expectation – computed expected value of log likelihood for parameter θ(t) in current iteration t

– = | , ( ) log ; , Z typically𝑡𝑡 discrete latent𝑡𝑡 variables – 𝑄𝑄 𝜃𝜃 𝜃𝜃 𝐸𝐸𝑍𝑍 𝑋𝑋 𝜃𝜃 𝐿𝐿 𝜃𝜃 𝑋𝑋 𝑍𝑍 – Given parameter θ(t), sequence Z can be found; typically via fast algorithm, e.g., dynamic programming

. Maximization ( ) = argmax 𝑡𝑡+1 𝑡𝑡 𝜃𝜃 𝜃𝜃𝑄𝑄 𝜃𝜃 𝜃𝜃

50 Example – Gaussian mixture models

. What’s a Gaussian mixture model (GMM)? – ~ ( , ) – Component i has probability mean standard dev 𝑋𝑋 ∑𝑖𝑖 𝛼𝛼𝑖𝑖 𝒩𝒩 𝜇𝜇𝑖𝑖 𝜎𝜎𝑖𝑖 – Could be multi-dimensional data  covariance matrix ∑ 𝛼𝛼𝑖𝑖 𝜇𝜇𝑖𝑖 𝜎𝜎𝑖𝑖i

. Useful? – Many distributions well-approximated by GMM – In principle can model almost everything as GMM – Trade-off between # components and model accuracy

51 Example continued

. Challenge - parameters ( , , ) often unavailable

𝛼𝛼𝑖𝑖 𝜇𝜇𝑖𝑖 𝜎𝜎𝑖𝑖 . Must estimate from data X

. To keep simple: N scalar samples X∈ . Latent variable ; z correspond 𝑁𝑁to Gaussian n ℝ components that xn belongs𝑁𝑁 to 𝑍𝑍 ∈ ℤ . E step: compute sequence Z given parameters θ=( , , ) . Optimize θ given Z 𝛼𝛼𝑖𝑖 𝜇𝜇𝑖𝑖 𝜎𝜎𝑖𝑖

52