Convex Optimization and Gradient Descent Lecturer: Drew Bagnell Scribe: Forrest Rogers-Marcovitz

Statistical Techniques in Robotics (16-831, F08) Lecture #8 (Thursday September 18) Convex Optimization and Gradient Descent Lecturer: Drew Bagnell Scribe: Forrest Rogers-Marcovitz 1 Online Convex Programming 1.1 Review Convex loss functions reduce the computation time needed for large sets of experts in the weighted expert algorithm. We want to minimize the convex loss function, l(w), where w is an element of the convex set C. In online convex programming, there is a loss function at each time step, lt(w), which is convex. Every expert, wi is represented as a point within the convex set. We want to minimize our regret with regard to the best expert, w∗: T T X X ∗ R(w) = lt(w) − argMinw∗2C lt(w ) (1) t=0 t=0 1.2 Examples • Smart Thermometer: Set the temperature in the room based on feedback from what the person has set the temperature in the past. We want to the algorithm, w to predict the desired temperature: w>x (2) Given possible features: { x0 = Is it summer? (1summer) { x1 = Amount of light { x2 = Is it Wednesday? (1W ed) Each day the loss function is: > 2 lt(w) = (w xt − yt) (3) where yt is the user selected temperature that day. • Tree Detection: Given good object detection up-close, trees in this example, we want to learn to recognize similar objects far away. An example of this would be to have good laser scan data for short distances and trying to extend that to far objects detected via image characteristics (shape, texture, color). The algorithm will try to detect objects far away and then check if it was right when closer; this is an example of self-supervised online learning. Because what we predict influences what we see in the future, regret might not be a good measure of performance as it might change depending on the examples given. We will return to this example later. 1 2 Subgradients 2.1 Definition Subgradient methods are algorithms for solving convex optimization problems. Subgradient methods can be used with a non-differentiable objective function. When the objective function is differentiable, subgradient methods for unconstrained problems use the same search direction as the method of steepest descent. The subgradient at point x is a vector rfx such that it lower bounds the function globally. Convex functions have subgradients at all points along it. Given some point y that is further from the minimal of the convex set than x. We know that: > f(y) ≥ rfx (y − x) + f(x) (4) 2.2 Subgradients for online learning Given a convex loss function, we can use subgradients to reduce the regret of online learning. Proof. ∗ > ∗ l(w ) ≥ rlt(wt) (w − wt) + lt(wt) (5) ∗ > ∗ lt(wt) − l(w ) ≤ rlt(wt) (wt − w ) (6) T T X ∗ X > ∗ R(wt) ≤ [lt(wt) − l(w )] ≤ [rlt(wt) (wt − w )] (7) t=0 t=0 We see that regret is bounded by a linear function. This means that it is hardest to optimize on ∗ linear (flat) functions. Also, regret is highest if the subgradient is parallel to wt − w and in the opposite direction. The next algorithm will take advantage of this fact. 3 Projected Subgradient Descent 3.1 Algorithm This algorithm is a method to minimize the regret for a online convex optimization problem. Algorithm 1 Projected Subgradient Descent(): 1: Predict w0 2: for t = 1:::T do 3: Receive l(wt) and rl(wt) 4: w^t+1 wt − αrl(wt) 5: wt+1 P rojc[w ^t+1] 6: end for 2 Line 5 projects the weights back into the convex set. Based on the convex set, this may be difficult for an exact solution. In line 4, α represents the step size, or learning rate. Large values of α will learn faster but are less likely to converge as it may step over the minimal. Smaller values pay a larger upfront cost but are more likely to converge and have a lower regret over time. There are various methods where αt decreases over time. 3.2 Examples • Tree Prediction: Using the example described above, we are given the following state features: { x1 = Red; x2 = Blue; x3 = Green { x4 = point spread { x5 = response of Gaber filter on image patch (texture) { yt = f1; −1g tree or no tree Using the loss function: > 2 lt(w) = (wt xt − yt) (8) We can use the following update rule: > wt+1 wt − 2α(wt xt − yt) · xt (9) > where (wT xt − yt) is the residual. • Portfolio Optimization: We are given a set of investment weights, wi, and daily market return ratios, ri such that: wi ≥ 0 (10) X wi = 1 (11) The daily increase in wealth is: > wt rt (12) and total wealth over time is: T Y > wt rt (13) t=1 We want to optimize the log-gain function: T X > log(wt rt) (14) t=1 We will compare the policy to a constantly rebalancing portfolio that adjusts the investments each day to have constant investment ratios. We make to following adjustments to the Projected Subgradient Descent algorithm: 3 i 1 { w0 = N i i i rt { wt+1 wt + α P j j j wt ·rt i { Project each wt+1 back to the convex function (simplex) The following problems exist with this algorithm, so you won't get "fabulously wealthy": { Constant balance may not necessarily be the best policy over time { Fixed transaction costs add up { Large actions will affect the stock's market price 4.

Load more