Online Learning, Regret Minimization, Minimax Optimality, and Correlated
Total Page:16
File Type:pdf, Size:1020Kb
High level Last time we discussed notion of Nash equilibrium. Static concept: set of prob. Distributions (p,q,…) such that Online Learning, Regret Minimization, nobody has any incentive to deviate. Minimax Optimality, and Correlated But doesn’t talk about how system would get there. Troubling that even finding one can be hard in large games. Equilibrium What if agents adapt (learn) in ways that are well-motivated in terms of their own rewards? What can we say about the system then? Avrim Blum High level Consider the following setting… Today: Each morning, you need to pick Online learning guarantees that are achievable when acting one of N possible routes to drive in a changing and unpredictable environment. to work. Robots R Us What happens when two players in a zero-sum game both But traffic is different each day. 32 min use such strategies? Not clear a priori which will be best. When you get there you find out how Approach minimax optimality. long your route took. (And maybe Gives alternative proof of minimax theorem. others too or maybe not.) What happens in a general-sum game? Is there a strategy for picking routes so that in the Approaches (the set of) correlated equilibria. long run, whatever the sequence of traffic patterns has been, you’ve done nearly as well as the best fixed route in hindsight? (In expectation, over internal randomness in the algorithm) Yes. “No-regret” algorithms for repeated decisions “No-regret” algorithms for repeated decisions A bit more generally: Algorithm has N options. World chooses cost vector. Can view as matrix like this (maybe infinite # cols) World – life - fate Algorithm At each time step, algorithm picks row, life picks column. At each time step, algorithm picks row, life picks column. average regret Alg pays cost for action chosen. Define Alg pays cost for action in T time chosen. steps as: (avg per-day cost of alg) – (avg per-day cost of best Alg gets column as feedback (or just its own cost in Alg gets column as feedback (or just its own cost in fixed row in hindsight). the “bandit” model). the “bandit” model). We want this to go to 0 or better as T gets large. Need to assume some bound on max cost. Let’s say all Need to assume some bound on max cost. Let’s say all costs between 0 and 1. costs [betweencalled a “no 0 and-regret” 1. algorithm] 1 Some intuition & properties of no-regret algs. History and development (abridged) Let’s look at a small example: World – life - fate [Hannan’57, Blackwell’56]: Alg with avg regret O((N/T)1/2). 2 Re-phrasing, need only T = O(N/ ) steps to get time- 1 0 dest average regret down to . (will call this quantity T) Algorithm 0 1 Optimal dependence on T (or ). Game-theorists viewed Note: Not trying to compete with best #rows N as constant, not so important as T, so pretty adaptive strategy – just best fixed row much done. (path) in hindsight. Why optimal in T? World – life - fate No-regret algorithms can do much 1 0 better than playing minimax optimal, dest Algorithm 0 1 and never much worse. Existence of no-regret algs yields • Say world flips fair coin each day. immediate proof of minimax thm (will • Any alg, in T days, has expected cost T/2. see in a bit) • But E[min(# heads,#tails)] = T/2 – Θ 푇 . • So, expected avg per-day gap is Θ 1/ 푇 . History and development (abridged) [Hannan’57, Blackwell’56]: Alg with avg regret O((N/T)1/2). 2 Re-phrasing, need only T = O(N/ ) steps to get time- average regret down to . (will call this quantity T) To think about this, let’s look at Optimal dependence on T (or ). Game-theorists viewed the problem of “combining expert #rows N as constant, not so important as T, so pretty much done. advice”. Learning-theory 80s-90s: “combining expert advice”. Imagine large class C of N prediction rules. Perform (nearly) as well as best f2C. [LittlestoneWarmuth’89]: Weighted-majority algorithm E[cost] · OPT(1+) + (log N)/. 1/2 2 Regret O((log N)/T) . T = O((log N)/ ). Optimal as fn of N too, plus lots of work on exact constants, 2nd order terms, etc. [CFHHSW93]… Extensions to bandit model (adds extra factor of N). Using “expert” advice Simpler question Say we want to predict the stock market. • We have n “experts”. • We solicit n “experts” for their advice. (Will the • One of these is perfect (never makes a mistake). market go up or down?) We just don’t know which one. • We then want to use their advice somehow to • Can we find a strategy that makes no more than make our prediction. E.g., lg(n) mistakes? Answer: sure. Just take majority vote over all experts that have been correct so far. Each mistake cuts # available by factor of 2. Basic question: Is there a strategy that allows us to do nearly as well as best of these in hindsight? Note: this means ok for n to be very large. [“expert” = someone with an opinion. Not necessarily “halving algorithm” someone who knows anything.] 2 What if no expert is perfect? Weighted Majority Algorithm One idea: just run above protocol until all Intuition: Making a mistake doesn't completely experts are crossed off, then repeat. disqualify an expert. So, instead of crossing off, just lower its weight. Makes at most log(n) mistakes per mistake of the best expert (plus initial log(n)). Weighted Majority Alg: – Start with all experts having weight 1. Seems wasteful. Constantly forgetting what we've – Predict based on weighted majority vote. “learned”. Can we do better? – Penalize mistakes by cutting weight in half. Analysis: do nearly as well as best expert in hindsight Randomized Weighted Majority • M = # mistakes we've made so far. 2.4(m + lg n) not so good if the best expert makes a • m = # mistakes best expert has made so far. mistake 20% of the time. Can we do better? Yes. • W = total weight (starts at n). • Instead of taking majority vote, use weights as probabilities. (e.g., if 70% on up, 30% on down, then pick • After each mistake, W drops by at least 25%. 70:30) Idea: smooth out the worst case. So, after M mistakes, W is at most n(3/4)M. • Also, generalize ½ to 1- . • Weight of best expert is (1/2)m. So, constant M = expected ratio #mistakes unlike most worst-case bounds, numbers So, if m is small, then M is pretty small too. are pretty good. Analysis Summarizing · (1+) -1 • Say at time t we have fraction Ft of weight on • E[# mistakes] m + log(n). experts that made mistake. 1/2 • So, we have probability Ft of making a mistake, and • If set =(log(n)/m) to balance the two terms out we remove an Ft fraction of the total weight. (or use guess-and-double), get bound of 1/2 – Wfinal = n(1- F1)(1 - F2)... E[mistakes] · m + 2(m¢log n) – ln(Wfinal) = ln(n) + t [ln(1 - Ft)] · ln(n) - t Ft (using ln(1-x) < -x) • Since m · T, this is at most m + 2(Tlog n)1/2. = ln(n) - M. ( F = E[# mistakes]) t • If best expert makes m mistakes, then ln(W ) > ln((1-)m). final • So, avg regret = 2(Tlog n)1/2/T ! 0. • Now solve: ln(n) - M > m ln(1-). 3 What can we use this for? Game-theoretic version • Can use to combine multiple algorithms to • What if experts are actions? (paths in a do nearly as well as best in hindsight. network, rows in a matrix game,…) • At each time t, each has a loss (cost) in {0,1}. • But what about cases like choosing paths • Can still run the algorithm to work, where “experts” are different – Rather than viewing as “pick a prediction with actions, not different predictions? prob proportional to its weight” , – View as “pick an expert with probability proportional to its weight” – Choose expert i with probability pi = wi/i wi. • Same analysis applies. Game-theoretic version Illustration • What if experts are actions? (paths in a World – life - fate 2 1 network, rows in a matrix game,…) (1-c1 ) (1-c1 ) 1 2 1 (1- c2 )( 1- c2 ) 1 scaling • What if losses (costs) in [0,1]? 2 1 (1- c3 )(1 - c3 ) 1 so costs • If expert i has cost ci, do: wi à wi(1-ci). 1 in [0,1] . 1 • Our expected cost = i ciwi/W. 2 1 (1-cn ) (1-cn ) 1 • Amount of weight removed = w c . i i i c1 c2 • So, fraction removed = ¢ (our cost). • Rest of proof continues as before… Guarantee: E[cost] · OPT + 2(OPT¢log n)1/2 Since OPT · T, this is at most OPT + 2(Tlog n)1/2. So, now we can drive to work! So, regret/time step · 2(Tlog n)1/2/T ! 0. (assuming full feedback) Minimax-optimal strategies • Can solve for minimax-optimal strategies using Linear programming • Claim: no-regret strategies will do nearly as well or Connections to Minimax better against any sequence of opponent plays. Optimality – Do nearly as well as best fixed choice in hindsight. – Implies do nearly as well as best distrib in hindsight – Implies do nearly as well as minimax optimal! Left Right Left (½,-½) (1,-1) Right (1,-1) (0,0) 4 Proof of minimax thm using RWM Proof contd • Now, consider playing randomized weighted- • Suppose for contradiction it was false.