Exploiting the Surrogate Gap in Online Multiclass Classification

Exploiting the Surrogate Gap in Online Multiclass Classification Dirk van der Hoeven Leiden University Dirk van der Hoeven (Leiden University) Exploiting the Surrogate Gap 1 / 30 This talk 1 Brief introduction Online Multiclass Classification 2 My contributions 3 Some intuition about my contributions 4 Future work Dirk van der Hoeven (Leiden University) Exploiting the Surrogate Gap 2 / 30 Probably AFC Ajax, but not absolutely certain. If we gather this information for all eredivisie games, can we predict perfectly? Probably not Important: regardless of what we predict, we will see the true outcome Example Application Full Information Football prediction. ADO Den Haag versus AFC Ajax. We know that ADO plays at home There are 0 supporters for either side Players of ADO Den Haag are valued 11.35 million euros Players of AFC Ajax are valued 288.85 million euros Who will win? Dirk van der Hoeven (Leiden University) Exploiting the Surrogate Gap 3 / 30 Probably not Important: regardless of what we predict, we will see the true outcome Example Application Full Information Football prediction. ADO Den Haag versus AFC Ajax. We know that ADO plays at home There are 0 supporters for either side Players of ADO Den Haag are valued 11.35 million euros Players of AFC Ajax are valued 288.85 million euros Who will win? Probably AFC Ajax, but not absolutely certain. If we gather this information for all eredivisie games, can we predict perfectly? Dirk van der Hoeven (Leiden University) Exploiting the Surrogate Gap 3 / 30 Example Application Full Information Football prediction. ADO Den Haag versus AFC Ajax. We know that ADO plays at home There are 0 supporters for either side Players of ADO Den Haag are valued 11.35 million euros Players of AFC Ajax are valued 288.85 million euros Who will win? Probably AFC Ajax, but not absolutely certain. If we gather this information for all eredivisie games, can we predict perfectly? Probably not Important: regardless of what we predict, we will see the true outcome Dirk van der Hoeven (Leiden University) Exploiting the Surrogate Gap 3 / 30 Goal: minimize the expected surrogate regret RT 0 1 " T # T X X RT = E 1[yt 6=y ˆt ] − Bmin `(hU, xt i, yt )C @ U A t=1 t=1 | {z } margin Setting: Full Information The online multiclass classification setting proceeds in rounds t = 1,..., T . In each round t 1 the environment picks an outcome yt ∈ {1,... K} and reveals a d feature vector xt ∈ R to the learner 2 the learner issues a (randomized) prediction yˆt 3 the environment reveals yt 4 the learner suffers 1[yt 6=y ˆt ] Dirk van der Hoeven (Leiden University) Exploiting the Surrogate Gap 4 / 30 Setting: Full Information The online multiclass classification setting proceeds in rounds t = 1,..., T . In each round t 1 the environment picks an outcome yt ∈ {1,... K} and reveals a d feature vector xt ∈ R to the learner 2 the learner issues a (randomized) prediction yˆt 3 the environment reveals yt 4 the learner suffers 1[yt 6=y ˆt ] Goal: minimize the expected surrogate regret RT 0 1 " T # T X X RT = E 1[yt 6=y ˆt ] − Bmin `(hU, xt i, yt )C @ U A t=1 t=1 | {z } margin Dirk van der Hoeven (Leiden University) Exploiting the Surrogate Gap 4 / 30 Surrogate Loss with K = 2 2.0 surrogate loss 1.5 if margin postive predict 1, otherwise predict 2 zero-one loss 1.0 0.5 0.0 -1.0 -0.5 0.0 0.5 1.0 margin Dirk van der Hoeven (Leiden University) Exploiting the Surrogate Gap 5 / 30 Is regret a reasonable measure of performance? Goal: minimize the expected surrogate regret RT 0 1 " T # B T C X B X C RT = E 1[yt 6=y ˆt ] − B min `(hU, xt i, yt ) C B U C t=1 @ t=1 A | {z } = 0 If model predicts perfectly Translation: I want to be close to the performance of the best offline version of the model. Dirk van der Hoeven (Leiden University) Exploiting the Surrogate Gap 6 / 30 probably not. What is your strategy for the suggestion? mine: randomly select one movie, I don’t have any information Important: You don’t get feedback about what you should have suggested Example application bandit setting You have to suggest a movie to friend from a list of movies you like (and suppose there is only 1 movie in this list your friend will like). You know: the length of your friend the weight of your friend Can you give the perfect suggestion? Dirk van der Hoeven (Leiden University) Exploiting the Surrogate Gap 7 / 30 mine: randomly select one movie, I don’t have any information Important: You don’t get feedback about what you should have suggested Example application bandit setting You have to suggest a movie to friend from a list of movies you like (and suppose there is only 1 movie in this list your friend will like). You know: the length of your friend the weight of your friend Can you give the perfect suggestion? probably not. What is your strategy for the suggestion? Dirk van der Hoeven (Leiden University) Exploiting the Surrogate Gap 7 / 30 Example application bandit setting You have to suggest a movie to friend from a list of movies you like (and suppose there is only 1 movie in this list your friend will like). You know: the length of your friend the weight of your friend Can you give the perfect suggestion? probably not. What is your strategy for the suggestion? mine: randomly select one movie, I don’t have any information Important: You don’t get feedback about what you should have suggested Dirk van der Hoeven (Leiden University) Exploiting the Surrogate Gap 7 / 30 Setting: Bandit The bandit online multiclass classification setting proceeds in rounds t = 1,..., T . In each round t 1 the environment picks an outcome yt ∈ {1,... K} and reveals a feature vector xt to the learner 2 the learner issues a (randomized) prediction yˆt 3 the environment reveals 1[yt 6=y ˆt ] 4 the learner suffers 1[yt 6=y ˆt ] Goal: minimize the expected surrogate regret RT 2 0 13 T ! T X X RT = E 6 1[yt 6=y ˆt ] − Bmin `(hU, xt i, yt )C7 4 @ U A5 t=1 t=1 | {z } margin Dirk van der Hoeven (Leiden University) Exploiting the Surrogate Gap 8 / 30 Results 1 Results for the full information setting Algorithm R Time (per round) T√ Standard first-order O(kUk T ) O(dK) Standard second-order O(ekUkdK ln(T )) O((dK)2) Gaptron O(KkUk2) O(dK) Dirk van der Hoeven (Leiden University) Exploiting the Surrogate Gap 9 / 30 Results 2 Results for the bandit setting Algorithm E[RT ] Time (per round) Standard first-order O((K)1/3T 2/3) O(dK) p 2 Standard second-order O(K dT√ln(T )) O((dK) ) Gaptron O(K T ) O(dK) Dirk van der Hoeven (Leiden University) Exploiting the Surrogate Gap 10 / 30 0 1 T BX 1 C @ [yt 6=y ˆt ] − `(hWt , xt i, yt )A t=1 | {z } Very wastefull: bound by 0 T ! X + `(hWt , xt i, yt ) − `(hU, xt i, yt ) t=1 T X ≤ `(hWt , xt i, yt ) − `(hU, xt i, yt ) t=1 | {z } controlled by OGD T kUk2 η X ≤ + k∇ `(hW , x i, y )k2 2η 2 Wt t t t √ t=1 ≤kUkX T Old Analysis: Upper Bound Zero-One Loss with Surrogate T X 1[yt 6=y ˆt ] − `(hU, xt i, yt ) = t=1 Dirk van der Hoeven (Leiden University) Exploiting the Surrogate Gap 11 / 30 T X ≤ `(hWt , xt i, yt ) − `(hU, xt i, yt ) t=1 | {z } controlled by OGD T kUk2 η X ≤ + k∇ `(hW , x i, y )k2 2η 2 Wt t t t √ t=1 ≤kUkX T Old Analysis: Upper Bound Zero-One Loss with Surrogate 0 1 T T X 1 BX 1 C [yt 6=y ˆt ] − `(hU, xt i, yt ) = @ [yt 6=y ˆt ] − `(hWt , xt i, yt )A t=1 t=1 | {z } Very wastefull: bound by 0 T ! X + `(hWt , xt i, yt ) − `(hU, xt i, yt ) t=1 Dirk van der Hoeven (Leiden University) Exploiting the Surrogate Gap 11 / 30 T kUk2 η X ≤ + k∇ `(hW , x i, y )k2 2η 2 Wt t t t √ t=1 ≤kUkX T Old Analysis: Upper Bound Zero-One Loss with Surrogate 0 1 T T X 1 BX 1 C [yt 6=y ˆt ] − `(hU, xt i, yt ) = @ [yt 6=y ˆt ] − `(hWt , xt i, yt )A t=1 t=1 | {z } Very wastefull: bound by 0 T ! X + `(hWt , xt i, yt ) − `(hU, xt i, yt ) t=1 T X ≤ `(hWt , xt i, yt ) − `(hU, xt i, yt ) t=1 | {z } controlled by OGD Dirk van der Hoeven (Leiden University) Exploiting the Surrogate Gap 11 / 30 √ ≤kUkX T Old Analysis: Upper Bound Zero-One Loss with Surrogate 0 1 T T X 1 BX 1 C [yt 6=y ˆt ] − `(hU, xt i, yt ) = @ [yt 6=y ˆt ] − `(hWt , xt i, yt )A t=1 t=1 | {z } Very wastefull: bound by 0 T ! X + `(hWt , xt i, yt ) − `(hU, xt i, yt ) t=1 T X ≤ `(hWt , xt i, yt ) − `(hU, xt i, yt ) t=1 | {z } controlled by OGD T kUk2 η X ≤ + k∇ `(hW , x i, y )k2 2η 2 Wt t t t t=1 Dirk van der Hoeven (Leiden University) Exploiting the Surrogate Gap 11 / 30 Old Analysis: Upper Bound Zero-One Loss with Surrogate 0 1 T T X 1 BX 1 C [yt 6=y ˆt ] − `(hU, xt i, yt ) = @ [yt 6=y ˆt ] − `(hWt , xt i, yt )A t=1 t=1 | {z } Very wastefull: bound by 0 T ! X + `(hWt , xt i, yt ) − `(hU, xt i, yt ) t=1 T X ≤ `(hWt , xt i, yt ) − `(hU, xt i, yt ) t=1 | {z } controlled by OGD T kUk2 η X ≤ + k∇ `(hW , x i, y )k2 2η 2 Wt t t t √ t=1 ≤kUkX T Dirk van der Hoeven (Leiden University) Exploiting the Surrogate Gap 11 / 30 How to improve upon standard methods? 2.0 surrogate loss 1.5 if margin postive predict 1, otherwise predict 2 zero-one loss 1.0 0.5 0.0 -1.0 -0.5 0.0 0.5 1.0 margin Dirk van der Hoeven (Leiden University) Exploiting the Surrogate Gap 12 / 30 Key Idea: when uncertain, randomize.

Load more