Introduction to Multi-Armed Bandits
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to Multi-Armed Bandits Aleksandrs Slivkins Microsoft Research NYC First draft: January 2017 Published: November 2019 This version: June 2021 Abstract Multi-armed bandits a simple but very powerful framework for algorithms that make decisions over time under uncertainty. An enormous body of work has accumulated over the years, covered in several books and surveys. This book provides a more introductory, textbook-like treatment of the subject. Each chapter tackles a particular line of work, providing a self-contained, teachable technical introduction and a brief review of the further developments; many of the chapters conclude with exercises. The book is structured as follows. The first four chapters are on IID rewards, from the basic model to impossibility results to Bayesian priors to Lipschitz rewards. The next three chapters cover adversarial rewards, from the full-feedback version to adversarial bandits to extensions with linear rewards and combinatorially structured actions. Chapter 8 is on contextual bandits, a middle ground between IID and adversarial bandits in which the change in reward distributions is completely explained by observable contexts. The last three chapters cover connections to economics, from learning in repeated games to bandits with supply/budget constraints to exploration in the presence of incentives. The appendix provides sufficient background on concentration and KL-divergence. The chapters on “bandits with similarity information”, “bandits with knapsacks” and “bandits and agents” can also be consumed as standalone surveys on the respective topics. Published with Foundations and Trendsr in Machine Learning, November 2019. arXiv:1904.07272v6 [cs.LG] 26 Jun 2021 The present version is a revision of the “Foundations and Trends” publication. It contains numerous edits for presentation and accuracy (based in part on readers’ feedback), updated and expanded literature reviews, and some new exercises. Further comments, suggestions and bug reports are very welcome! © 2017-2021: Aleksandrs Slivkins. Author’s webpage: https://www.microsoft.com/en-us/research/people/slivkins. Email: slivkins at microsoft.com. Preface Multi-armed bandits is a rich, multi-disciplinary research area which receives attention from computer sci- ence, operations research, economics and statistics. It has been studied since (Thompson, 1933), with a big surge of activity in the past 15-20 years. An enormous body of work has accumulated over time, various subsets of which have been covered in several books (Berry and Fristedt, 1985; Cesa-Bianchi and Lugosi, 2006; Gittins et al., 2011; Bubeck and Cesa-Bianchi, 2012). This book provides a more textbook-like treatment of the subject, based on the following principles. The literature on multi-armed bandits can be partitioned into a dozen or so lines of work. Each chapter tackles one line of work, providing a self-contained introduction and pointers for further reading. We favor fundamental ideas and elementary proofs over the strongest possible results. We emphasize accessibility of the material: while exposure to machine learning and probability/statistics would certainly help, a standard undergraduate course on algorithms, e.g., one based on (Kleinberg and Tardos, 2005), should suffice for background. With the above principles in mind, the choice specific topics and results is based on the au- thor’s subjective understanding of what is important and “teachable”, i.e., presentable in a relatively simple manner. Many important results has been deemed too technical or advanced to be presented in detail. The book is based on a graduate course at University of Maryland, College Park, taught by the author in Fall 2016. Each chapter corresponds to a week of the course. Five chapters were used in a similar course at Columbia University, co-taught by the author in Fall 2017. Some of the material has been updated since then, so as to reflect the latest developments. To keep the book manageable, and also more accessible, we chose not to dwell on the deep connections to online convex optimization. A modern treatment of this fascinating subject can be found, e.g., in Hazan (2015). Likewise, we do not venture into reinforcement learning, a rapidly developing research area and subject of several textbooks such as Sutton and Barto (1998); Szepesvari´ (2010); Agarwal et al. (2020). A course based on this book would be complementary to graduate-level courses on online convex optimiza- tion and reinforcement learning. Also, we do not discuss Markovian models of multi-armed bandits; this direction is covered in depth in Gittins et al. (2011). The author encourages colleagues to use this book in their courses. A brief email regarding which chapters have been used, along with any feedback, would be appreciated. A simultaneous book. An excellent recent book on bandits, Lattimore and Szepesvari´ (2020), has evolved over several years simultaneously and independently with ours. Their book is much longer, providing a deeper treatment for a number of topics and omitting a few others. The two books reflect the authors’ somewhat differing tastes and presentation styles, and, I believe, are complementary to one another. Acknowledgements. Most chapters originated as lecture notes from my course at UMD; the initial versions of these lectures were scribed by the students. Presentation of some of the fundamental results is influenced by (Kleinberg, 2007). I am grateful to Alekh Agarwal, Bobby Kleinberg, Yishay Mansour, and Rob Schapire for discussions and advice. Chapters 9, 10 have benefited tremendously from numerous conversations with Karthik Abinav Sankararaman. Special thanks go to my PhD advisor Jon Kleinberg and my postdoc mentor Eli Upfal; Jon has shaped my taste in research, and Eli has introduced me to bandits back in 2006. Finally, I wish to thank my parents and my family for love, inspiration and support. ii Contents Introduction: Scope and Motivation 1 1 Stochastic Bandits 4 1.1 Model and examples . 4 1.2 Simple algorithms: uniform exploration . 6 1.3 Advanced algorithms: adaptive exploration . 8 1.4 Forward look: bandits with initial information . 13 1.5 Literature review and discussion . 13 1.6 Exercises and hints . 15 2 Lower Bounds 18 2.1 Background on KL-divergence . 19 2.2 A simple example: flipping one coin . 20 2.3 Flipping several coins: “best-arm identification” . 21 2.4 Proof of Lemma 2.8 for the general case . 22 2.5 Lower bounds for non-adaptive exploration . 24 2.6 Instance-dependent lower bounds (without proofs) . 25 2.7 Literature review and discussion . 26 2.8 Exercises and hints . 27 3 Bayesian Bandits and Thompson Sampling 28 3.1 Bayesian update in Bayesian bandits . 29 3.2 Algorithm specification and implementation . 33 3.3 Bayesian regret analysis . 35 3.4 Thompson Sampling with no prior (and no proofs) . 37 3.5 Literature review and discussion . 37 4 Bandits with Similarity Information 38 4.1 Continuum-armed bandits . 38 4.2 Lipschitz bandits . 42 4.3 Adaptive discretization: the Zooming Algorithm . 45 4.4 Literature review and discussion . 50 4.5 Exercises and hints . 55 5 Full Feedback and Adversarial Costs 59 5.1 Setup: adversaries and regret . 60 5.2 Initial results: binary prediction with experts advice . 62 5.3 Hedge Algorithm . 64 5.4 Literature review and discussion . 68 5.5 Exercises and hints . 68 6 Adversarial Bandits 70 6.1 Reduction from bandit feedback to full feedback . 71 6.2 Adversarial bandits with expert advice . 71 6.3 Preliminary analysis: unbiased estimates . 72 6.4 Algorithm Exp4 and crude analysis . 73 6.5 Improved analysis of Exp4 ........................................ 74 6.6 Literature review and discussion . 76 6.7 Exercises and hints . 79 iii CONTENTS iv 7 Linear Costs and Semi-bandits 80 7.1 Online routing problem . 81 7.2 Combinatorial semi-bandits . 83 7.3 Follow the Perturbed Leader . 85 7.4 Literature review and discussion . 89 8 Contextual Bandits 90 8.1 Warm-up: small number of contexts . 91 8.2 Lipshitz contextual bandits . 91 8.3 Linear contextual bandits (no proofs) . 93 8.4 Contextual bandits with a policy class . 94 8.5 Learning from contextual bandit data . 97 8.6 Contextual bandits in practice: challenges and a system design . 99 8.7 Literature review and discussion . 104 8.8 Exercises and hints . 106 9 Bandits and Games 107 9.1 Basics: guaranteed minimax value . 108 9.2 The minimax theorem . 110 9.3 Regret-minimizing adversary . 111 9.4 Beyond zero-sum games: coarse correlated equilibrium . 113 9.5 Literature review and discussion . 114 9.6 Exercises and hints . 116 10 Bandits with Knapsacks 118 10.1 Definitions, examples, and discussion . 118 10.2 Examples . 120 10.3 LagrangeBwK: a game-theoretic algorithm for BwK . 122 10.4 Optimal algorithms and regret bounds (no proofs) . 128 10.5 Literature review and discussion . 130 10.6 Exercises and hints . 139 11 Bandits and Agents 141 11.1 Problem formulation: incentivized exploration . 142 11.2 How much information to reveal? . 144 11.3 Basic technique: hidden exploration . 146 11.4 Repeated hidden exploration . ..