New Directions in Bandit Learning: Singularities and Random Walk Feedback by Tianyu Wang

New Directions in Bandit Learning: Singularities and Random Walk Feedback by Tianyu Wang Department of Computer Science Duke University Date: Approved: Cynthia Rudin, Advisor Cynthia Rudin Xiuyuan Cheng Rong Ge Alexander Volfovsky Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science in the Graduate School of Duke University 2021 ABSTRACT NEW DIRECTIONS IN BANDIT LEARNING: SINGULARITIES AND RANDOM WALK FEEDBACK by Tianyu Wang Department of Computer Science Duke University Date: Approved: Cynthia Rudin, Advisor Cynthia Rudin Xiuyuan Cheng Rong Ge Alexander Volfovsky An abstract of a dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science in the Graduate School of Duke University 2021 Copyright © 2021 by Tianyu Wang All rights reserved Abstract My thesis focuses new directions in bandit learning problems. In Chapter 1, I give an overview of the bandit learning literature, which lays the discussion framework for studies in Chapters 2 and 3. In Chapter 2, I study bandit learning problem in metric measure spaces. I start with multi-armed bandit problem with Lipschitz reward, and propose a practical algorithm that can utilize greedy tree training methods and adapts to the landscape of the reward function. In particular, the study provides a Bayesian perspective to this problem. Also, I study bandit learning for Bounded Mean Oscillation (BMO) functions, where the goal is to \maximize" a function that may go to infinity in parts of the space. For an unknown BMO function, I will present algorithms that efficiently finds regions with high function values. To handle possible singularities and unboundedness in BMO functions, I will introduce the new notion of δ-regret { the difference between the function values along the trajectory and a point that is optimal after removing a δ-sized portion of the space. I κ log T will show that my algorithm has O T average T -step δ-regret, where κ depends on δ and adapts to the landscape of the underlying reward function. In Chapter 3, I will study bandit learning with random walk trajectories as feedback. In domains including online advertisement and social networks, user behaviors can be modeled as a random walk over a network. To this end, we study a novel bandit learning problem, where each arm is the starting node of a random walk in a network and the reward is the length of the walk. We provide a comprehensive understanding of this formulation iv by studying both the stochastic and the adversarial setting. In the stochastic setting, we observe that, there exists a difficult problem instance on which the following two seemingly conflicting facts simultaneously hold: 1. No algorithm can achieve a regret bound independent of problem intrinsics information theoretically; and 2. There exists an algorithm whose performance is independent of problem intrinsics in terms of tail of mistakes. This reveals an intriguing phenomenon in general semi-bandit feedback learning problems. In the adversarial setting, we establish novel algorithms that achieve regret bound of order p Oe κT , where κ is a constant that depends on the structure of the graph, instead of number of arms (nodes). This bounds significantly improves regular bandit algorithms, whose complexity depends on number of arms (nodes). v Contents Abstract iv List of Figures x List of Tables xi Acknowledgements xii 1 Introduction 1 2 Bandit Learning in Metric Spaces 7 2.1 Lipschitz Bandits: A Bayesian Approach . .7 2.1.1 Introduction . .7 2.1.2 Main Results: TreeUCB Framework and a Bayesian Perspective . .9 2.1.3 Empirical Study . 29 2.1.4 Conclusion . 31 2.2 Bandits for BMO Functions . 32 2.2.1 Introduction . 32 2.2.2 Preliminaries . 34 2.2.3 Problem Setting: BMO Bandits . 36 2.2.4 Solve BMO Bandits via Partitioning . 38 2.2.5 Achieve Poly-log Regret via Zooming . 44 vi 2.2.6 Experiments . 51 2.2.7 Conclusion . 52 3 Bandit Learning with Random Walk Feedback 53 3.1 Introduction . 53 3.1.1 Related Works . 56 3.2 Problem Setting . 58 3.3 Stochastic Setting . 59 3.3.1 Reduction to Standard MAB . 60 3.3.2 Is this Problem Much Easier than Standard MAB? . 62 3.3.3 Regret Analysis . 70 3.4 Adversarial Setting . 71 3.4.1 Analysis of Algorithm 6 . 74 3.4.2 Lower Bound for the Adversarial Setting . 79 3.5 Experiments . 81 3.6 Conclusion . 82 4 Conclusion 83 Appendices 85 A Supplementary Materials for Chapter 2 86 vii A.1 Proof of Lemma 2 . 86 A.2 Proof of Lemma 3 . 87 A.2.1 Proof of Proposition 5 . 91 A.2.2 Proof of Proposition 6 . 92 A.3 Proof of Lemma 4 . 92 A.4 Proof of Theorem 1 . 93 A.5 Proof of Proposition 2 . 94 A.6 Proof of Proposition 3 . 95 A.7 Elaboration of Remark 6 . 96 A.8 Proof of Theorem 3 . 99 B Supplementary Materials for Chapter 3 103 B.1 Additional Details for the Stochastic Setting . 103 B.1.1 Concentrations of Estimators . 103 B.1.2 Proof of Theorem 9 . 105 B.1.3 Proof of Theorem 10 . 108 B.1.4 Greedy Algorithm for the Stochastic Setting . 113 B.2 Proofs for the Adversarial Setting . 116 B.2.1 Proof of Theorem 12 . 121 B.2.2 Additional Propositions . 126 viii Bibliography 128 Biography 142 ix List of Figures 1.1 A bandit octopus. .2 2.1 Example reward function (in color gradient) with an example partitioning. 10 2.2 The left subfigure is the metric learned by Algorithm 1 (2.9). The right subfigure is the smoothed version of this learned metric. 29 2.3 The estimates for a function with respect to a given partition. 30 2.4 Performance of TUCB against benchmark methods in tuning neural networks. 31 2.5 Graph of f(x) = − log(jxj), with δ and f δ annotated. This function is an unbounded BMO function. 37 2.6 Example of terminal cubes, pre-parent and parent cubes. 46 2.7 Algorithms 3 and 4 on Himmelblau's function (left) and Styblinski{Tang function (right). 51 2.8 Landscapes of test functions used in Section 2.1.3. Left: (Rescaled) Him- melblau's function. Right: (Rescaled) Styblinski-Tang function. 52 3.1 Problem instances constructed to prove Theorem 8. The edge labels denote edge transition probabilities in J/J0....................... 66 p 1−px 3.2 A plot of function f(x) = 1+ x , x 2 [0; 1]. This shows that in Theorem 11, the dependence on graph connectivity is highly non-linear. 74 3.3 Experimental results for Chapter 3.5. 80 3.4 The network structure for experiments in Chapter 3.5. 82 x List of Tables 2.1 Settings for the SVHN experiments. 25 2.2 Settings for CIFAR-10 experiments. 26 xi Acknowledgements When I started my PhD study, I knew little about the journey ahead. There were multiple points where I almost failed. After five years of adventures, how much I have transformed! I am grateful to my advisor, Prof. Cynthia Rudin, for her guidance and advise. Lessons I learnt from Prof. Rudin are not only technical, but also philosophical. She has taught me how to define problems, how to collaborate, how to write academic papers and give talks. I am still practicing and improving skills learnt from her. I would like to thank Duke University and the Department of Computer Science. Administratively and financially, the university and the department have supported me through my PhD study. Also, a thank you to the Alfred P. Sloan Foundation for support- ing me via the Duke Energy Data Analytics Fellowship. I would like to thank all my co-authors during my PhD study. They are, alphabeti- cally, M. Usaid Awan, Siddhartha Banerjee, Dawei Geng, Gauri Jain, Yameng Liu, Marco Morucci, Sudeepa Roy, Cynthia Rudin, Sean Sinclair, Alexander Volfovsky, Zizhuo Wang, Lin Yang, Weicheng Ye, Christina Lee Yu. I would like to thank all the course instructors and teachers. The techniques and methodologies I learnt from them are invaluable. I also thank Duke University and the Department of Computer Science for providing rich learning resources. Very importantly, I am grateful to my parents. Looking ahead into the future, I will maintain a high standard for myself, to live up to the names of Duke University, the Department of Computer Science, my advisor Cynthia Rudin, and all my collaborators. xii Chapter 1 Introduction Bandit learning algorithms seek to answer the following critical question in sequential decision making: In interacting with the environment, when to exploit the historically good options, and when to explore the decision space? This intriguing question, and the corresponding exploitation-exploration tension arise in many, if not all, online decision making problems. Bandit learning algorithms find applications ranging from experiment design [Rob52] to online advertising [LCLS10]. In the classic bandit learning problem, an agent is interacting with an unknown and possibly changing environment. This agent has a set of choices (called arms in bandit community), and is trying to maximize the total reward, while learning the environment. The performance is usually measured by regret. The regret is defined as the total difference, summed over time, between the agent's choices, and a hindsight optimal option. We seek to design algorithms with a sub-linear regret in time. This ensures that when we can run the algorithm long enough, we are often choosing the best options. Three Different Settings In this part, I classify bandit problems into three different settings, based on how the environment may change.

New Directions in Bandit Learning: Singularities and Random Walk Feedback by Tianyu Wang

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support