Structured Online Learning with Full and Bandit Information

Structured Online Learning with Full and Bandit Information A DISSERTATION SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Nicholas Johnson IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY ARINDAM BANERJEE September, 2016 c Nicholas Johnson 2016 ALL RIGHTS RESERVED Acknowledgements I would like to thank many people who supported me throughout my graduate career. First, I would like to thank my advisor Professor Arindam Banerjee for taking me on as a student, teaching me everything I know about machine learning, encouraging me to pursue my own intellectual curiosities, providing career opportunities and guidance, and for his overall enthusiasm and professionalism in research and life. I will forever be grateful. To the rest of my committee { Professor Maria Gini for providing incredible advice whenever I needed it, Professor Rui Kuang for helping start my graduate career in a positive way, and Professor Jarvis Haupt for the thought-provoking questions. To my fellow graduate students in the Artificial Intelligence lab: Steve Damer, Mohamed Elidrisi, Julio Godoy, Will Groves, Elizabeth Jensen, Marie Manner, Ernesto Nunes, James Parker, Erik Steinmetz, and Mark Valovage, and in the Machine Learning lab: Soumyadeep Chatterjee, Sheng Chen, Konstantina Christakopoulou, Puja Das, Farideh Fazayeli, Jamal Golmohammady, Andre Goncalves, Qilong Gu, Igor Melnyk, Vidyashankar Sivakumar, Karthik Subbian, Amir Taheri, and Huahua Wang. I learned from each of you and will never forget the support you provided especially during the late nights in Walter Library B22 and Keller Hall 6-210 working toward paper deadlines. To my collaborators at Amazon { Aaron Dykstra, Ted Sandler, Houssam Nassif, Jeff Bilger, and Charles Elkan for teaching me how to use machine learning in the wild. Finally, to my mom and dad for their continuous love and support and for instilling in me the value of hard work, passion in my interests, and balance in life. To the rest of my family and friends for always being there. The research in this thesis was supported in part by NSF grants IIS-1447566, IIS- 1422557, CCF-1451986, CNS-1314560, IIS-0953274, IIS-1029711, IIS-0916750, NSF CA- REER award IIS-0953274, NASA grant NNX12AQ39A, Adobe, IBM, and Yahoo. i Dedication To my family. ii Abstract Numerous problems require algorithms to repeatedly interact with the environment which may include humans, physical obstacles such as doors or walls, biological pro- cesses, or even other algorithms. One popular application domain where such repeated interaction is necessary is in social media. Every time a user uses a social media application, an algorithm must make a series of decisions on what is shown ranging from news content to friend recommendations to trending topics. After which, the user provides feedback frequently in the form of a click or no click. The algorithm must use such feedback to learn the user's likes and dislikes. Similar scenarios play out in medical treatments, autonomous robot exploration, online advertising, and algorithmic trading. In such applications, users often have high expectations of the algorithm such as immediately showing relevant recommendations, quickly administering effective treatments, or performing profitable trades in fractions of a second. Such demands require algorithms to have the ability to learn in a dynamic environment, learn efficiently, and provide high quality solutions. Designing algorithms which meet such user demands poses significant challenges for researchers. In this thesis, we design and analyze machine learning algorithms which interact with the environment and specifically study two aspects which can help alleviate challenges: (1) learning online where the algorithm selects an action after which it receives the outcome (i.e., loss) of selecting such an action and (2) using the structure (sparsity, group sparsity, low-rankness) of a solution or user model. We explore such aspects under two feedback models: full and bandit information. With full information feedback, the algorithm observes the loss of each possible action it could have selected, for example, a trading algorithm can observe the price of each stock it could have invested in at the end of the day. With bandit information feedback, the algorithm can only observe the loss of the action taken, for example, a medical treatment algorithm can only observe whether a patient's health improved for the treatment provided. We measure the performance of our algorithms by their regret which is the difference between the cumulative loss received by the algorithm and the cumulative loss received by the best fixed or time- varying actions in hindsight. iii In the first half of this thesis, we focus on full information settings and study online learning algorithms for general resource allocation problems motivated, in part, by applications in algorithmic trading. The first two topics we explore are controlling the cost of updating an allocation, for example, one's stock portfolio, and learning to allocate a resource across groups of objects such as stock market sectors. In portfolio selection, making frequent trades may incur huge amounts of transaction costs and hurt one's bottom line. Moreover, groups of stocks, may perform similarly and investing in a few groups may lead to higher returns. We design and analyze two efficient algorithms and present new theoretical regret bounds. Further, we experimentally show the algorithms earn more wealth than existing algorithms even with transaction costs. The third and fourth topics we consider are two different ways to control suitable measures of risk associated with a resource allocation. The first approach is through diversification and the second is through the concept of hedging where, in the application of portfolio selection, a trader borrows shares from the bank and holds both long and short positions. We design and analyze two efficient online learning algorithms which either diversify across groups of stocks or hedge between individual stocks. We establish standard regret bounds and show experimentally our algorithms earn more wealth, in some cases orders of magnitude more, than existing algorithms and incur less risk. In the second half of this thesis, we focus on bandit information settings and how to use the structure of a user model to design algorithms with theoretically sharper regret bounds. We study the stochastic linear bandit problem which generalizes the widely studied multi-armed bandit. In the multi-armed bandit, an algorithm repeatedly selects an arm (action) from a finite decision set after which it receives a stochastic loss. In the stochastic linear bandit, arms are selected from a decision set with infinitely many arms (e.g., vectors from a compact set) and the loss is a stochastic linear function parameterized by an unknown vector. The first topic we explore is how the regret scales when the unknown parameter is structured (e.g., sparse, group sparse, low-rank). We design and analyze an algorithm which uses the structure to construct tight confidence sets which contain the unknown parameter with high-probability which leads to sharp regret bounds. The second topic we explore is how to generalize the previous algorithm to non-linear losses often used in Generalized Linear Models. We design and analyze a similar algorithm and show the regret is of the same order as with linear losses. iv Contents Acknowledgements i Dedication ii Abstract iii List of Tables x List of Figures xi 1 Introduction 1 1.1 Interactive Machine Learning . .2 1.1.1 Online Learning . .3 1.1.2 Structured Learning . .3 1.1.3 Full and Bandit Information . .3 1.2 Overview and Contributions . .4 1.2.1 Online Lazy Updates . .5 1.2.2 Online Lazy Updates with Group Sparsity . .6 1.2.3 Online Structured Diversification . .6 1.2.4 Online Structured Hedging . .7 1.2.5 Structured Stochastic Linear Bandits . .8 1.2.6 Structured Stochastic Generalized Linear Bandits . .8 2 Formal Setting 9 2.1 Preliminaries . .9 v 2.1.1 Notation . .9 2.1.2 Convex Sets and Functions . 10 2.1.3 Smooth Convex Functions . 11 2.1.4 Non-smooth Convex Functions . 11 2.1.5 Strongly Convex Functions . 12 2.1.6 Bregman Divergences . 12 2.1.7 Alternating Direction Method of Multipliers . 13 2.2 Online Convex Optimization . 15 2.3 Regret . 16 2.4 Feedback . 17 2.5 Structure . 18 2.6 Algorithmic Frameworks . 19 3 Related Work 22 3.1 Full Information . 22 3.1.1 Weighted Majority Algorithm and Expert Prediction . 23 3.1.2 Online Gradient Descent and Exponentiated Gradient . 24 3.1.3 Online Convex Optimization . 25 3.1.4 Follow-The-Regularized-Leader . 25 3.1.5 Online Mirror Descent and Composite Objective Mirror Descent 27 3.2 Bandit Information . 28 3.2.1 Stochastic K-Armed Bandits . 28 3.2.2 Contextual and Stochastic Linear Bandits . 29 3.2.3 Stochastic Binary and Generalized Linear Model Bandits . 31 4 Online Portfolio Selection 32 4.1 Model . 32 4.2 Algorithms . 33 4.2.1 Single Stock and Buy-and-Hold . 34 4.2.2 Constantly Rebalanced Portfolio . 34 4.2.3 Universal Portfolio . 35 4.2.4 Exponentiated Gradient . 35 4.3 Datasets . 36 vi 5 Online Lazy Updates 37 5.1 Introduction . 37 5.2 Problem Formulation . 39 5.3 Algorithm . 40 5.4 Regret Analysis . 42 5.4.1 Fixed Regret . 43 5.4.2 Shifting Regret . 45 5.5 Experiments and Results . 46 5.5.1 Methodology and Parameter Setting . 46 5.5.2 Effect of α and the L1 penalty . 47 γ 5.5.3 Wealth with Transaction Costs (ST )................ 50 5.5.4 Parameter Sensitivity (η and α).................. 51 6 Online Lazy Updates with Group Sparsity 52 6.1 Introduction . 52 6.2 Problem Formulation . 53 6.3 Algorithm . 55 6.4 Regret Analysis .

Load more