Evolutionary Dynamics of Reinforcement Learning Algorithms in Strategic Interactions
Total Page:16
File Type:pdf, Size:1020Kb
A DISSERTATION IN ARTIFICIAL INTELLIGENCE Evolutionary Dynamics of Reinforcement Learning Algorithms in Strategic Interactions MAASTRICHT UNIVERSITY Learning against Learning Evolutionary Dynamics of Reinforcement Learning Algorithms in Strategic Interactions Dissertation to obtain the degree of Doctor at Maastricht University, on the authority of the Rector Magnificus, Prof. dr. L. L. G. Soete, in accordance with the decision of the Board of Deans, to be defended in public on Monday, 17 December 2012, at 16.00 hours by Michael Kaisers born on 8 April 1985 in Duisburg, Germany Supervisor: Prof. dr. G. Weiss Co-supervisors: Dr. K.P. Tuyls Prof. dr. S. Parsons, City University of New York, USA Assessment Committee: Prof. dr. ir. R. L. M. Peeters (chairman) Dr. ir. K. Driessens Prof. dr. M. L. Littman, Brown University, USA Prof. dr. P. McBurney, King’s College London, UK Prof. dr. ir. J. A. La Poutr´e, Utrecht University Prof. dr. ir. J.C. Scholtes The research reported in this dissertation has been funded by a TopTalent 2008 grant of the Netherlands Organisation for Scientific Research (NWO). SIKS Dissertation Series No. 2012-49 The research reported in this dissertation has been carried out under the auspices of the Dutch Research School for Information and Knowledge Systems (SIKS). ISBN: 978-94-6169-331-0 c Michael Kaisers, Maastricht 2012 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronically, mechanically, photocopying, recording or otherwise, without prior permission of the author. Preface This dissertation concludes a chapter of my life. It is the destination of a journey of seven years of studying, through which I grew personally and professionally. Maastricht became my second home—yet I travelled to so many other places, living in New York City, Eindhoven and New Jersey in between. When I look back and see how many wonderful people have accompanied me, I am grateful to every one of them. Dear Karl, you have guided my studies no less than 6 years, and following my Bachelor’s and Master’s thesis this is the third work I have written under your su- pervision. I have learned a lot from you, not only scientifically. Your consideration of moral and political correctness inspires me and now influences many of my own de- cisions. Dear Simon, the research visit to your lab vastly broadened my horizon with respect to research, but also in a wider sense. The ideas which grew during the visit became the basis of my PhD proposal that eventually enabled me to perform four years of self-determined research, sponsored by a TopTalent 2008 grant of the Netherlands Organisation for Scientific Research (NWO). Of course, writing a dissertation is not an easy task, but your encouragements throughout the project have helped motivate me to push through. Dear Gerhard, I have enjoyed co-chairing EUMAS with you, and I appreciate the protective leadership you contribute to our department. Your fair credit assignment creates a feeling of appreciation for good work that keeps the spirit high. Besides my promotors, my dissertation has greatly benefited from the comments and suggestions of the assessment committee. A special thanks goes to Michael Littman for his extensive comments and the inspirational time I had when visiting his lab; I still draw research ideas from the discussions we had. Dear Kurt, thank you for your comments and also for the fruitful collaboration on planning using Monte Carlo tech- niques, a topic that is not covered in my dissertation but that I intend to follow up on. Dear Peter, the repeated encounters in the last four years have been very encouraging, thank you for sharing your enthusiasm. Dear Ralf, Jan and Han, thank you for your assessment, I highly appreciate your support. Many colleagues have gone part of the way with me. I enjoyed the many conver- sations with my colleagues in Maastricht, talking about intellectual topics, but also digressing into absurdity at times. To my friends and colleagues abroad, dear George, Tuna, Ari, and Mike, I will remember the time we had and hope our ways keep crossing in the future. I also want to express my gratitude to Jinzhong Niu and Steve Phelps for the stimulating conversations that got me off the ground in simulating auctions. To the PhD candidates who still have their dissertation ahead, it has been great to have you around, and I hope to make it back for your defenses. i PREFACE A PhD should of course not be all work and no play. Luckily, my spare time was always filled with exciting events and people to hang out with. Dear Hennes, you have been great company, and professionally we became something like Karl’s twins. I’m very glad we could walk the way together, and many stretches of time would have been difficult to master alone. Your smarts are incredible, and I appreciate your relaxed attitude. Dear Rena, thanks for exploring the world with me. Dear Gesa, thanks for catching me from my lows, accepting me and being honest with me. Dear Joran, Mare, Jason and Mautje, Salsa would not have been the same without you, I will never forget the many nights we danced. Dear friends of PhD Academy, I have had a great time with you, and I learned so much with you that no course could teach me, be it in the board, in one of the activities, like improvisation theater, or while organizing the PhD conference. To everyone I met while paragliding, thank you for being there with me, dreaming the dream of human flight. While my journey would have not been half as fun without these people accompa- nying me, my roots have been equally important. My family has been a constant source of unconditional support, and I appreciate that deeply. Liebe Großeltern, liebe Eltern, liebe Tanten und Onkel, liebe Cousinen und Cousins, liebe Marie, danke daf¨ur, dass ihr einfach da seid und mir einen Hafen bietet, wenn ich von meinen Reisen zur¨uck komme. Last but not least, dear Pia, your love carries me through any doubt and fears. You make me believe in me, and you give me a home wherever we go. It is impossible to name everyone who deserves mentioning; so please forgive me if your name is not here. I cherish the memories of every one who once made my day, because you all made Maastricht a beautiful place to live, and the most wonderful memory to keep. Thank you all, Michael ii Contents Preface i Contents iii 1 Introduction 1 1.1 Motivationandscope ............................ 2 1.2 Relatedwork ................................. 4 1.3 Problemstatement .............................. 5 1.4 Researchquestions .............................. 7 1.5 Contributions and structure of this dissertation . ... 7 1.6 Relationtopublishedwork ......................... 9 2 Background 11 2.1 Reinforcementlearning. .. .. .. .. .. .. .. .. .. .. .. 11 2.1.1 The multi-armed bandit problem and regret . 12 2.1.2 Markovdecisionprocesses . 14 2.1.3 Stochasticgames ........................... 15 2.2 Reinforcement-learning algorithms . 16 2.2.1 Crosslearning ............................ 16 2.2.2 Regret minimization . 17 2.2.3 Q-learning............................... 18 2.3 Reinforcement learning in strategic interactions . .. 19 2.3.1 Challenges of multi-agent reinforcement learning . 20 2.3.2 Multi-agent reinforcement learning . 20 2.4 Gametheoryandstrategicinteractions. 22 2.4.1 Classicalgametheory ........................ 22 2.4.2 Evolutionarygametheory . 24 2.4.3 Realpayoffsandtheheuristicpayofftable . 28 2.5 Dynamical systems of multi-agent learning . 30 2.5.1 Cross learning and the replicator dynamics . 31 2.5.2 Learning dynamics of regret minimization . 32 2.5.3 An idealized model of Q-learning . 33 2.5.4 Dynamicalsystemsofgradientascent . 34 2.6 Summary and limitations of the state of the art . 35 iii CONTENTS 3 Frequency Adjusted Q-learning 37 3.1 Discrepancy between Q-learning and its idealized model . 38 3.2 Implementing the idealized model of Q-learning . 40 3.2.1 The algorithm Frequency Adjusted Q-learning . 41 3.2.2 Experimentsandresults . 42 3.2.3 Discussion............................... 45 3.3 Convergenceintwo-actiontwo-playergames . .. 45 3.3.1 Preliminaries . 45 3.3.2 Proofofconvergence. .. .. .. .. .. .. .. .. .. .. 47 3.3.3 Experiments ............................. 50 3.3.4 Discussion............................... 52 3.4 Summary ................................... 54 4 Extending the dynamical systems framework 55 4.1 Time-dependent exploration rates in Q-learning . 55 4.1.1 The time derivative of softmax activation . 56 4.1.2 Frequency Adjusted Q-learning with varying exploration . 57 4.1.3 Designing appropriate temperature functions . 59 4.1.4 Experimentsinauctions . 61 4.1.5 Discussion............................... 65 4.2 Learning dynamics in stochastic games . 65 4.2.1 Multi-state Q-value dynamics . 65 4.2.2 Multi-state policy dynamics . 66 4.2.3 DynamicsofSARSA......................... 67 4.2.4 Networks of learning algorithms . 68 4.3 Lenient learning in cooperative games . 69 4.3.1 LenientFrequencyAdjustedQ-learning . 70 4.3.2 Experimentsandresults . 70 4.3.3 Discussion............................... 73 4.4 Summary ................................... 73 5 New perspectives 75 5.1 An orthogonal visualization of learning dynamics . 75 5.1.1 Method ................................ 76 5.1.2 Experiments ............................. 78 5.1.3 Discussion............................... 81 5.2 Reinforcement learning as stochastic gradient ascent . .... 82 5.2.1 Evolutionary dynamics of reinforcement learning . 83 5.2.2 Similarities in two-player two-action games . 84 5.2.3 Generalizationtonormalformgames. 88 5.2.4 Discussion............................... 89 5.3 Summary ................................... 89 iv CONTENTS 6 Applications 91 6.1 The value of information in auctions . 91 6.1.1 Relatedwork ............................. 92 6.1.2 Marketmodel............................. 94 6.1.3 Tradingstrategies .......................... 96 6.1.4 Evaluation of selected information distributions . 97 6.1.5 Evolutionary analysis . 100 6.1.6 Discussion............................... 102 6.2 Metastrategiesinpoker .