Reinforcement Learning in Supervised Problem Domains
Total Page:16
File Type:pdf, Size:1020Kb
Technische Universität München Fakultät für Informatik Lehrstuhl VI – Echtzeitsysteme und Robotik reinforcement learning in supervised problem domains Thomas F. Rückstieß Vollständiger Abdruck der von der Fakultät für Informatik der Technischen Universität München zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigten Dissertation. Vorsitzender: Univ.-Prof. Dr. Daniel Cremers Prüfer der Dissertation 1. Univ.-Prof. Dr. Patrick van der Smagt 2. Univ.-Prof. Dr. Hans Jürgen Schmidhuber Die Dissertation wurde am 30. 06. 2015 bei der Technischen Universität München eingereicht und durch die Fakultät für Informatik am 18. 09. 2015 angenommen. Thomas Rückstieß: Reinforcement Learning in Supervised Problem Domains © 2015 email: [email protected] ABSTRACT Despite continuous advances in computing technology, today’s brute for- ce data processing approaches may not provide the necessary advantage to win the race against the ever-growing amount of data that can be wit- nessed over the last decades. In this thesis, we discuss novel methods and algorithms that are capable of directing attention to relevant details and analysing it in sequence to overcome the processing bottleneck and to keep up with this data explosion. In the first of three parts, a novel exploration technique for Policy Gradi- ent Reinforcement Learning is presented which replaces traditional ad- ditive random exploration with state-dependent exploration, exploring on a higher, more strategic level. We will show how this new exploration method converges faster and finds better global solutions than random exploration can. The second part of this thesis will introduce the concept of “data con- sumption” and discuss means to minimise it in supervised learning tasks by deriving classification as a sequential decision process and ma- king it accessible to Reinforcement Learning methods. Depending on previously selected features and the internal belief state of a classifier a next feature is chosen by a sequential online feature selection that learns which features are most informative at each given time step. In experi- ments this attentive hybrid learning system shows significant reduction in required data for correct classification. Finally, the third major contribution of this thesis is a novel sequence learning approach that learns an explicit contextual state while traver- sing a sequence. This context helps distinguish the current input and mitigates the need for a predictor capable of dealing with sequential data. We show the close relationship to concepts from theoretical com- puter science, in particular that of deterministic finite automata and re- gular languages and demonstrate experimentally the capabilities of this hybrid algorithm. All three parts share in common a tight integration of Reinforcement Learning and Supervised Learning which not only delivers an orthogo- nal view onto this research but also establishes for the first time a general framework of such hybrid algorithms. iii ZUSAMMENFASSUNG Trotz des ständigen technischen Fortschritts in der Computertechnolo- gie ist es durchaus möglich, dass die heutigen Holzhammer-Methoden der Datenanalyse uns nicht den nötigen Vorteil bringen, um das Rennen gegen das stetige Datenwachstum zu gewinnen, das in den letzten Jahr- zehnten zu beobachten ist. In dieser Dissertation werden neue Methoden und Algorithmen diskutiert, die in der Lage sind, ihre Aufmerksamkeit auf die relevanten Details zu richten und diese sequentiell zu verarbei- ten, um den Flaschenhals der Informationsverarbeitung zu überwinden und mit der Datenexplosion Schritt zu halten. Im ersten von drei Teilen wird eine neuartige Explorationsmethode für Reinforcement Learning (bestärkendes Lernen) mit Policy Gradients vor- gestellt, welches die traditionelle Art des additiven Explorierens durch zustandsabhängige Exploration ersetzt und auf einem höheren und mehr strategischen Level arbeitet. Wir zeigen, dass diese neue Form der Explo- ration schneller konvergiert und bessere und globalere Lösungen finden kann als zufällige Exploration. Der zweite Teil dieser Arbeit stellt das Konzept des “Datenkonsums” vor und diskutiert Mittel, um diesen für überwachte Lernvorgänge zu minimieren. Dies wird ermöglicht indem Klassifizierung als sequenzi- eller Entscheidungsprozess hergeleitet und dadurch bestärkenden Lern- methoden zugänglich gemacht wird. Abhängig von vorherig ausgewähl- ten Features und des internen Zustands eines Klassifizierers wählt eine sequenzielle Komponente ein neues Feature in Echtzeit aus, indem es lernt, welches Feature den höchsten Informationsgehalt zum jeweiligen Zeitpunkt trägt. In Experimenten zeigt dieses hybride Lernsystem ei- ne signifikante Verringerung der nötigen Datenmenge für die korrekte Klassifizerung. Der dritte bedeutende Beitrag dieser Dissertation beschreibt eine neue sequenzielle Lernmethode, die während des Traversierens einer Sequenz einen expliziten kontextuellen Zustand aufbaut. Dieser Kontext unter- stützt die Charakterisierung der aktuellen Eingabe und ermöglicht das Auseinanderhalten zeitlicher Eingabesignale ohne die Hilfe einer sequen- tiellen Klassifikationsmethode. Wir zeigen die enge Verwandtschaft zu Konzepten aus der Theoretischen Informatik auf, insbesondere zur Au- tomatentheorie und Regulären Sprachen, und demonstrieren die Mög- lichkeiten dieses hybriden Algorithmus experimentell. iv Alle drei Teile haben eine enge Integration von bestärkenden und über- wachten Lernmethoden gemeinsam. Dies stellt nicht nur eine alternati- ve Ansicht auf die hier vorgestellten Forschungsergebnisse dar, sondern etabliert zum ersten Mal ein allgemeines Rahmensystem solcher kombi- nierter Algorithmen. v ACKNOWLEDGEMENTS I would like to thank my supervisors Prof. Jürgen Schmidhuber and Prof. Patrick van der Smagt for their help and support. Without either of them, this thesis would not have been possible. I thank Jürgen for believing in me and letting me join his group, for his inspiring vision and valuable conversations, and for his support and funding during the first years of my Ph.D. I want to thank Patrick for jumping in and taking over the supervision without knowing me much, for funding my work in the later years of my Ph.D., and for the interesting lunch discussions during my visits. His constructive and criticial suggestions and diligent proofreading have helped improve this thesis tremendously. Special thanks go to my colleagues at TUM (a.k.a. “die Buam”), for the many great and interesting years together. Christian Osendorfer, my “roommate” for many years, was always there to bounce off ideas, dis- cuss crazy theories and help out with complex mathematical problems. Thanks go to Frank Sehnke, Alex Graves and Martin Felder, the other Buam in the team. We worked on some great projects together, always had each other’s backs and made the years of Ph.D. slavery a fun experi- ence. The other contributors of PyBrain, in particular Justin Bayer, Tom Schaul and Daan Wierstra, also deserve a mention for their great work on this joint project. Everyone at the chair of Robotics and Embedded Systems, especially Prof. Alois Knoll, Dr. Gerhard Schrott, our secre- taries Monika Knürr, Gisela Hibsch and Amy Bücherl, have helped and supported me in various ways over the years and made my time at TUM more about the research and less about bureaucracy. A big Dankeschön goes to all my friends. I’m lucky to have met so many great people, both during my time in Tübingen, and after my move to München. They supported me in every decision, were there for me during good and bad times, and kept me sane and happy over all these years. I miss you all incredibly. I also want to thank my parents Wolfgang and Berta for a happy and unburdened upbringing, for giving me the opportunity to make my own choices in life and for always giving me good advise for the important decisions. Finally, I would like to thank my wonderful partner Steve Pennells for supporting me unconditionally during all this time, for his sacrifices and his endless patience over the last years and for being such a great companion on our journey around the globe and through life. vi CONTENTS abstract iii zusammenfassung iv acknowledgements vi 1 introduction 3 I Overview1 2 outline 5 3 contributions 7 3.1 Scientific Contributions . 7 3.2 Technical Contributions . 8 4 function approximation 13 II Literature Review 11 4.1 General . 13 4.2 Function Approximation in Reinforcement Learning . 14 4.3 Linear Regression . 15 4.4 Logistic and Multinomial Regression . 17 4.5 Neural Networks . 18 4.5.1 Feed-Forward Neural Networks . 19 4.5.2 Recurrent Neural Networks and LSTMs . 22 4.6 Locally Weighted Projection Regression . 24 4.7 Sequence Learning . 24 5 reinforcement learning 27 5.1 General Remarks and Notation . 27 5.1.1 Formal Definition of Reinforcement Learning . 28 5.2 Model-based Reinforcement Learning . 31 5.3 Value-Based Reinforcement Learning . 31 5.3.1 Discrete Value-Based RL . 32 5.3.2 Continuous Value-Based RL . 35 5.4 Direct Reinforcement Learning . 41 5.4.1 Overall performance measure . 41 vii viii Contents 5.4.2 Finite Difference Methods . 41 5.4.3 Likelihood Ratio Methods . 42 5.5 Exploration . 45 5.5.1 Exploration in Discrete Action Spaces . 45 5.5.2 Exploration in Continuous Action Spaces . 46 6 state-dependent exploration 51 III Methods and Experiments 49 6.1 Introduction . 51 6.2 Exploration in Parameter Space . 53 6.3 State-Dependent Exploration . 55 6.3.1 REINFORCE for General Multi-Dimensional FA . 56 6.3.2 Derivation from