A Regularization Study for Policy Gradient Methods
Total Page:16
File Type:pdf, Size:1020Kb
Submitted by Florian Henkel Submitted at Institute of Computational Perception Supervisor Univ.-Prof. Dr. Gerhard Widmer Co-Supervisor A Regularization Study Dipl.-Ing. Matthias Dorfer for Policy Gradient July 2018 Methods Master Thesis to obtain the academic degree of Diplom-Ingenieur in the Master’s Program Computer Science JOHANNES KEPLER UNIVERSITY LINZ Altenbergerstraße 69 4040 Linz, Österreich www.jku.at DVR 0093696 Abstract Regularization is an important concept in the context of supervised machine learning. Especially with neural networks it is necessary to restrict their ca- pacity and expressivity in order to avoid overfitting to given train data. While there are several well-known and widely used regularization techniques for supervised machine learning such as L2-Normalization, Dropout or Batch- Normalization, their effect in the context of reinforcement learning is not yet investigated. In this thesis we give an overview of regularization in combination with policy gradient methods, a subclass of reinforcement learning algorithms relying on neural networks. We compare different state-of-the-art algorithms together with regularization methods for supervised learning to get a better understanding on how we can improve generalization in reinforcement learn- ing. The main motivation for exploring this line of research is our current work on score following, where we try to train reinforcement learning agents to listen to and read music. These agents should learn from given musical training pieces to follow music they have never heard and seen before. Thus, the agents have to generalize which is why this scenario is a suitable test bed for investigating generalization in the context of reinforcement learning. The empirical results found in this thesis should primarily serve as a guideline for our future work in this field. Although we have a rather limited set of experiments due to hardware limitations, we see that regularization in rein- forcement learning is not working in the same way as for supervised learning. Most notable is the effect of Batch-Normalization. While this technique did not work for one of the tested algorithms, it yields promising but very insta- ble results for another. We further observe that one algorithm is robust and not affected at all by regularization. In our opinion it is necessary to further explore this field and also perform a more in depth and thorough study in the future. II Kurzfassung Im Bereich des Supervised Machine Learning spielt das Konzept der Reg- ularisierung eine wesentliche Rolle. Speziell bei neuronalen Netzen ist es notwendig, diese in ihrer Kapazität und Ausdrucksstärke einzuschränken, um sogenanntes Overfitting auf gegebene Trainingsdaten zu vermeiden. Während es für Supervised Machine Learning einige bekannte und häufig verwendete Techniken zur Regularisierung gibt, wie etwa L2-Normalization, Dropout oder Batch-Normalization, so ist deren Einfluss im Bezug auf Reinforcement Learn- ing noch nicht erforscht. In dieser Arbeit geben wir eine Übersicht über Regu- larisierung in Verbindung mit Policy Gradient Methoden, einer Unterklasse von Reinforcement Learning, die auf neuronalen Netzen basiert. Wir vergleichen verschiedene modernste Algorithmen zusammen mit Regularisierungsmetho- den für Supervised Machine Learning, um zu verstehen, wie die General- isierungsfähigkeit bei Reinforcement Learning verbessert werden kann. Die Hauptmotivation, dieses Forschungsgebiet zu untersuchen, ist unsere aktuelle Arbeit im Bereich der automatischen Musikverfolgung, wo wir versuchen, Agen- ten mit Hilfe von Reinforcement Learning beizubringen, Musik zu hören und zu lesen. Diese Agenten sollen von gegeben Trainingsmusikstücken lernen, um dann noch nie gehörter und gesehener Musik zu folgen. Daher müssen die Agenten in der Lage sein zu generalisieren, wodurch dieses Szenario eine passende Testumgebung zu Erforschung von Generalisierung im Bereich Rein- forcement Learning ist. Die empirischen Ergebnisse dieser Arbeit sollen uns primär als Richtline für unsere zukünfte Arbeit in diesem Fachgebiet dienen. Auch wenn wir auf Grund von Hardwareeinschränkungen nur eine begrenzte Anzahl an Exper- imenten durchführen konnten, so können wir doch feststellen, dass sich Reg- ularisierung in Reinforcement Learning nicht gleich verhält wie für Super- vised Learning. Besonders hervorzuheben ist hier der Einfluss von Batch- Normalization. Während diese Technik für einen der getesteten Algorithmen nicht funktionierte, so lieferte sie für einen anderen vielversprechende, wenn auch instabile, Resultate. Desweiteren können wir feststellen, dass ein Algo- rithmus robust auf Regularisierung reagiert und von dieser überhaupt nicht beinflusst wird. Unserer Meinung nach ist es notwendig, in der Zukunft weiter in diesem Bereich zu forschen und eine gründlichere und ausführlichere Studie durchzuführen. IV Acknowledgments First of all, I would like to thank my whole family, my loving partner and my friends who always supported me throughout my studies. Without them this would not have been possible. Furthermore, I would like to thank Gerhard Widmer, who not only supervised this thesis but also gave me the opportunity to work as a student researcher at the Institute of Computational Perception. I am really grateful for having the possibility to work in such a productive environment and with all my experienced colleagues. I would also particularly like to thank Matthias Dorfer, who guided me during the course of this thesis. One could not think of a better and more supportive advisor. This work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement 670035, project Con Espressione). VI Contents 1 Introduction 1 1.1 Motivation .............................. 1 1.2 Related Work ............................ 1 1.3 Outline ................................ 3 2 Theory 4 2.1 Reinforcement Learning ...................... 4 2.2 Policy Gradient and Actor Critic Methods ............ 6 2.2.1 REINFORCE ........................ 8 2.2.2 One-Step Actor Critc .................... 10 2.2.3 (Asynchronous) Advantage Actor Critic .......... 11 2.2.4 Proximal Policy Optimization ............... 11 2.3 Neural Networks .......................... 14 2.3.1 Fully Connected Neural Networks ............. 14 2.3.2 Convolutional Neural Networks .............. 16 2.3.3 Back-Propagation and Gradient Based Optimization .. 18 2.3.4 Activation Functions .................... 19 2.4 Regularization for Neural Networks ................ 21 2.4.1 L2-Normalization ...................... 21 2.4.2 Dropout ........................... 22 2.4.3 Batch-Normalization .................... 22 3 Experimental Study 24 3.1 The Score Following Game ..................... 24 3.2 Experimental Setup ......................... 28 3.2.1 The Nottingham Dataset .................. 28 3.2.2 Network Architectures ................... 29 3.2.3 Training and Validation .................. 34 3.3 Results and Discussion ....................... 36 3.3.1 Result Summary ...................... 36 3.3.2 Comparing Algorithms ................... 43 3.3.3 Comparing Activation Functions .............. 43 3.3.4 Comparing Regularization Techniques .......... 46 3.4 Implications on the Network Architecture ............. 51 4 Conclusion 53 VIII List of Figures 1 Agent-Environment interaction framework ............ 4 2 Simple example neural network .................. 14 3 Convolution visualization ...................... 17 4 Activation Function comparison .................. 20 5 Dropout ............................... 23 6 Time Domain to Frequency Domain ................ 24 7 Score Following Game MDP .................... 25 8 Score Following Game State Space ................ 26 9 Score Following Game Reward function .............. 27 10 Network Architecture sketch .................... 29 11 Shallow Network: Validation set setting comparison ....... 38 12 Shallow Network: Test set setting comparison .......... 39 13 Deep Network: Validation set setting comparison ........ 41 14 Deep Network: Test set setting comparison ............ 42 15 Shallow Network: Algorithm comparison ............. 44 16 Shallow Network: Performance with different activations .... 45 17 Deep Network: Performance with different activations ...... 45 18 Shallow Network: Reinforce with regularization ......... 46 19 Shallow Network: A2C with regularization ............ 47 20 Deep Network: A2C with regularization ............. 48 21 Shallow Network: PPO with regularization ............ 49 22 Deep Network: PPO with regularization ............. 50 X List of Tables 1 Shallow Network Architecture ................... 30 2 Shallow Network Architecture with Dropout ........... 31 3 Shallow Network Architecture with Batch-Normalization .... 31 4 Deep Network Architecture .................... 32 5 Deep Network Architecture with Dropout ............ 32 6 Deep Network Architecture with Batch-Normalization ...... 33 7 Hyperparameters .......................... 35 8 Shallow Network: Training set performance ............ 37 9 Shallow Network: Validation set performance .......... 38 10 Shallow Network: Test set performance .............. 39 11 Deep Network: Training set performance ............. 40 12 Deep Network: Validation set performance ............ 41 13 Deep Network: Test set performance ............... 42 14 Simplified Network Archtitecture S1 ................ 52 15 Simplified Network Archtitecture S2 ................ 52 16 Simplified Network Results ....................