To Text-Based Games
Total Page:16
File Type:pdf, Size:1020Kb
Submitted by Fabian Paischer Submitted at Institute for Machine Learning Applying Return De- Supervisor Univ.-Prof. Dr. Sepp composition for De- Hochreiter Co-Supervisor Jose Arjona Medina, layed Rewards (RUD- PhD DER) to Text-Based September, 2020 Games Master Thesis to obtain the academic degree of Master of Science in the Master’s Program Bioinformatics JOHANNES KEPLER UNIVERSITY LINZ Altenbergerstraße 69 4040 Linz, Österreich www.jku.at DVR 0093696 Abstract The advent of text-based games tracks back to the invention of the first computers being only able to display and interact with text in the form of ASCII characters. With advancing technology in computer graphics, such games eventually fell into oblivion; however they provide a great environment for machine learning algorithms to learn language understanding and common sense reasoning simultaneously solely based on interaction. A vast variety of text-based games have been developed spanning across multiple domains. Recent work has proved navigating through text-based worlds to be extremely cumbersome for reinforcement learning algorithms. State-of-the-art agents reach reasonable performance on achieving fairly easy quests only. This work focuses on solving text-based games via reinforcement learning within the TextWorld [Côté et al., 2018] framework. A substantial amount of this work is based on recent work done by [Jain et al., 2019] and [Arjona-Medina et al., 2018]. First a reproducibility study of the work done by [Jain et al., 2019] is conducted to demonstrate that reproducibility in reinforcement learning remains a common problem. No work on return decomposition and reward redistribution in the realm of text-based environments has been done prior to this work, Thus, a feasibility study was conducted. Further, hyperparameter search was performed to find the best possible parameters for continuous return prediction and regularization. Reward redistributions are exhibited for randomly sampled episodes by taking the difference between adjacent return predictions. Further, a novel training procedure for training agents to navigate through text-based environments by incorporating return prediction as critic network is presented. The actor is pre-trained with deep Q-learning according to [Jain et al., 2019] and fine-tuned based on proximal policy optimization [Schulman et al., 2017] considering the redistributed rewards of the critic as advantage. The agent achieves comparable results compared to [Jain et al., 2019] on handcrafted benchmark games created within the TextWorld framework establishing a baseline using policy gradients. For simpler benchmark games the agent performs comparable to [Jain et al., 2019]; however fine-tuning with policy gradients enables the agent to recover from runs which performed particularly worse using Q-learning. For more advanced games fine-tuning with proximal policy optimization is capable of recovering when Q-learning stagnates and shows improvements within a small number of training steps. A longer fine-tuning phase or policy gradient training from scratch might even yield superior performance for the benchmark games. Zusammenfassung Text basierte Spiele stammen aus einer Zeit, in der die ersten Computer entwickelt wurden und nur in der Lage waren, Text in Form von ASCII Zeichen auszugeben. Durch den Fortschritt in der Computer Grafik gerieten diese Spiele jedoch in Vergessenheit. Sie bieten aber eine interessante und schwierige Umgebung für Machine Learning Algorithmen, um Sprachverstehen, sowie logisches Denken gleichzeitig nur durch Interaktion mit der Spielumgebung zu lernen. Eine enorme Menge an verschiedenen Text basierten Spielen wurde seither entwickelt über die verschiedensten Domänen. Kürzlich durchgeführte Arbeiten auf diesem Gebiet bestätigen die Schwierigkeit für Reinforcement Learning Algorithmen durch solche Spielumgebungen zu navigieren. Derzeitige State-of-the-art Algorithmen sind nur in der Lage einfache Aufgaben zu lösen. In dieser Arbeit beschäftige ich mich mit dem Lösen Text-basierter Spiele mit Reinforcement Learning im TextWorld [Côté et al., 2018] framework. Ein Großteil meiner Arbeit basiert auf kürzliche Publikationen von [Jain et al., 2019] und [Arjona-Medina et al., 2018]. Als erstes führe ich eine Reproduktionsstudie basierend auf der Publikation von [Jain et al., 2019] durch und zeige, dass Reproduktion von Resultaten in Reinforcement Learning ein weit verbreitetes Problem ist. Danach begebe ich mich in das Gebiet der Vorhersage von Returns im Reinforcement learning und zeige, dass dies in Text-basierten Anwendungen möglich ist. Durch die Vorhersage von Returns werden einige neue Parameter eingeführt für kontinuierliche Vorhersage und Regularisierung, welche ich für optimale Resultate anpasse. Weiters zeige ich, dass es möglich ist eine Umverteilung des gesamten Returns einer gespielten Episode umzuverteilen durch die Differenz benachbarter kontinuierlichen Vorhersagen. Ich gebe Vorschläge wie man die Vorhersage und Umverteilung von Returns weiter verbessern könnte. Schlussendlich schlage ich eine neue Trainingsmethode für Reinforcement Learning Algorithmen vor, indem ich die Vorhersage von Returns mit einem Critic Netzwerk realisiere. Das Actor Netzwerk wurde mit Deep Q-learning trainiert nach der Trainingsmethode von [Jain et al., 2019] und wird basierend auf der Policy Gradient Methode weiter trainiert basierend auf der Proximal Policy Optimization Schulman et al. [2017] Update Regel. Mein trainertes Modell erreicht vergleichbare Resultate auf Benchmarks, welche mit dem TextWorld framework erzeugt wurden, womit ich eine Baseline basierend auf Policy Gradients einführe. Für leichtere Varianten der Benchmarks erreicht das Modell vergleichbare Performanz zu einem state-of-the-art Model von [Jain et al., 2019], wobei fine-tuning mit Policy Gradients in der Lage ist, besonders schlechte Durchläufe mit Q-learning zu verbessern. Für die schwereren Benchmarks, ist das fine-tuning in der Lage die schlechten Durchläufe des Q-learning Modells selbst nach einer geringer Anzahl an Trainingsschritten zu verbessern. Eine längere fine-tuning Phase, sowie ein Training mit PPO von Beginn an könnten dazu beitragen, bessere Performanz als bisher bekannt auf den Benchmark Spielen zu erreichen. Acknowledgments First and foremost I would like to thank my supervisor Jose Arjona-Medina for the great support and guidance throughout this entire work with his most recent publication [Arjona-Medina et al., 2018] being a substantial part of my work. Also I would like to thank the Institute for Machine Learning for providing as many resources as possible to efficiently conduct experiments and collect results. Special thanks to Vishal Jain for providing most of the code from his prior work ([Jain et al., 2019]) based on algorithmic improvements for interactive fiction. Further, I would like to thank the researchers from Google and Facebook for developing the open source deep learning framework pytorch ([Paszke et al., 2019]) which was utilized for training neural networks. Also thanks to Michael Widrich from the Institute of Machine Learning for developing the python package widis-lstm-tools (https://github.com/widmi/widis-lstm-tools.git) which greatly simplified implementing the LSTM architecture used for return prediction. [Hunter, 2007] developed the python package matplotlib which was used for most visualizations in this work. Also I would like to thank [Côté et al., 2018] for developing the python package TextWorld which enables easy handcrafting of text-based games and convenient interaction of text-based environments with an agent. List of Figures 1 Introduction to Zork . 14 2 Types of Text-Based Games [He et al., 2015] . 14 3 Reinforcement Learning Paradigm, [Sutton and Barto, 1998] . 16 4 Unified View of Reinforcement Learning, [Sutton and Barto, 1998] . 18 5 TD Update for V (st) ................................ 19 6 Overview of the TextWorld Framework, taken from [Côté et al., 2018] . 23 7 Logical Representation of States, P represents the player, taken from [Côté et al., 2018] . 23 8 Logical Representation of Transition function, uppercase letter define types of objects (F: food type, C: container, S: supporter, R: room), P represents the player, and I represents the player’s inventory), ( represents implication, taken from [Côté et al., 2018] . 23 9 Logical Representation of Action Space, taken from [Côté et al., 2018] . 24 10 Benchmarks on curated games within TextWorld, taken from [Côté et al., 2018] . 25 11 Sketch of POMDP formulation . 26 12 Score Contextualization Architecture . 31 13 LSTM Architecture used for return prediction . 35 14 Possible Scores for each level, taken from [Jain et al., 2019] . 39 15 Graphical representation of levels 1 and 2, taken from [Jain et al., 2019] . 40 16 Graphical representation of higher levels, taken from [Jain et al., 2019] . 40 17 Sketch of the map of Zork 1 . 41 18 Results for level 1, right image shows results by [Jain et al., 2019] (red curve: Score contextualization, blue curve: Score Contextualization + Action Gating, gray curve: Baseline, re-implementation of [Yuan et al., 2018]), left image shows reproduced results (results for re-implementation of Score Contextualization) . 42 19 Results for level 2, right image shows results by [Jain et al., 2019] (red curve: Score contextualization, blue curve: Score Contextualization + Action Gating, gray curve: Baseline, re-implementation of [Yuan et al., 2018]), left image shows reproduced results (results for re-implementation of Score Contextualization)