Submitted by Fabian Paischer

Submitted at Institute for Applying Return De- Supervisor Univ.-Prof. Dr. Sepp composition for De- Hochreiter Co-Supervisor Jose Arjona Medina, layed Rewards (RUD- PhD DER) to Text-Based September, 2020 Games

Master to obtain the academic degree of Master of Science in the Master’s Program Bioinformatics

JOHANNES KEPLER UNIVERSITY LINZ Altenbergerstraße 69 4040 Linz, Österreich www.jku.at DVR 0093696

Abstract The advent of text-based games tracks back to the invention of the first computers being only able to display and interact with text in the form of ASCII characters. With advancing technology in computer graphics, such games eventually fell into oblivion; however they provide a great environment for machine learning algorithms to learn language understanding and common sense reasoning simultaneously solely based on interaction. A vast variety of text-based games have been developed spanning across multiple domains. Recent work has proved navigating through text-based worlds to be extremely cumbersome for reinforcement learning algorithms. State-of-the-art agents reach reasonable performance on achieving fairly easy quests only. This work focuses on solving text-based games via reinforcement learning within the TextWorld [Côté et al., 2018] framework. A substantial amount of this work is based on recent work done by [Jain et al., 2019] and [Arjona-Medina et al., 2018]. First a reproducibility study of the work done by [Jain et al., 2019] is conducted to demonstrate that reproducibility in reinforcement learning remains a common problem. No work on return decomposition and reward redistribution in the realm of text-based environments has been done prior to this work, Thus, a feasibility study was conducted. Further, hyperparameter search was performed to find the best possible parameters for continuous return prediction and regularization. Reward redistributions are exhibited for randomly sampled episodes by taking the difference between adjacent return predictions. Further, a novel training procedure for training agents to navigate through text-based environments by incorporating return prediction as critic network is presented. The actor is pre-trained with deep Q-learning according to [Jain et al., 2019] and fine-tuned based on proximal policy optimization [Schulman et al., 2017] considering the redistributed rewards of the critic as advantage. The agent achieves comparable results compared to [Jain et al., 2019] on handcrafted benchmark games created within the TextWorld framework establishing a baseline using policy gradients. For simpler benchmark games the agent performs comparable to [Jain et al., 2019]; however fine-tuning with policy gradients enables the agent to recover from runs which performed particularly worse using Q-learning. For more advanced games fine-tuning with proximal policy optimization is capable of recovering when Q-learning stagnates and shows improvements within a small number of training steps. A longer fine-tuning phase or policy gradient training from scratch might even yield superior performance for the benchmark games.

Zusammenfassung Text basierte Spiele stammen aus einer Zeit, in der die ersten Computer entwickelt wurden und nur in der Lage waren, Text in Form von ASCII Zeichen auszugeben. Durch den Fortschritt in der Computer Grafik gerieten diese Spiele jedoch in Vergessenheit. Sie bieten aber eine interessante und schwierige Umgebung für Machine Learning Algorithmen, um Sprachverstehen, sowie logisches Denken gleichzeitig nur durch Interaktion mit der Spielumgebung zu lernen. Eine enorme Menge an verschiedenen Text basierten Spielen wurde seither entwickelt über die verschiedensten Domänen. Kürzlich durchgeführte Arbeiten auf diesem Gebiet bestätigen die Schwierigkeit für Reinforcement Learning Algorithmen durch solche Spielumgebungen zu navigieren. Derzeitige State-of-the-art Algorithmen sind nur in der Lage einfache Aufgaben zu lösen. In dieser Arbeit beschäftige ich mich mit dem Lösen Text-basierter Spiele mit Reinforcement Learning im TextWorld [Côté et al., 2018] framework. Ein Großteil meiner Arbeit basiert auf kürzliche Publikationen von [Jain et al., 2019] und [Arjona-Medina et al., 2018]. Als erstes führe ich eine Reproduktionsstudie basierend auf der Publikation von [Jain et al., 2019] durch und zeige, dass Reproduktion von Resultaten in Reinforcement Learning ein weit verbreitetes Problem ist. Danach begebe ich mich in das Gebiet der Vorhersage von Returns im Reinforcement learning und zeige, dass dies in Text-basierten Anwendungen möglich ist. Durch die Vorhersage von Returns werden einige neue Parameter eingeführt für kontinuierliche Vorhersage und Regularisierung, welche ich für optimale Resultate anpasse. Weiters zeige ich, dass es möglich ist eine Umverteilung des gesamten Returns einer gespielten Episode umzuverteilen durch die Differenz benachbarter kontinuierlichen Vorhersagen. Ich gebe Vorschläge wie man die Vorhersage und Umverteilung von Returns weiter verbessern könnte. Schlussendlich schlage ich eine neue Trainingsmethode für Reinforcement Learning Algorithmen vor, indem ich die Vorhersage von Returns mit einem Critic Netzwerk realisiere. Das Actor Netzwerk wurde mit Deep Q-learning trainiert nach der Trainingsmethode von [Jain et al., 2019] und wird basierend auf der Policy Gradient Methode weiter trainiert basierend auf der Proximal Policy Optimization Schulman et al. [2017] Update Regel. Mein trainertes Modell erreicht vergleichbare Resultate auf Benchmarks, welche mit dem TextWorld framework erzeugt wurden, womit ich eine Baseline basierend auf Policy Gradients einführe. Für leichtere Varianten der Benchmarks erreicht das Modell vergleichbare Performanz zu einem state-of-the-art Model von [Jain et al., 2019], wobei fine-tuning mit Policy Gradients in der Lage ist, besonders schlechte Durchläufe mit Q-learning zu verbessern. Für die schwereren Benchmarks, ist das fine-tuning in der Lage die schlechten Durchläufe des Q-learning Modells selbst nach einer geringer Anzahl an Trainingsschritten zu verbessern. Eine längere fine-tuning Phase, sowie ein Training mit PPO von Beginn an könnten dazu beitragen, bessere Performanz als bisher bekannt auf den Benchmark Spielen zu erreichen.

Acknowledgments First and foremost I would like to thank my supervisor Jose Arjona-Medina for the great support and guidance throughout this entire work with his most recent publication [Arjona-Medina et al., 2018] being a substantial part of my work. Also I would like to thank the Institute for Machine Learning for providing as many resources as possible to efficiently conduct experiments and collect results. Special thanks to Vishal Jain for providing most of the code from his prior work ([Jain et al., 2019]) based on algorithmic improvements for interactive fiction.

Further, I would like to thank the researchers from Google and for developing the open source framework ([Paszke et al., 2019]) which was utilized for training neural networks. Also thanks to Michael Widrich from the Institute of Machine Learning for developing the python package widis-lstm-tools (https://github.com/widmi/widis-lstm-tools.git) which greatly simplified implementing the LSTM architecture used for return prediction. [Hunter, 2007] developed the python package matplotlib which was used for most visualizations in this work. Also I would like to thank [Côté et al., 2018] for developing the python package TextWorld which enables easy handcrafting of text-based games and convenient interaction of text-based environments with an agent.

List of Figures

1 Introduction to Zork ...... 14 2 Types of Text-Based Games [He et al., 2015] ...... 14 3 Reinforcement Learning Paradigm, [Sutton and Barto, 1998] ...... 16 4 Unified View of Reinforcement Learning, [Sutton and Barto, 1998] ...... 18 5 TD Update for V (st) ...... 19 6 Overview of the TextWorld Framework, taken from [Côté et al., 2018] ...... 23 7 Logical Representation of States, P represents the player, taken from [Côté et al., 2018] ...... 23 8 Logical Representation of Transition function, uppercase letter define types of objects (F: food type, C: container, S: supporter, R: room), P represents the player, and I represents the player’s inventory), ( represents implication, taken from [Côté et al., 2018] ...... 23 9 Logical Representation of Action Space, taken from [Côté et al., 2018] ...... 24 10 Benchmarks on curated games within TextWorld, taken from [Côté et al., 2018] . 25 11 Sketch of POMDP formulation ...... 26 12 Score Contextualization Architecture ...... 31 13 LSTM Architecture used for return prediction ...... 35 14 Possible Scores for each level, taken from [Jain et al., 2019] ...... 39 15 Graphical representation of levels 1 and 2, taken from [Jain et al., 2019] ...... 40 16 Graphical representation of higher levels, taken from [Jain et al., 2019] ...... 40 17 Sketch of the map of Zork 1 ...... 41 18 Results for level 1, right image shows results by [Jain et al., 2019] (red curve: Score contextualization, blue curve: Score Contextualization + Action Gating, gray curve: Baseline, re-implementation of [Yuan et al., 2018]), left image shows reproduced results (results for re-implementation of Score Contextualization) ...... 42 19 Results for level 2, right image shows results by [Jain et al., 2019] (red curve: Score contextualization, blue curve: Score Contextualization + Action Gating, gray curve: Baseline, re-implementation of [Yuan et al., 2018]), left image shows reproduced results (results for re-implementation of Score Contextualization) ...... 43 20 Results for level 3, right image shows results by [Jain et al., 2019] (red curve: Score contextualization, blue curve: Score Contextualization + Action Gating, gray curve: Baseline, re-implementation of [Yuan et al., 2018]), left image shows reproduced results (results for re-implementation of Score Contextualization) ...... 43 21 Results for level 4, right image shows results by [Jain et al., 2019] (red curve: Score contextualization, blue curve: Score Contextualization + Action Gating, gray curve: Baseline, re-implementation of [Yuan et al., 2018]), left image shows reproduced results (results for re-implementation of Score Contextualization) ...... 44 22 Results for level 5, right image shows results by [Jain et al., 2019] (red curve: Score contextualization, blue curve: Score Contextualization + Action Gating, gray curve: Baseline, re-implementation of [Yuan et al., 2018]), left image shows reproduced results (results for re-implementation of Score Contextualization) ...... 44 23 Results for level 6, right image shows results by [Jain et al., 2019] (red curve: Score contextualization, blue curve: Score Contextualization + Action Gating, gray curve: Baseline, re-implementation of [Yuan et al., 2018]), left image shows reproduced results (results for re-implementation of Score Contextualization) ...... 45 24 Results for level 7, right image shows results by [Jain et al., 2019] (red curve: Score contextualization, blue curve: Score Contextualization + Action Gating, gray curve: Baseline, re-implementation of [Yuan et al., 2018]), left image shows reproduced results (results for re-implementation of Score Contextualization) ...... 45 25 Results for Zork 1, right image shows results by [Jain et al., 2019] (red curve: Score contextualization, blue curve: Score Contextualization + Action Gating, gray curve: Baseline, re-implementation of [Yuan et al., 2018]), left image shows reproduced results (results for re-implementation of Score Contextualization) ...... 46 26 Main and Auxiliary Loss for Return Prediction ...... 47 27 Return Prediction Metrics for continuous prediction factor of 1 ...... 48 28 Return Prediction Metrics for continuous prediction factor of 0.5 ...... 49 29 Return Prediction Metrics for continuous prediction factor of 0.1 ...... 50 30 Reward Redistributions for two random samples for a continuous prediction factor saux = 1 ...... 51 31 Reward Redistributions for two random samples for a continuous prediction factor saux = 0.5 ...... 51 32 Reward Redistributions for two random samples for a continuous prediction factor saux = 0.1 ...... 52 33 Original Quest for level 2 and returned rewards ...... 53 34 Top 5 positive and negative redistributed rewards for level 2 ...... 53 35 Metrics for scaling factor sκ = 0.01 and a kappa window of wκ = 10 ...... 55 36 Reward Redistributions for two random samples for wκ = 10 and sκ = 0.01 .... 55 37 Reward Redistribution for randomly sampled trajectory with wk = 40 and skappa = 1e − 5 ...... 56 38 Reward Redistribution for wk = 40 and skappa = 1e − 5 ...... 56 39 Reward Redistributions for a random sample for wκ = 1 (left) and wκ = 3 (right) with N-Step Return Correction ...... 57 40 Reward Redistributions for a random sample for wκ = 5 (left) and wκ = 10 (right) with N-Step Return Correction ...... 57 41 Metrics for single run of PPO fine-tuning phase ...... 58 42 RUDDER training metrics for single run of PPO fine-tuning phase ...... 59 43 Normalized Score of Score Contextualization (red) and Score Contextualization + PPO fine-tuning (blue) for level 2, mean and standard deviation over 5 runs depicted 59 44 Single runs for level 2, left image shows superior performance of Q-learning agent, right image shows improvement after PPO fine-tuning ...... 60 45 Normalized Score of Score Contextualization (red) and Score Contextualization + PPO fine-tuning (blue) for level 3, mean and standard deviation over 5 runs depicted 61 46 Single runs for level 3, left image shows superior performance of Q-learning agent, right image shows improvement after PPO fine-tuning ...... 61 47 Normalized Score of Score Contextualization (red) and Score Contextualization + PPO fine-tuning (blue) for level 4, mean and standard deviation over 5 runs depicted 62 48 Single runs for level 4, left image shows better performance of Q-learning agent up to some point, right image shows improvement after PPO fine-tuning ...... 62 49 Normalized Score of Score Contextualization (red) and Score Contextualization + PPO fine-tuning (blue) for level 5, mean and standard deviation over 5 runs depicted 63 50 Single runs for level 5, left image shows collapse of PPO fine-tuning, right image shows superior performance after PPO fine-tuning ...... 63 51 Normalized Score of Score Contextualization (red) and Score Contextualization + PPO fine-tuning (blue) for level 6, mean and standard deviation over 5 runs depicted 64 52 Single runs for level 6, both images show superior performance of PPO during fine-tuning phase ...... 64 53 Normalized Score of Score Contextualization (red) and Score Contextualization + PPO fine-tuning (blue) for level 7, mean and standard deviation over 5 runs depicted 65 54 Single runs for level 7, both images show superior performance of PPO during fine-tuning phase ...... 65

Contents

1 Introduction 14

2 Reinforcement Learning 16 2.1 TD-learning ...... 19 2.2 Function Approximation ...... 20 2.3 Policy Gradients ...... 20 2.3.1 Policy Gradient Theorem ...... 20 2.3.2 Proximal Policy Optimization ...... 21

3 The TextWorld Framework 22 3.1 TextWorld Engine ...... 22 3.2 TextWorld Generator ...... 24 3.3 TextWorld Features ...... 24 3.4 TextWorld Benchmarks ...... 25

4 Text-based Games from a Reinforcement Learning Perspective 26 4.1 Formulation as POMDP ...... 26 4.2 Challenges posed by Text-based games ...... 27

5 Relevant Work 28

6 Methods 30 6.1 Score Contextualization ...... 30 6.2 RUDDER ...... 32 6.2.1 Reward Redistribution ...... 32 6.2.2 Contribution Analysis ...... 34 6.2.3 Redistribution Quality ...... 35 6.2.4 N-Step Kappa Correction ...... 36

7 Empirical Results 36 7.1 Training Procedure ...... 37 7.1.1 Score Contextualization ...... 37 7.1.2 Return Prediction as Critic ...... 37 7.1.3 Fine-Tuning Phase ...... 38 7.2 Benchmark Games ...... 38 7.3 Score Contextualization ...... 39 7.3.1 SaladWorld Benchmarks ...... 42 7.3.2 Zork 1 Benchmark ...... 42 7.3.3 Discussion ...... 46 7.4 Return Prediction: A feasibility Study ...... 47 7.5 Reward Redistribution ...... 47 7.5.1 Continuous Prediction Factor ...... 48 7.5.2 Kappa Correction ...... 54 7.6 PPO Fine-tuning ...... 57 7.6.1 SaladWorld Benchmarks ...... 58

8 Conclusion 66

9 Future Work 66 1 INTRODUCTION

1 Introduction

Text-based games, also known as Interactive Fiction, are interactive simulations in which the user navigates through a fictive world given a narrative in the form of text. Depending on the current narrative, the user decides upon which command to issue. This requires skills such as planning, common sense reasoning, recalling already visited locations, and exploration via trial and error. For us humans skills like common sense reasoning (how to interact with day to day objects), and comprehension of physical processes happen unconsciously in our brains; however an algorithm has to learn that from scratch, which made mastering text-based games cumbersome in the early days of reinforcement learning. Text-based games experienced a renaissance with the invention of deep reinforcement learning and advances in natural language processing. Recently a vast amount of work has been conducted to solve text-based games; however current state-of-the-art agents are only capable of completing fairly easy tasks.

One of the most popular examples of text-based games is called Zork. A short excerpt of the introduction of the game can be observed in figure 1. The description of the current narrative is given in text and the user issues an action (first issued action in this example is open mailbox). After issuing a command the user gets feedback from the environment in the form of text (Opening the small mailbox reveals a leaflet). The quest to be completed might or might not be given in the beginning of the game.

Figure 1: Introduction to Zork

There are three types of text-based games, namely (a) parser-based games, (b) choice-based games, and (c) hypertext-based games. Parser-based games expect a string of natural language as input command, while choice-based games provide predefined action choices for each location throughout the game. Hypertext-based games usually give choices in form of hyperlinks contained in the narra- tive displayed to the user. Figure 2 shows an example of the three different types of text-based games. In this work a pre-compiled set of all admissible actions that are possibly issued throughout the en- tire game is used as action space. Thus„ this work mainly moves in the realm of choice-based games.

Figure 2: Types of Text-Based Games [He et al., 2015]

TextWorld [Côté et al., 2018] is a sandbox learning environment in Python which handles interactive play-through of text-based games. TextWorld distinguishes itself from other existing frameworks

14 1 INTRODUCTION by the functionality to handcraft games, or automatically generate games via sampling from a parameterized game distribution. Further, it provides a set of curated games for benchmarking. The structure of the framework is explained in more detail in section 3. The agent trained in this work is tested on seven handcrafted games rising in difficulty to compare the performance to earlier work conducted by [Jain et al., 2019].

Recently a novel architecture was proposed by [Jain et al., 2019] specifically designed for the reward structure of text-based games, consisting of several LSTM networks [Hochreiter and Schmidhuber, 1997]. Rewards are received from the environment in a sequential fashion. The main quest comprises of several sub-tasks for which upon completeness intermediate rewards are returned. The aim of the proposed architecture is to relieve the burden of one LSTM network to solve a whole episode by dividing the episode into sub-sequences corresponding to different intermediate scores in the game, solved by different network heads. The network heads receive as input word representations created by another LSTM. Word Representations and score contextualization are trained jointly. Their architecture was used as actor network in combination with return decomposition as a critic to train the agent proposed in this work. The architecture is referred to as Score Contextualization and explained in more detail in section 6.1.

The critic network is based on the work of [Arjona-Medina et al., 2018], namely RUDDER. Rein- forcement learning algorithms such as Monte Carlo (MC) and Temporal Difference (TD) learning suffer from high variance, and high bias, respectively. The reason for this is the sampling strategy in MC and bootstrapping in TD. Considering Q-values the term causing variance or bias is the expected sum of future rewards. RUDDER provides a return equivalent reward redistribution of the original reward, which if optimal causes the expected future reward to be zero. Thus„ at each timestep only immediate rewards are given which could significantly speed up training and yield better performance. Section 6.2 elaborates RUDDER in more detail.

The agent presented in this work comprises of an actor network taken from [Jain et al., 2019] and a critic network for return prediction based on [Arjona-Medina et al., 2018]. A novel training procedure is proposed by pre-training of an actor with Q-learning for a certain amount of episodes. Once enough trajectories are collected in a buffer the critic network is trained and reward redistribution on newly collected episodes is performed. The redistributed reward is used for proximal policy optimization (PPO) instead of the originally proposed advantages in [Schulman et al., 2017]. For text-based environments the critic learns a reasonable reward redistribution providing positive reward for intermediate necessary actions and negative rewards for actions that are meaningless or have a bad impact in the current state.

As opposed to policy gradient methods, Q-learning is very well-established in the realm of text-based games. This work aims to establish a policy gradient baseline based on the PPO update rule. First a reproducibility study of the originally proposed results for Score Contextualization was conducted. Further, a feasibility study was performed to check whether return decomposition and reward redistribution is feasible in text-based games. Indeed, reward redistributions shown in this work seem reasonable, indicating RUDDER could be incorporated into agents for text-based games. An actor-critic approach is realized by utilizing an existing network architecture (Score Contextualization) as actor and return decomposition and reward redistribution as critic (RUDDER). In the initial pre-training phase the agent is trained via Q-learning until sufficiently many episodes are collected to train the critic network. After the critic is trained, training of the actor network continues according to the PPO update rule in the fine-tuning phase. Handcrafted games created within the TextWorld framework provide a comparison to a Score Contextualization architecture

15 2 REINFORCEMENT LEARNING

Figure 3: Reinforcement Learning Paradigm, [Sutton and Barto, 1998] solely trained with Q-learning.

The remainder of this thesis explains how text-based games can be solved by reinforcement learning by formulating them as partially observable markov decision processes in section 4 and enumerates challenges provided by such an environment. Sections 2, 3, 6.1, 6.2 explain the basic foundations of RL, the TextWorld framework, the Score Contextualization architecture, and RUDDER in more detail. Section 5 presents recent relevant work conducted in this field and differences to the proposed approach. Finally, section 7 shows the results collected throughout this work and elaborates on possible improvements and future perspectives. Sections 8 and 9 conclude this work with a conclusion and future perspectives.

2 Reinforcement Learning

Reinforcement learning is a machine learning paradigm which is based upon learning by interaction. Differently to other machine learning paradigms, reinforcement learning is directed towards a certain predefined goal, i.e. playing a game, or learning to walk. What makes reinforcement learning particularly difficult compared to other machine learning paradigms is that the assumption of i.i.d (identically independently distributed) data is broken, since the agent needs to learn from samples it generated itself previously. The general paradigm of reinforcement learning can be seen in figure 3. The agent interacts with the environment by issuing an action according to a policy. The policy is a mapping from states to actions and defines the agent’s behaviour. It can range from lookup-tables and simple functions to extensive search processes. After a command is issued the environment transitions to a new state and yields a reward. The reward is a single number which defines which actions are beneficial and which are not and forms a primary basis for altering the policy. The model of the environment mimics the dynamics of the environment and can be deterministic or stochastic. An environment is deterministic if it consistently transitions to the same state after issuing a certain command in a particular state. If the environment is stochastic, state transitions are random to some degree. The aim of the agent is to maximize the total reward returned from the environment over time. Since the reward is returned each time an action is issued it represents the immediate effect of the taken action. The value of a state/action defines what states/actions are beneficial in the long run. The value of a state is the total reward an agent can expect to accumulate in the future starting from that particular state onwards (see equation 1). Gt represents the return of a trajectory and is defined as the sum of discounted rewards. γ is the discount factor and in the range γ ∈ [0, 1]. If γ = 0 only the immediate reward rr+1 is considered, if γ = 1 all rewards are considered equally weighted. γ < 1 should be used for environments returning large rewards to avoid Gt = ∞. The value of a state is always computed under some policy π (Policy Evaluation), different policies result in different state values. A state yielding little reward might

16 2 REINFORCEMENT LEARNING still have a high value if it is followed by beneficial states.

" ∞ # X k vπ(s) = Eπ[Gt|st] = Eπ γ rt+k+1|st (1) k=0 The formulation of equation 1 can be extended to the value of actions (equation 2).

" ∞ # X k qπ(st, at) = Eπ [Gt|st, at] = Eπ γ rt+k+1|st, at (2) k=0 Policies can be compared via their value functions, a policy is better than another policy if its action values are larger (equation 3).

0 0 π ≤ π , if qπ(s, a) ≤ qπ(s, a), ∀s ∈ S, ∀a ∈ A (3) An optimal policy is precisely defined for an MDP and all optimal policies share equivalent state and action values (equation 4).

q∗(s, a) = maxπqπ(s, a), ∀s ∈ A, ∀a ∈ A (4) The Bellman optimality equation for Q-values states that the value of an action under an optimal policy is equivalent to the expected immediate return plus the expected return for the best action in the subsequent state (equation 5). Note that we only use Q-values here instead of state values since the computation of the value of state st depends on the values of all successor states and Thus, requires the model of the environment for the computation of the expectation term. Using Q-values in the Bellman optimality equation elegantly sidesteps this issue. An additional important measurement is the advantage function which describes the advantage of a certain action over other actions and is defined in equation 6.

"T −k # X 0 0 q∗(s, a) = Eπ∗ [rt + 1|st, at] + Eπ∗ rt+k+1|s, a = rt+1 + γ max q∗(s , a ) (5) a0 k=1 A(s, a) = Q(s, a) − V (s) (6) Optimal Q-values immediately give the optimal policy π∗ by simply taking the argmax over all action values (equation 7). The procedure of improving the policy according to computed values is called Policy Improvement.

∗ π (s) = argmaxaq∗(s, a) (7) The aim is to estimate action values and select actions according to these estimates. Perform- ing greedy action selection during training results in exploitation only (Exploration/Exploitation dilemma, see section 4), meaning no time is spent exploring actions that might be more beneficial in the long run, which leads to a suboptimal solution. A common approach to remedy this problem is to incorporate a small probability  according to which actions are selected randomly. Using -greedy action selection, all actions are selected infinitely many times in the limit and Q-values ∗ qt(st, at) converge to qt (st, at).

Classical reinforcement learning algorithms rely on a combination of Policy Evaluation and Policy Improvement, called Policy Iteration or Value Iteration. First values are estimated for an initial policy π0 (mostly random). After evaluation the policy is updated according to the newly estimated values. Policy Iteration combines Policy Evaluation and Policy Improvement in an iterative manner

17 2 REINFORCEMENT LEARNING and returns the optimal policy. Different types of reinforcement learning algorithms either consider the model dynamics (model-based, i.e. dynamic programming) or do not consider the model dynamics of the environment (model-free, i.e. temporal difference, monte-carlo methods). Figure 4 shows different reinforcement learning algorithms and their characteristics.

Figure 4: Unified View of Reinforcement Learning, [Sutton and Barto, 1998]

Exhaustive Search iterates over the entire state/action space and is intractable and inefficient for many problems. Dynamic Programming (DP) considers the model of the environment and uses bootstrapping (i.e. makes an initial guess of the value of a state and updates those accordingly). It uses a distribution over successor states together with an estimate of the expected return to update the value of a state. DP is proven to converge to the optimal value for each state after a sufficient number of iterations. Monte Carlo methods (MC) are model-free and based on sampling whole trajectories to estimate the value of a state by averaging over returns for each state. MC yields an unbiased estimation of state values; however result in high variance since many samples are needed to get a reasonable estimate. Also MC can only be used in a finite horizon setting (i.e. environments that actually end at some point). Temporal Difference Learning (TD) is a model-free approach and combines the advantages of MC and DP, it samples single transitions and uses bootstrapping. Q-learning belongs to the family of TD learning and is a substantial part of this work.

Text-based games resemble a deterministic model of the environment since an action always yields the same change in the underlying state (i.e. the action take blue key always results in the agent picking up a key if it is present); however the produced observation by the environment need not always be equivalent and depends on a stochastic function, which makes it impractical for model-based approaches (i.e. Dynamic Programming), because the dynamics of this function is unknown. Since the state space consists of natural language it is combinatorial and extremely large which makes it intractable for lookup-tables to store values of states/actions. Thus, function approximation by neural networks needs to be performed. In the realm of choice-based games the action space is limited to a pre-compiled set of actions for each game, therefore approximation of Q-values is feasible. The policy is defined as taking the action corresponding to the highest Q-value. The reward signal in text-based games is sparse and only present if the agent completes sub-tasks of specific length. No reward is returned for intermediate actions.

18 2 REINFORCEMENT LEARNING

Figure 5: TD Update for V (st)

2.1 TD-learning As mentioned in the previous section DP is not feasible since the model dynamics of the environment is unknown, MC is impractical since it assumes a finite-horizon MDP and the agent is not guaranteed to ever reach the end of an episode, especially for more complex games. The TD update for the value V of a state st is based on samples from the estimate of immediate successor states V (st+1) (see equation 8). Figure 5 illustrates the formulation of the update.

V (st) = V (st) + α(rt+1 + γV (st+1) − V (st)) (8)

Extending the formulation of equation 8 to action values qt(st, at) leads to the update shown in equation 9. This is also called the SARSA update (short for State, Action, Reward, State, Action) and uses a 5-tuple for a single update. The Q-values of terminal states are set to Q(sT , ·) = 0. SARSA is an on-policy method since the policy used for collecting transitions is directly used in the update. Thus, when policy evaluation via SARSA converges, policy improvement needs to be performed.

Q(st, at) = Q(st, at) + α(rt+1 + γQ(st+1, at+1) − Q(st, at)) (9) Q-learning is a variant of TD learning and has proven itself to be effective for Atari games [Mnih et al., 2013] and text-based games ([He et al., 2015], [Narasimhan et al., 2015], [Yuan et al., 2018]). The Q-value update of Q-learning is shown in equation 10. Due to the argmax in the update Q-learning directly performs policy improvement and can be seen as off-policy. Off-policy algorithms utilize two different policies, the target policy π which we want to improve and the behavioral policy µ which is used to generate transitions. Commonly the behavioral policy µ is -greedy to enforce exploration while generating samples. The Q-learning update however considers an alternative successor action a0, which is selected greedily due to the max operator using policy π. Q-learning effectively combines policy evaluation and policy improvement while using sampling and bootstrapping.

0 Q(st, at) = Q(st, at) + α(rt+1 + γmaxa0 (Q(st+1, a ) − Q(st, at)) (10)

19 2 REINFORCEMENT LEARNING

2.2 Function Approximation Using tabular reinforcement learning algorithms to store q-values is not feasible in the realm of text-based games. Also tabular methods do not generalize across states, since each state is a single entry in a lookup table. The advent of deep learning enabled the use of deep neural networks for reinforcement learning algorithms for approximating value functions. This alters the update of the q-values in a way to include the network parameters θ (equation 11), where the term rt+1 + γQ(st+1, at+1; θ) is denoted as the TD target.

0 Q(st, at; θ) = Q(st, at; θ) + α(rt+1 + γmaxa0 (Q(st+1, a ; θ) − Q(st, at; θ)) (11) The difference between the TD target and the estimated actual Q-value is taken via the mean squared error. The problem with this specific update rule is that both the expected Q-value as well as the currently predicted Q-value are highly correlated which results in oscillations. [Mnih et al., 2013] successfully trained a deep neural network to predict Q-values for Atari games using a separate network for predicting the target values (target network). After a certain training time the target network is updated with the weights of the network predicting the actual Q-value. This results in more stable training and convergence. Since Deep Q-learning learns from sampled transitions under some policy, sampled transitions need to be stored in a buffer from which batches are sampled to train the network. [Schaul et al., 2015] proposed a sampling strategy for prioritizing samples in the buffer according to their TD-targets. Samples with a large TD-target deviate considerably from the expected Q-value, Thus, are sampled more often. [Hausknecht and Stone, 2015] propose a sampling method specifically designed for recurrent neural networks for POMDPs. A specific fraction of the minibatch is sampled such that those episodes contain at least on positive reward, and another fraction is sampled containing at least one negative reward. The remaining episodes are sampled randomly. Several other variants of deep Q-learning exist, namely Double Q-learning [van Hasselt et al., 2015], DuelingDQN [Wang et al., 2015], DistributionalDQN [Bellemare et al., 2017], non-delusional Q-learning [Lu et al., 2018], Noisy Nets [Fortunato et al., 2017]. In this work deep recurrent Q-learning with target networks and sampling according to [Hausknecht and Stone, 2015] is used for the pre-training phase of the actor network. [Schaul et al., 2015] is used for training of the critic network prioritizing episodes that yield a large mean squared error in the return prediction.

2.3 Policy Gradients Policy Gradients form an alternative to value-based methods such as Q-learning by directly optimizing the policy to perform actions that yield high rewards.

2.3.1 Policy Gradient Theorem

The probability of an episode being sampled is defined as pπ(τ) (equation 12), where τ denotes the episode.

T Y pπ(τ) = p(s0) π(at|st)p(st+1|st, at) (12) t=0 RL aims to maximize the expected return of an episode sampled from the distribution of trajectories P (equation 13), where R(τ) = t rt+1 is the return for trajectory τ and J(π) a performance measure

20 2 REINFORCEMENT LEARNING for policy π. The optimal policy π∗ is the policy that maximizes the return for a trajectory τ sampled from the distributions of trajectories p(π).

∗ π = argmaxπEτ∼pπ(τ) [R(τ)] = argmaxπJ(π) (13)

Equivalently for a parameterized policy πθ the objective is shown in equation 14.

θ∗ = argmax [R(τ)] = argmax J(θ) (14) θEτ∼pπθ (τ) θ Theoretically the policy can be optimized by taking the gradient with respect to J(θ); however practically this is not possible since this results in taking the gradient with respect to an expectation. Also the gradient depends on the stationary distribution of states following the target policy. The policy gradient theorem (equation 15) helps to solve this problem by reformulation of the gradient computation. According to the policy gradient theorem the gradient of the expectation can be expressed as the expectation of the product of the likelihood of taking an action and the observed reward.

∇ [R(τ)] = [R(τ)∇ log(π (τ))] (15) θEτ∼pπθ (τ) Eτ∼pπθ (τ) θ θ

Using the policy gradient theorem the gradient ∇θEπθ (R(τ)) can be approximated (equation 16).

N T ! T ! 1 X X X ∇ (R(τ)) ≈ ∇ log π (a |s ) r (16) θEπθ N θ θ i,t i,t i,t+1 i=1 t=0 t=0

Intuitively the gradient ∇θEπθ (R(τ)) decreases the likelihood of negative return episodes and increases the likelihood of high return episodes. The reward term in equation 16 can be replaced by any term indicating which actions are preferable over others (a common choice is the advantage function). The above gradient computation only holds for on-policy algorithms which are sample inefficient, since no samples created by an earlier policy can be utilized. The formulation of equation 16 can be extended by importance sampling to incorporate samples created by a behavioral policy µ. This alters the gradient approximation according to equation 17. Importance sampling includes the probability ratio of policies. Practically this term can be problematic since it can either vanish or explode, Thus, usually the ratio is clipped to a certain range.

" T t ! T # X Y πθ(at0 |st0 ) X 0 0 ∇θJ(θ) = Eτ∼pµ(τ) ∇θ log πθ(at|st) r(st , at ) (17) µ(at0 |st0 ) t=1 t0=1 t0=t Optimizing this term without regularization might result in overfitting. If suboptimal actions yield high immediate rewards, the likelihood for those actions is maximized resulting in suboptimal policies. To counter that problem the entropy (equation 18) of the probability distribution over actions can be used as regularization term. High entropy means the distribution is close to uniform, which results in high exploration since all actions are equally likely to be taken. Low entropy means the policy is committed to single actions and mostly exploits. X H(s) = − π(a|s) log π(a|s) (18) a

2.3.2 Proximal Policy Optimization As opposed to supervised learning update steps that lower performance cannot be corrected in reinforcement learning. Once an update results in a bad policy it cannot be corrected anymore. To remedy this problem the concept of trust regions was developed. The aim is to optimize the policy

21 3 THE TEXTWORLD FRAMEWORK under the constraint that the distance between stationary distribution of the behavioral policy and the target policy do not exceed a certain threshold . This can be done via natural policy gradients ([Peters and Schaal, 2008]), which uses second order optimization, or by estimating trust regions for the update.

The simplest way to estimate trust regions was proposed by [Schulman et al., 2017] and is based on clipping of the objective. This is realized by bounding the ratio of probabilities for an action. As in equation 17 the ratio of probabilities is defined as ρ (equation 19). If ρ is equal to 1 the objective is trustworthy since the deviation between the policies is minimal. If ρ is far from 1, the policies are very different and an update might destroy the current policy.

µθ0 (at|st) ρt(µθ0 ) = (19) πθ0 (at|st) The ratio ρ is clipped to a certain range (equation 20, where  = 0.2) to avoid large updates in one direction ([Schulman et al., 2017] propose clipping in the range of [0.8, 1.2]). The new gradient update is shown in equation 21. Again instead of the reward r(st, at) other measures such as the advantage of a function can be utilized.

ρclip = clip(ρt, 1 − , 1 + ) (20)

0 θ = argmaxθ0 Eθ [min{ρtr(st, at), ρclipr(st, at)}] (21)

The term ρtr(st, at) is the conservative policy iteration objective ([Kakade and Langford, 2002]). The term ρclipr(st, at) modifies the surrogate objective by clipping the probability ratio, which removes the incentive for moving ρ outside of the pre-determined interval. The min operator forces a lower bound on the unclipped objective. In this work PPO is used for fine-tuning the policy network. Instead of the immediate reward, the redistributed reward by the critic network is used for the update (reward redistribution is explained in detail in section 6.2).

3 The TextWorld Framework

TextWorld is a python framework developed by Microsoft Research. Analogously to frameworks like OpanAI Gym [Brockman et al., 2016] it was specifically designed for the development of reinforcement learning algorithms for text-based games. It provides a list of curated games and also enables the user to sample games from a parameterizable game distribution. The two main components of TextWorld are the TextWorld Engine and the TextWorld Generator (see figure 6).

3.1 TextWorld Engine The TextWorld Engine uses an inference engine in order to check validity of generated games. A valid game can be finished if there exists a trajectory from an initial game state to the goal state without loops. Contrary to the perspective of the agent, text-based games can be interpreted as a Markov Decision Process (MDP) internally, since the true underlying states are known by the Engine. The MDP is defined as (S, A, T, R, γ) where S is the set of environment states, A is the set of actions available in state st at time t, T is the state transition function, R : S ×A 7→ R is the reward function, and γ ∈ [0, 1] is the discount factor. The environment states S, the action space A, and the transition function T are defined by linear logic. States are represented by conjunctions (N) of logical predicates that are true in the current state as shown in figure 7.

22 3 THE TEXTWORLD FRAMEWORK

Figure 6: Overview of the TextWorld Framework, taken from [Côté et al., 2018]

Figure 7: Logical Representation of States, P represents the player, taken from [Côté et al., 2018]

The state transition function is defined by a set of logical rules stored in the knowledge base. These rules define what actions are possible in the current state (see figure 8). The left-hand side (LHS) of the implication defines the requirements of the current state and the right-hand side (RHS) defines the resulting predicate being carried over to the next state. All predicates marked with the $ sign are also carried over to the next state. The definition of an action is displayed in figure 9 and is defined as one of the logical rules in the knowledge base. The actions possible in the next state st+1 can be retrieved by performing a step

of forward-chaining (selecting valid subset of actions Ast ) in a state st for which the conjunction of predicates on the LHS is true. In general if a winning state is reached the reward function R returns a positive reward.

Figure 8: Logical Representation of Transition function, uppercase letter define types of objects (F: food type, C: container, S: supporter, R: room), P represents the player, and I represents the player’s inventory), ( represents implication, taken from [Côté et al., 2018]

23 3 THE TEXTWORLD FRAMEWORK

Figure 9: Logical Representation of Action Space, taken from [Côté et al., 2018]

3.2 TextWorld Generator The TextWorld Generator creates sample games based on its knowledge base which contains all necessary information by which the game distribution is parameterized, e.g. map size, quest length and complexity, number of objects, etc. In general the TextWorld framework provides two themes for the creation of text-based games, the home theme and the medieval theme. Based on the given parameters in the knowledge base the TextWorld Generator creates a Game definition in several steps:

1. World Generation The world generation process is parameterized by the size of the world grid, and the number of rooms and their connections. A map is generated according to the Random Walk algorithm to ensure a wide variety of room configurations.

2. Quest Generation Quests are either generated by forward quest generation or backward quest generation. Forward quest generation uses inference mechanism of the Game Engine (forward chaining, explained in section 3.1) and introduces constraints to avoid cycles in the path from start state to goal state. Backward quest generation starts at the goal state and uses backward chaining (reversed forward chaining) to create a quest. During creation of the quest, the generator is free to extend the world and add missing objects and facts if needed.

3. Text Generation Game states are represented by logical elements and need to be converted into natural language. This is done by utilizing context free grammar (CFG). The generator uses different grammars for object names, instructions, and room descriptions. The observa- tions are generated by looping over all elements contained in the current observation and filling out templates. Different templates are used for object names, room descriptions, and quest instructions.

3.3 TextWorld Features Since TextWorld enables self-creation of text-based games via a parameterized game distribution, there are several parameters that can be adjusted to simplify the game-play for an agent:

• The size of the state space can be controlled (number of rooms, objects, size of the map, etc.)

• The partial observability of the states can be adjusted, from very little observability to displaying the true underlying states.

• The agent’s vocabulary can be restricted to admissible commands only

• Intermediate rewards can be returned - a small positive reward after performing an action that is contained in the winning policy, or a small negative reward if the performed action does not appear in the winning policy.

24 3 THE TEXTWORLD FRAMEWORK

Figure 10: Benchmarks on curated games within TextWorld, taken from [Côté et al., 2018]

Handcrafting games for benchmarking also enables users of TextWorld to test trained agents for generalization. Since multiple games can be created from the same distribution problems like language acquisition (explained in section 4) do not occur anymore.

3.4 TextWorld Benchmarks [Côté et al., 2018] provides benchmarks on a fraction of the games contained in their list of curated games (50 curated games). Two agents which participated in the Text-Based Adventure AI Competition [Atkinson et al., 2018], namely BYU [Fulda et al., 2017], and Golovin [Kostka et al., 2017], were tested against a simple baseline agent. The baseline agent randomly samples actions from a pool of admissible commands. [Fulda et al., 2017] proposed a preprocessing method based on affordance extraction to craft a small subset of possible actions for a given state and apply tabular Q-learning to solve the text-based environment. Golovin is an agent that relies on a language model trained on fantasy books and utilizes external resources and templates from the interactive fiction database (IFDB).

Figure 10 shows the normalized score of each agent (highest reached score divided by maximum score possible) for each tested game. All of the curated games originate from different domains and propose different quests to be completed. The baseline agent has a significant advantage over both other agents since it samples from a predefined set of admissible commands. It outperforms both the BYU and the Golovin agent in games which can mostly be solved with navigational commands (Detective), since its sampled commands are mostly navigational. In Advent each player starts with a minimum of 36 points which explains equal performance of all three agents. Since Golovin relies on pre-training on fantasy books it outperforms the other agents in games like Zork 1 and Omniquest. The BYU agent outperforms the Golovin agent mostly on games that can be solved by verb-noun commands (ztuu, lostpig). Both agents perform poorly overall, since none of them conditions on prior observations or actions, which is crucial when dealing with text-based games.

25 4 TEXT-BASED GAMES FROM A REINFORCEMENT LEARNING PERSPECTIVE

4 Text-based Games from a Reinforcement Learning Per- spective

4.1 Formulation as POMDP After issuing an action the game returns an observation, but not the true underlying game state. Due to the partial observability of states the environment of text-based games can be seen as sequential decision-making problems where each action depends on all previously issued actions and observations. An extension to a Markov Decision Process (MDP) is used to formally describe text-based games, namely Partially Observable Markov Descision Processes (POMDP) [Kaelbling et al., 1998]. Figure 11 depicts a sketch of a POMDP.

Figure 11: Sketch of POMDP formulation

A POMDP is defined by a 7-tuple (S, T, A, Ω, O, R, γ). S is the set of environment states, T describes the set of conditional transition probabilities, A is the set of possible actions, Ω is the set of observations, O is the set of conditional observation probabilities, R is the reward function map- ping a state action pair to a real valued number R : S×A 7→ R, and γ ∈ [0, 1] is the discount factor.

The underlying state at given time t is st ∈ S and contains the complete underlying information such as positions and states of all objects and entities, which is mostly hidden from the agent. If the agent issues a command at the current state transitions to st+1 with probability T (st+1|st, at). In parser-based games a command at time t is at and can be a string of arbitrary length consisting 0 n−1 of a sequence of n words (wt , . . . , wt ). The words building the actions are selected from a predefined vocabulary V and at each timestep n tokens are selected to form a command to be 0 n−1 n−1 executed at = [wt , . . . , wt ] where wt is a special token indicating the end of a command. An agent learns a policy πθ which is parameterized by θ and maps the current state st and all 0 n−2 n−1 0 n−2 previously selected words (wt , . . . , wt ) to the next word wt = πθ(st, wt , . . . , wt ). The displayed information at time t is ot ∈ Ω and depends on the environment state and the issued command with probability O(ot|st, at−1). The agent receives a reward rt based on taken actions P t rt = R(st, at) and its goal is to maximize the discounted expected reward E[ t γ rt]. In the realm of choice-based games and hypertext-based games an action at is a fixed string of arbitrary length provided to the agent in advance. In order to properly solve a POMDP the agents needs to consider the prior sequence of observations and actions at each timestep.

26 4 TEXT-BASED GAMES FROM A REINFORCEMENT LEARNING PERSPECTIVE

4.2 Challenges posed by Text-based games Since the interaction with the environment is solely based on natural language additional issues to the already existing ones in reinforcement learning are raised. The most common challenges posed by text-based games are:

1. Partial Observability Since text-based games are defined as a POMDP the agent only receives local information instead of the true underlying state. The displayed information may only contain a current room description, navigation possibilities to adjacent rooms, and the agent’s inventory. Based on this observation the agent has to infer which action is most beneficial in the current state.

2. Enormous State Space Since text-based games solely consist of natural language the resulting state space is combinatorial and enormous. The number of states increases exponentially with the number of rooms and objects. This makes it impractical for tabular- based reinforcement learning algorithms.

3. Enormous Action Space The agent needs to issue actions comprised of natural language. Additionally to being extremely large and combinatorial the action space is sparse. Most commands are not accepted by the game parser and the ratio of admissible commands to the number of possible commands is extremely small. Recent methods try to limit the state space and restrict the action space by extracting affordance relations in a new embedding space [Fulda et al., 2017], or to learn which commands are not favorable in certain states [Haroush et al., 2018]; however due to restrictions the loss of information is unavoidable.

4. Exploration vs. Exploitation The trade-off between Exploration and Exploitation is one of the fundamental issues in reinforcement learning. Exploration is a core concept in text-based games since the agent needs to collect information about the environment, search for clues to solve puzzles, and gain knowledge that might come in handy later in the game to solve a certain quest. A common approach to remedy this problem in TD learning is to include a probability measure of when to issue a random action (-greedy). In the PG setting usually an entropy term of the categorical distribution from which actions are sampled is added to the objective.

5. Sparse Rewards Another fundamental issue in reinforcement learning is the long-term credit assignment. Usually the agent receives a positive reward upon finishing the game and a negative or no reward at all for failing to complete the game. It might appear that earlier actions in a trajectory are crucial to fulfill a quest, but are not rewarded and Thus, not taken. A recent method [Arjona-Medina et al., 2018] introduces a framework for learning an intrinsic reward distribution to promote favorable actions that are not rewarded by default.

6. Observation Modality At time t the agent receives an observation ot based on the last command it issued. ot is an arbitrarily long sequence of words separated by a space. The player has to decide on which parts of the observation it should focus on. Considering only words means loosing some information, focusing on whole sentences is computationally more expensive and might promote redundancies.

7. Parser Feedback The game parser varies from game to game. Also failure message vary from game to game. Since only a small fracture of possible commands are admissible the agent needs to find a way to comprehend the feedback of the parser. This problem limits the ability of agents to generalize across games.

27 5 RELEVANT WORK

8. Affordance Extraction & Common Sense Reasoning An essential part of text-based games is to understand affordances (which verbs can be applied to which objects) and to develop a common sense on how interact with day to day objects. Comprehending affordances does not apply to the domain of choice-based or hypertext-based games; however understanding how to interact with objects (i.e. the player first needs to collect a key, unlock a door with the key and open it to progress to the next room) is necessary. 9. Language Acquisition A vast variety of text-based games in different domains exist. Zork, for example, is a fantasy adventure game in which a considerable amount of words is invented or taken from fantasy books. The meaning of such words and corresponding affordances must be learned on-the-fly. An agent trained on a text-based environment needs to overcome challenges related to natural language and reinforcement learning simultaneously in order to succeed and finish the game.

5 Relevant Work

The DRRN (deep reinforcement relevance network) proposed by [He et al., 2015] was designed for choice-based or hypertext-based games. Representations for states and actions are learned separately and combined with an interaction function (dot-product) to obtain a q-value for an action. The word representations are learned via a bag of words approach as in [Mikolov et al., 2013]. They use experience replay to train the network and show that it learns to correlate action representations with state representations.

The LSTM-DQN was proposed by [Narasimhan et al., 2015] and jointly learns state representations and action policies via Q-learning. It consists of a representation generator and an action scorer which approximates the Q-values for state-action pairs and only allows verb-noun commands. The model is tested on two text-based games with small vocabulary sizes and exhibited superior performance compared to two state-of-the-art agents at that time. Both the DRRN and the LSTM-DQN do not have the capacity to cope with partial observability, since they assume a markovian environment and only condition on single transitions; however the domain of text-based games are proven to be non-markovian since the current state depends on the sequence of actions taken earlier, not only on the last action taken.

The Action Elimination Network (AEN) proposed by [Haroush et al., 2018] learns to eliminate irrelevant actions in a given state. On top of the AEN a Deep Q Network (DQN) is stacked to predict Q-values for state-action pairs. The AEN predicts a score for each action of a restricted action space, which indicates how likely an action will fail in a given state. Those predictions serve as input to the DQN to select actions that are most likely admissible in the current state. They train their model on small subtasks of Zork 1 and show that the agent is capable of completing those; however their agent is not trained on the whole game of Zork 1. As the work of [Narasimhan et al., 2015] and [He et al., 2015] their approach does not condition on prior actions and observations

The NAIL ("Navigate Acquire Interact Learn") agent 1 was invented by [Hausknecht et al., 2019] for playing IF games. NAIL follows a modular architecture separating decision making from knowledge acquisition. The decision making problem is further decomposed into a set of decision modules performing actions when a certain context arises. The main decision modules are Examiner, Inter- actor, and Navigator. The Examiner identifies objects in the current state which can be interacted

1available at: https://github.com/Microsoft/nail_agent

28 5 RELEVANT WORK with and are later added to the knowledge graph upon visiting a new state. The Interactor aims to reasonably interact with objects in the environment. It utilizes a 5-gram language model which computes the probability of an action making sense. The list of possible actions is created according to action templates commonly used in the domain of IF games. The Navigator is responsible for navigating to a new location. Each module computes an eagerness score at each step of the game and the module with the highest eagerness takes control of the current situation. Further, they utilize a validity detector which predicts the validity of an issued command. During interaction with the game the agent accumulates its knowledge and constructs an internal knowledge graph. It keeps track of locations, objects, connections between locations, unrecognized words, and object states. The graph is only updated for common actions, which are proven to be successful in most text-based games. Non-common actions are only stored if they were successful. NAIL won the Text-Based Adventure AI competition in 2018, outperforming other agents by a large margin.

Another architecture specifically designed for text-based games was proposed by [Zelinka, 2018] and inspired by [He et al., 2015]. The siamese state-action Q-network (SSAQN) aims to overcome shortcomings of the DRRN to serve as a new baseline for choice-based and hypertext-based games. They produce contextualized word embeddings with an LSTM. They apply a dense layer on top of the hidden representations of states and actions of the LSTM followed by a hyperbolic tangent activation function. Contrary to [He et al., 2015] they use cosine-similarity instead of to the dot-product as interaction function since it is a common choice for determining document similarity according to [Huang, 2008]. Additionally the SSAQN was trained to play multiple games concurrently within the pyfiction framework 2. To counteract the overestimation of q-values for actions taken multiple times in a certain state they introduce a simple form of intrinsic reward motivation [Singh et al., 2004] depending on the number of times an action was already taken in a certain state. The SSAQN outperforms the DRRN for every benchmark.

Other work ([Zelinka et al., 2019]) focuses on dynamically learning knowledge graphs from taken actions and the returned observation of the environment. They propose the TextWorld KG3 seen seen seen dataset which consists of {Gt−1 , At−1, Ot, Gt } tuples where Gt−1 denotes the knowledge graph up to timestep t − 1, At−1 denotes the last taken action, Ot denotes the returned observation from seen the game engine, and Gt denotes the updated ground truth knowledge graph according to At−1 and Ot. They train a seq2seq model to predict the update operations to the knowledge graph seen given {Gt−1 , At−1, Ot} as input. The model architecture consists of a text encoder, encoding the concatenation of observation and action, a graph encoder, a hidden graph representation, a representation aggregator, combining the two representations via attention mechanisms, and a command generator, generating update operations to the last knowledge graph. Teacher forcing is used as a training procedure for command generation. Their model reaches reasonable performance and serves as a foundation for [Adhikari et al., 2020].

A sequence to sequence architecture based on [Yuan et al., 2018] was proposed by [Tao et al., 2018] to learn to formulate valid actions given the current context returned by the game engine. This is framed as a supervised learning task where the input to the model is the current context and all objects in the player’s inventory and corresponding labels are admissible actions that can be applied in this particular game state. They collected two different datasets on which they evaluate their model. Their best model proves to be capable of generating an adaptive action space depending on the current game state.

2pyfiction: https://github.com/MikulasZelinka/pyfiction 3publicly available at https://github.com/MikulasZelinka/textworld_kg_dataset

29 6 METHODS

Most of the above mentioned methods use word2vec [Mikolov et al., 2013] embeddings as initial encodings. The Bidirectional Encoder Representations from Transformers (BERT, [Devlin et al., 2018]) is capable of learning either sentence or word embeddings in a self-supervised manner. BERT alleviates the unidirectionality constraint by utilizing two pre-training objectives, namely Masked Language Model (MLM) and Next Sentence Prediction (NSP). It is applicable to feature based and fine-tuning approaches. The model is based on a multi-layer bidirectional transformer encoder based on [Vaswani et al., 2017]. BERT can easily be fine-tuned on downstream tasks and set a new state-of-the-art performance on a wide variety of natural language processing tasks. No work has been published so far utilizing BERT for solving text-based games; however it might be a fruitful addition in this domain.

6 Methods

6.1 Score Contextualization Algorithmic improvements for deep reinforcement learning agents were proposed by [Jain et al., 2019] to cope with partial observability in text-based games. Their architecture is inspired by [Yuan et al., 2018] and [Narasimhan et al., 2015] and makes use of recurrent neural networks to overcome the non-markovian nature of text-based games. They use deep Q-learning to predict the value of an action given the current state. By definition the vanilla-DQN [Mnih et al., 2013] is trained with Experience Replay, sampling single transitions from a collection of transitions. Since text-based games are defined as POMDPs and issued commands need to depend on all previously visited states and taken commands, they extend the definition of Q-learning to be dependent on whole histories of observations and actions. Equation 22 shows the Q-value update depending on histories instead of single states, where ht denotes the history including all observations and actions up to timestep t. Due to the combinatorial nature of the state space the number of possible histories also rises exponentially and must be approximated by a recurrent neural network.   Q(ht, at) = Q(ht, at) + α rt + γ max Q(ht+1, a) − Q(ht, at) (22) a∈A They tested an additional variant of Q-learning, namely Consistent Q-learning [Bellemare et al., 2016]. Consistent Q-learning aims at decreasing the Q-value of suboptimal actions while maintaining it for optimal ones dependent on whether a state change has been elicited. Equation 23 shows the ˆ Q-value backup for consistent q-learning, where ξ denotes the probability of an action at being admissible in the current state st. Intuitively the consistent Q backup propagates the reward signal back only if the current action is admissible. They show that by performing the consistent Q backup the action-value of inadmissible actions indeed diminish during training.

    ˆ ˆ Q(ht, at) = Q(ht, at) + α rt + γ max Q(ht+1, a)ξt + Q(ht+1, at)(1 − ξt) − Q(ht, at) (23) a∈A

The reward structure in text-based games is mostly linear and intermediate rewards are given for each sub-quest being completed. [Jain et al., 2019] propose to split a whole sequence into sub-sequences for each score and assign separate network heads to each of the sub-sequences, they call this architecture score contextualization. The number k of network heads is a hyperparameter which they chose to be k = 5 in their work. Each network head predicts Q-values for a sub-sequence of a trajectory corresponding to a specific score. A drawback of this method is that if the number

30 6 METHODS

Figure 12: Score Contextualization Architecture of different scores exceeds the number of network heads, the network head corresponding to the highest score is utilized for prediction and usually the maximum reachable score is not known. Figure 12 shows the network architecture.

Actions and observations are embedded separately by the representation generator φR which consists of an embedding layer and an LSTM layer to produce contextualized embeddings. The learned word embeddings for actions and observations are average pooled to create a sentence representation. Both sentence representations for commands and observations are concatenated and serve as input to the k score contextualization LSTM sub-networks. The action scorer φA is applied to every timestep of the lstm and consists of two linear layers and a ReLU activation to predict the Q-value for the currently issued command. The auxiliary classifier is applied to every timestep to predict the probability of the current action being admissible with the admissibility of an action being defined by whether it elicited a state change in the game.

They trained their agent on handcrafted games created by the TextWorld framework and show that the agent is able to navigate through environments with increasing difficulty. They conducted an ablation study on different training settings including consistent Q-learning, score contextualization and score contextualization with action masking. For benchmarking they train their agent on Zork I where it reached a maximum score of 35. Similarly, to other recently published work they fail to surpass the 35 point benchmark since this corresponds to a difficult in-game task (troll quest); however their agent does not make use of the information-gathering action INVENTORY and issues the LOOK command only occasionally, Thus, learning its state representation and belief only by feedback of the environment. In this work the same network architecture was used for the actor network without utilization of the action gating mechanism, since this information is not always accessible through the environment. Also reward redistribution assigns credit to actions that are essential to complete the game and blame for those that are not, Thus, no action gating is needed. A reproducibility study was performed for score contextualization without action gating on all TextWorld games and the benchmark Zork1. The results of this reproducibilty study can be seen in

31 6 METHODS section 7.3.

6.2 RUDDER Assigning credit to actions that cause a reward in the future (delayed reward) is one of the major common problems in reinforcement learning. They hinder learning approximate policies for text-based games. Mostly reward is received after completing a subtask of the game, but not for intermediate actions within such a task. RUDDER [Arjona-Medina et al., 2018] was designed to remedy this problem in two essential steps: (i) reward redistribution to create return-equivalent MDPs, and (ii) return decomposition via contribution analysis.

MC and TD learning are forward-view approaches, since they make guesses about the future, which is highly chaotic (state transitions can have a high branching factor, stochastic environments result in probabilistic transitions). In early stages of the game the number of possible transitions that are affected by the variance/bias grows exponentially. The further the agent progresses throughout the game the more states are already visited and the number of states to be affected by the variance/bias grows linearly. This results in high variance for MC (because of sampling entire trajectories) and high bias in TD learning (initial guess for bootstrapping).

Reward transformations or reward shaping is a common concept in reinforcement learning and first introduced by [Skinner, 1958]. Reward shaping can be used to inject prior knowledge into the reward function. Potential based reward shaping was introduced by [Ng et al., 1999]. A shaping function F is used to transform an existing MDP M = (S, A, T, γ, R) into another MDP M 0 = (S, A, T, γ, R + F which maintains the optimal policy in M. [Wiewiora et al., 2003] extends this concept to shaping based on states and actions, which allows additional information as to which actions to issue. As opposed to RUDDER, potential-based reward shaping is a very general theory used when injecting prior knowledge while maintaining the optimal policy. It is based on a potential-based shaping function which is learned during training and does not alter or reduce the original reward. RUDDER assumes an optimal reward redistribution by which the expected future rewards equal to zero and alters the original reward distribution. Also it performs contribution analysis by taking differences of continuous predictions.

6.2.1 Reward Redistribution RUDDER proposes a reward redistribution which if optimal results in future expected rewards being equal to 0 (equation 5). Thus, the Q-values equal the expected immediate reward, according to equation 25. Since the expected future reward causes high variance/bias in MC/TD learning an optimal reward redistribution should speed up and stabilize learning.

"T −k # X Eπ rt+k+1|s, a = 0 (24) k=1

q(st, at) = Eπ[rt+1|st, at] (25) Optimal reward redistribution is generally not feasible in MDPs or POMDPs since the reward signal needs to be Markov. Thus, [Arjona-Medina et al., 2018] propose a transformation of the MDP into a return equivalent sequence-Markov Decision Process (SDP) with expected future rewards being equal to zero. The difference between an SDP and an MDP lies only within the structure of the reward, policy and transition probabilities maintain the Markov property. If a SDP π π is return-equivalent to a MDP the expected return at the first timestep is the same v˜0 = v0 and it

32 6 METHODS possesses the same optimal policy π. The optimal policy is defined by the episode that yields the highest expected return v0 at t = 0 (see equation 26).

∗ π π π = argmaxπv˜0 = argmaxπv˜0 (26) We define the expected future reward at time t in the interval [t + 1, t + m + 1] as in κ (equation 27). Thus, the formulation for Q-values changes (equation 28). An optimal second order reward redistribution results in κ(T − t − 1, t) = 0 implying that the expected reward equals the differences between consecutive Q-values of the originally delayed reward (equation 29).

" m # X κ(m, t) = Eπ rt+1+τ |st, at (27) τ=0

q(st, at) = Eπ [rt+1|st, at) + κ(T − t − 1, t)] (28)

π π Eπ [rt+1|st−1, at−1, st, at] =q ˜ (st, at) − q˜ (st−1, at−1) (29)

Since the expectation in equation 29 depends on (st−1, at−1, st, at), the reward redistribution is required to be second order Markov. After redistribution the advantage function of the SPD π π is preserved since the term q˜ (st−1, at−1) can be rewritten as Est−1,at−1 [˜q (st−1, at−1)|st] which equals the mean Q-value of the last timestep.

RUDDER transforms the reinforcement learning task into a supervised regression task. An LSTM is trained to predict the scaled G0 return (equation 30) of entire sequences. sr is a scaling factor and is chosen to be equal to the maximum score reachable by a game. The purpose of rescaling the ground truth is to avoid large gradients.

T 1 X G = r (30) 0 s t+1 r t=0 LSTMs are generally preferred over feed-forward networks (FFN), because they are capable of storing and reusing information from the past. An LSTM learns to store a pattern only if it encounters strong evidence for a change in the expected return. Also consecutive prediction of an LSTM are highly correlated since they rely on internal cell states, which makes it more likely that prediction errors cancel out. FFNs are generally not designed for sequence data and are prone to prediction errors. For return prediction a modified LSTM architecture was utilized (see figure 13). The forget gate and the output gate of the LSTM cell are neglected. Due to the cumulative nature of the G0 return of a sequence there is no need to forget anything, Thus, the forget gate is obsolete. Also it reintroduces the vanishing gradient problem [Hochreiter, 1998]. The output gate was omitted since the output of timestep t − 1 should not be capable of switching off the output at timestep t. The output at timestep t should solely be based on the input consisting of state-action pairs. Further, the recurrent connections at the cell input are omitted to enforce information extraction of the current state-action pair. The input gate only considers recurrent connections, since information of the previous timestep t − 1 might be beneficial for return prediction at timestep t. Equation 31 shows the formulation of the components of the LSTM cell visualized in figure 13. For implementation of the special LSTM architecture The python package widis-lstm-tools 4 which serves as a wrapper for pytorch’s LTSM implementation is used which allows to comfortably adjust gates, inputs, and recurrencies.

4https://github.com/widmi/widis-lstm-tools

33 6 METHODS

z(t) = h(W x(t)) i(t) = σ(W h(t − 1)) (31) c(t) = z(t)i(t) y(t) = tanh(c(t))

The LSTM is trained on entire sequences stored in a buffer. The Mean Squared Error (MSE) is taken between the LSTM output of the last timestep y(T ) and the sequence return G0 (equation 32), where N denotes the number of sequences in a minibatch. To enforce continuous prediction an auxiliary loss is added by taking the MSE at each timestep y(t) to the overall sequence return G0, see equation 33.

N 1 X L = (y(T ) − G )2 (32) main N 0 n=0

N T 1 X X L = (y(t) − G )2 (33) aux N 0 n=0 t=0 As mentioned above an optimal reward redistribution results in the expected future reward being equal to zero (equation 24); however optimality is not guaranteed in practice, leading to a residual return κ > 0. To counteract this issue another term may be introduced into the loss function, namely kappa correction. Kappa correction (equation 34) is formulated as the difference between a return prediction at timestep t and a timestep t + k, k steps in the future. The intuition behind this is that if the LSTM is capable of learning the return a sequence yields in the future it can utilize this information for the current timestep t. This acts as regularization term in the loss function and is learned with a separate network head.

N T −k 1 X X L = (y(t + k) − y(t))2 (34) kappa N n=0 t=0 Summing up the main loss, auxiliary loss, and the kappa correction yields the loss function used for training the LSTM (Ltotal = Lmain + sauxLaux + skappaLkappa, where saux and skappa represent scaling factors for the auxiliary loss and the kappa loss). Note that the losses need to be scaled properly to ensure stable training. At the beginning of training the main loss Lmain must have the highest impact since reward redistribution depends on correctly predicting the expected return of the entire sequence. As soon as the G0 prediction yields reasonable results, the auxiliary loss should start to distribute the return backwards across the sequence. Finally, the kappa correction only serves as a regularization term after the reward was successfully redistributed.

6.2.2 Contribution Analysis Contribution Analysis tries to identify which features of the input to a neural network were decisive for a certain prediction. For example which regions of an image resulted in a network predicting the class cat. Examples for contribution analysis techniques are Integrated Gradients [Sundararajan et al., 2017], Layer-wise Relevance Propagation [Bach et al., 2015], or DEEP-LIFT [Shrikumar et al., 2017].

RUDDER uses a contribution analysis method similar to Layer-wise Relevance Propagation. The relevance (expected immediate reward) for each transition is propagated backwards from the end

34 6 METHODS

Figure 13: LSTM Architecture used for return prediction of the sequence to intermediate timesteps by taking the differences between continuous predictions (rt = y(t) − y(t − 1)). The main advantage of this method is that predictions at each timestep contain the same contributions of the past and the future which cancel out by taking the difference. What remains is the direct contribution of the two state-action pairs.

The expected return of the first timestep is a special case since no difference can be taken to an earlier timestep. In this case the difference between the prediction of the first timestep of a novel sequence and the mean expected return at the first timestep of all predictions in the training set is taken.

6.2.3 Redistribution Quality During training the return prediction acts as a critic by redistributing the reward across sequences while the actor progresses throughout the game. Thus, the critic network needs to be capable of generalizing to unseen trajectories; however the further the actor progresses in the game, the more unseen data is created for the critic to predict on. Thus, we must keep track of the quality of the redistribution on novel sequences. The computation of the redistribution quality is problem specific and can vary. The quality of a reward redistribution on a newly created sequence is computed according to equation 35. b denotes the absolute error between the predicted expected return G0 and the actual return for a batch b. µ is the current variance in the buffer and  is a hyperparameter in the range [0, 1]. The absolute batch error b is upscaled by sr in order to be in the same range as the variance in the buffer µ. If b ≤ 1 the first term dominates and the second term turns negative resulting in a negative quality. In the case of b < 1 the result depends on whether b > 1 − , if so, the second term again turns negative. If b < 1 −  the quality starts to increase, Thus, 1 −  determines a threshold of a valid reward redistribution. The general intuition behind this formulation is that the redistribution quality depends on the current variance in the buffer and the absolute error made for prediction. If the variance in the buffer is small, b needs to be very small to reach reasonable quality. For larger variance in the buffer higher absolute error is allowed. Since the variance in the buffer changes over time as new episodes are generated by the agent the quality needs to adapt to those circumstances. As soon as the redistribution quality on newly generated sequences drops

35 7 EMPIRICAL RESULTS below a certain threshold the LSTM needs to be re-trained on all including the novel trajectories.

bsr 1 QRR = 1 − + (35) µ 1 −  − b Although the LSTM is steadily re-trained after the moving average of the quality drops below a threshold, it can appear that reward redistributions on newly generated episodes yield a very low quality and are Thus, not usable. For such a case the quality Q of the redistribution is used as interpolation between the original reward and the redistributed reward. For a quality Q = 0 the advantage of the original reward (see equation 36) is used as redistribution, for a quality of t+1 Q = 1 the redistributed reward is used (see equation 37). rRR denotes the redistributed reward at timestep t and rt+1 the originally returned reward from the environment. Additionally, redistributed rewards that are assigned to inadmissible actions are masked out according to an action gating mechanism (see section 7.1.2).

A(st, at) = Q(st, at) − V (st) = rt+1 + V (st+1) − V (st) (36)

t+1 rt+1 = QRRrRR + (1 − QRR)A(st, at) (37)

6.2.4 N-Step Kappa Correction Equivalently to being used as a regularization term the kappa correction can also be included in the computation of the immediate redistributed reward in the form of an N-Step Return Correction. Contrary to using the difference between adjacent return predictions the sum of differences can be taken as redistributed reward (see equation 38), where k determines the depth of the N-Step window to be summed over. Intuitively an appropriate choice for this window is equivalent to the window size of kappa correction loss (equation 34).

k X rt+1 = rt+1+τ (38) κRR RR τ=0 Since the reward redistribution in practice is not optimal, the redistributed reward might still be delayed, although not necessarily by much. Summing over a specified window the remaining delay is captured and all actions receive their corresponding reward. N-Step Kappa Correction helps to locally redistribute the remaining delayed reward since RUDDER only bridges long delays, but no local delays, Thus, a local reward redistribution is needed.

7 Empirical Results

In this section empirical results collected throughout this work are exhibited. First the training procedure for Score Contextualization and RUDDER as well as the fine-tuning approach of the Actor-Critic method is explained in more detail. Further, handcrafted benchmarks created with the TextWorld framework are presented which were originally used by [Jain et al., 2019]. Finally, results for a reproducibility study of the work done by [Jain et al., 2019], return prediction and reward redistribution in the realm of text-based games, and a novel fine-tuning approach using policy gradients are shown.

36 7 EMPIRICAL RESULTS

7.1 Training Procedure The training procedure is split into two parts, the first part is equivalent to the training procedure conducted by [Jain et al., 2019] for Score Contextualization. The second part includes the incorporation of RUDDER as a critic network and handling of reward redistribution and fine-tuning by policy gradients.

7.1.1 Score Contextualization For Score Contextualization the network architecture shown in section 6.1 is used without action gating mechanism. The auxiliary classifier φC of the network is intentionally left out of the actor network since a sufficiently optimal reward redistribution should assign credit to important intermediate actions and blame or no reward to actions that are not beneficial. Since reward redistribution is not optimal the action gating mechanism is included into the critic network.

The network is trained for a total of 1.3 million steps and a maximum of 200 steps for each episode. The behavioral policy µ acts according to an -greedy action selection during training with a parameter  = 1 slowly annealing to  = 0.1 over the first one million training steps. The network is initialized according to a random normal distribution. A maximum vocabulary size of 2000 is used with en embedding dimension of 20 and a hidden dimension of 64 for the contextualized word embeddings. The same hidden dimension is used for the k different score contextualization network heads. The action scorer φA is a two-layer FFN with 128 units and |A| units respectively, where |A| denotes the size of the action space. To enforce exploration the agent issues a LOOK command after every 20 steps. As soon as the agent finishes an episode it is stored in a lesson buffer. Training is started after number of episodes stored in the buffer exceeds the minibatch size of 64 used for training. The network is trained every fourth step by sampling trajectories according to the sampling strategy introduced by [Hausknecht and Stone, 2015]. A positive fraction of τp = 0.25 and a negative fraction of τn = 0.25 is used for sampling. If no episodes containing positive or negative reward are present in the buffer, the episodes are randomly sampled. From every trajectory a sub-sequence of length h = 15 is sampled randomly and utilized for training the network. The agent is updated by the Q-learning update rule (equation 10) using a discount factor γ = 0.9. The target network is updated every 10.000 steps and the agent network is saved every 25.000 steps. The Adam optimizer [Kingma and Ba, 2014] with a learning rate of 0.001 is used for optimization.

7.1.2 Return Prediction as Critic Training return prediction requires special handling of the lesson buffer since the network is trained on entire trajectories. The word embeddings are created equivalently to those for score contextual- ization with equal dimension sizes. The return prediction LSTM follows the special architecture explained in section 6.2 (figure 13) with a hidden dimension of 32 and is initialized according to a random normal distribution. Two linear network heads are stacked on top for i) expected return prediction and ii) κ prediction. Additionally, action gating mechanism utilized by [Jain et al., 2019] is included into the critic network. In section 7.5.1 an example is shown where positive reward is assigned to a failed attempt to finish the game by issuing the command put lettuce on counter in an early stage of the game. Due to the architectural design of the Score Contextualization actor network this might cause problems since different network heads are utilized for different stages of the game. Thus, increasing the probability of taking the action put lettuce on counter for i.e. the first network head does not make sense, as for the later stage of the game a different network head is used. The action gating mechanism is realized by a separate representation generator φR (following figure 12) on top of which a LSTM is stacked (following architecture 13). For predicting

37 7 EMPIRICAL RESULTS whether an action is admissible an auxiliary classifier (following figure 12) is stacked on top of the LSTM. As ground truth the information of the environment whether an action elicited a state change is taken and the Binary Cross Entropy loss is used. If the predicted probability of an action being non-admissible is below 0.01 the redistributed reward for this transition is set to 0. This results in more stable learning and avoids the problem of actions being taken by wrong network heads of the actor.

The LSTM training for return prediction begins after 1 million training steps (annealing phase of the parameter ). Before training, the bias of the linear layers is set to the scaled mean return of the episodes in the buffer. Also for return prediction the ground truth returns of the episodes are scaled down by the maximum possible score throughout the game to avoid back- propagation of large gradients. For reward redistribution of new trajectories the return predictions are rescaled by the same factor to restore the reward in the original range. Rudder is trained as long as the mean redistribution quality (see equation 35) over the whole buffer is below 0.9 or a maximum of 50 epochs is reached. While training, rank-based prioritization [Schaul et al., 2015] of trajectories according to the main loss of the return prediction (equation 32) is used. A minibatch size of 32 is used for training, a continuous prediction factor of 0.1 is used, a κ- window of 5 with N-Step Return Correction is used, and a κ-scaling of 1e-5 is used. Section 7.4 elaborates in more detail about this hyperparameter setting. The  parameter for computation of the redistribution quality is set to  = 0.6. The parameters b and µ are computed during runtime.

7.1.3 Fine-Tuning Phase After the return prediction is trained, action selection of the actor network is performed by sampling from a probability distribution across actions obtained by stacking a normalization and a softmax layer on top of the action scorer φA. Episodes are collected by running multiple agents in parallel in separate environments. Additionally, after the trajectories are created, rudder is used to redistribute the reward across those episodes and store the redistributed reward together with the trajectories and action probabilities in a separate buffer. An exponential moving average of the redistribution quality of newly created episodes is computed. If this average drops below a threshold of 0.9, the return prediction is retrained on the originally created 4000 episodes and all newly created trajectories. For the PPO update randomly extracted histories of the collected episodes of length h = 15 are used. The current policy of the actor network is evaluated on the sampled minibatch to collect action probabilities and entropy for regularization. The entropy term for the update is scaled down by a factor of 0.0001. This term needs to be chosen carefully since if it is too large the optimization results in a random policy.

7.2 Benchmark Games Seven benchmark games were created by [Jain et al., 2019] via the TextWorld framework. The games vary in difficulty (i.e. quest length, number of objects, number of rooms, map size) but the goal is equal in every game, namely collecting ingredients from different locations and prepare a salad in the kitchen. The games are referred to as SaladWorld from level 1 to level 7. As mentioned in section 4 the reward structure of text-based games is mostly linear and reward is received by completing a subquest. Figure 14 shows the possible scores that can be reached for each level. While the first level can be completed within a few issued commands and even mostly be solved by a random agent the quest to be completed grows in complexity for higher levels. Figure 15 depicts a graphical representation of the first two levels. Figure 16 shows a graphical representation of the

38 7 EMPIRICAL RESULTS

Figure 14: Possible Scores for each level, taken from [Jain et al., 2019] higher levels. The numbers in brackets besides objects indicates which objects are present in which levels. The color outlines represent the size of the map for each level. Additionally to the handcrafted TextWorld benchmarks the agents were evaluated on the interactive fiction game Zork 1. Figure 17 shows a sketch of the map of the game Zork 1. Compared to the TextWorld Benchmarks Zork 1 is significantly harder to solve and consists of much more complex puzzles and mazes. The maximum score possible to reach in Zork 1 is 350 and the entire quest consists of 70 subquests.

Additionally to the increased length of the quest the size of the action space |A| drastically increases from level to level. Also the maximum score differs from game to game with the maximum reachable score being 350 for the game Zork 1. Table 1 shows the different benchmark games, their quest lengths, their size of the action space and the maximum possible score.

7.3 Score Contextualization Reproducibility is a common problem in deep reinforcement learning due to stochasticity of the environment, hyper-parameter settings, and intrinsic variance of algorithms ([Henderson et al., 2017]). A reproducibility study of the Score Contextualization architecture on the aforementioned TextWorld benchmarks and Zork 1 was conducted and compared with the results of [Jain et al., 2019]. The results of five runs were collected for each level and the mean and standard deviation

39 7 EMPIRICAL RESULTS

Figure 15: Graphical representation of levels 1 and 2, taken from [Jain et al., 2019]

Figure 16: Graphical representation of higher levels, taken from [Jain et al., 2019]

40 7 EMPIRICAL RESULTS

Figure 17: Sketch of the map of Zork 1

Game Size of action space Quest Length Maximum Possible Score SaladWorld Level 1 7 7 15 SaladWorld Level 2 14 16 20 SaladWorld Level 3 14 18 20 SaladWorld Level 4 50 27 25 SaladWorld Level 5 140 39 30 SaladWorld Level 6 282 54 35 SaladWorld Level 7 294 55 40 Zork 1 131 357 350 Table 1: Size of Action Space for different levels of SaladWorld and Zork 1

41 7 EMPIRICAL RESULTS

Figure 18: Results for level 1, right image shows results by [Jain et al., 2019] (red curve: Score contextualization, blue curve: Score Contextualization + Action Gating, gray curve: Baseline, re-implementation of [Yuan et al., 2018]), left image shows reproduced results (results for re-implementation of Score Contextualization) of those runs are depicted.

7.3.1 SaladWorld Benchmarks The following figures show the results of the reproducibility study for SaladWorld and compare it with the original results proposed by [Jain et al., 2019]. Figure 18 shows the normalized score for the re-implementation and the originally published results for the first level. The re-implementation shows convergence behaviour after approximately 100,000 training steps, which is much faster compared to the originally published results. The variance in different runs is fairly small since even the random behavioral policy in the beginning is capable of solving the first level. Note that for comparison only the red curve in the right figure should be considered as other variants of the originally published results were not re-implemented.

For level 2 (figure 19) the performance differs. The re-implementation converges in 2 out of 5 runs only, whereas the results of [Jain et al., 2019] show stable convergence for all runs. Also the variance for the re-implementation is significantly larger than for the original results.

For level 3 the re-implementation performs significantly better than the original results by reaching an average score of 16 as opposed to an average score of 12, respectively; however concerning variance the same pattern as for level 2 is present with the re-implementation showing significantly higher variance. Note that only the Score Contextualization + Action Gating variant of [Jain et al., 2019] was capable of finishing the game. For level 4 (figure 21) the performance of both implementations are similar. In level 5 (figure 22) the re-implementation performs slightly better than the original implementation. For level 6 (figure 23) and level 7 (figure 24) the original implementation performs slightly better.

7.3.2 Zork 1 Benchmark Figure 25 shows the results for Zork 1. Remarkably the re-implementation of Score Contextualization failed to make significant progress throughout the game, while the proposed results show a highscore of 35 which corresponds to a normalized score of 0.1. The highest score reached by the re-

42 7 EMPIRICAL RESULTS

Figure 19: Results for level 2, right image shows results by [Jain et al., 2019] (red curve: Score contextualization, blue curve: Score Contextualization + Action Gating, gray curve: Baseline, re-implementation of [Yuan et al., 2018]), left image shows reproduced results (results for re-implementation of Score Contextualization)

Figure 20: Results for level 3, right image shows results by [Jain et al., 2019] (red curve: Score contextualization, blue curve: Score Contextualization + Action Gating, gray curve: Baseline, re-implementation of [Yuan et al., 2018]), left image shows reproduced results (results for re-implementation of Score Contextualization)

43 7 EMPIRICAL RESULTS

Figure 21: Results for level 4, right image shows results by [Jain et al., 2019] (red curve: Score contextualization, blue curve: Score Contextualization + Action Gating, gray curve: Baseline, re-implementation of [Yuan et al., 2018]), left image shows reproduced results (results for re-implementation of Score Contextualization)

Figure 22: Results for level 5, right image shows results by [Jain et al., 2019] (red curve: Score contextualization, blue curve: Score Contextualization + Action Gating, gray curve: Baseline, re-implementation of [Yuan et al., 2018]), left image shows reproduced results (results for re-implementation of Score Contextualization)

44 7 EMPIRICAL RESULTS

Figure 23: Results for level 6, right image shows results by [Jain et al., 2019] (red curve: Score contextualization, blue curve: Score Contextualization + Action Gating, gray curve: Baseline, re-implementation of [Yuan et al., 2018]), left image shows reproduced results (results for re-implementation of Score Contextualization)

Figure 24: Results for level 7, right image shows results by [Jain et al., 2019] (red curve: Score contextualization, blue curve: Score Contextualization + Action Gating, gray curve: Baseline, re-implementation of [Yuan et al., 2018]), left image shows reproduced results (results for re-implementation of Score Contextualization)

45 7 EMPIRICAL RESULTS

Figure 25: Results for Zork 1, right image shows results by [Jain et al., 2019] (red curve: Score contextualization, blue curve: Score Contextualization + Action Gating, gray curve: Baseline, re-implementation of [Yuan et al., 2018]), left image shows reproduced results (results for re-implementation of Score Contextualization) implementation is 25 which is comparable to the results reached by Score Contextualization with Action Masking. Also the originally proposed results for Zork 1 show much faster convergence behaviour compared to the re-implementation.

7.3.3 Discussion The reproducibility study highlights a main issue in deep reinforcement learning. The results obtained from the re-implementation are comparable for some games; however strongly deviate for others. A reason for this might be differently adjusted hyperparameters for different games. Throughout the reproducibility study the entire training procedure as well as the hyperparameter configuration remained equal.

A hyperparameter which turns out to be problematic is the number of network heads for score contextualization. The intuition behind contextualizing the score is to ease the burden of issuing optimal commands throughout the entire game by having different network heads predicting actions for different parts of the game. Since the maximum reachable score differs from game to game this parameter would have to be adjusted. If the number of intermediate scores is larger than the number of network heads the initial problem is re-introduced, Thus, this parameter should be adapted to the current game being played.

Sentence embeddings for score contextualization are created by a unidirectional LSTM. Introducing bidirectionality has successfully proven to improve performance for various natural language pro- cessing tasks ([Peters et al., 2018]). The advent of the Transformer architecture [Vaswani et al., 2017] and BERT [Devlin et al., 2018] further advanced the state-of-the-art for a variety of natural language processing tasks. Such techniques could be easily incorporated in agents for text-based games to learn more abstract word representations.

Although [Jain et al., 2019] shared most of their code the results of the reproducibility study differ from those given by their work. For Zork 1 the agent performs significantly worse than the original implementation. Nonetheless, the re-implementation of score contextualization is used as a baseline for comparison to the agent trained according to the proposed training procedure in this work.

46 7 EMPIRICAL RESULTS

Figure 26: Main and Auxiliary Loss for Return Prediction

7.4 Return Prediction: A feasibility Study No previous work has been conducted on return prediction for reward redistribution in text-based games. Reward redistribution was successfully performed by [Arjona-Medina et al., 2018] in Atari games; however the framework was not tested for other types of games. For the feasibility study only the main loss and the auxiliary loss for continuous prediction were used. The LSTM was trained on a training set of 4000 collected episodes from a checkpoint of the reproducibility study for SaladWorld level 2. Since even the random agent was able to complete level 1 there is no need to train return prediction on episodes collected from level 1. The generalization capabilities were tested by validating the LSTM on 3000 unseen trajectories from a different checkpoint. Episodes of varying length are padded with the pad command. The LSTM is trained for at most 1000 val epochs and training is stopped as soon as the quality on the validation set is QRR ≤ 0.95. For the feasibility study a continuous prediction factor saux = 1 is assumed. Figure 26 shows the mean main and auxiliary loss over all batches in an epoch for the feasibility study. The learning curves indicate that the LSTM is indeed capable of learning patterns in the the text sequence and predict the overall return of a trajectory.

7.5 Reward Redistribution Since predicting the return of an episode for text-based environments ought to be successful, hyperparameter tuning was conducted based on the quality of the redistributed reward. For the following experiments each trajectory is padded with one look command to provide the LSTM with information about the starting state.

47 7 EMPIRICAL RESULTS

Figure 27: Return Prediction Metrics for continuous prediction factor of 1

7.5.1 Continuous Prediction Factor Figure 26 shows that the continuous prediction factor is approximately in the same range as the main loss. This has great influence on the reward redistribution since the auxiliary loss is supposed to pull the return backwards across the sequence after a good return prediction has been established. Thus, the auxiliary loss needs to be scaled down. Figures 27, 28, 29 show the auxiliary loss, the main loss, the redistribution quality and the absolute prediction error for continuous prediction factors saux of 1, 0.5, and 0.1, respectively.

Figure 27 shows that even for saux = 1 the redistribution quality converges towards 1. The reason for this is that the quality measure is based on the absolute return prediction error, not on the continuous prediction. Even though a redistribution quality of 1 is reached does not indicate that the reward redistribution assigns credit to intermediate actions correctly. Thus, although each setting val reaches a quality measure of QRR = 1 the corresponding reward redistributions differ significantly. The auxiliary loss shows the same pattern for every setting on the validation set. This is due to the fact that the mean error of continuous predictions is taken. For timesteps towards the end of the episode the auxiliary loss is smaller since the prediction is more close to the overall return. Actions that are favorable should be assigned more credit (i.e. low auxiliary loss) and non-favorable actions should be assigned less credit (i.e. high auxiliary loss). The rising mean auxiliary loss indicates reasonable variety in the continuous predictions. The absolute prediction error for the overall return steadily decreases for training and validation set, indicating reasonable generalization behaviour. The main loss for the training set immediately diminishes and also for the validation set steadily decreases.

48 7 EMPIRICAL RESULTS

Figure 28: Return Prediction Metrics for continuous prediction factor of 0.5

49 7 EMPIRICAL RESULTS

Figure 29: Return Prediction Metrics for continuous prediction factor of 0.1

50 7 EMPIRICAL RESULTS

Figure 30: Reward Redistributions for two random samples for a continuous prediction factor saux = 1

Figure 31: Reward Redistributions for two random samples for a continuous prediction factor saux = 0.5

Considering learning curves and redistribution quality there is no significant difference between the aforementioned settings. Thus, for hyperparameter tuning the redistributed reward must be considered directly. For considering reward redistributions two random unseen trajectories have been sampled and the redistribution has been computed. Figures 30, 31, 32 depict the reward redistributions for continuous prediction factors saux = 1, saux = 0.5, saux = 0.1, respectively.

A significant difference between the three reward redistributions can be observed. Noticeably the LSTM constantly assigns credit to intermediate actions that actually led to a reward without any prior knowledge about intermediate rewards, indicating the capability of the LSTM to learn a reasonable reward redistribution. For the highest continuous prediction factor the reward re- distribution is the least informative (i.e. lowest number of intermediate rewards assigned). This makes sense, since the importance of the auxiliary loss is equivalent to the main loss and Thus, all intermediate predictions are pushed towards the overall return, resulting mostly in zero difference between adjacent return predictions. Thus, the need to scale down the auxiliary loss since the priority lies in predicting the overall return correctly. The lower the continuous prediction factor, the more intermediate rewards are assigned.

51 7 EMPIRICAL RESULTS

Figure 32: Reward Redistributions for two random samples for a continuous prediction factor saux = 0.1

Table 2 shows the top 5 actions that got assigned a positive intermediate reward and table 3 shows the top 5 actions that got assigned a negative intermediate reward for saux = 0.1. Remarkably the most positive reward was assigned for navigating to the kitchen (go west), this is probably due to the fact that the agent needs to navigate to the kitchen twice throughout the game in order to complete it. Further, the reward for finishing the game (put lettuce on counter) is preserved. Noticeably positive reward is assigned to a failed attempt of the agent to complete the game (last action in table 2). The most punished action performed is a failed attempt from the agent to unlock the blue door with the blue key. This seems reasonable since the command unlock blue door with blue key should not be issued when it is not possible in the current state. Further, navigating to the living room is punished (go east, go west), which is rather bad since navigational commands enforce exploration and should not be punished. Finally, the last two actions assigned with blame are drop lettuce which is reasonable and should not be done. For a more intuitive illustration of the reward redistribution, figure 33 depicts the actual quest for level 2 and the credit assignment and figure 34 shows the redistributed reward. For the actual quest the navigational commands are only written if they lead to a reward (i.e. navigating to the vegetable market via go east yields a reward of 10). The redistributed reward is mostly reasonable; however essential navigating commands to complete the quest are endangered to be assigned blame to. Thus, some regularization is needed to prevent the agent to assign negative credit to actions that are essential in fulfilling the quest. Note that figure 33 shows the complete trajectory to finish the games, but figure 34 only shows the reward redistribution of a randomly sampled episode (right plot in figure 32) with a total return of G0 = 15. Thus, no reward is redistributed for the last subquest (opening the blue door with blue key, taking tomato and putting it on the counter in the kitchen).

For an optimal reward redistribution the summed up intermediate rewards should equal the G0 return. The sum over redistributed rewards for continuous prediction factors of saux = 1 and saux = 0.5 strongly deviate from the original G0 return indicating the scaling factor needs to be smaller. For saux = 0.1 the sum over redistributed rewards is closest to the G0 return and the most intermediate credit is assigned, Thus, saux was set to saux = 0.1.

52 7 EMPIRICAL RESULTS

Action Redistributed Reward Original Reward go west 3.9 0 put lettuce on counter 1.9 5 go west 1.85 0 go west 1.8 0 put lettuce on counter 1.78 0 Table 2: Top 5 positive intermediate rewards assigned to actions by reward redistribution for saux = 1 (right plot of figure 30)

Action Intermediate Reward Original Reward unlock blue door with blue key -3.3 0 go east -1.8 0 go west -1.2 0 drop lettuce -0.8 0 drop lettuce -0.6 0 Table 3: Top 5 negative intermediate rewards assigned to actions by reward redistribution for saux = 1, (right plot of figure 30)

Figure 33: Original Quest for level 2 and returned rewards

Figure 34: Top 5 positive and negative redistributed rewards for level 2

53 7 EMPIRICAL RESULTS

7.5.2 Kappa Correction As explained in section 7.5.2 kappa correction acts as a regularization technique for return prediction. Essentially κ formulates the residual return which was not redistributed across the episode. The intuition is to let the LSTM predict what the return will be in wκ steps. If the network is capable of predicting future return at timestep t + wκ, the future return might already be redistributed around the current timestep t. This introduces another hyperparameter, mainly the window size wκ of the kappa correction. Further, since kappa correction should only act as a regularization the kappa loss needs to be scaled down by the scaling factor sκ. Kappa correction was incorporated into the return prediction and another hyperparameter study was conducted to find the best values for sκ and wκ.

Naturally a good choice for the parameter wκ would be the length of an episode divided by the number of subquests completed, so the reward for finishing a subtask can be redistributed effectively; however some subtasks are faster completed than others (i.e. navigating to the vegetable market requireas a few steps, whereas taking the blue key, unlocking a door, grabbing the tomato and putting it on the counter in the kitchen requires many more steps). Also the number of subtasks varies from game to game so this parameter would need to be adapted from game to game. The scaling factor sκ needs to be chosen such that the kappa correction only acts as regularization and does not interfer with the main training for return prediction.

Different parameter choices for wκ (wκ = [10, 20, 30, 40]) and sκ (sκ = [0.01, 1e − 05, 1e − 06]). First a fixed kappa window wκ = 10 and a scaling factor sκ = 0.01 was considered to shift the kappa loss into the appropriate range. Figure 35 shows the learning curves for the main, auxiliary, and kappa loss and the redistribution quality for a scaling factor sκ = 0.01. The absolute G0 prediction error is not depicted in this graphic since the increasing redistribution quality indicates a decreasing absolute error. It can be seen that the kappa correction loss for the training set decreases, but for the validation set it wildly fluctuates. Since the kappa correction serves only as regularization the loss is not expected to monotonically decrease. Considering reward redistributions (figure 36) using this parameter setting leads to detrimental results. This happens because the impact of the kappa correction is too large. If the influence of the kappa correction during training is too heavy, the continuous predictions are pulled towards the same values which results in zeros after the difference between adjacent predictions is taken. Thus, the scaling factor for kappa correction should be much smaller to utilize its regularizing effect. Therefore scaling factors sκ = [1e − 05, 1e − 06] were considered.

The parameter combination leading to the most promising reward redistribution were wκ = 40 and sκ = 1e − 05. Figure 37 illustrates a graphical representation of the reward redistribution of a randomly sampled unseen episode and figure 38 depicts the redistributed reward in comparison with the original reward. The G0 return of this particular episode is G0 = 10. In figure 37 the top 5 positive and negative yielding actions are depicted and the corresponding reward. Dropping ingredients (drop tomato, drop lettuce) generally leads to negative reward no matter in which room, as well as trying to navigate into a direction which is not available (going east at the starting point). Navigating towards the vegetable market and taking the lettuce yield rewards. The most rewarding command is open blue door; however this action was issued in the wrong room, indicating some confusion of the network. Generally opening the blue door is a necessary command to complete the game, thus assigning a positive immediate reward for issuing this action in the wrong room should not worsen the performance. This problem is alleviated by the action gating mechanism in the critic network which masks out rewards for actions that were not admissible in the current state.

54 7 EMPIRICAL RESULTS

Figure 35: Metrics for scaling factor sκ = 0.01 and a kappa window of wκ = 10

Figure 36: Reward Redistributions for two random samples for wκ = 10 and sκ = 0.01

55 7 EMPIRICAL RESULTS

Figure 37: Reward Redistribution for randomly sampled trajectory with wk = 40 and skappa = 1e − 5

Figure 38: Reward Redistribution for wk = 40 and skappa = 1e − 5

56 7 EMPIRICAL RESULTS

Figure 39: Reward Redistributions for a random sample for wκ = 1 (left) and wκ = 3 (right) with N-Step Return Correction

Figure 40: Reward Redistributions for a random sample for wκ = 5 (left) and wκ = 10 (right) with N-Step Return Correction

The effect of the N-Step Kappa Correction can be seen in figures 39 and 40. The return correction leads to a smoother redistribution of the reward; however the larger wk the more the positively or negatively redistributed reward at a single point affects neighboring rewards, Thus, more actions in the trajectory are assigned reward to. If N-Step Kappa Correction is utilized the hyperparameter wκ must be chosen carefully. Again an action gating mechanism alleviates the problem of credit assignment to inadmissible actions in an episode.

7.6 PPO Fine-tuning Since return prediction and reward redistribution yields reasonable results it can be incorporated into the training phase as a critic. The score contextualization network is trained via Q-learning for the first 1 million training steps until the annealing phase of  is finished. Afterwards RUDDER is trained for reward redistribution and used as a critic using N-Step Kappa Correction to obtain the redistributed reward of a sequence. After that point the PPO update rule is used for policy gradient updates.

57 7 EMPIRICAL RESULTS

Figure 41: Metrics for single run of PPO fine-tuning phase

7.6.1 SaladWorld Benchmarks Equivalently to section 7.3.1 the first level has been omitted for fine-tuning since even a random agent is capable of finishing SaladWorld level 1. Figure 41 shows the normalized score, the exponential moving average (EMA) of the online redistribution, the mean entropy of the action probability distributions, and the PPO loss. The increase of the exponential moving average towards one indicates that RUDDER efficiently learns to redistribute and generalize to unseen trajectories. The normalized score shows that the agent indeed improves in the fine-tuning phase, although not consistently. The peak in the PPO loss might be due to a suboptimal reward redistrtibution being used for the update; however the agent manages to recover from it.

Figure 42 shows loss metrics for RUDDER during the PPO fine-tuning phase. In total RUDDER is trained approximately 800 epochs during the fine-tuning phase. Also the second LSTM head efficiently learns to gate actions based on their admissibility, while the other LSTM network head learns to perform return prediction and kappa correction with two separate heads stacked on top.

Figure 43 shows the results for the original Score Contextualization training using Q-learning and the results for the PPO fine-tuning phase. The results are comparable; however looking at single runs a pattern can be observed. Figure 44 shows two single runs out of the 5 averaged runs in figure 43. Remarkably policy gradient fine-tuning performs poorly for runs where Q-learning converges rather fast. Contrary for runs where Q-learning stagnated, PPO fine-tuning managed to recover and finish the game when Q-learning did not. This pattern repeatedly occurs throughout the first 5 levels of SaladWorld. For the lower levels of SaladWorld the delay in the rewards is rather little, indicating that the effect of RUDDER as critic does not make a significant difference.

Figure 45 shows the results for SaladWorld level 3 and figure 46 depicts single runs where PPO fine-tuning outperformed Q-learning and vice-versa. Remarkably no Q-learning run was able to

58 7 EMPIRICAL RESULTS

Figure 42: RUDDER training metrics for single run of PPO fine-tuning phase

Figure 43: Normalized Score of Score Contextualization (red) and Score Contextualization + PPO fine-tuning (blue) for level 2, mean and standard deviation over 5 runs depicted

59 7 EMPIRICAL RESULTS

Figure 44: Single runs for level 2, left image shows superior performance of Q-learning agent, right image shows improvement after PPO fine-tuning complete the game, whereas one run was able to finish the game after the fine-tuning phase.

Figure 47 shows the performance of both methods on SaladWorld level 4 and figure 48 depicts two runs where the fine-tuning phase collapses (although Q-learning collapses as well after some more training steps) and Q-learning collapses and the fine-tuning phase is able to recover.

Figure 49 shows the performance of both agents on SaladWorld level 5. Figure 50 depicts the same pattern that occured for previous levels.

Figure 51 shows the performance of both agents averaged over 5 runs. Remarkably PPO fine-tuning consistently outperforms Q-learning in the fine-tuning phase. The Q-learning agent collapses shortly before the fine-tuning phase and does not recover. This might be due to the fact that for higher levels of SaladWorld the delay in the reward is larger since the sub-quests are longer. Figure 52 shows two single runs where PPO consistently outperforms Q-learning during the fine-tuning phase.

Figure 53 shows the results on SaladWorld level 7. Again during the fine-tuning phase the policy gradient agent consistently outperforms the Q-learning agent. Remarkably throughout the entire training phase the highest score was reached by the PPO agent in the fine-tuning phase. Figure 54 shows two runs where the policy gradient agent outperforms the Q-learning agent during the fine-tuning phase.

60 7 EMPIRICAL RESULTS

Figure 45: Normalized Score of Score Contextualization (red) and Score Contextualization + PPO fine-tuning (blue) for level 3, mean and standard deviation over 5 runs depicted

Figure 46: Single runs for level 3, left image shows superior performance of Q-learning agent, right image shows improvement after PPO fine-tuning

61 7 EMPIRICAL RESULTS

Figure 47: Normalized Score of Score Contextualization (red) and Score Contextualization + PPO fine-tuning (blue) for level 4, mean and standard deviation over 5 runs depicted

Figure 48: Single runs for level 4, left image shows better performance of Q-learning agent up to some point, right image shows improvement after PPO fine-tuning

62 7 EMPIRICAL RESULTS

Figure 49: Normalized Score of Score Contextualization (red) and Score Contextualization + PPO fine-tuning (blue) for level 5, mean and standard deviation over 5 runs depicted

Figure 50: Single runs for level 5, left image shows collapse of PPO fine-tuning, right image shows superior performance after PPO fine-tuning

63 7 EMPIRICAL RESULTS

Figure 51: Normalized Score of Score Contextualization (red) and Score Contextualization + PPO fine-tuning (blue) for level 6, mean and standard deviation over 5 runs depicted

Figure 52: Single runs for level 6, both images show superior performance of PPO during fine-tuning phase

64 7 EMPIRICAL RESULTS

Figure 53: Normalized Score of Score Contextualization (red) and Score Contextualization + PPO fine-tuning (blue) for level 7, mean and standard deviation over 5 runs depicted

Figure 54: Single runs for level 7, both images show superior performance of PPO during fine-tuning phase

65 9 FUTURE WORK

8 Conclusion

Learning to solve text-based games via reinforcement learning remains cumbersome, due to the non-markovian environment and additional challenges posed by the interaction with the environ- ment via natural language. The agent needs to comprehend observations and learn common sense reasoning and affordances to issue valid commands accordingly.

In this work a reproducibility study was conducted to point out that reproducibility remains a major issue in the realm of reinforcement learning, by reproducing results from a previously state-of-the-art agent. Since text-based games follow a particular reward structure and suffer from the explaining away problem a feasibility study was conducted whether RUDDER [Arjona-Medina et al., 2018] could be introduced to remedy long-term credit assignment in text-based environments. Different hyperparameter combinations were tested to obtain a reasonable reward redistribution, providing a foundation for incorporating return decomposition into agents for text-based games.

A novel training procedure was introduced consisting of a Q-learning pre-training phase to collect sufficiently many trajectories to train RUDDER and a fine-tuning phase using the PPO update rule with redistributed rewards as advantage. On average the agent shows comparable performance on several TextWorld benchmarks created by [Jain et al., 2019]. The fine-tuning phase proved to be specifically helpful for games with long delayed rewards and showed promising improvement within few training steps, whereas Q-learning stagnated or failed to progress at all.

A longer fine-tuning phase or training with policy gradient update rule from scratch might improve the overall performance on higher levels of the TextWorld benchmarks. Section 9 highlights more possible improvements which might results in better performance.

9 Future Work

A few aspects for which improvements could be made for reward redistribution in text-based games are:

1. Incorporate Intermediate Rewards: Since intermediate rewards are returned from the environment after completing a subquest, this reward signal can be incorporated into the continuous prediction. Instead of taking the MSE of the prediction and the G0 return, at each timestep the loss would be the MSE between the prediction and the next reached intermediate score (i.e. the first returned reward from the environment is +10 when reaching the vegetable market, all timesteps prior to entering the market would use +10 as ground truth instead of G0).

2. Adaptive Kappa Window: As mentioned in the previous section the kappa window wκ was set to a fixed value; however it could be more appropriate if the parameter was adapted depending on the average length for completing each subtask (e.g. using a window of 15 if the first subtask on average requires 15 steps to be solved). This could be extended to different windows for different subtasks.

3. Increase Complexity: In this work 32 hidden units were used in the LSTM for return prediction. More abstract games with a larger number of subtasks may require more LSTM cells to learn more patterns. Especially on higher levels of SaladWorld this could lead to improvements.

66 9 FUTURE WORK

Overall the performances of policy gradient fine-tuning and Q-learning are comparable. For lower levels there is no long delay in the intermediate rewards, Thus, Q-learning is efficient in propagating back the information to prior actions. RUDDER was specifically designed for long term credit assignment and the resulting reward redistribution is not optimal. Adjusting the window of N-Step Kappa Correction in the redistributed reward might remedy this problem since the local redistribution at some point deviates from the optimal redistribution. Different values for the kappa window might be considered for different levels of SaladWorld. The longer the delay in the reward and the bigger the action space the more Q-learning fails to adequately propagate back the reward signal; however policy gradient fine-tuning can help as can be seen in the results for SaladWorld level 6 and 7. Considering the growing action spaces for higher levels an efficient exploration strategy is needed (i.e. as proposed by [Yuan et al., 2018]).

In this work the Score Contextualization architecture from [Jain et al., 2019] is used for the actor network during fine-tuning. The main purpose of this architecture is to release the burden off a single LSTM network to complete an entire episode. Different LSTM network heads are used for different parts of the trajectory corresponding to different intermediate scores. While this is perfectly suited for Q-learning it might even hinder learning for policy gradient updates. Since after reward redistribution and N-Step Kappa Correction the immediate reward is given a single LSTM network could be used to navigate through the environment. Training one LSTM network should stabilize training compared to training multiple LSTM network heads using the PPO update rule.

Originally the PPO update rule proposed by [Schulman et al., 2017] uses the advantage of the reward as target. In this work the redistributed rewards after N-Step Kappa Correction is used; however this could easily be extended to the advantage of the redistributed rewards by learning a value function with a separate network head for states considering the redistribution. Thus, the advantage of the redistributed reward can be used which might stabilize the training procedure.

Another interesting aspect is training from scratch with the PPO update rule; however for this approach RUDDER needs to be pre-trained on a sufficiently large collection of episodes in order to enable reasonable reward redistribution. Having access to intermediate rewards from the beginning of the game could substantially improve the performance of agents trained on text-based games. Unfortunately due to time constraints no more experiments than shown in this work could be conducted.

67 REFERENCES

References

Ashutosh Adhikari, Xingdi Yuan, Marc-Alexandre Côté, Mikuláˆs Zelinka, Marc-Antoine Rondeau, Romain Laroche, Pascal Poupart, Jian Tang, Adam Trischler, and William L. Hamilton. Learning dynamic knowledge graphs to generalize on text-based games. CoRR, abs/2002.09127, 2020. URL https://arxiv.org/abs/2002.09127. Jose A. Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, and Sepp Hochreiter. RUDDER: return decomposition for delayed rewards. CoRR, abs/1806.07857, 2018. URL http://arxiv.org/abs/1806.07857. Timothy Atkinson, Hendrik Baier, Tara Copplestone, Sam Devlin, and Jerry Swan. The text-based adventure AI competition. CoRR, abs/1808.01262, 2018. URL http://arxiv.org/abs/1808. 01262. Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE, 10(7):e0130140, 07 2015. doi: 10.1371/journal.pone. 0130140. URL http://dx.doi.org/10.1371%2Fjournal.pone.0130140. Marc G. Bellemare, Georg Ostrovski, Arthur Guez, Philip S. Thomas, and Rémi Munos. Increasing the action gap: New operators for reinforcement learning. In Dale Schuurmans and Michael P. Wellman, editors, AAAI, pages 1476–1483. AAAI Press, 2016. URL http://dblp.uni-trier. de/db/conf/aaai/aaai2016.html#BellemareOGTM16. Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. CoRR, abs/1707.06887, 2017. URL http://dblp.uni-trier.de/db/journals/ corr/corr1707.html#BellemareDM17. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. CoRR, abs/1606.01540, 2016. URL http://arxiv.org/ abs/1606.01540. Marc-Alexandre Côté, Ákos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Matthew J. Hausknecht, Layla El Asri, Mahmoud Adada, Wendy Tay, and Adam Trischler. Textworld: A learning environment for text-based games. CoRR, abs/1806.11532, 2018. URL http://arxiv.org/abs/1806.11532. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018. URL http://arxiv.org/abs/1810.04805. Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Rémi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, and Shane Legg. Noisy networks for exploration. CoRR, abs/1706.10295, 2017. URL http: //dblp.uni-trier.de/db/journals/corr/corr1706.html#FortunatoAPMOGM17. Nancy Fulda, Daniel Ricks, Ben Murdoch, and David Wingate. What can you do with a rock? affordance extraction via word embeddings. CoRR, abs/1703.03429, 2017. URL http://arxiv. org/abs/1703.03429. Matan Haroush, Tom Zahavy, Daniel J. Mankowitz, and Shie Mannor. Learning how not to act in text-based games, 2018. URL https://openreview.net/forum?id=B1-tVX1Pz.

68 REFERENCES

Matthew J. Hausknecht and Peter Stone. Deep recurrent q-learning for partially observable mdps. CoRR, abs/1507.06527, 2015. URL http://dblp.uni-trier.de/db/journals/corr/ corr1507.html#HausknechtS15.

Matthew J. Hausknecht, Ricky Loynd, Greg Yang, Adith Swaminathan, and Jason D. Williams. NAIL: A general interactive fiction agent. CoRR, abs/1902.04259, 2019. URL http://arxiv. org/abs/1902.04259.

Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Lihong Li, Li Deng, and Mari Ostendorf. Deep reinforcement learning with an unbounded action space. CoRR, abs/1511.04636, 2015. URL http://arxiv.org/abs/1511.04636.

Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. CoRR, abs/1709.06560, 2017. URL http://arxiv. org/abs/1709.06560.

Sepp Hochreiter. The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(2): 107–116, 1998. URL http://dblp.uni-trier.de/db/journals/ijufks/ijufks6.html# Hochreiter98.

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735–1780, 1997.

Anna Huang. Similarity measures for text document clustering. Proceedings of the 6th New Zealand Computer Science Research Student Conference, 01 2008.

John D. Hunter. Matplotlib: A 2D graphics environment. Computing in Sci & Engg, 9(3):90–95, May 2007. ISSN 1521-9615. doi: 10.1109/mcse.2007.55. URL http://dx.doi.org/10.1109/ mcse.2007.55.

Vishal Jain, William Fedus, Hugo Larochelle, Doina Precup, and Marc G. Bellemare. Algo- rithmic improvements for deep reinforcement learning applied to interactive fiction. CoRR, abs/1911.12511, 2019. URL http://dblp.uni-trier.de/db/journals/corr/corr1911. html#abs-1911-12511.

Leslie P. Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101:99–134, 1998.

Sham M. Kakade and John Langford. Approximately optimal approximate reinforcement learning. In Claude Sammut and Achim G. Hoffmann, editors, ICML, pages 267–274. Morgan Kaufmann, 2002. ISBN 1-55860-873-7. URL http://dblp.uni-trier.de/db/conf/icml/icml2002. html#KakadeL02.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014. URL http://arxiv.org/abs/1412.6980. cite arxiv:1412.6980Comment: Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015.

Bartosz Kostka, Jaroslaw Kwiecien, Jakub Kowalski, and Pawel Rychlikowski. Text-based adventures of the golovin AI agent. CoRR, abs/1705.05637, 2017. URL http://arxiv.org/abs/1705. 05637.

69 REFERENCES

Tyler Lu, Dale Schuurmans, and Craig Boutilier. Non-delusional q-learning and value-iteration. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, NeurIPS, pages 9971–9981, 2018. URL http://dblp.uni-trier. de/db/conf/nips/nips2018.html#LuSB18.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed represen- tations of words and phrases and their compositionality. CoRR, abs/1310.4546, 2013. URL http://arxiv.org/abs/1310.4546.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. URL https://arxiv.org/pdf/1312.5602.pdf.

Karthik Narasimhan, Tejas D. Kulkarni, and Regina Barzilay. Language understanding for text- based games using deep reinforcement learning. CoRR, abs/1506.08941, 2015. URL http: //arxiv.org/abs/1506.08941.

A Ng, D Harada, and S Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, 1999.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library, 2019. URL http://arxiv.org/abs/1912.01703. cite arxiv:1912.01703Comment: 12 pages, 3 figures, NeurIPS 2019.

Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4):682–697, 2008. URL http://dblp.uni-trier.de/db/journals/nn/nn21. html#PetersS08.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations, 2018. URL http: //arxiv.org/abs/1802.05365. cite arxiv:1802.05365Comment: NAACL 2018. Originally posted to openreview 27 Oct 2017. v2 updated for NAACL camera ready.

Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay, 2015. URL http://arxiv.org/abs/1511.05952. cite arxiv:1511.05952Comment: Published at ICLR 2016.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL http://dblp.uni-trier.de/ db/journals/corr/corr1707.html#SchulmanWDRK17.

Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through prop- agating activation differences. CoRR, abs/1704.02685, 2017. URL http://dblp.uni-trier. de/db/journals/corr/corr1704.html#ShrikumarGK17.

Satinder P. Singh, Andrew G. Barto, and Nuttapong Chentanez. Intrinsically motivated reinforcement learning. In NIPS, pages 1281–1288, 2004. URL http://dblp.uni-trier.de/db/conf/nips/ nips2004.html#SinghBC04.

B. F. Skinner. Reinforcement today. In American Psychologist, pages 94–99, 1958.

70 REFERENCES

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. CoRR, abs/1703.01365, 2017. URL http://dblp.uni-trier.de/db/journals/corr/corr1703. html#SundararajanTY17.

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, March 1998. ISBN 0262193981. URL http://www.amazon.ca/exec/obidos/redirect? tag=citeulike09-20&path=ASIN/0262193981.

Ruo Yu Tao, Marc-Alexandre Côté, Xingdi Yuan, and Layla El Asri. Towards solving text- based games by producing adaptive action spaces. CoRR, abs/1812.00855, 2018. URL http: //dblp.uni-trier.de/db/journals/corr/corr1812.html#abs-1812-00855.

Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. CoRR, abs/1509.06461, 2015. URL http://arxiv.org/abs/1509.06461.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017. URL http://arxiv.org/abs/1706.03762.

Ziyu Wang, Nando de Freitas, and Marc Lanctot. Dueling network architectures for deep rein- forcement learning. CoRR, abs/1511.06581, 2015. URL http://dblp.uni-trier.de/db/ journals/corr/corr1511.html#WangFL15.

Eric Wiewiora, Garrison W. Cottrell, and Charles Elkan. Principled methods for advising rein- forcement learning agents. In Tom Fawcett and Nina Mishra, editors, ICML, pages 792–799. AAAI Press, 2003. ISBN 1-57735-189-4. URL http://dblp.uni-trier.de/db/conf/icml/ icml2003.html#WiewioraCE03.

Xingdi Yuan, Marc-Alexandre Côté, Alessandro Sordoni, Romain Laroche, Remi Tachet des Combes, Matthew J. Hausknecht, and Adam Trischler. Counting to explore and generalize in text-based games. CoRR, abs/1806.11525, 2018. URL http://arxiv.org/abs/1806.11525.

Mikuláˆs Zelinka. Baselines for reinforcement learning in text games. CoRR, abs/1811.02872, 2018. URL http://arxiv.org/abs/1811.02872.

Mikulás Zelinka, Xingdi Yuan, Marc-Alexandre Côté, Romain Laroche, and Adam Trischler. Building dynamic knowledge graphs from text-based games. CoRR, abs/1910.09532, 2019. URL http: //dblp.uni-trier.de/db/journals/corr/corr1910.html#abs-1910-09532.

71

Statuatory Declaration

I hereby declare that the thesis submitted is my own unaided work, that I have not used other than the sources indicated, and that all direct and indirect sources are acknowledged as references. This printed thesis is identical with the electronic version submitted.

Linz, September 2020 Fabian Paischer

Eidesstattliche Erklärung

Ich erkläre an Eides statt, dass ich die vorliegende Masterarbeit selbstständig und ohne fremde Hilfe verfasst, andere als die angegebenen Quellen und Hilfsmittel nicht benutzt bzw. die wörtlich oder sinngemäß entnommenen Stellen als solche kenntlich gemacht habe. Die vorliegende Masterarbeit ist mit dem elektronisch übermittelten Textdokument identisch.

Linz, September 2020 Fabian Paischer