Improving Model-Based Reinforcement Learning with Internal State Representations Through Self-Supervision
Total Page:16
File Type:pdf, Size:1020Kb
1 Improving Model-Based Reinforcement Learning with Internal State Representations through Self-Supervision Julien Scholz, Cornelius Weber, Muhammad Burhan Hafez and Stefan Wermter Knowledge Technology, Department of Informatics, University of Hamburg, Germany [email protected], fweber, hafez, [email protected] Abstract—Using a model of the environment, reinforcement the design decisions made for MuZero, namely, how uncon- learning agents can plan their future moves and achieve super- strained its learning process is. After explaining all necessary human performance in board games like Chess, Shogi, and Go, fundamentals, we propose two individual augmentations to while remaining relatively sample-efficient. As demonstrated by the MuZero Algorithm, the environment model can even be the algorithm that work fully unsupervised in an attempt to learned dynamically, generalizing the agent to many more tasks improve its performance and make it more reliable. After- while at the same time achieving state-of-the-art performance. ward, we evaluate our proposals by benchmarking the regular Notably, MuZero uses internal state representations derived from MuZero algorithm against the newly augmented creation. real environment states for its predictions. In this paper, we bind the model’s predicted internal state representation to the environ- II. BACKGROUND ment state via two additional terms: a reconstruction model loss and a simpler consistency loss, both of which work independently A. MuZero Algorithm and unsupervised, acting as constraints to stabilize the learning process. Our experiments show that this new integration of The MuZero algorithm [1] is a model-based reinforcement reconstruction model loss and simpler consistency loss provide a learning algorithm that builds upon the success of its prede- significant performance increase in OpenAI Gym environments. cessor, AlphaZero [9], which itself is a generalization of the Our modifications also enable self-supervised pretraining for successful AlphaGo [10] and AlphaGo Zero [11] algorithms MuZero, so the algorithm can learn about environment dynamics before a goal is made available. to a broader range of challenging board and Atari games. Similar to other model-based algorithms, MuZero can predict I. INTRODUCTION the behavior of its environment to plan and choose the most Reinforcement learning algorithms are classified into two promising action at each timestep to achieve its goal. In groups. Model-free algorithms can be viewed as behaving contrast to AlphaZero, it does this using a learned model. instinctively, while model-based algorithms can plan their next As such, it can be applied to environments for which the moves ahead of time. The latter are arguably more human-like rules are not known ahead of time. MuZero uses an internal and have the added advantage of being comparatively sample- (or embedded) state representation that is deduced from the efficient. environment observation but is not required to have any The MuZero Algorithm [1] outperforms its competitors in semantic meaning beyond containing sufficient information to a variety of tasks while simultaneously being very sample- predict rewards and values. Accordingly, it may be infeasible efficient. Unlike other recent model-based reinforcement learn- to reconstruct observations from internal states. This gives ing approaches that perform gradient-based planning [2]–[6], MuZero an advantage over traditional model-based algorithms MuZero does not assume a continuous and differentiable that need to be able to predict future observations (e.g. noisy action space. However, we realized that MuZero performed visual input). relatively poorly on our experiments in which we trained a There are three distinct functions (e.g. neural networks) robot to grasp objects using a low-dimensional input space and that are used harmoniously for planning. Namely, there is was comparatively slow to operate. Contrary to our expecta- a representation function hθ, a prediction function fθ, as tions, older, model-free algorithms such as A3C [7] delivered well as a dynamics function gθ, each being parameterized significantly better results while being more lightweight and using parameters θ to allow for adjustment through a training comparatively easy to implement. Even in very basic tasks, process. The complete model is called µθ. We will now explain our custom implementation as well as one provided by other each of the three functions in more detail. researchers, muzero-general [8], produced unsatisfactory out- • The representation function hθ maps real observations to comes. internal state representations. For a sequence of recorded Since these results did not match those that MuZero observations o1; : : : ; ot at timestep t, an embedded rep- 0 achieved in complex board games [1], we are led to believe resentation s = h(o1; : : : ; ot) may be produced. As that MuZero may require extensive training time and expensive previously mentioned, s0 has no semantic meaning, hardware to reach its full potential. We question some of and typically contains significantly less information than © 2021 IEEE. Personal use of this material is permitted. o1; : : : ; ot. Thus, the function hθ is tasked with eliminat- ::: ot−1 ot ot+1 ::: ing unnecessary details from observations, for example by extracting object coordinates and other attributes from hθ 1 2 3 images. r r r • The dynamics function g tries to mimic the environment θ s0 gθ s1 gθ s2 gθ s3 by advancing an internal state sk−1 at a hypothetical k timestep k − 1 based on a chosen action a to predict a1 a2 a3 k k k−1 k k fθ r ; s = gθ s ; a , where r is an estimate of the k real reward ut+k and s is the internal state at timestep 3 3 k. This function can be applied recursively, and acts as p ; v the simulating part of the algorithm, estimating what may 1 2 happen when taking a sequence of actions a1; :::; ak in a Fig. 1. Testing a single action sequence made up of three actions a , a , and a3 to plan based on observation o using all three MuZero functions. state s0. t • The prediction function fθ can be compared to an actor- critic architecture having both a policy and a value output. p an estimate (that is faster to compute) of what our search k For any internal state s , there shall be a mapping policy π might be, meaning it can be used as a heuristic. For k k k k p ; v = fθ(s ), where policy p represents the prob- our value estimates ν, we first calculate n-step bootstrapped n−1 n abilities with which the agent would perform actions and returns zt = ut+1 + γut+2 + ::: + γ ut+n + γ νt+n. With k k v value v is the expected return in state s . Whereas the another loss l , the target zt+k is set for each value output value is very useful to bootstrap future rewards after k vt . The three previously mentioned output targets as well the final step of planning by the dynamics function, the as an additional L2 regularization term cjjθjj2 [12] form the policy of the prediction function may be counterintuitive, complete loss function: keeping in mind that the agent should derive its policy from the considered plans. We will explore the uses of K k X r k v k p further on. lt(θ) = l (ut+k; rt ) + l (zt+k; vt ) Given parameters θ, we can now decide on an action policy k=0 p k 2 π for each observation ot, which we will call the search policy, + l (πt+k; pt ) + cjjθjj (2) that uses hθ, gθ and fθ to search through different action sequences and find an optimal plan. As an example, in a The aforementioned brute-force search policy is highly un- small action space, we can iterate through all action sequences optimized. For one, we have not described a method of caching and reusing internal states and prediction function outputs so a1; :::; an of a fixed length n, and apply each function in the order visualized in Fig.1. By discounting the reward and as to reduce the computational footprint. Furthermore, iterating value outputs with a discount factor γ 2 [0; 1], we receive through all possible actions is very slow for a high number of an estimate for the return when first performing the action actions or longer action sequences and impossible for continu- sequence and subsequently following π: ous action spaces. Ideally, we would focus our planning efforts on only those actions that are promising from the beginning Eπ [ut+1 + γut+2 + ::: j ot; a1; : : : ; an] and spend less time incorporating clearly unfavorable behavior. n AlphaZero and MuZero employ a variation of Monte-Carlo X = γk−1rk + γnvn (1) tree search (MCTS)[13] (see Section II-B). k=1 MuZero matches the state-of-the-art results of AlphaZero Given the goal of an agent to maximize the return, we may in the board games Chess and Shogi, despite not having now simply choose the first action of the action sequence with access to a perfect environment model. It even exceeded the highest return estimate. Alternatively, a less promising AlphaZero’s unprecedented rating in the board game Go, action can be selected to encourage the exploration of new while using less computation per node, suggesting that it behavior. At timestep t, we call the return estimate for the caches useful information in its internal states to gain a chosen action produced by our search νt. deeper understanding of the environment with each application Training occurs on trajectories previously observed by the of the dynamics function. Furthermore, MuZero outperforms MuZero agent. For each timestep t in the trajectory, we unroll humans and previous state-of-the-art agents at the Atari game the dynamics function K steps through time and adjust θ such benchmark, demonstrating its ability to solve tasks with long that the predictions further match what was already observed time horizons that are not turn-based.