Popart Normalization, R2D2, Muzero

NPFL122, Lecture 11 PopArt Normalization, R2D2, MuZero Milan Straka December 14, 2020 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated PopArt Normalization An improvement of IMPALA from Sep 2018, which performs normalization of task rewards instead of just reward clipping. PopArt stands for Preserving Outputs Precisely, while Adaptively Rescaling Targets. Assume the value estimate v(s; θ, σ, μ) is computed using a normalized value predictor n(s; θ) v(s; θ, σ, μ) =def σn(s; θ) + μ and further assume that n(s; θ) is an output of a linear function n(s; θ) =def ωT f(s; θ − {ω, b}) + b. We can update the σ and μ using exponentially moving average with decay rate β (in the paper, first moment μ and second moment υ is tracked, and the standard deviation is computed as 2 ; decay rate −4 is employed). σ = υ − μ β = 3 ⋅ 10 NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 2/28 PopArt Normalization Utilizing the parameters μ and σ, we can normalize the observed (unnormalized) returns as (G − μ)/σ and use an actor-critic algorithm with advantage (G − μ)/σ − n(S; θ). However, in order to make sure the value function estimate does not change when the normalization parameters change, the parameters ω, b used to compute the value estimate v(s; θ, σ, μ) =def σ ωT f(s; θ − {ω, b}) + b + μ ( ) are updated under any change μ → μ′ and σ → σ′ as ′ σ ω ← ω, σ′ ′ ′ σb + μ − μ b ← . σ′ In multi-task settings, we train a task-agnostic policy and task-specific value functions (therefore, μ, σ and n(s; θ) are vectors). NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 3/28 PopArt Results Table 1 of paper "Multi-task Deep Reinforcement Learning with PopArt" by Matteo Hessel et al. Figures 1, 2 of paper "Multi-task Deep Reinforcement Learning with PopArt" by Matteo Hessel et al. NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 4/28 PopArt Results Figure 3 of paper "Multi-task Deep Reinforcement Learning with PopArt" by Matteo Hessel et al. Normalization statistics on chosen environments. NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 5/28 PopArt Results Figures 4, 5 of paper "Multi-task Deep Reinforcement Learning with PopArt" by Matteo Hessel et al. NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 6/28 Transformed Rewards So far, we have clipped the rewards in DQN on Atari environments. Consider a Bellman operator T def ′ ′ (T q)(s, a) = Es′ ,r∼p r + γ max q(s , a ) . [ a′ ] Instead of reducing the magnitude of rewards, we might use a function h : R → R to reduce their scale. We define a transformed Bellman operator as T h def −1 ′ ′ (T hq)(s, a) = Es′ ,r∼p h r + γ max h (q(s , a )) . [ ( a′ )] NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 7/28 Transformed Rewards It is easy to prove the following two propositions from a 2018 paper Observe and Look Further: Achieving Consistent Performance on Atari by Tobias Pohlen et al. 1. If for , then k k→∞ . h(z) = αz α > 0 T hq h ∘ q∗ = αq∗ The statement follows from the fact that it is equivalent to scaling the rewards by a constant α. 2. When h is strictly monotonically increasing and the MDP is deterministic, then k k→∞ . T hq h ∘ q∗ This second proposition follows from −1 h ∘ q∗ = h ∘ T q∗ = h ∘ T (h ∘ h ∘ q∗) = T h( h ∘ q∗) , where the last equality only holds if the MDP is deterministic. NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 8/28 Transformed Rewards The authors use the following transformation for the Atari environments def h(x) = sign(x) ∣x∣ + 1 − 1 + εx ( ) with ε = 10−2. The additive regularization term ensures that h−1 is Lipschitz continuous. It is straightforward to verify that 2 −1 1 + 4ε(∣x∣ + 1 + ε) − 1 h (x) = sign(x) ⎛ − 1⎞ . ( 2ε ) ⎝ ⎠ NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 9/28 Recurrent Replay Distributed DQN (R2D2) Proposed in 2019, to study the effects of recurrent state, experience replay and distributed training. R2D2 utilizes prioritized replay, n-step double Q-learning with n = 5, convolutional layers followed by a 512-dimensional LSTM passed to duelling architecture, generating experience by a large number of actors (256; each performing approximately 260 steps per second) and learning from batches by a single learner (achieving 5 updates per second using mini-batches of 64 sequences of length 80). Instead of individual transitions, the replay buffer consists of fixed-length (m = 80) sequences of (s, a, r), with adjacent sequences overlapping by 40 time steps. NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 10/28 Recurrent Replay Distributed DQN (R2D2) Figure 1 of the paper "Recurrent Experience Replay in Distributed Reinforcement Learning" by Steven Kapturowski et al. NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 11/28 Recurrent Replay Distributed DQN (R2D2) Table 1 of the paper "Recurrent Experience Replay in Distributed Reinforcement Learning" by Steven Kapturowski et al. Figure 2 of the paper "Recurrent Experience Replay in Distributed Reinforcement Learning" by Steven Kapturowski et al. NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 12/28 Recurrent Replay Distributed DQN (R2D2) Table 2 of the paper "Recurrent Experience Replay in Distributed Reinforcement Learning" by Steven Kapturowski et al. NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 13/28 Recurrent Replay Distributed DQN (R2D2) Figure 9 of the paper "Recurrent Experience Replay in Distributed Reinforcement Learning" by Steven Kapturowski et al. NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 14/28 Recurrent Replay Distributed DQN (R2D2) Ablations comparing the reward clipping instead of value rescaling (Clipped), smaller discount factor γ = 0.99 (Discount) and Feed-Forward variant of R2D2. Furthermore, life-loss reset evaluates resetting an episode on life loss, with roll preventing value bootstrapping (but not LSTM unrolling). Figure 4 of the paper "Recurrent Experience Replay in Distributed Reinforcement Learning" by Steven Kapturowski et al. Figure 7 of the paper "Recurrent Experience Replay in Distributed Reinforcement Learning" by Steven Kapturowski et al. NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 15/28 Utilization of LSTM Memory During Inference Figure 5 of the paper "Recurrent Experience Replay in Distributed Reinforcement Learning" by Steven Kapturowski et al. NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 16/28 MuZero The MuZero algorithm extends the AlphaZero by a trained model, alleviating the requirement for a known MDP dynamics. At each time-step , for each of steps, a model , with parameters , conditioned t 1 ≤ k ≤ K μθ θ on past observations and future actions , predicts three future o1, … , ot at+1, … , at+k quantities: the policy k , pt ≈ π(at+k+1∣ o1, … , ot, at+1, … , at+k ) the value function k , vt ≈ E[ut+k+1 + γut+k+2 + … ∣o1, … , ot, at+1, … , at+k ] the immediate reward k , rt ≈ ut+k where are the observed rewards and is the behaviour policy. ui π NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 17/28 MuZero At each time-step t (omitted from now on for simplicity), the model is composed of three components, a representation function, a dynamics function and a prediction function. The dynamics function, k k k−1 k , simulates the MDP dynamics and predicts r , s = gθ( s , a ) an immediate reward rk and an internal state sk . The internal state has no explicit semantics, its only goal is to accurately predict rewards, values and policies. The prediction function k k k , computes the policy and value function, similarly p , v = fθ( s ) as in AlphaZero. The representation function, 0 , generates an internal state encoding the s = hθ( o1, … , ot) past observations. NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 18/28 MuZero Figure 1 of the paper "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model" by Julian Schrittwieser et al. NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 19/28 MuZero To select a move, we employ a MCTS algorithm similar to the AlphaZero. It produces a policy and value estimate , and an action is then sampled from the policy . πt νt at+1 ∼ πt During training, we utilize a sequence of k moves. We estimate the return using bootstrapping n−1 n . The values and are used in zt = ut+1 + γut+2 + … + γ ut+n + γ νt+n k = 5 n = 10 the paper. The loss is then composed of the following components: K r k v k p k 2 L ( θ) = L (u , r ) + L (z , v ) + L (π , p ) + c∥θ∥ . t ∑ t+k t t+k t t+k t k=0 Note that in Atari, rewards are scaled by for −3, and sign(x)( ∣x∣ + 1 − 1) + εx ε = 10 authors utilize a cross-entropy loss with 601 categories for values −300, … , 300, which they claim to be more stable. NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 20/28 MuZero Model 0 s = hθ( o1, ..., ot) k k k−1 k k k k 1 k r , s = gθ( s , a ) ⎪⎫ p , v , r = μθ( o1, ..., ot, a , ..., a ) k k k p , v = fθ( s ) ⎬ ⎪⎭ Search 0 νt, πt = MCTS(st , μθ) at ∼ πt NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 21/28 MuZero Learning Rule k k k pt , vt , rt = μθ( o1, … , ot, at+1, ..., at+k ) uT for games zt = n−1 n { ut+1 + γut+2 + ... + γ ut+n + γ νt+n for general MDPs K r k v k p k 2 L ( θ) = L (u , r ) + L (z , v ) + L (π , p ) + c∥θ∥ t ∑ t+k t t+k t t+k t k=0 Losses r 0 for games L (u, r) = { ϕ(u)T log r for general MDPs 2 v (z − q) for games L (z, q) = { ϕ(z)T log q for general MDPs Lp(π, p) = πT log p NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 22/28 MuZero – Evaluation Figure 2 of the paper "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model" by Julian Schrittwieser et al.

Popart Normalization, R2D2, Muzero

Improving Model-Based Reinforcement Learning with Internal State Representations Through Self-Supervision

Survey on Reinforcement Learning for Language Processing

Multi-Agent Reinforcement Learning: a Review of Challenges and Applications

Combining Off and On-Policy Training in Model-Based Reinforcement Learning

On Building Generalizable Learning Agents

On the Role of Planning in Model-Based Deep

The Evolving Electric Power Grid -Energy Internet, Iot and AI

Monte-Carlo Tree Search As Regularized Policy Optimization

Reinforcement Learning in Stock Market

Playing Nondeterministic Games Through Planning with a Learned

Yaklaşim Ve Uygulamalar

Muesli: Combining Improvements in Policy Optimization