NPFL122, Lecture 11

PopArt Normalization, R2D2, MuZero

Milan Straka

December 14, 2020

Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated PopArt Normalization

An improvement of IMPALA from Sep 2018, which performs normalization of task rewards instead of just reward clipping. PopArt stands for Preserving Outputs Precisely, while Adaptively Rescaling Targets. Assume the value estimate v(s; θ, σ, μ) is computed using a normalized value predictor n(s; θ)

v(s; θ, σ, μ) =def σn(s; θ) + μ

and further assume that n(s; θ) is an output of a linear function

n(s; θ) =def ωT f(s; θ − {ω, b}) + b.

We can update the σ and μ using exponentially moving average with decay rate β (in the paper, first moment μ and second moment υ is tracked, and the standard deviation is computed as 2 ; decay rate −4 is employed). σ = υ − μ β = 3 ⋅ 10

NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 2/28 PopArt Normalization

Utilizing the parameters μ and σ, we can normalize the observed (unnormalized) returns as (G − μ)/σ and use an actor-critic with advantage (G − μ)/σ − n(S; θ). However, in order to make sure the value function estimate does not change when the normalization parameters change, the parameters ω, b used to compute the value estimate

v(s; θ, σ, μ) =def σ ωT f(s; θ − {ω, b}) + b + μ ( ) are updated under any change μ → μ′ and σ → σ′ as

′ σ ω ← ω, σ′

′ ′ σb + μ − μ b ← . σ′

In multi-task settings, we train a task-agnostic policy and task-specific value functions (therefore, μ, σ and n(s; θ) are vectors). NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 3/28 PopArt Results

Table 1 of paper "Multi-task Deep with PopArt" by Matteo Hessel et al.

Figures 1, 2 of paper "Multi-task Deep Reinforcement Learning with PopArt" by Matteo Hessel et al.

NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 4/28 PopArt Results

Figure 3 of paper "Multi-task Deep Reinforcement Learning with PopArt" by Matteo Hessel et al. Normalization statistics on chosen environments.

NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 5/28 PopArt Results

Figures 4, 5 of paper "Multi-task Deep Reinforcement Learning with PopArt" by Matteo Hessel et al.

NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 6/28 Transformed Rewards

So far, we have clipped the rewards in DQN on Atari environments. Consider a Bellman operator T

def ′ ′ (T q)(s, a) = Es′ ,r∼p r + γ max q(s , a ) . [ a′ ]

Instead of reducing the magnitude of rewards, we might use a function h : R → R to reduce their scale. We define a transformed Bellman operator as T h

def −1 ′ ′ (T hq)(s, a) = Es′ ,r∼p h r + γ max h (q(s , a )) . [ ( a′ )]

NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 7/28 Transformed Rewards

It is easy to prove the following two propositions from a 2018 paper Observe and Look Further: Achieving Consistent Performance on Atari by Tobias Pohlen et al. 1. If for , then k k→∞ . h(z) = αz α > 0 T hq h ∘ q∗ = αq∗ The statement follows from the fact that it is equivalent to scaling the rewards by a constant α. 2. When h is strictly monotonically increasing and the MDP is deterministic, then k k→∞ . T hq h ∘ q∗ This second proposition follows from

−1 h ∘ q∗ = h ∘ T q∗ = h ∘ T (h ∘ h ∘ q∗) = T h( h ∘ q∗) ,

where the last equality only holds if the MDP is deterministic.

NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 8/28 Transformed Rewards

The authors use the following transformation for the Atari environments

def

h(x) = sign(x) ∣x∣ + 1 − 1 + εx ( ) with ε = 10−2. The additive regularization term ensures that h−1 is Lipschitz continuous. It is straightforward to verify that

2

−1 1 + 4ε(∣x∣ + 1 + ε) − 1 h (x) = sign(x) ⎛ − 1⎞ . ( 2ε ) ⎝ ⎠

NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 9/28 Recurrent Replay Distributed DQN (R2D2)

Proposed in 2019, to study the effects of recurrent state, experience replay and distributed training. R2D2 utilizes prioritized replay, n-step double Q-learning with n = 5, convolutional layers followed by a 512-dimensional LSTM passed to duelling architecture, generating experience by a large number of actors (256; each performing approximately 260 steps per second) and learning from batches by a single learner (achieving 5 updates per second using mini-batches of 64 sequences of length 80). Instead of individual transitions, the replay buffer consists of fixed-length (m = 80) sequences of (s, a, r), with adjacent sequences overlapping by 40 time steps.

NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 10/28 Recurrent Replay Distributed DQN (R2D2)

Figure 1 of the paper "Recurrent Experience Replay in Distributed Reinforcement Learning" by Steven Kapturowski et al.

NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 11/28 Recurrent Replay Distributed DQN (R2D2)

Table 1 of the paper "Recurrent Experience Replay in Distributed Reinforcement Learning" by Steven Kapturowski et al.

Figure 2 of the paper "Recurrent Experience Replay in Distributed Reinforcement Learning" by Steven Kapturowski et al.

NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 12/28 Recurrent Replay Distributed DQN (R2D2)

Table 2 of the paper "Recurrent Experience Replay in Distributed Reinforcement Learning" by Steven Kapturowski et al.

NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 13/28 Recurrent Replay Distributed DQN (R2D2)

Figure 9 of the paper "Recurrent Experience Replay in Distributed Reinforcement Learning" by Steven Kapturowski et al.

NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 14/28 Recurrent Replay Distributed DQN (R2D2)

Ablations comparing the reward clipping instead of value rescaling (Clipped), smaller discount factor γ = 0.99 (Discount) and Feed-Forward variant of R2D2. Furthermore, life-loss reset evaluates resetting an episode on life loss, with roll preventing value bootstrapping (but not LSTM unrolling).

Figure 4 of the paper "Recurrent Experience Replay in Distributed Reinforcement Learning" by Steven Kapturowski et al.

Figure 7 of the paper "Recurrent Experience Replay in Distributed Reinforcement Learning" by Steven Kapturowski et al. NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 15/28 Utilization of LSTM Memory During Inference

Figure 5 of the paper "Recurrent Experience Replay in Distributed Reinforcement Learning" by Steven Kapturowski et al.

NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 16/28 MuZero

The MuZero algorithm extends the AlphaZero by a trained model, alleviating the requirement for a known MDP dynamics. At each time-step , for each of steps, a model , with parameters , conditioned t 1 ≤ k ≤ K μθ θ on past observations and future actions , predicts three future o1, … , ot at+1, … , at+k quantities: the policy k , pt ≈ π(at+k+1∣ o1, … , ot, at+1, … , at+k ) the value function k , vt ≈ E[ut+k+1 + γut+k+2 + … ∣o1, … , ot, at+1, … , at+k ] the immediate reward k , rt ≈ ut+k where are the observed rewards and is the behaviour policy. ui π

NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 17/28 MuZero

At each time-step t (omitted from now on for simplicity), the model is composed of three components, a representation function, a dynamics function and a prediction function. The dynamics function, k k k−1 k , simulates the MDP dynamics and predicts r , s = gθ( s , a ) an immediate reward rk and an internal state sk . The internal state has no explicit semantics, its only goal is to accurately predict rewards, values and policies. The prediction function k k k , computes the policy and value function, similarly p , v = fθ( s ) as in AlphaZero. The representation function, 0 , generates an internal state encoding the s = hθ( o1, … , ot) past observations.

NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 18/28 MuZero

Figure 1 of the paper "Mastering Atari, Go, and by Planning with a Learned Model" by Julian Schrittwieser et al.

NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 19/28 MuZero

To select a move, we employ a MCTS algorithm similar to the AlphaZero. It produces a policy and value estimate , and an action is then sampled from the policy . πt νt at+1 ∼ πt During training, we utilize a sequence of k moves. We estimate the return using bootstrapping n−1 n . The values and are used in zt = ut+1 + γut+2 + … + γ ut+n + γ νt+n k = 5 n = 10 the paper. The loss is then composed of the following components:

K r k v k p k 2 L ( θ) = L (u , r ) + L (z , v ) + L (π , p ) + c∥θ∥ . t ∑ t+k t t+k t t+k t k=0

Note that in Atari, rewards are scaled by for −3, and sign(x)( ∣x∣ + 1 − 1) + εx ε = 10 authors utilize a cross-entropy loss with 601 categories for values −300, … , 300, which they claim to be more stable.

NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 20/28 MuZero

Model 0 s = hθ( o1, ..., ot) k k k−1 k k k k 1 k r , s = gθ( s , a ) ⎪⎫ p , v , r = μθ( o1, ..., ot, a , ..., a ) k k k p , v = fθ( s ) ⎬ ⎪⎭

Search 0 νt, πt = MCTS(st , μθ)

at ∼ πt

NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 21/28 MuZero

Learning Rule k k k pt , vt , rt = μθ( o1, … , ot, at+1, ..., at+k )

uT for games zt = n−1 n { ut+1 + γut+2 + ... + γ ut+n + γ νt+n for general MDPs K r k v k p k 2 L ( θ) = L (u , r ) + L (z , v ) + L (π , p ) + c∥θ∥ t ∑ t+k t t+k t t+k t k=0

Losses

r 0 for games L (u, r) = { ϕ(u)T log r for general MDPs 2 v (z − q) for games L (z, q) = { ϕ(z)T log q for general MDPs Lp(π, p) = πT log p

NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 22/28 MuZero – Evaluation

Figure 2 of the paper "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model" by Julian Schrittwieser et al.

NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 23/28 MuZero – Atari Results

Table 1 of the paper "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model" by Julian Schrittwieser et al.

NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 24/28 MuZero – Planning Ablations

Figure 3 of the paper "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model" by Julian Schrittwieser et al.

NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 25/28 MuZero – Planning Ablations

Figure S3 of the paper "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model" by Julian Schrittwieser et al.

NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 26/28 MuZero – Detailed Atari Results

Table S1 of the paper "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model" by Julian Schrittwieser et al. NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 27/28 MuZero – Detailed Atari Results

Table S1 of the paper "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model" by Julian Schrittwieser et al.

NPFL122, Lecture 11 PopArt Normalization TransformedRewards R2D2 MuZero 28/28