<<

Action and Perception as Divergence Minimization Danijar Hafner12, Pedro Ortega3, Jimmy Ba2, Thomas Parr4, Karl Friston4, Nicolas Heess3 1 Brain 2 University of Toronto 3 DeepMind 4 University College

1 Overview 2 Agents with Latent Variables 3 Joint KL Minimization Formulate agent objective as bringing its current actual distribution Action and Perception as Divergence Minimization (APD)... Perception toward a target distribution: actions, state estimates, Map of possible agent objectives that correspond to different 1 parameters, skills latent variables, target factorizations, and divergence measures on-policy data replay buffer 2 Unified perspective on representation learning, infogain or planning exploration, and empowerment many different 3 Representation learning should be paired with infogain exploration for a temporally consistent objective options Agent Beliefs Input Sequence 4 World models as path toward adaptive infomax agents while Actual distribution Target distribution making task rewards optional actions, objects, rules, etc Action images, proprioception, etc lifetime trajectory of inputs 5 Future objectives should be derived from a joint divergence to set of agent latents facilitate comparison and make target explicit Parameterize belief (incl actions) by parameters of agent beliefs over repr and actions

4 Target Dependencies 5 Information Bounds 6 Past and Future Factorized Targets Expressive Targets Minimizing joint KL to an expressive target... Agents with expressive targets... ● realizes the preferences expressed by the target ● infer latent representations that are informative of past inputs

● maximizes variational bound on the mutual information between ● explore future inputs that are informative of the representations Inputs and latents have zero mutual Target knows or learns depen. inputs and latents information under the target between inputs and latents ● bound is tighter the better the target can express dependencies Agent minimizes mutual Agent maximizes the mutual simplicity accuracy input entropy information in the actual dist information in the actual dist Examples Examples ● MaxEnt RL uses reward factor and ● World models learn representations control information gain action prior to solve the task while that are informative of past inputs keeping actions as random as possible ● Reverse predictors learn skills that maximally influence future inputs (see eq 6 in the paper)

7 Types of Latents 8 Action Perception Cycle 9 Niche Seeking Latent Variable Past Infomax Future Infomax Action and perception optimize the same objective Minimizing a joint divergence also brings the marginals together N/A Empowerment & MaxEnt RL but receive and affect different variables. Actions past actions are observed VIM, ACIE, EPC, SQL, SAC Under a unified target distribution... The marginal target distribution over inputs is the marginal likelihood N/A Skill Discovery Skills ● perception makes the agent beliefs consistent with the world past skills are observed VIC, SNN, DIAYN, VALOR The agent thus converges to an ecological niche… ● actions make the world consistent with the agent beliefs State Estimation State Information Gain ● See inputs propto how well the agent can learn to predict them State Estimates VAE, DVBF, SOLAR, PlaNet NDIGO, DVBF-LM ● That is large because of the information gain exploration

System Identification Dynamics Information Gain ● That it can inhabit despite external perturbations Dynamics Parameters PETS, Bayesian PlaNet VIME, MAX, Plan2Explore The agent thus seeks out a large niche that it can inhabit and understand Belief over Policies Policy Information Gain Policy Parameters BootDQN, Bayesian DQN BootDQN, Bayesian DQN Models that assign high prob to more trajectories lead to larger niches

@danijarh Project website with video: danijar.com/apd