<<

Universal Planning Networks

Aravind Srinivas 1 Allan Jabri 1 Pieter Abbeel 1 Sergey Levine 1 Chelsea Finn 1

Abstract GDP improve plan (i) (i) i raˆ Lplan Lplan A key challenge in complex visuomotor control optimized actions is learning abstract representations that are ef- actions Limitate fective for specifying goals, planning, and gen- latent latent latent improve model initial forward dynamics end goal ,✓ eralization. To this end, we introduce univer- r Limitate encoder encoder expert sal planning networks (UPN). UPNs embed dif- actions ferentiable planning within a goal-directed pol- initial goal icy. This planning computation unrolls a forward obs. obs. model in a latent space and infers an optimal action plan through gradient descent trajectory (a) Universal Planning Network (UPN) optimization. The plan-by-gradient-descent pro- Demonstrations for training tasks cess and its underlying representations are learned end-to-end to directly optimize a supervised imi- learn UPN representations tation learning objective. We find that the rep- resentations learned are not only effective for goal-directed visual imitation via gradient-based UPN✓, trajectory optimization, but can also provide a metric for specifying goals using images. The derive reward functions learned representations can be leveraged to spec- from learned metric

ify distance-based rewards to reach target Reinforcement learning on harder tasks states for model-free reinforcement learning, re- (b) Leveraging learned latent representations sulting in substantially more effective learning when solving new tasks described via image- Figure 1. An overview of the UPN, which embeds a gradient de- based goals. Visit https://sites.google. scent planner (GDP) in the action-selection process. We demon- strate transfer to different, harder control tasks, including morpho- com/view/-public/home for video logical (yellow point robot to ant) and topological (3-link to 7-link highlights. reacher) variants, as shown above.

cally not available for real world reinforcement learning and 1. Introduction users must manually specify tasks via hand-crafted rewards with hand-crafted representations. To automate this process, Learning visuomotor policies is a central pursuit in build- some prior methods have proposed to specify tasks by pro- ing machines capable of performing complex skills in the viding an image of the goal scene (Deguchi & Takahashi, variety of unstructured and dynamic environments seen in 1999; Watter et al., 2015; Finn et al., 2016b). However, a the real world (Levine et al., 2016; Pinto et al., 2016). A reward that measures success based on matching the raw key challenge in learning such policies lies in acquiring pixels of the goal image is far from ideal: such a reward representations of the visual environment and its dynamics is both uninformative and overconstrained, since matching that are suitable for control. This challenge arises both in all pixels is usually not required for succeeding in tasks. If the construction of the policy itself and in the specification we can automatically identify the right representation, we of the task. Extrinsic and perfect reward signals are typi- can both accelerate the policy learning process and simplify 1UC Berkeley, Computer Science. Correspondence to: Aravind the specification of tasks via goal images. Prior work in Srinivas . visual representation learning for planning and control has

th relied predominantly on unsupervised or self-supervised Proceedings of the 35 International Conference on Machine objectives (Watter et al., 2015; Finn et al., 2016b), which in Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). principle only provide an indirect connection to the utility Universal Planning Networks of the representation for the underlying control problem. latent representations induced by optimizing for successful Effective representation learning for planning and control planning can be leveraged to transfer task-related semantics remains an open problem. to other agents for more challenging tasks through goal- conditioned reward functions, which to our knowledge has In this work, instead of learning from unsupervised or aux- previously not been demonstrated; and (3) the learned plan- iliary objectives and expecting that useful representations ning computation improves when allowed more updates at should emerge, we directly optimize for plannable repre- test-time, even in scenarios of less data, providing encour- sentations; that is, representations through which gradient- aging evidence of successful meta-learning for planning. based planning is successful with respect to the goal- directed task. We propose universal planning networks (UPN), a neural network architecture that can be trained to 2. Universal Planning Networks acquire a plannable representation. By embedding a differ- Model-based approaches leverage forward models to search entiable planning computation inside the policy, our method for, or plan, sequences of actions to achieve goal states such enables joint training of the planner and its underlying latent that a planning objective is minimized. Forward modeling encoder and forward dynamics representations. An outer im- supports simulation of future state and hence, in principle, itation learning objective ensures that the learned represen- should allow for planning over extended horizons. In the tations are directly optimized for successful gradient-based absence of known environment dynamics, a forward model planning on a set of training demonstrations. In principle, must be learned. Differentiable forward models allow for however, the architecture could also be trained with other end-to-end training of model-based planners, as well as policy optimization techniques such as those from reinforce- planning by back-propagating gradients with respect to input ment learning. An overview is provided in Figure 1(a). actions (Schmidhuber, 1990; Henaff et al., 2017). We demonstrate that the representations learned by UPN Nevertheless, learned forward models may: (1) suffer from not only support gradient-based trajectory optimization for function approximation modeling error, especially in com- successful visual imitation, but in fact acquire a meaning- plex, high-dimensional environments, (2) capture irrelevant ful encoding of state, which can be used as a metric for details under the incentive to reduce model-bias, as is of- task-specific latent distance to a goal. We find that we can the case when learning directly from pixels, and (3) reuse this representation to specify latent distance-based not necessarily align with the task and planning problem at rewards to reach new target states via standard model-free hand, such that the inferred plans are sub-optimal even if reinforcement learning, resulting in substantially more effec- the planning objective is optimized. tive learning when using image targets. These properties are naturally induced by the agent’s reliance on the minimiza- These issues motivate a central idea of the proposed method: tion of the latent distance between its predicted terminal instead of learning from surrogate unsupervised or auxiliary state and goal state throughout the planning process. By objectives, we directly optimize for what we care about, learning plannable representations, the UPN learns an op- which is, representations with which gradient-based trajec- timizable latent distance metric. Our findings are based on tory optimization leads to the desired actions. We study a suite of challenging vision-based simulated robot control a model-based architecture that performs a differentiable tasks that involve planning. planning computation in a latent space jointly learned with forward dynamics, trained end-to-end to encode what is At a high-level, our approach is a goal-conditioned policy necessary for solving tasks by gradient-based planning. architecture that leverages a gradient-based planning compu- tation in its action-selection process. While the architecture is agnostic to the objective function in the outer loop, we 2.1. Learning to Plan will focus on the imitation learning setting. From the per- The UPN computation graph forms a goal-directed policy spective of representation learning, our method provides a supported by an iterative planning algorithm. Given initial way to learn more effective representations suitable for spec- and goal observations (ot and og) as input images, the model ifying perceptual reward functions, which can then be used, produces an optimal plan aˆt:t+T to arrive at og, where aˆt for example, with a model-free reinforcement learner. In denotes the predicted action at time t. The computation terms of meta-learning, our architecture can be seen as learn- graph consists of a pair of tied encoders that encode both ing a planning computation by learning representations that ot and og, and their features are fed into a gradient descent are in some sense traversible by gradient descent trajectory planner (GDP), which produces the action at as output. The optimization for satisfying the outer meta-objective. GDP uses a neural network encoder and forward dynamics In extensive experiments, we show that (1) UPNs learn ef- model to simulate transitions in a learned latent space and fective visual goal-directed policies more efficiently (that is thus fully differentiable. An overview of the method is is, with less data) than traditional imitation learners; (2) the presented in Figure2. Universal Planning Networks

imitate a⇤ L t:t+T GDP (np) (i) aˆ ,✓ imitate t:t+T r L (i) raˆ Lplan Lplan

(i) (i) (i)

aˆ aˆt+1 aˆt+T

GDP t xt xˆt+1 ... xˆt+T +1 xg i g✓ g✓ g✓ f f

ot ot

forward backward given inferred variable loss iteration propagation propagation variable function

Figure 2. An overview of the proposed method. Given an initial ot and a goal og, the GDP (gradient descent planner) uses gradient descent to optimize a plan to reach the goal observation with a sequence of actions in a latent space represented by fφ. This planning process forms one large computation graph, chaining together the sub-graphs of each iteration of planning. The learning signal is derived from the (outer) imitation loss and the gradient is back-propagated through the entire planning computation graph. The blue lines represent the flow of gradients for planning, while the red lines depict the meta-optimization learning signal and the components of the architecture th affected by it. Note that the GDP iteratively plans across np updates, as indicated by the i loop.

The GDP uses gradient descent to optimize for a sequence ics model gθ and an encoder fφ, where θ and φ are neural of actions aˆt:t+T to reach the encoded goal observation og network parameters that are learned end-to-end: from an initial o . Since the model is differentiable, back- t x = f (o )x ˆ = g (x , a ) propagation through time allows for computing the gradient t φ t t+1 θ t t with respect to each planned action in order to end up closer Specifically, fφ is a convolutional network and gθ is a fully to the desired goal state. Each iteration of the GDP thus connected network. Further architectural details can be involves unrolling the trajectory of latent state encodings us- found in the supplementary. 1 ing the current planned actions, and taking a step along the Planning by Gradient Descent: The planner starts with an gradient to improve the planning objective. The cumulative element-wise randomly initialized plan aˆ(0) ( 1, 1) planning process forms a large, differentiable computation t:t+T ∼ U − graph, chaining together each iteration of planning. and aims to minimize the distance between the predicted terminal latent state and the encoded goal observation. T The actual learning signal is derived from an outer loss denotes the horizon over which the agent plans, which can function, which supervises the entire computation graph depend on the task and hence may be treated as a hyper- (including the GDP) to output the correct action sequence. parameter, while np is the number of planning updates per- The outer loss can in principle take any form, but in this formed. Algorithm1 describes the iterative optimization work we use an imitation learning loss derived from demon- procedure that is implemented by the GDP. strations. The outer loss provides task-specific grounding to Huber Loss: In practice, for (i) , we use a Huber Loss optimize for representations that support effective iterative Lplan planning for the task and environment at hand, as the gradi- centered around xg for well-behaved inner loop gradients 2 ent is back-propagated through the entire iterative planning (i) instead of a direct quadratic xˆt+T +1 xg 2 . This usage computation graph. is inspired from the Deep Q|| Networks− paper|| of Mnih et al. Training thus involves nested objectives. One can view the (2015) and similar metrics have also been used by Levine learning process as first deriving a plan to achieve the goal et al.(2016) and Sermanet et al.(2017). and then updating the model parameters to make the plan- Action selection at test-time: At test-time, Algorithm1 ning procedure more effective for the outer objective. In 1 other words, we seek to learn the planning computation We note that one could also use an action encoder hα(at) = ut, with gθ operating on xt and ut. A temporal encoder h would through its underlying representations for latent state encod- allow for abstract sequences of actions (options), for an option ing and latent forward dynamics. conditioned latent forward model gθ. We work with flat sequences of actions, leaving hierarchical extensions for future work. Parameters: The model is composed of a forward dynam- Universal Planning Networks

Algorithm 1 GDP (o , o , α) aˆ Algorithm 2 Learning the Planner via Imitation t g → t:t+T Require: α: hyperparameter for step size Require: GDP (ot, og, α), expert at∗:t+T , step sizes α, β (0) for n from 1 to N do Randomize an initial guess for the optimal plan aˆt:t+T for i from 0 to n 1 do Sample a batch of demonstrations ot, og, at∗:t+T p − Compute xt = fφ(ot), xg = fφ(og) Compute aˆt:t+T = GDP (ot, og, α) 2 for from to do Compute imitate = aˆt:t+T a∗ 2 j 0 T L || − t:t+T || (i) (i) (i) Update θ := θ β θ imitate xˆ = gθ(ˆx , aˆ ) t+j+1 t+j t+j Update φ := φ− β∇ L end for − ∇φLimitate (i) (i) 2 end for Compute plan = xˆt+T +1 xg 2 L (i+1)|| (i) − || (i) Update plan: (i) aˆt:t+T =a ˆt:t+T α aˆ plan − ∇ t:t+T L notions of state comparison useful for the imitation task end for and that are traversible by gradient descent for trajectory (np) Return aˆt:t+T optimization. This is naturally induced by the agent’s re- liance on the minimization of the latent distance between can be used to produce a sequence of actions. A more its predicted terminal state and goal state throughout the sophisticated approach is to use Algorithm1 to re-plan at planning process. Thus, in necessitating plannable repre- each timestep. The agent first plans a trajectory suitable to sentations, the encoder learns an optimizable latent distance metric. This is key to the viability of using the learned latent reach og from ot, but only executes the first action, before replanning. This allows the agent to achieve goals requiring space as a metric from which to derive reward functions for longer planning horizons at test-time even if the GDP was reinforcement learning. trained with a shorter horizon. This amounts to using model- predictive control (MPC) over our learned planner. 2.3. RL with Rewards from a pre-trained UPN A potential solution to the reward specification problem 2.2. Imitation as the Outer Objective for image targets is to use distance to the target image in An idea central to our approach is to directly optimize the an abstract representation. Previous work such as Watter planning computation for the task at hand, through the outer et al.(2015) and Finn et al.(2016b) propose unsupervised objective. Though in this work we study the use of an learning with autoencoders, while others attempt to fine- imitation loss as the outer objective, the policy can in prin- tune representations from Imagenet using auxiliary losses ciple be trained through any gradient-based policy search tailor-made for robotic manipulation Sermanet et al.(2017). method including policy gradients (Schulman, 2016) and With the UPN having been trained to acquire plannable value functions (Sutton & Barto, 1998). representations, we might naturally expect the learned latent To learn parameters φ and θ, we do not directly optimize space encoded by fφ to serve well as an abstract representa- tion through which rewards can be specified for RL visuo- the planning error under plan, but instead learn the planner insofar as it can imitateL an expert agent by iteratively ap- motor tasks with image targets from scratch. We can lever- age the learned fφ from pre-trained UPN to provide reward plying plan (Algorithm2). The model is therefore trained L functions of the form r(o , o ) = f (o ) f (o ) 2. to plan in such a way as to produce actions that match the t g −|| φ t − φ g ||2 expert demonstrations. The RL procedure is portrayed in Figure3. As we will see in subsection 4.4, this enables transferring general task Note that the subroutine GDP (ot, og, α) is an accumulated semantics learned by the UPN such that other RL agents of computation graph composed of several iterations of plan- different form can be trained with RL from scratch. ning, each of which includes encoding observations and aˆt unrolling of latent forward dynamics through time. Learn- aˆt ing end-to-end thus requires that we back-propagate the RL ENV RL ENV AGENT UPN behavior cloning loss under the produced plan through the f AGENT

GDP subroutine as depicted in Figure2. We note that the UPN UPN xg xt+1 st+1 st+1 ot+1 gradients obtained on the network parameters θ and φ from UPN f rt distance the outer objective are composed of first-order derivatives of st distance UPN f st these parameters. Therefore, even though the computation rt graph of UPN may seem long and complicated, it is not

Figure 3. UPN-RL Agent: The rewards are based on the difference UPN prohibitively expensive to compute. f between ot and og in the UPN latent space, while the policy is In learning to plan via imitation, the agent jointly optimizes conditioned on joint angles and velocities specific to the agent, st; UPN for latent state and dynamics representations that capture and the feature vector of the goal, fφ (og). The agent has to reason about the goals and how to achieve them. Universal Planning Networks

3. Related Work to connect the use of differentiable planning procedures to learning reusable representations that generalize across Our work is primarily concerned with learning representa- complex visuomotor tasks. tions that can support planning for tasks described through an image target. Watter et al.(2015) and Finn et al.(2016b) The idea of planning by gradient descent has existed take an unsupervised learning approach to learning such for decades (Kelley, 1960). While such work relied on representations, which they use for planning with respect known analytic forms of environment dynamics, later work to target images using iLQR (Tassa et al., 2012). However, (Schmidhuber, 1990) explored jointly learning approxi- reconstructing all the pixels in the scene could lead to the en- mate models of dynamics with neural networks. Henaff coding of state variables not necessarily useful in the context et al.(2017) adopt gradient-based trajectory optimization of planning (Higgins et al., 2017) and discard state variables for model-based planning in discrete action spaces, but rely that are not visually prominent (Goodfellow et al.(2016), on representations learned from unsupervised pretraining. Chapter 15). Our approach avoids this problem by explicitly Oh et al.(2017) and Silver et al.(2016) have also explored optimizing a representation for plannability through gradi- forward predictions in a latent space that is learned by de- ent descent as the only criterion. Self-supervised methods coding the value function of a state. Our architecture is that avoid pixel reconstruction by using other intermedi- related in so far as distance to goal in the learned latent ate forms of supervision that can be obtained automatically space can be viewed as a value function. However, we also from the data have also been used to learn representations for differ significantly by not relying on extrinsic rewards and visuomotor control (Sermanet et al., 2016; 2017). We again focusing on continuous control tasks. differ by optimizing directly for what we need: plannable representations, instead of intermediate objectives. 4. Experiments There has been work in learning state representations usable We designed experiments to answer the following questions: for model-free RL when provided rewards (Lange et al., (1) does embedding a gradient descent planner help learn a 2012; Jonschkowski & Brock, 2015; Jonschkowski et al., policy that can map from pixels to torque control when pro- 2017; Higgins et al., 2017; de Bruin et al., 2018). The vided current and goal observations at test-time ? (2) how key difference in our work is that we focus on learning does our method compare to reactive and autoregressive be- representations that can be used for defining metric-based havior cloning agents as the amount of training data varies? rewards for new tasks, as opposed to just learning state (3) what are the properties of the representation learned by representations for RL from external environment rewards. UPN? (4) how can the learned representations from UPN Learning representations capable of providing metric based be leveraged for transfer to new and more complex tasks, rewards naturally relates to inverse reinforcement learning compared to representations from standard imitation and (IRL) (Ng & Russell, 2000; Abbeel & Ng, 2004; Finn et al., unsupervised methods (e.g. VAE)? 2016a; Ho & Ermon, 2016; Baram et al., 2017) and reward Methods for comparison: We consider two alternative shaping (Ng et al., 1999). However, IRL from raw pixels is imitation learning approaches for comparison: (1) a reac- challenging due to the lack of sufficient constraints in the tive imitation learner (RIL), composed of a convolutional problem definition Our work can be viewed as connecting feedforward policy that takes as input the current and goal IRL and reward shaping: learning representations amenable observation; (2) an auto-regressive imitation learner (AIL), to gradient-based trajectory optimization is one way to ex- composed of a recurrent decoder initially conditioned on tract a perceptual reward function. However, we differ sig- convolutionally encoded representations of the current and nificantly from conventional IRL in that our derived reward goal observation, trained to output a sequence of interme- functions are effective even for new tasks. diate actions. Both (1) and (2) are methods adopted from From an architectural standpoint, we embed a differentiable Pathak* et al.(2018). These comparisons are important for planner within our computation graph. Value iteration net- studying the effects of the inductive bias of gradient descent works of Tamar et al.(2016) embed an approximate differen- planning that is embedded within UPN. More specifically, tiable value iteration computation, though their architecture comparing to (1) allows us to understand the need for such only supports discrete planning and is evaluated on tasks an inductive bias, while comparing to (2) is necessary to with sparse state transition probabilities. We seek a more understand whether the benefits are not purely due to re- general planning computation for more complex transition current computations. All methods are trained on the same dynamics and continuous actions suitable for motor control synthetically-generated expert demonstration datasets. We from raw pixels. Concurrently, a few recent efforts have refer the reader to the supplementary materials for details been developed to embed differentiable planning procedures on the architectures and dataset generation. in computation graphs (Guez et al., 2018; Farquhar et al., 2017). However, to our knowledge, our paper is the first Universal Planning Networks

(a) Pointmass config. 1 (b) Pointmass config. 2 (c) Reacher config. 1 (d) Reacher config. 2

Figure 4. Examples of the visuomotor tasks considered for the zero shot generalization study. We consider two 2D robot models: a force- controlled point robot and a 3-link torque-controlled reacher robot. We consider two types of generalization: fixing the obstacles while varying the target goals (FOVG) and varying both the obstacle and target goals (VOVG). These tasks require non-trivial generalization combining visual planning with low level motor control.

(a) Pointmass- VOVG (b) Reacher - VOVG (c) Pointmass- FOVG (d) Reacher - FOVG

Figure 5. Notation: VOVG - Varying Obstacles and Varying Goals, FOVG - Fixed Obstacles and Varying Goals; Success on test tasks as a function of the dataset size. Our approach (UPN) outperforms the RIL and AIL consistently across the four generalization conditioned considered and is more sample efficient. As expected, the AIL improves with more data to eventually almost match the UPN. This illustrates the tradeoff between inductive bias and expressive architectures when given sufficient data.

4.1. UPNs Learn Effective Imitation Policies UPN. This is consistent with the conclusions of Tamar et al. (2016), who observed that the benefit of the value iteration Here, we study the suitability of the UPN for learning visual inductive bias shrinks in regimes in which demonstrations imitation policies that generalize to new goal-directed tasks. are plentiful. Note that generalization across obstacle con- We focus on two tasks: (1) navigating a 2D point robot figurations in the reacher case (Figure 4(c)) is a hard task; around obstacles to desired goal locations amidst distractors expert performance is only 73.12%. We encourage the (Figures 4(a) and 4(b)), wherein the color of the goal is reader to refer to the supplementary for further details about randomized; (2) a harder task of controlling a 3-DoF planar the experiment. arm to reach goals amidst scattered distractors and obstacles, as shown in Figures 4(c) and 4(d). For these tasks, we consider two types of generalization: (1) generalizing to new goals for a fixed configuration of obstacles having trained on the same configuration; (2) gen- eralizing to new goals in new obstacle configurations having trained across varying obstacle configurations. Figures 4(c) and 4(d) show two different obstacle configurations for the reaching task, while the differently colored locations in Figure4 represent varying goal locations. (a) (b) We employ the action selection process described in subsec- Figure 6. (a) The effect of additional planning steps at test-time. tion 2.1 with a chosen maximum episode length. Results UPN learns an effective gradient descent planner whose conver- shown in Figure5 compare performance over a varying gence improves with more planning steps at test-time. (b) A number of training demonstrations. As expected, the induc- comparison of the success rate of UPN between 40 and 160 plan- ning steps at test time with varying number of demonstrations on tive bias of embedding trajectory optimization via gradient Reacher VOVG. Using 160 planning steps is consistently better descent in UPN supports generalization from fewer demon- than using 40 steps (though the relative benefit shrinks with more strations. With more demonstrations, however, the expres- demonstrations) and allows the UPN to match the expert level. sive AIL is able to almost match the performance of the Universal Planning Networks

4.2. Analysis of the Gradient Descent Planner The UPN can be viewed in the context of meta-learning as learning a planning algorithm and its underlying representa- tions. We take inspiration from Finn & Levine(2018), which studied a gradient-based model-agnostic meta-learning al- gorithm and showed that a classifier trained for few-shot image classification improves in accuracy at test-time with Figure 7. Visualization of the learned metric in the UPN latent additional gradient updates. In our case, the inner loop is space on the reacher with obstacles task. Lighter color → larger the GDP, which may not necessarily converge due to the latent distance. The learned distance metric is obstacle-aware and fixed number of planning updates. Hence, it is worth study- supports obstacle avoidance. ing whether additional test-time GDP updates yield more Table 1. Average Success Rate % in solving the task described in accurate plans and therefore better success rates. Figure 4(d) for fixed and varying goals Planning more helps: Figure 6(a) shows that with more planning steps at test-time, a UPN trained with fewer demon- FEATURE SPACE FIXED VARYING strations (20000) can improve on task success rate beginning RIL-RL 0% 0.01% from 38.1% with 40 planning steps to 64.44% with 160 plan- AIL-RL 0% 4.72% VAE-RL 20.23% 24.67% ning steps. As a reference, the average test success rate of UPN-160 IMITATION 45.82% 47.99% the expert on these tasks is 73.12% while the best UPN EXPERT 46.77% 51.1% model with 40 planning steps (trained on twice the num- UPN-RL 69.84% 71.12% ber of demos (40000)) achieves 64.78%. Thus, with more planning steps, we see that UPN can improve to match the expert fails. We study this idea in the reaching scenario performance of a UPN with fewer planning steps but trained with obstacle configuration as presented in Figure 4(d). The on twice the number of demonstrations. We also find that difference between fixed and varying goals is that for vary- 160 steps is consistently better than using 40 steps (though ing goals, we feed in a feature vector of the goal image as the relative benefit shrinks with more demonstrations) and an additional input to the RL policy. We use PPO (Schul- that the UPN is able to match expert performance (Figure man et al., 2017) for model-free policy optimization of the 6(b)). This finding suggests that the learned planning ob- rewards derived from the feature space(s). jective is well defined, and can likely be reused for related control problems, as we explore in Sections 4.4 and 4.5. Though the subsection 2.3 explains an RL procedure based on using the fφ of a UPN, one could use a trained encoder f from other methods such as our supervised learning 4.3. Latent Space Visualization φ comparisons RIL, AIL. In addition to RIL and AIL, a feature We offer a qualitative analysis for studying the acquired space we compare to is an encoder obtained from training a latent space for an instance of the reacher with obstacles variational auto-encoder (VAE) (Kingma & Welling, 2013) task. Given the selected initial pose, we compute the dis- on the images of the demonstrations. This comparison is tance in the learned fφ space for 150 random final poses necessary to judge how useful the feature space of a UPN is and illustrate these distances qualitatively on the environ- for downstream reinforcement learning when compared to ment arena by color mapping each end-effector position pixel reconstruction methods such as VAEs. In Table1, we accordingly. The result is shown in Figure7; lighter blue see that reinforcement learning on the feature space of RIL corresponds to larger distances in the feature space. We and AIL clearly fail, while RL on the UPN feature space is see that the learned distance metric is obstacle-aware and better compared to that of a VAE. We also see that UPN-RL task-specific: regions below the initial position in Figure7 is able to outperform the expert and the imitating UPN-160. are less desirable even though they are near, while farther regions above are comparatively favorable. 4.5. Transfer Across Robots Having seen the success of reinforcement learning using 4.4. Transfer to Harder Scenarios rewards derived from UPN representations in subsection We have seen in subsections 4.1 and 4.2 that UPNs can 4.4, we pose a harder problem in this subsection: can we learn effective imitation policies that can perform close to leverage UPN representations trained on some source task(s) the expert level on visuomotor planning tasks. In principle, to provide rewards for target task(s) with significantly dif- deriving reward functions from a pre-trained UPN as ex- ferent dynamics and action spaces? We propose to do this plained in subsection 2.3 should allow us to extend beyond by training and testing with different robots (morphologi- the capabilities of the expert on harder scenarios where the cal variations) on the same desired functionality (reaching / locomotion, around obstacles). This study will highlight UPNs with shared f RL rewards from f

Universal Planning Networks

UPN trained across random morphologies RL rewards from f UPNs with shared f RL rewards from f

(a) Transfer to more complex topology (Reacher) (b) Transfer to new morphology (Point to Ant) Figure 8. Transfer between robots as described in subsection 4.5.

via an experiment illustrated in Figure 8(b). Here, we learn representations with a UPN on demonstrations collected UPN trained across randomized RL rewards from ffrom a 2D point robot trained to traverse obstacles to reach 2D robot morphologies varying goals. We randomize the robot’s morphological appearance across demonstrations (Figure 8(b)), inspired by Sadeghi & Levine(2016); Tobin et al.(2017). This allows the UPN to learn an encoder fφ that is robust to the creature appearance. We then replace the point robot with (a) Point robot to Ant transfer (b) Reacher transfer a 3D-torque-controlled ant to study how well the learned representation transfers to a harder task. Figure 9. RL with rewards from the UPN representation is signif- icantly more successful compared to other feature spaces (VAE, In Figure9, we see that for both the transfer scenarios, RL AIL, RIL, shaped rewards), suggesting that UPNs learn trans- with rewards from the UPN representation is more success- ferrable, generalizable latent spaces. ful compared to other feature spaces (VAE, AIL, and RIL). Additionally, we also compare the UPN-RL setup to a na¨ıve the extrapolative generalizability of UPN representations. RL agent optimizing a spatial distance to goal in the co- The idea of trajectory optimization with a learned metric ordinate space as the reward; this procedure assumes that is a fundamental prior that can hold across a large class the spatial position of the goal is known, unlike UPN, RIL, of visuomotor control problems. Having trained UPN to AIL, and VAE where the feature vector of the goal image learn such a prior, it is natural to expect the underlying is provided as input. We note that this method performs representation to be amenable to providing suitable metric poorly, as expected, because the spatial distance in the co- based rewards for similar but unseen tasks. We craft two ordinate space is not obstacle-aware. These results show challenging experimental scenarios to verify this hypothesis. that optimizing for the rewards derived from UPN correlates Reacher with new morphology: Having trained a UPN with task success, supporting our claim that UPNs learn generalizable and transferrable latent spaces. with a shared fφ and different gθ for a 3-link and 4-link reacher (on the obstacles task), can we leverage the learned fφ to specify rewards for reinforcement-learning a 5-link 5. Discussion reacher to reach different goals around the same obstacles? Figure 8(a) visually depicts this experiment. To the best of We studied the problem of learning generalizable representa- our knowledge, such a transfer scenario has not previously tions for visuomotor control and introduced universal plan- been studied in the past for visuomotor control. The dynam- ning networks, a goal-directed policy architecture with an ics of a 5-link reacher are more complex (compared to 3 embedded differentiable planner, that can be trained end-to- and 4 link reachers), thereby posing a harder control prob- end. Our extensive experiments demonstrated that (1) UPNs lem to solve at test time. However, a good path-planning learn effective visual goal-directed policies efficiently; (2) reward function learned from 3 and 4-link reachers is likely UPN latent representations can be leveraged to transfer task- to help for a 5-link reacher due to morphological similari- related semantics to more complex agents and more chal- ties. We train the UPN on both the 3 and 4 link reachers to lenging tasks through goal-conditioned reward functions; avoid overfitting the learned metric to a specific dynamical and (3) the learned planner improves with more updates at system. As comparison methods, we train RIL and AIL test-time, providing encouraging evidence of meta-learning (with a multi-task (head) architecture), and a VAE (jointly for planning. Our transfer learning results demonstrate that on images from both the tasks). UPNs learn generic representations with task-specific struc- ture of agency and goals useful for planning. Future work Point to Ant: High level visual planning for navigation should investigate self-supervised ways to train UPN-like should be common across different robots, from a 2D point representations, borrowing ideas from Weber et al.(2017). robot controlled through simple forces to a robot as complex as an 8-joint quadruped ant. We empirically confirm this Universal Planning Networks

Acknowledgements Guez, A., Weber, T., Antonoglou, I., Simonyan, K., Vinyals, O., Wierstra, D., Munos, R., and Silver, D. Learn- We thank Aviv Tamar, John Schulman, Rocky Duan, ing to search with MCTSnets, 2018. URL https: Alyosha Efros and colleagues from Robot Learning Lab at //openreview.net/forum?id=r1TA9ZbA-. Berkeley for helpful feedback. This research was supported by an ONR PECASE N000141612723AS grant, NVIDIA, Henaff, M., Whitney, W. F., and LeCun, Y. Model- AWS, Berkeley EECS and Berkeley AI fellowships, and an based planning in discrete action spaces. arXiv preprint NSF GRFP award. arXiv:1705.07177, 2017.

References Higgins, I., Pal, A., Rusu, A. A., Matthey, L., Burgess, C. P., Pritzel, A., Botvinick, M., Blundell, C., and Lerchner, Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse A. Darla: Improving zero-shot transfer in reinforcement reinforcement learning. In Proceedings of the twenty-first learning. arXiv preprint arXiv:1707.08475, 2017. international conference on Machine learning, pp. 1. ACM, 2004. Ho, J. and Ermon, S. Generative adversarial imitation learn- ing. In Advances in Neural Information Processing Sys- Baram, N., Anschel, O., Caspi, I., and Mannor, S. End- tems, pp. 4565–4573, 2016. to-end differentiable adversarial imitation learning. In International Conference on Machine Learning, pp. 390– Jonschkowski, R. and Brock, O. Learning state represen- 399, 2017. tations with robotic priors. Autonomous Robots, 39(3): 407–428, 2015. de Bruin, T., Kober, J., Tuyls, K., and Babuska, R. Integrat- ing state representation learning into deep reinforcement Jonschkowski, R., Hafner, R., Scholz, J., and Riedmiller, learning. IEEE Robotics and Automation Letters, PP(99): M. Pves: Position-velocity encoders for unsupervised 1–1, 2018. doi: 10.1109/LRA.2018.2800101. learning of structured state representations. arXiv preprint arXiv:1705.09805, 2017. Deguchi, K. and Takahashi, I. Image-based simultaneous control of robot and target object motions by direct-image- Kelley, H. J. Gradient theory of optimal flight paths. ARS interpretation method. In Intelligent Robots and Systems, Journal, 30(10):947–954, 1960. 1999. IROS’99. Proceedings. 1999 IEEE/RSJ Interna- tional Conference on, volume 1, pp. 375–380. IEEE, Kingma, D. P. and Welling, M. Auto-encoding variational arXiv preprint arXiv:1312.6114 1999. bayes. , 2013. Lange, S., Riedmiller, M., and Voigtlander, A. Autonomous Farquhar, G., Rocktaschel,¨ T., Igl, M., and Whiteson, S. reinforcement learning on raw visual input data in a real Treeqn and atreec: Differentiable tree planning for deep world application. In Neural Networks (IJCNN), The reinforcement learning. arXiv preprint arXiv:1710.11417, 2012 International Joint Conference on, pp. 1–8. IEEE, 2017. 2012.

Finn, C. and Levine, S. Meta-learning and universality: Levine, S., Finn, C., Darrell, T., and Abbeel, P. End-to- Deep representations and gradient descent can approxi- end training of deep visuomotor policies. The Journal of mate any learning algorithm. International Conference Machine Learning Research, 17(1):1334–1373, 2016. on Learning Representations, 2018. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, Finn, C., Levine, S., and Abbeel, P. Guided cost learning: J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje- Deep inverse optimal control via policy optimization. In land, A. K., Ostrovski, G., et al. Human-level control International Conference on Machine Learning, pp. 49– through deep reinforcement learning. Nature, 518(7540): 58, 2016a. 529, 2015.

Finn, C., Tan, X. Y., Duan, Y., Darrell, T., Levine, S., and Ng, A. Y. and Russell, S. Algorithms for inverse reinforce- Abbeel, P. Deep spatial autoencoders for visuomotor ment learning. In in Proc. 17th International Conf. on learning. In Robotics and Automation (ICRA), 2016 IEEE Machine Learning. Citeseer, 2000. International Conference on, pp. 512–519. IEEE, 2016b. Ng, A. Y., Harada, D., and Russell, S. Policy invariance Goodfellow, I., Bengio, Y., and Courville, A. Deep learning, under reward transformations: Theory and application to volume 1. 2016. reward shaping. In ICML, volume 99, pp. 278–287, 1999. Universal Planning Networks

Oh, J., Singh, S., and Lee, H. Value prediction network. In Tassa, Y., Erez, T., and Todorov, E. Synthesis and stabi- Advances in Neural Information Processing Systems, pp. lization of complex behaviors through online trajectory 6120–6130, 2017. optimization. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 4906– Pathak*, D., Mahmoudieh*, P., Luo*, M., Agrawal*, P., 4913. IEEE, 2012. Chen, D., Shentu, F., Shelhamer, E., Malik, J., Efros, A. A., and Darrell, T. Zero-shot visual imitation. In Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., and International Conference on Learning Representations, Abbeel, P. Domain randomization for transferring deep 2018. URL https://openreview.net/forum? neural networks from simulation to the real world. In id=BkisuzWRW. Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on, pp. 23–30. IEEE, 2017. Pinto, L., Gandhi, D., Han, Y., Park, Y.-L., and Gupta, A. The curious robot: Learning visual representations Watter, M., Springenberg, J., Boedecker, J., and Riedmiller, via physical interactions. In European Conference on M. Embed to control: A locally linear latent dynamics Computer Vision, pp. 3–18. Springer, 2016. model for control from raw images. In Advances in neural information processing systems, pp. 2746–2754, 2015. Sadeghi, F. and Levine, S. Cad2rl: Real single-image flight without a single real image. arXiv preprint Weber, T., Racaniere,` S., Reichert, D. P., Buesing, L., Guez, arXiv:1611.04201, 2016. A., Rezende, D. J., Badia, A. P., Vinyals, O., Heess, N., Li, Y., et al. Imagination-augmented agents for deep Schmidhuber, J. An on-line algorithm for dynamic reinforce- reinforcement learning. arXiv preprint arXiv:1707.06203, ment learning and planning in reactive environments. In 2017. In Proc. IEEE/INNS International Joint Conference on Neural Networks, pp. 253–258. IEEE Press, 1990.

Schulman, J. Optimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs. PhD thesis, EECS Department, Univer- sity of , Berkeley, Dec 2016. URL http://www2.eecs.berkeley.edu/Pubs/ TechRpts/2016/EECS-2016-217.html.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

Sermanet, P., Xu, K., and Levine, S. Unsupervised per- ceptual rewards for imitation learning. arXiv preprint arXiv:1612.06699, 2016.

Sermanet, P., Lynch, C., Hsu, J., and Levine, S. Time-contrastive networks: Self-supervised learn- ing from multi-view observation. arXiv preprint arXiv:1704.06888, 2017.

Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley, T., Dulac-Arnold, G., Reichert, D., Rabinowitz, N., Barreto, A., et al. The predictron: End-to-end learning and planning. arXiv preprint arXiv:1612.08810, 2016.

Sutton, R. S. and Barto, A. G. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition, 1998. ISBN 0262193981.

Tamar, A., Wu, Y., Thomas, G., Levine, S., and Abbeel, P. Value iteration networks. In Advances in Neural Informa- tion Processing Systems, pp. 2154–2162, 2016.