Universal Planning Networks

Universal Planning Networks Aravind Srinivas 1 Allan Jabri 1 Pieter Abbeel 1 Sergey Levine 1 Chelsea Finn 1 Abstract GDP improve plan (i) (i) i raˆ Lplan Lplan A key challenge in complex visuomotor control optimized actions is learning abstract representations that are ef- actions Limitate fective for specifying goals, planning, and gen- latent latent latent improve model initial forward dynamics end goal φ,✓ eralization. To this end, we introduce univer- r Limitate encoder encoder expert sal planning networks (UPN). UPNs embed dif- actions ferentiable planning within a goal-directed pol- initial goal icy. This planning computation unrolls a forward obs. obs. model in a latent space and infers an optimal action plan through gradient descent trajectory (a) Universal Planning Network (UPN) optimization. The plan-by-gradient-descent pro- Demonstrations for training tasks cess and its underlying representations are learned end-to-end to directly optimize a supervised imi- learn UPN representations tation learning objective. We find that the representations learned are not only effective for goal-directed visual imitation via gradient-based UPN✓,φ trajectory optimization, but can also provide a metric for specifying goals using images. The derive reward functions learned representations can be leveraged to spec- from learned metric φ ify distance-based rewards to reach new target Reinforcement learning on harder tasks states for model-free reinforcement learning, re- (b) Leveraging learned latent representations sulting in substantially more effective learning when solving new tasks described via image- Figure 1. An overview of the UPN, which embeds a gradient de- based goals. Visit https://sites.google. scent planner (GDP) in the action-selection process. We demonstrate transfer to different, harder control tasks, including morpho- com/view/upn-public/home for video logical (yellow point robot to ant) and topological (3-link to 7-link highlights. reacher) variants, as shown above. cally not available for real world reinforcement learning and 1. Introduction users must manually specify tasks via hand-crafted rewards with hand-crafted representations. To automate this process, Learning visuomotor policies is a central pursuit in build- some prior methods have proposed to specify tasks by pro- ing machines capable of performing complex skills in the viding an image of the goal scene (Deguchi & Takahashi, variety of unstructured and dynamic environments seen in 1999; Watter et al., 2015; Finn et al., 2016b). However, a the real world (Levine et al., 2016; Pinto et al., 2016). A reward that measures success based on matching the raw key challenge in learning such policies lies in acquiring pixels of the goal image is far from ideal: such a reward representations of the visual environment and its dynamics is both uninformative and overconstrained, since matching that are suitable for control. This challenge arises both in all pixels is usually not required for succeeding in tasks. If the construction of the policy itself and in the specification we can automatically identify the right representation, we of the task. Extrinsic and perfect reward signals are typi- can both accelerate the policy learning process and simplify 1UC Berkeley, Computer Science. Correspondence to: Aravind the specification of tasks via goal images. Prior work in Srinivas <[email protected]>. visual representation learning for planning and control has th relied predominantly on unsupervised or self-supervised Proceedings of the 35 International Conference on Machine objectives (Watter et al., 2015; Finn et al., 2016b), which in Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). principle only provide an indirect connection to the utility Universal Planning Networks of the representation for the underlying control problem. latent representations induced by optimizing for successful Effective representation learning for planning and control planning can be leveraged to transfer task-related semantics remains an open problem. to other agents for more challenging tasks through goal- conditioned reward functions, which to our knowledge has In this work, instead of learning from unsupervised or aux- previously not been demonstrated; and (3) the learned plan- iliary objectives and expecting that useful representations ning computation improves when allowed more updates at should emerge, we directly optimize for plannable repre- test-time, even in scenarios of less data, providing encour- sentations; that is, representations through which gradient- aging evidence of successful meta-learning for planning. based planning is successful with respect to the goal- directed task. We propose universal planning networks (UPN), a neural network architecture that can be trained to 2. Universal Planning Networks acquire a plannable representation. By embedding a differ- Model-based approaches leverage forward models to search entiable planning computation inside the policy, our method for, or plan, sequences of actions to achieve goal states such enables joint training of the planner and its underlying latent that a planning objective is minimized. Forward modeling encoder and forward dynamics representations. An outer im- supports simulation of future state and hence, in principle, itation learning objective ensures that the learned represen- should allow for planning over extended horizons. In the tations are directly optimized for successful gradient-based absence of known environment dynamics, a forward model planning on a set of training demonstrations. In principle, must be learned. Differentiable forward models allow for however, the architecture could also be trained with other end-to-end training of model-based planners, as well as policy optimization techniques such as those from reinforce- planning by back-propagating gradients with respect to input ment learning. An overview is provided in Figure 1(a). actions (Schmidhuber, 1990; Henaff et al., 2017). We demonstrate that the representations learned by UPN Nevertheless, learned forward models may: (1) suffer from not only support gradient-based trajectory optimization for function approximation modeling error, especially in com- successful visual imitation, but in fact acquire a meaning- plex, high-dimensional environments, (2) capture irrelevant ful encoding of state, which can be used as a metric for details under the incentive to reduce model-bias, as is of- task-specific latent distance to a goal. We find that we can ten the case when learning directly from pixels, and (3) reuse this representation to specify latent distance-based not necessarily align with the task and planning problem at rewards to reach new target states via standard model-free hand, such that the inferred plans are sub-optimal even if reinforcement learning, resulting in substantially more effec- the planning objective is optimized. tive learning when using image targets. These properties are naturally induced by the agent’s reliance on the minimiza- These issues motivate a central idea of the proposed method: tion of the latent distance between its predicted terminal instead of learning from surrogate unsupervised or auxiliary state and goal state throughout the planning process. By objectives, we directly optimize for what we care about, learning plannable representations, the UPN learns an op- which is, representations with which gradient-based trajec- timizable latent distance metric. Our findings are based on tory optimization leads to the desired actions. We study a suite of challenging vision-based simulated robot control a model-based architecture that performs a differentiable tasks that involve planning. planning computation in a latent space jointly learned with forward dynamics, trained end-to-end to encode what is At a high-level, our approach is a goal-conditioned policy necessary for solving tasks by gradient-based planning. architecture that leverages a gradient-based planning computation in its action-selection process. While the architecture is agnostic to the objective function in the outer loop, we 2.1. Learning to Plan will focus on the imitation learning setting. From the per- The UPN computation graph forms a goal-directed policy spective of representation learning, our method provides a supported by an iterative planning algorithm. Given initial way to learn more effective representations suitable for spec- and goal observations (ot and og) as input images, the model ifying perceptual reward functions, which can then be used, produces an optimal plan a^t:t+T to arrive at og, where a^t for example, with a model-free reinforcement learner. In denotes the predicted action at time t. The computation terms of meta-learning, our architecture can be seen as learn- graph consists of a pair of tied encoders that encode both ing a planning computation by learning representations that ot and og, and their features are fed into a gradient descent are in some sense traversible by gradient descent trajectory planner (GDP), which produces the action at as output. The optimization for satisfying the outer meta-objective. GDP uses a neural network encoder and forward dynamics In extensive experiments, we show that (1) UPNs learn ef- model to simulate transitions in a learned latent space and fective visual goal-directed policies more efficiently (that is thus fully differentiable. An overview of the method is is, with less data) than traditional imitation learners; (2) the presented in Figure2. Universal

Load more