Arxiv:2010.13483V3 [Cs.RO] 31 Oct 2020 Catch and Throw the Second Ball and Return to the Left in Time to Catch the first Ball
Total Page:16
File Type:pdf, Size:1020Kb
High Acceleration Reinforcement Learning for Real-World Juggling with Binary Rewards Kai Ploeger∗, Michael Lutter∗, Jan Peters Computer Science Department, Technical University of Darmstadt fkai, michael, [email protected] Abstract: Robots that can learn in the physical world will be important to en- able robots to escape their stiff and pre-programmed movements. For dynamic high-acceleration tasks, such as juggling, learning in the real-world is particularly challenging as one must push the limits of the robot and its actuation without harming the system, amplifying the necessity of sample efficiency and safety for robot learning algorithms. In contrast to prior work which mainly focuses on the learning algorithm, we propose a learning system, that directly incorporates these requirements in the design of the policy representation, initialization, and opti- mization. We demonstrate that this system enables the high-speed Barrett WAM manipulator to learn juggling two balls from 56 minutes of experience with a bi- nary reward signal. The final policy juggles continuously for up to 33 minutes or about 4500 repeated catches. The videos documenting the learning process and the evaluation can be found at https://sites.google.com/view/jugglingbot Keywords: Reinforcement Learning, Dynamic Manipulation, Juggling 1 Introduction Robot learning is one promising approach to overcome the stiff and pre-programmed movements of current robots. When learning a task, the robot autonomously explores different movements and improves its behavior using scalar rewards. In recent years, research has focused a lot on improving task-agnostic deep reinforcement learning (DRL) algorithms by changing either the optimization [1,2], the simulation to use perturbed physics parameters [3,4], or the task to gradually increase complexity [5]. While these approaches have propelled learning robots to very complex domains in simulation, ranging from full-body control of humanoids [6] to control of dexterous hands [7,8], most of these approaches are not applicable to learn on physical systems as they neglect the intricate complexities of the real world. On the physical system, the learning is constrained to real-time and a single instance. Hence, the learning must not damage the robot with jerky actions and must be sample efficient. Consider the high-acceleration task of juggling two balls next to each other with a single anthro- pomorphic manipulator. The manipulator is required to throw a ball upwards, move to the right, arXiv:2010.13483v3 [cs.RO] 31 Oct 2020 catch and throw the second ball and return to the left in time to catch the first ball. To sustain this cyclic juggling pattern, the robot must always throw the ball sufficiently vertical and maintain precise timing. Therefore, this task pushes the limits of the robot to achieve the required high ac- celerations (of up to 8g), while maintaining precise control of the end-effector and the safety of the physical system. The task is not only inherently difficult to master1 but also requires learning on the physical system. Real-world experience is required as simulation models, while good at simulating contact-free rigid-bodies, cannot represent the non-linear effects close to the torque limits of the actuators and the highly dynamic contacts between end-effector and balls. For the tendon driven Barrett WAM, rigid-body-simulators also cannot model the dominating cable dynamics at high ac- celerations. Hence, simulation-based solutions cannot be transferred for very dynamic tasks given our current simulators. For this specific task even robot to robot transfer between robots of the same ∗Equal contribution 1For reference, the untrained human jugglers of the lab achieve 2 repeated catches and improve to about 20 repeated catches after a few hours of training. 4th Conference on Robot Learning (CoRL 2020), Cambridge MA, USA. (a) (b) (c) (d) (e) Figure 1: The juggling movement consisting of four separate movements, which are repeated to achieve juggling of two balls with a single anthropomorphic manipulator: (a) throw ball 1, (b) catch ball 2, (c) throw ball 2, (d) catch ball one, (e) repeat. model is not possible. The learned policy fails immediately when transferred to a different Barrett WAM. Therefore, the optimal policy depends on the exact robot instance and must be learned on the individual robot. The high accelerations amplify the safety and sample efficiency requirements for learning on the physical system as collisions at high velocities severely damage the system. Further- more, the high acceleration movements induce a lot wear and tear and cannot be executed for days. Therefore, high acceleration tasks are an ideal problem to test the limitations of current learning algorithms on the physical system. To emphasize important considerations for building a real-world learning system for high- acceleration tasks, we describe our robot learning setup to learn juggling and the taken system design considerations. We limit ourselves to existing methods to focus solely on the requirements for the real-world application of current learning approaches. This is in contrast to many prior works, which mainly propose new policy representations or policy optimizations and demonstrate that the new algorithm can be applied to the physical system. Instead, we focus on a single challenging task and evaluate which existing representations and algorithms can be used to build a safe and sample efficient robot learning system. We also highlight limitations of learning approaches, which - given the current knowledge - we do not dare to apply to the physical system for robot juggling. After- wards, we describe one approach to solve robot toss juggling with two balls and an anthropomorphic manipulator and validate the chosen approach in the real-world. The used approach - in our opin- ion - optimally combines engineering expertise and learning to obtain a reliable, safe, and sample efficient solution for this task. Our contribution is (1) the application of robot learning to the challenging task of single-arm toss juggling with two balls and (2) highlighting the challenges of applying a learning system in the physical world for a high-acceleration task. In the following, we first cover the prior work on robot juggling and learning dynamical tasks on the physical robot in Section2. Afterwards, we compare different approaches to learn real-world toss juggling in Section3 and describe the implemented learning system in Section4. The experimental setup and results are presented in Section5. 2 Related Work 2.1 Robot Juggling For decades robot juggling has been used to showcase the ingenuity of mechanical system design and control engineering as listed in table 1. Starting with Claude Shannon in the 1970s, many different juggling machines were built.2 For example, the Shannon juggler used cups to bounce up to five balls off a surface [9], the devil-sticking machine stabilized a stick in the air [10, 11], paddle jugglers, built from designated hardware [12,9] or by attaching tennis rackets to manipulators [13, 14, 15, 16, 17, 18, 19], juggled multiple balls using paddling. Toss jugglers were built using manipulators to juggle one [20, 21] or two balls [22] and even humanoids were used to juggle up to three balls using two arms [23, 24]. Most of these approaches proposed new engineering solutions for movements and controllers showing that these systems achieve juggling when the control parameters are manually 2A historic overview of robot juggling is available at https://www.youtube.com/watch?v= 2ZfaADDlH4w. 2 fine-tuned. Only a few approaches used supervised learning for model learning [14, 11], behavioral cloning [18], or evolutionary strategies with dense fitness functions [25] to achieve paddle juggling and devil sticking. We build upon the vast experience on end-effector and controller design for robot juggling but in contrast to the prior work, we demonstrate, to the best of our knowledge, the first robot learning system that learns toss juggling with two balls and a single anthropomorphic manipulator in the real world using only binary rewards. Using this approach, we achieve juggling of up to 33 minutes and high repeatability between trials. Table 1: Prior work on different types of robot juggling Juggling Type Approach Papers Devil Sticking Engineered [9] Devil Sticking Model Learning [11] Paddle Juggling Engineered [15, 16,9, 17, 12, 19] Paddle Juggling Imitation [18] Paddle Juggling Model Learning [14] Paddle Juggling Evolutionary Strategies [25] Toss Juggling Engineered [22, 24, 23, 21, 20] Toss Juggling Reinforcement Learning [Ours] 2.2 Learning Dynamical Tasks on the Physical Robot Despite the recent surge of deep reinforcement learning algorithms for controlling robots, most of these approaches are constrained to learn in simulation due to sample complexity and the high risk of catastrophic policies. Only relatively few DRL approaches have been applied to physical robots, e.g., robot manipulators [26, 27,8, 28, 29, 30] or mobile robots [31, 32]. Most of the work for manipulators focuses on non-dynamic tasks. Only Schwab et al. [30] and Buchler¨ et al. [33] ap- plied DRL to learn the dynamic tasks of Ball-in-a-Cup or robot table tennis. In [30], the authors built a fully automated environment to achieve large-scale data collection and engineered classical safety mechanisms to avoid damaging the physical system. Using the safe and automated environ- ment, SAC-X was able to learn Ball-in-a-Cup from raw pixels within three days [30]. Most other approaches for learning dynamical tasks on the physical system use more classical robot learning techniques. These algorithms combine engineering- and task knowledge with learning to achieve sample efficient and safe learning that does not require completely safe and fully automated envi- ronments. For example, combining model learning with trajectory optimization/model predictive control [34,8, 11] or model-free reinforcement learning with engineered policy representation, ex- pert policy initialization, and dense rewards [35, 36, 18, 37, 38].