OpenDR — Open Deep Learning Toolkit for

Project Start Date: 01.01.2020 Duration: 36 months Lead contractor: Aristotle University of Thessaloniki

Deliverable D5.1: First report on deep action and decision making Date of delivery: 31 December 2020

Contributing Partners: TUD, ALU-FR, AU, TAU Version: v4.0

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 871449. D5.1: First report on deep robot action and decision making 2/99

Title D5.1: First report on deep robot action and decision making Project OpenDR (ICT-10-2019-2020 RIA) Nature Report Dissemination Level: PU Authors Bas van der Heijden (TUD), Osama Mazhar (TUD), Laura Fer- ranti (TUD), Jens Kober (TUD), Robert Babuskaˇ (TUD), Halil Ibrahim Ugurlu (AU), Erdal Kayacan (AU), Amir Mehman Se- fat (TAU), Roel Pieters (TAU), Daniel Honerkamp (ALU-FR), Tim Welschehold (ALU-FR), Abhinav Valada (ALU-FR), Wol- fram Burgard (ALU-FR), Anastasios Tefas (AUTH), Nikos Niko- laidis (AUTH) Lead Beneficiary TUD (Technische Universiteit Delft) WP 5 Doc ID: OPENDR D5.1.pdf

Document History

Version Date Reason of change v1.0 25/9/2020 Deliverabe structure ready v2.0 27/11/2020 Contributions from partners finalized v3.0 18/12/2020 Final version ready v4.0 30/12/2020 Final version to be submitted

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 3/99

Contents

1 Introduction 6 1.1 Deep Planning (T5.1) ...... 6 1.1.1 Objectives ...... 6 1.1.2 Innovations and achieved results ...... 6 1.1.3 Ongoing and future work ...... 6 1.2 Deep Navigation (T5.2) ...... 7 1.2.1 Objectives ...... 7 1.2.2 Innovations and achieved results ...... 7 1.2.3 Ongoing and future work ...... 7 1.3 Deep Action and Control (T5.3) ...... 8 1.3.1 Objectives ...... 8 1.3.2 Innovations and achieved results ...... 8 1.3.3 Ongoing and future work ...... 9 1.4 Connection to Project Objectives ...... 10

2 Deep Planning 12 2.1 End-to-end planning for drone racing ...... 12 2.1.1 Introduction and objectives ...... 12 2.1.2 Summary of state of the art ...... 13 2.1.3 Description of work performed so far ...... 16 2.1.4 Performance evaluation ...... 18 2.1.5 Future work ...... 20 2.2 Local replanning in Agricultural Field ...... 21 2.2.1 Introduction and objectives ...... 21 2.2.2 Summary of state of the art ...... 21 2.2.3 Description of work performed so far ...... 24 2.2.4 Performance evaluation ...... 28 2.2.5 Future work ...... 30

3 Deep Navigation 31 3.1 Learning Kinematic Feasibility for Mobile Manipulation ...... 31 3.1.1 Introduction and objectives ...... 31 3.1.2 Summary of state of the art ...... 32 3.1.3 Description of work performed so far ...... 33 3.1.4 Performance evaluation ...... 35 3.1.5 Analytical evaluation ...... 36 3.1.6 Motion execution in simulation ...... 37 3.1.7 Future work ...... 38 3.2 Visual navigation in real-world indoor environments using end-to-end deep re- inforcement learning ...... 39 3.2.1 Introduction and objectives ...... 39 3.2.2 Summary of state of the art ...... 40 3.2.3 Description of work performed so far and performance evaluation . . . 41

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 4/99

4 Deep action and control 42 4.1 Model-based latent planning ...... 42 4.1.1 Introduction and objectives ...... 42 4.1.2 Summary of state of the art ...... 44 4.1.3 Description of work performed so far ...... 48 4.1.4 Performance evaluation ...... 49 4.1.5 Future work ...... 50 4.2 Existing reinforcement learning frameworks ...... 51 4.2.1 Environments ...... 51 4.2.2 Development Tools ...... 52 4.2.3 Baselines ...... 54 4.3 Hyperparameter Optimization ...... 55 4.3.1 Introduction and objectives ...... 55 4.3.2 Summary of state of the art ...... 55 4.3.3 Description of work performed so far ...... 57 4.4 Grasp pose estimation and execution ...... 58 4.4.1 Introduction and objectives ...... 58 4.4.2 Summary of state of the art ...... 58 4.4.3 Description of work performed so far ...... 62 4.4.4 Performance evaluation ...... 70 4.4.5 Future work ...... 74

5 Conclusions 75

A Visual navigation in real-world indoor environments using end-to-end deep rein- forcement learning 92

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 5/99

Executive Summary

This document presents the status of the work performed for WP5–Deep robot action and decision making. WP5 consists of four main tasks, that are Task 5.1–Deep Planning, Task 5.2– Deep Navigation, Task 5.3–Deep Action and Control, and Task 5.4–Human Robot Interaction. This document covers the first three tasks of the WP, given that Task 5.4 will start later in the project. After a general introduction that provides an overview of the individual chapters with a link to the main objectives of the project, the document dedicates a chapter to each tasks. Each chapter (i) provides an overview on the state of the art for the individual topics and existing toolboxes, (ii) details the partners’ current work in each task with initial performance results, and (iii) describes the next steps for the individual tasks. Finally, a conclusion chapter provides a final overview of the work and the planned future work for each individual task.

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 6/99

1 Introduction

This document describes the work done during the first year of the project in the three (out of four) major research areas of WP5 namely deep planning, deep navigation, and deep action and control. The next sections (Sections 1.1-1.4) provide a summary of the work done so far on these three main topics and the link with the project objectives. The rest of the document is structured as follows. Chapter 2 details our work on deep planning. Chapter 3 describes our work on deep navigation. Chapter 4 presents our work on deep action and control. Finally, Chapter 5 concludes this deliverable.

1.1 Deep Planning (T5.1) 1.1.1 Objectives Conventional robot motion planning is based on solving individual sub-problems such as per- ception, planning, and control. On the other hand, end-to-end motion planning methods in- tend to solve the problem in one shot with less computation cost. Deep learning enables us to learn such end-to-end policies, particularly integrated with Reinforcement learning. AU intro- duce end-to-end motion planning methods for UAV navigation trained with Deep reinforcement learning. Notably, AU studied two applications: autonomous drone racing and agricultural use- case.

1.1.2 Innovations and achieved results AU first study end-to-end planning for autonomous drone racing. AU employ curriculum-based Deep reinforcement learning to train end-to-end policy. The policy generates velocity com- mands to UAV using only RGB images of the onboard camera. A sparse reward indicating the gate pass is used to train the policy. Curriculum learning is applied to solve the exploration problem due to sparse reward. The task difficulty is adjusted according to the success rate of the agent. AU train and evaluate the method in a realistic simulation environment. Although AU achieve a high gate pass rate, it is not enough to test on a real system since it will be ex- pensive for a UAV to crash. AU also study the local motion planning task in the agricultural use-case where the UAV assists UGV with its wide camera range. Informed by the global plan, the UAV generates local plans for UAV and UGV in case of obstacle detection. AU create an agricultural simulation environment in , including trees, trucks, inclination in the field. AU also present initial results for this task.

1.1.3 Ongoing and future work AU are developing a local planner for the agricultural use-case. AU will implement an end-to- end neural network policy for this task. The policy will generate local motion plans based on depth image and moving target point that tracks the global plan. AU will include a curriculum- based DRL method for autonomous drone racing in the OpenDR toolkit. End-to-end planning method for agricultural use-case is also intended to be included in the toolkit.

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 7/99

1.2 Deep Navigation (T5.2) 1.2.1 Objectives Learning based approaches have shown to be well suited to solve navigation tasks across diverse environments and platforms, including autonomous vehicles, video games and robotics. Partic- ularly deep learning and reinforcement learning approaches have shown to work well with the complex, high-dimensional inputs of real-world environments. Navigation tasks involve both long-horizon goals that require long-term planning as well as local, short-term decision making such as traversing unknown terrain or avoiding static and dynamic obstacles. As a result both the decomposition of the problem into different components and levels of abstraction as well as the combination of traditional optimization and planning approaches with learned modules are very promising approaches.

1.2.2 Innovations and achieved results Mobile manipulation tasks with autonomous remain one of the critical challenges for applications in service robotics as well as in industrial scenarios. ALU-FR proposes a deep reinforcement learning approach to learn feasible dynamic motions for a mobile base while the end-effector follows a trajectory in task space generated by an arbitrary system to fulfill the task at hand. This modular formulation has several benefits: it enables us to readily transform a broad range of end-effector motions into mobile applications, it allows us to use the kinematical feasibility of the end-effector trajectory as a simple dense reward signal instead of relying on sparse or costly shaped rewards and it generalises to new, unseen, task specific end-effector tra- jectories at test time. ALU-FR demonstrates the capabilities of this approach on multiple platforms with different kinematic abilities and different types of wheeled platforms, both in simulation as well as in real-world experiments. The resulting approach has the potential to enable the application of any system that generates task specific end-effector motions across a diverse set of mobile manipulation platforms. Visual navigation is essential for many applications in robotics, from manipulation, through mobile robotics to automated driving. DRL provides an elegant map-free approach integrating image processing, localization, and planning in one module, which can be trained and therefore optimized for a given environment. However, to date, DRL-based visual navigation was vali- dated exclusively in simulation, where the simulator provides information that is not available in the real world, e.g., the robot’s position or image segmentation masks. This precludes the use of the learned policy on a real robot. Therefore, TUD proposes a novel approach that enables a direct deployment of the trained policy on real robots. TUD designed visual auxiliary tasks, a tailored reward scheme, and a new powerful simulator to facilitate domain randomization. The policy is fine-tuned on images collected from real-world environments. The method is evalu- ated on a mobile robot in a real office environment. The training took roughly 30 hours on a single GPU. In 30 navigation experiments, the robot reached a 0.3-meter neighborhood of the goal in more than 86.7 % of cases. This result makes the proposed method directly applicable to tasks like mobile manipulation.

1.2.3 Ongoing and future work While ALU-FR achieves very good results in terms of kinematic feasibility during trajectory generation the approach so far does not consider collisions with the environment. In future

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 8/99 work ALU-FR plans to incorporate obstacles and object detection into the training to enable the agent to also avoid collision while moving the base. Further we observed occasional undesired configuration jumps in the arms with high degrees of freedom which we will aim to avoid by including a corresponding penalty in the reward function. ALU-FR will include methods to train and run the current mobile manipulation framework in the first version of toolkit.

1.3 Deep Action and Control (T5.3) 1.3.1 Objectives Model-based reinforcement learning (RL) is well-suited for robotics due to its sample com- plexity. In the high-dimensional case, the policy must be learned in a lower-dimensional latent space to fully exploit this data efficiency. For the learned policy to run on a robotic platform with limited computational resources, the latent dynamics model must be compatible with fast control algorithms. A promising research direction is to the use of Koopman operator theory in combination with deep learning. Although a latent Koopman representation facilitates more efficient action planning, it still focuses the majority of the model capacity on potentially task- irrelevant dynamics that are contained in the observations. Hence, it is of great importance to mitigate the effect of task-irrelevant dynamics in the latent representation learning to make the representation more effective for robotics in industrial settings with many moving systems. Therefore, TUD sought to find a lightweight and sample efficient method based on Koopman theory to control robotic systems for which it is hard to directly find a simple state description by using high-dimensional observations (which are possibly contaminated with task-irrelevant dynamics). Despite the recent successes of deep neural networks in robotics, it has not yet become a standard component in control design toolbox. This is due to several limitations imposed by the practice/requirement of manually tuning a large number of hyperparameters. Optimal tuning of these parameters require significant expertise. To simplify the tuning process and to save the rare resource of experts, the TUD aims to develop efficient hyperparameter tuning algorithms in OpenDR. One practical example that motivates the use of deep learning in robotics is object grasping, as provided by the Agile Production use case, under development by TAU. Within such object grasping task, perception, action and control play a crucial role to approach the problem. For example, generation of object grasps can use RGB-D sensing for modelling (e.g. deep convo- lutional neural networks) or utilize a simulator to learn successful grasping. A first approach towards the development of grasping models in the OpenDR toolkit is therefore to explore the state of the art in modern grasp and grasp pose-estimation methods, and evaluate their per- formance on the custom developed OpenDR Agile Production dataset. Second, shortcomings of existing approaches, such as training time and model complexity, should be considered in context of the industrial use case, where large computation power and object models might be unavailable.

1.3.2 Innovations and achieved results TUD first provided an extensive review of the state of the art in deep action and control, includ- ing a broad overview of state-of-the-art environments that could potentially be incorporated in the OpenDR Toolkit. Motivated by this review of the literature, TUD focused on the use of Koopman theory combined with deep learning for control of robotic systems. In this respect, OpenDR No. 871449 D5.1: First report on deep robot action and decision making 9/99 a preliminary architecture is provided. This architecture allows one to control generic robotic systems without any assumption on the way the control command affect the system. In addition, the proposed architecture is suitable for both regulation and tracking problems. TUD showed preliminary results using a challenging task to represent using Koopman operator theory, that is, the OpenAI’s single-pendulum swing-up task. To investigate the effect of distractor dynamics, TUD tested in two different scenarios. In the first scenario, only relevant dynamics are observed, while in the second one a more realistic scenario was considered where the observations were purposely contaminated with distractor dynamics. This is often encountered in industrial en- vironments, where for example, the camera of a vision-based also captures the movements of other manipulators nearby. Then, the manipulator must be able to distinguish relevant from irrelevant dynamics. TUD compared the two scenarios with two baseline model- free approaches, that are, Deep Bisimulation for Control (DBC) [219] and DeepMDP [64]. Preliminary results, show that the approach is able to reach the same level of accuracy of the two model-free baselines in the distractor-free scenario, while achieving a better performance compared to the baselines in the distractor scenario. In addition, TUD also shortlisted existing state-of-the-art toolkits for automatic hyperpa- rameter optimization. TUD briefly explored their target platforms, implemented algorithms and utilities offered by those toolkits. This survery will become the basis of the future develop- ment of automatic hyperparameter optimization methods which will target the gaps in current state-of-the-art and design strategies well-suited with OpenDR objectives. Moreover, towards the deep learning of grasp actions, TAU has provided a review of the state of the art for grasp (pose) estimation. This gives a clear overview of different approaches, divided by criteria such as sensor modality and objects to be grasped. Two deep learning based object pose and grasp estimation frameworks are chosen for further analysis and their perfor- mance is tested on the custom developed OpenDR Agile Production database. This database, developed by TAU, consists of several part, both simple in nature (3D printed assembly parts) and industrial (Diesel engine parts). As both approaches have shortcomings with respect to the practical industrial application and the specifications of the OpenDR toolkit (framerate, latency and model size), a novel grasp model is under development as well by TAU, denoted Single Demonstration Grasping (SDG). This model is designed to take a single demonstration of an object to be grasped and generates the required training data and training procedure for accurate and robust grasp generation. Preliminary evaluation demonstrates promising results, however, further development and evaluation is necessary.

1.3.3 Ongoing and future work Concerning the use of Koopman theory combined with deep learning for control of robotic sys- tems, as part of future work, TUD plans to finalize the proposed control architecture and test the proposed approach on a simplified manipulator task, that is, the OpenAI’s reacher task where the two joints of a two-link arm are controlled and the euclidean distance the arm’s end-point and a goal must be minimized. This will allow TUD to test the tracking performance of the proposed control architecture. In addition, TUD will extent the method with high-dimensional images as observations. TUD intents to include the extended method into the toolkit if convinc- ing results with images are achieved. Regarding the development of robot grasp actions, future work of TAU will continue de- velopment and extension of the OpenDR Agile Production dataset that is tailored to industrial human-collaborative assembly tasks. Existing object pose and grasp estimation frameworks

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 10/99 will be used for evaluation of the generated grasps and their suitability in industrial context. Finally, TAU’s proposed Single Grasp Demonstration model will be improved and extended to tackle short-comings of state of the art approaches. Suitable grasping models will be included in the OpenDR toolkit.

1.4 Connection to Project Objectives The work performed within WP5, as summarized in the previous subsections, perfectly aligns with the project objectives. More specifically, the conducted work progressed the state-of-the- art towards meeting following objectives of the project:

O2. To provide lightweight deep learning methods for deep robot action and decision making, namely:

O2.c.i Deep reinforcement learning (RL) and related control methods. TUD contributed to this objective as detailed in part of Chapter 4. They pro- ∗ vided an overview of the relevant literature on model-based RL and existing toolboxes for Deep RL in robotics (Sections 4.1.2 and 4.2). Their main focus is on the design of a model-based approach in which the latent dynamics model must be compatible with fast control algorithms. For this, their objective is on the design of a lightweight and sample efficient method based on Koopman theory (Section 4.1.3). This approach will allow one to control robotic sys- tems for which we cannot directly find a simple state description by using high- dimensional observations (which are possibly contaminated with task-irrelevant dynamics). Preliminary results are presented (Section 4.1.4) and future work has been planned accordingly (Section 4.1.5) in line with this objective. Moreover, to simplify the task of hyperparameter tuning of Deep RL for non- expert users, they aim to develop automatic hyperparameter optimization strate- gies. In this report, they present a survey of existing hyperparameter tuning toolkits (Section 4.3). O2.c.ii Deep planning and navigation methods that can be trained in end-to-end fashion. ALU-FR developed a method based on Deep RL to enable a mobile robot to ∗ perform navigation for manipulation by generating kinetically feasible base motions given an end-effector motion following an unknown policy to fulfill a certain task. Given it’s lightweight architecture it can be employed in real- time on mobile robot platforms. These results could also be linked to O2.c.i AU contributed to this objective, as described in Chapter 2. They provide end- ∗ to-end planning methods for autonomous drone racing and local planning in the agricultural use-case. They train end-to-end neural network policy generating velocity commands based on RGB images for autonomous drone racing with curriculum-based deep reinforcement learning. They also introduce an end-to- end planner for the agricultural use-case that avoids obstacles while following a given global trajectory. These results could also be linked to O2.c.i TUD contributed to this objective as detailed in parts of Chapter 3. A novel ∗ end-to-end deep navigation approach was introduced, that enables a direct de- ployment of a policy, trained with images, on real robots. While the training is

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 11/99

computationally demanding and took 30 hours on a single GPU, the execution of the eventual agent’s policy is computationally cheap and can be executed in real time (sampling times 50 ms), even on lightweight embedded hardware, such as NVIDIA Jetson. O2.c.iii Enable robots to decide on actions based on observations in WP3, as well as to learn from observations TAU has contributed to this objective as described in Section 4.4. They have ∗ provided an overview of relevant literature and models towards the robotic ac- tion of grasping based on visual observations, in particular object pose estima- tion and object grasp generation (Section 4.4.2). Details and shortcoming of two state of the art approaches are analysed by evaluation of the models on the custom developed OpenDR Agile Production dataset, which contains several industrial parts relevant to the Agile Production use case. Performance results in simulation indicate that the state of the art models are not light-weight and fast, and take considerable computational resources for training (Section 4.4.4). A novel single demonstration grasping (SDG) model is under development that takes these limitations into account. Preliminary results with a Franka Panda collaborative robot shows that the model can grasp unknown parts with high performance and fast inference time (20 FPS) and future works is planned ac- cordingly (Section 4.4.5). O2.c.iv Enable efficient and effective human robot interaction The work on this objective will start in M13 with T5.4 Human-robot interaction. ∗

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 12/99

2 Deep Planning

2.1 End-to-end planning for drone racing 2.1.1 Introduction and objectives Autonomous navigation problem of robots is conventionally divided into three sub-problems, such as perception, planning, and control. One approach, which is the conventional approach in robotics, is to find individual solutions for each block to achieve the desired performance. However, such an approach may require large computation power since each problem is solved individually. Furthermore, in various applications, where we do not have sufficient time to solve each subproblem separately (e.g., autonomous drone racing problem), this approach may not be preferable. Alternatively, end-to-end planning methods have been more popular with the devel- opment of cost-effective onboard processors. Those methods aim to solve the problem in one shot. In other words, they try to map input sensor data (e.g., RGB camera) to robot actions (e.g., drone velocity commands in x-y-z directions). This approach is computationally less expensive due to its relatively more compact structure, which is vital for onboard processors where the energy consumption is critical and we have limited computation power; such as autonomous drone racing. On the other hand, end-to-end methods have a significant drawback, which is the fact that debugging particular problems in a subproblem is more difficult. Eventually, it becomes more challenging to obtain safety and/or convergence guarantees. Reinforcement learning (RL) is one of the machine learning approaches which focuses on how agents learn with trial-and-error in an environment based on a certain reward. RL is fa- vorable for a robotic application to train an end-to-end policy using only a reward function that defines the task. Recently, using their ability to extract high-level features from raw sen- sory data, deep neural networks (DNN) are also used in RL methods, and deep RL (DRL) emerges. DNNs help RL to scale the huge-size problems such as learning to play computer video games. In a similar manner, DRL algorithms are also used in the literature to allow control policies for robots to be learned directly from sensor inputs, such as RGB cameras. However, a significant challenge to define a robotic task as a DRL problem is choosing a descriptive re- ward function. One can consider a sparse reward, e.g., a positive reward for passing a gate successfully in au- tonomous drone racing, or a shaped reward which in- forms the agent more frequently. An exploration prob- lem appears with the sparsity of reward: how long the agent should perform on the environment to achieve sufficient reward and observation samples. While shaping the reward function decreases the exploration complexity, it requires engineering and optimization on function definition. Moreover, the resultant policy is not directly optimized for a specific goal. Alterna- tively, a curricular schedule is exposed to the learning algorithm to overcome exploration complexity with sparse reward. This research adopts such a curriculum Figure 1: A drone in front of a racing learning strategy [85] to solve the autonomous drone gate racing problem by only rewarding gate passes.

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 13/99

In the autonomous drone racing problem, the main goal is to complete a specific parkour defined by gates (as in Fig. 1) with an aerial robot having onboard sensors in a faster and safer manner than a skilled human pilot. The long-term hope is that the novel algorithms and methods developed for autonomous drone racing problem will be transferable to other real-world indoor navigation and surveillance problems, such as exploration of a nuclear power plan after an accident. However, there are various problems such as excited nonlinear dynamics due to agile flight, onboard computation limitations. Those make the autonomous drone racing challenge an excellent benchmark problem for testing our end-to-end planning methods. Motivated by the challenges and opportunities above, in this section, we will present our initial results in autonomous drone racing problem using a visuomotor planner trained with DRL.

2.1.2 Summary of state of the art One approach to solve the autonomous drone racing problem is by decoupling it into localiza- tion, planning, and control tasks. Here, one critical issue is gate detection from camera image and localization. Traditional computer vision methods are employed to solve this task using preliminary knowledge of the gates’ color and shape [94, 124]. Unfortunately, these methods are lacking in the generalization of the environment. Deep learning methods can successfully handle the task and generalize the information robustly [95, 100]. Recently, Foehn et al. [60] proposed a novel method that segments the four corners of the gate and identifies multiple gates’ poses. Although they produce remarkable performance, the method necessitates a heavy onboard computer. Instead of learning gate poses, Loquercio et al. [130] train an artificial neural network to generate the planner’s steering function. They also introduce the superiority of deep learning with domain randomization for simulation-to-real transfer of methods, which opens a way for deep learning to involve coupling the localization, planning, or control blocks. For example, Bonatti et al. [19] implement all the systems with neural networks from sensors to actions. Mainly, they use a variational autoencoder to learn a latent representation of gate pose estima- tion. Then, they apply imitation learning to generate actions from the latent space. Muller et al. [148] also use imitation learning, but they trained a convolutional DNN to generate low-level actuator commands from RGB camera images. However, they ease the problem using addi- tional guiding cones between the gates in standard drone racing applications. More recently, Rojas-Peres et al. [166] use a heavier convolutional DNN to generate controller actions fed with mosaic images. Mosaic images contain consecutive camera images side-by-side to inform the network about the velocity of the drone visually. However, they apply a simple training method based on manual flight references, and finally, they achieve a smaller velocity level than other works in the domain. Datasets Five visual-inertial UAV datasets are listed in Table 1. The first two datasets provide slower speed data with their weighted sensors. Other datasets publish data from more agile flights to challenge visual-inertial odometry algorithms. Individually, the UZH-FPV drone racing dataset [48] gives an unconventional sensor, event-based camera, data for researchers. Simulations Simulations have long been a valuable tool for developing robotic vehicles. It allows engi- neers to identify faults early in the development process and let scientists quickly prototype and demonstrate their ideas without the risk of destroying potentially expensive hardware. While simulation systems have various benefits, many researchers see the results generated in simu-

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 14/99

Table 1: UAV visual-inertial datasets

Datasets top speed data collection mm ground data modalities (m/s) truth EuRoC MAV [28] 2.3 indoor Yes IMU, stereo camera, 3D lidar scan Zurich urban MAV 3.9 outdoor No IMU, RGB camera, [135] GPS Blackbird [2] 13.8 indoor Yes IMU, RGB camera, depth camera, segmen- tation UPenn fast flight 17.5 outdoor No IMU, RGB camera [185] UZH-FPV drone 23.4 indoor/outdoor Yes IMU, RGB camera, racing [48] event camera, optical flow lation with skepticism, as any simulation system is an abstraction of reality and will vary from reality on some scale. Despite skepticism about simulation results, several trends have emerged that have driven the research community to develop better simulation systems of necessity in recent years. A major driving trend toward realistic simulators originates from the emergence of data-driven algorithmic methods in robotics, for instance, based on machine learning that requires extensive data or reinforcement learning that requires a safe learning environment to generate experiences. Simulation systems provide not only vast amounts of data but also the corresponding labels for training, which makes it ideal for these data-driven methods. This driving trend has created a critical need to develop better, more realistic simulation systems. The ideal simulator has three main features:

1. Fast collection of large amounts of data with limited time and computations.

2. Physically accurate to represent the dynamics of the real world with high fidelity.

3. Photorealistic to minimize the discrepancy between simulations and real sensor observa- tions.

These objectives are generally conflicting in nature: the higher accuracy in a simulation, the more computational power is needed to provide this accuracy, hence a slower simulator. There- fore, achieving all those objectives in a single uniform simulator is challenging. Three simula- tions tools that have tried to balance these objectives in a UAV simulator is; FlightGoggles [70], AirSim [176] and Flightmare [180]. These simulations use modern 3D game engines such as the Unity or Unreal engine to produce high-quality, photorealistic images in real-time.

In 2017, Microsoft released the first version of AirSim (Aerial Informatics and Robotics Simulation). AirSim is an open-source photorealistic simulator for autonomous vehicles built on Unreal Engine. A set of API’s, written in either c++ or python, allows communication with the vehicle whether it is observing the sensor streams, collision, vehicle state, or sending commands to the vehicle. In addition, AirSim offers an interface to configure multiple vehicle

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 15/99

RGB image Segmented Depth

Figure 2: Information extracted from Unreal Engine using AirSim models for quadrotors and supports hardware-in-the-loop as well as software-in-the-loop with flight controllers such as PX4. AirSim differs from the other two simulators, as the dynamic model of the vehicle is simu- lated using NVIDIAs physics engine PhysX, a popular physics engine used by the large majority of today’s video games. Despite the popularity in the gaming industry, PhysX is not specialized for quadrotors, and it is tightly coupled with the rendering engine to simulate environment dy- namics. AirSim achieves some of the most realistic images, but because of this strict connection between rendering and physics simulation, AirSim can achieve only limited simulation speeds. In figure 2 examples of the photorealistic information collected via AirSim is presented, pro- viding RGB, segmented, and depth information.

In 2019, MIT FAST lab developed FlightGoggles as an open-source photorealistic sensor simulator for perception-driven robotic vehicles. The contribution from FlightGoggles con- sists of two separate components. Firstly, decoupling the photorealistic rendering engine from the dynamics modeling gives the user the flexibility to choose the dynamic models’ complex- ity. Lowering the complexity allows the rendering engine more time to generate higher quality perception data at the cost of model accuracy, whereas by increasing the complexity, more re- sources are allocated to accurately modeling. Secondly, providing an interface with real-world vehicles and actors in a motion capture system for testing vehicle-in-the-loop and human-in-the- loop. Using the motion capture system is very useful for rendering camera images given trajec- tories and inertial measurements from flying vehicles in the real-world, in which the collected dataset is used for testing vision-based algorithms. Being able to perform vehicle-in-the-loop experiments with photorealistic sensor simulation facilitates novel research directions involv- ing, e.g., fast and agile autonomous flight in obstacle-rich environments, safe human interaction.

In 2020, UZH Robotics and Perception Group released the first version of Flightmare: A

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 16/99

Flexible Quadrotor Simulator. Flightmare shares the same motivation as FlightGoggles by de- coupling the dynamics modeling from the photorealistic rendering engine to gain the flexibility to choose how much time we want to simulate an accurate model compared to gathering fast data. While not incorporating the motion capture system, Flightmare provides interfaces to the famous robotics simulator Gazebo along with different high-performance physics engines. Flightmare can simulate several hundreds of agents in parallel by decoupling the rendering module from the physics engine. This is useful for multi-drone applications and enables ex- tremely fast data collection and training, which is crucial for developing deep reinforcement learning applications. To facilitate reinforcement learning even further, Flightmare provides a standard wrapper (OpenAI Gym), together with popular OpenAI baselines for state-of-the-art reinforcement learning algorithms. Lastly, Flightmare contributes with a sizeable multi-modal sensor suite, providing: IMU, RGB, segmentation, depth, and an API to extract the full 3D information of the environment in the form of a point cloud.

This review has been to introduce some of the high-quality perception simulators on the market. Table 2 summarises the main differences of presented simulators, including Gazebo- based packages and Webots but choosing the right one depending on the problem and what data is essential for the project. For example: If high-quality photorealistic images are the priority of the project, AirSim might be best, but if an evaluation of the quadrotor’s performance is needed, maybe FlightGoggles with its vehicle-in-the-loop is the right choice. These simulation tools are being improved and extended on a daily basis, so no one can predict the right simulator to use in the future, but hopefully, this section has shone some light on how simulators can be a useful tool for training, testing, and prototype and how valuable it is for autonomous vehicles.

Table 2: UAV simulation tools

Simulator rendering dynamics sensor suite point cloud RL API Vehicles Hector [107] OpenGL Gazebo- IMU, RGB No No single based RotorS [62] OpenGL Gazebo- IMU, RGB, No No single based Depth Webots [139] WREN ODE IMU, RGB, No No multiple Depth, Seg, Lidar Flightgoggles Unity Flexible IMU, RGB No No single [70] Airsim [176] Unreal PhysX IMU, RGB, No No multiple Engine Depth, Seg Flightmare Unity Flexible IMU, RGB, Yes Yes multiple [180] Depth, Seg

2.1.3 Description of work performed so far We consider a basic quadrotor drone equipped with an RGB camera in a realistic AirSim [176] environment. The block diagram of our methodology is given in Fig. 3. The drone provides RGB images to the agent while it is commanded with linear velocity references. The agent

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 17/99 is trained with DRL using the online data collected from the simulation. While the control commands and sensors are simulated in AirSim, we apply a high-level curricular schedule. The curriculum generator helps the agent to explore the environment effectively by adjusting the difficulty level of episodes. Below we give the details of the method.

Curriculum Success Rate Generator Velocity Difficulty DRL Reference Drone Racing Agent Gym Wrapper RGB Image

AirSim

Figure 3: Block diagram of the proposed method

Drone racing environment for DRL We aim to learn an end-to-end neural network policy, π( ), that generates quadrotor actions · as velocity references using only RGB input images. We consider a supervisory environment, that rewards the agent having access to all ground-truth information, in order to train the policy with a DRL algorithm. In the context of our research, the drone is assumed to be forward-facing to the gate and to be kept at a constant altitude which is the same as the center of the gate. The policy is trained in episodes by simulating action and observation sequences with constant discrete time steps, ∆t. Particularly, for observation, ot, at a time-step, t, when an action, at, is taken, the simulation runs for ∆t seconds, and the new observation, ot+1, is obtained. An episode is randomly initiated by placing the drone with respect to the gate according to,

px β(1.3,0.9)d , (1) 0 ∼ max py U ( px, px), (2) 0 ∼ − 0 0 x y where the drone’s position variables, p0, p0, are derived with given beta, β, and uniform, U , distributions using dmax as a difficulty parameter. A beta distribution with given parameters samples in [0,1] interval giving higher values more probability. The action, at, is applied as velocity references to the controller in the x and y-axis as,

x y at = (vr,vr), (3)

x y where vr and vr are the reference values to the drone controller. Note that altitude is controlled separately with aligned with the gate center. We propose a sparse reward that determines the objective only given at episode termination as, ( 1, if drone passes gate successfully, R = (4) 1, otherwise. −

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 18/99

So, an episode is terminated in two ways: success or failure. For success, we check two condi- tions. First, the drone should cross a virtual window from the previous time-step to the current time-step in position. Secondly, it should have a velocity oriented in a certain range from the forward direction. For a step to be considered as a failure, a time limit, and a boundary limit is taken into account. Curriculum methodology We prefer a sparse reward to avoid side effects of reward shaping as explained in Section 2.1.1. On the other hand, it is hard for a DRL algorithm to explore sparse reward in high dimen- sional state spaces. We adopt a curriculum strategy [85] to overcome this issue. Particularly, we define a difficulty parameter, d [0,1], that is controlled by the curriculum generator. The ∈ parameter is initiated with zero and updated according to the equation,  d + 0.0005, 60% α ,  ≤ success d d, 40% < αsuccess < 60%, (5) ←−  d 0.0005, α 40%, − success ≤ where αsuccess is the success rate computed using the last fifty episodes. The difficulty parameter is linearly related with dmax in Eqn. 1. Therefore, it controls how far the drone is initialized from the gate. Policy network architecture We train our neural network policy using Proximal policy optimization (PPO) algorithm [172] which is a state-of-the-art DRL algorithm. The policy network structure is illustrated in Fig. 4, which is a standard convolutional neural network used in DRL algorithms. It is to be noted that the network contains three convolutional layers with 32, 64, and 64 filters, filter size of 8, 4, and 4, a stride of 4, 2, and 1, respectively. The convolutional layers are followed by a single fully-connected layer with 512 hidden neurons. We keep the images in a first-in-first-out buffer to feed our neural network policy since a single image does not infer information about the drone’s velocity. The buffer size is chosen as three, considering the results shown in Table 3. We observe that the reinforcement learning agent converges in difficulty parameter, that means it can keep the success rate above %60 while d = 1, if the buffer size is above two. On the other side, the computation complexity increases by the increase in buffer size. Since we do not observe a better performance with buffer size above three, we continue our experiments using three images buffered. At every layer, activation functions are rectified linear units (relu). The policy outputs the action as explained in Eqn. 3.

Table 3: Effect of the size of the image buffer to the training performance

Number of images 2 3 4 Max difficulty reached 0.8 1 1 Max success rate reached - 90% 80%

2.1.4 Performance evaluation We use Stable baselines [89] which is a Python package including several DRL algorithm im- plementation. We use the default hyperparameters except for the number of time-steps per algorithm update. In Fig. 5, different training curves are shown for the increasing number of

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 19/99

Observation Action

relu relu x relu relu v r y v r 3x RGB Image

Figure 4: Policy network structure. Network input is three consecutive RGB images. There are three convolutional layer followed by two fully connected layers. Network output is action commands to the drone.

1.0 1.0 1.0

0.5 0.5 0.5

0.0 0.0 0.0

0.5 0.5 0.5 − − − Reward Reward Reward Difficulty Difficulty Difficulty 1.0 1.0 1.0 − Success rate − Success rate − Success rate

0.00 0.25 0.50 0.75 1.00 1.25 0 100000 200000 300000 400000 500000 0 100000 200000 300000 400000 time steps 106 time steps time steps × (a) Number of timestep per up- (b) Number of timestep per up- (c) Number of timestep per up- date = 128 date = 1024 date = 4096

Figure 5: Sample training curves for increasing number of time-step per algorithm update. Training performance increases both in time and variance when the hyperparameter is increased from the default value for PPO algoritm. the hyperparameter. Training performance increases both in time and reward variance with the increasing values of the hyperparameter. We train and test our method in a realistic AirSim [176] simulation environment. The labo- ratory environment created in Unreal Engine 4 is shown in Fig. 6a. We have placed four gates to different directions in the environment to avoid the memorization of a single scene. Each episode is randomly initiated for one of four gates with equal probability. We obtain 90% success rate when we test the policy for the gates used in training. We run ∼ 20 episodes for each gate and tabulate the success rates in Table 4. The sample trajectories for a gate are shown in Fig. 6b. The starting point of each episode is indicated by green dots. From the starting point, the policy generates and runs a velocity command according to the current observation. The trajectories are obtained using ground truth positions at each time-step. We measure the speed of the networks on laptop CPU and Jetson TX2 onboard computer. We run the network 10000 times and calculate the fps. We get 500 Hz and 71 Hz for i7 laptop CPU and TX2 respectively.

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 20/99

Table 4: Success rate of the trained policy above 20 test episodes on single gate

Gate number 1 2 3 4 Overall Success rate 95% 95% 90% 85% 91%

gate 2 unsuccessful episode successful episode starting points 1

0 y (m)

1 −

2 −

0 1 2 3 4 5 x (m)

(a) Simulated laboratory environment in (b) Sample trajectories obtained from 20 episodes AirSim with four training gates. The for one gate. Each trajectory is plotted in top view gates are placed different orientation to starting from green dot and goes through the gate diversify background. denoted with blue.

Figure 6: Simulated laboratory environment and sample test trajectories of drone

2.1.5 Future work In this work, an end-to-end visuomotor planner is trained with a curriculum-based DRL for drone racing task. We qualitatively obtain an 85% success rate for a gate placed in our simu- ∼ lated laboratory environment. Although this rate is comparable to a different task uses a similar methodology [85], it is not enough for the autonomous drone racing challenge. Moreover, we can predict that the success rate will drop further in real-world experiments. Also, practically a failure with a drone implies more damage to hardware than a . That means the method is lacking from sim-to-real transfer currently. One explanation comes with the nature of DRL. Since it is subject to the exploration- exploitation dilemma, DRL algorithms are known to result in a less optimized network than supervised learning algorithms. Nevertheless, there exist some open questions that can improve the results. Instead of depending only camera, other sensors also beneficial for the planner. One research question is integrating the overall system with other sensors, such as an inertial sensor or depth camera. The network architecture and hyperparameters of learning algorithms are other variables that should be extensively analyzed.

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 21/99

2.2 Local replanning in Agricultural Field 2.2.1 Introduction and objectives The agriculture scenario case in OpenDR is based on the co-operation of an unmanned aerial vehicle (UAV)and an (UGV) to complete their tasks in an agricultural environment. The UGV is responsible for autonomous weed control, the machine has to detect and distinguish plants and weeds to efficiently remove weeds between the rows (inter-row) and possibly between the plants, intra-row. For this purpose, the ground robot is equipped with Deep Learning models that are integrated with the (ROS). Thanks to having a deep learning-based weeding system which is compatible with ROS, the ground robot will also have improved control capabilities on the movement of itself or its equipped actuators. Also, the ground robot is capable of adjusting operating speed based on plant intensity. Since the mentioned tasks require a proper hardware setup, the UGV uses the NVIDIA Jetson TX2 as an AI computing device, RGB cameras to detect local weed intensity and positions of rows and crops. Lidars and RGB cameras are used to detect obstacles such as unforeseen objects on the surface or animals in the field and RTK-GPS sensor to carry tasks with accurate positioning on the field. The mentioned tasks of the UGV will be conducted according to the global planner’s instructions. However, due to the design goals of the UGV, it is not fully compatible with detecting particular obstacles such as very high crop and heavy occlusions on the environment. As a result, it suffers from its limited situational awareness capability. Since it is not fully aware of stationary or moving obstacles in the far range, it becomes more vulnerable to involving accidents. At this point, the assistance of an air vehicle is needed. The UAV is equipped with more suitable skills and sensor set including a camera and depth sensor. In addition to the sensor set providing a wide range of view and depth information of obstacles, quick field covering features of the air vehicle make it suitable for the assistance task. Moreover, the air vehicle will locally generate its own and the ground robot’s trajectory with respect to obstacle detection. Since the ground robot will have an updated trajectory, it will be able to perform without operator supervision. In order to reach our goals, our methods need to be trained in realistic scenarios. For this purpose, the simulation environment is assumed to have different kinds of field characteris- tics. Trees, humans, and tractors have been considered obstacles in the agriculture field. Also, possible inclines in the field must be taken into account. These decisions are crucial to obtain maximum usage from our air and ground vehicles. As a final goal in this scenario approach, maximization of profits and minimization of costs are aimed. Moreover, most importantly, the prevention of unwanted dramatic accidents will be achieved as an ultimate goal.

2.2.2 Summary of state of the art 2.2.2.1 Global and local planning Robot motion planning concerns finding a collision-free path from an initial state to a goal state. It can be principally classified under global planning and local planning [116]. Global planning interests finding the way with full knowledge of the environment. Therefore, these methods can provide optimal or complete solutions for motion planning problems. However, a robot cannot obtain full environmental information due to limited sensing capabilities in real-world applications. Local planning methods are developed for incremental planning in a partially observable or dynamic environment. They mostly implement global planners while considering

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 22/99 partial observation. We consider a local planning problem in our scenario where the UAV has a limited field of view.

2.2.2.2 Graph-based planning methods An alternative classification of motion planning methods is based on the type of applied al- gorithms (see Fig. 7). One approach to solve the motion planning problem is based on the discretization of state space into grids and using graph-theoretic methods to find a path. Grid resolution choice directly affects the performance of grid-based planning algorithms such that choosing coarser grids end up with faster search result. However, grid-based algorithms suffer in finding paths through narrow portions of collision-free configuration spaces. On the other hand, the grid can be considered as a graph and the path can be found using graph search algo- rithms such as like depth-first search, breadth-first search, A* [79] or D* [182] can be counted under this class.

Planning Algorithms

Computational Graph search Optimization Sampling based Potential fields intelligence based based based

Figure 7: Classification of planning algorithms

2.2.2.3 Optimization-based methods Another group of methods use control theory and/or optimization theory to solve the motion planning problem using the equations of motions [44, 177]. However, these methods demand high computation costs when they perform in cluttered environments or in harsh environments that have complex dynamics characteristics.

2.2.2.4 Potential fields The artificial potential fields are one other approach to solve the problem in the continuous domain where a potential function is defined in state space to guide the robot to the goal and avoid obstacles [103]. The major downside of this method is the local minimum points in the potential field.

2.2.2.5 Sampling-based methods Sampling-based methods use probabilistic methods to construct a graph or tree representation of free space. Probabilistic road map (PRM) [102], Rapidly-exploring random tree (RRT) [115], and Voronoi diagram [187] are some generic example algorithms under this class. These algo- rithms have lower computational complexity due to possibly more sparse graph representations can be obtained than discretization of state space. The downside is the suboptimality of the OpenDR No. 871449 D5.1: First report on deep robot action and decision making 23/99 solution and probabilistic completeness, i.e. if there is a solution it will be found while time is going infinity. These algorithms can be deployed in real-world UAV applications by considering their limitations.

2.2.2.6 Computational-intelligence-based methods Beyond these traditional algorithms, computational-intelligence-based algorithms are also pro- posed to solve the UAV path planning problem, such as genetic algorithm [80], particle swarm optimization [221], ant colony optimization [35], artificial neural networks, fuzzy logic, learning- based methods [224].

2.2.2.7 Planning approaches for UAVs We can also classify planning algorithms considering cascaded planning/control diagram of quadrotor UAV applications, such as high-level planning, position planning, attitude planning. Attitude (orientation) planning develops an attitude plan for quadrotor based on a given trajec- tory. Traditionally, proportional-integral-derivative or linear quadratic regulator controllers are implemented to obtain a minimum trajectory tracking error. However, more sophisticated plan- ning algorithms are studied to achieve acrobatic maneuvers such as passing through a slanted, narrow gap or a barrel roll. Earlier, optimization-based methods are published in safe motion capture environments to perform flip [65] or slanted gap passing [149] maneuvers. Afterward, researchers achieve the slanted gap passing maneuver based on only onboard sensing [55]. Re- cently, learning-based approaches are also involved in this task. Camci et al. [32] incorporate a Reinforcement learning framework for swift maneuver planning of quadrotors. Kaufmann et al. [101] developed a neural network planner to achieve aggressive maneuvers like barrel roll or matty flip using only onboard sensors and computation. Position planning methods are implemented to generate position commands based on a high-level task of the quadrotor. Traditionally, sampling-based or potential field methods are employed to solve this task due to their capabilities. Also, trajectory optimization algorithms based on properties of quadrotors using the intermediate target points [84] or full map knowl- edge [114] have been proposed to extend traditional methods. Besides, the limitations of the onboard sensor range are also considered in the literature [152]. Another challenge for position planning is designing end-to-end planners that avoid the computational complexity of mapping the environment [31]. Finally, high-level planning concerns more to task planning. Algorithms are developed to plan tasks such as visiting multiple targets with limited resources [153] or shooting aesthetical videos [68].

2.2.2.8 UAV-UGV collaborated planning Scenarios including UAVs and UGVs collaborating as a team are widely studied due to their different but complementary skill sets. A UAV has a broader field of view, and a UGV has a higher payload capability for the mission. Mueggler et al. [147] propose a mission planning algorithm for such a UAV-UGV team for a search and rescue scenario. In their work, the UAV is assumed to capture the region of interest fully. Alternatively, solutions proposed to construct the global map with partial observation [156]. More recently, Chen et al. [37] propose two networks for map generation and planning in a UAV-UGV collaboration scenario without map reconstruction and save computation time.

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 24/99

2.2.2.9 Tools The open motion planning library (OMPL) [184] is an open-source collection of state-of-the-art sampling-based motion planning algorithms for robotics applications. It also provides prim- itives for sampling-based planning, such as collision checking, nearest neighbor searching, defining configuration spaces. The library is mainly written in C++, as well as all function- alities also accessible through python. It is also available in ROS through MoveIt.

2.2.3 Description of work performed so far Simulation environment To satisfy our objectives that have mentioned above, we have considered that field have elevation, trees spreed in the field in a proper structure and field contain different types of trees that have a different type of obstacle characteristics (see Fig. 8). Also, we have taken into account that the field contains other vehicles such as tractors. Thanks to design specifications, we have a realistic and suitable environment that both ground and air vehicles can be simulated with their own requirements.

Figure 8: Agriculture field environment in the size of 300m length, 400m width and 15m height

After finalizing the environment design, we have continued our work on Webots by analyz- ing the sensor pool supported by Webots, and listed our possible additional sensor use scenarios. In addition to mandatory sensors to carry autonomous flying, we have decided to implement a camera and rangefinder sensor. At this point, the collection, process, and transfer of sensor data need to be suitable for companion devices. Robot operating system (ROS) has been chosen among its competitors which provided by Webots. At this point, first, we have started with writing its own ROS package for sensors. By using its specific ROS commands for the camera, we have developed a node to obtain camera data and visualized in Rviz. In addition to that, we have implemented the rangefinder sensor to our air vehicle to obtain depth information. In parallel, we have started with the implementation of an air vehicle provided by Webots. Since Mavic 2 Pro was the only copter provided by Webots, its compatibility for future imple- mentations must be understood clearly. After evaluation process, we have decided that we need a more flexible and customizable copter for our future real-world implementations. Due to these reasons, we have decided to move on with implementing a third-party software, ArduPilot. The first attractive feature to choose this software is the capability of using the communication pro- tocol, The micro air vehicle link (MAVlink), to access companion boards. ArduPilot is also capable to support the software in the loop (SITL) feature with Webots. By using SITL with Webots capability, we have created a network architecture to assign tasks to our aerial vehicle OpenDR No. 871449 D5.1: First report on deep robot action and decision making 25/99

(a) Instant View from Camera (b) Rangefinder view on Webots

Figure 9: Camera and rangefinder windows via the MAVlink. At this point, first, we have assigned our tasks by using a ground control station, QGroundControl. After the successful implementation of the ground control station, we have decided to continue directly via MAVROS which is the MAVLink extendable commu- nication node for ROS with the proxy for Ground Control Station. In Fig. 10a, given trajectory via QGroundControl has represented. Also, the corresponding tracked trajectory has visualized on Rviz and shown in Fig. 10b.

(a) Given trajectory to the air vehicle (b) Trajectory realization on rviz

Figure 10: Tracked trajectories

After this step, we have decided to discard QGroundControl from our network architecture and directly work via MAVROS. To do that, we have implemented our own ROS node which doing the arming, mode selection, take-off, and way-point assignments. Right now, we have a stable ROS node that doing these tasks we mentioned above.

Local replanning for agricultural use-case The general block diagram of our methodology is given in Fig. 11. As explained before, the global plan is already defined for the UGV. The local planner is informed about the global plan as a moving target. The local planner incorporates the moving target together with the current depth image and generates local plans for UAV and UGV. Therefore, it helps the UAV-UGV team to track the globally defined trajectory while avoiding obstacles. We aim to implement the local planner block as a learning-based end-to-end motion planner. Notably, we start with an existing work [31] that trains a similar end-to-end planner for a single

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 26/99

Partial Observation (RGB-D Image) UAV Moving Target Local Global Motion Commands Planner Planner

UGV

Figure 11: Collaborative local planning with partial observation from UAV.

UAV with Deep reinforcement learning. Following, we explain this method. A Reinforcement learning problem is defined over a Markov decision process (MDP) formu- lation. It is critical to express the real-world problem with the elements of the MDP adequately. In detail, the state, st, the action, at, and the reward signal, rt, are designed in discrete time- steps. For our problem, the state is multi-modal data containing the depth image and the vector representing the moving target point. Mathematically, the state is defined as,

0 st = (Idepth,pt), (6) where I is the matrix representing depth image and p = [x ,y ,z ]T is 3 1 vector repre- depth t t t t × senting the position of target point with respect to the body frame in three main axis. The action is defined as a local position plan, including 1m length steps in the forward, lateral directions: 0 at = (xa,ya,za), (7) where x 0,1 , y 1,0,1 , z 1,0,1 are representing next position step of UAV a ∈ { } a ∈ {− } a ∈ {− } with respect to body frame in each three main axis. Note that the agent selects one of the 18 actions as a combination of its elements. It is biased towards the forward direction that is the moving direction and direction of the front-facing depth camera. At each time-step, the action is applied to the UAV based on the output of policy network (see Fig. 12), and the depth image and next target point are obtained as the new state. The reward signal is based on the UAV’s relative motion to the target point and collision if it occurs. The motion of the UAV evaluated with the change in Euclidean distance between the target point and initial position through time-step, ∆d, proportional to the Euclidean distance at the next time step, dt. The collision and excessive deviation are punished with constant values. Mathematically, the reward signal is defined as [31],

 Rl , for ∆du < ∆d  dt  R +(R R ) ∆du ∆d  l u l ∆du −∆d  − − l , for ∆d < ∆d < ∆d  dt l u r0 = Ru (8) t , for ∆d < ∆dl  dt  Rdp, for excessive deviation,  Rcp, for collision, where R = 0 and R = 0.5 are reward boundaries, ∆d = 1 and ∆d = 1 are reward saturation l u l − u bounds, R = 0.5 and R = 1 is punishment for excessive deviation and collision. This dp − cp − reward logic enables the agent to avoid obstacles while quickly navigating to goal. OpenDR No. 871449 D5.1: First report on deep robot action and decision making 27/99

The policy network (π(st)) structure for this task is given in Fig. 2. The neural network feeds the depth image to three convolutional layers with 32, 64, and 64 filters, filter size of 8, 4, and 4, a stride of 4, 2, and 1, respectively. The convolutional layers are followed by a single fully-connected layer with 512 hidden neurons. The vector of moving target position is fed to two fully-connected layers with 64 hidden neurons. The two networks are concatenated to a fully-connected layer with 124 hidden neurons, and the action is generated. The algorithm trains this policy network with given settings using a Deep reinforcement learning method called Deep Q-network (DQN).

relu relu relu

Depth Image relu

concat tanh Action

tanh

xt tanh tanh yt zt Moving Target

Figure 12: Policy network structure of end-to-end local planner.

UGV integration For our use-case, we consider collaboration between the UAV and the UGV. This setting implies some changes in MDP formulation, such as selecting UGV actions or redesigning the reward function. Following, we explain these issues. We extend the state with the UAV position with respect to UGV in order to include the UGV to the problem. Mathematically, the state now is defined as, st = (Idepth,pt,pUAV UGV ), (9) − where pUAV UGV is 3 1 vector representing the UAV position with respect to the UGV. In the − × problem, the UAV’s objective is to lead the UGV from a certain altitude ahead. In the problem, the UAV’s objective is to lead the UGV from a certain altitude ahead. There- fore, the UAV encounters obstacles before the UGV. The UAV’s action vector is updated as,

at = (xa,ya), (10) since the altitude is stabilized with respect to the UGV. Moreover, the UGV’s actions are se- lected in a logic. If the UAV deviates from the route with a certain distance, the UGV follows it: ( Normal operation, if yt < ∆yl aUGV | | (11) ← Obstacle avoidance, if y > ∆y , | t| l OpenDR No. 871449 D5.1: First report on deep robot action and decision making 28/99

where ∆yl is the distance limit to the UGV. We should also update the reward signal to define the objective of the UAV. We also discount the reward if the UAV moves away from the UGV as,

 Rl Rugv , for ∆du < ∆d  dt  R +(R R ) ∆du ∆d  l u l ∆du −∆d R − − l , for ∆d < ∆d < ∆d  ugv dt l u r = Ru (12) t Rugv , for ∆d < ∆dl  dt  Rdp, for excessive deviation,  Rcp, for collision, where Rugv = 1/ xuav ugv 2 is the discount factor inversely proportional with deviation of the | − − | UAV position from the two meter in front of the UGV.

2.2.4 Performance evaluation We implement an OpenAI gym wrapper to create our DRL environment in communication with Webots and Ardupilot through ROS in python 3. The ROS communication flow is shown in Fig. 13. Our gym ROS-node commands quadrotor through Mavros while using odometry information from it. The depth sensor and collision information are directly obtained from the robot object in Webots. The gym node also generates the target point on the global trajectory. In our initial experiments, the global trajectory is a straight line, and the target point is selected 5m ahead of the quadrotor’s current position projected to the global path.

Figure 13: ROS topics and nodes are visualized.

We initially conduct two sets of experiments, one with an obstacle-free environment and the other with a single obstacle. The first experiment investigates the trajectory following skills of our algorithm. We train our agent in an 80m straight trajectory while the episode is ended at the end of the parkour. The depth image and related neural network part are not used in this experiment considering its objective. We observe that the agent learns a policy biased to forward motion because of the environment’s forward-biased nature. We updated the Euclidian distance calculation in the reward function with weights to overcome this problem. We define

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 29/99 the distance function as, v u   u 1 0 0 u T Dw(pd) = tpd 0 4 0pd, (13) 0 0 1 where D ( ) is weighted distance function, and p = p p is difference between interested w · d 1 − 2 points. Thereby, the importance of vertical deviation increases for the reward, and the agent becomes more robust to deviating from the global path. With the previously explained settings, the collected reward during training is plotted for the obstacle-free environment in Fig. 14a. The trained policy is tested in the environment. The trajectory followed by the trained policy is shown in Fig. 15. We force the agent to take three consecutive right or left actions to create a disturbance during the test. The policy approaches the global trajectory as expected.

1.0 Reward Reward 6 0.5

0.0 4 0.5 − 2 1.0 −

mean episode reward 0 mean episode reward 1.5 −

2 2.0 − − 0 10000 20000 30000 40000 0 2000 4000 6000 8000 10000 12000 time steps time steps (a) (b)

Figure 14: Mean episode reward collected during training in (a) obstacle free environment, (b) environment with single obstacle.

28 − 29 − 30 − Followed trajectory y (m) 31 − Disturbance 32 Global trajectory − Starting point 33 − 0 5 10 15 20 25 x (m)

Figure 15: Trajectory followed under disturbances. The drone is initiated from starting point. The drone approaches to the global trajectory although some actions are forced as disturbance.

In the second experiment, we train the agent in the environment with a single obstacle. In Fig. 16, an instance from training in the environment is shown with the drone and the rectangular-shaped obstacle. It is a simple scenario to test the algorithm in all aspects, including

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 30/99

Figure 16: A time instance from training in webots. The drone is in front of the obstacle. trajectory following and obstacle avoidance. In this case, the global trajectory is shortened to 10m. The mean reward during training is plotted in Fig. 14b. The trained policy is employed in the simulation environment. In Fig. 5, the trajectory of the test is presented. The agent takes a right-forward action based on the depth image to avoid the obstacle.

28.0 − Followed trajectory Global trajectory 28.5 − Starting point Obstacle 29.0 − 29.5 −

y (m) 30.0 − 30.5 − 31.0 − 31.5 − 0 1 2 3 4 5 6 7 8 x (m)

Figure 17: Trajectory followed to avoid disturbance. The drone is initiated from starting point. The drone takes a right-forward action to avoid from the obstacle.

2.2.5 Future work In this work, an end-to-end planner is trained with DRL for local replanning in agricultural use-case. We develop an agricultural simulation environment in Webots. We also present our algorithm’s initial results. We continue working on this task. We will diversify our experiments, and we will include UGV in the problem.

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 31/99

3 Deep Navigation

3.1 Learning Kinematic Feasibility for Mobile Manipulation 3.1.1 Introduction and objectives In recent years, approaches to improving the capabilities of robotic platforms to perform flexible and complex tasks in both industrial and dynamic domestic environment achieved impressive results [7, 17, 183, 181]. So far, however, most of these approaches separate navigation and manipulation due to the difficulties in planning the joint movement of the robot base and its end- effector (EE). Typically the tasks that a robot is expected to perform are linked to conditions in the task space, such as poses at which handled objects can be grasped, orientation that objects should maintain or entire trajectories that must be followed. While there are approaches to position a manipulator to fulfill various task constraints with respect to the robot’s kinematics based on inverse reachability maps (IRM) [154], performing such tasks while moving the base remains an unsolved problem. Classical planning approaches circumvent kinematic issues implicitly by exploring paths in the robot’s configuration space [27]. However, this creates a number of new difficulties. First, the constraints must be transferred from the task space to the robot specific configuration space, requiring expert knowledge on the task, the robot and the environment. Furthermore, the execution of pre-planned configuration space movements in dynamic environments is not easy as minor errors in the execution of poses in the configuration space can lead to large deviations in the task space and, due to changes in the dynamic scene, adjustments to the movement might be necessary, requiring a complete re-planning in the configuration space.

goal풆풇 Agent x 풗풆풇 Observation: 풗풆풇, goal풆풇, robot configuration Action: 풗풃 Reward: 풗 ? Kinematic 풃 feasibility

Environment

Figure 18: Given the robot configuration, a velocity for the end-effector ~ve f and the desired goal for the end-effector, the robot learns a corresponding base velocity ~vb in a reinforcement learning setting to maintain kinematic feasibility throughout the motion execution.

We present a method to generate a kinematically compatible movement for the base of a mo- bile robot while its end-effector executes motions generated by an arbitrary system in task space in order to perform a certain action. This decomposes the task into the generation of trajectories for the end-effector, which is usually defined by the task constraints, and the robot base, which should handle kinematic constraints and collision avoidance. This separation is beneficial for many robotic applications: first, it results in a high modularity where action models for the end- effector can easily be shared among robots with different kinematic properties without having to adapt the end-effector behavior as the kinematic constraints are handled entirely by the base motion control. Second, if the learned base policy is able to generalise to arbitrary end-effector motions, the same base policy can be used on new, unseen tasks without expensive retraining. And lastly learning the full task end-to-end is a complex long-horizon problem. In contrast

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 32/99 we show that we can directly leverage the kinematic feasibility as a dense reward signal across several platforms without the need for extensive reward shaping. In prior work [204] we addressed the kinematic feasibility of a joint base and end-effector motion by treating the inverse reachability constraint as an obstacle avoidance problem. While the approach achieved good results in mobile manipulation actions on a PR2 robot, it required substantial robot specific design and the approximations made to model the inverse reachable space restricted the robot’s movement further than necessary. Instead of explicitly modeling the kinematic abilities of the robot, we propose to directly learn a feasible motion of the robot base respecting a given motion of the end-effector and show that the learned policy achieves better results. We illustrate our envisioned system in Fig. 18.

3.1.2 Summary of state of the art In general there are two distinct methods to ensure kinematic feasibility while performing ma- nipulation actions: On one hand planning frameworks that plan trajectories for the robot in joint space and thereby only explore kinematic feasible paths [27, 26]. On the other hand in- verse reachability maps [192] that seek a good positioning for the robot base given the task constraints [154]. While combinations of the two methods exist [119], it remains a hard prob- lem to integrate kinematic feasibility constraints in task space mobile motion planning. In this context, [204] proposed a geometric description of the inverse reachability and ad- dress the kinematic feasibility of an arbitrary gripper trajectory as an obstacle avoidance prob- lem. On a high level we use a conceptually similar setup and compare our results to this. By using RL to directly learn the base controls, the resulting base motions are not limited by geometric approximations of the reachability. We also do not really on robot-specific expert knowledge but show that our approach generalises to different robots and platforms. While RL has shown a lot of promise in manipulation tasks, it has only recently been in- corporated into mobile manipulation tasks. RelMoGen proposes to learn high-level subgoals through RL and execute them with traditional motion generators to simplify the exploration problem [210]. They focus on tasks in which the exact gripper trajectories are not relevant and learn to either move gripper or base. In contrast we are explicitly interested in task specific end-effector trajectories and aim to simplify the joint motion generation for robot end-effector and base by using RL to ensure kinematic feasibility. [198] use RL to solve a mobile picking task by learning to jointly control both a robot’s base and end-effector. While their focus is on learning a policy for one specific task we address a more general problem of maintaining kinematic feasibility in arbitrary mobile manipulation actions. By introducing a given planner for arbitrary end-effector trajectories, our approach decouples the RL agent’s action space from the task the end-effector performs, thus allowing a generalisation to different tasks. While [198] require to calculate a distance-based reward at every step, we leverage inverse kinematics for the rewards which do not rely on ground truth distances. Recently the use of RL has also been explored to calculate forward and inverse kinematics of complex many-joint robot arms [73]. While such approaches are interesting for robot system with a large number of degrees of freedom (up to 40-dof[73]), in our applications (7-dof in robot arms) the inverse kinematics can still be solved numerically. Goal conditional RL [97, 169] takes both the current state and a goal as input to predict the actions to arrive at this goal. Hierarchical methods [186, 5, 96] commonly reduce the complex- ity of long-horizon tasks by splitting it into a subgoal proposal and goal-conditional policy. In

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 33/99 contrast we reduce the complexity of the task by assuming a given planner for a subset of the agent (the gripper) and learn to control the remaining degrees of freedom (the base) to achieve all points along its proposed end-effector trajectory. This shifts the burden from the end-effector planner to the base policy and allows us to use simple methods for the end-effector trajectory actions.

3.1.3 Description of work performed so far Mobile manipulation tasks require complex trajectories in the joint space of arm and base over long horizons. We decompose the problem into a given, arbitrary motion for the end-effector and a learned base policy. This allows us to readily transform end-effector motions into mobile applications and to strongly reduce the burden on the end-effector planner which can now be a reduced to a fairly simple system. We then formulate this as a goal-conditional RL problem and show that we can leverage kinematic feasibility as a simple, dense reward signal instead of relying on either sparse or extensively shaped reward functions. Our proposed system consists of three main components: a planner for the end-effector, a learned RL policy for the base and a standard inverse kinematics (IK) solver for the manipulator arm that provides us with the rewards. An overview of the system is shown in Fig. 19.

Figure 19: We decompose mobile manipulation tasks into two components: an end-effector (EE) planner and a conditional RL agent that controls the base velocities. This enables us to produce a dense reward solely based on the kinematic feasibility, calculated by a standard inverse kinematics (IK) solver.

3.1.3.1 End-effector Trajectory Generation To generate end-effector trajectories we assume access to an arbitrary planner that takes as input the current end-effector pose and a goal state g, specified by an end-effector pose in cartesian space. At every time step this planner then outputs the next velocity command ve f for the end- effector. During training we pass the last planned instead of the current EE pose to the planner as to prevent the RL agent from influencing the shape of the overall EE-trajectory. At test time the EE-planner can be replaced by an arbitrary system, including both closed and open loop approaches.

3.1.3.2 Learning Robot Base Trajectories Given an end-effector motion dynamic, our goal is to ensure that the resulting EE-poses re- main kinematically feasible at every time step. We can formulate this as a goal-conditioned reinforcement learning problem. We define a finite-horizon Markov decision process (MDP) M = (S ,A ,P,r,γ) with state and action spaces S and A , transition dynamics P(s s ,a ), a reward function r and a dis- t+1| t t count factor γ. The objective is to find the policy π(a s) that maximises the expected return | OpenDR No. 871449 D5.1: First report on deep robot action and decision making 34/99

T t Eπ [∑t=1 γ r(st,at)]. We can extend this to a goal-conditional formulation [96] to learn a pol- icy π(a s,g) that maximises the expected return r(st,at,g) under a goal distribution g G as T | t ∼ Eπ [∑t=1 γ r(st,at,g)]. At every step, the agent observes the current state st consisting of current base and end- effector pose and arm joint positions as well as the end-effector goal g and the next planned gripper velocities. It then learns a policy π(s,g) for the base velocity commands vb. We propose two different parametrizations of the base velocities, given the agent’s actions a: (i) directly learn the base velocities vb = a (ii) learn the difference in velocities relative to the planned end- effector: vb = vef + a. The base rotation is always learned directly. We also experimented with learning the parameters of a geometric modulation as introduced by [204] but did not observe any advantages in that formulation.

3.1.3.3 Kinematic Feasibility By separating EE and base motions we can now directly leverage the kinematical feasibility of the EE-motions to train the base policy. We convert this into a straightforward reward function that simply penalises whenever a point along the trajectory is not feasible. This provides us with a dense reward signal without having to rely on ground truth distances or extensive reward shaping. It naturally arises from framing mobile manipulation as a modular problem and can be evaluated without adaption across platforms as we will show in our experiments. To ensure smooth and economical behaviour we also add a regularisation term to keep the actions small whenever possible. The overall reward function can be expressed as follows:

r(s,g,a) = 1 λ a 2 (14) − !kin − || || 1 where !kin is an indicator function evaluating to one when the next gripper pose is not kine- matically feasible and λ is a hyperparameter weighting the norm of the actions. To evaluate the kinematic feasibility we first calculate the next desired end-effector pose from the current pose and velocities. We then use standard kinematic solvers to evaluate whether this new pose is feasible. Essentially we use the known kinematic chains of the manipulator arms to move the reward generation out of the otherwise potentially unknown environment dy- namics. Note that the impact of the regularisation differs for our two proposed action parameterisa- tions. When directly learning base velocities, it incentivises overall small actions. While when learning relative velocities it incentivises to stay close to the gripper velocities. We optimise this objective with recent model-free RL algorithms that have shown to be robust to noise and overestimation, namely TD3 [61] and SAC [76]. An episode ends when either we are close to the goal or too many kinematic failures have occurred.

3.1.3.4 Training task The starting point for modularising end-effector and base control was the ability to learn a base policy that enables a large number of task-specific end-effector trajectories. In many cases we will not know all tasks at training time. To be able to generalise to diverse end-effector motions the agent should ideally observe a wide range of different relative EE-positions and goals. To do so we train the agent on a random goal reaching task. We first initialize the robot in a random joint configuration and then sample random end-effector goals within a distance of one to five meters around the robot base.

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 35/99

To generate end-effector motions we use a linear dynamic system where the end-effector velocity for each step is generated as the difference between the current pose and the given goal pose constrained by a maximum and a minimum velocity. By training with a very simple EE- planner that not take into account the current joint configuration in any way, we shift the burden of generating feasible kinematic movements to the base policy.

3.1.4 Performance evaluation We evaluate our approach on multiple mobile robot platforms on a series of analytical and simulated experiments.

3.1.4.1 Experimental setup Robot platforms We train agents across three different robotic platforms differing con- siderably in their kinematic structure and base motion abilities. The PR2 is equipped with a 7-DOF arm mounted on an omnidirectional base, giving it high mobility and kinematic flexibility. TIAGo is also equipped with a 7-DOF arm and we addi- tionally use the height adjustment of its torso. For the base motion it uses a differential drive restricting its mobility compared to the PR2. The Toyota HSR also has a omnidirectional base but the arm is limited to 5-DOF including the height adjustable torso. Given the low flexibility of the HSR arm, we consider EE-positions up to 10 cm and a rotational distance of up to 0.05 as kinematically feasible. This leeway does not apply to the final goal. To minimize the use of this ’slack’ we additionally penalize the sum of the squared distance and rotational distance to the desired EE-pose, scaled into the range of [ 1,0] (i.e. smaller or equal to the penalty for − kinematically infeasible poses). The action space for these platforms is continuous, consisting of either one (diff-drive) or two (omni) directional velocities vb, x,y and an angular velocity vb,θ . { }

Tasks For the virtual evaluation we locate the objects in a room around the robot.In all tasks the robot starts randomly within a 1.5x1.5m square in the center of the room, rotated between [ π/2,π/2] relative to the first end-effector goal and in a random joint configuration − sampled from the full possible configuration space, including difficult and unusual poses. We construct five tasks: A general goal reaching (ggr) task in which the goals are selected randomly as in the training phase. As the kinematics become very restrictive towards the edges of the height range, we also analyse the results for initial configurations and goals that are re- stricted to more common heights, which we refer to as ggr restr.A pick&place task in which the robot has to grasp an object randomly located on the edge of a table, move it to a goal table randomly located on the wall next to it and place it down. We use the linear system to generate EE-motions by sequentially combining four goals relative to the object: slightly in front of the object, at the object, in front of and at the goal location. To test whether the decomposition into base and end-effector generalises to different EE-motions we then construct two more tasks from an unseen imitation learning system developed in [203]. These motions are learned from a human teacher and encoded in a dynamic system following a demonstrated hand trajectory to manipulate a certain object. As a result the motions can differ substantially from the linear system used during training. We use models to grasp and open

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 36/99 a cabinet door as well as to grasp and open a drawer. The corresponding motion models are adapted to the given poses of the handled objects.

Baselines We construct separate baselines for the linear system and the imitation system. For the linear system the baseline replicates the same robot base motion in xy direction as the − end-effector. For Tiago this agent also takes into account the limitations of the differential drive. While simple, this strategy already removes some of the main difficulties by keeping a fixed distance between EE and body and avoiding the gripper to pass around or through the robot body. The baseline for the imitation system follows base motions that were learned together with the end-effector, adapted to the different lengths of the robot arms. We combine these two baselines under the terms PR2 bl, Tiago bl and HSR bl. We also compare against the geometric modulation for the PR2 presented in [204], termed PR2 gm, which approximates the inverse reachability and modulates the base velocities to stay within allowed poses. We then evaluate our agent’s ability to achieve the described tasks without any task-specific fine-tuning. For each platform we train models on ten different seeds and average the results. We then evaluate our agent’s ability to achieve the described tasks without any task-specific fine-tuning over 50 episodes per task. For each task we report the share of the trajectories executed without a single kinematic failure. Note that this is a fairly strict metric and in many cases even with a few failures the task can still be completed successfully. For this reason we also report the share of episodes which never deviate more than 5cm from the EE-motions (in brackets).

3.1.5 Analytical evaluation We train the model on analytically generated trajectories, i.e. we integrate the system step-by- step to generate the poses of robot end-effector and base and at each step evaluate the kinematic feasibility of the generated poses without a physical simulation of robot controllers. We evaluate this system at a step size corresponding to a frequency of 10Hz. We then evaluate the trained models across all tasks. Tab. 5 summarises the results. On the PR2 the baseline replicating the EE-velocities is able to complete between 60% and 73% of the linear motion tasks successfully. By simply ’sliding’ towards the goal, the manipulation task is kept relatively static and most of the difficulty is transferred to the IK solver. But on more constraint platforms performance drops to between 0% and 24% for both the 5-DOF arm of the HSR as well as the diff-drive of TIAGo that has to ’circle’ towards the goal. This illustrates the need for base and arm to work together to achieve these motions. On the door and drawer tasks the PR2 and TIAGo are able to follow some of the imitated motions, but fail in the majority of cases with between 9% and 32% of the episodes fully successful. The HSR is completely unable to follow these motions. The ellipsis modulation approach is not able to significantly improve further upon the baseline, indicating that the velocities of the baseline that serve as input are already removing most of the difficulty from the task. Failure cases for the ellipsis approach include when the starting pose lies outside the approximations as well as situations in which the EE comes close to the edges of the approximation. In contrast our approach to directly learn the base velocities and kinematic feasibility achieves high success rates across all robots and tasks, solving the goal reaching task with 90.0% (PR2), 71.6% (TIAGo) and 75.2% (HSR). This translates into near perfect performance on the pick&place task with 97.0% success for the PR2, 91.4% for TIAGo and 90.2% for the HSR. Looking at the imitation tasks, we find that these results generalise to the unseen motions with all platforms

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 37/99

Agent linear dynamic system imitation learning system ggr ggr restr pick&place door drawer PR2 bl 60.8 (65.0) 70.8 (72.8) 72.6 (76.8) 30.2 (33.6) 31.6 (35.4) PR2 gm 64.6 (69.6) 68.6 (74.4) 74.4 (78.6) 31.6 (38.0) 28.2 (34.0) PR2 90.2 (91.2) 88.8 (90.6) 97.0 (97.4) 94.2 (95.4) 95.4 (95.4) Tiago bl 21.6 (24.6) 23.6 (26.8) 12.0 (13.8) 9.2 (10.0) 28.6 (31.8) Tiago 71.6 (73.4) 80.2 (83.0) 91.4 (92.2) 95.3 (96.9) 94.9 (95.3) HSR bl 5.4 (7.4) 7.6 (10.4) 0.0 (3.2) 0.0 (0.0) 0.0 (0.0) HSR 75.2 (69.8) 80.1 (72.6) 93.4 (90.2) 91.2 (87.0) 90.6 (88.6)

Table 5: Performance in the analytical evaluation as share of successfully executed episodes with zero kinematically infeasible EE-poses and share of episodes that never deviate more than 5cm from the EE-motion (in brackets). We evaluate each task over 50 episodes for 10 random seeds. achieving over 90.6% of all episodes without a single kinematic failure. This indicates a number of things: first, our training task is comparably difficult, making it a good training ground to train for general tasks. This makes sense as the goal distribution encom- passes the full range of possible heights in which the kinematics can become quite restrictive as well as short distances to the goal. Focusing on a workspace restricted to common heights, performance increases for both TIAGo (+8.8%) and HSR (+4.9%) and while not changing sig- nificantly for the PR2 (-1.4%). Secondly, we find no drop in performance when following the unseen imitation learning system, showing the agent’s ability to enable general movements of the end-effector. Qualitatively we find that the resulting policies take reasonable and economical paths, such as rotating around the closest angle and seeking robust relative EE-poses.

3.1.6 Motion execution in simulation We proceed to analyse how well the generated motions can be executed by running the system in the Gazebo simulator [106]. This includes both a simulation of the robot controllers as well as a physical simulation of the environment. We run all agents at a frequency of 50Hz. The RL agent operates on low-level robot state information and uses small three-layer neural networks. As a result this frequency can easily be achieved on standard CPUs. Differences to the analytical environment include previously potentially unmodelled accelerations, inertias and constraints, execution time of the arm movements as well as collisions with the physical objects. We find particularly large inertias for the PR2 to cause the EE to outrun the base. To account for this we slow down the EE-motions by a factor of two for all PR2 experiments in this section. Note that this was not needed in the real world experiments. As for the analytical evaluation we evaluate our approach on all combinations of agents and tasks. The results are summarized in Tab. 6. While all approaches experience a drop in performance, we find this impact to be relatively small for the learned base motions, showing that they can be readily executed on all platforms. The average difference in performance is 8.4% (PR2), 8.0% (TIAGo) and 3.0% (HSR) across all tasks. We identify two main causes for this.

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 38/99

Agent Linear Dynamic System Imitation Learning System ggr ggr restr pick&place door drawer PR2 bl 48.2 (53.0) 56.6 (62.2) 35.4 (41.2) 3.6 (4.8) 3.2 (4.2) PR2 gm 17.0 (22.2) 19.4 (22.8) 20.8 (24.6) 21.2 (31.4) 0.1 (4.8) PR2 80.2 (83.6) 84.4 (88.8) 85.6 (88.8) 88.0 (92.8) 85.4 (91.6) PR2 hs 87.0 (88.0) 89.0 (90.8) 92.0 (93.6) 94.2 (95.0) 90.4 (90.6) PR2 nObj - - 87.6 (92.4) 85.6 (90.0) 85.4 (90.8) Tiago bl 7.0 (7.8) 9.0 (11.0) 2.2 (2.8) 1.6 (3.0) 6.8 (7.8) Tiago 65.4 (70.4) 74.6 (81.2) 88.2 (91.2) 81.6 (86.8) 83.8 (88.0) Tiago nObj - - 87.4 (91.0) 93.4 (95.4) 92.8 (94.2) HSR bl 2.6 (0.0) 2.6 (0.1) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) HSR 64.4 (54.0) 70.3 (59.4) 85.6 (69.6) 87.2 (78.0) 90.2 (83.8) HSR nObj - - 92.2 (78.2) 87.2 (76.6) 87.4 (82.4)

Table 6: Performance in the Gazebo physics simulator as share of successfully executed episodes with zero kinematically infeasible EE-poses and share of episodes that never devi- ate more than 5cm from the EE-motion (in brackets).

3.1.6.1 Obstacles A main limitation of the current approach is that the base agent does not take into account obstacles. To measure the impact of collisions with the physical objects we execute the same motions with the objects removed from the scene, labelled nObj in table 6. We find that this explains the largest part of the difference to the analytical environment for TIAGo, while the PR2 and HSR do not seem to collide much. The gap to the analytical environment reduces to an average of 7.1% (PR2), 2.6% (TIAGo) and 2.8% (HSR). Overall collisions account for less than 8.4% across all tasks, further indicating that the resulting base motions are economical and do not unnecessarily move or rotate around the EE-goals. We leave obstacle avoidance to future work. But we believe our formulation to be well suited to extend to this by learning to locally modulation the feasible trajectories around obstacles.

3.1.6.2 Inertia For the PR2 we find that inertias in the base movement often cause failures in the beginning of the motion when the arm starts in an already stretched out position and then wants to move further away from the base. We extend our model to give the PR2 a ’head start’ of three seconds to position itself before we begin the EE-motion (PR2 hs). To do so we use the same trained model, i.e. the agent did not see this head start during training. This simple adaptation closes the gap to the analytical environment to as little as 2.6%. Alternative approaches to account for large inertias would be to include acceleration limits in the linear motion system or to learn a scaling term that would allow the agent to influence the EE-velocities.

3.1.7 Future work We present an approach to generate a suitable robot base motion for mobile platforms given an end-effector motion generated by an arbitrary system. Leveraging state of the art RL methods we show that we can achieve high success rates across different robot platforms for both seen and unseen end-effector motions. This demonstrates the potential of this approach to enable the

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 39/99 application of any system that generates task specific end-effector motions across a diverse set of mobile manipulation platforms. While we achieve very good results in terms of kinematic feasibility during trajectory gen- eration the approach so far does not consider collisions with the environment. In a next step we plan to run these experiments on a real PR2. In future work we plan to incorporate obstacles and object detection into the training to enable the agent to also avoid collision while moving the base. Further we observed occasional undesired configuration jumps in the arms with high degrees of freedom which we will aim to avoid by including a corresponding penalty in the reward function.

3.2 Visual navigation in real-world indoor environments using end-to-end deep reinforcement learning 3.2.1 Introduction and objectives Vision-based navigation is essential for a broad range of robotic applications, from industrial and service robotics to automated driving. The wide-spread use of this technique will be further stimulated by the availability of low-cost cameras and high-performance computing hardware. Conventional vision-based navigation methods usually build a map of the environment and then use planning to reach the goal. They often rely on precise, high-quality stereo cameras and additional sensors, such as laser rangefinders, and generally are computationally demanding. As an alternative, end-to-end deep-learning systems can be used that do not employ any map. They integrate image processing, localization, and planning in one module or agent, which can be trained and therefore optimized for a given environment. While the training is computationally demanding, the execution of the eventual agent’s policy is computationally cheap and can be executed in real time (sampling times ˜50 ms), even on light-weight embedded hardware, such as NVIDIA Jetson. These methods also do not require any expensive cameras. The successes of deep reinforcement learning (DRL) on game domains [143, 6, 13] inspired the use of DRL in visual navigation. As current DRL methods require many training samples, it is impossible to train the agent directly in a real-world environment. Instead, a simulator provides the training data [207, 225, 111, 93, 208, 213, 175, 214, 50, 38, 140], and domain randomization [191] is used to cope with the simulator-reality gap. However, there are several unsolved problems associated with the above simulator-based methods:

The simulator often provides the agent with features that are not available in the real • world: the segmentation masks [207, 111, 175], distance to the goal, stopping signal [207, 225, 93, 111, 140, 214, 50], etc., either as one of the agent’s inputs [207, 175] or in the form of an auxiliary task [111]. While this improves the learning performance, it precludes a straightforward transfer from simulation-based training to real-world deploy- ment. For auxiliary tasks using segmentation masks during training [111], another deep neural network could be used to annotate the inputs [91, 189]. However, this would in- troduce additional overhead and noise to the process and diminish the performance gain.

Another major problem with the current approaches [207, 225, 93, 111, 140, 214, 50] • is that during the evaluation, the agent uses yet other forms of input provided by the simulator. In particular, it relies on the simulator to terminate the navigation as soon as the agent gets close to the goal. Without this signal from the simulator, the agent never stops, and after reaching its goal, it continues exploring the environment. If we

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 40/99

provide the agent with the termination signal during training, in some cases, the agent learns an efficient random policy, ignores the navigation goal, and tries only to explore the environment efficiently.

To address these issues, we propose a novel method for DRL visual navigation in the real world, with the following contributions:

1. We designed a set of auxiliary tasks that do not require any input that is not readily available to the robot in the real world. Therefore, we can use the auxiliary tasks both during the pre-training on the simulator and during the fine-tuning on previously unseen images collected from the real-world environment.

2. Similarly to [213, 208], we use an improved training procedure that forces the agent to automatically detect whether it reached the goal. This enables the direct use of the trained policy in the real world where the information from the simulator is not available.

3. To make our agent more robust and to improve the training performance, we have de- signed a fast and realistic environment simulator based on Quake III Arena [42] and DeepMind Lab [8]. It allows for pre-training the models quickly, and thanks to the high degree of variation in the simulated environments, it helps to bridge the reality gap. Fur- thermore, we propose a procedure to fine-tune the pre-trained model on a dataset consist- ing of images collected from the real-world environment. This enables us to use a lot of synthetic data collected from the simulator and still train on data visually similar to the target real-world environment.

4. To demonstrate the viability of our approach, we report a set of experiments in a real- world office environment, where the agent was trained to find its goal given by an image.

3.2.2 Summary of state of the art The application of DRL to the visual navigation domain has attracted a lot of attention recently [207, 225, 111, 93, 208, 213, 175, 214, 50, 38, 140]. This is possible thanks to the deep learning successes in the computer vision [189, 82] and gaming domains [142, 6, 13]. However, most of the research was done in simulated environments [207, 225, 111, 93, 208, 213, 175, 38, 140], where one can sample an almost arbitrary number of trajectories. The early methods used a single fixed goal as the navigation target [93]. More recent approaches are able to separate the goal from the navigation task and pass the goal as the input either in the form of an embedding [207, 140, 213] or an input image [225, 111, 214, 50]. Visual navigation was also combined with natural language in the form of instructions for the agent [175]. The use of auxiliary tasks to stabilize the training was proposed in [93] and later applied to visual navigation [111]. In our work, we use a similar set of auxiliary tasks, but instead of using segmentation masks to build the internal representation of the observation, we use the camera images directly to be able to apply our method to the real world without labeling the real-world images. This is built on ideas from model-based DRL [196]. Unfortunately, most of the current approaches [207, 225, 111, 93, 208, 213, 175, 214, 50, 38, 140] were validated only in simulated environments, where the simulator provides the agent with signals that are normally not available in the real-world setting, such as the distance to the goal or segmentation masks. This simplifies the problem, but at the same time, precludes the use of the learnt policy on a real robot.

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 41/99

There are some methods [140, 50, 214] which attempt to apply end-to-end DRL to real- world domains. In [140], the authors trained their agent to navigate in Google Street View, which is much more photorealistic than other simulators. They, however, did not evaluate their agent on a real-world robot. In [214], a mobile robot was evaluated in an environment dis- cretized into a grid with ˜27 states, which is an order of magnitude lower than our method. Moreover, they did not use the visual features directly but used ResNet [82] features instead. It makes the problem easier to learn but requires the final environment to have a lot of visual features recognizable by the trained ResNet. In our case, we do not restrict ourselves to such a requirement, and our agent is much more general. Generalization across environments is dis- cussed in [50]. The authors trained the agent in domain-randomized maze-like environments and experimented with a robot in a small maze. Their agent, however, does not include any stopping action, and the evaluator is responsible for terminating the navigation. Our evaluation is performed in a realistic real-world environment, and we do not require any external signal to stop the agent. In our work, we focus on end-to-end deep reinforcement learning. We believe that it has the potential to overcome the limitations of and have superior performance to other methods, which use deep learning or DRL as a part of their pipeline, e.g., [74, 14]. We also do not consider pure obstacle avoidance methods such as [164, 212, 178], or methods relying on other types of input apart from the camera images, e.g., [21, 168].

3.2.3 Description of work performed so far and performance evaluation The details of this work are found in the corresponding publication that is listed below, and can be found in Appendix A:

[112] J. Kulhanek,´ E. Derner and R. Babuskaˇ “Visual Navigation in Real-World In- • door Environments Using End-to-End Deep Reinforcement Learning”, arXiv preprint arXiv:2010.10903.

In this paper, we proposed a deep reinforcement learning method for visual navigation in real-world settings. Our method is based on the Parallel Advantage Actor-Critic algorithm boosted with auxiliary tasks and curriculum learning. It was pre-trained in a simulator and fine-tuned on a dataset of images from the real-world environment. We reported a set of experiments with a TurtleBot in an office, where the agent was trained to find a goal given by an image. In simulated scenarios, our agent outperformed strong baseline methods. We showed the importance of using auxiliary tasks and pre-training. The trained agent can easily be transferred to the real world and achieves an impressive performance. In 86.7 % of cases, the trained agent was able to successfully navigate to its target and stop in the 0.3-meter neighborhood of the target position facing the same way as the target image with a heading angle difference of at most 30 degrees. The average distance to the goal in the case of successful navigation was 0.175 meters. The results show that DRL presents a promising alternative to conventional visual navigation methods. Auxiliary tasks provide a powerful way to combine a large number of simulated samples with a comparatively small number of real-world ones. We believe that our method brings us closer to real-world applications of end-to-end visual navigation, so avoiding the need for expensive sensors.

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 42/99

4 Deep action and control

4.1 Model-based latent planning 4.1.1 Introduction and objectives Reinforcement learning (RL) provides a mathematical framework for solving a sequential deci- sion process. Loosely speaking, the user specifies a reward function that encodes what the de- sired behavior of an agent is and the RL algorithm optimizes this function to achieve the desired behavior, also called the agent’s ”policy”. The use of deep learning architectures (neural net- works) as function approximators within RL algorithms lead to the concept of deep reinforce- ment learning (DRL). DRL is able to learn complex behavior from raw high-dimensional sensor data. This end-to-end learning approach alleviated the need for hand-tailored features that re- quire a lot of expert knowledge. DRL has achieved good results on complex tasks that range from playing games [144, 195] to enabling robots to manipulate objects [220]. For robotics researchers, RL provides powerful methods to control complex physical systems that exhibit nonlinear dynamics in uncertain environments with (sub-) optimal performance that is other- wise hard to achieve with conventional control techniques. RL methods that directly learn the policy for a task by trial-and-error without the explicit learning of a model of the environment are called model-free. In contrast, RL methods that use known or learned models (e.g., by planning ahead or by generating artificial experience) are called model-based. Model-free methods have achieved excellent results [172, 76], but re- quire either a huge amount of iterations to converge or a carefully designed low-dimensional parametrization. This is especially troublesome for complex robotic tasks that lack a simple parametrization. Moreover, acquiring a large amount of real-world data can be time-consuming due to the physical nature of robotic tasks. In contrast, model-based methods have achieved comparable performance on low-dimensional benchmark tasks with impressive data efficiency [39, 113]. The performance of model-based methods strongly depends on the ability to learn an accurate model of the dynamics. Policies derived from inaccurate models suffer from mod- elling bias that will typically lead to performance degradation in the real-world. For complex systems with high-dimensional observations this poses a major challenge, as accurate forward predictions usually require high-capacity dynamics models that need a lot of training data, thus negating the data efficiency benefits of model-based methods. Moreover, even if we would have a dynamic model that can accurately predict the future evolution of high-dimensional observations, the large dimension would render planning algorithms in model-based methods computationally infeasible. This motivates research into data-driven methods that learn low-dimensional latent repre- sentations based on high-dimensional observations for complex systems that lack simple equa- tions. To this end, several model-based methods have previously been proposed [202, 78, 220] that learn a latent representation that accurately predicts future rewards and observations to al- low for planning in the latent space. Though planning in a latent space can be considered fast compared to planning in a high-dimensional space, [78] is nonetheless stuck with an ineffi- cient planning algorithm because neural networks are used as the dynamics and reward models. This touches upon a central tension in the learning of dynamics for control between the expres- sive power of a model and its compatibility with efficient planning algorithms. Other methods [202, 220] use (linear) model structures that are suitable for efficient (linear) control and plan- ning techniques at the cost of making limiting assumption. These limiting assumptions will be

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 43/99 further discussed in Section 4.1.2. Alternatively, [98] proposes to learn a latent model suitable for efficient real-time control by relying on Koopman operator theory [108]. This theory states that any nonlinear dynamical system can be lifted to an infinite dimensional latent space, where the evolution in this latent space is linear and governed by the linear Koopman operator. A finite-dimensional approxi- mation of this operator should provide a latent representation of the nonlinear system where the evolution is approximately and globally linear. Provided that we can identify the Koopman representation, consisting of the lifting function and accompanying Koopman operator, this framework enables the use of powerful linear and robust control techniques for a general class of non-linear systems. Previous works successfully applied Koopman theory to control various nonlinear systems with linear control techniques, both in simulation [98, 109, 126] and in real- world robotic applications [136, 23]. Identifying the lifting function that maps the observations to a suitable Koopman representation is challenging and considered to be an open problem. Re- cent works propose Deep Koopman representation learning methods [188, 131, 216, 126] that leverage the expressive power of deep learning architectures to identify the Koopman lifting function. We believe the combination of model-based reinforcement learning with latent plan- ning together with Koopman theory can provide the basis for a novel representation learning method. Herein, the non linearity of the system could be completely offloaded to a Koopman lifting function based on deep learning, such that we obtain a latent space that evolves linearly according to the Koopman operator, thus enabling the use of fast and powerful linear control techniques. In general, reconstruction-based representation learning aims to learn a lower dimensional representation, sufficient to reconstruct current measurements and predict future measurements. However, such methods are task-agnostic: the models represent all dynamic elements they ob- serve, whether they are relevant to the task or not [219]. This consumes expressive power of the latent dynamical model at the expense of the relevant dynamical model accuracy. This is espe- cially troublesome for data-driven Koopman representation learning, as the expressive power of the linear latent dynamical model is already limited compared to general nonlinear dynamical models, such as neural networks. If finding a Koopman representation was not already challeng- ing, then measurements contaminated with task-irrelevant dynamics only add to the challenge of finding parsimonious latent representations. To summarize, model-based reinforcement learning is well-suited for robotics due to its sample complexity. In the high-dimensional case, the policy must be learned in a lower- dimensional latent space to fully exploit this data efficiency. For the learned policy to run on a robotic platform with limited computational resources, the latent dynamics model must be compatible with fast control algorithms. For this, Koopman operator theory combined with deep learning provides a promising research direction. Although a latent Koopman representation fa- cilitates more efficient control, it still focuses the majority of the model capacity on potentially task-irrelevant dynamics that are contained in the observations. Therefore, it is of great im- portance to mitigate the effect of task-irrelevant dynamics in the latent representation learning. Therefore, our objective is to find a lightweight and sample efficient method based on Koopman theory to control robotic systems that lack a simple state description using high-dimensional ob- servations that are possibly contaminated with task-irrelevant dynamics. Specifically, we aim for a control algorithm that is:

1. Sample efficient The learning algorithm must be sample-efficient, because real environ- ment interactions are both time-consuming and costly in robotic systems due to wear of

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 44/99

the equipment.

2. Lightweight Combine deep learning methods with Koopman theory to enable the use of lightweight linear optimal and robust control techniques in the latent space.

3. High-dimensional observations Enable robots to efficiently decide on actions based on high-dimensional observations (e.g. images by first mapping them to a low-dimensional space).

4. Invariance to task-irrelevant dynamics Learn a low-dimensional mapping that is in- variant to task-irrelevant dynamics.

4.1.2 Summary of state of the art This review of the literature will cover three main topics, namely, (i) model-based reinforcement learning for high-dimensional observations, (ii) Koopman representation learning, and (iii) fil- tering techniques for task-irrelevant dynamics in representation learning. The aim of this review is to provide the reader with the required knowledge to understand the proposed approach and clarify the gap with respect to the state of the art.

4.1.2.1 Model-based reinforcement learning with high-dimensional observations RL is a broad research area. To make our review more effective, we will restrict this summary to methods that are interesting for (parts of) the objective described in the previous section. In this respect, we will exclude from this overview model-free methods, because of their data in- efficiency. The reader is referred to [4] for a detailed survey on model-free methods. Moreover, we will focus on methods that address the difficulties in policy learning with high-dimensional observations. These difficulties arise in vision-based robotic tasks where image observations consisting of thousands of pixels are used. Therefore, we exclude from the review model-based methods that plan end-to-end in the observation space or use non-parametric approximation methods, because both components would render the algorithm computationally infeasible for high-dimensional problems. Some methods assume the dynamics model of the environment to be known. For some game environments (e.g. checkers, go) this might be true as the rules are known a priori. Though, for robotic tasks this is unrealistic. Therefore, these methods will also be excluded. The reader is referred to [145] for a general survey on model-based methods. We define a discrete time-step t, latent state st, high-dimensional observation ot, action vector at, and scalar rewards rt. Then, a typical latent model contains the following components:

Dynamics model: p(s s ,a ) t+1| t t Encoder: p(s o ) t| t Decoder: p(o s ) (15) t| t Policy: p(a s ) t| t Reward: p(r s ,a ) t| t t The encoder maps high-dimensional observations to a latent state, while the decoder does the opposite and reconstructs the observation from a given latent state. In addition, there is a dy- namics model that predicts future latent states. The latent representation is trained such that

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 45/99 decoding the predicted latent state evolution provides an accurate prediction of the future ob- servation evolution. The goal is to implement a policy p(a s ) that maximizes the expected cu- t| t mulative latent reward function p(r s ,a ), while ensuring that maximizing in the latent space t| t t using p(r s ,a ) is equivalent to maximizing the real (unknown) reward function. In other t| t t words, starting from the same real state, the cumulative reward of a trajectory through the latent space must match the cumulative reward of the corresponding trajectory in the real environment [170]. This value equivalence is not ensured in [202], where the latent reward function is simply defined as a quadratic function with arbitrary weights without explicitly considering the true re- wards of the environment. This might lead to policies that ”work”, but are by no means optimal as the approach actually optimizes a different (self-defined) latent reward function that is not trained to resemble the true reward function. The works in [78, 220, 170] do take true rewards in consideration by training p(r s ,a ) on actual observed reward samples in a supervised manner. t| t t The underlying environment state can be stochastic, so many methods adopt a latent model that is also stochastic [202, 78, 220, 151], which is reflected in (15) with the probabilistic nota- tion p( ). Nonetheless, remarkable results were also achieved with a deterministic model [170], · although the authors are going to add stochastic components in future work. Variational infer- ence based on neural network approximations is most commonly used to learn stochastic latent models [202, 78, 220, 151]. The region of validity for the model is another aspect to consider when selecting the model structure. Most methods use a neural network as the parametric function approximation for a dynamics model that is globally valid. However, some approaches [202, 220] choose to ap- proximate the system dynamics locally by a more restricted function approximation class (e.g. linear), in favor of the ability to use efficient planning algorithms and arguably less instability compared to a global approximation. In [202] a global model provides a local linear model at every time-step, whereas [220] uses a global model to initialize an optimization in search of a separate local linear model for each time-step. The linear structure of the local model enables the use of efficient linear control techniques that are locally optimal [121]. The latent model is trained by minimizing a loss function that penalizes errors between sampled observation evolutions and observation evolutions that result from decoding predicted latent state evolutions using the dynamics model p(s s ,a ) and decoder p(o s ). The works t+1| t t t| t in [202, 220] only make a one-step prediction in the loss function, even though after learning the model is used to plan into the future which usually involves multi-step predictions. So, multi-step predictions are made by repeatedly feeding one-step predictions back into the model. However, since the model was not optimized to make long term predictions, accumulating errors may actually cause the multi-step prediction to diverge from the true dynamics [145]. Therefore, the works in [67, 78, 170, 151] include multi-step predictions into the loss function, such that the model is optimized towards accurate long term predictions. There are different ways to incorporate the multi-step predicted latent state evolution for policy learning. Broadly speaking, we can categorize methods into three groups [158], namely latent imagination, latent planning, and world modelling. Latent imagination methods use real- world samples to learn a latent model of the environment. Then, the methods in [75, 98, 77, 151] use the latent model to ’imagine’ trajectories that are used with model-free methods to learn the policy. The work in [57] uses both the imagined and real-world trajectories to learn the policy in a model-free way. In literature, latent imagination methods are also referred to as hybrid model-free methods. Latent planning methods also learn a latent model from real-world samples of the environ- ment. Though, here the policy follows from the learned model, instead of viewing the model as

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 46/99 a simulator that can create additional training data. The method in [202] solves a local stochas- tic optimal control problem (iLQR [125]) in the latent space at every time-step. Herein, a priori knowledge of the goal state is assumed. The goal state is mapped to the latent space, where it is inserted in a self-defined quadratic reward model. Value equivalence, as previously dis- cussed, is not ensured. Nevertheless, the method is able to reach the goal with success. The method in [220] also derives its policy directly from local linear dynamics and quadratic re- ward models, learned for each time-step separately. The local models can provide gradient directions for local policy improvements. The algorithm is very efficient for solving real world robotics tasks, because it can pre-compute the policy improvements during training so that no planning is required when actually interacting with the environment. However, this approach is limited to ”trajectory-centric” environments where the distributions over initial conditions and goal conditions are low variance [220], because the local models are time-varying. In case of high variance, the agent might find itself in vastly different states at the same initial time- step, so that a single linear model cannot accurately capture the dynamics. Both methods in [202, 220] do not take action saturation into account. In [78], the agent uses the learned model in a model-predictive control (MPC) fashion. It uses a sample-based MPC method to search for the best action sequence under the model. This requires the evaluation of many candidate action sequences that are drawn from a prior action distribution, that is iteratively re-weighted based on predicted performance using a learned reward model. After a number of iterations, only the first action of the action distribution’s mean is applied. After receiving the next ob- servation, the agent has to re-plan all over again. The performance for several challenging image-based benchmarks is comparable to state-of-the-art model-free methods, while requir- ing several orders of magnitude less environment interactions. On the downside, the method is computationally heavy, due to the many candidate action sequences that have to be evaluated at every time step during run-time, which makes the approach less applicable to systems with limited computational resources. From literature, it has not become clear if either latent planning or imagination should have precedent over the other. We believe both planning and imagination to be valid alternatives, but in the context of robotics we expect a policy that explicitly includes latent planning to be more robust against modelling errors, compared to a model-free policy trained on imagined trajectories. Only [53, 220] have reported results for real robot experiments based on high- dimensional observations, and both methods employed a planning strategy. For a definitive answer, a real-world comparison between planning and imagination methods is required. World models [75, 99, 67, 77, 173] are based on the idea that there is a large amount of structure in the world. Capturing this structure in a general model, allows the agent to quickly adapt to a variety of tasks. The adaptation to specific tasks is done with either latent planning or imagination, so world modelling should be viewed as a pre-training step for, rather than a separate approach alongside latent planning and imagination. First, the agent aims to learn a latent model in the absence of extrinsic rewards, i.e. rewards related to a specific task. Instead, it uses planning to explore the world in a self-supervised manner by generating intrinsic rewards. After obtaining a world model, the agent can use it to adapt quickly to various tasks in a zero or few-shot manner. These methods are less applicable in settings where much of the observed dynamics is irrelevant for any useful task. Then, the world modelling method has no way of filtering task-irrelevant dynamics, so it will try to predict all it observes, regardless of the relevancy. This is elaborated on further in the next paragraph.

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 47/99

4.1.2.2 Filtering techniques for task-irrelevant dynamics Reconstruction-based representation learning aims to learn a lower dimensional representation, sufficient to reconstruct current observations and predict future observations. However, such methods are task-agnostic: the models represent all dynamic elements they observe, whether they are relevant to the task or not [219]. Therefore, observations contaminated with task- irrelevant dynamics complicate learning a parsimonious latent representations, because expres- sive power of the latent dynamical model is consumed to represent task-irrelevant dynamics at the expense of the relevant dynamics model accuracy. This is especially troublesome for representation learning that use a more restricted function approximation class as the dynamics model (e.g. linear), because the expressive power of such models is already limited compared to global function approximators (e.g. neural networks). Instead of training a latent space such that it can predict future observations, the latent model can be trained with the sole purpose of accurately predicting relevant future quantities, such as future rewards or the value function. Essentially, we want an abstract latent state description that maps observations to the same latent state in case they are ”behaviorally equivalent”. The concept of bisimulation formalizes this idea, which is explained in [219] as follows. Bisimu- lation maps observations to the same latent state if they are indistinguishable with respect to reward sequences given any action sequence tested. Or, the definition in its recursive form: two observations are bisimilar if they share both the same immediate reward and equivalent distribu- tions over the next bisimilar observation. This is a binary concept in that observations are either indistinguishable or they are not, which makes them less applicable for continuous observa- tions where the probability of producing exactly the same reward sequence is practically zero. For continuous observations, bisimulation metrics [58] can provide a continuous measure of how ”behaviorally similar” observations are with respect to the reward. The works in [219, 64] both use bisimulation metrics to learn the latent model, where the distance between two latent states corresponds to how similar the corresponding observations are. The method in [219] can be combined with any model-free method to learn the policy. The authors also note that their method can be used as a model-based approach that uses latent planning, as they learn both the representation and latent dynamics. The method is tested in a realistic driving simulator with images. Their results (Figure 10 [219] for details) demonstrate that latent states that are close to each other (so behaviorally similar), correspond to image observations that can be visually very different but indeed are behaviorally equivalent. This demonstrates the potential robustness the representation learning method has to task-irrelevant elements. Whether an obstacle at a certain position relative to the car is a wall, car or truck does not matter. The only thing that matters is that the car should not crash into it, so the visually distinct observations are in close proximity in the latent space. Likewise, the approach demonstrates a robustness to different lighting con- ditions (clouds or sunny). The work in [170, 150] also avoids reconstruction of observations. In addition, they aim to learn a latent model that is invariant to task irrelevant elements by pre- dicting the value function [150] and additionally the immediate rewards [170]. While this work is not explicitly connected to bisimulation theory, the proposed methods seem to apply similar principles.

4.1.2.3 Koopman representation learning Consider the following nonlinear discrete-time autonomous system with smooth differentiable N dynamics ok+1 = F(ok), where the observations ok R at time-step k evolve according to ∈ nonlinear dynamics F. For such a system, there exists a lifting function g( ) : RN Rn that · → OpenDR No. 871449 D5.1: First report on deep robot action and decision making 48/99 maps the observations to a latent space where the dynamics are linear [108], that is,

g(ok+1) = Kg(ok). (16)

Loosely speaking, the Koopman operator K acts as a linear matrix on the lifting function g( ). · In theory, a latent space with infinite dimension is required for an exact solution (i.e., n ∞). → Fortunately, a finite approximation with n sufficiently large can provide a latent representation where the global dynamics are approximately linear. Direct identification of the lifting function and Koopman operator with the formulation in (16) often leads to many spurious modes in K, so many approaches [131, 109] seek to identify eigenfunctions of the Koopman operator instead, which are defined as cos(ω∆t) sin(ω∆t) φ(o ) = Kφ(o ) = J(µ,ω)φ(o ), where J(µ,ω) = eµ∆t − . (17) k+1 k k sin(ω∆t) cos(ω∆t)

2 2 Herein, Jordan real block J R × and complex eigenvalue pair λ = µ ω correspond to ∈ ± n ± eigenfunction φ(ok) and ∆t is the sampling time. By stacking P = 2 eigenfunctions into a single vector ϕ = (φ [1],φ [2],...,φ [P]), and constructing the corresponding Koopman operator Λ = diag(J(λ [1]),J(λ [2]),...,J(λ [P])), we can simultaneously identify a set of eigenfunctions ϕ with ± ± ±

ϕ(ok+1) = Λϕ(ok). (18)

Identifying the lifting function and accompanying complex eigenvalue pairs of the Koopman operator is challenging and motivates the use of powerful deep learning methods [131, 188]. The authors of [109, 3, 23, 110] used data-driven methods that were derived from the Ex- tended Dynamic Mode Decomposition (EDMD) [205] to find the Koopman representation. In contrast, [136, 98, 24] used prior knowledge of the system dynamics to hand-craft parts of the lifting function (i.e., a function that maps observations with nonlinear dynamics to a la- tent space where the dynamics are approximately linear), or construct a dictionary of possibly relevant nonlinear function and applied Sparse Identification of Nonlinear Dynamics (SINDy) [25]. Recently, powerful deep learning methods [188, 131, 216] have been used to alleviate the need for hand-crafted features. Only [126] has combined Koompan theory and deep learning for control purposes. Previous works successfully applied Koopman theory to control various nonlinear systems with linear control techniques, both in simulation [98, 109, 126] and in real-world robotic applications [136, 23]. Herein, [24, 98, 136] used a linear quadratic regulator (LQR), while [109, 3, 23, 126, 110] applied linear model predictive control (MPC). All aforementioned meth- ods are trained towards full observation reconstruction, which focuses the majority of their model capacity on potentially task-irrelevant dynamics. Also, most methods assume the effect of control on the Koopman eigenfunctions to be linear, which is a limiting assumption.

4.1.3 Description of work performed so far We have performed an extensive state-of-the-art survey on latent planning with high dimen- sional observations. We concluded that efficient planning in the latent space while still achiev- ing competitive performance remains an open problem. In this respect, a promising strategy is represented by the Koopman framework that was covered in the previous section.

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 49/99

Figure 20: The architecture of the proposed agent: Deep KoCo. Herein, z-1 represent a time-step delay.

We propose Deep Koopman Control (Deep KoCo), that is, a model-based agent that learns a latent Koopman representation from images and achieves its goal through planning in this latent space. The representation will be (i) invariant to task-irrelevant dynamics and (ii) compatible with efficient planning algorithms. We propose a lossy autoencoder network that reconstructs and predicts observed costs, rather than all observed dynamics, which leads to a representation that is invariant to task-irrelevant dynamics. The latent-dynamics model can represent non- affine nonlinear systems and does not require access to the underlying environment dynamics or reward function. This makes the approach suitable for a large variety of robotic systems, which aligns with OpenDR’s project goals. A preliminary architecture of this agent is provided in Figure 20. Herein, ak is the agent’s action, ok,ok+1 are the observations of the environment, 30865 sk is the Koopman latent state. We learn the parameters θ R that define the Koopman ∈ latent model, by letting the agent interact with the environment, storing the observed tuples t = o ,a ,c in experience replay buffer D, and training on the gathered experience. k+1 { k+1 k+1 k+1} 4.1.4 Performance evaluation We evaluate Deep KoCo on OpenAI’s pendulum swing-up task. To investigate the effect of dis- tractor dynamics, we test in two different scenarios. In the first scenario, only relevant dynamics are observed, while in the second one we purposely contaminate the observations with distrac- tor dynamics. The agent is provided with a concatenated state observation containing the state measurements of both the relevant system and distractors. In case we use state observations, the observation dimension increases from 3 to 15 when performing the task in the distractor scenario compared to the clean one. We compare with two baselines that both combine model-free reinforcement learning with an auxiliary bisimulation loss to be robust against task-irrelevant dynamics. The first is Deep Bisimulation for Control (DBC) [219] and the second is DeepMDP [64]. As expected, considering the task in a clean scenario, the baselines converge more quickly to the final performance compared to Deep KoCo, as the first column of Figure 21 shows. Nevertheless, we do consistently achieve a similar final performance. The slower convergence can be explained by the noise we add for exploratory purposes in the first 400 episodes. We believe the convergence rate can be significantly improved by performing a parameter search together with a faster annealing rate, but we do not expect to be able to match the baselines in this ideal scenario. Note that, despite the apparent simplicity of the application scenario, finding

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 50/99

Figure 21: Top The pendulum swing-up task in a clean scenario (first column) and distractor scenario (second column). In all setups, the center pendulum is controlled by the agent. Bottom Learning curves when using state observations. Each column refers to the scenarios depicted on the top row. The mean performances (line) with one standard deviation (shaded area) over 5 random seed runs are shown. an accurate Koopman representation for the pendulum system is challenging, because it exhibits a continuous eigenvalue spectrum and has multiple (unstable) fixed points [131]. In the more realistic scenario in which we consider distractions, both benchmarks fail to learn anything. In contrast, our approach is able to reach the same final performance as in the clean scenario, in a comparable amount of episodes, as the second column of Figure 21 shows. We believe our method minimizes a loss function that provides a stronger learning signal for the agent to distinguish relevant from irrelevant dynamics. The preliminary results support our statement that our approach learns a Koopman representation that is invariant to task-irrelevant dynamics.

4.1.5 Future work Concerning the use of Koopman theory combined with deep learning for control of robotic systems, as part of our future work, we plan to finalize the proposed control architecture and test the proposed approach on a simplified manipulator task, that is, the OpenAI’s reacher task where the two joints of a two-link arm are controlled and the euclidean distance the arm’s end- point and a goal must be minimized. This will allow us to test the tracking performance of the proposed control architecture. In addition, we will test our method with high-dimensional images as observations. Our expectation is that the proposed method will also be able to learn an Koopman representation and improve over the model-free baselines when distractions are present. We intent to include the extended method into the toolkit if we are able to show con- vincing results with images. Also, we will benchmark the efficiency of our planning algorithm at run time on the project’s selected hardware (i.e. Nvidia’s Jentson TX2).

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 51/99

4.2 Existing reinforcement learning frameworks In this section, we discuss some of the the existing frameworks and libraries for reinforcement learning research. We divide this section into three subsections namely environments, develop- ment tools and baselines.

4.2.1 Environments 4.2.1.1 OpenAI gym OpenAI Gym1 is one of the most popular and widely used platforms for developing and com- paring reinforcement learning algorithms [22]. The gym library is open source and contains a collection of environments that include Atari, Box2D, Classic Control, Robotics and Toy Text environments, that can be used for reinforcement learning algorithms. It supports Python and is compatible with TensorFlow or Theano. To control 2D and 3D robots, gym exploits MuJoCo physics engine, which is a .

4.2.1.2 Unity ML Agents The Unity Machine Learning Agents2 toolkit is also one of the most popular open-source projects that enables games and simulators to serve as environments for training intelligent agents. It provides a Python API through which agents can be trained using reinforcement learning, imitation learning, neuro-evolution and other machine learning methods. This toolkit also provides TensorFlow-based implementations of state-of-the-art algorithms for the users to train autonomous agents for 2D, 3D and VR/AR games.

4.2.1.3 DeepMind Lab DeepMind Lab3 is a first-person 3D game platform developed by Google based on an open source engine ioquake3 (from id Software’s Quake III Arena) for agent-based research. It can be used to study puzzle-solving and 3D navigation tasks for autonomous learning agents in large and partially observed environments. DeepMind Lab currently ships as source code only and depends on several additional libraries to work. DeepMind also provides a control suite for physics based simulation named dm control which is based on MuJoCo physics.

4.2.1.4 PySC2 PySC24 is a Python wrapper of the StarCraft II Learning Environment (SC2LE) which exposes Blizzard’s StarCraft II Machine Learning API as a Python reinforcement learning environment. It provides an interface for intelligent agents to interact with StarCraft 2 that could send actions and get observations.

1https://github.com/openai/gym 2https://github.com/Unity-Technologies/ml-agents 3https://github.com/deepmind/lab 4https://github.com/deepmind/pysc2

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 52/99

4.2.1.5 Project Malmo¨ Project Malmo¨5 is a platform developed by Microsoft for AI experimentation and research build on top of Minecraft. It provides an API for creating tasks or “Missions” which are essentially descriptions of tasks in Mincraft environment. It also provides an Open AI gym like environ- ment in Python without any native code which communicates directly with Java Minecraft.

4.2.1.6 AI safety gridworlds It is a highly-customizable gridworld game engine for reinforcement learning with several safety properties of intelligent agents. AI safety gridworlds6 provides various environments which are Markov Decision Processes and that include safe interruptibility, absent supervisor, self modification, distributional shift and safe exploration etc. these .

4.2.1.7 gym-gazebo2 gym-gazebo27 is a toolkit for developing and comparing reinforcement learning algorithms using ROS 2 and Gazebo. Its a collection of simulation tools, robot middlewares, machine learning and reinforcement learning techniques for developing and benchmarking algorithms. It provides an accurate simulated version of Modular Articulated Robotic Arm (MARA) in Gazebo, which allows to translate behaviors from simulation to real robot.

4.2.2 Development Tools 4.2.2.1 Dopamine Dopamine8 is a tensorflow-based framework and is relatively new entrant to the reinforcement learning space [34]. It exploits Google gin-config configuration framework that facilitates plu- gability and reuse. Dopamine supports a single-GPU Rainbow agent [87] applied to Atari 2600 game-playing. The Rainbow agent implements n-step Bellman updates, prioritized experience replay, distributional reinforcement learning as well as Deep Q-Networks (DQN) [144]. It runs on TensorFlow 1.X, 2 and has lately switched its network definitions to keras models.

4.2.2.2 ReAgent ReAgent9 is another open source end-to-end platorm developed by Facebook for applied rein- forcement learning [63]. It is built in Python and uses PyTorch and TorchScript for modeling, training and model saving. ReAgent supports Discrete Action DQN, Parametric-Action DQN, Double DQN (DDQN) [193], Dueling DQN [200], Dueling Double DQN [87], Distributional RL: C51 [9] and Quantile Regression-DQN (QR-DQN) [45], Twin Delayed Deep Deterministic Policy Gradient (TD3) [61] and Soft Actor-Critic (SAC) [76].

5https://github.com/microsoft/malmo 6https://github.com/deepmind/ai-safety-gridworlds 7https://github.com/AcutronicRobotics/gym-gazebo2 8https://github.com/google/dopamine 9https://github.com/facebookresearch/ReAgent

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 53/99

4.2.2.3 Garage Garage10 is a toolkit for developing and evaluating reinforcement learning algorithms. It ac- companies implementations of state-of-the-art algorithms as well. Garage provides several al- gorithms like Cross-Entropy Method (CEM) [137], REINFORCE [206], DDQN, Trust Region Policy Optimization (TRPO) [171], Model-Agnositc Meta-Learning (MAML) [59], Probability Embeddings for Actor-critic RL (PEARL) [162], Task Embedding, Behavioral Cloning etc in various frameworks that includes numpy, PyTorch and TensorFlow.

4.2.2.4RL Coach Coach11 is a Python reinforcement learning framework that exposes a set of easy-to-use APIs for developing new algorithms. It supports integration of several environments like OpenAI gym, ViZDoom [209], PyBullet [43], CARLA [51] etc. It also provides a dedicated Coach Dashboard for visualization. Coach supports distributed multi-node and batch reinforcement learning.

4.2.2.5 Tensorforce Tensorforce12 is another open-source deep reinforcement learning framework, which attempts to abstract reinforcement learning primitives in a modular component-based fashion using Ten- sorFlow. Model policy is defined within the Agent which allows the users to change models with a single line code. TensorForce modular components can be exploited to replicate several Q-learning and policy gradient algorithms. It also provides adaptors for several environments including Arcade Learning Environments, CARLA, OpenAI Gym, OpenSim [49, 174] etc.

4.2.2.6 TF-Agents TF-Agents13 is another TensorFlow library for contexual bandits and reinforcement learning which is under active development. It strongly overlaps with Dopamine, but is designed to be used for production-level reinforcement learning. TF-Agents provides well tested and modular components which can be modified and extended.

4.2.2.7 KerasRL KerasRL14 is an implementation of state-of-the-art reinforcement learning algorithms in Python which integrates with the deep learning libraries in keras. It has out-of-the-box support for OpenAI Gym and several algorithms like DQN, Double DQN, Continuous DQN [69], Deep SARSA [1] and Proximal Policy Optimization (PPO) [172] has already been implemented in KerasRL. 10https://github.com/rlworkgroup/garage 11https://github.com/IntelLabs/coach 12https://github.com/tensorforce/tensorforce 13https://github.com/tensorflow/agents 14https://github.com/keras-rl/keras-rl

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 54/99

4.2.2.8 Acme Acme15 is a reinforcement learning development framework designed by DeepMind with the goals to enable reproducibility, to simplify design of new algorithms and to enhance readability of RL agents. It enables agents to run in both single-process and highly-distributed regimes by providing tools for constructing agents at various levels of abstraction. Acme also offers a scalable data storage mechanism with the help of its companion Reverb library which de- couples the notions of data producers i.e., agents and data consumers i.e., learners.

4.2.2.9 Other Tools Some other tools for developing reinforcement learning algorithms include pyqlearning16, Chain- erRL17 and MushroomRL18.

4.2.3 Baselines 4.2.3.1 OpenAI Baselines OpenAI baselines19 is a set of high quality implementations of state-of-the-art RL algorithms by OpenAI. The framework contains implementations of several popular agents such as A2C [142], Deep Deterministic Policy Gradient (DDPG) [128], DQN, PPO and TRPO. Some lim- itations in OpenAI baselines implementations include lack of comments in the code, absence of meaningful variable names, no common code-style or consistency and plenty of duplicated code. Moreover, the documentation is not well-written thus implementing new environments or even tweaking/modifying the existing implementations could be extremely challenging. This framework currently does not include some recent works from OpenAI.

4.2.3.2 Stable Baselines To address the shortcomings of OpenAI baselines, a group of researchers forked the frame- work, performed major structural refactoring of OpenAI baselines and released it as Stable baselines20. This toolkit provides unified structure for all algorithms, PEP8 complaint code- style, well documented functions and classes, more tests, more code coverage and implemen- tations of additional algorithms like SAC and TD3. Stable baselines also provides the ability to pass a callback, offer loading and saving functions for all the algorithms as well as full ten- sorboard support. It additionally supports training of RL algorithms on arbitrary functions i.e., something other than pixels only. Stable baselines current supports Tensorflow versions from 1.8.0 to 1.14.0 while Tensorflow 2 api is not yet supported.

15https://github.com/deepmind/acme 16https://github.com/accel-brain/accel-brain-code/tree/master/Reinforcement-Learning 17https://github.com/chainer/chainerrl 18https://github.com/MushroomRL/mushroom-rl 19https://github.com/openai/baselines 20https://github.com/hill-a/stable-baselines

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 55/99

4.2.3.3 RL Baselines Zoo RL Baselines Zoo21 is a collection of 100+ pre-trained RL Agents using Stable-Baselines. It also offers an interface to train RL agents, provides basic script for benchmarking different RL algorithms as well as tuned hyperparameters for each environment and RL algorithm.

4.3 Hyperparameter Optimization 4.3.1 Introduction and objectives The combination of deep neural networks with reinforcement learning (DRL) has been suc- cessful on some high-dimensional domains in the field of robotics such as object grasping by exploiting image sensors from cluttered environments. However, it has yet not become a stan- dard component in robotics control design toolbox due to several limitations. This includes tuning of a large number of hyper-parameters that have to be set manually while there still has been no standard/systematic framework, method or criteria for their tuning. Hyper-parameters often refers to a set of parameters that cannot be updated while we train the learning algo- rithms. The existing accepted practice by the users of machine learning algorithms is to either use the default parameters recommended in the software implementation packages or manually configure them through their experience or by trial-and-error. To save the rare-resource of experts and simplify the process of parameters tuning for less experienced users, automatic hyper-parameter optimization offers to train algorithms automat- ically at the cost of computational resources. Efficient hyper-parameter optimization strate- gies require to determine a set of variables to optimize, as well as their corresponding ranges, scales and potential prior-distributions for subsequent sampling. In the following section we will discuss some of the existing toolkits for hyper-parameter optimization of machine learning algorithms /deep neural networks.

4.3.2 Summary of state of the art 1. Existing toolkits There exist several hyper-parameters tuning tools that exploit different strategies for the said task. Here we discuss some of the most popular libraries for automatic tuning of hyper-parameters.

4.3.2.1 Hyperopt

Hyperopt22 is a Python library for serial and distributed hyperparameters optimization which allows the user to define a search space in which the best results are expected to exist. Hyperopt implements three algorithms for the optimization task namely Random Search [10], Tree of Parzen Estimators (TPE) [12] and Adaptive TPE [11]. It requires the users to describe the objective function to minimize, search space, a database to store

21https://github.com/araffin/rl-baselines-zoo 22https://github.com/hyperopt/hyperopt

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 56/99

point evaluations of the search and the search algorithm to be used. Hyperopt uses Apache Spark23 and MongoDB24 to parellelize the algorithms.

4.3.2.2 Optuna

Optuna25 is a lightweight platform agnostic framework for automatic hyperparameters optimization. It allows users to define search spaces using Python syntax including con- ditionals and loops. Optuna uses the same optimization methods as Hyperopt under the hood. Optuna allows Run Pruning with the help of Pruning Callbacks. It supports many machine learning frameworks like Keras, TensorFlow, PyTorch etc. Optuna let the pro- gram pass the allowed exceptions in case it fails due to wrong parameter combination, random training error or some other problem. Thus scores against tested parameters are not lost in the case of these failures. It also supports easy parallelization and provides quick visualization tools.

4.3.2.3 Ray Tune

Tune26 is a Python library to execute scalable experiments and hyperparameters tuning. It exploits the strength of distributed computing to speed up the hyperparameter optimiza- tion. Tune supports several learning frameworks like Keras, TensorFlow, PyTorch etc. It has an implementation of several state-of-the-art scalable optimization algorithms like Population Based Training (PBT) [92], Vizier’s Median Stopping Rule [66], ASHA [123], BOHB [56] etc.

4.3.2.4 Facebook Ax

Ax27 is a platform for managing and optimizing experiments by exploring a parameter space to identify optimal configurations. It exploits Bayesian optimization which is pow- ered by BoTorch, and bandit optimization as exploration strategies. In high-noise settings for A/B tests and simulations with reinforcement learning agents, Ax supports state-of- the-art algorithms for optimization [120] which work better than traditional Bayesian optimization. Ax also supports multi-modal experimentation and multi-object optimiza- tion.

4.3.2.5 Optimizers for Keras

Hyperas28 is a wrapper around hyperopt for optimization of keras models. It helps the users to exploit the power of hyperopt with simple template notations to define hyperpa- rameter ranges to optimize.

23https://spark.apache.org/ 24https://www.mongodb.com/ 25https://github.com/optuna/optuna 26https://github.com/ray-project/ray 27https://github.com/facebook/Ax 28https://github.com/maxpumperla/hyperas

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 57/99

Talos29 is another hyperparameter optimization tool mainly for keras models. It exposes keras functionality entirely, without introducing any new syntax or template to learn. Ta- los offers peudo, quasi and random search options. It also supports grid search and prob- abilistic optimizers. Some other key features include possibility to dynamically change optimization strategy during experiment and live training monitor.

4.3.2.6 Other Optimizers

Some other hyperparameter optimizers available for machine learning algorithms include Determined AI30, Hyperparameter hunter31, auto-sklearn32, Neuraxle33, sherpa34 and adatune35.

4.3.3 Description of work performed so far We surveyed the existing toolkits for hyper-parameter optimization. Most of the existing toolk- its are aimed to optimize end-to-end deep learning algorithms, while their application in DRL is yet to be tested.

29https://github.com/autonomio/talos 30https://github.com/determined-ai/determined 31https://github.com/HunterMcGushion/hyperparameterhunter 32https://automl.github.io/auto-sklearn/master/ 33https://github.com/Neuraxio/Neuraxle 34https://github.com/sherpa/sherpa/ 35https://github.com/awslabs/adatune

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 58/99

4.4 Grasp pose estimation and execution 4.4.1 Introduction and objectives Robot object grasping and manipulation are commonplace in industrial manufacturing. For high-throughput assembly lines, fast and accurate grasping is typically established by custom- made bulk feeders [129] and standard parallel gripper mechanisms [16]. However, these so- lutions are unsuitable in custom manufacturing, which requires flexibility and reconfigurability following task changes. Ongoing research efforts towards grasping disregard feeders and utilize sensing to detect objects and their grasp pose for handling [88], for example by deep convolu- tional neural networks (CNNs) trained on large datasets of grasp attempts in a simulator or on a physical robot [132, 122]. The grasping task is driven by the Agile Production use case of OpenDR; human-robot collaborative assembly of a Diesel engine, which considers the handling and manipulation of several industrial (Diesel engine) parts by either the human or the robot manipulator. Regarding the robotic handling and manipulation of these parts, perception, action planning and control play a crucial role. In light of this research challenge, the following objectives are defined: Analyse the state-of-the-art in modern grasp and pose-estimation methods • Provide an empirical comparison of class-agnostic and multi-class trained grasp estima- • tion approaches

Train and evaluate object pose-estimation over a variety of basic and industrial parts using • state-of-the-art proven and tested methods.

Develop and evaluate a novel single demonstration grasping model • Evaluate the different grasping models in simulation and on a real robot, using Franka • Emika Panda robot manipulator.

4.4.2 Summary of state of the art To a robot, the task of grasping is to have a secure and stable lift-off of an object without any slippage. Grasping is a popular topic in robotics research and several surveys and reviews can be found that describe the field [167, 18, 29, 52]. A grasp can be represented as points (x,y), in the image coordinates or as 3D points (x,y,z) in the robot workspace. Other approaches use oriented rectangular-box representations either in 7D (x, y, z, roll , pitch, yaw, width) in robot workspace or 5D (x, y, theta, width, height) on the image plane. These approaches are analogous to the object detection & localization concept and hence can be easily translated to grasping. Earlier works, utilized analytical approaches of calculating and dynamics to deal with the constraints on the 3D geometry of the object to be grasped [29, 52]. These techniques can be satisfying force-closure, form-closure or task-specific geometric constraints to find feasible contact points for a particular object and manipulator configuration. The major- ity of this work is concerned with finding and parameterizing the surface-normals on various flat faces of the object and then testing the force-closure condition by subjecting the angles be- tween these normals to be within certain thresholds. Generally, a force-closed grasp is one in which the end-effector can apply required forces on the object in any direction, without letting it slip or rotate out of the gripper. Later analytical methods argued on the optimality criterion OpenDR No. 871449 D5.1: First report on deep robot action and decision making 59/99 of the grasp quality. This means that a metric should decide on the quality of the force-closure achieved with a certain grasp, i.e., how close the grasp is to losing its force-closure, given a particular object geometry and hand configuration. Recently, it has been studied that these an- alytical modeling methods and metrics alone were not good measures of grasps as they do not adapt well to the challenges that are faced during execution, such as uncertainties in dynamic behavior and the unstructured environment. Therefore, in the last decade, there was a gen- eral push towards machine-learning approaches which present a more abstract, easy-to-test and indirect approach of evaluating grasp success on a huge variety of objects, grippers and envi- ronmental conditions. The shift from complex mathematical modeling of the grasping itself, towards the indirect mapping of perceptual features with grasp-success was made possible by availability of high quality 3D cameras and depth sensors, increasingly powerful computational resources and a substantial amount of invaluable research in deep neural networks and transfer learning in the last few decades [29]. The biggest advantage these methods provided were the vast possibilities for testing in simulation and real execution, using real and synthesized data. These methods don’t explicitly guarantee equilibrium, dexterity or stability but the premise of testing all criteria, based solely on sensor data, object representation and carefully designed object features provides a convenient way of studying grasp synthesis [18].

4.4.2.1 Grasp detection with RGB(-D) image data Known objects Grasping objects that are known has been research extensively, as they are a direct extension of object detection and object pose estimation. Different methods are used to extract information from the objects such as 2D point correspondences (e.g. SIFT, SURF, ORB), templates (e.g. HOG, surface normals, moments) or patches (e.g. image regions, super-pixels) to represent objects in intermediate representations for correspondence matching. Hinterstoisser et el. [90] pioneered the research in template matching by providing a complete framework for creating robust templates from existing 3D models of objects by sampling a full-view hemisphere around an object. In addition, they introduced a dataset called LINEMOD, made of 1100+ frame video sequences of 15 different household items varying in shapes, colors and sizes along with their registered meshes and CAD models. Other research includes PoseCNN [211], which combines feature and template methods, ConvPoseCNN [33] that improves PoseCNN by a fully-convolutional architecture, HybridPose [179] and PVNet [155].

Similar objects This class of grasp-detection methods are aimed at objects that are similar i.e., a different/unseen instance of a previously seen category of objects. For example; All cups, with slight intraclass variation belong to a single category of objects and all shoes belong to another etc. These meth- ods learn a normalized representation of an object category and transfer grasps using sparse- dense correspondence between normalized 3D representation of the category and the partial- view object in the scene [52]. NOCS [199] presents an initial benchmark in this category by formulating a canonical-space representation per-category using a vast collection of different CAD models for each class. A color-coded 2D perspective-projection of this space (NOCS map) is then used to train a Mask-RCNN based network [81] in order to learn correspondences from RGB-D images of unseen instances to this NOCS map. These correspondences are later combined with depth-map to estimate 6D pose and size of multiple instances per class.

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 60/99

Novel objects In this class of methods, there is no existing knowledge of object geometry and grasps are estimated directly from image and/or depth data. Majority of these methods use geometric properties inferred from input perceptual data, as a measure of grasp success [52]. Most meth- ods are developed in end-to-end fashion, learning from a database of grasps on a huge number of different object models. These grasps are sampled exhaustively around the objects and are either evaluated on classical grasp metrics, such as epsilon quality metric [159], manually an- notated with their success measures by humans or tested with real execution [134, 133]. The premise of these methods lies in the later training stage, where a deep neural network learns to produce robust grasps in general. The emphasis is on learning a robustness function that ranks a candidate grasp for various quality metrics. DexNet 1.0 [134] and DexNet 2.0 [133] are two pioneering works that utilized this strategy and created large datasets of 3D object models for learning objective functions that minimize grasp failure, in presence of object and gripper position uncertainties and camera noise. Enforc- ing constraints like collision avoidance, approach-angle threshold and gripper-roll threshold, these methods provide a baseline for correlating object geometry from RGBD images [134] or pointclouds [133] with grasp robustness. Finally, [163] and [71] are excellent examples that use a fully-convolutional architecture, to regress graspable bounding-boxes bypassing the need of any sliding window detectors or convex-hull-sampling approaches.

4.4.2.2 Grasp detection with pointcloud data Deep-learning with point-cloud data presents a unique set of challenges uncommon to RGB and RGB-D-based methods, such as the small scale of available datasets, the high dimensionality and the unstructured nature of pointclouds. Nevertheless, pointclouds are a richer representa- tion of geometry, scale and shape as they preserve the original geometric information in 3D space without any discretization [72]. Perhaps the most widely used pointcloud aggregation method in applications of grasp-detection and pose-estimation are PointNet [160] and Point- Net++ [161]. These two methods revolutionized geometry-encoding in pointclouds by preserv- ing permutation invariance. Pointclouds are an inherently unordered data-type and any kind of global or local feature-representation shouldn’t change the way they are ordered. To deal with this problem, most techniques convert pointclouds into other discrete and ordered forms, i.e. voxel-grids, height-maps, octomaps, surface-normals or 2D-projected gradient-maps, before aggregating into a final compact representation. PointNet, PointNet++ and their later modifi- cations overcame this and paved way for direct useage of pointclouds in order to be used with other deep-learning frameworks. Due to a better scene understanding and geometry aggregation, pointclouds processed with deep neural networks have led to a whole new line of pose-estimation methods. These methods filled the undesirable gap that existed in RGB/RGB-D techniques, that required data-collection, training and evaluation in image coordinates in the form of 2D or 3D bounding boxes. Also, the need to transfer poses or grasps from image to world coordinates was removed, as these techniques could recover full 6-DOF pose of the object without having to employ any post- processing on the depth channel or without learning to estimate depth. In this regard, methods have been developed that follow the same categorization as RGB/RGB-D based pose-estimation methods, such as correspondence-based [218, 157, 222], template-based [86] or voting-based [155, 83].

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 61/99

Methods that generate grasps directly from poincloud data, circumvent the need for esti- mating the pose of an object in the scene or any prior knowledge about the shape or canonical representation of a class of objects. With pointclouds, these grasps can be recovered in 6-DOF without constraining the gripper to move along the image plane, as with the RGB-D-based detectors. Noteworthy examples of such networks are [190], [217], PointNetGDP [127] and 6-DOF GraspNet [146].

4.4.2.3 Single demonstration grasp (SDG) Learning from demonstration is another paradigm that plays an important role in human-robot collaboration tasks that opens up the opportunity for non-expert users to interact with a robot [117]. While the majority of research focus is on generic solutions for known or unknown ob- jects and for different grasp techniques [105, 52, 29], and typically require a significant amount of data collection for generalized models, single demonstration grasp focuses only on a suc- cessful grasping action on known objects rather than focusing on “grasp pose detection”. This research work is inspired by [46, 47] where the authors develop a system that is able to learn how to perform an efficient grasp action by developing a CNN network for centering the ob- ject in the camera image frame followed by execution of a fixed action to perform the actual grasping task in 3D space. This research work aims at extending the application of state-of-the-art Deep learning mod- els in SDG systems to achieve enhanced performance and finally, the concept of relative camera pose estimation will be evaluated on the system to find the correct pose of the camera with re- spect to the desired camera pose. Relative camera pose estimation is one of the fundamental vision tasks that is used in many core robotic functionalities such as structure from motion, SLAM and visual-odometry. The conventional approaches use feature extraction algorithms like SIFT, DAISY, ORB, BRIEF and SURF that come with many limitations when dealing with nuisance factors such as brightness changes, texture-less objects and large changes in the viewpoint. An alternative to these conventional methods is to apply deep learning for directly predicting the relative camera pose or using deep learning for feature extraction and using it in conventional models [138, 165].

4.4.2.4 Grasp evaluation An open question in grasping research is how the grasp candidates are sampled for testing and how they should be evaluated? In order to generate training data for a robot that is large enough in quantity, varied enough for generalization and accurate enough representation of task con- straints, some efficient heuristic measures are needed to search through a space of thousands of potentially viable grasps [54]. Even after the initial selection of these candidates, effective evaluation of these grasps and the metrics for a robot’s performance on them determines the use- fulness of the data and robustness of the grasp algorithm being trained on [20]. The commonly used grasp sampling techniques that are analyzed by Eppner et al. [54] are broadly catego- rized into object geometry-guided sampling, (non-)uniform sampling [215], approach-based sampling and antipodal sampling [15]. Eppner et al. [54] then devised a few intuitive metrics of comparing these sampling methods, i.e., grasp coverage, robustness and precision. Based on their own simulations they generated over 317 billion grasps on 21 YCB-dataset objects [30] and evaluate the successful 1 billion grasps out of these. They conclude that uniform samplers have better grasp coverage because of minimal constraints, with the trade-off for efficiency. On

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 62/99 the other hand, heuristics like approach-based sampling or anti-podal sampling are efficient but might not entirely capture all possible grasps. Moreover, they also found that anti-podal grasps have higher coverage and find more robust grasps. Precision is quite low for both uniform and approach-based methods, while being significantly higher for anti-podal methods. Non-uniform or geometry-based approaches, consistently perform poor on all three metrics.

4.4.2.5 Manipulation benchmarking For robotic manipulation tasks such as assembly, accurate grasps and grasp poses are crucial to ensure successful completion of the task. This implies that neither the object falls out of the gripper nor slips too much inside the gripper. With an accurate estimate of initial object- pose, the in-hand pose of the object is assumed to be within certain constraints and the final placement of an object can be done with a reasonable certainty. A recent method tackles the problem of object-placement in a tight region using conservative and optimistic estimates of the object volume [141]. Another group of methods use force-feedback or compliance control of the robotic hand/arm and propose algorithmic approaches to solve well-known assembly tasks such as peg-in-hole, hole-on-peg and screwing a bolt. [194] and [201] are two state-of-the art works that provide a general framework for the aforementioned assembly tasks using motion-priors like spiralling around the hole for peg-insertion tasks and back-and-forth spinning for screwing tasks. These methods are thoroughly tested on various combinations of both compliant-arm and compliant- hand and fingers both with and without contact sensors. They argue in detail over the benefits of using various force-profiles as cues for driving the manipulation towards a more accurate and robust assembly and present a general framework for benchmarking these problems. The ability of a robot to plan for and reach all possible poses in it’s workspace with a required certainty also contributes to the absolute constraints in it’s task execution and completion. The work by Fab- rizio et el. [20] lays down a general framework in this regard, to test manipulability of a given robot in a particular environment setup without any specific task-constraints. They formulate a composite expression to test reachability, obstacle-avoidance and grasp robustness by repeat- edly performing these three tasks along different regions that the workspace is divided into. Finally, recent work also combines visual and force sensing with deep reinforcement learning to teach policies for contact-rich manipulation tasks [118]. By self-supervision a compact and multimodal representation of our sensory inputs is learned, which can then be used to improve the sample efficiency of policy learning. Evaluation of the method is done on a peg insertion task, which shows that it generalizes over varying geometries, configurations, and clearances, while being robust to external perturbations.

4.4.3 Description of work performed so far The research and development of grasping capabilities for robot manipulators is first to achieve results from existing deep learning networks toward the dataset relevant for the Agile Production use case, and second, the generation of a novel grasping model that considers the shortcomings of the state-of-the-art approaches. To this end, two networks were chosen as state-of-the-art and utilized for the generation and execution of grasps of the OpenDR AP dataset: PVN3D [83] and 6-DOF GraspNet [146]. PVN3D is a recent (2020), deep point-wise 3D keypoints-based voting network for 6-DOF pose estimation and utilizes single RGB-D images for inference. 6-DOF GraspNet, on the

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 63/99

Figure 22: CAD dataset for the Agile Production benchmark of OpenDR. From top left to top right: piston, round peg, square peg, separator, pendulum head. From bottom left to bottom right: pendulum, shaft, face plate, valve tappet, shoulder bolt. The dataset consist of parts from the (3D printed) Cranfield assembly benchmark [40] and from a Diesel engine assembly use case. other hand, generates grasps directly as output of the network, from 3D point clouds observed by a depth camera as input. Both approaches are tested on the OpenDR AP dataset that is under development simultaneously. In particular, PVN3D uses the OpenDR AP dataset itself to train a novel network, while for 6-DOF GraspNet the model trained by the authors is used for evaluation of the grasps. For both, simulation results are evaluated to assess their suitability for the use case and for inclusion in the OpenDR toolkit. The novel approach, which focuses on single demonstrations for generating a grasping model, tries to tackle several shortcomings of the state-of-the-art approaches, such as, required training data, required training time and size of the neural network. As a work in progress, the developments and (intermediate) results of all three methods are presented in the following sections.

4.4.3.1 OpenDR Agile Production benchmark dataset The OpenDR AP dataset comprises of a few commonly used Diesel engine-assembly parts and a standard set of parts from the cranfield assembly [40]. These parts are chosen for their variety of shape, size, mass-distribution and manipulability. It forms an initial step towards a wider dataset in the future and provides enough flexibility to test multiple use-cases with increasing levels of complexity for pose-estimation, grasping and manipulation. CAD-model images of all objects used in this dataset are shown in Fig. 22.

4.4.3.2 PVN3D The PVN3D neural network [83] is implemented in tensorflow and the generic training and evaluation scripts are provided under MIT License, on the Author’s repository36. The network consists of the following separate blocks and their functionalities:

36https://github.com/ethnhe/PVN3D

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 64/99

1. Feature-Extraction: This block contains two separate steps as follows: PSPNet-based [223] CNN layer for feature-extraction in RGB image. • PointNet++-based [161] layer for geometry extraction in pointcloud. • The output features from these two are then fused by:

DenseFusion [197] layer for a combined RGBD feature-embedding. • 2. 3D-keypoint detection: This block comprises of shared Multi-Layered Perceptrons (MLP) with semantic-segmentation block and uses the features extracted by the previous block to estimate an offset of each visible point from the target keypoints in Euclidean space. The points and their offsets are then used for voting candidate keypoints. The candidate keypoints are then clustered using Meanshift clustering [41] and cluster-centers are casted as key-point predictions.

3. Instance semantic-segmentation: This block contains two modules sharing the same layers of MLPs as those of the 3D-keypoint detection block. A ’semantic-segmentation’ module that predicts per-point class label and ’center-voting’ module to vote for different object centres in order to distinguish between object instances in the scene. The ’center- voting’ module is similar to the ’3D-keypoint detection’ block in that it predicts per- point offset, which in this case votes for the candidate center of the object rather than the keypoints.

4. 6 DoF Pose estimation: This is simply a least-squares fitting between keypoints predicted by the network (in the transformed camera coordinate system) and the corresponding keypoints (in the non-transformed object coordinates system).

Data collection All the data was collected in simulation only, as it provides easy fine-tuning of variety of pa- rameters i.e., object placement and pose, ambient, diffuse, directional and spot lighting. An Xbox360 Kinect camera is simulated and publishes color images, depth images and camera- intrinsics for both over ROS. All the objects are simulated using their standard polygon mesh files generated from CAD models. Only ambient and diffuse lighting (i.e., no directional or spot light) is used with a fixed color for each object. The background is left out to be a brightly-lit plain-grey room with no walls and the objects always rest on the floor in most stable equilibrium poses. For the sake of simplicity, no dense background clutter or non-dataset objects are added to the environment. This is done so, because the current simulation conditions don’t require the algorithm to deal with any complex background objects other than the plain-grey simulation environment itself. A moderately dense clutter of all the objects in the dataset is created around the origin in the simulation environment, with 3 distinct sets of relative positions of objects with different levels of clutter and truncation. A hemisphere-sampling as described in [90] is carried out around each clutter-set. The steps in this sampling are as follows:

The camera is moved from yaw=0 degrees to yaw=360 degrees in increments of 15 de- • grees around the clutter, with it’s principle-axis always pointing towards the origin.

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 65/99

Figure 23: Upper hemisphere sampling for the data collection. The image shows camera sam- pling the scene at various poses all super imposed in one frame. For the sake of visibility, samples shown here are lesser than the actual number of samples used.

For each yaw the camera goes from a pitch=25 degrees to a pitch=85 degrees in incre- • ments of 10 degrees, with it’s principle-axis always pointing towards the origin. This range allows for an adequate number of samples and avoids sampling objects in nearly flat poses (either completely horizontal or vertical with respect to the camera).

For each combination of yaw and pitch, a total of four different scales are sampled i.e., • hemispheres of four different radii from 65cm to 95cm with increments of 10cm, are sampled around the origin.

A similar setup has been used for data-collection and mesh re-construction in both LineMOD [90] and YCB [30] datasets. Fig. 23 shows the camera sampling for data-collection following this scheme. For each sample, the dataset records RGB and depth images of the scene, a grey- scale image with binary mask of each object encoded with the respective class label and the ground truth poses of each object in camera-coordinates, acquired directly from the simula- tion environment. Also, since the simulation can directly query an accurate camera-in-world pose for each sample, there is no need for extrinsic camera-calibration. For each object mesh, greedy farthest-point-sampling is used to sample keypoints that spread-out at furthest possible distances from each other on the mesh surface. Three different versions, i.e., 8, 12 and 16 key- points are used for training three separate network checkpoints. As starting point, Gazebo was chosen as the simulation environment, due to its familiarity.

Training A joint multi-task training is carried out for 3D-keypoint detection and instance semantic- segmentation blocks. First, the semantic-segmentation module facilitates extracting global and local features in order to differentiate between different instances, which results in accurate localization of points and improves the keypoint offset reasoning procedure. Second, learn- ing for the prediction of keypoint-offsets indirectly learns size-information as well. This helps distinguish objects with similar appearance but different size. This paves the way for joint opti- mization of both network branches under a combined loss function, similar to the original work OpenDR No. 871449 D5.1: First report on deep robot action and decision making 66/99

[83]: Lmultitask = λ1Lkeypoints + λ2Lsemantic + λ3Lcenter, (19) where λ1, λ2 and λ3 are the weights for each task. The authors of this method have shown with experiments that when jointly trained, these tasks boosts each-other’s performance. The individual loss for each module is:

N M 1 j j Lkeypoints = ∑ ∑ o fi o fi ∗ I(Pi I), (20) N i=1 j=1 k − k ∈ j where o fi ∗ is the ground truth translation offset; M is the total number of selected target key- points; N is the total number of seeds and I is an indicating function equates to 1 only when point pi belongs to instance I, and 0 otherwise.

L = α(1 q )γ log(q ), (21) semantic − i i where q = c l with α the α-balance parameter, γ the focusing parameter, c the predicted i i · i i confidence for the i-th point belongs to each class and li the one-hot representation of ground true class label.

1 N Lcenter = ∑ ∆xi ∆xi∗ I(pi I), (22) N i=1 k − k ∈ where N denotes the total number of seed points on the object surface and ∆xi∗ is the ground truth translation offset from seed pi to the instance center. I is an indication function indicating whether point pi belongs to that instance. With a total of 2304 collected data samples, a 75%-25% train-test split was used where every 4-th sample is used as a test sample, in order to evenly cover all possible pitches, yaws and scales in both test and training dataset. Input image size for training is 640 by 480 and a total 12288 points are randomly sampled for PointNet++ [161] feature extraction. Only these points are further used for semantic labeling and keypoint-offset voting. This is an optimal number originally recommended and tested by the authors. If the number of points in the pointcloud are less than this number, the pointcloud is rescursively wrap-padded around it’s edges until the pointcloud has at least 12288 points. All three key-point variations were trained for a total of 70 epochs with batch size of 24 as recommended by the authors. The training was carried out on 4 Nvidia v100 GPUs simultaneously and takes around 5-7 hours for the given batch-size, number of epochs and training-dataset.

4.4.3.3 6-DOF GraspNet GraspNet [146] is implemented in tensorflow and source code is released under MIT License and the trained weights are released under CC-BY-NC-SA 2.0, on the Author’s repository37. The neural network has two main components and a third refinement step as follows:

1. Grasp-sampler: This part is a Variational Auto-encoder (VAE) [104] that are widely used as generative models in other machine learning domains and can undergo unsuper- vised training to maximize likelihood of the training data. In this method, a VAE with PointNet++ encoding layers, learns to maximize the likelihood P(G X) of ground-truth | 37https://github.com/NVlabs/6dof-graspnet

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 67/99

positive grasps G given a pointcloud X. G and X are mapped to latent-space variable z. The probability density function P(z) in latent space is approximated to be uniform hence, the likelihood of the generated grasps can be written as:

P(G X,z,θ)P(z)dz (23) | 2. Grasp-evaluator: Since the grasp sampler learns to maximize the likelihood of getting as many grasps as it can for a pointcloud, it learns only on positive examples and hence can generate false positives, due to noisy or incomplete pointclouds at test time. To overcome this, an adversarial network, which also has a PointNet++ architecture, learns to measure probability of success P(S g,X) of a grasp g and the observed point cloud X. This net- | work also uses negative examples from the training data and by generating hard-negatives through random perturbation of positive grasp samples. The main difference in this part from the generator is that it extracts PointNet++ feature from a unified pointcloud, con- taining both object and gripper(in the grasp-pose) points and the measure of success is found using this geometric association between the two.

3. Iterative grasp-pose refinement: The grasps rejected by evaluator are mostly close to success and can undergo an iterative refinement step. A refinement transformation ∆g can be found by taking a partial derivative of success function P(S G,X) with the respect to | the closest successful grasp translation ∂T(g). Using the chain rule, ∆g is computed as follows: ∂S ∂S ∂T(g; p) ∆g = = η , (24) ∂g × ∂T(g; p) × ∂g where T(g; p) is the transformation of a set of predefined points p and η is a hyper- parameter to limit the update at each step. Authors of this method chose η so that maxi- mum translation update never exceeds 1 cm per refinement step.

Data collection The grasp data is collected in a physics simulation, based on grasps done with a free-floating parallel-jaw gripper and objects in zero gravity. Objects retain a uniform surface density and friction coefficient and the grasping trial consists of closing the gripper in a given grasp-pose and performing a shaking motion. If the object stays enclosed in the fingers during the shaking, the grasp is labelled as positive. Grasps are sampled based on object geometry, sampling random points on object mesh surface to align approach axis with the normal at each of these points. The distance of gripper from object is sampled uniformly from zero to the finger length. Gripper roll is also sampled from uniform distribution. Only the grasps with non-empty closing volume and no collision with the object are used for simulation. Grasps are performed on a total of 206 objects from six categories in ShapeNet [36]. A total of 10,816,720 candidate grasps are sampled of which, 7,074,038 (65.4%), are simulated i.e. those that pass the initial non-collision and non-empty closing volume constraint. Overall, 2,104,894 successful grasps (19.4%) are generated. This data collection is done as part of the original work in [146].

Training The posterior probability in Eq. (23) is intractable because of infinitely large amount of values in latent space. This is simplified by the encoder of the ’Grasp Sampler’ which learns the OpenDR No. 871449 D5.1: First report on deep robot action and decision making 68/99 mapping Q(z X,g) between latent variable z and pointcloud X, grasp g pair. The decoder learns | to reconstruct the latent variable z into a grasp poseg ˆ. The reconstruction loss between ground truth grasps g G and the reconstructed graspsg ˆ is: ∈ ∗ 1 L(g,gˆ) = T(g, p) T(gˆ, p) , (25) n ∑k − k1 where T( , p) is the transformation of a set of predefined points p on the robot gripper. − The total Loss function learned by VAE is

LVAE = ∑ L(g,gˆ) αDKL[Q(z X,g),N (0,I)], (26) z Q,g G − | ∼ ∼ ∗ where D represents a KL-divergence between the complex distribution Q( ) and the nor- KL −|− mal distribution N (0,I), which is also a part of minimization in order to ensure a normal distribution in latent space with unit variance. For pointcloud X, grasps g are sampled from the set of ground truth grasps G∗ using stratified sampling. Both encoder and decoder are Point- Net++ based and encode a feature vector that has 3D coordinates of the sampled point and and relative position of it’s neighbours. The decoder concatenates latent variable z with this feature vector. Optimizing the loss function in Eq. (26) using stochastic gradient descent makes the en- coder learn to pack enough information (about grasp and the pointcloud) in variable z so that the decoder can reliably reconstruct the grasps with this variable. The grasp evaluator is optimized using the cross-entropy loss, as explained in the original work [146]:

L = (ylog(s) + (1 y)log(1 s)), (27) evaluator − − − where y is the ground truth binary label of the grasps indicating whether the grasp is successful or not and s is the predicted probability of success by the evaluator.

4.4.3.4 Single demonstration grasping The majority of robot grasping approaches discussed earlier develop generic and efficient learning- based models capable of working in different environments, and typically require a significant amount of training data and training time. As such data and time might not be available in an industrial scenario, this work proposes a Single-Demonstration Grasp (SDG) model that relies only on a single demonstration from the operator. SDG then collects and generates the nec- essary data for learning how to repeat a grasp action for a particular object regardless of its position and orientation. The proposed model is inspired by the work of De Coninck et al. [47] where the grasping is divided into two consecutive actions, as follows: first, the model predicts the 2D pose of an object on a plane, parallel to the surface of the table. Second, after reaching the hover pose, the end-effector executes a predefined trajectory to reach the final 3D grasp pose. The hover pose (a relative pose of camera with respect to the object) and the predefined trajectory are demonstrated to the system by the user as the only inputs to the system. The proposed SDG model aims at extending the baseline approach in [47] with deep learning by evaluating more robust, efficient and flexible methods for reaching the hover pose. In addition, DL-based models are utilized for predicting the relative pose of the camera. Currently, the main focus is on evaluating the application of deep learning in the baseline model and later on, the aim is to use the concept of relative pose estimation for directly reaching the final grasp pose. The baseline model as described in [47] consists of the following steps:

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 69/99

1. Hover pose demonstration: showing a pose with respect to the object before executing the final grasp action. This is used for generating augmented data from the captured image.

2. Approach demonstration: Recording the trajectory data from hover pose to grasp pose. 3. Data generation: Generating training data using data augmentation techniques. 4. Training: Training a Convolutional Neural Network (CNN) model for localizing the object and predicting its orientation.

5. Grasp execution: Using all the information from previous steps to execute a grasping action regardless of the object’s pose. The prediction of 2D location and orientation of the objects are treated as classification problems, where the output ranges of these models are (0 to 1) and (0 to 360) for location and orientation, respectively. The difference between the two models is only in the output nodes and the loss functions. The output of the models are a probability for being empty or occupied and the outputs for the angle prediction is the class corresponding to the predicted angle. Our SDG approach extends the baseline by: 1. Utilizing one-shot localization approaches instead of a sliding window. 2. Integrate depth towards the object into the grasping model. 3. Utilize camera relative pose estimation to enhance orientation estimation. Facebook’s detectron2 object detection algorithm38 is used to replace the original sliding window approach with a one-shot detection model. This enables the calculation of orientation and object detection simultaneously, by predicting keypoints in the image.

Data collection The following augmentation sequence is applied in order to generate training a dataset according to detectron2’s input requirements: brightness variation, 2D translation, rotation and scale. In addition, bounding boxes and keypoints are defined on the image where the bounding box coordinates are used to obtain the center point of the objects. Two keypoints then represents the orientation of the object. The process of transforming the keypoint location and bounding box after augmentation is internally done by the ’imgaug’ library. The generated images as well as the corresponding bounding box and keypoint pixel coordinates are stored in order to register them in Detectron2’s data dictionary. Examples of training data is illustrated in Fig. 24.

Training The model is trained for a set of different objects, such as a Diesel engine piston and a Allen key set, with complex shapes as well as parts with symmetric shapes, as demonstrated in Fig. 24. The keypoint rcnn R 50 FPN 3x pretrained model and weights are used from detectron2’s model zoo with a learning rate of 0.0008 and 1000 iterations. After these 1000 iterations, the total loss has decreased to 1.27. The training is done on google colab servers and takes around 10 minutes for each model to iterate for 1000 times. 38https://github.com/facebookresearch/detectron2

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 70/99

Figure 24: Industrial parts and tools (left), and training data generated by data augmentation (right) for the proposed Single-Demonstration Grasping model (SDG).

4.4.4 Performance evaluation The described results are obtained with a i7-9850H CPU @ 2.60GHz 12 and Quadro T1000 × GPU, for an input image resolution of 640 480, in both RGB and depth format. × 4.4.4.1 Pose estimation with PVN3D (simulation) For evaluating pose-estimation, two of the very widely used metrics ADD and ADD-S are used. Given a ground truth object rotation R, translation T, predicted rotation R˜x and translation T˜. ADD computes average of the distances between pairs of corresponding 3D points on the model transformed according to the ground truth pose and the estimate pose. 1 ADD = ∑ (Rx + t) (Rx˜ + t˜) , (28) m x M k − k ∈ where M denotes the set of 3D model points and m is the number of points. For objects that are symmetric in the plane along the principal axis of rotation, the rotation along this axis in the predicted pose could be ambiguous by 180 degrees since similar points would repeat after every 180 degrees and the correspondences are spurious. This rotational ambiguity is by-passed by the ADD-S metric where only the minimum of all the distances between a pair of points is considered for distance calculation and the average of this distance then gives the error. 1 ADD S = min (Rx1 +t) (Rx˜ 2 + t˜) . (29) ∑ M − m x M x2 k − k 1∈ ∈ Results are presented in terms of accuracy and area under accuracy-threshold curve where the threshold for both ADD and ADD-s metric is varied from 0 to 10 cm. Results are reported in table 7. Some of these results are shown in Fig. 25. As these results only demonstrate the accuracy of pose estimation, a follow-up step is de- vised that transforms the estimated pose to a useable grasp pose. Predefined grasps for each object are selected based on the constraints and properties of the gripper, object and robotic system, such as gripper dimensions, object shape and center of mass and robot approach path. These grasps are defined in object coordinates (see Fig. 26) and are transformed to world coor- dinates based on the estimated pose of the object. Then, grasps are executed by pick-and-place OpenDR No. 871449 D5.1: First report on deep robot action and decision making 71/99 trials in isolation and in clutter (see Fig. 27) to assess their quality based on the following metrics: number of generated grasps, grasp test (the percentage of grasps that passed the initial planning and grasp-execution test where the object stays within the gripper), placement test (the percentage which passed the placement test where the object is placed upright and within 5 cm of the required x and y coordinates of the goal pose) and placement error. These results are reported in Table 8 for isolated objects and in Table 9 for cluttered objects.

Figure 25: Images from the pose-estimation inference (PVN3D). The 3D poses are shown as bounding boxes projected on the RGB image.

Figure 26: Predefined grasps for each of the objects used in pose estimation (PVN3D) and grasping simulation. These grasps are defined in object coordinates and are transformed to world coordinates based on the estimated pose of the respective object.

Figure 27: Pick poses of each object when tested in isolation (left two images) and clutter (right two images).

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 72/99

Table 7: Area under curve (AUC) for accuracy-threshold curve for the ADD and ADD-s metric on the OpenDR Agile Production benchmark dataset.

Object 16 keypoints 12 keypoints 8 keypoints ADD ADD-s ADD ADD-s ADD ADD-s piston 97.76 98.16 97.57 98.02 97.21 97.85 round peg 97.35 97.35 97.4 97.4 97.05 97.05 square peg 97.12 97.12 96.92 96.92 96.32 96.32 pendulum 96.09 96.09 95.68 95.68 95.31 95.31 pendulum-head 97.27 97.27 97.05 97.05 96.73 96.73 separator 96.4 96.4 96.24 96.24 95.95 95.95 shaft 97.93 97.93 97.68 97.68 97.41 97.41 face-plate 96.16 96.16 96.15 96.15 95.88 95.88 valve-tappet 91.7 91.7 92.48 92.48 91.58 91.58 shoulder-bolt 93.5 93.5 93.08 93.08 91.4 91.4 Average 96.13 96.17 96.03 96.07 95.48 95.55 Inference time [sec] 1.98 1.75 1.5 GPU memory [MB] 2775 2544 2343

Table 8: Results from pick-and-place experiments in isolation. The experiments are reported based on the number of generated grasps, grasp test: % of grasps that ended in successful holding of the object, placement test: % of grasps that were able to place the object upright with an (x, y) error < 5cm and the average placement error in (cm, deg).

Placement error Generated Grasp test Placement test [cm], [deg] Object grasps [%] [%] x y yaw Piston 88 83.0 60.2 0.9 2.5 25 Round peg 108 76.9 53.7 0.2 1.3 - Square peg 108 72.2 42.6 0.9 1.4 46 Pendulum 54 87.0 61.1 0.4 2 2 Separator 58 50.0 36.2 1.1 3.8 2 Shaft 108 74.1 51.9 0.4 1.5 44

Table 9: Results from pick-and-place experiments in clutter. These results are reported on the same factors as in pick-and-place isolated experiments presented in Table 8.

Placement error Generated Grasp test Placement test Object [cm], [deg] grasps [%] [%] x y yaw Piston 31 74.2 58.1% 0.55 2.9 14 Round peg 37 75.7 46.0% 0.1 1.3 - Square peg 37 55.1% 35.1% 0.87 1.6 36 Pendulum 19 42.1% 16.0% 0.19 2.1 1 Separator 22 50.0% 36.4% 1.33 1.8 3 Shaft 37 75.8% 48.6% 0.33 1.1 40

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 73/99

4.4.4.2 Grasp generation and execution with 6-DOF GraspNet (simulation) As mentioned in Section 4.4.3.3, the 6-DOF GraspNet algorithm generates grasps all over the pointcloud. This is because it was trained to generate grasps for a free-floating gripper which does not take into account the constraints applied when planning the grasp for the whole arm. Filtering of generated grasps is therefore crucial to select grasps suitable for the objects and the robotic system and based on criteria such as approach direction, occlusion and ground plane. One example is shown in Fig. 28 for the piston. Assessment of grasp quality for 6-DOF GraspNet cannot be done by pick-and-place tests as there is no prior information of the object pose with respect to a ground plane. Accurate placement is therefore not possible. Instead, we assess a grasp’s success whether it holds an object, after being subjected to a predefined jerky motion along various axes. Results are shown in Table 10 which provides the total number of grasps generated during the simulation, the percentage of grasps that passed the initial planning and grasp execution test (grasp test) and the percentage that passed the stability test after oscillation (stability test).

Figure 28: Camera coordinate system used for filtering grasp projections (left). Grasps before filtering (middle). Grasps after filtering (right).

Table 10: Results from pick-and-oscillate experiments.The results are reported based on the number of generated grasps, grasp test: % of grasps that ended in successful holding of the object and stability test: % of grasps that kept holding after oscillation.

Grasp test Stability test Object Generated grasps [%] [%] Piston 40 42.5% 12.1% Round peg 22 77.3% 59.1% Square peg 19 47.4% 21.1% Pendulum 36 8.3% 2.8% Separator 39 5.1% 0.0% Shaft 21 42.9% 28.6%

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 74/99

4.4.4.3 Grasp generation and execution with Single-demonstration grasping (experiments) The trained Single-Demonstration Grasping model as explained in Section 4.4.3.4 have been evaluated by executing grasping actions for several industrial parts and tools, such as Allen key set, a Diesel engine piston and a small piston part (see Fig. 24). Each grasp has been repeated for 10 times. Out of 10 attempts, the model trained for the Allen key set has a 100% success rate and the Diesel engine’s piston has a 90% success rate. Failure is due to the object leaving the camera’s field of view and not due to the model itself. One noteworthy issue is the symmetry of objects, as the generated model is not able to predict the location of keypoints precisely. For non-symmetric parts, the developed model is able to perform with a high success rate. The detectron2’s default predictor has been used for evaluating the performance of the trained model on a machine equipped with Nvidia GTX 1080. The average inference time during image stream from a RealSense D435 camera (640 480 RGB image) is about 20 FPS × which is sufficient for the closed loop control of the arm. Fig. 29 depicts preliminary results of real experiments with a Franka Panda collaborative robot.

Figure 29: Proposed Single-Demonstration Grasping model at the hover pose (left) and the grasp pose (right).

4.4.5 Future work Deep learning based robot grasp generation for industrial objects has clear potential as demon- strated by the different algorithms, both in simulation and experiments. Despite these, several directions will be pursued to improve the approaches and to expand their implementation. As a start, the Agile Production dataset presented in Fig. 22 will be extended with more industrial parts and tools, and evaluated with the mentioned algorithms for grasp (pose) estimation. A broader dataset serves to demonstrate the capabilities of the algorithms and the OpenDR toolkit, and could potentially serve as a agile production grasping benchmark. Second, the development of the dataset, the generation of training data and the simulation environment will be integrated in Webots, the dedicated simulator for the OpenDR toolkit. Concerning the performance of the tested and developed grasping algorithms, current results do not meet the specifications of the toolbox as stated in Deliverable 2.1 ’Toolkit and use case-specific requirements and spec- ifications’. Future efforts will therefore focus on acceleration and improvements of the neural networks (pose estimation and grasp estimation) to achieve a performance suitable for real-time control (i.e., 25 FPS, 100ms latency, max 1GB memory), on real robotic systems. Such suitable model will then be included in the OpenDR toolkit.

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 75/99

5 Conclusions

This document presented the work performed on WP5. After a short introduction on the work done on the individual tasks, the document provided a detailed overview of the individual tasks, as detailed below. Chapter 2 presented the status of the work performed for Task 5.1–Deep Planning. First, AU presented deep end-to-end planning for autonomous drone racing. It gave a curriculum-based deep reinforcement learning approach to train end-to-end policy with a performance evaluation on a realistic simulation environment. Second, AU introduced a local replanning strategy for the agricultural use-case with preliminary results on Webots simulation. AU will include the curriculum-based DRL method for autonomous drone racing in the OpenDR toolkit. End-to- end planning method for agricultural use-case is also intended to be included in the toolkit. Chapter 3 detailed the status of the work performed for Task 5.2–Deep Navigation. ALU-FR presented a new method to enable navigation for mobile manipulation tasks. It used reinforce- ment learning to learn a base policy that ensures the kinematic feasibility of arbitrary end- effector motions. It first reviewed the existing literature, presented the new approach and then empirically demonstrated its ability to generalise across robotic platforms and tasks, thereby enabling the transformation of a wide range of applications into mobile tasks. Methods to train and run this approach will be included in the openDR toolkit. Also, TUD introduced a visual navigation method for real-world indoor environments using end-to-end DRL. The ap- proach enabled direct deployment of a policy trained in simulation with images, on real robots. Moreover, the approach is lightweight and could reach 20 frames per second on lightweight embedded hardware such as the NVIDIA Jetson. Finally, Chapter 4 highlighted the work performed for Task 5.3–Deep Action and Control. The chapter covered the following topics. First, TUD introduced the work performed on model- based latent planning, with some preliminary encouraging performance results on a simplified scenario. If an extension of the proposed method that can deal with image observations, shows convincing results, it is intended to be added to the toolkit. Second, TUD provided a detailed overview of the existing RL frameworks, together with an overview of the state of the art and of the existing toolboxes for hyperparameter optimization. Third, TAU provided an overview of robot grasping, where two state of the art methods (grasp pose estimation and grasp generation) are evaluated on the OpenDR Agile Production dataset. In addition, TAU is developing a novel single demonstration grasping model that demonstrates promising preliminary results. Suitable grasping models will be included in the OpenDR toolkit.

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 76/99

References

[1] M. Andrecut and M. Ali. Deep-sarsa: A reinforcement learning algorithm for au- tonomous navigation. International Journal of Modern Physics C, 12(10):1513–1523, 2001.

[2] A. Antonini, W. Guerra, V. Murali, T. Sayre-McCord, and S. Karaman. The blackbird dataset: A large-scale dataset for uav perception in aggressive flight. In International Symposium on Experimental Robotics, pages 130–139. Springer, 2018.

[3] H. Arbabi, M. Korda, and I. Mezic.´ A data-driven koopman model predictive control framework for nonlinear partial differential equations. In 2018 IEEE Conference on Decision and Control (CDC), pages 6409–6414, 2018.

[4] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath. Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6):26–38, 2017.

[5] P.-L. Bacon, J. Harb, and D. Precup. The option-critic architecture. Thirty-First AAAI Conference on Artificial Intelligence, 2017.

[6] A. P. Badia, B. Piot, S. Kapturowski, P. Sprechmann, A. Vitvitskyi, D. Guo, and C. Blundell. Agent57: Outperforming the Atari human benchmark. arXiv preprint arXiv:2003.13350, 2020.

[7] J. A. Bagnell, F. Cavalcanti, L. Cui, T. Galluzzo, M. Hebert, M. Kazemi, M. Klingen- smith, J. Libby, T. Y. Liu, N. Pollard, M. Pivtoraiko, J. Valois, and R. Zhu. An integrated system for autonomous robotics manipulation. Int. Conf. on Intelligent Robots and Sys- tems, 2012.

[8] C. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. Kuttler,¨ A. Lefrancq, S. Green, V. Valdes,´ A. Sadik, et al. DeepMind Lab. arXiv preprint arXiv:1612.03801, 2016.

[9] M. G. Bellemare, W. Dabney, and R. Munos. A distributional perspective on reinforce- ment learning. arXiv preprint arXiv:1707.06887, 2017.

[10] J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(10):281–305, 2012.

[11] J. Bergstra, D. Yamins, and D. Cox. Making a science of model search: Hyperparam- eter optimization in hundreds of dimensions for vision architectures. In International conference on machine learning, pages 115–123, 2013.

[12] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kegl.´ Algorithms for hyper-parameter optimization. In Advances in neural information processing systems, pages 2546–2554, 2011.

[13] C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fis- cher, S. Hashme, C. Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 77/99

[14] S. Bhatti, A. Desmaison, O. Miksik, N. Nardelli, N. Siddharth, and P. H. Torr. Playing DOOM with slam-augmented deep reinforcement learning. arXiv preprint arXiv:1612.00380, 2016. [15] A. Bicchi and V. Kumar. Robotic grasping and contact: A review. In IEEE International Conference on Robotics and (ICRA), pages 348–353, 2000. [16] L. Birglen and T. Schlicht. A statistical review of industrial robotic grippers. Robotics and Computer-Integrated Manufacturing, 49:88–97, 2018. [17] K. Blomqvist, M. Breyer, A. Cramariuc, J. Forster,¨ M. Grinvald, F. Tschopp, J. J. Chung, L. Ott, J. Nieto, and R. Siegwart. Go fetch: Mobile manipulation in unstructured envi- ronments. arXiv preprint arXiv:2004.00899, 2020. [18] J. Bohg, A. Morales, T. Asfour, and D. Kragic. Data-driven grasp synthesis—a survey. IEEE Transactions on Robotics, 30(2):289–309, 2013. [19] R. Bonatti, R. Madaan, V. Vineet, S. Scherer, and A. Kapoor. Learning visuomo- tor policies for aerial navigation using cross-modal representations. arXiv preprint arXiv:1909.06993, 2020. [20] F. Bottarel, G. Vezzani, U. Pattacini, and L. Natale. Graspa 1.0: Graspa is a robot arm grasping performance benchmark. IEEE Robotics and Automation Letters, 5(2):836– 843, 2020. [21] N. Botteghi, B. Sirmacek, K. A. Mustafa, M. Poel, and S. Stramigioli. On reward shaping for mobile : A reinforcement learning and SLAM based approach. arXiv preprint arXiv:2002.04109, 2020. [22] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016. [23] D. Bruder, B. Gillespie, C. D. Remy, and R. Vasudevan. Modeling and control of soft robots using the koopman operator and model predictive control. In Proceedings of Robotics: Science and Systems (RSS), pages 01–09, 2019. [24] S. L. Brunton, B. W. Brunton, J. L. Proctor, and J. N. Kutz. Koopman invariant subspaces and finite linear representations of nonlinear dynamical systems for control. PLoS ONE, 11(2), 2016. [25] S. L. Brunton, J. L. Proctor, J. N. Kutz, and W. Bialek. Discovering governing equa- tions from data by sparse identification of nonlinear dynamical systems. Proceedings of the National Academy of Sciences of the United States of America, 113(15):3932–3937, 2016. [26] F. Burget, M. Bennewitz, and W. Burgard. BI2RRT*: An efficient sampling-based path planning framework for task-constrained mobile manipulation. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2016. [27] F. Burget, A. Hornung, and M. Bennewitz. Whole-body motion planning for manipu- lation of articulated objects. EEE International Conference on Robotics and Automatio (ICRA), pages 1656–1662, 2013. OpenDR No. 871449 D5.1: First report on deep robot action and decision making 78/99

[28] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart. The euroc micro aerial vehicle datasets. The International Journal of Robotics Research, 35(10):1157–1163, 2016.

[29] S. Caldera, A. Rassau, and D. Chai. Review of deep learning methods in robotic grasp detection. Multimodal Technologies and Interaction, 2(3):57, 2018.

[30] B. Calli, A. Singh, A. Walsman, S. Srinivasa, P. Abbeel, and A. M. Dollar. The YCB object and model set: Towards common benchmarks for manipulation research. In inter- national conference on advanced robotics (ICAR), pages 510–517, 2015.

[31] E. Camci and E. Kayacan. End-to-end motion planning of quadrotors using deep rein- forcement learning. arXiv preprint arXiv:1909.13599, 2019.

[32] E. Camci and E. Kayacan. Learning motion primitives for planning swift maneuvers of quadrotor. Autonomous Robots, 43(7):1733–1745, 2019.

[33] C. Capellen, M. Schwarz, and S. Behnke. Convposecnn: Dense convolutional 6D object pose estimation. arXiv preprint arXiv:1912.07333, 2019.

[34] P. S. Castro, S. Moitra, C. Gelada, S. Kumar, and M. G. Bellemare. Dopamine: A Research Framework for Deep Reinforcement Learning. 2018.

[35] U. Cekmez, M. Ozsiginan, and O. K. Sahingoz. A uav path planning with parallel aco algorithm on cuda platform. In 2014 International Conference on Unmanned Aircraft Systems (ICUAS), pages 347–354. IEEE, 2014.

[36] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.

[37] C. Chen, Y. Wan, B. Li, C. Wang, G. Xie, and H. Jiang. Motion planning for het- erogeneous unmanned systems under partial observation from uav. arXiv preprint arXiv:2007.09633, 2020.

[38] K. Chen, J. P. de Vicente, G. Sepulveda, F. Xia, A. Soto, M. Vazquez,´ and S. Savarese. A behavioral approach to visual navigation with graph localization networks. arXiv preprint arXiv:1903.00445, 2019.

[39] K. Chua, R. Calandra, R. McAllister, and S. Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in Neural Information Processing Systems, 31:4754–4765, 2018.

[40] K. Collins, A. Palmer, and K. Rathmill. The development of a european benchmark for the comparison of assembly robot programming systems. In Robot technology and applications, pages 187–199. Springer, 1985.

[41] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on pattern analysis and machine intelligence, 24(5):603–619, 2002.

[42] W. W. Connors, M. Rivera, and S. Orzel. Quake 3 arena manual, 1999.

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 79/99

[43] E. Coumans and Y. Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning. 2016.

[44] K. Culligan, M. Valenti, Y. Kuwata, and J. P. How. Three-dimensional flight experi- ments using on-line mixed-integer linear programming trajectory optimization. In 2007 American Control Conference, pages 5322–5327. IEEE, 2007.

[45] W. Dabney, M. Rowland, M. G. Bellemare, and R. Munos. Distributional reinforcement learning with quantile regression. arXiv preprint arXiv:1710.10044, 2017.

[46] E. De Coninck, T. Verbelen, P. Van Molle, P. Simoens, and B. Dhoedt. Learning to grasp arbitrary household objects from a single demonstration. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2372–2377, 2019.

[47] E. De Coninck, T. Verbelen, P. Van Molle, P. Simoens, and B. Dhoedt. Learning robots to grasp by demonstration. Robotics and Autonomous Systems, 127:103474, 2020.

[48] J. Delmerico, T. Cieslewski, H. Rebecq, M. Faessler, and D. Scaramuzza. Are we ready for autonomous drone racing? the uzh-fpv drone racing dataset. In 2019 International Conference on Robotics and Automation (ICRA), pages 6713–6719. IEEE, 2019.

[49] S. L. Delp, F. C. Anderson, A. S. Arnold, P. Loan, A. Habib, C. T. John, E. Guendelman, and D. G. Thelen. Opensim: open-source software to create and analyze dynamic simu- lations of movement. IEEE transactions on biomedical engineering, 54(11):1940–1950, 2007.

[50] A. Devo, G. Mezzetti, G. Costante, M. L. Fravolini, and P. Valigi. Towards general- ization in target-driven visual navigation by using deep reinforcement learning. IEEE Transactions on Robotics, 2020.

[51] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun. CARLA: An open urban driving simulator. In Proceedings of the 1st Annual Conference on , pages 1–16, 2017.

[52] G. Du, K. Wang, and S. Lian. Vision-based robotic grasping from object localiza- tion, pose estimation, grasp detection to motion planning: A review. arXiv preprint arXiv:1905.06658, 2019.

[53] F. Ebert, C. Finn, S. Dasari, A. Xie, A. Lee, and S. Levine. Visual Foresight: Model- Based Deep Reinforcement Learning for Vision-Based Robotic Control. arXiv preprint arXiv:1812.00568, pages 1–14, 2018.

[54] C. Eppner, A. Mousavian, and D. Fox. A billion ways to grasp: An evaluation of grasp sampling schemes on a dense, physics-based grasp data set. arXiv preprint arXiv:1912.05604, 2019.

[55] D. Falanga, E. Mueggler, M. Faessler, and D. Scaramuzza. Aggressive quadrotor flight through narrow gaps with onboard sensing and computing using active vision. In 2017 IEEE international conference on robotics and automation (ICRA), pages 5774–5781. IEEE, 2017.

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 80/99

[56] S. Falkner, A. Klein, and F. Hutter. Bohb: Robust and efficient hyperparameter optimiza- tion at scale. arXiv preprint arXiv:1807.01774, 2018. [57] V. Feinberg, A. Wan, I. Stoica, M. I. Jordan, J. E. Gonzalez, and S. Levine. Model- Based Value Expansion for Efficient Model-Free Reinforcement Learning. arXiv preprint arXiv:1803.00101v1, 2018. [58] N. Ferns and D. Precup. Bisimulation metrics are optimal value functions. In Proceed- ings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, pages 210–219, 2014. [59] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017. [60] P. Foehn, D. Brescianini, E. Kaufmann, T. Cieslewski, M. Gehrig, M. Muglikar, and D. Scaramuzza. Alphapilot: Autonomous drone racing. arXiv preprint arXiv:2005.12813, 2020. [61] S. Fujimoto, H. Van Hoof, and D. Meger. Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477, 2018. [62] F. Furrer, M. Burri, M. Achtelik, and R. Siegwart. Rotors—a modular gazebo mav sim- ulator framework. In Robot Operating System (ROS), pages 595–625. Springer, 2016. [63] J. Gauci, E. Conti, Y. Liang, K. Virochsiri, Y. He, Z. Kaden, V. Narayanan, X. Ye, Z. Chen, and S. Fujimoto. Horizon: Facebook’s open source applied reinforcement learn- ing platform. arXiv preprint arXiv:1811.00260, 2018. [64] C. Gelada, S. Kumar, J. Buckman, O. Nachum, and M. G. Bellemare. Deepmdp: Learn- ing continuous latent space models for representation learning. In International Confer- ence on Machine Learning, pages 2170–2179, 2019. [65] J. H. Gillula, H. Huang, M. P. Vitus, and C. J. Tomlin. Design of guaranteed safe ma- neuvers using reachable sets: Autonomous quadrotor aerobatics in theory and practice. In 2010 IEEE International Conference on Robotics and Automation, pages 1649–1654. IEEE, 2010. [66] D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley. Google vizier: A service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD inter- national conference on knowledge discovery and data mining, pages 1487–1495, 2017. [67] K. Gregor, D. J. Rezende, F. Besse, Y. Wu, H. Merzic, and A. van den Oord. Shaping Belief States with Generative Environment Models for RL. arXiv preprint arXiv:1906.09237, 2019. [68] M. Gschwindt, E. Camci, R. Bonatti, W. Wang, E. Kayacan, and S. Scherer. Can a robot become a movie director? learning artistic principles for aerial cinematography. arXiv preprint arXiv:1904.02579, 2019. [69] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep q-learning with model- based acceleration. In International Conference on Machine Learning, pages 2829–2838, 2016. OpenDR No. 871449 D5.1: First report on deep robot action and decision making 81/99

[70] W. Guerra, E. Tal, V. Murali, G. Ryou, and S. Karaman. Flightgoggles: Photorealis- tic sensor simulation for perception-driven robotics using photogrammetry and virtual reality. arXiv preprint arXiv:1905.11377, 2019.

[71] D. Guo, T. Kong, F. Sun, and H. Liu. Object discovery and grasp detection with a shared convolutional neural network. In IEEE International Conference on Robotics and Automation (ICRA), pages 2038–2043, 2016.

[72] Y. Guo, H. Wang, Q. Hu, H. Liu, L. Liu, and M. Bennamoun. Deep learning for 3d point clouds: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.

[73] Z. Guo, J. Huang, W. Ren, and C. Wang. A reinforcement learning approach for inverse kinematics of arm robot. Int. Conf. on Robotics & Automation (ICRA), 2019.

[74] S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik. Cognitive mapping and planning for visual navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2616–2625, 2017.

[75] D. Ha and J. Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.

[76] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy max- imum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.

[77] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to Control: Learning Behaviors by Latent Imagination. arXiv preprint arXiv:1912.01603, pages 1–20, 2019.

[78] D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learn- ing latent dynamics for planning from pixels. In International Conference on Machine Learning, pages 2555–2565. PMLR, 2019.

[79] P. E. Hart, N. J. Nilsson, and B. Raphael. A formal basis for the heuristic determination of minimum cost paths. IEEE transactions on Systems Science and Cybernetics, 4(2):100– 107, 1968.

[80] R. L. Haupt and S. Ellen Haupt. Practical genetic algorithms. 2004.

[81] K. He, G. Gkioxari, P. Dollar,´ and R. Girshick. Mask R-CNN. In IEEE international conference on computer vision (ICCV), pages 2961–2969, 2017.

[82] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.

[83] Y. He, W. Sun, H. Huang, J. Liu, H. Fan, and J. Sun. Pvn3d: A deep point-wise 3d key- points voting network for 6dof pose estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

[84] M. Hehn and R. D’Andrea. Quadrocopter trajectory generation and control. IFAC pro- ceedings Volumes, 44(1):1485–1491, 2011.

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 82/99

[85] L. Hermann, M. Argus, A. Eitel, A. Amiranashvili, W. Burgard, and T. Brox. Adaptive curriculum generation from demonstrations for sim-to-real visuomotor control. In 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020. [86] A. Hertz, R. Hanocka, R. Giryes, and D. Cohen-Or. PointGMM: a neural GMM network for point clouds. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12054–12063, 2020. [87] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver. Rainbow: Combining improvements in deep reinforce- ment learning. arXiv preprint arXiv:1710.02298, 2017. [88] A. Hietanen, J. Latokartano, A. Foi, R. Pieters, V. Kyrki, M. Lanz, and J.-K. Kam¨ ar¨ ainen.¨ Benchmarking 6d object pose estimation for robotics. arXiv preprint arXiv:1906.02783, 2019. [89] A. Hill, A. Raffin, M. Ernestus, A. Gleave, A. Kanervisto, R. Traore, P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, and Y. Wu. Stable baselines. https://github.com/hill-a/stable-baselines, 2018. [90] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Asian conference on computer vision, pages 548–562. Springer, 2012. [91] Y.-T. Hu, J.-B. Huang, and A. Schwing. MaskRNN: Instance level video object segmen- tation. In Advances in neural information processing systems, pages 325–334, 2017. [92] M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan, et al. Population based training of neural networks. arXiv preprint arXiv:1711.09846, 2017. [93] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016. [94] S. Jung, S. Cho, D. Lee, H. Lee, and D. H. Shim. A direct -based frame- work for the 2016 iros autonomous drone racing challenge. Journal of Field Robotics, 35(1):146–166, 2018. [95] S. Jung, S. Hwang, H. Shin, and D. H. Shim. Perception, guidance, and navigation for indoor autonomous drone racing using deep learning. IEEE Robotics and Automation Letters, 3(3):2539–2544, 2018. [96] L. P. Kaelbling. Hierarchical learning in stochastic domains: Preliminary results. Pro- ceedings of the tenth international conference on machine learning, 951:167–173, 1993. [97] L. P. Kaelbling. Learning to achieve goals. Proc. of the 13th Int. Joint Conference on Artificial Intelligence, pages 1094–1098, 1993. [98] E. Kaiser, J. N. Kutz, and S. L. Brunton. Data-driven discovery of Koopman eigenfunc- tions for control. arXiv preprint arXiv:1707.01146, April-2017. OpenDR No. 871449 D5.1: First report on deep robot action and decision making 83/99

[99] L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, A. Mohiuddin, R. Sepassi, G. Tucker, and H. Michalewski. Model-Based Reinforcement Learning for Atari. arXiv preprint arXiv:1903.00374, 2019.

[100] E. Kaufmann, M. Gehrig, P. Foehn, R. Ranftl, A. Dosovitskiy, V. Koltun, and D. Scara- muzza. Beauty and the beast: Optimal methods meet learning for drone racing. In 2019 International Conference on Robotics and Automation (ICRA), pages 690–696. IEEE, 2019.

[101] E. Kaufmann, A. Loquercio, R. Ranftl, M. Muller,¨ V. Koltun, and D. Scaramuzza. Deep drone acrobatics. arXiv preprint arXiv:2006.05768, 2020.

[102] L. E. Kavraki, P. Svestka, J.-C. Latombe, and M. H. Overmars. Probabilistic roadmaps for path planning in high-dimensional configuration spaces. IEEE transactions on Robotics and Automation, 12(4):566–580, 1996.

[103] O. Khatib. Real-time obstacle avoidance for manipulators and mobile robots. In Au- tonomous robot vehicles, pages 396–404. Springer, 1986.

[104] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.

[105] K. Kleeberger, R. Bormann, W. Kraus, E. Rahtu, and M. F. Huber. A survey on learning- based robotic grasping. Current Robotics Reports, 2020.

[106] N. Koenig and A. Howard. Design and use paradigms for gazebo, an open-source multi- robot simulator. IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2149–2154, Sep 2004.

[107] S. Kohlbrecher, J. Meyer, T. Graber, K. Petersen, U. Klingauf, and O. von Stryk. Hector open source modules for autonomous mapping and navigation with rescue robots. In Robot Soccer World Cup, pages 624–631. Springer, 2013.

[108] B. O. Koopman. Hamiltonian systems and transformation in hilbert space. Proceedings of the National Academy of Sciences, 17(5):315–318, 1931.

[109] M. Korda and I. Mezic.´ Linear predictors for nonlinear dynamical systems: Koopman operator meets model predictive control. Automatica, 93:149–160, 2018.

[110] M. Korda and I. Mezic. Optimal construction of koopman eigenfunctions for prediction and control. IEEE Transactions on Automatic Control, pages 01–15, 2020.

[111] J. Kulhanek,´ E. Derner, T. de Bruin, and R. Babuska.ˇ Vision-based navigation using deep reinforcement learning. In 2019 European Conference on Mobile Robots (ECMR), pages 1–8. IEEE, 2019.

[112] J. Kulhanek,´ E. Derner, and R. Babuska.ˇ Visual navigation in real-world indoor environ- ments using end-to-end deep reinforcement learning, 2020.

[113] T. Kurutach, I. Clavera, Y. Duan, A. Tamar, and P. Abbeel. Model-ensemble trust-region policy optimization. In International Conference on Learning Representations, 2018. OpenDR No. 871449 D5.1: First report on deep robot action and decision making 84/99

[114] B. Landry, R. Deits, P. R. Florence, and R. Tedrake. Aggressive quadrotor flight through cluttered environments using mixed integer programming. In 2016 IEEE international conference on robotics and automation (ICRA), pages 1469–1475. IEEE, 2016.

[115] S. M. LaValle. Rapidly-exploring random trees: A new tool for path planning. 1998.

[116] S. M. LaValle. Planning algorithms. Cambridge university press, 2006.

[117] J. Lee. A survey of robot learning from demonstrations for human-robot collaboration. arXiv preprint arXiv:1710.08789, 2017.

[118] M. A. Lee, Y. Zhu, P. Zachares, M. Tan, K. Srinivasan, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg. Making sense of vision and touch: Learning multimodal representations for contact-rich tasks. IEEE Transactions on Robotics, 36(3):582–596, 2020.

[119] D. Leidner, A. Dietrich, F. Schmidt, C. Borst, and A. Albu-Schaffer.¨ Object-centered hybrid reasoning for whole-body mobile manipulation. Int. Conf. on Robotics & Au- tomation (ICRA), pages 1828–1835, 2014.

[120] B. Letham, B. Karrer, G. Ottoni, and E. Bakshy. Constrained bayesian optimization with noisy experiments. arXiv preprint arXiv:1706.07094, 2017.

[121] S. Levine and P. Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, pages 1071–1079, 2014.

[122] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen. Learning hand-eye co- ordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research, 37(4-5):421–436, 2018.

[123] L. Li, K. Jamieson, A. Rostamizadeh, E. Gonina, M. Hardt, B. Recht, and A. Talwalkar. Massively parallel hyperparameter tuning. arXiv preprint arXiv:1810.05934, 2018.

[124] S. Li, M. M. Ozo, C. De Wagter, and G. C. de Croon. Autonomous drone race: A computationally efficient vision-based navigation and control strategy. Robotics and Au- tonomous Systems, 133:103621, 2020.

[125] W. Li and E. Todorov. Iterative linear quadratic regulator design for nonlinear biological movement systems. In ICINCO, pages 222–229, 2004.

[126] Y. Li, H. He, J. Wu, D. Katabi, and A. Torralba. Learning compositional koopman opera- tors for model-based control. In International Conference on Learning Representations, 2020.

[127] H. Liang, X. Ma, S. Li, M. Gorner,¨ S. Tang, B. Fang, F. Sun, and J. Zhang. Pointnetgpd: Detecting grasp configurations from point sets. In IEEE International Conference on Robotics and Automation (ICRA), pages 3629–3635, 2019.

[128] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 85/99

[129] V. Limere,` H. V. Landeghem, M. Goetschalckx, E.-H. Aghezzaf, and L. F. McGinnis. Optimising part feeding in the automotive assembly industry: deciding between kitting and line stocking. International Journal of Production, 50(15):4046–4060, 2012. [130] A. Loquercio, E. Kaufmann, R. Ranftl, A. Dosovitskiy, V. Koltun, and D. Scaramuzza. Deep drone racing: From simulation to reality with domain randomization. IEEE Trans- actions on Robotics, 36(1):1–14, 2019. [131] B. Lusch, J. N. Kutz, and S. L. Brunton. Deep learning for universal linear embeddings of nonlinear dynamics. Nature Communications, 9(1):1–10, dec 2018. [132] J. Mahler and K. Goldberg. Learning deep policies for robot bin picking by simulating robust grasping sequences. In Conference on robot learning, pages 515–524, 2018. [133] J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg. Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. arXiv preprint arXiv:1703.09312, 2017. [134] J. Mahler, F. T. Pokorny, B. Hou, M. Roderick, M. Laskey, M. Aubry, K. Kohlhoff, T. Kroger,¨ J. Kuffner, and K. Goldberg. Dex-net 1.0: A cloud-based network of 3d objects for robust grasp planning using a multi-armed bandit model with correlated rewards. In IEEE international conference on robotics and automation (ICRA), pages 1957–1964, 2016. [135] A. L. Majdik, C. Till, and D. Scaramuzza. The zurich urban micro aerial vehicle dataset. The International Journal of Robotics Research, 36(3):269–273, 2017. [136] G. Mamakoukas, M. Castano, X. Tan, and T. Murphey. Local koopman operators for data-driven control of robotic systems. In Proceedings of Robotics: Science and Systems (RSS), pages 01–09, 2019. [137] S. Mannor, R. Y. Rubinstein, and Y. Gat. The cross entropy method for fast policy search. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 512–519, 2003. [138] I. Melekhov, J. Ylioinas, J. Kannala, and E. Rahtu. Relative camera pose estimation using convolutional neural networks. 2017. [139] O. Michel. Cyberbotics ltd. webotsTM: professional mobile robot simulation. Interna- tional Journal of Advanced Robotic Systems, 1(1):5, 2004. [140] P. Mirowski, M. Grimes, M. Malinowski, K. M. Hermann, K. Anderson, D. Teplyashin, K. Simonyan, A. Zisserman, R. Hadsell, et al. Learning to navigate in cities without a map. In Advances in Neural Information Processing Systems, pages 2419–2430, 2018. [141] C. Mitash, R. Shome, B. Wen, A. Boularias, and K. Bekris. Task-driven perception and manipulation for constrained placement of unknown objects. IEEE Robotics and Automation Letters, 5(4):5605–5612, 2020. [142] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Inter- national conference on machine learning, pages 1928–1937, 2016. OpenDR No. 871449 D5.1: First report on deep robot action and decision making 86/99

[143] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Ried- miller. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.

[144] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.

[145] T. M. Moerland, J. Broekens, and C. M. Jonker. Model-based reinforcement learning: A survey. arXiv preprint arXiv:2006.16712, pages 01–58, 2020.

[146] A. Mousavian, C. Eppner, and D. Fox. 6-dof graspnet: Variational grasp generation for object manipulation. In IEEE International Conference on Computer Vision, pages 2901–2910, 2019.

[147] E. Mueggler, M. Faessler, F. Fontana, and D. Scaramuzza. Aerial-guided navigation of a ground robot among movable obstacles. In 2014 IEEE International Symposium on Safety, Security, and Rescue Robotics (2014), pages 1–8. IEEE, 2014.

[148] M. Muller, V. Casser, N. Smith, D. L. Michels, and B. Ghanem. Teaching uavs to race: End-to-end regression of agile controls in simulation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 0–0, 2018.

[149] M. Neunert, C. De Crousaz, F. Furrer, M. Kamel, F. Farshidian, R. Siegwart, and J. Buchli. Fast nonlinear model predictive control for unified trajectory optimization and tracking. In 2016 IEEE international conference on robotics and automation (ICRA), pages 1398–1404. IEEE, 2016.

[150] J. Oh, S. Singh, and H. Lee. Value prediction network. Advances in Neural Information Processing Systems, 30:6118–6128, 2017.

[151] M. Okada and T. Taniguchi. Dreaming: Model-based Reinforcement Learning by La- tent Imagination without Reconstruction. arXiv preprint arXiv:2007.14535, pages 1–12, 2020.

[152] H. Oleynikova, M. Burri, Z. Taylor, J. Nieto, R. Siegwart, and E. Galceran. Continuous- time trajectory optimization for online uav replanning. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5332–5339. IEEE, 2016.

[153] S. Omidshafiei, A.-A. Agha-Mohammadi, C. Amato, and J. P. How. Decentralized con- trol of partially observable markov decision processes using belief space macro-actions. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 5962–5969. IEEE, 2015.

[154] F. Paus, P. Kaiser, N. Vahrenkamp, and T. Asfour. A combined approach for robot placement and coverage path planning for mobile manipulation. Int. Conf. on Intelligent Robots and Systems, 2017.

[155] S. Peng, Y. Liu, Q. Huang, X. Zhou, and H. Bao. Pvnet: Pixel-wise voting network for 6dof pose estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4561–4570, 2019. OpenDR No. 871449 D5.1: First report on deep robot action and decision making 87/99

[156] J. Peterson, H. Chaudhry, K. Abdelatty, J. Bird, and K. Kochersberger. Online aerial terrain mapping for ground robot navigation. Sensors, 18(2):630, 2018.

[157] Q.-H. Pham, M. A. Uy, B.-S. Hua, D. T. Nguyen, G. Roig, and S.-K. Yeung. LCD: Learned cross-domain descriptors for 2D-3D matching. In AAAI, pages 11856–11864, 2020.

[158] A. Plaat, W. Kosters, and M. Preuss. Model-Based Deep Reinforcement Learning for High-Dimensional Problems, a Survey. arXiv preprint arXiv:2008.05598, pages 1–21, 2020.

[159] F. T. Pokorny and D. Kragic. Classical grasp quality evaluation: New algorithms and theory. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3493–3500, 2013.

[160] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3D classification and segmentation. In IEEE conference on computer vision and pattern recognition, pages 652–660, 2017.

[161] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pages 5099–5108, 2017.

[162] K. Rakelly, A. Zhou, C. Finn, S. Levine, and D. Quillen. Efficient off-policy meta- reinforcement learning via probabilistic context variables. In International conference on machine learning, pages 5331–5340, 2019.

[163] J. Redmon and A. Angelova. Real-time grasp detection using convolutional neural net- works. In IEEE International Conference on Robotics and Automation (ICRA), pages 1316–1322, 2015.

[164] P. Regier, L. Gesing, and M. Bennewitz. Deep reinforcement learning for navigation in cluttered environments, 2020.

[165] J. Revaud, P. Weinzaepfel, C. R. de Souza, and M. Humenberger. R2D2: repeatable and reliable detector and descriptor. In NeurIPS, 2019.

[166] L. O. Rojas-Perez and J. Martinez-Carranza. Deeppilot: A cnn for autonomous drone racing. Sensors, 20(16):4524, 2020.

[167] A. Sahbani, S. El-Khoury, and P. Bidaud. An overview of 3d object grasp synthesis algorithms. Robotics and Autonomous Systems, 60(3):326–336, 2012.

[168] C. Sampedro, H. Bavle, A. Rodriguez-Ramos, P. de la Puente, and P. Campoy. Laser- based reactive navigation for multirotor aerial robots using deep reinforcement learning. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1024–1031. IEEE, 2018.

[169] T. Schaul, D. Horgan, K. Gregor, and D. Silver. Universal value function approximators. Proceedings of Machine Learning Research, 37:1312–1320, 2015.

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 88/99

[170] J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, T. Lillicrap, and D. Silver. Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. arXiv preprint arXiv:1911.08265, pages 1–21, 2019.

[171] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy opti- mization. In International conference on machine learning, pages 1889–1897, 2015.

[172] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy opti- mization algorithms. arXiv preprint arXiv:1707.06347, 2017.

[173] R. Sekar, O. Rybkin, K. Daniilidis, P. Abbeel, D. Hafner, and D. Pathak. Planning to explore via self-supervised world models, 2020.

[174] A. Seth, J. L. Hicks, T. K. Uchida, A. Habib, C. L. Dembia, J. J. Dunne, C. F. Ong, M. S. DeMers, A. Rajagopal, M. Millard, et al. Opensim: Simulating musculoskele- tal dynamics and neuromuscular control to study human and animal movement. PLoS computational biology, 14(7):e1006223, 2018.

[175] P. Shah, M. Fiser, A. Faust, J. C. Kew, and D. Hakkani-Tur. FollowNet: Robot naviga- tion by following natural language directions with deep reinforcement learning. arXiv preprint arXiv:1805.06150, 2018.

[176] S. Shah, D. Dey, C. Lovett, and A. Kapoor. Airsim: High-fidelity visual and physi- cal simulation for autonomous vehicles. In Field and service robotics, pages 621–635. Springer, 2018.

[177] D. H. Shim, H. Chung, and S. S. Sastry. Conflict-free navigation in unknown urban environments. IEEE robotics & automation magazine, 13(3):27–33, 2006.

[178] A. Singla, S. Padakandla, and S. Bhatnagar. Memory-based deep reinforcement learning for obstacle avoidance in UAV with limited environment knowledge. IEEE Transactions on Intelligent Transportation Systems, pages 1–12, 2019.

[179] C. Song, J. Song, and Q. Huang. Hybridpose: 6D object pose estimation under hybrid representations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 431–440, 2020.

[180] Y. Song, S. Naji, E. Kaufmann, A. Loquercio, and D. Scaramuzza. Flightmare: A flexible quadrotor simulator. arXiv preprint arXiv:2009.00563, 2020.

[181] S. S. Srinivasa, D. Ferguson, C. J. Helfrich, D. Berenson, A. Collet, R. Diankov, G. Gal- lagher, G. Hollinger, J. Kuffner, and M. V. Weghe. Herb: a home exploring robotic butler. Autonomous Robots, 28(1):5, 2010.

[182] A. Stentz. Optimal and efficient path planning for partially known environments. In Intelligent unmanned ground vehicles, pages 203–220. Springer, 1997.

[183] J. Stuckler,¨ M. Schwarz, and S. Behnke. Mobile manipulation, tool use, and intuitive interaction for cognitive cosero. Frontiers in Robotics and AI, 3:58, 2016.

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 89/99

[184] I. A. S¸ucan, M. Moll, and L. E. Kavraki. The Open Motion Planning Library. IEEE Robotics & Automation Magazine, 19(4):72–82, December 2012.

[185] K. Sun, K. Mohta, B. Pfrommer, M. Watterson, S. Liu, Y. Mulgaonkar, C. J. Taylor, and V. Kumar. Robust stereo visual inertial odometry for fast autonomous flight. IEEE Robotics and Automation Letters, 3(2):965–972, 2018.

[186] R. S. Sutton, D. Precup, and S. Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.

[187] O. Takahashi and R. J. Schilling. Motion planning in a plane using generalized voronoi diagrams. IEEE Transactions on robotics and automation, 5(2):143–150, 1989.

[188] N. Takeishi, Y. Kawahara, and T. Yairi. Learning koopman invariant subspaces for dynamic mode decomposition. Advances in Neural Information Processing Systems, 30:1130–1140, 2017.

[189] M. Tan, R. Pang, and Q. V. Le. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10781–10790, 2020.

[190] A. ten Pas, M. Gualtieri, K. Saenko, and R. Platt. Grasp pose detection in point clouds. The International Journal of Robotics Research, 36(13-14):1455–1473, 2017.

[191] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain random- ization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23– 30. IEEE, 2017.

[192] N. Vahrenkamp, T. Asfour, and R. Dillmann. Robot placement based on reachability inversion. Int. Conf. on Robotics & Automation (ICRA), 2013.

[193] H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q- learning. arXiv preprint arXiv:1509.06461, 2015.

[194] K. Van Wyk, M. Culleton, J. Falco, and K. Kelly. Comparative peg-in-hole testing of a force-based manipulation controlled robotic hand. IEEE Transactions on Robotics, 34(2):542–549, 2018.

[195] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.

[196] N. Wahlstrom,¨ T. B. Schon,¨ and M. P. Deisenroth. From pixels to torques: Policy learning with deep dynamical models. arXiv preprint arXiv:1502.02251, 2015.

[197] C. Wang, D. Xu, Y. Zhu, R. Mart´ın-Mart´ın, C. Lu, L. Fei-Fei, and S. Savarese. Dense- fusion: 6d object pose estimation by iterative dense fusion. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3343–3352, 2019.

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 90/99

[198] C. Wang, Q. Zhang, Q. Tian, S. Li, X. Wang, D. Lane, and Y. Petillot. Learning mobile manipulation through deep reinforcement learning. Sensors, 20:939, 02 2020.

[199] H. Wang, S. Sridhar, J. Huang, J. Valentin, S. Song, and L. J. Guibas. Normalized object coordinate space for category-level 6d object pose and size estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2642–2651, 2019.

[200] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas. Dueling network architectures for deep reinforcement learning. In International conference on machine learning, pages 1995–2003, 2016.

[201] J. Watson, A. Miller, and N. Correll. Autonomous industrial assembly using force, torque, and rgb-d sensing. Advanced Robotics, 34(7-8):546–559, 2020.

[202] M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in neural information processing systems, pages 2746–2754, 2015.

[203] T. Welschehold, C. Dornhege, and W. Burgard. Learning mobile manipulation actions from human demonstrations. Int. Conf. on Intelligent Robots and Systems, 2017.

[204] T. Welschehold, C. Dornhege, F. Paus, T. Asfour, and W. Burgard. Coupling mobile base and end-effector motion in task space. Int. Conf. on Intelligent Robots and Systems, 2018.

[205] M. O. Williams, I. G. Kevrekidis, and C. W. Rowley. A data–driven approximation of the koopman operator: Extending dynamic mode decomposition. Journal of Nonlinear Science, 25(6):1307–1346, 2015.

[206] R. J. Williams. Simple statistical gradient-following algorithms for connectionist rein- forcement learning. Machine learning, 8(3-4):229–256, 1992.

[207] Y. Wu, Y. Wu, G. Gkioxari, and Y. Tian. Building generalizable agents with a realistic and rich 3D environment. arXiv preprint arXiv:1801.02209, 2018.

[208] Y. Wu, Y. Wu, A. Tamar, S. Russell, G. Gkioxari, and Y. Tian. Bayesian relational mem- ory for semantic visual navigation. In Proceedings of the IEEE International Conference on Computer Vision, pages 2769–2779, 2019.

[209] M. Wydmuch, M. Kempka, and W. Jaskowski.´ Vizdoom competitions: Playing doom from pixels. IEEE Transactions on Games, 2018.

[210] F. Xia, C. Li, R. Mart´ın-Mart´ın, O. Litany, A. Toshev, and S. Savarese. Relmogen: Leveraging motion generation in reinforcement learning for mobile manipulation. arXiv preprint arXiv:2008.07792, 2020.

[211] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox. Posecnn: A convolutional neural net- work for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199, 2017.

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 91/99

[212] L. Xie, S. Wang, A. Markham, and N. Trigoni. Towards monocular vision based obstacle avoidance through deep reinforcement learning. arXiv preprint arXiv:1706.09829, 2017.

[213] W. Yang, X. Wang, A. Farhadi, A. Gupta, and R. Mottaghi. Visual semantic navigation using scene priors. arXiv preprint arXiv:1810.06543, 2018.

[214] X. Ye, Z. Lin, H. Li, S. Zheng, and Y. Yang. Active object perceiver: Recognition-guided policy learning for object searching on mobile robots. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 6857–6863. IEEE, 2018.

[215] A. Yershova, S. Jain, S. M. Lavalle, and J. C. Mitchell. Generating uniform incremental grids on SO(3) using the hopf fibration. The International journal of robotics research, 29(7):801–812, 2010.

[216] E. Yeung, S. Kundu, and N. Hodas. Learning deep neural network representations for koopman operators of nonlinear dynamical systems. In 2019 American Control Confer- ence (ACC), pages 4832–4839. IEEE, 2019.

[217] B. S. Zapata-Impata, P. Gil, J. Pomares, and F. Torres. Fast geometry-based computation of grasping points on three-dimensional point clouds. International Journal of Advanced Robotic Systems, 16(1), 2019.

[218] A. Zeng, S. Song, M. Nießner, M. Fisher, J. Xiao, and T. Funkhouser. 3dmatch: Learning local geometric descriptors from RGB-D reconstructions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1802–1811, 2017.

[219] A. Zhang, R. McAllister, R. Calandra, Y. Gal, and S. Levine. Learning Invariant Representations for Reinforcement Learning without Reconstruction. arXiv preprint arXiv:2006.10742v1, pages 1–16, 2020.

[220] M. Zhang, S. Vikram, L. Smith, P. Abbeel, M. Johnson, and S. Levine. Solar: Deep structured representations for model-based reinforcement learning. In International Con- ference on Machine Learning, pages 7444–7453. PMLR, 2019.

[221] Y. Zhang, S. Wang, and G. Ji. A comprehensive survey on particle swarm optimization algorithm and its applications. Mathematical Problems in Engineering, 2015, 2015.

[222] B. Zhao, H. Zhang, X. Lan, H. Wang, Z. Tian, and N. Zheng. REGNet: Region- based grasp network for single-shot grasp detection in point clouds. arXiv preprint arXiv:2002.12647, 2020.

[223] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017.

[224] Y. Zhao, Z. Zheng, and Y. Liu. Survey on computational-intelligence-based uav path planning. Knowledge-Based Systems, 158:54–64, 2018.

[225] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi. Target- driven visual navigation in indoor scenes using deep reinforcement learning. In 2017 IEEE international conference on robotics and automation (ICRA), pages 3357–3364. IEEE, 2017.

OpenDR No. 871449 D5.1: First report on deep robot action and decision making 92/99

A

Visual Navigation in Real-World Indoor Environments Using End-to-End Deep Reinforcement Learning

1, 2 3 Jonáš Kulhánek ∗, Erik Derner , and Robert Babuška

Abstract—Visual navigation is essential for many applications eventual agent’s policy is computationally cheap and can be in robotics, from manipulation, through mobile robotics to executed in real time (sampling times ~50 ms), even on light- automated driving. Deep reinforcement learning (DRL) provides weight embedded hardware, such as NVIDIA Jetson. These an elegant map-free approach integrating image processing, localization, and planning in one module, which can be trained methods also do not require any expensive cameras. and therefore optimized for a given environment. However, to The successes of deep reinforcement learning (DRL) on date, DRL-based visual navigation was validated exclusively in game domains [1]–[3] inspired the use of DRL in visual simulation, where the simulator provides information that is navigation. As current DRL methods require many training not available in the real world, e.g., the robot’s position or samples, it is impossible to train the agent directly in a real- image segmentation masks. This precludes the use of the learned policy on a real robot. Therefore, we propose a novel approach world environment. Instead, a simulator provides the training that enables a direct deployment of the trained policy on real data [4]–[14], and domain randomization [15] is used to cope robots. We have designed visual auxiliary tasks, a tailored reward with the simulator-reality gap. However, there are several scheme, and a new powerful simulator to facilitate domain unsolved problems associated with the above simulator-based randomization. The policy is fine-tuned on images collected from methods: real-world environments. We have evaluated the method on a mobile robot in a real office environment. The training took The simulator often provides the agent with features that • ~30 hours on a single GPU. In 30 navigation experiments, the are not available in the real world: the segmentation robot reached a 0.3-meter neighborhood of the goal in more than masks [4], [6], [10], distance to the goal, stopping signal 86.7 % of cases. This result makes the proposed method directly applicable to tasks like mobile manipulation. [4]–[7], [11], [12], [14], etc., either as one of the agent’s Index Terms—Vision-based navigation, reinforcement learning, inputs [4], [10] or in the form of an auxiliary task [6]. deep learning methods. While this improves the learning performance, it pre- cludes a straightforward transfer from simulation-based training to real-world deployment. For auxiliary tasks us- I.INTRODUCTION ing segmentation masks during training [6], another deep Vision-based navigation is essential for a broad range of neural network could be used to annotate the inputs [16], robotic applications, from industrial and service robotics to [17]. However, this would introduce additional overhead automated driving. The wide-spread use of this technique will and noise to the process and diminish the performance be further stimulated by the availability of low-cost cameras gain. and high-performance computing hardware. Another major problem with the current approaches [4]– • Conventional vision-based navigation methods usually build [7], [11], [12], [14] is that during the evaluation, the a map of the environment and then use planning to reach agent uses yet other forms of input provided by the the goal. They often rely on precise, high-quality stereo simulator. In particular, it relies on the simulator to cameras and additional sensors, such as laser rangefinders, and terminate the navigation as soon as the agent gets close generally are computationally demanding. As an alternative, to the goal. Without this signal from the simulator, the end-to-end deep-learning systems can be used that do not em- agent never stops, and after reaching its goal, it continues ploy any map. They integrate image processing, localization, exploring the environment. If we provide the agent with and planning in one module or agent, which can be trained the termination signal during training, in some cases, and therefore optimized for a given environment. While the the agent learns an efficient random policy, ignores the training is computationally demanding, the execution of the navigation goal, and tries only to explore the environment efficiently. 1Jonáš Kulhánek is with the Czech Institute of Informatics, Robotics, and Cybernetics, Czech Technical University in Prague, 16636 Prague, Czech To address these issues, we propose a novel method for Republic [email protected] DRL visual navigation in the real world, with the following 2Erik Derner is with the Czech Institute of Informatics, Robotics, and Cybernetics, Czech Technical University in Prague, 16636 Prague, Czech Re- contributions: public and with the Department of Control Engineering, Faculty of Electrical 1) We designed a set of auxiliary tasks that do not require Engineering, Czech Technical University in Prague, 16627 Prague, Czech Republic [email protected] any input that is not readily available to the robot in 3Robert Babuška is with , Faculty of 3mE, Delft the real world. Therefore, we can use the auxiliary tasks University of Technology, 2628 CD Delft, The Netherlands and with the Czech both during the pre-training on the simulator and during Institute of Informatics, Robotics, and Cybernetics, Czech Technical Univer- sity in Prague, 16636 Prague, Czech Republic [email protected] the fine-tuning on previously unseen images collected ∗ Corresponding author from the real-world environment.

OpenDR No. 871449 2) Similarly to [8], [9], we use an improved training In [11], a mobile robot was evaluated in an environment procedure that forces the agent to automatically detect discretized into a grid with ~27 states, which is an order whether it reached the goal. This enables the direct of magnitude lower than our method. Moreover, they did use of the trained policy in the real world where the not use the visual features directly but used ResNet [20] information from the simulator is not available. features instead. It makes the problem easier to learn but 3) To make our agent more robust and to improve the train- requires the final environment to have a lot of visual features ing performance, we have designed a fast and realistic recognizable by the trained ResNet. In our case, we do not environment simulator based on Quake III Arena [18] restrict ourselves to such a requirement, and our agent is and DeepMind Lab [19]. It allows for pre-training the much more general. Generalization across environments is models quickly, and thanks to the high degree of vari- discussed in [12]. The authors trained the agent in domain- ation in the simulated environments, it helps to bridge randomized maze-like environments and experimented with the reality gap. Furthermore, we propose a procedure to a robot in a small maze. Their agent, however, does not fine-tune the pre-trained model on a dataset consisting of include any stopping action, and the evaluator is responsible images collected from the real-world environment. This for terminating the navigation. Our evaluation is performed in enables us to use a lot of synthetic data collected from a realistic real-world environment, and we do not require any the simulator and still train on data visually similar to external signal to stop the agent. the target real-world environment. In our work, we focus on end-to-end deep reinforcement 4) To demonstrate the viability of our approach, we report learning. We believe that it has the potential to overcome the a set of experiments in a real-world office environment, limitations of and have superior performance to other methods, where the agent was trained to find its goal given by an which use deep learning or DRL as a part of their pipeline, e.g., image. [23], [24]. We also do not consider pure obstacle avoidance methods such as [25]–[27], or methods relying on other types II.RELATED WORK of input apart from the camera images, e.g., [28], [29]. The application of DRL to the visual navigation domain has III.METHOD attracted a lot of attention recently [4]–[14]. This is possible thanks to the deep learning successes in the computer vision We modify and extend our method [6] to adapt it to real- [17], [20] and gaming domains [2], [3], [21]. However, most world environments. We design a powerful environment simu- of the research was done in simulated environments [4]–[10], lator that uses synthetic scenes to pre-train the agent. The agent [13], [14], where one can sample an almost arbitrary number is then fine-tuned on real-world images. The implementation 1 of trajectories. The early methods used a single fixed goal as is publicly available on GitHub , including the source code of 2 the navigation target [7]. More recent approaches are able to the simulator . separate the goal from the navigation task and pass the goal as A. Network architecture & training the input either in the form of an embedding [4], [9], [14] or We adopt the neural network architecture from our prior an input image [5], [6], [11], [12]. Visual navigation was also work [6], which uses a single deep neural network as the combined with natural language in the form of instructions for learning agent. It outputs a probability distribution over a the agent [10]. discrete set of allowed actions. The input to the agent is the The use of auxiliary tasks to stabilize the training was visual output of the camera mounted on the robot, and an proposed in [7] and later applied to visual navigation [6]. In image of the goal, i.e., the image taken from the agent’s target our work, we use a similar set of auxiliary tasks, but instead of pose. The previous action and reward are also used as an input using segmentation masks to build the internal representation to the agent. of the observation, we use the camera images directly to be The network is trained using the Parallel Advantage Actor- able to apply our method to the real world without labeling Critic (PAAC) algorithm [30] with off-policy critic updates. the real-world images. This is built on ideas from model-based The network uses a stack of convolutional layers at the bottom, DRL [22]. Long Short-Term Memory (LSTM) [31] in the middle, and Unfortunately, most of the current approaches [4]–[14] were actor and critic heads. To improve the training performance, validated only in simulated environments, where the simulator the following auxiliary tasks were used in [6]: pixel control, provides the agent with signals that are normally not available reward prediction, reconstruction of the depth image, and of in the real-world setting, such as the distance to the goal or the observation and target image segmentation masks. Each segmentation masks. This simplifies the problem, but at the auxiliary task has its own head, and the total loss is computed same time, precludes the use of the learnt policy on a real as a weighted sum of the individual losses. For further details, robot. please refer to [6], [7]. There are some methods [11], [12], [14] which attempt to To be able to train the agent on real-world images, we apply end-to-end DRL to real-world domains. In [14], the modified the set of auxiliary tasks. We no longer use the authors trained their agent to navigate in Google Street View, which is much more photorealistic than other simulators. They, 1https://github.com/jkulhanek/robot-visual-navigation however, did not evaluate their agent on a real-world robot. 2https://github.com/jkulhanek/dmlab-vn

93 segmentation masks as they cannot be obtained from the envi- ronment without manual labeling. We have therefore replaced the two segmentation auxiliary tasks with two new auxiliary tasks: one to reconstruct the observation image and the other one to reconstruct the target image, see Fig. 1. This guides the convolutional base part of the network to build useful features using unsupervised learning. We hypothesized that having the auxiliary tasks share the latent-space with the actor and the critic will have a positive effect on the training performance. This hypothesis is verified empirically in Section IV.

last reward actor A2C last action LSTM critic Fig. 2. Images taken from our environment simulator.

pixel-control For each room layout, a new map is generated, which is reward-prediction then compiled and optimized using tools designed for Quake. UNREAL To speed up the training, we keep the same room layout for observation conv-base depth-map 50 episodes before reshuffling the objects. C. Fine-tuning on real-world data input-image Since the simulated environment does not precisely match

target-image auxiliary VN the real-world environment, we fine-tune the agent on a set target of real-world images. Throughout the rest of this paper, we restrict the motion of the robot to a rectangular grid, both for the image collection and for the final evaluation experiments. Fig. 1. Neural network overview. It is similar to [6] with the difference of To collect the training dataset, we placed the mobile robot the labels for VN auxiliary tasks being the raw camera images instead of the segmentation masks, which are not readily available in the real-world equipped with an RGB-D camera in the real-world environ- environment. ment. We programmed the robot to automatically collect a set of images from each point of a rectangular grid, in all four Similarly to [9], we have also modified the action set to cardinal directions. Odometry was used to derive and store the include a new action called terminate. This action stops the robot’s pose (position and orientation) from which the images navigation episode and enables the trained agent to navigate were taken. In this way, we obtained a dataset consisting of in a real-world environment using only the camera images, tuples (x, y, φ, i, camera image, depth map), where x, y and φ without additional sensors to provide its actual pose in the are the pose coordinates, and i is the index of the observation environment. During training, when the episode terminates image, as several images were taken from each pose. with the terminate action, the agent receives either a positive The pre-trained agent was then fine-tuned on the collected or a negative reward based on its distance to the goal. The dataset. It was initialized at a random pose, and its current episode can also terminate (with a negative reward) after a policy was used to move to the subsequent pose on the grid. predefined maximum number of steps. For each pose, a random image was sampled from the dataset, B. Environment simulator and this process repeated until the episode terminated. In this way, the agent was trained in an on-policy fashion on rollouts Since DRL requires many training samples, we first pre- generated from previously collected data. train the agent in a simulated environment and then fine-tune on images collected from the real world. We have designed D. Real-world deployment a novel, fast, and realistic 3D environment simulator that After the training, the agent uses its trained policy and dynamically generates scenes of office rooms. It is based on requires only the camera observation and the target image – no DeepMind Lab [7], and it uses the Quake III Arena [18] depth image or localization is necessary. To deploy our method rendering engine. We have extended the simulator with office- to a new environment, we propose the following procedure. like models and textures. The room is generated by placing First, the agent is pre-trained in a simulated environment using random objects along the walls, e.g., bookshelves, chairs, or our simulator. RGB-D images are then semi-automatically cabinets. A target object is selected from the set of objects collected from the real-world environment, and the agent is placed in the room, and a target image is taken from the fine-tuned on this dataset. Finally, the agent can be placed proximity of the object. Figure 2 shows four random examples in a real-world environment with an RGB camera as its only of images from the environment. sensor.

94 IV. EXPERIMENTS baseline algorithms. The results were computed from 1 000 First, we compared several algorithms on our 3D simulated runs where the starting position and the goal were sampled environment. Then, we fine-tuned the pre-trained model on randomly. We present the mean cumulative reward and the the real-world images, and finally, we evaluated the method mean success rate – the percentage of cases when the agent on the real robot. reached its goal. C. Simulated environment A. Implementation details The network architecture described in Section III was im- plemented as follows. The input images to the network were downsized to 84 84 pixels. The shared convolutional part of × the deep neural network had four convolutional layers. The first convolutional layer had 16 features, kernel size 8 8, and 103 × stride 4. The second convolutional layer had 32 features, kernel size 4 4, and stride 2. The third layer had 32 features with steps × kernel size 4 4 and stride 1. The first fully-connected layer × 2 PAAC had 512 features and LSTM had also 512 output features and 10 UNREAL a single layer. The deconvolutional networks used in auxiliary ours tasks had two layers with kernel sizes 4 4 and strides 2, × first having 16 output features. For image auxiliary tasks, the 0 1 2 3 4 5 6 7 8 frames 106 first layer was shared. The pixel control task used a similar × architecture, but it had a single fully-connected layer with 2592 output features at the bottom, and it was dueled, as Fig. 3. The plot shows the average episode length during training in described in [7], [32]. We used the discount factor γ = 0.99 the simulated environment. The PAAC, UNREAL, and our algorithm are compared. Note that we plot the average episode length, which may not for simulated experiments and γ = 0.9 for the real-world correlate with the success rate – the probability of signaling the goal correctly. dataset. The training parameters are summarized in Table I. In the case of UNREAL and our algorithm, however, this metric better shows The learning rate depends on f – the global step. Actor weight, the asymptotic behavior of the two algorithms since both converged to the success rate of one. critic weight, etc., correspond to weights of each term in the total loss [6]. In the simulated environment, we trained the algorithm for 8 106 training frames. We gave the agent a reward of 1 if × TABLE I the agent reached the goal and stopped using the terminate METHOD PARAMETERS. action. We gave it a reward of -0.1 if it used the terminate name value action incorrectly close to (and looking at) an object of a different type than its goal, e.g., a bookcase while the goal discount factor (γ) 0.99 simulated / 0.9 RW dataset maximum rollout length 20 steps was a chair. Otherwise, the reward was 0. In the case of number of environment instances 16 the simulated environment, we did not stop the agent when replay buffer size 2 000 samples it used the terminate action incorrectly in order to improve optimizer RMSprop the training performance. RMSprop alpha 0.99 5 The evaluation was done in 100 randomly generated envi- RMSprop epsilon 10− learning rate 7 10 4 1 f ronments. A total of 1 000 simulations were evaluated with × − − 4 107 max. gradient norm× 0.5 initial positions and goals randomly sampled. The mean suc-  entropy gradient weight 0.001 cess rate, the mean traveled distance, the mean number of actor weight 1.0 steps taken, and its standard deviations are given in Table II. critic weight 0.5 The training performance can be seen in Figure 3. The mean off-policy critic weight 1.0 pixel control weight 0.05 number of steps was averaged over the successful episodes reward prediction weight 1.0 only, i.e., those episodes which result in the agent signaling depth-map prediction weight 0.1 the goal correctly. The algorithm was compared to PAAC [30], observation image segmentation prediction weight 0.1 target segmentation prediction weight 0.1 which has comparable performance to A3C [21]. We compared pixel control discount factor 0.9 the proposed method also with the UNREAL method [7]. pixel control downsize factor 4 Both alternative methods were adapted to incorporate the goal auxiliary VN downsize factor 4 input by concatenating image channels of the target and the observation images and were trained on 8 106 training × B. Experiment configuration frames, same as our method. For both the simulated environment and the real-world D. Real-world dataset experiment dataset, we have evaluated the performance of the proposed In an office room, we used the TurtleBot 2 robot (Fig. 4) algorithm and compared it to the performance of relevant to collect a dataset of images taken at grid points with a

95 TABLE II 1.0 METHODPERFORMANCEONSIMULATEDENVIRONMENTS. 0.5 algorithm success rate distance traveled (m) simulation steps 0.0 ours 1.000 7.691±3.415 41.552±25.717 66.913 30.329 398.214 276.642 PAAC 0.420 ± ± UNREAL 0.999 8.322 6.187 45.541 50.349 return 0.5 ± ± −

1.0 PAAC − UNREAL ours 1.5 − 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 frames 107 ×

Fig. 6. The plot shows the average return during training on the simulated environment.

0.2 m resolution. Examples of these images can be seen in Fig. 5. When we collected the dataset, we estimated the robot pose through odometry. The odometry was also used for the evaluation of our trained agent in the experimental environment to assess whether the agent fulfilled its task and stopped close to the goal. The target was sampled from a set of robot positions near Fig. 4. Mobile robot TurtleBot 2 in the experimental environment. a wall or near an object and facing the wall or the object. The initial position was uniformly sampled from the set of non- target states, and the initial orientation was chosen randomly. The maximal length from the initial state to the goal started at approximately three actions, and between the frames 0.5 106 × and 5 106, it was increased to the maximal length, using × curriculum learning [6], [14]. The final position of the robot was considered correct when its Euclidean distance from the target position was at most 0.3 m, and the difference between the robot orientation and the target orientation was at most 30◦. The tolerance was set to compensate for the odometry inaccuracy. For the purpose of training, we had chosen the reward to be equal to 1 when the agent reached and stopped at its target. We penalized the agent with a reward of 0.01 if it tried to move in a direction − that would cause a collision. Otherwise, the reward was zero. The algorithm was trained on 3 107 frames. The training × Fig. 5. Training images from the real-world dataset. performance can be seen in Figure 6. Our method was com- pared to the PAAC algorithm [30] and the UNREAL algorithm [7], modified in the same way as described in Section IV-C. TABLE III All models were pre-trained in the simulated environment. METHODPERFORMANCEONREAL-WORLD DATASET. We evaluated the algorithms on 1 000 simulations in total with initial positions and goals randomly sampled. The mean algorithm success rate goal distance (m) steps on grid success rate, the mean distance from the goal (goal distance) . ours 0 936 0.145±0.130 13.489±6.286 when the goal was signaled, and the mean number of steps are 0.157 0.209 14.323 10.141 PAAC 0.922 ± ± UNREAL 0.863 0.174 0.173 14.593 9.023 given in Table III. We also show a comparison with the same ± ± models trained from scratch (labels beginning with np, which 0.187 0.258 15.880 7.022 np ours 0.883 ± ± 0.243 0.447 13.699 6.065 stands for no pre-training) in the same table. Same as before, np PAAC 0.860 ± ± 0.224 0.358 15.676 6.578 np UNREAL 0.832 ± ± the mean number of steps was averaged over the successful 1.467 1.109 147.956 88.501 episodes only. For comparison, we also report the shortest path random 0.205 ± ± 0.034 0.039 12.595 5.743 shortest path – ± ± and the performance of a random agent. It selects random movements, but when it reaches the target, the ground truth

96 information is used to signal the goal. C. Real-world dataset E. Real-world evaluation In the case of the real-world dataset (Table III), the goal was to learn a robust navigation policy in a noisy environment. The TABLE IV noisiness came from the fact that we did not have access to REAL-WORLDEXPERIMENTRESULTS. the precise robot positions and orientations when the dataset was collected, but only to their estimates based on odometry, evaluation success rate goal distance (m) steps on grid and the images were aligned to the grid points. Our method 0.175 0.101 15.153 6.455 mobile robot 0.867 ± ± achieved the best performance. It was followed by the raw real-world dataset 0.933 0.113 0.109 14.750 6.583 ± ± PAAC algorithm [30]. The UNREAL method [7] was the worst of these methods. We see that for all algorithms, the average Finally, to evaluate the trained network in the real-world number of steps to get to the target is very close to the optimal environment, we have randomly chosen 30 pairs of initial and number of steps. target states. The trained robot was placed in an initial pose, We did not require the agent to end up precisely in the and it was given a target image. Same as before, we show correct position, but only in its proximity – 30 cm from the the mean number of steps, the mean distance from the goal target position. The reward was always the same, no matter (goal distance), and the mean success rate, where the mean how far the agent was from the target. The noisiness in the number of steps was averaged over the successful episodes position labels in the dataset might cause the agent to stop only. The results are summarized in Table IV, where we also closer to the goal than it was necessary to meet a safety margin include the evaluation of the trained model on the real-world at the cost of more steps. image dataset. The results for this dataset slightly differ from A substantial performance improvement can be observed the results reported in Table III, as in this case, a smaller set when using pre-training on the simulated environment for of initial-goal state pairs was used. all three methods. Although the simulated environment used V. DISCUSSION continuous space, it looked visually different, and the control was different, it was still remarkably beneficial for the overall A. Alternative agent objective performance. This can be seen in Figure 3, where all methods In this work, we framed the navigation problem as the quickly reached a relatively high score. The decrease in ability of the agent to reach the goal and stop there. To this end, performance of the UNREAL method might be caused by the we introduced the terminate action and trained the agent to use ineffectiveness of its auxiliary tasks. As the environment was it. Alternatively, we could stick to the approach of previous not continuous, but the rotation actions turned the robot by visual navigation work [4], [6]–[8], [12] in which the agent 90 ◦, the observations at these rotations were non-overlapping. is stopped by the simulator. In the real-world environment, Therefore, the task of predicting the effect of each action be- a localization method [33], [34] would then have to be used came much more difficult, and instead of helping the algorithm to detect whether the agent reached the goal. However, this converge faster, it was adding some noise to the gradient. would introduce unnecessary computational overhead for the In our method, the auxiliary tasks might help the model robot, and it would obscure the evaluation, as a navigation learn a compact representation of each state quicker and failure could be caused by either of the two systems. make the algorithm converge faster. We believe the proposed B. Simulated environments visual navigation auxiliary tasks have a similar effect as using autoencoders in model-based DRL [22]. By learning the Training the agent on the simulated environments was sufficient features to reconstruct the observation, and by using difficult since the agent was supposed to generalize across these features as the input for the policy, the policy could use different room layouts. In this case, the agent’s goal was not a higher abstraction of the input while it learns an easier task to navigate to a given target by using the shortest path but than model-based methods and thus converges faster. to locate the target object in the environment first and then to navigate to it, possibly minimizing the total number of D. Real-world experiment steps. From some initial positions, the target could not be seen directly, and the agent had to move around to see the The results of the real-world experiments indicate that our target. Also, sometimes, there were many instances of the same algorithm is able to navigate in the real environment well. The object, complicating further the task. discrepancy in the performance between the simulation and Our proposed method achieved the best results of all com- the real-world environment can be caused by a generalization pared methods, see Tab. II, closely followed by the UNREAL error in transferring the learned policy as well as by the errors algorithm [7]. This clearly indicates the positive effect of using in odometry measurements, which were used for estimating auxiliary tasks for continuous spaces. The similarity between the robot position for the experiment evaluation. our method and the UNREAL algorithm can be that both were VI.CONCLUSION & FUTUREWORK close to the optimal solution and could not improve further. This is implied by the fact that both methods reached and In this paper, we proposed a deep reinforcement learn- signaled the goal successfully almost always. ing method for visual navigation in real-world settings. Our

97 method is based on the Parallel Advantage Actor-Critic algo- [6] J. Kulhánek, E. Derner, T. de Bruin, and R. Babuška, “Vision-based rithm boosted with auxiliary tasks and curriculum learning. It navigation using deep reinforcement learning,” in 2019 European Con- ference on Mobile Robots (ECMR). IEEE, 2019, pp. 1–8. was pre-trained in a simulator and fine-tuned on a dataset of [7] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Sil- images from the real-world environment. ver, and K. Kavukcuoglu, “Reinforcement learning with unsupervised We reported a set of experiments with a TurtleBot in an auxiliary tasks,” arXiv preprint arXiv:1611.05397, 2016. [8] Y. Wu, Y. Wu, A. Tamar, S. Russell, G. Gkioxari, and Y. Tian, “Bayesian office, where the agent was trained to find a goal given by relational memory for semantic visual navigation,” in Proceedings of the an image. In simulated scenarios, our agent outperformed IEEE International Conference on Computer Vision, 2019, pp. 2769– strong baseline methods. We showed the importance of using 2779. [9] W. Yang, X. Wang, A. Farhadi, A. Gupta, and R. Mottaghi, “Visual se- auxiliary tasks and pre-training. The trained agent can easily mantic navigation using scene priors,” arXiv preprint arXiv:1810.06543, be transferred to the real world and achieves an impressive 2018. performance. In 86.7 % of cases, the trained agent was able [10] P. Shah, M. Fiser, A. Faust, J. C. Kew, and D. Hakkani-Tur, “FollowNet: Robot navigation by following natural language directions with deep to successfully navigate to its target and stop in the 0.3-meter reinforcement learning,” arXiv preprint arXiv:1805.06150, 2018. neighborhood of the target position facing the same way as [11] X. Ye, Z. Lin, H. Li, S. Zheng, and Y. Yang, “Active object perceiver: the target image with a heading angle difference of at most Recognition-guided policy learning for object searching on mobile robots,” in 2018 IEEE/RSJ International Conference on Intelligent 30 degrees. The average distance to the goal in the case of Robots and Systems (IROS). IEEE, 2018, pp. 6857–6863. successful navigation was 0.175 meters. [12] A. Devo, G. Mezzetti, G. Costante, M. L. Fravolini, and P. Valigi, The results show that DRL presents a promising alternative “Towards generalization in target-driven visual navigation by using deep reinforcement learning,” IEEE Transactions on Robotics, 2020. to conventional visual navigation methods. Auxiliary tasks [13] K. Chen, J. P. de Vicente, G. Sepulveda, F. Xia, A. Soto, M. Vázquez, provide a powerful way to combine a large number of simu- and S. Savarese, “A behavioral approach to visual navigation with graph lated samples with a comparatively small number of real-world localization networks,” arXiv preprint arXiv:1903.00445, 2019. [14] P. Mirowski, M. Grimes, M. Malinowski, K. M. Hermann, K. Anderson, ones. We believe that our method brings us closer to real-world D. Teplyashin, K. Simonyan, A. Zisserman, R. Hadsell, et al., “Learning applications of end-to-end visual navigation, so avoiding the to navigate in cities without a map,” in Advances in Neural Information need for expensive sensors. Processing Systems, 2018, pp. 2419–2430. [15] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, In our future work, we will evaluate this approach in larger “Domain randomization for transferring deep neural networks from sim- environments with continuous motion of the robot, rather ulation to the real world,” in 2017 IEEE/RSJ International Conference than on the grid. The experiments will be performed using on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 23–30. [16] Y.-T. Hu, J.-B. Huang, and A. Schwing, “MaskRNN: Instance level a motion capture system for accurate robot localization. To video object segmentation,” in Advances in neural information process- extend the memory of the actor, one can pursue the idea of ing systems, 2017, pp. 325–334. implicit external memory in deep reinforcement learning [35] [17] M. Tan, R. Pang, and Q. V. Le, “EfficientDet: Scalable and efficient object detection,” in Proceedings of the IEEE/CVF Conference on and transformers [36]. By using better domain randomization, Computer Vision and Pattern Recognition, 2020, pp. 10 781–10 790. a general model can be trained that will not need the robot [18] W. W. Connors, M. Rivera, and S. Orzel, “Quake 3 arena manual,” 1999. pose data accompanying the images in the fine-tuning phase. [19] C. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. Küttler, A. Lefrancq, S. Green, V. Valdés, A. Sadik, et al., “DeepMind We also plan to work with continuous action space and model Lab,” arXiv preprint arXiv:1612.03801, 2016. the robot dynamics. [20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2016 IEEE Conference on Computer Vision and Pattern ACKNOWLEDGMENT Recognition (CVPR), pp. 770–778, 2016. This work was supported by the European Regional De- [21] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein- velopment Fund under the project Robotics for Industry forcement learning,” in International conference on machine learning, 4.0 (reg. no. CZ.02.1.01/0.0/0.0/15_003/0000470). Robert 2016, pp. 1928–1937. Babuška was supported by the European Union’s H2020 [22] N. Wahlström, T. B. Schön, and M. P. Deisenroth, “From pixels to torques: Policy learning with deep dynamical models,” arXiv preprint project Open Deep Learning Toolkit for Robotics (OpenDR) arXiv:1502.02251, 2015. under grant agreement No 871449. [23] S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik, “Cogni- tive mapping and planning for visual navigation,” in Proceedings of the REFERENCES IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2616–2625. [1] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier- [24] S. Bhatti, A. Desmaison, O. Miksik, N. Nardelli, N. Siddharth, and stra, and M. Riedmiller, “Playing Atari with deep reinforcement learn- P. H. Torr, “Playing DOOM with slam-augmented deep reinforcement ing,” arXiv preprint arXiv:1312.5602, 2013. learning,” arXiv preprint arXiv:1612.00380, 2016. [2] A. P. Badia, B. Piot, S. Kapturowski, P. Sprechmann, A. Vitvitskyi, [25] P. Regier, L. Gesing, and M. Bennewitz, “Deep reinforcement learning D. Guo, and C. Blundell, “Agent57: Outperforming the Atari human for navigation in cluttered environments,” 2020. benchmark,” arXiv preprint arXiv:2003.13350, 2020. [26] L. Xie, S. Wang, A. Markham, and N. Trigoni, “Towards monocular [3] C. Berner, G. Brockman, B. Chan, V. Cheung, P. D˛ebiak,C. Dennison, vision based obstacle avoidance through deep reinforcement learning,” D. Farhi, Q. Fischer, S. Hashme, C. Hesse, et al., “Dota 2 with large arXiv preprint arXiv:1706.09829, 2017. scale deep reinforcement learning,” arXiv preprint arXiv:1912.06680, [27] A. Singla, S. Padakandla, and S. Bhatnagar, “Memory-based deep 2019. reinforcement learning for obstacle avoidance in UAV with limited en- [4] Y. Wu, Y. Wu, G. Gkioxari, and Y. Tian, “Building generalizable vironment knowledge,” IEEE Transactions on Intelligent Transportation agents with a realistic and rich 3D environment,” arXiv preprint Systems, pp. 1–12, 2019. arXiv:1801.02209, 2018. [28] N. Botteghi, B. Sirmacek, K. A. Mustafa, M. Poel, and S. Stramigioli, [5] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and “On reward shaping for mobile robot navigation: A reinforcement A. Farhadi, “Target-driven visual navigation in indoor scenes using learning and SLAM based approach,” arXiv preprint arXiv:2002.04109, deep reinforcement learning,” in 2017 IEEE international conference 2020. on robotics and automation (ICRA). IEEE, 2017, pp. 3357–3364.

98 [29] C. Sampedro, H. Bavle, A. Rodriguez-Ramos, P. de la Puente, and and M. J. Milford, “Visual place recognition: A survey,” IEEE Trans- P. Campoy, “Laser-based reactive navigation for multirotor aerial robots actions on Robotics, vol. 32, no. 1, pp. 1–19, 2016. using deep reinforcement learning,” in 2018 IEEE/RSJ International [34] Z. Chen, A. Jacobson, N. Sünderhauf, B. Upcroft, L. Liu, C. Shen, Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, I. Reid, and M. Milford, “Deep learning features at scale for visual pp. 1024–1031. place recognition,” in 2017 IEEE International Conference on Robotics [30] A. V. Clemente, H. N. Castejón, and A. Chandra, “Efficient and Automation (ICRA). IEEE, 2017, pp. 3223–3230. parallel methods for deep reinforcement learning,” arXiv preprint [35] A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska- arXiv:1705.04862, 2017. Barwinska,´ S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, [31] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural et al., “Hybrid computing using a neural network with dynamic external computation, vol. 9, no. 8, pp. 1735–1780, 1997. memory,” Nature, vol. 538, no. 7626, pp. 471–476, 2016. [32] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, [36] E. Parisotto, H. F. Song, J. W. Rae, R. Pascanu, C. Gulcehre, S. M. “Dueling network architectures for deep reinforcement learning,” in Jayakumar, M. Jaderberg, R. L. Kaufman, A. Clark, S. Noury, et al., International conference on machine learning, 2016, pp. 1995–2003. “Stabilizing transformers for reinforcement learning,” arXiv preprint [33] S. Lowry, N. Sünderhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke, arXiv:1910.06764, 2019.

99