<<

Robotics: Science and Systems 2020 Corvalis, Oregon, USA, July 12-16, 2020 Deep Drone Acrobatics

Elia Kaufmann∗‡, Antonio Loquercio∗‡, René Ranftl†, Matthias Müller†, Vladlen Koltun†, Davide Scaramuzza‡

Fig. 1: A quadrotor performs a Barrel Roll (left), a Power Loop (middle), and a Matty Flip (right). We safely train acrobatic controllers in simulation and deploy them with no fine-tuning (zero-shot transfer) on physical quadrotors. The approach uses only onboard sensing and computation. No external motion tracking was used.

Abstract—Performing acrobatic maneuvers with quadrotors on the control stack raise fundamental questions in both is extremely challenging. Acrobatic flight requires high thrust perception and control. Therefore, they provide a natural and extreme angular accelerations that push the platform to its benchmark to compare the capabilities of autonomous systems physical limits. Professional drone pilots often measure their level of mastery by flying such maneuvers in competitions. In this against trained human pilots. paper, we propose to learn a sensorimotor policy that enables Acrobatic maneuvers represent a challenge for the actuators, an autonomous quadrotor to fly extreme acrobatic maneuvers the sensors, and all physical components of a quadrotor. While with only onboard sensing and computation. We train the policy hardware limitations can be resolved using expert-level equip- entirely in simulation by leveraging demonstrations from an ment that allows for extreme accelerations, the major limiting optimal controller that has access to privileged information. We use appropriate abstractions of the visual input to enable transfer factor to enable agile flight is reliable state estimation. Vision- to a real quadrotor. We show that the resulting policy can be based state estimation systems either provide significantly directly deployed in the physical world without any fine-tuning reduced accuracy or completely fail at high accelerations due on real data. Our methodology has several favorable properties: to effects such as motion blur, large displacements, and the it does not require a human expert to provide demonstrations, difficulty of robustly tracking features over long time frames [4]. it cannot harm the physical system during training, and it can be used to learn maneuvers that are challenging even for the Additionally, the harsh requirements of fast and precise control best human pilots. Our approach enables a physical quadrotor at high speeds make it difficult to tune controllers on the real to fly maneuvers such as the Power Loop, the Barrel Roll, and platform, since even tiny mistakes can result in catastrophic the Matty Flip, during which it incurs accelerations of up to 3g. outcomes for the platform. The difficulty of agile autonomous flight led previous work SUPPLEMENTARY MATERIAL to mostly focus on specific aspects of the problem. One line of A video demonstrating acrobatic maneuvers is available at research focused on the control problem, assuming near-perfect https://youtu.be/2N_wKXQ6MXA. Code can be found at state estimation from external sensors [23, 1, 15, 3]. While these https://github.com/uzh-rpg/deep_drone_acrobatics. works showed impressive examples of agile flight, they focused purely on control. The issues of reliable perception and state I.INTRODUCTION estimation during agile maneuvers were cleverly circumvented Acrobatic flight with quadrotors is extremely challenging. by instrumenting the environment with sensors (such as Vicon Human drone pilots require many years of practice to safely and OptiTrack) that provide near-perfect state estimation to master maneuvers such as power loops and barrel rolls1. the platform at all times. Recent works addressed the control Existing autonomous systems that perform agile maneuvers and perception aspects in an integrated way via techniques require external sensing and/or external computation [23, 1, like perception-guided trajectory optimization [9, 10, 33] or 3]. For aerial vehicles that rely only on onboard sensing training end-to-end visuomotor agents [34]. However, acrobatic and computation, the high accelerations that are required for performance of high-acceleration maneuvers with only onboard acrobatic maneuvers together with the unforgiving requirements sensing and computation has not yet been achieved. In this paper, we show for the first time that a vision- ∗ Equal contribution. based autonomous quadrotor with only onboard sensing and ‡Robotics and Perception Group, University of Zurich and ETH Zurich. †Intelligent Systems Lab, Intel. computation is capable of autonomously performing agile 1https://www.youtube.com/watch?v=T1vzjPa5260 maneuvers with accelerations of up to 3g, as shown in Fig. 1.

1 This contribution is enabled by a novel simulation-to-reality Aggressive flight with only onboard sensing and computation transfer strategy, which is based on abstraction of both visual is an open problem. The first attempts in this direction were and inertial measurements. We demonstrate both formally and made by Shen et al. [33], who demonstrated agile vision-based empirically that the presented abstraction strategy decreases the flight. The work was limited to low-acceleration trajectories, simulation-to-reality gap with respect to a naive use of sensory therefore only accounting for part of the control and perception inputs. Equipped with this strategy, we train an end-to-end problems encountered at high speed. More recently, Loianno et sensimotor controller to fly acrobatic maneuvers exclusively in al. [20] and Falanga et al. [10] demonstrated aggressive flight simulation. Learning agile maneuvers entirely in simulation has through narrow gaps with only onboard sensing. Even though several advantages: (i) Maneuvers can be simply specified by these maneuvers are agile, they are very short and cannot be reference trajectories in simulation and do not require expensive repeated without re-initializing the estimation pipeline. Using demonstrations by a human pilot, (ii) training is safe and perception-guided optimization, Falanga et al. [9] and Lee et does not pose any physical risk to the quadrotor, and (iii) the al. [17] proposed a model-predictive control framework to plan approach can scale to a large number of diverse maneuvers, aggressive trajectories while minimizing motion blur. However, including ones that can only be performed by the very best such control strategies are too conservative to fly acrobatic human pilots. maneuvers, which always induce motion blur. Our sensorimotor policy is represented by a neural network Abolishing the classic division between perception and that combines information from different input modalities to control, a recent line of work proposes to train visuomotor directly regress thrust and body rates. To cope with different policies directly from data. Similarly to our approach, Zhang et output frequencies of the onboard sensors, we design an al. [34] trained a neural network from demonstrations provided asynchronous network that operates independently of the sensor by an MPC controller. While the latter has access to the full frequencies. This network is trained in simulation to imitate state of the platform and knowledge of obstacle positions, the demonstrations from an optimal controller that has access to network only observes laser range finder readings and inertial privileged state information. measurements. Similarly, Li et al. [18] proposed an imitation We apply the presented approach to learning autonomous learning approach for training visuomotor agents for the task execution of three acrobatic maneuvers that are challenging of quadrotor flight. The main limitation of these methods is in even for expert human pilots: the Power Loop, the Barrel their sample complexity: large amounts of demonstrations are Roll, and the Matty Flip. Through controlled experiments in required to fly even straight-line trajectories. As a consequence, simulation and on a real quadrotor, we show that the presented these methods were only validated in simulation or were approach leads to robust and accurate policies that are able to constrained to slow hover-to-hover trajectories. reliably perform the maneuvers with only onboard sensing and Our approach employs abstraction of sensory input [27] to computation. reduce the problem’s sample complexity and enable zero-shot sim-to-real transfer. While prior work has demonstrated the II.RELATED WORK possibility of controlling real-world quadrotors with zero-shot Acrobatic maneuvers comprehensively challenge perception sim-to-real transfer [21, 32], our approach is the first to learn an and control stacks. The agility that is required to perform ac- end-to-end sensorimotor mapping – from sensor measurements robatic maneuvers requires carefully tuned controllers together to low-level controls – that can perform high-speed and high- with accurate state estimation. Compounding the challenge, acceleration acrobatic maneuvers on a real physical system. the large angular rates and high speeds that arise during the execution of a maneuver induce strong motion blur in vision III.OVERVIEW sensors and thus compromise the quality of state estimation. In order to perform acrobatic maneuvers with a quadrotor, The complexity of the problem has led early works to only we train a sensorimotor controller to predict low-level actions focus on the control aspect while disregarding the question from a history of onboard sensor measurements and a user- of reliable perception. Lupashin et al. [23] proposed iterative defined reference trajectory. An observation o[k] ∈ O at time learning of control strategies to enable platforms to perform k ∈ [0,...,T ] consists of a camera image I[k] and an inertial multiple flips. Mellinger et al. [25] used a similar strategy to measurement φ[k]. Since the camera and IMU typically operate autonomously fly quadrotors through a tilted window [25]. By at different frequencies, the visual and inertial observations switching between two controller settings, Chen et al. [6] are updated at different rates. The controller’s output is an also demonstrated multi-flip maneuvers. Abbeel et al. [1] action u[k] = [c, ω>]> ∈ U that consists of continuous mass- > learned to perform a series of acrobatic maneuvers with normalized collective thrust c and bodyrates ω = [ωx, ωy, ωz] autonomous helicopters. Their algorithm leverages expert pilot that are defined in the quadrotor body frame. demonstrations to learn task-specific controllers. While these The controller is trained via privileged learning [5]. Specifi- works proved the ability of flying machines to perform agile cally, the policy is trained on demonstrations that are provided maneuvers, they did not consider the perception problem. by a privileged expert: an optimal controller that has access Indeed, they all assume that near-perfect state estimation is to privileged information that is not available to the senso- available during the maneuver, which in practice requires rimotor student, such as the full ground-truth state of the instrumenting the environment with dedicated sensors. platform s[k] ∈ S. The privileged expert is based on a classic optimization-based planning and control pipeline that tracks a platform. Neglecting aerodynamic drag and motor dynamics, reference trajectory from the state s[k] using MPC [9]. the dynamics of the quadrotor can be modelled as We collect training demonstrations from the privileged p˙WB = vWB expert in simulation. Training in simulation enables synthesis v˙ = g + q c and use of unlimited training data for any desired trajectory, WB W WB B 1 (3) without putting the physical platform in danger. This includes q˙WB = Λ(ωB) · qWB maneuvers that stretch the abilities of even expert human 2 −1 pilots. To facilitate zero-shot simulation-to-reality transfer, the ω˙B = J · (η − ωB × J · ωB) , sensorimotor student does not directly access raw sensor input where pWB, vWB, qWB denote the position, linear velocity, such as color images. Rather, the sensorimotor controller acts and orientation of the platform body frame with respect to on an abstraction of the input, in the form of feature points the world frame. The gravity vector Wg is expressed in the extracted via classic computer vision. Such abstraction supports world frame and qWB cB denotes the rotation of the mass- > sample-efficient training, generalization, and simulation-to- normalized thrust vector cB = (0, 0, c) by quaternion qWB. > reality transfer [27, 35]. The time derivative of a quaternion q = (qw, qx, qy, qz) is 1 The trained sensorimotor student does not rely on any given by q˙ = 2 Λ(ω) · q and Λ(ω) is a skew-symmetric matrix > > > privileged information and can be deployed directly on the of the vector (0, ω ) = (0, ωx, ωy, ωz) . The diagonal physical platform. We deploy the trained controller to perform matrix J = diag(Jxx,Jyy,Jzz) denotes the quadrotor inertia, acrobatic maneuvers in the physical world, with no adaptation and η ∈ R3 are the torques acting on the body due to the required. motor thrusts. The next section presents each aspect of our method in Instead of directly planning in the full state space, we detail. plan reference trajectories in the space of flat outputs z = [x, y, z, ψ]> proposed in [24], where x, y, z denote the IV. METHOD position of the quadrotor and ψ is the yaw angle. It can be shown that any smooth trajectory in the space of flat outputs can We define the task of flying acrobatic maneuvers with a be tracked by the underactuated platform (assuming reasonably quadrotor as a discrete-time, continuous-valued optimization bounded derivatives). problem. Our task is to find an end-to-end control policy The core part of each acrobatic maneuver is a circular π : → , defined by a neural network, which minimizes the O U motion primitive with constant tangential velocity vl. The following finite-horizon objective: orientation of the quadrotor is constrained by the thrust vector k=T the platform needs to produce. Consequently, the desired h X i platform orientation is undefined when there is no thrust. To min J(π) = ρ(π) C(τr[k], s[k]) , (1) π E k=0 ensure a well-defined reference trajectory through the whole circular maneuver, we constrain the norm of the tangential where C is a quadratic cost depending on a reference trajectory velocity vl to be larger by a margin ε than the critical tangential τr[k] and the quadrotor state s[k], and ρ(π) is the distribution of velocity that would lead to free fall at the top of the maneuver: possible trajectories {(s[0], o[0], u[0]),..., (s[T ], o[T ], u[T ])} induced by the policy π. √ kvlk > ε rg, (4) We define the quadrotor state s[k] = [p, q, v, ω] as the −2 platform position p, its orientation quaternion q, and their where r denotes the radius of the loop, g = 9.81 m s , and derivatives. Note that the agent π does not directly observe the ε = 1.1. While the circular motion primitives form the core part of the state s[k]. We further define the reference trajectory τr[k] as a time sequence of reference states which describe the desired agile maneuvers, we use constrained polynomial trajectories to trajectory. We formulate the cost C as enter, transition between, and exit the maneuvers. A polynomial trajectory is described by four polynomial functions specifying > C(τr[k], s[k]) = x[k] Lx[k], (2) the independent evolution of the components of z over time:

j=Pi where x[k] = τ [k] − s[k] denotes the difference between the X j r zi(t) = aij · t for i ∈ {0, 1, 2, 3} . (5) state of the platform and the corresponding reference at time j=0 k, and L is a positive-semidefinite cost matrix. We use polynomials of order Pi = 7 for the position components (i = {0, 1, 2}) of the flat outputs and P = 2 A. Reference Trajectories i for the yaw component (i = 3). By enforcing continuity for Both the privileged expert and the learned policy assume both start (t = 0) and end (t = tm) of the trajectory down to the access to a reference trajectory τr[k] that specifies an acrobatic 3rd derivative of position, the trajectory is fully constrained. We maneuver. To ensure that such reference is dynamically minimize the execution time tm of the polynomial trajectories, feasible, it has to satisfy constraints that are imposed by the while constraining the maximum speed, thrust, and body rates physical limits and the underactuated nature of the quadrotor throughout the maneuver. Fig. 3: Reference trajectories for acrobatic maneuvers. Top row, from left to right: Power Loop, Barrel Roll, and Matty Flip. Bottom row: Combo.

Finally, the trajectories are concatenated to the full reference C. Learning trajectory, which is then converted back to the full state- The sensorimotor controller is trained by imitating demon- space representation τr(t) [24]. Subsequent sampling with strations provided by the privileged expert. While the expert a frequency of 50 Hz results in the discrete-time representation has access to privileged information in the form of ground-truth τr[k] of the reference trajectory. Some example trajectories are state estimates, the sensorimotor controller does not access illustrated in Figure 3. any privileged information and can be directly deployed in the physical world [5]. B. Privileged Expert A lemma by Pan et al. [28] formally defines an upper bound between the expert and the student performance as Our privileged expert π∗ consists of an MPC [9] that ∗  ∗  generates collective thrust and body rates via an optimization- J(π) − J(π ) ≤ Cπ∗ Eρ(π) DW (π, π ) ∗ based scheme. The controller operates on the simplified ≤ Cπ∗ Eρ(π)Eu∗∼π∗ Eu∼π[ku − uk], (8) dynamical model of a quadrotor proposed in [26]: where DW (·, ·) is the Wasserstein metric [13] and Cπ∗ is

p˙WB = vWB a constant depending on the smoothness of expert actions. π v˙ = g + q c Finding an agent with the same performance as the privileged WB W WB B (6) controller π∗ boils down to minimizing the discrepancy in 1 q˙WB = Λ(ωB) · qWB actions between the two policies on the expected agent 2 trajectories ρ(π). In contrast to the model (3), the simplified model neglects the The aforementioned discrepancy can be minimized by an dynamics of the angular rates. The MPC repeatedly optimizes iterative supervised learning process known as DAGGER [31]. the open-loop control problem over a receding horizon of N This process iteratively collects data by letting the student time steps and applies the first control command from the control the platform, annotating the collected observations with optimized sequence. Specifically, the action computed by the the experts’ actions, and updating the student controller based MPC is the first element of the to the following on the supervised learning problem optimization problem: ∗ π = min Es[k]∼ρ(π)[ku (s[k]) − πˆ(o[k])k], (9) πˆ  ∗ π∗ = min x[N]>Qx[N] where u (s[k]) is the expert action and o[k] is the ob- u servation vector in the state s[k]. Running this process N−1  for O(N) iterations yields a policy π with performance X > >  + x[k] Qx[k] + u[k] Ru[k] (7) J(π) ≤ J(π∗) + O(N) [31]. k=1 Naive application of this algorithm to the problem of agile s.t. r(x, u) = 0 flight in the physical world presents two major challenges: h(x, u) ≤ 0, how can the expert access the ground-truth state s[k] and how can we protect the platform from damage when the partially where x[k] = τr[k] − s[k] denotes the difference between the trained student π is in control? To circumvent these challenges, state of the platform at time k and the corresponding reference we train exclusively in simulation. This significantly simplifies τr[k], r(x, u) are equality constraints imposed by the system the training procedure, but presents a new hurdle: how do we dynamics (6), and h(x, u) are optional bounds on inputs and minimize the difference between the sensory input received by states. Q, R are positive-semidefinite cost matrices. the controller in simulation and reality? Fig. 4: Network architecture. The network receives a history of feature tracks, IMU measurements, and reference trajectories as input. Each input modality is processed using temporal convolutions and updated at different input rates. The resulting intermediate representations are processed by a multi-layer perceptron at a fixed output rate to produce collective thrust and body rate commands.

Our approach to bridging the gap between simulation and odometry (VIO) system. In contrast to camera frames, feature reality is to leverage abstraction [27]. Rather than operating tracks primarily depend on scene geometry, rather than surface on raw sensory input, our sensorimotor controller operates on appearance. We also make inertial measurements independent an intermediate representation produced by a perception mod- of environmental conditions, such as and , ule [35]. This intermediate representation is more consistent by integration and de-biasing. As such, our input representations across simulation and reality than raw visual input. fulfill the requirements of Eq. (11). We now formally show that training a network on abstrac- As the following lemma shows, training on such representa- tions of sensory input reduces the gap between simulation and tions reduces the gap between task performance in simulation reality. Let M(z | s),L(z | s): S → O denote the observation and the real world. models in the real world and in simulation, respectively. Such Lemma 2: A policy that acts on an abstract representation models describe how an on-board sensor measurement z senses of the observations πf : f(O) → U has a lower simulation- a state s. We further define πr = Eor ∼M(s)[π(or[k])] and to-reality gap than a policy πo : O → U that acts on raw

πs = Eos∼L(s)[π(os[k])] as the realizations of the policy π in observations. the real world and in simulation. The following lemma shows Proof: The lemma follows directly from (10) and (11). that, disregarding actuation differences, the distance between the observation models upper-bounds the gap in performance D. Sensorimotor Controller in simulation and reality. In contrast to the expert policy π∗, the student policy π is Lemma 1: For a Lipschitz-continuous policy π the only provided with onboard sensor measurements from the simulation-to-reality gap J(πr) − J(πs) is upper-bounded by forward-facing camera and the IMU. There are three main challenges for the controller to tackle: (i) it should keep track J(π ) − J(π ) ≤ C K DW (M,L), (10) r s πs Eρ(πr ) of its state based on the provided inputs, akin to a visual- where K denotes the Lipschitz constant. inertial odometry system [11, 8], (ii) it should be invariant to Proof: The lemma follows directly from (8) and the fact environments and domains, so as to not require retraining for that each scene, and (iii) it should process sensor readings that are provided at different frequencies. DW (π , π ) = inf [d (π , π )] r s E(or ,os) p r s We represent the policy as a neural network that fulfills all γ∈Π(or ,os) of the above requirements. The network consists of three input ≤ K inf E(or ,os)[do(or, os)] γ∈Π(or ,os) branches that process visual input, inertial measurements, and = K · DW (M,L), the reference trajectory, followed by a multi-layer perceptron that produces actions. The architecture is illustrated in Fig. 4. where do and dp are distances in observation and action space, Similarly to visual-inertial odometry systems [7, 8, 11], we respectively. provide the network with a representation of the platform state We now consider the effect of abstraction of the input by supplying it with a history of length L = 8 of visual and observations. Let f be a mapping of the observations such that inertial information. DW (f(M), f(L)) ≤ DW (M,L). (11) To ensure that the learned policies are scene- and domain- independent, we provide the network with appropriate abstrac- The mapping f is task-dependent and is generally designed – tions of the inputs instead of directly feeding raw inputs. We with domain knowledge – to contain sufficient information to design these abstractions to contain sufficient information to solve the task while being invariant to nuisance factors. In solve the task while being invariant to environmental factors our case, we use feature tracks as an abstraction of camera that are hard to simulate accurately and are thus unlikely to frames. The feature tracks are provided by a visual-inertial transfer from simulation to reality. The distribution of raw IMU measurements depends on the exact sensor as well as environmental factors such as pressure and temperature. Instead of using the raw measurements as input to the policy, we preprocess the IMU signal by applying bias subtraction and gravity alignment. Modern visual- inertial odometry systems perform a similar pre-integration of the inertial data in their optimization backend [29]. The Fig. 5: Example images from simulation (left) and the real test resulting inertial signal contains the estimated platform velocity, environment (right). orientation, and angular rate. We use the history of filtered inertial measurements, sampled output only when a new input from the sensor arrives. The multi- at 100 Hz, and process them using temporal convolutions [2]. layer perceptron uses the latest outputs from the asynchronous Specifically, the inertial branch consists of a temporal con- branches and operates at 100 Hz. It outputs control commands volutional layer with 128 filters, followed by three temporal at approximately the same rate due to its minimal computational convolutions with 64 filters each. A final fully-connected layer overhead. maps the signal to a 128-dimensional representation. E. Implementation Details Another input branch processes a history of reference We use the Gazebo simulator to train our policies. Gazebo velocities, orientations, and angular rates. It has the same can model the physics of quadrotors with high fidelity using the structure as the inertial branch. New reference states are added RotorS extension [12]. We simulate the AscTec Hummingbird to the history at a rate of 50 Hz. multirotor, which is equipped with a forward-facing fisheye For the visual signal, we use feature tracks, i.e. the motion camera. The platform is instantiated in a cubical simulated of salient keypoints in the image plane, as an abstraction flying space with a side length of 70 meters. An example image of the input. Feature tracks depend on the scene structure, is shown in Fig. 5 (left). ego-motion, and image gradients, but not on absolute pixel For the real-world experiments we use a custom quadrotor intensities. At the same time, the information contained in that weighs 1.15 kg and has a thrust-to- ratio of 4:1. We the feature tracks is sufficient to infer the ego-motion of the use a Jetson TX2 for neural network inference. Images and platform up to an unknown scale. Information about the scale inertial measurements are provided by an Intel RealSense T265 can be recovered from the inertial measurements. We leverage camera. the computationally efficient feature extraction and tracking We use an off-policy learning approach. We execute the frontend of VINS-Mono [29] to generate feature tracks. The trained policy, collect rollouts, and add them to a dataset. After frontend extracts Harris corners [14] and tracks them using the 30 new rollouts are added, we train for 40 epochs on the entire Lucas-Kanade method [22]. We perform geometric verification dataset. This collect-rollouts-and-train procedure is repeated 5 2 and exclude correspondences with a distance of more than times: there are 150 rollouts in the dataset by the end. We use pixels from the epipolar line. We represent each feature track the Adam optimizer [16] with a learning rate of 3e − 4. We by a 5-dimensional vector that consists of the keypoint position, always use the latest available model for collecting rollouts. its displacement with respect to the previous keyframe (both on We execute a student action only if the difference to the expert the rectified image plane), and the number of keyframes that action is smaller than a threshold t = 1.0 to avoid crashes in the feature has been tracked (a measure of keypoint quality). the early stages of training. We double the threshold t every To facilitate efficient batch training, we randomly sample 30 rollouts. We perform a random action with a probability 40 keypoints per keyframe. The features are processed by a p = 30% at every stage of the training to increase the coverage reduced version of the PointNet architecture proposed in [30] of the state space. To facilitate transfer from simulation to before we generate a fixed-size representation at each timestep. reality, we randomize the IMU biases and the thrust-to-weight Specifically, we reduce the number of hidden layers from 6 to ratio of the platform by up to 10% of their nominal value in 32, 64, 128, 128 4, with filters, respectively, in order to reduce every iteration. We do not perform any randomization of the 128 latency. The output of this subnetwork is reduced to a - geometry and appearance of the scene during data collection. dimensional vector by global average pooling over the feature tracks. The history of resulting hidden representations is then V. EXPERIMENTS processed by a temporal convolutional network that has the We design our evaluation procedure to address the following same structure as the inertial and reference branches. questions. Is the presented sensorimotor controller advanta- Finally, the outputs of the individual branches are concate- geous to a standard decomposition of state estimation and nated and processed by a synchronous multi-layer perceptron control? What is the role of input abstraction in facilitating with three hidden layers of size 128, 64, 32. The final outputs transfer from simulation to reality? Finally, we validate our are the body rates and collective thrust that are used to control design choices with ablation studies. the platform. We account for the different input frequencies by allowing A. Experimental Setup each of the input branches to operate asynchronously. Each We learn sensorimotor policies for three acrobatic maneuvers branch operates independently from the others by generating an that are popular among professional drone pilots as well as a Maneuver Input Power Loop Barrel Roll Matty Flip Combo Ref IMU FT Error (↓) Success (↑) Error (↓) Success (↑) Error (↓) Success (↑) Error (↓) Success (↑) VIO-MPC XXX 43 ± 14 100% 79 ± 43 100% 92 ± 41 100% 164 ± 51 70% Ours (Only Ref) X 250 ± 50 20% 485 ± 112 85% 340 ± 120 15% ∞ 0 % Ours (No IMU) XX 210 ± 100 30% 543 ± 95 85% 380 ± 100 20% ∞ 0 % Ours (No FT) XX 28 ± 8 100% 64 ± 24 95% 67 ± 29 100% 134 ± 113 85% Ours XXX 24 ± 5 100% 58 ± 9 100% 53 ± 15 100% 128 ± 57 95% TABLE I: Comparison of different variants of our approach with the baseline (VIO-MPC) in terms of the average tracking error in centimeters and the success rate. Results were averaged over 20 runs. Agents without access to IMU data perform poorly. An agent that has access only to IMU measurements has a significantly lower tracking error than the baseline. Adding feature tracks further improves tracking performance and success rate, especially for longer and more complicated maneuvers.

8 Ours 100 VIO-MPC 6 Ours (No FT)

4 50

Ours

2 Success Rate [%] Tracking Error [m] VIO-MPC Ours (No FT) 0 0 2 4 6 8 10 12 14 16 18 20 5 10 15 20 Maneuver Time [s] Maneuver Time [s] Fig. 6: Tracking error (left) and success rate (right) over time when a maneuver is executed repeatedly in simulation. The controllers were trained to complete the maneuver for six seconds and generalize well to longer sequences. Our learned controller, which leverages both IMU and visual data, provides consistently good performance without a single failure. policy that consists of a sequence of multiple maneuvers. the reference position with respect to the true position of the 1) Power Loop: Accelerate over a distance of 4 m to a speed platform during the execution of the maneuver. Note that we of 4.5 m s−1 and enter a loop maneuver with a radius of can measure this error only for simulation experiments, as r = 1.5 m. it requires exact state estimation. We thus define a second 2) Barrel Roll: Accelerate over a distance of 3 m to a speed metric, the average success rate for completing a maneuver. of 4.5 m s−1 and enter a roll maneuver with a radius of In simulation, we define success as not crashing the platform r = 1.5 m. into any obstacles during the maneuver. For the real-world 3) Matty Flip: Accelerate over a distance of 4 m to a speed experiments, we consider a maneuver successful if the safety of 4.5 m s−1 while yawing 180◦ and enter a backwards pilot did not have to intervene during the execution and the loop maneuver with a radius of r = 1.5 m. maneuver was executed correctly. 4) Combo: This sequence starts with a triple Barrel Roll, followed by a double Power Loop, and ends with a Matty B. Experiments in Simulation Flip. The full maneuver is executed without stopping We first evaluate the performance for individual maneuvers between maneuvers. in simulation. The results are summarized in Table I. The The maneuvers are listed by increasing difficulty. The trajecto- learned sensorimotor controller that has access to both visual ries of these maneuvers are illustrated in Fig. 3. They contain and inertial data (Ours) is consistently the best across all high accelerations and fast angular velocities around the body maneuvers. This policy exhibits a lower tracking error by up axes of the platform. All maneuvers start and end in the hover to 45% in comparison to the strong VIO-MPC baseline. The condition. baseline can complete the simpler maneuvers with perfect For comparison, we construct a strong baseline by combining success rate, but generally has higher tracking error due to visual-inertial odometry [29] and model predictive control [9]. drift in state estimation. The gap between the baseline and our Our baseline receives the same inputs as the learned controllers: controller widens for longer and more difficult sequences. inertial measurements, camera images, and a reference trajec- Table I also highlights the relative importance of the input tory. modalities. Policies that only receive the reference trajectories We define two metrics to compare different approaches. but no sensory input (Ours (Only Ref)) – effectively operating We measure the average root mean square error in meters of open-loop – perform poorly across all maneuvers. Policies Input Train Test 1 Test 2 Error (↓) Success (↑) Error (↓) Success (↑) Error (↓) Success (↑) 100 Image 90 ± 32 80% ∞ 0% ∞ 0% Ours 53 ± 15 100% 58 ± 18 100% 61 ± 11 100%

TABLE II: Sim-to-sim transfer for different visual input 80 modalities. Policies that directly rely on images as input do not transfer to scenes with novel appearance (Test 1, Test 2).

Feature tracks enable reliable transfer. Results are averaged Success Rate [%] 60 Ours over 10 runs. Ours (No FT) Maneuver Power Loop Barrel Roll Matty Flip 1 2 3 4 5 Ours (No FT) 100% 90% 100% Ours 100% 100% 100% Number of Loops Fig. 7: Number of successful back-to-back Power Loops on TABLE III: Success rate across 10 runs on the physical the physical quadrotor before the human expert pilot had to platform. intervene. Results are averaged over 5 runs. that have access to visual input but not to inertial data (Ours C. Deployment in the Physical World (No IMU)) perform similarly poorly since they do not have We now perform direct simulation-to-reality transfer of the sufficient information to determine the absolute scale of the learned controllers. We use exactly the same sensorimotor ego-motion. On the other hand, policies that only rely on controllers that were learned in simulation and quantitatively inertial data for sensing (Ours (No FT)) are able to safely evaluated in Table I to fly a physical quadrotor, with no fine- fly most maneuvers. Even though such controllers only have tuning. Despite the differences in the appearance of simulation access to inertial data, they exhibit significantly lower tracking and reality (see Fig. 5), the abstraction scheme we use facilitates error than the VIO-MPC baseline. However, the longer the successful deployment of simulation-trained controllers in the maneuver, the larger the drift accumulated by purely-inertial physical world. The controllers are shown in action in the (Ours (No FT)) controllers. When both inertial and visual data supplementary video. is incorporated (Ours), drift is reduced and accuracy improves. We further evaluate the learned controllers with a series For the longest sequence (Combo), the abstracted visual input of quantitative experiments on the physical platform. The raises the success rate by 10 percentage points. success rates of different maneuvers are shown in Table III. Fig. 6 analyzes the evolution of tracking errors and success Our controllers can fly all maneuvers with no intervention. An rates of different methods over time. For this experiment, we additional experiment is presented in Fig. 7, where a controller trained a policy to repeatedly fly barrel rolls for four seconds. that was trained for a single loop was tested on repeated We evaluate robustness and generalization of the learned policy execution of the maneuver with no breaks. The results indicate by flying the maneuver for up to 20 seconds at test time. The that using all input modalities, including the abstracted visual results again show that (abstracted) visual input reduces drift input in the form of feature tracks, enhances robustness. and increases robustness. The controller that has access to both visual and inertial data (Ours) is able to perform barrel rolls VI.CONCLUSION for 20 seconds without a single failure. Our approach is the first to enable an autonomous flying To validate the importance of input abstraction, we compare machine to perform a wide range of acrobatics maneuvers our approach to a network that uses raw camera images instead that are highly challenging even for expert human pilots. The of feature tracks as visual input. This network substitutes the approach relies solely on onboard sensing and computation, PointNet in the input branch with a 5-layer convolutional and leverages sensorimotor policies that are trained entirely network that directly operates on image pixels, but retains the in simulation. We have shown that designing appropriate same structure otherwise. We train this network on the Matty abstractions of the input facilitates direct transfer of the policies Flip and evaluate its robustness to changes in the background from simulation to physical reality. The presented methodology images. The results are summarized in Table II. In the training is not limited to autonomous flight and can enable progress in environment, the image-based network has a success rate of other areas of robotics. only 80%, with a 58% higher tracking error than the controller ACKNOWLEDGEMENT that receives an abstraction of the visual input in the form of feature tracks (Ours). We attribute this to the higher sample This work was supported by the Intel Network on Intelligent complexity of learning from raw pixels [35]. Even more Systems, the National Centre of Competence in Research dramatically, the image-based controller fails completely when Robotics (NCCR) through the Swiss National Science Founda- tested with previously unseen background images (Test 1, Test tion, and the SNSF-ERC Starting Grant. 2). (For backgrounds, we use randomly sampled images from the COCO dataset [19].) In contrast, our approach maintains a 100% success rate in these conditions. REFERENCES [13] Alison L Gibbs and Francis Edward Su. “On choosing [1] Pieter Abbeel, Adam Coates, and Andrew Y Ng. “Au- and bounding probability metrics”. In: International tonomous helicopter aerobatics through apprenticeship Statistical Review 70.3 (2002), pp. 419–435. learning”. In: The International Journal of Robotics [14] Christopher G. Harris and Mike Stephens. “A Combined Research 29.13 (2010), pp. 1608–1639. Corner and Edge Detector”. In: Proceedings of the Alvey [2] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. “An em- Vision Conference. 1988. pirical evaluation of generic convolutional and recurrent [15] Jemin Hwangbo, Inkyu Sa, Roland Siegwart, and Marco networks for sequence modeling”. In: arXiv:1803.01271 Hutter. “Control of a quadrotor with reinforcement (2018). learning”. In: IEEE Robotics and Automation Letters [3] Adam Bry, Charles Richter, Abraham Bachrach, and 2.4 (2017), pp. 2096–2103. Nicholas Roy. “Aggressive flight of fixed-wing and [16] Diederik P. Kingma and Jimmy Ba. “Adam: A Method quadrotor aircraft in dense indoor environments”. In: for Stochastic Optimization”. In: International Confer- The International Journal of Robotics Research 34.7 ence on Learning Representations (ICLR). 2015. (2015), pp. 969–1002. [17] Keuntaek Lee, Jason Gibson, and Evangelos A [4] Cesar Cadena, Luca Carlone, Henry Carrillo, Yasir Theodorou. “Aggressive Perception-Aware Navigation Latif, Davide Scaramuzza, José Neira, Ian D. Reid, using Deep Optical Flow Dynamics and PixelMPC”. In: and John J. Leonard. “Past, Present, and Future of arXiv:2001.02307 (2020). Simultaneous Localization and Mapping: Toward the [18] Shuo Li, Ekin Ozturk, Christophe De Wagter, Guido Robust-Perception Age”. In: IEEE Transactions on de Croon, and Dario Izzo. “Aggressive Online Control Robotics 32.6 (2016), pp. 1309–1332. of a Quadrotor via Deep Network Representations of [5] Dian Chen, Brady Zhou, Vladlen Koltun, and Philipp Optimality Principles”. In: arXiv:1912.07067 (2019). Krähenbühl. “Learning by Cheating”. In: Conference on [19] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Robot Learning (CoRL). 2019. Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and [6] Ying Chen and Néstor O Pérez-Arancibia. “Controller C. Lawrence Zitnick. “Microsoft COCO: Common Ob- Synthesis and Performance Optimization for Aerobatic jects in Context”. In: European Conference on Computer Quadrotor Flight”. In: IEEE Transactions on Control Vision (ECCV). 2014. Systems Technology (2019), pp. 1–16. [20] Giuseppe Loianno, Chris Brunner, Gary McGrath, and [7] Ronald Clark, Sen Wang, Hongkai Wen, Andrew Vijay Kumar. “Estimation, control, and planning for Markham, and Niki Trigoni. “VINet: Visual-inertial aggressive flight with a small quadrotor with a single odometry as a sequence-to-sequence learning problem”. camera and IMU”. In: IEEE Robotics and Automation In: AAAI Conference on Artificial Intelligence. 2017. Letters 2.2 (2016), pp. 404–411. [8] Jakob Engel, Vladlen Koltun, and Daniel Cremers. [21] Antonio Loquercio, Elia Kaufmann, René Ranftl, Alexey “Direct sparse odometry”. In: IEEE Transactions on Dosovitskiy, Vladlen Koltun, and Davide Scaramuzza. Pattern Analysis and Machine Intelligence (T-PAMI) “Deep Drone Racing: From Simulation to Reality with 40.3 (2018), pp. 611–625. Domain Randomization”. In: IEEE Transactions on [9] Davide Falanga, Philipp Foehn, Peng Lu, and Davide Robotics 36.1 (2020), pp. 1–14. Scaramuzza. “PAMPC: Perception-aware model predic- [22] Bruce D. Lucas and Takeo Kanade. “An Iterative Image tive control for quadrotors”. In: IEEE/RSJ International Registration Technique with an Application to Stereo Conference on Intelligent Robots and Systems (IROS). Vision”. In: International Joint Conference on Artificial 2018. Intelligence. 1981. [10] Davide Falanga, Elias Mueggler, Matthias Faessler, [23] Sergei Lupashin, Angela Schöllig, Michael Sherback, and Davide Scaramuzza. “Aggressive quadrotor flight and Raffaello D’Andrea. “A simple learning strategy through narrow gaps with onboard sensing and com- for high-speed quadrocopter multi-flips”. In: IEEE puting using active vision”. In: IEEE International International Conference on Robotics and Automation Conference on Robotics and Automation (ICRA). 2017. (ICRA). 2010, pp. 1642–1648. [11] C. Forster, L. Carlone, F. Dellaert, and D. Scara- [24] Daniel Mellinger and Vijay Kumar. “Minimum snap tra- muzza. “On-Manifold Preintegration for Real-Time jectory generation and control for quadrotors”. In: IEEE Visual-Inertial Odometry”. In: IEEE Transactions on International Conference on Robotics and Automation Robotics 3.1 (2017), pp. 1–21. (ICRA). 2011, pp. 2520–2525. [12] Fadri Furrer, Michael Burri, Markus Achtelik, and [25] Daniel Mellinger, Nathan Michael, and Vijay Kumar. Roland Siegwart. “RotorS–A modular Gazebo MAV “Trajectory generation and control for precise aggressive simulator framework”. In: Robot Operating System maneuvers with quadrotors”. In: The International (ROS). Springer, 2016, pp. 595–625. Journal of Robotics Research 31.5 (2012), pp. 664–674. [26] Mark W Mueller, Markus Hehn, and Raffaello D’Andrea. [31] Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. “A “A computationally efficient algorithm for state-to-state reduction of imitation learning and structured prediction quadrocopter trajectory generation and feasibility ver- to no-regret online learning”. In: International Confer- ification”. In: IEEE/RSJ International Conference on ence on Artificial Intelligence and Statistics (AISTATS). Intelligent Robots and Systems (IROS). 2013, pp. 3480– 2011. 3486. [32] Fereshteh Sadeghi and Sergey Levine. “CAD2RL: Real [27] Matthias Müller, Alexey Dosovitskiy, Bernard Ghanem, Single-Image Flight Without a Single Real Image”. In: and Vladen Koltun. “Driving Policy Transfer via Mod- Robotics: Science and Systems (RSS). 2017. ularity and Abstraction”. In: Conference on Robot [33] Shaojie Shen, Yash Mulgaonkar, Nathan Michael, and Learning (CoRL). 2018. Vijay Kumar. “Vision-Based State Estimation and Tra- [28] Yunpeng Pan, Ching-An Cheng, Kamil Saigol, Keuntaek jectory Control Towards High-Speed Flight with a Lee, Xinyan Yan, Evangelos Theodorou, and Byron Quadrotor”. In: Robotics: Science and Systems (RSS). Boots. “Agile Autonomous Driving Using End-to-End 2013. Deep Imitation Learning”. In: Robotics: Science and [34] Tianhao Zhang, Gregory Kahn, Sergey Levine, and Systems (RSS). 2018. Pieter Abbeel. “Learning deep control policies for [29] Tong Qin, Peiliang Li, and Shaojie Shen. “VINS-Mono: autonomous aerial vehicles with MPC-guided policy A robust and versatile monocular visual-inertial state search”. In: IEEE International Conference on Robotics estimator”. In: IEEE Transactions on Robotics 34.4 and Automation (ICRA). 2016, pp. 528–535. (2018), pp. 1004–1020. [35] Brady Zhou, Philipp Krähenbühl, and Vladlen Koltun. [30] René Ranftl and Vladlen Koltun. “Deep fundamental ma- “Does computer vision matter for action?” In: Science trix estimation”. In: European Conference on Computer Robotics 4.30 (2019). Vision (ECCV). 2018.