Autonomous Thermalling As a Partially Observable Markov Decision Process (Extended Version)

Autonomous Thermalling as a Partially Observable Markov Decision Process (Extended Version) Iain Guilliard1 Richard Rogahn Jim Piavis Andrey Kolobov Australian National University Microsoft Research Microsoft Research Microsoft Research Canberra, Australia Redmond, WA-98052 Redmond, WA-98052 Redmond, WA-98052 [email protected] [email protected] [email protected] [email protected] Abstract—Small uninhabited aerial vehicles (sUAVs) commonly rely on active propulsion to stay airborne, which limits flight time and range. To address this, autonomous soaring seeks to utilize free atmospheric energy in the form of updrafts (thermals). However, their irregular nature at low altitudes makes them hard to exploit for existing methods. We model autonomous thermalling as a POMDP and present a receding- horizon controller based on it. We implement it as part of ArduPlane, a popular open-source autopilot, and compare it to an existing alternative in a series of live flight tests involving two sUAVs thermalling simultaneously, with our POMDP-based controller showing a significant advantage. I. INTRODUCTION Fig. 1: Thermals and their bell-shaped lift model (in red). Small uninhabited aerial vehicles (sUAVs) commonly rely on active propulsion stay in the air. They use motors either construct a trajectory that would exploit this lift, and exit at the directly to generate lift, as in copter-type sUAVs, or to propel right time. In this paper, we focus on autonomously identifying the aircraft forward and thereby help produce lift with airflow thermal parameters and using them to gain altitude. These over the drone’s wings. Unfortunately, motors’ power demand processes are interdependent: thermal identification influences significantly limits sUAVs’ time in the air and range. the choice of trajectory, which, in turn, determines what In the meantime, the atmosphere has abundant energy information will be collected about the thermal; both of these sources that go unused by most aircraft. Non-uniform heating affect the decision to exit the thermal or stay in it. and cooling of the Earth’s surface creates thermals — areas Reinforcement learning (RL) [37], a family of techniques of rising air that vary from several meters to several hundreds for resolving such exploration-exploitation tradeoffs, has been of meters in diameter (see Figure1). Near the center of considered in the context of autonomous thermalling, but a thermal air travels upwards at several meters per second. only in simulation studies [40, 41, 33, 28]. Its main practical Indeed, thermals are used by human sailplane pilots and drawback for this scenario is the episodic nature of classic RL algorithms. RL agents learn by executing sequences of arXiv:1805.09875v1 [cs.RO] 24 May 2018 many bird species to gain hundreds of meters of altitude [3]. A simulation-based theoretical study estimated that under actions (episodes) that are occasionally “reset”, teleporting the the exceptionally favorable thermalling conditions of Nevada, agent to its initial state. If the agent has access to an accurate USA and in the absence of altitude restrictions, an aircraft’s resettable simulator of the environment, this is not an issue, 2-hour endurance could potentially be extended to 14 hours and there have been attempts to build sufficiently detailed by exploiting these atmospheric updrafts [5,4]. thermal models [33] for learning thermalling policies offline. Researchers have proposed several approaches to enable However, to our knowledge, policies learned in this way autonomous thermalling for fixed-wing sUAVs [40, 41,6, 16, have never tested on real sUAVs. Lacking a highly detailed 23, 33, 28]. Generally, they rely on various parameterized simulator, in order to learn a policy for a specific thermal, thermal models characterizing vertical air velocity distribution a sUAV would need to make many attempts at entering the within a thermal. In order to successfully gain altitude in same thermal repeatedly in the real world, a luxury it doesn’t a thermal, a sUAV’s autopilot has to discover it, determine have. On the other hand, thermalling controllers tested live its parameters such as shape and lift distribution inside it, [5, 16] rely on simple, fixed strategies that, unlike RL-based ones, don’t take exploratory steps to gather information about 1The author did most of the work for this paper while at Microsoft Research. a thermal. They were tested at high altitudes, where thermals are quite stable. However, below 200 meters, near the ground, Thermal updrafts are used by birds [3] and by human pilots thermals’ irregular shape makes the lack of exploration a flying sailplanes. Sailplanes (Figure2, left), colloquially also notable drawback, as we show in this paper. called gliders, are a type of fixed-winged aircraft optimized for The main contribution of our work is framing and solving unpowered flight, although some do have a limited-run motor. autonomous thermalling as partially observable Markov deci- To test our algorithms, we use a Radian Pro sailplane sUAV sion process (POMDP). A POMDP agent maintains a belief (Figure2, right) controllable by a human from the ground. about possible world models (in our case — thermal models) Thermalling strategies depend on the distribution of lift and can explicitly predict how new information could affect within a thermal. Much of autonomous thermalling literature, its beliefs. This effectively allows the autopilot to build a as well as human pilots’ intuition, relies on the bell-shaped simulator for a specific thermal in real time “on the fly”, and model of lift in the horizontal cross-section of a thermal at a trade off information gathering to refine it versus exploiting given altitude [41]. It assumes thermals to be approximately the already available knowledge to gain height. We propose a round, with vertical air velocity w(x; y) being largest near the fast approximate algorithm tailored to this scenario that runs in center and monotonically getting smaller towards the fringes: real time on Pixhawk, a common autopilot hardware that has (x−xth)2+(y−yth)2 a 32-bit ARM processor with only 168MHz clock speed and − 2 w(x; y) = W e R0 (1) 256KB RAM. On sUAVs with a more powerful companion 0 computer such as Raspberry Pi 3 onboard, our approach allows Here, (xth; yth) is the position of the thermal center at a given for thermalling policy generation with a full-fledged POMDP altitude, W0 is vertical air velocity, in m/s, at the center, and solver. Alternatively, our setup can be viewed and solved as a R0 can be interpreted as the thermal’s radius (Figure1, in model-based Bayesian reinforcement learning problem [21]. red). Note that a thermal’s lift doesn’t disappear entirely more For evaluation, we added the proposed algorithm to Ardu- that R0 meters from its center. In spite of its simplicity, we Plane [1], an open-source drone autopilot, and conducted use this model in our controller for its low computational cost. a live comparison against ArduSoar, ArduPlane’s existing soaring controller. This experiment comprised 14 missions, in MDPs and POMDPs. Settings where an agent has to optimize which two RC sailplanes running each of the two thermalling its course of action are modeled as Markov Decision Processes algorithms onboard flew simultaneously in weak turbulent (MDP). An MDP is a tuple hS; A; T ; R; s0i where S is the windy thermals at altitudes below 200 meters. Our controller set of possible joint agent/environment states, A is the set of significantly outperformed ArduSoar in flight duration in 11 actions available to the agent, T : S × A × S ! [0; 1] is flights out of 14, showing that its unconventional thermalling a transition function specifying the probability that executing trajectories let it take advantage of the slightest updrafts even action a in state s will change the state to s0, R : S ×A×S ! when running on very low-power hardware. R is a reward function specifying the agent’s reward for such a transition, and s0 is the start state. An MDP agent is assumed II. BACKGROUND to know the current state exactly. An optimal MDP solution is Thermals and sailplanes. Thermals are rising plumes of air a mapping π : S!A called policy that dominates all other that originate above areas of the ground that give up previously policies under the expected reward value starting at s0: accumulated heat during certain parts of the day (Figure1). They tend to occur in at least partly sunny weather several " 1 # π X i π V (s0) = ET π γ R(Si;A ;Si+1) S0 = s0 (2) hours after sunrise above darker-colored terrain such as fields i or roads and above buildings, but occasionally also appear i=0 π where there are no obvious features on the Earth’s surface. Here, Si and Ai are random variables for the agent’s state i As the warm air rises, it coalesces into a “column”, cools off steps into the future and the action chosen by π in that state, π with altitude and eventually starts sinking around the column’s under trajectory distribution T induced by π from s0. fringes. As any conceptual model, this representation of a thermal is idealized. In reality, thermals can be turbulent and If the agent doesn’t have full state knowledge but has access irregularly shaped, especially at altitudes up to 300m. to noisy state observations (as in this paper’s setting), it is in a partially observable MDP (POMDP) setting [8]. A POMDP is a tuple hS; A; T ; R; O; Z; b0i, where S; A; T , and R are as in the MDP definition, O is the observation space of possible clues about the true state, and Z : A×S×O ! [0; 1] describes the probabilities of these observations for different states s0 where the agent may end up after executing action a.

Load more