Training an artificial bat: Modeling sonar-based obstacle avoidance using deep-reinforcement learning

A dissertation submitted in partial fulfillment of the requirements for the degree of Masters of Science

Department of Electrical Engineering and Computer science COLLEGE OF ENGINEERING AND APPLIED SCIENCE University of Cincinnati, November 2020

Author: Adithya Venkatesh Mohan Chair: Dr.Dieter Vanderelst Committee: Dr.Ali A. Minai Dr.Zachariah Fuchs Abstract

Recent evidence suggests that sonar provides bats only with limited information about the environment. Nevertheless, they can fly swiftly through dense environments while avoiding obstacles. Previously, we proposed a model of sonar-based obstacle avoidance that only relied on the interaural level difference of the onset of the echoes. In this paper, we extend this previous model. In particular, we present a model that (1) is equipped with a short term memory of recent echo trains, and (2) uses the full echo train. Because handcrafting a controller to use more sonar data is challenging, we resort to machine learning to train a robotic model. We find that both extensions increase performance and conclude that these could be used to enhance our models of bat sonar behavior. We discuss the implications or our method and findings for both and bio-inspired engineering.

ii c 2020 by Adithya Venkatesh Mohan. All rights reserved. Acknowledgments

First of all thanks to Dr.Dieter Vanderelst for supporting me at each and every step of this work and through the period of my Masters. Thanks to University of Cincinnati for giving me this opportunity especially big thanks to the EECS (Electrical Engineering and Computer Science Department) of CEAS , Biology department for all the support. Also thanks to Cognition Action and Perception Labs for providing with equipment and computers necessary for the experiments. Special thanks to my family and friends for supporting completely me. Thanks to 3305 Jefferson group ( close friends) for making this graduate life fun filled. The one mantra of life which I learnt from them is ”Rakita rakita rakita oo ...” enjoy your life to fullest no matter what.

iv Contents

Abstract ii

Copyright iii

Acknowledgments iv

List of Tables vii

List of Figures viii

1. Introduction1

2. Background6 2.1. 3D reconstruction of the environment?...... 6 2.2. Sonar for robots...... 9 2.3. Biological plausibility of reinforcement learning...... 11

3. Methods 14 3.1. Robot and echo generation...... 14 3.2. Neural network and training...... 24 3.2.1. Algorithmic details...... 27

v 3.3. Conditions...... 29

4. Results 33

5. Discussion & Future Work 39 5.1. Biology...... 40 5.2. Engineering...... 41 5.3. Future work...... 42

Bibliography 45

A. Prerequisite material 54 A.1. Reinforcement Learning...... 54 A.2. Q-Learning...... 57 A.3. Double DQN...... 59 A.4. Dueling Architecture...... 60 A.4.1. Preference of D3QN - Dueling Double Deep Q- network...... 62 A.5. Prioritized Experience Replay Memory...... 62

vi List of Tables

3.1. The most important hyper parameters used in training the robot. The learning rate and the batch size are parameters determining neural net- work optimization. The four bottom parameters pertain to the algorithm as shown in algorithm1...... 28

vii List of Figures

3.1. Illustration of the software architecture used in the current paper. Gazebo was used to simulate the robot and the environment. We used Gazebo to obtain a point cloud of the environment of the robot. These points were used to approximate the acoustics, i.e., to calculate the echo waveforms returning from the environment. The processed acoustic data was used as input to a neural network. The neural network was trained to learn the relationship between the current sensory state (echoes) and the reward function (a measure of the goodness of obstacle avoidance). The different components of this architecture are explained in more detail in the main text of the paper...... 16 3.2. Illustration of the kabuki-based turtle bot which used in experiment is shown (a) top view and (b) side view ...... 17 3.3. Visualization of the 360 degree 2D LIDAR sensor to get short obstacle distance for reward function...... 19 3.4. Visualization of the point cloud data from the onboard 3D LIDAR sensor which is used to approximate the sonar sensor...... 20 3.5. Visualization of the combined directionality graph of the emitter and re- ceiver: energy of the echo vs azimuth and elevation considering a constant distance...... 22

viii 3.6. Visualization of the entire echo generation and pre-processing from top to bottom (1000 3D point cloud to 24 float point reduced echo waveform) 23

3.7. Example state St. Row represent a short term memory of past echo trains. The columns represent the 24 samples of the sub-sampled echo trains. The intensity of the echoes are encoded in the gray scale, with more intense echoes shown brighter...... 25 3.8. Illustration of the Dueling Double Deep Q-Network Learning (D3QN) neural network that was trained to control the robot, based on the echo

data contained in the sensory state matrix St (see equation 3.2.1). The CNN filters have a shape of (1,3) and a stride of 1. This emulates the 1D convolution so that each echo is processed separately. The dueling architecture CNN has 2 outputs. One output for state value function V (S) which returns a single value and other one is the advantage function A(S, a) which is a vector of length 3, returning the advantage of taking that action. The Q-function function is calculated using both of these Q(S, a) = V (S) + A(s, a)[Wang et al., 2015]. The Dueling Double Deep Q- network is optimized using gradient decent (i.c., RMSprop)1...... 28

3.9. Example state St. given to condition 1. as it explained above and also can be clearly seen in the figure that we are utilizing all the rows in the image to give memory of 3 and we are also giving the complete echo train. 30

3.10. Example state St given to condition 2: which is No memory condition . It can be seen that first 4 rows from the top are blacked out as the t − 1 and t − 2 echoes from 3.2.1 are zeroed out, which can be seen in above figure...... 31

ix 3.11. Example state St given to condition 3. which is Onset condition, It can be seen that we are utilizing all the rows as we are giving the echoes at t − 1 and t − 2 are given but the complete echo train is not given As you see from figure the first echo value is given to rest of the echo train... 31

4.1. This plot that gives the reward return vs the time step t during the training period - fig a: case 1: full memory (T=3) and complete echo train , fig b: No memory (T=1) with complete echo train, fig c: Onset with full memory (T=3)...... 36 4.2. Example trajectories for each condition and each environment. Notice that the robot was trained in the Simple environment and evaluated in all three environments. The trajectories shown for the Memory and No Memory condition were successfully completed (i.e., 1000 steps of 15 cm each). In contrast, for the Onset condition, the trajectories are rather short. This is because the robot was less successful at negotiating the environments. Indeed, each of the three trajectories shown for this con- dition ended in a collision soon after the start (locations indicated with a star). The black scale bar in each of the top panels is 1 m long...... 37

x 4.3. Results of the robot evaluation, for each condition (Memory, No Memory, and Onset). Each row depicts the results for a different environment. Top Row: Simple, Middle Row: Narrow, and Bottom Row: Circle. The results are averaged across 15 evaluation runs (each starting at a different location in the arena). (a, d, g) Violin plots of the distance to the nearest wall at each time step. (b, e, h) Stacked barplots showing the proportion of each of the three possible actions taken. (c, f, i) Barplots showing the distance covered without colliding. Error bars give the Standard Error (n = 15). Note that the maximum distance covered by the robot was 150 m (15 cm × 1000 steps, indicated by horizontal line). As the robot’s speed was constant, the plots also show the time (in seconds) driven by the robots...... 38

A.1. the Markov Decision process flowchart which shows how the agent can interact with the environment with actions and in return environment transitions to new state and a reward for that time step...... 55 A.2. this figure illustrates the unrolled Markov Decision process for 3 time-step which shows how the interactions between the agent and environment progress...... 55

xi Chapter 1

Introduction

Bats [e.g., Barchi et al., 2013, Moss and Surlykke, 2001, Schnitzler et al., 2003, Simmons, 2012, Ulanovsky and Moss, 2008, Clare and Holderied, 2015, Geipel et al., 2013] are often assumed to infer the 3D position and identity of objects from the echoes they receive [Barchi et al., 2013, Moss and Surlykke, 2001, Schnitzler et al., 2003, Simmons, 2012, Ulanovsky and Moss, 2008, Clare and Holderied, 2015, Geipel et al., 2013]. This view on bat echolocation has its origin in experimental findings showing that bats can locate single targets (typically, a small sphere) with high accuracy [e.g., Simmons et al., 1983, Lawrence and Simmons, 1982, Wotton and Simmons, 2000]. However, the ability to locate isolated objects under favorable conditions does not necessarily imply that bats can reconstruct the layout of multiple complex reflectors (e.g., trees and shrubs) under natural conditions. In contrast to the artificial targets used in experiments, most natural objects, including vegetation and human-made objects, typically return a multitude of overlapping echoes [Yovel et al., 2011, 2009, Vanderelst et al., 2016, Warnecke et al., 2018, Kuc, 2020]. Over the past few years, we argued in several papers that it is highly questionable whether bats can interpret echoes from complex natural objects in terms of a 3D model or acoustic image [Vanderelst et al., 2015a, 2016, Steckel and Peremans, 2013, Mansour

1 et al., 2019a,b]. We argued that bat sonar is inherently limited in the amount of in- formation it can extract about the structure of natural environments. The presumed ability of bats to reconstruct acoustic images from echoes is questionable due to factors such as their small field of view [Surlykke et al., 2009, Jakobsen et al., 2012, 2013] and low update rate [Holderied et al., 2006, Seibert et al., 2013]. However, the most severe barrier likely stems from the temporal resolution of the sonar system [Simmons et al., 1989, Wiegrebe and Schmidt, 1996, Surlykke and Bojesen, 1996]. Most objects, including vegetation and human-made objects, return a multitude of echoes [Yovel et al., 2011, 2009, Vanderelst et al., 2016, Warnecke et al., 2018][Yovel et al., 2011]. The temporal integration in the cochlea introduces interference between these cascades of echoes. In psychophysical experiments, [Wiegrebe and Schmidt, 1996] observed a cochlear temporal integration constant of about 200 µs in Megaderma lyra. [Simmons et al., 1989] derived an integration constant of about 200-400 µs for the bat Eptesicus fuscus. Echoes from points separated in time by less than the time constant of the auditory system will be integrated, and interference between them will deteriorate the spectral cues [Simmons et al., 1989] degrading the encoding of both object location and proper- ties. This results in substantial loss of information during the cochlear transduction of acoustic energy to spike trains [Reijniers and Peremans, 2010]. However, the temporal resolution of the bat sonar system is probably even worse than suggested by the cochlear temporal integration constant. Indeed, the integration constant is a conservative upper boundary for the temporal resolution of the auditory system. Masking effects have been found to extend for substantially longer intervals than the integration time (both in bats and humans [Moore, 2012]). For example, [Geberl, 2013] found temporal masking in Phyllostomus discolor for a temporal separation be- tween echoes of up to 6 ms (or 100 cm).

2 More recently, [Geberl et al., 2019] found that bats were unable to resolve objects spaced by less than 40◦. Finally, neurophysiological data suggest that closely spaced ob- jects are not perceived individually by echolocating bats. In contrast, neural responses in the inferior colliculus to echo cascades from densely spaced objects represent a sin- gle extended stimulus [Warnecke et al., 2018], which explains several recent behavioral findings in bats [Knowles et al., 2015, Simmons et al., 2020]. It follows that bats, at least most of the time, are unable to reconstruct the 3D lay- out of the environment. Despite this informational constraint, bats exhibit astounding intelligent behavior in interaction with complex environments. For example, bats per- form sonar-based obstacle avoidance in complete darkness [Griffin and Galambos, 1941, Griffin, 1958]. Previous models of this behavior [See Mansour et al., 2019a,b, for an overview] have assumed that bats localize obstacles and use this to plan a path around them. However, we have shown that a Braitenberg-like behavior using the interaural level difference of the echoes between the ears is sufficient to avoid obstacles [Mansour et al., 2019a,b, Vanderelst et al., 2015a]. In this model, the bat is assumed to compare the intensity of the echoes between the ears. It turns away from the ear that received the loudest echoes. This simple approach does not require the bat to localize the origin of echoes, which indeed might be impossible when faced with complex reflectors. In our previous work, the controller for the robotic or simulated bat used the loudness of the onset of the echo train. The model relied on the integrated energy following 1 ms after the arrival of the first detectable echo to steer away from obstacles. This 1 ms window could contain multiple (overlapping) echoes. While this strategy was successful, it implies that control decisions are only based on the nearest reflectors. Moreover, the model was memory-less and depended solely on the most recent echoes.

3 It is conceivable that using interaural level information beyond the echo train onset allows for improved performance. Also, we conjecture that giving the model a short term memory of the most recently received echo trains results in increased obstacle avoidance performance. The downside of extending the model this way is that designing the controller becomes more complicated. Indeed, previously, we were able to handcraft the controller parameters (taking into account the aerodynamic constraints of bats). However, designing a controller that uses more data would require a substantial amount of tuning. For example, one needs to decide how echoes returning from further away are combined with nearer echoes to arrive at a single motor command. This extended model also takes into account the complete echo train received at the left and the right ear instead of discarding all echoes beyond the onset. To overcome the additional tuning such a model would require, we employ a machine learning approach, i.c., Reinforcement Learning (RL). One more area that we wanted to focus in this work is memory. Memory is very important for intelligent behavior in general as we know that it will help any agent to consider its past experience to make the decision or take a action that maximizes its return. In most of the real-life cases the agent’s observation only compromises of only partial observation which does not contain the complete information about the environment which does not obey Markovian conditions (in RL terms) also termed as Partially Observable Markovain Decision process (POMDP) (this is further explained in Appendix refer).So, to counter this in agents needs to have some memory to accommo- date the previous experience into the decision-making process thus can allow the agent to form a better representation of the environment and hence make better decisions. So to test the effects of memory, we evaluate whether our previously presented model can be augmented by providing it with short-term memory of fixed window size of 3. The

4 short-term memory consists of both observation and a action memory which explained in the later sections in detail.

5 Chapter 2

Background

2.1. 3D reconstruction of the environment?

Popular literature often refers to bat echolocation as ‘seeing with sound’ [Surlykke et al., 2016]. Bats are assumed to infer the 3D position and identity of objects from the echoes they receive. This notion of bats being able to reconstruct an acoustic image of their surroundings from echoes is also maintained in the scientific literature [Barchi et al., 2013, Moss and Surlykke, 2001, Schnitzler et al., 2003, Simmons, 2012, Ulanovsky and Moss, 2008, Clare and Holderied, 2015, Geipel et al., 2013]. This view on bat echolocation, we call the Acoustic Image Theory (AIT), has its origin in experimental findings showing that bats can locate single targets (typically, a small sphere) with high accuracy [Simmons et al., 1983, Lawrence and Simmons, 1982, Wotton and Simmons, 2000]. However, the ability to locate isolated objects under favorable conditions does not necessarily imply that bats can reconstruct the layout of multiple complex reflectors (e.g., trees and shrubs) under natural conditions. In contrast to the artificial targets used in experiments, most natural objects, including vegetation and human-made objects, typically return a multitude of overlapping echoes [Yovel et al., 2011, 2009, Vanderelst et al., 2016, Warnecke et al., 2018].

6 Over the past few years, we argued in several papers that it is highly questionable whether bats can interpret echoes from complex natural objects in terms of a 3D model or acoustic image [e.g., Vanderelst et al., 2015a, 2016, Steckel and Peremans, 2013, Mansour et al., 2019a,b]. We argued that bat sonar is inherently limited in the amount of information it can extract about the structure of natural environments. The presumed ability of bats to reconstruct acoustic images from echoes is questionable due to factors such as their small field of view [Surlykke et al., 2009, Jakobsen et al., 2012, 2013] and low update rate [Holderied et al., 2006, Seibert et al., 2013]. However, the most severe barrier stems from the temporal resolution of the sonar system [Simmons et al., 1989, Wiegrebe and Schmidt, 1996, Surlykke and Bojesen, 1996]. As said, most objects, including vegetation and human-made objects, return a multi- tude of echoes [Yovel et al., 2011, 2009, Vanderelst et al., 2016, Warnecke et al., 2018]. The temporal integration in the cochlea introduces interference between these cascades of echoes. In a psychophysical experiment [Wiegrebe and Schmidt, 1996] observed a cochlear temporal integration constant of about 200 µs in Megaderma lyra. Simmons et al.[1989] derived an integration constant of about 200-400 µs for the bat Eptesicus fuscus. Echoes from points separated in time by less than the time constant of the auditory system will be integrated, and interference between them will deteriorate the spectral cues [Simmons et al., 1989] degrading the encoding of both object location and properties. This results in substantial loss of information during the cochlear transduc- tion of acoustic energy to spike trains [Reijniers and Peremans, 2010]. However, the temporal resolution of the bat sonar system is probably substantially worse than suggested by the cochlear temporal integration constant. Indeed, the inte- gration constant is a conservative upper boundary for the temporal resolution of the auditory system. In behavioral experiments, masking effects have been found to extend for much longer intervals than the integration time (both in bats and humans [Moore,

7 2012]). For example, Geberl[2013] found temporal masking in Phyllostomus discolor for a temporal separation between echoes of up to 6 ms (or 100 cm). More recently, Geberl et al.[2019] found that bats were unable to resolve objects spaced by less than 40 ◦. Neurophysiological data confirms that closely spaced objects are not perceived individu- ally by echolocating bats. In contrast, neural responses in the inferior colliculus to echo cascades from densely spaced objects represent a single extended stimulus [Warnecke et al., 2018], which explains several recent behavioral findings in bats [Knowles et al., 2015, Simmons et al., 2020]. It follows that bats, at least most of the time, are unable to extract much information about the 3D layout of the environment. Despite this informational constraint, bats exhibit astoundingly intelligent behavior in interaction with complex environments. For example, echolocating bats have excellent spatial memory [Barchi et al., 2013, Holland, 2007, Geva-Sagiv et al., 2015, M¨ohres and Neuweiler, 1966]. They can navigate to salient locations like roosts, foraging grounds, and water sources [Schnitzler et al., 2003, Helversen and Helversen, 2003]. Displaced bats deprived of sight have been found to return to their roost successfully [Stones and Branick, 1969, Williams et al., 1966]. Bats are highly maneuverable animals that can fly and forage along established routes in highly cluttered space [Barchi et al., 2013, Petrites et al., 2009]. Von Helversen and Winter[2005] discuss how Glossophaga soricina revisit the location of removed feeders down to the centimeter, even after a few days have passed. In conclusion, bats perform astonishingly intelligent sensorimotor behavior under con- ditions where sonar provides limited information about the structure of the environment. This makes bat sonar an excellent system to test whether the same principles underlying insect sensorimotor intelligence also explain intelligent behavior, under severe informa- tional constraints, in other taxa.

8 2.2. Sonar for robots

The enormous tactical [Hundley and Gritton, 1994, Hassanalian and Abdelkefi, 2017, Petricca et al., 2011] and civilian benefits [Kendoul, 2012, Floreano and Wood, 2015, Hassanalian and Abdelkefi, 2017, Petricca et al., 2011] of Unmanned-Aerial Vehicles (UAVs) have motivated large research efforts [e.g., Wood et al., 2012, Ma et al., 2013, Recchiuto et al., 2014]. Great strides have been taken in fabrication and actuation of UAVs [Petricca et al., 2011, Bareiss et al., 2017]. This progress has resulted in ever smaller, more versatile fixed, rotary and flapping wing aircraft [e.g., Floreano and Wood, 2015, Liu et al., 2016, Recchiuto et al., 2014, Wood et al., 2012, Ma et al., 2013]. Various groups have proposed vision-based control algorithms and hardware for the new generation of nano-UAVs, inspired by insect vision [e.g., Franceschini et al., 2007, Franceschini, 2014, Briod et al., 2013, Duhamel et al., 2013]. Lightweight cameras per- form well when lighting is good and reliable. Moreover, these passive sensors require little energy, and bio-inspired vision algorithms are computationally very efficient [Flore- ano and Wood, 2015]. Unfortunately, the quality of the signals captured by small lensed photoreceptors rapidly degrades when lighting is low. This makes vision an inappropri- ate sense for many applications. Nano-UAVs will be mostly deployed to (autonomously) inspect the interior of buildings and other structures. Indeed, in confined spaces, their high maneuverability and small size are most useful. Inside buildings and other human- made structures, the light will often be low and changeable. Moreover, the structures can be filled with smoke, dust, and other aerosols [to which sonar is relatively insensitive, Steckel et al., 2011]. Under these circumstances, vision will not be a suited sense, and active sensors should complement or replace vision-based approaches. Sonar is perhaps the most common sensor on autonomous robots. Sonar is one of the first sensors to have been used in autonomous robots [e.g., Brooks, 1986] and today it is still used in many commercial and experimental robots. Sonar sensors are entirely

9 insensitive to lighting conditions – even more so than Laser Scanners and TOF cameras – and can operate in spaces filled with smoke and dust [Steckel et al., 2011]. Also, several properties make sonar a promising sensor modality for nano-UAVs, including low cost, small form factor, low power consumption, and mechanical robustness [M¨ulleret al., 2009, Steder et al., 2007]. Despite these advantages, applications of artificial in-air sonar are almost exclusively limited to simple ranging and obstacle detection [See Peremans et al., 2012, Kleeman and Kuc, 2008, for reviews]. However, this limited application domain is not due to a fundamental limitation of sonar. Indeed, the potential of sonar is demonstrated by the abilities of echolocating bats. More than fifty years of intense research on bats [Griffin, 1958] have shown that sonar is capable of supporting swift flight through dense vegetation [Holderied et al., 2006], navigation in changing environments [Barchi et al., 2013], object recognition [Von Helversen and Von Helversen, 2003] and airborne foraging [Griffin, 1958]. The enormous gap between artificial sonar applications and biological performance implies that, currently, we are not taking full advantage of the potential of sonar. This calls for a new approach to sonar for robots, inspired by bat echolocation research. We suggest that, just as the AIT is not a tenable theory about bio-sonar, the AIT should not drive the development of bio-inspired artificial sonar. In other words, a bio- inspired approach to sonar should not aim at reconstructing the (3D) layout of a scene or the shape of objects from the echoes. Not relying on a reconstructed image solves (or bypasses) many of the problems accompanying approaches to sonar-based on inferring the geometric layout of the reflectors from the echoes [Kleeman and Kuc, 2008]. The sensorimotor strategies investigated in this proposal could be ported to UAVs as an alternative to the current approach to sonar-based control, which does not depend on inferring the geometric layout of the environment from echoes. In particular, this project

10 will demonstrate that a method relying on a direct mapping of task-specific sensor data to motor commands could close the performance gap between biological and artificial sonar. Thus, resulting in making sonar a sophisticated sensor for robotic control.

2.3. Biological plausibility of reinforcement learning

Here in this work, we choose to use RL algorithms to learn the controller for the robot. This has been natural choice as RL algorithms has been attracting biologist and be- havioral ecologist for behavioral cloning. Mainly because, RL is known for generating complex traits with simple rules (reward function). RL’s main advantages lie in fact that it’s works well in studying behaviors/traits which emerges via natural selection, cue-driven-switch plasticity, and developmental selection as opposed to some classical methods which are more on fixed/static trait development [Frankenhuis et al., 2019]. Each organism faces a wide variety of tasks like resource gathering, navigation, main- tenance, reproduction to perform on a daily basis for its survival. These organisms have to constantly take decisions/actions based on their sensory inputs from the environment and knowledge they possess about the environment [Frankenhuis et al., 2019, Neftci and Averbeck, 2019]. Also, these biological systems must accomplish these tasks in an environment that changes rapidly. For example, conditions might change drastically depending on geography, weather, or other organisms. To deal with these dynamic envi- ronments, biological systems must learn new behaviors to adapt to these circumstances [Frankenhuis et al., 2019, Neftci and Averbeck, 2019]. Behavior could be modeled as a sequence of choices/actions taken by organism. Such a definition sits well with reinforcement learning as its aims is to make the agent learn a sequence of actions (policy) which maximizes the reward. This is explained in more detail in the appendix.

11 As we discussed earlier, reinforcement learning has been used extensively in biology and psychology [Frankenhuis et al., 2019]. The relationship between RL learning and biological learning systems is well established in [Black and Prokasy, 1972, Schultz et al., 1997, Montague et al., 1996, Barto et al., 1995, Frank, 2005]. There are theories that link biological reinforcement learning to the role of frontal-striatal systems of the brain [Barto et al., 1995, Frank, 2005]. In the famous Rescorla-Wagner model the activity of dopamine is taken to explain learning and shaping behaviors of animals [Black and Prokasy, 1972]. According to this model, the cortex of the brain could be represented as a set of available choices to animal [Haber et al., 2006, Mink, 1996]. The strength of synapses between the choices represented in the cortex and the striatum cells encode the information about the values of these choices. Hence, the synapses represent the values of the available actions [Lau and Glimcher, 2008, O’Doherty et al., 2004]. The striatal activity drives choice activity via descending projections to motor output areas. The connections between cortical choices and striatal cells are straightened or weakened, depending on whether the outcome of a particular action returns a high or lower than expected reward. The basic properties of this process are captured by the equation 2.3.1 for the Rescorla-Wagner model is given below with α a learning rate parameter, vi the value of action i, and r the experienced reward.

vi(t + 1) = vi(t) + α(r(t) − vi(t)) (2.3.1)

The temporal difference learning which used in artificial reinforcement learning [Sut- ton and Barto, 2018, Bellman, 1966] is very similar to the Rescorla-Wagner model 2.3.2. However, it extends the Rescorla-Wagner model by including state dependency. In con- trast, the Rescorla-Wagner model is stateless [Neftci and Averbeck, 2019]. The equation for the temporal difference is given below.

12 vi(st) ← vi(st) + (r(t) − vi(st) + γvi(st+1)) (2.3.2)

In equation 2.3.2, i represents the action, vi(st) gives the value of state st and action i, rt is the reward at time t, st is the state at time t and γ is the decay factor which sets the horizon [Bellman, 1966]. Therefore, vi(st) is updated based on the error δt =

(r(t) − vi(st) + γvi(st+1)) (equation 2.3.2). Equations 2.3.1 and 2.3.2 reveal the analogy between biological and artificial reinforce- ment learning. In this thesis, we do not attempt to model the ontological development of bats directly. Nor are we directly modeling how bats learn to use their sonar system. Instead, as argued above, reinforcement learning is used to overcome practical issues related to parametrizing the model. Nevertheless, the biological plausibility of rein- forcement learning adds some value to our results. Indeed, successful parametrization through reinforcement learning suggests that, at least in principle, the resulting models are learnable through experience and interaction with the environment.

13 Chapter 3

Methods

3.1. Robot and echo generation

RL algorithms are powerful but require substantial data and exploration of the action space. Therefore, it is a common approach to train robotic controllers using realistic physics simulators. Here, we used Gazebo [Koenig and Howard, 2004] to simulate the robot and its environment for the accurate and realistic physics simulation as the sim- ulator must be very realistic for accurate echo emulation and behavior generation. We also used ROS [Stanford Artificial Intelligence Laboratory et al., 2018] Robotics operat- ing system which is a flexible framework to with versatile services and communications protocols with many functional packages and models which helps in quick and reliable prototyping with testing packages of robots. Open AI gym provides a clean framework for RL and gym gazebo is just an extension of Open AI gym for using with Gazebo, it also has pre-defined mazes and circuits were used as the environments for the robot training and gym gazebo [Zamora et al., 2016] for setting up the environment and the robot, as well as for connecting with the open AI RL software framework [Brockman et al., 2016].

14 The overall software architecture used in the current paper for simulating the echoes and training the robot is depicted in figure 3.1. Stable baseline is an open source RL libraries which has implementations for lot of RL baseline algorithms. TensorFlow is a library which has tools for designing and training deep-learning models like MLP, CNN, RNN and many more.

15 Figure 3.1.: Illustration of the software architecture used in the current paper. Gazebo was used to simulate the robot and the environment. We used Gazebo to obtain a point cloud of the environment of the robot. These points were used to approximate the acoustics, i.e., to calculate the echo waveforms returning from the environment. The processed acoustic data was used as input to a neural network. The neural network was trained to learn the relationship between the current sensory state (echoes) and the reward function (a measure of the goodness of obstacle avoidance). The different components of this architecture are explained in more detail in the main text of the paper.

16 Figure 3.2.: Illustration of the kabuki-based turtle bot which used in experiment is shown (a) top view and (b) side view .

A kabuki-based turtle bot is used as the mobile robot platform which is designed to have a circular body measuring 35 cm in diameter which can be seen in figure 3.2, which corresponds to the wingspan of many species of bat [See Vanderelst et al., 2015a, for references]. It should be noted that the (simulated) robotic experiments presented here are per- formed in 2D using a wheeled robot. In contrast, bats can avoid obstacles in 3D. How- ever, in most experiments testing obstacle avoidance, bats show little variation in their altitude while negotiating obstacles. The same holds for flight trajectories of bats under natural conditions (see Vanderelst et al.[2015a] for references). Therefore, while a 2D model does not capture the complete behavioral repertoire of bats, it is representative of much natural and experimental behavior. The robot was equipped with one ultrasonic emitter (modeling the bat’s mouth) and two receivers (modeling the bat’s ears). The emitter and the receivers’ directionalities were modeled using the simulated head related transfer function of the bat Philllostomus discolor [De Mey et al., 2008, Vanderelst et al., 2010]. Most echolocating bat species use brief frequency-modulated emissions, especially in complex environments [Schnitzler

17 and Kalko, 2001]. However, this paper, we modeled the robot as using narrow band 40 kHz signals. We did this to simplify modeling the acoustics. Most echo-locating bat species use brief frequency-modulated emissions [See, for ex- ample, Schnitzler and Kalko, 2001]. However, this paper, we modeled the robot as using narrow band 40 kHz signals. We did this to simplify modeling the acoustics. Also, by using 40 kHz signals, our work is more relevant to robotics, where 40 kHz ceramic emitters are often used (see discussion the implications of our work to robotics). The robot featured a simulated 3D LIDAR sensor providing 1000 distance readings in a field of view (FOV) spanning ±90◦in azimuth and 0 to 30◦in elevation. The polar coordinates each of the 1000 distance readings were used to approximate the acoustic impulse response of the environment. The field of view of the 3D LIDAR covered the main lobes of the simulated acous- tic emitter and receivers as it’s articulated in the figure the energy is accumulated in the main lobe and 2 side lobes of the graph which comes inside the 180 degree range Horizontal and 30 degree vertical FOV, so rest of the FOV is not necessary for echo emulation hence not considered for the simulation. The output of the 3D LIDAR sensor is shown in the figure 3.4. In the vertical FOV, it’s considered from 0 to 30 degree rather than -30 degree to 30 degree which would more accurately cover the frontal lobe. This because if we consider -30 degree as the min vertical FOV, it will emulate the generation of echoes from the ground. This is undesirable as it overlaps other useful echoes from the space which is vital for navigation / obstacle avoidance etc., In practice, the ultrasonic sensors will be placed in a higher point to avoid these ground echoes or while calibrations these ground echoes will be calculated and will be subtracted from the echoes during the execution. Assuming one of the above methods will be adopted to avoid ground echoes we directly

18 Figure 3.3.: Visualization of the 360 degree 2D LIDAR sensor to get short obstacle dis- tance for reward function skip it by considering the min vertical FOV as 0 degree. This makes the echo emulation process simple, reduces unnecessary complexity and computation load. Note that the simulated 3D LIDAR system was only used as a means to approximate the acoustic impulse response of the room. The 3D LIDAR was not used to support obstacle avoidance. Another sensor used on-board in the robot is 360-degree range 2D LIDAR sensor, this is used to get the nearest obstacle distance which is used in the reward function of the learning algorithm. The LIDAR data was converted into a simulated impulse response of the environment using the sonar equation [Urick, 1983, Vanderelst et al., 2015b] (eq. 3.1.1). In calculating the strength gi,r of the echo received at each ear r for each point i returned by the LIDAR, we took into account two-way spherical spreading and the directionality of each ear. In

19 Figure 3.4.: Visualization of the point cloud data from the onboard 3D LIDAR sensor which is used to approximate the sonar sensor addition, we included atmospheric attenuation for sound at 40 kHz [Bass et al., 1995]:

0.1 gi,r = ge + 40 · log + 2 · ri · af + dφ,r,i (3.1.1) ri

In equation 3.1.1, gi denotes the echo strength of point i, in dBspl, at receiver r (i.e., either the left of the right ear). The term ge gives the emission strength. This was set to 100 dB, which is similar to the emission strengths of commercially available sonar emitters and comparable to bat emissions in closely spaced conditions. The term ri denotes the distance of point i, as returned by the virtual LIDAR system. The term dφ,i gives the combined directionality of the emitter and receiver r for direction φ. The variable af gives the atmospheric attenuation at frequency f = 40 kHz (-1.318 dB/m). To account for internal and external noise, we assumed a noise floor of 20 dBspl (see [Vanderelst and

Peremans, 2018] for a biological justification). Echo strengths gi,r smaller than 20 dB were set to zero.

20 The impulse response was convolved with a 2.5 ms long (fs = 125 kHz) signal to obtain the echo sequence, which corresponds to call lengths used by Eptesius fuscus in cluttered conditions [Sandig et al., 2014, Knowles et al., 2015]. Next, the echo signal received at the left and the right ear was filtered using a model of the bat’s cochlea [Wiegrebe, 2008]. This model incorporates the temporal resolution of the basilar membrane response and the transduction to a neural spike train in the cochlear nerve. In effect, this model results in a low-pass filtered envelope of the echo sequence at each ear.

21 Figure 3.5.: Visualization of the combined directionality graph of the emitter and re- ceiver: energy of the echo vs azimuth and elevation considering a constant distance

The low-pass filtering by the cochlear model [Wiegrebe, 2008], introduces temporal dependencies between nearby samples of the resulting signals. To avoid passing redun- dant information to the neural network, we subsampled the output of the cochlear model

22 after smoothing it with a boxcar function. In doing so, we reduced the signals arriving at the left and the right ear from 3000 samples (fs = 125 kHz) to 24 samples. Indeed, it has been shown before that sonar signals can be substantially subsampled without loss of information [e.g., Vanderelst et al., 2016, Kuc, 2020]. Refer the figure 3.6 to see the all the preprocessing steps for the echo state Et.

point-cloud unprocessed cut gamma gamma- exponential low-pass reduced data echo 25ms filtered rectified compressed filtered

Left ear Echo

Right ear Echo

Figure 3.6.: Visualization of the entire echo generation and pre-processing from top to bottom (1000 3D point cloud to 24 float point reduced echo waveform)

The robot emitted calls at a rate of 20 Hz (50 ms interpulse interval) and moved at a constant speed of 3 m/s. Therefore, the robot moved 15 cm between subsequent calls. These parameters approximate the flight speed and sample strategy of bats in cluttered environments, e.g., [Knowles et al., 2015, Warnecke et al., 2018, Barchi et al., 2013]. The velocity being fixed, the robot’s only possible actions a consisted of several angular rotations it could execute after each emission. For the scope of this paper, since we are only using variation of DQN to generate behavior and since DQN works only with discrete action space, so robot action space is also discretized to a simple 3 actions mode, moving forward, right and left. The robot could adopt a rotational velocity of either 0, -360, or +360 degrees per second. Given the inter-call interval of 50 ms, this translated into the robot turning 0, -18, or +18 degrees between subsequent calls. A rotational speed of 360 degrees per second is well below the turning rates bats can achieve at the modeled speed of 3 m/s [Jones and Rayner, 1989, Holderied, 2001].

23 In the remainder of the paper, the concatenation of these two vectors, for the n th emission is denoted as   E  L,t  Et =   (3.1.2) ER,t with the subscripts l and r referring to the left and the right ear respectively.

3.2. Neural network and training

We trained the robot in a single environment (Simple environment, see figs. 4.2 and 4.3). This environment consisted of an arena of straight walls forming a corridor. The corridor was 1.5 m wide. To ensure that the robot’s trained controller was able to generalize across environments, we tested it in the Simple environment and two additional environments (Narrow and Circle, figs. 4.2 and 4.3). We used a variant of RL, in particular Deep Q-Network Learning (DQN), to train the simulated robot to steer through the environment using the sonar data provided by its two ears. RL is a broad class of machine learning methods where the agent learns a policy that maximizes the discounted return Gt [Sutton and Barto, 2018].

∞ X k Gt = γ Rt+k+1 k=0

In RL, the agent is assumed to operate under a Markov decision process (MDP). At each time step the agent is in a state St and takes an action at. As a result of taken action at in state St, the agent receives a reward Rt. The environment has a state transition function P (St+1|St, at) and a reward function R(St, at) both which are unknown to the agent [Sutton and Barto, 2018], at the onset of training. Here, the states are given by the 6 × 24 matrix St,

24 Here, the states are given by the matrix,

  Et−2 at−3     St = E a  (3.2.1)  t−1 t−2   Et at−1 containing the echoic information received from the last three emissions and previous three actions that robot chose to execute. Matrix Stcontains the echoic information received from the last three emissions (at time steps t − 2, t − 1, and t) for both ears (L and R) and an action memory containing the previous three actions chosen by the robot (at time steps t − 3, t − 2 and t − 1) that led to the corresponding partial state Et.

An example of the state matrix St is given in figure 3.7.

Figure 3.7.: Example state St. Row represent a short term memory of past echo trains. The columns represent the 24 samples of the sub-sampled echo trains. The intensity of the echoes are encoded in the gray scale, with more intense echoes shown brighter.

The reward function used to train the agent is based on the shortest obstacle distance s, which we get via the 2D, 360 degree LIDAR sensor. We reiterate that the robot can not use this sensor to avoid the walls. The reward function used for training the agent

25 is as follows,   −0.005 if s < 0.4m Rt = (3.2.2)  +0.050 if s > 0.4m and at = 0

The reward function Rt is designed to encourage the agent to learn obstacle avoidance behavior. The reward is positive when the robot is (1) further than 0.4 meter from the wall and (2) it moves straight (i.e., it takes action at = 0). Adding this second requirement avoids the robot accruing rewards by rotating on the spot at a position well away from the walls. Additionally, two counters are used to speed up the training process. First, we imple- mented a damage counter increments when the robot gets too near to a wall, i.e., s < 0.4. If the damage counter is higher than three, the current training episode terminates. The second counter is an idle counter that increments when the robot continuously gets negative rewards and resets zero if the agent receives a positive reward. The episode is terminated if the idle counter is greater than 10. The second counter ensures that episodes in which the robot is performing poorly are terminated quickly. If neither the damage counter nor the idle counter terminates the learning episode, the episode is automatically terminated once the time step t reaches 300. At the start of each learning episode, the robot is spawned to a random point in the environment to avoid the over-fitting of the environment. At the start of each learning episode, the robot is spawned to a random, but safe, point in the maze to avoid the memorizing routes of the environment and learning monotonous strategy. Also, The regularization methods like weight decay and drop out are adopted to avoid the neural network over-fitting.

26 3.2.1. Algorithmic details

The goal of the agent is to select a sequence of actions to maximize the discounted return. In DQN this is achieved by Q-learning where the agent learns the optimal value function,

∗ Q (St, at), through training a deep Q-network using the Bellman equation [Mnih et al., 2013],

2 Qπ(St, at) ← Qπ(St, at) + α(yt − Q(St, at)) (3.2.3)

yt = (Rt + maxaQ(St+1, a)) (3.2.4)

The architecture of the Convolutional neural networks (CNN) for the experiment is shown in figure 3.8. The Q-function gives a value for each state and action. The optimal action can be found using a greedy approach [Mnih et al., 2013], i.e., by taking action at that is predicted to return the maximum reward,

π(a|s) = argmaxaQ(S, a) (3.2.5)

As mentioned above, at each time step t, the robot could take one of three different actions. It could either drive straight (at = 0) or could turn left or right at a fixed rotational speed, this discretization is required as DQN is limited to work with discrete action space [Mnih et al., 2013]. Expanding this behavior to continuous action spaces will be our future work. For enhancing the performance of the learning agent, we have implemented modifica- tions on top of DQN like Double Deep Q-learning networks (Double Deep Q-Network Learning (DDQN)) [Van Hasselt et al., 2016] with dueling architecture [Wang et al., 2015] and priority replay buffers [Schaul et al., 2015] using open AI gym [Brockman et al., 2016] and stable baseline RL libraries [Hill et al., 2018]. Table 3.1 lists the hyper- parameters used in training the D3QN.

27 Learning Rate 0.005 Batch Size 32 Initial Exploration 1.0 Final Exploration Final 0.01 Replay Buffer Size 50,000 Tau 0.001

Table 3.1.: The most important hyper parameters used in training the robot. The learn- ing rate and the batch size are parameters determining neural network opti- mization. The four bottom parameters pertain to the algorithm as shown in algorithm1.

Figure 3.8.: Illustration of the D3QN neural network that was trained to control the robot, based on the echo data contained in the sensory state matrix St (see equation 3.2.1). The CNN filters have a shape of (1,3) and a stride of 1. This emulates the 1D convolution so that each echo is processed separately. The dueling architecture CNN has 2 outputs. One output for state value function V (S) which returns a single value and other one is the advantage function A(S, a) which is a vector of length 3, returning the advantage of taking that action. The Q-function function is calculated using both of these Q(S, a) = V (S) + A(s, a)[Wang et al., 2015]. The Dueling Double Deep Q- network is optimized using gradient decent (i.c., RMSprop)1.

28 Training was done for 5000 iterations. We assessed whether learning had converged by looking at the reward obtained by the robot, as a function of time. We also inspected whether the loss function had converged and the further training would not increase performance. The loss function is a measure indicating how well the neural network has learned the relationship between the states, the actions and the resulting rewards. In other words, a lower loss implies that the neural network has learned the Q-function. As also indicated in algorithm1, the loss L was calculated as,

2 L = (yt − Q(St, at; α, β) (3.2.6) where ˆ yt = rt + η max Q(St+1, at, ; α, β) (3.2.7) a0 and

0 Q(St, at) = V (St; α) + (A(St, at; β) − max A(St, a ; β)) (3.2.8) 0 t at

3.3. Conditions

Here, we set out to assess whether our previously presented model of sonar-based obstacle avoidance in bats [Vanderelst et al., 2015a, Mansour et al., 2019a,b] can be augmented by providing it with short-term memory and with the full echo sequence (instead of only the onset thereof). To test this hypothesis, we train the robot by providing it .with three versions of the sensory information contained in the state matrix St (equation 3.2.1).

1. Condition 1 (Memory): We provide the robot with a 6×24 matrix containing a short term trace of the past echo trains received at both ears. This is, we provide

the robot with state matrix St as given by equation 3.2.1 and depicted in fig. 3.9.

29 2. Condition 2 (No Memory): We provide the robot with a 6×24 matrix containing a short term trace of the past echo trains received at both ears. However, the values

in vectors Et−2,∗ and Et−2,∗ are set to 0. This model only uses the most recent echo data and depicted in fig. 3.10.

3. Condition 3 (Onset): We provide the robot with a 6×24 matrix containing a short term trace of the past echo trains received at both ears. However, we only the first

value in each vector E∗,∗ larger than zero is kept. Subsequent values in the vector are set to 0. This represents a situation where the bat is only assumed to use the onset of the echo train and depicted in fig. 3.11.

Figure 3.9.: Example state St. given to condition 1. as it explained above and also can be clearly seen in the figure that we are utilizing all the rows in the image to give memory of 3 and we are also giving the complete echo train.

30 Figure 3.10.: Example state St given to condition 2: which is No memory condition . It can be seen that first 4 rows from the top are blacked out as the t − 1 and t − 2 echoes from 3.2.1 are zeroed out, which can be seen in above figure

Figure 3.11.: Example state St given to condition 3. which is Onset condition, It can be seen that we are utilizing all the rows as we are giving the echoes at t − 1 and t − 2 are given but the complete echo train is not given As you see from figure the first echo value is given to rest of the echo train.

31 Initialize the Gazebo simulator initialize prioritized replay memory D to capacity N Initialize the Q-network with random weights θ for episode = 1,M do re-spawn the robot at a random safe point in maze

initialize with observation S1(6 × 24) for t = 1,T do With probability  select a uniform random action at or greedy approach at = argmaxaQ(St, a, θ)

Robot executes the chosen action at and observes the reward rt and Partial state (echo waveform) Et+1 augments with the prev E to state St+1

Store the tuples in the priority buffer (St, at, rt,St+1) with priority as TD error

Sample priority-based mini-batch of experience tuples (St, at,Rt,St+1) from D Calculate the Q-function Q(S, a) using this equation Q(S, a) = V (S) + 0 (A(s, a) − maxa0 A(s, a )) ˆ Set yt = rt + η maxa0 Q(St+1, at) and optimize using RMSprop optimizer on 2 (yt − Q(St, at) with respect to network parameters Every C steps reset Qˆ = Q end end

Algorithm 1: Dueling Architecture double deep Q-learning algorithm [Wang et al., 2015] [Van Hasselt et al., 2016] used in this paper. A list of the most important parameters and their values is given in table 3.1.

32 Chapter 4

Results

In all three conditions, the robot’s training had converged at 5000 iterations. The reward return vs the training time step plot for the 3 conditions is given below. After training, the robot was evaluated for 15 runs in each of the three environments. Evaluation runs started from randomly selected locations in the environment. For each condition, the same random start locations were used. Below, we report on the results of these evaluation runs. Each run lasted for 1000 steps (calls) or until the robot collided with a wall. Therefore, each evaluation run’s length can be used as a metric of the obstacle avoidance performance of the robot (see below). An example trajectory for each condition and environment is shown in figure 4.2. case Figure 4.3 quantifies the robot performance in the three different environments. The data shown are averaged over the 15 replications. For each environment and condition, we assess the distribution of the closest distance to any of the environment’s walls, at each time step. As can be seen from the violin plots in figure 4.3, the median distance from walls is about the same across conditions. However, the (shape of the) distribution is different. There is a tendency for the robot in the No Memory condition to stray closer to the walls than in the other conditions. We also evaluated the proportion

33 of each action taken in the three environments and the two conditions. Figure 4.3 reveals that the number of straight, left turns, and right turns differed in the different conditions. We assessed this statistically using a Kruskal-Wallis H-test on the proportion of steps the robot drove straight (instead of turning left or right) in each of the 15 replications. In each of the three environments, this test reached significance (H > 22, p < 0.01). This indicated that, for each environment, the different conditions resulted in a different distribution of actions chosen. The differences in the distances to the walls and the proportion of actions taken indicate that the robot behaved differently across conditions. These differences in behavior translated into marked differences in their obstacle avoidance performance. Figure 4.3 show the distance driven by the robot without colliding with the walls. Because the robots were evaluated for a maximum of 1000 steps and the distance driven per step was fixed to 15 cm (3 m/s, 20 calls per second), the maximum distance covered by the robot was 150 m. The robot without memory (No Memory) performed worse than the robot with memory (Memory). Performance in the Onset condition was worse than both other conditions. On average, the Memory robot covered 1.5× as much distance as the No Memory robot and 6× as many steps as the Onset robot. These results were confirmed statistically. In each environment, comparing the number of completed steps between conditions using Kruskal-Wallis H-tests yielded statistically significant results (H > 16, p < 0.01). Another way of expressing the robots’ performance in the various conditions is to look at the proportion of trials in which they completed the maximum of 1000 steps. In the Memory condition, 100% of trials were successfully completed. In No Memory and Onset conditions, 66% and 21% were completed successfully, respectively. This confirms the differences in performance across conditions. Running a z-test on the proportions of successfully completed trials affirmed that the Memory condition differed

34 significantly from the No Memory condition (z > 2.4, p < 0.05). And the No Memory condition differed significantly from the Onset condition (z > 2.5, p < 0.01).

35 Fig a:

Fig b:

Fig c:

Figure 4.1.: This plot that gives the reward return vs the time step t during the training period - fig a: case 1: full memory (T=3) and complete echo train , fig b: No memory (T=1) with complete echo train, fig c: Onset with full memory (T=3)

36 Simple Narrow Circle

1 m 1 m

1 m Memory No Memory Onset

Figure 4.2.: Example trajectories for each condition and each environment. Notice that the robot was trained in the Simple environment and evaluated in all three environments. The trajectories shown for the Memory and No Memory condition were successfully completed (i.e., 1000 steps of 15 cm each). In contrast, for the Onset condition, the trajectories are rather short. This is because the robot was less successful at negotiating the environments. Indeed, each of the three trajectories shown for this condition ended in a collision soon after the start (locations indicated with a star). The black scale bar in each of the top panels is 1 m long.

37 , = n Top . The Circle No Memory , Memory Distance to obstacles Actions taken Distance driven without collision 1.50 1.0 (a) (b) 150 (c) 50 1.25 0.8 1.00 40 0.6 100 0.75 30 Bottom Row: 0.4 Left 20 Time (s) 0.50 50 Right Distance (m) 0.25 0.2 10 Straight Proportion of actions

Dis. nearest obstacle (m) Dis. nearest obstacle 0.00 0.0 0 0

Memory No memory Onset Memory No memory Onset Memory No memory Onset , and Distance to obstacles Actions taken Distance driven without collision 1.50 1.0 (d) (e) 150 (f) 50 1.25 0.8 1.00 40

0.6 100 Narrow 0.75 30

0.4 38 Left 20 Time (s) 0.50 50 Right Distance (m) 0.25 0.2 10 Straight Proportion of actions

Dis. nearest obstacle (m) Dis. nearest obstacle 0.00 0.0 0 0 Memory No memory Onset Memory No memory Onset Memory No memory Onset Distance to obstacles Actions taken Distance driven without collision 1.50 1.0 (g) (h) 150 (i) 50 1.25 0.8 1.00 40 0.6 100 0.75 30

0.4 Middle Row: Left 20 Time (s) 0.50 50 Right Distance (m) 0.25 0.2 10 , Straight Proportion of actions

Dis. nearest obstacle (m) Dis. nearest obstacle 0.00 0.0 0 0 Memory No memory Onset Memory No memory Onset Memory No memory Onset ). Each row depicts the results for a different environment. Simple 1000 steps, indicated by horizontal line). As the robot’s speed was Onset × 15). Note that thecm maximum distance covered by the robot was 150 m (15 constant, the plots also show the time (in seconds) driven by the robots. and Row: results are averaged acrosslocation 15 in evaluation the runs arena).wall (each at starting (a, each at d, time g) aof step. Violin different each plots of (b, of thedistance e, the covered three h) distance without possible colliding. Stacked to barplots actions Error the bars showing nearest taken. give the the (c, Standard proportion Error f, ( i) Barplots showing the Figure 4.3.: Results of the robot evaluation, for each condition ( Chapter 5

Discussion & Future Work

As stated in the introduction, recent evidence suggests that echolocating bats have only access to minimal sensory data while negotiating complex environments [e.g., Geberl, 2013, Geberl et al., 2019, Warnecke et al., 2018]. This led us to propose models of obstacle avoidance [Vanderelst et al., 2015a, Mansour et al., 2019a,b] that rely only on the interaural level differences of the echo train onset. We showed that our artificial Braitenberg-bats were able to avoid obstacles, and construct cognitive maps [Vanderelst and Peremans, 2017], only guided by a simple algorithm. While our previous results were encouraging, the performance of the models still falls short of the impressive flight control observed in bats. This indicates that our models are at best partial and require extending. The temporal and spatial resolution of the bat’s sonar system is limited by its auditory periphery, in particular by processing in the cochlea. Therefore, the quality of the sensory data we can presume in modeling bat echolocation is fixed. However, we can assess whether providing our artificial bats with more data, makes them better echolocators. Here, we offer a preliminary expansion of our existing models. We tested whether providing a robotic echolocator with (1) a short term memory of echoes and, (2) the ability to use the complete echo train (instead of only their onset) would improve performance. Our results indicate that such is the case:

39 performance was best when the robot could exploit more acoustic data. This finding has both biological and engineering implications.

5.1. Biology

First, the current results represent an incremental refinement of our previous models [Vanderelst et al., 2015a, Mansour et al., 2019a,b]. The results suggest that relevant information is present in the complete echo train, something we ignored in our earlier modeling efforts. In the current paper, preventing the bat from using the entire echo train had a significant impact on performance. By comparison, the short term memory was less critical for performance. In hindsight, this might be expected. The sonar system of bats (and our robot) has only a limited field of view. Therefore, turns executed between calls result in the sonar system sampling largely non-overlapping regions on subsequent calls. In other words, turning to avoid obstacles might result in the bat ‘seeing’ mostly different parts of the environment with each call, especially when avoiding obstacles in tight spaces. Therefore, having a memory of past echoes is perhaps not very informative to the bat. In summary, the current results suggest different sensory (processing) strategies that might (or might not) be useful to a bat trying to avoid obstacles in tight spaces. In itself, this a new hypothesis suggested by this work worth following up. A second and potentially more important, implication of the current work lies in establishing a method for automatically deriving models of bat sonar-based control. Here, we revisited the sonar-based behavior we modeled before (i.c., obstacle avoidance). However, many more behaviors await explanation, for example, the capture of prey hidden in vegetation [Geipel et al., 2013, 2019], or mass emergence from caves [Kloepper and Bentley, 2017]. The work presented here suggests a method of arriving at models of these behaviors. One could envision training artificial bats to perform these tasks, providing them with (simulated) sensory data as understood to be available to them

40 from physiological and behavioral studies. Instead of handcrafting potential controllers, hypotheses about how the bat performs the behavior would arise from the training. In other words, machine learning could be used to find patterns in the sensory data to support motor behavior.

5.2. Engineering

Various groups have proposed vision-based control algorithms and hardware for the new generation of nano-UAVs, inspired by insect vision [e.g., Franceschini et al., 2007, Franceschini, 2014, Briod et al., 2013, Duhamel et al., 2013]. Lightweight cameras per- form well when lighting is good and reliable. Moreover, these passive sensors require little energy, and bio-inspired vision algorithms are computationally very efficient [Flore- ano and Wood, 2015]. Unfortunately, the quality of the signals captured by small lensed photoreceptors rapidly degrades when lighting is low. This makes vision an inappropri- ate sense for many applications. Nano-UAVs will be mostly deployed to (autonomously) inspect the interior of buildings and other structures. Indeed, in confined spaces, their high maneuverability and small size are most useful. Inside buildings and other human- made structures, the light will often be low and changeable. Moreover, the structures can be filled with smoke, dust, and other aerosols [to which sonar is relatively insensitive, Steckel et al., 2011]. Under these circumstances, vision will not be a suited sense, and active sensors should complement or replace vision-based approaches. Sonar is perhaps the most common sensor on autonomous robots. Sonar is one of the first sensors to have been used in autonomous robots [e.g., Brooks, 1986] and today it is still used in many commercial and experimental robots. Sonar sensors are entirely insensitive to lighting conditions – even more so than Laser Scanners and TOF cameras – and can operate in spaces filled with smoke and dust [Steckel et al., 2011]. Also, several properties make sonar a promising sensor modality for nano-UAVs, including low cost,

41 small form factor, low power consumption, and mechanical robustness [M¨ulleret al., 2009, Steder et al., 2007]. Despite these advantages, applications of artificial in-air sonar are almost exclusively limited to simple ranging and obstacle detection [See Peremans et al., 2012, Kleeman and Kuc, 2008, for reviews]. However, this limited application domain is not due to a fundamental limitation of sonar. Indeed, the potential of sonar is demonstrated by the abilities of echolocating bats. More than fifty years of intense research on bats [Griffin, 1958] have shown that sonar is capable of supporting swift flight through dense vegetation [Holderied et al., 2006], navigation in changing environments [Barchi et al., 2013], object recognition [Von Helversen and Von Helversen, 2003] and airborne foraging [Griffin, 1958]. The enormous gap between artificial sonar applications and biological performance implies that, currently, we are not taking full advantage of the potential of sonar. This calls for a new approach to sonar for robots, inspired by bat echolocation research. Our results suggest that closing the performance gap between biological and artificial sonar might not require enormous computational power. The simulated robot presented in this paper only uses 6×24 floating-point numbers to support control. In combination, with emerging technologies for efficient implementation of neural networks, this suggests that sonar could be a computationally cheap sensing modality for robots. In particular, we suggest, for robots with highly constrained energy budgets, such as nano-UAVs.

5.3. Future work

In the discussions above, we have shown the various implications of this work in biology and robotics. This work can still be improved in multiple ways. Some of the most important are listed below. The most significant expansion of this work in the future would be to transfer the model from the simulator to the physical environment using

42 hardware implementation. This probably would involve adding various noise distribution to both the echo simulation and the robot dynamics during the simulation training and testing that model in the real world. Another possible expansion direction of this work is to expand the model to work in continuous action space instead of discrete action space. This work of developing the work into continuous action space is already in progress, using an Actor-Critic algorithm DDPG. Another possible expansion is to leverage a much deeper convolutional neural network for the agent’s policy to replace the existing cochlear pre-processing step. This will allow us to study the behavior using the raw signal directly. The framework that we have developed for this work could be expanded to study other behavior like prey capture with obstacle avoidance and navigation.

43 44 Bibliography

Jonathan R Barchi, Jeffrey M Knowles, and James A Simmons. Spatial memory and stereotypy of flight paths by big brown bats in cluttered surroundings. The Journal of experimental biology, 216(6):1053–1063, 2013.

Daman Bareiss, Joseph R. Bourne, and Kam K. Leang. On-board model-based au- tomatic collision avoidance: application in remotely-piloted unmanned aerial vehi- cles. Autonomous Robots, 41(7):1539–1554, 2017. ISSN 1573-7527. doi: 10.1007/ s10514-017-9614-4.

AG Barto, JC Houk, JL Davis, and DG Beiser. Models of information processing in the basal ganglia. 1995.

H. E. Bass, L. C. Sutherland, A. J. Zuckerwar, D. T. Blackstock, and D. M. Hester. At- mospheric absorption of sound: Further developments. The Journal of the Acoustical Society of America, 97(1):680–683, 1995. doi: 10.1121/1.412989.

Richard Bellman. Dynamic programming. Science, 153(3731):34–37, 1966.

Abraham H Black and William F Prokasy. Classical conditioning ii: Current research and theory. 1972.

Adrien Briod, Jean-Christophe Zufferey, and Dario Floreano. Optic-flow based control of a 46g quadrotor. In Workshop on Vision-based Closed-Loop Control and Navigation of Micro Helicopters in GPS-denied Environments, IROS 2013, 2013.

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.

R. Brooks. A robust layered control system for a mobile robot. IEEE Journal on Robotics and Automation, 2, 1986. ISSN 0882-4967. doi: 10.1109/JRA.1986.1087032.

Elizabeth L Clare and Marc W Holderied. Acoustic shadows help gleaning bats find prey, but may be defeated by prey acoustic camouflage on rough surfaces. Elife, 4: e07404, 2015.

F De Mey, J Reijniers, H Peremans, M Otani, and U Firzlaff. Simulated head related transfer function of the phyllostomid bat phyllostomus discolor. The Journal of the Acoustical Society of America, 124(4):2123–2132, 2008.

45 Pierre-Emile J Duhamel, Nestor O Perez-Arancibia, Geoffrey L Barrows, and Robert J Wood. Biologically inspired optical-flow sensing for altitude control of flapping-wing microrobots. IEEE/ASME Transactions on Mechatronics, 18(2):556–568, 2013.

Dario Floreano and Robert J. Wood. Science, technology and the future of small autonomous drones. Nature, 521(7553):460–466, 2015. ISSN 0028-0836. doi: 10.1038/nature14542.

N Franceschini, F Ruffier, and J Serres. A bio-inspired flying robot sheds light on insect piloting abilities. Current Biology, 17(4):329–335, 2007.

Nicolas Franceschini. Small brains, smart machines: From fly vision to robot vision and back again. Proceedings of the IEEE, 102(5):751–781, 2014. ISSN 0018-9219. doi: 10.1109/JPROC.2014.2312916.

Michael J Frank. Dynamic dopamine modulation in the basal ganglia: a neurocompu- tational account of cognitive deficits in medicated and nonmedicated parkinsonism. Journal of cognitive , 17(1):51–72, 2005.

Willem E Frankenhuis, Karthik Panchanathan, and Andrew G Barto. Enriching be- havioral with reinforcement learning methods. Behavioural Processes, 161: 94–100, 2019.

Cornelia Geberl. Spatial and temporal resolution of bat sonar. PhD thesis, Ludwig- Maximilians-Universit¨atM¨unchen, July 2013.

Cornelia Geberl, Kathrin Kugler, and Lutz Wiegrebe. The spatial resolution of bat biosonar quantified with a visual-resolution paradigm. Current Biology, 29(11):1842– 1846, 2019.

I. Geipel, K. Jung, and E.K. Kalko. Perception of silent and motionless prey on vegeta- tion by echolocation in the gleaning bat micronycteris microtis. Proceedings. Biological sciences / The Royal Society, 280(1754):20122830, 2013. doi: 10.1098/rspb.2012.2830.

Inga Geipel, Jan Steckel, Marco Tschapka, Dieter Vanderelst, Hans-Ulrich Schnitzler, Elisabeth KV Kalko, Herbert Peremans, and Ralph Simon. Bats actively use leaves as specular reflectors to detect acoustically camouflaged prey. Current Biology, 29(16): 2731–2736, 2019.

Maya Geva-Sagiv, Liora Las, Yossi Yovel, and Nachum Ulanovsky. Spatial cognition in bats and rats: from sensory acquisition to multiscale maps and navigation. Nature Reviews. Neuroscience, 16(2):94–108, 2015.

Donald R. Griffin. Listening in the dark; the acoustic orientation of bats and men. Yale University Press, New Haven,, 1958.

Donald R Griffin and Robert Galambos. The sensory basis of obstacle avoidance by flying bats. Journal of Experimental , 86(3):481–506, 1941.

46 Suzanne N Haber, Ki-Sok Kim, Philippe Mailly, and Roberta Calzavara. Reward-related cortical inputs define a large striatal region in primates that interface with associative cortical connections, providing a substrate for incentive-based learning. Journal of Neuroscience, 26(32):8368–8376, 2006.

M. Hassanalian and A. Abdelkefi. Classifications, applications, and design challenges of drones: A review. Progress in Aerospace Sciences, 91(November 2016):99–131, 2017. ISSN 0376-0421. doi: 10.1016/j.paerosci.2017.04.003.

D Von Helversen and O Von Helversen. Object recognition by echolocation: A nectar- feeding bat exploiting the flowers of a rain forest vine. Journal of Comparative Phys- iology. A, Neuroethology, Sensory, Neural, and Behavioral , 189(5):327–36, 2003.

Ashley Hill, Antonin Raffin, Maximilian Ernestus, Adam Gleave, Anssi Kanervisto, Rene Traore, Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. Stable baselines. https://github.com/hill-a/stable-baselines, 2018.

Marc W Holderied. Akustische Flugbahnverfolgung von Flederm¨ausen:Artvergleich des Verhaltens beim Suchflug und Richtcharakteristik der Schallabstrahlung. PhD thesis, Friedrich-Alexander-Universit¨atErlangen-N¨urnberg., 2001.

Marc W Holderied, Gareth Jones, and Otto von Helversen. Flight and echolocation be- haviour of whiskered bats commuting along a hedgerow: range-dependent sonar signal design, doppler tolerance and evidence foracoustic focussing’. Journal of Experimental Biology, 209(10):1816–1826, 2006.

Richard a. Holland. Orientation and navigation in bats: Known unknowns or unknown unknowns? Behavioral Ecology and , 61:653–660, 2007. ISSN 0340-5443. doi: 10.1007/s00265-006-0297-7.

Richard O Hundley and Eugene E Gritton. Future technology-driven revolutions in military operations: Results of a workshop. Technical report, 1994.

Lasse Jakobsen, John M. Ratcliffe, and Annemarie Surlykke. Convergent acoustic field of view in echolocating bats. Nature, 493(7430):93–96, November 2012. ISSN 0028-0836, 1476-4687. doi: 10.1038/nature11664.

Lasse Jakobsen, Signe Brinklø v, and Annemarie Surlykke. Intensity and directionality of bat echolocation signals. Frontiers in physiology, 4(April):89, January 2013. ISSN 1664-042X. doi: 10.3389/fphys.2013.00089.

Gareth Jones and Jeremy M V Rayner. Foraging behavior and echolocation of wild horseshoe bats and R . hipposideros Rhinolophidae ). Behavioral Ecology and Socio- biology, 25(3):183–191, 1989.

47 Farid Kendoul. Survey of Advances in Guidance, Navigation, and Control of Unmanned Rotorcraft Systems. Journal of Field Robotics, 29(2):315–3782012, 2012. doi: 10. 1002/rob.20414.

L Kleeman and R Kuc. Springer handbook of sensing, chapter Sonar Sensing. Springer, 2008.

Laura N Kloepper and Ian Bentley. Stereotypy of group flight in brazilian free-tailed bats. Animal Behaviour, 131:123–130, 2017.

Jeffrey M. Knowles, Jonathan R. Barchi, Jason E. Gaudette, and James A. Simmons. Effective biosonar echo-to-clutter rejection ratio in a complex dynamic scene. The Journal of the Acoustical Society of America, 138(2):1090–1101, August 2015. ISSN 0001-4966. doi: 10.1121/1.4915001.

N. Koenig and A. Howard. Design and use paradigms for gazebo, an open-source multi- robot simulator. In 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566), volume 3, pages 2149–2154 vol.3, 2004.

Roman Kuc. Artificial neural network classification of surface reflectors and volume scatterers using sequential echoes acquired with a biomimetic audible sonar. The Journal of the Acoustical Society of America, 147(4):2357–2364, 2020.

Brian Lau and Paul W Glimcher. Value representations in the primate striatum during matching behavior. Neuron, 58(3):451–463, 2008.

BD Lawrence and JA Simmons. Echolocation in bats: the external ear and perception of the vertical positions of targets. Science, 218(4571):481–483, 1982.

Hao Liu, Sridhar Ravi, Dmitry Kolomenskiy, and Hiroto Tanaka. and biomimetics in insect-inspired flight systems. Phil. Trans. R. Soc. B, 371(1704): 20150390, 2016.

Kevin Y Ma, Pakpong Chirarattananon, Sawyer B Fuller, and Robert J Wood. Con- trolled flight of a biologically inspired, insect-scale robot. Science, 340(6132):603–7, 2013. ISSN 1095-9203. doi: 10.1126/science.1231806.

Carl Bou Mansour, Elijah Koreman, Dennis Laurijssen, Jan Steckel, Herbert Peremans, and Dieter Vanderelst. Robotic models of obstacle avoidance in bats. In The 2018 Conference on Artificial Life: A Hybrid of the European Conference on Artificial Life (ECAL) and the International Conference on the Synthesis and Simulation of Living Systems (ALIFE), pages 463–464. MIT Press, 2019a.

Carl Bou Mansour, Elijah Koreman, Jan Steckel, Herbert Peremans, and Dieter Van- derelst. Avoidance of non-localizable obstacles in echolocating bats: A robotic model. PLoS , 15(12), 2019b.

48 Jonathan W Mink. The basal ganglia: focused selection and inhibition of competing motor programs. Progress in neurobiology, 50(4):381–425, 1996.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.

Franz Peter M¨ohresand Gerhard Neuweiler. Die ultraschallorientierung der grossblatt- flederm¨ause(chiroptera-megadermatidae). Zeitschrift f¨urvergleichende Physiologie, 53(3):195–227, 1966.

P Read Montague, Peter Dayan, and Terrence J Sejnowski. A framework for mesen- cephalic dopamine systems based on predictive hebbian learning. Journal of neuro- science, 16(5):1936–1947, 1996.

Brian CJ Moore. An introduction to the psychology of hearing. Brill, 2012.

C F Moss and a Surlykke. Auditory scene analysis by echolocation in bats. The Journal of the Acoustical Society of America, 110(4):2207–2226, 2001. ISSN 0001-4966. doi: 10.1121/1.1398051.

J¨orgM¨uller,Axel Rottmann, Leonhard M Reindl, and Wolfram Burgard. A probabilistic sonar sensor model for robust localization of a small-size blimp in indoor environments using a particle filter. In Robotics and Automation, 2009. ICRA’09. IEEE Interna- tional Conference on, pages 3589–3594. IEEE, 2009.

Emre O Neftci and Bruno B Averbeck. Reinforcement learning in artificial and biological systems. Nature Machine Intelligence, 1(3):133–143, 2019.

John O’Doherty, Peter Dayan, Johannes Schultz, Ralf Deichmann, Karl Friston, and Raymond J Dolan. Dissociable roles of ventral and dorsal striatum in instrumental conditioning. science, 304(5669):452–454, 2004.

Herbert Peremans, F De Mey, and F Schillebeeckx. Man-made versus biologi- cal in-air sonar systems. Frontiers in Sensing, 3:195–206, 2012. doi: 10.1007/ 978-3-211-99749-9\ 13.

Luca Petricca, Per Ohlckers, and Christopher Grinde. Micro- and nano-air vehicles: State of the art. International Journal of Aerospace Engineering, 2011, 2011. ISSN 1687-5966. doi: 10.1155/2011/214549.

Anthony Petrites, Oliver Eng, Donald Mowlds, James Simmons, and Caroline DeLong. Interpulse interval modulation by echolocating big brown bats in different densities of obstacle clutter. Journal of Comparative Physiology A: Neuroethology, Sensory, Neural, and Behavioral Physiology, 195:603–617, 2009. ISSN 0340-7594. doi: 10. 1007/s00359-009-0435-6.

49 Carmine Tommaso Recchiuto, Rezia Molfino, Anders Hedenstr¨oem,Herbert Peremans, Vittorio Cipolla, Aldo Frediani, Emanuele Rizzo, and Giovanni Gerardo Muscolo. Bioinspired mechanisms and sensorimotor schemes for flying: A preliminary study for a robotic bat. In Advances in Autonomous Robotics Systems, pages 37–47. Springer, 2014. Jonas Reijniers and Herbert Peremans. On population encoding and decoding of auditory information for bat echolocation. Biological Cybernetics, 102(4):311–326, 2010. S. Sandig, H.-U. Schnitzler, and A. Denzinger. Echolocation behaviour of the big brown bat (Eptesicus fuscus) in an obstacle avoidance task of increasing difficulty. Journal of Experimental Biology, 217(16):2876–2884, August 2014. ISSN 0022-0949, 1477-9145. doi: 10.1242/jeb.099614. Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015. H U Schnitzler and E K V Kalko. Echolocation by insect-eating bats. Bioscience, 51 (7):557–569, 2001. Hans-Ulrich Schnitzler, Cynthia F Moss, and Annette Denzinger. From spatial orienta- tion to food acquisition in echolocating bats. Trends in Ecology & Evolution, 18(8): 386–394, 2003. W Schultz, P Dayan, and PR Montague. A neural substrate of prediction and reward. science, 275, 1593e1599, 1997. Anna-Maria Seibert, Jens C Koblitz, Annette Denzinger, and Hans-Ulrich Schnitzler. Scanning behavior in echolocating common pipistrelle bats (pipistrellus pipistrellus). PLOS ONE, 8(4):e60752, 2013. J A Simmons, E G Freedman, S B Stevenson, L Chen, and T J Wohlgenant. Clutter interference and the integration time of echoes in the echolocating bat, Eptesicus fuscus. The Journal of the Acoustical Society of America, 86(4):1318–1332, 1989. JA Simmons, SA Kick, BD Lawrence, C. Hale, C. Bard, and B. Escudie. Acuity of horizontal angle discrimination by the echolocating bat, Eptesicus fuscus. Journal of Comparative Physiology A: Neuroethology, Sensory, Neural, and Behavioral Physiol- ogy, 153(3):321–330, 1983. James a. Simmons. Bats use a neuronally implemented computational acoustic model to form sonar images. Current Opinion in Neurobiology, 22(2):311–319, April 2012. ISSN 0959-4388. doi: 10.1016/j.conb.2012.02.007. James A. Simmons, Patricia E. Brown, Carlos E. Vargas-Irwin, and Andrea M. Simmons. Big brown bats are challenged by acoustically-guided flights through a circular tunnel of hoops. Scientific Reports, 10(1), December 2020. ISSN 2045-2322. doi: 10.1038/ s41598-020-57632-4.

50 Stanford Artificial Intelligence Laboratory et al. Robotic operating system, 2018. URL https://www.ros.org. Jan Steckel and Herbert Peremans. Batslam: Simultaneous localization and mapping using biomimetic sonar. PLOS ONE, 8(1):e54076, January 2013. doi: 10.1371/journal. pone.0054076. Jan Steckel, Wouter Vanduren, and Herbert Peremans. 3d localization by a biomimetic sonar system in a fire-fighting application. In Image and Signal Processing (CISP), 2011 4th International Congress on, volume 5, pages 2549–2553. IEEE, 2011. Bastian Steder, Axel Rottmann, Giorgio Grisetti, Cyrill Stachniss, and Wolfram Bur- gard. Autonomous navigation for small flying vehicles. In Workshop on Micro Aerial Vehicles at the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). San Diego, CA, USA, 2007. Robert C Stones and Leo P Branick. Use of hearing in homing by two species of Myotis bats. Journal of , 50(1):157–160, 1969. A Surlykke and O Bojesen. Integration time for short broad band clicks in echolocating fm-bats (eptesicus fuscus). Journal of Comparative Physiology A, 178(2):235–241, 1996. Annemarie Surlykke, Simon Boel Pedersen, and Lasse Jakobsen. Echolocating bats emit a highly directional sonar sound beam in the field. Proceedings of the Royal Society B: Biological Sciences, 276(1658):853–860, 2009. Annemarie Surlykke, James A Simmons, and Cynthia F Moss. Perceiving the world through echolocation and vision. In Bat Bioacoustics, pages 265–288. Springer, 2016. Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000. Nachum Ulanovsky and Cynthia F Moss. What the bat’s voice tells the bat’s brain. Proceedings of the National Academy of Sciences of the United States of America, 105 (25):8491–8498, 2008. RI Urick. Principles of underwater acoustics. McGraw-Hill, New York, 1983. Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In Thirtieth AAAI conference on artificial intelligence, 2016. D Vanderelst, H Peremans, and M.W. Holderied. Sensorimotor model of obstacle avoid- ance in echolocating bats. PLoS Computational Biology, 2015a. Paper accepted pend- ing minor revisions.

51 Dieter Vanderelst and Herbert Peremans. A computational model of mapping in echolo- cating bats. Animal Behaviour, 131:73–88, 2017.

Dieter Vanderelst and Herbert Peremans. Modeling bat prey capture in echolocating bats: The feasibility of reactive pursuit. Journal of theoretical biology, 456:305–314, 2018.

Dieter Vanderelst, Fons De Mey, Herbert Peremans, Inga Geipel, Elisabeth Kalko, and Uwe Firzlaff. What noseleaves do for fm bats depends on their degree of sensorial specialization. PLOS ONE, 5(8):e11893, 2010.

Dieter Vanderelst, Marc W. Holderied, and Herbert Peremans. Sensorimotor model of obstacle avoidance in echolocating bats. PLoS Computational Biology, 11(10): e1004484, October 2015b.

Dieter Vanderelst, Jan Steckel, Andre Boen, Herbert Peremans, and Marc W Holderied. Place recognition using batlike sonar. Elife, 5:e14188, 2016.

D Von Helversen and O Von Helversen. Object recognition by echolocation: a nectar- feeding bat exploiting the flowers of a rain forest vine. Journal of Comparative Phys- iology A, 189(5):327–336, 2003.

Otto Von Helversen and York Winter. Glossophagine bats and their flowers: Costs and benefits for plants and pollinators. Bat ecology, page 346, 2005.

Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015.

Michaela Warnecke, Silvio Mac´ıas,Benjamin Falk, and Cynthia F. Moss. Echo interval and not echo intensity drives bat flight behavior in structured corridors. The Journal of Experimental Biology, 221(24):jeb191155, December 2018. ISSN 0022-0949, 1477- 9145. doi: 10.1242/jeb.191155.

Lutz Wiegrebe. An autocorrelation model of bat sonar. Biological cybernetics, 98(6): 587–595, 2008.

Lutz Wiegrebe and Sabine Schmidt. Temporal integration in the echolocating bat, megaderma lyra. Hearing research, 102(1):35–42, 1996.

Timothy C Williams, Janet M Williams, and Donald R Griffin. The homing ability of the neotropical bat phyllostomus hastatus, with evidence for visual orientation. Animal behaviour, 14(4):468–473, 1966.

Robert J Wood, Ben Finio, Michael Karpelson, Kevin Ma, N´estorOsvaldo P´erez- Arancibia, Pratheev S Sreetharan, Hiroto Tanaka, and John Peter Whitney. Progress on ‘pico’air vehicles. The International Journal of Robotics Research, 31(11):1292– 1302, 2012.

52 J M Wotton and J A Simmons. Spectral cues and perception of the vertical position of targets by the big brown bat, Eptesicus fuscus. The Journal of the Acoustical Society of America, 107, 2000.

Yossi Yovel, Peter Stilz, Matthias O. Franz, Arjan Boonman, and Hans Ulrich Schnit- zler. What a plant sounds like: The statistics of vegetation echoes as received by echolocating bats. PLoS Computational Biology, 5(7), July 2009. ISSN 1553-734X. doi: 10.1371/journal.pcbi.1000429.

Yossi Yovel, Matthias O Franz, Peter Stilz, and Hans-Ulrich Schnitzler. Complex echo classification by echo-locating bats: a review. Journal of Comparative Physiology A, 197(5):475–490, 2011.

Iker Zamora, Nestor Gonzalez Lopez, Victor Mayoral Vilches, and Alejandro Hernandez Cordero. Extending the openai gym for robotics: a toolkit for reinforcement learning using ros and gazebo. arXiv preprint arXiv:1608.05742, 2016.

53 Appendix A

Prerequisite material

A.1. Reinforcement Learning

Reinforcement Learning (RL) is a class of machine learning where the agent learns a sequence of actions that maximize the expected discounted return Gt by interacting with the environment [Sutton and Barto, 2018].

∞ X k Gt = γ Rt+k+1 (A.1.1) k=0 This sequence of actions taken by the agent is called the policy of the agent which is termed as π. Which defines the behavior of the agent by providing a mapping between states and actions. Policy can be either stochastic where we get a probability distribution of an action given the state π(at|st) or deterministic where the policy is a rigid mapping between states and action, which tells given a state st the agent has to take an action a at time t [Sutton and Barto, 2018]. RL is a semi-supervised algorithm that learns the solution to the problem with less human interventions as it uses a reward function to generate the desired behavior in the AI system.

54 Environment Reward R State S action A

Agent

Figure A.1.: the Markov Decision process flowchart which shows how the agent can inter- act with the environment with actions and in return environment transitions to new state and a reward for that time step

R1 R2 R3

S0 S1 S2 S3

A0 A0 A0

Figure A.2.: this figure illustrates the unrolled Markov Decision process for 3 time- step which shows how the interactions between the agent and environment progress.

55 In RL, the agent operates in a Markov decision process (MDP). At each time step, the agent experiences a current state st. The agent selects to execute an action at based on this state st. Next, it receives a reward rt. Furthermore, the environment has a transition function P (s0 |s, a) and reward function R(s, a), both of which are unknown to the agent [Sutton and Barto, 2018, Mnih et al., 2013]. The agent’s goal is to learn the policy (pi) that maps the states to the actions or probability of the actions π(at|st) that maximizes the expected discounted return which can be seen in equation A.1.1. RL can be broadly classified into two categories: model-based RL and model-free RLIn˙ a model-based RL algorithm, the agent first tries to learn the environment’s transition function and reward function through interactions. It will then try to learn the sequence of actions to maximize the cumulative rewards. In model-free RL the agent directly tries to learn the series of actions that maximize the return Gt from the equation A.1.1 without any prior knowledge about the environment or the world. Some of the examples, for model-free algorithms, are Q-learning, temporal difference, SARSA, or policy search methods ,etc. These model-free RL algorithms can be further divided into two categories value-based learning and policy-based learning [Sutton and Barto, 2018]. Value-based RL methods concentrate on learning the value function or Q-function and then using that value function to get the policy. A value function or quality function Q(s, a) is nothing but a function to which gives expected return for state-action pair. value function in general could show how good a particular action at is when the agent

∗ is at state St at time-step t. Optimal action-value function Q (s, a) is a function which gives the value that maximizes the expected return and this Q∗(s, a) obeys the Bellman equation [Bellman, 1966, Sutton and Barto, 2018]. This Bellman equation tells us the difference between the expected value of trajectory from a time-step t to end and the expected value of the that trajectory from the next time-step t + 1 to end is just the reward at time-step t, this bellman relationship holds

56 under the condition that this trajectory yields the maximum expected return. this property can be used to learn the value function using the dynamic programming [Sutton and Barto, 2018]. The agent uses this value function to get the policy by either greedily selecting the actions with maximum Q-value (deterministic) or sampling the probability distribution actions based on the Q-value (stochastic) [Mnih et al., 2013]. In policy-based RL meth- ods, the agent skips the value function and directly learns the policy that maximizes the discounted return [Sutton and Barto, 2018, Sutton et al., 2000].

X X J(θ) = πθ(s|a)Q(s, a) (A.1.2) s a

π ∇θJ(θ) = Eπ[Q (s, a) ∗ ∇θπθ(a|s)] (A.1.3)

A.2. Q-Learning

Among the value-based RLalgorithms, one of the most frequently used is Q-learning. As discussed earlier, the agent’s goal is to learn the sequence of actions that maximizes the discounted return. This is achieved in Q-learning by learning the optimal action- value function Q∗(s, a). The optimal value function Q∗(s, a) can be defined as the expected discounted return that we get when all the sequence of actions a is taken given a state s are optimal [Mnih et al., 2013]. This can be written as the expected return from time-step t if the agent follows an optimal policy π∗ ,

∗ Q (s, a) = maxπE[Rt|st = s, at = a, π] (A.2.1)

As we discussed earlier, The optimal value function obeys the Bellman equation, This Bellman equation can be used to estimate of the optimal value function with dynamic

57 programming (value iterations) to minimize the below-given equation [Mnih et al., 2013].

2 Qπ(st, at) ← Qπ(st, at) + α(yt − Q(St, at)) (A.2.2)

yt = (rt + maxaQ(St+1, a)) (A.2.3)

When using this Q-learning algorithm with high-dimensional input like images, audio etc. we use a function approximator with a parameter θ to estimate the action - value pair Q-function Q(s, a)[Mnih et al., 2013, Sutton and Barto, 2018].

Q(s, a; θ) ≈ Q(s, a) (A.2.4)

This Q-function approximator can be of various types, from linear and non-linear such as a gaussian neural network. When a neural network is used as the function approximator for the Q-function, then that is known as the Deep Q-network (DQN) [Mnih et al., 2013, Sutton and Barto, 2018]. The loss function to optimize the Q-function or DQN neural network is given below.

2 Li(θi) = Es,a p(.)[(yi − Q(s, a; θi)) ] (A.2.5)

the yi is same as the yt from the equation A.2.3. One of the major advantages of the DQN algorithm is that it is an off-policy learning algorithm as it uses the greedy strategy, a = maxaQ(s, a; theta) this could be seen from the equation A.2.5 and A.2.3. Due to this, the agent’s learning doesn’t depend on the current action that the robot is taking (online) so the experience/observation gathered by the robot over time can be stored and used later for learning, this is called the experience replay memory, this improves the speed of the training and re-usability of the experience. During training, these experiences which are stored in experience

58 replay are randomly sampled given to DQN neural network for training this is done to break the strong correlations between the subsequent samples collected during training. This increases the stability in training the DQN network. for more details, on the DQN algorithm refer [Mnih et al., 2013]. In this work, we are using D3QN, which is a vanilla DQN with two modifications Dueling architecture networks and Double DQN (DDQN), and priority replay buffers. These modifications improves the performance of the agent vastly.

A.3. Double DQN

In vanilla DQN, the same Q network is used to select the action greedily using max- operator and estimating the Q-value this can be seen in the equation A.2.3. Through many experiments in [Van Hasselt et al., 2016], it has been noted that this has been the cause for the issue of over-optimistic Q-value estimates from the DQN neural network. To avoid this issue it has been suggested in [Van Hasselt et al., 2016] that two Q-network have to be used for one for estimation of the Q value and another for selecting the action which gives the maximum Q-value. In practice, the existing target network (used for estimation the target Q-value) which is already used stabilize the learning in [Mnih et al., 2013] can be used to select the action as well [Van Hasselt et al., 2016], and the existing Q network can be used to evaluate that action which is selected by the target Q-network [Van Hasselt et al., 2016].

DoubleQ 0 Yt = Rt + 1 + γ ∗ Q(St+1, argmaxaQ(St+1, a; θt); θ ) (A.3.1)

From the above equation A.3.1 replaces the equation used in DQN A.2.3 as you can see in equation A.3.1 that a separate Q-network of param θt is used to select the action

59 which gives the maximum Q-value and another Q-network of parameter θ0 is used to estimate the Q-value for the loss function A.2.5.

A.4. Dueling Architecture

The dueling architecture is a modification on the DQN Q-network architecture. The dueling architecture [Wang et al., 2015] differs from DQN by dividing the Q-network into two streams. One for estimating the state value function V (s) and another for action advantage function A(s, a) as the Q-value can be broken into a summation of the state value estimation and action-advantage function which can be seen in equation A.4.1. The motivation is that with dueling architecture, the network will learn the state function separately without having to learn the effect of each action. This results in vast improvements in policy evaluation as the action advantage function can identify the correct action using the action advantage function more precisely as redundant common state value is removed [Wang et al., 2015].

Q(s, a) = V (s) + A(s, a) (A.4.1)

The Dueling architecture by [Wang et al., 2015] utilizes this equation A.4.1 to learn two streams ( Value stream and action-advantage stream). The Dueling architecture can be integrated with DQN by substituting this equation in DQN equation A.3.1, A.2.5 replacing the Q(s, a) and the state value function from the equation A.4.1 is given below.

π V (s) = Ea∀π(s)[Q (s, a)] (A.4.2)

60 From the above equation A.4.2 it’s clear that the state value function is an expected value of Q-function Q(s, a) over all the actions available at that state st [Wang et al., 2015]. and the condition that the action-advantage function has to follow in order for the equation A.4.1 to work is given below in equation A.4.3.

π Ea∀π(s)[A (s, a)] = 0 (A.4.3)

we are utilizing the neural network as the function approximator for estimating both value and action-advantage function. So the estimation of the Q function equation A.4.1 using the Dueling network after including the parameters is given below.

Q(s, a; θ, α, β) = V (s; θ, β) + A(s, a; θ, α) (A.4.4)

This equation A.4.4 is replaces the Q(s, a) in the equation A.2.5 and that is used to train the dueling architecture network. Since we are utilizing the neural network to estimate the state value function V (s) and action-advantage function A(s, a) and summation of both of these is used in loss function A.2.5 to train the neural network, we might not be able to recover the V (s) and A(s, a) uniquely, as it can be seen from equation A.4.4 by adding a constant to V (s) and subtracting a constant to A(s, a) we can still arrive at the same value. This creates an issue of identifiability between V and A[Wang et al., 2015]. To address the issue of identifiability between V and A, it is suggested in [Wang et al., 2015] to adjust the equation A.4.4 by forcing the action-advantage function to have zero advantage at chosen action. Which is shown in the equation below.

0 Q(s, a; θ, α, β) = V (s; θ, β) + (A(s, a; θ, α) − maxa∈|A|A(s, a ; θ, α)) (A.4.5)

61 Another alternative that would solve this issue suggested by [Wang et al., 2015] is to replace the max operator in equation A.4.5 with average which is shown in equation A.4.6,

X 0 Q(s, a; θ, α, β) = V (s; θ, β) + (A(s, a; θ, α) − 1/|A| ∗ A(s, a ; θ, α)) (A.4.6) a0

A.4.1. Preference of D3QN - Dueling Double Deep Q- network

The authors of Dueling Architecture, Hado van Hasselt, Arthur Guez, David Silver have highlighted the importance of Double Deep Q-network (DDQN) in their work [Wang et al., 2015]. They performed experiments combining DDQN and Dueling architecture. These experiments have shown these two modifications works very well together and outperform the DDQN model. Wang et al.[2015] have suggested to use DDQN and Dueling architecture together. So for our work we have used these modifications together. We refer to this as Dueling Double Deep Q-Network (D3QN) in our work.

A.5. Prioritized Experience Replay Memory

Experience replay allows us to reuse the memory as the DQN is the off-policy learning algorithm [Mnih et al., 2013]. We store the experience in the memory and sample them uniformly for learning without any prioritization. Improvement of this is prioritized experience replay memory [Schaul et al., 2015] where we give more priority to experience tuples with high TD error and sample that more often for learning, TD error is given in the equation below.

δj = Rj + γj ∗ Qtarget(Sj, argmaxaQ(Sj, a)) − Q(Sj−1, aj−1) (A.5.1)

62 The priority is given by,

pj ← |δj| +  (A.5.2)

where  is small positive constant value probability of the sampling that particular experience sampled from the replay buffer for learning is given by below equation [Schaul et al., 2015]

α X α P (i) = pi /( pk ). (A.5.3) k

α Notice that if α equals to zero then pi equals 1 which results in a uniform distribution in sampling memories which gives back our original replay memory with prioritization and α equals to 1 results in complete greedy prioritization. Complete greedy prior- itization sampling causes over-fitting with high priority data [Schaul et al., 2015], to avoid over-fitting we use stochastic prioritization where we interpolate between greedy prioritization sampling to uniform sampling by annealing the α value from 0.9 to 0.01. Otherwise, we can also select a α value of 0.6, which gives some prioritization and also allows random sampling. By introducing prioritization in sampling, we introduce bias in the learning algorithm [Schaul et al., 2015]. This destabilizes the learning process. to correct the bias, we introduce the importance of sampling weights (IS), which compensates the probabilities used for prioritization [Schaul et al., 2015].

β wi = (1/N ∗ 1/P (i)) (A.5.4)

to quicken the learning, we allow some bias during the initial phase of the training as the learning is very chaotic in the initial phase. The effects of bias are minimum. So the beta β is annealed from 0 to 1 [Schaul et al., 2015].

63