Training an Artificial Bat: Modeling Sonar-Based Obstacle Avoidance
Total Page:16
File Type:pdf, Size:1020Kb
Training an artificial bat: Modeling sonar-based obstacle avoidance using deep-reinforcement learning A dissertation submitted in partial fulfillment of the requirements for the degree of Masters of Science Department of Electrical Engineering and Computer science COLLEGE OF ENGINEERING AND APPLIED SCIENCE University of Cincinnati, November 2020 Author: Adithya Venkatesh Mohan Chair: Dr.Dieter Vanderelst Committee: Dr.Ali A. Minai Dr.Zachariah Fuchs Abstract Recent evidence suggests that sonar provides bats only with limited information about the environment. Nevertheless, they can fly swiftly through dense environments while avoiding obstacles. Previously, we proposed a model of sonar-based obstacle avoidance that only relied on the interaural level difference of the onset of the echoes. In this paper, we extend this previous model. In particular, we present a model that (1) is equipped with a short term memory of recent echo trains, and (2) uses the full echo train. Because handcrafting a controller to use more sonar data is challenging, we resort to machine learning to train a robotic model. We find that both extensions increase performance and conclude that these could be used to enhance our models of bat sonar behavior. We discuss the implications or our method and findings for both biology and bio-inspired engineering. ii c 2020 by Adithya Venkatesh Mohan. All rights reserved. Acknowledgments First of all thanks to Dr.Dieter Vanderelst for supporting me at each and every step of this work and through the period of my Masters. Thanks to University of Cincinnati for giving me this opportunity especially big thanks to the EECS (Electrical Engineering and Computer Science Department) of CEAS , Biology department for all the support. Also thanks to Cognition Action and Perception Labs for providing with equipment and computers necessary for the experiments. Special thanks to my family and friends for supporting completely me. Thanks to 3305 Jefferson group ( close friends) for making this graduate life fun filled. The one mantra of life which I learnt from them is "Rakita rakita rakita oo ..." enjoy your life to fullest no matter what. iv Contents Abstract ii Copyright iii Acknowledgments iv List of Tables vii List of Figures viii 1. Introduction1 2. Background6 2.1. 3D reconstruction of the environment?...................6 2.2. Sonar for robots................................9 2.3. Biological plausibility of reinforcement learning.............. 11 3. Methods 14 3.1. Robot and echo generation.......................... 14 3.2. Neural network and training......................... 24 3.2.1. Algorithmic details.......................... 27 v 3.3. Conditions................................... 29 4. Results 33 5. Discussion & Future Work 39 5.1. Biology..................................... 40 5.2. Engineering.................................. 41 5.3. Future work.................................. 42 Bibliography 45 A. Prerequisite material 54 A.1. Reinforcement Learning........................... 54 A.2. Q-Learning.................................. 57 A.3. Double DQN................................. 59 A.4. Dueling Architecture............................. 60 A.4.1. Preference of D3QN - Dueling Double Deep Q- network...... 62 A.5. Prioritized Experience Replay Memory................... 62 vi List of Tables 3.1. The most important hyper parameters used in training the robot. The learning rate and the batch size are parameters determining neural net- work optimization. The four bottom parameters pertain to the algorithm as shown in algorithm1............................ 28 vii List of Figures 3.1. Illustration of the software architecture used in the current paper. Gazebo was used to simulate the robot and the environment. We used Gazebo to obtain a point cloud of the environment of the robot. These points were used to approximate the acoustics, i.e., to calculate the echo waveforms returning from the environment. The processed acoustic data was used as input to a neural network. The neural network was trained to learn the relationship between the current sensory state (echoes) and the reward function (a measure of the goodness of obstacle avoidance). The different components of this architecture are explained in more detail in the main text of the paper................................ 16 3.2. Illustration of the kabuki-based turtle bot which used in experiment is shown (a) top view and (b) side view .................... 17 3.3. Visualization of the 360 degree 2D LIDAR sensor to get short obstacle distance for reward function......................... 19 3.4. Visualization of the point cloud data from the onboard 3D LIDAR sensor which is used to approximate the sonar sensor............... 20 3.5. Visualization of the combined directionality graph of the emitter and re- ceiver: energy of the echo vs azimuth and elevation considering a constant distance.................................... 22 viii 3.6. Visualization of the entire echo generation and pre-processing from top to bottom (1000 3D point cloud to 24 float point reduced echo waveform) 23 3.7. Example state St. Row represent a short term memory of past echo trains. The columns represent the 24 samples of the sub-sampled echo trains. The intensity of the echoes are encoded in the gray scale, with more intense echoes shown brighter...................... 25 3.8. Illustration of the Dueling Double Deep Q-Network Learning (D3QN) neural network that was trained to control the robot, based on the echo data contained in the sensory state matrix St (see equation 3.2.1). The CNN filters have a shape of (1,3) and a stride of 1. This emulates the 1D convolution so that each echo is processed separately. The dueling architecture CNN has 2 outputs. One output for state value function V (S) which returns a single value and other one is the advantage function A(S; a) which is a vector of length 3, returning the advantage of taking that action. The Q-function function is calculated using both of these Q(S; a) = V (S) + A(s; a)[Wang et al., 2015]. The Dueling Double Deep Q- network is optimized using gradient decent (i.c., RMSprop)1...... 28 3.9. Example state St. given to condition 1. as it explained above and also can be clearly seen in the figure that we are utilizing all the rows in the image to give memory of 3 and we are also giving the complete echo train. 30 3.10. Example state St given to condition 2: which is No memory condition . It can be seen that first 4 rows from the top are blacked out as the t − 1 and t − 2 echoes from 3.2.1 are zeroed out, which can be seen in above figure...................................... 31 ix 3.11. Example state St given to condition 3. which is Onset condition, It can be seen that we are utilizing all the rows as we are giving the echoes at t − 1 and t − 2 are given but the complete echo train is not given As you see from figure the first echo value is given to rest of the echo train... 31 4.1. This plot that gives the reward return vs the time step t during the training period - fig a: case 1: full memory (T=3) and complete echo train , fig b: No memory (T=1) with complete echo train, fig c: Onset with full memory (T=3)........................... 36 4.2. Example trajectories for each condition and each environment. Notice that the robot was trained in the Simple environment and evaluated in all three environments. The trajectories shown for the Memory and No Memory condition were successfully completed (i.e., 1000 steps of 15 cm each). In contrast, for the Onset condition, the trajectories are rather short. This is because the robot was less successful at negotiating the environments. Indeed, each of the three trajectories shown for this con- dition ended in a collision soon after the start (locations indicated with a star). The black scale bar in each of the top panels is 1 m long...... 37 x 4.3. Results of the robot evaluation, for each condition (Memory, No Memory, and Onset). Each row depicts the results for a different environment. Top Row: Simple, Middle Row: Narrow, and Bottom Row: Circle. The results are averaged across 15 evaluation runs (each starting at a different location in the arena). (a, d, g) Violin plots of the distance to the nearest wall at each time step. (b, e, h) Stacked barplots showing the proportion of each of the three possible actions taken. (c, f, i) Barplots showing the distance covered without colliding. Error bars give the Standard Error (n = 15). Note that the maximum distance covered by the robot was 150 m (15 cm × 1000 steps, indicated by horizontal line). As the robot's speed was constant, the plots also show the time (in seconds) driven by the robots.................................... 38 A.1. the Markov Decision process flowchart which shows how the agent can interact with the environment with actions and in return environment transitions to new state and a reward for that time step.......... 55 A.2. this figure illustrates the unrolled Markov Decision process for 3 time-step which shows how the interactions between the agent and environment progress..................................... 55 xi Chapter 1 Introduction Bats [e.g., Barchi et al., 2013, Moss and Surlykke, 2001, Schnitzler et al., 2003, Simmons, 2012, Ulanovsky and Moss, 2008, Clare and Holderied, 2015, Geipel et al., 2013] are often assumed to infer the 3D position and identity of objects from the echoes they receive [Barchi et al., 2013, Moss and Surlykke, 2001, Schnitzler et al., 2003, Simmons, 2012, Ulanovsky and Moss, 2008, Clare and Holderied, 2015, Geipel et al., 2013]. This view on bat echolocation has its origin in experimental findings showing that bats can locate single targets (typically, a small sphere) with high accuracy [e.g., Simmons et al., 1983, Lawrence and Simmons, 1982, Wotton and Simmons, 2000]. However, the ability to locate isolated objects under favorable conditions does not necessarily imply that bats can reconstruct the layout of multiple complex reflectors (e.g., trees and shrubs) under natural conditions.