A Robot Exploration Strategy Based on Q-Learning Network

A Robot Exploration Strategy Based on Q-learning Network Tai Lei Liu Ming Department of Mechanical and Biomedical Engineering Department of Mechanical and Biomedical Engineering City University of Hong Kong City University of Hong Kong Email: [email protected] Email: [email protected] Abstract—This paper introduces a reinforcement learning supervised learning model is implemented with a Convolution method for exploring a corridor environment with the depth Neural Network [4]. Secondly, a neural network structure with information from an RGB-D sensor only. The robot controller three fully-connected hidden layers was used to mimic the achieves obstacle avoidance ability by pre-training of feature maps using the depth information. The system is based on the reinforcement learning procedure taking the feature maps as recent Deep Q-Network (DQN) framework where a convolution input. The feature maps are the output of the last second neural network structure was adopted in the Q-value estimation layer of the supervised learning model trained before. This of the Q-learning method. We separate the DQN into a super- reinforcement learning framework is defined as a Q-Network. vised deep learning structure and a Q-learning network. The In this paper, we will mainly introduce the second step. experiments of a Turtlebot in the Gazebo simulation environment show the robustness to different kinds of corridor environments. Particularly, we stress the following contributions: All of the experiments use the same pre-training deep learning • We design a revised version of DQN network for a structure. Note that the robot is traveling in environments which moving robot to explore an unknown environment. The are different from the pre-training environment. It is the first time that raw sensor information is used to build such an exploring project is implemented in Gazebo and ROS-based inter- strategy for robotics by reinforcement learning. faces. Feature learning is based on Caffe [5], a popular tool-kit for deep learning. I. INTRODUCTION • The model is validating in several simulated environments. We also discuss the future work, such as adding Mobile robot exploration of an unknown environment is noise to verify the robustness of the system and apply it a quite common problem for robot applications, like rescue, in real environments. mining etc. Normally, with information from vision or depth sensors, robot requires complicated logic about the obstacles and topological mapping of environments [1] [2] designed by II. RELATED WORK human-beings. However, there is no high-level human-brain- like intelligence in these traditional approaches. Recently, A. Reinforcement Learning in Robotics machine learning has attracted more and more attentions. In this paper, we want to develop a machine learning method for Reinforcement Learning (RL) [6] is an efficient way for robots to explore an unknown environment using raw sensor robotics to acquire data and to learn skills. With an appropriate inputs. and abstract reward, the robot can learn a complex strategy Regarding the requirements mentioned above, Deep Re- without ground truth as references. It was just applied on inforcement Learning, merging reinforcement learning and mastering the strategy of GO (an ancient Chinese board deep learning, is a proper method to apply in this scenario. game which was regarded as the most challenging task for For example, Google DeepMind implemented a Deep Q- artificial intelligence) [7]. It indicates the great feasibility of Network (DQN) [3] on 49 Atari-2600 games. This method reinforcement learning in other fields. RL was also applied on outperformed almost all of other state-of-the-art reinforcement an autonomous helicopter flight [8] and autonomous inverted learning methods and 75% human players, without any prior helicopter flight [9], by collecting the flight data and learning knowledge about the Atari 2600 games. It showed great po- a non-linear model of the aerodynamics. tential to apply this algorithm in other related fields including Reinforcement learning was also proved to improve the mo- robotic exploration. tion behaviour of a humanoid robot to react for visually iden- Not like the DQN mentioned above, we apply this learning tified objects substantially [10], by building an autonomous approach in two steps in the robotics exploration for an strategy with little prior knowledge. In this application, The unknown environment. Firstly, we build a supervised learning robot showed a continuously evolved performance with time. model taking the depth information as input and the com- Most of reinforcement learning methods for robotics are mand as output. The datum is manually labeled with control based on state information. To our knowledge, raw image commands to tune the moving directions of the robot. This sensor information has never been considered directly. conv. pool. relu. conv. pool. relu. conv. pool. relu. softmax. % " - . Fig. 1. Structure of the CNN layers. Depth images after down sampling will be fed into to the model. Three Convolution layers with pooling and rectifier layers after are connected together. After that, feature maps of every input will be fully connected and fed to the softmax layer of the classier. B. CNN in Perception Algorithm 1 Q-network algorithm 1: Initialize action-value Q-network with random weights θ Convolution Neural Network (CNN) is a classic visual Initialize the memory D to store experience replay learning method. With the development of large-scale com- Set the distance threshold l = 0:6m puting and GPU accelerating, huge CNN frameworks can be s 2: for episode = 1;M do set with tens of convolution layers. 3: Set the Turtlebot to the start position. Normally, CNN was used to solve a classification problem Get the minimum intensity of depth image as l with a softmax layer, such as imagenet classification [11] [12] min 4: while l > l do and face recognition [4]. With a regression layer to optimal the min s 5: Capture the real time feature map x euclidean loss, the feature maps extracted by CNN can also be t 6: With probability " select a random action a applied to key points searching problem [13] [14]. Computer t Otherwise select a = argmax Q(x ; a; θ) vision based recognition methods are mainly feature detection t a t 7: Move along the selected direction a and extraction [15] [16] [17], while CNN extract this feature t Update l with new depth information model by self-learning. min 8: if l < l then In terms of robotics, CNN was also used to perceive min s 9: r = −50 environment information for visual navigation [18]. However, t x = Null a supervised-learning-based method requires a complicated t+1 10: else and time-consuming training period and the trained model 11: r = 1 cannot be applied in a different environment directly. t Capture the new feature map xt+1 12: end if 13: Store the transition (xt; at; rt; xt+1) in D Select a batch of transitions (xk; ak; rk; xk+1) ran- domly from D 14: if rk = −50 then 15: yk = rk 16: else 0 17: yk = rk + γ maxa0 Q(xk+1; a ; θ) 18: end if Update θ through a gradient descent procedure on the 2 batch of (yk − Q(φk; ak; θ)) Fig. 2. Feature map extracted from the supervised learning model is the input 19: end while and is reshaped to a one dimension vector. After three fully-connected hidden 20: end for layers of a neural network, it will be transformed to the three commands for moving direction as the outputs defined as controlling a ground-moving robot in an environ- III. IMPLEMENTATION DETAILS ment without any collisions with the obstacles. The input Travelling in an unknown environment with obstacle avoid- information here is the feature maps extracted by the CNN ance ability is the main target of this paper. This task is supervised learning model which was trained in our prior work. Supervised learning model can be used to perceive an TABLE II environment [18], so the feature maps can be regarded as TRAINING PARAMETERS AND THEIR VALUE abstracted information for the environment to some extends. Parameter Value Robot will perform obstacle avoidance by Q-network learning batch size 32 of itself in other environments which are a little different replay memory size 5000 from the training environment. The implementation of the discount factor 0.85 learning rate 0.000001 experiment includes three parts: gradient momentum 0.9 max iteration 15000 • a simulated 3D environment in Gazebo for a robot to step size 10000 explore. gamma 0.1 • a Q-network reinforcement learning framework. • a simulated Turtlebot with a Kinect sensor in Gazebo controlled by the Q-network outputs. value of r + γQ∗(s0; a0), if the optimal value Q∗(s0; a0) of the sequence at the next time step is known. ∗ 0 0 ∗ 0 0 Q (s ; a ) = Es0∼"[r + γ max Q (s ; a )js; a] a0 Not using iterative updating method to optimal the equation, (a) Straight Corridor it is common to estimate the equation by using a function approximator. Q-network in DQN was such a neural network function approximator with weights θ and Q(s; a; θ) ≈ Q∗(s; a). The loss function to train the Q-network is: 2 Li(θi) = Es;a∼ρ(·)[(yi − Q(s; a; θi)) ] yi is the target, which is calculated by the previous iteration result θi−1. ρ(s; a) is the probability distribution of sequences s and a. The gradient of the loss function is shown below: 0 rθi Li(θi) = Es;a∼ρ(·);s ∼"[(yi − Q(s; a; θi))rθi Q(s; a; θi)] B. Q-network-based Learning System (b) circular Corridor To accomplish the task of exploration, we simplify the DQN Fig.

A Robot Exploration Strategy Based on Q-Learning Network

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support