Continuous Control with Deep Reinforcement Learning for Autonomous Vessels
Total Page:16
File Type:pdf, Size:1020Kb
Continuous Control with Deep Reinforcement Learning for Autonomous Vessels Nader Zare * Bruno Brandoli * Mahtab Sarvmaili Institute for Big Data Analytics Institute for Big Data Analytics Institute for Big Data Analytics Dalhousie University Dalhousie University Dalhousie University Halifax Halifax Halifax [email protected] [email protected] [email protected] Amilcar Soares Stan Matwin * Department of Computer Science Institute for Big Data Analytics Memorial University of Dalhousie University Newfoundland Halifax St. John’s Institute for Computer Science [email protected] Polish Academy of Sciences Warsaw [email protected] ABSTRACT ACM Reference Format: Maritime autonomous transportation has played a crucial role in Nader Zare *, Bruno Brandoli *, Mahtab Sarvmaili, Amilcar Soares, and Stan the globalization of the world economy. Deep Reinforcement Learn- Matwin *. 2021. Continuous Control with Deep Reinforcement Learning for Autonomous Vessels . In y, Washington,DC, USA, z, * Correspondence: ing (DRL) has been applied to automatic path planning to simulate [email protected], [email protected], [email protected] vessel collision avoidance situations in open seas. End-to-end ap- proaches that learn complex mappings directly from the input have poor generalization to reach the targets in different environments. 1 INTRODUCTION In this work, we present a new strategy called state-action rotation Deep Reinforcement Learning (DRL) has become one of the most to improve agent’s performance in unseen situations by rotating active research areas in Artificial Intelligence (AI). This remarkable the obtained experience (state-action-state) and preserving them in surge started by obtaining human-level performance in Atari games the replay buffer. We designed our model based on Deep Determin- [16] without any prior knowledge, and continued by defeating the istic Policy Gradient, local view maker, and planner. Our agent uses best human player in the game of Go [19]. Since then DRL has two deep Convolutional Neural Networks to estimate the policy been vastly employed for robotic control [8, 12, 24], strategic games and action-value functions. The proposed model was exhaustively [11, 20, 21], and autonomous car navigation [10, 17]. Thereby, DRL trained and tested in maritime scenarios with real maps from cities has a strong potential to process and make decisions in challenging such as Montreal and Halifax. Experimental results show that the and dynamic environments, such as autonomous vessel navigation state-action rotation on top of the CVN consistently improves the and collision avoidance. rate of arrival to a destination (RATD) by up 11.96% with respect Traditionally, ship navigation has been performed entirely by to the Vessel Navigator with Planner and Local View (VNPLV), as humans. However, the advancement of maritime technologies and well as it achieves superior performance in unseen mappings by AI algorithms have big potential to make the vessel navigation up 30.82%. Our proposed approach exhibits advantages in terms of semi or fully autonomous. An autonomous navigation system has robustness when tested in a new environment, supporting the idea the ability to guide its operator in discovering the near-optimum that generalization can be achieved by using state-action rotation. trajectory for ship collision avoidance [22]. This idea is known arXiv:2106.14130v1 [cs.AI] 27 Jun 2021 as path planning, in which a system attempts to find the optimal KEYWORDS path in reaching the target point while avoiding the obstacles. The proposed algorithms in [4, 5, 14] require complete environmental Deep Reinforcement Learning, Path planning and Obstacle avoid- information in the real world. This unknown environment is ex- ance, Rotation Strategy, Autonomous Vessel Navigation plored by a single or multiple agents, that is, prior knowledge for training the agent is a restricting assumption. In addition, to find the shortest path from a departure point and a landing location is Permission to make digital or hard copies of all or part of this work for personal or posed as an challenge. That is because in autonomous navigation classroom use is granted without fee provided that copies are not made or distributed vessels cannot navigate close to the shore, usually areas considered for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM shallow and entirely with stones and terrain banks. Regarding RL- must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, based navigation, the main concern is about model generalization to post on servers or to redistribute to lists, requires prior specific permission and/or a for different locations. fee. Request permissions from [email protected]. y, , z, * Correspondence: [email protected], [email protected], [email protected], Wash- This problem was addressed by A* algorithms [2] and Q-learning ington,DC, USA. © 2021 Association for Computing Machinery. [15]. In this context, Yoo and Kim[26] proposed a new reinforcement learning method for path planning of maritime vessels in ocean an origin to a destination autonomously. Section 5 demonstrates environments; however, without considering obstacles. Q-Learning the efficiency of SAR and CVN, and it discusses the results ofthe is very limited to low-dimensional state spaces domains. Unlike, in experiments in a simulated maritime environment. Finally, the con- [3] developed a DRL-based algorithm for unmanned ships based clusions of this work and future directions come in Section 6. on DQN with no prior knowledge of the environment. Similar to our approach, they used the local view strategy to reduce the state- 2 BACKGROUND space exploration and path planning more efficiently. In contrast 2.1 Deep Q-Network with Cheng and Zhang, we designed our model based on Deep Deterministic Policy Gradient. The main drawback proposed by Deep Q-Network (DQN) proposed by Mnih et al. [16] is a model- the authors is regarding the discrete actions for reaching the target. free RL algorithm that uses a multi-layered neural network & to ∗ Recently, [18] presented a prototype of multiple autonomous vessels approximate & . It has been exploited in continuous or high dimen- controlled by deep q-learning. The main issue with this idea is the sional environments with a discrete action space. DQN receives fact they incorporated navigation rules employed with navigational state B and outputs a vector of action values & ¹B, 0º. However, the limitation by polygons or lines. RL models are unstable when a non-linear function approximator This research’s main goal is not to find the shortest path for an is exploited to estimate the Q-values. There are several reasons for autonomous vessel, but to develop a new strategy of continuous this problem: the correlations between sequences of observations action on top of the local view path planning. Our proposal is de- that result in a significant change in agent policy after each update signed based on the DDPG architecture by using the environment’s and also the correlation between action-values and target values. A raw input images to learn the best path between a source and a randomized replay buffer is used to solve the correlations between destination point. The idea is to feed the network with the input the observation sequences. To address the second issue, another − images through the actor to predict the best continuous action neural network known as a target network with \ parameters has for that state, and then to feed the predicted action and the state been exploited to decorrelate the relation between the action-values to the critic network to predict the state-action value. The critic and target-values. Hence, in DQN there are two neural networks; network output is used during the training phase, and when this the online network that uses \ as weights, and the target network − phase is over, the actor will select actions. Our model was exhaus- that uses \ . tively trained and tested in maritime scenarios with real data maps During training, the environment sends state BC , and a reward to from cities such as Montreal and Halifax with non-fixed origin the agent and the agent selects an action by a n−greedy policy; in and destination positions. For training the model, we used areas the next step, the agent generates transition tuples ¹BC , 0C ,AC ,BC¸1º with the presence of real geographical lands. We give a detailed and stores the tuples in the replay buffer. The n-greedy policy of example about state-action rotation in Section 3. This process is & selects a random action with probability 1 − n and c& ¹Bº = trained by using the DDPG alorithm. We chose DDPG because it 0A6<0G02퐴& ¹B, 0º with probability n. To train the online network, holds an actor-critic architecture. We describe the DDPG model in we select a batch of transitions from the replay buffer and train Section 2. The process is started over with the module transition the online network by using mini-batch gradient descent using the maker with the reward given the state of the environment. The Bellman equation (Equation 1). agent exploration converges when either target is reached or he 2 hit obstacles in the scenario, such as rocks and islands along the ℓ = E¹& ¹BC , 0C º − ~C º where, vessel’s navigability. 퐷&# − (1) . ≡ 'C¸1 ¸ W max& ¹(C¸1, 0;\ º In summary, the main contributions of our paper are: C 0 C • The development of an environment for 2D simulation of After each g steps, we copy the online neural networks weights real-world maps that takes insights of the Floyd and Ramer- \ to target neural network \−. Douglas-Peucker algorithms to find the trajectory between two positions. • We developed a new architecture named Continuous Ves- 2.2 Deep Deterministic Policy Gradient sel Navigator (CVN) based on the DDPG algorithm with Deep Deterministic Policy Gradient (DDPG) [13] is a model-free RL continuous action spaces, evaluating generalization in two algorithm that uses two multi-layered neural networks for training different shores.