Arxiv:1911.11991V1 [Cs.RO] 27 Nov 2019 to the Capability of Auvs

Hsu et al. / Front Inform Technol Electron Eng in press 1

Frontiers of Information Technology & Electronic Engineering www.jzus.zju.edu.cn; engineering.cae.cn; www.springerlink.com ISSN 2095-9184 (print); ISSN 2095-9230 (online) E-mail: [email protected] Review: A selected review on reinforcement learning based control for autonomous underwater vehicles∗

Yachu HSU, Hui WU, Keyou YOU‡, Shiji SONG Department of Automation and BNRist, Tsinghua University, Beijing 100084, China E-mail: {xuyz17, wuhui14}@mails.tsinghua.edu.cn; {youky, shijis}@tsinghua.edu.cn

Abstract: Recently, reinforcement learning (RL) has been extensively studied and achieved promising results in a wide range of control tasks. Meanwhile, autonomous underwater vehicle (AUV) is an important tool for executing complex and challenging underwater tasks. The advances in RL oﬀers ample opportunities for developing intelligent AUVs. This paper provides a selected review on RL based control for AUVs with the focus on applications of RL to low-level control tasks for underwater regulation and tracking. To this end, we ﬁrst present a concise introduction to the RL based control framework. Then, we provide an overview of RL methods for AUVs control problems, where the main challenges and recent progresses are discussed. Finally, two representative cases of RL-based controllers are given in detail for the model-free RL methods on AUVs.

Key words: reinforcement learning; autonomous underwater vehicle; low-level control; model-free https://doi.org/ CLC number: TP

1 Introduction anticipated for their capability to enable adaptive autonomy in an optimal manner (Kiumarsi et al., The development of AUVs was initially moti- 2017). RL algorithms provide control policies that vated by the desire to explore the Arctic waters in maximize the quantitative performance throughout 1957. Since then, AUVs have received considerable a well-designed task by learning from ongoing inter- attention and intensive efforts have been made to actions with the environment. RL looks ahead to deploy AUVs in various underwater environments. future events and focuses on long-term performance, These versatile vehicles bring a revolution to the making it appealing to control problems. In the con- field of ocean research. The development of control fields of unmanned aerial vehicles (UAVs) and trollers also contributes to this as it is also key unmanned ground vehicles (UGVs), RL algorithms arXiv:1911.11991v1 [cs.RO] 27 Nov 2019 to the capability of AUVs. Many controllers have are widely studied. The successes of Waslander et al. been designed for AUVs to complete manifold mili- (2005), Kim et al. (2004) and Bagnell and Schneider tary and civilian tasks, including source seeking (Li (2001) demonstrate that RL-based controllers per- et al., 2018), pipeline inspecting (Xiang et al., 2010), form better than classic controllers or highly trained seafloor mapping (Ribas et al., 2011), etc. pilots. Abbeel et al. (2010) presented apprentice- Among them, RL-based controllers are highly ship learning algorithms that allowed autonomous helicopter to perform arbitrary challenging aerobatic ‡ Corresponding author * This work was supported in part by the National Key maneuvers. Hester et al. (2011) conducted real-time Research and Development Program of China under Grant learning on a physical vehicle to control its velocity No.2016YFC0300801, and National Natural Science Foundation of China under Grants No.41576101 and No.41427806. through pedals. The velocity was accurately tracked ORCID: Ke-you YOU, https://orcid.org/0000-0003-4355-5340 after 3 minutes. Kendall et al. (2019) realized au- c Zhejiang University and Springer-Verlag GmbH Germany, part tonomous driving via RL where the full sized vehicle of Springer Nature 2019 2 Hsu et al. / Front Inform Technol Electron Eng in press learnt to follow lanes from scratch within 30 minutes state and the action just made. A reward signal using on-board computers. is received to evaluate such an action. Fig. 1 il- The success of RL in ground and aerial vehicle lustrates this agent-environment interaction, which control community suggests its potential for control- gives rise to a sequence of states, actions and rewards ling AUVs. In fact, RL framework has been intro- τ = (s0, a0, r1, s1, a1, r2,... ). The agent’s goal is to P∞ t duced to achieve persistent autonomy and precise maximize the return R(τ) = t=0 γ rt, namely the control for AUVs. This paper briefly surveys the cumulative reward received during the interaction progress of the implementation of RL on different period, where γ ∈ (0, 1] is a discount factor to assign low-level control tasks for AUVs, hoping to inspire decayed weights to the future rewards. more research into the application of RL. The remainder of this paper is organized as follows. Section 2 describes basic concepts in RL with an introduction of RL algorithms. Section 3 dis- cusses challenges in applying RL to control AUVs. Recent studies in this area are briefly introduced in Section 4. In Section 5, the cases of two RL- based controllers for different underwater tasks are Fig. 1 The agent-environment interaction. presented in detail to better illustrate the advantages of RL-based controllers. Finally, a conclusion is made in Section 6. 2.2 Value-based RL

2 Basics of RL The rule which an agent used for choosing actions is called a policy. The expected return received This section concisely introduces basic concepts when starting in a state s under a specific policy π and foundational algorithms of RL. Due to the uncer- is called the value at s. For MDPs, we define the tainty of underwater dynamics, this section mainly state-value function and state-action-value function focuses on model-free algorithms. as 2.1 Markov decision process π V (s) = Eτ∼π [R(τ)|s0 = s] (1) This subsection is mainly based on Sutton and π Q (s, a) = Eτ∼π [R(τ)|s0 = s, a0 = a] (2) Barto (2018). Formally, RL aims to solve the Markov decision process(MDP) based problem, which con- where τ denotes a trajectory and s0, a0 are its start- sists of four basic elements: a set of valid states ing state and action. The state-action-value Qπ(s, a) S (and the starting state distribution ρ0), a set of can construct an optimal policy by choosing an ac- valid actions A, reward function r(s, a) and tran- tion that maximizes the value function. A policy is 0 sition probability function p (s , r|s, a). Transitions optimal if it can reach the highest expected return should depend only on the most recent state and for all states, while a value function is optimal if it action, which is known as the Markov property. acts according to the optimal policy. The optimal By adopting RL, an agent learns the mapping value functions can be defined as between situations and control outputs from interactions with the environment, so as to optimize ∗ V (s) = max Eτ∼π [R(τ)|s0 = s] , (3) its control performance. Learning from these re- π ∗ Q (s, a) = max Eτ∼π [R(τ)|s0 = s, a0 = a] . (4) peated interactions enables RL to handle the case π where dynamic programming (Bertsekas et al., 1995) is not applicable, i.e., the case when the function Value functions play an important role in de- p is unknown. At every step of the interaction, signing RL algorithms, because they evaluate how the agent observes a state and chooses an action well a state or state-action pair is and guide the al- based on the observation. The environment then gorithm to search an optimal policy. Methods used transits to a new state depending on the current to estimate value functions are typically derived from Hsu et al. / Front Inform Technol Electron Eng in press 3

Bellman equations 2.3 Policy-based RL

π π 0 The major defect of value-based algorithms is V (s) = E 0 [r(s, a) + γV (s )] (5) s ∼P that the maximum operation makes them inapplica- π π 0 0 Q (s, a) = Es0∼P [r(s, a) + γEa0∼π [Q (s , a )] (6) ble for the continuous action space. A solution is ∗ ∗ 0 V (s) = max Es0∼P [r(s, a) + γV (s )] (7) to use parameterized policies. Unlike greedy poli- a cies, a parameterized policy π can be stochastic and ∗ h ∗ 0 0 i θ Q (s, a) = Es0∼P r(s, a) + γ max Q (s , a ) (8) a0 thereby is more suitable for problems with imper- fect information. Moreover, by utilizing the neural where P denotes state transition probabilities network as a non-linear function approximator along p (s0, r|s, a). Most model-free RL procedures can be with some modifications to stabilize learning, the abstracted as alternating between policy evaluation policy is able to handle high-dimensional observa- and policy improvement. The former estimates the tions. The performance of the policy J (πθ) can be value function of the current policy and the other defined as either the expectation of cumulative dis- improves the policy with respect to the estimated counted reward or average reward, for example value function, usually by making it greedy to the πθ current value function. This process stabilizes when J (πθ) = V (s0). (10) the policy is greedy to its own value function, which Then the parameter can be directly optimized for matches the Bellman equations of the optimal value the performance by gradient descent, i.e., and policy. This idea is referred to as generalized policy iteration (GPI). θk+1 = θk + α∇θJ (πθk ) . (11) The self-consistency Bellman equations imply that the estimation of value functions can be im- Policy Gradient Theorem in Sutton et al. (2000) lays proved by bootstrapping. Inspired by this idea, the foundation for these algorithms temporal-difference (TD) learning is a kind of com- π π monly used value-based RL algorithm. It updates ∇θJ (πθ) = Es∼ρ ,a∼πθ [∇θ log πθ(a|s)Q (s, a)] the estimation by minimizing the TD error δ, i.e., (12) where ρπ denotes the state distribution. It shows π π Q (st, at) ← Q (st, at) + αδ (9) that the gradient of the performance function with respect to the policy parameter can be expressed by where α denotes the learning rate and δ = Y − an expectation without concerning the effect of pol- π Q (st, at) is the error between the target value Y icy changes on the state distribution. This gradient and its estimated value. The target varies in different can be estimated by sampling methods as long as an π algorithms, for example Y = rt + γQ (st+1, at+1) unbiased expectation is guaranteed. REINFORCE is in an on-policy algorithm SARSA and Y = rt + a kind of straightforward Monte Carlo implementa- π γ maxs Q (st+1, a) in an off-policy algorithm Q- tion, which estimates the action-value function with learning. On-policy means the updated policy is sampled returns (Williams, 1992). consistent with the policy used for sampling, whereas Another family of variants is the famous actor- off-policy algorithm uses a different policy for inter- critic methods. They take the value function es- acting with the environment. timation mentioned above as the critic part of the Beside bootstrapping, value functions can also algorithm to better learn the policy parameter. The be updated by Monte Carlo (MC) methods in policy learning part referred to as the actor consults episodic tasks. In these methods, value functions are the estimated value function Qω(s, a) when comput- estimated by averaging the returns observed so far. ing the performance gradient. It is proved that under MC methods and TD methods are like two extremes certain constraints, this approximation will not re- and can be unified by taking n-step bootstrapping. sults in bias (Sutton et al., 2000) This usually performs better as if it utilized the ad- ω ∇ J (π ) = π [∇ log π (a|s)Q (s, a)] . vantages of previous methods, whereas more compu- θ θ Es∼ρ ,a∼πθ θ θ tation is also required. (13) 4 Hsu et al. / Front Inform Technol Electron Eng in press

It is obvious that when the variance of the which can build a distribution over the transition stochastic policy is zero, the policy reduces to a de- function (Polydoros and Nalpantidis, 2017). terministic policy. Silver et al. (2014) extended the Once the model of the environment is accessible, policy gradient framework to deterministic policies then a possible step or even all possible episodes can by proving that as the variance approaches zero, the be generated for planning methods like dynamic pro- stochastic policy gradient converges to deterministic gramming, heuristic search and exhaustive search. gradient with the following form There are also many other ways of combining models with model-free algorithms, e.g., regarding the h µ i ∇ J (µ ) = µ ∇ µ (s)∇ Q (s, a)| . θ θ Es∼ρ θ θ a a=µθ (s) planning method as an expert that the policy should (14) learn from (Anthony et al., 2017), considering plans as side information for the policy (Racanière et al., This adaption can also be interpreted as approximat- 2017), producing simulated experiences for data aug- ing the maximization with the policy mentation (Feinberg et al., 2018), etc. max Q(s, a) ≈ Q(s, µ(s)). (15) Recently, a rich class of RL methods has been a extensively studied, although a further review on the Deterministic policy gradient (DPG) needs not to state of the art in RL will not be included in this integrate over the action space, which is appealing paper. Instead, we will focus on integrating RL into when the dimension of the action space is high. Pop- the control of AUVs in the following sections. ular algorithms derived from this idea have already proved their advantages (Lillicrap et al., 2015; Fuji- 3 Challenges for implementing RL to moto et al., 2018). control AUVs Normal policy gradient methods measure the distance between policies in parameter spaces, yet it The fact that RL-based control methods can is better to measure the distance on the probability learn form interactions distinguishes them from clas- manifold to ensure the performance improvement in sic control methods including PID control (Kho- every step. This is the basic idea of the natural policy dayari and Balochian, 2015), adaptive control gradient (Amari, 1998). These methods optimize (Narasimhan and Singh, 2006), backstepping control policy with different surrogate objective functions (Lapierre and Soetanto, 2007), sliding-mode control (Schulman et al., 2015, 2017). (Elmokadem et al., 2016), etc. Controlling an AUV is kinematically similar to the problem of control- 2.4 Model-based RL ling a free-floating rigid body in a six-dimensional Model-free RL learns directly from interactions, space (Fig. 2). However, the underwater environ- whereas model-based RL rely on a model of the en- ment complicates the dynamics (Antonelli, 2018). vironment, which can be described as the transition For most classic controllers, one of the major ob- probability function p (s0, r|s, a). The model accu- stacles is the lack of an accurate dynamic model racy has a great impact on the performance of the for designing controllers. Moreover, the models are model-based RL. The policy may perform poorly in manually decoupled or linearized without a sufficient the real test when the learnt model is inaccurate. consideration on the uncertainties and disturbances, Two main concerns in model-based RL are the way which are important environmental underwater char- to obtain models and how to utilize them. acteristics. On the other hand, RL-based controllers In some simple tasks, an accurate mathematical can be trained without a dynamic model while en- model can be established based on the priori knowl- abling adaptive autonomy. Challenges of controlling edge. For other cases, it is possible to learn a model AUVs in the context of RL differ in many ways. Some from interactions with the environment. Provided possible challenges are discussed in this section. that assumptions about the model are given, the un- 3.1 Sample efficiency known parameters of the model can be inferred by methods like linear regression. If there is no prior To apply RL algorithms to AUVs, experiences knowledge of the model, a common approach for need to be acquired by interacting with the physi- learning the model is the Gaussian Processes (GP), cal system. It is obvious that carrying out such an Hsu et al. / Front Inform Technol Electron Eng in press 5

eral RL community (Kober et al., 2013). One possible solution is to prepare emergency protocol which has a higher priority when AUVs encounter danger, such as getting too closed to an obstacle (El-Fakdi and Carreras, 2013).

3.3 Model uncertainty Fig. 2 The six DOFs(degrees of freedom) motions of the AUV. The mathematical model of the dynamics of underwater vehicles is usually derived from Newton- experiment with an AUV is costly in terms of time, Euler equations of a rigid body, in which the effect labour and finances. More specifically, an AUV is of inertial generalized forces, hydrodynamics, grav- expensive to build and needs careful maintenance ity, buoyancy and thrusters’ presence are taken into to reduce wear as well as avoid crashing. Mean- account (Antonelli, 2018). However, most of these while, whether to build a water tank or to find a effects are either complex itself with high nonlinear- site with suitable underwater environment is not an ity and time-varying characteristic or closely related easy task. Even if all preparation works are ready, to the exact structure of the AUV. In a word, it is the process to collect data itself is time consuming. hard to develop a reliable model for an AUV. Al- Therefore gaining a better sample efficiency for mini- though learning with an accurate model can solve mizing interactions has become an outstanding issue, the problem in collecting real-world samples, bias in outweighing limiting memory consumption and com- the model may cause sub-optimal or terrible per- putational complexity. Off-policy methods are more formance in the real environment no matter how suitable in this case since they are able to reuse the well the policy behaves when with the approxima- experiences collected, namely more sample efficient. tion model. The unexpected poor performance may Model-based methods are widely used in robot con- cause irreversible damage to AUVs. For model-based trol for their promise of sample efficiency, though algorithms and model-free algorithms that trained in models of AUVs and underwater environment are of- a simulator, it is essential to deal with this reality ten poorly known. gap issue. Compromise has to be made between ac- 3.2 Tradeoff between exploration and ex- curacy and robustness. Under such a condition, the ploitation best policy should be the one that is robust to noises rather than the one with the highest reward. During the learning process, exploiting means to select greedy actions, which have the greatest es- 3.4 Partially observed state timated value. This maximizes the expected reward on the one step, whereas exploring by selecting non- Most existing pure RL algorithms are designed greedy actions may produce the greater total reward under the assumption that the environment can be in the long run (Sutton and Barto, 2018). Addition- totally observed. Whereas, the underwater environ- ally, by taking these actions, more kinds of states will ment is noisy and uncertain, creating great chal- be visited, which results in a more robust policy. lenges for AUVs to collect useful information. Guaranteeing sufficient exploration has been a For visual inspections in underwater domain, long existing problem in RL algorithms, and is espe- both optical and sonar systems are widely used (Fer- cially important in the control of AUVs to provide reira et al., 2016). Optical sensors are expected to robustness to the variable underwater environment. obtain high resolution data with helpful colour infor- Although learning from mistake is beneficial in most mation. However, in turbid water, water molecules, cases, exploring underwater with an AUV is rather dissolved organic and inorganic matter, and various complicated. The price of damaging an AUV is par- types of suspended particles cause scattering and ab- ticularly high, considering the cost, physical labour sorption of light, and results in dark and low contrast and long waiting period for repairing an AUV. Safe underwater images with poor visibility. Colour dis- exploration is a key issue for practical application of tortion also occurs because of the different attenu- RL algorithms, which is often neglected in the gen- ation rate inversely proportional to the wavelength 6 Hsu et al. / Front Inform Technol Electron Eng in press of light (Lu et al., 2015). Sonars are able to look perceptual aliasing, which is the case that different further, while providing low resolution images insuf- states cannot be distinguished by the given informa- ficient for object identification. tion. Wu et al. (2018) considered three depth control As for underwater localization, unlike UAVs and problems precisely and designed the states carefully. UGVs, radio or spread-spectrum communications More details about this piece of work will be intro- and global positioning are disabled due to the rapid duced in the following section. Meanwhile, a good attenuation of higher frequency signals. Although state representation can greatly improve the robust- acoustic-based sensors and communications can sup- ness of an algorithm, results in a more generalized port the localization of AUVs, they are constrained policy, e.g., the goal oriented control architecture by limited and distance-dependent bandwidth, time- used in Carlucho et al. (2018a,b) can omit the con- varying multi-path propagation and low speed of tinuous retraining step when changing to a new goal. sound (Heidemann et al., 2012). In many cases, 4.1.2 Design of reward functions control designers of AUVs need to balance the need to inspect the environment and the cost of higher On the other hand, the design of reward func- energy consumption which diminished the available tions also holds great importance. When facing con- mission time. In a real-world task, only limited in- flicting objectives, a simple treatment is to unify formation can be gained, not to mention the quality them with prescribed weights. This kind of scalar- of the information. ized reward functions are easy to implement and can produce a single optimal solution. Generally, 4 RL applications in control of under- a reward function for AUV control tasks consists water vehicles of two parts, i.e., terms to evaluate the error and terms to restrict the thruster usage or penalize sud- This section briefly introduces recent works on den changes. Carlucho et al. (2018a) illustrated the controlling AUVs with RL. There are also works that significance of each term. The former terms ensure focus on high-level decision tasks such as path plan- that the AUV achieves the control target while the ning (Yoo and Kim, 2016; Wang et al., 2018; Hu latter terms prevent the thruster outputs from vio- et al., 2019), which do not involve the low level con- lent oscillation. The weights are usually chosen em- trol of AUVs. An extension of this topic is beyond pirically according to the relative importance of the the scope of this paper. objectives. Yu et al. (2017) imposed constraints on weights, which were derived from Lyapunov theory 4.1 Modeling of MDPs to guarantee the stability of the control system. Nev- ertheless, the weights are still hard to tune, since Before applying RL methods, it is vital to frame even small changes in weights may cause a great the control task as a MDP. As mentioned above, difference in the learnt policy. Ahmadzadeh et al. there are four elements that have to be well defined. (2014a) employed multi-objective RL that can dis- For the low level control problem, the actions are cover multiple optimal solutions which satisfied dif- control inputs of AUVs and the transition probabil- ferent objectives respectively (shortest path, mini- ity is usually unaccessible. Hence the key is to find a mum final velocity and minimum heading error). An proper state representation and design an appropri- additional algorithm is then needed for selecting the ate reward function. optimal solution. 4.1.1 State representations Piecewise function is also a common form of the reward function, in which different levels of reward It is natural to define the raw observations as the will be given according to the preference for the cur- state, since this is the most informative form. How- rent state. Carlucho et al. (2018b) suggested that ever, RL methods suffer from the curse of dimension- by gradually tighten the definition of the preferred ality. Larger amount of samples and computation are state, the agent should obtain more useful experi- desired to ensure the convergence as the number of ences. For example, if a positive reward signal can state-space dimension grows. Thus a proper formu- be obtained when reaching an spherical neighbor- lation should involve fewer variables while avoiding hood of a desired way-point, then it is beneficial to Hsu et al. / Front Inform Technol Electron Eng in press 7

world learning problem. Other studies all tried to discover fault-tolerant strategies, i.e., methods that can operate under thruster failure. This means that these algorithms should be able to control both over- actuated and under-actuated AUVs. Carlucho et al. Fig. 3 REMUS, a kind of screw-driven AUV (Stokey (2018b) applied an algorithm similar to the one in et al., 2005). Carlucho et al. (2018a). Jamali et al. (2014) aimed decay the radius of the sphere throughout learning. at improving the robustness of the policy found by model-based direct policy search. A Gaussian noise 4.2 RL for screw-driven underwater vehicles was added to the inputs of the thrusters to test the sensitivity of the policy to noise. The results showed Classical AUVs are usually controlled by rotary that the relationship between the performance in the propellers and control surfaces such as rudders and noiseless setting and the robustness of the policy is sterns. Most of them have the shape of a torpedo unpredictable. Covariance analysis was used to mea- for hydrodynamic performance (Fig. 3). Amounts of sure the robustness in order to find a policy that experiments have been carried out on them. performed well whilst being robust to noise. 4.2.1 Set-point regulation Researchers of Istituto Italiano di Tecnologia carried out a series of experiments on this topic Stabilization is the most fundamental control (Leonetti et al., 2013; Ahmadzadeh et al., 2014b,a). task for AUVs. Fernandez-Gauna et al. (2014) con- In 2013, an on-line controller framed within model- ducted experiments on the speed control problem based policy search was proposed. Although the fea- using Continuous Action-Critic Learning Automa- sibility of the method was tested in simulator, the ton (CACLA). Unlike other policy gradient meth- policy cannot be applied to real open water scenario ods, CACLA only updated the policy in action space for being an open-loop function of time. In 2014, when the critique was strictly positive. Besides, it this was solved by closing the loop with state feed- was proved that starting the training with outputs backs and the learnt policy was evaluated on a real of PID as replacements of random actions helped AUV. Different levels of thruster failure was also con- to bias the learnt policy towards the optimal policy. sidered. Nevertheless, the fact that the presented Walters et al. (2018) carried out the regulation task method requires the dynamic model of the AUV as in reality utilizing model-based dynamic program- well as related hydrodynamic parameters make it less ming. The dynamic model was learnt on-the-fly. appealing in practical use. They focused on the influence of the time-varying ir- So far, only learnt policies have been tested, the rational current and presented the Lyapunov-based test of learning a fault-tolerant policy online has not stability analysis to guarantee the convergence to yet been performed in reality. the target state and optimal polices. Carlucho et al. (2018a) contributed to this field by conducting con- 4.2.3 Trajectory tracking trol tests of a real AUV on all six DOFs. The pro- Varying degrees of success have been achieved in posed deep RL algorithm was based on deep deter- the tracking control of AUVs. Palomeras et al. (2012) ministic policy gradient(DDPG) and framed in the presented a control architecture for AUVs in which goal oriented control architecture. the RL algorithm was programmed in the reactive 4.2.2 Way-point tracking layer and tested in a real-time autonomous underwater task. A visual based cable tracking task was Several attempts have been done for applying completed after applying a two-step learning process RL to the way point tracking problem for AUVs, using natural actor-critic algorithm. The location which can be seen as a transition task between the and rotation of the cable were computed for two RL station keeping task and tracking task. Frost and controllers to learn uncoupled policies for the yaw Lane (2014) presented a simplistic implementation and sway action. The controllers were trained in the of tabular Q-learning in both simulated and real simulator before learning in reality to enhance the scenario. The problem was discretized into a grid- convergence rate. Carlucho et al. (2018a) adopted 8 Hsu et al. / Front Inform Technol Electron Eng in press a similar learning strategy when conducting the ve- dynamics (Guo et al., 2019b). A Nussbaum-type locity control task. In El-Fakdi and Carreras (2013), function was used to resolve unknown control di- more real world experiments on cable tracking were rections. Compared with the previous algorithms conducted. To study the robustness of learnt poli- in the discrete time manner (Cui et al., 2017), it cies, the policies were tested with different cable con- successfully avoided chattering of control inputs in figurations without retraining. Another test changed steady-state phase. The other simulation showed its the altitude of the AUV with respect to the cable dur- ability to gain results competitive to general neural ing the online learning process. The results showed network control which had access to the input dy- that the policies were with high adaptation capabil- namics. On the other hand, in Guo et al. (2019a), an ities. event-triggered RL-based adaptive tracking control Due to a variety of restrictions, unlike the sta- algorithm was investigated to reduce the update fre- tion keeping task, other algorithms proposed for quency of the controller. The algorithm was designed tracking are not yet sufficiently validated in real to consider the long-term performance index, un- scene. Sun et al. (2015) used regularized extreme modeled dynamics, and external disturbances simul- learning machine to replace the look-up table in Q- taneously. Compared to the ordinary time-triggered learning. However, the description of experiment set- methods, it significantly reduced the computational tings and results was ambiguous. Shi et al. (2018b) load and energy consumption. modified the calculation of the target value used 4.3 RL for bionic underwater vehicles for updating the critic in deterministic policy gradient algorithm. The so called pseudo averaged Bionic AUVs, which mimic the swimming mo- Q-learning method averaged over several previously tions of underwater creatures, are developed to meet learnt action-value estimations and benefited from the higher requirements on endurance, system noise multiple actors. This scheme stabilized the learn- and especially maneuverability. Compared with the ing process by reducing the variance of target ap- screw-driven AUVs, more efforts in control algo- proximation error. In Shi et al. (2018a), the pro- rithms have to be made for these AUVs to acquire posed multi pseudo Q-learning based deterministic an optimal swimming pattern for the complicated policy gradient algorithm employed multiple critics dynamics. and multiple actors simultaneously. The critics were For fish-like robots, the fin-type propulsive updated by the expected absolute Bellman error to forces and moments depend on integrated influence accelerate the learning process. of various factors such as waveform, wavelength, am- To demonstrate the effectiveness of the proposed plitude and frequency. In Lin et al. (2009), online algorithms, some researchers gave out rigorous the- Q-learning method was implemented on a bionic un- oretical analysis on the stabilization of the control derwater robot (Fig. 4a) to select frequencies for its system. Yu et al. (2017) solved the tracking prob- two undulating fins in the autonomous heading con- lem through DDPG. The system was mathematically trol task. The experiment result was barely satis- proved to be stable as long as the reward was cho- factory, having relatively big error in yaw angle. In sen according to the Lyapunov stability principle. In 2010, the same task was carried out on a similar 2014, Cui et al. (2014) proposed a partially model- robot with a more flattened body (Fig. 4b), utiliz- based adaptive control algorithm framed within the ing an improved Q-learning method for continuous actor-critic architecture. The actor network compen- state space (Lin et al., 2010). The proposed algo- sated the uncertainties in dynamics and the critic rithm stored experiences in a replay buffer and re- network evaluated the tracking performance. In moved old experiences according to the resembling 2017, the input nonlinearities was considered in the degree. A PID controller was adopted for supervis- dynamic model (Cui et al., 2017). The nonlinear- ing to prevent the occurrence of low learning rate ities included the actuator dead-zone and satura- when starting from scratch. The bionic AUV swam tion as well as the relationship between the nomi- smoothly under the modified algorithm. Wang and nal and actual force. In 2019, the actor-critic adap- Kim (2015) showed that a hierarchical RL structure tive control algorithm was further investigated for can enhance the convergence rate of Q-learning in continuous-time systems with completely unknown such locomotion problem. Hsu et al. / Front Inform Technol Electron Eng in press 9

(a) (b)

Fig. 4 Fish-like AUVs used in (Lin et al., 2009) (a), and (Lin et al., 2010) (b). Fig. 6 The snake-like AUV used in (Zhang et al., 2018).

vides an alternative approach to formulate an underwater control task. The second task unfolds more possibilities of RL-based controllers by performing end-to-end learning.

5.1 Seafloor tracking problem Fig. 5 The bionic AUV Aqua (Prahacs et al., 2004). 5.1.1 Problem formulation Aqua, as shown in Fig. 5, is a descendant of hexapod walking vehicle, which has the ability to In a seafloor tracking task, an AUV should keep work underwater (Prahacs et al., 2004). Meger a certain tracking velocity while holding a constant et al. (2015) aimed to learn the gait of its six flip- relative distance zr with the seafloor. Generally, pers through a policy search method PILCO (Prob- only motions in vertical plane are considered, in abilistic Inference for Learning Control), in which a which the surge speed is assumed to be constant. probabilistic dynamic model was learnt before im- The actions are continuous inputs of the related plementing tabular-rasa. Five out of the six differ- thrusters and the state of the AUV can be described ent fixed-depth tasks carried out on real robot ob- as χ = [z, θ, w, q]T , including heave position z, heave tained satisfactory results within seven iterations. velocity w, pitch orientation θ and pitch angular ve- Additional experiments about sharing experiences locity q. To avoid the confusion due to the period- from a simulator showed that an inaccurate model icity of angle, [cos(θ), sin(θ)]T is used instead of θ. will deteriorate the performance of the proposed Moreover, replacing z with a goal oriented variable . method. To address the problem of being computa- ∆z = z − zr can enhance the generality of the learnt tional expensive, Higuera et al. (2018) proposed an policy. Thus, the state can be designed as improved deep-PILCO method, which gained com- s = [∆z, cos(θ), sin(θ), w, q]T . (16) petitive data-efficiency while optimizing neural network controllers. However, as illustrated in Fig. 7, perceptual alias- Zhang et al. (2018) concentrated on the con- ing may appear owing to the unknown future trend trol of a snake-like underwater robot that incorpo- of the target depth. Whilst this trend is unac- rated advantages of the underwater glider through cessible in the seafloor tracking problem, it can two gliding wings, as shown in Fig. 6. REINFORCE be predicted by the sequence of recent observa- algorithm using preprocessed input was adopted and tions [∆zt−N+1, . . . , ∆zt−1, ∆zt], where N denotes the simulation result was encouraging. the length of the sequence. In conclusion, the state of the seafloor tracking problem is designed as

5 Case Study T s = [∆zt−N+1, . . . , ∆zt−1, ∆zt, cos(θ), sin(θ), w, q] . (17) In this section, two representative examples of RL-based controllers are introduced in detail. The The reward is straightforward and given as follows: first task is a regular tracking problem based on a 2 2 2 T low-level representation of the system state. It pro- r = ρ1∆zt + ρ2w + ρ3q + u Ru (18) 10 Hsu et al. / Front Inform Technol Electron Eng in press where the first term aims to minimize the depth error and other terms are for the minimization of the con- sumed energy. The coefficients can provide tradeoffs among different objectives.

(a)

Fig. 7 Perceptual aliasing in the depth control problem.

5.1.2 Methods and strategies (b)

This problem is solved by implementing the Fig. 8 Structure of the evaluation network (a), and DPG algorithm. As is mentioned above, it updates policy network (b). the parameterized policy πθ along the gradient of is provided by the Shenyang Institute of Automa- the performance function ∇θJ (πθ). This gradient is approximated by tion, Chinese Academy of Science. The number of the preserved ∆z is three, which is decided by prelim- M 1 X inary experiments. Fig. 9 shows that the proposed ∇ J(θ) ≈ ∇ π (s |θ) ∇ Q (s , u |ω) (19) θ M θ i ui i i controller performs well in the test and is compara- i=1 ble with nonlinear model predictive control (NMPC) in which (sk, uk, sk+1) is a transition pair along a without having to know the dynamics of the AUV. trajectory at time k and Q(s, u|ω) is a parameterized approximation for the value function. As illustrated in Fig. 8, both the policy and value function are approximated by neural networks(NNs), with three layers and four layers respectively. The activation function ReLu is used for better convergence rate. To improve sample eﬃciency, prioritized experience replay (Schaul et al., 2015), which reuses previous experiences according to their priority, is adopted. The priority of an experience is proportional to its TD error

PRIk = |rk + γQ (sk+1, π (sk+1|θ) |ω) − Q (sk, uk|ω) |. (20) Fig. 9 Tracking trajectory of NNDPG, NMPC, and the realistic seafloor. The intuition behind this definition is that a RL agent can learn more from a transition with higher magnitude of TD error. During the training, samples 5.2 End-to-end control problem with higher priority are more likely to be chosen. 5.2.1 Problem formulation 5.1.3 Results Deep learning is capable of learning from unpro- Simulation tests are carried out on a path gen- cessed, high-dimensional and sensory input. In other erated by a data set sampled from the real seafloor of words, it has the ability to construct end to end so- the South China Sea at (23◦060N, 120◦070E), which lution, which is preferable in most situation. On Hsu et al. / Front Inform Technol Electron Eng in press 11 the other hand, algorithms with this kind of inputs are usually hard to converge, especially when deal- ing with low level control problems involving complex dynamics. Here we present an example which proposed an end-to-end control policy for the pipe following task using sensor signals and motion variables as inputs. The AUV has to keep the pipeline in its camera view and head along the pipeline without knowing its own position as well as that of the pipeline (Fig. 10). The sensor input is an 84x84x3 image and the motion variables contain the orientation vector and velocity vector. Although the controller is trained directly through the raw image input, image processing is Fig. 11 Illustration of dc and θc. leveraged to aid the reward extraction. After a series of procedures, the center line of the pipeline in cumulative reward, the objective of PPO is given as the camera view can be detected. Its distance from PPO CLIP ˆ h targ2i L =L − λ1Et Vθ (st) − Vt the center of the view dc and angle between the dis- (22) tance line and x axis θ are then calculated (Fig. 11). ˆ c + λ2Et [H (πθ (·|st))] The reward is designed as ˆ where Et means to average over a batch of samples. −1 r = u · |cos θc| − dcdmax (21) The second term minimizes the error of the value function for better estimation of the advantage func- where u denotes the surge velocity and d equals max tion Aˆt and the last term encourages exploration. half of the diagonal length of the view. The actions The advantage function measures the relative ad- are the input of the two related thrusters. vantage of an action and is mathematically defined π π π targ as A (s, a) = Q (s, a) − V (s). Vt can be the cumulative return and H computes the entropy of a distribution. LCLIP is a clipped surrogate objective

CLIP ˆ h ˆ L = Et min wt(θ)At, i clip (wt(θ), 1 − , 1 + ) Aˆt (23) (a) (b) where the clip function clip(x, a, b) restricts x to the Fig. 10 Simulation scene for pipe following (a), and bound [a, b] and wt(θ) is a weight that measures the view of the camera (b). diﬀerence between policies by calculating

πθ (at|st) 5.2.2 Methods and strategies wt(θ) = . (24) πθold (at|st) The sensor input and motion variables are han- dled with two encoder networks respectively. The 5.2.3 Results former is a four-layers CNN network and the latter is a LSTM(long short term memory) network. As The designed controller behaves well on the illustrated in Fig. 12, both of their outputs are fed to pipeline tracking task in the simulation scene. The a fully connected layer followed by a value network AUV follows the straight pipeline successfully with- and a policy network. Proximal policy optimization, out requiring its localization information and dy- a kind of natural policy gradient method, is imple- namic model. Two extra experiments highlight the mented for training the network. Besides using the advantage of using end-to-end control structure and performance function that is deﬁned directly as the the generality of the learnt policy respectively. The 12 Hsu et al. / Front Inform Technol Electron Eng in press

Fig. 12 Structure of the network. Fig. 14 The inference of the learnt policy on several realistic underwater pipeline images. When the out- first experiment replaces the CNN network with the put of the left thruster (blue bar) is larger than that extracted features θc and dc. Other settings remain of the right thruster (red bar), the AUV tends to turn unchanged. The results in Fig. 13 show that the right, vice versa. network with CNN performs much better, which in- areas, steady progress is being made to gain an opti- dicates that the use of the raw sensory input helps mal and practical solution via RL. Furthermore, we preserve more useful information. The other experi- list two detailed cases to help to reveal the feasibil- ment checks the predicted actions when the sensory ity and potential of RL in the underwater control input changes from views of the simulated scene to domain. We believe that RL-based controllers can images of realistic underwater pipelines (Fig. 14). pave the way to more intelligent AUVs. Actions generated from 21 out of 30 images move the AUV towards the correct direction. Though the References magnitude of these actions are barely satisfactory, Abbeel P, Coates A, Ng AY, 2010. Autonomous helicopter the results imply the potential of the algorithm to be aerobatics through apprenticeship learning. The In- applied in real world training. ternational Journal of Robotics Research, 29(13):1608- 1639. Ahmadzadeh SR, Kormushev P, Caldwell DG, 2014a. Multi- objective reinforcement learning for AUV thruster failure recovery. 2014 IEEE Symposium on Adaptive Dy- namic Programming and Reinforcement Learning (AD- PRL), p.1-8. Ahmadzadeh SR, Leonetti M, Carrera A, et al., 2014b. Online discovery of AUV control policies to overcome thruster failures. 2014 IEEE International Conference on Robotics and Automation (ICRA), p.6522-6528. Amari SI, 1998. Natural gradient works efficiently in learning. Neural computation, 10(2):251-276. Anthony T, Tian Z, Barber D, 2017. Thinking fast and slow with deep learning and tree search. Advances in Neural Information Processing Systems, p.5360-5370. Antonelli G, 2018. Underwater Robots. Springer. Bagnell JA, Schneider JG, 2001. Autonomous helicopter con- Fig. 13 Comparison between the use of hand-designed trol using reinforcement learning policy search methods. image features and the CNN encoder. Proceedings 2001 ICRA IEEE International Conference on Robotics and Automation (Cat No 01CH37164), 2:1615-1620. Bertsekas DP, Bertsekas DP, Bertsekas DP, et al., 1995. Dy- 6 Conclusion namic Programming and Optimal Control. Athena scientific Belmont, MA. This paper provides a selective overview of con- Carlucho I, De Paula M, Wang S, et al., 2018a. Adaptive low- trolling AUVs with RL. Methods that have been pro- level control of autonomous underwater vehicles using posed in the literature are presented according to the deep reinforcement learning. Robotics and Autonomous Systems, 107:71-86. motion control task they are designed for. Whilst Carlucho I, De Paula M, Wang S, et al., 2018b. AUV position there are still many challenges for merging these two tracking control using end-to-end deep reinforcement Hsu et al. / Front Inform Technol Electron Eng in press 13

learning. OCEANS 2018 MTS/IEEE Charleston, p.1- Jamali N, Kormushev P, Ahmadzadeh SR, et al., 2014. Co- 8. variance analysis as a measure of policy robustness. Cui R, Yang C, Li Y, et al., 2014. Neural network based OCEANS 2014-TAIPEI, p.1-5. reinforcement learning control of autonomous under- Kendall A, Hawke J, Janz D, et al., 2019. Learning to drive water vehicles with control input saturation. 2014 in a day. 2019 International Conference on Robotics UKACC International Conference on Control (CON- and Automation (ICRA), p.8248-8254. TROL), p.50-55. Khodayari MH, Balochian S, 2015. Modeling and control of Cui R, Yang C, Li Y, et al., 2017. Adaptive neural network autonomous underwater vehicle (AUV) in heading and control of auvs with control input nonlinearities using depth attitude via self-adaptive fuzzy PID controller. reinforcement learning. IEEE Transactions on Systems, Journal of Marine Science and Technology, 20(3):559- Man, and Cybernetics: Systems, 47(6):1019-1029. 578. El-Fakdi A, Carreras M, 2013. Two-step gradient-based Kim HJ, Jordan MI, Sastry S, et al., 2004. Autonomous reinforcement learning for underwater robotics behavior helicopter flight via reinforcement learning. Advances learning. Robotics and Autonomous Systems, 61(3):271- in Neural Information Pprocessing Systems, p.799-806. Kiumarsi B, Vamvoudakis KG, Modares H, et al., 2017. Opti- 282. mal and autonomous control using reinforcement learn- Elmokadem T, Zribi M, Youcef-Toumi K, 2016. Trajectory ing: A survey. IEEE Transactions on Neural Networks tracking sliding mode control of underactuated auvs. and Learning Systems, 29(6):2042-2062. Nonlinear Dynamics, 84(2):1079-1091. Kober J, Bagnell JA, Peters J, 2013. Reinforcement learning Feinberg V, Wan A, Stoica I, et al., 2018. Model-based in robotics: A survey. The International Journal of value estimation for efficient model-free reinforcement Robotics Research, 32(11):1238-1274. learning. arXiv preprint arXiv:180300101,. Lapierre L, Soetanto D, 2007. Nonlinear path-following Fernandez-Gauna B, Osa JL, Graña M, 2014. Effect of control of an auv. Ocean engineering, 34(11-12):1734- initial conditioning of reinforcement learning agents on 1744. feedback control tasks over continuous state and action Leonetti M, Ahmadzadeh SR, Kormushev P, 2013. On-line spaces. International Joint Conference SOCOâĂŹ14- learning to recover from thruster failures on autonomous CISISâĂŹ14-ICEUTEâĂŹ14, p.125-133. underwater vehicles. 2013 OCEANS-San Diego, p.1-6. Ferreira F, Machado D, Ferri G, et al., 2016. Underwater Li Z, You K, Song S, 2018. AUV based source seeking with optical and acoustic imaging: A time for fusion? a estimated gradients. Journal of Systems Science and brief overview of the state-of-the-art. OCEANS 2016 Complexity, 31(1):262-275. MTS/IEEE Monterey, p.1-6. Lillicrap TP, Hunt JJ, Pritzel A, et al., 2015. Continu- Frost G, Lane DM, 2014. Evaluation of Q-learning for search ous control with deep reinforcement learning. arXiv and inspect missions using underwater vehicles. 2014 preprint arXiv:150902971,. Oceans-St John’s, p.1-6. Lin L, Xie H, Shen L, 2009. Application of reinforcement Fujimoto S, Hoof H, Meger D, 2018. Addressing function ap- learning to autonomous heading control for bionic un- proximation error in actor-critic methods. International derwater robots. 2009 IEEE International Conference Conference on Machine Learning, p.1582-1591. on Robotics and Biomimetics (ROBIO), p.2486-2490. Guo X, Yan W, Cui R, 2019a. Event-triggered reinforcement Lin L, Xie H, Zhang D, et al., 2010. Supervised neural learning-based adaptive tracking control for completely Q-learning based motion control for bionic underwater unknown continuous-time nonlinear systems. IEEE robots. Journal of Bionic Engineering, 7:S177-S184. Transactions on Cybernetics,. Lu H, Li Y, Zhang L, et al., 2015. Contrast enhancement for images in turbid water. JOSA A, 32(5):886-893. Guo X, Yan W, Cui R, 2019b. Integral reinforcement Meger D, Higuera JCG, Xu A, et al., 2015. Learning legged learning-based adaptive NN control for continuous-time swimming gaits from experience. 2015 IEEE Interna- nonlinear MIMO systems with unknown control direc- tional Conference on Robotics and Automation (ICRA), tions. IEEE Transactions on Systems, Man, and Cy- p.2332-2338. bernetics: Systems,. Narasimhan M, Singh SN, 2006. Adaptive optimal control Heidemann J, Stojanovic M, Zorzi M, 2012. Underwa- of an autonomous underwater vehicle in the dive plane ter sensor networks: applications, advances and chal- using dorsal fins. Ocean Engineering, 33(3-4):404-416. lenges. Philosophical Transactions of the Royal Society Palomeras N, El-Fakdi A, Carreras M, et al., 2012. COLA2: A: Mathematical, Physical and Engineering Sciences, A control architecture for auvs. IEEE Journal of 370(1958):158-175. Oceanic Engineering, 37(4):695-716. Hester T, Quinlan M, Stone P, 2011. A real-time model-based Polydoros AS, Nalpantidis L, 2017. Survey of model-based reinforcement learning architecture for robot control. reinforcement learning: Applications on robotics. Jour- arXiv preprint arXiv:11051749,. nal of Intelligent & Robotic Systems, 86(2):153-173. Higuera JCG, Meger D, Dudek G, 2018. Synthesizing neu- Prahacs C, Saudners A, Smith MK, et al., 2004. Towards ral network controllers with probabilistic model-based legged amphibious mobile robotics. Proceedings of the reinforcement learning. 2018 IEEE/RSJ International Canadian Engineering Education Association (CEEA), Conference on Intelligent Robots and Systems (IROS), . p.2538-2544. Racanière S, Weber T, Reichert D, et al., 2017. Imagination- Hu H, Song S, Chen CP, 2019. Plume tracing via model-free augmented agents for deep reinforcement learning. reinforcement learning method. IEEE Transactions on Advances in Neural Information Processing Systems, Neural Networks and Learning Systems,. p.5690-5701. 14 Hsu et al. / Front Inform Technol Electron Eng in press

Ribas D, Palomeras N, Ridao P, et al., 2011. Girona 500 Yoo B, Kim J, 2016. Path optimization for marine vehicles auv: From survey to intervention. IEEE/ASME Trans- in ocean currents using reinforcement learning. Journal actions on Mechatronics, 17(1):46-53. of Marine Science and Technology, 21(2):334-343. Schaul T, Quan J, Antonoglou I, et al., 2015. Prioritized Yu R, Shi Z, Huang C, et al., 2017. Deep reinforcement experience replay. arXiv preprint arXiv:151105952,. learning based optimal trajectory tracking control of Schulman J, Levine S, Abbeel P, et al., 2015. Trust re- autonomous underwater vehicle. 2017 36th Chinese gion policy optimization. International conference on Control Conference (CCC), p.4958-4965. machine learning, p.1889-1897. Zhang XL, Li B, Chang J, et al., 2018. Gliding control of un- Schulman J, Wolski F, Dhariwal P, et al., 2017. Proxi- derwater gliding snake-like robot based on reinforcement mal policy optimization algorithms. arXiv preprint learning. 2018 IEEE 8th Annual International Confer- arXiv:170706347,. ence on CYBER Technology in Automation, Control, Shi W, Song S, Wu C, et al., 2018a. Multi pseudo Q-learning- and Intelligent Systems (CYBER), p.323-328. based deterministic policy gradient for tracking control of autonomous underwater vehicles. IEEE Transactions on Neural Networks and Learning Systems,. Shi W, Song S, Wu C, 2018b. High-level tracking of autonomous underwater vehicles based on pseudo averaged Q-learning. 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), p.4138-4143. Silver D, Lever G, Heess N, et al., 2014. Deterministic policy gradient algorithms. International Conference on Machine Learning, p.387-395. Stokey RP, Roup A, von Alt C, et al., 2005. Development of the remus 600 autonomous underwater vehicle. Pro- ceedings of OCEANS 2005 MTS/IEEE, p.1301-1304. Sun T, He B, Nian R, et al., 2015. Target following for an autonomous underwater vehicle using regularized ELM-based reinforcement learning. OCEANS 2015- MTS/IEEE Washington, p.1-5. Sutton RS, Barto AG, 2018. Reinforcement learning: An introduction. MIT press. Sutton RS, McAllester DA, Singh SP, et al., 2000. Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems, p.1057-1063. Walters P, Kamalapurkar R, Voight F, et al., 2018. On- line approximate optimal station keeping of a marine craft in the presence of an irrotational current. IEEE Transactions on Robotics, 34(2):486-496. Wang C, Wei L, Wang Z, et al., 2018. Reinforcement learning-based multi-auv adaptive trajectory planning for under-ice field estimation. Sensors, 18(11):3859. Wang J, Kim J, 2015. Optimization of fish-like locomotion using hierarchical reinforcement learning. 2015 12th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI), p.465-469. Waslander SL, Hoffmann GM, Jang JS, et al., 2005. Multi- agent quadrotor testbed control design: Integral sliding mode vs. reinforcement learning. 2005 IEEE/RSJ In- ternational Conference on Intelligent Robots and Sys- tems, p.3712-3717. Williams RJ, 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Ma- chine learning, 8(3-4):229-256. Wu H, Song S, You K, et al., 2018. Depth control of model- free AUVs via reinforcement learning. IEEE Trans- actions on Systems, Man, and Cybernetics: Systems, (99):1-12. Xiang X, Jouvencel B, Parodi O, 2010. Coordinated formation control of multiple autonomous underwater vehicles for pipeline inspection. International Journal of Ad- vanced Robotic Systems, 7(1):3.