A Self-Adaptive Reinforcement-Exploration Q-Learning Algorithm

Home , Reinforcement learning, Unsupervised learning

S S symmetry

Article A Self-Adaptive Reinforcement-Exploration Q-Learning Algorithm

Lieping Zhang 1 , Liu Tang 1, Shenglan Zhang 1, Zhengzhong Wang 1, Xianhao Shen 2 and Zuqiong Zhang 2,*

1 College of Mechanical and Control Engineering, Guilin University of Technology, Guilin 541006, China; [email protected] (L.Z.); [email protected] (L.T.); [email protected] (S.Z.); [email protected] (Z.W.) 2 College of Information Science and Engineering, Guilin University of Technology, Guilin 541006, China; [email protected] * Correspondence: [email protected]

Abstract: Directing at various problems of the traditional Q-Learning algorithm, such as heavy repetition and disequilibrium of explorations, the reinforcement-exploration strategy was used to replace the decayed ε-greedy strategy in the traditional Q-Learning algorithm, and thus a novel self-adaptive reinforcement-exploration Q-Learning (SARE-Q) algorithm was proposed. First, the concept of behavior utility trace was introduced in the proposed algorithm, and the probability for each action to be chosen was adjusted according to the behavior utility trace, so as to improve the efﬁciency of exploration. Second, the attenuation process of exploration factor ε was designed into two phases, where the ﬁrst phase centered on the exploration and the second one transited the focus from the exploration into utilization, and the exploration rate was dynamically adjusted

according to the success rate. Finally, by establishing a list of state access times, the exploration factor of the current state is adaptively adjusted according to the number of times the state is accessed.

Citation: Zhang, L.; Tang, L.; Zhang, The symmetric grid map environment was established via OpenAI Gym platform to carry out the S.; Wang, Z.; Shen, X.; Zhang, Z. A symmetrical simulation experiments on the Q-Learning algorithm, self-adaptive Q-Learning (SA-Q) Self-Adaptive algorithm and SARE-Q algorithm. The experimental results show that the proposed algorithm has Reinforcement-Exploration obvious advantages over the ﬁrst two algorithms in the average number of turning times, average Q-Learning Algorithm. Symmetry inside success rate, and number of times with the shortest planned route. 2021, 13, 1057. https://doi.org/ 10.3390/sym13061057 Keywords: reinforcement learning; Q-Learning algorithm; self-adaptive Q-Learning algorithm; self-adaptive reinforcement-exploration strategy; path planning Academic Editor: Dumitru Baleanu

Received: 21 May 2021 Accepted: 9 June 2021 1. Introduction Published: 11 June 2021 Reinforcement learning (RL), one of methodologies of machine learning, is used to

Publisher’s Note: MDPI stays neutral describe and solve how an intelligent agent learns and optimizes the strategy during the with regard to jurisdictional claims in interaction with the environment [1]. To be more speciﬁc, the intelligent agent acquires published maps and institutional afﬁl- the reinforcement signal (reward feedback) from the environment during the continuous iations. interaction with the environment, and adjusts its own action strategy through the reward feedback, aiming at the maximum gain. Different from supervised learning [2] and semi- supervised learning [3], RL does not need to collect training samples in advance, and during the interaction with the environment, the intelligent agent will automatically learn

Copyright: © 2021 by the authors. to evaluate the action generated according to the rewards fed back from the environment, Licensee MDPI, Basel, Switzerland. instead of being directly told the correct action. This article is an open access article In general, the Markov decision-making process is used by the RL algorithm for distributed under the terms and environment modeling [4]. Based on whether the transition probability P of the Markov conditions of the Creative Commons decision-making process in the sequential decision problem is already known, the RL Attribution (CC BY) license (https:// algorithm is divided into two major types [5]: a model-based RL algorithm under known creativecommons.org/licenses/by/ transition probability and a model-free RL algorithm under unknown transition probability, 4.0/).

Symmetry 2021, 13, 1057. https://doi.org/10.3390/sym13061057 https://www.mdpi.com/journal/symmetry Symmetry 2021, 13, 1057 2 of 16

where the former is a dynamic planning method, and the latter mainly includes strategy- based RL method, value function-based RL method, and strategy–value function integrated RL method. The value function-based RL method is an important solution to the model-free Markov problem, and it mainly makes use of Monte Carlo (MC) RL and temporal-difference (TD) RL [6]. As one of most commonly used RL algorithms, the Q-Learning algorithm, which is based on value, reference strategy learning and TD method [7], has been widely applied to route planning, manufacturing and assembly, and dynamic train scheduling [8–10]. Many researchers have been dedicated to improving the low exploration efficiency problem of the traditional Q-Learning algorithm. Qiao, J.F. et al. [11] proposed an improved Q-Learning obstacle avoidance algorithm in which Q-Table was substituted with neural network (NN), in an effort to overcome the deficiency of Q-Learning, namely, it was inapplicable under continuous state. Song, Y. et al. [12] applied the artificial potential field to the Q-Learning algorithm, proposed a Q value initialization method, and improved the convergence rate of the Q-Learning algorithm. In [13], a dynamic adjustment algorithm for the exploration factor in ε-greedy strategy was raised to improve the exploration-utilization equilibrium problem existing in the practical application of RL method. In the literature [14], a nominal control-based supervised RL route planning algorithm was brought forward, and the tutor supervision was introduced into the Q-Learning algorithm, thus accelerating the algorithm convergence. Andouglas et al. came up with a reward matrix-based Q-Learning algorithm, which satisfies the route planning demand of marine robots [15]. Pauline et al. introduced the concept of partially guided Q-Learning, initialized the Q-Table through the flower pollination algorithm (FPA), and thus optimized the route planning performance of mobile robots [16]. Park, J.H. et al. employed a genetic algorithm to evolve robotic structures as an outer optimization, and it applied a reinforcement learning algorithm to each candidate structure to train its behavior and evaluate its potential learning ability as an inner optimization [17]. It can be seen from the above description that the current research on the improvement of the Q-learning algorithm mainly focuses on two aspects: The first is to study the attenuation law of the exploration factor ε with the increase of the number of training episodes. The second aspect is to adjust the exploration factor ε of the next episode according to the training information of the previous episode. The article creatively proposes the idea of dynamically adjusting the exploration factor of the current episode according to the states accessed times, based on the decayed ε-greedy strategy, the concept of behavior utility trace is introduced, and the states accessed list and on-site adaptive dynamic adjustment method are added, so the self-adaptive reinforcement exploration method is realized, and taking the path planning as the application object, the simulation results verify the effectiveness and superiority of the proposed algorithm. The main contributions of the proposed algorithm are as follows: (1) the traditional Q-learning algorithm selects actions randomly according simply to the equal probability method, without considering the differences of actions. In the article, the behavior utility trace is introduced to improve the probability of effective action being randomly selected. (2) In order to better solve the contradiction between exploration and utilization, the article designs the attenuation process of the exploration factor into two stages. The first stage is the exploration stage. The attenuation of the exploration factor is designed slowly to improve the exploration rate. The second stage is the transition from the exploration stage to the utilization stage, which dynamically adjusts the attenuation speed of the exploration factor according to the success rate. (3) In addition, because the exploration factor is fixed in each episode, the agent makes too much exploration near the initial location. The article proposes to record the access times of the current state through the states accessed list, so as to reduce the exploration probability of the states with more accessed times, and improve the exploration rate of the state which is first accessed, so as to improve the radiation range of exploration. Symmetry 2021, 13, 1057 3 of 16

2. Markov Process 2.1. Markov Reward Process (MRP)

In a timing sequence process, if the state at time t + 1 only depends on the state St at time t while unrelated to any state before time t, the state St at time t is considered with Markov property. If each state in a process is of Markov property, the random process is called Markov process [18]. The key to describing a Markov process lies in the state transition probability matrix, 0 namely, the probability for the transition from the state St = s at time t into state St+1 = s at time t + 1, as shown in Equation (1):

0 Pss0 = p St+1 = s |St = s (1)

For a Markov process with n states (s1, s2, ··· , sn), the state transition probability matrix P from the state s into all follow-up states s0 is expressed by Equation (2):   Ps1s1 ··· Ps1sn P =  ······  (2)

Psns1 ··· Psnsn

In Equation (2), the data in each row of state transition matrix P denote the probability value for the transition from one state into all other n states, and the sum of each row is constantly 1. An MRP is formed when the reward element is added into a Markov process. An MRP, which is composed of ﬁnite state set S, state transition probability matrix P, reward function R and attenuation factor γ(γ ∈ [0, 1]), can be described through a quadruple form S, P, R, γ. In an MRP, the sum of all reward attenuations since the initial sampling under a state St until the terminal state is called the gain, which is expressed by Equation (3):

∞ k k Gt = Rt+1 + γRt+2 + ··· + γ Rt+k+1 = ∑ γ Rt+k+1 (3) i=0

In Equation (3), Rt+1 is the instant reward at time t + 1, k is the total number of subsequent states from the time t + 1 until the terminal state, and Rt+k+1 is the instant reward of terminal state. The expected gain of a state in the MRP is called value, which is used to describe the importance of a state. The value v(s) is expressed as shown in Equation (4):

v(s) = E[Gt|St = s ] (4)

By unfolding the gain Gt in the value function according to its deﬁnition, the expression as shown in Equation (5) can be acquired:

v(s) = E[Rt+1 + γv(St+1)|St = s ] (5)

In Equation (5), the expected value of state St+1 at time t + 1 can be obtained according to the probability distribution at the next time. s denotes the state at the present time, s0 represents any possible state at the next time, and thus Equation (5) can be written into the following form: 0 v(s) = Rs + γ ∑ Pss0 v s (6) s0∈S Equation (6), which is called Bellman equation in MRP, expresses that the value of a state consists of its reward and the values of subsequent states according to a certain probability and attenuation ratio. Symmetry 2021, 13, 1057 4 of 16

2.2. Markov Decision-Making Process In addition to the MRP, the RL problem also involves the individual behavior choice. Once the individual behavior choice is included into the MRP, the Markov decision-making process is obtained. The Markov decision-making process can be described using a quintuple form hS, A, P, R, γ, i, where its ﬁnite behavior set A, ﬁnite state set S and attenuation factor γ are identical with those of the MRP. Different from the MRP, the gain R and state transition probability matrix P in the Markov decision-making process are based on behaviors. a The action At = a is chosen at state St = s, the gain Rs at the moment can be expressed by the expected gain Rt+1 at time t + 1, and it is mathematically described as shown in Equation (7): a Rs = E[Rt+1|St = s, At = a ] (7) a The mathematical description of state transition probability Pss0 from the state St = s 0 into the subsequent state St+1 = s is shown in Equation (8):

a 0 Pss0 = P St+1 = s |St = s , At = a (8)

In the Markov decision-making process, an individual will choose one action from the ﬁnite behavior set according to their own recognition of the present state, and the basis for choosing such behavior becomes a strategy and a mapping from one state to action. The strategy is usually expressed by the symbol π, referring to a distribution based on the behavior set under the given state s, namely:

π(a|s ) = P[At = a|St = s ] (9)

The state value function vπ(s) is used to describe the value generated when the state s abides by a specific strategy π in the Markov decision-making process, and it is defined by Equation (10): vπ(s) = E[Gt|St = s ] (10) The behavior value function is used to describe the expected gain that can be obtained by executing an action a under the present state s when a specific strategy π is followed. As a behavior value is usually based on one state, the behavior value function is also called state-behavior pair value function. The definition of behavior value function qπ(s, a) is shown in Equation (11): qπ(s, a) = E[Gt|St = s , At = a] (11) By substituting Equation (3) into Equations (10) and (11), respectively, two Bellman expectation Equations (12) and (13) can be acquired:

vπ(s) = E[Rt+1 + γvπ(St+1)|St = s ] (12)

qπ(s, a) = E[Rt+1 + γqπ(St+1, At+1)|St = s , At = a] (13) In the Markov decision-making process, a behavior serves as the bridge of state transition, and the behavior value is closely related to the state value. To be more speciﬁc, the state value can be expressed by all behavior values under this state, and the behavior value can be expressed by all state values that the behavior can reach, as shown in Equations (14) and (15), respectively: vπ(s) = ∑ π(a|s)qπ (s, a) (14) a∈A a a 0 qπ(s, a) = Rs + γ ∑ Pss0 vπ s (15) s0∈S Symmetry 2021, 13, 1057 5 of 16

The following can be obtained by substituting Equation (15) into Equation (14): ! a a 0 vπ(s) = ∑ π(s|a ) Rs + γ ∑ Pss0 vπ s (16) a∈A s0∈S

Equation (14) is substituted into Equation (15) to obtain:

a a 0 0 0 0 qπ(s, a) = Rs + γ ∑ Pss0 ∑ π(a s )qπ s , a (17) s0∈S a0∈A

Each strategy corresponds to a state value function, and the optimal strategy naturally corresponds to the optimal state value function. The optimal state value function v∗(s) is deﬁned as the maximum state value function among all strategies, and the optimal behavior value function q∗(s, a) is deﬁned as the maximum behavior value function among all strategies, as shown in Equations (18) and (19), respectively:

∗ v (s) = maxvπ(s) (18) π ∗ q (s, a) = maxqπ(s, a) (19) π From Equations (16) and (17), the optimal state value function and optimal behavior value function can be obtained, as shown in Equations (20) and (21), respectively:

∗ a a ∗ 0 v (s) = maxR + γ P 0 v s (20) a s ∑ ss s0∈S

∗ a a ∗ 0 0 q (s, a) = Rs + γ ∑ Pss0 mqxq s , a (21) s0∈S a If the optimal behavior value function is known, the optimal strategy can be acquired by directly maximizing the optimal behavior value function q∗(s, a), as shown in Equation (22):  ∗ 1 if a = argmaxq (s, a) π∗(a|s ) = a∈A (22) 0 otherwise

3. Self-Adaptive Reinforcement-Exploration Q-Learning Algorithm 3.1. Q-Learning Algorithm Compared with the TD reinforcement learning method based on value function and the state–action–reward–state–action (SARSA) algorithm, the iteration of the Q-learning algorithm is a trial-and-error process. One of the conditions for convergence is to try every possible state–action pair many times, and ﬁnally learn the optimal control strategy. At the same time, the Q-learning algorithm is an effective reinforcement learning algorithm under the condition of unknown environment. It does not need to establish the environment model, and has the advantages of guaranteed convergence, less parameters and strong exploratory ability. Especially in the ﬁeld of path planning, which focuses on obtaining the optimal results, the Q-learning algorithm has more advantages, and it is the research hotspot of scholars at present [19]. The Q in Q-Learning algorithm is namely the value Q(s,a) of state–behavior pair, representing the expected return when the Agent executes the action a (a ∈ A) under the state s (s ∈ S). The environment will feed back a corresponding instant reward value r according to the behavior a of the Agent. Therefore, the core idea of this algorithm is to use a Q-Table to store all Q values. In the continuous interaction between the Agent and the environment, the expected value (evaluation) is updated through the actual reward r obtained by executing the action a under the state s, and in the end, the action contributing to the maximum return according to the Q-Table is chosen. Symmetry 2021, 13, 1057 6 of 16

Reference strategy TD learning means updating the behavior value with the TD method under the reference strategy condition. The updating method of reference strategy TD learning is shown in Equation (23): π(At|St ) V(St) ← V(St) + α (Rt+1 + γV(St+1)) − V(St) (23) µ(At|St )

In Equation (23), V(St) represents the value assessment of state St at time t, V(St+1) is the value assessment of state St+1 at time t + 1, α is the learning rate, γ is the attenuation factor, Rt+1 denotes the instant return value at time t + 1, π(At|St ) is the probability for the target strategy π(a|s ) to execute the action At under the state St+1, and µ(At|St ) is the probability to execute the action At under the state St+1 according to the behavior strategy ( | ) µ(a|s ). In general, the target strategy π(a|s ) chosen is of a certain ability. If π At St < 1, µ(At|St ) the probability of the reference strategy choosing the action At is smaller than that of the behavior strategy, and the updating amplitude is relatively conservative at the moment. When this ratio is greater than 1, the probability of the reference strategy choosing the action At is greater than that for the behavior strategy, and the updating amplitude is relatively bold. The behavior strategy µ of reference strategy TD learning method is replaced by the ε-greedy strategy based on the value function Q(s,a), and the target strategy π is replaced by the complete greedy strategy based on the value function Q(s,a), thus forming a Q-Learning

algorithm. Q-Learning updates Q(St,At) using the time series difference method, and the updating method is expressed by Equation (24):

0 Q(St, At) ← Q(St, At) + α Rt+1 + γQ St+1, A − Q(St, At) (24)

0 In Equation (24), the TD target value Rt+1 + γQ(St+1, A ) is the Q value of the behavior A0 generated based on the reference strategy π. Thanks to this updating method, the value of behavior obtained by the state St according to the ε-greedy strategy will be updated by a certain proportion towards the direction of maximum behavior value determined by the greedy strategy under the state St+1, and will ﬁnally converge to the optimal strategy and optimal behavior value. The concrete behavior updating formula of the Q-Learning algorithm is shown in Equation (25): 0 Q(St, At) ← Q(St, At) + α R + γmaxQ St+1, a − Q(St, At) (25) a0

In Equation (25), a0 is the action acquiring the maximum behavior value under the subsequent states.

3.2. Adaptive Reinforcement-Exploration Strategy Design 3.2.1. Behavior Utility Trace In order to improve the low exploration efficiency of traditional exploration strategy, in this study, the behavior utility trace [20] was introduced into the probability of action selection, and the self-adaptive reinforcement-exploration strategy was proposed based on the improved ε-greedy algorithm. In the RL, the behavior utility trace was used to record the influence of each state on its subsequent states and to adjust the step size during the state value updating. The behavior utility trace is defined as shown in Equation (26):

Et(s) = λEt−1(s) + 1(St = s), λ ∈ [0, 1] (26)

In Equation (26), 1(St = s) is an expression used to judge true value. It is 1 when and only when St = s, and 0 under other conditions. Meanwhile, whether a state is accessed is also an important signal. After an action is executed and transferred to a brand-new state, this, to some extent, indicates that the Symmetry 2021, 13, 1057 7 of 16

exploration of action is effective this time. Therefore, the state access function v(st) was used in this study to describe whether a state is accessed as shown in Equation (27): ( 0 st ∈ V v(st) = (27) 1 else

In Equation (27), st represents the present state, V is the set of accessed states, and the value of v(st) is 1 when and only when this state is accessed for the ﬁrst time. By combining instant reward and the features of utility trace and the relationship between whether a state is accessed and the action itself, an action utility trace based on instant reward and state access function was designed in this study as shown in Equation (28):  λE (a ) + e × v(s ) × 1(r > 0) a = a = a  t−1 i t t i t−1 t−2 Et(ai) = e × v(st) × 1(rt > 0) ai = at−1 6= at−2 (28)  0 else

In Equation (28), the action ai ∈ A is the behavior space of the Agent, at−1 and at−2 are the actions made at the previous time and the time before the previous time, respectively, rt is the instant reward of the present state, and e value is the exploration incentive value set according to the practical situation. When the same action is continuously chosen for the E value (Et(ai)) of action ai ∈ A, the following will hold: When the instant reward is non-negative and the present state is not accessed, the E value at the present time is obtained by adding e to the attenuation value of E value at the previous time. When the instant reward is negative or the present state is accessed, the E value at the present time is the attenuation value of E value at the previous time. When the action selection is not continuous, the following will be true: The E value at the present time will be e under positive instant reward and 0 under negative instant reward, and the E value will be constantly 0 under other circumstances.

3.2.2. Adaptive Reinforcement-Exploration Strategy Based on the utility trace introduced, a self-adaptive reinforcement-exploration strategy was put forward in this study. This strategy, which is a symmetric improvement of the ε-greedy strategy, was improved mainly from the three following aspects: • Introduction of the behavior utility trace to improve the probability for different actions to be chosen and to enhance the effectiveness of exploration action. When the strategy chooses to explore the environment at a certain probability, the probability for each action to be randomly chosen is shown in Equation (29):

1 + Eai P(ai) = n (29) n + ∑1 Eaj

In Equation (29), the action is ai ∈ A, Eai is the E value of behavior utility trace for each action, and n is the total number of behavior spaces. When a positive instant reward is obtained by executing one action or the Agent is transferred to a new state, the E value of this action will be enlarged, and so will the probability for it to be chosen. Through such a design, the algorithm can be encouraged to explore in a linear form, thus expanding the radiation range of the initial exploration. • Real-time adjustment of exploration factor ε in different phases until it meets the objective needs. As for the adjustment of exploration factor ε, the whole attenuation process of ε is divided into two phases. The ﬁrst phase is the initial training phase of the Agent, in which the Agent almost has no understanding of environmental information and its main task is to explore the environment. In this phase, the attenuation of exploration factor ε should be Symmetry 2021, 13, 1057 8 of 16

slightly slow. In this study, the concrete adjustment formula for the exploration factor ε is shown in Equation (30):

t 2 π εt = 1 − sin ( T ) ∗ 2 t ≤ βT (30)

In Equation (30), εt is the exploration factor at the present time, t is the present number of iteration cycles, T is the maximum number of training cycles of state, and β is the self-adaptive exploration factor, with a range of [0, 0.5]. When the iteration cycle satisﬁes t > βT, here comes the second phase in which the focus is shifted from the exploration to utilization. In this study, the exploration factor ε was designed into an approximately linear attenuation and gradual stabilization at the set minimum value during this phase. Meanwhile, in order to adjust the attenuation rate according to the practical situation, the concept of success rate proposed in the literature was taken for reference to design the concrete adjustment formula of exploration factor ε in the second phase, as shown in Equation (31):

( t−βT t−βT ε1 + T −βT (εmin − ε1) − i × R ε1 + T −βT (εmin − ε1) − i × R ≥ εmin εt = 1 1 (31) εmin else

In Equation (31), ε1 is the ﬁnal value of ε in the ﬁrst phase, T1 is the target number of episodes of the minimum ε, R is the probability of the Agent successfully arriving at the target position in every ten experiments, i is the accelerated utilization factor of the success rate, and εmin is the set minimum value of the exploration factor. Through such a design, the success rate of the Agent becomes higher, and the exploration factor ε attenuates faster, so as to make better use of information and reach the goal of accelerating the convergence while ensuring the algorithm effect. • Adaptive adjustment of the exploration factor ε of the present action according to the number of access times of the state. If the traditional exploration strategy is used, although the previous information will be shared in the subsequent iterations, the exploration of the Agent nearby the initial position will be much greater than the exploration nearby the target position. In order to improve the exploration rate nearby the target position and reduce the ineffective exploration nearby the initial position, a dynamic adjustment method was proposed in this study for the exploration factor under a state according to the number of access times of this state. The adjustment formula of exploration factor εcur of the present state within the iteration cycle is shown in Equation (32): ( εt + µ x = 0 εcur = (32) εt − µsin(ϕx × π/2) else

In Equation (32), εt is the basic exploration factor of the present state sequence, µ and ϕ are the utilization factors of number of state access times, x is the number of access times of the present state, the maximum value on record is xmax, and ϕ satisﬁes ϕxmax ≤ 1.

3.3. Design of Reward and Penalty Functions In essence, the RL problem solving means a maximization process of return during the continuous trials and errors and interaction between the Agent and environment. Rightly, as a reward given after the Agent executes an action, the reward function serves as the bridge for the Agent to acquire the environmental information, and its quality has a direct Symmetry 2021, 13, 1057 9 of 16

bearing on the convergence or not of the ﬁnal strategy [21]. The traditional reward function is designed as shown in Equation (33):  −r collision  1 r = r2 get_target (33)  0 else

In Equation (33), both r1 and r2 are positive, collision means that a collision occurs, when the instant reward −r1 is acquired at the moment, and get_target expresses that the Agent successfully arrives at the target position, when the instant reward r2 is harvested. Obviously, the reward function of such a design will be stuck in an obvious problem; the reward values are too sparse, and the Agent will obtain the feedback of reward values only when experiencing a collusion or reaching the target position. However, the Agent fails to obtain any reward feedback under other states, so the Agent has to wander continuously and can be hardly converged. Therefore, the reward density needs to be reasonably increased. In the reward function designed in this study, any action of Agent would obtain an instant reward. The Agent would acquire an instant reward r1 when colliding, a positive reward r2 when approaching the target, a negative reward r3 when being away from the target, and the ﬁnal reward r4 when reaching the target position. In the path planning task, the agent needs to reach the target location without collision. Reaching the target location is the ultimate goal and the main task. Therefore, the reward value of r4 should be set to the maximum. The setting of r1 is to avoid collision and can be regarded as a secondary task, so it is less than r4. The setting of r2 and r3 is to avoid that the reward values are too sparse to make the agent unable to reach the target location, which plays an auxiliary role. Therefore, the values of r3 and r4 are set to two orders of magnitude smaller than r4 or r1.The reward function with symmetry designed in this study is shown in Equation (34).  −r1 collision  +r close_to_target r = 2 (34) −  r3 far_from_target  r4 get_target

In Equation (34), r1, r2, r3 and r4 are all positive, close_to_target means approaching the target position, and far_from_target means being far from the target position.

3.4. Algorithm Implementation The basic idea followed by the proposed SARE-Q algorithm was described as follows: The self-adaptive reinforcement-exploration strategy was used as the behavior selection strategy of the Q-Learning algorithm to improve the environmental exploration efﬁciency and better balance the contradiction brought by the exploration-utilization problem. Mean- while, the optimized reward function was combined to enhance the agent-environment interaction efﬁciency and improve the performance of the original algorithm. The algorithm pseudocode is shown in Algorithm 1: Symmetry 2021, 13, 1057 10 of 16

Algorithm 1 SARE-Q algorithm.

Initialization: The following are initialized: The minimum exploration factor εmin, the maximum number of iterative episodes Tmax_episodes, the maximum step size of single episode Tmax_steps, learning rate α, reward attenuation factor γ, utility trace attenuation factor λ, exploration incentive value e, self-adaptive exploration factor β, utilization factor i of the success Symmetry 2021, 13, x FOR PEER REVIEWrate, and utilization factors µ and ϕ of access times. The terminal state set ST is set. For each10 state of 16 s (s ∈ S), a (a ∈ A), and the initial value of Q-Table is randomly set. The T-Table of state access times is initialized, so is the behavior utility trace table, namely, E-Table. Loop iteration (for t = 1, 2, 3 ··· (t < Tmax_episodes)): InitializeCycle (for the initial=1,2,3⋯(< state s _ )): UpdateUpdate the basic the exploration exploration factor factorεt in this episode Cycle (forChooset = 1, an 2, 3action··· (t < T max according_steps)): to the self-adaptive reinforcement-exploration strategyUpdate and the Q-Table exploration factor εcur in this episode ChooseExecute an actionthe actiona according , and to acquire the self-adaptive the instant reinforcement-exploration reward and the state 1 strategyat the next and Q-Tabletime Execute the action a, and acquire the instant reward r and the state s1 at the next Update Q-Table, T-Table and E-Table time ←s1 Update Q-Table, T-Table and E-Table Until reaching the terminals ← states1 or the maximum step size _ of single episodeUntil reaching the terminal state s or the maximum step size Tmax_steps of single episode UntilUntil reaching reaching the the maximum maximum number number of iterative of iterative episodes episodesTmax_episodes _

4.4. SimulationSimulation ExperimentExperiment andand ResultsResults AnalysisAnalysis 4.1.4.1. SimulationSimulation ExperimentalExperimental EnvironmentEnvironment andand ParameterParameter SettingSetting InIn thisthis study,study, the the simulation simulation experiment experiment was was carried carried out out in in the the n n× × nn symmetricsymmetric gridgrid mapmap environmentenvironment commonlycommonly usedused inin thethe routeroute planningplanning experiment.experiment. ThreeThree experimentalexperimental scenarios—multi-obstaclescenarios—multi-obstacle grid map map environments environments consisting consisting of of 10 10 × ×10,10, 15 15× 15× and15 and20 × 2020 ×grids,20 grids, respectively—were respectively—were designed designed to veri tofy verify the Q-Learning the Q-Learning algorithm, algorithm, the self-adap- the self- adaptivetive Q-Learning Q-Learning (SA-Q) (SA-Q) algo algorithmrithm proposed proposed [22], [22 and], and the the SARE-Q SARE-Q algorithm algorithm proposed proposed in inthis this study. study. Hereby Hereby the the simulation simulation experi experimentalmental environment environment design was introduced byby takingtaking thethe 2020 ×× 2020 grid grid environment environment for example as shown in Figure 1 1..

FigureFigure 1.1. TheThe schematicschematic diagramdiagram ofof 2020 ×× 20 grid environment.

AsAs shownshown inin FigureFigure1 ,1, there there were were in in total total 20 20× × 2020 == 400400 grids.grids. TheThe greygrey gridsgrids inin thethe ﬁgurefigure representedrepresented obstacles,obstacles,and and eighteight differentdifferent obstacles obstacles were were set set in in total.total. TheThe greengreen gridgrid expressedexpressed thethe initialinitial positionposition ofof thethe Agent,Agent, withwith thethe coordinatescoordinates ofof (2,14).(2,14). TheThe blueblue gridgrid waswas thethe targettarget positionposition ofof thethe Agent,Agent, withwith thethe coordinatescoordinates ofof (17,6).(17,6). TheThe redred circlecircle waswas thethe Agent.Agent. TheThe initialinitial positionposition andand targettarget positionposition ofof thethe AgentAgent werewere ﬁxed,fixed, andand thethe AgentAgent onlyonly movedmoved withinwithin thethe whitewhite grids,grids, wherewhere neitherneither obstaclesobstacles nornor boundaryboundary couldcould gaingain access.access. That the Agent was located in different grids represented different states, it could move towards four directions: up, down, left and right, but it could move forward only by the unit grid length towards one direction each time. When the Agent moved towards one direction, if the next target position was an accessible grid, it must be shifted to this grid. If the next target position went beyond the edge or was inaccessible for obstacles, etc., the Agent would stay at the original position. The experimental simulation platform in this study was as follows: the computer operating system was Windows10, with the CPU of i5-8300H, internal storage of 8 GB, the

Symmetry 2021, 13, 1057 11 of 16

That the Agent was located in different grids represented different states, it could move towards four directions: up, down, left and right, but it could move forward only by the unit grid length towards one direction each time. When the Agent moved towards one direction, if the next target position was an accessible grid, it must be shifted to this grid. If the next target position went beyond the edge or was inaccessible for obstacles, etc., the Agent would stay at the original position. The experimental simulation platform in this study was as follows: the computer operating system was Windows10, with the CPU of i5-8300H, internal storage of 8 GB, the graphics card of 1050Ti, video memory of 4 GB, conda version of 4.8.4, Python version of 3.5.6, and simulation platform of OpenAI Gym. The related algorithm parameters used were as follows [23]: the initial exploration factor was εinit = 1, the minimum exploration factor was εmin = 0.01, and through the repeated experiments, the learning rate was set as α = 1.0, reward attenuation factor as γ = 0.9, the maximum step size of single episode as Stepmax = 100, the maximum number of iterative episodes as Tmax_episodes = 1000, the maximum number of exploration episodes as 0.8 times of maximum number of iterative episodes Tmax_episodes, namely 800, utility trace attenuation factor as λ = 0.75, exploration incentive value as e = 1, self-adaptive exploration factor as β = 0.1, utilization factor of the success rate as i = 0.1, µ and ϕ for the utilization factor of access times as 0.5 and 0.1, respectively, and r1, r2, r3 and r4 for the reward function as 20, 0, 0.2 and 30, respectively.

4.2. Simulation Experiment and Result Analysis In the experimental environment as shown in Figure1, three different algorithms were used for the route planning, and their typical return curves are displayed in Figure2, where the x-coordinate denotes the present iterative episode, and y-coordinate represents the total return obtained by each episode. It could be observed from Figure2 that in the 20 × 20 grid environment, in comparison with the Q-Learning algorithm and SA-Q algorithm, the Agent could reach the target position earlier by using the SARE-Q algorithm, and the number of times for it to reach the target position was the maximum. From the curve shape, the curves obtained by the Q-Learning algorithm and the SA-Q algorithm were straighter than that obtained by the proposed algorithm after the initial convergence, which is ascribed to the dynamic adjustment of exploration factor ε. For any state not accessed or the state accessed for just a few times, the proposed algorithm would reinforce the exploration nearby such state, thus leading to a great curve ﬂuctuation. However, for the other two algorithms, the exploration factor ε was stabilized at a small value in the later iteration phase, and it almost did not ﬂuctuate after being approximately converged. The optimal routes obtained by 100 route planning experiments using the three different algorithms are shown in Figure3. It could be shown that all three algorithms could give the shortest routes, among which the number of turning times of the optimal route given by the Q-Learning algorithm was six, that by the SA-Q algorithm was seven and that by the SARE-Q algorithm was four. The statistical performance results of the three algorithms in executing the route planning task for 100 times are listed in Table1. It could be shown from Table1 that the average operating time of the SARE-Q algorithm differed a little from the SA-Q algorithm and the Q-Learning algorithm. The average number of turning times of the proposed algorithm was the least, and it was much smaller than that of the Q-Learning algorithm and the SA-Q algorithm. As for the average success rate, the proposed algorithm was 16.3% ahead of the SA-Q algorithm, and considerably 35.8% ahead of the Q-Learning algorithm. In the aspects of average step size and number of times with the shortest route, all of the three algorithms might miss the shortest route, but the SARE-Q algorithm was superior to the other two algorithms. Symmetry 2021, 13, x FOR PEER REVIEW 11 of 16

graphics card of 1050Ti, video memory of 4 GB, conda version of 4.8.4, Python version of 3.5.6, and simulation platform of OpenAI Gym. The related algorithm parameters used were as follows [23]: the initial exploration factor was init =1, the minimum exploration factor was =0.01, and through the repeated experiments, the learning rate was set as =1.0, reward attenuation factor as =0.9, the maximum step size of single episode as Step = 100, the maximum number of iterative episodes as = 1000, the max max_episodes maximum number of exploration episodes as 0.8 times of maximum number of iterative

episodes max_episodes, namely 800, utility trace attenuation factor as = 0.75, exploration incentive value as =1, self-adaptive exploration factor as =0.1, utilization factor of the success rate as =0.1, and for the utilization factor of access times as 0.5 and 0.1, respectively, and , , and for the reward function as 20, 0, 0.2 and 30, respectively.

(a)

Symmetry 2021, 13, x FOR PEER REVIEW 12 of 16

(b)

(c)

FigureFigure 2. The 2. The return return curve curve of three of three different different algori algorithmsthms in a in 20 a × 20 20× grid20 grid environment: environment: (a) the (a) thereturn return curvecurve of the of theQ-Learning Q-Learning algorithm; algorithm; (b) ( bthe) the return return curve curve of ofthe the SA-Q SA-Q algorithm; algorithm; (c ()c )the the return return curve curve of of thethe SARE-Q SARE-Q algorithm. algorithm.

It could be observed from Figure 2 that in the 20 × 20 grid environment, in comparison with the Q-Learning algorithm and SA-Q algorithm, the Agent could reach the target position earlier by using the SARE-Q algorithm, and the number of times for it to reach the target position was the maximum. From the curve shape, the curves obtained by the Q-Learning algorithm and the SA-Q algorithm were straighter than that obtained by the proposed algorithm after the initial convergence, which is ascribed to the dynamic adjustment of exploration factor . For any state not accessed or the state accessed for just a few times, the proposed algorithm would reinforce the exploration nearby such state, thus leading to a great curve fluctuation. However, for the other two algorithms, the exploration factor was stabilized at a small value in the later iteration phase, and it almost did not fluctuate after being approximately converged. The optimal routes obtained by 100 route planning experiments using the three different algorithms are shown in Figure 3. It could be shown that all three algorithms could give the shortest routes, among which the number of turning times of the optimal route given by the Q-Learning algorithm was six, that by the SA-Q algorithm was seven and that by the SARE-Q algorithm was four.

(a)

Symmetry 2021, 13, x FOR PEER REVIEW 12 of 16

(c) Figure 2. The return curve of three different algorithms in a 20 × 20 grid environment: (a) the return curve of the Q-Learning algorithm; (b) the return curve of the SA-Q algorithm; (c) the return curve of the SARE-Q algorithm. Symmetry 2021, 13, 1057 13 of 16 Symmetry 2021, 13, x FOR PEER REVIEW 13 of 16 It could be observed from Figure 2 that in the 20 × 20 grid environment, in comparison with the Q-Learning algorithm and SA-Q algorithm, the Agent could reach the target position earlier by using the SARE-Q algorithm, and the number of times for it to reach the target position was the maximum. From the curve shape, the curves obtained by the Q-Learning algorithm and the SA-Q algorithm were straighter than that obtained by the proposed algorithm after the initial convergence, which is ascribed to the dynamic adjustment of exploration factor . For any state not accessed or the state accessed for just a few times, the proposed algorithm would reinforce the exploration nearby such state, thus leading to a great curve fluctuation. However, for the other two algorithms, the exploration factor was stabilized at a small value in the later iteration phase, and it almost did not fluctuate after being approximately converged. The optimal routes obtained by 100 route planning experiments using the three different algorithms are shown in Figure 3. It could be shown that all three algorithms could give the shortest routes, among which the number of turning times of the optimal route given by the Q-Learning algorithm was six, that by the SA-Q algorithm was seven and that by the SARE-Q algorithm was four. (b)

(a) (c)

Figure 3. The optimal routes planned by threeFigure different 3. The algorithms optimal routes in a 20 planned× 20 grid by environment:three different (a )algorithms in a 20 × 20 grid environment: The optimal route planned by the Q-Learning(a) The algorithm; optimal ( broute) the optimalplanned routeby the planned Q-Learning by the algorithm; SA-Q (b) the optimal route planned by the algorithm; (c) the optimal route planned bySA-Q SARE-Q algorithm; algorithm. (c) the optimal route planned by SARE-Q algorithm.

Table 1. The performance comparison of the three different algorithmsThe in a 20statistical× 20 grid performance environment results (100 times). of the three algorithms in executing the route planning task for 100 times are listed in Table 1. It could be shown from Table 1 that the average Evaluation Indexes Q-Learning Algorithmoperating time SA-Q of Algorithmthe SARE-Q algorithm SARE-Q Algorithmdiffered a little from the SA-Q algorithm and the Average operating time (s) 2.047Q-Learning algorithm. 1.739 The average number 1.995 of turning times of the proposed algorithm Average number of turning times (times) 9.16was the least, and 10.22 it was much smaller than 7.6 that of the Q-Learning algorithm and the SA- Average success rate (%) 30.2Q algorithm. As for 49.7 the average success 66rate, the proposed algorithm was 16.3% ahead of Average step size (step) 23.38the SA-Q algorithm, 24.44 and considerably 23.16 35.8% ahead of the Q-Learning algorithm. In the Number of times with the shortest route (times) 83aspects of average 88step size and number 92 of times with the shortest route, all of the three Number of turning times of the optimal route (times) 6 7 4 algorithms might miss the shortest route, but the SARE-Q algorithm was superior to the The simulation experiments wereother implemented two algorithms. in 15 × 15 and 10 × 10 symmetric grid environments, in an effort to explore the inﬂuence of different grid environments Table 1. The performance comparison of the three different algorithms in a 20 × 20 grid environment (100 times). on the algorithm performance, and the corresponding experimental results are shown in Tables2 and3. Evaluation Indexes Q-Learning Algorithm SA-Q Algorithm SARE-Q Algorithm Average operating time (s) 2.047 1.739 1.995 Average number of turning times (times) 9.16 10.22 7.6 Average success rate (%) 30.2 49.7 66 Average step size (step) 23.38 24.44 23.16 Number of times with the shortest route (times) 83 88 92 Number of turning times of the optimal route (times) 6 7 4

Symmetry 2021, 13, 1057 14 of 16

Table 2. The performance comparison of the three different algorithms in a 15 × 15 grid environment (100 times).

Evaluation Indexes Q-Learning Algorithm SA-Q Algorithm SARE-Q Algorithm Average operating time (s) 1.43 1.034 1.147 Average number of turning times(times) 9.67 10.32 8.33 Average success rate (%) 41.9 71.5 85.5 Average step size (step) 24.24 24.16 24.04 Number of times with the shortest route (times) 89 93 98 Number of turning times of the optimal route (times) 9 9 7

Table 3. The performance comparison of the three different algorithms in a 10 × 10 grid environment (100 times).

Evaluation Indexes Q-Learning Algorithm SA-Q Algorithm SARE-Q Algorithm Average operating time (s) 0.819 0.629 0.79 Average number of turning times(times) 7.19 7.13 6.1 Average success rate (%) 58.4 70.4 75.2 Average step size (step) 18.1 18.81 18 Number of times with the shortest route (times) 94 99 100 Number of turning times of the optimal route (times) 5 5 4

As shown in Tables1–3, all performance indexes of the three algorithms declined with the increase in the number of map grids. During this process, the SA-Q algorithm was comprehensively ahead of the Q-Learning algorithm in terms of performance indexes, though some of its performance indexes were lower than those of Q-Learning algorithm. The SA-Q algorithm was always superior to the other two aspects in terms of operating time. Although the operating time was not as superior as that of the SA-Q algorithm, the other performance indexes of the SARE-Q algorithm were all better than those of the other two algorithms.

5. Conclusions The SARE-Q algorithm was proposed in this study, in order to tackle the problems of the traditional Q-Learning algorithm, e.g., slow convergence and easy local optimization. In the end, the route planning was simulated on the OpenAI Gym platform. By a symmetric comparison with the Q-Learning algorithm and SA-Q algorithm, the superiority of the proposed SARE-Q algorithm was verified. The following conclusions were obtained through the theoretical study and simulation experiments: 1. The problems existing in the Q-Learning algorithm were studied, the behavior utility trace was introduced into the Q-Learning algorithm, and the self-adaptive dynamic exploration factor was combined to put forward a reinforcement-exploration strategy, which substituted the exploration strategy of the traditional Q-Learning algorithm. The simulation experiments of route planning manifest that the SARE-Q algorithm shows advantages, to different extents, over the traditional Q-Learning algorithm and the algorithms proposed in other references in the following aspects: average number of turning times, average inside success rate, average step size, number of times with the shortest planning route, optimal number of turning times of route, etc. 2. Though being of a certain complexity, the random environment given in this study is obviously too simple compared with the actual environment. Therefore, the algorithms should be explored under the environment with dynamic obstacles and dynamic target positions in the future. Meanwhile, the algorithm was verified only through the simulation experiments, so the corresponding actual system should be established for the further verification. In addition, the SARE-Q algorithm proposed in this study remains to be further optimized in the aspect of parameter selection. How to optimize the related parameters through intelligent algorithms is the follow-up research content. 3. Restricted by the action space and sample space, the traditional RL algorithm is inapplicable to actual scenarios with very large state space and continuous action Symmetry 2021, 13, 1057 15 of 16

space. With the integration of deep learning and RL, the deep RL method will overcome the deﬁciencies of the traditional RL method by virtue of the powerful character representation ability of deep learning technology. The subsequent research content lies in studying the deep RL method and verifying it in the route planning.

Author Contributions: Conceptualization, L.Z.; data curation, Z.W. and L.T.; simulation, Z.W. and Z.Z.; writing—original draft preparation, L.T. and S.Z.; writing—review and editing, L.T. and X.S.; supervision, L.Z. All authors have read and agreed to the published version of the manuscript. Funding: This work was supported by National Natural Science Foundation of China (No. 61741303), the key laboratory of spatial information and geomatics (Guilin University of Technology) (No.19- 185-10-08), the Scientific Research Basic Ability Enhancement Program for Young and Middle-aged Teachers of Guangxi (No. 2021KY0260). Institutional Review Board Statement: Not applicable. Informed Consent Statement: Not applicable. Data Availability Statement: Not applicable. Acknowledgments: Authors express their gratitude to the reviewers to improve this work. Conflicts of Interest: The authors declare no conflict of interest.

References 1. Zhou, X.M.; Bai, T.; Gao, Y.B.; Han, Y. Vision-Based Robot Navigation through Combining Unsupervised Learning and Hierarchi- cal Reinforcement Learning. Sensors 2019, 19, 1576. [CrossRef] 2. Miorelli, R.; Kulakovskyi, A.; Chapuis, B.; D’Almeida, O.; Mesnil, O. Supervised learning strategy for classification and regression tasks applied to aeronautical structural health monitoring problems. Ultrasonics 2021, 113, 106372. [CrossRef] 3. Faroughi, A.; Morichetta, A.; Vassio, L.; Figueiredo, F.; Mellia, M.; Javidan, R. Towards website domain name classification using graph based semi-supervised learning. Comput. Netw. 2021, 188, 107865. [CrossRef] 4. Zeng, J.J.; Qin, L.; Hu, Y.; Yin, Q. Combining Subgoal Graphs with Reinforcement Learning to Build a Rational Pathfinder. Appl. Sci. 2019, 9, 323. [CrossRef] 5. Zeng, F.Y.; Wang, C.; Ge, S.S. A Survey on Visual Navigation for Artificial Agents with Deep Reinforcement Learning. IEEE Access 2020, 8, 135426–135442. [CrossRef] 6. Li, R.Y.; Peng, H.M.; Li, R.G.; Zhao, K. Overview on Algorithms and Applications for Reinforcement Learning. Comput. Syst. Appl. 2020, 29, 13–25. [CrossRef] 7. Luan, P.G.; Thinh, N.T. Hybrid genetic algorithm based smooth global-path planning for a mobile robot. Mech. Based Des. Struct. Mach. 2021, 2021, 1–17. [CrossRef] 8. Mao, G.J.; Gu, S.M. An Improved Q-Learning Algorithm and Its Application in Path Planning. J. Taiyuan Univ. Technol. 2021, 52, 91–97. [CrossRef] 9. Neves, M.; Vieira, M.; Neto, P. A study on a Q-Learning algorithm application to a manufacturing assembly problem. J. Manuf. Syst. 2021, 59, 426–440. [CrossRef] 10. Han, X.C.; Yu, S.P.; Yuan, Z.M.; Cheng, L.J. High-speed railway dynamic scheduling based on Q-Learning method. Control Theory Appl. 2021. Available online: https://kns.cnki.net/kcms/detail/44.1240.TP.20210330.1333.042.html (accessed on 31 March 2021). 11. Qiao, J.F.; Hou, Z.J.; Ruan, X.G. Neural network-based reinforcement learning applied to obstacle avoidance. J. Tsinghua Univ. Sci. Technol. 2008, 48, 1747–1750. [CrossRef] 12. Song, Y.; Li, Y.B.; Li, C.H. Initialization in reinforcement learning for mobile robots path planning. Control Theory Appl. 2012, 29, 1623–1628. 13. Zhao, Y.N. Research of Path Planning Problem Based on Reinforcement Learning. Master’s Thesis, Harbin Institute of Technology, Harbin, China, 2017. 14. Zeng, J.J.; Liang, Z.H. Research of path planning based on the supervised reinforcement learning. Comput. Appl. Softw. 2018, 35, 185–188. 15. da Silva, A.G.; dos Santos, D.H.; de Negreiros, A.P.F.; Silva, J.M.; Gonçalves, L.M.G. High-Level Path Planning for an Autonomous Sailboat Robot Using Q-Learning. Sensors 2020, 20, 1550. [CrossRef][PubMed] 16. Low, E.S.; Ong, P.; Cheah, K.C. Solving the optimal path planning of a mobile robot using improved Q-learning. Robot. Auton. Syst. 2019, 115, 143–161. [CrossRef] 17. Park, J.H.; Lee, K.H. Computational Design of Modular Robots Based on Genetic Algorithm and Reinforcement Learning. Symmetry 2021, 13, 471. [CrossRef] 18. Li, B.H.; Wu, Y.J. Path Planning for UAV Ground Target Tracking via Deep Reinforcement Learning. IEEE Access 2020, 8, 29064–29074. [CrossRef] Symmetry 2021, 13, 1057 16 of 16

19. Yan, J.J.; Zhang, Q.S.; Hu, X.P. Review of Path Planning Techniques Based on Reinforcement Learning. Comput. Eng. 2021. [CrossRef] 20. Seo, K.; Yang, J. Differentially Private Actor and Its Eligibility Trace. Electronics 2020, 9, 1486. [CrossRef] 21. Qin, Z.H.; Li, N.; Liu, X.T.; Liu, X.L.; Tong, Q.; Liu, X.H. Overview of Research on Model-free Reinforcement Learning. Comput. Sci. 2021, 48, 180–187. 22. Li, T. Research of Path Planning Algorithm based on Reinforcement Learning. Master’s Thesis, Jilin University, Changchun, China, 2020. 23. Li, T.; Li, Y. A Novel Path Planning Algorithm Based on Q-learning and Adaptive Exploration Strategy. In Proceedings of the 2019 Scientiﬁc Conference on Network, Power Systems and Computing (NPSC 2019), Guilin, China, 16–17 November 2019; pp. 105–108. [CrossRef]