S S symmetry
Article A Self-Adaptive Reinforcement-Exploration Q-Learning Algorithm
Lieping Zhang 1 , Liu Tang 1, Shenglan Zhang 1, Zhengzhong Wang 1, Xianhao Shen 2 and Zuqiong Zhang 2,*
1 College of Mechanical and Control Engineering, Guilin University of Technology, Guilin 541006, China; [email protected] (L.Z.); [email protected] (L.T.); [email protected] (S.Z.); [email protected] (Z.W.) 2 College of Information Science and Engineering, Guilin University of Technology, Guilin 541006, China; [email protected] * Correspondence: [email protected]
Abstract: Directing at various problems of the traditional Q-Learning algorithm, such as heavy repetition and disequilibrium of explorations, the reinforcement-exploration strategy was used to replace the decayed ε-greedy strategy in the traditional Q-Learning algorithm, and thus a novel self-adaptive reinforcement-exploration Q-Learning (SARE-Q) algorithm was proposed. First, the concept of behavior utility trace was introduced in the proposed algorithm, and the probability for each action to be chosen was adjusted according to the behavior utility trace, so as to improve the efficiency of exploration. Second, the attenuation process of exploration factor ε was designed into two phases, where the first phase centered on the exploration and the second one transited the focus from the exploration into utilization, and the exploration rate was dynamically adjusted
according to the success rate. Finally, by establishing a list of state access times, the exploration factor of the current state is adaptively adjusted according to the number of times the state is accessed.
Citation: Zhang, L.; Tang, L.; Zhang, The symmetric grid map environment was established via OpenAI Gym platform to carry out the S.; Wang, Z.; Shen, X.; Zhang, Z. A symmetrical simulation experiments on the Q-Learning algorithm, self-adaptive Q-Learning (SA-Q) Self-Adaptive algorithm and SARE-Q algorithm. The experimental results show that the proposed algorithm has Reinforcement-Exploration obvious advantages over the first two algorithms in the average number of turning times, average Q-Learning Algorithm. Symmetry inside success rate, and number of times with the shortest planned route. 2021, 13, 1057. https://doi.org/ 10.3390/sym13061057 Keywords: reinforcement learning; Q-Learning algorithm; self-adaptive Q-Learning algorithm; self-adaptive reinforcement-exploration strategy; path planning Academic Editor: Dumitru Baleanu
Received: 21 May 2021 Accepted: 9 June 2021 1. Introduction Published: 11 June 2021 Reinforcement learning (RL), one of methodologies of machine learning, is used to
Publisher’s Note: MDPI stays neutral describe and solve how an intelligent agent learns and optimizes the strategy during the with regard to jurisdictional claims in interaction with the environment [1]. To be more specific, the intelligent agent acquires published maps and institutional affil- the reinforcement signal (reward feedback) from the environment during the continuous iations. interaction with the environment, and adjusts its own action strategy through the reward feedback, aiming at the maximum gain. Different from supervised learning [2] and semi- supervised learning [3], RL does not need to collect training samples in advance, and during the interaction with the environment, the intelligent agent will automatically learn
Copyright: © 2021 by the authors. to evaluate the action generated according to the rewards fed back from the environment, Licensee MDPI, Basel, Switzerland. instead of being directly told the correct action. This article is an open access article In general, the Markov decision-making process is used by the RL algorithm for distributed under the terms and environment modeling [4]. Based on whether the transition probability P of the Markov conditions of the Creative Commons decision-making process in the sequential decision problem is already known, the RL Attribution (CC BY) license (https:// algorithm is divided into two major types [5]: a model-based RL algorithm under known creativecommons.org/licenses/by/ transition probability and a model-free RL algorithm under unknown transition probability, 4.0/).
Symmetry 2021, 13, 1057. https://doi.org/10.3390/sym13061057 https://www.mdpi.com/journal/symmetry Symmetry 2021, 13, 1057 2 of 16
where the former is a dynamic planning method, and the latter mainly includes strategy- based RL method, value function-based RL method, and strategy–value function integrated RL method. The value function-based RL method is an important solution to the model-free Markov problem, and it mainly makes use of Monte Carlo (MC) RL and temporal-difference (TD) RL [6]. As one of most commonly used RL algorithms, the Q-Learning algorithm, which is based on value, reference strategy learning and TD method [7], has been widely applied to route planning, manufacturing and assembly, and dynamic train scheduling [8–10]. Many researchers have been dedicated to improving the low exploration efficiency problem of the traditional Q-Learning algorithm. Qiao, J.F. et al. [11] proposed an improved Q-Learning obstacle avoidance algorithm in which Q-Table was substituted with neural network (NN), in an effort to overcome the deficiency of Q-Learning, namely, it was inapplicable under continuous state. Song, Y. et al. [12] applied the artificial potential field to the Q-Learning algorithm, proposed a Q value initialization method, and improved the convergence rate of the Q-Learning algorithm. In [13], a dynamic adjustment algorithm for the exploration factor in ε-greedy strategy was raised to improve the exploration-utilization equilibrium problem existing in the practical application of RL method. In the literature [14], a nominal control-based supervised RL route planning algorithm was brought forward, and the tutor supervision was introduced into the Q-Learning algorithm, thus accelerating the algorithm convergence. Andouglas et al. came up with a reward matrix-based Q-Learning algorithm, which satisfies the route planning demand of marine robots [15]. Pauline et al. introduced the concept of partially guided Q-Learning, initialized the Q-Table through the flower pollination algorithm (FPA), and thus optimized the route planning performance of mobile robots [16]. Park, J.H. et al. employed a genetic algorithm to evolve robotic structures as an outer optimization, and it applied a reinforcement learning algorithm to each candidate structure to train its behavior and evaluate its potential learning ability as an inner optimization [17]. It can be seen from the above description that the current research on the improvement of the Q-learning algorithm mainly focuses on two aspects: The first is to study the attenuation law of the exploration factor ε with the increase of the number of training episodes. The second aspect is to adjust the exploration factor ε of the next episode according to the training information of the previous episode. The article creatively proposes the idea of dynamically adjusting the exploration factor of the current episode according to the states accessed times, based on the decayed ε-greedy strategy, the concept of behavior utility trace is introduced, and the states accessed list and on-site adaptive dynamic adjustment method are added, so the self-adaptive reinforcement exploration method is realized, and taking the path planning as the application object, the simulation results verify the effectiveness and superiority of the proposed algorithm. The main contributions of the proposed algorithm are as follows: (1) the traditional Q-learning algorithm selects actions randomly according simply to the equal probability method, without considering the differences of actions. In the article, the behavior utility trace is introduced to improve the probability of effective action being randomly selected. (2) In order to better solve the contradiction between exploration and utilization, the article designs the attenuation process of the exploration factor into two stages. The first stage is the exploration stage. The attenuation of the exploration factor is designed slowly to improve the exploration rate. The second stage is the transition from the exploration stage to the utilization stage, which dynamically adjusts the attenuation speed of the exploration factor according to the success rate. (3) In addition, because the exploration factor is fixed in each episode, the agent makes too much exploration near the initial location. The article proposes to record the access times of the current state through the states accessed list, so as to reduce the exploration probability of the states with more accessed times, and improve the exploration rate of the state which is first accessed, so as to improve the radiation range of exploration. Symmetry 2021, 13, 1057 3 of 16
2. Markov Process 2.1. Markov Reward Process (MRP)
In a timing sequence process, if the state at time t + 1 only depends on the state St at time t while unrelated to any state before time t, the state St at time t is considered with Markov property. If each state in a process is of Markov property, the random process is called Markov process [18]. The key to describing a Markov process lies in the state transition probability matrix, 0 namely, the probability for the transition from the state St = s at time t into state St+1 = s at time t + 1, as shown in Equation (1):
0 Pss0 = p St+1 = s |St = s (1)
For a Markov process with n states (s1, s2, ··· , sn), the state transition probability matrix P from the state s into all follow-up states s0 is expressed by Equation (2): Ps1s1 ··· Ps1sn P = ······ (2)
Psns1 ··· Psnsn
In Equation (2), the data in each row of state transition matrix P denote the probability value for the transition from one state into all other n states, and the sum of each row is constantly 1. An MRP is formed when the reward element is added into a Markov process. An MRP, which is composed of finite state set S, state transition probability matrix P, reward function R and attenuation factor γ(γ ∈ [0, 1]), can be described through a quadruple form S, P, R, γ. In an MRP, the sum of all reward attenuations since the initial sampling under a state St until the terminal state is called the gain, which is expressed by Equation (3):
∞ k k Gt = Rt+1 + γRt+2 + ··· + γ Rt+k+1 = ∑ γ Rt+k+1 (3) i=0
In Equation (3), Rt+1 is the instant reward at time t + 1, k is the total number of subsequent states from the time t + 1 until the terminal state, and Rt+k+1 is the instant reward of terminal state. The expected gain of a state in the MRP is called value, which is used to describe the importance of a state. The value v(s) is expressed as shown in Equation (4):
v(s) = E[Gt|St = s ] (4)
By unfolding the gain Gt in the value function according to its definition, the expres- sion as shown in Equation (5) can be acquired:
v(s) = E[Rt+1 + γv(St+1)|St = s ] (5)
In Equation (5), the expected value of state St+1 at time t + 1 can be obtained according to the probability distribution at the next time. s denotes the state at the present time, s0 represents any possible state at the next time, and thus Equation (5) can be written into the following form: 0 v(s) = Rs + γ ∑ Pss0 v s (6) s0∈S Equation (6), which is called Bellman equation in MRP, expresses that the value of a state consists of its reward and the values of subsequent states according to a certain probability and attenuation ratio. Symmetry 2021, 13, 1057 4 of 16
2.2. Markov Decision-Making Process In addition to the MRP, the RL problem also involves the individual behavior choice. Once the individual behavior choice is included into the MRP, the Markov decision-making process is obtained. The Markov decision-making process can be described using a quintuple form hS, A, P, R, γ, i, where its finite behavior set A, finite state set S and attenuation factor γ are identical with those of the MRP. Different from the MRP, the gain R and state transi- tion probability matrix P in the Markov decision-making process are based on behaviors. a The action At = a is chosen at state St = s, the gain Rs at the moment can be expressed by the expected gain Rt+1 at time t + 1, and it is mathematically described as shown in Equation (7): a Rs = E[Rt+1|St = s, At = a ] (7) a The mathematical description of state transition probability Pss0 from the state St = s 0 into the subsequent state St+1 = s is shown in Equation (8):
a 0 Pss0 = P St+1 = s |St = s , At = a (8)
In the Markov decision-making process, an individual will choose one action from the finite behavior set according to their own recognition of the present state, and the basis for choosing such behavior becomes a strategy and a mapping from one state to action. The strategy is usually expressed by the symbol π, referring to a distribution based on the behavior set under the given state s, namely:
π(a|s ) = P[At = a|St = s ] (9)
The state value function vπ(s) is used to describe the value generated when the state s abides by a specific strategy π in the Markov decision-making process, and it is defined by Equation (10): vπ(s) = E[Gt|St = s ] (10) The behavior value function is used to describe the expected gain that can be obtained by executing an action a under the present state s when a specific strategy π is followed. As a behavior value is usually based on one state, the behavior value function is also called state-behavior pair value function. The definition of behavior value function qπ(s, a) is shown in Equation (11): qπ(s, a) = E[Gt|St = s , At = a] (11) By substituting Equation (3) into Equations (10) and (11), respectively, two Bellman expectation Equations (12) and (13) can be acquired:
vπ(s) = E[Rt+1 + γvπ(St+1)|St = s ] (12)
qπ(s, a) = E[Rt+1 + γqπ(St+1, At+1)|St = s , At = a] (13) In the Markov decision-making process, a behavior serves as the bridge of state transition, and the behavior value is closely related to the state value. To be more specific, the state value can be expressed by all behavior values under this state, and the behavior value can be expressed by all state values that the behavior can reach, as shown in Equations (14) and (15), respectively: vπ(s) = ∑ π(a|s)qπ (s, a) (14) a∈A a a 0 qπ(s, a) = Rs + γ ∑ Pss0 vπ s (15) s0∈S Symmetry 2021, 13, 1057 5 of 16
The following can be obtained by substituting Equation (15) into Equation (14): ! a a 0 vπ(s) = ∑ π(s|a ) Rs + γ ∑ Pss0 vπ s (16) a∈A s0∈S
Equation (14) is substituted into Equation (15) to obtain:
a a 0 0 0 0 qπ(s, a) = Rs + γ ∑ Pss0 ∑ π(a s )qπ s , a (17) s0∈S a0∈A
Each strategy corresponds to a state value function, and the optimal strategy naturally corresponds to the optimal state value function. The optimal state value function v∗(s) is defined as the maximum state value function among all strategies, and the optimal behavior value function q∗(s, a) is defined as the maximum behavior value function among all strategies, as shown in Equations (18) and (19), respectively:
∗ v (s) = maxvπ(s) (18) π ∗ q (s, a) = maxqπ(s, a) (19) π From Equations (16) and (17), the optimal state value function and optimal behavior value function can be obtained, as shown in Equations (20) and (21), respectively:
∗ a a ∗ 0 v (s) = maxR + γ P 0 v s (20) a s ∑ ss s0∈S
∗ a a ∗ 0 0 q (s, a) = Rs + γ ∑ Pss0 mqxq s , a (21) s0∈S a If the optimal behavior value function is known, the optimal strategy can be ac- quired by directly maximizing the optimal behavior value function q∗(s, a), as shown in Equation (22): ∗ 1 if a = argmaxq (s, a) π∗(a|s ) = a∈A (22) 0 otherwise
3. Self-Adaptive Reinforcement-Exploration Q-Learning Algorithm 3.1. Q-Learning Algorithm Compared with the TD reinforcement learning method based on value function and the state–action–reward–state–action (SARSA) algorithm, the iteration of the Q-learning algorithm is a trial-and-error process. One of the conditions for convergence is to try every possible state–action pair many times, and finally learn the optimal control strategy. At the same time, the Q-learning algorithm is an effective reinforcement learning algorithm under the condition of unknown environment. It does not need to establish the environment model, and has the advantages of guaranteed convergence, less parameters and strong exploratory ability. Especially in the field of path planning, which focuses on obtaining the optimal results, the Q-learning algorithm has more advantages, and it is the research hotspot of scholars at present [19]. The Q in Q-Learning algorithm is namely the value Q(s,a) of state–behavior pair, representing the expected return when the Agent executes the action a (a ∈ A) under the state s (s ∈ S). The environment will feed back a corresponding instant reward value r according to the behavior a of the Agent. Therefore, the core idea of this algorithm is to use a Q-Table to store all Q values. In the continuous interaction between the Agent and the environment, the expected value (evaluation) is updated through the actual reward r obtained by executing the action a under the state s, and in the end, the action contributing to the maximum return according to the Q-Table is chosen. Symmetry 2021, 13, 1057 6 of 16
Reference strategy TD learning means updating the behavior value with the TD method under the reference strategy condition. The updating method of reference strategy TD learning is shown in Equation (23): π(At|St ) V(St) ← V(St) + α (Rt+1 + γV(St+1)) − V(St) (23) µ(At|St )
In Equation (23), V(St) represents the value assessment of state St at time t, V(St+1) is the value assessment of state St+1 at time t + 1, α is the learning rate, γ is the attenuation factor, Rt+1 denotes the instant return value at time t + 1, π(At|St ) is the probability for the target strategy π(a|s ) to execute the action At under the state St+1, and µ(At|St ) is the probability to execute the action At under the state St+1 according to the behavior strategy ( | ) µ(a|s ). In general, the target strategy π(a|s ) chosen is of a certain ability. If π At St < 1, µ(At|St ) the probability of the reference strategy choosing the action At is smaller than that of the behavior strategy, and the updating amplitude is relatively conservative at the moment. When this ratio is greater than 1, the probability of the reference strategy choosing the action At is greater than that for the behavior strategy, and the updating amplitude is relatively bold. The behavior strategy µ of reference strategy TD learning method is replaced by the ε-greedy strategy based on the value function Q(s,a), and the target strategy π is replaced by the complete greedy strategy based on the value function Q(s,a), thus forming a Q-Learning
algorithm. Q-Learning updates Q(St,At) using the time series difference method, and the updating method is expressed by Equation (24):