13
Hybrid Metaheuristics Using Reinforcement Learning Applied to Salesman Traveling Problem
Francisco C. de Lima Junior1, Adrião D. Doria Neto2 and Jorge Dantas de Melo3 University of State of Rio Grande do Norte, Federal University of Rio Grande do Norte Natal – RN, Brazil
1. Introduction Techniques of optimization known as metaheuristics have achieved success in the resolution of many problems classified as NP-Hard. These methods use non deterministic approaches that reach very good solutions which, however, don’t guarantee the determination of the global optimum. Beyond the inherent difficulties related to the complexity that characterizes the optimization problems, the metaheuristics still face the dilemma of exploration/exploitation, which consists of choosing between a greedy search and a wider exploration of the solution space. A way to guide such algorithms during the searching of better solutions is supplying them with more knowledge of the problem through the use of a intelligent agent, able to recognize promising regions and also identify when they should diversify the direction of the search. This way, this work proposes the use of Reinforcement Learning technique – Q-learning Algorithm (Sutton & Barto, 1998) - as exploration/exploitation strategy for the metaheuristics GRASP (Greedy Randomized Adaptive Search Procedure) (Feo & Resende, 1995) and Genetic Algorithm (R. Haupt & S. E. Haupt, 2004). The GRASP metaheuristic uses Q-learning instead of the traditional greedy- random algorithm in the construction phase. This replacement has the purpose of improving the quality of the initial solutions that are used in the local search phase of the GRASP, and also provides for the metaheuristic an adaptive memory mechanism that allows the reuse of good previous decisions and also avoids the repetition of bad decisions. In the Genetic Algorithm, the Q-learning algorithm was used to generate an initial population of high fitness, and after a determined number of generations, where the rate of diversity of the population is less than a certain limit L, it also was applied to supply one of the parents to be used in the genetic crossover operator. Another significant change in the hybrid genetic algorithm is an interactive/cooperative process, where the Q-learning algorithm receives an additional update in the matrix of Q-values based on the current best solution of the Genetic Algorithm. The computational experiments presented in this work compares the results obtained with the implementation of traditional versions of GRASP metaheuristic and Genetic Algorithm, with those obtained using the proposed hybrid
www.intechopen.com 214 Traveling Salesman Problem, Theory and Applications methods. Both algorithms had been applied successfully to the symmetrical Traveling Salesman Problem, which was modeled as a Markov decision process.
2. Theoretical foundation
2.1 GRASP metaheuristic The Metaheuristic Greedy Randomized Adaptive Search Procedure - GRASP (Feo & Resende, 1995), is a multi-start iterative process, where each iteration consists of two phases: constructive and local search. The constructive phase builds a feasible solution, whose neighbourhood is investigated until a local minimum is found during the local search phase. The best overall solution is kept as the final solution. Each iteration of the construction phase let the set of candidate elements be formed by all elements that can be incorporated in to the partial solution under construction without destroying its feasibility. The selection of the next element for incorporation is determined by the evaluation of all candidate elements according to a greedy evaluation function g(c), where c is a candidate element to compose the solution. The evaluation of the elements by function g(c) leads to the creation of a restricted candidate list - RCL formed by the best elements, i.e., those whose incorporation to the current partial solution results in the smallest incremental cost (this is the greedy aspect of the algorithm in the minimization case). Once the selected element is incorporated into the partial solution, the candidate list is updated and the incremental costs are revaluated (this is the adaptive aspect of the heuristic). The probabilistic aspect of GRASP is given by the random choice of one of the elements of the RCL, not necessarily the best, except in the case of the RCL, which has unitary size. In this case, the selection criterion is reduced to the greedy criterion. The improvement phase typically consists of a local search procedure aimed at improving the solution obtained in the constructive phase, since this solution may not represent a global optimum. In GRASP metaheuristics, it is always important to use a local search to improve the solutions obtained in the constructive phase. The local search phase works in an iterative way, replacing successively the current solution by a better one from its neighbourhood. Thus, the success of the local search algorithm depends on the quality of the neighbourhood chosen (Feo & Resende, 1995). This dependence can be considered a disadvantage of the GRASP metaheuristic. The GRASP metaheuristic presents advantages and disadvantages: Advantages: • Simple implementation: greedy algorithm and local search; • Some parameters require adjustments: restrictions on the candidate list and the number of iterations. Disadvantages: • It depends on good initial solutions: since it is based only on randomization between iterations, each iteration benefits from the quality of the initial solution; • It does not use memory of the information collected during the search. This work explores the disadvantages of the GRASP metaheuristic replacing the random- greedy algorithm of the constructive phase by constructive algorithms that use the Q- learning algorithm as an exploitation/exploration strategy aimed at building better initial solutions. The pseudo code of the traditional GRASP algorithm is presented in the Fig. 1,
www.intechopen.com Hybrid Metaheuristics Using Reinforcement Learning Applied to Salesman Traveling Problem 215 where D corresponds to the distance matrix for the TSP instances, αg1 is the control parameter of the restricted candidate list - RCL, and Nm the maximal number of interactions.
Fig. 1. Traditional GRASP Algorithm
In the GRASP metaheuristic, the restricted candidate list, specifically the αg parameter, is practically the only parameter to be adjusted. The effect of the choice of αg value, in terms of quality solution and diversity during the construction phase of the GRASP, is discussed by Feo and Resende (Feo & Resende, 1995). Prais and Ribeiro (Prais & Ribeiro, 1998) proposed a new procedure called reactive GRASP, for which the αg parameter of the restricted candidate list is self-adjusted according to the quality of the solution previously found. In the reactive GRASP algorithm, the αg value is randomly selected from a discreet set containing m predetermined feasible values:
1 m Ψ={ααgg,..., } (1)
The use of different αg values at different iterations allows the building of different Restrict Candidate Lists - RCL, possibly allowing the generation of distinct solutions that would not be built through the use of a single fixed αg value. The choice of αg in the set Ψ is made using a probability distribution pi, i=1,...,m, which is periodically updated by the so-called absolute qualification rule (Prais & Ribeiro, 1998), which is based on the average value of the solutions obtained with each αg value and is computed as follows:
((FS* ))δ qi = (2) Ai for all i=1,...,m, where F(S*) is the value of the overall best solution already found and δ is used to explore the updated values of probability pi. Using qi, the probability distribution is given by:
1 The index g is used here to differ from the αg parameter used in the Q-learning algorithm in this paper.
www.intechopen.com 216 Traveling Salesman Problem, Theory and Applications
m (3) pii= qq∑ j j=1 In this paper, the reactive GRASP is presented as a more robust version of the traditional GRASP algorithm, and its performance will be compared with the new method proposed.
2.2 Genetic algorithm Genetic algorithms (GA) are based on a biological metaphor: they see the resolution of a problem as a competition among a population of evolving candidate problem solutions. A “fitness” function evaluates each solution to decide whether it will contribute to the next generation of solutions. Then, through analogous operations to gene transfer in sexual reproduction, the algorithm creates a new population of candidate solutions. At the beginning of a run of a genetic algorithm, a large population of random chromosomes is created. Each one, when decoded, will represent a different solution to the problem at hand. Some of the advantages of a GA (R. Haupt & S. E. Haupt, 2004) include the following: • It optimizes with continuous or discrete variables; • Does not require derivative information; • Simultaneously searches through wide sampling of the cost surface; • Deals with a large number of variables and is well suited for parallel computers; • Optimizes variables with extremely complex cost surfaces (they can jump out of a local minimum); • Provides a list of optimum variables, not just a single solution; • May encode the variables so that the optimization is done with the encoded variables; and • Works with numerically generated data, experimental data, or analytical functions. A typical genetic algorithm presents the following operators: • Crossover - this operator is used to vary the population of the chromosomes from one generation to the next. It selects the two fittest chromosomes in the population and produces a number of offspring. There are several crossover techniques; in this work, we will use the one-point crossover. • Selection - this operator replicates the most successful solutions found in a population at a rate proportional to their relative quality, which is determined by the fitness function. In this paper we will use the roulette wheel selection technique for this operator. • Mutation - used to maintain genetic diversity from one generation of a population of chromosomes to the next. It selects a position of an offspring chromosome with small probability and changes the value based on the problem model; for example, in this work, the mutation operator modifies the offspring chromosome simply by changing the position of a gene. The pseudo code of the standard genetic algorithm is summarized in the Fig. 2, where Tc is the crossover rate or parameter that determines the rate at which the crossover operator is applied, Tm is the equivalent for the mutation rate, Tp is the population size (number of chromosomes) and MaxG the number of generations used in the experiment.
www.intechopen.com Hybrid Metaheuristics Using Reinforcement Learning Applied to Salesman Traveling Problem 217
Fig. 2. Traditional Genetic Algorithm
2.3 Reinforcement learning The reinforcement learning is characterized by the existence of an agent which should learn the behaviours through trial-and-error interactions in a dynamic environment. In the interaction process the learner agent, at each discrete time step t receives from the environment some denominated representation state. Starting from state st∈S (S is the set of possible states), the learner agent chooses an action at∈A (A is the set of actions available in state st). When the learner agent chooses an action, receives a numerical reward rt+1, which represent the quality of the action selected and the environment response to that action presenting st+1, a new state of the environment. The agent's goal is to maximize the total reward it receives along the process, so the agent has to exploit not only what it already knows in order to obtain the reward, but also explore the environment in order to select better action in the future (Sutton & Barto, 1998). There are some classes of methods for solving the reinforcement learning problem, such as: Dynamic Programming, Monte Carlo methods and Temporal-difference learning. Each of these methods presents its characteristics. Dynamic programming methods are well developed mathematically, but require a complete and accurate model of the environment. Monte Carlo methods require no model and are very simple conceptually, but are not suited for step-by-step incremental computation. Temporal-difference methods do not require a complete model and are fully incremental, but it's more complex to mathematical analysis. Based in the characteristics and need of the problem in hand, this work will use a well-know Temporal-difference technique denominated Q-learning algorithm, which will be presented in details in the next section.
2.3.1 The Q-learning algorithm Not all reinforcement learning algorithms need a complete modelling of the environment. Some of them do not need to have all the transition probabilities and expected return values for all the transitions of the possible environmental states. This is the case of the reinforcement learning techniques based on temporal differences (Sutton & Barto, 1998). One of these techniques is the well-known Q-learning algorithm (Watkins, 1989), considered
www.intechopen.com 218 Traveling Salesman Problem, Theory and Applications one of the most important breakthroughs in reinforcement learning, since its convergence for optimum Q-values does not depend on the policy applied. The updated expression of the Q value in the Q-learning algorithm is:
Qsa(,)= (1−+αqq ) Qsa (,)αγ [ r+ max(',)] Qsa (4) aA∈ where s is the current state, a is the action carried out in the state s, r is the immediate reward received for executing a in s, s’ is the new state, γ is a discount factor (0 ≤ γ ≥ 1), and αq, (0 < αq < 1) is the learning factor. An important characteristic of this algorithm is that the choice of actions that will be executed during the iterative process of function Q can be made through any exploration/exploitation criterion, even in a random way. A widely used technique for such a choice is the so-called ε-greedy exploration, which consists of an agent to execute the action with the highest Q value with probability 1-ε, and choose a random action with probability ε. The Q-learning was the first reinforcement learning method to provide strong proof of convergence. Watkins showed that if each pair state-action is visited an infinite number of times and with an adjusted value, the value function Q will converge with probability 1 to Q*. The pseudo code of the Q-learning algorithm is presented in the Fig. 3, where, r is the reward matrix, ε is the parameter of the ε-greedy police.
Fig. 3. Q-learning Algorithm
3. Proposed hybrid methods Metaheuristics are approximate methods that depend on good exploration/exploitation strategies based on previous knowledge of the problem and are can guide the search for an optimum solution in order avoiding local minimum. Good strategies (or good heuristics) alternate adequately between exploration and exploitation. In other words they maintain the balance between these two processes during the search for an optimum solution. The methods presented here are hybrid since they use a well-known reinforcement learning method - Q-learning algorithm - as an exploration/exploitation mechanism for GRASP and genetic algorithm metaheuristics (Lima, F. C., et al., 2007).
www.intechopen.com Hybrid Metaheuristics Using Reinforcement Learning Applied to Salesman Traveling Problem 219
The Fig. 4, presents a framework of the proposed hybrid methods:
Fig. 4. Framework of the proposed hybrid methods. The proposed methods in this section are applied to solve the symmetric traveling salesman problem - TSP and are modelled as a Reinforced Learning Problem.
3.1 The traveling salesman problem modeled as a sequential decision problem The traveling salesman problem can be described as a sequential decision process, represented by the quintuplet M= {T, S, A, R, P}, in at least two different ways. For example, consider a model where the set of states S is the set of all possible solutions for TSP (Ramos et al, 2003). The model in this study has a high-dimensional inconvenience S, since TSP has a higher number of possible solutions in the symmetric case (n-1)!/2. An alternative to model TSP as a sequential decision problem is to consider that S is formed by all the cities to solve TSP. In this new model, the cardinality of S is equal to the instance size of the problem. This lowers the risk of S suffering the “curse of dimensionality”. In this study TSP is modelled as a sequential decision problem based on this second alternative. To better understand the proposed model, consider an example of TSP with 5 cities shown in graph G(V, E, W) of Fig. 5. V is the set of vertices, E is the set of arcs between the vertices and W is the weight associated to each arc. In graph G(V,E,W), a12 corresponds to visiting the “city” s2 derived from s1, and the values associated to each arc (number between parentheses) corresponds to the distance between the “cities”. Considering graph G(V,E,W), the quintuplet M = {T, S, A, R, P}, representing a sequential decision process, can be defined by TSP as follows: • T: The set of decision instants is denoted by T = {1, 2, 3, 4, 5}, where the cardinality of T corresponds to the number of cities that compose a route to TSP. • S: The set of states is represented by S = {s1, s2, s3, s4, s5}, where each state si, i=1,...,5 corresponds to a city2 at a route to TSP. • A: The set of possible actions is denoted by:
2 From this point onwards the expression “city” or “cities” indicates the association between a city of TSP and a state (or states) from the environment.
www.intechopen.com 220 Traveling Salesman Problem, Theory and Applications
AAs() As () As () As () As () = 12345∪∪∪∪
Aaaaaaaaaaaaaaaaaaaaa=∪∪∪∪{12 , 13 , 14 , 15 }{ 21 , 23 , 24 , 25 }{ 31 , 32 , 34 , 35 }{ 41 , 42 , 43 , 45 }{ 51 , 52 , 53 , 54 }
A{,,,,,,,,,,,,,,,,,,,} aaaaaaaaaaaaaaaaaaaa = 12 13 14 15 21 23 24 25 31 32 34 35 41 42 43 45 51 52 53 54 (5) It is important to emphasize that owing to the restriction of TSP, some actions may not be available when constructing a solution. To understand it more clearly, consider the following partial solution for TSP:
Solp : s23→→ s s 5 (6) where the decision process is in the decision instant 3 and state s5. In this case the actions available are A(s5)={a51, a54}, since the “cities” s2 (action choice a52) and s3 (action choice a53) are not permitted to avoid repetitions in the route. • R: S × A → ℜ: Expected Return. In TSP, elements rij are calculated using the distance between the “cities” si and sj. The return should be a reward that encourages the choice of “city” sj closest to si. Since TSP is a minimization problem, a trivial method to calculate the reward is to consider it inversely proportional to the traveling cost between the cities. That is:
1 rij = (7) dij where dij > 0 is a real number that represents the distance from “city” si to “city” sj. Using the weights from the graph in Fig. 5, R is represented by:
⎡ 014161713⎤ ⎢14 0 13 19 110⎥ ⎢ ⎥ R = ⎢16 13 0 12 18⎥ (8) ⎢ ⎥ ⎢17 19 12 0 15⎥ ⎢ ⎥ ⎣131101815 0⎦ • P: S × A × S → [1, 0]: The function that defines the probability transition between states s∈S, where the elements pij(sj|si, aij) correspond to the probability of reaching “city” sj when in “city” si and choosing action aij. The values of pij are denoted by:
⎧1 if snotvisitedj ⎪ (9) pssaij(|, j i ij )= ⎨ ⎩⎪0 otherwise The quintuplet M = {T, S, A, R, P} defined in the graph of the Fig. 5. can be defined with no loss of generality for a generic graph G(V, E, W), (with |V|=N) where a Hamiltonian cycle of cardinality N is found. An optimum policy for TSP, given generic graph G(V, E, W), should determine the sequence of visits to each “city” si, i=1, 2,...,N to achieve the best possible sum of returns. In this case, the problem has a finite-time horizon, sets of states, and discrete and finite actions.
www.intechopen.com Hybrid Metaheuristics Using Reinforcement Learning Applied to Salesman Traveling Problem 221
Fig. 5. Complete Graph, instance of the TSP with 5 cities.
3.2 The Markov property In a sequential decision process, the environmental response in view of the action choice at any time t+1 depends on the history of events at a time before t+1. This means that the dynamic of environmental behaviour will be defined by complete probability distribution (Sutton & Barto, 1998):
a (10) Pssss'11===Pr{ t++ ',rr t |s t ,a t ,r t ,sars t −−− 1 , t1 , t 11 ,… , ,a1 ,r1 ,s0 ,a0 } where Pr represents the probability of st+1 being s’, for all s, a, r, states, actions and past reinforcements st, at, rt, ...., r1, s0, a0. However, if the sequential process obeys the Markov property, the environmental response at t+1 only depends on information for state s and action a available at t. In other words the environment dynamic is specified by:
a (11) Pssss'11===Pr{ t++ ',rr t |s t ,a t } which qualifies it as a Markov decision process – MDP (Puterman, 1994). In the traveling salesman problem, an action is chosen (deciding which city to place on the route under construction) from state st (the actual city in the route under construction) based on the instance between this city and the others. It is important to note that, to avoid violating the restriction imposed by TSP (not to repeat cities on the route), actions that would lead to states already visited should not be available in state st. In this context, TSP is a Markov decision process since all the information necessary to make the decision at t is available at state st. A list of actions not chosen during route construction could be included in state st to provide information on the actions available in the state. The Markov property is fundamentally important in characterizing the traveling salesman problem as a reinforcement learning task, since the convergence of methods used in this study depends on its verification in the proposed model. The restriction imposed on TSP creates an interesting characteristic in the Markov decision process associated to the problem. The fact that the sets of available actions A(s) for state s at each time instant t vary during the learning process, implies changes in the strategies for choosing actions in any state st. This means that modifications will occur in the politics used during the learning process. Fig. 6 demonstrates this characteristic.
www.intechopen.com 222 Traveling Salesman Problem, Theory and Applications
Fig. 6. Changes in the number of actions available during the decision process. To understand more clearly, consider the following partial solution for TSP: The chart in Fig. 6, can be interpreted as follows: A route to TSP is constructed in each episode of the learning process. During construction of a route to TSP, at each decision instant t, a state is visited s∈S and an action choice a∈A is made. Since the set A(s) of actions available for state s varies along time, in an episode i the set of actions available for state st at instant t can be different of the actions available in the same state s at time instant t+k and episode j. This confirms that the Markov decision process associated to the traveling salesman problem can be modelled has a non-stationary policy.
3.3 The Q-learning algorithm implemented Since the methods presented in this section use Q-learning, this section presents details such as: action selection policies and convergence of the algorithm when applied to the proposed problem.
3.3.1 Action selection policies for the Q-learning algorithm When solving a reinforcement learning Problem, an action selection policy aims to determine the behaviour of the agent so that it alternates adequately between using obtained knowledge and acquiring new knowledge. This optimizes the exploration/exploitation process of the search space. More than one action selection policy is used for Q-learning to determine which policy is most appropriate to implement the proposed hybrid methods. • The ε-greedy policy chooses the action with the highest expected value, compatibility defined by (1-ε) and random action with a probability of ε. Mathematically, given the matrix of Q-values Q(s, a), the greedy action a* is obtained for state s as follows:
aQ∗ = max (s ,a ) aAs∈ ()
∗ ε πε(,sa )=−+ 1 (12) As() ε π(,)sa=∀ a∈− As () a∗ As() {}
www.intechopen.com Hybrid Metaheuristics Using Reinforcement Learning Applied to Salesman Traveling Problem 223
where |A(s)| corresponds to the number of possible actions from s, and ε is the control parameter between greed and randomness. The restriction in Eq. 12 allows Q-learning to explore the state space of the problem and is needed to find the optimum control policy. • The Adaptive ε-greedy Policy is similar to the ε-greedy policy described. In other words, it allows the action with the highest expected value to be chosen, with probability defined by (1 - ε) and a random action with probability ε. It is different and also “Adaptive”, since the value of ε suffers exponential decay calculated by:
k (13) ε = max{vvbif , . }
where k is the episode counter of Q-learning, b is the closest value of 1 and vi < vf ∈ [0,1]. The algorithm initially uses high values for ε (close to vf) and, as the value of k increases, choice ε is directed to lower values (closer to vi). The objective is to allow more random choices to be made and to explore more the greedy aspect as the number of episodes increases. • Based on Visitor Count Policy chooses the action based on a technique called Reinforcement Comparison (Sutton & Barto, 1998). In this technique actions are chosen based on the principle that actions followed by large rewards are preferred to those followed by small rewards. To define a ``large reward'', a comparison is made with a standard reward known as a reference reward. The objective of the proposed policy is similar to the reinforcement comparison technique. However, choosing preferred actions is based on the visitor count to the states affected by these actions. That is, the most visited states indicate preferred actions in detriment to actions that lead to states with a lower number of visits. When the preference measurement for actions is known, the probability of action selection can be determined as follows:
pa() e t−1 aPaa (14) πtrt()==={} p ()b ∑e t−1 b
where πt(a) denotes the probability of selecting action a at step t and pt(a) is the preference for action a at time t. This is calculated by:
(15) patt+1()= pa tt ()+ β ( NsaN vt (,)/ Ep )
where s is the state affected by selecting action a at step t, Nv(s, at) is the number of visits to state s, NEp is the total number of episodes and β∈[0,1] is the control parameter that considers the influence level of preferred actions. An experiment was carried out to compare the policies tested using 10 instances of TSP available in TSPLIB library (TSPLIB, 2010) and the results for the three policies tested simultaneously considering the value of the function and processing time, the adaptive ε- greedy policy had the best performance. This policy will be used in the version of the Q- learning algorithm implemented in this work.
www.intechopen.com 224 Traveling Salesman Problem, Theory and Applications
3.4 The GRASP learning method In GRASP metaheuristics the exploration and exploitation processes occur at different moments. The constructive phase explores the space of viable solutions and the local search improves the solution constructed in the initial phase of exploiting the area. Despite the clear delimitation of roles, the two phases of GRASP metaheuristics work in collaboration. The good performance of local search algorithm varies with the neighbourhood chosen and depends substantially on the quality of the initial solution. (Feo & Resende, 1995). This section presents a hybrid method using Q-learning algorithm in the constructive phase of GRASP metaheuristics. In traditional GRASP metaheuristics iteration is independent, in other words, actual iteration does not use information obtained in past iterations (Fleurent & Glover, 1999). The basic idea of this proposed method is to use the information from the Q- values matrix as a form of memory that enables good decisions to be made in previous iterations and avoids uninteresting ones. This facilitates the exploration and exploitation process. Based on the results of using an isolated Q-learning algorithm to solve the small instances of TSP, the approach proposed may significantly improve metaheuristic performance in locating the global optimum. In the GRASP-Learning method, the Q-learning algorithm is used to construct initial solutions for GRASP metaheuristic. A good quality viable solution is constructed at each iteration of the algorithm using information from the Q-values matrix. The Q-values matrix can then be used at each GRASP iteration as a form of adaptive memory to allow the past experience to be used in the future. The use of adaptive word means that at each iteration new information is inserted by using the matrix Q. This influences the constructive phase of the next iteration. The Q-learning algorithm will therefore be implemented as a randomized greedy algorithm. The control between “greed” (Exploration) and “randomness” (Exploitation) is achieved using the parameter ε of the transition rule defined in equation 13. The reward matrix is determined as follows:
1 Rsa(,)=∗ Nv (,) sa (16) dij where 1/dij is the inverse distance between the two cities ci and cj (city ci is represented by state s and city cj by the state accessed by choosing action a), and Nv(s,a) is the number of visits to the current state. The procedure of the Q-learning algorithm proposed for the GRASP constructive phase applied to the symmetric TSP is: A table of state-action values Q (Q-values matrix) initially receives a zero value for all items. An index of table Q is randomly selected to start the updating process and the index then becomes state s0 for Q-learning. State s1 can be obtained from state s0 using the ε-greedy adaptive transition rule of the following way: • Randomly, selecting a new city of the route, or • Using the maximum argument Q in relation to the previous state s0. Given states s0 and s1, the iteration that updates table Q begins using the equation 4. The possibility of selecting a lower argument state through randomness ensures the method ability to explore other search areas. Table Q is obtained after a maximum number of
www.intechopen.com Hybrid Metaheuristics Using Reinforcement Learning Applied to Salesman Traveling Problem 225 episodes. The route to TSP is taken from this matrix. Constructing a solution for TSP after obtaining matrix Q is achieved as follows: • Copy the data of matrix Q to auxiliary matrix Q1; • Select an index l that shows an initial line of table Q1 (corresponding to the city at the start of the route) • Starting from line l choose the line's greatest value, index c of this value is the column of the next city on the route. • Attribute a null value to all the values of column c at Q1, to ensure there is no repetition of cities on the route, repeating the basic restriction of TSP; • Continue the process while the route is incomplete. At each GRASP iteration the matrix Q is updated by executing the Q-learning algorithm. In other words, at the end of Nmax iterations Q-learning algorithm will have executed Nmax*NEp episodes, where NEp denotes the number of episodes executed for each iteration. Updating the Q-values matrix at each GRASP iteration improves the information quality collected throughout the search process. Fig. 7, shows a overview of the GRASP-Learning method.
Fig. 7. Framework of the GRASP-Learning Method. The local search phase in GRASP-Learning suffers no modifications. That is, it uses the same traditional GRASP algorithm that was implemented in this study as a descending method 2-Opt. This local search technique only moves to a neighbouring solution if it improves in objective function. The Fig. 8, presents a simplified form of the pseudo code method GRASP-learning proposed. When examining the pseudo code GRASP-Learning algorithm, significant modifications are seen in lines 3 and 5. Function MakeReward in line 3 uses the distance matrix D to create a reward matrix, based on Equation 16. In line 5 the function Qlearning returns a viable solution S for TSP and the updated Q-values matrix Q, which is used in the next iteration. The other aspects of GRASP-learning algorithm are identical to traditional GRASP metaheuristics.
www.intechopen.com 226 Traveling Salesman Problem, Theory and Applications
Fig. 8. Pseudo code of GRASP-Learning Algorithm.
3.5 The cooperative genetic-learning algorithm The hybrid genetic algorithm proposed in this section uses reinforcement learning to construct better solutions to the symmetric traveling salesman problem. The method main focus is the use the Q-learning algorithm to assist the genetic operators with the difficult task of achieving balance between exploring and exploiting the problem solution space. The new method suggests using the Q-learning algorithm to create an initial population of high fitness with a good diversity rate that also cooperates with the genetic operators. In the learning process, the Q-learning algorithm considers using either the knowledge already obtained or choosing new unexplored search spaces. In other words, it has the characteristics of a randomized greedy algorithm. It is greedy owing to its use of the maximum value of Q(s, a) to choose action a, contributing to a greater return in state s, and randomized since it uses an action choice policy that allows for occasional visits to the other states in the environment. This behaviour allows Q-learning to significantly improve the traditional genetic algorithm. As mentioned previously, the initial population of the proposed genetic algorithm is created by Q-learning algorithm. This is achieved by executing Q-learning with a determinate number of episodes (NEp). Tp chromosomes are then created using the matrix Q (Q-values matrix), where Tp is the population size. The choice criteria in the creation of each chromosome is the value of Q(s, a) associated to each gene so that the genes with the highest value of Q(s, a) are predominantly chosen in chromosome composition3. In addition to creating the initial population, the Q-learning algorithm cooperates with the genetic operators as follows: • In each new generation, the best solution S* obtained by the genetic operators is used to update the matrix of Q-values Q produced by the Q-learning algorithm in previous generations. • After updating matrix Q, it is used in subsequent generations by the genetic algorithm. This iteration process rewards the state-action pairs that compose the
3 A chromosome is an individual in the population (for example, a route to TSP), whereas a gene is each part that composes a chromosome (in TSP a city composes a route)
www.intechopen.com Hybrid Metaheuristics Using Reinforcement Learning Applied to Salesman Traveling Problem 227
best solutions from the last generations. They receive an additional increment that identifies them as good decisions. Calculating the incremental value is based on the following theory: • According to the model of TSP described in section 3.1, a solution for the model can be denoted by Fig. 9.
Fig. 9. TSP solution, modelled as a reinforcement learning Problem. In the Fig. 9, the sequence of states, actions and rewards corresponds to a solution S constructed during an episode of Q-learning algorithm, where si, i=0,…,n is the current state and rj, j=1,…,n is the immediate reward received by executing change si, for si+1. In this scenario, with solution S*, the accumulated reward value Ri is easily calculated:
Qs(,)01 s= r 1++++ r 2 r 3… rnn− 1 += r R 0
Qs(,)12 s=+++ r 2 r 3… rnn− 1 += r R 1