Hybrid Metaheuristics Using Reinforcement Learning Applied to Salesman Traveling Problem
Total Page:16
File Type:pdf, Size:1020Kb
13 Hybrid Metaheuristics Using Reinforcement Learning Applied to Salesman Traveling Problem Francisco C. de Lima Junior1, Adrião D. Doria Neto2 and Jorge Dantas de Melo3 University of State of Rio Grande do Norte, Federal University of Rio Grande do Norte Natal – RN, Brazil 1. Introduction Techniques of optimization known as metaheuristics have achieved success in the resolution of many problems classified as NP-Hard. These methods use non deterministic approaches that reach very good solutions which, however, don’t guarantee the determination of the global optimum. Beyond the inherent difficulties related to the complexity that characterizes the optimization problems, the metaheuristics still face the dilemma of exploration/exploitation, which consists of choosing between a greedy search and a wider exploration of the solution space. A way to guide such algorithms during the searching of better solutions is supplying them with more knowledge of the problem through the use of a intelligent agent, able to recognize promising regions and also identify when they should diversify the direction of the search. This way, this work proposes the use of Reinforcement Learning technique – Q-learning Algorithm (Sutton & Barto, 1998) - as exploration/exploitation strategy for the metaheuristics GRASP (Greedy Randomized Adaptive Search Procedure) (Feo & Resende, 1995) and Genetic Algorithm (R. Haupt & S. E. Haupt, 2004). The GRASP metaheuristic uses Q-learning instead of the traditional greedy- random algorithm in the construction phase. This replacement has the purpose of improving the quality of the initial solutions that are used in the local search phase of the GRASP, and also provides for the metaheuristic an adaptive memory mechanism that allows the reuse of good previous decisions and also avoids the repetition of bad decisions. In the Genetic Algorithm, the Q-learning algorithm was used to generate an initial population of high fitness, and after a determined number of generations, where the rate of diversity of the population is less than a certain limit L, it also was applied to supply one of the parents to be used in the genetic crossover operator. Another significant change in the hybrid genetic algorithm is an interactive/cooperative process, where the Q-learning algorithm receives an additional update in the matrix of Q-values based on the current best solution of the Genetic Algorithm. The computational experiments presented in this work compares the results obtained with the implementation of traditional versions of GRASP metaheuristic and Genetic Algorithm, with those obtained using the proposed hybrid www.intechopen.com 214 Traveling Salesman Problem, Theory and Applications methods. Both algorithms had been applied successfully to the symmetrical Traveling Salesman Problem, which was modeled as a Markov decision process. 2. Theoretical foundation 2.1 GRASP metaheuristic The Metaheuristic Greedy Randomized Adaptive Search Procedure - GRASP (Feo & Resende, 1995), is a multi-start iterative process, where each iteration consists of two phases: constructive and local search. The constructive phase builds a feasible solution, whose neighbourhood is investigated until a local minimum is found during the local search phase. The best overall solution is kept as the final solution. Each iteration of the construction phase let the set of candidate elements be formed by all elements that can be incorporated in to the partial solution under construction without destroying its feasibility. The selection of the next element for incorporation is determined by the evaluation of all candidate elements according to a greedy evaluation function g(c), where c is a candidate element to compose the solution. The evaluation of the elements by function g(c) leads to the creation of a restricted candidate list - RCL formed by the best elements, i.e., those whose incorporation to the current partial solution results in the smallest incremental cost (this is the greedy aspect of the algorithm in the minimization case). Once the selected element is incorporated into the partial solution, the candidate list is updated and the incremental costs are revaluated (this is the adaptive aspect of the heuristic). The probabilistic aspect of GRASP is given by the random choice of one of the elements of the RCL, not necessarily the best, except in the case of the RCL, which has unitary size. In this case, the selection criterion is reduced to the greedy criterion. The improvement phase typically consists of a local search procedure aimed at improving the solution obtained in the constructive phase, since this solution may not represent a global optimum. In GRASP metaheuristics, it is always important to use a local search to improve the solutions obtained in the constructive phase. The local search phase works in an iterative way, replacing successively the current solution by a better one from its neighbourhood. Thus, the success of the local search algorithm depends on the quality of the neighbourhood chosen (Feo & Resende, 1995). This dependence can be considered a disadvantage of the GRASP metaheuristic. The GRASP metaheuristic presents advantages and disadvantages: Advantages: • Simple implementation: greedy algorithm and local search; • Some parameters require adjustments: restrictions on the candidate list and the number of iterations. Disadvantages: • It depends on good initial solutions: since it is based only on randomization between iterations, each iteration benefits from the quality of the initial solution; • It does not use memory of the information collected during the search. This work explores the disadvantages of the GRASP metaheuristic replacing the random- greedy algorithm of the constructive phase by constructive algorithms that use the Q- learning algorithm as an exploitation/exploration strategy aimed at building better initial solutions. The pseudo code of the traditional GRASP algorithm is presented in the Fig. 1, www.intechopen.com Hybrid Metaheuristics Using Reinforcement Learning Applied to Salesman Traveling Problem 215 where D corresponds to the distance matrix for the TSP instances, αg1 is the control parameter of the restricted candidate list - RCL, and Nm the maximal number of interactions. Fig. 1. Traditional GRASP Algorithm In the GRASP metaheuristic, the restricted candidate list, specifically the αg parameter, is practically the only parameter to be adjusted. The effect of the choice of αg value, in terms of quality solution and diversity during the construction phase of the GRASP, is discussed by Feo and Resende (Feo & Resende, 1995). Prais and Ribeiro (Prais & Ribeiro, 1998) proposed a new procedure called reactive GRASP, for which the αg parameter of the restricted candidate list is self-adjusted according to the quality of the solution previously found. In the reactive GRASP algorithm, the αg value is randomly selected from a discreet set containing m predetermined feasible values: 1 m Ψ={ααgg,..., } (1) The use of different αg values at different iterations allows the building of different Restrict Candidate Lists - RCL, possibly allowing the generation of distinct solutions that would not be built through the use of a single fixed αg value. The choice of αg in the set Ψ is made using a probability distribution pi, i=1,...,m, which is periodically updated by the so-called absolute qualification rule (Prais & Ribeiro, 1998), which is based on the average value of the solutions obtained with each αg value and is computed as follows: ((FS* ))δ qi = (2) Ai for all i=1,...,m, where F(S*) is the value of the overall best solution already found and δ is used to explore the updated values of probability pi. Using qi, the probability distribution is given by: 1 The index g is used here to differ from the αg parameter used in the Q-learning algorithm in this paper. www.intechopen.com 216 Traveling Salesman Problem, Theory and Applications m (3) pii= qq∑ j j=1 In this paper, the reactive GRASP is presented as a more robust version of the traditional GRASP algorithm, and its performance will be compared with the new method proposed. 2.2 Genetic algorithm Genetic algorithms (GA) are based on a biological metaphor: they see the resolution of a problem as a competition among a population of evolving candidate problem solutions. A “fitness” function evaluates each solution to decide whether it will contribute to the next generation of solutions. Then, through analogous operations to gene transfer in sexual reproduction, the algorithm creates a new population of candidate solutions. At the beginning of a run of a genetic algorithm, a large population of random chromosomes is created. Each one, when decoded, will represent a different solution to the problem at hand. Some of the advantages of a GA (R. Haupt & S. E. Haupt, 2004) include the following: • It optimizes with continuous or discrete variables; • Does not require derivative information; • Simultaneously searches through wide sampling of the cost surface; • Deals with a large number of variables and is well suited for parallel computers; • Optimizes variables with extremely complex cost surfaces (they can jump out of a local minimum); • Provides a list of optimum variables, not just a single solution; • May encode the variables so that the optimization is done with the encoded variables; and • Works with numerically generated data, experimental data, or analytical functions. A typical genetic algorithm presents the following operators: • Crossover - this operator is used