Effective Reinforcement Learning Through Evolutionary Surrogate-Assisted Prescription

Effective Reinforcement Learning through Evolutionary Surrogate-Assisted Prescription Olivier Francon1, Santiago Gonzalez1;2, Babak Hodjat1, Elliot Meyerson1, Risto Miikkulainen1;2, Xin Qiu1, and Hormoz Shahrzad1 1Cognizant Technology Solutions and 2The University of Texas at Austin ABSTRACT Outcomes There is now significant historical data available on decision mak- O ing in organizations, consisting of the decision problem, what decisions were made, and how desirable the outcomes were. Using this data, it is possible to learn a surrogate model, and with that model, evolve a decision strategy that optimizes the outcomes. This paper introduces a general such approach, called Evolutionary Surrogate-Assisted Prescription, or ESP. The surrogate is, for ex- Predict ample, a random forest or a neural network trained with gradient descent, and the strategy is a neural network that is evolved to C A maximize the predictions of the surrogate model. ESP is further Context Prescribe Actions extended in this paper to sequential decision-making tasks, which makes it possible to evaluate the framework in reinforcement learn- Figure 1: Elements of ESP. A Predictor is trained with histori- ing (RL) benchmarks. Because the majority of evaluations are done cal data on how given actions in given contexts led to specific on the surrogate, ESP is more sample efficient, has lower variance, outcomes. The Predictor can be any machine learning model and lower regret than standard RL approaches. Surprisingly, its trained with supervised methods, such as a random forest solutions are also better because both the surrogate and the strat- or a neural network. The Predictor is then used as a surro- egy network regularize the decision making behavior. ESP thus gate in order to evolve a Prescriptor, i.e. a neural network forms a promising foundation to decision optimization in real-world implementing a decision policy that results in the best pos- problems. sible outcomes. The majority of evaluations are done on the KEYWORDS surrogate, making the process highly sample-efficient and robust, and leading to decision policies that are regularized Reinforcement Learning, Decision Making, Surrogate-Assisted Evo- and therefore generalize well. lution, Genetic Algorithms, Neural Networks ACM Reference Format: Olivier Francon1, Santiago Gonzalez1;2, Babak Hodjat1, Elliot Meyerson1, and, in principle, make better decisions, i.e. those that lead to more Risto Miikkulainen1;2, Xin Qiu1, and Hormoz Shahrzad1. 2020. Effective desirable outcomes. However, while prediction is necessary, it is Reinforcement Learning through Evolutionary Surrogate-Assisted Prescrip- only part of the process. Predictive models do not specify what tion. In Genetic and Evolutionary Computation Conference (GECCO ’20), the optimal decisions actually are. To find a good decision strategy, July 8–12, 2020, Cancun, Mexico. ACM, New York, NY, USA, 9 pages. https: different approaches are needed. //doi.org/10.1145/3377930.3389842 The main challenge is that optimal strategies are not known, so standard gradient-based machine learning approaches cannot 1 INTRODUCTION be used. The domains are only partially observable, and decision arXiv:2002.05368v2 [cs.NE] 22 Apr 2020 Many organizations in business, government, education, and health- variables and outcomes often interact nonlinearly: For instance, care now collect significant data about their operations. Such data allocating marketing resources to multiple channels may have a is transforming decision making in organizations: It is now possi- nonlinear cumulative effect, or nutrition and exercise may interact ble to use machine learning techniques to build predictive models to leverage or undermine the effect of medication in treating an of behaviors of customers, consumers, students, and competitors, illness [9, 32]. Such interactions make it difficult to utilize linear programming and other traditional optimization approaches from Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed operations research. for profit or commercial advantage and that copies bear this notice and the full citation Instead, good decision strategies need to be found using search, on the first page. Copyrights for components of this work owned by others than ACM i.e. by generating strategies, evaluating them, and generating new, must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a hopefully better strategies based on the outcomes. In many domains fee. Request permissions from [email protected]. such search cannot be done in the domain itself: For instance, test- GECCO ’20, July 8–12, 2020, Cancun, Mexico ing an ineffective marketing strategy or medical treatment could © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-7128-5/20/07...$15.00 be prohibitively costly; evaluating an engineering design through https://doi.org/10.1145/3377930.3389842 simulation, or a behavioral strategy in game playing could require GECCO ’20, July 8–12, 2020, Cancun, Mexico Francon, Gonzalez, Hodjat, Meyerson, Miikkulainen, Qiu, and Shahrzad a prohibitive amount of computation time. However, given that training. Surprisingly, optimizing against the surrogate also has a historical data about past decisions and their outcomes exist, it is regularization effect: the solutions are sometimes more general and possible to do the search using a predictive model as a surrogate to thus perform better than solutions discovered in the domain itself. evaluate them. Only once good decision strategies have been found Further, ESP brings the advantages of population-based search out- using the surrogate, they are tested in the real world. lined above to RL, i.e. enhanced exploration, multiobjectivity, and Even with the surrogate, the problem of finding effective de- scale-up to high-dimensional search spaces. cision strategies is still challenging. Nonlinear interactions may The ESP approach is evaluated in this paper in various RL bench- result in deceptive search landscapes, where progress towards good marks. First its behavior is visualized in a synthetic domain, illus- solutions cannot be made through incremental improvement: Dis- trating how the Predictor and Prescriptor learn together to discover covering them requires large, simultaneous changes to multiple optimal decisions. Direct evolution (DE) is compared to evolution variables. Decision strategies often require balancing multiple ob- with the surrogate, to demonstrate how the approach minimizes jectives, such as performance and cost, and in practice, generating the need for evaluations in the real world. ESP is then compared a number of different trade-offs between them is needed. Conse- with standard RL approaches, demonstrating better sample effi- quently, search methods such as reinforcement learning (RL), where ciency, reliability, and lower cost. The experiments also demon- a solution is gradually improved through local exploration, do not strate regularization through the surrogate, as well as ability to lend themselves well to searching solution strategies either. Fur- utilize different kinds of Predictor models (e.g. random forests and ther, the number of variables can be very large, e.g. thousands or neural networks) to best fit the data. ESP thus forms a promising even millions as in some manufacturing and logistics problems [6], evolutionary-optimization-based approach to sequential decision- making methods such as Kriging and Bayesian optimization [5, 46] making tasks. ineffective. Moreover, the solution is not a single point but astrat- egy, i.e. a function that maps input situations to optimal decisions, exacerbating the scale-up problem further. 2 RELATED WORK Keeping in mind the above challenges, an approach is developed Traditional model-based RL aims to build a transition model, em- Evolutionary Surrogate-Assisted Prescription in this paper for (ESP; bodying the system’s dynamics, that estimates the system’s next Figure 1), i.e. for discovering effective solution strategies using state in time, given current state and actions. The transition model, evolutionary optimization. With a population-based search method, which is learned as part of the RL process, allows for effective action it is possible to navigate deceptive, high-dimensional landscapes, selection to take place [13]. These models allow agents to leverage and discover trade-offs across multiple objectives [27]. The strategy predictions of the future state of their environment [54]. However, is expressed as a neural network, making it possible to use state- model-based RL usually requires a prohibitive amount of data for of-the-art neuroevolution techniques to optimize it. Evaluations of building a reliable transition model while also training an agent. the neural network candidates are done using a predictive model, Even simple systems can require tens- to hundreds-of-thousands trained with historical data on past decisions and their outcomes. of samples [41]. While techniques such as PILCO [7] have been Elements of the ESP approach were already found effective in developed to specifically address sample efficiency in model-based challenging real-world applications. In an autosegmentation ver- RL, they can be computationally intractable for all but the lowest sion of Ascend by Evolv, a commercial product for

Load more