Contextual Q-Learning Tiago Pinto1 and Zita Vale2
Total Page:16
File Type:pdf, Size:1020Kb
24th European Conference on Artificial Intelligence - ECAI 2020 Santiago de Compostela, Spain Contextual Q-Learning Tiago Pinto1 and Zita Vale2 Abstract.1 This paper highlights a new learning model that • is the learning rate which introduces a contextual dimension to the well-known Q-Learning determines the extent to which the newly acquired algorithm. Through the identification of different contexts, the learning process is adapted accordingly, thus converging to information will replace the old information (e.g. enhanced results. The proposed learning model includes a assuming a value of 0 learns nothing; on the other hand, a simulated annealing (SA) process that accelerates the convergence value of 1 represents a fully deterministic environment). process. The model is integrated in a multi-agent decision support • is the reward, which represent the quality of the pair system for electricity market players negotiations, enabling the context-scenario (c x s). It appreciates the positive actions experimentation of results using real electricity market data. with high values and negative with low values, all of them are normalized on a scale from 0 to 1. The reward r is 1 CONTEXTUAL Q-LEARNING defined in (3): Let’s consider an agent that operates in an environment (3) conceptualized by a set of possible states, in which the agent can where represents the real price that has been established choose actions from a set of possible actions. Each time that the in a contract with an opponent p, in context c, in time t, referring to player performs an action, a reinforcement value is received, an amount of power a; and is the estimation price of indicating the immediate value of the resulting state transition. scenario that corresponds to the same player, amount of power and Thus, the only learning source is the agents’ own experience, context in time t. All r values are normalized in a scale from 0 to 1. whose goal is to acquire an actions policy that maximizes its • is the discount factor which determines the overall performance. importance of future rewards. A value of 0 only evaluates The methodology introduced in [3] proposes an adaptation of current rewards, and higher values than 0 takes into the Q-Learning algorithm [4] to undertake the learning process. Q- account future rewards. Learning is a very popular reinforcement learning method. It • is the estimation of the optimal future value allows the autonomous establishment of an interactive action which determines the utility of scenario s, resultant in policy. It is demonstrated that the Q-Learning algorithm converges to the optimal proceeding when the learning Q state-action pairs is context c. is calculated as in (4): represented in a table containing the full information of each pair (4) value [4]. The basic concept behind the proposed Q-Learning The Q-Learning algorithm is executed as follows: adaption is that the learning algorithm can learn a function of • For each c and s, initialize optimal evaluation over the whole space of context-scenario pairs • Observe new event (new established contract); (c x s). This evaluation defines the Q confidence value that each • Repeat until the stopping criterion is satisfied: scenario can represent the actual encountered negotiation scenario o Select new scenario for current context; s in context c. The Q function performs the mapping as in (1): o Receive immediate reward ; (1) o Update according to (2); where U is the expected utility value when selecting a scenario s in o Observe new context c’; context c. The expected future reward, when choosing the scenario s in context c, is learned through trial and error according to (2): o . (2) After each update, all Q values are normalized according to the equation (5), to facilitate the interpretation of values of each where is the kind of context when performing under scenario scenario in a range from 0 to 1. at time t: 푄(푐, 푠) • represents the value of the previous iteration 푄′(푐, 푠) = (5) 푚푎푥[푄(푐, 푠)] (each iteration represents each new contract established in The proposed learning model assumes the confidence of Q values the given scenario and context). Generally, the Q value is as the probability of a scenario in a given context. learns initialized to 0. by treating a forecast error, updating each time a new observation (new established contract) is available again. Once all pairs context-scenario have been visited, the scenario that presents the 1 GECAD Research group, Polytechnic Institute of Porto, Portugal, email: highest Q value, in the last update, is chosen by the learning [email protected] algorithm, to identify the most likely scenario to occur in actual 2 Polytechnic Institute of Porto, Portugal, email: [email protected] negotiation. 24th European Conference on Artificial Intelligence - ECAI 2020 Santiago de Compostela, Spain 2 SIMULATED ANNEALING PROCESS ; is the current iteration; is the difference between best solution and current solution; is the SA is an optimization method that imitates the annealing process used in metallurgy. The final properties of this substance depend Boltzmann constant ; is the initial temperature; is the strongly on the cooling schedule applied, i.e. if it cools down number of variables; ; is the abs of solution current; quickly the resulting substance will be easily broken due to an ; . imperfect structure, if it cools down slowly the resulting structure will be well organized and strong. When solving an optimization 3 RESULTS problem using SA the structure of the substance represents a codified solution of the problem, and the temperature is used to A historical database based on real data extracted from MIBEL - determine how and when new solutions are perturbed and the Iberian Electricity Market is used to assess the results of the accepted. The algorithm is basically a three steps process: perturb proposed model. The dataset is composed by the bilateral contracts the solution, evaluate the quality of the solution, and accept the declared in MIBEL, between 1 July 2007 and 31 October 2008 [2]. solution if it is better than the previous one [1]. Table 2 shows the learning results of the proposed model The two main factors of SA are the decrease of the temperature against several benchmark reinforcement learning algorithms and the probability of acceptance. The temperature only decreases considering two distinct negotiation contexts. Results show that the when the acceptance value is greater than a stipulated maximum. proposed method achieves the lowest prediction errors in all This acceptance number is only incremented when the probability contexts, resulting from the context aware learning capability. of acceptance is higher than a random number, which allows some Table 2: Average prediction errors in different contexts solutions to be accepted even if their quality is lower than the previous. When the condition of acceptance is not satisfied, the Context Algorithm MAE MAPE (%) STD solution is compared to the previous one, and if it is better, the best Proposed Model 7.45 9.89 8.98 solution is updated. At high temperatures, the simulated annealing Std. Q-Learning 11.24 16.28 14.04 method searches for the global optimum in a wide region; on the 1 Roth-Erev 10.49 14.87 13.41 contrary, when the temperature decreases the method reduces the UCB1 15.36 21.04 18.93 search area. This is done to try to refine the solution found in high EXP3 18.56 24.90 21.39 temperatures. This is a good quality that makes the simulated Proposed Model 4.28 6.46 5.89 annealing a good approach for problems with multiple local Std.Q-Learning 4.88 7.23 6.68 optima. SA, thereby, does not easily converge to solutions near the 2 Roth-Erev 4.46 6.83 6.03 global optimum; instead this algorithm seeks a wide area always UCB1 5.89 9.31 8.72 trying to optimize the solution. Thus, it is important to note that the EXP3 6.53 10.85 9.38 temperature should decrease slowly to enable exploring a large part of the search space. The considered stopping criteria are: the current temperature and the maximum number of iterations. In 4 CONCLUSIONS each iteration is necessary to seek a new solution, this solution is The proposed model improves the standard Q-Learning algorithm calculated according to (6). by including a contextual dimension, thus providing a contextual (6) aware learning model. A simulated annealing process is also solution in (1) refers to the previous solution, because this may not included in order to enable accelerating the convergence process be the best found so far. is a random number with a when needed, especially when the number of observations is low. normal distribution, the variable S is obtained through (7). Results show that the proposed model is able to undertake context- aware learning, surpassing the results of several benchmark (7) learning algorithms when it comes to contextual learning. and are the limits of each variable, which prevent from getting out of the limits of the search problem. ACKNOWLEDGEMENTS The decisive parameters in SA's research are the decrease of temperature and the likelihood of acceptance. 4 variations of the This work has received funding from the EU Horizon 2020 SA algorithm have been implemented, combining different research and innovation program under project DOMINOES (grant approaches for calculating these two components. It is expected agreement No 771066) and from FEDER Funds through that this will bring different results for different groups, as these COMPETE program and from National Funds through FCT under components introduce a strong randomness in SA, which makes projects CEECIND/01811/2017 and UIDB/00760/2020 them reflect in the final results.