Rule-Based Reinforcement Learning Augmented by External Knowledge
Total Page:16
File Type:pdf, Size:1020Kb
Rule-based Reinforcement Learning augmented by External Knowledge Nicolas Bougie12, Ryutaro Ichise1 1 National Institute of Informatics 2 The Graduate University for Advanced Studies Sokendai [email protected], [email protected] Abstract Learning from scratch and lack of interpretability impose some problems on deep reinforcement learning methods. Reinforcement learning has achieved several suc- Randomly initializing the weights of a neural network is in- cesses in sequential decision problems. However, efficient. Furthermore, this is likely intractable to train the these methods require a large number of training model in many domains due to a large amount of required iterations in complex environments. A standard data. Additionally, most RL algorithms cannot introduce ex- paradigm to tackle this challenge is to extend rein- ternal knowledge limiting their performance. Moreover, the forcement learning to handle function approxima- impossibility to explain and understand the reason for a de- tion with deep learning. Lack of interpretability cision restricts their use to non-safety critical domains, ex- and impossibility to introduce background knowl- cluding for example medicine or law. An approach to tackle edge limits their usability in many safety-critical these problems is to combine simple reinforcement learning real-world scenarios. In this paper, we study how techniques and external knowledge. to combine reinforcement learning and external A powerful recent idea to address the problem of computa- knowledge. We derive a rule-based variant ver- tional expenses is to modularize the model into an ensemble sion of the Sarsa(λ) algorithm, which we call Sarsa- of experts [Lample and Chaplot, 2017], [Bougie and Ichise, rb(λ), that augments data with complex knowledge 2017]. The task is divided into a sequence of stages and for and exploits similarities among states. We apply each one, a policy is learned. Since each expert focuses on our method to a trading task from the Stock Mar- learning a stage of the task, the reduction of the actions to ket Environment. We show that the resulting al- consider leads to a shorter learning period. Although this ap- gorithm leads to much better performance but also proach is conceptually simple, it does not handle very com- improves training speed compared to the Deep Q- plicated environments and environments with a large set of learning (DQN) algorithm and the Deep Determin- actions. istic Policy Gradients (DDPG) algorithm. Another technique is called Hierarchical Learning [Tessler et al., 2017], [Barto and Mahadevan, 2003] and is used to solve complex tasks, such as “simulating human brain” [Lake 1 Introduction et al., 2016]. It is inspired by human learning which uses pre- Over last few years, reinforcement learning (RL) has made vious experiences to face new situations. Instead of learning significant progress to learn good policies in many domains. directly the entire task, different sub-tasks are learned by the Well-known temporal difference (TD) methods such as Sarsa agent. By reusing knowledge acquired from the previous sub- [Sutton, 1996] or Q-learning [Watkins and Dayan, 1992] tasks, the learning is faster and easier. Some limitations are learn to predict the best action to take by step-wise interac- the necessity to re-train the model which is time-consuming tions with the environment. In particular, Q-learning has been and problems related to the catastrophic forgetting of knowl- shown to be effective in solving the traveling salesman prob- edge on previous tasks. All the previously cited approaches lem [Gambardella and Dorigo, 1995] or learning to drive a suffer from lack of interpretation reducing their usage in crit- bicycle [Randløv and Alstrøm, 1998]. However large or con- ical applications such as autonomous driving. tinuous state spaces limit their application to simple environ- An approach, Symbolic Reinforcement Learning [Garnelo ments. et al., 2016], [d’Avila Garcez et al., 2018] combines a system Recently, combining advances in deep learning and rein- that learns an abstracted representation of the environment forcement learning has proved to be very successful in mas- and high-order reasoning. However this has several limita- tering complex tasks. A significant example is the combina- tions, it cannot support ongoing adaptation to a new environ- tion of neural networks and Q-learning, resulting in “Deep ment and cannot handle external sources of prior knowledge. Q-Learning” (DQN) [Mnih et al., 2013], able to achieve hu- This paper demonstrates that a simple reinforcement learn- man performance on many tasks including Atari video games ing agent can overcome these challenges to learn control poli- [Bellemare et al., 2013]. cies. Our model is trained with a variant of the Sarsa(λ) algorithm [Singh and Sutton, 1996]. We introduce external 2.1 Q-learning algorithm knowledge by representing the states as rules. Rules trans- Q-learning [Watkins and Dayan, 1992] is a common tech- form the raw data into a compressed and high-level represen- nique to approximate π ≈ π∗. The estimation of the action tation. To deal with the problem of training speed and highly value function is iteratively performed by updating Q(s; a). fluctuating environments [Dundar et al., ], we use a sub-states This algorithm is considered as an off-policy method since mechanism. Sub-states allow a more frequent update of the the update rule is unrelated to the policy that is learned, as Q-values thereby smooth and speed-up the learning. Further- follows: more, we adapted eligibility traces which turned out to be critical in guiding the algorithm to solve tasks. In order to evaluate our method, we constructed a variety Q(st; at) Q(st; at)+ of trading environment simulations based on real stock mar- α [r + γ ∗ max Q(s ; a) − Q(s ; a )] (4) ket data. Our rule-based approach, Sarsa-rb(λ), can learn to t+1 a t+1 t t trade in a small number of iterations. In many cases, we The choice of the action follows a policy derived from Q. are able to outperform the well-known Deep Q-learning al- The most common policy called -greedy policy trade-off the gorithm in term of quality of policy and training time. Sarsa- exploration=exploitation dilemma. In case of exploration, a rb(λ) also exhibits higher performance than DDPG [Lillicrap random action is sampled whereas exploitation selects the ac- et al., 2015] after converging. tion with the highest estimated return. In order to converge to The paper is organized as follows. Section 2 gives an a stable policy, the probability of exploitation must increase overview of reinforcement learning. Section 3 describes the over time. An obvious approach to adapting Q-learning to main contributions of the paper. Section 4 presents the exper- continuous domains is to discretize the state spaces, leading iments and the results. Section 5 presents the main conclu- to an explosion of the number of Q-values. Therefore, a good sions drawn from the work. estimation of the Q-values in this context is often intractable. 2.2 Sarsa algorithm 2 Reinforcement Learning Sarsa is a temporal differentiation (TD) control method. The Reinforcement learning consists of an agent learning a pol- key difference between Q-learning and Sarsa is that Sarsa in icy π by interacting with an environment. At each time-step an on-policy method. It implies that the Q-values are learned the agent receives an observation st and chooses an action at. based on the action performed by the current policy instead The agent gets a feedback from the environment called a re- of a greedy policy. The update rule becomes : ward r . Given this reward and the observation, the agent can t Q(s ; a ) Q(s ; a )+α[r +γQ(s ; a )−Q(s ; a )] update its policy to improve the future rewards. t t t t t+1 t+1 t+1 t t Given a discount factor γ, the future discounted reward, (5) called return Rt, is defined as follows : Q : X × A ! T Algorithm 1 Sarsa: Learn function R X t0−t procedure SARSA(X , A, R, T , α, γ) Rt = γ rt0 (1) Initialize Q : X × A ! uniformly t0=t R while Q is not converged do where T is the time-step at which the epoch terminates. Start in state s 2 X The goal of reinforcement learning is to learn to select the Choose a from s using policy derived from Q (e.g., action with the maximum return Rt achievable for a given -greedy) observation [Sutton and Barto, 1998]. From Equation (1), while s is not terminal do 0 we can define the action value Qπ(s; a) at a time t as the Take action a, observe r, s 0 0 expected reward for selecting an action a for a given state st Choose a from s using policy derived from Q and following a policy π. (e.g., -greedy) Q(s; a) Q(s; a) + α · (r + γ · Q(s0; a0) − π Q (s; a) = E [Rt j st = s; a] (2) Q(s; a)) s s0 The optimal policy is defined as selecting the action with the a a0 optimal Q-value, the highest expected return, followed by an return Q optimal sequence of actions. This obeys the Bellman opti- mality equation: Sarsa converges with probability 1 to an optimal policy as long as all the action-value states are visited an infinite h 0 0 i Q∗(s; a) = r + γ max Q∗(s ; a ) j s; a (3) number of times. Unfortunately, it is not possible to straight- E 0 a forwardly apply Sarsa learning to continuous or large state In temporal difference (TD) learning methods such as Q- spaces. Such large spaces are difficult to explore since it learning or Sarsa, the Q-values are updated after each time- requires a frequent visit of each state to accurately estimate step instead of updating the values after each epoch, as hap- their values, resulting in an inefficient estimation of the Q- pens in Monte Carlo learning.