Feature Selection for Reinforcement Learning by Learning Process Evaluation

Feature Selection for Reinforcement Learning by Learning Process Evaluation Cleiton Alves da Silva Valdinei Freire Escola de Artes, Ciências e Humanidades Escola de Artes, Ciências e Humanidades Universidade de São Paulo (EACH USP) Universidade de São Paulo (EACH USP) [email protected] [email protected] ABSTRACT of state space grows exponentially according to the number Approaches to Transfer Learning in Reinforcement Learning of variables. aim to improve the learning process, whether in the same To deal with situations in which the state space is arbi- problem through generalization, by exploring similarity be- trarily large, the agent must be able to generalize the knowl- tween similar states, or in different problems, by transfer- edge gained from other experiences and share it among sim- ring acquired knowledge from a source problem to accel- ilar states [12], allowing to speed up learning. A common erate learning in a target problem. One way of obtaining approach to generalization is to use a function approxima- generalization is through state abstraction through feature tion and to aggregate states that have similar characteris- selection. This article presents FS-LPE algorithm to select a tics, reducing the amount of learned parameters, making the subset of features, where each subset of features is evaluated value function approximate as a table. Also, to reduce the by designing an agent based on such features and observing size of state space and speed up learning in problems where the quality they present during the learning process. The states are described by several features, we need to obtain algorithm is iterative, and in each round the FS-LPE algo- a function approximation based on the methods of selection rithm evaluates a number of subset of features, gives a score of features. In addition, another way to speed up learn- for each features, and such score influence the next round ing is through Transfer Learning (TC) [16], considering that of the algorithm. We propose three approaches to choose knowledge acquired in a source problem such as selected fea- the subsets to evaluate at each round and compare them tures can be transferred to a target problem and speed up empirically in Discrete 2D Soccer Simulator. learning. The feature selection is a technique used to reduce di- mensionality and consists of detecting, according to an eval- Keywords uation criterion, features relevant to the original problem, Reinforcement Learning, Transfer Learning and Feature Se- which usually provides an improvement in learning, for ex- lection ample, by speed up the learning process. The main approaches to feature selection are: Filter, Wrapper and Em- 1. INTRODUCTION bedded. The Filter approach estimates a relevance index for each feature after its evaluation and then ranks the features Problems involving sequential decision making can be solved according to a statistical criterion. The Wrapper approach by Reinforcement Learning (RL) techniques [13], which con- explores the feature space and generates multiple candidate siders that the learning process is based on trial and error subsets and, after evaluating each subset, selects the best strategy, in which an agent does not need any information performing subset. In the Embedded approach a learning on the environment, but must learn to choose actions that algorithm performs the feature selection during the train- allow the exploration and exploitation of the environment. ing process, adjusting the model and the feature selection After observing the result of the experience acquired with simultaneously, such as the techniques used in classification the execution of its actions, the agent that uses an RL algo- trees. rithm such as Q-learning [21] or Sarsa [9] to estimate how This paper proposes a strategy for feature selection, in- well a particular action was performed on a particular state tegrating the Filter and Wrapper approaches, in order to and stores the values that represent states and actions. The discover subsets with the features that represent the orig- agent must be able to experience an infinite number of times inal learning problem. Through the Filter approach each the states for the algorithm to converge. In the tabular ver- feature is individually rank according to its relevance, with sions of the Q-learning and Sarsa algorithms, learning con- the Wrapper approach the features subsets are evaluated sists of learning a Q : S × A ! R function involving jS × Aj based on sampling. Evaluations are performed during the parameters. It is noted that this learning technique can be- learning process, selecting features that can speed up learn- come slow and impractical in environments where the size ing rather than just considering the power that a subset has to represent a quasi-optimal value function. This strategy allows the learning process to be tried over and over again. Appears in: Proceedings of the 1st Workshop on Transfer It is considered that the experience gained in solving the in Reinforcement Learning (TiRL) at the 16th International Conference on Autonomous Agents and Multiagent Systems source problem, through the selected features subsets, can (AAMAS 2017), A. Costa, D. Precup, M. Veloso, M. Taylor be transferred and used to improve learning in a more com- (chairs), May 8{9, 2017, S~aoPaulo, Brazil. plex target problem. approximation considers an analytical function with a vec- The remainder of this paper is organized as follows: Sec- tor of parameters θ = (θ1; : : : ; θn) with jSj × jAj n, i.e., tion 2 presents background of Reinforcement Learning and Q(s; a) ≈ q(s; a; θ). Transfer Learning, Section 3 presents State Abstraction based One way of obtaining a function approximation is by using on Learning-Process Evaluation and our propose. Section 4 abstraction, i.e., a function M : S!X that maps the presentes our experiments. Section 5 Feature Selection in original state space S into a smaller space X where jSj Reinforcement Learning and Section 6 conclusion. jX j; then, tabular SARSA algorithm can be used directly in the mapped states. 2. REINFORCEMENT LEARNING The tile coding approach considers sets of abstractions fM ;:::;M g and approximates the Q values by the sum The Reinforcement Learning problem considers an agent 1 dS of each tabular function [13, 22, 11], i.e., interacting with an environment, where the agent learns by trial and error, i.e., it has no beforehand information about dS X i the environment. At any time step t, the agent perceives q(s; a; θ) = Q (Mi(s); a): the environment state st, chooses an action at and receives i=1 a reward rt. The objective of the agent is to accumulate i where each value Q (Mi(st); a) is a parameter in the vector positive rewards, while avoiding negative ones. θ. SARSA algorithm can be used by updating each tabular 2.1 SARSA Algorithm function Qi by i i α While interacting with the environment, the agent accu- Q (Mi(st); a) Q (Mi(st); a) + δt dS mulates experience s0; a0; r0; s1; a1; r1;:::. A RL algorithm must solve two problems: (i) how to obtain an optimal pol- where δt = rt + γq(st+1; at+1; θ) − q(st; at; θ). icy of action? and (ii) how to interact with the environment, i.e., to choose actions? A policy π maps each possible state 2.3 Transfer in RL s 2 S into an action a 2 A, i.e., π : S!A; here, an optimal Another way to speed up learning in RL is to use Transfer policy is a policy that maximizes the expected discounted Learning (TL) [16, 17], in which the agent must use expe- P1 t sum of rewards E[ t=0 γ rt], where γ 2 (1; 0) is a discount rience gained in a source problem to speed up learning in factor. When acting in the environment to obtain experi- a similar target problem. The TL problem is formalized ence, the agent must make a trade-off between exploitation, when there is a source problem from which the knowledge acting optimally by following the current learned policy to is extracted and a target problem to which the knowledge obtain rewards, and exploration, acting randomly to obtain is transferred. The hypothesis is that by reusing transferred experience to improve the current learned policy. knowledge will help solving the target problem faster. To obtain an optimal policy a common approach is to The types of knowledge transferred between problems vary learn a value function Q : S × A ! R to evaluate actions. according to the different approaches that are commonly Given a state s, Q(s; a) evaluates the action a and a policy transferred: value function [15, 6]; Policy [2, 18, 19]; sam- is obtained by π(s) = arg maxs2S Q(s; a). A well-known ples of interactions [7]; a state abstraction [20]. usually, two algorithm in the literature is SARSA [9]; SARSA updates assumptions may be considered when the agent interacts the value function Q at any time step t by using the tuple with the source task: (i) the source task is simpler than the (st; at; rt; st+1; at+1) with target one, and (ii) the agent interact with the environment for an arbitrary long time. Q(st; at) Q(st; at) + αtδt; where δt = rt + γQ(st+1; at+1) − Q(st; at) is the temporal 3. DESIGNING STATE ABSTRACTION BA- difference, and αt 2 (0; 1] is a learning rate. If an appro- SED ON LEARNING-PROCESS EVALU- priated trade-off between exploitation and exploration and an appropriated learning rate are applied, independently of ATION initial Q values, SARSA is guaranteed to converge to an Although state abstraction may accelerate RL, designing optimal policy. an appropriated state abstraction automatically is not a triv- If the state-action pairs (s; a) are experienced by the agent ial task.

Load more