Sparse Cooperative Q-Learning

Sparse Cooperative Q-learning Jelle R. Kok [email protected] Nikos Vlassis [email protected] Informatics Institute, Faculty of Science, University of Amsterdam, The Netherlands Abstract in uncertain environments. In principle, it is possible to treat a multiagent system as a `big' single Learning in multiagent systems suffers from agent and learn the optimal joint policy using stan- the fact that both the state and the action dard single-agent reinforcement learning techniques. space scale exponentially with the number of However, both the state and action space scale ex- agents. In this paper we are interested in ponentially with the number of agents, rendering this using Q-learning to learn the coordinated ac- approach infeasible for most problems. Alternatively, tions of a group of cooperative agents, us- we can let each agent learn its policy independently ing a sparse representation of the joint state- of the other agents, but then the transition model de- action space of the agents. We first examine pends on the policy of the other learning agents, which a compact representation in which the agents may result in oscillatory behavior. need to explicitly coordinate their actions only in a predefined set of states. Next, we On the other hand, in many problems the agents only use a coordination-graph approach in which need to coordinate their actions in few states (e.g., two we represent the Q-values by value rules that cleaning robots that want to clean the same room), specify the coordination dependencies of the while in the rest of the states the agents can act in- agents at particular states. We show how Q- dependently. Even if these `coordinated' states are learning can be efficiently applied to learn a known in advance, it is not a priori clear how the coordinated policy for the agents in the above agents can learn to act cooperatively in these states. framework. We demonstrate the proposed In this paper we describe a multiagent Q-learning tech- method on the predator-prey domain, and we nique, called Sparse Cooperative Q-learning, that al- compare it with other related multiagent Q- lows a group of agents to learn how to jointly solve a learning methods. task when the global coordination requirements of the system (but not the particular action choices of the agents) are known beforehand. 1. Introduction We first examine a compact representation in which A multiagent system (MAS) consists of a group of the agents learn to take joint actions in a predefined agents that can potentially interact with each other set of states. In all other (uncoordinated) states, we (Weiss, 1999; Vlassis, 2003). In this paper, we are let the agents learn independently. Then we generalize interested in fully cooperative multiagent systems in this approach by using a context-specific coordination which the agents have to learn to optimize a global graph (Guestrin et al., 2002b) to specify the coordina- performance measure. One of the key problems in such tion dependencies of subsets of agents according to the systems is the problem of coordination: how to ensure current context (dynamically). The proposed frame- that the individual decisions of the agents result in work allows for a sparse representation of the joint jointly optimal decisions for the group. state-action space of the agents, resulting in large com- putational savings. Reinforcement learning (RL) techniques (Sutton & Barto, 1998) have been applied successfully in many We demonstrate the proposed technique on the single-agent systems for learning the policy of an agent `predator-prey' domain, a popular multiagent problem in which a number of predator agents try to capture st Appearing in Proceedings of the 21 International Confer- a poor prey. Our method achieves a good trade-off ence on Machine Learning, Banff, Canada, 2004. Copyright between speed and solution quality. 2004 by the authors. 2. MDPs and Q-learning and Ri : S × A ! IR is the reward function that returns the reward Ri(s; a) for agent i after the joint In this section, we review the Markov Decision Pro- action a is taken in state s. As global reward function cess (MDP) framework. An observable MDP is a tuple n R(s; a) = i=1 Ri(s; a) we take the sum of all individ- hS; A; T; Ri where S is a finite set of world states, A is ual rewardsPreceived by the n agents. This framework a set of actions, T : S × A × S ! [0; 1] is the Marko- differs from a stochastic game (Shapley, 1953) in that vian transition function that describes the probabil- 0 0 each agent wants to maximize social welfare (sum of ity p(s js; a) of ending up in state s when perform- all payoffs) instead of its own payoff. ing action a in state s, and R : S × A ! IR is a reward function that returns the reward R(s; a) ob- Within this framework different choices can be made tained after taking action a in state s. An agent's which affect the problem description and possible solu- policy is defined as a mapping π : S ! A. The tion concepts, e.g., whether the agents are allowed to objective is to find an optimal policy π∗ that maxi- communicate, whether they observe the selected joint mizes the expected discounted future reward U ∗(s) = action, whether they perceive the individual rewards 1 t maxπ E [ t=0 γ R(st)jπ; s0 = s] for each state s. The of the other agents, etc. In our case we assume that the expectationP operator E[·] averages over reward and agents are allowed to communicate and thus are able stochastic transitions and γ 2 [0; 1) is the discount fac- to share individual actions and rewards. Before we dis- tor. We can also represent this using Q-values which cuss our approach, we first describe two other learning store the expected discounted future reward for each methods for environments with multiple agents. state s and possible action a: 3.1. MDP Learners ∗ 0 ∗ 0 0 Q (s; a) = R(s; a)+γ p(s js; a) max Q (s ; a ): (1) X a0 s0 In principle, a collaborative multiagent MDP can be regarded as one large single agent in which each joint The optimal policy for a state s is the action ∗ action is represented as a single action. The optimal arg maxa Q (s; a) that maximizes the expected future Q-values for the joint actions can then be learned us- discounted reward. ing standard single-agent Q-learning. In order to apply Reinforcement learning (RL) (Sutton & Barto, 1998) this MDP learners approach a central controller mod- can be applied to estimate Q∗(s; a). Q-learning is a els the complete MDP and communicates to each agent widely used learning method when the transition and its individual action, or all agents model the complete reward model are unavailable. This method starts MDP separately and select the individual action that with an initial estimate Q(s; a) for each state-action corresponds to their own identity. In the latter case, no pair. When an exploration action a is taken in state s, communication is needed between the agents but they reward R(s; a) is received and next state s0 is observed, all have to observe the joint action and all individual the corresponding Q-value is updated by rewards. Moreover, the problem of exploration can be solved by using the same random number generator 0 0 Q(s; a) := Q(s; a)+α[R(s; a)+γ max Q(s ; a )−Q(s; a)] (and the same seed) for all agents (Vlassis, 2003). Al- a0 (2) though this approach leads to the optimal solution, it where α 2 (0; 1) is an appropriate learning rate. Un- is infeasible for problems with many agents since the der conditions, Q-learning is known to converge to the joint action space, which is exponential in the number optimal Q∗(s; a) (Watkins & Dayan, 1992). of agents, becomes intractable. 3. Multiagent Q-learning 3.2. Independent Learners At the other extreme, we have the independent learn- The framework discussed in the previous section only ers (IL) approach (Claus & Boutilier, 1998) in which involves single agents. In this work, we are interested the agents ignore the actions and rewards of the other in systems in which multiple agents, each with their agents in the system, and learn their strategies in- own set of actions, have to collaboratively solve a task. dependently. The standard convergence proof for Q- A collaborative multiagent MDP (Guestrin, 2003) ex- learning does not hold in this case, since the transition tends the single agent MDP framework to include mul- model depends on the unknown policy of the other tiple agents whose joint action impacts the state tran- learning agents. Despite the lack of guaranteed con- sition and the received reward. Now, the transition vergence, this method has been applied successfully in model T : S × A × S ! [0; 1] represents the proba- multiple cases (Tan, 1993; Sen et al., 1994). bility p(s0js; a) the system will move from state s to 0 n s after performing the joint action a 2 A = ×i=1Ai PSfrag replacements s s0 s00 4. Context-Specific Q-learning 0 R1(s; a) 0 R1(s ; a) Q1(s ; a1) In many problems, agents only have to coordinate their A1 actions in a specific context (Guestrin et al., 2002b). 0 For example, two cleaning robots only have to take R2(s; a) 0 R2(s ; a) 00 A2 Q(s; a) Q2(s ; a2) Q(s ; a) care that they do not obstruct each other when they are cleaning the same room.

Sparse Cooperative Q-Learning

Backpropagation and Deep Learning in the Brain

Implementing Machine Learning & Neural Network Chip

Predrnn: Recurrent Neural Networks for Predictive Learning Using Spatiotemporal Lstms

Introduction to Machine Learning

Survey on Reinforcement Learning for Language Processing

Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning

Comparative Analysis of Recurrent Neural Network Architectures for Reservoir Inﬂow Forecasting

Deep Learning in Bioinformatics

A Deep Reinforcement Learning Neural Network Folding Proteins

An Introduction to Deep Reinforcement Learning

Modelling Working Memory Using Deep Recurrent Reinforcement Learning

Machine Learning and Data Mining Machine Learning Algorithms Enable Discovery of Important “Regularities” in Large Data Sets