STRATEGIC DOMINANCE AND DYNAMIC PROGRAMMING FOR MULTI-AGENT PLANNING Application to the Multi-Robot Box-pushing Problem

Mohamed Amine Hamila1, Emmanuelle Grislin-Le Strugeon1, Rene Mandiau1 and Abdel-Illah Mouaddib2 1LAMIH, Universite de Valenciennes, Valenciennes, France 2GREYC, Universite de Caen Basse-Normandie, Caen, France

Keywords: Multi-agent planning, Coordination, Stochastic games, Markov processes.

Abstract: This paper presents a planning approach for a multi-agent coordination problem in a dynamic environment. We introduce the algorithm SGInfiniteVI, allowing to apply some theories related to the engineering of multi-agent systems and designed to solve stochastic games. In order to limit the decision complexity and so decreasing the used resources (memory and processor-time), our approach relies on reducing the number of joint-action at each step decision. A scenario of multi-robot Box-pushing is used as a platform to evaluate and validate our approach. We show that only weakly dominated actions can improve the resolution process, despite a slight deterioration of the solution quality due to information loss.

1 INTRODUCTION equilibria. However, the computed solution is made through a comprehensive process, thereby limiting Many daily situations involve a decision making: for the dimensions of the addressed problem. example an air-traffic controller has to assign landing- Our contribution is mainly threefold; firstly, we area and time slots to planes, or a taxi company that present an exact algorithm for the elimination of has some transportation tasks to be carried out. Intel- weakly/strictly dominated strategies. Secondly, we ligent agents can aid in this decision-making process. have incorporated this technique into the algorithm In this paper, we address the problem of collision-free SGInfiniteVI, in order to simplify the decision prob- paths for multiple agents sharing and moving in the lems and accelerate the resolution process. Thirdly, same environment. we propose an experimental approach for the evalua- The objective of this work is to propose an effi- tion of the resulting new algorithm and it is performed cient answer to such coordination problems. One an- in two stages; (1) numerical evaluation: attempts to swer is to consider stochastic games, since they pro- compare the effect of the elimination of weakly and vide a powerful framework for modeling multi-agent strictly dominated strategies, on the used resources, interactions. Stochastic games were first studied as (2) Behavioral evaluation: checks the impact of the an extension of matrix games (Neumann and Morgen- chosen on the solution quality. stern, 1944) to multiple states. They are also seen as The paper has been organized as follows. Section a generalization of Markov decision process (MDP) 2 recalls some definitions of stochastic games. Sec- (Puterman, 2005) to several agents. The Nash equi- tion 3 describes the algorithm SGInfiniteVI, the pro- librium (Nash, 1950) is the most commonly-used so- cess improvement and shows how to apply on a grid- lution concept, intuitively defined as a particular be- world game. Results are presented and discussed in havior for all agents, where each agent acts optimally section 4 and finally we conclude in section 5. with regard to the others’ behavior. This work aims to improve the performance of a previous algorithm, the SGInfiniteVI (Hamila et al., 2010), designed to solve stochastic games. The latter 2 BACKGROUND allowed finding a decentralized policy actions, based on the dynamic programming technique and Nash In this section, we present on one hand an introduc-

Amine Hamila M., Grislin-Le Strugeon E., Mandiau R. and Mouaddib A.. 91 STRATEGIC DOMINANCE AND DYNAMIC PROGRAMMING FOR MULTI-AGENT PLANNING - Application to the Multi-Robot Box-pushing Problem. DOI: 10.5220/0003707500910097 In Proceedings of the 4th International Conference on Agents and Artificial Intelligence (ICAART-2012), pages 91-97 ISBN: 978-989-8425-96-6 Copyright c 2012 SCITEPRESS (Science and Technology Publications, Lda.) ICAART 2012 - International Conference on Agents and Artificial Intelligence

tion to the model of stochastic games and on the other studying the concepts of plausible solutions. Strate- hand some of its crucial aspects. gic dominance (Fudenberg and Tirole, 1991; Leyton- Brown and Shoham, 2008) represents one of the most 2.1 Definitions and Concepts widely used concept, seeking to eliminate actions that are dominated by others actions.

Stochastic Games (SG) (Shoham et al., 2003; Hansen Definition 2. A strategy ai ∈ Ai is said to be strictly et al., 2004) are defined by the tuple: 0 dominated if there is another strategy a i ∈ Ai such as: < Ag,{Ai : i = 1...|Ag|},{Ri : i = 1...|Ag|},S,T > 0 • Ag: is the finite set of agents. Ri(a i,a−i) > Ri(ai,a−i) ∀a−i ∈ A−i (2) • Ai: is the finite set of actions (or pure strategies) Thus a strictly dominated strategy for a player available to agent i (i ∈ Ag). yields a lower expected payoff than at least one other • Ri: is the immediate reward function of agent i, strategy available to the player, regardless of the Ri(a) → R, where a is the joint-action defined as strategies chosen by everyone else. Obviously, a ra- a ∈ ×i∈Ag Ai and is given by a = ha1,...,a|Ag|i. tional player will never use a strictly dominated strat- • S: is the finite set of environment states. egy. The process can be repeated until strategies are • T: is the stochastic transition function, no longer eliminated in this manner. This prediction T : S × A × S → [0,1], indicating the probability process on actions, is called ”Iterative Elimination of of moving from a state s ∈ S to a state s0 ∈ S by Strictly Dominated Strategies” (IESDS). running the joint-action a. Definition 3. For every player i, if there is only one The particularity of stochastic games is that each solution resulting from the IESDS process, then the state s can be considered as a matrix game M(s). At game is said to be dominance solvable and the solu- each step of the game, the agents observe their envi- tion is a . ronment, simultaneously choose actions and receive However, in many cases, the process ends with a large rewards. The environment transitions stochastically number of remaining strategies. To further reduce into a different state M(s0) with a probability P(s0|s,a) the joint-action space, we could relax the principle of and the above process repeats. The goal for each dominance and so include weakly dominated strate- agent is to maximize the expected sum of rewards it gies. receives during the game. Definition 4. A strategy ai ∈ Ai is said to be weakly 0 2.2 Equilibrium in Stochastic Games dominated if there is another strategy a i ∈ Ai such as: 0 Stochastic games have reward functions which can Ri(a i,a−i) > Ri(ai,a−i) ∀a−i ∈ A−i (3) be different for every agent. In certain cases, it may Thus the elimination process would provide more be difficult to find policies that maximize the perfor- compact matrices and consequently reduce the com- mance criteria for all agents. So in stochastic games, putation time of the equilibrium. However, this pro- an equilibrium is always looked for every state. This cedure has two major drawbacks: (1) the elimination equilibrium is a situation in which no agent, taking the order may change the final of the game and other agents’ actions as given, can improve its perfor- (2) eliminating weakly dominated strategies, can ex- mance criteria by choosing an alternative action: we clude some Nash equilibria present in the game. find here the definition of the Nash equilibrium (Nash, 1950). 2.2.2 Best-response Function Definition 1. A Nash Equilibrium is a set of strate- As explained above, the iterative elimination of dom- ∗ gies (actions) a such that: inated strategies is a relevant solution, but unreliable R (a∗,a∗ ) R (a ,a∗ ) ∀i ∈ Ag, ∀a ∈ A (1) for an exact search of equilibrium. Indeed discarding i i −i > i i −i i i dominated strategies narrows the search for a solu- 2.2.1 Strategic Dominance tion strategy, but does not identify a unique solution. To select a specific strategy requires to introduce the When the number of agents is large, it becomes dif- concept of Best-Response. ficult for everyone to consider the entire joint-action Definition 5. Given the other players’ actions a−i, space. This may involve a high cost of matrices con- the Best-Response (BR) of the player i is: struction and resolution. To reduce the joint-action set, most research (in the ) focused on BRi : a−i → argmaxai∈Ai Ri(ai,a−i) (4)

92 STRATEGIC DOMINANCE AND DYNAMIC PROGRAMMING FOR MULTI-AGENT PLANNING - Application to the Multi-Robot Box-pushing Problem

Thus, the notion of Nash equilibrium can be ex- refined to favor a strategy over another one. Thus, pressed using the concept of Best-Response: Best-Response function is used to find equilibria. ∗ ∗ ∀i ∈ Ag ai ∈ BRi(a−i) • Directly calculate a Nash equilibrium: this hap- pens when the intersection of all strategies (one 2.3 Solving Stochastic Games per player) forms a Nash equilibrium (line 8).

The main works on planning in stochastic games Algorithm 1: IEDS algorithm. are the ones of Shapley (Shapley, 1953) and Kearns Input: a matrix M(s) (Kearns et al., 2000). Shapley was the first to pro- 1 for k ∈ 1...|Ag| do pose an algorithm for stochastic games with an in- 2 for stratCandidate ∈ Ak do finite horizon for the case of zero-sum games. The 3 stratDominated ← true FINITEVI algorithm by Kearns et al., generalizes the 4 for stratAlternat ∈ Ak do 5 if stratCandidate non-dominated by Shapley’s work to general-sum case. stratAlternat then In the same area, the algorithm SGInfiniteVI 6 stratDominated ← false (Hamila et al., 2010) brought several improvements, including the decentralization and the implementation 7 if stratDominated then of the function (previously re- 8 delete stratCandidate from Ak garded as an oracle), in the aim to deal with complex 0 situations, such as the equilibrium multiplicity. Output: |M | ≤ |M| However, the algorithm reaches its limits as the problem size increases. Our objective consists in im- Algorithm 2: The SGInfiniteVI algorithm with proving the algorithm SGInfiniteVI by eliminating IEDS. useless strategies, with the aim to accelerate the com- Input: A SG, γ ∈ [0,1], ε ≥ 0 putation time, to reduce the memory usage and to plan 1 t ← 0 easily with fewer coordination problems. 2 repeat 3 t ← t + 1 4 for s ∈ S do 5 for a ∈ A do 3 INTEGRATION OF THE 6 for k ∈ 1...|Ag| do 7 M(s,a,k,t) = Rk(s,a) + DOMINANCE PROCEDURE 0 f 0 γ ∑ T(s,a,s ) Vk (M(s ,a,t − 1)) AND APPLICATION s0∈S 8 M’(s,t) = IEDS (M(s,t)) In this section we present the dominance procedure 9 if |M0| = 1 then join 0 (IEDS) and how we integrate it into the SGInfiniteVI 10 πk(s,t) = f (M (s,t)) 11 else algorithm. 0 12 πk(s,t) = f (M (s,t))

 NashMaxTot 3.1 The Improvement of SGInfiniteVI f maximizes the payoffs of agents.  NashMaxSub f f maximizes its own payoff. ApproxNash We first present the algorithm 1, which performs the  f selects an approximate Nash. iterative elimination of dominated strategies (intro- f f duced in Section 2). The algorithm takes as parameter 13 until maxs∈S|V (s,t) −V (s,t − 1)| < ε; a matrix M(s) of an arbitrary dimension and returns a Output: Policy πi matrix M0(s) assumed to be smaller. The elimination process is applied iteratively until each player has one Note that in case of equilibrium multiplicity, we remaining strategy, or several strategies none of which propose two selection functions1: one function maxi- being weakly dominated. Note that each player will mizes the overall gain f NashMaxTot and the other max- seek to reduce its matrix, not only by eliminating its imizes the individual gain f NashMaxSub. If there is no dominated strategies but also those of the others. equilibrium then we use the function f ApproxNash to The IEDS is incorporated into the algorithm (Al- reach an approximate Nash equilibrium. gorithm 2, line 8) in the aim to: 1We propose only the functions f NashMaxTot and • Significantly reduce the matrix size: if after the f NashMaxSub because it has been shown that these functions procedure, there is more than one strategy per were better than the NashPareto in case of equilibrium mul- player (line 12), the equilibrium selection must be tiplicity.

93 ICAART 2012 - International Conference on Agents and Artificial Intelligence

Algorithmic Complexity. In the search for equi- 4 VALIDATION libria, the algorithm is dealing only with a pure Nash equilibrium. It has been proved that whether a The experiments were performed on 200 policies (one game has a “pure Nash equilibrium” is NP-complete for each agent), with 20,000 tests per policy (chang- (Conitzer and Sandholm, 2008). Therefore, the run- ing at each test randomly the agents initial position). ning time of SGInfiniteVI algorithm is not polynomial The simulator was implemented in Java language and in the size of the game matrix (due to the fact that the the experiments were performed on a machine quad- function f used to compute Nash equilibrium is it- 2.8GHz and 4GB of memory. self non-polynomial). Moreover, the size of the states First, experiments were made to study the effect space is exponential in the number of agents. The run- of IEDS procedure on the used resources. Second, we ning time of our algorithm is taken to be exponential. sought to determine its effect on the agents’ behavior. Spatial Complexity. Since the agent can only keep the equilibrium values, the total number of required 4.1 Numerical Evaluation values is: |S| × |Ag|. SGInfiniteVI is linear in the number of states and agents2. The matrix is com- This section aims to demonstrate empirically the ef- posed of |Ag| values, making a total of |Ag| × |A| val- fect of IEDS on the equilibrium computation, the ues. Every agent can store a backup matrix after each CPU-time and the memory space (depending on the evaluation of the state s, including its own payoff but type of eliminated strategies). also the payoffs of the other agents. But it is not necessary, since the agent can only keep the values 4.1.1 Equilibrium Computation of the payments coming from the calculated equilib- rium. Therefore the total number of required values We wanted to test the algorithm with two different is: |S| × |Ag|. SGInfiniteVI is linear in the number of procedures (IESDS and IEWDS), respectively per- states and agents3. forming iterative elimination of strictly and weakly dominated strategies. The purpose is to know which 3.2 Grid-world Game of the two procedures would make a profit without degrading the quality of the solution. The optimal The example that we choose is similar to the exam- case of reduction corresponds to one strategy per ples from literature, as the “Two-Player Coordina- player (the joint-action forms an equilibrium). We tion Problem” of Hu and Wellman (Hu and Wellman, call pcRed the percentage of the reduction. 2003). It is the problem of multi-robot box-pushing The experiments show that for the case of strictly dominated strategies, the percentage of Nash equilib- (see Figure 1), the game includes robots, objects and join a container box. In this example, the objective of the rium found by the f function is generally less than robots is to put all the objects in the box with a mini- 20%. As for the elimination of weakly dominated mum number of steps without conflict. strategies, the percentage is close to 90% as shown in Figure 2. The results reflect the ascendancy of the weakly dominated strategies in terms of matrix reduc- tion.

1

0.8

0.6 I.E. weakly D.S. I.E. strictly D.S. Figure 1: An example of scenario: two robots, four objects and a container box. 0.4

The following section intends to validate the im- 0.2 proved algorithm on the modeled game.

Percentage of Nash equilibria found with IEDS 0 2 4 5 6 7 8 9 10 11 12 13 14 15 16 the next section will show that this complexity is a Grid size worst-case complexity and that in practice a gain in memory can be considered. Figure 2: Comparison between IEWDS and IESDS in terms 3the next section will show that this complexity is a of matrix reduction (with 2 agents, 4 objects and different worst-case complexity and that in practice a gain in memory grid size). can be considered.

94 STRATEGIC DOMINANCE AND DYNAMIC PROGRAMMING FOR MULTI-AGENT PLANNING - Application to the Multi-Robot Box-pushing Problem

4.1.2 Evaluation on the Computation Time 2. Start the IEDS process, 3. Calculate the payments of the other players when We compared the computation time used by the al- they only correspond to the surviving strategies. gorithm SGInfiniteVI with/without IEDS. The figure The number of new values filled in the matrix is: 3 shows that the elimination of weakly dominated |A−i|, which makes a total of |A|+|A−i| by matrix, strategies provides a significant gain in time Gt , de- pending on the size of the environment. Neverthe- 4. Find the best-response, less, the use of strictly dominated strategies does not The expected gain per matrix is then: reduce the computation time but increases it a little. gm = (nbrValMat − nbrValCalc) / nbrValMat Outside the context of the application, we can say that = ( (|A| ∗ |Ag|) − (|A| + |A−i|)) / (|A| ∗ |Ag|) The total expected gain is: 400 G = g ∗ pcRed Elimination of weakly dominated strategies m m Without IEDS = ( (|A| ∗ |Ag|) − (|A| + |A−i|)) ∗ pcRed 350 Elimination of strictly dominated strategies 300 1600 Used memory without IEDS 250 Used memory with IEDS 1400 200 1200 150

Time in seconds 1000

100 800

50 600

0 400 4 5 6 7 8 9 10 11 12 13 14 15 16 Used memory space (MB) Grid size 200 0 Figure 3: Comparison between IEWDS, IESDS and the 4 5 6 7 8 original version of the algorithm, in terms of computation Grid size time (with 2 agents, 4 objects and different grid size). Figure 4: Memory space required to calculate a policy (with 3 agents, 3 objects and different grid size). the reduction of the matrix size is not necessarily syn- onymous of less computation time. Indeed, the IEDS The table 1 shows the evolution of the expected procedure leads to a gain gt on the exploration time of gain according to the number of agents and the figure the matrix, but also entails a computational cost ct in 5 shows in practice the evolution of the expected gain addition to the one of the algorithm, the useful gain is according to the problem size. The empirical analysis being given by the difference gt − ct . of gm and pcRed leads to an estimated gain Gm of Thus, to obtain a gain in time Gt that is percepti- about 40%. ble at the general level, the degree of reduction pcRed must be large enough to generate a gain gt covering Table 1: Expected gain according to the number of agents. ct . The parameter pcRed could be integrated into the |Ag| |A| |M| gm pcRed Gm algorithm to allow launching (if required) the IEDS 2 agents 2 25 50 40% ' 90% 35% procedure. 3 agents 3 125 375 60% ' 75% 45% 4 agents 4 625 2500 70% ' 53% 37% 4.1.3 Evaluation on Memory Space This will allow considering higher problem sizes, At this level, the used dominance procedure does not increasing the number of agents, the grid size, etc. allow to reduce directly the memory usage, but pro- pose an estimation of the gain. Indeed, a gain is pos- 4.2 Evaluation on the Agents’ Behavior sible only when matrices are partially calculated. For example, it is useless to calculate the payments of the To assess the effect of the elimination of dominated other players, when these coincide with a dominated strategies4 on the agents’ behavior; we define three and therefore not performed strategy. An estimate of elements of comparison: the expected gain can be made from the average num- ber of dominated strategies per player. The scenario is the following one: 4We consider only weakly dominated strategies, since 1. Each player computes only its own payments. The the elimination of strictly dominated strategies cannot make number of calculated values is: |A|, significant gains.

95 ICAART 2012 - International Conference on Agents and Artificial Intelligence

0.65 • The average number of conflicts: a conflict is con- Without IEDS sidered when agents violate any of the game rules, With IEDS 0.6 e.g. an agent moving to a position that is already occupied by another agent. 0.55

• The average number of deadlocks: a oc- 0.5 curs when agents are preventing each other’s ac- tions forever, waiting for a shared resource. 0.45

• The average number of livelocks: a livelock is an 0.4 endless cycle that prevents the agents from reach- The average number of deadlocks/simulation 0.35 ing a goal state, and the game from any progress. 4 6 8 10 12 14 16 Grid size 4.2.1 Average Number of Conflicts Figure 6: Evaluation according to the average number of deadlocks (with 2 agents, 4 objects and different grid size). Figure 5 shows that the elimination of dominated strategies leads to an increase in the number of con- may affect long-term gains, even though they are con- flicts compared to the original version of the algo- sidered unnecessary during the elimination process. rithm. Nevertheless, the observed values remain rela- In addition, adopting the concept of dominance sim- tively small and can be considered as acceptable in the plifies the decision problem at each stage by getting context of the simulation. For example, for a simula- rid of the equilibrium multiplicity. tion with a grid size of 12 × 12, two agents and four objects, the average number of conflicts is only 0.25 per simulation. 5 CONCLUSIONS Without IEDS 0.3 With IEDS In the context of coordinating agents, the aim of this 0.25 work was not only to study the model of stochastic

0.2 games, but also to propose a planning algorithm based on the dynamic programming and the Nash equilib- 0.15 rium. Our method involves the implementation, the 0.1 validation and the evaluation on an example of in- teraction between agents. The concept of strategic 0.05 dominance has been studied and used in order to im- The average number of conflicts/simulation 0 prove the SGInfiniteVI algorithm computation time 4 6 8 10 12 14 16 Grid size and used memory. The experiments demonstrated that only the elimination of weakly dominated strate- Figure 5: Evaluation according to the average number of gies could make a gain in time and memory. conflicts (with 2 agents, 4 objects and different grid size).

4.2.2 Average Number of Deadlocks and REFERENCES Livelocks Conitzer, V. and Sandholm, T. (2008). New complexity re- The figure 6 shows a slight increase in the average sults about nash equilibria. Games and Economic Be- number of deadlocks by simulation compared to the havior, 63(2):621–641. original version of the algorithm. The number of Fudenberg, D. and Tirole, J. (1991). Game theory. MIT deadlocks may reflect a loss in terms of coordination Press, Cambridge, MA. between agents. Indeed, such a situation occurs when Hamila, M. A., Grislin-le Strugeon, E., Mandiau, R., and the actions performed from distinct Nash equilibria. Mouaddib, A.-I. (2010). An algorithm for multi-robot planning: Sginfinitevi. In Proceedings of the 2010 4.2.3 Conclusion of the Experimentation IEEE/WIC/ACM International Conference on Web In- telligence and Intelligent Agent Technology - Volume 02, WI-IAT ’10, pages 141–148, Washington, DC, We found that results show a significant gain in com- USA. IEEE Computer Society. putation time and an opportunity to gain memory Hansen, E. A., Bernstein, D. S., and Zilberstein, S. space. Intuitively, this gain is not without conse- (2004). Dynamic programming for partially observ- quence on the agents’ policies. Dominated strategies able stochastic games. In Proceedings of the 19th na-

96 STRATEGIC DOMINANCE AND DYNAMIC PROGRAMMING FOR MULTI-AGENT PLANNING - Application to the Multi-Robot Box-pushing Problem

tional conference on Artifical intelligence, AAAI’04, pages 709–715. AAAI Press. Hu, J. and Wellman, M. (2003). Nash q–learning for general–sum stochastic games. Journal of Machine Learning Research, 4:1039–1069. Kearns, M., Mansour, Y., and Singh, S. (2000). Fast plan- ning in stochastic games. In In Proc. UAI-2000, pages 309–316. Morgan Kaufmann. Leyton-Brown, K. and Shoham, Y. (2008). Essentials of game theory: A concise multidisciplinary introduc- tion. Synthesis Lectures on Artificial Intelligence and Machine Learning, 2(1):1–88. Nash, J. F. (1950). Equilibrium points in n-person games. Proc. of the National Academy of Sciences of the United States of America, 36(1):48–49. Neumann, J. V. and Morgenstern, O. (1944). Theory of games and economic behavior. Princeton University Press, Princeton. Second edition in 1947, third in 1954. Puterman, M. (2005). Markov Decision Processes: Dis- crete Stochastic Dynamic Programming. Wiley- Interscience. Shapley, L. (1953). Stochastic games. Proc. of the National Academy of Sciences USA, pages 1095–1100. Shoham, Y., Powers, R., and Grenager, T. (2003). Multi- agent reinforcement learning: a critical survey. Tech- nical report, Stanford University.

97