Game-theoretic Management of Interacting Adaptive Systems

David Wolpert Nilesh Kulkarni NASA Ames Research Center Perot Systems INC, NASA Ames Research Center MS 269-1 MS 269-1 Moffett Field, CA 94035 Moffett Field, CA 94035 Email: david.h.wolpert@.gov Email: [email protected]

Abstract— A powerful technique for optimizing an evolving complete control over the behavior of the players. That is system “agent” is co-evolution, in which one evolves the agent’s always true if there are communication restrictions that prevent environment at the same time that one evolves the agent. Here us from having full and continual access to all of the players. we consider such co-evolution when there is more than one agent, and the agents interact with one another. By definition, It is also always true if some of the players are humans. Such the environment of such a set of agents defines a non-cooperative limitations onour control are also typical when the players are game they are playing. So in this setting co-evolution means using software programs written by third party vendors. a “manager” to adaptively change the game the agents play, On the other hand, even when we cannot completely control in such a way that their resultant behavior optimizes a utility the players, often we can set / modify some aspects of the function of the manager. We introduce a fixed-point technique for doing this, and illustrate it on computer experiments. game among the players. As examples, we might have a manager external to the players who can modify their utility I. I functions (e.g., by providing them with incentives), the com- Many distributed systems involve multiple goal-seeking munication sequence among them, what they communicate agents. Often the interactions of those agents can be modeled with one another, the command structure relating them, how as a noncooperative game where the agents are identified with their chosen moves are mapped into the physical world, how the players of the game and their goals are identified with (if at all) their sensory inputs are distorted, or even how the associated utility functions of the players [7], [8], [12], rational they are. The ultimate goal of the manager is to make [17], [21]. Examples involving purely artificial players include such modifications that induce behavior of the players that is distributed adaptive control, distributed reinforcement learn- optimal as far as the manager is concerned. ing (e.g., such systems involving multiple autonomous adap- As an example, say some of the players are evolvable soft- tive rovers on Mars or multiple adaptive telecommunications ware systems. Then the game details comprise the environment routers), and more generally multi-agent systems involving in which those systems evolve. So modifying the game details adaptive agents [2], [6], [15], [16]. In other examples some to optimize the resultant behavior of the players is a variant of the agents / players are human beings. Examples include of using co-evolution for optimization [1], [3]. The difference air-traffic management [9], multi-disciplinary optimization [4], with most work on co-evolution is that in optimal management [5], and sense, much of mechanism design, including in of a game we are concerned with the environment of multiple particular design of auctions [8], [12], [13]. interacting and evolving systems rather than the environment Sometimes the goals of the players do not conflict; in- of a solitary evolving system. tuitively, the system is “modularized”. However often this In the next section we present a simple example that illus- cannot be guaranteed. As an example, it may be that in most trates how the behavior of players in a game can be improved conditions the system is modularized, but that some conditions by changing the game, discuss previous related work, and cause conflicts among the needs of the players for system-wide overview our approach. After that we present our notation. resources (e.g., when an “emergency” occurs). Alternatively, We then use that notation to introduce a formal framework for it may be that the players take physical actions and that under analyzing optimal management. Next we use our framework some conditions the laws of physics couple those actions in to introduce an algorithm for optimal management. After that way that makes their goals conflict. Moreover, whenever some we present a computer-based test validating that algorithm. agents are humans, in almost all conditions there will be some  degree of conflict among their goals. Finally, note that even II. B when there are no conflicts among the goals of the players, To illustrate how changing the details of a game may result there may be synergies among the players that they can not in the players behaving in a way that is better for an external readily find if left on their own. manager, say we have two players, Row and Col, each of who In all of these scenarios there is a need to intervene in has four possible moves (also called “pure strategies”). As the behavior of the players. However often we do not have usual, each player has a utility function that maps the joint pure strategy of the two players into the reals. Say that those part of any player. Finally, the players not in T are modeled utility functions, (gR, gC), are given by the following bimatrix: as players, rather than treated as noise. Our approach starts by specifying a parameterized set of (0, 6) (4, 7) (−1, 5) (4, 4) models for the of the entire system. To (−1, 6) (5, 5) (2, 3) (7, 4)   (1) capture our prior knowledge that the players are goal-seeking,  (−2, 1) (3, 2) (0, 0) (5, −1)    these models are based on considerations. There  (1, 1) (6, 0) (1, −2) (6, −1)    are two types of parameters of our models: those that charac-   To play the game each player i ∈ {Row, Col} independently terize the behavior of the players, and those that the manager chooses a “mixed strategy”, i.e., a probability distribution sets. The former are estimated from observations of system Pi(xi) over her set of (four) allowed moves. So the expected behavior. The manager then uses those estimates and searches E i = utility for player i is P(g ) xi,x−i Pi(xi)P−i(x−i)u(xi, x−i), over the remaining parameters, to find which of the associated where P−i(x−i) is the mixed strategyP of i’s opponent. probability distributions are optimal for him. He then sets those A pair of mixed strategies (PRow, PCol) is a “Nash Equi- parameters to their optimizing values. i librium” (NE) of the game if for all players i, EP(g ) cannot III. G  increase if Pi changes while P−i stays the same. At a NE, neither player can benefit by changing her mixed strategy, Given any space Z, we write the set of functions from Z given her opponents’ mixed strategies. To illustrate, a NE of into Z as ZZ. We use a minus sign before a set of subscripts the game in Table 1 is the joint pure strategy where Row of a vector to indicate all components of the vector other than plays her bottom-most move, and Col plays her left-most the indicated one(s). We will use the integral symbol with the move. Noncooperative game theory’s starting premise is that measure implicit. So for example for finite X, “ dx” implicitly “rational” players will have a NE joint mixed strategy. uses a point-mass measure and therefore meansR a sum. We Now say that we could induce both players to be “anti- write |Z| to indicate the cardinality of Z. rational”, that is to try to minimize their expected utilities. (For- We use upper cases to indicate a random variable, and lower mally, this is equivalent to multiplying their utility functions cases to indicate a value of that variable. So for example by -1.) Now the equilibrium of the game occurs where Row “P(A | b)” means the full function mapping any value a plays the top-most row and Col plays the right-most column. of the random variable A to the conditional probability of Note that both players have a higher utility at this equilibrium a given the value b of the random variable B. We will than at the NE (4 > 1). Moreover, neither player would be loose in distinguishing between probability distributions benefit if she changed from being anti-rational to being rational and probability density functions, using the term “probability and the equilibrium changing accordingly, regardless of which distribution” and the symbol P to mean both concepts, with rationality her opponent adopted. Accordingly, consider the the context making the precise meaning clear if only one of the scenario where the manager’s goal is to increase the utility concepts is meant. We will also use the Dirac delta symbol functions of both players. Then if he can infer that joint anti- even if its arguments are integers or even symbols. So for rationality does that, and can also induce the players both to example the expression δ(a − b) where a and b are members be anti-rational, the manager would benefit. of an arbitrary finite space equals 1 if a = b, 0 otherwise. More generally, the optimal management problem is how We write N to mean the infinite set of all integers not to find changes that the manager can make to the system that less than 1, and N ≡ 1,..., N. We will sometimes use would make it behave in a way that the manager prefers. curly brackets to indicate a set of indexed elements without In the Probability Collectives [14], [19], [20] approach to explicitly specifying the range of the indices. For example, optimal management, one has complete freedom to design the in an N-player game where each player i has an associated players in some subset T of all the players. The remaining variable ai, “{ai}” is shorthand for {ai : i ∈ N }. Similarly a− j players are treated as exogenous noise, with no attempt to is shorthand for {ai : i ∈ N , i , j}. sgn(x) is defined to equal exploit prior knowledge that they are goal-seeking. Distributed the sign of the real value x, with sgn(0) ≡ 0. reinforcement learning and related approaches share these When defining a function, the symbol “,” means the characteristics [2], [6], [15], [16]. In the Collective Intelligence definition holds for all values of the listed arguments. So approach [21], [22] one only has freedom to design the utility for example, “ f (a, b) , dc r(a)s(b, c)” means the definition functions of the players in T, and again treats all other players holds for all values of a andR b (as opposed to being an equation as exogenous noise. Most of mechanism design [8], [12], whose solution specifies a and b). Abusing notation, given a [13] shares these characteristics (though in mechanism design, function F : A → B, we will sometimes write F(A) to indicate one talks of “conditional payments” to players rather than the range of F, but sometimes write F(A′) to indicate the full “modifications to their utility functions”). function F evaluated over domain A′. In contrast, here we focus on a more general situation. We The unit simplex of possible distributions over a space Z is expect that manager may only have partial ability to modify written ∆Z. Given two spaces A, B, we write ∆A,B to mean the the players in T, and that those modifications may involve unit simplex over the Cartesian product A  B. Similarly, ∆A|b other characteristics besides their utility functions. We also indicates the set of all distributions of A conditioned on b, and allow the manager to change system parameters not directly ∆A|B indicates the set of all functions from B into ∆A, i.e, the set of all conditional distributions P(A | b), b ∈ B. Finally, So our players are single-step receding horizon controllers. given a Cartesian product space X = i Xi, we write ∆χ to (An example of an alternative is where each player wants indicate the Cartesian product i ∆Xi . So ∆χ is the set of all to maximize the expectation of a discounted sum of future product distributions over X. Similarly, ∆χ|a is i ∆Xi | a, the rewards.) Expanding, we write that expected utility as set of all product distributions over X conditioned on a. i i E[g (Θt+1)] = dθt+1 P(θt+1)g (θt+1) IV. T      Z i A. Exact State Information = dθt dθt+1 A(θt+1, θt; σt)g (θt+1) P(θt). Z "Z # Say we have a repeated game, with the set of possible (5) system states (states of Nature) given by Θ, and the set of possible joint moves by the players given by X [8], [12]. Now plug in the definition of A to get For simplicity, say this is a stationary game of complete state i j j E[g (Θt+1)] = dθt P(θt) dxt+1 σ (x , θt) × information [12]. So we can dispense with a special space of Z Z t t+1 possible signals to the players — at every iteration every player  Yj knows the current state of Nature exactly. In fact, we restrict i dθt+1 π(θt+1, θt, xt+1)g (θt+1) attention to behavior strategies by the players that only depend Z  on that preceding state of Nature; every player’s behavioral , j j i dθt P(θt) dxt+1 σt (xt+1, θt)γ (xt+1; θt) strategy is a conditional distribution over the possible moves Z Z Yj of that player conditioned on the preceding state of Nature. (6) Also for simplicity, we take Θ and X finite. i , E i We write the (fixed) conditional probability of the next state where for all i, γ (xt+1; θt) (gt+1 | xt+1, θt). given the current state and the next joint move as This shows that each value of θt specifies a separate conventional strategic form game for the players, with their π(θt+1, θt, xt+1) , P(θt+1 | θt, xt+1) (2) i i θt-specific mixed strategies over Xt+1 given by {σt = P(Xt+1 | θt) : i ∈ N } and their θt-specific “effective” utility functions where t ∈ N, θt+1, θt ∈ Θ and xt+1 ∈ X. The joint behavioral over X given by {γi(Xi ; θ ) : i ∈ N )}. The actual time t strategy of the players at time t is the distribution t+1 t+1 t is irrelevant to the specification of the game; only the value i P(xt+1 | θt) = P(xt+1 | θt) of the parameter θt matters, by setting the dependence of the Yi effective utility functions on Xt+1. We write this game jointly , i i , i} σt(xt+1, θt). (3) specified by θt, π and the set g g as the game function π,g Yi B (θt), or sometimes just B(θt) for short. (So for every θt, i (Note that it is a matter of conventional whether we have the B(θt) is a set of N utility functions, {γ (X; θt)}.) index on x be t or t + 1; we choose it to be t + 1 so that the Accordingly, consider the following iterated process. First, behavioral strategy of the players is formulated as the move the at time t we sample P(Θt) to get a θt. Next, the players set players will make given the system state they just observed.) σt(Xt+1 | θt) to a NE of the game B(θt). Each player i then i i i i Intuitively, each component x of x is set independently samples her mixed strategy σt+1(Xt+1 | θt) to get a move xt+1. of the other components by player i, in a completely free This specifies a joint move xt+1. After this π(Θt+1, θt, xt+1) is manner. It is in how those moves affect the next state of Nature sampled to get a next Θ value, θt+1, and the process repeats. that “constraints” can arise that physically couple the devices Now in general, there may be more than one NE for some controlled by the players. In general, we place no restrictions game B(θ). More broadly, if the players are allowed to have i bounded rationality, the set of possible joint mixed strategies on the form of each P(Xt+1 | Θt); they are to be set solely by associated game-theoretic considerations. adopted by the players for any game B(θ) may have non-zero Given this, the updating rule for the distribution over Θ is measure. However say we have a universal refinement which the players jointly use to always pick the same unique joint i i distribution for any game, i.e., a mapping R : B(Θ) → ∆ . (As P(θt+1) = dθtdxt+1 π(θt+1, θt, xt+1) σt(xt+1, θt)P(θt) (4) χ Z Yi an example, it might that for any θ, R(B(θ)) is a NE of game 1 This is the transition equation for a Markov process B(θ).) Then σt(Xt+1, θt) = R(B(θt)). This means that for fixed π and g, σt is a t-independent function from Θ to ∆χ (changes taking P(Θt) to P(Θt+1), where the transition matrix i i to t that don’t θt don’t change B(θt) and therefore don’t change dxt+1 π(Θt+1, Θt, xt+1) i σt(x , Θt) is parameterized by t+1 σ (X , θ ).) Accordingly we can drop the subscript t from σ . theR behavioral strategiesQ of the players. Write this matrix as t t+1 t t R induces a (t-independent) conditional distribution A(Θt+1, Θt; σt), or A(σt) for short, where σt is the vector of all players’ behavioral strategies at time t. Similarly write the b,R P (θt+1 | θt) , dxt+1 π(θt+1, θt, xt+1)[R(b)](xt+1) (7) update equation Eq. 4 as P(Θt+1) = A(σt)P(Θt). Z For simplicity, we restrict attention to scenarios where at 1Such “point prediction” specifying a single possible σ for a given π any moment t, the goal of each player i is to maximize an and g, is the goal of conventional game theory. More sophisticated modeling expected associated utility gi : Θ → R evaluated at time t + 1. provides a distribution over σ’s. See [18]. where b ∈ B(Θ). (Note that Pb,R is well-defined even if b , full-rationality equilibrium at time t is any set of strategies j i B(θt).) If we sample this conditional distribution for a current {ht } such that simultaneously each ht maximizes the expected E i −i pair (θt, b = B(θt)) we (stochastically) generate a new Θ value, (effective) utility, (rt(Θt; Wt)), as evaluated using ht . θt+1. Evaluating B(θt+1) then produces a new game b, and we subject to the strategies of the other players. can then apply Eq. 7 using that new game, so that the process Recall that when every player has exact knowledge of the repeats. In this way, for any fixed R and π, the function B(Θ) current state, the players are involved in a θt-indexed strategic generates a distribution over possible sequences of θ’s. form game, by Eq. 6. So comparing Eq. 6 with Eq. 10, it might An interesting special case is when B is invertible, i.e., seem that in the partial state information case the players are i where knowing the N function {γ (Xt+1; θt)} uniquely fixes θt. involved in “a wt-indexed strategic form game”. This is not In such cases we can dispense with Θ; the stochastic dynamics strictly correct though. The problem with the comparison is i i i of the system across Θ reduces to a stochastic dynamics across that whereas each ht only depends on w , the effective utility rt an associated space of games, B(Θ).2 depends on the entire vector w; there is no such distinction in the arguments of the corresponding functions in Eq. 6. Here B. Partial state information player i can only use wi to choose her move, whereas her Now say that each player i not see θt at iteration t, but only a utility function depends on the full w. i “signal” wt that is stochastically related to θt. So the updating More carefully, what Eq. 10 really shows is that at each time rule for the distribution over Θ is now t the players are involved in a correlated equilibrium game.4 Moreover, since each conditional distribution Ωi is fixed in P(θt+1) = dθtdxt+1dwt π(θt+1, θt, xt+1) × Z time, the distribution P(Wt), along with each effective utility i 5 function rt, all vary with P(Θt). So the correlated game the [hi(xi , wi)Ωi(wi, θ ) P(θ ) (8)  t t+1 t t t  t players are engaged in varies with t in general, and therefore Yi  so does the full rationality equilibrium h .   t i i i , i  i i  To understand this intuitively, note that P(Θ ) changes as where ht(xt+1, wt) P(xt+1 | wt) replaces σ as the strategy t adopted by player i at time t, and where P(w | θ) , the system evolves, just as θt changes in the exact information i i i∈N Ω (w , θ) is the (fixed) conditional distribution speci- case. (See Eq. 8.) Moreover, in this partial information setting fyingQ how the vector of all the signals of all players, w, is P(Θt) is a “” that each player i uses to infer 3 i stochastically generated from a current state of nature θ. what θt is likely to be, having observed signal wt. Accordingly, j Now by Bayes’ rule, we can always expand changes to P(Θt) changes the optimal behavior strategies ht . Just as θt defines the game of Eq. 6, so P(Θt) defines the P(wt)P(θt | wt) = P(θt)P(wt | θt) game of Eq. 10. In both cases, the game the players are i i = P(θt) Ω (wt, θt) (9) engaged in changes in time. In the partial information case, iY∈N the system evolves from one t to the next by going through a Plugging this into Eq. 8, sequence of correlated equilibria. For every t, the equilibrium of the associated game specifies a new game whose correlated i j j j E[g (Θt+1)] = dwt P(wt) dxt+1 h (x , w ) × equilibrium gives the time t+1 joint move, xt+1. The associated Z Z t t+1 t Yj transition rule is non-Markovian, i.e., since the optimal ht i depends on P(Θt) which changes in time, the transition rule dθtdθt+1P(θt | wt)π(θt+1, θt, xt+1)g (θt+1) "Z # in Eq. 8 is not a Markovian process over X × Θ. In particular, the dynamics over Θ is not governed by the matrix A. , j j j i dwtP(wt) dxt+1 ht (xt+1, wt )rt(xt+1; wt) For this partial information case, the game function has Z Z Yj P(Θt) as its argument rather than θt (the argument of the game (10) function in the exact information case), and is parameterized by Ω in addition to π and g. Similarly, the domain of where the value of the vector w at time t is wt and where each i any universal refinement for the partial information case is rt(Xt+1; Wt) is an “effective” utility function for player i. The

2One example of an invertible B is presented in the experiments below in 4One can see this by considering a two-stage extensive form game based which the two players have different utility functions and there is no manager, on our original strategic form game. In that two-stage game there is a single so the thruster angles are fixed. As a more extreme example, let Θ be a subset first-moving Nature player who sets wt by playing the (potentially time- of RN , and define each gi as the associated projection operator, gi(θ) , θi. varying) distribution P(Wt). The original players N simultaneously move N Presume further that for all xt+1 ∈ X,θt ∈ Θ,θt+1 ∈ Θ, π(θt+1,θt, xt+1) = in the second stage, and each such player i ∈ has an information set i δ(φ(θt, xt+1) − θt+1) for some vectgor-valued function φ. So for all i, θt and consisting of the value wt. The utility function of the i’th player in the i i xt+1, γ (xt+1; θt) = φ (θt, xt+1). Now φ takes X × Θ → Θ, i.e., X acts on Θ via two-stage game is the associated effective utility function of player i in i i i i j φ. Say that each x ∈ X is a different permutation of Θ . Then not only is the original game, rt. Now if h(Xt+1, Wt ) is a full rationality equilibrium i B(Θt) invertible, but in fact we only need to know the N values {γ (xt+1; θt)} of the original game, then there is no player i and function ψ such that j i i −i E i for one xt+1 to know θt. dwtdxt+1 P(wt)ht(xt+1, wt )rt(ψ(xt+1), xt+1; wt) > [g (Θt+1)]. So that ht 3There are many variations of this scenario. For example, we could have isR a correlated equilibrium of the two-stage game. i 5 each player i base her probability of move x at some time t on a history of Integrate both sides of Eq. 9 over θt to see that P(Wt) depends on P(Θt). i ′ i Θ Θ values wt′ for t < t in addition to wt. Many such variations can be mapped Then use Eq. 9 again to see that P( t | Wt) depends on P( t), and therefore i into one another by redefining what the variables mean. so does each rt. an expanded version of the domain for exact information that is stable under the Markovian dynamics. (This stability univeral refinements, now being the set of all correlated in the Markovian dynamics is different from the stability of equilibrium games. Given such a partial information game a peaked fixed point; the first concerns ∆Θ, and the second π,g,Ω function B¯ and refinement R¯, we can write ht(xt+1, wt) = concerns Θ.) He might also prefer that the dynamics converges π,g,Ω R¯[B¯ (P(Θt))](xt+1, wt). Plugging this into Eq. 8 then gives to the fixed point quickly from some initial distribution P(Θ0). the dynamics over Θ: Given any such preferences, the simplest version of the manager’s optimization problem is to find the y ∈ Y such π,g,R¯,Ω y P (θt+1) = dθtdxt+1dwt π(θt+1, θt, xt+1)Ω(wt, θt) × that the associated Markov transition matrix A has a fixed Z point, and to optimize the location of that fixed point, its ¯ ¯ π,g,Ω R[B (P(Θt))](xt+1, wt) P(θt) (11) stability, and how quickly it is achieved. There are more h i sophisticated policies the manager might adopt however. For where Ω(w , θ ) , Ωi(wi, θ ). t t i t t example, it may be that by judiciously “mode-switching” Q V. O M among the possible y as the system evolves, the manager can induce the dynamics to go from the initial P(Θ ) to a desired A. General considerations of optimal management 0 fixed point P(Θ∞) more expeditiously than it would if any From now on, for simplicity we restrict attention to the case single y ∈ Y were used for the entire sequence. where the players have exact state information, as in Sec. IV- Note that despite having a preference ordering and a move to A. This means that we do not consider the effects of sensor choose, the manager is not a player in the game. In particular, noise on player behavior. there is no dynamics of y. Rather the manager sets y from Say we have a manager external to the players who at outside of the system. each t has a preference order over sequences of future θ’s. So the manager prefers π’s, R’s, and g’s that are more likely to B. Algorithm overview produce desirable sequences of θ’s, as determined by Eq. 7: For simplicity, from now on we restrict attention to the case where the preference order of the manager does not π,g,R , π,g P (θt+1 | θt) dxt+1 π(θt+1, θt, xt+1)[R(B (θt))](xt+1) depend on the full future sequence through the space of Z (12) games, involving fixed points, discounted sums of rewards, or something similar. Rather, like the players, at every t the (For the partial information case, dynamics over Θ is instead manager is only interested in optimizing (the expectation of) given by the non-Markovian Eq. 11.) an associated utility function of θt+1. This restriction means we Suppose that the manager can change some aspects of do not need to explicitly consider θ-indexed effective utility π and/or g and/or R. Then at each t the manager has an functions, game functions, or the like. We write the manager’s optimization problem, of how to choose among its set of triples utilty function as G : Θ → R. y ≡ (π, g, R) to optimize the likely resultant sequences of θ’s. In practice, the manager may not know all relevant attributes Note that since we are in the exact information case, we do not of the players and/or the rest of the system. In this case allow the manager to distort the sensor inputs to the players. the manager must estimate those attributes at run-time. Since (That would mean distorting Ω.) Similarly, it means we do not the underlying process is Markovian, this means taht the consider the possibility of the manager modifying the inter- manager’s problem is one of controlling a partially observable player communication structure and/or command structure Markov decision process. However since here at every t the (i.e., we do not allow the manager to change the extensive manager is only concerned with θt+1 (rather than the whole game the players are engaged in). future sequence of θ’s), we adopt a simpler approach. As an example of this, the manager’s preference order might To begin, parameterize the triple {π, g, R} that fixes Eq. 12 be be a discounted sum of future rewards. This need not be by (ζ, y). As before, y is the set of all parameters affecting the case however. To illustrate this, note from Eq. 12 that for the behavior of the players and the system that the man- y any fixed y the transition matrix A , P(Θt+1 | Θt) is fixed ager can set. ζ is a set of other parameters outside of the for all time. So for any fixed y, we have a Markov process manager’s control that affect the behavior of the players and y across P(Θ) ∈ ∆Θ with transition matrix A and can analyze its system, but can be estimated from observational data. These convergence properties. In particular, the manager might prefer may include in particular parameters that characterize the a sequence in ∆Θ that eventually converges to a fixed point. endogenous behavior of the players. For example, ζ might (Note this is a fixed point in ∆Θ, not in Θ.) Furthermore, if by specify the rationality of some player i (suitably quantified), i i setting y he can vary among a set of such fixed points {P(Θ∞) ∈ or if the manager cannot modify g , ζ could specify g . Other ∆Θ}, then his preference ordering might lead him to prefer ones components of ζ might affect multiple players at once, by that are centered about certain locations in Θ. He might also modifying R, π and/or the parametric dependence of g on y. ζ,y ζ,y ζ,y prefer a P(Θ∞) that is stable, in that it is highly peaked, so that To formalize this we sometimes write g , R , and/or π . in the infinite time limit (P(Θt) settles to a distribution under Any pair (ζ, y) specifies the dynamics over Θ, via Eq. 12. which) θt has little variability. In addition to these aspects of So presuming the manager’s estimate of ζ is correct, the fixed point P(Θ∞), the manager might prefer a fixed point since ζ is independent of y, the manager can determine ζ,y how varying y translates into variations inσ ˆ (xt+1, θt) , Note that the QRE mixed strategy for player i depends on ζ,y ζ,y Ey y,i R [B (θt)](xt+1, θt). So the task for the manager is to find the mixed strategies of the other players, through (gt+1 | i the y that maximizes xt+1, θt). So the QRE is a set of coupled simultaneous equa- tions. Brouwer’s fixed point theorem guarantees that there y E (G(θt+1) | θt) = is always at least one σ that solves this set of equations. Moreover, in the limit of βi → ∞, the QRE σi places y ζ,y,i i dθt+1dxt+1 G(θt+1)π (θt+1, θt, xt+1) σˆ (x + , θt) (13) zero probability mass on any move that doesn’t maximize i’s Z t 1 iY∈N expected utility. Accordingly, as βi → ∞ for every player i, Often the manager will not be able to find this optimal y. the QRE approaches a NE [10]. This may be due to ignorance of some of the distributions, For each i, the associated QRE equations can be expanded computational limitations, inability to estimate some relevant as the |Xi| + 1 equations components of θt, etc. More generally, it may be that the βiEy(gy,i |xi ,θ ) e t+1 t+1 t players are actually in a partial state information scenario, but i i − = ∀ i σ (xt+1, θt) i 0 xt+1 (16) ζ,y N (θt) the manager cannot solve for the y that optimizes E (Gt+1 | i i βiEy(gy,i |xi ,θ ) wt) (e.g., due to ignorance of Ω, or of wt). t+1 t+1 t N (θt) − dxt+1 e = 0. (17) In such situations we approximate how the joint behavorial Z strategy and system dynamics depends on y and ζ with an where for all i, equilibrium model. This is a pair of counterfactual game and ¯ ζ,y ¯ζ,y Ey y,i i (perhaps bounded rational) refinement functions, B and R . (gt+1 | xt+1, θt) = Our presumption is that if we change the integrand in Eq. 13 to Eζ,y −i y,i y i −i j j involve those functions, the resultant dependence of (Gt+1 | dθt+1dxt+1 g (θt+1)π (θt+1, θt, xt+1, xt+1) σ (xt+1, θt) (18) Z , θt) on (ζ, y) accurately approximates the true dependence. Yj i ζ,y′ In practice, the manager might also estimateσ ˆ (Xt+1, θt) By plugging Eq. 18 into Eq.’s 16, 17 and running over all θ i for the current t from observations. Doing that allows the players i, we specify the QRE as a set of M ≡ N + i∈N |X | manager to avoid evaluating R¯ζ,y[B¯ ζ,y(θ )], which typically t coupled simultaneous equations. For fixed θt, there areP a total would require solving a coupled set of equations. Intuitively, of M unknowns in those equations: the N normalization factors 6 i i Nature solves the equations on behalf of the manager. In this {N } together with the i∈N |X | mixed strategy components of i i situation, the manager only has to solve for how the solution the players, {σ (x , θt)}. We can condense those M equations ζ,y ζ,y t+1 P R¯ [B¯ (θt) would change if y were to change (for the given in M unknowns into the following equation: ζ. As illustrated below, this may reduce to the manager’s estimating how the integrand in Eq. 13 varies with y, for the f(σ, N, y) = 0 (19) given ζ. From now on we suppress the ζ superscript. i i i where σ is the vector of i∈N |X | probabilities {σ (xt+1, θt): N i C. Quantal Response Equilibria i ∈ , j ∈ |X |}, N is the vectorP of N normalization factors, 0 is the M-dimensional vector of all 0’s, and f is an M-dimensional We are interested in modeling players that are “bounded vector-valued function. For any y, the solution to Eq. 19 for σ rational”, i.e., who want to maximize their expected utilities and N gives usσ ˆ ζ,y and the associated values {Ni}, respectively. but are not able to do so. A popular model for this situation is For simplicity, we model the player interaction as being a the Quantal Response Equilibrium (QRE) [10], [11]. Under the QRE for some suitable set of βi’s. (Exploring more sophis- QRE, the mixed strategy of player i is a Boltzmann distribution ticated models is the subject of future work.) The set of the over her move-conditioned expected utilities: βi’s of all the players will comprise ζ in our experiments. In iEy y,i i y y β (g + |x + ,θt) addition, every g will be independent of y; only π depends i i , e t 1 t 1 σ (xt+1, θt) i (14) on y. The resultant dependence of the player mixed strategies N (θt) on y is captured in Eq. 19. where Ni is the associated normalization constant, D. Moving the QRE fixed point iEy y,i | i i , i β (gt+1 xt+1,θt) N (θt) dxt+1 e . (15) The manager’s expected utility is given by Eq. 13 where Z eachσ ˆ ζ,y,i is given by Eq. 19. For a given ζ, the task of the If at each t the objectives of the players involve future manager is to move y so that the resultant σ solving Eq. 19 trajectories through Θ rather than just (as in this paper) θt+1, y optimizes E (G(θt+1) | θt) as given by Eq. 13. In more detail, then the exponents in the QRE equations should be changed as the manager changes y, he changes the values in Eq. 18, accordingly. Those exponents should also be changed if we which then changes the probabilities in Eq. 14. That in turn are in a partial state information setting rather than an exact changes expected G, according to Eq. 13. The manager wants state information setting (by replacing each Ey(gy,i | xi , θ ), t+1 t+1 t to search over y’s to maximize this ensuing value of expected Ey y,i | i i given by Eq. 6, with (gt+1 xt+1, wt), given by Eq. 10). G. The manager can do this using a gradient descent over expected G based on the following equation: 6 ζ,y′ Sometimes the manager can even solicitσ ˆ (Xt+1,θt) from the players. ∂ Ey y ∂y (G(θt+1) | θt) = to π (θt+1, θt, xt+1), where θt represents the current state of the 2 satellite in the R , and xt+1 is the two thrust vectors that are ∂ y ζ,y,i i dθt+1dxt+1 G(θt+1) π (θt+1, θt, xt+1) σˆ (x , θt) chosen by the two controllers to fire at the next time step. Z ∂y t+1   iY∈N The eight possible thrust vectors are assigned an initial set + of angles. Figure 1 illustrates the satellite with the thrusters. y ∂ ζ,y,i i Each controller has its individual goal point where it wishes dθt+1dxt+1 G(θt+1)π (θt+1, θt, xt+1) σˆ (x , θt) Z ∂y  t+1  to move the satellite to. The manager has its own objective iY∈N    for moving the satellite.  (20) The first integral is what the manager’s estimate of the gradient 1 of expected G would be if the manager were a “conventional Controller 1's ζ,y,i i possible thrust vectors controller”, who presumes thatσ ˆ (x , θt) is some stationary 0.75 Satellite t+1 Controller 2's distribution. The second integral is the correction term intro- possible thrust vectors 0.5 duced if the manager accounts for the fact that the algorithms setting each qi are actually adaptive players who (under the 0.25 QRE model of their mutual adaptation) obey Eq. 19. 0 To compute the integrand terms in Eq. 20 involving partial −0.25 derivatives of theσ ˆ ζ,y,i’s, expand both sides of Eq. 19 that equation to first order in y (i.e., use implicit differentiation): −0.5 ∂σˆ ζ,y −0.75 −1 ∂f ∂y = − ∂f ∂f ∂N ∂σˆ ζ,y ∂N (21) −1   ∂y −1 −0.75 −0.5 −0.25 0 0.25 0.5 0.75 1  ∂y  h i     ∂f ∂σˆ ζ,y (Note that in general y is a vector, so for example ∂y , ∂y , ∂N Fig. 1. Satellite Control Example and ∂y are all matrices.) The solution to this equation gives us the partial derivatives we need to evaluate Eq. 20. Given these partial derivatives, we employ conjugate gra- Given that the two controllers are allowed to fire one thruster dient descent to update y. An alternative is to use Newton’s each simultaneously, we implement the evolution of the satel- method; to do that one needs to compute the Hessian of the lite’s trajectory as a repeated game. For this experiment, QRE probabilities with respect to y, which can be done by we realized the two controllers as Boltzmann reinforcement ∂σˆ ζ,y learners. Each of the four moves of both the controllers is differentiating the solution for ∂y given by solving Eq. 21. Note that in practice, when running this algorithm we can attached a utility value. As the system trajectory evolves, ask the players (or observe their behavior) to determine their the controllers update the utilities associated with each move joint mixed strategy for the current y,σ ˆ ζ,y. So we don’t have based on how close the moves get them to their individual to solve the fixed point equation giving the QRE. Our descent goals. At every time step, the controllers associate Boltzmann algorithms only require that can predict the dependence on the probabilities to the moves based on the utilities associated with position of the QRE on y. That means we only need to know those moves. Thus, moves with higher utilities are given a how π and the player utilities depend on y. higher probability, and moves with lower utilities are given a lower probability. To guarantee exploration of all moves E. Experimental details including those with low utilities, the probabilities of the DHW: Note that in our experiments, we actually have a individual moves have a set lower limit. If the equilibrium i partial information scenario, where wt is player i’s history probabilities specified by the Boltzmann distribution fall below of moves and rewards. However we can’t write down this limit, they are reset to this minimum threshold, and the Ω tractably, and therefore instead approximate it and probabilities of the remaining moves renormalized to sum to its ramifications, with an exact information equilibrium unity. This guarantees exploration of the move space along model. It is that model that we then manage. with exploitation. The reinforcement learners also use data- Also, somewhere we must say how in our experiments, aging techniques to give more weight to the recent data versus we assume that whatever R is for the QRE that (approxi- old data. In these experiments, we used exponential weighting mates) our adaptive controllers, it is a smooth function of to update the utility values associated with each move. Thus, y. the utility value for a particular move is given by a weighted To illustrate the various concepts outlined, we consider a sum of the past utilities that the controller observes for that simple problem of controlling a satellite in R2. The satellite move. The exponents of the weighting term are a function of has two controllers (players) that each fire a thruster from a state and time. So an observed utility value that is based on set of four thrusters assigned to each of them. The resultant the system state that is closer to the current state, and not very displacement of the satellite is given by the vector addition of far back in time is given more weighting than a utility for the the two thrusts from the two thrusters fired by the controllers. move that was taken at a state further away from the current In these experiments, the dynamics of the satellite corresponds state and farther back in time. The manager, observing the behavior of the two controllers position (2,2). Figures 2 and 3 illustrate the effect of the and the evolution of the system states, needs to deduce the manager, where the manager’s goal is also set up to be at model of the controllers’ interaction. For the current experi- (2,2). Without the manager, the trajectory takes a very long ment, the manager allows the controllers to operate without time to move along the θ1 axis. With the manager active, the updating the thruster angles. The manager then captures the thruster angles are updated to move the satellite quickly to the middle section of the trajectory along with the associated desired position. moves of the two controllers. The manager then estimates the Without Active Management start With Active Management start rationality (βi) of the two controllers by optimizing the log- 3 3 likelihood objective function that maximizes the equilibrium 2 2 probabilities of a QRE model with the observed moves. The manager, now, having a model of the interaction between 1 end 1 2 2 end θ θ the two controllers, updates the thruster angles. This update 0 0 is carried out every five simulation steps. So the learning controllers operate for five simulation steps, at which point −1 −1 the manager updates the thruster angles. In these experiments −2 −2 −1 0 1 2 3 4 5 −1 0 1 2 3 4 5 θ θ we employed the Newton’s method to update the thruster 1 1 angles. Every update corresponds to multiple internal update Fig. 4. Comparison of the Trajectories without (Blue) and with (Red) Active steps, where the manager keeps on updating the angles till it Management of the Interaction between the Learning Controllers for Case 2 can no longer increase its objective function beyond a certain prechosen threshold value.

Without Active Management 35 With Active Management Without Active Management 4 3 With Active Management end 30 3 2

25 2 end 1 2 2 θ θ 1 0 20 start 0 −1 15

start −1 −2 −2 −1 0 1 2 −1 0 1 2 3 10

θ θ Manager’s Objective Function 1 1

5 Fig. 2. Comparison of the Trajectories without (Blue) and with (Red) Active 0 Management of the Interaction between the Learning Controllers for Case 1 0 100 200 300 400 500 600 700 800 900 1000 time index

Fig. 5. Comparison of the Manager’s Objective Function without (Blue) 20 With Active Management and with (Red) Active Management of the Interactions between the Learning 18 Without Active Management Controllers for Case 2 16 14 In the second case, the thruster angles are given to be 12 (0◦, 90◦, 180◦, 270◦) Thus, both controllers have full control- 10 lability to move the satellite in the R2 space. However, now 8 controller 1’s desired position is (2,2), controller 2’s desired 6

Manager’s Objective Function ff 4 position is (2,-2), while the manager has a di erent goal (0,0)

2 from these individual goals of the two controllers. Figures

0 4 and 5 again illustrate the trajectory of the satellite with 0 100 200 300 400 500 600 700 800 900 1000 time index and without active management along with the corresponding values of the manager’s objective function for this case. We Fig. 3. Comparison of the Manager’s Objective Function without (Blue) note that in both the cases, the manager is successful in and with (Red) Active Management of the Interactions between the Learning Controllers for Case 1 achieving its objective. R In this work, we consider two cases. The first case illustrates the manager aiding the two players in achieving their common [1] T. Back, D. B. Fogel, and Z. Michalewicz (eds.), Handbook of evolu- tionary computation, Oxford University Press, 1997. goal. In this case, the initial thruster angles for both sets [2] S. Bieniawski, I. Kroo, and D. H. Wolpert, Flight Control with Dis- of four thrusters are given to be (85◦, −85◦, 95◦, −95◦) With tributed Effectors, Proceedings of 2005 AIAA Guidance, Navigation, these angles, the thrusters have limited capability in moving and Control Conference, San Francisco, CA, 2005, AIAA Paper 2005- 6074. the satellite along the θ1 axis. Here we set up the situation [3] K. Chellapilla and D.B. Fogel, Evolution, neural networks, games, and where both the controllers wish to move the satellite to the intelligence, Proceedings of the IEEE (1999), 1471–1496. [4] S. Choi and J.J. Alonso, Multi-fidelity design optimization of low-boom supersonic business jet, Proceedings of 10th AIAA/ISSMO Multidisci- plany Analysis and Optimization Conference, 2004, AIAA Paper 2004- 4371. [5] E.J. Cramer, J.E. Dennis, and et alia, Problem formulation for multidis- ciplinary optimization, SIAM J. of Optimization 4 (1994). [6] J. Ferber, Reactive distributed artificial intelligence: Principles and applications, Foundations of Distributed Artificial Intelligence (G. O- Hare and N. Jennings, eds.), John Wiley and Sons, 1996, pp. 287–314. [7] D. Fudenberg and D. K. Levine, Steady state learning and Nash equilibrium, Econometrica 61 (1993), no. 3, 547–573. [8] D. Fudenberg and J. Tirole, Game theory, MIT Press, Cambridge, MA, 1991. [9] H. Hwang, J. Kim, and C. Tomlin, Protocol-based conflict resolution for air traffic control, Air Traffic Control Quarterly (2007), in press. [10] R. D. McKelvey and T. R. Palfrey, Quantal response equilibria for normal form games, Games and Economic Behavior 10 (1995), 6–38. [11] , Quantal response equilibria for extensive form games, Exper- imental Economics 1 (1998), 9–41. [12] Roger B. Myerson, Game theory: Analysis of conflict, Harvard Univer- sity Press, 1991. [13] N. Nisan and A. Ronen, Algorithmic mechanism design, Games and Economic Behavior 35 (2001), 166–196. [14] D. Rajnarayan and David H. Wolpert, Exploiting parametric learning to improve black-box optimization, Proceedings of ECCS 2007 (J. Jost, ed.), 2007. [15] A. Schaerf, Y. Shoham, and M. Tennenholtz, Adaptive load balancing: A study in multi-agent learning, Journal of Artificial Intelligence Research 162 (1995), 475–500. [16] J.S. Shamma and G. Arslan, Dynamic fictitious play, dynamic gradient play, and distributed convergence to nash equilibria, IEEE Trans. on Automatic Control 50 (2004), no. 3, 312–327. [17] D. H. Wolpert, - the bridge connecting bounded rational game theory and statistical physics, Complex Engineered Sys- tems: Science meets technology (D. Braha, A. Minai, and Y. Bar-Yam, eds.), Springer, 2004, pp. 262–290. [18] , Predicting the outcome of a game, Submitted. See arXiv.org/abs/nlin.AO/0512015 for an early version, 2007. [19] D. H. Wolpert and S. Bieniawski, Distributed control by lagrangian steepest descent, Proc. of the 2004 IEEE Control and Decision Conf., 2004, pp. 1562–1567. [20] D. H. Wolpert, C. E. M. Strauss, and D. Rajnayaran, Advances in dis- tributed optimization using probability collectives, Advances in Complex Systems (2006), in press. [21] D. H. Wolpert and K. Tumer, Collective Intelligence, Data Routing and Braess’ Paradox, Journal of Artificial Intelligence Research 16 (2002), 359–387. [22] D. H. Wolpert, K. Tumer, and E. Bandari, Improving search by using intelligent coordinates, Physical Review E 69 (2004), no. 017701.