570: Minimax Sample Complexity for Turn-Based Stochastic Game

Minimax Sample Complexity for Turn-based Stochastic Game Qiwen Cui1 Lin F. Yang2 1School of Mathematical Sciences, Peking University 2Electrical and Computer Engineering Department, University of California, Los Angeles Abstract guarantees are rather rare due to complex interaction between agents that makes the problem considerably harder than single agent reinforcement learning. This is also known The empirical success of multi-agent reinforce- as non-stationarity in MARL, which means when multi- ment learning is encouraging, while few theoret- ple agents alter their strategies based on samples collected ical guarantees have been revealed. In this work, from previous strategy, the system becomes non-stationary we prove that the plug-in solver approach, proba- for each agent and the improvement can not be guaranteed. bly the most natural reinforcement learning algo- One fundamental question in MBRL is that how to design rithm, achieves minimax sample complexity for efficient algorithms to overcome non-stationarity. turn-based stochastic game (TBSG). Specifically, we perform planning in an empirical TBSG by Two-players turn-based stochastic game (TBSG) is a two- utilizing a ‘simulator’ that allows sampling from agents generalization of Markov decision process (MDP), arbitrary state-action pair. We show that the em- where two agents choose actions in turn and one agent wants pirical Nash equilibrium strategy is an approxi- to maximize the total reward while the other wants to min- mate Nash equilibrium strategy in the true TBSG imize it. As a zero-sum game, TBSG is known to have and give both problem-dependent and problem- Nash equilibrium strategy [Shapley, 1953], which means independent bound. We develop reward perturba- there exists a strategy pair that both agents will not benefit tion techniques to tackle the non-stationarity in from changing its strategy alone, and our target is to find the game and Taylor-expansion-type analysis to the (approximate) Nash equilibrium strategy. Dynamic pro- improve the dependence on approximation error. gramming type algorithms is a basic but powerful approach With these novel techniques, we prove the mini- to solve TBSG. Strategy iteration, a counterpart of policy max sample complexity of turn-based stochastic iteration in MDP, is known to be a polynomial complex- game. ity algorithm to solve TBSG with known transition kernel [Hansen et al., 2013, Jia et al., 2020]. However, these algorithms suffer from high computational cost and require full knowledge of the transition dynamic. Reinforcement 1 INTRODUCTION learning is a promising alternative, which has demonstrated its potential in solving sequential decision making problems. Reinforcement learning (RL) [Sutton and Barto, 2018], a However, non-stationarity pose a large obstacle against the framework where agents learn to make sequential decisions convergence of model-free algorithms. To tackle this chal- in an unknown environment, has received tremendous at- lenge, sophisticated algorithms have been proposed [Jia tention. An interesting branch is multi-agent reinforcement et al., 2019, Sidford et al., 2020]. In this work, we focus learning (MARL) that multiple agents exist and they interact on another promising but insufficiently developed method, with the environment as well as the others, which bridges model-based algorithms, where agents learn the model of RL and game theory. In general, each agent attempts to max- the environment and plan in the empirical model. imize its own reward by utilizing the data collected from the environment and also inferring other agents’ strategies. Model-based RL is long perceived to be the cure to the sam- Impressive successes have been achieved in games such ple inefficiency in RL, which is also justified by empirical as backgammon [Tesauro, 1995], Go [Silver et al., 2017] advances [Kaiser et al., 2019, Wang et al., 2019]. However, and strategy games [Ye et al., 2020]. MARL has shown the theoretical understanding of model-based RL is still the potential for superhuman performance, but theoretical far from complete. Recently, a line of research focuses on Accepted for the 37th Conference on Uncertainty in Artificial Intelligence (UAI 2021). analyzing model-based algorithm under generative model TBSG, which is inspired by the technique developed in [Li setting [Azar et al., 2013, Agarwal et al., 2019, Cui and et al., 2020]. The key observation is that if r(s; a) increases, Yang, 2020, Zhang et al., 2020, Li et al., 2020], In this work, Q∗(s; a) increases much faster than Q∗(s; a0) for a0 6= a, we aim to prove that the simplest model-based algorithm, thus the perturbed TBSG enjoys a suboptimality gap with plug-in solver approach, enjoys minimax sample complexity high probability. Different from the usage in the first part, for TBSG by utilizing novel absorbing TBSG and reward here the suboptimality gap is used to ensure the empirical perturbation techniques. Nash equilibrium strategy lies in a finite set so that union bound can be applied. Combining the reward perturbation Specifically, we assume that we have access to a generative technique with the absorbing TBSG technique, we are able model oracle [Kakade et al., 2003], which allows sampling to prove more subtle concentration arguments and finally from arbitrary state-action pair. It is known that exploration show that the empirical Nash equilibrium strategy is an - and exploitation tradeoff is simplified under this setting, i.e., approximate Nash equilibrium strategy in the true TBSG sampling from each state-action pair equally can already with minimax sample complexity O(jSjjAj(1 − γ)−3−2) yield the minimax sample complexity result. With the gen- e for 2 (0; (1 − γ)−1]. In addition, we can recover the erative model, the most intuitive approach is to learn an em- problem-dependent bound by simply setting = ∆. pirical TBSG and then use a planning algorithm to find the empirical Nash equilibrium strategy as the solution, which Recently Zhang et al.[2020] proposes a similar result for separates learning and planning. This kind of algorithm simultaneous stochastic game. However, their result requires is known as the ‘plug-in solver approach’, which means a planning oracle for regularized stochastic game, which arbitrary planning algorithm can be used. We will show is computationally intractable. Our algorithm only requires that this simple plug-in solver approach enjoys minimax planning in a standard turn-based stochastic game, which sample complexity. For MDP, this line of research is studied can be performed efficiently by utilizing strategy iteration in the seminal work by Azar et al.[2013]. In the recent [Hansen et al., 2013] or learning algorithms [Sidford et al., work [Li et al., 2020], they almost completely solves this 2020]. In addition, their sample complexity result N = −3 −2 − 1 problem by proving the minimax complexity with full range Oe(jSjjAj(1 − γ) ) only holds for 2 (0; (1 − γ) 2 ], of error . However, the result for TBSG is still unknown which means N = Oe(jSjjAj(1 − γ)−2). our result fills the and largely due to interaction between agents. In this work, blank in the sample region Oe(jSjjAj(1 − γ)−1) ≤ N ≤ we give the sample complexity upper bounds of TBSG for −2 Oe(jSjjAj(1 − γ) ). Note that the lower bound [Azar et al., both problem-dependent and problem-independent cases. 2013] indicates that N = O(jSjjAj(1−γ)−1) is insufficient Suboptimality gap is a widely studied notion in the bandit to learn a policy. As both parts of our analysis heavily rely theory, which stands for a constant gap between the optimal on the suboptimality gap, we hope our work can provide arm and second optimal arm. This notion has received in- more understanding about this notion and TBSG. creasing focus in MDP and we generalize it to TBSG. To start with, we prove that Oe(jSjjAj(1 − γ)−3∆−2) samples are enough to recover Nash equilibrium strategy accurately − 1 for ∆ 2 (0; (1 − γ) 2 ], where S is the state space, A is the action space, γ is the discount factor and ∆ is the 2 PRELIMINARY suboptimality gap. The key analysis tool is the absorbing TBSG technique that helps to show the empirical optimal Q value is close to true optimal Q value. With the absorbing Turn-based Stochastic Game Turn-based two-player TBSG, the statistical dependence on Pb(s; a) is moved to a zero-sum stochastic game (TBSG) is a generalized version single parameter u, which can be approximated by an cover- of Markov decison process (MDP) which includes two play- ing argument. Therefore, standard concentration arguments ers competing with each other. Player 1 aims to maximize combined with union bound can be applied. Suboptimality the total reward while player 2 aims to minimize it. TBSG is gap plays an critical role in showing the empirical Nash equi- described by the tuple G = (S = Smax [Smin; A; P; r; γ), librium strategy is exactly the same as true Nash equilibrium where Smax is the state space of player 1, Smin is the strategy. state space of player 2, A is the action space of both players, P 2 RjS||A|×|Sj is the transition probability matrix, The main contribution is in the second part, where we give r 2 RjSjjAj is the reward vector and γ is the discount factor. our problem-independent bound which meets the existing In each step, only one player plays an action and a transition lower bound O(jSjjAj(1 − γ)−3−2) [Azar et al., 2013, happens. For instance, if the state s 2 Smax, player 1 needs Sidford et al., 2020]. Note that the problem-dependent re- to select an action a 2 A. After selecting the action, the sult becomes meaningless when the suboptimality gap is 0 state will transit to s 2 Smin according to the distribution sufficiently small and also the gap may not even exist.

570: Minimax Sample Complexity for Turn-Based Stochastic Game

Labsi Working Papers

1 Sequential Games

Finitely Repeated Games

Stochastic Game Theory Applications for Power Management in Cognitive Networks

Alpha-Beta Pruning

Chapter 16 Oligopoly and Game Theory Oligopoly Oligopoly

Deterministic and Stochastic Prisoner's Dilemma Games: Experiments in Interdependent Security

Game Theory with Translucent Players

Stochastic Games with Hidden States∗

Cooperation Spillovers in Coordination Games*

The Effective Minimax Value of Asynchronously Repeated Games*

Formation of Coalition Structures As a Non-Cooperative Game