Learning to Communicate Implicitly by Actions

Zheng Tian†, Shihao Zou†, Ian Davies†, Tim Warr†, Lisheng Wu†, Haitham Bou Ammar†‡, Jun Wang† † University College London ‡ Huawei R&D UK zheng.tian.11, shihao.zou.17, ian.davies.12, tim.warr.17, lisheng.wu.17 @ucl.ac.uk { [email protected] } [email protected]

Abstract for facilitating collaboration in MARL, explicit communica- tion channels come at additional computational and memory In situations where explicit communication is limited, human costs, making them difficult to deploy in decentralized con- collaborators act by learning to: (i) infer meaning behind their trol (Roth, Simmons, and Veloso 2006). partner’s actions, and (ii) convey private information about the state to their partner implicitly through actions. The first Environments where explicit communication is difficult component of this learning process has been well-studied in or prohibited are common. These settings can be synthetic multi-agent systems, whereas the second — which is equally such as those in games, e.g., bridge and Hanabi, but also crucial for successful collaboration — has not. To mimic both frequently appear in real-world tasks such as autonomous components mentioned above, thereby completing the learn- driving and autonomous fleet control. In these situations, ing process, we introduce a novel algorithm: Policy Belief humans rely upon implicit communication as a means of in- Learning (PBL). PBL uses a belief module to model the other formation exchange (Rasouli, Kotseruba, and Tsotsos 2017) agent’s private information and a policy module to form a and are effective in learning to infer the implicit meaning be- distribution over actions informed by the belief module. Fur- thermore, to encourage communication by actions, we pro- hind others’ actions (Heider and Simmel 1944). The ability pose a novel auxiliary reward which incentivizes one agent to perform such inference requires the attribution of a men- to help its partner to make correct inferences about its pri- tal state and reasoning mechanism to others. This ability is vate information. The auxiliary reward for communication is known as theory of mind (Premack and Woodruff 1978). In integrated into the learning of the policy module. We evalu- this work, we develop agents that benefit from considering ate our approach on a set of environments including a matrix others’ perspectives and thereby explore the further devel- game, particle environment and the non-competitive bidding opment of machine theory of mind (Rabinowitz et al. 2018). problem from . We show empirically that this auxiliary reward is effective and easy to generalize. These re- Previous works have considered ways in which an agent sults demonstrate that our PBL algorithm can produce strong can, by observing an opponent’s behavior, build a model of pairs of agents in collaborative games where explicit commu- opponents’ characteristics, objectives or hidden information nication is disabled. either implicitly (He et al. 2016; Bard et al. 2013) or explic- itly (Raileanu et al. 2018; Li and Miikkulainen 2018). Whilst these works are of great value, they overlook the fact that an Introduction agent should also consider that it is being modeled and adapt In collaborative multi-agent systems, communication is es- its behavior accordingly, thereby demonstrating a theory of sential for agents to learn to behave as a collective rather mind. For instance, in collaborative tasks, a decision-maker could choose to take actions which are informative to its arXiv:1810.04444v4 [cs.AI] 20 Nov 2019 than a collection of individuals. This is particularly impor- tant in the imperfect-information setting, where private in- teammates, whereas, in competitive situations, agents may formation becomes crucial to success. In such cases, effi- act to conceal private information to prevent their opponents cient communication protocols between agents are needed from modeling them effectively. for private information exchange, coordinated joint-action In this paper, we propose a generic framework, titled pol- exploration, and true world-state inference. icy belief learning (PBL), for learning to cooperate in im- In typical multi-agent reinforcement learning (MARL) perfect information multi-agent games. Our work combines settings, designers incorporate explicit communication opponent modeling with a policy that considers that it is be- channels hoping to conceptually resemble language or ver- ing modeled. PBL consists of a belief module, which models bal communication which are known to be important for hu- other agents’ private information by considering their previ- man interaction (Baker et al. 1999). Though they can be used ous actions, and a policy module which combines the agent’s current observation with their beliefs to return a distribution Copyright c 2020, Association for the Advancement of Artificial over actions. We also propose a novel auxiliary reward for Intelligence (www.aaai.org). All rights reserved. encouraging communication by actions, which is integrated into PBL. Our experiments show that agents trained using In a related work, Jaques et al. (2019) utilize a mutual in- PBL can learn collaborative behaviors more effectively than formation objective to imbue agents with social influence. a number of meaningful baselines without requiring any ex- While the objective of maximal mutual information in ac- plicit communication. We conduct a complete ablation study tions can yield highly effective collaborating agents, a mu- to analyze the effectiveness of different components within tual information objective in itself is insufficient to necessi- PBL in our bridge experiment. tate the development of implicit communication by actions. Eccles et al. (2019) introduce a reciprocity reward as an al- Related Work ternative approach to solve social dilemmas.. Our work is closely related to (Albrecht and Stone 2017; A distinguishing feature of our work in relation to previ- Lowe et al. 2017; Mealing and Shapiro 2017; Raileanu ous works in multi-agent communication is that we do not et al. 2018) where agents build models to estimate other have a predefined explicit communication protocol or learn agents’ hidden information. Contrastingly, our work en- to communicate through an explicit channel. Information ex- hances a “flat” opponent model with recursive reasoning. change can only happen via actions. In contrast to previous “Flat” opponent models estimate only the hidden informa- works focusing on unilaterally making actions informative, tion of opponents. Recursive reasoning requires making de- we focus on bilateral communication by actions where infor- cisions based on the mental states of others as well as the mation transmission is directed to a specific party with po- state of the environment. In contrast to works such as I- tentially limited reasoning ability. Our agents learn to com- POMDP (Gmytrasiewicz and Doshi 2005) and PR2 (Wen et municate through iterated policy and belief updates such that al. 2019) where the nested belief is embedded into the train- the resulting communication mechanism and belief mod- ing agent’s opponent model, we incorporate level-1 nested els are interdependent. The development of a communica- belief “I believe that you believe” into our policy by a novel tion mechanism therefore requires either direct access to the auxiliary reward. mental state of other agents (via centralized training) or the Recently, there has been a surge of interest in using rein- ability to mentalize, commonly known as theory of mind. forcement learning (RL) approaches to learn communication We investigate our proposed algorithm in both settings. protocols (Foerster et al. 2016; Lazaridou, Peysakhovich, and Baroni 2016; Mordatch and Abbeel 2017; Sukhbaatar, Problem Definition Szlam, and Fergus 2016). Most of these works enable agents We consider a set of agents, denoted by , interacting with to communicate via an explicit channel. Among these works, an unknown environment by executing actionsN from a joint Mordatch and Abbeel (2017) also observe the emergence set = 1,..., N , with i denoting the action space of non-verbal communication in collaborative environments of agentA {Ai, and NAthe} total numberA of agents. To enable without an explicit communication channel, where agents models that approximate real-world scenarios, we assume are exclusively either a sender or a receiver. Similar re- private and public information states. Private information search is also conducted in (de Weerd, Verbrugge, and Ver- states, jointly (across agents) denoted by = 1,..., N heij 2015). In our setting, we do not restrict agents to be ex- X {X X } are a set of hidden information states where i is only clusively a sender or a receiver of communications – agents observable by agent i, while public states areX observed can communicate mutually by actions. Knepper et al. (2017) by all agents. We assume that hidden informationO states at propose a framework for implicit communication in a co- each time step are sampled from an unknown distribution operative setting and show that various problems can be x : [0, 1], while public states evolve from an initial mapped into this framework. Although our work is con- P X → distribution o : [0, 1], according to a stochastic tran- ceptually close to (Knepper et al. 2017), we go further and P O → sition model : 1 N [0, 1]. Having present a practical algorithm for training agents. The recent transitioned toT aO×X successor ×A state×···×A according×O → to , agents re- work of (Foerster et al. 2018) solves an imperfect informa- T ceive rewards from : 1 N R, where tion problem as considered here from a different angle. We we have used = R Sto × denoteA × ··· joint × A state→ descriptions approach the problem by encouraging agents to exchange that incorporateS bothO public × X and private information. Finally, critical information through their actions whereas Foerster rewards are discounted over time by a factor γ (0, 1]. et al. train a public player to choose an optimal determinis- With this notation, our problem can be described succinctly∈ tic policy for players in a game based on publicly observ- by the tuple: , , , , , , x, o, γ , which we re- able information. In Cooperative Inverse RL (CIRL) where fer to as an imperfect-informationhN A O X T R P MarkovP decisioni process robotic agents try to infer a human’s private reward function (I2MDP)1. In this work, we simplify the problem by assum- from their actions (Hadfield-Menell et al. 2016), optimal so- ing that hidden information states are temporally static lutions need to produce behavior that coveys information. and are given at the beginning of theX game. Dragan, Lee, and Srinivasa (2013) consider how to train We interpret the joint policy from the perspective of agent agents to exhibit legible behavior (i.e. behavior from which i such that π = (πi(ai s), π−i(a−i s)), where π−i(a−i s) it is easy to infer the intention). Their approach is dependent is a compact representation| of the joint| policy of all agents| on a hand-crafted cost function to attain informative behav- excluding agent i. In the collaborative setting, each agent is ior. Mutual information has been used as a means to promote coordination without the need for a human engineered cost 1We also note that our problem can be formalized as a de- function. Strouse et al. (2018) use a mutual information ob- centralized partially observable Markov decision process (Dec- jective to encourage an agent to reveal or hide its intention. POMDP) (Bernstein, Zilberstein, and Immerman 2013). presumed to pursue the shared maximal cumulative reward where KL( ) is the Kullback–Leibler(KL) divergence and ·||· −i expressed as we use a one-hot vector to encode the ground truth, xj , " ∞ # when we calculate the relevant KL-divergence. i X t i −i max η (π) = E γ R(st, at, at ) , (1) With updated belief module Φ[k], we learn a new pol- t=1 icy for the next iteration, π[k+1], via a policy gradient al- i −i where st = [xt, xt , ot] is the current full information state, gorithm. Sharing information in multi-agent cooperative i −i games through communication reduces intractability by en- (at, at ) are joint actions taken by agent i and all other agents respectively at time t and γ is a discount factor. abling coordinated behavior. Rather than implementing ex- pensive protocols (Heider and Simmel 1944), we encourage Policy Belief Learning agents to implicitly communicate through actions by intro- ducing a novel auxiliary reward . To do so, notice that Applying naive single agent reinforcement learning (SARL) in the centralized setting agent i has the ability to consult algorithms to our problem will lead to poor performance. Φ−i(xi h−i) One reason for this is the partial observability of the envi- its opponent’s belief model t thereby exploiting the fact that other agents hold beliefs| over its private infor- ronment. To succeed in a partially observable environment, −i i an agent is often required to maintain a belief state. Recall mation xi. In fact, comparing bt to the ground-truth x en- that, in our setting, the environment state is formed from the ables agent i to learn which actions bring these two quanti- union of the private information of all agents and the pub- ties closer together and thereby learn informative behavior. i −i This can be achieved through an auxiliary reward signal de- licly observable information, st = [xt, xt , ot]. We there- i −i vised to encourage informative action communication: fore learn a belief module Φ (xt ) to model other agents’ −i i i −i,∗ i −i private information xt which is the only hidden informa- rc,t = KL(x b ) KL(x bt+1), (3) tion from the perspective of agent i in our setting. We as- || − || −i where b−i,∗ = Φ−i(xi h−i ) is agent i’s best belief (so-far) sume that an agent can model xt given the history of [k] | t,∗ − public information and actions executed by other agents about agent i’s private information: i −i ht = o1:t−1, a1:t−1 . We use a NN to parameterize the b−i,∗ = argmin KL(xi b−i) u t. belief module{ which takes} in the history of public informa- || u ∀ ≤ i i −i i i tion and produces a belief state bt = Φ (xt ht). The belief In other words, rc,t encourages communication as it is pro- state together with information observable by| agent i forms portional to the improvement in the opponent’s belief (for a a sufficient statistic, sˆi = [xi, bi, o ], which contains all the fixed belief model Φ−i(xi h−i )), measured by its proxim- t t t t [k] | t+1 information necessary for the agent to act optimally (Astr˚ om¨ ity to the ground-truth, resulting from the opponent observ- i 1965). We use a separate NN to parameterize agent i’s pol- ing agent i’s action at. Hence, during the policy learning i i i icy π (at sˆt) which takes in the estimated environment state step of PBL, we apply a policy gradient algorithm with a i | 2 sˆt and outputs a distribution over actions. As we assume hid- shaped reward of the form: den information is temporally static, we will drop the time script for it in the rest of the paper. r = re + αrc, (4) The presence of multiple learning agents interacting with where re is the reward from the environment, rc is the com- the environment renders the environment non-stationary. munication reward and α 0 balances the communication This further limits the success of SARL algorithms which and environment rewards. ≥ are generally designed for environments with stationary dy- namics. To solve this, we adopt centralized training and de- Algorithm 1 Per-Agent Policy Belief Learning (PBL) centralized execution, where during training all agents are 1: Randomly initialize policy π and belief Φ recognized as one central representative agent differing only Initialize: 0 0 2: Pre-train π by their observations. Under this approach, one can imagine 0 −i 3: for k = 0 to max iterations do belief models Φi(x−i hi) and Φ−i(xi h ) sharing param- t t 4: Sample episodes for belief training using self-play eters φ. The input data,| however, varies| across agents due forming the data set Ω[k] to the dependency on both hi and h−i. In a similar fash- t t 5: Update belief network using data from Ω[k] solving ion, we let policies share the parameters θ. Consequently, Equation 2 one may think of updating θ and φ using one joint data set 6: Given updated beliefs Φ[k+1]( ), update policy π( ) aggregated across agents. Without loss of generality, in the (policy gradients with rewards from· Equation 4) · remainder of this section, we discuss the learning procedure 7: end for from the point of view of a single agent, agent i. 8: Output: Final policy, and belief model We first present the learning procedure of our belief mod- i ule. At iteration k, we use the current policy π[k](a sˆ) to −i i M | Initially, in the absence of a belief module, we pre-train a generate a data set of size M, Ω[k] = (xj , hj)j=1 , using self-play and learn a new belief module{ by minimizing:} policy π[0] naively by ignoring the existence of other agents  −i i i  in the environment. As an agent’s reasoning ability may −i i φ[k] := argmin E(x ,h )∼Ω[k−1] KL(xj bj(hj; φ) , φ || 2Please note, we omit the agent index i in the reward equation, (2) as we shape rewards similarly for all agents. Player 2 (acts second)

Card 1 Card 2 9 Player 2 action A B C 8

A 10 0 0 0 0 10 7

B 4 8 4 4 8 4

Card 1 6 C 10 0 0 0 0 10 PBL Player 1 action

Average reward 5 BAD BADCF Player 1 (acts first) 0 0 10 10 0 0 4 PG 4 8 4 4 8 4 0 10 20 30 40 50 60 70 80 90 Card 2 0 0 0 10 0 0 Iterations (a) Payoff for the matrix game (b) Learning curves of PBL and baselines over 100 runs

Figure 1: Matrix game experiment and results. be limited, we may then iterate between Belief and Policy nication reward as:3 learning multiple times until either the allocated computa- ri = KL(xi ˜bi,∗) KL((xi ˜bi ), tional resources are exhausted or the policy and belief mod- c,t || t − || t+1 ules converge. We summarize the main steps of PBL in Al- where ˜bi,∗ = Φi(xi h−i ) and ˜bi = Φi(xi h−i ). In this gorithm 1. Note that, although information can be leaked t | t,∗ t+1 | t+1 during training, as training is centralized, distributed test- way, an agent essentially establishes a mental state of oth- phase execution ensures hidden-private variables during ex- ers with its own belief model and acts upon it. We humbly ecution. believe this could be a step towards machine theory of mind where algorithmic agents learn to attribute mental states to others and adjust their behavior accordingly. Machine Theory of Mind The ability to mentalize relieves the restriction of collab- orators having the same reasoning process. However, the In PBL, we adopt a centralized training and decentralized success of collaboration still relies on the correctness of execution scheme where agents share the same belief and one’s belief about the mental states of others. For instance, policy models. In reality, however, it is unlikely that two correctly inferring other drivers’ mental states and conven- people will have exactly the same reasoning process. In con- tions can reduce the likelihood of traffic accidents. Therefore trast to requiring everyone to have the same reasoning pro- road safety education is important as it reduces variability cess, a person’s success in navigating social dynamics re- among drivers reasoning processes. In our work, this align- lies on their ability to attribute mental states to others. This ment amounts to the similarity between two agents’ trained attribution of mental states to others is known as theory belief models which is affected by training data, initializa- of mind (Premack and Woodruff 1978). Theory of mind is tion of weights, training algorithms and so on. We leave in- fundamental to human social interaction which requires the vestigation of the robustness of collaboration to variability recognition of other sensory perspectives, the understanding in collaborators’ belief models to future work. of other mental states, and the recognition of complex non- verbal signals of emotional state (Lemaignan and Dillen- Experiments & Results bourg 2015). In collaboration problems without an explicit We test our algorithms in three experiments. In the first, we communication channel, humans can effectively establish an validate the correctness of the PBL framework which inte- understanding of each other’s mental state and subsequently grates our communication reward with iterative belief and select appropriate actions. For example, a teacher will re- policy module training in a simple matrix game. In this rel- iterate a difficult concept to students if she infers from the atively simple experiment, PBL achieves near optimal per- students’ facial expressions that they have not understood. formance. Equipped with this knowledge, we further apply The effort of one agent to model the mental state of another PBL to the non-competitive bridge bidding problem to ver- is characterized as Mutual Modeling (Dillenbourg 1999). ify its scalability to more complex problems. Lastly, we in- In our work, we also investigate whether the proposed vestigate the efficacy of the proposed communication reward communication reward can be generalized to a distributed in a distributed training setting. setting which resembles a human application of theory of mind. Under this setting, we train a separate belief model for Matrix Game i −i i −i i −i each agent so that Φ (x ht) and Φ (x ht ) do not share We test our PBL algorithm on a matrix card game where an i −i | | parameters (φ = φ ). Without centralization, an agent can implicit communication strategy is required to achieve the only measure how6 informative its action is to others with its own belief model. Assuming agents can perfectly recall their 3Note the difference of super/sub-scripts of the belief model past actions and observations, agent i computes its commu- and its parameters when compared to Equation 3. 0.069 0.077 0.079 0.069 0.062 0.067

0.065 0.025

0.063 -0.025 -0.054 PQL-4 PBL PQL-5 PQL-5 PQL-4 (No-Rules) 0.061 (No-Rules) Average reward PBL 0.059 NCR NPBI 0.057 PQL-4 IP (Single-Network)

100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 PQL-5 Iterations (Single-Network) (a) Learning curves for non-competitive bridge bidding (b) Comparison of PBL and PQL

Figure 2: a) Learning curves for non-competitive bridge bidding with a warm start from a model trained to predict the score distribution (average reward at warm start: 0.038). Details of warm start provided in Appendix B.3. b) Bar graph comparing PBL to variants of PQL, with the full version of PQL results as reported in (Yeh and Lin 2016). global optimum. This game is first proposed in (Foerster et In this work, we focus on non-competitive bidding in al. 2018). There are two players and each player receives bridge, where we consider North (N) and South (S) bidding a card drawn from card 1, card 2 independently at the be- in the game, while East (E) and West (W) always bid PASS. ginning of the game.{ Player 1 acts} first and Player 2 responds Hence, the declaring team never changes. Thus, each deal after observing Player 1’s action. Neither player can see the can be viewed as an independent episode of the game. The other’s hand. By the design of the payoff table (shown in private information of player i N,S , xi, is its hand. Figure. 1a), Player 1 has to use actions C and A to signify xi is a 52-dimensional binary vector∈ { encoding} player i’s 13 that it holds Cards 1 and 2 respectively so that Player 2 can cards. An agent’s observation at time step t consists of its i i i choose its actions optimally with the given information. We hand and the bidding history: ot = xt, ht . In each episode, compare PBL with algorithms proposed in (Foerster et al. Players N and S are dealt hands xN{ , xS respectively.} Their 2018) and vanilla policy gradient. As can be seen from Fig- hands, together, describe the full state of the environment ure 1b, PBL performs similarly to BAD and BAD-CF on this s = xN , xS , which is not fully observed by either of simple game and outperforms vanilla policy gradient signif- the two{ players.} Since rolling out via self-play for every icantly. This demonstrates a proof of principle for PBL in a contract is computationally expensive, we resort to double multi-agent imperfect information coordination game. dummy analysis (DDA) (Haglund 2010) for score estima- tion. Interested readers are referred to (Haglund 2010) and Contract Bridge Case-Study the appendix for further details. In our work, we use stan- Non-competitive contract bridge bidding is an imperfect- dard Duplicate rules (League 2017) to score information game that requires information exchange be- games and normalize scores by dividing them by the maxi- tween agents to agree high-quality contracts. Hence, such mum abusolute score. a game serves as an ideal test-bed for PBL. In bridge, two Benchmarking & Ablation Studies: PBL introduces sev- teams of two (North-South vs East-West) are situated in eral building blocks, each affecting performance in its own opposing positions and play a trick-taking game using a right. We conduct an ablation study to better understand the standard 52-card deck. Following a deal, bidding and play- importance of these elements and compare against a state- ing phases can be effectively separated. During the bid- of-the-art method in PQL (Yeh and Lin 2016). We introduce ding phase, players sequentially bid for a contract until a the following baselines: final contract is reached. A PASS bid retains previously pro- 1. Independent Player (IP): A player bids independently posed contracts and a contract is considered final if it is fol- without consideration of the existence of the other player. lowed by three consecutive PASS bids. A non-PASS bid pro- poses a new contract of the form integer, suit , where 2. No communication reward (NCR): One important ques- integer takes integer values betweenh one and seven,i and tion to ask is how beneficial the additional communication suit belongs to , , , , NT . The number of tricks auxiliary reward rc is in terms of learning a good bidding needed to achieve{♣ a contract♦ ♥ ♠ are 6 +} integer, and an NT strategy. To answer this question, we implement a base- suit corresponds to bidding to win tricks without trumps. A line using the same architecture and training schedule as contract-declaring team achieves points if it fulfills the con- PBL but setting the communication reward weighting to tract, and if not, the points for the contract go to the opposing zero, α = 0. team. Bidding must be non-decreasing, meaning integer 3. No PBL style iteration (NPBI): To demonstrate that mul- is non-decreasing and must increase if the newly proposed tiple iterations between policy and belief training are ben- suit precedes or equals the currently bid suit in the eficial, we compare our model to a baseline policy trained ordering < < < < NT . with the same number of weight updates as our model but ♣ ♦ ♥ ♠ Pass Club belief Heart belief Heart belief belief update update update update 5.9 6.6 4.5 3.9 3.3 3.1 3.0 3.3 2.8 1.4 1.4 1.4 1.4 1.2 1.1 0.8

… … P ASS AAAC2HicjVHLTsJAFD3UF+ILZemmEU1ckYImusS4cYlBHhEIacuADUPbTKcmhJi4M279Abf6RcY/0L/wzlgSlRidpu2Zc+85M/deJ+ReJC3rNWXMzS8sLqWXMyura+sb2c2tehTEwmU1N+CBaDp2xLjns5r0JGfNUDB75HDWcIanKt64ZiLyAv9CjkPWGdkD3+t7ri2J6mZzbW77A87Mykm12hYad7N5q2DpZc6CYgLySFYlyL6gjR4CuIgxAoMPSZjDRkRPC0VYCInrYEKcIOTpOMMNMqSNKYtRhk3skL4D2rUS1qe98oy02qVTOL2ClCb2SBNQniCsTjN1PNbOiv3Ne6I91d3G9HcSrxGxElfE/qWbZv5Xp2qR6ONY1+BRTaFmVHVu4hLrrqibm1+qkuQQEqdwj+KCsKuV0z6bWhPp2lVvbR1/05mKVXs3yY3xrm5JAy7+HOcsqJcKxYNC6fwwX95NRp3GNnawT/M8QhlnqKBG3mM84gnPxqVxa9wZ95+pRirR5PBtGQ8fVTiWyw== 1,C 1,H 4,H h i hAAAC2HicjVHLSsNAFD3GV62vapdugkVwUUpSBV0WunFZwT6wLSVJpzV0moTJRChFcCdu/QG3+kXiH+hfeGdMQS2iE5KcOfeeM3PvdSPux9KyXheMxaXlldXMWnZ9Y3NrO7ez24jDRHis7oU8FC3XiRn3A1aXvuSsFQnmjF3Omu6oquLNayZiPwwu5CRi3bEzDPyB7zmSqF4u3+FOMOTMtItmtSM07uUKVsnSy5wHdgoKSFctzL2ggz5CeEgwBkMASZjDQUxPGzYsRMR1MSVOEPJ1nOEGWdImlMUowyF2RN8h7dopG9BeecZa7dEpnF5BShMHpAkpTxBWp5k6nmhnxf7mPdWe6m4T+rup15hYiSti/9LNMv+rU7VIDHCqa/CppkgzqjovdUl0V9TNzS9VSXKIiFO4T3FB2NPKWZ9NrYl17aq3jo6/6UzFqr2X5iZ4V7ekAds/xzkPGuWSfVQqnx8XKsV01BnsYR+HNM8TVHCGGurkPcEjnvBsXBq3xp1x/5lqLKSaPL4t4+EDOPaWXA== i hAAAC13icjVHLSsNAFD2N7/qKunQTLIILKUkVdFlw41LBtkorJRnHGpwmYTIRpYg7cesPuNU/Ev9A/8I74whqEZ2Q5My595yZe2+UiThXvv9SckZGx8YnJqfK0zOzc/PuwmIzTwvJeIOlIpWHUZhzESe8oWIl+GEmediPBG9F5zs63rrgMo/T5EBdZfy4H/aS+DRmoSKq6y52RJj0BA/Wvd2ONLDrVvyqb5Y3DAILKrBrL3Wf0cEJUjAU6IMjgSIsECKnp40APjLijjEgThKKTZzjGmXSFpTFKSMk9py+Pdq1LZvQXnvmRs3oFEGvJKWHVdKklCcJ69M8Ey+Ms2Z/8x4YT323K/pH1qtPrMIZsX/pPjP/q9O1KJxi29QQU02ZYXR1zLoUpiv65t6XqhQ5ZMRpfEJxSZgZ5WefPaPJTe26t6GJv5pMzeo9s7kF3vQtacDBz3EOg2atGmxUa/ublfq6HfUklrGCNZrnFurYxR4a5H2JBzziyTlybpxb5+4j1SlZzRK+Lef+Hc66ljc= i hAAAC2HicjVHLSsNAFD2Nr1pf1S7dBIvgopS0FnRZcNNlBfvAtpQkndbQaRImE6EUwZ249Qfc6heJf6B/4Z0xBbWITkhy5tx7zsy91wm5F0nLek0ZS8srq2vp9czG5tb2TnZ3rxkFsXBZww14INqOHTHu+awhPclZOxTMnjictZzxmYq3rpmIvMC/kNOQ9Sb2yPeGnmtLovrZXJfb/ogzs1Iwa12hcT+bt4qWXuYiKCUgj2TVg+wLuhgggIsYEzD4kIQ5bET0dFCChZC4HmbECUKejjPcIEPamLIYZdjEjuk7ol0nYX3aK89Iq106hdMrSGnikDQB5QnC6jRTx2PtrNjfvGfaU91tSn8n8ZoQK3FF7F+6eeZ/daoWiSFOdQ0e1RRqRlXnJi6x7oq6ufmlKkkOIXEKDyguCLtaOe+zqTWRrl311tbxN52pWLV3k9wY7+qWNODSz3Eugma5WDouls8r+WohGXUa+zjAEc3zBFXUUEeDvKd4xBOejUvj1rgz7j9TjVSiyeHbMh4+AEw3lmQ= i

Time Player two makes Player one makes Player two makes Player one makes Pass bid One Clubs bid One Hearts bid Four Hearts bid Figure 3: An example of a belief update trace showing how PBL agents use actions for effective communication. Upon observ- ing P ASS from East, West decreases its HCP belief in all suits. When West bids 1, C , East improves belief in clubs. Next, Easth bids 1,iH . West recalculates its belief from last time step and increases its HCPh beliefi in hearts. h i

no further PBL iterations after training a belief network formance, we could use these (or other) heuristics for our Φ0 at PBL iteration k = 0. bidding algorithm. However, this deviates from the focus 4. Penetrative Q-Learning (PQL): PQL as proposed by of our work which is to demonstrate that PBL is a general Yeh and Lin (2016) as the first bidding policy for framework for learning to communicate by actions. non-competitive bridge bidding without human domain knowledge. Belief Update Visualization: To understand how agents up- Figure 2a shows the average learning curves of our model date their beliefs after observing a new bid, we visualize the and three baselines for our ablation study. We obtain these belief update process (Figure 3). An agent’s belief about its curves by testing trained algorithms periodically on a pre- opponent’s hand is represented as a 52-dimensional vector generated test data set which contains 30,000 games. Each with real values which is not amenable to human interpreta- point on the curve is an average score computed by Dupli- tion. Therefore, we use high card points (HCPs) to summa- cate bridge scoring rules (League 2017) over 30,000 games rize each agent’s belief. For each suit, each card is given a and 6 training runs. As can been seen, IP and NCR both ini- point score according to the mapping: A=4, K=3, Q=2, J=1, tially learn faster than our model. This is reasonable as PBL else=0. Note that while agents’ beliefs are updated based on spends more time learning a communication protocol at first. the entire history of its opponent’s bids, the difference be- However, IP converges to a local optimum very quickly and tween that agent’s belief from one round to the next is pre- is surpassed by PBL after approximately 400 learning iter- dominantly driven by the most recent bid of its opponent, as ations. NCR learns a better bidding strategy than IP with a shown in Figure 3. belief module. However, NCR learns more slowly than PBL in the later stage of training because it has no guidance on Whilst our model’s bidding how to convey information to its partner. PBL outperform- Learned Bidding Convention: decisions are based entirely on raw card data, we can use ing NPBI demonstrates the importance of iterative training high card points as a simple way to observe and summarize between policy and belief modules. the decisions which are being made. For example, we ob- Restrictions of PQL: PQL (Yeh and Lin 2016) is the first serve our policy opens the bid with 1 if it has HCPs of algorithm trained to bid in bridge without human engineered spade 4.5 or higher but lower HCPs of♠ any other suits. We features. However, its strong bidding performance relies on run the model on the unseen test set of 30,000 deals and heavy adaption and heuristics for non-competitive Bridge summarize the learned bidding convention in Appendix C.1. bidding. First, PQL requires a predefined maximum number of allowed bids in each deal, while using different bidding networks at different times. Our results show that it will fail Imperfect Recall of History: the length of the action his- when we train a single NN for the whole game, which can tory players can recall affects the accuracy of the belief mod- been seen as a minimum requirement for most DRL algo- els. The extent of the impact depends on the nature of the rithms. Second, PQL relies on a rule-based function for se- game. In bridge, the order of bidding encodes important in- lecting the best contracts at test time. In fact, removing this formation. We ran an ablation study where players can only second heuristic significantly reduces PQL’s performance as recall the most recent bid. In this setting, players do worse reported in Figure 2b. In addition, without pre-processing (average score 0.065) than players with perfect recall. We the training data as in (Yeh and Lin 2016), we could not re- conjecture that this is because players can extract less infor- produce the original results. To achieve state-of-the-art per- mation and therefore the accuracy of belief models drops. 0.2

0.4

0.6 Average reward 0.8 CR NCR 1.0 0 100 200 300 400 500 600 Iterations (b) CR (c) NCR (a) Learning curves for Silent Guide

Figure 4: a) Learning curves for Silent Guide. Guide agent trained with communication reward (CR) significantly outperforms the one trained with no communication reward (NCR). b) A trajectory of Listener (gray circle) and Guide (blue circle) with CR. Landmarks are positioned randomly and the Goal landmark (blue square) is randomly chosen at the start of each episode. c) A trajectory of Listener and Guide with NCR. Trajectories are presented with agents becoming progressively darker over time.

Silent Guide use its own belief model to establish the mental state of Lis- We modify a multi-agent particle environment (Lowe et al. tener and learns to communicate through actions judged by 2017) to test the effectiveness of our novel auxiliary reward this constructed mental state of Listener. We also observe in a distributed setting. This environment also allows us to that a trained Guide agent can work with a naive RL listener explore the potential for implicit communication to arise (best reward -0.252) which has no belief model but can ob- through machine theory of mind. In the environment there serve PBL guide agents action. The success of Guide with are two agents and three landmarks. We name the agents CR shows the potential for machine theory of mind. We ob- Guide and Listener respectively. Guide can observe Lis- tain the learning curves by repeating the training process five tener’s goal landmark which is distinguished by its color. times and take the shared average environment reward. Listener does not observe its goal. However, Listener is able to infer the meaning behind Guide’s actions. The two agents Conclusions & Future Work receive the same reward which is the negative distance be- In this paper, we focus on implicit communication through tween Listener and its goal. Therefore, to maximize the cu- actions. This draws a distinction of our work from previ- mulative reward, Guide needs to tell Listener the goal land- ous works which either focus on explicit communication or mark color. However, as the “Silent Guide” name suggests, unilateral communication. We propose an algorithm com- Guide has no explicit communication channel and can only bining agent modeling and communication for collaborative communicate to Listener through its actions. imperfect-information games. Our PBL algorithm iterates In the distributed setting, we train separate belief modules between training a policy and a belief module. We propose a for Guide and Listener respectively. The two belief mod- novel auxiliary reward for encouraging implicit communica- ules are both trained to predict a naive agent’s goal landmark tion between agents which effectively measures how much color given its history within the current episode but using closer the opponent’s belief about a player’s private infor- different data sets. We train both Guide and Listener policies mation becomes after observing the player’s action. We em- from scratch. Listener’s policy takes Listener’s velocity, rel- pirically demonstrate that our methods can achieve near op- ative distance to three landmarks and the prediction of the timal performance in a matrix problem and scale to complex belief module as input. It is trained to maximize the envi- problems such as contract bridge bidding. We conduct an ronment reward it receives. Guide’s policy takes its veloc- initial investigation of the further development of machine ity, relative distance to landmarks and Listener’s goal as in- theory of mind. Specifically, we enable an agent to use its put. To encourage communication by actions, we train Guide own belief model to attribute mental states to others and act policy with the auxiliary reward proposed in our work. We accordingly. We test this framework and achieve some ini- compare our method against a naive Guide policy which is tial success in a multi-agent particle environment under dis- trained without the communication reward. The results are tributed training. There are a lot of interesting avenues for shown in Figure 4. Guide when trained with communica- future work such as exploration of the robustness of collab- tion reward (CR) learns to inform Listener of its goal by ap- oration to differences in agents’ belief models. proaching to the goal it observes. Listener learns to follow. However, in NCR setting, Listener learns to ignore Guide’s References uninformative actions and moves to the center of three land- Albrecht, S. V., and Stone, P. 2017. Autonomous Agents marks. While Guide and Listener are equipped with belief Modelling Other Agents: A Comprehensive Survey and models trained from different data sets, Guide manages to Open Problems. arXiv e-prints. Astr˚ om,¨ K. J. 1965. Optimal control of Markov Processes ceedings of the 2017 ACM/IEEE International Conference with incomplete state information. Journal of Mathematical on Human-Robot Interaction, HRI ’17. ACM. Analysis and Applications 10:174–205. Lazaridou, A.; Peysakhovich, A.; and Baroni, M. 2016. Baker, M.; Hansen, T.; Joiner, R.; and Traum, D. 1999. The Multi-agent cooperation and the emergence of (natural) lan- role of grounding in collaborative learning tasks. Collabo- guage. CoRR abs/1612.07182. rative learning: Cognitive and computational approaches. League, A. C. B. 2017. Laws of . Bard, N.; Johanson, M.; Burch, N.; and Bowling, M. 2013. Lemaignan, S., and Dillenbourg, P. 2015. Mutual mod- Online implicit agent modelling. In Proceedings of the 2013 elling in robotics: Inspirations for the next steps. 2015 10th International Conference on Autonomous Agents and Multi- ACM/IEEE International Conference on Human-Robot In- agent Systems, AAMAS ’13, 255–262. teraction (HRI) 303–310. Bernstein, D. S.; Zilberstein, S.; and Immerman, N. 2013. Li, X., and Miikkulainen, R. 2018. Dynamic adaptation and The Complexity of Decentralized Control of Markov Deci- opponent exploitation in computer poker. In Workshops at sion Processes. arXiv e-prints arXiv:1301.3836. the 32nd AAAI Conference on Artificial Intelligence. de Weerd, H.; Verbrugge, R.; and Verheij, B. 2015. Higher- Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; and order theory of mind in the tacit communication game. Bio- Mordatch, I. 2017. Multi-agent actor-critic for mixed logically Inspired Cognitive Architectures. cooperative-competitive environments. Neural Information Dillenbourg, P. 1999. What do you mean by collaborative Processing Systems (NIPS). learning? In Dillenbourg, P., ed., Collaborative-learning: Mealing, R., and Shapiro, J. L. 2017. Opponent model- Cognitive and Computational Approaches. Oxford: Elsevier. ing by expectation-maximization and sequence prediction in Dragan, A. D.; Lee, K. C.; and Srinivasa, S. S. 2013. Leg- simplified poker. IEEE Transactions on Computational In- ibility and predictability of robot motion. In Proceedings telligence and AI in Games 9(1):11–24. of the 8th ACM/IEEE international conference on Human- Mordatch, I., and Abbeel, P. 2017. Emergence of grounded robot interaction, 301–308. IEEE Press. compositional language in multi-agent populations. CoRR Eccles, T.; Hughes, E.; Kramar,´ J.; Wheelwright, S.; and abs/1703.04908. Leibo, J. Z. 2019. Learning reciprocity in complex sequen- Premack, D., and Woodruff, G. 1978. Does the chimpanzee tial social dilemmas. arXiv preprint arXiv:1903.08082. have a theory of mind? Behavioral and Brain Sciences 1. Foerster, J. N.; Assael, Y. M.; Freitas, N.; and Whiteson, S. Rabinowitz, N. C.; P., F.; Song, H. F.; Zhang, C.; Eslami, S. 2016. Learning to communicate with deep multi-agent rein- M. A.; and Botvinick, M. 2018. Machine theory of mind. forcement learning. CoRR abs/1605.06676. CoRR abs/1802.07740. Foerster, J. N.; Song, F.; Hughes, E.; Burch, N.; Dunning, Raileanu, R.; Denton, E.; Szlam, A.; and Fergus, R. 2018. I.; Whiteson, S.; Botvinick, M.; and Bowling, M. 2018. Modeling others using oneself in multi-agent reinforcement Bayesian Action Decoder for Deep Multi-Agent Reinforce- learning. CoRR abs/1802.09640. ment Learning. arXiv e-prints arXiv:1811.01458. Rasouli, A.; Kotseruba, I.; and Tsotsos, J. 2017. Agree- Gmytrasiewicz, P. J., and Doshi, P. 2005. A framework for ing to cross: How drivers and pedestrians communicate. In sequential planning in multi-agent settings. J. Artif. Int. Res. Intelligent Vehicles Symposium (IV), 2017 IEEE. IEEE. Hadfield-Menell, D.; Russell, S. J.; Abbeel, P.; and Dragan, Roth, M.; Simmons, R.; and Veloso, M. 2006. What to com- A. 2016. Cooperative inverse reinforcement learning. In municate? execution-time decision in multi-agent pomdps. Advances in neural information processing systems, 3909– In DARS ’06. Springer. 177–186. 3917. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Haglund, B. 2010. Search algorithms for a bridge double Klimov, O. 2017. Proximal policy optimization algorithms. dummy solver. arXiv preprint arXiv:1707.06347. He, H.; Boyd-Graber, J.; Kwok, K.; and III, H. D. 2016. Op- Strouse, D. J.; Kleiman-Weiner, M.; Tenenbaum, J.; ponent modeling in deep reinforcement learning. In ICML Botvinick, M.; and Schwab, D. 2018. Learning to share and ’16. hide intentions using information regularization. NIPS’18. Heider, F., and Simmel, M. 1944. An experimental study of Sukhbaatar, S.; Szlam, A.; and Fergus, R. 2016. Learning apparent behavior. The American journal of psychology. multiagent communication with backpropagation. CoRR. Jaques, N.; Lazaridou, A.; Hughes, E.; Gulcehre, C.; Ortega, Wen, Y.; Yang, Y.; Luo, R.; Wang, J.; and Pan, W. 2019. P.; Strouse, D.; Leibo, J. Z.; and De Freitas, N. 2019. So- Probabilistic recursive reasoning for multi-agent reinforce- cial influence as intrinsic motivation for multi-agent deep ment learning. In ICLR. reinforcement learning. In International Conference on Ma- chine Learning, 3040–3049. Yeh, C., and Lin, H. 2016. Automatic bridge bidding using deep reinforcement learning. CoRR abs/1607.03290. Kingma, D., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. Knepper, R. A.; I.Mavrogiannis, C.; Proft, J.; and Liang, C. 2017. Implicit communication in a joint action. In Pro- A Bridge A.1 Playing Phase in Bridge After the final contract is decided, the player from the declaring side who first bid the trump suit named in the final contract becomes Declarer. Declarer’s partner is Dummy. The player to the left of the declarer becomes the first leading player. Then Dummy lays his cards face up on the table and then play proceeds clockwise. On each trick, the leading player shows one card from their hand and other players need to play the same suit as the leading player if possible; otherwise, they can play a card from another suit. Trump suit is superior to all other suits and, within a suit, a higher rank card is superior to lower rank one. A trick is won by the player who plays the card with the highest priority. The winner of the hand becomes the leading player for the next trick.

A.2 Double Dummy Analysis (DDA) Double Dummy Analysis assumes that, for a particular deal, one player’s hand is fully observed by other players and players always play cards to their best advantage. However, given a set s = xN , xS , the distribution of remaining cards for the two non-bidding players East and West is still unknown. To reduce the{ variance} of the estimate, we repeatedly sample a deal U times by allocating the remaining cards randomly to East and West and then estimate re(s) by taking the average of their DDA scores, U 1 X N S E W re(s) = re(x , x , x , x ), (5) U u u u=1 E W th where xu , xu are hands for East and West from the u sampling respectively. For a specific contract at, the corresponding N S score is given by re(at s) = re(x , x , at). In our work, we set U = 20. | A.3 Bridge Scoring

Algorithm 2 Bridge Duplicate Scoring Score (tricks made, bid level, trump suit) T trump suit δ ←tricks made (bid level + 6) score← 0 − if δ ←0 then ≥ score score + bid level scaleT + biasT Contract tricks if score← 100 then ∗ { } score≥ 300 Game Bonus else ← { } score 50 PARTSCORE end if ← { } if δ = 6 then score score + 500 Slam bonus else if δ =← 7 then { } score score + 1000 Grand Slam bonus end if ← { } if δ > 0 then score score + δ scaleT Over-tricks end if ← ∗ { } else score score bid level 50 Under-tricks end if ← − ∗ { }

Algorithm 2 shows how we score a game under Duplicate Bridge Scoring rules. We obtain the average of tricks made using Double Dummy Analysis (Haglund 2010) given the hands of players North and South xN , xS , the declarer and the trump suit. The score function above has a scale and bias for each trump suit. The scale is 20 for{ and} and 30 for all others. Bias is zero for all trumps, except NT which has a bias of 10. ♣ ♦ Note that Double is only a valid bid in response to a contract proposed by one’s opponents. Also, a Redouble bid must be preceded by a Double bid. In the non-competitive game, opponents do not propose contracts, so these options are naturally not included. North b S FCb h

p p bS xN

S S N N a FCp η + E + η FCp a xS bN

b N h FCb

South

Figure 5: The architecture of our Policy Network and Belief Network in bridge. E represents environment and FC stands for fully connected layers. For player i, we add their private information xi and belief bs together to form ηi as the input to its policy network π. Player i0s action ai becomes part of player j’s history hj, which is the input to Player j’s belief network Φ.

A.4 Double Pass Analysis At the beginning of bidding, when two players both bid Pass (Double Pass), all players’ hands are re-dealt and a new episode starts. If we ignore these episodes in training, a naive strategy emerges where a player always bids Pass unless it is highly confident about its hand and therefore bids at level 1 whose risk is at the minimum. In this work, we are interested in solving problems where private information needs to be inferred from observed actions for better performance in a game. Therefore, this strategy is less meaningful and the opportunity cost of bidding Pass could be high when a player could have won the game with high reward. To reflect this opportunity cost in training, we set the reward for Double Pass as the negative of the maximum attainable reward given the players’ hands: rdp(s) = max(re(s)). Therefore, a player will be penalized heavily by bidding Pass if it could have obtained a high reward otherwise− and awarded slightly if they could never win in the current episode. It is worthy of note that an initial Pass bid can convey information; however, if it is followed by a second Pass bid the game ends and hence no further information is imparted from a second Pass. We note that, by Duplicate Bridge Laws 77 and 22 of (League 2017), if the game opens with four Pass bids all players score zero. In our setting, however, East and West play pass regardless of their hands and this will give North and South Player extra advantages. Therefore, we use rdp to avoid results where North and South only bid where they have particularly strong hands; we discourage full risk averse behavior. B Experiment Details in Bridge B.1 Offline Environment Generating a new pair of hands for North and South and pre-calculating scores for every possible contract at the start of a new episode during policy training is time inefficient. Instead, we pre-generate 1.5 million hands and score them in advance. Then we sample new episodes from this data set when we train a policy. We also generate a separate test data set containing 30,000 hands for testing our models. B.2 Model Architecture

We parameterize policy and belief modules with two neural networks πθp and Φθb respectively. Fig. 5 shows our model for the bridge experiments. Weights of belief and policy modules are shared between players. The input to a player’s policy πθp is the sum of its private information x and belief b. The last layer of the belief module Φ has a Sigmoid non-linear activation function, so that the belief vector b has elements in the interval [0, 1]. By adding belief b to x, we could reuse parameters associated with x for b and avoid training of extra parameters. B.3 Policy Pre-training In the first PBL iteration (k = 0), to have a good initial policy and avoid one with low-entropy, we train our policy to predict a distribution formed using Softmax on re(s) with temperature τ. The loss for pre-train policy given s is: e k0 = KL(π(a x) Softmax(r (s), τ)), L | || where KL is the KL-Divergence. To have a fair comparison with other benchmarks, all our benchmarks are initialized with this pre-trained policy. Supervising a bidding policy to predict pre-calculated scores for all actions only provides it with a basic understanding of its hand. B.4 Policy Training We utilize Proximal Policy Optimization (PPO) (Schulman et al. 2017) for policy training. Optimization is performed with the −8 −4 Adam optimizer (Kingma and Ba 2014) with β1 = 0.9, β2 = 0.999,  = 10 . The initial learning rate is 10 which we decay exponentially with a rate of decay of 0.95 and a decaying step of 50. The distance measure used in communication rewards rc is cross entropy and we treat each dimension of belief and ground truth as an independent Bernoulli distribution. We train PBL with 8 PBL iterations, we do policy gradient 200 times per iteration and we sample 5000 episodes each time. Each mini-batch is then a sub-sample of 2048 episodes. The initial communication weight α is 5, and we decay it gradually. We train all other baselines with the same PPO hyperparameters.

B.5 Learning A Belief Network When a player tries to model its partner’s hand based on the observed bidding history, we assume it can omit the restriction that its partner can only hold 13 cards. Therefore, we take the prediction of the partner’s hand given the observed bidding history as a 52-label classification problem, where each label represents one corresponding card being in the partner’s hand. In other words, we treat each card from a 52-card deck being in the partner’s hand as an independent Bernoulli distribution and we train a belief network by maximizing the joint likelihood of these 52 Bernoulli distributions given a bidding history h. This gives the loss for belief network as: 52 X −i i −i i Φ = x log(b ) + (1 x ) log(1 b ), L − − − i=1 where x−i and bi are elements of one-hot encoding vectors of a partner’s hand x−i and one agent’s belief bi. The reasoning behind this assumption is we think it is more important to have a more accurate prediction over an invalid distribution than a less accurate one over a valid distribution as the belief itself is already an approximation. For each iteration of belief training, we generate 300,000 data episodes with the current policy to train the belief network. −8 Optimization is performed with the Adam optimizer with β1 = 0.9, β2 = 0.999,  = 10 and a decay rate of 0.95. The initial learning rate is 10−3. The batch size is 1024. We split the data set such that 90% of it is used for training and 10% is for early stopping check to prevent overfitting. C Learned Opponent Model and Bidding Convention C.1 High Card Points Tables

Bid Total Bid Total ♣ ♦ ♥ ♠ ♣ ♦ ♥ ♠ PASS 1.4 1.3 1.4 1.4 5.4 PASS 1.4 1.3 1.3 1.3 5.3 1 2.3 2.1 2.3 4.5 11.2 1 2.3 2.1 2.2 4.6 11.1 1♠ 2.3 2.2 4.7 1.8 11.0 1♠ 2.2 2.2 4.7 1.9 11.0 1♥ 2.2 4.4 2.1 2.0 10.7 1♥ 2.3 4.5 2.0 2.0 10.8 1♦ 4.8 1.9 2.1 2.1 10.9 1♦ 4.8 2.0 2.2 2.2 11.1 3NT♣ 4.4 4.4 4.9 4.6 18.3 3NT♣ 4.4 4.6 4.8 4.7 18.5 4 3.0 6.5 3.1 3.4 16.1 4 2.8 6.8 3.3 3.1 16.0 4♦ 6.6 5.3 2.3 2.3 16.5 4♦ 6.2 4.0 3.0 2.8 16.0 ♣ ♣ Table 1: Opening bid - own aHCPs. Table 2: Belief HCPs after observing opening bid.

Bid Total ♣ ♦ ♥ ♠ PASS 4.4 4.5 4.4 4.5 17.8 1 4.4 4.3 4.5 7.1 20.3 1♠ 4.6 4.4 7.0 4.9 20.9 1♥ 4.3 6.7 4.8 4.9 20.7 1♦ 7.2 4.3 4.7 4.9 21.1 2♣ 4.8 4.7 5.2 8.5 23.2 2♠ 4.9 5.1 8.6 5.1 23.7 3NT♥ 6.6 6.6 7.1 7.2 27.6 4 5.2 8.5 5.5 5.6 24.9 4♦ 8.4 5.6 5.4 5.4 24.8 6♣ 5.7 6.2 9.9 6.9 28.7 6♥ 6.4 9.2 7.3 7.1 30.0 6♦ 10.8 4.0 10.0 6.5 31.3 6NT♣ 7.5 7.0 8.7 9.0 32.2

Table 3: Responding bid - own + belief aHCPs.

Table 1 shows the average HCPs (aHCPs) present in a hand for each of the opening bidding decisions made by North. Once an opening bid is observed, South updates their belief; Table 2 shows the effect which each opening bid has on South’s belief. We show in Table 3 the responding bidding decisions made by South; aHCP values in table 3 are the sum of HCPs in South’s hand and South’s belief over HCPs in North’s hand. The values highlighted in bold for each row are the maximum values for the respective row. This is only done for rows where the bid is has a specified trump suit. C.3 Interesting Bidding Examples We note that in general, in the initial stages of bidding the suit of a bid is used to show strength in that suit with no trump bids used to show strength across all suits and PASS bids used to show a weak hand. In later stages of bidding, players either reiterate their strength with larger trick numbers in their bids or show strength in other suits. In the final rounds of bidding agents show their overall combined view of the pairing’s hands to propose bids that reflect all 26 of the pair’s cards. In the following examples, we describe how this bidding process manifests in certain cases. Note that in each case where a bid specifying the trump suit is made, the bidding pair have the majority of the cards in the trump suit in their collective hands. This is often essential for success in attaining suited bids.

Example 1 NORTH 6 10 2 4 6 A 2 3 5 J Q K A ♠ ♠ r r r r q ♣ ♣ ♣ ♣ ♣ ♣ SOUTH 8 K A 8 3 4 7 9 J A 2 6 7 ♠ ♠ ♠ r q q q q q q ♣ ♣ ♣ BIDS: 1 1 2 2 3NT PASS ♣ ♢ ♣ ♢ REWARD: 0.283 North has a very strong hand in clubs holding all of the picture cards and the ace. North also holds the ace of hearts. South has strength in diamonds and spaces. The bidding pattern shows that North initially signals their strength in clubs and South their strength in diamonds. Both parties then reiterate their strengths in their next turns and a contract of 3NT is agreed.

Example 2 NORTH 8 Q K A 8 Q 2 3 4 9 J Q A ♠ ♠ ♠ ♠ r r q q q q q q q SOUTH 3 4 3 7 9 10 J K 5 6 K 4 A ♠ ♠ r r r r r r q q q ♣ ♣ BIDS: 1 1 3NT 4 6 PASS ♢ ♡ ♡ ♡ REWARD: 0.301 North opens bidding with diamonds to show their strength in the suit. South then responds with hearts to show their strongest suit. North then bids no trumps to signal an inferred collective strength in multiple suits. South has little strength outside of hearts and continues to bid hearts and ultimately a contract of 6 is reached as bid by North who only has two hearts. This is a bid to win all but one hand with hearts as trumps which reflects♡ the fact that between them the pair have from 7 to King of hearts. The pairing bid hearts despite their collective strength also in diamonds since hearts commands a higher per trick score if the contract is met.

Example 3 NORTH 2 4 7 J A K K 3 4 5 6 8 A ♠ ♠ ♠ ♠ ♠ r q ♣ ♣ ♣ ♣ ♣ ♣ SOUTH 6 9 K 7 3 5 6 10 Q A 10 J K ♠ ♠ ♠ r q q q q q q ♣ ♣ ♣ BIDS: 1 1 1 2 3NT 4 PASS ♣ ♢ ♠ ♣ ♠ REWARD: 0.305 Between them the pairing have a strong hand in diamonds clubs and spades with a lack of hearts (15 + 13 = 28 HCP Total). Bidding opens with 1 as North aims to show their strength in clubs. South then conveys their strength in diamonds with a bid of 1 . Bidding continues♣ as North demonstrates some strength in spades with a bid of 1 . Later, a bid by North of 3NT demonstrates♢ that North believes the pair have strength across the as reflected by the♠ bids from all suits except hearts. Bidding ends however, with a bid of 4 from South which is then accepted through a PASS bid from North. South makes the bid of 4 despite only having 3 spades♠ in their hand. The earlier bid of 1 from North informs South that North has some strength♠ in spades and hence South believes that the pair may be able to attain♠ 4 . ♠ Example 4 NORTH 2 7 9 10 K 2 4 9 J A 10 J A ♠ r r r r q q q q q ♣ ♣ ♣ SOUTH 9 K 2 4 J A 3 6 8 Q K Q K ♠ ♠ r r r r q q q q q ♣ ♣ BIDS: 1 3NT 4 6 6NT PASS ♢ ♡ ♢ REWARD: 0.651 North has strength in all suits excluding spades (13 HCP Total) and South has strength evenly spread across all suits (18 HCP). Their shared strongest suit is diamonds where they hold the highest 4 cards between them. The bid of 6 is raised to 6NT as North forms a belief that the pair share strength across multiple suits (31 HCP Total for both hands). ♢

Example 5 NORTH 3 9 J Q 6 10 Q 2 3 7 A 2 5 ♠ ♠ ♠ ♠ r r r q q q q ♣ ♣ SOUTH 7 10 4 J A 4 8 Q K 4 6 K A ♠ ♠ r r r q q q q ♣ ♣ ♣ ♣ BIDS: PASS 1 1 2 3NT PASS ♣ ♢ ♢ REWARD: 0.211 North opens bidding with PASS signaling they they have a relatively weak hand (8 HCP Total). South then responds with 1 signaling their strength in clubs. Bidding then progresses as North bids 1 signaling that diamond is the stronger of their suits.♣ South also has strength in diamonds and ultimately the pairing is able to reason♢ about their collective hands and bid 3NT; a bid that would not be reached without forming beliefs due to the relative weakness of North’s hand.

Example 6 NORTH 6 7 10 J 5 6 7 K 8 10 7 Q K ♠ ♠ ♠ ♠ r r r r q q ♣ ♣ ♣ SOUTH 9 K A 3 4 9 Q A 2 3 4 8 A ♠ ♠ r q q q q q ♣ ♣ ♣ ♣ ♣ BIDS: PASS 1 2 2 3NT PASS ♣ ♣ ♢ REWARD: 0.258 Between them North and South have evenly spread strength across the suits (9 + 17 = 26 HCP Total). Similarly to Example 4, we see that the agents are able to reason, using their beliefs, about the collective strength of both of their hands. Here the pair reach a bid of 3NT while in Example 4, the pairing shared 31 HCP and reached a bid of 6NT.

Example 7 NORTH 8 Q K 4 5 6 7 8 9 2 3 7 K r r r q q q q q q ♣ ♣ ♣ ♣ SOUTH 2 4 6 10 J 3 9 A 9 10 J Q A ♠ ♠ ♠ ♠ ♠ r r r ♣ ♣ ♣ ♣ ♣ BIDS: PASS 1 1 1 2 PASS ♣ ♢ ♠ ♣ REWARD: 0.104 Both players lack any cards in one of the four suits. North opens bidding with PASS to show the relative weakness of their hand (8 HCP Total). Bidding remains low in different suits reflecting the similar level of strength across the suits while low bids show no significant strength overall. Ultimately a bid of 2 is reached reflecting strength in as between them the pair hold from 9 to Ace of clubs. ♣ ♣