Ad Hoc Teamwork with Unknown Task Model and Teammate Behavior

Ad hoc teamwork with unknown task model and teammate behavior Gonçalo Alfredo dos Santos Rodrigues Thesis to obtain the Master of Science Degree in Information Systems and Computer Engineering Supervisors: Prof. Francisco Antonio´ Chaves Saraiva de Melo Prof. Jose´ Alberto Rodrigues Pereira Sardinha Examination Committee Chairperson: Prof. Francisco Joao˜ Duarte Cordeiro Correia dos Santos Supervisor: Prof. Francisco Antonio´ Chaves Saraiva de Melo Member of the Committee: Prof. Pedro Manuel Urbano de Almeida Lima OCTOBER 2018 Acknowledgments I want to thank both my supervisors, Francisco Melo and Alberto Sardinha, for steering me in the right direction, helping me whenever I had questions and providing adequate resources when I needed. You were always there to answer my questions and provide technical guidance and also, of course, advice when it comes to writing. I also want to thank my family for supporting me throughout this process, always assuring me that I could finish this thesis and providing everything I needed to do so. Finally, I want to thank my friends, for sharing their thesis stories and relating to my problems, for supporting me and keeping me sane. Without you, I would have gone crazy trying to write all of this. Abstract This thesis addresses the problem of ad hoc teamwork. Ad hoc teamwork consists of creating an agent that can cooperate with a team without pre-coordination. We focus on settings where the task and world model are unknown, and the agent must learn this task, learn its teammates’ policies and adapt to them. We propose a new approach that combines model-based reinforcement learning with Monte Carlo tree search to enable an ad hoc agent to learn the underlying task and coordinate with its teammates. We model the underlying environenment and the teammates’ behaviors using deep neural networks that are learned as the ad hoc agent explores. By using a model-based approach, we can achieve a good performance with low sample complexity. The solution is demonstrated in the well-established pursuit domain and our results show that it is competitive against state-of-the-art approaches that rely on the perfect model of the task and known behavior of the teammates. Keywords Ad hoc teamwork; Multiagent reinforcement learning; Deep learning; Model-based reinforcement learning; Artificial intelligence iii Resumo Esta tese aborda o problema de ad hoc teamwork. Ad hoc teamwork consiste em criar um agente que coopera com uma equipa sem pre-coordenacao. Nos focamo-nos nos cenarios´ em que o modelo da tarefa e do mundo sao˜ desconhecidos, e o agente tem que aprender a tarefa, aprender o comportamento dos colegas de equipa e adaptar-se a eles. Nos propomos uma nova soluçao˜ que combina model-based reinforcement learning com Monte Carlo tree search para que o agente ad hoc aprenda a tarefa e consiga coordenar com o resto da equipa. Nos modelamos o ambiente e o comportamento da equipa usando deep neural networks que sao˜ aprendidas a medida que o agente explora. Usando esta abordagem model-based, conseguimos atingir uma boa performance sem usar muitos samples. A soluçao˜ e demonstrada no conhecido pursuit domain e os nossos resultados mostram que e competitiva com soluçoes˜ state-of-the-art que assumem perfeito conhecimento da tarefa e do comportamento da equipa. Palavras Chave Ad hoc teamwork; Multiagent reinforcement learning; Deep learning; Model-based reinforcement learning; Artificial intelligence v Contents 1 Introduction 1 1.1 Motivation.............................................3 1.2 Hypothesis............................................3 1.3 Contributions...........................................4 2 Background 5 2.1 Multi-armed bandits.......................................7 2.2 Fully cooperative normal form game..............................7 2.3 Markov decision processes (MDP)...............................8 2.4 Monte carlo tree search (MCTS)................................ 10 2.4.1 Upper Confidence Bound for Trees (UCT)....................... 11 2.5 Neural networks......................................... 11 3 Related work 15 3.1 Multi-armed bandits....................................... 17 3.2 Iterated normal form game................................... 18 3.3 Markov Decision Processes................................... 20 3.4 Other formalizations....................................... 22 3.5 Summary............................................. 23 4 Solution 25 4.1 Problem formalization...................................... 27 4.2 3-layered architecture for ad-hoc teamwork.......................... 28 4.3 Learning teammate’s policies with a deep neural network.................. 30 4.4 Task identification with model-based reinforcement learning................. 30 4.5 MCTS planning.......................................... 31 5 Evaluation 35 5.1 Pursuit domain.......................................... 37 5.2 System layout........................................... 38 5.3 Setup............................................... 40 vii 5.3.1 Features selected for teammate policy learning.................... 40 5.3.2 Features selected for environment dynamics learning................ 42 5.4 Results.............................................. 42 5.4.1 Greedy teammates.................................... 42 5.4.2 Teammate aware teammates.............................. 45 6 Conclusion 49 6.1 Future work............................................ 51 viii List of Figures 2.1 Monte carlo tree search algorithm in four steps........................ 10 2.2 Example of a neural network with one hidden layer...................... 12 4.1 Architecture of proposed solution................................ 29 4.2 An example of the expansion step of Monte Carlo Tree Search (MCTS) planning using previously learned models of both behavior and environment. (D,U,L) represents an example of actions by the three teammates and R represents an action by the ad hoc agent. 32 5.1 An example of a capture position. Prey is represented as a green circle and predators are represented as red circles. In this case, the prey is unable to move in any direction, ending the game and all predators earn a reward....................... 37 5.2 Architecture of teammate behavior neural network...................... 39 5.3 Architecture of environment transition neural network..................... 39 5.4 Example state in the pursuit domain.............................. 41 5.5 Results for our ad-hoc agent after 1, 50 and 200 episodes in a 5x5 world......... 43 5.6 Results for our ad-hoc agent after 1, 50 and 200 episodes in a 10x10 world........ 44 5.7 Results for our ad-hoc agent after 1, 50 and 200 episodes in a 20x20 world........ 44 5.8 Results for our ad-hoc agent, playing with team aware agents, after 1, 50 and 200 episodes in a 5 × 5 world.......................................... 45 5.9 Results for our ad-hoc agent, playing with team aware agents, after 1, 50 and 200 episodes in a 10 × 10 world......................................... 46 5.10 Results for our ad-hoc agent, playing with team aware agents, after 1, 50 and 200 episodes in a 20 × 20 world......................................... 46 ix x List of Tables 3.1 Summary of approaches that tackle the ad hoc teamwork problem............. 23 xi xii Acronyms MDP Markov Decision Process MMDP Multiagent Markov Decision Process POMDP Partially Observable Markov Decision Process RL Reinforcement Learning MCTS Monte Carlo Tree Search HFO Half Field Offense FQI Fitted Q-iteration MAB Multi-armed bandits SGD Stochastic Gradient Descent UCT Upper Confidence Bound for Trees DQN Deep Q-Networks xiii xiv 1 Introduction Contents 1.1 Motivation...........................................3 1.2 Hypothesis...........................................3 1.3 Contributions.........................................4 1 2 1.1 Motivation As agents are incorporated in the world, the more we need them to work together for better performance. In the cooperative multiagent setting, two or more agents act in the environment with the goal of maxi- mizing their joint reward. Solutions for this setting either incorporate explicit communication between the agents or the same algorithm is shared among the agents and so they are implicitly coordinated. However, there is an increasing need for agents that can cooperate with their team without communicating or without pre-coordination. Consider these 2 scenarios: 1. Developers from different countries create agents independently to rescue people from a natural catastrophe, without defining a communication protocol or coordinating their approaches. 2. Agents have been deployed in some environment and years later one of them breaks and it must be replaced. Since many years have passed there is an opportunity to create an agent more powerful than the existing ones but we still wish to keep the old agents. In both cases we need to create an agent able of cooperating with others that possesses different behavior from them. It must be able to adapt to a wide range of behaviors in an online fashion, possibly behaviors not known a priori. Agents can have different capabilities, different models of the world, different roles and different ways of communicating and yet they need to be able to cooperate. We are facing an ad hoc teamwork problem. In ad hoc teamwork, the goal is to design an agent that can learn its teammates behaviors and characteristics online and adapt to them in order to reach a certain shared objective. This problem was introduced by Stone et. al. [1] and some

Load more