Deep Reinforcement Learning

Ben Bell [email protected] Outline

● What is Reinforcement Learning (RL) and why is it useful? ● What is Deep Reinforcement Learning (DRL)? ● DRL case studies What is Reinforcement Learning?

● The Workflow: ○ An agent interacts with an environment to learn a policy which maximizes the reward from the environment ○ By randomly selecting actions the agent explores the environment and slowly learns which actions give a positive or negative reward Where is Reinforcement Learning Applied?

● General algorithm that can be applied to many complex problems ● Solve problems where the rules are simple but the solution is not ○ Games like Go and Chess ○ Robotic control ● Stochastic and partially observable environments where manually designing even a good solution is difficult or impossible Why is Reinforcement Learning Useful?

● RL represents a paradigm shift from a hand-engineered solution to specifying an objective; no expert knowledge is needed ● Creates agents that exhibit complex behavior and often discover novel solutions ○ AlphaGo Zero taught human experts new strategies for a 3,000 year old game ● Adaptive to changes in objectives or environments ● The same RL algorithm can solve different problems Deep Reinforcement Learning (DRL)

● The combination of Reinforcement Learning and Deep Neural Networks (NN) creates a truly general algorithm which can be applied to almost any problem ○ Neural networks can process many different kinds of data: numeric, images, video, audio, and any combination thereof ● NNs have the capability to generalize past experience to new states ● NNs and RL algorithms are independent, advancements in both areas improve agent performance Case Studies (Why learn on video games?)

● Simplified views into real world scenarios ● Extensive list of environments already exist: ○ Driving simulators, tactical & strategy, complex 3D worlds ● Games require many characteristics desirable in real world scenarios ○ Quick decision making ○ Balancing short-term vs long-term goals ○ Adaptability to evolving scenarios DRL Timeline (Breakout)

● Dec 2013 - First successful application of Deep Reinforcement Learning ● Nov 2016 - Deepmind announces plans to research SC2 ● Jan 2019 - DRL agent beats professional players ● Only 5 years from Breakout to SC2 ○ A 34 year difference in release date ○ May 1976 - July 2010

AlphaStar - StarCraft 2

● SC2 is a complex game, played in real time, which requires micro and macro strategic decision-making along with resource management ● Partially observable, requiring enemy positions to be scouted/tracked ● Initially trained to mimic human actions/strategy ● Multiple agents with differing objectives are trained by competing against each other with AlphaStar incorporating the best strategies discovered ● Unlike AlphaGo, AlphaStar does not use a search algorithm AlphaStar - 10, Humans - 0

● Beat TLO 5-0, a professional player ranked in the top 600 ○ “AlphaStar takes well-known strategies and turns them on their head. The agent demonstrated strategies I hadn’t thought of before...” ● After another week of training defeats MaNa a top 10 player, 5-0 ○ “I’ve realised how much my gameplay relies on forcing mistakes and being able to exploit human reactions…” OpenAI Five -

● Multi-agent 5 vs 5 game, played in real time and partially observable, each agent must fulfill its role and trade-off personal vs team rewards ● Trained with no human supervision, agents learn from random action policies and self-play ○ 180 years per day for 9 days -> 1,620 years of play ● Reward shaping is used for the final agent but good policies can be learned from only a binary win/loss signal ● No explicit communication channel between agents, they collaborate based on a shared view of the environment (emergent swarming) OpenAI Five - DotA 2

● Won 2-0 against semi-pro (99th percentile) players ○ Attack target coordination ○ Flanking & Ambushes ○ Perfect timing / low level control ○ Punishes opponents mistakes without hesitation ● Lost 0-2 against professional players ○ Humans were able to adapt during the game to exploit the AI Robots Learning (Walker)

● Learns a robust policy in 2 hours, training only on a flat surface Robots Learning (Quadruped)

● https://youtu.be/aTDkYFZFWug?t=157 ● “The learned policies are also robust to changes in hardware… different robot configurations, which roughly contribute 2.0 kg to the total weight, and a new drive which has a spring three times stiffer than the original one.” ● “In terms of computational cost ... the inference on the robot requires less than 25 µs using a single CPU thread.” ● “…this process [designing the rewards & NN architecture] takes about two days for the locomotion policies presented in this work.” Where Does RL Fail?

● Generalization, applying learned concepts to new & unseen environments ● In the maze environment the agent overfits even with 20,000 training mazes ● AlphaStar competed on only one map ● OpenAI Five plays with hero (18/117), item, and skill restrictions RL Summary

● Used where the rules are defined but the optimal solution is unknown ● Typically requires an accurate simulation ● Adaptable to rule/requirement changes, retrain instead of re-engineer ● In real-world use cases, special care is needed to test and evaluate performance in unseen scenarios

● DotA 2 Rematch - April 13th The Power of Reinforcement Learning

AlphaStar was able to beat a professional player in a restricted setting after 7 days of training, then a top 10 player the following week. “...the first success took almost three years of research time and the second success took seven days. Similarly, although OpenAI’s DotA 2 agent lost against a pro team, they were able to beat their old agent 80% of the time with 10 days of training. Wonder where it’s at now…”

- Alex Irpan, Software Engineer at Brain Robotics Q&A