Deep Reinforcement Learning

Deep Reinforcement Learning Ben Bell [email protected] Outline ● What is Reinforcement Learning (RL) and why is it useful? ● What is Deep Reinforcement Learning (DRL)? ● DRL case studies What is Reinforcement Learning? ● The Workflow: ○ An agent interacts with an environment to learn a policy which maximizes the reward from the environment ○ By randomly selecting actions the agent explores the environment and slowly learns which actions give a positive or negative reward Where is Reinforcement Learning Applied? ● General algorithm that can be applied to many complex problems ● Solve problems where the rules are simple but the solution is not ○ Games like Go and Chess ○ Robotic control ● Stochastic and partially observable environments where manually designing even a good solution is difficult or impossible Why is Reinforcement Learning Useful? ● RL represents a paradigm shift from a hand-engineered solution to specifying an objective; no expert knowledge is needed ● Creates agents that exhibit complex behavior and often discover novel solutions ○ AlphaGo Zero taught human experts new strategies for a 3,000 year old game ● Adaptive to changes in objectives or environments ● The same RL algorithm can solve different problems Deep Reinforcement Learning (DRL) ● The combination of Reinforcement Learning and Deep Neural Networks (NN) creates a truly general algorithm which can be applied to almost any problem ○ Neural networks can process many different kinds of data: numeric, images, video, audio, and any combination thereof ● NNs have the capability to generalize past experience to new states ● NNs and RL algorithms are independent, advancements in both areas improve agent performance Case Studies (Why learn on video games?) ● Simplified views into real world scenarios ● Extensive list of environments already exist: ○ Driving simulators, tactical & strategy, complex 3D worlds ● Games require many characteristics desirable in real world scenarios ○ Quick decision making ○ Balancing short-term vs long-term goals ○ Adaptability to evolving scenarios DRL Timeline (Breakout) ● Dec 2013 - First successful application of Deep Reinforcement Learning ● Nov 2016 - Deepmind announces plans to research SC2 ● Jan 2019 - DRL agent beats professional players ● Only 5 years from Breakout to SC2 ○ A 34 year difference in release date ○ May 1976 - July 2010 AlphaStar - StarCraft 2 ● SC2 is a complex game, played in real time, which requires micro and macro strategic decision-making along with resource management ● Partially observable, requiring enemy positions to be scouted/tracked ● Initially trained to mimic human actions/strategy ● Multiple agents with differing objectives are trained by competing against each other with AlphaStar incorporating the best strategies discovered ● Unlike AlphaGo, AlphaStar does not use a search algorithm AlphaStar - 10, Humans - 0 ● Beat TLO 5-0, a professional player ranked in the top 600 ○ “AlphaStar takes well-known strategies and turns them on their head. The agent demonstrated strategies I hadn’t thought of before...” ● After another week of training defeats MaNa a top 10 player, 5-0 ○ “I’ve realised how much my gameplay relies on forcing mistakes and being able to exploit human reactions…” OpenAI Five - DotA 2 ● Multi-agent 5 vs 5 game, played in real time and partially observable, each agent must fulfill its role and trade-off personal vs team rewards ● Trained with no human supervision, agents learn from random action policies and self-play ○ 180 years per day for 9 days -> 1,620 years of play ● Reward shaping is used for the final agent but good policies can be learned from only a binary win/loss signal ● No explicit communication channel between agents, they collaborate based on a shared view of the environment (emergent swarming) OpenAI Five - DotA 2 ● Won 2-0 against semi-pro (99th percentile) players ○ Attack target coordination ○ Flanking & Ambushes ○ Perfect timing / low level control ○ Punishes opponents mistakes without hesitation ● Lost 0-2 against professional players ○ Humans were able to adapt during the game to exploit the AI Robots Learning (Walker) ● Learns a robust policy in 2 hours, training only on a flat surface Robots Learning (Quadruped) ● https://youtu.be/aTDkYFZFWug?t=157 ● “The learned policies are also robust to changes in hardware… different robot configurations, which roughly contribute 2.0 kg to the total weight, and a new drive which has a spring three times stiffer than the original one.” ● “In terms of computational cost ... the inference on the robot requires less than 25 µs using a single CPU thread.” ● “…this process [designing the rewards & NN architecture] takes about two days for the locomotion policies presented in this work.” Where Does RL Fail? ● Generalization, applying learned concepts to new & unseen environments ● In the maze environment the agent overfits even with 20,000 training mazes ● AlphaStar competed on only one map ● OpenAI Five plays with hero (18/117), item, and skill restrictions RL Summary ● Used where the rules are defined but the optimal solution is unknown ● Typically requires an accurate simulation ● Adaptable to rule/requirement changes, retrain instead of re-engineer ● In real-world use cases, special care is needed to test and evaluate performance in unseen scenarios ● DotA 2 Rematch - April 13th The Power of Reinforcement Learning AlphaStar was able to beat a professional player in a restricted setting after 7 days of training, then a top 10 player the following week. “...the first success took almost three years of research time and the second success took seven days. Similarly, although OpenAI’s DotA 2 agent lost against a pro team, they were able to beat their old agent 80% of the time with 10 days of training. Wonder where it’s at now…” - Alex Irpan, Software Engineer at Google Brain Robotics Q&A.

Deep Reinforcement Learning

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support