Teaching Agents with Deep Apprenticeship Learning
Total Page:16
File Type:pdf, Size:1020Kb
Rochester Institute of Technology RIT Scholar Works Theses 6-2017 Teaching Agents with Deep Apprenticeship Learning Amar Bhatt [email protected] Follow this and additional works at: https://scholarworks.rit.edu/theses Recommended Citation Bhatt, Amar, "Teaching Agents with Deep Apprenticeship Learning" (2017). Thesis. Rochester Institute of Technology. Accessed from This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact [email protected]. Teaching Agents with Deep Apprenticeship Learning by Amar Bhatt A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Engineering Supervised by: Assistant Professor Dr. Raymond Ptucha Department of Computer Engineering Kate Gleason College of Engineering Rochester Institute of Technology Rochester, New York June 2017 Approved by: Dr. Raymond Ptucha, Assistant Professor Thesis Advisor, Department of Computer Engineering Dr. Ferat Sahin, Professor Committee Member, Department of Electrical Engineering Dr. Iris Asllani, Assistant Professor Committee Member, Department of Biomedical Engineering Dr. Christopher Kanan, Assistant Professor Committee Member, Department of Imaging Science Louis Beato, Lecturer Committee Member, Department of Computer Engineering Thesis Release Permission Form Rochester Institute of Technology Kate Gleason College of Engineering Title: Teaching Agents with Deep Apprenticeship Learning I, Amar Bhatt, hereby grant permission to the Wallace Memorial Library to reproduce my thesis in whole or part. Amar Bhatt Date iii Dedication This thesis is dedicated to all those who have taught and guided me throughout my years... My loving parents who continue to support me in my academic and personal endeavors. Their emphasis on respect, diligence, and never giving up crafted me as a student and the person I am today. My teachers, professors, coaches, gurus, and senseis whose commitment to excellence, student development, and higher education brought light to what they taught me. Thank you for being my guides and imparting your knowledge on me. iv Acknowledgments I am grateful ... For my parents, Anup and Meena Bhatt, who funded my pursuit into higher education and whose unwavering support gave me the strength to continue forward. For my advisor, Dr. Ptucha who brought me under his wing as a young undergraduate. His guidance, mentorship, and ability to unlock true potential shaped me as a student and an academic. I thank him for trusting me with Milpet, and being there for me in my journey through higher education. For my thesis committee, which is filled with a number of professors from different fields. Their diverse knowledge represents my passion for multidisciplinary research. They each have contributed to my academics in unique ways, and for that I am grateful. For my friends, for staying up with me late nights and collaborating on projects. To Luke Boudreau thank you for being a great friend and teammate. To Felipe Petroski Such, you are one of the most intelligent people I know, thank you for all of your insight. To Radha Mendapra for continually supporting me through my highs and my lows. For RIT, for giving me the opportunity to succeed both as an academic and as a student leader. This University’s commitment to student excellence is incredible, and rare. v Abstract Teaching Agents with Deep Apprenticeship Learning Amar Bhatt Supervising Professor: Dr. Raymond Ptucha As the field of robotic and humanoid systems expand, more research is being done on how to best control systems to perform complex, smart tasks. Many supervised learning and classification techniques require large datasets, and only result in the system mimicking what it was given. The sequential re- lationship within datasets used for task learning results in Markov decision problems that traditional classification algorithms cannot solve. Reinforce- ment learning helps to solve these types of problems using a reward/punish- ment and exploration/exploitation methodology without the need for datasets. While this works for simple systems, complex systems are more difficult to teach using traditional reinforcement learning. Often these systems have complex, non-linear, non-intuitive cost functions which make it near impos- sible to model. Inverse reinforcement learning, or apprenticeship learning algorithms, learn complex cost functions based on input from an expert sys- tem. Deep learning has also made a large impact in learning complex sys- tems, and has achieved state of the art results in several applications. Using methods from apprenticeship learning and deep learning a system can be taught complex tasks from watching an expert. It is shown here how well these types of networks solve a specific task, and how well they generalize and understand the task through raw pixel data from an expert. vi Contents Dedication :::::::::::::::::::::::::::::: iii Acknowledgments ::::::::::::::::::::::::: iv Abstract ::::::::::::::::::::::::::::::: v 1 Introduction ::::::::::::::::::::::::::: 1 2 Background ::::::::::::::::::::::::::: 4 2.1 Reinforcement Learning . .4 2.1.1 Temporal Difference Learning . .4 2.1.2 Q-Learning and Sarsa Implementations . 13 2.2 Deep Reinforcement Learning . 22 2.2.1 Deep Q-Networks . 23 2.2.2 Double Deep Q-Networks . 26 2.2.3 Dueling Deep Q-Networks . 27 2.2.4 Deep Recurrent Q-Networks . 29 2.3 Apprenticeship Learning . 31 2.3.1 Bayesian Inverse Reinforcement Learning . 33 2.3.2 Gaussian Process Inverse Reinforcement Learning . 35 2.3.3 Maximum Entropy Inverse Reinforcement Learning 37 2.3.4 IRL using DQN . 39 2.4 Deep Inverse Reinforcement Learning . 40 2.4.1 Deep Gaussian Process IRL . 41 vii 2.4.2 Deep Maximum Entropy IRL . 42 2.4.3 Deep Apprenticeship Learning . 44 2.4.4 Deep Q-Learning from Demonstrations . 47 3 Dataset and Technologies :::::::::::::::::::: 51 3.1 Maze World . 51 3.1.1 Expert Data . 52 3.1.2 Random Data . 52 3.1.3 Datasets . 53 3.1.4 Simulation . 54 3.1.5 Processing Data . 54 3.2 Tools and Technology . 54 3.2.1 Python . 54 3.2.2 TensorFlow . 55 3.2.3 Python Imaging Library . 55 3.2.4 Numpy . 55 3.2.5 h5py . 55 4 Proposed Methodologies :::::::::::::::::::: 56 4.1 Deep Apprenticeship Learning Network Modifications . 56 4.1.1 No Pooling Layers . 56 4.1.2 Transfer Learning . 56 4.1.3 Using Q-Learning . 57 4.2 Deep Q-Network Implementations . 58 4.2.1 Using Shared Experience Replay . 58 4.2.2 Target Q-Network . 59 4.2.3 Using Dueling DQN . 59 4.2.4 Using Deep Recurrent Q-Networks . 60 viii 5 Implementation ::::::::::::::::::::::::: 62 5.1 Architecture Details . 62 5.2 Algorithms . 66 5.2.1 Deep Apprenticeship Learning Networks . 66 5.2.2 Deep Q-Network Apprenticeship Learning . 66 6 Results and Analysis :::::::::::::::::::::: 69 6.1 Task Completion and Task Understanding . 69 6.2 Test Methodology . 69 6.3 Proposed Architecture Performances . 70 6.4 Discussion . 72 6.4.1 Task Completion . 72 6.4.2 Task Understanding . 73 7 Conclusions and Future Work ::::::::::::::::: 75 7.1 Remarks on Novel Contributions . 75 7.1.1 Reward Abstraction . 75 7.1.2 Scheduled Shared Experience Replay . 75 7.1.3 Dueling Deep Q-Network Architecture . 76 7.1.4 Deep Recurrent Q-Network Architecture . 76 7.2 Challenges and Future Work . 76 7.2.1 Datasets and Benchmarking . 76 7.2.2 Overfitting . 77 7.3 Applications . 77 Bibliography :::::::::::::::::::::::::::: 79 ix List of Tables 2.1 Q-Learning and Sarsa results across several world map sizes. 21 5.1 Architecture Hyper-Parameters for Task Completion. 65 5.2 Architecture Hyper-Parameters for Task Understanding. 66 6.1 Task Completion Results. 71 6.2 Task Understanding Results. 72 x List of Figures 2.1 Floor plan for a one-story house for temporal difference learning example as a picture (left) and graph (right). .8 2.2 Rewards graph for room transitions for Fig. 2.1. .9 2.3 State transition table with reward values. A “−” sign de- notes that there does not exist a state transition. For exam- ple, state A cannot go to state D. .9 2.4 Q-table initialized to zero denoting the lack of information the agent has at time step zero. 10 2.5 Q-table update after first episode iteration. 10 2.6 Q-table update after second episode iteration. 11 2.7 Q-table update after third episode iteration. 11 2.8 Map Example with Terrain (Sand = Magenta, Forest = Green, Pavement = Black, Water = Blue, Misc. Debris = Red, Start/Goal = White dots). 15 2.9 Parameter tuning for optimal path selection on controlled environments for Q-Learning. (left) Environment with pun- ishment, (right) Environment without punishment. 16 2.10 Parameter tuning for optimal path selection on controlled environments for Sarsa. (left) Environment with punish- ment, (right) Environment without punishment. 16 2.11 Q-Learning Algorithm Path Example with Punishment. 20 2.12 Sarsa Algorithm Path Example with Punishment. 20 xi 2.13 Deep Q-Network (DQN) Architecture. The input consists of an 84x84x4 image. Each hidden layer is followed by a rectifier non-linearity (max(0,x)) [22]. 24 2.14 Deep Q-Network results on Atari 2600 games when com- pared to a linear learner [22]. 25 2.15 Double Deep Q-Network results on Atari 2600 games when compared to DQN [35]. 27 2.16 Dueling Deep Q-Network Architecture. The network splits into two layers (V (s) on top, A(a) on bottom) which are combined at the end [37]. 28 2.17 Dueling Deep Q-Network results on Atari 2600 games when compared to Double DQN [37]. 29 2.18 Deep Recurrent Q-Network (DRQN) Architecture [11]. 30 2.19 DRQN results on Atari 2600 games when compared to the DQN architecture [11]. 31 2.20 Maximum Entropy versus Action-based selection diagram [41].