Baconian: A Unified Open-source Framework for Model-Based Demonstration Linsen Dong Guanyu Gao Xinyi Zhang School of Computer Science and School of Computer Science and School of Computer Science and Engineering, Nanyang Technological Engineering, Nanyang Technological Engineering, Nanyang Technological University University University Singapore Singapore Singapore [email protected] [email protected] [email protected] Liangyu Chen Yonggang Wen School of Electrical and Electronic School of Computer Science and Engineering, Nanyang Technological Engineering, Nanyang Technological University University Singapore Singapore [email protected] [email protected]

ABSTRACT Dyna, GPS, ME, MPC, iLQR, ME- Algorithm Model-Based Reinforcement Learning (MBRL) is one category of PPO, DQN, DDPG, PPO OpenAI Gym, PyBullet, DeepMind Reinforcement Learning (RL) algorithms which can improve sam- Environment pling efficiency by modeling and approximating system dynamics. Control Suite It has been widely adopted in the research of robotics, autonomous Built-in Logging and Visualization, driving, etc. Despite its popularity, there still lacks some sophis- Utility TensorFlow Integration, ticated and reusable open-source frameworks to facilitate MBRL Parameter Management research and experiments. To fill this gap, we develop a flexible User Guide and API References, Open-source MIT License, and modularized framework, Baconian, which allows researchers Support to easily implement a MBRL testbed by customizing or building Benchmark Results Released upon our provided modules and algorithms. Our framework can free users from re-implementing popular MBRL algorithms from Figure 1: Feature list of Baconian. scratch thus greatly save users’ efforts on MBRL experiments.

KEYWORDS Existing model-based frameworks are few and have some short- Reinforcement Learning, Model-based Reinforcement Learning, comings. The work in [18] gives a comprehensive benchmark over Open-source Library; state-of-the-art MBRL algorithms, but the implementations are scat- tered across different codebases without a unified implementation, posing obstacles to conduct experiments with it. The work in [5] 1 INTRODUCTION provides the implementations for Guided Policy Search(GPS) [10], Model-Based Reinforcement Learning (MBRL) is proposed to reduce which supports robotics controlling tasks. But it lacks support for sample complexity introduced by model-free Deep Reinforcement other MBRL algorithms. Thus, a unified MBRL open-source frame- arXiv:1904.10762v4 [cs.LG] 16 Mar 2021 Learning (DRL) algorithms [12]. Specifically, MBRL approximates work is in need. To fill this gap, we design and implement aunified the system dynamics with a parameterized model, which can be MBRL framework, Baconian, by trading off the diversity of included utilized for policy optimizing when the training data is very limited MBRL algorithms against the complexity of the framework. Users or costly to obtain in the real world. can reproduce benchmark results or prototype their idea easily Implementing a RL experiments from scratch can be tedious and with it by a minimal amount of codes without understanding the bug-introducing. Luckily, many open-source frameworks have been detailed implementations. Moreover, the design of Baconian not developed to facilitate DRL research, including baselines [3], rllab only benefits the research of MBRL, but is also applicable toother [4], Coach [2], and Horizon [6]. However, these frameworks are types of RL algorithms including model-free algorithms. The code- mainly implemented for model-free DRL methods, and lack enough base is available at https://github.com/cap-ntu/baconian-project. support for MBRL. The demo video is available at https://youtu.be/J6xI6qI3CvE. Proc. of the 19th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2020), B. An, N. Yorke-Smith, A. El Fallah Seghrouchni, G. Sukthankar (eds.) 2 MAIN FEATURES © 2020 International Foundation for Autonomous Agents and Multiagent Systems (www.ifaamas.org). All rights reserved. Baconian supports many RL algorithms, test environments, and https://doi.org/doi experiment utilities. We summarize the main features in Fig. 1. Linsen Dong, Guanyu Gao, Xinyi Zhang, Liangyu Chen, and Yonggang Wen

Experiment Manager Training Engine Agent Status Policy Algorithm Collector Status Train manage Utilize Agent Experiment Control Dynamic Exploration Setup Settings Launch Flow Strategy User Model flow

Experiment Control Recorder Environment Record log Sample Flow Create Environment User

Monitor Experiment

Logger Plotter Send log Data

Run Log file/ Console/ Tensorflow model Figure 3: The procedure to create an MBRL experiment in Figure 2: The system design of Baconian. The system is di- Baconian. Each module is replaceable and configurable to vided into three main modularized components to minimize reduce the effort of building from scratch. the coupling for flexibility and maintainability.

3.1 Experiment Manager The Experiment Manager consists of Experiments Settings, Status State-of-the-Art RL Algorithms. We implement many widely Collector, and Experiment Recorder. Experiments Settings manages used RL algorithms. For model-based algorithms, we implement the creating and initialization of each module. Status Collector con- Dyna[15], ME-TRPO (Model-ensemble Trust Region Policy Opti- trols the status information collected across different modules to mization) [9], MPC (Model Predictive Control)[13], iLQR (Iterative compose a globally shared status that can be used including learn- Linear Quadratic Regulator)[17], etc. Since many model-based al- ing rate decay, exploration strategy scheduling, etc. Experiment gorithms are built upon model-free algorithms, we also implement Recorder will record the information generated from the experi- some popular model-free algorithms including DQN[7], DDPG[11], ment, such as loss, rewards. Such information will be handed to the and PPO[14] in Baconian. Monitoring for rendering or saving. Supported Test Environments. To evaluate the performance of RL algorithms, it is a must to support a wide range of test en- vironments. Baconian support OpenAI Gym[1], RoboSchool[8], 3.2 Training Engine DeepMind Control Suite[16]. These test environments cover most Training Engine handles the training process of the MBRL algo- essential tasks in RL community. rithms. The novelty of the design lies in abstracting and encap- Experiment Utilities. Baconian provides many utilities to re- sulating the training process as a Control Flow module, which duce users’ efforts on experiment set-up, hyper-parameter tuning, controls the execution processes of the experiment based on the logging, result visualization, and algorithms diagnosis. We provide user’s specifications, including the agent’s sampling from environ- integration of TensorFlow to support neural network building, train- ment, policy model and dynamics model optimization, and testing. ing, and managing. As the hyper-parameters play a critical role in MBRL algorithms can be complicated[12, 15]. Such abstractions RL experiments, we provide user-friendly parameter management can decouple the tangled and complicated MBRL training processes utility to remove the tedious work of setting, loading, and saving into some independent tasks, which are further encapsulated as the these hyper-parameters. sub-modules of Control Flow module for providing flexibility. Open-source Support. Baconian provide detailed user guides and API references1, so users can hand on Baconian easily and 3.3 Monitor conduct novel MBRL research upon it. We also release some pre- Monitor is responsible for monitoring and recording of the exper- liminary benchmark results in the codebase. iment as it proceeds. This includes recording necessary loggings, printing information/warning/error, and rendering the results. 3 DESIGN AND IMPLEMENTATION Baconian consists of three major components, namely, Experiment 4 USAGE Manager, Training Engine, and Monitor. The system overview of This section presents the procedures to create a MBRL experiment Baconian is shown in Fig. 2. Various design patterns are applied to with the essential modules in Baconian. The procedures are shown decouple the complicated dependencies across different modules to in Fig. 32. For high flexibility, most of modules are customizable. enable the easily extension and programming over the framework. Meanwhile, user can directly adopt built-in benchmark module or codes if customization is unnecessary.

2For more details of how to configure these modules, please see the documentation 1 The documentation can be found at https://baconian-public.readthedocs.io/en/latest/ page https://baconian-public.readthedocs.io/en/latest/step_by_step.html due to the API.html. page limit. Baconian: A Unified Open-source Framework for Model-Based Reinforcement Learning

First, the user should create a environment and a RL algorithm [4] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. 2016. module with necessary hyper-parameters configured, e.g., neural Benchmarking deep reinforcement learning for continuous control. In Interna- tional Conference on . 1329–1338. network size, learning rate. Algorithm module is usually composed [5] C. Finn, M. Zhang, J. Fu, X. Tan, Z. McCarthy, E. Scharff, and S. Levine. 2016. of a policy module and a dynamics model module depending on Guided Policy Search Code Implementation. (2016). http://rll.berkeley.edu/gps Software available from rll.berkeley.edu/gps. different algorithms. Then, the user needs to create an agent module [6] Jason Gauci, Edoardo Conti, Yitao Liang, Kittipat Virochsiri, Yuchen He, Zachary by passing the algorithm module, and the exploration strategy Kaden, Vivek Narayanan, and Xiaohui Ye. 2018. Horizon: ’s Open Source module if needed, into it. Applied Reinforcement Learning Platform. arXiv preprint arXiv:1811.00260 (2018). [7] Ionel-Alexandru Hosu and Traian Rebedea. 2016. Playing Atari Games with Deep Second, the user should create a control flow module that de- Reinforcement Learning and Human Checkpoint Replay. CoRR abs/1607.05077 fines how the experiments should be proceeded and the stopping (2016). arXiv:1607.05077 http://arxiv.org/abs/1607.05077 conditions. This includes defining how much samples should be [8] Oleg Klimov and John Schulman. 2017. Roboschool. https://github.com/openai/ roboschool. (2017). collected for training at each step, and what condition to indicate [9] Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. the completion of an experiment, etc. Some typical control flows 2018. Model-Ensemble Trust-Region Policy Optimization. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April have already been implemented in Baconian to meet the users’ 30 - May 3, 2018, Conference Track Proceedings. https://openreview.net/forum? needs. id=SJJinbWRZ Finally, an experiment module should be created by passing the [10] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. 2016. End-to-End Training of Deep Visuomotor Policies. J. Mach. Learn. Res. 17 (2016), 39:1–39:40. agent, environment, control flow modules into it, and then launched. http://jmlr.org/papers/v17/15-522.html After that, the Baconian will handle the experiment running, moni- [11] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom toring, and results saving/logging, etc. Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2016. Continuous control with deep reinforcement learning. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track 5 CONCLUSION Proceedings. http://arxiv.org/abs/1509.02971 [12] Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. 2018. This paper presented a unified, reusable, and flexible framework, Neural network dynamics for model-based deep reinforcement learning with Baconian, for MBRL research. It can reduce users’ effort to conduct model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 7559–7566. MBRL experiments and prototype new MBRL algorithms. In the [13] Arthur George Richards. 2005. Robust constrained model predictive control. Ph.D. future, we will implement more state-of-the-art MBRL algorithms Dissertation. Massachusetts Institute of Technology. and benchmark them on different tasks. [14] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms. CoRR abs/1707.06347 (2017). arXiv:1707.06347 http://arxiv.org/abs/1707.06347 A LIST OF REQUIREMENTS [15] Richard S Sutton. 1991. Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin 2, 4 (1991), 160–163. To present the Baconian, we require a computer with operating [16] Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de system as Ubuntu 16.04/18.04, and installed with Python 3.5 or Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy Lillicrap, and Martin Riedmiller. 2018. DeepMind Control Suite. Technical higher. Report. DeepMind. https://arxiv.org/abs/1801.00690 [17] Yuval Tassa, Tom Erez, and Emanuel Todorov. 2012. Synthesis and stabilization REFERENCES of complex behaviors through online trajectory optimization. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2012, Vilamoura, [1] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Algarve, Portugal, October 7-12, 2012. 4906–4913. https://doi.org/10.1109/IROS. Schulman, Jie Tang, and Wojciech Zaremba. 2016. OpenAI Gym. (2016). 2012.6386025 arXiv:arXiv:1606.01540 [18] Tingwu Wang, Xuchan Bao, Ignasi Clavera, Jerrick Hoang, Yeming Wen, Eric Lan- [2] Itai Caspi, Gal Leibovich, Gal Novik, and Shadi Endrawis. 2017. Reinforcement glois, Shunshi Zhang, Guodong Zhang, Pieter Abbeel, and Jimmy Ba. 2019. Bench- Learning Coach. (Dec. 2017). https://doi.org/10.5281/zenodo.1134899 marking model-based reinforcement learning. arXiv preprint arXiv:1907.02057 [3] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plap- (2019). pert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. 2017. OpenAI Baselines. https://github.com/openai/baselines. (2017).