Online Learning from Human Feedback with Applications to Exoskeleton Gait Optimization

Online Learning from Human Feedback with Applications to Exoskeleton Gait Optimization Thesis by Ellen Novoseller In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy CALIFORNIA INSTITUTE OF TECHNOLOGY Pasadena, California 2021 Defended November 30th, 2020 ii © 2020 Ellen Novoseller ORCID: 0000-0001-5263-0598 Some rights reserved. This thesis is distributed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). iii ACKNOWLEDGEMENTS I am deeply grateful to my advisors, Professors Joel Burdick and Yisong Yue, for their support along my PhD journey. This thesis would not have been possible without their constant advice, insights, patience, guidance, and encouragement. I would also like to thank my committee members, Professors Aaron Ames, Dorsa Sadigh, and Richard Murray, for taking time out of their busy schedules to provide valuable suggestions and advice. Also, I am extremely grateful to everyone with whom I have collaborated (in no particular order): Maegan Tucker, Kejun Li, Myra Cheng, Claudia Kann, Richard Cheng, Yibing Wei, Erdem Bıyık, Jeffrey Edlund, Charles Guan, Atli Kosson, Solveig Einarsdottir, Sonia Moreno, and Professors Aaron Ames, Yanan Sui, Dorsa Sadigh, and Dimitry Sayenko. I have learned a tremendous amount from you, and this work would never have been possible if I had not had the opportunity to work together with you. I would like to thank all of my colleagues in Joel’s and Yisong’s groups; I have been really fortunate to get to know you during my time at Caltech. I am also lucky to have many great friends who have been there for me over the last six years and made grad school more enjoyable. Finally, I am grateful to my family for their love and for believing in me, particularly my husband David, my parents, and my brother Michael. iv ABSTRACT Systems that intelligently interact with humans could improve people’s lives in numerous ways and in numerous settings, such as households, hospitals, and work- places. Yet, developing algorithms that reliably and efficiently personalize their interactions with people in real-world environments remains challenging. In particular, one major difficulty lies in adapting to human-in-the-loop feedback, in which an algorithm makes sequential decisions while receiving online feedback from humans; throughout this interaction, the algorithm seeks to optimize its decision- making quality, as measured by the utility of its performance to the human users. Such algorithms must balance between exploration and exploitation: on one hand, the algorithm must select uncertain strategies to fully explore the environment and the interacting human’s preferences, while on the other hand, it must exploit the empirically-best-performing strategies to maximize its cumulative performance. Learning from human feedback can be difficult, as people are often unreliable in specifying numerical scores. In contrast, humans can often more accurately provide various types of qualitative feedback, for instance pairwise preferences. Yet, sample efficiency is a significant concern in human-in-the-loop settings, as qualitative feedback is less informative than absolute metrics, and algorithms can typically pose only limited queries to human users. Thus, there is a need to create theoretically- grounded online learning algorithms that efficiently, reliably, and robustly optimize their interactions with humans while learning from online qualitative feedback. This dissertation makes several contributions to algorithm design for human-in- the-loop learning. Firstly, this work develops the Dueling Posterior Sampling (DPS) algorithmic framework, a model-based, Bayesian approach for online learning in the settings of preference-based reinforcement learning and generalized linear dueling bandits. DPS is developed together with a theoretical regret analysis framework, and yields competitive empirical performance in a range of simulations. Additionally, this thesis presents the CoSpar and LineCoSpar algorithms for sample-efficient, mixed-initiative learning from pairwise preferences and coactive feedback. CoSpar and LineCoSpar are both deployed in human subject experiments with a lower- body exoskeleton to identify optimal, user-preferred exoskeleton walking gaits. This work presents the first demonstration of preference-based learning for optimizing dynamic crutchless exoskeleton walking for user comfort, and makes progress toward customizing exoskeletons and other assistive devices for individual users. v PUBLISHED CONTENT AND CONTRIBUTIONS Novoseller, Ellen R. et al. “Dueling posterior sampling for preference-based reinforcement learning.” In: Conference on Uncertainty in Artificial Intelligence (UAI). PMLR. 2020, pp. 1029–1038. url: http://proceedings.mlr.press/ v124/novoseller20a.html. E.R.N. contributed to the conception of the project, developing the algorithm, performing the theoretical analysis, conducting the simulation experiments, and writing the manuscript. Tucker, Maegan, Myra Cheng, et al. “Human preference-based learning for high- dimensional optimization of exoskeleton walking gaits.” In: IEEE International Conference on Intelligent Robots and Systems (IROS). 2020. url: https:// arxiv.org/pdf/2003.06495.pdf. E.R.N. contributed to the conception of the project, developing the algorithm, providing ongoing mentorship and direction for conducting the simulations, conducting the exoskeleton experiments, and writing the manuscript. Tucker, Maegan, Ellen R. Novoseller, et al. “Preference-based learning for exoskeleton gait optimization.” In: IEEE International Conference on Robotics and Automation (ICRA). 2020. doi: 10.1109/ICRA40945.2020.9196661. url: https://ieeexplore.ieee.org/document/9196661. E.R.N. contributed to the conception of the project, developing the algorithm, conducting the simulation and exoskeleton experiments, analyzing the experimental results, and writing the manuscript. vi CONTENTS Acknowledgements . iii Abstract . iv Published Content and Contributions . v Contents . v List of Figures . viii List of Tables . xvii Chapter I: Introduction . 1 1.1 Motivation . 2 1.2 The Bandit and Reinforcement Learning Problems . 3 1.3 Human-in-the-Loop Learning . 4 1.4 Lower-Body Exoskeletons for Mobility Assistance . 6 1.5 Contributions . 8 1.6 Organization . 10 Chapter II: Background . 11 2.1 Bayesian Inference for Parameter Estimation . 11 2.2 Gaussian Processes . 16 2.3 Entropy, Mutual Information, and Kullback-Leibler Divergence . 22 2.4 Bandit Learning . 24 2.5 Dueling Bandits . 36 2.6 Episodic Reinforcement Learning . 41 Chapter III: The Preference-Based Generalized Linear Bandit and Reinforce- ment Learning Problem Settings . 48 3.1 The Generalized Linear Dueling Bandit Problem Setting . 48 3.2 The Preference-Based Reinforcement Learning Problem Setting . 51 3.3 Comparing the Preference-Based Generalized Linear Bandit and RL Settings . 54 Chapter IV: Dueling Posterior Sampling for Preference-Based Bandits and Reinforcement Learning . 55 4.1 The Dueling Posterior Sampling Algorithm . 55 4.2 Additional Notation . 58 4.3 Posterior Modeling for Utility Inference and Credit Assignment . 58 4.4 Theoretical Analysis . 64 4.5 Empirical Performance of DPS . 95 4.6 Discussion . 102 Chapter V: Mixed-Initiative Learning for Exoskeleton Gait Optimization . 103 5.1 Introduction . 103 5.2 Background on the Atalante Exoskeleton and Gait Generation for Bipedal Robots . 106 5.3 The CoSpar Algorithm for Preference-Based Learning . 109 vii 5.4 Simulation Results for CoSpar . 114 5.5 Deployment of CoSpar in Human Subject Exoskeleton Experiments . 117 5.6 The LineCoSpar Algorithm for High-Dimensional Preference-Based Learning . 119 5.7 Performance of LineCoSpar in Simulation . 125 5.8 Deployment of LineCoSpar in Human Subject Exoskeleton Experi- ments . 127 5.9 Discussion . 129 Chapter VI: Conclusions and Future Directions . 131 6.1 Conclusion . 131 6.2 Future Work . 132 Bibliography . 135 Appendix A: Models for Utility Inference and Credit Assignment . 149 A.1 Bayesian Linear Regression . 149 A.2 Gaussian Process Regression . 149 A.3 Gaussian Process Preference Model . 155 Appendix B: Proofs of Asymptotic Consistency for Dueling Posterior Sampling158 B.1 Facts about Convergence in Distribution . 160 B.2 Asymptotic Consistency of the Transition Dynamics in DPS in the Preference-Based RL Setting . 160 B.3 Asymptotic Consistency of the Utilities in DPS . 171 B.4 Asymptotic Consistency of the Selected Policies in DPS . 190 Appendix C: Additional Details about the Dueling Posterior Sampling Ex- periments in the Linear and Logistic Dueling Bandit Settings . 192 C.1 Baselines: Sparring with Upper Confidence Bound (UCB) Algorithms192 C.2 Hyperparameter Optimization . 196 Appendix D: Additional Details about the Dueling Posterior Sampling Ex- periments in the Preference-Based RL Setting . 199 viii LIST OF FIGURES Number Page 1.1 The Atalante exoskeleton, designed by Wandercraft (Wandercraft, n.d.). 8 4.1 Comparison of Poisson disk sampling and uniform random sampling over the surface of the 3-dimensional unit sphere. Both plots show 100 samples. While the uniformly random samples often cluster together, the Poisson disk samples are more uniformly spaced over the sphere’s surface. 84 4.2 Cumulative regret and estimated information ratio values in the linear bandit setting with relative Gaussian feedback over pairs of actions. Values are plotted over the entire learning process for three represen- tative experimental repetitions (colors are identical for corresponding

Load more