A Multi-Modal Intelligent Agent that Learns from Demonstrations and Natural Language Instructions Toby Jia-Jun Li Human-Computer Interaction Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213
[email protected] http://toby.li/ CMU-HCII-21-102 May 3, 2021 Thesis Committee: Brad A. Myers (Chair), Carnegie Mellon University Tom M. Mitchell, Carnegie Mellon University Jeffrey P. Bigham, Carnegie Mellon University John Zimmerman, Carnegie Mellon University Philip J. Guo, University of California San Diego Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Copyright ©2021 Toby Jia-Jun Li Keywords: end user development, end user programming, interactive task learning, programming by demonstration, programming by example, multi-modal interaction, verbal instruction, natural language programming, task automation, intelligent agent, instructable agent, conversational as- sistant, human-AI interaction, human-AI collaboration. Abstract Intelligent agents that can perform tasks on behalf of users have become increasingly popular with the growing ubiquity of “smart” devices such as phones, wearables, and smart home devices. They allow users to automate common tasks and to perform tasks in contexts where the direct ma- nipulation of traditional graphical user interfaces (GUIs) is infeasible or inconvenient. However, the capabilities of such agents are limited by their available skills (i.e., the procedural knowledge of how to do something) and conceptual knowledge (i.e., what does a concept mean). Most current agents (e.g., Siri, Google Assistant, Alexa) either have fixed sets of capabilities or mechanisms that allow only skilled third-party developers to extend agent capabilities. As a result, they fall short in supporting “long-tail” tasks and suffer from the lack of customizability and flexibility.