Continual Learning for Grounded Instruction Generation by Observing Human Following Behavior

Noriyuki Kojima, Alane Suhr, and Yoav Artzi Department of Computer Science and Cornell Tech, [email protected] {suhr, yoav}@cs.cornell.edu

AAACDXicbVA9SwNBEN3zM8avqKXNYhSswp2FWoo2FhYRvCgkR9jbzCWLe7vH7lwwHPkDNv4VGwtFbO3t/DduPgq/Hgw83pthZl6cSWHR9z+9mdm5+YXF0lJ5eWV1bb2ysdmwOjccQq6lNjcxsyCFghAFSrjJDLA0lnAd356N/Os+GCu0usJBBlHKukokgjN0Uruy20K4wzgpLoAZJVSXJkanNLRg6Cn0WF9oM2xXqn7NH4P+JcGUVMkU9Xblo9XRPE9BIZfM2mbgZxgVzKDgEoblVm4hY/yWdaHpqGIp2KgYfzOke07p0EQbVwrpWP0+UbDU2kEau86UYc/+9kbif14zx+Q4KoTKcgTFJ4uSXFLUdBQN7QgDHOXAEcaNcLdS3mOGcXQBll0Iwe+X/5LGQS04rAWXB9WT02kcJbJNdsg+CcgROSHnpE5Cwsk9eSTP5MV78J68V+9t0jrjTWe2yA9471/ZeJwL InitializationAAACAHicbZA7T8MwFIUdnqW8AgwMLBYVElOVdADGChbYikQfUhtVjuu0Vh0nsm8QJcrCX2FhACFWfgYb/wY3zQAtR7L06Zx7Zd3jx4JrcJxva2l5ZXVtvbRR3tza3tm19/ZbOkoUZU0aiUh1fKKZ4JI1gYNgnVgxEvqCtf3x1TRv3zOleSTvYBIzLyRDyQNOCRirbx/2gD2AH6Q3kgMngj/mQda3K07VyYUXwS2gggo1+vZXbxDRJGQSqCBad10nBi8lCjgVLCv3Es1iQsdkyLoGJQmZ9tL8gAyfGGeAg0iZJwHn7u+NlIRaT0LfTIYERno+m5r/Zd0Eggsv5TJOgEk6+yhIBIYIT9vAA64YBTExQKgyDVBMR0QRCqazsinBnT95EVq1qntWdW9rlfplUUcJHaFjdIpcdI7q6Bo1UBNRlKFn9IrerCfrxXq3PmajS1axc4D+yPr8AeXDlz4= Learning from User Behavior

Abstract AAAB/3icbVDLSgMxFM3UV62vUcGNm2AruChlpoK6EYpuXFaxD2iHkslk2tBMMiQZoYxd+CtuXCji1t9w59+YtrPQ1gOBwzn3cG+OHzOqtON8W7ml5ZXVtfx6YWNza3vH3t1rKpFITBpYMCHbPlKEUU4ammpG2rEkKPIZafnD64nfeiBSUcHv9SgmXoT6nIYUI22knn1wJxIeKFiSl265Wj4tdwOhValnF52KMwVcJG5GiiBDvWd/mSBOIsI1ZkipjuvE2kuR1BQzMi50E0VihIeoTzqGchQR5aXT+8fw2CgBDIU0j2s4VX8nUhQpNYp8MxkhPVDz3kT8z+skOrzwUsrjRBOOZ4vChEEt4KQMGFBJsGYjQxCW1NwK8QBJhLWprGBKcOe/vEia1Yp7VnFvq8XaVVZHHhyCI3ACXHAOauAG1EEDYPAInsEreLOerBfr3fqYjeasLLMP/sD6/AE6+pRN Rounds r =1, 2, 3,... UserAAACAXicbVA9SwNBEN3zM8avqI1gsxgEq3AnopZBG+0imA9IQtjbTJIle3vH7pwYjtj4V2wsFLH1X9j5b9xLrtDEBwOP92aYmedHUhh03W9nYXFpeWU1t5Zf39jc2i7s7NZMGGsOVR7KUDd8ZkAKBVUUKKERaWCBL6HuD69Sv34P2ohQ3eEognbA+kr0BGdopU5hv4XwgEnVgKY3CkEznhpm3CkU3ZI7AZ0nXkaKJEOlU/hqdUMeB6CQS2ZM03MjbCdMo+ASxvlWbCBifMj60LRUsQBMO5l8MKZHVunSXqhtKaQT9fdEwgJjRoFvOwOGAzPrpeJ/XjPG3kU7ESqKERSfLurFkmJI0zhoV2jgKEeWMK6FvZXyAUtTsKHlbQje7MvzpHZS8s5K3u1psXyZxZEjB+SQHBOPnJMyuSYVUiWcPJJn8krenCfnxXl3PqatC042s0f+wPn8AUVnl2w= Interactions

We study continual learning for natural lan- AAACGXicbVC7TsMwFHXKq4RXgJHFokViqpIOwFhBB8Yi6ENqqspxb1urjhPZTqUq6m+w8CssDCDECBN/g9N2gJYjWTo6517fe08Qc6a0635bubX1jc2t/La9s7u3f+AcHjVUlEgKdRrxSLYCooAzAXXNNIdWLIGEAYdmMLrJ/OYYpGKReNCTGDohGQjWZ5RoI3Ud16cgNEgmBvZ9EoMcMwU97Pt2lWjzr8ZFPyR6SAlPq9OuW+w6BbfkzoBXibcgBbRAret8+r2IJqEZQzlRqu25se6kRGpGOUxtP1EQEzoiA2gbKkgIqpPOLpvisyTbph9J84TGM/V3R0pCpSZhYCqzLdWyl4n/ee1E9686KRNxokHQ+aB+wrGOcBYT7jEJVPOJIYRKZnbFdEgkoSYrZZsQvOWTV0mjXPIuSuW7cqFyvYgjj07QKTpHHrpEFXSLaqiOKHpEz+gVvVlP1ov1bn3MS3PWoucY/YH19QN4IZ/w

AAACHXicbVDLSgMxFM3UVx1fVZdugq3gqswUUZfFduGygn1Ap5RMmrahmcyQ3BHK0B9x46+4caGICzfi35hpZ6GtBwKHc+7Nvff4keAaHOfbyq2tb2xu5bftnd29/YPC4VFLh7GirElDEaqOTzQTXLImcBCsEylGAl+wtj+ppX77gSnNQ3kP04j1AjKSfMgpASP1CxceZRKY4nJk1wmYnwCXvIDAmBKR1Gd9VcKeh+1aKDWomKZddr9QdMrOHHiVuBkpogyNfuHTG4Q0DswoKojWXdeJoJcQBZwKNrO9WLOI0AkZsa6hkgRM95L5dTN8ZpQBHobKPAl4rv7uSEig9TTwTWW6t172UvE/rxvD8LqXcBnFwCRdDBrGAkOI06jwgCtGQUwNIVRxsyumY6IINXnpNAR3+eRV0qqU3cuye1cpVm+yOPLoBJ2ic+SiK1RFt6iBmoiiR/SMXtGb9WS9WO/Wx6I0Z2U9x+gPrK8fRcehZA== guage instruction generation, by observing SupervisedDatasets Dataset r D Dataset 0 Construction human users’ instruction execution. We fo- D cus on a collaborative scenario, where the

AAACYHicbZBPbxMxEMW9W/6E0NIUbnCxSJGKhKLdHgCpl6ocygFQkEhbKY6iWe8kseq1V/ZsabTKl+yNA5d+EpzNHqBlJMs/vTcjj19WauUpSX5F8daDh48ed550n27vPNvt7T0/87ZyEkfSausuMvColcERKdJ4UTqEItN4nl1+WvvnV+i8suYHLUucFDA3aqYkUJCmvZ9CoiF0ysy7gvCa6m9AlQPNv4CZVzDHFReitU7RoGsG+Vebo26sfZGBq68De1Xw4YGQuSUuCpXzBt9tLn7EBS2QYFq71dv9aa+fDJKm+H1IW+iztobT3o3IrayKsK3U4P04TUqa1OBISY2rrqg8liAvw8bjgAYK9JO6CWjF3wQl5zPrwjHEG/XviRoK75dFFjoLoIW/663F/3njimYfJ7UyZUVo5OahWaU5Wb5Om+fKoSS9DADSqbArlwtwIEPkvhtCSO9++T6cHQ7S94P0+2H/+KSNo8NesdfsgKXsAztmn9mQjZhkv6OtaDvaiW7jTrwb721a46idecH+qfjlH3qutWg=

AAACCnicbVC7TsMwFHXKq4RXgJElUCExVUkHYKxgYSyiL6mJKse9aa06TmQ7laqoMwu/wsIAQqx8ARt/g9NmgJYjWTo65z58T5AwKpXjfBultfWNza3ytrmzu7d/YB0etWWcCgItErNYdAMsgVEOLUUVg24iAEcBg04wvs39zgSEpDFvqmkCfoSHnIaUYKWlvnXqEeAKBOVD8yFNQEyohIHteWZTYMq13LcqTtWZw14lbkEqqECjb315g5ikkZ5LGJay5zqJ8jMsFCUMZqaXSkgwGeMh9DTlOALpZ/NTZvZ5mq8PY6EfV/Zc/d2R4UjKaRToygirkVz2cvE/r5eq8NrPKE9SBZwsFoUps1Vs57nYAyqAKDbVBBNB9V9tMsICEx2ONHUI7vLJq6Rdq7qXVfe+VqnfFHGU0Qk6QxfIRVeoju5QA7UQQY/oGb2iN+PJeDHejY9Fackoeo7RHxifP+wSmmY= Natural Language system both acts and delegates tasks to hu- Supervised AAACJXicbVC7TgJBFJ3FF64v1NJmIphYELJLoRYWRCksMZFHwhIyOwwwYXZ2M3PXhGz4GRt/xcZCYkys/BVngQLBk9zk5Nz38SPBNTjOt5XZ2Nza3snu2nv7B4dHueOThg5jRVmdhiJULZ9oJrhkdeAgWCtSjAS+YE1/dJ/mm89MaR7KJxhHrBOQgeR9TgkYqZu79SiTwBSXA7tKwEwCjT0P2wUvIDCkRCTVSdcper0QdHFZUwW7m8s7JWcGvE7cBcmjBWrd3NTMoXFgVlJBtG67TgSdhCjgVLCJ7cWaRYSOyIC1DZUkYLqTzL6c4Auj9HA/VCYk4Jm63JGQQOtx4JvK9Ey9mkvF/3LtGPo3nYTLKAYm6XxRPxYYQpxahntcMQpibAihiptbMR0SRajxTacmuKsvr5NGueReldzHcr5yt7Aji87QObpELrpGFfSAaqiOKHpBb+gDTa1X6936tL7mpRlr0XOK/sD6+QUpOqRr Generation Model Datasets man users using natural language. We com- Training x¯ P ( , ; ✓ ) 0,..., r pare user execution of generated instruc- ⇠ ·|· · r D D

tions to the original system intent as an indi- AAACA3icbVDLSsNAFJ3UV42vqDvdBIvgxpJkoS6LLnRZoS9oQplMb9qhk0mYmQglFNz4K25cKOLWn3Dn3zhts9DWAxcO59w7c+8JU0alcpxvo7Syura+Ud40t7Z3dves/YOWTDJBoEkSlohOiCUwyqGpqGLQSQXgOGTQDkc3U7/9AELShDfUOIUgxgNOI0qw0lLPOvIJcAWC8oF5W2+ce75vtoEOhkr2rIpTdWawl4lbkAoqUO9ZX34/IVmsHyQMS9l1nVQFORaKEgYT088kpJiM8AC6mnIcgwzy2Q0T+1QrfTtKhC6u7Jn6eyLHsZTjONSdMVZDuehNxf+8bqaiqyCnPM0UcDL/KMqYrRJ7GojdpwKIYmNNMBFU72qTIRaY6FSkqUNwF09eJi2v6l5UvXuvUrsu4iijY3SCzpCLLlEN3aE6aiKCHtEzekVvxpPxYrwbH/PWklHMHKI/MD5/AHGNlsA=

GPT-2 AAACDnicbZC7TsMwFIYdriXcAowsFlUlpirpAIxVuzAWqTepraoTx2mtOk5kO4gq6hOw8CosDCDEyszG2+C2GaDlSJY+/f85ts/vJ5wp7brf1sbm1vbObmHP3j84PDp2Tk7bKk4loS0S81h2fVCUM0FbmmlOu4mkEPmcdvxJfe537qlULBZNPU3oIIKRYCEjoI00dEp9QoWmkomRXY8NPegUOK6BCJjGTQlMGGvoFN2yuyi8Dl4ORZRXY+h89YOYpJG5m3BQque5iR5kIDUjnM7sfqpoAmQCI9ozKCCiapAt1pnhklECHMbSHKHxQv09kUGk1DTyTWcEeqxWvbn4n9dLdXgzyJhIUk0FWT4UphzrGM+zwQGTlGg+NQBEMvNXTMYggZiAlG1C8FZXXod2pexdlb27SrFay+MooHN0gS6Rh65RFd2iBmohgh7RM3pFb9aT9WK9Wx/L1g0rnzlDf8r6/AFAoZw7 cation to the system’s success communicat- Contextual Bandit Training Weights ing its intent. We show how to use this sig- nal to improve the system’s ability to gener- Figure 1: Diagram of our learning process. We ini- ate instructions via contextual bandit learn- tialize a generation model using supervised learn- ing. In interaction with real users, our sys- ing, and continually learn through interaction with tem demonstrates dramatic improvements in users, by alternating between observing user exe- its ability to generate language over time. cution of generated instructions and training.

1 Introduction generation by observing human users executing generated instructions. We learn by comparing Natural language provides an expressive and ac- instruction execution to the system intent, and cessible avenue to instruct non-expert users. The demonstrate how this results in a system that con- ability to generate instructions is critical for sys- tinually improves its natural language generation tems that collaborate with users, for example to ability through interaction with users. Figure1 il- delegate tasks. In such scenarios, the system gen- lustrates our learning process. erates language to communicate to the user a latent We design a task-oriented collaborative sce- intent. When users are cooperative and proficient nario using the CEREALBAR game environ- in the language, whether they accomplish the sys- ment (Suhr et al., 2019). In CEREALBAR, two tem’s intent provides an informative, albeit noisy agents, a leader and a follower, work together to signal to the quality of instruction generation. complete tasks. The leader plans the tasks to com- This implicit signal is fundamentally different plete, and communicates goals to the follower us- arXiv:2108.04812v1 [cs.CL] 10 Aug 2021 from supervised data, including via active learn- ing natural language. CEREALBAR was originally ing, in that it does not label the system’s intent introduced for studying follower instruction exe- with a written instruction, but only provides evi- cution. We modify it to focus on generation of dence to the quality of a given instruction in re- leader instructions, which are then executed by hu- laying this intent. As a natural byproduct of inter- man followers. The collaborative, embodied setup action with users, it also differs from explicit user effectively engages users, and aligns their incen- feedback in not requiring user action beyond what tives with executing the system’s instructions to they already do as part of the interaction. Despite the best of their abilities. its potential and prevalence, this signal is under- A major challenge is inferring a learning sig- studied for learning to generate natural language. nal from observed user behavior. Given the user In this paper, we study this learning signal. execution, we create positive and negative exam- We formalize continually improving instruction ples, depending on how the user execution aligns with the system’s plan and the user’s perceived 2 Technical Overview and Notation correctness of their own execution. For exam- ple, consider an execution that does not align well Our goal is to continually improve a natural lan- with the system’s plan, but that the user consid- guage instruction generation model, by observing ers correct given the instruction. Because of the human executions of generated instructions. misalignment, we cannot consider the instruction Interaction Scenario We focus on a collabora- as a successful example given the system’s plan. tive scenario, where two agents, a leader and a However, given the user’s perceived correctness, follower, complete tasks in an environment. The we can generate a positive example treating the system is the leader, and the human user is the fol- user’s execution as a plan paired with the instruc- lower. The leader plans tasks to accomplish, acts tion. In contrast to supervised learning with gold- in the world, and instructs the follower using nat- standard per-token labels (Sutskever et al., 2014), ural language. We use a deterministic procedure such utterance-level binary labels form a challeng- for planning and executing leader actions, and fo- ing signal for learning, because they do not distin- cus on learning the leader instruction generation guish between correct and incorrect tokens. model. The human follower acts in the world fol- lowing the system instructions. We instantiate this We do not make the typical distinction between scenario using CEREALBAR (Section3), a col- training and deployment; as human users follow laborative game, where two agents collect sets of generated instructions, we continually collect new cards together by moving in a 3D environment. data, periodically train using this data, and eval- Task A world state s describes the current envi- uate the system through the interaction itself. We ronment; in CEREALBAR, this includes the loca- formalize learning as an off-policy contextual ban- tion of landmarks, cards, and both agents. A plan dit learning problem. We show that positive ex- p¯ is a sequence of poses p , . . . , p the system amples can be treated in a manner that reduces to 1 |p¯| intends for the human userh to take startingi from supervised learning, allowing for simple effective a start state s . In CEREALBAR, a plan includes use of the data. However, using negative exam- 1 moving in the environment with the intent of col- ples is more challenging, because simply minimiz- lecting cards; each pose p is a tuple (h , w , α ), ing their likelihood gives an unbounded negative j j j j where h and w are height and width coordinates, loss. We weigh negative examples using an in- j j and α is a discrete orientation angle. An instruc- verse propensity score (IPS; Horvitz and Thomp- j tion x¯ is a sequence of tokens x1, . . . , x . An son, 1952; Wang et al., 2017) to address this issue. h |x¯|i instruction execution e¯ is the sequence of poses We experiment with our approach through in- p1, . . . , p a user takes executing x¯, starting h |e¯|i teraction with human users, tracking both task in a start state s1. The generation distribution performance and how the generated language P (¯x s1, p¯; θ) is parameterized by θ. The goal changes. We observe dramatic improvements in of instruction| generation is that given a generated the quality of instructions generated as reflected instruction x¯ P ( s1, p¯; θ), the user execution ∼ · | in users’ execution: task completion in accor- e¯ from s1 will follow the plan p¯. The user does not dance to the system intent increases from 44.7% have access to p¯, but only to its description x¯. to 79.3%. This is accompanied by significant Learning We use an encoder-decoder neural language change: the occurrence of erroneous network model (Section4), which we continually phrases decreases as desired, but the effective sys- improve by observing user behavior. This process tem vocabulary gradually shrinks. proceeds in rounds. At each round r, we first col- Although using user feedback for improving lect data and then train our model by estimating language generation has been studied, as we dis- the model parameters θr. During data collection cuss in Section8, to the best of our knowl- in round r, we sample from our model to generate edge, this study is the first to show effec- instructions, and observe a human user’s execution tive instruction generation learning by observ- of each instruction. An execution of an instruction ing user execution. Our experiments demon- x¯ P ( s1, p¯; θr) generated for the plan p¯ with ∼ · | strate the effectiveness of our process, but also start state s1 creates a tuple (s1, p,¯ x,¯ e,¯ f), where e¯ illustrate limitations and important directions for is the user execution and f is structured user feed- future work. Code and data are available at back solicited using binary questions (e.g., about https://lil.nlp.cornell.edu/cerealbar/. the grammaticality of x¯). The learner does not ob- AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KkkP6rEoiMcK9gPaUDbbSbt0k427G6WE/gkvHhTx6t/x5r9x2+agrQ8GHu/NMDMvSATXxnW/nZXVtfWNzcJWcXtnd2+/dHDY1DJVDBtMCqnaAdUoeIwNw43AdqKQRoHAVjC6nvqtR1Say/jejBP0IzqIecgZNVZq30gh5BOqXqnsVtwZyDLxclKGHPVe6avblyyNMDZMUK07npsYP6PKcCZwUuymGhPKRnSAHUtjGqH2s9m9E3JqlT4JpbIVGzJTf09kNNJ6HAW2M6JmqBe9qfif10lNeOlnPE5SgzGbLwpTQYwk0+dJnytkRowtoUxxeythQ6ooMzaiog3BW3x5mTSrFe+84t1Vy7WrPI4CHMMJnIEHF1CDW6hDAxgIeIZXeHMenBfn3fmYt644+cwR/IHz+QM7iZAY

AAAB9HicbVC7TsNAEFzzDOEVoKQ5ESGoIjsFUEbQUFAEiTykxIrO53VyyvnB3TkosvIdNBQgRMvH0PE3XBIXkDDSSqOZXe3ueIngStv2t7Wyura+sVnYKm7v7O7tlw4OmypOJcMGi0Us2x5VKHiEDc21wHYikYaewJY3vJn6rRFKxePoQY8TdEPaj3jAGdVGcu+Q+ijPFGlyfOqVynbFnoEsEycnZchR75W+un7M0hAjzQRVquPYiXYzKjVnAifFbqowoWxI+9gxNKIhKjebHT0hp0bxSRBLU5EmM/X3REZDpcahZzpDqgdq0ZuK/3mdVAdXbsajJNUYsfmiIBVEx2SaAPG5RKbF2BDKJDe3EjagkjJtciqaEJzFl5dJs1pxLirOfbVcu87jKMAxnMA5OHAJNbiFOjSAwSM8wyu8WSPrxXq3PuatK1Y+cwR/YH3+AA5RkaI= appear,1 and new cards appear. The two play- Leader’s View Follower ers must collaborate effectively using natural lan- guage. The leader observes the entire environ- Follower’sAAAB+HicbVBNSwMxEM3Wr1o/uurRS7CInspuD+qxKIjHCvYD2qVk09k2NJssSdZSS3+JFw+KePWnePPfmLZ70NYHA4/3ZpiZFyacaeN5305ubX1jcyu/XdjZ3dsvugeHDS1TRaFOJZeqFRINnAmoG2Y4tBIFJA45NMPhzcxvPoLSTIoHM04giElfsIhRYqzUdYu3knM5AnWmcYPBqOuWvLI3B14lfkZKKEOt6351epKmMQhDOdG67XuJCSZEGUY5TAudVENC6JD0oW2pIDHoYDI/fIpPrdLDkVS2hMFz9ffEhMRaj+PQdsbEDPSyNxP/89qpia6CCRNJakDQxaIo5dhIPEsB95gCavjYEkIVs7diOiCKUGOzKtgQ/OWXV0mjUvYvyv59pVS9zuLIo2N0gs6Rjy5RFd2hGqojilL0jF7Rm/PkvDjvzseiNedkM0foD5zPH1wNkuQ= View ment, plans who should select which cards for the next set, executes their own part of this plan, and

AAAB7XicbVA9SwNBEJ2LXzF+RS1tDoNgFe5SqGXQxsIigvmA5Ah7e3PJmr3dY3dPCCH/wcZCEVv/j53/xk1yhSY+GHi8N8PMvDDlTBvP+3YKa+sbm1vF7dLO7t7+QfnwqKVlpig2qeRSdUKikTOBTcMMx06qkCQhx3Y4upn57SdUmknxYMYpBgkZCBYzSoyVWndIIlT9csWrenO4q8TPSQVyNPrlr14kaZagMJQTrbu+l5pgQpRhlOO01Ms0poSOyAC7lgqSoA4m82un7plVIjeWypYw7lz9PTEhidbjJLSdCTFDvezNxP+8bmbiq2DCRJoZFHSxKM64a6Q7e92NmEJq+NgSQhWzt7p0SBShxgZUsiH4yy+vklat6l9U/ftapX6dx1GEEziFc/DhEupwCw1oAoVHeIZXeHOk8+K8Ox+L1oKTzxzDHzifP2yJjwc= issues instructions to the follower. The follower Leader executes leader instructions, only seeing a par- ... tial first-person view of the environment. Leader x¯8: get the two orange diamonds and the one green plus. instructions must make use of the observed spa- then turn around and grab the three black hearts tial environment, including landmarks, for the fol- x¯9: go around the lake and get the 2 green hearts and the 1 red triangle lower to be able to execute them given their partial [Set made. New score: 6] view. Each interaction includes multiple instruc- ... tions. Figure2 shows the game and example gen- Figure 2: Interaction snapshot in CEREALBAR, erated instructions. with instructions generated by our model. The cur- CEREALBAR was originally used for learning a rent instruction is x¯9. The leader plan is illustrated follower instruction execution model from human with red arrows in the leader’s view. The user sees demonstrations (Suhr et al., 2019). In contrast, only the follower’s view during execution. we learn an instruction generation model for the leader, with the human user as the follower. The serve the user’s actions executing x¯, but only their generated instructions must often specify multiple poses along the execution. Given these tuples, we tasks to complete (i.e., when the follower is to se- (i) (i) (i) (i) |Dr| create a dataset r = (s , ρ¯ , x¯ , y ) , lect multiple cards), and how to navigate to the D { 1 }i=1 where y(i) 1, +1 is a binary label. Depend- target cards, because the follower has only partial ing on the∈ user {− execution} and feedback, the plan observability of the environment. This includes ρ¯(i) is either the original plan p¯(i) used for gener- references to landmarks, spatial relations, and de- ating x¯(i) or the user execution e¯(i) of x¯(i). We scriptions of paths. We focus on language gener- formulate estimating θr+1 as a contextual bandit ation, and use a deterministic planner to generate learning problem with y as the reward. Section5 the plan, including which cards to select and how describes the complete learning process. each player should move in their next turn, and ex- Evaluation Throughout the system’s lifetime, ecute the planned leader actions. The system uses we measure how well human users complete tasks, the model we learn to map the follower’s part of and also use earth mover’s distance (EMD; Rub- the plan to a natural language instruction. ner et al., 1998) to quantify the similarity of the We learn through interactions with non-expert user execution e¯ to the plan p¯. We characterize human followers, which CEREALBAR is particu- language change over time by tracking vocabulary larly suited for. The utility-maximizing game ob- size, instruction length, and other statistics. jective to earn a high score by collecting as many valid sets as possible incentivizes followers to ex- 3 Interaction Scenario ecute the generated instructions as accurately as possible. In addition, CEREALBAR players need Suhr et al.(2019) describe C EREALBAR in de- no expert knowledge to participate in the game, tail. CEREALBAR is a two-player, turn-based beyond familiarity with the simple game rules. game where a leader and follower collaborate to collect sets of matching cards. The game objec- 4 Model tive is to collect as many valid sets as possible in a 3D environment. The environment includes We design a relatively simple encoder-decoder landmarks (e.g., houses, mountains, ponds, etc.) architecture to model the generation distribution P ( s1, p¯; θ), leaving more complex model de- that the players must move around, and may ob- · | scure a player’s view. A valid set consists of velopment for future work. The inputs are a start three cards with three distinct colors, shapes, and state s1 and a plan p¯. The model parameters are θ. counts. Players move onto cards to select or de- 1In Suhr et al.(2019), only the selected cards disappear. select them. When the selected cards comprise a We introduced this modification to minimize inter-turn ef- valid set, the players earn a point, all cards dis- fects for the follower (i.e., memorize card locations). AAACLXicbVFNSwMxEJ312/pV9egltCqeyq6KehS8eKxgW6FdJJtONTSbrElWLEv/kBf/iggeKuLVv2Ha7kFbBzJ5eW+GZF6iRHBjfX/gzczOzS8sLi0XVlbX1jeKm1t1o1LNsMaUUPomogYFl1iz3Aq8STTSOBLYiLoXQ73xiNpwJa9tL8EwpneSdzij1lGqWIIWWEB4cllDDBlUgQODLhBIIXFZgXQ6cSpC2+0GHpxCR2dC9qEPhdti2a/4oyDTIMhBGfKo3hbfWm3F0hilZYIa0wz8xIYZ1ZYzgf1CKzWYUNald9h0UNIYTZiNpu2TPce0SUdpt6QlI/Z3R0ZjY3px5Cpjau/NpDYk/9Oaqe2chRmXSWpRsvFFnVQQq8jQOtLmGpkVPQco09y9lbB7qimzzuChCcHkyNOgflgJTipHV8fl893cjiXYgRIcQACncA6X7gNq7gOe4RUG8OG9eO/ep/c1Lp3x8p5t+BPe9w+4n5uN

AAAB/nicbVDLSgNBEJyNrxhfq+LJy2AUPMVdFfUY0IPHCOYByRJmJ51kyOyDmV4xLAF/xYsHRbz6Hd78G2eTHDSxoKGo6qa7y4+l0Og431ZuYXFpeSW/Wlhb39jcsrd3ajpKFIcqj2SkGj7TIEUIVRQooRErYIEvoe4PrjO//gBKiyi8x2EMXsB6oegKztBIbXuvhfCIKkhvErPOqEBPRoW2XXRKzhh0nrhTUiRTVNr2V6sT8SSAELlkWjddJ0YvZQoFlzAqtBINMeMD1oOmoSELQHvp+PwRPTJKh3YjZSpEOlZ/T6Qs0HoY+KYzYNjXs14m/uc1E+xeeakI4wQh5JNF3URSjGiWBe0IBRzl0BDGlTC3Ut5ninE0iWUhuLMvz5Paacm9KJ3dnRfLh9M48mSfHJBj4pJLUia3pEKqhJOUPJNX8mY9WS/Wu/Uxac1Z05ld8gfW5w/e9ZVT AAACJHicbVDLSsNAFJ34rPUVdelmsAouSklUVHBTcOOygn1AE8pkOmmHTiZhZiKUNB/jxl9x48IHLtz4LU7SLGrrgYHDuffcufd4EaNSWda3sbS8srq2Xtoob25t7+yae/stGcYCkyYOWSg6HpKEUU6aiipGOpEgKPAYaXuj26zefiRC0pA/qHFE3AANOPUpRkpLPfPGYYgPGIFOgNTQ85Nh2rOr0OmHSlZnxWTieEgkUTpJoSNyT8+sWDUrB1wkdkEqoECjZ37ouTgOCFeYISm7thUpN0FCUcxIWnZiSSKER2hAuppyFBDpJvmRKTzRSh/6odCPK5irs44EBVKOA093ZlvL+Vom/lfrxsq/dhPKo1gRjqcf+TGDKoRZYrBPBcGKjTVBWFC9K8RDJBBWOteyDsGeP3mRtM5q9mXt/P6iUj8u4iiBQ3AEToENrkAd3IEGaAIMnsALeAPvxrPxanwaX9PWJaPwHIA/MH5+ARsKpak= Duplicate / Pick up one red square . . .

AAAB8XicbVDLSgMxFL3js9ZX1aWbYBVclRkVdVlw47KifWBbSia904ZmMkOSEcrQv3DjQhG3/o07/8ZMOwttPRA4nHMvOff4seDauO63s7S8srq2Xtgobm5t7+yW9vYbOkoUwzqLRKRaPtUouMS64UZgK1ZIQ19g0x/dZH7zCZXmkXww4xi7IR1IHnBGjZUeOyE1Qz9I7ye9UtmtuFOQReLlpAw5ar3SV6cfsSREaZigWrc9NzbdlCrDmcBJsZNojCkb0QG2LZU0RN1Np4kn5MQqfRJEyj5pyFT9vZHSUOtx6NvJLKGe9zLxP6+dmOC6m3IZJwYlm30UJIKYiGTnkz5XyIwYW0KZ4jYrYUOqKDO2pKItwZs/eZE0zireZeX87qJcPc7rKMAhHMEpeHAFVbiFGtSBgYRneIU3RzsvzrvzMRtdcvKdA/gD5/MHvpaQ4Q== h ,...,h 1 p¯ AAAB/XicbVDLSsNAFJ3UV62v+Ni5CVbBVUlU1GVBFy4r2Ae0oUymk3bozCTM3Ig1FH/FjQtF3Pof7vwbJ20W2npg4HDOvXPvPUHMmQbX/bYKC4tLyyvF1dLa+sbmlr2909BRogitk4hHqhVgTTmTtA4MOG3FimIRcNoMhleZ37ynSrNI3sEopr7AfclCRjAYqWvvdYA+gBLptZmlWJAAHZe6dtmtuBM488TLSRnlqHXtr04vIomgEgjHWrc9NwY/xQoY4ebDTqJpjMkQ92nbUIkF1X462X7sHBml54SRMk+CM1F/d6RYaD0SgakUGAZ61svE/7x2AuGlnzIZm6MkmQ4KE+5A5GRROD2mKAE+MgQTxcyuDhlghQmYwLIQvNmT50njpOKdV05vz8rVwzyOItpHB+gYeegCVdENqqE6IugRPaNX9GY9WS/Wu/UxLS1Yec8u+gPr8wcDoZV+ h | |i S Distribute AAACAXicbVC7TsMwFHXKq5RXgAWJxaIgMZUEEDBWYmFgKII+pLaqHNdprTpxZN+gVlFZ+BUWBhBi5S/Y+BucNgMUjmTp6Jx7dH2PFwmuwXG+rNzc/MLiUn65sLK6tr5hb27VtIwVZVUqhVQNj2gmeMiqwEGwRqQYCTzB6t7gMvXr90xpLsM7GEWsHZBeyH1OCRipY++0gA0huTZxovARvpU+BGQ4LnTsolNyJsB/iZuRIspQ6difra6kccBCoIJo3XSdCNoJUcCpYONCK9YsInRAeqxpaEgCptvJ5IIxPjBKF/tSmRcCnqg/EwkJtB4FnpkMCPT1rJeK/3nNGPyLdsLDKAYW0ukiPxYYJE7rwF2uGAUxMoRQxc1fMe0TRSiY0tIS3NmT/5Lacck9K53cnBbL+1kdebSL9tAhctE5KqMrVEFVRNEDekIv6NV6tJ6tN+t9Opqzssw2+gXr4xu5z5ZU

AAAB+HicbVBNSwMxEM3Wr1o/uurRS7AKnsquinosePFYwX5Au5RsNm1Dk82SzIp16S/x4kERr/4Ub/4b03YP2vpg4PHeDDPzwkRwA5737RRWVtfWN4qbpa3tnd2yu7ffNCrVlDWoEkq3Q2KY4DFrAAfB2olmRIaCtcLRzdRvPTBtuIrvYZywQJJBzPucErBSzy13gT2ClllLaRFNcM+teFVvBrxM/JxUUI56z/3qRoqmksVABTGm43sJBBnRwKlgk1I3NSwhdEQGrGNpTCQzQTY7fIJPrBLhvtK2YsAz9fdERqQxYxnaTklgaBa9qfif10mhfx1kPE5SYDGdL+qnAoPC0xRwxDWjIMaWEKq5vRXTIdGEgs2qZEPwF19eJs2zqn9ZPb+7qNSO8ziK6BAdoVPkoytUQ7eojhqIohQ9o1f05jw5L8678zFvLTj5zAH6A+fzBxyrk0s= … Linear / Softmax World

AAAB+XicbVDLSgNBEJz1GeNr1aOXwSh4CrsqKp4CXjxGNA9IljA76U2GzD6Y6Q2GJX/ixYMiXv0Tb/6Nk2QPmljQUFR1093lJ1JodJxva2l5ZXVtvbBR3Nza3tm19/brOk4VhxqPZayaPtMgRQQ1FCihmShgoS+h4Q9uJ35jCEqLOHrEUQJeyHqRCARnaKSObbcRnlCF2QMyhJsx7dglp+xMQReJm5MSyVHt2F/tbszTECLkkmndcp0EvYwpFFzCuNhONSSMD1gPWoZGLATtZdPLx/TEKF0axMpUhHSq/p7IWKj1KPRNZ8iwr+e9ifif10oxuPYyESUpQsRni4JUUozpJAbaFQo4ypEhjCthbqW8zxTjaMIqmhDc+ZcXSf2s7F6Wz+8vSpXjPI4COSRH5JS45IpUyB2pkhrhZEieySt5szLrxXq3PmatS1Y+c0D+wPr8AZTdk4g=

State: AAAB8nicbVBNS8NAFHypX7V+VT16CVbBU0lU1GPBi8cK1hbSUjbbTbt0sxt2X4QS+jO8eFDEq7/Gm//GTZuDtg4sDDPvsfMmTAQ36HnfTmlldW19o7xZ2dre2d2r7h88GpVqylpUCaU7ITFMcMlayFGwTqIZiUPB2uH4NvfbT0wbruQDThLWi8lQ8ohTglYKujHBESUia0771ZpX92Zwl4lfkBoUaParX92BomnMJFJBjAl8L8FeRjRyKti00k0NSwgdkyELLJUkZqaXzSJP3VOrDNxIafskujP190ZGYmMmcWgn84hm0cvF/7wgxeiml3GZpMgknX8UpcJF5eb3uwOuGUUxsYRQzW1Wl46IJhRtSxVbgr948jJ5PK/7V/WL+8ta46SoowxHcAxn4MM1NOAOmtACCgqe4RXeHHRenHfnYz5acoqdQ/gD5/MHgK+RUA==

AAAB9XicbVBNS8NAEN34WetX1aOXxSp4KomKeix68aBQsV/QxrLZbtqlm03Ynagl9H948aCIV/+LN/+NmzYHbX0w8Hhvhpl5XiS4Btv+tubmFxaXlnMr+dW19Y3NwtZ2XYexoqxGQxGqpkc0E1yyGnAQrBkpRgJPsIY3uEz9xgNTmoeyCsOIuQHpSe5zSsBI921gT5Bc8Ou76s0o3ykU7ZI9Bp4lTkaKKEOlU/hqd0MaB0wCFUTrlmNH4CZEAaeCjfLtWLOI0AHpsZahkgRMu8n46hE+MEoX+6EyJQGP1d8TCQm0Hgae6QwI9PW0l4r/ea0Y/HM34TKKgUk6WeTHAkOI0whwlytGQQwNIVRxcyumfaIIBRNUGoIz/fIsqR+VnNPS8e1JsbyfxZFDu2gPHSIHnaEyukIVVEMUKfSMXtGb9Wi9WO/Wx6R1zspmdtAfWJ8/2AyR/g==

AAAB9HicbVBNS8NAEN3Ur1q/qh69LFbBU0lU1GOhF0+lgv2ANpbNdtMu3Wzi7qRYQn+HFw+KePXHePPfuGlz0NYHA4/3ZpiZ50WCa7Dtbyu3srq2vpHfLGxt7+zuFfcPmjqMFWUNGopQtT2imeCSNYCDYO1IMRJ4grW8UTX1W2OmNA/lPUwi5gZkILnPKQEjuV1gT5BUa7Xpg1PoFUt22Z4BLxMnIyWUod4rfnX7IY0DJoEKonXHsSNwE6KAU8GmhW6sWUToiAxYx1BJAqbdZHb0FJ8apY/9UJmSgGfq74mEBFpPAs90BgSGetFLxf+8Tgz+jZtwGcXAJJ0v8mOBIcRpArjPFaMgJoYQqri5FdMhUYSCySkNwVl8eZk0z8vOVfni7rJUOcniyKMjdIzOkIOuUQXdojpqIIoe0TN6RW/W2Hqx3q2PeWvOymYO0R9Ynz/whpF3 … ….. …

AAAB/nicbVBNS8NAEN3Ur1q/ouLJS7AKnkqioh4LgnisYD+gDWWzmbZLNx/sTtQSCv4VLx4U8erv8Oa/cdPmoK0PBh7vzTAzz4sFV2jb30ZhYXFpeaW4Wlpb39jcMrd3GipKJIM6i0QkWx5VIHgIdeQooBVLoIEnoOkNrzK/eQ9S8Si8w1EMbkD7Ie9xRlFLXXOvg/CIMkivAfxeJB+o9Melrlm2K/YE1jxxclImOWpd86vjRywJIEQmqFJtx47RTalEzgSMS51EQUzZkPahrWlIA1BuOjl/bB1pxbf0cl0hWhP190RKA6VGgac7A4oDNetl4n9eO8HepZvyME4QQjZd1EuEhZGVZWH5XAJDMdKEMsn1rRYbUEkZ6sSyEJzZl+dJ46TinFdOb8/K1cM8jiLZJwfkmDjkglTJDamROmEkJc/klbwZT8aL8W58TFsLRj6zS/7A+PwBoDOV0g== CNN1 BiLSTM P Feedforward

AAAB/nicbVBNS8NAEN3Ur1q/ouLJS7AKXiyJinosePFY0X5AG8pmM2mXbrJhdyKWUPCvePGgiFd/hzf/jUnbg7Y+GHi8N8PMPC8WXKNtfxuFhcWl5ZXiamltfWNzy9zeaWiZKAZ1JoVULY9qEDyCOnIU0IoV0NAT0PQG17nffACluYzucRiDG9JexAPOKGZS19zrIDyiCtOahsSXJ3cgglGpa5btij2GNU+cKSmTKWpd86vjS5aEECETVOu2Y8foplQhZwJGpU6iIaZsQHvQzmhEQ9BuOj5/ZB1lim8FUmUVoTVWf0+kNNR6GHpZZ0ixr2e9XPzPaycYXLkpj+IEIWKTRUEiLJRWnoXlcwUMxTAjlCme3WqxPlWUYZZYHoIz+/I8aZxWnIvK2e15uXo4jaNI9skBOSYOuSRVckNqpE4YSckzeSVvxpPxYrwbH5PWgjGd2SV/YHz+ADt2lZA=

, , AAAB+3icbVBNS8NAEN34WetXrEcvi1XwVBIt6rHgxWMF+wFNKJvttl26m4TdibSE/BUvHhTx6h/x5r9x0+agrQ8GHu/NMDMviAXX4Djf1tr6xubWdmmnvLu3f3BoH1XaOkoUZS0aiUh1A6KZ4CFrAQfBurFiRAaCdYLJXe53npjSPAofYRYzX5JRyIecEjBS3654wCXT2AM2BSXTelbu21Wn5syBV4lbkCoq0OzbX94goolkIVBBtO65Tgx+ShRwKlhW9hLNYkInZMR6hobELPTT+e0ZPjfKAA8jZSoEPFd/T6REaj2TgemUBMZ62cvF/7xeAsNbP+VhnAAL6WLRMBEYIpwHgQdcMQpiZgihiptbMR0TRSiYuPIQ3OWXV0n7suZe164e6tXGWRFHCZ2gU3SBXHSDGugeNVELUTRFz+gVvVmZ9WK9Wx+L1jWrmDlGf2B9/gCsvpQd AAACX3icbVHRStxAFJ3EttrV2qhPpS9DtwWFZUmq2CIUhL70cQtdFTZLuJm9uzs4mYSZG3WJ+UnfCr74J07WPGy1FwbOnHPu3JkzaaGkpTD86/lrr16/Wd9429ncerf9PtjZPbN5aQQORa5yc5GCRSU1DkmSwovCIGSpwvP08mejn1+hsTLXf2hR4DiDmZZTKYAclQRXMeENmawaKNAnvOZxCqYqav6Dx46ZKeT78yTqXbsVgyrmkEQHPR5PcrK9Rqpu247b2plWd619hTrgsVme2UmCbtgPl8VfgqgFXdbWIAnu3EhRZqhJKLB2FIUFjSswJIXCuhOXFgsQlzDDkYMaMrTjaplPzb84ZsKnuXFLE1+yqx0VZNYustQ5M6C5fa415P+0UUnT7+NK6qIk1OJp0LRUnHLehM0n0qAgtXAAhJHurlzMwYAg9yVNCNHzJ78EZ1/70XH/8PdR9/RzG8cG+8g+sX0WsW/slP1iAzZkgt17vrfpbXkP/rq/7QdPVt9re/bYP+V/eARdh7Wj Pseudo-Self 4 Plan:p ¯ = (h1,w1, ↵1),...,(h p¯ ,wp¯ , ↵ p¯ ) AAAB/HicbVDLTsMwEHR4lvIK9MjFoiBxqhJAwLGIC8ci0YfURpXjOq1VO4nsDSKKyq9w4QBCXPkQbvwNTpsDtIxkaTSzu94dPxZcg+N8W0vLK6tr66WN8ubW9s6uvbff0lGiKGvSSESq4xPNBA9ZEzgI1okVI9IXrO2Pb3K//cCU5lF4D2nMPEmGIQ84JWCkvl3pAXsEJbNrABbm2qTct6tOzZkCLxK3IFVUoNG3v3qDiCbSDKCCaN11nRi8jCjgVLBJuZdoFhM6JkPWNTQkkmkvmy4/wcdGGeAgUuaFgKfq746MSK1T6ZtKSWCk571c/M/rJhBceRkP48RcRmcfBYnAEOE8CTzgilEQqSGEKm52xXREFKFg8spDcOdPXiSt05p7UTu7O6/Wj4o4SugAHaIT5KJLVEe3qIGaiKIUPaNX9GY9WS/Wu/UxK12yip4K+gPr8wc3SZUL h | | | | | | i … Attention ⇥ AAAB8nicbVDLSgMxFM34rPVVdekmWEVXZUZFXRbcuKxoHzAdSibNtKGZZEjuCGXoZ7hxoYhbv8adf2OmnYW2HggczrmXnHvCRHADrvvtLC2vrK6tlzbKm1vbO7uVvf2WUammrEmVULoTEsMEl6wJHATrJJqROBSsHY5uc7/9xLThSj7COGFBTAaSR5wSsJLfjQkMwyh7OJ30KlW35k6BF4lXkCoq0OhVvrp9RdOYSaCCGON7bgJBRjRwKtik3E0NSwgdkQHzLZUkZibIppEn+MQqfRwpbZ8EPFV/b2QkNmYch3Yyj2jmvVz8z/NTiG6CjMskBSbp7KMoFRgUzu/Hfa4ZBTG2hFDNbVZMh0QTCralsi3Bmz95kbTOa95V7eL+slo/LuoooUN0hM6Qh65RHd2hBmoiihR6Rq/ozQHnxXl3PmajS06xc4D+wPn8ASJ6kRI= … ….. …

AAACHHicbVDLSgMxFM34rPVVdekmWAUXpcxYUZcFNy4r2Ad0hnInTdvQTGZIMkKZzoe48VfcuFDEjQvBvzGdVtDWA4HDOecmucePOFPatr+speWV1bX13EZ+c2t7Z7ewt99QYSwJrZOQh7Llg6KcCVrXTHPaiiSFwOe06Q+vJ37znkrFQnGnRxH1AugL1mMEtJE6hYrLQfQ5xS7waAAdp4TdbqhV6UdIxq4PMonScYpdmWU7haJdtjPgReLMSBHNUOsUPsydJA6o0ISDUm3HjrSXgNSMcJrm3VjRCMgQ+rRtqICAKi/JlkvxiVG6uBdKc4TGmfp7IoFAqVHgm2QAeqDmvYn4n9eOde/KS5iIYk0FmT7UiznWIZ40hbtMUqL5yBAgkpm/YjIACUSbPvOmBGd+5UXSOCs7F+XK7XmxejyrI4cO0RE6RQ66RFV0g2qojgh6QE/oBb1aj9az9Wa9T6NL1mzmAP2B9fkNyKChwQ== S0 AAACKHicbVDLSgMxFM34rPVVdekmWIUWSplRUXcW3LisYB/QGYZMmrahmcyQZJQync9x46+4EVGkW7/EzLSL2nohcHLOPTe5xwsZlco0J8bK6tr6xmZuK7+9s7u3Xzg4bMogEpg0cMAC0faQJIxy0lBUMdIOBUG+x0jLG96leuuJCEkD/qhGIXF81Oe0RzFSmnILtzZDvM8ILA1cq/LsWuUKtLuBkpWUice2h0QcJuNEa3O3MrRF5nMLRbNqZgWXgTUDRTCrulv40ONx5BOuMENSdiwzVE6MhKKYkSRvR5KECA9Rn3Q05Mgn0omzRRN4ppku7AVCH65gxs47YuRLOfI93ekjNZCLWkr+p3Ui1btxYsrDSBGOpw/1IgZVANPUYJcKghUbaYCwoPqvEA+QQFjpbPM6BGtx5WXQPK9aV9WLh8ti7XQWRw4cgxNQAha4BjVwD+qgATB4AW/gE3wZr8a78W1Mpq0rxsxzBP6U8fMLHsGmBQ== α … , ,

AAAB9XicbZDLSgMxFIbPeK31VnXpJlgFV2VGRV0W3LiSCvYCvZFJM21oJjMkZ5Qy9D3cuFDEre/izrcxbWehrT8EPv5zDufk92MpDLrut7O0vLK6tp7byG9ube/sFvb2ayZKNONVFslIN3xquBSKV1Gg5I1Ycxr6ktf94c2kXn/k2ohIPeAo5u2Q9pUIBKNorc5dJyYtFCE3xGK3UHRL7lRkEbwMipCp0i18tXoRS0KukElqTNNzY2ynVKNgko/zrcTwmLIh7fOmRUXtonY6vXpMTqzTI0Gk7VNIpu7viZSGxoxC33aGFAdmvjYx/6s1Ewyu26lQcYJcsdmiIJEEIzKJgPSE5gzlyAJlWthbCRtQTRnaoPI2BG/+y4tQOyt5l6Xz+4ti+TiLIweHcASn4MEVlOEWKlAFBhqe4RXenCfnxXl3PmatS042cwB/5Hz+AK9UkeY= (h1,w1),...,(h p¯ ,wp¯ ) ↵1,...,↵ p¯ AAACAnicbVDLSgNBEJz1GeNr1ZN4GYyCIIRdFfUoePEYwaiQLGF20huHzM4sM71iWIIXf8WLB0W8+hXe/BsnMQeNFjQUVd00VXEmhcUg+PQmJqemZ2ZLc+X5hcWlZX9l9dLq3HCocy21uY6ZBSkU1FGghOvMAEtjCVdx93TgX92CsUKrC+xlEKWso0QiOEMntfz1JsIdmrS40F1QdJfWtBUDq9/yK0E1GIL+JeGIVMgItZb/0WxrnqegkEtmbSMMMowKZlBwCf1yM7eQMd5lHWg4qlgKNiqGEfp02yltmmjjRiEdqj8vCpZa20tjt5kyvLHj3kD8z2vkmBxHhVBZjqD496MklxQ1HfRB28IAR9lzhHHjonPKb5hhHF1rZVdCOB75L7ncq4aH1f3zg8rJ1qiOEtkgm2SHhOSInJAzUiN1wsk9eSTP5MV78J68V+/te3XCG92skV/w3r8ATYGXSA==

ϕ { h | | | | i h | |i p p

AAAB+HicbVDLSgNBEOz1GeMjqx69DEbBU9hVUY8BQTxGMA9IljA7mSRDZh/M9IpxyZd48aCIVz/Fm3/jbLIHTSxoKKq66e7yYyk0Os63tbS8srq2Xtgobm5t75Ts3b2GjhLFeJ1FMlItn2ouRcjrKFDyVqw4DXzJm/7oOvObD1xpEYX3OI65F9BBKPqCUTRS1y51kD9ieiMpIg8nxa5ddirOFGSRuDkpQ45a1/7q9CKWBDxEJqnWbdeJ0UupQsEknxQ7ieYxZSM64G1DQxpw7aXTwyfk2Cg90o+UqRDJVP09kdJA63Hgm86A4lDPe5n4n9dOsH/lpSKME/MVmy3qJ5JgRLIUSE8ozlCODaFMCXMrYUOqKEOTVRaCO//yImmcVtyLytndebl6lMdRgAM4hBNw4RKqcAs1qAODBJ7hFd6sJ+vFerc+Zq1LVj6zD39gff4A1GyTHA== N N Token + Position ⇥ AAAB/XicbVBNS8NAEN3Ur1q/6sfNS7AKnkqioh4LInisYD+gLWWzmbRLdzdhdyLWUPwrXjwo4tX/4c1/Y9L2oK0PBh7vzTAzz4sEN+g431ZuYXFpeSW/Wlhb39jcKm7v1E0YawY1FopQNz1qQHAFNeQooBlpoNIT0PAGV5nfuAdteKjucBhBR9Ke4gFnFFOpW9xrIzyglsm19MD3ueqZUaFbLDllZwx7nrhTUiJTVLvFr7YfsliCQiaoMS3XibCTUI2cCRgV2rGBiLIB7UErpYpKMJ1kfP3IPkoV3w5CnZZCe6z+nkioNGYovbRTUuybWS8T//NaMQaXnYSrKEZQbLIoiIWNoZ1FYftcA0MxTAllmqe32qxPNWWYBpaF4M6+PE/qJ2X3vHx6e1aqHE7jyJN9ckCOiUsuSIXckCqpEUYeyTN5JW/Wk/VivVsfk9acNZ3ZJX9gff4AvfyVUQ== Flatten AAAB9HicbVBNS8NAEN34WetX1aOXYBU8lURFPRa8eKxoP6ANZbOdtEs3m7g7KZbQ3+HFgyJe/THe/Ddu2hy09cHA470ZZub5seAaHefbWlpeWV1bL2wUN7e2d3ZLe/sNHSWKQZ1FIlItn2oQXEIdOQpoxQpo6Ato+sObzG+OQGkeyQccx+CFtC95wBlFI3kdhCdM780mnBS7pbJTcaawF4mbkzLJUeuWvjq9iCUhSGSCat12nRi9lCrkTMCk2Ek0xJQNaR/ahkoagvbS6dET+8QoPTuIlCmJ9lT9PZHSUOtx6JvOkOJAz3uZ+J/XTjC49lIu4wRBstmiIBE2RnaWgN3jChiKsSGUKW5utdmAKsrQ5JSF4M6/vEgaZxX3snJ+d1GuHudxFMghOSKnxCVXpEpuSY3UCSOP5Jm8kjdrZL1Y79bHrHXJymcOyB9Ynz/hy5IV Embeddings

AAAB9HicbVBNS8NAEN34WetX1aOXxSp4KkkV9VjoxVOpYD+gjWWz3bRLN5u4OymW0N/hxYMiXv0x3vw3btoctPXBwOO9GWbmeZHgGmz721pZXVvf2Mxt5bd3dvf2CweHTR3GirIGDUWo2h7RTHDJGsBBsHakGAk8wVreqJr6rTFTmofyHiYRcwMykNznlICR3C6wJ0iqtdr0oZzvFYp2yZ4BLxMnI0WUod4rfHX7IY0DJoEKonXHsSNwE6KAU8Gm+W6sWUToiAxYx1BJAqbdZHb0FJ8ZpY/9UJmSgGfq74mEBFpPAs90BgSGetFLxf+8Tgz+jZtwGcXAJJ0v8mOBIcRpArjPFaMgJoYQqri5FdMhUYSCySkNwVl8eZk0yyXnqnRxd1msnGZx5NAxOkHnyEHXqIJuUR01EEWP6Bm9ojdrbL1Y79bHvHXFymaO0B9Ynz/yC5F4 AAAB/XicbVDJSgNBEO2JW4zbuNy8NEbBU5hRUY+BXDxGMAskQ+jp1CRNeha6a8Q4BH/FiwdFvPof3vwbZ5I5aOKDgsd7VVTVcyMpNFrWt1FYWl5ZXSuulzY2t7Z3zN29pg5jxaHBQxmqtss0SBFAAwVKaEcKmO9KaLmjWua37kFpEQZ3OI7A8dkgEJ7gDFOpZx50ER4wqakwonWGfAh6UuqZZatiTUEXiZ2TMslR75lf3X7IYx8C5JJp3bGtCJ2EKRRcwqTUjTVEjI/YADopDZgP2kmm10/oSar0qReqtAKkU/X3RMJ8rce+m3b6DId63svE/7xOjN61k4ggihECPlvkxZJiSLMoaF8o4CjHKWFcifRWyodMMY5pYFkI9vzLi6R5VrEvK+e3F+XqcR5HkRySI3JKbHJFquSG1EmDcPJInskreTOejBfj3fiYtRaMfGaf/IHx+QM6GZT8 2 Split

AAACIXicbVDLSgMxFL1T3/VVdamLYFVclRkVdSm4cVnBPqAtJZPetqGZZEgyYhn6K278FTcuFOlO/BnTx0JbD+RyOOdebu4JY8GN9f0vL7OwuLS8srqWXd/Y3NrO7eyWjUo0wxJTQulqSA0KLrFkuRVYjTXSKBRYCXu3I7/yiNpwJR9sP8ZGRDuStzmj1kkqdwB1sIDw5KqGCFIoAgcGPSCQQOyqAul84lyEFhByAgUYQLaZy/sFfwwyT4IpycMUxWZuWG8plkQoLRPUmFrgx7aRUm05EzjI1hODMWU92sGao5JGaBrp+MIBOXZKi7SVdk9aMlZ/T6Q0MqYfha4zorZrZr2R+J9XS2z7upFyGScWJZssaieCWEVGcZEW18is6DtCmebur4R1qabMulBHIQSzJ8+T8lkhuCyc31/kb46mcazCPhzCKQRwBTdw50IvudCf4RXe4cN78d68T284ac1405k9+APv+wenSZiY

Crop Patches AAAB+nicbVBNS8NAEN34WetXqkcvi1XwVBIV9Vjw4jGC/YA2lM120i7dfLA7UUvsT/HiQRGv/hJv/huTNgdtfTDweG+GmXleLIVGy/o2lpZXVtfWSxvlza3tnV2zstfUUaI4NHgkI9X2mAYpQmigQAntWAELPAktb3Sd+617UFpE4R2OY3ADNgiFLzjDTOqZlS7CI6YOKOow5MNJuWdWrZo1BV0kdkGqpIDTM7+6/YgnAYTIJdO6Y1sxuilTKLiESbmbaIgZH7EBdDIasgC0m05Pn9DjTOlTP1JZhUin6u+JlAVajwMv6wwYDvW8l4v/eZ0E/Ss3FWGcIIR8tshPJMWI5jnQvlDAUY4zwrgS2a2UD5liHLO08hDs+ZcXSfO0Zl/Uzm7Pq/WjIo4SOSCH5ITY5JLUyQ1xSINw8kCeySt5M56MF+Pd+Ji1LhnFzD75A+PzB8g5k5k= CNN AAAB+XicbVDLTsMwEHR4lvIKcORiUZA4lQQQcKzEhWNB9CG1UeW4TmvVsSN7U1FF/RMuHECIK3/Cjb/BbXOAlpFWGs3s2rsTJoIb8LxvZ2l5ZXVtvbBR3Nza3tl19/brRqWashpVQulmSAwTXLIacBCsmWhG4lCwRji4nfiNIdOGK/kIo4QFMelJHnFKwEod120De4LsDD8oIMDGxY5b8sreFHiR+DkpoRzVjvvV7iqaxkwCFcSYlu8lEGREA6fCPthODUsIHZAea1kqScxMkE03H+MTq3RxpLQtCXiq/p7ISGzMKA5tZ0ygb+a9ifif10ohugkyLpMUmKSzj6JUYFB4EgPucs0oiJElhGpud8W0TzShYMOahODPn7xI6udl/6p8cX9ZqhzncRTQITpCp8hH16iC7lAV1RBFQ/SMXtGbkzkvzrvzMWtdcvKZA/QHzucP48uTFg== … … Pick up one red . . . . AAACLnicbVDLSsNAFJ34rPUVdelmsIouSklU1GVBBJcV7AOaWCbTSTt0MgkzE6Gk+SI3/oouBBVx62c4SbuorQcGDufce+fe40WMSmVZ78bC4tLyymphrbi+sbm1be7sNmQYC0zqOGShaHlIEkY5qSuqGGlFgqDAY6TpDa4zv/lIhKQhv1fDiLgB6nHqU4yUljrmjcMQ7zECnQCpvucnUdqxHxJ5nJah0w2VLE87ycjxkNBslOY10BF5d8csWRUrB5wn9oSUwAS1jvmqh+M4IFxhhqRs21ak3AQJRTEjadGJJYkQHqAeaWvKUUCkm+TnpvBIK13oh0I/rmCuTnckKJByGHi6MltdznqZ+J/XjpV/5SaUR7EiHI8/8mMGVQiz7GCXCoIVG2qCsKB6V4j7SCCsdMJFHYI9e/I8aZxW7IvK2d15qXo4iaMA9sEBOAE2uARVcAtqoA4weAIv4AN8Gs/Gm/FlfI9LF4xJzx74A+PnF4LUqf0= Per Patch

/ Rotate AAACJHicbVDLSsNAFJ34rPUVdelmsAouSklUVHBTcOOygn1AE8JkOmmHTiZhZiKUtB/jxl9x48IHLtz4LU7SLGrrgYHDuffcuff4MaNSWda3sbS8srq2Xtoob25t7+yae/stGSUCkyaOWCQ6PpKEUU6aiipGOrEgKPQZafvD26zefiRC0og/qFFM3BD1OQ0oRkpLnnnjMMT7jEAnRGrgB2k88ewqdHqRktVZMR07PhKajSfQEbnHMytWzcoBF4ldkAoo0PDMDz0XJyHhCjMkZde2YuWmSCiKGZmUnUSSGOEh6pOuphyFRLppfuQEnmilB4NI6McVzNVZR4pCKUehrzuzreV8LRP/q3UTFVy7KeVxogjH04+ChEEVwSwx2KOCYMVGmiAsqN4V4gESCCuda1mHYM+fvEhaZzX7snZ+f1GpHxdxlMAhOAKnwAZXoA7uQAM0AQZP4AW8gXfj2Xg1Po2vaeuSUXgOwB8YP781QqW5

AAACA3icbVBNS8NAEN3Ur1q/ot70EqyCp5KoqMeCCB4r2A9oS9lsJ+3SzSbsTsQSCl78K148KOLVP+HNf+M27UFbHwzzeG+G3Xl+LLhG1/22cguLS8sr+dXC2vrG5pa9vVPTUaIYVFkkItXwqQbBJVSRo4BGrICGvoC6P7ga+/V7UJpH8g6HMbRD2pM84IyikTr2XgvhAVWYZt0P0mvJoi6o0ajQsYtuyc3gzBNvSopkikrH/mp1I5aEIJEJqnXTc2Nsp1QhZwJGhVaiIaZsQHvQNFTSEHQ7zW4YOUdG6TpBpExJdDL190ZKQ62HoW8mQ4p9PeuNxf+8ZoLBZTvlMk4QJJs8FCTCwcgZB+J0uQKGYmgIZYqbvzqsTxVlaGIbh+DNnjxPaicl77x0entWLB9O48iTfXJAjolHLkiZ3JAKqRJGHskzeSVv1pP1Yr1bH5PRnDXd2SV/YH3+AOk5mDw= s0 s0 AAACA3icbVDLSsNAFJ3UV62vqDvdBKvgqiQq6rKgC5cV7APaUCbTm3bo5MHMjVhCwI2/4saFIm79CXf+jUmahbYeuNzDOfcyc48TCq7QNL+10sLi0vJKebWytr6xuaVv77RUEEkGTRaIQHYcqkBwH5rIUUAnlEA9R0DbGV9lfvsepOKBf4eTEGyPDn3uckYxlfr6Xg/hAaUX591x42tgwQBkklT6etWsmTmMeWIVpEoKNPr6V28QsMgDH5mgSnUtM0Q7phI5E5BUepGCkLIxHUI3pT71QNlxfkNiHKXKwHADmZaPRq7+3oipp9TEc9JJj+JIzXqZ+J/XjdC9tGPuhxGCz6YPuZEwMDCyQIwBl8BQTFJCmeTpXw02opIyTGPLQrBmT54nrZOadV47vT2r1g+LOMpknxyQY2KRC1InN6RBmoSRR/JMXsmb9qS9aO/ax3S0pBU7u+QPtM8f2cCYMg== p1 ,...,p p¯ p1,...,p p¯ Encoder Decoder h | |i h | |i Figure 3: Model illustration. Section4 describes the model.

Our design considers the environment and plan to attends to by concatenating each hj with the p p generate relevant, grounded instructions. Figure3 N N position vectors encoded in each pj: illustrates the model. ×  = [hj; pj[x, y]] 1 j p¯ , P | ≤ ≤ | | (1) Inputs Similar to Suhr et al.(2019), we repre- 1 x, y N p , P ×H×W ≤ ≤ 0 sent the world state s1 0, 1 as a bi- s ∈ { } where pj[x, y] is a position vector of size N . nary 3D tensor, where P is the number of position Decoder The decoder computes a probability properties, and H and W are the environment’s distribution over token types conditioned on the height and width. Each of the W H positions is × prefix generated so far and the set , which repre- represented as a binary properties vector of length sents the environment state and plan.P The decoder P (e.g., encoding the type of object in the position, uses the first four layers of the GPT-2 Transformer its color, etc.). The system plan p¯ = p1, . . . , p h |p¯|i architecture (Radford et al., 2019), which enables is a sequence of follower poses along the intended initializing with GPT-2 weights. We extend it with execution. Each pose pj is a tuple (hj, wj, αj) of pseudo self attention (Ziegler et al., 2019) to con- height hj and width wj coordinates, and a discrete dition the generation on the encoder outputs . orientation angle αj. This adds a linear layer that projects the encoderP Encoder The encoder computes a set of hidden outputs into the decoder self-attention space. P states, which the decoder attends to during gener- Inference We decode instructions from P ( s · | ation. We use a learned embedding function φ to s1, p¯; θ) using temperature sampling with a tem- map each position vector to a dense embedding of perature of τ (Kreutzer et al., 2018b). This sharp- size N s by summing the embeddings of each of ens the sampling distribution, to focus on higher the position’s properties. We combine the embed- probability outputs. We do not use beam search. s dings into a tensor S IRN ×H×W , and compute: ∈ S0 = CNN1(S), where CNN1 is a learned convo- 5 Learning 0 0 N s ×H×W lution and S R . Because the CERE- ∈ We continually improve our model by observ- ALBAR environment is a grid of hexagons, we use ing users following generated instructions and re- HEXACONV (Hoogeboom et al., 2018). We en- estimating the model parameters. We initialize the code the plan positions into a sequence of vectors model parameters θ using an existing language s0 s0 s0 p p 1 p1 ,..., p|p¯| by cropping a N N N -sized model and training on a static dataset of instruc- h i0 × × tensors from S centered around each (hj, wj) and tions 0 (Section 5.1). We then perform a series D rotated by αj. These tensors represent the pose of rounds, each round r includes deploying the of the follower and its surroundings during execu- model with human users and training on the col- s0 2 s0 tion. Each pj is encoded to pj = CNN (pj ), lected interactions (Section 5.2). In round r, we s0 while retaining the dimensionality of pj . collect interactions between our model parameter- We concatenate an orientation embed- ized by θr and human followers, to create a dataset α (i) (i) (i) (i) |Dr| (i) ding φ (αj) to each pj, and process r = (s1 , ρ¯ , x¯ , y ) i=1 of start states s1 , α α D {(i) (i) } (i) [p1; φ (α1)],..., [p|p¯|; φ (α|p¯|)] with a bidi- plans ρ¯ , instructions x¯ , and binary labels y . rectional LSTM to compute h1,..., h|p¯|. We We estimate θr+1 using all data collected so far r construct the set of hidden states the decoder q. Figure1 illustrates our learning process. P ∪q=0D 5.1 Initialization Perceived correctness: Did you follow all parts of the Leader’s command and find everything correct? User interaction requires some level of minimal Users are instructed to answer yes only if they consider their performance. Pilot experiments showed that a execution correct given the instruction, if they perceive that poorly initialized system is likely to frustrate the instruction describes the relevant cards and objects cor- users, who in turn provide little learning signal. rectly, and if the specified actions make sense. Our initialization provides a sufficient level of Grammaticality: Was the instruction grammatical and well written? grammaticality and plausibility to support user in- Users are shown examples of errors before the game, and teraction, and thereby further learning. largely interpret this criteria as language correctness inde- We initialize the decoder weights with the pendent of the world state. first four layers of GPT-2 (Radford et al., 2019). Figure 4: The binary questions displayed to the All other weights, including of the encoder and user at the end of instruction execution. pseudo self-attention linear layers, are initialized randomly. We then train with a supervised dataset its turn. The model is used to sample an instruc- (i) (i) (i) (i) |D0| 0 = (s , ρ¯ , x¯ , y ) of human plans tion x¯ P ( s1, p¯; θr), which is displayed to the D { }i=1 ∼ · | ρ¯(i) starting at start states s(i) and instructions x¯(i), user. The human user has no access to p¯, the set of (i) all with positive labels y = +1. We use lim- target cards, or the game state s1. They only ob- ited data, just sufficient to effectively interact with serve the instruction and what is ahead (Figure2). users for further learning. We estimate θ1 by min- During their turn, the user executes x¯ to the best imizing a supervised loss: of their ability, and indicates when done. If the user determines that the instruction cannot be fol- I (θ1, 0) = L D lowed, they can terminate the execution, which is 0 1 |DX| (2) log P (¯x(i) s(i), ρ¯(i); θ ) . treated just like marking the instruction as com- − | 1 |D0| i=1 plete. The user execution e¯ is the entire sequence of poses they take while following the instruction. 5.2 Learning from User Behavior When the user concludes or terminates an in- Learning from interacting with human users alter- struction x¯, we show them a top-down view of nates between generating instructions in interac- the entire environment with their execution path tion with users and training the model. highlighted. They do not see the original system Interaction with Users In each round r, we first plan. We ask the user two binary feedback ques- deploy the model with parameters θr to interact tions about the perceived correctness of their exe- with human users, with our system as the leader cution and grammaticality (Figure4). and the user as the follower. We do not update the We create a tuple (s1, p,¯ x,¯ e,¯ f) for each ex- model during this interaction phase. ecution e¯, where s1 is the start state of the en- The game environment is randomly generated vironment, p¯ is the plan generated in that state, for each interaction. Each game continues until it x¯ P ( s1, p¯; θr) is the sampled instruction, and ∼ · | concludes, either when the user leaves or the turns f is the set of responses to the feedback questions. are exhausted. A game often includes collecting Once the user submits the answers to the feedback multiple sets of cards, and generating multiple in- questions, the next instruction is generated. structions. Each instruction is generated for the Dataset Construction We use all interactions in current state as the start state s ;2 as both agents 1 round r to construct dataset r, which is made of move and change the status of cards, the environ- D tuples (s1, ρ,¯ x,¯ y), where ρ¯ is a plan and y is a ment state changes throughout the game. At state binary label. Given a tuple (s1, p,¯ x,¯ e,¯ f), we use s , we generate the plan p¯ using a deterministic 1 three heuristics to add examples to r: D planner that determines (a) which cards should be 1. If any feedback answer in f is negative, the selected or de-selected to make the next valid set, instruction does not reflect the user’s execu- and (b) the shortest paths the leader and follower tion or not well written (i.e., ungrammatical). should take to visit all target cards. The actions We add a negative example to r with the the planner assigns to the follower form the plan D system plan p¯: (s1, p,¯ x,¯ 1). p¯. The actions assigned to the leader are exe- − 2. If both feedback answers are positive, the cuted by the leader agent deterministically during user considers their execution e¯ accurate and 2For simplicity, we do not index the game time step. the instruction well formed. This does not necessarily indicate the execution follows the We address this issue by adding an inverse system plan, but that we can treat the execu- propensity score (IPS; Horvitz and Thompson, tion as a plan. We add a positive example 1952; Wang et al., 2017) coefficient to negative ex- with the execution as the plan: (s1, e,¯ x,¯ +1). amples in a policy gradient objective. The gradient 3. If both answers are positive and the execution for estimating parameters θr+1 is: 3 e¯ follows the plan p¯, the instruction commu- (θr+1, ) = nicates the plan well. We add a positive ex- ∇L D |D| (3) 1 X (i) (i) (i) (i) (i) ample with the system plan: (s1, p,¯ x,¯ +1). ` y log P (¯x s , ρ¯ ; θr+1) , θr+1 ∇ | Overall, we add examples to using both the i=1 r D (i) (i) (i) (i) original system plan and the userD execution. The where, given an example (s , ρ¯ , x¯ , y ) ac- (i) heuristics utilize the observational learning signal quired in round q with parameters θq, `θ is: as much as possible while avoiding examples not  1 y = +1 (i) (i) (i) (i) beneficial for learning. For example, we do not `θ = P (¯x s , p¯ ; θ) . (4)  (i) | (i) (i) y = 1 add negative examples using the user execution, P (¯x s , p¯ ; θq) − because these are less likely to be useful for learn- As the probability| of a negative example (i.e., ing. Although such executions can form negative y = 1) decreases, so does its impact on the loss. examples if the user answered negatively to the While− IPS is commonly used in bandit learning to correctness question, they tend to be relatively ar- de-bias the loss estimate (Lawrence et al., 2017), bitrary, and it is unlikely the model conditioned on our motivation is different, and we do not add it them will assign significant probability to the gen- to positive examples. Because of the large combi- erated instruction, which is the behavior negative natorial space, sentence probabilities are generally examples come to suppress. small. The IPS coefficient of a positive example Parameter Estimation We estimate the model can become very large as its probability increases parameters for the next round θr+1 using all avail- during learning. Instead, we use a supervised-like r 5 able data = q. We re-train our model, term, which is known to behave well. D ∪q=0D starting with GPT-2 parameters (Section 5.1).4 6 Experimental Setup We formulate learning as an offline contex- tual bandit problem, treating the sentence labels Initialization Data We create the supervised y as rewards. Learning from the positive ex- initialization dataset 0 by sampling 360 interac- D amples in forms a straightforward supervised tions from the original CEREALBAR data (Suhr D learning problem, albeit one where the data is et al., 2019), which was collected in a wizard- generated from system interaction. A key chal- of-oz (WOZ; Kelley, 1984) setup via human- lenge is using the negative examples. Treating human games. We select this number through pi- them like supervised examples requires optimiz- lot studies and qualitative analysis to minimize the ing the probability of their instructions to zero. amount of initialization data, while still maintain- Because lim log P ( ) = , this leads ing sufficient model performance for early inter- P (·)→0 · −∞ to an unbounded negative loss that quickly dom- actions to facilitate learning. Our goal is to use as inates the objective. This in contrast to posi- little data as possible to study the target scenario tive examples, for which the loss is bounded by where investment in supervised data is minimal, zero. This issue is not present in existing work us- and most learning is left to interaction with users. ing offline contextual bandits to improve machine This data includes 7,147 examples. We use the hu- translation (Lawrence et al., 2017; Kreutzer et al., man demonstrations in the original data as plans. 2018b), where rewards are always non-negative. Evaluation Similar to Zhao et al.(2021), we ob- serve that automated metrics, such as Bleu (Pa- 3 For instructions that target cards, we require getting the pineni et al., 2002) or BERTScore (Zhang et al., card selection right, and ignore the follower position. For in- structions that require waiting (e.g., hold still), we require the 2020), computed over a static held-out validation position to remain the same, but allow orientation deviation. set are unreliable for evaluating instruction gener- 4Pilot studies showed re-training to be more stable than ation. Instead, we focus on task-completion mea- fine-tuning given new data, and we conduct the majority of our experiments with this method. However, we also observe 5An alternative, and important direction for future study that our process is overall robust to the initially observed in- is to add IPS to all examples, but clip it at a certain maximal stabilities of fine-tuning (Section7). value, similar to clipping in PPO (Schulman et al., 2017). sures via human execution. We measure task com- both experiments, we collect roughly 100 inter- pletion by considering the user execution as com- actions for each system per round. In the seven- pleting the intended task if the user visits all card round experiments, we deploy methods simultane- locations included in the system plan; or, if the ously to ensure that our observations are not sen- plan includes no target cards, the user stays in the sitive to changes in user behavior, for example be- starting position. We quantify the similarity of the cause of adaptation and increased expertise. We user execution to the path in the system plan by do not inform workers about the model they are in- computing earth mover’s distance (EMD; Rubner teracting with. We train each system only on data et al., 1998)6 between the two (Blukis et al., 2019). collected by the same method in previous rounds. We also track the user answers to the feedback questions (Figure4). We average each measure 7.1 Long-term Study over the number of instructions in each round. We experiment with our approach for 14 rounds. Language Analysis We quantitatively analyze We collect a total of 27,031 instructions from how generated instructions change throughout 1,445 interactions, with 103.2 interactions per training. For each round, we report mean in- round on average. The total cost is $2,895. Fig- struction length, vocabulary size, and three mea- ure5 shows both performance measures and lan- sures of syntactic complexity using dependency guage trends. For task measures and user feed- trees (Xu and Reitter, 2016): (a) maximum depth: back, we also break down performance according the longest path from root to a leaf; (b) maximum to the number of target cards in the system plan width: the maximum out-degree of any word in to evaluate performance changes for plans which the tree; and (c) average branching factor: the av- may be more difficult to describe (e.g., because erage out-degree of non-leaf words. We normalize they require specifying more cards).7 the three measures by instruction length. We qual- Our learning method significantly improves the itatively analyze errors in generated instructions, system performance across all measures. Task by comparatively analyzing 100 randomly sam- completion rate improves from 44.7% at round pled examples where the user failed to complete one to 79.3% at round 14, while EMD decreases the intended task from the first and final rounds. from 1.73 to 0.88, showing increasing similarity Interaction Setup Except initialization, learn- between the execution and the plan. The user per- ing and evaluation are done through live interac- ception of the system also improves: the positive tion with users on MTurk. All workers response rate for the perceived correctness ques- passed a tutorial and a qualification quiz. We pay tion improves from 47.9% to 78.6%, and for gram- $0.15 per interaction, with a bonus of $0.10 per maticality from 88.9% to 99.2%. The overall col- instruction to workers that follow our guidelines. laborative system performance improves as well; Implementation Details Similar to perfor- the game score increases from 4.5 to 10.4. The mance evaluation, automated measures are number of positive examples per round gradually unreliable for model selection. Instead, for both increases, as the system improves and the interac- initialization and in each round, we train for tions become longer. In contrast, the number of N = 400 epochs, and take the final model. We negative examples decreases over time. find N via qualitative analysis of the initial model. We observe that the initial model struggles to We use an ensemble of four models. We uni- describe plans containing more target cards, with formly sample one of the four models to sample a particularly low task completion rate of 1.6% each instruction, and take its probability to use in for 3-card plans in the first round. This is poten- IPS for negative examples. We use a sampling tially because only 0.7% of human follower exe- temperature τ = 0.5, and AdamW (Loshchilov cutions in 0 demonstrate picking up three cards, and Hutter, 2018) for learning. while the plannerD generates 3-card plans 7.9% of the time. While initial improvement is slow for 3- 7 Results and Analysis card instructions, it picks up around round eight, We conduct a long-term experiment with 14 and reaches 32.9% task completion rate. rounds using our approach, and separate seven- Language Analysis We observe a consistent round experiments to compare system variants. In trend of decreasing sentence length and vocabu-

6We use POT (Flamary et al., 2021) to compute EMD. 70-card plans target no cards (e.g., hold still). Overall 0-card7 1-card 2-card 3-card 100 3 10

2 50 8 EMD

Rate (%) 1 Game Score 6 Task Completion 0 0 4 100 100 11 Vocabulary Sentence length

90 10 200 50

80 9 Vocabulary Sentence Length Positive Grammati- cality Feedback (%) rectness Feedback (%) Positive Perceived Cor- 100 1.5k 1k 1.02

1k 1 500 0.98 500 Max depth Max width 0.96 # Positive Examples # Negative Examples 0 0 Branching factor Syntactic Complexity 2 4 6 8 10 12 14 2 4 6 8 10 12 14 2 4 6 8 10 12 14 Round Round Round Figure 5: The system’s lifetime statistics from the long-term experiment (14 rounds). The system im- proves on task completion ( ), EMD ( ), positive response rate for the two feedback questions ( ), and game score ( ). Section 7.1 ↑discusses these↓ results in detail. ↑ ↑ lary size. Overall, these trends accompany reduc- crease (p < 0.00001).8 We also see a slight de- tion in over generation of erroneous phrases that crease in maximum tree depth (p < 0.0001), and are not grounded well in the environment. We no significant change in max width. also qualitatively observe that the systems grad- Error Analysis We analyze errors in the gener- ually generates slightly more underspecified in- ated instructions at the first and final rounds. For structions, for example by dropping mentions of each round, we randomly sample 100 instructions landmarks crucial for navigating to a target card. that the user did not execute according to the plan This may explain the slight decrease in 1-card task or answered negatively to a feedback question. Ta- completion rate in later rounds (Figure5), because ble1 shows error types and example instructions. the planner usually has the follower travel fur- Overall, the frequency of erroneous instructions ther for 1-card instructions, which requires rely- decreases from 68.5% of instructions in the first ing more on landmarks. A potential explanation round, to 26.8% in the final round. From the to the decrease in vocabulary size is the ever in- first to final round, we observe noticeable decrease creasing presence of system-generated sentences in errors related to grounding of cards and land- in training, which reinforces the system’s word marks. The overall frequency of errors related to choices. Alternatively, our learning signal may incorrect directions and incorrect actions or condi- not account for the need for more descriptive lan- tions also decreases, and implausible instructions guage. For example, humans may compensate diminish close to zero percent. However, there is with exploration for omitted descriptions, which is an overall increase in underspecified instructions. not distinguished by how we convert the observed This aligns with the decrease in the vocabulary behavior to a learning signal. These trends outline size and landmark use we discuss above. important directions for future work. Confounding Factors We identify two mecha- nisms unrelated to our approach that could explain We observe a small increase in syntactic com- the observed performance changes. We deploy plexity over the system’s lifetime with regard to the branching factor, which shows significant in- 8We use t-test (α = 0.01) comparing rounds 1 and 14. Error Type r = 1 r = 14 Example Incorrect, missing, or extra cards 75 39 turn left and go to the yellow star triangles Irrelevant landmarks 13 1 Head toward the windmill house. grab 2 red and triangle Incorrect direction 30 35 grab the black heart to your left in front of you. Incorrect actions or conditions 28 14 After the two red triangles, get the 3 red triangles. Underspecification 8 26 turn right and go straight toward red trees collect two orange triangle. Implausible instructions 11 1 Turn left and get the two pink hearts and the two pink hearts near the pink hearts. Proportion of erroneous instructions 68.5% 26.8% Table 1: The types of errors observed in erroneous instructions generated during the first (r = 1) and final (r = 14) rounds of deployment. We show error counts from the 100 randomly-sampled erroneous instructions. Examples illustrate error categories; red strikethrough shows erroneous segments, and blue fragments show possible corrections. Instructions that fit into multiple categories are double counted.

Model r Overall 0-card7 1-card 2-card 3-card 7.2 System Variants Study

θ1 1 44.8 84.1 64.9 9.6 1.7 We vary different design decisions, and experi- θ1 14 45.1 84.5 62.1 9.3 0.8 ment for seven interaction rounds.9 We experi- θ0 14 49.6 76.6 63.8 24.8 7.4 1 ment with four system variants: (a) FULL: our full θ14 14 79.4 99.6 81.9 72.1 33.0 approach described in Section5; (b) P OS-ONLY: Table 2: The effect of confounding factors on task use only examples with positive labels y = +1; completion rate (%). The initial model θ1 is eval- (c) TC-ONLY: ignore the feedback questions, in- uated both in the first (r = 1) and final (r = 14) stead if the user completes the task according to rounds, showing no effect of user adaptation. In our task success measure we add positive exam- 0 the final round, we also evaluate θ1, which is ples with both the system plan and user execution, trained on the same data as θ1 but using more gra- otherwise we add a negative example using the dient updates. We also show results for the final- system plan; (d) NO-ENSEMBLE: train and deploy round model θ14. a single model each round, starting from an initial model randomly sampled from these we use for FULL; and (e) FINE-TUNING: train model param- two additional systems alongside our system dur- eters θ on for N epochs, starting from θ , ing the final round. For each interaction, one of r+1 r r avoiding overfittingD with rehearsal (Rebuffi et al., the three systems is randomly chosen. We do not 2017; Hawkins et al., 2020a). In rehearsal, in each inform the workers of the identity of the model batch, half the examples are sampled randomly for each interaction. First, we deploy the system from the previous datasets ,. . . , . Except following initialization during the final round to 0 r−1 the variations specified, theD systemsD are identical. study if performance might be explained by user We do not deploy a system ablating IPS, because improvement over time. Second, because we train we observe that training with negative examples with a fixed number of epochs, later rounds have without IPS results in a largely unusable system. many more gradient updates, which may allow for We collect a total of 63,189 instructions across better parameter estimation, even with the same all systems, with 3173 interactions. Each round amount of data. We train a system on the initial- includes 453.2 interactions on average. The total ization dataset for the same number of gradient 0 cost is $7,165. All systems are used concurrently updates as whenD training the final full system. in each round, including re-deploying FULL again Table2 shows these confounding factors do not starting from initialization. Figure6 shows the re- explain the observed gains. We find minimal dif- sults. Despite some differences between the sys- ferences between evaluating the initial model (θ1) tem variants, our method is largely robust to vari- at the beginning and end of deployment, show- 9 ing no significant effect from user improvement. This study is similar to ablation analysis, but aims to Training the initial system longer ( 0 ) shows a study different learning design decisions. Full-fledged repet- θ1 itive ablations to identify the ideal system design are partic- slight overall improvement, but negligent com- ularly challenging in this work, both because of experiment pared to final system (θ14). costs and the complex dynamics of interacting with users. FULL POS-ONLY TC-ONLY cabulary size. This is in contrast to our initial ex- NO-ENSEMBLE FINE-TUNING periments, which showed it is harder to get consis- tent improvements through fine-tuning. While the 70 70 fine-tuning process is harder to design because it requires to choose the fine-tuning procedure (e.g., 60 60 rehearsal (Robins, 1995) or KL regularization (Yu et al., 2013)) and carefully optimize additional 50 50 hyperparameters, it can work just as well as re- rectness Feedback (%) Positive Perceived Cor-

Task Completion Rate (%) training. Because fine-tuning is faster to train be- 11 tween rounds, it may be preferable in future work. 250 10 7.3 Comparison to Supervised Learning 200 We also separately study the learning trends of 9 Vocabulary

Sentence Length 150 our method compared to training on equivalent amount of supervised WOZ data. Supervised data 1 3 5 7 1 3 5 7 is fundamentally different from our bandit data, Round Round for two main reasons: (a) it is significantly costlier Figure 6: Comparison of system variants. because it requires a dedicated instruction-writing effort, whereas our data arises naturally from the ations in learning design decisions. system interaction with users during deployment; All systems achieve comparable improvements and (b) it provides per-token labels, whereas our in task completion rate, except for POS-ONLY, data includes only utterance-level binary labels. which slightly underperforms. We observe faster For the supervised system, after each round, we decrease in the vocabulary size and instruction expand the dataset by randomly drawing an equiv- length for POS-ONLY, which does not use nega- alent amount of additional data from the com- tive examples. This is possibly because the loss plete dataset of Suhr et al.(2019), which includes 10 from negative examples encourages a more uni- 19,112 examples from 960 interactions. This form generation distribution, potentially slowing dataset allows for seven rounds. We concurrently down the overall trends of making the generation deploy a no-ensemble variant of our continual distribution more peaky. TC-ONLY, which ig- learning system. We collect a total of 22,216 in- nores the answers to user feedback questions when structions across both systems, with 1,166 interac- constructing the dataset, shows fewer positive re- tions. This experiment’s total cost is $2,230. sponses to the perceived correctness feedback, al- Figure7 shows our continual learning system though task completion remains comparable. consistently outperforms this supervised alterna- We observe that using a single (NO- tive in overall task completion rate. There are two ENSEMBLE) model rather than an ensemble potential explanations to this gap. First, the data leads to limited difference in overall performance. our approach uses is made of examples the system However, because of the challenge of identify- is likely to generate, potentially providing a more ing a good automated metric to stop training, effective learning signal. Second, there is a dif- the performance of models following training ference between the plans of human leaders and varies significantly. This can lead to deploying our planner. Our training is better suited to adapt a bad model, which provides users with a poor to how the complete system is designed, whereas experience. Using an ensemble of models incurs training on human-annotated data is bound to suf- higher computational cost, but makes such a fer from a distribution shift. However, the con- worst-case scenario less likely. For example, in tinual learning system did not consistently outper- our long-term experiment, the maximum task form the supervised alternative on 2-card and 3- completion performance gap we observe between card instructions, especially at early rounds. This the best and worst models in each round is 13%. is likely because the continual learning system generates few positive examples for more com- Finally, we observe that fine-tuning (FINE- TUNING) works as well as our re-training ap- 10Interactions with the supervised system are not used for proach (FULL), potentially with a more stable vo- learning, but only for evaluation. Continual Supervised bined, for example by using rule-based methods to Overall 0-card7 1-card 2-card 3-card generate initialization data for our approach. 100 Bandit learning has been studied with simu- lated user ratings for machine translation (Nguyen 75 et al., 2017; Lawrence et al., 2017; Kreutzer et al., 50 2017) and semantic parsing (Lawrence and Rie- 25 zler, 2018). We learn from real users, similar to re- cent studies in machine translation (Kreutzer et al., 0 Task Completion Rate (%) 2018a,b). In general, such learning assumes users 1 2 3 4 5 6 7 can judge the system output, for example via pro- Round ficiency in the language they wish to translate to. Figure 7: Comparison to supervised learning. The Our learning signal does not require such exper- continual learning system is competitive in task tise, and is available naturally from the interaction. completion rates with systems trained on equiv- alent amount of supervised data. Explicit human feedback has also been incorpo- rated into reinforcement learning methods (Knox and Stone, 2009; Pilarski et al., 2011; Daniel plex system plans (i.e., 2-card or 3-card) at ear- et al., 2015; Mathewson and Pilarski, 2016; War- lier rounds. At later rounds, as the system im- nell et al., 2018; MacGlashan et al., 2017; Aru- proves, we observe more positive examples for mugam et al., 2019), including in the context of such plans, creating an accelerating effect of im- dialogue system learning (Liu et al., 2018). Jaques provement, which is best observed in our long- et al.(2020) study forming a reward from implicit term experiment (Figure5). feedback for non-task-oriented dialogue language generation, by training multiple models to de- 8 Related Work tect linguistic signals, such as sentiment and lexi- Learning for instruction generation has been stud- cal overlap, that correlate with explicit user feed- ied using supervised methods, with examples back. Learning from users has also been studied of task specifications (i.e., contexts) paired with by asking users to rank system outputs (e.g., Wil- human-written instructions (e.g., Daniele et al., son et al., 2012; Christiano et al., 2017), including 2016; Narayan-Chen et al., 2019), including to im- for instruction following (Wang et al., 2016) and prove instruction following (Fried et al., 2018; Tan summarization (Stiennon et al., 2020). Unlike our et al., 2019). We focus on continually learning approach, such ranking requires knowing the true by observing users executing generated instruc- system intent, and is not part of the system’s nor- tions. This reduces annotation needs, and del- mal operation (i.e., instructing users in our case). egates much of the learning to interaction with Incorporating human users into learning is re- users during system deployment. Language gener- lated to active learning (Settles, 2009), where a ation in context was also studied in scenarios that policy selects examples for an oracle to label dur- are not explicitly instructional, but aim to elicit ing learning. Unlike common active learning sce- specific behavior, such as negotiation games (e.g., narios we do not select examples from a static un- Lewis et al., 2017) and referring expression gener- derlying distribution (i.e., a training set) for an- ation (e.g., Dale and Reiter, 1995). notation, but generate examples with the learned Gatt and Krahmer(2017) survey existing work model. This is similar to query synthesis ac- on language generation, including using rule- tive learning (Angluin, 1988), where examples are based methods. Similar to our approach, some generated for annotation, rather than being se- rule-based methods were evaluated with human lected from a set of unannotated examples. A followers in situated environments using task suc- more significant difference is that active learning cess (e.g., Koller et al., 2010; Janarthanam and methods solicit model output annotations by pre- Lemon, 2011). Such methods are accurate and re- senting an oracle with model inputs. In contrast, liable, but are limited to pre-specified rules and re- our approach exposes users to model outputs (i.e., main static following development. Our focus is generated instructions). It does not solicit written on studying the potential for learning by observing instructions, as would be expected if requesting human behavior. The two approaches can be com- labels. We also do not show model inputs (i.e., plans) to users. Finally, our model interacts with panied by a reduction of language complexity, in- users during system operation, while completing cluding reducing the effective vocabulary and sen- its task. It does not require oracle annotators. tence length. While much of the decrease in the ef- Language learning from behavioral signals has fective vocabulary size throughout the system life- been studied in the cognitive science and psychol- time relates to generating fewer erroneous phrases, ogy literature.11 Krauss and Weinheimer(1966) it also reduces the language diversity and descrip- study two types of feedback in human studies: tiveness. Our experiments show that this trend can concurrent linguistic feedback and behavioral in- be slowed down by using negative examples, and tent confirmation, and show how both influence appears to be less pronounced when using fine- linguistic adaptation in an interaction over time. tuning. The combination of this decrease with the Studies of reference games reproduced the effect preference for shorter instructions makes it diffi- of confirmation feedback, showing that successful cult for the system to describe longer, complex intent communication reinforces convention for- trajectories. Qualitatively, we observe this open mation in the form of shorter references (Clark and problem is responsible for a significant portion of Wilkes-Gibbs, 1986; Hawkins et al., 2020b). Our the remaining errors. An important direction for learning signal is a type of confirmation feedback. future work is experimenting with directly encour- However, our interaction procures and makes use aging more diverse language. This can be com- of more complex learning signals than a simple bi- bined with approaches that allow for introducing nary intent communication success, by using the new word types, which is unlikely in our approach, path the listener takes in response to the generated even though it uses sub-word tokenization. A po- instruction as an alternative intent when construct- tential direction in this vein is combining active ing data for learning (Section 5.2).12 learning to solicit human-written oracle instruc- tions for plans the system fails to communicate. 9 Discussion We propose a methodology to continually improve Our work highlights several other directions for an instruction generation model by observing hu- future work. There is a strong need for a reli- man users executing natural language instructions, able automated metric to evaluate instruction gen- and demonstrate its efficacy within a collaborative eration. In absence of such a metric, we use a instruction following scenario. Our study shows simple, but likely sub-optimal stopping criteria for that observation of user behavior is an informa- learning. Beyond the learning signal we explored tive signal for generating language to relay instruc- in our experiments, there are additional potential tional intent. To the best of our knowledge, this cues available during interaction. For example, us- type of learning signal has not been studied before. ing continuous-valued similarity between system This learning setting facilitates continual learn- intent and user execution, modeling follower qual- ing through interaction with users, and is partic- ity to discount the learning signal from interac- ularly compelling for interactions with collabora- tions with bad followers, or weighing the feedback tive agents, including robots and software agents. questions differently for more nuanced reward. Such agents are likely to operate in constantly changing environments (e.g., robots in homes), Finally, the decrease in utterance length and where continual learning is necessary to adjust to vocabulary size mirrors similar trends observed changes. Our continual learning approach also in studies of human communication (Clark and provides systems the flexibility to co-adapt to hu- Wilkes-Gibbs, 1986; Hawkins et al., 2020b). This man users, who are likely to change preferences illustrates the potential of continual learning sys- and behaviors in response to system behavior. tems to reflect the dynamics of language change Our experiments demonstrate the learning pro- human participants expect in natural language in- cess is robust to various learning and process de- teractions. Observations of human learning also sign choices. However, they also show it is accom- indicate the potential of integrating our approach with conversational self-repair (Clark, 2020) and 11 This review is not comprehensive, and only aims to high- partner reformulation (Clark, 2018), both impor- light the relation to problems studied in related disciplines. tant components of child language acquisition that 12In more recent reference games (Hawkins et al., 2020b), unlike in Krauss and Weinheimer(1966), the choice of a bad likely provide better credit assignment for learning referent can be seen as related to our use of listener execution. compared to our binary bandit signal. Acknowledgments Robert Dale and Ehud Reiter. 1995. Com- putational interpretations of the gricean max- This research was supported by ARO W911NF- ims in the generation of referring expressions. 21-1-0106, a Focused Award, Masason Cognitive Science, 19:233–263. Foundation, a Facebook Fellowship, and NSF un- der grants No. 1750499 and DGE-1650441. We Christian Daniel, Oliver Kroemer, M. Viering, thank Jonathan Chang, Sasha Rush, the Cornell Jan Metz, and Jan Peters. 2015. Active re- NLP Group, Robert Hawkins, Dipendra Misra, ward learning with a novel acquisition function. and John Langford for discussion and comments; Autonomous Robots, 39:389–405. Suyi Diao for Unity development; Anna Effen- Andrea F Daniele, Mohit Bansal, and Matthew R berger for code to compute syntax complexity; Ge Walter. 2016. Natural language generation in Gao, Koji Shiono, and Takayuki Kojima for feed- the context of providing indoor route instruc- back on our interaction platform; and the crowd- tions. In Proceedings of the Robotics: Science sourcing workers for participating in our data col- and Systems Workshop on Model Learning for lection. Finally, we thank the action editor and the Human-Robot Communication. anonymous reviewers for detailed comments. Rémi Flamary, Nicolas Courty, Alexandre Gram- fort, Mokhtar Z. Alaya, Aurélie Boisbunon, References Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, Léo D. Angluin. 1988. Queries and concept learning. Gautheron, Nathalie T.H. Gayraud, Hicham Ja- Machine Learning, 2:319–342. nati, Alain Rakotomamonjy, Ievgen Redko, An- toine Rolet, Antony Schutz, Vivien Seguy, Dan- Dilip Arumugam, Jun Ki Lee, Sophie Saskin, and ica J. Sutherland, Romain Tavenard, Alexander Michael L. Littman. 2019. Deep reinforcement Tong, and Titouan Vayer. 2021. Pot: Python learning from policy-dependent human feed- optimal transport. Journal of Machine Learning back. CoRR, abs/1902.04257. Research, 22(78):1–8. Valts Blukis, Eyvind Niklasson, Ross A. Knepper, Daniel Fried, Jacob Andreas, and Dan Klein. and Yoav Artzi. 2019. Learning to map natu- 2018. Unified pragmatic models for generat- ral language instructions to physical quadcopter ing and following instructions. In Proceedings control using simulated flight. In Proceedings of the Conference of the North American of the Conference on Robot Learning, pages Chapter of the Association for Computational 1415–1438. Linguistics: Human Language Technologies, pages 1951–1963. Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Albert Gatt and Emiel Krahmer. 2017. Survey of Deep reinforcement learning from human pref- the state of the art in natural language gener- erences. In Proceedings of the Advances in ation: Core tasks, applications and evaluation. Neural Information Processing Systems. Curran Journal Artificial Intelligence Research, 61:65– Associates, Inc. 170. Robert Hawkins, Minae Kwon, Dorsa Sadigh, Eve V. Clark. 2018. Conversation and language and Noah Goodman. 2020a. Continual acquisition: A pragmatic approach. Language adaptation for efficient machine communica- Learning and Development, 14:170–185. tion. In Proceedings of the Conference on Computational Natural Language Learning, Eve V. Clark. 2020. Conversational repair and the pages 408–419. Association for Computational acquisition of language. Discourse Processes, Linguistics. 57:441 – 459. Robert D. Hawkins, Michael C. Frank, and Herbert H Clark and Deanna Wilkes-Gibbs. 1986. Noah D. Goodman. 2020b. Characterizing Referring as a collaborative process. Cognition, the dynamics of learning in repeated reference 22(1):1–39. games. Cognitive science, 44(6):e12845. Emiel Hoogeboom, Jorn WT Peters, Taco S Co- machine translation be improved with user feed- hen, and Max Welling. 2018. Hexaconv. In back? In Proceedings of the Conference of the Proceedings of the International Conference on North American Chapter of the Association for Learning Representations. Computational Linguistics: Human Language Technologies, pages 92–105. Association for Daniel G Horvitz and Donovan J Thompson. Computational Linguistics. 1952. A generalization of sampling without re- placement from a finite universe. Journal of the Julia Kreutzer, Artem Sokolov, and Stefan Riezler. American Statistical Association, 47(260):663– 2017. Bandit structured prediction for neural 685. sequence-to-sequence learning. In Proceedings of the Annual Meeting of the Association for Srini Janarthanam and Oliver Lemon. 2011. The Computational Linguistics, pages 1503–1513. GRUVE challenge: Generating routes un- Association for Computational Linguistics. der uncertainty in virtual environments. In Proceedings of the European Workshop on Julia Kreutzer, Joshua Uyheng, and Stefan Rie- Natural Language Generation, pages 208–211. zler. 2018b. Reliability and learnability of hu- Association for Computational Linguistics. man bandit feedback for sequence-to-sequence reinforcement learning. In Proceedings of Natasha Jaques, Judy Hanwen Shen, Asma Ghan- the Annual Meeting of the Association for deharioun, Craig Ferguson, Agata Lapedriza, Computational Linguistics, pages 1777–1788. Noah Jones, Shixiang Gu, and Rosalind Picard. Association for Computational Linguistics. 2020. Human-centric dialog training via offline reinforcement learning. In Proceedings of the Carolin Lawrence and Stefan Riezler. 2018. Im- Conference on Empirical Methods in Natural proving a neural semantic parser by coun- Language Processing, pages 3985–4003. Asso- terfactual learning from human bandit feed- ciation for Computational Linguistics. back. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, John F Kelley. 1984. An iterative design method- pages 1820–1830. Association for Computa- ology for user-friendly natural language office tional Linguistics. information applications. ACM Transactions on Information Systems, 2(1):26–41. Carolin Lawrence, Artem Sokolov, and Stefan Riezler. 2017. Counterfactual learning from W. Bradley Knox and Peter Stone. 2009. Interac- bandit feedback under deterministic logging tively shaping agents via human reinforcement: : A case study in statistical machine trans- the TAMER framework. In Proceedings of lation. In Proceedings of the Conference the fifth international conference on Knowledge on Empirical Methods in Natural Language capture, pages 9–16. Processing, pages 2566–2576. Association for Alexander Koller, Kristina Striegnitz, Andrew Computational Linguistics. Gargett, Donna Byron, Justine Cassell, Robert Mike Lewis, Denis Yarats, Yann Dauphin, Devi Dale, Johanna Moore, and Jon Oberlander. Parikh, and Dhruv Batra. 2017. Deal or no 2010. Report on the second NLG challenge on deal? End-to-end learning of negotiation di- generating instructions in virtual environments alogues. In Proceedings of the Conference (GIVE-2). In Proceedings of International on Empirical Methods in Natural Language Natural Language Generation Conference. As- Processing, pages 2443–2453. Association for sociation for Computational Linguistics. Computational Linguistics. Robert M. Krauss and Sidney Weinheimer. 1966. Concurrent feedback, confirmation, and the en- Bing Liu, Gokhan Tür, Dilek Hakkani-Tür, coding of referents in verbal communication. Pararth Shah, and Larry Heck. 2018. Dialogue Journal of personality and social psychology, 4 learning with human teaching and feedback in 3:343–6. end-to-end trainable task-oriented dialogue sys- tems. In Proceedings of the Conference of the Julia Kreutzer, Shahram Khadivi, Evgeny Ma- North American Chapter of the Association for tusov, and Stefan Riezler. 2018a. Can neural Computational Linguistics: Human Language Technologies, pages 2060–2069. Association Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, for Computational Linguistics. Georg Sperl, and Christoph H Lampert. 2017. icarl: Incremental classifier and representation Ilya Loshchilov and Frank Hutter. 2018. De- learning. In Proceedings of the Conference coupled weight decay regularization. In on Computer Vision and Pattern Recognition, Proceedings of the International Conference on pages 2001–2010. IEEE. Learning Representations. Anthony Robins. 1995. Catastrophic forgetting, James MacGlashan, Mark K. Ho, Robert Tyler rehearsal and pseudorehearsal. Connection Loftin, Bei Peng, David L. Roberts, Matthew E. Science, 7(2):123–146. Taylor, and Michael L. Littman. 2017. Inter- active learning from policy-dependent human Yossi Rubner, Carlo Tomasi, and Leonidas J feedback. In Proceedings of the International Guibas. 1998. A metric for distributions with Conference on Machine Learning. applications to image databases. In Proceedings of the International Conference on Computer K. Mathewson and P. Pilarski. 2016. Simultane- Vision. IEEE. ous control and human feedback in the train- ing of a robotic agent with actor-critic reinforce- John Schulman, Filip Wolski, Prafulla Dhariwal, ment learning. arXiv, abs/1606.06979. Alec Radford, and Oleg Klimov. 2017. Prox- imal policy optimization algorithms. arXiv, Anjali Narayan-Chen, Prashant Jayannavar, and abs/1707.06347. Julia Hockenmaier. 2019. Collaborative di- alogue in Minecraft. In Proceedings of Burr Settles. 2009. Active learning literature sur- the Annual Meeting of the Association for vey. Computational Linguistics, pages 5405–5415. Association for Computational Linguistics. Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Rad- Khanh Nguyen, Hal Daumé III, and Jordan Boyd- ford, Dario Amodei, and Paul F Christiano. Graber. 2017. Reinforcement learning for 2020. Learning to summarize with human feed- bandit neural machine translation with simu- back. In Proceedings of the Advances in Neural lated human feedback. In Proceedings of the Information Processing Systems, pages 3008– Conference on Empirical Methods in Natural 3021. Curran Associates, Inc. Language Processing, pages 1464–1474. Asso- ciation for Computational Linguistics. Alane Suhr, Claudia Yan, Jack Schluger, Stan- ley Yu, Hadi Khader, Marwa Mouallem, Iris Kishore Papineni, Salim Roukos, Todd Ward, Zhang, and Yoav Artzi. 2019. Executing and Wei-Jing Zhu. 2002. Bleu: a method instructions in situated collaborative interac- for automatic evaluation of machine transla- tions. In Proceedings of the Conference tion. In Proceedings of the Annual Meeting of on Empirical Methods in Natural Language the Association for Computational Linguistics, Processing, pages 2119–2130. Association for pages 311–318. Association for Computational Computational Linguistics. Linguistics. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. P. M. Pilarski, M. R. Dawson, T. Degris, F. Fahimi, 2014. Sequence to sequence learning with neu- J. P. Carey, and R. S. Sutton. 2011. Online hu- ral networks. In Proceedings of the Advances in man training of a myoelectric prosthesis con- Neural Information Processing Systems, pages troller via actor-critic reinforcement learning. 3008–3021. Curran Associates, Inc. In Proceedings of the International Conference on Rehabilitation Robotics, pages 1–7. Hao Tan, Licheng Yu, and Mohit Bansal. 2019. Learning to navigate unseen environments: Alec Radford, Jeffrey Wu, Rewon Child, David Back translation with environmental dropout. Luan, Dario Amodei, and Ilya Sutskever. 2019. In Proceedings of the Conference of the Language models are unsupervised multitask North American Chapter of the Association for learners. Computational Linguistics: Human Language Technologies, pages 2610–2621. Association navigation instructions. In Proceedings of for Computational Linguistics. the European Chapter of the Association for Computational Linguistics, pages 1302–1316. Sida I. Wang, Percy Liang, and Christopher D. Association for Computational Linguistics. Manning. 2016. Learning language games through interaction. In Proceedings of Zachary M Ziegler, Luke Melas-Kyriazi, Sebas- the Annual Meeting of the Association for tian Gehrmann, and Alexander M Rush. 2019. Computational Linguistics, pages 2368–2378. Encoder-agnostic adaptation for conditional Association for Computational Linguistics. language generation. arXiv, abs/1908.06938.

Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudík. 2017. Optimal and adaptive off- policy evaluation in contextual bandits. In Proceedings of International Conference on Machine Learning, pages 3589–3597. Proceed- ings of Machine Learning Research.

Garrett Warnell, Nicholas R Waytowich, Vernon Lawhern, and Peter Stone. 2018. Deep tamer: Interactive agent shaping in high-dimensional state spaces. In Proceedings of the AAAI Conference on Artificial Intelligence.

Aaron Wilson, Alan Fern, and Prasad Tade- palli. 2012. A bayesian approach for pol- icy learning from trajectory preference queries. In Proceedings of the Advances in Neural Information Processing Systems. Curran Asso- ciates, Inc.

Yang Xu and David Reitter. 2016. Conver- gence of syntactic complexity in conversa- tion. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 443–448. Association for Computational Linguistics.

Dong Yu, Kaisheng Yao, Hang Su, Gang Li, and Frank Seide. 2013. Kl-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 7893– 7897. IEEE.

Tianyi Zhang, Varsha Kishore, Felix Wu, Kil- ian Q. Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating text generation with BERT. In Proceedings of the International Conference on Learning Representations.

Ming Zhao, Peter Anderson, Vihan Jain, Su Wang, Alex Ku, Jason Baldridge, and Eugene Ie. 2021. On the evaluation of vision-and-language