Arxiv:2108.04812V1 [Cs.CL] 10 Aug 2021 from Supervised Data, Including Via Active Learn- Ing Natural Language
Total Page:16
File Type:pdf, Size:1020Kb
Continual Learning for Grounded Instruction Generation by Observing Human Following Behavior Noriyuki Kojima, Alane Suhr, and Yoav Artzi Department of Computer Science and Cornell Tech, Cornell University [email protected] {suhr, yoav}@cs.cornell.edu <latexit sha1_base64="XR5Yz8Ha2D6x8ktwsYyJHNrBqjs=">AAACDXicbVA9SwNBEN3zM8avqKXNYhSswp2FWoo2FhYRvCgkR9jbzCWLe7vH7lwwHPkDNv4VGwtFbO3t/DduPgq/Hgw83pthZl6cSWHR9z+9mdm5+YXF0lJ5eWV1bb2ysdmwOjccQq6lNjcxsyCFghAFSrjJDLA0lnAd356N/Os+GCu0usJBBlHKukokgjN0Uruy20K4wzgpLoAZJVSXJkanNLRg6Cn0WF9oM2xXqn7NH4P+JcGUVMkU9Xblo9XRPE9BIZfM2mbgZxgVzKDgEoblVm4hY/yWdaHpqGIp2KgYfzOke07p0EQbVwrpWP0+UbDU2kEau86UYc/+9kbif14zx+Q4KoTKcgTFJ4uSXFLUdBQN7QgDHOXAEcaNcLdS3mOGcXQBll0Iwe+X/5LGQS04rAWXB9WT02kcJbJNdsg+CcgROSHnpE5Cwsk9eSTP5MV78J68V+9t0jrjTWe2yA9471/ZeJwL</latexit> Initialization<latexit sha1_base64="+XeULmdxQVYSlfEOkBPApaVE+xA=">AAACAHicbZA7T8MwFIUdnqW8AgwMLBYVElOVdADGChbYikQfUhtVjuu0Vh0nsm8QJcrCX2FhACFWfgYb/wY3zQAtR7L06Zx7Zd3jx4JrcJxva2l5ZXVtvbRR3tza3tm19/ZbOkoUZU0aiUh1fKKZ4JI1gYNgnVgxEvqCtf3x1TRv3zOleSTvYBIzLyRDyQNOCRirbx/2gD2AH6Q3kgMngj/mQda3K07VyYUXwS2gggo1+vZXbxDRJGQSqCBad10nBi8lCjgVLCv3Es1iQsdkyLoGJQmZ9tL8gAyfGGeAg0iZJwHn7u+NlIRaT0LfTIYERno+m5r/Zd0Eggsv5TJOgEk6+yhIBIYIT9vAA64YBTExQKgyDVBMR0QRCqazsinBnT95EVq1qntWdW9rlfplUUcJHaFjdIpcdI7q6Bo1UBNRlKFn9IrerCfrxXq3PmajS1axc4D+yPr8AeXDlz4=</latexit> Learning from User Behavior Abstract <latexit sha1_base64="3JiF7F+rKxpyzzztwWXoSeOlDOc=">AAAB/3icbVDLSgMxFM3UV62vUcGNm2AruChlpoK6EYpuXFaxD2iHkslk2tBMMiQZoYxd+CtuXCji1t9w59+YtrPQ1gOBwzn3cG+OHzOqtON8W7ml5ZXVtfx6YWNza3vH3t1rKpFITBpYMCHbPlKEUU4ammpG2rEkKPIZafnD64nfeiBSUcHv9SgmXoT6nIYUI22knn1wJxIeKFiSl265Wj4tdwOhValnF52KMwVcJG5GiiBDvWd/mSBOIsI1ZkipjuvE2kuR1BQzMi50E0VihIeoTzqGchQR5aXT+8fw2CgBDIU0j2s4VX8nUhQpNYp8MxkhPVDz3kT8z+skOrzwUsrjRBOOZ4vChEEt4KQMGFBJsGYjQxCW1NwK8QBJhLWprGBKcOe/vEia1Yp7VnFvq8XaVVZHHhyCI3ACXHAOauAG1EEDYPAInsEreLOerBfr3fqYjeasLLMP/sD6/AE6+pRN</latexit> Rounds r =1, 2, 3,... User<latexit sha1_base64="/rdkM82DieGD7BktcDWwmh29QOE=">AAACAXicbVA9SwNBEN3zM8avqI1gsxgEq3AnopZBG+0imA9IQtjbTJIle3vH7pwYjtj4V2wsFLH1X9j5b9xLrtDEBwOP92aYmedHUhh03W9nYXFpeWU1t5Zf39jc2i7s7NZMGGsOVR7KUDd8ZkAKBVUUKKERaWCBL6HuD69Sv34P2ohQ3eEognbA+kr0BGdopU5hv4XwgEnVgKY3CkEznhpm3CkU3ZI7AZ0nXkaKJEOlU/hqdUMeB6CQS2ZM03MjbCdMo+ASxvlWbCBifMj60LRUsQBMO5l8MKZHVunSXqhtKaQT9fdEwgJjRoFvOwOGAzPrpeJ/XjPG3kU7ESqKERSfLurFkmJI0zhoV2jgKEeWMK6FvZXyAUtTsKHlbQje7MvzpHZS8s5K3u1psXyZxZEjB+SQHBOPnJMyuSYVUiWcPJJn8krenCfnxXl3PqatC042s0f+wPn8AUVnl2w=</latexit> Interactions We study continual learning for natural lan- <latexit sha1_base64="wLaDS53RIYIfKWhct78XfXM1NS8=">AAACGXicbVC7TsMwFHXKq4RXgJHFokViqpIOwFhBB8Yi6ENqqspxb1urjhPZTqUq6m+w8CssDCDECBN/g9N2gJYjWTo6517fe08Qc6a0635bubX1jc2t/La9s7u3f+AcHjVUlEgKdRrxSLYCooAzAXXNNIdWLIGEAYdmMLrJ/OYYpGKReNCTGDohGQjWZ5RoI3Ud16cgNEgmBvZ9EoMcMwU97Pt2lWjzr8ZFPyR6SAlPq9OuW+w6BbfkzoBXibcgBbRAret8+r2IJqEZQzlRqu25se6kRGpGOUxtP1EQEzoiA2gbKkgIqpPOLpvisyTbph9J84TGM/V3R0pCpSZhYCqzLdWyl4n/ee1E9686KRNxokHQ+aB+wrGOcBYT7jEJVPOJIYRKZnbFdEgkoSYrZZsQvOWTV0mjXPIuSuW7cqFyvYgjj07QKTpHHrpEFXSLaqiOKHpEz+gVvVlP1ov1bn3MS3PWoucY/YH19QN4IZ/w</latexit> <latexit sha1_base64="uAHfSldyAZLG1P0LGIqJtnGVqVY=">AAACHXicbVDLSgMxFM3UVx1fVZdugq3gqswUUZfFduGygn1Ap5RMmrahmcyQ3BHK0B9x46+4caGICzfi35hpZ6GtBwKHc+7Nvff4keAaHOfbyq2tb2xu5bftnd29/YPC4VFLh7GirElDEaqOTzQTXLImcBCsEylGAl+wtj+ppX77gSnNQ3kP04j1AjKSfMgpASP1CxceZRKY4nJk1wmYnwCXvIDAmBKR1Gd9VcKeh+1aKDWomKZddr9QdMrOHHiVuBkpogyNfuHTG4Q0DswoKojWXdeJoJcQBZwKNrO9WLOI0AkZsa6hkgRM95L5dTN8ZpQBHobKPAl4rv7uSEig9TTwTWW6t172UvE/rxvD8LqXcBnFwCRdDBrGAkOI06jwgCtGQUwNIVRxsyumY6IINXnpNAR3+eRV0qqU3cuye1cpVm+yOPLoBJ2ic+SiK1RFt6iBmoiiR/SMXtGb9WS9WO/Wx6I0Z2U9x+gPrK8fRcehZA==</latexit> guage instruction generation, by observing SupervisedDatasets Dataset r D Dataset 0 Construction human users’ instruction execution. We fo- D cus on a collaborative scenario, where the <latexit sha1_base64="0/DXBOmpnV5/rAwEIMfTGn4nwi4=">AAACYHicbZBPbxMxEMW9W/6E0NIUbnCxSJGKhKLdHgCpl6ocygFQkEhbKY6iWe8kseq1V/ZsabTKl+yNA5d+EpzNHqBlJMs/vTcjj19WauUpSX5F8daDh48ed550n27vPNvt7T0/87ZyEkfSausuMvColcERKdJ4UTqEItN4nl1+WvvnV+i8suYHLUucFDA3aqYkUJCmvZ9CoiF0ysy7gvCa6m9AlQPNv4CZVzDHFReitU7RoGsG+Vebo26sfZGBq68De1Xw4YGQuSUuCpXzBt9tLn7EBS2QYFq71dv9aa+fDJKm+H1IW+iztobT3o3IrayKsK3U4P04TUqa1OBISY2rrqg8liAvw8bjgAYK9JO6CWjF3wQl5zPrwjHEG/XviRoK75dFFjoLoIW/663F/3njimYfJ7UyZUVo5OahWaU5Wb5Om+fKoSS9DADSqbArlwtwIEPkvhtCSO9++T6cHQ7S94P0+2H/+KSNo8NesdfsgKXsAztmn9mQjZhkv6OtaDvaiW7jTrwb721a46idecH+qfjlH3qutWg=</latexit> <latexit sha1_base64="RruFSUnH/9uEUc0zynrlQ/gIawM=">AAACCnicbVC7TsMwFHXKq4RXgJElUCExVUkHYKxgYSyiL6mJKse9aa06TmQ7laqoMwu/wsIAQqx8ARt/g9NmgJYjWTo65z58T5AwKpXjfBultfWNza3ytrmzu7d/YB0etWWcCgItErNYdAMsgVEOLUUVg24iAEcBg04wvs39zgSEpDFvqmkCfoSHnIaUYKWlvnXqEeAKBOVD8yFNQEyohIHteWZTYMq13LcqTtWZw14lbkEqqECjb315g5ikkZ5LGJay5zqJ8jMsFCUMZqaXSkgwGeMh9DTlOALpZ/NTZvZ5mq8PY6EfV/Zc/d2R4UjKaRToygirkVz2cvE/r5eq8NrPKE9SBZwsFoUps1Vs57nYAyqAKDbVBBNB9V9tMsICEx2ONHUI7vLJq6Rdq7qXVfe+VqnfFHGU0Qk6QxfIRVeoju5QA7UQQY/oGb2iN+PJeDHejY9Fackoeo7RHxifP+wSmmY=</latexit> Natural Language system both acts and delegates tasks to hu- Supervised <latexit sha1_base64="+NGIPhLziU6XwLTdmWPpfMBIYU0=">AAACJXicbVC7TgJBFJ3FF64v1NJmIphYELJLoRYWRCksMZFHwhIyOwwwYXZ2M3PXhGz4GRt/xcZCYkys/BVngQLBk9zk5Nz38SPBNTjOt5XZ2Nza3snu2nv7B4dHueOThg5jRVmdhiJULZ9oJrhkdeAgWCtSjAS+YE1/dJ/mm89MaR7KJxhHrBOQgeR9TgkYqZu79SiTwBSXA7tKwEwCjT0P2wUvIDCkRCTVSdcper0QdHFZUwW7m8s7JWcGvE7cBcmjBWrd3NTMoXFgVlJBtG67TgSdhCjgVLCJ7cWaRYSOyIC1DZUkYLqTzL6c4Auj9HA/VCYk4Jm63JGQQOtx4JvK9Ey9mkvF/3LtGPo3nYTLKAYm6XxRPxYYQpxahntcMQpibAihiptbMR0SRajxTacmuKsvr5NGueReldzHcr5yt7Aji87QObpELrpGFfSAaqiOKHpBb+gDTa1X6936tL7mpRlr0XOK/sD6+QUpOqRr</latexit> Generation Model Datasets man users using natural language. We com- Training x¯ P ( , ; ✓ ) 0,..., r pare user execution of generated instruc- ⇠ ·|· · r D D tions to the original system intent as an indi- <latexit sha1_base64="f5fpd9LH40a9VXlIHe4kj1pwqZw=">AAACA3icbVDLSsNAFJ3UV42vqDvdBIvgxpJkoS6LLnRZoS9oQplMb9qhk0mYmQglFNz4K25cKOLWn3Dn3zhts9DWAxcO59w7c+8JU0alcpxvo7Syura+Ud40t7Z3dves/YOWTDJBoEkSlohOiCUwyqGpqGLQSQXgOGTQDkc3U7/9AELShDfUOIUgxgNOI0qw0lLPOvIJcAWC8oF5W2+ce75vtoEOhkr2rIpTdWawl4lbkAoqUO9ZX34/IVmsHyQMS9l1nVQFORaKEgYT088kpJiM8AC6mnIcgwzy2Q0T+1QrfTtKhC6u7Jn6eyLHsZTjONSdMVZDuehNxf+8bqaiqyCnPM0UcDL/KMqYrRJ7GojdpwKIYmNNMBFU72qTIRaY6FSkqUNwF09eJi2v6l5UvXuvUrsu4iijY3SCzpCLLlEN3aE6aiKCHtEzekVvxpPxYrwbH/PWklHMHKI/MD5/AHGNlsA=</latexit> GPT-2 <latexit sha1_base64="HKC48qWozmgZTYxXmBLYcHdnuaY=">AAACDnicbZC7TsMwFIYdriXcAowsFlUlpirpAIxVuzAWqTepraoTx2mtOk5kO4gq6hOw8CosDCDEyszG2+C2GaDlSJY+/f85ts/vJ5wp7brf1sbm1vbObmHP3j84PDp2Tk7bKk4loS0S81h2fVCUM0FbmmlOu4mkEPmcdvxJfe537qlULBZNPU3oIIKRYCEjoI00dEp9QoWmkomRXY8NPegUOK6BCJjGTQlMGGvoFN2yuyi8Dl4ORZRXY+h89YOYpJG5m3BQque5iR5kIDUjnM7sfqpoAmQCI9ozKCCiapAt1pnhklECHMbSHKHxQv09kUGk1DTyTWcEeqxWvbn4n9dLdXgzyJhIUk0FWT4UphzrGM+zwQGTlGg+NQBEMvNXTMYggZiAlG1C8FZXXod2pexdlb27SrFay+MooHN0gS6Rh65RFd2iBmohgh7RM3pFb9aT9WK9Wx/L1g0rnzlDf8r6/AFAoZw7</latexit> cation to the system’s success communicat- Contextual Bandit Training Weights ing its intent. We show how to use this sig- nal to improve the system’s ability to gener- Figure 1: Diagram of our learning process. We ini- ate instructions via contextual bandit learn- tialize a generation model using supervised learn- ing. In interaction with real users, our sys- ing, and continually learn through interaction with tem demonstrates dramatic improvements in users, by alternating between observing user exe- its ability to generate language over time. cution of generated instructions and training. 1 Introduction generation by observing human users executing generated instructions. We learn by comparing Natural language provides an expressive and ac- instruction execution to the system intent, and cessible avenue to instruct non-expert users. The demonstrate how this results in a system that con- ability to generate instructions is critical for sys- tinually improves its natural language generation tems that collaborate with users, for example to ability through interaction with users. Figure1 il- delegate tasks. In such scenarios, the system gen- lustrates our learning process. erates language to communicate to the user a latent We design a task-oriented collaborative sce- intent. When users are cooperative and proficient nario using the CEREALBAR game environ- in the language, whether they accomplish the sys- ment (Suhr et al., 2019). In CEREALBAR, two tem’s intent provides an informative, albeit noisy agents, a leader and a follower, work together to signal to the quality of instruction generation. complete tasks. The leader plans the tasks to com- This implicit signal is fundamentally different plete, and communicates goals to the follower us- arXiv:2108.04812v1 [cs.CL] 10 Aug 2021 from supervised data, including via active learn- ing natural language. CEREALBAR was originally ing, in that it does not label the system’s intent introduced for studying follower instruction exe- with a written instruction, but only provides evi- cution. We modify it to focus on generation of dence to the quality of a given instruction in re- leader instructions, which are then executed by hu- laying this intent. As a natural byproduct of inter- man followers. The collaborative, embodied setup action with users, it also differs from explicit user effectively engages users, and aligns their incen- feedback in not requiring user action beyond what tives with executing the system’s instructions to they already do as part of the interaction. Despite the best of their abilities. its potential and prevalence, this signal is under- A major challenge is inferring a learning sig- studied for