ACL 2020

NLP for Conversational

Proceedings of the 2nd Workshop

July 9, 2020 c 2020 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected]

978-1-952148-08-8

ii Introduction

Welcome to the ACL 2020 Workshop on NLP for Conversational AI.

Ever since the invention of the intelligent machine, hundreds and thousands of mathematicians, linguists, and computer scientists have dedicated their career to empowering human-machine communication in natural language. Although the idea is finally around the corner with a proliferation of virtual personal assistants such as Siri, Alexa, Google Assistant, and Cortana, the development of these conversational agents remains difficult and there still remain plenty of unanswered questions and challenges.

Conversational AI is hard because it is an interdisciplinary subject. Initiatives were started in different research communities, from Dialogue State Tracking Challenges to NIPS Conversational Intelligence Challenge live competition and the Amazon Alexa prize. However, various fields within the NLP community, such as semantic parsing, coreference resolution, sentiment analysis, question answering, and machine reading comprehension etc. have been seldom evaluated or applied in the context of conversational AI.

The goal of this workshop is to bring together NLP researchers and practitioners in different fields, alongside experts in speech and machine learning, to discuss the current state-of-the-art and new approaches, to share insights and challenges, to bridge the gap between academic research and real- world product deployment, and to shed the light on future directions. “NLP for Conversational AI” will be a one-day workshop including keynotes, spotlight talks, posters, and panel sessions. In keynote talks, senior technical leaders from industry and academia will share insights on the latest developments of the field. An open call for papers will be announced to encourage researchers and students to share their prospects and latest discoveries. The panel discussion will focus on the challenges, future directions of conversational AI research, bridging the gap in research and industrial practice, as well as audience- suggested topics.

With the increasing trend of conversational AI, NLP4ConvAI 2020 is competitive. We received 27 submissions, and after a rigorous review process, we only accept 15. There are total 13 accepted regular workshop papers and 2 cross-submissions or extended abstracts. The workshop overall acceptance rate is about 55.5%. We hope you will enjoy NLP4ConvAI 2020 at ACL and contribute to the future success of our community!

NLPConvAI 2020 Organizers Tsung-Hsien Wen, PolyAI Asli Celikyilmaz, Microsoft Zhou Yu, UC Davis Alexandros Papangelis, Uber AI Mihail Eric, Amazon Alexa AI Anuj Kumar, Facebook Iñigo Casanueva, PolyAI Rushin Shah, Google

iii

Organizers: Tsung-Hsien Wen, PolyAI (UK) Asli Celikyilmaz, Microsoft (USA) Zhou Yu, UC Davis (USA) Alexandros Papangelis, Uber AI (USA) Mihail Eric, Amazon Alexa AI (USA) Anuj Kumar, Facebook (USA) Iñigo Casanueva, PolyAI (UK) Rushin Shah, Google (USA)

Program Committee: Pawel Budzianowski, University of Cambridge and PolyAI Sam Coope, PolyAI Ondrejˇ Dušek, Heriot Watt University Nina Dethlefs, University of Hull Daniela Gerz, PolyAI Pei-Hao Su, PolyAI Matthew Henderson, PolyAI Simon Keizer, AI Lab, Vrije Universiteit Brussel Ryan Lowe, McGill University Julien Perez, Naver Labs Marek Rei, Imperial Gokhan Tür, Uber AI Ivan Vulic, University of Cambridge and PolyAI David Vandyke, Apple Zhuoran Wang, Trio.io Tiancheng Zhao, CMU Alborz Geramifard, Facebook Shane Moon, Facebook Jinfeng Rao, Facebook Bing Liu, Facebook Pararth Shah, Facebook Ahmed Mohamed, Facebook Vikram Ramanarayanan, ETS Rahul Goel, Google Shachi Paul, Google Seokhwan Kim, Amazon Andrea Madotto, HKUST Dian Yu, UC Davis Piero Molino, Uber Chandra Khatri, Uber Yi-Chia Wang, Uber Huaixiu Zheng, Uber Fei Tao, Uber Abhinav Rastogi, Google Stefan Ultes, Daimler v José Lopes, Heriot Watt University Behnam Hedayatnia, Amazon Dilek Hakkani-Tür, Amazon Ta-Chung Chi, CMU Shang-Yu Su, NTU Pierre Lison, NR Ethan Selfridge, Interactions Teruhisa Misu, HRI Svetlana Stoyanchev, Toshiba Ryuichiro Higashinaka, NTT Hendrik Bushmeier, University of Bielefeld Kai Yu, Baidu Koichiro Yoshino, NAIST Abigail See, Stanford University Ming Sun, Amazon Alexa AI

Invited Speakers: Yun-Nung Chen, National Taiwan University Dilek Hakkani-Tür, Amazon Alexa AI Jesse Thomason, University of Washington Antoine Bordes, Facebook AI Research Jacob Andreas, Massachussetts Institute of Technology Jason Williams, Apple

vi Table of Contents

Using Alternate Representations of Text for Natural Language Understanding Venkat Varada, Charith Peris, Yangsook Park and Christopher Dipersio...... 1

On Incorporating Structural Information to improve Dialogue Response Generation Nikita Moghe, Priyesh Vijayan, Balaraman Ravindran and Mitesh M. Khapra...... 11

CopyBERT: A Unified Approach to Question Generation with Self-Attention Stalin Varanasi, Saadullah Amin and Guenter Neumann ...... 25

How to Tame Your Data: Data Augmentation for Dialog State Tracking Adam Summerville, Jordan Hashemi, James Ryan and william ferguson ...... 32

Efficient Intent Detection with Dual Sentence Encoders Iñigo Casanueva, Tadas Temcinas,ˇ Daniela Gerz, Matthew Henderson and Ivan Vulic...... ´ 38

Accelerating Natural Language Understanding in Task-Oriented Dialog Ojas Ahuja and Shrey Desai ...... 46

DLGNet: A Transformer-based Model for Dialogue Response Generation Olabiyi Oluwatobi and Erik Mueller ...... 54

Data Augmentation for Training Dialog Models Robust to Speech Recognition Errors Longshaokan Wang, Maryam Fazel-Zarandi, Aditya Tiwari, Spyros Matsoukas and Lazaros Poly- menakos...... 63

Automating Template Creation for Ranking-Based Dialogue Models Jingxiang Chen, Heba Elfardy, Simi Wang, Andrea Kahn and Jared Kramer ...... 71

From Machine Reading Comprehension to Dialogue State Tracking: Bridging the Gap Shuyang Gao, Sanchit Agarwal, Di Jin, Tagyoung Chung and Dilek Hakkani-Tur ...... 79

Improving Slot Filling by Utilizing Contextual Information Amir Pouran Ben Veyseh, Franck Dernoncourt and Thien Huu Nguyen ...... 90

Learning to Classify Intents and Slot Labels Given a Handful of Examples Jason Krone, Yi Zhang and Mona Diab ...... 96

MultiWOZ 2.2 : A Dialogue Dataset with Additional Annotation Corrections and State Tracking Base- lines Xiaoxue Zang, Abhinav Rastogi and Jindong Chen...... 109

Sketch-Fill-A-R: A Persona-Grounded Chit-Chat Generation Framework Michael Shum, Stephan Zheng, Wojciech Kryscinski, Caiming Xiong and Richard Socher . . . . 118

Probing Neural Dialog Models for Conversational Understanding Abdelrhman Saleh, Tovly Deutsch, Stephen Casper, Yonatan Belinkov and Stuart Shieber . . . . 132

vii

Workshop Program

July 9, 2020

06:00 Opening Remarks

06:10 Invited Talk Yun-Nung Chen

06:40 Invited Talk Antonie Bordes

07:10 Invited Talk Jacob Andreas

07:40 Using Alternate Representations of Text for Natural Language Understanding Venkat Varada, Charith Peris, Yangsook Park and Christopher Dipersio

07:50 On Incorporating Structural Information to improve Dialogue Response Generation Nikita Moghe, Priyesh Vijayan, Balaraman Ravindran and Mitesh M. Khapra

08:00 CopyBERT: A Unified Approach to Question Generation with Self-Attention Stalin Varanasi, Saadullah Amin and Guenter Neumann

08:10 How to Tame Your Data: Data Augmentation for Dialog State Tracking Adam Summerville, Jordan Hashemi, James Ryan and william ferguson

08:20 Efficient Intent Detection with Dual Sentence Encoders Iñigo Casanueva, Tadas Temcinas,ˇ Daniela Gerz, Matthew Henderson and Ivan Vulic´

08:30 Accelerating Natural Language Understanding in Task-Oriented Dialog Ojas Ahuja and Shrey Desai

08:40 DLGNet: A Transformer-based Model for Dialogue Response Generation Olabiyi Oluwatobi and Erik Mueller

08:50 Data Augmentation for Training Dialog Models Robust to Speech Recognition Er- rors Longshaokan Wang, Maryam Fazel-Zarandi, Aditya Tiwari, Spyros Matsoukas and Lazaros Polymenakos

ix July 9, 2020 (continued)

09:00 Panel

10:00 Invited Talk Jesse Thomason

10:30 Invited Talk Dilek Hakkani-Tür

11:00 Invited Talk Jason Williams

11:30 Automating Template Creation for Ranking-Based Dialogue Models Jingxiang Chen, Heba Elfardy, Simi Wang, Andrea Kahn and Jared Kramer

11:40 From Machine Reading Comprehension to Dialogue State Tracking: Bridging the Gap Shuyang Gao, Sanchit Agarwal, Di Jin, Tagyoung Chung and Dilek Hakkani-Tur

11:50 Improving Slot Filling by Utilizing Contextual Information Amir Pouran Ben Veyseh, Franck Dernoncourt and Thien Huu Nguyen

12:00 Learning to Classify Intents and Slot Labels Given a Handful of Examples Jason Krone, Yi Zhang and Mona Diab

12:10 MultiWOZ 2.2 : A Dialogue Dataset with Additional Annotation Corrections and State Tracking Baselines Xiaoxue Zang, Abhinav Rastogi and Jindong Chen

12:20 Sketch-Fill-A-R: A Persona-Grounded Chit-Chat Generation Framework Michael Shum, Stephan Zheng, Wojciech Kryscinski, Caiming Xiong and Richard Socher

12:30 Probing Neural Dialog Models for Conversational Understanding Abdelrhman Saleh, Tovly Deutsch, Stephen Casper, Yonatan Belinkov and Stuart Shieber

12:40 Closing Remarks

x Using Alternate Representations of Text for Natural Language Understanding

Venkata sai Varada, Charith Peris, Yangsook Park, Christopher DiPersio Amazon Alexa AI, USA {vnk, perisc, yangsoop, dipersio}@amazon.com

Abstract ping list म dove soap bar जोड़’ (‘add dove soap bar to my shopping list’). NLU models ex- One of the core components of voice assis- pect ASR to return the text in the above form, tants is the Natural Language Understand- ing (NLU) model. Its ability to accurately which matches with the transcription of the classify the user’s request (or “intent”) and training data for NLU models. Then, NLU recognize named entities in an utterance is model would return the action as Add Items pivotal to the success of these assistants. To Shopping List, while it also recognizes ‘dove NLU models can be challenged in some soap bar’ as the actual item name to be added languages by code-switching or morpho- to the shopping list. However, ASR could in- logical and orthographic variations. This consistently return ‘ shopping list dove work explores the possibility of improving मेरी म soap ’, where the English word ‘bar’ is the accuracy of NLU models for Indic lan- बार जोड़ guages via the use of alternate represen- recognized as a word ‘बार’. Note that tations of input text for NLU, specifically while the Hindi word ‘बार’ means something ISO-15919 and IndicSOUNDEX, a custom different than the English ‘bar’, their pronun- SOUNDEX designed to work for Indic lan- ciation is similar enough to mislead the ASR guages. We used a deep neural network model in the context of mixed-language utter- based model to incorporate the informa- ance. In this case, the NLU model should be- tion from alternate representations into the come robust against such ASR mistakes, and NLU model. We show that using alternate representations significantly improves the learn that ‘dove soap बार’ is equivalent to ‘dove overall performance of NLU models when soap bar’ in order to correctly recognize it the amount of training data is limited. as an item name. The phenomenon of code- switching is common amongst other Indic lan- 1 Introduction guages as well. Building NLU models can be more challeng- ASR inconsistencies can cause significant ing for languages that involve code-switching. challenges for NLU models by introducing new Recent times have seen a significant surge of data that the models were not trained on. interest in voice-enabled smart assistants, such Such inconsistencies can occur due to 1) code- as Amazon’s Alexa, Google Assistant, and Ap- switching in the data, especially when the in- ple’s Siri. These assistants are powered by put text contains more than one script (char- several components, which include Automatic acter set), 2) orthographic or morphological Speech Recognition (ASR) and Natural Lan- variations that exist in the input text, or 3) guage Understanding (NLU) models. The in- due to requirements for multilingual support. put to an NLU model is the text returned by In a language such as English in the United the ASR model. With this input text, there States, where code-switching is less common, are two major tasks for an NLU model: 1) In- both ASR and NLU models can perform quite tent Classification (IC), and 2) Named Entity well, as input data tend to be more consistent. Recognition (NER). However, when it comes to Indic languages Code-switching is a phenomenon in which such as Hindi, where code-switching is much two or more languages are interchanged within more common, the question arises as to which a sentence or between sentences. An exam- representation of the input text would work ple of code-switching from Hindi is ‘मेरी shop- best for NLU models. 1 Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, pages 1–10 July 9, 2020. c 2020 Association for Computational Linguistics

There is ongoing research on solving mod- on one high-resource Indic language (Hindi) eling tasks for data with code-switching. In and three low-resource Indic languages (Tamil, Geetha et al. (2018), the authors explore using Marathi, Telugu). In Section 2, we describe character and word level embeddings for NER the two alternate representations of text that tasks. Some of this research focuses on us- we explore. In Section 3, we describe our data, ing alternate representations of text for NLU model architecture, and detail our experimen- modeling. For example, Johnson et al. (2019) tal setup. In Section 4, we present our results explore the use of cross-lingual transfer learn- followed by our conclusions in Section 5. ing from English to Japanese for NER by ro- manizing Japanese input into a Latin-based 2 Alternate representations of text character set to overcome the script differences between the language pair. Zhang and Le- Using data transcribed in the original script of Cun (2017) has shown using for a language can cause problems for both mono- Japanese text for sentiment classification hurt lingual and multilingual NLU models. For a the performance of the monolingual model. monolingual model in a language where code- In this paper, we explore the possibility of switching, orthographic variations, or rich mitigating the problems related to ASR in- morphological inflections are common, NLU consistency and code-switching in our input models may not be able to perform well on data by using two alternate representations of all the variations, depending on the frequency text in our NLU model: ISO-15919 and In- of these variations in the training data. For dicSOUNDEX. ISO-159191 was developed as multilingual models, words with similar sound a standardized Latin-based representation for and meaning across different languages (.g., Indic languages and scripts. SOUNDEX is an loanwords, common entities) cannot be cap- algorithm that provides phonetic-like represen- tured if the words are written in their original tations of words. Most work on SOUNDEX script. For example, the same proper noun ‘tel- algorithms has focused primarily on monolin- ugu’ is written as ‘तेलुगु’ in Hindi, as ‘ெத’ gual solutions. One of the best known imple- in Tamil, and as ‘లుగు’ in Telugu. Although mentations with a multilingual component is they sound similar and mean the same thing, Beider-Morse Phonetic Matching Beider and NLU model will see them as unrelated tokens Morse (2010); however, it only identifies the if they are represented in their original scripts language in which a given word is written in the input data. to choose which pronunciation rules to apply. From the NLU point of view, a text repre- Other attempts at multilingual SOUNDEX sentation that can minimize variations of the algorithms, particularly for Indic languages, same or similar words within a language and were smaller studies focused on two Indic lan- across different languages would be beneficial guages with or without English as a third lan- for both IC and NER tasks. In this section, guage. Chaware and Rao (2012) developed a we explore two alternative ways of text rep- custom SOUNDEX algorithm for monolingual resentations for Indic languages: ISO-15919 Hindi and Marathi word pair matching. Shah and a SOUNDEX-based algorithm, which we and Singh (2014) describe an actual multilin- call IndicSOUNDEX. Compared to using the gual SOUNDEX implementation designed to original scripts, these two alternatives can rep- cover Hindi, Gujarati, and English, which, in resent the variants of the same word or root addition to the actual algorithm implementa- in the same way. For example, in the origi- tion, was aided by a matching threshold declar- nal Hindi script (.e., ), the word ing two conversions a match even if they dif- for ‘volume/voice’ can be represented in two fered in up to two characters. forms: ‘आवाज़’ and ‘आवाज’. These two forms, In this study, we use ISO-15919 and Indic- however, are uniformly represented as ‘āvāj’ SOUNDEX representations of text in a deep in ISO-15919 and as the string ‘av6’ in Indic- neural network (DNN) to perform multi-task SOUNDEX. Similarly, the English word ‘list’ modeling of IC and NER. We experiment may be transcribed as ‘list’ or as ‘’ in Tel- ugu; however, they map to the same Indic- 1https://en.wikipedia.org/wiki/ISO_15919 SOUNDEX representation, ‘ls8’.

2 2.1 ISO-15919 3 Experimental Setup

ISO-15919 is a standardized representation of 3.1 Data Indic scripts based on Latin characters, which For our experiments, we chose datasets from maps the corresponding Indic characters onto four Indic languages: Hindi, Tamil, Marathi, the same Latin character. For example, ‘क’ in and Telugu. For Hindi, we use an inter- Devanagari, ‘క’ in Telugu, and ‘க’ in Tamil are nal large-scale real-world dataset; for the all represented by ‘k’ in the ISO-15919 stan- other three languages, we use relatively small dard. Consequently, ISO-15919 can be used datasets collected and annotated by third as a neutral common representation between party vendors. We perform a series of ex- Indic scripts. However, ISO-15919 has some periments to evaluate the use of the alternate downsides. Indic scripts often rely on implicit representations of text, ISO-15919 and Indic- vowels which are not represented in the orthog- SOUNDEX, described in Section 2. raphy, which means they cannot be reliably Our Hindi training dataset consists of 6M added to a transliterated word. Additionally, data. We separate a portion (1̃%) of the data Indic scripts have a character called a halant, into an independent test set. We execute a or virama, which is used to suppress an in- stratified split on the remainder, based onin- herent vowel. This character, although usu- tent, and choose 10% of this data for validation ally included orthographically in a word, does and the rest as training data. For the three not have an ISO-15919 representation and so smaller-scale datasets, we execute a stratified is lost in an exact, one-to-one conversion. Fi- split with 60% for training, 20% for validation, nally, it is not always the case that the same and 20% for testing. Table 2 shows the data word in two different Indic scripts will have partitions across different languages used for the same ISO-15919 conversion due to script- our experiments. to-script and language-to-language differences Each of the four datasets contain code- and irregularities. Table 1 below shows some switching. The transcription was done either examples of conversion using ISO-15919. in the original script of the language (for words from that language) or in the standard Latin (for words from other languages including En- 2.2 IndicSOUNDEX glish). However, the transcription was not al- ways consistent, especially in the third party SOUNDEX algorithms provide phonetic-like data, so some Indic words were transcribed in representations of words that attempt to re- the so-called ‘colloquial’ Latin (i.e., a casual, duce spelling and other orthographic varia- non-standard way of representing Indic lan- tions for the same or similarly-sounding words, guages in online resources) and some English mainly to increase recall in information re- words were represented in the original script of trieval or text search tasks. This is done by fol- the Indic languages (e.g., ‘’ for the English lowing a set of simple steps which may include word ‘list’). See Table 3 for the total counts removing vowels, reducing character duplica- of tokens in each script in each of the training tion, and mapping sets of characters to a sin- and test data, which reflects the use of code- gle character, based on whether they 1) sound switching in each training dataset. Note that similar or 2) are used in similar environments Hindi and Marathi both use the same script in similar sounding words. For example, at (Devanagari) in their writing systems. its most extreme, American SOUNDEX maps ‘c’, ‘g’, ‘j’, ‘k’, ‘q’, ‘s’, ‘x’, and ‘z’ to the same 3.2 Model architecture character: ‘2’. For our NLU models, we used a multi-task IndicSOUNDEX is a custom SOUNDEX ap- modeling framework that predicts Intent and proach designed to work on Hindi, Marathi, Named Entities, given an input sentence. A Telugu, Tamil, , Punjabi, Bengali, schematic of our model is given in Figure 1 and Kannada, Gujarati, and English. Table 1 Figure 2. For a given corpus, we built alter- shows some examples of conversion using In- nate text representations using the ISO-15919 dicSOUNDEX: and IndicSOUNDEX approaches mentioned in

3 Table 1: Examples of conversion using ISO-15919 and IndicSOUNDEX

Script Indic Original ISO-15919 IndicSOUNDEX Devanagari (Hindi) इदके indikē i034 Devanagari (Marathi) इतहासाचा itihāsācā i8hs2 Telugu పన pḍindi p303 Tamil பர pirvīṇ plv0

Table 2: Data partition across languages sentations from IndicSOUNDEX, and the sec- ond with the alternate representations from Language Train Test Valid Total ISO-15919. Our modeling architecture is de- Hindi 5.4M 600K 54K 6M signed as follows: Tamil 27K 9k 9k 45K Baseline Embedding Block: For our Marathi 27K 8.5k 8.5K 42K baseline model, we designed our embedding Telugu 27K 9K 9k 46K block with two layers and concatenated the outputs. The first layer is a token embedding layer of dimension 100, while the second one is a character-based token embedding built us- ing convolutional neural network (CNN) sub- sequence embedding with dimension of 16. The final embedding used in our models will be generated by concatenating these two em- beddings. Embedding block with alternate repre- sentations: For our alternate representation models, we modified the baseline embedding block to accommodate these representations. Figure 1: Embedding for a token All other components, including encoder and decoder, stayed the same. The embedding block for this setting was modified to have four layers with final embedding used in our model being the concatenation of these. A schematic is shown in Figure 1 The four layers are as follows:

1. Token embedding layer of dimension 100.

2. Token embedding layer for alternate rep- resentations of tokens of dimension 100 . Figure 2: Modeling architecture with tokens and al- ternate representation (IndicSOUNDEX) as input 3. Character embedding built using convolu- tional neural network (CNN) subsequence Section 2. We used both the original sentence embedding with dimension of 16. as well as the alternate representations of the 4. Alternate representations character em- sentence to inform our models via added em- bedding built using convolutional neural bedding layers. network (CNN) subsequence embedding First, we built a baseline model without with dimension of 16. using any alternate representations using the Baseline Embedding block architecture below. Eventually, we concatenated the output of the Next, we built two candidate models: the first embedding layer to obtain the final word em- with an embedding layer using alternate repre- bedding.

4 Table 3: The number of tokens in each script within each training and test set

Hindi Tamil Marathi Train Test Train Test Train Test Train Test Latin 13.9M 101K 22.5K 7.5K 29.1K 9.7K 40.7K 13.4K Devanagari 15.5M 102K 0 0 110K 36K 0 0 Tamil 8 0 108K 36K 0 0 0 0 Telugu 0 0 0 0 0 0 112K 37.5K Other 1K 0 2 0 0 0 0 0 Total 29.5M 203K 131K 43.5K 139K 46K 153K 51K

Encoder: We defined a biLSTM (bidi- • Baseline model - a baseline model with rectional Long Short Term Memory) encoder architecture explained in the model archi- with 5 hidden layers and a hidden dimension tecture section using word and character of 512. A schematic is given in Figure 2. In embeddings from tokens. order to handle our multi-task prediction of Intent and NER, we have two output layers: • IndicSOUNDEX model - a model with In- 1. A multi-layer perceptron (MLP) classifier dicSOUNDEX alternate representations with the number of classes set to the vo- i.e., using an embedding layer with tokens, cabulary size of Intent. the IndicSOUNDEX representation of to- 2. A conditional random fields (CRF) se- kens, characters from tokens and the In- quence labeler with the number of classes dicSOUNDEX representations of charac- set to the vocabulary size of Named Enti- ters. ties. Decoder: We used a joint task decoder for • ISO-15919 model - a model with ISO- our purpose. The Intent-prediction task is ac- 15919 alternate representations i.e., using complished by single class decoder, while the embedding layer with tokens, the ISO- label-prediction task is achieved by sequence 15919 representation of tokens, characters labeling decoder. For this setup, the loss func- from tokens and the ISO-15919 represen- tion is a multi-component loss, with cross- tations of characters. entropy loss for Intent and CRF-loss (Lafferty et al., 2001) for NER. Evaluation metric: We use Semantic Er- 4 Results ror Rate (SemER) Su et al. (2018) to evaluate the performance of our models. The definition To evaluate each model, we used our withheld of SemER is as follows: test dataset and measured SemER scores as- sociated with the three different models. To (D + I + S) SemER = (1) obtain confidence intervals, we bootstrapped (C + D + S) our test dataset by sampling ten times with where D (deletion), I (insertion), S (substitu- replacement, and evaluated our models on tion), C (correct slots). each of these ten datasets. Final SemER As the Intent is treated as a slot in this met- was taken as the average of all the ten iter- ric, the Intent error is considered as a substi- ations. We used a Student’s t-test to calcu- tution. late the significance of the improvements of Experimental Process: Our experimen- the IndicSOUNDEX and ISO-15919 models tal setup consists of the following process. For with respect to the baseline. Here we present each language, we build three models with the the results of our three models: the Baseline model architecture explained in model archi- DNN model, IndicSOUNDEX model, and ISO- tecture section. 15919 model on each of the four languages.

5 Table 4: Comparison of ISO-15919 and IndicSOUNDEX model performance on Hindi w.r.t baseline. ‘+’ indicates degradation, ‘-’ indicates improvement. ‘*’ denotes the results that are not statistically significant

Language Model Topics improved Average improvement Topics degraded Average degradation Overall change IndicSOUNDEX 3 8% 1 5% –0.07*% Hindi ISO-15919 3 4% 4 9% +1.08%

4.1 Results for a high-resource tion errors. Also, given the size of the training language - Hindi data, the NLU models were well trained on the various spelling variations represented in Our Hindi data consisted of data from 22 the original script. Owing to this relatively topics including Music, Reminders, Alarms, high consistency in transcription and the exis- Weather, Search, News, and so on. See Ap- tence of various tokens with similar meaning pendix for the full list of topics. Table 4 and sound in the training data, we believe that shows the performance on Hindi. Results re- using the alternate representations of text was vealed that, out of the 22 topics, the use of not effective for improving the performance of IndicSOUNDEX resulted in 3 topics showing the NLU model. improvement in SemER, with an average im- provement of 8% and 1 topic showing 5.36% 4.2 Results for low-resource languages degradation. The use of ISO-15919 resulted in - Tamil, Marathi, and Telugu 3 topics showing an average improvement of Unlike the case of Hindi, we see much more 4% and 4 topics showing an average degrada- significant overall improvements where train- tion of 9%. Rest of the topics showed results ing data are sparse. All three languages that are not statistically significant. We note showed significant improvement in overall per- that two topics showed improvement across formance for ISO-15919 model, whereas Indic- both models: SmartHome and Global. SOUNDEX showed significant improvement Our results show that there is no overall for Tamil and Marathi. Within Tamil and (i.e., across all topics together) statistically Marathi, IndicSOUNDEX showed a larger im- significant improvement seen for the Indic- provement than ISO-15919. SOUNDEX model. However, we note that Our data for the low-resource languages con- the improvements in 3 three topics: Global, sisted of 19 different topics. Table 5 shows SmartHome, and ToDoLists are significant. the performance of each language. At topic The one topic that showed a degradation was level for Tamil, we found that using the Indic- Shopping. SOUNDEX model improved the performance On the other hand, the ISO-15919 model in 15 out of 19 topics with an average SemER shows an overall 1.08% degradation that is drop of 11%. With the ISO-15919 represen- statistically significant. The ISO-15919 model tation, 13 out of 19 topics showed improve- shows statistically significant improvements in ment with an average SemER drop of 13%. Global, Video, and SmartHome topics, and For Marathi, IndicSOUNDEX improved 7 top- degradation in Weather, Music, QuestionAn- ics with an average drop in SemER of 18%, dAnswer, and Media. whereas ISO-15919 improved 9 topics but with In summary, for a high-resource language a lower average drop in SemER (1̃0%). For Tel- such as Hindi, we find that neither Indic- ugu, using IndicSOUNDEX or ISO-15919 im- SOUNDEX nor ISO-15919 shows an overall proved the performance of 4 topics each with significant improvement. However, there are the same average drop in SemER of 15%. certain topics that could benefit from using There were two topics that showed improve- these alternate representations either for IC or ment across all three languages with the In- NER. Note that the majority of the training dicSOUNDEX model: MovieTimes and Me- data for Hindi were transcribed by well-trained dia. Furthermore, Calendar and Music top- internal human transcribers and went through ics showed significant improvement across all some cleaning processes for common transcrip- three languages with the ISO-15919 model.

6 Table 5: Comparison of ISO-15919 and IndicSOUNDEX model performance w.r.t baseline on low resource languages. ‘+’ indicates degradation, ‘-’ indicates improvement. ‘*’ denotes the results that are not statistically significant

Language Model Topics improved Average improvement Topics degraded Average degradation Overall change IndicSOUNDEX 15 11% 1 7% -7.6% Tamil ISO-15919 13 13% 2 2% -6.2% IndicSOUNDEX 7 18% 6 8% -2.30% Marathi ISO-15919 9 10% 6 11% -1.50% IndicSOUNDEX 4 15% 5 7% -0.20*% Telugu ISO-15919 4 15% 4 8% -2.4%

Table 6: Percentage change in SemER for candidate models w.r.t baseline model across languages. ‘+’ indicates degradation, ‘-’ indicates improvement

Language % Change in IndicSOUNDEX % Change in ISO-15919 Hindi -0.07% +1.08% Tamil -7.6% -6.2% Marathi -2.3% -1.5% Telugu -0.2% -2.4%

In Table 6, we provide the relative change line and ISO-15919, we can conclude that the in performance w.r.t baseline of all the models mitigation of the impact of multiple different across high and low resource languages. scripts and inconsistent Latin usage accom- plished by IndicSOUNDEX seems to produce 5 Conclusion better results for our NLU models. The lack of In this work, we explored the effect of using impact of IndicSOUNDEX for Telugu merits alternate representations of text for IC and further investigation. NER models on a high-resource Indic language Our future work includes exploring differ- (Hindi) and three low-resource Indic languages ent model architectures including transformer (Tamil, Telugu, Marathi). We adopted a neu- models, exploring these representations fur- ral network based model architecture, where ther by pre-training models with a combina- the alternate representations are incorporated tion of original tokens and alternate represen- in the embedding layers. tations of tokens. We also plan to explore the Based on the performance analysis over the use of character bigrams (of both original text baseline models, we saw that the alternate rep- and alternate representations of text) instead resentations, while helping specific topics, do of unigram characters in the embedding layers. not help as much for the high-resource lan- guage overall. This is possibly due to the rel- 6 Acknowledgments atively high consistency in transcription and We would like to thank Karolina Owczarzak the existence of various tokens with similar for her support and invaluable discussions on meaning and sound in the training data. How- this work and Wael Hamza for useful discus- ever, they helped significantly boost perfor- sions. We would also like to thank our ex- mance in the low-resource languages, thus cre- tended team mates for helpful discussions and ating potential applications in bootstrapping anonymous reviewers for valuable feedback. new languages quickly and cost-effectively. In the case of the low-resource languages we saw significant improvements overall when us- References ing either ISO-15919 or IndicSOUNDEX. This Beider, A. and Morse, S. P. (2010). Phonetic match- suggests that smaller datasets stand to bene- ing: A better soundex. Association of Professional fit from the use of alternative representations. Genealogists Quarterly. Based on Tamil and Marathi results, where Chaware, S. M. and Rao, S. (2012). Analysis of pho- IndicSOUNDEX performed better than base- netic matching approaches for indic languages.

7 Geetha, P., Chandu, K., and Black, A. W. (2018). Tackling code-switched NER: Participation of cmu. In Proceedings of the Third Workshop on Compu- tational Approaches to Linguistic Code-Switching, pages 126–131.

Johnson, A., Karanasou, P., Gaspers, J., and Klakow, D. (2019). Cross-lingual transfer learning for Japanese named entity recognition. In Proceedings of the 2019 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers), pages 182–189.

Lafferty, J. D., McCallum, A., and Pereira, F.C.N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Con- ference on Machine Learning, ICML ’01, page 282– 289, San Francisco, , USA. Morgan Kaufmann Publishers Inc.

Shah, R. and Singh, D. K. (2014). Improvement of soundex algorithm for indian languages based on phonetic matching. International Journal of Com- puter Science, Engineering and Applications, 4(3).

Su, C., Gupta, R., Ananthakrishnan, S., and Mat- soukas, S. (2018). A re-ranker scheme for integrat- ing large scale nlu models. 2018 IEEE Spoken Lan- guage Technology Workshop (SLT), pages 670–676.

Zhang, X. and LeCun, Y. (2017). Which encoding is the best for text classification in chinese, english, japanese and korean? CoRR.

8 A Appendix A. Results across different topics on all languages

Table 7: Results for Hindi. Relative change in performance w.r.t baseline on the average of ten boot- strapped test data sets. ‘+’ indicates degradation, ‘-’ indicates improvement

Topic % Change in IndicSOUNDEX Is Significant % Change in ISO-15919 Is Significant Reservations -1.64% NO -0.65% NO Books -0.43% NO -5.18% NO Calendar -7.59% NO -7.05% NO CallingAndMessaging -4.36% NO -0.08% NO News +2.78% NO +0.92% NO Photos -6.85% NO -12.16% NO Media +7.79% NO +11.13% YES Global -2.89% YES -4.09% YES Help +6.52% NO +5.76% NO SmartHome -5.58% YES -4.32% YES QuestionAndAnswer +3.10% NO +4.99% YES Search -3.98% NO +3.78% NO Music -0.45% NO +2.18% YES Notifications +2.74% NO -2.49% NO OriginalContent -3.66% NO -1.63% NO Recipes -0.44% NO -0.39% NO Shopping +5.36% YES -0.82% NO Sports 0% NO 0% NO ToDoLists -16.50% YES +1.08% NO Video -2.73% NO -4% YES Weather -3.89% NO +16.36% YES Overall -0.07% NO +1.08% YES

Table 8: Results for Tamil. Relative change in performance w.r.t baseline on the average of ten boot- strapped test data sets. ‘+’ indicates degradation, ‘-’ indicates improvement

Topic % Change in IndicSOUNDEX Is Significant % Change in ISO-15919 Is Significant Books -14.06% YES -18.28% YES Calendar -8.52% YES -11.93% YES MovieTimes -7.91% YES -7.24% NO CallingAndMssaging -12.44% YES -6.38% YES News -15.50% YES -6.14% YES Media -12.23% YES 0% NO Global -4.20% YES -4.41% YES Help -20.84% YES -17.72% YES SmartHome -7.29% YES +0.05% YES Search +7.15% YES +3.08% NO Music -8.03% YES -8.08% YES Notifications -10.98% YES -6.81% YES OriginalContent -9.92% YES -23.21% YES Shopping -16.22% YES -13.44% YES Sports +8.01% NO -30.10% YES ToDoLists -2.14% NO +3.03% YES Video -4.01% YES -14.65% YES Weather -10.23% YES -10.91% YES Overall -7.56% YES -6.23% YES

9 Table 9: Results for Marathi. Relative change in performance w.r.t baseline on the average of ten bootstrapped test data sets. ‘+’ indicates degradation, ‘-’ indicates improvement

Topic % Change in IndicSOUNDEX Is Significant % Change in ISO-15919 Is Significant Books -6.95% YES -11.81% YES Calendar +9.28% YES -0.11% YES MovieTimes -13.74% YES -11.03% YES CallingAndMessaging +3.09% YES 4.97% YES News +16.76% YES +1.90% YES Media -24.50% YES -12.84% YES Global +0.13% NO -2.81% YES Help -8.51% YES -0.76% NO SmartHome -14.36% YES -5.04% YES Search -0.63% NO -1.70% YES Music -0.45% NO -4.66% YES Notifications +2.49% YES +5.50% YES OriginalContent -22.84% YES +8.10% YES Shopping +2.97% NO +12.17% YES Sports -38.48% YES -35.97% YES ToDoLists +0.70% NO -0.23% NO Video +5.25% YES +1.59% NO Weather +12.29% YES +32.18% YES Overall -2.29% YES -1.47% YES

Table 10: Results for Telugu. Relative change in performance w.r.t baseline on the average of ten bootstrapped test data sets. ‘+’ indicates degradation, ‘-’ indicates improvement

Topic % Change in IndicSOUNDEX Is Significant % Change in ISO-15919 Is Significant Books +2.42% NO +0.37% NO Calendar +4.51% YES -10.33% YES MovieTimes -13.57% YES -8.83% YES CallingAndMessaging +5.41% YES +3.79% YES News +0.97% NO +15.24% YES Media -13.91% YES -0.61% NO Global +2.93% YES +3.57% NO Help +12.35% YES -0.45% NO SmartHome -0.15% NO +0.31% NO Search -7.41% YES +2.90% YES Music +1.02% NO -4.55% YES Notifications +1.33% NO -4.15% NO OriginalContent -3.94% NO +2.10% NO Shopping +0.55% NO +1.59% NO Sports -4.43% NO -14.31% NO ToDoLists +10.44% YES +10.05% YES Video -0.72% NO -0.97% NO Weather -25.98% YES -35.37% YES Overall -0.21% NO -2.42% YES

10 On Incorporating Structural Information to improve Dialogue Response Generation 1 2 3,4 3,4 Nikita Moghe∗, Priyesh Vijayan , Balaraman Ravindran , and Mitesh M. Khapra

1School of Informatics, University of Edinburgh 2School of Computer Science, McGill University and Mila 3Indian Institute of Technology Madras 4Robert Bosch Centre for Data Science and Artificial Intelligence (RBC-DSAI), Indian Institute of Technology Madras [email protected] [email protected]

Abstract 1 Introduction

We consider the task of generating dialogue re- Neural conversation systems that treat dialogue re- sponses from background knowledge compris- sponse generation as a sequence generation task ing of domain specific resources. Specifically, (Vinyals and Le, 2015) often produce generic and given a conversation around a movie, the task is to generate the next response based on back- incoherent responses (Shao et al., 2017). The pri- ground knowledge about the movie such as the mary reason for this is that, unlike humans, such plot, review, Reddit comments etc. This re- systems do not have any access to background quires capturing structural, sequential, and se- knowledge about the topic of conversation. For mantic information from the conversation con- example, while chatting about movies, we use our text and background resources. We propose a background knowledge about the movie in the form new architecture that uses the ability of BERT of plot details, reviews, and comments that we to capture deep contextualized representations might have read. To enrich such neural conver- in conjunction with explicit structure and se- quence information. More specifically, we use sation systems, some recent works (Moghe et al., (i) Graph Convolutional Networks (GCNs) to 2018; Dinan et al., 2019; Zhou et al., 2018) incorpo- capture structural information, (ii) LSTMs to rate external knowledge in the form of documents capture sequential information, and (iii) BERT which are relevant to the current conversation. For for the deep contextualized representations example, Moghe et al.(2018) released a dataset that capture semantic information. We analyze containing conversations about movies where every the proposed architecture extensively. To this alternate utterance is extracted from a background end, we propose a plug-and-play Semantics- Sequences-Structures (SSS) framework which document about the movie. This background doc- allows us to effectively combine such linguis- ument contains plot details, reviews, and Reddit tic information. Through a series of experi- comments about the movie. The focus thus shifts ments, we make some interesting observations. from sequence generation to identifying relevant First, we observe that the popular adaptation snippets from the background document and modi- of the GCN model for NLP tasks where struc- fying them suitably to form an appropriate response tural information (GCNs) was added on top given the current conversational context. of sequential information (LSTMs) performs poorly on our task. This leads us to explore Intuitively, any model for this task should ex- interesting ways of combining semantic and ploit semantic, structural and sequential informa- structural information to improve performance. tion from the conversation context and the back- Second, we observe that while BERT already ground document. For illustration, consider the outperforms other deep contextualized repre- chat shown in Figure1 from the Holl-E movie con- sentations such as ELMo, it still benefits from versations dataset (Moghe et al., 2018). In this the additional structural information explicitly example, Speaker 1 nudges Speaker 2 to talk about added using GCNs. This is a bit surprising given the recent claims that BERT already cap- how James’s wife was irritated because of his ca- tures structural information. Lastly, the pro- reer. The right response to this conversation comes posed SSS framework gives an improvement from the line beginning at “His wife Mae . . . ”. of 7.95% BLEU score over the baseline. However, to generate this response, it is essential

The∗ work was done by Nikita and Priyesh at Indian In- to understand that (i) His refers to James from the stitue of Technology Madras. previous sentence; (ii) quit boxing is a contigu-

11 Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, pages 11–24 July 9, 2020. c 2020 Association for Computational Linguistics

Source Doc: ... At this point James Brad- them through a bidirectional LSTM as is the stan- dock (Russel Crowe) was a light heavyweight dard practice in many NLP tasks. Lastly, we add boxer, who was forced to retired from the explicit structural information in the form of de- ring after breaking his hand in his last fight. pendency graphs, co-reference graphs, and entity His wife Mae had prayed for years that he co-occurrence graphs. To allow interactions be- would quit boxing, before becoming perma- tween words related through such structures, we nently injured. ... use GCNs which essentially aggregate information Conversation: from the neighborhood of a word in the graph. Speaker 1(N): Yes very true, this is a real Of course, combining BERT with LSTMs in it- rags to riches story. Russell Crowe was excel- self is not new and has been tried in the original lent as usual. work (Devlin et al., 2019) for the task of Named En- Speaker 2(R): Russell Crowe owns the char- tity Recognition. Similarly, Bastings et al.(2017) acter of James Bradock, the unlikely hero combine LSTMs with GCNs for the task of - who makes the most of his second chance. chine translation. To the best of our knowledge, He’s a good fighter turned hack. this is the first work that combines BERT with ex- Speaker 1(N): Totally! Oh by the way do plicit structural information. We investigate several you remember his wife ... how she wished he interesting questions in the context of dialogue re- would stop sponse generation. For example, Speaker 2(P): His wife Mae had prayed for 1. Are BERT-based models best suited for this years that he would quit boxing, before be- task? coming permanently injured. 2. Should BERT representations be enriched with sequential information first or structural information? 3. Are dependency graph structures more im- portant for this task or entity co-occurrence Figure 1: Sample conversation from the Holl-E Dataset. graphs? The text in bold in the first block is the background doc- ument which is used to generate the last utterance in 4. Given the recent claims that BERT captures this conversation. N, P, and R correspond to the type syntactic information, does it help to explic- of background knowledge used: None, Plot, and Re- itly enrich it with syntactic information using view as per the dataset definitions. For simplicity, we show only a few of the edges for the background knowl- GCNs? edge at the bottom. The edge in blue corresponds to the To systematically investigate such questions co-reference edge, the edges in green are dependency we propose a simple plug-and-play Semantics- edges and the edge in red is the entity edge. Sequences-Structures (SSS) framework which al- lows us to combine different semantic repre- ous phrase, and (iii) quit and he would stop mean sentations (GloVe (Pennington et al., 2014), the same. We need to exploit (i) structural infor- BERT(Devlin et al., 2018), ELMo (Peters et al., mation, such as, the co-reference edge between 2018a)) with different structural priors (depen- His-James (ii) the sequential information in quit dency graphs, co-reference graphs, etc.). It also boxing and (iii) the semantic similarity (or syn- allows us to use different ways of combining struc- onymy relation) between quit and he would stop. tural and sequential information, e.g., LSTM first To capture such multi-faceted information from followed by GCN or vice versa, or both in par- the document and the conversation context we pro- allel. Using this framework we perform a series pose a new architecture that combines BERT with of experiments on the Holl-E dataset and make explicit sequence and structure information. We some interesting observations. First, we observe start with the deep contextualized word represen- that the conventional adaptation of GCNs for NLP tations learnt by BERT which capture distribu- tasks, where contextualized embeddings obtained tional semantics. We then enrich these represen- through LSTMs are fed as input to a GCN, exhibits tations with sequential information by allowing poor performance. To overcome this, we propose the words to interact with each other by passing some simple alternatives and show that they lead to

12 better performance. Second, we observe that while ters et al., 2018c). These recent developments have BERT performs better than GloVe and ELMo, it led to many interesting questions about the best still benefits from explicit structural information way of exploiting rich information from sentences captured by GCNs. We find this interesting because and documents. We try to answer some of these some recent works (Tenney et al., 2019; Jawahar questions in the context of background aware dia- et al., 2019; Hewitt and Manning, 2019) suggest logue response generation. that BERT captures syntactic information, but our results suggest that there is still more information 3 Background to be captured by adding explicit structural priors. In this section, we provide a background on how Third, we observe that certain graph structures are GCNs have been leveraged in NLP to incorporate more useful for this task than others. Lastly, our different linguistic structures. best model which uses a specific combination of The Syntactic-GCN proposed in (Marcheggiani semantic, sequential, and structural information im- and Titov, 2017) is a GCN (Kipf and Welling, 2017) proves over the baseline by 7.95% on the BLEU variant which can model multiple edge types and score. edge directions. It can also dynamically determine the importance of an edge. They only work with 2 Related work one graph structure at a time with the most popular structure being the dependency graph of a sentence. There is an active interest in using external knowl- For convenience, we refer to Syntactic-GCNs as edge to improve the informativeness of responses GCNs from here on. for goal-oriented as well as chit-chat conversations Let G denote a graph defined on a text sequence (Lowe et al., 2015; Ghazvininejad et al., 2018; (sentence, passage or document) with nodes as Moghe et al., 2018; Dinan et al., 2019). Even the words and edges representing a directed relation teams participating in the annual Alexa Prize com- between words. Let denote a dictionary of list of petition (Ram et al., 2017) have benefited by using N neighbors with (v) referring to the neighbors of several knowledge resources. This external knowl- N a specific node v, including itself (self-loop). Let edge can be in the form of knowledge graphs or dir(, v) in, out, self denote the direction of unstructured texts such as documents. ∈ { } the edge, (u, v). Let be the set of different edge Many NLP systems including conversation sys- L types and let L(u, v) denote the label of the tems use RNNs as their basic building block which ∈ L edge, (u, v). The (k + 1)-hop representation of a typically captures n-gram or sequential informa- node v is computed as tion. Adding structural information through tree- based structures (Tai et al., 2015) or graph-based (k+1) (k) (k) (k) (k) hv = σ( g(u,v)(Wdir(u,v)hu + bL(u,v)) structures (Marcheggiani and Titov, 2017) on top u (v) ∈NX of this has shown improved results on several tasks. (1) For example, GCNs have been used to improve neu- where σ is the activation function, g(u,v) R is ∈ ral machine translation (Marcheggiani et al., 2018) the predicted importance of the edge (u, v) and m by exploiting the semantic structure of the source hv is node, v’s embedding. W ∈ R dir(u,v) ∈ sentence. Similarly, GCNs have been used with Win,Wout,Wself depending on the direction { } m m dependency graphs to incorporate structural infor- dir(u, v) and Win, W and Wout . The self ∈ R ∗ mation for semantic role labelling (Marcheggiani importance of an edge g(u,v) is determined by an and Titov, 2017), neural machine translation (Bast- edge gating mechanism w.r.t. the node of interest, ings et al., 2017) and entity relation information in u as given below: question answering (De Cao et al., 2019) and tem- g = sigmoid h .W + b (2) poral information for neural dating of documents (u,v) u dir(u,v) L(u,v) (Vashishth et al., 2018). In summary, a GCN computes new representation There have been advances in learning deep of a node u by aggregating information from it’s contextualized word representations (Peters et al., neighborhood (v). When k=0, the aggregation N 2018b; Devlin et al., 2019) with a hope that such happens only from immediate neighbors, i.e., 1 representations will implicitly learn structural and hop neighbors. As the value of k increases the relational information with the interaction between aggregation implicitly happens from a larger neigh- words at multiple layers (Jawahar et al., 2019; Pe- borhood.

13 4 Proposed Model (ii) whether BERT already captures syntactic infor- mation completely (as claimed by recent works) or Given a document D and a conversational con- can it benefit form additional syntactic information text Q the task is to generate the response y = as described below. y1, y2, ...., ym. This can be modeled as the prob- lem of finding a y that maximizes the probability Structure Layer: To capture structural informa- P (y D,Q) which can be further decomposed as tion we propose multi-graph GCN, M-GCN, a sim- | ple extension of GCN to extract relevant multi-hop m multi-relational dependencies from multiple struc- y = arg max P (yt y1, ..., yt 1, Q, D) tures/graphs efficiently. In particular, we general- y | − t=1 Y ize G to denote a labelled multi-graph, i.e., a graph As has become a standard practice in most NLG which can contain multiple (parallel) labelled edges between the same pair of nodes. Let denote the tasks, we model the above probability using a neu- R ral network comprising of an encoder, a decoder, an set of different graphs (structures) considered and let G = 1, 2 ... be a set of dictionary of attention mechanism, and a copy mechanism. The {N N N|R|} neighbors from the graphs. We extend the Syn- copy mechanism essentially helps to directly copy |R| words from the document D instead of predicting tactic GCN defined in Eqn:1 to multiple graphs by having graph convolutions at each layer as them from the vocabulary. Our main contribution is |R| given in Eqn:3. Here, g conv( ) is the graph in improving the document encoder where we use N σ a plug-and-play framework to combine semantic, convolution defined in Eqn:1 with as the identity structural, and sequential information from differ- function. Further, we remove the individual node (or word) i from the neighbourhood list (i) and ent sources. This enriched document encoder could N be coupled with any existing model. In this work, model the node information separately using the W we couple it with the popular Get To The Point parameter self . (GTTP) model (See et al., 2017) as used by the authors of the Holl-E dataset. In other words, we h(k+1) = ReLU (h(k)W (k) + g conv( ) use the same attention mechanism, decoder, and i i self N G copy mechanism as GTTP but augment it with an NX ∈ (3) enriched document encoder. Below, we first de- This formulation is advantageous over having |R| scribe the document encoder and then very briefly different GCNs as it can extract information from describe the other components of the model. We multi-hop pathways and can use information across also refer the reader to the supplementary material different graphs with every GCN layer (hop). Note for more details. 0 that hi is the embedding obtained for word v from the semantic layer. For ease of notation, we use 4.1 Encoder the following functional form to represent the final Our encoder contains a semantics layer, a sequen- representation computed by M-GCN after k-hops tial layer and a structural layer to compute a rep- 0 starting from the initial representation hi , given G. resentation for the document words which is a se- quence of words w1, w2, ..., wm. We refer to this as 0 a plug-and-play document encoder simply because hi = M-GCN(hi , G, k) it allows us to plug in different semantic repre- sentations, different graph structures, and different simple but effective mechanisms for combining Sequence Layer: The purpose of this layer is to structural and semantic information. capture sequential information. Once again, follow- Semantics Layer: Similar to almost all NLP mod- ing standard practice, we pass the word represen- els, we capture semantic information using word tations computed by the previous layer through a embeddings. In particular, we utilize the ability of bidirectional LSTM to compute a sequence contex- BERT to capture deep contextualized representa- tualized representation for each word. As described tions and later combine it with explicit structural in the next subsection, depending upon the manner information. This allows us to evaluate (i) whether in which we combine these layers, the previous BERT is better suited for this task as compared to layer could either be the structure layer or the se- other embeddings such as ELMo and GloVe and mantics layer.

14 Figure 2: The SSS framework. The Word Embeddings include GloVe, ELMo, and BERT. Seq-GCN considers obtaining an LSTM representation first which this then passed through an M-GCN module. In Str-LSTM, we compute the M-GCN representation first which is then passed to an LSTM layer while Par-GCN-LSTM computes both LSTM and M-GCN representations independently which are then combined into a final representation.

4.2 Combining structural and sequential then fed to the M-GCN along with the graph G information to compute a k-hop aggregated representation as As mentioned earlier, for a given document D shown below: containing words w1, w2, w3, . . . , wm, we first ob- str seq hi = M-GCN(hi , G, k) tain word representations x1, x2, x3, . . . , xm using BERT (or ELMo or GloVe). At this point, we have This final representation hfinal = hstr for the three different choices for enriching the represen- i i i-th word thus combines semantics, sequential and tations using structural and sequential information: structural information in that order. This is a popu- (i) structure first followed by sequence (ii) sequence lar way of combining GCNs with LSTMs but our first followed by structure or (iii) structure and se- experiments suggest that this does not work well quence in parallel. We depict these three choices for our task. We thus explore two other variants as pictorially in Figure2 and describe them below explained below. with appropriate names for future reference. Please note that the choice of “Seq” denotes the sequential 4.2.2 Structure contextualized LSTM nature of LSTMs while “Str” denotes the structural (Str-LSTM) nature of GCNs. Though we use a specific variant Here, we first feed the word representations of GCN, described as M-GCN in the previous sec- x1, x2, x3, . . . , xm to M-GCN to obtain structure tion, any other variant of GCN can be replaced in aware representations as shown below. the “Str” layer. hstr = M-GCN(x , G, k) 4.2.1 Sequence contextualized GCN i i (Seq-GCN) These structure aware representations are then Seq-GCN is similar to the model proposed in (Bast- passed through a BiLSTM to capture sequence in- ings et al., 2017; Marcheggiani and Titov, 2017) formation as shown below: where the word representations x1, x2, x3, . . . , xm seq seq str are first fed through a BiLSTM to obtain sequence hi = BiLST M(hi 1, hi ) contextualized representations as shown below. − final seq seq seq This final representation hi = hi for the i- hi = BiLST M(hi 1, xi) − th word thus combines semantics, structural and These representations h1, h2, h3, . . . , hm are sequential information in that order.

15 4.2.3 Parallel GCN-LSTM (Par-GCN-LSTM) 2018) which contains 9k movie chats and 90k ∼ ∼ Here, both M-GCN and BiLSTMs are fed with utterances. Every chat in this dataset is associated word embeddings xi as input to aggregate structural with a specific background knowledge resource and sequential information independently as shown from among the plot of the movie, the review of the below: movie, comments about the movie, and occasion- ally a fact table. Every even utterance in the chat is str hi = M-GCN(xi, G, k) generated by copying and or modifying sentences seq seq from this unstructured background knowledge. The hi = BiLST M(hi 1, xi) − task here is to generate/retrieve a response using final The final representation, hi , for each word conversation history and appropriate background final str seq resources. Here, we focus only on the oracle setup is computed as hi = hi + hi and combines structural and sequential information in parallel as where the correct resource from which the response opposed to a serial combination in the previous two was created is provided explicitly. We use the same variants. train, test, and validation splits as provided by the authors of the paper. 4.3 Decoder, Attention, and Copy Mechanism 5.2 Construction of linguistic graphs Once the final representation for each word is com- We consider leveraging three different graph-based puted, an attention weighted aggregation, ct, of structures for this task. Specifically, we evaluate the these representations is fed to the decoder at each popular syntactic word dependency graph (Dep-G), time step t. The decoder itself is a LSTM which entity co-reference graph (Coref-G) and entity co- computes a new state vector st at every timestep t occurrence graph (Ent-G). Unlike the word depen- as dency graph, the two entity-level graphs can cap- ture dependencies that may span across sentences st = LST M(st 1, ct) in a document. We use the dependency parser pro- − vided by SpaCy (https://spacy.io/) to obtain The decoder then uses this st to compute a dis- the dependency graph (Dep-G) for every sentence. tribution over the vocabulary where the probabil- For the construction of the co-reference graph ity of the i-th word in the vocabulary is given by (Coref-G), we use the NeuralCoref model (https: pi = softmax(V st + W ct + b)i. In addition, the //github.com/huggingface/neuralcoref) inte- decoder also has a copy mechanism wherein, at ev- grated with SpaCy. For the construction of the en- ery timestep t, it could either choose the word with tity graph (Ent-G), we first perform named-entity the highest probability pi or copy that word from recognition using SpaCy and connect all the enti- the input which was assigned the highest attention ties that lie in a window of k = 20. weight at timestep t. Such copying mechanism is useful in tasks such as ours where many words in 5.3 Baselines the output are copied from the document D. We We categorize our baseline methods as follows: refer the reader to the GTTP paper for more details Without Background knowledge: We consider of the standard copy mechanism. the simple Sequence-to-Sequence (S2S) (Vinyals and Le, 2015) architecture that conditions the re- 5 Experimental setup sponse generation only on the previous utterance In this section, we briefly describe the dataset and and completely ignores the other utterances as well task setup followed by the pre-processing steps we as the background document. We also consider carried to obtain different linguistic graph struc- HRED (Serban et al., 2016), a hierarchical vari- tures on this dataset. We then describe the dif- ant of the S2S architecture which conditions the ferent baseline models. Our code is available response generation on the entire conversation his- at: https://github.com/nikitacs16/horovod_ tory in addition to the last utterance. Of course, we gcn_pointer_generator do not expect these models to perform well as they completely ignore the background knowledge but 5.1 Dataset description we include them for the sake of completeness. We evaluate our models using Holl-E, an English With Background Knowledge: To the S2S archi- language movie conversation dataset (Moghe et al., tecture we add an LSTM encoder to encode the

16 document. The output is now conditioned on this Model BLEU ROUGE representation in addition to the previous utterance. 1 2 L We refer to this architecture as S2S-D. Next, we S2S 4.63 26.91 9.34 21.58 use GTTP (See et al., 2017) which is a variant of HRED 5.23 24.55 7.61 18.87 the S2S-D architecture with a copy-or-generate de- S2S-D 11.71 26.36 13.36 21.96 coder; at every time-step, the decoder decides to GTTP 13.97 36.17 24.84 31.07 copy from the background knowledge or generate BiRNN+GCN 14.70 36.24 24.60 31.29 from the fixed vocabulary. We also report the per- BiDAF 16.79 26.73 18.82 23.58 formance of the BiRNN + GCN architecture that SSS(GloVe) 18.96 38.61 26.92 33.77 uses the dependency graph only as discussed in SSS(ELMo) 19.32 39.65 27.37 34.86 (Marcheggiani and Titov, 2017). Finally, we note SSS(BERT) 22.78 40.09 27.83 35.20 that in our task many words in the output need to Table 1: Results of automatic evaluation. The be copied sequentially from the input background architectures within the SSS framework outperform document which makes it very similar to the task the baseline methods with our proposed architecture of span prediction as used in Question Answering. SSS(BERT) performing the best. We thus also evaluate BiDAF (Seo et al., 2017), a popular question-answering architecture, that ex- tracts a span from the background knowledge as a versations between friends. Tallying the majority response using complex attention mechanisms. For vote, we obtain win/loss/both/none for SSS(BERT) a fair comparison, we evaluate the spans retrieved as 29/25/29/17, SSS(GloVe) as 24/17/47/12 and by the model against the ground truth responses. SSS(ELMo) as 22/23/41/14. This suggests quali- We use BLEU-4 and ROUGE (1/2/L) as the eval- tative improvement using the SSS framework. We uation metrics as suggested in the dataset paper. also provide some generated examples in Appendix Using automatic metrics is more reliable in this B1. We found that the SSS framework had less con- setting than the open domain conversational setting fusion in generating the opening responses than the as the variability in responses is limited to the infor- GTTP baseline. These “conversation starters” have mation in the background document. We provide a unique template for every opening scenario and implementation details in Appendix A. thus have different syntactic structures respectively. We hypothesize that the presence of dependency 6 Results and Discussion graphs over these respective sentences helps to al- leviate the confusion as seen in Example 1. The In Table1, we compare our architecture against the second example illustrates why incorporating struc- baselines as discussed above. SSS(BERT) is our tural information is important for this task. We also proposed architecture in terms of the SSS frame- observed that the SSS encoder framework does not work. We report best results within SSS chosen improve on the aspects of human creativity such as across 108 configurations comprising of four differ- diversity, initiating a context-switch, and common- ent graph combinations, three different contextual sense reasoning as seen in Example 3. and structural infusion methods, three M-GCN lay- ers, and, three embeddings. The best model was 6.2 Ablation studies on the SSS framework chosen based on performance of the validation set. We report the component-wise results for the SSS From Table1, it is clear that our improvements in framework in Table2. The Sem models condition incorporating structural and sequential information the response generation directly on the word em- with BERT in the SSS encoder framework signifi- beddings. As expected, we observe that ELMo cantly outperforms all other models. and BERT perform much better than GloVe embed- dings. 6.1 Qualitative Evaluation The Sem+Seq models condition the decoder on We conducted human evaluation for the SSS mod- the representation obtained after passing the word els from Table1 against the generated responses of embeddings through the LSTM layer. These mod- GTTP. We presented 100 randomly sampled out- els outperform their respective Sem models. The puts to three different annotators. The annotators gain with ELMo is not significant because the un- were asked to pick from four options: A, B, both, derlying architecture already has two BiLSTM lay- and none. The annotators were told these were con- ers which are already being fine-tuned for the task.

17 Emb Paradigm BLEU ROUGE relies only on the dependency graph. In the case 1 2 L of GloVe based experiments, the entity and co- Sem 4.4 29.72 11.72 22.99 reference relations were not independently useful GloVe Sem+Seq 14.83 36.17 24.84 31.07 with the Str-LSTM and Par-GCN-LSTM models, SSS 18.96 38.61 26.92 33.77 but when used together gave a significant perfor- Sem 14.36 32.04 18.75 26.71 ELMo Sem+Seq 14.61 35.54 24.58 30.71 mance boost, especially for Str-LSTM. However, SSS 19.32 39.65 27.37 34.86 most of the BERT based and ELMo based models Sem 11.26 33.86 16.73 26.44 achieved competitive performance with the indi- BERT Sem+Seq 18.49 37.85 25.32 32.58 vidual entity and co-reference graphs. There is SSS 22.78 40.09 27.83 35.2 no clear trend across the models. Hence, probing these embedding models is essential to identify Table 2: Performance of components within the SSS which structural information is captured implicitly framework. BERT based models outperform both by the embeddings and which structural informa- ELMo and GloVe based architectures in the respective tion needs to be added explicitly. For the quantita- paradigms. Notably, adding all the three levels of infor- tive results, please refer to Appendix B2. mation: semantic, sequential, and structural is useful.

Hence the addition of one more LSTM layer may 6.5 Structural information in deep not contribute to learning any new sequential word contextualised representations information. It is clear from Table2 that the SSS models, that use structure information as well, ob- tain a significant boost in performance, validating Earlier work has suggested that deep contextualized the need for incorporating all three types of infor- representations capture syntax and co-reference re- mation in the architecture. lations (Peters et al., 2018c; Jawahar et al., 2019; Tenney et al., 2019; Hewitt and Manning, 2019). 6.3 Combining structural and sequential We revisit Table2 and consider the Sem+Seq mod- information els with ELMo and BERT embeddings as two ar- The response generation task of our dataset is a chitectures that implicitly capture structural infor- span based generation task where phrases of text mation. We observe that the SSS model using the are expected to be copied or generated as they simpler GloVe embedding outperforms the ELMo are. The sequential information is thus crucial Sem+Seq model and performs slightly better than to reproduce these long phrases from background the BERT Sem+Seq model. knowledge. This is strongly reflected in Table3 Given that the SSS models outperform the cor- where Str-LSTM which has the LSTM layer on responding Sem+Seq model, the extent to which top of GCN layers performs the best across the the deep contextualized word representations learn hybrid architectures discussed in Figure2. The the syntax and other linguistic properties implic- Str-LSTM model can better capture sequential in- itly are questionable. Also, this calls for better loss formation with structurally and syntactically rich functions for learning deep contextualized represen- representations obtained through the initial GCN tations that can incorporate structural information layer. The Par-GCN-LSTM model performs second explicitly. best. However, in the parallel model, the LSTM cannot leverage the structural information directly More importantly, all the configurations of SSS and relies only on the word embeddings. Seq-GCN (GloVe) have a lesser memory footprint in com- model performs the worst among all the three as parison to both ELMo and BERT based models. the GCN layer at the top is likely to modify the Validation and training of GloVe models require sequence information from the LSTMs. one-half, sometimes even one-fourth of computing resources. Thus, the simple addition of structural 6.4 Understanding the effect of structural information through the GCN layer to the estab- priors lished Sequence-to-Sequence framework that can While a combination of intra-sentence and inter- perform comparably to stand-alone expensive mod- sentence graphs is helpful across all the models, els is an important step towards Green AI(Schwartz the best performing model with BERT embeddings et al., 2019).

18 Emb Seq-GCN Str-LSTM Par-GCN-LSTM BLEU ROUGE BLEU ROUGE BLEU ROUGE 1 2 L 1 2 L 1 2 L GloVe 15.61 36.6 24.54 31.68 18.96 38.61 26.92 33.77 17.1 37.04 25.70 32.2 ELMo 18.44 37.92 26.62 33.05 19.32 39.65 27.37 34.86 16.35 37.28 25.67 32.12 BERT 20.43 40.04 26.94 34.85 22.78 40.09 27.83 35.20 21.32 39.9 27.60 34.87

Table 3: Performance of different hybrid architectures to combine structural information with sequence informa- tion. We observe that using structural information followed by sequential information, Str-LSTM, provides the best results.

7 Conclusion and Future Work Nicola De Cao, Wilker Aziz, and Ivan Titov. 2019. Question answering by reasoning across documents We demonstrated the usefulness of incorporating with graph convolutional networks. In Proceed- structural information for the task of background ings of the 2019 Conference of the North American aware dialogue response generation. We infused Chapter of the Association for Computational Lin- guistics: Human Language Technologies, Volume 1 the structural information explicitly in the stan- (Long and Short Papers), pages 2306–2317, Min- dard semantic+sequential model and observed a neapolis, Minnesota. Association for Computational performance boost. We studied different structural Linguistics. linguistic priors and different ways to combine se- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and quential and structural information. We also ob- Kristina Toutanova. 2018. BERT: pre-training of serve that explicit incorporation of structural infor- deep bidirectional transformers for language under- mation helps the richer deep contextualized rep- standing. CoRR, abs/1810.04805. resentation based architectures. The framework provided in this work is generic and can be applied Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of to other background aware dialogue datasets and deep bidirectional transformers for language under- several tasks such as summarization and question standing. In Proceedings of the 2019 Conference answering. We believe that the analysis presented of the North American Chapter of the Association in this work would serve as a blueprint for analyz- for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), ing future work on GCNs ensuring that the gains pages 4171–4186, Minneapolis, Minnesota. Associ- reported are robust and evaluated across different ation for Computational Linguistics. configurations. Emily Dinan, Stephen Roller, Kurt Shuster, Angela Acknowledgements Fan, Michael Auli, and Jason Weston. 2019. Wizard of wikipedia: Knowledge-powered conversational We would like to thank the anonymous reviewers agents. International Conference on Learning Rep- for their useful comments and suggestions. We resentations. would like to thank the Department of Computer Marjan Ghazvininejad, Chris Brockett, Ming-Wei Science and Engineering, and Robert Bosch Center Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, and for Data Sciences and Artificial Intelligence, IIT Michel Galley. 2018. A knowledge-grounded neural Madras (RBC-DSAI) for providing us with ade- conversation model. In Proceedings of the Thirty- Second AAAI Conference on Artificial Intelligence, quate resources. Lastly, we extend our gratitude (AAAI-18), the 30th innovative Applications of Arti- to the volunteers who participated in the human ficial Intelligence (IAAI-18), and the 8th AAAI Sym- evaluation experiments. posium on Educational Advances in Artificial Intel- ligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 5110–5117. References John Hewitt and Christopher D. Manning. 2019.A Joost Bastings, Ivan Titov, Wilker Aziz, Diego structural probe for finding syntax in word repre- Marcheggiani, and Khalil Sima’an. 2017. Graph sentations. In Proceedings of the 2019 Conference convolutional encoders for syntax-aware neural ma- of the North American Chapter of the Association chine translation. In Proceedings of the 2017 Con- for Computational Linguistics: Human Language ference on Empirical Methods in Natural Language Technologies, Volume 1 (Long and Short Papers), Processing, pages 1957–1967, Copenhagen, Den- pages 4129–4138, Minneapolis, Minnesota. Associ- mark. Association for Computational Linguistics. ation for Computational Linguistics.

19 Ganesh Jawahar, Benoˆıt Sagot, and Djame´ Seddah. Zettlemoyer. 2018b. Deep contextualized word rep- 2019. What does BERT learn about the structure resentations. In Proceedings of the 2018 Confer- of language? In Proceedings of the 57th Annual ence of the North American Chapter of the Associ- Meeting of the Association for Computational Lin- ation for Computational Linguistics: Human Lan- guistics, pages 3651–3657, Florence, Italy. Associa- guage Technologies, Volume 1 (Long Papers), pages tion for Computational Linguistics. 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics. Thomas N. Kipf and Max Welling. 2017. Semi- supervised classification with graph convolutional Matthew Peters, Mark Neumann, Luke Zettlemoyer, networks. In 5th International Conference on Learn- and Wen-tau Yih. 2018c. Dissecting contextual ing Representations, ICLR 2017, Toulon, France, word embeddings: Architecture and representation. April 24-26, 2017, Conference Track Proceedings. In Proceedings of the 2018 Conference on Em- pirical Methods in Natural Language Processing, Ryan Lowe, Nissan Pow, Iulian Serban, Laurent Char- pages 1499–1509, Brussels, Belgium. Association lin, and Joelle Pineau. 2015. Incorporating unstruc- for Computational Linguistics. tured textual knowledge sources into neural dialogue systems. In Neural Information Processing Systems Matthew E. Peters, Sebastian Ruder, and Noah A. Workshop on Machine Learning for Spoken Lan- Smith. 2019. To tune or not to tune? adapting pre- guage Understanding. trained representations to diverse tasks. In Proceed- ings of the 4th Workshop on Representation Learn- Diego Marcheggiani, Joost Bastings, and Ivan Titov. ing for NLP (RepL4NLP-2019), pages 7–14, Flo- 2018. Exploiting semantics in neural machine trans- rence, Italy. Association for Computational Linguis- lation with graph convolutional networks. In Pro- tics. ceedings of the 2018 Conference of the North Amer- ican Chapter of the Association for Computational Ashwin Ram, Rohit Prasad, Chandra Khatri, Anu Linguistics: Human Language Technologies, Vol- Venkatesh, Raefer Gabriel, Qing Liu, Jeff Nunn, ume 2 (Short Papers), pages 486–492, New Orleans, Behnam Hedayatnia, Ming Cheng, Ashish Nagar, Louisiana. Association for Computational Linguis- Eric King, Kate Bland, Amanda Wartick, Yi Pan, tics. Han Song, Sk Jayadevan, Gene Hwang, and Art Pet- tigrue. 2017. Conversational AI: the science behind Diego Marcheggiani and Ivan Titov. 2017. Encoding the alexa prize. Alexa Prize Proceedings. sentences with graph convolutional networks for se- Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren mantic role labeling. In Proceedings of the 2017 Etzioni. 2019. Green AI. Conference on Empirical Methods in Natural Lan- guage Processing, pages 1506–1515, Copenhagen, Abigail See, Peter J. Liu, and Christopher D. Manning. Denmark. Association for Computational Linguis- 2017. Get to the point: Summarization with pointer- tics. generator networks. In Proceedings of the 55th An- nual Meeting of the Association for Computational Nikita Moghe, Siddhartha Arora, Suman Banerjee, and Linguistics (Volume 1: Long Papers), pages 1073– Mitesh M. Khapra. 2018. Towards exploiting back- 1083, Vancouver, Canada. Association for Computa- ground knowledge for building conversation sys- tional Linguistics. tems. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and pages 2322–2332, Brussels, Belgium. Association Hannaneh Hajishirzi. 2017. Bidirectional attention for Computational Linguistics. flow for machine comprehension. In 5th Inter- national Conference on Learning Representations, Jeffrey Pennington, Richard Socher, and Christopher D. ICLR 2017, Toulon, France, April 24-26, 2017, Con- Manning. 2014. Glove: Global vectors for word ference Track Proceedings. representation. In Proceedings of the 2014 Confer- ence on Empirical Methods in Natural Language Iulian Vlad Serban, Alessandro Sordoni, Yoshua Ben- Processing, EMNLP 2014, October 25-29, 2014, gio, Aaron C. Courville, and Joelle Pineau. 2016. Doha, Qatar, A meeting of SIGDAT, a Special Inter- Building end-to-end dialogue systems using gener- est Group of the ACL, pages 1532–1543. ative hierarchical neural network models. In Pro- ceedings of the Thirtieth AAAI Conference on Arti- Matthew Peters, Mark Neumann, Mohit Iyyer, Matt ficial Intelligence, February 12-17, 2016, Phoenix, Gardner, Christopher Clark, Kenton Lee, and Luke Arizona, USA., pages 3776–3784. Zettlemoyer. 2018a. Deep contextualized word rep- resentations. In Proceedings of the 2018 Conference Yuanlong Shao, Stephan Gouws, Denny Britz, Anna of the North American Chapter of the Association Goldie, Brian Strope, and Ray Kurzweil. 2017. Gen- for Computational Linguistics: Human Language erating high-quality and informative conversation re- Technologies, Volume 1 (Long Papers), pages 2227– sponses with sequence-to-sequence models. In Pro- 2237. Association for Computational Linguistics. ceedings of the 2017 Conference on Empirical Meth- ods in Natural Language Processing, pages 2210– Matthew Peters, Mark Neumann, Mohit Iyyer, Matt 2219, Copenhagen, Denmark. Association for Com- Gardner, Christopher Clark, Kenton Lee, and Luke putational Linguistics.

20 Kai Sheng Tai, Richard Socher, and Christopher D. the context. The final representation of the context Manning. 2015. Improved semantic representations is then the attention weighted sum of these word from tree-structured long short-term memory net- representations: works. In Proceedings of the 53rd Annual Meet- ing of the Association for Computational Linguistics t T d and the 7th International Joint Conference on Natu- fi = v tanh(Wchi + V st + bd) ral Language Processing (Volume 1: Long Papers), mt = softmax(f t) pages 1556–1566, Beijing, China. Association for (4) t d Computational Linguistics. dt = mihi i Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. X BERT rediscovers the classical NLP pipeline. In Similar to the original model, we use an RNN to Proceedings of the 57th Annual Meeting of the Asso- compute the representation of the document. Let ciation for Computational Linguistics, pages 4593– 4601, Florence, Italy. Association for Computational N be the length of the document then the RNN r r r Linguistics. computes representations h1, h2, ..., hN for all the words in the resource (we use the superscript r Shikhar Vashishth, Shib Sankar Dasgupta, to indicate resource). We then compute the query Swayambhu Nath Ray, and Partha Talukdar. 2018. Dating documents using graph convolution aware resource representation as follows. networks. In Proceedings of the 56th Annual Meet- t T r ing of the Association for Computational Linguistics ei = v tanh(Wrhi + Ust + V dt + br) (Volume 1: Long Papers), pages 1605–1615, Mel- at = softmax(et) bourne, Australia. Association for Computational (5) t r Linguistics. ct = aihi i Oriol Vinyals and Quoc V. Le. 2015. A neural conver- X sational model. In ICML Deep Learning Workshop. where ct is the attended context representation. Kangyan Zhou, Shrimai Prabhumoye, and Alan W Thus, at every decoder time-step, the attention on Black. 2018. A dataset for document grounded con- the document words is also based on the currently versations. In Proceedings of the 2018 Conference attended context representation. on Empirical Methods in Natural Language Process- The decoder then uses rt (document representa- ing, pages 708–713, Brussels, Belgium. Association for Computational Linguistics. tion) and st (decoder’s internal state) to compute a probability distribution over the vocabulary Pvocab. A Implementation Details In addition, the model also computes pgen which in- dicates that there is a probability pgen that the next A.1 Base Model word will be generated and a probability (1 p ) − gen The authors of Moghe et al.(2018) adapted the ar- that the next word will be copied. We use the fol- chitecture of Get to the Point (See et al., 2017) for lowing modified equation to compute pgen background aware dialogue response generation T T T task. In the summarization task, the input is a doc- pgen = σ(wr rt + ws st + wx xt + bg) (6) ument and the output is a summary whereas in our where x is the previous word predicted by the case the input is a resource/document, context t { } decoder and fed as input to the decoder at the cur- pair and the output is a response. Note that the con- rent time step. Similarly, s is the current state of text includes the previous two utterances (dialog t the decoder computed using this input x . The final history) and the current utterance. Since, in both t probability of a word w is then computed using a the tasks, the output is a sequence (summary v/s combination of two distributions, viz.,(P ) as response) we don’t need to change the decoder (i.e., vocab described above and the attention weights assigned we can use the decoder from the original model as to the document words as shown below it is). However, we need to change the input fed to the decoder. We use an RNN to compute a repre- t P (w) = pgenPvocab(w)+(1 pgen) ai (7) sentation of the conversation history. Specifically, − i:wi=w we consider the previous k utterances as a single X t sequence of words and feed these to an RNN. Let where ai are the attention weights assigned to every M be the total length of the context (i.e., all the k word in the document as computed in Equation5. utterances taken together) then the RNN computes Thus, effectively, the model could learn to copy d d d t representations h1, h2, ..., hM for all the words in a word i if pgen is low and ai is high. This is 21 the baseline with respect to the LSTM architecture is XYZ” where XYZ was the respective crowd- r (Sem + Seq). For, GCN based encoders, the hi sourced phrase. The presence of dependency is the final outcome after the desired GCN/LSTM graphs over the respective sentences may help to configuration. alleviate the confusion. Now consider the example under Hannibal in A.2 Hyperparameters Table4. We find that the presence of a co-reference We selected the hyper-parameters using the valida- graph between “Anthony Hopkins” in the first sen- tion set. We used Adam optimizer with a learning tence and “he” in the second sentence can help rate of 0.0004 and a batch size of 64. We used in continuing the conversation on the actor “An- GloVe embeddings of size 100. For the RNN-based thony Hopkins”. Moreover, connecting tokens encoders and decoders, we used LSTMs with a hid- in “Anthony Hopkins” to refer to “he” in the sec- den state of size 256. We used gradient clipping ond sentence is possible because of the explicit with a maximum gradient norm of 2. We used a entity-entity connection between the two tokens. hidden state of size 512 for Seq-GCN and 128 for However, this is applicable only to SSS(GloVe) the remaining GCN-based encoders. We ran all the and SSS(ELMo) as their best performing versions experiments for 15 epochs and we used the check- use these graphs along with the dependency graph point with the least validation loss for testing. For while the best performing SSS(BERT) only uses models using ELMo embeddings, a learning rate dependency graph and may have learnt the inter- of 0.004 was most effective. For the BERT-based sentence relations implicitly. models, a learning rate of 0.0004 was suitable. Rest There is a limited diversity of responses gener- of the hyper-parameters and other setup details re- ated by the SSS framework as it often resorts to the main the same for experiments with BERT and patterns seen during training while it is not copying ELMo. Our work follows a task specific architec- from the background knowledge. We also iden- ture as described in the previous section. Following tify that SSS framework cannot handle the cases the definitions in (Peters et al., 2019), we use the where Speaker2 initiates a context switch, i,e; when “feature extraction” setup for both ELMo and BERT Speaker2 introduces a topic that has not been dis- based models. cussed in the conversation so far. In the chat on The Road Warrior in Table4, we find that Mad Max: B Extended Results Fury Road has been used to initiate a discussion that compares the themes of both the movies. All B.1 Qualitative examples the models produce irrelevant responses. We illustrate different scenarios from the dataset to identify the strengths and weaknesses of our B.2 Quantitative results models under the SSS framework in Table4. We We explore the effect of using different graphs in compare the outputs from the best performing Table5. model on the three different embeddings and use GTTP as our baseline. The best performing com- bination of sequential and structural information for all the three models in the SSS framework is Str-LSTM. The best performing SSS(GloVe) and SSS(ELMo) architectures use all the three graphs while SSS(BERT) uses only the dependency graph. We find that the SSS framework improves over the baseline for the cases of opening statements (see Example 1). The baseline had confusion in picking opening statements and often mixed the responses for “Which is your favorite charac- ter?”, “Which is your favorite scene” and “What do you think about the movie?”. The responses to these questions have different syntactic struc- tures - “My favorite character is XYZ”, “I liked the one in which XYZ”, and “ I think this movie

22 Movie Rocky V (Example 1) Hannibal (Example 2) The Road Warrior (Exam- ple 3) Resource Rocky V, terribly under- Anthony Hopkins gave .... Box Office: $ 9,003,011 rated. I liked the one However, the material he Similar Movies: Lethal in which Rocky .... . was given to work .... al- Weapon, Mad Max: Fury My favorite character was most seemed like an ex- Road ... Adrian because she was cuse to present us ...I think it was enjoyable, though slightly less than th. Chat Speaker 1: Which is Speaker 1: What is your Speaker 1: Which is your favourite character opinion about the movie? your favourite character in this? Speaker 2: I think it was in this? a fair sequel. Speaker 2: My favorite Speaker 1: Anthony Hop- character was Dog kins gave an impeccable Speaker 1: I liked the performance. feral kid even after he’s found in the car and Max evicts him without cere- mony he doesn’t give up. Ground Truth my favorite character was however , the material he Have you seen the new adrian because she was was given to work with Mad Max: Fury Road? strong and did her best to was not as good as silence keep her family together of the lambs . through tough times . Baseline i liked the one in which in fairness , perhaps there lethal weapon . rocky and gunn fight in was no way it could be . the street . SSS(GloVe) my favorite character was however , the material he it made $ 9,003,011 adrian because she was was given to work with strong and did her best to was not as good as silence keep her together through of the lambs . tough times . SSS(ELMo) my favorite character was the material he was given [UNK] only one man can adrian because she was to work with was not as make the difference in the strong and did her best to good as silence of the future . keep her family together lambs . through tough times . SSS(BERT) my favorite character was the material he was given yes .[UNK] only one man adrian because she was to work with was not as can make the difference strong and did her best to good as silence of the in the future . keep her family together lambs . through tough times .

Table 4: Sample outputs from the SSS framework compared with baseline and ground truth responses.

23 Graph GloVe ELMo BERT BLEU ROUGE BLEU ROUGE BLEU ROUGE 1 2 L 1 2 L 1 2 L Dep 16.79 37.77 25.89 32.88 17.00 37.56 26.14 32.77 22.78 40.09 27.83 35.2 Dep+Ent 14.44 35.14 24.61 30.43 18.34 39.55 28.00 34.76 19.33 39.37 27.52 34.33 Dep+Coref 16.58 37.60 25.72 32.63 18.56 40.08 28.42 35.06 20.99 40.10 28.66 35.11 Dep+Ent 18.96 38.61 26.92 33.77 19.32 39.65 27.37 34.86 20.37 39.11 27.2 34.19 +Coref

Table 5: Comparing performance of different structural priors across different semantic information on the Str- LSTM architecture.

24 CopyBERT: A Unified Approach to Question Generation with Self-Attention

Stalin Varanasi Saadullah Amin Gunter¨ Neumann

German Research Center for Artificial Intelligence (DFKI) Multilinguality and Language Technology Lab stalin.varanasi,saadullah.amin,guenter.neumann @dfki.de { }

Abstract

Contextualized word embeddings provide bet- ter initialization for neural networks that deal with various natural language understanding (NLU) tasks including question answering (QA) and more recently, question generation (QG). Apart from providing meaningful word representations, pre-trained transformer mod- els, such as BERT also provide self-attentions which encode syntactic information that can be probed for dependency parsing and POS- tagging. In this paper, we show that the infor- mation from self-attentions of BERT are use- Figure 1: CopyBERT architecture for conditional ques- ful for language modeling of questions con- tion generation: Given a sequence of length n, with ditioned on paragraph and answer phrases. question tokens q Q , paragraph tokens p P { i}i=1 { i}i=1 To control the attention span, we use semi- with answer phrase a A and semi-diagonal mask { i}i=1 diagonal mask and utilize a shared model for M ( 3.2), the model explicitly uses H multi-headed § encoding and decoding, unlike sequence-to- self-attention matrices from L layers of transformers n n L H sequence. We further employ copy mechanism to create A R × × × . This matrix along with n L ∈H over self-attentions to achieve state-of-the-art S R × × , obtained from the BERT sequence ∈ n h results for question generation on SQuAD output H R × , is used to learn copy probabil- ∈ dataset. ity p (q .) ( 3.3.2). Finally, a weighted combination c i| § p(q .) is obtained with simple generation probability i| p (q .) ( 3.4). 1 Introduction g i| § Automatic question generation (QG) is the task of generating meaningful questions from text. (Mostafazadeh et al., 2016). Yao et al.(2012) and With more question answering (QA) datasets like Nouri et al.(2011) use QG to create and augment SQuAD (Rajpurkar et al., 2016) that have been re- conversational characters. In a similar approach, leased recently (Trischler et al., 2016; Choi et al., Kuyten et al.(2012) creates a virtual instructor to 2018; Reddy et al., 2019; Yang et al., 2018), there explain clinical documents. In this paper, we pro- has been an increased interest in QG, as these pose a QG model with following contributions: datasets can not only be used for creating QA mod- We introduce copy mechanism for BERT- • els but also for QG models. based models with a unified encoder-decoder QG, similar to QA, gives an indication of ma- framework for question generation. We fur- chine’s ability to comprehend natural language text. ther extend this copy mechanism using self- Both QA and QG are used by conversational agents. attentions. A QG system can be used in the creation of arti- ficial question answering datasets which in-turn Without losing performance, we improve the • helps QA (Duan et al., 2017). It specifically can speed of training BERT-based language mod- be used in conversational agents for starting a con- els by choosing predictions on output embed- versation or draw attention to specific information dings that are offset by one position.

25 Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, pages 25–31 July 9, 2020. c 2020 Association for Computational Linguistics

2 Related Work 3 Model In sequence-to-sequence learning framework, a sep- Most of the QG models that use neural networks arate encoder and a decoder model is used. Such rely on a sequence-to-sequence architecture where an application to BERT will lead to high compu- a paragraph and an answer is encoded appropriately tational complexity. To alleviate this, we use a before decoding the question. Sun et al.(2018) shared model for encoding and decoding (Dong uses an answer-position aware attention to enrich et al., 2019). This not only leads to a reduced the encoded input representation. Recently, Liu number of parameters but also allows for cross at- et al.(2019) showed that learning to predict clue tentions between source and target words in each words based on answer words helps in creating a layer of the transformer model. While such an ar- better QG system. With similar motivation, gated chitecture can be used in any conditional natural self-networks were used by Zhao et al.(2018) to language generation task, here we apply it for QG. fuse appropriate information from paragraph before generating question. More recently, self-attentions 3.1 Question Generation of a transformer has been shown to perform answer For a sequence of paragraph tokens P = agnostic question generation (Scialom et al., 2019). [p1, p2, ..., pP ], start and end positions of an answer The pre-training task of masked language mod- phrase sa = (as, ae) in the paragraph and ques- eling for BERT (Devlin et al., 2019) and other such tion tokens Q = [q1, q2, ..., qQ] with p1 = bop, models (Joshi et al., 2019) make them suitable for pP = eop and qQ = eoq representing begin of natural language generation tasks. Wang and Cho paragraph, end of paragraph and end of question (2019) argues that BERT can be used as a genera- respectively, the task of question generation is to tive model. However, only few attempts have been maximize the likelihood of Q given P and sa. To made so far to make use of these pre-trained mod- this end, with m such training examples, we maxi- els for conditional language modeling. Dong et al. mize the following objective: (2019) and Chan and Fan(2019) use a single BERT m n model for both encoding and decoding and achieve (j) (j) (j) (j) max log p(qi q

26 h h in the question and each word in the question only where Wn R × is a parameter matrix. ∈ attends to previous words in the question in addi- 3.3.2 Self-Copy tion to all the words in the paragraph. This results in a semi-diagonal mask which is also proposed by In a transformer architecture (Vaswani et al., 2017), Dong et al.(2019) and shown in Figure1. if there are L layers and H attention heads at each layer, there will be M = L H self-attention Formally, from S in 3.1, we have Ip = × § matrices of size n n. For example, in BERT- [1, 2, ..., P ] as the sequence of paragraph indices × Large model (Devlin et al., 2019), there would be and Iq = [P + 1,P + 2, .., P + Q] as the sequence 24 16 = 384 such matrices. Each of these self- of question indices with n = P + Q (ignoring the × n n attention matrices carry unique information. In pad tokens). The semi-diagonal mask M R × ∈ is defined as: this method for copy mechanism, called self-copy, we obtain Pa as a weighted average of all these (i I j I ) 1 ∈ p ∧ ∈ q ∨ self-attentions . M = −∞ (i Iq j > i) i,j  ∈ ∧ We obtain at each time step, a probability score 1, else for each of the M self-attention matrices in A ∈ n n M signifying their corresponding impor- 3.3 Copy Mechanism × × h M tance. Given a parameter matrix Wa R × , we ∈ Pre-trained transformer models not only yield bet- obtain: ter contextual word embeddings but also give infor- n M S = softmax(HWa) R × mative self-attentions (Hewitt and Manning, 2019; ∈ Reif et al., 2019). We explicitly make use of T T n 1 n Pa = [ 1 1 ; ...; n n ] R × × these pre-trained self-attentions into our QG mod- S A S A ∈ n 1 M els. This also matches with our motivation to use where R × × is a 3D tensor with added di- Se ∈ T n M n the copy mechanism (Gu et al., 2016) for BERT, as mension 2 to S, R × × is the transposed A ∈ 1 M the self-attentions can be used to obtain attention tensor of 3D self-attention matrices . i R × T M n A S ∈ and R × are the i-th slices of the tensors probabilities over input paragraph text which are Ai ∈ and T . The final attention probabilities P necessary for copy-mechanism. S A a For the input sequence S with the semi-diagonal are obtained by removing the dimension 2 from n n P . Thus, the final attention probabilities are ob- mask M R × and segment ids D, we first a ∈ encode with BERT(S, M,D) to obtain hidden rep- tained as a weighted average over all self-attention n matrices.e resentations of the sequence H = hi i=1 n h { } ∈ R × . We then define copy probability pc(yi .) := | 3.3.3 Two-Hop Self-Copy p (y q , P, s ) as: c i|