Gunrock: Building a Human-Like Social Bot by Leveraging Large Scale Real User Data

Gunrock: Building A Human-Like Social Bot By Leveraging Large Scale Real User Data Chun-Yen Chen∗, Dian Yu,y Weiming Wen,z Yi Mang Yang, Jiaping Zhang, Mingyang Zhou Kevin Jesse, Austin Chau, Antara Bhowmick, Shreenath Iyer, Giritheja Sreenivasulu Runxiang Cheng, Ashwin Bhandare, Zhou Yux Department of Computer Science University of California, Davis Davis, CA 95616 Abstract Gunrock is a social bot designed to engage users in open domain conversations. We improved our bot iteratively using large scale user interaction data to be more capable and human-like. Our system engaged in over 40; 000 conversations during the semi-finals period of the 2018 Alexa Prize. We developed a context-aware hierarchical dialog manager to handle a wide variety of user behaviors, such as topic switching and question answering. In addition, we designed a robust three- step natural language understanding module, which includes techniques such as sentence segmentation and automatic speech recognition (ASR) error correction. Furthermore, we improve the human-likeness of the system by adding prosodic speech synthesis. As a result of our many contributions and large scale user interactions analysis, we achieved an average score of 3:62 on a 1 − 5 Likert scale on Oct 14th. Additionally, we achieved an average of 22:14 number of turns and a 5:22 minutes conversation duration. 1 Introduction One considerable challenge in dialog system research is training and testing dialog systems with a large number of users. To address this, most researchers have previously simulated users to train and evaluate their systems through paid subjects on crowd-sourced platforms [27]. However, systems trained and evaluated in such a manner can yield precarious results when directly deployed as a real product. The Amazon Alexa Prize provided a platform to attract a large number of volunteer users with real intent to interact with social conversational systems. We obtained on average more than 500 conversations per day over the course of the 45 day evaluation period. In total, we collected 487,314 conversation turns throughout the entire development period, with collection ending on Aug 14th. Anyone who has an Alexa powered device, such as Amazon Echo, can interact with our system in the US. Our system needed to handle a large pool of diverse users. As humans are accustomed to the communication patterns of one another, most users would likely transfer their human-human communicative behavioral patterns and expectations to interactions with a system. For example, while users quickly learned that Microsoft Cortana (a personal assistant) could not handle social content, 30% of the total user utterances addressing it consisted of social content [10]. Therefore, one possible way to improve conversational system performance is to imitate ∗[email protected] [email protected] [email protected] [email protected] 2nd Proceedings of Alexa Prize (Alexa Prize 2018). human communicative behaviors. We propose an open-domain social bot, Gunrock, to imitate natural human-human conversations with the ability to cover a wide variety of social topics that can converse in depth on specific and popular subjects. We made a number of contributions in the open domain spoken language understanding, dialog management and language generation. The two major challenges in open-domain spoken language understanding include 1) ASR errors and 2) entity ambiguity. We designed a novel three-phase natural language understanding (NLU) pipeline to address these obstacles. While users can utter several sentences in one turn, ASR decodes the sentence but does not provide punctuation like textual input. Our NLU first breaks down the complex input into smaller segments to reduce the complexity of understanding. It then performs various NLP techniques on these small segments to extract information including named entity, dialogue intent, and sentiment. Finally, we also leverage context and phonetic information to resolve co-reference, ASR error, and entity ambiguity. We also designed a hierarchical stack-based dialog manager to handle different multitudes of conversations among users. The dialog manager first makes a high-level decision on which topic (e.g. movies) the user requests using information obtained from NLU. Then the system activates the domain-specific topic dialog module that handles that topic. In each topic dialog module, we have a pre-defined conversation flow that serves to engage users in a more detailed and comprehensive conversation. To accommodate various user behaviors and keep conversations coherent, the system can jump in and out of the flow to answer factual and personal questions at any time. Additionally, users’ intent switch can be accommodated by using the tunnel created between different domain-specific topic dialog modules. The presentation of the system utterance is equally as important as its content. To create a more lively human-like interaction, we created a library of prosodic effects using Amazon’s speech synthesis markup language (SSML), such as “aha”. From user interviews, we found that people evaluated the system as sounding more natural with these prosodic effects and interjections. 2 Related Work Task-oriented and open domain dialog systems have been widely studied. The former lies in specific domains, such as restaurant reservation [2][7]. In the open domain, early chatbots such as Alice [26] aim to pass the Turing Test while recent dialog systems such as Amazon Alexa and Google Assistant focus on short-turn, question-answer types of interactions with users [20]. In comparison, social chatbots require in-depth communication skills with emotional support[21]. Gunrock utilizes state-of-the-art practices in both domains and emphasizes dynamic user conversations. Many neural models [25] and reinforcement learning models [14] have been proposed for understanding and generation. With large datasets available, including Cornell Movie Dialogs [5] and Reddit 5, these models improve dialog performance in an end-to-end approach. However, these methods suffer from incoherent and generic issues[29]. To solve those problems, some research has combined rule-based and end-to-end approaches [19]. Other relevant work leverages individual mini-skills and knowledge graphs [6]. In the 2017 Alexa Prize 6, Sounding Board reported the highest weekly average feedback rating (one to five stars, indicating how likely the user is willing to talk to the system again) of 3:37 across all conversations[6]. This combination of approaches enhances user experience and prolongs conversations, however, they are not flexible in adapting to new domains and cannot handle robustly opinion related requests. Our system takes full advantage of connected datasets across different domains and tunnels to transition seamlessly between topic dialog modules. We trained our models with those datasets in addition to data collected from users for NLU and natural language generation (NLG). These novel concepts contributed to our last week rating of 3:62. 3 Architecture We leveraged the Amazon Conversational Bot Toolkit (cobot)[] 5https://www.reddit.com/ 6https://developer.amazon.com/alexaprize/2017-alexa-prize 2 to build the system architecture. The toolkit provides a zero-effort scaling framework, allowing developers to focus on building a user-friendly bot. The event-driven based system is implemented on top of AWS Lambda function7 and will be triggered when a user sends a request to the bot. Cobot infrastructure also has a state manager interface that stores both the user data and the dialog state information to DynamoDB8. We also utilized Redis9 and Amazon’s newly released graph database, Neptune10, to build the internal system’s knowledge base. In this section, we focus on discussing each system component. 3.1 System Overview Natural Language Dialog Manager User Understanding Topic Dialog Modules Segmentation animals EVI movies Intent Context Noun Phrase Classifier Backstory news Feedback Topic Sentiment retrieval ASR Knowledge Base Profanity NER Check Google Natural Language Generation Coreference TTS Knowledge Profanity Check Template Manager Concept Concept Net Graph Data Scraper ASR Dialog Act on EC2 Correction Prosody Post Processor Figure 1: Social Bot Framework Figure 1 depicts the social bot dialog system framework. Amazon provides user utterance through an ASR model through the run time and Amazon Alexa’s Text-To-Speech (TTS) to generate utterances. Our system mainly handles the text input and output. Due to the possibility of long latency, we bypass some modules to generate responses in some scenarios, such as low ASR confidence, profane or incomplete user input. For example, if we detect that there is a low confidence score in ASR results, we will generate prompts to ask the user to repeat or clarify directly. After going through the ASR, the user input will be processed by multiple NLU components, such as the Amazon toolkit services and dialog act detector. We will discuss them in detail in Section 3.3. In both face-to-face and online contexts, people engage in conversations that include profane content. We use Amazon Offensive Speech Classifier toolkit to detect offensive content. If content exhibits signs of profanity, we inform the user on the inappropriateness of the topic and recommend an alternate subject to continue the conversation. Figure 1 shows that there are 12 components in NLU. In order to balance the trade-offs between latency and dependencies, we use cobot’s built-in natural language processing

Load more