Automatic Online Evaluation of Intelligent Assistants
Total Page:16
File Type:pdf, Size:1020Kb
Automatic Online Evaluation of Intelligent Assistants Jiepu Jiang1*, Ahmed Hassan Awadallah2, Rosie Jones2, Umut Ozertem2, Imed Zitouni2, Ranjitha Gurunath Kulkarni2 and Omar Zia Khan2 1 Center for Intelligent Information Retrieval, University of Massachusetts Amherst 2 Microsoft Redmond, WA USA [email protected], {hassanam, rosie.jones, umuto, izitouni, raguruna, omarzia.khan}@microsoft.com ABSTRACT These applications are usually evaluated by comparing system-gen- Voice-activated intelligent assistants, such as Siri, Google Now, erated answers with “correct” answers labeled by human annota- and Cortana, are prevalent on mobile devices. However, it is chal- tors. For example, in web search, we annotate relevant webpages lenging to evaluate them due to the varied and evolving number of and evaluate using metrics such as mean average precision (MAP) tasks supported, e.g., voice command, web search, and chat. Since and normalized discounted cumulative gain (nDCG) [14]. each task may have its own procedure and a unique form of correct However, intelligent assistants differ from these applications in answers, it is expensive to evaluate each task individually. This pa- that they can involve a wide variety of tasks, ranging from making per is the first attempt to solve this challenge. We develop con- phone calls and managing calendars, to finding places, finding an- sistent and automatic approaches that can evaluate different tasks swers to general questions, and web search. These tasks have dif- in voice-activated intelligent assistants. We use implicit feedback ferent forms of “correct” answers. It is expensive to evaluate each from users to predict whether users are satisfied with the intelligent task separately using different human judgments and metrics. It is assistant as well as its components, i.e., speech recognition and in- also difficult to use one single setup to evaluate all tasks. In addi- tent classification. Using this approach, we can potentially evaluate tion, the tasks performed can be personal in nature and the perfor- and compare different tasks within and across intelligent assistants mance of the system depends heavily on users. These factors make according to the predicted user satisfaction rates. Our approach is it challenging to conduct manual ground-truth-based evaluation. characterized by an automatic scheme of categorizing user-system To solve these challenges, we adopt approaches similar to recent interaction into task-independent dialog actions, e.g., the user is studies of user satisfaction prediction in web search [1, 3, 6, 7, 17, commanding, selecting, or confirming an action. We use the action 32]. These studies developed alternative evaluation approaches by sequence in a session to predict user satisfaction and the quality of finding and using correlation between explicit ratings of user expe- speech recognition and intent classification. We also incorporate rience and implicit behavioral signals such as click and dwell time other features to further improve our approach, including features [4, 7]. However, we cannot simply apply user behavior signals in derived from previous work on web search satisfaction prediction, web search to evaluate intelligent assistants due to the wider range and those utilizing acoustic characteristics of voice requests. We and different nature of tasks. These tasks may involve a variety of evaluate our approach using data collected from a user study. Re- user intents, diverse topics, and distinct user interaction modes. For sults show our approach can accurately identify satisfactory and un- example, the process of making a phone call, or navigating to a satisfactory sessions. place, involves a dialog style conversation between user and sys- tem, with user requests and system responses very different from Categories and Subject Descriptors those in web search. In addition, intelligent assistants heavily use H.5.2 [Information Interfaces and Presentation]: User Interfaces voice interactions, and it is important to consider the voice signal – evaluation/methodology, interaction styles, voice I/O. to assess user experience. We introduce a model for evaluating user experience in voice- Keywords activated intelligent assistants. We consider satisfaction as the ma- Voice-activated intelligent assistant; evaluation; user experience; jor indicator of user experience because our study shows that it is mobile search; spoken dialog system. consistently correlated with changes in user interests towards the 1. INTRODUCTION system. Our model predicts whether the user has satisfactory or un- Intelligent assistants are becoming a prevalent feature on mobile satisfactory experience with an intelligent assistant based on user devices. They provide voice control and feedback to mobile device interaction patterns. Once the model is trained, it can evaluate real functions (e.g., making phone calls, calendar management, finding traffic of intelligent assistants without human judgments of correct places). Users can also search the web or even chat with intelligent answers for tasks. This makes it a useful and cheap evaluation ap- assistants. While these novel applications are useful and attractive proach for intelligent assistants’ developers, who have abundant for users, it is challenging to evaluate and compare them due to the user traffic and user logs. Our model includes a sub-model for au- large variability of tasks. tomatically classifying user-system interaction into dialog actions, Evaluation is a central component of many related applications, a Markov model over action transitions, as well as features related e.g., search engines, Q&A systems, and recommendation systems. to requests, responses, clicks, and those using acoustic signals. Our contributions can be summarized as follows: * Work done during an internship at Microsoft Research • An accurate model for predicting user satisfaction with an intel- ligent assistant and its components, i.e., speech recognition and Copyright is held by the International World Wide Web Conference Com- intent classification. mittee (IW3C2). IW3C2 reserves the right to provide a hyperlink to the au- • A scheme of categorizing user-system interaction into task-inde- thor’s site if the Material is used in electronic media. pendent dialog actions, and a model to automatically map differ- WWW 2015, May 18–22, 2015, Florence, Italy. ACM 978-1-4503-3469-3/15/05. ent actions to this scheme. http://dx.doi.org/10.1145/2736277.2741669 • Analysis of user behavior and patterns indicating user experience in intelligent assistants. 2. RELATED WORK itself as a solid foundation for managing dialogues, and a compre- There are a number of areas of related work relevant to the research hensive review can be found in [33]. described in this paper. These include (1) methods and metrics for Since they also support other forms of interactions, intelligent the evaluation of search systems, (2) inferring satisfaction from ob- assistants differ from traditional spoken dialog systems. In addition served search behavior and (3) dialog act modeling and classifica- to voice system response, intelligent assistants provide answers or tion in conversational speech. We cover these in turn in this section. options in the display, and users can type in the requests and select User behavior modeling has been used extensively for evaluating a displayed result or option. In this sense, intelligent assistants are search systems [1, 4, 6, 7]. Traditionally search system have been related to multi-modal conversational systems [11, 20, 30]. evaluated using retrieval metrics such as MAP and nDCG [14], Note that many different taxonomies of dialog acts have been where a collection of documents, queries and human labeled rele- proposed [28]. We do not intend here to propose a new one, but vance judgments are used to evaluate search system performance. rather to model user and system interaction with the goal of pre- These metrics are expensive to collect and potentially noisy, given dicting user satisfaction. Our model of system interaction and user that third-party judges have limited knowledge of the individual is designed independently from any dialog model the system uses. user’s intent. Additionally, these metrics are query-based. Previous Hence, it is independent of any specific implementation. In contrast research has shown that search tasks often contain multiple queries with work on learning dialog model transitions [27] we do not at- related to the same information need [7]. Unfortunately the connec- tempt to model the most likely dialog sequence, but to use the dia- tions between these queries are ignored by these metrics. Session- log sequence to model user satisfaction. Our work differs from pre- based DCG (sDCG) [15] does consider the session-context, but still vious work in offline evaluation of dialog systems [31], as we do requires manual relevance judgments. not require manual transcription of speech, and thus once trained, Another line of research has focused on using implicit feedback our models can be run online, at-scale, evaluating voice assistants from user behavior to evaluate search engine. These methods have in an automated, unsupervised fashion. lower cost, are more scalable, and sourced from the actual users. 3. INTELLIGENT ASSISTANTS Early research on implicit feedback [4] used an instrumented Intelligent assistants are emerging and evolving applications lack- browser to determine if there was an association between explicit ing a precise definition.