Automatic Online Evaluation of Intelligent Assistants Jiepu Jiang1*, Ahmed Hassan Awadallah2, Rosie Jones2, Umut Ozertem2, Imed Zitouni2, Ranjitha Gurunath Kulkarni2 and Omar Zia Khan2 1 Center for Intelligent Information Retrieval, University of Massachusetts Amherst 2 Microsoft Redmond, WA USA
[email protected], {hassanam, rosie.jones, umuto, izitouni, raguruna, omarzia.khan}@microsoft.com ABSTRACT These applications are usually evaluated by comparing system-gen- Voice-activated intelligent assistants, such as Siri, Google Now, erated answers with “correct” answers labeled by human annota- and Cortana, are prevalent on mobile devices. However, it is chal- tors. For example, in web search, we annotate relevant webpages lenging to evaluate them due to the varied and evolving number of and evaluate using metrics such as mean average precision (MAP) tasks supported, e.g., voice command, web search, and chat. Since and normalized discounted cumulative gain (nDCG) [14]. each task may have its own procedure and a unique form of correct However, intelligent assistants differ from these applications in answers, it is expensive to evaluate each task individually. This pa- that they can involve a wide variety of tasks, ranging from making per is the first attempt to solve this challenge. We develop con- phone calls and managing calendars, to finding places, finding an- sistent and automatic approaches that can evaluate different tasks swers to general questions, and web search. These tasks have dif- in voice-activated intelligent assistants. We use implicit feedback ferent forms of “correct” answers. It is expensive to evaluate each from users to predict whether users are satisfied with the intelligent task separately using different human judgments and metrics.