Towards Unified Dialogue System Evaluation
Total Page:16
File Type:pdf, Size:1020Kb
Towards Unified Dialogue System Evaluation: A Comprehensive Analysis of Current Evaluation Protocols Sarah E. Finch Jinho D. Choi Department of Computer Science Department of Computer Science Emory University Emory University Atlanta, GA, USA Atlanta, GA, USA [email protected] [email protected] Abstract labor once the evaluation script is written (Sec- tion3). For automated evaluation to be a reliable As conversational AI-based dialogue manage- ment has increasingly become a trending topic, measurement of the dialogue system quality, how- the need for a standardized and reliable evalua- ever, it needs to be shown to be a close approxi- tion procedure grows even more pressing. The mation of human judgements (Section4). Unfor- current state of affairs suggests various evalua- tunately, commonly used automated metrics corre- tion protocols to assess chat-oriented dialogue late weakly with human judgments, indicating poor management systems, rendering it difficult to utility of such metrics (Liu et al., 2016). Human conduct fair comparative studies across differ- evaluation has become more commonplace in re- ent approaches and gain an insightful under- cent dialogue system works; however, it presents its standing of their values. To foster this research, a more robust evaluation protocol must be set own challenges. For one, it is time-consuming and in place. This paper presents a comprehensive expensive to obtain human judgments. More crit- synthesis of both automated and human eval- ically, there is a lack of standardized protocol for uation methods on dialogue systems, identify- such human evaluation, which makes it challenging ing their shortcomings while accumulating evi- to compare different approaches to one another. dence towards the most effective evaluation di- There have been many previous attempts at stan- mensions. A total of 20 papers from the last dardizing dialogue system evaluations. A major two years are surveyed to analyze three types of evaluation protocols: automated, static, and limitation has been their focus on task-oriented interactive. Finally, the evaluation dimensions dialogue systems, which does not translate well used in these papers are compared against our to chat-oriented dialogue systems (Walker et al., expert evaluation on the system-user dialogue 1997; Malchanau et al., 2019). Previous works data collected from the Alexa Prize 2020. which have included chat-oriented evaluations have lacked comprehensive coverage over the many va- 1 Introduction rieties of such evaluation procedures that are cur- Most successful automated dialogue systems fol- rently in use. Instead, the emphasis has rested low task-oriented dialogue management methodol- primarily on automated metrics at the expense of ogy, which defines an explicit goal that the system detailed analysis of human evaluation (Deriu et al., is seeking to fulfill through the conversation with 2019). At this stage in conversational AI, it is the user (Gao et al., 2019). Recently, the research probable that automated and human metrics reveal in chat-oriented dialogue management has experi- different aspects of dialogue systems (Hashimoto enced a substantial increase in popularity. Unlike et al., 2019). It would be remiss to focus on a sin- task-oriented dialogues, where the success is gener- gle evaluation category when assessing the state ally measured as ability to complete the goal of the of the field. For this reason, our work aims to fill task, evaluation of chat-oriented dialogues is much in the gaps of previous dialogue system evaluation less straightforward, since the conversational goals surveys by identifying and comparing human evalu- can be highly subjective (Huang et al., 2019). ation protocols for chat-oriented dialogue systems. The evaluation of chat-oriented dialogue systems To this end, we present a comparative analysis of has been typically accomplished through the use the evaluations used for chat-oriented dialogue sys- of automated metrics and human evaluation (Sec- tems over the past several years. Since the field of tion2). Automated evaluation requires no human conversational AI has experienced a rapid growth 236 Proceedings of the SIGdial 2020 Conference, pages 236–245 1st virtual meeting, 01-03 July 2020. c 2020 Association for Computational Linguistics in these years, it presents a unique opportunity to tive. Automated evaluation is performed systemat- observe and assess which evaluation metrics have ically by a batch script such that no human effort been most widely adopted by the larger commu- is required once the script is written (Section 2.1). nity in this period of expeditious development. We Static evaluation is done by human where the eval- provide a detailed survey of both automated and uator assesses a dialogue whose last utterance is human evaluations in order to present the most ac- generated by the dialogue system (Section 2.2). curate depiction of the current evaluation protocols. Interactive evaluation is also done by human, al- However, our in-depth analysis is limited to that though the evaluator assesses the quality of the of the human evaluations due to the abundance of dialogue after directly interacting with the dialogue previous work in automated metric analysis. As system (Section 2.3). such, we defer to such work as Liu et al.(2016), Table1 shows the distributions of the three evalu- Ghandeharioun et al.(2019), and Ghazarian et al. ation protocols. Most recent approaches adopt both (2019) for more detail on automated metrics. automated and human evaluations, with only 2 pa- As a part of our analysis, we also present a case pers not including any form of human evaluation. study of real human-machine dialogues which ex- The most common protocol for human evaluation plores the significance of different human evalu- is static evaluation, with very few papers conduct- ation metrics in terms of overall user satisfaction ing interactive assessments of dialogue systems. through an expert analysis. As a result of our work, No work has adopted all three types of evaluation the most commonly used evaluation metrics in con- protocols. temporary literature - both automated and human - are revealed in detail and our findings towards Method References # [1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12 AUT 17 the prevalence, impact, and applicability of human 13, 14, 15, 16, 17, 20] evaluation metrics are illustrated. [1, 3, 4, 7, 9, 10, 11, 12, 13, 14 STA 16 15, 16, 17, 18, 19, 20] 2 Evaluation Protocols INT [8, 19] 2 [1, 3, 4, 7, 9, 10, 11, 12, 13, 14 AUT & STA 14 For a holistic understanding of current evaluation 15, 16, 17, 20] AUT & INT [] 0 protocols on dialogue systems, we have carefully STA & INT [19] 1 selected 20 relevant papers since 2018, primarily from top-tier venues, and synthesized their meth- Table 1: Distributions of the three evaluation protocols. ods. These papers focus on open domain (or non- #: number of papers using the corresponding protocol, task-oriented) dialogue, and employ a variety of AUT/STA/INT: automated/static/interactive evaluation. &: approaches using both protocols. approaches including:1 2.1 Automated Evaluation • Incorporation of knowledge bases [2, 4, 7, 18, 20] Automated evaluation provides an objective quan- titative measurement of the dialogue systems by • Integration of personality [8, 12] operationalizing various dimensions of dialogue • Handling of emotion-driven responses [10] into mathematical formulations. Depending on the specific objectives behind different systems, a few • Purely depending on neural-based sequence- studies define novel automated metrics to capture to-sequence models [19] the benefit of their proposed approaches. Auto- mated evaluation provides the most straightforward Based on these papers, three main categories are and undemanding methods by which to evaluate di- found as evaluation protocols for open-domain di- alogue systems; however, they are generally viewed alogue systems: automated, static, and interac- as poor indicators of true dialogue quality, follow- 1Throughout the paper, the following are used to refer to the ing results from Liu et al.(2016). related work: 1: Li and Sun(2018) 2: Liu et al.(2018) 3: Luo et al.(2018) 4: Moghe et al.(2018) 5: Parthasarathi 2.2 Static Evaluation and Pineau(2018) 6: Xu et al.(2018) 7: Young et al.(2018) 8: Zhang et al.(2018) 9: Du and Black(2019) 10: Li et al. Static evaluation is an offline procedure where the (2019) 11: Lin et al.(2019) 12: Madotto et al.(2019) 13: evaluators never directly interact with the dialogue Qiu et al.(2019) 14: Tian et al.(2019) 15: Wu et al.(2019) 16: Zhang et al.(2019) 17: Zhou et al.(2019) 18: Zhu et al. systems under review; instead, they are provided (2019) 19: Adiwardana et al.(2020) 20: Wang et al.(2020). with dialogue excerpts. These excerpts are gen- 237 erated by first randomly sampling dialogues from • Inertia: inertia on the clusters of embed- a corpus consisting of human-to-human conversa- dings of responses (Du and Black, 2019) tions, then having the systems produce responses to the sampled dialogues. The sampled dialogues • Perplexity: inverse likelihood of predict- together with the system responses are provided ing the responses of the test set (Chen et al., to human evaluators to assess. Because only the 1998) last utterance in these excerpts are generated by the • ROUGE: a subset of ROUGE-1, ROUGE-2, dialogue systems, it is difficult to evaluate sequen- and ROUGE-L (Lin, 2004) tial aspects about dialogue management through static evaluation (e.g., coherence among responses generated by the same system). The automated metrics in Table2 fall into the fol- lowing five categories: 2.3 Interactive Evaluation Ground Truth Response Similarity Most com- Unlike static evaluation, interactive evaluation has monly used automated metrics focus on assess- the same person play the role of both the user (one ing how well system responses match the ground who interacts with the system) and the evaluator. truth human responses, using word overlap (BLEU, In this setup, the evaluator has a conversation with ROUGE) or embedding similarity.