Fundamental Exploration of Evaluation Metrics for Persona Characteristics of Text Utterances
Total Page:16
File Type:pdf, Size:1020Kb
Fundamental Exploration of Evaluation Metrics for Persona Characteristics of Text Utterances Chiaki Miyazaki Saya Kanno Makoto Yoda Junya Ono Hiromi Wakaki Sony Group Corporation, Japan fchiaki.miyazaki, saya.kanno, makoto.yoda, junya.ono, [email protected] Abstract To maintain utterance quality of a persona- Utterance Automatic Utterance aware dialog system, inappropriate utterances creation evaluation selection for the persona should be thoroughly filtered. Utterance Score Utterance Score When evaluating the appropriateness of a large Utt. 1 0.52 Utt. 1 0.52 number of arbitrary utterances to be registered … … in the utterance database of a retrieval-based Utt. 147 0.30 Utt. 147 0.30 Utts. dialog system, evaluation metrics that require Utt. 148 0.29 … a reference (or a “correct” utterance) for each Utt. 499 0.02 Utt. 500 0.00 evaluation target cannot be used. In addition, Pool of utterances Utterance DB practical utterance filtering requires the abil- varying in quality for dialog system ity to select utterances based on the intensity of persona characteristics. Therefore, we are Figure 1: Process of selecting appropriate utterances developing metrics that can be used to capture for dialog system responses. the intensity of persona characteristics and can be computed without references tailored to the evaluation targets. To this end, we explore ex- dition, we use the term persona characteristics to isting metrics and propose two new metrics: indicate the distinctive qualities of a persona. persona speaker probability and persona term salience. Experimental results show that our Figure 1 shows how we would like to auto- proposed metrics show weak to moderate cor- matically evaluate the appropriateness of a large relations between scores of persona character- number of arbitrary utterances and select utter- istics based on human judgments and outper- ances to be registered in the utterance database form other metrics overall in filtering inappro- of a retrieval-based dialog system. Doing this is priate utterances for particular personas. preferable for commercial use in terms of pre- 1 Introduction venting unexpected utterances from being out- put. Evaluation metrics based on word over- Maintaining utterance quality is important for lap between an evaluation target and a refer- commercial dialog systems. To achieve better ence (or a “correct” utterance) are often used to quality, methods of filtering inappropriate utter- evaluate persona-aware utterance generation (e.g., ances have been proposed from the perspectives F1, BLEU, and ROUGE in (Wolf et al., 2019; of offensive language (Xu et al., 2020), grammar, Madotto et al., 2019; Olabiyi et al., 2019)). How- topics (Tsunomori et al., 2020), discourse rela- ever, these metrics are not applicable to utterance tion (Otsuka et al., 2017), and so on. In addi- selection because preparing references for a large tion to these perspectives, we need a filter for number of arbitrary utterances is extremely time- personas of dialog systems. Persona-aware dia- consuming. In other words, these metrics are not log systems are important in that having a con- supposed to be used to evaluate utterances outside sistent persona makes a dialog system believ- a predefined evaluation dataset. Therefore, met- able (Higashinaka et al., 2018) and entertaining rics need to be computed without the references (Miyazaki et al., 2016). Throughout this paper, we tailored to the evaluation targets. In addition, prac- use the term persona to indicate individuals such tical utterance selection requires the ability to se- as real-life people and fictional characters. In ad- lect utterances based on the intensity of persona 178 Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 178–189 July 29–31, 2021. ©2021 Association for Computational Linguistics characteristics. Category Metric Accordingly, we explore the metrics that can be Persona- Trained Persona accuracy used to capture the intensity of persona character- description- Untrained P-F1 istics and can be computed without the references based P-Cover tailored to the evaluation targets. The contribu- Sample- Trained Personality classifica- tions of this paper are as follows: monologue- tion accuracy based uPPL • We provide summaries of existing metrics Untrained MaxBLEU used for evaluating persona-aware utterances. • We propose two new metrics to evaluate per- Table 1: List of existing metrics. sona characteristics without the references 2020). For example, when transferring negative tailored to the evaluation targets. sentences into positive ones, transfer success is • We investigate the effectiveness of the exist- measured by the fraction of sentences that are clas- ing metrics and our proposed metrics in cap- sified as positive (Fu et al., 2018). turing the intensity of persona characteristics. The same idea can be used to evaluate persona- aware utterances. In fact, there is a study that uses The rest of this paper is structured as follows. In a similar evaluation metric called personality clas- Section 2, we introduce related work. In Section sification accuracy (Su et al., 2019), which is the 3, we overview the existing evaluation metrics. In accuracy of the speaker classification for the eval- Section 4, we propose two new metrics. In Sec- uation target utterances. We utilize and modify tion 5, we investigate the correlation coefficient of this idea so that we can measure the persona char- the metrics between human judgments. In Section acteristics of each utterance. 6, we investigate filtering inappropriate utterances considering the practicality of the utterance selec- 3 Existing Metrics tion. This section introduces the existing evaluation 2 Related Work metrics for persona-aware utterances that can be computed without the references being tailored to Since the release of the PERSONA-CHAT dataset the evaluation targets. Table 1 shows the list of the (Zhang et al., 2018), many more studies have existing metrics. The metrics are roughly divided been conducted on persona-aware utterance gen- into those that are based on the persona descrip- eration (Song et al., 2019; Jiang et al., 2020; tions as used in the PERSONA-CHAT dataset and Liu et al., 2020), including studies by the 23 those that are based on the sample monologues teams that participated in the ConvAI2 competi- of the personas. In addition, they can be catego- tion (Dinan et al., 2019). The PERSONA-CHAT rized by the involvement of machine learning, i.e., dataset was created by crowdworkers who were trained or untrained. Hereinafter, we use the term asked to converse as the personas described in the monologue to refer to a set of independent utter- given descriptions. Each description consisted of ances that are not associated with the preceding or five sentences on average, such as “I am a vegetar- the following utterances in a dialog. ian,” “I like swimming,” “My father used to work for Ford,” “My favorite band is Maroon5,” and “I 3.1 Metrics Based on Persona Descriptions got a new job last month, which is about advertis- 3.1.1 Persona Accuracy ing design.” In this manner, facts about the per- Persona accuracy (Zheng et al., 2020) is the ac- sonas are described. However, the linguistic styles curacy with which the binary classification distin- of the personas were not focused on. guishes if a persona description is expressed in the Linguistic style is also an important aspect of evaluation target utterances. persona-aware utterances. For example, Big Five personalities (Mairesse and Walker, 2007), gen- 3.1.2 Persona F1 (P-F1) der, age, and area of residence (Miyazaki et al., P-F1 is an untrained evaluation metric used by 2015) can affect the linguistic styles of the utter- Jiang et al. (2020) that was adapted from a pre- ances. In text style transfer, transfer success is of- vious study (Dinan et al., 2018). P-F1 is the har- ten measured by transfer accuracy (Krishna et al., monic mean of persona precision and persona re- 179 call, which are computed based on the number of Probability of being non-stop words shared between an evaluation tar- said by 俺 / は / 学生 / だ Speaker Persona A: 0.35 get and a persona description. (I’m a student.) classifier Persona B: 0.15 … 3.1.3 Persona Coverage (P-Cover) Figure 2: Process of obtaining an utterance score using PSProb. P-Cover is another untrained metric used by Jiang et al. (2020) that was adapted from a previ- ous study (Song et al., 2019). Though this is also Getting Utterance score 俺 / は / 学生 / だ scores Averaging Persona A: 0.002 based on the non-stop words shared between an (I’m a student.) for terms scores Persona B: 0.001 evaluation target and the persona description, it … utilizes inverse term frequency1 to place weight on Term Score (bigram) Persona A Persona B … words. Persona Term 俺 / は 0.003 0.000 … Salience は / 学生 0.002 0.002 … 3.2 Metrics Based on Sample Monologues … … … … Figure 3: Process of obtaining an utterance score using 3.2.1 Personality Classification Accuracy PTSal. Personality classification accuracy (Su et al., 2019) is the speaker classification accuracy for the evaluation targets. The speaker classification can 4 Proposed Metrics be achieved by building a classifier to distinguish We propose a trained persona speaker probabil- the speakers of the utterances in a monologue ity (PSProb) metric and an untrained persona term corpus of the target personas. salience (PTSal) metric. 3.2.2 User Language Perplexity (uPPL) 4.1 Persona Speaker Probability (PSProb) uPPL (Wu et al., 2020) is a metric that evaluates To measure the intensity of the persona character- whether an utterance satisfies the linguistic style istics of an utterance, we use the probability of of a given persona. It can be obtained by building the utterance being said by a persona. Figure 2 a statistical language model for a persona using shows the process of obtaining an utterance score.