Investigating Evaluation Methods for Generated Co-Speech Gestures Pieter Wolfert∗ Jeffrey M

To Rate or Not To Rate: Investigating Evaluation Methods for Generated Co-Speech Gestures Pieter Wolfert∗ Jeffrey M. Girard IDLab, Ghent University - imec Department of Psychology, University of Kansas Ghent, Belgium Kansas, USA [email protected] [email protected] Taras Kucherenko Tony Belpaeme KTH Royal Institute of Technology IDLab, Ghent University - imec Stockholm, Sweden Ghent, Belgium [email protected] [email protected] ABSTRACT ACM Reference Format: While automatic performance metrics are crucial for machine learn- Pieter Wolfert, Jeffrey M. Girard, Taras Kucherenko, and Tony Belpaeme. ing of artificial human-like behaviour, the gold standard for eval- 2021. To Rate or Not To Rate: Investigating Evaluation Methods for Gener- ated Co-Speech Gestures. In Proceedings of the 2021 International Conference uation remains human judgement. The subjective evaluation of on Multimodal Interaction (ICMI ’21), October 18–22, 2021, Montréal, QC, artificial human-like behaviour in embodied conversational agents Canada. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3462244. is however expensive and little is known about the quality of the 3479889 data it returns. Two approaches to subjective evaluation can be largely distinguished, one relying on ratings, the other on pairwise comparisons. In this study we use co-speech gestures to compare 1 INTRODUCTION the two against each other and answer questions about their appro- When we interact with embodied conversational agents, we expect priateness for evaluation of artificial behaviour. We consider their a similar manner of nonverbal communication as when interacting ability to rate quality, but also aspects pertaining to the effort of use with humans. One way to achieve more human-like nonverbal be- and the time required to collect subjective data. We use crowd sourc- haviour in conversational agents is through the use of data-driven ing to rate the quality of co-speech gestures in avatars, assessing methods, which learn model parameters from data and gained in which method picks up more detail in subjective assessments. We popularity over the past few years [1, 17, 18, 40]. Data-driven meth- compared gestures generated by three different machine learning ods have been used to generate lips synchronisation, eye gaze or models with various level of behavioural quality. We found that facial expressions, however in this work we take co-speech gestures both approaches were able to rank the videos according to quality as a test bed for comparing evaluation methods. Data-driven meth- and that the ranking significantly correlated, showing that in terms ods are able to generate a wider range of gestures and behaviours, of quality there is no preference of one method over the other. We as behaviour is no longer restricted to pre-coded animation or also found that pairwise comparisons were slightly faster and came procedurally generated behaviour, but instead are generated from with improved inter-rater reliability, suggesting that for small-scale models trained on large amounts of data of human movement. studies pairwise comparisons are to be favoured over ratings. These behaviours are often used to drive conversational agents in both virtual and physical agents, as these improve interaction CCS CONCEPTS [3, 11, 21, 28, 30]. Co-speech gestures are traditionally divided into • Human-centered computing ! HCI design and evaluation four dimensions: iconic gestures, beat gestures, deictic gestures, methods; Human computer interaction (HCI). and metaphoric gestures [24]. The approach to produce each of these categories often differs but, using data-driven methods, it arXiv:2108.05709v2 [cs.HC] 13 Aug 2021 KEYWORDS becomes possible to generate multiple categories of gestures with virtual agents, user study, nonverbal behaviour, evaluation method- a single model. ology The quality of generated human-like behaviour can be assessed using objective or subjective measures. Objective measures rely on ∗Corresponding Author an algorithmic approach to return a quantitative measure of the quality of the behaviour and are entirely automated, while subjec- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed tive measures instead rely on ratings by human observers. Most for profit or commercial advantage and that copies bear this notice and the full citation recent papers on co-speech gesture generation report objective on the first page. Copyrights for components of this work owned by others than ACM measures to assess the quality of the generated behaviour, with must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a measures such as velocity diagrams or average jerk being popular fee. Request permissions from [email protected]. [1, 16, 40]. These measures not only are easy to automate, but also ICMI ’21, October 18–22, 2021, Montréal, QC, Canada allow comparisons across models. For example, Yoon et al. [40] © 2021 Association for Computing Machinery. ACM ISBN 978-1-4503-8481-0/21/10...$15.00 trained models from other authors on the same dataset to compare https://doi.org/10.1145/3462244.3479889 objective metric results. For this reason, objective measures are often preferred over subjective evaluations, as the latter are harder Our aim is to quantify the pros and cons of these subjective to compare due to their potentially high variability. Yet, subjective evaluation methods and to provide empirical recommendations for evaluations are crucial when evaluating the behaviour of agents the community working on gesture generation on when to use interacting with humans. This is because social communication is each method. Although the concept of comparing these evaluation much more complex than current objective measures are capable strategies is not novel on its own [8], it is novel in relation to of capturing, and subjective evaluations are still considered to be the evaluation of gesture generation for ECAs. We hope that this the gold standard. There may also be a large subjective component work can both highlight similarities and differences between the to how observers interpret generated behaviour, which we would evaluated methods, and function as a bridge between the different like to capture. Thus, the “final stretch” in quality evaluations still fields of psycho-metrics and gesture generation researchers. relies heavily on subjective evaluations [34]. While the value of subjective evaluations is widely accepted, there is little consensus on how to collect and analyse such evaluations in relation to the evaluation of data-driven generated stimuli. 2 RELATED WORK Recently, several authors working on data-driven methods for non- In this section, we cover work that compared rating scale evalua- verbal behaviour generation moved from rating scales (e.g., having tions with pairwise comparisons, and look at their specific use in observers rate how human-like two generated stimuli are from 1-to- the field of gesture generation for embodied conversational agents. 5) to the use of pairwise comparisons (e.g., having observers select To our knowledge, there has not been a comparison between rating which of two generated stimuli is more human-like) [17, 26, 36, 40]. scales and pairwise comparisons for the evaluation of co-speech For example, Yoon et al. [40] argued that “co-speech gestures are so gestures in Embodied Conversational Agents (ECAs) in particular. subtle, so participants would have struggled to rate them on a five- or However, subjective evaluation methods have been studied and seven-point scale” and promoted the use of pairwise comparisons compared in several other fields. We want to highlight that we over rating scales. Relatively little empirical attention has been believe that there is a difference in the type of stimuli that are devoted to this methodological topic in regard to the evaluation evaluated: we consider data-driven behaviour in virtual agents. We of data-driven generated stimuli, however, and it is still unknown are aware of the overlap there is with psychology and psychomet- how much the methods actually differ in terms of usability and rics, but want to zoom in specifically on the use of both rating and informativeness. pairwise assessments of generated gesticular behaviour for ECAs. In the current study, we seek to explore the similarities and There is a rich history of work in psychology on related top- differences between the rating scale and pairwise comparison ap- ics. DeCoster et al. [5] compared analysing continuous variables proaches. We take generated co-speech gestures as a test bed for directly with analysing them after dichotomisation (e.g., re-coding our evaluations but note that these findings may also apply to them as two-class variables such as high-or-low). Although there stimulus evaluation in other areas. Our hypotheses, design, and were a few edge cases where dichotomisation was similar to di- methodology were pre-registered before the data was gathered1. rect analysis, they demonstrated that dichotomisation throws away We present short video clips to human participants, with each video important information and concluded that the use of the origi- clip showing an avatar displaying combined verbal and nonverbal nal continuous variables is to be preferred in most circumstances. behaviour. The movements are generated using three data-driven Simms et al. [32] randomly assigned participants to complete the methods of varying quality and we expect the subjective evalua- same personality rating scales with different numbers of response tions to clearly reflect this difference. In order to gain more insight options ranging from two to eleven. They found that including four into the effectiveness of the two subjective evaluation methods, we or fewer response options often attenuates psychometric precision, formulated the following five hypotheses. and including more than six response options generally provides no improvements in precision. Finally, Rhemtulla et al.

Load more