To Rate or Not To Rate: Investigating Evaluation Methods for Generated Co-Speech Gestures Pieter Wolfert∗ Jeffrey M. Girard IDLab, Ghent University - imec Department of Psychology, University of Kansas Ghent, Belgium Kansas, USA
[email protected] [email protected] Taras Kucherenko Tony Belpaeme KTH Royal Institute of Technology IDLab, Ghent University - imec Stockholm, Sweden Ghent, Belgium
[email protected] [email protected] ABSTRACT ACM Reference Format: While automatic performance metrics are crucial for machine learn- Pieter Wolfert, Jeffrey M. Girard, Taras Kucherenko, and Tony Belpaeme. ing of artificial human-like behaviour, the gold standard for eval- 2021. To Rate or Not To Rate: Investigating Evaluation Methods for Gener- ated Co-Speech Gestures. In Proceedings of the 2021 International Conference uation remains human judgement. The subjective evaluation of on Multimodal Interaction (ICMI ’21), October 18–22, 2021, Montréal, QC, artificial human-like behaviour in embodied conversational agents Canada. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3462244. is however expensive and little is known about the quality of the 3479889 data it returns. Two approaches to subjective evaluation can be largely distinguished, one relying on ratings, the other on pairwise comparisons. In this study we use co-speech gestures to compare 1 INTRODUCTION the two against each other and answer questions about their appro- When we interact with embodied conversational agents, we expect priateness for evaluation of artificial behaviour. We consider their a similar manner of nonverbal communication as when interacting ability to rate quality, but also aspects pertaining to the effort of use with humans.