Finchat: Corpus and Evaluation Setup for Finnish Chat Conversations on Everyday Topics

FinChat: Corpus and evaluation setup for Finnish chat conversations on everyday topics Katri Leino1, Juho Leinonen1, Mittul Singh1, Sami Virpioja2, Mikko Kurimo1 1Department of Signal Processing and Acoustics, Aalto University, Finland 2Department of Digital Humanities, University of Helsinki, Finland [email protected], [email protected], [email protected], [email protected], [email protected] Abstract evaluation setups available. Meanwhile, using machine trans- lated corpora from well-resourced languages is dependent on Creating open-domain chatbots requires large amounts of con- the translation quality which for Finnish is a concern at the mo- versational data and related benchmark tasks to evaluate them. ment. In this work, our focus is to bridge this gap for Finnish Standardized evaluation tasks are crucial for creating automatic and to bootstrap open-domain chatbot research in the language. evaluation metrics for model development; otherwise, compar- We provide the FinChat corpus and evaluation setup to support ing the models would require resource-expensive human evalu- this aim. ation. While chatbot challenges have recently managed to pro- The FinChat corpus consists of Finnish chat conversations vide a plethora of such resources for English, resources in other on everyday topics collected from voluntary participants. Our languages are not yet available. In this work, we provide a goal is to collect conversations that are natural and engaging, starting point for Finnish open-domain chatbot research. We which are two important qualities of a human conversation [4]. describe our collection efforts to create the Finnish chat conver- To ensure naturalness, we do not restrict our participants to a sation corpus FinChat, which is made available publicly. Fin- script, specify the language style (formal or informal) or restrict Chat includes unscripted conversations on seven topics from them to specific discussion points. To ensure engaging conver- people of different ages. Using this corpus, we also construct a sations, we provide participants with seven broad and diverse retrieval-based evaluation task for Finnish chatbot development. topics to guide their conversation. Later, the participants self- We observe that off-the-shelf chatbot models trained on conver- evaluate each conversation to be engaging or not. sational corpora do not perform better than chance at choosing The FinChat evaluation setup includes a retrieval task to the right answer based on automatic metrics, while humans can help automatically compare chatbot models. The task provides do the same task almost perfectly. Similarly, in a human evalu- chatbot with a sentence from a conversation and asks to pre- ation, responses to questions from the evaluation set generated dict the answer as the continuation from the given list. The by the chatbots are predominantly marked as incoherent. Thus, task is easy for humans, who achieve 95:1% accuracy, whereas FinChat provides a challenging evaluation set, meant to encour- off-the-shelf chatbot models trained on large Finnish conver- age chatbot development in Finnish. sational datasets perform much worse, barely achieving the Index Terms: Finnish corpora, chatbot evaluation, open- accuracy of a random choice, 10%. We also perform a hu- domain chatbots, conversational language modeling man evaluation where responses generated by the chatbots to the questions from the FinChat evaluation set are marked for 1. Introduction grammatical correctness, intelligibility and coherence. The best Recently, open-domain conversational agents or chatbots, ca- generated responses score high, close to original human re- pable of casual conversation, have received much attention sponses, on grammar and intelligibility but much worse on co- from NLP researchers. This trend has been supported by regu- herence. Thus, FinChat poses a challenging task for chatbot larly organized chatbot challenges like Dialogue State Tracking modelling. To support further research, we publicly release the Challenges1, NeurIPS’ Conversational Intelligence Challenge2 FinChat corpus, our evaluation setup, and training recipes to https://github.com/ and Amazon Alexa prize3. Nevertheless, training good open- recreate the considered chatbots at aalto-speech/FinChat domain chatbots is challenging. The chatbot training requires . large amounts of conversational data and an evaluation setup to develop them. It is often easier to extract conversational data 2. Related Work from online sources for training than constructing standardized Conversational chat corpora in English are mostly knowledge- evaluations. Yet, the latter is essential for model development grounded [5, 6, 7] or partly scripted [8]. Conversations are because the alternative is to employ expensive human evalua- knowledge-grounded by providing participants with a highly tion for comparing different conversational agents. specific topic and background reading beforehand. This data Chatbot challenges have overcome this issue by providing generation style results in topical and more coherent conver- standardized evaluation setups to compare models [1, 2, 3]. The sations. Such data is useful to create chatbots suitable for in- growth of resources, however, has been restricted to English. formation retrieval. For more casual conversations, there are For Finnish, like many other languages, there are no chatbot only a few corpora such as PersonaChat [8] that have scripted 1https://www.microsoft.com/en-us/research/ small-talk conversations. Alternatively, real conversations from event/dialog-state-tracking-challenge social media corpora can be extracted. However, they require 2http://convai.io/ significant filtering effort to ensure the quality and appropri- 3https://developer.amazon.com/alexaprize ateness of the content. Our approach for conversational chat corpus is to collect diverse casual conversations by not restrict- Table 1: FinChat data statistics: the number of conversations ing the content and only providing a broad topic as guidance. (Conv), messages (Mes), and words and the rate of interesting We also promote diversity by having participants of different conversations for each topic and group. The groups are univer- ages and giving them the freedom to converse with the conver- sity staff, university students and high school students (HS) sational language they use in real life. Additionally, we aim for longer conversation by providing participants more time. In Topics Conv Mes Words Interesting most of the above data sets, the length of the conversation are Sports 24 1,054 5,703 77 % often short with only 5-8 turns except for Topical-Chat [5] that Literature 15 655 3,179 61 % has 20 turn conversations. In comparison, the FinChat corpus TV 15 900 4,132 71 % has conversation length of 14 turns on the average. Traveling 12 463 4,418 83 % Crowd-sourcing is a popular option to gather chat conver- Food 9 240 2,140 78 % sations because it gives easy access to a large amount of partic- Movies 7 209 1,546 57 % ipants, and the content of the conversation can be controlled for Music 4 149 1,263 100 % quality. Unfortunately, all languages are not well represented in the crowd-sourcing platforms, and therefore, gathering substan- Univ. staff 41 1,526 12,700 77 % tial amounts of data for them is challenging. As this concerns Univ. students 10 239 2,769 85 % Finnish as well, we created a collection setup and recruited vol- HS students 34 1,863 6,733 66 % unteers to have a casual chat with each other. All 86 3,630 22,210 74 % For chatbot evaluation, using perplexity, cross-entropy, and translation metrics such as BLEU [9], METEOR [10], ROUGE Table 2: Group statistics: The number of conversations (Conv), [11], are straightforward to calculate, but show little correlation the average word length, the average number of messages in with human evaluation [12, 13]. Besides, PersonaChat corpus each conversation (Mes / Conv), the number of words in each introduced hits@1/N, which is the success rate of predicting message (Words / Mes), and the rate of interesting conversations the correct reply out of N − 1 random sentences. N-choose-k in each age group. [14] could be seen as an extension of hits metric, where it is enough for the correct response to appear in top-k. In our work, we employ these different metrics to evaluate chatbots on the Groups Word length Mes / Conv Words / Mes evaluation task. Univ. staff 6.0 37.2 8.3 In conversation modeling, most recent advances employ Univ. students 5.5 23.9 11.6 Transformer models [15, 16]. However, many present ap- HS students 5.0 54.8 3.6 proaches still use RNN-based encoder-decoder architecture [17, 18]. In our work, we test off-the-shelf systems from both tain anonymity. At the beginning of the session, participants these two approaches on the FinChat evaluation task. were given a topic to discuss. They were also instructed to (1) not reveal any personal information, (2) ask one question at the 3. FinChat dataset time and wait for their partner’s reply, (3) use conversational language, and (4) not use any abusive language. After chat- FinChat corpus provides a set of natural, diverse and engaging ting 10-15 minutes, conversation partners were switched and a conversations on seven topics collected from voluntary partic- new topic was given. In a session, each participant had two ipants. To promote natural conversations, we did not provide or three conversations. After each conversation, participants any scripts to the participants except for the

Finchat: Corpus and Evaluation Setup for Finnish Chat Conversations on Everyday Topics

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support