Quick viewing(Text Mode)

Empathetic BERT2BERT Conversational Model: Learning Arabic Language Generation with Little Data

Empathetic BERT2BERT Conversational Model: Learning Arabic Language Generation with Little Data

Empathetic BERT2BERT Conversational Model: Learning Arabic Language Generation with Little Data

Tarek Naous, Wissam Antoun, Reem A. Mahmoud, and Hazem Hajj Department of Electrical and Computer Engineering American University of Beirut Beirut, Lebanon {tnn11,wfa07,ram79,hh63}@aub.edu.lb

Abstract لقد تم تعييني للتو كباحث في شركة جوجل! -Enabling empathetic behavior in Arabic dia logue agents is an important aspect of build- User ing human-like conversational models. While

Arabic Natural Language Processing has seen Feeling proud لماذا قاموا بتعيينك؟ -significant advances in Natural Language Un derstanding (NLU) with language models such تهانينا! أنت فعال تستحق هذه الوظيفة! as AraBERT, Natural Language Generation (NLG) remains a challenge. The shortcomings Empathetic Agent of NLG encoder-decoder models are primar- ily due to the lack of Arabic datasets suitable Figure 1: Example of empathetic behavior in a conver- to train NLG models such as conversational sational agent. agents. To overcome this issue, we propose a transformer-based encoder-decoder initialized with AraBERT parameters. By initializing innate human capacity of relating to another per- the weights of the encoder and decoder with son’s feelings and making sense of their emotional AraBERT pre-trained weights, our model was state (Yalc¸ın, 2020). An important factor towards able to leverage knowledge transfer and boost performance in response generation. To en- developing human-like dialogue agents is enabling able in our conversational model, we their empathetic capability (Huang et al., 2020). train it using the ArabicEmpatheticDialogues To this end, there has been a significant interest in dataset and achieve high performance in em- developing empathetic conversational models (Ma- pathetic response generation. Specifically, our jumder et al., 2020; Sharma et al., 2020; Ma et al., model achieved a low perplexity value of 17.0 2020; Yalc¸ın and DiPaola, 2019). These models and an increase in 5 BLEU points compared infer the emotions of a human user and provide a to the previous state-of-the-art model. Also, suitable empathetic response. The desired behavior our proposed model was rated highly by 85 human evaluators, validating its high capabil- of an empathetic conversational agent is illustrated ity in exhibiting empathy while generating rel- in Figure1, where the empathetic agent recognizes evant and fluent responses in open-domain set- that the user is feeling proud and, thus, generates tings. an empathetic response that congratulates the user with enthusiasm. 1 Introduction Recent work in open-domain empathetic con- Conversational models with empathetic responding versational models have adopted neural-based se- capabilities are crucial in making human-machine quence generation approaches (Rashkin et al., interactions closer to human-human interactions, 2019). These approaches are based on encoder- as they can lead to increased engagement, more decoder neural network architectures such as trust, and reduced frustration (Yalc¸ın and DiPaola, Sequence-to-Sequence (Seq2Seq) recurrent neural 2018). These characteristics are highly desirable network models (Shin et al., 2020) or transformers in open-domain conversational models as they can (Lin et al., 2020). Despite the significant work done boost user satisfaction and make chatbots look less in this direction, the focus so far has been mostly boorish. While empathy can be attributed to a range on the English language with fewer efforts being of behaviors, it can be generally described as the directed towards low-resource languages, such as

164

Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 164–172 Kyiv, Ukraine (Virtual), April 19, 2021. Arabic. The first dataset for Arabic utterances and to select an emotion, describe a situation where empathetic responses was recently introduced by they have felt that way, and carry out a conversa- Naous et al.(2020), where a Bidirectional Long tion related to the emotion. The authors used the Short-Term Memory (Bi-LSTM) Seq2Seq model conversations collected to train retrieval-based and was trained on the dataset. However, the model generative-based models, which showed higher lev- proposed by Naous et al.(2020) delivered sub- els of empathy in their responses compared with optimal performance due to the limited size of the models trained on spontaneous conversational data dataset. The additional challenges in developing gathered from the Internet. neural-based empathetic conversational models for The release of EmpatheticDialogues (Rashkin Arabic is the lack of open-domain conversational et al., 2019) stimulated further research in this area data that can be used for pre-training (Li et al., with multiple attempts in the literature to improve 2017), and thus no availability of pre-trained con- the empathetic capability of conversational mod- versational models that can be used directly for els. Shin et al.(2020) formulated the empathetic fine-tuning (Zhang et al., 2020b). responding task as a reinforcement learning prob- To address the challenges of small dataset size lem. The approach named “Sentiment look-ahead” and lack of conversational resources, in terms employs a Seq2Seq policy model with Gated Re- of datasets and pre-trained models, we propose current Units to generate an empathetic response a transformer-based encoder-decoder model ini- based on an input utterance and updates the policy tialized with AraBERT (Antoun et al., 2020) pre- using the REINFORCE method. Lin et al.(2020) trained weights. Our work extends the English fined-tuned a GPT model on the EmpatheticDia- BERT2BERT architecture (Rothe et al., 2020) to logues dataset. The GPT model was pre-trained Arabic response generation. We fine-tune our pro- on the BooksCorpus (Zhu et al., 2015) dataset, im- posed model on the limited-sized dataset of em- proving the NLU capability of the model, as well pathetic responses in Arabic (Naous et al., 2020). as on the PersonaChat (Zhang et al., 2018) dataset, By using the pre-trained weights of the AraBERT allowing the model to have improved performance language model to initialize the encoder and de- on response generation. coder, our proposed BERT2BERT model is ex- pected to leverage knowledge transfer and show 2.2 Arabic Empathetic Conversational enhanced performance in empathetic response gen- Models eration compared to the baseline Bi-LSTM model proposed by Naous et al.(2020). While many works have focused on enabling empa- The rest of this paper is organized as follows: thetic capabilities in conversational models for En- Section2 reviews the recent literature on empa- glish, there are much fewer attempts to build similar thetic conversational models in both English and models for Arabic. In general, research on Arabic Arabic. Our proposed BERT2BERT approach for conversational models is still in its infancy mainly empathetic response generation is presented in Sec- due to the complexity of the language, and the lack tion3, including the dataset and pre-processing of resources and pre-trained models that are avail- steps. Section4 analyzes the performance of our able in abundance for English. Despite the avail- model and compares its results to several bench- ability of Arabic pre-trained language models such mark models. Concluding remarks and future di- as hULMonA (ElJundi et al., 2019) and AraBERT rections are presented in Section5. (Antoun et al., 2020), which have proven useful for Arabic NLU tasks, the lack of pre-trained models 2 Related Work for Arabic NLG makes the development of neural- based Arabic conversational models a challenging 2.1 English Empathetic Conversational task. Hence, existing works on Arabic chatbots Models have mainly focused on retrieval-based methods The interest in enabling empathy in conversational (Ali and Habash, 2016) or rule-based approaches agents has increased over the last few years with the (Hijjawi et al., 2014; Fadhil and AbuRa’ed, 2019). introduction of the EmpatheticDialogues dataset While these approaches work well on task-oriented by Rashkin et al.(2019). EmpatheticDialogues objectives, they are limited by the size of manu- is a crowdsourced dataset of open-domain conver- ally crafted rules they follow or the richness of the sations where a group of workers was instructed they can retrieve responses from. This

165 AraBERT-initialized Encoder AraBERT-initialized Decoder

푙 푥1ҧ 푥ҧ2 푥ҧ… 푥ҧ푛x 푙1 푙2 푙… 푛푦

Encoder Block Decoder Block

......

Encoder Block Decoder Block

Encoder Block Decoder Block

Encoder Block Decoder Block

Encoder Block Decoder Block

푥 푥 푥 푥 1 2 … 푛푥 푦1 푦2 푦… 푦푛푦

Figure 2: Architecture of the proposed BERT2BERT model initialized with AraBERT checkpoints for Arabic empathetic response generation. makes it difficult for such types of models to op- pre-training before being fine-tuned on the desired erate well in open-domain conversational settings, task (Zhang et al., 2020a). It was shown by Rothe where generative neural-based models would be et al.(2020) that warm-starting the transformer- more suitable. based encoder-decoder model with the checkpoints Recently, the first empathy-driven Arabic con- of a pre-trained encoder (e.g. BERT) allows the versational model was proposed by Naous et al. model to deliver competitive results in sequence (2020) that released ArabicEmpatheticDialogues, a generation tasks while skipping the costly pre- dataset of Arabic utterances and their correspond- training. Inspired by this idea, and due to the ing empathetic responses. The authors trained a unavailability of Arabic conversational datasets Seq2Seq model with bidirectional LSTM units on that can be used for pre-training, we adopt the the dataset. While the model succeeded in gener- BERT2BERT architecture (Rothe et al., 2020), ating empathetic responses, it showed an average and warm-start the encoder and decoder with the Relevance score which indicates that the responses AraBERT checkpoint (Antoun et al., 2020). The can sometimes go off-topic and may not be suitable encoder-decoder attention is randomly initialized. responses for the emotional context of the input ut- The architecture of the proposed model is illus- terance. The limitations of this work were mainly trated in Figure2. due to the limited size of the dataset. The input to the proposed model is a sequence In this work, we adopt the BERT2BERT archi- x = [x1, x2, . . . , xn ] of one-hot representations tecture (Rothe et al., 2020) and leverage the pre- x with a length of nx tokens, chosen to be 150. This trained AraBERT (Antoun et al., 2020) model to sequence is fed as input to an AraBERT initialized improve the performance of empathetic Arabic con- encoder. At the decoder side, the model generates versational models. an empathetic response represented by a sequence y = [y , y , . . . , y ], where the maximum output 3 Proposed Method 1 2 ny length ny is also specified to be 150. We optimize 3.1 Proposed BERT2BERT Model the log-likelihood loss over the output tokens. Our proposed model for Arabic empathetic re- To generate empathetic responses from our sponse generation is a transformer-based Seq2Seq model, we adopt the Top-K Sampling scheme (Fan model (Vaswani et al., 2017), which has been et al., 2018) where, at each time step, the model shown to boost performance on a several Seq2Seq randomly samples the K most likely candidates tasks (Raffel et al., 2020; Lewis et al., 2019). How- from the probability distribution of all words in ever, such an architecture would require massive the vocabulary. This decoding strategy has been

166 Emotion Excited Utterance     AJ ƒðP ,­J ’ÖÏ@ YÊJ.Ë@ úΫ IJ.ʪK AJ K@ðQ» ù ë Aëð ,ÕËAªË@ €A¿ YëA‚ AJ» ú æ•AÖÏ@ ¨ñJ.ƒB@ ú ¯    Response !iJ k. QË@ HC¿QK . @ðPA ¯ Y®Ë . éJ.ªÊË@ ½ÊK IK @P Emotion Furious Utterance       !½ËX áÓ BYK. Q«Q.Ë@ Õç' Y®K Õç' áºË ð éJ Ê®ÖÏ@ £A¢J.Ë@ IJ .Ê£ Y®Ë ,ÉJ m' Response     ?QK YÒÊË IJ ºJƒ@ Éë ?Iʪ ¯ @XAÓ . éK AªÊË éJ ƒ éÓYg AîE@ Emotion Embarrassed Utterance g Q  Ì     øQk. @XAÓ áÔ . J.» HXAm I“QªK ð ú æ•AÖÏ@ ¨ñJ.ƒB@ éK AîE ú ¯ Ik. Qk Response  Q ' Q '  HYg AÖß. ú G .m à@ ½J Ê« ? m . IK@ Éë Table 1: Samples of utterances and empathetic responses from the ArabicEmpatheticDialogues dataset for three emotion labels: Excited, Furious, and Embarrassed found more effective than conventional approaches Pre-Segmentation such as beam search, which tends to yield com-     '   ¯ éƒYJêË@ éJ Ê¿ áÓ ñJÊË Ik. Qm Y®Ë ú æJK. AK. @Yg. Pñm AK@ mon responses found repetitively in the training set Post-Segmentation or similar, slightly-varying versions of the same  '   ¯ high-likelihood sequences (Ippolito et al., 2019). H+ h. Qm Y®Ë ø + H+ áK. @ +H. @+Yg. Pñm AK@ è+ €YJë +È@ è+ ¿ áÓ ñK +È@ +È ú Î 3.2 Dataset Table 3: Example of an Arabic utterance segmentation using Farasa. Grouped Emotion Labels Complete Emotion Labels Excited Proud We use the ArabicEmpatheticDialogues dataset Grateful (Naous et al., 2020) which was translated from Hopeful Joy Confident the English version introduced by Rashkin et al. Joyful (2019). ArabicEmpathicDialogues contains 36,628 Content samples of speaker utterances and their correspond- Prepared Anticipating ing empathetic responses in Arabic. Each sample Caring is also labeled with the emotion of the speaker’s ut- Sentimental terance. Three examples from the dataset for three Love Trusting Faithful different emotion labels are provided in Table1. Nostalgic By training a sequence generation model on the Surprised Surprise samples of utterances and their corresponding re- Impressed Sad sponses from the dataset, the model will be able to Lonely infer the emotions in input utterances and provide Guilty suitable empathetic responses. Thus, the empa- Sadness Disappointed Devastated thetic capability of the model would be enhanced. Embarrassed The dataset is originally labeled with 32 emo- Ashamed tion labels, many of which are very similar such Angry Annoyed as “joyful” and “content”, or “angry” and “furi- Anger Furious ous”. To reduce the number of classes, we follow Disgusted the tree-structured list of emotions defined by Par- Jealous Afraid rott(2001) to map the 32 emotion labels to their Terrified Fear 6 primary emotions which are “Joy”, “Surprise”, Anxious “Love”, “Surprise”, “Anger”, and “Fear”. This Apprehensive grouping is shown in Table2. Table 2: Grouping of emotion labels in the ArabicEm- To reduce lexical sparsity, the utterances and patheticDialogues dataset as per Parrott’s characteriza- responses in the dataset are segmented using the tion of tree-structured emotions (Parrott, 2001). Farasa segmenter (Abdelali et al., 2016). Given the

167 morphological complexity of the Arabic language, Model PPL BLEU segmentation is an important pre-processing step Baseline (Naous et al., 2020) 38.6 0.5 that can greatly enhance the performance of neural- EmoPrepend 24.1 3.16 based sequence generation models. An example BERT2BERT-UN 159.8 0.1 of this process is shown in Table3. By perform- BERT2BERT 17.0 5.58 ing segmentation, the vocabulary size is drastically reduced from 47K tokens to around 13K tokens. Table 4: Performance of the models on the test set in terms of PPL and BLEU score. Baseline EmoPrepend

Response for emotion classification using the utterances and Response their labels in the dataset. The fine-tuned AraBERT Seq2Seq model is then used as an external predictor to clas- LSTM-based LSTM model sify the emotion in the utterance and prepend it as Decoder -Joy a token before being used as an input to the Emo لقد ربحت المباراة Attention Prepend model. We note that the step of grouping Pre-trained emotion labels into 6 main labels, as discussed in LSTM-based Emotion Encoder Classifier Section 3, makes the emotion classification task easier. BERT2BERT-UN: which stands for BERT2BERT-Uninitialized. This model is لقد ربحت المباراة لقد ربحت المباراة Figure 3: Architectures of the Baseline and Emo- a regular transformer-based encoder-decoder Prepend models used for comparative evaluation model that shares the same architecture of the against the proposed BERT2BERT model. BERT2BERT model shown in Figure2, but is not initialized with AraBERT pre-trained weights.

4 Experiments & Results 4.2 Experimental Setup The proposed BERT2BERT model was developed We evaluate the proposed BERT2BERT model in using the Huggingface transformers library1. We comparison to three benchmark models. We con- train the model for 5 epochs with a batch size of 322. duct numerical as well as human evaluation of the Model training was done on a 16GB V100 NVidia different conversational models. GPU. The Baseline Bi-LSTM Seq2Seq (Naous et al., 2020), EmoPrepend, and BERT2BERT-UN 4.1 Benchmark Models benchmark models were developed using the Open- We train several neural-based sequence generation NMT Library (Klein et al., 2017). models on the ArabicEmpatheticDialogues dataset Dataset Partitioning: All models were trained and consider them as benchmarks for performance and evaluated on common data splits of the Ara- comparison. The benchmark models are denoted bicEmpatheticDialogues. We randomly partitioned as follows: the dataset into 90% training, 5% validation, and Baseline: The baseline model, illustrated in Fig- 5% testing using a seed of 42. ure3, is a Seq2Seq Bi-LSTM model with Attention following the prior state-of-the-art model proposed 4.3 Numerical Evaluation by Naous et al.(2020). Table4 summarizes the perplexity (PPL) and Bilin- EmoPrepend: In this setup, illustrated in Fig- gual Evaluation Understudy (BLEU) scores for the ure3, we prepend the emotion label to each utter- proposed and benchmark models when evaluated ance before feeding it as input to the baseline model on the test set. It is clear from the numerical evalu- described above, and we denote this approach as ation results that the proposed BERT2BERT model EmoPrepend. This allows us to add supervised consistently outperforms the benchmark models. information to the data, without having to intro- This is reflected through both a lower PPL score duce any modifications to the architecture. The and a higher BLEU score. existing emotion labels have been prepended to the 1https://github.com/huggingface/transformers utterances in the train and validation sets. For the 2https://github.com/aub-mind/Arabic-Empathetic- test set and at inference, we fine-tune AraBERT Chatbot

168 Model Empathy Relevance Fluency Baseline (Naous et al., 2020) 2.04 1.68 2.44 EmoPrepend 2.81 2.18 3.28 BERT2BERT 4.0 3.59 4.30

Table 5: Average evaluation of the collected human ratings.

Generated Response Utterance Emotion  Sadness ? HYg @XAÓ !B èð@ @Yg. úæ ƒ ÐñK ÐñJ Ë@

½ËX ¨AÖÞ ©K@QË@ áÓ! Joy ' @Yg. YJ ªƒ AK@ ? ½ËX ÈAJ k @Yg. @YJ ªƒ ½Êªm. ø YË@ AÓ   Fear ? Q m'. I K@ Éë .½ËX ¨AÒ‚Ë @Yg. ­ƒ@AK@AKðPñ» €ðQ ®K. I .“@Y®Ë ¯ ¯ Joy ½‚®JK. @Yg. Pñm ½K@ ©K@P QÓB éK@ €YJêÓ ú GñºK. @Yg. Pñm AK@ ? †A ‚Ë@ ÉÒªË@ ñë AÓ !AJ KAî E ÐñJË@ Ô« ú¯ éJ ¯Q K IJ ®Ê KY ®Ë Surprise ú Î   ½ËX ¨AÒ‚Ë ­ƒ@ AK@ ɒk@ ÕË ,H@ñJƒ ©’. ÉJ.¯ XCJ ÖÏ@ YJ «ÈCg Sadness

?ɒ®ÖÏ@ ¼XCJ Ó YJ « àA¿ Éë AK @Yë ø @ úΫ      Anger A®k Aj. «QÓ ½ËX àñºK à@I. m.' !ð@ð , èð@ AgAJ.“ éJ KAJË@ é«A‚Ë@ ú¯ ZA“ñ“ H@Yg@ ú¯ øPAg. QÒJ‚

áºËð AêË @Yg YJªƒ AK @ . ÉJ®Ö Ï@ ¨ñJƒB@ hðQ Jƒ úæk @ Q '   . . . . Sadness m . àñºJƒ ½K@ áÓ Y»AJÓ AK@ , ð@ð , èð@     I. Ê®Ë@ ú ¯ ÉJ ®K Zú æ„. Qªƒ@ àAJ kB@ ‘ªK. ú ¯ Table 6: Examples of responses generated by the BERT2BERT model for multiple utterances with various emo- tional states and domain contexts.

With EmoPrepend, the addition of supervised in- Arabic corpus. formation in the form of prepended emotion labels The numerical results achieved by the showed performance improvements in comparison BERT2BERT model are particularly impressive to the Baseline model, reflected by an increase in since, despite the limited size of the ArabicEmpa- 2.6 BLEU points and a reduction of 14.5 points theticDialogues dataset, BERT2BERT was able to in the PPL score. Nevertheless, the PPL score of leverage knowledge transfer through fine-tuning EmoPrepend at 24.1 is still considered high and to achieve state-of-art performance on the task of could potentially lead to sub-optimal performance. open-domain empathetic response generation in BERT2BERT showed significant performance im- Arabic without requiring additional empathetic provements in comparison to the baseline Seq2Seq samples to train on, or pre-training conversational Bi-LSTM, highlighted by a much reduced PPL data. value of 17.0 and an increase in 5 BLEU points. BERT2BERT also achieved better scores than the 4.4 Human Evaluation EmoPrepend model. Automated metrics such as PPL and BLEU scores are not sufficient alone to evaluate a model’s ability The BERT2BERT-UN model resulted in a very to exhibit empathetic behavior. Given the unavail- high PPL score of 158.9 and very low BLEU ability of specific metrics to evaluate empathy in score of 0.1. These poor results are due to the a conversational model, we resort to evaluation nature of transformer networks that require huge based on the judgment of human subjects. Through amounts of data samples to deliver good perfor- human evaluation, we can evaluate the emotional mance. The initialization of the BERT2BERT with communication capability of the models, which is pre-trained AraBERT weights showed very signifi- their ability to recognize emotion in the input utter- cant enhancements compared with the uninitialized ance and generate a suitable expression of emotion BERT2BERT-UN model. This performance boost in their corresponding response (Yalc¸ın, 2019). To provided by the BERT2BERT model is expected this end, we conducted a survey to collect ratings given the fact that AraBERT’s initialization param- from 85 native Arabic speakers. eters have been pre-trained on a massive 24 GB

169 Generated Response Utterance

½ËX ¨AÒ‚Ë YJ ªƒ AK @ ? ½ËAg ­J » AJ.kQÓ ? ½J« @XAÓ ? ɪ® K@ XAÓ  é®J ËB@ HA K@ñJ mÌ'@ I. k@AK@ .ð@ð , èð@ ? ½ÖÞ @ AÓ  ? IJ .ëX Éë .IJ .ëX áK @ áÓ @Y»AJÓ I‚Ë ? ÉJ.¯ áÓ €Aj. J¯ €B HP P Éë       éªK@P ¨ñJ.ƒ@ éK AîE éÊ¢« éKA¿ ð ðYJ. K @Yë ÉJ ʯ YªK. Xñ«@ ð CJ ʯ èQJÊË I. ëXAƒ Table 7: Examples of responses generated by the BERT2BERT model for multiple utterances with neutral emo- tions.

The raters were shown various utterances and emotional states and domain contexts, which also their corresponding responses generated by the proves its ability to handle open-domain conversa- Baseline, EmoPrepend, and BERT2BERT models. tions. The BERT2BERT-UN model was excluded from the survey given its poor results in terms of numeri- 4.5 Performance on Inputs with Neutral cal metrics. The raters were asked to evaluate each Emotional States of the models’ ability to show Empathy, Relevance, Despite the promising results achieved by the and Fluency in the generated response. The raters BERT2BERT model in generating relevant empa- were asked to answer the following questions: thetic responses in open-domain settings, it was shown to poorly handle regular chit-chat utterances • Empathy: Does the generated response show with neutral emotions, such as ”Hey, how are you?” an ability to infer the emotions in the given or ”What are you doing?”. Instead of providing a utterance? regular response, the BERT2BERT model will opt • Relevance: How relevant is the generated re- to generate an empathetic response as we show in sponse to the input utterance? Table7. This issue can be explained by the fact that the model was fine-tuned on a dataset com- • Fluency: How understandable is the generated prised of utterances with pure emotional context response? Is it linguistically correct? and corresponding empathetic responses. Moreso, the AraBERT-initialized parameters did not help For each question, the raters were asked to score mitigate this issue since AraBERT is pre-trained in the responses of the models on a scale of 0 to 5, a self-supervised fashion on news articles and later where 0 reflects extremely poor performance and 5 fine-tuned on a task-specific dataset that does not reflects excellent performance. contain regular chit-chat samples. Thus, it is clear The results of the survey are summarized in Ta- why the BERT2BERT model is not able to handle ble5, where we report the average of the collected neutral chit-chat conversations, as it is outside the ratings. The EmoPrepend model showed a higher scope of the training data and the task at hand. average score of Empathy and Relevance than the Baseline model. However, these scores are below 5 Conclusion 3, meaning the EmoPrepend model was seen to deliver below-average performance. In this paper, we address the limitation in resources On the other hand, the average ratings of the for Arabic conversational systems, in particular, BERT2BERT model can be considered high and empathetic conversations. Unlike the English lan- are much superior to both the Baseline and the Emo- guage which has seen great advancements in lan- Prepend models, which indicates BERT2BERT’s guage generation models due to large corpora and ability to deliver highly empathetic responses while million parameter pre-trained models like GPT, abiding by linguistic correctness. This is reflected Arabic is considered a low-resource language with in some examples of the generated responses by limited availability of conversational datasets and BERT2BERT that can be seen in Table6. The pre-trained models for response generation. responses demonstrate the model’s ability to ex- We propose an empathetic BERT2BERT, a press empathetic, relevant, and fluent responses transformer-based model, of which the encoder when prompted with input utterances with various and decoder are warm-started using AraBERT pre-

170 trained parameters and fine-tuned for Arabic em- Workshop on Open-Source Arabic Corpora and Pro- pathetic response generation using the ArabicEm- cessing Tools, with a Shared Task on Offensive Lan- patheticDialogues dataset. By adopting this trans- guage Detection, pages 9–15. fer learning strategy, the proposed BERT2BERT Obeida ElJundi, Wissam Antoun, Nour El Droubi, model was able to address the challenges of build- Hazem Hajj, Wassim El-Hajj, and Khaled Shaban. ing an open-domain neural-based empathetic con- 2019. hulmona: The universal language model in arabic. In Proceedings of the Fourth Arabic Natural versational model for a low resource language such Language Processing Workshop, pages 68–77. as Arabic. BERT2BERT achieved significant per- formance improvements in comparison to three Ahmed Fadhil and Ahmed AbuRa’ed. 2019. OlloBot benchmark models, a baseline Seq2Seq Bi-LSTM - towards a text-based Arabic health conversational agent: Evaluation and results. In Proceedings of model, a Seq2Seq Bi-LSTM model with prepended the International Conference on Recent Advances in supervised information about the emotion label dur- Natural Language Processing (RANLP 2019), pages ing the training process, and a transformer-based 295–303. encoder-decoder that is not initialized with pre- Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi- trained weights. erarchical neural story generation. In Proceedings The proposed BERT2BERT model achieved a of the 56th Annual Meeting of the Association for low PPL value of 17.0, a BLEU score of 5.58, and Computational Linguistics (Volume 1: Long Papers), was rated highly by human evaluators with a score pages 889–898. of 4.3/5.0, reflecting its ability to generate empa- Mohammad Hijjawi, Zuhair Bandar, Keeley Crockett, thetic, relevant, and fluent responses. Hence, our and David Mclean. 2014. ArabChat: an arabic con- results show the ability to develop high-performing versational agent. In 2014 6th International Confer- conversational models in low resource settings by ence on Computer Science and Information Technol- ogy (CSIT), pages 227–237. IEEE. adopting the BERT2BERT strategy. Despite its high performance in empathetic re- Minlie Huang, Xiaoyan Zhu, and Jianfeng Gao. 2020. sponse generation, BERT2BERT showed a limita- Challenges in building intelligent open-domain dia- tion in its ability to handle regular chit-chat conver- log systems. ACM Transactions on Information Sys- tems (TOIS), 38(3):1–32. sations with neutral emotional states. To this end, future directions include the development of a strat- Daphne Ippolito, Reno Kriz, Joao Sedoc, Maria egy that improves the model’s ability to determine Kustikova, and Chris Callison-Burch. 2019. Com- when an empathetic response is suitable and when parison of diverse decoding methods from condi- tional language models. In Proceedings of the 57th it is not. Annual Meeting of the Association for Computa- tional Linguistics, pages 3752–3762. Acknowledgments Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senel- This work has been funded by the University Re- lart, and Alexander M Rush. 2017. Opennmt: Open- search Board (URB) at the American University of source toolkit for neural . In Beirut (AUB). Proceedings of ACL 2017, System Demonstrations, pages 67–72.

Mike Lewis, Yinhan Liu, Naman Goyal, Mar- References jan Ghazvininejad, Abdelrahman Mohamed, Omer Ahmed Abdelali, Kareem Darwish, Nadir Durrani, and Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Hamdy Mubarak. 2016. Farasa: A fast and furious Bart: Denoising sequence-to-sequence pre-training segmenter for arabic. In Proceedings of the 2016 for natural language generation, translation, and conference of the North American chapter of the as- comprehension. arXiv preprint arXiv:1910.13461. sociation for computational linguistics: Demonstra- tions, pages 11–16. Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. Dailydialog: A manually Dana Abu Ali and Nizar Habash. 2016. Botta: An labelled multi-turn dialogue dataset. In Proceedings arabic dialect chatbot. In Proceedings of COLING of the Eighth International Joint Conference on Nat- 2016, the 26th International Conference on Compu- ural Language Processing (Volume 1: Long Papers), tational Linguistics: System Demonstrations, pages pages 986–995. 208–212. Zhaojiang Lin, Peng Xu, Genta Indra Winata, Wissam Antoun, Fady Baly, and Hazem Hajj. 2020. Farhad Bin Siddique, Zihan Liu, Jamin Shin, and AraBERT: Transformer-based model for arabic lan- Pascale Fung. 2020. CAiRE: an end-to-end empa- guage understanding. In Proceedings of the 4th thetic chatbot. In AAAI, pages 13622–13623.

171 Zhiqiang Ma, Rui Yang, Baoxiang Du, and Yan Chen. Ozge¨ Nilay Yalc¸ın and Steve DiPaola. 2018. A compu- 2020. A control unit for emotional conversation gen- tational model of empathy for interactive agents. Bi- eration. IEEE Access, 8:43168–43176. ologically Inspired Cognitive Architectures, 26:20– 25. Navonil Majumder, Pengfei Hong, Shanshan Peng, Jiankun Lu, Deepanway Ghosal, Alexander Gel- Ozge¨ Nilay Yalc¸ın and Steve DiPaola. 2019. M-path: bukh, Rada Mihalcea, and Soujanya Poria. 2020. a conversational system for the empathic virtual Mime: Mimicking emotions for empathetic re- agent. In Biologically Inspired Cognitive Architec- sponse generation. In Proceedings of the 2020 Con- tures Meeting, pages 597–607. Springer. ference on Empirical Methods in Natural Language Processing (EMNLP), pages 8968–8979. Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Pe- ter Liu. 2020a. Pegasus: Pre-training with extracted Tarek Naous, Christian Hokayem, and Hazem Hajj. gap-sentences for abstractive summarization. In In- 2020. Empathy-driven arabic conversational chat- ternational Conference on , pages bot. In Proceedings of the Fifth Arabic Natural Lan- 11328–11339. PMLR. guage Processing Workshop, pages 58–68. Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur W Gerrod Parrott. 2001. Emotions in social psychol- Szlam, Douwe Kiela, and Jason Weston. 2018. Per- ogy: Essential readings. Psychology Press. sonalizing dialogue agents: I have a dog, do you Colin Raffel, Noam Shazeer, Adam Roberts, Katherine have pets too? In Proceedings of the 56th An- Lee, Sharan Narang, Michael Matena, Yanqi Zhou, nual Meeting of the Association for Computational Wei Li, and Peter J Liu. 2020. Exploring the lim- Linguistics (Volume 1: Long Papers), pages 2204– its of transfer learning with a unified text-to-text 2213. transformer. Journal of Machine Learning Research, Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun 21(140):1–67. Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Hannah Rashkin, Eric Michael Smith, Margaret Li, and Jingjing Liu, and William B Dolan. 2020b. DI- Y-Lan Boureau. 2019. Towards empathetic open- ALOGPT: Large-scale generative pre-training for domain conversation models: A new benchmark and conversational response generation. In Proceedings dataset. In Proceedings of the 57th Annual Meet- of the 58th Annual Meeting of the Association for ing of the Association for Computational Linguistics, Computational Linguistics: System Demonstrations, pages 5370–5381. pages 270–278. Sascha Rothe, Shashi Narayan, and Aliaksei Severyn. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut- 2020. Leveraging pre-trained checkpoints for se- dinov, Raquel Urtasun, Antonio Torralba, and Sanja quence generation tasks. Transactions of the Asso- Fidler. 2015. Aligning books and movies: Towards ciation for Computational Linguistics, 8:264–280. story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE In- Ashish Sharma, Adam Miner, David Atkins, and Tim ternational Conference on Computer Vision, pages Althoff. 2020. A computational approach to un- 19–27. derstanding empathy expressed in text-based men- tal health support. In Proceedings of the 2020 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 5263–5276. Jamin Shin, Peng Xu, Andrea Madotto, and Pas- cale Fung. 2020. Generating empathetic re- sponses by looking ahead the user’s sentiment. In ICASSP 2020-2020 IEEE International Confer- ence on Acoustics, Speech and Processing (ICASSP), pages 7989–7993. IEEE. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information pro- cessing systems, pages 5998–6008. Ozge¨ Nilay Yalc¸ın. 2019. Evaluating empathy in arti- ficial agents. In 2019 8th International Conference on and Intelligent Interaction (ACII), pages 1–7. IEEE. Ozge¨ Nilay Yalc¸ın. 2020. Empathy framework for em- bodied conversational agents. Cognitive Systems Re- search, 59:123–132.

172