<<

Procesamiento del Lenguaje Natural, Revista nº 61, septiembre de 2018, pp. 83-89 recibido 04-04-2018 revisado 29-04-2018 aceptado 04-05-2018

Bi-modal annoyance level detection from speech and text

Detecci´ondel nivel de enfado mediante un sistema bi-modal basado en habla y texto

Raquel Justo, Jon Irastorza, Saioa P´erez,M. In´esTorres Universidad del Pa´ısVasco UPV/EHU. Sarriena s/n. 48940 Leioa. Spain {raquel.justo,manes.torres}@ehu.eus

Abstract: The main goal of this work is the identification of emotional hints from speech. Machine learning researchers have analysed sets of acoustic parameters as potential cues for the identification of discrete emotional categories or, alternatively, of the dimensions of . However, the semantic information gathered in the text message associated to its utterance can also provide valuable information that can be helpful for detection. In this work this information is included within the acoustic information leading to a better system performance. Moreover, it is noticeable the use of a corpus that include spontaneous emotions gathered in a realistic environment. It is well known that emotion expression depends not only on cultural factors but also on the individual and on the specific situation. Thus, the conclusions extracted from the present work can be more easily extrapolated to a real system than those obtained from a classical corpus with simulated emotions. Keywords: speech processing, semantic information, emotion detection on speech, annoyance tracking, machine learning Resumen: El principal objetivo de este trabajo es la detecci´onde cambios emo- cionales a partir del habla. Diferentes trabajos basados en aprendizaje autom´atico analizado conjuntos de par´ametrosac´usticoscomo potenciales indicadores en la identificaci´onde categor´ıasemocionales discretas o en la identificaci´ondimensional de las emociones. Sin embargo, la informaci´onsem´antica recogida en el mensaje tex- tual asociado a cada intervenci´on, puede proporcionar informaci´onvaliosa para la detecci´onde emociones. En este trabajo se combina la informaci´ontextual y ac´usti- ca dando lugar a mejoras en el rendimiento del sistema. Es importante recalcar por otra parte, el uso de un corpus que incluye emociones espont´aneasrecogidas en un entorno realista. Es bien sabido que la expresi´onde la emoci´ondepende no solo de factores culturales si no tambi´ende factores individuales y de situaciones particu- lares. Por lo tanto, las conclusiones extra´ıdasen este trabajo se pueden extrapolar m´asf´acilmente a un sistema real que aquellas obtenidas a partir de un corpus cl´asico en el que se simula el estado emocional. Palabras clave: procesamiento del habla, informaci´onsem´antica, reconocimiento emocional en el habla, rastreo del enfado, aprendizaje autom´atico

1 Introduction i.e. corpora including human data annotated The detection of emotional status has been with emotional labels (Devillers, Vidrascu, widely studied in the last decade within the y Lamel, 2005) (Vidrascu y Devillers, 2005) machine learning framework. The goal of re- and this is not a straightforward task due searchers is to be able to recognise emotio- to the subjectivity of by nal information from the analysis on voice, humans (Devillers, Vidrascu, y Lamel, 2005) language, face, gestures or ECG (Devillers, (Eskimez et al., 2016). Many works conside- Vidrascu, y Lamel, 2005). One of the main red corpora that consist of data from profes- important challenges that need to be faced sional actors simulating the emotions to be in this area is the need of supervised data, analyzed. However, it usually leads to poor ISSN 1135-5948. DOI 10.26342/2018-61-9 © 2018 Sociedad Española para el Procesamiento del Lenguaje Natural Raquel Justo, Jon Irastorza, Saioa Pérez, M. Inés Torres results due to many factors like the differen- or (Medeiros y van der Wal, 2017; Gil- ces among the real situations the detection bert y Karahalios, 2010; Marsden y Camp- system has to deal with and the emotional bell, 2012). Moreover, it seems reasonable status picked up in the corpus. Moreover, the to think that the combination of acoustic selection of valuable data including sponta- and textual information might lead to impro- neous emotions depends on the goals of the ve systems performan- involved research and it is difficult to find an ce. However, although there are plenty of re- appropriate corpus that matches the specific search articles on audio-visual emotion recog- goal of each task. nition, only a few research works have been Focussing on emotion identification from carried out on multimodal emotion recogni- speech and language a wide range of poten- tion using textual clues with visual and au- tial applications and research objectives can dio modality (Eyben et al., 2010; Poria et al., be found (Valstar et al., 2014) (Wang et al., 2016). 2015) (Clavel y Callejas, 2016). Some exam- In this work we deal with a problem pro- ples are early detection of Alzheimer’s disease posed by a Spanish company providing custo- (Meil´anet al., 2014), the detection of valency mer assistant services through the telephone onsets in medical emergency calls (Vidras- (Justo et al., 2014). They want to automa- cu y Devillers, 2005) or in Stock Exchange tically detect annoyance rates during custo- Customer Service Centres (Devillers, Vidras- mer calls for further analysis, which is a novel cu, y Lamel, 2005). Emotion recognition from and challenging goal. Their motivation is to speech signals relies on a number of short- verify if the policies applied by operators to term features such as pitch, additional exci- deal with annoyed and angry customers lead tation signals due to the non-linear air flow to shifts in customer behavior. Thus an au- in the vocal tract, vocal tract features such tomatic procedure to detect those shifts will as formants (Wang et al., 2015) (Ververidis y allow the company to evaluate their policies Kotropoulos, 2006), prosodic features (Ben- through the analysis of the recorded audios. David et al., 2016) such as pitch loudness, Moreover, they were interested in providing speaking rate, rhythm, voice quality and arti- this information to the operators during the culation (Vidrascu y Devillers, 2005) (Girard conversation. As a consequence this work is y Cohn, 2016), latency to speak, pauses (Jus- aimed at detecting different levels of anno- to et al., 2014) (Esposito et al., 2016), fea- yance during real phone-calls to Spanish com- tures derived from energy (Kim y Clements, plain services. Mainly, we wanted to analyse 2015) as well as feature combinations, etc. the effect of including textual information in- Regarding methodology, statistical analysis to the annoyance detection system based on of feature distributions has been traditionally acoustic signals. carried out. Classical classifiers such as the The paper is organised as follows, Sec. 2 Bayesian or SVM have been proposed for describes the previous work carried out to sol- the identification of emotional characteristics ve the presented problem with the specific from speech signals. The model of continuous dataset we are dealing with. In Sec. 3 the an- affective dimensions is also an emerging cha- notation procedure in terms of speech signal llenge when dealing with continuous rating and text is described and Sec. 4 details the of emotion labelled during real interaction experiments carried out and the obtained re- (Mencattini et al., 2016). In this approach sults. Finally, Sec. 5 summarises the conclu- recurrent neural networks have been propo- ding remarks and future work. sed to integrate contextual information and then predict emotion in continuous time to 2 Dataset and previous work just deal with and valence (Wollmer The Spanish call center company involved et al., 2008) (Ringeval et al., 2015). in this work offers customer assistance for When regarding text, there are numerous several phone, tv and internet service pro- works dealing with sentiment analysis who- viders. The customer complaint services of se application domains range from business these companies receive a certain number to security considering well-being, politics or of phone-calls from angry or annoyed custo- software engineering (Cambria, 2016). Howe- mers. But the way of expressing annoyance ver, there are few works considering the re- is not the same for all the customers. Some cognition of specific emotions such as , of them are furious and shout; others speak 84 Bi-modal annoyance level detection from speech and text quickly with frequent and very short micro- fined as follows: very low agreed by annota- pauses but do not shout (Justo et al., 2014), tors, low, which corresponds to a low-medium others seems to be more fed-up than angry; disagreement, medium agreed by annotators, others feel impotent after a number of ser- high, which corresponds to a medium-high vice failures and calls to the customer ser- disagreement and very high agreed by an- vice. The dataset for this study consisted of notators. Less frequent disagreements were seven conversations between customers and not considered. The right side of Table 1 the call-centre operators that were identified (SPEECH-BASED) shows the final number and selected by experienced operators. All of segments identified for each audio file and the selected customers were very angry with annoyance level. the service provider because of unsolved and An automatic classification was carried repeated service failures that caused serious out in (Irastorza y Torres, 2016) using acous- troubles to them. In a second step each re- tic parameters extracted from the audio fi- cording was named according to the particu- les. The acoustic signal was divided into lar way the customer expresses his annoyance 20 ms overlapping windows (frames) from degree. Thus, call-center operators qualified which a set of features was extracted. The the seven subjects in conversations as follows: classification procedure was carried out over Disappointed, Angry (2 records), Extremely those frames. A combination of Intensity angry, Fed-up, Impotent and annoyed in di- and intensity-based suprasegmental features sagreement. All these correspond to along with LPC coefficients achieved the hig- the different ways the customer in the study hest frame classification accuracies for all the expressed their annoyance with the service expressions of annoyance analysed. The ob- provided. More specifically they correspond tained results validated the annotation proce- to the way the human operators perceived dure and also showed that shifts in customer customer feelings. The duration of the con- annoyance rates could be potentially tracked versations was 42s, 42s, 35s, 16m20s, 1m08s, during phone calls. 1m02s and 1m35s respectively, resulting in a total of 22.1 minutes. 3 Text annotation process In a previous work (Irastorza y Torres, The transcription of the utterances provides 2016) the different records were manually an- additional information related to the seman- notated. Two members of the research group tics, language style, among others, that can- acted as expert annotators. They first identi- not be found in acoustics and might help in fied customer speech segments, agent speech the detection of annoyance or other emotio- segments and overlapping segments. Only in- nal categories. Thus, in this work the text as- telligible customer speech segments were con- sociated with the utterances was considered sidered for the experiments. In a second step, and annotated to be included in the classifi- annotators were asked to identify the changes cation process. in the degree of perceived emotion in each re- cording using zero for neutral or very low, one 3.1 Transcription for medium and two for high degree. They were asked to mark time steps where they The audio files described in the previous sec- perceived a change in the degree of expres- tion were transcribed in order to use text as sion and then label each segment with the an additional information source. The trans- corresponding perceived level. The annota- criptions were carried out making use of tor agreement was high in the identification Praat (Boersma y Weenink, 2016) software of the time steps where they perceived chan- tool. The segments obtained from the afore- ges in the degree of expression. Then, just mentioned annotation, in terms of time steps, one of the two annotations was chosen to were also employed here and only those inte- fix segments bounds. However, the procedu- lligible customer speech segments were trans- re resulted in a significant level of disagree- cribed. ments when regarding the label given to each In this case, given that there is no any step. Thus, most frequent disagreements we- ambiguity in the transcription task, only one re considered as new levels in the proposed member of the research group listened to all scale of annoyance expression. The resulting the audios, segment-to-segment, and provi- set of categories consists of five degrees de- ded the corresponding transcriptions. 85 Raquel Justo, Jon Irastorza, Saioa Pérez, M. Inés Torres

3.2 Annotation user is upset. Once the transcription was obtained, the seg- Table 1 shows that significant differences ments were annotated with an emotional la- can be observed in text-based and speech- bel extracted from the text. In order to ob- based annotations. Mainly, it seems that hig- tain a label that only considers textual in- her anger rates are associated to speech- formation, the labelling was carried out by based annotations: there are 57 very high an annotator that was not involved in the segments annotated in speech versus 13 in transcription process. In this way, the anno- text, while at the other end, there are 5 very tator did not listened to the audio previously low segments annotated in speech versus 78 and was only focused on text. Two members in text. An example of this behaviour would of the research group carried out the anno- be a segment of the audio “Angry 2” where tation independently. The same levels of an- the speaker says: “all things you will tell me, ger employed in the acoustic annotation was they have already told me”. This fragment, also considered here: zero for very low, one was annotated as high in speech annotation for medium and neutral and two for high de- and low in textual annotation. It seems, ac- gree of anger. Given the ambiguity associa- cording to the obtained results, that anno- ted to this annotation procedure a method yed people tend to vary the features of their was also needed to deal with disagreement utterances (pitch, intensity, etc.) more easily among the two annotators. Thus, the set of while keeping the meaning of their messages categories consisting of five degrees defined unaltered. It might be because changes in the as very low agreed by annotators, low, which acoustic signal occur in an spontaneous and corresponds to a low-medium disagreement, involuntary way and it does not happen when medium agreed by annotators, high, which regarding the content of the message. Thus, corresponds to a medium-high disagreement only when the annoyance is kept in a lon- and very high agreed by annotators was also ger period of time annoyance signals appear employed here. in text messages. The combination between An analysis of the labeled segments sho- speech annotator information and text an- wed that one of the two annotators always notator information, provides the chance to used a higher level of anger than the ot- complement the information provided from her one. Therefore, one/zero or two/one la- isolated sources, leading to nuanced results, bels appeared more frequently than zero/one in cases of discrepancy. and one/two labels. It is noteworthy, that A bi-modal annoyance recognizer, combi- segments labeled as zero/two and two/zero ning acoustic and text, was developed in this (strong disagreement) were very unfrequent. work and the results provided by its evalua- The left side of Table 1 (TEXT-BASED) tion are given in section 4. shows the number of segments for each audio file. 4 Experimental Results The experiments carried out aimed at analy- 3.3 Analyzing the annoyance zing the validity of the assumptions made in People’s behaviour can be very different when the annotation procedure and also the per- regarding the way in which they express an- formance when combining acoustic features noyance. Some people vary the pitch or in- and textual information. We used the Nai- tensity of their utterance very easily when ve Bayes Classifier (NB) and the Support they are upset, while others keep these va- Vector Machine (SVM) which had already riables unaltered and emphasize the meaning proven their efficiency classifying emotional of their message. Thus, both issues have to hints (Irastorza y Torres, 2016). To this end be considered to identify annoyance rates. we shuffled the set of frames of each audio file For instance, in the audio Very angry there and then split this set into a training and a is a speech segment where the speaker says: test set that included 70 % and 30 % of fra- “sorry, sorry but you are the impolite”. Re- mes, respectively. We used the frame classifi- garding the acoustic annotation it was la- cation accuracy as evaluation metric. beled with a low annoyance rate due to an Two series of classification experiments unaltered acoustic signal, whereas the text were carried out in order to include textual annotation indicated a high annoyance rate, information in different ways. Firstly, we cho- because its meaning clearly denotes that the se a combination of acoustic (Linear Predic- 86 Bi-modal annoyance level detection from speech and text

TEXT-BASED SPEECH-BASED Activation level category v low low medium high v high v low low medium high v high Total Disappointed 1 1 5 2 0 0 6 1 2 0 9 Angry 1 0 4 0 1 1 0 0 4 2 0 6 Angry 2 0 1 2 4 1 1 2 1 4 0 8 Very Angry 65 61 53 35 9 0 95 26 45 57 223 Fed-up 6 2 4 1 2 0 3 11 1 0 15 Impotent 6 3 0 0 0 4 4 1 0 0 9 All 78 72 64 43 13 5 110 44 54 57 270

Table 1: Number of segments for each audio file. tion Coefficients, LPC) and textual (labels classification outperforms Linear Prediction provided in the text annotations) features. Coefficients classification in both SVM and The labels obtained from the speech anno- NB models for speaker dependent. For ins- tations were employed to train the classifiers. tance, we can see an accuracy improvement In a second stage only LPC acoustic features up to 0.3 in the Disappointed audio when were considered, but in this case the labels combining acoustic parameters plus textual provided to train the classifiers were those ob- information using SVM classifier. Equally, we tained from the text annotation. We aimed at carried out speaker independent experiments analysing the behaviour of the selected sets of and the results also showed better performace features and annotation schemes when clas- using acoustic features plus textual informa- sifying frames into the categories that repre- tion, improving the accuracy from 0.46 in to sent the customer annoyance degree in each 0.52. Looking at the results it seems that the- particular call. Moreover, speaker dependent re is information within text, that is missing and speaker independent experiments were in the acoustic signal, that could be useful for also considered. In speaker dependent expe- detecting annoyance rate. riments only frames obtained from a specific speaker (an audio file) were involved in the 4.2 Second series of experiments classification process (training and test). Then a second set of frame classification experiments was carried out using both 4.1 First series of experiments SVM and Naive Bayes models. This se- ries was aimed at evaluating speaker depen- Accuracy - Speaker Dependent I dent/independent model using acoustic para- 1 meters with textual labeling. 0,8 Figure 2 shows that the classification ba- 0,6 sed on acoustic features (LPC) along with 0,4

Accuracy the use of labels based on the text annota- 0,2 tion provides lower accuracy values. This loss 0 of accuracy reinforces the idea that cognitive Disap. Angry1 Angry2 V. Angry Fed-up Impotent processing diverges depending on which sen- LPC (SVM) LPC + T.I. (SVM) LPC (NB) LPC + T.I. (NB) ses are involved in the annotation process. On the other hand, speaker independent results Figure 1: Comparison of SVM and NB frame showed also worse performance, since accu- classification approaches. Bar graphs show racy result did not achieve more than 0.31. the frame identification accuracies based on the sets of features selected for the first se- 5 Conclusions and future work ries of experiments. The two bar graphs on the left side of each audio correspond to re- In conclusion, two main ideas resulted from sults obtained by SVM classifier whereas the the experiments. On the one hand, the pos- ones on the right side correspond to the re- sibility of joining textual and acoustic infor- sults obtained by NB. mation in order to predict annoyance rates was explored and validated. The experiments In this first set of frame classification ex- show that the inclusion of labels extracted periments, Figure 1 confirms that Linear Pre- from text as a feature improve the classifica- diction Coefficients plus textual information tion accuracy. However, using acoustic fea- 87 Raquel Justo, Jon Irastorza, Saioa Pérez, M. Inés Torres

Accuracy - Speaker Dependent II agent interaction. IEEE Transactions on

1 Affective Computing, 7(1):74–93, Jan. 0,8 Devillers, L., L. Vidrascu, y L. Lamel. 2005. 0,6 Challenges in real-life emotion annota- 0,4

Accuracy tion and machine learning based detec- 0,2 tion. Neural Networks, 18(4):407 – 422. 0 Emotion and Brain. Disap. Angry1 Angry2 V. Angry Fed-up Impotent

LPC (SVM) with acoustic class LPC (SVM) with text class Eskimez, S. E., K. Imade, N. Yang, M. Sturge-Apple, Z. Duan, y W. Hein- Figure 2: Comparison of SVM and NB frame zelman. 2016. Emotion classification: classification approaches. Bar graphs show how does an automated system compa- the frame identification accuracies based on re to naive human coders? En Procee- the sets of features selected for the first se- dings of the IEEE International Conferen- cond of experiments. The bar graph on the ce on Acoustics, Speech and Signal Proces- left side of each audio correspond to results sing (ICASSP 2016), p´aginas2274–2278, obtained by SVM classifier using the class March. based on the text whereas the right graph Esposito, A., A. M. Esposito, L. Likforman- correspond to the results obtained using the Sulem, M. N. Maldonato, y A. Vinciare- class based on the speech. lli, 2016. Recent Advances in Nonlinear Speech Processing, cap´ıtuloOn the Signi- tures with text-based annotation does not ficance of Speech Pauses in Depressive Di- provide a good system performance. Acous- sorders: Results on Read and Spontaneous tic features are linked to acoustic signal, and Narratives, p´aginas73–82. Springer Inter- that is why categories based on semantic national Publishing, Cham. analysis are meaningless. Moreover, the use of a corpus that include Eyben, F., M. W¨ollmer,A. Graves, B. Schu- spontaneous emotions gathered in a realistic ller, E. Douglas-Cowie, y R. Cowie. 2010. environment leads to an easy extrapolation On-line emotion recognition in a 3-d of the obtained results to a real system. activation-valence-time continuum using For further work we propose to explore al- acoustic and linguistic cues. Journal ternative ways of integrating textual informa- on Multimodal User Interfaces, 3(1):7–19, tion along with deep learning based classifi- Mar. cation paradigms. Gilbert, E. y K. Karahalios. 2010. Wides- pread and the stock market. En References ICWSM 2010 - Proceedings of the 4th In- Ben-David, B. M., N. Multani, V. Shakuf, ternational AAAI Conference on Weblogs F. Rudzicz, y P. H. H. M. van Lieshout. and Social Media, p´aginas58–65. 2016. Prosody and semantics are separa- Girard, J. M. y J. F. Cohn. 2016. Automated te but not separable channels in the per- audiovisual analysis. Current ception of emotional speech: Test for ra- Opinion in Psychology, 4:75 – 79. ting of emotions in speech. Journal of Speech, Language, and Hearing Research, Irastorza, J. y M. I. Torres. 2016. Analy- 59(1):72–89. zing the expression of annoyance during phone calls to complaint services. En Boersma, P. y D. Weenink. 2016. Praat: 2016 7th IEEE International Conference doing phonetics by computer. Software on Cognitive Infocommunications (CogIn- tool, University of Amsterdam. version 6. foCom), p´aginas000103–000106, Oct. 0.15. Justo, R., O. Horno, M. Serras, y M. I. To- Cambria, E. 2016. Affective computing and rres. 2014. Tracking emotional hints in sentiment analysis. IEEE Intelligent Sys- spoken interaction. En Proc. of VIII Jor- tems, 31(2):102–107, Mar. nadas en Tecnolog´ıadel Habla and IV Ibe- Clavel, C. y Z. Callejas. 2016. Sentiment rian SLTech Workshop (IberSpeech 2014), analysis: From opinion mining to human- p´aginas216–226. 88 Bi-modal annoyance level detection from speech and text

Kim, J. C. y M. A. Clements. 2015. Multi- Vidrascu, L. y L. Devillers. 2005. Detection modal affect classification at various tem- of Real-Life Emotions in Call Centers. En poral lengths. IEEE Transactions on Af- Proceedings of INTERSPEECH’05: the fective Computing, 6(4):371–384, Oct. 6th Annual Conference of the Internatio- Marsden, P. V. y K. E. Campbell. 2012. nal Speech Communication Association, Reflections on conceptualizing and measu- p´aginas1841–1844, Lisbon, Portugal. IS- ring tie strength. Social Forces, 91(1):17– CA. 23. Wang, K., N. An, B. N. Li, Y. Zhang, y L. Li. Medeiros, L. y C. N. van der Wal. 2017. An 2015. Speech emotion recognition using agent-based model predicting group emo- fourier parameters. IEEE Transactions tion and misbehaviours in stranded pas- on Affective Computing, 6(1):69–75, Jan. sengers. En E. Oliveira J. Gama Z. Va- Wollmer, M., F. Eyben, S. Reiter, B. Schu- le, y H. Lopes Cardoso, editores, Progress ller, C. Cox, E. Douglas-Cowie, y R. Co- in Artificial Intelligence, p´aginas28–40, wie. 2008. Abandoning emotion classes Cham. Springer International Publishing. - towards continuous emotion recognition Meil´an,J. J. G., F. Mart´ınez-S´acnhez,J. Ca- with modelling of long-range dependen- rro, D. E. L´opez, L. Millian-Morell, y J. M. cies. p´aginas597–600, 9. Arana. 2014. Speech in alzheimer?s di- sease: Can temporal and acoustic para- meters discriminate dementia? Dementia and Geriatric Cognitive Disorders, 37(5- 6):327–334. Mencattini, A., E. Martinelli, F. Ringeval, B. Schuller, y C. D. Natlae. 2016. Con- tinuous estimation of emotions in speech by dynamic cooperative speaker models. IEEE Transactions on Affective Compu- ting, PP(99):1–1. Poria, S., I. Chaturvedi, E. Cambria, y A. Hussain. 2016. Convolutional mkl ba- sed multimodal emotion recognition and sentiment analysis. En 2016 IEEE 16th International Conference on Data Mining (ICDM), p´aginas439–448, Dec. Ringeval, F., F. Eyben, E. Kroupi, A. Yuce, J.-P. Thiran, T. Ebrahimi, D. Lalanne, y B. Schuller. 2015. Prediction of asynchro- nous dimensional emotion ratings from audiovisual and physiological data. Pat- tern Recognition Letters, 66:22 – 30. Valstar, M., B. Schuller, K. Smith, T. Al- maev, F. Eyben, J. Krajewski, R. Cowie, y M. Pantic. 2014. Avec 2014: 3d di- mensional affect and depression recogni- tion challenge. En Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, AVEC ’14, p´aginas3– 10, New York, NY, USA. ACM. Ververidis, D. y C. Kotropoulos. 2006. Emo- tional speech recognition: Resources, fea- tures, and methods. Speech Communica- tion, 48(9):1162 – 1181. 89