Bi-Modal Annoyance Level Detection from Speech and Text
Total Page:16
File Type:pdf, Size:1020Kb
Procesamiento del Lenguaje Natural, Revista nº 61, septiembre de 2018, pp. 83-89 recibido 04-04-2018 revisado 29-04-2018 aceptado 04-05-2018 Bi-modal annoyance level detection from speech and text Detecci´ondel nivel de enfado mediante un sistema bi-modal basado en habla y texto Raquel Justo, Jon Irastorza, Saioa P´erez,M. In´esTorres Universidad del Pa´ısVasco UPV/EHU. Sarriena s/n. 48940 Leioa. Spain fraquel.justo,[email protected] Abstract: The main goal of this work is the identification of emotional hints from speech. Machine learning researchers have analysed sets of acoustic parameters as potential cues for the identification of discrete emotional categories or, alternatively, of the dimensions of emotions. However, the semantic information gathered in the text message associated to its utterance can also provide valuable information that can be helpful for emotion detection. In this work this information is included within the acoustic information leading to a better system performance. Moreover, it is noticeable the use of a corpus that include spontaneous emotions gathered in a realistic environment. It is well known that emotion expression depends not only on cultural factors but also on the individual and on the specific situation. Thus, the conclusions extracted from the present work can be more easily extrapolated to a real system than those obtained from a classical corpus with simulated emotions. Keywords: speech processing, semantic information, emotion detection on speech, annoyance tracking, machine learning Resumen: El principal objetivo de este trabajo es la detecci´onde cambios emo- cionales a partir del habla. Diferentes trabajos basados en aprendizaje autom´atico han analizado conjuntos de par´ametrosac´usticoscomo potenciales indicadores en la identificaci´onde categor´ıasemocionales discretas o en la identificaci´ondimensional de las emociones. Sin embargo, la informaci´onsem´antica recogida en el mensaje tex- tual asociado a cada intervenci´on, puede proporcionar informaci´onvaliosa para la detecci´onde emociones. En este trabajo se combina la informaci´ontextual y ac´usti- ca dando lugar a mejoras en el rendimiento del sistema. Es importante recalcar por otra parte, el uso de un corpus que incluye emociones espont´aneasrecogidas en un entorno realista. Es bien sabido que la expresi´onde la emoci´ondepende no solo de factores culturales si no tambi´ende factores individuales y de situaciones particu- lares. Por lo tanto, las conclusiones extra´ıdasen este trabajo se pueden extrapolar m´asf´acilmente a un sistema real que aquellas obtenidas a partir de un corpus cl´asico en el que se simula el estado emocional. Palabras clave: procesamiento del habla, informaci´onsem´antica, reconocimiento emocional en el habla, rastreo del enfado, aprendizaje autom´atico 1 Introduction i.e. corpora including human data annotated The detection of emotional status has been with emotional labels (Devillers, Vidrascu, widely studied in the last decade within the y Lamel, 2005) (Vidrascu y Devillers, 2005) machine learning framework. The goal of re- and this is not a straightforward task due searchers is to be able to recognise emotio- to the subjectivity of emotion perception by nal information from the analysis on voice, humans (Devillers, Vidrascu, y Lamel, 2005) language, face, gestures or ECG (Devillers, (Eskimez et al., 2016). Many works conside- Vidrascu, y Lamel, 2005). One of the main red corpora that consist of data from profes- important challenges that need to be faced sional actors simulating the emotions to be in this area is the need of supervised data, analyzed. However, it usually leads to poor ISSN 1135-5948. DOI 10.26342/2018-61-9 © 2018 Sociedad Española para el Procesamiento del Lenguaje Natural Raquel Justo, Jon Irastorza, Saioa Pérez, M. Inés Torres results due to many factors like the differen- or anger (Medeiros y van der Wal, 2017; Gil- ces among the real situations the detection bert y Karahalios, 2010; Marsden y Camp- system has to deal with and the emotional bell, 2012). Moreover, it seems reasonable status picked up in the corpus. Moreover, the to think that the combination of acoustic selection of valuable data including sponta- and textual information might lead to impro- neous emotions depends on the goals of the ve emotion recognition systems performan- involved research and it is difficult to find an ce. However, although there are plenty of re- appropriate corpus that matches the specific search articles on audio-visual emotion recog- goal of each task. nition, only a few research works have been Focussing on emotion identification from carried out on multimodal emotion recogni- speech and language a wide range of poten- tion using textual clues with visual and au- tial applications and research objectives can dio modality (Eyben et al., 2010; Poria et al., be found (Valstar et al., 2014) (Wang et al., 2016). 2015) (Clavel y Callejas, 2016). Some exam- In this work we deal with a problem pro- ples are early detection of Alzheimer's disease posed by a Spanish company providing custo- (Meil´anet al., 2014), the detection of valency mer assistant services through the telephone onsets in medical emergency calls (Vidras- (Justo et al., 2014). They want to automa- cu y Devillers, 2005) or in Stock Exchange tically detect annoyance rates during custo- Customer Service Centres (Devillers, Vidras- mer calls for further analysis, which is a novel cu, y Lamel, 2005). Emotion recognition from and challenging goal. Their motivation is to speech signals relies on a number of short- verify if the policies applied by operators to term features such as pitch, additional exci- deal with annoyed and angry customers lead tation signals due to the non-linear air flow to shifts in customer behavior. Thus an au- in the vocal tract, vocal tract features such tomatic procedure to detect those shifts will as formants (Wang et al., 2015) (Ververidis y allow the company to evaluate their policies Kotropoulos, 2006), prosodic features (Ben- through the analysis of the recorded audios. David et al., 2016) such as pitch loudness, Moreover, they were interested in providing speaking rate, rhythm, voice quality and arti- this information to the operators during the culation (Vidrascu y Devillers, 2005) (Girard conversation. As a consequence this work is y Cohn, 2016), latency to speak, pauses (Jus- aimed at detecting different levels of anno- to et al., 2014) (Esposito et al., 2016), fea- yance during real phone-calls to Spanish com- tures derived from energy (Kim y Clements, plain services. Mainly, we wanted to analyse 2015) as well as feature combinations, etc. the effect of including textual information in- Regarding methodology, statistical analysis to the annoyance detection system based on of feature distributions has been traditionally acoustic signals. carried out. Classical classifiers such as the The paper is organised as follows, Sec. 2 Bayesian or SVM have been proposed for describes the previous work carried out to sol- the identification of emotional characteristics ve the presented problem with the specific from speech signals. The model of continuous dataset we are dealing with. In Sec. 3 the an- affective dimensions is also an emerging cha- notation procedure in terms of speech signal llenge when dealing with continuous rating and text is described and Sec. 4 details the of emotion labelled during real interaction experiments carried out and the obtained re- (Mencattini et al., 2016). In this approach sults. Finally, Sec. 5 summarises the conclu- recurrent neural networks have been propo- ding remarks and future work. sed to integrate contextual information and then predict emotion in continuous time to 2 Dataset and previous work just deal with arousal and valence (Wollmer The Spanish call center company involved et al., 2008) (Ringeval et al., 2015). in this work offers customer assistance for When regarding text, there are numerous several phone, tv and internet service pro- works dealing with sentiment analysis who- viders. The customer complaint services of se application domains range from business these companies receive a certain number to security considering well-being, politics or of phone-calls from angry or annoyed custo- software engineering (Cambria, 2016). Howe- mers. But the way of expressing annoyance ver, there are few works considering the re- is not the same for all the customers. Some cognition of specific emotions such as joy, love of them are furious and shout; others speak 84 Bi-modal annoyance level detection from speech and text quickly with frequent and very short micro- fined as follows: very low agreed by annota- pauses but do not shout (Justo et al., 2014), tors, low, which corresponds to a low-medium others seems to be more fed-up than angry; disagreement, medium agreed by annotators, others feel impotent after a number of ser- high, which corresponds to a medium-high vice failures and calls to the customer ser- disagreement and very high agreed by an- vice. The dataset for this study consisted of notators. Less frequent disagreements were seven conversations between customers and not considered. The right side of Table 1 the call-centre operators that were identified (SPEECH-BASED) shows the final number and selected by experienced operators. All of segments identified for each audio file and the selected customers were very angry with annoyance level. the service provider because of unsolved and An automatic classification was carried repeated service failures that caused serious out in (Irastorza y Torres, 2016) using acous- troubles to them. In a second step each re- tic parameters extracted from the audio fi- cording was named according to the particu- les. The acoustic signal was divided into lar way the customer expresses his annoyance 20 ms overlapping windows (frames) from degree. Thus, call-center operators qualified which a set of features was extracted. The the seven subjects in conversations as follows: classification procedure was carried out over Disappointed, Angry (2 records), Extremely those frames.