Utterance-Level Multimodal Sentiment Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Utterance-Level Multimodal Sentiment Analysis Veronica´ Perez-Rosas´ and Rada Mihalcea Louis-Philippe Morency Computer Science and Engineering Institute for Creative Technologies University of North Texas University of Southern California [email protected], [email protected] [email protected] Abstract Riloff, 2005; Esuli and Sebastiani, 2006) or large annotated datasets (Maas et al., 2011). Given the During real-life interactions, people are accelerated growth of other media on the Web and naturally gesturing and modulating their elsewhere, which includes massive collections of voice to emphasize specific points or to videos (e.g., YouTube, Vimeo, VideoLectures), im- express their emotions. With the recent ages (e.g., Flickr, Picasa), audio clips (e.g., pod- growth of social websites such as YouTube, casts), the ability to address the identification of Facebook, and Amazon, video reviews are opinions in the presence of diverse modalities is be- emerging as a new source of multimodal coming increasingly important. This has motivated and natural opinions that has been left al- researchers to start exploring multimodal clues for most untapped by automatic opinion anal- the detection of sentiment and emotions in video ysis techniques. This paper presents a content (Morency et al., 2011; Wagner et al., 2011). method for multimodal sentiment classi- In this paper, we explore the addition of speech fication, which can identify the sentiment and visual modalities to text analysis in order to expressed in utterance-level visual datas- identify the sentiment expressed in video reviews. treams. Using a new multimodal dataset Given the non homogeneous nature of full-video consisting of sentiment annotated utter- reviews, which typically include a mixture of posi- ances extracted from video reviews, we tive, negative, and neutral statements, we decided show that multimodal sentiment analysis to perform our experiments and analyses at the ut- can be effectively performed, and that the terance level. This is in line with earlier work on joint use of visual, acoustic, and linguistic text-based sentiment analysis, where it has been modalities can lead to error rate reductions observed that full-document reviews often contain of up to 10.5% as compared to the best both positive and negative comments, which led to performing individual modality. a number of methods addressing opinion analysis at sentence level. Our results show that relying 1 Introduction on the joint use of linguistic, acoustic, and visual Video reviews represent a growing source of con- modalities allows us to better sense the sentiment sumer information that gained increasing interest being expressed as compared to the use of only one from companies, researchers, and consumers. Pop- modality at a time. ular web platforms such as YouTube, Amazon, Another important aspect of this paper is the in- Facebook, and ExpoTV have reported a signifi- troduction of a new multimodal opinion database cant increase in the number of consumer reviews annotated at the utterance level which is, to our in video format over the past five years. Compared knowledge, the first of its kind. In our work, this to traditional text reviews, video reviews provide a dataset enabled a wide range of multimodal senti- more natural experience as they allow the viewer to ment analysis experiments, addressing the relative better sense the reviewer’s emotions, beliefs, and importance of modalities and individual features. intentions through richer channels such as intona- The following section presents related work tions, facial expressions, and body language. in text-based sentiment analysis and audio-visual Much of the work to date on opinion analysis has emotion recognition. Section 3 describes our new focused on textual data, and a number of resources multimodal datasets with utterance-level sentiment have been created including lexicons (Wiebe and annotations. Section 4 presents our multimodal sen- 973 Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 973–982, Sofia, Bulgaria, August 4-9 2013. c 2013 Association for Computational Linguistics timent analysis approach, including details about such as speech or facial expressions. our linguistic, acoustic, and visual features. Our The only exceptions that we are aware of are the experiments and results on multimodal sentiment findings reported in (Somasundaran et al., 2006; classification are presented in Section 5, with a Raaijmakers et al., 2008; Mairesse et al., 2012; detailed discussion and analysis in Section 6. Metze et al., 2009), where speech and text have been analyzed jointly for the purpose of subjectiv- 2 Related Work ity or sentiment identification, without, however, addressing other modalities such as visual cues; In this section we provide a brief overview of re- and the work reported in (Morency et al., 2011; lated work in text-based sentiment analysis, as well Perez-Rosas et al., 2013), where multimodal cues as audio-visual emotion analysis. have been used for the analysis of sentiment in 2.1 Text-based Subjectivity and Sentiment product reviews, but where the analysis was done Analysis at the much coarser level of full videos rather than individual utterances as we do in our work. The techniques developed so far for subjectivity and sentiment analysis have focused primarily on 2.2 Audio-Visual Emotion Analysis. the processing of text, and consist of either rule- based classifiers that make use of opinion lexicons, Also related to our work is the research done on or data-driven methods that assume the availability emotion analysis. Emotion analysis of speech sig- of a large dataset annotated for polarity. These tools nals aims to identify the emotional or physical and resources have been already used in a large states of a person by analyzing his or her voice number of applications, including expressive text- (Ververidis and Kotropoulos, 2006). Proposed to-speech synthesis (Alm et al., 2005), tracking methods for emotion recognition from speech fo- sentiment timelines in on-line forums and news cus both on what is being said and how is be- (Balog et al., 2006), analysis of political debates ing said, and rely mainly on the analysis of the (Carvalho et al., 2011), question answering (Oh et speech signal by sampling the content at utterance al., 2012), conversation summarization (Carenini et or frame level (Bitouk et al., 2010). Several re- al., 2008), and citation sentiment detection (Athar searchers used prosody (e.g., pitch, speaking rate, and Teufel, 2012). Mel frequency coefficients) for speech-based emo- One of the first lexicons used in sentiment anal- tion recognition (Polzin and Waibel, 1996; Tato et ysis is the General Inquirer (Stone, 1968). Since al., 2002; Ayadi et al., 2011). then, many methods have been developed to auto- There are also studies that analyzed the visual matically identify opinion words and their polarity cues, such as facial expressions and body move- (Hatzivassiloglou and McKeown, 1997; Turney, ments (Calder et al., 2001; Rosenblum et al., 1996; 2002; Hu and Liu, 2004; Taboada et al., 2011), as Essa and Pentland, 1997). Facial expressions are well as n-gram and more linguistically complex among the most powerful and natural means for phrases (Yang and Cardie, 2012). human beings to communicate their emotions and For data-driven methods, one of the most widely intentions (Tian et al., 2001). Emotions can be used datasets is the MPQA corpus (Wiebe et al., also expressed unconsciously, through subtle move- 2005), which is a collection of news articles manu- ments of facial muscles such as smiling or eyebrow ally annotated for opinions. Other datasets are also raising, often measured and described using the available, including two polarity datasets consist- Facial Action Coding System (FACS) (Ekman et ing of movie reviews (Pang and Lee, 2004; Maas et al., 2002). al., 2011), and a collection of newspaper headlines De Silva et. al. (De Silva et al., 1997) and Chen annotated for polarity (Strapparava and Mihalcea, et. al. (Chen et al., 1998) presented one of the 2007). early works that integrate both acoustic and visual While difficult problems such as cross-domain information for emotion recognition. In addition to (Blitzer et al., 2007; Li et al., 2012) or cross- work that considered individual modalities, there language (Mihalcea et al., 2007; Wan, 2009; Meng is also a growing body of work concerned with et al., 2012) portability have been addressed, not multimodal emotion analysis (Silva et al., 1997; much has been done in terms of extending the ap- Sebe et al., 2006; Zhihong et al., 2009; Wollmer et plicability of sentiment analysis to other modalities, al., 2010). 974 Utterance transcription Label En este color, creo que era el color frambuesa. neu In this color, I think it was raspberry Pinta hermosisimo. pos It looks beautiful. Sinceramente, con respecto a lo que pinta y a que son hidratante, si son muy hidratantes. pos Honestly, talking about how they looks and hydrates, yes they are very hydrant. Pero el problema de estos labiales es que cuando uno se los aplica, te dejan un gusto asqueroso en la boca. neg But the problem with those lipsticks is that when you apply them, they leave a very nasty taste Sinceramente, es no es que sea el olor sino que es mas bien el gusto. neg Honestly, is not the smell, it is the taste. Table 1: Sample utterance-level annotations. The labels used are: pos(itive), neg(ative), neu(tral). More recently, two challenges have been or- Among all the videos returned by the YouTube ganized focusing on the recognition of emotions search, we selected only videos that respected the using audio and visual cues (Schuller et al., following guidelines: the speaker should be in front 2011a; Schuller et al., 2011b), which included sub- of the camera; her face should be clearly visible, challenges on audio-only, video-only, and audio- with a minimum amount of face occlusion during video, and drew the participation of many teams the recording; there should not be any background from around the world.