Utterance-Level Multimodal Sentiment Analysis Veronica´ Perez-Rosas´ and Rada Mihalcea Louis-Philippe Morency Computer Science and Engineering Institute for Creative Technologies University of North Texas University of Southern California
[email protected],
[email protected] [email protected] Abstract Riloff, 2005; Esuli and Sebastiani, 2006) or large annotated datasets (Maas et al., 2011). Given the During real-life interactions, people are accelerated growth of other media on the Web and naturally gesturing and modulating their elsewhere, which includes massive collections of voice to emphasize specific points or to videos (e.g., YouTube, Vimeo, VideoLectures), im- express their emotions. With the recent ages (e.g., Flickr, Picasa), audio clips (e.g., pod- growth of social websites such as YouTube, casts), the ability to address the identification of Facebook, and Amazon, video reviews are opinions in the presence of diverse modalities is be- emerging as a new source of multimodal coming increasingly important. This has motivated and natural opinions that has been left al- researchers to start exploring multimodal clues for most untapped by automatic opinion anal- the detection of sentiment and emotions in video ysis techniques. This paper presents a content (Morency et al., 2011; Wagner et al., 2011). method for multimodal sentiment classi- In this paper, we explore the addition of speech fication, which can identify the sentiment and visual modalities to text analysis in order to expressed in utterance-level visual datas- identify the sentiment expressed in video reviews. treams. Using a new multimodal dataset Given the non homogeneous nature of full-video consisting of sentiment annotated utter- reviews, which typically include a mixture of posi- ances extracted from video reviews, we tive, negative, and neutral statements, we decided show that multimodal sentiment analysis to perform our experiments and analyses at the ut- can be effectively performed, and that the terance level.