Emotion Detection and Sentiment Analysis of Images

Emotion Detection and Sentiment Analysis of Images Vasavi Gajarla Aditi Gupta Georgia Institute of Technology Georgia Institute of Technology [email protected] [email protected] describe some of the most influential work that addresses Abstract this issue. If we search for a tag "love" on Flickr, we get a wide Visual Sentiment Prediction with Deep Convolutional variety of images: roses, a mother holding her baby, Neural Networks proposed by Xu et al. [2] use images with hearts, etc. These images are very different Convolutional Neural Networks pretrained on Object from one another and yet depict the same emotion of recognition data to perform sentiment analysis on images “love” in them. In this project, we explore the possibility collected from Twitter and Tumblr. Robust image of using deep learning to predict the emotion depicted by sentiment analysis using progressively trained and domain an image. Our results look promising and indicate that transferred deep networks by You et al. [3] uses VGG- neural nets are indeed capable of learning the emotion ImageNet and its architectural variations to study essayed by an image. These kinds of predictions can be sentiment analysis on Twitter and Flickr datasets. used in applications like automatic tag predictions for Recognizing image style by Karayev et al. [4] experiment images uploaded on social media websites and with handcrafted features like L*a*b color space features, understanding sentiment of people and their mood GIST and saliency features on Flickr style data, during/after an election. Wikipaintings and AVA Style data. Emotion based classification of natural images by Dellagiacoma et al. [5] 1 Introduction uses MPEG7 color and edge descriptors to perform emotion classification and compares the results with Bag Nowadays, people share a lot of content on social media in of Emotions method. the form of images - be it personal, or everyday scenes, or their opinions depicted in the form of cartoons or memes. 3 Our work Analyzing content like this from social media websites and/or photo-sharing websites like Flickr, Twitter, Tumblr, Firstly, we finalized the emotional categories to perform etc., can give insights into the general sentiment of people the classification on. We then collected data from Flickr about say Presidential elections. Also, it would be useful to for these categories. We experimented with various understand the emotion an image depicts to automatically classification methods on our data - SVM on high level predict emotional tags on them - like happiness, fear, etc. features of VGG-ImageNet, fine-tuning on pretrained models like RESNET, Places205-VGG16 and VGG- As a part of this project, we aim to predict the emotional ImageNet - more details follow in Section 4. category an image falls into from 5 categories - Love, Happiness, Violence, Fear, and Sadness. We do this by 3.1 Deciding Emotion Categories fine-tuning 3 different convolutional neural networks for We are inspired by the 6 emotion categories defined by the the tasks of emotion prediction and sentiment analysis. famous psychologist Ekman in [1] - Happiness, Sadness, Fear, Disgust, Anger and Surprise. These emotions are 2 Related Work used by several other papers in this domain ([5], [7]). There exists an affective gap in Emotion Semantic Image However instead of ‘Anger’ we chose ‘Violence’ as we Retrieval (ESIR) between low-level features and the feel it is more suited to the applications we are looking at - emotional content of an image reflecting a particular more and more people are turning to social media to show sentiment, similar to the well-known semantic gap. A lot of their support for causes or protest against something (For previous work tries to address this issue, using both example, a lot of social messages poured in after terrorist handcrafted features and neural networks. We briefly attacks in Paris in November, 2015). For the same reason, we add ‘Love’ as an emotion category. To limit ourselves 1 to 5 emotions we dropped ‘Disgust’ and ‘Surprise’. So our final emotion categories are: Love, Happiness, Violence, We obtained an accuracy of 32.9% on the test data. The Fear and Sadness. obtained confusion matrix is shown in Figure 1. 3.2 Data Collection 4.2 Fine-tuning VGG-ImageNet We collected data for the selected 5 emotional categories - The results obtained for the One vs All SVM classifier are Love, Happiness, Violence, Fear, and Sadness from Flickr. better than chance (20%) but still far from good. The following are the steps involved in collecting this data. Therefore, we experimented with fine-tuning the VGG- ImageNet model. We added a dropout layer followed by a Step 1: Querying the image metadata. fully connected layer and trained it on our dataset. We We used the Flickr’s API service to query for images using obtained a maximum testing accuracy of 38.4%. each of the emotional categories - Fear, Happiness, Love, Sad, and Violence - as search query parameters (along 4.3 Sentiment Analysis with the license flag set to ‘creative commons’) to collect The confusion matrix (Figure 1) indicates that there exists the image metadata (like server and farm ID numbers on more confusion within each of the positive (Love, which the image is stored) after sorting the result set by Happiness) and negative (Violence, Fear, Sadness) interestingness to make sure that images fitting to the emotions. To understand our results better, we perform emotional categories are retrieved on priority. sentiment analysis on our dataset. We obtain a testing Step 2: Downloading the images from Flickr server. accuracy of 67.8% which indicates that the network is Once the image metadata is retrieved, it was possible to learning to distinguish between the overall positive and use that metadata to directly access the images from Flickr negative categories, but is having trouble classifying servers instead of going through the Flickr API [8]. within each of the positive and negative sentiments. We suspect this is because the network is learning colors - For this project, we collected 9854 images in total with positive sentiments usually have bright colored images and ~1900 images in every category. We split the data in each negative sentiments correspond to dark images. category such that 75% of the data (~8850) is used for training, and 25% of the data (~1000) is used for testing. 4.4 Fine-tuning VGG-Places205 4 Experiments VGG-ImageNet is pretrained on ImageNet which consists mostly of iconic images. However, our dataset has both In this section we demonstrate all the experiments we iconic, as well as non-iconic images (for example, in conducted on our data. Violence category, many images contain protests which have a lot of people). Therefore, we experimented with 4.1 One vs All SVM fine-tuning a model pretrained on a scene-based database The idea is to use a pretrained neural network to get the called Places205. We hypothesize that this will work feature representation of our dataset and then use this better, since our task is closer to scene classification than representation as an input to SVM and classify the data object recognition or detection. We fine-tune the VGG using one vs all SVM classifiers. We used VGG-ImageNet model trained on the Places205 database. Using this [9] as the pretrained model. method, we obtain a testing accuracy of 40.9% on the emotion classification task and an accuracy of 68.7% on The following are the steps involved in this experiment: the sentiment analysis task. The network’s confusion Step 1: Pass the data (both train and test) as input to VGG- matrix can be seen in Figure 1. ImageNet model. Step 2: Store the activations from second-to-last fully 4.5 Fine-tuning ResNet-50 connected layer of the network as feature vectors. We also experiment with fine-tuning a pretrained ResNet- Step 3: Train a one vs all SVM classifier for each emotion 50. Because this network is very deep (has 50 layers), and category. is trained on both scene-centric data (MS COCO) as well Step 4: For each of the test image, find the maximum of as object-centric data (ImageNet), we expected it to give the scores from each SVM to get its predicted category better results. We did get better results with this approach - label. 40.9% of testing accuracy on emotion classification task 2 in other categories. The model seems to be learning Figure 1: ( row-wise starting from top left) Confusion matrix of emotion classification for One vs All SVM, Confusion matrix of emotion clasification and sentiment analysis for CGG-ImageNet, 205PlacesVGG-16 and ResNet-50 and an accuracy of 73% on sentiment analysis task. The posters, demonstration scenes, officers and text, ‘Je Suis confusion matrix of this network can be seen in Figure 1. Charlie’ in particular. It probably because of the labeling for this class is unambiguous. In all of the fine-tuning experiments, it has been observed Fear: The model seems to be learning dark, haunting that standard data augmentation techniques like mirroring images and spiders. and random cropping of images increase the accuracy. Sadness: This class has the maximum variation. The Having a higher learning rate for the later layers also model seems to be learning a lot of face images. yields a better accuracy. We used Stochastic Gradient Descent optimization for all the networks, and the base Table 2 shows some misclassified instances. These learning rate at a start value of 0.0001 gave better examples seemed to have been rightly classified (emotion- accuracies. wise) but fall under misclassifications with respect to the actual tagged label. For example, the first image in the row 5 Analysis of Results of fear is dark and dull, which we suspect has led the network to classify it as fear but it was actually tagged as In this section we examine the results of the ResNet-50 happiness.

Load more