A Multi-Modal Approach to Infer Image Affect
Total Page:16
File Type:pdf, Size:1020Kb
A Multi-Modal Approach to Infer Image Aect Ashok Sundaresan, Sugumar Murugesan Sean Davis, Karthik Kappaganthu, ZhongYi Jin, Divya Jain Anurag Maunder Johnson Controls - Innovation Garage Santa Clara, CA ABSTRACT Our approach combines the top-down and boom-up features. e group aect or emotion in an image of people can be inferred Due to the presence of multiple subjects in an image, a key problem by extracting features about both the people in the picture and the that needs to be resolved is to combine individual human emotions overall makeup of the scene. e state-of-the-art on this problem to a group level emotion. e next step is to combine multiple investigates a combination of facial features, scene extraction and group level modalities. We address the rst problem by using a Bag- even audio tonality. is paper combines three additional modali- of-Visual-Words (BOV) based approach. Our BOV based approach ties, namely, human pose, text-based tagging and CNN extracted comprises developing a code book using two well known clustering features / predictions. To the best of our knowledge, this is the algorithms: k-means clustering and a Gaussian mixture model rst time all of the modalities were extracted using deep neural (GMM) based clustering. e code books from these clustering networks. We evaluate the performance of our approach against algorithms are used to fuse individual subjects’ features to a group baselines and identify insights throughout this paper. image level features We evaluate the performance of individual features using 4 dif- KEYWORDS ferent classication algorithms, namely, random forests, extra trees, gradient boosted trees and SVM. Additionally, the predictions from Group aect, scene extraction, pose extraction, facial feature ex- these classiers are used to create an ensemble of classiers to traction, CENTRIST, feature aggregation, Gaussian mixture model, obtain the nal classication results. While developing the ensem- convolutional neural networks, deep learning, image captions ble model, we also employ predictions from a individually trained convolutional neural network (CNN) model using the group level 1 INTRODUCTION images. Using machine learning to infer the emotion of a group of people in an image or video has the potential to enhance our understanding 2 MODALITIES of social interactions and improve civil life. Using modern com- puter vision techniques, image retrieval can be made more intuitive, 2.1 Facial Feature Extraction doctors will be able to beer diagnose psychological disorders and Facial features are extracted individually from isolated human im- security will be able to beer respond to social unrest before it ages on an image-by-image basis. First, a Faster R-CNN [21] is escalates to violence. Currently, even highly skilled professionals employed to extract the each of the full human frames from the have a dicult time recognizing and categorizing the exact emo- image. ese extracted frames are used for both the pose estimation tional state of a group. Consistent, automatic categorization of the and the facial feature extraction. Each face is then extracted and aect of an image can only be achieved with the latest advances aligned with the frontal view using Deepface [29]. Once the faces in machine learning coupled with precise feature extraction and are isolated and pre-processed, a deep residual network (ResNet) fusion. [14] is employed to extract facial features. Inferring the overall emotion that an image elicits from a viewer requires an understanding of details from vastly diering scales in 2.1.1 Human Frame Extraction. Isolation of each signicant hu- the image. Traditional approaches to this problem have focused on man frame is performed on the image using a Faster R-CNN [21]. arXiv:1803.05070v1 [cs.CV] 13 Mar 2018 small-scale features, namely investigating the emotion of individual e model is trained on the Pascal VOC 2007 data set [9] using the faces [13, 20, 33]. However, both large-scale features [8, 17] and very deep VGG-16 model [26]. e procedure for building and train- additional action/scene recognition must be represented in the ing this model is presented in the original Faster R-CNN manuscript description of the image emotion [18, 25]. Each of these feature [21] and will not be detailed here. extraction methods, or modalities, cannot capture the full emotion of an image by itself. 2.1.2 Deepface. For extracting and aligning human faces in the Previous research has investigated a combination of facial fea- images, Facebook’s DeepFace [29] algorithm is utilized. e align- tures, scene extraction [17] and even audio tonality [27]. is paper ment is performed using explicit 3D facial modeling coupled with combines three additional modalities with commonly used facial a 9-layer deep neural network. For more details on the implemen- and scene modalities, namely, human pose, text-based tagging and tation and training of the model, please see [29]. CNN extracted features / predictions. To the best of our knowl- edge, this is the rst time all of the modalities were extracted using 2.1.3 ResNet based feature extraction. e ResNet model is trained deep neural networks with the exception of the baseline CENsus with the Face Emotion Recognition (FER13) dataset [13], which TRansform hISTogram (CENTRIST) [31] model. contains 35,887 aligned face images with a predened training and validation set. Each face in FER13 is labeled with one of the follow- 2.4 Image Captions ing 7 emotions: anger, disgust, fear, happiness, sadness, surprise 2.4.1 ClarifAI. One of the modalities we used in our model is or neutral. To balance training time and ecacy, the ResNet-50 to generate captions for images and use captions to infer emotion. topology from 14 is implemented. e ResNet model is trained from As a rst step towards that, we rst proceeded to use commercially scratch, with an initial learning rate of 0.1, a learning rate decay available ClarifAI API [1] to answer the following preliminary factor of 0.9 which decays every 2 epochs and a batch size of 64. questions: is network architecture is able to achieve a 62.8% top-1 accuracy on the validation FER13 set when predicting the emotion from an • Whether reasonably accurate captions could be generated image. In lieu of a fully connected layer, ResNet-50 uses global for group-level emotion images average pooling as its penultimate layer. e feature set, which is • Whether captions are descriptive of underlying emotions a vector of 2048 elements, is the output from the global average Fig. 2 illustrates the top four tags generated by Clarifai for some pooling layer when running inference on the isolated faces from of the images in the Emotiw2017 Training Dataset. ese results each image. along with those generated for several other images in the dataset provide us condence that a deep learning model can be reason- ably accurate in generating tags for images, thus answering the 2.2 Centrist rst question above. As for the second question, we refer to Fig. 3. Here, the distribution of top tags generated by Clarifai across e CENsus TRansform hISTogram (CENTRIST) [31] is a visual three emotion classes is ploed. Note that, for several tags such descriptor predominantly used to describe topological places and as ‘administration’, ‘election’, ‘competition’, there is a notable dif- scenes in an image [4, 7, 32, 34]. Census Transform, in essence, is ference in prior probability of occurrence given each class. is calculated by comparing a pixel’s gray-scale intensity with that of could translate to discriminative probability of classes given these its eight neighboring pixels, encoding the results as 8-bit binary tags, assuming uniform prior on the classes. is observation is codeword, and converting the codeword to decimal (Fig. 1). CEN- even more pronounced for more obvious tags such as ‘festival’ and TRIST, in turn, is the histogram of the Census Transform calculated ‘bale’. is answers the second question above, and motivates us over various rectangular sections of the image via Spatial Pyramid to explore further. techniques [16]. By construction, CENTRIST is expected to capture the high-level, global features of an image such as the background, 2.4.2 im2txt. Having motivated ourselves that image tags can the relative locations of the persons, etc. CENTRIST is adopted by be of value in inferring emotions on a group-level image, we turned the challenge organizers as the baseline algorithm for emotion de- our aention to the more general ‘image to caption’ models. To- tection. e baseline accuracy provided by the organizers is 52.97% wards this, we focused on the im2txt TensorFlow open-source model on the Validation set and 53.62% on the Test set. To achieve these [24]. is model combined techniques from both computer vision accuracies, a support vector regression model was trained using and natural language processing to form a complete image caption- features extracted by CENTRIST. In our work, we use CENTRIST ing algorithm. More specically, in machine translation between as a scene-level descriptor and build various modalities on top of it languages, a Recurrent Neural Network (RNN) is used to trans- to extract complementary features from the image. form a sentence in ‘source language’ into a vector representation [3, 5, 28]. is vector is further fed into a second RNN that gener- ates a target sentence in the ‘target language’. In [30], the authors derived from this approach and replaced the rst RNN with a vi- 2.3 Human Pose sual representation of an image using a deep Convolutional Neural e intention of including pose modality in this emotion detection Network (CNN) [23] until the pen-ultimate layer. is CNN was task is to detect similar and dissimilar poses of the individuals in the originally trained to detect objects, and the pen-ultimate layer is image and capture any eect of these poses may have towards the used to feed the second RNN designed to produce phrases in the group emotion.