A Multi-Modal Approach to Infer Image A€ect

Ashok Sundaresan, Sugumar Murugesan Sean Davis, Karthik Kappaganthu, ZhongYi Jin, Divya Jain Anurag Maunder Johnson Controls - Innovation Garage Santa Clara, CA

ABSTRACT Our approach combines the top-down and boŠom-up features. Œe group a‚ect or emotion in an image of people can be inferred Due to the presence of multiple subjects in an image, a key problem by extracting features about both the people in the picture and the that needs to be resolved is to combine individual human emotions overall makeup of the scene. Œe state-of-the-art on this problem to a group level emotion. Œe next step is to combine multiple investigates a combination of facial features, scene extraction and group level modalities. We address the €rst problem by using a Bag- even audio tonality. Œis paper combines three additional modali- of-Visual-Words (BOV) based approach. Our BOV based approach ties, namely, human pose, text-based tagging and CNN extracted comprises developing a code book using two well known clustering features / predictions. To the best of our knowledge, this is the algorithms: k-means clustering and a Gaussian mixture model €rst time all of the modalities were extracted using deep neural (GMM) based clustering. Œe code books from these clustering networks. We evaluate the performance of our approach against algorithms are used to fuse individual subjects’ features to a group baselines and identify insights throughout this paper. image level features We evaluate the performance of individual features using 4 dif- KEYWORDS ferent classi€cation algorithms, namely, random forests, extra trees, gradient boosted trees and SVM. Additionally, the predictions from Group a‚ect, scene extraction, pose extraction, facial feature ex- these classi€ers are used to create an ensemble of classi€ers to traction, CENTRIST, feature aggregation, Gaussian mixture model, obtain the €nal classi€cation results. While developing the ensem- convolutional neural networks, , image captions ble model, we also employ predictions from a individually trained convolutional neural network (CNN) model using the group level 1 INTRODUCTION images. Using to infer the emotion of a group of people in an image or video has the potential to enhance our understanding 2 MODALITIES of social interactions and improve civil life. Using modern com- puter vision techniques, image retrieval can be made more intuitive, 2.1 Facial Feature Extraction doctors will be able to beŠer diagnose psychological disorders and Facial features are extracted individually from isolated human im- security will be able to beŠer respond to social unrest before it ages on an image-by-image basis. First, a Faster R-CNN [21] is escalates to violence. Currently, even highly skilled professionals employed to extract the each of the full human frames from the have a dicult time recognizing and categorizing the exact emo- image. Œese extracted frames are used for both the pose estimation tional state of a group. Consistent, automatic categorization of the and the facial feature extraction. Each face is then extracted and a‚ect of an image can only be achieved with the latest advances aligned with the frontal view using Deepface [29]. Once the faces in machine learning coupled with precise feature extraction and are isolated and pre-processed, a deep residual network (ResNet) fusion. [14] is employed to extract facial features. Inferring the overall emotion that an image elicits from a viewer requires an understanding of details from vastly di‚ering scales in 2.1.1 Human Frame Extraction. Isolation of each signi€cant hu- the image. Traditional approaches to this problem have focused on man frame is performed on the image using a Faster R-CNN [21]. arXiv:1803.05070v1 [cs.CV] 13 Mar 2018 small-scale features, namely investigating the emotion of individual Œe model is trained on the Pascal VOC 2007 data set [9] using the faces [13, 20, 33]. However, both large-scale features [8, 17] and very deep VGG-16 model [26]. Œe procedure for building and train- additional action/scene recognition must be represented in the ing this model is presented in the original Faster R-CNN manuscript description of the image emotion [18, 25]. Each of these feature [21] and will not be detailed here. extraction methods, or modalities, cannot capture the full emotion of an image by itself. 2.1.2 Deepface. For extracting and aligning human faces in the Previous research has investigated a combination of facial fea- images, Facebook’s DeepFace [29] algorithm is utilized. Œe align- tures, scene extraction [17] and even audio tonality [27]. Œis paper ment is performed using explicit 3D facial modeling coupled with combines three additional modalities with commonly used facial a 9-layer deep neural network. For more details on the implemen- and scene modalities, namely, human pose, text-based tagging and tation and training of the model, please see [29]. CNN extracted features / predictions. To the best of our knowl- edge, this is the €rst time all of the modalities were extracted using 2.1.3 ResNet based feature extraction. Œe ResNet model is trained deep neural networks with the exception of the baseline CENsus with the Face Emotion Recognition (FER13) dataset [13], which TRansform hISTogram (CENTRIST) [31] model. contains 35,887 aligned face images with a prede€ned training and validation set. Each face in FER13 is labeled with one of the follow- 2.4 Image Captions ing 7 emotions: anger, disgust, fear, happiness, sadness, surprise 2.4.1 ClarifAI. One of the modalities we used in our model is or neutral. To balance training time and ecacy, the ResNet-50 to generate captions for images and use captions to infer emotion. topology from 14 is implemented. Œe ResNet model is trained from As a €rst step towards that, we €rst proceeded to use commercially scratch, with an initial learning rate of 0.1, a learning rate decay available ClarifAI API [1] to answer the following preliminary factor of 0.9 which decays every 2 epochs and a batch size of 64. questions: Œis network architecture is able to achieve a 62.8% top-1 accuracy on the validation FER13 set when predicting the emotion from an • Whether reasonably accurate captions could be generated image. In lieu of a fully connected layer, ResNet-50 uses global for group-level emotion images average pooling as its penultimate layer. Œe feature set, which is • Whether captions are descriptive of underlying emotions a vector of 2048 elements, is the output from the global average Fig. 2 illustrates the top four tags generated by Clarifai for some pooling layer when running inference on the isolated faces from of the images in the Emotiw2017 Training Dataset. Œese results each image. along with those generated for several other images in the dataset provide us con€dence that a deep learning model can be reason- ably accurate in generating tags for images, thus answering the 2.2 Centrist €rst question above. As for the second question, we refer to Fig. 3. Here, the distribution of top tags generated by Clarifai across Œe CENsus TRansform hISTogram (CENTRIST) [31] is a visual three emotion classes is ploŠed. Note that, for several tags such descriptor predominantly used to describe topological places and as ‘administration’, ‘election’, ‘competition’, there is a notable dif- scenes in an image [4, 7, 32, 34]. Census Transform, in essence, is ference in prior probability of occurrence given each class. Œis calculated by comparing a pixel’s gray-scale intensity with that of could translate to discriminative probability of classes given these its eight neighboring pixels, encoding the results as 8-bit binary tags, assuming uniform prior on the classes. Œis observation is codeword, and converting the codeword to decimal (Fig. 1). CEN- even more pronounced for more obvious tags such as ‘festival’ and TRIST, in turn, is the histogram of the Census Transform calculated ‘baŠle’. Œis answers the second question above, and motivates us over various rectangular sections of the image via Spatial Pyramid to explore further. techniques [16]. By construction, CENTRIST is expected to capture the high-level, global features of an image such as the background, 2.4.2 im2txt. Having motivated ourselves that image tags can the relative locations of the persons, etc. CENTRIST is adopted by be of value in inferring emotions on a group-level image, we turned the challenge organizers as the baseline algorithm for emotion de- our aŠention to the more general ‘image to caption’ models. To- tection. Œe baseline accuracy provided by the organizers is 52.97% wards this, we focused on the im2txt TensorFlow open-source model on the Validation set and 53.62% on the Test set. To achieve these [24]. Œis model combined techniques from both accuracies, a support vector regression model was trained using and natural language processing to form a complete image caption- features extracted by CENTRIST. In our work, we use CENTRIST ing algorithm. More speci€cally, in machine translation between as a scene-level descriptor and build various modalities on top of it languages, a (RNN) is used to trans- to extract complementary features from the image. form a sentence in ‘source language’ into a vector representation [3, 5, 28]. Œis vector is further fed into a second RNN that gener- ates a target sentence in the ‘target language’. In [30], the authors derived from this approach and replaced the €rst RNN with a vi- 2.3 Human Pose sual representation of an image using a deep Convolutional Neural Œe intention of including pose modality in this emotion detection Network (CNN) [23] until the pen-ultimate layer. Œis CNN was task is to detect similar and dissimilar poses of the individuals in the originally trained to detect objects, and the pen-ultimate layer is image and capture any e‚ect of these poses may have towards the used to feed the second RNN designed to produce phrases in the group emotion. Œe pose features are expected to work only as an original machine translation model. Œis end-to-end system is fur- indirect complement to features from other modalities. Human pose ther trained directly on images and their captions to maximize estimation is a regression problem in which the location of speci€c the likelihood that the description on an image best matches the key points in a human body are predicted. In literature several training descriptions for that image. Œe open-source model im2txt neural network and non-neural network methods have been used. [24] is an implementation of the algorithm in [30]. We speci€cally State of the art results on PASCAL VOC have been demonstrated used a Dockerized version of the im2txt model available in [22]. by Gkioxari et.al. [12], [11]. For this work, we utilize the work Fig. 4 illustrates the captions generated by im2txt in [22] for presented in [11] which is based on a R-CNN. Œis method obtained some of the images in the training dataset. In our approach, these a mean AP of 15.2% PASCAl VOC 2009 validation dataset. Œe captions are further encoded into sparse bag-of-word vectors [35]. method uses AlexNet [15] and builds an R-CNN [10] with region For instance, if the dictionary passed to the im2txt model is made of proposals generated by [2] and €ne tune from ImageNet pertained words {w1,w2,... wn }. Œen if, for an image, caption is w3w4w2w4, model on 1000 classes. Œe R-CNN in this work is trained to predict the bag-of-words representation of the image is: {w1 : 0,w2 : 1,w3 : key points to be within a distance of .2 H (H is the height of the 1,w4 : 2,w5 : 0 ... wn : 0}. Œis vector is further normalized during torso) from the ground truth. During test time, the fc7 embeddings the concatenation stage, e‚ectively converting this into a term are used as features to combine with other modalities. frequency representation [6]. 2 (a) Image from emotiw2017 training dataset (b) Centroid Transform of image to the le‰

Figure 1: Centrist illustration

(a) Traditional: 0.99906313, Festival: 0.9946226, Culture: 0.99122095, Religion: (b) People: 0.97925866, Drag Race: 0.964833, Police:0.9458773, BaŠle:0.9399004 0.99001175

Figure 2: Top four tags generated from ClarifAI in ‘tag:probability’ format. Image source: emotiw2017 training dataset

2.5 CNN 3 FEATURE AGGREGATION CNNs are a proven technique for image classi€cation. Œe com- As previously discussed in Section ‰, given a group level image, plexity and sparsity of a‚ection related features in images suggest our methodology consists of extracting scene related features using a CNN only solution will need a deeper and wider network and CENTRIST, facial features using RESNET, human pose features us- a large training dataset. However, even with a relatively small ing a pre-trained NN and BOW descriptors using the im2txt neural training dataset, CNNs can still be used as an e‚ective modality for network. In this section, we describe our strategy for combining fea- feature extraction. To reduce over€Šing, the Resenet18 architecture tures extracted from all the group level images to build the training [14] is employed as the CNN and two models are built with the data for classi€cation. training dataset: one is trained on original colored image and one Our feature combination strategy involves concatenating all is trained on grayscale images converted from the original ones. In the features for each group level image. It is important to note cross-validation, the color-model has an accuracy of 52% and the that, in this problem of group level emotion recognition, feature gray-model 50%. Œis suggests color contains a‚ection information. concatenation is not straight-forward. Œe feature vectors extracted For the remaining of the paper, we will only refer the color model from CENTRIST and the bag-of-words extracted from the im2txt as our CNN model. NN are already on a per image basis.However, this is not the case

3 takes into account the strength of association of each visual word to its cluster center. De€ning the strength of association of each visual word wi (i = 1,...,n) to cluster k as qik , where qik ≥ Ín 0 and i=1 qik = 1, the VLAD encoding using the cluster centroids µk is de€ned as,

n Õ vk = qik (wik − µk ) (2) i=1 Œe primary advantage of VLAD over the TF matrix is that more discriminative property is added in our feature vector by taking the di‚erence of each descrip- tor from the mean in its voronoi cell. Œis €rst order statistic adds more information in the feature vector and may give us beŠer discrimination during classi€- cation. Œe VLAD encodings are usually normalized (e.g. l2 normalization) before usage. Figure 3: Distribution of top tags generated by Clarifai, (c) Œe third feature vector, resulting from the k-means across three emotion classes for images in emotiw 2017 clustering algorithm is a weighted average (WA) of training dataset the cluster centers for each visual word in the im- age. Œe weights are nothing but the normalized term frequencies corresponding to each visual word. for the facial and pose features since these features are extracted (2) GMM Based Methodology In addition to the k-means clus- for each human in the images and need to be aggregated across all tering method, we also perform a dimensionality reduction the humans for each group level image, before concatenation. of the visual codes using GMM. Œe aggregation involves Inspired by [references], we adopt a bag-of-visual-words (BOV) computing the posterior probability of each component for based approach to construct image level aggregated feature vec- a given visual word, which is also de€ned as the responsi- tors using the facial and pose features for each group level image bility that a particular component takes in explaining that evaluated using the methods described in Section and Section. Œe visual word. Œe responsibilities of a given component BOV based feature transformation is carried out separately for both resulting from all the faces in a given image are averaged facial and pose features. to compute the group level image responsibility for that In our BOV based approach, the feature vector corresponding component. Performing the same computation for all the to each face or pose, is regarded a visual word as a result of which components using the visual words in a given image results the number of visual words is equal to the total number of humans in an image level aggregated feature vector. extracted over all the group level images. Having de€ned the visual It can be noted that all the four feature vectors described above, word, the BOV based feature aggregation is described as follows. namely, the TF matrix, VLAD, the weighted average of the cluster (1) k-means Based Methodology Œe set of visual words is centers and the GMM based features have dimensions N ×K, where clustered using the k-means algorithm to reduce the vocab- K is either the number of clusters for the k-means algorithm or the ulary which consists of the cluster labels, also referred to number of Gaussians for the GMM model. Since all the features as visual codes. From clusters resulting from the k-means are now consistent in terms of their dimensions and are group- clustering algorithm, we develop three group level feature image level features, they can be easily concatenated. Œe series of vectors as follows. above described steps are carried out for both the training and the (a) A term frequency (TF) matrix which consists of the validation image sets to construct the features for classi€cation. associating each face in the group level image to a visual code and the counting the occurrence of each 4 CLASSIFICATION RESULTS visual code in the vocabulary. Œe values are also Given the feature vectors for group level images corresponding to normalized by dividing each count by the total number both the training and the validation set, the next task is to build a of faces (visual words) in the image. Denoting the raw classi€cation model which can assign a given image to one of the count of a visual code l for a given image m as tl,m, three classes: positive, negative or neutral. To build the classi€- the TF is simply given as, cation model, we adopt a two tier approach: tier-1, consisting of learning multiple independent classi€ers and, tier-2 learning from tl,m t f (i,m) = (1) tier-1 classi€ers. Such an ensemble approach has previously been Í 0 t 0 l ∈m l ,m used in many data science competitions. (b) Œe second feature, that we evaluate using the clus- Œe €rst tier involves training individual classi€ers on each of ter center is the vector of locally aggregated descrip- the di‚erent modalities and optimizing their parameters that result tors (VLAD). VLAD is a feature encoding method that in the maximum validation accuracy. To this end, four di‚erent 4 (a) a woman siŠing on a couch with a child in front of her, a woman siŠing on a couch (b) a group of people siŠing around a table, a group of people siŠing around a table with with a child on her lap, a woman siŠing on a couch with a child in her lap wine glasses, a group of people siŠing around a table with food

Figure 4: Captions generated by im2txt model. Image source: emotiw2017 training dataset well-known classi€ers are considered, namely, a random forest (RF) Modalities Feature Dimension classi€er, gradient boosted tree (GBT) based classi€er, support vec- Term Frequency 1000 tor machine (SVM) classi€er and a variant of random forests called Face VLAD 1000 as extra trees (ET). Each of the classi€ers considered are inherently emotion Weighted Average of Cluster Center 2048 di‚erent in the sense that they may learn di‚erent characteristics GMM 512 of the feature set to perform classi€cation. For example, the RF Term Frequency 1000 VLAD 1000 classi€er may be well suited for a particular group level emotion Pose whereas the performance of other classi€ers may be below par. In Weighted Average of Cluster Centers 2048 a di‚erent scenario, we may have a situation where the SVM based GMM 512 classi€er is beŠer suited. Motivated by this possibility, our second Scene Centrist 7905 tier model consists of building an ensemble of the above mentioned Image to Text Bag of Words 364 classi€ers using a technique referred to as stacking. Also, note that Table 1: Features extracted for tier-1 classi€ers the input features for each of these classi€ers are also made up of di‚erent modalities thereby providing a diversity of information for the second tier algorithm to learn. A Logistic regression model RF ET GBT SVM is used as the second tier classi€er. Training .99 .98 .98 .95 In this section, we provide more details about the features ex- Validation .55 .53 .52 .50 tracted from the group level images using the methods outlined in Table 2: Centrist Based Features Classi€cation Performance the previous sections. From the training dataset of 3630 images, the human detection and face detection pipeline extracted 19464 faces. For the validation RF ET GBT SVM set we extracted 11696 faces from 2066 images. For these images, TF Training .99 .99 .90 .95 we use vocabulary sizes of 200, 500 and 1000 to compute the BOV Validation .56 .58 .57 .56 derived features. Œe performance of the €nal classi€cation was VLAD Training .99 .99 .99 .97 best for a vocabulary size of 1000 and hence for conciseness we will Validation .58 .58 .59 .53 explain the rest of the results based on this vocabulary size. Œe Weighted Average Training .99 .99 .99 .99 list of all feature vectors used in tier-1, along with their dimensions Validation .60 .58 .57 .58 are shown in Table 1. GMM Training .99 .99 .90 .87 Validation .56 .56 .55 .58 Next, we determine the performance of each of the modalities Table 3: Face Emotion Based Features Classi€cation Perfor- separately with di‚erent classi€ers. Œese results are listed in Ta- mance bles 2 to 4.

5 RF ET GBT SVM One key observation that was noted in this work is that our TF Training .99 .99 .99 .98 methodology was not able to exploit pose based features in an opti- Validation .39 .40 .42 .38 mal fashion. Œis needs to be investigated further. Also, improve- VLAD Training .99 .99 .99 .99 ments could be made to the methodology adopted for the image Validation .41 .40 .41 .40 to text based feature extraction. our current method extracted too Weighted Average Training .99 .99 .99 .99 few image tags which could be limiting the performance of the Validation .40 .41 .41 .42 classi€er. GMM Training .99 .99 .95 .96 Weak supervision [19] has been proven as an e‚ective way of ob- Validation .40 .43 .41 .40 taining additional labeled data. As for future work, we are planning Table 4: Pose Based Features Classi€cation Performance to use google reverse-image search as our €rst labeling function to collect similar images as the one in the training dataset. In the same process, context information such as human annotated tags and summary information will also be gathered for each newly collected image. We will then apply a second NLP based labeling function to label the new images. Our initial experiment shows that with the newly labeled images, the Restnet18 model has 1% improvement in terms of accuracy.

Œe classi€cation performance for the features apart from pose REFERENCES meet or exceed the benchmark performance [8]. Since the pose [1] 2017. Clarifai. hŠps://www.clarifai.com/. (2017). based features is poor, we do not include it in the concatenated [2] Pablo Arbelaez,´ Jordi Pont-Tuset, Jonathan T Barron, Ferran Marques, and Jiten- features. Also, it is interesting to note that the performance of the dra Malik. 2014. Multiscale combinatorial grouping. In Proceedings of the IEEE conference on computer vision and paˆern recognition. 328–335. concatenated features is beŠer than the performance of individual [3] D. Bahdanau, K. Cho, and Y. Bengio. 2015. Neural machine translation by features. Among the individual modality features, the weighted jointly learning to align and translate. International Conference on Learning average features for face emotion modality using random forests Representations (2015). [4] G. Cheng, J. Han, L. Guo, and T. Liu. 2015. Learning coarse-to-€ne sparselets achieves the best validation accuracy of 60.22%. We observed that for ecient object detection and scene classi€cation. In 2015 IEEE Conference on the concatenated features provide the best classi€cation accuracy Computer Vision and Paˆern Recognition (CVPR). 1173–1181. DOI:hŠp://dx.doi. on the validation set (64%). Hence, we use it as the basis for the org/10.1109/CVPR.2015.7298721 [5] K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. stacking classi€er. 2014. Learning phrase representations using RNN encoder-decoder for statistical Œe probabilities predicted from the individual classi€ers with machine translation. Proc. Empirical Methods Natural Lang. Process. (2014). [6] Tommy W.S. Chow, Haijun Zhang, and M.K.M. Rahman. 2009. A new document the concatenated feature set as the input, was used to train a three representation using term frequency and vectorized graph connectionists with class Logistic Regression classi€er to perform stacking. Œis classi- application to document retrieval. Expert Systems with Applications 36, 10 (2009), €er improves the validation accuracy by 4% resulting in an overall 12023 – 12035. [7] G. Costante, T. A. Ciarfuglia, P. Valigi, and E. Ricci. 2013. A transfer learning validation accuracy of 68%. approach for multi-cue semantic place recognition. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems. 2122–2129. DOI:hŠp://dx.doi.org/ 10.1109/IROS.2013.6696653 5 CONCLUSIONS AND FUTURE DIRECTIONS [8] Abhinav Dhall, Jyoti Joshi, Karan Sikka, Roland Goecke, and Nicu Sebe. 2015. Œe more the merrier: Analysing the a‚ect of a group of people in images. Œe problem of determining the a‚ect of a group of people is consid- In Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International ered in this paper. A multi-modal approach consisting of features Conference and Workshops on, Vol. 1. IEEE, 1–8. [9] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, extracted from the scene, human face, human pose and image tags and A. Zisserman. 2007. Œe PASCAL Visual Object Classes is adopted. To extract the human face based features, we developed Challenge 2007 (VOC2007) Results. hŠp://www.pascal- a pipeline that consists of R-CNNs for extracting humans from the network.org/challenges/VOC/voc2007/workshop/index.html. (2007). [10] Ross Girshick, Je‚ Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich group level images, a CNN [29] to crop the faces and a conventional feature hierarchies for accurate object detection and semantic segmentation. In face alignment method that uses regression trees. Proceedings of the IEEE conference on computer vision and paˆern recognition. One of the main contributions of this work is the training and 580–587. [11] G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik. 2014. R-CNNs for Pose usage of deep neural networks (DNNs) to extract face and pose Estimation and Action Detection. features. Furthermore, the DNNs were trained on external datasets [12] Georgia Gkioxari, Bharath Hariharan, Ross Girshick, and Jitendra Malik. 2014. Us- ing k-poselets for detecting people and localizing their keypoints. In Proceedings that contained more number of images suitable for learning the of the IEEE Conference on Computer Vision and Paˆern Recognition. 3582–3589. large number of parameters in a DNN. Œis approach overcomes of [13] Ian J. Goodfellow, Dumitru Erhan, Pierre-Luc Carrier, Aaron Courville, Mehdi the inadequacy of the current dataset, which does not have sucient Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Œaler, Dong-Hyun Lee, Yingbo Zhou, Chetan Ramaiah, Fangxiang Feng, Ruifan Li, Xiaojie Wang, samples to train a DNN. Additionally, we also employed the bag- Dimitris Athanasakis, John Shawe-Taylor, Maxim Milakov, John Park, Radu of-visual-words based approach to translate multiple human level Ionescu, Marius Popescu, Cristian Grozea, James Bergstra, Jingjing Xie, Lukasz features to group level features. Using a combination of these group Romaszko, Bing Xu, Zhang Chuang, and Yoshua Bengio. 2015. Challenges in representation learning: A report on three machine learning contests. Neural level features from multiple modalities, a validation accuracy of Networks 64 (2015), 59 – 63. DOI:hŠp://dx.doi.org/10.1016/j.neunet.2014.09.005 64% percent was achieved. Finally, a stacking methodology was Special Issue on Deep Learning of Representations. [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual employed to build an ensemble of classi€ers which resulted in a Learning for Image Recognition. CoRR abs/1512.03385 (2015). hŠp://arxiv.org/ validation accuracy of 68%. abs/1512.03385 6 [15] Alex Krizhevsky, Ilya Sutskever, and Geo‚rey E Hinton. Imagenet classi€cation with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105. [16] S. Lazebnik, C. Schmid, and J. Ponce. 2006. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In 2006 IEEE Com- puter Society Conference on Computer Vision and Paˆern Recognition (CVPR’06), Vol. 2. 2169–2178. DOI:hŠp://dx.doi.org/10.1109/CVPR.2006.68 [17] Jianshu Li, Sujoy Roy, Jiashi Feng, and Terence Sim. 2016. Happiness Level Prediction with Sequential Inputs via Multiple Regressions. In Proceedings of the 18th ACM International Conference on Multimodal Interaction (ICMI 2016). ACM, New York, NY, USA, 487–493. DOI:hŠp://dx.doi.org/10.1145/2993148.2997636 [18] Alexander Patrick Mathews, Lexing Xie, and Xuming He. 2016. SentiCap: Gen- erating Image Descriptions with Sentiments.. In AAAI. 3574–3580. [19] Alexander J. Ratner, Stephen H. Bach, Henry R. Ehrenberg, and Chris Re.´ 2017. Snorkel: Fast Training Set Generation for Information Extraction. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD ’17). ACM, New York, NY, USA, 1683–1686. DOI:hŠp://dx.doi.org/10.1145/3035918. 3056442 [20] Bhargava Reddy, Ye-Hoon Kim, Sojung Yun, Junik Jang, and Soonhyuk Hong. 2016. End to End Deep Learning for Single Step Real-Time Facial Expression Recognition. In Video Analytics. Face and Facial Expression Recognition and Au- dience Measurement - Šird International Workshop, VAAM 2016, and Second International Workshop, FFER 2016, Cancun, Mexico, December 4, 2016, Revised Selected Papers. 88–97. DOI:hŠp://dx.doi.org/10.1007/978-3-319-56687-0 8 [21] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91–99. [22] Siavash Se€d-Rodi. 2016. im2txt demo. hŠps://github.com/siavash9000/im2txt demo. (2016). [23] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. 2014. Overfeat: Integrated recognition localization and detection using convolutional networks. International Conference on Learning Representations (2014). [24] Chris Shallue. 2016. im2txt. hŠps://github.com/tensorƒow/models/tree/master/ im2txt. (2016). [25] Andrew Shin, Yoshitaka Ushiku, and Tatsuya Harada. 2016. Image Captioning with Sentiment Terms via Weakly-Supervised Sentiment Dataset.. In BMVC. [26] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional net- works for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014). [27] Bo Sun, Qinglan Wei, Liandong Li, Qihua Xu, Jun He, and Lejun Yu. 2016. LSTM for dynamic emotion and group emotion recognition in the wild. In Proceedings of the 18th ACM International Conference on Multimodal Interaction. ACM, 451–457. [28] I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence to sequence learning with neural networks. Proc. Neural Inf. Process. Syst. (2014). [29] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. 2014. DeepFace: Closing the Gap to Human-Level Performance in Face Veri€cation. In Še IEEE Conference on Computer Vision and Paˆern Recognition (CVPR). [30] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. 2017. Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge. IEEE Transactions on Paˆern Analysis and Machine Intelligence 39, 4 (April 2017), 652–663. DOI: hŠp://dx.doi.org/10.1109/TPAMI.2016.2587640 [31] Jianxin Wu and Jim M Rehg. 2011. CENTRIST: A visual descriptor for scene categorization. IEEE transactions on paˆern analysis and machine intelligence 33, 8 (2011), 1489–1501. [32] M. Yang, B. Li, H. Fan, and Y. Jiang. 2015. Randomized spatial pooling in deep convolutional networks for scene recognition. In 2015 IEEE International Con- ference on Image Processing (ICIP). 402–406. DOI:hŠp://dx.doi.org/10.1109/ICIP. 2015.7350829 [33] Zhiding Yu and Cha Zhang. 2015. Image Based Static Facial Expression Recogni- tion with Multiple Deep Network Learning. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction (ICMI ’15). ACM, New York, NY, USA, 435–442. DOI:hŠp://dx.doi.org/10.1145/2818346.2830595 [34] L. Zhang, X. Zhen, and L. Shao. 2014. Learning Object-to-Class Kernels for Scene Classi€cation. IEEE Transactions on Image Processing 23, 8 (Aug 2014), 3241–3253. DOI:hŠp://dx.doi.org/10.1109/TIP.2014.2328894 [35] Yin Zhang, Rong Jin, and Zhi-Hua Zhou. 2010. Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics 1, 1 (01 Dec 2010), 43–52. DOI:hŠp://dx.doi.org/10.1007/ s13042-010-0001-0

7