A Multi-Modal Approach to Infer Image Aect
Ashok Sundaresan, Sugumar Murugesan Sean Davis, Karthik Kappaganthu, ZhongYi Jin, Divya Jain Anurag Maunder Johnson Controls - Innovation Garage Santa Clara, CA
ABSTRACT Our approach combines the top-down and boom-up features. e group aect or emotion in an image of people can be inferred Due to the presence of multiple subjects in an image, a key problem by extracting features about both the people in the picture and the that needs to be resolved is to combine individual human emotions overall makeup of the scene. e state-of-the-art on this problem to a group level emotion. e next step is to combine multiple investigates a combination of facial features, scene extraction and group level modalities. We address the rst problem by using a Bag- even audio tonality. is paper combines three additional modali- of-Visual-Words (BOV) based approach. Our BOV based approach ties, namely, human pose, text-based tagging and CNN extracted comprises developing a code book using two well known clustering features / predictions. To the best of our knowledge, this is the algorithms: k-means clustering and a Gaussian mixture model rst time all of the modalities were extracted using deep neural (GMM) based clustering. e code books from these clustering networks. We evaluate the performance of our approach against algorithms are used to fuse individual subjects’ features to a group baselines and identify insights throughout this paper. image level features We evaluate the performance of individual features using 4 dif- KEYWORDS ferent classication algorithms, namely, random forests, extra trees, gradient boosted trees and SVM. Additionally, the predictions from Group aect, scene extraction, pose extraction, facial feature ex- these classiers are used to create an ensemble of classiers to traction, CENTRIST, feature aggregation, Gaussian mixture model, obtain the nal classication results. While developing the ensem- convolutional neural networks, deep learning, image captions ble model, we also employ predictions from a individually trained convolutional neural network (CNN) model using the group level 1 INTRODUCTION images. Using machine learning to infer the emotion of a group of people in an image or video has the potential to enhance our understanding 2 MODALITIES of social interactions and improve civil life. Using modern com- puter vision techniques, image retrieval can be made more intuitive, 2.1 Facial Feature Extraction doctors will be able to beer diagnose psychological disorders and Facial features are extracted individually from isolated human im- security will be able to beer respond to social unrest before it ages on an image-by-image basis. First, a Faster R-CNN [21] is escalates to violence. Currently, even highly skilled professionals employed to extract the each of the full human frames from the have a dicult time recognizing and categorizing the exact emo- image. ese extracted frames are used for both the pose estimation tional state of a group. Consistent, automatic categorization of the and the facial feature extraction. Each face is then extracted and aect of an image can only be achieved with the latest advances aligned with the frontal view using Deepface [29]. Once the faces in machine learning coupled with precise feature extraction and are isolated and pre-processed, a deep residual network (ResNet) fusion. [14] is employed to extract facial features. Inferring the overall emotion that an image elicits from a viewer requires an understanding of details from vastly diering scales in 2.1.1 Human Frame Extraction. Isolation of each signicant hu- the image. Traditional approaches to this problem have focused on man frame is performed on the image using a Faster R-CNN [21]. arXiv:1803.05070v1 [cs.CV] 13 Mar 2018 small-scale features, namely investigating the emotion of individual e model is trained on the Pascal VOC 2007 data set [9] using the faces [13, 20, 33]. However, both large-scale features [8, 17] and very deep VGG-16 model [26]. e procedure for building and train- additional action/scene recognition must be represented in the ing this model is presented in the original Faster R-CNN manuscript description of the image emotion [18, 25]. Each of these feature [21] and will not be detailed here. extraction methods, or modalities, cannot capture the full emotion of an image by itself. 2.1.2 Deepface. For extracting and aligning human faces in the Previous research has investigated a combination of facial fea- images, Facebook’s DeepFace [29] algorithm is utilized. e align- tures, scene extraction [17] and even audio tonality [27]. is paper ment is performed using explicit 3D facial modeling coupled with combines three additional modalities with commonly used facial a 9-layer deep neural network. For more details on the implemen- and scene modalities, namely, human pose, text-based tagging and tation and training of the model, please see [29]. CNN extracted features / predictions. To the best of our knowl- edge, this is the rst time all of the modalities were extracted using 2.1.3 ResNet based feature extraction. e ResNet model is trained deep neural networks with the exception of the baseline CENsus with the Face Emotion Recognition (FER13) dataset [13], which TRansform hISTogram (CENTRIST) [31] model. contains 35,887 aligned face images with a predened training and validation set. Each face in FER13 is labeled with one of the follow- 2.4 Image Captions ing 7 emotions: anger, disgust, fear, happiness, sadness, surprise 2.4.1 ClarifAI. One of the modalities we used in our model is or neutral. To balance training time and ecacy, the ResNet-50 to generate captions for images and use captions to infer emotion. topology from 14 is implemented. e ResNet model is trained from As a rst step towards that, we rst proceeded to use commercially scratch, with an initial learning rate of 0.1, a learning rate decay available ClarifAI API [1] to answer the following preliminary factor of 0.9 which decays every 2 epochs and a batch size of 64. questions: is network architecture is able to achieve a 62.8% top-1 accuracy on the validation FER13 set when predicting the emotion from an • Whether reasonably accurate captions could be generated image. In lieu of a fully connected layer, ResNet-50 uses global for group-level emotion images average pooling as its penultimate layer. e feature set, which is • Whether captions are descriptive of underlying emotions a vector of 2048 elements, is the output from the global average Fig. 2 illustrates the top four tags generated by Clarifai for some pooling layer when running inference on the isolated faces from of the images in the Emotiw2017 Training Dataset. ese results each image. along with those generated for several other images in the dataset provide us condence that a deep learning model can be reason- ably accurate in generating tags for images, thus answering the 2.2 Centrist rst question above. As for the second question, we refer to Fig. 3. Here, the distribution of top tags generated by Clarifai across e CENsus TRansform hISTogram (CENTRIST) [31] is a visual three emotion classes is ploed. Note that, for several tags such descriptor predominantly used to describe topological places and as ‘administration’, ‘election’, ‘competition’, there is a notable dif- scenes in an image [4, 7, 32, 34]. Census Transform, in essence, is ference in prior probability of occurrence given each class. is calculated by comparing a pixel’s gray-scale intensity with that of could translate to discriminative probability of classes given these its eight neighboring pixels, encoding the results as 8-bit binary tags, assuming uniform prior on the classes. is observation is codeword, and converting the codeword to decimal (Fig. 1). CEN- even more pronounced for more obvious tags such as ‘festival’ and TRIST, in turn, is the histogram of the Census Transform calculated ‘bale’. is answers the second question above, and motivates us over various rectangular sections of the image via Spatial Pyramid to explore further. techniques [16]. By construction, CENTRIST is expected to capture the high-level, global features of an image such as the background, 2.4.2 im2txt. Having motivated ourselves that image tags can the relative locations of the persons, etc. CENTRIST is adopted by be of value in inferring emotions on a group-level image, we turned the challenge organizers as the baseline algorithm for emotion de- our aention to the more general ‘image to caption’ models. To- tection. e baseline accuracy provided by the organizers is 52.97% wards this, we focused on the im2txt TensorFlow open-source model on the Validation set and 53.62% on the Test set. To achieve these [24]. is model combined techniques from both computer vision accuracies, a support vector regression model was trained using and natural language processing to form a complete image caption- features extracted by CENTRIST. In our work, we use CENTRIST ing algorithm. More specically, in machine translation between as a scene-level descriptor and build various modalities on top of it languages, a Recurrent Neural Network (RNN) is used to trans- to extract complementary features from the image. form a sentence in ‘source language’ into a vector representation [3, 5, 28]. is vector is further fed into a second RNN that gener- ates a target sentence in the ‘target language’. In [30], the authors derived from this approach and replaced the rst RNN with a vi- 2.3 Human Pose sual representation of an image using a deep Convolutional Neural e intention of including pose modality in this emotion detection Network (CNN) [23] until the pen-ultimate layer. is CNN was task is to detect similar and dissimilar poses of the individuals in the originally trained to detect objects, and the pen-ultimate layer is image and capture any eect of these poses may have towards the used to feed the second RNN designed to produce phrases in the group emotion. e pose features are expected to work only as an original machine translation model. is end-to-end system is fur- indirect complement to features from other modalities. Human pose ther trained directly on images and their captions to maximize estimation is a regression problem in which the location of specic the likelihood that the description on an image best matches the key points in a human body are predicted. In literature several training descriptions for that image. e open-source model im2txt neural network and non-neural network methods have been used. [24] is an implementation of the algorithm in [30]. We specically State of the art results on PASCAL VOC have been demonstrated used a Dockerized version of the im2txt model available in [22]. by Gkioxari et.al. [12], [11]. For this work, we utilize the work Fig. 4 illustrates the captions generated by im2txt in [22] for presented in [11] which is based on a R-CNN. is method obtained some of the images in the training dataset. In our approach, these a mean AP of 15.2% PASCAl VOC 2009 validation dataset. e captions are further encoded into sparse bag-of-word vectors [35]. method uses AlexNet [15] and builds an R-CNN [10] with region For instance, if the dictionary passed to the im2txt model is made of proposals generated by [2] and ne tune from ImageNet pertained words {w1,w2,... wn }. en if, for an image, caption is w3w4w2w4, model on 1000 classes. e R-CNN in this work is trained to predict the bag-of-words representation of the image is: {w1 : 0,w2 : 1,w3 : key points to be within a distance of .2 H (H is the height of the 1,w4 : 2,w5 : 0 ... wn : 0}. is vector is further normalized during torso) from the ground truth. During test time, the fc7 embeddings the concatenation stage, eectively converting this into a term are used as features to combine with other modalities. frequency representation [6]. 2 (a) Image from emotiw2017 training dataset (b) Centroid Transform of image to the le
Figure 1: Centrist illustration
(a) Traditional: 0.99906313, Festival: 0.9946226, Culture: 0.99122095, Religion: (b) People: 0.97925866, Drag Race: 0.964833, Police:0.9458773, Bale:0.9399004 0.99001175
Figure 2: Top four tags generated from ClarifAI in ‘tag:probability’ format. Image source: emotiw2017 training dataset
2.5 CNN 3 FEATURE AGGREGATION CNNs are a proven technique for image classication. e com- As previously discussed in Section , given a group level image, plexity and sparsity of aection related features in images suggest our methodology consists of extracting scene related features using a CNN only solution will need a deeper and wider network and CENTRIST, facial features using RESNET, human pose features us- a large training dataset. However, even with a relatively small ing a pre-trained NN and BOW descriptors using the im2txt neural training dataset, CNNs can still be used as an eective modality for network. In this section, we describe our strategy for combining fea- feature extraction. To reduce overing, the Resenet18 architecture tures extracted from all the group level images to build the training [14] is employed as the CNN and two models are built with the data for classication. training dataset: one is trained on original colored image and one Our feature combination strategy involves concatenating all is trained on grayscale images converted from the original ones. In the features for each group level image. It is important to note cross-validation, the color-model has an accuracy of 52% and the that, in this problem of group level emotion recognition, feature gray-model 50%. is suggests color contains aection information. concatenation is not straight-forward. e feature vectors extracted For the remaining of the paper, we will only refer the color model from CENTRIST and the bag-of-words extracted from the im2txt as our CNN model. NN are already on a per image basis.However, this is not the case
3 takes into account the strength of association of each visual word to its cluster center. Dening the strength of association of each visual word wi (i = 1,...,n) to cluster k as qik , where qik ≥ Ín 0 and i=1 qik = 1, the VLAD encoding using the cluster centroids µk is dened as,
n Õ vk = qik (wik − µk ) (2) i=1 e primary advantage of VLAD over the TF matrix is that more discriminative property is added in our feature vector by taking the dierence of each descrip- tor from the mean in its voronoi cell. is rst order statistic adds more information in the feature vector and may give us beer discrimination during classi- cation. e VLAD encodings are usually normalized (e.g. l2 normalization) before usage. Figure 3: Distribution of top tags generated by Clarifai, (c) e third feature vector, resulting from the k-means across three emotion classes for images in emotiw 2017 clustering algorithm is a weighted average (WA) of training dataset the cluster centers for each visual word in the im- age. e weights are nothing but the normalized term frequencies corresponding to each visual word. for the facial and pose features since these features are extracted (2) GMM Based Methodology In addition to the k-means clus- for each human in the images and need to be aggregated across all tering method, we also perform a dimensionality reduction the humans for each group level image, before concatenation. of the visual codes using GMM. e aggregation involves Inspired by [references], we adopt a bag-of-visual-words (BOV) computing the posterior probability of each component for based approach to construct image level aggregated feature vec- a given visual word, which is also dened as the responsi- tors using the facial and pose features for each group level image bility that a particular component takes in explaining that evaluated using the methods described in Section and Section. e visual word. e responsibilities of a given component BOV based feature transformation is carried out separately for both resulting from all the faces in a given image are averaged facial and pose features. to compute the group level image responsibility for that In our BOV based approach, the feature vector corresponding component. Performing the same computation for all the to each face or pose, is regarded a visual word as a result of which components using the visual words in a given image results the number of visual words is equal to the total number of humans in an image level aggregated feature vector. extracted over all the group level images. Having dened the visual It can be noted that all the four feature vectors described above, word, the BOV based feature aggregation is described as follows. namely, the TF matrix, VLAD, the weighted average of the cluster (1) k-means Based Methodology e set of visual words is centers and the GMM based features have dimensions N ×K, where clustered using the k-means algorithm to reduce the vocab- K is either the number of clusters for the k-means algorithm or the ulary which consists of the cluster labels, also referred to number of Gaussians for the GMM model. Since all the features as visual codes. From clusters resulting from the k-means are now consistent in terms of their dimensions and are group- clustering algorithm, we develop three group level feature image level features, they can be easily concatenated. e series of vectors as follows. above described steps are carried out for both the training and the (a) A term frequency (TF) matrix which consists of the validation image sets to construct the features for classication. associating each face in the group level image to a visual code and the counting the occurrence of each 4 CLASSIFICATION RESULTS visual code in the vocabulary. e values are also Given the feature vectors for group level images corresponding to normalized by dividing each count by the total number both the training and the validation set, the next task is to build a of faces (visual words) in the image. Denoting the raw classication model which can assign a given image to one of the count of a visual code l for a given image m as tl,m, three classes: positive, negative or neutral. To build the classi- the TF is simply given as, cation model, we adopt a two tier approach: tier-1, consisting of learning multiple independent classiers and, tier-2 learning from tl,m t f (i,m) = (1) tier-1 classiers. Such an ensemble approach has previously been Í 0 t 0 l ∈m l ,m used in many data science competitions. (b) e second feature, that we evaluate using the clus- e rst tier involves training individual classiers on each of ter center is the vector of locally aggregated descrip- the dierent modalities and optimizing their parameters that result tors (VLAD). VLAD is a feature encoding method that in the maximum validation accuracy. To this end, four dierent 4 (a) a woman siing on a couch with a child in front of her, a woman siing on a couch (b) a group of people siing around a table, a group of people siing around a table with with a child on her lap, a woman siing on a couch with a child in her lap wine glasses, a group of people siing around a table with food
Figure 4: Captions generated by im2txt model. Image source: emotiw2017 training dataset well-known classiers are considered, namely, a random forest (RF) Modalities Feature Dimension classier, gradient boosted tree (GBT) based classier, support vec- Term Frequency 1000 tor machine (SVM) classier and a variant of random forests called Face VLAD 1000 as extra trees (ET). Each of the classiers considered are inherently emotion Weighted Average of Cluster Center 2048 dierent in the sense that they may learn dierent characteristics GMM 512 of the feature set to perform classication. For example, the RF Term Frequency 1000 VLAD 1000 classier may be well suited for a particular group level emotion Pose whereas the performance of other classiers may be below par. In Weighted Average of Cluster Centers 2048 a dierent scenario, we may have a situation where the SVM based GMM 512 classier is beer suited. Motivated by this possibility, our second Scene Centrist 7905 tier model consists of building an ensemble of the above mentioned Image to Text Bag of Words 364 classiers using a technique referred to as stacking. Also, note that Table 1: Features extracted for tier-1 classiers the input features for each of these classiers are also made up of dierent modalities thereby providing a diversity of information for the second tier algorithm to learn. A Logistic regression model RF ET GBT SVM is used as the second tier classier. Training .99 .98 .98 .95 In this section, we provide more details about the features ex- Validation .55 .53 .52 .50 tracted from the group level images using the methods outlined in Table 2: Centrist Based Features Classication Performance the previous sections. From the training dataset of 3630 images, the human detection and face detection pipeline extracted 19464 faces. For the validation RF ET GBT SVM set we extracted 11696 faces from 2066 images. For these images, TF Training .99 .99 .90 .95 we use vocabulary sizes of 200, 500 and 1000 to compute the BOV Validation .56 .58 .57 .56 derived features. e performance of the nal classication was VLAD Training .99 .99 .99 .97 best for a vocabulary size of 1000 and hence for conciseness we will Validation .58 .58 .59 .53 explain the rest of the results based on this vocabulary size. e Weighted Average Training .99 .99 .99 .99 list of all feature vectors used in tier-1, along with their dimensions Validation .60 .58 .57 .58 are shown in Table 1. GMM Training .99 .99 .90 .87 Validation .56 .56 .55 .58 Next, we determine the performance of each of the modalities Table 3: Face Emotion Based Features Classication Perfor- separately with dierent classiers. ese results are listed in Ta- mance bles 2 to 4.
5 RF ET GBT SVM One key observation that was noted in this work is that our TF Training .99 .99 .99 .98 methodology was not able to exploit pose based features in an opti- Validation .39 .40 .42 .38 mal fashion. is needs to be investigated further. Also, improve- VLAD Training .99 .99 .99 .99 ments could be made to the methodology adopted for the image Validation .41 .40 .41 .40 to text based feature extraction. our current method extracted too Weighted Average Training .99 .99 .99 .99 few image tags which could be limiting the performance of the Validation .40 .41 .41 .42 classier. GMM Training .99 .99 .95 .96 Weak supervision [19] has been proven as an eective way of ob- Validation .40 .43 .41 .40 taining additional labeled data. As for future work, we are planning Table 4: Pose Based Features Classication Performance to use google reverse-image search as our rst labeling function to collect similar images as the one in the training dataset. In the same process, context information such as human annotated tags and summary information will also be gathered for each newly collected image. We will then apply a second NLP based labeling function to label the new images. Our initial experiment shows that with the newly labeled images, the Restnet18 model has 1% improvement in terms of accuracy.
e classication performance for the features apart from pose REFERENCES meet or exceed the benchmark performance [8]. Since the pose [1] 2017. Clarifai. hps://www.clarifai.com/. (2017). based features is poor, we do not include it in the concatenated [2] Pablo Arbelaez,´ Jordi Pont-Tuset, Jonathan T Barron, Ferran Marques, and Jiten- features. Also, it is interesting to note that the performance of the dra Malik. 2014. Multiscale combinatorial grouping. In Proceedings of the IEEE conference on computer vision and paern recognition. 328–335. concatenated features is beer than the performance of individual [3] D. Bahdanau, K. Cho, and Y. Bengio. 2015. Neural machine translation by features. Among the individual modality features, the weighted jointly learning to align and translate. International Conference on Learning average features for face emotion modality using random forests Representations (2015). [4] G. Cheng, J. Han, L. Guo, and T. Liu. 2015. Learning coarse-to-ne sparselets achieves the best validation accuracy of 60.22%. We observed that for ecient object detection and scene classication. In 2015 IEEE Conference on the concatenated features provide the best classication accuracy Computer Vision and Paern Recognition (CVPR). 1173–1181. DOI:hp://dx.doi. on the validation set (64%). Hence, we use it as the basis for the org/10.1109/CVPR.2015.7298721 [5] K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. stacking classier. 2014. Learning phrase representations using RNN encoder-decoder for statistical e probabilities predicted from the individual classiers with machine translation. Proc. Empirical Methods Natural Lang. Process. (2014). [6] Tommy W.S. Chow, Haijun Zhang, and M.K.M. Rahman. 2009. A new document the concatenated feature set as the input, was used to train a three representation using term frequency and vectorized graph connectionists with class Logistic Regression classier to perform stacking. is classi- application to document retrieval. Expert Systems with Applications 36, 10 (2009), er improves the validation accuracy by 4% resulting in an overall 12023 – 12035. [7] G. Costante, T. A. Ciarfuglia, P. Valigi, and E. Ricci. 2013. A transfer learning validation accuracy of 68%. approach for multi-cue semantic place recognition. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems. 2122–2129. DOI:hp://dx.doi.org/ 10.1109/IROS.2013.6696653 5 CONCLUSIONS AND FUTURE DIRECTIONS [8] Abhinav Dhall, Jyoti Joshi, Karan Sikka, Roland Goecke, and Nicu Sebe. 2015. e more the merrier: Analysing the aect of a group of people in images. e problem of determining the aect of a group of people is consid- In Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International ered in this paper. A multi-modal approach consisting of features Conference and Workshops on, Vol. 1. IEEE, 1–8. [9] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, extracted from the scene, human face, human pose and image tags and A. Zisserman. 2007. e PASCAL Visual Object Classes is adopted. To extract the human face based features, we developed Challenge 2007 (VOC2007) Results. hp://www.pascal- a pipeline that consists of R-CNNs for extracting humans from the network.org/challenges/VOC/voc2007/workshop/index.html. (2007). [10] Ross Girshick, Je Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich group level images, a CNN [29] to crop the faces and a conventional feature hierarchies for accurate object detection and semantic segmentation. In face alignment method that uses regression trees. Proceedings of the IEEE conference on computer vision and paern recognition. One of the main contributions of this work is the training and 580–587. [11] G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik. 2014. R-CNNs for Pose usage of deep neural networks (DNNs) to extract face and pose Estimation and Action Detection. features. Furthermore, the DNNs were trained on external datasets [12] Georgia Gkioxari, Bharath Hariharan, Ross Girshick, and Jitendra Malik. 2014. Us- ing k-poselets for detecting people and localizing their keypoints. In Proceedings that contained more number of images suitable for learning the of the IEEE Conference on Computer Vision and Paern Recognition. 3582–3589. large number of parameters in a DNN. is approach overcomes of [13] Ian J. Goodfellow, Dumitru Erhan, Pierre-Luc Carrier, Aaron Courville, Mehdi the inadequacy of the current dataset, which does not have sucient Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David aler, Dong-Hyun Lee, Yingbo Zhou, Chetan Ramaiah, Fangxiang Feng, Ruifan Li, Xiaojie Wang, samples to train a DNN. Additionally, we also employed the bag- Dimitris Athanasakis, John Shawe-Taylor, Maxim Milakov, John Park, Radu of-visual-words based approach to translate multiple human level Ionescu, Marius Popescu, Cristian Grozea, James Bergstra, Jingjing Xie, Lukasz features to group level features. Using a combination of these group Romaszko, Bing Xu, Zhang Chuang, and Yoshua Bengio. 2015. Challenges in representation learning: A report on three machine learning contests. Neural level features from multiple modalities, a validation accuracy of Networks 64 (2015), 59 – 63. DOI:hp://dx.doi.org/10.1016/j.neunet.2014.09.005 64% percent was achieved. Finally, a stacking methodology was Special Issue on Deep Learning of Representations. [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual employed to build an ensemble of classiers which resulted in a Learning for Image Recognition. CoRR abs/1512.03385 (2015). hp://arxiv.org/ validation accuracy of 68%. abs/1512.03385 6 [15] Alex Krizhevsky, Ilya Sutskever, and Georey E Hinton. Imagenet classication with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105. [16] S. Lazebnik, C. Schmid, and J. Ponce. 2006. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In 2006 IEEE Com- puter Society Conference on Computer Vision and Paern Recognition (CVPR’06), Vol. 2. 2169–2178. DOI:hp://dx.doi.org/10.1109/CVPR.2006.68 [17] Jianshu Li, Sujoy Roy, Jiashi Feng, and Terence Sim. 2016. Happiness Level Prediction with Sequential Inputs via Multiple Regressions. In Proceedings of the 18th ACM International Conference on Multimodal Interaction (ICMI 2016). ACM, New York, NY, USA, 487–493. DOI:hp://dx.doi.org/10.1145/2993148.2997636 [18] Alexander Patrick Mathews, Lexing Xie, and Xuming He. 2016. SentiCap: Gen- erating Image Descriptions with Sentiments.. In AAAI. 3574–3580. [19] Alexander J. Ratner, Stephen H. Bach, Henry R. Ehrenberg, and Chris Re.´ 2017. Snorkel: Fast Training Set Generation for Information Extraction. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD ’17). ACM, New York, NY, USA, 1683–1686. DOI:hp://dx.doi.org/10.1145/3035918. 3056442 [20] Bhargava Reddy, Ye-Hoon Kim, Sojung Yun, Junik Jang, and Soonhyuk Hong. 2016. End to End Deep Learning for Single Step Real-Time Facial Expression Recognition. In Video Analytics. Face and Facial Expression Recognition and Au- dience Measurement - ird International Workshop, VAAM 2016, and Second International Workshop, FFER 2016, Cancun, Mexico, December 4, 2016, Revised Selected Papers. 88–97. DOI:hp://dx.doi.org/10.1007/978-3-319-56687-0 8 [21] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91–99. [22] Siavash Sed-Rodi. 2016. im2txt demo. hps://github.com/siavash9000/im2txt demo. (2016). [23] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. 2014. Overfeat: Integrated recognition localization and detection using convolutional networks. International Conference on Learning Representations (2014). [24] Chris Shallue. 2016. im2txt. hps://github.com/tensorow/models/tree/master/ im2txt. (2016). [25] Andrew Shin, Yoshitaka Ushiku, and Tatsuya Harada. 2016. Image Captioning with Sentiment Terms via Weakly-Supervised Sentiment Dataset.. In BMVC. [26] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional net- works for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014). [27] Bo Sun, Qinglan Wei, Liandong Li, Qihua Xu, Jun He, and Lejun Yu. 2016. LSTM for dynamic emotion and group emotion recognition in the wild. In Proceedings of the 18th ACM International Conference on Multimodal Interaction. ACM, 451–457. [28] I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence to sequence learning with neural networks. Proc. Neural Inf. Process. Syst. (2014). [29] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. 2014. DeepFace: Closing the Gap to Human-Level Performance in Face Verication. In e IEEE Conference on Computer Vision and Paern Recognition (CVPR). [30] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. 2017. Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge. IEEE Transactions on Paern Analysis and Machine Intelligence 39, 4 (April 2017), 652–663. DOI: hp://dx.doi.org/10.1109/TPAMI.2016.2587640 [31] Jianxin Wu and Jim M Rehg. 2011. CENTRIST: A visual descriptor for scene categorization. IEEE transactions on paern analysis and machine intelligence 33, 8 (2011), 1489–1501. [32] M. Yang, B. Li, H. Fan, and Y. Jiang. 2015. Randomized spatial pooling in deep convolutional networks for scene recognition. In 2015 IEEE International Con- ference on Image Processing (ICIP). 402–406. DOI:hp://dx.doi.org/10.1109/ICIP. 2015.7350829 [33] Zhiding Yu and Cha Zhang. 2015. Image Based Static Facial Expression Recogni- tion with Multiple Deep Network Learning. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction (ICMI ’15). ACM, New York, NY, USA, 435–442. DOI:hp://dx.doi.org/10.1145/2818346.2830595 [34] L. Zhang, X. Zhen, and L. Shao. 2014. Learning Object-to-Class Kernels for Scene Classication. IEEE Transactions on Image Processing 23, 8 (Aug 2014), 3241–3253. DOI:hp://dx.doi.org/10.1109/TIP.2014.2328894 [35] Yin Zhang, Rong Jin, and Zhi-Hua Zhou. 2010. Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics 1, 1 (01 Dec 2010), 43–52. DOI:hp://dx.doi.org/10.1007/ s13042-010-0001-0
7