<<

Proceedings of the Thirtieth AAAI Conference on Artificial (AAAI-16)

User-Centric of Image

Sicheng Zhao, Hongxun Yao, Wenlong Xie, Xiaolei Jiang School of and Technology, Harbin Institute of Technology, Harbin, China. [email protected]; [email protected]; [email protected]; [email protected] Appendix: https://sites.google.com/site/schzhao/

Abstract Title: Light in Darkness Expected emotion: Tags: london, stormy, dramatic weather, … Emotion category: Description: …there was a break in the Sentiment: negative We propose to predict the personalized emotion percep- clouds … while the sky behind remained : 4.1956 tions of images for each viewer. Different factors that somewhat foreboding. I thought it made : 4.4989 for a pretty intense scene…. Dominance: 4.8378 may influence emotion perceptions, including visual content, social context, temporal evolution, and loca- (a) Original image (b) Image metadata (c) Expected emotion Comments to this image from different viewers Personalized emotion labels tion influence are jointly investigated via the presented Wow, that is fantastic...it looks so incredible, like a Emotion: , Sentiment: positive rolling multi-task hypergraph learning. For evaluation, painting...slightly unreal. That sky is amazing. V: 7.121 A: 4.479 D: 6.635 Emotion: excitement, Sentiment: positive we set up a large scale image emotion dataset from Yup a fave for me as well. Exciting drama at its best. V: 7.950 A: 6.950 D: 7.210 Flickr, named Image-Emotion-Social-Net, with over 1 Hey, it really frightened me! My little daughter just Emotion: fear, Sentiment: negative million images and about 8,000 users. Experiments con- looked scared. V: 2.625 A: 5.805 D: 3.625 ducted on this dataset demonstrate the superiority of the (d) Personalized emotion labels proposed method, as compared to state-of-the-art. Figure 1: Illustration of personalized image emotion percep- tions in social networks. Related are obtained us- Motivation ing the keywords in red. Images can convey rich semantics and evoke strong emo- tions in viewers. Most existing works on image emotion analysis tried to find features that can express emotions bet- Problem Definition ter to bridge the affective gap (Zhao et al. 2014). These u methods are mainly image centric, focusing on the domi- Formally, a user i in social networks observes an im- age xit at t, and her perceived emotion after view- nated emotions for the majority of viewers. y x u However, the emotions that are evoked in viewers by an ing the image is it. Before viewing it, the user i may image are highly subjective and different (Zhao et al. 2015), have seen many other images. Among them we select the recent past ones, which are believed to influence the cur- as shown in Figure 1. Therefore, predicting the personalized S emotion perceptions for each viewer is more reasonable and rent emotion. These selected images comprise a set i. The important. In such cases, the emotion prediction task be- emotional social network is formulized as a hybrid hyper- graph G = {U, X , S}, E, W. Each vertex v in vertex set comes user centric. We build a large-scale dataset for this V = {U, X , S} (u, x, S) u task and classify personalized emotion perceptions. is a compound triple , where rep- resents user, x and S are the current image and the recent past images, named as ‘Target Image’ and ‘History Image The Image-Emotion-Social-Net Dataset Set’, respectively. Note that in this triple, both x and S are We set up the first large-scale dataset on personalized im- viewed by user u. E⊂V×Vis the hyperedge set. Each hy- age emotion , named Image-Emotion-Social-Net, peredge e of E represents a link between two vertexes based with over 1 million images downloaded from Flickr. To get on one component of the triple and is assigned with a weight the personalized emotion labels, firstly we use traditional w(e). W is the diagonal matrix of the edge weights. lexicon-based methods to obtain the text segmentation re- Mathematically, the task of personalized emotion predic- sults of the title, tags and descriptions from uploaders for ex- tion is to find the appropriate mapping for each user ui pected emotions and the comments from viewers for actual f :(G,y , ..., y ) → y . emotions. Then we compute the average value of valence, i1 i(t−1) it (1) arousal and dominance (VAD) of the segmentation results as ground truth for dimensional emotion representation based Rolling Multi-Task Hypergraph Learning on the VAD norms of 13,915 English lemmas (Warriner, Ku- Rolling multi-task hypergraph learning (RMTHG) is pre- perman, and Brysbaert 2013). See Appendix A1 for details. sented to jointly combine the various factors that may in- Copyright c 2016, Association for the Advancement of Artificial fluence : visual content, social context, Intelligence (www.aaai.org). All rights reserved. temporal evolution, and location influence. The framework

4284 Hybrid Hypergraph Compound Vertex Generation Hyperedge Construction Predicted Emotions u u u u u S S 1 S 1 S 1 S 1 x 11 x 12 x 13 x 13 x 11 12 13 13

u ui ui S j Si2 Si2 j2 x u xi uj u x j2 S i 2 Sj S j i2 Temporal Evolution i1 x 1 xj j2 x t i1 1 j2

u u Rolling Multi-task S n S n u n2 x n3 x Hypergraph Learning un S n n2 n3 Sn n1 x 3 xn n1 3

Figure 2: The framework of the proposed method for personalized image emotion prediction. is shown in Figure 2. The compound vertex formulation en- dataset are used for training and the rest are used for test. ables our system to model all factors. Different types of hy- Naive Bayes (NB), Support Vector Machine (SVM) with peredges can be constructed for the three components of the RBF kernel and Graph Model (GM) are adopted as baseline compound vertex. Please see Appendix A2 for details. methods. Please see Appendix A3 for detailed results. Given N users u1,...,uN and related images, our ob- We firstly test the performances using different visual fea- jective is to explore the correlation among all involved tures and learning models for the 8 emotion categories. From images and the user relations. Suppose the training ver- the results, we can observe that: (1) Generally, the fusion of m1 mN texes are {(u1,x1j,S1j)}j=1, ··· , {(uN ,xNj,SNj)}j=1, different features outperforms single feature based method, Y T possibly because it can utilize the advantages from different the training labels are 1 =[y11, ··· ,y1m1 ] ,...,YN = [y , ··· ,y ]T aspects; (2) The proposed RMTHG greatly outperforms the N1 NmN , the to-be-estimated relevance values of all images related to specified users are R1 = baselines on almost all features. [R , ··· ,R ]T,...,R =[R , ··· , R ]T We also evaluate the influence of different factors on emo- 11 1m1 N N1 NmN . Let tion classification, by comparing the performance with all T T T T T T Y =[Y1 , ··· , YN ] , R =[R1 , ··· , RN ] . (2) factors and that without one factor. Here all the visual fea- tures are considered. From the results, we can see that (1) The proposed RMTHG can be conducted as a semi- By incorporating the various factors, the classification per- to minimize the empirical loss and the formance is greatly improved, which indicates that the social regularizer on the hypergraph structure simultaneously by emotions are affected by these factors; (2) Even without us- arg min{Ψ+λΓ}, (3) ing the visual content, we can still get competitive results, R which demonstrates that the social features and the temporal λ Γ=||R − Y||2 evolution play an important role in emotion perception. where is a trade-off parameter, is the em- T −1/2 −1 T −1/2 pirical loss and Ψ=R I−Dv HWDe H Dv R is the regularizer on the hypergraph structure. H is the inci- Conclusion dence matrix, Dv and De are two diagonal matrices with the In this paper, we proposed to predict personalized percep- diagonal elements denoting the vertex and edge degrees. tions of image emotions by incorporating various factors By setting the derivative of Equ. (3) with respect to R to with visual content. Rolling multi-task hypergraph learning zero, the solution of Equ. (3) can be achieved by was presented to jointly combine these factors. A large-scale personalized emotion dataset of social images was con-  1 −1 R = I + Δ Y, (4) structed and some baselines were provided. Experimental λ results demonstrated that the performance of the proposed −1/2 −1 T −1/2 method is superior over the state-of-the-art approaches. where Δ=I − Dv HWD H Dv is the hyper- e This work was supported by the National Natural Science graph Laplacian. By using the relevance scores in R,wecan Foundation of China (No. 61472103 and No. 61133003). rank the related images for each user. The top results with high relevance scores are assigned with related emotion cat- egory. Suppose the predicted results of the test images are References Eˆ = F (R) Warriner, A. B.; Kuperman, V.; and Brysbaert, M. 2013. Norms of , we can iteratively update Equ. (4) based on the valence, arousal, and dominance for 13,915 english lemmas. Be- emotion of history image set until convergence. havior Research Methods 45(4):1191–1207. Zhao, S.; Gao, Y.; Jiang, X.; Yao, H.; Chua, T.-S.; and Sun, Experiments X. 2014. Exploring principles-of-art features for image emotion We use precision, recall and F1-Measure to evaluate the recognition. In ACM MM, 47–56. performance of different methods. The first involved 50% Zhao, S.; Yao, H.; Jiang, X.; and Sun, X. 2015. Predicting discrete images of each viewer in the Image-Emotion-Social-Net distribution of image emotions. In IEEE ICIP.

4285