arXiv:1708.05851v1 [cs.CV] 19 Aug 2017 ewa nepesn h bv ups.Isie yvi- [ by vision Inspired musical and purpose. above music the sual expressing could which in However, modality, weak visual only be time. in happiness exist to the gifts, appear remember images with to covered used trees is Christmas which and people celebrating contain up when commonly dressed pictures example, captured the For day, Christmas the emotions. of convey and kinds contents some specific some contain could which ing, Introduction 1. performed. also is task song2image suitabl the provides Besides, and We songs. task retrieval image2song re- triplets. the noticeable in lyric) shows sults model clip, proposed problem, our music that proposed social- (image, demonstrate the the of from study consists dataset to which a data collect We multimodal sharing content attention. image tag main the a on propose via focus we modeling lyric, lyric and the make image to between gap r content To the perceptron. duce multi-layer a then via image and correlation lyric, from establishes model tags to convo- network rich neural a recurrent generate a uses regions, to model network This neural words. lutional lyric and between correlation content image the learns song which semantic-based framework, a corre- retrieval build establishing we of semantic/content, prob- requirements in image2song the lation the by as Motivated named is lem. by which just query, songs relevant image semantic an retrieve pa- to this propose the In we expected. Hence, per, be should recently. songs network of drawn selection social automatic has the which and in expression, image attention the the much Christmas. amplify combines to celebrating that song way love, relevant better as another is such There purposes, or tions † mgsaeuulytknfrteproeo memoriz- of purpose the for taken usually are Images emo- of kinds some expressing for taken usually is Image colo optrSineadCne o Pia Mgr An IMagery OPTical for Center and Science Computer of School mg2og ogRtivlvaBign mg otn n L and Content Image Bridging via Retrieval Song Image2song: xuelong otwsenPltcnclUiest,X’n707,P R P. 710072, Xi’an University, Polytechnical Northwestern Abstract iota.n [email protected] luxiaoqiang@o [email protected], [email protected], hns cdm fSine,X’n701,P .China R. P. 710119, Xi’an Sciences, of Academy Chinese ∗ ia nttt fOtc n rcso Mechanics, Precision and Optics of Institute Xi’an 22 ,w osdrta the that consider we ], uln Li Xuelong ∗ iHu Di , e- e † ioin Lu Xiaoqiang , g2ogi hspaper. this im- in as age2song named comprehen- is song which song image-based expected, be automatic of should of the recommendation lack technique from and the suffer Hence, still users could sion. of but library decide song to cost time small often of methods lot Such a manner. users are manual image on given a based for almost But selection photo. song than of richer approaches but existing this nowa- video than in Hence, simpler attention is much which image. days, attracted single has combination than of viewers kind touch to easier song the is playing while image Christmas the ing [ data multimodal infor- merged obviously from useful mation be more the will ex- from results that expression which song enhanced, the relevant purpose, a similar with the presses combined if Hence, is audi- image performance. vision, captured similar share (e.g. may senses tactile) tory, different from come stimulates color. in viewed Best bo words. in lyric appearing (lyri and color) tags song different age (in coupled words and several are image There Christmas-relevant A 1. Figure mg otn nlss ec,ti akblnst the to belongs task specific this the Hence, on relies analysis. therefore semantic-related content which the image images, find to for im- aims proposed songs the task paper, ac- retrieval this music In age2song proper criteria. indexing specific on to cording focuses which field, search object: attribute: ui nomto Retrieval Information Music t r e e smiling flower,, , , g r e e n ball lssadLann (OPTIMAL), Learning and alysis dress, , , happy ∗ , , red… stock … China . All I AllWant ChristmasIsFor You 40 And I don't AndIdon't careabout the SantaClaus won'tmake me I don't Idon't need to hangmy pt.ac.cn I don't Idon't wantalot for MR sataiinlre- traditional a is (MIR) Underneath the , Witha Thereis justone thing I need 23 Thereupon fireplacethe rcWords yric .Freape show- example, For ]. toy on … … Christmas Christmastree -- MariahCarey Christmas igeBells Jingle stocking presents Day happy him- th c). semantic/tag-based music retrieval category [32]. Such re- semantic representation, many approaches based on it have trieval tasks utilize multiple textual data sources as the mu- shown great success in image understanding and descrip- sic modality for meeting the semantic requirements, such tion [31, 25]. However, these methodsjust focus on describ- as music blog in SerachSounds [3, 38] and web pages ing the image in specific fixed labels, while our work aims in Gedoodle [19]. In this paper, we focus on retrieving at providing a richer description of images, which points songs (not instrumental music) for images, which contain out the detailed image contents. Wu et al. [46] proposed sufficient textual data in lyric. More importantly, these tex- an attribute predictor with the same purpose, which viewed tual data contains multiple common words in image tags (as the prediction as a multi-label regression task but without shown in Fig. 1), which are considered as the related content focusing on the image contents in specific regions. across the modalities. Hence, lyric is adopted as the textual Lyric modeling. A number of approaches have been pro- data source for retrieval. However, there still remains an- posed to extract the semantic information of lyrics. Most of other two problems. First, image and lyric are different, these works viewed lyric as a kind of text, therefore Bag- where the former is non-temporal and the latter is temporal- of-Words (BoW) approach was usually used to describe the based. More importantly, lyrics are not specifically created lyrics [4, 13, 16], which accounted for the frequency of for images, which results in the content description gap be- word across the corpus, e.g. Term Frequency-Inverse Docu- tween them [32]. Second, there is barely dataset providing ment Frequency (TF-IDF) [44]. However, almost all these correspondingimages and songs, which makes it difficult to works aimed at emotion recognition [16] or sentiment clas- learn the correlation via a data-driven fashion. sification [47]. Recently, Schwarz et al. [33] proposed to In this paper, to overcome the above problems, our con- embed the lyric words into vector space and extract relevant tributions are threefold: semantic representation. But, this work just focused on sen- • We leverage the lyric as the textual data modality for tences rather than the whole lyric. What is more, Schwarz et semantic-based song retrieval task, which provides an al. [33] just performed pooling operation over the sentence effective way to establish the correlation between im- words, which ignored to model the contextual information age and song in semantic. of lyric. Compared with the previous works, our work fo- • We develop a multimodal model based on neural net- cuses on the lyric content analysis for providing sufficient work, which learns the latent correspondence by em- information to learn the correlation with image, which takes bedding the image and lyric representations into a consideration of both the semantic and context information common space. To reduce the content gap, we intro- of words. duce a tag attention approach that makes the bidirec- Multimodal learning across image and text. Several tional Recurrent Neural Network (RNN) of lyric focus works have studied the problem of annotating image with on the main content of image regions, especially for text. Srivastava and Salakhutdinov [39] proposed to use the related lyric words. Multimodal Deep Boltzmann Machine (MDBM) to jointly • We build a dataset that consists of (image, music clip, model the images and corresponding tags, which could be lyric) triplets, which are collected from the social- used to infer the missing text (tags) from image queries, or sharing multimodal data on the Internet1. inverse. Sohn et al. [37] and Hu et al. [12] extended such Experimental results verify that our model can provide no- framework to explore and enhance the similarity across im- ticeable retrieval results. In addition, we also perform the age and text. Recently, amounts of works focused on using song2image task and our model has an improvement over natural language to annotate image instead of tags, which is state-of-the-art method in this task. usually named as image caption. These works [17, 18, 26] commonly utilized deep CNN and RNN to encode the im- 2. Related Work age and correspondingsentences into a common embedding space, respectively. The aforementioned works aimed to Image description. An explicit and sufficient description generate relevant tags or sentences to describe the content of of image is necessary for establishing the correlation with images. In contrast, our work focuses on learning to max- lyric in content. There have been many attempts over the imize the correlation between image and lyric, where the years to provide detailed and high level description of im- lyric is not specially generated for describing the concrete ages effectively and efficiently. Sivic and Zisserman [36] image content but a text containing several related words. represented the image by integrating the low-level visual Music visualization. Music visualization has an opposite features into bag-of-visual-words, which has been widely goal of our work, which aims at generating imagery to de- applied in the scene classification [21] and object recog- pict the music characteristic. Yoshii and Goto [50] pro- nition [35]. Recently, in view of the advantages of Con- posed to transform musical pieces into a color and gradation volutional Neural Network (CNN) in producing high-level visual effects, i.e., a visual thumbnail image. Differently, 1The dataset is available at https://dtaoo.github.io/ Miyazaki and Matsuda [28] utilized a moving icon to visu- 200 alize the music. However, these methods are weak in con- 28 358 veying semantic information in music, such as content and 3125 emotion. A possible way to dealwith such defectsis to learn [50,) 3408 [10,50) the semantic from lyrics and establish correlation with real 5787 8124 [5,10) 2316 images. Cai et al. [2] manually extracted the salient-words [1,5) from lyrics, e.g. location and non phrases, then took them as the key words to retrieve images. Similar frameworks could Number of Songs Number of Triplets be found in [45, 49], while Schwarz et al. [33] focused on Figure 2. The statistics of the frequency of song occurrence. For the sentences of lyrics and organized the retrieved images example, there are 5,787 songs appearing less than 5 times, which of each sentence into a video. In addition, Chu et al. [5] results in 8,124 triplets. attempted to sing the generated description of an image for the first time, but it was too limited for various music style Triplet Song Available Mood Favorite Count and lack of natural melody. In contrast to these methods 16,973 6,373 3,158 [1, 8964] which are based on rough tag description [2, 45, 49, 33] and direct similarity measure between features of different Table 1. Aggregated statistics of the Shuttersong dataset. modalities [33], we propose a multimodal learning frame- work to jointly learn the correlation between lyric and im- age in a straightforward way while analyzing the specific ent songs among these triplets, but only 3,158 (18.6%) ones content for both of them. have available moods. The favorite counts of the shared image-song pairs created by users vary from 1 to 8,964, 3. The Shuttersong Dataset which could be used as a reference for estimating the qual- ity of the pairs. In order to explore the correlation between image and We also perform statistical analysis about the frequency song (lyric), we collect a dataset that contains amounts of of song occurrence, as shown in Fig. 2. Although there are pairwise images and songs from the Shuttersong applica- tion. Shuttersong is a social sharing software, just like In- 2 stagram . However, the shared content contains not only an All I Want For image, but also a corresponding song clip selected by users, Christmas Is You -- Mariah Carey which is for strengthening the expression purpose. A rele- … I don't want a lot for Christmas vantmoodcan also be appendedby users. We collect almost There is just one thing I need And I don't care about the presents the entire updated data from Shuttersong, which consists of Underneath the Christmas tree 36,646 pairs of images and song clips. Some optional mood I don't need to hang my stocking There upon the fireplace and favorite count information are also included. Santa Claus won't make me happy With a toy on Christmas Day In this paper, the lyric is viewed as a bridge connecting … image and song, but it is not contained in the collected data Best Friends from Shuttersong. To acquire the lyrics, we develop a soft- -- The Janoskians ware to automatically search and download them from the ... we said last night, last night. Internet based on song title and artist. However, there exist we probably some abnormal ones in the collected lyrics, such as non- won't remember what we did but one thing i'll never forget. English songs, undetected ones, etc. Hence, we ask twenty i'd rather be right here, tonight with you participants to refine the lyrics. Specifically, the participants ... first exclude the non-English song, then judge whether the lyric matches the song clip. For the incorrect or undetected Paradise ones, the participants manually search the lyrics via the In- -- Jack & Jack ... ternet. Then, the refined lyrics update the original ones in When she was just a girl, She expected the world, the dataset. A detailed explanation of the collected data can But it flew away from her reach, be found in the supplementary material. So she ran away in her sleep. Dreamed of para-para-paradise, Statistics and analysis. The full dataset consists of 16,973 Para-para-paradise, Para-para-paradise, available triplets after excluding the abnormal ones, where Every time she closed her eyes. each triplet contains corresponding music clip3, lyric, and ... image. As shown in Table 1, there are totally 6,373 differ- Figure 3. Examples of songs and corresponding images in the dataset. One song could relate to multiple images. It is easy to 2www.shuttersong.com, www.instagram.com find out the images belonging to the same song have similar con- 3Due to the relevant legal, it is difficult to obtain the complete audio. tent that expresses the song to some extent. 6,373 different songs in the dataset, 586 songs that appear top K regions [15, 48], we use the whole prediction re- at least 5 times take up more than half the triplets, which sults to provide a richer description. We adopt the pow- means each of these songs relates to at least 5 images. For erful VGG net [34] as the CNN architecture. As shown example, Fig. 3 shows some examples of image-song pairs, in Fig. 4, the CNN is first initialized from ImageNet, then where these songs appear more than 5 times in the built fine-tuned on the Real-World Scene Graphs Dataset [14] dataset. It is obvious that the images belong to the song All that provides more object and attribute classes compared I Want For Christmas Is You are Christmas relevant, where with COCO 2014 [24] and Pascal VOC [7], where 266 ob- trees, ribbons, and lights commonly appear among them. ject classes and 249 attribute types4 are employed for fine- Meanwhile, these objects or attributes are related to some tuning, respectively. The images in the Shuttersong dataset words of the corresponding lyric to some extent and pro- are then fed into the network. The top prediction probabil- vide a similar expression with the song. We can also find ities of each class constitute the final image representation the similar situations in the other two groups, as shown in as a 515-dimensional vector v. Fig. 3. These relevant images of the same song could pro- vide an efficient way to explore the valuable information in 4.2. Lyric representation lyrics and establish correlation with songs. Therefore, we Considering the song lyric is a kind of text containing conduct our experiments based on these songs with high dozens of words, we expect to generate a sufficient and effi- occurrence frequency. cient representation to establish inner-modalities correlation with the image representation. RNN-like architectures have 4. Our Model recently shown advantages in encoding sequence while con- taining enough information for a range of natural language In this paper, our proposed model aims at learning the processing tasks, such as language modeling [41] and trans- correlation between images and lyrics, which can be used lation [8]. In our work, in view of the remarkable ability for song retrieval by image query, and vice versa. We first of LSTM [11] in encoding long sequence, we employ it to introduce a CNN-based model for fully representing image embed the lyric into a vector representation but with minor with amounts of tags, then encode the lyric sequence into modification [9]. a vector representation via a bi-directional LSTM model. Words of lyric encoded in one-hot representations are A proposed multimodal model finally embeds the encoded first embedded into a continuous vector space, where the lyric representation into the image tag space to correlate nearby points share similar semantic, them under tag attention.

4.1. Image representation xt = Welt, (1)

We aim to generate a rich description of image, which where lt is the t-th word in the song lyric, and the embed- could point out the specific content in certain text, i.e., im- ding matrix We is pre-trained based on the part of Google age tags. A common strategy is to detect objects and their News dataset (about 100 billion words) [27]. The weight attributes via extracting proposal regions and feeding them We is kept during the training due to overfitting concerns. into specific classifier [20]. Hence, we propose and clas- Then the word vectors constitute the lyric matrix represen- sify the regions of each image via a common Region CNN tation and are fed into the LSTM network, (RCNN) [30]. However, different from most of the previ- ous works that just make use of the CNN features in the it = σ (Wixt+Uiht−1+bi) (2)

ft = σ (Wf xt+Uf ht−1+bf ) (3)

Fine-tuned model C˜t = tanh (Wcxt+Ucht−1+bc) (4)

Ct = it ∗ C˜t + ft ∗ Ct−1 (5) CNN tree ot = σ (Woxt+Uoht−1 + bo) (6) dress Parameter transferring flower stock ht = ot ∗ tanh (Ct) . (7) smiling red ... CNN Three gates (input i, forget f, and output o) and one cell memory C constitute a LSTM cell and work in cooperation to determine whether remembering or not. σ is the sigmoid Figure 4. Overview of the image tag prediction network. The function. And the network parameters W∗, U∗, and b∗ will model is developed based on faster-rcnn, which is firstly fine-tuned be learned during the training procedure. on the Scene Graph Dataset [14], then used to generate the tag pre- diction result for images in the Shuttersong dataset via parameter 4Different from the settings in [14], we select the attributes appear at transferring. least 30 times to provide a more detailed description. I don't want a lot for Christmas There is just one thing I need And I don't care about the presents Underneath the Christmas tree RCNN I don't need to hang my stocking There upon the fireplace Santa Claus won't make me happy With a toy on Christmas Day

input image input lyric

embed pooling

top K tags tag matrix tag attention

Figure 5. The diagram of the proposed model. The image content tags are first predicted via the R-CNN, meanwhile, a bi-directional LSTM is utilized to model the corresponding lyric. Then the generated lyric representation is mapped into the space of the image tags via a MLP. To reduce the content gap between image and lyric, the top K image tags are embedded into a tag matrix and represented as tag attention for the lyric modeling by performing max/average pooling.

Considering the relevant images of a given lyric have in the images, but the words paradise, flew, and girl are re- high variance in the content, such as the examples in Fig. 3, lated to some image contents. it is difficult to directly correlate the image tags with each To address the aforementioned problem, we propose to embedded lyric word, especially for the longer lyric com- use the tag attention to make the lyric words focus on the pared with normal sentence [10, 42]. Hence, we just take image content, as shown in Fig. 5. We first sort the image the final output of the LSTM, which remains the context of prediction results of all the tag classes, then choose the top- single word and provides efficient information of the whole K tags that are assumed as the correct prediction of corre- lyric. And experiments (Sec. 5.2) indicate its effectiveness sponding image content5. To share the representation space in establishing the correlation with images. Besides, both with lyric words, the selected tags are also embedded into a the forward and backward LSTM are employed to simul- vector of continuous value via Eq. 1, which results in a tag taneously model the history and future information of the matrix T ∈ RK×300. Then we perform the pooling opera- lyric, and the final lyric representation can be denoted as tion over the matrix T into a tag attention vector ˜v. Inspired → ← by the work of Hermann et al. [10] and Tan et al. [42], the l=hfinal k h1, where k indicates concatenation. tag attention is designed to modify the output vector of the 4.3. Multimodal learning across image and lyric LSTM and make the lyric words focus on the concrete im- age content. To be specific, the output vector h at time step The image and lyric representation have been extracted t t is updated as follows, via respective model, it is intuitive to embed them into a common space to establish their correlation via a Multi- mt = σ (Whmht + Wvm˜v) (8) Layer Perceptron (MLP) model, as shown in the upper part T st ∝ exp w mt (9) of Fig. 5. The effectiveness of such model has been verified ms  ˜ in the image-text retrieval [29], this is because the text in- ht = htst, (10) formation is specially written to describe the image content, where the weights , , and are considered as hence there are several common features representing the Whm Wvm wms the attention degree of the lyric words given the image con- same content across both modalities. However, the MLP tent (tags). During modeling the lyrics, the related words model is not entirely suitable for our task. As lyric is not are paid more attention via the attention weights, which specially generated for images, it can be found that there is acts like the TF-IDF in document retrieval based on key few lyric words directly used to describe the corresponding words. However, different from the pooling operation over image but multiple related words sharing the similar mean- the entire sequence in the previous works [10, 42], the out- ing with image content, which could result in a content de- put vector with attention, h˜ , just flow through the entire scription gap. For example, we can find such situations in t lyric words, which results in a refined lyric representation. the Paradise group in Fig. 3. There exists few specific lyric words could be used for describing the beach, wave, or sky 5The top 5 tags are validated and employed in this paper. During training, we aims at minimizing the difference Baselines. In our experiments, we compare with the between the image and lyric pair in the tag space, which is following models: Bag-of-Words [1]: The image fea- essentially the Mean Squared Error (MSE), tures and lyric BoW representation are mapped into a common subspace. CONSE [29]: The lyric representa- T 2 tion is obtained by performing pooling operation over all ˜ lmse = vi − li , (11) words and then established correlation with image features. X 2 i=1 Attentive-Reader [10]: This method is mainly developed for Question-Answering task, which performs a weighted where vi and ˜li are the generated image and projected lyric representation, and T is the number of training pairs. Ex- pooling over the encoded words via LSTM with question cept the MSE loss, we also employ the cosine proximityand attention. Here, the question attention is replaced with the marginal ranking loss. The relevant experiment results are image tag attention, and a non-linear combination is used to reported in the materials. For the retrieval, both the query measure the correlation. and retrieved items are fed into the proposed model, then Our Models. We evaluate two variants of our models: Our the cosine similarity is computed as the relevance. baseline model: Our proposed model except the tag atten- tion, as shown in the upper part of Fig. 5. Our-attention 4.4. Optimization model: The proposed complete model. Note that, aver- age pooling is employed for obtaining tag attention, which The proposed model is optimized by employing stochas- could remain more image content information compared tic gradient descent with RMSprop [43]. The algorithm with max-manner. adaptively rescales the step size for updating trainable weights according to the corresponding gradient history, 5.2. Image2song retrieval which achieves the best result when faced with the word frequency disparity in lyric modeling. The parameters are This experiment aims to evaluate the song retrieval per- empirically set as: the learning rate l = 0.001, the weight formance by given image query. There are two kinds of decay ρ = 0.9, the tiny constant ε = 10−8, and The model tags used for representing images, i.e., object and attribute. is trained with mini-batches of 100 image-lyric pairs. It is expected to explore which kind of them influences the image description most and provides more valuable infor- 5. Experiments mation to establish the correlation across the two modali- ties. In view of this, we group the image tags into three 5.1. Data preprocessing and setup categories, i.e., object, attribute, and both of them. Table. 4 Dataset. In this paper, we choose the triplets whose lyrics shows the comparison results on both datasets, where both appear at least 5 times, which results in 8,849 triplets (586 the model variants and other methods are considered. And songs). Such operation is a common preprocessing method we also provide example retrieval results in Fig. 6. and also better for learning the correlation across modali- For each tag category, there are three points we should ties. To reduce the influence of the imbalanced number of pay attention to. First, the bi-directional LSTM model images, we choose five triplets with top favorite counts for provides better lyric representations than direct pooling each song, which are considered to have more reliable cor- (CONSE [29]) and BoW [1], as the LSTM takes consider- relation. Within these filtered triplets, 100 songs and cor- ation of both word semantic and context information. Sec- responding images are randomly selected for testing, and ond, the proposed complete model shows the best result in the rest for training, which forms dataset†. Note that, we most conditions. When we employ the tag attention for also employ another kind of train/test partition to constitute lyric modeling, more related words in the lyrics will be dataset§: we randomly select one from five images of each emphasized, which shrinks the content gap and improves song for testing, and the rest for training. In the dataset§, the the relevant performance, especially on the dataset§. Al- train and test set share the same 586 songs but with different though Attentive-Reader [10] also employ similar manner, images, which is developed to exactly evaluate the models it takes the attentive pooling result as the lyric representa- in shrinking the content gap, when faced with variable im- tion. The direct pooling operation in Attentive-Reader will age content and lack of related lyric words. make it suffer from the difficulty in establishing correlation, Preprocessing. For the lyric data preprocessing, we remove this is because the variable image contents change the at- all the non-alphanumeric and stop words. tention weight. While our model takes the final output of Metric. We employ the rank-based evaluation metrics. LSTM, where the attention weight is not immediately uti- R@K is Recall@K that computes the percentage of a cor- lized but conveyed by the updated output vector h˜t. Third, rect result found in the top-K retrieved items (higher is bet- although our methods show the best performance on both ter), and Med r is the median rank of the closest correct datasets, overall performance is not excellent as expected. retrieved item (lower is better). But this is a common situation in the semantic-based mu- Image tags obj-tags attr-tags obj-attr-tags Dataset† R@1 R@5 R@10 Med r R@1 R@5 R@10 Med r R@1 R@5 R@10 Med r BoW [1] 1.4 7.6 13.4 45.06 1.6 7.0 13.0 46.11 1.4 7.2 13.2 45.01 CONSE [29] 1.8 8.0 13.0 46.11 1.8 6.6 12.2 47.38 2.0 6.8 14.8 46.13 Attentive-Reader [10] 1.8 7.4 14.0 44.45 1.8 7.6 13.0 46.97 1.8 7.2 15.0 43.78 Our 2.2 7.2 14.2 43.13 1.6 7.8 14.0 46.32 2.2 7.6 14.8 43.30 Our-attention 2.0 9.2 17.6 41.36 2.2 8.4 14.4 45.21 2.6 9.4 16.8 41.50 Dataset§ R@10 R@50 R@100 Med r R@10 R@50 R@100 Med r R@10 R@50 R@100 Med r BoW [1] 4.27 13.48 24.23 251.77 3.07 12.12 23.55 260.17 4.27 14.33 23.72 250.63 CONSE [29] 3.41 14.33 25.27 248.88 3.24 12.97 23.38 262.17 3.58 14.51 25.09 245.32 Attentive-Reader [10] 4.10 14.51 24.57 254.53 3.75 13.14 23.04 267.12 3.92 14.51 25.43 243.53 Our 4.61 15.02 26.28 246.51 4.27 13.65 23.89 254.17 4.94 15.36 28.33 240.00 Our-attention 4.47 15.36 26.86 245.61 4.61 13.99 25.76 249.42 5.63 17.58 29.35 233.82

Table 2. Image2song retrieval experiment result in R@K and Med r on dataset† and dataset§. Three kinds of image representation are considered, e.g., object (obj), attribute (attr), and both them (obj-attr). Image Query Top 3 retrieved song All I Want For Christmas Is You Cedarwood Road The Miracle Obj Attr -- Mariah Carey -- U2 -- U2 ...... flowers smiling I don't need to hang my stocking Northside just across the river I woke up at the moment dress wooden There upon the fireplace to the Southside when the miracle occurred window green Santa Claus won't make me happy That's a long way here Heard a song that made building pink With a toy on Christmas Day All the green and all the gold some sense out of the world girl black Oh-ho, all the lights are shining The hurt you hide, Everything I ever lost, tree white So brightly everywhere the joy you hold now has been returned lights happy And the sound of children The foolish pride that gets you out the door In the most beautiful sound plant red Laughter fills the air Up on Cedarwood Road, I'd ever heard......

Obj Attr Would U Love Me Groove Shallow Love -- Jack & Jack -- Jack & Jack ...-- Jack & Jack man blue ...... Takin' vows to the coffin eyes brown Would you love me, Girl, the way you dance on the wedding day face black Would you love me. has got me going insane She convinced him it's true mouth white If I wasn't such a jerk? Now it's time to turn on the lights The way we live so well jacket smiling Try to be, all but nice. Yeah, so I can get a better look at you Can make it hard to tell girl happy But it doesn't seem to work. Come on baby it's me and What she in it for? glasses asian Would you tell me that you love me Tonight we're letting loose What she really in it for? sweater grey ... So much that we could do

......

Figure 6. Image2song retrieval examples generated by our model. The generated object and attribute tags are shown next to each image query, and the songs with red triangle are the ground truth in the dataset. sic retrieval [32]. This is because the song retrieval based which makes it difficult to correlate images with songs, es- on textual data has to estimate the semantic labels from pecially for the attribute tags. Third, most of the images lyric, which is charactered as a low specificity and long- share similar attribute tags with high prediction scores (e.g., term granularity [32]. Even so, our proposed models still black, blue, white). This is actually a long-tailed distri- enjoy a relatively big improvement, and the retrieved exam- bution6 and therefore it becomes difficult to establish the ples show the effectiveness of the models in Fig. 6. correlation between image and specific lyric. However, the group with both tags nearly shows the best performance Across the three groups of image tags, we could find that across different models, which results from more detailed the attribute tags almost always perform worse than object description for images. ones. We consider there are mainly three reasons to ex- Apart from the influence of image tag property, the lyric plain this phenomenon. First, the object words are usually also impacts the performance significantly. We show the employed for image description, which is more powerful in specific results of 28 songs with more 50 times occurrence identifying the image content compared with attribute ones. in Fig. 7 and Fig. 8. As shown, some song lyrics are of Second, as the Shuttersong is actually a kind of sharing ap- plication, most of the updated images are self-photography, 6A detailed illustration can be found in the supplementary material. 1 Song Query Top 4 retrieved result

0.8 Cedarwood Road ... -- U2 0.6 I was running down the road The fear was all I knew I was looking for a soul that's real R@3 0.4 Then I ran into you And that cherry blossom tree 0.2 ...

0 What Do You Mean? -- Justin Bieber ... I need you to be mine rising up and then we shine cause your the one and i want you in my life Son gs Our Our-attention ... Figure 7. Detailed comparison results among song examples in Figure 9. Example retrieval results by our model. The images in R@3. The complete model with attention improves the average red bounding box are the corresponding ones in the Dataset. performance, especially for the songs with zero score.

1

0.8 modeling during training, while Schwarz et al. [33] directly 0.6 use the pooling results of lyric word vector for similarity

0.4

R@3 retrieval without the inner-model interaction like ours. Sec- 0.2 ond, although our model get better performance, it still suf- 0 fers from the lack of related words in some songs, just like the image2song task. In addition, we also show some examples of the song Songs obj attr obj-attr query, and the top 4 retrieved results are illustrated in Fig. 9. Figure 8. Detailed comparison results among the three groups of tags. All the experiments are conducted with the proposed baseline Although some retrieved results are not correct, they share model and evaluated by R@3. similar content conveyedby the lyric. For example, in terms of the song Cedarwood Road about tree and Christmas, the Models R@1 R@5 R@10 R@20 Med r top two retrieved images indeed have tree-relevant content Schwarz [33] 0.10 0.27 0.45 0.71 16 and the left two are about Christmas. Our 0.19 0.34 0.52 0.74 15

Table 3. The image retrieval results given lyric query. Here, the 6. Discussion image tags in the method of Schwarz et al. [33] are generated by the R-CNN approach [30] for a fair comparison. In this paper, we introduce a novel problem that retrieves semantic-related songs based on given images, which is remarkable performance, while some fail to establish the named as the image2song task. We collect a dataset that correlation with relevant images. The potential reason is consists of pairwise images and songs to study this prob- that parts of the lyrics are weak in providing obvious or lem. We propose a semantic-based song retrieval frame- even related words. For example, the lyric words of Best work, which employs the lyric as the textual data source for Friends are about forget, tonight, remember, etc, which are estimating the semantic label of songs, then a deep neural not specifically related to the image content, as shown in network based multimodal framework is proposed to learn Fig. 3. Even so, our proposed tag attention mechanism can the correlation, where the lyric modeling is proposed to fo- still reduce the content gap and improvethe performance, as cus on the main image content to reduce the content gap show in Fig. 7. While for the song Paradise and All I Want between them. The experiment results show that our model For Christmas Is You, the lyric words and image content are can recommend suitable songs for a given image. In addi- closely related, hence this case achieves better results. tion, the proposed approach can also retrieve relevant im- ages according to a song query with a better performance 5.3. Song2image retrieval than other methods. In this experiment, we aims to retrieve relevant images There still remains a direction that should be explored for a given song (lyric) query, which is similar to the music in the future. Song is about several minutes long, which is visualization task. And we employ the proposed baseline too long for just showing an image. A possible way could model, which could perform more efficiently without the be expected is to correlate the image with only parts of the attention of each image. Table. 6 shows the comparison re- corresponding song, which is more natural for expression. sults on dataset†. First, our proposed model outperforms Furthermore, we also hope to perform other kinds of cross- the method of Schwarz et al. [33]. In our model, the im- modal retrieval task, which essentially attempts to establish age content information are employed to supervise the lyric the correlation among different senses of human. 7. Acknowledgement [12] D. Hu, X. Lu, and X. Li. Multimodal learning via exploring deep semantic similarity. In Proceedings of We thank Yaxing Sun for crawling the raw multimodal the 2016 ACM on Multimedia Conference, pages 342– data from the Shuttersong application. We also thank 346. ACM, 2016. 2 Chengtza Wang for editing the video demo for presentation. [13] X. Hu, J. S. Downie, and A. F. Ehmann. Lyric text References mining in music mood classification. American music, 183(5,049):2–209, 2009. 2 [1] B. Bai, J. Weston, D. Grangier, R. Collobert, K. Sadamasa, Y. Qi, C. Cortes, and M. Mohri. Poly- [14] J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. A. nomial semantic indexing. In Advances in Neural In- Shamma, M. S. Bernstein, and L. Fei-Fei. Image re- formation Processing Systems, pages 64–72, 2009. 6, trieval using scene graphs. In 2015 IEEE Conference 7, 12 on Computer Vision and Pattern Recognition (CVPR), pages 3668–3678. IEEE, 2015. 4 [2] R. Cai, L. Zhang, F. Jing, W. Lai, and W.-Y. Ma. Au- tomated music video generation using web image re- [15] A. Karpathy and L. Fei-Fei. Deep visual-semantic source. In 2007 IEEE International Conference on alignments for generating image descriptions. In Pro- Acoustics, Speech and Signal Processing-ICASSP’07, ceedings of the IEEE Conference on Computer Vision volume 2, pages II–737. IEEE, 2007. 3 and Pattern Recognition, pages 3128–3137, 2015. 4 [3] O.` Celma, P. Cano, and P. Herrera. Search sounds [16] Y.E. Kim, E. M. Schmidt, R. Migneco, B. G. Morton, an audio crawler focused on weblogs. In 7th Inter- P. Richardson, J. Scott, J. A. Speck, and D. Turnbull. national Conference on Music Information Retrieval Music emotion recognition: A state of the art review. (ISMIR), 2006. 2 In Proc. ISMIR, pages 255–266. Citeseer, 2010. 2 [4] R. Chen, Z. Xu, Z. Zhang, and F. Luo. Content based [17] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Multi- music emotion analysis and recognition. In Proc. of modal neural language models. In ICML, volume 14, 2006 International Workshop on Computer Music and pages 595–603, 2014. 2 Audio Technology , volume 68275, 2006. 2 [18] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unify- [5] H. Chu, R. Urtasun, and S. Fidler. Song from pi: A ing visual-semantic embeddings with multimodal neu- musically plausible network for pop music generation. ral language models. arXiv preprint arXiv:1411.2539, arXiv preprint arXiv:1611.03477, 2016. 3 2014. 2 [6] J. Dong, X. Li, and C. G. Snoek. Word2visualvec: [19] P. Knees, T. Pohle, M. Schedl, and G. Widmer. A Cross-media retrieval by visual feature prediction. music search engine built upon audio-based and web- arXiv preprint arXiv:1604.06838, 2016. 11 based similarity measures. In Proceedings of the [7] M. Everingham,L. Van Gool, C. K. Williams, J. Winn, 30th annual international ACM SIGIR conference on and A. Zisserman. The pascal visual object classes Research and development in information retrieval, (voc) challenge. International journal of computer vi- pages 447–454. ACM, 2007. 2 sion, 88(2):303–338, 2010. 4 [20] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, [8] A. Graves. Neural networks. In Supervised Sequence Y. Choi, A. C. Berg, and T. L. Berg. Babytalk: Un- Labelling with Recurrent Neural Networks, pages 15– derstanding and generating simple image descriptions. 35. Springer, 2012. 4 IEEE Transactions on Pattern Analysis and Machine [9] A. Graves, A.-r. Mohamed, and G. Hinton. Speech Intelligence, 35(12):2891–2903, 2013. 4 recognition with deep recurrent neural networks. In [21] L.-J. Li and L. Fei-Fei. What, where and who? classi- 2013 IEEE international conference on acoustics, fying events by scene and object recognition. In 2007 speech and signal processing, pages 6645–6649. IEEE 11th International Conference on Computer Vi- IEEE, 2013. 4 sion, pages 1–8. IEEE, 2007. 2 [10] K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espe- [22] X. Li, D. Tao, S. J. Maybank, and Y. Yuan. Vi- holt, W. Kay, M. Suleyman, and P. Blunsom. Teach- sual music and musical vision. Neurocomputing, ing machines to read and comprehend. In Advances in 71(10):2023–2028, 2008. 1 Neural Information Processing Systems, pages 1693– 1701, 2015. 5, 6, 7, 12 [23] C. C. S. Liem, M. Larson, and A. Hanjalic. When [11] S. Hochreiter and J. Schmidhuber. Long short-term music makes a scene. IJMIR, 2:15–30, 2012. 1 memory. Neural computation, 9(8):1735–1780, 1997. [24] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, 4 D. Ramanan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Con- [37] K. Sohn, W. Shang, and H. Lee. Improved multi- ference on Computer Vision, pages 740–755.Springer, modal deep learning with variation of information. In 2014. 4 Advances in Neural Information Processing Systems, [25] J. Long, E. Shelhamer, and T. Darrell. Fully convo- pages 2141–2149, 2014. 2 lutional networks for semantic segmentation. CoRR, [38] M. Sordo, O.` Celma, and C. Laurier. Querybag: Us- abs/1411.4038, 2015. 2 ing different sources for querying large music collec- [26] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and tions. In Proceedings of the 10th International Society A. Yuille. Deep captioning with multimodal re- for Music Information Retrieval Conference (ISMIR), current neural networks (m-rnn). arXiv preprint 2009. 2 arXiv:1412.6632, 2014. 2 [39] N. Srivastava and R. R. Salakhutdinov. Multi- [27] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and modal learning with deep boltzmann machines. In J. Dean. Distributed representations of words and Advances in neural information processing systems, phrases and their compositionality. In Advances in pages 2222–2230, 2012. 2 neural information processing systems, pages 3111– [40] B. E. Stein and M. A. Meredith. The merging of the 3119, 2013. 4 senses. The MIT Press, 1993. 1 [28] R. Miyazaki and K. Matsuda. Dynamicicon: A visu- [41] I. Sutskever, O. Vinyals, and Q. V.Le. Sequence to se- alizing technique for musical pieces in moving icons quence learning with neural networks. In Advances in based on acoustic features. Journal of Information neural information processing systems, pages 3104– Processing, 51(5):1283–1293, 2010. 2 3112, 2014. 4 [29] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, [42] M. Tan, B. Xiang, and B. Zhou. Lstm-based deep J. Shlens, A. Frome, G. S. Corrado,and J. Dean. Zero- learning models for non-factoid answer selection. shot learning by convex combination of semantic em- arXiv preprint arXiv:1511.04108, 2015. 5 beddings. arXiv preprint arXiv:1312.5650, 2013. 5, [43] T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Di- 6, 7, 12 vide the gradient by a running average of its recent [30] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: magnitude. COURSERA: Neural Networks for Ma- Towards real-time object detection with region pro- chine Learning, 4(2), 2012. 6 posal networks. In Advances in neural information [44] M. Van Zaanen and P. Kanters. Automatic mood clas- processing systems, pages 91–99, 2015. 4, 8 sification using tf* idf based on lyrics. In ISMIR, pages [31] O. Russakovsky, J. Deng, H. Su, J. Krause, 75–80, 2010. 2 S. Satheesh, S. Ma, Z. Huang, A. Karpathy,A. Khosla, [45] Z.-K. Wang, R. Cai, L. Zhang, Y. Zheng, and J.-M. Li. M. Bernstein, et al. Imagenet large scale visual recog- Retrieving web images to enrich music representation. nition challenge. International Journal of Computer In ICME, 2007. 3 Vision, 115(3):211–252, 2015. 2 [46] Q. Wu, C. Shen, L. Liu, A. Dick, and A. v. d. Hen- [32] M. Schedl, E. G´omez, J. Urbano, et al. Music informa- gel. What value do explicit high level concepts have tion retrieval: Recent developments and applications. in vision to language problems? arXiv preprint Foundations and Trends R in Information Retrieval, arXiv:1506.01144, 2015. 2 8(2-3):127–261, 2014. 2, 7 [47] Y. Xia, L. Wang, and K.-F. Wong. Sentiment vector [33] K. Schwarz, T. L. Berg, and H. P. A. Lensch. Auto- space model for lyric-based song sentiment classifica- illustrating poems and songs with style. In Asian Con- tion. International Journal of Computer Processing ference on Computer Vision (ACCV), 2016. 2, 3, 8 Of Languages, 21(04):309–330, 2008. 2 [34] K. Simonyan and A. Zisserman. Very deep convo- [48] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, lutional networks for large-scale image recognition. R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, arXiv preprint arXiv:1409.1556, 2014. 4 attend and tell: Neural image caption generation with [35] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and visual attention. arXiv preprint arXiv:1502.03044, W. T. Freeman. Discovering object categories in im- 2(3):5, 2015. 4 age collections. 2005. 2 [49] S. Xu, T. Jin, and F. C.-M. Lau. Automatic generation [36] J. Sivic and A. Zisserman. Video google: A text re- of music slide show using personal photos. In ISM, trieval approach to object matching in videos. In Com- 2008. 3 puter Vision, 2003. Proceedings. Ninth IEEE Interna- [50] K. Yoshii and M. Goto. Music thumbnailer: Visu- tional Conference on, pages 1470–1477. IEEE, 2003. alizing musical pieces in thumbnail images based on 2 acoustic features. In ISMIR, pages 211–216, 2008. 2 Supplementary Material lyric, which even outperforms the attention model in some cases. This is because the mood tag directly points out the 8. The Shuttersong Dataset core information of the shared image-song pair and there- fore makes the pair become closer. 8.1. Favorite Count 9.2. Pooling Operation Apart from the song clip, image, and mood, we also col- lect the favorite count for each image-song pair from the The tag attention is obtained by performing the pooling Shuttersong application. The favorite counts vary from 1 to operation over the tag matrix, which plays an important role 8,964, which could be used to estimate the quality of image- in establishing the correlation between image and lyric. In song pairs as a reference. The specific statistics can be view of this, the average and max pooling strategy are com- found in Fig. 10. There are 6,043 (image, music clip, lyric) pared to evaluate their performances in remaining effective triplets owning at least 3 favorite counts, which are consid- image content. Table. 5 shows the comparison results. It ered to jointly show better expressions compared with the is clear that using average pooling is much better than max others. pooling. The potential reason is that the average pooling could extract more tag semantic values from the tag matrix, 1547 1936 so that more tag values provide a more complete description [10,) for images. 2560 [5,10) 10930 [3,5) 9.3. Loss Comparison [1,3) In addition to the Mean Squared Error (MSE) loss func- tion employed in the paper, Cosine Proximity Loss (CPL) NumberNumber of TripletsTuples and Marginal Ranking Loss (MRL) are also considered. Figure 10. The statistics of triplet number in favorite counts. There CPL is based on the cosine distance, which is commonly are 1,547 triplets owning at least 10 favorite counts, which could used in vector space model and written as follow, be considered as the image-song pair with high quality. T l = − cos v , l˜ . (12) cpl X  i i 8.2. Lyric Refinement i=1 As there are some abnormal lyrics existing in the au- As for MRL, it takes consideration of both positive and neg- tomatically searched set, it is necessary to verify each of ative samples with respect to the images query and is more them. Hence, we ask twenty participants to refine the prevalent in retrieval tasks. It belongs to the hinge loss and lyrics, and the corresponding flow char of the refinement is written as, is shown in Fig. 11. First, the participants judge whether T − l = max 0, 1+cos v , l˜ − cos v , ˜l+ , the song is in English or not. Then they select the mis- mrl X n  i i   i i o match ones and conduct manual searching for the filtered i=1 (13) English songs. The websites used for searching in this pa- ˜+ per are www.musixmatch.com and search.azlyrics.com. Fi- where li is the ground truth lyric for current image rep- ˜− nally, both the correct matching and successfully updated resentation vi, and li is a negative one that is randomly ones constitute the refined lyric set. And the rest lyrics are selected from the entire lyric database. the abnormal ones, e.g. non-English songs, unfound lyrics. Table. 7 shows the comparison results among the three introduced loss functions. It is obvious that MSE performs 9. Additional Experiments the best in both Recall@K and Med r metric, while MRL has the worst performance. We consider that the main rea- We have shown the specific comparison results of the son comes from the diversity of images, e.g. the examples 28 songs with more 50 times occurrence in the paper, The in Fig. 12. The images related to the same lyrics have high following subsections show more results of our models with variance in the appearance, which makes these two modal- these songs, as well as other compared models. ities lack the content correspondence to each other. Hence, it becomes more challenging to deal with the positive and 9.1. More Retrieval Results negative samples simultaneously. Such conditions can be Apart from the lyric words and image features, we also also found in the image-to-text retrieval task [6]. take consideration of the mood information, which is com- 9.4. Attribute Property bined with the encoded lyric representation, but only 18.6% is available. As shown in Table 4, the extra mood informa- In our paper, the attribute tags perform worse than the tion indeed strengthens the correlation between image and object ones, one of the potential reasons is due to the im- Automatically collected lyrics Abnormal lyrics Refined lyrics

No Yes No

Yes Lyric match No Successful Yes English song? Manual search song? search?

Figure 11. The flow chart of manual lyric refinement. The automatically collected lyrics are divided into two parts, one is the abnormal ones that contains non-English song and undetected lyric, while the other is the refined lyrics used to constitute the final Shuttersong dataset.

Image tags obj-tags attr-tags obj-attr-tags Models R@1 R@5 R@10 Med r R@1 R@5 R@10 Med r R@1 R@5 R@10 Med r BoW [1] 10.71 31.21 52.62 9.34 9.32 30.03 51.34 10.06 9.42 34.51 55.73 9.15 CONSE [29] 10.44 30.93 52.42 9.50 9.13 29.61 51.19 10.20 9.39 34.24 55.19 9.35 Attentive-Reader [10] 11.45 32.81 52.02 9.47 9.13 30.26 51.47 9.91 12.95 37.16 61.79 8.62 Our 11.34 32.52 51.44 9.61 8.92 29.82 51.18 9.81 10.95 36.31 57.51 8.87 Our-mood 12.13 34.52 54.60 8.83 9.70 31.31 52.84 9.13 12.13 37.46 61.85 8.23 Our-attention 12.71 35.14 57.37 8.37 9.26 33.64 52.26 8.97 13.10 38.38 62.50 7.82

Table 4. Image2song retrieval experiment result in R@K and Med r. Three kinds of image representation are considered, e.g., object (obj), attribute (attr), and both them (obj-attr).

0.35

0.3

0.25

Would U Love Me Shallow Love 0.2

I've got a question, tell me why, you always fall for the bad guy? Everywhere I go, I see love turnin' into somethin' it's not It's cuz you like it, yeah you like it. People caught with the image, when's this shit gon' stop? When I look at other girls walking by I say hello, When he gets down on his knee and that question gets popped cause you don't mind it, I think you like it I don't know why you like it. She's like, "Yes, babe!" Eyes still locked at the rock By the way, last week Saturday, was our Anniversary, I was playing GTA. He's got bands, he wants a pretty lady who wants a Mercedes-Benz 0.15 Damn I'm Sorry. So they can both show off to their friends Would you love me, Would you love me. They're in the whip, they're in the penthouse and they're in the jet to France If I wasn't such a jerk? Damn, it seems like love's the only thing that they ain't in But you love me, yeah you love me cuz you like it when it hurts. Divin' in headfirst into some shallow waters Try to be, all but nice. I'm just touchin' on a subject that I feel is kinda catastrophic 0.1 But it doesn't seem to work. I'm not sayin' every couple with money's got this problem Would you tell me that you love me, If I wasn't such a jerk? But in some cases without the label, she'd be straight up robbin' I DON' Takin' vows to the coffin on the wedding day Probabilities Average Prediction T THINK SO!! Ain't no prenupt needed, she convinced him it's true I used to treat you like a queen, but it never worked out right for me. Perfectly curved, believin' every word that she spewed You didn't like it, you never liked it. Was blinded by the beauty, can't see through 0.05 I know you're parent's hate my life, but it always seems to make you smile, Because you like it, I guess you like it I don't know why you like it. Is this shallow love? By the way, coming up on Saturday, isn't it your birthday? Or is it something real? I'll be playing GTA. You make it hard to deal Damn I'm Sorry. Not knowin' if this is shallow love 0 Would you love me, Would you love me. The way we live so well If I wasn't such a jerk? Can make it hard to tell But you love me, yeah you love me cuz you like it when it hurts. What she in it for? Attributes Try to be, all but nice. What she really in it for? But it doesn't seem to work. I don't even fucking know... Would you tell me that you love me, If I wasn't such a jerk? Figure 13. The average attribute prediction results over all the im- Figure 12. Examples of songs with high frequency appearance in ages in dataset†. The results are sorted in the descend order. the Shuttersong dataset. Multiple corresponding images are also shown for each of them. whose corresponding lyrics appear at least 5 times are con- Pooling R@1 R@3 R@5 R@10 Med r sidered. There are 249 attribute types employed in this pa- Average 13.10 28.30 38.38 62.50 7.82 per, and Fig. 13 shows the average prediction results. It is Max 12.08 26.54 35.40 59.74 8.37 clear to find that only a few types have high value, while most remain the low probabilities, which is actually a kind Table 5. The performance of the proposed model with different pooling strategies over the tag matrix. of long-tailed distribution. The imbalanced results could make it difficult to distinguish the images that belong to different songs. More importantly, the top 9 attributes are balanced attributes. We perform a statistical analysis with almost color-related, as shown in Table. 6. These attributes the attribute prediction probabilities, where all the images commonly appear in colorful images, and therefore become Attributes white black blue brown red green pink blonde smiling ··· Average Probabilities 0.30 0.25 0.20 0.19 0.14 0.12 0.09 0.09 0.08 ···

Table 6. The top 9 detected attributes with corresponding prediction probabilities.

Loss R@1 R@3 R@5 R@10 Med r MRL 9.90 22.70 36.04 57.84 8.94 CPL 11.29 26.25 37.07 60.92 8.29 MSE 13.10 28.30 38.38 62.50 7.82

Table 7. The retrieval performance of our model with distinct loss functions. weaker in describing the specific image appearance com- pared with other ones, e.g. happy, messy. Hence, only em- ploying attribute tags may suffer from the aforementioned problems and result in the unreliable correlation.