Separating Self-Expression and Visual Content in Hashtag Supervision

Andreas Veit∗ Maximilian Nickel Serge Belongie Laurens van der Maaten Cornell University AI Research Cornell University Facebook AI Research [email protected] [email protected] [email protected] [email protected]

Abstract previous hashtags by user: user A #livemusic #live #music #band #switzerland The variety, abundance, and structured nature of hash- #rock tags make them an interesting data source for training vi- sion models. For instance, hashtags have the potential to significantly reduce the problem of manual supervision and annotation when learning vision models for a large number of concepts. However, a key challenge when learning from hashtags is that they are inherently subjective because they are provided by users as a form of self-expression. As a previous hashtags by user: consequence, hashtags may have synonyms (different hash- user B #arizona #sport #climbing #bouldering #az tags referring to the same visual content) and may be poly- #rock semous (the same hashtag referring to different visual con- ). These challenges limit the effectiveness of approaches that simply treat hashtags as image-label pairs. This paper presents an approach that extends upon modeling simple image-label pairs with a joint model of images, hashtags, and users. We demonstrate the efficacy of such approaches in image tagging and retrieval experiments, and show how the joint model can be used to perform user-conditional re- Figure 1. Image-retrieval results obtained using our user-specific trieval and tagging. hashtag model. The box above the query shows hashtags fre- quently used by the user in the past. Hashtag usage varies widely among users because they are a means of self-expression, not just 1. Introduction a description of visual content. By jointly modeling users, hash- tags, and images, our model disambiguates the query for a specific Convolutional networks have shown great success on user. We refer the reader to the supplementary material for license image-classification tasks involving a small number of information on the photos. classes (1000s). An increasingly important question is how this success can be extended to tasks that require the recog- guage descriptions [15] whilst being nearly as structured as nition of a larger variety of visual content. An important ob- image labels in datasets like ImageNet. stacle to increasing variety is that successful recognition of However, using hashtags as supervision comes with its the long tail of visual content [11] may require manual an- own set of challenges. In addition to the missing-label prob- notation of hundreds of millions of images into hundreds of lem that hampers many datasets with multi-label annota- thousands of classes, which is difficult and time-consuming. tions (e.g., [4, 13, 17]), hashtag supervision has the prob- Images annotated with hashtags provide an interesting lem that hashtags are inherently subjective. Since hashtags alternative source of training data because: (1) they are are provided by users as a form of self-expression, some available in great abundance, and (2) they describe the long users may be using different hashtags to describe the same tail of visual content that we would like to recognize. Fur- content (synonyms), whereas other users may be using the thermore, hashtags appear in the sweet spot between captur- same hashtag to describe very different content (polysemy). ing much of the rich information contained in natural lan- As a result, hashtags cannot be treated as oracle descrip- tions of the visual content of an image, but must be viewed ∗This work was performed while Andreas Veit was at Facebook. as user-dependent descriptions of that content.

1 User embeddings Several prior works have studied hashtag prediction and recommendation for text posts [7, 23], infographics User 10 Image [2], and images [5, 32]. The most closely related to our embedding Hashtag embeddings Convolu �onal W study is [5], which studies hashtag prediction conditioned Network on image and user features. The main differences between #goldengate our work and [5] are (1) that we train the convolutional net- work end-to-end with hashtag supervision rather than pre- Figure 2. Overview of the proposed user-specific hashtag model. trained ImageNet features and (2) that the user embeddings The three-way tensor product models the interactions between im- in our model are learned based solely on the photos users age features, hashtag embeddings, and user embeddings. posted and the hashtags they used. Our model does not receive any metadata about the user, whereas [5] assumes access to detailed user metadata. This allows us to model Motivated by this observation, we develop a user-specific intent on the level of individual users, which helps in dis- hashtag model that takes the hashtag usage patterns of a ambiguating hashtags. user into account [31]. Instead of training on simple image- Our hashtag-prediction study is an example of large- hashtag pairs, we train our model on image-user-hashtag scale weakly supervised training, which has been the triplets. This allows our model to learn patterns in the hash- topic of several recent studies. Specifically, [24] trains con- usage of a particular user, and to disambiguate the learn- volutional networks on 300 million images with noisy la- ing signal. After training, our model can perform a kind of bels and show that the resulting models transfer to a range intent determination that personalizes image retrieval and of other vision tasks. Similarly, [12, 15] train networks tagging. This allows us to retrieve more relevant images on the YFCC100M dataset to predict words or n-grams in and hashtags for a particular user. Figure 1 demonstrates user posts from image content, and explore transfer of these how our user-specific hashtag model can disambiguate the models to other vision tasks. Further, [30] explores aug- ambiguous #rock hashtag by modeling the user. menting large-scale weakly supervision with a small set of Figure 2 provides an overview of this model. It is com- verified labels. Our study differs from these prior works prised of a convolutional network that feeds image fea- both in terms of the type of supervision used (hashtags tures into a three-way tensor model, which is responsible rather than manual annotation or n-grams from user com- for modeling the interaction between image features, hash- ments), and in terms of its final objective (hashtag predic- tag embeddings, and an embedding that represents the user. tion rather than transfer to other vision problems). When multiplying the three-way interaction tensor by a user Tensor models have a long history in psychological data embedding, we obtain a user-specific bilinear model. This analysis [9, 27] and have increasingly been used in a wide personalized bilinear mapping between images and hash- range of machine-learning problems, including link predic- tags can take into account user-specific hashtag usage pat- tion in relational and temporal graphs [6, 18], higher-order terns. Our model can produce a single score for an image- recommendation systems [20], and parameter estimation in hashtag-user triplet; we use this score in a ranking loss in or- latent variable models [1]. In computer vision, prominent der to learn parameters that discriminate between observed examples of tensor models include the modeling of style and unobserved triplets. The user embeddings are learned and content [25], the joint analysis of image ensembles [29], jointly with the weights of the three-way tensor model. sparse image coding [21] and gait recognition [8, 28]. We investigate the efficacy of our models in (user- specific) image tagging and retrieval experiments on the 3. Learning from Hashtags publicly available YFCC100M dataset [26]. We demon- strate that: (1) we can learn to recognize visual concepts Our goal is to train image-recognition models that can ranging from simple shapes to specific instances such as capture a large variety of visual concepts. In particu- celebrities and architectural landmarks by using hashtags as lar, we aim to learn from hashtags as supervisory sig- supervision; (2) our models successfully learn to discrimi- nal. Formally, we assume access to a set of N images nate synonyms and resolve hashtag ambiguities; and (3) we I = {I1,..., IN } with height H, width W and chan- H×W ×C can improve accuracy on tasks such as image tagging by nels so that Ii ∈ [0, 1] , a vocabulary of K hashtags taking the user that uploaded the photo into account. H={h1, . . . , hK }, and a set of U users U = {u1, . . . , uU }. Each image is associated with a unique user, and with one or 2. Related Work more hashtags (we discard images without associated hash- tags from the dataset). The resulting dataset comprises a Our study is related to prior work on (1) hashtag pre- set of N triplets T , in which each triplet contains an im- diction and recommendation, (2) large-scale weakly super- age I ∈ I, a user u ∈ U, and a hashtag set H ⊆ H. vised training, and (3) three-way tensor models. Formally, T ={(I1, um(1), H1),..., (IN , um(N), HN )}, in

2 which m(n) maps from the image/triplet index n to the cor- embeddings of the hashtags, hj, corresponding to image Ij: responding user index in {1,...,U}. > Hashtag supervision differs from traditional image anno- hj fj `(fj, hj; θ) = − . (2) tations in that it was not intended to objectively describe the khjkkfjk image content, but merely to serve as a medium for self- A potential advantage of this approach is that the embed- expression by the user. This self-expression leads to user- dings of synonomous hashtags are likely very similar: this specific variation in hashtag supervision that is independent implies that the loss used for training the convolutional net- of the image content. We first study convolutional networks work, in contrast to the multi-class logistic loss, does not that are agnostic to the subjective nature of hashtags and substantially penalize predicting a synonymous hashtag that simply treat them as image labels. Subsequently, we de- the user did not happen to use to describe the image. velop a user-specific model that explicitly incorporates the We experiment with two methods for learning the hash- user as part of the hashtag-prediction model in order to cap- tag embeddings h . The first method computes the D prin- ture variations in self-expression. i cipal singular vectors of the positive pointwise mutual in- Throughout this work, we focus on two tasks: (1) a tag- formation (PPMI) matrix [14]. The second method [16] ging task in which, given a query image I, we aim to re- explicitly models ambiguous hashtags (i.e., hashtags with trieve the most relevant hashtags for that image; and (2) a multiple meanings) by learning multi-sense hashtag embed- retrieval task in which, given a hashtag query h ∈ H, we dings. We follow [16] and use the global embedding vectors aim to retrieve the most relevant images for that hashtag. in their model as hashtag embedding. We train all models using mini-batch stochastic gradient descent (SGD). 3.1. User-Agnostic Hashtag Modeling 3.2. User-Specific Hashtag Modeling We investigate two approaches for training image- recognition models using user-agnostic hashtag supervi- The models described above do not explicitly capture sion: (1) softmax multi-class classification and (2) hashtag- variations in hashtag labels that are due to variations in how embedding regression [3]. In both cases, we learn an image users self-express. Here, we present a model that aims to H×W ×C D capture these variations by modeling images, hashtags, and model f(·; θ) : [0, 1] → R which maps images into an D-dimensional embedding space. The image model users jointly. We will show that this can help in disam- f(·; θ) is implemented by a residual network [10] with pa- biguating the meaning of hashtags assigned to images. As rameters θ. In addition to the image model, we learn hash- before, the model represents images via a convolutional net- F D work, fj = f(Ij; θ) ∈ , and hashtags via embeddings tag embeddings hi ∈ R for all hashtags hi ∈ H. R h ∈ D. In addition, we learn user embeddings, u ∈ E. Multi-Class Classification. Several prior studies [12, 24] i R k R We aim to learn a scoring function s(t; W) with parameters suggest that softmax classification can be very effective W ∈ D×F ×E that combines all three representations to even in multi-label settings with large numbers of classes R predict whether or not an image-hashtag-user triplet t is cor- such as ours. Motivated by this, we train f(·; θ) with a soft- rect. Specifically, we select a hashtag h from hashtag set max over the 100, 000 most frequent hashtags by minimiz- i H uniformly at random, and model the score of the result- ing the multi-class logistic loss. Following [12], we select j ing triplet t = (Ij, uk, hi) as: a single hashtag uniformly at random from hashtag set Hn as target class for each image when training the softmax D F E D X X X model. In particular, let fj = f(Ij; θ) ∈ R be the image s(t; W) = wr1r2r3 hir1 fjr2 ukr3 , (3) embedding, and hi ∈ Hj the randomly selected hashtag. r1=1 r2=1 r3=1 h We then learn jointly the embeddings i and the parameters where w , h , f , and u are elements from W, θ of the vision model f(·; θ) by minimizing the negative r1r2r3 ir1 jr2 kr3 hi, fj, and uk, respectively. Equation 3 is a three-way ten- log-likelihood for the probability distribution: sor product between the embeddings of the image, hashtag, and user in which the weights w specify the (posi- > r1r2r3 exp(hi fj) tive or negative) interactions of all possible feature combi- P (hi|Ij) = P > . (1) ` exp(h` fj) nations. The user-specific aspect of Equation 3 can be ob- served by considering the summation over the user dimen- Hashtag-Embedding Regression. This training method sion. In particular, when summing over the user dimension, comprises two main stages. First, we learn an embedding weighted by the embedding for user uk, we obtain a user- D (k) D×F hi ∈ R for each hashtag hi ∈ H. Second, we follow [3] specific weight matrix W ∈ R with entries: and learn the parameters θ of the image model f(·; θ) by E minimizing the negative cosine similarity between the im- (k) X D wab = ukrwabr. (4) age embedding, fj = f(Ij; θ) ∈ R , and the sum of the r=1

3 Hashtag Frequency Images per User Table 1. Frequency of the most common hashtags in the data set. 1M 100k 5 5 2 Hashtag Frequency Unique Users 2 10k 100k 5 #california 5 2 905,715 15,785 1000 2 5 #travel 826,366 15,944 10k 2 5 100 #usa 825,641 13,400 5 frequency 2 2 #london 764,277 21,516 1000 10 5 5 #japan 732,859 11,652 2 2 1 #france

number of images in training set 650,436 17,265 100 0 20k 40k 60k 80k 100k 0 50k 100k 150k 200k 250k 300k hashtags ordered by frequency users ordered by number of images #wedding 580,605 19,599 #music 552,645 23,359 Figure 3. Left: Frequency of hashtags in hashtag vocabulary H. #beach 547,038 44,695 Right: Number of photos per user in user set U.

The score function of Equation 3 is then equivalent to: As before, we train our user-specific hashtag model us- ing mini-batch SGD. We first learn the parameters of the > (k) s(t; W) = hi W fj. (5) convolutional network, θ, by minimizing one of the losses from 3.1. We then jointly learn the image, hashtag and user Hence, our proposed model learns user-conditioned bilinear embeddings as well as the parameters of the scoring func- models between hashtags for images, by conditioning the tion, W, in a subsequent training stage. In our experiments, weight matrix of the bilinear model on the user embedding. we use image and hashtag embeddings with D = F = 300 Given a dataset of M triplets1 T = {t , . . . , t }, we 1 M dimensions and user embeddings of size E = 50. estimate the parameters W using a ranking approach. In Once we have inferred the embeddings for users, hash- particular, we want the score of a true observed triplet t+ ∈ tags, and images as well as W, we can then approach the T to be higher than that of an unobserved triplet t− 6∈ T . aforementioned image tagging and retrieval results in the We achieve this by minimizing the following loss: following way. Given a user uk and an image Ij, we com-   pute the most likely hashtag according to our model as: `(t+; W) = max 0, max s(t−; W) − s(t+; W) + 1) . t−6∈T > (k) arg maxh ∈H hi W fj (6) This ranking loss is better suited for our problem than a per- i triplet binary logistic loss, because the latter would consider The most likely image given a hashtag-user pair can be re- any unobserved triplet as a “negative”. This is problematic trieved analogously. because (1) the hashtag annotations for an image are gen- erally not exhaustive and (2) there are far more unobserved 4. Experiments than observed triplets. The ranking loss only aims to assign a lower score to unobserved triplets, and as a result it is not The aim of our experiments is: (1) to compare the strate- nearly as much affected by these problems. gies for training user-agnostic convolutional networks us- In practice, the maximization over negative triplets t− ing hashtag supervision introduced in Section 3.1 and (2) can only be approximated. For our ranking loss to be ef- to investigate the effectiveness of the user-specific hashtag fective, it is essential to develop good approximations for model we introduced in Section 3.2. the maximization by mining “hard negatives” [22]. We per- form online hard negative mining along all three axes, i.e., 4.1. Dataset we rank tags, images, and users. Specifically, we sample six We conduct experiments on the YFCC100M dataset [26] negative triplets per positive sample, and uses each of them of approximately 99.2 million photos. More than 60 million as a negative in the loss. We sample three “intermediate” of these photos have one or more associated hashtags, and and three “hard” negatives. In an “intermediate” negative, each photo has an associated user, the user who uploaded it. one of the three elements (the image, hashtag, or user) of We start by removing numerical hashtags and also remove the positive triplet is replaced by another element that is the 10 most frequent tags because they are non-visual and selected uniformly at random from the training batch; the non-informative (e.g., #iphonography, #, other two elements remain the same. In a “hard” negative, #square, and #canon). We define the hashtag set H as we replace one of the three elements in the triplet by the the set of the 100, 000 most frequent (remaining) hashtags. (non-identical) element in the training batch that maximizes The left plot in Figure 3 shows the resulting hashtag fre- the score s(t; W). quencies, and Table 1 lists the most frequent hashtags. The 1Please note that T contains image-hashtag-user triplets, whereas T hashtag distribution is heavily skewed towards a few fre- contains image-hashtag set-user triplets. quent hashtags and has a long tail of less frequent tags. For

4 oetesto ahasta r soitdt htiaeby image that to associated are that hashtags of set the note image for of hashtags to terms scoring ability in image their the by with accuracy models associated mea- our We hashtags of the image. predict quality given tagging a the to relevant sure are that hashtags dict Prediction Hashtag 1: Experiment 4.2. o naeaeo about of average an contains for dataset the together, Taken the assign timestamp, by user each of stamps. first photos time the upload their sort to We according testing and training heavy-tailed. also is user per photos of number the ann htst h aiainadts es hsresults of of This set set sets. training test a and in validation the to photos maining results This U of average. on dataset image a per in hashtags 15 than “spammers”, more by photos out filter and be to multilingual. tend increasingly tags are frequent tags frequent most less the English, while that is hashtags the hash- our in hashtags set frequent tag least The images. training of H hashtag, over pears frequent most the example, better. is Higher details. for YFCC100M text the see on set; test models prediction hashtag user-specific three 4. Figure n 315 = ntefis e feprmns eueormdl opre- to models our use we experiments, of set first the In omdlaraitcuecs,w pi h htsfor photos the split we use-case, realistic a model To eslc l htswt tlatoehstgfrom hashtag one least at with photos all select We Accuracy at 1 Accuracy . 10 15 20 25 30 35 40 45 90% 1 ImageNet H 17.2% ilo mgs n etstof set test a and images, million , @ mg tagging: Image 745 fteiae otetann e,adasg h re- the assign and set, training the to images the of nyappear only 900 A k @ (

sransi modelsuser-agnostic A sr.A hw ntergtpo nFgr 3, Figure in plot right the in shown As users. 29.2% , @ k MCLL 000 55 @ k = k . ste endas: defined then is 6 .W eoetesto the of set the denote We ). N ie ntetann set, training the in times N 1 NCSL-PPMI ilo mgsadaue set user a and images million 28.7% n X 260 50 = N 4 =1 Accuracy . 7 I ie.Aohrcaatrsi of characteristic Another times. asprphoto. per tags I NCSL-MS . h 27.9% n 6 R by ilo hts validation a photos, million k ( I @1 ) R n ∩ N ( k ffu sransi and user-agnostic four of I 265 n 41.2% (MCLL) Tensor i.e. #california H ) n sbfr,de- before, as and ,

srseii modelsuser-specific n yuesta use that users by , 4 6= ilo hashtags million ilo images. million i.e. (NCSL-PPMI) 40.4% ∅ Tensor i with , . k (NCSL-MS) highest- 43.7% U Tensor 1 . ap- , 78% with (7) H 5 eeaut cuayat accuracy evaluate We ehd sinhstg htaerlvn oteiaecon- image the to assign relevant tent are users that different hashtags tagging ideally, assign that methods content: is visual similar task to ground- this hashtags different the in of challenge predictions. one key highest-ranked least 10 A the at in often appears hashtags how ground-truth truth (2) the and in is set hashtag hashtag top-ranked the often how (1) ftetremdlte n cr hmuigamulti-layer a using them of score ( and series perceptron modalities a three the (3) of and consideration; for specific under frequency their user a to according the (2) tags fre- predicts set; fre- their that training to a baseline the according (1) in tags predicts quency baselines: of that additional accuracy@10 baseline three the quency and presents 2 models, Table these Additionally, test the on set. models user-specific three models user-agnostic respectively). using convo- NCSL-MS, three- trained and were the the NCSL-PPMI, networks MCLL, into three of (those features model terms image tensor in way feeds vary that network and that lutional architecture but same approach, the three training share evaluate We that models ResNet-50. user-specific a is network convolutional ( hash- multi-sense [16] employs embeddings but loss tag same Equa- the of uses that loss work embed- similarity ( cosine hashtag 2 negative PPMI tion the uses in [14] that dings network ( trained loss predic- logistic to-end hashtag multi-class for end-to-end using trained tion is that network a (2) ( ImageNet on by trained extracted model network features convolutional baseline on an regressor a logistic linear (1) a trains that models: user-agnostic four evaluate h srseicfeunybsln n eeal outper- generally and baseline frequency user-specific outperform the substantially models tensor three-way these predictions. the their of Fifth, in features ability user-specific capture the to demonstrates models user- which the models, outperform agnostic significantly at accuracy models of user-specific terms all in loss particular, similarity in cosine (NCSL), negative using outperforms trained (MCLL) user-agnostic loss logistic was multi-class for that using suffice model trained not user-agnostic does the Third, ImageNet prediction. in hashtag variety visual the that ing suggest- sub- networks, prediction Imagenet-trained hashtag outperforms for stantially scratch from one networks assigned training be can image each of that per- given generally well quite and out- form baseline clearly frequency models (global) all the First, perform observations. main five make 100 iue4peet h agn accuracy tagging the presents 4 Figure we 3.2, Section of model user-specific the to addition In rmterslspeetdi h gr n h al,we table, the and figure the in presented results the From and , 000 NCSL-PPMI oesi hc ecnaeaeteembeddings the concatenate we which in models r fiprac oteue ne consideration. under user the to importance of are ifrn ahas eod h eut hwthat show results the Second, hashtags. different MLP ahrta h he-a esrmodel. tensor three-way the than rather ) ;ad()a n-oedtandnet- trained end-to-end an (4) and ); NCSL-MS k 1 = .I l xeiet,our experiments, all In ). and user-specific MCLL k 10 = @1 ;()a end- an (3) ); ImageNet 10 forfour our of omeasure to frequency Fourth, . user- ); Table 2. Image tagging: Accuracy@1 (A@1) and accuracy@10 a) User-agnos �c image tagging (A@10) of two frequence baselines, four user-agnostic hashtag prediction models, and six user-specific hashtag prediction mod- els; see text for details. Higher is better. Method A@1 A@10 Global frequency 1.68% 9.65%

User-specific frequency 38.07% 62.55% #nyc #golden gate bridge #autumn #night #san francisco #fall Imagenet 17.21% 40.01% #people #goldengatebridge #river #new york #golden gate #trees MCLL 29.24% 56.47% #bw #california #water NCSL-PPMI 28.72% 47.70% b) User-specific image tagging

user-agnostic NCSL-MS 27.94% 46.65% user A previous hashtags by user: MLP (MCLL) 35.58% 65.58% #new york city #nyc # manhattan #picnick MLP (NCSL-PPMI) 37.31% 67.68% user-agnos �c user-specific ground-truth MLP (NCSL-MS) 41.66% 71.34% #nature #central park #central park #tree #eden #manhattan Tensor (MCLL) 41.24% 70.75% #trees #ny #new york city #hiking #park #nyc user-specific Tensor (NCSL-PPMI) 40.43% 68.86% #park #prospect park Tensor (NCSL-MS) 43.65% 72.12% previous hashtags by user: user B #spain #bilbao #españa #bizkaia #grecia 0.5 MCLL user-agnos �c user-specific ground-truth Tensor (MCLL) 0.45 #italy #mar #playa #sunset #spain #mar #italia #playa #barcelona 0.4 #mare #bilbao #android #tramonto #sea #cataluña 0.35

0.3 previous hashtags by user: user C #seattle #redmond #raw #flower #seahawks 0.25 user-agnos �c user-specific ground-truth Accuracy at 1 #rugby #seattle #seattle 0.2 #twickenham #soccer #soccer #portland #football #futbal 0.15 timbers #qwest field #mls #rugbyunion #seahawks #qwest field 0.1 1 5 15 39 142 143,547 Figure 6. Image tagging: Example tagging results from the user- Number of images of user agnostic (MCLL) and user-specific (Tensor NCSL-MS) models.

Figure 5. Image tagging: Accuracy@1 (A@1) of user-agnostic and respective user-specific tensor model as a function of the num- ter for users with large image libraries. We surmise this ef- ber of images by the user. fect is due to the fact that those users have provided the ma- jority of the images in our training set, as a result of which they dominate the data distribution. For the user-specific form the user-specific MLP baselines models, which sug- tagging model, we observe a stronger relationship between gests three-way tensor models are best suited for tailoring accuracy and the number of images per user. Whilst the predictions based on visual content to a particular user. The user-specific model outperforms the user-agnostic one for highest accuracy is obtained by a three-way tensor model all users, the main benefits of the user-specific modeling are on top of a convolutional network trained using NCSL-MS, for users with more than approximately 27 uploaded pho- which is surprising because that network has the lowest ac- tos. For users with many photos, the tensor model has more curacy of the user-agnostic models. data that it can use to pin down the user embeddings that In Figure 5, we break down the tagging accuracy of our capture their hashtag usage patterns. models per user by measuring accuracy as a function of Figure 6 shows examples of user-agnostic and user- the number of training images the models observed for that specific tagging results. The tag predictions were obtained user. We show the accuracy break-down for the best per- using the MCLL model and the Tensor (MCLL) model, re- forming user-agnostic model (MCLL) and its correspond- spectively. The figure highlights the wide range of visual ing tensor model. The figure shows that the user-agnostic concepts that our convolutional networks learned to recog- model works well across all users, but tends to perform bet- nize. This range encompasses objects such as “people”,

6 45 #circle

44

43

42 Accuracy at 1 41 #wheel #gillingham #california #handpainted #theatre #circle #costa rica #kent #solvang 40 25 50 75 100 125 150 175 200 #green Dimensionality of user embedding

Figure 7. Image tagging: Accuracy@1 (A@1) of the Tensor (NCSL-MS) model as a function of the user embedding size, E.

“river”, and “trees”; specific instances and locations such #singapore #green #green #unretouched #clover #leaf as the “Golden Gate Bridge”, “San Francisco”, and “New #botanic gardens #macro #macro York”; whole-image concepts such as “autumn”; and image #sign styles such as “black and white”. The bottom part of the fig- ure highlights the differences between the user-agnostic and user-specific models. Specifically, it shows tag predictions the user-agnostic model makes for a photo and predictions the user-specific model makes for that same photo for a par- ticular user — we provide insight into the user’s “profile” #highway #sign #united states by showing the most frequent hashtags for that user. #sign #route #minnesota #shield #michigan #minneapolis We observe that the user-specific model can help in #lego disambiguating (most likely) locations of a photo: e.g., it changes its prediction from #nature to #central park for a user that often tags photos with concepts related to New York. The user-specific model also can change pre- dictions into the user’s preferred language (e.g., from En- glish to Spanish), and it can help in disambiguating fine- #figher #small #lego grained categories, such as recognizing the difference be- #lego #star wars #moc tween a rugby and a soccer stadium. We emphasize that all #transformer the information the user-specific model used to make these #star wars disambiguations comes from image-hashtag-user triples; the model does not employ any additional user metadata. A key to the user-specific model are the user embeddings that personalize the mapping between images and hashtags. Figure 7 shows the accuracy of the top-performing user- specific model (Tensor NCSL-MS) as a function of the di- #cologne #darth vader #star wars #gamescom #lego #madrid mensionality of the user embedding. The results show that #star wars #exhibition a substantial number of dimensions is needed, suggesting #empire state building that the user embeddings are playing an important role in the accuracy of the model.

4.3. Experiment 2: Hashtag-Based Image Retrieval In a second set of experiments, we study hashtag-based image retrieval and measure the quality of our models by #empire state building #new york #new york their ability to retrieve relevant images given a hashtag #new york #travel #30 rock #skyline #vacation #rockefeller plaza query in terms of precision@k (P @k). We define the set (h) Figure 8. Hashtag-based image retrieval: Top-scoring photos of the k highest-scoring images for hashtag h, Rk , and (h) and corresponding ground-truth hashtags for six hashtag queries. the set of photos that are labeled with hashtag h, GT = Results obtained using the user-agnostic MCLL model.

7 un asad()tedfeec ewe h CLmodel The MCLL the between fre- difference query. for the higher (2) we and is hashtags tags performance quent retrieval the (1) of that: shows frequency plot the by cision hashtags. disambiguate can user the modeling how strates model user-specific query, same our the by for retrieved images of shows example 1 Figure an landmarks. indi- architectural and of concepts instances from fine-grained vidual to range colors recognized models and concepts shapes our simple the 8 concepts Figure recognize; visual importantly, to of More learned variety wide such. the if as even illustrates labeled queries, not hashtag are the to they retrieved relevant the of actually many are that photos suggest which MCLL 8, the Figure in show by We model produced 9. results Figure image-retrieval in values rise qualitative gives precision which low hashtag, relatively that the with to labeled also is query tag models. loss other logistic all multi-class outperforms substantially using (MCLL) trained was that that observe network we the Third, hashtags. bet- ambiguous are modeling they at ter because presumably, (NCSL) embeddings, loss PPMI similarity than cosine model. negative ImageNet the suitable with the more training for seem of (MS) precision embeddings low retrieval, multi-sense Second, the Ima- image in in hashtag-based variety reflected for visual as suffice the not does experiment, First, geNet first observations. 4.2. the main Section three to make in similar we used results, also the were From that models user-agnostic hashtag. a of meanings re- all hashtags to methods that corresponding retrieval is photos ideally, task trieve this meanings: in multiple challenge have key can A it. with ciated the measure We { ( 9. Figure P I @10 ∃ | nFgr 0 ebekdw h mg-erea pre- image-retrieval the down break we 10, Figure In hash- a for photo relevant every not that emphasize We the presents 9 Figure 10 u ffu ovltoa ewrs e etfrdetails. for text see networks; convolutional four of ) o-crn mgsta aeteqeyhstgasso- hashtag query the have that images top-scoring ( :

Precision at 10 10 0 2 4 6 8 I ,h u, , aha-ae mg retrieval: image Hashtag-based ImageNet P P 2.5% @ @10 ) k } T ∈ = #rock norexperiments, our in |H| NCSL-PPMI 1 Precision@ . 3.5% P h X ∈H @10 o w ifrn sr;i illut- it users; different two for , |R ntets e o h four the for set test the on NCSL-MS k ( h 6.0% ) k GT ∩ k ste endas: defined then is i.e. ( h h rcinof fraction the , ) 9.9% MCLL | . Precision@10 (8) 8 ietosfrftr okicuemdln srmeta- [19]. patterns user temporal and modeling spatial as include well as work [5] data future for Other directions information. relevance explicit with settings in trieval predictions. the the in features user-specific demonstrating capture models to substantially ability user-agnostic approach the user-specific the outperforms Generally, tag- image (user-specific) ging. for well performs ap- mainly embedding-regression consistently proach the whereas performs tasks, all approach across classification well The embed- and hashtag dings. pre-learned approach onto regresses classification that convo- approach standard the an a training time. for networks: test approaches at lutional to different user us two specific allows used a This We to prediction model’s model. the prediction tailor final embeddings the user learns with that jointly model tensor three-way networks a these with extended and prediction, hashtag perform to Work Future and Conclusion 5. experiments. tagging the NCSL- in on networks builds convolutional that trained model tensor compet- the observe of embed- we performance all example, itive For which is jointly. in learned limitation are model, dings This user-specific the performances. in weaker alleviated in resulting mod- embedding-regression els, re- of This capacity effective model. classification the jointly the duces learned in are features they visual whereas the image- with models, be- those embeddings our in hashtag fixed the ing in to due is models experiments retrieval (NCSL) embedding-regression of precision@10 similar very a the embedding achieve less multi-sense model on of the and tail evaluated classification the long When tags, the quent in primarily tags. is frequent models NCSL the and better. is Higher query. hashtag the of quency ( 10. Figure P @10 nftr ok epa or-ii srseiciaere- image user-specific re-visit to plan we work, future In scratch from networks convolutional trained paper This the of performance poor relatively the surmise We Precision at 10 905,715 0.1 0.2 0.3 0.4 ffu ovltoa ewrsa ucino h fre- the of function a as networks convolutional four of ) 0 aha-ae mg retrieval: image Hashtag-based rqec fhstgi YFCC100M in hashtag of Frequency 1,542 729 464 ImageNet NCSL-PPMI NCSL-MS MCLL 1 , 338 000 Precision@10 47% otfre- most . 264 References [17] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- manan, P. Dollar,´ and C. L. Zitnick. Microsoft coco: Com- [1] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Tel- mon objects in context. In European Conference on Com- garsky. Tensor decompositions for learning latent variable puter Vision (ECCV). Springer, 2014. models. Journal of Machine Learning Research (JMLR), [18] M. Nickel, V. Tresp, and H.-P. Kriegel. A three-way model 15:2773–2832, 2014. for collective learning on multi-relational data. In Proc. [2] Z. Bylinskii, S. Alsheikh, S. Madan, A. Recasens, K. Zhong, ICML, 2011. H. Pfister, F. Durand, and A. Oliva. Understanding info- [19] T. Rattenbury, N. Good, and M. Naaman. Towards automatic graphics through textual and visual tag prediction. 2017. extraction of event and place semantics from flickr tags. In [3] F. Chollet. Information-theoretical label embeddings International Conference on Research and Development in for large-scale image classification. arXiv preprint Information Retrieval (SIGIR), 2007. arXiv:1607.05691, 2016. [20] S. Rendle. Factorization machines. In International Confer- [4] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng. ence on Data Mining (ICDM), 2010. Nus-wide: a real-world web image database from national [21] A. Shashua and T. Hazan. Non-negative tensor factorization university of singapore. In ACM International Conference with applications to statistics and computer vision. In Inter- on Image and Video Retrieval (CIVR), 2009. national Conference on Machine Learning (ICML), 2005. [5] E. Denton, J. Weston, M. Paluri, L. Bourdev, and R. Fer- [22] A. Shrivastava, A. Gupta, and R. Girshick. Training region- gus. User conditional hashtag prediction for images. In ACM based object detectors with online hard example mining. SIGKDD International Conference on Knowledge Discovery In Conference on Computer Vision and Pattern Recognition and Data Mining (KDD), 2015. (CVPR), 2016. [6] D. M. Dunlavy, T. G. Kolda, and E. Acar. Temporal link pre- [23] C. Stanley and M. Byrne. Comparing vector-based and diction using matrix and tensor factorizations. ACM Transac- bayesian memory models using large-scale datasets: User- tions on Knowledge Discovery from Data (TKDD), 5(2):10, generated hashtag and tag prediction on and stack 2011. overflow. Psychological Methods, 21(4):542–565, 2016. [7] F. Godin, V. Slavkovikj, W. De Neve, B. Schrauwen, and [24] C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting R. Van de Walle. Using topic models for twitter hashtag rec- unreasonable effectiveness of data in deep learning era. In ommendation. In International Conference on World Wide International Conference on Computer Vision (ICCV), 2017. Web (WWW), 2013. [25] J. B. Tenenbaum and W. T. Freeman. Separating style and [8] W. Gong, M. Sapienza, and F. Cuzzolin. Fisher ten- content. In Advances in Neural Information Processing Sys- sor decomposition for unconstrained gait recognition. In tems (NIPS). ECML/PKDD Workshops, 2013. [26] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, [9] R. A. Harshman and M. E. Lundy. Parafac: Parallel fac- D. Poland, D. Borth, and L.-J. Li. Yfcc100m: The new tor analysis. Computational Statistics & Data Analysis, data in multimedia research. of the ACM, 18(1):39–72, 1994. 59(2):64–73, 2016. [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning [27] L. Tucker. Some mathematical notes on three-mode factor for image recognition. In Conference on Computer Vision analysis. Psychometrika, 31(3):279–311, 1966. and Pattern Recognition (CVPR), 2016. [28] M. A. O. Vasilescu. Human motion signatures: Analysis, [11] G. V. Horn and P. Perona. The devil is in the tails: Fine- synthesis, recognition. In International Conference on Pat- grained classification in the wild. 2017. tern Recognition (ICPR), 2002. [12] A. Joulin, L. van der Maaten, A. Jabri, and N. Vasi- [29] M. A. O. Vasilescu and D. Terzopoulos. Multilinear analysis lache. Learning visual features from large weakly supervised of image ensembles: Tensorfaces. In European Conference data. In European Conference on Computer Vision (ECCV). on Computer Vision (ECCV). Springer, 2002. Springer, 2016. [30] A. Veit, N. Alldrin, G. Chechik, I. Krasin, A. Gupta, and [13] I. Krasin, T. Duerig, N. Alldrin, A. Veit, S. Abu-El-Haija, S. Belongie. Learning from noisy large-scale datasets with S. Belongie, D. Cai, Z. Feng, V. Ferrari, V. Gomes, et al. minimal supervision. In Conference on Computer Vision and Openimages: A public dataset for large-scale multi-label Pattern Recognition (CVPR), 2017. and multi-class image classification. Dataset available from [31] P. Welinder, S. Branson, P. Perona, and S. J. Belongie. The https://github. com/openimages, 3, 2016. multidimensional wisdom of crowds. In Advances in Neural [14] O. Levy and Y. Goldberg. Neural word embedding as im- Information Processing Systems (NIPS), 2010. plicit matrix factorization. In Advances in Neural Informa- [32] J. Weston, S. Bengio, and N. Usunier. Large scale image tion Processing Systems (NIPS), 2014. annotation: learning to rank with joint word-image embed- [15] A. Li, A. Jabri, A. Joulin, and L. van der Maaten. Learning dings. Machine Learning, 81(1):21–35, 2011. visual n-grams from web data. In International Conference on Computer Vision (ICCV), 2017. [16] J. Li and D. Jurafsky. Do multi-sense embeddings improve natural language understanding? Conference on Empirical Methods on Natural Language Processing (EMNLP), 2015.

9