<<

A Sequence-Oblivious Generation Method for Context-Aware Recommendation

Junmo Kang Jeonghwan Kim Suwon Shin Sung-Hyon Myaeng School of Computing, KAIST Daejeon, Republic of Korea {junmo.kang, jeonghwankim123, ssw0093, myaeng}@kaist.ac.kr

0.43 0.22 0.15 … 0.03 Abstract #Tag1 #Tag2 #Tag3 … #TagN

Like search, a recommendation task accepts Ranking an input query or cue and provides desirable Model

items, often based on a ranking function. Such Context a ranking approach rarely considers explicit dependency among the recommended items. In this work, we propose a generative approach #Tag1 #Tag2 #TagN to tag recommendation, where semantic tags Generation Generation Generation are selected one at a time conditioned on Model Model … Model the previously generated tags to model inter- dependency among the generated tags. We ap- Context Context #Tag1 Context #Tag1 #Tag2 … #TagN-1 ply this tag recommendation approach to an data set where an array of context Figure 1: Ranking vs. Generation model. feature types (image, location, time, and text) are available for posts. To exploit the inter- dependency among the distinct types of fea- while the tags can be seen as predefined categories, tures, we adopt a simple yet effective architec- they can be generated as a sequence of tags like a ture using self-attention, making deep interac- natural language sentence. tions possible. Empirical results show that our Social networking service (SNS) platforms like method is significantly superior to not only the Instagram and are example cases in which usual ranking schemes but also autoregressive models for tag recommendation. They indi- tag recommendation plays a significant role in ag- cate that it is critical to fuse mutually support- gregating and distributing information. Users tend ing features at an early stage to induce exten- to include when they post daily life stories sive and comprehensive view on inter-context or advertisements, expecting they would serve as interaction in generating tags in a recurrent keywords for the and pragmatics of the feedback loop. unstructured content like image and text. Recom- mending appropriate hashtags to the users would in- 1 Introduction crease global coherency of the tags used in the user From traditional term-based methods to deep neu- population and subsequently facilitate grouping the arXiv:2012.02957v1 [cs.CL] 5 Dec 2020 ral network based models, recommendation func- posts of similar topics for easier navigation/search. tions in the Internet domain are widely adopted. A number of hashtag recommendation methods While categorical features are often used for col- have been proposed to date (Weston et al., 2014; laborative filtering types along with a user cue, rec- Gong and Zhang, 2016; Wu et al., 2018a; Wang ommending text or image, i.e. content-based rec- et al., 2019; Zhang et al., 2019; Yang et al., 2020b; ommendation, requires handling Kaviani and Rahmani, 2020; Yang et al., 2020a). based on relevance of items toward a user query. These methods focus on modeling latent topic dis- Like search, most of these recommendation sys- tribution over hashtags (Godin et al., 2013), heavily tems rank and return items with top-k predicted relying on late fusion approach to model the inter- logit values (Covington et al., 2016; Weston et al., action between image and text input (Zhang et al., 2014; Wu et al., 2018b). We posit that tag recom- 2017; Yang et al., 2020b) or project words within mendation touches on the middle ground because the given SNS post and the hashtag embeddings (i.e. tag embeddings) to a common high-dimensional has been used in other studies (Wang et al., 2019; space and update the embeddings with pairwise Yang et al., 2020b), we build a BERT-based autore- ranking loss (Weston et al., 2014). However, prior gressive (AR) model with a Transformer (Vaswani studies taking the ranking approach to tag recom- et al., 2017) decoder. The experimental result show mendation neglect inter-dependency among the that our model also outperforms the AR model by generated hashtags for the given context (i.e., post). a large margin. We propose a recurrent hashtag feedback ap- Our key contributions are summarized are as proach to tag recommendation, which enables the follows: recommendation model to repeatedly consider pre- • A generation framework, recurrent hashtag viously generated tags in generating the next “rel- feedback, for tag recommendation, consider- evant” tag. Our recurrent model using BERT (De- ing inter-tag dependency vlin et al., 2019) generates hashtags conditioned on the assorted context information and previously • An early fusion approach enabling deep in- generated hashtags as in Figure1. Note that this teractions among context features and tags, recurrent BERT model is devised for the unique and nature of syntax-free tag generation, instead of the usual RNN or BERT approach for language gen- • Experiments showing the superiority of the eration. On a different note, this approach can be proposed approaches and shedding light on seen as analogous to pseudo-relevance feedback the way the context features interact among (PRF) (Xu and Croft, 1996) where previously re- each other for tag generation. trieved items are deemed relevant and additional query terms are extracted as relevance feedback. 2 Related Work For tag recommendation, we also assume that pre- The hashtag recommendation problem has been viously generated tags can be trusted to be relevant studied as a ranking problem (Park et al., 2016; but incorporate them for generating the next one. Zangerle et al., 2011; Denton et al., 2015; Sed- Nonetheless, they can be seen as part of the new hai and Sun, 2014; Li et al., 2016; Weston et al., “query” for generation of the next tag. 2014; Wu et al., 2018b; Gong and Zhang, 2016). Our work also proposes an early fusion of multi- A representative approach is to use a visual fea- modal context features of Instagram posts (image, ture extractor from the input image and employs a location, time, and text). Prior works (Denton et al., multi-label classifier to calculate the score of each 2015; Wang et al., 2019; Gong et al., 2018; Li et al., hashtag and provides top-k hashtags recommenda- 2016) on combining multi-modal features in hash- tions (Park et al., 2016). Others handle multiple tag recommendation tend to merge the representa- input feature types (i.e. image, text) by mapping tions with either a co-attention (Lu et al., 2016) or a them into a common representation space and ap- bi-attention (Seo et al., 2017) mechanism after inde- ply the pairwise ranking loss algorithm (Denton pendently encoding features of differing modalities. et al., 2015; Weston et al., 2014; Wu et al., 2018b), Based on our intuition that context input features such as the weighted approximate-rank pairwise should affect the representation modeling process (WARP) loss (Weston et al., 2011), as the training as early as possible, we exploit the self-attention objective. Many of the prior studies on hashtag rec- based pre-trained BERT to fuse the different fea- ommendation (Godin et al., 2013; Ding et al., 2012; tures and encode the relationships at an early stage Li et al., 2019; Zhao et al., 2016) take topic mod- of building representations. This approach has an eling approaches with Latent Dirichlet Allocation added benefit of allowing for an investigation of (LDA), which is often used to discover general top- how the features influence each other for tag gener- ics in a large collection of documents. Unlike our ation, in addition to a usual ablation study where approach, however, such ranking approaches do we can only reveal the role of each feature type for not explicitly consider the inter-dependency among the overall performance. the generated hashtags. Our experimental work shows that the proposed The use of multi-modal features is also evident method outperforms the ranking approaches by a (Denton et al., 2015; Wang et al., 2019; Gong et al., significant margin. To further differentiate and eval- 2018; Li et al., 2016; Zhang et al., 2017; Yang et al., uate our model against a generative approach that 2020a,b). The types of multi-modal features and ^ ^ ^ ht2 ht3 hts

… … … … …

BERT BERT … BERT

… … … … …

… … … … ^ … ^ ^ ^ img1 [IMG] loc1 [LOC] time1 [TIME] txt1 [SEP] ht1 [MASK] img loc time txt ht1 ht2 [MASK] img loc time txt ht< s [MASK] ^ ht 1 ^ s- I2 = [C, ht< 2, [MASK]] I3 … Is where C = [img, loc, time, txt] Is-1

Figure 2: Overall architecture of the proposed approach (sequential tag generation). the ways they are fused are quite distinct. For in- immediately preceding token is pooled (Vaswani stance, Denton et al. incorporate user et al., 2017), our model fuses mutually supporting (e.g. age, gender) with 3-way multiplicative gat- context features “directly” with the self-attention ing along with image for hashtag recommendation. mechanism along with the incrementally appended Another example (Wang et al., 2019) uses both the hashtags. This early fusion approach is conducive text description of a given tweet and the thread to modeling our representation space because the conversation by employing bi-attention (Seo et al., output space is jointly modeled with the combined 2017). A more recent approach makes use of audio context-tag representation for every generation step. features (Yang et al., 2020a) for short video infor- On the contrary, the commonly used late fusion of mation. A common drawback of these previous latent representations of multi-modal input vectors models is that they capture a very limited amount is limited to the aggregation of the projected infor- of relationship among the input features of different mation. The expected benefit of the early fusion modalities. Our approach of using a self-attention is the extensive and comprehensive view on the mechanism by employing a pre-trained BERT en- contextual information in generating a hashtag. sures that inter-dependency among the given image, Inspired by the generative application of BERT location, time and text description is learned at the in (Chan and Fan, 2019), our model uses a pre- early stage of the layer hierarchy to model the con- trained BERT model that generates tags one after textual information dispersed throughout the input another and recurrently feeds them back as input. features. Our BERT-based model is trained on a hashtag prediction task. Given context features and co- 3 Approach occurring tags as input, it predicts a relevant hash- 3.1 Recommendation as a Generation Task tag while taking into account what hashtags have been generated. More formally, the sth hashtag is In contrast to the ranking approaches, our model generated as follows: generates a sequence of interrelated hashtags given an assortment of context features. As in Figure2, ˆ ˆ C hts = argmax P (hts|img, loc, time, txt, ht

hi = BERT (Xi) (4) T • Text. A set of words are collected from a z = h [MASK] · W + b (5) i vocab user’s textual description in a post. We strip exp(zk) pk = P (6) hashtags from the description and use texts exp(zht) ht∈HT only as the text feature. yˆi = argmax(p) (7) yi We then enter the input context C of image img, loc time txt where h ∈ Rn, z ∈ R|V |, and p ∈ R|V |. Then location , time and text description i [IMG] [LOC] the generated yˆ is again used to expand the input with a delimiter token for each ( , , i [TIME] [SEP] sequence for the next input: , ). Note that the hashtag(s) gener- ated from the previous stage htˆ and [MASK] token are appended to the end: Xi+1 = [x, [SEP], yˆ1, yˆ2..., yˆi−1, yˆi, [MASK]] (8)

img = [img1, img2, ..., img|img|−1, [IMG]] (9)

loc = [loc1, loc2, ..., loc|loc|−1, [LOC]] (10) 3.3 Early Fusion of Different Features time = [time1, time2, ..., time|time|−1, [TIME]] (11) txt = [txt1, txt2, ..., txt , [SEP]] (12) As implied by Eq.1 where a hashtag is generated |txt|−1 C = [img, loc, time, txt] (13) for an input context consisting of img, loc, time, I = [C, htˆ , [MASK]] (14) and txt, a precursor of our early fusion strategy is s

Table 1: Precision@K and Recall@K evaluation on baseline models and our model.

4.1.2 Metrics • BERT-based Ranking (EF vs. LF). This We conduct our evaluation with precision-at-k model sends only the context features to (P@K) and recall-at-k (R@K), both of which are BERT and produces a top-k list of hashtags widely used for recommendation tasks. With k at based on the [CLS] representation. To inves- 1, 3, 5 and 10, we calculate the scores for the given tigate the difference between the early fusion metrics to compare our model performance against (EF) strategy proposed by our model and the several baselines. commonly used late fusion (LF), we design two different versions of this model. The EF C 1 X |Ranked top-K ∩ Ground Truth| version takes the input (Eq. 13) whereas P @K = N K the LF version passes each feature separately 1 X |Ranked top-K ∩ Ground Truth| through the BERT to independently encode R@K = N |Ground Truth| each feature.

• BERT-based AR (1-to-1 vs. 1-to-M). For 4.1.3 Baselines the autoregressive (AR) tag generation model, To validate our hypotheses and conduct a compara- we employ the Transformer architecture. For tive analysis of our recurrent BERT hashtag gener- the sake of fairness, we use BERT as the en- ation model, we design and evaluate the following coder and a single-layer decoder from Trans- baselines: former to match the size of parameters with other BERT-based baselines. This model is • Frequency-Based. We generate most fre- divided into a 1-to-1 and 1-to-M approaches. quent tags regardless of a given context. 4.1.4 Implementation details • Context-Tag Mapping (Ranking). Conven- Our implementations of BERT-based models are tional tag recommendation models (Weston based on transformers library (Wolf et al., et al., 2014; Denton et al., 2015; Wu et al., 2019) and we use a V100 NVIDIA GPU for train- 2018b; Yang et al., 2020a) project input em- ing. With the batch size set to 8, the hidden size to bedding and tag embeddings onto the same 768 and the learning rate to 5e-5, we use the Adam representation space and learn with a pair- optimizer and a seed equal to 42. The maximum wise ranking loss. Building upon this scheme, input sequence length of our model is set to 384. this generalized context-tag mapping model merges encoded features to represent context. 4.2 Lessons from the Comparisons At inference time, it selects the top-k tags Table1 presents P@K and R@K scores of our nearest to the context embedding in the jointly model against the baselines, showing very clearly projected space. It uses CNN for image and that our model outperforms the baseline models by LSTM for text encoding, and is trained with significant margins. The most salient outcome out triplet loss that brings positive tags closer and of the experiment is that the particular generative negative tags away from the context at the approach in our proposed model is much superior same time. to the ranking and the auto-regressive generation Model P@1 P@3 P@5 P@10 R@1 R@3 R@5 R@10 All features 0.5185 0.3015 0.2203 0.1390 0.2229 0.3238 0.3662 0.4263 w/o Image 0.4057 0.2169 0.1572 0.0992 0.1744 0.2350 0.2629 0.3026 w/o Location 0.4979 0.2851 0.2085 0.1299 0.2129 0.3061 0.3464 0.3990 w/o Time 0.4323 0.2596 0.1927 0.1229 0.1849 0.2769 0.3164 0.3714 w/o Text 0.3724 0.2048 0.1474 0.0913 0.1573 0.2182 0.2435 0.2792

Table 2: Ablation over the feature types using our model. Only one feature is removed at a time. approaches. Still another significant result is that eration. The BERT-based AR models are di- the early fusion strategy is superior to late fusion rectly influenced by immediately preceding for both ranking and generative approaches. Fur- input token (i.e., hashtag). This property is ther analyses follow: suitable for a sequentially dependent natural language generation but refrains the AR mod- • Context-Tag Mapping vs. BERT-based els from using the entire context including Ranking (LF). This comparison is intended the generated hashtags, which causes such to ensure that even for late fusion without weaker performance than the ranking model. deep interaction among the features, the lan- On the contrary, our model uses a [MASK] to- guage modeling capabilities of BERT con- ken representation which learns to aggregate tributes significantly to the hashtag recommen- entire contextual and previous tag information dation problem, compared to the joint space in generating the output hashtags. The result approach. validates the effectiveness of our generation model specifically suited for the hashtag rec- • EF vs. LF. We assess the early fusion and late ommendation task. fusion approaches for inter-context feature modeling by comparing BERT-based ranking • 1-to-1 vs. 1-to-M. For the BERT-based AR (LF) and BERT-based ranking (EF) baselines. model and our model, we see 1-to-M mod- In the LF approach, the model separately en- els outperform 1-to-1 models by a significant codes each feature (image, location, time and gap, which shows the effectiveness of our ap- text) with a shared parameter BERT and aver- proach under the orderlessness assumption. ages over them to form a single, aggregated To test orderlessness, in addition, we shuffled context representation. On the other hand, the and reversed the order of hashtags within the EF approach jointly feeds the context features posts and trained our model under the same in a single step in ranking the top-k hashtags setting. There was no meaningful gap in per- based on the fused context information. As the formance with the original result, validating results between these two models imply, early our assumption. fusion provides an extensive and more com- 4.3 Ablation Study prehensive view of the given features unlike the delayed aggregation of separately modeled We also conduct an ablation study to see how each context representations. feature contributes to the model performance and the hashtag recommendation. As can be seen in • Ranking vs. Generation. The experimental Table2, every evaluation score decreases when we result shows that casting recommendation as remove one of the input features, implying all of generation using recurrent BERT has a sig- the features contribute to the task and the model. It nificant benefit in performance. This perfor- shows that Text is the most important feature, prob- mance gain is attributed to the generation as- ably because it comes directly from the users and is pect that considers the dependency among the most native to the BERT language model. On the generated tags. Note that the BERT-based AR other hand, the location and time features appear models are weaker than the ranking model, to be less important because they are secondary which reinforces the importance of the par- descriptions derived from the original descriptor. ticular way the proposed model handles gen- Usually the text form of location is too specific and diverse for the model to capture the patterns. We leave the issue of extracting more innate represen- P@1 tations for those features for future work.

Image Location Time Text Tag B = 1 (Greedy) R@1 B = 3 Image 39.74% 13.65% 22.33% 9.76% 14.52% B = 5 Location 13.48% 47.01% 19.37% 8.80% 11.34% B = 10 Time 11.07% 11.61% 51.67% 10.77% 14.88% 0.1 0.2 0.3 0.4 0.5 Text 12.39% 10.37% 16.71% 39.19% 21.34% Tag 9.29% 6.72% 10.27% 10.41% 63.31% Figure 4: Beam search results with different beam [MASK] 14.27% 9.28% 11.58% 10.86% 54.01% widths.

Table 3: Interactions among features. beam search and greedy search adopted for our model. We apply beam search with different width

(a) settings in generating the tag sequence. In Figure 4, we observe a substantial performance drop in , , s s a of an fall our model when we applied beam search to tag # uk river park park [IMG] [SEP] #park [LOC] [UNK] [UNK] image regent regent [TIME] london [MASK] evening #nature # #autumn weekday generation. This result can be explained in terms (b) of the characteristic differences between a natu- ral language sequence and a tag sequence. In an a a of up bu da fall cid par tree #sky close ## rle marx NLG task, making a sentence natural and under- [IMG] [SEP] [LOC] #park ##que ## ade #trees [TIME] roberto #nature [MASK] evening weekday branches standable overall is important. Thus, beam search, which maximizes the probability of the entire se- Figure 3: Visualization of attention scores for encod- ing a [MASK] token to be used for generating a next quence (sentence) improves the NLG performance. hashtag. However, tag generation inherently prefers greedy search because it requires to increase the number of relevant hashtags from the early steps of gen- 4.4 Attention Analysis eration, without having to consider the complex We examine the level of interactions among the fea- syntactic and semantic constraints. This outcome ture types, the context-to-tag correlation, and the substantiates our orderlessness assumption. importance of generated tags as well as the feature 4.6 Qualitative Analysis types in the [MASK] token representation. Table 3 shows the averaged attention scores for each fea- Aside from the extensive quantitative analysis, we ture over the test set, which indicate how much the show a few test cases where we can examine the feature attend to the other features (row-wise). It is way different models generate lists of hashtags evident that the features interact with one another given the posts in Figure5. The predicted tags quite actively as expected. We note that a major por- in red are the correct ones that actually exist for tion of attention from [MASK] (54.01%) is on the the post. For the top post, we see that the BERT- previous tags, witnessing the dependency among based Ranking model produces relevant hashtags tags manifests itself in our model. In Figure3 (a) like #disney but fails to predict others like #fear and (b), which illustrate how much [MASK] token and #anger, which require considering inter-tag attends to other tokens, we observe that it largely dependency (Joy, Fear and Anger are characters focuses on the previous hashtags. This shows that from the Disney movie, Inside Out). However, the generated hashtags play a critical role in gener- our model successfully generates those hashtags ating a related tag, which is in accordance with our because it can make use of information from pre- claim. viously generated hashtags. For the bottom post, the BERT-based AR model fails to generate any 4.5 Greedy vs. Beam Search of the gold hashtags, because of its autoregressive As a way to confirm our hypothesis that our pro- property that heavily relies on the immediately pre- posed model is more amenable to hashtag gener- ceding hashtag instead of seeing the entire context. ation than the generation models developed for Since the model initially produced incorrect hash- Natural Language Generation (NLG), we compare tags (#disgust, #insideout), the autoregressive char- BERT-based Ranking #drawing #art #disney #digitalart #sketch #disgust #illustration #joy #artist #artwork

Ours #insideout #disgust #disney #pixar #sadness #joy #fear #anger #disneyland #green #disney #insideout #sadness #joy #anger #fear #disgust

BERT-based AR #disgust #insideout #disney #fear #anger #sadness #pixar #joy #halloween #emotions

Ours #park #fall #hiking #autumn #nature #forest #nature #park #nature #blackandwhite #family #love #trees #outdoors #fall #sun #hiking #outdoors #november #blackandwhitephotography #pretty

Figure 5: Qualitative analysis for comparing generated tags by different models.

acteristic propagates the erroneous tag information Paul Covington, Jay Adams, and Emre Sargin. 2016. through the subsequent generation steps. In con- Deep neural networks for recommendations. trast, our model generates many of the correct tags In Proceedings of the 10th ACM conference on rec- ommender systems, pages 191–198. successfully, showing its accountability for direct view of contextual information. Emily Denton, Jason Weston, Manohar Paluri, Lubomir Bourdev, and Rob Fergus. 2015. User con- 5 Conclusion ditional hashtag prediction for images. In Proceed- ings of the 21th ACM SIGKDD international con- We introduce a novel hashtag feedback/generation ference on discovery and data mining, approach that explicitly considers the dependency pages 1731–1740. among the tags, while accounting for the charac- J. Devlin, Ming-Wei Chang, Kenton Lee, and Kristina teristic difference between tag generation and lan- Toutanova. 2019. Bert: Pre-training of deep bidirec- guage generation. To exploit the rich information tional transformers for language understanding. In among assorted features, we adopt an early fusion NAACL-HLT. approach by converting distinct features to the same Zhuoye Ding, Qi Zhang, and Xuan-Jing Huang. 2012. modality (i.e., text) and leverage BERT’s language Automatic hashtag recommendation for microblogs modeling capabilities. Through an extensive analy- using topic-specific translation model. In Proceed- sis over the baseline models, we show significant ings of COLING 2012: Posters, pages 265–274. improvements. We also prove that the 1-to-M gen- Frederic´ Godin, Viktor Slavkovikj, Wesley De Neve, eration approach under orderlessness assumption Benjamin Schrauwen, and Rik Van de Walle. 2013. is more appropriate for the tag recommendation Using topic models for twitter hashtag recommenda- tion. In Proceedings of the 22nd International Con- task. For future work, we plan to evaluate our ference on , pages 593–596. model on other benchmark of similar kind to test the robustness of our tag generation approach. Our Yeyun Gong, Qi Zhang, and Xuanjing Huang. 2018. generative framework is also expected to general- Hashtag recommendation for multimodal microblog posts. Neurocomputing, 272:170 – 177. ize over other recommendation tasks that deal with items with inter-dependency. Yuyun Gong and Qi Zhang. 2016. Hashtag recom- mendation using attention-based convolutional neu- ral network. In Proceedings of the Twenty-Fifth References International Joint Conference on Artificial Intelli- gence, IJCAI’16, page 2782–2788. AAAI Press. Ying-Hong Chan and Yao-Chung Fan. 2019. A recur- rent BERT-based model for question generation. In Mohadeseh Kaviani and Hossein Rahmani. 2020. Proceedings of the 2nd Workshop on Machine Read- Emhash: Hashtag recommendation using neural net- ing for Question Answering, pages 154–162, Hong work based on bert embedding. In 2020 6th Interna- Kong, China. Association for Computational Lin- tional Conference on Web Research (ICWR), pages guistics. 113–118. IEEE. Quanzhi Li, Sameena Shah, Armineh Nourbakhsh, Xi- Thomas Wolf, Lysandre Debut, Victor Sanh, Julien aomo Liu, and Rui Fang. 2016. Hashtag recommen- Chaumond, Clement Delangue, Anthony Moi, Pier- dation based on topic enhanced embedding, tweet ric Cistac, Tim Rault, Remi´ Louf, Morgan Fun- entity data and learning to rank. In Proceedings of towicz, et al. 2019. Transformers: State-of-the- the 25th ACM International on Conference on Infor- art natural language processing. arXiv preprint mation and , pages 2085– arXiv:1910.03771. 2088. Gaosheng Wu, Yuhua Li, Wenjin Yan, Ruixuan Li, Yang Li, Ting Liu, Jingwen Hu, and Jing Jiang. Xiwu Gu, and Qi Yang. 2018a. Hashtag recommen- 2019. Topical co-attention networks for hashtag dation with attention-based neural image hashtag- recommendation on microblogs. Neurocomputing, ging network. In International Conference on Neu- 331:356–365. ral Information Processing, pages 52–63. Springer.

Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Ledell Yu Wu, Adam Fisch, Sumit Chopra, Keith 2016. Hierarchical question-image co-attention for Adams, Antoine Bordes, and Jason Weston. 2018b. visual question answering. In Advances in neural Starspace: Embed all the things! In AAAI. information processing systems, pages 289–297. Jinxi Xu and W. Bruce Croft. 1996. Query expansion Minseok Park, Hanxiang Li, and Junmo Kim. 2016. using local and global document analysis. In Pro- Harrison: A benchmark on hashtag recommenda- ceedings of the 19th Annual International ACM SI- tion for real-world images in social networks. arXiv GIR Conference on Research and Development in preprint arXiv:1605.05054. Information Retrieval, SIGIR ’96, page 4–11, New York, NY, USA. Association for Computing Machin- Surendra Sedhai and Aixin Sun. 2014. Hashtag rec- ery. ommendation for hyperlinked tweets. In Proceed- ings of the 37th international ACM SIGIR confer- Chao Yang, Xiaochan Wang, and Bin Jiang. 2020a. ence on Research & development in information re- Sentiment enhanced multi-modal hashtag recom- trieval, pages 831–834. mendation for micro-videos. IEEE Access, 8:78252–78264. Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention Qi Yang, Gaosheng Wu, Yuhua Li, Ruixuan Li, flow for machine comprehension. In 5th Inter- Xiwu Gu, Huicai Deng, and Junzhuang Wu. 2020b. national Conference on Learning Representations, Amnn: Attention-based multimodal neural network ICLR 2017, Toulon, France, April 24-26, 2017, Con- model for hashtag recommendation. IEEE Transac- ference Track Proceedings. tions on Computational Social Systems.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. E. Zangerle, W. Gassler, and G. Specht. 2011. Rec- Sequence to sequence learning with neural networks. ommending #-tags in twitter. In Proceedings of In Advances in neural information processing sys- the Workshop on Semantic Adaptive Social Web tems, pages 3104–3112. (SASWeb 2011). CEUR Workshop Proceedings, vol- ume 730, pages 67–78. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Qi Zhang, Jiawen Wang, Haoran Huang, Xuanjing Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Huang, and Yeyun Gong. 2017. Hashtag recommen- Kaiser, and Illia Polosukhin. 2017. Attention is all dation for multimodal microblog using co-attention you need. In Advances in neural information pro- network. In IJCAI, pages 3420–3426. cessing systems, pages 5998–6008. Suwei Zhang, Yuan Yao, Feng Xu, Hanghang Tong, Xi- Yue Wang, Jing Li, Irwin King, Michael R Lyu, and aohui Yan, and Jian Lu. 2019. Hashtag recommen- Shuming Shi. 2019. Microblog hashtag generation dation for photo sharing services. In Proceedings of Proceed- via encoding conversation contexts. In the AAAI Conference on Artificial Intelligence, vol- ings of the 2019 Conference of the North American ume 33, pages 5805–5812. Chapter of the Association for Computational Lin- guistics: Human Language Technologies, Volume 1 Feng Zhao, Yajun Zhu, Hai Jin, and Laurence T Yang. (Long and Short Papers), pages 1624–1633. 2016. A personalized hashtag recommendation ap- proach using lda-based topic model in microblog en- Jason Weston, Samy Bengio, and Nicolas Usunier. vironment. Future Generation Computer Systems, 2011. Wsabie: Scaling up to large vocabulary im- 65(C):196–206. age annotation.

Jason Weston, Sumit Chopra, and Keith Adams. 2014. #TagSpace: Semantic embeddings from hashtags. In Proceedings of the 2014 Conference on Em- pirical Methods in Natural Language Processing (EMNLP), pages 1822–1827, Doha, Qatar. Associ- ation for Computational Linguistics.