<<

Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16)

Hashtag Recommendation Using Attention-Based Convolutional Neural Network Yuyun Gong, Qi Zhang School of Computer Science, Fudan University Shanghai Key Laboratory of Intelligent Information Processing 825 Zhangheng Road, Shanghai, P.R. China yygong12, qz @fudan.edu.cn { }

Abstract recommending hashtags for microblogs has received consid- erable attention in recent years. Existing works have studied Along with the increasing requirements, the hash- discriminative models with various kinds of features and recommendation task for microblogs has been models [Heymann et al., 2008; Liu et al., ], collaborative receiving considerable attention in recent years. filtering [Kywe et al., 2012], and generative models [Krestel Various researchers have studied the problem from et al., 2009; Ding et al., 2013; Godin et al., ] based on different aspects. However, most of these methods textual and social information. Most of these methods are usually need handcrafted features. Motivated by commonly based on lexical level features, including the bag- the successful use of convolutional neural networks of-words (BoW) model and exquisitely designed patterns. In () for many natural language processing addition to these methods, the effectiveness of word trigger tasks, in this paper, we adopt CNNs to perform the assumption [Liu et al., ; Ding et al., 2013] has also been hashtag recommendation problem. To incorporate demonstrated. This means that substance of a given sentence the trigger words whose effectiveness have been can be realized by some important words in it. experimentally evaluated in several previous works, we propose a novel architecture with an attention In recent years, the rapid development of deep neural mechanism. The results of experiments on the data networks with word embedding has made it possible to collected from a real world service perform various NLP tasks and achieve remarkable re- demonstrated that the proposed model outperforms sults. Among these methods, convolutional neural networks state-of-the-art methods. By incorporating trigger (CNNs)[LeCun et al., 1998], which were originally invented words into the consideration, the relative improve- for computer vision, have shown their effectiveness for ment of the proposed method over the state-of-the- various NLP tasks, including semantic parsing [Yih et al., art method is around 9.4% in the F1-score. 2014], machine translation [Meng et al., ], sentence modeling [Kalchbrenner et al., ], and a variety of traditional NLP tasks [Dos Santos et al., 2015; Chen et al., ]. Instead of building Introduction hand-crafted features, these methods utilize layers with On social networks and microblogging services, users usually convolving filters that are applied on top of pre-trained word use hashtags, which are types of labels or metadata tags, embeddings. Moreover, compared to standard feedforward to make it easier to find messages with a specific theme or neural networks, CNNs have far fewer parameters and thus content. In most microblogging services, users can place the are easier to train. hash character # in front of words or unspaced phrases to Inspired by the advantages of CNNs in processing NLP create and use hashtags. Hashtags can occur anywhere in a tasks, we propose to use it to solve the hashtag recommen- microblog: at the beginning, middle, or end. Moreover, along dation task. Previous methods for hashtag and keyphrase with the increase in services, microblogs have recommendation [Liu et al., ; Ding et al., 2013] have also been widely used as data sources for public opinion also demonstrated the effectiveness of the trigger word analyses [Bermingham and Smeaton, 2010; Jiang et al., ], mechanism, for example, in the tweet “#ipad #iphone How prediction [Asur and Huberman, 2010; Bollen et al., 2011], to share calendar events on iPhone and iPad”, the trigger reputation management [Pang and Lee, 2008; Otsuka et al., words are iPhone and iPad. However, standard CNNs cannot 2012], and many other applications [Sakaki et al., 2010; handle it. Motivated by the attention mechanism, which Becker et al., 2010; Guy et al., 2010; 2013]. Many ap- has been used in speech recognition [Chorowski et al., proaches for various applications have also demonstrated the 2014] and machine translation [Luong et al., 2015], in usefulness of hashtags, including microblog retrieval [Efron, this work, we introduce a novel CNN that incorporates 2010], query expansion [A.Bandyopadhyay et al., 2011], and an attention mechanism. The hashtag recommendation task sentiment analysis [Davidov et al., 2010; Wang et al., 2011]. is modelled as a classification problem. We employ a However, only a portion of microblogs contain hashtags attention layer to produce a weight for a word with its created by their authors. Hence, the task of automatically surrounding context. Then, trigger words selected based on

2782 the attention layer and whole microblogs are transformed 2015], machine translation [Bahdanau et al., 2015a], visual into fixed length vectors respectively. In the final step, a object classification [Mnih et al., 2014], caption generation fully connected layer with softmax outputs is constructed. [Xu et al., 2015] and so on. Through experiments using a dataset collected from real Bahdanau et al. [2015b] proposed an attention-based online services, we demonstrated the effectiveness of the recurrent sequence generators for sequence prediction. They CNNs and attention mechanism. The CNN method could performed the method on speech recognition task at the achieve a performance that was better than those of state- character level. The attention mechanism scanned the input of-the-arts methods. The proposed attention-based CNNs sequence and chooses relevant frames. Chorowski [2015] yielded additional improvement compared to standard CNNs. also used attention-based RNN on a phoneme recognition The main contributions of this work can be summarized as task. On image classification tasks, Mnih [2014] introduced a follows: recurrent neural network model, which can select a sequence To take advantages of deep neural networks, we adopted of regions or locations from an image or video. Only the • CNNs to perform the hashtag recommendation task. selected regions were incorporated into further processing. To incorporate a trigger word mechanism, we proposed Motived from these successful usages in these tasks, in • a novel attention-based CNN architecture, which incor- this work, we adopt attentional mechanism to scan input porates a local attention channel and global channel. microblogs and select trigger words. We further combine both the selected words and the whole microblog together to Experimental results using a dataset collected from a achieve the task. • real microblogging service showed that the proposed method can achieve significantly better performance than the state-of-the-arts methods. The Proposed Models In this work, we formulate a hashtag recommendation task as Related Works a multi-class classification problem. Networks handle input Hashtag recommendation microblogs of varying length. Each dimension of the output Due to the requirements of hashtag recommendation, a layer represents the probability of a hashtag recommended. variety of methods have been proposed from different per- As discussed in the introduction section, we also follow the spectives [Zangerle et al., 2011; Ding et al., 2013; Sedhai and trigger words assumption, which has been successfully used Sun, 2014; Gong et al., ]. Zangerle et al. [2011] introduced a in previous studies [Liu et al., 2012; Ding et al., 2013]. similarity based method to achieve the task. Firstly, they tried Hence, the proposed model incorporates two channels: a local to retrieve a set of hashtags used within these most similar attention channel and global channel. Figure 1 shows the messages. Then, heuristics ranking methods are used to select architecture of the proposed model. hashtags from these selected candidates. Ding et al. [2013] In the global channel, all the words will be encoded, while proposed to convert the hashtag recommendation task as a the local attention channel will only encode a few trigger translation process. They assume that hashtag and the trigger words. Common to these two types of channels is the fact words of the tweets are two different language and have the that for each input microblog, both channels first operate same meaning. They integrate topical based translation model the input. The goal is then to derive a feature vector that to perform this task. captures relevant information. However, these channels differ Most of the works mentioned above are based on textual in how the feature vector is derived. The global channel information. Besides these methods, Zhang et al. [2014] has to attend to all the words for each tag, while the tag observed that when picking hashtags users have different may only have relations with some trigger words. A local perspectives, which are impacted by their own interests or attention mechanism chooses to focus only on a small subset the global topic trend. To model these factors, they proposed of the words for each tag which is depended on gate scores. a topical model based method to incorporate the temporal Specifically, we employ a simple convolutional layer to and personal information. Since may tweets contain not combine the information from both vectors as follows: only textual information but also hyperlinks, Sedhai and Sun [2014] studied the problem of hashtag recommendation h = tanh(M v[hg; hl]+b), (1) ⇤ for hyperlinked tweets. From the brief descriptions given above, we can observe where hg is theb feature vector extracted from global channel, that deep neural networks have not been implemented to hl is the feature vector of the local attention channel. M is perform hashtag suggestion tasks. In this work, we propose a filter matrix for the convolutional operation. b is a bias. to use convolution network with attentional mechanism to Each filter will produce one feature, we use multiple filters achieve the hashtag recommendation task. to produce a feature vector. In our model, we use CNN to encode a sentence for Attention Mechanism the global channel, whereas we construct a local attention In recent years, attention-based neural network architectures, network for the local attention channel. The parameters in which learn to focus their “attention” to specific parts of the the global and local attention channels are learned jointly input, have shown promising results on various tasks, such as with our final objective function instead of being trained speech recognition [Bahdanau et al., 2015b; Chorowski et al., separately.

2783 �������

�������������

������� ��� ���� ���� �������

… ��������� … �������������

… … ���� �� ���� �� … b �1 �2 �3 �� e … �1 �2 �3 ��

Figure 1: The architecture of the attention-based Convolutional Neural Network

Local Attention Channel s is a sequence of scores for each word in the microblog. min s is the minimum score of the words, and max s is In the local attention channel, we consider the attention { } { } problem as a decision process. Given an input microblog the maximum score. [0, 1] is a parameter to balance the d 2 m, we take the embeddings wi R for each word in the minimum and maximum. microblog to obtain the first layer,2 where d is the dimension The local attention layer makes it possible to extract the of the word vector. A microblog of length n is represented most important words in the microblog. We apply the local with w1:n which is the concatenation of words w1,w2,...,wn. attention layer as the first layer in the attention channel which In general, let wi:i+j refer to the concatenation of words insure that the following layer will only operate on the trigger wi,wi+1,...,wi+j. words. After converting words into embeddings, the next step is After local attention layer, the trigger words extracted will the local attention layer. Given a threshold value ⌘ and an input the folding layer which can operate on different number input microblog m, the attention layer generates a sequence of words. The folding layer is to abstract the features of the of trigger words (wi,...,wj). Each word is extracted from trigger words by the following operation. a small window. In general, given an attention window size l l h d z = g(M folding(w)+b), (5) h, we define M R ⇥ to be the parameter matrix. At ⇤ the i-th step, the local2 attention layer generates the score of where w are the trigger words. Ml Rd r is the parameter the i-th word in the microblog by focusing on the words in ⇥ matrix, d is the dimension of the word2b vector, r is dimension the window. The score of the central word in the window is of the output vector. b Rr is a bias, g is a non-linear obtained as follow: b function. folding is the sum2 operation for each dimension of l s(2i+h 1)/2 = g(M wi:i+h + b), (2) all the trigger words, fi = wj,i, where wj,i is the value in ⇤ j where s(2i+h 1)/2 is the score of the word w(2i+h 1)/2. b the ith position of the embedding of the jth trigger word. P is a bias, g is a non-linear function. We extract the words The final output of the localb attention channelb is a fixed- depend on their scores, if the score of the word greater than length vector, which represents the embeddings of the trigger the threshold ⌘, it will be extracted as a trigger word. The words w. operation has been defined as follows: Global Channel wi if wi >⌘ b wi = 0 i

2784 calculated as follow: hashtags for the dth microblog in unlabelled data by the fully g connected layer: zi = g(M wi:i+l 1 + b), (6) · (a)T d d d exp( h ) where g is a non-linear function and b R is a bias term. We P (y = a h ; )= (j)T d , (10) 2 | j A exp( h ) operate this filter on all the combinations of the words in the 2 microblog w1:l,w2:l+1,...,wn l+1:n to produce a map of where are the parameters. hPis the feature vector connected feature as follow:{ } from the channels. A is a set of candidate hashtags. According to the scores output from fully connected z =[z1,z2,...,zn l+1], (7) layer, we can rank the hashtags for each microblog and recommending the top-ranked hashtags to users. In the pooling layer, a max-overtime pooling operation is applied over the feature map z to produce a feature for this filter. Using this operation we can extract the most important Experiments feature for each feature map. And it can deal with different Dataset and Setup microblog lengths. We use the microblog dataset collected by Ding et al. [2013] From the process described above, we can see that one for evaluation. There are 110,000 microblogs in the dataset feature is extracted from one filter. To obtain multiple which contain hashtags annotated by users. The vocabulary features, we use multiple filters with varying window sizes in of words is 106,323 in the dataset, the vocabulary of hashtags the model. After max-overtime pooling operation, a bias and is 37,224, and the average number of words and hashtags in a non-linear function tanh are applied to the pooled matrix. each microblog is 20.45 and 1.20 respectively. The dataset The final output of our global channel is also a fixed- has been splitted into training set (100,000 microblogs) and length vector, which represents the embeddings of the input test set (10,000 microblogs). In our experiment, we randomly microblog m. select 10% of the training set as the development set. After the local attention channel and global channel, we To evaluate the performance, we use Precision(P), Re- use a convolutional layer with multiple feature maps to call(R), and F-score(F1): combine the outputs of the local attention channel and the Nr Nr 2PR global channel. We regard the output from the convolutional P = ,R= ,F1 = , (11) layer as the embedding of the microblog from our deep neural Ns Nm P + R network. Nr is the right number recommended, Ns is the total number recommended by system, Nm is the total number of the Training hashtags assigned in the corpus. In this work, we joined learning the parameters ⇥ in local The parameters in our model include both of parameters attention and global channels. in global channel and local attention channel. For the global channel, after trying different single window sizes and multi ⇥= W, Ml, Mg , (8) window sizes, we empirically set filter windows to multi { } window 1, 2, 3, and we use 100 feature maps for each window l g where W are words embeddings; M and M are the size. For the local attention channel, we set the width of parameters of the local attention channel and global channel attention matrix to 5, and to 0.8. Both the global and local respectively; the rest parameters belong to the fully connected attention channel we use hyperbolic tangent as non-linear layer. Our training objective function is formulated as fol- function and set mini-batch size to 200. lows: We trained the word vectors on 10 million words from J = logp(a m), . The vectors have dimensionality of 100 and were | (9) (m,a) D trained using the architecture proposed in [Mikolov et al., X2 2013]. New words are initialized randomly. where D is the training corpus, a is the hashtag for microblog m. Baseline To minimize the objective function, we use the robust In this section, we compare our attention-based model with [ ] learning method AdaDelta Zeiler, 2012 . some baseline models and two degeneration models. We consider the following methods Hashtag Recommendation Naive Bayes (NB): We used NB to model the hashtag We perform hashtag recommendation as follows. Suppose • recommendation as a classification task. Given the given an unlabelled dataset, we first train our model on textual of the microblogs, we can estimate the posterior training data, and save the model which has the best probability of each hashtag. performance on the validate dataset. For the microblog of LDA: we used the LDA based method proposed in the unlabelled data, we will encode the microblog through • the local attention channel and global channel by the saved [Krestel et al., 2009] to recommend hashtags. model. Translation model (IBM-1): IBM model 1 is proposed After the encoded processes, we combine the features • by [Liu et al., ] which is directly applied to obtain the generated from both channels. Then we get the scores of the alignment probability between the word and the hashtag.

2785 0.5 0.45

NB 0.5 0.40 NB LDA LDA 0.4 IBM1 0.35 IBM1 TopicWA 0.4 TopicWA TTM TTM 0.30 CNN+Attention-5 CNN+Attention-5 0.3 0.3 0.25

0.20 0.2 0.2 NB 0.15 LDA IBM1 0.10 0.1 0.1 TopicWA TTM 0.05 CNN+Attention-5 0 0 0 12345 12345 12345

Precision Recall F1

Figure 2: Precision, Recall and F1 with recommended Hashtags range from 1 to 5

Table 1: Evaluation results of different methods on the the models based on trigger words. From the results of evaluation collection. “TTM”, “TopicWA”, “IBM1”, and “NB”, we can observe that the trigger words methods could improve the performance Methods Precision Recall F1 of the hashtag recommendation task. A comparison of the NB 0.217 0.197 0.203 “CNN” and “TTM” results shows that “CNN” achieves a LDA 0.064 0.060 0.062 significantly better F1 than “TTM”. The results demonstrate IBM1 0.236 0.214 0.220 that the neural network can achieve better performance on this TopicWA 0.310 0.285 0.292 task. From the results of “Attention-5”, “TTM”, “TopicWA”, TTM 0.382 0.357 0.364 and “IBM1”, we can observe that our local attention channel CNN 0.416 0.338 0.373 can improve the performance of the trigger words for this Attention-5 0.410 0.335 0.369 task. From the results of CNN and CNN+Attention-5, we CNN+Attention-5 0.443 0.362 0.398 observe that the attention model can benefit the task, and the multiple channels can achieve better performance than a single channel. The relative improvement of the proposed TopicWA: TopicWA is a topical word alignment model CNN-Attention-5 over TTM is around 9.4% in the F1 score, • proposed in [Ding et al., ]. In this method, the standard which demonstrates the effectiveness of our method. LDA is employed to discover the latent topic and Figure 2 shows the precision, recall, and F1 curves of NB, combine the alignment method for this task. LDA, IBM1, TopicWA, TTM, and CNN+Attention-5 on the test data. Each point of a curve represents the extraction TTM: TTM was proposed by [Ding et al., 2013] for • hashtag recommendation. of a different number of hashtags, ranging from 1 to 5. The curve that is the highest of the graphs indicates the CNN: CNN was proposed by [Kim, 2014] for sentence best performance. Based on the results, we can observe that • classification. We use the publish code to accomplish the the performance of CNN+Attention-5 is the highest in all hashtag recommendation task. the curves. This indicates that the proposed method was Attention-5: Attention-5 was the local attention channel significantly better than the other methods, and when we • in our model with the window size equals to 5. recommended the top 1 hashtag for each microblog, we obtained the highest F1 score. The first method NB is traditional method. LDA, IBM-1, Table 2 lists the static and non-static results. These results TopicWA and TTM are the methods based on topic model show that the non-static model could achieve better results for and translation model which are attracted a lot of attention in all the models. The non-static method tuned the word vector recent years. CNN and Attention-5 are the variant methods of more specifically to the task-at-hand. For randomly initialized our method. tokens which are not in the set of pre-trained vectors, we can learn more meaningful representations from fine-tuning. Results and Discussion Table 1 shows comparisons of the results of the pro- Parameters sensitive analysis posed method with those of the state-of-the-art discrimi- From the description of the proposed model, we can know native and generative methods on the evaluation dataset. that there are several hyperparameters in the proposed model. “CNN+Attention-5” denotes the method proposed in this To evaluate their impacts, we evaluated two crucial ones, paper. “CNN” and “Attention-5” represent the convolutional window size l of the global channel and the width of the neural network and attention model, respectively. The results attention size h. show that the proposed method is significantly better than Table 3 lists the results of different window sizes for the other methods. “TTM”, “TopicWA”, and “IBM1” are the global channel. We modeled the global channel on

2786 Table 2: Evaluation results of static and non-static models.

Non-static Static Methods Precision Recall F1 Precision Recall F1 CNN 0.416 0.338 0.373 0.412 0.337 0.371 Attention-5 0.410 0.335 0.369 0.382 0.313 0.344 CNN+Attention-5 0.443 0.362 0.398 0.425 0.350 0.384

Table 3: Performance on various window size of global of words considered, and we will obtain the best performance channel. when the window size is set to 5. “Attention-1” achieved a bad performance. Because in this model, the importance of window sizes Precision Recall F1 the central word only depended on itself while ignoring the 1 0.422 0.346 0.380 context. From the results of “Attention-5” and “Attention-7”, 2 0.428 0.351 0.386 we can observe that when the distance between the words and 3 0.425 0.348 0.383 the central word greater than 2, we can ignore the influence 1-2 0.440 0.361 0.397 of these words to the central word. 1-2-3 0.443 0.362 0.398 Conclusions Table 4: Evaluation results of different attention size. In this paper, we investigated a novel attention based CNNs for performing the hashtag recommendation task. We adopted Methods Precision Recall F1 the architecture of CNNs to avoid hand-craft features and Attention-1 0.326 0.267 0.294 take other advantages of CNNs. To incorporate a trigger Attention-3 0.406 0.332 0.365 word mechanism, we proposed a novel attention-based CNN Attention-5 0.410 0.335 0.369 architecture, which consists of a local attention channel and Attention-7 0.405 0.331 0.364 global channel. Experimental results on the data collected CNN-Attention-1 0.395 0.323 0.355 from a real world microblogging service demonstrated that CNN-Attention-3 0.435 0.356 0.392 the proposed method outperforms the methods which take CNN-Attention-5 0.443 0.362 0.398 only global or local information into consideration and state- CNN-Attention-7 0.438 0.358 0.394 of-the-art methods.

Acknowledgments window sizes of 1, 2, and 3. To show the performance of The authors wish to thank the anonymous reviewers for the model with multiple window sizes, we use the window their helpful comments. This work was partially funded size with the the following combinations: (1,2), (1,2,3), by National Natural Science Foundation of China (No. and we obtained the best performance when the window 61532011, 61473092, and 61472088), the National High sizes were (1,2,3). From the results of window sizes equal Technology Research and Development Program of China to 1, 2, or 3, we observe that the best window size is 2 (No. 2015AA015408). on hashtag recommendation. The reason is that when the window size equals to 1, the convolutional operation will References extract the unigram information while ignore the context information. The results of window sizes equal to 2 and 3 [A.Bandyopadhyay et al., 2011] A.Bandyopadhyay, M. Mitra, and show that the bigram information is more important than the P. Majumder. Query expansion for microblog retrieval. In trigram for our task. From the results of multiple window Proceedings of TREC, 2011. sizes, we can observe that the multiple can achieve better [Asur and Huberman, 2010] S. Asur and B.A. Huberman. Predict- performance. This demonstrated the advantages of the models ing the future with social media. In WI-IAT’10, volume 1, pages with multiple window sizes over the single window size 492–499, 2010. models. Comparing the results of window sizes equal to (1,2) [Bahdanau et al., 2015a] Dzmitry Bahdanau, Kyunghyun Cho, and and (1,2,3), we can observe that the performance between the Yoshua Bengio. Neural machine translation by jointly learning multiple window sizes were similar, so we can choose the to align and translate. In International Conference on Learning multiple window size conveniently. Representations, 2015. Table 4 lists the comparisons of different attention sizes [Bahdanau et al., 2015b] Dzmitry Bahdanau, Jan Chorowski, on the constructed evaluation dataset. “Attention-h” represent Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio. End-to- the methods that only considered the local attention channel. end attention-based large vocabulary speech recognition. arXiv “CNN+Attention-h” represent the methods proposed with the preprint arXiv:1508.04395, 2015. multiple window sizes (1,2,3) in the global channel (h [ ] 2 Becker et al., 2010 Hila Becker, Mor Naaman, and Luis Gravano. 1, 3, 5, 7 is the attention size). Based on the results, we Learning similarity metrics for event identification in social can{ see that} the performance will increase with the number media. In Proceedings of WSDM ’10, 2010.

2787 [Bermingham and Smeaton, 2010] Adam Bermingham and Alan F. [Kywe et al., 2012] Su Mon Kywe, Tuan-Anh Hoang, Ee-Peng Smeaton. Classifying sentiment in microblogs: is brevity an Lim, and Feida Zhu. On recommending hashtags in advantage? In Proceedings of CIKM’10, 2010. networks. In Social Informatics, pages 337–350. Springer, 2012. [Bollen et al., 2011] Johan Bollen, Huina Mao, and Xiaojun Zeng. [LeCun et al., 1998] Yann LeCun, Leon´ Bottou, Yoshua Bengio, Twitter mood predicts the stock market. Journal of Computa- and Patrick Haffner. Gradient-based learning applied to tional Science, 2(1):1 – 8, 2011. document recognition. Proceedings of the IEEE, 86(11):2278– [Chen et al., ] Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng, 2324, 1998. and Jun Zhao. Event extraction via dynamic multi-pooling [Liu et al., ] Zhiyuan Liu, Xinxiong Chen, and Maosong Sun. A convolutional neural networks. In Proceedings of ACL-IJCNLP simple word trigger method for social tag suggestion. In 2015. Proceedings of EMNLP 2011. [Chorowski et al., 2014] Jan Chorowski, Dzmitry Bahdanau, [Liu et al., 2012] Zhiyuan Liu, Chen Liang, and Maosong Sun. Kyunghyun Cho, and Yoshua Bengio. End-to-end continuous Topical word trigger model for keyphrase extraction. In speech recognition using attention-based recurrent nn: First Proceedings of COLING, 2012. results. In arXiv preprint arXiv:1412.1602, 2014. [Luong et al., 2015] Minh-Thang Luong, Hieu Pham, and Christo- [Chorowski et al., 2015] Jan Chorowski, Dzmitry Bahdanau, pher D Manning. Effective approaches to attention-based neural Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. machine translation. In Proceedings of EMNLP, 2015. Attention-based models for speech recognition. arXiv preprint arXiv:1506.07503, 2015. [Meng et al., ] Fandong Meng, Zhengdong Lu, Mingxuan Wang, Hang Li, Wenbin Jiang, and Qun Liu. Encoding source language [Davidov et al., 2010] Dmitry Davidov, Oren Tsur, and Ari Rap- with convolutional neural network for machine translation. In poport. Enhanced sentiment learning using twitter hashtags and Proceedings of ACL-IJCNLP 2015. smileys. In COLING, 2010. [Mikolov et al., 2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, [ ] Ding et al., Zhuoye Ding, Qi Zhang, and XuanJing Huang. Greg S Corrado, and Jeff Dean. Distributed representations of Automatic hashtag recommendation for microblogs using topic- words and phrases and their compositionality. In NIPS, 2013. specific translation model. In Proceedings of COLING 2012. [Mnih et al., 2014] Volodymyr Mnih, Nicolas Heess, Alex Graves, [ et al. ] Ding , 2013 Zhuoye Ding, Xipeng Qiu, Qi Zhang, and et al. Recurrent models of visual attention. In NIPS, 2014. Xuanjing Huang. Learning topical translation model for microblog hashtag suggestion. In Proceedings of IJCAI 2013, [Otsuka et al., 2012] Takanobu Otsuka, Takuya Yoshimura, and 2013. Takayuki Ito. Evaluation of the reputation network using realistic distance between data. In WI-IAT, 2012. [Dos Santos et al., 2015] Cicero Dos Santos, Bing Xiang, and Bowen Zhou. Classifying relations by ranking with convolutional [Pang and Lee, 2008] Bo Pang and Lillian Lee. Opinion mining neural networks. In Proceedings of ACL-IJCNLP, 2015. and sentiment analysis. Found. Trends Inf. Retr., 2(1-2):1–135, [Efron, 2010] Miles Efron. Hashtag retrieval in a microblogging January 2008. environment. In Proceedings of SIGIR ’10, 2010. [Sakaki et al., 2010] Takeshi Sakaki, Makoto Okazaki, and Yutaka [Godin et al., ] Frederic´ Godin, Viktor Slavkovikj, Wesley Matsuo. Earthquake shakes twitter users: real-time event De Neve, Benjamin Schrauwen, and Rik Van de Walle. detection by social sensors. In Proceedings of WWW ’10, 2010. Using topic models for twitter hashtag recommendation. In [Sedhai and Sun, 2014] Surendra Sedhai and Aixin Sun. Hashtag Proceedings of WWW ’13 Companion. recommendation for hyperlinked tweets. In Proceedings of [Gong et al., ] Yeyun Gong, Qi Zhang, and Xuanjing Huang. SIGIR, 2014. Hashtag recommendation using dirichlet process mixture models [Wang et al., 2011] Xiaolong Wang, Furu Wei, Xiaohua Liu, Ming incorporating types of hashtags. Zhou, and Ming Zhang. Topic sentiment analysis in twitter: [Guy et al., 2010] Ido Guy, Naama Zwerdling, Inbal Ronen, David a graph-based hashtag sentiment classification approach. In Carmel, and Erel Uziel. Social media recommendation based on Proceedings of CIKM ’11, 2011. people and tags. In Proceedings of SIGIR, 2010. [Xu et al., 2015] Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron [Guy et al., 2013] Ido Guy, Uri Avraham, David Carmel, Sigalit Ur, Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Michal Jacovi, and Inbal Ronen. Mining expertise and interests Bengio. Show, attend and tell: Neural image caption generation from social media. In Proceedings of WWW, 2013. with visual attention. arXiv preprint arXiv:1502.03044, 2015. [Heymann et al., 2008] Paul Heymann, Daniel Ramage, and Hector [Yih et al., 2014] Wen-tau Yih, Xiaodong He, and Christopher Garcia-Molina. Social tag prediction. In SIGIR, 2008. Meek. Semantic parsing for single-relation question answering. [Jiang et al., ] Long Jiang, Mo Yu, Ming Zhou, Xiaohua Liu, and In Proceedings of ACL, 2014. Tiejun Zhao. Target-dependent twitter sentiment classification. [Zangerle et al., 2011] Eva Zangerle, Wolfgang Gassler, and Gun- In Proceedings of ACL 2011. ther Specht. Recommending#-tags in twitter. In Proceedings of [Kalchbrenner et al., ] Nal Kalchbrenner, Edward Grefenstette, and SASWeb 2011, 2011. Phil Blunsom. A convolutional neural network for modelling [Zeiler, 2012] Matthew D Zeiler. Adadelta: An adaptive learning sentences. In Proceedings of ACL 2014. rate method. arXiv preprint arXiv:1212.5701, 2012. [Kim, 2014] Yoon Kim. Convolutional neural networks for [Zhang et al., 2014] Qi Zhang, Yeyun Gong, Xuyang Sun, and sentence classification. arXiv preprint arXiv:1408.5882, 2014. Xuanjing Huang. Time-aware personalized hashtag recommen- [Krestel et al., 2009] Ralf Krestel, Peter Fankhauser, and Wolfgang dation on social media. In Proceedings of COLING 2014, Dublin, Nejdl. Latent dirichlet allocation for tag recommendation. In Ireland, 2014. Proceedings of RecSys ’09, 2009.

2788