
#TAGSPACE: Semantic Embeddings from Hashtags Jason Weston Sumit Chopra Keith Adams Facebook AI Research Facebook AI Research Facebook AI Research [email protected] [email protected] [email protected] n Abstract resented as a vector in R , where n is a hyper- parameter that controls capacity. The embeddings We describe a convolutional neural net- of words comprising a text are combined using a work that learns feature representations for model-dependent, possibly learned function, pro- short textual posts using hashtags as a su- ducing a point in the same embedding space. A pervised signal. The proposed approach is similarity measure (for example, inner product) trained on up to 5.5 billion words predict- gauges the pairwise relevance of points in the em- ing 100,000 possible hashtags. As well as bedding space. strong performance on the hashtag predic- Unsupervised word embedding methods train tion task itself, we show that its learned with a reconstruction objective in which the em- representation of text (ignoring the hash- beddings are used to predict the original text. For tag labels) is useful for other tasks as well. example, word2vec tries to predict all the words To that end, we present results on a docu- in the document, given the embeddings of sur- ment recommendation task, where it also rounding words. We argue that hashtag predic- outperforms a number of baselines. tion provides a more direct form of supervision: the tags are a labeling by the author of the salient 1 Introduction aspects of the text. Hence, predicting them may Hashtags (single tokens often composed of nat- provide stronger semantic guidance than unsuper- ural language n-grams or abbreviations, prefixed vised learning alone. The abundance of hashtags with the character ‘#’) are ubiquitous on social in real posts provides a huge labeled dataset for networking services, particularly in short textual learning potentially sophisticated models. documents (a.k.a. posts). Authors use hashtags to In this work we develop a convolutional net- diverse ends, many of which can be seen as labels work for large scale ranking tasks, and apply it for classical NLP tasks: disambiguation (chips to hashtag prediction. Our model represents both #futurism vs. chips #junkfood); identi- words and the entire textual post as embeddings as fication of named entities (#sf49ers); sentiment intermediate steps. We show that our method out- (#dislike); and topic annotation (#yoga). performs existing unsupervised (word2vec) and Hashtag prediction is the task of mapping text to supervised (WSABIE (Weston et al., 2011)) em- its accompanying hashtags. In this work we pro- bedding methods, and other baselines, at the hash- pose a novel model for hashtag prediction, and tag prediction task. show that this task is also a useful surrogate for We then probe our model’s generality, by trans- learning good representations of text. fering its learned representations to the task of per- Latent representations, or embeddings, are vec- sonalized document recommendation: for each of torial representations of words or documents, tra- M users, given N previous positive interactions ditionally learned in an unsupervised manner over with documents (likes, clicks, etc.), predict the large corpora. For example LSA (Deerwester et N + 1’th document the user will positively inter- al., 1990) and its variants, and more recent neural- act with. To perform well on this task, the rep- network inspired methods like those of Bengio et resentation should capture the user’s interest in al. (2006), Collobert et al. (2011) and word2vec textual content. We find representations trained (Mikolov et al., 2013) learn word embeddings. In on hashtag prediction outperform representations the word embedding paradigm, each word is rep- from unsupervised learning, and that our convolu- convolution tanh max tanh linear layer layer pooling layer layer w1 word hashtag w2 lookup lookup t table table (l + K 1) d l H l H H H d d − ⇥ ⇥ ⇥ wl N d ⇥ f(w, t) Figure 1: #TAGSPACE convolutional network f(w; t) for scoring a (document, hashtag) pair. tional architecture performs better than WSABIE 3 Convolutional Embedding Model trained on the same hashtag task. Our model #TAGSPACE (see Figure 1), like other 2 Prior Work word embedding models, starts by assigning a d- dimensional vector to each of the l words of an Some previous work (Davidov et al., 2010; Godin input document w1; : : : ; wl, resulting in a matrix et al., 2013; She and Chen, 2014) has addressed of size l d. This is achieved using a matrix of × hashtag prediction. Most such work applies to N d parameters, termed the lookup-table layer × much smaller sets of hashtags than the 100,000 we (Collobert et al., 2011), where N is the vocabulary consider, with the notable exception of Ding et al. size. In this work N is 106, and each row of the (2012), which uses an unsupervised method. matrix represents one of the million most frequent As mentioned in Section 1, many approaches words in the training corpus. learn unsupervised word embeddings. In our A convolution layer is then applied to the l d experiments we use word2vec (Mikolov et al., input matrix, which considers all successive win-× 2013) as a representative scalable model for un- dows of text of size K, sliding over the docu- supervised embeddings. WSABIE (Weston et al., ment from position 1 to l. This requires a fur- 2011) is a supervised embedding approach that has ther Kd H weights and H biases to be learned. shown promise in NLP tasks (Weston et al., 2013; To account× for words at the two boundaries of the Hermann et al., 2014). WSABIE is shallow, linear, document we also apply a special padding vector and ignores word order information, and so may at both ends. In our experiments K was set to 5 have less modeling power than our approach. and H was set to 1000. After the convolutional Convolutional neural networks (CNNs), in step, a tanh nonlinearity followed by a max op- which shared weights are applied across the in- eration over the l H features extracts a fixed- × put, are popular in the vision domain and have re- size (H-dimensional) global feature vector, which cently been applied to semantic role labeling (Col- is independent of document size. Finally, another lobert et al., 2011) and parsing (Collobert, 2011). tanh non-linearity followed by a fully connected Neural networks in general have also been applied linear layer of size H d is applied to represent the to part-of-speech tagging, chunking, named en- entire document in the× original embedding space tity recognition (Collobert et al., 2011; Turian et of d-dimensions. al., 2010), and sentiment detection (Socher et al., Hashtags are also represented using d- 2013). All these tasks involve predicting a limited dimensional embeddings using a lookup-table. (2-30) number of labels. In this work, we make We represent the top 100,000 most frequent tags. use of CNNs, but apply them to the task of rank- For a given document w we then rank any given ing a very large set of tags. We thus propose a hashtag t using the scoring function: model and training scheme that can scale to this class of problem. f(w; t) = econv(w) elt(t) · where e (w) is the embedding of the document Posts Words conv Dataset (millions) (billions) Top 4 tags by the CNN just described and elt(t) is the em- bedding of a candidate tag t. We can thus rank all #fitness, Pages 35.3 1.6 #beauty, candidate hashtags via their scores f(w; t), largest #luxury, #cars first. To train the above scoring function, and hence #FacebookIs10, People 201 5.5 #love, #tbt, the parameters of the model we minimize a rank- #happy ing loss similar to the one used in WSABIE as a training objective: for each training example, Table 1: Datasets used in hashtag prediction. we sample a positive tag, compute f(w; t+), then sample random tags t¯ up to 1000 times until billion words. These posts’ authorial voice is a f(w; t¯) > m + f(w; t+), where m is the mar- public entity, such as a business, celebrity, brand, gin. A gradient step is then made to optimize the or product. The posts in the pages dataset are pre- pairwise hinge loss: sumably intended for a wider, more general audi- = max 0; m f(w; t+) + f(w; t¯) : ence than the posts in the people dataset. Both are L f − g summarized in Table 1. We use m = 0:1 in our experiments. This loss Both corpora comprise posts between February function is referred to as the WARP loss in (We- 1st and February 17th, 2014. Since we are not at- ston et al., 2011) and is used to approximately tempting a multi-language model, we use a simple optimizing the top of the ranked list, useful for trigram-based language prediction model to con- metrics like precision and recall@k. In particu- sider only posts whose most likely language is En- lar, the search for a negative candidate tag means glish. that more energy is spent on improving the rank- The two datasets use hashtags very differently. ing performance of positive labels already near the The pages dataset has a fatter head, with popular top of the ranked list, compared to only randomly tags covering more examples. The people dataset sampling of negatives, which would optimize the uses obscure tags more heavily. For example, the average rank instead. top 100 tags account for 33.9% of page tags, but Minimizing our loss is achieved with parallel only 13.1% of people tags. stochastic gradient descent using the hogwild al- gorithm (Niu et al., 2011).
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages6 Page
-
File Size-