Character-Based Neural Embeddings for Tweet Clustering
Total Page:16
File Type:pdf, Size:1020Kb
Character-based Neural Embeddings for Tweet Clustering Svitlana Vakulenko Lyndon Nixon Mihai Lupu Vienna University of MODUL Technology GmbH TU Wien Economics and Business, [email protected] Vienna, Austria MODUL Technology GmbH [email protected] [email protected] Abstract a bag-of-words. These approaches also require language-specific sentence and word tokenization. In this paper we show how the perfor- Word-based approaches fall short when applied mance of tweet clustering can be improved to social media data, e.g., Twitter, where a lot of by leveraging character-based neural net- infrequent or misspelled words occur within very works. The proposed approach overcomes short documents. Hence, the document represen- the limitations related to the vocabulary tation matrix becomes increasingly sparse. explosion in the word-based models and One way to overcome sparseness in a tweet- allows for the seamless processing of the term matrix is to consider only the terms that multilingual content. Our evaluation re- appear frequently across the collection and drop sults and code are available on-line1. all the infrequent terms. This procedure effec- tively removes a considerable amount of informa- 1 Introduction tion content. As a result, all tweets that do not con- tain any of the frequent terms receive a null-vector Our use case scenario, as part of the InVID representation. These tweets are further ignored 2 project , originates from the needs of profes- by the model and cannot influence clustering out- sional journalists responsible for reporting break- comes in the subsequent time intervals, where the ing news in a timely manner. News often appear frequency distribution may change, which hinders on social media exclusively or right before they the detection of emerging topics. appear in the traditional news media. Social me- Artificial neural networks (ANNs) allow to gen- dia is also responsible for the rapid propagation erate dense vector representation (embeddings), of inaccurate or incomplete information (rumors). which can be efficiently generated on the word- as Therefore, it is important to provide efficient tools well as character levels (dos Santos and Zadrozny, to enable journalists rapidly detect breaking news 2014; Zhang et al., 2015; Dhingra et al., 2016). in social media streams (Petrovic et al., 2013). The main advantage of the character-based ap- The SNOW 2014 Data Challenge provided the proaches is their language-independence, since task of extracting newsworthy topics from Twitter. they do not require any language-specific parsing. The results of the challenge confirmed that the task The major contribution of our work is the eval- is ambitious: The best result was 0.4 F-measure. uation of the character-based neural embeddings Breaking-news detection involves 3 subtasks: on the tweet clustering task. We show how to selection, clustering, and ranking of tweets. In this employ character-based tweet embeddings for the paper, we address the task of tweet clustering as task of tweet clustering and demonstrate in the ex- one of the pivotal subtasks required to enable ef- perimental evaluation that the proposed approach fective breaking news detection from Twitter. significantly outperforms the current state-of-the- Traditional approaches to clustering textual art in tweet clustering for breaking news detection. documents involve construction of a document- The remaining of this paper is structured as fol- term matrix, which represents each document as lows: Section 2 provides an overview of the re- lated work; we describe the setup of an extensive 1https://github.com/vendi12/tweet2vec_ clustering evaluation in Section 3; report and discuss the re- 2http://www.invid-project.eu sults in Sections 4 and 5, respectively; conclu- 36 Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media , pages 36–44, Valencia, Spain, April 3-7, 2017. c 2017 Association for Computational Linguistics sion (Section 6) summarizes our findings and di- autoencoder on the task of paraphrase and seman- rections for future work. tic similarity detection in tweets. Our work extends the evaluation of the 2 Related Work Tweet2Vec model (Dhingra et al., 2016) to the tweet clustering task, versus the traditional 2.1 Breaking news detection document-term matrix representation. To the best There has been a continuous effort over the re- of our knowledge, this work is the first attempt to cent years to design effective and efficient algo- evaluate the performance of character-based neu- rithms capable of detecting newsworthy topics in ral embeddings on the tweet clustering task. the Twitter stream (Hayashi et al., 2015; Ifrim et al., 2014; Vosecky et al., 2013; Wurzer et al., 3 Experimental Evaluation 2015). These current state-of-the-art approaches build upon the bag-of-words document model, 3.1 Dataset which results in high-dimensional, sparse repre- Description and preprocessing. We use the sentations that do not scale well and are not aware SNOW 2014 test dataset (Papadopoulos et al., of semantic similarities, such as paraphrases. 2014) in our evaluation. It contains the IDs of The problem becomes evident in case of tweets about 1 million tweets produced within 24 hours. that contain short texts with a long tail of in- We retrieved 845,626 tweets from the Twitter frequent slang and misspelled words. The per- API, since other tweets had already been deleted formance of the such approaches over Twitter from the platform. The preprocessing procedure: datasets is very low, with F-measure up to 0.2 remove RT prefixes, urls and user mentions, bring against the annotated Wikipidea articles as refer- all characters to lower case and separate punctua- ence topics (Wurzer et al., 2015) and 0.4 against tion with spaces (the later is necessary only for the the curated topic pool (Papadopoulos et al., 2014). word-level baseline). The dataset is further separated into 5 subsets 2.2 Neural embeddings corresponding to the 1-hour time intervals (18:00, Artificial neural networks (ANNs) allow to gen- 22:00, 23:15, 01:00 and 01:30) that are annotated erate dense vector representations (embeddings). with the list of breaking news topics. In total, we Word2vec (Mikolov et al., 2013) is by far the have 48,399 tweets for clustering evaluation; the most popular approach. It accumulates the co- majority of them (42,758 tweets) are in English. occurrence statistics of words that efficiently sum- The dataset comes with the list of the breaking marizes their semantics. news topics. These topics were manually selected Brigadir et al. (2014) demonstrated encour- by the independent evaluators from the topic pool aging results using the word2vec Skip-gram collected from all challenge participants (external model to generate event timelines from tweets. topics). The list of topics contains 70 breaking Moran et al. (2016) achieved an improvement over news headlines extracted from tweets (e.g., “The the state-of-the-art first story detection (FSD) re- new, full Godzilla trailer has roared online”). Each sults by expanding the tweets with their semanti- topic is annotated with a few (at most 4) tweet IDs, cally related terms using word2vec. which is not sufficient for an adequate evaluation Neural embeddings can be efficiently generated of a tweet clustering algorithm. on the character level as well. They repeatedly outperformed the word-level baselines on the tasks Dataset extension. We enrich the topic anno- of language modeling (Kim et al., 2016), part-of- tations by collecting larger tweet clusters using 3 speech tagging (dos Santos and Zadrozny, 2014), fuzzy string matching for each of the topic labels. and text classification (Zhang et al., 2015). The Fuzzy string matching uses the Levenstein (edit) main advantage of the character-based approach is distance (Levenshtein, 1966) between the two in- its language-independence, since it does not de- put strings as the measure of similarity. Leven- pend on any language-specific preprocessing. stein distance corresponds to the minimum num- Dhingra et al. (2016) proposed training a recur- ber of character edits (insertions, deletions, or sub- rent neural network on the task of hashtag pre- stitutions) required to transform one string into the diction. Vosoughi et al. (2016) demonstrated an 3https://github.com/seatgeek/ improved performance of a character-based neural fuzzywuzzy 37 other. We choose only the tweets for which the information flow. The gates (reset and update similarity ratio with the topic string is greater than gate) regulate how much the previous output state 0.9 threshold. (ht 1) influences the current state (ht). − A sample tweet cluster produced with the fuzzy The two GRUs are identical, but the back- string matching for the topic “Justin Trudeau apol- ward GRU receives the same sequence of tweet- ogizes for Ukraine joke”: characters in reverse order. Each GRU computes its own vector-representation for every substring Justin Trudeau apologizes for Ukraine joke: • (ht) using the current character vector (xt) and Justin Trudeau said he’s spoken the head... the vector-representation it computed a step be- Justin Trudeau apologizes for Ukraine com- • fore (ht 1). These two representations of the same ments http://t.co/7ImWTRONXt − tweet are combined in the next layer of the neural Justin Trudeau apologizes for Ukraine hockey • network to produce the final tweet embedding (see joke #cdnpoli more details in Dhingra et.al. (2016)). In total, we matched 2,585 tweets to 132 clus- The network is trained in minibatches with an ters using this approach. The resulting tweet clus- objective function to predict the previously re- ters represent the ground-truth topics within dif- moved hashtags. A hashtag can be considered as ferent time intervals. The cluster size varies from the ground-truth cluster label for tweets. There- 1 to 361 tweets with an average of 20 tweets per fore, the network is trained to optimize for the cor- cluster (median: 6.5).