Arxiv:2012.03468V1 [Cs.CL] 7 Dec 2020

An Empirical Survey of Unsupervised Text Representation Methods on Twitter Data Lili Wang1, Chongyang Gao2, Jason Wei3, Weicheng Ma4, Ruibo Liu5, and Soroush Vosoughi6 1,2,4,5,6Department of Computer Science, Dartmouth College 3ProtagoLabs 1,2,4,[email protected] [email protected] [email protected] Abstract However, the representation power of these methods for data from social media is not well un- The field of NLP has seen unprecedented derstood. This is especially true for tweets which achievements in recent years. Most notably, are usually short, noisy, and idiosyncratic. This with the advent of large-scale pre-trained paper is an attempt to evaluate and catalogue the Transformer-based language models, such as representation power of a wide range of meth- BERT, there has been a noticeable improve- ment in text representation. It is, however, un- ods for tweets, starting from very simple bag-of- clear whether these improvements translate to words representations (or embeddings) to repre- noisy user-generated text, such as tweets. In sentations generated by recent Transformer-based this paper, we present an experimental survey models, such as BERT. Since we are interested in of a wide range of well-known text representa- the general representation power of the methods tion techniques for the task of text clustering and not their performance on any specific down- on noisy Twitter data. Our results indicate that stream tasks, we do not fine-tune any of the meth- the more advanced models do not necessarily ods using downstream tasks and use unsupervised work best on tweets and that more exploration in this area is needed. evaluation (i.e., clustering) for our survey. 1 Introduction 2 Text Representation Methods Recent years have witnessed an exponential in- In this section, we briefly introduce the methods crease in the usage of social media platforms. used in our survey, sorted from oldest to newest. These platforms have become an important part of For word embedding methods like word2vec, politics, business, entertainment, and general social GloVe, and fastText, which dot not explicitly sup- life. Correspondingly, the amount of data gener- port sentence embeddings, we average the word ated by users on these platforms has also grown embeddings to get sentence embeddings. For deep exponentially. Though data on social media in- models like ELMo, BERT, ALBERT, and XLNet, cludes various modalities, such as images, videos, we take the average of the hidden state of the last and graphs, text is by far the largest type of data layer on the input sequence axis. Note that some generated by users. Thus, in order to extract knowl- other works use the hidden state of the first token arXiv:2012.03468v1 [cs.CL] 7 Dec 2020 edge and insight from social media, sophisticated ([CLS]), but in our experiments, we use the pre- text processing models are needed. Luckily, in trained model without fine-tuning, in this case, the parallel to the growth of social media, there has hidden state of [CLS] is not a good sentence repre- been a rapid rise in the development of sophisti- sentation. Note that we use all these deep neural cated text representation techniques, the most re- models without fine-tuning. This is because fine- cent being large-scale pre-trained language models tuning is usually based on specific downstream that use Transformer-based architecture (Vaswani tasks which bias the information in the hidden et al., 2017)(such as BERT (Devlin et al., 2018), states, weakening the general representation. Note and XLNet(Yang et al., 2019)). These methods can that when we refer to n-gram models we mean mod- generate general-purpose vector representations of els that capture all grams up to and including the documents that can be used for any downstream n-gram (e.g., bigram models will include bigrams task (e.g., sentiment classification). and unigrams). 1. bag-of-words (BoW). This is a representa- 9. Universal Sentence Encoder (USE) (Cer et al., tion of text that describes the occurrence of words 2018). USE encodes sentences into high dimen- within a document. In our experiments, we use a sional vectors. The pre-trained encoder comes in random sample of 5 million tweets collected from two versions, one trained with deep averaging net- the Internet Archive Twitter dataset 1 (IAT) to cre- work (DAN) (Iyyer et al., 2015) and one with Trans- ate a vocabulary. We also remove stop words from former. We use the DAN version of USE. the tweets. We try unigram, bigram, and trigram 10. ELMo (Peters et al., 2018). This method models. provides context-dependent word representations 2. TF-IDF.. Term frequency–inverse document fre- based on bidirectional language models. We use quency (TF-IDF) reflects how important a word is the version pre-trained on the One Billion Word with respect to documents in a collection or corpus. Benchmark. We use a similar experimental setup as BoW. 11. BERT (Devlin et al., 2018). BERT is a large- 3. LDA (Hoffman et al., 2010). Latent Dirichlet scale Transformer-based language representation allocation (LDA) is a generative statistical model model (Vaswani et al., 2017). We use two off-the- for capturing the topic distribution of documents in shelf pre-trained versions BERT-base and BERT- a corpus. We train this model on the IAT dataset. large, which are pre-trained on the BooksCorpus We also remove stop-words and train models with and English Wikipedia respectively. 5, 10, 20, and 100 topics. 12. ALBERT (Lan et al., 2019). This is a 4. word2vec (Mikolov et al., 2013). word2vec lite version of BERT, with far fewer parameters. is a distributed representation of words based on We use two off-the-shelf versions, ALBERT-base a model trained on predicting the current word and ALBERT-large, which are pre-trained on the from surrounding context words (CBOW). We train BooksCorpus and English Wikipedia respectively. unigram, bigram, and trigram word2vec models 13. XLNet (Yang et al., 2019). This is an autore- using the IAT dataset. gressive Transformer-based language model. Like 5. doc2vec (Le and Mikolov, 2014). This model BERT, XLNet is a large-scale language model with extends word2vec by adding another document vec- millions of parameters. We use the off-the-shelf tor based on ID. Our model is trained on the IAT versions pre-trained on the BooksCorpus and En- dataset. glish Wikipedia. 6. GloVe (Pennington et al., 2014). This model 14. Sentence-BERT (Reimers and Gurevych, combines global matrix factorization and local con- 2019). Sentence-BERT modifies BERT by using text window methods for training distributed rep- siamese and triplet network structures to derive se- resentations. We use the 200-dimensional version mantically meaningful sentence embeddings. We that was pre-trained on 2 billion tweets. use five off-the-shelf versions provided by the au- 7. fastText (Joulin et al., 2016). fastText is another thors, Sentence-BERT-base, Sentence-BERT-large, word embedding method that extends word2vec by Sentence-Distilbert, Sentence-RoBERTa-base, and representing each word as an n-gram of characters. Sentence-RoBERTa-large, all pre-trained on NLI We use the 300-dimensional off-the-shelf version data. which was pre-trained on Wikipedia. 3 Experiments 8. Tweet2vec (Dhingra et al., 2016). This model finds vector-space representations of whole tweets Since we are interested in measuring the general by learning complex, non-local dependencies in text representation power of our methods, we use character sequences. In our experiments, we use clustering as a way to evaluate the representations the pre-trained best model provided by the au- generated by each model (instead of any down- thors.2 stream supervised tasks). We use the vector repre- 1https://archive.org/search.php? sentations of each tweet to run k-means clustering query=collection%3Atwitterstream&sort= for different values of k. We use two tweet datasets -publicdate for our evaluation. The tweets in these datasets 2https://github.com/bdhingra/ tweet2vec/tree/master/tweet2vec/best_ have labels corresponding to their topic which we model There is another tweet2vec model that uses a use as cluster ground-truth for evaluation purposes. character-level cnn-lstm encoder-decoder (Vosoughi et al., 2016), but for the sake of brevity we only show the results for Dataset 1 (Zubiaga et al., 2015): This dataset in- one of the tweet2vec models. cludes 356,782 tweets belonging to 1,036 topics. We use k 2 f200; 400; 600; 800; 1000g, for this 1.0 Homogeneity 1 0.94 0.99 0.86 0.88 0.21 dataset. 0.9 Dataset 2 (Rosenthal et al., 2017): This dataset Completeness 0.94 1 0.98 0.89 0.97 0.28 0.8 includes 35,323 tweets belonging to 374 topics. We 0.7 use k 2 f100; 200; 300; 400; 500g, for this dataset. V-measure 0.99 0.98 1 0.89 0.93 0.25 0.6 ARI 0.86 0.89 0.89 1 0.91 0.2 3.1 Evaluation Metrics 0.5 We use a total of six metrics for evaluating the AMI 0.88 0.97 0.93 0.91 1 0.31 0.4 “goodness” of our clusters, described below. Except 0.3 Silhouette 0.21 0.28 0.25 0.2 0.31 1 for the Silhouette score, all other metrics rely on 0.2 ARI ground-truth labels. AMI Silhouette Silhouette score (Rousseeuw, 1987): A good clus- V-measure Homogeneity tering will produce clusters where the elements Completeness inside the same cluster are close to each other and Figure 1: Confusion matrix of the correlation (Pear- the elements in different clusters are far from each son’s r) between each pair of methods.

Load more