Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics Too!

Total Page:16

File Type:pdf, Size:1020Kb

Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics Too! Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too! Suzanna Sia Ayush Dalmia Sabrina J. Mielke Department of Computer Science Johns Hopkins University Baltimore, MD, USA [email protected], [email protected], [email protected] Abstract the top words in each cluster based on distance from the cluster center.1 Topic models are a useful analysis tool to un- Aside from reporting the best performing com- cover the underlying themes within document bination of word embeddings and clustering al- collections. The dominant approach is to use gorithm, we are also interested in whether there probabilistic topic models that posit a genera- are consistent patterns: embeddings which per- tive story, but in this paper we propose an al- ternative way to obtain topics: clustering pre- form consistently well across clustering algorithms trained word embeddings while incorporating might be good representations for unsupervised document information for weighted clustering document analysis, clustering algorithms that per- and reranking top words. We provide bench- form consistently well are more likely to generalize marks for the combination of different word to future word embedding methods. embeddings and clustering algorithms, and To make our approach reliably work as well as analyse their performance under dimensional- LDA, we incorporate corpus frequency statistics ity reduction with PCA. The best performing combination for our approach performs as well directly into the clustering algorithm, and quan- as classical topic models, but with lower run- tify the effects of two key methods, 1) weighting time and computational complexity. terms during clustering and 2) reranking terms for obtaining the top J representative words. Our con- 1 Introduction tributions are as follows: Topic models are the standard approach for ex- • We systematically apply centroid-based clus- ploratory document analysis (Boyd-Graber et al., tering algorithms on top of a variety of pre- 2017), which aims to uncover main themes and trained word embeddings and embedding underlying narratives within a corpus. But in times methods for document analysis. of distributed and even contextualized embeddings, are they the only option? • Through weighted clustering and reranking of This work explores an alternative to topic model- top words we obtain sensible topics; the best ing by casting ‘key themes’ or ‘topics’ as clusters of performing combination is comparable with word types under the modern distributed represen- LDA, but with smaller time complexity and arXiv:2004.14914v2 [cs.CL] 6 Oct 2020 tation learning paradigm: unsupervised pre-trained empirical runtime. word embeddings provide a representation for each • We show that further speedups are possible by word type as a vector, allowing us to cluster them reducing the embedding dimensions by up to based on their distance in high-dimensional space. 80% using PCA. The goal of this work is not to strictly outperform, but rather to benchmark standard clustering of mod- 2 Related Work and Background ern embedding methods against the classical ap- proach of Latent Dirichlet Allocation (LDA; Blei Analyzing documents by clustering word embed- et al., 2003). dings is a natural idea—clustering has been used We restrict our study to influential embedding 1We found that using non-centroid-based hierarchical, or methods and focus on centroid-based clustering density based clustering algorithms like DBScan resulted in algorithms as they provide a natural way to obtain worse performance and more hyperparameters to tune. 1 for readability assessment (Cha et al., 2017), ar- gument mining (Reimers et al., 2019), document classification and document clustering (Sano et al., 2017), inter alia. So far, however, clustering word embeddings has not seen much success for the pur- poses of topic modeling. While many modern ef- forts have attempted to incorporate word embed- Figure 1: The figure on the left shows the cluster cen- dings into the probabilistic LDA framework (Liu ter (?) without weighting, while the figure on the right et al., 2015; Nguyen et al., 2015; Das et al., 2015; shows that after weighting (larger points have higher Zhao et al., 2017; Batmanghelich et al., 2016; Xun weight) a hopefully more representative cluster center et al., 2017; Dieng et al., 2019), relatively little is found. Note that top words based on distance from work has examined the feasibility of clustering em- the cluster center could still very well be low frequency beddings directly. word types, motivating reranking (§3.3). Xie and Xing(2013) and Viegas et al.(2019) first cluster documents and subsequently find words 3.1 Obtaining top-J words within each cluster for document analysis. Srid- In traditional topic modeling (LDA), the top J har(2015) targets short texts where LDA per- words are those with highest probability under each forms poorly in particular, fitting GMMs to learned topic-word distribution. For centroid based clus- word2vec representations. De Miranda et al.(2019) tering algorithms, the top words of some cluster cluster using self-organising maps, but provide only i are naturally those closest to the cluster center qualitative results. c(i), or with highest probability under the cluster In contrast, our proposed approach is straight- parameters. Formally, this means choosing the set forward to implement, feasible for regular length of types J as documents, requires no retraining of embeddings, and yields qualitatively and quantitatively convinc- 8 (i) 2 c xj 2 for KM/KD; ing results. We focus on centroid based k-means X <>k − k argmin cos(c(i); x ) for SK; (KM), Spherical k-means (SK), and k-medoids j J : jJj=10 j2J > (i) :f(xj c ; Σi) for GMM/VMFM: (KD) for hard clustering, and von Mises-Fisher j Models (VMFM) and Gaussian Mixture Models (GMM) for soft clustering; as pre-trained embed- Our results in §6 focus on KM and GMM, as we dings we consider word2vec (Mikolov et al., 2013), observe that k-medoids, spherical KM and von GloVe (Pennington et al., 2014), FastText (Bo- Mises-Fisher tend to perform worse than KM and janowski et al., 2017), Spherical (Meng et al., GMM (see App.A, App.B). 2019), ELMo (Peters et al., 2018), and BERT (De- vlin et al., 2018). Note that it is possible to extend this approach to obtain the top topics given a document: compute similarity scores between learned topic cluster 3 Methodology centers and all word embeddings from that particu- lar document, and normalize them using softmax After preprocessing and extracting the vocabulary to obtain a (non-calibrated) probability distribution. from our training documents, each word type is converted to its embedding representation (averag- Crucial to our method is the incorporation of ing all of its tokens for contextualized embeddings; corpus statistics on top of vanilla clustering algo- details in §5.3). Following this we apply the vari- rithms, which we will describe in the remainder of ous clustering algorithms on the entire training cor- this section. pus vocabulary to obtain k clusters, using weighted 3.2 Weighting while clustering (§3.2) or unweighted word types. After the cluster- ing algorithm has converged, we obtain the top J The intuition of weighted clustering is based on words (§3.1) from each cluster for evaluation. Note the formulation of classical LDA which models the that one potential shortcoming of our approach is probability of the word type t belonging to a topic i Nt;i+βt the possibility of outliers forming their own cluster, as P , where Nt;i refers to the number of t0 Nt0i+βt0 which we leave to future work. times word type t has been assigned to topic i, and 2 β is a parameter of the Dirichlet prior on the per- O(tkN), where N is the number of all tokens, so topic word distribution. In our case, illustrated by when N n, clustering methods can potentially the schematic in Fig.1, weighting is a natural way achieve better performance-complexity tradeoffs. to account for the frequency effects of vocabulary Note that running ELMo and BERT over doc- terms during clustering. uments also requires iterating over all tokens, but only once, and not for every topic and iteration. 3.3 Reranking when obtaining topics When obtaining the top-J words that make up a 4.1 Cost of obtaining Embeddings cluster’s topic, we also consider reranking terms, For readily available pretrained word embeddings as there is no guarantee that words closest to cluster such as word2vec, FastText, GloVe and Spherical, centers are important word types. We will show the embeddings can be considered as ‘given’ as the in Table2 that without reranking, clustering yields practioner does not need to generate these embed- “sensible” topics but low NPMI scores. dings from scratch. However for contextual embed- dings such as ELMo and BERT, there is additional 3.4 Which corpus statistics? computational cost in obtaining these embeddings To incorporate corpus statistics into the clustering before clustering, which requires passing through algorithm, we examine three different schemes2 to RNN and transformer layers respectively. This assign weights to word types, where nt is the count can be trivially parallelised by batching the con- of word type t in corpus D, and d is a document: text window (usually a sentence). We use standard pretrained ELMo and BERT models in our experi- nt tf = P (1) ments and therefore do not consider the runtime of 0 n 0 t t training these models from scratch. d D t d tf-df = tf jf 2 j 2 gj (2) · D 5 Experimental Setup j j D tf-idf = tf log j j (3) Our implementation is freely available online.4 · d D t d + 1 jf 2 j 2 gj 5.1 Datasets These scores can now be used for weighting We use the 20 newsgroup dataset (20NG) which word types when clustering ( w), reranking top w contains around 18000 documents and 20 cate- 100 words ( r) after, both ( ), or neither (simply r gories,5 and a subset of Reuters215786 which con- ).
Recommended publications
  • Probabilistic Topic Modelling with Semantic Graph
    Probabilistic Topic Modelling with Semantic Graph B Long Chen( ), Joemon M. Jose, Haitao Yu, Fajie Yuan, and Huaizhi Zhang School of Computing Science, University of Glasgow, Sir Alwyns Building, Glasgow, UK [email protected] Abstract. In this paper we propose a novel framework, topic model with semantic graph (TMSG), which couples topic model with the rich knowledge from DBpedia. To begin with, we extract the disambiguated entities from the document collection using a document entity linking system, i.e., DBpedia Spotlight, from which two types of entity graphs are created from DBpedia to capture local and global contextual knowl- edge, respectively. Given the semantic graph representation of the docu- ments, we propagate the inherent topic-document distribution with the disambiguated entities of the semantic graphs. Experiments conducted on two real-world datasets show that TMSG can significantly outperform the state-of-the-art techniques, namely, author-topic Model (ATM) and topic model with biased propagation (TMBP). Keywords: Topic model · Semantic graph · DBpedia 1 Introduction Topic models, such as Probabilistic Latent Semantic Analysis (PLSA) [7]and Latent Dirichlet Analysis (LDA) [2], have been remarkably successful in ana- lyzing textual content. Specifically, each document in a document collection is represented as random mixtures over latent topics, where each topic is character- ized by a distribution over words. Such a paradigm is widely applied in various areas of text mining. In view of the fact that the information used by these mod- els are limited to document collection itself, some recent progress have been made on incorporating external resources, such as time [8], geographic location [12], and authorship [15], into topic models.
    [Show full text]
  • D1.3 Deliverable Title: Action Ontology and Object Categories Type
    Deliverable number: D1.3 Deliverable Title: Action ontology and object categories Type (Internal, Restricted, Public): PU Authors: Daiva Vitkute-Adzgauskiene, Jurgita Kapociute, Irena Markievicz, Tomas Krilavicius, Minija Tamosiunaite, Florentin Wörgötter Contributing Partners: UGOE, VMU Project acronym: ACAT Project Type: STREP Project Title: Learning and Execution of Action Categories Contract Number: 600578 Starting Date: 01-03-2013 Ending Date: 30-04-2016 Contractual Date of Delivery to the EC: 31-08-2015 Actual Date of Delivery to the EC: 04-09-2015 Page 1 of 16 Content 1. EXECUTIVE SUMMARY .................................................................................................................... 2 2. INTRODUCTION .............................................................................................................................. 3 3. OVERVIEW OF THE ACAT ONTOLOGY STRUCTURE ............................................................................ 3 4. OBJECT CATEGORIZATION BY THEIR SEMANTIC ROLES IN THE INSTRUCTION SENTENCE .................... 9 5. HIERARCHICAL OBJECT STRUCTURES ............................................................................................. 13 6. MULTI-WORD OBJECT NAME RESOLUTION .................................................................................... 15 7. CONCLUSIONS AND FUTURE WORK ............................................................................................... 16 8. REFERENCES ................................................................................................................................
    [Show full text]
  • A Hierarchical Clustering Approach for Dbpedia Based Contextual Information of Tweets
    Journal of Computer Science Original Research Paper A Hierarchical Clustering Approach for DBpedia based Contextual Information of Tweets 1Venkatesha Maravanthe, 2Prasanth Ganesh Rao, 3Anita Kanavalli, 2Deepa Shenoy Punjalkatte and 4Venugopal Kuppanna Rajuk 1Department of Computer Science and Engineering, VTU Research Resource Centre, Belagavi, India 2Department of Computer Science and Engineering, University Visvesvaraya College of Engineering, Bengaluru, India 3Department of Computer Science and Engineering, Ramaiah Institute of Technology, Bengaluru, India 4Department of Computer Science and Engineering, Bangalore University, Bengaluru, India Article history Abstract: The past decade has seen a tremendous increase in the adoption of Received: 21-01-2020 Social Web leading to the generation of enormous amount of user data every Revised: 12-03-2020 day. The constant stream of tweets with an innate complex sentimental and Accepted: 21-03-2020 contextual nature makes searching for relevant information a herculean task. Multiple applications use Twitter for various domain sensitive and analytical Corresponding Authors: Venkatesha Maravanthe use-cases. This paper proposes a scalable context modeling framework for a Department of Computer set of tweets for finding two forms of metadata termed as primary and Science and Engineering, VTU extended contexts. Further, our work presents a hierarchical clustering Research Resource Centre, approach to find hidden patterns by using generated primary and extended Belagavi, India contexts. Ontologies from DBpedia are used for generating primary contexts Email: [email protected] and subsequently to find relevant extended contexts. DBpedia Spotlight in conjunction with DBpedia Ontology forms the backbone for this proposed model. We consider both twitter trend and stream data to demonstrate the application of these contextual parts of information appropriate in clustering.
    [Show full text]
  • Query-Based Multi-Document Summarization by Clustering of Documents
    Query-based Multi-Document Summarization by Clustering of Documents Naveen Gopal K R Prema Nedungadi Dept. of Computer Science and Engineering Amrita CREATE Amrita Vishwa Vidyapeetham Dept. of Computer Science and Engineering Amrita School of Engineering Amrita Vishwa Vidyapeetham Amritapuri, Kollam -690525 Amrita School of Engineering [email protected] Amritapuri, Kollam -690525 [email protected] ABSTRACT based Hierarchical Agglomerative clustering, k-means, Clus- Information Retrieval (IR) systems such as search engines tering Gain retrieve a large set of documents, images and videos in re- sponse to a user query. Computational methods such as Au- 1. INTRODUCTION tomatic Text Summarization (ATS) reduce this information Automatic Text Summarization [12] reduces the volume load enabling users to find information quickly without read- of information by creating a summary from one or more ing the original text. The challenges to ATS include both the text documents. The focus of the summary may be ei- time complexity and the accuracy of summarization. Our ther generic, which captures the important semantic con- proposed Information Retrieval system consists of three dif- cepts of the documents, or query-based, which captures the ferent phases: Retrieval phase, Clustering phase and Sum- sub-concepts of the user query and provides personalized marization phase. In the Clustering phase, we extend the and relevant abstracts based on the matching between input Potential-based Hierarchical Agglomerative (PHA) cluster- query and document collection. Currently, summarization ing method to a hybrid PHA-ClusteringGain-K-Means clus- has gained research interest due to the enormous information tering approach. Our studies using the DUC 2002 dataset load generated particularly on the web including large text, show an increase in both the efficiency and accuracy of clus- audio, and video files etc.
    [Show full text]
  • Wordnet-Based Metrics Do Not Seem to Help Document Clustering
    Wordnet-based metrics do not seem to help document clustering Alexandre Passos1 and Jacques Wainer1 Instituto de Computação (IC) Universidade Estadual de Campinas (UNICAMP) Campinas, SP, Brazil Abstract. In most document clustering systems documents are repre- sented as normalized bags of words and clustering is done maximizing cosine similarity between documents in the same cluster. While this representation was found to be very effective at many different types of clustering, it has some intuitive drawbacks. One such drawback is that documents containing words with similar meanings might be considered very different if they use different words to say the same thing. This happens because in a traditional bag of words, all words are assumed to be orthogonal. In this paper we examine many possible ways of using WordNet to mitigate this problem, and find that WordNet does not help clustering if used only as a means of finding word similarity. 1 Introduction Document clustering is now an established technique, being used to improve the performance of information retrieval systems [11], as an aide to machine translation [12], and as a building block of narrative event chain learning [1]. Toolkits such as Weka [3] and NLTK [10] make it easy for anyone to experiment with document clustering and incorporate it into an application. Nonetheless, the most commonly accepted model—a bag of words representation [17], bisecting k-means algorithm [19] maximizing cosine similarity [20], preprocessing the corpus with a stemmer [14], tf-idf weighting [17], and a possible list of stop words [2]—is complex, unintuitive and has some obvious negative consequences.
    [Show full text]
  • Collocation Classification with Unsupervised Relation Vectors
    Collocation Classification with Unsupervised Relation Vectors Luis Espinosa-Anke1, Leo Wanner2,3, and Steven Schockaert1 1School of Computer Science, Cardiff University, United Kingdom 2ICREA and 3NLP Group, Universitat Pompeu Fabra, Barcelona, Spain fespinosa-ankel,schockaerts1g@cardiff.ac.uk [email protected] Abstract Morphosyntactic relations have been the focus of work on unsupervised relational similarity, as Lexical relation classification is the task of it has been shown that verb conjugation or nomi- predicting whether a certain relation holds be- nalization patterns are relatively well preserved in tween a given pair of words. In this pa- vector spaces (Mikolov et al., 2013; Pennington per, we explore to which extent the current et al., 2014a). Semantic relations pose a greater distributional landscape based on word em- beddings provides a suitable basis for classi- challenge (Vylomova et al., 2016), however. In fication of collocations, i.e., pairs of words fact, as of today, it is unclear which operation per- between which idiosyncratic lexical relations forms best (and why) for the recognition of indi- hold. First, we introduce a novel dataset with vidual lexico-semantic relations (e.g., hyperonymy collocations categorized according to lexical or meronymy, as opposed to cause, location or ac- functions. Second, we conduct experiments tion). Still, a number of works address this chal- on a subset of this benchmark, comparing it in lenge. For instance, hypernymy has been modeled particular to the well known DiffVec dataset. In these experiments, in addition to simple using vector concatenation (Baroni et al., 2012), word vector arithmetic operations, we also in- vector difference and component-wise squared vestigate the role of unsupervised relation vec- difference (Roller et al., 2014) as input to linear tors as a complementary input.
    [Show full text]
  • Comparative Evaluation of Collocation Extraction Metrics
    Comparative Evaluation of Collocation Extraction Metrics Aristomenis Thanopoulos, Nikos Fakotakis, George Kokkinakis Wire Communications Laboratory Electrical & Computer Engineering Dept., University of Patras 265 00 Rion, Patras, Greece {aristom,fakotaki,gkokkin}@wcl.ee.upatras.gr Abstract Corpus-based automatic extraction of collocations is typically carried out employing some statistic indicating concurrency in order to identify words that co-occur more often than expected by chance. In this paper we are concerned with some typical measures such as the t-score, Pearson’s χ-square test, log-likelihood ratio, pointwise mutual information and a novel information theoretic measure, namely mutual dependency. Apart from some theoretical discussion about their correlation, we perform comparative evaluation experiments judging performance by their ability to identify lexically associated bigrams. We use two different gold standards: WordNet and lists of named-entities. Besides discovering that a frequency-biased version of mutual dependency performs the best, followed close by likelihood ratio, we point out some implications that usage of available electronic dictionaries such as the WordNet for evaluation of collocation extraction encompasses. dependence, several metrics have been adopted by the 1. Introduction corpus linguistics community. Typical statistics are Collocational information is important not only for the t-score (TSC), Pearson’s χ-square test (χ2), log- second language learning but also for many natural likelihood ratio (LLR) and pointwise mutual language processing tasks. Specifically, in natural information (PMI). However, usually no systematic language generation and machine translation it is comparative evaluation is accomplished. For example, necessary to ensure generation of lexically correct Kita et al. (1993) accomplish partial and intuitive expressions; for example, “strong”, unlike “powerful”, comparisons between metrics, while Smadja (1993) modifies “coffee” but not “computers”.
    [Show full text]
  • Collocation Extraction Using Parallel Corpus
    Collocation Extraction Using Parallel Corpus Kavosh Asadi Atui1 Heshaam Faili1 Kaveh Assadi Atuie2 (1)NLP Laboratory, Electrical and Computer Engineering Dept., University of Tehran, Iran (2)Electrical Engineering Dept., Sharif University of Technology, Iran [email protected],[email protected],[email protected] ABSTRACT This paper presents a novel method to extract the collocations of the Persian language using a parallel corpus. The method is applicable having a parallel corpus between a target language and any other high-resource one. Without the need for an accurate parser for the target side, it aims to parse the sentences to capture long distance collocations and to generate more precise results. A training data built by bootstrapping is also used to rank the candidates with a log-linear model. The method improves the precision and recall of collocation extraction by 5 and 3 percent respectively in comparison with the window-based statistical method in terms of being a Persian multi-word expression. KEYWORDS: Information and Content Extraction, Parallel Corpus, Under-resourced Languages Proceedings of COLING 2012: Posters, pages 93–102, COLING 2012, Mumbai, December 2012. 93 1 Introduction Collocation is usually interpreted as the occurrence of two or more words within a short space in a text (Sinclair, 1987). This definition however is not precise, because it is not possible to define a short space. It also implies the strategy that all traditional models had. They were looking for co- occurrences rather than collocations (Seretan, 2011). Consider the following sentence and its Persian translation1: "Lecturer issued a major and also impossible to solve problem." مدرس یک مشکل بزرگ و غیر قابل حل را عنوان کرد.
    [Show full text]
  • Collocation Classification with Unsupervised Relation Vectors
    Collocation Classification with Unsupervised Relation Vectors Luis Espinosa-Anke1, Leo Wanner2,3, and Steven Schockaert1 1School of Computer Science, Cardiff University, United Kingdom 2ICREA and 3NLP Group, Universitat Pompeu Fabra, Barcelona, Spain fespinosa-ankel,schockaerts1g@cardiff.ac.uk [email protected] Abstract Morphosyntactic relations have been the focus of work on unsupervised relational similarity, as Lexical relation classification is the task of it has been shown that verb conjugation or nomi- predicting whether a certain relation holds be- nalization patterns are relatively well preserved in tween a given pair of words. In this pa- vector spaces (Mikolov et al., 2013; Pennington per, we explore to which extent the current et al., 2014a). Semantic relations pose a greater distributional landscape based on word em- beddings provides a suitable basis for classi- challenge (Vylomova et al., 2016), however. In fication of collocations, i.e., pairs of words fact, as of today, it is unclear which operation per- between which idiosyncratic lexical relations forms best (and why) for the recognition of indi- hold. First, we introduce a novel dataset with vidual lexico-semantic relations (e.g., hyperonymy collocations categorized according to lexical or meronymy, as opposed to cause, location or ac- functions. Second, we conduct experiments tion). Still, a number of works address this chal- on a subset of this benchmark, comparing it in lenge. For instance, hypernymy has been modeled particular to the well known DiffVec dataset. In these experiments, in addition to simple using vector concatenation (Baroni et al., 2012), word vector arithmetic operations, we also in- vector difference and component-wise squared vestigate the role of unsupervised relation vec- difference (Roller et al., 2014) as input to linear tors as a complementary input.
    [Show full text]
  • Dish2vec: a Comparison of Word Embedding Methods in an Unsupervised Setting
    dish2vec: A Comparison of Word Embedding Methods in an Unsupervised Setting Guus Verstegen Student Number: 481224 11th June 2019 Erasmus University Rotterdam Erasmus School of Economics Supervisor: Dr. M. van de Velden Second Assessor: Dr. F. Frasincar Master Thesis Econometrics and Management Science Business Analytics and Quantitative Marketing Abstract While the popularity of continuous word embeddings has increased over the past years, detailed derivations and extensive explanations of these methods lack in the academic literature. This paper elaborates on three popular word embedding meth- ods; GloVe and two versions of word2vec: continuous skip-gram and continuous bag-of-words. This research aims to enhance our understanding of both the founda- tion and extensions of these methods. In addition, this research addresses instability of the methods with respect to the hyperparameters. An example in a food menu context is used to illustrate the application of these word embedding techniques in quantitative marketing. For this specific case, the GloVe method performs best according to external expert evaluation. Nevertheless, the fact that GloVe performs best is not generalizable to other domains or data sets due to the instability of the models and the evaluation procedure. Keywords: word embedding; word2vec; GloVe; quantitative marketing; food Contents 1 Introduction 3 2 Literature review 4 2.1 Discrete distributional representation of text . .5 2.2 Continuous distributed representation of text . .5 2.3 word2vec and GloVe ..............................7 2.4 Food menu related literature . .9 3 Methodology: the base implementation 9 3.1 Quantified word similarity: cosine similarity . 12 3.2 Model optimization: stochastic gradient descent . 12 3.3 word2vec ...................................
    [Show full text]
  • Evaluation of Text Clustering Algorithms with N-Gram-Based Document Fingerprints
    Evaluation of Text Clustering Algorithms with N-Gram-Based Document Fingerprints Javier Parapar and Alvaro´ Barreiro IRLab, Computer Science Department University of A Coru~na,Spain fjavierparapar,[email protected] Abstract. This paper presents a new approach designed to reduce the computational load of the existing clustering algorithms by trimming down the documents size using fingerprinting methods. Thorough eval- uation was performed over three different collections and considering four different metrics. The presented approach to document clustering achieved good values of effectiveness with considerable save in memory space and computation time. 1 Introduction and Motivation Document's fingerprint could be defined as an abstraction of the original docu- ment that usually implies a reduction in terms of size. In the other hand data clustering consists on the partition of the input data collection in sets that ideally share common properties. This paper studies the effect of using the documents fingerprints as input to the clustering algorithms to achieve a better computa- tional behaviour. Clustering has a long history in Information Retrieval (IR)[1], but only re- cently Liu and Croft in [2] have demonstrated that cluster-based retrieval can also significantly outperform traditional document-based retrieval effectiveness. Other successful applications of clustering algorithms are: document browsing, search results presentation or document summarisation. Our approach tries to be useful in operational systems where the computing time is a critical point and the use of clustering techniques can significantly improve the quality of the outputs of different tasks as the above exposed. Next, section 2 introduces the background in clustering and document repre- sentation.
    [Show full text]
  • Evaluating Language Models for the Retrieval and Categorization of Lexical Collocations
    Evaluating language models for the retrieval and categorization of lexical collocations Luis Espinosa-Anke1, Joan Codina-Filba`2, Leo Wanner3;2 1School of Computer Science and Informatics, Cardiff University, UK 2TALN Research Group, Pompeu Fabra University, Barcelona, Spain 3Catalan Institute for Research and Advanced Studies (ICREA), Barcelona, Spain [email protected], [email protected] Abstract linguistic phenomena (Rogers et al., 2020). Re- cently, a great deal of research analyzed the degree Lexical collocations are idiosyncratic com- to which they encode, e.g., morphological (Edmis- binations of two syntactically bound lexical ton, 2020), syntactic (Hewitt and Manning, 2019), items (e.g., “heavy rain”, “take a step” or or lexico-semantic structures (Joshi et al., 2020). “undergo surgery”). Understanding their de- However, less work explored so far how LMs in- gree of compositionality and idiosyncrasy, as well their underlying semantics, is crucial for terpret phraseological units at various degrees of language learners, lexicographers and down- compositionality. This is crucial for understanding stream NLP applications alike. In this paper the suitability of different text representations (e.g., we analyse a suite of language models for col- static vs contextualized word embeddings) for en- location understanding. We first construct a coding different types of multiword expressions dataset of apparitions of lexical collocations (Shwartz and Dagan, 2019), which, in turn, can be in context, categorized into 16 representative useful for extracting latent world or commonsense semantic categories. Then, we perform two experiments: (1) unsupervised collocate re- information (Zellers et al., 2018). trieval, and (2) supervised collocation classi- One central type of phraselogical units are fication in context.
    [Show full text]