Investigation of Latent Semantic Analysis for Clustering of Czech News Articles
Total Page:16
File Type:pdf, Size:1020Kb
Investigation of Latent Semantic Analysis for Clustering of Czech News Articles Michal Rott, Petr Cerva Institute of Information Technology and Electronics Technical University of Liberec Studentska 2, 461 17, Liberec, Czech Republic, https://www.ite.tul.cz/itee/ Email: [email protected], [email protected] Abstract—This paper studies the use of Latent Semantic composed of several words (terms), LSA translates it into Analysis (LSA) for automatic clustering of Czech news articles. a low-dimensional space, and finds matching documents [9], We show that LSA is capable of yielding good results in [10]. this task as it allows us to reduce the problem of synonymy. This is a very important factor particularly for Czech, which The LSA-based clustering approach, which is adopted in belongs to a group of highly inflective and morphologically- this work combines hierarchical and model-based methods. rich languages. The experimental evaluation of our clustering First, LSA is used to create a vector representation of the input scheme and investigation of LSA is performed on query- and documents in the concept space. These vectors are then merged category-based test sets. The obtained results demonstrate that within an agglomerative hierarchical clustering approach. We the automatic system yields values of the Rand index that are also take advantage of a text-preprocessing module, which absolutely lower – by 20% – than the accuracy of human cluster annotations. We also show which similarity metric should be performs language-dependent operations such as substitution used for cluster merging and the effect of dimension reduction of synonyms and lemmatisation. on clustering accuracy. The rest of the paper is structured as follows: The next section describes the adopted clustering scheme. Section III I. INTRODUCTION then describes the measures used for assessment of document The task of automatic clustering of text documents has similarity and the metrics used for evaluation of clustering. attracted a lot of attention recently. Since 1988, when one Experimental evaluation of query-based and category-based of the first studies on automatic clustering came out [1], test sets is then given in Section IV. In this section, we various methods have been proposed in this field [2], [3]. investigate various possibilities how LSA-based decomposition These approaches can be divided into several categories [4], and clustering can be performed. Section V then concludes this e.g., model-based, density-based, hierarchical or partitioning paper. types. The goal of this work is to find out an approach that II. ADOPTED CLUSTERING SCHEME would be suitable for clustering of documents in the highly The clustering scheme used within this work is based on inflective Czech language with a rich vocabulary. Thus, we the use of LSA and it consists of three phases. In the first take advantage of Latent Semantic Analysis (LSA) [5], which phase, a term-document matrix is constructed and decomposed allows us to find a low-rank approximation of a term-document to a concept space using LSA. Next, the dimensionality of the matrix, describing the number of occurrences of words (terms) concept space is reduced and, after that, hierarchical clustering in input documents [6]. In the case of inflective languages, rank is performed in the third phase. All three of these steps are lowering is important for two reasons. First, it should allow detailed in the following subsections. us to merge the dimensions associated with terms of similar meanings: the estimated term-document matrix is presumed overly sparse in comparison to the ”true” term-document A. The term-document matrix and its decomposition matrix, because it contains only the words actually seen in each At first, all the input documents are preprocessed and document, whereas it should list a much larger set (due to sy- lemmatised. The resulting text then does not just contain nonymy) of all words related to each document. Second, rank lemmas, but also word forms, such as numbers or typing errors, lowering is expected to de-noise the original term-document which cannot be lemmatised. We refer to all of these items as matrix, which contains semantically unimportant (noisy) terms. terms. For these reasons, LSA has also been successfully applied Given the preprocessed set of input documents, the fre- in the task of automatic summarisation of documents and quency of occurrence, i.e., the term frequency (TF), is calcu- information retrieval. In the former case, it allows us to find lated for every unique term from these documents, excluding a decomposition that describes an importance degree for each terms in the stop list. After that, the frequency of each term is topic of the document in each sentence. The resulting summary weighted by its inverse document frequency (IDF) [11], [12]. of the input document is then created by choosing the most important sentences [7],[8]. In the latter task, given a query These IDF values can be expressed as: C. Hierarchical clustering jD j IDF (l) = log b (1) jfd 2 D : l 2 d gj The adopted hierarchical clustering approach is based on b b b the assumption that the documents belonging to the same where jDbj is the total number of training documents in the cluster should have similar concepts (topics). Therefore, after background corpus and jfdb 2 Db : l 2 dbgj is the number of dimensionality reduction, clustering can be performed for T background documents containing the term l. vectors of AK or for vectors of reduced matrix V which describes mapping of concepts to documents (see section for Given the weighted term frequency values, the term- results IV-D). document matrix A is constructed, where each column vector represents a weighted term frequency vector of one input In both cases, we perform clustering in an agglomerative document. Therefore, the size of A is t × d, where t is the way, where each document represents one cluster at the be- number of all unique terms in all input documents and d is ginning. Then, pairs of the most similar clusters are merged the total number of input documents. in consecutive steps until the demanded number of clusters is reached. As a similarity measure, we employ various metrics After that, the LSA is performed. This method employs described in section III-A). The outcome of clustering is a set A the Singular Value Decomposition (SVD) to the matrix as of clusters where every cluster contains a list of documents follows: that should have similar concept (topic). A = UΣV T (2) III. METRICS where U is a t × m column-orthonormal matrix of left singular A. Similarity measures used for hierarchical clustering vectors, Σ = diag(σ1; σ2; :::; σm) is an m × m diagonal The similarity metrics are used to find a pair of the closest matrix whose diagonal elements represent non-negative singu- documents (clusters) during the process of agglomerative hie- lar values sorted in descending order, and V is an m × d rarchical clustering. For this purpose, each pair of documents is orthonormal matrix of right singular vectors. represented by vectors ~a and ~b that correspond to columns of A T It has been shown [13] that the matrices U, Σ and V T or V . When two documents (or clusters) are merged to form represent a concept (semantic) space of the input documents: a new cluster, this cluster is then represented by the average the matrix U describes a mapping of concepts to the space of vector (centroid) calculated over all documents belonging to terms, V T captures how concepts are mapped to documents, the cluster. and the singular values of Σ represent the significance of Within this work, the following similarity metrics were individual concepts. Note that a more detailed description of used: the SVD can be found in [9]. Euclidean distance: v B. Concept space dimension reduction u k ~ uX 2 In practice, the term-document matrix A is usually sparse de ~a; b = t (ai − bi) (3) because individual input documents can contain a) synonyms i=1 or b) partially or completely different words and word forms. Note that the latter problem occurs particularly for inflective Normalised Euclidean distance: v languages such as Czech. The next issue is that A also contains k u (a − b )2 noise that is represented by common and/or meaningless terms. ~ uX i i dn ~a; b = t 2 (4) σi From a linguistics point of view, these issues can partially i=1 be eliminated by using: Cosine similarity: • stop list of meaningless terms ~ ~ ~a · b dc ~a; b = (5) • processing module for text normalisation and lemma- k~ak k~bk tisation • minimum occurrence threshold for terms where k (=< t)is the number of dimensions of the reduced 2 th concept space and σi is the variance of i element across the Within this work, we utilised the first two options to reduce given space. the sparsity of A (see section IV-F for details and obtained results). B. Clustering evaluation metrics From the mathematical point of view, this problem can Silhouette: Silhouette is a clustering evaluation metric that also be addressed by low-rank approximation that reduces the describes how well documents are assigned to clusters [14]. number of dimensions of the concept space from m to k Silhouette takes on values inside the interval h−1; 1i and is (see Fig. 1). Unfortunately, the problem of finding the proper defined as: value of k is not trivial and its solution is usually based on a heuristic knowledge of the given task. In section IV-E we N 1 X ni − ci present our result obtained for values of k in the range from SI = ; (6) N max fc ; n g 10 % of sum of all singular values to 100 %.