Analysis of Word Embeddings Using Fuzzy Clustering
Total Page:16
File Type:pdf, Size:1020Kb
Analysis of Word Embeddings using Fuzzy Clustering Shahin Atakishiyev Marek Z. Reformat Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering University of Alberta University of Alberta Edmonton, AB, Canada Edmonton, AB, Canada [email protected] [email protected] Abstract – In data dominated systems and applications, a concept techniques. High-dimensional spaces often have a devastating of representing words in a numerical format has gained a lot of effect on data clustering in terms of performance and quality; attention. There are a few approaches used to generate such a this issue is regarded as the curse of the dimensionality. representation. An interesting issue that should be considered is the ability of such representations – called embeddings – to imitate Our study sheds some light on the comparative analysis of the human-based semantic similarity between words. In this study, above fuzzy clustering methods to observe if and how the we perform a fuzzy-based analysis of vector representations of results of fuzzy clustering change with the dimensionality of words, i.e., word embeddings. We use two popular fuzzy clustering word embeddings. Additionally, we illustrate ‘usefulness’ of algorithms on count-based word embeddings, known as GloVe, of fuzzy clustering via analysis of degrees of belongings of words different dimensionality. Words from WordSim-353, called the gold standard, are represented as vectors and clustered. The to different clusters. results indicate that fuzzy clustering algorithms are very sensitive The paper is structured as follows: Section II contains the to high-dimensional data, and parameter tuning can dramatically description of fuzzy C-Means (FCM) and fuzzy Gustafson- change their performance. We show that by adjusting the value of Kessel (FGK) algorithms. In Section III, we provide an the fuzzifier parameter, fuzzy clustering can be successfully applied to vectors of high – up to one hundred – dimensions. overview of the validity indices applied for fuzzy clustering Additionally, we illustrate that fuzzy clustering allows to provide processes. The theoretical background behind the construction interesting results regarding membership of words to different process of GloVe embeddings is covered in Section IV. Section clusters. V outlines the methodology, while Section VI shows our experimental results. Finally, the obtained conclusions are Keywords -- fuzzy clustering, fuzzy C-means, fuzzy Gustafson- presented in Section VII. Kessel, cluster validity, word embeddings, word vectors II. FUZZY CLUSTERING Zadeh’s fuzzy sets theory [5] has triggered a number of studies I. INTRODUCTION focused on the application of theoretical and empirical concepts of fuzzy logic to data clustering. In contrast to hard clustering Word embeddings become an essential element of methods that techniques, where one point is assigned exactly to only one focus on analysis and comparison of texts. One of the most cluster, fuzzy clustering allows data points to pertain to several popular embeddings is GloVe [10]. The embedding is obtained clusters with different grades of membership. via analysis of word-word co-occurrences in a text corpus. A natural question is related to the ability of the embeddings to We have analyzed the behavior of FCM and FGK in clustering represent a human-based semantic similarity of words. of vector representations of words in different dimensions. The details of these clustering methods have been described in the Data clustering is the process of grouping objects in a way that following sections. similarity between data points that belong to the same group (cluster) becomes as high as possible, while similarity between A. Fuzzy C-means clustering points from different groups gets as small as possible. It is an important task in analysis processes and has been successfully Fuzzy C-means algorithm was introduced by Bezdek [6] in applied to pattern recognition [1] [2], image segmentation [3], 1981. It allows an observation to belong to multiple clusters fault diagnosis and search engines [4]. Fuzzy clustering which with varying grades of membership. Having D as the number of allows data points to belong to several numbers of clusters with data points, N as the number of clusters, m as the fuzzifier different membership grades has been proved to have useful parameter, 푥푖 as the i-th data point, 푐푖 as the center of the j-th applications in many areas. Specifically, fuzzy C-Means (FCM) cluster, 휇푖푗 as the membership degree of 푥푖 for the j-th cluster, clustering and its augmented version – fuzzy Gustafson-Kessel FCM aims to minimize (FGK) clustering are the most popular fuzzy clustering 퐷 푁 푚 2 푐 푐 퐽 = ∑ ∑ 휇푖푗 ||푥푖 − 푐푗|| (1) 푁 푁 푚 2 푚 푇 푖=1 푗=1 퐽 = ∑ ∑ 휇푖푘 퐷퐺퐾 = ∑ ∑ 휇푖푘(푥푘 −∨푖) 퐴푖(푥푘 −∨푖) (8) 푘=1 푖=1 푖=1 푖=1 The FCM clustering proceeds in the following way: Like FCM, this optimization is also subject to the following constraints: 1. Cluster membership values 휇푖푗 and initial cluster centers are initialized randomly. 푐 휇푖푘 ∈ [0,1] , ∀푖, 푘 and ∑푖=1 휇푖푘 =1, ∀푘 (9) 2. Cluster centers are computed according to the formula: We see that the computation of the FGK algorithm is more ∑퐷 휇푚푥 푖=1 푖푗 푖 convoluted than FCM clustering. 푐푗 = 퐷 푚 (2) ∑푖=1 휇푖푗 III VALIDITY INDICES 3. Membership grades ( 휇푖푗) are updated in the following way: There are several validity indices to analyze the performance of the fuzzy clustering algorithms. One of them was proposed by 1 푐 Bezdek [6]. It is called fuzzy partition coefficient (FPC). This 휇푖푗 = 2 , 휇푖푗 ∈ [0,1] 푎푛푑 ∑푖=1 휇푖푗 = 1 (3) ||푥 −푐 || index is calculated as follows: 푁 푖 푗 푚−1 ∑1 ( ) 푁 푐 ||푥푖−푐푘|| 1 퐹푃퐶 = ∑ ∑ 휇 2 (10) 푁 푖푘 4. The objective function 퐽 is calculated 푘=1 푖=1 5. The steps 2,3,4 are repeated until the value of the objective function gets less than a specified threshold. FPC changes between [0,1] range and the maximum value indicates the best clustering quality. Another popular index to Fuzzy C-means (FCM) has many useful applications in medical measure fuzzy clustering quality was proposed by Xie and Beni image analysis, pattern recognition, and software quality (XB) in 1991 [9]. It focuses on two properties: cluster prediction [6,7], to name just a few. The most important factors compactness and separation: affecting the performance of this algorithm are the fuzzifier 푐 parameter m, the size and the dimensionality of data. The 푁 푚 2 ∑푘=1(휇푖푘) ||푥푘 −∨푖|| performance analysis of the algorithm for high-dimensional 푋퐵 = ∑ 2 (11) 푁푚푖푛 푖푘||푥푘 −∨푖|| clustering will be discussed in Section VII in detail. 푖=1 B. Fuzzy Gustafson-Kessel clustering The numerator part shows the strength of the compactness of fuzzy clustering, and the denominator shows the strength of Fuzzy Gustafson-Kessel (FGK) extends FCM by introducing an separation between those fuzzy clusters. If a range of clusters adaptive distance norm that allows the algorithm to identify { 푘1, 푘2… 푘푖} is taken, the 푘푖 minimizing this index will be the clusters with different geometrical shapes [8]. The distance optimal number of clusters for the dataset. metric is defined in the following way: IV. GloVe VECTORS 2 푇 퐷퐺퐾 = (푥푘 −∨푖) 퐴푖(푥푘 −∨푖) (4) One of the most known unsupervised learning algorithm to where 퐴푖 itself is computed from a fuzzy covariance matrix of produce vector representations of words is GloVe. It is based on each cluster: a word-word co-occurrence in text corpora. The term stands for Global Vectors as the representation is able to capture global 1/푑 −1 퐴푖 = (휌푖|퐶푖) 퐶푖 , (5) corpus statistics. ∑푁 휇푚(푥 −∨ )푇(푥 −∨ ) A. Overview of GloVe 퐶 = 푘=1 푖푘 푘 푖 푘 푖 (6) 푖 ∑푁 휇푚 푘=1 푖푘 Let us start with defining a word-word co-occurrence counts matrix as X, where 푋 is the number of times the word 푗 exists Here the parameter 휌푖 is the constrained form of the determinant 푖푗 of 퐴푖, and ∨푖 is the center of a specified cluster i. in the context of the word i. Let use denote 푋푖 = ∑푘 푋푖푘 the |퐴푖| = 휌푖 , 휌푖 > 0, ∀푖 (7) number of times any word appears in the context of the word i. Lastly, let 푃푖푗 = 푃(푗|푖) = 푋푖푗/푋푖 become the probability in Enabling the matrix 퐴푖 to change with fixed determinant serves which word j exists in the context of the word i. Using a simple to optimize the shape of clusters by keeping the cluster’s example, we demonstrate how aspects of meanings can be volume constant [8]. Gustafson-Kessel clustering minimizes extracted from the word-word co-occurrences statistics. the following criterion: Pennington et al. show this with good examples. Assuming we have text corpora related to thermodynamics and we may take to handle with these two vectors is to sum and assign the sum words 푖 = 푖푐푒 and 푗 = 푠푡푒푎푚. We investigate the relationship vector as a unique representation of the word: of these words by learning from the co-occurrence probabilities with other sample words. For instance, taking the word 푖푐푒 and 푊푓푖푛푎푙 = 푊 + 푊 ˜ (16) word t=solid, we can expect that 푃푖푡/푃푗푡 will be large. Likewise, if we select words t that are related to steam but not to ice such Summing two sets of vectors into one effectively reflects words that t=gas, then we expect that the value of ratio should be in the embeddings space. The authors have built the vocabulary small. Global vectors try to leverage a series of functions called of most frequent 400,000 words, and made them publicly F that represents those ratios [10] [11].