Topology and Word Spaces
Total Page:16
File Type:pdf, Size:1020Kb
Topology and Word Spaces Mapper and Betti 0 Barcodes Applied to Random Indexing Word-Spaces - a First Survey DAVID NILSSON ARIEL EKGREN Bachelor’s Thesis at CSC Supervisors: Hedvig Kjellström Jussi Karlgren Mikael Vejdemo-Johansson Examiner: Mårten Olsson Abstract This paper will introduce analytic methods for linguistic data that is represented in forms of word-spaces constructed from the random indexing model. The paper will present two different methods; a visualisation method derived from an algorithm called Mapper, and a word-space property measure derived from Betti numbers. The methods will be explained and thereafter implemented in order to demon- strate their behaviour with a smaller set of linguistic data. The implementations will constitute as a foundation for fu- ture research. Contents 1 Introduction 1 1.1 Problem . 1 1.2 Thesis . 2 1.3 Goal . 2 1.4 Background . 2 1.4.1 Computational Linguistics . 3 1.4.2 Topological data analysis . 4 1.4.3 Topology and Algebraic Topology . 4 1.4.4 Mapper . 6 2 The Topology of Text 9 2.1 Data Extraction . 9 2.2 Barcodes . 10 2.2.1 The Algorithm . 10 2.3 Mapper . 10 3 Experimental Results 13 3.1 Barcodes . 13 3.2 Mapper . 14 4 Conclusions 17 4.1 Discussion . 17 4.1.1 Betti 0 . 17 4.1.2 Mapper . 17 4.2 Summary . 18 4.3 Future Work . 18 4.3.1 Barcodes . 18 4.3.2 Mapper . 19 Bibliography 21 1. Introduction Analysis of vast amounts of high dimensional data is a growing field in both science and industry and are often referred to as Big Data analytics. Information generated by enterprises, social media, the Internet of Things and many other entities is increasing in volume and detail and will fuel exponential growth in data for the foreseeable future [14]. The data analytic tools in use today is not always suited for vast amounts of high dimensional data. Thus there exists a demand for novel analytical methods. Computational Linguistics is one of the areas of research that are faced with the task of handling the ever growing streams of data. Automated and scalable methods for analysing dynamic data is of high interest in a world were newspapers, blogs, Facebook, Twitter and the rest of the internet are generating increasingly large amounts of text every second. In collaboration with the swedish company Gavagai AB, that works with high performance dynamic text analytics, a set of interesting linguistic data analysis problems were defined. This paper will focus on the analysis of a certain type of linguistic data. The data will be in the form of mathematical representations of words from a large source of text and their surrounding contexts. The tools already present for analysis and visualisation of this type of data are limited. Mainly due to the amount and high dimensionality of the data. Thus we turn to novel approaches of analysis and the field of topology. Our work will hopefully contribute to further understanding of the intersection between topology and linguistics and inspire future research on the subject. 1.1 Problem One way that Gavagai process their linguistic data is through special high dimen- sional vector spaces. Distinguishing changes in the vector spaces as new data is processed and determining classes of subsets of the data in the vector spaces are two interesting problems from a data analytic, linguistic and semantic point of view. 1 CHAPTER 1. INTRODUCTION Questions of interest such as -“Is this data set generated from this or that source?” or -“What type of distinguishing features does this data set contain?” are hard to answer because of the size and high dimensionality of the data. The tools today includes manually picking out subsets of interest and keeping track of them as the data changes over time and more or less manually examining relations between data points in the vector spaces. Identifying subtle changes in the signals and hidden features of this type of data is an open problem. 1.2 Thesis We believe that it is possible to classify linguistic vector space data by applying ideas from topological data analysis (TDA). We think that by using TDA algorithms and concepts it will be possible to develop diagnostic tools and to visualise high dimensional data in ways that show relevant features and properties as well as revealing interesting subsets of linguistic vector space data. We think that two key concepts will be applicable, Betti 0 Barcodes and the Mapper algorithm [4] [21]. 1.3 Goal The aim of this paper is to introduce two data analytic methodologies and apply them to computational linguistic data. The scope of this thesis is not to evolve a complete and rigorous method, but rather to bridge work done in topological data analysis [4] to computational linguistics [11]. The two methodologies we aim to introduce and implement are: a global measure of connectedness and a visualisation attached to point cloud data which allows for a qualitative understanding through direct visualisation. They are Betti 0 Barcodes and the Mapper algorithm. We wish to apply these to a specific type of high dimensional vector space, a Random Indexing word-space (RI). 1.4 Background Our work consists of combining ideas from two different academic disciplines; com- putational linguistics and topological data analysis. Combining these two to perform topological data analysis on computational linguistic data sets. We will start by covering some basic and some not so basic concepts in a lighter manner. For a more thorough explanation of the concepts and subjects we direct the reader to our references. The idea of applying topological data analysis to computational linguistics arose from the need to find coordinate invariant properties in semantic word-spaces. A novel approach to the problem was needed in order to identify and compare both global and small scale features in a high dimensional vector space with a lot of data points. 2 1.4. BACKGROUND 1.4.1 Computational Linguistics Computational linguistics is the academic discipline where language is examined with the aid of computers, statistics and math [8]. One way to study text is to use a word-space model, which is a linguistic model used to make mathematical representations of written language. Word-spaces are often high-dimensional vector spaces where words are represented as points. One important property of these vector spaces is that words with similar meaning are located closer to each other than words with no similarity at all. Word-spaces and Random Indexing There are different ways to construct a word-space, but the one referred to through- out this paper is the Random Indexing approach. The following explanation of RI word-spaces will be prerequisite knowledge to understand later sections of the paper. For a more detailed theory of the word-space model and RI we refer to [19]. A corpus T is a set of sequences of words. Given a corpus T we can define W = {w : w ∈ T, ∀ T ∈ T} Cn = {(w1, . , wn) subsequence of T, ∀ T ∈ T} We then define a function f with domain W and co-domain the N-dimensional N vector space K . f maps a unique index vector to each word, when the vector is created the entries in it are randomly generated as described in [19] N f : W → K w 7→ vindex We then define a function g(w)((w1, w2, . , wm)) that returns the number of times the context (w1, w2, . , wm) is a context centered on w g : W → Hom(Cn, N) Lastly, we define a function that maps each word w to a context vector c that can be described as the sum of all the word’s different contexts N h: W → K X X w 7→ g(w)(c) ∗ f(w0) 0 c∈Cn w ∈c Finally we have obtained the set of context vectors {c} derived from T, our Random Indexing word-space. 3 CHAPTER 1. INTRODUCTION 1.4.2 Topological data analysis In the past ten years the field of topology has been developed to fit the needs of data analysis and has been applied in widely differing fields such as identification of subgroups of breast cancer patients [16] to computer vision recognition [5]. Topology is well suited for data analysis in the sense that it allows us to look past disturbance in the form of coordinate dependence and instead distinguish qualitative features of the data. 1.4.3 Topology and Algebraic Topology Topology is the branch of mathematics concerned with the general study of conti- nuity and closeness of topological spaces. A topological space is the most general notion of a mathematical “space” and it consists of a set of points, along with a set of neighbourhoods for each point, that satisfy a set of axioms that relate points and neighbourhoods. A key concept in topology is the similarity property called homeomorphism. A homeomorphism is a one-to-one continuous mapping of one topological space onto another. If there exist a homeomorphism between two spaces, A and B, they are said to be homeomorphic (equivalent in a topological sense) [20]. In order for two topological spaces to be homeomorphic they have to fulfil constraints that can be complicated to validate. An easier approach is via algebraic topology. In algebraic topology we have defined homotopy and homology, which are similar to homeomorphism but less strict and rigorous. To compute homology, linear algebra is used and it is therefore preferential to homotopy when working with large amounts of data [10]. To begin the topological examination of point cloud data a conversion of the data to a topological space has to be done.