Topology and Word Spaces

Mapper and Betti 0 Barcodes Applied to Random Indexing Word-Spaces - a First Survey

DAVID NILSSON ARIEL EKGREN

Bachelor’s Thesis at CSC Supervisors: Hedvig Kjellström Jussi Karlgren Mikael Vejdemo-Johansson Examiner: Mårten Olsson

Abstract

This paper will introduce analytic methods for linguistic data that is represented in forms of word-spaces constructed from the random indexing model. The paper will present two different methods; a visualisation method derived from an algorithm called Mapper, and a word-space property measure derived from Betti numbers. The methods will be explained and thereafter implemented in order to demon- strate their behaviour with a smaller set of linguistic data. The implementations will constitute as a foundation for fu- ture research. Contents

1 Introduction 1 1.1 Problem ...... 1 1.2 Thesis ...... 2 1.3 Goal ...... 2 1.4 Background ...... 2 1.4.1 Computational Linguistics ...... 3 1.4.2 Topological data analysis ...... 4 1.4.3 Topology and ...... 4 1.4.4 Mapper ...... 6

2 The Topology of Text 9 2.1 Data Extraction ...... 9 2.2 Barcodes ...... 10 2.2.1 The Algorithm ...... 10 2.3 Mapper ...... 10

3 Experimental Results 13 3.1 Barcodes ...... 13 3.2 Mapper ...... 14

4 Conclusions 17 4.1 Discussion ...... 17 4.1.1 Betti 0 ...... 17 4.1.2 Mapper ...... 17 4.2 Summary ...... 18 4.3 Future Work ...... 18 4.3.1 Barcodes ...... 18 4.3.2 Mapper ...... 19

Bibliography 21 1. Introduction

Analysis of vast amounts of high dimensional data is a growing field in both science and industry and are often referred to as Big Data analytics. Information generated by enterprises, social media, the Internet of Things and many other entities is increasing in volume and detail and will fuel exponential growth in data for the foreseeable future [14]. The data analytic tools in use today is not always suited for vast amounts of high dimensional data. Thus there exists a demand for novel analytical methods.

Computational Linguistics is one of the areas of research that are faced with the task of handling the ever growing streams of data. Automated and scalable methods for analysing dynamic data is of high interest in a world were newspapers, blogs, Facebook, Twitter and the rest of the internet are generating increasingly large amounts of text every second. In collaboration with the swedish company Gavagai AB, that works with high performance dynamic text analytics, a set of interesting linguistic data analysis problems were defined.

This paper will focus on the analysis of a certain type of linguistic data. The data will be in the form of mathematical representations of words from a large source of text and their surrounding contexts. The tools already present for analysis and visualisation of this type of data are limited. Mainly due to the amount and high dimensionality of the data. Thus we turn to novel approaches of analysis and the field of topology. Our work will hopefully contribute to further understanding of the intersection between topology and linguistics and inspire future research on the subject.

1.1 Problem

One way that Gavagai process their linguistic data is through special high dimen- sional vector spaces. Distinguishing changes in the vector spaces as new data is processed and determining classes of subsets of the data in the vector spaces are two interesting problems from a data analytic, linguistic and semantic point of view.

1 CHAPTER 1. INTRODUCTION

Questions of interest such as -“Is this data set generated from this or that source?” or -“What type of distinguishing features does this data set contain?” are hard to answer because of the size and high dimensionality of the data. The tools today includes manually picking out subsets of interest and keeping track of them as the data changes over time and more or less manually examining relations between data points in the vector spaces. Identifying subtle changes in the signals and hidden features of this type of data is an open problem.

1.2 Thesis

We believe that it is possible to classify linguistic vector space data by applying ideas from topological data analysis (TDA). We think that by using TDA algorithms and concepts it will be possible to develop diagnostic tools and to visualise high dimensional data in ways that show relevant features and properties as well as revealing interesting subsets of linguistic vector space data. We think that two key concepts will be applicable, Betti 0 Barcodes and the Mapper algorithm [4] [21].

1.3 Goal

The aim of this paper is to introduce two data analytic methodologies and apply them to computational linguistic data. The scope of this thesis is not to evolve a complete and rigorous method, but rather to bridge work done in topological data analysis [4] to computational linguistics [11]. The two methodologies we aim to introduce and implement are: a global measure of connectedness and a visualisation attached to point cloud data which allows for a qualitative understanding through direct visualisation. They are Betti 0 Barcodes and the Mapper algorithm. We wish to apply these to a specific type of high dimensional vector space, a Random Indexing word-space (RI).

1.4 Background

Our work consists of combining ideas from two different academic disciplines; com- putational linguistics and topological data analysis. Combining these two to perform topological data analysis on computational linguistic data sets. We will start by covering some basic and some not so basic concepts in a lighter manner. For a more thorough explanation of the concepts and subjects we direct the reader to our references.

The idea of applying topological data analysis to computational linguistics arose from the need to find coordinate invariant properties in semantic word-spaces. A novel approach to the problem was needed in order to identify and compare both global and small scale features in a high dimensional vector space with a lot of data points.

2 1.4. BACKGROUND

1.4.1 Computational Linguistics

Computational linguistics is the academic discipline where language is examined with the aid of computers, statistics and math [8]. One way to study text is to use a word-space model, which is a linguistic model used to make mathematical representations of written language. Word-spaces are often high-dimensional vector spaces where words are represented as points. One important property of these vector spaces is that words with similar meaning are located closer to each other than words with no similarity at all.

Word-spaces and Random Indexing

There are different ways to construct a word-space, but the one referred to through- out this paper is the Random Indexing approach. The following explanation of RI word-spaces will be prerequisite knowledge to understand later sections of the paper. For a more detailed theory of the word-space model and RI we refer to [19].

A corpus T is a set of sequences of words. Given a corpus T we can define

W = {w : w ∈ T, ∀ T ∈ T}

Cn = {(w1, . . . , wn) subsequence of T, ∀ T ∈ T}

We then define a function f with domain W and co-domain the N-dimensional N vector space K . f maps a unique index vector to each word, when the vector is created the entries in it are randomly generated as described in [19]

N f : W → K w 7→ vindex

We then define a function g(w)((w1, w2, . . . , wm)) that returns the number of times the context (w1, w2, . . . , wm) is a context centered on w

g : W → Hom(Cn, N)

Lastly, we define a function that maps each word w to a context vector c that can be described as the sum of all the word’s different contexts

N h: W → K X  X  w 7→ g(w)(c) ∗ f(w0) 0 c∈Cn w ∈c

Finally we have obtained the set of context vectors {c} derived from T, our Random Indexing word-space.

3 CHAPTER 1. INTRODUCTION

1.4.2 Topological data analysis In the past ten years the field of topology has been developed to fit the needs of data analysis and has been applied in widely differing fields such as identification of subgroups of breast cancer patients [16] to computer vision recognition [5]. Topology is well suited for data analysis in the sense that it allows us to look past disturbance in the form of coordinate dependence and instead distinguish qualitative features of the data.

1.4.3 Topology and Algebraic Topology Topology is the branch of concerned with the general study of conti- nuity and closeness of topological spaces. A topological space is the most general notion of a mathematical “space” and it consists of a set of points, along with a set of neighbourhoods for each point, that satisfy a set of axioms that relate points and neighbourhoods. A key concept in topology is the similarity property called homeomorphism. A homeomorphism is a one-to-one continuous mapping of one topological space onto another. If there exist a homeomorphism between two spaces, A and B, they are said to be homeomorphic (equivalent in a topological sense) [20].

In order for two topological spaces to be homeomorphic they have to fulfil constraints that can be complicated to validate. An easier approach is via algebraic topology. In algebraic topology we have defined homotopy and homology, which are similar to homeomorphism but less strict and rigorous. To compute homology, linear algebra is used and it is therefore preferential to homotopy when working with large amounts of data [10].

To begin the topological examination of point cloud data a conversion of the data to a topological space has to be done. This is achieved through creating simplicial complexes from the data.

Simplicial Complexes

One can think of simplicial complexes as triangular structures connected under certain constraints. In the word-space case, these triangular shapes are built when letting edges connect points in the point cloud under the constraints defined in [15]. A simplex is the generalization of a tetrahedral region of space to n dimensions. For example a 0-simplex consists of a point, a 1-simplex consists of a line segment, a 2-simplex a triangle, a 3-simplex a tetrahedron and so on. Alot more can be said about the simplex but for our general and brief explanation of the simplicial complex this will suffice. The following definition of a simplicial complex is cited from [15].

Definition. A simplicial complex is a finite collection of simplices K such that

4 1.4. BACKGROUND

σ ∈ K and τ ≤ σ implies τ ∈ K, and, σi,σj ∈ K implies σi ∩ σj is either empty or a face of both.

A practical method of dividing the point cloud data into simplicial complexes is to use the Vietoris-Rips complex. The following definition of the Rips complex is cited from [7].

n Definition. Given a collection of points {xα} in Euclidean space E , the Rips complex, Rc, is the abstract simplicial complex whose k-simplices are determined by k unordered (k + 1)-tuples of points{xα}0 which are pairwise within distance .

Betti Numbers

Betti numbers is a way to describe connectivity in a topological space and is often denoted as βk. Informally speaking the k − th Betti number describes how many k-dimensional surfaces that are independent, and if two Betti numbers are the same for two different spaces then the spaces are homotopy equivalent [4]. This thesis focuses only on the β0 number which describes the connections between points.

Barcodes

Barcodes is a parametrised version of betti numbers. And as Robert Ghrist points out in his 2008 survey of research on barcodes we can motivate the use of barcodes and partially explain them in the following way [7]:

1. It is beneficial to replace a set of data points with a family of simplicial complexes, indexed by a proximity parameter. This converts the data set into global topological objects.

2. It is beneficial to view these topological complexes through the lens of algebraic topology, specifically, via a novel theory of persistent homology adapted to parameterized families.

3. It is beneficial to encode the persistent homology of a data set in the form of a parameterized version of a betti number, a barcode.

Example. A Barcode example from artificially generated point cloud data. The data consists of 30 points in three clusters randomly generated with centers at [x, y] coordinates: [−5, −5], [−5, 10], [5, −5] and a standard deviation of 0.5. You can see the points in a scatter plot and a Betti 0 Barcode showing how they connect.

5 CHAPTER 1. INTRODUCTION

Figure 1.1: Point cloud data Figure 1.2: Barcode data

Figure 1.3: Barcode example with artificially generated 2d point cloud data

1.4.4 Mapper Mapper, first described in [21], is a method for constructing combinatorial repre- sentations of geometric information about high dimensional point cloud data. By filtering and clustering the data set you get a different representation of the data set than if you would have acted upon it directly. The method bear similarities to density clustering trees, disconnectivity graphs and reeb graphs but is a more generalised approach.

The method consists of a number of steps, given a point cloud with N points x ∈ X:

1. We start with a function f : X → R whose value is known for the N data points. We call this function a filter. The function should convey some in- teresting geometric or other, for the task at hand relevant, properties of the data.

2. Citing from [21]: “Finding the range (I) of the filter f restricted to the set X and creating a cover of X by dividing I into a set of smaller intervals (S) which overlap. This gives us two parameters which can be used to control resolution namely the length of the smaller intervals (l) and the percentage overlap between successive intervals (p).”

3. Citing from [21]: “Now, for each interval Ij ∈ S, we find the set Xj = {x|f(x) ∈ Ij} of points which form its domain. The set {Xj} forms a cover of S X, and X ⊆ j Xj.”

4. Choosing a metric d(·, ·) to get the set of all interpoint distances Dj = {d(xa, xb)|xa, xb ∈ Xj}

6 1.4. BACKGROUND

5. For each Xj together with the set of distances Dj we find clusters {Xjk}

6. Each cluster then becomes a vertex in our complex and an edge is created between vertices if Xjk ∩ Xlm =6 ∅ meaning that two clusters share a common point. Example. A Mapper example from artificially generated point cloud data. The data consists of 5000 points randomly generated from a gaussian distribution sur- rounding three centroids at [x, y] coordinates: [10, 20], [−10, −10], [17, −10]; with a standard deviation of 9. The filter function f was chosen to be Gaussian kernel den- sity estimation. And the mapper parameters were set to 7 intervals and an overlap of 10 percent. See Figure 1.4

Figure 1.4: Mapper example with artificially generated 2D point cloud data with corre- sponding mapper output

7

2. The Topology of Text

The data extraction section below will describe the linguistic data followed by an explanation and description of the implementation of our Betti 0 barcode algorithm and the implementation of the Mapper algorithm.

2.1 Data Extraction

The data in the study were RI word-spaces constructed from three different corpuses:

• British National Corpus, short BNC. 82 070 645 words [6].

• Touchstone Applied Science Associates, Inc., short TASA. 10 861 774 words [22].

• Reuters Corpus, Volume 1, short Reuters. 200 144 390 words [18].

Subsets of words were then extracted from these word-spaces, using word lists, in order to get a more manageable set of words that would work as a representation of each word-space. Here follows a short explanation of these word lists but for more detailed information see references.

• Swadesh is a classic compilation of basic concepts for the purposes of historical- comparative linguistics.

• Abstract terms. Many academic writing-guides claim that these terms should not be used in academic reports for the reason that they are considered to be abstract terms. The list is complied by Karlgren [12].

9 CHAPTER 2. THE TOPOLOGY OF TEXT

2.2 Barcodes

Despite existing libraries for calculation of simplicial complexes and barcodes, the following barcode algorithm was developed from scratch and implemented in Python. The Dionysus C++ library for computing persistent homology [1] was considered but omitted, due to its installation dependencies and the project’s need for cross platform compatibility (Windows 7 and Mac OSX 10.7).

Since the algorithm only derives Betti 0 barcodes some underlying mathematical concepts could be disregarded, giving rise to simplifications that would not be pos- sible for calculations of Betti 1 barcodes or higher.

The algorithm will first be described in plain text followed by a pseudocode descrip- tion. The barcode plots can be viewed in the Results section.

All the barcodes contain one bar that goes to infinity, from when all points have coalesced. All the barcode plots presented in this paper are missing the infinite bar.

2.2.1 The Algorithm After a set of word vectors was loaded into a Numpy array, the pairwise distances were calculated and stored in a pairwise distance matrix M, where Mi,j is the distance between vector i and j.

Each row in the distance matrix was then examined in order to determine closest neighbours. A bar bi in the barcode was then generated letting the barcode value be the distance between i and j and, by convention, letting the higher indexed vector “die”. Each distance in the smallest indexed vector was then compared to each distance in the other vector, substituting larger values with smaller values. In this way the distance information in the higher indexed vector was transferred to the lower indexed vector. This operation was looped until every index, except one, was “dead”. The only surviving index got the value infinity.

After the iteration was done every bar bi ∈ β0 had values 0 < bi ≤ ∞. In order to visualise the bars they were plotted in a bar plot Figure 3.1 - 3.6. Algorithm 1 shows a pseudo code how the algorithm was implemented.

2.3 Mapper

The implementation of mapper was done in Python, utilising the already available packages: Numpy, Scipy, Modular toolkit for Data Processing (MDP) [23], scikit- learn [17], NetworkX [9] and matplotlib. The final visualisations were done in Gephi [3], a graph visualisation software.

10 2.3. MAPPER

Algorithm 1 Calculate Barcodes distance_matrix = pairwise_distances(vectors) for i in distance_matrix do j = i.mindex() if j.mindex() == i.mindex() then bl = i.min() if j < i and value in jk < value in ik then β0j = bl replace values and kill j else if j > i and value in jk > value in ik then β0i = bl replace values and kill i end if end if end for return β0

Each of the nine sets of text, described in this chapters section Data Handling, were examined with the authors implementation of Mapper.

The filter function f : X → R was chosen to be the projection of the data points on to the first eigenvector of a singular value decomposition (SVD) based principal component analysis (PCA), we call this filter PCA1. Two reasons for choosing PCA1 will be presented; when examining new data looking at the components with high variance might reveal interesting properties of the data, the high dimensionality of the data does not allow for a more basic approach such as filtering by density. There are many interesting filtering functions that can be used but due to time constraints the examination of the data was restricted to PCA1 only.

The clustering algorithm was chosen to be DBSCAN. DBSCAN seemed like a good candidate for the Mapper implementation because it is a dynamic clustering al- gorithm in the sense that the number of clusters do not have to be pre-specified, the metric for similarity may be defined by the user and it is readily available in the python package scikit-learn [17]. For the choice of metric cosine similarity was u·v used d(u, v) = 1 − kuk2kvk2 . The cosine similarity metric was chosen because it projects the word-space points to a unit-sphere. This property is particularly well suited for our high dimensional word-spaces due to the elimination of unwanted word-frequency effects.

The adjustable parameters in the filter and clustering function were adjusted for TASA with the abstract terms filtering, Figure 3.13. These parameter settings were then used for all word-spaces, with the expectation that they would give rise to

11 CHAPTER 2. THE TOPOLOGY OF TEXT meaningful visualisations for all six plots. The mapper plots, Figure 3.7 - 3.13, convey the opposite.

Adjustable mapper parameters:

• The number of segments Ij to divide the range I into, was set to 7.

• The percentage of overlap p between the segments was set to p = 0.4.

• The  parameter of the DBSCAN clustering algorithm was set to  = 0.51

• minimum points required to form a cluster by DBSCAN was just set to 1.

After the mapper algorithm was implemented, a comparison between the PCA1 and the word frequency was done and plotted for all three Swadesh subsets. This was done because the PCA1 seemed to give a similar representation as if a word- frequency mapping had been used instead. The correlation in Figure 3.10 confirms this assumption for low frequency words. Outliers in this correlation plot is only high-frequency words and is not presented, as there are just a very small fraction of them.

12 3. Experimental Results

3.1 Barcodes

The following plots show the normed Betti 0 barcode for texts from Reuters, BNC and TASA. The words are filtered with the Swadesh and abstract terms lists.

Swadesh Words

Figure 3.1: Swadesh in Reuters texts. Figure 3.2: Swadesh in BNC texts.

Figure 3.3: Swadesh in Tasa texts.

13 CHAPTER 3. EXPERIMENTAL RESULTS

Abstract terms

Figure 3.4: Abstract terms in Reuters Figure 3.5: Abstract terms in BNC texts. texts.

Figure 3.6: Abstract terms in Tasa texts.

3.2 Mapper

The following plots are the result from applying the Mapper algorithm to the with the Swadesh and abstract terms filters. The PCA1 word-frequency plots is also presented in the following section.

14 3.2. MAPPER

Swadesh Words

Figure 3.7: Swadesh in Reuters texts. Figure 3.8: Swadesh in BNC texts.

Figure 3.9: Swadesh in Tasa texts.

Figure 3.10: Freq vs PCA; tasa, reuters and bnc all swadesh

15 CHAPTER 3. EXPERIMENTAL RESULTS

Abstract Terms

Figure 3.11: Abstract terms in Reuters Figure 3.12: Abstract terms in BNC texts. texts.

Figure 3.13: Abstract terms in TASA texts.

16 4. Conclusions

4.1 Discussion

4.1.1 Betti 0

The barcode plots show the connectedness of each word-space subset. A curvature can be seen by looking at the ends of the bars, and each barcode plot has a curvature that reflects the corresponding point’s relative positioning in its word-space. A possible conclusion is that the more curvature the more separation between the points.

4.1.2 Mapper

We believe that the Mapper algorithm might be useful to represent topological struc- ture of word-spaces, to find interesting subspaces and to display relations among topics. Since the combination of Mapper and RI word-spaces is previously uncharted territory it is, without further examination, hard to draw any general conclusions about the feasibility of the approach. The questions are many and the answers are few.

The filter function was chosen to be the projection of the data on to the first eigenvector of a principal component analysis, as shown in Figure 3.10 you can see an almost linear correlation between the PCA-projection and word frequency for “low” frequencies, which raises doubts about the semantic relevance of filtering by PCA. Further investigation regarding properties of relevant filter functions for semantic word-spaces is thus needed.

The clustering algorithm DBSCAN is most likely not well suited for the Mapper algorithm. The positive features of DBSCAN is outweighed by the lack of control over the process and poor performance for high dimensional clustering. Since the initial implementation was done with DBSCAN the research had to be carried out with it, but for future research single linkage clustering is proposed as a more suitable alternative.

17 CHAPTER 4. CONCLUSIONS

u·v The cosine distance d(u, v) = 1− kuk2kvk2 was used for the distance matrix fed to the clustering algorithm. The cosine distance seem to be a good choice when working with RI word-spaces. With the filter, distance metric and clustering algorithm chosen it was hard to fine tune the parameters of the Mapper algorithm to convey the underlying structures of the data in a relevant way. The results are either large clusters connected one to one or many small disconnected clusters. It may be that a different choice of metric and clustering algorithm makes it easier to hone in on favourable parameter choices. But if one looks at the Barcode plots provided in the results section it is also evident that there is no single parameter choice that will catch all of the underlying structure, since the curves are all quite smooth from smallest to largest Barcode bar.

4.2 Summary

Even if no clear approval of Mapper and Betti 0 Barcodes applied on linguistic data could be made, the foundation for future work are laid. The result of this thesis provides insight into the nature of topological data analysis and computational linguistics and more specific into Betti 0 Barcodes, Mapper and Random Indexing Word-Spaces. This paper can hopefully contribute to enough prerequisites in these fields, for a continuation of this work.

4.3 Future Work

4.3.1 Barcodes

Extracting features, in the machine learning sense, from the barcodes in line with the work described in [2]. Examining if barcodes for text can act as features for recognising writing styles or authors etc.

In order to confirm our assertion about Betti 0 barcodes, further tests are required. Many different types of word-spaces should be considered but with known linguis- tic structure in order to classify the barcode-shape with corresponding word-space. We then believe that a proper machine learning algorithm can be applied to these barcodes in order to find how different barcodes “should” look for texts of a cer- tain type. One important tool that has to be considered is the filtering method. This work was restricted to Swadesh, abstract terms and food filtrations, but other filtrations might reveal new aspects that were not mentioned in this thesis.

As for the Betti 0 algorithm described in chapter 2, it possess one weakness that in- hibits time-consuming calculations for larger data sets, namely the pairwise distance calculation. For each point, the distance to all other points are calculated. It was sufficient for this particular thesis to use this method, because of the size-limitation of the test sets, but a better alternative is needed for larger datasets. The other

18 4.3. FUTURE WORK aspect of this algorithm is the fact that many simplifications have been made due to the limited interest of Betti numbers higher Betti 0 barcodes. This implies that for further Betti numbers higher than the Betti 0, one might need to extend the code with a much wider foundation with the algebraic topology theory.

4.3.2 Mapper The possible future works for applying the Mapper algorithm to Random Indexing word-spaces or word-spaces in general are many and we will list a few crucial ones to determine if Mapper and word-spaces are a fruitful combination.

• Finding novel filter functions: finding semantically relevant filter func- tions might prove both challenging and extremely useful for displaying the true topology of word-spaces and by extension, the topology of language. An interesting approach, that was not tested, is to let the word-space points be projected to the first two or three PCA vectors, and thereafter taking a point cloud density measure. By doing this, we believe that the the mapping output will be more related to the intrinsic dimension of the word-space.

• Breaking up clustering: investigating the possibility to use a recursive strategy of breaking up large clusters and applying Mapper to them again to allow for a more complete and adaptive strategy of visualising the word-spaces.

• Quantifying results: whether examining subspaces alone, connections be- tween subspaces, topological changes over time, comparing topologies between different spaces or other interesting investigations the need for a quantifiable result is of uttermost importance. Yet, when it comes to these early investiga- tions it have been hard to define what properties to examine. It is easier done with data of a raw statistical nature. But how do we measure the relevance of a set of words compared to another set of words. How to interpret the graph distance between two sets of words. The mapping of the results, from the intersection between Mapper and word-spaces, to the real world contains a lot of future work.

• Showing RI topology: examining if Mapper can be configured to show the filament structure of the Random Indexing word-spaces described in [13].

19

Bibliography

[1] Dionysus. http://www.mrzv.org/software/dionysus/#, 06 2012.

[2] Aaron Adcock, Erik Carlsson, and Gunnar Carlsson. The of algebraic functions on persistence bar codes. 2012.

[3] Mathieu Bastian, Sebastien Heymann, and Mathieu Jacomy. Gephi: An open source software for exploring and manipulating networks, 2009.

[4] Gunnar Carlsson. Topology and data. Bulletin of The American Mathematical Society, 46(2):255–308, January 29 2009.

[5] Gunnar Carlsson, Tigran Ishkhanov, Vin De Silva, and Afra Zomorodian. On the local behavior of spaces of natural images. International journal of com- puter vision, 76(1):1–12, 2008.

[6] The British National Corpus. The british national corpus.

[7] Robert Ghrist. Barcodes: the persistent topology of data. Bulletin of the American Mathematical Society, 45(1):61–75, 2008.

[8] Ralph Grishman. Computational linguistics: an introduction. Cambridge Uni- versity Press, 1986.

[9] Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart. Exploring network structure, dynamics, and function using NetworkX. In Proceedings of the 7th Python in Science Conference (SciPy2008), pages 11–15, Pasadena, CA USA, August 2008.

[10] Allen Hatcher. Algebraic topology. Cambridge UP, Cambridge, 2002.

[11] Pentti Kanerva, Jan Kristofersson, and Anders Holst. Random indexing of text samples for latent semantic analysis. In Proceedings of the 22nd annual conference of the cognitive science society, volume 1036. Citeseer, 2000.

[12] Jussi Karlgren. New measures to investigate term typology by distributional data. 19th Nordic Conference on Computational Linguistics. Oslo, 2013.

21 BIBLIOGRAPHY

[13] Jussi Karlgren, Anders Holst, and Magnus Sahlgren. Filaments of meaning in word space. In Advances in Information Retrieval, pages 531–538. Springer, 2008.

[14] J Manyika, M Chui, B Brown, J Bughin, R Dobbs, C Roxburgh, and AH By- ers. Big data: the next frontier for innovation, competition, and productivity. mckinsey global institute, 2011.

[15] James R Munkres. Elements of algebraic topology, volume 2. Addison-Wesley Reading, 1984.

[16] Monica Nicolau, Arnold J Levine, and Gunnar Carlsson. Topology based data analysis identifies a subgroup of breast cancers with a unique mutational pro- file and excellent survival. Proceedings of the National Academy of Sciences, 108(17):7265–7270, 2011.

[17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas- sos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825– 2830, 2011.

[18] Reuters. Reuters corpus, volume 1, english language, 1996-08-20 to 1997-08-19.

[19] Magnus Sahlgren. The Word-Space Model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high- dimensional vector spaces. PhD thesis, , 2006.

[20] G.F. Simmons. Introduction to topology and modern analysis. International series in pure and applied mathematics. R.E. Krieger Pub. Co., 1983.

[21] Gurjeet Singh, Facundo Mémoli, and Gunnar Carlsson. Topological methods for the analysis of high dimensional data sets and 3d object recognition. In Eu- rographics Symposium on Point-Based Graphics, volume 22. The Eurographics Association, 2007.

[22] Inc. Touchstone Applied Science Associates. Touchstone applied science asso- ciates, inc., 05 2013.

[23] Tiziano Zito, Niko Wilbert, Laurenz Wiskott, and Pietro Berkes. Modular toolkit for data processing (mdp): a python data processing framework. Fron- tiers in neuroinformatics, 2, 2008.

22