Random Projection and Geometrization of String Distance Metrics
Total Page:16
File Type:pdf, Size:1020Kb
Random Projection and Geometrization of String Distance Metrics Daniel Devatman Hromada Université Paris 8 – Laboratoire Cognition Humaine et Artificielle Slovak University of Technology – Faculty of Electrical Engineering and Information Technology [email protected] Abstract Sahlgren, 2005) and cognitive sciences (Gärdenfors, 2004), it is of no surprise that Edit distance is not the only approach how “surface” entities like “character strings” can be distance between two character sequences can also compared one with another according to be calculated. Strings can be also compared in certain metric. somewhat subtler geometric ways. A Traditionally, the distance between two strings procedure inspired by Random Indexing can is most often evaluated in terms of edit distance attribute an D-dimensional geometric (ED) which is defined as the minimum number coordinate to any character N-gram present in the corpus and can subsequently represent the of operations like insertion, deletion or word as a sum of N-gram fragments which the substitution required to change one string-word string contains. Thus, any word can be into the other. A prototypical example of such an described as a point in a dense N-dimensional edit distance approach is a so-called Levenshtein space and the calculation of their distance can distance (Levenshtein, 1966). While many be realized by applying traditional Euclidean variants of Levenshtein distance (LD) exist, measures. Strong correlation exists, within the some extended with other operations like that of Keats Hyperion corpus, between such cosine “metathese” (Damerau, 1964), some exploiting measure and Levenshtein distance. Overlaps probabilist weights (Jaro, 1995), some between the centroid of Levenshtein distance introducing dynamic programming (Wagner & matrix space and centroids of vectors spaces , generated by Random Projection were also Fischer, 1974) all these ED algorithms take as observed. Contrary to standard non-random granted that notions of insertion, deletion etc. are “sparse” method of measuring cosine crucial in order to operationalize similarity distances between two strings, the method between two strings. based on Random Projection tends to naturally Within this article we shall argue that one can promote not the shortest but rather longer successfully calculate similarity between two strings. The geometric approach yields finer strings without taking recourse to any edit output range than Levenshtein distance and the operation whatsoever. Instead of discrete retrieval of the nearest neighbor of text’s insert&delete operations, we shall focus the centroid could have, due to limited attention of the reader upon a purely positive dimensionality of Randomly Projected space, notion, that of “occurence of a part within the smaller complexity than other vector methods. whole” (Harte, 2002). Any string-to-be- Mèδεις ageôμετρèτος eisitô μου tèή stegèή compared shall be understood as such a whole and any continuous N-gram fragment observable 1 Introduction within it shall be interpreted as its part. Transformation of qualities into still finer and 2 Advantages of Random Projection finer quantities belongs among principal hallmarks of the scientific method. In the world Random Projection is a method for projecting where even “deep” entities like “word- high-dimensional data into representations with meanings” are quantified and co-measured by less dimensions. In theoretical terms, it is ever-growing number of researchers in founded on a Johnson-Lindenstrauss (Johnson & computational linguistics (Kanerva et al., 2000; Lindenstrauss, 1984) lemma stating that a small set of points in a high-dimensional space can be 79 Proceedings of the Student Research Workshop associated with RANLP 2013, pages 79–85, Hissar, Bulgaria, 9-11 September 2013. embedded into a space of much lower dimension stochastic aspects and its application thus does in such a way that distances between the points not guarantee replicability of results. Secundo, it are nearly preserved. In practical terms, involves two parameters D and S and choice of solutions based on Random Projection, or a such parameters can significantly modify closely related Random Indexing, tend to yield model’s performance (in relation to corpus upon high performance when confronted with diverse which it is applied). Tertio: since even the most NLP problems like synonym-finding (Sahlgren minute “features” are initially encoded in the & Karlgren, 2002), text categorization (Sahlgren same way as more macroscopic units like words, & Cöster, 2004), unsupervised bilingual lexicon documents or text, i.e. by a vector of length D extraction (Sahlgren & Karlgren, 2005), “seeded” with D-S non-zero values, RP can be discovery of implicit inferential connections susceptible to certain limitations if ever applied (Cohen et al., 2010) or automatic keyword on data discretisable into millions of distinct attribution to scientific articles (El Ghali et al., observable features. 2012). RP distinguishes itself from other word space models in at least one of these aspects: 3 Method 1. Incremental: RP allows to inject on-the- The method of geometrization of strings by fly new data-points (words) or their means of Random Projection (RP) consists of ensembles (texts, corpora) into already four principal steps. Firstly, strings contained constructed vector space. One is not within corpus are “exploded” into fragments. obliged to execute heavy computations Secondly, a random vector is assigned to every (like Singular Value Decomposition in fragment according to RP’s principles. Thirdly, case of Latent Semantic Analysis) every the geometric representation of the string is time new data is encountered. obtained as a sum of fragment-vectors. Finally, 2. Multifunctional: As other vector-space the distance between two strings can be obtained models, RP can be used in many diverse by calculating the cosine of an angle between scenarios. In RI, for example, words are their respective geometric representations. often considered to be the terms and sentences are understood as documents. 3.1 String Fragmentation In this article, words (or verses) shall be We define the fragment F of a word W having considered as documents and N-gram 2 fragments which occur in them shall be the length of N as any continuous 1-, 2-, 3-...N- treated like terms. gram contained within W. Thus, a word of length 3. Generalizable: RP can be applied in any 1 contains 1 fragment (the fragment is the word scenario where one needs to encode into itself), words of length 2 contain 3 fragments, vectorial form the set of relations and, more generally, there exist N(N+1)/2 between discrete entities observables at fragments for a word of length N. Pseudo-code diverse levels of abstraction (words / of the fragmentation algorithm is as follows: documents, parts / wholes, features / function fragmentator; objects, pixels/images etc.). for frag_length (1..word_length) { 4. Absolute: N-grams and terms, words and for offset (0..(word_length - frag_length)) { sentences, sentences and documents – in frags[]=substr (word,offset,frag_length); RP all these entities are encoded in the } same randomly constructed yet absolute } space . Similarity measurements can be therefore realized even among entities where substr() is a function returning from the which would be considered as string word a fragment of length frag_length incommensurable in more traditional starting at specified offset. approaches1. There is, of course, a price which is being paid for these advantages: Primo, RP involves 2 Note that in this introductory article we exploit only continuous N-gram fragments. Interaction of RP with 1 In traditional word space models, words are considered possibly other relevant patterns observable in the word – to be represented by the rows (vectors/points) of the like N-grams with gaps or sequences of members of word-document matrix and documents to be its columns diverse equivalence classes [e.g. consonants/vowels] – (axes). In RP, both words (or word-fragments) and shall be, we hope, addressed in our doctoral Thesis or documents are represented by rows. other publications. 80 3.2 Stochastic fragment-vector generation each other. While other measures like Jaccard index are sometimes also applied in relation to Once fragments are obtained, we transform them RI, the distance between words X and Y shall be into geometric entities by following the calculated, in the following experiment, in the fundamental precept of Random Projection: most traditional way. Id est, as a cosine of an To every fragment-feature F present in the angle between vectors V and V corpus, let’s assign a random vector of length X Y. containing D-S elements having zero values 4 Experiment(s) and S elements whose value is either -1 or 1. The number of dimensions (D) and the seed Two sets of simulations were conducted to test (S) are the parameters of the model. It is our hypothesis. The first experiment looked for recommended that S<<D. Table 1 illustrates how both correlations as well as divergences between all fragments of the corpus containing only a three different word-couple similarity data-sets word3 “DOG” could be, given that S=2, obtained by applying three different measures randomly projected in a 5-dimensional space. upon the content of the corpus. The second experiment focused more closely upon overlaps Fragment Vector among the centroids of three diverse metric D 0, 1, 0, 0, -1 spaces under study. O 1, 1, 0, 0, 0 G 0, 0, -1, 0, -1 4.1 Corpus and word extraction DO -1, 0, -1, 0, 0 OG 0, 1, 0, 1, 0 ASCII-encoded version of the poem “The Fall of DOG 0, 0, 0, -1, -1 Hyperion” (Keats, 1819) was used as a corpus from which the list of words was extracted by Table 1: Vectors possibly assigned to the 1. Splitting the poem into lines (verses). fragments of the word “dog” by RP5,2 2. Splitting every verse into words, considering the characters [ :;,.?!()] as 3.3 String geometrization word separator tokens. Once random “init” vectors have been assigned 3. In order to mark the word boundaries, to all word-fragments contained within the every word was prefixed with ^ sign and corpus, the geometrization of all word-strings is post-fixed with $ sign.