Persistent Homology: an Introduction and a New Text Representation for Natural Language Processing

Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Persistent Homology: An Introduction and a New Text Representation for Natural Language Processing Xiaojin Zhu Department of Computer Sciences, University of Wisconsin-Madison Madison, Wisconsin, USA 53706 [email protected] Abstract There has been geometric methods for visualizing docu- ments and information flow, e.g. based on differential ge- Persistent homology is a mathematical tool from ometry [Lebanon et al., 2007; Lebanon, 2006; Gous, 1999; topological data analysis. It performs multi-scale Hall and Hofmann, 2000]. In contrast, we introduce an alge- analysis on a set of points and identifies clusters, braic method based on persistent homology. As a branch of holes, and voids therein. These latter topologi- topological data analysis, persistent homology has the advan- cal structures complement standard feature repre- tage of capturing novel invariant structural features of doc- sentations, making persistent homology an attrac- uments. Intuitively, persistent homology can identify clus- tive feature extractor for artificial intelligence. Re- ters (0-th order holes), holes (1st order, as in our loopy search on persistent homology for AI is in its in- curve), voids (2nd order holes, the inside of a balloon), fancy, and is currently hindered by two issues: the and so on in a point cloud. Considering the importance of lack of an accessible introduction to AI researchers, clustering today, the value of these higher order structures and the paucity of applications. In response, the is tantalizing. Indeed, in the last few years persistent ho- first part of this paper presents a tutorial on persis- mology has found applications in data analysis, including tent homology specifically aimed at a broader audi- neuroscience [Singh et al., 2008], bioinformatics [Kasson ence without sacrificing mathematical rigor. The et al., 2007], sensor networks [de Silva and Ghrist, 2007a; second part contains one of the first applications de Silva and Ghrist, 2007b], medical imaging [Chung et al., of persistent homology to natural language pro- 2009], shape analysis [Gamble and Heo, 2010], and computer cessing. Specifically, our Similarity Filtration with vision [Freedman and Chen, 2011]. Time Skeleton (SIFTS) algorithm identifies holes Unfortunately, existing homology literature requires ad- that can be interpreted as semantic “tie-backs” in a vanced mathematical background not easily accessible to a text document, providing a new document structure broader audience. Our first contribution is an accessible yet representation. We illustrate our algorithm on doc- rigorous tutorial that contains many unpublished materials. uments ranging from nursery rhymes to novels, and Although a tutorial is unconventional in a technical paper, we on a corpus with child and adolescent writings. feel that there is value to the AI community as it paves the way to further interdisciplinary research. Our second contribution is a novel text representation using persistent ho- 1 Introduction mology. It formalizes the curve-and-loop intuition based on Imagine dividing a document into smaller units such as para- Vietoris-Rips filtration over semantic similarity. We hope this graphs. A paragraph can be represented by a point in some paper inspires future innovations on topology and AI. space, for example, as the bag-of-words vector in Rd where d is the vocabulary size. All paragraphs in the document form 2 Persistent Homology a point cloud in this space. Now let us “connect the dots” by linking the point for the first paragraph to the second, the We aim for mathematical rigor and intuition, but have to sac- second to the third, and so on. What does the curve look like? rifice completeness. Readers can follow up with [Singh et al., Certain structures of the curve capture information relevant 2008; Giblin, 2010; Freedman and Chen, 2011; Zomorodian, to Natural Language Processing (NLP). For instance, a good 2001; Rote and Vegter, 2006; Edelsbrunner and Harer, 2010; essay may have a conclusion paragraph that “ties back” to Hatcher, 2001; Carlsson, 2009; Edelsbrunner and Harer, the introduction paragraph. Thus the starting point and the 2007; Balakrishnan et al., 2012; 2013] for detailed treatment. ending point of the curve may be close in the space. If we Persistent homology finds “holes” by identifying equiv- further connect all points within some small diameter, the alent cycles: Consider the following space in yellow with curve may become a loop with a hole in the middle. In con- a small white hole. Imagine the blue cycle as a rubber trast, an essay without any tying back may not contain holes, band. It can be stretched and bent within the space into no matter how large is. the green cycle, but not the red one without tearing itself. 1953 Definition 5. A map φ : G → G is a homomorphism if φ(a ∗ b)=φ(a) φ(b) for ∀a, b ∈ G. For example, the groups R+, × and Z2, +2 do not look There are two equivalent classes of rubber bands: some sur- similar at all. But there is a trivial homomorphism φ(a)= round the hole and others do not. Conversely, two equivalent 0, ∀a ∈ R+. Note the last 0 is in Z2. This simply says that classes indicate one hole. To formalize this idea, we need to we map all positive real numbers to the “0” in mod-2 addition. introduce some algebraic concepts. Obviously 0=φ(a × b)=φ(a)+2 φ(b)=0+2 0=0for ∀a, b ∈ R+. 2.1 Group Theory As another example, consider the group of (somewhat arti- Definition 1. A group G, ∗ is a set G with a binary opera- ficial) negation in natural language: GN = {, not} with the tion ∗ such that (1. associative) a ∗ (b ∗ c)=(a ∗ b) ∗ c for following operation, where stands for whitespace: all a, b, c ∈ G. (2. identity) ∃e ∈ G so that e ∗ a = a ∗ e = a ∗ not for all a ∈ G. (3. inverse) ∀a ∈ G, ∃a ∈ G where not a ∗ a = a ∗ a = e. not not For example, integer addition Z, +, real number addition R, + are groups with identity 0 and a’s inverse −a. Posi- i.e., single negation stays while double negation cancels. There is a homomorphism between GN and Z2: φ()= tive real numbers and multiplication is a group R+, × with 1 0,φ(not)=1. In fact, GN and Z2 are identical up to re- identity 1 and a’s inverse a . However, R, × is not a group since 0 ∈ R does not have an inverse under ×. Real numbers naming. There is a name for such homomorphisms: except 0 is again a group R\{0}, ×. Z2 is the only group Definition 6. A homomorphism that is a one-to-one corre- (up to element renaming) of size two: spondence is called an isomorphism. +2 01 Definition 7. The kernel of a homomorphism φ : G → G is 0 01 kerφ = {a ∈ G | φ(a)=e}. In other words, the kernel is 1 10 the elements that map to identity. We can think of +2 as the XOR function or mod-2 addition. Theorem 1. For any homomorphism φ : G → G, kerφ is a For any set A = {a1,...,an}, its power set forms a group G A subgroup of . 2 , +2 where +2 is the symmetric difference: B +2 C = G (B ∪ C)\(B ∩ C). The identity is the empty set ∅, and the G’ inverse of any B ⊆ A is B itself. Definition 2. A group G is abelian if the operation ∗ is com- φ e’ mutative: ∀a, b ∈ G, a ∗ b = b ∗ a. ker φ All groups in this paper are abelian. For an example of n × n Because kerφ is a subgroup (depicted as the blue square non-abelian groups, consider invertible matrices under G a ∗ kerφ matrix multiplication. above), we can partition into cosets of the form for a ∈ G. These cosets are the white or blue squares. For H ⊆ G G, ∗ Definition 3. A subset of a group is a sub- example, φ : R\{0}, × → GN with φ(a)= if a>0 and G H, ∗ group of if is itself a group. “not” if a<0, then kerφ = R+ is one coset and R− is the {e} is the trivial subgroup of any group G (we often omit only other coset. the operation when it is clear). R+, × is a subgroup of We need one more piece of definition. Let H, ∗ be a sub- R\{0}, × by restricting multiplication to positive numbers. group of an abelian group G, ∗. We can introduce a new Note however multiplication on negative numbers R−, × is binary operation not on the elements of G but on the cosets not a subgroup because the result is not in R−. of H: (a ∗ H) (b ∗ H)=(a ∗ b) ∗ H, ∀a, b ∈ G. The operation is well-defined and does not depend on the particular Definition 4. Given a subgroup H of an abelian group G, for choice of representer. any a ∈ G, the set a ∗ H = {a ∗ h | h ∈ H} is the coset of H represented by a. Definition 8. The cosets {a∗H | a ∈ G} under the operation form a group, called the quotient group G/H. Consider H = R+ and G = R\{0}. Then 3.14 × R+ is a coset which is the same as R+. In fact for any a>0, It is useful to think of quotient groups as “higher level” a × R+ = R+, i.e., many different a’s represent the same groups defined on the squares in the previous picture. kerφ coset. On the other hand, −1 × R+ = R−,soR− is a coset (the blue square) is a subgroup of G. The elements of the represented by -1 (or any negative number, for that matter).

Load more