kth royal institute of technology
Doctoral Thesis in Mathematics Topological and geometrical methods in data analysis
OLIVER GÄFVERT
Stockholm, Sweden 2021 Topological and geometrical methods in data analysis
OLIVER GÄFVERT
Academic Dissertation which, with due permission of the KTH Royal Institute of Technology, is submitted for public defence for the Degree of Doctor of Philosophy on Friday the 11th of June 2021, at 2:00 p.m. in sal F3, Kungl Tekniska Högskolan, Lindstedtsvägen 26, Stockholm.
Doctoral Thesis in Mathematics KTH Royal Institute of Technology Stockholm, Sweden 2021 © Oliver Gäfvert
ISBN 978-91-7873-853-3 TRITA-SCI-FOU 2021:14
Printed by: Universitetsservice US-AB, Sweden 2021 iii
Abstract
This thesis concerns two related data analysis pipelines, using topological and geometrical methods respectively, to extract relevant information. The first pipeline, referred to as the topological data analysis (TDA) pipeline, con- structs a filtered simplicial complex on a given data set in order to describe its shape. The shape is described using a persistence module, which characterizes the topological features of the filtration, and the final step of the pipeline ex- tracts algebraic invariants from this object. The second pipeline, referred to as the geometric data analysis (GDA) pipeline, associates an algebraic variety to a given data set and aims to describe the structure of this variety. Its structure is described using homology, an invariant which for most algebraic varieties can only be computed numerically using sampling methods. In Paper A we consider invariants on multi-parameter persistence mod- ules. We explain how to convert discrete invariants into stable ones via what we call hierarchical stabilization. We illustrate this process by constructing stable invariants for multi-parameter persistence modules with respect to the interleaving distance and so called simple noise systems. For one parameter, we recover the standard barcode information. For more than one parameter we prove that the constructed invariants are in general NP-hard to calculate. A consequence is that computing the feature counting function, proposed by Scolamiero et. al. (2016), is NP-hard. In Paper B we introduce an efficient algorithm to compute a minimal presentation of a multi-parameter persistent homology module, given a chain complex of free modules as input. Our approach extends previous work on this problem in the 2-parameter case, and draws on ideas underlying the F4 and F5 algorithms for Gröbner basis computation. In the r-parameter case, our algorithm computes a presentation for the homology of C −→F A −→G B, with modules of rank l, n, m respectively, in O(r2 nr+1 + nr m + nr−1 m2 + r n2 l) arithmetic operations. We implement this approach in our new software Muphasa, written in C++. In preliminary computational experiments on synthetic TDA examples, we compare our approach to a version of a classical approach based on Schreyer’s algorithm, and find that ours is substantially faster and more memory efficient. In the course of developing our algorithm for computing presentations, we also introduce algorithms for the closely re- lated problems of computing Gröbner bases for the image and kernel of the morphism G. This algorithm runs in time O(nr m + nr−1 m2) and memory O(n2 + m n + n r + K), where K is the size of the output. Paper C analyzes the complexity of fitting a variety, coming from a class of N varieties, to a configuration of points in R . The complexity measure, called the algebraic complexity, computes the Euclidean Distance Degree (EDD) of a certain variety called the hypothesis variety as the number of points in the configuration increases. Finally, we establish a connection to complexity of architectures of polynomial neural networks. For the problem of fitting an N (N − 1)-sphere to a configuration of m points in R , we give a closed formula for the algebraic complexity of the hypothesis variety as m grows for the case of N = 1. For the case N > 1 we conjecture a generalization of this formula supported by numerical experiments. iv
In Paper D we present an efficient algorithm to produce a provably dense sample of a smooth compact variety. The procedure is partly based on com- puting bottlenecks of the variety. Using geometric information such as the bottlenecks and the local reach we also provide bounds on the density of the sample needed in order to guarantee that the homology of the variety can be recovered from the sample. An implementation of the algorithm is provided together with numerical experiments and a computational comparison to the algorithm by Dufresne et. al. (2019). v
Sammanfattning
Den här avhandling rör två relaterade dataanalysflöden, som använder topologiska respektive geometriska metoder, för att extrahera relevant in- formation. Det första flödet, kallat topological data analysis (TDA) pipeline, konstruerar ett så kallat “filtrerat simpliciellt komplex” på en given data- mängd i syfte att beskriva formen på datamängden. Formen beskrivs genom en så kallad persistensmodul, som karaktäriserar de topologiska egenskaperna hos filtreringen, och det slurgiltiga steget i flödet är att extrahera algebraiska invarianter från detta objekt. Det andra flödet, kallat geometric data analysis (GDA) pipeline, associerar en algebraisk varieté till en given datamängd och syftar till att beskriva dess struktur. Strukturen beskrivs med hjälp av homo- logi, en invariant som för de flesta algebraiska varieteter bara kan beräknas numeriskt med hjälp av testmetoder. I Paper A studerar vi invarianter för persistensmoduler med flera para- metrar. Vi förklarar vi hur man konverterar diskreta invarianter till stabila via det vi kallar hierarkisk stabilisering. Vi illustrerar denna process genom att konstruera stabila invarianter för persistensmoduler med flera parametrar med avseende på interleaving-metriken och så kallade simple noise systems. För endast en parameter återfår vi barcode-invarianten. För mer än en para- meter bevisar vi att de konstruerade invarianterna i allmänhet är NP-svåra att beräkna. En konsekvens är att beräkning av feature counting-funktionen, föreslagen av Scolamiero et. al. (2016), är NP-svår. I Paper B introducerar vi en effektiv algoritm för att beräkna en mi- nimal presentation av en multi-parameter persistensmodul, givet ett ked- jekomplex av fria moduler som input. Vårt tillvägagångssätt bygger på ti- digare arbete med detta problem i fallet med två parametrar och bygger på idéer som ligger till grund för F4 och F5-algoritmerna för att beräkna Gröbner baser. I r-parameter fallet beräknar vi en presentation av homo- login för C −→F A −→G B, med moduler av respektive rang l, n och m med O(r2 nr+1 + nr m + nr−1 m2 + r n2 l) aritmetiska operationer. Vi har imple- menterat vår algoritm i den nya mjukvaran Muphasa, skriven i C++. I pre- liminära experiment på syntetiska TDA-exempel jämför vi vår strategi med en version av en klassiskt algoritm baserat på Schreyer’s algoritm med resul- tatet att vår algoritm är väsentligt snabbare och mer minneseffektiv. Under utvecklingen av vår algoritm för minimala presentationer så introducerar vi också algoritmer för närbesläktade problem för att beräkna en Gröbner-bas för bilden och kärnan till morfismen G. Vår algoritm i det fallet har tidskom- plexiteten O(nr m + nr−1 m2) och minneskomplexitet O(n2 + m n + n r + K), där K är storleken på resultatet. Paper C analyserar komplexiteten av att anpassa en algebraisk varietet, som kommer från en klass av varieteter, till en konfiguration av punkter i N R . Komplexitetsmåttet, kallat algebraisk komplexitet, beräknar den Eukli- diska avståndsgraden (EDD) för en viss varietet som kallas hypotesvarieteten när antalet punkter i konfigurationen ökar. Slutligen visar vi en koppling till komplexiteten av arkitekturer i polynomiella neurala nätverk. För problemet N med att anpassa en (N −1)-sfär till en konfiguration av m punkter i R , ger vi vi
en sluten formel för hypotesvarietetens algebraiska komplexitet när m växer för fallet N = 1. För fallet N > 1 formulerar vi en förmodan som generaliserar denna formel med stöd av numeriska experiment. I Paper D presenterar vi en effektiv algoritm för att producera en bevis- ligt tät delmängd av en slät och kompakt algebraisk varieté. Algoritmen är delvis baserat på beräkningen av flaskhalsar av varieteten. Med hjälp av geo- metrisk information så som flaskhalsar och lokal räckvidd ger vi också övre gränser för densiteten som krävs för delmängden för att garantera att variete- tens homologi kan återvinnas. En implementering av algoritmen tillhandahålls tillsammans med numeriska experiment och en jämförelse med algoritmen av Dufresne et. al. (2019). Contents
Contents vii
Acknowledgements ix
Part I: Introduction and Summary
1 Introduction 1
2 Background 3 2.1 Topological data analysis ...... 4 2.2 Geometric data analysis ...... 9
3 Summary of Results 17
References 23
Part II: Scientific papers
Paper A Stable invariants for multiparameter persistence (joint with W. Chachólski) Pre-print: arXiv:1703.03632.
Paper B Efficient computation of multiparameter persistence (joint with M. R. Bender and M. Lesnick) Pre-print.
Paper C Computational complexity of learning algebraic varieties Advances in Applied Mathematics, vol. 121 (2020), 102100.
vii viii CONTENTS
Paper D Sampling and homology via bottlenecks (joint with S. Di Rocco and D. Eklund) Pre-print: arXiv:2010.07976. Acknowledgements
I would like to begin by thanking my advisor Sandra Di Rocco and co-advisors Wojciech Chachólski and David Eklund for their guidance and support, interesting discussions and valuable advise during the years leading up to this thesis. I am very grateful to my co-authors. In particular, Matías Bender and Michael Lesnick for our many interesting and long discussions. I would also like to thank my other colleagues at the mathematics department, who made my time there much more enjoyable. In particular, and in no particular order, Samuel Fromm, Ger- ard Farré, Julian Mauersberger, Parikshit Upadhyaya, Johan Wärnegård, Martina Favero, Lena Leitenmaier, Gustav Zickert, Carl Ringqvist, Nasrin Altafi, Marcus Nordström, Philippe Moreillon, Federico Izzo and Eric Ahlqvist. Finally I would like to thank Ella, Salem and my family for their encouragement and support.
ix
1 Introduction
Understanding the underlying structure generating observed data is the aim of data analysis. The methods of topological data analysis (TDA) and nonlinear algebra can unveil this structure without a specific choice of parameters or model. The power of these methods comes from the formulation of computable and visualizable invariants describing the shape of an object, which in this setting means a data set. Their application has been seen in a wide range of areas including, but not limited to, medicine [4, 33, 37], biology [16, 21, 20], robotics [7, 35, 22, 39] and image processing [6, 3, 2]. The field of topological data analysis originates from the study of similarity of Euclidean submanifolds and estimation of their homology [30, 36]. The main tool of TDA, the persistence module, was formulated by Edelsbrunner et. al. [28] and later Carlsson et. al. showed that it has the structure of a graded module over a polynomial ring [43, 12]. The premise of TDA is that shape matters. Data in high dimensions is usually sparse but may have relevant low-dimensional topological features. An illustrative example can e.g. be found in [41] where TDA is used to discover periodicity in dynamical systems. In their setting, periodic behavior corresponds to a homological 1-cycle, i.e. a closed circle in the state space. Detecting such structure using other methods is difficult as formalizing a description of shape is a non-trivial task often dependent on a choice of parameters. TDA accomplishes this with the use of tools coming from algebraic topology and commutative algebra. It avoids making a choice of parameters by considering all parameter choices and condensing this, usually very large quantity of information, into a concise and visualizable invariant. For problem instances involving multiple measurements on the data (also re- ferred to as filtration parameters), the construction of TDA invariants and efficient computation of the persistence module is an open area of research. In Paper B we make use of the observation that the persistence module has the structure of a graded module in order to compute it. We formulate a novel algorithm based on the theory of Gröbner bases to efficiently compute presentations of persistence modules. In the case of invariants, it was shown by Carlsson and Zomorodian [13] that no complete discrete invariant exists in the multi-parameter case, which means an invariant that completely determines the isomorphism class of the object. In
1 2 CHAPTER 1. INTRODUCTION
Paper A we show how to construct versions of classical commutative algebra invari- ants for the multi-parameter setting and analyze their computational complexity. Our main result is showing that the invariant of Scolamiero el. al. [14] and its generalization is in general NP-hard to compute. TDA has recently been used in other areas of mathematics and vice versa, to enrich the information extracted from the persistence module using tools coming from e.g. algebraic geometry. See for instance the work of [8, 26, 41]. If the observed data comes from an algebraic variety it has additional structure that we can make use of in our analysis. The information could for example be used to estimate the tangent spaces of the underlying variety in order to guide the construction of the persistence module and reduce memory and computational time [8, 32]. Yet another use of the algebraic structure is to directly estimate equations of the underlying variety, produce a dense sample and calculate its homology. Analyzing the complexity of estimating the equations of the variety is the subject of Paper C. In Paper D we consider the latter part of this pipeline. We construct a sampling algorithm to efficiently sample the variety with a density that guarantees us to be able to recover its homology. Sampling an algebraic variety is in itself an important problem as geometric invariants such as homology groups and Betti numbers can almost always not be computed symbolically. It is however a non-trivial problem as it scales poorly with the degree and dimension of the variety and ambient space. Ideally, one would like to have a uniform sample of the variety with a prescribed density. Formulating an algorithm to produce a uniform sample of the variety is the subject of [9, 18] and an algorithm to produce a provably dense sample is described in [26]. The main result of Paper D is constructing a novel provably dense sampling algorithm which in numerical experiments proves to be substantially more efficient than the algorithm of [26]. 2 Background
In this chapter we introduce the methods and tools used throughout this thesis. The primary tool binding this thesis together is homology. Homology can be used to classify and describe the geometrical shape of objects and answer questions such as how many components it consists of or the number of holes/voids present in the object etc. The work in this thesis regards two related data analysis pipelines illustrated below, the first one being the topological data analysis (TDA) pipeline and the latter the geometric data analysis (GDA) pipeline. The output of each pipeline is a homological invariant while the tools used to compute it comes from algebraic topology and algebraic geometry respectively.
Topological data analysis (TDA) pipeline
Data Construct filtered Compute persistence Invariant simplicial complex module
Figure 21 – The TDA pipeline for computing homological invariants.
Geometric data analysis (GDA) pipeline
Estimating equations Compute sample Compute homology Data of an algebraic E V with density δ of E variety V ⊂
Figure 22 – The GDA-pipeline for computing homological invariants.
Throughout this chapter, a data set U is a finite subset U ⊂ RN where N is a positive natural number.
3 4 CHAPTER 2. BACKGROUND
2.1 Topological data analysis
The fundamental goal of topological data analysis is to describe the shape of a data set. As observed by Edelsbrunner et. al. [28] one can formalize a description of the shape of a data set amendable for computations using simplicial complexes and homology. Starting from a data set U the question is how to construct a simplicial complex. While we can make any such choice, we would like to construct a simplicial complex that best reflects the structure of the object that U was sampled from. The strategy of persistent homology, the main tool of TDA, is to construct a filtration of simplicial complexes on U. The homological generators that persist in the filtration are believed to represent significant topological features of the underlying object. The filtration is constructed starting from a metric on the points of U. There are several possible strategies to construct a filtration, such as Vietoris-Rips, Cech˘ or α-complexes and their witness constructions (see [11, 27] for details).
Example 2.1. The most widely used strategy for constructing a filtration of sim- plical complexes in TDA is the Vietoris-Rips complex. Let d: U × U → R be a metric on U. The Vietoris-Rips complex at the parameter value is defined by
F () = {σ ∈ P(U) | d(x, y) < for all x, y ∈ σ} where P(U) denotes the power set of U, consisting of all subsets. A property of the
Figure 23 – Illustration of a filtered Vietoris-Rips complex.
Vietoris-Rips complex, which makes it attractive from a computational perspective, is that it is fully determined by its 1-skeleton. This means that in order to describe it one only has to consider individual pairs of points. This is not the case for neither the Cech˘ or α-complex.
Let SC denote the category of simplicial complexes, R denotes the category whose objects are the real numbers and ∀a, b ∈ R there is an arrow from a to b if a ≤ b in the usual ordering. A filtered simplicial complex on U can be repre- sented as a functor F : R → SC. One can then compute the homology of each simplicial complex in the filtration. In persistent homology one normally computes the homology with coefficients in some field K, making the homology group into a K-vector space (see [11] for details). An important property of homology is functo- riality. This means that if there is a map f : S → S0 of two simplicial complexes S and S0, then we get an induced map on homology such that the following diagram 2.1. TOPOLOGICAL DATA ANALYSIS 5 commutes: S / S0
0 Hk(S, K) / Hk(S , K) where Hk(S, K) denotes the k-th homology group of S computed with coefficients in K. By taking homology of F (with coefficients in a field K) we thus get the following functor Hk(F, K): R → VectK with values in the category of K-vector spaces, called the k-th persistence module of U. It was shown by Carlsson and Zomorodian in [43] that the k-th persistence module has the structure of a graded module over K[x]. Further, they observed that this means that the persistence module is a Principal Ideal Domain (PID). It follows that there is a unique decomposition of the module into irreducible components, called bars. This means that
n1 n2 ∼ M ct M at bt Hk(F, K) = x K ⊕ x K/x K t=1 t=1 for some at, bt, ct, n1, n2 ∈ N. Since the above decomposition is unique, the values n1, n2 and at, bt and ct completely characterize the module. These numbers consti- tute an invariant of the module called the barcode. This is a complete invariant of the module, which means that it completely determines its isomorphism type. Fig- ure 24 illustrates the barcodes of the zeroth, first and second persistence modules of a filtration defined in [31]. In a more general setting we may construct the filtration using several metrics on U. Let the symbol Rr denote the category of r-tuples of real numbers. There is an arrow between a and b in Rr if a ≤ b in the poset ordering, which means that ai ≤ bi for all 1 ≤ i ≤ r. Suppose now that we construct the filtration using r metrics on U (see [13] for details), we will refer to the number of metrics used as the number of parameters of the filtration. The result is a functor F : Rr → SC. Figure 25 illustrates an example of a simplicial complex filtered by two parameters, i.e. with r = 2. Taking the homology of this functor as above yields the k-th multi- r parameter persistence module Hk(F, K): R → VectK. It was shown by Carlsson et. al. [13] that no complete discrete invariant exists for multi-parameter persistence modules. This is because the multi-parameter persistence module, constructed using r metrics, has the structure of a Zr-graded module over a polynomial ring in r variables. The moduli space of such modules is a very complicated algebraic variety for which no complete discrete invariant exists. To analyze such modules one must thus resort to incomplete invariants, which do not completely determine the isomorphism type but can still extract relevant information from the module. We are now ready to formulate the TDA-pipeline by restating Figure 21. 6 CHAPTER 2. BACKGROUND
Figure 24 – Illustration due to [31] of the barcodes of the zeroth, first and second persistence module.
Figure 25 – Illustration due to [38] of a filtration F : R2 → SC. 2.1. TOPOLOGICAL DATA ANALYSIS 7
Topological data analysis (TDA) pipeline
Data Construct filtered Compute persistence Invariant simplicial complex module
In the following we explain the last two stages in the pipeline.
Computing the persistence module Computing the k-th persistence module means computing the k-th homology of all simplicial complexes in the filtration F : Rr → SC and all of their induced maps. For the case r = 1 this problem has been studied extensively and several software packages have been made available, see [34] for an overview. The problem essentially comes down to Gaussian elimination and using a series of tricks to avoid performing reductions on columns. To compute the k-th persistence module of a filtration F : Rr → SC one pro- ceeds as follows. Let Ck(F (), K) denote the k-th chain module of the simplicial complex F (). The set of k-th chain modules for each in fact also defines a func- r tor Ck(F, K): R → VectK where the maps are induced from the inclusion maps of F . Moreover, the k-th boundary map ∂k(): Ck(F (), K) → Ck−1(F (), K) extends to a natural transformation from Ck(F, K) to Ck−1(F, K). Computing the persis- tence module then comes down to computing ker(∂k) and im(∂k+1) and taking their quotient. For the case r = 1 it is possible to perform all of these computations within the same matrix reduction, see [43] for details. The only restriction one has when per- forming reductions is that the filtration values of the simplices has to be respected, this means that one cannot add a column to one which has a smaller filtration value of its associated simplex. This is resolved by ordering the columns by the filtration value of their associated simplex in increasing order and then performing the reductions left-to-right. For the case r > 1 things are more complicated. In Paper B we detail the con- nection between computing the k-th persistence module and the theory of Gröbner bases. The left-to-right columns reductions in the r = 1 case essentially correspond to a Gröbner basis computation. This connection was observed already by Carlsson et. al. in [12] but since then no real effort has been made to further connect TDA with the results of Gröbner basis theory. One may generalize the persistence module computation for r = 1 to the multi- parameter case by taking the perspective of the persistence module as a graded module over a polynomial ring. In [13] it is shown that there is a correspon- dence between persistence modules and finitely generated Zr-graded modules over K[x1, . . . , xr]. In fact, the filtration also has the structure of such a module and computing the persistence module can be expressed as computing the homology of r a chain complex of free Z -graded K[x1, . . . , xr]-modules, as shown by Scolamiero 8 CHAPTER 2. BACKGROUND et. al. in [15]. Computing the k-th persistence module thus means computing the kernel of a map between such modules and then computing the quotient with the image of the map above. r To compute the kernel of a map ϕ: A → B of free Z -graded K[x1, . . . , xr]- modules one typically has to compute a Gröbner basis. This can in a sense be seen as a generalization of Gaussian elimination but has in the multigraded case a doubly exponential worst-case time complexity. In Paper B we formulate an algorithm to compute the kernel and image of ϕ with computational time complexity of order O(nr m + nr−1 m2) and memory complexity O(n2 + m n + n r + K) where the input is given as a boundary matrix with n columns and m rows, and K is the size of the output. This algorithm is then used to construct an algorithm that compute a presentation of the multi-parameter persistence module.
Invariants For r = 1 the barcode is a complete invariant that contains all of the information about the module. It was shown by Cohen-Steiner et. al. [19] that this invariant is stable, which means that small variations in input yields small variations of the invariant. In order to talk about stability one needs to define a metric. The metric most commonly used on barcodes is the bottleneck distance, defined as follows. The barcode of a 1-parameter persistence module, F , can be embedded into the space R × R∞ where R∞ := R ∪ {∞} by letting each bar represent a coordinate in this space. The barcode of F can thus be represented as
B(F ) := {(at, bt) ∈ R × R∞ | 1 ≤ t ≤ n1} ∪ {(ct, ∞) ∈ R × R∞ | 1 ≤ t ≤ n2} This representation is called a persistence diagram [28]. Let F and G be 1-parameter persistence modules, then the bottleneck distance is defined as
dB(F,G) := inf sup ||b − γ(b)||∞ γ b∈B(F )
∞ where γ ranges over all bijections from B(F ) to B(G) and || · ||∞ denotes the L - norm. Note that this means that F and G have to have the same number of bars. If this is not the case one simply adds a set of “empty” bars. In order for invariants in TDA to be useful in practice they need to be stable. For the case r > 1 the most prominent stable invariant is the rank invariant, proposed by Carlsson and Zomorodian [43]. This invariant can be seen as collecting all one-dimensional barcode information in a big table. Consequently, it captures the persistent information of the module, but extracting multivariate persistent information from the table and visualizing it require more work. Another stable invariant is the feature counting function [14], which extracts readably persistent information of the module. In Paper A we generalize its construction through the process of hierarchical stabilization, a process to stabilize classical invariants in commutative algebra for the setting of multi-parameter persistent homology. 2.2. GEOMETRIC DATA ANALYSIS 9
Further we define an algorithm to compute it and show that it is in general NP- hard to compute. This is an indication that extracting multivariate persistent information from the rank invariant, and perhaps in general, might be NP-hard as well.
2.2 Geometric data analysis
If we have more information about the source of U, we can make use of this when computing homological invariants. In this section we assume that U is sampled from an algebraic variety, whose defining equations are either known or unknown. If the variety is unknown we need to estimate its defining equations. This is a non-trivial task as determining dimension and degree of the variety is a challeng- ing problem. Once the defining equations have been estimated, computing the homology can almost surely not be done symbolically and thus one has to resort to numerical methods such as sampling. From a sample of the variety one can construct a simplicial complex and compute its homology. In order to guarantee that the “correct” homology can be recovered from the sample we require the it to have a prescribed density. This outlines the process of the GDA pipeline. In the following we introduce the necessary definitions.
Geometric invariants
N N N Let V ⊆ C be a smooth subvariety and let q ∈ R . For x ∈ V , let NxV ⊆ C denote the embedded normal space to V at x. Let
N X 2 N N dE(u, v) := (ui − vi) : C × C → C i=1 denote the (squared) Euclidean distance function. Further let dV (u) denote the minimal distance, with respect to dE, from u to any point on V . Similarly we refer to the distance between two varieties V,W as the minimum distance, with respect to dE, of any two points x ∈ V , y ∈ W .A critical point of dE with respect to q and V is a point x ∈ V such that q ∈ NxV . Consider the set of critical points of dE with respect to q and V :
I(V, q) = {x ∈ V : q ∈ NxV }.
We call I(V, q) the normal locus of V with respect to q. Recall that the degree of a zero-dimensional ideal is equal to the degree of the corresponding zero-scheme, i.e. the number of points defined by the ideal counted with multiplicity. It holds that for generic q ∈ CN , the ideal I(V, q) is zero-dimensional [25]. The degree of I(V, q) for generic q ∈ CN is constant and is called the Euclidean Distance Degree (EDD) of V , see [25] for more details. 10 CHAPTER 2. BACKGROUND
Example 2.2. Consider the variety in R2 defined by the following polynomial f(x, y) = (0.6x4 + 7y4 − 1)(x4 + y4 + 4x2y2 − 0.9) + xy − 0.1
It is worth noting that the use of the algebraically closed field C is important to the theory. Over C the EDD is constant and in this case equal to 64. Figure 26 illustrate the real points in the normal locus for the choice q = (−0.4, −0.2). For our specific choice of q, only 6 points out of 64 are real.
q
Figure 26 – Illustration of the real part of the normal locus for a specific choice of V and q.
The normal locus I(V, q) is interesting for sampling since it contains at least one point on every connected component of V . Next we consider a very important invariant of the embedding V ⊂ RN of a N variety called the reach of V . For p ∈ R , let πV (p) denote the set of points x ∈ V 0 such that ||x − p|| = minx0∈V ||x − p||. Also, for r ≥ 0 let Vr denote the tubular neighborhood of V of radius r, that is the set of points x ∈ RN at distance less than r from V . Then the reach τV of V may be defined as
τV := sup{r ≥ 0 : |πV (p)| = 1 for all p ∈ Vr}
For a point q ∈ V we have a local notion called the local reach, or the local feature size, which is defined as
τV (q) = sup{r ≥ 0 : |πV (p)| = 1 for all p ∈ Bq(r)}
N where Bq(r) = {p ∈ R : ||p − q|| < r}. Then τV = infq∈V τV (q). If V is compact then the reach is positive and if we in addition assume that dim V > 0, V is not convex and therefore the reach of V is finite. 2.2. GEOMETRIC DATA ANALYSIS 11
N Let ∆V = {p ∈ R : |πV (p)| > 1} and define the medial axis MV as the algebraic N closure of ∆V . Let UV = R \ MV and note that the projection πV : UV → V is a well defined map taking a point p to the unique closest point of V . Also, the local reach τV (q) is the distance from q ∈ V to the medial axis and the reach τV is the distance from V to the medial axis. Example 2.3. Consider the variety defined by the polynomial f(x, y) = x4 − 2xy − 2(x2 − y2) + 5y + 1
Figure 27 illustrates its reach and the set ∆V , whose closure is its medial axis. The radius of the dashed circle is equal to the size of the reach. The solid circles indicate other candidates for the reach.
Figure 27 – The reach and medial axis of an algebraic variety.
The reach can also be computed as a minimum of two quantities [1], one of which derives from local considerations in terms of the curvature of V and the other one is a global quantity arising from the bottlenecks of V . More precisely, the global quantity is the radius of the narrowest bottleneck of V , defined as: 1 inf{||x − y|| :(x, y) ∈ V × V, x 6= y, y ∈ N V, x ∈ N V } 2 x y A bottleneck of V is a pair
(x, y): x 6= y, x ∈ NyV and y ∈ NxV The local quantity, which we call ρ, is the minimal radius of curvature on V , where the radius of curvature at a point x ∈ V is the reciprocal of the maximal curvature of a geodesic passing through x.
Example 2.4. Consider the variety in R2 defined by the polynomial f(x, y) = x4 + y − 2(x2 − y2). Figure 28 illustrates the narrowest bottleneck of this variety and its tubular neighbourhood (dashed lines). 12 CHAPTER 2. BACKGROUND
Figure 28 – The narrowest bottleneck and the tubular neighbourhood of a variety.
Weak feature size The weak feature size can be seen as a generalization of the bottlenecks. For the convenience of the reader we recall the construction introduced in Paper D. For a set of points x1, . . . , xk ∈ V to define a bottleneck of degree k we first require there N to be a point y ∈ C such that the Euclidean distance from y to xi is the same for each 0 ≤ i ≤ k and that y is in the normal space Nxi V for each 0 ≤ i ≤ k. Moreover, we require y to be critical in some sense. It turns out that a good notion of critical is, as is explained in [17], that y is in the convex hull of x1, . . . , xk. This Pk means that there exists real non-negative numbers a1, . . . , ak such that i=1 ai = 1 Pk and y = i=1 aixi. N 2 Let C := {(y, r) ∈ C × C | ||xi − y|| − r = 0, for 1 ≤ i ≤ k} and denote the N closure of the image of the projection C → C by ρ(x1, . . . , xk). Then ρ(x1, . . . , xk) is the set of centers of all (N − 1)-spheres passing through the points x1, . . . , xk. Next, let Conv(x1, . . . , xk) denote the convex hull of the points x1, . . . , xk. Define the k-th bottleneck locus to be the following set:
N×k k Bk := {(x1, . . . , xk) ∈ C \∆ | ∩i=1NV (xi)∩ρ(x1, . . . , xk)∩Conv(x1, . . . , xk) 6= ∅}
N×k where ∆ := {(x1, . . . , xk) | xi = xj for some i, j} ⊂ C . We refer to the elements in B2 as the bottlenecks of V and the elements in Bk for k > 2 as the generalized bottlenecks of V . In the case k = 2 the conditions in the above definition simplify to algebraic conditions which means that B2 is an N algebraic variety in C × C. For k > 2 the set Bk may not be algebraic but if we N×k restrict Bk to only allow real tuples (x1, . . . , xk) ∈ R it is a semi-algebraic set. N By restricting the sets Bk to only allow tuples in R we can define a notion R R of narrowest bottleneck. Let Bk denote this restriction. Let `k : Bk → R be the function that takes (x1, . . . , xk) to the radius of the smallest (N − 1)-sphere in R ρ(x1, . . . , xk). Define bk to be the minimal value (or infimum in case Bk is not finite) of the image of `k, called the radius of the narrowest bottleneck of degree k. 2.2. GEOMETRIC DATA ANALYSIS 13
We now define the weak feature size as the following quantity:
wfs(V ) := min bk 2≤k≤EDD(V ) Computing the generalized bottlenecks of V is in general much harder than com- puting b2. It is therefore interesting to understand in which cases the equality wfs(V ) = b2 holds.
2 3 2 2 2 2 Example 2.5. Let f1(x, y) = 3x y − y + 2(x + y )(x + y − 1) and f2(x, y) = 2 3 4 5 2 2 2 2 2 10x y − 5x y − y + 5(x + y ) (x + y − 1). The curves {f1 = 0} and {f2 = 0} are illustrated in Figure 29 and are two examples of varieties in R2 for which the weak feature size, wfs(V ), is smaller than the narrowest bottleneck b2.