Sketching the of a Probability

Joachim Giesen Lars K¨uhne S¨orenLaue Friedrich-Schiller-Universit¨atJena

Abstract where (Bi) is a countable family of disjoint Borel sub- sets. The measure µ is finite if µ(Rd) < ∞ and it is a d We want to sketch the support of a probabil- probability measure if µ(R ) = 1. ity measure on Euclidean space from samples The support of a probability measure µ is the set that have been drawn from the measure. This d problem is closely related to certain manifold supp(µ) = {x ∈ R | µ(B(x, r)) > 0 for all r > 0}, learning problems, where one assumes that where B(x, r) is the closed ball with radius r that is the sample points are drawn from a manifold centered at x. Note that supp(µ) is always closed. that is embedded in Euclidean space. Here we propose to sketch the support of the prob- Given samples x1, . . . , xn drawn from µ, a natural ap- ability measure (that does not need to be a proach to sketch supp(µ) is to approximate it with a manifold) by some gradient flow complex, or union of balls centered at the sample points, i.e., by n more precisely by its Hasse diagram. The [ gradient flow is defined with respect to the Xα = B(xi, α), distance function to the sample points. We i=1 prove that a gradient flow complex (that can where B(xi, α) is the ball of radius α > 0 centered be computed) is homotopy equivalent to the at xi. The obvious problem with this approach is to support of the measure for sufficiently dense determine a good value for α. One contribution of our samplings, and demonstrate the feasibility of paper is a simple method for choosing such a good our approach on real world data sets. value for α.

The homotopy type of the union of balls Xα can be 1 Introduction computed from the nerve of the ball covering, i.e., the simplicial complex that is determined by the intersec- Our goal is to compute a sketch, i.e., an approxima- tion pattern of the balls, see for example [11] for an tion, of the support supp(µ) of a probability measure introduction to computational topology. The latter complex is known as the Cech complex of the union. µ on Rd from a finite set of sample points that are drawn from µ. More specifically, we want to compute A smaller simplicial complex from which the homo- a complex (not necessarily geometrically realized) that topy type of the union of balls can also be computed has the same homotopy type as supp(µ). is the α-complex which is the nerve of the ball cover- ing where the balls are restricted to the Voronoi cells of Formally, a probability measure and its support are their centers. Yet another complex from which the ho- defined as follows. motopy type can be computed is the α-flow complex. This complex is even smaller than the α-complex, i.e., Probability measure. A non-negative measure µ does contain a smaller number of cells, but it is no on Rd is an additive function that maps every Borel longer simplicial. The flow complex can be derived d B ⊆ R to R≥0. Additivity means that from the distance function to the sample points. It ! contains a cell for every critical point of the distance [ X function. We show that the critical points of the dis- µ Bi = µ(Bi), i∈ i∈ tance function are either close to supp(µ) or close to a N N dual structure of supp(µ) that is called the medial axis Appearing in Proceedings of the 17th International Con- of supp(µ). If the support of µ does not exhibit several ference on Artificial Intelligence and Statistics (AISTATS) geometric scales, then the critical points that belong to 2014, Reykjavik, Iceland. JMLR: W&CP volume 33. Copy- supp(µ) and the critical points that belong to the me- right 2014 by the authors. dial axis can be separated by simple thresholding, i.e.,

257 Sketching the Support of a Probability Measure all critical points at which the distance function takes the data points that can be computed efficiently also in values less than the threshold value belong to supp(µ) high dimensions, but does not come with approxima- and the remaining critical points belong to the medial tion guarantees. Laplacian SVMs have been shown to axis. Restricting the flow complex to the critical points achieve state of the art performance in semi-supervised with distance function values less than α constitutes classification. the α-flow complex. If α is a threshold value at which Our approach here is more abstract. We are also in the the two types of critical points can be separated, then probabilistic setting, but do not assume that the sup- the α-flow complex is homotopy equivalent to supp(µ) port of the probability measure from which the sam- for sufficiently dense samplings. ples are drawn is a manifold. Still, also in this setting We have computed α-flow complexes for real data sets we can provide topological reconstruction guarantees. and all values for α in the [0, ∞). On these data sets we have not observed geometric multi-scale 2 Distance Function behavior. Hence, in these situations the simple thresh- olding was enough to compute a complex that is homo- Here we briefly review the theory of distance functions topy equivalent to supp(µ) for sufficiently dense sam- to a compact set that has been developed within the plings. We report details on the data sets and our fields of differential and computational geometry over findings at the end of this paper. the last years [14, 10, 22, 12, 9, 16, 8, 13, 4]. In the following let K always denote a compact subset of Rd. Related work. Our work is related to certain mani- Distance function. The distance function d to fold learning problems that we briefly summarize here. K the compact set K assigns to any point x ∈ d its In machine learning manifold learning is often used R distance to K, i.e., synonymously with non-linear dimensionality reduc- tion, but there is also quite some work (mostly in com- d dK : R → [0, ∞), x 7→ min kx − yk. putational geometry) that aims at learning a manifold y∈K from samples (that need to satisfy certain conditions), where learning a manifold refers to computing an ap- The function dk characterizes K completely since K = d−1(0). proximation from a finite sampling that is guaranteed K to be topologically equivalent and geometrically close d to the manifold. Exemplary for this line of work is Gradient. For any point x ∈ R let the technique by Boissonnat and Ghosh [3]. The body N (x) = {p ∈ K : kx − pk = d (x)} of work in computational geometry does not consider K K the probabilistic setting where the sample points are be the set of nearest neighbors in K and let c(x) be drawn at random from the manifold. The probabilistic the center of the smallest enclosing ball of NK (x). The setting was first considered by Niyogi et al. [19] who gradient of the distance function at x is given as show how to compute the homology of a randomly sampled manifold with high confidence. Later Niyogi x − c(x) ∂K (x) = , if x 6= c(x), et al. [20] have extended this approach for recover- dK (x) ing the geometric core of Gaussian noise concentrated around a low dimensional manifold, i.e., to the case and 0 otherwise. The norm of the gradient is always where the samples are not necessarily drawn from the upper bounded by 1, i.e., k∂K (x)k ≤ 1. manifold itself. This can be seen as a topological ap- proach to unsupervised learning. Medial axis. The medial axis of K is the following set Manifold learning plays an important role in semi-  d supervised classification, where a manifold assump- ma(K) = x ∈ R \ K k∂K (x)k < 1 , tion, see [1], can be used to exploit the availability of unlabeled data in classification tasks. The assump- i.e., the set of all center points of maximal empty open tion requires that the support of the marginal prob- balls in the complement Rd \ K of K. ability distribution underlying the data is a manifold (or close to a manifold). The manifold assumption for Reach. The reach of K is defined as semi-supervised learning has been exploited by Belkin inf kx − yk. et al. [1] in form of a support vector machine with x∈K, y∈ma(K) an additional Laplacian regularization term (Laplacian SVM), see also [17]. For Laplacian SVMs the manifold If the reach of K is positive, then we say that K has is approximated just by some neighborhood graph on finite reach.

258 Joachim Giesen, Lars K¨uhne,S¨orenLaue

Figure 1: Example flow line in two dimensions for a set X of five sample points. Shown on the left is the starting point x of the flow line together with the starting flow direction that is determined by the smallest enclosing ball of NX (x), which here is just a sample point. Then, shown from left to right are the points where the set NX (x), and thus the direction of flow, changes. The flow line ends (shown on the right) in a local maximum of the distance function, i.e., x is contained in the convex hull of NX (x).

The approximation guarantees for our sketch, i.e., pre- 3 Flow Complex serving the homotopy type, are based on a reconstruc- tion theorem for compact of Euclidean space The flow complex of a finite point set X ⊂ Rd is a cell that has been proved in [5]. This theorem topologi- complex that contains a cell for every critical point of cally relates the off-sets of two compact sets that need the distance function to X. We show that a properly to satisfy certain conditions. chosen sub-complex of the flow complex of X, namely some α-flow complex, provides a homotopy equivalent d reconstruction of the support of a probability measure α-offset. For any set K ⊂ R and α > 0 let Kα be the Minkowski sum of K and B(0, α), i.e., µ given that X is a sufficiently dense sampling drawn from µ.  d 0 0 Kα = x ∈ R | x ∈ B(x , α), x ∈ K . Critical points. Let K be a compact subset of Rd. d The points x ∈ R with ∂K (x) = 0, i.e., the points Now we are equipped with the necessary definitions to for which x = c(x), are called the critical points of the state the reconstruction theorem. distance function dK (cf. the definition of the gradient of dK in the previous section). Critical points x of Theorem 1. [Chazal et al.] Let ρ > 0 be the reach dK are always contained in the convex hull of their 0 d of K and let K ⊂ R be a compact set such that the neighbors in K, i.e., in conv(NK (x)). Note that all the 0 ρ Hausdorff distance between K and K is less than 17 , critical points of the distance function dK are either 0 ρ d 0 i.e., dH (K,K ) < 17 , then the complement R \ Kα of points of K, or are contained in the medial axis ma(K). K0 is homotopy equivalent to the complement d \ K α R If K is a finite point set X, then a meaningful in- of K, and K0 is homotopy equivalent to K for all α η dex i(x) can be assigned to any critical point x of d , sufficiently small η > 0, provided that X namely the dimension of the affine hull of NX (x). Crit- ical points of index 0 are the points in X, i.e., the min- 4 · d (K,K0) ≤ α ≤ ρ − 3 · d (K,K0). H H ima of the distance function. Critical points of index d are maxima of dX , and all other critical points are saddle points of dX . All critical points with positive As we have already indicated in the introduction, of- index are contained in ma(X). ten K0 is a finite sampling of K. Let X be such a d 0 Flow complex. Let X ⊂ R be a finite point set. finite sampling, then Xα = Kα is a union of balls with The flow induced by the gradient vector field ∂X is a radius α that is homotopy equivalent to Kη for all suffi- mapping ciently small η > 0, if the Hausdorff distance between d d φX : [0, ∞) × → Xα and K is small, and α is in the range given by R R Theorem 1. Note that the practical problem of choos- defined by the equations φX (0, x) = x and ing a good value for α, i.e., a value that falls into the φX (t, x) − φX (t0, x) range specified by Theorem 1 still remains open. In lim = ∂X (φX (t0, x)). t↓t0 t − t the following we address this problem by considering 0 the critical points of the distance function to the set The set φX (x) = {φX (t, x) | t ≥ 0} is called the flow X ⊂ K. line of the point x, see Figure 1. The stable manifold

259 Sketching the Support of a Probability Measure

S(x) of a critical point x is the set of all points in Rd that flow into x, i.e.,

d S(x) = {y ∈ : lim φX (t, y) = x}. R t→∞

The dimension of S(x) is given by the index i(x). The flow complex is given by the stable manifolds of all critical points together with the following incidence information that is defined using the unstable mani- Figure 2: On the left: A finite point set of four points folds of critical points. Given a neighborhood U of a (in black) in two dimensions together with the critical critical point x and setting points of its distance function (index-0 (black), index- 2 (light blue), and index-2 (pink)). The arrows repre- V (U) = sent flow lines that witness the incidence relationships d between the critical points. On the right: The Hasse {y ∈ R | ∃z ∈ U, t ≥ 0 such that φX (t, z) = y}, diagram of the flow complex. the unstable manifold of x is the set \ U(x) = V (U). 4 Topological Guarantees neighborhood U of x In this section we specify conditions under which an α- The stable manifold of a critical point y is incident to flow complex of the sample points in X = {x1, . . . , xn} the stable manifold of a critical point x if drawn from µ is homotopy equivalent to supp(µ). We do so by using the following lemma, see [6](Lemma S(x) ∩ U(y) 6= ∅, 5.1). i.e., if there is a point in the unstable manifold of y that flows into x. The incidence structure on the stable Lemma 2. Given a sequence of sample points manifolds of the critical points is a binary relation that x1, . . . , xn drawn independently from a probability is measure µ on Rd. Then, for every ε > 0 and any x ∈ supp(µ),

1. reflexive, because S(x) ∩ U(x) = {x} for any crit-   lim P kx1(x) − xk > ε = 0, ical point x. n→∞

2. antisymmetric, because S(x) ∩ U(y) 6= ∅ and where x1(x) is the nearest neighbor of x in S(y) ∩ U(x) 6= ∅ implies x = y. {x1, . . . , xn}.

3. transitive, because S(x)∩U(y) 6= ∅ implies U(x) ⊆ An immediate consequence of this lemma is the fol- U(y), and hence if x is incident to z, i.e., S(z) ∩ lowing corollary. U(x) 6= ∅, then also y is incident to z. Corollary 3. Given a sequence of sample points Hence, the combinatorial structure of the flow complex x1, . . . , xn drawn independently from a probability induces a partial order on the set of stable manifolds, measure µ with compact support on Rd. Then, for or the critical points of dX , respectively, that can be every ε > 0, encoded in a Hasse diagram, see Figure 2.    lim P dH supp(µ),X > ε = 0, α-flow complex. For α ≥ 0, the α-flow complex of n→∞ d of a finite point set X ⊂ R is the Hasse diagram of  the flow complex restricted to the critical points x of where dH supp(µ),X is the Hausdorff distance be- tween supp(µ) and X = {x1, . . . , xn}. dX for which dX (x) ≤ α.

In [7] it has been shown that the union of balls Xα and the α-flow complex of the finite point set X are Proof. Since supp(µ) is compact there is a finite set homotopy equivalent. Hence, also the α-flow complex of points y1, . . . , ym ∈ supp(µ) such that the union Sm of X is homotopy equivalent to Kη for small η > 0, of balls i=1 B(yi, ε/2) covers supp(µ). Assume there if the Hausdorff distance between Xα and K is small, exists y ∈ supp(µ) such that minx∈X kx − yk > ε. By and α is in the range given by Theorem 1. construction there exists yi such that ky − yik ≤ ε/2,

260 Joachim Giesen, Lars K¨uhne,S¨orenLaue

and thus minx∈X kx − yik > ε/2. It follows that for every ε > 0. Hence, " #    lim P 4 · dH supp(µ),X > α = 0, P sup min kx − yk > ε n→∞ x∈X y∈supp(µ) and   ≤ P max min kx − yk > ε/2   lim P ρ − 3 · dH supp(µ),X < α] = 0, y∈{y1,...,ym} x∈X n→∞ m   X which implies the claim about the homotopy equiv- ≤ P min kx − yik > ε/2 x∈X Sn i=1 alence of i=1 B(xi, α) and suppη(µ) and hence the m claim of the theorem. X   = P min kx1(yi) − yik > ε/2 , x∈X i=1 5 Choosing a good value for α where the second inequality follows from a simple union bound. From Lemma 2 we have In this section we prove a theorem that allows one to chose a good value for α in practice. The theorem lim P [kx1(yi) − yik > ε/2] = 0, states that the critical points of the distance function n→∞ to the set X = {x1, . . . , xn} of sample points can be for all yi, and thus partitioned into two subsets. The first set contains the critical points that are close to supp(µ), and the " # second set contains the critical points that are close lim P sup min kx − yk > ε = 0, to the medial axis ma(µ) of supp(µ), i.e., there are no n→∞ y∈supp(µ) x∈X critical points in the complement of supp(µ) ∪ ma(µ) which implies the claim on the Hausdorff distance since or more precisely in we also have x ∈ supp(µ) for all sample points and i compl (µ) thus ε  max inf kx − yk = 0. = conv(supp(µ)) \ suppε(µ) ∪ maε(µ) x∈{x1,...,xn} y∈supp(µ) for any small enough ε > 0. Hence, for large samplings only the critical points with small distance values are Now we are prepared to state and prove our topological relevant for sketching supp(µ). approximation guarantees. Theorem 5. Given a sequence of sample points x , . . . , x drawn independently from a probability Theorem 4. Given a sequence of sample points 1 n measure µ with compact support on d. If the reach ρ x , . . . , x drawn independently from a probability R 1 n of the support is positive, then, for every 0 < ε < ρ/2, measure µ with compact support on Rd whose reach ρ is positive. Then, for every 0 < α < ρ and sufficiently   lim P complε(µ) contains a critical point of dn = 0, small η > 0, n→∞ d  where dn : → is the distance function to the set lim P the α-flow complex of {x1, . . . , xn} R R n→∞ X = {x1, . . . , xn}. is not homotopy equivalent to suppη(µ) ] = 0. Proof. Let (xn) be a sequence of points in supp(µ) Proof. Since the α-flow complex of X = {x1, . . . , xn} such that cn ∈ complε(µ) is a critical point of dn, i.e., is homotopy equivalent to the union of balls the distance function to the first n points of the se- B(xi, α), i = 1, . . . , n it suffices to show that quence. Since the closure of complε(µ) is compact we can assume by turning to an appropriate subsequence " n [ that the sequence (cn) converges to c ∈ complε(µ). By lim P B(xi, α) is not homotopy equivalent n→∞ the same argument we can even assume that all the cn i=1 i have the same index i ∈ {1, . . . , d}. Let y0n, . . . , yin be to suppη(µ) = 0. the points in N(cn) ⊂ X such that cn is the center of the smallest enclosing ball of {y0n, . . . , yin}, i.e., this For that we check that α satisfies the conditions of the ball is given as B(cn, kcn − y0k) and does not contain reconstruction theorem (Theorem 1). By Corollary 3, any point from X in its interior. By the compactness of supp(µ) we can assume that the sequence (y ) con-    jn lim P dH supp(µ),X > ε = 0 n→∞ verges to yj ∈ supp(µ). Since cn is the center of the

261 Sketching the Support of a Probability Measure

Figure 3: First three rows of the scatter plot matrix of the 14-dimensional embedding of the MovieLens data set.

smallest enclosing ball of the points y0n, . . . , yin it can In practice we expect that the number of critical points be written as a convex combination of these points, whose distance value is at most α ≥ 0 is increasing fast i.e., with growing α for small values of α. Once α is large i X enough such that all the critical points that belong cn = λjnyjn to supp(µ) have been found, the number of critical j=0 points remains constant for growing α, and is only in- with creasing again once the critical points that belong to i ma(µ) are being discovered. There are two things one X λjn = 1 and λjn ≥ 0, j = 0, . . . , i. should bear in mind though. First, this behavior is j=0 only expected if supp(µ) does not exhibit geometric features on different scales, because otherwise critical That is, the vector λn = (λ0n, . . . , λin) is from the points that belong to the medial axis can be discov- i-dimensional standard simplex which is compact. ered before critical points that belong to the support, Hence by turning to yet another subsequence we can and second, by construction the medial axis ma(µ) is assume that λn converges in the standard simplex. sampled much more sparsely by critical points than Let λ = (λ0, . . . , λi) be the limit of (λn), then we supp(µ). Hence, if supp(µ) does not exhibit geometric Pi have c = j=0 λjyj and thus c is the center of the features on different scales, then we expect the number smallest enclosing ball B(c, kc − y0k) of the points of critical points to grow at first with growing α and y0, . . . , yi. If B(c, kc − y0k) does not contain any point to remain almost constant once all the critical points from supp(µ) in its interior, then c must be a point that belong to supp(µ) have been discovered. A good of the medial axis ma(µ) which is impossible since value for α should be the point at which the number the points cn ∈ complε(µ) are at distance at least of critical points stays almost constant. ε from the medial axis, and hence (cn) can not con- verge to c. Thus, B(c, kc − y0k) must contain a point z ∈ supp(µ) in its interior, i.e., there exists δ > 0 such 6 Implementation that B(z, δ) ⊂ B(c, kc − y0k). Since cn converges to c and the radii kcn−y0nk converge to the radius kcn−y0k We have designed and implemented an algorithm for we also have B(z, δ) ⊂ B(cn, kcn − y0nk) for n large computing the Hasse diagram of the whole flow com- enough, and thus limn→∞ kx1(z) − zk ≥ δ. That is, plex. The experimental results that we report in the for n large enough the event following section have been obtained using this imple- mentation.   complε(µ) contains a critical point of dn There is a straightforward way to distribute our algo-   implies the event kx1(z) − zk ≥ δ . Hence, rithm if we are only interested in computing the α-flow   complex for small values of α > 0, e.g., when sketch- lim P complε(µ) contains a critical point of dn > 0. n→∞ ing the support supp(µ) of a probability measure µ on d from a given finite sampling {x , . . . , x } drawn implies that R 1 n independently from µ. The idea for distributing the   lim P kx1(z) − zk ≥ δ > 0 for z ∈ supp(µ), algorithm is based on the following simple observation n→∞ which is implied by the triangle inequality. which contradicts Lemma 2. Thus we have Observation 6. For any α > 0, if c is a criti-   cal point of the distance function d : d → to lim P complε(µ) contains a critical point of dn = 0. n R R n→∞ the set {x1, . . . , xn} whose distance value dn(c) is at most α and whose nearest neighbor set N(c) contains

262 Joachim Giesen, Lars K¨uhne,S¨orenLaue

Figure 4: The number of critical points of the α-flow complex as a function of α for the first three to six dimensions (from top-left to bottom-right) for the MovieLens data set. Note that in dimension d we can only have critical points of index up to d.

xi, then N(c) is contained in the ball B(xi, 2α), i.e., the map- and reduce step computes the α-flow complex N(c) ⊂ B(xi, 2α). of {x1, . . . , xn}.

The distributed algorithm can now be implemented through the following map- and reduce steps. Proof. We need to argue that the algorithm finds all

Map. For every xi ∈ X = {x1, . . . , xn} let Xi = critical points of the distance function dn and connects B(xi, 2α)∩{x1, . . . , xn}. For i = 1, . . . , n, compute the them in the right way. By Observation 6 the α-flow α-flow complex for Xi. This can be done by computing complex of Xi does contain any critical point c of the the whole flow complex, i.e., the ∞-flow complex, for distance function dn with xi ∈ N(c) whose distance Xi and removing all critical points with distance value function value is at most α. Hence, any critical point larger than α. of dn with distance value at most α is contained in the union of the α-flow complexes of the sets Xi, where Reduce. Let G = (V,E) be the graph whose ver- they are also connected in the right way. tex set is V = [n] = {1, . . . , n} and whose edge set  is E = (i, j) ∈ [n] × [n] | Xi ∩ Xj 6= ∅ . Combine the α-flow complexes for the sets Xi by traversing the 7 Experiments and Discussion connected components of the graph G in a breadth- first manner. Note that the α-flow complex is itself We have tested our approach on the publicly avail- a graph, namely a Hasse diagram. The combination able data sets Movielens, Body Dimensions and of two α-flow complexes is achieved by identifying all Ecoli that we describe in the following. These data common vertices in the respective Hasse diagrams. sets either reside or can be embedded, respectively, in Theorem 7. The distributed algorithm that comprises medium dimensional Euclidean space.

263 Sketching the Support of a Probability Measure

Figure 5: The number of critical points of the α-flow complex as a function of α for the Body Dimensions data set restricted to the first five dimensions (on the left) and the Ecoli data set (on the right).

MovieLens. The MovieLens 100k data set [21] was dimension of the data set is four since almost no criti- collected by the GroupLens Research Project at the cal points of index six and only very few critical points University of Minnesota. It consists of 100,000 ratings of index five can be found. from 943 users on 1,682 movies. The data set can be viewed as an incomplete matrix that is indexed by the Body Dimensions. The Body Dimensions data users and the movies, respectively, where the matrix set [15] contains 507 points with 21 attributes (exclud- entries are the ratings. Therefore, the MovieLens data ing four nominal attributes) that represent measure- set is not a point cloud data set itself, but it is straight- ments of the human body. The first nine attributes forward to derive point clouds for the movies and for are skeletal measurements, whereas the latter 12 are the users, respectively, from the completed ratings ma- girth measurements. The first five skeletal measure- trix using principal component analysis. We used a ments regard the body’s torso. Here we restricted our- technique by Bell et al. [2] (called ComputeNextFac- selves to these first five dimensions. Figure 5 (on the tor) for completing the ratings matrix that at the same left) shows the number of critical points as a function time computes a low dimensional spectral embedding of the value α for this data set. The function again for the movies, i.e., every point in the low dimensional looks like expected. We observe a fast increase in the point cloud corresponds to a movie. Using this tech- number of critical points up to a threshold value for α nique we created a 14-dimensional embedding of the which is ≈ 2. 1682 movies. Figure 3 shows three rows out of a scatter plot matrix visualization of the data set, with scatter Ecoli. The Ecoli data set [18] contains 336 points in plots of the first three dimensions against all 14 di- eight dimensions. From these dimensions we removed mensions. As can be seen, the trailing dimensions are two binary attributes and the sequence number and correlated with the leading three dimensions and thus considered only the remaining five metric (Euclidean) contribute less geometric information than the leading dimensions. Figure 5 (on the right) shows the number dimensions. We therefore computed α-flow complexes of critical points as a function of the value α for this only for the data sets in three to six dimensions. Fig- data set. Again, this function looks like expected. The ure 4 shows the number of critical points as a func- threshold value for α here is ≈ 1.75. tion of the value α. The functions look like expected, namely we observe a fast increase in the number of crit- Acknowledgements ical points up to a threshold value for α. Beyond the threshold value the number of critical points stays al- This work has been funded by the Computational Ge- most constant. Note that the threshold value increases ometric Learning (CG Learning) project. The project with the dimension from ≈ 2 in three dimensions to CG Learning acknowledges the financial support of the ≈ 3 in six dimensions. This increase is expected since Future and Emerging Technologies (FET) programme the distances between the points that represent the within the Seventh Framework Programme for Re- movies also increase with the dimension. Another in- search of the European Commission, under FET-Open teresting observation is the following: the plots in Fig- grant number: 255827”. ure 4 for Dimensions 5 and 6 indicate that the intrinsic

264 Joachim Giesen, Lars K¨uhne,S¨orenLaue

References [11] Herbert Edelsbrunner and John Harer. Compu- tational Topology - an Introduction. American [1] Mikhail Belkin, Partha Niyogi, and Vikas Sind- Mathematical Society, 2010. hwani. Manifold regularization: A geometric [12] Joachim Giesen and Matthias John. The flow framework for learning from labeled and unla- complex: a data structure for geometric model- beled examples. Journal of Machine Learning Re- ing. In Annual ACM-SIAM Symposium on Dis- search, 7:2399–2434, 2006. crete Algorithms (SODA), pages 285–294, 2003. [2] Robert Bell, Yehuda Koren, and Chris Volinsky. [13] Joachim Giesen, Edgar A. Ramos, and Bardia Modeling relationships at multiple scales to im- Sadri. Medial axis approximation and unstable prove accuracy of large recommender systems. In flow complex. In Symposium on Computational Proceedings of the ACM SIGKDD International Geometry, pages 327–336, 2006. Conference on Knowledge Discovery and Data Mining (KDD), pages 95–104. ACM, 2007. [14] K. Grove. Critical point theory for distance func- tions. In Differential geometry: Riemannian ge- [3] Jean-Daniel Boissonnat and Arijit Ghosh. Man- ometry, volume 54 of Proc. Sympos. Pure Math., ifold reconstruction using tangential Delaunay pages 357–385. Amer. Math. Soc., 1993. complexes. In Proceedings of the ACM Sympo- sium on Computational Geometry (SOCG), pages [15] Grete Heinz, Louis J. Peterson, Roger W. 324–333, 2010. Johnson, and Carter J. Kerk. Exploring relation- ships in body dimensions. Journal of Statistics [4] Fr´ed´ericCazals and David Cohen-Steiner. Recon- Education, 11(2), 2003. Data set available at structing 3D compact sets. Computational Ge- http://www.amstat.org/publications/jse/ ometry: Theory and Applications, 45(1-2):1–13, v11n2/datasets.heinz.html. 2012. [16] Andr´eLieutier. Any open bounded subset of rn [5] Fr´ed´ericChazal, David Cohen-Steiner, and Andr´e has the same homotopy type as its medial axis. Lieutier. A sampling theory for compact sets Computer-Aided Design, 36(11):1029–1046, 2004. in Euclidean space. In Symposium on Compu- [17] Stefano Melacci and Mikhail Belkin. Laplacian tational Geometry (SOCG), pages 319–326, 2006. support vector machines trained in the primal. Journal of Machine Learning Research, 12:1149– [6] Luc Devroye, Laszlo Gy¨orfi,and Gabor Lugosi. 1184, 2011. A Probabilistic Theory of Pattern Recognition. Springer, 1991. [18] Kenat Nakai. Ecoli data set, 1996. Data set available at http://archive.ics.uci.edu/ml/ [7] Tamal K. Dey, Joachim Giesen, and Matthias datasets/Ecoli. John. Alpha-shapes and flow shapes are homo- topy equivalent. In Annual ACM Symposium on [19] Partha Niyogi, Stephen Smale, and Shmuel Wein- Theory of Computing (STOC), pages 493–502, berger. Finding the Homology of Submani- 2003. folds with High Confidence from Random Sam- ples. Discrete & Computational Geometry, 39(1- [8] Tamal K. Dey, Joachim Giesen, Edgar A. Ramos, 3):419–441, 2008. and Bardia Sadri. Critical points of the dis- tance to an epsilon-sampling of a surface and flow- [20] Partha Niyogi, Stephen Smale, and Shmuel Wein- complex-based surface reconstruction. In Sympo- berger. A Topological View of Unsupervised sium on Computational Geometry (SOCG), pages Learning from Noisy Data. SIAM Journal on 218–227, 2005. Computing, 40(3):646–663, 2011.

[9] Herbert Edelsbrunner. Surface Reconstruction by [21] GroupLens research group University of Min- Wrapping Finite Sets in Space. In The Goodman- nesota. Movielens data set, 2011. Data set avail- Pollack Festschrift (Algorithms and Combina- able at http://www.grouplens.org/node/73. torics), pages 379–404. Springer, 2003. [22] Dirk Siersma. Voronoi diagrams and morse the- ory of the distance function. In Geometry in [10] Herbert Edelsbrunner, Michael Facello, and Jie Present Day Science, pages 187–208. World Sci- Liang. On the Definition and the Construction entific, 1999. of Pockets in Macromolecules. Discrete Applied Mathematics, 88:83–102, 1998.

265