The World is not always Flat or

Learning Curved Manifolds

Evgeni Begelfor and Michael Werman

Abstract Manifold learning and finding low-dimensional structure in the data is an important task. Most algorithms for the purpose embed the data in the , an approach which is destined to failure on non-flat data. This paper presents a non-iterative algebraic method for embed- ding the data into hyperbolic and spherical spaces. We argue that these spaces are often better than Euclidean space in capturing the geometry of the data. We also demonstrate the utility of these embeddings by show- ing how some of the standard data-handling algorithms translate to these curved manifolds.

1 Introduction and motivation

Embedding data in low-dimensional space (also known as manifold learning) is important for many applications such as visualization, data manipulation (smoothing / noise reduc-

tion etc.) and data exploration (clustering etc.). Most algorithms tackle the problem by ¡£¢ searching an embedding into a Euclidean space such that ¤ is as small as possible, yet the data maintain some of its geometry (either locally or globally). The main advantage

of the Euclidean space is its being simple and well understood. However, many times the ¢ inherent geometry of the data isn’t flat, thus any embedding in ¥ will nesecarily have a big distortion. For example, embedding a regular tree with the shortest path metric in the Euclidean space requires high dimension (Fig, 1). This is because the volume of a ball in ¡¦¢ grows only polynomially, thus not having ”enough space” for the tree. On the other hand, the volume in the hyperbolic space grows exponentially, allowing a good embedding. For another example, points on a , such as cities on Earth with the natural distances will be distorted as well by embedding in the Euclidean space, Figure 2. This paper suggests an efficient non iterative algorithm for choosing the ”right” space and embedding data points in it. We also provide a method of computing distances, means and covariances in these spaces which are useful for data manipulation and under- standing. Our algorithm can be used either for embedding data given its pairwise dissimi-

larity, or for dimensionality reduction through classical metric MDS. § A number of papers discussed the advantages of visualizing data using §©¨ or [24, 12,

14]. Our algorithm enhances the embedding of the data for these visualizations.

School of Engineering and Computer Science, Hebrew University of Jerusalem, Jerusalem 91904, Israel 4.5 classical(euclidean) MDS 4 Curved MDS

3.5 8

3 6 2.5

2 Distortion 4 1.5 Distortion 1 2

0.5 0 0 0 5 10 15 20 −2 0 2 4 6 8 −4 Dimension Dimension x 10

Figure 1: The distortion of embedding the Figure 2: Embedding cities with their airline nodes of a tree using the pairwise nodes path distances. The distortion of the embedding is

length as a metric versus the dimension of em- shown versus the curvature of the host space.

   bedding space. Euclidean embedding requires The best curvature is where is the radius

much higher dimension than the curved one of the Earth  The paper will review relevant properties of § and , then we will show how to choose the ”best” space and embed the data into it, based on a theorem of Blumenthal, Menger and Schoenberg [18, 2]. We will then show how k-means, EM and mean-shift can be effectively computed in these spaces The clustering is based on the possibility of computing geodesic distances, means and covariances in these spaces.

2 Previous work

Embedding data in low dimensional space has two variants. In dimensionality reduction or manifold learning we want to embed high dimensional data in a lower dimensional space while keeping some properties to with the points geometry either globally or locally. In the second variant, called Multidimensional Scaling (MDS)(see [5] for general introduction), we only have the points pairwise dissimilarity measures, and our goal is to embed the data according to them. There are numerous approaches to MDS and we will briefly mention a few. Classical metric MDS (sometimes referred to as PCA, principal component analysis) [22] comes to

minimize the  norm of the differences between the inter-point distances given by the dis- similarity matrix¨ and those calculated between the embedded points. The main advantage

of classical metric MDS is that it is non iterative and guaranteed to minimize the  norm functional. Non classical metric MDS, for example, [17] comes to minimize different¨ func- tionals at the price of using iterative methods that are both computationally expensive and not guaranteed to converge to a global minima. Non-metric MDS [20, 11] tries to preserve the order of the distances instead of their values. Often the metric space data (pairwise dis- tances) has some missing values. There are a few methods to ”fill in” the missing values, such as using geodesic distances and SDE [25]. One of the methods for solving the dimensionality reduction problem is to reduce it to MDS by constructing the inter-point dissimilarity matrix. PCA can be looked upon as MDS using the global Euclidean distance. Isomap[21] tries to capture the geometry of the data by taking the ”geodesic” distances in the data between points. Some non-metric methods for dimensionality reduction are LLE[16], HLLE[7], Laplacian eigenmaps[1], which try to

preserve relationships in the local neighborhoods of the data. All these methods work well ¢ if the data lies on or close to manifold which is isometric to a subset of ¡ . Learning of curved manifold have been discussed in [23], although the paper addresses non-flatness arising from the sampling of the manifold, and not from its own geometry. There have been papers related to embedding finite metric spaces into the hyperbolic [19], especially for visualization [12, 24], spherical [6] and general non-linear manifolds. This is a highly non convex problem and all solutions up to date were based on iterative methods that are prone to local minima. To emphasize, we’ve done the following experiment: We took random 100 points in the hyperbolic space, computed the distance matrix and than tried to find the embedding using iterative methods. Even in this case, when the distortion of the best embedding is zero, brute-force optimization results in embedding with high distortion.

3 Geometrical Background

There are 3 types of Riemannian spaces with constant (sectional) curvature; spherical (cur-

  vature  ), Euclidean (curvature ) and hyperbolic (curvature ).

3.1 Spherical Geometry

1

We model the m-dimensional sphere with curvature ¤ in the following way:

 "!$#

-

¡

¢

¨

  &%('*),+ - /.10

3

243

with the distance:

 )C5DE7

F

)65879;:=<,>?>1@BA ¤

3.2 Hyperbolic Geometry2

There different models of the hyperbolic space in this paper we shall use the hyperboloid

model. For visualization# matters [14, 24] Klein or Poincare disk models may be more

¡ G¦)?H

appropriate. On &%(' we define a bilinear form in the following way:

- -



-RQ

GJIK)6LDHMI4N?L*NPO + I L

' GJL4)6LDHMS.

The hyperboloid consisting of all vectors L such that has two sheets. We take

§ L

N 

to be the one containing points with positive 3 , with the distance:

243

GIK)6L_HE7

F

IK)6LT7UWVYX*Z1Z?[=\^]

OE¤

3badcEefaYg4c 3

7 79

where ' and is the inverse function of .

>?@YAC` VBX*Z1Z1[=\d] >1@BA6` ¨

3.3 Euclidean Geometry ¥

The well known Euclidean space  can also be seen as the limit of the hyperbolic spaces

¢

§ ¤  or the spherical spaces as tend to zero.

1 The curvature h may seem artificial in the way it is presented here. See [4, 13] for more infor- mation 2For further information the reader is referred to [8]

4 C-MDS: Curved Multidimensional Scaling

# ¡

The goal of the well-known classical MDS is to embed a given metric space in  equipped with the Euclidean metric minimizing the average distortion of the distances. We shall de- scribe a procedure generalizing MDS to non-flat spaces with constant curvature ( and hyperbolic spaces), which is very similar in spirit to the original MDS and also de- pends on a generalization of the PCA/SVD method. We need a theorem connecting the

distances in a metric space i and its isometric embeddability in a space with constant

curvature. The theorem goes back to Blumenthal, Menger and Schoenberg [18, 2] and the 2"k

proof reproduced here was taken from [10]. #

¡

i j imlnipo ¤Jq/ Let be a finite space of size with % a metric and . Define an

¢ 3

3 2 sutwv

j©lrj matrix as follows :

-yx

F

-yx

¢

twv

s  ¤ 7

2

>1@BA

¢

i") ),¤z){s¦twv

Theor3 em 1. Let as above.

¢

¢

¤r i § s¦twv

. The metric space can be isometrically embedded in  iff the signature of

.|)C}~)CjrO}€OJ.d7 }‚ƒ e

is 3 with .

¢

¢

¤r i  s¦twv

. The metric space can be isometrically embedded in  iff the signature of

}~)C8)CjrO}=7 }£Jƒ .

is 3 with .

„ 7

Signature ){„d) means that the matrix has positive eigenvalues, negative eigenvalues

V Z V

and  eigenvalues.

Z

-‰ˆ

-RQ

#

-

¢

¤J  † ‡^5  §

Proof. The hyperbolic case, . Let be points in  . We look at the

¡

'

-

¢

§ 5 &%(' hyperboloid model of  , thus is a vector in . As defined above, the distances

between I are

3

- x

2 3

- x

G5 )C5 HE7

F

5 )65 7U

VYX*Z1Z?[=\^]

OE¤

- x -

-yx

¢

twv

£GŠ5 )C5 H I

thus s In other words, if we write as rows of matrix A and define an

”?•

.

¢

O¦.’‘?‘“‘

twv

‹ŒŽ s ;—¦‹˜—E™

(m+1)-by-(m+1) matrix – then one has and O¦.

the condition on the signature follows 4. 3

3 e

¢

.|){}~)6jšO›}uO.^7 jnOf„,5£O

In the other direction, if s¦twv has signature then there exists a

¢

¢

ƒ .^7 — s¦twv œ—¦‹˜— — §© matrix such that ™ . Now the rows of will be points in having the right distances between them. The proof of the spherical case is similar.

We’ve omitted the Euclidean case from the theorem, as the proof is different for this case and for all real problem the Euclidean space can be approximated by spheric or hyperbolic space.

3

hn Ÿ hn ŒŸ

This is just a fancy way to write both h"žŸ and cases together, as for we have

¡,¢“£?¤b¥ ¡,¢^£©¨4¤

¥ ª

hB¦˜§ hY¦

¤±°^²©³D²©´ ³ ° ª

4 ª

¬&­U® ¯ ¦

If « is of full column rank, than the signature of is Linear dependence

¤±°^²©µ=²©´ µ ° µ ³

ª ª

« ¦ ¦ of the columns of « will change the signature of to for .

¢ ¢

i §   What if the metric space cannot be embedded exactly in  or ? As in the classic MDS we use spectral decomposition to embed the space with minimal distortion, taking the dominant eigenvalues. The algorithms for the spherical and hyperbolic cases are outlined

below. The3 bottleneck of the running time is the eigenvalue decomposition. Fortunately, ¥

we do not have to compute the entire decomposition to get ¶ and . Thus the running

 

· j ƒš7

time is ¨ , which makes the algorithm realistic for very large data sets.

! !

-6ˆ -6ˆ

¸n)6ƒ©){¤D ¸n)6ƒ©){¤rJ -RQ

Input: -RQ Input:

¢ ¢

‡^L  §" ‡“L  

3 3 2

Output 2 Output

' '

-yx -yx -yx -

F F

s  ¤ 7 s  ¤ 7

>1@BA >1@BA

g g

s¹;¶u¥º¶ sľ¶u¥£¶ ™

™ eigenvalue decomposition, eigenvalue decomposition

g

¶ ¶

eigenvalues in increasing order ¶SÅ»



¼

&%('  '

g g

¶Š/» ¶ ¶

¥



 E¼

w%K'

¥/ÇÆ

¥ ƒ

consists of the most negative ¥





 '"È

 É

eigenvaluesF

‘?‘?‘

F

'

—½¾¶ OP¥

0 —Š½¶ ¥/

 

É

-

-



e 3À¿Á

É

-

Á

-

‘“‘?‘

- - - -Ãx



Ê Ê

L 

É

Â

7 ) ) ) ) . xCQ+ L 

¨

 '

V V V V '

To show that curved manifolds appear naturally in applications,2 we’ve built the distortion

of the distortion versus curvature graph for different datasets, Fig.[ 4,4.1,4] coming from

i") Ì9)CÍ

diverse areas. The distortion of the embedding Ë of into is was defined to be

2YÎ 3 3b2 3 3 3 3

- x - x

.

Ï Ï6Ð

Ëz79 +6-yx ) 7ÑO©Í Ë 7Ò),Ë 767 7

¨

\ i

It is clearly seen in these examples that the datasets do not have an Euclidean structure, and thus call for learning curved manifolds.

6 1

0.8 4 0.6

0.4

Distortion 2 Distortion 0.2

0 0 −8 −6 −4 −2 0 2 −8 −6 −4 −2 0 2 Curvature Curvature

Figure 3: The distortion of embedding the in- Figure 4: The distortion of the embedding of ternet in dimension two as a function of the cur- actors in dimension two as a function of the cur- vature of the embedding space (negative values vature of the embedding space. The distance are hyperbolic spaces and positive are spheri- between close actors is inversely proportional cal). The distance between hubs was taken to to the number of movies they jointly acted in, be the minimal number of hops between them. the further distances where computed by short- Data from the DIMES project supplied by Scott est paths. The data is taken from the UCI KDD Kirkpatrick. archive 9 dim=2 8 dim=3 dim=6 7

6

5

4 Distortion 3

2

1

0 −0.9 −0.8 −0.7 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0 Curvature

Figure 5: The distortion of the embedding of ORFs (potential genes) in the M. tuberculosis bac- terium in dimension two (blue), three (green) and six (magenta) with respect to the curvature of the embedding space. The distances where computed by BLAST and the data taken from [9].

4.1 Choosing the curvature K

Although we do not have an analytical proof of the fact, in the experiments we made (Fig. 4,4.1,4) the variation of distortion with curvature is well behaved and unimodular. This

suggests using 3 binary search to find the optimal curvature. Another approach, at least for

the hyperbolic case, is to estimate the growth of the volume of a ball of radius in the data

X

¢

Ó 7 §

as a function of the radius. Knowing the formula for the growth of volume in  we X can estimate the best curvature ¤ .

4.2 Semi-definite Embedding

When doing manifold learning one wants to ”trust” only the local information and use it to obtain the global structure of the manifold. The approach of semi-definite embedding [25] is to use the local distances to build the global distance matrix by semi-definite programming and then use classical MDS. The same method can be used for spherical embeddings.

5 After the embedding

Everyone knows why embedding the data in a low dimensional Euclidean space is useful: all the algorithms for handling data work in the Euclidean space. We seek to show that a

lot of them can be transferred to the spherical/hyperbolic case. For example, linear separa- ¢ tion of points in ¥ by a hyperplane translates into separation by a geodesic manifold of codimension 1.

Clustering in Euclidean spaces is often done using k-means, EM for mixture of Gaussians

 §  or mean-shift. In order to carry out these algorithms in the curved spaces  or in for k-means or mean-shift we need to compute means and distances and for EM we need to compute covariance matrices.

The notion of the mean of a set of points in a metric space Ì can be defined in different ways. One of them, usually called the Karcher# mean [3]5 comes from noticing that the mean of a set of points in the Euclidean space ¡‚¢ minimizes the sum of the squared distances to

5although originally defined and studied by Cartan 1 Euclidean Curved 0.9

0.8

0.7

0.6 M.I. score

0.5

0.4

0.3

0.2 0 5 10 15 20 25 30 35 40 45 50 standard deviation

Figure 6: Comparison of clustering with k-means after embedding in euclidean (solid line) and

hyperbolic (dashed line) plane.

3 243

the points in the set. This isa still well defined in any metric space, thus we set:

‘?‘“‘

-

c=ÚYÛ



Ô

¨

j 5 ) )65 7U -RQ )65 7

+



'

V :|

(such as hyperbolic space) has a unique mean. Summarizing: to compute the mean of a set

 §  of points in  or in we minimize the sum of the squared distances. As we can ana- lytically obtain the gradient and the Hessian of the function we minimize, the convergence to the minimum is quadratic. The notion of Gaussian distributions can be generalized to Riemannian manifolds, in par- ticular to spheres and hyperbolic spaces. The computation of the covariance matrix of data is straightforward [15], but estimating the parameters of the Gaussian distributions is more

technical and will be published elsewhere.

ˆ -

In Figure 4.2 we-ÜQ show the results of a toy clustering experiment. We take a random set

¢

‡ j

of points in the- hyperbolic plane. A set of points is drawn randomly with std

Ý

'

Z jޤ

around each of the -s, giving us a subspace i consisting of points. The goal is to

Z i

cluster having only its distance matrix.We use a staight-forward implementationk of k-

ˆ

‘?‘“‘ §"¨

means algorithm and compare the results of the clustering after embedding in ¥Ø¨ and .

3 k

ˆ

Ó ißoà‡Y.Y) ){¤

To evaluate the clustering we used‘?‘“‘ the mutual information score: if is

3

#

Óä7w á iâoã‡Y.Y) ){¤ ¢æå|ç?èêédë ìÒí

a clustering and is the ground truth, then score ' where

Ó á ÓKî‰á87 is the mutual information between and .

6 Discussion

In this work we proposed an algorithm for faithfully embedding data in low dimensional hyperbolic and spheric spaces. Our embedding algorithm is a generalization of the clas- sical metric MDS and is hence non iterative and fast. As demonstrated by the examples the geometry of many datasets is better captured by these curved manifolds than by the 6If S lies in a convex set in a , the mean is unique, see [13] Euclidean space. A lot of tools currently used for the Euclidean space can be transfered to the hyperbolic and the spherical case, thus enlarging the possibilities of data mining, understanding and visualizing complex datasets.

References [1] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering, 2002. [2] L.M. Blumenthal. Theory and applications of distance geometry, 1970. [3] P. Buser and H. Karcher. Gromov’s almost flat manifolds, Asterisque 1981. [4] M.P. Do Carmo. Riemannian Geometry. Birkuser, Boston, 1992. [5] T. F. Cox and M. A. A. Cox. Multidimensional scaling. London, 1995. [6] B. A. Falissard. A spherical representation of a correlation matrix. Journal of Classification, 13(2):267–280, 1996. [7] C. Grimes and D. Donoho. Hessian eigenmaps: New locally-linear embedding techniques for high-dimensional data. Technical report, Stanford University, 2003. [8] B. Iversen. Hyperbolic Geometr. Cambridge University Press, 1992. [9] R. King, A. Karwath, A. Clare, and L. Dehaspe. Accurate prediction of protein functional class in the m. tuberculosis and e. coli genomes using data mining. Comparative and Functional Genomic, 17, 2000. [10] S. L. Kokkendorf. Gram matrix analysis of finite distance spaces in constant curvature. Discrete and Computational Geometry, 31:515–543, 2004. [11] J. B. Kruskal. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothe- sis. Psychometrika, 29:1–27, 1964. [12] J. Lamping, R. Rao, and P. Pirolli. A focus+context technique based on hyperbolic geometry for viewing large hierarchies. In ACM SIGCHI Conference on Human Factors in Computing Systems, pages 401–408, 1995. [13] M.Berger. A Panoramic View of Riemannian Geometry. Springer, Berlin, 2003. [14] Tamara Munzner and Paul Burchard. Visualizing the structure of the world wide web in 3d hyperbolic space. In Proceedings of VRML 1995, pages 33–38, 1995. [15] X. Pennec. Probabilities and statistics on riemannian manifolds: Basic tools for geometric measurements. In Proc. of Nonlinear Signal and Image Processing 99, pages 194–198. [16] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000. [17] J. W. Sammon. A nonlinear mapping algorithm for data structure analysis. IEEE Transactions on Computers, 18(5:401–409, 1969. [18] I.J. Schoenberg. A remark to maurece frechets article. Ann. of Math, 36(3), 1935. [19] Yuval Shavitt and Tomer Tankel. On the curvature of the internet and its usage for overlay construction and distance estimation. In In Proceedings of IEEE INFOCOM 2004, 2004. [20] R. N Shepard. The analysis of proximities: multidimensional scaling with an unknown distance function. Psychometrika, 27:125–140, 1962. [21] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000. [22] W. S. Torgerson. Multidimensional scaling: I. theory and method. Psychometrika, 1952. [23] J. Tenenbaum V. de Silva. Unsupervised learning of curved manifolds. In Proceedings of the MSRI workshop on nonlinear estimation. [24] Jorg A. Walter. On interactive visualization of high dimensional data using the hyperbolic plane. In Proceedings of the eighth ACM SIGKDD, pages 123 – 132, 2002. [25] K. Q. Weinberger and L. K. Saul. Unsupervised learning of image manifolds by semidefinite programming. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR-04), volume 2, pages 988–995, Washington D.C., 2004.