Arxiv:2006.09858V3 [Cs.LG] 16 Jun 2021

Geometry of Similarity Comparisons Puoya Tabaghi Jianhao Peng Olgica Milenkovic Coordinated Science Lab Coordinated Science Lab Coordinated Science Lab ECE Department, UIUC ECE Department, UIUC ECE Department, UIUC [email protected] [email protected] [email protected] Ivan Dokmanic´ Department of Mathematics and Computer Science University of Basel [email protected] Abstract Many data analysis problems can be cast as distance geometry problems in space forms – Euclidean, spherical, or hyperbolic spaces. Often, absolute distance measurements are often unreliable or simply unavailable and only proxies to absolute distances in the form of similarities are available. Hence we ask the following: Given only comparisons of similarities amongst a set of entities, what can be said about the geometry of the underlying space form? To study this question, we introduce the notions of the ordinal capacity of a target space form and ordinal spread of the similarity measurements. The latter is an indicator of complex patterns in the measurements, while the former quantifies the capacity of a space form to accommodate a set of measurements with a specific ordinal spread profile. We prove that the ordinal capacity of a space form is related to its dimension and the sign of its curvature. This leads to a lower bound on the Euclidean and spherical embedding dimension of what we term similarity graphs. More importantly, we show that the statistical behavior of the ordinal spread random variables defined on a similarity graph can be used to identify its underlying space form. We support our theoretical claims with experiments on weighted trees, single-cell RNA expression data and spherical cartographic measurements. 1 Introduction arXiv:2006.09858v4 [cs.LG] 29 Jul 2021 Distances reveal the geometry of their underlying space. They are at the core of many machine learning algorithms. In particular, finding a geometrical representation for point sets based on pairwise distances is the subject of distance geometry problems (DGPs). Euclidean DGPs have a rich history of applications in robotics [1, 2], wireless sensor networks [3], molecular conformation analysis [4] and dimensionality reduction [5]. One is typically concerned with finding a geometric representation for a set of measured Euclidean distances [6]. Beyond Euclidean DGPs, recent works have focused of hyperbolic geometry methods in data analysis, most notably when dealing with hierarchical data. Social and FoodWeb networks [7, 8], gene ontologies [9], and Hearst graphs of hypernyms [10] are interesting examples of hierarchical datasets. Spherical embeddings represent sets of points on a (hyper)sphere [11], and have found applications in astronomy [12], distance problems on Earth [13], and texture mapping [14]. Euclidean, spherical and hyperbolic geometries are categorical examples of constant curvature spaces, or space forms, which are characterized by their curvature and dimension. The above examples represent instances of metric embeddings in space forms, as opposed to what is termed nonmetric embeddings. In the latter setting, one is provided with nonmetric information about data points, such as quantized distances or ordinal measurements such as comparisons or rankings. Preprint. Under review. We argue that nonmetric information such as distance comparisons carries valuable information about the space the data points originated from. To formally state our claims, assume that we are given a set of points x1; : : : ; xN in an unknown metric space S. In nonmetric embedding problems [15, 16], we work with dissimilarity (similarity) measurements of the form def 8m; n 2 [N] = f1;:::;Ng : ym;n = φ d(xm; xn) ; where d(xm; xn) is the distance between xm and xn in S, and φ(·) is an unknown monotonically increasing (or decreasing) function. Since φ is unknown, we can only interpret the measurements as distance comparisons or ordinal measurements, i.e., if the entities indexed by n1; n2 are more similar than those indexed by n3; n4; then increasing φ yn1;n2 ≤ yn3;n4 (==========) d(xn1 ; xn2 ) ≤ d(xn3 ; xn4 ): We hence ask: What do distance/similarity comparisons as those described above reveal about the space S? Our work shows, for the first time, that one can use ordinal measurements to deduce the sign of the curvature and a lower-bound for the dimension of the underlying space form (in Euclidean and spherical spaces). The main results of our analysis are as follows. 1. We introduce the notion of ordinal spread of the sorted distance list, which is of fundamental importance in the study of the geometry of distance comparisons. The spread of ordinal measurements describes a pattern in which entities appear in the sorted list of distances, i.e., the ordinal spread gives the ranking of the first appearance of a data point in the list. This notion is related to another important geometric entity termed the ordinal capacity. 2. We define the notion of ordinal capacity of a space form to characterize the space’s ability to host extreme patterns of ordinal spreads (computed from similarity measurements). We show that the ordinal capacity of a space form is related to its dimension and curvature sign. The ordinal capacity of Euclidean and spherical spaces are equal and grow exponentially with their dimensions, while the ordinal capacity of a hyperbolic space is infinite for any possible dimension of the space. 3. We derive a deterministic lower bound for Euclidean and spherical embedding dimensions using ordinal spreads and the (finite) ordinal capacity. We also associate an ordinal spread random variable with (1) a set of random points in a space form, and (2) a set of random vertex subsets from a similarity graph – a complete graph with edge weights corresponding to similarity scores of their defining nodes. The distributions of these random variables serves as a practical tool to identify the underlying space form given a similarity graph. 4. We illustrate the utility of our theoretical analysis by using them to correctly uncover the hyperbol- icity of weighted trees. Moreover, we use them to detect Euclidean and spherical geometries for ordinal measurements derived from local and global cartographic data. Finally, we use the ordinal spread variables to determine the degree of heterogeneity of cell populations based on noisy scRNAseq data and how data imputation influences the geometry of the cell space trajectories. Due to space limitations, all proofs, algorithms, and extended discussions are delegated to the Supplement. 1.1 Related Works In many applications we seek a representation for a group of entities based on their distances, but the exact magnitudes of the distances may be unavailable.What often is available (and prevalent) in applied sciences are nonmetric – dissimilarity or similarity – measurements: In neural coding [17], developmental biology [18], learning from perceptual data [19], and cognitive psychology [20]. Unfortunately, the datasets used in most of these studies are small (often involving less than 100 entities) and have limited utility for learning tasks that require sufficiently large sample complexity. Nonmetric embedding problems date back to the works of Shepard [21, 22] and Kruskal [15]. Inspired by the Shepard-Kruskal scaling problem, Agarwal et al. [16] introduce generalized nonmetric multi- dimensional scaling, a semidefinite relaxation used to embed dissimilarity (or similarity) ratings of a set of entities in Euclidean space. Stochastic triplet embeddings [23] and crowd kernel learning [24] are used to embed triadic comparisons using probabilistic information. Tabaghi et al. [25] propose a semidefinite relaxation for metric and nonmetric embedding problems in hyperbolic space. In all 2 these scenarios, the embedding space has to properly represent the measured data. For example, in developmental biology and cancer genomics, single-cell RNA sequencing (scRNAseq) is used to differentiate cell types and cycles. The classification results have important implications for lineage identification and monitoring cell trajectories and dynamic cellular processes [26]. Klimovskaia et al. [18] use hyperbolic rather than Euclidean spaces for low-distortion embedding of complex cell trajectories (hierarchical structures). Learning from distance comparisons is an active area of research. Among the relevant research topics are ranking objects from pairwise comparisons [27, 28], theoretical analysis of necessary number of distance comparisons to uniquely determine the embedding [29], nearest neighbor search [30], random forests [31], and classification based on triplet comparisons [32]. Understanding the underlying geometry of ordinal measurements is important in designing relevant algorithms. Related to nonmetric embedding problems are the various techniques that study topological properties of point clouds independently of the choice of metric and of the geometric properties such as curvature [33]. An important problem in this domain is to detect intrinsic structure in neural firing patterns, invariant under nonlinear monotone transformations of measurements. Giusti et al. [17] propose a method based on clique topology of the graph of correlations between pairs of neurons. The clique topology of a weighted graph describes the behavior of cycles in its order complex [17] as a function of edge densities; these entities are also known as Betti curves. The statistical behavior of Betti curves is used to distinguish random and geometric structures of moderate sizes in Euclidean space.

Arxiv:2006.09858V3 [Cs.LG] 16 Jun 2021

Applied Category Theory for Genomics – an Initiative

Gene Prediction: the End of the Beginning Comment Colin Semple

Meeting Review: Bioinformatics and Medicine – from Molecules To

The Ethos and Effects of Data-Sharing Rules: Examining The

Contcenter for Genomic Regul

Concepts, Historical Milestones and the Central Place of Bioinformatics in Modern Biology: a European Perspective

Computational Biology: Plus C'est La Même Chose, Plus Ça Change

CSHL Audio Visual Collection Inventory

Flymine: an Integrated Database for Drosophila and Anopheles Genomics

Molecular Genetics & Genomics

Infrastructure Needs of Systems Biology

Prolango: Protein Function Prediction Using Neural~ Machine