Random Geometric Graph in High Dimension

Universit`adegli Studi di Milano FACOLTA` DI SCIENZE MATEMATICHE, FISICHE E NATURALI Corso di Laurea Magistrale in Fisica Random Geometric Graph in High Dimension Candidato: Relatore: Sebastiano Ariosto Marco Gherardi Matricola 901936 Correlatori: Vittorio Erba Pietro Rotondo Anno Accademico 2018-2019 Contents 1 Random geometric graph 4 1.1 Introduction to the random geometric graphs . .4 1.2 Observables in graph theory . .6 1.3 RGGs in data science and high dimensionality . 12 2 Central limit theorem for distances 16 2.1 Central limit theorems . 16 2.2 Multivariate central limit theorem for distances . 18 2.3 The covariance matrix Σ .................... 22 2.3.1 The terms of Σ ..................... 22 2.3.2 Characterizing the covariance matrix . 23 2.4 Reliability of the developed CLT by comparison with simulations . 26 3 Random geometric graph in high dimension 39 3.1 Clique density in high dimensionality . 39 3.2 Clique density in high dimensional Soft RGGs: a Erdős-Rényi behaviour . 45 3.3 Clique density in high dimensional Hard RGGs: a non Erdős- Rényi behaviour . 46 4 A RGGs approach to a Real-World dataset 50 4.1 Real-World dataset . 50 1 Contents 4.1.1 Real-World vs random dataset . 51 4.1.2 Moments of Real-World dataset . 53 A Relationships between α, β and γ and eigenvalues 61 B Hubbard-Stratonovich transformation 63 C Numerical methods 66 C.1 Multivariate Gaussian integration . 66 C.2 Simulations for Hard RGGs . 66 C.3 Simulations for Soft RGGs . 67 2 Motivation. Thinking about the nowadays world, one of the most difficult issue to deal with is extract the information from the massive amount of data that every- day every field of scientific research, or in general every human activities, produce and collect. To better understand the size of the problem, we can think about the LHC experiment at CERN: every second it can generate 40 terabytes of informations, the same memory that will be occupied by one million of books. Again, talking about bioinformatics, one single microarray that measures gene expression is composed up to ten thousand of features. It is clear from these examples that the key issue today is how to infer, extract and visualize meaningful informations in an efficient and effective way from this overabundant quantity of data. Thanks to this enormous production and storage of data, the graph theory live a Renaissance during this information age. This is because the graph is the simplest, but not less effective, way to model an ensemble of data. In graph theory, a graph is a structure amounting to a set of objects in which some pairs of the objects are in some sense "related". The object are called vertex and each link between vertices is called edge. However, in the vast majority of cases a dataset does not include the relationships between its objects, relationships that are necessary in many data science techniques. One of the most used methods to create relationships (and therefore edges) in a dataset that does not show them is to immerse the graph that represents the dataset in a metric space and use the distance between two elements of the dataset as the meter of relationships among them: the closer two points are, the greater their probability 1 Contents of being linked. In literature, a graph immersed in a metric space is called a geometric graph. However, this method can lead to very difficult situations to deal with: in fact, in the vast majority of cases, the natural ambient space of the datasets turns out to be a high dimensional space and studying a high dimensional system implies having to face a series of counterintuitive phenomena known with the evocative name of Curse of Dimensionality. Among these, one of the most annoying phenomena for data science is certainly the concentration of distances: in fact it is observed that in the limit of high dimensionality the distances tend to concentrate around their average value. This is very prob- lematic, in fact many data science techniques involve the research for the nearest neighbors and therefore using distance as a meter. For example, as we said before, geometric graphs use the concept of distance to link similar points to each other, but if all the points are roughly equidistant, the graph that will be generated will not be representative of the true relationships between the points. Despite this, the concentration of distances is not necessarily bad. In fact, in Physics, concentration phenomena are used to make predictions on the trend of variables related to the one it concentrates. In this Thesis work we will see how the concentration of distances can be demonstrated, quan- tified and used to make predictions on the behaviour of random geometric graphs, which are generated starting from points randomly extracted and often used as benchmark for data science techniques. The Thesis has the following structure: • Chapter 1 formally introduces random geometric graphs, their use in data science, and the most commonly used observables to characterize them; • Chapter 2 presents the development of a central limit theorem for dis- tinctions, focusing on its key features and demonstrating the reliability of this result through simulations; • Chapter 3 develops theoretical technique capable of making predictions on the observables relating to the random geometric graph in 2 Contents both the Soft and Hard cases; moreover, this technique are compared with simulated data; • Chapter 4 finally applies the theoretical techniques developed to real- world datasets, highlighting their strengths and limitations. 3 Chapter 1 Random geometric graph 1.1 Introduction to the random geometric graphs We want to start by introducing what will be the main ingredient of this work: in Mathematics a graph is a structure amounting to a set of nodes (also called vertices or points) in which some pairs of them are connected by edges (also called links or lines). Graphs can be very different: direct or indirect, simple or multiple and with or without loops. But of all, the most important division is the one that distinguishes a purely deterministic chart from those that include random elements. The latter, known by the name of random graph, will be one of the main subjects of this work. These objects were introduced in two forms in 1959 by Gilbert, Erdős and Rényi in their works on the so-called Erdős-Rényi random graph model [20, 26]. In the version G(n,p) of this model, an Erdős-Rényi random graph (ERG) is a random graph with N vertices where each possible edge has probability p of existing. These graphs, given their simplicity, are widely studied in literature and act as a basic model for the study of the characteristics of random graphs. Furthermore, as shown by [15] and as we can see in this work, they are limit cases of much more complex systems. Nevertheless Erdős-Rényi random graph do not give us access to a geometric measure of the characteristics of each point and consequently they do 4 Chapter 1. Random geometric graph not allow us to quantify the differences between two nodes. To overcome this problem, in 1961 Gilbert first introduced [27] the concept of metric space in random graphs theory, thus giving rise to random geometric graphs. So, by definition, a random geometric graph (RGG) is the mathematically simplest spatial network, namely an undirected graph constructed by randomly placing N nodes in some metric space (according to a specified probability distribution) and connecting two nodes via a link based on a probability distribution that depends on their mutual distance. With the aim of better describing these graphs, we want to introduce one of the simplest cases: the creation of a RGG starting from points uni- formly distributed in a hypercube. Let x1; ··· ; xN be independent, uni- formly random points from [0; 1]d. Now, having fixed a maximum distance r, we connect all the points which are less distant from each other than r, i.e. such that d(i;j) < r. The structure created in this way connects only the "similar" points together and leaves the "different" points separate (where the concept of diversity is measured by distance). Now, we want to include a more general setting for the ingredients of N the RGGs, starting with the nodes. In RGGs, the nodes f~ngi=1 are N point d drawn by a probability distribution ν over R : Note that if we consider a probability distributions that are factorized and equally distributed over the Qd k coordinates (i.e. ν(~x) = k=1 τ(x ), where τ is a probability distribution k on R), the coordinates of all nodes fxi g, with 1 ≤ i ≤ N and 1 ≤ k ≤ d are i.i.d. random variables with law τ. Now, for every pair of nodes x, y we compute the distance d(x; y) and + we link the points with probability h(d(x; y)), where h : R ! [0; 1] is the so-called activation function of the RGG. The activation function tells us the probability of linking two nodes based on their distance, and it will be typically a monotone decreasing function (with h(0) = 1 and h(+1) = 0), with the idea that closer nodes will be linked with higher probability than further ones. In literature, instead of a single activation function, a family of para- metric activation functions hr is usually used. The parameter r describe the typical distance that is associated with a definite and non-trivial linking 1 probability, for example hr(r) = 2 . This is useful because it allow us to 5 Chapter 1.

Load more