Universit`adegli Studi di Milano

FACOLTA` DI SCIENZE MATEMATICHE, FISICHE E NATURALI Corso di Laurea Magistrale in Fisica

Random Geometric Graph in High Dimension

Candidato: Relatore: Sebastiano Ariosto Marco Gherardi Matricola 901936

Correlatori: Vittorio Erba Pietro Rotondo

Anno Accademico 2018-2019 Contents

1 Random geometric graph 4 1.1 Introduction to the random geometric graphs ...... 4 1.2 Observables in ...... 6 1.3 RGGs in data science and high dimensionality ...... 12

2 Central limit theorem for distances 16 2.1 Central limit theorems ...... 16 2.2 Multivariate central limit theorem for distances ...... 18 2.3 The covariance matrix Σ ...... 22 2.3.1 The terms of Σ ...... 22 2.3.2 Characterizing the covariance matrix ...... 23 2.4 Reliability of the developed CLT by comparison with simu- lations ...... 26

3 Random geometric graph in high dimension 39 3.1 density in high dimensionality ...... 39 3.2 Clique density in high dimensional Soft RGGs: a Erdős-Rényi behaviour ...... 45 3.3 Clique density in high dimensional Hard RGGs: a non Erdős- Rényi behaviour ...... 46

4 A RGGs approach to a Real-World dataset 50 4.1 Real-World dataset ...... 50

1 Contents

4.1.1 Real-World vs random dataset ...... 51 4.1.2 Moments of Real-World dataset ...... 53

A Relationships between α, β and γ and eigenvalues 61

B Hubbard-Stratonovich transformation 63

C Numerical methods 66 C.1 Multivariate Gaussian integration ...... 66 C.2 Simulations for Hard RGGs ...... 66 C.3 Simulations for Soft RGGs ...... 67

2 Motivation.

Thinking about the nowadays world, one of the most difficult issue to deal with is extract the information from the massive amount of data that every- day every field of scientific research, or in general every human activities, produce and collect. To better understand the size of the problem, we can think about the LHC experiment at CERN: every second it can generate 40 terabytes of informations, the same memory that will be occupied by one million of books. Again, talking about bioinformatics, one single microarray that measures gene expression is composed up to ten thousand of features. It is clear from these examples that the key issue today is how to infer, extract and visualize meaningful informations in an efficient and effective way from this overabundant quantity of data. Thanks to this enormous production and storage of data, the graph theory live a Renaissance during this information age. This is because the graph is the simplest, but not less effective, way to model an ensemble of data. In graph theory, a graph is a structure amounting to a set of objects in which some pairs of the objects are in some sense "related". The object are called and each link between vertices is called edge. However, in the vast majority of cases a dataset does not include the relationships between its objects, relationships that are necessary in many data science techniques. One of the most used methods to create rela- tionships (and therefore edges) in a dataset that does not show them is to immerse the graph that represents the dataset in a metric space and use the between two elements of the dataset as the meter of relation- ships among them: the closer two points are, the greater their probability

1 Contents of being linked. In literature, a graph immersed in a metric space is called a geometric graph. However, this method can lead to very difficult situations to deal with: in fact, in the vast majority of cases, the natural ambient space of the datasets turns out to be a high dimensional space and studying a high dimensional system implies having to face a series of counterintuitive phenomena known with the evocative name of Curse of Dimensionality. Among these, one of the most annoying phenomena for data science is certainly the concentration of distances: in fact it is observed that in the limit of high dimensionality the distances tend to concentrate around their average value. This is very prob- lematic, in fact many data science techniques involve the research for the nearest neighbors and therefore using distance as a meter. For example, as we said before, geometric graphs use the concept of distance to link similar points to each other, but if all the points are roughly equidistant, the graph that will be generated will not be representative of the true relationships between the points. Despite this, the concentration of distances is not necessarily bad. In fact, in Physics, concentration phenomena are used to make predictions on the trend of variables related to the one it concentrates. In this Thesis work we will see how the concentration of distances can be demonstrated, quan- tified and used to make predictions on the behaviour of random geometric graphs, which are generated starting from points randomly extracted and often used as benchmark for data science techniques. The Thesis has the following structure:

• Chapter 1 formally introduces random geometric graphs, their use in data science, and the most commonly used observables to characterize them;

• Chapter 2 presents the development of a central limit theorem for dis- tinctions, focusing on its key features and demonstrating the reliability of this result through simulations;

• Chapter 3 develops theoretical technique capable of making predic- tions on the observables relating to the random geometric graph in

2 Contents

both the Soft and Hard cases; moreover, this technique are compared with simulated data;

• Chapter 4 finally applies the theoretical techniques developed to real- world datasets, highlighting their strengths and limitations.

3 Chapter 1

Random geometric graph

1.1 Introduction to the random geometric graphs

We want to start by introducing what will be the main ingredient of this work: in Mathematics a graph is a structure amounting to a set of nodes (also called vertices or points) in which some pairs of them are connected by edges (also called links or lines). Graphs can be very different: direct or indirect, simple or multiple and with or without loops. But of all, the most important division is the one that distinguishes a purely deterministic chart from those that include random elements. The latter, known by the name of , will be one of the main subjects of this work. These objects were introduced in two forms in 1959 by Gilbert, Erdős and Rényi in their works on the so-called Erdős-Rényi random graph model [20, 26]. In the version G(n,p) of this model, an Erdős-Rényi ran- dom graph (ERG) is a random graph with N vertices where each possible edge has probability p of existing. These graphs, given their simplicity, are widely studied in literature and act as a basic model for the study of the characteristics of random graphs. Furthermore, as shown by [15] and as we can see in this work, they are limit cases of much more complex systems. Nevertheless Erdős-Rényi random graph do not give us access to a geo- metric measure of the characteristics of each point and consequently they do

4 Chapter 1. Random geometric graph not allow us to quantify the differences between two nodes. To overcome this problem, in 1961 Gilbert first introduced [27] the concept of metric space in random graphs theory, thus giving rise to random geometric graphs. So, by definition, a random geometric graph (RGG) is the mathematically sim- plest , namely an undirected graph constructed by randomly placing N nodes in some metric space (according to a specified probability distribution) and connecting two nodes via a link based on a probability distribution that depends on their mutual distance. With the aim of better describing these graphs, we want to introduce one of the simplest cases: the creation of a RGG starting from points uni- formly distributed in a hypercube. Let x1, ··· , xN be independent, uni- formly random points from [0, 1]d. Now, having fixed a maximum distance r, we connect all the points which are less distant from each other than r, i.e. such that d(i,j) < r. The structure created in this way connects only the "similar" points together and leaves the "different" points separate (where the concept of diversity is measured by distance). Now, we want to include a more general setting for the ingredients of N the RGGs, starting with the nodes. In RGGs, the nodes {⃗n}i=1 are N point d drawn by a probability distribution ν over R . Note that if we consider a probability distributions that are factorized and equally distributed over the ∏︁d k coordinates (i.e. ν(⃗x) = k=1 τ(x ), where τ is a probability distribution k on R), the coordinates of all nodes {xi }, with 1 ≤ i ≤ N and 1 ≤ k ≤ d are i.i.d. random variables with law τ. Now, for every pair of nodes x, y we compute the distance d(x, y) and + we link the points with probability h(d(x, y)), where h : R → [0, 1] is the so-called activation function of the RGG. The activation function tells us the probability of linking two nodes based on their distance, and it will be typically a monotone decreasing function (with h(0) = 1 and h(+∞) = 0), with the idea that closer nodes will be linked with higher probability than further ones. In literature, instead of a single activation function, a family of para- metric activation functions hr is usually used. The parameter r describe the typical distance that is associated with a definite and non-trivial linking 1 probability, for example hr(r) = 2 . This is useful because it allow us to

5 Chapter 1. Random geometric graph study the properties of RRGs as function of the parameter r. Based on the choice of the activation function the RRGs are distinguish two different type:

• In the case of hard activation function, defined as:

hard hr = θ(r − x) (1.1)

we have the so-called Hard Random Geometric Graph. In this case we link a pair only if the distances between the points is lower than the parameter r.

• Indeed, if the activation function is not boolean, we have the Soft Random Geometric Graph. In this case the linking of a pair of point is not deterministic. An example of soft activation functions widely used in literature [18, 33, 28] is the so-called Reylight fading activation functions, i.e.: [︂ (︂x)︂η]︂ hReylight(x) = exp −ξ (1.2) r r 1 where ξ = log(2) guarantees that hr = 2 . Note that in this work, when the adjectives "hard" or "soft" are omitted, we generically refer to both. In order to better understand the behaviour of RGGs, given the random nature of them, it is more significant studying the statistic properties of some observable than studying a single realization. In the next section we will present some of the most used observables in random graph theory, and we will explain the choice of one of these for the statistic analysis make in this work.

1.2 Observables in graph theory

In literature there are many statistical-based observables that are useful to understand the behaviours of graphs. Usually those observables are studied

6 Chapter 1. Random geometric graph

hr(x) hr(x)

[︂(︁ x )︁2]︂ θ(r − x) exp r

r x r x

Hard RGG Soft RGG

Figure 1.1: Example of a hard and a soft random geometric graph. Small circles denote nodes drawn randomly with flat probability distribution on [0, 1]2 and the largest shaded circles identify the region of radius r/2 around nodes. Dashed lines denote all possible edges between nodes, and solid lines highlight the actual edges of the represented graphs. Above the graph representations, the activation function used to build them are plotted. Left: In a hard random geometric graph the only selected edges are those with nodes no more distant than r (in the picture, the nodes whose shaded regions intersect). Right: In a soft random geometric graph, edges are selected based on a continuous activation function hr(x). If two nodes are at distance d between each other, then the edge that connects them will be chosen with probability hr(d). In the picture, yellow edges are those edges that have been chosen by the soft random geometric graphs even though the distance between nodes was greater than r. Vice versa, blue dashed edges are those of distance lesser than r, but not chosen by the soft random geometric graph.

7 Chapter 1. Random geometric graph as a function of the parameter r of the activation function. In this section we will present some of them, focusing on their utility in real-world problems and outlining the most important results for the most studied random graph, i.e. the Erdős-Rényi random graph:

• Probability of isolated vertex and connectivity threshold: by definition a graph is said to be connected if every pair of pointsin the graph is connected i.e. for every two vertices u and v the graph contains a from u to v. If a graph have at least one isolated vertex it is surely not connected, so these two observables are strictly related. The statistical behaviour in the limit of the number of points N tending to infinity of both the observables is studied case by case, but the most generic consideration for random geometric graph is given by [43]. Instead the result for Erdős-Rényi graph is given by [20], where the authors find the threshold for connectivity:

⎧ (1 − ϵ)log(N) ⎪If p < , then a graph will almost surely contain isolated ⎪ ⎨⎪ N vertices, and thus be disconnected. ⎪ ⎪ (1 + ϵ)log(N) ⎩⎪If p > , then a graph will almost surely be connected. N where ϵ → 0+. The study of this properties is very important for wireless sensor net- works, indeed in this field it is very important to find the power anode needs to transmit in order to ensure that the network is connected with high probability, even if the number of nodes in the network goes to infinity [30, 4, 18].

• Minimal spanning tree: in general, the minimal spanning tree T of a connected graph G is a subgraph that connects all the vertices together, without any cycles and with the minimum possible total edge weight. In the case of geometric graphs the distance between points is usually used as weight and it is useful to find the value of the minimum spanning tree of a graph, but, as shown by [23],

8 Chapter 1. Random geometric graph

this is not a trivial problem. The minimal spanning tree have direct applications in the design of networks, including computer networks, telecommunications networks, transportation networks, water supply networks, and electrical grids as shown in [29]. Furthermore, this observable is also used in the taxonomy [50], gene expression [59], computer vision [53], circuit design [38] and ecotoxicology [14].

• Hamiltonicity: a is a path in a connected graph that visits each vertex exactly once (see fig. 1.2). A Hamiltonian is a Hamiltonian path that is a cycle (i.e a close path). e. A graph is said to be Hamiltonian if it contains a Hamiltonian cycle. Finding out if a graph is Hamiltonian is a NP-complete problem [24]. In the case of RGGs the problem is simpler, indeed, using statistical method it is possible to show, as done by [16], that there exists a threshold value of the parameter r above which the graph is asymptotically almost surely Hamiltonian.

: is a measure of the to which nodes in a graph tend to cluster together. Two versions of this measure exist: the global one, which was designed to give an overall indication of the clustering in the network, and the local one that gives an indication of the embeddedness of single nodes (i.e. the density of connections among the neighbours of a node). In the case of RGGs it has been shown that in the limit of N tending to infinity the avarage clustering coefficient is only a function of dimensionality [11].

• Clique number and k-cliques density: in graph theory, a clique is a subset of vertices of a graph such that every two distinct vertices in the clique are adjacent (i.e. are linked); that is, the induced subgraph is complete. The maximum clique of a graph is the clique that has the biggest number of vertices, and the clique number of a graph is the number of vertices in the maximum clique. Note that finding the maximum clique and the clique number are a NP-complete problems. For RGGs, in the case of number of points N tends to infinity, there exists concentration theorem for clique number [43].

9 Chapter 1. Random geometric graph

Instead of using the clique number, it is more interesting to study the trend of the density of k-cliques as the parameter r changes. A k-clique is a clique with k vertices (see fig. 1.2) and the density of these ρk is given by the ratio between the number of k-cliques actually (︁ (︁N)︁)︁ present in the graph and all the possible ones which are k . For the Erdős-Rényi random graph there exists the exact solution [8]: (︃ )︃ ER N k ρ (p) = p(2) (1.3) k k where p is the linking probability, N the number of points and k the dimension of clique. Note that finding k-cliques is a problem that can be solved in polynomial time (O(N K ), where N in the number of points), so it is generally faster than finding the maximum clique. For the generic RGGs there exists some bounds for some particular cases and only in the limit of number of point N tends to infinity [15, 5, 9]. Cliques are widely used in a lot of different fields. The concept of clique was introduced in 1949 by Luce and Perry in social [37], a interdisciplinary field halfway between sociology and graph theory, and it become a very used tool in this field [3, 42, 19] thanks to the fact that, being densely connected, cliques are able to describe the very connected areas of the graph, such as "communities" or "clusters". They are also used a lot in bioinformatics: indeed they are used for modeling a great variety of different system; for example, the problem of clustering gene expression data [6, 55], the ecologi- cal niches in food web [52], protein structure prediction [47] and to describe the problem of inferring evolutionary trees [12]. Again, the cliques can be also used to extract information by real network, as done for the protein-protein interaction network [51] and communi- cation network [45]. Furthermore they can also be used in chemistry [40, 46], statistical mechanics [13, 39] and electrical engineering [41].

We decided to focus our study on the trend of the density of k-cliques as the parameter r changes, in the case of any number of points N and dimensionality d that tends to infinity.

10 Chapter 1. Random geometric graph

Connectivity Minimal Spanning Tree

Hamiltonian Path 5-Clique

Figure 1.2: Example of observables in graph theory. Given a hard random geometric graph (N = 100 points drawn with flat probability in unit square) we highlight some examples of observables. Top left: The connectivity of a RGGs is only function of the cut-off radius r (r = 0.05 in dark blue and r=0.2 in light blue). Top right: the Minimal Spanning Tree in highlight in dark green. Bottom left: the shortest (according to Euclidean metrics) path that link all the point of the graph (Hamiltonian Path) is highlighted in dark green. Bottom right: some of 5-clique are highlighted.

11 Chapter 1. Random geometric graph

1.3 RGGs in data science and high dimensionality

According to [31], the main purpose of Data Science is to analyze and un- derstand real phenomena dealing with complex datasets. In other words, the goal of data science is to use statistics, data analysis, machine learning and their related methods in order to extract knowledge and meaningful insights from datasets. Although this field is very young (the modern definition of it was givenfor the first time only in 1992) it has been able to become one of themost developed and studied sciences thanks to its possible uses in every field of knowledge. In fact, from industries to research, from business to urban management, all human activities create, collect and analyze data to be able to make a data-based forecast or take a data-based decision. In order to do this, it is necessary to be able to transform a huge amount of data into a small number of variables that can characterize the system. Obviously, this is not a trivial task. In fact, the first problem is to find the best way to model the data set: in fact, the type of information that can be extracted will also depend on how the data is modeled. For example, suppose we want to extract information on the relationships between the objects in a dataset: to do this the best way to represent the dataset is to make the relationships between objects explicit, that is, using a graph. Indeed, as we have seen in the previous section, some of the observables of graph theory allow us to extract a lot of information from dataset. But it is also true that it is not always possible to find explicit relationships in a dataset. However, there is a way to highlight these relationships by incorporating the data set into a metric space, and expressing the relation- ship between two points as a function of their distance. In other words, one way to generate relationships from a dataset that does not reveal any of it, is to immerse it in a metric space and study the resulting geometric graph. This idea is fundamental for the study of real networks, such as wireless networks, and for the development of many algorithms, as in the case of manifold learning of machine learning. But, if it is simple to find the em- bedding dimension in the case of real wireless network (a real system is at

12 Chapter 1. Random geometric graph most three-dimensional), finding the embedding dimension for a dataset of images or proteins it is a non-trivial problem (although it seems strange to find the space to incorporate these datasets is a normal problem indata science). The simplest way and one that guarantees no loss of information is to link each degree of freedom of the individual data to a dimension. For example, let’s consider a dataset of digital images with a size of 28x28 pixels: how is it possible to represent them as an object belonging to a metric space? The most used way is to represent each image as a vector in dimension 784 (28x28) where each of the vector will be linked to the internal value of a precise pixel. In this way, the dataset is represented as a set of points in a metric space of dimension 728. As showed in this example, if it is necessary consider all the degree of freedom it is evident that the embedding metric space will be high dimensional. Depending on your point of view or use, having a high dimensional dataset can be a curse or a blessing. With the curse of dimensionality one refers to a set to vari- ous phenomena that arise in several subfields of the mathematical sciences when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional system. The common theme of cursed phenom- ena is that when the dimensionality of the system increases, the volume of the space increases so fast that the available data become sparse and these become statistically less significant, as shown in fig. 1.3. The blessings of dimensionality are less widely noted, but it is the same concept of the main idea of statistical physics: if a system can be repre- sented as a union of many weakly (but not only) interacting subsystems, then, if the number of the subsystems tends to infinity (i.e. in the thermo- dynamic limit), the whole system can be described by a few macroscopic variables (that are low dimensional). In the same way, because of the so- called concentration of the measure phenomenon, we can say that even in a high dimensional dataset the fluctuations will be well controlled and the asymptotic methods can be used. In other words we can say that statisti- cally we know more things about a high dimensional dataset than a lower dimensional one. According to [56] the blessing of dimensionality and the curse of dimen- sionality are two sides of the same coin. For example, the Concentration

13 Chapter 1. Random geometric graph

10

RVt(d)

8

6

4

2

d

0 0 20 40 60 80 100

Figure 1.3: As the dimensionality of the system increases (leaving the number of points fixed), the minimum distance and the maximum distance measured inthe system remain close to the average instead of following the extremal distances. In this plot the main effect of Concentration of Distances is highlighted. As you can see, boththe average value of the distance between two points (in blue) and the maximum and minimum value (in light green)√ grow much slower than the maximum possible distance in the system (in black), which is d, in the Euclidean case. This is due to the fact that, leaving the number of points fixed, as the dimensionality increases, the system appears to be much more sparse, and therefore it is unable to sample the total space well. This is also highlighted by the fact that the standard deviation (in darkk green, plotted around the average) does not grow as the dimensionality of the system increases.

14 Chapter 1. Random geometric graph of Distance phenomena: given a high dimensional dataset, the distance be- tween two points will be, with high probability, close to the average distance [7]. This phenomena significantly simplifies the expected geometry of data (blessing) [32], but, at the same time, it makes the neighbours search in high dimensions difficult and even uselesscurse ( ) [44].

As we have seen in this chapter, the study of high dimensional random geometric graphs is a very important aim for random graph theory and for the data science. In order to do that we decided to investigate the properties of the object that allows us to define the relationships between the points, i.e. the distance, and the scaling property of the observables of the RGGs in function of this.

15 Chapter 2

Central limit theorem for distances

The purpose of this chapter is to write a multivariate Central Limit Theo- rem for the distances of N points extracted with a determinate probability distribution. In order to do this, the path we took was to exploit the gen- erality of the demonstration given by [57] to the multivariate central limit theorem, that is re-proposed in the next section.

2.1 Central limit theorems

We first introduce a theorem that allows us to connect convergence indis- tribution of the sequence of random variables with pointwise convergence of their characteristic functions.

Theorem 2.1.1 (Lévy’s continuity theorem). Let Xm and X be a random T T k it Xm it X vectors in R . Then Xm ⇝ X if and only if Ee ↦−→ Ee for every k itT X t ∈ R . Moreover, if Ee m converge pointwise to a function ϕ(x) that is continuous T zero, then ϕ is the characteristic function of a random vector X, and Xm ⇝ X. That theorem implies that weak convergences of vectors is equivalent to

16 Chapter 2. Central limit theorem for distances weak convergences of linear combinations: T T k Xm ⇝ X ⇔ t Xm ⇝ t X ∀ t ∈ R (2.1) This is know as Cràmer-Wold device, and it allows to reduce higher-dimensional problems to one-dimensional case, so it will be a key ingredient in the proof of the multivariate central limit theorem. But before doing so, we need to add one more piece, namely the univariate version of the same theorem.

Theorem 2.1.2 (Central limit theorem). Let Y1,Y2, .. be i.i.d. random 2 variables with EYi=0 and EYi = 1. Than the sequence: ∑︁m √ Yi m Y = √i (2.2) m m converges in distribution to the standard normal distribution. Proof. Differentiating twice the characteristic function ofY ϕ(t) = EeitY under the expectation sign shows that ϕ′′(0) = i2EY 2. Because ϕ′(0) = iEY = 0, we obtain 2 √ (︂ )︂ (︂ )︂m 1 2 2 it m Ym m t t 2 (︁ −1)︁ − t EY Ee = ϕ √ = 1 − EY + o m −→ e 2 . m 2m (2.3) This is possible because the characteristic function of independent ran- dom variables is factorable. The right side is the characteristic function of the normal distribution with mean zero and variance EY 2. The proposition follows from the Lèvy’s continuity theorem. Now we can use the previous theorem and the Cràmers-Wold device to prove the multivariate CLT.

Theorem 2.1.3 (Multivariate central limit theorem). Let Y1,...,Yl,...,Ym S be i.i.d. random vectors in R with mean EYl = µ and covariance matrix T Σ = E(Yl − µ)(Yl − µ) . Then m 1 ∑︂ (s) √ (s) √ (Y − µ) = m(Y − µ) N (0, Σ) (2.4) m l m ⇝ S l (Note that the sum is taken coordinatewise).

17 Chapter 2. Central limit theorem for distances

Proof. To prove this theorem we need to find the limit distribution of the random variables, using the Cràmer-Wold device

m m (︂ 1 ∑︂ )︂ 1 ∑︂ tT √ (Y − µ) = √ (tT Y − tT µ) (2.5) m l m l l l

T T Because the random variables t Yl −t µ are i.i.d. with mean zero and vari- T T ance t Σt this sequence is asymptotically N1(0,t Σt)-distribuited by the T univariate central limit theorem, where N1(0,t Σt) is the univariate nor- mal distribution. This is exactly the distribution of xT X if X possesses an NS(0,Σ) distribution, where NS(0,Σ) is the multivariate normal distribu- tion, defined as: − 1 xtΣ−1x e 2 NS(0, Σ) = √︁ (2.6) (2π)S|Σ|

2.2 Multivariate central limit theorem for distances

Now we want to specialize the previous multivariate CLT (Theorem 2.1.3) to the case of distances. d Let ⃗x1, . . . , ⃗xN be i.i.d. random points in R sampled with probability ∏︁d k distribution ν(⃗x) such that ν(⃗x)= k=1 τ(x ), where the upper indices in- dicate the spatial coordinates (as will be better specified below). We will d consider the p-norms in R : ⌜ ⃓ d ⃓p ∑︂ k p ∥⃗x∥p = ⎷ |x | . (2.7) i=1 Notice that p-norms are norms only for p ≥ 1, as for 0 < p < 1 the tri- p angle inequality is not satisfied. In this case we can use ∥⃗xi − ⃗xj∥p, that nonetheless defines a distances. Thus, we will consider the distance function as: min(1,p) dp(⃗xi, ⃗xj) = ∥⃗xi − ⃗xj∥p (2.8)

18 Chapter 2. Central limit theorem for distances

We expect that, in the high-dimensional limit, the distance must concentrate around their mean value, so we prefer to use a rescaled distance:

max(1,p) [dp(⃗xi − ⃗xj)] − dµ q(⃗xi, ⃗xj) = √ = d d d (2.9) 1 ∑︂ k k p 1 ∑︂ k = √ (|xi − xj | − µ) = √ q (⃗xi, ⃗xj) d k=1 d k=1 where ∫︂ ∞ p µ = τ(xi)τ(xj)|xi − xj| (2.10) −∞ k k k Notice that the random vectors qk = (q ( ⃗x1, ⃗x2), q ( ⃗x1, ⃗x3), . . . q ( N⃗x −1, ⃗xN )) (N) ∈ R 2 (where we have considered that the distance is symmetrical for point exchange), with 1 ≤ k ≤ d, are statistically independent and identically distributed due to the fact that ν is a factorized distribution, and that the definition of µ implies that the expected value of qk is the null vector. No- tice also that the components of the vectors qk are naturally indexed by lexicographically ordered multi-indices, as they are related to the distances between pairs of points along the k-th dimension; to distinguish such vec- tors from the Euclidean ones, we type them in boldface. So, we can use k (s) q (⃗xi, ⃗xj) as the random variable Yl seen in the previous theorem (The- orem 2.1.3), where the index s (the old coordinate) is represented by the double index (i,j) (i.e the couple of points), meanwhile the sum is now made over the index k. Thus, the vector q = (q( ⃗x1, ⃗x2), q( ⃗x1, ⃗x3), . . . q( N⃗x −1, ⃗xN )) is a sum of i.i.d. multivariate random variables, and satisfies the following central limit theorem: Theorem 2.2.1 (Multivariate central limit theorem for distances). Let d ⃗x1, . . . , ⃗xN be random vector in R i.i.d. sampled with probability measure ∏︁d k ν(⃗x) such that ν(⃗x)= k=1 τ(x ). Than d 1 ∑︂ k q = √ q −→ Ni

19 Chapter 2. Central limit theorem for distances

Where 1⟨︃ ⟩︃ Σ = [︁q1(x , x )q1(x , x )]︁ = (︁|x − x |p − µ)︁(︁|x − x |p − µ)︁ . (i,j),(l,m) E i j l m d i j l m (2.12)

Note how Σ(i,j),(l,m) represents a matrix despite its 4 indexes, this is because each vector is identified by a double index.

Proof. The general proof is the same as that seen in 2.1.3. Here, we sketch a more down-to-the-earth argument. The probability distribution of distance is definite as:

(︄{︃ d }︃ )︄ 1 ∑︂ k P √ q (xi, xj) = q(i,j) = d k=1 ∀i

∫︂ d N ∫︂ N (︃ d )︃ ∏︂ ∏︂ (︂ k k )︂ ∏︂ dλi,j (︂ 1 ∑︂ k )︂ = dx τ(x ) exp iλi,j qi,j − √ q (xi, xj) i i 2π k=1 i=1 i

∏︁N dλi,j iλi,j qi,j where we have defined Dλ = i

d ∫︂ [︄ ∫︂ N (︃ N t )︃]︄ ∏︂ (︂ )︂ ∑︂ (︂|xi − xj| − µ)︂ = Dλ dxiτ(xi) exp − i λi,j √ (2.16) i=1 i

20 Chapter 2. Central limit theorem for distances

k Now, if the second moment of q (xi, xj), in the limit of d tends to infinity, the argument of the exponential will be very small, so we can expand it in series (this is possible if the second moment is finite).

∫︂ [︄ ∫︂ N (︃ )︃(︄ N (︃ )︃ ∏︂ i ∑︂ p ≃ Dλ dxiτ(xi) 1 − √ λi,j |xi − xj| − µ − i=1 d i

Now we can solve the integrals in dxi obtaining expectation value. ∫︂ ⟨︄ N (︃ )︃ i ∑︂ p ≃ Dλ 1 − √ λi,j |xi − xj| − µ − d i

(︄ d )︄ {︃1 ∑︂ }︃ P qk(x , x ) = q ≃ d i j i,j k=1 ∀i

21 Chapter 2. Central limit theorem for distances

In other words, we have obtained:

(︄ d )︄ {︃1 ∑︂ }︃ P qk(x , x ) = q −→ N (0, Σ) (2.22) d i j i,j i

This theorem proves that, in the limit of dimension which tends to infin- ity, the probability distribution of the normalized distances among N points (︁N)︁ tends to a 2 -dimensional multivariate normal distribution with mean 0 and covariance matrix given by Σ. This matrix is the most important object in our result, so in the following sections we study in deep its form.

2.3 The covariance matrix Σ

In this section we will present the different terms of the covariance matrix and than we will expose in detail the main characteristics of the covariance matrix.

2.3.1 The terms of Σ In general Σ have the following form: ⟨︃ ⟩︃ (︁ p )︁(︁ p )︁ Σ(i,j),(k,l) = |xi − xj| − µ |xl − xm| − µ (2.23)

Based on the different types of index combinations, only three cases are possible:

• Diagonal correlation (i = k and j = l) ⟨︃ ⟩︃ ∫︂ (︁ t )︁2 2p 2 Σ(i,j),(i,j) := α = |x − y| − µ = dxdyτ(x)τ(y)|x − y| − µ (2.24)

22 Chapter 2. Central limit theorem for distances

• Triangular correlations (i = k and j ≠ l or i ≠ k and j = l) ⟨︃ ⟩︃ (︁ t )︁(︁ t )︁ Σ(i,j),(i,l) = Σ(i,j),(k,j) :=β = |x − y| − µ |x − z| − µ (2.25) ∫︂ = dxdydzτ(x)τ(y)τ(z)|x − y|p|x − z|p − µ2 (2.26)

• Pair-pair correlations (i, j, k, l are all distinct)

Σ(i,j),(k,l) := γ = 0 (2.27)

Using Kroeneker’s Delta we can explicit the exact form of Σ:

Σ(i,j),(k,l) = α(δi,kδj,l + δi,lδj,k)+ (︁ )︁ + β δi,k(1 − δj,l) + δi,l(1 − δj,k) + δj,l(1 − δi,k) + δj,k(1 − δi,l)δi,k + (︁ )︁ + γ (1 − δi,k)(1 − δi,l)(1 − δj,k)(1 − δj,l) =

= (α − 2β + γ)(δi,kδj,l) + (β − γ)(δi,k + δi,l + δj,k + δj,l) + γ. (2.28) Where, in the last step, we have included the conditions i

2.3.2 Characterizing the covariance matrix Here we will expose in detail the main characteristics of the covariance matrix.

1. Symmetrical: Σ is symmetrical with respect to the exchange of pair of points. This peculiarity is clear in the Figure 2.1.

2. Elements: Σ has on each column/line:

• α: 1 times. • β: 2(N-2) times. (N−2)(N−3) • γ: 2 times.

23 Chapter 2. Central limit theorem for distances

ρ ϵ γ

Figure 2.1: (Left) The shape of the Σ matrix. Σ matrix for N=10 (45 couples of indexes) where α, β and γ (schematically represented in legend) was coloured differently to make explicit the symmetrical form of this matrix. (Right) Example of Johnson graph with N=5. The Johnson graph J(N, 2) is the line graph of the over N nodes. It has all the distinct pairs of the original nodes as its vertices, and the vertices are linked if their pairs share and original node.

3. Eigenvalues and multiplicity: With the aim of finding these it is necessary to consider a representation of our matrix. Indeed is possi- ble write Σ as:

Σ = (α − γ)I + (β − γ)J + γU. (2.29) where I is the identity matrix, U is the all-ones matrix and J is the (with null diagonal) of the Johnson graph J (M,2), which is the line graph of the complete graph over M vertices. It is cleat that the I,J and U commute between each other, so they can be diagonalized simultaneously. The contribution of I is trivial. J and U share a non-degenerate eigenvector (that with all components equal to one), that accounts for λ1 . In the orthogonal subspace, U represents the null operator (its other eigenvectors are associated to eigenvalue 0), and does not contribute. Thus, the remainder of the spectrum is determined by that of J, which is known [21]. Following

24 Chapter 2. Central limit theorem for distances

this, the eigenvalues are: (N−2)(N−3) • λ1 = α + 2(n − 2)β + 2 γ

multiplicity gλ1 = 1; • λ2 = α + (N − 4)β − (N − 3)γ

multiplicity gλ2 = N-1; • λ3 = α − 2β + γ N(N−3) multiplicity gλ3 = 2 ;

4. Positive semi-definite: In statistics, the covariance matrix of a multivariate probability distribution (like Σ) must be positive semi- definite, as shown by [17]. This implies that λi ≥ 0. Knowing this, we have deduced some characteristics of α and β that will be useful in appendix A. N(N−1) 5. Trace: T r[Σ] = 2 α 6. Determinant: (︂ (N − 2)(N − 3) )︂ Det[Σ] = α + 2(N − 2)β + γ (2.30) 2 (︂ )︂N−1(︂ )︂ N(N−3) α + (N − 4)β − (N − 3)γ α − 2β + γ 2 (2.31)

7. Inverse of the matrix: Note that, in the probability distribution formula is present the inverse of Σ. In order to calculate it, we can take advantage of the following matrix identity: Σ−1 = (U T DU)−1 = U T D−1U. (2.32) So Σ−1 have the same eigenvectors of Σ, but with inverse eigenvalues (α, β, γ) i.e. Σ−1(α, β, γ) = Σ(α, β, γ). The inverse eigenvalues can be evaluated by solving the following system of equations: 1 λi(α, β, γ) = for i = 1, 2, 3. (2.33) λi(α, β, γ)

25 Chapter 2. Central limit theorem for distances

The solution of this system is:

(︂ )︂(︂ )︂ DEN(α, β, γ) =2 α − 2β + γ α − 2(N − 4)β − (N − 3)γ (2.34) (︂ (N − 2)(N − 3) )︂ α + 2(N − 2)β + γ 2 (︄ 1 α(α, β, γ) = − γ2(N − 4)(N − 3)2+ DEN(α, β, γ) (︂ )︂ (︂ )︂ γβ(N − 4) 22 + (N − 11) + γα 14 + (N − 7)N )︄ (︃ (︂ )︂ )︃ + 2 2 14 + (N − 8)N β2 + βα(3N − 10) + α2

(2.35) (︄ 1 β(α, β, γ) = (N − 4)(N − 3)γ2 DEN(α, β, γ) )︄ (︂ )︂ (︂ )︂ − 26 + (N − 11)N γβ − 2β 2(N − 4)β + α

(2.36) (︄ )︄ 2 (︂ )︂ γ(α, β, γ) = (N − 3)γ2 − 4β2 + γ α − (N − 6)β DEN(α, β, γ) (2.37)

2.4 Reliability of the developed CLT by compari- son with simulations

In this section we will discuss the reliability of our result comparing it with simulations. In order to make this simulations, we draw N i.i.d. samples {⃗x} ∏︁d k from ν(⃗x) = k=1 τ(x ). With these points, we can calculate the distances and compare them with our CLT where α, β and γ are calculated from ν. We use two different probability distribution, which are the uniform

26 Chapter 2. Central limit theorem for distances distribution in a hypercube of side {0, 1} and the multivariate Gaussian distribution with mean 0 and variance σ. This can be write as:

d ∏︂ νcube(⃗x) = θ(xk)θ(1 − xk) (2.38) k=1

d k gauss ∏︂ 1 − x ν (⃗x) = √ e 2σ2 (2.39) πσ k=1 2 Given ν, the first ingredient needed is finding µ, α and β.

• Flat distribution on hypercube. From the definition of µ, using the transformation z = x - y , we can easily calculate it: ∫︂ ∫︂ µcube = dxdyτ(x)τ(y)|x − y|p = dxθ(x)θ(1 − x)θ(y)θ(1 − y)|x − y|p = ∫︂ ∫︂ = dx (−dz)|z|pθ(x)θ(1 − x)θ(x − z)θ(1 − x + z) =

∫︂ 1 ∫︂ x ∫︂ 1 (︂ ∫︂ x ∫︂ 0 )︂ = dx |z|p = dx zp + (−1)p zp = 0 x−1 0 0 x−1 ∫︂ 1 (︂xp+1 − 0p+1 0p+1 − xp+1 )︂ = dx + (−1)p = 0 p + 1 p + 1 ∫︂ 1 xp+1 + (1 − x)p+1 = dx = 0 p + 1 1(p+1)(p+2) − 0(p+1)(p+2) (1 − 1)(p+1)(p+2) − (1 − 0)(p+1)(p+2) = + (−) = (p + 1)(p + 2) (p + 1)(p + 2) 2 = . (p + 1)(p + 2) (2.40)

27 Chapter 2. Central limit theorem for distances

Now, we can calculate the other ingredients, starting from α:

∫︂ 2 4 αcube = dxdy|x − y|2p − µ2 = − = (2p + 1)(2p + 2) (p + 1)2(p + 2)2 p2(p + 5) = (p + 1)2(p + 2)2(2p + 1) (2.41)

For last, βcube: ∫︂ βcube = dxdydzτ(x)τ(y)τ(z)|x − y|p|x − z|p − µ2 =

∫︂ 1 (︂ ∫︂ x )︂(︂ ∫︂ x )︂ = dx dη|η|p dξ|ξ|p − µ2 = 0 x−1 x−1 ∫︂ 1 ∫︂ x 2 (︂ p)︂ 2 = dx dη|η| − µ = (2.42) 0 x−1 ∫︂ 1 (︃x2p+2 + (1 − x)2p+2 xp+1(1 − x)p+1 )︃ = dx 2 − 2 2 = 0 (p + 1) (p + 1) (︄ )︄ 2 p2 − 2 Γ2(p + 2) = + (p + 1)2 (p + 2)2(2p + 3) Γ(2p + 4)

where Γ(x) is the Euler gamma function.

• Multivariate gaussian distribution. In order to find the elements of the covariance matrix in this case we have used Mathematica 12.

p+1 ∫︂ 2 y2 p gauss 1 − x − p (2σ) Γ( 2 ) µ = dxe 2σ2 e 2σ2 |x − y| = √ 2πσ2 2 (2σ)2p (︃√ (︂1 )︂ (︂p + 1)︂)︃ αgauss = πΓ + p − Γ2 π 2 2 3 ∫︂ ∞ 2 2 (︂ 1 )︂ − 3x (︂p + 1)︂ (︂p + 1 1 x )︂ gauss p+1 2σ2 2+2p 2 2 β = 2 dx2 e σ Γ 1F1 ; ; 2 2πσ −∞ 2 2 2 2σ (2.43)

28 Chapter 2. Central limit theorem for distances

Where 1F1 is the hypergeometric function. From now, we use σ = 1 for simplicity.

Now we can finally make the comparison among our central limit the- orem and a simulation. The simplest way to do that is to draw only N = 2 points, indeed, in this case, the rescaled distance vector q is one- dimensional, so we can easily plot it. We have repeated the drawn 105 times, we have plotted in a histogram the distances and we have compared these plots with the Normal distribution that comes from CLT. In order to improve the comparison, we have also calculated the mean and the variance from the dataset of distances and we have plotted the Normal distribution with these parameters: this is useful to cut off the fluctuations and makea better comparison. It is possible see the results in Figure 2.2 , 2.3, 2.4 and 2.5. We have repeated this procedure for different values of the dimension and of the norm’s exponential p. From these comparisons we can extract three pieces of information: • the first thing that is possible to notice (fig. 2.2 and 2.3) isthat,at fixed p, with increasing dimensionality the simulation and the theoret- ical result are more and more similar. This is not unexpected because we know that the CLT works in the limit of infinite dimension. Nev- ertheless, we have a good similarity even in dimension 5 that is very far from infinity. Although at lowest dimension the shape of Normal distribution is very different from the shape of histogram, the mean and the variance of the simulation are equal to the theoretical ones.

• The scaling in p seems to be in stark contrast to what we read in lit- erature about the concentration of distances. In fact, in literature [22] we can read that as p increases, the effects of distance concentration intensify, that is, the system concentrates at a lower dimensionality. Instead in our case it is observed (fig. 2.5 and 2.4) that with the increasing of p, the speed of convergence in dimensionality becomes slower. In other words, the higher p, the slower the concentration around the normal distribution of the histogram. The reason for this

29 Chapter 2. Central limit theorem for distances

difference is trivial: in fact we are using a different definition ofdis- tance than the classic one, which is closer to a p-norm. In literature is generally used : ⌜ ⃓ d lit ⃓p ∑︂ k k p dp (⃗x,⃗y) = ⎷ |x − y | . (2.44) i=1

Note that this is not a distance but a semi-distance for p < 1, as it does not satisfy triangular inequality Instead, as we said in the previous section, we use:

d ⃓ ⃓p our ∑︂ ⃓ k k⃓ dp (⃗x,⃗y) = ⃓x − y ⃓ . (2.45) i=1 which is a distance for each value of p. To be sure that our result changes coherently with the change of definition of the distance, we studied the behaviour of an observable when d and p change, both in the case of using our distance and the classic one. This observable, the so called relative variance, is defined as: √︁ V ar(X) RV (X) = . (2.46) p,d E(X) The RV is really important for the concentration of the distances because it is able to give a measure of this phenomenon. Moreover, being a very studied observable, we know the theoretical trend of this for the random systems. In fact, using the classic distance that derives from p-norms, we expect [22] a trend like:

1 1 αteo teo √ p RVp,d ≃ teo . (2.47) d p µp In our simulations, in figure 2.8, we observed that in the case of classi- cal distance the trend is coherent with what is foreseen by the theory.

30 Chapter 2. Central limit theorem for distances

Instead, in our case, the trend of RV seems to follow a law like:

1 αteo √ p RVp,d ≃ teo . (2.48) d µp This behaviour mainly affects scaling in p. In fact we can see how, while the scaling in d is the same in both cases, the trend in p is very different: in the classic case it is observed that the concentration effects are maximally reduced in the caseof p = 0.5, instead, in our case, the same effect is observed for p = 5. This is consistent with teo teo 1 αp αp the trend of teo in the first case and with teo in the second one, as p µp µp can be seen in the inset of fig. 2.8. This suggests that the effects of the concentration of distances in our case cannot be compared with what has been seen in the literature due to the different definition of distance.

• The last information is that the scaling in dimensionality and in p is the same for points drawn with a flat distribution in a hypercube ora multivariate Gaussian, even if the speed of scaling is different in this two cases. This is not unexpected too, because our CLT uses a generic probability distribution, with the only claim that is factorizable.

We also followed the same procedure by extracting N = 3 points at a time. However, in this case, the vector of the rescaled distances is three- dimensional, so in order to draw a plot we had to marginalize the probability distribution so as to have only one free variable. Even in this case, the information that we can extract from the comparison is the same that we have found in the case of N = 2, as shown by Figure 2.6 and 2.7.

31 Chapter 2. Central limit theorem for distances

d = 100 d = 10

d = 5 d = 3

d = 2 d = 1

Figure 2.2: Increasing the dimensionality of the system the simulation of distances between two points drawn from flat distribution in hypercube and the theoretical result become more and more similar. The comparison at different dimensionality (from top left: d = 100, 10, 5, 3, 2, 1 and p = 1) among histogram of distances between two points drawn from flat distribution in hypercube and the normal distribution that comes from CLT(in blue) shows that in the limit of high dimensionality our theoretical results approximate well the distribution of the distances between randomly drawn points. On the other hand, at low d, the finite dimensionality effects are very evident. Nevertheless, at every dimensionality the meanand the variance of the simulation are equal to the theoretical ones, indeed the normal distribution with the parameter extract from the simulation (in black, dashed) and theoretical results always overlap. 32 Chapter 2. Central limit theorem for distances

d = 100 d = 10

d = 5 d = 3

d = 2 d = 1

Figure 2.3: Changing the probability distribution from which the points are extracted, from the flat distribution in a hypercube to a multivariate Gaussian, change the scaling speed in d but leaves the general behaviour unchanged. The comparison at different dimensionality (from top left: d = 100, 10, 5, 3, 2, 1 and p = 1) among histogram of d distances between two points drawn from multivariate Gaussian distribution in R and the normal distribution that comes from CLT (in blue) shows that in the limit of high dimensionality our theoretical result well approximates the distribution of the distances between randomly drawn points a we see in fig. 2.2. But, compared to the previous case, the speed of convergence in slowed down, in fact the finite dimensionality effects are visible evenat d = 10.

33 Chapter 2. Central limit theorem for distances

p = 0.5 p = 0.5 d = 100 d = 10

p = 2 p = 2 d = 100 d = 10

p = 5 p = 5 d = 100 d = 10

Figure 2.4: In the case of point drawn by flat distribution in a hypercube, higher p values slows the convergence of the distribution of distances to the normal limit distribution of our theoretical result. The slow down of the speed of convergence is clear comparing the plot at the same dimensionality but with different p value (From top: p = 0.5, 2 ,5, on the left d = 100 and on the right: d = 10). Indeed, it is evident that as p increases (with fixed dimensionality) the simulation seems to move away from the theoretical line. Implicitly the system requires, with increasing p, a higher dimensionality to reach a good level of convergence.

34 Chapter 2. Central limit theorem for distances

p = 0.5 p = 0.5 d = 100 d = 10

p = 2 p = 2 d = 100 d = 10

p = 5 p = 5 d = 100 d = 10

Figure 2.5: Even in this case, changing the probability distribution from which the points are extracted, from the flat distribution in a hypercube to a multivariate Gaussian, change the scaling speed in p but leaves the general behaviour unchanged, as we have seen for d. The slow down of the speed of convergence is clear comparing the plot at the same dimensionality but with different p value (From top: p = 0.5, 2 ,5, on the left d = 100 and on the right: d = 10). Compared to the case in fig.2.4 we can see how in this case the convergence is even slower, i.e. the system requires an even higher dimensionality to reach a good level of convergence.

35 Chapter 2. Central limit theorem for distances

d = 1 d = 1 p = 1 p = 2

d = 3 d = 3 p = 1 p = 2

d = 100 d = 100 p = 1 p = 2

Figure 2.6: Even in the case of N = 3 points, with a three-dimensional distribution of distances, we find the same scaling in dimensionality and in p for the convergence of the probability distribution of distances between point drawn from the flat dis- tribution in a hypercube to our theoretical result. We plotted only 2 of the 3 distances (the third was ignored) in a density histogram (lighter color corresponds to a more frequent event), from this we extracted the marginal histograms (left and bottom in each graph) that were compared with the marginalization of the distribution of theoretical probability (in black). From these comparison (From top: d= 1, 3, 100, left: p = 1 and right: p = 2) we are able to deduct that the scaling remains unchanged even at a larger number of N points. 36 Chapter 2. Central limit theorem for distances

d = 1 d = 1 p = 1 p = 2

d = 3 d = 3 p = 1 p = 2

d = 100 d = 100 p = 1 p = 2

Figure 2.7: Even in the case of N = 3 points, changing the probability distribution of points, do not change the scaling in d and in p. We plotted only 2 of the 3 distances in a density histogram (lighter color corresponds to a more frequent event), from this we extracted the marginal histograms (left and bottom in each graph) that were compared with the marginalization of the distribution of theoretical probability (in black) (From top: d= 1, 3, 100, left: p = 1 and right: p =2). Compared to the case in fig.2.6, we can see how in this case the convergence in d both d and p is even slower. Note: these plots are related to the Gaussian distribution in R .

37 Chapter 2. Central limit theorem for distances

O RVp (d) 0.25

1 ρ(p) p µ(p)

0.20 0.15

0.10

0.15 0.05

p 1 2 3 4 5 6 0.10

0.05

0.00 d 0 20 40 60 80 100

C RVp (d) 0.25 ρ µ 0.20 0.25 0.20

0.15 p=5 0.15 p 0.10 p=2 0.05

p p=1 1 2 3 4 5 6 0.10 p= 0.5 p= 0.2 0.05

0.00 d 0 20 40 60 80 100

Figure 2.8: Having defined the distance differently than in the classic way, theeffects of the concentration of the distances are very different, especially in scaling in p. Nevertheless, the concentration effects with increasing dimensionality remain present even by changing the distance. Top: The relative variance in the case of classical distance RV C shows a nonlinear trend in p with a maximum in p = 0.5, as evidenced by its scaling law (in inset). At this maximum we can see how the concentrative effects are slowed down. Bottom: Using instead our distance we observe that the scaling of RV O is monotonous increasing in p, scaling justified by our empirical law (in inset). This trend is consistent with what has been seen in the previous figures (2.5 and 2.4). Furthermore, the fact that changing the distance also changes the trend of the concentration effects, explains why we observe a different trend in p.

38 Chapter 3

Random geometric graph in high dimension

As we could see in Chapter 1, the concept of distance is fundamental in random geometric graphs, in fact the activation function depends on the distance, both directly and parametrically through r. In the same chapter we also saw how the connectivity of the graph is function of r. Consequently, the study of observables as a function of parameter r is very important because it allows us to characterize graphs based on their connectivity. In Chapter 2 we saw how the distances between N randomly drawn points, in the limit of high dimensionality, behave following central limit theorem. In this Chapter we will see how by applying our CLT for distances to random geometric graphs we will be able to make predictions on connectiv- ity and consequently on observables, specifically on the k-cliques density.

3.1 Clique density in high dimensionality

In general, our result allows us to calculate in a simple way the average number of subgraphs with a given structure in the limit of high dimension- ality. In order to do that, we need to introduce the adjacency matrix Ai,j

39 Chapter 3. Random geometric graph in high dimension

of a subgraph with M points, i.e. the M × M matrix with entry Ai,j = 1 if (i,j) is linked and Ai,j = 0 otherwise. The average number of a certain subgraph g with M nodes of a random graph can be factored in two terms. The first one is a combinatorial factor (︁N )︁ M , that accounts for the number of ways in which one can extract M nodes from a set of N of them. The second one is the so-called density ρg(r) of the subgraph g at a scale r, which is the probability (considering the cut-off radius r) that M points are linked together to form a subgraph with the same adjacency matrix of g:

∫︂ ∏︂ [︁ ]︁A(i,j) ρg(r) = ddΠ(d) hr(d(i,j)) 1≤i

[︄ (︄ (︂ )︂)︄]︄A(i,j) ∫︂ √ min 1, 1 ∏︂ [︂ ]︂ p = dq N (0, Σ)(q) hr dµ + dq(i,j) 1≤i

(︄ (︂ )︂)︄ ∫︂ √ min 1, 1 ∏︂ [︂ ]︂ p ρk(r) = dq N (0, Σ)(q) hr dµ + dq(i,j) (3.2) 1≤i

Now that we have an explicit formula for the density of cliques, we can perform some simulations to see if our result is consistent with the real systems (for an overview of the integration methods, see the section C.1). In order to do this we have drawn N random points with a certain probability distribution τ to form an Hard RGG and we have calculated the number of k-cliques present in the graph as the parameter r changes. We have repeated this a considerable number of times and averaged over realizations , then we have compare the simulation result with the theoretical result.

40 Chapter 3. Random geometric graph in high dimension

As shown in fig. 3.1, it is evident that our result is excellent in thehigh dimension limit, but a very good approximation is obtained even in the case of d = 10. For scaling in p, we observed a behaviour similar as in the previous section (fig. 3.2): to have a good approximation as p increases, it is necessary to have a higher dimensionality in the system. It is also interesting to study scaling in the dimensionality of the clique considered, i.e. with the variation of k. In fact it is possible to see (fig. 3.1) how as k increases, keeping the dimensionality of the system fixed, the goodness of our result turns out to be slightly lower. Furthermore, also in this case, by changing the probability distribution of the points, the same things are observed, as shown in the fig. 3.3. We therefore demonstrated the goodness of our result for the density of k-clique. Despite the results obtained, the explicit dependence of ρk(r) on r, p and µ makes it more difficult to perform further more detailed analysis. In fact, this dependence makes it impossible to compare the densities of the k-cliques obtained from systems with different parameters. To avoid the problem, we must define a new variable that must accurately describe the density of k-cliques without explicitly depending on these parameters. The new variable, called rescaled k-clique density, is defined as follows:

−1 ωk(x) = (ρk ◦ ρ2 )(x) (3.3)

In this variable all these dependences cancel out, and the curves at different values of the parameters all lie in the domain x ∈ [0,1]. Indeed, by changing −1 variable from r to (ρ2(r)) , we are expressing our observables as functions of the average probability that two nodes are linked, so that we are explic- itly factoring out the typical scale of separation between the nodes from the observables. This variable is also particularly convenient because, as we will see in the next sections, it will allow us to take full advantage of the different char- acteristics between Hard and Soft RRGs, so as to be able to perform more in-depth analyses and find the limit behaviour in the case of high dimen- sional systems.

41 Chapter 3. Random geometric graph in high dimension

ρ2(r) ρ5(r)

dim = 1 dim = 1 p = 1 p = 1

r r ρ2(r) ρ5(r)

dim = 10 dim = 10 p = 1 p = 1

r r ρ2(r) ρ5(r)

dim = 100 dim = 100 p = 1 p = 1

r r

Figure 3.1: By applying the theorem of the central limit for the distances to the calculation of the density of k-clique in a Hard RGG, we are able to predict the behaviour of this density when the cutoff radius r varies, at least in the high dimen- sionality limit (d ⩾ 10). We extracted N = 1000 points with a uniform distribution in the unit hypercube and with these we calculated the density of k-clique ρk as the cutoff radius r varied. This operation was repeated S = 500 times (in green)and these simulations were subsequently mediated (in dark blue) in order to be compared with our theoretical prediction (in yellow). It is evident that the predictive capacity of our result improves with the increase of the dimensionality of the system, even if with fixed dimensionality there is a better approximation at lower k (k = 2 on the left and k = 5 on the right).

42 Chapter 3. Random geometric graph in high dimension

ρ2(r) ρ2(r)

dim = 1 dim = 1 p = 0.5 p = 2

r r ρ2(r) ρ2(r)

dim = 10 dim = 10 p = 0.5 p = 2

r r ρ2(r) ρ2(r)

dim = 100 dim = 100 p = 0.5 p = 2

r r

Figure 3.2: As p increases, a higher system dimensionality is required to obtain a good approximation. This is consistent with what was observed for the CLT in the previous chapter. We repeated the same procedure as in fig. 3.1, but this time we changed the exponent of the distance p (p=0.5 on the left and p=2 on the right). Consistent with the worst approximation capacity of our CLT, it is observed that as p increases, keeping the dimensionality fixed, our prediction on the density ofthe k-cliques is worse.

43 Chapter 3. Random geometric graph in high dimension

ρ2(r) ρ3(r)

dim = 1 dim = 1 p = 1 p = 1

r r ρ2(r) ρ3(r)

dim = 10 dim = 10 p = 1 p = 1

r r ρ2(r) ρ3(r)

dim = 100 dim = 100 p = 1 p = 1

r r

Figure 3.3: By changing the probability distribution of the points, the goodness of our result for the density of the k-cliques does not change. We repeated the same procedure followed in figures 3.1 and 3.2 but this time the points were extracted from a Gaussian probability distribution with zero mean and unit variance. Similar to what happened for the CLT in the previous chapter, the goodness of our forecast follows the same scaling observed in the case of points extracted uniformly from a hypercube.

44 Chapter 3. Random geometric graph in high dimension

3.2 Clique density in high dimensional Soft RGGs: a Erdős-Rényi behaviour

The main difference between Hard and Soft RGGs is their activation func- tions. Indeed, while hhard(r) is a boolean function, hsoft(r) is in general a continuous function, as in the case of the Rayleigh fading activation func- tion. Starting from the analytical formula of the k-clique density, we can take advantage of the continuity of the activation function to find the lead- soft ing order of ωk in the high dimensional case. (︄ (︂ )︂)︄ ∫︂ √ min 1, 1 ∏︂ [︂ ]︂ p ρk(r) = dq N (0, Σ)(q) hr dµ + dq(i,j) 1≤i

⎛ (︂ 1 )︂⎞ (3.4) ∫︂ [︃ (︃ )︃]︃min 1, p ∏︂ q(i,j) = dq N (0, Σ)(q) hr ⎝ d µ + √ ⎠ 1≤i

Now, in high√ dimension limit, we can expand the activation function in powers of 1/ d .

∫︂ (︃ (︃ (︂ 1 )︂)︃ (︃ )︃)︃ soft ∏︂ min 1, p 1 ρ2 (r) ≃ dq N (0, Σ)(q) hr (dµ) + O √ 1≤i

(︃ (︂ 1 )︂)︃ soft min 1, p ρ2 (r) ≃ hr (dµ) . (3.6)

So, in the case of a generic k-clique:

k regular [︂ regular ]︂(2) ρk (r) = ρ2 (r) (3.7)

In this case, the relation between ρk and ρk reduces to that of Erdös- soft Rényi graphs with linking probability ρ2 , i.e:

soft k (2) ER ωk (x) = x = ωk (x). (3.8)

45 Chapter 3. Random geometric graph in high dimension

Indeed, as shown by [8], the k-clique density for a Erdős-Rényi graph is k p(2), where p is the linking probability. To verify the goodness of this result, we carried out simulations, re- peating the procedure of the previous section but studying the trend of the rescaled k-clique density (for an overview of the numerical methods, see the section C.2). To perform these simulations we generated the soft RGG using the Rayleigh fading activation function hrayleigh with η = 2 and ξ = 1 2 , therefore having: ⎡ (︂ )︂⎤ (︃ )︃2 min 1, 1 rayleigh 1 dµ p ρ (r) = exp ⎣− ⎦ . (3.9) 2 2 r

As we can see in fig. 3.4, the convergence to the limit is very fast, and even at d = 10 our analytical prediction provides a good approximation soft ER of the simulated observables, that is ωk follows the same trend as ωk . Furthermore, we observed that this behaviour does not change with the variation of the parameter p of the distance, even if a slight slowdown in convergence is observed in the case of low p values.

3.3 Clique density in high dimensional Hard RGGs: a non Erdős-Rényi behaviour

The Hard RGGs are very different compared to the Soft RGGs: in fact, as we said in the previous section, the activation function in this case is Boolean, therefore it is neither continuous nor differentiable. But, it is also true that in this case we are free to perform a rescaling of the cut-off radius. Indeed, for hhard we observe that:

hard hard p h (x) = h p (x ) r r (3.10) hard hard + hr (x) = hcr (cx), ∀c ∈ R We can therefore take advantage of this freedom to be able to rewrite hard hard ρk in a convenient way to calculate ωk , that is, absorb all the pa- rameters in a single variable. In fact, by carrying out a translation and

46 Chapter 3. Random geometric graph in high dimension

soft soft ωk (x) ωk (x) 1 1

0.5 0.5

k k 3 3 4 4 5 5 x x 0 0.5 1 0 0.5 1 d) 2 10 100 p) 0.5 1 2

Figure 3.4: In the high dimension limit (d ≳ 10), the rescaled k-clique density of a soft RGG tends to that of an ERG. Furthermore, this behaviour seems not to be influenced by the choice of the exponent of the distance p. soft Left: simulated ωk is plotted for system with different dimensionalitytriangles ( for d = 2, diamonds for d = 10 and circles for d = 100 ) and for different values of k (k = 3 in blue, k = 4 in dark green and k = 5 in light green). As we can see, the convergence to the limit is fast, and ER even at d = 10 the simulations are very close to the rescaled k-clique density of an ERG ωk (in black). Note that for this simulations we have used p = 2. soft Right: In a 10-dimensional system we have studied the trend of ωk with the variation of the parameter p of the distance (triangles for p = 0.5, diamonds for p = 1 and circles for p = 2 ). The choice of this parameter does not seem to influence the limit behaviour, even if there is a slight slowdown in convergence in the case of p = 0.2.

47 Chapter 3. Random geometric graph in high dimension eliminating the root, we obtain that the integral reduces to: (︄ )︄ r max(1,p) − dµ ρ hard(r) = ρ hard √ , (3.11) k k d with ∫︂ hard ∏︂ hard ρk = dq N (0, Σ)(q) hx (q(i,j)) 1≤i

48 Chapter 3. Random geometric graph in high dimension

soft soft ωk (x) ωk (x) 1 p=2 1 p=1

0.5 0.5

k k 3 3 4 4 5 5 x x 0 0.5 1 0 0.5 1 d) 2 10 100 d) 2 10 100

soft soft ωk (x) ωk (x) 1 p=0.5 1 d=10

0.5 0.5

p 5 k 2 3 1 4 0.5 5 0.2 x x 0 0.5 1 0 0.5 1 d) 2 10 100

Figure 3.5: In the case of Hard RGGs, the rescaled k-clique density does not converge to ωER, even in the high dimension limit. hard Simulated ωk is plotted for system with different dimensionalitytriangles ( for d = 2, diamonds for d = 10 and circles for d = 100 ), for different values ofk ( = 3 in blue, k = 4 in dark green and k = 5 in light green), and different choice of p (top: p = 2 on the left and p = 1 on the hard right, bottom: p = 0.5 on the left.). In this case it is observed that ωk , in the case of high dimensional systems, does not converge to the ER limit, but to a slightly different limit function (solid line). The choice of p does not seem to be a determining factor for this convergence, although it is observed that the limit function tends to approach ωER as p decreases (bottom right, smaller p correspond to lighter colors). 49 Chapter 4

A RGGs approach to a Real-World dataset

4.1 Real-World dataset

As we said in the previous chapters, the purpose of this thesis is to study RGGs and their behaviour in the high dimension limit from a statistical point of view, keeping in mind their wide use in data science as a null model. In fact, we have seen in Chapter 1 how it is possible to express a dataset as a high-dimensional geometric graph. Now that we have found the limit behaviour of high dimensional RGGs, we can go to see if these forecasts also apply to the geometric graphs created using real datasets. Among all the possible Real World datasets, i.e. the datasets obtained from real events or things, there are some which are particularly simple that are used as a benchmark for many techniques of data science, especially for artificial intelligence. In this thesis we decided to compare our forecast with 3 of these datasets. The first, certainly the best known and most used, is the handwrit- ten digit dataset known as the MNIST database (Modified National Insti- tute of Standards and Technology database). This dataset contains 70 000 grayscale images written as a 28 × 28 matrix where the entries are integers

50 Chapter 4. A RGGs approach to a Real-World dataset that identify the color density, from 0 (white) to 255 (intense black). This dataset is widely used [48, 36, 54] to test the predictive capabilities of a neural network. In fact, this set is divided into a set of 60,000 training im- ages and 10,000 testing images and each of these images is associated with the written number. A dataset very similar to MNIST, but containing slightly more complex objects than handwritten numbers, is fashion-MNIST. This dataset is made up of 70 000 images (28 × 28 pixels, in grayscale as MNIST) of Zalando’s articles divided into 10 different categories of clothing, such as T-shirts, sweatshirts, bags, shoes and sandals. This dataset is also widely used [58, 49, 2], in fact it was created specifically to overcome the critical issues of MNIST: in fact, it was observed that mnist is too simple, overused but most of all, it is really not representative of Cross Validation tasks. The last dataset we have chosen is even more complex and most used [34, 1, 10, 35]. This dataset, known as CIFAR-10 (Canadian Institute For Advanced Research), contains 60 000 color images of size 32 × 32 divided into 10 different categories. The 10 different classes represent airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. In this case the image is written as a tensor 32 × 32 × 3, because each pixel is associated with 3 numbers, corresponding to the three colors according to the RGB system. In order to compare the three datasets in a coherent way, we have transformed the images of this dataset into a gray scale ones, thus obtaining that the image can be represented as a matrix like the others, where each pixel is associated with a single input that can take values between 0 and 255.

4.1.1 Real-World vs random dataset Having represented the elements of these datasets as high-dimensional vec- tors (728 for MNIST and Fashion-MNIST and 1024 for CIFAR-10) we can use them to create their relative geometric graph. To do this, just consider these vectors as points in the metric space and link them together using the hard activation function, calculating the distances between the points using our definition of this (eq. 2.8). Using these GGs we carried out simulations

51 Chapter 4. A RGGs approach to a Real-World dataset

MNIST dataset

Fashion-MNIST dataset

CIFAR-10 dataset

Figure 4.1: Matrix form of some of the images contained in the various datasets. We brought the matrices relating to various elements of the 3 datasets, colouring the pixels with darker colors when they had a higher entry. to see the trend of the scaled density of k-clique. As we can see in fig. 4.2 the scaling of the rescaled k-clique densities of the three datasets is very similar to each other. It is also evident that none of them follow the limit behaviour seen in the case of Hard and Soft

52 Chapter 4. A RGGs approach to a Real-World dataset

data data ωk (x) ωk (x) 1 1

0.5 0.5

k k 3 3 4 4 5 5 x x 0 0.5 1 0 0.5 1

MNIST F-MNIST CIFAR-10 MNIST F-MNIST CIFAR-10

Figure 4.2: The trend of the scaled density of k-clique in the case of the three datasets is very similar to each other but does not follow the trend of the hard RGGs or that of the ER. We plotted the three datasets (triangles for MNIST, diamonds for Fashion-MNIST and circles for cifar) as k (k = 3 in blue, k = 4 in dark green and k = 5 in light green) and p (p = 1 on the left and p = 2 on the right) change and it is possible to see how the behaviour of the datasets is very different from those seen for soft and hard RGGs.

RGGs, suggesting that the internal order in the datasets does not allow convergence to the random case.

4.1.2 Moments of Real-World dataset With the aim of extracting further information from the datasets and trying to understand why they do not converge to the random case, we decided to extract the µ, α and β parameters from the datasets. However, these parameters are the moments of a one-dimensional probability distribution, as we can see in the section 2.3 . So, in order to extract them we calculated the following expectation values:

53 Chapter 4. A RGGs approach to a Real-World dataset

• for µ, which is the mean value of the distance, we have use:

[︂ 1 k k max(1,p)]︂ µ = E distp(xi , xj ) (i,j), k d (4.1) 1 ∑︂ 1 ∑︂ = |xk − xk|p (︁N)︁ d i j 2 1≤i

• for α, which is the variance of the distance, we have use:

[︃ 2]︃ (︂ 1 max(1,p) )︂ α = E distp(xi, xj) − dµ (i,j), k d (4.2) 1 ∑︂ 1 ∑︂ (︂ )︂2 = |xk − xk|p − dµ (︁N)︁ d i j 2 1≤i

• finally, for β, which represents the correlations between distances, we used:

[︂(︂ 1 max(1,p) )︂ (︂ 1 max(1,p) )︂]︂ β = E distp(xi, xj) − dµ distp(xi, xl) − dµ (i,j,l), k N d 1 ∑︂ ∑︂ 1 ∑︂ (︂ )︂ (︂ )︂ = |xk − xk|p − dµ |xk − xk|p − dµ (︁N)︁ d i j i l 3 1≤i

In these equations E[•](i,j), k indicates the expectation value calculated on the pairs of points and on the dimensions, while dist1 indicates the distance taken along the single coordinate. In order to be able to compare them with what we saw in the previ- ous chapters, we have translated the dataset into the dimensional cube of unitary edge. The extracted parameters can be found in Table 4.1. As we can see in the table 4.1, the averages µ of the datasets are about half compared to the random case, while the other moments are much larger. As we saw in the previous section, using the moments related to the hy- percube leads to a poor level of approximation. Now, we can compare the

54 Chapter 4. A RGGs approach to a Real-World dataset

Moment of dataset Dataset d µ α β µd/µt αd/αt βd/βt Random point (teo) ind. 1/3 1/18 1/180 1 1 1 Random point (sim) 1024 0.333 0.056 0.0057 1.000 0.999 1.018 MNIST 784 0.167 1.198 0.330 0.501 21.57 59.43 Fashion-MNIST 784 0.278 5.627 0.695 0.834 101.3 125.2 CIFAR-10 1024 0.136 1.593 0.444 0.409 28.67 79.96

Table 4.1: Through a simulation, we extracted the moments µ, α and β from the three datasets described above and compared them with the random case carried out under the same conditions (high-dimensional hypercube with unit edge). By comparing the random simulation with the datasets we see how the differences are evident. In fact, although the averages of the µ datasets are similar to the value seen for the random case (although all smaller), we cannot say the same of the other moments. In fact, α is at least 20 times larger than the theoretical, and for β at least 60 times the theoretical. Note: for these simulations we used p = 1. k-clique density of the dataset with our theoretical prediction, using how- ever the parameters extracted from the dataset itself. Start with the case k = 2, that is the simplest. In fact, in chapter 3 we calculated an analytical form for ρ2: [︄ (︄ max(1,p) )︄]︄ hard 1 r − dµ ρ2 (x) = 1 + Erf √ (4.4) 2 2dα We can now compare this theoretical result, using the parameters extracted from the dataset, with a simulation made starting from the dataset. data As we can see in the figure 4.3 the behaviour of the simulated ρ2 is very similar to what is predicted by our theory, for all three datasets. Therefore, despite the very different values of µdata and αdata from µcube and αcube, our theoretical work is, in first approximation, able to predict the trend of the density of 2-clique if the parameters extracted from the datasets are used. The next step is to repeat the same analysis but in the case of k > 2. Although we do not have access to an analytical form for ωk with k > 2 we can do as in the previous chapter and numerically solve the integral (see section C.3).

55 Chapter 4. A RGGs approach to a Real-World dataset

data data ρk (x) ρk (x) 1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

50 100 150 200 x 100 150 200 250 300 350 x data data ρk (x) ωk (x) 1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

x x 100 150 200 250 0.2 0.4 0.6 0.8 1.0

MNIST F-MNIST CIFAR-10

Figure 4.3: Using the moments µdata e αdata extracted from the dataset our data forecast for the density of 2-clique improves. The simulated 2-clique density ρ2 of the datasets is (triangles for MNIST, diamonds for Fashion-MNIST and circles for teo cifar) compared with our theoretical result ρ2 (µdata, αdata). We can see how for all three datasets in this case our theoretical result provides a good approximation. (Right data teo bottom:) The same comparison is made for ω2 and ω2 (µdata, αdata)

56 Chapter 4. A RGGs approach to a Real-World dataset

data data ωk (x) ωk (x) 1 1

0.5 0.5

k k 3 3 4 4 5 5 x x 0 0.5 1 0 0.5 1

data ωk (x) 1 MNIST

Fashion-MNIST

0.5 CIFAR-10

teo ωk (µcube, αcube, βcube) k 3 teo 4 ωk (µdata, αdata, βdata) 5 x 0 0.5 1

Figure 4.4: In the case k> 2, using the parameters extracted from the dataset does not significantly improve the goodness of our theoretical result. For all three datasets and for various k (k = 3 in blue, k = 4 in dark green and k = 5 in light green) we com- data pared ωk (triangles for MNIST, diamonds for Fashion-MNIST and circles for cifar) with teo teo ωk (µcube, αcube, βcube) (solid line) and ωk (µdata, αdata, βdata) (dashed line). None of ours is able to simulate the real trend of the dataset.

57 Chapter 4. A RGGs approach to a Real-World dataset

As we can see in fig. 4.4, using the moments extracted from the datasets does not change much the situation. It is true that the forecast is better, data but still it is not able to describe the real trend of ωk . Wanting to find a reason for this behaviour that is observed for k > 2 and not in k = 2, we can say that probably our result is able to provide a good approximation only in case where α and β are small. This is because a system that has these two variables with a particularly high value is necessarily strongly correlated. Consequently, during the CLT proof, stopping at the second order of exponential development is not sufficient because the terms of order 3 are not negligible.

58 Conclusion.

In this thesis we developed a central limit theorem for distances and ap- plied it to random geometric graphs in high dimension. This allowed us to develop theoretical techniques capable of making predictions on the observ- ables used to characterize geometric graphs. Subsequently these techniques were specialized to Hard RGGs and Soft RGGs, showing their different limit behaviors. The reliability of both the theorem and the developed techniques have been confirmed by exhaustive simulations. Finally, these techniques were applied to real-world datasets and their limits were quantified. Fur- thermore, this thesis work also shows how the RGGs are intrinsically dif- ferent from the geometric graphs generated by the Real-World datasets. In fact, we have observed that the internal order of the latter does not make the distances concentrate even at high dimensions. The end of this thesis however leaves some questions that will be developed in the future:

• as we suggested at the end of Chapter 5, the fact that our technique is not able to provide a good approximation for the k-clique density in the case of datasets can be due to premature truncation in the central limit theorem. It would be interesting to find a correction for the next order and verify if our approximations will improve;

• in this work we have focused on the density of k-cliques, but we could extend our results also to other observables of the graphs;

• it would be interesting to extend the analyzes made in Chapter 5 to other datasets (above all, these results can be useful for datasets

59 Chapter 4. A RGGs approach to a Real-World dataset

present in sociology, biology and physics);

• we could apply these results to some recently developed machine learn- ing techniques, which exploit the concept of distances and the search for the nearest neighbors.

60 Appendix A

Relationships between α, β and γ and eigenvalues

We want to extract a relationship that links the three parameters of the covariance matrix. To do this we will look under which conditions the eigenvalues are semi-positive, knowing that this fact is always true, as said by [17]. Keep in mind that (α, β) > 0 and γ ≥ 0.

(N−2)(N−3) • ϑ1 = ρ + 2(N − 2)β + 2 γ This eigenvalue is always positive because we have N ≥2.

• ϑ2 = ρ + (N − 4)β − (N − 3)γ This eigenvalue exist only if N ≥3. If N=3, we have that β ≤ α. In the others case, we consider the generic form and make a system of equation with the third eigenvalue.

• ϑ3 = α − 2β + γ This one exist only if N ≥4. The system between ϑ2 and ϑ3 is:

{︃α + (N − 4)β − (N − 3)γ ≥ 0 α − 2β + γ ≥ 0

61 Appendix A. Relationships between α, β and γ and eigenvalues

We add to the first equation N times the second one.

{︃(1 + N)α − (N + 4)β + 3γ ≥ 0 α − 2β + γ ≥ 0

And write both the equation as solution for β.

{︄ 3γ+(1+N)α β ≤ N+4 γ+α β ≤ 2

The nest step is search when the first solution is less restrictive than the second one: γ + α 3γ + (1 + N)α ≤ (A.1) 2 N + 4

(N + 4)(α + γ) ≤ 6γ + 2(1 + N)α (A.2) (N − 2)γ ≤ (N − 2)α (A.3)

We obtain that, if α ≥ γ, the only necessary condition for the positivity of the eigenvalues is: γ + α β ≤ (A.4) 2 This condition is good even for N = 3 because it is more restrictive than that one that we have found before. We note that there is no need to consider the case α ≤ γ (i.e. when the first solution is more restrictive than the second one in eq. A) because in all the real cases γ = 0 and we have put the condition α > 0.

Knowing that the eigenvalues are semi-definite positive we can therefore say that for all real systems (in which γ is zero) it is true that α ≥ 2β.

62 Appendix B

Hubbard-Stratonovich transformation

Note that if someone want to use our distribution of distance for calculate (N−1)(N−2) mean value, he must makes 2 integrals, that can be heavy even for a calculator. In this section we are searching a transformation that allows us to reduce the number, looking for the true degrees of freedom of the system, which are now not explicit because we have misused the symmetries, ignoring some of them. To do this we must first remove all the symmetries that we have included in Σ to rewrite the exponential like this:

N N N N 2 2 ∑︂ 2 ∑︂ (︂ ∑︂ )︂ (︂ ∑︂ )︂ τ qi,j + η qi,j +γ ̃ qi,j (B.1) i,j i j i,j

α−2β+γ 2α−1 2β Where τ = 2 , η = β − γ, γ̃ = γ/2 and θ = d N . The overlined parameters are the inverted parameters, given by the system of equations in the previous section. Now we can write the formula for expectation value in a smartest way, adding Dirac’s delta to include the symmetry that we have just removed in the exponential:

63 Appendix B. Hubbard-Stratonovich transformation

N N N N ⟨︂ ∏︂ ⟩︂ ∫︂ (︂ ∏︂ )︂ ∏︂ ∏︂ fi,j =N dqi,jfi,j δ(qi,i) δ(qi,j − qj,i) ij (︂ (︂ )︂2 (︂ )︂2)︂ 1 ∑︁N 2 ∑︁N ∑︁N ∑︁N − τ q +η qi,j +̃γ qi,j e 2θ i,j i,j i j i,j N N N ∫︂ ∏︂ ∫︂ ∫︂ (︂ ∏︂ )︂ ∏︂ ∏︂ =NNyNz dyi dz dqi,jfi,j δ(qi,i) δ(qi,j − qj,i) i i,j i i>j 2 N N (︁ N )︁ N − τ ∑︁N q2 − θ ∑︁ y2−i ∑︁ yi ∑︁ q − θ z2−iz ∑︁ q e 2θ i,j i,j e 2η i i i j i,j e 2̃γ i,j i,j

∫︂ θ ∑︁N 2 ∫︂ θ 2 ∏︂ − i y − z = dyie 2η i dze 2̃γ i N √ √ ∫︂ τ ∑︁N 2 ∑︁N ∑︁N ∑︁N ∏︂ − q − − sign[η] yi qi,j − − sign[̃γ]z qi,j dqi,jfi,je 2θ i,j i,j i j i,j ij

∫︂ θ ∑︁N 2 ∫︂ θ 2 ∏︂ − i y − z =NNyNz dyie 2η i dze 2̃γ i N √ √ √ ∫︂ (︂ τ 2 )︂ ∏︂ − q −( − sign[η]yi+ − sign[η]yj + − sign[̃γ]z)qi,j dqi,jfi,je θ i,j i

∫︂ θ ∑︁N 2 ∫︂ θ 2 ∏︂ − i y − z =NNyNz dyie 2η i dze 2̃γ i N (︄ √ √ √ )︄ ∫︂ τ 2 ∏︂ − q −( − sign[η]yi+ − sign[η]yj + − sign[̃γ]z)q dqfi,je θ i

64 Appendix B. Hubbard-Stratonovich transformation

N 1 η n γ̃ 1 ( 2 ) − 2 − 2 − 2 Where N = [(2π) det(Σ)] , Ny = [2π θ ] and Nz = [2π θ ] . Here we have used Hubbard-Stratonovic transformation:

2 ∫︂ y2 √ − sign(a)|a|x 1 − − − sign[a]xy e 2 = √︁ dye 2|a| (B.3) 2π|a|

Using this transformation we have reduced the total number of integrals (N−1)(N−2) from 2 to (N + 1). This shows that the actual number of degrees of freedom is lower than we would have expected. Unfortunately, the result found in this section is not very useful from a computational point of view. In fact, while (imposing γ = 0):

β η = β − γ = (B.4) (2β − α)(α + β(N − 4)) is always negative (we have found in the previous section that α > 2β), we have that: 4β2 γ̃ = γ/2 = (B.5) (α − 2β)(α + β(N − 4))(α + 2β(N − 2)) is always positive. This can be problematic from the computational point of view because it brings out imaginary units during the first integral.

65 Appendix C

Numerical methods

C.1 Multivariate Gaussian integration

To numerically evaluate the integrals of Equation 3.2, we implemented the algorithm described in [25], allowing very fast run times for the small values of k (k ≲ 10) we were interested in; notice that the dimension of the integral is already of order 102 for k = 10. Higher values of k would require finer techniques.

C.2 Simulations for Hard RGGs

To compute the density of k-cliques in simulated hard RGGs, we imple- mented a simple random sampling procedure. For each realization of the nodes (∼ 104 ), we extracted ∼ 105 k-uples of nodes, computing the mini- mum cut-off distance at which they formed a clique. The cumulative distri- bution of the minimal distances obtained, averaged over different realization hard of the nodes, reconstructs ρk (r). We noticed that as N grows, the last average is well approximated by a single realization of the nodes, suggest- ing a self-averaging property for the density of k-cliques; in practice, not averaging does not affect the results of the simulations.

66 Appendix C. Numerical methods

C.3 Simulations for Soft RGGs

To compute the density of cliques in simulated soft RGGs with generic a activation function, we implemented a different random sampling procedure. This time, for each realization of the nodes (∼ 104) and for a fixed radius r, we counted how many of ∼ 104 k-uples of nodes form a k-clique, considering each of them to be a M-clique with probability ∏︂ hr (dist(xi, xj)) . (C.1) 1≤i

67 Bibliography

[1] Yehya Abouelnaga, Ola S Ali, Hager Rady, and Mohamed Moustafa. Cifar-10: Knn-based ensemble of classifiers. In 2016 International Conference on Computational Science and Computational Intelligence (CSCI), pages 1192–1195. IEEE, 2016.

[2] Abien Fred Agarap. Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375, 2018.

[3] Richard D Alba. A graph-theoretic definition of a sociometric clique. Journal of Mathematical Sociology, 3(1):113–126, 1973.

[4] S. A. Aldosari and J. M. F. Moura. Distributed detection in sensor networks: Connectivity graph and small world networks. In Conference Record of the Thirty-Ninth Asilomar Conference onSignals, Systems and Computers, 2005., pages 230–234, Oct 2005.

[5] Konstantin Avrachenkov and Andrei Bobu. Cliques in high-dimensional random geometric graphs. In COMPLEX NETWORKS 2019 - 8th In- ternational Conference on Complex Networks and Their Applications, Lisbon, Portugal, December 2019.

[6] Amir Ben-Dor, Ron Shamir, and Zohar Yakhini. Clustering gene ex- pression patterns. Journal of computational biology, 6(3-4):281–297, 1999.

68 Bibliography

[7] Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When is “nearest neighbor” meaningful? In International con- ference on database theory, pages 217–235. Springer, 1999. [8] Béla Bollobás and Paul Erdös. Cliques in random graphs. In Mathe- matical Proceedings of the Cambridge Philosophical Society, volume 80, pages 419–427. Cambridge University Press, 1976. [9] Sébastien Bubeck, Jian Ding, Ronen Eldan, and Miklós Z. Rácz. Test- ing for high-dimensional geometry in random graphs. Random Struc- tures & Algorithms, 49(3):503–532, 2016. [10] Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the four- teenth international conference on artificial intelligence and statistics, pages 215–223, 2011. [11] Jesper Dall and Michael Christensen. Random geometric graphs. Phys- ical review E, 66(1):016121, 2002. [12] William HE Day and David Sankoff. Computational complexity of inferring phylogenies by compatibility. Systematic Biology, 35(2):224– 229, 1986. [13] Imre Derényi, Gergely Palla, and Tamás Vicsek. Clique percolation in random networks. Physical review letters, 94(16):160202, 2005. [14] J Devillers and JC Doré. Heuristic potency of the minimum span- ning tree (mst) method in toxicology. Ecotoxicology and environmental safety, 17(2):227–235, 1989. [15] Luc Devroye, András György, Gábor Lugosi, and Frederic Udina. High- dimensional random geometric graphs and their clique number. Elec- tron. J. Probab., 16:2481–2508, 2011. [16] Josep Díaz, Dieter Mitsche, and Xavier Pérez. Sharp threshold for hamiltonicity of random geometric graphs. SIAM Journal on Discrete Mathematics, 21(1):57–65, 2007.

69 Bibliography

[17] Chuong B Do. The multivariate gaussian distribution. Section Notes, Lecture on Machine Learning, CS, 229, 2008. [18] Jingbo Dong, Qing Chen, and Zhisheng Niu. Random graph theory based connectivity analysis in wireless sensor networks with rayleigh fading channels. In 2007 Asia-Pacific Conference on Communications, pages 123–126. IEEE, 2007. [19] Patrick Doreian and Katherine L Woodard. Defining and locating cores and boundaries of social networks. Social networks, 16(4):267–293, 1994. [20] Paul Erdős and Alfréd Rényi. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5(1):17–60, 1960. [21] Yuval Filmus. An orthogonal basis for functions over a slice of the boolean hypercube. The Electronic Journal of Combinatorics, 23(1):P1–23, 2016. [22] Damien Francois, Vincent Wertz, and Michel Verleysen. The concen- tration of fractional distances. IEEE Transactions on Knowledge and Data Engineering, 19(7):873–886, 2007. [23] Alan M Frieze. On the value of a random minimum spanning tree problem. Discrete Applied Mathematics, 10(1):47–56, 1985. [24] Michael R Garey and David S Johnson. Computers and intractability, volume 174. freeman San Francisco, 1979. [25] Alan Genz. Numerical computation of multivariate normal probabil- ities. Journal of computational and graphical statistics, 1(2):141–149, 1992. [26] Edgar N Gilbert. Random graphs. The Annals of Mathematical Statis- tics, 30(4):1141–1144, 1959. [27] Edward N Gilbert. Random plane networks. Journal of the society for industrial and applied mathematics, 9(4):533–543, 1961.

70 Bibliography

[28] Alexander P Giles, Orestis Georgiou, and Carl P Dettmann. Between- ness in dense random geometric networks. In 2015 IEEE International Conference on Communications (ICC), pages 6450–6455. IEEE, 2015. [29] Ronald L Graham and Pavol Hell. On the history of the minimum spanning tree problem. Annals of the History of Computing, 7(1):43– 57, 1985. [30] Piyush Gupta and P. R. Kumar. Critical Power for Asymptotic Con- nectivity in Wireless Networks, pages 547–566. Birkhäuser Boston, Boston, MA, 1999. [31] Chikio Hayashi, Keiji Yajima, Hans H Bock, Noboru Ohsumi, Yutaka Tanaka, and Yasumasa Baba. Data Science, Classification, and Re- lated Methods: Proceedings of the Fifth Conference of the International Federation of Classification Societies (IFCS-96), Kobe, Japan, March 27–30, 1996. Springer Science & Business Media, 2013. [32] Robert Hecht-Nielsen. Context vectors: general purpose approximate meaning representations self-organized from raw data. 1994. [33] Alexander P Kartun-Giles, Marc Barthelemy, and Carl P Dettmann. Shape of shortest paths in random spatial networks. Physical Review E, 100(3):032315, 2019. [34] Alex Krizhevsky and Geoff Hinton. Convolutional deep belief networks on cifar-10. Unpublished manuscript, 40(7):1–9, 2010. [35] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. [36] Ernst Kussul and Tatiana Baidyk. Improved method of handwritten digit recognition tested on mnist database. Image and Vision Comput- ing, 22(12):971–981, 2004. [37] R. Duncan Luce and Albert D. Perry. A method of matrix analysis of group structure. Psychometrika, 14:95–116, 1949.

71 Bibliography

[38] Henrik Ohlsson, Oscar Gustafsson, and Lars Wanhammar. Implemen- tation of low complexity fir filters using a minimum spanning tree. In Proceedings of the 12th IEEE Mediterranean Electrotechnical Confer- ence (IEEE Cat. No. 04CH37521), volume 1, pages 261–264. IEEE, 2004.

[39] Gergely Palla, Imre Derényi, and Tamás Vicsek. The critical point of k-clique percolation in the erdős–rényi graph. Journal of Statistical Physics, 128(1-2):219–227, 2007.

[40] Marvin C Paull and Stephen H Unger. Minimizing the number of states in incompletely specified sequential switching functions. IRE Transactions on Electronic Computers, (3):356–367, 1959.

[41] Marvin C Paull and Stephen H Unger. Minimizing the number of states in incompletely specified sequential switching functions. IRE Transactions on Electronic Computers, (3):356–367, 1959.

[42] Edmund R Peay. Hierarchical clique structures. Sociometry, pages 54–65, 1974.

[43] Mathew Penrose et al. Random geometric graphs, volume 5. Oxford university press, 2003.

[44] Vladimir Pestov. Is the k-nn classifier in high dimensions affected bythe curse of dimensionality? Computers & Mathematics with Applications, 65(10):1427–1437, 2013.

[45] Zvi Prihar. Topological properties of telecommunication networks. Pro- ceedings of the IRE, 44(7):927–933, 1956.

[46] Nicholas Rhodes, Peter Willett, Alain Calvet, James B Dunbar, and Christine Humblet. Clip: similarity searching of 3d databases using clique detection. Journal of chemical information and computer sci- ences, 43(2):443–448, 2003.

72 Bibliography

[47] Ram Samudrala and John Moult. A graph-theoretic algorithm for com- parative modeling of protein structure. Journal of molecular biology, 279(1):287–302, 1998.

[48] Lukas Schott, Jonas Rauber, Matthias Bethge, and Wieland Brendel. Towards the first adversarially robust neural network model on mnist. arXiv preprint arXiv:1805.09190, 2018.

[49] Yian Seo and Kyung-shik Shin. Hierarchical convolutional neural net- works for fashion image classification. Expert Systems with Applica- tions, 116:328–339, 2019.

[50] P Sneath. A computer approach to numerical taxonomy. J Gen Mi- crobiol, 17:201–226, 1957.

[51] Victor Spirin and Leonid A Mirny. Protein complexes and functional modules in molecular networks. Proceedings of the national Academy of sciences, 100(21):12123–12128, 2003.

[52] G. SUGIHARA. Graph thory, homology and food webs. Proc. Symp. App. Math., 30:83–101, 1984.

[53] Minsoo Suk and Ohyoung Song. Curvilinear feature extraction using minimum spanning trees. Computer vision, graphics, and image pro- cessing, 26(3):400–411, 1984.

[54] Siham Tabik, Daniel Peralta, Andrés Herrera-Poyatos, and Francisco Herrera. A snapshot of image pre-processing for convolutional neural networks: case study of mnist. International Journal of Computational Intelligence Systems, 10(1):555–568, 2017.

[55] Amos Tanay, Roded Sharan, and Ron Shamir. Discovering statisti- cally significant biclusters in gene expression data. Bioinformatics, 18(suppl_1):S136–S144, 2002.

[56] Ivan Tyukin. Blessing of dimensionality: Mathematical foundations of the statistical physics of data. Philosophical Transactions of The

73 Bibliography

Royal Society A Mathematical Physical and Engineering Sciences, 376, 03 2018.

[57] A. W. van der Vaart. Asymptotic Statistics. Cambridge University Press, 1 edition, 1998.

[58] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.

[59] Ying Xu, Victor Olman, and Dong Xu. Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees. Bioinformatics, 18(4):536–545, 2002.

74