This electronic thesis or dissertation has been downloaded from the King’s Research Portal at https://kclpure.kcl.ac.uk/portal/
Quantitative semantics and graph theory as a framework for complex systems modeling
Gramatica, Ruggero
Awarding institution: King's College London
The copyright of this thesis rests with the author and no quotation from it or information derived from it may be published without proper acknowledgement.
END USER LICENCE AGREEMENT
Unless another licence is stated on the immediately following page this work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International licence. https://creativecommons.org/licenses/by-nc-nd/4.0/ You are free to copy, distribute and transmit the work Under the following conditions: Attribution: You must attribute the work in the manner specified by the author (but not in any way that suggests that they endorse you or your use of the work). Non Commercial: You may not use this work for commercial purposes. No Derivative Works - You may not alter, transform, or build upon this work.
Any of these conditions can be waived if you receive permission from the author. Your fair dealings and other rights are in no way affected by the above.
Take down policy
If you believe that this document breaches copyright please contact [email protected] providing details, and we will remove access to the work immediately and investigate your claim.
Download date: 07. Oct. 2021 King’s College London
Doctoral Thesis
Quantitative semantics and graph theory as a framework for complex systems modeling
Author: Supervisor: Ruggero Gramatica Prof. Tiziana Di Matteo
A thesis submitted in fulfilment of the requirements for the degree of Doctor of Philosophy in
Applied Mathematics Department of Mathematics
December 2015
This page is intentionally left blank Published Articles
I list here the peer-reviewed articles I have authored and co-authored in the course of my PhD and whose material I have used in this thesis. The articles are listed in chronological order of appearance as follows and can be reviewed in Annex 1:
1. R.Morales, T.Di Matteo, R.Gramatica and T.Aste, Dynamical General- ized Hurst Exponent as a Tool to Monitor Unstable Periods in Financial Time Series, Physica A, 391, 2012, 3180-3189.
2. Tomaso Aste, Ruggero Gramatica and T.Di Matteo, Exploring complex networks via topological embedding on surfaces, Physical Review E 86, 036109 (2012)
3. T. Aste, Ruggero Gramatica, T. Di Matteo, Random and frozen states in complex triangulations, Philosophical Magazine, Volume 92, Issue 1-3, (2012) 244-254
4. Ruggero Gramatica, T. Di Matteo, Stefano Giorgetti, Massimo Barbiani, Do- rian Bevec, Tomaso Aste, Graph Theory Enables Drug Repurposing - How a Mathematical Model Can Drive the Discovery of Hidden Mechanisms of Action, January 2014, PlosOne Volume 9, Issue 1, e84912
5. Ruggero Gramatica, Haris Dindo, T. Di Matteo, T. Aste, A quantitative se- mantic and graph theoretical approach for the analysis of financial and economic unstructured data, In preparation for 2015 - (see Chapter 6)
i This page is intentionally left blank Abstract
The study of Complex Systems focuses on how interactions of constituents within a system, individually or grouped into clusters, produce behavioral patterns locally or globally and how these interact with the external environment. Over the last few decades the study of Complex Systems has gone through a growing rate of interest and today, given a sufficiently big set of data, we are able to construct comprehensive models describing emerging characteristics and properties of complex phenomena transcending the different domains of physical, biological and social sciences.
The use of network theory has shown, amongst others, a particular fit in describ- ing statical and dynamical correlations of complex data sets because its ability to deal not only with deterministic quantities but also with probabilistic methods. A complex system is generally an open system flexible in adapting to variable external conditions in the way that it exchanges information with environment and adjusts its internal structure in the process of self-organization. Moreover, it has been shown how real world phenomena that are represented by complex systems display inter- esting statistical properties such as power-law distributions, long-range interactions, scale invariance, criticality, multifractality and hierarchical structure.
In the era of big data where effort is largely put to collect large data sets carry- ing relevant information about given phenomena to be studied and analysed, the interesting field of quantitative semantics, e.g. dealing with information expressed in natural language, is becoming more and more relevant particularly in the social sciences. However, recent studies are expanding these techniques to become a tool for structuring and organising information across a number of disparate disciplines.
In this Thesis I propose a methodology that (i) extracts a structured complex data set from large corpora of descriptive language sources and efficiently exploits the power of quantitative semantics techniques to map the essence of a complex phe- nomena into a network representation, and (ii) combines such induced knowledge iii network with a graph theoretical framework utilising a number of graph theory tools to study the emerging properties of complex systems. Thus, leveraging on devel- opments in Computational Linguistics and Network Theory, the proposed approach builds a graph representation of knowledge, which is analyzed with the aim of ob- serving correlations between any two nodes or across clusters of nodes and highlights emerging properties by means of both topological structure analysis and dynamic evolution, i.e. the change in connectivity. Under this framework I will provide two real-world applications:
- The first application deals with the creation of a structured network of bio- medical concepts starting from an unstructured corpus of biological text-based data set (peer reviewed articles) and next it retrieves known pathophysiologi- cal Mode of Actions by applying a stochastic random-walk measure and finds new ones by meaningfully selecting and aggregating contributions from known bio-molecular interactions. By exploiting the proposed graph-theoretic model, this approach has proven to be an innovative way to find emergent mechanism of actions aimed at drug repurposing where existing biologic compounds origi- nally intended to deal with certain pathophysiologic actions are redirected for treating other type of clinical indications.
- The second application consists of a representation of a financial and economic system through a network of interacting entities and to devise a novel semantic index influenced by the topological properties of agglomerated information in a semantic graph. I have shown how it is possible to fully capture the dynamical aspects of the phenomena under investigation by identifying clusters carrying influential information and tracking them over time. By computing graph- based statistics over such clusters I turn the evolution of textual information into a mathematically well-defined, multivariate time series, where each time series encodes the evolution of particular structural, topological and semantic properties of the set of concepts previously extracted and filtered. Eventually
iv an autoregressive model with vectorial exogenous inputs is defined, which lin- early mixes previous values of an index with the evolution of other time series induced by the semantic information in the graph.
The methodology briefly described above concludes the contribution of my re- search work in the field of Complex Systems and it has been instrumental in successfully defining a graph-theoretical model for the study of drug repurpos- ing [1] and the construction of a framework for the analysis of financial and economic unstructured data (see chapter6).
v This page is intentionally left blank Acknowledgements
It had been a long term dream of mine to be able to study Applied Mathematics and gain a PhD and I would first of all like to thank Prof. William Shaw who on my first interview at King’s College gave me the chance to make this a reality by approving my application as a post graduate student. He kindly reassured me that even in your 40’s it is not too late to undertake a doctoral program as long as I would bring passion, commitment and sufficient background to tackle such a challenging endeavour.
Next I would like to thank my supervisor Prof. Tiziana Di Matteo, she provided me with good day to day advice and a direction for my research, teaching me how to focus on innovative thinking and maintaining rigour in the investigation of adjacent domain of research. Having Dr. Tomaso Aste as adjunct supervisor was really a blessing and I appreciated his breadth of knowledge, his perspective, his strong scientific background in statistical mechanics and his sense of humour.
Because of my engineering background and, more recently as an entrepreneur, it was important for me that my research was not abstract, but closely linked to the business world and I would like to thank everyone I met throughout my doctoral journey from disparate industry field for bridging the gap between the academic world and the business universe and for believing in the broad ranging applications of my work to real business problems.
I would like to thank my colleagues and fellow students Dr. Raffello Morales and Nicolo’ Musmeci for sharing with me the PhD experience and for being around to discuss some formal aspects of the theory of multiscaling and complex systems in general.
Also, Inspiring me on my journey was my old friend and fellow mathematician Guido Previde whose shared interest in my work made him a great sounding board for ideas and he helped me to work through some of the challenges I faced with both his knowledge and his sense of humour.
Last, but not least, on the academic list is Dr. Haris Dindo, who I came to know about half way through my research. As a great researcher in the field of Machine Learning and Cognitive Science he has been an invaluable help in devising new perspectives on how to tackle some challenging parts of my research. vii On my personal sphere, I wanted to say thank you to my mother who throughout so many years she has been a constant source of optimism and fed me with a genuine ambition that got me where I am now.
Finally I would like to thank my wife. Having heard me talk about my project to achieve a doctoral program for many years it was she who pushed me out of the door and apply at King’s College. I embarked on my research as a single man and I am now a husband with two children and I am grateful that I started my PhD when I did as I am not sure that I would find the time again now, or in the near future.
viii Contents
Abstract iii
Acknowledgements vii
List of Figures xv
List of Tables xvii
Symbols xxi
Terminology xxiv
1 Introduction1
2 Complex Systems5 2.1 Introduction...... 5 2.2 Multiscaling...... 6 2.3 Scaling properties...... 8 2.4 Generalized Hurst exponent...... 13 2.5 Complex Networks...... 15 2.5.1 Characteristics of Complex Networks...... 16 2.5.1.1 Definition of a Network (Graph)...... 17 2.5.2 Measures and Metrics...... 18 2.5.2.1 Degree and density...... 18 2.5.2.2 Diameter and average distance...... 19 2.5.2.3 Connectivity...... 19 2.5.2.4 Distance and Shortest Path...... 19 2.5.2.5 Random Walk...... 20 2.5.3 Centrality measures...... 22 2.5.3.1 Degree centrality...... 23 2.5.3.2 Eigenvector Centrality...... 23 2.5.3.3 Closeness Centrality...... 24 2.5.3.4 Betweenness Centrality...... 24 2.5.4 Degree Distribution of a Graph...... 25
ix Contents x
2.5.5 Clustering...... 26 2.5.6 Models of Complex Networks...... 27 2.5.7 Random Networks...... 27 2.5.8 Scale-free Networks...... 27 2.5.9 Small-World Networks...... 28 2.5.10 Filtered Networks...... 30 2.5.11 Minimum Spanning Tree...... 31 2.5.12 Planar Maximally Filtered Graph...... 32 2.5.13 Scale-Free Networks Embedded in Hyperbolic Metric Spaces. 33 2.6 Summary...... 35
3 A Linguistic and Graph theoretic approach 38 3.1 Introduction...... 38 3.2 An epistemological challenge...... 39 3.2.1 The Semantic Problem...... 41 3.2.2 Graph-theoretic analysis of semantic networks...... 42 3.3 The model...... 44 3.3.1 The Paradigm of Distributional Hypothesis...... 44 3.3.2 Conceptualization & Disambiguation...... 46 3.3.2.1 Conceptualization...... 46 3.3.2.2 Disambiguation...... 47 3.3.3 Building a knowledge representation...... 48 3.4 The Methodology...... 52 3.4.1 Data availability and ingestion...... 52 3.5 Natural Language Processing and Parsing models...... 53 3.5.1 The probabilistic Context-Free Parser...... 54 3.5.2 The Lexicalized dependency parser...... 56 3.5.3 The Lexicalized probabilistic context-free parser...... 56 3.5.4 Extracting a knowledge graph from a large corpus of descrip- tive language based data set...... 58 3.5.5 Similarities and distance in a semantic network...... 61 3.5.5.1 The overall structure of the graph...... 63 3.5.5.2 The ”shortest paths”...... 64 3.5.5.3 Random walk: a stochastic semantic distance.... 65 3.6 Summary...... 67
4 The semantic - graph theoretical framework as a model for biomed- ical drug repurposing 69 4.1 Introduction...... 69 4.2 The semantic data ingestion...... 71 4.2.0.4 Dictionaries and ontology...... 72 4.2.0.5 Use of Natural Language Processing...... 73 4.3 Building Connections...... 73 4.3.1 Retrieving the appropriate similarity measure...... 73 Contents xi
4.3.1.1 Occurrences...... 74 4.3.1.2 Co-occurrences...... 75 4.3.1.3 Fine grained measures...... 78 4.3.1.4 Coarse grained measures...... 78 4.3.2 Semantic interrelation, Dependency, Similarity...... 79 4.3.3 Inferences and paths...... 80 4.3.4 Building inferential paths...... 82 4.4 Biological mechanisms through shortest paths...... 84 4.4.1 Direct links...... 84 4.4.2 Indirect links: searching for shortest paths...... 86 4.4.2.1 Paths from the first case...... 88 4.4.2.2 Paths from the second case...... 88 4.4.2.3 Paths form the second case but with symmetrized average weights...... 92 4.4.2.4 Paths form the second case with symmetrized mini- mum weights...... 94 4.5 Constrained shortest paths as an inference measure...... 96 4.5.1 Shortest paths on the biomedical knowledge graph...... 99 4.5.2 Constrained shortest paths in the data set under investigation 99 4.6 Most abundant paths...... 100 4.6.1 Shortest path as the most probable paths...... 101 4.6.2 Stochastic measures. The Random Walk distance...... 102 4.6.2.1 Biological mechanism from paths based on random walk distance...... 104 4.6.2.2 Paths associated with larger levels of similarity from average random walk distance...... 105 4.6.2.3 Paths associated with larger levels of similarity from maximum random walk distance...... 106 4.7 Network approach to information filtering...... 108 4.7.1 Overall structure analysis...... 108 4.7.2 Information filtering...... 109 4.7.3 Applying Filtered Networks Techniques...... 109 4.7.3.1 Minimum spanning tree...... 111 4.7.3.2 Planar Maximally Filtered Graph...... 113 4.7.4 Analysing the structure of the induced semantic graph through Minimum Spanning Tree...... 114 4.7.4.1 MST global structure and average random walk dis- tance...... 116 4.7.4.2 MST global structure and maximum random walk distance...... 117 4.8 Summary...... 118
5 A further step into the biological model 121 5.1 Introduction...... 121 Contents xii
5.1.1 Biomedical corpus...... 122 5.1.2 Biomedical background - Why Linguistics and Network Theory 123 5.1.3 Information retrieval and construction of the knowledge graph 124 5.1.4 Analysis of the Knowledge Graph...... 128 5.1.5 Ranking paths using Random Walk...... 131 5.1.6 Results...... 133 5.1.7 The Path VIP - SARCOIDISIS...... 134 5.1.8 The Path α-MSH - SARCOIDOSIS...... 136 5.1.9 The Path CNP - SARCOIDOSIS...... 136 5.1.10 The Path IMATINIB CREUTZFELDT-JACOB Disease... 137 5.1.11 Acknowledgements...... 140 5.1.12 Summary...... 140
6 A Semantic - graph theoretical approach for the analysis of financial and economics unstructured data 143 6.1 Introduction...... 143 6.2 Prior work...... 145 6.2.1 Our approach...... 147 6.3 From plain text to semantic graphs...... 149 6.3.1 Assessing the strength of association between concepts.... 149 6.4 Cluster identification...... 152 6.4.1 Clustering...... 152 6.4.2 Random walk clustering...... 154 6.4.2.1 Random walk distance...... 154 6.4.2.2 Community detection algorithm...... 156 6.5 Cluster tracking...... 158 6.6 Semantic Index...... 161 6.6.1 Comparison with financial time series...... 162 6.6.2 Correlation analysis...... 164 6.6.3 A note on stationarity and correlation...... 168 6.7 Building the Semantic Index...... 170 6.7.1 Semantic index as an auto-regressive process with exogenous variables...... 170 6.7.1.1 Multivariate case...... 172 6.7.2 Fitting the parameters of the VARX model...... 173 6.7.3 Semantic content of the CSI...... 176 6.8 Summary...... 177 6.9 Acknowledgments...... 179
Conclusions 180 Contents xiii
Bibliography 187 This page is intentionally left blank List of Figures
2.1 Illustration of a graph with N = 5 nodes and E = 4 edges...... 17 2.2 Scale-free networks: cumulative degree distributions for six different networks...... 29 2.3 Small-world network: representation of one-dimensional lattice network 30
2.4 Exemplification of a T1 and T2 elementary moves on a triangulation: edge switching (T1) and vertex insertion and removal (T2)...... 35 2.5 Exemplification of a T 1 elementary move...... 37
3.1 An examples of a structured sentence parse tree representation.... 49 3.2 Projection of the Knowledge Graph onto a semantic space...... 52 3.3 An example of a structured parse tree showing Part of Speech Tags according to Stanford NLP parser...... 54 3.4 PCFG, Lexicalized dependency and Combined parser...... 57
4.1 Schematic representation of a biological Mechanism of Action (MoA) 71 4.2 Distribution of the frequency of keywords occurrence in a sample of 208,583 articles...... 75 4.3 Distribution of the frequency of keyword co-occurrence in a sample of 208,583 articles...... 77 4.4 Distribution of the number of co-occurrences...... 79 4.5 A global view of the network of non-zero co-occurrences...... 83 4.6 A schematic ’mask’ used to mimic Mechanism of Action used as a constraint on a Constrained Shortest Path inference...... 98 4.7 All constrained paths peptide ← PROCESS → diseases...... 98 4.8 Distribution of the weights for the non-zero entries of the dependency matrix S...... 109 4.9 caption...... 110 4.10 Minimum Spanning Tree graph for the case study [LU]...... 111 4.11 PMFG network for the case study [LU]...... 114 4.12 Minimum Spanning Tree obtained from the co-occurrences on sym- metric weights (average operator)...... 115 4.13 Minimum Spanning Tree obtained from the co-occurrences on asym- metric weights...... 116 4.14 Minimum Spanning Tree obtained from the co-occurrences on sym- metric weights (max operator)...... 117 4.15 Global structure of the average random walk distances...... 118
xv List of Figures xvi
4.16 Global structure of maximum random walk distance...... 119
5.1 Centrality measures for the sample in exam: 3 million papers, 1606 concepts, comprising 127 peptides, 300 rare diseases and 1179 other biological entities, 1576 vertices, 158,428 edges...... 128 5.2 Conceptual outline of the knowledge graph building process..... 129 5.3 Paths identification and selection...... 132 5.4 The Sarcoidosis knowledge network...... 135 5.5 The VIP - SARCOIDOSIS path and other closely related concepts.. 135 5.6 The α-MSH - SARCOIDOSIS path and other closely related concepts 137 5.7 The CNP - SARCOIDOSIS path and other closely related concepts. 138 5.8 Imatinib (GLEEVEC) - Creutzfeldt-Jakob Disease path and other closely related concepts...... 139
6.1 Example of the walktrap community detection algorithm...... 159 6.2 An example of cluster tracking procedure on a synthetic example.. 161 6.3 Semantic Index computed on the 2007 Thomson Reuters news headlines161 6.4 Pearson correlation coefficient between DJIA time series and five dif- ferent semantic time series...... 165 6.5 Cross-correlation coefficient between DJIA time series and five differ- ent semantic time series...... 166 6.6 Correlation analysis between different financial indicators and seman- tic indices on the 2007 data...... 168 6.7 Comprehensive Semantic Index fitted on 2003-2011 DJIA series... 174 6.8 Comprehensive Semantic Index fitted on 2003-2011 S&P 500 series.. 175 6.9 Comprehensive Semantic Index fitted on 2003-2011 VIX series.... 176 List of Tables
4.1 The twenty most abundant co-occurrences within the sample in exam. 77 4.2 Hub vertices with largest degree...... 82
xvii This page is intentionally left blank To my daughters Emma and Julia
∀φ[P (¬φ) ↔ ¬P (φ)]
∀φ∀ϕ[(P (φ) ∧ ∀x[φ(x) → ϕ(x)]) → P (ϕ)] ∀φ[P (ϕ) → ♦∃xϕ(x)] G(x) ↔ ∀φ[P (ϕ) → φ(x)] P (G)
♦∃xG(x) φ ess. x ↔ φ(x) ∧ ∀ϕ(ϕ(x) → ∀y(φ(y) → ϕ(y))) ∀x[G(x) → G ess. x]
NE(x) ↔ ∀φ[φ ess. x → ∃yφ(y)] P (NE)
∃xG(x)
From: ”Formalization, Mechanization and Automation of G¨odel’sProof of God’s Existence”, C. Benzm¨ullerand B. Woltzenlogel Paleo, 2013
K. G¨odel,Collected Works, Volume III: Unpublished Essays and Letters (Ontological proof), Oxford University Press, 1970 This page is intentionally left blank Symbols xxi
Symbols
E(·) expected value f probability density function
F< cumulative distribution function
F> complementary cumulative distribution function H Hurst exponent H(q) generalized Hurst exponent
Km Bernoulli random variable ` lag
Mq(t, τ) q-moments ∗ Mq (t, τ) empirical q-moments n hierarchical order
pm probability associated with the Bernoulli variable Km
rt log-returns
Pt price at time t
St a generic stochastic process X a generic random variable x a realisation of X
α tail exponent or generically denote exponents of exponential and power-law functions f(x) ∼ e−αx. β ACF decay exponent γ tail exponent or generically denote exponents of exponential and power-law functions f(x) ∼ x−γ. κ Kurtosis λ eigenvalue κ˜ Excess Kurtosis
ζq scaling function ∗ ζq empirical scaling function
σt volatility Symbols xxii
Σ Covariance matrix
Σij Covariance matrix entries τ time scale
i, j i and j serve as indices to label individual elements / vertices of a data set/graph. When i and j are reserved for some objects, v and u serve to label further types of objects. ij ij is reserved to represent an edge between the vertices i and j. If the graph −→ is directed the expression notation is ij . N Number of elements / vertices in a data set / network. GG(V,E) denotes a graph with vertex set i, j ∈ V , and edge set ij ∈ E. For weighted graphs, it becomes a triplet G(V,E,W ) including weight set
wij ∈ W , or a quadruple G(V, E, W, D) including the edge distance set ~ dij ∈ D. In short, it is denoted as G, and G for directed graphs.
Aij Denotes the adjacency matrix of a network / graph, that specifies linking information between vertices i and j. If the label of a specific graph needs to be highlighted, it takes the form of AG. For specifying linking informa-
tion between sub-graphs of G, an extension of the form A(Si,Sj) is used
where Si and Sj are disjoint sub-graphs of G. In general, large capitals of
the form Mij are used to denote matrices specifying pair-wise entries. P (k) Denotes the degree distribution of a network / graph. In general, P is reserved for distributions of stochastic variables.
Ci Denotes the clustering coefficient of vertex i.
lij Denotes the shortest path length between vertices i and j. The directed
−→ path length between vertices i and j is denoted by lij .
xi Denotes the ith data point in a specific proximity space, and all data points
belong to a set xi ∈ X. Symbols xxiii
D,SD and S denote dissimi- larity and similarity ma-
trices where D(xi, xj) and
S(xi, xj) denote pairwise dissimilarity and similar- ity between the ith and jth data points. D is sometime referred as the diameter of a network. log Denotes the logarithm to the base of e. Otherwise, logarithm to the base of n
is specified by logn. Terminology xxiv
Terminology
Vertex A Vertex is a constituent of a graph and represents indi- vidual elements of a complex system. Edge An Edge is a link between a pair of vertices to represent relevant interactions between the corresponding elements in a complex system. Graph/Network A graph is a discrete mathematical object that consists of vertices and edges linking the vertices pair-wise. The term ’graph’ has been widely used in the discrete mathematics literatures, whereas ’network’ has been extensively used in physical literatures. In this thesis, I use these two words interchangeably. Subgraph A subgraph is a part of a graph that consists of a subset of vertices, and a subset of edges linking among these vertices. Community/Cluster In complex network literatures, a sub-graph characterized by a unique topological property is denoted as a ’commu- nity’. In clustering literatures, it includes the sub-graphs as well as general subgroups of elements which share particu- lar patterns/features. In this thesis, I use these two words interchangeably where appropriate. Path A path is a sequence of connected edges starting from a vertex i and terminating to vertex j. Planar graph A planar graph is a graph whose topology can be drawn on a topological sphere without edge-crossings. Hyperbolic surface A hyperbolic surface is a generalization of spherical surface with g handles attached. g is called genus. Hyperbolic graph A hyperbolic graph is a generalization of a planar graph, where the graph can be drawn on a hyperbolic surface. 1
Introduction
“In God we trust, all others bring data”
– W. Edwards Deming
The amount of information produced nowadays has reached a level that only a couple of decades ago was unimaginable. In 2012 the global data supply reached 2.8 zettabytes (ZB) - or 2.8 trillion GB - but just 0.5% of this is used for analysis. Volumes of data are projected to reach 40ZB by 2020, or 5,247 GB per person, with emerging economies accounting for an increasingly large proportion of the world’s total [2]. But it is not only the increasing amount of data that floods us every minute; the key characteristic is in fact its complexity and that makes the interpretation of hidden information a challenging one because the unstructured format and the speed at which data is gathered and analysed.
Many tools and modern business intelligence methods are quite common these days and generally allow analysts to delve into complex datasets and provide general an- swers to predetermined questions, but the emergent field of data science is concerned with finding the questions that should be asked when huge and often unstructured data is collected and leads to non-conclusive, and sometimes, hidden results.
1 Introduction 2
Moreover, most of the complexity we are facing today determine the general difficul- ties of finding consistent rules that explain certain queries. In fact, we can generally assume that complex answers derive from complex phenomena; problems that are difficult to solve are often hard to understand because the causes and effects are not obviously related and when the number of co-related variables becomes overwhelm- ing any traditional approach for analysing a problem becomes insufficient to provide a thorough solution. The field of complex systems has become very useful in understanding unknown and not evident effects between elements of a system and its study provides a number of sophisticated tools to deal with these systems and their complexity; these provide a very useful framework for capturing certain features and emergent properties that otherwise would remain unknown.
Emergent properties therefore become key in the interpretation of knowledge. Emer- gent characteristics in a complex system arise with the identification of a higher-level of properties and the dynamic behaviour of individual components of the systems itself. For this reason Emergent Properties are properties of the ”whole” that are not strictly driven by any of the individual parts making up that whole; such phe- nomena exist in various domains and can be described, using complexity concepts and thematic knowledge [3]. We are witnessing several studies focussing on com- plex modelling of dynamic or behavioral systems in the attempt to identify relevant Emergent Properties.
To this aim, delving into the main characteristics of Complex Systems and their features is key: we look into intrinsic memory of a phenomenon due to its dy- namic nature and we analyse the non-linear and scaling pattern trying to create a predictability model by tracking overtime the non-equilibrium status of their open- system intimate nature. Complexity becomes the common frontier in the physical, biological and social sciences and this leads us to investigate how subsystems self- organise into new emergent structures. Outcomes will include new technologies and tools [4]. Introduction 3
The non-linear nature of complex systems and the fact that these systems are open, e.g. they interchange information with the environment and constantly modify their internal structure and patterns of activity in the process of self-organization, suggests that they are flexible and easily adapt to variable external conditions. Furthermore, the emergent phenomena that cannot be derived solely from the knowledge of the systems’ structure and the interactions among their individual elements requires a more intimate analysis of the different levels of its organization (multi-scaling). There is strong evidence that different complex systems can have similar charac- teristics both in their structure and in their behaviour. One can thus expect the existence of some common, universal laws that govern their properties [5,6].
In this research work, I have approached two types of complex data sets and used a number of instruments to analyze their emergent behaviour. I examined statistical correlations and topological measures in financial time series and investigated how best to describe and quantify these phenomena expanding the footprint of such investigations to other disciplines. In this respect I looked at how mechanisms of action mapped within a biological data set emerge when looked at under a stochastic approach and a method to build a semantic index representing financial or economic indicators starting from a large corpus of unstructured data.
In particular, I propose a methodology to construct a graph-based knowledge frame- work derived from semantically processing large corpora of unstructured data ex- pressed in Natural Language and representing two different complex systems, in the fields of biology and economics. By leveraging on the graph-theoretic platform, the goal of this thesis is to address to what extent this methodology reflects the com- plexity and the contents of meaningful information hidden in the data (emergent properties), and to develop tools that highlight the meaningful information.
The Thesis is structured as follows: Chapter2 provides an overview of complex systems and addresses both properties such as mutiscaling and type of complex net- works focussing on measures and metrics and in general statistical models of network topology, which plays central roles in complex network literature and provides in- sights to understand dynamics and underlying mechanisms governing the respective Introduction 4 complex systems. Moreover, a graph-theoretic filtering technique is introduced as an instrument to extract meaningful information from complex systems.
In Chapter3 I review the Linguistic theory behind the semantic approach that has been widely used in my research work. I will introduce the basics of the Ditributional Hypothesis as a fundamental component of extracting and correlating concepts from descriptive language sources and the various techniques to construct a semantic graph. I then describe the statistical measures of semantic graphs and how these will be applied to real world cases.
Chapter4 introduces a semantic - graph theoretical framework built on a real case analysing a biomedical and pharmacological data set. This chapter describes how such framework leverages computational linguistic and graph theory in a context of a biological complex system and shows techniques derived from graph theory - such as information filtering, analysis of connections, paths and stochastic distances - that will treat the induced biological network with the goal to simulate a biological Mechanism of Action.
In Chapter5, I expand the framework in chapter 4 by applying the semantic graph- theoretical model to an extended biomedical data set and I show how this can efficiently represent a framework for repurposing existing drugs towards diseases for which they were not initially intended. Full details are reported in the article published by Gramatica et al. [1].
Chapter6 describes a second real world case which consists of applying to an eco- nomic system the framework previously presented. A large corpus of unstructured data describing economics, political and financial information is extracted from an historical database of Thomson Reuters (http://www.reuters.com/) providing an instrument to analyse unstructured complex data. I discuss a methodology to con- struct a quantitative semantic index readily comparable to real financial and eco- nomic indicators utilising computational linguistics, machine learning techniques and a graph theoretical framework. 2
Complex Systems
“Science is a differential equation, Religion is a boundary condition”
– Alan Turing
2.1 Introduction
Complex systems are difficult to interpret and their dynamics hard to predict. They obey the laws of physics although their behaviour goes well beyond the simple ra- tionalisation of classic models as each element of a complex system participates in many different interactions generating emergent properties [7].
One of the theoretical approaches to tackle such complexity is to understand the dynamics of how the systems transform. However, another way of explaining a com- plex phenomenon, beyond the analysis of its sequential transformations, is looking at the whole dynamic of a system as a reactive system, [8]; this way allow us to observe that the system does not behave according to a pre-programmed chain of simple events but rather it reacts in parallel to many concurrent inputs, and its behaviour, outputs and range of effects are not just a function of the its inputs but also of their frequency, magnitude, history and order in which they arrive. In this
5 Complex Systems 6 sense, functions performed by a system that are not the result of a single element in the system, but rather are the result of interacting elements in the system, is said an emergent property. This characteristic leads to consider that the output of a system’s transformational process is not ruled by single lower-scale elements which are part of the system itself; emergence, in other words, is a matter of scale, e.g. interactions at one scale create objects at a higher scale.
A system is defined complex if its behaviour crucially depends on the details of the system [6]. However, finding a unique definition of Complex Systems it is a daunting task and most of the time it is easier to find description of such phenomena by summarising their main characteristics. A complex system is a system built on a large number of elements capable of both interacting with each other and with their environment. Such stream of interaction amongst elements occur either with neighbours or with distant ones. The common characteristic of all complex systems is that they display organisation without that any external ”organising” principle is applied. As suggested earlier, the whole is much more that the sum of its parts [6,9].
In this chapter I discuss certain characteristics of Complex Systems particularly focussing on: 1. Multiscaling properties of complex data sets; 2. Structural features of networks and topological properties of Graphs in general as a model for analysing complex systems. Throughout this Thesis I will use respectively the terms Network - Graph and Vertex - Node interchangeably.
2.2 Multiscaling
As the scaling analysis of complex data sets has been the ignition point of my re- search efforts which eventually expanded through adjacent areas of complex systems analysis and led to the Thesis herewith, I will now introduce few relevant concepts Complex Systems 7 about this subject as an introduction of the part of the theory studied through my research. In particular, during this phase, I have focussed on financial complex data sets to investigate their multiscaling characteristics and here I am giving a brief background of the theory and instruments explored.
The analysis of time series is fundamental in the study of many disciplines, like economy, finance, biology, physics, etc., and the main objective of studying time series is the discovery of regularities of apparently unpredictable signals and the construction of a synthetic model which can be used to forecast and understand the intimate mechanisms governing the processes under study. Discovering regularities means, from the perspective of a scientist who analyses an event at different scales, looking for similarities of local/global measures which are expressed in a form of a scaling law which can be defined as a power law with a scaling exponent γ describing the behaviour of a quantity F as a function of a scale parameter s: F (s) ≈ sα, for a large range of s values [10, 11]. The concept of scaling leads to the important concept of a fractal which represent a system that is characterised by a scaling law with a unique non-integer scaling exponent. The uniqueness of the scaling exponent simplifies the fractal modelling approach to complex data sets which in turns becomes more difficult in presence of time series displaying multi-fractal characteristics, i.e. their scaling properties are not described by a single number, but rather are defined by a function of scaling exponents [12–14]. As mentioned earlier, by analysing scaling regularities for example of financial data sets it is possible to derive interesting information concerning the underling mech- anisms that generate data trends and consequently aim at building a theoretical model that can eventually reproduce certain behaviours [15, 16] useful for predict- ing trends or analysing certain events [17]. In this respect, I have investigated at the scaling concept that has indeed its ori- gin in physics but is increasingly applied to other disciplines [18–22]. In the recent years, the application of the scaling concept to financial markets has largely in- creased also in consequence of the abundance of available data [18]. In this part of Complex Systems 8 my research I have focussed on the scaling properties of different financial market data by computing the scaling exponents and in the following paragraph I included a brief description of the scaling properties, a description of the techniques used to compute the scaling exponents (namely the Generalized Hurst exponent method) whose main goal from these studies was to understand the very large fluctuations in the market resulting in particular values of the scaling exponents and to understand how markets react to such shocks and how different markets react to the same crisis.
2.3 Scaling properties
The scaling properties in time series have been studied by means of several tech- niques [23]. There are many proposed and used estimators for the investigation of the scaling properties in the financial and economic literature. Let us start with the seminal work on rescaled range statistical analysis R/S [24, 25] which gives an estimator for the Hurst exponent. Indeed, the rescaled range statistical analysis (R/S analysis) was first introduced by Hurst himself to describe the long-term de- pendence of water levels in rivers and reservoirs. It provides a sensitive method for revealing long-run correlations in random processes. This analysis can distinguish time series that are not correlated from correlated time series. What mainly makes the Hurst analysis appealing is that all these information about a complex signal are contained in one parameter only: the Hurst exponent.
The original Hurst R/S approach is very sensitive to the presence of short memory, heteroskedasticity, multiple scale behaviors. Such a lack of robustness has been largely discussed in the literature (see for instance [26–29]) and several alternative approaches have been proposed. Also the fact that the range relies on maxima and minima makes the method error-prone because any outlier present in the data would have a strong influence on the range.
Lo [26] suggested a modified version of the R/S analysis that can detect long-term memory in the presence of short-term dependence [30]. The modified R/S statistic Complex Systems 9 differs from the classical R/S statistic only in its denominator, adding some weights and covariance estimators to the standard deviation [31]. In this modified R/S, a problem is choosing the truncation lag q. Andrews [32] showed that when q becomes large relative to the sample size N, the finite-sample distribution of the estimator can be radically different from its asymptotic limit. However, the value chosen for q must not be too small, since the autocorrelation beyond lag q may be substantial and should be included in the weighted sum. The truncation lag thus must be chosen with some consideration. Despite these difficulties, several authors are still using this estimator trying to avoid the Lo’s critique and proposing filtering procedures [33, 34].
During these last years there has been a proliferation of papers proposing different techniques and providing comparison studies between them [35]. Let us start men- tioning the most popular ones: the detrended fluctuation analysis (DFA) [36–48] and its generalization [49, 50]; the moving-average analysis technique [51] and its comparison with the DFA [52, 53]; the periodogram regression (GPH method) [54]; the (m, k)-Zipf method [55]; the Average Wavelet Coefficient Method in [56–59]; the ARFIMA estimation by exact maximum likelihood (ML) [60–62]. Let us stress that it does not exit one method whose performance has no deficiencies. The use of each of the above mentioned estimators can be subject to both advantages and disadvan- tages. For instance, simple traditional estimators can be seriously biased. On the other hand, asymptotically unbiased estimators derived from Gaussian ML estima- tion are available, but these are parametric methods which require a parameterized family of model processes to be chosen a priori, and which cannot be implemented exactly in practice for large data sets due to high computational complexity and memory requirements [63–65]. Analytic approximations have been suggested (Whit- tle estimator) but in most of the cases (see [66]), computational difficulties remain, motivating a further approximation: the discretization of the frequency-domain in- tegration. Even with all these approximations the Whittle estimator remains with a significantly high overall computational cost and problems of convergence to local minima rather than to the absolute minimum may be also encountered. In this Complex Systems 10 framework, connections to multi-scaling/multi-affine analysis (the q order height- height correlation) have been made in various papers like [67–69]. Most of the work in the 1900’s on large complex data sets showed a general consen- sus on the Brownian motion approach [70] as a stochastic scaling model to describe financial assets behavior (e.g. prices or returns); these models, although innovative, showed a normal distribution with stable mean and finite variance. However, over the years empirical evidence demonstrated that the returns are not normally dis- tributed, but have a higher peak around the mean and fatter tails [71–73]. Later on further sophistications of that embryonic approach considered fractional Brownian motion [74–76] involving fractal analysis, however the presence of several scaling features in the analysis of complex data sets like the ones in the financial ones, proved that a robust modelling framework was more complex[77]. For this reason, new approaches focussed to the estimation of the tail index [72] showing how the tails of the distribution of returns behave as a function of the size of the movement; moreover the definition of a scaling exponent related to the fractal dimension (Hurst exponent), was deemed to be an effective indication of the behavior of volatility measures like variance of returns, absolute value of returns, etc, as a function of the time interval on which such series of returns are measured [78].
Focussing therefore on the more complex scenarios of a multi-fractal data set (in the following section we refer data set as a time series), we introduce the definition of multi-scaling stochastic process X(t) (or multi-fractal process): A stochastic process X(t) exhibits multi-scaling characteristics if has stationary increments and satisfies the following equation:
E(|X(t)|q) = c(q)tτ(q)+1 . (2.1)
(referenced from [79])
The function τ(q) is defined as the scaling function of the multi-scaling process X(t) and is determined by a unique parameter identifying the slope; the associated stochastic process is called uniscaling or unifractal. For uniscaling processes, τ(q) Complex Systems 11 is a linear function of q which is identified by a single exponent and therefore all sizes of fluctuations display the same scaling law. In general, when a multiscaling process shows fluctuations behaviour one can observe the q exponent and empirically measure that the higher is q the larger is the fluctuations [11]. Moreover, a time-dependent process X(t) is also said to be self-affine when it shows fluctuations on different timescales that can be re-scaled so that the original signal X(t) is statistically equivalent to its re-scaled version c−H · X(ct), ∀c > 0. Similarly, the invariance condition X(t) = tH X(1) substituting in 2.1 leads to E(|X(t)|q) = tqH E(|X(1)|q). Eq 2.1 still holds with c(q) = E(|X(1)|q) and τ(q) = qH −1 showing that it is a linear function and therefore uni-scale process. H is the Hurst Exponent (described below) and represents the self-affinity index, also called scaling exponent, of the process X(t). For non-linear τ(q) we are in the realm of stochastic multi-scale processes.
The Hurst exponent H is a statistical measure used to classify time series, where the larger the H the stronger is the trend of such series hence providing a measure for predictability. Formally, given X(t) a Gaussian random process where E(X(t)) = 0 and its second moment E(X2) = 1, and with C(∆t) = E[X(t)·X(t+∆t)] its auto-correlation func- tion, if the same process X(t) can show a long memory pattern when the asymptotic behaviour as ∆t → ∞ quantifies the presence or absence of long-range dependence and if C(∆t) ∼ |∆t|−β as ∆t → ∞ for β ∈ (0, 1), then the process X(t) has a long β memory with Hurst exponent H = 1 − 2 . For the same type of stochastic process seen before we define its autocorrelation function C(∆t) = E[X(t) · X(t + ∆t) as the Euclidean profile roughness of the pro- cess X(t) or, also called, fractal dimension D. If the correlation function behaves as 1 − |∆t|α, as |∆t| → 0 and α ∈ (0, 2], then we can associate it to the correlation
α function the Fractal Dimension D = 2 − 2 . A very important relation links together the concept of fractal dimension D with the Hurst exponent H: H + D = 2, where those parameters can be used to exhibit the hidden statistical properties of a random process. By observing many of the Complex Systems 12 natural phenomena in the study of complex systems it is not unlikely that the Hurst exponent H measures values in the region of 0.72 and the fractal dimension D of about 1.28; conversely, statistical processes composed by n i.i.d. variables {xi} with finite variance show H ' 0.5 and D ' 1.5.
It was observed [80] that whenever H deviates from the value 0.5 we are in the presence of a ”move” in the data evolution. In particular for 0.5 < H < 1 (and 1 < D < 1.5) indicates a persistent behaviour that is, for example, upon a growth period with said value range of H, it is likely that such growth will persist for another period. For value of 0 < H < 0.5 we have, on the contrary, anti-persistent behaviour. For H = 1 we can observe a straight line with zero slope from the series and the process will likely to be totally predictable while should H → 0 we have total unpredictability and the process exhibits white noise behaviour.
However, although the Hurst exponent H and the fractal dimension D are strictly related, they are quite independent of one another as D represents a local property whereas H shows global characteristics on the large scale view. For self-similar processes however local and global properties are the same [81, 82].
What we have seen thus far represent the global scaling characteristics of a process, however we can also define the local scaling properties of a process and we do this by means of the H¨older exponent α(t)[80]. Given a stochastic process X(t), we have:
α(t) |X(t + dt) − X(t)| ∼ Ct(dt) . (2.2)
A single α(t) H¨olderexponent represent an uniscaling process whereas a series con- tinuing local scale of H¨olderexponents determines a multiscaling phenomenon.
In the Econophycis literature, the two most widely used methods to directly quan- tify the multifractal properties of a time series are the following: the Multifractal Complex Systems 13
Detrended Fluctuation Analysis (MF-DFA) [83] and the Generalized Hurst Expo- nent (GHE) [84]. Both methods aim at estimating the scaling function ζq by means of a scaling exponent, which is a q-dependent generalisation of the Hurst exponent H. Barunik and Kristoufek [[85] show that the GHE method outperform the MF- DFA in returning an estimate of the scaling exponent with lower variance and bias regardless of the presence of heavy tails in the data. For this reason we have chosen to use this method in order to give a quantitative description of multifractality. We describe the GHE method in the next section.
2.4 Generalized Hurst exponent
The generalized Hurst exponent method [14, 86] is essentially a tool to study directly the scaling properties of the data via the qth-order moments of the distribution of the increments. The q-order moments are much less sensitive to the outliers than the maxima/minima and different exponents q are associated with different characterizations of the multi-scaling complexity of the signal. This type of analysis combines the sensitivity to any type of dependence in the data to a computationally straight forward and simple algorithm.
The Hurst analysis examines if some statistical properties of time series xk (with k=ν, 2ν, ..., kν, ..., T ) scale with the time-resolution (ν) and the observation-period (T ). Such a scaling is characterized by an exponent H which is commonly associated with the long-term statistical dependence of the signal. A generalization of the approach proposed by Hurst should therefore be associated with the scaling behavior of statistically significant variables constructed from the time series. In this case, the qth-order moments of the distribution of the increments is used [87, 88]. This is a good quantity to characterize the statistical evolution of a stochastic variable xk. It is defined as: q h|Xk+τ − Xk| i Kq(τ) = q , (2.3) h|Xk| i Complex Systems 14
where Xk is the detrended signal and the time-interval τ can vary between ν and
τmax. (Note that, for q = 2, the Kq(τ) is proportional to the autocorrelation function: a(τ) = hXk+τ Xki.)
The generalized Hurst exponent H(q)1 can be defined from the scaling behavior of
Kq(τ) if it follows the relation [89]:
τ qH(q) K (τ) ∼ . (2.4) q ν
Within this framework, two kinds of processes can be distinguished: (i) a process where H(q) = H, constant independent of q; (ii) a process with H(q) not constant. The first case is characteristic of uni-scaling or uni-fractal processes and its scaling behavior is determined from a unique constant H that coincides with the Hurst coefficient or the self affine index. This is indeed the case for self-affine processes where qH(q) is linear (H(q) = H) and fully determined by its index H. In the second case, when H(q) depends on q, the process is commonly called multi-scaling (or multi-fractal) [90, 91] and different exponents characterize the scaling of different q-moments of the distribution. Therefore, the non-linearity of the empirical func- tion qH(q) is a solid argument against Brownian, fractional Brownian, L´evy, and fractional L´evymodels, which are all additive models, therefore giving for qH(q) straight lines or portions of straight lines.
For some values of q, the exponents are associated with special features. For in- stance, when q = 1, H(1) describes the scaling behavior of the absolute values of the increments. The value of this exponent is expected to be closely related to the original Hurst exponent, H, that is indeed associated with the scaling of the absolute spread in the increments. The exponent at q = 2, is associated with the scaling of the autocorrelation function and is related to the power spectrum [92]. A special case is associated with the value of q = q∗ at which q∗H(q∗) = 1. At this value of q, the moment Kq∗ (τ) scales linearly in τ [87]. Since qH(q) is in general a monotonic
1We use H without parenthesis as the original Hurst exponent and H(q) as the generalized Hurst exponent. Complex Systems 15
∗ growing function of q, all the moments Hq(τ) with q < q will scale slower than τ, whereas all the moments with q > q∗ will scale faster than τ. The point q∗ is therefore a threshold value. Clearly in the uni-fractal case H(1) = H(2) = H(q∗). All these quantities will be equal to 1/2 for the Brownian motion and they would be equal to H 6= 0.5 for the fractional Brownian motion. However, for more complex processes, these coefficients do not in general coincide.
This method produces a set of generalised Hurst exponents H(q) similar to h(q) from the MDFA technique, that characterises the scaling properties of time series at different order q, as the value of q varies. However, in this case, the value of q cannot be negative. The function Kq(τ) defined in equation 2.3 diverges when q is negative.
2.5 Complex Networks
We discussed earlier that complex systems, like those within biology, economics or social disciplines - to name a few - [93] are comprised of many interacting elements and we have seen that study of individual property could be misleading if not prop- erly compared with the information from the rest of the elements constituting the system itself. Moreover, one of the most significant breakthrough in complex sys- tems studies has been the discovery that all these systems share similar structures in the network of interactions between their constituting elements [9, 93, 94]. The study of these similarities have led to an ongoing development of network models and graph-theoretical analysis techniques instrumental to characterise and understand complexity.
The birth of network (or graph) theory is due to the famous mathematician Euler and his theory and solution of the celebrated K˝onigsberg bridge puzzle [95]. With- out entering into the details of Euler’s famous problem we can here emphasise that Euler noticed that physical distance was not the important factor in the description of the problem but rather it was the representation of the topological constraints of Complex Systems 16 the problem itself in the form of a graph. Euler’s work became of seminal impor- tance in identifying topological properties as the key issue of the solving complex problems, thus opening a new field of mathematics that turned into a powerful tool for analysing complex systems. If Euler was the first to formalise Graph theory, the development of such discipline is due to Paul Erd˝osand Alfred R´enyi who pursued the theoretical analysis of the properties of graphs and random graphs obtaining a number of important results by introducing the concept of probability in Euler’s original work [96, 97].
In this section I will recall the main characteristics of a network (graph) and its main properties that have been widely used throughout my research works and this Thesis; I will omit the basics of definition of network components and network taxonomy, e.g definition of a node, link, type of networks like directed, undirected, bi-partite etc., although a waste literature can be found in several introductory material such as Newman’s work [94, 98].
2.5.1 Characteristics of Complex Networks
As we are dealing with complex phenomena, one of the most productive approach in the study of complex networks is to characterize the topology of a network at multiple scales of complexity; this can be done through graph-theoretical measures at a local or global level and together they provide an important instrument in the analysis of a complex system representation [99].
Over the past few decades a lot of effort has been put in the search for grasping underlying laws governing the dynamics and evolution of complex systems and sci- entists have adopted a systematic analysis and characterization of many aspects; in particular, several studies focused on how correlated data can be represented by means of a rigorous instrument such as a graph and its network representation has been thoroughly explored. Generally speaking, when representing a complex system through a network this can be defined as an abstract graph-shaped mathematical Complex Systems 17
Figure 2.1: Illustration of a graph with N = 5 nodes and E = 4 edges. The set of the nodes is V = {1, 2, 3, 4, 5} and the edge set is E = {{1, 2}, {1, 5}, {2, 3}, {2, 5};G=(V,E)}. Adapted from [100]. representation whose vertices (nodes) identify the elements of the observed phe- nomenon and the connecting links of such nodes (edges) translate the relationship amongst those elements.
By utilizing such graph-theoretic platform, we are now able to discover the emerging properties of a complex system and its hidden information and to recognize several characteristics concerning the underlying complex phenomena; the mathematical tools available provide then interesting ways to derive information about emergence of power-law distributions, robustness of systems, scaling characteristics of the net- work in its fundamental properties. In this section I will describe the main properties observed on real-world complex networks and in particular on scale-free networks.
2.5.1.1 Definition of a Network (Graph)
We consider here undirected graphs (we make no distinction between (u, v) and (v, u)) since most classical properties are defined on such graphs only. Formally, given a graph G(V,E), with V defining the vertex set and E the edge set, we will assume that V ∈ (1,...,N). With this notation in mind the most natural matrix associated with a graph G is then adjacency matrix A, which has been universally chosen as the formal way to represent a graph.
The Adjacency matrix, with entries A(i, j) is defined by a nxn matrix such that: Complex Systems 18
1 if an edge exists between vertex i and vertex j aij = (2.5) 0 otherwise .
Similarly, using the values of the edges in a weighted network, we can construct a matrix of weights W whose entries are the weights wij(i, j = 1, ..., N) of the vertices connections. Now that a network is formally defined, let’s look at the set of measure and metrics that help quantifying network structures.
2.5.2 Measures and Metrics
There are several measures applicable to a graph that describe certain features of a network topology; the following are the most known and the ones thoroughly used through my research.
2.5.2.1 Degree and density
The degree d0(v) of a node v is its number of links, or, equivalently, its number of neighbours: d0(v) = |N(v)| whereas the average degree d0 of a graph is the average 0 1 P o over all its nodes: d = N v d (v). The density of a graph G(V,E) is defined as the number of edges in the graph
2E divided by the total number of possible links: δ = V (V −1) and it indicates to what extent the graph is fully connected (all the links exist). In particular, it gives the probability that two randomly chosen nodes are linked in the graph.
We can define a relation between the average degree and the density: d0 = δ(n − 1). In general the average degree of complex networks is small, and independent of the sample size; this implies that the density is δ go to zero when the sample grows,
d0 since δ = n−1 [101, 102].
th More in general the degree of i vertex ki is defined as the number of edges attached to vertex i. A similar quantity is the vertex strength si, defined as the sum of weights Complex Systems 19 of edges linked to the vertex i. For un-weighted networks, these two quantities coincides at each vertex, on the other hand, edge weights adds another layer of complexity and these two should be addressed separately.
2.5.2.2 Diameter and average distance
We define d(i, j) the distance between vertices i and j, i.e. the number of links on 1 P a shortest path between them. We define d(i) = n j d(i, j) the average distance 1 P from i to all nodes, and d = n i d(i) the average distance in the graph. Finally, we define D = maxi,jd(i, j) the diameter of the graph, i.e. the largest distance [102].
2.5.2.3 Connectivity
We define a connected component of a graph the maximal set of nodes such that a path exists between any pair of nodes in this set. The connected components and their sizes are computed using a graph traversal (like a breadth-first search, [103]. In most real-world complex networks, it has been observed that there is a large connected component, often called giant component, together with a number of small components containing no more than a few percents of the nodes, often much less (if any).
2.5.2.4 Distance and Shortest Path
The distance d(u, v) between two vertices u and v of a graph is the minimum length of the paths connecting them, also named shortest paths; we also say that it is the length of a graph geodesic [93, 98]. Should all the vertices lie in different connected components, i.e. no such path exists, then the distance is set equal to infinity. The matrix dij consisting of all distances from vertex vi to vertex vj represents the col- lection of all shortest paths and it is called the graph distance matrix [104]. The diameter of a graph is the length of the longest geodesic path between any pair of vertices in the graph [98, 105]. Complex Systems 20
Amongst various techniques to compute the distance between any two nodes of a graph it is worth to mention the Floyd-Warshall [106] algorithm which efficiently and simultaneously finds the shortest paths (i.e., graph geodesics) between every pair of vertices in a weighted directed graph. Another famous algorithm for finding a graph geodesic, i.e., the shortest path between two graph vertices in a graph is due Edsger Dijkstra [107] which constructs a shortest-path tree from the initial vertex to every other vertex in the graph. It is worth to emphasise that in graph theory, the shortest path problem is the problem of finding a path between two nodes such that the sum of the distances of its constituent edges is minimised or, if the distance is represented by the strength (weigh) of two vertices relations, the shortest path is then a measure that maximises a path of such weigh’s relations. Amongst the most used algorithms to compute the shortest paths in a graph, as we will be searching for all shortest paths between each couples of nodes (a.k.a. all-pairs shortest path problem), the Floyd-Warshall approach, with certain adaptations, will be used throughout the examples later in this Thesis. The Floyd-Warshall algorithm compares all possible paths through the graph be- tween each pair of vertices in O(|V |3) comparisons. It does so by incrementally improving an estimate on the shortest path between two vertices, until the estimate is optimal. By means of this algorithm we obtain the value of the shortest path dis- tance for each couples of nodes in each connected component of G. The algorithm also returns the predecessor matrix P such that pi,k is the index of the node which is preceding k in the shortest path between i to k. From this matrix any path between any couple of nodes i, j in a connected component of G can be reconstructed.
2.5.2.5 Random Walk
Given a graph we imagine a ”walker” which starts a vertex and moves randomly to another vertex, neighbour of it; then the crawl iterates its move from the new vertex point at random, and move to the next, and so on. The random sequence of vertices Complex Systems 21 touched by the walker this way is a random walk on the graph. Random walks arise in many models in mathematics and physics [108].
A random walk is a finite Markov chain [109, 110]. In fact, there is not much difference between the theory of random walks on graphs and the theory of finite Markov chains; every Markov chain can be viewed as random walk on a directed graph, if we allow weighted edges. The random walk is also interesting since it could be a mechanism of transport and search on networks. Those processes would be optimal if one follows the shortest path between two nodes under considerations, i.e. among all paths connecting two nodes, the shortest path is given by the one with the smallest number of links. However the shortest path can be found only after global connectivity is known at each node, which is improbable in practice. The random walk becomes important in the extreme opposite case where only local connectivity is known at each node[111]. We also suggest that the random walk is a useful tool in studying the structure of networks [112].
Formally, given G = (V,E) a undirected connected graph with N vertices and E edges, let’s consider a random walk on such graph G.
th A walker starts at vertex v0 and takes t random steps; if at the t step we are at a 0 vertex vt, we move to the neighbour of vt with probability 1/d (vt). The sequence of random nodes vt : t ∈ 0, 1,... obtained is a Markov chain.
We then define:
- n x n adjacency matrix A with A(i, j) with weights on edge from vertex i to vertex j. As we consider here the graph undirected, then A(i, j) = A(j, i), i.e. A is symmetric
- We denote by pi(t) as the probability that the walker is at vertex i at the time t and defined as: X A(i, j) p (t) = p (t − 1) , (2.6) i k j j j Complex Systems 22
where 1/kj is the probability of taking a step along any edge attached to j [98].
- n x n transition matrix P ; P is row stochastic with P (i, j) as the probability A(i,j) of stepping on vertex j from vertex i: P (i, j) = P . i A(i,j)
We can also see eq. 2.6 in matrix form where p represents the vector of pi and D the diagonal matrix with the degrees of vertices on its diagonal: p(t) = AD−1p(t − 1).
In the long run, for t → ∞, the probability distribution over all the vertices is:
X pi(∞) = Aijpj(∞)/kj , (2.7) j or, in a matrix form: p = AD−1p. Note that the element D−1p is an eigenvector with eigenvalue 0. What we observe in the matrix of transition probabilities (of this Markov chain) is a sequence is a walk in the graph. We call such a walk a random walk on the graph or digraph G.
2.5.3 Centrality measures
The Centrality measure is a useful metric that helps answering the question of which nodes are most central relatively to the rest of the network. The general formula for centralization is the Freeman formula [113].
However, there are many different types of centralities that altogether provide quite a good representation of the properties of a network in particular when those quantities are measured over time (i.e. in a evolving network). The following list of centrality measures are the ones mostly used within works of network analysis. Complex Systems 23
2.5.3.1 Degree centrality
Cd(i) represents nodes with a large number of neighbors (i.e., edges) that have high centrality, or the number of links incident upon a node. We can define Cd(i) = deg(i)/(n−1) where n is the number of vertices in the graph and deg(i) is the number of edges from i to another vertex [101, 114]. In certain type of networks, the biological networks for example, nodes with high degree centrality are considered to be more relevant than other with lower degree centrality [98, 101]. Degree centrality, however, can be misleading because of its characteristic of being a pure local measure and it solely relies on the number of connected edges as the main feature characterizing the entire node; If we want to measure the degree to the extent of how a graph is centralized, we look at the other type of measure, such as eigenvector centrality.
2.5.3.2 Eigenvector Centrality
The Eigenvector centrality measures the importance of a node by the importance of its neighbors [115]. In other words, this eigenvector centrality combines the edges of a given node with the weight of the nodes linked to those edges. This approach provides centrality weighted by the nodes involved and therefore, if the domain of certain neighbors of a node is crowded by important nodes, its centrality raises proportionally, and vice versa. This centrality measure is a powerful one as when applied to each vertex of a graph it returns a score proportional to the sum of score (centrality weights) of its neighbours. In other words, the assumption is that each vertex’s centrality is the sum of the centrality values of the vertices that it is connected to [101, 114]. Formally, the Eigenvector centrality is defined as the principal eigenvector of the adjacency matrix associated (defining) the network. The defining equation of an eigenvector is:
λv = Av , (2.8) Complex Systems 24 where A is the adjacency matrix of the graph, λ is a constant (the eigenvalue), and v is the eigenvector. The interpretation of the above equation is that node with a high eigenvector score is one that is adjacent to nodes that are themselves high scorers. The centrality matrix is an eigenvector of the adjacency matrix such that all of its elements are positive [116].
2.5.3.3 Closeness Centrality
This measure of centrality is based on the concept of geodesic path, i.e. the distance between two vertices in a graph measured as the number of edges in a shortest path, and it looks at the the mean distance from a vertex to other vertices [98]. Closeness is based on the inverse of the distance of each vertex to every other vertex in the network [101, 114].
If we define dij the length of a geodesic path from i to j, then the inverse of the mean geodesic distance from i to j, normalised over all vertices j defines the closeness centrality in the graph as:
" n #−1 X Cc(i) = dij . (2.9) j=1
2.5.3.4 Betweenness Centrality
The geodesic shortest path earlier mentioned also serves to define another important topological quantity called betweenness centrality. We define two different type of betweenness centrality: the vertex betweenness centrality and edge betweenness centrality. The vertex betweenness centrality BC(v) of a vertex v ∈ V is the sum over all pairs of vertices i, j ∈ V , of the fraction of shortest paths between i and j that pass through v:
X lij(v) BC(v) = , (2.10) lij i,j∈V Complex Systems 25
where lij(v) is the total number of shortest path between i and j that pass through vertex v. One may look at this measure as measure of ”bottleneck” that how many traffics pass through each vertex. Similarly, the edge betweenness centrality BC(e) is defined as:
X lij(e) BC(e) = , (2.11) lij i,j∈V
where lij(e) is the total number of shortest path between i and j that pass through edge e.
2.5.4 Degree Distribution of a Graph
We have seen earlier the definition of degree of a vertex as the number of edges attached to it. Let k be the degree of a vertex, the degree distribution of a graph is the proportion pk of nodes of degree exactly k in the graph, for all k [102, 117] pk(k) = P [K = k] ≡ pk .
Pk is the fraction of vertices in a network with degree k and also the probability that a randomly selected vertex has degree k [118]. Degree distributions of a graph include: