This electronic thesis or dissertation has been downloaded from the King’s Research Portal at https://kclpure.kcl.ac.uk/portal/

Quantitative semantics and graph theory as a framework for complex systems modeling

Gramatica, Ruggero

Awarding institution: King's College London

The copyright of this thesis rests with the author and no quotation from it or information derived from it may be published without proper acknowledgement.

END USER LICENCE AGREEMENT

Unless another licence is stated on the immediately following page this work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International licence. https://creativecommons.org/licenses/by-nc-nd/4.0/ You are free to copy, distribute and transmit the work Under the following conditions:  Attribution: You must attribute the work in the manner specified by the author (but not in any way that suggests that they endorse you or your use of the work).  Non Commercial: You may not use this work for commercial purposes.  No Derivative Works - You may not alter, transform, or build upon this work.

Any of these conditions can be waived if you receive permission from the author. Your fair dealings and other rights are in no way affected by the above.

Take down policy

If you believe that this document breaches copyright please contact [email protected] providing details, and we will remove access to the work immediately and investigate your claim.

Download date: 07. Oct. 2021 King’s College London

Doctoral Thesis

Quantitative semantics and graph theory as a framework for complex systems modeling

Author: Supervisor: Ruggero Gramatica Prof. Tiziana Di Matteo

A thesis submitted in fulfilment of the requirements for the degree of Doctor of Philosophy in

Applied Mathematics Department of Mathematics

December 2015

This page is intentionally left blank Published Articles

I list here the peer-reviewed articles I have authored and co-authored in the course of my PhD and whose material I have used in this thesis. The articles are listed in chronological order of appearance as follows and can be reviewed in Annex 1:

1. R.Morales, T.Di Matteo, R.Gramatica and T.Aste, Dynamical General- ized Hurst Exponent as a Tool to Monitor Unstable Periods in Financial Time Series, Physica A, 391, 2012, 3180-3189.

2. Tomaso Aste, Ruggero Gramatica and T.Di Matteo, Exploring complex networks via topological embedding on surfaces, Physical Review E 86, 036109 (2012)

3. T. Aste, Ruggero Gramatica, T. Di Matteo, Random and frozen states in complex triangulations, Philosophical Magazine, Volume 92, Issue 1-3, (2012) 244-254

4. Ruggero Gramatica, T. Di Matteo, Stefano Giorgetti, Massimo Barbiani, Do- rian Bevec, Tomaso Aste, Graph Theory Enables Drug Repurposing - How a Mathematical Model Can Drive the Discovery of Hidden Mechanisms of Action, January 2014, PlosOne Volume 9, Issue 1, e84912

5. Ruggero Gramatica, Haris Dindo, T. Di Matteo, T. Aste, A quantitative se- mantic and graph theoretical approach for the analysis of financial and economic unstructured data, In preparation for 2015 - (see Chapter 6)

i This page is intentionally left blank Abstract

The study of Complex Systems focuses on how interactions of constituents within a system, individually or grouped into clusters, produce behavioral patterns locally or globally and how these interact with the external environment. Over the last few decades the study of Complex Systems has gone through a growing rate of interest and today, given a sufficiently big set of data, we are able to construct comprehensive models describing emerging characteristics and properties of complex phenomena transcending the different domains of physical, biological and social sciences.

The use of has shown, amongst others, a particular fit in describ- ing statical and dynamical correlations of complex data sets because its ability to deal not only with deterministic quantities but also with probabilistic methods. A is generally an open system flexible in adapting to variable external conditions in the way that it exchanges information with environment and adjusts its internal structure in the process of self-organization. Moreover, it has been shown how real world phenomena that are represented by complex systems display inter- esting statistical properties such as power-law distributions, long-range interactions, scale invariance, criticality, multifractality and hierarchical structure.

In the era of big data where effort is largely put to collect large data sets carry- ing relevant information about given phenomena to be studied and analysed, the interesting field of quantitative semantics, e.g. dealing with information expressed in natural language, is becoming more and more relevant particularly in the social sciences. However, recent studies are expanding these techniques to become a tool for structuring and organising information across a number of disparate disciplines.

In this Thesis I propose a methodology that (i) extracts a structured complex data set from large corpora of descriptive language sources and efficiently exploits the power of quantitative semantics techniques to map the essence of a complex phe- nomena into a network representation, and (ii) combines such induced knowledge iii network with a graph theoretical framework utilising a number of graph theory tools to study the emerging properties of complex systems. Thus, leveraging on devel- opments in Computational Linguistics and Network Theory, the proposed approach builds a graph representation of knowledge, which is analyzed with the aim of ob- serving correlations between any two nodes or across clusters of nodes and highlights emerging properties by means of both topological structure analysis and dynamic evolution, i.e. the change in connectivity. Under this framework I will provide two real-world applications:

- The first application deals with the creation of a structured network of bio- medical concepts starting from an unstructured corpus of biological text-based data set (peer reviewed articles) and next it retrieves known pathophysiologi- cal Mode of Actions by applying a stochastic random-walk measure and finds new ones by meaningfully selecting and aggregating contributions from known bio-molecular interactions. By exploiting the proposed graph-theoretic model, this approach has proven to be an innovative way to find emergent mechanism of actions aimed at drug repurposing where existing biologic compounds origi- nally intended to deal with certain pathophysiologic actions are redirected for treating other type of clinical indications.

- The second application consists of a representation of a financial and economic system through a network of interacting entities and to devise a novel semantic index influenced by the topological properties of agglomerated information in a semantic graph. I have shown how it is possible to fully capture the dynamical aspects of the phenomena under investigation by identifying clusters carrying influential information and tracking them over time. By computing graph- based statistics over such clusters I turn the evolution of textual information into a mathematically well-defined, multivariate time series, where each time series encodes the evolution of particular structural, topological and semantic properties of the set of concepts previously extracted and filtered. Eventually

iv an autoregressive model with vectorial exogenous inputs is defined, which lin- early mixes previous values of an index with the evolution of other time series induced by the semantic information in the graph.

The methodology briefly described above concludes the contribution of my re- search work in the field of Complex Systems and it has been instrumental in successfully defining a graph-theoretical model for the study of drug repurpos- ing [1] and the construction of a framework for the analysis of financial and economic unstructured data (see chapter6).

v This page is intentionally left blank Acknowledgements

It had been a long term dream of mine to be able to study Applied Mathematics and gain a PhD and I would first of all like to thank Prof. William Shaw who on my first interview at King’s College gave me the chance to make this a reality by approving my application as a post graduate student. He kindly reassured me that even in your 40’s it is not too late to undertake a doctoral program as long as I would bring passion, commitment and sufficient background to tackle such a challenging endeavour.

Next I would like to thank my supervisor Prof. Tiziana Di Matteo, she provided me with good day to day advice and a direction for my research, teaching me how to focus on innovative thinking and maintaining rigour in the investigation of adjacent domain of research. Having Dr. Tomaso Aste as adjunct supervisor was really a blessing and I appreciated his breadth of knowledge, his perspective, his strong scientific background in statistical mechanics and his sense of humour.

Because of my engineering background and, more recently as an entrepreneur, it was important for me that my research was not abstract, but closely linked to the business world and I would like to thank everyone I met throughout my doctoral journey from disparate industry field for bridging the gap between the academic world and the business universe and for believing in the broad ranging applications of my work to real business problems.

I would like to thank my colleagues and fellow students Dr. Raffello Morales and Nicolo’ Musmeci for sharing with me the PhD experience and for being around to discuss some formal aspects of the theory of multiscaling and complex systems in general.

Also, Inspiring me on my journey was my old friend and fellow mathematician Guido Previde whose shared interest in my work made him a great sounding board for ideas and he helped me to work through some of the challenges I faced with both his knowledge and his sense of humour.

Last, but not least, on the academic list is Dr. Haris Dindo, who I came to know about half way through my research. As a great researcher in the field of Machine Learning and Cognitive Science he has been an invaluable help in devising new perspectives on how to tackle some challenging parts of my research. vii On my personal sphere, I wanted to say thank you to my mother who throughout so many years she has been a constant source of optimism and fed me with a genuine ambition that got me where I am now.

Finally I would like to thank my wife. Having heard me talk about my project to achieve a doctoral program for many years it was she who pushed me out of the door and apply at King’s College. I embarked on my research as a single man and I am now a husband with two children and I am grateful that I started my PhD when I did as I am not sure that I would find the time again now, or in the near future.

viii Contents

Abstract iii

Acknowledgements vii

List of Figures xv

List of Tables xvii

Symbols xxi

Terminology xxiv

1 Introduction1

2 Complex Systems5 2.1 Introduction...... 5 2.2 Multiscaling...... 6 2.3 Scaling properties...... 8 2.4 Generalized Hurst exponent...... 13 2.5 Complex Networks...... 15 2.5.1 Characteristics of Complex Networks...... 16 2.5.1.1 Definition of a Network (Graph)...... 17 2.5.2 Measures and Metrics...... 18 2.5.2.1 Degree and density...... 18 2.5.2.2 Diameter and average distance...... 19 2.5.2.3 Connectivity...... 19 2.5.2.4 Distance and Shortest Path...... 19 2.5.2.5 Random Walk...... 20 2.5.3 Centrality measures...... 22 2.5.3.1 Degree centrality...... 23 2.5.3.2 Eigenvector Centrality...... 23 2.5.3.3 Closeness Centrality...... 24 2.5.3.4 Betweenness Centrality...... 24 2.5.4 Degree Distribution of a Graph...... 25

ix Contents x

2.5.5 Clustering...... 26 2.5.6 Models of Complex Networks...... 27 2.5.7 Random Networks...... 27 2.5.8 Scale-free Networks...... 27 2.5.9 Small-World Networks...... 28 2.5.10 Filtered Networks...... 30 2.5.11 Minimum Spanning Tree...... 31 2.5.12 Planar Maximally Filtered Graph...... 32 2.5.13 Scale-Free Networks Embedded in Hyperbolic Metric Spaces. 33 2.6 Summary...... 35

3 A Linguistic and Graph theoretic approach 38 3.1 Introduction...... 38 3.2 An epistemological challenge...... 39 3.2.1 The Semantic Problem...... 41 3.2.2 Graph-theoretic analysis of semantic networks...... 42 3.3 The model...... 44 3.3.1 The Paradigm of Distributional Hypothesis...... 44 3.3.2 Conceptualization & Disambiguation...... 46 3.3.2.1 Conceptualization...... 46 3.3.2.2 Disambiguation...... 47 3.3.3 Building a knowledge representation...... 48 3.4 The Methodology...... 52 3.4.1 Data availability and ingestion...... 52 3.5 Natural Language Processing and Parsing models...... 53 3.5.1 The probabilistic Context-Free Parser...... 54 3.5.2 The Lexicalized dependency parser...... 56 3.5.3 The Lexicalized probabilistic context-free parser...... 56 3.5.4 Extracting a knowledge graph from a large corpus of descrip- tive language based data set...... 58 3.5.5 Similarities and distance in a semantic network...... 61 3.5.5.1 The overall structure of the graph...... 63 3.5.5.2 The ”shortest paths”...... 64 3.5.5.3 Random walk: a stochastic semantic distance.... 65 3.6 Summary...... 67

4 The semantic - graph theoretical framework as a model for biomed- ical drug repurposing 69 4.1 Introduction...... 69 4.2 The semantic data ingestion...... 71 4.2.0.4 Dictionaries and ontology...... 72 4.2.0.5 Use of Natural Language Processing...... 73 4.3 Building Connections...... 73 4.3.1 Retrieving the appropriate similarity measure...... 73 Contents xi

4.3.1.1 Occurrences...... 74 4.3.1.2 Co-occurrences...... 75 4.3.1.3 Fine grained measures...... 78 4.3.1.4 Coarse grained measures...... 78 4.3.2 Semantic interrelation, Dependency, Similarity...... 79 4.3.3 Inferences and paths...... 80 4.3.4 Building inferential paths...... 82 4.4 Biological mechanisms through shortest paths...... 84 4.4.1 Direct links...... 84 4.4.2 Indirect links: searching for shortest paths...... 86 4.4.2.1 Paths from the first case...... 88 4.4.2.2 Paths from the second case...... 88 4.4.2.3 Paths form the second case but with symmetrized average weights...... 92 4.4.2.4 Paths form the second case with symmetrized mini- mum weights...... 94 4.5 Constrained shortest paths as an inference measure...... 96 4.5.1 Shortest paths on the biomedical knowledge graph...... 99 4.5.2 Constrained shortest paths in the data set under investigation 99 4.6 Most abundant paths...... 100 4.6.1 Shortest path as the most probable paths...... 101 4.6.2 Stochastic measures. The Random Walk distance...... 102 4.6.2.1 Biological mechanism from paths based on random walk distance...... 104 4.6.2.2 Paths associated with larger levels of similarity from average random walk distance...... 105 4.6.2.3 Paths associated with larger levels of similarity from maximum random walk distance...... 106 4.7 Network approach to information filtering...... 108 4.7.1 Overall structure analysis...... 108 4.7.2 Information filtering...... 109 4.7.3 Applying Filtered Networks Techniques...... 109 4.7.3.1 Minimum spanning tree...... 111 4.7.3.2 Planar Maximally Filtered Graph...... 113 4.7.4 Analysing the structure of the induced semantic graph through Minimum Spanning Tree...... 114 4.7.4.1 MST global structure and average random walk dis- tance...... 116 4.7.4.2 MST global structure and maximum random walk distance...... 117 4.8 Summary...... 118

5 A further step into the biological model 121 5.1 Introduction...... 121 Contents xii

5.1.1 Biomedical corpus...... 122 5.1.2 Biomedical background - Why Linguistics and Network Theory 123 5.1.3 Information retrieval and construction of the knowledge graph 124 5.1.4 Analysis of the Knowledge Graph...... 128 5.1.5 Ranking paths using Random Walk...... 131 5.1.6 Results...... 133 5.1.7 The Path VIP - SARCOIDISIS...... 134 5.1.8 The Path α-MSH - SARCOIDOSIS...... 136 5.1.9 The Path CNP - SARCOIDOSIS...... 136 5.1.10 The Path IMATINIB CREUTZFELDT-JACOB Disease... 137 5.1.11 Acknowledgements...... 140 5.1.12 Summary...... 140

6 A Semantic - graph theoretical approach for the analysis of financial and economics unstructured data 143 6.1 Introduction...... 143 6.2 Prior work...... 145 6.2.1 Our approach...... 147 6.3 From plain text to semantic graphs...... 149 6.3.1 Assessing the strength of association between concepts.... 149 6.4 Cluster identification...... 152 6.4.1 Clustering...... 152 6.4.2 Random walk clustering...... 154 6.4.2.1 Random walk distance...... 154 6.4.2.2 Community detection algorithm...... 156 6.5 Cluster tracking...... 158 6.6 Semantic Index...... 161 6.6.1 Comparison with financial time series...... 162 6.6.2 Correlation analysis...... 164 6.6.3 A note on stationarity and correlation...... 168 6.7 Building the Semantic Index...... 170 6.7.1 Semantic index as an auto-regressive process with exogenous variables...... 170 6.7.1.1 Multivariate case...... 172 6.7.2 Fitting the parameters of the VARX model...... 173 6.7.3 Semantic content of the CSI...... 176 6.8 Summary...... 177 6.9 Acknowledgments...... 179

Conclusions 180 Contents xiii

Bibliography 187 This page is intentionally left blank List of Figures

2.1 Illustration of a graph with N = 5 nodes and E = 4 edges...... 17 2.2 Scale-free networks: cumulative degree distributions for six different networks...... 29 2.3 Small-world network: representation of one-dimensional lattice network 30

2.4 Exemplification of a T1 and T2 elementary moves on a triangulation: edge switching (T1) and vertex insertion and removal (T2)...... 35 2.5 Exemplification of a T 1 elementary move...... 37

3.1 An examples of a structured sentence parse tree representation.... 49 3.2 Projection of the Knowledge Graph onto a semantic space...... 52 3.3 An example of a structured parse tree showing Part of Speech Tags according to Stanford NLP parser...... 54 3.4 PCFG, Lexicalized dependency and Combined parser...... 57

4.1 Schematic representation of a biological Mechanism of Action (MoA) 71 4.2 Distribution of the frequency of keywords occurrence in a sample of 208,583 articles...... 75 4.3 Distribution of the frequency of keyword co-occurrence in a sample of 208,583 articles...... 77 4.4 Distribution of the number of co-occurrences...... 79 4.5 A global view of the network of non-zero co-occurrences...... 83 4.6 A schematic ’mask’ used to mimic Mechanism of Action used as a constraint on a Constrained Shortest Path inference...... 98 4.7 All constrained paths peptide ← PROCESS → diseases...... 98 4.8 Distribution of the weights for the non-zero entries of the dependency matrix S...... 109 4.9 caption...... 110 4.10 Minimum Spanning Tree graph for the case study [LU]...... 111 4.11 PMFG network for the case study [LU]...... 114 4.12 Minimum Spanning Tree obtained from the co-occurrences on sym- metric weights (average operator)...... 115 4.13 Minimum Spanning Tree obtained from the co-occurrences on asym- metric weights...... 116 4.14 Minimum Spanning Tree obtained from the co-occurrences on sym- metric weights (max operator)...... 117 4.15 Global structure of the average random walk distances...... 118

xv List of Figures xvi

4.16 Global structure of maximum random walk distance...... 119

5.1 Centrality measures for the sample in exam: 3 million papers, 1606 concepts, comprising 127 peptides, 300 rare diseases and 1179 other biological entities, 1576 vertices, 158,428 edges...... 128 5.2 Conceptual outline of the knowledge graph building process..... 129 5.3 Paths identification and selection...... 132 5.4 The Sarcoidosis knowledge network...... 135 5.5 The VIP - SARCOIDOSIS path and other closely related concepts.. 135 5.6 The α-MSH - SARCOIDOSIS path and other closely related concepts 137 5.7 The CNP - SARCOIDOSIS path and other closely related concepts. 138 5.8 Imatinib (GLEEVEC) - Creutzfeldt-Jakob Disease path and other closely related concepts...... 139

6.1 Example of the walktrap community detection algorithm...... 159 6.2 An example of cluster tracking procedure on a synthetic example.. 161 6.3 Semantic Index computed on the 2007 Thomson Reuters news headlines161 6.4 Pearson correlation coefficient between DJIA time series and five dif- ferent semantic time series...... 165 6.5 Cross-correlation coefficient between DJIA time series and five differ- ent semantic time series...... 166 6.6 Correlation analysis between different financial indicators and seman- tic indices on the 2007 data...... 168 6.7 Comprehensive Semantic Index fitted on 2003-2011 DJIA series... 174 6.8 Comprehensive Semantic Index fitted on 2003-2011 S&P 500 series.. 175 6.9 Comprehensive Semantic Index fitted on 2003-2011 VIX series.... 176 List of Tables

4.1 The twenty most abundant co-occurrences within the sample in exam. 77 4.2 Hub vertices with largest degree...... 82

xvii This page is intentionally left blank To my daughters Emma and Julia

∀φ[P (¬φ) ↔ ¬P (φ)]

∀φ∀ϕ[(P (φ) ∧  ∀x[φ(x) → ϕ(x)]) → P (ϕ)] ∀φ[P (ϕ) → ♦∃xϕ(x)] G(x) ↔ ∀φ[P (ϕ) → φ(x)] P (G)

♦∃xG(x) φ ess. x ↔ φ(x) ∧ ∀ϕ(ϕ(x) → ∀y(φ(y) → ϕ(y))) ∀x[G(x) → G ess. x]

NE(x) ↔ ∀φ[φ ess. x → ∃yφ(y)] P (NE)

∃xG(x)

From: ”Formalization, Mechanization and Automation of G¨odel’sProof of God’s Existence”, C. Benzm¨ullerand B. Woltzenlogel Paleo, 2013

K. G¨odel,Collected Works, Volume III: Unpublished Essays and Letters (Ontological proof), Oxford University Press, 1970 This page is intentionally left blank Symbols xxi

Symbols

E(·) expected value f probability density function

F< cumulative distribution function

F> complementary cumulative distribution function H Hurst exponent H(q) generalized Hurst exponent

Km Bernoulli random variable ` lag

Mq(t, τ) q-moments ∗ Mq (t, τ) empirical q-moments n hierarchical order

pm probability associated with the Bernoulli variable Km

rt log-returns

Pt price at time t

St a generic stochastic process X a generic random variable x a realisation of X

α tail exponent or generically denote exponents of exponential and power-law functions f(x) ∼ e−αx. β ACF decay exponent γ tail exponent or generically denote exponents of exponential and power-law functions f(x) ∼ x−γ. κ Kurtosis λ eigenvalue κ˜ Excess Kurtosis

ζq scaling function ∗ ζq empirical scaling function

σt volatility Symbols xxii

Σ Covariance matrix

Σij Covariance matrix entries τ time scale

i, j i and j serve as indices to label individual elements / vertices of a data set/graph. When i and j are reserved for some objects, v and u serve to label further types of objects. ij ij is reserved to represent an edge between the vertices i and j. If the graph −→ is directed the expression notation is ij . N Number of elements / vertices in a data set / network. GG(V,E) denotes a graph with vertex set i, j ∈ V , and edge set ij ∈ E. For weighted graphs, it becomes a triplet G(V,E,W ) including weight set

wij ∈ W , or a quadruple G(V, E, W, D) including the edge distance set ~ dij ∈ D. In short, it is denoted as G, and G for directed graphs.

Aij Denotes the adjacency matrix of a network / graph, that specifies linking information between vertices i and j. If the label of a specific graph needs to be highlighted, it takes the form of AG. For specifying linking informa-

tion between sub-graphs of G, an extension of the form A(Si,Sj) is used

where Si and Sj are disjoint sub-graphs of G. In general, large capitals of

the form Mij are used to denote matrices specifying pair-wise entries. P (k) Denotes the degree distribution of a network / graph. In general, P is reserved for distributions of stochastic variables.

Ci Denotes the clustering coefficient of vertex i.

lij Denotes the shortest path length between vertices i and j. The directed

−→ path length between vertices i and j is denoted by lij .

xi Denotes the ith data point in a specific proximity space, and all data points

belong to a set xi ∈ X. Symbols xxiii

D,SD and S denote dissimi- larity and similarity ma-

trices where D(xi, xj) and

S(xi, xj) denote pairwise dissimilarity and similar- ity between the ith and jth data points. D is sometime referred as the diameter of a network. log Denotes the logarithm to the base of e. Otherwise, logarithm to the base of n

is specified by logn. Terminology xxiv

Terminology

Vertex A Vertex is a constituent of a graph and represents indi- vidual elements of a complex system. Edge An Edge is a link between a pair of vertices to represent relevant interactions between the corresponding elements in a complex system. Graph/Network A graph is a discrete mathematical object that consists of vertices and edges linking the vertices pair-wise. The term ’graph’ has been widely used in the discrete mathematics literatures, whereas ’network’ has been extensively used in physical literatures. In this thesis, I use these two words interchangeably. Subgraph A subgraph is a part of a graph that consists of a subset of vertices, and a subset of edges linking among these vertices. Community/Cluster In literatures, a sub-graph characterized by a unique topological property is denoted as a ’commu- nity’. In clustering literatures, it includes the sub-graphs as well as general subgroups of elements which share particu- lar patterns/features. In this thesis, I use these two words interchangeably where appropriate. Path A path is a sequence of connected edges starting from a vertex i and terminating to vertex j. Planar graph A planar graph is a graph whose topology can be drawn on a topological sphere without edge-crossings. Hyperbolic surface A hyperbolic surface is a generalization of spherical surface with g handles attached. g is called genus. Hyperbolic graph A hyperbolic graph is a generalization of a planar graph, where the graph can be drawn on a hyperbolic surface. 1

Introduction

“In God we trust, all others bring data”

– W. Edwards Deming

The amount of information produced nowadays has reached a level that only a couple of decades ago was unimaginable. In 2012 the global data supply reached 2.8 zettabytes (ZB) - or 2.8 trillion GB - but just 0.5% of this is used for analysis. Volumes of data are projected to reach 40ZB by 2020, or 5,247 GB per person, with emerging economies accounting for an increasingly large proportion of the world’s total [2]. But it is not only the increasing amount of data that floods us every minute; the key characteristic is in fact its complexity and that makes the interpretation of hidden information a challenging one because the unstructured format and the speed at which data is gathered and analysed.

Many tools and modern business intelligence methods are quite common these days and generally allow analysts to delve into complex datasets and provide general an- swers to predetermined questions, but the emergent field of data science is concerned with finding the questions that should be asked when huge and often unstructured data is collected and leads to non-conclusive, and sometimes, hidden results.

1 Introduction 2

Moreover, most of the complexity we are facing today determine the general difficul- ties of finding consistent rules that explain certain queries. In fact, we can generally assume that complex answers derive from complex phenomena; problems that are difficult to solve are often hard to understand because the causes and effects are not obviously related and when the number of co-related variables becomes overwhelm- ing any traditional approach for analysing a problem becomes insufficient to provide a thorough solution. The field of complex systems has become very useful in understanding unknown and not evident effects between elements of a system and its study provides a number of sophisticated tools to deal with these systems and their complexity; these provide a very useful framework for capturing certain features and emergent properties that otherwise would remain unknown.

Emergent properties therefore become key in the interpretation of knowledge. Emer- gent characteristics in a complex system arise with the identification of a higher-level of properties and the dynamic behaviour of individual components of the systems itself. For this reason Emergent Properties are properties of the ”whole” that are not strictly driven by any of the individual parts making up that whole; such phe- nomena exist in various domains and can be described, using complexity concepts and thematic knowledge [3]. We are witnessing several studies focussing on com- plex modelling of dynamic or behavioral systems in the attempt to identify relevant Emergent Properties.

To this aim, delving into the main characteristics of Complex Systems and their features is key: we look into intrinsic memory of a phenomenon due to its dy- namic nature and we analyse the non-linear and scaling pattern trying to create a predictability model by tracking overtime the non-equilibrium status of their open- system intimate nature. Complexity becomes the common frontier in the physical, biological and social sciences and this leads us to investigate how subsystems self- organise into new emergent structures. Outcomes will include new technologies and tools [4]. Introduction 3

The non-linear nature of complex systems and the fact that these systems are open, e.g. they interchange information with the environment and constantly modify their internal structure and patterns of activity in the process of self-organization, suggests that they are flexible and easily adapt to variable external conditions. Furthermore, the emergent phenomena that cannot be derived solely from the knowledge of the systems’ structure and the interactions among their individual elements requires a more intimate analysis of the different levels of its organization (multi-scaling). There is strong evidence that different complex systems can have similar charac- teristics both in their structure and in their behaviour. One can thus expect the existence of some common, universal laws that govern their properties [5,6].

In this research work, I have approached two types of complex data sets and used a number of instruments to analyze their emergent behaviour. I examined statistical correlations and topological measures in financial time series and investigated how best to describe and quantify these phenomena expanding the footprint of such investigations to other disciplines. In this respect I looked at how mechanisms of action mapped within a biological data set emerge when looked at under a stochastic approach and a method to build a semantic index representing financial or economic indicators starting from a large corpus of unstructured data.

In particular, I propose a methodology to construct a graph-based knowledge frame- work derived from semantically processing large corpora of unstructured data ex- pressed in Natural Language and representing two different complex systems, in the fields of biology and economics. By leveraging on the graph-theoretic platform, the goal of this thesis is to address to what extent this methodology reflects the com- plexity and the contents of meaningful information hidden in the data (emergent properties), and to develop tools that highlight the meaningful information.

The Thesis is structured as follows: Chapter2 provides an overview of complex systems and addresses both properties such as mutiscaling and type of complex net- works focussing on measures and metrics and in general statistical models of network topology, which plays central roles in complex network literature and provides in- sights to understand dynamics and underlying mechanisms governing the respective Introduction 4 complex systems. Moreover, a graph-theoretic filtering technique is introduced as an instrument to extract meaningful information from complex systems.

In Chapter3 I review the Linguistic theory behind the semantic approach that has been widely used in my research work. I will introduce the basics of the Ditributional Hypothesis as a fundamental component of extracting and correlating concepts from descriptive language sources and the various techniques to construct a semantic graph. I then describe the statistical measures of semantic graphs and how these will be applied to real world cases.

Chapter4 introduces a semantic - graph theoretical framework built on a real case analysing a biomedical and pharmacological data set. This chapter describes how such framework leverages computational linguistic and graph theory in a context of a biological complex system and shows techniques derived from graph theory - such as information filtering, analysis of connections, paths and stochastic distances - that will treat the induced biological network with the goal to simulate a biological Mechanism of Action.

In Chapter5, I expand the framework in chapter 4 by applying the semantic graph- theoretical model to an extended biomedical data set and I show how this can efficiently represent a framework for repurposing existing drugs towards diseases for which they were not initially intended. Full details are reported in the article published by Gramatica et al. [1].

Chapter6 describes a second real world case which consists of applying to an eco- nomic system the framework previously presented. A large corpus of unstructured data describing economics, political and financial information is extracted from an historical database of Thomson Reuters (http://www.reuters.com/) providing an instrument to analyse unstructured complex data. I discuss a methodology to con- struct a quantitative semantic index readily comparable to real financial and eco- nomic indicators utilising computational linguistics, machine learning techniques and a graph theoretical framework. 2

Complex Systems

“Science is a differential equation, Religion is a boundary condition”

– Alan Turing

2.1 Introduction

Complex systems are difficult to interpret and their dynamics hard to predict. They obey the laws of physics although their behaviour goes well beyond the simple ra- tionalisation of classic models as each element of a complex system participates in many different interactions generating emergent properties [7].

One of the theoretical approaches to tackle such complexity is to understand the dynamics of how the systems transform. However, another way of explaining a com- plex phenomenon, beyond the analysis of its sequential transformations, is looking at the whole dynamic of a system as a reactive system, [8]; this way allow us to observe that the system does not behave according to a pre-programmed chain of simple events but rather it reacts in parallel to many concurrent inputs, and its behaviour, outputs and range of effects are not just a function of the its inputs but also of their frequency, magnitude, history and order in which they arrive. In this

5 Complex Systems 6 sense, functions performed by a system that are not the result of a single element in the system, but rather are the result of interacting elements in the system, is said an emergent property. This characteristic leads to consider that the output of a system’s transformational process is not ruled by single lower-scale elements which are part of the system itself; emergence, in other words, is a matter of scale, e.g. interactions at one scale create objects at a higher scale.

A system is defined complex if its behaviour crucially depends on the details of the system [6]. However, finding a unique definition of Complex Systems it is a daunting task and most of the time it is easier to find description of such phenomena by summarising their main characteristics. A complex system is a system built on a large number of elements capable of both interacting with each other and with their environment. Such stream of interaction amongst elements occur either with neighbours or with distant ones. The common characteristic of all complex systems is that they display organisation without that any external ”organising” principle is applied. As suggested earlier, the whole is much more that the sum of its parts [6,9].

In this chapter I discuss certain characteristics of Complex Systems particularly focussing on: 1. Multiscaling properties of complex data sets; 2. Structural features of networks and topological properties of Graphs in general as a model for analysing complex systems. Throughout this Thesis I will use respectively the terms Network - Graph and Vertex - Node interchangeably.

2.2 Multiscaling

As the scaling analysis of complex data sets has been the ignition point of my re- search efforts which eventually expanded through adjacent areas of complex systems analysis and led to the Thesis herewith, I will now introduce few relevant concepts Complex Systems 7 about this subject as an introduction of the part of the theory studied through my research. In particular, during this phase, I have focussed on financial complex data sets to investigate their multiscaling characteristics and here I am giving a brief background of the theory and instruments explored.

The analysis of time series is fundamental in the study of many disciplines, like economy, finance, biology, physics, etc., and the main objective of studying time series is the discovery of regularities of apparently unpredictable signals and the construction of a synthetic model which can be used to forecast and understand the intimate mechanisms governing the processes under study. Discovering regularities means, from the perspective of a scientist who analyses an event at different scales, looking for similarities of local/global measures which are expressed in a form of a scaling law which can be defined as a power law with a scaling exponent γ describing the behaviour of a quantity F as a function of a scale parameter s: F (s) ≈ sα, for a large range of s values [10, 11]. The concept of scaling leads to the important concept of a fractal which represent a system that is characterised by a scaling law with a unique non-integer scaling exponent. The uniqueness of the scaling exponent simplifies the fractal modelling approach to complex data sets which in turns becomes more difficult in presence of time series displaying multi-fractal characteristics, i.e. their scaling properties are not described by a single number, but rather are defined by a function of scaling exponents [12–14]. As mentioned earlier, by analysing scaling regularities for example of financial data sets it is possible to derive interesting information concerning the underling mech- anisms that generate data trends and consequently aim at building a theoretical model that can eventually reproduce certain behaviours [15, 16] useful for predict- ing trends or analysing certain events [17]. In this respect, I have investigated at the scaling concept that has indeed its ori- gin in physics but is increasingly applied to other disciplines [18–22]. In the recent years, the application of the scaling concept to financial markets has largely in- creased also in consequence of the abundance of available data [18]. In this part of Complex Systems 8 my research I have focussed on the scaling properties of different financial market data by computing the scaling exponents and in the following paragraph I included a brief description of the scaling properties, a description of the techniques used to compute the scaling exponents (namely the Generalized Hurst exponent method) whose main goal from these studies was to understand the very large fluctuations in the market resulting in particular values of the scaling exponents and to understand how markets react to such shocks and how different markets react to the same crisis.

2.3 Scaling properties

The scaling properties in time series have been studied by means of several tech- niques [23]. There are many proposed and used estimators for the investigation of the scaling properties in the financial and economic literature. Let us start with the seminal work on rescaled range statistical analysis R/S [24, 25] which gives an estimator for the Hurst exponent. Indeed, the rescaled range statistical analysis (R/S analysis) was first introduced by Hurst himself to describe the long-term de- pendence of water levels in rivers and reservoirs. It provides a sensitive method for revealing long-run correlations in random processes. This analysis can distinguish time series that are not correlated from correlated time series. What mainly makes the Hurst analysis appealing is that all these information about a complex signal are contained in one parameter only: the Hurst exponent.

The original Hurst R/S approach is very sensitive to the presence of short memory, heteroskedasticity, multiple scale behaviors. Such a lack of robustness has been largely discussed in the literature (see for instance [26–29]) and several alternative approaches have been proposed. Also the fact that the range relies on maxima and minima makes the method error-prone because any outlier present in the data would have a strong influence on the range.

Lo [26] suggested a modified version of the R/S analysis that can detect long-term memory in the presence of short-term dependence [30]. The modified R/S statistic Complex Systems 9 differs from the classical R/S statistic only in its denominator, adding some weights and covariance estimators to the standard deviation [31]. In this modified R/S, a problem is choosing the truncation lag q. Andrews [32] showed that when q becomes large relative to the sample size N, the finite-sample distribution of the estimator can be radically different from its asymptotic limit. However, the value chosen for q must not be too small, since the autocorrelation beyond lag q may be substantial and should be included in the weighted sum. The truncation lag thus must be chosen with some consideration. Despite these difficulties, several authors are still using this estimator trying to avoid the Lo’s critique and proposing filtering procedures [33, 34].

During these last years there has been a proliferation of papers proposing different techniques and providing comparison studies between them [35]. Let us start men- tioning the most popular ones: the detrended fluctuation analysis (DFA) [36–48] and its generalization [49, 50]; the moving-average analysis technique [51] and its comparison with the DFA [52, 53]; the periodogram regression (GPH method) [54]; the (m, k)-Zipf method [55]; the Average Wavelet Coefficient Method in [56–59]; the ARFIMA estimation by exact maximum likelihood (ML) [60–62]. Let us stress that it does not exit one method whose performance has no deficiencies. The use of each of the above mentioned estimators can be subject to both advantages and disadvan- tages. For instance, simple traditional estimators can be seriously biased. On the other hand, asymptotically unbiased estimators derived from Gaussian ML estima- tion are available, but these are parametric methods which require a parameterized family of model processes to be chosen a priori, and which cannot be implemented exactly in practice for large data sets due to high computational complexity and memory requirements [63–65]. Analytic approximations have been suggested (Whit- tle estimator) but in most of the cases (see [66]), computational difficulties remain, motivating a further approximation: the discretization of the frequency-domain in- tegration. Even with all these approximations the Whittle estimator remains with a significantly high overall computational cost and problems of convergence to local minima rather than to the absolute minimum may be also encountered. In this Complex Systems 10 framework, connections to multi-scaling/multi-affine analysis (the q order height- height correlation) have been made in various papers like [67–69]. Most of the work in the 1900’s on large complex data sets showed a general consen- sus on the Brownian motion approach [70] as a stochastic scaling model to describe financial assets behavior (e.g. prices or returns); these models, although innovative, showed a normal distribution with stable mean and finite variance. However, over the years empirical evidence demonstrated that the returns are not normally dis- tributed, but have a higher peak around the mean and fatter tails [71–73]. Later on further sophistications of that embryonic approach considered fractional Brownian motion [74–76] involving fractal analysis, however the presence of several scaling features in the analysis of complex data sets like the ones in the financial ones, proved that a robust modelling framework was more complex[77]. For this reason, new approaches focussed to the estimation of the tail index [72] showing how the tails of the distribution of returns behave as a function of the size of the movement; moreover the definition of a scaling exponent related to the fractal dimension (Hurst exponent), was deemed to be an effective indication of the behavior of volatility measures like variance of returns, absolute value of returns, etc, as a function of the time interval on which such series of returns are measured [78].

Focussing therefore on the more complex scenarios of a multi-fractal data set (in the following section we refer data set as a time series), we introduce the definition of multi-scaling stochastic process X(t) (or multi-fractal process): A stochastic process X(t) exhibits multi-scaling characteristics if has stationary increments and satisfies the following equation:

E(|X(t)|q) = c(q)tτ(q)+1 . (2.1)

(referenced from [79])

The function τ(q) is defined as the scaling function of the multi-scaling process X(t) and is determined by a unique parameter identifying the slope; the associated stochastic process is called uniscaling or unifractal. For uniscaling processes, τ(q) Complex Systems 11 is a linear function of q which is identified by a single exponent and therefore all sizes of fluctuations display the same scaling law. In general, when a multiscaling process shows fluctuations behaviour one can observe the q exponent and empirically measure that the higher is q the larger is the fluctuations [11]. Moreover, a time-dependent process X(t) is also said to be self-affine when it shows fluctuations on different timescales that can be re-scaled so that the original signal X(t) is statistically equivalent to its re-scaled version c−H · X(ct), ∀c > 0. Similarly, the invariance condition X(t) = tH X(1) substituting in 2.1 leads to E(|X(t)|q) = tqH E(|X(1)|q). Eq 2.1 still holds with c(q) = E(|X(1)|q) and τ(q) = qH −1 showing that it is a linear function and therefore uni-scale process. H is the Hurst Exponent (described below) and represents the self-affinity index, also called scaling exponent, of the process X(t). For non-linear τ(q) we are in the realm of stochastic multi-scale processes.

The Hurst exponent H is a statistical measure used to classify time series, where the larger the H the stronger is the trend of such series hence providing a measure for predictability. Formally, given X(t) a Gaussian random process where E(X(t)) = 0 and its second moment E(X2) = 1, and with C(∆t) = E[X(t)·X(t+∆t)] its auto-correlation func- tion, if the same process X(t) can show a long memory pattern when the asymptotic behaviour as ∆t → ∞ quantifies the presence or absence of long-range dependence and if C(∆t) ∼ |∆t|−β as ∆t → ∞ for β ∈ (0, 1), then the process X(t) has a long β memory with Hurst exponent H = 1 − 2 . For the same type of stochastic process seen before we define its autocorrelation function C(∆t) = E[X(t) · X(t + ∆t) as the Euclidean profile roughness of the pro- cess X(t) or, also called, fractal dimension D. If the correlation function behaves as 1 − |∆t|α, as |∆t| → 0 and α ∈ (0, 2], then we can associate it to the correlation

α function the Fractal Dimension D = 2 − 2 . A very important relation links together the concept of fractal dimension D with the Hurst exponent H: H + D = 2, where those parameters can be used to exhibit the hidden statistical properties of a random process. By observing many of the Complex Systems 12 natural phenomena in the study of complex systems it is not unlikely that the Hurst exponent H measures values in the region of 0.72 and the fractal dimension D of about 1.28; conversely, statistical processes composed by n i.i.d. variables {xi} with finite variance show H ' 0.5 and D ' 1.5.

It was observed [80] that whenever H deviates from the value 0.5 we are in the presence of a ”move” in the data evolution. In particular for 0.5 < H < 1 (and 1 < D < 1.5) indicates a persistent behaviour that is, for example, upon a growth period with said value range of H, it is likely that such growth will persist for another period. For value of 0 < H < 0.5 we have, on the contrary, anti-persistent behaviour. For H = 1 we can observe a straight line with zero slope from the series and the process will likely to be totally predictable while should H → 0 we have total unpredictability and the process exhibits white noise behaviour.

However, although the Hurst exponent H and the fractal dimension D are strictly related, they are quite independent of one another as D represents a local property whereas H shows global characteristics on the large scale view. For self-similar processes however local and global properties are the same [81, 82].

What we have seen thus far represent the global scaling characteristics of a process, however we can also define the local scaling properties of a process and we do this by means of the H¨older exponent α(t)[80]. Given a stochastic process X(t), we have:

α(t) |X(t + dt) − X(t)| ∼ Ct(dt) . (2.2)

A single α(t) H¨olderexponent represent an uniscaling process whereas a series con- tinuing local scale of H¨olderexponents determines a multiscaling phenomenon.

In the Econophycis literature, the two most widely used methods to directly quan- tify the multifractal properties of a time series are the following: the Multifractal Complex Systems 13

Detrended Fluctuation Analysis (MF-DFA) [83] and the Generalized Hurst Expo- nent (GHE) [84]. Both methods aim at estimating the scaling function ζq by means of a scaling exponent, which is a q-dependent generalisation of the Hurst exponent H. Barunik and Kristoufek [[85] show that the GHE method outperform the MF- DFA in returning an estimate of the scaling exponent with lower variance and bias regardless of the presence of heavy tails in the data. For this reason we have chosen to use this method in order to give a quantitative description of multifractality. We describe the GHE method in the next section.

2.4 Generalized Hurst exponent

The generalized Hurst exponent method [14, 86] is essentially a tool to study directly the scaling properties of the data via the qth-order moments of the distribution of the increments. The q-order moments are much less sensitive to the outliers than the maxima/minima and different exponents q are associated with different characterizations of the multi-scaling complexity of the signal. This type of analysis combines the sensitivity to any type of dependence in the data to a computationally straight forward and simple algorithm.

The Hurst analysis examines if some statistical properties of time series xk (with k=ν, 2ν, ..., kν, ..., T ) scale with the time-resolution (ν) and the observation-period (T ). Such a scaling is characterized by an exponent H which is commonly associated with the long-term statistical dependence of the signal. A generalization of the approach proposed by Hurst should therefore be associated with the scaling behavior of statistically significant variables constructed from the time series. In this case, the qth-order moments of the distribution of the increments is used [87, 88]. This is a good quantity to characterize the statistical evolution of a stochastic variable xk. It is defined as: q h|Xk+τ − Xk| i Kq(τ) = q , (2.3) h|Xk| i Complex Systems 14

where Xk is the detrended signal and the time-interval τ can vary between ν and

τmax. (Note that, for q = 2, the Kq(τ) is proportional to the autocorrelation function: a(τ) = hXk+τ Xki.)

The generalized Hurst exponent H(q)1 can be defined from the scaling behavior of

Kq(τ) if it follows the relation [89]:

τ qH(q) K (τ) ∼ . (2.4) q ν

Within this framework, two kinds of processes can be distinguished: (i) a process where H(q) = H, constant independent of q; (ii) a process with H(q) not constant. The first case is characteristic of uni-scaling or uni-fractal processes and its scaling behavior is determined from a unique constant H that coincides with the Hurst coefficient or the self affine index. This is indeed the case for self-affine processes where qH(q) is linear (H(q) = H) and fully determined by its index H. In the second case, when H(q) depends on q, the process is commonly called multi-scaling (or multi-fractal) [90, 91] and different exponents characterize the scaling of different q-moments of the distribution. Therefore, the non-linearity of the empirical func- tion qH(q) is a solid argument against Brownian, fractional Brownian, L´evy, and fractional L´evymodels, which are all additive models, therefore giving for qH(q) straight lines or portions of straight lines.

For some values of q, the exponents are associated with special features. For in- stance, when q = 1, H(1) describes the scaling behavior of the absolute values of the increments. The value of this exponent is expected to be closely related to the original Hurst exponent, H, that is indeed associated with the scaling of the absolute spread in the increments. The exponent at q = 2, is associated with the scaling of the autocorrelation function and is related to the power spectrum [92]. A special case is associated with the value of q = q∗ at which q∗H(q∗) = 1. At this value of q, the moment Kq∗ (τ) scales linearly in τ [87]. Since qH(q) is in general a monotonic

1We use H without parenthesis as the original Hurst exponent and H(q) as the generalized Hurst exponent. Complex Systems 15

∗ growing function of q, all the moments Hq(τ) with q < q will scale slower than τ, whereas all the moments with q > q∗ will scale faster than τ. The point q∗ is therefore a threshold value. Clearly in the uni-fractal case H(1) = H(2) = H(q∗). All these quantities will be equal to 1/2 for the Brownian motion and they would be equal to H 6= 0.5 for the fractional Brownian motion. However, for more complex processes, these coefficients do not in general coincide.

This method produces a set of generalised Hurst exponents H(q) similar to h(q) from the MDFA technique, that characterises the scaling properties of time series at different order q, as the value of q varies. However, in this case, the value of q cannot be negative. The function Kq(τ) defined in equation 2.3 diverges when q is negative.

2.5 Complex Networks

We discussed earlier that complex systems, like those within biology, economics or social disciplines - to name a few - [93] are comprised of many interacting elements and we have seen that study of individual property could be misleading if not prop- erly compared with the information from the rest of the elements constituting the system itself. Moreover, one of the most significant breakthrough in complex sys- tems studies has been the discovery that all these systems share similar structures in the network of interactions between their constituting elements [9, 93, 94]. The study of these similarities have led to an ongoing development of network models and graph-theoretical analysis techniques instrumental to characterise and understand complexity.

The birth of network (or graph) theory is due to the famous mathematician Euler and his theory and solution of the celebrated K˝onigsberg bridge puzzle [95]. With- out entering into the details of Euler’s famous problem we can here emphasise that Euler noticed that physical distance was not the important factor in the description of the problem but rather it was the representation of the topological constraints of Complex Systems 16 the problem itself in the form of a graph. Euler’s work became of seminal impor- tance in identifying topological properties as the key issue of the solving complex problems, thus opening a new field of mathematics that turned into a powerful tool for analysing complex systems. If Euler was the first to formalise Graph theory, the development of such discipline is due to Paul Erd˝osand Alfred R´enyi who pursued the theoretical analysis of the properties of graphs and random graphs obtaining a number of important results by introducing the concept of probability in Euler’s original work [96, 97].

In this section I will recall the main characteristics of a network (graph) and its main properties that have been widely used throughout my research works and this Thesis; I will omit the basics of definition of network components and network taxonomy, e.g definition of a node, link, type of networks like directed, undirected, bi-partite etc., although a waste literature can be found in several introductory material such as Newman’s work [94, 98].

2.5.1 Characteristics of Complex Networks

As we are dealing with complex phenomena, one of the most productive approach in the study of complex networks is to characterize the topology of a network at multiple scales of complexity; this can be done through graph-theoretical measures at a local or global level and together they provide an important instrument in the analysis of a complex system representation [99].

Over the past few decades a lot of effort has been put in the search for grasping underlying laws governing the dynamics and evolution of complex systems and sci- entists have adopted a systematic analysis and characterization of many aspects; in particular, several studies focused on how correlated data can be represented by means of a rigorous instrument such as a graph and its network representation has been thoroughly explored. Generally speaking, when representing a complex system through a network this can be defined as an abstract graph-shaped mathematical Complex Systems 17

Figure 2.1: Illustration of a graph with N = 5 nodes and E = 4 edges. The set of the nodes is V = {1, 2, 3, 4, 5} and the edge set is E = {{1, 2}, {1, 5}, {2, 3}, {2, 5};G=(V,E)}. Adapted from [100]. representation whose vertices (nodes) identify the elements of the observed phe- nomenon and the connecting links of such nodes (edges) translate the relationship amongst those elements.

By utilizing such graph-theoretic platform, we are now able to discover the emerging properties of a complex system and its hidden information and to recognize several characteristics concerning the underlying complex phenomena; the mathematical tools available provide then interesting ways to derive information about emergence of power-law distributions, robustness of systems, scaling characteristics of the net- work in its fundamental properties. In this section I will describe the main properties observed on real-world complex networks and in particular on scale-free networks.

2.5.1.1 Definition of a Network (Graph)

We consider here undirected graphs (we make no distinction between (u, v) and (v, u)) since most classical properties are defined on such graphs only. Formally, given a graph G(V,E), with V defining the vertex set and E the edge set, we will assume that V ∈ (1,...,N). With this notation in mind the most natural matrix associated with a graph G is then adjacency matrix A, which has been universally chosen as the formal way to represent a graph.

The Adjacency matrix, with entries A(i, j) is defined by a nxn matrix such that: Complex Systems 18

  1 if an edge exists between vertex i and vertex j aij = (2.5)  0 otherwise .

Similarly, using the values of the edges in a weighted network, we can construct a matrix of weights W whose entries are the weights wij(i, j = 1, ..., N) of the vertices connections. Now that a network is formally defined, let’s look at the set of measure and metrics that help quantifying network structures.

2.5.2 Measures and Metrics

There are several measures applicable to a graph that describe certain features of a network topology; the following are the most known and the ones thoroughly used through my research.

2.5.2.1 Degree and density

The degree d0(v) of a node v is its number of links, or, equivalently, its number of neighbours: d0(v) = |N(v)| whereas the average degree d0 of a graph is the average 0 1 P o over all its nodes: d = N v d (v). The density of a graph G(V,E) is defined as the number of edges in the graph

2E divided by the total number of possible links: δ = V (V −1) and it indicates to what extent the graph is fully connected (all the links exist). In particular, it gives the probability that two randomly chosen nodes are linked in the graph.

We can define a relation between the average degree and the density: d0 = δ(n − 1). In general the average degree of complex networks is small, and independent of the sample size; this implies that the density is δ go to zero when the sample grows,

d0 since δ = n−1 [101, 102].

th More in general the degree of i vertex ki is defined as the number of edges attached to vertex i. A similar quantity is the vertex strength si, defined as the sum of weights Complex Systems 19 of edges linked to the vertex i. For un-weighted networks, these two quantities coincides at each vertex, on the other hand, edge weights adds another layer of complexity and these two should be addressed separately.

2.5.2.2 Diameter and average distance

We define d(i, j) the distance between vertices i and j, i.e. the number of links on 1 P a shortest path between them. We define d(i) = n j d(i, j) the average distance 1 P from i to all nodes, and d = n i d(i) the average distance in the graph. Finally, we define D = maxi,jd(i, j) the diameter of the graph, i.e. the largest distance [102].

2.5.2.3 Connectivity

We define a connected component of a graph the maximal set of nodes such that a path exists between any pair of nodes in this set. The connected components and their sizes are computed using a graph traversal (like a breadth-first search, [103]. In most real-world complex networks, it has been observed that there is a large connected component, often called giant component, together with a number of small components containing no more than a few percents of the nodes, often much less (if any).

2.5.2.4 Distance and Shortest Path

The distance d(u, v) between two vertices u and v of a graph is the minimum length of the paths connecting them, also named shortest paths; we also say that it is the length of a graph geodesic [93, 98]. Should all the vertices lie in different connected components, i.e. no such path exists, then the distance is set equal to infinity. The matrix dij consisting of all distances from vertex vi to vertex vj represents the col- lection of all shortest paths and it is called the graph distance matrix [104]. The diameter of a graph is the length of the longest geodesic path between any pair of vertices in the graph [98, 105]. Complex Systems 20

Amongst various techniques to compute the distance between any two nodes of a graph it is worth to mention the Floyd-Warshall [106] algorithm which efficiently and simultaneously finds the shortest paths (i.e., graph geodesics) between every pair of vertices in a weighted directed graph. Another famous algorithm for finding a graph geodesic, i.e., the shortest path between two graph vertices in a graph is due Edsger Dijkstra [107] which constructs a shortest-path tree from the initial vertex to every other vertex in the graph. It is worth to emphasise that in graph theory, the shortest path problem is the problem of finding a path between two nodes such that the sum of the distances of its constituent edges is minimised or, if the distance is represented by the strength (weigh) of two vertices relations, the shortest path is then a measure that maximises a path of such weigh’s relations. Amongst the most used algorithms to compute the shortest paths in a graph, as we will be searching for all shortest paths between each couples of nodes (a.k.a. all-pairs shortest path problem), the Floyd-Warshall approach, with certain adaptations, will be used throughout the examples later in this Thesis. The Floyd-Warshall algorithm compares all possible paths through the graph be- tween each pair of vertices in O(|V |3) comparisons. It does so by incrementally improving an estimate on the shortest path between two vertices, until the estimate is optimal. By means of this algorithm we obtain the value of the shortest path dis- tance for each couples of nodes in each connected component of G. The algorithm also returns the predecessor matrix P such that pi,k is the index of the node which is preceding k in the shortest path between i to k. From this matrix any path between any couple of nodes i, j in a connected component of G can be reconstructed.

2.5.2.5 Random Walk

Given a graph we imagine a ”walker” which starts a vertex and moves randomly to another vertex, neighbour of it; then the crawl iterates its move from the new vertex point at random, and move to the next, and so on. The random sequence of vertices Complex Systems 21 touched by the walker this way is a random walk on the graph. Random walks arise in many models in mathematics and physics [108].

A random walk is a finite Markov chain [109, 110]. In fact, there is not much difference between the theory of random walks on graphs and the theory of finite Markov chains; every Markov chain can be viewed as random walk on a directed graph, if we allow weighted edges. The random walk is also interesting since it could be a mechanism of transport and search on networks. Those processes would be optimal if one follows the shortest path between two nodes under considerations, i.e. among all paths connecting two nodes, the shortest path is given by the one with the smallest number of links. However the shortest path can be found only after global connectivity is known at each node, which is improbable in practice. The random walk becomes important in the extreme opposite case where only local connectivity is known at each node[111]. We also suggest that the random walk is a useful tool in studying the structure of networks [112].

Formally, given G = (V,E) a undirected connected graph with N vertices and E edges, let’s consider a random walk on such graph G.

th A walker starts at vertex v0 and takes t random steps; if at the t step we are at a 0 vertex vt, we move to the neighbour of vt with probability 1/d (vt). The sequence of random nodes vt : t ∈ 0, 1,... obtained is a Markov chain.

We then define:

- n x n adjacency matrix A with A(i, j) with weights on edge from vertex i to vertex j. As we consider here the graph undirected, then A(i, j) = A(j, i), i.e. A is symmetric

- We denote by pi(t) as the probability that the walker is at vertex i at the time t and defined as: X A(i, j) p (t) = p (t − 1) , (2.6) i k j j j Complex Systems 22

where 1/kj is the probability of taking a step along any edge attached to j [98].

- n x n transition matrix P ; P is row stochastic with P (i, j) as the probability A(i,j) of stepping on vertex j from vertex i: P (i, j) = P . i A(i,j)

We can also see eq. 2.6 in matrix form where p represents the vector of pi and D the diagonal matrix with the degrees of vertices on its diagonal: p(t) = AD−1p(t − 1).

In the long run, for t → ∞, the probability distribution over all the vertices is:

X pi(∞) = Aijpj(∞)/kj , (2.7) j or, in a matrix form: p = AD−1p. Note that the element D−1p is an eigenvector with eigenvalue 0. What we observe in the matrix of transition probabilities (of this Markov chain) is a sequence is a walk in the graph. We call such a walk a random walk on the graph or digraph G.

2.5.3 Centrality measures

The Centrality measure is a useful metric that helps answering the question of which nodes are most central relatively to the rest of the network. The general formula for centralization is the Freeman formula [113].

However, there are many different types of centralities that altogether provide quite a good representation of the properties of a network in particular when those quantities are measured over time (i.e. in a evolving network). The following list of centrality measures are the ones mostly used within works of network analysis. Complex Systems 23

2.5.3.1 Degree centrality

Cd(i) represents nodes with a large number of neighbors (i.e., edges) that have high centrality, or the number of links incident upon a node. We can define Cd(i) = deg(i)/(n−1) where n is the number of vertices in the graph and deg(i) is the number of edges from i to another vertex [101, 114]. In certain type of networks, the biological networks for example, nodes with high degree centrality are considered to be more relevant than other with lower degree centrality [98, 101]. Degree centrality, however, can be misleading because of its characteristic of being a pure local measure and it solely relies on the number of connected edges as the main feature characterizing the entire node; If we want to measure the degree to the extent of how a graph is centralized, we look at the other type of measure, such as eigenvector centrality.

2.5.3.2 Eigenvector Centrality

The Eigenvector centrality measures the importance of a node by the importance of its neighbors [115]. In other words, this eigenvector centrality combines the edges of a given node with the weight of the nodes linked to those edges. This approach provides centrality weighted by the nodes involved and therefore, if the domain of certain neighbors of a node is crowded by important nodes, its centrality raises proportionally, and vice versa. This centrality measure is a powerful one as when applied to each vertex of a graph it returns a score proportional to the sum of score (centrality weights) of its neighbours. In other words, the assumption is that each vertex’s centrality is the sum of the centrality values of the vertices that it is connected to [101, 114]. Formally, the Eigenvector centrality is defined as the principal eigenvector of the adjacency matrix associated (defining) the network. The defining equation of an eigenvector is:

λv = Av , (2.8) Complex Systems 24 where A is the adjacency matrix of the graph, λ is a constant (the eigenvalue), and v is the eigenvector. The interpretation of the above equation is that node with a high eigenvector score is one that is adjacent to nodes that are themselves high scorers. The centrality matrix is an eigenvector of the adjacency matrix such that all of its elements are positive [116].

2.5.3.3 Closeness Centrality

This measure of centrality is based on the concept of geodesic path, i.e. the distance between two vertices in a graph measured as the number of edges in a shortest path, and it looks at the the mean distance from a vertex to other vertices [98]. Closeness is based on the inverse of the distance of each vertex to every other vertex in the network [101, 114].

If we define dij the length of a geodesic path from i to j, then the inverse of the mean geodesic distance from i to j, normalised over all vertices j defines the closeness centrality in the graph as:

" n #−1 X Cc(i) = dij . (2.9) j=1

2.5.3.4 Betweenness Centrality

The geodesic shortest path earlier mentioned also serves to define another important topological quantity called betweenness centrality. We define two different type of betweenness centrality: the vertex betweenness centrality and edge betweenness centrality. The vertex betweenness centrality BC(v) of a vertex v ∈ V is the sum over all pairs of vertices i, j ∈ V , of the fraction of shortest paths between i and j that pass through v:

X lij(v) BC(v) = , (2.10) lij i,j∈V Complex Systems 25

where lij(v) is the total number of shortest path between i and j that pass through vertex v. One may look at this measure as measure of ”bottleneck” that how many traffics pass through each vertex. Similarly, the edge betweenness centrality BC(e) is defined as:

X lij(e) BC(e) = , (2.11) lij i,j∈V

where lij(e) is the total number of shortest path between i and j that pass through edge e.

2.5.4 Degree Distribution of a Graph

We have seen earlier the definition of degree of a vertex as the number of edges attached to it. Let k be the degree of a vertex, the degree distribution of a graph is the proportion pk of nodes of degree exactly k in the graph, for all k [102, 117] pk(k) = P [K = k] ≡ pk .

Pk is the fraction of vertices in a network with degree k and also the probability that a randomly selected vertex has degree k [118]. Degree distributions of a graph include:

N−1 k N−k−1 - Binomial Degree Distribution: pk(k) = k p (1−p) with k = 0, 1, ...., (N− 1) and where p is the probability of a given vertex to have and edge connected to another vertex with N − 1 edges set [117, 119].

βk −β - Poisson Degree Distribution: pk(k) = k! e , with k = 0, 1..., approximation of the binomial distribution in the case of N → ∞ and β = (N − 1)p is the average degree and is a constant [120].

−γ - Power-Law Degree Distribution: pk(k) = Ak , with k = 1, 2, .... and where γ > 1 is a parameter tracking the rate at which the probability decays with Complex Systems 26

connectivity and is typically similar for similar networks. A is a normalising constant that ensures that the P (k) values sum to 1 and it is given by

1 X = k−γ A k=1

[117, 119].

Although some networks exhibit exponential degree distributions, the majority of real networks show degree distributions that are scale-free [121, 122] meaning that the degree distribution behaves as a power-law. The scale-free form of the degree distribution for most real networks indicates that there is a high diversity of node degrees and no typical node in the network that could be used to characterize the rest of the nodes [100].

2.5.5 Clustering

Despite having a global small density, a graph may display a high local density e.g. two vertices close to each other in the graph are linked together with a higher probability than two randomly chosen vertices [117]. In other words a common feature observed in many real networks is that a vertex connected to another vertex of a connected pair of vertices, then it is likely that it is also connected to the other vertex of that pair. The clustering coefficient is the most widely used measures to compute this property [101, 114]. The clustering coefficient C accounts for the number of triangles in the network and specifically, Ci is the ratio between the number of edges E connecting the nearest neighbors of vertex i and the total number of possible edges between these neighbors: C = 2E . ki(ki−1)

The clustering coefficient of all the network is the average of all individual Ci over 1 PN all vertices: Ci = N i=1 Ci . Complex Systems 27

2.5.6 Models of Complex Networks

In this section we briefly describe different models of complex networks.

2.5.7 Random Networks

Paul Erd˝osand Alfred R´enyi (ER random graphs) were the first to study and de- scribe large networks with no particular distributions of nodes and link and whose organisation principles were not easily definable and they called these networks ”ran- dom graphs”. Formally, let’s define G(N, p) a graph with N vertices where an edge exists with independent random probability 0 < p < 1 for each edge. For each vertex Ni, there is an edge to node Nj with probability p [96, 97]. The degree distribution created around this type of graph is binomial and is proba-

k k N−1−k bility of a node to have degree k is defined as P (ki = k) = CN − 1 p (1 − P ) k N−1 where CN − 1 = k ; the expected degree of a vertex is given by: (N − 1)p ≈ Np [117, 119]. For large N this probability is approximated by the Poisson distribution with:

(Np)k hki P (k) ≈ e−Np = e−hki . (2.12) k! k!

2.5.8 Scale-free Networks

Although the random model has been used as a reference for many years, real world networks of complex topology are not exactly random as described by the ER model. In fact, these random graph theory was rarely tested in the real world complex networks but further studies [123, 124] showed the existence of a high degree of self-organization characterizing the large-scale properties in complex networks. This led to describe these systems by means of an evolving networks stand point and the Complex Systems 28 measures of degree distribution consistently showed a power - law and exponential behaviour. This his behaviour was unpredicted by previous random network models.

Scale-free networks are networks typically characterized by the presence of large hubs and, as mentioned earlier, they display a power-law degree distribution [124]. They are called scale-free because the power law distribution have the same functional form at all scales. The power law Pdeg(k) defined below remains unchanged (other than a multiplicative factor) when rescaling the independent variable k. For an

−γ undirected network, we can define degree distribution as: Pdeg(k) ≈ ke .

Pdeg(k) decays as the degree k increases, improving the likelihood of finding a vertex with a very large degree.

2.5.9 Small-World Networks

A small-world network refers to networks in which the average geodesic distance between nodes (i.e., shortest-path) increases slowly as a function of the number of nodes in the network. The characteristics of such networks are quite different from the ones of regular or random networks; in fact, the average distance hdi between two vertices scales with the total number of vertices N in the network, as hdi ∼ log N with γ ≥ 2 and for a random network for γ → ∞.

When the average distance hdi scales as hdi ∼ log log N (with 1 < γ < 2) we are in the situation of a ultra small network which means that the network is very compact and only few steps are necessary to pass from a vertex to another. Finally, regular lattices in D dimensions are large worlds with hdi ∼ V 1/D [125].

The term ”small-world network” is used frequently to refer specifically to a Watts- Strogatz toy network [126]. The Watts-Strogatz model (WS) is a random graph generation model that helps to construct networks with small-world properties, in- cluding short average path lengths and high clustering. The model was conceptualised to mitigate the issues raised on ER random graphs which do not satisfy two important properties observed in many real-world networks, Complex Systems 29

Figure 2.2: Cumulative degree distributions for six different networks. The horizontal axis for each panel is vertex degree k (or in- degree for the citation and Web networks, which are directed) and the vertical axis is the cumulative probability distribution of degrees, i.e., the fraction of vertices that have degree greater than or equal to k. The networks shown are: (a) the collaboration network of mathematicians; (b) citations between 1981 and 1997 to all papers cataloged by the Institute for Scientific Information; (c) a 300 million vertex subset of the World Wide Web, circa 1999; (d) the Internet at the level of autonomous systems, April 1999; (e) the power grid of the western United States; (f) the interaction network of proteins in the metabolism of the yeast S. Cerevisiae. Of these networks, three of them, (c), (d) and (f), appear to have power-law degree distributions, as indicated by their approximately straight-line forms on the doubly logarithmic scales, and one (b) has a power-law tail but deviates markedly from power-law behavior for small degree. Network (e) has an exponential degree distribution (note the log-linear scales used in this panel) and network (a) appears to have a truncated power-law degree distribution of some type, or possibly two separate power-law regimes with different exponents. Taken from [93] e.g. do not generate local clustering and have a low clustering coefficient. In fact ER models do not account for the formation of hubs as the the degree distribution converges to a Poisson distribution, rather than a power law observed in many real- world, scale-free networks. Watts and Strogatz investigated the characteristics of clustering emergence and and small-worldness by rewiring edges on a one-dimensional regular lattice net where each vertex links to ki nearest neighbors, and either each edge is randomly rewired Complex Systems 30

Figure 2.3: A one-dimensional lattice with connections between all vertex pairs separated by k or fewer lattice spacing, with k = 3 in this case. (b) The small-world model is created by choosing at random a fraction p of the edges in the graph and moving one end of each to a new location, also chosen uniformly at random. (c) A slight variation on the model in which shortcuts are added randomly between vertices, but no edges are removed from the underlying one-dimensional lattice. Adapted from [93]. or added with probability p. The regular lattice is used as a template of a ”large- world” network with high clustering coefficients whereas the rewired or extra edges are meant as bridges to improve accessibility between the vertices (see Figure 2.3).

Watts and Strogatz found that there exists a range of p where the network is si- multaneously small-world and highly clustered [126] exploring the changes in the average Clustering coefficient and average shorts part measures with respect to in- creasing p. In other words, the WS model shows that t it takes significantly little amount of bridges (above mentioned) to make the network ”small-world”.

2.5.10 Filtered Networks

As discussed in the previous paragraphs, a complex system is in general an inter- weaved mesh of several interacting elements and interactions whose complexity is an intricate connected structure often redundant and that may contain some false- positive information. Several studies have therefore focused on the need to find methods that are able to single out the emergent information by filtering such a complex graph into a simpler relevant sub-graph that can be further processed with topological clustering or stochastic methods [127]. Complex Systems 31

In the last few years we have witnessed a proliferation of methods for filtering rele- vant information by extracting a structure of interactions from weighted adjacency matrices [128–130]; in particular, two methods that have been proved to be very ef- fective are the Minimum Spanning Tree (MST) and the Planar Maximally Filtered Graph (PMFG) [101, 127]. Both methods are based on an iterative construction of a constrained graph (a tree or a planar graph) which retains the largest correlations between connected nodes.

2.5.11 Minimum Spanning Tree

The simplest way to see a connected graph retaining the greatest amount of infor- mation is what is called a spanning tree (a graph with no cycles that connects all vertices) and the network deriving from this technique is called Minimum Spanning Tree (MST). A typical MST is built starting from a weighted network with given weights wij based on a distance (or a similarity) matrix which can be identified with a weight matrix or a correlation matrix.

The MST algorithm encountered in part of this research work is named after Kruskal [131] whose general approach is firstly to rank and then to connect the most corre- lated pairs of nodes by maintaining a ”tree” structure in the network. The resulting MST network of the given weighted graph, is therefore a spanning tree where the sum of the weights (or distances) of the edges is minimal. We can argue that MST describes the cheapest network to connect all nodes. Kruskal’s algorithm for MST builds a minimum spanning tree graph by inserting edges connecting two disjoint components in order of increasing cost so that all the vertices connected this way have minimum total weight:

  MST X w(G ) = min  w(i, j) , (2.13) (i,j)∈GMST where G is a connected graph with weighted edges w(i, j) and GMST is a minimum spanning tree of G. Complex Systems 32

2.5.12 Planar Maximally Filtered Graph

Although powerful and very useful, MST shows some shortfalls when considering that the resulting spanned graph must be a tree that has lost important pieces of information present in the cycles and left out during the spanning process. Studies have shown that it is possible to determine a family of graphs with the same hi- erarchical tree associated to the MST but comprising a larger number of links and allowing closed loops [127]. Because any graph can be embedded on a surface with sufficiently high genus it was observed that the local properties of the graph itself are affected by the sur- face genus which determines the average degree, influences the degree distribution and controls the clustering coefficient. The global properties of the graph are also strongly affected by the surface genus which is constraining the degree of interwo- venness, changing the scaling properties of the network from large-world kind (small genus) to small and ultrasmall-world kind (large genus). Therefore graphs with different degrees of complexity can be constructed by embed- ding them on a surface of a given genus g = k [127, 128]. The genus g is generally a topologically invariant property of a surface defined as the largest number of non- isotopic simple closed curves that can be drawn on the surface without separating it, in other words, the number of handles in the surface. As proposed by recent studies [132], we may consider constructing a graph on sur- faces instead of looking them on a plane; such surfaces are therefore categorized by their genus (i.e. the number of holes present in the 3D surface - for instance embed- ding the graph on the surface a sphere, genus g = 0, on a torus g = 1 etc - in other words the genus of a graph is the minimum number of handles for a surface needed to embed the graph); we can infer that information on graphs is directly linked with the genus of the surface and in the case of g = 0 we obtain the maximum information stored in the graph. In this case the resulting graph is planar (a planar graph is a graph which can be drawn in the plane without any edges crossing) i.e., it can be embedded on the sphere and it is called Planar Maximally Filtered Graph (PMFG) [133](pp. 247-281). Complex Systems 33

2.5.13 Scale-Free Networks Embedded in Hyperbolic Met- ric Spaces

As discussed earlier in this chapter, many complex networks observed in real world exhibit a well-defined scale-free topology characterised by a power law distribution P (k) ≈ k−γ of the node degree distribution P (k) and a high clustering coefficient, the latter resulting in the emergence of many triangles in the network [99, 124]. As the topology of a network is strictly tied to various metrics and measures, we observe in complex natural networks how information self-organise its routes within the net- work (think of the movement of people or goods in a transportation network or the spreading of news in social networks or diseases in a given population, etc) and it is interesting to analyse how the network builds a path across vertices without having a global view of the network, but having only local informations of the neighbours [134, 135]. An example of a strongly clustered network is a triangular lattice on a planar sur- face [128]: in such a network each one of the N nodes is connected with its local neighbours only and the average distance between two individuals scales as N 1/2 representing a large-world organization. On the other hand, Erd˝osand R´enyi [96], pointed out that random graphs are closely connected systems where the average distance scales as log(N), i.e. a small-world. Between the two extreme we can find intermediate structures that can be derived from the planar lattice by adding edges across distant nodes making in this way shortcuts. But such an insertion of a shortcut on the triangular lattice has an important consequence: the network can no longer be drawn on the plane without edge crossings thus becoming non-planar. The embedding surface must be modified accordingly by creating a worm-hole which connects two distant parts of the surface and through which the new edge can travel. Such worm-holes create short-cut tunnels in the (2D) universe transforming it into a small world.

There are studies (T. Aste 2005) supporting the idea that networks evolves onto a hyperbolic surface [128]. The complexity of the network itself is in this way Complex Systems 34 associated with the complexity of the surface and the evolution of the network is now constrained to a given overall topological organization. More precisely, we explore the relation between the properties of a network and its embedding on a surface. An orientable surface (an intersection-free, two-sided, two-dimensional manifold) can be topologically classified in terms of its genus which is the largest number of non-intersecting simple closed cuts that can be made on the surface without disconnecting a portion (equal to the number of handles in the surface).The genus g is a good measure of complexity for a surface: under such a classification, the sphere g = 0 is the simplest system; the torus is the next-simplest g = 1; etc. To a given network can always be assigned a genus: defined to be equal to the minimum number of handles that must be added to the plane to embed the graph without edge-crossings. (Accordingly, a planar graph has genus = 0 and it can be minimally embedded on the sphere) [128]. This approach proved to be a useful instrument to measure the complexity of real-world graphs [128, 132, 136].

An interesting approach for analysing property of network topology in relation to the properties of a given network was recently proposed by Krioukov et al. [137]. This approach assumes an hidden metric spaces is mapped so that it influences the topology of the related complex networks. In particular, the hidden metric space describes a distance function between two entities (namely nodes in the network associated to such hidden space) that influences the probability that the two entities will be connected in the resulting network. Of all possible hidden spaces, Hyperbolic spaces are the ones mostly studied in association with scale-free networks [138]. During the study of the main properties of real-world networks I delved into several theories and techniques and I have analysed ways how graphs embedded on surfaces are a powerful and practical tools to characterise and simulate networks with a broad range of properties. In particular I studied how two elementary moves (T1 and T2) allow the exploration of all networks embeddable on a given surface and provide an instrument to develop a statistical mechanics description for these networks [132, 136]; see Fig. 2.4.

Within such a framework, we analysed the properties of topologically embedded Complex Systems 35

Figure 2.4: Exemplification of a T1 and T2 elementary moves on a triangulation: edge switching (T1) and vertex insertion and removal (T2). graphs and studied how dynamically they tend to lower their energy towards a ground state with a given reference degree distribution; finally, topologically em- bedded graphs can be built in a way to contain arbitrary complex networks as subgraphs.

2.6 Summary

In this Chapter I gave a general introduction of Complex Systems and highlighted the main characteristics associated to complex phenomena. Throughout the various sections, I discussed the feature deriving from the internal structure of a system consisting of many components interacting with each other at multiple scales of space and time. Such components tend to form hierarchal subsystems and these in turn also behave differently. Indeed, a complex system displays certain behaviours that are characteristic of complex phenomena and arise from the interaction of sub- systems which do not show such behaviour. The term emergent is therefore used to describe these ”global” behaviours arising from a variety of locally distributed interacting elements. A number of interesting mathematical properties of complex systems, mainly derived from statistics, have been observed and studied throughout the chapter I discussed Complex Systems 36 some of these properties and two theoretical framework for studying these systems: Multiscaling and Complex Network Theory. In this section I presented the basic property of a scaling feature, directly related to the concept of fractal, and subsequently how the characteristic of (multi)scaling means discovering regularities in a complex data set at different scales; I then dis- cuss of the scaling law for these systems as a power law with a scaling exponent γ describing the behaviour of a certain quantity F as a function of a scale parameter s: F (s) ≈ sα. We then discussed about techniques to extract uni-scaling and multi-scaling mea- sures and amongst those we focussed on the fractal dimension D, the Hurst exponent H (and its homologous H¨olderexponent for local scaling measure) as a statistical measure used to classify time series and its ability to providing a measure for pre- dictability in a stochastic process and finally the generalised Hurst exponent GHE as a pure multi scaling measure technique.

The second part of this chapter discusses the fundamental characteristics of Complex Network theory, a very important theoretical framework that is used for the anal- ysis and the study of many natural complex phenomena. By utilizing such graph- theoretic platform, I discuss the opportunity to hunt for emerging properties in a complex system and recognising several characteristics concerning the underlying complex phenomena: emergence of power-law distributions, robustness of systems, scaling characteristics of the network in its fundamental properties. In this section I introduced the definition of a graph (network) and its basic set of properties. I then presented a general description of the various metrics used to anal- yse the topology of a network and a classification of networks is given and I describe the main properties observed on real-world complex networks and in particular on random networks (an introductory view), scale-free networks and small-world net- works. Finally, I presented a summary of network filtering tools like Minimum Spanning Tree MST and Planar Maximally Filtered Graph PMFG and an outlook of scale- free networks embedded in Hyperbolic spaces. Complex Systems 37

Figure 2.5: Exemplification of a T 1 elementary move, which consists in the switching of one edge between four vertices (a, b, c, d). 3

A Linguistic and Graph theoretic approach

3.1 Introduction

In this chapter I will provide a thorough introduction on linguistic theory [139–141] and how by means of certain computational techniques it is possible to construct a semantic network extracted from a large corpus of descriptive language based data set. In fact, one of the objective of my research works is to build a framework where large complex data sets described in natural language can indeed be represented via a network where vertices are semantic concepts and edges are semantic correlations.

I will then describe various semantic models and in particular I will focus on a robust approach called distributional hypothesis [139, 142, 143] and the statistical properties associated to those models. Finally, I will show how a graph constructed upon those models represents a complex system and how we can use the graph theoretical approach to analyse such network. In particular: in section 3.2 I describe the epistemological challenge, i.e. how the fundamental knowledge is indeed embedded in large corpus of unstructured data (e.g. descriptive language in the form of text) and I highlight the similarity of

38 A Linguistic and Graph theoretic approach 39 properties displayed in an induced semantic network with certain emerging features typical of scale-free networks; then I will treat the semantic problem and how this is dealt within the domain of semantic networks by previous works. In section 3.3 I describe the fundamental paradigm of Distributional Hypothesis, facing the challenges of conceptualisation and disambiguation and I will conclude this section by introducing the techniques for building a knowledge graph. In sections 3.4 and 3.5 a thorough description of these techniques is given; I discuss the ingestion process and parsing probabilistic approach concluding this section with the semantic analysis approach treating the concept of semantic similarity [144] and semantic distance [145] measures including a stochastic metric.

3.2 An epistemological challenge

The scientific framework presented in this thesis proposes an approach dealing with the internal structure of language that can be exploited to detect relations among concepts, working on the relations of the words and the phrases representing them.

It is a well-known fact in the field of linguistics that the meaning of a word must be inferred examining its occurrences over large corpora of text [139]. Reducing to the minimal the scope of this concept we can say that in Linguistics, the meaning of a word is the result of an elaboration over its usage at least at the sentences level and not an a priori definition. Adopting this perspective, one can say that the meaning of a word ultimately depends on the words it mostly goes along with: this is the basis of the so-called Distributional hypothesis that we will discuss in details later in this chapter1.

This hypothesis has its shortcomings and intellectual difficulties as for example in cases where a word can appear often alongside its opposites. This situation arises when a certain word has many meanings that can co-occur with widely different sets of words; its usage may vary with the time (and within different corpora).

1”You shall know a word by the company it keeps” [140] A Linguistic and Graph theoretic approach 40

Nonetheless the focus on the connections among words still holds because one can imagine a sort of network of all the connections among a huge set of words and conceive the meaning of a word as something depending on the whole and not on the word itself. We can recognize therefore a similarity with those characteristics of complex phenomena seen in chapter2 where a node (in our case a word or concept) is interwoven in a super-set of other nodes (other words or concepts) through links (the semantic correlations) in a global scale of interactions; we could then infer that in the domain of language representing complex data sets that the meaning of a word/concept is considered as an emergent property of the language.

The image of a network of words immediately suggests the possibility to associate a word not only with those it appears directly connected with (its nearest neighbours in the network) but also to the ones connected through intermediate words. Indeed, one can imagine to hop through the network of words and to draw a path between any two of them: many path results will be available for any given couple of words/concept thus suggesting an inference.

In general it might be farfetched to draw an analogy between paths constructed upon a network of words and concepts from a sentence proposition expressing a relation among their constituent words, particularly when a long path is involved. However, intuition suggests that shorter (i.e. less indirect) paths may indeed be more likely to express meaningful propositions; in addition to this, interesting results arise when a path of words linked to each other are coupled with the frequency of the co- occurrence of a pair of words/concepts, therefore introducing a measure of similarity among words, and thus suggesting a new tool to gauge the ”likeliness” of a path (highly co-occurring words most likely represent bounded concepts, or even the same one) [144].

As we said, this approach may show some weaknesses when considering synonymy, polysemy and the fact that a word is easily associated with its opposite. The defini- tion of a dictionary though, which collects synonyms under a common designation, A Linguistic and Graph theoretic approach 41

- which in the scope of this work is called concept -, solves the synonymy and poly- semy issues at the cost of restricting the interpretation to a specific context (semantic domain): the network of words becomes a network of concepts.

Through the methodology here described, we obtain a knowledge representation built upon the elaboration of large corpora within a given domain (being this intrinsically a dynamic and data-driven complex data set) through certain semantic algorithms. This epistemological representation is perfectly described by a linguistic approach leading to a mathematical model known as a graph [146].

3.2.1 The Semantic Problem

By approaching the linguistic problem through a semantic point of view, such seman- tic model can be based on the assumption that modern linguistic theory, coupled with advanced computational and machine-learning algorithms, reaches an inter- esting level of understanding of the meaning of propositions expressed in natural language.

Today we are witnessing a big effort from a consortium of researchers (The Se- mantic Web2) aiming at defining a system enabling machines to understand and answer complex human queries. This paradigm faces the issues of defining a way to structure Metadata (i.e. annotations on the document enabling to collectively capture and describe the information) by means of Ontologies3 (i.e. complex, man- made and predefined knowledge representations describing hierarchical relationship among text objects) in order to provide machines with a basic understanding of the textual (semantic) structure of the objects dealt. This approach, although is still the most complete and structured, suffers some limitations particularly concerning

2The Semantic Web is a ”man-made woven web of data” that facilitates machines to understand the semantics, or meaning, of information on the World Wide Web.(Source: http://en.wikipedia.org/wiki/Semantic web). 3In computer science and information science, an ontology formally represents knowledge as a set of concepts within a domain, and the relationships between those concepts. It can be used to reason about the entities within that domain, and may be used to describe the domain. (Source: http://en.wikipedia.org/wiki/Ontology (computer science)) A Linguistic and Graph theoretic approach 42 its representational complexity, because in order to maintain flexibility and com- pleteness of information, the end results are corpora of text that are both verbose and difficult to analyse. Moreover, they are nearly impossible to understand for the laymen. As a result, ontologies are expensive and cumbersome to maintain and semantically far from being standardized. Finally, the issues related to natural language lead to develop tools and identify linguistic structures lacking of the precision due to disambiguation challenges and are not suitable for precise representation of the concept described. When deal- ing with the uncertainty of the interpretation of the language because of the above mentioned issues, this approach forces certain logical consistency even where there is still uncertainty or contradiction. The result is therefore a predefined and static representation of knowledge.

3.2.2 Graph-theoretic analysis of semantic networks

We have earlier introduced the similarity of knowledge representation with a graph of words or concepts and their relationships; as such - and as a natural extension of complex systems - semantic knowledge and related inferences provide an intuitive representation through the powerful framework of network theory. We will refer from now on to semantic networks [147], i.e. a framework implementing a formal correlation structure amongst concepts (words) related because of certain similarity measures within their meaning and characterized by certain statistical features typ- ical of networks. These statistical features of semantic networks constructed upon large corpora of data sets showed properties typical of both small-world and scale-free structures with typically a short diameter (and in general short average distances among nodes), high degrees of clustering and quite a definite power-law degree distribution.

The first time the semantic challenge was approached by means of graph theory is found in Collins and Quillian’s work [148] where nodes represented concepts in a in a tree-structured hierarchy. However, because the semantic hierarchy was forced A Linguistic and Graph theoretic approach 43 into this rigid structure, several constraints emerged in the proper semantic repre- sentation.

A more comprehensive framework was further analyzed in order to provide the principles governing the structure of network representations for natural language semantics. Those are eventually defined as semantic structural behaviors based on statistical regularities represented by distributions of statistics measures analyzed over nodes, pairs/groups of nodes and across a number of correlation measures. In fact the observation of the number of connections per word, the length of the shortest path between two words, and the distribution of a node’s neighbours are a few of the important indicators of the nature of a semantic network that behaves like many other natural networks [149]. It is indeed in this respect that these networks show small-world structure characterized by the combination of highly clustered neighbourhoods, a short average path length, scale-free organization (also found in many other systems [150, 151], where small numbers of well-connected nodes act as hubs), and the distribution of node’s connectivities clearly follows a power function. It is not surprising that a power law behavior is observed in semantic networks as it is well known from the works made by Zipf (1965) that in human language the distribution of word frequencies and derivative lexical connotations are ruled by a power law characteristic.

In the general analysis of the semantic network presented in this Thesis, a lot of emphasis is given to certain graph theoretical measures such as centrality, sparsity, connectedness, clustering, shortest path, power-law degree distributions and random walk similarity with the objective to highlight the emergent properties of the complex systems. I will be dealing with the analysis of the intrinsic information represented via a semantic network and I will show the implications of distinctive characteristics, including local / global network properties; by means of a stochastic approach, leading to assume certain statistical behaviour across concepts (i.e. nodes) and their correlations (i.e. edges) we will recognise similarities with natural phenomena involving scale-free or small-world structures [149, 150, 152]. A Linguistic and Graph theoretic approach 44

3.3 The model

3.3.1 The Paradigm of Distributional Hypothesis

In linguistics the meaning of a word is oftentimes the result of an elaboration about its usage in sentences and not necessarily as an a priori definition. Therefore, this meaning ultimately depends on the words it mostly goes along with, i.e. this is the basis of the previously mentioned distributional hypothesis [139, 142, 143].

This hypothesis is often stated in terms of ”you shall know a word by the company it keeps” (Firth, 1957); ”words which are similar in meaning occur in similar con- texts” (Rubenstein & Goodenough, 1965); ”words with similar meanings will occur with similar neighbours if enough text material is available” (Sch¨utze& Pedersen, 1995); ”a representation that captures much of how words are used in natural con- text will capture much of what we mean by meaning” (Landauer & Dumais, 1997); and ”words that occur in the same contexts tend to have similar meanings” (Pantel, 2005), just to quote few examples. The fundamental rationale behind the distributional hypothesis is based on the ob- servation that there is a correlation between distributional similarity and meaning similarity, which allows us to utilize the former in order to estimate the latter so that the distributional view of meanings in sentences from the linguistic perspective can be represented by statistical distributional analysis [153]. In particular we can observe that differences of meaning are reflected in the differ- ences of their distributions and that through the distributional methodology we can map (semantic) similarities amongst a set of words [153]. In other words, difference of meaning correlates with difference of distribution [142].

As mentioned earlier in the introduction, this approach may show weaknesses as it is subject to a wide range of different challenges due to semantic relations, like syn- onymy, antonymy, hyponymy, hypernymy, etc. Thus, it may seem that the concept of semantic similarity may result to be too generalistic, and that simple distribu- tional models cannot distinguish between, e.g., synonyms and antonyms [153]. A Linguistic and Graph theoretic approach 45

However, it was observed that elements of language relate to each other by means of the functional differences which appear within the structure of a sentence. The relations emerging from the functional role within the sentence are classified as: syn- tagmatic and paradigmatic relations [154]. This approach is the subject of a linguistic discipline called Semiotics which focuses on the formalism of textual analysis and structural analysis. Structuralist semiotic analysis involves identifying the constituent units in a semiotic system (such as a text or socio-cultural practice) and the structural relationships between them (op- positions, correlations and logical relations). Syntagmatic relations therefore, from a semiotic stand point concern positioning and relate to entities that co-occur in the text. Syntagmatic relations are therefore combinatorial relations, which means that words that enter into such relations can be combined with each element of the sentence. For example, written words as we read them individually from any text source are syntagms of letters i.e. a series of linguistic units consisting of a set of linguistic forms that are in a sequential relationship to one another; similarly, sentences are syntagms of words, and paragraphs are syntagms of sentences [153]. Paradigmatic relations, on the contrary, concern substitutions, and relate entities that do not co- occur in the text; in other words a paradigmatic relation exists between linguistic entities that occur in the same context but not at the same time. A paradigm is thus a set of such substitutable entities [153].

In the two specific applications of the proposed semantic-graph theoretical frame- work, the biological semantic network and the economic-financial network, that I will present later on this Thesis, I will make use of both paradigms (syntagmatic and paradigmatic) empirically choosing those with best results. A more in-depth analysis of which one fits better in the context of different semantic data set is not within the scope of this research work. A Linguistic and Graph theoretic approach 46

3.3.2 Conceptualization & Disambiguation

3.3.2.1 Conceptualization

As the fundamental component of this research relies on the epistemological mean- ing of knowledge - eventually represented as a graph of interconnected concepts - and not with language itself, we need to bridge the linguistic perspective, focused on words, with the semantic perspective, that deals with concepts. The linguistic approach allows for a measure of versatility, being applicable to vir- tually any subject, but to work with knowledge it is necessary to define the items pertaining to the specific topic. In the presented model the linguistic meaning of a ”term” is given by a selection of words - or phrases - that represents and means, a concept.

This approach is called conceptualization [155], already known in linguistics as lemmatization [156], and it is ”the process of grouping together the different in- flected forms of a word so they can be represented (and therefore analysed) by a single item 4. Lemmatization of a word and a similar approach known as stemming are closely related but the latter operates on a single word without knowledge of the context extracting the root of a term by ”pruning” the inflection part and therefore cannot discriminate between different meanings in a given sentence or context. A typical example is the following: The Lemma, is the abstraction of the meaning of the word; for example: ”better” has ”good” as its lemma; however the stemming process would miss this link, as it would only focus on the root of the original word and therefore one would necessarily requires a dictionary look-up otherwise it may return wrongly the word ”bet” The word ”walk” is the base form for the word ”walking”, and hence this is matched in both stemming and lemmatization. The word ”meeting” can be either the base form of a noun or a form of a verb (”to meet”) depending on the context, e.g., ”in our last meeting” or ”We are meeting

4Collins English Dictionary, entry for ”lemmatise” A Linguistic and Graph theoretic approach 47 again tomorrow”. Unlike stemming, lemmatization does select the right lemma de- pending on the context [157]. Lemmatization results being an important starting point for the compilation of lan- guage dictionaries that precede the creation of one or more definitions. In the model presented in this thesis, the lemmatized dictionary is structured as a collection of words (think of synonyms), acronyms and, generally phrases, which we claim to be referring to the same thing. For every concept one such collection is to be provided. This man-made conceptualization is a strong hypothesis on this research work; words, usage and meaning change in time and is not the same in different contexts (even dictionaries report many definitions per lemma, depending on the context); moreover it is an input provided a priori to the model which requires a great deal of effort - though not comparable to the production of a fully fledged ontology.

3.3.2.2 Disambiguation

Restricting to a specific context or topic does not resolve all disambiguation prob- lems - i.e. some words or acronyms may refer to different concepts. There are several methods and techniques that help to disambiguate concepts or words [158, 159]; As previously stated, a very important concept is that paradig- matic relations relate entities that do not co-occur in the text; this implies that related words connected through a paradigm tend to be words that do not co-occur themselves, but rather the neighbour terms are often the same. Let’s take the following two sentences as an example:

• This is good news.

• This is wonderful news, indeed!

Notice that good and wonderful are a paradigmatically related word pair in this example, and that by looking at the immediately preceding and succeeding words (in this case a 1+1 sized context window) is enough to establish the meaning of the A Linguistic and Graph theoretic approach 48 sentence and that the two sentences are paradigmatically equivalent. An interesting practical experiment [160] asked people to identify the sense of a polysemous word when they were shown only the words in its immediate vicinity. People were almost always able to determine the sense of the word when shown a string of five words, i.e. using a 2+2-sized context window. Those results were afterword confirmed by other linguistic studies - e.g. [159, 161].

There are controversial opinions about the ideal size of context window, i.e. how many words to the left and to the right, should be considered and as such in general there is no agreement about a theoretical motivation for a particular window size. Recent studies - e.g. [161] - indicate that it is preferable to use a narrow context window for acquiring paradigmatic information rather than measuring semantic sim- ilarity in a large context region spanning hundreds of words. In this work thesis, and in particular in the biological model later described, I have adopted a paradigmatic model to identify the sense of a polysemous word by con- sidering only the words in its immediate vicinity, i.e. a small window.

3.3.3 Building a knowledge representation

As previously stated, in semantic analysis a dictionary is associated to a semantic domain which defines the relevant concepts and the expressions identifying them. Applying a specific dictionary - i.e. searching for concepts represented by words - it means to highlight the concepts the author is referring to and to consider them cor- related by the fact that they display certain statistics features in their co-occurrence dynamics within the text or even, as per the distributional hypothesis, within the sentences. We can initially imagine a completely disconnected graph of concepts, then edges are added whenever two concepts co-occur in a sentence; if a co-occurrence appears more than once then the ”weight” of the link is increased. Since each (text) docu- ment is composed of a set of sentences, we can link the concepts according to the whole document; thus we obtain a graph whose connected parts - that is the ones A Linguistic and Graph theoretic approach 49

Figure 3.1: An example of parse tree representation derived from NLP analysis utilising the Stanford parser we are interested in - are directly inferred from the document under analysis. In a similar way, considering all the sentences in several documents, we obtain a graph, which is an overall representation of those documents within the selected semantic domain. Such a basic representation may be enriched in several ways.

Let’s define the strength or weight of the edges between a pair of nodes of the above mentioned representation by considering the recurrence of those concept relations obtained through a co-occurrence approach. The most straightforward way to do this is to count co-occurrences which translate as ’the more a connection occurs the more it gains strength’. This is viewed as a syntagmatic approach [154], previously mentioned, which ultimately leads to the extensive similarity measure we earlier described. However, counting co-occurrences, although it is useful, is somehow a coarse tech- nique. A finer similarity measure should embrace a kind of notion of the distance of the words within the sentence. In fact, the distance between words is commonly adopted but the difficulty is to overcome language peculiarities or style habits that may affect the result. Leveraging on the syntactic structure of a sentence, - known in computational lin- guistics as parse tree - we are able to overcome several language and habit peculiar- ities.

We define syntactic distance [145, 162] the number of steps required to link two A Linguistic and Graph theoretic approach 50 words with a path through the syntactic tree-like structure, representing a sentence [163, 164]; see Figure 3.1. As the length of this path can be normalized with the total tree depth, the mea- sure is made quite independent from the author’s style and habits [165]. Semantic distance is therefore a measure of how close or distant the meanings of two words, phrases or concepts are. Also, we can state that the syntactic distance defines an intensive measure of the relation between concepts; for instance think of averaging the syntactic distance over a given corpus (set of sentences) the result would provide a further degree of depth assessing the relation between/amongst concepts. As far as syntactic measure used in this thesis I opted for two approaches: the Distributional measures of word-distance (a.k.a. knowledgelean measures); these measures rely on cosine and α-skew divergence [166] and the distributional hypoth- esis framework seen earlier in this chapter (i.e. two words tend to be semantically close if they occur in similar contexts [139]). Distributional measures relying on text and sentences give the distance between any two words that occur with at least a thresholded number of times; the semantic dis- tances between words or concepts is a powerful instrument used as a building block for measuring semantic distance between larger units of language. The ability to mimic human judgments of semantic distance is useful in numerous natural language tasks including machine translation, word sense disambiguation, thesaurus creation, information retrieval, text summarization, and identifying discourse structure. So far, we have shown that the application of the dictionary - i.e. the conceptu- alisation described above - over sentences and parse trees, is key to dynamically creating interrelations among the defined concepts by directly inferring them from the natural expressed texts. This is enough for a semantic interpretation and for representing the data in graph form (the semantic graph). Moreover, the definition of the similarity distances allows us to draw more precise quantitative information to further elaborate more sophisticated analysis and syn- thesis techniques. By iterating the above mentioned framework and considering all the sentences in a A Linguistic and Graph theoretic approach 51 given corpus of text, the obtained weighted graph is an overall representation of the semantic domain; we call it a Knowledge graph. Therefore, it’s possible to gather and automatically structure information from sev- eral unstructured complex data set sources and construct a multi-layered, multidi- mensional, multidiscipline Knowledge representation with a lexically-induced topol- ogy naturally represented by the graph. In this form information is thus made available to machines in a way that is suitable for retrieval and automated analysis. Such representation of information results:

a) Complete, no information is discarded;

b) Structured, the distributional approach and syntax ”naturally” give structure to the whole;

c) Persistent;

d) Flexibly organized, open to a wide range of analysis algorithms and tools.

Formally, we can represent our process as a projection of the Graph database - or a part of it - onto a semantic subspace, resulting in a set of the syntactic relations (among words) collapsed into a subset of semantic links (among concepts) As different dictionaries are taken into account, different knowledge graphs are ob- tained. This process can be schematically represented as follows:

∼ X Γdict = {s ∈ Ωsemantic|s = P (σ), σ ∈ } . (3.1) syntax

With the aid of several dictionaries the ”general” Knowledge Graph can be easily projected into a specific self-consistent and homogeneous semantic space upon which a variety of analyses can be performed; see Figure 3.2. Meaningful analyses and conclusions may be drawn from the unfiltered graph where the global knowledge of a given domain is represented as a whole and emerging global properties are analysed; this means that we can draw preliminary semantic analysis A Linguistic and Graph theoretic approach 52

Figure 3.2: Projection of the Knowledge Graph onto a semantic space. On the left a knowledge domain has been syntactically analyzed and each single sentence is extracted and atomically defined, forming a superset of individual unit of in- formation. Through a set of rules defined by a specific dictionary the corpus of syntactic elements are semantically processed creating an induced graph where nodes are concepts and links are semantic relations between concepts by looking at the properties of the knowledge graph which inherently represents a domain of knowledge.

3.4 The Methodology

3.4.1 Data availability and ingestion

The input required to the above described framework is text: more precisely natural language expressed text. Being based upon a linguistic perspective it is essential for the analysis to be carried out over natural language, with no references to images, tables of data and formatting features. Any language can be supported, as the features of linguistic we exploit are universal, in that they apply (or should apply) equally well for all languages. The limits in the adoption or rejection of a given language are just the availability of a dictionary and - a more strong requirement - a parse engine for it. Text expressed in natural language comes naturally from documents: they may be books, scientific papers, newspaper or magazine articles, press notes, corporate A Linguistic and Graph theoretic approach 53 documents, official statements, blog posts or even transcripts from speech. The basic input module of the process is a retriever of text sources and a tool for the extraction and storing of the text thereby included. Of course it is very important to be able to identify and isolate each individual sentences that constitute the source itself and apply to them the syntagmatic approach.

3.5 Natural Language Processing and Parsing mod- els

Parsing is a fundamental problem in language processing particularly when it comes to defining a universal interpretation of language. There are several factors when parsing in a broad domain rather than in a narrow specific domain and typically several challenges arise on the large rule-based grammar set, long sentences and multiple types of word sense leading to ambiguity. The latter represents one of the most complicated problems to solve and for this reason several statistical models for parsing natural language and dealing with dis- ambiguation have been studied in order to assign a probability to each parse-tree deriving from the ambiguous term and then ranking the output in order of plausi- bility. The probability for each candidate parsing tree is calculated as a product of terms, each term corresponding to some sub-structure within the tree.

The tree in Figure 3.3 represents several levels of information. The non-terminal symbol directly above each word in the sentence is the part-of-speech for that word: for example, the tree indicates that ”Lotus” is an NNP (proper noun), ”acquired” is a VBD (past-tense verb), ”long-time” is a JJ (adjective). The tree describes the hierarchical grouping of words into phrases: ”IBM” is an NP (noun phrase), ”IBM, long-time rival of Microsoft” is also an NP; ”acquired Lotus on Wednesday” is a VP(verb phrase); and so on. Finally, the tree represents grammatical relations between phrases or words. In a A Linguistic and Graph theoretic approach 54

Figure 3.3: An example of a structured parse tree showing Part of Speech Tags according to Stanford NLP parser. In this sentence the parser identifies the following tags: IN Preposition or subordinating conjunction; JJ Adjective; NN Noun, singular or mass; NNS Noun, plural; NP Noun Part (or Noun Phrase); NNP Proper noun, singular; NNPS Proper noun, plurar; SYM Symbol; VP is Verbal Part (or Verbal Phrase); VB Verb, base form; VBD Verb, past tense) rule hS → NP − VP i the NP is the subject of the verb within the VP (by this rule ”IBM , long-time rival of Microsoft” is the subject of ”acquired”). Similarly, hVP → VBD − NP i represents an object-verb relationship (”Lotus” is the object of ”acquired”) and hVP → V B...P P i represents prepositional-phrase modification of a verb (”on Wednesday” modifies ”acquired”). These syntactic rules allow us to directly read predicate-argument relations from the tree: that IBM, by virtue of being the subject, is doing the acquiring (rather than Microsoft); that Lotus, by virtue of being the object, is the acquirer; and so on.

3.5.1 The probabilistic Context-Free Parser

In formal language theory, a context-free grammar (CFG) is a grammar that nat- urally generates a formal language in which clauses can be nested inside clauses arbitrarily deep, but where grammatical structures are not allowed to overlap [167– 169]. A context-free grammar G consists of 4-tuple (N, Σ, A, R) where N is a set of non- terminal symbols; Σ is an alphabet; A is a distinguished start symbol in N and R is a finite set of rules in the form: X → β for some X ∈ N and β ∈ (N ∪ Σ) is a string of symbols of the set Σ ∪ N A Linguistic and Graph theoretic approach 55

It is important to notice that the grammar defines a set of possible strings in the language and also a set of possible derivations under the grammar itself which cor- respond to a pair of tree-sentences well defined under the grammar [170, 171]. A probabilistic context-free grammar (PCFG) - also defined as stochastic context- free grammar, SCFG - is a context-free grammar with the inherent characteristic of a rule with probability [168, 169, 172]:

Count(X → β) P (β|X) = , (3.2) Count(X) where the counting is taken from a predefined training of a text corpus with syntactic and semantic annotation (a.k.a. treebank). In other words a Probabilistic Parsing method aims to find the most likely parse type of a given sentence and it does it by assigning a probability to each parse tree T ; for a tree-sentence pair (T,S) derived by n applications of context-free rules LHSi → RHSi, with 1 ≤ i ≤ n, where LHS is the Left Hand Side and RHS is the Right Hand Side, its probability under the PCFG is defined by the following probabilistic model:

n Y P (T,S) = P (RHSi|LHSi) . (3.3) i=1

Once the model has been trained, P (T,S) is defined for any sentence-tree pair in the grammar.

The most likely tree under this model given a new sentence S, is:

P (T,S) T = argmax P (T | S) = argmax = argmax P (T,S) . (3.4) best T T P (S) T

We can then conclude that a probabilistic context-free parser is a model that searches for the tree Tbest, that maximizes P (T,S) and it is quite a powerful instrument to define, under the rules of PCFG, the valid Part-of-Speech (PoS) sequences such as NP-VP-PP (NP=Noun Part, VP=Verb Part, PP= Predicative Part). PCFG is useful in disambiguation as it chooses the most likely parse, however it A Linguistic and Graph theoretic approach 56 does not adequately model lexical dependencies, i.e. the sentence: ”General Motors exported more than 1M automobiles to Europe...” translates with PCFG in NP → NP PP or VP → V NP PP, which means that PP can be linked either to NP or VP where the attachment choice depends on the verb.

3.5.2 The Lexicalized dependency parser

A lexicalized dependency parser solves the above mentioned problem with depen- dency structures in a sentence; it analyses a sentence looking for links (dependencies) between its constituent words [167, 169]; it is based on an iterative search of a head word in the sentence, which leads to a complete analysis of the sentence itself. The rules for the head-word selection are again built from the elaboration of a corpus of text, which means that headword for a node in the parse tree is set to the headword of its head child. 5 The result of the dependency analysis is a dependency tree. The lexical dependency tree so produced provides an alternative point of view for the sentence that can be used to resolve ambiguities in the PCFG results [173].

3.5.3 The Lexicalized probabilistic context-free parser

As previously discussed PCFG analysis fails to provide a correct parse tree for a sentence with dependencies and ambiguities, such as, for instance, in the following sentence: ”Workers dumped sacks into a bin”. While the meaning (and structure) of it is clear for a human, from the PCFG perspective an alternative admissible structure can lead to a wrong interpretation of the sentence (that is: workers dumped the sacks that were in a bin). In the correct parse ”sacks” is a Predicative Part and ”into a bin” another predicate, while in the wrong one ”sack into a bin” is a (composite) predicative part. A probability is assigned to the alternative rules and therefore one parse or the other will finally be chosen by the PCFG parser, but that

5http://nlp.stanford.edu/software/lex-parser.shtml A Linguistic and Graph theoretic approach 57

Figure 3.4: PCFG, Lexicalized dependency and Combined parser under Stan- ford NLP parsing process probability may point to one or another rule depending on the training set used and not to the specific words involved in the sentence. The solution to this shortcoming is to build a parser that uses both the PCFG technique and the dependency tree, which can resolve the ambiguities employing statistics on the single words: such a parser is called a lexicalized PCFG parser [174]. Many such parsers have been proposed, developed and evaluated in recent years but in this Thesis I have decided to rely on the one produced by the Stanford Natural Language Processing Group: the Stanford Parser. 67

The common approach to produce a PCFG lexicalized parser is to build a sophisti- cated model that integrates the two approaches seen earlier. In the Stanford Parser instead, the PCFG and the lexicalized dependency parser are run separately and then combined together to produce a lexicalized dependency tree, which can be seen as a pair L = (T,D) of a phrase structure tree T and a dependency tree D.A lexicalized PCFG parser assigns a probability to the pair, that in the case of the Stanford parser is a factorized probability P (T,D) = P (T )P (D); see Figure 3.4.

The approach to lexicalized PCFG parsing is to act as if the lexicalized PCFG is a large non-lexical PCFG, with many more symbols than its non-lexicalized PCFG backbone. For example, while the original PCFG might have a symbol NP, the lex- icalized one has a symbol NPx for every possible head x in the vocabulary. Further, 6http://nlp.stanford.edu/software/stanford-dependencies.shtml 7http://nlp.stanford.edu/software/lex-parser.shtml A Linguistic and Graph theoretic approach 58

rules like S → NP − VP become a family of rules Sx → NPy − VPx. Within a dynamic program, the core parse item in this case is an edge, which is specified by its start, end, root symbol and head position. Adjacent edges combine to form larger edges. There are O(n3) edges, and two edges are potentially compatible whenever the left one ends where the right one starts. Therefore, there are O(n3), such com- binations to check, giving an O(n3) dynamic program. An A∗ model has then been employed to efficiently parse the combined tree and to obtain the optimal parse from the sentence.

3.5.4 Extracting a knowledge graph from a large corpus of descriptive language based data set

As discussed earlier, a distributional model can be created with a syntagmatic re- lation approach if words co-occur in a sentence and at the same time can show a paradigmatic relation if they share neighbours; thus, it is possible to populate a distributional model with the former by collecting information about which words tend to co-occur, and with the latter by collecting information about which words tend to share neighbours [153].

The search for co-occurrences is used in this first part of the thesis as the basic step for understanding the meaning of a text: every co-occurrence induces a rela- tion between two concepts. As more and more text is analysed, more and more co-occurrences are found, thus strengthening the found co-occurrences or finding new ones. On large corpora of texts spurious or misleading co-occurrences will be statistically less significant than their ”true” counterparts. This process leads to the generation of a knowledge graph where nodes are the concepts and relations (co-occurrences) are the links. The frequency of co-occurrences is an extensive property of the relationship between two concepts and can be used as a similarity measure for the graph. It is interesting also to define an intensive measure of the relationship to further strengthen the nature of concepts’ relationship. It seems intuitive in fact that if we A Linguistic and Graph theoretic approach 59 consider a fairly large and complex sentence, several co-occurrences are expected to be found, yet some of them may be stronger than others. The distance between words is commonly adopted in standard approaches to seman- tic applications, but language peculiarities (i.e. style habits and individual prefer- ences) may be misleading for instance when positioning very far apart the subject and the object of a sentence, which are supposed to be quite closely related. This issue can be avoided by leveraging instead on the syntactic structure of a sentence - which can always be represented by a tree in meaningful propositions. This distance, called syntactic distance, can be defined as the number of steps required linking two words - representing concepts - with a path through the tree representing the syn- tactic structure of the sentence. One of the formal challenges in tackling the semantic framework described above is how to produce a computational model of semantic similarity that reflects the nature of concept domain. One way of representing semantic relations between words is to regard each word as a distinct entity that occurs in a sentence set. This approach leads us to measure the statistical correlation between the common co-occurrence of such entities (i.e. words) within the corpus of a given text. A very interesting approach that is being used throughout this thesis is the concept of Word-Space, defined as a spatial representation of semantic similarity represented as a proxy in a n-dimensional space. For the sake of argument a 1-dimensional space is a line with a given word/concept preceded and followed by similar words, a 2- dimensional space is still a geometric representation on a plane where the extra dimension introduce a second qualifier of similarity, etc. This spatial proximity across words indicates how similar their meanings are and is at the basis of the geometric metaphor (GM) of meaning: meanings are locations in a semantic space, and semantic similarity is proximity between the locations [154]. This representation is seen as particularly effective as the space is constructed with no human intervention, and with no a priori knowledge or constraints about meaning similarities as similarities are automatically derived from language data by looking at empirical evidence of real language use. This approach is then a perfect fit for the Distributional Hypothesis previously discussed as this methodology translates A Linguistic and Graph theoretic approach 60 the idea to scatter terms with similar distributional properties in similar regions in order for the word-proximity to reflect distributional similarity. From the above description of the characteristics of the distributional hypothesis, we can conclude that a combination of the DH discovery of similarities in meaning with a GM with its ability to represent similarities in meaning leads to an ideal framework; a formal mathematical model should be defined accordingly.

The underlying idea is that appropriate vectors constructed on word co-occurrence statistics (i.e., frequency, distribution) from large corpora of spoken or written text can provide a relevant representation of the semantic meaning of a given corpus. The vectors, namely context vectors, are derived from word co-occurrence counts that are simply the number of times in the given corpus that each context word c appears in a window of a particular shape and size s around each target word t [154, 175]. The context here can assume several forms but in this case we define context it’s surrounding terms (as previously seen with window size equal to 1), i.e. for each word we consider the preceding and succeeding words.

Iterating this approach for each word of a text we end up constructing a series of context co-occurrence vectors, where a vector ~v is an element of the n-dimensional vector space ~v = (a1, a2, ..., an). Therefore, as co-occurrences of word-pairs are collected in a word-by-word vectors, (the elements of such vector display the number of times two words co-occur within a set window of word tokens), these context vectors are then defined as the rows or the columns of the matrix (with the matrix being symmetric), called co-occurrences matrix and it is normally defined as a frequency matrix where it’s element fij iden- tifies the frequency of co-occurrence of word i and word j or in a given corpus of text. In other words, in the traditional vector-space model, the element fij of a co-occurrences matrix is the weight of two terms co-occurring within the text.

Once the semantic analysis is completed, the knowledge extracted from the docu- ments is represented by a graph of concepts linked together by their co-occurrences, A Linguistic and Graph theoretic approach 61 whose degree of relevance is expressed by a positive real number, which is the syn- tactic distance: the research of semantic information is therefore ”reduced” to the analysis of a so-called ”weighted graph”. Graph theory provides many ways to explore and analyse the properties of a graph and of its constituent nodes, but essentially we are interested in three main domains:

1. The overall structure of the graph

2. The analysis of the ”shortest paths” among the nodes

3. The ”stochastic” distance among nodes.

In the following subsection we will examine each of these domains and the questions they can answer in our analysis.

3.5.5 Similarities and distance in a semantic network

The principal feature of the meaning under the Geometric Metaphor’s approach does not provide an indication of the topological location of ’meanings’ in a semantic space but rather that similarity between the meaning of words can be expressed in spatial terms, as proximity in high-dimensional space. Luckily, the context vectors method described earlier allows us to represent distributional information as a geometric representation and also to facilitate the effort of computing an important measure as the distributional semantic proximity between words which introduces the similarity of words and concepts.

Several methods have been studied to compute the similarity between vectors and the measures obtained are generally divided into similarity measures that produce a high ranking for similar objects, and distance measures that is the display an opposite result, i.e. the lower the measure the shorter is the distance between two vectors (concepts); large similarity equals small distance, and vice versa. As a similarity measure is defined as the inverse of a distance measure, we can transform a distance measure d(i, j) into a similarity measure s(i, j) by computing: A Linguistic and Graph theoretic approach 62

1 s(i, j) = . (3.5) d(i, j)

The above leads us to define a similarity vector as [154, 176]:

s(~i,~j) = i · j = i1j1 + i2j2 + ... + injn , (3.6) and a distance vector as a:

n ~ X N 1 d(~a, b) = ( |ai − bi| ) N . (3.7) i=1

This distance metric is the general form of distance Minkowsky metric but a more simple form is derived and defined as the Euclidean distance vector as follows [177]:

v u n ~ uX 2 deuclidean(~a, b) = t (ai − bi) . (3.8) i=1

Both the similarity and distance measures are however affected by the size of the vectors; while in the first case the mathematical operator brings an issue when the element of the vectors are words/concepts as in this scenario, e.g. with word-space vector’s representations, the scalar products operator favours frequent words and therefore those words with high frequency co-occurrences will resemble most of the other words measures. In the second scenario with distance metrics, the opposite results are observable, i.e. frequent terms will result being too distant from the other words. Thus, in order to mitigate the similarity/distance vectors length flaws, we can nor- malize the size of the vectors by introducing the norm in their definition by means of a variety of approaches. For example, a similarity measure that will be used later in this thesis and that takes into consideration the scalar product by factoring out the length of the vectors A Linguistic and Graph theoretic approach 63 dividing by the norm, used is the Cosine Distance [178] (or Cosine Similarity) which is defined as the cosine of the angles between two word vectors:

n P a b a · b i i d (~a,~b) = = i=1 , (3.9) cos r n r n |a||b| P P ai bi i=1 i=1

dcos values varies from 0 for orthogonal vectors to ±1 for identical or opposite vectors. It is intuitive to derive the meaning of these values in respect to the similarity of words that is the cosine distance generates a metric that says how related are two documents by looking at the angle instead of magnitude.

3.5.5.1 The overall structure of the graph

We have seen in the previous paragraphs that a corpus of natural language data set can be semantically analysed and derive an induced semantic network where con- cepts are represented by vertices and semantic relations are shown through edges. Examining the overall structure of a knowledge graph means understanding local and global characteristics that eventually represent a semantic inference. Through- out this research project I analyse what are the more strongly connected nodes and the strongest links among them, and derive a synthetic structural semantic represen- tation highlighting for example the centrality of nodes versus the peripheral ones. Also, an important role is played by measures about the graph structure such as Minimum Spanning Tree (MST), Planar Maximally Filtered Graph (PMFG). The main purpose of analysing the graph with these tools is to verify that our repre- sentation complies with the basic features of the subject it represents. For example, it must be expected that a very important concept behaves as a hub in the graph and appears very close to the root in the MST structure by analysing the richness of ”secondary links” in the graph compared to then MST ones. The intuition is that a graph with few ”secondary links” represents a field of sparse knowledge. More- over links that display high weight and that are present in the MST should represent A Linguistic and Graph theoretic approach 64 well-known and much cited relations in literature. A number of graph measures have been extensively used throughout this Thesis; see 2.5 introduced earlier in Chapter 2.

3.5.5.2 The ”shortest paths”

The most natural way to topologically explore a graph (note that such graph now represents a network of concepts and relations amongst them) is to hop from one node to another, thereby producing a path between any two nodes (i.e. concepts). Generally, given two nodes, there are many paths connecting them. The path con- necting two nodes with the least number of hops is called the shortest path (see 2.5.2.4). In the case of a weighted graph - with a little abuse of language - the shortest path is not necessarily the path with the least number of hops but the one having the minimum (or maximum) overall weight. In the real case application that are described later on I have chosen to use as weight a similarity measure defined by the syntactic distance:

X ν si,j = fi,j , (3.10) ν

ν where fi,j is the frequency of co-occurrences the concepts i and j in the document ν.

This is clearly a symmetric and positive (si,j ≥ 0) measure which returns larger values for pairs of concepts that appear more frequently in the same parse tree. More refined analyses can be carried out by imposing constraints on paths and tak- ing the minimum (or maximum) among the paths complying with the constraints. This is the so-called resource-constraint shortest path (RCSP). The resource con- strained shortest path problem asks for the computation of a least cost path obey- ing a set of resource constraints. The problem is NP − complete. Formally, the constraints are described by an additional set of conditions in the general form. For (i) any path p from s to t we introduce a 0 − 1 variable xp and we use cp and rp , with i = 1, ..., k to denote the cost and resource consumption of p respectively [179]. For A Linguistic and Graph theoretic approach 65 instance, for a minimum RCSP we have: P P min p cpxp, such that p xp = 1

X (i) (i) rp xp ≤ λ for i = 1...k and xp ∈ {0, 1} , (3.11) p where k is the number of resources and the sum is taken over the edges. This problem is a generalisation of one where the path is simply constrained to pass through a node or a set of nodes.

This is the kind of problem we are interested in. Dealing with paths in knowledge graphs means exploring indirect relations among concepts. In our case ”indirect” means that there is a path connecting them but there is no co-occurrence link be- tween them in the document base. Doing so we extend the concept of co-occurrence into something similar to an infer- ence. This holds for any kind of path but intuition suggests that long-wandering paths may be more affected by noise therefore they should be taken as short as possible to be meaningful. Supposing to attach a suitable categorisation or labelling to nodes, we can constrain the path to cross certain categories in a given order. This means that we can setup a kind of syllogistic template our inferences/paths must comply with. The template is built to convey the idea of a chain of reasoning to suggest an interpretation key of this end-to-end inference to the user. Such categories can be imposed a priori to satisfy specific needs according to the semantic domain of reference or be inferred from a cluster analysis from the graph.

3.5.5.3 Random walk: a stochastic semantic distance

The shortest path approach, as per the previous paragraph, may suffer from a limi- tations due to its strongly local feature - i.e. the minimum, maximum of a specific measure. Let’s consider two concepts a and b in the knowledge graph connected by a ”strong” shortest path that is the sole path connecting them. Let’s also consider A Linguistic and Graph theoretic approach 66 another concept c connected to a by a ”weaker” shortest path but also through many other paths. Should we consider concepts a and b more strongly-related than a and c? According to the shortest path approach, the answer is trivially affirmative. But the shortest (or strongest) path may be of little or no interest in a specific research e.g. it may be a well-known result or a false positive. Therefore, we would need a synthetic measure to gauge an overall interrelation be- tween two concepts. Indeed, it is possible to adopt a complementary perspective of the shortest path where one may look at all the paths connecting two concepts, focusing onto the ”abundance” of these paths, together with their weights, as a measure of the strength of the interrelation. We can describe this interrelation as a probability to move from node i to node j. The probability to move from a node to a nearest neighbour can be derived from the similarity measure adopted.

p(i ← j) ∝ si,j . (3.12)

By definition of probability, we require this to be normalised, thus obtaining:

si,j p(i ← j) = P . (3.13) k sk,j

The matrix P with elements (P )i,j = p(i ← j) is often referred as transfer matrix.

The elements of this matrix coincide with the above probabilities, which can also be interpreted as conditional probabilities, p(i ← j) = p(j|i), expressing the chance to find concept i along with the concept j. Considering two non-adjacent nodes, it can be proved that the average number of steps required to travel from node i to node j is given by

RW X  1  di,j = . (3.14) k I − B(j) i,k A Linguistic and Graph theoretic approach 67

Eq. 3.14 defines the Random Walk semantic distance where I is the identity matrix8 and B(j) is a square matrix identical to the transfer matrix P having posed B(j)i,j = 0 for any i. This measure allows an overall ranking among couples of concepts, since it considers all the possible paths connecting the two arbitrarily chosen nodes.

3.6 Summary

In this chapter I have introduced the fundamental aspect of computational seman- tics presenting some of the linguistics challenges and techniques to extract semantic objects and relations from a large corpus of unstructured text based data set. I showed how emerging properties arising from such descriptive language are repre- sented by a semantic network; to this extend a reference is made to the statistical features emerging from these semantic representation of a certain domain of informa- tion and how they show properties typical to small-world and scale-free structures. A thorough description of previous literature was given and in particular I intro- duced the Distributional Hypothesis Model as a paradigm for analysing the semantic structure of a text and setting up the framework for identifying concepts and rela- tionships amongst them; the syntagmatic and paradigmatic paradigm (the former considering co-occurrences of words within the sentence while the latter substitu- tions of words not-present in the sentence but contextually related) are described as two approaches within the distributional model. Throughout the chapter it was introduced the parsing techniques for identifying con- cepts from a probabilistic point of view; in particular I described the probabilistic context-free parser as an instrument aiming to find the most likely parse tree of a given sentence.

Moreover, the methodology for extracting nodes and relations from the corpus of text is described, how to build a similarity matrix containing the weights of nodes (i.e. words co-occurrence) relations and generate a semantic graph; in particular a

8 The element of the identity matrix are given by δi,j (Kroenecher symbol). A Linguistic and Graph theoretic approach 68 description of co-occurence is given and I introduced the basic semantic similarity and distance measures based on syntactic distance concluding the chapter with two of the most important feature for analysing the constructed graph: semantic shortest path and random walk as a measure of semantic stochastic distance.

The approach shown in the next chapters, inspired by many other works in this field (properly referenced throughout), is about assessing the issue of information overload and finding a new method of ”connecting the dots”, i.e. a synthesis tool, allowing us to build inference like concepts’ paths across a multitude of unstructured data and by exploiting the powerful framework of network theory. In the next section I will describe how the proposed methodology has proven to produce interesting results in the field of bio-medicine. 4

The semantic - graph theoretical framework as a model for biomedical drug repurposing

“Two roads diverged in a wood, and I took the one less traveled by. And that has made all the difference.”

– Robert Frost

4.1 Introduction

One of the outcomes expected from this research work is to deduce possible inferences among the entities of a given corpus of natural language expressed text dealing with a given set of disciplines [180–182]. Essentially the idea is to retrieve significant entities from the a large amount of an unstructured data set expressed in natural language but containing rigorous knowledge and to connect them in a graph on the basis of their co-occurrence in the multitude of text fragments (e.g. sentences). The resulting graph is then analysed searching and ranking paths traveling through the nodes (semantic concepts) representing meaningful inferences. 69 A biological application 70

With this in mind, there is an opportunity for literature-based scientific discovery [183, 184], as noted by Swanson back in 1986 [185]: important pieces of information for example regarding chemical substances, biological processes and pathway inter- actions are scattered among publications read by different communities of scientists, who are not mutually aware of their findings. In other words, the extensive processing of million of papers form the biomedical literature through the framework presented in this Thesis can reveal emerging paths that imply connections between known diseases and known peptides. Some of these connections have not been yet exploited pharmaceutically and can lead to new cures.

In this Chapter I will apply the framework built on computational linguistics and graph theory analysis (as introduced in the previous sections) and I will provide a real case of biomedical and pharmacological research. In particular, I introduce a preliminary analysis by processing abstracts and titles of a corpus of 208,583 scientific bio-medical papers selected form PubMed database and I will show examples of construction of biochemical Mechanism of Actions (MoA)1, see Fig. Figure 4.1, referring to certain peptides2 and diseases and suggest how such a framework can provide a different approach to drug research, highlighting pathophysiological courses of bio-chemical compounds and helping the investigation of certain biological behaviours. Throughout the sections of this chapter, I will describe step by step the semantic graph theoretical framework starting from the statistical semantic analysis where the distributional hypothesis theory and the Natural Language Process parsing tech- niques play a fundamental role in the construction of the knowledge graph. Moreover, I will describe the mechanics for finding network paths associated to large level of similarities between nodes (i.e. biomedical concepts extracted and rep- resented in the induced knowledge graph) under a topological approach including also a powerful stochastic method to identify relevant biological path by means of a random walk distance measure. 1The term mechanism of action (MOA) refers to the specific biochemical interaction through which a drug substance produces its pharmacological effect. 2Peptides are naturally occurring biological molecules and appear as short chains of amino acid. A biological application 71

Figure 4.1: In the specific MoA dealt in this research work, a chemical or biological compound (in our application a peptide - short chain of aminoacids) bind to specific receptors on the cell surface and by doing so they activate a cascade of reactions inside the cell called signaling that leads to gene transcription by the DNA and so the synthesis of a protein that generates an action (a response) inside or outside the cell.

Finally, I conclude the chapter by introducing network filtering techniques applied to the biomedical knowledge graph and show preliminary results based on the struc- tural analysis of the concepts network.

4.2 The semantic data ingestion

The initial phase data source ingestion step introduced in the previous chapter may be defined in the following context as the retrieval and semantic process of scientific papers in order to extract atomic biomedical information concerning (and semanti- cally linked ) the bio-molecular processes and the physiology of certain compounds (in this particular case I analyse short chains of aminoacids called peptides) in re- spect to their action with rare diseases. The literature that will be semantically processed and represented through a se- mantic graph is derived from public and private databases and in particular from A biological application 72

PubMED NCBI archives34 and the Stanford University Libraries & Academic In- formation Resources. The two libraries are not composed of two entirely different sets of documents but on the contrary they share a certain number of documents. In this chapter I omit the description of computer based tasks for ingesting the data set, dealing with issues concerning its format and other basic technical issues which have been tackled by means of software custom solutions and are not within the scope of this research.

4.2.0.4 Dictionaries and ontology

Dictionaries and ontologies provide the interpretation framework, the means by which concepts and their relations can be identified and extracted. They are ba- sically a configuration that, in the simplest form, can be thought of as a list of concept-words to which a variable number of related phrases are associated. For instance to the concept ”SKIN” the phrases ”epidermis”, ”derma” but also ”cu- taneous” can be associated. Already from this simple example it is clear that a lot of technical tuning is required (to handle different word forms and to avoid mistaken interpretation of substrings of longer words as a concept) but also that the definition of a dictionary is a critical issue: a concept may be narrowly or broadly character- ized (think for instance about including ”epidermis” to skin, when truly it is only a part of it) and the same word can mean different things in different contexts. An ontology is an extension to the concept of dictionary, where relations are added among the concepts in order to have a deeper embedded representation of the overall concept set. Both a dictionary and ontology refer to a particular context resulting in gene ontologies, medical terms ontologies and so on.

3PubMed is the US National Library of Medicine under the National Institutes of Health 4NCBI is the US National Centre for Biotechnology Information A biological application 73

4.2.0.5 Use of Natural Language Processing

NLP are applied to the text of the articles in order to extract more detailed informa- tion by means of the Stanford Parser framework 5. One obvious advantage of this approach is that the noun parts of the sentences can thus be isolated, so that the search for keywords to be identified as concepts can be carried out only on them, avoiding the verb parts and consequently simplifying the disambiguation task. Of course the syntactic analysis can be used also to identify relations among con- cepts and to measure the ”closeness” of two concepts in a sentence with a precision unattainable with ordinary Latent Semantic Analysis techniques [186]: in LSA one can only measure a distance ”in words” between two words (i.e.: how many words lie between two selected words?) while here a distance can be defined as the number of steps required to link two words with a path through the parse tree. As the length of this path can be normalized with the total tree depth, the measure is also quite independent from the author’s style: the latter is not a minor quality since often articles are written in English by non native English speaker authors, who are prone to use their own language habits stylistically.

4.3 Building Connections

4.3.1 Retrieving the appropriate similarity measure

From the syntagmatic analysis of the literature (see Chapter3) we can identify sev- eral ’semantic objects’, or keywords, which are of relevance for our research purposes. The following approach starts from the acquisition of information about diseases and their bio-molecular mechanisms from the biological and medical literature and it aims to extract interesting relationships that may lead to the development of a possible cure. Specifically our approach consists of two main parts:

5http://nlp.stanford.edu/software/lex-parser.shtml A biological application 74

- The computational semantic approach earlier seen in chapter3 allows to build a dependency matrix S of similarity measures amongst semantic ”objects” extracted from the unstructured complex data set. The matrix S is a set of

scalar entries si,j with larger values corresponding to a larger dependence of object j from object i. In general, S is not symmetric.

- The second part of the framework is focused on discovering and extracting

meaningful paths, πi,j = iv1v2...vkj, between two semantic objects i and j and assessing the statistical strength, reliability and robustness of such paths.

Later in this Chapter I will show examples of how the above mentioned approach leads to the construction of Mechanism of Actions (MoA) referring to certain pep- tides and diseases.

4.3.1.1 Occurrences

The sample data set used to validate the presented semantic-graph theoretical frame- work is made of N = 208, 583 biomedical papers and we identify 250 semantic objects

ν (our keywords) which are relevant for our research purposes. Let us denote with fi the frequency of occurrence of keyword ‘i’ within the paper ‘ν’. P ν The distribution of the sum over the frequencies across the set of papers fi = ν fi is reported in Figure 4.2. As one can see from the semi − log and log − log plots the distribution is characterised by a small set of important keywords that appear with a large frequency and it decreases faster than exponential in the tail region. The first 20 most repeated keywords are listed in the figure together with their fre- quencies, we observe that they account for already over 50% of the total number of occurrences. A biological application 75

4 x 10 6 12 10

5 10 10 i i 4 8 10

3 6 10

2 4 10 keyword frequency f keyword frequency f

1 2 10

0 0 10 0 50 100 150 200 250 300 0 50 100 150 200 250 keyword index ’i’ keyword index ’i’ 4 6 x 10 10 12

5 10 10 i i 4 10 8

3 10 6

2 10 4 keyword frequency f

1 10 keyword frequency f 2

0 0 10 0 1 2 3

10 10 10 10 RAT ERK CRF LUNG ACTH LIVER MAPK BRAIN ALPHA keyword index ’i’ HUMAN MOUSE BLOOD − INSULIN NEURON

ER CYTOKINE APOPTOSIS ESTROGEN

SARCOIDOSIS INFLAMMATION ANGIOGENESIS

PHOSPHORYLATION

Figure 4.2: Distribution of the frequency of occurrence fi of the keywords (i = 1..250) in the sample in exam (N = 208.583 papers). Clockwise, from the top-left: in linear scale, log-linear, log-log and histogram plot. The keywords in the figure at the bottom-right refer to the list in Table 4.1)

4.3.1.2 Co-occurrences

In our semantic approach the co-occurrences define a measure to create links in an undirected graph where the concepts are the nodes and the frequency of co- occurrence is the weight of the link. The number of nodes of this graph is given once the dictionary is set, while the num- bers of edges is unpredictable, depending on the corpus of text and on the identifying expressions listed in the dictionary. Yet a well-known empirical law of distribution of the occurrence of words in a text due to Zip’f [187], states that a semantic graph will be a small-world network and possibly even displays a power law distribution A biological application 76

[188–190]. The graph conveys immediately the idea that the meaning of a concept is an emer- gent property of the whole, and almost visually depicts the gist of the distributional hypothesis: in fact it is natural to regard the neighbourhood of a given concept as its context, the set of other concepts it interacts with, and by means of which it can be defined. But more than that the graph is a mathematical object susceptible to quantitative analysis and as such we can study it both as an overall knowledge representation and for deducting inferences.

As mentioned earlier, the co-occurrences of keywords pairs within phrases is the starting measure from which we build the nodes relations. Let us, for the moment, neglect the relative positioning of the keywords in the phrase that is described by the structure of the semantic tree. Here we will simply consider

ν the frequency of co-occurrences fi,j which counts the number of times that in the paper ‘ν’ the two keywords ‘i’ and ‘j’ appear within the same phrase. The distribution of the sum over the co-occurrences across the set of papers

X ν fi,j = fi,j ν is reported in Fig. Figure 4.3. As one can see from the semi-log and log-log plots the distribution is similar to the one of the frequencies (Fig. Figure 4.2) but it has a fatter tail with a decreasing law that is slower than exponential. The the first 20 largest co-occurrencies are reported in detail in the figure and the corresponding couples of keywords are listed in Table 4.1.

ν ν Although obviously related and correlated the two quantities fi and fi,j are however distinct. Indeed, in a phrase a keyword can appear more than once and the co- occurrences may count all the combinations of the same keyword ‘i’ with different others ‘j1’, ‘j2’, ‘j3’... . A biological application 77

4 10000 10

8000 i,j

i,j 3 10

6000 2 10

4000 occureny frequency f occureny frequency f − − 1 10 co co 2000

0 0 10 0 2000 4000 6000 8000 10000 12000 0 2000 4000 6000 8000 10000 12000 co−occurenies ’i−j’ co−occurenies ’i−j’ 4 10 10000

9000

8000 i,j 3 i,j 10 7000

6000 2 10 5000

4000 occureny frequency f

− 1 occureny frequency f 3000

10 − co

co 2000

1000 0 10 0 5 10 10 0 co−occurenies ’i−j’ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 co−occurenies ’i−j’

Figure 4.3: Distribution of the frequency of occurrence fi,j of the keywords (i, j = 1..250) in the sample in exam (N = 208, 583 papers). Clockwise, from the top-left: in linear scale, log-linear, log-log and histogram plot. The indices (1-20) in the figure at the bottom-right refer to the list in Table 4.1

ν ν In general, we observe that the two quantities fi and fi,j are highly correlated but they can be both larger, smaller or equal than the other.

Table 4.1: The twenty most abundant co-occurrences within the sample in exam.

#CO-OCCURRENCIES i j 9641 ERK PHOSPHORYLATION 8701 HUMAN MOUSE 6731 ERK MAPK 4983 MAPK PHOSPHORYLATION 4599 CYTOKINE INFLAMMATION 3739 HUMAN LIVER 3396 MTOR RAPAMYCIN 3375 CRF RAT 3155 CRF HYPOTHALAMUS 2863 LUNG SARCOIIDOSIS 2837 LUNG MOUSE A biological application 78

2818 LIVER RAT 2578 HUMAN LUNG 2474 HUMAN RAT 2455 APOPTOSIS HUMAN 2382 FIBROBLAST HUMAN 2174 LIVER MOUSE 2167 EPITHELIUM HUMAN SMOOTH MUSCLE 2159 CELL SMOOTH MUSCLE CELL PROLIFERATION 2158 ACTH HYPOTHALAMUS

Several techniques to compute measures of co-occurrence can be studied and here I will provide two alternative class of measures looking both at the co-occurrences within the phrases (’fine grained’ analysis) and at co-occurrences within papers (’coarse grained’ analysis).

4.3.1.3 Fine grained measures

i) Co-occurrences in phrases taking into account multiple co-occurrences within the same phrase.

ii) Co-occurrences in phrases without taking into account multiple co-occurrences within the same phrase.

iii) Weighted co-occurrences with weights given by the semantic distance:  (semanticdistance−1)  weight = 1 − (lengthofsentence−1) resulting therefore in weight = 1, if the two keywords are adjacent in the phrase and smaller values for increasing distances.

4.3.1.4 Coarse grained measures

iv) Co-occurrencies in papers taking into account multiple co-occurrences within the same paper.

v) Co-occurrencies in papers without taking into account multiple co-occurrences within the same paper. A biological application 79

(a) Frequency Distribution of the num- (b) Complementary Cumulative Distri- ber of co-occurrences bution of co-occurrences

Figure 4.4: Distribution of the frequencies of co-occurrences between keywords. Here C is the number of co-occurrences of any couple of keywords i − j and f(C) is the relative frequency at which such a number of co-occurrences is measured.

P ν The measure (i) is the quantity ν fi,j that we have previously mentioned and for sake of simplicity, it is the only measure discussed in this analysis. Interestingly, we observed that, despite the different levels of description involved in each of these different measures, the final path results are not very sensitive to the specific mea- sure. We interpret this fact as a sign of robustness of our the search for bio-medical mechanism of actions procedure.

4.3.2 Semantic interrelation, Dependency, Similarity

The ultimate goal of the semantic phase is to provide an accurate and robust indi- cation of the semantic interrelations between the keywords giving the strength and direction of the dependency between them. This is in principle achievable through an accurate syntactic analysis able to distinguish between the different semantic implications of the co-occurrences of keywords in the same phrase. However, the direction of a relation is not implemented in this work. Therefore, for the time being, we use the counting of co-occurrences of keywords in the same phrase as a measure of similarity which we consider as a proxy for the relevance of semantic interrelation. We then implement a matching procedure that aims to extract links A biological application 80 and paths associated with larger levels of similarity. The distribution of the number of co-occurrences is shown in Figure 4.4. We can observe in Figure 4.4a the majority of co-occurrences between keywords happens in fewer than five sentences within the whole body of the literature and the largest number of co-occurrences happens only in one sentence within the whole body of the literature (this accounts for about 1/4 of the total number of co-occurrences). How- ever, we also observe in Figure 4.4b that there is a sizeable fraction of co-occurrences that happens in more than 20% of the papers (the most abundant being the couples listed in Table 4.1 and shown in Figure 4.3). Indeed, the tail of the complementary cumulative distribution behaves as a power law (linear trend in log-log scale) with exponent γ ≈ 1.81 indicating a fat-tailed distribution [191].

4.3.3 Inferences and paths

A proposition is composed of nouns (concepts in our model) and of verbs and other parts of speech explicating their relation. If one leaves out the connections it is left with an ordered list of concepts, which, from the perspective of our graph, is a path

πi,j = iv1v2...vkj connecting the end-point nodes i and j. Every inference has an ordered list of concepts associated, and, of course, more than one inference can be associated to the same list. Yet not all the ordered lists of con- cepts needs to be associated to an inference. It must be emphasized that this is valid for an inference which has the structure of a chain of reasoning: more complicated logics (e. g. where the same concept appears more than once in the propositions) cannot be modelled this way. As thoroughly explained in the following sections and chapter5, a chain of actions among the entities is a good model for a biologist in pharmaceutical research, where our methodology has been applied. The aim of this research work is to find a way to produce paths in the graph repre- senting meaningful inferences. We will assume that we are dealing with small-world connected graphs and therefore that in general more than one path is available be- tween any two nodes: the problem of getting an ”inference path” becomes then a A biological application 81 problem of suitably constraining the set of possible paths and obtaining a ranking of confidence of the ones we have got. Two general considerations can be made on the form of such paths (see paragraph 2.5.2.4):

1. They should be in some way ”minimal”. In fact longer paths represent longer chains of reasoning and in the end longer propositions, which are less frequent in natural language. Moreover a path much longer than the network diameter is likely to cross more hubs and the connections between hubs tend to obfus- cate specific meanings and to highlight strong links in general, without being affected by the specificities of the end points of the path.

2. They should connect end-points which are ”close” enough. Of course this close- ness cannot be metric because in a small world network every node is ”close” to each other. Rather it must represent the stronger ”real” relation between two given concepts because the more two concepts are related the higher is the probability that a meaningful inference among them exists. Conversely, the ”closeness” of two concepts is higher if more propositions connecting them are available: the measure of closeness then must be dependent on the abundance of possible paths, rather than on the availability of a very short one [192].

These considerations drive our strategy in the search for inferences. We will gauge the strength of the relation of each pair of concepts as a function of the abundance of paths connecting them. We will show that a random walker is a suitable tool to produce this ranking. For each pair we will also compute a suitable minimal path that will provide the sketch of an inference. The paths thus obtained will be ranked with the random walk distance technique and then submitted to experts in the field under examination for evaluation. A biological application 82

k keyword 234 VEGF-C 219 VEGF-A 219 VASOPRESSIN 218 VASODILATION 217 VASOCONSTRICTION 211 UVEA 207 UTERUS 205 UROTENSIN 204 UROCORTIN 199 T KILLER CELL 197 T HELPER CELL 197 TUBULAR CELL 196 TSH 194 TSC2 194 TRYPTOPHAN HYDROXILASE 191 TROPHOBLAST 183 TRH 182 TRAF 182 TPH 180 TNF-R1

Table 4.2: Hub vertices with largest degree

4.3.4 Building inferential paths

In order to study the structure of the interrelations measured from the automatic reading of the literature we can build a weighted graph where the vertices are the keywords and weighted links are connecting vertices with non-zero co-occurrences.

Figure 4.5 shows the resulting network, which is very rich of links (23,233 edges, i.e. 37% of the complete graph) and it has a rather uniform distribution in the vertex connectivity (degree distribution) with some hubs’ vertices that are linked with almost all other vertices. A list of the ten highest connected hubs is provided in Table 4.2.

As stated earlier, the objective is to highlight strong semantic relations between pep- tides and diseases that emerge from the analysis of this intricate set of connections; intuitively, a meaningful paths (associated with larger levels of similarity) should link together a given peptide with a given disease through the path that maximizes A biological application 83

(a) Network of co-occurences (red square: pep- tides; blue triangles: diseases; green circles: other keywords)

0.09

0.08

0.07

0.06

0.05

p(k) 0.04

0.03

0.02

0.01

0 0 50 100 150 200 250 degree k

(b) Distribution of the vertex connectivity (degree) in the network

Figure 4.5: A global view of the network of non-zero co-occurrences A biological application 84 similarity. However, it is also intuitive that a large number of weak paths between a given peptide and a given disease may also be of great significance. In the following sections we shall explore these two complementary approaches.

4.4 Biological mechanisms through shortest paths

Let us investigate the simplest and most logical strategy for finding paths associ- ated with larger levels of similarity that consist in associating a connection between a given peptide and a given disease by looking at the path of links with largest sim- ilarity. As mentioned earlier we have chosen to use as similarity measure the total number of co-occurrences of keywords within phrases over the whole body of the investigated literature:

X ν si,j = fi,j = fi,j . (4.1) ν

This is a symmetric and positive (si,j ≥) measure which returns larger values of similarity for couples of keywords that appear more frequently in the same parse.

4.4.1 Direct links

Let us first look for the peptides and diseases that are directly linked (si,j 6= 0). This is the simplest path (i.e. a biological mechanism of action) associated with larger levels of similarity, which must reflect the known connections between some peptides and some diseases. Of course, this is of little relevance for what concerns the discovery of new unknown connections, it is however very important in order to provide an immediate feedback on the meaningfulness of the similarity measure. The complete list of direct links between peptides and diseases is:

ANGIOTENSIN ←→ 375 ←→ SARCOIDOSIS ANGIOTENSIN ←→ 92 ←→ ATHEROSCLEROSIS ALPHA-MSH ←→ 61 ←→ ISCHEMIA ANGIOTENSIN ←→ 39 ←→ ISCHEMIA ALPHA-MSH ←→ 39 ←→ ANOREXIA A biological application 85

BNP ←→ 30 ←→ ISCHEMIA ANGIOTENSIN ←→ 27 ←→ INFARCTION ANP ←→ 22 ←→ ISCHEMIA ANGIOTENSIN ←→ 22 ←→ HYPERCALCAEMIA GSH ←→ 15 ←→ SARCOIDOSIS NEUROTENSIN ←→ 14 ←→ SARCOIDOSIS GSH ←→ 13 ←→ ISCHEMIA ALPHA-MSH ←→ 13 ←→ ARTHRITIS BRADYKININ ←→ 12 ←→ ISCHEMIA ALPHA-MSH ←→ 10 ←→ REPERFUSION INJURY ANGIOTENSIN ←→ 10 ←→ LAM BNP ←→ 10 ←→ INFARCTION ANP ←→ 10 ←→ ATHEROSCLEROSIS AGRP ←→ 10 ←→ ANOREXIA AVIPTADIL ←→ 9 ←→ ISCHEMIA GLP-1 ←→ 8 ←→ ISCHEMIA ALPHA-MSH ←→ 8 ←→ ARDS BNP ←→ 7 ←→ SARCOIDOSIS ANP ←→ 7 ←→ REPERFUSION INJURY ANGIOTENSIN ←→ 7 ←→ NON-CASEATING GRANULOMA AG85A ←→ 6 ←→ SARCOIDOSIS CNP ←→ 6 ←→ ISCHEMIA CNP ←→ 5 ←→ REPERFUSION INJURY CNP ←→ 5 ←→ INFARCTION ANP ←→ 5 ←→ INFARCTION ALPHA-MSH ←→ 5 ←→ INFARCTION ANGIOTENSIN ←→ 5 ←→ ERYTHEMA NODOSUM ANGIOTENSIN ←→ 5 ←→ CBD ANGIOTENSIN ←→ 3 ←→ REPERFUSION INJURY CART ←→ 3 ←→ ISCHEMIA GSH ←→ 3 ←→ INFARCTION GLP-1 ←→ 3 ←→ INFARCTION ALPHA-MSH ←→ 3 ←→ BULIMIA ANGIOTENSIN ←→ 3 ←→ ARTHRITIS BRADYKININ ←→ 2 ←→ SARCOIDOSIS BRADYKININ ←→ 2 ←→ REPERFUSION INJURY ANGIOTENSIN ←→ 2 ←→ NHL AVIPTADIL ←→ 2 ←→ LEIOMYOMA BRADYKININ ←→ 2 ←→ INFARCTION GSH ←→ 2 ←→ ILD AVIPTADIL ←→ 1 ←→ SARCOIDOSIS ANP ←→ 1 ←→ SARCOIDOSIS GLP-1 ←→ 1 ←→ REPERFUSION INJURY BNP ←→ 1 ←→ REPERFUSION INJURY LYSINE ←→ 1 ←→ NHL GSH ←→ 1 ←→ NHL GSH ←→ 1 ←→ MULTIPLE SCLEROSIS A biological application 86

AVIPTADIL ←→ 1 ←→ MULTIPLE SCLEROSIS ANGIOTENSIN ←→ 1 ←→ MULTIPLE SCLEROSIS CNP ←→ 1 ←→ MCT PULMONARY HYPERTENSION ANP ←→ 1 ←→ MCTPULMONARY HYPERTENSION AVIPTADIL ←→ 1 ←→ INFARCTION ALPHA-MSH ←→ 1 ←→ CSA INDUCED TUBULOI-NTERSTITIAL FIBROSIS

Here the numbers between the arrows are the measured co-occurrences (i.e. the number of times these two keyword appear together in the same phrase).

4.4.2 Indirect links: searching for shortest paths

If our measure of dependency between keywords (si,j) is significant, then the most relevant relation between a given peptide ’i’ and a given disease ’j’ must be associated with the path πi = iv1v2...vkj that connects the two by passing through edges with the largest similarity. In network theory this problem is generally referred to as the shortest path problem [180, 193, 194].

The general problem consists in associating ”weights” (or ”costs”) to each path and selecting the one between the starting object ’i’ and the ending object ’j’ that maximizes the weight (or minimizes the cost). In our case, the general guiding principle is that larger weights must be associated to edges with larger similarity but the precise choice of the functional relationship between the weights and the similarity is arbitrary and different choices may be used for different strategies of paths associated with larger levels of similarity. First we start from a network as in Figure 4.5 where an edge is present if and only if a co-occurrence between two keywords has been measured. As a general rule, the edges in such a network must have weights which are proportional to the number of co-occurrences. Let us here focus on two possible choices of the weight wi,j of the edge between vertex i and vertex j. A biological application 87

- First case: symmetric weights proportional to the co-occurrences

si,j wi,j = P ; (4.2) k,h sk,h

- Second case: asymmetric weights scaled by the degree strengths

si,j si,j wi,j = P ; (4.3) wi,j = P . (4.4) k sk,j k si,k

One can note that these weights have the common characteristic of being

defined in the range wi,j ∈ [0, 1]. Furthermore, these weights wi,j can be inter- preted as probabilities. In particular, in the first case (4.2) we can associate to

wi,j the joint probability p(i, j) to find the co-occurence of the two keywords ’i’ and ’j’.

On the other hand, in the second case (4.3 and 4.4) we can associate to wi,j with the conditional probability p(i|j) to find keyword ’i’ given a phrase con- taining keyword ’j’. We will explore this aspect in further details in the next chapter. From the above weights we can introduce a dissimilarity measure:

di,j = −log wi,j , (4.5)

which is correctly defined in the range di,j ∈ [0, ∞) and, as opposed to the weights, returns larger values for smaller similarities. Let us note that in the

first case di,j is symmetric and it can be interpreted as a distance measure,

however in the second case di,j is asymmetric and therefore it is not a proper distance measure [195–197]. Let us now use the dissimilarity measure in order to extract the shortest paths. We note that the shortest path calculated from

the dissimilarity measure di,j is a sequence of edges that minimize the sum

over the di,j and therefore it maximize the product of the weights. We will see in the next chapter, that such a product is associated with the probability of the path. Let us now investigate separately the outcomes of each of the two above choices of dissimilarity measure. A biological application 88

4.4.2.1 Paths from the first case

The twenty paths with the smallest total di,j from 4.2 connecting indirectly a peptide with a disease are:

CNP ←→ 115 ←→ LUNG ←→ 5317 ←→ SARCOIDOSIS ALPHA-MSH ←→ 451 ←→ SKIN ←→ 1193 ←→ SARCOIDOSIS ALPHA-MSH ←→ 577 ←→ INFLAMMATION ←→ 774 ←→ ATHEROSCLEROSIS ALPHA-MSH ←→ 587 ←→ HUMAN ←→ 452 ←→ LEIOMYOMA UROTENSIN ←→ 46 ←→ LUNG ←→ 5317 ←→ SARCOIDOSIS VASOPRESSIN ←→ 1452 ←→ RAT ←→ 127 ←→ REPERFUSIONINJURY ANGIOTENSIN ←→ 232 ←→ LUNG ←→ 784 ←→ ILD CNP ←→ 38 ←→ UTERUS ←→ 4577 ←→ LEIOMYOMA UROCORTIN ←→ 335 ←→ HUMAN ←→ 477 ←→ SARCOIDOSIS UROCORTIN ←→ 335 ←→ HUMAN ←→ 452 ←→ LEIOMYOMA CNP ←→ 115 ←→ LUNG ←→ 1253 ←→ LAM VASOPRESSIN ←→ 214 ←→ HUMAN ←→ 669 ←→ ATHEROSCLEROSIS ANP ←→ 106 ←→ LUNG ←→ 1253 ←→ LAM SAUVAGINE ←→ 93 ←→ RAT ←→ 1364 ←→ ISCHEMIA UROTENSIN ←→ 267 ←→ HUMAN ←→ 452 ←→ LEIOMYOMA ANGIOTENSIN ←→ 262 ←→ HUMAN ←→ 452 ←→ LEIOMYOMA GSH ←→ 239 ←→ HUMAN ←→ 452 ←→ LEIOMYOMA NPY ←→ 1230 ←→ CRF ←→ 85 ←→ ARTHRITIS CNP ←→ 527 ←→ RAT ←→ 193 ←→ ARTHRITIS ANP ←→ 509 ←→ RAT ←→ 193 ←→ ARTHRITIS where the numbers are the co-occurencies of the two keywords. We observe that the shortest path algorithm correctly picks paths with high co- occurrences. However, we notice that there is a tendency for these paths to pass through hub vertices, such as ’HUMAN’ or through the keywords used in the liter- ature search, such as ’SARCOIDOSIS’ or ’LUNG’.

4.4.2.2 Paths from the second case

The tendency to pass through hubs is certainly reduced by using the second measure (4.3 and 4.4) which drastically reduces the weights associated to hubs by dividing by the strength of the vertex. In this case, the twenty paths with the smallest total di,j (asymmetric measure) connecting a peptide indirectly with a disease (the numbers in the table are the co-occurrences of the two keywords): AG85A ←→ 6 ←→ SARCOIDOSIS ←→ 297 ←→ ARTHRITIS AG85A ←→ 6 ←→ SARCOIDOSIS ←→ 252 ←→ HYPERCALCAEMIA AG85A ←→ 6 ←→ SARCOIDOSIS ←→ 228 ←→ NON-CASEATING GRANULOMA AG85A ←→ 6 ←→ SARCOIDOSIS ←→ 154 ←→ ILD AG85A ←→ 6 ←→ SARCOIDOSIS ←→ 5317 ←→ LUNG ←→ 1253 ←→ LAM AG85A ←→ 6 ←→ SARCOIDOSIS ←→ 124 ←→ CBD AG85A ←→ 6 ←→ SARCOIDOSIS ←→ 124 ←→ ERYTHEMA NODOSUM THYMOPENTIN ←→ 27 ←→ LYMPHOCYTE ←→ 821 ←→ SARCOIDOSIS SRP ←→ 31 ←→ UROCORTIN ←→ 129 ←→ ISCHEMIA UROTENSIN ←→ 46 ←→ LUNG ←→ 5317 ←→ SARCOIDOSIS SAUVAGINE ←→ 30 ←→ SKIN ←→ 1193 ←→ SARCOIDOSIS ALPHA-MSH ←→ 451 ←→ SKIN ←→ 1193 ←→ SARCOIDOSIS AG85A ←→ 6 ←→ SARCOIDOSIS ←→ 58 ←→ PNEUMOCONIOSIS AG85A ←→ 6 ←→ SARCOIDOSIS ←→ 52 ←→ MULTIPLE SCLEROSIS AMYLIN ←→ 23 ←→ RAT ←→ 1364 ←→ ISCHEMIA AGRP ←→ 134 ←→ NEURON ←→ 650 ←→ ISCHEMIA CNP ←→ 115 ←→ LUNG ←→ 5317 ←→ SARCOIDOSIS THYMOPENTIN ←→ 31 ←→ MOUSE ←→ 1305 ←→ ISCHEMIA SAUVAGINE ←→ 93 ←→ RAT ←→ 1364 ←→ ISCHEMIA CNP ←→ 38 ←→ UTERUS ←→ 4577 ←→ LEIOMYOMA A biological application 90

This is an asymmetric measure and therefore the inverse paths between a disease and peptide do not have the same weights as shown in the next table (the numbers therein are the co-occurrences of the two keywords): DYSPHORIA ←→ 6 ←→ CRF ←→ 6715 ←→ VASOPRESSIN HYPERCALCURIA ←→ 7 ←→ HYPERCALCAEMIA ←→ 22 ←→ ANGIOTENSIN DYSPHORIA ←→ 6 ←→ CRF ←→ 2539 ←→ UROCORTIN DYSPHORIA ←→ 3 ←→ KOR ←→ 6 ←→ ALPHA-MSH BULIMIA ←→ 6 ←→ HYPOTHALAMUS ←→ 1389 ←→ VASOPRESSIN PNEUMOCONIOSIS ←→ 58 ←→ SARCOIDOSIS ←→ 375 ←→ ANGIOTENSIN CSA INDUCED TUBULOINTERSTITIAL FIBROSIS ←→ 1 ←→ ALPHA-MSH ←→ 366 ←→ NPY ERYTHEMA NODOSUM ←→ 51 ←→ SKIN ←→ 451 ←→ ALPHA-MSH DYSPHORIA ←→ 3 ←→ KOR ←→ 4 ←→ LYSINE BULIMIA ←→ 17 ←→ ANOREXIA ←→ 10 ←→ AGRP BULIMIA ←→ 17 ←→ ANOREXIA ←→ 10 ←→ UROCORTIN KLINEFELTER SYNDROME ←→ 3 ←→ LIVER ←→ 427 ←→ GSH CSA INDUCED TUBULOINTERSTITIAL FIBROSIS ←→ 2 ←→ APOPTOSIS ←→ 255 ←→ GSH DYSPHORIA ←→ 6 ←→ CRF ←→ 1230 ←→ NPY CSA INDUCED TUBULOINTERSTITIAL FIBROSIS ←→ 1 ←→ ALPHA-MSH ←→ 249 ←→ AGRP HEMOLYTIC ANEMIA ←→ 9 ←→ SARCOIDOSIS ←→ 375 ←→ ANGIOTENSIN MCT PULMONARY HYPERTENSION ←→ 11 ←→ RAT ←→ 1452 ←→ VASOPRESSIN A biological application 92

Here we see that the outcome is still unsatisfactory. The hubs are less predominant but now some of the loosely connected keywords with low co-occurences get instead into the path due to the renormalization process that favours poorly connected vertices. Also we see that the two directions give very different results and this is difficult to interpret.

4.4.2.3 Paths form the second case but with symmetrized average weights

In order to perform the renormalization of the weights associated to hubs, but also assuring a symmetric outcome we used a symmetrized dissimilarity measure that avg uses the average dissimilarity di,j = avg {di,j, dj,i}. The twenty paths with smallest total di,javg connecting indirectly a peptide with a disease are shown in the next table (the numbers again are the co-occurrences of the two keywords): AG85A ←→ 6 ←→ SARCOIDOSIS ←→ 228 ←→ NON-CASEATING GRANULOMA AG85A ←→ 6 ←→ SARCOIDOSIS ←→ 6 ←→ HYPERCALCURIA AG85A ←→ 6 ←→ SARCOIDOSIS ←→ 124 ←→ ERYTHEMA NODOSUM AG85A ←→ 6 ←→ SARCOIDOSIS ←→ 252 ←→ HYPERCALCAEMIA AG85A ←→ 6 ←→ SARCOIDOSIS ←→ 58 ←→ PNEUMOCONIOSIS AG85A ←→ 6 ←→ SARCOIDOSIS ←→ 9 ←→ HEMOLYTIC ANEMIA VASOPRESSIN ←→ 6715 ←→ CRF ←→ 6 ←→ DYSPHORIA SAUVAGINE ←→ 423 ←→ CRF ←→ 6 ←→ DYSPHORIA UROCORTIN ←→ 2539 ←→ CRF ←→ 6 ←→ DYSPHORIA AG85A ←→ 6 ←→ SARCOIDOSIS ←→ 154 ←→ ILD AG85A ←→ 6 ←→ SARCOIDOSIS ←→ 124 ←→ CBD AG85A ←→ 6 ←→ SARCOIDOSIS ←→ 5317 ←→ LUNG ←→ 36 ←→ MCT PULMONARY HYPERTENSION STRESSCOPIN ←→ 92 ←→ CRF ←→ 6 ←→ DYSPHORIA SAUVAGINE ←→ 423 ←→ CRF ←→ 162 ←→ ANOREXIA AG85A ←→ 6 ←→ SARCOIDOSIS ←→ 52 ←→ MULTIPLE SCLEROSIS AG85A ←→ 6 ←→ SARCOIDOSIS ←→ 297 ←→ ARTHRITIS AG85A ←→ 6 ←→ SARCOIDOSIS ←→ 5317 ←→ LUNG ←→ 1253 ←→ LAM AG85A ←→ 6 ←→ SARCOIDOSIS ←→ 5317 ←→ LUNG ←→ 185 ←→ ARDS NPY ←→ 1230 ←→ CRF ←→ 6 ←→ DYSPHORIA UROTENSIN ←→ 364 ←→ CRF ←→ 6 ←→ DYSPHORIA A biological application 94

We can see that this list is similar to the previous one, with an unsatisfactory mixture of connections with large co-occurrences and connections with small co-occurrences but through vertices with a small number of neighbours.

4.4.2.4 Paths form the second case with symmetrized minimum weights

Another alternative is to use a symmetrized dissimilarity measure that retains only

max the largest values: di,j = max {di,j, dj,i}. The twenty paths with smallest total max di,j connecting a peptide indirectly with a disease are in this case: ALPHA-MSH ←→ 451 ←→ SKIN ←→ 1193 ←→ SARCOIDOSIS CNP ←→ 38 ←→ UTERUS ←→ 4577 ←→ LEIOMYOMA UROCORTIN ←→ 29 ←→ UTERUS ←→ 4577 ←→ LEIOMYOMA ALPHA-MSH ←→ 451 ←→ SKIN ←→ 272 ←→ LEIOMYOMA ANP ←→ 21 ←→ UTERUS ←→ 4577 ←→ LEIOMYOMA GSH ←→ 487 ←→ SUPEROXIDE DISMUTASE ←→ 7 ←→ REPERFUSION INJURY CNP ←→ 115 ←→ LUNG ←→ 5317 ←→ SARCOIDOSIS VASOPRESSIN ←→ 99 ←→ SMOOTH MUSCLE CELL ←→ 1283 ←→ ATHEROSCLEROSIS ALPHA-MSH ←→ 577 ←→ INFLAMMATION ←→ 774 ←→ ATHEROSCLEROSIS UROCORTIN ←→ 197 ←→ HEART ←→ 278 ←→ SARCOIDOSIS ALPHA-MSH ←→ 3227 ←→ MELANOCYTE ←→ 8 ←→ ANGIOMYOLIPOMA ANGIOTENSIN ←→ 480 ←→ SMOOTH MUSCLE CELL ←→ 198 ←→ LEIOMYOMA SRP ←→ 49 ←→ STRESSCOPIN ←→ 1 ←→ ANOREXIA SAUVAGINE ←→ 111 ←→ UROTENSIN ←→ 21 ←→ ATHEROSCLEROSIS ANGIOTENSIN ←→ 375 ←→ SARCOIDOSIS ←→ 154 ←→ ILD BNP ←→ 1404 ←→ CNP ←→ 38 ←→ UTERUS ←→ 4577 ←→ LEIOMYOMA LVP ←→ 73 ←→ LYSINE ←→ 5 ←→ ANOREXIA AGRP ←→ 10 ←→ ANOREXIA ←→ 17 ←→ BULIMIA SAUVAGINE ←→ 30 ←→ SKIN ←→ 1193 ←→ SARCOIDOSIS GSH ←→ 5 ←→ ANGIOMYOLIPOMA ←→ 257 ←→ LAM A biological application 96

As one can see, the above list now appears quite satisfactory, revealing paths with large co-occurrences but also without a predominance of hubs and with only few keywords from the original literature search. Clearly to properly judge the ’goodness’ of these paths a qualified bio-medical investigation is necessary, and this is beyond the purpose of this Thesis. However, we can certainly say that the above study has revealed that there is a strong dependency of the shortest paths on the choice of similarity and dissimilarity measures. Given that there is a relative degree of arbitrariness in these choices, the overall message is that at this stage a careful and broad exploration of all the possibilities is necessary and it must be carried out in strict collaboration with experts in this bio-medical field.

4.5 Constrained shortest paths as an inference measure

In the knowledge graph the most relevant relation between a given peptide i and a given disease j must be associated with the path πi,j = iv1v2...vkj connecting the pair by passing through edges with the largest similarity. In network theory this problem is generally referred to as the shortest path problem. Also in this case the general problem consists in associating weights (or costs) to each path, and selecting the one between the starting node i and the ending node j that maximizes the weight (or minimizes the cost). In our case, the general guiding principle is that larger weights must be associated to edges with larger similarity but the precise choice of the functional relation between the weights and the similarity is arbitrary and different choices may be used for different matching strategies. We will use the measure: si,j wi,j = , (4.6) max(si,j) + ε A biological application 97

where 0 < ε  min(si,j) is a small arbitrary constant. This is a rescaling of the si,j matrix so that for its coefficients the following inequality is valid:

0 ≤ wi,j < 1

A distance can then be defined as in Eq. 4.5:

di,j = −log(wi,j) .

di,j is a dissimilarity measure which is correctly defined in the interval di,j ∈ [0, ∞) and, oppositely to the weights, returns larger values for smaller similarities. The constant ε has been used to avoid having a 0 distance for the edge with the maximum weight.

Let us now use the dissimilarity measure di,j in order to extract the shortest paths. We note that the shortest path calculated from the dissimilarity measure is a se- quence of edges that minimize the sum over the di,j and therefore it maximizes the product of the weights.

Given the small-world nature of the knowledge network and a pure shortest ap- proach would produce paths with a very low number of concepts and mainly crossing through hubs. In order to produce less cryptic inferences we exploit the categoriza- tion of the concepts introduced in the dictionary. Once the concepts (nodes) have been partitioned into categories it is possible to apply constraints to a path by limiting the possible inter-category links: operatively it means applying a mask to the original graph which leaves out some of the links Figure 4.6. The mask provides a general reasoning model and must be shaped in a way to mirror the type of reasoning adopted in the discipline under focus.

In this specific biomedical case we have applied a mask as shown in the picture above. The path is permitted to roam freely between enzymes, proteins, processes and hormones but is constrained to cross a receptor to connect a peptide to one of A biological application 98

Figure 4.6: A schematic ’mask’ used to mimic Mechanism of Action used as a constraint on a Constrained Shortest Path inference

all constrained paths peptide <− PROCESS −> disease INSULIN CAMP ESTROGEN 1.92 1.95 RAPAMYCIN1.17 GLP−1 HAMARTOMATOMACBD1.70 ILD 0.59 KLINEFELTERSAUVAGINE SYNDROME 5.94 5.81 3.93 19.76AMYLIN 28.175.02 177.95 77.16 6.56 UROTENSIN LEIOMYOMA8.28 0.68 0.79 0.78 0.41 30.66 NHL 2.98 1.57 6.39 0.65 0.77 4.99 STRESSCOPIN 49.48 LUNG 367.38 0.541.68 35.84 341.51 2.19 0.58 0.61 0.41 1.00 5.34 0.75 8.56 0.78 0.77 1.59 17.34 LAM 48.85 LYSINE3.92 UTERUS 2.47 0.6928.39 10.99 1.883.79 44.14 0.41 1.27 12.03 0.68 5.30 0.720.62 UROCORTIN 1.78 4.18 1.270.75 24.05 0.708.29 0.62 2.25 0.46 SMOOTH MUSCLE66.221.40 CELL LYMPHANGIOMA1.00 10.560.435.060.69 23.75 HUMAN 1.78 11.784.23 0.80 INFLAMMATION2.0713.81 0.28 SRP 45.73 1.40 15.35 73.120.792.07 13.69 0.5923.00 18.720.50 0.72 1.77 0.82 79.08 13.74 5.18 124.20 SMOOTH78.72 MUSCLE4.51 CELL MIGRATION PNEUMOCONIOSIS PHOSPHORYLATION 14.48 0.84 8.76 ANGIOGENESIS6.90 35.73 2.34 LIPOMA LYMPHANGIOGENESISSMOOTH MUSCLE5.13 2.35 CELL PROLIFERATION 2.59 0.83 CYTOKINE 7.31 35.96 67.09 0.91 20.107.702.401.441.90 5.08 ISCHEMIA 1.38 1.05 1.64 0.75 0.61 APOPTOSIS0.48 SARCOIDOSIS1.494.18 68.72 5.36 MOUSE0.81 5.64 23.95 0.50 12.70 ANGIOMYOLIPOMA2.69 1.25 LYMPHNODE 124.333.99 0.50 1.00 23.470.450.72 43.29 0.98 CARDIOMYOCYTE 4.23 0.35 1.42 NEUROTENSIN1.74 694.13 0.5616.08 46.51 3.80 0.58 1.00 51.252.45 1.85 3.55 BRADYKININ 1.36 0.76 1.05 13.29 157.56 2.27 0.572.73 2.173.73 CART 24.08 8.35 4.00 28.37 1.70 0.62 REPERFUSION INJURY LVP 0.78 21.42 VASOPRESSIN 2.01 2.11 PDGF 13.42 ANGIOTENSIN 23.65ALPHA−MSH AGRP3.57 2.770.80 DYSPHORIA3.32 2.62 LIVER MTOR NPY 1.00 6.53 5.22 3.48 5.48 14.26 GSH 3.82 0.584.35 RAT 13.77 LEPTIN NEURON

Figure 4.7: All constrained paths peptide ← PROCESS → diseases them. It is also constrained to cross at least one of the categories listed above before reaching a disease. A biological application 99

4.5.1 Shortest paths on the biomedical knowledge graph

4.5.2 Constrained shortest paths in the data set under in- vestigation

Shortest paths have the intrinsic limitation that by selecting only the shortest and strongest links they have a tendency to pass through trivial connections rather than through the significant ones. We have discussed above some methods to control the excessive tendency to pass through hubs such as ’human’. However, it is quite clear that a more efficient way to avoid paths that pass though trivial keywords is by specifically searching for paths that are ’constrained ’ to pass through some specific regions with a clear biological and physiological meaning. These are the constrained paths. Let us here for example consider the shortest paths that pass through LUNG.

Some of the shortest paths connecting peptides and LUNG are:

ANGIOTENSIN ←→ 232 ←→ LUNG CNP ←→ 115 ←→ LUNG ANP ←→ 106 ←→ LUNG GSH ←→ 76 ←→ LUNG AVIPTADIL ←→ 53 ←→ LUNG UROTENSIN ←→ 46 ←→ LUNG ALPHA-MSH ←→ 41 ←→ LUNG BNP ←→ 28 ←→ LUNG BRADYKININ ←→ 25 ←→ LUNG UROCORTIN ←→ 25 ←→ LUNG NEUROTENSIN ←→ 21 ←→ LUNG A biological application 100

Whereas some of the shortest paths connecting LUNG and diseases are:

LUNG ←→ 5317 ←→ SARCOIDOSIS LUNG ←→ 1253 ←→ LAM LUNG ←→ 784 ←→ ILD LUNG ←→ 237 ←→ LEIOMYOMA LUNG ←→ 233 ←→ CBD LUNG ←→ 185 ←→ ARDS LUNG ←→ 150 ←→ ISCHEMIA LUNG ←→ 134 ←→ ANGIOMYOLIPOMA LUNG ←→ 79 ←→ PNEUMOCONIOSIS

Where the numbers represent the co-occurrences of the two keywords. In this exam- ple the paths have been calculated by using the symmetrized dissimilarity measure

max di,j = max {di,j, dj,i}.

It is quite clear from the above list that the constraint has provided an important selection of meaningful paths. One must consider that the constraint can be imple- mented in several ways, for instance, one can constrain the passage through several keywords or an entire category.

4.6 Most abundant paths

By construction, the shortest path algorithm is extracting the strongest dependency chain between a peptide and a disease. It has been discussed in the previous chapter that different measures for dependency and different renormalization of the similar- ity/dissimilarity measure can be used to explore alternative paths. However, one may adopt a complementary perspective where instead of linking together keywords accordingly to the heaviest path between them one may look at all the paths that connect the two keywords and use the abundance of these paths, together with their weights, as a measure of the strength of the linkage between the keywords. This can be achieved by developing a completely new framework based on the average number of steps required to go from one vertex to the other in the network assuming that a walker is moving at random and at each discrete time-step it jumps from a vertex A biological application 101 to one of its neighbors with a probability that is proportional to the dependency between the two keywords. Let us first discuss how to build this probability and then we will see how to calculate the average walking times.

4.6.1 Shortest path as the most probable paths

Before tackling the random walk problem, it is worth remarking that the shortest path is the one associated with the largest probability among all possible paths between two vertices. Indeed, the probability of a given path πi,j = iv1v2...vkj between vertex j and vertex i is:

p(πi,j) = p(i ←− v1)p(v1 ←− v2)...p(vk ←− j) ; (4.7)

By definition, the shortest path is the one that minimizes the total distance

d(i, v1) + d(v1, v2) + ... + d(vk, j) , (4.8) but we defined the distance as:

du,v = −log wu,v = −log p(u ←− v) , (4.9) and therefore the shortest path minimizes

− log [p(i ←− v1)p(v1 ←− v2)...p(vk ←− j)] = −log p(πi,j) . (4.10)

We have therefore shown that the shortest path algorithm is selecting the chain that maximizes the path probability p(πi,j). A biological application 102

4.6.2 Stochastic measures. The Random Walk distance

Let us now adopt a stochastic perspective in order to retrieve a measure of relation between vertices; that is seen as the average number of steps necessary to ’walk’ from one vertex to the other. We shall call this measure the random walk distance [198].

The most relevant relations between a given peptide ‘i’ and a given disease ‘j’ should be associated with paths πi,j = iv1v2...vkj with passing through links with largest weights. In network theory this problem is generally referred as the shortest path problem [180, 193, 194] which searchs for the sequence of edges that minimize the sum over a dissimilarity measure associated with each edge in the path. However, strong indirect connections between a given peptide i and a given disease j may also arise form paths with smaller weights but that contribute in larger numbers. We therefore look at all the paths that connect the two concepts and use the the abundance and redundancy of these paths, together with their weights, as a measure of the strength of link between the concepts. This can be achieved by measuring the average number of steps required to go from one vertex to the other in the network assuming that a walker is moving at random and at each –discrete– time-step it jumps from a vertex to one of its neighbours with a probability, p(j ← i), to move from one vertex i to a neighbour j which is proportional to the dependency measure fi,j. To express it formally, if we assume the similarity matrix for the graph is:

S = {si,j} . and we want the probability to move from vertex j to a neighbour i to be proportional to the dependency measure:

p(i ← j) ∝ si,j .

If we assume that the walker must always move to one of its neighbours at each time step, then it is clear that the probability to move to anyone of its neighbours A biological application 103

P must be equal to one: j p(j ← i) = 1. The combination of these two assumptions implies univocally: fi,j p(j ← i) = p(j|i) = P , (4.11) k fk,j these are conditional probabilities expressing the chance to find concept j within a phrase containing concept i. The matrix P with elements (P )j,i = p(j ← i) is often refereed as transfer matrix [199].

The average number of steps necessary to ‘walk’ form a vertex ‘i’ to a vertex ‘j’ is the random walk distance [199] which can be computed from [200]:

N X  1  dRW = ; (4.12) i,j I − B(j) k=1 i,k where N is the number of vertices in the network, I is the identity matrix and B(j) is a N ×N matrix which is identical to the transfer matrix P except that B(j))i,j = 0 (for any i ∈ [1,N]).

RW Let us notes that di,j is asymmetric. This is a direct consequence of the asymmetry of the transfer matrix P which, in turns, is due to the renormalization of the co- occurrences in Eq. 4.11. This means that it is much less likely for the walker to take a path through an hub. Indeed, although it is likely to for the walker to get into an hub vertex, once inside it the walker is ‘lost’ in a large daedalus of equivalent exit directions and it results rather unlikely to get out in the right direction.

RW Let us notes that di,j is not a distance because it is asymmetric as the walker does not take the same average number of steps to go in one direction or the other. This is a direct consequence of the asymmetry of the transfer matrix P which, in turn, is due to the renormalization of the similarities by the vertex strength. If the knowledge graph was a regular graph (i.e. a graph where every node has the same degree) the asymmetry would not be present: asymmetry is induced by the presence of hubs and, in general by the small-world structure.

RW To circumvent this issue, we conform to the typical solution of ’symmetrizing’ di,j A biological application 104 and thus computing the average:

dRW + dRW dS−RW = dS−RW = i,j j,i . (4.13) i,j j,i 2

The new distances are a natural ranking of the ”closeness” of two nodes in the graph.

4.6.2.1 Biological mechanism from paths based on random walk distance

Let us focus first on the pairing of peptides - diseases that are at the shortest random

RW walk distance di,j . The random walk distance is asymmetric and therefore we must look at both the peptides - disease pairs and the diseases - peptides pairs. The first ten pairs with the shortest times are reported here:

Peptide −→ Disease AG85A ←− 26.4 −→ SARCOIDOSIS ANGIOTENSIN ←− 101.9 −→ SARCOIDOSIS THYMOPENTIN ←− 107.7 −→ SARCOIDOSIS BRADYKININ ←− 107.9 −→ SARCOIDOSIS GSH ←− 109.8 −→ SARCOIDOSIS AMYLIN ←− 110.7 −→ SARCOIDOSIS UROTENSIN ←− 110.9 −→ SARCOIDOSIS CNP ←− 111.0 −→ SARCOIDOSIS NEUROTENSIN ←− 111.1 −→ SARCOIDOSIS ANP ←− 111.2 −→ SARCOIDOSIS BNP ←− 111.3 −→ SARCOIDOSIS A biological application 105

Disease −→ Peptide CSA INDUCED TUBULOINTERSTITIAL FIBROSIS ←− 102.1 −→ ALPHA-MSH ANOREXIA ←− 102.4 −→ VASOPRESSIN DYSPHORIA ←− 102.7 −→ VASOPRESSIN BULIMIA ←− 103.0 −→ VASOPRESSIN MULTIPLE SCLEROSIS ←− 110.4 −→ VASOPRESSIN ISCHEMIA ←− 110.5 −→ VASOPRESSIN INFARCTION ←− 110.7 −→ VASOPRESSIN CSA INDUCED TUBULOINTERSTITIAL FIBROSIS ←− 110.9 −→ VASOPRESSIN REPERFUSION INJURY ←− 111.3 −→ VASOPRESSIN ARTHRITIS ←− 111.3 −→ VASOPRESSIN MCT PULMONARY HYPERTENSION ←− 111.8 −→ VASOPRESSIN

Where the numbers displayed between the diseases are the random walk distances, i.e. the average number of steps that the walkers take to go from a given peptide to a given disease. As one can see these paths associated with larger levels of similarity reveal mixed results with a very strong dependence on the direction. We observe that some con- nections which are known to be relevant are correctly highlighted in both directions but others take very long times in one way and very short times in the other making the results difficult to analyze. This is a well known fact and indeed in the literature the average time between the two directions is often used instead [198], [201], [202].

4.6.2.2 Paths associated with larger levels of similarity from average random walk distance

The first thirty pairs peptide ←→ disease connected by the shortest average random RW RW (di,j +dj,i ) walk time, 2 , are:

VASOPRESSIN ←− 112.6 −→ SARCOIDOSIS VASOPRESSIN ←− 113.9 −→ ISCHEMIA ALPHA-MSH ←− 123.5 −→ SARCOIDOSIS ALPHA-MSH ←− 125.2 −→ ISCHEMIA ANGIOTENSIN ←− 152.3 −→ SARCOIDOSIS ANGIOTENSIN ←− 163.0 −→ ISCHEMIA VASOPRESSIN ←− 190.8 −→ ATHEROSCLEROSIS A biological application 106

UROCORTIN ←− 192.9 −→ ISCHEMIA UROCORTIN ←− 194.0 −→ SARCOIDOSIS VASOPRESSIN ←− 194.8 −→ LEIOMYOMA CNP ←− 201.5 −→ SARCOIDOSIS ANP ←− 202.0 −→ SARCOIDOSIS CNP ←− 202.1 −→ ISCHEMIA ALPHA-MSH ←− 202.2 −→ ATHEROSCLEROSIS ANP ←− 202.3 −→ ISCHEMIA ALPHA-MSH ←− 206.5 −→ LEIOMYOMA NPY ←− 207.0 −→ SARCOIDOSIS NPY ←− 207.4 −→ ISCHEMIA ANGIOTENSIN ←− 234.6 −→ ATHEROSCLEROSIS ANGIOTENSIN ←− 244.0 −→ LEIOMYOMA BNP ←− 266.4 −→ ISCHEMIA BNP ←− 267.0 −→ SARCOIDOSIS UROCORTIN ←− 272.0 −→ ATHEROSCLEROSIS UROCORTIN ←− 276.1 −→ LEIOMYOMA CNP ←− 278.2 −→ ATHEROSCLEROSIS ANP ←− 279.3 −→ ATHEROSCLEROSIS CNP ←− 284.1 −→ LEIOMYOMA ANP ←− 284.8 −→ LEIOMYOMA NPY ←− 284.9 −→ ATHEROSCLEROSIS NPY ←− 289.1 −→ LEIOMYOMA

Where the numbers within the arrows are the average random walk distances.

4.6.2.3 Paths associated with larger levels of similarity from maximum random walk distance

Although the paths associated with larger levels of similarity structure obtained by using the average random walk distance is meaningful and informative, the presence of a large hub around HUMAN may hide other interesting features in the organiza- tion of the dataset. A way to mitigate the formation of hubs is to look at the least

RW RW favorable case by using a symmetrized maximum distance max(di,j , dj,i ).

The first thirty pairs peptides ←→ diseases connected by the shortest maximum distance are: A biological application 107

VASOPRESSIN ←− 112.6 −→ SARCOIDOSIS VASOPRESSIN ←− 117.3 −→ ISCHEMIA ALPHA-MSH ←− 133.9 −→ ISCHEMIA ALPHA-MSH ←− 135.5 −→ SARCOIDOSIS ANGIOTENSIN ←− 202.6 −→ SARCOIDOSIS ANGIOTENSIN ←− 209.9 −→ ISCHEMIA ANGIOTENSIN ←− 262.9 −→ ATHEROSCLEROSIS ALPHA-MSH ←− 268.8 −→ ATHEROSCLEROSIS VASOPRESSIN ←− 269.5 −→ ATHEROSCLEROSIS UROCORTIN ←− 271.4 −→ ISCHEMIA UROCORTIN ←− 275.3 −→ ATHEROSCLEROSIS UROCORTIN ←− 275.9 −→ SARCOIDOSIS ANGIOTENSIN ←− 276.2 −→ LEIOMYOMA UROCORTIN ←− 276.7 −→ LEIOMYOMA ALPHA-MSH ←− 277.0 −→ LEIOMYOMA VASOPRESSIN ←− 277.3 −→ LEIOMYOMA CNP ←− 289.2 −→ ISCHEMIA ANP ←− 289.8 −→ ISCHEMIA CNP ←− 290.1 −→ ATHEROSCLEROSIS ANP ←− 291.5 −→ ATHEROSCLEROSIS CNP ←− 292.1 −→ SARCOIDOSIS CNP ←− 292.2 −→ LEIOMYOMA ANP ←− 292.8 −→ SARCOIDOSIS ANP ←− 293.1 −→ LEIOMYOMA NPY ←− 298.3 −→ ISCHEMIA NPY ←− 300.5 −→ ATHEROSCLEROSIS NPY ←− 300.9 −→ LEIOMYOMA NPY ←− 301.3 −→ SARCOIDOSIS BNP ←− 418.8 −→ ISCHEMIA BNP ←− 421.3 −→ ATHEROSCLEROSIS

Where the numbers within the arrows represent the maximum random walk dis- tances. We can see that there are some interesting overlaps with the previous pair- ings but there are also several new paths associated with larger levels of similarity that appear significant. A biological application 108

4.7 Network approach to information filtering

4.7.1 Overall structure analysis

The way to study the overall structure of the graph is to build a subgraph whose topological structure represents the dependency among the nodes but contains a lower number of edges than the original [117, 193, 194]. In such a network all the important dependency relations must be represented, but the network should be as ’simple’ as possible. The simplest connected graph is a spanning tree (a graph with no cycles that con- nects all nodes) [117, 193, 194] and it is therefore natural to choose as representative network a spanning tree which retains the maximum possible correlations; such a network is called Minimum Spanning Tree (MST) [101, 127]. We shall build the MST of our graph using the classical Kruskal algorithm [194]. The condition that the ex- tracted network should be a tree is a strong constraint greatly reducing the number of possible paths, sometimes eliminating links that are only marginally weaker than the ones kept. A recently proposed solution to keep a greater number of connections consists in building graphs embedded on surfaces with given genus[195, 203]. It is known that given a surface of genus large enough, any network can be embedded on it. From a general perspective, the larger the genus, the larger is the complexity of the embedded triangulation. The simplest network is the one associated with g = 0 which is a triangulation of a topological sphere. Such planar graphs are called Planar Maximally Filtered Graph (PMFG). We will compute the MST and the PFMG of our knowledge graph to gain insight about our knowledge graph, alongside other usual metrics employed for the study of complex networks: betweenness centrality, eigenvector centrality, degree distribution and link weight distribution. A biological application 109

0 800 10 α p(s)~s− , α= 1.69 700

600

−1 10 500

400 (s) > f(S) P

300

−2 10 200

100

0 0 1 2 3 0 5 10 15 20 25 30 35 40 10 10 10 10 S s

(a) Frequency distribution (b) Complementary Cumulative Distribution

Figure 4.8: Distribution of the weights for the non-zero entries of the dependency matrix S.

4.7.2 Information filtering

Let us first focus on the structure and properties of the information embedded upon the construction of a knowledge graph. Typically, in these kinds of data structures the non-zero entries of the similarity matrix S are of the order of n2 where N is the total number of semantic objects. Given that the number of known diseases are of the order of several thousands and the number of known peptides are of the order of tenths of thousands, it is therefore evident that, in order to have an overall vision of the data structure, the dependency matrix S must be filtered into a simpler structure with a lower number of links.

4.7.3 Applying Filtered Networks Techniques

From a network perspective the dependency matrix S is associated with a rather dense weighed graph G with a number of edges E = O(n2). As an example we can process 1580 papers and focus on the paths emerging from the pair of concepts LAM - UROCORTIN [LU] (a peptide and a disease); the graph G for the case study [LU] is shown in Figure 4.9. It is clear that, even in this relatively small case study where N = 61, we have an overwhelming number of links that makes it quite difficult to extract any meaningful information. We therefore need a filtering method which is A biological application 110

Figure 4.9: The graph G is a network representation of the dependency matrix P ν ν S with elements si,j = v di,j which is adding the semantic tree distancedi,j over ν the papers ν. Note that di,j is the average distance in the semantic tree between the semantic object ’i’ and the semantic object ’j’ in paper ν. Sample case study LAM-UROCORTIN [LU] with N=61 a pruning process on G where links with lower weights in the dependency measure are removed and instead links with larger weights are kept. However, this cannot be done by means of a simple thresholding because typically, these complex data structures are organized in a hierarchical way with a heterogeneous coexistence of regions of redundant high dependency which may be crucially linked to regions with small but essential dependency with overall values of the dependency measure that span cross-scales. This is shown in Figure 4.8 where the complementary cumulative distribution is reported, P>(S), of the values of the non-zero elements of S for the sample case study search [LU].

One can note an evident linear trend in the log − log plot of the P>(S) which A biological application 111

B−CELL OPN FIBROBLAST TGF−beta

2.82 1.48 T−CELL RECEPTOR 5.62 2.87

IGH 1.70 HUMAN GSH PEROXIDASE ILD ANGIOTENSIN IL−6 GLUTAMATE2.77 CRF ALPHA−MSH 90.49 1.54 SKIN104.91 BLOOD 51.91 BRONCHUS TRAF 3.21 5.97 337.78 4.76 15.41 30.48EYE LYMPHATIC SYSTEM SUPEROXIDE DISMUTASE2.16 20.36 31.17 2.01 CXCR6 1.50 UROTENSIN41.58 CYTOKINELUNG 3.95 UROCORTIN 3.63 GSH TNF−ALPHA 73.44 EOSINOPHIL4.39 BRONCHOALVEOLAR 3.65 1.53 0.42 INTERFERON−GAMMA 6.83 94.81 9.61 361.34 MOUSE 0.73 9.31 T HELPER CELL 86.04 1.29 LYMPHOCYTENF−KB 2.34 IL−1 LAM3.97 4.08 VEGF−D 2.00 0.53 MACROPHAGUS IL−4 HSP70 13.88 CD1A 78.23 ANGIOTENSIN CONVERTING7.10 ENZYME565.01 3.96 86.27 7.16 12.15 SARCOIDOSIS 2.65 ER−ALPHA6.44 46.92 LEIOMYOMA LIPOMA 16.09 KIDNEY 0.58 MTOR IL−10 INFLAMMATIONEPITHELIUM 177.96 VEGF−C PDGF ANGIOGENESISSMOOTH MUSCLE CELL ANGIOMYOLIPOMA 1.00 8.00 3.00

VEGF−A LYMPHANGIOGENESIS SMOOTH MUSCLE CELL PROLIFERATION

Figure 4.10: Minimum Spanning Tree graph for the case study [LU]

1 implies the scaling relation p(s) ∼ sα for the pdf distribution of the dependency entries. This means that a characteristic scale cannot be associated to this data and the dependency values si,j are statistically characterized by a large occurrence of small values together with a small but finite probability of occurrence of very large values. This kind of statistical distribution is typical of a large class of complex data structures [191].

4.7.3.1 Minimum spanning tree

Here we want to build a network whose topological structure represents the depen- dency among the different elements but containing a lower number of edges than the original G. In such a network all the important dependency relations must be rep- resented, but the network should be as ’simple’ as possible. The simplest connected graph is a spanning tree (a graph with no cycles that connects all vertices). It is A biological application 112 therefore natural to choose as a representative network a spanning tree which retains the maximum possible correlations. Such a network is called Minimum Spanning Tree (MST). There are several algorithms to build a MST, the two most common being the Prim’s algorithm [204] and Kruskal’s algorithm [205] both from the ’50, but there are also older ones. Remarkably there are also very recent newly discov- ered ones such as the one by Chazelle [206], proposed in the year 2000, which is - so far - the algorithmically most efficient, running in almost linear time with the number of edges.

The general approach for the construction of the MST is to connect the most cor- related pairs while constraining the network to be a tree. Let us here describe a very simple algorithm (similar to Kruskal’s), which is very intuitive and will help to clarify the concept.

1. Make an ordered list of pairs i, j ranking them by decreasing correlation ρi,j (first the largest and last the smallest).

2. Take the first element in the list and add the edge to the graph.

3. Take the next element and add the edge if the resulting graph is still a forest or a tree, otherwise discard it.

4. Iterate the process from Step 3 until all pairs have been exhausted.

The resulting MST has N − 1 edges and it is the spanning tree which maximizes the sum of the correlations over the connected edges. The MST for the case-study [LU] seen before is shown in Figure 4.10. As one can notice the original bias of the search in the literature (where LAM and UROCORTIN were used as key words) is reproduced in the data structure, which shows two hubs for these objects. This is clearly a limit of this approach that has however the advantage of giving a simple representation of the search results. This technique however can be used as a preliminary explorative tool that visually underlines the structure of important links between semantic objects. By means of this tool one A biological application 113 can carry on a powerful exploration of the details of the data structure by selectively ’knocking down’ some of the objects and looking at the construction of alternative paths.

4.7.3.2 Planar Maximally Filtered Graph

The MST approach is a very powerful method to study the data structure, but there are some aspects of the MST approach which might be unsatisfactory. In particular, the condition that the extracted network should be a tree is a strong constraint that greatly reduces the number of possible paths, sometimes eliminating links that are only marginally weaker than the one in the MST. Ideally, one would like to be able to maintain the same powerful filtering properties of the MST but also allowing the presence of cycles and extra links in a controlled manner. A recently proposed solution consists in building graphs embedded on surfaces with given genus [207]. (Roughly speaking the genus of a surface is the number of holes in the surface: g = 0 corresponds to the embedding on a topological sphere; g = 1 on a torus; g = 2 on a double torus; etc.) The algorithm to build such a network is identical to the one for the MST discussed previously, except that at step 3 the condition to accept the link now requires that the resulting graph must be embeddable on a surface of genus g. The resulting graph has (3N − 6 + 6g) edges and it is a triangulation of the surface.

It has been proven that the MST is always a sub-graph of such a graph [208]. It is known that given a surface of genus large enough, any network can be embedded on it. From a general perspective, the larger the genus, the larger is the complexity of the embedded triangulation. The simplest network is the one associated with g = 0 which is a triangulation of a topological sphere. Such planar graphs are called Planar Maximally Filtered Graphs (PMFG) [208]. PMFG have the algorithmic advantage that planarity tests are relatively simple to perform. The PMFG network for the case study [LU] is shown in Figure 4.11. As one can see A biological application 114

B−CELL HSP70 GSH 0.82 ALPHA−MSH GLUTAMATE 0.91 0.92 EYE 1.50 ANGIOTENSIN CONVERTING ENZYME NF−KB 0.73 1.021.24 GSH PEROXIDASE 4.76 1.54 2.39 IL−4 MACROPHAGUS 1.87 2.01 1.54 SUPEROXIDE DISMUTASE CXCR6 0.64 0.54 2.820.79 VEGF−C 1.79 2.00 0.76 3.06 1.64 LYMPHANGIOGENESIS1.00 IL−6 SKIN 0.54 5.84 12.15 2.73 VEGF−A 2.00 3.162.48 UROCORTIN1.54 ANGIOTENSIN8.00 IL−10 2.49 2.00 0.44 2.60 304.07 CD1A2.00 0.73 2.45 25.71 ANGIOGENESIS 0.42 SARCOIDOSIS 3.95 5.54 27.66 5.26 3.33 1.00 INTERFERON−GAMMA CRF 63.92 7.30 2.65 1.18 2.5713.51 4.98 EOSINOPHIL INFLAMMATION29.93 5.79 0.80 0.59 IL−10.86 4.11 2.65 41.19 99.03 BLOOD 5.52 5.01 14.040.53 3.63 4.14 25.213.60MOUSE61.87 8.72 VEGF−D 1.368.25 1.10 5.94 KIDNEY10.62 12.790.52 3.91 11.35 12.84 10.16 8.17 2.29 2.35 2.18 TGF−beta CYTOKINE UROTENSIN 10.03 HUMAN 15.26 11.68 LAM 2.45 73.82 566.85 8.23 BRONCHOALVEOLAR 0.68 0.86 0.432.00 5.15 1.48 0.35 64.10 LEIOMYOMA11.022.20 0.64 TNF−ALPHA LYMPHOCYTE 4.60 15.96 ANGIOMYOLIPOMA 4.93 BRONCHUS T HELPER CELL TRAF 5.04 76.85 358.63 2.27 5.62 12.17 44.813.63 67.20 319.33 178.01 4.814.40 23.4420.33ILD 2.99 5.64 4.61 EPITHELIUM0.68 21.21 4.06 29.87 59.45 LUNG LIPOMA 41.77 FIBROBLAST MTOR SMOOTH MUSCLE CELL 1.78 1.53 2.71 14.46 LYMPHATIC SYSTEM 1.27 OPN ER−ALPHA 3.00 ESTROGEN25.98 0.58 SMOOTH MUSCLE CELL PROLIFERATION 0.71

PDGF

Figure 4.11: PMFG network for the case study [LU] we retrieve a similar structure to the MST graph in Figure 4.10 but with a richer structure of links and dependencies, that must be studied.

4.7.4 Analysing the structure of the induced semantic graph through Minimum Spanning Tree

The overall structure of co-occurrences can be studied with a ’holistic’ approach based on the spanning tree that maximises the total weight and therefore minimises the total dissimilarity di,j. This is the so-called Minimum Spanning Tree (MST). It should be quite clear that - overall - a shortest path between two distant vertices in the network is most likely to pass through the structure of this spanning tree 4.12.

In Figure 4.13 it is shown that the MST resulting from the ’first case’ symmetric weights (4.2) proportional to the co-occurrences. One can note that there are a few A biological application 115

CCL20

80.00

CCR6

22.00

CXCR3 MIF 48.00 GM−CSF 126.00 CCR5 1133.00 MACROPHAGUS 154.00

DIURESIS 726.00 CCL5 THALOMIDE 38.00 CTLA−4 IGH MONOCYTE PHA TRAF 113.00 88.00 NATRIURESIS 32.00 4.00 HEART I−NOS 230.00 4.00 IL−16 LYMPHATIC SYSTEM GLP−1 113.00 629.00 CCL3 ANGIOGENESIS REPERFUSION INJURY CYSLT 10.00 28.00 NPRA34.00 186.00 993.00 LYMPHOCYTET−CELL RECEPTOR 550.00 877.00 NF−KB 37.00 ARTHRITISEOSINOPHIL MYOCARDIUM 143.00 CYTOKINE 498.00 GLUCAGON T KILLER4.00 CELL LYMPHANGIOGENESIS PENTOXIFYLLINEIL−101082.00 412.00 741.00 ISCHEMIA2193.00 ANP 245.005134.00 VEGF−D BRAIN TNF−R11943.00 2709.00 MULTIPLE SCLEROSIS 279.00 716.00 1539.00 2040.00 ADIPOCYTE P42/44 MAPK PATHWAY 87.00 30.00809.00 INFLAMMATIONCYCLOSPORINE 553.00 INFARCTION7.00 BNP CNP 230.00 IL−6 5.00 163.00 T HELPER CELL 708.00 ADIPOSE991.00 TISSUE 3161.00 PREDNISOLONE VEGF−A 413.00 18.00 TNF−ALPHA 44.00 640.00 280.00 IL−1 1795.00 ALPHA−ENOLASE 50.00 27.00 VEGF−C NPRB INSULIN 2.00 GUANYL CYCLASE 13.00 92.00 METHYLPREDNISOLONEAZATHIOPRINE NOD2 685.00 NPRC 3367.00 SOCS3 ACUTE RENAL6.00 FAILURE39.00 CAMP ETANERCEPT53.00 INFLIXIMAB AG85A 6.00 LYMPHNODE MHC 12.00 1076.00 234.00 55.00 CATALASE 95.00EYE6.00 HYDROXYCHLOROQUINE PREDNISONE 90.00 HEMOLYTIC9.0083.00 ANEMIA LIPOMA 357.00PHOSPHORYLATION3204.00 STAT 9.00551.00 67.00 MYOSIN 4.00 228.00NON−CASEATING GRANULOMA 426.00 144.00 140.00 FIBROADIPOSE TISSUE 1984.00 IFNGR 5.00 PAROTID SARCOIDOSIS 10.00 GLAND 1.00 RAPAMYCIN 6.00 HLA−B27 LEIOMYOMA 379.005364.00 13562.00 21.00 SUPEROXIDE DISMUTASE IL−18 5.00 124.00171.00 MEDIASTINUM MTOR 186.00 HYPERCALCURIA764.00 BTNL2 4577.00 ACTIN INTERFERON−GAMMAEPITHELIOID 5317.00 CELL UVEA 456.00 7.00 ERYTHEMA NODOSUM76.00BRONCHOALVEOLAR 265.00 PROGESTERONE 348.00 ERK 510.00 ANGIOTENSIN CONVERTING854.00 ENZYME 327.00 OVARY 258.00 51.00 185.00ARDS 6.00 UTERUS TSC2 602.00 LUNG36.00 MYOFIBROBLAST IL−4 HYPERCALCAEMIA 2501.00 SKIN1822.00 79.00 MCT PULMONARY HYPERTENSIONGSH PEROXIDASE 1267.00 6979.00 137.00 218.00784.00 EVEROLIMUS 233.00 PNEUMOCONIOSIS BLOOD 30506.00ER−ALPHA IL−12 1253.00 FIBRINOLYSIS ESTROGEN ANGIOTENSIN VASOCONSTRICTIONILD 627.00 MAPK SKELETON853.00 110.00 CBD ANGIOMYOLIPOMA 104.00275.00 234.00 298.00 130.00 257.00 RENIN LAM GSH PI3 KINASE BRADYKININ 41.00VASODILATION 15.00 COAGULATION TESTOSTERONE 26.00 129.00 HAMARTOMATOMA 2.00 427.00 HLA−DP−GLU69 KUPFFER269.00 CELL AORTA ATORVASTATIN PI3 KINASE PATHWAY ALDOSTERONE 273.00 AGRP 218.00 KIDNEY 7.00 3.00 LIVER 45.00 GHRELIN 258.00 CARDIOMYOCYTE774.00 134.00 KLINEFELTER277.00 5163.00 SYNDROME132.00 PITUITARY GLAND 251.00BILIRUBIN 533.00 NPY MELANOCYTE RAT CNS ADRENAL GLAND TUBULAR CELL HLA−DP STOMACH 2496.00384.00 116.00 HYPOTHALAMUS 3227.00 2.00 AVIPTADIL733.001230.00SAUVAGINE 1.00 5.00 GUINEATHYROID PIG 8125.00 PPAR−GAMMACHIMPANZEE MESENTERIC ARTERY302.00423.006902.00 ACTH 2757.00 CRFR27110.00 ALPHA−MSH 1.00 PLACENTA229.0082.00450.00 42.00 INTERFERON−BETA CRF4691.00 CSA INDUCED TUBULOINTERSTITIAL FIBROSIS HLA−DQ 516.00 140.00 90.00 6.00 HUMAN94.00 162.00 HTERTMMP504.00 283.00 THROMBOPOIETIN RHESUS2539.00423.00364.00 MONKEYDYSPHORIA 3.00 2.00 HSP70 751.00 38.00 13.00 151.00393.00 6715.00 116.00 20.00 3130.00AMYLIN 92.004816.00UROTENSINANOREXIA17.00 7787.00KERATINOCYTES21.00TROPHOBLAST 428.00 3.00 APOPTOSIS HTERC CYTOMEGALOVIRUS UROCORTINNEUROTENSIN BULIMIATSH HLA−DR 3301.00 EPITHELIUM 141.00 MYCOPLASMA21.00 VASOPRESSIN 18.00 MOUSE KOR APOPTOTIC PATHWAY LYMPHANGIOMA IFN−ALFA−2B STRESSCOPINNEURON 11.00 15.00 33.00 SEROTONIN 369.00 803.00 467.00 26.00 FIBROBLAST 87.00 NKT−CELL66.00 B−CELL 4.00 122.00 171.00 CD1A 29.00 49.00 1250.00 GLUTAMATE MITOCHONDRION LT−ALPHA BRONCHUS 45.00 90.00 996.00 16.00 ADIPOR2THYMUS DOXYCYCLINE 1963.00 CARTLVP IL−2 SRPHIPPOCAMPUS DENDRITIC CELL NHL COLLAGEN TPH 73.00 10.00 PDGF 61.00 1068.00 107.00 ADIPONECTIN 505.00 59.00 LYSINE TGF−BETA1414.00 172.00 IL−7 27.00 IL−15 TRYPTOPHAN HYDROXILASE LEPTIN 23.00 INTEGRIN SMOOTH MUSCLE CELL THYMOPENTIN 1941.00 ADIPOR1 452.00 150.00 1283.002726.00 SMOOTH MUSCLE CELL MIGRATION TRH ATHEROSCLEROSISSMOOTH MUSCLEOPN CELL PROLIFERATION

21.00

CXCL16

56.00

CXCR6

Figure 4.12: MST from the second measure (4.3 and 4.4) but using a sym- a di,j +dj,i metrized dissimilarity measure di,j = 2 hub vertices. For instance, we have six keywords that have more than 10 connections: CRF and HUMAN with 24 connections; followed by SARCOIDOSIS with 19; LUNG with 14; MOUSE with 13 and TNF-ALPHA with 11. These hubs reflect both the importance of the keyword and also the input in the search algorithm when the literature has been selected. We observe that - meaningfully - most of the diseases are gathered in the region around the hub LUNG because indeed this has been our main interest in the literature search. On the other hand, most of the peptides are around the hub CRF that appears to play a very central role. We have also investigated the MST from the ’second case’ measure (4.3 and 4.4). However, the MST can only be constructed from a symmetric dissimilarity measure, therefore, consistently with the previous study, we have symmetrized di,j by using two different criteria: i)

a di,j +dj,i m average dissimilarity di,j = 2 ; ii) maximum dissimilarity di,j = max {di,j, dj,i}. a m The MSTs resulting from di,j is shown in Figure 4.12, whereas the MST from di,j is A biological application 116

Figure 4.13: Minimum Spanning Tree obtained from the co-occurrences by using the first weight measure from 4.3 and 4.4 and the relative dissimilarity measure from 4.5 shown in Figure 4.14. We can observe that the two MSTs are very different from each other and different from Figure 4.13as well. Let us note that interestingly, in Figure 4.12 several diseases are now occupying the central branch of the network. The MST in Figure 4.14 is the one that mostly differentiates showing a more open branching structure with an intriguing mixture of peptides and diseases.

4.7.4.1 MST global structure and average random walk distance

RW RW (di,j +dj,i ) In order to understand the properties of the average random walk distance 2 we build upon this distance the minimum spanning tree. As one can see from Figure 4.15 the MST is characterized by one very large hub made by the keyword HUMAN that has 176 connections. Interestingly, the second largest hub is CRF with 32 A biological application 117

NKT−CELL 2.00 CXCL16 NPRC CHIMPANZEE 56.00 13.00 10.00 FIBROADIPOSE TISSUE CXCR6 NPRB 1.00 GUANYL CYCLASE VEGF−D CCL20 8.00 CATALASE 12.00 MEDIASTINUM2.00 HAMARTOMATOMA 80.00 708.00 LYMPHANGIOGENESIS 413.00 67.00 LT−ALPHA MITOCHONDRION CCR6 426.00 553.00 NPRA 685.00 LYMPHNODE 2.00 2.00 VEGF−C SUPEROXIDE BNPDISMUTASE 22.00 UVEA 37.00 MYCOPLASMA 1539.00 CARDIOMYOCYTE 640.00 CD1A APOPTOTIC PATHWAY SKELETON 15.00 2.00 CXCR31.00 GSH PEROXIDASE487.00 VEGF−A 10.00 627.00 ANP 242.00 HLA−DR 789.00 816.00 20.00 PNEUMOCONIOSISNOD2 HSP70 48.00 MULTIPLE SCLEROSIS GSH 2040.00 HEART ANGIOGENESIS HLA−DQ BTNL2 16.00 22.00 10.00 16.00 5.00 1.00 58.00 993.00 1088.00 20.00 CCR5 HLA−DP7.00 HLA−DP−GLU69 41.00 CBD INTERFERON−BETA CNP GLUTAMATE 877.00 86.00 INFARCTION 154.00 748.00 ISCHEMIA498.00REPERFUSIONMYOCARDIUM INJURY 23.00CCL3 2193.00 113.00 CCL5 127.00 BRAIN 8.00 IL−16 HIPPOCAMPUS 38.00 2072.00 SRP 1250.00 SEROTONIN EOSINOPHIL 49.00 NEURON509.00CNS 45.00 TPH 104.00 THROMBOPOIETIN STRESSCOPIN1787.00 59.00 37.00 75.00 ADRENAL GLAND THYROIDTRYPTOPHAN HYDROXILASE 2496.00 GLP−1 ALPHA−ENOLASE1.00 GM−CSF 186.00 MIF 1133.00 HYPOTHALAMUS MELANOCYTE 111.00 1941.00 126.00 UROCORTIN GLUCAGON MACROPHAGUS212.00 3227.00 26.00 5.00 90.00 726.00 552.00 TRH 83.00 NHL B−CELL 48.00 6902.00 250.00 AMYLIN DENDRITIC192.00 CELLMONOCYTE IL−4 2.00 AVIPTADIL CRFR 2757.00 ALPHA−MSH 63.00 CTLA−4 4691.00 ACTH 379.00 TSH 290.00 27110.00 384.00 NEUROTENSIN IGH CRF6715.00 CAMP 20.00 VASOPRESSINPITUITARY GLAND IL−10 4.00 8125.00 T−CELL RECEPTOR CSA INDUCED TUBULOINTERSTITIAL FIBROSIS 333.00IL−18 P42/44 1.00MAPK PATHWAY 147.00 412.00 1.00 RAT T KILLER CELL IL−12 2709.00 TUBULAR CELL T HELPER CELL 134.00 5163.00 1082.00 602.00IFNGR KIDNEYKUPFFER CELL PENTOXIFYLLINE2.00 ETANERCEPT 53.00 INFLIXIMAB 9.00 1441.00273.00 THALOMIDE10.00 5.00 935.00 LYMPHOCYTE1411.00 BLOOD 2525.00 ERYTHEMA38.00 NODOSUM 864.00 IL−2 LIVER ARTHRITIS741.00 INTERFERON−GAMMA 230.00 275.00 277.00 BRONCHUS HYPERCALCURIA HEMOLYTIC ANEMIA INFLAMMATION 107.00 BILIRUBIN 803.00 6.00 5134.00 PHA COAGULATION 4568.00 7.00KERATINOCYTES CYTOKINE 1535.00 IL−15 348.00 ACUTE RENAL FAILURE 2818.00 104.00 EPITHELIUM 8.00 11.00 THYMUS 1574.00 SKIN THYMOPENTIN TNF−ALPHA 34.00 MOUSE HYPERCALCAEMIACYSLT 27.00 FIBRINOLYSIS 4878.00 LUNG 1193.00 252.00 I−NOS 1943.00 5317.00 1.00 629.00 IL−7 LYSINE4.00 3161.00NF−KB 7787.00 SARCOIDOSIS6.00 24.00 KOR 551.00 764.00 73.00 3.00 649.00 87.00 BRADYKININ EYE AG85A ANGIOTENSIN CONVERTING6.00 ENZYME DYSPHORIA TNF−R1 IL−11328.00 HTERT504.00HUMAN LVP 7.00 IL−6 13.00 BRONCHOALVEOLAR GUINEA PIG TRAF 3128.00 1822.00 3301.00 5.00 8.00 1247.00 HTERC 64.00 SOCS3 17.00 RHESUS MONKEY 234.00 ER−ALPHA MCT3.00 PULMONARY HYPERTENSION FIBROBLAST ILD ANGIOTENSIN STAT ARDS 126.00 ADIPOR2 PPAR−GAMMA 30506.00 853.00 4.00 996.00 1963.00 VASOCONSTRICTION ADIPOR1 126.00 ADIPOSE TISSUE 3204.00 130.00 23.00 VASODILATION 180.00 MMP210.00 COLLAGEN ESTROGEN RENIN ADIPONECTIN117.00 991.00 INSULIN 3367.00 PHOSPHORYLATION PDGF 103.00 11.00 ADIPOCYTE 505.00 1068.00 NATRIURESIS 13562.00 129.00 222.00 INTEGRIN493.00 TGF−BETA 38.00 1984.00 ERK 1414.00 2501.00 UROTENSIN DIURESIS LEPTIN 150.00 ALDOSTERONE STOMACH32.00 GHRELIN 245.00 2314.00 ACTIN 111.00 24.00 6979.00APOPTOSIS PROGESTERONE 258.00 NPY MTOR SMOOTH MUSCLE CELL127.00 456.00 OPN 348.00 TESTOSTERONE 44.00 AGRP 452.00 SAUVAGINE 379.00 2726.00 TSC2 MAPK 444.00 CART 5364.00 MYOFIBROBLAST 1283.00 SMOOTH MUSCLE CELL MIGRATION 10.00 191.00 SMOOTH MUSCLE CELL PROLIFERATION 298.00 MYOSIN LAM RAPAMYCIN UTERUS ANOREXIA 257.00 ATHEROSCLEROSIS 17.00 PI3 KINASE 90.00 4577.00 2.00 327.00 ANGIOMYOLIPOMA245.00 120.00 LEIOMYOMA BULIMIA 26.00 65.00 KLINEFELTER SYNDROME MHC OVARY 144.00 AORTA EPITHELIOID CELL EVEROLIMUS PI3 KINASE PATHWAY 41.00 LIPOMA6.00 12.00 29.00 4.00 45.00 PAROTID GLAND 10.00 CYTOMEGALOVIRUS NON−CASEATING GRANULOMACYCLOSPORINE MESENTERIC ARTERYPLACENTA 2.00 4.00 LYMPHANGIOMA 44.00 144.00 IFN−ALFA−2B 5.00 HLA−B27 TROPHOBLAST 39.00 AZATHIOPRINE LYMPHATIC SYSTEM PREDNISONE 27.00

PREDNISOLONE

12.00 METHYLPREDNISOLONE 6.00 HYDROXYCHLOROQUINE2.00 DOXYCYCLINE

2.00 ATORVASTATIN

Figure 4.14: MST from the second measure with symmetrized dissimilarity m measure di,j = max {di,j, dj,i} connections mostly to peptides. The third most connected vertex is LUNG with 9 connections mostly to pulmonary diseases. Other relevant vertices are MOUSE with 7, RAT with 5, ERK and LIVER with 4 connections.

4.7.4.2 MST global structure and maximum random walk distance

The minimum spanning tree derived from the maximum random walk distance

RW RW max(di,j , dj,i ) is shown in Fig. 4.16. In this case we see that the structure does not have a large hub. We can see that the most connected vertex is SARCOIDOSIS with 11 links, followed by CRF, LIVER, LYMPHO- CYTE, T-CELL RECEPTOR all with 6 links. We also observe a branch rich of peptides which expands from CRF showing an interesting mixture with some important diseases. A biological application 118

CRFR UROTENSINCARTPITUITARY GLAND AGRP STRESSCOPIN TRH TPH UROCORTINAVIPTADIL ACTH ADRENAL GLAND BULIMIA SRP 111.5793.8 490.6 574.7 CNS 619.2 3145.4 GLP−1 HYPERCALCURIA 233.4 3880.3 MESENTERIC ARTERY 138.4 519.8 KOR MYCOPLASMA 19.4 70.2 HYPOTHALAMUS 19439.8 5876.2 DOXYCYCLINE 925.9 172.7 ANOREXIA 7453.8 TSH GSH PEROXIDASE 41.5 1228.9 64314.2 VASOPRESSIN ADIPOR1 54.6 TRYPTOPHAN HYDROXILASE 69185.7 280.4 CRF 370.1 40.3 SEROTONIN NKT−CELL 2026.5 NEURON 7854.3 1991.0 46082.1 2922.3 154.0 HLA−DP−GLU69 RHESUS MONKEY 760.2 3017.376.6 18926.1 908.3 DYSPHORIA 24.0 3628.3 PHA PPAR−GAMMA 325.9 NPY TNF−R1 13281.6 HYPERCALCAEMIACXCR3 TSC2 ACUTE RENAL FAILURE TUBULAR CELL MACROPHAGUS CYTOKINECYTOMEGALOVIRUS NPRC FIBROADIPOSE TISSUE 1256.2 MMP LVP THYMUS INSULIN MYOCARDIUM ALPHA−MSH150.2 UVEA HTERC MELANOCYTE CBD STOMACH ADIPOSE TROPHOBLAST TISSUE LT−ALPHA 7054.2 MOUSE VEGF−A LYSINE SAUVAGINE GSH INFLIXIMAB NF−KB RAT AMYLINNOD2 EPITHELIOID CELL CTLA−4 ETANERCEPT NEUROTENSIN OVARY SKELETON CCL5 IGH CD1A CCR6 HLA−DR CHIMPANZEE IFN−ALFA−2B LEPTIN 940.4 510.6 315.0 TGF−BETA 1204.0 1791.8 3976.7106.4 ANGIOTENSIN 127.1 GUINEA PIG 36.32121.01085.413002.0 160908.0 CXCR6 IL−16 STAT SKIN643.4 251.9 420.1 BLOOD GLUTAMATE 848.7 20531.549.2 35923.1 IL−1020.2 584.5 366.4 571.0 MYOFIBROBLAST 303.2 263.7173.3 61.8615.3 22.3 I−NOS MULTIPLECAMP SCLEROSIS 1744.4 SOCS3 4473.8 3497.7INTERFERON−GAMMA 1259.1NPRA 240.64052.12895.555.3 439.480408.98141.7 KERATINOCYTES LYMPHNODEMONOCYTE 3971.8 HTERT 1032.0 ADIPOR2 INFLAMMATION90.4 8628.3 132.6 115.0 8856.8 RENINANP PI3 KINASE 77.642615.8 869.9 74.9 39.86864.711233.3 IFNGR VASODILATION HIPPOCAMPUS CATALASE 389.7 139.1 331.5 ADIPONECTIN CARDIOMYOCYTE 1774.1 102.7 IL−18 786.0 79.4 METHYLPREDNISOLONE 6203.0 BRAIN 125.7 149.787.2 877.0 290.1 CXCL16 32.7 461.8155.943858.4513.1 MHC 684.2APOPTOTIC PATHWAY5212.4 LYMPHANGIOGENESIS 512.7 250.6 455.5 39.3 4345.9 THALOMIDE ATORVASTATIN ANGIOGENESIS444.9 1945.0 363.6 HAMARTOMATOMA8410.3 53.8 828.7 1119.9 BRADYKININ2334.0 AORTA OPN 530.0 579.4 GHRELIN PHOSPHORYLATION 29.1 212.0 141.5 HUMAN 440.3 161.0 502.0 VEGF−D 595.5 ANGIOTENSIN77.6 CONVERTINGATHEROSCLEROSIS3833.1 ENZYME528.2 28385.0 3751.9 INTEGRIN 39.9 155.2 ANGIOMYOLIPOMA LYMPHOCYTE 2373.9 3003.0679.2 434.6AZATHIOPRINE68.0 4186.7 389.1 4111.7ALDOSTERONEMITOCHONDRION CCL20 CNP 2689.120855.7 2202.9 73.6 97.8 CSA INDUCED241176.7 TUBULOINTERSTITIALAPOPTOSIS FIBROSIS 5279.2 97.0 564.2 28.1 17379.0296.9 46.0 1173.1 PREDNISOLONE155.4 THYMOPENTINMIF DENDRITIC CELLISCHEMIA 92.8 21947.2 19375.4 PENTOXIFYLLINE 24741.9 52.1 2741.8 347.5 LYMPHATIC SYSTEM 2700.51772.5 36.8 INFARCTION239.3939.2 259.4 HLA−DQ PREDNISONE 24124.7 183.9 801.8413.4KIDNEY COLLAGEN TRAF 352.8 682.771.9 91.0 T HELPER CELL ERK 202.3206.3 15347.6261.1NATRIURESIS10354.6 VASOCONSTRICTION 42.6 HYDROXYCHLOROQUINET−CELL RECEPTOR 785.1 91.2 5815.4EVEROLIMUS58.3 37.4 T KILLER CELL MAPK 30.3 1060.7146.7FIBROBLAST50.2 671.7 HLA−B27 1548.0 165.1301.5 462.1 561.256.7 MTOR 2275.312607.7 ALPHA−ENOLASE746.5 243.2 GUANYL CYCLASE HEMOLYTIC101.0 ANEMIA6878.2 IL−6 ERYTHEMA NODOSUM TNF−ALPHA907.4221.0 358.4 45940.4 273.1 430.4 1560.9 IL−15 B−CELL FIBRINOLYSISPAROTID GLAND VEGF−C P42/44 MAPK PATHWAY 30.0 IL−4 CCR5 PLACENTA IL−1 EOSINOPHILUTERUS 105.1 IL−12 COAGULATION 16758.5 THYROID BTNL2 ARTHRITIS HLA−DP HEART DIURESIS ER−ALPHA LIVER GLUCAGON PDGF THROMBOPOIETINACTIN EPITHELIUM CCL3 RAPAMYCIN IL−7 GM−CSF TESTOSTERONE MYOSINSMOOTH MUSCLE CELL 130.3 EYE HSP70 1565.3 NHLNPRB ADIPOCYTE IL−2INTERFERON−BETA LYMPHANGIOMA PI3 KINASE PATHWAY SUPEROXIDECYSLT DISMUTASE CYCLOSPORINE BNP BRONCHUSREPERFUSION INJURY 23.4 BILIRUBIN 1127.5 214.6 LEIOMYOMA 137807.4 LUNG

1224.9 KUPFFER CELL ESTROGEN SMOOTH MUSCLE CELL PROLIFERATION 15560.0 2892.1 5204.4 717.2 3417.4 KLINEFELTER SYNDROME 363.2 2435.7 299.6 62.5 SMOOTH MUSCLE CELL MIGRATION MCT PULMONARY HYPERTENSION 126.1 LIPOMA PNEUMOCONIOSIS ILD MEDIASTINUM LAM ARDS BRONCHOALVEOLAR SARCOIDOSIS

PROGESTERONE

1817.4 120557.3

NON−CASEATING GRANULOMA AG85A

Figure 4.15: Global structure of the average random walk distances

4.8 Summary

In this Chapter I introduced a real world application of the semantic-graph theoret- ical framework linking the computational linguistic approach and the instruments for analysing a network presented in the previous chapters and I applied those to a sample data set derived from titles and abstracts of roughly two hundred thousands bio-medical articles within the PubMed database. The main purpose of this analysis was to validate both the methodology and the preliminary results against a biomedical induced graph and to create a model that allows to draw biomedical inferences. I have described in detail the techniques asso- ciated to constructing a network based on nodes that are bio-medical concepts and edges that are co-occurrences of biological concepts tagged with a weight represent- ing a quantitative semantic similarity measure. A biological application 119

CXCR6

10545.9

CXCL16

8542.6

CCR6

HLA−DQ 5670.4 BTNL2 30132.4 39348.1 CCL20 HLA−DP−GLU69 PNEUMOCONIOSIS 24667.1 7371.7 HLA−DP 10260.1 20362.1 CBD CXCR3 UVEA 1648.0 3417.8 40897.9 NOD2 6917.0 CCR5

1472.2 CCL3 1248.5 IL−16 MULTIPLE SCLEROSIS22360.5 3476.7 INTERFERON−BETA1779.3 CCL5 PENTOXIFYLLINE 843.6 MYCOPLASMA 8297.3 EOSINOPHIL 15460.0 THROMBOPOIETIN ETANERCEPT 794.2 2090.0 LT−ALPHA 5221.1 GM−CSF 13999.4 ALPHA−ENOLASE INFLIXIMAB 49429.9 275.3 ERYTHEMA NODOSUMIL−7 MIF 3428.9 1322.2 5397.9 BULIMIA 2969.0 MHC MACROPHAGUS SEROTONIN IGH THALOMIDE 38030.7 IL−15 840.7 180.9 712.4 TPH NKT−CELL160609.1 DENDRITIC CELL 2209.6 AMYLIN 7193.9 1835.6 HAMARTOMATOMA ANOREXIA 7610.6 THYMUS1257.7 NHL CART 8895.1 T−CELL RECEPTOR833.4 MONOCYTE IL−2 FIBROADIPOSE TISSUE TRYPTOPHAN HYDROXILASE 8029.7 543.4 4427.2 16673.4 DOXYCYCLINE 2424.1 320640.9 1519.7 CTLA−4 T KILLER CELL GLUCAGON1561.5 GLP−1 2007.7 221.9 152.2 PHA168.3 NON−CASEATING GRANULOMA 339.1NEUROTENSIN1525.7 B−CELL MEDIASTINUM 5801.2 15998.8 HLA−DR T HELPER CELL 1829.4 THYROID CD1A 154.7 494.6 AGRP 620.9 AVIPTADIL LYMPHOCYTE ILD 3555.8 995.4 390.4 6751.3 TRH ARDS ATORVASTATIN 1166.3 128538.9 HEMOLYTIC37776.1 ANEMIA 1406.2 GLUTAMATE TSH LYMPHNODE4837.7 EPITHELIOID CELL 442.2 ACUTE RENAL128.1 FAILURE MELANOCYTE 7854.9 4632.4 ALDOSTERONE967.2 ARTHRITIS268.8 BRONCHOALVEOLAR 743.8 NPY RENIN 493.1 EYE558.2 2423.1 CYSLT 819.5137192.3HYPERCALCAEMIA2353.4 565.3 226.0 285.6 1121.0 91787.2 TSC2 STOMACH HYPERCALCURIA202.6 SARCOIDOSIS3496.2 PAROTID GLAND 1131.6 KOR 90026.4 BRADYKININ1602.1 327.4 ANGIOTENSIN123.5241088.1 HIPPOCAMPUS GHRELIN DYSPHORIA ANGIOTENSIN CONVERTING ENZYME ANGIOMYOLIPOMA ALPHA−MSH180.3 1199.2 14851.6 554.9 SKIN 927.5 PITUITARY GLAND CAMP KERATINOCYTES AG85A 95.6 776.9 CNS 123.9 LYSINE5888.2 IL−1 275401.8 224.9 1126.5 THYMOPENTIN LAM KLINEFELTER SYNDROME 5753.0 116.3 703.6 318.9 LUNG ACTH93.7 VASOPRESSIN IFNGR IL−6 44.2 82.9 NEURON INFLAMMATIONEPITHELIUM LVP 87659.8 76.6 46.7 512.6 861.8 IL−12 40.9 HTERC 24.7 IL−18 363.2 48.9 CYTOKINE BRONCHUS 57.3 9936.6 245.8 INTERFERON−GAMMA131.7 TNF−ALPHA 71000.9 HYPOTHALAMUS114.1 ADRENAL GLAND STRESSCOPIN IL−10 65.2 6209.4 SRP CRF 330.7 97.7 MOUSE HTERT RAT 26.1 211.2 248.6 UROCORTIN 22.3 1742.9 25.0 CRFR 950.5 IL−4 38.4 HUMAN I−NOS 223.3 NF−KB 43.0KUPFFER CELL 54.9 UROTENSIN1713.1 LEPTIN239.0 STAT 126.4 2223.7 74.8 SAUVAGINE 2479.6 PHOSPHORYLATION3105.4BILIRUBIN 56.0 796.2 REPERFUSION INJURY 1525.1 74.1 LIVER BRAIN108.6 28.3 FIBROBLAST MMP ISCHEMIA TNF−R1 SOCS3 INSULIN 58.2 127.2 152.7 5304.1 ER−ALPHA170.3 476.1 HEART 741.8 448.3 157.7ERK 503.9 280.7 INFARCTION TRAF BLOOD COLLAGEN 789.0 59.5 KIDNEY 91.1 289.4 470.9 973.6 ADIPOCYTE ADIPONECTIN 86.6 57.0 23.4 INTEGRIN MYOCARDIUM APOPTOSIS1335.4 GSH 260.8 1711.6 CARDIOMYOCYTE 592.8 MTOR 1271.0 CNP GUINEA PIG 37345.5 680.6 2135.0 PDGF 4020.2 89.7 ANGIOGENESIS 1016.3 PPAR−GAMMASKELETON COAGULATION MAPK GSH PEROXIDASE 153.7 989.8 RHESUS MONKEY ADIPOR1 575.5 ACTIN 192.8 ESTROGEN TGF−BETA OPN 78479.0 31096.0 ADIPOSE315.5 TISSUE 121.1 TUBULAR CELL 219.2 91.6 VASODILATION 4997.3 1042.5 ANP 10458.0 ADIPOR2 SUPEROXIDE DISMUTASE 612.8 NATRIURESIS MCT PULMONARY HYPERTENSION 154.3 PROGESTERONE 9287.5 VEGF−A FIBRINOLYSIS RAPAMYCIN 481813.1 48175.9 MYOSIN 1032.9 12295.7 DIURESIS 1120.2 SMOOTH MUSCLE CELL 307.8 403.1 888.7 MYOFIBROBLAST NPRA 1153.4 1464.6 2245.8 CATALASECSA INDUCED P42/44 MAPK TUBULOINTERSTITIAL PATHWAY 2410.4 FIBROSIS VASOCONSTRICTION VEGF−C UTERUS390.1 23067.1 813.4 247.3TESTOSTERONESMOOTH MUSCLE CELL MIGRATION BNP VEGF−D PI3 KINASE HSP70 NPRB 17618.3 600.5 EVEROLIMUS 185.5 449.4SMOOTH MUSCLE CELL PROLIFERATION 592.0 21497.4 CHIMPANZEE 2978.2 32967.5 ATHEROSCLEROSIS LYMPHANGIOGENESIS 4213.4 3841.2 OVARY GUANYL CYCLASE NPRC CYCLOSPORINE PI3 KINASE PATHWAY LEIOMYOMA 580.5 8104.9 4583.9 CYTOMEGALOVIRUS APOPTOTIC PATHWAY 684.9 5641.1 LYMPHATIC SYSTEMPREDNISOLONE 4174.9 AORTA 17091.1 43684.8 56518.6 PLACENTA 10070.9 PREDNISONE LIPOMA 3912.4 IFN−ALFA−2B METHYLPREDNISOLONE HLA−B27 MITOCHONDRION 1072.2 6932.3 13314.8 MESENTERIC ARTERY 33555.4 TROPHOBLAST AZATHIOPRINE HYDROXYCHLOROQUINE LYMPHANGIOMA

Figure 4.16: Global structure of maximum random walk distance

Several methods were analysed to isolate such co-occurrences of pairs of concepts (keywords) within a sentence by measuring the frequency of co-occurrences of two biomedical concepts appearing within the same phrase and analysed the statistical distribution of the co-occurrences across the entire data set. I have also shown alternative measures of co-occurrences looking both at the fine grained analysis and coarse grained analysis and also investigated the nature of distribution of co-occurrences associated to such measures.

Following the construction of a semantic network and the preliminary statistical analysis of its constituent elements, I introduced the topic of inferential path which translates into the biological mechanism studied in this example. This problem is tackled through the application of shortest path against a semantic similarity matrix. To this end I applied the concept seen in chapter3 about semantic similarities and defined here a dissimilarity measure. The problem of identifying a biological path A biological application 120 was introduced in section 4.7 with the Random Walk approach where a stochastic measure is defined and applied to the sample data set under analysis and showed how a biological mechanism is derived by means of a comprehensive distance of the all ’biological network’ under exam.

Finally, a description of Network approach to information filtering is given on this specific biological example. 5

A further step into the biological model

5.1 Introduction

In this Chapter I apply the semantic graph-theoretical model presented in the pre- vious chapter to an extended biomedical data set and show how this can efficiently represent a framework for researching and repurposing existing drugs towards dis- eases for which they were not initially intended. Indeed, this large scale application has been implemented within the research mod- els of a biotechnology firm based in Switzerland 1 and whose significant biological results have been published in a peer reviewed article [1].

Leveraging on developments in Computational Linguistics and Graph Theory, a methodology is defined to build a graph representation of knowledge, which is auto- matically analysed to discover hidden relations between any drug and any disease: these relations are specific paths among the biomedical entities of the graph, repre- senting possible Modes of Action for any given pharmacological compound. I will propose a measure for the likeliness of these paths based on a stochastic process on

1Therametrics SA - www.therametrics.com

121 A further step into the biological model 122 the graph; this measure depends on the abundance of indirect paths between a pep- tide and a disease, rather than solely on the strength of the shortest path connecting them. Finally, I will provide real-world examples, showing how the method successfully retrieves known pathophysiological Mode of Action and finds new ones by meaning- fully selecting and aggregating contributions from known bio-molecular interactions. Applications of this methodology are presented, and prove the efficacy of the method for selecting drugs as treatment options for rare diseases.

(Some of the results described in this Chapter are thoroughly represented in: Ruggero Gramatica et al. Graph Theory Enables Drug Repurposing - How a Mathematical Model Drive the Discovery of Hidden Mechanisms of Actions [1]).

5.1.1 Biomedical corpus

The corpus of the unstructured data set that has been used for this biomedical modelling comprises of 3 million abstracts of papers dealing with biomedical re- search taken from PubMed. The aim of the research is to deduce Mechanisms of Actions (MoA) that might indicate that a particular biological compound (a peptide in this application) can be exploited as a cure to a rare disease. The dictionary comprises 1606 concepts: 127 peptides, 300 rare diseases and 1179 other biological entities (i.e. organs, proteins, receptors, enzymes, hormones). Each concept has been assigned to one category and this task is generally straight- forward with the only exception being a certain degree of arbitrariness in some case when distinguishing between a protein and a receptor. As a matter of fact, recep- tors are all proteins: they are singled out to stress the fact that they appear on cell surfaces and participate in cell signalling. A further step into the biological model 123

5.1.2 Biomedical background - Why Linguistics and Net- work Theory

In pharmaceutical research the subject of drug repurposing means redirection of clinically advanced or marketed products into certain diseases rather than in the initially intended indications. A significant advantage of repurposing drugs is their demonstrated clinical pharmacological efficacy and safety profile. The hypothesis for drug repurposing is based on the drugs’ side effects profiles, indicating interaction with more than one cellular target. These pathway interactions open up the op- portunity to exploit existing medicines towards other diseases. Extensive data sets describing drug effects have been published globally, resulting in a huge amount of information publicly available in large on-line collections of bio-medical publications such as PubMed (http://www.ncbi.nlm.nih.gov/pubmed/). This is an opportunity for literature-based scientific discovery; see [183–185]. How- ever, important pieces of information regarding chemical substances, biological pro- cesses and pathway interactions are scattered between publications from different communities of scientists, who are not always mutually aware of their findings. In order to generate a working hypothesis from such a body of literature, a researcher would need to read thoroughly all the relevant publications and to pick among them the relevant items of information. Search engines help scientists in this endeavour, but are unable to semantically aggregate information from different sources, leaving all the initiative to researchers; complex relation-focused and graph-like representa- tions (ontologies) have been extensively produced and used to fill the gap, since their introduction for the Semantic Web [209, 210]. Yet ontologies need to be man-made and they are difficult to integrate with each other and to maintain [211].

As introduced in Chapter 3, here we propose an approach to literature-based research based on the distributional hypothesis of linguistic theory [212, 213] - whose analysis relates the statistical properties of words association to the intrinsic meaning of a concept - and network theory [214, 215] - a collection of versatile mathematical tools A further step into the biological model 124 for representing interrelated concepts and analyse the structure of their connections. The main objective described in this Chapter is to provide a methodology for:

- capturing the essential entities occurring in a variety of publications and con- necting them into a graph whenever they co-occur in a given sentence;

- creating a network representing certain biomedical knowledge whose elements are identified and correlated one another;

- analysing the knowledge graph thus created in order to identify and rank sta- tistically relevant indirect connections among prospect medicines and diseases.

We show that with a suitable set of concepts, specifically compiled in a dictionary, the linked biochemical entities in the network can be connected along paths that mimic a chain of reasoning and lead to prospect inferences about the mechanism of action of a chemical substance in the pathophysiology of a disease. The current Chapter elaborates a method to rank the relevance of the inferences, introducing a measure based on a stochastic process (random walk) defined on the graph: this measure takes into account all paths connecting two concepts and uses the abundance and redundancy of these paths, together with their weights, as a measure of the strength of the overall relation between the concepts. The proposed method consists of two steps: in the first step biomedical papers are collected and submitted to semantic analysis in order to retrieve co-occurrences and compute their similarity; in the second step the knowledge graph, which has been implicitly defined, is analysed to find paths among peptides and rare diseases using techniques derived from the network theory of complex systems.

5.1.3 Information retrieval and construction of the knowl- edge graph

In Chapter3 we have extensively described the powerful techniques deriving from semantic analysis and natural language processing. We have seen that in the field A further step into the biological model 125 of linguistics it is commonly accepted that the meaning of a word must be inferred by examining its (co-)occurrences over large corpora of text. As previously dis- cussed, adopting this perspective [216, 217], one can say that the meaning of a word ultimately depends on the words it mostly goes along with (ref. Distributional Hy- pothesis). This approach suggests the assumption that concept occurring in the same unit of text are in some way semantically related. The general idea also shows that there is a correlation between distributional simi- larity and meaning similarity, which allows exploiting the former in order to derive the latter.

The network representation that is discussed in this Thesis provides new depth to the original linguistic idea because it suggests focusing not only on the relations of a word with those it materially co-occurs with (i.e. its nearest neighbours in the graph) but with every other word in the network, even those indirectly connected. The meaning of a word is suddenly revealed as something depending on the whole and not on the word itself: it is not farfetched to mirror this behaviour in the con- text of complex systems where we recognise local versus global properties, e.g. the meaning of a word is an emergent property of the language [218, 219]. Hopping through this network and drawing a path between any two non-adjacent concepts can be interpreted as suggesting a possible sentence that has never actually been uttered but that can implicitly carry a new and correct idea. Let us note that co-occurrence linguistic technique is nowadays a common method to find a relationship between biomedical concepts; co-occurrence methods are com- monly used to discover new and hidden relations, following the seminal work of Swanson [220, 221] and other more detailed works [210, 222]. Some authors (e.g. see [223, 224]) use networks to map specific biomedical entities such as protein-protein interactions, gene regulatory events and links between proteins and phosphorylation or genes interactions. However, our aim is to build and use a co-occurrence network of biomedical concepts to produce inferences that are new hypotheses for drug repurposing. The key idea exploited in this paper is that hopping through this knowledge network and drawing A further step into the biological model 126 a path between any two non-adjacent concepts can be interpreted as suggesting a possible statement that has never actually been uttered but that can implicitly carry a new and correct idea.

Let us here start by describing in details how our knowledge network of peptides, related biological processes and rare diseases is built (see Fig. 5.2). Every abstract of the corpus of papers has been broken down into its constituent sentences. Entity recognition has been carried out on every sentence, following a dictionary-based approach [217, 221]. Specifically, the dictionary was built, enrich- ing each item of our biomedical item list with a set of acronyms, synonyms and other identifying phrases gathered from MeSH (Medical Subject Headings), Orpha.net and the ‘ ‘cope with cytokines” web site. Of course different concepts may share some of the identifying expressions. This is the polysemy problem, i.e. the capacity for a word or a phrase to have multiple meanings that leads to the necessity of disambiguation [221, 225]. This is a very complex problem in general and, to tackle with disambiguation, we employed a ver- sion of the Lesk algorithm [226]. Whenever the ”Lesk” disambiguation fails, we have chosen to keep both the possible concepts: this option reduces precision but maximizes recall - i.e. the quantity of relevant concepts that are retrieved. Many other errors in the detection of co-occurrences arise beyond the ones due to failed disambiguation: a sentence boundary may be misplaced, one of the occur- rences may be a false positive or the occurrences may be just part of a list (and therefore not semantically related). It is expected though that as more and more papers are analysed the meaningful co-occurrences will outgrow the spurious ones: in fact ‘ ‘real” co-occurrences are repeated consistently as more and more literature is considered, while spurious ones become statistically insignificant because the same concept is linked randomly to a great number of other concepts. In a figurative man- ner we may think of a noise in the co-occurrence detection that becomes negligible as a large number of papers are considered.

We thus obtain a co-occurrence network (Fig. 5.3a) where the biomedical concepts A further step into the biological model 127 of our dictionary are the nodes and the co-occurrence frequency is the weight of the edges. The resulting network is sparse with a small number of links (158,428) compared to a complete graph N(N − 1)/2 (12.7%) but, nonetheless, only 30 concepts are not connected to the giant component of the network, thus comprising 1576 nodes (98.13% of the total). The diameter of this network (i.e. the maximum distance between any two nodes) is D = 4 with an average path length (i.e. average distance between any two nodes) of 1.95 [214, 215]. The network is a small-world scale-free network showing a distinct fat-tailed power law degree distribution ( ∝ x−γ with γ = 0.55) with an average degree of 200.92. The link weight distribution is again a power law distribution but with a consider- ably slimmer tail (Fig. 5.1 a - b). These features are expected from previous studies on word distribution in a text corpus (Zipf’s law [181, 189]), but the fatness of the tail is much greater in our case: we can take it as an indication that the synonym resolution increases more strongly the occurrence probability of concepts with many names than the one with fewer names. It is observed that the graph contains hubs interpreted as physiological processes typical of diseases (e.g. inflammation, prolif- eration, necrosis), immune system-related items (e.g. white blood cells, cytokines) and the major organs - especially the ones dealing with chemical elaboration of drugs (e.g. kidney, liver). A number of direct connections of peptides - diseases are present, such as ANGIOTENSIN - SARCOIDOSIS or ANGIOTENSIN - DIA- BETIC NEPHROPATHY (see Fig. 5.3b). The relations between those peptides and diseases are already known as we ex- pected on the ground that they appear together in a predicate. Indeed, Angiotensin is known to worsen Sarcoidosis symptoms, while it is of aid in diabetic nephropathy. These features are interpreted as a positive feedback on the meaningfulness of the knowledge graph. A further step into the biological model 128

Figure 5.1: Centrality measures for the sample in exam

5.1.4 Analysis of the Knowledge Graph

Once the Knowledge Graph is built, we are in the position to deal with its mathe- matical properties in order to highlight new relations between a peptide and a rare disease. We search for indirect relations in the network (Fig. 5.3a) and therefore for paths [124, 188] between a peptide and rare disease (Fig. 5.3b). Since all nodes in the network are connected, these paths exist: the challenge is to rank them (in order to find the most significant ones) and to explore and choose those that suggest non-trivial inferences. Shorter paths must be considered more relevant, as more steps introduce new lev- els of indirection and magnify the effects of randomness and noise. Yet the paths cannot be too short, because they must be ‘ ‘verbose” enough to suggest a rationale to indicate the biological Mechanisms of Action (MoA), i.e. a specific biochemi- cal interaction through which a drug substance produces its pharmacological effect amongst molecular targets like cell receptors, proteins or enzymes; in other words A further step into the biological model 129

Figure 5.2: Conceptual outline of the knowledge graph building process. (a) Every document is split into its constituent sentences and each of them is scanned to identify expressions registered on the dictionary. In the figure, two sentences are highlighted and the matching expressions are enclosed in coloured boxes. Every one of these expressions is associated to a concept in the dictionary. (b) The concepts co-occurring in a sentence are connected pairwise. A sentence is therefore abstracted as a complete graph where the occurring concepts are the nodes and a single co-occurrence is a link. The weight of a link is increased if more instances of the same co-occurrence are present. (c) The sentence graphs are then merged in such a way that each node (concept) appears only once in the graph. In the figure it is evident that the LAM node (abbreviation for Lymphangioleiomyomatosis - a rare disease) appears in every graph and the Lung node in two of them. (d) The result of the merging is a new graph - which is no more complete - where the weight of the link is associated to the frequency of the same co-occurrence. A further step into the biological model 130 the MoA explains why and how a drug substance works. Specifically, when deal- ing with peptides, the MoA, that we aim to replicate, is the one where a peptide binds to its specific receptors, thus activating or modulating a physiological process involved in the disease. To achieve such characteristics, we consider specific inter- actions (links) among nodes, filtering out unwanted information. For instance, a peptide may be connected to any node but since we look for mechanisms of actions, only links in the form of peptide - cell receptor are allowed and therefore considered in the graph. Similarly, a receptor can be either involved in a pathway or influence directly a biological process, thus only links in the form cell receptor - process or cell receptor - protein are allowed. To this purpose, every item in our dictionary is assigned to one of the following cat- egories: AMINOACID, BACTERIA, CELL, DISEASE, DRUG, ENZYME, GENE, HORMONE, NUCLEOTIDE, ORGAN, OTHER, PATHWAY, PEPTIDE, PRO- CESS, PROTEIN, RECEPTOR, VITAMIN.

From the mathematical perspective, co-occurrences define the coefficients of the similarity matrix A representing the weighted graph. Through a suitable normal- ization of A we are able to find a probabilistic interpretation for the link weights. Specifically, posing: aij wi,j = P , (5.1) k aik where aij is the i, j element of the similarity matrix, which is zero if the vertices i, j are not directly connected and equal to the edge weight otherwise [227].

Therefore, as seen in the previous chapter, the components wij can be interpreted as the conditional probability p(j|i) of finding concept j in a sentence containing concept i. Since the coefficients of the matrix W with elements wi,j are in the range (0, 1], we can also introduce a dissimilarity measure:

dij = − log(W ) , (5.2)

which is correctly defined in the range dij ∈ [0, ∞) and, as opposed to the standard interpretation of weights, it returns larger values for smaller similarities. A further step into the biological model 131

This representation of the distance allows immediate application of the available algorithms for computing shortest paths (see [22]). With this definition, the shortest path between two any given nodes i and j represents the most probable path (and therefore in our interpretation, the most probable MoA) connecting them. In fact, the shortest path is the one πij = iv1...vkj that minimizes the total distance:

D(πij) = d(i, v1) + d(v1, v2) + ... + d(vk, j) ; (5.3) therefore, we have:

Y D(πij) = − log wiv1 − ... − log wvkj = − log( wr,s) . (5.4)

r,s∈πij

Since wij are conditional probabilities, the above equation is a product of conditional probabilities (a Markov chain). Therefore the conditional probability associated to the shortest path is maximized.

5.1.5 Ranking paths using Random Walk

We have seen that shortest paths maximize the probability of a single MoA, but strong indirect connections between a given peptide and a given disease may arise also from a set of paths which are smaller in weight but that contribute in larger numbers. We have therefore devised a different ranking algorithm for a peptide- disease correlation that considers all paths connecting the two concepts and uses the abundance and redundancy of these paths, together with their weights, as a measure of the strength of the overall relation between the concepts. The method- ology considers all paths connecting the two concepts and uses the abundance and redundancy of these paths, together with their weights, as a measure of the strength of the overall relation between the concepts. A further step into the biological model 132

Figure 5.3: (a) This figure selectionhows a version of the graph - simplified for illustration purposes - built focusing onto 300 concepts and with 200,000 docu- ments. (b) This figure shows three automatically retrieved and meaningful paths, identifying three - out of five - prospect candidate peptides for sarcoidosis. The paths are depicted in a further simplified version of the graph obtained from the first one by filtering out nodes not relevant to the paths.

This can be achieved by measuring the average number of time-steps required to go from one vertex to the other in the network, assuming that a walker is moving at random and that at each discrete time-step it jumps from a vertex to one of its neighbours with a probability which depends on the number of available links and to their weights. This random walker produces a distance that is a function of both the length and the abundance of paths [228–230].

Intuitively, imagine two nodes connected by one short (one step) path and many longer ones. A random walker trying the route many times will tread the longer paths more often therefore perceiving a ‘ ‘long” distance. Instead, if the end points are connected with a lot of medium-sized paths, the walker will tread those most of the times and thus perceiving a distance shorter than the previous one. A common- world example for conveying this idea: imagine a drunkard trying to go home. He is likely to make many mistakes at the crossroads effectively selecting the next lane at random. He is more likely to get home sooner if many roads converge to his destination rather than if only a short one goes there and the others lead astray. The random walk distances can be computed by pure algebraic means. The computation is carried out defining a vector, where each component is the likelihood that at a given time a random walker is on a given node. The step-by-step evolution of this A further step into the biological model 133 vector is a representation of the shifting distribution of these walkers in the nodes in their random wandering. The probability to walk from vertex i to vertex j is defined in the random walk theory by the transfer matrix P , computed from the similarity matrix A with the formula:

aij pi,j = P , (5.5) k aik with i, j = 1, ..., N, which is exactly the matrix we have previously denoted by W . It has been described in the previous chapters that (see also [39]) that the random walk distance between two nodes i and j is given by:

N X  1  dRW = , (5.6) i,j I − B(j) k=1 ik where I is the identity matrix and B is a square matrix identical to P having posed

B(j)i,j = 0 (for any i ∈ [1,N]) - as seen in section 4.6.2. The random walk distance built this way is non-symmetric, but for our purposes we symmetrise it by taking the average of the two directions.

d(i, j)RW + d j, i)RW dsym:RW = ( i,j 2 . This distance defines an implicit ranking measure for each couple of distinct nodes and therefore between any peptide-disease couple. Such a measure can be interpreted as the probability of finding that path, and thus the MoA, within the document base.

5.1.6 Results

In this section we show four biological paths representing a Mechanism of Action produced by the methodology described earlier in this chapter with regard to: A further step into the biological model 134

- the granulomatous disease Sarcoidosis and its pulmonary pathology, and

- Imatinib, a targeted-therapy agent for Creutzfeld-Jakob disease.

Sarcoidosis is a disease in which abnormal collections of chronic inflammatory cells form as nodules (granulomas) in multiple organs. Sarcoidosis is present at various level of severity in all-ethnic and racial groups and is mainly caused by environmental agents in people with higher genetic sensitivity. The disease is a chronic inflamma- tory disease that primarily affects the lungs but can affect almost all organs of the body. Sarcoidosis is a complex disease displaying incorrect functionalities within immune cells, cytokines, and antigenic reactions; [231]. Fig. 5.4 shows a subgraph of the knowledge network comprising the concepts related with Sarcoidosis. We were interested in investigating which peptides may show meaningful paths for treating Sarcoidosis. Therefore a number of rationales have been obtained from a pool of peptides against sarcoid pathologies, and the most relevant findings are listed below, ranked according to the random walk distance:

1. VIP - VIPR1 - INFLAMMATION - SARCOIDOSIS 2. α-MSH - HGFR - INFECTION - SARCOIDOSIS 3. CNP - NPRB - GUANYLIN CYCLASE - INFLAMMATION - SARCOIDOSIS

5.1.7 The Path VIP - SARCOIDISIS

Vasoactive Intestinal Peptide - VIP (also known as Aviptadil), is an endogenous human peptide. It is predominantly localized in the lungs where it binds specific receptors (VPAC-1, VPAC-2), which transform the signal into an increased produc- tion of intracellular cyclic adenosine monophosphate (cyclic AMP or cAMP), as well as into the inhibition of translocation of NF-κ B from cytoplasm into the nucleus. This process regulates the production of various cytokines responsible for the in- flammatory reaction, such as TNF-α. Hence, VIP is responsible for preventing or attenuating a wide variety of exaggerated pro-inflammatory activities [232]. The path in Fig. 5.5 shows that VIP is affecting the inflammation processes related A further step into the biological model 135

Figure 5.4: A portion of the knowledge network showing the neighborhood of Sarcoidosis. The figure is intended as a bird-eye view of the entities the system detected as related with Sarcoidosis

Figure 5.5: The VIP - SARCOIDOSIS path and other closely related concepts to Sarcoidosis. The scientific evidence clearly suggests VIP as a potential treatment option for Sar- coidosis: the system has been able to retrieve the main receptor of VIP and its relevance in the inflammation process. A further step into the biological model 136

5.1.8 The Path α-MSH - SARCOIDOSIS

α-Melanocyte Stimulating Hormone (α-MSH) is an endogenous peptide originally described for stimulating melanogenesis, mainly for the pigmentation of the skin. Later it gained roles in feeding behaviour, sexual activity, immune responses, inflam- mation and fibrosis. Upon binding to its specific cell surface receptors it increases production of cAMP in the target cells and triggers four signalling pathways leading to the disruption of the transcription of several pro-inflammatory mediators genes; [232]. In addition, α-MSH also regulates the MET proto-oncogene expression in both melanoma cells and in normal human melanocytes. The MET proto-oncogene encodes for the Hepatocyte Growth Factor Receptor (HGFR) that is involved in melanocyte growth and melanoma development; [233]. There is evidence of interrelation between Epstein-Barr Virus (EBV) infection and MET proto-oncogene expression, and at date several infection agents have been sug- gested to have an implication as cause of Sarcoidosis. A role for a transmissible agent is also suggested by the finding of granulomatous in- flammation in patients without Sarcoidosis who received heart transplantation from donors who had Sarcoidosis; [234, 235]. The system sees both these processes (as apparent from Fig. 5.6), assigning a better ranking to the second one. α-MSH is another candidate for the treatment of sarcoid pathology due to this double action.

5.1.9 The Path CNP - SARCOIDOSIS

CNP (C-type Natriuretic Peptide) is a human peptide, which elicits a number of vascular, renal, and endocrine activities, regulating blood pressure and extracellular fluid volume. When CNP binds to its receptor, NPRB, on the cell surface 2 it ac- tivates a cell signalling through a Guanyl cyclase that increases intracellular cGMP level activating specific pathways ultimately modifying cellular functions. cGMP

2see the description of Mechanism of Actions earlier described in this Chapter A further step into the biological model 137

Figure 5.6: The α-MSH - SARCOIDOSIS path and other closely related con- cepts

(Cyclic guanosine monophosphateis) known for its potent vasodilatory action in pulmonary vessels. Depending on the tissues involved, however, some of its effects are directly opposite to those of cAMP, which is a potent inhibitor of proinflamma- tory tumor necrosis factor (TNF-α) synthesis; [236]. The inference subtended by the path in Fig. 5.7 is sound and correctly traces a biological process. Yet CNP is not considered a treatment option for Sarcoidosis because of its potential negative side effects profile due to its systemic vasodilatory characteristics.

5.1.10 The Path IMATINIB CREUTZFELDT-JACOB Dis- ease

Imatinib (commercialized under the name GLEEVEC) is a rationally designed pyridylpyrim- idine derivative, and a highly potent and selective competitive tyrosine kinase in- hibitor, especially effective in the inhibition of kinases c-Abl (Abelson proto-oncogene), A further step into the biological model 138

Figure 5.7: The CNP - SARCOIDOSIS path and other closely related concepts c-kit, and PDGF-R (platelet-derived growth factor receptor); [237, 238]. These ki- nases are enzymes involved in cellular signal transduction processes, whose dysreg- ulation may lead to malfunctioning of cells and disease processes, as exemplified in a variety of hyperproliferative disorders and cancers. Imatinib has been regula- tory approved for chronic myelogenous leukemia (CML), gastrointestinal stromal tu- mors (GISTs), aggressive systemic mastocytosis (ASM), hypereosinophilic syndrome (HES), chronic eosinophilic leukemia (CEL), dermatofibrosarcoma protuberans, and Acute Lymphoblastic Leukemia (ALL). Exploiting our methodology we looked for rationales for the redirection of Ima- tinib; on the basis of the results of the stochastic measure, the system indicates the neurodegenerative transmissible spongiform encephalopathies - exemplified by the Creutzfeldt-Jakob disease (CJD) - as promising targets for this drug. Trans- missible spongiform encephalopathies are caused by the aberrant metabolism of the prion protein (PrP). Prions are seemingly infectious agents without a nucleic acid genome. Prion diseases belong to the group of neurodegenerative diseases acquired by exogenous infection and have a long incubation period followed by a clinical course A further step into the biological model 139

Figure 5.8: Imatinib (GLEEVEC) - Creutzfeldt-Jakob Disease path and other closely related concepts. of progressive dementia, myoclonal ataxia, delirious psychomotor excitement, and neuronal death; [239]. Moreover, the system selects the path (see Fig. 5.8) that indicates the kinase c-Abl effect on cell-apoptosis as key MoA for redirecting Imatinib towards CJD. In fact, the c-Abl tyrosine kinase is found to be over-activated in neurodegenerative diseases like Alzheimer’s disease and Parkinson’s diseases, and overexpression of active c-Abl in adult mouse neurons results in neurodegeneration and neuroinflammation; [240]. There is clear experimental evidence that activation of c-Abl leads to neuronal cell death and neuronal apoptosis in experimental Creutzfeldt-Jakob disease; [241]. Ima- tinib has been shown to prevent c-Abl kinase induced apoptosis in animal models of neurodegeneration; [242]. Finally, Imatinib was shown to clear prion-infected cells in a time and dose-dependent manner from misfolded infectious protein without influ- encing the normal biological features of the healthy PrP, and Imatinib activated the lysosomal degradation of pre-existing misfolded PrP; [243]. This provides a sound rationale for the proposed redirection. A further step into the biological model 140

The system indicated also Imatinib as a treatment option for pulmonary arterial hy- pertension (PAH), via its potent inhibitory effect on the PDGF Receptor (PDGF-R). For the indication PAH, the drug is however not approved.

5.1.11 Acknowledgements

The biomedical application of the framework investigated and previously described has been possible thanks to important contribution of Dr. Dorian Bevec, Prof. Molecular Biology at University of Vienna and Chief Scientist Officer of Theramet- rics SA, whose expertise in biochemical interaction at molecular level helped to map the results of the graph-theoretical model analysis into a pathophysiology analytical research framework. All the results presented in this work have been extensively validated by Prof. Bevec and his team at Therametrics SA - Stans, Switzerland.

5.1.12 Summary

A double-layer methodology is presented, consisting of semantic analysis leveraging on developments of Computational Linguistics and graph analysis and exploiting Graph Theory and Stochastic Process Theory tools. This methodology has allowed the screening of more than 3 million abstracts from PubMed-published biomedical papers and the detection of relevant concepts iden- tified by dictionary-defined expressions; concepts have been mapped as nodes of a graph, whose links are defined by co-occurrence of concepts across roughly 30 million of sentences. Specifically, the pathophysiological connections between peptides and diseases have been detected in order to provide inferences for biomedical rationales for drug repurposing. The proposed methodology provides an effective instrument to detect different MoAs of peptides and drugs; though it may not capture the full-detail of the MoAs, it suc- ceeds in making them recognizable by a short chain of biomedical entities. Moreover, A further step into the biological model 141 the graph representations of biomedical knowledge seen above produces a sound and meaningful representation of the many interrelated concepts of the biomedical disci- pline; such methodology successfully allows both the validation of existing rationales and the discovery of new ones, a feat usually left to serendipity and intuition. We have translated the scientific rationales in relevant clinical trial settings into new potential treatment options for the affected patients in Sarcoidosis. We have looked for experimental evidence of our findings on sarcoid pathologies: we have chosen the peptides VIP and α-MSH for drug repurposing. In an open clinical phase II study, we treated 20 patients with histologically proven Sarcoidosis and active disease with nebulized VIP for 4 weeks. This study is the first to show that VIP has clear, positive, immune-regulatory effects in sarcoid patients without any obvious side effects and without systemic immuno-suppression. VIP should therefore be developed as an attractive therapeutic option for patients with pulmonary Sar- coidosis; [244]. In parallel, we have initiated a clinical ex-vivo trial to prove α-MSH in a sarcoid pathology. Preliminary data clearly suggest a beneficial outcome of the experimentation (unpublished data), clearly suggesting α-MSH as another potential treatment option for this pathology. Moreover, the case for Imatinib as a treatment option for the Creutzfeldt-Jakob disease shows how the system is able to produce a sound scientific rationale also for non-peptide drugs and with a mechanism of action quite different from the others, thus proving a much wider applicability. Results are more noteworthy if the relative slimness of the dictionary is taken into account. Better representations are to be expected defining more detailed and more comprehensive dictionaries. Furthermore, Graph Theory tools provide quite an interesting arsenal of instruments that analyse a complex network of nodes (bio- logical and medical concepts) and highlight hidden inferences across biochemical compounds, clinical data and medical concepts.

It is worth stressing that here I have provided here only a very general charac- terization of the knowledge network and focused onto very well consolidated tools of analysis. But studies of complex network theory and related applications have A further step into the biological model 142 greatly increased over the last decade providing ever more subtle indicators of fea- tures and related techniques of analysis. We therefore think that our reliance on graph representation poses this method in the best position to exploit this develop- ment and may well prove to make it mainstream in the field of information retrieval. This methodology can be applied to other fields: for sure it can be extended over broader biomedical research, transcending peptides to study other chemical com- pounds and also focusing on diseases other than rare. We think it can be applied to any field of research - even outside natural science - provided that a suitable amount of literature is available and that the main issue be the association of a great number of particular facts and observation that do not yet fit into a comprehensive scheme. 6

A Semantic - graph theoretical approach for the analysis of financial and economics unstructured data

6.1 Introduction

As seen in the previous chapters, natural language processing and in particular com- putational semantics, have proven to be a versatile and reliable instrument to map large quantities of unstructured textual data into a network representation. In this chapter we will go a step further by using tools from network analysis and econo- physics with the objective to turn the temporal evolution of structural properties of such networks into a semantic index readily comparable to other related time series. The underlying hypothesis is that fluctuations of certain structural parameters - computed over induced networks - could be put in correspondence with fluctuations of other related phenomena. For instance, indices computed from the financial news headlines - such as Thomson Reuters - could be compared to a financial market index, say Dow Jones Index Average, in search for significant (cross-)correlations. 143 Analysis of financial and economics unstructured data 144

Additionally, we could train1 a (stochastic) mathematical model that learns to pre- dict fluctuations of a target time series from the fluctuations of a semantic index (or some combinations thereof).

The starting point of our analysis is a corpus of unstructured text organized accord- ing to some predefined temporal scale (e.g. set of all Thomson Reuters financial news split on a week-by-week, or month-by-month basis). Our algorithms strive to discover concepts of interest in text and compute the strength of association between any pair of said concepts, whereby similar concepts are connected in a graph-like structure. However, isolated concepts bear little information regarding the evolu- tion of the complex system they belong to. Our approach autonomously identifies clusters of influential concepts induced by the overall graph structure. Such clusters are tracked over time hence reflecting the dynamical nature of complex phenomena. Each cluster is assigned a numerical value, called Semantic index, that essentially captures the local connectivity of the cluster nodes - as well as their global impor- tance. Semantic index is computed by using tools from network analysis, econo- physics and machine learning with the intent to mirror the evolution of particular ”topics” or ”memes” over time. As a natural application, we have compared the evolution of the Semantic index with the evolution of standard financial indices dis- covering significant correlations. Finally, we push forward the speculation that a combination of indices extracted through the above mentioned analysis of complex semantic networks could be either used in addition to traditional stock market indi- cators or as exogenous inputs to improve prediction and forecasting of stock market events. 1In machine learning, training is equivalent to the problem of fitting free parameters of a model[245]. Analysis of financial and economics unstructured data 145

6.2 Prior work

Despite the fact that the efficient market hypothesis2 has been one of the dominant theory dealing with the dynamics of financial markets, researchers were aware that the inertia between the onset of informations that could affect the markets and their impact on the markets themselves would provide an arbitrage opportunity, and started looking at formal models that could take advantage of that. News (mainly political and financial, but not only those) were soon recognized by behavior of finance advocates to be one of the major influencers on markets (and investors). The quest was how to turn unstructured text into features that could be used to explain and/or forecast stock market events.

One of the first approaches merging automatic textual analysis with market indica- tors dates back to the seminal work of W¨uthrich and colleagues [247, 248]. They use data mining techniques to collect statistics over a fixed set of keywords from finan- cial news data and train a probabilistic rule-based expert system to predict changes in the target financial index and forecast its closing value. Despite its simplicity (in- teresting news were manually fed into the system and the number of terms to look after was limited to approx. four hundred n-grams), the methodology permitted to achieve the above-chance predictive accuracy on a short term and the authors used their methodology to profitably invest on the Hong Kong stock exchange.

More recent works started digging into large quantities of readily available financial data by using techniques and tools from the nascent big data movement. The vast majority of works concentrates on extracting ”sentiment”, or mood, from various online social media (e.g blogs, tweets, MySpace posts, etc) [249]. The intuition is that global mood might be positively correlated - or even predictive - of the market indicators. In their influential work, Bollen et al. extract a fixed number of mood states from daily Twitter feeds by using publicly available tools for sentiment analysis [250]. The resulting mood time series is tested for (possibly lagged) correlations

2The efficient market hypothesis (EMH) states that it is impossible to achieve returns greater than those that could be obtained by an essentially random approach [246]. Mathematically, this fact usually is modeled as a random walk process. Analysis of financial and economics unstructured data 146 against the Dow Jones Index Average (DJIA). A Granger causality analysis is used to assess which mood states are predictive of changes in DJIA closing values, and a non-linear model (Self-Organizing Fuzzy Neural Network) is trained on the historical financial data augmented with mood data obtaining promising forecasting results. In a related work, the sentiment extraction is simplified by concentrating on the states of anxiety, worry and fear extracted by processing 20 million LiveJournal posts [251]. The resulting ”Anxiety Index” is tested for Ganger causality revealing its predictive power in the case of Standard&Poor’s 500 index. Similar approaches that use text mining and language models to classify news stories into positive or negative and predict market trends accordingly are given in [252, 253].

While approaches above succeeded in improving predictive accuracy, they are still based on a small number of carefully handcrafted ”sentiment” features extracted from either structured data (e.g. type of web search queries related to a stock) or (short) unstructured textual messages (e.g. tweets or news headlines), that hopefully have an effect on the chosen time series (see for instance [254–258]). By looking for explicit sentiment indicators (e.g. expressions of joy or sadness, smilies, etc.), said approaches discard the causes that oriented the global opinion towards a particular mood state which are typically contained in aseptic messages and communiques, either from various microblogging platforms or mainstream mass-media. In line with this, there has been some attempt to predict prices (or stock activity in general) from financial news articles. For instance, Schumaker et al. trained a modified Support Vector Machine to predict a discrete stock price twenty minutes after a news article was released [259]. The machine was trained on selected stock quotes from S&P500 using three different textual representation schema that convert plain text into a vector representation. Being based on a nonlinear machine, however, their approach does not provide any explanation as of why the stock should change. In addition, they limit the analysis to breaking-news stories only, which almost certainly contain significant information. Nevertheless, the results they obtained showed a net increase in accuracy (either in trend or value prediction) with respect to classical regression- based approached. In a more recent work, Ruiz and his colleagues put in relation Analysis of financial and economics unstructured data 147

Twitter feeds referring to a number of companies (e.g. Yahoo) with stock-market events for those companies [260]. They extracted two groups of features from the unstructured tweets. Features in the first group measure the overall activity in the micro-blogging platform (e.g. number of tweets and retweets). Of particular interest for us are the features in the second group that measure properties of a graph induced by particular user-tweet interaction patterns. Indeed, they found that some graph- based properties, such as the number of connected components or the statistics on the degree distribution, are strongly correlated with the stock events. A slightly different approach studies which factors influence the most the trading behavior of different types of investors (government, household, etc.) in a medium-sized market [256]. The authors investigate the role of typical endogenous variables (returns and volatility) as well as exogenous variables such as the total number of daily news articles and a sentiment analysis of the news. They reached the conclusion that the governmental and non-profit organizations are weakly sensitive to news and returns or volatility, while households and companies, on the contrary, are very sensitive to both endogenous and exogenous factors (with volatility and returns, on average, more relevant than the number of news articles and sentiment). Finally, financial institutions and foreign organizations are intermediate between these two cases.

6.2.1 Our approach

In all of the approaches above, semantic analysis is either performed to extract dis- crete sentiment features from text or is applied on a well-defined subset of messages (tweets, posts, news feeds) that contain explicit references to concepts defined - be- forehand - as being of interest (for instance, one could process posts that explicitly mention Yahoo! or Google only). Our approach, on the other hand, tries to uncover more subtle relations embedded in the induced semantic network. We put our focus not on single concepts or on aggregate information, but rather on the behavior of the network (or significant parts of it) as a whole. In addition, to the best of our knowledge, our approach is the first to explicitly address the temporal evolution of said network over time. As we shall see, we merge paradigms from computational Analysis of financial and economics unstructured data 148 linguistics and those from the study of complex networks to turn unstructured text into computable graph-theoretic features that reflect the underlying semantic con- tent and that could be used to mirror a financial market indicator or to uncover hidden mechanisms of action for repurposing existing drugs, as discussed in5[261].

The rest of the chapter follows the steps performed in our semantic analysis and is organized as follows:

• In section 6.3 we show how to turn a large corpus of unstructured textual data into an undirected graph whereby connected concepts have statistically significant patterns of co-occurrence in the text;

• In general, the graph induced at the previous step is too general to be of immediate use. In section 6.4 we use a random-walk clustering algorithm to partition the graph into distinct communities, each of which can be studied separately;

• Section 6.5 introduces the algorithm that tracks the evolution of a cluster (or multiple clusters) over time to reconstruct the dynamics of the information propagation as induced by the textual corpus under investigation;

• In section 6.6 we show how the semantic content of each cluster can be quan- tified by computing various graph-theoretic properties; We also perform a (cross)correlation analysis between time-series induced by the evolving clusters and representative financial indices to assess whether one series can be used to explain or predict the evolution of the other;

• In section 6.7 we show how different series of semantic information can be merged into a unique time series that mirrors a financial index of choice and - at the same - provides a non technical explanation (in terms of cluster concepts) for variations in the overall trend. We present examples of our index trained against different financial market indices.

• Finally, section 6.8 summarizes the content of this chapter and outlines future work. Analysis of financial and economics unstructured data 149

6.3 From plain text to semantic graphs

The starting point of our analysis is a large corpus of unstructured textual data. As detailed in the Chapter3, we start by defining a dictionary of concepts we wish to pay our attention to. The dictionary can be either static (i.e. fixed) or dynamic. Each dictionary entry can be composed of multiple words, with the convention that single words in complex concepts are separated by underscores (e.g. new york). In addition, we define a list of synonyms, i.e. unambiguous substitutions of words belonging to the same concept (e.g. ”ny” or ”nyc” is a synonym for new york). We do not adopt tools of formal semantic analysis. Indeed, our goal is not to infer the very meaning of the words, but rather to uncover intimate relations between various concepts from a timely evolving corpus of data. As an example, we are not interested in knowing that ”Lehman Brothers” was a US investment bank but in the fact that the concept ”Lehman Brothers”, in mid 2008, suddenly became strongly linked to the concepts like ”global crisis”, ”bankruptcy” and ”subprime mortgage crisis”, thus reshaping its global meaning. This analytical process is akin to the previously discussed distributional hypothesis, which states that words with similar meanings should occur in similar contexts (cf. 3.3.1). The above-chance co-occurring concepts are linked together in a graph-like structure which allows to readily adopt well-established analytical tools and finally to link local and global graph measures to various socio-economic events in a coherent and mathematically sound way. The process of building a graph of concepts out of their plain co-occurrence is detailed in the following section.

6.3.1 Assessing the strength of association between con- cepts

The very first step in our analysis is the computation of the strength of association between various concepts in the text being analyzed. The approach we used is akin to the concept of mutual information which can be viewed as a measure of the quantity Analysis of financial and economics unstructured data 150 of information gained about whether one concept will occur by knowing whether the other concept occurs [262]. The actual implementation is performed by using the log-likelihood ratio (LLR) [263] particularly suitable for processing large data sets. LLR is known to be equivalent to the above mentioned mutual information when the probabilities involved are estimated via maximum likelihood [264].

Suppose c1 and c2 represent two concepts for which we wish to estimate the strength of association in a given context (be it a sentence, a paragraph or the entire text). The idea here is to assess whether two concepts occur together more often than by chance. In addition, we would like to be able to capture rare - but significant - con- cept co-occurrences. A rigorous way to test whether two concepts occur significantly often together is usually done via hypothesis testing [265]. Here the null hypothesis

(H0) states that the probability of two concepts occurring together (in a context) is simply the product of individual probabilities (i.e. H0 : p(c1, c2) = p(c1)p(c2)). In other words, we assume the independence between any two concepts. As usual, the goal is to reject the null, and classical approaches adopted in text analysis are based on t-test or χ2-test. However, said approaches heavily depend on the assumption of normality [266] which - however - is usually not the case with rare co-occurrences we would like to account for!

Likelihood ratio is a well-known hypothesis testing method that does not depend on the assumption of normality discussed above [263]. In essence, it provides a mathematical tool for comparing two alternative hypotheses. In the case of concept co-occurrence, we wish to compare the hypothesis that two concepts are independent (null hypothesis), and thus not likely to co-occur, against the contrasting hypothesis that they are not (alternative hypothesis). Put in formal terms, given two concepts c1 and c2 we have the following hypotheses [266]:

• Null Hypothesis (H0): p(c2|c1) = p = p(c2|¬c1)

• Alternative Hypothesis (H1): p(c2|c1) = p1 6= p2 = p(c2|¬c1) Analysis of financial and economics unstructured data 151

Likelihood ratio can then be computed as:

L(H ) λ = 0 . (6.1) L(H1)

Where L(H0) and L(H1) are respectively the likelihood function for the Null Hy- pothesis and the Alternative Hypothesis. Usually, the log of the above quantity is computed.

We need to set the distribution to model the appearance of concepts in the corpus of interest. In case of text, the probability of observing exactly k occurrences of a concept c, out of N, can be modeled as a binomial distribution shown in Eq. 6.2.

N bin(k; x, N) = xk(1 − x)N−k , (6.2) k where x is the a priori probability of observing the concept c.

Let n1, n2 and n1,2 be the total number of occurrences of c1 (concept c1 alone), c2

(concept c2 alone) and c1,2 (concepts c1 and c2 together) in the corpus of size N.

We use the maximum likelihood estimate of p, p1 and p2 in the hypotheses above:

c2 c1,2 c2−c1,2 p = , p1 = , p2 = [266]. Then, the likelihood ratio (or - equivalently - the N c1 N−c1 log likelihood ratio (LLR)) can be computed as (see [263, 266] for a full derivation):

L(H ) bin(c ; p, c )bin(c − c ; p, N − c ) log λ = log 0 = log 1,2 1 2 1,2 1 . (6.3) L(H1) bin(c1,2; p1, c1)bin(c2 − c1,2; p2,N − c1)

Intuitively, what we are looking for are concepts that exhibit a significant co- occurrence pattern. In other words, we are neither interested in rather obvious co-occurring concepts, such as ”Morgan Stanley” and ”investment bank”, nor in co-occurrences being formed likely by chance - simply because constituent concepts are among the most frequent ones. LLR score in Eq.6.3 captures exactly that in- tuition. It computes the ratio between two hypotheses, the one that two concepts under investigation are independent, and the competing one that they are indeed statistically related. It is important to stress that it can also be shown that the Analysis of financial and economics unstructured data 152 quantity −2logλ is asymptotically χ2 distributed in the limit of big data [266], per- mitting one to use familiar tools from statistical hypothesis testing[267]. Thus, we can assess that the higher the value of −2logλ, the more pronounced the deviation is from the null hypothesis - and, for our purposes, the more associated are the two concepts.

The strength of association between concepts implicitly induces a graph-like network whereby similar concepts are directly connected by edges whose weight is propor- tional to the above-mentioned LLR score. This paradigm-shift permits to adopt powerful tools and algorithms from graph theory to quantify semantical properties of text being analyzed, as will be detailed in the next section.

6.4 Cluster identification

Having translated unstructured textual data into a network of similar/related con- cepts opens the room for sophisticated graph theoretic analysis as outlined in section 3.2.2. For instance, we could readily compute the set of most influential concepts according to some centrality measure, or compute common statistical properties of the network as a whole.

However, neither a single concept nor the whole network provide the level of gran- ularity we are interested in. Here we embrace the idea that valuable information is hidden into clusters of related concepts, and that the evolution of sets of such clus- ters provides insights on important structural perturbations we want to keep track of. Next section provides a brief overview of the clustering problem and details the random walk clustering algorithm used in this thesis.

6.4.1 Clustering

In general, clustering refers to the problem of partitioning a set of items into distinct groups without resorting to any a priori information (such algorithms are named Analysis of financial and economics unstructured data 153

’unsupervised’ in the typical machine learning parlance). Typically, given a data set, clustering algorithms tend to group together elements having high mutual similarity but being dissimilar from other clusters’ members.

Unfortunately, no single definition of a cluster in graphs is universally accepted and different algorithms tend to produce different communities3. However, there are a number of desirable properties that a cluster should posses [268, 269]:

• Clusters should be connected;

• Each cluster should be dense;

• Connections from the cluster to the rest of the graph should be sparse.

Clustering algorithms can be divided into flat (producing a single partitioning) and hierarchical (producing a tree-like sequence of partitions, called dendogram, where each subcluster belongs to one supercluster; see Fig. 6.1 for an example)[245, 270]. Typical flat clustering algorithms include k-means and k-medoids [271] and require the number of clusters to be set beforehand. In hierarchical algorithms, a cluster partition can be obtained by choosing the number of clusters and cutting the den- drogram at the appropriate level, or otherwise choosing the cut-point such as to maximize some globally-defined validity measure [271]. Hierarchical clustering algo- rithms can be further divided into divisive and agglomerative, depending on whether the partition is refined or coarsened during each iteration, respectively. Typical ag- glomerative hierarchical algorithms are based on the idea of Single Linkage [272] (and versions thereof, such as Average Linkage and Complete Linkage [273]). In the specific case of networked systems, such as the ones we deal with in this work, clustering is used as synonym of community detection. Typical algorithms in this case include the well known Girvan-Newman method [274], which is a divisive al- gorithm based on the betweenness centrality, and WalkTrap [275], an agglomerative algorithm based on random walks used throughout this chapter and described in the next section. Other approaches, such as the Directed Bubble Hierarchical Tree

3We will use the terms cluster, partition and community interchangeably. Analysis of financial and economics unstructured data 154

(DBHT) [276] that exploits the topological property of the PMFG (Planar Maxi- mally Filtered Graph), can be investigated in order to partition the network into communities (for further details and a comparison to other clustering algorithm, see [277]).

6.4.2 Random walk clustering

In this work we chose an agglomerative clustering procedure based on the behavior of a random walker that allows us to find community structures at different scales [275]. The intuition is simple: since most real-world complex networks are globally sparse but locally dense (our network is not an exception, as we will see in the experimental part), any random walker should get ”trapped” into the densely con- nected parts corresponding to communities. The algorithm, called WalkTrap [275], reflects the intuition above by using a distance r between the vertices that captures the community structure of the graph and which is computed from the information given by random walks in the graph.

Walktrap allows to find modules without having to fix the number of modules in advance. Moreover, being based on random walks, it is expected to give results related to the spectral partitioning of a graph[275]. While many alternative module- finding algorithms are possible, walktrap is often faster and simpler to implement. Being an important component of our approach below we provide more details on its functioning.

6.4.2.1 Random walk distance

To group the vertices into communities we need to define a distance to measure how dissimilar (or similar) vertices are. Like in other clustering algorithms, such a metrics should be larger for vertices belonging to different communities (thus, dissimilar) than for vertices belonging to the same community. As we will shortly see, it will be computed from the information given by random walks in the graph. Analysis of financial and economics unstructured data 155

Let us first introduce some properties of random walks in graphs on which is based the distance above. Each graph G can be fully described through its adjacency matrix A. In the unweighted case, Aij = 1 if two vertices i and j are connected and 0 otherwise4 It is natural do define the degree of a node i as the number of P its neighbors (including the vertex i itself): d(i) = j Aij. The degree of the node and the adjacency matrix permit to define the probability to transit from vertex i

Aij to vertex j as Pij = d(i) . This defines the transition matrix P of a random walk process, the power of which defines the process itself. Indeed, the probability of

t going from i to j through a random walk of length t can be computed as (Pij) . In t the following, we will denote this probability simply by Pij.

It is well known from the theory of Markov Chains that the probability of being on a vertex j, starting from the vertex i, depends only on the degree of vertex j as the length of the path t approaches infinity, property known as stationarity [112]. Analogously, the probabilities of going from i to j and viceversa through a random walk of a fixed length t have a ratio that only depends on the degrees d(i) and d(j), property known as reversibility.

A random walker of length t implicitly induces a distance between any two vertexes in the graph G. The length t of the random walks must be long enough to gather enough information about the topology of the graph. However t must not be too high (compared to the mixing time of the Markov chain), to avoid the stationarity problem. It is intuitive that - for a suitable value of t - vertices in the same com-

t t munity will have a high probability Pij. However, the fact that Pij is high is not a sufficient condition for i and j being in the same community. Another interesting fact is that nodes in the same community tend to see all other vertices in the same

t t way. Mathematically this can be expressed as ∀k, Pik ' Pjk.

4 + In the weighted case Aij ∈ R Analysis of financial and economics unstructured data 156

By taking into account all the previous remarks, the random walk distance between two vertexes can be given as [275]:

v u n t t 2 uX (Pik − Pjk) rij = t . (6.4) d(k) k=1

Thus, nodes belonging to the same community will have a smaller distance by virtue

t t of the fact that Pik and Pjk will be similar. It is worth saying that the distance r above is directly related to the spectral properties of the transition matrix P (see [275] for major details).

It is also possible to generalize the distance between vertices above to a distance between communities as [275]:

v u n (P t − P t )2 uX C1k C2k rC C = t , (6.5) 1 2 d(k) k=1

t where PCj is the probability to go from community C to vertex j in t steps defined as [275]: 1 X P t = P t . (6.6) Cj |C| ij i∈C

6.4.2.2 Community detection algorithm

Having defined a distance between any two (sets) of vertices, the problem of com- munity detection boils down to a classical clustering algorithm. Walktrap is an agglomerative approach that produces a dendogram-like structure of communities at various scales [275]. As with other hierarchical algorithms, one can decide to produce a fixed number of clusters or search for the community partition that maxi- mizes some global cost criterion (in algorithm for community detection, it is usually modularity, a function that compares the density of links in each community with the one expected in a (null) random graph model [278]). In our experiments, we Analysis of financial and economics unstructured data 157 decided to (empirically) fix the overall number of communities, irrespective of the obtained modularity.

The algorithm starts by having each vertex in a separate cluster, and proceeds bottom-up by iteratively merging communities based on their reciprocal distance until the last partition holds the entire graph. In details, we start from a partition

P1 = v, v ∈ V of the graph into n communities, where each community is reduced to a single vertex. We first compute the distances between all adjacent vertices. Then this partition evolves by repeating the following operations. At each step k:

• choose two communities C1 and C2 in Pk according to a criterion based on the

distance, σk between the communities (to be detailed later),

• merge these two communities into a new community C3 = C1 ∪ C2 and create

the new partition: Pk+1 = (Pk \ C1,C2) ∪ C3, and

• update the distances between communities.

After n − 1 steps, the algorithm finishes and we obtain Pn = V . Choosing which communities to merge at each step of the algorithm is done according to the Ward’s method [279, 280]: at each step k two communities that minimize the mean of the squared distances between each vertex and its community are merged.

1 X X σ = r2 , (6.7) k n iC C∈Pk i∈C for all communities C in the given partition at step k: Pk (the quantity r has been defined in Eq. 6.5 where one community collapses to a single node i). It can be shown that there exists an efficient algorithm for computing the quantity above. Overall, the algorithm exhibits a time complexity of O(mn2), where m and n are the number of edges and vertices, respectively. However, most real-world networks are sparse (m = O(n)) yielding a worst-case time complexity of O(n2 log n). Regarding the memory consumption, the algorithm is quadratic in the number of nodes, O(n2). Analysis of financial and economics unstructured data 158

As an example, Fig. 6.1 shows the results of applying walktrap clustering algorithm on a subgraph obtained by processing Reuters news feed during September 2003. We have randomly selected 33 concepts out of ≈ 5000 (see section 6.6.1 for the explanation of full dataset). Two concepts are connected if their LLR score (in Sec. 6.3) is above a fixed threshold (set to 0.99 throughout this work). Figure shows a possible ”cut” of the dendogram producing three distinct clusters highlighted with different colors.

6.5 Cluster tracking

Up to now we have tacitly ignored the problem of how to process data that change over time. For instance, news are being published continuously and we expect the topology of the corresponding network to change accordingly, thus reflecting the evolution of trending concept and/or topics, and such notion should be captured by cluster-oriented graph statistics. However, to reconstruct the dynamics of the underlying process we should be able to track clusters across multiple time steps, taking into account the fact that clusters will likely change their morphology. We will call this the ”correspondence problem”. In the following we will present a measure that takes into account both semantical and structural properties of clusters to solve said correspondence issue.

t t t Formally, let C1,C2,...,CN be the set of clusters identified at time t. Similarly, let t+1 t+1 t+1 C1 ,C2 ,...,CM be the clusters at time t+1. N and M denote the cardinality of clusters at time t and t+1, respectively. The goal is to find, for every cluster at time t, the ”best matching” cluster at time t+1 according to some criterion. While at first one could think of reusing tools form the vast cluster validation literature to detect similarities between partitions [281, 282], there is a subtle difference: partitions to compare do not refer to the same data set which usually change from time t to time t + 1. Analysis of financial and economics unstructured data 159

(a) Semantic graph

(b) Dendrogram representation of the communities obtained using walktrap; A possible cut at level 3 is shown.

Figure 6.1: Example of the walktrap community detection algorithm on a por- tion of semantic graph obtained by processing Reuters financial news for Septem- ber 2003 with approximately 30 concepts. Vertex dimensions are proportional to their ”importance” as measured by the eigenvector centrality. A possible ”cut” of the dendogram producing three distinct clusters is also shown. Analysis of financial and economics unstructured data 160

Intuitively, we wish the cluster at time t + 1 to retain the majority of nodes present at time t as to preserve its semantic contribution. A particularly suited measure is the Jaccard similarity coefficient in (6.8)[283], which measures the extent to which two sets, say Ct and Ct+1, overlap:

|Ct ∩ Ct+1| . (6.8) |Ct ∪ Ct+1|

Thus, the Jaccard index would prefer clusters having the highest percentage of nodes in common. However, it does not take into account the importance of such nodes.

Here we propose an Adjusted Jaccard Index (AJI) which - all other conditions being the same - would assign higher rank to clusters which share ”important” nodes. The importance of a node can be assessed by various centrality measures (see sec- tion 3.2.2) and our index is independent of the particular centrality chosen. More formally, let n be a node and ct(n) its centrality value at time t, then our adjusted index can be defined as:

P max(ct(n), ct+1(n)) n∈Ct∩Ct+1 . (6.9) |Ct ∪ Ct+1|

Thus, our index is computed as the proportion of elements two clusters have in common, weighted by their importance in the network as a whole. Figure 6.2 pro- vides an example of cluster tracking. Thus, the adjusted Jaccard index governs the temporal evolution of clusters. By ranking the AJI computed between the target cluster at time t and all clusters at time t + 1, it is possible to track a cluster from one time step to another. Similarly, by setting a lower threshold on the AJI, some clusters will vanish thus emulating the survival-of-the-fittest behavior: only clusters bearing semantically relevant information will survive, while others will vanish over time. We will see examples of this behavior in the experimental section. Analysis of financial and economics unstructured data 161

Figure 6.2: An example of cluster tracking in two consecutive time steps on a synthetic example. Shown are three clusters at each time step (t and t + 1). According to the formulae in Equation 6.9, cluster#3 at time t+1 (on the right) is the best candidate match for the cluster#2 at time t (left) as the two clusters share the highest number of important nodes, where the importance of a node can be assessed by one of the centrality measures (in this work, we have used the betweenness centrality for the cluster tracking).

(a) Random walk betweenness (b) Eigenvector centrality

Figure 6.3: Example of the semantic index showing the evolution of two different centrality measures, namely random walk betweenness and eigenvector, on 2007 Thomson Reuters news headlines.

6.6 Semantic Index

Evolving clusters can be seen as a time sequence of subgraphs induced by the random

i i i walk clustering and tracking algorithm. Let {C1,...,Ct ,...,CT } be the temporal sequence of length T for the ith cluster. Suppose there are N such sequences. In order to compare such sequences with - for instance - a financial time series, we need to turn semantic information encoded in each subgraph into a numeric index which we call Semantic Index. Analysis of financial and economics unstructured data 162

Once again, we will commit ourselves to various centrality measures - averaged over

i th all cluster members - as indicators of the information content. Let nj be the j th i node of the i cluster, and let c(nj) be its centrality value. Then, we can define the Semantic Index for the cluster Ci, under the centrality measure c, as

P i i c(n ) SIi = j∈C j . (6.10) c |Ci|

Figure 6.3 shows two semantic indices computed over the same temporal sequence of clusters of Thomson Reuters financial news in 2007 using two different central- ity measures, namely the random walk betweenness and the clustering coefficient. While the two semantic indices above measure different structural properties of the cluster’s nodes, it is interesting to notice that they both exhibit a similar trend. Next section pushes this observation further by analyzing similarities between two different types of temporal sequences: those induced by various semantic indices and different publicly available financial time series (such as Dow Jones, S&P, and alike).

6.6.1 Comparison with financial time series

Our approach permits to turn a stream of unstructured textual data, grouped and processed along well-defined temporal boundaries (e.g. daily o weekly financial news), into a sequence of data points induced by computing the semantic index over the evolution of semantically most influent clusters. Obviously, the number of time series is the total number of clusters being tracked, say N, times the number of different semantic indices, say F . In this section we investigate whether a particular semantic index time series, or a linear/non-linear combination of different indices, could be used to mirror the activity of e.g. a financial market time series, so that the observed trend of the semantic index could be correlated with the trend of the target financial time series. Analysis of financial and economics unstructured data 163

As is now widely accepted, markets are far from being truly ”efficient” and is it common to believe that - for instance - stock prices are at least partially predictable [284]. Many economists started emphasizing certain ”fundamental” psychological and behavioral elements of stock-price determination. Our index tries to capture exactly such an information by processing big quantities of financial-related infor- mation. Thus, our primary intent is not that of providing a yet-another-tool for forecasting market evolution, but rather to provide a quantitative description of the above-mentioned psychological and behavioral elements hidden in large corpora of unstructured textual data and processed in real-time. The goal is to efficiently cap- ture macroscopical perturbations that could affect the behavior of traders and thus - indirectly - could help making more accurate analysis on the market itself.

For the purpose of the above analysis, we have ingested the whole Thomson Reuters corpus of financial news headlines in the period 2003-2012 for a total of 11.5M news dispatches and full stories. News published during the same months are grouped together and processed separately. Processing is done on a 36-cores MapR5 cluster using a fixed dictionary of 26k (mostly economic) terms. In the following we will concentrate our analysis on the results obtained on the 2007 data where a total of N = 13 clusters have been identified as relevant and tracked.

The outcome of our text processing and cluster tracking stage are NxF time series, one for each combination of cluster and semantic index. For simplicity, we will concentrate on the most influential cluster only (thus, N = 1) selected as the one having the highest average centrality value summed over the whole year. In this work we have used five different semantic indices (thus, F = 5), namely:

• Random walk betweenness [285]

• Clustering coefficient

• Eigenvector centrality

• Shortest-path betweenness 5MapR, www.mapr.com, is a production-ready distribution for Apache Hadoop tailored to pro- cess and store huge quantities of data. Analysis of financial and economics unstructured data 164

• Page rank

While in this work we use mainly centrality values over the selected cluster, there is no reason not to use other local and/or global graph measures, such as the statistics on the node degree distribution, number of connected components, average cluster- wise diameter, or more involved measures that combine topological and semantic features such as the Katz similarity [286] and alike [287].

6.6.2 Correlation analysis

In our first analysis we are interested to find whether there are significant corre- lations between one of the semantic index time series and the Dow Jones Average financial index (DJIA) over the same time period. Here we do not speculate about possible causal relations between events represented by such time series. We use the standard Pearson correlation coefficient defined as the average product of de- partures of two variables from their respective means divided by the product of the standard deviations of those variables. Figure 6.4 graphically depicts the close rela- tion between the two time series under different semantic indices together with the correlation coefficient.

In addition, we are interested to test whether one time series contains predictive information about the other. In our second analysis we use the cross-correlation coefficient [288] to estimate the strength and direction of correlation between two random variables. Given two jointly stationary time-series, X and Y , the cross- correlation is computed as

E[Xt − µX ]E[Yt+τ − µY ] Rxy(t, τ) = , (6.11) σX σY where t is the time index, tau is the lag value (i.e. delay) at which we wish to compute the correlation, and µX and σX are the mean and standard deviation of Analysis of financial and economics unstructured data 165

(a) 2007 DJIA (b) Eigenvector centrality

(c) Clustering coefficient (d) Random walk betweenness

(e) Shortest-path betweenness (f) Pagerank

Figure 6.4: The figure shows Pearson correlation coeff. computed between each semantic index time-series and the Dow Jones Average financial index (DJIA) over the same time period. Figure (a) shows the DJIA 2007 monthly time series alone, while in figures (b-f) we overlap the DJIA and the semantic time series computed over the Reuters news headlines during the same period. We present the evolution of the semantic index for the cluster having the highest correlation index. Figures also show p-values (at 5%) for each correlation coefficient. In each plot, x axis represents time, while y axis represents the actual value of DJIA index (on the left, black) and the actual value of the semantic index (on the right, red). Analysis of financial and economics unstructured data 166

(a) 2007 Dow Jones In- (b) Random walk be- (c) Clustering coeffi- dex tweenness cient

(d) Eigenvector cen- (e) Shortest-path be- (f) Pagerank trality tweenness

Figure 6.5: We plot the cross-correlation coefficient - at varying time lags τ - between DJIA time series (a) and five different semantic index time series (b-f) for the same clusters used in Fig. 6.4. Cross-correlation measures whether one time series contains predictive information about the other and at τ = 0 it reduces to correlation coefficient. In plots (b-f) x axis represents time lag (τ), while y axis represents the value of cross-correlation coefficient in [−1, 1].

the process Xt (similarly for Yt). From the equation above, we see that the cross- correlation value at τ = 0 corresponds to the correlation coefficient, otherwise it measures the correlation of the first series with respect to the second one shifted (lagged) by an amount τ. When computing the cross-correlation between various semantic index time series and a financial index we are particularly interested in significant correlations at a negative lag as they indicate that the former could be used to predict the latter[288].

Figure 6.5 shows the cross-correlation coefficient computed between the Dow Jones Index Average and different semantic indices for one evolving cluster. As the figure suggests, there appears to be a significant correlation and cross-correlation at lag τ = 1 between the financial index all our time series. Analysis of financial and economics unstructured data 167

We have extended our analysis to the correlation between semantic indices and various financial indicators over the same year. Figure 6.6 shows the results obtained. Analysis of financial and economics unstructured data 168

Figure 6.6: We have taken various semantic indexes built upon the 2007 Reuters data and computed the Pearson correlation against some of the most popular financial indicators (DJIA, SP500, VIX, NASDAQ & DJ Commodities). Last column shows the two time-series, financial and semantic, overlapped. For each comparison the best performing cluster has been chosen (i.e. the one exhibiting the highest correlation value). Figures also show p-values for each correlation. As before, x axis represents time, while y axis represents either the actual value of the financial index or the actual value of the chosen semantic index.

6.6.3 A note on stationarity and correlation

In the correlation analysis above, we had tacitly assumed that the time series under investigation were stationary. Actually, by performing the augmented Dickey-Fuller test at 5% significance level6, neither the financial series nor the evolution of our semantic indices passed the stationarity test (at varying lag values). However, it was sufficient to differentiate all series of interest to eliminate unit roots, and thus to restore the stationarity. In the remainder of this work we will thus assume that all time series are (weak) stationary. See [288] for an exhaustive set of statistical tests to assess the stationarity of data and techniques to turn a non-stationary time series into a stationary one.

With regards to the actual correlation values, we have performed standard statistical tests to assess their significance. The significance of a sample correlation, r, depends on the sample size and on r itself. Significance can be tested with a t-test with the following assumptions:

6Augmented Dickey-Fuller test is one of the most commonly used unit-root test for stationarity [289]. Analysis of financial and economics unstructured data 169

• The samples, x and y, are drawn from populations that follow a bivariate normal distribution;

• The samples are random samples from the population;

• The population correlation coefficient, ρ, is zero.

If the above assumptions are met, the value of r is used to determine the statistical significance of the correlation by calculating the well-known t test statistic as [290]: √ r N − 2 T = √ . 1−2

T follows a t-distribution with N − 2 degrees of freedom, where N is the sample size and r is the correlation coefficient of the sample.

As usual, we need to define the null and alternative hypotheses for a significance test, which in our case are:

• H0: ρ = 0 (two series are uncorrelated)

• H1: ρ 6= 0

We have performed statistical tests with 95% confidence interval (p = 0.05). In all experiments we have obtained p  0.05 and therefore we rejected the null hypothesis above.

In addition, we have also performed Granger causality tests [291] between various semantic indices and target financial data under the hypothesis of an auto-regressive model. The Granger causality test is a statistical hypothesis test used in time series analysis to determine if a time series Xt is useful in forecasting another time series

Yt[292]. Indeed, Xt is said to Granger-cause Yt if Yt can be better predicted using both the histories of Xt and Yt rather than using only the history of Yt alone. The

Granger causality can be assessed by regressing Yt on its own time-lagged values and on those of Xt. Usually, F-test is then used to examine whether the null hypothesis Analysis of financial and economics unstructured data 170

that Yt is not Granger-caused by Xt can be rejected with a given confidence level [291]. As the results in Fig. 6.6 show, we confirm the predictive power of semantic indices (here we use a p-value of 5%) at various time lags paving the way for the comprehensive semantic index detailed in the following section.

6.7 Building the Semantic Index

In this section we investigate how various indices could be combined to reflect overall topological and structural network properties mirroring a financial index of choice (e.g. DJIA, SP500, . . . ). As shown in the previous section, various semantic indices indeed reflect the evolution of target financial data. However, no single measure is descriptive enough to be used as unique descriptor of global and local perturbations, leaving the room for for the possibility to introduce a novel index which simulta- neously takes into account both semantic information encoded in the network and historical data for a target financial indicator. Such an index, called Comprehen- sive Semantic Index (CSI), should augment and enhance information expressed in the target financial index by incorporating easily available exogenous data, such as financial news and similar.

The idea is to build a model that intelligently combines various graph-induced time- series into a unique index. How can such an index be built and its parameters learned? Here we assume that the evolution of our index can be described as an autoregressive model with multivariate exogenous inputs described in the following section.

6.7.1 Semantic index as an auto-regressive process with ex- ogenous variables

Since classical regression is often insufficient for explaining all of the interesting dynamics of a time series, in this work we adopt an auto-regressive process with Analysis of financial and economics unstructured data 171 vectorial exogenous variables as a mathematical model that governs the evolution of our index [293].

Autoregressive models are based on the idea that the current value of the series, xt, can be explained as a function of p past values, xt−1, xt−2, . . . , xt−p, where p determines the number of steps into the past needed to forecast the current value [291]. Thus, an autoregressive model of order p, abbreviated AR(p), is expressed as:

xt = φ1xt−1 + φ2xt−2 + ... + φpxt−p + wt , (6.12)

where xt is assumed to be stationary and φ1, . . . , φp are unknown, time-invariant, coefficients. It is usually assumed that wt is a Gaussian white noise series with mean 2 zero and variance σw. The mean of xt is assumed to be zero (if not, it sufficient to replace xt by (xt − µ). Usually, AR process is augmented with the so-called moving average (MA) model of order q, abbreviated as MA(q), which assumes that the white noise wt on the right-hand side of the equation 6.12 is combined linearly to form the observed data. The resulting model, referred to as ARMA, is given in the following equation [291]:

xt = φ1xt−1 + φ2xt−2 + ... + φpxt−p + wt + θ1wt−1 + ... + θqwt−q . (6.13)

The parameters p and q are called, respectively, the autoregressive and the mov- ing average orders. Again, φ1, . . . , φp, θ1, . . . , θq are unknown, time-invariant, coeffi- cients.

Often the value of one variable is not only related to its predecessors in time but, in addition, it depends on past values of other variables. Thus, it can be the case that the target variable linearly depends on another time series, ut, . . . , ut−r, with order r and linear coefficients β1, . . . , βr. Such a series is called exogenous and the Analysis of financial and economics unstructured data 172 resulting model, called ARMAX, can be described by the following equation [293]:

xt = φ1xt−1 +φ2xt−2 +...+φpxt−p +wt +θ1wt−1 +...+θqwt−q +β1ut−1 +...+βrut−r (6.14)

2 In general, unknown parameters φ, θ and β, the variance σw, as well as the orders p, q, and r should be estimated from available observations, x1, . . . , xn and u1, . . . , un of the target process and exogenous inputs, respectively. The estimation of the coefficients above is usually done via Maximum Likelihood (ML) or least squares estimators (and variants thereof), given that the order of the model is known. On the other hand, estimation of the order variables p, q, and r is akin to the problem of structural learning and is usually done by comparing various models against each other by balancing the error of the fit against the number of free parameters in the model. AIC (Akaike’s Information Criterion) and BIC (Bayesian Information Criterion) are among the most used statistics for model comparison. A detailed explanation of various estimation algorithms can be found in [288, 293, 294].

6.7.1.1 Multivariate case

The equations above are valid in the univariate case only. However, there are cases in which we have more than one output or input time series and often the value of one output time series could depend both on its own lagged values as well as on the values of other (lagged) output and input values. Thus, mathematical models introduced above must be extended to deal with the multivariate case.

Unfortunately, the extension of moving-average based models to their multivariate counterpart is technically involved and is usually not used in practice. However, the multivariate autoregressive model is a straightforward extension of the univariate AR model [295]: p X xt = α + Φjxt−j + wt . (6.15) j=1 Analysis of financial and economics unstructured data 173

The model in equation 6.15 is called vector autoregressive mode (VAR) of order p.Φj is a transition matrix that expresses the dependence of two consecutive realizations of the output variable. The vector white noise process wt is assumed to be multivariate normal with mean-zero and covariance matrix Σw.

Finally, the model above can be easily generalized to include a vector of exogenous inputs, ut, yielding to the so-called VARX model:

r p X X xt = Γjut−j + Φjxt−j + wt . (6.16) j=0 j=1

In this work, we have adopted the VARX model to accomodate various semantic indices as exogenous inputs trained against a target time-series of interest (i.e. a financial index). How the model is trained, which implies the estimation of free parameters Γ and Φ, is shown in the following section.

6.7.2 Fitting the parameters of the VARX model

Imagine that the structure of the model, i.e. the value of p and r, is known be- forehand and fixed. In order to estimate the parameters of such a model, any autoregressive model requires that the historical data of the target time-series are available for training. In our case, as this would require previous values of the se- ries we are trying to build, we bootstrap the learning problem by constraining our index to follow the perturbations of a financial index of interest (e.g. Dow Jones or S&P500). The net result is that our index mimics the trend of the target financial index by mixing all the available information related to both the exogenous events, such as those induced by the financial news, and the past values of the index itself. An advantage is that our index can anticipate market fluctuations by reading ex- ternal sources of information in real-time. In addition, we can provide a semantic explanation as of why our index behaves in certain way. Analysis of financial and economics unstructured data 174

(a) Training data (b) Testing data

Figure 6.7: Comprehensive Semantic Index fitted on 2003-2011 DJIA series and semantic exogenous inputs. (a) Comparison between CSI and DJIA indices on training data; (b) Comparison between CSI and DJIA indices on test data; Black curve depicts the target financial series while the red one represents our index. In each plot, x axis represents time, while y axis represents the value of the semantic index and the target time series (DJIA).

In our first experiment, we fitted a VARX model against the 2003-2012 monthly Dow Jones Average Index (DJIA) and five different semantic indices introduced in section 6.6.1 acting as exogenous series. We have decided to keep the last twelve months in the series as the test set while the rest has been used as training set.

The parameters of the model, given the structure, are learned by a Maximum Like- lihood algorithm [295]. The structure of the model, on the other hand, is estimated via BIC model selection criterion and the parameters p and r are set to be 18 and 6, respectively. Figure 6.7(a) shows the results of the training phase obtained with the best model according to BIC. It is interesting to notice how our index follows the trend of the Dow Jones index on the test data (Figure 6.7(b)). In particular, the index correctly predicts a significant market perturbation on June 2012 as well as a positive trend immediately after the shock. However, the two series are not perfectly aligned at all times and we provide the end user with the opportunity to understand the reason why our index forecasted a certain trend as will be shown in the following.

We have also performed a similar analysis on another commonly used financial series, Analysis of financial and economics unstructured data 175

(a) Training data (b) Testing data

Figure 6.8: Comprehensive Semantic Index fitted on 2003-2011 Standard&Poors 500 series and semantic exogenous inputs. (a) Comparison between CSI and SP500 indices on training data; (b) Comparison between CSI and SP500 indices on test data; Black curve depicts the target financial series while the red one represents our index. In each plot, x axis represents time, while y axis represents the value of the semantic index and the target time series (SP500). namely that of Standard&Poors 500 (abbr. SP500) over the same time frame (2003- 2012). As before, the last year has been used for testing purposes while the VARX model has been tuned on the remaining months. BIC analysis reveals a different structure for the SP500 data with p and r model order parameters set to 12 and 4, respectively. As before, we show how the model fits the data from the training set and how well it mirrors the observed trends in the test set. Results are shown in Figure 6.8

Since both DJIA and SP500 exhibit similar trend across the period in examination (2003-2012), we have tried to fit our index to a slightly different index. For this comparison we have chosen the volatility index (VIX) which represents a measure of the market’s expectation of stock market volatility over the next 30 day period and computed as a weighted blend of prices for a range of options on the S&P 500[296]. We have adopted the identical learning scheme as before with the same percentage of training and testing data. Through BIC analysis we found the optimal structure of the model to be p = 12 and r = 9 and we trained the model using the maximum likelihood learning. The results are depicted in the Figure 6.9 with comparable Analysis of financial and economics unstructured data 176

(a) Training data (b) Testing data

Figure 6.9: Comprehensive Semantic Index fitted on 2003-2011 VIX series and semantic exogenous inputs. (a) Comparison between CSI and VIX indices on training data; (b) Comparison between CSI and VIX indices on test data; As usual, black curve depicts the target financial series while the red one represents our index. In each plot, x axis represents time, while y axis represents the value of the semantic index and the target time series (VARX). results to those shown on the other two series.

6.7.3 Semantic content of the CSI

For an end user of our index, it would be highly beneficial to understand why the index behave in the way it appears to behave. While obviously the global trend of an index is an interplay of different factors (in our case, the evolution of both the topological features of the semantic net and that of the economic index the model has been trained against) we could provide the semantic content of the cluster nodes that act as exogenous variables in our model. Let us consider the last example of fitting the CSI index against a volatility indicator (i.e. VIX). As shown in the Figure 6.9, there is a sudden inversion of trend around May/June 2012 (this was captured by other two indices as well). Let us inspect the content of the cluster that was used to build various semantic indices as explained in the section 6.6. The cluster in June 2012 contains concepts in common with the cluster in May 2015 such as government bonds, loan, property tax, price inflation, unemployment rate, ..., while it presents a set of new influential concepts entering the cluster that presumably have Analysis of financial and economics unstructured data 177 influenced - on the short run - the negative asset of the market during that month: deflation, foreign bond, financial crisis, fiscal deficit, Mario Draghi, and alike.

6.8 Summary

In this conclusive chapter we have shown how techniques developed for the analysis of complex networks can be extended to capture the evolution of semantic informa- tion over time. As stated in the introductory section, our goal is to seamlessly turn unstructured information from trustworthy textual datasets (e.g. financial news) into readily usable decision making knowledge by quantifying semantics. In anal- ogy, the evolution of (particular aspect of) a financial market is usually succinctly described via stock market indices used by investors (alto sense) during the process of decision making. To take the full advantage of such indices, over the last decades a great effort has been spent into devising complex mathematical models able to forecast future values of this or that financial index based on its past values and - possibly - other external information.

Here we propose a radically different approach. In short, the goal is to devise a novel index influenced by the topological properties of agglomerates of information in a semantic graph. The graph itself is created by interconnecting semantically related concepts extracted from huge textual corpora. Our approach is akin to the distributional hypothesis which relates two concepts if they co-occur in similar contexts with an above-chance probability. Thus, plain textual information can be turned into a graph enabling practitioners to adopt a plethora of tools from the field of complex network analysis. We have also extended this approach to fully capture the dynamical aspects of the phenomena under investigation by identifying clusters holding influential information and tracking them over time. By computing graph-based statistics over such clusters we eventually turn the evolution of textual information into a, mathematically well-defined, multivariate time series, where each time series encodes the evolution of particular structural, topological and semantic Analysis of financial and economics unstructured data 178 properties of the set of concepts it has been computed on. The last step is to construct a model that combines said properties into a unique comprehensive index, that possibly mimics the behavior of an existing index.

The model we have chosen, autoregressive with vectorial exogenous inputs (VARX), linearly mixes previous values of the index with the evolution of n other time series induced by the semantic information in the graph. To bootstrap the construction of such an index, the model is trained against the time series whose behavior, in terms of fluctuations to external data, we wish to mirror. The structure of the model is deduced by using classical model selection tools to prevent eventual overfitting to the training data.

We fitted our model against different market indices (DJIA, SP500, VIX) with promising results. On the test data (corresponding in all experiments to the last available year) the model correctly forecasted major financial fluctuations and fol- lowed the trend of the target index. In addition, our model was able to provide a comprehensive explanation of its own behavior by providing an overview of se- mantic information it encodes. For instance, we have retrospectively seen that in correspondence of market shocks dominating concepts are those related to the ob- served effect on the markets (e.g. Crimea during the onset of the recent riots in Ukraine, subprime mortgage and overcollaterization during financial crisis in late 2007, and so on).

While the system as-of-now processes data in batch with a time constant that in the examples above was set to a month, future works will concentrate on developing the real-time counterpart of our system able to ingest and process huge quantities of data as it arrives. In addition, we plan to investigate the adoption of non-linear models (e.g. NARX [297]) to capture more subtle effects of semantic information on the evolution of stock markets. Analysis of financial and economics unstructured data 179

6.9 Acknowledgments

The author would like to thank Thompson Reuters for kindly providing their 2003- 2012 news headlines dataset. Conclusions

The objective of this Thesis is to investigate a framework for analysing complex systems of different nature (domains) by means of a two-layer approach: firstly, the semantic conversion of a large unstructured data set into an textinduced knowledge network and secondly a graph theoretical analysis of such network. The combination of these two stages allowed me to construct a methodology able to investigate the statistical and emerging properties of certain complex phenomena and to devise a framework suitable for predictive analysis. The theoretical foundations that are at the base of this research work spanned from exploring various theories and techniques applicable to complex data sets.

In Chapter2 I presented the relevant concepts about Multiscaling as an introductory part of the theory studied through my research. In particular, I have focussed on financial complex data sets and investigated their multiscaling characteristics giving a brief background of the theory and instruments explored with the objective of studying regularities of apparently unpredictable signals and the construction of a synthetic model which can be used to forecast and understand the mechanisms governing the processes under study.

Throughout the first part of this Chapter I showed how the concept of scaling leads to the important concept of a fractal which represent a system that is characterised by a scaling law with a unique non-integer scaling exponent representing a fractal characteristic, and other scaling properties that are not described by a single number, but rather are defined by a function of scaling exponents displaying multi-fractal characteristics.

180 Analysis of financial and economics unstructured data 181

The investigation of such statistical properties continued though the description of how the scaling properties in time series have been studied by means of several techniques introducing the rescaled range statistical analysis R/S which provides a statistical rule to describe the long-term dependence and revealing long-run corre- lations in random processes, presenting several approaches and finally focusing on multiscaling (multi-fractal) stochastic processes. A particular attention was given to the Generalized Hurst Exponent (GHE) which is one of the most widely used methods to directly quantify the multifractal properties of a time series.

The second part of this Chapter was dedicated to the main properties of Graph Theory (properly conjugated in thesis as Network Theory) that have been widely used throughout my research works and this Thesis and I investigated how this field of mathematics fits well in analysing large complex systems representing natural real world phenomena. In this respect I described the main properties observed on real-world complex networks and in particular on scale-free networks. I first gave a summary of the topology characteristics of a network from local and global perspec- tive, introducing various network measures and metrics, describing characteristics of graph connectivity, i.e. distance between nodes, path from the geodesic perspec- tive such as shortest path and from a stochastic perspective such as Random Walk distance and define various centrality and the meaning of clustering measures.

The last part of this Chapter was dedicated to an introduction of models of complex networks. Here I introduced the Erd˝osand R´enyi Model (ER) which was the first to address and describe large networks with no particular distributions of nodes and link and whose organisation principles were not easily definable (random graphs). Although these models were used as reference model for several years, they were not an exact representation of the characteristics of a real world networks of complex topology and two better models are described as follows:

- the Scale-free networks, which are networks typically characterised by the pres- ence of large hubs and displaying a power-law degree distribution, and, Analysis of financial and economics unstructured data 182

- the small-world network - Watts-Strogatz model (WS) - referring to networks in which the average geodesic distance between nodes (i.e., shortest-path) in- creases slowly as a function of the number of nodes in the network.

Finally, I presented a summary of network filtering tools which have been widely used throughout my research work. Amongst those I introduce the Minimum Spanning Tree MST and Planar Maximally Filtered Graph PMFG and I provide an outlook of scale-free networks embedded in Hyperbolic spaces.

In Chapter3 I gave a thorough introduction of certain aspects of computational linguistics by analysing the modern techniques concerning semantic representation and conceptualised a compelling methodology in order to transform large text based information into a semantic induced graph. Various semantic models are introduced and in particular I focussed on a robust approach called Distributional Hypothesis and the statistical properties associated to it and how a graph constructed upon those models represents a complex system that can be analysed with a graph theo- retical approach. Through the Chapter I introduced various approaches of parsing techniques such as the textcontext-free grammar (CFG) parser and showed how it naturally generates a formal language in which it is able to represents structured clauses and other more sophisticated methods such as probabilistic context-free gram- mar (PCFG) and the lexicalised probabilistic context-free grammar (L-PCFG) parser method which aims to find the most likely parse type of a given sentence by assign- ing a probability to each parse tree. The study of parsing techniques along side with the distributional hypothesis led to the definition and generation of the semantic induced network, where frequency of co-occurrences represents an extensive prop- erty of the relationship between two concepts, i.e two vertices in the graph, and can be used as a similarity measure for the graph itself. Once the semantic graph is constructed I provided the background for analysing such network of concepts and defined geometric and probabilistic semantic similarities and semantic shortest path distance measures readily comparable to the known measures introduced in Chapter 2. Analysis of financial and economics unstructured data 183

In chapter4 I introduced a preliminary analysis of a real world application of the framework described in the previous chapters by processing abstracts and titles of a corpus of 208,583 scientific bio-medical papers selected form PubMed database and I described step by step the semantic graph theoretical approach starting from the statistical semantic analysis where the distributional hypothesis theory and the Natural Language Process parsing techniques are applied to the aim of building the induced knowledge graph. The following paragraphs introduced various methods for finding network paths associated to large level of similarities between nodes (i.e. biomedical concepts) including a stochastic method to identify relevant biological path by means of a random walk distance measure. Finally, I presented certain network filtering techniques applied to the biomedical knowledge graph and showed preliminary results based on the structural analysis of the resulting concepts net- work.

Chapter5 took the biomedical case study seen in the previous chapter a step further by demonstrating additionally that the semantic graph-theoretical model presented earlier can indeed represent a valid framework for researching and repurposing ex- isting drugs towards diseases for which they were not initially intended. By utilising the tested methods discussed in the previous chapter, I used a double-layer approach leveraging on developments of computational linguistics and exploiting a stochastic distance measure to provide biological relevant results. The corpus of the unstruc- tured data set that has been used for this model comprises of 3 million abstracts of papers dealing with biomedical research taken from PubMed and the proposed methodology provided an effective instrument to detect different Mechanism of Ac- tions (MoAs) of peptides and drugs; though it may not capture the full-detail of the MoAs, this approach succeeded in making them recognisable by a short chain of biomedical entities.

Moreover, the induced graph representations of biomedical knowledge presented pro- duced a sound and meaningful exemplification of the many interrelated concepts of the biomedical discipline; such methodology successfully allowed both the validation of existing rationales and the discovery of new ones, a feat usually left to serendipity Analysis of financial and economics unstructured data 184 and intuition. Through this proposed framework the scientific rationale has been translated into relevant clinical trial settings aimed at providing new potential treat- ment options for affected patients in Sarcoidosis and Creutzfeldt-Jakob diseases and the most meaningful results deriving from the proposed methodology are shown. This real world application of the framework in the field of biology developed and explained in this Chapter has been published in the article: Ruggero Gramatica, T. Di Matteo, S. Giorgetti, M. Barbiani, D. Bevec, Tomaso Aste, Graph Theory Enables Drug Repurposing - How a Mathematical Model Can Drive the Discovery of Hidden Mechanisms of Action, January 2014, PlosOne Volume 9, Issue 1, e84912.

Chapter6 dealt with another real world example focussing on the domain of eco- nomic and financial complex systems. In this example, a large data set comprising 11 million financial and economic news spanning through 10 years of Thomson Reuters headlines, is treated applying the dual-layer framework previously discussed, i.e. 1) a semantic process aimed at extracting emerging concept through computational linguistic techniques and 2) creating an induced semantic network, and a graph- theoretical analysis of such network. Here an additional element of complexity is added by considering the variable of time t, i.e. dealing with the peculiarity of evolving networks. Indeed, one of the main the objective was to turn the tempo- ral evolution of the structural and topological properties of such semantic induced networks into a semantic economic/financial index readily comparable to other re- lated time series (market indices such as (DJIA, NASDAQ COMPOSITE IXIC, DJ UBS COMMODITIES DJC, SP500 GSPC, VIX). The underlying hypothesis is that fluctuations of certain structural parameters - computed over induced networks - could be put in correspondence with fluctuations of other related phenomena. Using the analytical techniques described, I suggest that a combination of indices extracted through the methodology described here upon complex semantic networks could be either used in addition to traditional stock market indicators or as exogenous inputs to improve prediction and forecasting of stock market events. By computing graph based statistics over a set of clusters the presented method eventually turned the evolution of textual information into a mathematically well-defined multivariate set Analysis of financial and economics unstructured data 185 of time series, where each time series encoded the evolution of particular structural, topological and semantic properties of the set of concepts it was computed on and eventually combined said properties into a unique comprehensive economic/financial index, that possibly mimics the behaviour of an existing index. To accomplish this last part an autoregressive with vectorial exogenous inputs (VARX) model was cho- sen which linearly mixed previous values of the index with the evolution of n other time series induced by the semantic information in the graph. The model correctly forecasted major financial fluctuations and followed the trend of the target indices. In addition, the presented model was able to provide a comprehensive explanation of its own behaviour by providing an overview of semantic information it encoded. This work represents a new approach in this area; with the existing, and ever in- creasing volumes of unstructured data created daily, the impact that it can have on the industry is enormous. The results of the sample data taken in this instance suggest promising area of applications that have immediate potential in the financial sectors and more broadly.

To summarise, the proposed approach described in this thesis exploits computational linguistics and graph theory in order to build a graph representation of knowledge, which is analysed with the aim of observing correlations across concepts (nodes) or across clusters of nodes and highlight emerging properties by means of both topolog- ical structure static and dynamic evolution analysis, i.e. the change in connectivity patterns. The research carried out in this thesis has served in real world applica- tions as an empirical basis to investigate further underlying mechanisms governing biological, economic or financial events. This page is intentionally left blank Bibliography

[1] Ruggero Gramatica, Tiziana Di Matteo, Stefano Giorgetti, Massimo Barbiani, Dorian Bevec, and Tomaso Aste. Graph theory enables drug repurposing–how a mathematical model can drive the discovery of hidden mechanisms of action. PloS one, 9(1):e84912, 2014.

[2] John Gantz and David Reinsel. Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East. IDC iView, 2012.

[3] Cyrille Bertelle Moulay Aziz-Alaoui. From System Complexity to Emergent Properties. Springer, 2009.

[4] Frank Detering Robert L Dewar. Complex Physical, Biophysical and Econo- physical Systems. World Scientific Publishing Company, 2010.

[5] Jaroslaw Kwapien and Stanislaw Drozdz. Physical approach to complex sys- tems. Physics Reports, 515(3-4):115–226, 2012. doi: 10.1016/j.physrep.2012. 01.007.

[6] Giorgio Parisi. Complex systems: a physicist’s viewpoint. Physica A: Statis- tical Mechanics and its Applications, 263(1):557–564, 1999.

[7] David Harel Irun R Cohen. Explaining a complex living system: dynamics, multi-scaling and emergence. J R Soc Interface, 4(13):175–182, 2007. doi: 10.1098/rsif.2006.0173.

[8] David Harel and Amir Pnueli. On the development of reactive systems. Springer, 1985.

187 Bibliography 188

[9] Luis AN Amaral and Julio M Ottino. Complex networks. The European Physical Journal B-Condensed Matter and Complex Systems, 38(2):147–162, 2004.

[10] Benoit B Mandelbrot. The variation of certain speculative prices. Springer, 1997.

[11] Jan W Kantelhardt. Fractal and multifractal time series. In Encyclopedia of Complexity and Systems Science, pages 3754–3779. Springer, 2009.

[12] Benoit B Mandelbrot. The fractal geometry of nature/revised and enlarged edition. New York, WH Freeman and Co., 1983, 495 p., 1, 1983.

[13] Benoit B Mandelbrot, Adlai J Fisher, and Laurent E Calvet. A multifractal model of asset returns, 1997.

[14] T. Di Matteo, T. Aste, and M.M. Dacorogna. Long-term memories of devel- oped and emerging markets: Using the scaling analysis to characterize their stage of development. Journal of Banking and Finance, 29:827–851, 2005.

[15] T. Di Matteo. Multi-scaling in Finance. Quantitative Finance, 7(1):21–36, 2007.

[16] and Howard M Taylor. On the distribution of stock price differences. Operations research, 15(6):1057–1062, 1967.

[17] Ramazan Gen¸cay, Michel Dacorogna, Ulrich A Muller, Olivier Pictet, and Richard Olsen. An introduction to high-frequency finance. Academic press, 2001.

[18] U.A. M¨uller,M.M. Dacorogna, R.B. Olsen, O.V. Pictet, M. Schwarz, and C. Morgenegg. Statistical study of foreign exchange rates, empirical evidence of a price change scaling law, and intraday analysis. Journal of Banking and Finance, 14:1189–1208, 1990. Bibliography 189

[19] R.N. Mantegna and H.E. Stanley. An Introduction to Econophysics: Corre- lations and Complexity in Finance. Cambridge University Press, Cambridge, UK, 2000.

[20] M.M. Dacorogna, R. Gencay, U. M¨uller, R.B. Olsen, and O.V. Pictet. An Introduction to High-Frequency Finance. Academic Press, San Diego, CA, 2001.

[21] A.I. Loffredo. On the statistical physics contribution to quantitative fiance. International Journal of Modern Physics B, 18:705–713, 2004.

[22] J.-P. Bouchaud and M. Potters. Theory of Financial Risks: From Statistical Physics to Risk Management. Cambridge University Press, Cambridge, United Kingdom, 2004.

[23] T. Di Matteo. Multi-scaling in finance. Quantitative Finance, 7(1):21–36, 2007.

[24] H. E. Hurst. Long-term storage capacity of reservoirs. Transaction of the American Society of Civil Engineers, 116:770–808, 1951.

[25] R. Black H. E. Hurst and Y. M. Sinaika. Long-Term Storage in Reservoirs: An experimental Study. Constable, London, 1965.

[26] A. W. Lo. Long-term memory in stock market prices. Econometrica, 59: 1279–1313, 1991.

[27] M. U. Taquu V. Teverovsky and W. Willinger. A critical look at lo’s modified r/s statistic. Journal of Statistical Planning an Inference, 80:211–227, 1999.

[28] R. Weron and B. Przybylowsciz. Hurst analysis of electricity price dynamics. Physica A, 283:462–468, 2000.

[29] R. Weron. Estimating long-range dependence: finite sample properties and confidence intervals. Physica A, 312:285–299, 2002. Bibliography 190

[30] J. Moody and L. Wu. Improved estimates for the rescaled range and hurst exponents. In J. Moody A. Refenes, Y. Abu-Mostafa and A. Weigend, editors, Neural Networks in Financial Engineering, Proceedings of the Third Interna- tional Conference, pages 537–553. World Scientific, London, 1996.

[31] K. Whitney Newey and D. Kenneth West. A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix. Econo- metrica, 55(3):703–708, 1987.

[32] D. W. Q. Andrews. Heteroskedasticity and autocorrelation consistent covari- ance matrix estimation. Econometrica, 59:817–858, 1991.

[33] B. M. Tabak D. O. Cajueiro. The hurst exponent over time: testing the assertion that emerging markets are becoming more efficient. Physica A, 336: 521–537, 2004.

[34] B. M. Tabak D. O. Cajueiro. Possible causes of long-range dependence in the brazilian stock market. Physica A, 345:635–645, 2005.

[35] W. Willinger M. S. Taqqu, V. Teverovsky. Estimators for long-range depen- dence: an empirical study. Fractals, 3:785–798, 1995.

[36] C.-K. Peng, S.V. Buldyrev, S. Havlin, M. Simons, H.E. Stanley, and A.L. Goldberger. Mosaic organization of DNA nucleotides. Physical Review E, 49: 1685–1689, 1994.

[37] H.E. Stanley, S.V. Buldyrev, A.L. Goldberger, S. Havlin, C.-K. Peng, and M. Simons. Scaling features of noncoding DNA. Physica A, 273:1–18, 1999.

[38] G. M. Viswanathan, S. V. Buldyrev, S. Havlin, and H. E. Stanley. Quantifi- cation of dna patchiness using long-range correlation measures. Biophysical Journal, 72:866–875, 1997.

[39] Y. Liu, P. Gopikrishnan, P. Cizeau, M. Meyer, C.-K. Peng, and H.E. Stanley. Statistical properties of the volatility of price fluctuations. Physical Review E, 60:1390–1399, 1999. Bibliography 191

[40] K. Hu, P.Ch. Ivanov, Z. Chen, P. Carpena, and H.E. Stanley. Effect of trends on detrended fluctuation analysis. Physical Review E, 64:011114, 2001.

[41] Z. Chen, P.Ch. Ivanov, K. Hu, and H.E. Stanley. Effect of nonstationarities on detrended fluctuation analysis. Physical Review E, 65:041107, 2002.

[42] N. Vandewalle and M. Ausloos. Coherent and random sequences in financial fluctuations. Physica A, 246:454–459, 1997.

[43] M. Ausloos, N. Vandewalle, Ph. Boveroux, A. Minguet, and K. Ivanova. Ap- plications of statistical physics to economic and financial topics. Physica A, 274:229–240, 1999.

[44] M. Ausloos. Statistical physics in foreign exchange currency and stock markets. Physica A, 285:48–65, 2000.

[45] K. Ivanova and M. Ausloos. Are eur and gbp different words for the same currency? The European Physical Journal B, 27:239–247, 2002.

[46] G. Cuniberti M. Raberto, E. Scalas and M. Riani. Volatility in the italian stock market: an empirical study. Physica A, 269:148–155, 1999.

[47] I.M. J´anosi,B. Janecsk´o,and I. Kondor. Statistical analysis of 5 s index data of the Budapest stock exchange. Physica A, 269:111–124, 1999.

[48] R.L. Costa and G.L. Vasconcelos. Long-range correlations and nonstationarity in the Brazillian stock market. Physica A, 329:231–248, 2003.

[49] J.W. Kantelhardt, S.A. Zschiegner, E. Koscielny-Bunde, S. Havlin, A. Bunde, and H.E. Stanley. Multifractal detrended fluctuation analysis of nonstationary time series. Physica A, 316:87–141, 2002.

[50] P. O´swi¸ecimka, J. Kwapie´n,and S. Dro˙zd˙z. Multifractality in the stock market: price increments versus waiting times. Physica A, 347:626–638, 2005.

[51] A. G. Ellinger. The Art of Investment. Bowers & Bowers, London, 1971. Bibliography 192

[52] G. Castelli E. Alessio, A. Carbone and V. Frappietro. Second-order moving average and scaling of stochastic time series. The European Physical Journal B, 27:197–200, 2002.

[53] A. Carbone, G. Castelli, and H. E. Stanley. Analysis of clusters formed by the moving average of a long-range correlated time series. Physical Review E, 69: 026105–026109, 2004.

[54] N. Vandewalle, M. Ausloos, and Ph. Boveroux. The moving averages demys- tified. Physica A, 269:170–176, 1999.

[55] G. K. Zipf. Human Behavior and the Principle of Least Effort. Addison- Wesley, Cambridge MA, 1949.

[56] A. R. Mehrabi ann H. Rassamdana and M. Sahimi. Characterization of long- range correlations in complex distributions and profiles. Physical Review E, 56:712–722, 1997.

[57] I. Simonsen, A. Hansen, and O. Nes. Determination of the hurst exponent by use of wavelet trasforms. Physical Review E, 58:2779–2787, 1998.

[58] D. B. Percival and A. T. Walden. Wavelet Methods for Time Series Analysis. Cambridge University Press, Cambridge, 2000.

[59] R. Gen¸cay, F. Sel¸cuk,and B. Whitcher. Scaling properties of foreign exchange volatility. Physica A, 289:249–266, 2001.

[60] F. B. Sowell. Maximum likelihood estimation of stationary univariate frac- tionally integrated time series models. Journal of Econometrics, 53:165–188, 1992.

[61] Craig Ellis. Estimation of the arfima (p,d,q) fractional differencing parameter (d) using the classical rescaled adjusted range technique. International Review of Financial Analysis, 8(1):53–65, 1999.

[62] P. Grau-Carles. Empirical evidence of long-range correlations in stock returns. Physica A, 287:396–404, 2000. Bibliography 193

[63] P. C. B. Phillips. Discrete Fourier transforms of fractional processes. Cowles Foundation Discussion Paper #1243, , 1999a.

[64] P. C. B. Phillips. Unit root log periodogram regression. Cowles Foundation Discussion Paper #1244, Yale University, 1999b.

[65] P. C. B. Phillips and K. Shimotsu. Local Whittle estimation in nonstation- ary and unit root cases. Cowles Foundation Discussion Paper #1266, Yale University, 2001.

[66] J. Beran. Statistics for Long-Memory Processes. Chapman & Hall, London, U. K., 1994.

[67] M. Ausloos N. Vandewalle. Sparseness and roughness of foreign exchange rates. International Journal of Modern Physics C, 9:711–720, 1998.

[68] M. Ausloos N. Vandewalle. Crossing of two mobile averages: A method for measuring the roughness exponent. Physical Review E, 58:6832–6834, 1998.

[69] K. Ivanova and M. Ausloos. Low q-moment multifractal analysis of gold price, dow jones industrial average and bgl-usd. European Physical Journal B, 8: 665–669, 1999.

[70] M. F. M. Osborne. Brownian Motion in the Stock Market. Operations Re- search, 7(2):145–173, 1959.

[71] Rosario N. Mantagena and H. Eugene Stanley. An Introduction to Econo- physics. Cambridge University Press, 2000.

[72] Michel Dacorogna, Ramazan Genay, Ulrich A. Muller, Olivier Pictet, and Richard Olsen. An Introduction to High-Frequency Finance. Academic Press, 2001.

[73] Marc Potters Jean-Philippe Bouchaud. Theory of Financial Risks. Academic Press, 2001.

[74] Mandelbrot B B. Gaussian self-affinity and fractals. C. R. Acad. Sci., Paris, 260:3274–3277, 1965. Bibliography 194

[75] J.W. Mandelbrot, B.; van Ness. Fractional brownian motions, fractional noises and applications. SIAM Review, 10(4):422–437, 1968.

[76] P. K Clark. A subordinated stochastic process model with xed variance for speculative prices. Econometrica, 41:135–156, 1973.

[77] Thomas Lux. The socio-economic dynamics of speculative markets: interacting agents, chaos, and the fat tails of return distributions. Journal of Economic Behavior & Organization, 33(2):143–165, 1998. doi: 10.1016/S0167-2681(97) 00088-7.

[78] T. Di Matteo. Multi-scaling in Finance. Quantitative Finance, 7(1):21–36, 2007.

[79] Adlai Fisher Laurent Calvet. Multifractality in asset returns: theory and evidence. Review of Economics and Statistics, 84(3):381–406, 2002.

[80] T Di Matteo. Multi-scaling in finance. Quantitative finance, 7(1):21–36, 2007.

[81] Harry Eugene Stanley. Introduction to Phase Transitions and Critical Phe- nomena. Oxford University Press, 1971.

[82] Jens Feder. Fractals. Plenum Press, New York and London, 1988.

[83] Jan W Kantelhardt, Stephan A Zschiegner, Eva Koscielny-Bunde, Shlomo Havlin, Armin Bunde, and H Eugene Stanley. Multifractal detrended fluctu- ation analysis of nonstationary time series. Physica A: Statistical Mechanics and its Applications, 316(1):87–114, 2002.

[84] Tiziana Di Matteo, Tomaso Aste, and Michel M Dacorogna. Long-term mem- ories of developed and emerging markets: Using the scaling analysis to char- acterize their stage of development. Journal of Banking & Finance, 29(4): 827–851, 2005.

[85] Jozef Barunik and Ladislav Kristoufek. On hurst exponent estimation under heavy-tailed distributions. Physica A: Statistical Mechanics and its Applica- tions, 389(18):3844–3855, 2010. Bibliography 195

[86] M. Di Matteo, T. Aste, and M.M. Dacorogna. Scaling behaviour in differently developed markets. Physica A, 324:183–188, 2003.

[87] B. B. Mandelbrot. Fractals and Scaling in Finance. Springer Verlag, New York, 1997.

[88] A.-L. Barab´asiand T. Viscek. Multifractality of self-affine fractals. Physical Review A, 44:2730–2733, 1991.

[89] A. Lucas P. A. Groenendijk and C. G. de Vries. A hybrid joint moment ratio test for financial time series. Preprint of the Erasmus University, 1998. URL http://www.few.eur.nl/few/people/cdevries/{\protect\protect\ protect\edefOT1{OT1}\let\enc@update\relax\protect\edefcmr{cmr}\ protect\edefm{m}\protect\edefn{n}\protect\xdef\OT1/cmr/m/it/ 12{\OT1/cmr/m/n/12}\OT1/cmr/m/it/12\size@update\enc@update\ ignorespaces\relax\protect\relax\protect\edefm{bx}\protect\xdef\ OT1/cmr/m/it/12{\OT1/cmr/m/n/12}\OT1/cmr/m/it/12\size@update\ enc@update1}.

[90] B. J. West. The Lure of Modern Science: Fractal thinking. World Scientific, 1985.

[91] J. Feder. Fractals. Plenum Press, New York, 1988.

[92] P. Flandrin. On the spectrum of fractional brownian motions. IEEE Transac- tion on Information Theory, 35:197–199, 1989.

[93] Mark EJ Newman. The structure and function of complex networks. SIAM review, 45(2):167–256, 2003.

[94] Guido Caldarelli. Scale-free networks: complex webs in nature and technology. OUP Catalogue, 2007.

[95] Leonhard Euler. Solutio problematis ad geometriam situs pertinentis. Com- mentarii academiae scientiarum Petropolitanae, 8:128–140, 1741. Bibliography 196

[96] Paul Erd˝osand Alfr´edR´enyi. On random graphs i. Publ. Math. Debrecen, 6: 290–297, 1959.

[97] Paul Erd˝osand Alfr´edR´enyi. On the evolution of random graphs. Publ. Math. Inst. Hungar. Acad. Sci, 5:17–61, 1960.

[98] Mark Newman. Networks: an introduction. Oxford University Press, 2010.

[99] Guido Caldarelli and Alessandro Vespignani. (edited by) Large scale structure and dynamics of complex networks: from information technology to finance and natural science, volume 2. World Scientific, 2007.

[100] R´eka Albert and Albert-L´aszl´oBarab´asi. Statistical mechanics of complex networks. Reviews of modern physics, 74(1):47, 2002.

[101] Guido Caldarelli and Michele Catanzaro. Networks: a very short introduction, volume 335. Oxford University Press, 2012.

[102] Matthieu Latapy and Cl´emence Magnien. Complex network measurements: Estimating the relevance of observed properties. In INFOCOM 2008. The 27th Conference on Computer Communications. IEEE. IEEE, 2008.

[103] Steven S Skiena. The algorithm design manual, 2008.

[104] Mark Newman, Albert-L´aszl´oBarab´asi,and Duncan J Watts. The structure and dynamics of networks. Princeton University Press, 2006.

[105] B´elaBollob´asand Oliver Riordan. The diameter of a scale-free random graph. Combinatorica, 24(1):5–34, 2004.

[106] Robert W Floyd. Algorithm 97: shortest path. Communications of the ACM, 5(6):345, 1962.

[107] Edsger W Dijkstra. The humble programmer. Communications of the ACM, 15(10):859–866, 1972.

[108] Karl Pearson. The problem of the random walk. Nature, 72(1865):294, 1905. Bibliography 197

[109] Takis Konstantopoulos. Markov chains and random walks. Lecture notes, 2009.

[110] John G.. Kemeny and James Laurie Snell. Finite markov chains. Van Nos- trand, 1967.

[111] Jae Dong Noh and Heiko Rieger. Random walks on complex networks. Physical review letters, 92(11):118701, 2004.

[112] L´aszl´oLov´asz. Random walks on graphs: A survey. Combinatorics, Paul Erdos is eighty, 2(1):1–46, 1993.

[113] Linton C Freeman. Centrality in social networks conceptual clarification. So- cial networks, 1(3):215–239, 1979.

[114] Mark EJ Newman. Analysis of weighted networks. Physical Review E, 70(5): 056131, 2004.

[115] Phillip Bonacich. Some unique properties of eigenvector centrality. Social Networks, 29(4):555–564, 2007.

[116] Stephen P Borgatti. Centrality and network flow. Social networks, 27(1): 55–71, 2005.

[117] Mark EJ Newman. The structure and function of complex networks. SIAM review, 45(2):167–256, 2003.

[118] Oliver C Ibe. Fundamentals of stochastic networks. John Wiley & Sons, 2011.

[119] R´eka Albert and Albert-L´aszl´oBarab´asi. Statistical mechanics of complex networks. Reviews of modern physics, 74(1):47, 2002.

[120] Mark EJ Newman, Steven H Strogatz, and Duncan J Watts. Random graphs with arbitrary degree distributions and their applications. Physical review E, 64(2):026118, 2001.

[121] R´eka Albert, Istv´anAlbert, and Gary L Nakarado. Structural vulnerability of the north american power grid. Physical review E, 69(2):025103, 2004. Bibliography 198

[122] Luıs A Nunes Amaral, Antonio Scala, Marc Barthelemy, and H Eugene Stan- ley. Classes of small-world networks. Proceedings of the National Academy of Sciences, 97(21):11149–11152, 2000.

[123] Albert-L´aszl´oBarab´asiand R´eka Albert. Emergence of scaling in random networks. science, 286(5439):509–512, 1999.

[124] Guido Caldarelli. Scale-free networks: complex webs in nature and technology. OUP Catalogue, 2007.

[125] T Aste and T Di Matteo. Introduction to complex and econophysics systems: A navigation map. Lecture Notes in Complex Systems, 9:1–35, 2010.

[126] Duncan J Watts and Steven H Strogatz. Collective dynamics of ”small-world” networks. nature, 393(6684):440–442, 1998.

[127] Michele Tumminello, Tomaso Aste, Tiziana Di Matteo, and Rosario N Man- tegna. A tool for filtering information in complex systems. Proceedings of the National Academy of Sciences of the United States of America, 102(30): 10421–10426, 2005.

[128] T Aste, T Di Matteo, and ST Hyde. Complex networks on hyperbolic surfaces. Physica A: Statistical Mechanics and its Applications, 346(1):20–26, 2005.

[129] T Di Matteo, T Aste, ST Hyde, and S Ramsden. Interest rates hierarchical structure. Physica A: Statistical Mechanics and its Applications, 355(1):21–33, 2005.

[130] MICHELE Tumminello, T Di Matteo, T Aste, and Rosario Nunzio Mantegna. Correlation based networks of equity returns sampled at different time hori- zons. The European Physical Journal B-Condensed Matter and Complex Sys- tems, 55(2):209–217, 2007.

[131] Joseph B Kruskal. On the shortest spanning subtree of a graph and the trav- eling salesman problem. Proceedings of the American Mathematical society, 7 (1):48–50, 1956. Bibliography 199

[132] Tomaso Aste, Ruggero Gramatica, and T Di Matteo. Exploring complex net- works via topological embedding on surfaces. Physical Review E, 86(3):036109, 2012.

[133] Douglas Brent West et al. Introduction to graph theory, volume 2. Prentice hall Upper Saddle River, 2001.

[134] Jon M Kleinberg. Navigation in a small world. Nature, 406(6798):845–845, 2000.

[135] Jon Kleinberg. Complex networks and decentralized search algorithms. In Proceedings of the International Congress of Mathematicians (ICM), volume 3, pages 1019–1044, 2006.

[136] Tomaso Aste, Ruggero Gramatica, and Tiziana Di Matteo. Random and frozen states in complex triangulations. Philosophical Magazine, 92(1-3):246–254, 2012.

[137] Fragkiskos Papadopoulos, Dmitri Krioukov, M Bogua, and Amin Vahdat. Greedy forwarding in dynamic scale-free networks embedded in hyperbolic metric spaces. In INFOCOM, 2010 Proceedings IEEE, pages 1–9. IEEE, 2010.

[138] James W Anderson. Hyperbolic Geometry. Springer, 2005.

[139] J.R. Firth. Papers in Linguistics 1934-1951. Oxford University Press, 1957.

[140] John Rupert Firth. A synopsis of linguistic theory. Studies in Linguistic Analysis, pages 1–32, 1957.

[141] Roger C Schank. Conceptual dependency: A theory of natural language un- derstanding. Cognitive psychology, 3(4):552–631, 1972.

[142] Zellig Sabbettai Harris. Distributional structure. Springer, 1970.

[143] Magnus Sahlgren. The distributional hypothesis. Italian Journal of Linguis- tics, 20(1):33–54, 2008. Bibliography 200

[144] George A Miller and Walter G Charles. Contextual correlates of semantic similarity. Language and cognitive processes, 6(1):1–28, 1991.

[145] Jike Ge and Yuhui Qiu. Concept similarity matching based on semantic dis- tance. In Semantics, Knowledge and Grid, 2008. SKG’08. Fourth International Conference on, pages 380–383. IEEE, 2008.

[146] Renaud Lambiotte and Pietro Panzarasa. Communities, knowledge creation, and information diffusion. Journal of Informetrics, 3(3):180–190, 2009.

[147] Joshua B. Tenenbaumb Mark Steyversa. Large-scale structure of semantic networks: Statistical analyses and a model of semantic growth. Cognitive Science, 29:41–78, 2005. doi: 10.1207/s15516709cog2901 3.

[148] Allan M. Collins; M. R. Quillian. Retrieval time from semantic memory. Jour- nal of verbal learning and verbal behavior, 8(2):240–247, 1969.

[149] Duncan J. Watts and Steven H. Strogatz. Collective dynamics of ’small-world’ networks. Nature, 393:440–442, June 1998.

[150] Rka Albert Albert-Lszl Barabsi. Emergence of scaling in random networks. Science, 286(5439):50–512, October 1999. doi: 10.1126/science.286.5439.509.

[151] Steven H. Strogatz. Exploring complex networks. Nature, 410:268–276, March 2001.

[152] Hawoong Jeong Rka Albert and Albert-Lszl Barabsi. Error and attack tol- erance of complex networks. Nature, 406:378–382, July 2000. doi: 10.1038/ 35019019.

[153] Magnus Sahlgren. The distributional hypothesis. Italian Journal of Linguis- tics, 20(1):33–54, 2008.

[154] Magnus Sahlgren. The Word-Space Model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high- dimensional vector spaces. Doctoral dissertation, Stockholm University, De- partment of Linguistics, Computational Linguistics Stockholm, 2006. Bibliography 201

[155] Jan Nuyts and Eric Pederson. Language and conceptualization, volume 1. Cambridge University Press, 2000.

[156] Jo¨elPlisson, Nada Lavrac, Dr Mladeni´c,et al. A rule based approach to word lemmatization. In Proceedings of the 7th International Multi-Conference Information Society, 2004.

[157] Wikipedia. Lemmatisation, 2014. URL http://en.wikipedia.org/wiki/ Lemmatisation. [Online; accessed 14-December-2014].

[158] David Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd annual meeting on Association for Com- putational Linguistics, pages 189–196. Association for Computational Linguis- tics, 1995.

[159] Yaacov Choueka and Serge Lusignan. Disambiguation by short contexts. Com- puters and the Humanities, 19(3):147–157, 1985.

[160] Abraham Kaplan. An experimental study of ambiguity and context. Rand Corporation, 1950.

[161] Jussi Karlgren and Magnus Sahlgren. From words to understanding. Founda- tions of Real-World Intelligence, pages 294–308, 2001.

[162] Lance J Rips, Edward J Shoben, and Edward E Smith. Semantic distance and the verification of semantic relations. Journal of verbal learning and verbal behavior, 12(1):1–20, 1973.

[163] Gerald Gazdar. Phrase structure grammar. In The nature of syntactic repre- sentation, pages 131–186. Springer, 1982.

[164] Gerald Gazdar. Generalized phrase structure grammar. Harvard University Press, 1985.

[165] Saif M Mohammad and Graeme Hirst. Distributional measures of semantic distance: A survey. arXiv preprint arXiv:1203.1858, 2012. Bibliography 202

[166] Lillian Lee. On the effectiveness of the skew divergence for statistical language analysis. In Proceedings of Artificial Intelligence and Statistics, pages 65–72, 2001.

[167] Eugene Charniak. Statistical techniques for natural language parsing. AI magazine, 18(4):33, 1997.

[168] Peter Norvig Stuart Russell. Artificial Intelligence: A Modern Approach. Pren- tice Hall, third edition, 2009.

[169] Eugene Charniak. Statistical parsing with a context-free grammar and word statistics. AAAI/IAAI, 2005:598–603, 1997.

[170] Michael Collins. Head-driven statistical models for natural language parsing. Computational Linguistics, 29(4):589–637, December 2003.

[171] Michael Collins. Head-driven statistical models for natural language parsing. Computational linguistics, 29(4):589–637, 2003.

[172] Frederick Jelinek, John D Lafferty, and Robert L Mercer. Basic methods of probabilistic context free grammars. Springer, 1992.

[173] Norbert Br¨oker, Udo Hahn, and Susanne Schacht. Concurrent lexicalized de- pendency parsing: the parsetalk model. In Proceedings of the 15th conference on Computational linguistics-Volume 1, pages 379–385. Association for Com- putational Linguistics, 1994.

[174] Yves Schabes and Richard C Waters. Stochastic lexicalized context-free gram- mar. In Proceedings of the Third International Workshop on Parsing Tech- nologies. Citeseer, 1993.

[175] W Caid and Joel L Carleton. Context vector-based text retrieval. Fair Isaac Corporation, pages 1–20, 2003.

[176] Vijay V Raghavan and SK Michael Wong. A critical analysis of vector space model for information retrieval. Journal of the American Society for informa- tion Science, 37(5):279–287, 1986. Bibliography 203

[177] John Clifford Gower. Properties of euclidean and non-euclidean distance ma- trices. Linear Algebra and its Applications, 67:81–97, 1985.

[178] Grigori Sidorov, Alexander Gelbukh, Helena G´omez-Adorno,and David Pinto. Soft similarity and soft cosine measure: Similarity of features in vector space model. Computaci´ony Sistemas, 18(3):491–504, 2014.

[179] Kurt Mehlhorn and Mark Ziegelmann. Resource constrained shortest paths. In Algorithms-ESA 2000, pages 326–337. Springer, 2000.

[180] Jochen Fromm. The emergence of complexity. Kassel university press Kassel, 2004.

[181] Christopher D Manning and Hinrich Sch¨utze. Foundations of statistical natural language processing. MIT press, 1999.

[182] Michael Uschold and Martin King. Towards a methodology for building on- tologies. Citeseer, 1995.

[183] Don R Swanson. Intervening in the life cycles of scientific knowledge. Library Trends, 41(4):606–31, 1993.

[184] Don R Swanson and Neil R Smalheiser. An interactive system for finding com- plementary literatures: a stimulus to scientific discovery. Artificial intelligence, 91(2):183–203, 1997.

[185] Don R Swanson. Fish oil, raynaud’s syndrome, and undiscovered public knowl- edge. Perspectives in biology and medicine, 30(1):7–18, 1986.

[186] Thomas K Landauer, Peter W Foltz, and Darrell Laham. An introduction to latent semantic analysis. Discourse processes, 25(2-3):259–284, 1998.

[187] Marcelo A Montemurro. Beyond the zipf–mandelbrot law in quantitative lin- guistics. Physica A: Statistical Mechanics and its Applications, 300(3):567–578, 2001.

[188] Mark Newman, Albert-Laszlo Barabasi, and Duncan J Watts. The structure and dynamics of networks. Princeton University Press, 2006. Bibliography 204

[189] Ramon Ferrer i Cancho, Ricard V Sol´e,and Reinhard K¨ohler. Patterns in syntactic dependency networks. Physical Review E, 69(5):051915, 2004.

[190] Dan Jurafsky and James H Martin. Speech & language processing. Pearson Education India, 2000.

[191] T. Aste and T. Di Matteo. Introduction to complex and econophysics systems: A navigation map. In World Scientific Lecture Notes in Complex Systems, volume 9 of Complex Physical, Biophysical and Econophysical Systems, pages 1–35, 2010.

[192] Joseph B Kruskal. On the shortest spanning subtree of a graph and the trav- eling salesman problem. Proceedings of the American Mathematical society, 7 (1):48–50, 1956.

[193] Caldarelli G. Scale-Free Networks Complex Webs in Nature and Technology. Oxford University Press, 2007.

[194] Ottino J. Amaral L. Complex networks. Eur. Phys. J. B., 38, 2004.

[195] Flynn P.J. Jain A.K., Murty M.N. Data clustering: a review. ACM Computing Surveys, 31(3), 1999.

[196] J. MacQueen. Some methods for classification and analysis of multivariate observations. In Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, 1967.

[197] Rui Xu. Survey of clustering algorithms. IEEE Transactions on Neural Net- works, 16(3):645 – 678, May 2005. doi: 10.1109/TNN.2005.845141.

[198] Lovsz L. Random walks on graphs: a survey. Combinatorics Paul Erds is Eighty, 2:146, 1993.

[199] Lovasz Laszlo. Random walks on graphs: A survey. Combinatorics, 2(1):1–46, 1993. Bibliography 205

[200] Dengyong Zhou and Bernhard Sch¨olkopf. Learning from labeled and unlabeled data using random walks. In Pattern Recognition, pages 237–244. Springer, 2004.

[201] H. Zhou. Network landscape from a brownian particles perspective. Phys. Rev. E, 67, 2003.

[202] Mustafa Yilmaz Aykut Firat, Sangit Chatterjee. Genetic clustering of social networks using random walks. Computational Statistics & Data Analysis, 51 (12):6285–6294, August 2007.

[203] Robert Clay Prim. Shortest connection networks and some generalizations. Bell system technical journal, 36(6):1389–1401, 1957.

[204] Robert Clay Prim. Shortest connection networks and some generalizations. Bell system technical journal, 36(6):1389–1401, 1957.

[205] Joseph B Kruskal. On the shortest spanning subtree of a graph and the trav- eling salesman problem. Proceedings of the American Mathematical society, 7 (1):48–50, 1956.

[206] Chazelle B. A minimum spanning tree algorithm with inverse-ackermann type complexity. Journal of the ACM (JACM), 47:10281047, 2000.

[207] Aste T., Di Matteo T., , and Hyde S.T. Complex networks on hyperbolic surfaces. Physica A, 346:2026, 2005.

[208] Tumminello M., Aste T., Di Matteo T., and Mantegna R.N. A tool for filtering information in complex systems. In Proc. Natl. Acad. Sci, volume 102, page 1042110426, 2005.

[209] Padmini Srinivasan and Bisharah Libbus. Mining medline for implicit links between dietary substances and diseases. Bioinformatics, 20(suppl 1):i290– i296, 2004. Bibliography 206

[210] Dimitar Hristovski, Borut Peterlin, Joyce A Mitchell, and Susanne M Humphrey. Using literature-based discovery to identify disease candidate genes. International journal of medical informatics, 74(2):289–298, 2005.

[211] Pierre Zweigenbaum, Dina Demner-Fushman, Hong Yu, and Kevin B Cohen. Frontiers of biomedical text mining: current progress. Briefings in bioinfor- matics, 8(5):358–375, 2007.

[212] Andrey Rzhetsky, Michael Seringhaus, and Mark Gerstein. Seeking a new biology through text mining. Cell, 134(1):9–13, 2008.

[213] Ram´onAA Erhardt, Reinhard Schneider, and Christian Blaschke. Status of text-mining techniques applied to biomedical text. Drug discovery today, 11 (7):315–325, 2006.

[214] Christos Andronis, Anuj Sharma, Vassilis Virvilis, Spyros Deftereos, and Aris Persidis. Literature mining, ontologies and information visualization for drug repurposing. Briefings in bioinformatics, 12(4):357–368, 2011.

[215] David J Wild, Ying Ding, Amit P Sheth, Lee Harland, Eric M Gifford, and Michael S Lajiness. Systems chemical biology and the semantic web: what they mean for the future of drug discovery research. Drug discovery today, 17 (9):469–474, 2012.

[216] Lars Juhl Jensen, Jasmin Saric, and Peer Bork. Literature mining for the biologist: from information retrieval to biological discovery. Nature reviews genetics, 7(2):119–129, 2006.

[217] Dietrich Rebholz-Schuhmann, Anika Oellrich, and Robert Hoehndorf. Text- mining solutions for biomedical research: enabling integrative biology. Nature Reviews Genetics, 13(12):829–839, 2012.

[218] Arthur Lesk. Introduction to bioinformatics. Oxford University Press, 2013.

[219] K Bretonnel Cohen and Lawrence Hunter. Getting started in text mining. PLoS computational biology, 4(1):e20, 2008. Bibliography 207

[220] SaˇsoDˇzeroski,Pat Langley, and LjupˇcoTodorovski. Computational discovery of scientific knowledge. Springer, 2007.

[221] Sophia Ananiadou and John McNaught. Text mining for biology and biomedicine. Citeseer, 2006.

[222] Eftychia Lekka, Spyros N Deftereos, Aris Persidis, Andreas Persidis, and Chris- tos Andronis. Literature analysis for systematic drug repurposing: a case study from biovista. Drug Discovery Today: Therapeutic Strategies, 8(3):103–108, 2012.

[223] Christian Blaschke, Miguel A Andrade, Christos A Ouzounis, and Alfonso Valencia. Automatic extraction of biological information from scientific text: protein-protein interactions. In Ismb, volume 7, pages 60–67, 1999.

[224] Jasmin Saric, Lars J Jensen, and Isabel Rojas. Large-scale extraction of gene regulation for model organisms in an ontological context. In silico biology, 5 (1):21–32, 2005.

[225] Daniel Jurafsky and H James. Speech and language processing an introduction to natural language processing, computational linguistics, and speech. MIT Press, 2000.

[226] Roberto Navigli. Word sense disambiguation: A survey. ACM Computing Surveys (CSUR), 41(2):10, 2009.

[227] Wei Jin, Rohini K Srihari, and Xin Wu. Mining concept associations for knowledge discovery through concept chain queries. In Advances in Knowledge Discovery and Data Mining, pages 555–562. Springer, 2007.

[228] Yongjin Li and Jagdish C Patra. Genome-wide inferring gene–phenotype re- lationship by walking on the heterogeneous network. Bioinformatics, 26(9): 1219–1224, 2010.

[229] Xing Chen, Ming-Xi Liu, and Gui-Ying Yan. Drug–target interaction predic- tion by random walk on the heterogeneous network. Molecular BioSystems, 8 (7):1970–1978, 2012. Bibliography 208

[230] Joaqu´ın Go˜ni,Andrea Avena-Koenigsberger, Nieves Velez de Mendizabal, Martijn P van den Heuvel, Richard F Betzel, and Olaf Sporns. Exploring the morphospace of communication efficiency in complex networks. PLoS One, 8 (3):e58070, 2013.

[231] J Muller-Quernheim. Sarcoidosis: immunopathogenetic concepts and their clinical application. European Respiratory Journal, 12(3):716–738, 1998.

[232] Elena Gonzalez-Rey, Alejo Chorny, and Mario Delgado. Regulation of immune tolerance by anti-inflammatory neuropeptides. Nature Reviews Immunology, 7(1):52–63, 2007.

[233] Laurent Beuret, Enrica Flori, Christophe Denoyelle, Karine Bille, Mauro Pi- cardo, Corine Bertolotto, Robert Ballotti, et al. Up-regulation of met expres- sion by α-melanocyte-stimulating hormone and mitf allows hepatocyte growth factor to protect melanocytes and melanoma cells from apoptosis. Journal of Biological Chemistry, 282(19):14140–14147, 2007.

[234] Bing Luo, Yun Wang, Xiao-Feng Wang, Yu Gao, Bao-Hua Huang, and Peng Zhao. Correlation of epstein-barr virus and its encoded proteins with heli- cobacter pylori and expression of c-met and c-myc in gastric carcinoma. World journal of gastroenterology: WJG, 12(12):1842–1848, 2006.

[235] Yoshinobu Eishi, Moritaka Suga, Ikuo Ishige, Daisuke Kobayashi, Tetsuo Ya- mada, Tamiko Takemura, Touichiro Takizawa, Morio Koike, Shoji Kudoh, Ulrich Costabel, et al. Quantitative analysis of mycobacterial and propionibac- terial dna in lymph nodes of japanese and european patients with sarcoidosis. Journal of clinical microbiology, 40(1):198–204, 2002.

[236] Tomohiro NAKAYAMA. The genetic contribution of the natriuretic peptide system to cardiovascular diseases. Endocrine journal, 52(1):11–21, 2005.

[237] Elisabeth Buchdunger, J¨urgZimmermann, Helmut Mett, Thomas Meyer, Mar- cel M¨uller,Brian J Druker, and Nicholas B Lydon. Inhibition of the abl Bibliography 209

protein-tyrosine kinase in vitro and in vivo by a 2-phenylaminopyrimidine derivative. Cancer research, 56(1):100–104, 1996.

[238] Thomas Schindler, William Bornmann, Patricia Pellicena, W Todd Miller, Ba- yard Clarkson, and John Kuriyan. Structural mechanism for sti-571 inhibition of abelson tyrosine kinase. Science, 289(5486):1938–1942, 2000.

[239] Yasser Marandi, Neda Farahi, Amin Sadeghi, and Goudarz Sadeghi-Hashjin. Review paper: Prion diseases–current theories and potential therapies: a brief review. Folia Neuropathol, 50(1):46–49, 2012.

[240] Sarah D Schlatterer, Christopher M Acker, and Peter Davies. c-abl in neurode- generative disease. Journal of Molecular Neuroscience, 45(3):445–452, 2011.

[241] Dorota Jesionek-Kupnicka, Radzis3aw Kordek, Jaros3aw Buczy˜nski, and Pawe3 P Liberski. Apoptosis in relation to neuronal loss in experimental creutzfeldt-jakob disease in mice. Acta neurobiologiae experimentalis, 61(1): 13–20, 2000.

[242] Gonzalo I Cancino, Enrique M Toledo, Nancy R Leal, Diego E Hernandez, L Fernanda Y´evenes, Nibaldo C Inestrosa, and Alejandra R Alvarez. Sti571 prevents apoptosis, tau phosphorylation and behavioural impairments induced by alzheimer’s β-amyloid deposits. Brain, 131(9):2425–2442, 2008.

[243] Alexa Ertmer, Sabine Gilch, Seong-Wook Yun, Eckhard Flechsig, Bert Klebl, Matthias Stein-Gerlach, Michael A Klein, and Hermann M Sch¨atzl.The tyro- sine kinase inhibitor sti571 induces cellular clearance of prpsc in prion-infected cells. Journal of Biological Chemistry, 279(40):41918–41927, 2004.

[244] Antje Prasse, Gernot Zissel, Niklas Ltzen, Jonas Schupp, Rene Schmiedlin, Elena Gonzalez-Rey, Anne Rensing-Ehl, Gerald Bacher, Vera Cavalli, Dorian Bevec, et al. Inhaled vasoactive intestinal peptide exerts immunoregulatory ef- fects in sarcoidosis. American journal of respiratory and critical care medicine, 182(4):540–548, 2010. Bibliography 210

[245] C.M. Bishop. Pattern recognition and machine learning. Springer New York, 2006.

[246] Eugene F Fama. Efficient capital markets: A review of theory and empirical work. The journal of Finance, 25(2):383–417, 1970.

[247] B Wuthrich, V Cho, S Leung, D Permunetilleke, K Sankaran, and J Zhang. Daily stock market forecast from textual web data. In IEEE International Conference on Systems, Man, and Cybernetics, volume 3, pages 2720–2725. IEEE, 1998.

[248] Beat W¨uthrich, D Permunetilleke, Steven Leung, W Lam, Vincent Cho, and J Zhang. Daily prediction of major stock indices from textual www data. HKIE Transactions, 5(3):151–156, 1998.

[249] Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Foundations and trends in information retrieval, 2(1-2):1–135, 2008.

[250] Johan Bollen, Huina Mao, and Xiaojun Zeng. Twitter mood predicts the stock market. Journal of Computational Science, 2(1):1–8, 2011.

[251] Eric Gilbert and Karrie Karahalios. Widespread worry and the stock market. In ICWSM, pages 59–65, 2010.

[252] Victor Lavrenko, Matt Schmill, Dawn Lawrie, Paul Ogilvie, David Jensen, and James Allan. Mining of concurrent text and time series. In KDD-2000 Workshop on Text Mining, pages 37–44. Citeseer, 2000.

[253] M-A Mittermayer. Forecasting intraday stock price trends with text mining techniques. In Proceedings of the 37th Annual Hawaii International Conference on System Sciences, pages 10–pp. IEEE, 2004.

[254] Wesley S Chan. Stock price reaction to news and no-news: drift and reversal after headlines. Journal of Financial Economics, 70(2):223–260, 2003.

[255] Paul C Tetlock. Giving content to investor sentiment: The role of media in the stock market. The Journal of Finance, 62(3):1139–1168, 2007. Bibliography 211

[256] Fabrizio Lillo, Salvatore Miccich`e, Michele Tumminello, Jyrki Piilo, and Rosario N Mantegna. How news affects the trading behaviour of different categories of investors in a financial market. Quantitative Finance, (ahead-of- print):1–17, 2014.

[257] Gabriele Ranco, Ilaria Bordino, Giacomo Bormetti, Guido Caldarelli, Fabrizio Lillo, and Michele Treccani. Coupling news sentiment with web browsing data predicts intra-day stock prices. arXiv preprint arXiv:1412.3948, 2014.

[258] Ilaria Bordino, Stefano Battiston, Guido Caldarelli, Matthieu Cristelli, Antti Ukkonen, and Ingmar Weber. Web search queries can predict stock market volumes. PloS one, 7(7):e40014, 2012.

[259] Robert P Schumaker and Hsinchun Chen. Textual analysis of stock market prediction using breaking financial news: The azfin text system. ACM Trans- actions on Information Systems (TOIS), 27(2):12, 2009.

[260] Eduardo J Ruiz, Vagelis Hristidis, Carlos Castillo, Aristides Gionis, and Ale- jandro Jaimes. Correlating financial time series with micro-blogging activity. In Proceedings of the fifth ACM international conference on Web search and data mining, pages 513–522. ACM, 2012.

[261] Ruggero Gramatica, Tiziana Di Matteo, Stefano Giorgetti, Massimo Barbiani, Dorian Bevec, and Tomaso Aste. Graph theory enables drug repurposing–how a mathematical model can drive the discovery of hidden mechanisms of action. PloS one, 9(1):e84912, 2014.

[262] David J. C. Mackay. Information Theory, Inference & Learning Algorithms. Cambridge University Press, 2002.

[263] Ted Dunning. Accurate methods for the statistics of surprise and coincidence. Computational linguistics, 19(1):61–74, 1993.

[264] Robert C Moore. On log-likelihood-ratios and the significance of rare events. In Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 333–340, 2004. Bibliography 212

[265] Rand R Wilcox. Introduction to robust estimation and hypothesis testing. Academic Press, 2012.

[266] Christopher D Manning and Hinrich Sch¨utze. Foundations of statistical natural language processing. MIT press, 1999.

[267] George Casella and Roger L Berger. Statistical inference, volume 2. Duxbury Pacific Grove, CA, 2002.

[268] Satu Elisa Schaeffer. Graph clustering. Computer Science Review, 1(1):27–64, 2007.

[269] Tore Opsahl and Pietro Panzarasa. Clustering in weighted networks. Social networks, 31(2):155–163, 2009.

[270] Anil K Jain, Richard C Dubes, et al. Algorithms for clustering data, volume 6. Prentice hall Englewood Cliffs, 1988.

[271] Michael R Anderberg. Cluster analysis for applications. Probability and Math- ematical Statistics, 1, 1973.

[272] Robin Sibson. Slink: an optimally efficient algorithm for the single-link cluster method. The Computer Journal, 16(1):30–34, 1973.

[273] John A Hartigan. Statistical theory in clustering. Journal of classification, 2 (1):63–76, 1985.

[274] Michelle Girvan and Mark EJ Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99(12): 7821–7826, 2002.

[275] Pascal Pons and Matthieu Latapy. Computing communities in large networks using random walks. Journal of Graph Algorithms and Applications, 10(2): 191–218, 2006.

[276] Won-Min Song, T Di Matteo, and Tomaso Aste. Hierarchical information clustering by means of topologically embedded graphs. PLoS One, 7(3):e31929, 2012. Bibliography 213

[277] Nicolo Musmeci, Tomaso Aste, and Tiziana Di Matteo. Relation between financial market structure and the real economy: Comparison between clus- tering methods. SSRN, In press.

[278] Mark EJ Newman and Michelle Girvan. Finding and evaluating community structure in networks. Physical review E, 69(2):026113, 2004.

[279] Joe H Ward Jr. Hierarchical grouping to optimize an objective function. Jour- nal of the American statistical association, 58(301):236–244, 1963.

[280] Rui Xu, Donald Wunsch, et al. Survey of clustering algorithms. Neural Net- works, IEEE Transactions on, 16(3):645–678, 2005.

[281] Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of clas- sification, 2(1):193–218, 1985.

[282] Marina Meil˘a.Comparing clusterings? an information based distance. Journal of Multivariate Analysis, 98(5):873–895, 2007.

[283] Andrew R Webb. Statistical pattern recognition. John Wiley & Sons, 2003.

[284] Rosario Nunzio Mantegna, Harry Eugene Stanley, et al. An introduction to econophysics: Correlations and complexity in finance, volume 9. Cambridge university press Cambridge, 2000.

[285] Mark EJ Newman. A measure of betweenness centrality based on random walks. Social networks, 27(1):39–54, 2005.

[286] Diego R Amancio, Osvaldo N Oliveira Jr, and Luciano da F Costa. Structure– semantics interplay in complex networks and its effects on the predictability of similarity in texts. Physica A: Statistical Mechanics and its Applications, 391(18):4406–4419, 2012.

[287] Alan Mislove, Massimiliano Marcon, Krishna P Gummadi, Peter Druschel, and Bobby Bhattacharjee. Measurement and analysis of online social networks. In Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, pages 29–42. ACM, 2007. Bibliography 214

[288] William Wu-Shyong Wei. Time series analysis. Addison-Wesley, 2005.

[289] David A Dickey and Wayne A Fuller. Distribution of the estimators for au- toregressive time series with a unit root. Journal of the American statistical association, 74(366a):427–431, 1979.

[290] James Douglas Hamilton. Time series analysis, volume 2. Princeton university press Princeton, 1994.

[291] Chris Chatfield. The analysis of time series: an introduction. CRC press, 2013.

[292] Clive WJ Granger. Some recent development in a concept of causality. Journal of econometrics, 39(1):199–211, 1988.

[293] Helmut L¨utkepohl. New introduction to multiple time series analysis. Springer Science & Business Media, 2007.

[294] Chris Chatfield. The analysis of time series: An introduction. CRC press, 2013.

[295] Helmut L¨utkepohl. New introduction to multiple time series analysis. Springer, 2007.

[296] Menachem Brenner and Dan Galai. New financial instruments for hedge changes in volatility. Financial Analysts Journal, 45(4):61–65, 1989.

[297] Sheng Chen, SA Billings, and PM Grant. Non-linear system identification using neural networks. International journal of control, 51(6):1191–1214, 1990.