The NOESIS Network-Oriented Exploration, Simulation, and Induction System

V´ıctor Mart´ınez,∗ Fernando Berzal,† and Juan-Carlos Cubero‡ Department of Computer Science and Artificial Intelligence, University of Granada, Spain

Network data mining has become an important area of study due to the large number of problems it can be applied to. This paper presents NOESIS, an open source framework for network data mining that provides a large collection of network analysis techniques, including the analysis of network structural properties, community detection methods, link scoring, and link prediction, as well as network visualization algorithms. It also features a complete stand–alone graphical user interface that facilitates the use of all these techniques. The NOESIS framework has been designed using solid object–oriented design principles and structured parallel programming. As a lightweight library with minimal external dependencies and a permissive software license, NOESIS can be incorporated into other software projects. Released under a BSD license, it is available from http://noesis.ikor.org.

Contents B. Structural properties of networks 4 C. Network visualization techniques 5 I. Introduction 1 IV. Network data mining techniques 6 II. The design of the NOESIS framework 2 A. Community detection 6 A. System architecture 2 B. Link scoring and prediction 8 B. Core classes 2 C. Supported data formats 2 V. Conclusion 11 D. Graphical user interface 3 Acknowledgments 11 III. Network analysis tools 3 A. Network models 3 References 11 ∗ Electronic address: [email protected] ‡Electronic address: [email protected] †Electronic address: [email protected]

I. INTRODUCTION developers as software libraries that collect network anal- ysis algorithms. Most tools allow the analysis of net- Data mining techniques are intended to extract infor- work topology and the computation of different struc- mation from large volumes of data (Tan et al., 2006). tural properties of networks having thousands or even Data mining includes tasks such as classification, regres- millions of nodes. Some of them also include implemen- sion, clustering, or anomaly detection, among others. tations of specific network data mining techniques, such Traditional data mining techniques are typically applied as community detection algorithms or predictive mod- to tabulated data. Novel techniques have also been de- els, including link prediction (L¨uand Zhou, 2011) and vised for semi-structured or structured data, since ex- epidemic models (Keeling and Eames, 2005). ploiting the relationships among instances from a dataset For instance, Pajek (Batagelj and Mrvar, 1998), leads to new research and development opportunities NodeXL (Smith et al., 2009), Gephi (Bastian et al., (Getoor and Diehl, 2005). 2009), and UCINET (Borgatti et al., 2002) are widely arXiv:1611.04810v2 [cs.SI] 23 Jun 2017 For example, network data mining has been used used for analysis (SNA). Graphviz (Ell- to predict previously unknown protein interactions in son et al., 2002) and Cytoscape (Shannon et al., 2003) protein-protein interaction networks (Mart´ınez et al., are two well–known alternatives for network visualiza- 2014). It has also been used to study and predict future tion. Finally, igraph (Csardi and Nepusz, 2006) and author collaborations and tendencies in co-authorship NetworkX (Schult and Swart, 2008) are two popular soft- networks (Pavlov and Ichise, 2007). Different network ware libraries of network algorithms. A more comprehen- mining techniques are used by popular internet search sive and up–to–date list of available software tools can engines to rank the most relevant websites (Page et al., be found at Wikipedia: https://en.wikipedia.org/ 1999). These are only some examples of the large number wiki/Social_network_analysis_software. of applications of network data mining. NOESIS, whose name stands for Network–Oriented There are many software tools that facilitate the anal- Exploration, Simulation, and Induction System, is a soft- ysis of networked data. Some tools provide closed so- ware framework for network analysis and mining. It tries lutions for end users who need to work with their own to combine the best features of closed social network anal- network data sets, whereas other tools cater to software ysis tools and extensible algorithm libraries, while provid- 2 The NOESIS Network-Oriented Exploration, Simulation, and Induction System

write result = (double) Parallel.reduce( index ->x[index] * y[index], ADD, 0, SIZE-1) to compute the dot product of two vectors in parallel. The HAL does not only imple- ment structured parallel programming design patterns, DD but it is also responsible for task scheduling and parallel D execution. It allows the adjustment of parallel execution parameters, including the task scheduling algorithm. The reflective kernel is at the core of NOESIS and provides its main features. The reflective kernel pro- D vides the base models (data structures) and tasks (al- gorithms) needed to perform network data mining, as well as the corresponding meta-objects and meta-models, DD D which can be manipulated at run time. It is the under- lying layer that supports a large collection of network DDDDD analysis algorithms and data mining techniques, which are described in the following section. Different types of networks are dealt with using an unified interface, al- FIG. 1 The NOESIS framework architecture and its core sub- systems. lowing us to choose the particular implementation that is the most adequate for the spatial and computational requirements of each application. Algorithms provided ing support for parallel execution, something that most by this subsystem are built on top of the HAL building listed tools lack. NOESIS is completely written in Java blocks, allowing the parallelized execution of algorithms and its source code has been released under a permissive whenever possible. BSD open source license. The data access layer (DAL) provides an unified in- Our paper is structured as follows. In Section 2, the terface to access external data sources. This subsystem NOESIS architectural design principles are briefly de- allows reading and writing networks in different file for- scribed. Section 3 covers the network analysis techniques mats, providing implementations for some of the most included in NOESIS. Network data mining techniques are important standardized network file formats. This mod- surveyed in Section 4. Finally, Section 5 describes the ule also enables the development of data access compo- NOESIS project current status and roadmap. nents for other kinds of data sources, such as network streaming. Finally, an application generator is used to build II. THE DESIGN OF THE NOESIS FRAMEWORK a complete graphical user interface following a model driven software development (MDSD) approach. This NOESIS has been designed to be an easily–extensible provides a friendly user interface that al- framework whose architecture provides the basis for the lows users without programming skills to use most of the implementation of network data mining techniques. In NOESIS framework features. order to achieve this, NOESIS is designed around ab- stract interfaces and a set of core classes that provide essential functions, which allows the implementation of B. Core classes different features in independent components with strong cohesion and loose coupling. NOESIS components are The core classes and interfaces shown in Figure 2 pro- designed to be maintainable and reusable. vide the foundation for the implementation of different types of networks with specific spatial and computa- tional requirements. Basic network operations include A. System architecture adding and removing nodes, adding and removing links, or querying a node neighborhood. More complex opera- The NOESIS framework architecture and its core sub- tions are provided through specialized components. systems are displayed in Figure 1. These subsystems are NOESIS supports networks with attributes both in described below. their nodes and their links. These attributes are defined The lowest-level component is the hardware abstrac- according to predefined data models, including categori- tion layer (HAL), which provides support for the execu- cal and numerical values, among others. tion of algorithms in a parallel environment and hides im- plementation details and much of the underlying techni- cal complexity. This component provides different build- C. Supported data formats ing blocks for implementing well-studied parallel pro- gramming design patterns, such as MapReduce (Dean Different file formats have been proposed for network and Ghemawat, 2008). For example, we would just datasets. Some data formats are more space efficient, II The design of the NOESIS framework 3

III. NETWORK ANALYSIS TOOLS

NOESIS is designed to ease the implementation of net-

work analysis tools. It also includes reusable implemen- tations of a large collection of popular network–related techniques, from graph visualization (Tamassia, 2013) and common graph algorithms, to network structural FIG. 2 Some of the NOESIS core classes and interfaces rep- properties (Newman, 2010) and network formation mod- resented as an UML class diagram. els (Jackson, 2008). The network analysis tools included in NOESIS and the modules that implement them are introduced in this section. whereas others are more easily parseable.

NOESIS supports reading and writing network data A. Network models sets using the most common data formats. For example, the GDF file format is a CSV-like format used by some NOESIS implements a number of popular random net- software tools such as GUESS and Gephi. It supports work generation models, which are described by probabil- attributes in both nodes and links. Another supported ity distributions or random processes. Such models have file format is GML, which stands for Graph Modeling been found to be useful in the study and understanding Language. GML is a hierarchical ASCII-based file for- of certain properties or behaviors observed in real-world mat. GraphML is another hierarchical file format based networks. on XML, the ubiquitous eXtensible Markup Language Among the models included in NOESIS, the Erd¨os- developed by the W3C. R´enyi model (Erd˝osand R´enyi, 1959) is one of the sim- Other file formats are supported by NOESIS, such as plest ones. The Gilbert model (Gilbert, 1959) is similar the Pajek file format, which is similar to GDF, or the but a probability of existence is given for links instead. file format of the datasets from the Stanford Network The anchored network model is also similar to the two Analysis Platform (SNAP) (Leskovec and Krevl, 2014). previous models, with the advantage of reducing the oc- currence of isolated nodes, but at the cost of being less than perfectly random. Finally, the connected random model is a variation of the anchored model that avoids isolated nodes. D. Graphical user interface Other models included in NOESIS exhibit specific properties often found in real-world networks. For ex- ample, the Watts–Strogatz model (Watts and Strogatz, In order to allow users without programming knowl- 1998) generates networks with small-world properties, edge to use most of the NOESIS features, a lightweight that is, low diameter and high clustering. This model easy–to–use graphical user interface is included with the starts by creating a ring lattice with a given number of standard NOESIS framework distribution. The NOESIS nodes and a given mean , where each node is con- GUI allows non–technical end users loading, visualizing, nected to its nearest neighbors on both sides. In the and analyzing their own network data sets by applying following steps, each link is rewired to a new target node all the techniques provided with NOESIS. with a given probability, avoiding self-loops and link du- Some screenshots of this GUI are shown in Figure 3. plication. A canvas is used to display the network in every mo- Despite the small-world properties exhibited by net- ment. The network can be manipulated by clicking or works generated by the Watts–Strogatz model are closer dragging nodes. At the top of the window, a menu gives to real world networks than those generated by models access to different options and data mining algorithms. based on the Erd¨os-R´enyi approach, they still lack some The Network menu allows loading a network from an important properties observed in real networks. The external source and exporting the results using different Barab´asi–Albert model (Albert and Barab´asi,2002) is file formats, as well as creating images of the current another well-known model that generates networks whose network visualization both as raster and vector graphics node follows a power law, which leads image. The View menu allows the customization of the to scale-free networks. This model is driven by a pref- network appearance by setting specific layout algorithms erential attachment process, where new nodes are added and custom visualization styles. In addition, this menu and connected to existing nodes with a probability pro- allows binding the visual properties of nodes and links to portional to their current degree. Another model with their attributes. The Data menu allows the exploration very similar properties to Barab´asi–Albert’s model is the of attributes for each node and link. Finally, the Analysis Price’s citation model (Newman, 2003). menu gives access to most of the techniques that will be In addition to random network models, a number of described in the following sections. regular network models are included in NOESIS. These 4 The NOESIS Network-Oriented Exploration, Simulation, and Induction System

FIG. 3 Different screenshots of the NOESIS graphical user interface.

FIG. 4 Random networks generated using the Erd¨os-R´enyi model (left), the Watts–Strogatz model (center), and the Barab´asi– Albert model (right). models generate synthetic networks that are useful in the B. Structural properties of networks process of testing new algorithms. The networks regular models include complete networks, where all nodes are Network structural properties allow the quantification interconnected; star networks, where all nodes are con- of features or behaviors present in the network. They nected to a single hub node; ring networks, where each can be used, for instance, to measure network robustness node is connected to its closest two neighbors along a or reveal important nodes and links. NOESIS consid- ring; tandem networks, like ring model but without clos- ers three types of structural properties: node properties, ing the ; mesh network, where nodes are arranged in node pair properties (for pairs both with and without rows and columns, and connected only to their adjacent links among them), and global properties. nodes; toruses, meshes where nodes in the extremes of NOESIS provides a large number of techniques for an- the mesh are connected; hypercubes; binary trees; and alyzing network structural properties. Many structural isolates, a network without links. properties can be computed for nodes. For example, in- degree and out-degree, indicate the number of incoming and outgoing links, respectively. Related to node degree, two techniques to measure node degree assortativity have III Network analysis tools 5 been included: biased (Piraveenan et al., 2008) and unbi- 1953). Finally, diffusion (Kang et al., 2012) ased (Piraveenan et al., 2010) node degree assortativity. is another influence algorithm based on . Node assortativity is a score between −1 and 1 that mea- The main difference is that, while Katz considers infinite sures the degree correlation between pairs of connected length paths, diffusion centrality considers only paths of nodes. The clustering coefficient can also be computed a given limited length. for nodes. The clustering coefficient of a node is the frac- In the following example, we show how to load a net- tion of its neighbors that are also connected among them. work from a data file and compute its structural proper- Reachability scores are centrality measures that allow ties using NOESIS, its PageRank scores in particular: the analysis of how easy it is to reach a node from other nodes. The eccentricity of a node is defined as the max- FileReader fileReader = imum to any other node (Hage and Harary, new FileReader("karate.gml"); 1995). The closeness, however, is the inverse of the sum NetworkReader reader = of the distance from a given node to all others (Bavelas, new GMLNetworkReader(fileReader); 1950). An adjusted closeness value that normalizes the Network network = reader.read(); closeness according to the number of reachable nodes can PageRank task = new PageRank(network); also be used. Inversely to closeness, average path length NodeScore score = task.call(); is defined as the mean distance of all shortest paths to Different structural properties for links can also be any other node. Decay is yet another reachability score, computed by NOESIS. For example, link betweenness, computed as the summation of a delta factor powered by which is the count of shortest paths the link is involved the path length to any other node (Jackson, 2008). It is in, or link rays, which is the number of possible paths interesting to note that with a delta factor close to 0, the between two nodes that cross a given link. Some of these measure becomes the degree of the node, whereas with a properties are used by different network data mining al- delta close to 1, the measure becomes the component size gorithms. of the component the node is located at. A normalized decay score is also available. Betweenness, as reachability, is another way to mea- C. Network visualization techniques sure node centrality. Betweenness, also known as Free- man’s betweenness, is a score computed as the count of Humans are still better than machines at the recogni- shortest paths the node is involved in (Freeman, 1977). tion of certain patterns when analyzing data in a visual Since this score ranges from 2n − 1 to n2 − (n − 1) for way. Network visualization is a complex task since net- n the number of nodes in strongly-connected networks, a works tend to be huge, with thousands nodes and links. normalized variant is typically used. NOESIS enables the visualization of networks by provid- Finally influence algorithms provide a different per- ing the functionality needed to render the network and spective on node centrality. These techniques measure export the resulting visualization using different image the power of each node to affect others. The most popu- file formats. lar influence algorithm is PageRank (Page et al., 1999), NOESIS provides different automatic graph lay- since it is used by the Google search engine. PageR- out techniques, such as the well–known Fruchterman– ank computes a probability distribution based on the Reingold (Fruchterman and Reingold, 1991) and likelihood of reaching a node starting from any other Kamada–Kawai (Kamada and Kawai, 1989) force–based node. The algorithm works by iteratively updating node layout algorithms. Force–based layout algorithms assign probability based on direct neighbors probabilities, which forces among pairs of nodes and solve the system to reach leads to convergence if the network satisfies certain prop- an equilibrium point, which usually leads to an aesthetic erties. A similar algorithm is HITS (Kleinberg, 1999), visualization. which stands for hyperlink-induced topic search. It fol- Hierarchical layouts (Tamassia, 2013), which arrange lows an iterative approach, as PageRank, but computes nodes in layers trying to minimize edge crossing, are also two scores per node: the hub, which is a score related included. Different radial layout algorithms are included to how many nodes a particular node links, and the au- as well (Wills, 1999). These layouts are similar to the hi- thority, which is a score related to how many hubs link erarchical ones, but arrange nodes in concentric circles. a particular node. Both scores are connected by an it- Finally, several regular layouts are included. These lay- erative updating process: authority is updated accord- outs are common for visualizing regular networks, such ing to the hub scores of nodes connected by incoming as meshes or stars. links and hub is updated according to authority scores of NOESIS allows tuning the network visualization look nodes connected by outgoing links. Eigenvector central- and feel. The visual properties of nodes and links can ity is another iterative method closely related to PageR- be customized, including color, size, borders, and so on. ank, where nodes are assigned a centrality score based In addition, visual properties can be bound to static or on the summation of the centrality of their neighbors dynamic properties of the network. For example, node nodes. Katz centrality considers all possible paths, but sizes can be bound to a specific centrality score, allowing penalizes long ones using a given damping factor (Katz, the visual display of quantitative information. 6 The NOESIS Network-Oriented Exploration, Simulation, and Induction System

FIG. 5 A dolphin social network (Lusseau et al., 2003) represented using different network visualization algorithms: random layout (top left), Kamada–Kawai layout (top right), Fruchterman–Reingold layout (bottom left), and circular layout using average path length (bottom right).

IV. NETWORK DATA MINING TECHNIQUES A. Community detection

Community detection can be defined as the task of finding groups of densely connected nodes. A wide range of community detection algorithms have been proposed, exhibiting different pros and cons. NOESIS features dif- ferent families of community detection techniques and Network data mining techniques exist for both un- implements more than ten popular community detection supervised and supervised settings. NOESIS includes algorithms. The included algorithms, their time com- a wide array of community detection methods (Lanci- plexity, and their bibliographic references are shown in chinetti and Fortunato, 2009) and link prediction tech- Table I. niques (Liben-Nowell and Kleinberg, 2007). These algo- NOESIS provides hierarchical clustering algorithms. rithms are briefly described below. Agglomerative hierarchical clustering treats each node IV Network data mining techniques 7

FIG. 6 Community detection methods applied to Zachary’s karate club network (Zachary, 1977): Fast greedy partitioning (top left), Kernighan-Lin bi-partitioning (top right), average-link hierarchical partitioning (bottom left), and complete-link hierarchical partitioning (bottom right). as a cluster, and then iteratively merges clusters until all munities by attempting to maximize their nodes are in the same cluster (Fortunato, 2010). Different score (Newman and Girvan, 2004). Different greedy strategies for the selection of clusters to merge have been strategies, including fast greedy (Newman, 2004) and implemented, including single-link (Sibson, 1973), which multi-step greedy (Schuetz and Caflisch, 2008), are avail- selects the two clusters with the smallest minimum pair- able. These greedy algorithms merge pairs of clusters wise distance; complete-link (Defays, 1977), which selects that maximize the resulting modularity, until all possi- the two clusters with the smallest maximum pairwise dis- ble merges would reduce the network modularity. tance; and average-link (Liu, 2011), which selects the two Partitional clustering is another common approach. clusters with the smallest average pairwise distance. Partitioning clustering decomposes the network and per- Modularity-based techniques are also available in our forms an iterative relocation of nodes between clusters. framework. Modularity is a score that measures the For example, Kernighan-Lin bi-partitioning (Kernighan strength of particular division into modules of a given and Lin, 1970) starts with an arbitrary partition in two network. Modularity–based techniques search for com- clusters. Then, iteratively exchanges nodes between both 8 The NOESIS Network-Oriented Exploration, Simulation, and Induction System

Type Name Time complexity Reference Single-link (SLINK) O(v2) (Sibson, 1973) Hierarchical Complete-link (CLINK) O(v2 log v) (Defays, 1977) Average-link (UMPGA) O(v2 log v) (Liu, 2011) Fast greedy O(kvd log v) (Newman, 2004) Modularity Multi-step greedy O(kvd log v) (Schuetz and Caflisch, 2008) Kernighan-Lin bi-partitioning O(v2 log v) (Kernighan and Lin, 1970) Partitional K-means O(kvd) (MacQueen et al., 1967) Ratio cut algorithm (EIG1) O(v3) (Hagen and Kahng, 1992) Spectral Jordan and Weiss NG algorithm (KNSC1) O(v3) (Ng et al., 2002) Spectral k-means O(v3) (Shi and Malik, 2000) Overlapping BigClam O(v2) (Yang and Leskovec, 2013)

TABLE I Computational time complexity and bibliographic references for the community detection techniques provided by NOESIS. In the time complexity analysis, v is the number of nodes in the network, d is the maximum node degree, and k is the desired number of clusters. clusters to minimize the number of links between them. new NJWCommunityDetector(network); This approach can be applied multiple times to subdi- Matrix results = task.call(); vide the obtained clusters. K-means community detec- tion (MacQueen et al., 1967) is an application of the tra- ditional k-means clustering algorithm to networks and B. Link scoring and prediction another prominent example of partitioning community detection. Link scoring and link prediction are two closely related Spectral community detection (Fortunato, 2010) is an- tasks. On the one hand, link scoring aims to compute a other family of community detection techniques included value or weight for a link according to a specific criteria. in NOESIS. These techniques use the Laplacian represen- Most link scoring techniques obtain this value by consid- tation of the network, which is a network representation ering the overlap or relationship between the neighbor- computed by subtracting the of the net- hood of the nodes at both ends of the link. On the other work to a diagonal matrix where each diagonal element hand, link prediction computes a value, weight, or prob- is equal to the degree of the corresponding node. Then, ability proportional to the likelihood of the existence of a the eigenvectors of the Laplacian representation of the certain link according to a given model of link formation. network are computed. NOESIS includes the ratio cut algorithm (EIG1) (Hagen and Kahng, 1992), the Jordan The NOESIS framework provides a large collection of and Weiss NG algorithm (KNSC1) (Ng et al., 2002), and methods for link scoring and link prediction, from local spectral k-means (Shi and Malik, 2000). methods, which only consider the direct neighborhood of Finally, the BigClam overlapping community detector nodes, to global methods, which consider the whole net- is also available in NOESIS (Yang and Leskovec, 2013). work topology. As the amount of information considered In this algorithm, each node has a profile, which consists is increased, the computational and spatial complexity of in a score between 0 and 1 for each cluster that is pro- the techniques also increases. The link scoring and pre- portional to the likelihood of the node belonging to that diction methods available in NOESIS are shown in Table cluster. Also, a score between pairs of nodes is defined II. yielding values proportional to their clustering assign- Among local methods, the most basic technique is the ment overlap. The algorithm iteratively optimizes each common neighbors score (Newman, 2001), which is equal node profile to maximize the value between connected to the number of shared neighbors between a pair of nodes and minimize the value among unconnected nodes. nodes. Most techniques are variations of the common In the following example, we show how to load a net- neighbors score. For example, the Adamic–Adar score work from a data file and detect communities with the (Adamic and Adar, 2003) is the sum of one divided by KNSC1 algorithm using NOESIS: the logarithm of the degree of each shared node. The resource–allocation index (Zhou et al., 2009) follows the FileReader fileReader = same expression, but directly considers the degree in- new FileReader("mynetwork.net"); stead of the logarithm of the degree. The adaptive de- NetworkReader reader = gree penalization score (Mart´ınezet al., 2016) also follows new PajekNetworkReader(fileReader); the same approach, yet automatically determines an ad- Network network = reader.read(); equate degree penalization by considering properties of CommunityDetector task = the . Other local measures consider the IV Network data mining techniques 9

Type Name Time complexity Reference Common Neighbors count O(vd3) (Newman, 2001) Adamic–Adar score O(vd3) (Adamic and Adar, 2003) Resource–allocation index O(vd3) (Zhou et al., 2009) 3 Local Adaptive degree penalization score O(vd ) (Mart´ınezet al., 2016) Jaccard score O(vd3) (Jaccard, 1901) Leicht-Holme-Newman score O(vd3) (Leicht et al., 2006) Salton score O(vd3) (Salton and McGill, 1986) Sorensen score O(vd3) (Sørensen, 1948) Hub promoted index O(vd3) (Ravasz et al., 2002) hub depressed index O(vd3) (Ravasz et al., 2002) score O(vd2) (Barab´asiand Albert, 1999) Katz index O(v3) (Katz, 1953) Leicht-Holme-Newman score O(cv2d) (Leicht et al., 2006) Random walk O(cv2d) (Pearson, 1905) Global Random walk with restart O(cv2d) (Tong et al., 2006) Flow propagation O(cv2d) (Vanunu and Sharan, 2008) Pseudoinverse Laplacian score O(v3) (Fouss et al., 2007) Average commute time score O(v3) (Fouss et al., 2007) Random forest kernel index O(v3) (Chebotarev and Shamis, 2006)

TABLE II Computational time complexity and bibliographic references for the link scoring and prediction methods provided by NOESIS. In the time complexity analysis, v is the number of nodes in the network, d is the maximum node degree, and c refers to the number of iterations required by iterative global link prediction methods. number of shared neighbors, but normalize their value each element corresponds to the probability of reaching according to certain criteria. For example, the Jaccard each node. The classical random walk iteratively mul- score (Jaccard, 1901) normalizes the number of shared tiplies the probability vector by the transition matrix, neighbors by the total number of neighbors. The local which is the row-normalized version of the adjacency ma- Leicht-Holme-Newman score (Leicht et al., 2006) normal- trix, until convergence. An interesting variant is the ran- izes the count of shared neighbors by the product of both dom walk with restart (Tong et al., 2006), which mod- neighborhoods sizes. Similarly, the Salton score (Salton els the possibility of returning to the seed node with a and McGill, 1986) also normalizes, this time using the given probability. Flow propagation is another variant square root of the product of both node degrees. The of random walk (Vanunu and Sharan, 2008), where the Sorensen score (Sørensen, 1948) considers the double of transition matrix is computed by performing both row the count of shared neighbors normalized by the sum and column normalization of the adjacency matrix. of both neighbors size. The hub promoted and hub de- pressed scores (Ravasz et al., 2002) normalize the count Finally, some spectral techniques are also available in of shared neighbors by the minimum and the maximum NOESIS. Spectral techniques, as we mentioned when dis- of both nodes degree respectively. Finally, the preferen- cussing community detection methods, are based on the tial attachment score (Barab´asi and Albert, 1999) only Laplacian matrix. The pseudoinverse Laplacian score considers the product of both node degrees. (Fouss et al., 2007) is the inner product of the rows of the corresponding pair of nodes from the Laplacian ma- Global link scoring and prediction methods are more trix. Other spectral technique is the average commute complex than local methods. For example, the Katz score time (Fouss et al., 2007), which is defined as the average (Katz, 1953) sums the influence of all possible paths be- number of steps that a random walker starting from a tween two nodes, incrementally penalizing paths by their particular node takes to reach another node for the first length according to a given damping factor. The global time and go back to the initial node. Despite it models Leicht-Holme-Newman score (Leicht et al., 2006) is quite a random walk process, it is considered to be a spectral similar to the Katz score, but resorts to the dominant technique because it is usually computed in terms of the eigenvalue to compute the final result. Laplacian matrix. Given the Laplacian matrix, it can be Random walk techniques simulate a Markov chain of computed as the diagonal element of the starting node randomly-selected nodes (Pearson, 1905). The idea is plus the diagonal element of the ending node, minus two that, starting from a seed node and randomly moving times the element located in the row of the first node and through links, we can obtain a probability vector where the column of the second one. 10 The NOESIS Network-Oriented Exploration, Simulation, and Induction System

FIG. 7 Different link scoring techniques applied to Les Miserables coappearance network (Knuth, 2009): common neighbors (top left), preferential attachment score (top right), Sorensen score (bottom left), and Katz index (bottom right). Link width in the figure is proportional to the link score.

Finally, the random forest kernel score (Chebotarev Using network data mining algorithms in NOESIS is and Shamis, 2006) is a global technique based on the simple. In the following code snippet, we show how to concept of spanning tree, i.e. a connected undirected generate a Barabsi-Albert preferential attachment net- sub-network with no cycles that includes all the nodes work with 100 nodes and 1000 links, and then compute and some or all the links of the network. The matrix-tree the Resource Allocation score for each pair of nodes using theorem states that the number of spanning trees in the NOESIS: network is equal to any cofactor, which is a determinant obtained by removing the row and column of the given node, of an entry of its Laplacian representation. As a Network network = result of this, the inverse of the sum of the identity matrix new BarabasiAlbertNetwork(100, 1000); and the Laplacian matrix gives us a matrix that can be LinkPredictionScore method = interpreted as a measure of accessibility between pairs of new ResourceAllocationScore(network); nodes. Matrix result = method.call(); V Conclusion 11

V. CONCLUSION Dean, J. and Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the Currently, the NOESIS project comprises more than ACM, 51(1):107–113. thirty five thousand lines of code organized in hundreds Defays, D. (1977). An efficient algorithm for a complete link of classes and dozens of packages. NOESIS relies on a li- method. The Computer Journal, 20(4):364–366. Ellson, J., Gansner, E., Koutsofios, L., North, S. C., and brary of reusable components that, with more than forty Woodhull, G. (2002). Graphviz - open source graph draw- thousand lines of Java code, provide a customizable col- ing tools. In , pages 483–484. Springer. lection framework, support for the execution of parallel Erd˝os,P. and R´enyi, A. (1959). On random graphs. Publica- algorithms, mathematical routines, and the model-driven tiones Mathematicae Debrecen, 6:290–297. application generator used to build the NOESIS graphi- Fortunato, S. (2010). Community detection in graphs. Physics cal user interface. Reports, 486(3):75–174. NOESIS can ease the development of applications that Fouss, F., Pirotte, A., Renders, J.-M., and Saerens, M. (2007). involve the analysis of any kind of data susceptible of Random-walk computation of similarities between nodes of being represented as a network. NOESIS provides an a graph with application to collaborative recommendation. efficient, scalable, and developer–friendly framework for IEEE Transactions on Knowledge and Data Engineering, 19(3):355–369. network data mining, released under a permissive Berke- Freeman, L. C. (1977). A set of measures of centrality based ley Software Distribution free software license. Our on betweenness. Sociometry, pages 35–41. framework can be downloaded from its official website Fruchterman, T. M. and Reingold, E. M. (1991). Graph draw- at http://noesis.ikor.org. ing by force-directed placement. Software: Practice and Experience, 21(11):1129–1164. Getoor, L. and Diehl, C. P. (2005). Link mining: a survey. Acknowledgments ACM SIGKDD Explorations Newsletter, 7(2):3–12. Gilbert, E. N. (1959). Random graphs. The Annals of Math- ematical Statistics, 30(4):1141–1144. The NOESIS project is partially supported by the Hage, P. and Harary, F. (1995). Eccentricity and centrality Spanish Ministry of Economy and the European Regional in networks. Social Networks, 17(1):57–63. Development Fund (FEDER), under grant TIN2012– Hagen, L. and Kahng, A. B. (1992). New spectral methods for 36951, and the Spanish Ministry of Education under the ratio cut partitioning and clustering. IEEE Transactions on program “Ayudas para contratos predoctorales para la Computer-Aided Design of Integrated Circuits and Systems, formaci´onde doctores 2013” (grant BES–2013–064699). 11(9):1074–1085. We are grateful to Aar´onRosas, Francisco–Javier Gij´on, Jaccard, P. (1901). Etude´ comparative de la distribution flo- and Julio–Omar Palacio for their contributions to the im- rale dans une portion des alpes et des jura. Bulletin de la plementation of community detection methods in NOE- Soci´et´eVaudoise des Sciences Naturelles, 37:547–579. SIS. Jackson, M. O. (2008). Social and Economic Networks. Princeton University Press, Princeton, NJ, USA. Kamada, T. and Kawai, S. (1989). An algorithm for drawing general undirected graphs. Information Processing Letters, References 31(1):7–15. Kang, C., Molinaro, C., Kraus, S., Shavitt, Y., and Subrah- Adamic, L. A. and Adar, E. (2003). Friends and neighbors manian, V. (2012). Diffusion centrality in social networks. on the web. Social Networks, 25(3):211–230. In Proceedings of the 2012 International Conference on Ad- Albert, R. and Barab´asi,A.-L. (2002). Statistical mechanics vances in Social Networks Analysis and Mining (ASONAM of complex networks. Reviews of Modern Physics, 74(1):47. 2012), pages 558–564. IEEE Computer Society. Barab´asi,A.-L. and Albert, R. (1999). Emergence of scaling Katz, L. (1953). A new status index derived from sociometric in random networks. Science, 286(5439):509–512. analysis. Psychometrika, 18(1):39–43. Bastian, M., Heymann, S., Jacomy, M., et al. (2009). Gephi: Keeling, M. J. and Eames, K. T. (2005). Networks and an open source software for exploring and manipulating epidemic models. Journal of the Royal Society Interface, networks. International AAAI Conference on Weblogs and 2(4):295–307. Social Media, 8:361–362. Kernighan, B. W. and Lin, S. (1970). An efficient heuristic Batagelj, V. and Mrvar, A. (1998). Pajek-program for large procedure for partitioning graphs. Bell System Technical network analysis. Connections, 21(2):47–57. Journal, 49(2):291–307. Bavelas, A. (1950). Communication patterns in task-oriented Kleinberg, J. M. (1999). Authoritative sources in a hy- groups. The Journal of the Acoustical Society of America, perlinked environment. Journal of the ACM (JACM), 22(6):725–730. 46(5):604–632. Borgatti, S. P., Everett, M. G., and Freeman, L. C. (2002). Knuth, D. E. (2009). The Stanford GraphBase: A Plat- UCINET for Windows: Software for social network analy- form for Combinatorial Computing. Addison-Wesley Pro- sis. Technical report, Analytic Technologies. fessional, 1st edition. Chebotarev, P. and Shamis, E. (2006). Matrix-forest theo- Lancichinetti, A. and Fortunato, S. (2009). Community detec- rems. arXiv preprint math/0602575. tion algorithms: a comparative analysis. Physical Review Csardi, G. and Nepusz, T. (2006). The igraph software pack- E, 80(5):056117. age for research. International Journal of Leicht, E. A., Holme, P., and Newman, M. E. (2006). Complex Systems, 1695(5):1–9. 12 The NOESIS Network-Oriented Exploration, Simulation, and Induction System

similarity in networks. Physical Review E, 73(2):026120. Barab´asi,A.-L. (2002). Hierarchical organization of mod- Leskovec, J. and Krevl, A. (2014). SNAP Datasets: Stanford ularity in metabolic networks. Science, 297(5586):1551– large network dataset collection. http://snap.stanford. 1555. edu/data. Salton, G. and McGill, M. J. (1986). Introduction to Modern Liben-Nowell, D. and Kleinberg, J. (2007). The link- Information Retrieval. McGraw-Hill, Inc. prediction problem for social networks. Journal of the Schuetz, P. and Caflisch, A. (2008). Efficient modularity op- American Society for Information Science and Technology, timization by multistep greedy algorithm and vertex mover 58(7):1019–1031. refinement. Physical Review E, 77(4):046112. Liu, B. (2011). Web Data Mining, 2 ed. Berlin, Germany: Schult, D. A. and Swart, P. (2008). Exploring network struc- Springer Berlin Heidelberg. ture, dynamics, and function using NetworkX. In Proceed- L¨u,L. and Zhou, T. (2011). Link prediction in complex net- ings of the 7th Python in Science Conferences (SciPy 2008), works: A survey. Physica A: Statistical Mechanics and its volume 2008, pages 11–16. Applications, 390(6):1150–1170. Shannon, P., Markiel, A., Ozier, O., Baliga, N. S., Wang, Lusseau, D., Schneider, K., Boisseau, O. J., Haase, P., J. T., Ramage, D., Amin, N., Schwikowski, B., and Slooten, E., and Dawson, S. M. (2003). The bottlenose Ideker, T. (2003). Cytoscape: a software environment dolphin community of doubtful sound features a large pro- for integrated models of biomolecular interaction networks. portion of long-lasting associations. Behavioral Ecology and Genome Research, 13(11):2498–2504. Sociobiology, 54(4):396–405. Shi, J. and Malik, J. (2000). Normalized cuts and image seg- MacQueen, J. et al. (1967). Some methods for classification mentation. Transactions on Pattern Analysis and Machine and analysis of multivariate observations. In Proceedings of Intelligence, 22(8):888–905. the Fifth Berkeley Symposium on Mathematical Statistics Sibson, R. (1973). Slink: an optimally efficient algorithm and Probability, number 14 in 1, pages 281–297. Oakland, for the single-link cluster method. The Computer Journal, CA, USA. 16(1):30–34. Mart´ınez,V., Berzal, F., and Cubero, J.-C. (2016). Adaptive Smith, M. A., Shneiderman, B., Milic-Frayling, N., degree penalization for link prediction. Journal of Compu- Mendes Rodrigues, E., Barash, V., Dunne, C., Capone, T., tational Science, 13:1–9. Perer, A., and Gleave, E. (2009). Analyzing (social media) Mart´ınez,V., Cano, C., and Blanco, A. (2014). Prophnet: A networks with NodeXL. In Proceedings of the Fourth In- generic prioritization method through propagation of infor- ternational Conference on Communities and Technologies, mation. BMC Bioinformatics, 15(Suppl 1):S5. pages 255–264. ACM. Newman, M. (2010). Networks: An Introduction. Oxford Sørensen, T. (1948). A method of establishing groups of equal University Press. amplitude in plant sociology based on similarity of species Newman, M. E. (2001). Clustering and preferential at- and its application to analyses of the vegetation on danish tachment in growing networks. Physical Review E, commons. Biologiske Skrifter, 5:1–34. 64(2):025102. Tamassia, R. (2013). Handbook of Graph Drawing and Visu- Newman, M. E. (2003). The structure and function of com- alization. CRC Press. plex networks. Society for Industrial and Applied Mathe- Tan, P.-N., Steinbach, M., Kumar, V., et al. (2006). Intro- matics Review, 45(2):167–256. duction to data mining. Pearson, Addison Wesley, Boston. Newman, M. E. (2004). Fast algorithm for detecting Tong, H., Faloutsos, C., and Pan, J.-Y. (2006). Fast random in networks. Physical Review E, walk with restart and its applications. In Proceedings of 69(6):066133. the Sixth International Conference on Data Mining, ICDM Newman, M. E. and Girvan, M. (2004). Finding and evaluat- ’06, pages 613–622. IEEE Computer Society. ing community structure in networks. Physical Review E, Vanunu, O. and Sharan, R. (2008). A propagation-based al- 69(2):026113. gorithm for inferring gene-disease assocations. In German Ng, A. Y., Jordan, M. I., Weiss, Y., et al. (2002). On spectral Conference on Bioinformatics, pages 54–52. clustering: Analysis and an algorithm. Advances in Neural Watts, D. J. and Strogatz, S. H. (1998). Collective dynamics Information Processing Systems, 2:849–856. of ‘small-world’ networks. Nature, 393(6684):440–442. Page, L., Brin, S., Motwani, R., and Winograd, T. (1999). Wills, G. J. (1999). Nicheworksinteractive visualization of The pagerank citation ranking: Bringing order to the web. very large graphs. Journal of Computational and Graphical Technical report, Stanford InfoLab. Statistics, 8(2):190–212. Pavlov, M. and Ichise, R. (2007). Finding experts by link Yang, J. and Leskovec, J. (2013). Overlapping community prediction in co-authorship networks. Finding Experts on detection at scale: a nonnegative matrix factorization ap- the Web with Semantics Workshop, 290:42–55. proach. In Proceedings of the Sixth ACM International Pearson, K. (1905). The problem of the random walk. Nature, Conference on Web Search and Data Mining, pages 587– 72:342. 596. ACM. Piraveenan, M., Prokopenko, M., and Zomaya, A. (2008). Lo- Zachary, W. W. (1977). An information flow model for con- cal assortativeness in scale-free networks. Europhysics Let- flict and fission in small groups. Journal of Anthropological ters, 84(2):28002. Research, pages 452–473. Piraveenan, M., Prokopenko, M., and Zomaya, A. Y. (2010). Zhou, T., L¨u,L., and Zhang, Y.-C. (2009). Predicting missing Classifying complex networks using unbiased local assorta- links via local information. The European Physical Journal tivity. In ALIFE, pages 329–336. B, 71(4):623–630. Ravasz, E., Somera, A. L., Mongru, D. A., Oltvai, Z. N., and