Graph Embedding and Classification Via Simplicial Complexes
Total Page:16
File Type:pdf, Size:1020Kb
algorithms Article (Hyper)Graph Embedding and Classification via Simplicial Complexes Alessio Martino 1,* , Alessandro Giuliani 2 and Antonello Rizzi 1 1 Department of Information Engineering, Electronics and Telecommunications, University of Rome “La Sapienza”, Via Eudossiana 18, 00184 Rome, Italy; [email protected] 2 Department of Environment and Health, Istituto Superiore di Sanità, Viale Regina Elena 299, 00161 Rome, Italy; [email protected] * Correspondence: [email protected]; Tel.: +39-06-4458-5745 Received: 26 September 2019; Accepted: 24 October 2019; Published: 25 October 2019 Abstract: This paper investigates a novel graph embedding procedure based on simplicial complexes. Inherited from algebraic topology, simplicial complexes are collections of increasing-order simplices (e.g., points, lines, triangles, tetrahedrons) which can be interpreted as possibly meaningful substructures (i.e., information granules) on the top of which an embedding space can be built by means of symbolic histograms. In the embedding space, any Euclidean pattern recognition system can be used, possibly equipped with feature selection capabilities in order to select the most informative symbols. The selected symbols can be analysed by field-experts in order to extract further knowledge about the process to be modelled by the learning system, hence the proposed modelling strategy can be considered as a grey-box. The proposed embedding has been tested on thirty benchmark datasets for graph classification and, further, we propose two real-world applications, namely predicting proteins’ enzymatic function and solubility propensity starting from their 3D structure in order to give an example of the knowledge discovery phase which can be carried out starting from the proposed embedding strategy. Keywords: granular computing; embedding spaces; graph embedding; topological data analysis; simplicial complexes; computational biology; protein contact networks; complex networks; complex systems 1. Introduction Graphs are powerful data structures that can capture topological and semantic information from data. This is one of the main reasons they are commonly used for modelling several real-world systems in fields such as biology and chemistry [1–8], social networks [9], telecommunication networks [10,11] and natural language processing [12–14]. However, solving pattern recognition problems in structured domains such as graphs pose additional challenges. Indeed, many structured domains are also non-metric in nature [15–17] and patterns lack any geometrical interpretation. In brief, an input space is said to be non-metric if pairwise dissimilarities between patterns lying in such space do not satisfy the properties of a metric (non-negativity, identity, symmetry and triangle inequality) [17,18]. In the literature, several strategies can be found in order to perform pattern recognition tasks in structured domains [17], namely: • feature generation and feature engineering, where numerical features are ad-hoc extracted from the input patterns Algorithms 2019, 12, 223; doi:10.3390/a12110223 www.mdpi.com/journal/algorithms Algorithms 2019, 12, 223 2 of 21 • ad-hoc dissimilarities in the input space, where custom dissimilarity measures (e.g., edit distances [19–22]) are designed in order to directly process patterns in the input space (without moving towards Euclidean spaces) • dissimilarity representations [18,23], where each pattern is described by the pairwise distances with other patterns or with respect to a properly chosen subset of pivotal training patterns [23–26] • kernel methods, where the mapping between the original input space and the Euclidean space exploits positive-definite kernel functions [27–32] • embedding via information granulation. As the latter is concerned, embedding techniques are gaining more and more attention especially since the breakthrough of Granular Computing [33,34]. In short, Granular Computing is a human-inspired information processing paradigm which aims at the extraction of meaningful entities (information granules) arising from both the problem at hand and the data representation. The challenge with Granular Computing-based pattern recognition systems is that there are different levels of granularity according to which a given system can be observed [35–37]; nonetheless, one shall choose a suitable level of granularity for the problem at hand. These information granules are usually extracted in a data-driven manner and describe data aggregates, namely data which are similar according to structural and/or functional similarity [15–17]. Data clustering, for example, is a promising tool for extracting information granules [38], especially when clustering algorithms can be equipped with ad-hoc dissimilarity measures in order to deal with structured data [17,39–42]. Indeed, several works focused on extracting information granules via motifs clustering (see e.g., Refs. [43–47]), where a proper granulation module is in charge of extracting and clustering sub-structures (i.e., sub-graphs). The resulting clusters can be considered as information granules and the clusters’ representatives form an alphabet on the top of which the embedding procedure is performed thanks to the symbolic histograms approach [46]: let M be the size of the alphabet, each input pattern is transformed into an M-length integer-valued feature vector whose ith element contains the number of occurrences of the ith alphabet member within the pattern itself. Thanks to the embedding, the problem is moved towards a metric (Euclidean) space and plain pattern recognition algorithms can be used without alterations. The symbols extraction and alphabet synthesis is crucial in granulation-based classifiers: the resulting embedding space must preserve (the vast majority of) the original input space properties (e.g., the more different two objects drawn from the input space are, the more distant they must appear in the embedding space.) [17,18]. Also, for the sake of modelling complexity, the size of the alphabet must be as small as possible or, specifically, the set of resulting alphabet symbols should be small, yet informative. This aspect is crucial since Granular Computing-based pattern recognition systems aim to be human interpretable: the resulting set of symbols forming the alphabet, hence pivotal for the embedding space, should allow field experts to gather further insights for the problem at hand [17]. The aim of this paper is to investigate a novel procedure for extracting meaningful information granules thanks to simplicial complexes. Conversely to network motifs and graphlets, simplicial complexes are able to capture the multi-scale/higher-order organisation in complex networks [48,49], overcoming the main limitation offered by ‘plain graphs’; that is, they only considers pairwise relations, whereas simplicial complexes (and hypergraphs, in general) also consider multi-way relations. On the top of simplicial complexes, an embedding space is built for pattern recognition purposes. In order to show the effectiveness of the proposed embedding procedure, a set of thirty open-access datasets for graph classification has been considered. Furthermore, the proposed technique has been benchmarked against two suitable competitors and a null-model for statistical assessment. In order to stress the knowledge discovery phase offered by Granular Computing-based classifiers, additional experiments are proposed. Specifically, starting from real-world proteomic data, two problems will be addressed regarding the possibility to predict the enzymatic function and the solution/folding propensity starting from proteins’ folded 3D-structure. This paper is organised as follows: in Section2 the approach at the basis of work is presented by giving a brief overview of simplicial complexes (Section 2.1) before diving into the proper embedding Algorithms 2019, 12, 223 3 of 21 procedure (Section 2.2); in Section3 the results over benchmark datasets (Section 3.1) and real-world problems (Section 3.2) are shown. Section4 remarks the interpretability of the proposed model and, finally, Section5 concludes the paper, remarking future directions. 2. Information Granulation and Classification Systems 2.1. An Introduction to Simplicial Complexes Let P be a set of points in a multi-dimensional space equipped with a notion of distance d(·, ·) and let X be the topological space enclosing P. The topological space X can be described by means of its simplices, that are multidimensional objects of different order (dimension) drawn from P. Formally, a k-simplex (simplex of order k) is a set of k + 1 points drawn from P, for example, 0-dimensional simplices correspond to points, 1-dimensional simplices correspond to lines, 2-dimensional simplices correspond to triangles, 3-dimensional simplices correspond to tetrahedrons and so on for higher-dimensional simplices. Every non-empty subset of the (k + 1) vertices of a k-simplex is a face of the simplex: a face is itself a simplex. Simplicial complexes [50,51] are properly constructed finite collections of simplices that are closed with respect to inclusions of the faces: if a given simplex s belongs to a given simplicial complex S, then all faces of s also belong to S. The order (dimension) of the simplicial complex is the maximum order of any of its simplices. A graph G = (V, E), where E is the set of edges and V is the set of vertices, is also commonly-known as “1-skeleton” or “simplicial complex of order 1” since the only entities