<<

1

Topological Signal Processing over Simplicial Complexes Sergio Barbarossa, Fellow, IEEE, Stefania Sardellitti, Member, IEEE

Abstract—The goal of this paper is to establish the fundamental H(V, S) is known as a hypergraph. In particular, a class of tools to analyze signals defined over a , i.e. a hypergraphs that is particularly appealing for its rich algebraic of points along with a set of neighborhood relations. This setup structure is given by simplicial complexes, whose defining does not require the definition of a metric and then it is especially useful to deal with signals defined over non-metric spaces. We feature is the inclusion property stating that if a set A belongs focus on signals defined over simplicial complexes. Signal to S, then all subsets of A also belong to S [10]. Restricting Processing (GSP) represents a special case of Topological Signal the attention to simplicial complexes is a limitation with re- Processing (TSP), referring to the situation where the signals spect to a hypergraph model. Nevertheless, are associated only with the vertices of a graph. Even though models include many cases of interest and, most important, the theory can be applied to signals of any order, we focus on signals defined over the edges of a graph and show how building their rich algebraic structure makes possible to derive tools a simplicial complex of order two, i.e. including , yields that are very useful for analyzing signals over the complex. benefits in the analysis of edge signals. After reviewing the basic Learning simplicial complexes representing the complex inter- principles of algebraic , we derive a sampling theory for actions among sets of elements has been already proposed in signals of any order and emphasize the interplay between signals brain network analysis [6], neuronal morphologies [11], co- of different order. Then we propose a method to infer the topology of a simplicial complex from data. We conclude with applications authorship networks [12], collaboration networks [13], [14], to real edge signals and to the analysis of discrete vector fields tumor progression analysis [15]. More generally, the use of to illustrate the benefits of the proposed methodologies. tools for the extraction of information from Index Terms—Algebraic topology, graph signal processing, data is not new: The framework known as Topological Data topology inference. Analysis (TDA), see e.g. [16], has exactly this goal. Interesting applications of algebraic topology tools have been proposed to control systems [17], statistical ranking from incomplete I.INTRODUCTION data [18], [19], distributed coverage control of sensor networks Historically, signal processing has been developed for sig- [20]–[22], wheeze detection [23]. One of the fundamental tools nals defined over a , typically time or space. of TDA is the analysis of persistent homologies extracted from More recently, there has been a surge of interest to deal with data [24], [25]. Topological methods to analyze signals and signals that are not necessarily defined over a metric space. images are also the subjects of the two books [26] and [27]. Examples of particular interest are biological networks, social The goal of our paper is to establish a fundamental frame- networks, etc. The field of graph signal processing (GSP) has work to analyze signals defined over a simplicial complex. Our recently emerged as a framework to analyze signals defined approach is complementary to TDA: Rather than focusing, like over the vertices of a graph [2], [3]. A graph G(V, E) is TDA, on the properties of the simplicial complex extracted a simple example of topological space, composed of a set from data, we focus on the properties of signals defined of elements (vertices) V and a set of edges E representing over a simplicial complex. Our approach includes GSP as pairwise relations. However, notwithstanding their enormous a particular case: While GSP focuses on the analysis of success, graph-based representations are not always able to signals defined over the vertices of a graph, topological signal arXiv:1907.11577v2 [eess.SP] 13 Mar 2020 capture all the information present in interconnected systems processing (TSP) considers signals defined over simplices of in which the complex interactions among the system constitu- various order, i.e. signals defined over nodes, edges, triangles, tive elements cannot be reduced to pairwise interactions, but etc. Relevant examples of edge signals are flow signals, like require a multiway description, as suggested in [4]–[9]. blood flow between different areas of the brain [28], data To establish a general framework to deal with complex traffic over communication links [29], regulatory signals in interacting systems, it is useful to start from a topological gene regulatory networks [30], where it was shown that a space, i.e. a set of elements V, along with an ensemble of dysregulation of these regulatory signals is one of the causes multiway relations, represented by a set S containing subsets of cancer [30]. Examples of signals defined over triplets are of various cardinality, of the elements of V. The structure co-authorship networks, where a (filled) indicates the presence of papers co-authored by the authors associated to its The authors are with the Department of Information Engineering, Electronics, and Telecommunications, Sapienza University of Rome, Via three vertices [12] and the associated signal value counts the Eudossiana 18, 00184, Rome, Italy. E-mails: {sergio.barbarossa; stefa- number of such publications. Further examples of even higher nia.sardellitti}@uniroma1.it. This work has been supported by H2020 order structures are analyzed in [8], with the goal of predicting EU/Taiwan Project 5G CONNI, Nr. AMD-861459-3 and by MIUR, under the PRIN Liquid-Edge contract. Some preliminary results of this work were higher-order links. There are previous works dealing with the presented at the 2018 IEEE Workshop in Data Science [1]. analysis of edge signals, like [31]–[34]. More specifically, in 2

[31] the authors introduced a class of filters to analyze edge edge signal processing to filter discrete vector fields. Then, in signals based on the edge-Laplacian matrix [35]. A semi- Section VII, we propose some methods to infer the simplicial supervised learning method for learning edge flows was also complex structure from noisy observations, illustrating their suggested in [32], where the authors proposed filters highlight- performance over both synthetic and real data. Finally, in ing both divergence-free and curl-free behaviors. Other works Section VIII we draw some conclusions. analyzed edge signals using a line-graph transformation [33], [34]. Random walks evolving over simplicial complexes have II.REVIEWOFALGEBRAICTOPOLOGYTOOLS been analyzed in [36], [37], [38]. In [36], random walks and In this section we recall the basic principles of algebraic diffusion process over simplicial complexes are introduced, topology [10] and discrete [42], as they will form the while in [38] the authors focused on the study of diffusion background required for deriving the basic signal processing processes on the edge-space by generalizing the well-known tools to be used in later sections. We follow an algebraic relationship between the normalized graph Laplacian operator approach that is accessible also to readers with no specific and random walks on graphs. background on algebraic topology. Building a representation based on a simplicial complex is a straightforward generalization of a graph-based representation. Given a set of time series, it is well known that graph-based A. Discrete domains: Simplicial complexes representations are very useful to capture relevant correlations Given a finite set V , {v0, . . . , vN−1} of N points k or causality relations present between different signals [39], (vertices), a k- σi is an unordered set {vi0 , . . . , vik } of

[40]. In such a case, each time series is associated to a node k +1 points with 0 ≤ ij ≤ N −1, for j = 0, . . . , k, and vij 6= of a graph, and the presence of an edge is an indicator of vin for all ij 6= in.A face of the k-simplex {vi0 , . . . , vik } is the relations between the signals associated to its endpoints. a (k − 1)-simplex of the form {vi0 , . . . , vij−1 , vij+1 , . . . , vik }, As a direct generalization, if we have signals defined over the for some 0 ≤ j ≤ k. Every k-simplex has exactly k + 1 edges of a graph, capturing their relations requires inferring faces. An abstract simplicial complex X is a finite collection the presence of triangles associating triplets of edges. of simplices that is closed under inclusion of faces, i.e., if In summary, our main contributions are listed below: σi ∈ X , then all faces of σi also belong to X . The order (or 1) we show how to derive a real-valued function captur- dimension) of a simplex is one less than its cardinality. Then, ing triple-wise relations among data, to be used as a a is a 0-dimensional simplex, an edge has dimension 1, regularization function in the analysis of edge signals; and so on. The dimension of a simplicial complex is the largest 2) we derive a sampling theory defining the conditions for dimension of any of its simplices. A graph is a particular case the recovery of high order signals from a subset of of an abstract simplicial complex of order 1, containing only observations, highlighting the interplay between signals simplices of order 0 (vertices) and 1 (edges). of different order; If the set of points is embedded in a real space RD of 3) we propose inference algorithms to extract the structure dimension D, we can associate a geometric simplicial complex of the simplicial complex from high order signals; with the abstract complex. A set of points in a real space RD 4) we apply our edge signal processing tools to the analysis is affinely independent if it is not contained in a hyperplane; of vector fields, with a specific attention to the recovery an affinely independent set in RD contains at most D + 1 of the RiboNucleic Acid (RNA) velocity vector field, points. A geometric k-simplex is the convex hull of a set of useful to predict the evolution of a cell behavior [41]. k + 1 affinely independent points, called its vertices. Hence, a We presented some preliminary results of our work in [1]. is a 0-simplex, a is a 1-simplex, a triangle Here, we extend the work of [1], deriving a sampling theory is a 2-simplex, a is a 3-simplex, and so on. A for signals defined over complexes of various order, proposing geometric simplicial complex shares the fundamental property new inference methods, more robust against noise, and show- of an abstract simplicial complex: It is a collection of simplices ing applications to the analysis of wireless data traffic and to that is closed under inclusion and with the further property that the estimation of the RNA velocity vector field. the intersection of any two simplices in X is also a simplex The paper is organized as follows. Section II recalls the in X , assuming that the empty set is an element of every main algebraic principles that will form the basis for the simplicial complex. Although geometric simplicial complexes derivation of the signal processing tools carried out in the are easier to visualize and, for this reason, we will often use ensuing sections. In Section III, we will first recall the eigen- geometric terms like edges, triangles, and so on, as synonyms vectors properties of higher-order Laplacian and the Hodge of pairs, triplets, we do not require the simplicial complex to decomposition. Then, we derive the real-valued extension of be embedded in any real space, so as to leave the treatment an edge set function, capturing triple-wise relations among the as general as possible. elements of a discrete set, which will be used to design unitary The structure of a simplicial complex is captured by the bases to represent edge signals. In Section IV, we illustrate neighborhood relations of its subsets. As with graphs, it is some methods to recover the edge signal components from useful to introduce first the orientation of the simplices. Every noisy observations. Section V provides theoretical conditions simplex, of any order, can have only two orientations, de- to recover an edge signal from a subset of samples, high- pending on the permutations of its elements. Two orientations lighting the interplay between signals defined over structures are equivalent, or coherent, if each of them can be recovered of different order. In Section VI, we illustrate how to use from the other by an even number of transpositions, where 3 each transposition is a permutation of two elements [10]. A If we consider, for simplicity, a simplicial complex of order k k-simplex σi ≡ {vi0 , vi1 , . . . , vik } of order k, together with two, composed of a set V of vertices, a set E of edges, and an orientation is an oriented k-simplex and is denoted by a set T of triangles, having cardinalities V = |V|, E = |E|, k k [vi0 , vi1 , . . . , vik ]. Two simplices of order k, σi , σj ∈ X , and T = |T |, respectively, we need to build two incidence V ×E E×T are upper adjacent in X , if both are faces of a simplex of matrices B1 ∈ R and B2 ∈ R . k k order k + 1. Two simplices of order k, σi , σj ∈ X , are lower From (1), the property that the of a boundary is adjacent in X , if both have a common face of order k − 1 zero maps into the following matrix form k−1 k in X .A (k − 1)-face σj of a k-simplex σi is called a k k−1 k BkBk+1 = 0. (4) boundary element of σi . We use the notation σj ⊂ σi k−1 k The structure of a K-simplicial complex is fully described by to indicate that σj is a boundary element of σi . Given k−1 k k−1 k its high order combinatorial Laplacian matrices [43], of order a simplex σj ⊂ σi , we use the notation σj ∼ σi to k−1 k = 0,...,K, defined as indicate that the orientation of σj is coherent with that of k k−1 k T σi , whereas we write σj  σi to indicate that the two L0 = B1B1 , (5) orientations are opposite. T T Lk = Bk Bk + Bk+1Bk+1, k = 1,...,K − 1, (6) For each k, C (X , ) denotes the obtained k R L = BT B . (7) by the linear combination, using real coefficients, of the set of K K K oriented k-simplices of X . In algebraic topology, the elements It is worth emphasizing that all Laplacian matrices of inter- C (X , ) k chains {σk, . . . , σk } of k R are called - . If 1 nk is the set mediate order, i.e. k = 1,...,K − 1, contain two terms: The of k-simplices in X , a k-chain τk can be written as τk = first term, also known as lower Laplacian, expresses the lower Pnk α σk {σk, . . . , σk } τ i=1 i i . Then, given the basis 1 nk , a chain k adjacency of k-order simplices; the second terms, also known can be represented by the vector of its expansion coefficients as upper Laplacian, expresses the upper adjacency of k-order

(α1, . . . , αnk ). An important operator acting on ordered chains simplices. So, for example, two edges are lower adjacent if is the boundary operator. The boundary of the ordered k- they share a common vertex, whereas they are upper adjacent chain [vi0 , . . . , vik ] is a linear mapping ∂k : Ck(X , R) → if they are faces of a common triangle. Note that the vertices Ck−1(X , R) defined as of a graph can only be upper adjacent, if they are incident to L k the same edge. This is why the graph Laplacian 0 contains X j only one term. ∂k[vi0 , . . . , vik ] , (−1) [vi0 , . . . , vij−1 , vij+1 , . . . , vik ]. j=0 (1) III.SPECTRALSIMPLICIALTHEORY 2 So, for example, given an oriented triangle σi , [vi0 , vi1 , vi2 ], In this paper, we are interested in analyzing signals defined its boundary is over a simplicial complex. Given a set Sk, with elements of

2 cardinality k + 1, a signal is defined as a real-valued map on ∂2σ = [vi , vi ] − [vi , vi ] + [vi , vi ], (2) i 1 2 0 2 0 1 the elements of Sk, of the form i.e., a suitable linear combination of its edges. It is straight- fk : Sk → R, k = 0, 1,... (8) forward to verify, by simple substitution, that the boundary of The order of the signal is one less the cardinality of the a boundary is zero, i.e., ∂k∂k+1 = 0. S It is important to remark that an oriented simplex is dif- elements of k. Even though our framework is general, in ferent from a directed one. As with graphs, an oriented edge many cases we focus on simplices of order up to two. In that V E establishes which direction of the flow is considered positive case, we consider a set of vertices , a set of edges and T V E T or negative, whereas a directed edge only permits flow in a set of triangles , of dimension , , and , respectively. X (V, E, T ) one direction [42]. In this work we will considered oriented, We denote with the associated simplicial complex. k k = 0, 1 undirected simplices. The signals over each complex of order , with and 2, are defined as the following maps: s0 : V → RV , s1 : E → RE, and s2 : T → RT . B. Algebraic representation Spectral graph theory represents a solid framework to extract features of a graph looking at the eigenvectors of The structure of a simplicial complex X of dimension the combinatorial Laplacian L of order 0. The eigenvectors K, shortly named K-simplicial complex, is fully described 0 associated with the smallest eigenvalues of L are very useful, by the set of its incidence matrices B , k = 1,...,K. 0 k for example, to identify clusters [44]. Furthermore, in GSP Given an orientation of the simplicial complex X , the entries it is well known that a suitable basis to represent signals of the incidence matrix B establish which k-simplices are k defined over the vertices of a graph, i.e. signals of order 0, incident to which (k − 1)-simplices. Then B is the matrix k is given by the eigenvectors of L . In particular, given the representation of the boundary operator. Formally speaking, 0 eigendecomposition of L : its entries are defined as follows: 0 T  k−1 k L0 = U0Λ0U0 , (9) 0, if σi 6⊂ σj  k−1 k k−1 k 0 Bk(i, j) = 1, if σi ⊂ σj and σi ∼ σj . (3) the Graph Fourier Transform (GFT) of a signal s over an  k−1 k k−1 k undirected graph has been defined as the projection of the −1, if σi ⊂ σj and σi  σj 4

signal onto the space spanned by the eigenvectors of L0, i.e. 3) the eigenvectors associated with the nonzero eigenvalues T (see [3] and the references therein) λ of Lk are either the eigenvectors of Bk Bk or those T 0 T 0 of Bk+1Bk+1; s U0 s . (10) b , 4) the nonzero eigenvalues of Lk are either the eigenvalues T T Equivalently, a signal defined over the vertices of a graph can of Bk Bk or those of Bk+1Bk+1. be represented as Proof. All above properties are easy to prove. Property 1) is 0 0 BT B v = λv s = U0 bs . (11) straightforward: If k k , then T T T From graph spectral theory, it is well known that the eigen- Bk+1B λv = Bk+1B B Bkv = 0 (15) k+1 k+1 k ¯ vectors associated with the smallest eigenvalues of L encode 0 because of (4). Similarly, for the converse. Property 2) is also information about the clusters of the graph. Hence, the rep- straightforward: If v is an eigenvector of B BT associated resentation given by (11) is particularly suitable for signals k k with a nonvanishing eigenvalue λ, then that are smooth within each cluster, whereas they can vary T T T T T arbitrarily across different clusters. For such signals, in fact, (Bk Bk)Bk v = Bk (BkBk )v = λBk v. (16) the representation in (11) is sparse or approximately sparse. As a generalization of the above approach, we may represent Finally, properties 3) and 4) follow from the definition of k- T T signals of various order over bases built with the eigenvectors order Laplacian, i.e. Lk = Bk Bk + Bk+1Bk+1 and from of the corresponding high order Laplacian matrices, given in property 1). (6)-(7). Hence, using the eigendecomposition Remark: Recalling that the eigenvectors associated with the smallest nonzero eigenvalues of L0 are smooth within each T Lk = UkΛkUk , (12) cluster, applying property 2) to the case k = 1, it turns out that the eigenvectors of L associated with the smallest eigenvalues we may define the GFT of order k as the projection of a k- 1 of BT B are approximately null over the links within each order signal onto the eigenvectors of L , i.e. 1 1 k cluster, whereas they assume the largest (in modulus) values k T k on the edges across clusters. These eigenvectors are then useful bs , Uk s , (13) to highlight inter-cluster edges. so that a signal sk can be represented in terms of its GFT coefficients as k k B. Hodge decomposition s = Uk bs . (14) Let us consider the eigendecomposition of the k-th order Now we want to show under what conditions (14) is a Laplacian, for k = 1,...,K − 1, meaningful representation of a k-order signal and what is T T T the meaning of such a representation. More specifically, the Lk = Bk Bk + Bk+1Bk+1 = UkΛkUk . (17) goal of this section is threefold: i) we recall first the relations The structure of Lk, together with the property BkBk+1 = 0, between eigenvectors of various order of Lk; ii) we recall the induces an interesting decomposition of the space Dk of Hodge decomposition, which is a basic theory showing that R signals of order k of dimension D . First of all, the property the eigenvectors of any order can be split into three different k B B = 0 implies img(B ) ⊆ ker(B ). Hence, each classes, each representing a specific behavior of the signal; iii) k k+1 k+1 k vector x ∈ ker(B ) can be decomposed into two parts: one we provide a theory showing how to build a topology-aware k belonging to img(B ) and one orthogonal to it. Further- unitary basis to represent signals of various order starting only k+1 more, since the whole space Dk can always be written as from topological properties. R Dk T Dk R ≡ ker(Bk)⊕img(Bk ), it is possible to decompose R into the direct sum [47] A. Relations between eigenvectors of different order Dk T There are interesting relations between the eigenvectors R ≡ img(Bk ) ⊕ ker(Lk) ⊕ img(Bk+1), (18) of Laplacian matrices of different order [45], [46] which is where the vectors in ker(Lk) are also in ker(Bk) and useful to recall as they play a key role in spectral analysis. T k ker(Bk+1). This implies that, given any signal s of order k, The following properties hold true for the eigendecomposition k−1 k k+1 there always exist three signals s , sH , and s , of order of k-order Laplacian matrices, with k = 1,...,K − 1. k − 1, k, and k + 1, respectively, such that sk can always be expressed as the sum of three orthogonal components: Proposition 1: Given the Laplacian matrices Lk of any order k T k−1 k k+1 k, with k = 1,...,K − 1, it holds: s = Bk s + sH + Bk+1 s . (19) 1) the eigenvectors associated with the nonzero eigenvalues This decomposition is known as Hodge decomposition [48] BT B of k k are orthogonal to the eigenvectors associated and it is the extension of the for differential B BT with the nonzero eigenvalues of k+1 k+1 and vicev- forms on Riemannian to simplicial complexes. The ersa; T subspace ker(Lk) is called harmonic subspace since each 2) if v is an eigenvector of BkBk associated with the k T T sH ∈ ker(Lk) is a solution of the discrete Laplace equation eigenvalue λ, then Bk v is an eigenvector of Bk Bk, k T T k associated with the same eigenvalue; Lk sH = (Bk Bk + Bk+1Bk+1) sH = 0. 5

When embedded in a real space, a fundamental property of divergence, and then it may be called a solenoidal component, geometric simplicial complexes of order k is that the dimen- in analogy to the calculus terminology used for vector fields. 1 sions of ker(Lk), for k = 0,...,K are topological invariants The harmonic component sH is a flow vector that is both T 0 of the K-simplicial complex, i.e. topological features that are curl-free and divergence-free. Notice also that, in (24), B1 s preserved under homeomorphic transformations of the space. represents the (discrete) gradient of s0. The dimensions of ker(Lk) are also known as Betti numbers βk of order k: β0 is the number of connected components of the graph, β1 is the number of holes, β2 is the number of C. Topology-aware unitary basis cavities, and so on [48]. The decomposition in (19) shows an interesting interplay In this section, we propose a method to build a unitary basis between signals of different order, which we will exploit in to represent edge signals, reflecting topological properties of the ensuing sections. Before proceeding, it is useful to clarify the complex, more specifically triple-wise relations. The idea the meaning of the terms appearing in (19). Let us consider, underlying the method is to search a basis that minimizes for simplicity, the analysis of flow signals, i.e. the case k = 1. a real-valued function capturing triple-wise relations. For To this end, it is useful to introduce the curl and divergence example, in the graph case, a key topological property is operators, in analogy with their continuous time counterpart connectivity, which is well captured by the cut function. The operators applied to vector fields. More specifically, given an associated real-valued function can be built using the so called edge signal s1, the (discrete) curl operator is defined as Lovasz´ extension [49], [50] of the cut size. More specifically, 1 T 1 curl(s ) = B2 s . (20)

1 e v5 This operator maps the edge signal s onto a signal defined v4 11 over the triangle sets, i.e. in RT , and it is straightforward to e verify that the generic i-th entry of the curl(s1) is a measure 10 e9 of the flow circulating along the edges of the i-th triangle. As e8 v3 t3 an example for the simplicial complex in Fig. 1, we have e  0 0 1 −1 1 0 0 0 0 0 0  e2 7 T t2 B = 0 1 −1 0 0 0 1 0 0 0 0 (21) v 2   v2 6

0 0 0 0 0 0 −1 1 1 0 0 e3 t1 1 T 1 e5 and, defining s = [e , e , . . . , e , e ] , we get curl(s ) = e1 1 2 10 11 e4 [e − e + e , e − e + e , −e + e + e ]T . Each entry of s1 3 4 5 2 3 7 7 8 9 v7 v1 is then the circulation over the corresponding triangle. e6 Similarly, the (discrete) divergence operator maps the edge 1 V Fig. 1: Cut of order 1. signal s onto a signal defined over the vertex space, i.e. R , and it is defined as 1 1 given a graph G(V, E) and a partition of its vertex set V in div(s ) = B1s . (22) two sets A0 and A1, with A0 ∪ A1 = V and A0 ∩ A1 = ∅, Again, by direct substitution, it turns out that the i-th entry of the cut size is defined as 1 div(s ) represents the net-flow passing through the i-th vertex, X X i.e. the difference between the inflow and outflow at node i. F0(A0, A1) = cut(A0, A1) , aij (25) Thus a non-zero divergence reveals the presence of a source i∈A0 j∈A1 or sink node. For the example in Fig. 1, we get where a = 1 if (i, j) ∈ E and a = 0 otherwise. F (A , A )   ij ij 0 0 1 −1 0 0 0 0 −1 0 0 0 0 0 is a set function and is known to be a submodular function  1 −1 −1 −1 0 0 0 0 0 0 0  [49]. Now, we want to translate the set function F (A , A )   0 0 1  0 1 0 0 0 0 −1 0 −1 −1 0  onto a real-valued function f (x0), defined over V , to be   0 R B1 =  0 0 0 0 0 0 0 0 0 1 −1  (23) used for the subsequent optimization. This can be done using    0 0 0 0 0 0 0 −1 1 0 1  the so called Lovasz´ extension [49], which is equal to:    0 0 1 0 −1 0 1 1 0 0 0  00011100000 V V 0 X X 0 0 f0(x ) = aij|x − x |. (26) 1 i j so that div(s ) = [−e1 − e6, e1 − e2 − e3 − e4, e2 − e7 − e9 − i=1 j=1 T e10, e10 −e11, −e8 +e9 +e11, e3 −e5 +e7 +e8, e4 +e5 +e6] . If we consider equation (19) in the case k = 1, with x0 ∈ RV . The function in (26) measures the total variation of a signal defined over the nodes of a graph and then s1 = BT s0 + s1 + B s2, (24) 1 H 2 it can be used as a regularization function, whenever the signal recalling that B1B2 = 0, it is easy to check that the first of interest is known to be a smooth function. The function in component in (24) has zero curl, and then it may be called an (26) formed the basis of the method proposed in [51] to build irrotational component, whereas the third component has zero a unitary basis for analyzing signals defined over the vertices 6 of a graph. More specifically, in [51] the basis was built by R, evaluated at x1 ∈ RE, of the triple-wise coupling size solving the following optimization problem F1(A0, A1, A2) defined in (29), is

E T E V 1 X 1 1 1 X X 1 X f1(x ) = aijk|xi − xj + xk| = B2(i, j)xi U , (u1,..., uV ) = arg min f0(un) U∈ V ×V i,j,k=1 j=1 i=1 R n=2 (30) s.t. UT U = I, u = √1 1. where B (i, j) are the edge-triplet incidence coefficients de- 1 V 2 (27) fined in (3). In the above problem, the objective function to be minimized is Proof. Please see Appendix A. convex, but the problem is non-convex because of the unitarity The function in (30) represents the sum of the absolute values constraint. To simplify the search for the basis, the objective of the curls over all the triangles. Differently from [54], function in (27) can be relaxed to become where the total variation over a hypergraph was built from a bipartition of a discrete set and it was a function defined over V V RV , in our case, we start from a triparition of the original set r 0 X X 0 0 2 E f0 (x ) = aij(xi − xj ) . (28) and we build a real-valued extension, defined over R , i.e. a i=1 j=1 space of dimension equal to the number of edges. A suitable unitary basis for representing (curling) edge signals can then Substituting (28) in (27), we still have a non-convex problem. be found as the solution of the following optimization problem However, its solution is known to be given by the eigenvectors E of L0. From this perspective, the basis typically used in the X U , (u1,..., uE) = arg min f1(un) GFT, built as the eigenvectors of the Laplacian matrix, can be E×E U∈R n=1 (31) interpreted as the solution of the above optimization problem, T after relaxation. s.t. U U = I. Now, we extend the previous approach to find a regularization The objective function is convex, but the problem is non- function capturing triple-wise relations, useful to analyze sig- convex because of the unitarity constraint. The above problem nals defined over the edges of a graph. In the previous case, can be solved resorting to the algorithm proposed in [51], the analysis of node signals started from the bi-partition of tuned to the new objective function given in (30). Alternatively, a graph and then the construction of a real-valued extension similarly to what is typically done with graphs, f (x1) can be of the cut size. Here, to analyze edge signals, we need to 1 relaxed and replaced with start from the tri-partition of the discrete set and define a set triangles function counting the number of gluing the three sets T E !2 together. The function to be minimized should then be the real r 1 X X 1 1 T T 1 f (x ) = B2(i, j)x = x B2B x . (32) valued (Lovasz)´ extension of such a set function. 1 i 2 j=1 i=1 The combinatorial study of simplicial complexes has at- tracted increasing attention and the generalization of Cheeger Substituting this function back to (31), the solution is given T inequalities to simplicial complexes has been the subject of by the eigenvectors of B2B2 . T several works [52], [53]. In particular, Hein et al. introduced The above arguments show that the eigenvectors of B2B2 the total variation on hypergraphs as the Lovasz´ extension provide a suitable basis to analyze edge signals having a of the hypergraph cut, defined as the size of the hyperedges curling behavior. However, as we know from the Hodge set connecting a bipartition of the vertex set [54]. In this decomposition recalled in the previous section, edge sig- work, we derive a Lovasz´ extension defined over the edge set. nals contain other useful components that are orthogonal to More formally, we study a simplicial complex X (V, E, T ) of solenoidal signals, namely irrotational and harmonic compo- order two, as in the example sketched in Fig.1, and consider nents. Then, in general, it is useful to take as a unitary basis the partition of V in three sets (A0, A1, A2). The triple-wise for analyzing edge signals all the eigenvectors of L1, i.e. the T coupling function is now defined as eigenvectors associated to the nonzero eigenvalues of B2B2 T and of B1 B1, plus the eigenvectors associated to the of X X X L . In summary, the theory developed in this section provides F (A , A , A ) = a (29) 1 1 0 1 2 ijk a further argument, rooted on intrinsic topological properties i∈A j∈A i∈A 0 1 2 of the simplicial complex on which the signal is defined, to exploit the Hodge decomposition to find a suitable basis where aijk = 1 if (i, j, k) ∈ T and aijk = 0 otherwise. Our for the analysis of edge signals. Furthermore, the theory says main result, stated in the following theorem, is that the Lovasz´ that using the Laplacian eigenvectors comes as a consequence extension of the triple-wise coupling function gives rise to a of a relaxation step. In many cases, whenever the numerical measure of the curl of edge signals along triangles. complexity is not an issue, it may be better to keep the `1- Theorem 1: Let A0, A1, A2 be a partition of the vertex set norm objective functions given in (26) and (30), as this would V of the 2-dimensional simplicial complex X (V, E, T ) with yield more sparse, and then more informative, representations, |E| = E, |T | = T . Then the Lovasz´ extension f : RE → as a generalization of what done for directed graphs in [51]. 7

2 IV. EDGE FLOWS ESTIMATION This means that s and λ2 may differ only by an additive T 2 Let us consider now the observation of an edge signal vector lying in the nullspace of B2 B2. Let us set s = λ2+c2, T with B B2c2 = 0. Similarly, multiplying (c) by B1 and using affected by additive noise. Our goal in this section, is to formu- 2 ¯ late the denoising problem as a constrained problem, rooted on the first equality in (d) and condition (a), we obtain the Hodge decomposition. Denoising edge signals embedded T 0 T B1B1 s = B1B1 λ1, (38) in noise was already considered in [42], [31], [32] and, more 0 T specifically, in [18]. The formulation proposed in [18] was which implies that s = λ1+c1, with c1 such that B1B1 c1 = 0. Thus, condition (c) reduces to based on the definition of signals over simplicial complexes ¯ as skew-symmetric arrays of dimension equal to the order of 1 1 T 0 2 sH = x − B1 s − B2s (39) the corresponding simplex plus one. So, the edge flow was represented as a matrix (i.e., a two-dimensional array) satisfy- which says, as expected, that we can derive the harmonic ing the constraint X(i, j) = −X(j, i). A signal defined over component by subtracting the estimated solenoidal and irrota- 1 the triangles was represented as an array of dimension three, tional parts from the observed flow signal x . To recover the 1 0 satisfying the property Φ(i, j, k) = Φ(j, k, i) = Φ(k, i, j) = irrotational flow sirr from the 0-order signal s we need to −Φ(j, i, k) = −Φ(i, k, j) = −Φ(k, j, i). Our aim in this solve equation (a) in (36). Note that L0 is not invertible. For section, is to formulate the denoising optimization problem connected graphs, it has rank V −1 and its kernel is the span of 1 dealing only with vectors. Based on (19), the observed vector the vector 1 of all ones. However, since the vector b = B1x 0 1 can be modeled as is also orthogonal to 1, the normal equation L0s = B1x admits the nontrivial solution (at least for connected graphs): 1 T 0 1 2 1 x = B1 s + sH + B2 s + v (33) 0 † 1 s = L0B1x where v1 is noise. Let us suppose, for simplicity, that the noise where † denoted the Moore-Penrose pseudo-inverse. vector is Gaussian, with zero-mean entries all having the same Similarly, the 2-order signal s2, solution of the second variance σ2 . The optimal estimator can then be formulated as n equation in (36), can be obtained as the solution of the following problem 2 T † T 1 0 2 1 2 T 0 1 1 2 s = (B2 B2) B2 x (sˆ , sˆ , sˆH) = argmin k B2s + B1 s + sH − x k 0∈ V , 2∈ T , 1 ∈ E T 1 T s R s R sH R since B2 x is orthogonal to the null space of B2 B2. The 1 s.t. B1sH = 0 irrotational, solenoidal and harmonic components can then be T 1 recovered as follows B2 sH = 0 (Q). 1 T 0 T † 1 (34) sˆirr = B1 sˆ = B1 L0B1x Note that problem Q is convex. Then, there exists multipliers 1 2 T † T 1 sˆsol = B2sˆ = B2(B2 B2) B2 x (40) λ ∈ V , λ ∈ T such that the tuple (sˆ0, sˆ2, sˆ1 , λ , λ ) 1 1 1 1 1 R 2 R H 1 2 sˆH = x − sˆirr − sˆsol. satisfies the Karush-Kuhn-Tucker (KKT) conditions of Q (note that Slater’s constraint qualification is satisfied [55]). The Note that the first two conditions in (36) imply that the associated Lagrangian function is variables s0,s2 in Q are indeed decoupled so that the optimal 0 2 1 2 T 0 1 1 T solutions coincide with those of the following problems L(s , s , sH, λ1, λ2) = (B2s + B1 s + sH − x ) (35) 0 T 0 1 2 2 T 0 1 1 T 1 T T 1 sˆ = argmin k B1 s − x k (Q0) (41) ·(B2s + B1 s + sH − x ) + λ1 B1sH + λ2 B2 sH . 0 V s ∈R Exploiting the orthogonality property B1B2 = 0, it is easy to 2 2 1 2 sˆ = argmin k B2s − x k (Q2). (42) 2 T get the following KKT conditions s ∈R 0 2 1 T 0 1 (a) ∇s0 L(s , s , s , λ1, λ2) = B1B s − B1x = 0 H 1 An example of an edge (flow) signal is reported in Fig. 2 (a), 0 2 1 T 2 T 1 (b) ∇s2 L(s , s , sH, λ1, λ2) = B2 B2s − B2 x = 0 representing the simulation of data packet flow over a network, 0 2 1 1 1 T including measurement errors. The signal values are encoded (c) ∇s1 L(s , s , s , λ1, λ2)=sH − x + B1 λ1 + B2λ2 =0 H H in the gray color associated to each link. Suppose now, that 1 T 1 (d) B1sH = 0, B2 sH = 0 the goal of processing is to detect nodes injecting anomalous traffic in the network, starting from the traffic shown in Fig. (e) λ ∈ V , λ ∈ T . 1 R 2 R 2(a). If some node is a source of an anomalous traffic, that Note that conditions (a)-(c) reduce to node generates an edge signal with a strong irrotational (or divergence-like) component. To detect spamming nodes, we (a) L s0 = B x1 0 1 can then project the overall observed traffic vector s1 onto T 2 T 1 (b) B2 B2s = B2 x (36) the space orthogonal to the space spanned by the nonzero 1 1 T T (c) sH = x − B1 λ1 − B2λ2. eigenvectors of B2B2 , computing T 1  T  1 Multiplying both sides of condition (c) by B2 , and using the y = I − UsolUsol s (43) second equality in (d) and condition (b), we get where Usol is the matrix whose columns collect the eigenvec- T 2 T T B2 B2s = B2 B2λ2. (37) tors associated with the nonzero eigenvalues of B2B2 . The 8

1 where ΣF = diag(1F ). An edge signal s is F-bandlimited 1 1 over a frequency set F if FF s = s . The operators DS and FF are self-adjoint and idempotent and represents orthogonal projectors, respectively, on the sets S and F. If we look for edges signals which are perfectly localized in both the edge and frequency domains, some conditions for perfect localization have been derived in [56]. More specifically: a) s1 is perfectly localized over both the edge set S and the frequency set F if and only if the operator FF DS FF has an eigenvalue equal to one, i.e.

kDS FF k2 = kFF DS k2 = kFF DS FF k2 = 1 (46) (a) where kAk2 denotes the spectral norm of A; b) a sufficient condition for the existence of such signals is that |S| + |F| > E. Conversely, if |S|+|F| ≤ E, there could still exist perfectly localized vectors, when condition (46) holds. In the following we first consider sampling on a single layer by extracting samples from the edge signals and then, we consider multi-layer processing using samples of signals defined over different layers of the simplex.

A. Single-layer sampling In the following theorem [56] we provide a necessary and sufficient condition to recover the edge signal s1 from its (b) 1 1 samples sS , DS s . 1 1 Fig. 2: Observed flow on a simplicial complex (a) and recon- Theorem 2: Given the bandlimited edge signal s = FF s , struction of the irrotational flow (b). it is possible to recover s1 from a subset of samples collected over the subset S ⊆ E if and only if the following condition holds: result of this projection is reported in Fig. 2(b), where we can kD¯ S FF k2 = kFF D¯ S k2 < 1 (47) clearly see that the only edges with a significant contributions ¯ are the ones located around two nodes, i.e. the source nodes with DS = I − DS . injecting traffic in the network, whose divergence is encoded Proof. The proof is a straightforward extension of Th. 4.1 in by the node color. [56] to signals defined on the edges of the complex. In words, the above conditions mean that there can be V. SAMPLING AND RECOVERING OF SIGNAL DEFINED no F-bandlimited signals that are perfectly localized on the ¯ 1 OVERSIMPLICIALCOMPLEX complementary set S. Perfect recovery of the signal s from s1 can be achieved as Suppose now that we only observe a few samples of a S 1 1 k-order signal. The question we address here is to find the r = QS sS (48) conditions to recover the whole signal sk from a subset of where Q = (I − D¯ F )−1. The existence of the above samples. To answer this question, we may use the theory S S F inverse is ensured by condition (47). In fact, the reconstruction developed in [56] for signals on graph, and later extended error can be written as [56] to hypergraphs in [57]. For simplicity, we focus on signals 1 1 1 ¯ 1 1 ¯ 1 defined over a simplicial complex of order K = 2, i.e. on s −QS sS = s −QS (I−DS )s = s −QS (I−DS FF )s = 0, vertices, edges and triangles. Given a set of edges S ⊆ E we where we exploited in the second equality the bandlimited define an edge-limiting operator as a diagonal matrix D of S condition s1 = F s1. dimension equal to the number of edges, with a one in the F To make (47) holds true, we must guarantee that positions where we measure the flow, and zero elsewhere, i.e. 1 DS FF s 6= 0 or, equivalently, that the matrix DS FF is ¯ DS = diag{1S } (44) full column rank, i.e. rank(DS FF ) = |F|. Then, a necessary condition to ensure this holds is |S| ≥ |F|. where 1S is the set indicator vector whose i-th entry is equal An alternative way to retrieve the overall signal s1 from its to one if the edge ei ∈ S and zero otherwise. We say that an 1 samples can be obtained as follows. If (47) holds true, the edge signal s is perfectly localized over the subset S ⊆ E 1 1 1 1 entire signal s can be recovered from sS as follows (or S-edge-limited) if s = DS s . Similarly, given the matrix 1 T −1 T 1 U1 whose columns are the eigenvectors of L1, and a subset s = UF UF DS UF UF sS (49) of indices F, we define the operator where UF is the E × |F| matrix whose columns are the T FF = U1ΣF U1 (45) eigenvectors of L1 associated with the frequency set F. 9

Remark: It is worth to notice that, because of the Hodge Let us now consider the case where we want to recover decomposition (18), an edge signal always contains three s1 from samples collected from signals of order 0, 1, and 2, 0 1 2 2 2 components that are typically band-limited, as they reside on i.e. s , s and s . We denote by sM = DMs the vector a subspace of dimension smaller than E. This means that, of triangle signal samples with M ⊆ T . Furthermore, given if one knows a priori, that the edge signal contains only one the matrix U2 with columns the eigenvectors of the second- L F2 = U Σ UT component, e.g. solenoidal, irrotational, or harmonic, then it is order Laplacian 2, we define the operator F2 2 F2 2 2 2 possible to observe the edge signal and to recover the desired where |F2| denotes the bandwidth of s . Then, assuming s s2 = F2 s2 C component over all the edges, under the conditions established bandlimited, it holds F2 . Denote with 2 the set of by Theorem 2. indexes in F2 associated with the eigenvectors belonging to 2 ker(L2) and with U the matrix whose columns are the B. Multi-layer sampling F2−C2 eigenvectors of L2 associated with the index set F2 − C2. In this section, we consider the case where we take samples Theorem 4: Consider the second-order simplex X (V, E, T ), of signals of different order and we propose two alternative the edge signal s1 = B s2 + s1 + BT s0 and assume that: 1 2 H 1 strategies to retrieve an edge signal s from these samples. i) the vertex-signal s0, the edge signal s1 and the triangle 1 The first approach aims at recovering s by using both, the signal s2 are bandlimited with bandwidth, respectively, |F |, 0 0 0 vertex signal samples sA = DAs , with A ⊆ V, and the edge |F | = |F | + |F | + |F | − (c + c ) and |F |, with c ≥ 0 1 1 1 0 H 2 1 2 2 1 samples s = DS s . Hereinafter, we denote by Firr, Fsol and S and c2 ≥ 0 the number of eigenvectors in the bandwidth of FH the set of frequency indexes in F corresponding to the 0 2 s and s belonging, respectively, to ker(L0) and ker(L2); eigenvectors of L1 belonging, respectively, to the irrotational, ii) all conditions kD¯ F0 k < 1, kD¯ F k < 1 and A F0 2 S FH 2 solenoidal and harmonic subspaces. Note that, if s1 is F- 2 kD¯ MF k2 < 1 hold true. Then, it follows: bandlimited then also s1 , s1 and s1 are bandlimited with F2 H sol irr a) s1 can be perfectly recovered from a set of vertex signal bandwidth, respectively, |FH|, |Fsol| and |Firr|. Define FsH as 0 0 1 1 samples sA = DAs , from the edge samples sS = DS s the set of frequency indexes such that FsH = FH ∪ Fsol. Fur- 2 2 and from the triangle samples sM = DMs as thermore, given the matrix U0 with columns the eigenvectors of L , we define the operator F0 = U Σ UT where |F |  s0   s0  0 F0 0 F0 0 0 A 0 denotes the bandwidth of s . Let C1 be the set of indexes  1   1   sH  = R  sS  , (52) in F0 associated with the eigenvectors belonging to ker(L0)     U0 2 2 and denote by F0−C1 the matrix whose columns are the s sM eigenvectors of L associated with the frequency set F − C . 0 0 1 where Then, we can state the following theorem.  (I − D F0 )−1 OO  Theorem 3: Consider the second-order simplex X (V, E, T ) A F0 and the edge signal s1 = s1 +s1 +BT s0. Then, assume that:   sol H 1 R =  P1 P2 P3  i) the vertex-signal s0 and the edge signal s1 are bandlimited   OO (I − D F2 )−1 with bandwidth, respectively, |F0| and |F| = |FsH|+|F0|−c1, M F2 where |F | = |F | + |F | and c ≥ 0 denotes the number (53) sH sol H 1 −1 T 0 0 −1 0 and P1 = −(I−DS FF ) DS B F (I−DAF ) , of eigenvectors in the bandwidth of s belonging to ker(L0); H 1 F0 F0 0 −1 ¯ ¯ P2 = (I − DS FFH ) , P3 = −(I − ii) the conditions kDAFF k2 < 1 and kDS FFsH k2 < 1 hold 0 D F )−1D B F2 (I − D F2 )−1; true. Then, it follows that: S FH S 2 F2 M F2 a) s1 can be perfectly recovered from both a set of vertex b) There exist two sets of frequency indexes Fsol, Firr ⊂ F, 0 0 for which the eigenvectors of L1 stacked in the columns of signal samples sA = DAs and from the edge samples 1 1 the matrices UFsol ,UFirr satisfy, respectively, the equality sS = DS s via the set of equations 2 T 0 UFsol = B2UF −C , and UFirr = B1 UF −C . Then, any " 0 # " 0 # 2 2 0 1 s sA F-bandlimited edge signal with frequency set F = F ∪ = Q , (50) sol s¯1 s1 Firr ∪ FH, and |Firr| = |F0| − c1, |Fsol| = |F2| − c2, S 0 can be recovered by using N0 ≥ |F0| samples from s , 1 1 1 1 where s¯ = ssol + sH , N1 ≥ |FH| samples from s and N2 ≥ |F2| samples from 2 " (I − D F0 )−1 O # s . A F0 Q = (51) Proof. See Appendix B. −1 P (I − DS FFsH ) and P = −(I − D F )−1D BT F0 (I − D F0 )−1; VI.ESTIMATION OF DISCRETE VECTOR FIELDS S FsH S 1 F0 A F0 b) There exists a set of frequency indexes Firr ⊂ F, for Developing tools to process signals defined over simplicial

which the eigenvectors matrix UFirr of L1 satisfies the complexes is also useful to devise algorithms for processing equality U = BT U0 , and such that any F- Firr 1 F0−C1 discrete vector fields. The use of algebraic topology, and bandlimited edge signal with set of frequency indexes more specifically Discrete Exterior Calculus (DEC), has been F = Firr ∪ FsH and |Firr| = |F0| − c1 can be recovered already considered for the analysis of vector fields, especially 0 by using N0 ≥ |F0| samples from s and N1 ≥ |FsH| in the field of computer graphic. Exterior Calculus (EC) is a samples from s1. discipline that generalizes vector calculus to smooth Proof. See Appendix B. of arbitrary dimensions [58] and DEC is a discretization of EC 10 on simplicial complexes [59], [60]. DEC methodologies have that fits the observed vector x1, it is smooth and sparse. We been already proposed in [61] to produce smooth tangent vec- recover the vector s1 as the solution of tor fields in computer graphics. Smoothing tangential vector 1 1 1 T 1 1 min k s − x k2 +λs L1s + γ k s k1 (PF ) fields using techniques based on the spectral decomposition 1 E s ∈R of higher order Laplacian was also advocated in [62]. In this where λ, γ are positive penalty coefficients controlling, re- section, building on the basic ideas of [61], [62], we propose spectively, the signal smoothness and its sparsity. Since we an alternative approach to smooth discrete vector fields. More chose the Whitney form as interpolation basis, for any two specifically, as in [61], [62], the proposed approach is based on signals of order p, xp, yp ∈ Np , the induced inner product three main steps: i) map the vector field onto an edge signal; R is given by xpT M yp, where the N × N -dimensional ii) filter the edge signal; and iii) map the edge signal back p p p matrix M , incorporating the Whitney metric, is derived as in to a vector field. Steps i) and iii) are essentially the same as p [60][Prop. 9.7]. Therefore, the Hodge Laplacian L , weighted in [62]. The difference we introduce here is in step ii), i.e. in 1 with the appropriate metric matrices M , with p = 0, 1, 2, is the filtering of the edge signal. Filtering edge signals has been p a semidefinite positive matrix that can be written as considered before, see, e.g., [32], but in our case we consider a T T −1 different formulation and we incorporate the proper metrics in L1 = B2M2B2 + M1B1 M0 B1M1 (56) the Hodge Laplacian, dictated from the initial mapping from the vector field to the corresponding edge signal. while the fitting error norm becomes To filter vector fields, we need first to embed the vector field 1 1 1 1 T 1 1 k s − x k2= (s − x ) M1(s − x ). (57) and the simplicial complex into a real space. Let us denote with X a simplicial complex embedded in a real space Rn 3) Map the filtered signal back onto a discrete vector field 1 s1 ∀e ∈ E of dimension n. We assume that X is flat, i.e. all simplices Finally, given the discrete signal of order , eij , ij , are in the same affine n-subspace, and well-centered, which the generated piecewise linear vector field becomes [61] means the circumcenter of each simplex lies in its interior X 1 ~v(x) = s [φi(x)∇φj(x) − φj(x)∇φi(x)]. (58) [59]. A discrete vector field ~v can be considered as a map eij n eij from points xi of X to R . An illustrative example is shown in Fig. 3 (a), where the discrete vector field is represented by We show now an interesting application of the above pro- the set of arrows associated to the points in a regular grid in cedure to the analysis of the vector field representing the R2. In many applications, it is of interest to develop tools to RNA velocity field, defined as the time derivative of the extract relevant information from a vector field or to reduce gene expression state [41], useful to predict the future state the effect of noise. Our proposed approach is based on the of individual cells and then the direction of change of the following three main steps: entire transcriptome during the RNA sequencing dynamic 1) Map a vector field onto a simplicial complex signal process. The RNA velocity can be estimated by distinguishing Given the set of N points xi, we build the simplicial complex between nascent (unspliced) and mature (spliced) messenger X using a well-centered Delaunay with vertices RNA (mRNA) in common single-cell RNA sequencing pro- (0-simplicies) in the points xi [59]. Then, starting from the set tocols [41]. We consider the mRNA-seq dataset of mouse of vectors ~v(xi), we recover a continuous vector field through chromaffin cell differentation analysed in [41]. An example interpolation based on the Whitney basis functions φi(x) [63] of RNA velocity vector field is illustrated in Fig. 3(a). To analyze such a vector field using the proposed algorithms, N X we implemented steps 1) and 3) using the discrete exterior ~v(x) = ~v (x )φ (x) (54) i i calculus operators provided by the PyDEC software developed i=1 in [60]. We considered a Delaunay well-centered triangulation where φi(x) are affine piecewise functions. More specifically, of the continuous bi-dimensional space, which generates the φi is an hat function on vertex vi, which takes on the value one simplicial complexes in Fig. 3 composed of N = 400 nodes, at vertex vi, i.e. φi(xi) = 1, is zero at all other vertices, and where we fill all the triangles. The velocity field in Fig. 3(a) is affine over each 2-simplex having vi as vertex. Then, project observed over 156 vertices and the field vector at each vertex the vector field onto the set of 1-simplices (edges), giving rise has been obtained with a local Gaussian kernel smoothing [41]. to a set of scalar signals defined over the edges of the complex The underlying colored cells represented different states of the ! cell differentiation process. Then, to test our filtering strategy, ~v(σ0 ) + ~v(σ0 ) j0 j1 we added a strong Gaussian noise to the observed velocity x1 = · ~σ1, j = 1,...,E (55) j 2 j field, with SNR = 0 dB, as illustrated in Fig. 3(b). This noise is added to incorporate mRNA molecules degradation and model 1 where j0 and j1 are the end-points of edge j and ~σj stands inaccuracy. Then we apply the proposed filtering strategy σ1 = [σ0 , σ0 ] for the vector corresponding to j j0 j1 , having the by first reconstructing the edge signal as in (54) and then 1 1 E same direction as the orientation of σj . The vector x ∈ R recovering the edge signal as a solution of the optimization of edge signals associated with the vector field ~v, is then built problem PF . Finally, we reconstruct the vector field using the 1 1 1 T as x = [x1, . . . , xE] . interpolation formula (58), observed at the barycentric points 2) Filter the signal defined on the simplicial complex of each triangles. The result is reported in Fig. 3(c), where we The filtering strategy aims to recover an edge flow vector s1 can appreciate the robustness of the proposed filtering strategy. 11

Observed RNA velocity field (or it has been estimated). So, we start from the knowledge of L0, which implies, after selection of an orientation, knowledge T T of B1. Since L1 = B1 B1 + B2B2 , then we need to estimate B2. Before doing that, we check, from the data, if the term T B2B2 is really needed. Since, from (24), the only components that may depend on B2 are the solenoidal and harmonic components, we first project the observed flow signal onto the space orthogonal to the space spanned by the irrotational component, by computing 1 T  1 xsH(m) = I − UirrUirr x (m), m = 1,...,M, (59)

where Uirr is the matrix whose columns are the eigenvectors (a) T associated with the nonzero eigenvalues of B1 B1. Then, Noisy RNA velocity field 1  1 1  denoting with XsH = xsH(1),..., xsH(M) the signal matrix 1 of size E × M, we measure the energy of XsH by taking its 1 norm kXsHkF : If the norm is smaller than a threshold η of the averaged energy of the observed data set, we stop, otherwise we proceed to estimate B2. The first step in the estimation of B2 starts from the detection of all cliques of three elements present in the graph. h 3i Their number is T = trace (L0 − diag(L01)) /6. For each clique, we choose, arbitrarily, an orientation for the potential triangle filling it. The matrix B2 can then be written as T X T B2 = tnbnbn (60) (b) n=1

Filtered RNA velocity field where bn is the vector of size E associated with the n-th clique, whose entries are all zero except the three entries associated with the three edges of the n-th clique. Those entries assume the value 1 or −1, depending on the orientation of the triangle associated with the n-th clique. The coefficients tn in (60) are equal to one, if there is a (filled) triangle on the n-th clique, or zero otherwise. The goal of our inference algorithm is then to decide, starting from the data, which entries of t := [t1, . . . , tT ] are equal to one or zero. Our strategy is to make the association that enforces a small total variation of the observed flow signal on the inferred complex, using (32) as a measure of total variation on flow signals. We (c) propose two alternative algorithms: The first method infers the structure of B2 by minimizing the total variation of the Fig. 3: RNA velocity fields: (a) observed and (b) noisy fields; observed data; the second method performs first a Principal (c) reconstructed field. Component Analysis (PCA) and then looks for the matrix B2 and the coefficients of the expansion over the principal components that minimize the total variation plus a penalty VII.INFERENCEOFSIMPLICIALCOMPLEXTOPOLOGY on the model fitting error. FROM DATA Minimum Total Variation (MTV) Algorithm: The goal of The inference of the graph topology from (node) signals is this algorithm is to minimize the total variation over the a problem that has received significant attention, as shown in observed data set, assuming knowledge of the number of the recent tutorial papers [64], [39], [40] and in the references triangles. The set of coefficients t is found as solution of therein. In this section, we propose algorithms to infer the structure of a simplicial complex. Given the layer structure T X  1 T T 1  of a simplicial complex, we propose a hierarchical approach min q(t) , tntrace XsH bnbn XsH (PMTV) t∈{0,1}T that infers the structure of one layer, assuming knowledge n=1 ∗ of the lower order layers. For simplicity, we focus on the s.t. k t k0= t , tn ∈ {0, 1}, ∀n, inference of a complex of order 2 from the observation of (61) a set of M edge (flow) signals X1 := x1(1),..., x1(M), where t∗ is the number of triangles that we aim to detect. assuming that the topology of the underlying graph is given In practice, this number is not known, so it has to be found 12

through cross-validation. Even though problem PMTV is Algorithm PCA-BFMTV T ∗ non-convex, it can be solved in closed form. Introducing the Set γ > 0, t[0] ∈ {0, 1} , k t[0] k0= t , PM 1 T T 1 T nonnegative coefficients cn = i=1 xsH (i)bnbn xsH(i), the X T Lupp[0] = tn[0]bnbn , k = 1 solution can in fact be obtained by sorting the coefficients cn n=1 in increasing order and then selecting the triangles associated Repeat ∗ 1 T −1 T 1 with the indices of the t lowest coefficients cn. Note that Set SbsH[k] = (IF + γ Ub sHLupp[k − 1]Ub sH) Ub sHXsH; the proposed strategy infers the presence of triangles along Compute t[k] by sorting the coefficients 1 T T T 1 the cliques having the minimum curl along its edges. Hence, cn[k] = trace(SbsH[k] Ub sHbnbn Ub sHSbsH[k]), ∗ we expect better performance when the edge signal contains and setting to 1 the entries of t[k] corresponding to the t smallest coefficients, and 0 otherwise; only the harmonic components, whose curls along the filled Set k = k + 1, triangles is exactly null. until convergence.

PCA-based Best Fitting with Minimum Total Variation (PCA-BFMTV): To robustify the MTV algorithm in the case entries of t [k + 1] equal to 1 for the indices corresponding where the edge signal contains also a solenoidal component n to the first t∗ smallest coefficients of {c [k]}T , and 0 and is possibly corrupted by noise, we propose now the PCA- n n=1 otherwise. The iterative steps of the proposed strategy are BFMTV algorithm that infers the structure of B and the 2 reported in the box entitled Algorithm PCA-BFMTV. Now edge signal that best fits the observed data set X1, while we test the validity of our inference algorithms over both at the same time exhibiting a small total variation over the simulated and real data. inferred topology. The method starts performing a principal component analysis of the observed data by extracting the Performance on synthetic data: Some of the most critical eigenvectors associated with the largest eigenvalues of the parameters affecting the goodness of the proposed algorithms covariance matrix estimated from the observed data set. More are the dimension of the subspaces associated with the specifically, the proposed strategy is composed of two steps: solenoidal and harmonic components of the signal and the 1) estimate the covariance matrix Cb X from the edge signal number of filled triangles in the complex. In fact, in both MTV 1 data set XsH and builds the matrix Ub sH whose columns are and PCA-BFMTV a key aspect is the detection of triangles as the eigenvectors associated with the F largest eigenvalues of the cliques where the associated curl is minimum. Hence, if the 1 1 signal contains only the harmonic component and there is no Cb X ; 2) model the observed data set as XsH = Ub sHSbsH and 1 noise, the triangles can be identified with no error, because searches for the coefficient matrix SbsH and the vector t that solve the following problem the harmonic component is null over the filled triangles. However, when there is a solenoidal component or noise, there 1 1 1 2 min g(t, S ) +γ kX −UsHS k might be decision errors. To test the inference capabilities 1 bsH sH b bsH F T b F ×M t∈{0,1} ,SsH∈R of the proposed methods, in Fig. 4(a), we report the triangle error probability P , defined as the percentage of incorrectly s.t. k t k = t∗, t ∈ {0, 1}, ∀n, (P ) e 0 n TS estimated triangles with respect to the number of cliques with (62) three edges in the simplex, versus the signal-to-noise ratio T 1 X 1 T 1 (SNR), when the observation contains only harmonic flows g(t, S ) t (S UT b bT U S ) γ where bsH , n trace bsH b sH n n b sH bsH and plus noise. We considered a simplex composed of N = 50 n=1 is a non-negative coefficient controlling the trade-off between nodes and with a percentage of filled triangles with respect the data fitting error and the signal smoothness. Although to the number of second order cliques in the graph equal 50% M = 50 t∗ = 105 problem PTS is non-convex, it can be solved using an iter- to . We also set , and averaged our 3 ative alternating optimization algorithm returning successive numerical results over 10 zero-mean signal and noise random 1 realizations. The harmonic signal bandwidth |FH | is chosen estimates of SbsH, having fixed t, and alternately t, given 1 equal to 105, which is equal to the dimension of the kernel S bsH. Interestingly, each step of the alternating optimization of L . From Fig. 4(a), we can notice, as expected, that in the problem admits a closed form solution. More specifically, at 1 1 noiseless case the error probability is zero, since observing each iteration k, the coefficient matrix SbsH[k] can be found as only harmonic flows enables perfect recovery of the matrix 1 1 1 1 2 k B2. In the presence of noise, the MTV algorithm suffers and SbsH[k] = arg min g(t[k], SbsH) +γ kXsH−Ub sHSbsH kF (PS ). 1 in fact we observe a non negligible error probability at low Sb ∈ F ×M sH R SNR. However, applying the PCA-BFMTV algorithm enables PT T k a significant recovery of performance, as evidenced by the Defining Lupp[k] := n=1 tn[k]bnbn , problem PS admits the closed form solution blue curve that is entirely superimposed to the red curve, at least for the SNR values shown in the figure. In this example, 1 T −1 T 1 5 SbsH[k] = (IF + γ Ub sHLupp[k]Ub sH) Ub sHXsH. (63) the covariance matrix was estimated over 10 independent 1 observations of the edge signals. The optimal γ coefficient Then, given SbsH[k], we can find the vector t[k + 1] using the was chosen after a cross validation operation following a line same method used to solve problem MTV, in (61), i.e. setting search approach aimed to minimize the error probability. The 1 1 T T T cn[k] := trace(SbsH[k] Ub sHbnbn Ub sHSbsH[k]) and taking the improvement of the PCA-BFMTV method with respect to the 13

0.35 the number Nij of calls from node i to area j, as a function MTV of time. There is an edge between nodes i and j only if there 0.3 PCA-BFMTV noiseless MTV is a non null traffic between those points. The traffic has been 0.25 aggregated in time, over time intervals of one hour. We define the flow signal over edge (i, j) as Φij = Nij − Nji. We map 0.2 1

e all the values of matrix Φ into a vector of flow signals x . We P 0.15 observed the calls daily traffic during the month of December 2013. The data are aggregated for each day over an interval 0.1 of one hour. Our first objective is to show whether there is an 1 0.05 advantage in associating to the observed data set X a complex of order 2, i.e. a set of triangles, or it is sufficient to use a 0 −5 0 5 10 15 20 purely graph-based approach. In both cases, we rely on the SNR[dB] same graph structure, whose B1 comes from the data set, after (a) an arbitrary choice of the edges’ orientation. If we use a graph- 0.45 based approach, we can build a basis of the observed flow MTV, |Fsol| = 80 0.4 PCA-BFMTV, |Fsol| = 80 signals using the eigenvectors of the so called edge Laplacian noiseless MTV, |Fsol| = 80 low T low 0.35 in [35], i.e. L = B B1. We call this basis U . As an MTV, |Fsol| = 10 1 1 1 PCA-BFMTV, |Fsol| = 10 0.3 alternative, our proposed approach is to build a basis using the noiseless MTV, |Fsol| = 10 T T eigenvectors of L1 = B1 B1 + B2B2 , where B2 is estimated e 0.25 1 P from the data set X using our MTV algorithm. We call 0.2 this basis U1. To test the relative benefits of using U1 as low 0.15 opposed to U1 , we run a basis pursuit algorithm with the

0.1 goal of finding a good trade-off between the sparsity of the representation and the fitting error. More specifically, for any 0.05 given observed vector x1(m), we look for the sparse vector 0 1 −5 0 5 10 15 20 25 30 s as solution of the following basis pursuit problem [66]: SNR[dB] 1 (b) min k s k1 (B) 1 E s ∈R (64) 1 1 Fig. 4: Error probability by observing: (a) harmonic noisy s.t. k x − Vs kF ≤  signals; (b) harmonic plus solenoidal noisy signals. low where V = U1 in our case, while V = U1 in the graph- based approach. As a numerical result, in Fig. 5 we report the MTV method is due to the denoising made possible by the sparsity of the recovered edge signals versus the mean estima- 1 1 projection of the observed signal onto the space spanned by tion error k x − Vs kF considering as signal dictionary V the eigenvectors associated with the largest eigenvalues of the the eigenvectors of either the first-order Laplacian or the lower estimated covariance matrix. Laplacian. We used the MTV algorithm to infer the upper To test the proposed methods in the case where the observed Laplacian matrix by setting the number t∗ of triangles that we signal contains both the solenoidal and harmonic components, may detect equal to 800. This value is derived numerically in Fig. 4(b) we report Pe versus the SNR, for different values through cross-validation over a training data set, by choosing ∗ 1 of the dimension of the subspace associated with the solenoidal the value of t that yields the minimum norm k s k1 As part, indicated as |Fsol|. From Fig. 4(b), we can observe that can be observed from Fig. 5, using the set of the eigenvectors the performance of both algorithms MTV and PCA-BFMTV of L1 yields a much smaller MSE, for a given sparsity or, suffers when the bandwidth |Fsol| of the solenoidal compo- conversely, a much more sparse representation, for a given nent is large, whereas the performance degradation becomes MSE. An intuitive reason why our method performs so much negligible when |Fsol| is small. In all cases, PCA-BFMTV better than a purely graph-based approach is that the matrix low significantly outperforms the MTV algorithm, especially at L1 has a much reduced kernel space with respect to L1 and low SNR values, because of its superior noise attenuation the basis built on L1 captures much better some inner structure k capabilities. As further numerical test, we run algorithm PS present in the data by inferring the structure of the additional replacing the quadratic regularization term with the triple-wise term B2 from the data itself. low coupling regularization function in (30), by obtaining the same As a further test, we tested the two basis U1 and U1 in performance of the PCA-BFMTV algorithm. terms of the capability to recover the entire flow signal from Performance on real data: The real data set we used to test a subset of samples. To this end, we exploit the band-limited our algorithms is the set of mobile phone calls collected in the property enforced by the sparse representation, enabling the city of Milan, Italy, by Telecom Italia, in the context of the use of the theory developed in Section V.A. Starting with the Telecom Big Data Challenge [65]. The data are associated with representation of each input vector x1 as x1 = Vs1, with low a regular two-dimensional grid, composed of 100×100 points, either V = U1 in our case, or V = U1 in the graph-based superimposed to the city. Every point in the grid represents a approach, we used the Max-Det greedy sampling strategy in square, of size 235 meters. In particular, the data set collects [56] to select the subset of edges where to sample the flow 14

300 in applications over real wireless traffic data, the proposed L1 eigenvectors as signal bases Llow 1 eigenvectors as signal bases approach can significantly outperform methods based only on 250 graph representations. Furthermore, we proposed a method to analyze discrete vector fields and showed an application to the 200 recovery of the RNA velocity field to predict the evolution of living cells. In such a case, using the eigenvectors of L1 150 we have been able to highlight irrotational and solenoidal Sparsity behaviors that would have been difficult to highlight using only 100 the eigenvectors of L0. Further developments should include both theoretical aspects, especially in the statistical modeling 50 of random variables defined over a simplicial complex, and the generalization to higher order structures. 0 0 0.02 0.04 0.06 0.08 0.1 MSE APPENDIX A Fig. 5: Sparsity versus squared error. PROOFOF THEOREM 1 We begin by briefly recalling the basic properties of Lovasz´ extension [49], [67] and then we proceed to the proof of 0.22 Signal reconstruction L1 Llow Signal reconstruction 1 Theorem 1. 0.2 Definition 1: Given a set A, of cardinality A = |A|, and its 0.18 power set 2A, let us consider a set function Q : 2A → R, with A 0.16 Q(∅) = 0. Let x ∈ R be a vector of real variables, ordered w.l.o.g. in increasing order, such that x1 ≤ x2 ≤ ... ≤ xA. 0.14 Define C0 , A and Ci , {j ∈ A : xj > xi} for i > 0. Then, 0.12 the Lovasz´ extension f : RA → R of Q, evaluated at x, is 0.1 given by [49]: Normalized squared error 0.08 A−1 X f(x) = Q(A)x + Q(C )(x − x ). (65) 0.06 1 i i+1 i i=1 50 100 150 200 250 Number of samples An interesting property of set functions is submodularity, Fig. 6: Reconstruction error versus number of samples. defined as follows: Definition 2: A set function Q : 2A → R is submodular if and only if, ∀B, C ⊆ A, it satisfies the following inequality: signal and then we used the recovery rule in (49) to retrieve the overall flow signal from the samples. The numerical results are Q(B) + Q(C) ≥ Q(B ∪ C) + Q(B ∩ C). reported in Fig. 6, representing the normalized recovering error A fundamental property is that a set function Q is submodular of the edge signal versus the number Ns of samples used to iff its Lovasz´ extension f(x) is a convex function and that reconstruct the overall signal. We can notice how introducing A T minimizing f(x) on [0, 1] , is equivalent to minimizing the the term B2B2 , we can achieve a much smaller error, for the set function Q on 2A [49, p.172]. Now, we wish to apply same number of samples. Lovasz´ extension to a set function, defined over the edge set VIII.CONCLUSION E, counting the number of triangles gluing nodes belonging to In this paper we have presented an algebraic framework a tri-partition of the node set V. To this end, let us consider to analyze signals residing over a simplicial complex. In a tri-partition {A0, A1, A2} of the node set V. Extending the particular, we focused on signals defined over the edges of a approach used for graphs, where the cut-size is introduced as a complex of order two, i.e. including triangles, and we showed set function evaluated on the node set, to define a triangle-cut that exploiting the full algebraic structure of this complex size, we need to introduce a set function defined on the edge provides an advantage with respect to graph-based tools only. set. Given a simplex X , assume the faces of X to be positively oriented with the order of their vertices, i.e. for the k-simplex We have not analyzed higher order signals. Nevertheless, k k k−1 k−1 k−1 σ we have σ = [σ , σ , . . . , σ ] with i0 < i1 < looking at the structure of the higher order Laplacian and to i i i0 i1 ik the Hodge decomposition, the tools derived in this paper can . . . < ik. Given any non-empty tri-partition {A0, A1, A2} of be directly translated to analyze signals defined over higher the node set V, we define order structures. What would be missing in the higher order 2 2 τ(A0, A1, A2)={σi ∈ T : |σi ∩ Aj| = 1, 0 ≤ j ≤ 2} (66) cases would be a visual interpretation of the properties of the 2 higher order Laplacian eigenvectors, as we could not be talking as the set of triangles σi ∈ T with exactly one vertex in each 2 about solenoidal or irrotational behaviors. set Aj, for j = 0, 1, 2. Given any σi ∈ τ(A0, A1, A2), there π σ0 j = 0, 1, 2 We proposed a method to infer the structure of a second exists a permutation i of its vertices ij , , such that σ2 = [σ0 , σ0 , σ0 ] σ0 ∈ A j = 0, 1, 2 order simplicial complex from flow data and we showed that, i i0 i1 i2 and ij j, for . This returns 15 a linear ordering1 of the vertices of the partition such that for Then, from (65) we get: 0 0 0 0 all i < j, if σk ∈ Ai and σm ∈ Aj, then σk precedes σm in ˜ ˜ f(xi0 , xi1 , xi2 ) = Gi(C0)xi0 + Gi(C1)(xi1 − xi0 ) the vertex ordering. Given the tri-partition {A0, A1, A2}, we + G˜ (C )(x − x ) = x − x + x . define the set ET of oriented edges such that an oriented edge i 2 i2 i1 i0 i1 i2 [σ0, σ0] ∈ E if and only if σ0 ∈ A and σ0 ∈ A for some i j T i n j m Let us now assume xi1 ≤ xi0 ≤ xi2 . Thereby, we have C0 = n < m. Thus, we can think of the triangle cut size F1 in (29), {i0, i1, i2}, C1 = {i0, i2} and C2 = {i2}, so that it results: G(E , E ) equivalently, as a different function T T defined on sets 1 1 1 G˜i(C0) = 1E (σ ) − 1E (σ ) + 1E (σ ) = 1 of oriented edges, where ET denotes the complement set of ET T i0 T i1 T i2 G˜ (C ) = 1 (σ1 ) + 1 (σ1 ) = 2 (72) in E. Given an orientation on the simplicial complex, we can i 1 ET i0 ET i2 ˜ 1 Gi(C2) = 1ET (σ ) = 1 associate with the set of oriented edges ET the vector 1ET ∈ i2 E R , whose entries are 0 for edges in ET and ±1 for edges in and ET , depending on the edge orientation. It is straightforward to f(xi , xi , xi ) = 2(xi −xi )+(xi −xi )+xi = xi −xi +xi . T 0 1 2 0 1 2 0 1 0 1 2 check that B2 1ET , where B2 is the edge-triangle incidence By following similar derivations, it is not difficult to show that matrix defined in (3), computes the triangle cut for ET in the x ≤ x ≤ x x ≤ x ≤ x x ≤ x ≤ x sense that each entry of this vector, corresponding to a triangle, for i0 i2 i1 , i1 i2 i0 , i2 i1 i0 x ≤ x ≤ x f(x , x , x ) = x − is nonzero if and only if every vertex of that triangle is in a and i2 i0 i1 , it holds i0 i1 i2 i0 x + x different set of the partition. As an example, considering the i1 i2 . Therefore, from (69), defining the edge signal 1 T x = [x1, . . . , xE] , we can write the Lovasz´ extension of simplicial complex in Fig. 1, we have ET = {e2, e3, e4, e7, e8} T G(ET , E¯T ) exploiting the additive property [49] as and the associated vector 1ET is [0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0] . T T 1 X 1 1 1 Then, using (21), we get B2 1ET = [0, 1, 0] and the triangle f1(x ) = xi − xi + xi (73) cut size is equal to 1. 0 1 2 σ2∈T Then, introducing the vector t := BT 1 , we define the i ET 2 ET or, equivalently, as edge set function as T E T 1 X X 1 X f1(x ) = B2(i, j)xi . (74) G(ET , ET ) = |tE (j)|. (67) T j=1 i=1 j=1 From the equality G(ET , E¯T ) = F1(A0, A1, A2) in (68), we It can be proved that G is invariant under any permutation 1 of the vertexes of the triangles, since the effect of any can state that f1(x ) is the Lovasz´ extension of F1. This completes the proof of Theorem 1. vertex permutation is a change of sign in the entries of tET . Therefore, we get REFERENCES [1] S. Barbarossa, S. Sardellitti, and E. Ceci, “Learning from signals defined G(ET , ET ) = |τ(A0, A1, A2)| = F1(A0, A1, A2). (68) over simplicial complexes,” in 2018 IEEE Data Sci. Work. (DSW), 2018, Now we wish to derive the Lovasz´ extension of the set pp. 51–55. [2] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Van- function G(E , E ). Denote with σ2 = [σ1 , σ1 , σ1 ] the T T i i0 i1 i2 dergheynst, “The emerging field of signal processing on graphs: oriented 2-order simplex where 1 ≤ ik ≤ E, for k = 0, 1, 2. Extending high-dimensional data analysis to networks and other irregular domains,” IEEE Signal Process. Mag., vol. 30, no. 3, pp. 83–98, May To make explicit the dependence of the function G(ET , ET ) 2013. on the edge indexes i0, i1, i2, we rewrite (67) as [3] A. Ortega, P. Frossard, J. Kovaceviˇ c,´ J. M. F. Moura, and P. Van- dergheynst, “Graph signal processing: Overview, challenges, and ap- X ˜ G(ET , ET ) = Gi(ET , ET ) (69) plications,” Proc. of the IEEE, vol. 106, no. 5, pp. 808–828, May 2018. σ2∈T [4] S. Klamt, U.-U. Haus, and F. Theis, “Hypergraphs and cellular i networks,” PLoS Comput. Biol., vol. 5, no. 5, 2009. where, exploiting the structure of B2, given in (3), each term [5] O. T. Courtney and G. Bianconi, “Generalized network structures: G˜ can be written as (we omit the dependence of G˜ on the The configuration model and the canonical ensemble of simplicial i i complexes,” Phys. Rev. E, vol. 93, no. 6, pp. 062311, 2016. edge set partition, for notation simplicity) [6] C. Giusti, R. Ghrist, and D. S. Bassett, “Two’s company, three (or more) is a simplex,” J. Comput. Neurosci., vol. 41, no. 1, pp. 1–14, 2016. G˜ = 1 (σ1 ) − 1 (σ1 ) + 1 (σ1 ), (70) i ET i0 ET i1 ET i2 [7] T. Shen, Z. Zhang, Z. Chen, D. Gu, S. Liang, Y. Xu, R. Li, Y. Wei, Z. Liu, Y. Yi, and X. Xie, “A genome-scale metabolic network alignment 1 (σ1 ) k = 0, 1, 2 1 method within a hypergraph-based framework using a rotational tensor- where ET ik , , are the entries of ET correspond- ing to the edge σ1 . G˜ is then a set function defined only on vector product,” Sci. Rep., vol. 8, no. 1, pp. 1–16, 2018. ik i [8] A. R. Benson, R. Abebe, M. T. Schaub, A. Jadbabaie, and J. Kleinberg, the power set of {i0, i1, i2}. We can now derive its Lovasz´ “Simplicial closure and higher-order link prediction,” Proc. of Nat. Acad. of Sci., vol. 115, no. 48, pp. E11221–E11230, 2018. extension f(xi0 , xi1 , xi2 ). First, assume xi0 ≤ xi1 ≤ xi2 . C = {i , i , i } C = {i , i } [9] S. Agarwal, K. Branson, and S. Belongie, “Higher order learning with Then, from Def. 1, we have 0 0 1 2 , 1 1 2 graphs,” in Proc. 23rd Int. Conf. on Mach. Lear., 2006, pp. 17–24. and C2 = {i2}. Therefore, from (70), it holds: [10] J. R. Munkres, Topology, Prentice Hall, 2000. [11] L. Kanari, P. Dłotko, M. Scolamiero, R. Levi, J. Shillcock, K. Hess, G˜ (C ) = 1 (σ1 ) − 1 (σ1 ) + 1 (σ1 ) = 1 i 0 ET i0 ET i1 ET i2 and H. Markram, “A topological representation of branching neuronal 1 1 Neuroinfor. G˜i(C1) = −1E (σ ) + 1E (σ ) = 0 (71) morphologies,” , vol. 16, no. 1, pp. 3–13, 2018. T i1 T i2 [12] A. Patania, G. Petri, and F. Vaccarino, “The shape of collaborations,” G˜ (C ) = 1 (σ1 ) = 1. i 2 ET i2 EPJ Data Sci., vol. 6, no. 1, pp. 18, 2017. [13] R. Ramanathan, A. Bar-Noy, P. Basu, M. Johnson, W. Ren, A. Swami, 1A binary relation ≺ on a set X is a linear ordering on X if it is transitive, and Q. Zhao, “Beyond graphs: Capturing groups in networks,” in Proc. antisymmetric and for any two elements x, y ∈ X, either x ≺ y or y ≺ x. IEEE Conf. Comp. Commun. Work. (INFOCOM), 2011, pp. 870–875. 16

[14] T. J. Moore, R. J. Drost, P. Basu, R. Ramanathan, and A. Swami, [42] L. J. Grady and J. R. Polimeni, : Applied analysis on “Analyzing collaboration networks using simplicial complexes: A case graphs for computational science, Sprin. Sci. & Busin. Media, 2010. study,” in Proc. IEEE Conf. Comp. Commun. Work. (INFOCOM), 2012, [43] T. E. Goldberg, “Combinatorial Laplacians of simplicial complexes,” pp. 238–243. Senior Thesis, Bard College, 2002. [15] T. Roman, A. Nayyeri, B. T. Fasy, and R. Schwartz, “A simplicial [44] U. Von Luxburg, “A tutorial on spectral clustering,” Stat. Comput., vol. complex-based approach to unmixing tumor progression data,” BMC 17, no. 4, pp. 395–416, 2007. Bioinfor., vol. 16, no. 1, pp. 254, 2015. [45] D. Horak and J. Jost, “Spectra of combinatorial Laplace operators on [16] G. Carlsson, “Topology and data,” Bulletin Amer. Math. Soc., vol. 46, simplicial complexes,” Adv. Math., vol. 244, pp. 303–336, 2013. no. 2, pp. 255–308, 2009. [46] J. Steenbergen, Towards a spectral theory for simplicial complexes, [17] A. Muhammad and M. Egerstedt, “Control using higher order Laplacians Ph.D. thesis, Duke Univ., 2013. in network ,” in Proc. of 17th Int. Symp. Math. Theory Netw. [47] L.-H. Lim, “Hodge Laplacians on graphs,” S. Mukherjee (Ed.), Syst., 2006, pp. 1024–1038. Geometry and Topology in Statistical Inference, Proc. Sympos. Appl. [18] X. Jiang, L. H. Lim, Y. Yao, and Y. Ye, “Statistical ranking and Math., 76, AMS, 2015. combinatorial Hodge theory,” Math. Program., vol. 127, no. 1, pp. 203– [48] B. Eckmann, “Harmonische funktionen und randwertaufgaben in einem 244, 2011. komplex,” Comment. Math. Helv., vol. 17, no. 1, pp. 240–255, 1944. [19] Q. Xu, Q. Huang, T. Jiang, B. Yan, W. Lin, and Y. Yao, “HodgeRank [49] F. Bach, Learning with Submodular Functions: A Convex Optimization on random graphs for subjective video quality assessment,” IEEE Trans. Perspective, vol. 6, Foundations and Trends in Machine Learning, 2013. Multimedia, vol. 14, no. 3, pp. 844–857, June 2012. [50] L. Lovasz,´ “Submodular functions and convexity,” in A. Bachem et al. [20] A. Tahbaz-Salehi and A. Jadbabaie, “Distributed coverage verification (eds.) Math. Program. The State of the Art, Springer Berlin Heidelberg, in sensor networks without location information,” IEEE Trans. Autom. pp. 235–257, 1983. Contr., vol. 55, no. 8, pp. 1837–1849, Aug. 2010. [51] S. Sardellitti, S. Barbarossa, and P. Di Lorenzo, “On the graph Fourier [21] H. Chintakunta and H. Krim, “Distributed localization of coverage holes transform for directed graphs,” IEEE J. Sel. Top. Signal Process., vol. using topological persistence,” IEEE Trans. Signal Process., vol. 62, no. 11, no. 6, pp. 796–811, Sept. 2017. 10, pp. 2531–2541, May 2014. [52] O. Parzanchevski, R. Rosenthal, and R. J. Tessler, “Isoperimetric [22] V. De Silva and R. Ghrist, “Coverage in sensor networks via persistent inequalities in simplicial complexes,” Combinatorica, vol. 36, no. 2, ,” Algeb. & Geom. Topol., vol. 7, no. 1, pp. 339–358, 2007. pp. 195–227, 2016. [23] S. Emrani, T. Gentimis, and H. Krim, “Persistent homology of delay [53] J. Steenbergen, C. Klivans, and S. Mukherjee, “A Cheeger-type in- and its application to wheeze detection,” IEEE Signal equality on simplicial complexes,” Adv. Appl. Math., vol. 56, pp. 56 Process. Lett., vol. 21, no. 4, pp. 459–463, Apr. 2014. –77, 2014. [24] H. Edelsbrunner and J. Harer, “Persistent homology-a survey,” Contemp. [54] M. Hein, S. Setzer, L. Jost, and S. S. Rangapuram, “The total variation Math., vol. 453, pp. 257–282, 2008. on hypergraphs - Learning on hypergraphs revisited,” in Adv. Neur. [25] D. Horak, S. Maletic,´ and M. Rajkovic,´ “Persistent homology of Inform. Proces. Syst. (NIPS), Dec. 5-8 2013. complex networks,” J. Statis. Mech.: Theory and Experiment, vol. 2009, [55] S. Boyd and L. Vandenberghe, Convex optimization, Cambridge no. 3, pp. P03034, 2009. university press, 2004. [26] H. Krim and A. B. Hamza, Geometric methods in signal and image [56] M. Tsitsvero, S. Barbarossa, and P. Di Lorenzo, “Signals on graphs: IEEE Trans. Signal Process. analysis, Cambridge Univ. Press, 2015. Uncertainty principle and sampling,” , vol. 64, no. 18, pp. 4845–4860, Sept. 2016. [27] M. Robinson, Topological signal processing, Springer, 2016. [57] S. Barbarossa and M. Tsitsvero, “An introduction to hypergraph signal [28] W. Huang, T. A. W. Bolton, J. D. Medaglia, D. S. Bassett, A. Ribeiro, processing,” in Proc. of IEEE Int. Conf. Acoust., Speech, Signal Process. and D. Van De Ville, “A graph signal processing perspective on (ICASSP), 2016, pp. 6425–6429. functional brain imaging,” Proc. of the IEEE, vol. 106, no. 5, pp. 868– [58] Elie´ Cartan, “Sur certaines expressions differentielles´ et le probleme` de 885, May 2018. pfaff,” in Annales scientifiques de l’Ecole´ Normale Superieure´ , 1899, [29] K. K. Leung, W. A. Massey, and W. Whitt, “Traffic models for wireless vol. 16, pp. 239–332. IEEE J. Sel. Areas Commun. communication networks,” , vol. 12, no. 8, [59] A. N. Hirani, Discrete Exterior Calculus, PhD thesis, California Institute pp. 1353–1364, Oct. 1994. of Technology, May 2003. [30] R. Sever and J. S. Brugge, “Signal transduction in cancer,” Cold Spring [60] N. Bell and A. N. Hirani, “Pydec: software and algorithms for Harbor Perspec. in Medic., vol. 5, no. 4: a006098, 2015. discretization of exterior calculus,” ACM Trans. Math. Soft. (TOMS), [31] M. T. Schaub and S. Segarra, “Flow smoothing and denoising: Graph vol. 39, no. 1, pp. 3, 2012. signal processing in the edge-space,” in 2018 IEEE Global Conf. Signal [61] M. Fisher, P. Schroder,¨ M. Desbrun, and H. Hoppe, “Design of tangent Inf. Process. (GlobalSIP), Nov. 2018, pp. 735–739. vector fields,” in ACM Trans. Grap. (TOG), 2007, vol. 26, p. 56. [32] J. Jia, M. T. Schaub, S. Segarra, and A. R. Benson, “Graph-based semi- [62] C. Brandt, L. Scandolo, E. Eisemann, and K. Hildebrandt, “Spectral supervised & active learning for edge flows,” in Proc. of 25th ACM processing of tangential vector fields,” in Computer Graphics Forum. SIGKDD Int. Conf. Knowl. Disc. & Data Min., 2019, pp. 761–771. Wiley Online Library, 2017, vol. 36, pp. 338–353. [33] T. S. Evans and R. Lambiotte, “Line graphs, link partitions, and [63] J. Dodziuk, “Finite-difference approach to the Hodge theory of harmonic overlapping communities,” Phys. Rev. E, vol. 80, Jul. 2009. forms,” Amer. Jour. Math., vol. 98, no. 1, pp. 79–104, 1976. [34] Y.-Y. Ahn, J. P. Bagrow, and S. Lehmann, “Link communities reveal [64] G. B. Giannakis, Y. Shen, and G. V. Karanikolas, “Topology iden- multiscale complexity in networks,” Nature, vol. 466, pp. 761–764, tification and learning over graphs: Accounting for nonlinearities and 2010. dynamics,” Proc. of the IEEE, vol. 106, no. 5, pp. 787–807, May 2018. [35] M. Mesbahi and M. Egerstedt, Graph theoretic methods in multiagent [65] Big data-Dandelion API datasets, available at: https://dandelion.eu/. networks, vol. 33, Princ. Univ. Press, 2010. [66] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition [36] S. Mukherjee and J. Steenbergen, “Random walks on simplicial by basis pursuit,” in SIAM J. Sci. Comput., 1998, vol. 20, pp. 33–61. complexes and harmonics,” Random Struct. & Algor., vol. 49, no. 2, [67] L. Jost, S. Setzer, and M. Hein, “Nonlinear eigenproblems in data pp. 379–405, 2016. analysis: Balanced graph cuts and the ratioDCA-Prox,” in Extrac. Quant. [37] O. Parzanchevski and R. Rosenthal, “Simplicial complexes: spectrum, Infor. Complex Syst., Spr. Intern. Publ., vol. 102, pp. 263–279, 2014. homology and random walks,” Random Struct. & Algor., vol. 50, no. 2, [68] D. S. Bernstein, Matrix Mathematics: Theory, Facts, and Formulas with pp. 225–261, 2017. Application to Linear Systems Theory, Princeton Univ. Press, 2005. [38] M. T. Schaub, A. R. Benson, P. Horn, G. Lippner, and A. Jadbabaie, “Random walks on simplicial complexes and the normalized Hodge Laplacian,” arXiv preprint arXiv:1807.05044, 2018. [39] G. Mateos, S. Segarra, A. G. Marques, and A. Ribeiro, “Connecting the dots: Identifying network structure via graph signal processing,” IEEE Signal Process. Mag., vol. 36, no. 3, pp. 16–43, May 2019. [40] X. Dong, D. Thanou, M. Rabbat, and P. Frossard, “Learning graphs from data: A signal representation perspective,” IEEE Signal Process. Mag., vol. 36, no. 3, pp. 44–63, May 2019. [41] G. La Manno et al., “RNA velocity of single cells,” Nature, vol. 560, no. 7719, pp. 494–498, 2018. 17

APPENDIX B B. Proof of Theorem 3 SUPPORTING MATERIAL 1 0 Given the sampled signals sS and sA, we get 0 0 This document contains some supporting materials com- sA = DAs 1 1 1 T 0 (81) plementing the paper: “Topological Signal Processing over sS = DS s = DS s¯ + DS B1 s Simplicial Complexes” to appear in IEEE Transactions on with s¯1 = s1 + s1 . Then, since it holds s0 = F0 s0 and Signal Processing. Section A contains Proposition 2, and its sol H F0 s1 = F s1, we easily obtain proof, that will be instrumental in proving Theorems 3 and 4, F " 0 # " 0 # respectively, in Sections B and C. sA s = G , (82) 1 1 sS s¯ A. Proposition 2 where " 0 # DAFF O We need first to derive the relationship between the band- G = 0 1 T 0 D BT F0 D F width of the edge signal sirr = B1 s and that of the vertex S 1 F0 S FsH signal s0, as stated in the following proposition. (83) " (I − D F0 ) O # 1 T 0 1 A F0 Proposition 2: Let sirr = B1 s be the irrotational part of s = with s0 = F0 s0 a F -bandlimited vertex signal. Then, s1 is (I − D )BT F0 (I − D F ) F0 0 irr S 1 F0 S FsH a Firr-bandlimited edge signal with Firr the set of indexes of with FsH the set of frequency indexes in F correspond- the eigenvectors of L1 stacked in the columns of UF , such irr ing to the eigenvectors of L1 belonging to the solenoidal that U = BT U0 . The bandwidth of s1 is |F | = Firr 1 F0−C1 irr irr and harmonic subspaces. Assuming that both the conditions |F0| − c1, where c1 ≥ 0 is the number of eigenvectors in the kD¯ F0 k < 1 and kD¯ F k < 1 hold true, the matrix G 0 A F0 2 S FsH 2 bandwidth of s belonging to ker(L0). is invertible and from the inverse of partitioned matrices [68], Proof. Let us define U0 the N ×|F | matrix whose columns F0 0 we get 0 ui , ∀i ∈ F0 are the eigenvectors of the 0-order Laplacian " 0 −1 # T T (I − D F ) O 1 0 A F0 B1B1 . Since sirr = B1 s we get Q = G−1 = −1 P (I − DS FF ) s1 = BT s0 = BT U0 (U0 )T s0 sH irr 1 1 F0 F0 (75) (84) P = −(I−D F )−1D BT F0 (I−D F0 )−1 with S FsH S 1 F0 A F0 . Then where the last equality follows from the bandlimitedness of we can recovery the signals s0 and s¯1 as s0, i.e. s0 = U0 (U0 )T s0. From the property 2) in Prop. F0 F0 " 0 # " 0 # 0 T 0 T s sA 1, at each eigenvector ui of B1B1 with ui ∈/ ker(B1B1 ) = Q . (85) T 0 T 1 1 corresponds an eigenvector B1 ui of B1 B1 with the same s¯ sS eigenvalue. Let U denote the E × |F | matrix whose Firr irr This concludes the proof of point a). Let us prove next point columns are the eigenvectors of L1 associated to the frequency 0 b). From Proposition 2 it results that, if s is F0-bandlimited, index set Firr with 1 T 0 then sirr = B1 s is a Firr-bandlimted signal with Firr given in (76). This implies that the edge signal s1 = s1 +s1 +s1 is a F = {` ∈ F : u = BT u0, u0 ∈/ ker(B BT ), ∀i ∈ F −C }. sol H irr irr i `i 1 i i 1 1 0 1 F-bandlimited edge signal with F = F ∪F and bandwidth (76) sH irr |F| = |F | + |F | − c . Let us now consider the system in Then, if U0 = [U0 , U0 ], we get sH 0 1 F0 C1 F0−C1 (85). We get T 0 0 0 −1 0 B U = [O , U ]. s = (I − DAF ) s 1 F0 C1 Firr (77) F0 A s¯1 = −(I − D F )−1D BT F0 (I − D F0 )−1s0 + S FsH S 1 F0 A F0 A Therefore, equation (75) reduces to −1 1 (I − DS FFsH ) sS . s1 = U (U0 )T s0. (78) (86) irr Firr F0−C1 1 Using the first equation in (86) and the fact that DS s = 1 T 0 T DS s¯ + DS B1 s , it holds From the equality B1 B1u`i = λ`i u`i , multiplying both sides 0 T 1 −1 1 T 0 T 0 0 by B1, we easily derive u = B1u` , ∀u` ∈/ ker(B B1) so s¯ = (I − D F ) (D s¯ + D B s − D B F s ) i i i 1 S FsH S S 1 S 1 F0 that equation (78) is equivalent to −1 1 = (I − DS FFsH ) DS s¯ 1 T T 0 (87) s = UF U B s . (79) irr irr Firr 1 s0 = F0 s0 where the last equality follows from F0 . Hence, Since s1 = BT s0, we can rewrite (79) as from (87), it follows that to perfectly recover the solenoidal irr 1 1 and harmonic parts of s we need a number of samples N1 1 T 1 s = U U s . at least equal to the signal bandwidth |FsH |. Finally, from the irr Firr Firr irr (80) first equation in (86) it follows that to perfectly recovering 1 0 This last equality proves that sirr is a Firr-bandlimited edge s we need a number of samples N0 at least equal to the 0 signal with Firr as in (76) and bandwidth |Firr| = |F0| − c1, bandwidth |F0| of the vertex signal s . This concludes the defining c1 = |C1| ≥ 0. proof of point b) in the theorem. 18

C. Proof of Theorem 4 U2 c = |C | ≥ 0 where the columns of C2 are the 2 2 eigenvec- 2 0 1 2 tors in the kernel of L2 belonging to the bandwidth of s . Given the sampled signals sA, sS and sM, we have Therefore, it results s0 = D s0 A A 2 B2UF = [OC2 , UFsol ], (95) 1 1 2 1 T 0 2 sS = DS s = DS B2s + DS sH + DS B1 s (88) 2 2 2 2 sM = DMFF s . B U = U 2 where we used the equality 2 F2−C2 Fsol . Then equation (93) reduces to s0 = F0 s0 Then, using the bandlimitedness property, so that F0 , 1 1 2 2 2 1 2 T 2 s = FF s and s = F s , we get s = U (U ) s . H H H F2 sol Fsol F2−C2 (96)  s0   s0  T 2 2 A Since it holds B2 u`i = ui , ∀ui ∈/ ker(B2), we easily get  1  ¯  1  1 T 2 T 1  sS  = G  sH  (89) s = U (U ) B s = U (U ) s . (97)     sol Fsol Fsol 2 Fsol Fsol sol s2 s2 1 M Then ssol is a Fsol-bandlimited edge signal with Fsol given in 1 with (94). This implies that s is F-bandlimited with F = Fsol ∪  0  FH ∪ Firr and bandwidth |F| = |F0| + |FH| + |F2| − (c1 + c2). DAF OO F0 Let us now consider the system in (92). We get  T 0 2  G¯ = D B F D F D B F . (90) 0 0 −1 0  S 1 F0 S FH S 2 F2  s = (I − D F ) s   A F0 A 2 OODMF 1 −1 T 0 0 −1 0 F2 s = −(I − D F ) D B F (I − D F ) s + H S FH S 1 F0 A F0 A 0 Under the assumptions kD¯ F k < 1, kD¯ F k < 1 and −1 1 −1 2 A F0 2 S FH 2 (I − DS FF ) s − (I − DS FF ) DS B2F ¯ 2 ¯ H S H F2 kDMFF k2 < 1, the matrix G becomes invertible and from 2 ·(I − D F2 )−1s2 the inverse of partitioned matrices [68], we get M F2 M  0 −1  s2 = (I − D F2 )−1s2 . (I − DAF ) OO M F2 M F0 (98) −1 R = G¯ =  P P P  1 2  1 2 3  Using equations in (98) and the fact that sS = DS B2s +   T 0 1 OO (I − D F2 )−1 DS B1 s + DS sH , it holds M F2 (91) 1 −1 1 s = (I − DS FF ) DS s . (99) P = −(I − D F )−1D BT F0 (I − D F0 )−1 H H H and 1 S FH S 1 F0 A F0 , P = (I−D F )−1 P = −(I−D F )−1D B F2 (I− Hence, from (99), it follows that to perfectly recover the 2 S FH , 3 S FH S 2 F2 2 1 D F )−1 s0 s1 s2 harmonic part of s we need a number of edge samples N1 M F2 . Then we can recovery the signals , H and as at least equal to the signal bandwidth |FH |. Finally, from the  0   0  0 s sA first and last equation in (98), we conclude that to retrieve s 1  1   1  and the solenoidal signal ssol we need, respectively, a number  sH  = R  sS  . (92)     of node samples N0 ≥ |F0| and a number of triangle signal 2 2 s sM samples N2 ≥ |F2|. This concludes the proof of point a). Let us prove next point 1 b). From Proposition 2, sirr is a Firr-bandlimited signal with Firr given in (76). To find the bandwidth of the solenoidal part 1 2 ssol = B2s we can proceed in a similar way to the proof of Proposition 2. Using the bandlimitedness property of s2, so s2 = F2 s2 that F2 , it results s1 = B U2 U2 T s2 sol 2 F2 F2 (93) U2 ∈ T ×|F2| u2 ∀i ∈ with F2 R the matrix whose columns i , F2 are the eigenvectors of the second-order Laplacian L2 = T 2 2 B2 B2. From Proposition 1 at each eigenvector u with u ∈/ T 2 T ker(B2 B2) corresponds an eigenvector B2u of B2B2 with the same eigenvalue. Denote with UFsol the E × |Fsol| matrix whose columns are the eigenvectors of L1 associated to the frequency index set Fsol with 2 2 T Fsol = {`i ∈ F : u`i = B2ui , ui ∈/ ker(B2 B2), ∀i ∈ F2−C2}. (94) U2 Let us write F2 as U2 = [U2 , U2 ] F2 C2 F2−C2