PHYSICAL REVIEW E 86,056111(2012)

Alignment and integration of complex networks by -based spectral clustering

Tom Michoel* Freiburg Institute for Advanced Studies (FRIAS), University of Freiburg, Albertstrasse 19, D-79104 Freiburg, Germany and The Roslin Institute, The University of Edinburgh, Easter Bush, Midlothian, EH25 9RG, Scotland, United Kingdom

Bruno Nachtergaele† Department of , University of California Davis, One Shields Avenue, Davis, California 95616-8366, USA (Received 16 May 2012; published 26 November 2012) Complex networks possess a rich, multiscale structure reflecting the dynamical and functional organization of the systems they model. Often there is a need to analyze multiple networks simultaneously, to model a system by more than one type of interaction, or to go beyond simple pairwise interactions, but currently there is a lack of theoretical and computational methods to address these problems. Here we introduce a framework for clustering and community detection in such systems using hypergraph representations. Our main result is a generalization of the Perron-Frobenius theorem from which we derive spectral clustering algorithms for directed and undirected . We illustrate our approach with applications for local and global alignment of protein-protein interaction networks between multiple species, for tripartite community detection in folksonomies, and for detecting clusters of overlapping regulatory pathways in directed networks.

DOI: 10.1103/PhysRevE.86.056111 PACS number(s): 89.75.Fb, 89.65. s, 87.18.Vf −

I. INTRODUCTION one suggestion has been to use hypergraphs, where edges are arbitrarily sized subsets of nodes. Although a number Complex networks in nature and society represent in- of studies have generalized various concepts, from graph teractions between entities in inhomogeneous systems and theory to hypergraphs [8,9,13–16], a rigorous mathematical understanding their structure and function has been the focus of foundation and general-purpose algorithm for clustering and much research. On the macroscopic scale, complex networks community detection in hypergraphs is still lacking. are characterized by, among others, a distribution, Here we present a framework for spectral clustering characteristic path length, and clustering coefficient, which are in hypergraphs, which is mathematically sound and markedly different from those of regular lattices or uniformly algorithmically efficient. It is based on a generalization of distributed Erdos-R˝ enyi´ random graphs [1,2], while on the the Perron-Frobenius theorem, which allows one to define microscopic scale, they contain network motifs, i.e., small and compute a dominant eigenvector for hypergraphs and use subgraphs occurring significantly more often than expected its values for optimally partitioning the hypergraph’s by chance [3]. The intermediate level usually exhibits the set, similar to the operation of standard spectral clustering presence of communities or modules, i.e., sets of nodes with algorithms in ordinary graphs [17]. We demonstrate the asignificantlyhigherthanexpecteddensityoflinksbetween validity of our approach through practical applications in the them, with typical examples being friendship circles in social analysis of real-world networks. In particular, we address networks, websites devoted to similar topics in the World Wide the following problems. First, if two networks are defined on Web, or protein complexes in protein interaction networks separate node sets with a many-to-many mapping between [4–7]. them (for instance, protein-protein interaction networks in However, the limitations of modeling a by different species), it is a natural question to find matching anetworkwithasingletypeofpairwiseinteractionarebe- communities in the two networks. This is the so-called coming more and more clear. Folksonomies, i.e., online alignment problem [12]. We show that this problem communities where users apply tags to annotate resources such can be solved by finding clusters in a hypergraph where each as images or scientific articles, have a tripartite structure with hyperedge consists of two matching edges, with one from three types of interactions [8,9]. In , cellular systems are each network (see Sec. VIII A). Second, if multiple networks characterized by different types of networks which represent are defined on the same node set (i.e., together they form an different physical interaction mechanisms operating on differ- edge-colored graph), there often exist functionally meaningful, ent time scales, intertwined with each other through extensive higher-order relations between the different edge types (for feedforward and feedback loops [10,11]. To understand how instance, tripartite relations in folksonomies [8,9] or network evolutionary dynamics shapes molecular interaction networks, motifs in biological networks [10,11]). Finding communities we need to compare them between multiple species with or modules with respect to these higher-order relations is what nontrivial many-to-many relations between their respective we call the network integration problem. Here we show that node sets [12]. In order to move beyond simple networks any higher-order edge relation between different networks of pairwise interactions to model these and other systems, defines a subgraph pattern in the corresponding edge-colored graph and that all instances of this pattern form a hypergraph. Hypergraph-based clustering can then be applied to identify *[email protected] modules in such edge-colored graphs (see Secs. VIII B and †[email protected] VIII C).

1539-3755/2012/86(5)/056111(14)056111-1 ©2012 American Physical Society TOMMICHOELANDBRUNONACHTERGAELE PHYSICALREVIEWE 86,056111(2012)

II. GRAPHS AND HYPERGRAPHS aclusterisasubsetofverticeswithahighnumberofedges between them, relative to its size. Mathematically, for a graph Agraph is defined as a pair ( , )ofvertices and edges with A,theedge-to-noderatioofasubset (pairs of vertices)G ,whichmayormaynotbedirected.InV E V E X can be written as aweightedgraph,anumberisassignedtoeachedgewhich ⊂ V may represent, e.g., the cost, length, or reliability of an edge. A i,j X Aij (X) ∈ , hypergraph is a generalization of a graph where an edge, called S = X hyperedgeinthiscase,canconnectanynumberofvertices,i.e., ! | | where X denotes the number of elements in X.Thenumber is a set of arbitrarily sized subsets of .Aparticularclassof | | hypergraphsE are so-called k-uniform hypergraphsV where each of subsets of a set with N elements grows exponentially in hyperedge has the same cardinality k.Algebraically,agraph N and hence finding the subset with a maximal edge-to-node can be represented by an adjacency matrix A of dimension ratio by exhaustive enumeration is computationally infeasible N N,withN the number of vertices, such that A 1 for large graphs. However, if we denote by uX the unit vector ij N 1/2 if ×i,j ,and0otherwise.Forundirectedgraphs,A =is a in R which has uX,i X − for i X,and0otherwise, { }∈E we can write as a scalar=| product| and obtain∈ the simple upper symmetric matrix, and for weighted graphs, Aij is defined to S be the weight of the edge i,j .Fork-uniform hypergraphs, the bound: notion of adjacency matrix{ can} be generalized to an adjacency x,Ax (X) u ,Au max λ , (1) T T i ,...,i X X " ' 2 ( max multiarray or tensor ,with i1,...,ik 1if 1 k ,and S =' ( x RN ,x 0 x = 0otherwise.Forageneralhypergraph,wedefineafunction= { }∈E w ∈ %= ) ) N on the set of subsets of such that w(E) 1forE ,and0 where x,y xi yi is the standard inner product on R , V = ∈ E ' (= i otherwise. In general, we allow weighted hypergraphs where x √ x,x is the length of x,andλmax is the largest w can be any non-negative function. )eigenvalue)= ' of (A!.BythePerron-Frobeniustheorem[18], if the Apathbetweentwoverticesi and j in a hypergraph graph is irreducible, then the dominant eigenvector x,which is defined as a sequence of vertices i i1,i2,...,ik 1 j satisfies λmax x Ax, is unique, strictly positive (xi > 0for = + = = and edges E1,...,Ek such that for all m, im,im 1 Em.A all i), and solves the variational problem on the right-hand side hypergraph is called connected if there exists{ a path+ }⊂ between of Eq. (1). any pair of vertices. A stronger constraint on the structure of Hence, to find an approximate maximizer X of ,wecan S ahypergraphisthatofirreducibility.Ahypergraphissaidto take the set X for which uX is as close as possible to the be reducible if there exists a proper vertex subset I such dominant eigenvector x,whichissimilartowhatisdonein ⊂ V that for any i I and j1,...,jm I, w( i,j1,...,jm ) 0, other spectral clustering algorithms based on the Laplacian or and is said to be∈ irreducible if it is%∈ not reducible.{ For ordinary} = matrices [17], i.e., define graphs, connectedness and irreducibility are equivalent, but for 1 hypergraphs this is not the case. An irreducible hypergraph is X˜ argmax u ,x argmax x . X 1/2 i = X ' (= X X clearly connected, but the opposite is not always true. Indeed, if ⊂V ⊂V | | i X there exists a subset of vertices I such that paths crossing from "∈ ˜ i I to j I can always be chosen to do so through an edge Since x>0, X is of the form Xc i : xi >c for some ∈ %∈ threshold value c.InsteadofX˜ ,wethereforechoosethe={ } of the form i1,...,ik,j1,...,jm ,withk ! 2, i1,...,ik I, { } ∈ solution of the restricted variational problem and j1,...,jm I,thenwecansetw( i,j1,...,jm ) 0for %∈ { } = all i I and j1,...,jm I,therebymakingthehypergraph ∈ %∈ Xmax argmax (Xc)(2) reducible, without breaking its connectivity. = c>0 S Directed hypergraphs can be defined in many ways. For as an approximate maximizer. Solving Eq. (2) is linear in the instance, for k-uniform hypergraphs, we can impose any form number of vertices, since we only need to consider the values of permutation symmetry, or lack thereof, between some or c equal to the entries of x.Moreover, (X ) (X˜ ), and all of the k dimensions in each edge. In this paper, we will max ! hence X is a better approximation toS the true maximizerS of only consider the case where each edge E can be written max than X˜ . as a pair (S,T ), where S is called the “source” vertex S Thus we obtain a numerically highly efficient spectral graph set and T is called the⊂ “target”V vertex set, with weight clustering algorithm: function w⊂(S,TV ). Underlying a directed hypergraph, there is (1) Calculate the dominant eigenvector x using, for in- always an undirected hypergraph with edges E S T for stance, a power method [19]. every directed edge (S,T ). As is the case for ordinary= directed∪ (2) Find the cluster X which solves the restricted graphs, a stronger notion of connectivity is usually needed max variational problem in Eq. (2). than simple connectivity of this undirected hypergraph. We (3) Store X ,removealledgesbetweennodesinX defer the somewhat technical definition of strong connectivity max max from the edge set ,andrepeattheprocedureuntilnomore of directed hypergraphs to Appendix A. edges remain. E This result of this algorithm is a partition of the edges of the input graph. Edge clustering algorithms have recently gained III. DOMINANT EIGENVECTORS AND SPECTRAL popularity, as they allow for overlapping communities where GRAPH CLUSTERING nodes may belong to more than one community [20,21]. Although countless measures have been designed to define This procedure generalizes immediately to directed or clusters in a graph [5–7], perhaps the simplest definition is that bipartite graphs. In this case, a cluster consists of a source

056111-2 ALIGNMENT AND INTEGRATION OF COMPLEX NETWORKS ... PHYSICAL REVIEW E 86,056111(2012) set X and target set Y with edge-to-node ratio x>0. Multiplying both sides of Eq. (5) by xi ,weobtain Eq. (4).SummingbothsidesinEq.(4) over i gives λp i X,j Y Aij = p(x) maxx p(x-). (X,Y ) ∈ ∈ . R = - R S = √ X Y Next assume y>0isanothermaximizerof p.Denote ! | |·| | R c mini (xi /yi ), u cy,andz x u ! 0. Since x p The dominant eigenvector is replaced by the dominant left = = = p − ) ) = y p 1, we have c<1andc c for p 1. Denote and right singular vectors x and y corresponding to the largest ) ) = " ! singular value of A,whichareagainuniqueandstrictlypositive I i : zi 0 .Foranyi I,bytheEuler-Lagrange equations,={ ∈ V = } ∈ [18]. Xmax and Ymax are found by maximizing (X,Y )oversets obtained by thresholding on the entries of x andS y. p p p 0 λp x c y = i − i 1 1 E E IV. PERRON-FROBENIUS THEOREM ( ) | | | | FOR HYPERGRAPHS w(E) xj uj . ! − E :i E *&j E ' & j E ' + Our aim is to generalize the previous graph spectral { ∈"E ∈ } #∈ #∈ clustering algorithm to arbitrary hypergraphs. For this purpose, we first need a generalization of the Perron-Frobenius theorem. Since each term in the last sum is non-negative, they must all be zero. Hence for any j1,...,jk I,if i,j1,...,jk , Let ( , )beanundirectedhypergraphonN vertices. %∈ { }∈E Define,H = forVx E RN and p 1, then ∈ ! 1 E k k xi | | (x) w(E) | | , (3) p 0 xj uj R = x p = m − m E i E $) ) % m 1 m 1 "∈E #∈ #= #= k m 1 k where w(E)isthenon-negativeweightofedgeE and x p − p 1/p ) ) = ( xi ) is the p norm of x.Wehavethefollowingkey ujn xjm ujm xjn . (6) i | | = − result: m 1 &n 1 ' &n m 1 ' ! "= #= ( ) =#+ Theorem. p attains its maximum on the set of unit N R N vectors Sp u R : u p 1 .If is connected, there Again each term in this sum is non-negative and must therefore ={ ∈ ) ) N = } H is a unique maximizer x S which is strictly positive and be zero, but this contradicts j1,...,jk I.Henceedgeswith ∈ p %∈ satisfies the Euler-Lagrange equations i I and j1,...,jk I do not exist, but this contradicts the ∈ %∈ 1 assumption of irreducibility. Since I ,wemusthaveI E %=∅ = V w(E) | | or x y. p = λp xi xj , (4) Next consider directed hypergraphs with hyperedges E = E E :i E & j E ' = { ∈E ∈ } | | ∈ (S,T ), S,T as defined before. Then, define p,q(x,y)for " # N ⊂ V R x,y R and p,q ! 1, subject to the constraint x p 1andwithλp p(x). ∈ By analogy with the matrix) ) case,= we call x the= dominantR 1 1 2 S 2 T eigenvector of . xi | | yj | | H p,q(x,y) w(S,T ) | | | | . For clarity, we first prove this theorem in the simpler case R = x y (S,T ) i S $ p % j T $ q % when is irreducible. The proof of the general case is given "∈E #∈ ) ) #∈ ) ) in AppendixH B. (7) N Proof.ExistenceofamaximizeronSp follows from Weierstrass’s theorem [18]. Clearly, since p(x) p( x ), R = R | | By identical arguments as for undirected hypergraphs, it can be we can always choose a maximizer x to have non-negative shown that for a strongly connected directed hypergraph, there entries. Hence we can find x as a stationary point of the N N exists a unique pair x Sp and y Sq such that p,q(x,y) ! Lagrangian ∈ N ∈ R p,q(x-,y-)forallx-,y- R . These maximizers are strictly 1 R ∈ E positive and satisfy the Euler-Lagrange equations | | λ p (x) w(E) xi x 1 , L = | | − p ) )p − 1 1 E &i E ' 2 S 2 T "∈E #∈ p w(S,T ) | | | | ( ) λp,qx xi yj , giving rise (for non-negative x)totheEuler-Lagrangeequa- i = 2 S - (S,T ) :i S &i S ' & j T ' tions { "∈E ∈ } | | #-∈ #∈ (8) 1 E w(E) | | 1 1 p 1 E − 1 1 λxi − xj xi| | . (5) 2 S 2 T = E q w(S,T ) | | | | E :i E & j E,j i ' | | λp,qyj xi yj - , { ∈"E ∈ } ∈#%= = 2 T & ' & ' (S,T ) :j T | | i S j - T Let I i : xi 0 and i I. Assume there exists an { "∈E ∈ } #∈ #∈ ={ ∈ V = } ∈ (9) edge E i,j1,...,jm with j1,...,jm I.Thentheleft- hand side={ of Eq. (5) is} 0, while the right-hand%∈ side is . ∞ Hence such an edge cannot exist, but this contradicts the subject to the constraints x p y q 1andwithλp,q ) ) =) ) = = assumption of irreducibility of .ItfollowsthatI or p,q(x,y). Details are given in Appendix B. H =∅ R 056111-3 TOMMICHOELANDBRUNONACHTERGAELE PHYSICALREVIEWE 86,056111(2012)

V. SPECTRAL CLUSTERING AND BICLUSTERING and thus can be considered as a subhypergraph as well. Higher- IN HYPERGRAPHS scoring clusters can thus be obtained by recursively applying the previous procedure to each of the clusters itself until no Having a generalization of the Perron-Frobenius theorem, more subdivision that improves the score is found. it is straightforward to also generalize the spectral clustering For directed hypergraphs, we have a biclustering method. method. Define, for X , ⊂ V Define, for X,Y and p,q 1, ⊂ V ! E X w(E) p(X) ⊂ p(uX) p(x), (10) S X,T Y w(S,T ) 1 " (X,Y ) ⊂ ⊂ . S = X p = R R p,q 1 1 ! | | S = ! X 2p Y 2q | | | | with x the dominant eigenvector and u N now defined X Sp Approximate maximizers Xmax and Ymax are found by solving 1/p ∈ by uX,i X for i X,and0otherwise.Theparameter the restricted variational principle, =| |− ∈ p balances cluster size versus edge density. For p 1, p is = S (X ,Y ) argmax X ,Y , the ratio of edges to nodes in X.Takingp>1diminishesthe max max p,q c1 c2 = (c1,c2) S influence of the denominator and progressively favors a high ( ) number of edges rather than a high number of edges per node with Xc i : xi >c1 and Yc i : yi >c2 , 1 ={ ∈ V } 2 ={ ∈ V } in high-scoring clusters (further details are given in Sec. VII). where x and y are the unique solutions of the Euler-Lagrange The spectral clustering algorithm becomes as follows: equations (8) and (9),whichcanagainbecalculatedusinga (1) Calculate the maximizer x of p. power algorithm. R (2) Find the cluster Xmax which solves the restricted variational problem VI. RELATION TO PREVIOUS WORK

Xmax argmax p(Xc), The matrix algorithm for clustering in a simple graph has = c>0 S its roots in a method for image pattern recognition [23], and with Xc i : xi >c. the use of the singular value decomposition to detect densely ={ ∈ V } (3) Store Xmax,removeallhyperedgesbetweennodesin linked sets in directed networks goes back to the work of Xmax from the edge set ,andrepeattheprocedureuntilno Kleinberg [24]. The novelty here lies in the definition of more hyperedges remain.E adiscreteclusterthroughsolvingtherestrictedvariational The maximizer can be calculated using a generalization of problem, instead of using an ad hoc cutoff on the eigenvector the power method for matrices [19]ortensors[22]: starting entries. For k-uniform hypergraphs, we can define rescaled (0) (0) (0) 1/k with an initial vector x and defining λp x p 1, we variables yi xi such that maximizing p(x)becomes (n 1) (n) =) ) = equivalent to= maximizing R compute x + from x using the Euler-Lagrange equations (4) in the following steps: T y ,...,y i1,...,ik i1,...,ik i1 ik 1 1 p- (y) k , E p R - = y (n 1) w(E) (n) | | ! p- x + x , (11) ) ) i ← E j * E :i E & j E ' + with p- kp.Inthiscase,theTheorempresentedinSec.IV { ∈"E ∈ } | | #∈ = (n 1) (n 1) reduces to a multilinear extension of the Perron-Frobenius λ + x + p, (12) p =) ) theorem to non-negative irreducible tensors of arbitrary (n 1) dimension, which has been the subject of several recent (n 1) xi + + papers [25–27](whichalldependonthestrongirreducibility xi (n 1) , (13) ← λp + condition). The Proof given in Sec. IV is considerably simpler, iterated until the components of x(n) become stationary or, holds for general connected hypergraphs, and follows more (n) closely the proof of the matrix theorem [18]. In the unscaled equivalently, λp has converged to the dominant eigenvalue, variational problem for - ,themaximizerisuniquefor i.e., Rp- p- ! k and thus it is unsuited for generalizing to arbitrary (n 1) λ + hypergraphs where the uniqueness condition would become 1 p <", (14) − (n) p- ! kmax,whichisthemaximumedgesizeinthehypergraph. , λp , , , This explains why we introduced the geometric average over , , where " is a predefined, numerical, tolerance threshold. Due the values xj in Eq. (3). to the uniqueness of x,thechoiceofstartingvectorisnot For k 3, we have previously used a similar approach important. By taking a non-negative one, such as the uniform to find clusters= of three-node network motifs in integrated vector x(0) [1,1,...,1]T /N 1/p,weensurethatthepowers interaction networks [28,29]. In this case, an adjacency tensor = of 1/ E occurring in the Euler-Lagrange equations are always Trst is defined to be 1 if an instance of a three-node query defined| | unambiguously. Many of the hypergraphs occurring in motif or graph pattern exists between vertices (r,s,t), and 0 real-world applications are not connected. In such cases, it otherwise. More generally, we can define for any k-node query is important to ensure that x(0) has support only on a single pattern a k-uniform hypergraph consisting of all instances connected to obtain the unique maximizer for that of the query pattern in a given graph .Ouralgorithmwill component. identify clusters of vertices in withG a high number of Although we typically view a cluster as a subset of vertices, pattern instances between them, whichG often have a functional it is actually a subset of hyperedges (all hyperedges E Xmax) meaning in biological networks [28,30]. ⊂ 056111-4 ALIGNMENT AND INTEGRATION OF COMPLEX NETWORKS ... PHYSICAL REVIEW E 86,056111(2012)

Another example for k 3concernstheanalysisandclus- B. Edge-to-node scaling parameter = tering of multiply linked data [31,32] or multislice networks The transition in Fig. 1(d) as a function of the edge-to- [33]. Here we are given a set of M directed or undirected graphs (m) node scaling parameter p is a general feature, independent and define a hypergraph adjacency tensor as Tij m A , = ij of the actual hypergraphs used, and can be easily understood where A(m) denotes the adjacency matrix of the mth graph. as follows. Assume we have a hypergraph ( , )with Clustering in this case identifies vertex sets which are densely N nodes and M hyperedges. ThenH = theV E relative connected in multiple, but not necessarily all, graphs. score=| ofV| any set X with=|En| X nodes and m hyperedges compared to the score⊂ V of the total=| hypergraph| is

1 p(X) m N p α2 S 1/p sp(α1,α2), p( ) = M n = α ≡ VII. ALGORITHM VALIDATION S V $ % 1 A. Random geometric graphs with α1 and α2 the fractions of nodes and edges in X.The The dominant eigenvector of a graph’s adjacency matrix phase diagram of sp as a function of these two variables is often considered as a measure (“eigenvector is independent of the actual hypergraph under consideration centrality” [1]) and is, in essence, equal to a simplified (Fig. 2). Naturally, not all combinations of α1 and α2 are PageRank [34]forrankingglobalverteximportance.Itmay admissible. In general, there exists a boundary α2 " f (α1) thus come as a surprise to see it playing a role in identifying with f (α1) α1 for α1 1. In sparse hypergraphs, we ≈ 1 δ ≈ localized clusters (however, see the references in the previous typically have M N + with δ small, where often δ 0. section). In order to demonstrate the validity of our approach Locally, however,∼ the edge density can be much higher.= and illustrate the statements in Sec. V,weapplieditto For instance, in ordinary edge clustering, m n2,andin randomly generated geometric graphs of various sizes (see triangle-based clustering, m n3,forn not too∼ large. Hence, ∼ Appendix C1for details). as α1 decreases from 1, the boundary function f (α1)will For visualization purposes, we generated as a toy example deviate more and more from the diagonal α2 α1.InFig.2, arandomgeometricgraphwith100verticesandradiusr2 we have sketched a typical shape of a boundary= function (thick = 0.02 [Fig. 1(a)]. The graph is evidently modular and the six line). At p 1(Fig.2,topleft),thecontourlinesofsp are = highest-scoring edge clusters identified by our algorithm (with straight lines and sp will clearly be maximal at small values of p 1) are indicated in color. The profiles of the corresponding (α1,α2). As p increases, the contour lines become increasingly = dominant eigenvectors are clearly localized on a subset of more concave, pushing the value where sp attains its maximum nodes [Fig. 1(b)], illustrating that in a modular network, the towards α1 1. For the idealized boundary function in Fig. 2, dominant eigenvector indeed indicates the location of a single the transition= is in fact discontinuous and jumps from being cluster. Furthermore, comparing the edge-to-node ratio for at α1 0.1 (origin of the axes) to α1 1aroundp 1.95 each of the discovered edge clusters with the theoretical upper (bottom= left). = = bound in Eq. (1) shows that the solution of the restricted Since the transition is, in general, sharp as a function of p variational problem [Eq. (2)]mustbeclosetothetrue and can even be discontinuous, we will in practice only use maximum [Fig. 1(c)]. the default edge-to-node ratio score with p 1toidentify For a more systematic analysis, we performed triangle- dense hypergraph clusters, or use a large value= of p (typically based clustering on sequences of geometric graphs with p # 10) to identify connected hypergraph components. constant expected edge density and varying size. Triangle- based clustering searches for overlapping sets of triangles in an ordinary graph and corresponds to the simplest form of C. Algorithm efficiency k- clustering [35]. Here we considered each instance of For an undirected hypergraph with N nodes, M hyperedges, atriangleintheinputgraphasahyperedgeinathree-uniform and maximum edge size kmax,theupdatestepsinthepower hypergraph to which we applied our spectral clustering algorithm are at most of the order kmaxM [Eq. (11)]and algorithm. The parameter p can be used to identify clusters N [Eqs. (12) and (13)]. The number of steps needed to at different levels of resolution. Independent of network size, reach convergence depends on the convergence parameter " there is a low-p phase where the fraction of nodes in a cluster [Eq. (14)]andthereforepossiblyalsoonthehypergraphsize. is small compared to total network size, and a high-p phase In practice, a maximal number of iterations Imax is defined where a cluster consists of a macroscopic network portion and convergence manually inspected when Imax is exceeded. [Fig. 1(d)]. Interestingly, at p 1, cluster size does not Determining the optimal threshold value is, at most, of the depend on network size [Fig. 1(d)= ,inset].Henceclustering order of N (number of possible threshold values) times M based on (hyper)edge-to-node ratio scores [Eq. (10)]does (calculation of the edge-to-node ratio score). Taken together, not suffer from a resolution limit problem where cluster runtime is bounded by size grows with network size irrespective of the presence of “natural” clusters at smaller scales [36,37]. As in the previous trun " Imax[O(kmaxM) O(N)] O(MN). example, the cluster scores are always close to their theoretical + + upper bounds, demonstrating that the solution of the restricted For directed hypergraphs, determining the optimal thresh- variational problem is close to the true optimum in all cases old pair over all possible combinations of entries of the (see Fig. S1 of Supplemental Material [38]). dominant singular vector pair (x,y)isoftheorderofN 2M,

056111-5 TOMMICHOELANDBRUNONACHTERGAELE PHYSICALREVIEWE 86,056111(2012)

(a) Randomly generated geometric graph (b) Dominant eigenvector pro!les

0.18

0.16

0.14

0.12

0.1 max x 0.08

0.06

0.04

0.02

0 10 20 30 40 50 60 70 Vertex ID

(c) Cluster scores and upper bounds (d) Relative cluster size vs. resolution parameter p

4 1 0.9 3.5 N=200 0.8 N=400 3 N=600 0.7 N=800 N=1000 2.5 0.6

2 Φ 0.5 Score 35 0.4 1.5 30

0.3 25 1 0.2 20 0.5 15 0.1 200 400 600 800 1000 N 0 0 5 10 15 20 25 1 1.5 2 2.5 3 3.5 4 4.5 5 Cluster ID p

FIG. 1. (Color online) (a) Example of a randomly generated geometric graph with 100 vertices and radius r2 0.02, showing the largest connected component with the six highest-scoring edge clusters indicated by filled nodes. (b) Dominant eigenvector= profiles for the six highest-scoring edge clusters. (c) Edge-to-node ratio scores (left blue bars) and theoretical upper bound (right red bars) for all 25 edge clusters. (d) Cluster size as the fraction % of total number of network nodes for the highest-scoring triangle-based cluster in random geometric graphs with N 200, 400, 600, 800, and 1000 nodes and constant edge density (ρ 4) as a function of p.Eachdatapointisanaverageover10 random= networks. The inset shows the absolute mean cluster size and standard deviation= over 10 random networks as a function of N for p 1. = which is often prohibitive. In such instances, taking order hypergraph edges. We illustrate this idea by showing that local and global alignment of complex networks with a Xmax argmax p,q uXc ,y , = c R bipartite many-to-many mapping between their vertex sets can ( ) be naturally viewed as a hypergraph clustering problem. Ymax argmax p,q x,uYc , = c R Network alignment is the problem of finding topologically where we used the same notation as( in Sec.) V, results in an similar regions between two or more networks. In local approximation which is again O(MN). network alignment, small subgraphs in each network are aligned independently of the alignment of other subgraphs, VIII. APPLICATIONS whereas global network alignment aims to find a maximal alignment for each connected component in the input graphs. A. Local and global alignment of complex networks Network alignment methods for comparing molecular in- The core idea for applying hypergraph clustering to the teraction networks between different species come in two analysis of edge-colored graphs is to translate the relation main flavors. Topological network alignment finds conserved between multiple interaction types (edge colors) into higher- regions between networks taking only the topology of each

056111-6 ALIGNMENT AND INTEGRATION OF COMPLEX NETWORKS ... PHYSICAL REVIEW E 86,056111(2012)

p=1.00 p=1.65 1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6 2 2 α α 0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 α1 α1

p=1.95 p=5.00 1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6 2 2 α α 0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 α1 α1

FIG. 2. (Color online) Phase diagrams of s (α ,α )forp 1,1.65,1.95, and 5 (left to right, top to bottom). More yellow (lighter gray) p 1 2 = indicates higher values of sp; the thin lines are contours of constant sp, while the thick line indicates a possible boundary of admissible states. Colors (gray scale levels) are relative to the minimum and maximum in each panel and not comparable between panels. network into account [39]. The second class of methods i,k , j,l [Fig. 3(a)]. Such alignment hyperedges are takes into account that networks in different species have also{ } called{ }∈interologsM .Interologmappingisroutinelyusedto evolved from a common ancestor through gene duplication transfer annotation information from one organism to another and divergence mechanisms and hence there exists a mean- [42]andinterologanalysisisattheheartofpreviousnetwork ingful mapping between the nodes in each network [12]. alignment methods [41,43]. Here we propose to address the Methods have been developed which assume a one-to-one network alignment problem by identifying hyperedge clusters mapping [40], but in general a many-to-many map should be in the alignment or interolog hypergraph. Indeed, in a local considered [41]. alignment, we search for small regions in each graph which More formally, consider two ordinary graphs 1 and map nearly perfectly onto each other, i.e., have a high density G 2,whoseverticesareconnectedbyabipartitegraph . of interologs between them. This corresponds to hypergraph G M The directed alignment hypergraph between 1 and 2 clusters which maximize p for values of p close to one. In a is defined as the four-uniform hypergraphH containingG theG global alignment, we searchS for maximally matching regions edges ( i,j , k,l ), if and only if i,j 1, k,l 2,and in each graph, i.e., connected components in the interolog { } { } { }∈G { }∈G 056111-7 TOMMICHOELANDBRUNONACHTERGAELE PHYSICALREVIEWE 86,056111(2012)

(a) Network alignment hyperedge (c) Example of a local yeast-human functional network alignment

H PH1 PH2

Y PY1 PY2

(b) Examples of local yeast-human protein complex alignments

FIG. 3. (a) A (directed) hyperedge in the yeast-human protein interaction network alignment hypergraph is an interolog:apairofinteracting yeast (Y) proteins and a pair of interacting human (H) proteins connected by orthology relations (dashed lines). (b) Examples of aligned protein complexes (cluster no. 19 left, no. 1 right). (c) Example of a functional network alignment (cluster no. 48). In all panels, yeast proteins are white and human proteins are gray; protein interactions are solid lines and orthology relations are dashed lines. hypergraph. These correspond to hypergraph clusters which 766 human proteins and contains 90% of all interologs (see maximize p for large values of p. Table S4 of Supplemental Material [38]), showing that there We usedS our spectral clustering algorithm to locally and exists a high degree of network conservation at a global scale, globally align protein-protein interaction networks between which is consistent with previous findings using topological yeast and human, using orthology groups for mapping con- network alignment [39]. served proteins between both organisms (see Appendix C2 for details). Protein-protein interaction networks represent binary, undirected associations between proteins and they B. Tripartite community detection in online folksonomies are, at present, the most extensively characterized molecular Folksonomies, i.e., online communities where users col- interaction networks in biology [11,44]. Typical examples of laboratively create and annotate data, are examples of social high-scoring local alignment clusters are conserved protein systems that cannot be adequately modeled by ordinary graphs. complexes (see Supplemental Material [38]). Figure 3(b) For instance, tagged social networks such as Flickr [47]or shows two examples: first is a set of proteins that maps one- CiteULike [48] have a tripartite structure that is best modeled to-one between yeast and human from the minichromosome by a three-uniform hypergraph [8,9]. Using CiteULike as a maintenance (MCM) complex (cluster no. 19), which plays concrete example, each hyperedge consists of a user who an important role in DNA replication and is indeed conserved has annotated an academic article with a certain keyword or among all eukaryotes [45]; and second (cluster no. 1) is a tag [48][Fig.4(a)]. Traditionally, the set of components of the V-type ATPase (a proton pump), of such tripartite networks has been analyzed by considering which has expanded in human compared to yeast by gene one-mode ordinary graph projections of the hypergraph, e.g., duplications [46]. Other local alignment clusters reflect more by connecting two users if they have annotated the same general functional networks than protein complexes (see articles or connecting two tags if they have been applied to Table S3 of Supplemental Material [38]). Figure 3(c) shows the same articles [9]. In contrast, hypergraph-based clustering cluster no. 48, which is an example of a conserved network preserves the tripartite structure of folksonomy data and involved in nucleic acid metabolism centered around the reveals additional levels of community structure. We applied general transcription factor TBP (SPT15 in yeast), i.e., the our spectral clustering algorithm to a subset of the CiteULike TATA-binding protein. The largest connected component in data set containing more than 400 000 (user, article, tag) the network alignment hypergraph maps 651 yeast proteins to entries and identified nearly 14 000 hyperedge clusters (see

056111-8 ALIGNMENT AND INTEGRATION OF COMPLEX NETWORKS ... PHYSICAL REVIEW E 86,056111(2012)

(a) (b)

(c)

FIG. 4. (a) CiteULike hyperedge, which represents one instance of a user (hexagonal node) who has annotated an article (circular node) with a certain tag (rectangular node). (b) Example of two tripartite communities where the same set of users (top) has annotated two sets of articles (middle) with two sets of tags (bottom). Only the two central articles and one central tag (“pattern recognition”) overlap between the two clusters. User-tag edges have been omitted for clarity. (c) Coarse-grained view of the CiteULike hypergraph using the 100 highest-scoring hyperedge clusters. Each node represents a cluster (with node size proportional to the number of hyperedges in the cluster) and edges represent significant overlap between clusters (overlap score > 0.5; edge size proportional to overlap score). Solid edges: user overlap; dashed edges: tag overlap; wavy edges: article overlap.

056111-9 TOMMICHOELANDBRUNONACHTERGAELE PHYSICALREVIEWE 86,056111(2012)

(a) Combinatorial path cluster

(b) Hierarchical path cluster

FIG. 5. (Color online) Examples of high-scoring combinatorial [(a), cluster no. 6] and hierarchical [(b), cluster no. 1] path clusters in the yeast transcriptional regulatory network. Red (dark gray) rectangular nodes: knocked-out transcription factors (TFs); yellow (light gray) circular nodes: genes differentially expressed upon knock out of the TFs; white diamond-shaped nodes: all other TFs. Node size is proportional to out-degree and edge width to edge “betweenness” (defined for the purposes of this figure as the number of shortest paths between all pairs of cluster nodes passing through a given edge).

Appendix C3for details). The additional level of detail present article, or tag overlap between clusters. Figure 4(b) shows in hyperedge clusters is illustrated by looking at the user, an example of two hyperedge clusters formed by the same

056111-10 ALIGNMENT AND INTEGRATION OF COMPLEX NETWORKS ... PHYSICAL REVIEW E 86,056111(2012) set of users who have annotated different sets of articles by Here we address the problem of identifying sets of nodes different sets of tags. Only one tag, i.e., “pattern recognition,” is which respond to the knock out of a TF through similar regu- common between both clusters. The remaining tags show that latory paths by defining a nonuniform hypergraph where each the articles in the first cluster are about collective computing hyperedge corresponds to a shortest path between two nodes in and swarm intelligence, whereas those in the second cluster the original regulatory network. Hypergraph-based clustering deal with image analysis (see Table S1 of Supplemental will then find sets of nodes with a high number of shortest paths Material [38]), which are indeed two distinct subjects within running through them and such clusters form potential “signal- the broad field of pattern recognition. propagation” modules, which is consistent with the notion that In general, we expect such subdivisions of one-mode high information flow in a network is associated to high values projected communities to occur at the level of users (i.e., the of a node’s “betweenness” centrality (defined as the number same set of users annotating different sets of articles using of shortest paths between all pairs of nodes passing through a different sets of tags), but much less at the level of articles or given node). To test this hypothesis, we calculated all directed tags (i.e., we do not expect different sets of users to annotate shortest paths in the regulatory network of yeast between a TF the same set of articles using different sets of tags, or to use the and the genes differentially expressed upon knock out of that same set of tags for different sets of articles). Indeed, the 100 TF. The resulting hypergraph contained 1332 hyperedges be- highest-scoring clusters (which together contain about 20% tween 788 nodes, and spectral clustering identified 25 nonsin- of all hyperedges) overlap predominantly at the user level, to gleton and 14 singleton clusters (see Appendix C4for details). amuchlesserextentatthetaglevel,andhardlyoverlapat Topologically, there appear to exist two distinct types of path the article level, while about 21 of these clusters do not have clusters. Combinatorial path clusters contain genes responding any significant overlap (overlap > 50%; see Appendix C3for to the knock out of multiple TFs and form a network of densely details) with any other cluster [Fig. 4(c)]. Significant article overlapping paths. Figure 5(a) shows a combinatorial cluster overlap occurs in only two instances. In both cases, it concerns of 199 shortest paths from 20 TFs to 186 genes involved in gly- asubsetofuserswhohaveannotatedasubsetofarticlesfroma colysis and gluconeogenesis. Hierarchical path clusters have larger cluster with an additional set of tags. Tag overlap occurs alayeredstructure,wheretheperturbationalsignalofusually more frequently than article overlap, but with lower overlap not more than one TF flows to its targets via a limited number of percentages than user overlaps. Overlapping tags are typically intermediate TFs, in a strictly hierarchical manner [Fig. 5(b)]. general tags which can be applied to a broad spectrum of The functional relevance of regulatory path clusters is demon- articles. For instance, the ten tags occurring most frequently strated by the fact that they contain a significant fraction of the in the top 100 clusters are bibtex-import, learning, social, genes affected by the deletion of the cluster’s TF and that they evolution, review, support, govt, non-us, collaboration, and strongly overlap with specific functional categories (see Tables design. Thus we conclude that hyperedge clusters capture S5 and S6 of Supplemental Material [38]). For simplicity, we topic-specific tripartite (user, article, tag) communities which considered here only the shortest paths in the transcriptional reveal more structure of the underlying data than user, article, regulatory network, but clearly the approach can be extended or tag communities based on a single data dimension only. to paths composed of multiple interaction types.

C. Path clustering in regulatory networks IX. CONCLUSIONS Unlike protein-protein interaction networks, which are Over the past decade, graph theory has become crucial undirected, regulatory networks, which control the cellular to represent and reason about complex network data. In response to external or internal perturbations, are directed and particular, clustering, i.e., the detection of densely inter- represent the flow of information within a cell [10]. In tran- connected groups of vertices with few connections to the scriptional regulatory networks, the response to perturbations rest of the network, has become a standard coarse-graining can be measured experimentally by genetically knocking out a procedure to understand the structure and function of complex transcription factor (TF) and measuring the resulting changes networks. With more and more data becoming available to in gene expression levels on a genome-wide scale [49]. In highlight different aspects of the same complex systems, a yeast, direct physical binding interactions between a TF and its need has arisen to analyze networks with multiple types of target genes [50]aswellasperturbationalresponsedataforthe interactions simultaneously. In this paper, we have proposed to same TF [49]areavailableforacomprehensivesetofalmost use hypergraphs to characterize higher-order relations between 200 TFs (see Appendix C4for details). On average, only 3% of simple graphs and we have introduced efficient algorithms for the genes which respond to a knock-out perturbation of a TF are clustering and biclustering in such hypergraphs. also direct physical targets of that TF, and various approaches Our main result is a spectral clustering algorithm for hy- have been proposed to understand the mechanisms of indirect pergraphs, based on a generalization of the Perron-Frobenius regulation and propagation of network perturbations in this theorem for directed and undirected hypergraphs. More pre- context [51–54]. It is thought that perturbational responses are cisely, we have shown that like in ordinary graphs, there exists organized in a modular way, in the sense that groups of genes aunique,positivevector,calledthedominanteigenvector,over will be affected by the knock out of a TF through the same in- the set of vertices of a hypergraph, which maximizes a natural termediate regulatory pathways. However, due to the variable generalization of the Rayleigh-Ritz ratio for matrices. The length of these pathways, previous approaches for clustering importance of this result lies in the fact that the ratio of the in directed networks (e.g., [55–57]), which identify densely in- number of edges to the number of nodes in any subset of teracting node sets, are not directly applicable to this problem. vertices can be expressed as the same Rayleigh-Ritz ratio, in

056111-11 TOMMICHOELANDBRUNONACHTERGAELE PHYSICALREVIEWE 86,056111(2012) graphs and hypergraphs alike. Densely interconnected clusters generalizes the definition of a symmetric adjacency matrix can therefore be found very efficiently by first computing the B A AT from the asymmetric adjacency matrix A of a dominant eigenvector and then converting it to a discrete directed= + graph. Clearly, to call connected, we shall ask that set of vertices. Uniqueness of the dominant eigenvector ˜ is connected as defined in Sec.H II. guarantees unambiguity of the solution and rapid convergence H Now consider two subsets I,J such that I J is of the numerical procedure, whereas positivity implies that the neither empty nor equal to .Since ˜⊂isV connected, there∪ exist V H discretization can be achieved by setting an optimal threshold vertices i1,...,ik I, j1,...,j' J ,andh1,...,hm I J on its entries. such that ∈ ∈ %∈ ∪ Our work has been motivated by concrete problems of data w˜ ( i1,...,ik,j1,...,j',h1,...,hm ) > 0. integration in social and biological networks. We have given { } three practical examples for using hypergraph-based clustering This implies that there exists at least one partition of these in these contexts, namely, the alignment of protein-protein nodes into a source and target set with nonzero directed weight. interaction networks between multiple species using interolog We ask slightly more, namely, that there is a partition of the clustering, the detection of tripartite communities in form folksonomies, and the identification of overlapping regulatory w( i1,...,ik,h1,...,hn , j1,...,j',hn 1,...,hm ) > 0, pathways in perturbational expression data using shortest { } { + } path clustering. Undoubtedly, many more applications for i.e., the source as well as the target set should contain at least hypergraph-based clustering exist in the analysis of other bio- one element not in I or J .Notethattherequirementthatalli’s logical, social, computer, communication, or neural networks. go into the source set and all j’s go into the target set is purely From a theoretical point of view, we have considered the edge- notational convenience, since I or J are allowed to be empty, to-node ratio as a simple quality score for clusters in graphs and as long as their union is not. If the above condition is fulfilled hypergraphs. Although this score has many attractive proper- for all pairs of sets (I,J), we say that the directed hypergraph ties, such as its direct relation with the dominant eigenvector is strongly connected. and the absence of any resolution limit problems, it will still be H of interest to generalize clustering algorithms based on other APPENDIX B: GENERAL PROOF OF THE quality scores from graphs to hypergraphs as well. Popular PERRON-FROBENIUS THEOREM FOR CONNECTED methods like those based on minimal cutsets or modularity HYPERGRAPHS maximization also rely on spectral properties of, respectively, the graph Laplacian and modularity matrix. Although Consider a non-negative maximizer x of p(x)andwithout R certain mathematical aspects, such as eigenvalue multiplicity loss of generality assume x p 1. Let again I i : ) ) = ={ ∈ V and its implications on algorithm convergence and cluster xi 0 and assume I .Letk be the smallest integer for = } %=∅ discretization, are more complicated in these cases, we believe which there exists at least one set i1,...,ik I and at least ∈ our work lays the theoretical foundations for future studies one set j1,...,jm I such that w( i1,...,ik,j1,...,jm ) > 0. %∈ { } in this direction. Such k must exist, since is connected (see Appendix A). For ">0, define H

xi i I ACKNOWLEDGMENTS x˜i %∈ = "iI. T.M. thanks the Department of Mathematics of the Uni- - ∈ We will show that for " small enough, (x˜) > (x), which versity of California, Davis for warm hospitality during visits p p contradicts the assumption that thereR can existR a maximizer when part of this work was performed. The work of B.N. with zero elements. We have was supported in part by the National Science Foundation p p p p under Grant No. DMS-1009502. We thank the referees for x˜ p x p I " 1 I " , their detailed and constructive comments which led to a much ) ) =) ) +| | = +| | or, to leading order in ", improved version of our paper. 1 I 1 | |"p o("p). (B1) x˜ p = − p + APPENDIX A: STRONG CONNECTIVITY OF DIRECTED ) ) For the denominator of p(x˜), we have HYPERGRAPHS R 1 1 E E Consider first an undirected hypergraph ( , )on | | | | N vertices. Although connectedness of doesH = notV E imply w(E) x˜i w(E) xi = H E &i E ' E :E I &i E ' irreducibility, we do have the property that if there exists "∈E #∈ { ∈E"∩ =∅} #∈ 1 apropersubsetI such that for all i1,...,ik I and E j ,...,j I, w( ⊂i ,...,iV ,j ,...,j ) 0, then ∈ is not | | 1 m 1 k 1 m w(E) x˜i %∈ { } = H + connected (since there can then be no path that starts in I and E :E I &i E ' { ∈E"∩ %=∅} #∈ escapes from I). Hence, if is connected, no such set I exists. 1 H E For a directed hypergraph ( , ), we can define an | | H =˜ V E ˜ p(x) w(E) x˜i . underlying undirected hypergraph ( , )byconsidering = R + all possible partitions of a subset EH = VintoE source and tar- E :E I &i E ' ⊂ V { ∈E"∩ %=∅} #∈ get sets, i.e.,w ˜ (E) (S,T ):S T E w(S,T ). This procedure (B2) = { ∪ = } ! 056111-12 ALIGNMENT AND INTEGRATION OF COMPLEX NETWORKS ... PHYSICAL REVIEW E 86,056111(2012)

From the preceding discussion, it follows that the leading term of 2567 interolog hyperedges [cf. Fig. 3(a)]. At p q 1, k = = k m 180 clusters with at least two hyperedges were found; 119 in " of the second term in Eq. (B2) is of the order of " + for some k,m ! 1. Hence, for " small enough, the extra positive hyperedges had no connections in the hypergraph, forming k k m singleton clusters. The complete distribution of hyperedges, term of order " + in Eq. (B2) offsets the negative term of order "p in Eq. (B1),andweget,forsomec>0, nodes, and scores for all clusters is shown in Fig. S2 of the Supplemental Material [38]; the functional analysis of the local I p p k k and global alignment clusters is given in Tables S2 and S3 of p(x˜) 1 | |" o(" ) p(x) c" k m o(" k m ) R = − p + R + + + + the Supplemental Material [38]. . /. / k k p(x) c" k m o(" k m ) > p(x). = R + + + + R 3. Tripartite community detection in the CiteULike data Having established that a maximizer x must be positive, x>0, the remainder of the proof is the same as the proof for We obtained the complete “who-posted-what” data from irreducible hypergraphs, since in Eq. (6) it suffices that at least CiteULike [48], containing (as of 1 February 2012) 16 553 642 one jm I arrives at a contradiction, which is guaranteed by (user, article, tag) entries. To create a more manageable the connectedness%∈ of . data set, we considered all entries from 2005, resulting in a For directed hypergraphs,H the condition (and definition) of hypergraph of 466 948 (user, article, tag) hyperedges between strong connectivity in Appendix A is tailormade to ensure 4693 users, 121 071 articles, and 36 489 tags. Recursive that the above argument still goes through. More precisely, hypergraph spectral clustering with p 1identified13987 = if (x,y)areapairofnon-negativemaximizersof p,q(x,y) clusters with at least two hyperedges; 4616 hyperedges formed R [cf. Eq. (7)], then define I i : xi 0 and J j singleton clusters. The complete distribution of hyperedges, ={ ∈ V = } ={ ∈ : yj 0 .Settingthezeroelementsinx and y to a nodes, and scores for all clusters is shown in Fig. S3 of the smallV positive= } value ",strongconnectivityimpliesthatthe Supplemental Material [38]. While comparing the user, article, α numerator of p,q increases by a term of order " with α<1, and tag overlap of two hyperedge clusters, we were primarily whereas the denominatorR (the norms of x and y)canonly interested to detect when the set of users, articles, or tags of p q + asmallerclusterisentirelycontainedinalargercluster(cf. decrease p,q by a term of order " 2 with p,q ! 1. The uniquenessR argument again follows along the lines leading Fig. 4). We therefore used the overlap score defined for two to Eq. (6). sets X and Y as X Y ovlp(X,Y ) | ∩ | , APPENDIX C: NETWORK DATA AND = min( X , Y ) NUMERICAL SETTINGS | | | | Here we summarize the data sources and parameter settings which reaches its maximum value of 1 whenever X Y or ⊂ used in the example applications (Secs. VII and VIII). Y X. ⊂

1. Random geometric graphs 4. Path clustering in the yeast transcriptional AgeometricgraphwithN vertices and radius r is defined by regulatory network aset of points in a metric space and edges (u,v) : We obtained a network of 11 373 physical transcription 0 < Vu v r .WegeneratedrandomgeometricgraphsE ={ ∈ V " factor (TF) binding interactions between 198 TFs and 3535 by sampling) − ) with} uniform probability N points in the unit target genes in yeast from [50]andknock-outmicroarraydata square [0,1] [0,1] and taking the standard two-norm as the for 266 TFs from [49]. The knock-out data can be represented measure.× For a given vertex, the probability that it is as a directed network of perturbational interactions where connected to any other vertex is πr2.Hence,ifweincrease each TF is connected to the genes which respond to the N while keeping ρ Nr2 constant, we obtain a sequence knock-out perturbation of that TF. In addition, 182 TFs with of random geometric= graphs with constant average expected physical binding data also had knock-out data for a total of degree. 7090 perturbational interactions. We constructed a directed hypergraph consisting of 1332 hyperedges and 788 nodes, 2. Alignment of yeast and human PPI networks where each hyperedge is a shortest path in the regulatory We obtained physical protein-protein interactions (PPI) for network between a TF and a gene differentially expressed yeast from the BioGRID [58]databaseandphysicalandfunc- upon knock out of that TF. We defined the source set of a tional PPIs for human from the BioGRID and STRING [59] hyperedge as the knocked-out TF and the target set as the databases. The yeast network had 36 391 interactions between remainder of the path. Recursive spectral clustering identified 4847 proteins; the human network had 40 630 interactions 39 clusters of which 14 were singletons. between 9602 proteins. We integrated these networks with orthology mappings from the InParanoid database [60]. There were 3390 orthology relations between 2245 yeast and 3255 5. Supplementary data and algorithm implementation human proteins which had at least one interaction in their An implementation of the clustering algorithm in JAVA, respective PPI networks. We performed recursive spectral together with the input data and clustering results described in clustering on the directed alignment hypergraph consisting Sec. VIII,isavailablefromtheprojecthomepageinRef.[61].

056111-13 TOMMICHOELANDBRUNONACHTERGAELE PHYSICALREVIEWE 86,056111(2012)

[1] M. E. J. Newman, SIAM Rev. 45, 167 (2003). [34] S. Brin and L. Page, Comput. Networks 30, 107 (1998). [2] R. Albert and A-L. Barabasi,´ Rev. Mod. Phys. 74, 47 (2002). [35] G. Palla, I. Derenyi,´ I. Farkas, and T. Vicsek, Nature (London) [3] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, 435, 814 (2005). and U. Alon, Science 298, 824 (2002). [36] S. Fortunato and M. Barthelemy,´ Proc. Natl. Acad. Sci. USA [4] M. E. J. Newman, Proc. Natl. Acad. Sci. USA 103, 8577 (2006). 104, 36 (2007). [5] S. Fortunato, Phys. Rep. 486, 75 (2010). [37] B. H. Good, Y-A. de Montjoye, and A. Clauset, Phys. Rev. E 81, [6] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney, 046106 (2010). Internet Math. 6, 29 (2009). [38] See Supplemental Material at http://link.aps.org/supplemental/ [7] M. A. Porter, J.-P. Onnela, and P. J. Mucha, Notices AMS 56, 10.1103/PhysRevE.86.056111 for supplementary figures and 1082 (2009). tables. [8] G. Ghoshal, V.Zlatic,´ G. Caldarelli, and M. E. J. Newman, Phys. [39] O. Kuchaiev, T. Milenkovic, V. Memisevic, W. Hayes, and Rev. E 79, 066118 (2009). N. Przulj, J. R. Soc. Interface 7, 1341 (2010). [9] V.Zlatic,´ G. Ghoshal, and G. Caldarelli, Phys. Rev. E 80,036118 [40] J. Berg and M. Lassig,¨ Proc. Natl. Acad. Sci. USA 103,10967 (2009). (2006). [10] U. Alon, An Introduction to Systems Biology: Design Principles [41] R. Sharan, S. Suthram, R. M. Kelley, T. Kuhn, S. McCuine, of Biological Circuits (Chapman & Hall/CRC, London, 2007). P. Uetz, T. Sittler, R. M. Karp, and T. Ideker, Proc. Natl. Acad. [11] X. Zhu, M. Gerstein, and M. Snyder, Genes Dev. 21,1010 Sci. USA 102, 1974 (2005). (2007). [42] H. Yu, N. M. Luscombe, H. X. Lu, X. Zhu, Y. Xia, J. D. J. Han, [12] R. Sharan and T. Ideker, Nature Biotechnol. 24, 427 (2006). N. Bertin, S. Chung, M. Vidal, and M. Gerstein, Genome Res. [13] D. Zhou, J. Huang, and B. Scholkopf,¨ in Advances in 14, 1107 (2004). Neural Information Processing Systems (NIPS) 19, edited by [43] B. P. Kelley, R. Sharan, R. M. Karp, T. Sittler, D. E. Root, B. Scholkopf,¨ J. C. Platt, and T. Hofmann (MIT Press, B. R. Stockwell, and T. Ideker, Proc. Natl. Acad. Sci. USA 100, Cambridge, MA, 2007), pp. 1601–1608. 11394 (2003). [14] A. Vazquez, Phys. Rev. E 77, 066106 (2008). [44] A-L. Barabasi´ and Z. N. Oltvai, Nature Rev. Genet. 5,101 [15] S. Klamt, U-U. Haus, and F. Theis, PLoS Comp. Biol. 5, (2004). e1000385 (2009). [45] B. K. Tye, Annu. Rev. Biochem. 68, 649 (1999). [16] A. Vazquez, J. Stat. Mech.: Theory Exp. (2009) P07006. [46] H. Kibak, L. Taiz, T. Starke, P. Bernasconi, and J. P. Gogarten, [17] M. E. J. Newman, Phys. Rev. E 74, 036104 (2006). J. Bioenerg. Biomemb. 24, 415 (1992). [18] R. A. Horn and C. R. Johnson, Matrix Analysis (Cambridge [47] http://www.flickr.com. University Press, Cambridge, England, 1985). [48] http://www.citeulike.org/faq/data.adp. [19] G. H. Golub and C. F. Van Loan, Matrix Computations,3rded. [49] Z. Hu, P. J. Killion, and V.R. Iyer, Nature Genet. 39, 683 (2007). (The Johns Hopkins University Press, Baltimore, MD, 1996). [50] C. T. Harbison, D. B. Gordon, T. I. Lee, N. J. Rinaldi, K. D. [20] T. S. Evans and R. Lambiotte, Phys. Rev. E 80, 016105 (2009). Macisaac, T. W. Danford, N. M. Hannett, J. B. Tagne, D. B. [21] Y. Y. Ahn, J. P. Bagrow, and S. Lehmann, Nature (London) 466, Reynolds, J. Yoo, E. G. Jennings, J. Zeitlinger, D. K. Pokholok, 761 (2010). M. Kellis, P. A. Rolfe, K. T. Takusagawa, E. S. Lander, D. K. [22] L. De Lathauwer, B. De Moor, and J. Vandewalle, SIAM J. Gifford, E. Fraenkel, and R. A. Young, Nature (London) 431,99 Matrix Anal. Appl. 21, 1324 (2000). (2004). [23] K. Inoue and K. Urahama, Pattern Recogn. Lett. 20, 699 (1999). [51] T. Ideker, O. Ozier, B. Schwikowski, and A. F. Siegel, [24] J. M. Kleinberg, J. ACM 46, 604 (1999). Bioinformatics 18, S233 (2002). [25] L-H. Lim, in Proceedings of the IEEE International Workshop on [52] C. T. Workman, H. C. Mak, S. McCuine, J. B. Tagne, M. Computational Advances in Multi-Sensor Adaptive Processing Agarwal, O. Ozier, T. J. Begley, L. D. Samson, and T. Ideker, (IEEE, 2005), pp. 129–132. Science 312, 1054 (2006). [26] K. C. Chang, K. Pearson, and T. Zhang, Commun. Math. Sci. 6, [53] A. Gitter, Z. Siegfried, M. Klutstein, O. Fornes, B. Oliva, I. 507 (2008). Simon, and Z. Bar-Joseph, Mol. Syst. Biol. 5, 276 (2009). [27] S. Friedland, S. Gaubert, and L. Han, Lin. Alg. Appl., [54] A. Joshi, T. Van Parys, Y. Van de Peer, and T. Michoel, Genome doi: 10.1016/j.laa.2011.02.042 (2011). Biol. 11, R32 (2010). [28] T. Michoel, A. Joshi, B. Nachtergaele, and Y. Van de Peer, Mol. [55] R. Guimera,` M. Sales-Pardo, and L. A. Nunes Amaral, Phys. Bio. Syst. 7, 2769 (2011). Rev. E 76, 036102 (2007). [29] P. Audenaert, T. Van Parys, M. Pickavet, P. Demeester, Y. Van [56] G. Palla, I. J. Farkas, P. Pollner, I. Derenyi,´ and T. Vicsek, New de Peer, and T. Michoel, Bioinformatics 27, 1587 (2011). J. Phys. 9, 186 (2007). [30] L. V. Zhang, O. D. King, S. L. Wong, D. S. Goldberg, A. H. Y. [57] E. A. Leicht and M. E. J. Newman, Phys. Rev. Lett. 100,118703 Tong, G. Lesage, B. Andrews, H. Bussey, C. Boone, and F. P. (2008). Roth, J. Biol. 4, 6 (2005). [58] C. Stark, B-J. Breitkreutz, T. Reguly, L. Boucher, A. Breitkreutz, [31] Daniel M. Dunlavy, Tamara G. Kolda, and W. Philip and M. Tyers, Nucleic Acids Res. 34, D535 (2006). Kegelmeyer, Tech. Rep. No. SAND2006-2079 ( Sandia National [59] L. J. Jensen, M. Kuhn, M. Stark, S. Chaffron, C. Creevey, J. Laboratories, Albuquerque, NM and Livermore, CA, 2006). Muller, T. Doerks, P. Julien, A. Roth, M. Simonovic et al., [32] W. Li, C-C. Lie, T. Zhang, H. Li, M. S. Waterman, and X. J. Nucleic Acids Res. 37, D412 (2009). Zhou, PLoS Comp. Biol. 7, e1001106 (2011). [60] A. C. Berglund, E. Sjolund, G. Ostlund, and E. L. L. [33] P. J. Mucha, T. Richardson, K. Macon, M. A. Porter, and J. P. Sonnhammer, Nucleic Acids Res. 36, D263 (2008). Onnela, Science 328, 876 (2010). [61] http://schype.googlecode.com.

056111-14