1 Scalable Label Propagation for Multi-relational Learning on the of Graphs

Zhuliu Li*, Raphael Petegrosso*, Shaden Smith, David Sterling, George Karypis, Rui Kuang†

Abstract—Multi-relational learning on knowledge graphs infers high-order relations among the entities across the graphs. This learning task can be solved by label propagation on the tensor product of the knowledge graphs to learn the high-order relations as a tensor. In this paper, we generalize a widely used label propagation model to the normalized tensor product graph, and propose an optimization formulation and a scalable Low-rank Tensor-based Label Propagation algorithm (LowrankTLP) to infer multi-relations for two learning tasks, hyperlink prediction and multiple graph alignment. The optimization formulation minimizes the upper bound of the noisy tensor estimation error for multiple graph alignment, by learning with a subset of the eigen-pairs in the spectrum of the normalized tensor product graph. We also provide a data-dependent transductive Rademacher bound for binary hyperlink prediction. We accelerate LowrankTLP with parallel tensor computation which enables label propagation on a tensor product of 100 graphs each of size 1000 in less than half hour in the simulation. LowrankTLP was also applied to predicting the author-paper-venue hyperlinks in publication records, alignment of segmented regions across up to 26 CT-scan images and alignment of protein-protein interaction networks across multiple species. The experiments demonstrate that LowrankTLP indeed well approximates the original label propagation with better scalability and accuracy. Source code: https://github.com/kuanglab/LowrankTLP

Index Terms—multi-relational learning, tensor product graph, label propagation, hyperlink prediction, multiple graph alignment, tensor decomposition and completion !

1 INTRODUCTION

ABEL propagation has been widely used for semi- sparse [6], [11]. In particular, each multiplication with tensor L supervised learning on the similarity graph of labeled will exponentially increase the number of nonzero entries and and unlabeled samples [1], [2], [3]. As illustrated in Figure after one or two iterations, a dense tensor is expected. The 1 (A), label propagation propagates training labels on a main objective of this study is to provide a principled and graph to learn a vector y predicting the labels of vertices. A theoretical approximation approach, and scalable algorithms generalization of label propagation on the Kronecker product and implementations to tackle the scalability issue. Our of two graphs (also called bi-random walk) can infer the contributions in this paper are summarized as follows: pairwise relations in a matrix Y between the vertices from the two graphs as shown in Figure 1 (B). This approach has been We propose a novel optimization formulation to • applied to aligning biological and biomedical networks [4], approximate the transformation matrix in the closed- [5]. Similar learning problems on Kronecker product graphs form solution of label propagation on the tensor also exist in link prediction [6], [7], matching images [8], product graph, by learning with a subset of eigen- image segmentation [9], collaborative filtering [10], citation pairs from the normalized tensor product graph. We network analysis [11] and multi-language translation [12]. provide a theoretical justification that the globally When applied on the tensor product of n graphs to predict optimal solution of the optimization problem mini- the matching across the n graphs in an n-way tensor mizes an estimation error bound of recovering the Y arXiv:1802.07379v2 [cs.LG] 18 May 2020 as shown in Figure 1 (C), label propagation accomplishes true tensor that is structured by the tensor product n-way relational learning from knowledge graphs [13]. graph manifolds, for multiple graph alignment. We Label propagation on the tensor product graph explores the also provide a data-dependent error bound using the graph topologies for associating vertices across the graphs transductive Rademacher complexity for binary hyper- assuming the global relations among the vertices reveal link prediction. the vertex identities [5]. However, the tensor formulation We develop an efficient eigenvalue selection algo- • of label propagation is computationally intensive to solve. rithm to sequentially select the eigen-pairs from each Empirically, most of the existing methods are only scalable to individual normalized graph considering the global learn 3-way relations in large graphs even if the graphs are spectrum of the transformation matrix. We then show that the spectrum of the low-rank normalized tensor • Z. Li, R. Petegrosso, S. Smith, G. Karypis and R. Kuang are with Depart- product graph constructed by the selected eigen-pairs ment of Computer Science and Engineering, University of Minnesota Twin is guaranteed to be the globally optimal solution to Cities, MN, 55455, USA the proposed optimization formulation. E-mail: {lixx3617, peteg001, shaden, karypis, kuang}@umn.edu We propose a Lowrank Tensor-based Label Propaga- • D. Sterling is with Department of Radiation Oncology, University of • Minnesota Twin Cities, MN, 55455, USA tion algorithm (LowrankTLP) using efficient tensor E-mail: [email protected] † operations to compute the approximated solution (*Co-first author, Corresponding author) based on the selected eigen-pairs from the knowledge 2

(i) graphs. We provide an efficient parallel implementa- distinct symmetric matrices, where each Ii Ii matrix W tion of LowrankTLP using SPLATT library [14] with denotes the of the i-th undirected× graph shared-memory parallelism to increase the scalability with Ii vertices. The tensor product graph adjacency matrix is n (i) Qn Qn by a large magnitude. defined as W = i=1W with dimension ( i Ii) ( i Ii). ⊗ × (l) Our work comprehensively generalizes the label prop- The edge W = Qn W • (a1,a2,...,an),(b1,b2,...,bn) l=1 al,bl agation model proposed in [3] on tensor product of of W encodes the similarity between a pair of n- multiple graphs in several aspects including normal- way tuples of graph vertices (a1, a2, . . . , an) and ized tensor product graph, regularization framework, (b1, b2, . . . , bn), al, bl [1,Il] as illustrated in Figure iterative tensor-based label propagation algorithm 1 (C). The network∀{ properties} ∈ of the TPG including the and computation of the closed-form solution. spectral distribution are discussed in [16]. We validate the effectiveness, efficiency and scalability • Now, we introduce the two multi-relational learning of LowrankLTP on the simulation data by controlling problems studied in this paper. The objective is to score the the graph size and topology. We also demonstrate the queried n-way relations among the vertices across multiple practical use of LowrankLTP on three real datasets for undirected knowledge graphs W (i) : i = 1, . . . , n , for hyperlink prediction and multiple graph alignment, either hyperlink prediction or{ multiple graph alignment} across a large number of knowledge graphs. given the input as 1) the labels of a small number of observed n-way relations, 2) or the similarity scores between matrix Y the vertices of every pair of graphs respectively, as shown in a vector y (b1,b2) (a ,a ) a1 1 2 a2 Figure 2 (A). a c b c1 2 c c b1 Task 1: hyperlink prediction d b d2 d d1 2 b (1) Graph (2) Graph W (d1,d2) (c1,c2) W Input relations: a small set Θ = (i , i , . . . , i ): • h { 1 2 n ij [1,Ij], j = 1, . . . , n of labeled n-way relations (A) Label Propagation (B) Bi-Random Walk (hyperlinks)∈ ∀ among the} vertices across n graphs, (j) where ij denotes the i-th vertex of graph W . a1 a2 Queried multi-relations: a set Ω = (j , j , . . . , j ): • h 1 2 n c j [1,I ], l = 1, . . . , n} of the{ queried n-way 2 l ∈ l ∀ c1 relations (hyperlinks), chosen per user’s interests. (d1,d2,d3,...,dn) d Learning task: given n knowledge graphs and the set d1 b2 2 • b1 Graph (2) Θh of labeled hyperlinks, predict the link strengths Graph W (1) W of the queried set Ωh of hyperlinks in a sparse tensor (b1,b2,b3,...,bn) tensor Y In In−1 ... I1 (c1,c2,c3,...,cn) h R × × × with Ωh nonzero entries. O ∈ | | a3 an Task 2: multiple graph alignment c3 (a1,a2,a3,...,an) cn Ii Ij Input relations: a set Θg = Rij R × : i, j • { ∈ + ∀ ∈ b [1, n] and i < j of non-negative matrices, where Rij 3 d3 dn } Graph W (3) bn holds the similarity scores between the vertices of a ... Graph W (n) pair of graphs W (i) and W (j). (C) n-way Relational Learning Queried multi-relations: a set Ω = (j , j , . . . , j ): • g { 1 2 n jl [1,Il], l = 1, . . . , n} of the queried n-way Fig. 1. Label propagation generalized on tensor product graphs. (A) relations,∈ which∀ can be derived from the pairwise Label propagation on a graph predicts the labels of the vertices for semi- supervised learning; (B) Label propagation on the Kronecker product of relations by heuristic as in [17] and [18]. Learning task: given n knowledge graphs and the set two graphs predicts links between the vertices across the two graphs; • (C) Label propagation on an n-way tensor product graph learns n-way Θg of pairwise relations, predict the alignment scores multi-relations across the vertices in n graphs. Each n-way tuple of graph between the queried set Ω of n-way tuples of graph vertices in the same color is represented as an entry in the n-way tensor. g In In−1 ... I1 vertices in a sparse tensor g R × × × with Ω nonzero entries. O ∈ 2 PROBLEM FORMULATION | g| We first define the notations and the tensor product As outlined in Figure 2, given either the labels of the graph. Several useful lemmas and the definitions of tensor observed n-way relations or the similarity scores between CANDECOMP/PARAFAC (CP) decomposition and Tucker the vertices of all the graph pairs, we propose a scalable decomposition are also given in Appendix A. For more label propagation algorithm for manifold learning on the general tensor computations, we direct the readers to the tensor product graph (TPG) to infer the scores of a query set survey paper [15]. of n-way relations in a sparse output tensor, based on the topological information of the knowledge graphs. Notations and operators: 3 LABEL PROPAGATION ON TENSOR PRODUCT vector: x Hadamard product: ~ outer product: ◦ matrix: X Kronecker product: ⊗ the i-mode product: ×i GRAPH tensor: X Khatri–Rao product: vectorization of tensor: vec(X ) In this section, we focus on generalizing the graph-based Tensor product graph (TPG): Let W (i) : i = 1, . . . , n be n semi-supervised learning model proposed in [3] to the tensor { } 3

(A) Input relations (C) Tensor operations (Algorithm 2) (D) Prediction

Task 1 - Hyperlink prediction (2) [Qselect]1 Labeled n-ways Task 1 - Hyperlink prediction 0 relations Compression operations Y Query set ⌦ ⇥h 0 ⌫ h Y 1 (1) [Qselect]1 Task 2 - Graph alignment (3) h = [Q ]1 + O 0 select Y [Q(2) ] Prediction of Pairwise 0 select 2 multi-relations Link strength

⇥ ... g relations c r (CPY form) c Mk ⌫ (1) 0 R12 R13 F (2) 0 ⌫ Task 2 - Multiple graph RT 0 R F 2 12 23 sym Y (1) alignment T T (3) [Qselect]2 R13 R23 0 NMF F (3) Query set ⌦g R F [Qselect]2 + Graph alignment ... score ↵[select]i ...... mi = 1 ↵[ ] g select i (1) (2) (3) + O Q m Qselect Qselect select select (2) [Qselect]k Derive graph alignment ⌦ T T T (3) (3) (3) Q(1)⇤(1)Q(1) Q(2)⇤(2)Q(2) Q ⇤ Q ⌫k Knowledge (1) [Q ]k graphs select (3) (1) (2) (3) S S S [Qselect]k (B) Eigenvalue selection (Algorithm 1) Expansion operations

Fig. 2. The overview of LowrankTLP illustrated on the tensor product of three graphs. (A) Input: the initial input tensor Y0 is 1) a sparse tensor of labeled multi-relations (given in set Θh) for hyperlink prediction or 2) a CP-form estimated from pairwise similarities (given in set Θg) between every pair of the graphs for multiple graph alignment. (B) Knowledge Graphs: Three normalized undirected graphs S(1), S(2) and S(3) are given. Algorithm 1 will obtain the k optimal eigenvalues and the corresponding eigenvectors of the tensor product graph based on the eigen-decomposition of each graph for computing the approximation of the closed-form solution. (C) Efficient Tensor Operations: Y0 and the selected eigen-pairs are used to perform compression and prediction operations to obtain the approximated closed-form solution of label propagation in a CP-form. (D) Output: The scores of the hyperlink strengths queried in set Ωh are predicted in a sparse tensor Oh; or, the scores of the graph alignment queried in the set Ωg are predicted in a sparse tensor Og, which are then used to derive the graph alignment. product of multiple graphs. We first derive the normalized 3.2 Regularization framework with normalized TPG

tensor product graph (TPG), and then generalize the objective 0 In In−1 ... I1 Let R × × × be an initial tensor which is either function for graph-based semi-supervised learning on TPG incompleteY ∈ with labels of the observed n-way relations or to solve the multi-relational learning tasks defined in Section complete with noisy labels of all n-way relations, the true 2. Next, we generalize the label propagation algorithm on labels of all the n-way relations can be inferred in tensor TPG and give its closed-form solution. Finally, we analyze In In−1 ... I1 R × × × by minimizing the following objective the scalability issues of applying label propagation on the function:Y ∈ tensor product of multiple graphs. 1 ( ) = (vec( )T (I n S(i))vec( ) J Y 2 Y − ⊗i=1 Y + µ vec( ) vec( 0) 2). (3) 3.1 Normalized TPG || Y − Y ||2 The first term in ( ) is called smoothness constraint or (i) (i) 1 (i) (i) 1 Let S = [D ]− 2 W [D ]− 2 be the normalized graph of graph regularizationJ, whereY I n S(i) is called normalized (i) (i) i=1 W for i = 1, . . . , n, where D is the degree matrix. The graph Laplacian of the TPG,− ensures ⊗ the values (inferred n (i) 1 1 TPG W = W has its normalization S = D− 2 WD− 2 ⊗i=1 multi-relations) in tensor to be smooth on the manifolds derived as follows of the normalized TPG YS. In other words, the degree-

1 1 normalized (a1, a2, . . . , an)-th and the (b1, b2, . . . , bn)-th n (i) 2 n (i) n (i) 2 S = ( i=1[D ]− )( i=1W )( i=1[D ]− ) (1) entries in tensor are forced to be close if the edge weight ⊗ 1 ⊗ 1 ⊗ Y n (l) n (i) 2 (i) (i) 2 Q = ([D ]− W [D ]− ) (2) W(a1,a2,...,an),(b1,b2,...,bn) = l=1 W is large. The second ⊗i=1 al,bl n (i) term in ( ) is called fitting constraint, which penalizes the = S . J Y ⊗i=1 difference between the inferred tensor and its initialization 0, where µ > 0 is a hyperparameterY balancing the impacts Equation (1) is obtained by the fact that the degree matrices ofY both terms. are diagonal. Equation (2) is obtained by Appendix A Lemma 1. Using Appendix A Lemma 3, it can be shown Formulation of hyperlink prediction (Task 1): the • that the eigenvalues of S are bounded between -1 and 1. learning task 1 is a transductive learning problem This property will be used later in the derivations in the [19], [20] of inferring a tensor from a sparse forthcoming sections. initial tensor 0. The nonzero entriesY in 0 are the Y Y 4

labels (link types) of the observed n-way relations according to Appendix A Lemma 1 and 3. Substituting S (hyperlinks) given in set Θh (defined in Section 2), into Equation (5) we have such that the label of the (i1, i2, . . . , in)-th hyperlink n (i) n (i) 1 0 0 vec( ∗) = (1 α)( i=1Q )(I α( i=1Λ ))− is in,in−1,...,i1 . The zero entries in represent the Y − ⊗ − ⊗ Y Y n (i)T 0 unobserved n-way relations. The inferred tensor ( i=1Q )vec( ). (6) is composed of the link strengths of all the n-wayY ⊗ Y hyperlinks. It is not hard to see that computing the closed-form solution Formulation of multiple graph alignment (Task 2): in Equation (6) from right to left needs 2n matrix-tensor • products in total with the vectorization property of Tucker we convert the set Θg of pairwise similarity matrices to the rank-r CP-form (Appendix A Definition 1) of decomposition (Appendix A Definition 2). Therefore, the 0 as following: first, symmetric NMF (symNMF) space and time complexity will be the same as running [Y21] is applied on a symmetric matrix R built by two iterations of label propagation, apart from computing the eigen-decompositions of all the normalized graphs stacking all Rij’s to obtain a nonnegative factor matrix (i) (Pn I ) r S : i = 1, . . . , n . To tackle this challenge of computing F i=1 i × such that R FF T as illustrated R+ label{ propagation of} a high-order n-way tensor on TPG, we in∈ Figure 2 (A). Then, the rank-≈ r CP-form of 0 0 (n) (n 1) (1)Y propose the LowrankTLP algorithm based on a principled is approximated as = F ,F − ,...,F 1 (i) Ii r Y approximation of the linear transformation matrix (I αS)− where F × is the i-thJ submatrix of F . TheK R+ in the closed-form solution in Equation (5) in the next− section. assumption∈ is that more similar pairwise relations between every pair (i , i ) (i , i , . . . , i ) imply a a b ⊂ 1 2 n stronger n-way relation in tuple (i1, i2, . . . , in). This 4 LOW-RANK LABEL PROPAGATION representation has been widely adopted in real graph In this section, we first propose an optimization formulation alignment problems as in [17], [22] and [23]. As 0 Y to approximate the closed-form solution in Equation (5); then is guessed from the pairwise relations, we call the we develop Algorithm 1 to select a subset of eigen-pairs from learning task 2 structured signal recovery from noisy the normalized tensor product graph S which are guaranteed observation. Our goal is to recover the true tensor Y to be the optimal solution to the proposed optimization which is structured by the TPG manifolds, from its formulation. Next, we propose the LowrankTLP algorithm noisy observation 0. Y (illustrated in Figure 2) for scalable label propagation on TPG, using the selected eigen-pairs. Finally, we provide the 3.3 Label propagation algorithm on normalized TPG theoretical justification of our optimization formulation for The objective function ( ) in Equation (3) can be mini- Task 2 by proposing an estimation error bound of recovering mized by performing theJ fixed-pointY iteration (4), which is a the true tensor that is structured by the TPG manifolds; we generalization of the graph-based semi-supervised learning also provide a data-dependent error bound for a special case algorithm in [3] to TPG. of Task 1. t+1 n (i) t 0 vec( ) = α( i=1S )vec( ) + (1 α)vec( ), (4) Y ⊗ Y − Y 4.1 Optimization formulation where α = 1 (0, 1) is a balancing hyperparameter and 1+µ ∈ We propose to approximate the closed-form solution in Equa- t denotes the iteration number. Applying the vectorization tion (5) by minimizing the perturbation on transformation property of Tucker decomposition (Appendix A Definition 2), 1 matrix (I αS)− as follows, the Equation (4) can be rewritten as − 1 1 t+1 t (n) (n 1) (1) 0 minimize (I αS)− (I αSk)− 2,F = α 1 S 2 S − n S + (1 α) . eig(Sk) || − − − || Y Y × × · · · × − Y (7) Even if 0 is in a sparse form, the density of tensor subject to rank(Sk) = k, eig(Sk) eig(S), Y Y ⊆ increases exponentially in each iteration. Therefore, the space n (i) Qn where S = i=1S is the normalized TPG; eig(Sk) and complexity is O( i=1 Ii) for store the dense tensor and the ⊗ n n eig(S) denote the sets of eigen-pairs of S and S respectively; time complexity is O((Q I )(P I )) per iteration. Due k i=1 i i=1 i . is spectral norm and . is Frobenius norm. to the necessity of computing the full tensor, the same space 2 F || ||The objective is to find|| a|| low-rank matrix S defined by and time complexities are required by the Link Propagation k a subset of eigen-pairs of S to give the lowest divergence on method proposed in [6], which applies conjugate gradient the overall transformation matrix (I αS) 1. In Section 4.5.1, descent [24] to minimize a similar objective function defined − we will show this formulation minimizes− the estimation error on the original (unnormalized) TPG. bound in Theorem 3. It is noteworthy that simply computing Since the eigenvalues of S are in [ 1, 1] and α (0, 1), the best rank-k approximation to S per Eckart-Young-Mirsky iteration (4) converges to the following− closed-form solution∈ theorem does not guarantee the optimal solution. Instead, we of ( ) as J Y will show that the global optimal solution to the optimization t 1 0 vec( ∗) = lim vec( ) = (1 α)(I αS)− vec( ). (5) problem (7) can be efficiently found by Algorithm 1. t Y →∞ Y − − Y Furthermore, given the eigen-decomposition of each S(i) 4.2 Selection of the optimal eigen-pairs as S(i) = Q(i)Λ(i)Q(i)T : i = 1, . . . , n , the eigen- { } T decomposition of S can be expressed as Let Sk = Q1:kΛ1:kQ1:k be the eigen-decomposition of Sk, where Λ and Q store the eigen-pairs (λ , q ): j = T n (i) n (i) n (i)T 1:k 1:k { j j S = QΛQ = ( i=1Q )( i=1Λ )( i=1Q ), 1, . . . , k of S selected from eig(S). Also, define diagonal ⊗ ⊗ ⊗ } k 5

matrix Λrest and matrix Qrest to hold the remaining eigen- be among the multiplications between the union of the pairs (λi, qi): i = k + 1,...,N of eig(S), where N = k largest and smallest values in the two vectors. Thus, Qn { } (i 1) l=1 Il. According to Appendix A Lemma 3, we have only the numbers in top_bot_2k(Γ − ) are needed to (i) n compute the next Γ in the recursion. Taking the elements Y (i) (i) (i 1) (i) n in top_bot_2k(Γ − ) in the multiplication with each λ λj = λj and qj = i=1qj , j = 1, . . . , k, i=1 ⊗ ∀ guarantees that the numbers needed for computing the k largest (smallest) elements in n λ(i) will be kept in Γ(i). (i) (i) (i) ⊗i=1 where λj and qj is an eigen-pair of S contributing The details of the proof are given in Appendix C.1. (End of to λj and qj of S. This implies the eigen-pairs of Sk are Proof) composed of the properly selected eigen-pairs from each S(i), for i = 1, . . . , n. 1 Proposition 1. Define A = (I αS)− and its approximation Algorithm 1 Select Eigenvalues 1 − Aˆ = (I αS )− . According to Woodbury formula [25], we have (i) − k 1: Input: S : i = 1, . . . , n , α (0, 1). 1 T { (i) } ∈ Aˆ = Q ((I αΛ ) I)Q + I. 2: Output: λselect and Q : i = 1, . . . , n . 1:k 1:k − 1:k (8) { select } − − 3: Compute and store the eigenvalues and eigenvectors Theorem 1. The optimal k eigenvalues λj : j = 1, . . . , k that (i) λ(i) (i) { } of S in vector and matrix Q respectively, for solve the optimization problem (7) are among the union of the k i = 1, . . . , n. largest (algebraic) and k smallest (algebraic) eigenvalues of S and 4: Γ λ(1) ← satisfy the following condition 5: for i = 2 to n do 6: Γ λ(i) top_bot_2k(Γ) α λj α λi ← ⊗ | | | | , j [1, k], i [k + 1,N]. 7: end for 1 αλj ≥ 1 αλi ∀ ∈ ∀ ∈ α Γj − − 8: λselect k elements from Γ with the largest | | , j = ← 1 αΓj Proof. Given Equation (8), the perturbation can be obtained 1, . . . , k − as 9: for i = n to 1 do (i) (i) 1 T 10: return Q from Q by looking-up indexes of the Aˆ A = Q (I (I αΛ )− )Q , select − rest − − rest rest values output by function top_bot_2k(). α λ whose singular values are | i| : i = k + 1,...,N and k 11: end for 1 αλi zeros. Thus, its spectral norm{ − and Frobenius norm are} According to Theorem 1, the selected eigenvalues λj : α λ∗ α λ α λ { Aˆ A 2 = | | and | j | i j = 1, . . . , k from S satisfying 1 αλ 1 |αλ| , j || − || 1 αλ∗ } j ≥ i ∀ ∈ v− [1, k], i [k + 1,N] must be contained− in the− union of u N (9) ∀ ∈ u X α λi the k largest and k smallest eigenvalues of S. Thus, we Aˆ A = t ( | | )2, || − ||F 1 αλ only need to find the top_bot_2k( n λ(i)) with Theo- i=k+1 i i=1 − rem 2, and select k eigenvalues which⊗ give the largest α λ α λj where λ = argmax | | . To minimize both elements in the set | | : j = 1, . . . , k . Based on the ∗ λ λk+1,...,λN 1 αλ 1 αλj ∈{ } − { − } norms in Equation (9), the k selected eigenvalues λj : j = idea, we propose Algorithm 1 to select the eigen-pairs { (i) (i) 1, . . . , k should produce the largest elements in the set (λj , qj ): i = 1, . . . , n, j = 1, . . . , k efficiently in time α λj } { n } | | P 1 αλ : j = 1, . . . , k among all the eigenvalues of S. In O( i=1(kIi log(kIi)), plus the time for eigen-decomposition { − j } (1) addition, since α (0, 1) and λj [ 1, 1] for j = 1, . . . , k, of each knowledge graph. Algorithm 1 starts with λ , the α λ ∈ ∈ − | j | eigenvalues of the first graph (line 4) and iteratively merges the function 1 αλ is monotonically increasing in the positive − j another λ(i) one at a time in the for-loop between line 5- orthant and decreasing in the negative orthant with λj. Thus, 7 to compute top_bot_2k( i λ(l)) in Γ. Each merge step λj : j = 1, . . . , k must be in the union of the k largest ⊗l=1 (algebraic){ eigenvalues} and k smallest (algebraic) eigenvalues computes and sorts O(kIi) numbers. Algorithm 1 outputs a of S. (End of Proof) vector λselect of the selected eigenvalues from S and matrices (i) (i) Qselect of the selected eigenvectors from Q , for i = 1, . . . , n Theorem 2. Define function top_bot_2k(x) = top_k(x) such that bot_k(x) where top_k(x) and bot_k(x) return the k alge-∪ braically largest and smallest values of the vector x respectively. T λselect = [λ1, λ2, . . . , λk] Given the vector λ(i) of the eigenvalues of S(i) for i = 1, . . . , n, Q(i) = [q(i), q(i),..., q(i)], i = 1, . . . , n. we have select 1 2 k ∀ top_bot_2k( n λ(i)) = 1 ⊗i=1 Define M = (I αΛ1:k)− I, which is computed from (n) (n 1) − − top_bot_2k(λ top_bot_2k(Γ − )), where λselect as ( ⊗ λ(i) top_bot_2k(Γ(i 1)), if i = 2, . . . , n 1   (i) − αλ1 αλ2 αλk Γ = (1) ⊗ − M = diag , ,..., . λ , if i = 1. 1 αλ 1 αλ 1 αλ − 1 − 2 − k Proof. Theorem 2 can be proven by induction based on (i) the observation that the k algebraically largest (smallest) The matrix Q1:k can be computed from Qselect : i = (i) { elements in the outer product of two real vectors can only 1, . . . , n as Q = n Q . By Equation (8), the closed- } 1:k i=1 select 6

Algorithm 2 LowrankTLP where Y 0 denotes the matrix flattened from 0. The (1) Y 1: Input: S(i) : i = 1, . . . , n , 0, α, k and Ω. kernel in Equation (13) is similar to matricized tensor { } Y 2: Output: Sparse tensor O. times Khatri-Rao product (MTTKRP) [26] involving n 1 (i) products during the computation of the CP decompo-− 3: Apply Algorithm 1 to obtain λselect, Qselect : i = 1, . . . , n . { sition. Therefore, we can leverage parallel algorithms } 4: Initialize v to be a k-D vector with all zeros. developed to compute the CP decomposition for the 5: if 0 is sparse then computation. We adopt SPLATT [27], a C library Y 6: for j=1 to k do with shared-memory parallelism for fast MTTKRP 0 (n) (n 1) (1) computation. Parallelized in p threads, the parallel 7: v ¯ q ¯ q − ... ¯ q 0 j ← Y ×1 j ×2 j ×n j | |nk 8: end for implementation reduces the complexity to O( Y p ). 0 (n) (n 1) (1) 0 is in CP-form in multiple graph alignment (Task 9: else if is in CP-form F ,F − ,...,F then • Y (1)T Y 0 10: Ψ Q F (1) J K 2) (line 9-15): when the initial tensor is in the CP- ← select (n) (n 1) (1) Y 11: for j=2 to k do form F ,F − ,...,F the k-D vector v can (j)T (j) be obtainedJ by K 12: Ψ Ψ (Q F ) ← ~ select 13: end for n (i) T n (i) v = ( i=1Qselect) ( i=1F )1 (14) 14: v Ψ1 ← n (i)T (i) 15: end if = ~i=1(QselectF )1, (15) 16: m αλselect/(1 αλselect) 0 ← 0 − where Equation (14) is obtained by vectorization prop- 17: vˆ (v m) ← ~ erty of CP-form (Appendix A Definition 1) and Equa- 18: Initialize ∗ = to be an empty tensor Y {} tion (15) can be derived from Appendix A Lemma 19: for every tuple (i1, i2, . . . , in) in Ω do (i)T (i) Pk 0 Qn (l) 2. Since each QselectF takes O(krIi) (recall r is the 20: i ,i ,...,i (1 α)( vˆ q + 0 O n n−1 1 ← − j=1 j l=1 il,k rank of the in CP-form), the time complexity of 0 ) Y Pn Yin,in−1,...,i1 the compression step becomes only O(kr i=1 Ii). 21: end for Expansion (Prediction) step: (line 18-21 in Algorithm 2) After obtaining the k-D vector v which is then multiplied by form solution in Equation (5) can be approximated as the diagonal matrix M to obtain another k-D vector vˆ, the second step is to compute 0 vec( b∗) =(1 α)Avecˆ ( ) (10) n (i) 0 Y − Y vec( b∗) = (1 α)(( Q )vˆ + vec( )). (16) =(1 α)( n Q(i) )M( n Q(i) )T vec( 0) Y − i=1 select Y − i=1 select i=1 select Y + (1 α)vec( 0). (11) The left term of (16) has the same form as the vectorized (i) Ii k − Y CP decomposition with factor matrices Qselect R × for i = 1, . . . , n (Appendix A Definition 1). Thus, the∈ tensorized 4.3 LowrankTLP algorithm form can be obtained as Equation (11) implies a 2-step tensor computation of the 0 (n) (n 1) (1) 0 b∗ = (1 α)( vˆ ; Qselect,Qselect− ,...,Qselect + ), (17) closed-form solution given in Algorithm 2. The two steps Y − J K Y are also illustrated in Figure 2. where vˆ0 is a reversal of the elements in vˆ. According to Equa- (i) tion (17), the k-D vector vˆ and matrices {Qselect : i = 1, . . . , n} Compression step (line 4-15 in Algorithm 2) together with 0 store all the information for computing Y any entry of b∗ with a time complexity O(nk). Suppose the 0 is sparse in hyperlink prediction (Task 1) (line 4- Y • Y (i) query set Ω (denoting either Ω or Ω defined in Section 2) 8): the role of ( n Q )T vec( 0) in Equation (11) h g i=1 select Y has cardinality Ω , the total time complexity for predicting is to compress the original tensor 0 to a k-D vector | | Y the queried n-way relations in a sparse tensor is O(nk Ω ). v with its jth element O | |

0 (n) (n 1) (1) v = ¯ q ¯ q − ... ¯ q , (12) 4.4 Time and space complexity j Y ×1 j ×2 j ×n j where each ¯ denotes mode-i vector product of The time complexity of selecting the optimal eigen-pairs ×i Pn 3 tensor. In Equation (12), the original tensor 0 is using Algorithm 1 is O( i=1 Ii + kIi log(kIi)). There- compressed to a scalar by multiplying with nYvec- fore, the overall complexity of the computing the com- pressed representation in Algorithm 2 (line 1 - 17) is tors which is similar to computing the core tensor Pn 3 0 O( i=1(Ii +kIi log(kIi))+ nk) for sparse initialization, in Tucker decomposition. Denote the number of 0 |Y | 0 nonzeros in 0 as | 0|, the time complexity of the with denoting the number of nonzeros in , and Y Y Pn|Y | 3 Y compression step is O(| 0|nk). O( i=1(Ii + kIi log(kIi) + krIi)) for CP-form initialization. ? Parallel implementationY : The construction of v via The time complexity of the prediction step in Algorithm 2 (line 18 - 21) is O(nk Ω ) for both sparse initialization and Equation (12) performs j sequences of n-way tensor- | | vector products as CP-form initialization. Table 1 compares the time complexity of LowrankTLP with other existing methods described in (2) (n) Z Y 0 (Q Q ), (13) Sections 5 and 6.1. Note that compared with LowrankTLP, ← (1) select · · · select (1)T ApproxLink [7] has slightly lower compression complexity vj qj zj j = 1, . . . , k, Qn ← ∀ when n is small; however, when n is big the term i=1 ki 7 Compression step Prediction step Pn 3 0 LowrankTLP (Task 1) O( i=1(Ii + kIi log(kIi)) + |Y |nk) O(nk|Ω|) Pn 3 LowrankTLP (Task 2) O( i=1(Ii + kIi log(kIi) + krIi)) O(nk|Ω|) Pn 2 3 0 Qn Q ApproxLink (Task 1) O( i=1(Iiki + ki ) + |Y |n( i=1 ki)) O(n( i ki)|Ω|) 0 2 Pn Pn 2 GraphCP/GraphCP-W (Task 1) O(iters ∗ n(|Y |nr + r i=1 Ii + r i=1 Ii )) O(nr|Ω|) 2 Pn Pn 2 GraphCP (Task 2) O(iters ∗ n(r i=1 Ii + r i=1 Ii )) O(nr|Ω|) TABLE 1 Comparison of time complexity. The time complexities of LowrankTLP, ApproxLink [7] and GraphCP/GraphCP-W [28] are shown. Each method is also annotated by the applicability to hyperlink prediction (task 1) or multiple graph alignment (task 2).

in ApproxLink becomes a bottleneck in the computation. where λis are the eigenvalues of S, λ∗ is defined in Section 4.2 For GraphCP and GraphCP-W [28], we assumed the first Equation (9), and . denotes the Frobenius norm of a tensor. F order method is applied to minimize the objective functions. || || Proof. Note that the overall complexity of GraphCP and GraphCP- ˆ true ˆ 0 W is the number of iterations multiplies the per-iteration- E [ ∗ ] = E [ (P P )vec( ) + P vec( ) 2] Z ||Y − Y ||F Z || − Y Z || complexity (computing the gradient) in the compression step ˆ 0 (P P )vec( ) 2 + E [ P vec( ) 2] (19) while the empirical runtime complexity relies on the opti- ≤ || − Y || qZ || Z || ˆ 0 T T mization method, line search type, initialization, stopping (P P )vec( ) 2 + tr(E [vec( )vec( ) ]P P ) ≤ || − Y || Z Z Z condition, etc. (20) The space required to store the eigenvectors of all the 0 Pn 2 = (Pˆ P )vec( ) 2 + σ P F normalized graphs is O( i=1 Ii ); to store the indexes of || − Y || || || 0 the selected eigen-pairs is O(k); and to store the initial Pˆ P 2 vec( ) 2 + σ P F tensor is O(| 0|) and O(Pn I r) for sparse and CP- ≤ || − || || Y || || v|| Y i=1 i u N form initial tensor respectively. Thus, the overall space α λ∗ 0 uX 1 n = (1 α)( | | + σt ), | 0| P 2 − 1 αλ ||Y ||F (1 αλ )2 complexity is O( + i=1 Ii + k) for sparse initialization ∗ i=1 i Pn 2 YPn − − and O( i=1 Ii + i=1 Iir + k) for CP-form initialization. where inequalities (19) and (20) are obtained with Minkowski’s and Jensen’s inequalities respectively (End of 4.5 Error analysis of LowrankTLP algorithm Proof). In this section, we first present an estimation error bound It is important to note that since all λ [ 1, 1] are of LowrankTLP for multiple graph alignment, assuming the i constants, the second term on the right of inequality∈ − (18) is initial tensor 0 is fully observed with Gaussian noise. The O(√N) analysis providesY a theoretical justification of the proposed upper bounded by ; the upper bound of the expected estimation error in Inequality (18) can be minimized by optimization framework in Equation (7) in Section 4.1. Next, α λ∗ we use the transductive Rademacher complexity [19] to derive properly choosing λ∗ to minimize 1 |αλ|∗ , which is the same − a data-dependent error bound of LowrankTLP for binary as minimizing the optimization objective in (7) in Section 4.1 hyperlink prediction. by Theorem 1. Thus, Theorem 3 also provides a theoretical justification of the proposed optimization formulation. 4.5.1 Estimation error bound of TPG-structured tensor re- As discussed so far, we proposed to approximate the covery from noisy observation transformation matrix P through properly selecting k eigen- pairs of S to minimize the estimation error bound in Denote the transformation matrix (1 α)(I αS) 1 as P . − Theorem 3. Note that instead of using our approximation, We assume that the noisy tensor 0 −approximated− by the another natural alternative is to directly find the best rank-k pair-wise relations as described inY Section 3.2, is generated approximation to P . Proposition 2 below shows that our with the true TPG-structured tensor true and a noise tensor I I −1 I1 Y approximation strategy is a better solution (the proof is given R n× n ×···× as Z ∈ in Appendix C.2). 0 1 true vec( ) = P − vec( ) + vec( ), Y Y Z Proposition 2. Follow the definitions of A and Aˆ in Proposition where the entries of are drawn from the i.i.d Gaussian 1, where eig(Sk) is selected from eig(S) by Algorithm 1. Define 2 Z distribution (0, σ ). Ak as the best rank-k approximation to A in both spectral and N Q 1 Frobenius norm. Assuming k < i Ii, we have the following Theorem 3. Let Pˆ = (1 α)(I αS )− , where S is defined − − k k inequalities in Section 4.1 with eig(Sk) selected from eig(S) by Algorithm ˆ ˆ 1. The inferred tensor ˆ∗ found by the LowrankTLP algorithm A A 2 < Ak A 2 and A A F < Ak A F . Y0 || − || || − || || − || || − || as vec( ˆ∗) = Pˆ vec( ) in Equation (10), has the following Y Y bounded recovery error to the true tensor ˆ 4.5.2 Transductive Rademacher bound for binary hyperlink Y prediction true α λ∗ 0 ¯ E [ ˆ∗ ] (1 α)( | | Define Θ = Θh Θh = (i1, i2, . . . , in): ij [1,Ij], j = Z ||Y − Y ||F ≤ − 1 αλ∗ ||Y ||F ∪ { ∀ ∈ v − 1, . . . , n as the set of all n-way relations among the u N } ¯ uX 1 vertices across the n knowledge graphs, where Θh de- t true +σ 2 ), (18) notes the complement of Θh. Define tensor (1 αλi) I I −1 ... I1 Y ∈ i=1 − +1, 1 n× n × × which stores the true labels of all the { − } 8

hyperlinks in set Θ, where the label of the (i1, i2, . . . , in)- compared with the estimation error bound of the convex th hyperlink is either 1 (the link exists) or -1 (the link tensor completion model [29] under certain conditions. It does not exist). Accordingly, 0 contains a subset of known is also important to note that when l is very small i.e. the Y true q entries (hyperlinks) sampled from and zeros for the labeled n-way relations are extremely sparse, the term 1 Y In In−1 ... I1 l other unknown entries. Define out R × × × as increases relatively slow as l decreases. Thus, the bound the set of tensors outputted by theY LowrankTLP⊂ algorithm q  by O 1 implies that empirically, the performance of over all possible Θh / Θ¯ h partitions such that for every l 0 LowrankTLP might deteriorate less with very sparse input b∗ out we have vec( ˆ∗) = Pˆ vec( ) (as in Theorem 3). InY the∈ Y following derivations,Y we assumeY 0 is normalized by tensors, which is consistent with our observations in both 0 so that its Frobenius norm is unit.Y This normalization simulations and experiments on real datasets shown later in ||Y ||F Section 6. is proper since it does not change the signs in ∗. In Theorem 4, we provide a data-dependent error boundY of LowrankTLP for binary hyperlink prediction, using the 5 RELATED WORK transductive Rademacher complexity proposed in [19]. Two categories of tensor-based techniques have been pre- Theorem 4. Denote l = Θ and u = Θ¯ to be the cardinalities | h| |q h| viously applied to the problem of multi-relational learning ¯ 32 ln (4e) 1 1 of Θh and Θh respectively. Let c0 = 3 , Q = l + u with multiple knowledge graphs. l+u and G = (l+u 0.5)(1 0.5/ max (l,u)) . For any fixed positive real γ, The first category of methods also leverage the same with probability− of at− least 1 δ over the random choice of the set idea of semi-supervised manifold learning on the tensor − Θh, for all b∗ out, product graph (TPG) for predicting hyperlinks across the Y ∈ Y graphs. Given a set of labeled n-way tuples of graph vertices, ˆ r q γ γ P F 2 these methods aim at labeling/scoring the unlabeled n-way ( b∗) b ( b∗) + || || + c0Q min (l, u) Lu Y ≤Ll Y γ lu tuples by learning with the manifold structure in the TPG. [6] r GQ 1 proposed a semi-supervised link propagation method using + ln( ), (21) 2 δ conjugate gradient descend to predict the hyperlinks in the multi-relational tensor. The link propagation method is γ γ where u( b∗) and b ( b∗) are the γ-margin test and empirical only empirically scalable to three-way tensors due to the L Y Ll Y errors respectively defined as necessity of computing the full tensor in every iteration. Alternatively, several methods were proposed to apply low- γ 1 X true ( b∗) = ` ( b∗ , ) Lu Y u γ Yin,in−1,...,i1 Yin,in−1,...,i1 rank approximation on each individual graph rather than (i1,i2,...,i ) Θ¯ n ∈ h working with the original TPG for better scalability to the γ 1 X true tensor product of two or three large graphs. For example, b ( b∗) = ` ( b∗ , ), Ll Y l γ Yin,in−1,...,i1 Yin,in−1,...,i1 approximate link propagation [7] applies label propagation on (i1,i2,...,i ) Θ n ∈ h the product of two low-rank knowledge graphs for pairwise ab link prediction, and [30] makes a similar low-rank assump- with `γ (a, b) = 0 if ab > γ and `γ (a, b) = min (1, 1 γ ) otherwise. − tion on a single for pairwise link prediction. Transductive learning over product graph (TOP) [11] is a Proof. The bound (21) in Theorem 4 is based on the transduc- graph-based one-class transductive learning algorithm, in tive Rademacher bound using the transductive Rademacher which the Gaussian random fields prior proposed in [31] complexity [19] given in Appendix B Definition 3. According is approximated as a regularization term with the product to Appendix B Theorem 5, we only need to bound the of multiple low-rank knowledge graphs to overcome the Rademacher complexity R ( ) as below: l+u Yout bottleneck of evaluating the prior. In general, these methods 1 1 h T i are not applicable to a large number of knowledge graphs: Rl+u( out) = ( + )Eσ sup σ vec( b∗) first, low-rank approximation of each individual graph does Y l u b∗ Y Y ∈Yout not guarantee a globally optimal approximation of the TPG 1 1 h T 0 i ( + )Eσ sup σ Pˆ vec( ) as stated in Appendix A Lemma 4; and second, the rank 0 0 ≤ l u : F =1 Y Y ||Y || of the TPG is exponential in the number of graphs, and 1 1 h ˆ i therefore the approximation is not scalable to many graphs. = ( + )Eσ P σ 2 (22) l u || || Therefore, these approximation methods do not preserve r h i the performance of learning with the original TPG and still 1 1 T T ( + ) tr(Eσ σσ Pˆ Pˆ) (23) ≤ l u suffers scalability issues to learn from a large number of r 2 knowledge graphs. = Pˆ , The second category are tensor decomposition methods F lu || || regularized by graph Laplacian, which decompose a noisy where (22) and (23) are obtained using Cauchy-Schwarz and complete tensor or an incomplete sparse tensor into low-rank Jensen’s inequalities respectively. (End of Proof) factor matrices to estimate the true tensor. For example, [28] 1 α proposed two types of graph Laplacian regularization for Given the eigenvalues of Pˆ are bounded within [ − , 1], 1+α both CP and Tucker decomposition. The first type is called ˆ √ it is clear that P F l + u. Assuming l + u and within-mode || || ≤ →γ ∞ regularization which extends the regularized l u, the error bound (21) can be simplified as u( b∗) matrix factorization methods to encourage the components γ q 1  L Y ≤ b ( b∗) + O , which has a slower convergence rate in each factor matrix to be smooth among strongly connected Ll Y l 9 vertices in its corresponding graph. The second type is called CANDECOMP/PARAFAC decomposition (CP): An • 0 I I −1 ... I1 cross-mode regularization, which regularizes all the factor initial tensor R n× n × × which is either matrices jointly with the graph Laplacian of the TPG. For noisily completeY or∈ incomplete with zeros represent- joint analysis of data from multiple sources including tensor, ing the missing entries is decomposed into n factor matrices and knowledge graphs, [32] introduced a coupled matrices by solving a least square problem. The factor matrix tensor factorization (CMTF) model with within-mode matrices are then used to construct the queried entries graph Laplacian regularization for collaborative filtering. in the tensor. Though these tensor decomposition models can be solved CP using weighted optimization (CP-W)[33]: CP-W • by any scalable first-order method based on the idea of all- is a variation of CP where the differences are that at-once optimization [33], [34], they are however potentially the initial tensor 0 is incomplete, and the n factor lead to poor local minima especially for high-order tensors matrices are foundY by solving a weighted least square due to their non-convex formulations. Moreover, it has problem. been observed that the accuracy of tensor completion tends Graph regularized CP (GraphCP)[28]: Given n knowl- • to degrade severely when only a small fraction of multi- edge graphs W (i) : i = 1, . . . , n , an initial tensor 0 I I −{1 ... I1 } relations are observed [28]. R n× n × × which is either noisily complete orY incomplete∈ with zeros representing the missing en- tries is decomposed into n factor matrices by solving 6 EXPERIMENTS a least square problem with cross-mode regularization In the experiments, the performance of LowrankTLP for using TPG. The factor matrices are then used to hyperlink prediction (Task 1) and multiple graph alignment construct the queried entries in the tensor. Graph regularized CP using weighted optimization (Task 2) was evaluated in simulation and three real datasets. • In Section 6.1, we first explain the baseline methods imple- (GraphCP-W)[28]: Similarly, GraphCP-W is a vari- mented for performance comparisons. In Section 6.2, we ation of GraphCP where the differences are that the initial tensor 0 is incomplete, and the n factor evaluate the effectiveness and efficiency of LowrankTLP for Y hyperlink prediction (Task 1) on the simulation data, through matrices are again learned by solving a weighted least controlling the size and topology of multiple artificial graphs. square problem with cross-mode regularization using In Section 6.3, we test the practical application of Lowrank- TPG. Spectral methods for multiple PPI network alignment TLP for hyperlink prediction (Task 1) on the DBLP dataset • (IsoRankN)[35]: Given n PPI networks W (i) : i = of scientific publication records. In Section 6.4 and 6.5, we { evaluate the practical application of LowrankTLP to multiple 1, . . . , n of n species, and BLAST sequence similari- } Ii Ij graph alignment (Task 2). We first apply LowrankTLP to ties Rij R × : i, j [1, n] and i < j between { ∈ + ∀ ∈ } align up to 26 CT scan images in Section 6.4. Next, we proteins from each pair of species, IsoRankN finds evaluate the performance of LowrankTLP for the global a global alignment of the n PPI networks based on alignment of up to 4 full protein-protein interaction (PPI) spectral clustering on the induced graph of pairwise networks across 4 different species in Section 6.5. For better alignment scores. Backbone extraction and merge strategy for multiple clarity, we also summarize the input/output of all the • experiments in Appendix D Table 4. PPI network alignment (BEAMS)[36]: Given n PPI networks W (i) : i = 1, . . . , n of n species, and { } I I BLAST sequence similarities R i× j : i, j 6.1 Baseline methods and implementations ij R+ [1, n] and i < j between proteins{ ∈ from each∀ pair∈ 6.1.1 Baseline methods of species, BEMAS} finds a global alignment of the n We compared LowrankTLP with seven baseline methods PPI networks by solving a combinatorial optimization in the simulations and the experiments based on their problem with a heuristic approach. applicability to hyperlink prediction and multiple graph alignment. 6.1.2 Implementation details Approximate link propagation (ApproxLink)[7]: Ap- The graph regularization hyperparameter α defined in • proxLink was originally designed for pair-wise link section 3.3 was chosen from the pool {0.001, 0.1, 0.9, 0.99} for l n m prediction in a matrix. We extended its operations LowrankTLP, ApproxLink and TOP; rank √k approxima- for hyperlink prediction in a tensor. Given an incom- tion was applied to each individual graph for ApproxLink 0 I I −1 ... I1 plete initial tensor R n× n × × with zeros and TOP to guarantee the approximated TPG has the same representing the missingY ∈ entries, and n knowledge or larger rank than k. graphs W (i) : i = 1, . . . , n , ApproxLink can score For better scalability, we adopted the first-order method the queried{ entries in the tensor.} ADAM [37] based on the all-at-once optimization [33], Transductive learning over product graph (TOP)[11]: [34] to minimize the objective functions of CP, CP-W, • TOP is designed for one-class classification. Given a GraphCP and GraphCP-W. The factor matrices were ran- 0 I I −1 ... I1 binary incomplete initial tensor R n× n × × domly initialized; the stopping criteria was chosen to be Y ∈ t 3 0 t with ones and zeros representing the observed pos- f(x ) 2 10− f(x ) 2, where x denotes the stack itive entries and missing entries respectively, and n of||∇ all the|| vectorized≤ ||∇ factor|| matrices in the t-th iteration; knowledge graphs W (i) : i = 1, . . . , n , TOP can the maximum number of iterations was set to 1000. Note detect the queried positive{ entries in the tensor.} that, for GraphCP and GraphCP-W the gradient scale of the 10

104 LowrankTLP - 5 nets ApproxLink - 5 nets LowrankTLP - 50 nets 1 1

0.9 LowrankTLP - Matlab 0.8 0.8 2 10 LowrankTLP - C - SPLATT AUC 0.7 MAP FullrankTLP Runtime (s) GraphCP-W 0.6 0.6 GraphCP ApproxLink 0.5 100 1000 20000 200000 1000 20000 200000 3 10 20 30 40 50 60 70 80 90 100 # of selected eigenvalues (k) # of networks (n) (A) (B) Fig. 3. Simulation results. (A) Effectiveness comparisons by varying TPG ranks. (B) Efficiency and scalability comparisons. The curves are truncated if the method is not scalable to the larger sizes.

5 nets 10 nets I/2 randomly sampled off-diagonal entries to 0.9 and treated AUC MAP AUC MAP them as positive and negative test samples, respectively. The LowrankTLP 0.990 0.991 0.942 0.952 outputs are scores of the I test entries after label propagation, ApproxLink 0.610 0.679 0.554 0.672 which can be used to distinguish the positive and negative GraphCP 0.540 0.539 0.536 0.533 classes based on the assumption that the vertices indexed GraphCP-W 0.543 0.535 0.533 0.550 by the diagonal entries of the tensor should have high CP 0.549 0.528 0.520 0.527 similarities since they come from the same “ancestor” graph and this information should be captured by the TPG. CP-W 0.522 0.486 0.472 0.496 TABLE 2 Effectiveness: We compare LowrankTLP with Approx- Effectiveness comparison in simulations. • Link, CP, CP-W, GraphCP and GraphCP-W using the same sparse tensor 0 as input. For fair comparisons, cross-mode regularization term increases much faster than Y the gradient scale of the decomposition term, as the tensor we fixed α = 0.1 for LowrankTLP and used the order (the number of graphs) increases. Therefore, when the best hyperparameters for all the baseline methods. tensor order is high, the graph hyperparameter α as defined The area under the curve (AUC) and mean average in [28] tends to be set very small. Unless otherwise stated, we precision (MAP) are the evaluation metrics. Each 5 4 3 2 1 experiment was repeated five times and the aver- chose α from 10− , 10− , 10− , 10− , 10− as suggested in [28] and r (tensor{ rank) from 10, 50, 100 for} all CP based age performances are reported. Table 2 shows that methods in Task 1; for Task 2 the{ CP rank}r is equal to the LowrankTLP clearly outperforms all the baselines rank of the initial CP-form tensor and is chosen by PCA to learn multi-relations among 5 and 10 graphs. The to cover at least 90% of the spectral energy in the stacking prediction of the CP-based methods is almost random, matrix R defined in Section 3.2. which is not surprising given the fact that tensor 0 is extremely sparse; this observation also agrees The baselines IsoRankN and BEAMS were developed Y specifically for PPI network alignment. We downloaded with the previous observations that the accuracy of and ran the original packages[1,2] to obtain the alignment tensor decomposition degrades severely when only scores with the graph hyperparameter α selected from a small fraction of entries is observed [28], [38]. Note 0.1, 0.3, 0.5, 0.7, 0.9 as suggested in their packages. that ApproxLink performs better than the CP-based { All the experiments} were performed using our server methods, which implies that label propagation is a with Intel(R) Xeon(R) CPU E5-2450 with 32 cores (2.10GHz) more robust approach for sparse inputs than tensor in 2 CPUs and 196GB of RAM. All the baseline methods decomposition for hyperlink prediction. Figure 3(A) except IsoRankN and BEAMS were implemented using shows that the LowrankTLP outperforms ApproxLink MATLAB R2018b. in different TPG ranks, which validates the advantage of our optimization formulation (7). It also shows that LowrankTLP requires only a moderate rank 6.2 Simulations k 10, 000 when n = 5, and achieves a high Synthetic graphs were generated to evaluate the performance performance≥ with k 100,000 when n = 50, whereas and the scalability. We started with a graph of density 0.1 and ApproxLink is not applicable≥ to such a large number size I to generate n distinct graphs by randomly permuting of graphs. 10% of edges from the common “ancestor” graph so that Efficiency and scalability: We further compared the • they share similar structures that can be utilized for matching running time of the MATLAB implementation of the multi-relations. The inputs are the n graphs and a sparse LowrankTLP using Tensor Toolbox [39] version 2.6 0 I ... I n-way tensor R × × with I/2 (half) of its diagonal Y ∈ and the parallel implementation using SPLATT library entries set to 1s. We set the other I/2 diagonal entries and [14] (described in Section 4.3) with the baseline methods applicable to the knowledge graphs. We 1. IsoRankN package: http://cb.csail.mit.edu/cb/mna/. 2. BEAMS package: http://webprs.khas.edu.tr/~cesim/BEAMS.tar. chose a small tensor rank r = 10 for GraphCP 104 gz. and GraphCP-W. The TPG rank k = 5 n was 11

LowrankTLP GraphCP CP TOP LowrankTLP ApproxLink GraphCP GraphCP-W CP CP-W TOP ApproxLink GraphCP-W CP-W 1 1 1 0.9 0.9 0.8 0.8 0.8 0.7 0.6 AUC 0.7 MAP 0.6

0.4 0.5 0.6 0.2 0.4 0.5 0.3 0 0.1 10 50 90 0.1 10 50 90 AUC MAP % of training data (A) (B) (C)

Fig. 4. DBLP results. (A) The performance of 5-fold cross-validation. The average and standard deviation across the 5 folds are shown. (B) & (C) The performance of using various percentages of training data. The average and standard deviation are shown for each percentage across different random samplings. chosen for LowrankTLP which achieves AUC 0.9 using the 12,066 positive triples together with the same num- empirically. In Figure 3(B), we observe that≈ the ber of randomly sampled negative triples. Figure 4(A) shows parallel LowrankTLP results in a speedup of about the performance comparisons with standard deviations on one order of magnitude compared to the MATLAB all the 5 test folds. We observed that LowrankTLP clearly version. The parallel implementation of LowrankTLP outperforms the baselines in every fold, and the methods improved the running time to 103s compared with utilizing the graph information outperform the CP and CP- 104s by the sequential implementation to align 100 W, which do not use graph information. We also randomly graphs of size 1000 each. ApproxLink has a similar sampled 0.1%, 10%, 50% and 90% of positive triples as running time as sequential LowrankTLP on 3 and 10 training data to test the rest of the positive triples together graphs, while it is not applicable for more graphs with the same number of randomly sampled negative due to the exponential growth of the number of triples. The random samplings are repeated 5 times for components as discussed in Section 4.4. The empirical each percentage. Using the optimal hyperparameters chosen running time of GraphCP/GraphCP-W are worse from the previous 5-fold cross-validation, we compared the than LowrankTLP even if the theoretical time com- performance of all the methods on various percentages of plexity for computing the gradients in each iteration training/test data. Figure 4(B)&(C) show that LowrankTLP is fast as analyzed in Table 1. consistently outperforms all the baselines in every training percentage. Remarkably, both LowrankTLP and ApproxLink 6.3 Predicting multi-relations in scientific publications achieve average AUC 0.76 and MAP 0.8 when there are only 0.1% of training≈ data; LowrankTLP,≈ ApproxLink and We downloaded the DBLP dataset of scientific publication TOP are more robust to sparse input, comparing with CP- records from AMiner (Extraction and Mining of Academic based methods GraphCP and GraphCP-W, which also use Social Networks) [40]. We built three graphs: Author TPG information; the methods utilizing knowledge graphs Author (W (1)), Paper Paper (W (2)) and Venue Venue× perform consistently better than CP and CP-W, which do (W (3)). In W (1), the edge× weight is the count of papers× that not use knowledge graph, demonstrating that associations both authors have co-authored; in W (2), the edge weight among the tensor entries carried by the manifolds in the TPG is the number of times both papers were cited by another are useful to enhance the prediction performance. paper; and in W (3), the edge weight is calculated using Jaccard similarity between the vectors of the two venues whose dimensions are a bag of citations. After filtering the 6.4 Alignment of CT scan images vertices with zero and low degrees in each graph, we finally We obtained a dataset of 134 CT scan images of an obtained W (1) with 13,823 vertices and 266,222 edges; W (2) anonymized female patient. The scans were acquired on with 11,372 vertices and 4,309,772 edges; and W (3) with a Philips Brilliance Big Bore CT Scanner, and each image has 10,167 vertices and 46,557,116 edges, similar to the dataset 512 512 pixels with a slice thickness of 3mm. We used a used in [11]. We also built 12,066 triples in the form (Paper, subset× of 26 images which contain the same set of four seg- Author, Venue) as positive multi-relations, given by the mented regions manually annotated by a radiologist. When natural relationship that a paper is written by an author, working with CT scan images, a radiologist is interested and published in a specific venue. These triples were stored in matching the segmented regions across the images. We in the sparse initial tensor 0 as input, whose dimensions represent this situation by aligning a set of sampled spots match with the graph sizes.Y across the images to detect if they belong to the same type of We first performed 5-fold cross-validation with 3-fold segmented region. To construct a graph for each CT image, training triples, 1-fold validation triples to select the best we first sampled from each segmented region in each image hyperparameters and 1-fold test triples for all the methods, a number (proportional to the region size) of spots. Then, 12 LowrankTLP k = 10 k = 102 k = 103 k = 104 CP-form Y0 GraphCP 4 images 0.59 0.91 0.91 0.91 0.61 0.73 5 images 0.67 0.80 0.89 0.89 0.66 0.75 6 images 0.78 0.78 0.84 0.84 0.69 0.78 7 images 0.75 0.73 0.80 0.83 0.72 0.76 TABLE 3 Performance of CT image alignment. The accuracy is measured as the % of accurately aligned spots in the first image to the correct spots in the rest of the images.

2 spots, calculated using a RBF function with σ = 10− . The initial tensor 0 was then generated in CP-form using these cross-image spotsY similarity matrices. For example, to align 7 7 images, 2 = 21 similarity matrices were generated. In this setting, the number of spots can be different across the images. Therefore, it is possible that one spot in an image is matched to more than one spot in another image after the alignment. The set of query tuples were selected if the color densities between each pair of the spots in a tuple are all above a threshold. The alignment accuracy was measured by the top-1 match of each spot. Specifically, for each spot in the first graph, we took the sub-tensor of dimension (n 1) associated with the entry in the first dimension to find− the entry of the highest score in the sub-tensor. Then, we checked if the features of the aligned spots from all the other images in the maximum entry were the same as the spot in the first dimension. Table 3 shows the comparisons of LowrankTLP and GraphCP, using CP-form tensor 0 as their initialization. Fig. 5. Example of aligning 6 CT scan images. Each type of seg- With k 100, LowrankTLP achievesY much higher accuracy mented region in the images is represented by a different color in the ≥ alignment. The links connect all the pairs of the spots in a 6-tuple with than GraphCP in almost all the cases. It is also interesting one from each image to represent one alignment. that with k = 10, 000, LowrankTLP is able to align 7 images with an accuracy of 0.83, which means 83% of the spots in the LowrankTLP: 10nets, =0.999 10-4 10-6 GraphCP: 10nets, =1e-08 1.8 5 first graph is perfectly matched with a spot from the same

1.6 type of segmented region in all other 6 images. An example 4 1.4 of 6 aligned images is shown in Figure 5. It is clear that the 1.2 3 aligned spots are consistent across the images. 1 2 0.8 More importantly, to further measure the scalability

0.6 1

Prediction scores Prediction scores of LowrankTLP on a larger number of real graphs, we

0.4 0 performed an additional evaluation by aligning 10 and 26 0.2 CT scan images. Since it is not computationally feasible to 0 -1 10 9 8 7 6 5 4 3 10 9 8 7 6 5 4 3 Sampling groups (orderd by the value of h) Sampling groups (orderd by the value of h) enumerate every entry of the 10-way tensor and the 26- (A) Alignment of 10 CT scans LowrankTLP: 26nets, =0.999 way tensor, we generated a list of candidate n-tuples for 10-21 10-35 GraphCP: 26nets, =1e-08 2.5 10 performance evaluation. Given n images to be aligned, we 8

2 6 randomly sampled a list of n-tuples of spots. The n-tuples 4 were then grouped by their homogeneity score h, where 1.5 2 h is defined as the maximum number of spots that are 0 1 from the same type of segmented region in the n-tuple. -2 Prediction scores Prediction scores -4 For example, if there are 3, 4, 10 and 9 spots in a 26-tuple 0.5 -6 from the 4 types of regions, respectively, the homogeneity

0 -8 26 24 22 20 18 16 14 12 10 8 7 26 24 22 20 18 16 14 12 10 8 7 (3, 4, 10, 9) = 10 Sampling groups (orderd by the value of h) Sampling groups (orderd by the value of h) score of this 26-tuple will be max . Based on (B) Alignment of 26 CT scans the homogeneity score, we generated sampling groups of varying h to evaluate the alignment of n = 10 and n = 26 Fig. 6. Results of aligning 10 and 26 CT scan images. The average images. We expect that the sampling groups of larger h 4 and standard deviation of the prediction scores of the 10 n-tuples in n each sampling group ordered by the homogeneity score h is shown. also receive higher prediction scores on the -tuples in the groups. GraphCP is the only baseline that is both applicable we calculated the similarity between the spots using the and scalable in this experiment for comparison. x x 2 RBF function: s(x , x ) = exp( || i− j || ) if φ(x ) = φ(x ) The average and standard deviation of the prediction i j − σ i 6 j and otherwise 1, where xi and xj are the coordinates of the scores for each sampling group are shown in Figure 6. In the two spots; φ(x) represents the region where the spot x is alignment of 10 images shown in Figure 6 (A), we observe located; σ = 10 is the width of RBF function. The pairwise that LowrankTLP generates a much larger average score for similarity scores between the spots in two different images h = 10 compared with the sampling groups with h < 10, and were obtained by the color density difference between the the average score decreases consistently and monotonically 13

as h decreases. GraphCP is also able to identify the group of 1 1 h = 10 but the variance is large and a flatter tail is observed 0.9 0.8 0.8 after h = 6. In the aligment of 26 images shown in Figure 0.7 0.6 0.6 6 (B), GraphCP completely fails to distinguish the most 0.5 TPR significant group h = 26 from the other groups, whereas 0.4 0.4 specificity 0.3

LowrankTLP maintained the same clear decreasing trend as 0.2 0.2 LowrankTLP (AUC: 0.65) h decreases. This comparison implies LowrankTLP is more 0.1 BEAMS (AUC: 0.47) 0 0 0 0.2 0.4 0.6 0.8 1 applicable to high-order TPG of a large number of graphs in c=2 c=3 FPR real-world problems. As discussed in the implementation de- (A) Alignment of HS/DM/SC networks tails in Section 7.1, the graph hyperparameter α of GraphCP 1 1 LowrankTLP ISON BEAMS 0.9 LowrankTLP (AUC: 0.54) was set to be very small when the number n of graphs 0.8 0.8 BEAMS (AUC: 0.35) is large. In Appendix D Figure 8, we also provide more 0.7 0.6 0.6 comprehensive comparisons of LowrankTLP and GraphCP 0.5 TPR 0.4 0.4 by varying the graph hyperparameter α. Very similar results specificity 0.3 are observed. 0.2 0.2 0.1

0 0 0 0.2 0.4 0.6 0.8 1 c=2 c=3 c=4 FPR 6.5 Alignment of PPI Networks (B) Alignment of HS/DM/SC/CE networks We downloaded the IsoBase dataset [4], [35], [41], containing Fig. 7. Results of PPI network alignment. In both (A) and (B), the protein-protein interactions (PPI) networks for five species: figure on the left shows the specificity of the detected clusters containing H. sapiens (HS), D. melanogaster (DM), S. cerevisiae (SC), C. different number of species, and the figure on the right shows the AUC elegans (CE) and M. musculus (MM). The M. musculus network curves between consistent and inconsistent query entries by prediction only contains 776 interactions and is dropped from the among the clusters reported by BEAMS. analysis. After removing proteins with no association in the better than both BEAMS and IsorankN in the alignment PPI networks, there are 10,403, 7,396, 5,524 and 2,995 proteins of the three networks. The left plot in Figure 7 (B) shows and 109,822, 49,991, 165,588 and 9,711 interactions in the HS, that LowrankTLP performed similarly or slightly worse DM, SC and CE PPI networks, respectively. The dataset also than BEAMS in every cluster size in the alignment of four contains cross-species protein sequence similarities as BLAST networks. IsorankN is not able to detect any cluster of size 4. Bit-values for all the pairs of species. Similar to the CT scan To further compare LowrankTLP with BEAMS, we an- experiment, we generated the input tensor 0 in CP-form Y alyzed the detailed ranking of the annotated clusters with whose dimensions are matched with the number of proteins at least one protein from each species reported by BEAMS. in the corresponding species, by using the pairwise BLAST Specifically, we enumerated all the tuples containing one sequence similarity scores. In addition, the annotations of the protein from each species from each cluster and then applied proteins with 37,463 gene ontology (GO) terms below level LowrankTLP to calculate the scores of all the tuples in the five of GO are also provided for evaluation. We generated a output tensor. We re-ranked these tuples by the scores and set of query tuples of proteins with high sequence similarity annotated them as consistent or inconsistent multi-relations between all the protein pairs in the tuple. These tuples can by GO annotations. The AUC by their rankings is shown then be classified as true multi-relations if all the annotated in the right plots in Figure 7. In both the three-network proteins in the tuple share at least one common GO term, alignment and the four-network alignment, LowrankTLP and otherwise false multi-relations. The experiments were ranks the consistent multi-relations above the inconsistent performed using three species (HS, DM and SC) and four multi-relations with AUC larger than 0.5. Notice that since species by adding CE. Around 3M tuples were generated we only check the very top of the predictions (those predicted among three species and about 163M among four species. as true multi-relations), the AUC is less than 0.5 for BEAMs Similar to the post-processing in the evaluation in [36], results. after applying LowrankTLP to generate the prediction scores for all the query tuples, the tuples were sorted for a greedy merge as protein clusters for standard evaluation of PPI 7 CONCLUSION network alignment. A cluster of size n is defined as a set of proteins with at least one protein from each of the n In this study, we introduced a new algorithm LowrankTLP to species. The greedy merge scans the tuples and adds the improve the scalability and performance of label propagation tuple that only contains proteins not seen yet as a new on tensor product graphs for multi-relational learning. The cluster. Otherwise, the proteins that are already in some theoretical analysis shows that the global optimal solution other clusters are removed from the tuple, and the remaining minimizes an estimation error bound for recovering the proteins are added as a smaller cluster. In the evaluation, true tensor from the noisy initial tensor for multiple graph the specificity is defined as the ratio between the number alignment, and provides the data-dependent transductive of consistent clusters and the number of annotated clusters, Rademacher bound for binary hyperlink prediction. In where an annotated cluster is a cluster in which at least the experiments, we demonstrated that LowrankTLP well two proteins are associated to at least one GO term, and a approximates label propagation on the normalized tensor consistent cluster is the one in which all of its annotated product graph to achieve both the better scalability and per- proteins share at least one GO term. In the left plot in Figure formance. We also demonstrated that LowrankTLP, capable 7 (A), for both clusters of size 2 and 3, LowrankTLP performs of taking either a sparse tensor or a CP-form tensor as input, 14 is a flexible approach to meet the requirements of multi- [19] R. El-Yaniv and D. Pechyony, “Transductive rademacher complexity relational learning problems in a wide range of applications. and its applications,” Journal of Artificial Intelligence Research, vol. 35, pp. 193–234, 2009. In all the experiments, we also observed that it does not [20] Q. Gu and J. Han, “Towards active learning on graphs: An error require a huge rank to achieve a good prediction performance bound minimization approach,” in 2012 IEEE 12th International even if the size of a tensor product graph is exponential of Conference on Data Mining. IEEE, 2012, pp. 882–887. the size of the individual graphs. This observation supports [21] C. Ding, X. He, and H. D. Simon, “On the equivalence of nonneg- that the direct and efficient analysis of the entire spectral of ative matrix factorization and spectral clustering,” in Proceedings of the 2005 SIAM International Conference on Data Mining. SIAM, the tensor product graph is a better approach. In the future, 2005, pp. 606–610. we will analyze the spectral of the tensor product graphs to [22] C. Chen, H. Tong, L. Xie, L. Ying, and Q. He, “Fascinate: Fast develop an automatic strategy of choosing the rank k, for cross-layer dependency inference on multi-layered networks,” in more efficient application of LowrankTLP to multi-relational Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016, pp. 765–774. learning problems. [23] A. Wang, H. Lim, S.-Y. Cheng, and L. Xie, “Antenna, a multi-rank, multi-layered recommender system for inferring reliable drug- gene-disease associations: Repurposing diazoxide as a targeted anti-cancer therapy,” IEEE/ACM transactions on computational biology REFERENCES and bioinformatics, vol. 15, no. 6, pp. 1960–1967, 2018. [24] D. Bertsekas, Nonlinear Programming. Athena Scientific, 1999. [1] M. Szummer and T. Jaakkola, “Partially labeled classification with [25] N. J. Higham, Accuracy and stability of numerical algorithms. SIAM, markov random walks,” in Advances in neural information processing 2002. systems, 2002, pp. 945–952. [26] B. W. Bader and T. G. Kolda, “Efficient matlab computations with [2] X. Zhu and Z. Ghahramani, “Learning from labeled and unlabeled sparse and factored tensors,” SIAM Journal on Scientific Computing, data with label propagation,” Carnegie Mellon University, Tech. vol. 30, no. 1, pp. 205–231, 2007. Rep., 2002. [27] S. Smith, N. Ravindran, N. D. Sidiropoulos, and G. Karypis, “Splatt: [3] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf, Efficient and parallel sparse tensor-matrix multiplication,” 29th “Learning with local and global consistency,” in Advances in neural IEEE International Parallel & Distributed Processing Symposium, 2015. information processing systems, 2004, pp. 321–328. [4] R. Singh, J. Xu, and B. Berger, “Global alignment of multiple protein [28] A. Narita, K. Hayashi, R. Tomioka, and H. Kashima, “Tensor fac- interaction networks with application to functional orthology torization using auxiliary information,” Data Mining and Knowledge detection,” Proceedings of the National Academy of Sciences, vol. 105, Discovery, vol. 25, no. 2, pp. 298–324, 2012. no. 35, pp. 12 763–12 768, 2008. [29] R. Tomioka, T. Suzuki, K. Hayashi, and H. Kashima, “Statistical [5] M. Xie, T. Hwang, and R. Kuang, “Prioritizing disease genes by performance of convex tensor decomposition,” in Advances in neural bi-random walk,” in Pacific-Asia Conference on Knowledge Discovery information processing systems, 2011, pp. 972–980. and Data Mining. Springer, 2012, pp. 292–303. [30] D. M. Dunlavy, T. G. Kolda, and E. Acar, “Temporal link prediction [6] H. Kashima, T. Kato, Y. Yamanishi, M. Sugiyama, and K. Tsuda, using matrix and tensor factorizations,” ACM Transactions on “Link propagation: A fast semi-supervised learning algorithm Knowledge Discovery from Data (TKDD), vol. 5, no. 2, p. 10, 2011. for link prediction,” in Proceedings of the 2009 SIAM international [31] X. Zhu, Z. Ghahramani, and J. D. Lafferty, “Semi-supervised learn- conference on data mining. SIAM, 2009, pp. 1100–1111. ing using gaussian fields and harmonic functions,” in International [7] R. Raymond and H. Kashima, “Fast and scalable algorithms for Conference on Machine Learning, 2003, pp. 912–919. semi-supervised link prediction on static and dynamic graphs,” [32] V. W. Zheng, B. Cao, Y. Zheng, X. Xie, and Q. Yang, “Collaborative Machine Learning and Knowledge Discovery in Databases, pp. 131–147, filtering meets mobile recommendation: A user-centered approach,” 2010. in Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010. [8] O. Duchenne, F. Bach, I.-S. Kweon, and J. Ponce, “A tensor-based [33] E. Acar, D. M. Dunlavy, T. G. Kolda, and M. Mørup, “Scalable tensor algorithm for high-order graph matching,” IEEE transactions on factorizations for incomplete data,” Chemometrics and Intelligent pattern analysis and machine intelligence, vol. 33, no. 12, pp. 2383– Laboratory Systems, vol. 106, no. 1, pp. 41–56, 2011. 2395, 2011. [34] E. Acar, T. G. Kolda, and D. M. Dunlavy, “All-at-once [9] X. Yang, L. Prasad, and L. J. Latecki, “Affinity learning with optimization for coupled matrix and tensor factorizations,” in diffusion on tensor product graph,” IEEE transactions on pattern MLG’11: Proceedings of Mining and Learning with Graphs, August analysis and machine intelligence, vol. 35, no. 1, pp. 28–38, 2013. 2011. [Online]. Available: https://www.cs.purdue.edu/mlg2011/ [10] H. Liu and Y. Yang, “Bipartite edge prediction via transductive papers/paper_4.pdf learning over product graphs,” in International Conference on [35] C.-S. Liao, K. Lu, M. Baym, R. Singh, and B. Berger, “Isorankn: spec- Machine Learning, 2015, pp. 1880–1888. tral methods for global alignment of multiple protein networks,” [11] ——, “Cross-graph learning of multi-relational associations,” in Bioinformatics, vol. 25, no. 12, pp. i253–i258, 2009. International Conference on Machine Learning, 2016, pp. 2235–2243. [36] F. Alkan and C. Erten, “Beams: backbone extraction and merge [12] R. Xu, Y. Yang, H. Liu, and A. Hsi, “Cross-lingual text classification strategy for the global many-to-many alignment of multiple ppi via model translation with limited dictionaries,” in Proceedings of the networks,” Bioinformatics, vol. 30, no. 4, pp. 531–539, 2013. 25th ACM International on Conference on Information and Knowledge [37] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza- Management. ACM, 2016, pp. 95–104. tion,” arXiv preprint arXiv:1412.6980, 2014. [13] M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich, “A review of [38] R. Tomioka, K. Hayashi, and H. Kashima, “Estimation of low-rank relational machine learning for knowledge graphs,” IEEE proceeding, tensors via convex optimization,” arXiv preprint arXiv:1010.0789, 2015. 2010. [14] S. Smith and G. Karypis, “SPLATT: The Surprisingly ParalleL [39] B. W. Bader, T. G. Kolda et al., “Matlab tensor toolbox spArse Tensor Toolkit,” http://cs.umn.edu/~splatt/, 2016. version 2.6,” Available online, February 2015. [Online]. Available: [15] T. G. Kolda and B. W. Bader, “Tensor decompositions and applica- http://www.sandia.gov/~tgkolda/TensorToolbox/ tions,” SIAM review, vol. 51, no. 3, pp. 455–500, 2009. [40] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, “Arnetminer: [16] J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and extraction and mining of academic social networks,” in Proceedings Z. Ghahramani, “Kronecker graphs: An approach to modeling of the 14th ACM SIGKDD international conference on Knowledge networks,” Journal of Machine Learning Research, vol. 11, pp. 985– discovery and data mining. ACM, 2008, pp. 990–998. 1042, 2010. [41] D. Park, R. Singh, M. Baym, C.-S. Liao, and B. Berger, “Isobase: [17] V. Gligorijevi´c,N. Malod-Dognin, and N. Pržulj, “Fuse: multiple a database of functionally related proteins across ppi networks,” network alignment via data fusion,” Bioinformatics, vol. 32, no. 8, Nucleic acids research, vol. 39, no. suppl_1, pp. D295–D300, 2010. pp. 1195–1203, 2015. [42] C. Eckart and G. Young, “The approximation of one matrix by [18] S. Hashemifar, Q. Huang, and J. Xu, “Joint alignment of multiple another of lower rank,” Psychometrika, vol. 1, no. 3, pp. 211–218, protein–protein interaction networks via convex optimization,” 1936. Journal of Computational Biology, vol. 23, no. 11, pp. 903–911, 2016. 15

APPENDIX A label of example xi. The transductive Rademacher complexity ADDITIONAL DEFINITIONS AND LEMMAS is defined as h i Definition 1. CANDECOMP/PARAFAC decomposition 1 1 T Rl+u( out) = ( + )Eσ sup σ h , (CP): H l u h out ∈H I1 I2 ... In An n-way tensor R × × × of rank r can be written as T X ∈ where σ = (σ1, . . . , σl+u) is a vector of i.i.d random variables r X such that σi = 1 with probability p, σi = 1 with probability p = a(1) a(2) a(n) − lu c c c and σi = 0 with probability 1 2p. We set p = (l+u)2 as in [19]. X c=1 ◦ ◦ · · · ◦ − q (1) (2) (n) 32 ln (4e) 1 1 = A ,A ,...,A , Theorem 5. Let c0 = , Q = + and G = J K 3 l u l+u . For any fixed positive real γ, with (i) (i) I r (l+u 0.5)(1 0.5/ max (l,u)) where ac is the c-th column of factor matrix A R i× . probability− − of at least 1 δ over the random training/test set The vectorization of∈ CP-form is − Vectorization property: partitioning, h out, (n) (n 1) (1) vec( ) = (A A − A )1, where 1 is a vector ∀ ∈ H X · · · q with all-ones. γ γ Rl+u( out) (h) b (h) + H + c Q min (l, u) Lu ≤Ll γ 0 Definition 2. Tucker decomposition: r I1 I2 ... I GQ 1 An n-way tensor R × × × n can be decomposed into + ln( ), Xr ∈1 r2 ... r (i) a core tensor R × × × n and factor matrices {A 2 δ I r G ∈ ∈ R i× i : i = 1, . . . , n} as γ 1 Pl γ where bl (h) = l i=1 `γ (h(xi), yi) and u(h) = 1 l+uL L (1) (2) (n) P x = A A A u i=l+1 `γ (h( i), yi) are the γ-margin empirical and test error X G ×1 ×2 · · · ×n (1) (2) (n) respectively, with `γ (a, b) = 0 if ab > γ and `γ (a, b) = = ; A ,A ,...,A . min (1, 1 ab ) otherwise. JG K − γ Vectorization property: The vectorization of is vec( ) = (n) (n 1) (1) X X (A A − ... A )vec( ). ⊗ ⊗ G APPENDIX C Lemma 1. If A, B, C and D are matrices of such size that one can PROOFS form the matrix products AC and BD, then (A B)(C D) = C.1 Proof of Theorem 1 (AC) (BD). ⊗ ⊗ ⊗ Proof. We prove Theorem 1 by induction Lemma 2. If matrices A, B, C and D are of such size that one When n = 2 we have can form the operation (A B), (C D), (AT C) and (BT D), T T T 2 (i) (2) (1) then equality (A B) (C D) = (A C) (B D) holds. top_bot_2k( i=1λ ) = top_bot_2k(λ λ ) ~ ⊗ ⊗ = top_bot_2k(λ(2) top_bot_2k(λ(1))) (24) Lemma 3. Let λ1, . . . , λn be eigenvalues of A with corresponding ⊗ (2) (1) eigenvectors x1,..., xn, and let µ1, . . . , µm be eigenvalues of B = top_bot_2k(λ top_bot_2k(Γ )), ⊗ with corresponding eigenvectors y1,..., ym. Then the eigenvalues where Equation (24) is based on the observation that the and eigenvectors of A B are λiµj and xi yj, i = 1, . . . , n, j = 1, . . . , m. ⊗ ⊗ k largest (smallest) elements in the outer product can only have at most k different numbers from λ(1). Thus, taking ˜ (i) Lemma 4. Let matrix W denote the best rank-ki approx- the top(bottom)-k in λ(1) guarantees the k largest (smallest) (i) imation to W per Eckart-Young-Mirsky theorem [42]. The elements in the outer product will be kept. Suppose when n ˜ (i) Qn matrix i=1W is not guaranteed to be the best rank i=1 ki n = m > 2 we have ⊗ n (i) approximation to i=1W . ⊗ top_bot_2k( m λ(i)) ⊗i=1 (m) (m 1) = top_bot_2k(λ top_bot_2k(Γ − )) ⊗ APPENDIX B = Γ(m), (25) TRANSDUCTIVE RADEMACHER COMPLEXITY then, when n = m + 1 the following equations hold. The transductive Rademacher complexity and the data- dependent error bound for binary transductive learning top_bot_2k( m+1λ(i)) = top_bot_2k(λ(m+1) ( (m)λ(i))) ⊗i=1 ⊗ ⊗i=1 proposed in [19] are given below in Definition 3 and Theorem (m) = top_bot_2k(λ(m+1) top_bot_2k( λ(i))) 5. ⊗ ⊗i=1 = top_bot_2k(λ(m+1) top_bot_2k(Γ(m))) Definition 3. Given a fixed set Φ = (x , y ): i = 1, . . . , l+ ⊗ l+u { i i u of sample-label pairs drown from an unknown distribution, (End of Proof) w.l.o.g.,} the training set sampled uniformly without replacement from Φl+u is denoted as Φl = (xi, yi): i = 1, . . . , l , and the { } C.2 Proof of Proposition 2 test set is denoted as Xu = xi : i = l + 1, . . . , l + u . Define l+u { } T out R as a set of vectors h = (h(x1), . . . , h(xl+u)) Proof. Let σ1 > σ2 > > σN be the sorted eigenvalues of outputH ⊆ by a transductive algorithm using the set Φ and X over matrix S. Since σ [···1, 1], for i = 1,...,N and α (0, 1), l u i ∈ − ∈ all possible training/test set partitions, such that h(xi) is the soft by Eckart-Young-Mirsky theorem, the nonzero eigenvalues 16

of A are 1 : i = 1, . . . , k and the perturbations are (End of Proof) k 1 ασi given as { − } 1 Ak A 2 = and || − || 1 ασk+1 v− u N u X 1 A A = t ( )2. || k − ||F 1 ασ i=k+1 − i

Using σi : i = 1, . . . , k as eigenvalues and their corre- sponding{ eigenvectors of}S to construct a rank-k matrix L, 1 and define B = (I αL)− , we have Aˆ A B A − || − ||2 ≤ || − ||2 and Aˆ A B A according to the definition || − ||F ≤ || − ||F of Sk in Section 4.1. Thus, inequalities in Proposition 2 hold if we can prove B A < A A and || − ||2 || k − ||2 B A F < Ak A F . We first obtain the perturbations as|| − || || − ||

α σ∗ B A 2 = | | and || − || 1 ασ∗ v− u N u X α σi B A = t ( | | )2, || − ||F 1 ασ i=k+1 − i α σ where σ = argmax | | . Now we need to ∗ σ σk+1,...,σN 1 ασ show the inequalities (26)∈{ and (27)} are− valid. B A < A A (26) || − ||2 || k − ||2 B A < A A (27) || − ||F || k − ||F It is easy to prove Inequality (27) by the fact that α σi < 1. To show Inequality (26), we have to consider three special| | cases: firstly, if σ∗ > 0 and σ 0 then we have σ∗ = σ , thus k+1 ≥ k+1 Inequality (26) holds by the fact that α σ∗ < 1; secondly, if ∗ α σ | | 1 σ < 0 and σ 0 we have | |∗ < 1 and 1, ∗ k+1 1 ασ 1 ασ +1 ≥ − − k ≥ thus Inequality (26) holds; finally, if σ∗ < 0 and σk+1 < 0 we ∗ α σ 1 1 have σ∗ σk+1 , and | |∗ < ∗ , thus | | ≥ | | 1 ασ 1 ασ ≤ 1 ασk+1 Inequality (26) holds. Overall,− we have− shown − Aˆ A B A < A A and || − ||2 ≤ || − ||2 || k − ||2 Aˆ A B A < A A . || − ||F ≤ || − ||F || k − ||F 17

APPENDIX D SUPPLEMENTARY FILES

Task 1: hyperlink prediction Experiment Input relations Query set Knowledge graph n graphs generated by permuting a percentage Simulation observed n-way relations held-out test n-way relations of edges from a common random graph sampled known held-out known Author × Author, Paper × Paper DBLP (author, paper, venue)-relations (author,paper,venue)- relations and Venue × Venue graphs Task 2: multiple graph alignment Experiment Input relations Query set Knowledge graph RBF similarities across spots sampled alignment scores of spots across RBF similarities between spots sampled within CT scans from each pair of CT scan images multiple images each CT scan image BLAST sequence similarities between alignment scores of proteins across protein-protein interactions (PPI) networks for PPI proteins from each pair of species multiple species different species TABLE 4 Summary of datasets in the experiments

10-6 LowrankTLP: 10nets, =0.001 10-5 LowrankTLP: 10nets, =0.1 10-5 LowrankTLP: 10nets, =0.9 20 2.5 9

18 8 2 16 7

14 1.5 6

12 5 10 1 4 8

0.5 3 6

2 4 0

2 1

0 -0.5 0 10 9 8 7 6 5 4 3 10 9 8 7 6 5 4 3 10 9 8 7 6 5 4 3

10-7 GraphCP: 10nets, =1e-07 10-7 GraphCP: 10nets, =1e-06 10-7 GraphCP: 10nets, =0.1 8 10 10

6 8 8

4 6 6

2 4 4

0 2 2

-2 0 0

-4 -2 -2 10 9 8 7 6 5 4 3 10 9 8 7 6 5 4 3 10 9 8 7 6 5 4 3

(A) Alignment of 10 CT scans

10-24 LowrankTLP: 26nets, =0.001 10-24 LowrankTLP: 26nets, =0.1 10-22 LowrankTLP: 26nets, =0.9 7 16 7

6 14 6

5 12

5 4 10

4 3 8

2 6 3

1 4 2

0 2

1 -1 0

-2 -2 0 26 24 22 20 18 16 14 12 10 8 7 26 24 22 20 18 16 14 12 10 8 7 26 24 22 20 18 16 14 12 10 8 7

10-28 GraphCP: 26nets, =1e-15 10-35 GraphCP: 26nets, =1e-07 10-35 GraphCP: 26nets, =0.1 6 10 10

5 8 8

4 6 6

3 4 4

2 2 2

1 0 0

0 -2 -2

-1 -4 -4

-2 -6 -6

-3 -8 -8 26 24 22 20 18 16 14 12 10 8 7 26 24 22 20 18 16 14 12 10 8 7 26 24 22 20 18 16 14 12 10 8 7

(B) Alignment of 26 CT scans

Fig. 8. CT scan images alignment by varying the graph hyperparam- eter α. The x-axis are the sampling groups ordered by the value of h defined in Section 6.4; the y-axis are the prediction scores.