Tensor Simrank for Heterogeneous Information Networks
Total Page:16
File Type:pdf, Size:1020Kb
Tensor SimRank for Heterogeneous Information Networks Ben Usman Ivan Oseledets Skolkovo Institute of Science and Technology, Skolkovo Institute of Science and Technology, Novaya St., 100, Skolkovo, 143025, Russian Novaya St., 100, Skolkovo, 143025, Russian Federation Federation Moscow Institute of Physics and Technology, Institute of Numerical Mathematics, Russian Institutskiy Lane 9, Dolgoprudny, Moscow, Academy of Sciences, Gubkina St., 8, Moscow, 141700, Russian Federation 119333 [email protected] [email protected] ABSTRACT A Type of: devised structured activity Instance of: candidate KB completeness node, clarifying We propose a generalization of SimRank similarity measure collection type, type of object, type of temporally stuff-like for heterogeneous information networks. Given the informa- thing tion network, the intraclass similarity score s(a; b) is high if Subtypes: board game, brand name game, card game, the set of objects that are related with a and the set of ob- child's game, coin-operated game, dice game, electronic jects that are related with b are pair-wise similar according game, fantasy sports, game for two or more people, game to all imposed relations. of chance, guessing game, memory game, non-competitive game, non-team game, outdoor game, party game, puzzle Categories and Subject Descriptors game, role-playing game, sports game, table game, trivia [Information systems]: Retrieval models and ranking{ game, word game Similarity measures Instances: ducks and drakes, ultimate frisbee, darts, pachinko, Crossword Puzzle Activity CW, pool, Snooker, General Terms mini golf SimRank, Probabilistic SVD, Tensor, Low-rank approxima- Figure 1: OpenCyc ontology node of concept tion "Game" 1. INTRODUCTION Most data in the modern world can be treated as an infor- that would reflect the closeness of objects based on "similar- mation network, thus network node similarity measuring has ity of relations" they enter, and at the same time not mix- wide range of applications: search [1], recommendation sys- ing different relations as soon as "objects of different types tems [2], research publication networks analysis [3], biology and links carry different semantic meanings, and it does not [4], transportation and logistics [5] and others. make sense to mix them to measure the similarity without Consider a semantic network: set of types T , each type distinguishing their semantics" [6]. t 2 T is a set of entities; set of relations R, each relation is 2-order predicate defined on two types from T : 1.1 Related work R 3 rtp : t × p 7! f1; 0g; t; p 2 T ; The basic graph structure similarity measure is the clas- both types in relation can be equal (rtt : t × t ! f0; 1g), few sical SimRank [7] over a homogeneous graph G = (V; E) (1) (2) which is defined as follows: relations can share the same pair of types (9rtp 6= rtp 2 t×p arXiv:1502.06818v1 [cs.AI] 24 Feb 2015 f0; 1g ). That structure may be considered as a graph with colored vertices and colored edges: vertex color is its NG(a) = fv 2 V :(v; a) 2 E(G)g; entity type, edge color correponds to a relation. The question that we address is how to define similarity C X functions s(a; b) = s(w; v): I(a)I(b) v2N(a) st : t × t ! R; 8t 2 T ; w2N(b) The main drawback of this approach is that we cannot in- duce multiple relations or object types, so the only option Permission to make digital or hard copies of all or part of this work for is mixing them up into blobs "relation exists" and "all ob- personal or classroom use is granted without fee provided that copies are jects" that is completely not applicable in the case we have not made or distributed for profit or commercial advantage and that copies multiple relations with different semantics, for example the bear this notice and the full citation on the first page. To copy otherwise, to OpenCyc ontology node of the concept "Game" (see Figure republish, to post on servers or to redistribute to lists, requires prior specific 1) cannot be easily expressed via a single type of relations permission and/or a fee. 21st ACM SIGKDD’15 Sydney and objects. Copyright 2014 ACM X-XXXXX-XX-X/XX/XX ...$15.00. Personalized PageRank [8] is also often used to measure similarity in homogeneous graphs: Similarity scores between elements of different classes are equal to zero by the definition. Relation between objects of X πa(w) πa(b) = εδa(b) + (1 − ") ; unrelated classes is equal to zero by definition too. Equa- αw;v (w;b)2E tion (1) is basically the classical SimRank equation with the adjacency tensor instead of the adjacency matrix: each non- that it same as PageRank, except random jumps are made zero layer of tensor encodes some relation on the same pair into some pre-chosen node b, rather then into random node. of types. If one has more than a single relation between Another option is PathRank [6] that measures path-similarity types p; t 2 T , then r would have multiple non-zero layers between objects a; b picked from the same class A of the on the intersection of indices associated with the classes t; p heterogeneous information network N given a symmetric | one adjacency matrix per layer. In (1) the index γ stands meta-path P (set of paths that satisfy composition of re- for (weighted) summation over all layers of the tensor. That M1 M2 Mn lations M1;M2;:::;Mn that A −−! C1 −−! C2 ::: −−! A, can be equivalently rewritten explicitly: M1◦M2◦···◦Mn so A −−−−−−−−−−! A) as a number of paths from the ob- X T ject a to the object b (each step i must satisfy corresponding S = wγ Wγ SWγ + D; (2) γ relation Mi in P) normed over the number of paths from a to a plus the number of paths from b to b given P: where the diagonal matrix D has to be chosen in a such way p that diag(S) = I. kfp 2 P : a =) bgk sP (a; b) = p p kfp 2 P : a =) agk + kfp 2 P : b =) bgk 2.2 Computational algorithm That approach can handle several relations and object types Simple iterations for (1) are computationally demanding and is very useful when we know the structure of relations due to large-scale matrix-by-matrix products, thus we pro- we want our similarity measure to be based on. In case we pose a a method that exploits the fact that s is block diago- want just to "put our relations into a black box" that would nal and r is a three-dimensional block tensor with size of the find similarity that would capture all network relations as a last dimension (number of layers) much less then the overall whole, we might want to use something different. Recently, amount of objects. On each iteration k for each r 2 R we an approach [9] for building an optimal linear combination recompute si updates independently (assuming all other sj of meta-paths has been proposed. fixed), see Algorithm 1. There are several works on measuring similarity between objects from different classes, see, for example, [10]. Algorithm 1: Idea under Tensor SimRank Data: T - classes, R - relations 2. TENSOR SIMRANK Result: S = fst(a; b)gt2T repeat 2.1 Problem statement for st 2 S do Let us consider a function st(a; b) that assigns similarity assume all S n st fixed score for two objects from the same class t as follows: objects for r 2 R : rtp : t × p 7! f1; 0g do a; b 2 t 2 T are similar (value st(a; b) is high) if they relate for (a; b) 2 t do to objects which are similar too. That interdependence can for (c; d) 2 p do be expressed via the following definition: next st (a; b) += rtp(a; c)sp(c; d)rtp(b; d) += rtp(a; c)sp(c; d)rpt(d; b) Nrtp (a) = fb 2 pjrtp(a; b) = 1g; 1 X X end st(a; b) = w(rtp) sp(c; d); Z end rtp2R c2Np(a) d2Np(b) end end where rtp is the relation between classes t; p 2 T , Nrtp is the next update all st st neighbourhood function that returns set of objects from the P next until kst − st k < δ; class p that are related to the object a via the relation rtp, t w(rtp) are the weights corresponding to the relation rtp, Z is the normalization constant. So we just update the similarity score for each class as- This can be rewritten as a Tensor SimRank equation: suming all other classes similarities are fixed in a way that X the objects from the target class (t) that are related to ob- sαβ = wαβγ rαβγ sαβ rβαγ ; jects from some other class (c; d) 2 p that are close (s (c; d) γ (1) p is high) become closer too (st(a; b) "). s = diag(fstgt2T ); sαα = 1; To show actual vectorized algorithm of similarity compu- where s is a block-diagonal matrix (one block per each entity tation, let us introduce some additional notations: set of entity types T = ft gN , each entity type t is a set of en- type), w are the relation weights, rαβγ are the stochastic re- i i=0 1 (j) L lation tensors (which have non-zero blocks where relations tities, set of symmetric relation functions R = frtp gj=0 exist). where r(j) : t × p ! f0; 1g; t; p 2 T , j is the order; column- tj pj 1We have to use tensors instead of matrices to have multiple stochastic matrix of pairwise types impacts (weights) w 2 N×N (j) ktk×kpk relations on the same pair of classes R ; operator W : rtp ! R that maps relation into corresponding column-stochastic adjacency matrix.