The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

Adversarial Permutation Guided Node Representations for Link Prediction

Indradyumna Roy, Abir De, Soumen Chakrabarti Indian Institute of Technology Bombay {indraroy15, abir, soumen}@cse.iitb.ac.in

Abstract based on the network structure and domain knowledge (Katz 1997; Liben-Nowell and Kleinberg 2007; Backstrom and After observing a snapshot of a , a link pre- diction (LP) algorithm identifies node pairs between which Leskovec 2011). However, such feature engineering of- new edges will likely materialize in future. Most LP algo- ten demands significant domain expertise. The second ap- rithms estimate a score for currently non-neighboring node proach learns low dimensional node embeddings which pairs, and rank them by this score. Recent LP systems com- serve as node features in LP tasks. Such embedding mod- pute this score by comparing dense, low dimensional vector els include Node2Vec (Grover and Leskovec 2016), Deep- representations of nodes. Graph neural networks (GNNs), in Walk (Perozzi, Al-Rfou, and Skiena 2014), etc., and various particular graph convolutional networks (GCNs), are popu- graph neural networks (GNN), e.g., GCN (Kipf and Welling lar examples. For two nodes to be meaningfully compared, 2016a), GraphSAGE (Hamilton, Ying, and Leskovec 2017), their embeddings should be indifferent to reordering of their GAT (Velickoviˇ c´ et al. 2017), etc. neighbors. GNNs typically use simple, symmetric set aggre- gators to ensure this property, but this design decision has Limited expressive power of GNNs. While deep graph rep- been shown to produce representations with limited expres- resentations have shown significant potential in capturing sive power. Sequence encoders are more expressive, but are complex relationships between nodes and their neighbor- permutation sensitive by design. Recent efforts to overcome hoods, they lack representational power useful for LP. A key this dilemma turn out to be unsatisfactory for LP tasks. In reason for this weakness is the use of symmetric aggregates response, we propose PERMGNN, which aggregates neigh- over a node u’s neighbors, driven by the desideratum that the bor features using a recurrent, order-sensitive aggregator and representation of u should be invariant to a permutation of its directly minimizes an LP loss while it is ‘attacked’ by ad- neighbor nodes (Zaheer et al. 2017; Ravanbakhsh, Schnei- versarial generator of neighbor permutations. PERMGNN has der, and Poczos 2016; Qi et al. 2017). Such networks have superior expressive power compared to earlier GNNs. Next, recently been established as low-pass filters (Wu et al. 2019; we devise an optimization framework to map PERMGNN’s node embeddings to a suitable locality-sensitive hash, which NT and Maehara 2019), which attenuate high frequency sig- speeds up reporting the top-K most likely edges for the LP nals. This prevents LP methods based on such node repre- task. Our experiments on diverse datasets show that PERM- sentations from reaching their full potential. Although re- GNN outperforms several state-of-the-art link predictors, and cent efforts (Lee et al. 2019; Bloem-Reddy and Teh 2019; can predict the most likely edges fast. Shi, Oliva, and Niethammer 2020; Stelzner, Kersting, and Kosiorek 2020; Skianis et al. 2020; Zhang and Chen 2018) 1 Introduction on modeling inter-item dependencies have substantially im- proved the expressiveness of set representations in applica- In the link prediction (LP) task, we are given a snapshot of tions like image and text processing, they offer only mod- a social network, and asked to predict future links that are est improvement for LP, as we shall see in our experiments. most likely to emerge between nodes. LP has a wide vari- Among these approaches, SEAL (Zhang and Chen 2018) ety of applications, e.g., recommending friends in Facebook, improves upon GNN performance but does not readily lend followers in Twitter, products in Amazon, or connections on itself to efficient top-K predictions via LSH. LinkedIn. An LP algorithm typically considers current non- edges as potential edges, and ranks them by decreasing like- Limitations of sequence driven embeddings. We could ar- lihoods of becoming edges in future. range the neighbors of u in some arbitrary canonical or- der, and combine their features sequentially using, say, a 1.1 Prior Work and Their Limitations recurrent neural network (RNN). This would capture fea- ture correlations between neighbors. But now, the represen- LP methods abound in the literature, and predominantly tation of u will become sensitive to the order in which neigh- follow two approaches. The first approach relies strongly bors are presented to the RNN. In our experiments, we see on hand-engineering node features and edge likelihoods loss degradation when neighbors are shuffled. We seek to re- Copyright c 2021, Association for the Advancement of Artificial solve this central dilemma. An obvious attempted fix would Intelligence (www.aaai.org). All rights reserved. be to present many permutations (as Monte Carlo samples)

9445 of neighbor nodes but, as we shall see, doing so in a data- Further probing the experimental results reveal insightful ex- oblivious manner is very inefficient in terms of space and planations behind the success of PERMGNN. time. 2 Preliminaries 1.2 Our Proposal: PERMGNN In response to the above limitations in prior work, we de- In this section, we describe necessary notation and the com- velop PERMGNN: a novel node embedding method specif- ponents of a typical LP system. ically designed for LP. To avoid the low-pass nature of GNNs, we eschew symmetric additive aggregation over 2.1 Notation neighbors of a node u, instead using a recurrent network to which neighbor node representations are provided sequen- We consider a snapshot of an undirected social network tially, in some order. The representation of u is computed by G = (V,E). Each node u has a feature vector fu. We use an output layer applied on the RNN states. nbr(u) and nbr(u) to indicate the set of neighbors and non- To neutralize the order-sensitivity of the RNN, we cast LP neighbors of u. Our graphs do not have self edges, but we as a novel min-max optimization, equivalent to a game be- include u in nbr(u) by convention. We define nbr(u) = tween an adversary that generates worst-case neighbor per- {u}∪{v | (u, v) ∈ E}, nbr(u) = {v|v 6= u, (u, v) 6∈ E} and mutations (to maximize LP loss) and a node representation also E to be the set of non-edges, i.e., E = ∪u∈V nbr(u). Fi- learner that refines node representations (to minimize LP nally, we define Πδ to be the set of permutations of the set loss) until they become insensitive to neighborhood permu- [δ] = {1, 2, ..., δ} and Pδ to be the set of all possible 0/1 tations. To facilitate end-to-end training and thus avoiding permutation matrices of size δ × δ. exploration of huge permutation spaces, the adversarial per- mutation generator is implemented as a Gumbel-Sinkhorn 2.2 Scoring and Ranking neural network (Mena et al. 2018). Next, we design a hashing method for efficient LP, us- Given a graph snapshot G = (V,E), the goal of a LP al- ing the node representation learnt thus far. We propose a gorithm is to identify node-pairs from the current set of smooth optimization to compress the learned embeddings non-edges E (often called potential edges) that are likely into binary representations, subject to certain hash perfor- to become edges in future. In practice, most LP algo- mance constraints. Then we leverage locality sensitive hash- rithms compute a score s(u, v) for each potential edge ing (Gionis, Indyk, and Motwani 1999) to assign the bit vec- (u, v) ∈ E, which measures their likelihood of becoming tors to buckets, such that nodes likely to become neighbors connected in future. Recently invented network embedding share buckets. Thus, we can limit the computation of pair- methods (Kipf and Welling 2016a; Grover and Leskovec wise scores to within buckets. In spite of this additional com- 2016; Salha et al. 2019) first learn a latent representation pression, our hashing mechanism is accurate and fast. xu of each node u ∈ V and then compute scores s(u, v) We evaluate PERMGNN on several real-world datasets, using some similarity or distance measure between the cor- which shows that our embeddings can suitably distill in- responding representations xu and xv. In the test fold, some formation from node neighborhoods into compact vectors, nodes are designated as query nodes q. Its (current) non- and offers accuracy boosts beyond several state-of-the-art neighbors v are sorted by decreasing s(q, v). We are primar- LP methods1, while achieving large speed gains via LSH. ily interested in LP systems that can retrieve a small number of K nodes with largest s(q, v) for all q in o(N 2) time. 1.3 Summary of Contributions (1) Adversarial permutation guided embeddings: We 3 Proposed Approach propose PERMGNN, a novel node embedding method, which provides high quality node representations for LP. In this section, we first state the limitations of GNNs. Then, In a sharp contrast to additive information aggregation in we present our method for obtaining high quality node em- GNNs, we start with a permutation-sensitive but highly ex- beddings, with better representational power than GNNs. pressive aggregator of the graph neighbors and then desen- sitize the permutation-sensitivity by optimizing a min-max 3.1 GNNs and Their Limitations ranking loss function with respect to the smooth surrogates of adversarial permutations. GNNs start with a graph and per-node features fu to obtain (2) Hashing method for scalable predictions: We propose a neighborhood-sensitive node representation xu for u ∈ V . an optimized binary transformation to the learnt node repre- To meaningfully compare xu and xv and compute s(u, v), sentations, that readily admits the use of a locality-sensitive information from neighbors of u (and v) should be aggre- hashing method and shows fast and accurate predictions. gated in such a way that the embeddings become invariant (3) Comprehensive evaluation: We provide a rigorous to permutations of the neighbors of u (and v). GNNs en- evaluation to test both the representational power of PERM- sure permutation invariance by additive aggregation. Given GNN and the proposed hashing method, which show that our an integer K, for each node u, a GNN aggregates structural proposal usually outperforms classical and recent methods. information k hops away from u to cast it into xu for k ≤ K. Formally, a GNN first computes intermediate embeddings 1 Code: https://www.cse.iitb.ac.in/∼abir/codes/permgnn.zip. {zu(k) | k ∈ [K]} in an iterative manner and then computes

9446 xu, using the following recurrent propagation rule. 3.2 Our Model: PERMGNN  zu(k − 1) = AGGR {zv(k − 1) | v ∈ nbr(u)} ; (1) Responding to the above limitations of popular GNN mod-  els, we design PERMGNN, the proposed adversarial permu- zu(k) = COMB1 zu(k − 1), zu(k − 1) ; (2) tation guided node embeddings. x = COMB (z (1),..., z (K)) (3) u 2 u u Overview. Given a node u, we first compute an embedding Here, for each node u with feature vector fu, we initialize xu using a sequence encoder, parameterized by θ: zu(0) = fu;AGGR and COMB1,2 are neural networks. To  xu = ρθ {fv | v ∈ nbr(u)} , (5) ensure permutation invariance of the final embedding xu, AGGR aggregates the intermediate (k − 1)-hop information where nbr(u) is presented in some arbitrary order (to be dis- cussed). In contrast to the additive aggregator, ρ is modeled zv(k − 1) with an additive (commutative, associative) func- tion, guided by set function principles (Zaheer et al. 2017): by an LSTM (Hochreiter and Schmidhuber 1997), followed by a fully-connected feedforward neural network (See Fig-  AGGR {zv(k − 1) | v ∈ nbr(u)} ure 1). Such a formulation captures the presence of high   = σ P σ z (k − 1) . (4) frequency signal in the neighborhood of u and the complex 1 v∈nbr(u) 2 v dependencies between the neighbors nbr(u) by combining Here σ1, σ2 are nonlinear activations. In theory (Zaheer et al. their influence via the recurrent states of the LSTM. 2017, Theorem 2), if COMB1,2 are given ‘sufficient’ hid- However, now the embedding xu is no longer invariant to den units, this set representation is universal. In practice, the permutation of the neighbors nbr(u). As we shall see, however, commutative-associative aggregation suffers from we counter this by casting the LP objective as an instance of limited expressiveness (Pabbaraju and Jain 2019; Wagstaff a min-max optimization problem. Such an adversarial setup et al. 2019; Garg, Jegelka, and Jaakkola 2020; Cohen-Karlik, refines xu in an iterative manner, to ensure that the resulting David, and Globerson 2020), which degrades the quality of trained embeddings are permutation invariant (at least as far xu and s(·, ·), as described below. Specifically, their expres- as possible in a non-convex optimization setting). siveness is constrained from two perspectives. PERMGNN architecture. Let us suppose π = Attenuation of important network signals. GNNs are es- [π1, ..., π| nbr(u)|] ∈ Π| nbr(u)| is some arbitrary permu- tablished to be intrinsically low pass filters (NT and Mae- tation of the neighbors of node u. We take the features hara 2019; Wu et al. 2019). Consequently, they can attenuate of neighbors of u in the order specified by π, i.e., high frequency signals which may contain crucial structural  vπ1 , vπ2 , . . . , vπ| nbr(u)| , and pass them into an LSTM: information about the network. To illustrate, assume that the  yu,1, ..., yu,| nbr(u)| = LSTMθ fv , ..., fv . (6) node u in Eqs. (1)–(3) has two neighbors v and w and zv(k− π1 π| nbr(u)| 1) = [+1, −1] and z (k − 1) = [−1, +1], which induce  w Here yu,k is a sequence of intermediate repre- high frequency signals around the neighborhood of u. In k∈[| nbr(u)|] sentation of node u, which depends on the permutation . practice, these two representations may carry important sig- π Such an approach ameliorates the limitations of GNNs in nals about the network structure. However, popular choices two ways: of σ often diminish the effect of each of these vectors. In 2 (1) Unlike GNNs, the construction of y is not limited fact, the widely used linear form of σ (Hamilton, Ying, and • 2 to symmetric aggregation, and is therefore able to cap- Leskovec 2017; Kipf and Welling 2016a) would completely ture crucial network signals including those with high fre- annul their effects (since σ (z (k−1))+σ (z (k−1)) = 0) 2 v 2 w quency (Borovkova and Tsiamas 2019). in the final embedding x , which would consequently lose u (2) An LSTM (indeed, any RNN variant) is designed to cap- capacity for encapsulating neighborhood information. ture the influence of one token of the sequence on the subse- Inability to distinguish between correlation structures. In quent tokens. In the current context, the state variable hk of Eq. (4), the outer nonlinearity σ1 operates over the sum of the LSTM combines the influence of first k − 1 neighbors in all representations of neighbors of u. Therefore, it cannot the input sequence, i.e., vπ , . . . vπ on the k-th neighbor explicitly model the variations between the joint dependence 1 k−1 vπ . Therefore, these recurrent states allow y• to capture the of these neighbors. Suppose the correlation between z (k − k v complex dependence between the features f•. 1) and zw(k − 1) is different from that between zv0 (k − 1) 0 0 Next, we compute the final embeddings xu by using and zw0 (k − 1) for {v, v , w, w } ⊆ nbr(u). The additive an additional nonlinearity on the top of the sequence aggregator in Eq. (4) cannot capture the distinction. (yu,k)k∈[| nbr(u)|] output by the LSTM: Here, we develop a mitigation approach which exploits  RD sequential memory, e.g., LSTMs, even though they are xu;π = σθ yu,1, yu,2,..., yu,| nbr(u)| ∈ . (7) order-sensitive, and then neutralize the order sensitivity by Note that the embeddings {xu} computed above depends presenting adversarial neighbor orders. An alternative miti- on π, the permutation of the neighbors nbr(u) given as the gation approach is to increase the capacity of the aggrega- input sequence to the LSTM in Eq. (6). tor (while keeping it order invariant by design) by explic- Removing the sensitivity to π. One simple way to en- itly modeling dependencies between neighbors, as has been sure permutation invariance is to compute the average of attempted in image or text applications (Lee et al. 2019; xu;π over all permutations π ∈ Π| nbr(u)|, similar to Mur- Bloem-Reddy and Teh 2019; Shi, Oliva, and Niethammer phy et al. (2019a). At a time and space complexity of at P 2020; Stelzner, Kersting, and Kosiorek 2020). least O( u∈V |Π| nbr(u)||), this is quite impractical for even

9447 f . quence of neighbor feature vectors. The RHS of Eq. (6) can v+ Fv T LSTM σθ xv+ . + φ θ w f sim(u, v+) be written as LSTMθ(P Fw), which eventually lets us ex- edge f press loss as a function of P w. We can thus rewrite the min- u . T σ x C . Fu φ LSTMθ θ u ψ bu ∆ ReLU loss max optimization (10) as f non-edge w f sim(u, v−) min max loss(θ; {P }w∈V ), (11) . θ {P w | w∈V } v− F T σθ xv− . v− φ LSTMθ f where the inner maximization is carried out over all ‘soft’ Figure 1: PERMGNN min-max loss and hashing schematic. permutation matrices P w, parameterized as follows. In deep network design, a trainable multinomial distribu- tion is readily obtained by applying a softmax to trainable moderate degree nodes. Replacing the exhaustive average by (unconstrained) logits. Analogously, a trainable soft permu- a Monte Carlo sample does improve representation quality, tation matrix P w can be obtained by applying a Gumbel- but is still very expensive. Murphy et al. (2019b) proposed a Sinkhorn network ‘GS’ (Mena et al. 2018) to a trainable method called π-SGD, which samples one permutation per (unconstrained) ‘seed’ square matrix, say, Aw: epoch. While it is more efficient than sampling multiple per- w n w P = lim GS (A ), where mutations, it shows worse robustness in practice. n→∞ 0 w w Adversarial permutation-driven LP objective. Instead GS (A ) = exp(A ) and n w n−1 w  of brute-force sampling, we setup a two-party game, one be- GS (A ) = COLSCALE ROWSCALE GS (A ) . ing the network for LP, vulnerable to π, and the other being Here, COLSCALE and ROWSCALE represent column and an adversary, which tries to make the LP network perform row normalization. GSn(Aw) is the doubly stochastic ma- poorly by choosing a ‘bad’ π at each node. trix obtained by consecutive row and column normalizations 1: pick initial πu at each node u of Aw. It can be shown that 2: repeat lim GSn(Aw) = argmax Tr P >Aw . (12) u n→∞ 3: fix {π : u ∈ V }; optimize θ for best LP accuracy P ∈P| nbr(w)| u 4: fix θ; find next π at all u for worst LP accuracy GSn thus represents a recursive differentiable operator that 5: until LP performance stabilizes permits backpropagation of loss to {Aw}. In practice, n is u a finite hyperparameter, the larger it is, the closer the output Let π ∈ Π| nbr(u)| be the permutation used to shuffle the neighbors of u in Eq. (6). Conditioned on u, v, we com- to a ‘hard’ permutation. π π w pute the score for a node-pair (u, v) as Allocating a separate unconstrained seed matrix A for u v each node w would lead to an impractically large number s (u, v| , ) = sim(x u , x v ), (8) θ π π u;π v;π of parameters. Therefore, we express Aw using a globally where sim(a, b) denotes the cosine similarity between a shared network Tφ with model weights φ, and the per-node and b. To train our LP model to give high quality ranking, we feature matrix Fw already available. I.e., we define consider the following AUC loss surrogate (Joachims 2005): w A := Tφ(Fw/τ), (13) w loss(θ; {π }w∈V ) where τ > 0 is a temperature hyperparameter that en- n X h t u v i courages GS (Aw) toward a ‘harder’ soft permutation. The = ∆ + sθ(r, t)|π , π − sθ(u, v|π , π ) (9) + above steps allow us to rewrite optimization (11) in terms (u,v)∈E (r,t)∈E of θ and φ in the form minθ maxφ loss(θ; φ). After com- pleting the min-max optimization, the embedding xu of a where ∆ is a tunable margin and [a]+ = max{0, a}. node u can be computed using some arbitrary neighbor per- As stated above, we aim to train LP model parameters mutation. By design, the impact on sim(u, v) is small when θ in such a way that the trained embeddings {xu} become different permutations are used. The extended version of our invariant to the permutations of nbr(u) for all nodes u ∈ V . paper (Roy, De, and Chakrabarti 2021) contains further im- This requirement suggests the following min-max loss: plementation details. w min max loss(θ; {π }w∈V ). (10) w θ {π }w∈V 4 Scalable LP by Hashing Representations Neural permutation surrogate. As stated, the complex- At this point, we have obtained representations xu for each ity of Eq. (10) seems no better than exhaustive enumera- node u using PERMGNN. Our next goal is to infer some tion of permutations. To get past this apparent blocker, just number of most likely future edges. as max is approximated by softmax (a multinomial distri- Prediction using exhaustive comparisons. Here, we first bution), a ‘hard’ permutation (1:1 assignment) πw is ap- enumerate the scores for all possible potential edges (the proximated by a ‘soft’ permutation matrix P w — a doubly current non-edges) and then report top-K neighbors for each stochastic matrix — which allows continuous optimization. node. Since most real-life social networks are sparse, poten-   2 Suppose Fw = fv1 , fv2 ,..., fv| nbr(w)| is a matrix tial edges can be Θ(|V | ) in number. Scoring all of them whose rows are formed by the features of nbr(w) presented in large graphs is impractical; we must limit the number of w in some canonical order. Then P Fw approximates a per- comparisons between potentially connecting node pairs to muted feature matrix corresponding to some permuted se- be as small as possible.

9448 4.1 Data-Oblivious LSH with Random 1: Input: Graph G = (V,E); binary hash-codes {bu}; Hyperplanes query nodes Q; the number (K) of nodes to be recom- When for two nodes u and v, sim(u, v) is defined as mended per query node D 2: Output: Ranked recommendation list Rq for all q ∈Q cos(xu, xv) with x• ∈ R , the classic random hyperplane 3: initialize LSH buckets LSH can be used to hash the embeddings x•. Specifically, we first draw H uniformly random hyperplanes passing 4: for u ∈ V do 5: add u to appropriate hash buckets through the origin in the form of their unit normal vectors D 6: for q ∈ Q do nh ∈ R , h ∈ [H] (Charikar 2002). Then we set bu[h] = H 7: initialize score heap Hq with capacity K sign(nh ·xu) ∈ 1 as a 1-bit hash and bu ∈ 1 as the H- bit hash code of node u. Correspondingly, we set up 2H hash 8: for each LSH bucket B do 9: for (u, v) ∈ B do buckets with each node going into one bucket. If the buckets 10: if u ∈ Q then are balanced, we expect each to have N/2H nodes. Now we 11: insert hv, s(u, v)i in Hu; prune if |Hu|>K limit pairwise comparisons to only node pairs within each 12: if v ∈ Q then bucket, which takes N 2/2H pair comparisons. By letting 13: insert hu, s(u, v)i in Hv; prune if |Hv|>K H grow slowly with N, we can thus achieve sub-quadratic time. However, such a hashing method is data oblivious— 14: for q ∈ Q do 15: sort H by decreasing score to get ranked list R the hash codes are not learned from the distribution of the q q 16: return {R |q ∈ Q} original embeddings x•. It performs best when the embed- q dings are uniformly dispersed in the D-dimensional space, so that the random hyperplanes can evenly distribute the Algorithm 1: Reporting ranked list of potential edges fast. nodes among several hash buckets.

4.2 Learning Data-Sensitive Hash Codes with large cos(xu, xv) will be found in the same hash buck- ets. We form the buckets using the recipe of Gionis, Indyk, To overcome the above limitation of random hyperplane and Motwani (1999). We adopt the high-recall policy that based hashing, we devise a data-driven learning of hash node-pair u, v should be scored if u and v share at least one codes as explored in other applications (Weiss, Torralba, and bucket. Algorithm 1 shows how the buckets are traversed to Fergus 2009). Specifically, we aim to design an additional generate and score node pairs, then placed in a heap for re- transformation of the vectors {xu} into compressed repre- trieving top-K pairs. Details can be found in the extended sentations {bu}, with the aim of better balance across hash version of our paper (Roy, De, and Chakrabarti 2021). buckets and reduced prediction time. Hashing/compression network. In what follows, we will 5 Experiments D H call the compression network Cψ : R → [−1, 1] , with We report on a comprehensive evaluation of PERMGNN and  model parameters ψ. We interpret sign Cψ(xu) as the re- its accompanying hashing strategy. Specifically, we address H quired binary hash code bu ∈ {−1, +1} , with the sur- the following research questions. RQ1: How does the LP rogate tanh(Cψ(xu)), to be used in the following smooth accuracy of PERMGNN compare with classic and recent link optimization: predictors? Where are the gains and losses? RQ2: How does PERMGNN compare with brute-force sampling of neigh- α P > min u∈V 1 tanh(Cψ(xu)) ψ |V | bor permutations? RQ3: Exactly where in our adversari-

β P ally trained network is permutation insensitivity getting pro- + |V | u∈V tanh(Cψ(xu)) − 1 1 grammed? RQ4: Does the hashing optimization reduce pre- γ P + |tanh(Cψ(xu)) · tanh(Cψ(xv))| (14) diction time, compared to exhaustive computation of pair- |E| (u,v)∈E wise scores? Further experiments are reported by Roy, De, Here, E is the set of non-edges and α, β, γ ∈ (0, 1), with α+ and Chakrabarti (2021). β + γ = 1 are tuned hyperparameters. The final binary hash code bu = sign(Cψ(xu)). The salient terms in the objective 5.1 Experimental Setup above seek the following goals. Datasets. We consider five real world datasets: (1) Twit- Bit balance: If each bit position has as many −1s ter (Leskovec and Mcauley 2012), (2) Google+ (Leskovec as +1s, that bit evenly splits the nodes. The term et al. 2010), (3) Cora (Getoor 2005; Sen et al. 2008), > 1 tanh(Cψ(xu)) tries to bit-balance the hash codes. (4) Citeseer (Getoor 2005; Sen et al. 2008) and (5) PB (Ack- No sitting on the fence: The optimizer is prevented from land et al. 2005). setting b = 0 (the easiest way to balance it) by including a P Baselines. We compare PERMGNN with several hashable term h |b[h]| − 1 = |b| − 1 1. Weak supervision: The third term encourages currently un- LP algorithms. Adamic Adar (AA) and Common Neighbors connected nodes to be assigned dissimilar bit vectors. (CN) (Liben-Nowell and Kleinberg 2007) are classic unsu- pervised methods. Node2Vec (Grover and Leskovec 2016) Bucketing and ranking. Note that, we do not expect the and DeepWalk (Perozzi, Al-Rfou, and Skiena 2014) are dot product between the learned hash codes bu · bv to be a node embedding methods based on random walks. Graph good approximation for cos(xu, xv), merely that node pairs Convolutional Network (GCN) (Kipf and Welling 2016b),

9449 Mean Average Precision (MAP) Mean Reciprocal Rank (MRR) Twitter Google+ Cora Citeseer PB Twitter Google+ Cora Citeseer PB AA 0.727 0.321 0.457 0.477 0.252 0.904 0.553 0.535 0.548 0.508 CN 0.707 0.292 0.377 0.401 0.218 0.911 0.553 0.460 0.462 0.516 Node2Vec 0.673 0.330 0.448 0.504 0.182 0.832 0.551 0.484 0.546 0.333 DeepWalk 0.624 0.288 0.432 0.458 0.169 0.757 0.482 0.468 0.492 0.303 GraphSAGE 0.488 0.125 0.393 0.486 0.077 0.638 0.233 0.425 0.523 0.156 GCN 0.615 0.330 0.408 0.464 0.200 0.789 0.482 0.444 0.505 0.345 Gravity 0.735 0.360 0.407 0.462 0.193 0.881 0.540 0.438 0.518 0.330 PERMGNN 0.735 0.385 0.480 0.560 0.220 0.880 0.581 0.524 0.600 0.397

Table 1: MAP and MRR for all LP algorithms (PERMGNN and baselines) on the ranked list of all potential edges (K = ∞) across all five datasets, with 20% test set. Numbers in bold font indicate the best performer.

GraphSAGE (Hamilton, Ying, and Leskovec 2017) Grav- 1.0 1.0 0.5 0.5 → ity (Salha et al. 2019) are node embedding methods based → 0.0 0.0 on GNNs. We highlight that SEAL (Zhang and Chen 2018) Gravity Gravity 0.5 0.5 Gain does not readily lend itself to a hashable LP mechanism and Gain − AA − AA 1.0 1.0 − − therefore, we do not compare it in this paper. We discuss it 0 250 500 750 0 125 250 375 500 q-ID (sorted by gain) q-ID (sorted by gain) in the longer version (Roy, De, and Chakrabarti 2021). → → Evaluation protocol. Similar to the evaluation protocol of (a) Google+ (b) Citeseer Backstrom and Leskovec (2011), we partition the edge (and Figure 2: Query-wise wins and losses in terms of non-edge) sets into training, validation and test folds as fol- AP(PERMGNN) − AP(baseline), the gain (above x-axis) or lows. For each dataset, we first build the set of query nodes loss (below x-axis) of AP of PERMGNN with respect to com- Q, where each query contains at least one triangle around petitive baselines. Queries Q are sorted by decreasing gain it. Then, for each q ∈ Q, in the original graph, we partition of PERMGNN along the x-axis. the neighbors nbr(q) and the non-neighbors nbr(q) which are within 2-hop distance from q into 70% training, 10% 0.80 0.50 0.45

validation and 20% test sets, where the node pairs are sam- → → 0.75 0.40 pled uniformly at random. We disclose the resulting sampled Multiperm Multiperm 0.70 0.35 graph induced by the training and validation sets to the LP MAP PermGNN MAP 0.30 PermGNN q ∈ Q 0.65 0.25 model. Then, for each query , the trained LP model 0 15 30 45 60 75 0 15 30 45 60 75 outputs a top-K list of potential neighbors from the test set. Time taken (in mins.) Time taken (in mins.) → → Using ground truth, we compute the average precision (AP) (a) Twitter (b) Google+ and reciprocal rank (RR) of each top-K list. Then we aver- Figure 3: Validation MAP against training epochs for Twitter age over all query nodes to get mean AP (MAP) and mean and Google+. PERMGNN converges faster than MultiPerm. RR (MRR).

5.2 Comparative Analysis of LP Accuracy viz. Node2Vec and DeepWalk, show moderate performance. First, we address the research question RQ1 by comparing Notably, Node2Vec is the second best performer in Citeseer. LP accuracy of PERMGNN against baselines, in terms of Drill-down. Next, we compare ranking performance at in- MAP and MRR across the datasets. dividual query nodes. For each query (node) q, we measure MAP and MRR summary. Table 1 summarizes LP accu- the gain (or loss) of PERMGNN in terms of average preci- racy across all the methods. We make the following obser- sion, i.e., AP(PERMGNN) − AP(baseline) for three compet- vations. (1) PERMGNN outperforms all the competitors in itive baselines, across Google+ and Citeseer datasets. From terms of MAP, in four datasets, except PB, where it is outper- Figure 2, we observe that, for Google+ and Citeseer respec- formed by AA. Moreover, in terms of MRR, it outperforms tively, PERMGNN matches or exceeds the baselines for 60% all the baselines for Google+ and Citeseer datasets. (2) The and 70% of the queries. performance of GNNs are comparable for Cora and Citeseer. Due to its weakly supervised training procedure, the overall 5.3 PERMGNN vs. Sampling Permutations performance of GraphSAGE is poor among the GNN based Next, we address research question RQ2 by establishing the methods. (3) The classic unsupervised predictors, i.e., AA utility of PERMGNN against its natural alternative Multi- and CN often beat some recent embedding models. AA is the Perm, in which a node embedding is computed by averaging best performer in terms of MAP in PB and in terms of MRR permutation-sensitive representations over several sampled in Twitter. Since AA and CN encourage triad completion, permutations. Figure 3 shows that PERMGNN is >15× and which is a key factor for growth of several real life networks, >4.5× faster than the permutation averaging based method they often serve as good link predictors (Sarkar, Chakrabarti, for Twitter and Google+ datasets. MultiPerm also occupies and Moore 2011). (4) The based embeddings, significantly larger RAM than PERMGNN.

9450 1Perm π0 1.5 → →

→ Exhaustive 40 → MultiPerm Exhaustive 1.0 1.0 PermGNN Score time RH 10 RH 1Perm π Heap time 20 0 Our MultiPerm 0.5 0.5 5 Our PermGNN Training Loss Sensitivity 0 Training Loss Sensitivity 0.0 1 0.5 0 0.5 1 1 0.5 0 0.5 1 Time (in0 sec) .0 0 − − − − Twitter G+ Cora Cite PB Time (in sec) Twitter G+ Cora Cite PB KTau(π, π0) KTau(π, π0) → → (a) Tensorized (b) Non-tensorized (a) Cora (b) Citeseer Figure 6: Running time for our LSH based scalable pre- Figure 4: Effect of neighbor order perturbation on train- diction, random-hyperplane based LSH method, exhaustive ing loss. As we move away from the canonical permutation comparison. π0, training loss increases steeply for 1Perm, but remains roughly stable for MultiPerm and PERMGNN. 5.5 Performance of Hashing Methods 1.0 1.0 Finally, we address RQ4 by studying the performance of 0.8 0.8 our LSH method (Section 4.2). Specifically, we compare the 0.6 f 0.6 f time spent in similarity computation and heap operations of 0.4 y 0.4 y Insensitivity Insensitivity x our hashing method against random hyperplane based hash- 0.2 x 0.2 1 0.5 0 0.5 1 1 0.5 0 0.5 1 − − − − ing (Section 4.1), compared to exhaustive computation of KTau(π, π ) KTau(π, π ) 0 → 0 → pairwise scores (as a slow but “relatively perfect” baseline). (a) Cora (b) Citeseer Since vectorized similarity computation inside may be Figure 5: Insensitivity of neighborhood features {fv | v ∈ faster than numpy, we provide results on both implementa- nbr(u)} LSTM output {y} and the resultant node embed- tions. Figure 6 summarizes results in terms of running time. dings xu with respect to neighbor order permutations. It shows that: (1) hashing using Cψ leads to considerable savings in reporting top-K node-pairs with respect to both random hyperplane based hashing and exhaustive enumera- 5.4 Permutation Invariance of PERMGNN tion, and (2) the gains increase with increasing graph sizes (from Google+ to PB). Because LSH-based top-K retrieval Here, we answer the research question RQ3. To that end, may discard relevant nodes after K, it is more appropriate we first train PERMGNN along with its two immediate alter- to study ranking degradation in terms of decrease in NDCG natives: (i) 1Perm, where a vanilla LSTM is trained with a (rather than MAP). Suppose we insist that NDCG be at least single canonical permutation π0 of the nodes; and, (ii) Mul- 85, 90, or 95% of exhaustive NDCG. How selective is a tiperm, where an LSTM is trained using several sampled hashing strategy, in terms of the factor of query speedup (be- permutations of the nodes. Then, given a different permuta- cause of buckets pruned in Algorithm 1)? Table 2 shows that tion π, we compute the node embedding xu;π by feeding the our hashing method provides better pruning than random hy- corresponding sequence of neighbors π(nbr(u)) (sorted by perplane for a given level of NDCG degradation. node IDs of nbr(u) assigned by π), as an input to the trained models. Finally, we use these embeddings for LP and mea- Minimum NDCG as % of exhaustive NDCG sure the relative change in training loss. Figure 4 shows a 85% 90% 95% plot of (LOSS(π) − LOSS(π0))/LOSS(π0) against the cor- Twitter Google+ Twitter Google+ Twitter Google+ relation between π and the canonical order π0, measured in Our Hashing 6.67 12.5 6.67 10 6.25 5.5 terms of Kendall’s τ,KTAU(π, π0). It reveals that 1Perm RH 1.78 3.45 1.78 3.45 1.78 3.45 suffers a significant rise in training loss when the input node Table 2: Speedup achieved by different hashing methods un- order π substantially differs from the canonical order π0, der various permitted NDCG degradation limits. i.e.,KTAU(π, π0) is low. Both Multiperm and PERMGNN turns out to be permutation-insensitive across a wide range of node orderings. To probe this phenomenon, we instrument the stability 6 Conclusion of f, y, x to different permutations. Specifically, we define We presented PERMGNN, a novel LP formulation that com- P insensitivity(z; π, π0) = u∈V SIM(zu;π, zu;π0 )/|V | for bines a recurrent, order-sensitive graph neighbor aggrega- any vector or sequence z. We compute insensitivity of the tor with an adversarial generator of neighbor permutations. input sequence {fv : v ∈ nbr(u)}, the intermediate LSTM PERMGNN achieves LP accuracy comparable to or better output {y} and the final embedding xu with respect to dif- than sampling a number of permutations by brute force, and ferent permutations π. Figure 5 summarizes the results, and is faster to train. PERMGNN is also superior to a number shows that as information flows through PERMGNN stages, of LP baselines. In addition, we formulate an optimization from input feature sequence to the final embeddings, the in- to map PERMGNN’s node embeddings to a suitable locality- sensitivity of the underlying signals increases. Thus, our ad- sensitive hash, which greatly speeds up reporting of the most versarial training smoothly turns permutation-sensitive in- likely edges. It would be interesting to extend PERMGNN put sequences into permutation invariant node embeddings, to other downstream network analyses, e.g., node classifica- without any explicit symmetric aggregator. tion, community detection, or knowledge graph completion.

9451 Acknowledgements Joachims, T. 2005. A support vector method for Partly supported by an IBM AI Horizons Grant. Thanks to multivariate performance measures. In ICML, 377– Chitrank Gupta and Yash Jain for helping rectify an error in 384. ISBN 1-59593-180-5. doi:http://doi.acm.org/10.1145/ an earlier evaluation method. 1102351.1102399. URL http://www.machinelearning.org/ proceedings/icml2005/papers/048 ASupport Joachims.pdf. References Katz, B. 1997. From Sentence Processing to Information Ackland, R.; et al. 2005. Mapping the US political blo- Access on the World Wide Web. In AAAI Spring Symposium gosphere: Are conservative bloggers more prominent? In on Natural Language Processing for the World Wide Web, BlogTalk Downunder 2005 Conference, Sydney. BlogTalk 77–94. Stanford CA: Stanford University. See http://www. Downunder 2005 Conference, Sydney. ai.mit.edu/people/boris/webaccess/. Adamic, L. A.; and Adar, E. 2003. Friends and neigh- Kipf, T. N.; and Welling, M. 2016a. Semi-supervised classi- bors on the Web. Social Networks 25(3): 211 – 230. fication with graph convolutional networks. arXiv preprint ISSN 0378-8733. doi:http://dx.doi.org/10.1016/S0378- arXiv:1609.02907 . 8733(03)00009-1. URL http://pkudlib.org/qmeiCourse/ Kipf, T. N.; and Welling, M. 2016b. Variational graph auto- files/FriendsAndNeighbors.pdf. encoders. arXiv preprint arXiv:1611.07308 . Backstrom, L.; and Leskovec, J. 2011. Supervised ran- Kulis, B.; and Darrell, T. 2009. Learning to hash with dom walks: predicting and recommending links in social binary reconstructive embeddings. In NeurIPS, 1042– networks. In WSDM Conference, 635–644. URL http: 1050. URL http://papers.nips.cc/paper/3667-learning-to- //cs.stanford.edu/people/jure/pubs/linkpred-wsdm11.pdf. hash-with-binary-reconstructive-embeddings.pdf. Bloem-Reddy, B.; and Teh, Y. W. 2019. Probabilistic Lee, J.; Lee, Y.; Kim, J.; Kosiorek, A.; Choi, S.; and Teh, symmetry and invariant neural networks. arXiv preprint Y. W. 2019. Set transformer: A framework for attention- arXiv:1901.06082 . based permutation-invariant neural networks. In ICML. Borovkova, S.; and Tsiamas, I. 2019. An ensemble of LSTM Leskovec, J.; Chakrabarti, D.; Kleinberg, J.; Faloutsos, C.; neural networks for high-frequency stock market classifica- and Ghahramani, Z. 2010. Kronecker graphs: An approach tion. Journal of Forecasting 38(6): 600–619. to modeling networks. Journal of Re- Charikar, M. S. 2002. Similarity estimation techniques from search 11(Feb): 985–1042. rounding algorithms. In STOC, 380–388. URL https://dl. Leskovec, J.; and Mcauley, J. J. 2012. Learning to discover acm.org/doi/pdf/10.1145/509907.509965. social circles in ego networks. In NeuIPS. Cohen-Karlik, E.; David, A. B.; and Globerson, A. 2020. Liben-Nowell, D.; and Kleinberg, J. 2007. The link- Regularizing Towards Permutation Invariance in Recurrent prediction problem for social networks. Journal of the Amer- Models. In NeurIPS. URL https://arxiv.org/abs/2010.13055. ican Society for Information Science and Technology 58(7): Cuturi, M. 2013. Sinkhorn distances: Lightspeed com- 1019–1031. ISSN 1532-2890. doi:10.1002/asi.20591. URL putation of optimal transport. In NeurIPS, 2292–2300. https://onlinelibrary.wiley.com/doi/full/10.1002/asi.20591. URL https://papers.nips.cc/paper/4927-sinkhorn-distances- Lichtenwalter, R. N.; Lussier, J. T.; and Chawla, N. V. lightspeed-computation-of-optimal-transport.pdf. 2010. New perspectives and methods in link prediction. Garg, V. K.; Jegelka, S.; and Jaakkola, T. 2020. General- In SIGKDD Conference, 243–252. Washington, DC, USA: ization and representational limits of graph neural networks. ACM. ISBN 978-1-4503-0055-1. doi:10.1145/1835804. arXiv preprint arXiv:2002.06157 . 1835837. URL http://users.cs.fiu.edu/∼lzhen001/activities/ Getoor, L. 2005. Link-based classification. In Advanced KDD USB key 2010/docs/p243.pdf. methods for knowledge discovery from complex data, 189– Liu, W.; Wang, J.; Ji, R.; Jiang, Y.-G.; and Chang, S.-F. 207. Springer. 2012. Supervised hashing with kernels. In IEEE CVPR, Gionis, A.; Indyk, P.; and Motwani, R. 1999. Sim- 2074–2081. URL https://ieeexplore.ieee.org/stamp/stamp. ilarity Search in High Dimensions via Hashing. In jsp?arnumber=6247912. VLDB Conference, 518–529. See http://citeseer.nj.nec.com/ Mena, G.; Belanger, D.; Linderman, S.; and Snoek, J. 2018. gionis97similarity.html. Learning latent permutations with gumbel-sinkhorn net- Grover, A.; and Leskovec, J. 2016. node2vec: Scalable fea- works. arXiv preprint arXiv:1802.08665 URL https://arxiv. ture learning for networks. In SIGKDD. org/pdf/1802.08665.pdf. Hamilton, W.; Ying, Z.; and Leskovec, J. 2017. Inductive Murphy, R. L.; Srinivasan, B.; Rao, V.; and Ribeiro, B. representation learning on large graphs. In Advances in neu- 2019a. Janossy pooling: Learning deep permutation- ral information processing systems, 1024–1034. invariant functions for variable-size inputs. ICLR URL Hochreiter, S.; and Schmidhuber, J. 1997. Long https://arxiv.org/pdf/1811.01900. Short-Term Memory. Neural Computation 9(8): 1735– Murphy, R. L.; Srinivasan, B.; Rao, V.; and Ribeiro, B. 1780. URL https://www.mitpressjournals.org/doi/pdfplus/ 2019b. Relational pooling for graph representations. arXiv 10.1162/neco.1997.9.8.1735. preprint arXiv:1903.02541 .

9452 NT, H.; and Maehara, T. 2019. Revisiting graph neural Wagstaff, E.; Fuchs, F. B.; Engelcke, M.; Posner, I.; and Os- networks: All we have is low-pass filters. arXiv preprint borne, M. 2019. On the limitations of representing functions arXiv:1905.09550 . on sets. arXiv preprint arXiv:1901.09006 . Pabbaraju, C.; and Jain, P. 2019. Learning Functions over Wang, Z.; Ren, Z.; He, C.; Zhang, P.; and Hu, Y. 2019. Ro- Sets via Permutation Adversarial Networks. arXiv preprint bust Embedding with Multi-Level Structures for Link Pre- arXiv:1907.05638 URL https://arxiv.org/pdf/1907.05638. diction. In IJCAI, 5240–5246. URL https://www.ijcai.org/ Perozzi, B.; Al-Rfou, R.; and Skiena, S. 2014. Deepwalk: Proceedings/2019/0728.pdf. Online learning of social representations. In KDD, 701–710. Weiss, Y.; Torralba, A.; and Fergus, R. 2009. Spectral hash- Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J. 2017. Pointnet: ing. In NeurIPS, 1753–1760. URL https://papers.nips.cc/ on point sets for 3d classification and segmen- paper/3383-spectral-hashing.pdf. tation. In Proceedings of the IEEE conference on computer Wu, F.; Zhang, T.; Souza Jr, A. H. d.; Fifty, C.; Yu, T.; and vision and pattern recognition, 652–660. Weinberger, K. Q. 2019. Simplifying graph convolutional Ravanbakhsh, S.; Schneider, J.; and Poczos, B. 2016. networks. arXiv preprint arXiv:1902.07153 . Deep learning with sets and point clouds. arXiv preprint Xu, K.; Hu, W.; Leskovec, J.; and Jegelka, S. 2018a. arXiv:1611.04500 . How powerful are graph neural networks? arXiv preprint Roy, I.; De, A.; and Chakrabarti, S. 2021. Adversarial Per- arXiv:1810.00826 . mutation Guided Node representations for Link Prediction. Xu, K.; Li, C.; Tian, Y.; Sonobe, T.; Kawarabayashi, K.- In arXiv:2012.08974. i.; and Jegelka, S. 2018b. Representation learning on Salha, G.; Limnios, S.; Hennequin, R.; Tran, V.-A.; and graphs with jumping knowledge networks. arXiv preprint Vazirgiannis, M. 2019. Gravity-Inspired Graph Autoen- arXiv:1806.03536 . coders for Directed Link Prediction. In CIKM, 589–598. Yadati, N.; Nitin, V.; Nimishakavi, M.; Yadav, P.; Louis, A.; URL https://doi.org/10.1145/3357384.3358023. and Talukdar, P. 2018. Link prediction in hypergraphs using Sarkar, P.; Chakrabarti, D.; and Moore, A. W. 2011. Theo- graph convolutional networks. Manuscript. URL https:// retical justification of popular link prediction heuristics. In openreview.net/forum?id=ryeaZhRqFm. COLT. You, J.; Ying, R.; and Leskovec, J. 2019. Position-aware Schlichtkrull, M.; Kipf, T. N.; Bloem, P.; Van Den Berg, R.; graph neural networks. arXiv preprint arXiv:1906.04817 . Titov, I.; and Welling, M. 2018. Modeling relational data Zaheer, M.; Kottur, S.; Ravanbakhsh, S.; Poczos, B.; with graph convolutional networks. In European Semantic Salakhutdinov, R. R.; and Smola, A. J. 2017. Deep sets. In Web Conference, 593–607. URL https://arxiv.org/pdf/1703. Advances in neural information processing systems, 3391– 06103. 3401. Sen, P.; Namata, G.; Bilgic, M.; Getoor, L.; Galligher, B.; Zhang, M.; and Chen, Y. 2018. Link prediction based on and Eliassi-Rad, T. 2008. Collective classification in net- graph neural networks. In NeurIPS. work data. AI magazine 29(3): 93–93. Shi, Y.; Oliva, J.; and Niethammer, M. 2020. Deep Message Passing on Sets. In AAAI, 5750–5757. Sinkhorn, R. 1967. Diagonal equivalence to matrices with prescribed row and column sums. The American Mathemat- ical Monthly 74(4): 402–405. URL https://www.jstor.org/ stable/pdf/2314570.pdf. Skianis, K.; Nikolentzos, G.; Limnios, S.; and Vazirgiannis, M. 2020. Rep the set: Neural networks for learning set rep- resentations. In International Conference on Artificial Intel- ligence and , 1410–1420. PMLR. Stelzner, K.; Kersting, K.; and Kosiorek, A. R. 2020. Generative Adversarial Set Transform- ers. In Workshop on Object-Oriented Learning at ICML 2020. URL https://www.ml.informatik.tu- darmstadt.de/papers/stelzner2020ood gast.pdf. Tang, J.; Qu, M.; Wang, M.; Zhang, M.; Yan, J.; and Mei, Q. 2015. LINE: Large-scale information network embedding. In WWW Conference, 1067–1077. Velickoviˇ c,´ P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; and Bengio, Y. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 .

9453