Graph Matching Networks for Learning the Similarity of Graph Structured Objects

Yujia Li 1 Chenjie Gu 1 Thomas Dullien 2 Oriol Vinyals 1 Pushmeet Kohli 1

Abstract In the past few years graph neural networks (GNNs) have emerged as an effective class of models for learning rep- This paper addresses the challenging problem of resentations of structured data and for solving various su- retrieval and matching of graph structured ob- pervised prediction problems on graphs. Such models are jects, and makes two key contributions. First, we invariant to permutations of graph elements by design and demonstrate how Graph Neural Networks (GNN), compute graph node representations through a propagation which have emerged as an effective model for var- process which iteratively aggregates local structural infor- ious supervised prediction problems defined on mation (Scarselli et al., 2009; Li et al., 2015; Gilmer et al., structured data, can be trained to produce embed- 2017). These node representations are then used directly for ding of graphs in vector spaces that enables effi- node classification, or pooled into a graph vector for graph cient similarity reasoning. Second, we propose a classification. Problems beyond supervised classification or novel Graph Matching Network model that, given regression are relatively less well-studied for GNNs. a pair of graphs as input, computes a similarity In this paper we study the problem of similarity learning score between them by jointly reasoning on the for graph structured objects, which appears in many impor- pair through a new cross-graph attention-based tant real world applications, in particular similarity based matching mechanism. We demonstrate the ef- retrieval in graph databases. One motivating application fectiveness of our models on different domains is the computer security problem of binary function simi- including the challenging problem of control-flow- larity search, where given a binary which may or may not graph based function that plays contain code with known vulnerabilities, we wish to check an important role in the detection of vulnerabili- whether any control-flow-graph in this binary is sufficiently ties in software systems. The experimental analy- similar to a database of known-vulnerable functions. This sis demonstrates that our models are not only able helps identify vulnerable statically linked libraries in closed- to exploit structure in the context of similarity source software, a recurring problem (CVE, 2010; 2018) for learning but they can also outperform domain- which no good solutions are currently available. Figure1 specific baseline systems that have been carefully shows one example from this application, where the binary hand-engineered for these problems. functions are represented as control flow graphs annotated with assembly instructions. This similarity learning prob- lem is very challenging as subtle differences can make two 1. Introduction graphs be semantically very different, while graphs with different structures can still be similar. A successful model arXiv:1904.12787v2 [cs.LG] 12 May 2019 Graphs are natural representations for encoding relational for this problem should therefore (1) exploit the graph struc- structures that are encountered in many domains. Expect- tures, and (2) be able to reason about the similarity of graphs edly, computations defined over graph structured data are both from the graph structures as well as from learned se- employed in a wide variety of fields, from the analysis of mantics. molecules for computational biology and chemistry (Gilmer et al., 2017; Yan et al., 2005), to the analysis of knowl- In order to solve the graph similarity learning problem, we edge graphs or graph structured parses for natural language investigate the use of GNNs in this context, explore how understanding. they can be used to embed graphs into a vector space, and learn this embedding model to make similar graphs close 1 2 DeepMind Google. Correspondence to: Yujia Li . important property of this model is that, it maps each graph Proceedings of the 36 th International Conference on Machine independently to an embedding vector, and then all the sim- Learning, Long Beach, California, PMLR 97, 2019. Copyright ilarity computation happens in the vector space. Therefore, 2019 by the author(s). Graph Matching Networks

0xf90

push R15, RSP push R14, RSP 0x138c70 mov R14D , ESI push RBP, RSP push R13, RSP push R15, RSP push R12, RSP push R14, RSP push R12, RSP beddings for similarity learning; (2) we propose the new mov R13D , EDI push RBX, RSP push RBP, RSP mov R12, RDX push RBX, RSP mov RBP, RDI mov EDI, 560 mov RDI, [R12 + 28] mov EBP, EDX xor R15D , R15D 0x138cd3 test RDI, RDI 0x1950 sub RSP, 18 jz 138e4e mov RAX, [28] push RBP, RSP mov [ RSP + 8], RAX push R15, RSP xor EAX, EAX call f70 0x138c8f Graph Matching Networks that computes similarity through push R14, RSP 0x138d02 push R13, RSP cmp RDI, [R12 + 30] jnbe 138e4e push R12, RSP push RBX, RSP movsxd R14, EAX push RAX, RSP 0xfc0 mov EBP, EDX 0x138c9a mov RBX, RAX mov R14D , ESI mov EBX, [R12] lea EAX, R13 + f mov EDX, [RSI + 10] mov R15D , EDI mov ECX, 10 mov R11D , [RSI + 18] cross-graph attention-based matching; (3) empirically we mov EDI, 560 mov R15D , EAX mov R10, [RSI] lea R8, RDI + R14 * 1 mov [ R12 + 28], 0 mov EAX, EDX call 1028 cdq EDX, EAX shr EAX, 3 and R15D , f0 mov EAX, [R10 + RAX * 1 ] idiv EDX, EAX, ECX mov ECX, EDX test RBX, RBX and CL, 7 shr EAX, CL jz 108a mov ECX, 20 cmp R8, [R12 + 20] jmp 138e4e sub ECX, EBX shl/sal EAX, CL 0x196d shr EAX, CL add EDX, EBX show that the proposed graph similarity learning models mov RBX, RAX cmp R11D , EDX 0xfdf cmovbe EDX, R11D test RBX, RBX mov [ RSI + 10], EDX jbe 138d10 jz 19ac test EAX, EAX 0x108a lea R12D , RAX + 2 jz 138d02 lea RAX, 2179f9 xor EAX, EAX test BPL, 8 mov [ RBX], RAX jz 100e 0x138cd3 0x138d02 movsxd R14, EAX mov [ R12 + 28], 0 lea R8, RDI + R14 * 1 achieve good performance across a range of applications, 0x1975 jmp 138e4e cmp R8, [R12 + 20] jbe 138d10 lea R12D , R15 + f mov EAX, R12D 0xff3 0x19ac sar EAX, 1f shr EAX, 1c mov EAX, EBP 0x138d10 and EAX, 3 xor EBX, EBX lea R13D , R15 + RAX * 1 + f 0x100e mov EAX, EDX jmp 1a44 mov [ RBX + 520 ], EAX shr EAX, 3 sar R13D , 4 0x138ce1 mov [ RBX + 520 ], 1 mov EAX, EBP movzx EAX, [R10 + RAX * 1 ] xor EBP, EBP outperforming structure agnostic models and established lea RAX, 21bcc9 sar EAX, 4 lea RDX, b18f00 mov [ RBX + 524 ], 1 cmp EDX, R11D mov ESI, 10 mov [ RBX], RAX and EAX, 3 setl BPL xor EAX, EAX add EBP, EDX test BPL, 8 mov [ RBX + 524 ], EAX mov RDI, RBP mov [ RSI + 10], EBP jmp 1022 call 67b88 jnz 19b3 and DL, 7 movzx ECX, DL bt EAX, ECX jnb 138d8a

0x1022 hand-engineered baselines. 0x19b3 0x138d10bt EBP, 13 0x138d33 0x199b jnb 1035 mov EAX, EBP mov EAX, EBP shr EAX, 3 and EAX, 3 mov EDX, [R10 + RAX * 1 ] mov [ RBX + 520 ], 1 mov ECX, EBP mov [ RBX + 520 ], EAX mov EAX, 1 0x138d8a and CL, 7 0x138cf7 shr EDX, CL mov EAX, EBP cmp RDI, R8 mov R15D , bebbb1b7 jmp 19c6 add EBP, 4 jnb 138e4e jmp 138e4e shr EAX, 4 0x1035 cmp R11D , EBP and EAX, 3 cmovbe EBP, R11D mov EAX, EBP xor R15D , R15D 0x1028 and EDX, f movshr EAX, 1f EAX, EDX mov [ RSI + 10], EBP bt EBP, 1d call f10 jz 138e3c mov [ RBX + 514 ], EAX jnb 104d 0x19c6 shr EAX, 3 0x138d5c and R12D , f0 mov EAX, EBP add R13D , 2 shr EAX, 3 movzx EAX, [R10 + RAX * 1 ] mov [ RBX + 524 ], EAX 0x102d 0x1046 mov ECX, EBP test EBP, 80000 and CL, 7 0x138ce1 shr EAX, CL mov [ RBX + 514 ], EAX 0x138d93 jnz 1a1b movzxor [ RBX + 514 ], 2 EAX, [R10 + RAX 0x138e3c * and1 ]EAX, 1 jmp 1067 lea R9, 10e02a9 xor ECX, ECX 2. Related Work xor EDX, EDX jmp 138daa cmp EBP, R11D setl CL add ECX, EBP mov [ RSI + 10], ECX mov ECX, EAX xor EBP, EBP neg ECX 0x104d 0x19dc xor EDX, ECX add EDX, EAX bt EBP, 1e jmp 138e3e mov ECX, EBP lea RDX, b18f00 jnb 105a shr ECX, 1f cmp EDX, R11D mov EAX, EBP 0x138daa shr EAX, 1c = 6= 0x1a1b and EAX, 2 movsxd RAX, [R12 + 4] 0x1053 lea RAX, RAX + RAX * 2 mov ESI, 10 lea EDX, RAX + RCX * 1 mov RDX, [R9 + RAX * 8 + 8 ] call fc0 or [ RBX + 514 ], 4 mov ECX, EBP test EBP, 40000000 setl BPL shr ECX, 3 Graph Neural Networks and Graph Representation lea ECX, RAX + RCX * 1 + 4 mov EBX, [R10 + RCX * 1 ] 0x138e3e mov ECX, EBP cmovne EDX, ECX and CL, 7 movzx ESI, DL shr EBX, CL mov RDX, R14 xor EAX, EAX mov [ RBX + 514 ], EDX mov ECX, 20 call 67248 test EBP, 10000000 0x105a sub ECX,[ R9 + RAX*8 ] add EBP, EDX shl/sal EBX, CL jz 1a26 bt EBP, 1c shr EBX, CL movsx EAX, [RDX + RBX * 4 + 2 ] jnb 1067 add EAX, EBP cmp R11D , EAX mov RDI, RBP mov [ RSI + 10],jnbe 138de4EBP Learning The history of graph neural networks (GNNs) 0x1a07 0x1060 0x138de1 0x138e49 call 67b88 0x1a20 test EBP, 40000000 or [ RBX + 514 ], 1 mov EAX, R11D add [ R12 + 28], R14 cmove ECX, EAX and DL, 7 mov [ RBX + 514 ], EAX or ECX, 1 mov [ RBX + 514 ], ECX 0x138de4 goes back to at least the early work by Gori et al.(2005) jmp 1a26 0x1067 movsx RCX, [RDX + RBX * 4 ] mov [ RSI + 10], EAX movzx mov R8 D, R12 D ECX, DL movzx EDX, [R12 + RCX * 1 + 8 ] mov ECX, R15D test EDX, EDX mov EDX, R14D jz 138e21 mov ESI, R13D 0x1a26 mov RDI, RBX call 4ab0 bt EAX, ECX 0x138df6 mov RDI, RBX and Scarselli et al.(2009), who proposed to use a propaga- mov ESI, R15D mov ECX, EAX shr ECX, 3 mov EDX, R14D movzx EBP, [R10 + RCX * 1 ] mov ECX, R12D mov ECX, EAX 0x107b and CL, 7 mov R8D, R13D jnb 138d8a shr EBP, CL call 1a56 and EBP, 1 mov [ RBX + 510 ], ffffffff xor ECX, ECX mov RAX, RBX cmp EAX, R11D setl CL jmp 108c add ECX, EAX mov [ RSI + 10], ECX tion process to learn node representations. These models mov EAX, EBP neg EAX 0x1a3a xor EAX, EDX add EAX, EBP 0x108c mov [ RBX + 510 ], ffffffff mov EDX, EAX mov RSI, [RSP + 8] xor RSI, [28] jz 10a1 0x138e21 lea RAX, RDI + 1 have been further developed by incorporating modern deep 0x1a44 mov [ R12 + 28], RAX mov [ RDI], DL mov RDI, [R12 + 28] mov RAX, RBX 0x109c cmp RDI, R8 add RSP, 8 jb 138da0 pop RBX, RSP call f20 pop R12, RSP pop R13, RSP 0x138da0 0x138e3a pop R14, RSP mov EBP, [RSI + 10] ˇ ´ jmp 138e4e learning components (Li et al., 2015; Velickovic et al., 2017; pop R15, RSP mov R11D , [RSI + 18] 0x10a1 mov R10, [RSI] pop RBP, RSP ret near [ RSP] add RSP, 18 0x138d33 pop RBX, RSP pop RBP, RSP 0x138e4e pop R12, RSP mov EAX, R15D pop R13, RSP pop RBX, RSP pop R14, RSP pop R12, RSP pop R15, RSP pop R14, RSP pop R15, RSP Bruna et al., 2013). A separate line of work focuses on gener- ret near [ RSP] pop RBP, RSP mov ret near [ RSP] EAX, EBP alizing convolutions to graphs (Bruna et al., 2013; Bronstein et al., 2017). Popular graph convolutional networks also Figure 1. The binary function similarity learning problem. Check- compute node updates by aggregating information in local ing whether two graphs are similar requires reasoning about both neighborhoods (Kipf & Welling, 2016), making them the the structure as well as the semantics of the graphs. Here the left two control flow graphs correspond to the same function compiled same family of models as GNNs. GNNs have been suc- with different compilers (and therefore similar), while the graph cessfully used in many domains (Kipf & Welling, 2016; on the right corresponds to a different function. Velickoviˇ c´ et al., 2017; Battaglia et al., 2016; 2018; Niepert et al., 2016; Duvenaud et al., 2015; Gilmer et al., 2017; Dai et al., 2017; Li et al., 2018; Wang et al., 2018a;b). Most of the embeddings of graphs in a large database can be precom- the previous work on GNNs focus on supervised prediction puted and indexed, which enables efficient retrieval with problems (with exceptions like (Dai et al., 2017; Li et al., fast data structures like k-d trees 2018; Wang et al., 2018a)). The graph similarity learning (Bentley, 1975) or locality sensitive hashing (Gionis et al., problem we study in this paper and the new graph match- 1999). ing model can be good additions to this family of models. We further propose an extension to GNNs which we call Independently Al-Rfou et al.(2019) also proposed a cross Graph Matching Networks (GMNs) for similarity learn- graph matching mechanism similar to ours, for the problem ing. Instead of computing graph representations indepen- of unsupervised graph representation learning. dently for each graph, the GMNs compute a similarity score Recently Xu et al.(2018); Morris et al.(2018) studied the through a cross-graph attention mechanism to associate discriminative power of GNNs and concluded that GNNs nodes across graphs and identify differences. By making are as powerful as the Weisfeiler-Lehman (Weisfeiler & the graph representation computation dependent on the pair, Lehman, 1968) algorithm in terms of distinguishing graphs this matching model is more powerful than the embedding (isomorphism test). In this paper, however we study the model, providing a nice accuracy-computation trade-off. similarity learning problem, i.e. how similar are two graphs, We evaluate the proposed models and baselines on three rather than whether two graphs are identical. In this setting, tasks: a synthetic graph edit-distance learning task which learned models can adapt to the we have data for, captures structural similarity only, and two real world tasks while hand-coded algorithms can not easily adapt. - binary function similarity search and mesh retrieval, which Graph Similarity Search and Graph Kernels Graph require reasoning about both the structural and semantic similarity search has been studied extensively in database similarity. On all tasks, the proposed approaches outperform and data mining communities (Yan et al., 2005; Dijkman established baselines and structure agnostic models; in more et al., 2009). The similarity is typically defined by either ex- detailed ablation studies, we found that the Graph Matching act matches (full-graph or sub-graph isomorphism) (Berretti Networks consistently outperform the graph embedding et al., 2001; Shasha et al., 2002; Yan et al., 2004; Srinivasa & model and Siamese networks. Kumar, 2003) or some measure of structural similarity, e.g. To summarize, the contributions of this paper are: (1) we in terms of graph edit distances (Willett et al., 1998; Ray- demonstrate how GNNs can be used to produce graph em- mond et al., 2002). Most of the approaches proposed in this Graph Matching Networks direction are not learning-based, and focus on efficiency. Bertinetto et al., 2016; Zagoruyko & Komodakis, 2015). In the experiments we adapt Siamese networks to handle Graph kernels are kernels on graphs designed to capture graphs, but found our graph matching networks to be more the graph similarity, and can be used in kernel methods for powerful as they do cross-graph computations and therefore e.g. graph classification (Vishwanathan et al., 2010; Sher- fuse information from both graphs early in the computation vashidze et al., 2011). Popular graph kernels include those process. Independent of our work, recently (Shyam et al., that measure the similarity between walks or paths on graphs 2017) proposed a cross-example attention model for visual (Borgwardt & Kriegel, 2005; Kashima et al., 2003; Vish- similarity as an alternative to Siamese networks based on wanathan et al., 2010), kernels based on limited-sized sub- similar motivations and achieved good results. structures (Horváth et al., 2004; Shervashidze et al., 2009) and kernels based on sub-tree structures (Shervashidze & Borgwardt, 2009; Shervashidze et al., 2011). A recent sur- 3. Deep Graph Similarity Learning vey on graph kernels can be found in (Kriege et al., 2019). Given two graphs G = (V ,E ) and G = (V ,E ), we Graph kernels are usually used in models that may have 1 1 1 2 2 2 want a model that produces the similarity score s(G ,G ) learned components, but the kernels themselves are hand- 1 2 between them. Each graph G = (V,E) is represented as designed and motivated by graph theory. They can typically sets of nodes V and edges E, optionally each node i ∈ V be formulated as first computing the feature vectors for each can be associated with a feature vector x , and each edge graph (the kernel embedding), and then take inner product i (i, j) ∈ E associated with a feature vector x . These fea- between these vectors to compute the kernel value. One ij tures can represent, e.g. type of a node, direction of an edge, exception is (Yanardag & Vishwanathan, 2015) where the etc. If a node or an edge does not have any associated fea- co-occurrence of graph elements (substructures, walks, etc.) tures, we set the corresponding vector to a constant vector are learned, but the basic elements are still hand-designed. of 1s. We propose two models for graph similarity learn- Compared to these approaches, our graph neural network ing: a model based on standard GNNs for learning graph based similarity learning framework learns the similarity embeddings, and the new and more powerful GMNs. The metric end-to-end. two models are illustrated in Figure2. Distance Metric Learning Learning a distance metric between data points is the key focus of the area of metric 3.1. Graph Embedding Models learning. Most of the early work on metric learning assumes that the data already lies in a vector space, and only a linear Graph embedding models embed each graph into a vector, metric matrix is learned to properly measure the distance in and then use a similarity metric in that vector space to mea- this space to group similar examples together and dissimilar sure the similarity between graphs. Our GNN embedding examples to be far apart (Xing et al., 2003; Weinberger & model comprises 3 parts: (1) an encoder, (2) propagation Saul, 2009; Davis et al., 2007). More recently the ideas of layers, and (3) an aggregator. distance metric learning and representation learning have Encoder The encoder maps the node and edge features to been combined in applications like face verification, where initial node and edge vectors through separate MLPs: deep convolutional neural networks are learned to map sim- h(0) = MLP (x ), ∀i ∈ V ilar images to similar representation vectors (Chopra et al., i node i (1) 2005; Hu et al., 2014; Sun et al., 2014). In this paper, we eij = MLPedge(xij), ∀(i, j) ∈ E. focus on representation and similarity metric learning for Propagation Layers A propagation layer maps a set of graphs, and our graph matching model goes one step beyond (t) node representations {hi }i∈V to new node representations the typical representation learning methods by modeling the (t+1) cross-graph matchings. {hi }i∈V , as the following: (t) (t) Siamese Networks Siamese networks (Bromley et al., mj→i = fmessage(h , h , eij)  i j  (2) 1994; Baldi & Chauvin, 1993) are a family of neural net- (t+1) (t) P hi = fnode hi , j:(j,i)∈E mj→i work models for visual similarity learning. These models typically consist of two networks with shared parameters Here fmessage is typically an MLP on the concatenated in- applied to two input images independently to compute rep- puts, and fnode can be either an MLP or a recurrent neural resentations, a small network is then used to fuse these network core, e.g. RNN, GRU or LSTM (Li et al., 2015). representations and compute a similarity score. They can be To aggregate the messages, we use a simple sum which may thought of as learning both the representations and the simi- be alternatively replaced by other commutative operators larity metric. Siamese networks have achieved great success such as mean, max or the attention-based weighted sum in many visual recognition and verification tasks (Brom- (Velickoviˇ c´ et al., 2017). Through multiple layers of prop- ley et al., 1994; Baldi & Chauvin, 1993; Koch et al., 2015; agation, the representation for each node will accumulate information in its local neighborhood. Graph Matching Networks

vector space similarity vector space similarity

graph vectors

propagations

Figure 2. Illustration of the graph embedding (left) and matching models (right).

Aggregator After a certain number T rounds of propaga- the edges for each graph as before, but also a cross-graph tions, an aggregator takes the set of node representations matching vector which measures how well a node in one (T ) graph can be matched to one or more nodes in the other: {hi } as input, and computes a graph level representa- (T ) tion hG = fG({hi }). We use the following aggregation (t) (t) module proposed in (Li et al., 2015), mj→i = fmessage(hi , hj , eij), ∀(i, j) ∈ E1 ∪ E2 (4) (t) (t) ! µ = fmatch(h , h ), X (T ) (T ) j→i i j hG = MLPG σ(MLPgate(hi )) MLP(hi ) , ∀i ∈ V1, j ∈ V2, or i ∈ V2, j ∈ V1 (5) i∈V (3)   (t+1) (t) X X which transforms node representations and then uses a hi = fnode hi , mj→i, µj0→i (6) weighted sum with gating vectors to aggregate across nodes. j j0 The weighted sum can help filtering out irrelevant informa- h = f ({h(T )} ) (7) tion, it is more powerful than a simple sum and also works G1 G i i∈V1 (T ) significantly better empirically. hG2 = fG({hi }i∈V2 ) (8) s = fs(hG , hG ). (9) After the graph representations hG1 and hG2 are computed 1 2 for the pair (G1,G2), we compute the similarity between them using a similarity metric in the vector space, for exam- Here fs is a standard vector space similarity between hG1 ple the Euclidean, cosine or Hamming similarities. and hG2 . fmatch is a function that communicates cross- graph information, which we propose to use an attention- Note that without the propagation layers (or with 0 propa- based module: gation steps), this model becomes an instance of the Deep (t) (t) Set (Zaheer et al., 2017) or PointNet (Qi et al., 2017), which exp(sh(hi ,hj )) aj→i = (t) (t) , P exp(s (h ,h )) does computation on the individual nodes, and then pool j0 h i j0 (10) (t) (t) the node representations into a representation for the whole µj→i = aj→i(hi − hj ) graph. Such a model, however, ignores the structure and only treats the data as a set of independent nodes. and therefore,

X X (t) (t) (t) X (t) 3.2. Graph Matching Networks µj→i = aj→i(hi − hj ) = hi − aj→ihj . j j j Graph matching networks take a pair of graphs as input (11) and compute a similarity score between them. Compared sh is again a vector space similarity metric, like Euclidean to the embedding models, these matching models compute or cosine similarity, aj→i are the attention weights, and the similarity score jointly on the pair, rather than first inde- P (t) j µj→i intuitively measures the difference between hi pendently mapping each graph to a vector. Therefore these and its closest neighbor in the other graph. Note that because models are potentially stronger than the embedding models, of the normalization in aj→i, the function fmatch implicitly at the cost of some extra computation efficiency. (t) depends on the whole set of {hj }, which we omitted in We propose the following graph matching network, which Eq. 10 for a cleaner notation. Since attention weights are changes the node update module in each propagation layer required for every pair of nodes across two graphs, this op- to take into account not only the aggregated messages on eration has a computation cost of O(|V1||V2|), while for the Graph Matching Networks

GNN embedding model the cost for each round of propaga- pairs and maximize it for negative pairs. With this restric- tion is O(|V | + |E|). The extra power of the GMNs comes tion the graph vectors can no longer freely occupy the whole from utilizing the extra computation. Euclidean space, but we gain the efficiency for fast retrieval and indexing. To achieve this we propose to pass the h Note By construction, the attention module has a nice prop- G vectors through a tanh transformation, and optimize the erty that, when the two graphs can be perfectly matched, and following pair and triplet losses: when the attention weights are peaked at the exact match, we have P µ = 0, which means the cross-graph com- 2 j j→i Lpair = E(G1,G2,t)[(t − s(G1,G2)) ]/4, and (14) munications will be reduced to zero vectors, and the two L = [(s(G ,G ) − 1)2+ graphs will continue to compute identical representations triplet E(G1,G2,G3) 1 2 2 in the next round of propagation. On the other hand, the (s(G1,G3) + 1) ]/8, (15) differences across graphs will be captured in the cross-graph 1 PH P where s(G1,G2) = tanh(hG i) · tanh(hG i) is matching vector j µj→i, which will be amplified through H i=1 1 2 the propagation process, making the matching model more the approximate average Hamming similarity. Both losses sensitive to these differences. are bounded in [0, 1], and they push positive pairs to have Hamming similarity close to 1, and negative pairs to have Compared to the graph embedding model, the matching similarity close to -1. We found these losses to be a bit more model has the ability to change the representation of the stable than margin based losses for Hamming similarity. graphs based on the other graph it is compared against. The model will adjust graph representations to make them become more different if they do not match. 4. Experiments In this section, we evaluate the graph similarity learning 3.3. Learning (GSL) framework and the graph embedding (GNNs) and The proposed graph similarity learning models can be graph matching networks (GMNs) on three tasks and com- trained on a set of example pairs or triplets. Pairwise train- pare these models with other competing methods. Overall ing requires us to have a dataset of pairs labeled as positive the empirical results demonstrate that the GMNs excel on (similar) or negative (dissimilar), while triplet training only graph similarity learning, consistently outperforming all other approaches. needs relative similarity, i.e. whether G1 is closer to G2 or G3. We describe the losses on pairs and triplets we used be- low, which are then optimized with gradient descent based 4.1. Learning Graph Edit Distances algorithms. Problem Background Graph edit distance between When using Euclidean similarity, we use the following graphs G1 and G2 is defined as the minimum number of margin-based pairwise loss: edit operations needed to transform G1 to G2. Typically the edit operations include add/remove/substitute nodes and edges. Graph edit distance is naturally a measure of simi- Lpair = E(G1,G2,t)[max{0, γ − t(1 − d(G1,G2))}], (12) larity between graphs and has many applications in graph where t ∈ {−1, 1} is the label for this pair, γ > 0 is a similarity search (Dijkman et al., 2009; Zeng et al., 2009; 2 margin parameter, and d(G1,G2) = khG1 − hG2 k is the Gao et al., 2010). However computing the graph edit dis- Euclidean distance. This loss encourages d(G1,G2) < tance is NP-hard in general (Zeng et al., 2009), therefore 1−γ when the pair is similar (t = 1), and d(G1,G2) > 1+γ approximations have to be used. Through this experiment when t = −1. Given triplets where G1 and G2 are closer we show that the GSL models can learn structural similarity than G1 and G3, we optimize the following margin-based between graphs on very challenging problems. triplet loss: Training Setup We generated training data by sampling random binomial graphs G with n nodes and edge prob- L = [max{0, d(G ,G )−d(G ,G )+γ}]. 1 triplet E(G1,G2,G3) 1 2 1 3 ability p (Erdös & Rényi, 1959), and then create positive (13) example G2 by randomly substituting kp edges from G1 This loss encourages d(G1,G2) to be smaller than with new edges, and negative example G3 by substituting d(G1,G3) by at least a margin γ. 1 kn edges from G1, where kp < kn . A model needs to For applications where it is necessary to search through predict a higher similarity score for positive pair (G1,G2) a large database of graphs with low latency, it is benefi- 1 cial to have the graph representation vectors be binary, i.e. Note that even though G2 is created with kp edge substitutions H from G1, the actual edit-distance between G1 and G2 can be hG ∈ {−1, 1} , so that efficient nearest neighbor search smaller than kp due to symmetry and isomorphism, same for G3 algorithms (Gionis et al., 1999) may be applied. In such and kn. However the probability of such cases is typically low and cases, we can minimize the Hamming distance of positive decreases rapidly with increasing graph sizes. Graph Matching Networks

Graph Distribution WL kernel GNN GMN More experiments on generalization capabilities of these n = 20, p = 0.2 80.8 / 83.2 88.8 / 94.0 95.0 / 95.6 models (train on small graphs, test on larger graphs, train n = 20, p = 0.5 74.5 / 78.0 92.1 / 93.4 96.6 / 98.0 n = 50, p = 0.2 93.9 / 97.8 95.9 / 97.2 97.4 / 97.6 on graphs with some kp, kn combinations, test on others) n = 50, p = 0.5 82.3 / 89.0 88.5 / 91.0 93.8 / 92.6 and visualizations are included in Appendix B.1 andC.

Table 1. Comparing the graph embedding (GNN) and matching 4.2. Control Flow Graph based Binary Function (GMN) models trained on graphs from different distributions with Similarity Search the baseline, measuring pair AUC / triplet accuracy (×100). Problem Background Binary function similarity search is an important problem in computer security. The need than negative pair (G1,G3). Throughout the experiments to analyze and search through binaries emerges when we we fixed the dimensionality of node vectors to 32, and the do not have access to the source code, for example when dimensionality of graph vectors to 128 without further tun- dealing with commercial or embedded software or suspi- ing. We also tried different number of propagation steps T cious executables. Combining a disassembler and a code from 1 to 5, and observed consistently better performance analyzer, we can extract a control-flow graph (CFG) which with increasing T . The results reported in this section are contains all the information in a binary function in a struc- all with T = 5 unless stated otherwise. More details are tured format. See Figure1 and Appendix B.2 for a few included in Appendix B.1. example CFGs. In a CFG, each node is a basic block of as- Baseline We compare our models with the popular We- sembly instructions, and the edges between nodes represent isfeiler Lehman (WL) kernel (Shervashidze et al., 2011), the control flow, indicated by for example a jump or a return which has been shown to be very competitive on graph clas- instruction used in branching, loops or function calls. In this sification tasks and the Weisfeiler Lehman algorithm behind section, we target the vulnerability search problem, where a this kernel is a strong method for checking graph isomor- piece of binary known to have some vulnerabilities is used phism (edit distance of 0), a closely related task (Weisfeiler as the query, and we search through a library to find similar 2 & Lehman, 1968; Shervashidze et al., 2011). binaries that may have the same vulnerabilities. Accurate identification of similar vulnerabilities enables security en- Evaluation The performance of different models are gineers to quickly narrow down the search space and apply evaluated using two metrics: (1) pair AUC - the area under patches. the ROC curve for classifying pairs of graphs as similar or not on a fixed set of 1000 pairs and (2) triplet accuracy - In the past the binary function similarity search problem the accuracy of correctly assigning higher similarity to the has been tackled with classical graph theoretical matching positive pair in a triplet than the negative pair on a fixed set algorithms (Eschweiler et al., 2016; Pewny et al., 2015), of 1000 triplets. and Xu et al.(2017) and Feng et al.(2016) proposed to learn embeddings of CFGs and do similarity search in the Results We trained and evaluated the GSL models on embedding space. Xu et al.(2017) in particular proposed an graphs of a few specific distributions with different n, p, embedding method based on graph neural networks, starting with kp = 1 and kn = 2 fixed. The evaluation results are from some hand selected feature vectors for each node. Here shown in Table1. We can see that by learning on graphs of we study further the performance of graph embedding and specific distributions, the GSL models are able to do better matching models, with pair and triplet training, different than generic baselines, and the GMNs consistently outper- number of propagation steps, and learning node features form the embedding model (GNNs). Note that this result from the assembly instructions. does not contradict with the conclusion by Xu et al.(2018); Morris et al.(2018), as we are learning a similarity metric, Training Setup and Baseline We train and evaluate our rather than doing an isomorphism test, and our model can model on data generated by compiling the popular open do better than the WL-kernel with learning. source video processing software ffmpeg using different compilers gcc and clang, and different compiler opti- For the GMNs, we can visualize the cross-graph attention to mization levels, which results in 7940 functions and roughly gain further insight into how it is working. Figure3 shows 8 CFGs per function. The average size of the CFGs is two examples of this for a matching model trained with n around 55 nodes per graph, with some larger graphs having sampled from [20, 50], tested on graphs of 10 nodes. The up to a few thousand nodes (see Appendix B.2 for more cross-graph attention weights are shown in green, with the detailed statistics). Different compiler optimization levels scale of the weights shown as the transparency of the green result in CFGs of very different sizes for the same function. edges. We can see that the attention weights can align nodes We split the data and used 80% functions and the associated well when the two graphs match, and tend to focus on nodes with higher degrees when they don’t. However the pattern 2Note that our formulation is general and can also be applied is not as interpretable as in standard attention models. to source code directly if they are available. Graph Matching Networks

graph edit distance = 1 graph edit distance = 2 Figure 3. Visualization of cross-graph attention for GMNs after 5 propagation layers. In each pair of graphs the left figure shows the attention from left graph to the right, the right figure shows the opposite.

CFGs for training, 10% for validation and 10% for testing. tion steps; (2) the graph embedding model is consistently The models were trained to learn a similarity metric on better than the baselines with enough propagation steps; CFGs such that the CFGs for the same function have high and (3) graph matching models outperforms the embedding similarity, and low similarity otherwise. Once trained, this models across all settings and propagation steps. Addition- similarity metric can be used to search through library of ally, we have tried the WL kernel on this task using only binaries and be invariant to compiler type and optimization the graph structure, and it achieved 0.619 AUC and 24.5% levels. triplet accuracy. This is not surprising as the WL kernel is not designed for solving this task, while our models learn We compare our graph embedding and matching models the features useful for the task of interest, and can achieve with Google’s open source function similarity search tool better performance than generic similarity metrics. (Dullien, 2018), which has been used to successfully find vulnerabilities in binaries in the past. This tool computes representations of CFGs through a hand-engineered graph 4.3. More Baselines and Ablation Studies hashing process which encodes the neighborhood structure In this section, we carefully examine the effects of the de- of each node by hashing the degree sequence from a traver- sign decisions we made in the GMN model and compare it sal of a 3-hop neighborhood, and also encodes the assembly against a few more alternatives. In particular, we evaluate instructions for each basic block by hashing the trigrams of the popular Graph Convolutional Network (GCN) model assembly instruction types. These features are then com- by Kipf & Welling(2016) as an alternative to our GNN bined by using a SimHash-style (Charikar, 2002) algorithm model, and Siamese versions of the GNN/GCN embedding with learned weights to form a 128-dimensional binary code. models. The GCN model replaces the message passing in An LSH-based search index is then used to perform approx- Eq.2 with graph convolutions, and the Siamese model pre- imate nearest neighbor search using hamming distance. dicts a distance value by concatenating two graph vectors Following (Dullien, 2018), we also map the CFGs to 128- and then pass through a 2 layer MLP. The comparison with dimensional binary vectors, and use the Hamming similarity Siamese networks can in particular show the importance of formulation described in Section3 for training. We further the cross-graph attention early on in the similarity computa- studied two variants of the data, one that only uses the graph tion process, as Siamese networks fuse the representations structure, and one that uses both the graph structure and for 2 graphs only at the very end. the assembly instructions with learned node features. When We focus on the function similarity search task, and also assembly instructions are available, we embed each instruc- conduct experiments on an extra COIL-DEL mesh graph tion type into a vector, and then sum up all the embedding dataset (Riesen & Bunke, 2008), which contains 100 classes vectors for instructions in a basic block as the initial repre- of mesh graphs corresponding to 100 types of objects. We sentation vector (the xi’s) for each node, these embeddings treat graphs in the same class as similar, and used identical are learned jointly with the rest of the model. setup as the function similarity search task for training and Results Figure4 shows the performance of different evaluation. models with different number of propagation steps and in Table2 summarizes the experiment results, which clearly different data settings. We again evaluate the performance show that: (1) the GNN embedding model is a competi- of these models on pair AUC and triplet accuracy on fixed tive model (more powerful than the GCN model); (2) using sets of pairs and triplets from the test set. It is clear from re- Siamese network architecture to learn similarity on top of sults that: (1) the performance of both the graph embedding graph representations is better than using a prespecified sim- and matching models consistently go up with more propaga- ilarity metric (Euclidean, Hamming etc.); (3) the GMNs tion steps, and in particular significantly outperforming the outperform the Siamese models showing the importance of structure agnostic model special case which uses 0 propaga- Graph Matching Networks

100 100

95 95

90 90 baseline: struct only

Pair AUC baseline: struct + node features embedding: struct only

Triplet Accuracy 85 embedding: struct + node features 85 matching: struct only matching: struct + node features 80 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 #propagation steps #propagation steps Figure 4. Performance (×100) of different models on the binary function similarity search task.

Model Pair AUC Triplet Acc Model Pair AUC Triplet Acc Baseline 96.09 96.35 GCN 94.80 94.95 GCN 96.67 96.57 Siamese-GCN 95.90 96.10 Siamese-GCN 97.54 97.51 GNN 98.58 98.70 GNN 97.71 97.83 Siamese-GNN 98.76 98.55 Siamese-GNN 97.76 97.58 GMN 98.97 98.80 GMN 99.28 99.18 Function Similarity Search COIL-DEL

Table 2. More results on the function similarity search task and the extra COIL-DEL dataset. cross-graph information communication early in the compu- cannot directly be used for indexing and searching through tation process. large graph databases. Therefore it is best to use the graph matching networks when we (1) only care about the similar- 5. Conclusions and Discussion ity between individual pairs, or (2) use them in a retrieval setting together with a faster filtering model like the graph In this paper we studied the problem of graph similarity embedding model or standard graph similarity search meth- learning using graph neural networks. Compared to standard ods, to narrow down the search to a smaller candidate set, prediction problems for graphs, similarity learning poses a and then use the more expensive matching model to rerank unique set of challenges and potential benefits. For exam- the candidates to improve precision. ple, the graph embedding models can be learned through a Developing neural models for graph similarity learning is an classification setting when we do have a set of classes in the important research direction with many applications. There dataset, but formulating it as a similarity learning problem are still many interesting challenges to resolve, for example can handle cases where we have a very large number of to improve the efficiency of the matching models, study classes and only very few examples for each class. The different matching architectures, adapt the GNN capacity to representations learned from the similarity learning setting graphs of different sizes, and applying these models to new can also easily generalize to data from classes unseen during application domains. We hope our work can spur further training (zero-shot generalization). research in this direction. We proposed the new graph matching networks as a stronger alternative to the graph embedding models. The added References power for the graph matching models comes from the fact that they are not independently mapping each graph to an CVE-2010-0188. Available from MITRE, CVE- embedding, but rather doing comparisons at all levels across ID CVE-2010-0188., 2010. URL https: the pair of graphs, in addition to the embedding computa- //cve.mitre.org/cgi-bin/cvename.cgi? tion. The model can then learn to properly allocate capacity name=cve-2010-0188. toward the embedding part or the matching part. The price to pay for this expressivity is the added computation cost CVE-2018-0986. Available from MITRE, CVE- in two aspects: (1) since each cross-graph matching step re- ID CVE-2018-0986., 2018. URL https: quires the computation of the full attention matrices, which //cve.mitre.org/cgi-bin/cvename.cgi? name=CVE-2018-0986 requires at least O(|V1||V2|) time, this may be expensive for . large graphs; (2) the matching models operate on pairs, and Al-Rfou, R., Zelle, D., and Perozzi, B. Ddgk: Learning Graph Matching Networks

graph representations for deep divergence graph kernels. Chopra, S., Hadsell, R., and LeCun, Y. Learning a sim- arXiv preprint arXiv:1904.09671, 2019. ilarity metric discriminatively, with application to face verification. In Computer Vision and Pattern Recognition, Baldi, P. and Chauvin, Y. Neural networks for fingerprint 2005. CVPR 2005. IEEE Computer Society Conference recognition. Neural Computation, 5(3):402–418, 1993. on, volume 1, pp. 539–546. IEEE, 2005. Battaglia, P., Pascanu, R., Lai, M., Rezende, D. J., and Dai, H., Khalil, E., Zhang, Y., Dilkina, B., and Song, Kavukcuoglu, K. Interaction networks for learning about L. Learning combinatorial optimization algorithms over objects, relations and physics. In Advances in neural graphs. In Advances in Neural Information Processing information processing systems, pp. 4502–4510, 2016. Systems, pp. 6351–6361, 2017. Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez- Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti, Davis, J. V., Kulis, B., Jain, P., Sra, S., and Dhillon, I. S. A., Raposo, D., Santoro, A., Faulkner, R., Gulcehre, C., Information-theoretic metric learning. In Proceedings of Song, F., Ballard, A., Gilmer, J., Dahl, G., Vaswani, the 24th international conference on A., Allen, K., Nash, C., Langston, V., Dyer, C., Heess, (ICML), pp. 209–216, 2007. N., Wierstra, D., Kohli, P., Botvinick, M., Vinyals, O., Dijkman, R., Dumas, M., and García-Bañuelos, L. Graph Li, Y., and Pascanu, R. Relational inductive biases, matching algorithms for business process model simi- deep learning, and graph networks. arXiv preprint larity search. In International Conference on Business arXiv:1806.01261, 2018. Process Management, pp. 48–63. Springer, 2009. Bentley, J. L. Multidimensional binary search trees used for Dullien, T. functionsimsearch. https://github.com/ associative searching. Communications of the ACM, 18 google/functionsimsearch, 2018. Accessed: (9):509–517, 1975. 2018-05-14. Berretti, S., Del Bimbo, A., and Vicario, E. Efficient match- ing and indexing of graph models in content-based re- Duvenaud, D. K., Maclaurin, D., Iparraguirre, J., Bombarell, trieval. IEEE Transactions on Pattern Analysis and Ma- R., Hirzel, T., Aspuru-Guzik, A., and Adams, R. P. Con- chine Intelligence, 23(10):1089–1105, 2001. volutional networks on graphs for learning molecular fin- gerprints. In Advances in neural information processing Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A., systems, pp. 2224–2232, 2015. and Torr, P. H. Fully-convolutional siamese networks for object tracking. In European conference on computer Erdös, P. and Rényi, A. On random graphs, i. Publicationes vision, pp. 850–865. Springer, 2016. Mathematicae (Debrecen), 6:290–297, 1959. Borgwardt, K. M. and Kriegel, H.-P. Shortest-path kernels Eschweiler, S., Yakdan, K., and Gerhards-Padilla, E. dis- on graphs. In Data Mining, Fifth IEEE International covre: Efficient cross-architecture identification of bugs Conference on, pp. 8–pp. IEEE, 2005. in binary code. In NDSS, 2016. Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., and Shah, Feng, Q., Zhou, R., Xu, C., Cheng, Y., Testa, B., and Yin, R. Signature verification using a" siamese" time delay H. Scalable graph-based bug search for firmware images. neural network. In Advances in Neural Information Pro- In Proceedings of the 2016 ACM SIGSAC Conference on cessing Systems, pp. 737–744, 1994. Computer and Communications Security, pp. 480–491. ACM, 2016. Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A., and Van- dergheynst, P. Geometric deep learning: going beyond Gao, X., Xiao, B., Tao, D., and Li, X. A survey of graph euclidean data. IEEE Signal Processing Magazine, 34(4): edit distance. Pattern Analysis and applications, 13(1): 18–42, 2017. 113–129, 2010.

Bruna, J., Zaremba, W., Szlam, A., and LeCun, Y. Spec- Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and tral networks and locally connected networks on graphs. Dahl, G. E. Neural message passing for quantum chem- arXiv preprint arXiv:1312.6203, 2013. istry. arXiv preprint arXiv:1704.01212, 2017.

Charikar, M. S. Similarity estimation techniques from round- Gionis, A., Indyk, P., Motwani, R., et al. Similarity search ing algorithms. In Proceedings of the Thiry-fourth Annual in high dimensions via hashing. In Vldb, pp. 518–529, ACM Symposium on Theory of Computing, STOC ’02, 1999. pp. 380–388, New York, NY, USA, 2002. ACM. ISBN 1- 58113-495-9. doi: 10.1145/509907.509965. URL http: Gori, M., Monfardini, G., and Scarselli, F. A new model for //doi.acm.org/10.1145/509907.509965. learning in graph domains. In IEEE International Joint Graph Matching Networks

Conference on Neural Networks (IJCNN), volume 2, pp. Raymond, J. W., Gardiner, E. J., and Willett, P. Rascal: 729–734. IEEE, 2005. Calculation of graph similarity using maximum common edge subgraphs. The Computer Journal, 45(6):631–644, Horváth, T., Gärtner, T., and Wrobel, S. Cyclic pattern ker- 2002. nels for predictive graph mining. In Proceedings of the tenth ACM SIGKDD international conference on Knowl- Riesen, K. and Bunke, H. Iam graph database repository edge discovery and data mining, pp. 158–167. ACM, for graph based pattern recognition and machine learn- 2004. ing. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural Hu, J., Lu, J., and Tan, Y.-P. Discriminative deep metric and Syntactic Pattern Recognition (SSPR), pp. 287–297. learning for face verification in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Springer, 2008. Recognition , pp. 1875–1882, 2014. Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Kashima, H., Tsuda, K., and Inokuchi, A. Marginalized ker- Monfardini, G. The graph neural network model. IEEE nels between labeled graphs. In Proceedings of the 20th Transactions on Neural Networks, 20(1):61–80, 2009. international conference on machine learning (ICML-03), pp. 321–328, 2003. Shasha, D., Wang, J. T., and Giugno, R. Algorithmics and applications of tree and graph searching. In Proceedings Kingma, D. P. and Ba, J. Adam: A method for stochastic of the twenty-first ACM SIGMOD-SIGACT symposium on optimization. arXiv preprint arXiv:1412.6980, 2014. Principles of database systems, pp. 39–52. ACM, 2002.

Kipf, T. N. and Welling, M. Semi-supervised classifica- Shervashidze, N. and Borgwardt, K. M. Fast subtree kernels tion with graph convolutional networks. arXiv preprint on graphs. In Advances in neural information processing arXiv:1609.02907, 2016. systems, pp. 1660–1668, 2009.

Koch, G., Zemel, R., and Salakhutdinov, R. Siamese neural Shervashidze, N., Vishwanathan, S., Petri, T., Mehlhorn, K., networks for one-shot image recognition. In ICML Deep and Borgwardt, K. Efficient graphlet kernels for large Learning Workshop, volume 2, 2015. graph comparison. In Artificial Intelligence and Statistics, Kriege, N. M., Johansson, F. D., and Morris, C. A survey on pp. 488–495, 2009. graph kernels. arXiv preprint arXiv:1903.11835, 2019. Shervashidze, N., Schweitzer, P., Leeuwen, E. J. v., Li, Y., Tarlow, D., Brockschmidt, M., and Zemel, R. Mehlhorn, K., and Borgwardt, K. M. Weisfeiler-lehman Gated graph sequence neural networks. arXiv preprint graph kernels. Journal of Machine Learning Research, arXiv:1511.05493, 2015. 12(Sep):2539–2561, 2011.

Li, Y., Vinyals, O., Dyer, C., Pascanu, R., and Battaglia, Shyam, P., Gupta, S., and Dukkipati, A. Attentive recurrent P. Learning deep generative models of graphs. arXiv comparators. arXiv preprint arXiv:1703.00767, 2017. preprint arXiv:1803.03324, 2018. Srinivasa, S. and Kumar, S. A platform based on the multi- Morris, C., Ritzert, M., Fey, M., Hamilton, W. L., Lenssen, dimensional data model for analysis of bio-molecular J. E., Rattan, G., and Grohe, M. Weisfeiler and leman structures. In Proceedings of the VLDB Conference, pp. go neural: Higher-order graph neural networks. arXiv 975–986. Elsevier, 2003. preprint arXiv:1810.02244, 2018. Niepert, M., Ahmed, M., and Kutzkov, K. Learning con- Sun, Y., Chen, Y., Wang, X., and Tang, X. Deep learning volutional neural networks for graphs. In International face representation by joint identification-verification. In Advances in neural information processing systems conference on machine learning, pp. 2014–2023, 2016. , pp. 1988–1996, 2014. Pewny, J., Garmany, B., Gawlik, R., Rossow, C., and Holz, T. Cross-architecture bug search in binary executables. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, In IEEE Symposium on Security and Privacy (SP), pp. L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten- 709–724. IEEE, 2015. tion is all you need. In Advances in Neural Information Processing Systems, pp. 6000–6010, 2017. Qi, C. R., Su, H., Mo, K., and Guibas, L. J. Pointnet: Deep learning on point sets for 3d classification and segmen- Velickoviˇ c,´ P., Cucurull, G., Casanova, A., Romero, A., tation. In IEEE Conference on Computer Vision and Liò, P., and Bengio, Y. Graph attention networks. arXiv Pattern Recognition (CVPR), 2017. preprint arXiv:1710.10903, 2017. Graph Matching Networks

Vishwanathan, S. V. N., Schraudolph, N. N., Kondor, R., Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., and Borgwardt, K. M. Graph kernels. Journal of Machine Salakhutdinov, R. R., and Smola, A. J. Deep sets. In Learning Research, 11(Apr):1201–1242, 2010. Advances in Neural Information Processing Systems, pp. 3394–3404, 2017. Wang, T., Liao, R., Ba, J., and Fidler, S. Nervenet: Learning structured policy with graph neural networks. In ICLR, Zeng, Z., Tung, A. K., Wang, J., Feng, J., and Zhou, L. 2018a. Comparing stars: On approximating graph edit distance. Proceedings of the VLDB, 2(1):25–36, 2009. Wang, Y., Sun, Y., Liu, Z., Sarma, S. E., Bronstein, M. M., and Solomon, J. M. Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829, 2018b. Weinberger, K. Q. and Saul, L. K. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 10(Feb):207–244, 2009. Weisfeiler, B. and Lehman, A. A reduction of a graph to a canonical form and an algebra arising during this reduction. Nauchno-Technicheskaya Informatsia, 2(9): 12–16, 1968. Willett, P., Barnard, J. M., and Downs, G. M. Chemical similarity searching. Journal of chemical information and computer sciences, 38(6):983–996, 1998. Xing, E. P., Jordan, M. I., Russell, S. J., and Ng, A. Y. Distance metric learning with application to clustering with side-information. In Advances in neural information processing systems, pp. 521–528, 2003. Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018. Xu, X., Liu, C., Feng, Q., Yin, H., Song, L., and Song, D. Neural network-based graph embedding for cross- platform binary code similarity detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 363–376. ACM, 2017. Yan, X., Yu, P. S., and Han, J. Graph indexing: a frequent structure-based approach. In Proceedings of the ACM SIGMOD international conference on Management of data, pp. 335–346. ACM, 2004. Yan, X., Yu, P. S., and Han, J. Substructure similarity search in graph databases. In Proceedings of the ACM SIGMOD international conference on Management of data, pp. 766–777, 2005. Yanardag, P. and Vishwanathan, S. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1365–1374, 2015. Zagoruyko, S. and Komodakis, N. Learning to compare image patches via convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4353–4361. IEEE, 2015. Graph Matching Networks

A. Extra Details on Model Architectures for the edit distance learning task without further tuning. Using larger models there should further improve model In the propagation layers of the graph embedding and match- performance. ing models, we used an MLP with one hidden layer as the fmessage module, with a ReLU nonlinearity on the hidden (t) B.1. Learning Graph Edit Distances layer. For node state vectors (the hi vectors) of dimension D, the size of the hidden layer and the output is set to 2D. In this task the nodes and edges have no extra features We found it to be beneficial to initialize the weights of this associated with them, we therefore initialized the xi and xij fmessage module to be small, which helps stablizing train- vectors as vectors of 1s, and the encoder MLP in Eq.1 is ing. We used the standard Glorot initialization with an extra simply a linear layer for the nodes and an identity mapping scaling factor of 0.1. When not using this small scaling for the edges. factor, at the begining of training the message vectors when We searched through the following hyperparameters: (1) summed up can have huge scales, which is bad for learning. triplet vs pair training; (2) number of propagation layers; One extra thing to note about the propagation layers is that (3) share parameters on different propagation layers or not. we can make all the propagation layers share the same set of Learning rate is fixed at 0.001 for all runs and we used the parameters, which can be useful if this is a suitable inductive Adam optimizer (Kingma & Ba, 2014). Overall we found: bias to have. (1) triplet and pair training performs similarly, with pair training slightly better, (2) using more propagation layers We tried different fnode modules in both experiments, and consistently helps, and increasing the number of propaga- found GRUs to generally work better than one-hidden layer tion layers T beyond 5 may help even more, (3) sharing MLPs, and all the results reported uses GRUs as fnode, with P parameters is useful for performance more often than not. the sum over edge messages j mj→i treated as the input to the GRU for the embedding model, and the concatenation Intuitively, the baseline WL kernel starts by labeling each P P node by its degree, and then iteratively updates a node’s of j mj→i and j0 µj0→i as the input to the GRU for the matching model. representation as the histogram of neighbor node patterns, which is effectively also a graph propagation process. The In the aggregator module, we used a single linear layer for kernel value is then computed as a dot product of graph rep- the node transformation MLP and the gating MLPgate in resentation vectors, which is the histogram of different node Eq.3. The output of this linear layer has a dimensionality the representations. When using the kernel with T iterations same as the required dimensionality for the graph vectors. of computation, a pair of graphs of size |V | can have as σ(x) = 1 1+e−x is the logistic sigmoid function, and is large as a 2|V |T dimensional representation vector for each the element-wise product. After the weighted sum, another graph, and these sets of effective ‘feature’ types are differ- MLP with one hidden layers is used to further transform ent for different pairs of graphs as the node patterns can be the graph vector. The hidden layer has the same size as the very different. This is an advantage for WL kernel over our output, with a ReLU nonlinearity. models as we used a fixed sized graph vector regardless of For the matching model, the attention weights are computed the graph size. We evaluate WL kernel for T up to 5 and as report results for the best T on the evaluation set. (t) (t) exp(sh(h , h )) In addition to the experiments presented in the main paper, a = i j . (16) j→i P (t) (t) we have also tested the generalization capabilities of the 0 exp(sh(h , h 0 )) j i j proposed models, and we present the extra results in the

We have tried the Euclidean similarity sh(hi, hj) = following. 2 −khi − hjk for sh, as well as the dot-product similar- Train on small graphs, generalize to large graphs. In > ity sh(hi, hj) = hi hj, and they perform similarly without this experiment, we trained the GSL models on graphs with significant difference. n sampled uniformly from 20 to 50, and p sampled from range [0.2, 0.5] to cover more variability in graph sizes and B. Extra Experiment Details edge density for better generalization, and we again fix kp = 1, kn = 2. For evaluation, we tested the best embedding We fixed the node state vector dimensionality to 32, and models and matching models on graphs with n = 100, 200 graph vector dimensionality to 128 throughout both the and p = 0.2, 0.5, with results shown in Table3. We can see graph edit distance learning and binary function similar- that for this task the GSL models trained on small graphs ity search tasks. We tuned this initially on the function can generalize to larger graphs than they are trained on. similarity search task, which clearly performs better than The performance falls off a bit on much larger graphs with smaller models. Increasing the model size however leads to much more nodes and edges. This is also partially caused overfitting for that task. We directly used the same setting Graph Matching Networks

Eval Graphs WL kernel GNN GMN Overall we found that (1) triplet training performs slightly n = 100, p = 0.2 98.5 / 99.4 96.6 / 96.8 96.8 / 97.7 better than pair training in this case; (2) both learning rates n = 100, p = 0.5 86.7 / 97.0 79.8 / 81.4 83.1 / 83.6 n = 200, p = 0.2 99.9 / 100.0 88.7 / 88.5 89.4 / 90.0 can work but the smaller learning rate is more stable; (3) n = 200, p = 0.5 93.5 / 99.2 72.0 / 72.3 68.3 / 70.1 increasing number of propagation layers generally helps; (4) using different propagation layer parameters perform better Table 3. Generalization performance on large graphs for the GSL than using shared parameters; (5) GRUs are more stable models trained on small graphs with 20 ≤ n ≤ 50 and 0.2 ≤ p ≤ than MLPs and performs overall better. 0.5. In addition to the results reported in the main paper, we have also tried the same models on another dataset obtained by compiling the compression software unrar with different by the fact that we are using a fixed sized graph vector compilers and optimization levels. Our graph similarity throughout the experiments , but the WL kernel on the learning methods also perform very well on the unrar data, other hand has much more effective ‘features’ to use for but this dataset is a lot smaller, with around 400 functions computing similarity. On the other hand, as shown before, only, and overfitting is therefore a big problem for any learn- when trained on graphs from distributions we care about, ing based model, so the results on this dataset are not very the GSL models can adapt and perform much better. reliable to draw any conclusions. Train on some kp, kn combinations, test on other combi- A few more control-flow graph examples are shown in Fig- nations. We have also tested the model trained on graphs ure5. The distribution of graph sizes in the training set is with n ∈ [20, 50], p ∈ [0.2, 0.5], kp = 1, kn = 2, on graphs shown in Figure6. with different kp and kn combinations. In particular, when evaluated on k = 1, k = 4, the models perform much p n C. Extra Attention Visualizations better than on kp = 1, kn = 2, reaching 1.0 AUC and 100% triplet accuracy easily, as this is considerably sim- A few more attention visualizations are included in Figure7, pler than the kp = 1, kn = 2 setting. When evaluated on Figure8 and Figure9. Here the graph matching model graphs with kp = 2, kn = 3, the performance is workse we used has shared parameters for all the propagation and than kp = 1, kn = 2 as this is a harder setting. matching layers and was trained with 5 propagation layers. In addition, we have also tried training on the more diffi- Therefore we can use a number T different from the num- cult setting kp = 2, kn = 3, and evaluate the models on ber of propagation layers the model is being trained on to graphs with kp = 1, kn = 2 and n ∈ [20, 50], p ∈ [0.2, 0.5]. test the model’s performance. In both visualizations, we The performance of the models on these graphs are ac- unrolled the propagation for up to 9 steps and the model still tually be better than the models trained on this setting computes sensible attention maps even with T > 5. of kp = 1, kn = 2, which is surprising and clearly Note that the attention maps do not converge to very peaked demonstrates the value of good training data. However, distributions. This is partially due to the fact that we used in terms of generalizing to larger graphs models trained on the node state vectors both to carry information through the kp = 2, kn = 3 does not have any significant advantages. propagation process, as well as in the attention mechanism as is. This makes it hard for the model to have very peaked B.2. Binary Function Similarity Search attention as the scale of these node state vectors won’t be In this task the edges have no extra features so we initialize very big. A better solution is to compute separate key, query them to constant vectors of 1s, and the encoder MLP for and value vectors for each node as done in the tensor2tensor the edges is again just an identity mapping. When using the self-attention formulation (Vaswani et al., 2017), which may CFG graph structure only, the nodes are also initialized to further improve the performance of the matching model. constant vectors of 1s, and the encoder MLP is a linear layer. Figure7 shows another possibility where the attention maps In the case when using assembly instructions, we have a list do not converge to very peaked distributions because of of assembly code associated with each node. We extracted in-graph symmetries. Such symmetries are very typical the operator type (e.g. add, mov, etc.) from each instruc- in graphs. In this case even though the attention maps tion, and then embeds each operator into a vector, the initial are not peaked, the cross graph communication vectors µ node representation is a sum of all operator embeddings. are still zero, and the two graphs will still have identical We searched through the following hyperparameters: (1) representation vectors. triplet or pair training, (2) learning rate in {10−3, 10−4}, (3) number of propagation layers; (4) share propagation layer parameters or not; (5) GRU vs one-layer MLP for the fnode module. Graph Matching Networks

0x1ceba

sub RSP, 38 mov [ RSP + c], EDI mov [ RSP + 8], ESI mov [ RSP + 4], EDX mov RAX, [28] mov [ RSP + 28], RAX xor EAX, EAX mov EDI, 560 call fb8

0x1cee4

mov [ RSP + 20], RAX mov EAX, [RSP + c] add EAX, f and EAX, f0 mov [ RSP + 18], EAX mov EAX, [RSP + c] add EAX, f lea EDX, RAX + f test EAX, EAX cmovs EAX, EDX sar EAX, 4 add EAX, 2 mov [ RSP + 1c], EAX cmp [ RSP + 20], 0 jnz 1cf22

0x1cf22 0xf90 mov RAX, [RSP + 20] 0x1cf18 lea RDX, 220cd9 push R15, RSP mov [ RAX], RDX mov EAX, 0 push R14, RSP mov EAX, [RSP + 4] jmp 1d088 mov R14D , ESI and EAX, 8 push R13, RSP test EAX, EAX push R12, RSP jz 1cf69 mov R13D , EDI push RBP, RSP push RBX, RSP mov EDI, 560 mov EBP, EDX 0x1cf3c sub RSP, 18 mov EAX, [RSP + 4] mov RAX, [28] and EAX, 3 mov [ RSP + 8], RAX xor EAX, EAX 0x1cf69 mov EDX, EAX mov RAX, [RSP + 20] call f70 mov RAX, [RSP + 20] mov [ RAX + 520 ], EDX mov [ RAX + 520 ], 1 mov EAX, [RSP + 4] mov RAX, [RSP + 20] sar EAX, 4 mov [ RAX + 524 ], 1 and EAX, 3 mov EDX, EAX 0xfc0 mov RAX, [RSP + 20] mov [ RAX + 524 ], EDX mov RBX, RAX jmp 1cf87 lea EAX, R13 + f mov ECX, 10 mov R15D , EAX cdq EDX, EAX and R15D , f0 idiv EDX, EAX, ECX 0x1cf87 test RBX, RBX mov EAX, [RSP + 4] jz 108a and EAX, 80000 test EAX, EAX jz 1cfab

0xfdf

0x108a lea R12D , RAX + 2 0x1cfab lea RAX, 2179f9 0x1cf94 xor EAX, EAX test BPL, 8 mov RAX, [RSP + 20] mov [ RBX], RAX 0x1950 mov [ RAX + 514 ], 0 call f50 jz 100e cmp [ RSP + 4], 0 push RBP, RSP jns 1cfdc 0x1752b push R15, RSP push R15, RSP push R14, RSP push R14, RSP push R13, RSP push R13, RSP push R12, RSP push R12, RSP 0xff3 push RBX, RSP 0x1cfc1 push RBP, RSP push RAX, RSP 0x1cf99 push RBX, RSP mov EAX, EBP sub RSP, 18 mov EBP, EDX mov RAX, [RSP + 20] 0x100e and EAX, 3 mov EAX, [RAX + 514 ] mov EDX, EAX mov R12D , EDI mov [ RBX + 520 ], EAX mov R14D , ESI mov R13D , ESI mov RAX, [RSP + 20] mov R15D , EDI or EAX, 1 mov EBP, EDX mov [ RBX + 520 ], 1 mov EAX, EBP mov EDI, 560 mov EDX, EAX mov [ RAX + 514 ], EDX mov RAX, [28] mov [ RBX + 524 ], 1 sar EAX, 4 call 1028 mov RAX, [RSP + 20] jmp 1d054 mov [ RSP + 8], RAX and EAX, 3 mov [ RAX + 514 ], EDX xor EAX, EAX mov [ RBX + 524 ], EAX mov EDI, 560 jmp 1022 call f38

0x196d 0x1cfdc 0x1755b mov RBX, RAX mov EAX, [RSP + 4] 0x1022 test RBX, RBX and EAX, 20000000 mov RBX, RAX bt EBP, 13 jz 19ac test EAX, EAX lea EAX, R12 + f jnb 1035 jz 1d004 mov R14D , EAX and R14D , f0 mov ECX, 10 cdq EDX, EAX idiv EDX, EAX, ECX 0x1975 lea R15D , RAX + 2 0x1cfe9 mov EAX, 0 0x1035 lea R12D , R15 + f test RBX, RBX mov EAX, R12D mov RAX, [RSP + 20] jz 175db mov EAX, EBP 0x1028 0x19ac sar EAX, 1f mov EAX, [RAX + 514 ] shr EAX, 1f shr EAX, 1c or EAX, 2 bt EBP, 1d xor EBX, EBX call f10 lea R13D , R15 + RAX * 1 + f mov EDX, EAX mov [ RBX + 514 ], EAX jmp 1a44 mov RAX, [RSP + 20] 0x17580 sar R13D , 4 0x175db jnb 104d lea RAX, 21bcc9 mov [ RAX + 514 ], EDX lea RAX, 21aa19 mov RSI, [RSP + 8] mov [ RBX], RAX mov [ RBX], RAX xor RSI, [28] test BPL, 8 test BPL, 8 jz 1763a jnz 19b3 jz 175f0 0x1d004 0x102d 0x1046 mov EAX, [RSP + 4] mov [ RBX + 514 ], EAX or [ RBX + 514 ], 2 and EAX, 40000000 0x1763a 0x17590 jmp 1067 0x19b3 test EAX, EAX add RSP, 18 0x199b jz 1d02c mov EAX, EBP mov EAX, EBP pop RBX, RSP and EAX, 3 pop RBP, RSP 0x175eb and EAX, 3 mov [ RBX + 520 ], 1 mov [ RBX + 520 ], EAX pop R12, RSP mov [ RBX + 520 ], EAX mov EAX, EBP mov EAX, 1 pop R13, RSP call ee8 mov EAX, EBP sar EAX, 4 0x104d jmp 19c6 pop R14, RSP shr EAX, 4 and EAX, 3 pop R15, RSP 0x1d011 mov [ RBX + 524 ], EAX and EAX, 3 ret near [ RSP] bt EBP, 1e jnb 105a mov RAX, [RSP + 20] mov EAX, [RAX + 514 ] or EAX, 4 mov EDX, EAX 0x175f0 0x19c6 mov RAX, [RSP + 20] 0x1053 mov [ RAX + 514 ], EDX and R12D , f0 mov [ RBX + 520 ], 1 mov [ RBX + 524 ], 1 or [ RBX + 514 ], 4 add R13D , 2 jmp 175a9 mov [ RBX + 524 ], EAX test EBP, 80000 jnz 1a1b 0x1d02c

mov EAX, [RSP + 4] 0x175a9 0x105a and EAX, 10000000 bt EBP, 13 test EAX, EAX bt EBP, 1c jnb 17606 jnb 1067 0x19dc jz 1d054

mov ECX, EBP shr ECX, 1f 0x17606 mov EAX, EBP 0x1060 shr EAX, 1c 0x1d039 0x1a1b mov EAX, EBP 0x175af and EAX, 2 shr EAX, 1f or [ RBX + 514 ], 1 lea EDX, RAX + RCX * 1 mov RAX, [RSP + 20] mov [ RBX + 514 ], EAX call ee0 call fc0 test EBP, 40000000 mov EAX, [RAX + 514 ] bt EBP, 1d lea ECX, RAX + RCX * 1 + 4 or EAX, 1 jnb 1761e cmovne EDX, ECX mov EDX, EAX mov RAX, [RSP + 20] mov [ RBX + 514 ], EDX 0x1067 test EBP, 10000000 mov [ RAX + 514 ], EDX jz 1a26 0x17617 0x175b4 mov R8D, R12D mov ECX, R15D or [ RBX + 514 ], 2 mov [ RBX + 514 ], EAX mov EDX, R14D 0x1d054 mov ESI, R13D mov RDI, RBX 0x1a07 mov EDI, [RSP + 1c] call 4ab0 mov ECX, [RSP + 18] 0x1761e 0x1a20 test EBP, 40000000 mov EDX, [RSP + 8] bt EBP, 1e cmove ECX, EAX mov ESI, [RSP + c] mov [ RBX + 514 ], EAX jnb 1762b or ECX, 1 mov RAX, [RSP + 20] mov [ RBX + 514 ], ECX mov R8D, EDI 0x107b jmp 1a26 mov RDI, RAX call 1cc23 0x17624 mov [ RBX + 510 ], ffffffff mov RAX, RBX or [ RBX + 514 ], 4 jmp 108c 0x1a26 0x1d074 mov RDI, RBX 0x1762b mov ESI, R15D mov RAX, [RSP + 20] 0x108c mov EDX, R14D mov [ RAX + 510 ], ffffffff bt EBP, 1c mov ECX, R12D mov RAX, [RSP + 20] jnb 175ba mov RSI, [RSP + 8] mov R8D, R13D xor RSI, [28] call 1a56 jz 10a1

0x17631 0x1d088 or [ RBX + 514 ], 1 0x1a3a mov RCX, [RSP + 28] jmp 175ba 0x109c xor RCX, [28] mov [ RBX + 510 ], ffffffff jz 1d09d call f20

0x175ba

mov R8D, R15D 0x1a44 0x1d098 mov ECX, R14D 0x10a1 mov RAX, RBX mov EDX, R13D call f60 mov ESI, R12D add RSP, 8 add RSP, 18 mov RDI, RBX pop RBX, RSP pop RBX, RSP call 61e4 pop R12, RSP pop RBP, RSP pop R13, RSP pop R12, RSP pop R14, RSP 0x1d09d pop R13, RSP pop R15, RSP pop R14, RSP 0x175ce pop RBP, RSP add RSP, 38 pop R15, RSP ret near [ RSP] ret near [ RSP] mov [ RBX + 510 ], ffffffff ret near [ RSP] mov RAX, RBX

Figure 5. Example control flow graphs for the same binary function, compiled with different compilers (clang for the leftmost one, gcc for the others) and optimization levels. Note that each node in the graphs also contains a set of assembly instructions which we also take into account when computing similarity using learned features. Graph Matching Networks

104

103

102 Graph size

101

100 0 10000 20000 30000 40000 50000 60000 Graphs sorted by size

Figure 6. Control flow graph size distribution in the training set. In this plot the graphs are sorted by size on the x axis, each point in the figure corresponds to the size of one graph. Graph Matching Networks

0 propagation steps 1 propagation step

2 propagation steps 3 propagation step

4 propagation steps 5 propagation step

6 propagation steps 7 propagation step

8 propagation steps 9 propagation steps

Figure 7. The change of cross-graph attention over propagation layers. Here the two graphs are two isomorphic chains and there are some in-graph symmetries. Note that in the end the nodes are matched to two corresponding nodes with equal weight, except the one at the center of the chain which can only match to a single other node. Graph Matching Networks

0 propagation steps 1 propagation step

2 propagation steps 3 propagation step

4 propagation steps 5 propagation step

6 propagation steps 7 propagation step

8 propagation steps 9 propagation steps

Figure 8. The change of cross-graph attention over propagation layers. Here the two graphs are isomorphic, with graph edit distance 0. Note that in the end a lot of the matchings concentrated on the correct match. Graph Matching Networks

0 propagation steps 1 propagation step

2 propagation steps 3 propagation step

4 propagation steps 5 propagation step

6 propagation steps 7 propagation step

8 propagation steps 9 propagation steps

Figure 9. The change of cross-graph attention over propagation layers. The edit distance between these two graphs is 1.