<<

arXiv:1706.07450v2 [stat.ML] 30 Aug 2018 u na vrg-opeiysneudrti aeiptdis input same this under sense tion. wor average-complexity the an in only in not but problems, optimization compu these the understanding of in hardness interest growing a been has distributions there input appropriate over formulated problems ihtehp httersligmdlwl rvd odaccu good provide sense. will average the model algo in resulting of tradeoffs the complexity class that broad hope a the parametrizes with an that of model training the predictive supervise ate to used instanc pro are solved a solutions’ previously solve ‘planted of to how collection learn a to observing ability simply the to related hardness, of aytss pnigfo iceegoer ostatistics, to geometry of discrete terms from in defined spanning tasks, Many suffer. relaxatio to intrigui standard appear offer where techniques regimes Networks in Neural even da performance, Graph that good observe on We based Science. models Network driven in problem fundamental a omnt eeto rbe,o 3 ,7 o xmlso sta constraints. of computational examples under progra for tradeoffs this 7] 4, inference of [3, instance or problem, an thr detection for community detection therein references computational in and the [1] and see understand statistical to Theo is those in Statistics area between and research Science active An Computer ical estim time. the polynomial restricts in that computed est one, the computational a ratio st and signal-to-noise a feasible, which both above contains asks hardness gener of that notion certain aspect, the a case, under that In measurements es model. noisy of from task the object in given t interested a instance, with be For may poorly we polynomial. statistics scale any dimensional than solution faster the say optimum size, when the problem compute appears to hardness gorithms computational speaking, Loosely § ( Geometry and SV Algorithms Tank. Simons DMS-1719545. the NSF by and supported DMS-1712730 partially NSF by A supported W911NF-17-1-0438. tially DOA Structure), Latent using Learning qa Contribution Equal Bwsprilyspotdb asn lcrnc (Improvi Electronics Samsung by supported partially was JB nes rbescrepn oacrantp foptimizati of type certain a to correspond problems Inverse nti eie oe eaeitrse nsuyn nte a another studying in interested are we note, revised this In eilsrt hsstpo h udai sinetProble Assignment Quadratic the on setup this illustrate We EIE OEO ERIGQARTCASGMN IHGAHNE GRAPH WITH ASSIGNMENT QUADRATIC LEARNING ON NOTE REVISED lxNowak Alex opttoal hard computationally .INTRODUCTION 1. ABSTRACT ∗ † iraTa,IRAadEoeNraeSprer,Paris Superieure, Normale Ecole and INRIA Team, Sierra etrfrDt cec,NwYr nvriy e York New University, York New Science, Data for Center ‡ † , orn nttt,NwYr nvriy e York New University, York New Institute, Courant § oea Villar Soledad , piiainproblems. optimization &)Think A&G) Recently, . Bwspar- was SB to obe to ation mto is imation s These es. appropri- nthe in m timating esholds; lmby blem ‡ nhigh- in gDeep ng atistical based n tational NETWORKS tcase, st , rithms, terplay tistical ∗ tribu- , spect racy- ative § ngly ret- was fnoSBandeira S Afonso , are ta- al- m, on he erlnetworks neural rw rmtesm itiuin n owa xetteresu the extent thre what statistical/computational to those b reach and also can distribution, algorithm – same problem the same the from of drawn instances new solving at accuracy oepirkoldeo h loihi ls,parameteriz class, algorithmic the on θ knowledge prior code L ino esr hnmn htcnrl h ttsia har statistical for 4. the Section controls that, that conc phenomena reveals same measure the setup of by tion controlled Our is complexity landscape architecture. the l QAP, GNN optimization simplified the a studying a by of initial hardness an learnability provide also the We mo outperfor of graph budget. to random computational able lower certain a be on at counterparts may SDP here and taken spectral approach the GNN the that gest experim preliminary semide Our on [10,33]. based relaxations methods i programming and QAP [30] the methods for alignment algorithms Stoch i spectral tractable the Existing for in [6]. methods illustrated non-l Model Detection Block of as Community forms data-driven analysis, some the spectral in model and to pointwise passing ability and message the Laplacian, has graph the and or linearities, adjacency operator graph graph the local alternate as of N network combinations neural Graph linear This so-called applying the tween [27]. model consider (GNN) we Network graphs, ral of natu terms is problem in the mulated Since it. solve to study approximations and driven (QAP), Problem Assignment Quadratic the problem, involv should it operations which of to idea algorithm vague specific a the only of but knowledge it, prior no n with another solved be yet to leads hardness This learnability descent. gradient stochastic ing cleprmns ial,Scin6dsrbssm pnre open findings. some initial describes our by 6 motivated our Section directions presents Finally, 5 Section experiments. and ical analysis, landscape th architecture, network our of neural presents relaxations graph the existing describes describes 3 Section and set-up problem the rmsle ntne ftepolm nohrwrs ie a given words, other In alg tion learn problem. the to of approach instances data-driven solved from a consider we question, in in eakwehroecan one whether ask we tion, ( ∈ θ = ) h eea prahi ocs nesml fagrtm as algorithms of ensemble an cast to is approach general The nti eie eso f[0 efcso atclrNP-ha particular a on focus we [20] of version revised this In h eto h ae ssrcue sflos eto pres 2 Section follows. as structured is paper the of rest The nta fivsiaigadsge loih o h probl the for algorithm designed a investigating of Instead ( x R i S y , L h ewr standt iiieteeprclloss empirical the minimize to trained is network The . i − ) 1 i ≤ P L i fpolmisacsdanfo eti distribu- certain a from drawn instances problem of y ˆ ℓ ‡ ( , y Φ( = ∗ i htmaue owa xettepolmcan problem the extent what to measures that , , n onBruna Joan and Φ( x x i ; ; θ θ learn ) )) ihseicacietrsta en- that architectures specific with o ie esr ferror of measure given a for , nagrtmta civsgood achieves that algorithm an ‡ , ∗ URAL sholds. ns;see dness; eto 4 Section andscape nssug- ents al for- rally such – s to of otion orithms nstance e. QAP. e numer- nalysis collec- nclude search dby ed entra- ℓ solve finite inear data- be- s dels, lting astic non- us- , eing ents eu- em the rd m 2. QUADRATIC ASSIGNMENT PROBLEM 3. GRAPH NEURAL NETWORKS

Quadratic assignment is a classical problem in combinatorial opti- The Graph Neural Network, introduced in [27] and further simplified in [8, 18, 28] is a neural network architecture based on local opera- mization. For A,B n × n symmetric matrices it can be expressed as tors of a graph G = (V,E), offering a powerful balance between ⊤ expressivity and sample complexity; see [5] for a recent survey on maximize trace(AXBX ), subject to X ∈ Π, (1) models and applications of deep learning on graphs. where is the set of permutation matrices of size . Many V ×d Π n × n Given an input signal F ∈ R on the vertices of G, we combinatorial optimization problems can be formulated as quadratic consider graph intrinsic linear operators that act locally on this sig- assignment. For instance, the network alignment problem consists nal: The adjacency operator is the map A : F 7→ A(F ) where on given and the adjacency graph of two networks, to find the A B (AF ) := F , with i ∼ j iff (i, j) ∈ E. The degree opera- best between them, i.e.: i j∼i j 1 J 2 tor is the diagonal linear map D(F ) = diag(A )F . Similarly, 2 - P J minimize kAX − XBkF , subject to X ∈ Π. (2) 2 J th powers of A, AJ = min(1,A ) encode 2 -hop neighborhoods By expanding the square in (2) one can obtain an equivalent opti- of each node, and allow us to aggregate local information at differ- mization of the form (1). The value of (2) is 0 if and only if the ent scales, which is useful in regular graphs. We also include the graphs A and B are isomorphic. The minimum bisection problem 1 average operator (U(F ))i = |V | j Fj , which allows to broadcast can also be formulated as a QAP. This problem asks, given a graph information globally at each layer, thus giving the GNN the ability to B, to partition it in two equal sized subsets such that the number of recover average degrees, or moreP generally moments of local graph edges across partitions is minimized. This problem, which is natural properties. By denoting A = {1,D,A,A ...,A , U} the gener- to consider in community detection, can be expressed as finding the 1 J ator family, a GNN layer receives as input a signal x(k) ∈ RV ×dk best matching between A, a graph with two equal sized disconnected and produces x(k+1) ∈ RV ×dk+1 as cliques, and B. The quadratic assignment problem is known to be NP-hard and also hard to approximate [24]. Several algorithms and heuristics had (k+1) (k) (k) x = ρ Bx θB , (3) been proposed to address the QAP with different level of success de- B∈A ! pending on properties of A and B [11, 19, 23]. We refer the reader X to [9] for a recent review of different methods and numerical com- (k) (k) (k) Rdk×dk+1 where Θ = {θ1 ,...,θ|A|}k, θB ∈ , are trainable parison. According to the experiments performed in [9] the most parameters and ρ(·) is a point-wise non-linearity, chosen in this work accurate algorithm for recovering the best alignment between two to be ρ(z) = max(0,z) for the first dk+1/2 output coordinates and networks in the distributions of problem instances considered be- ρ(z) = z for the rest. We consider thus a layer with concatenated low is a semidefinite programming relaxation (SDP) first proposed “residual connections” [13], to both ease with the optimization when in [33]. However, such relaxation requires to lift the variable X using large number of layers, but also to give the model the ability 2 2 to an n × n matrix and solve an SDP that becomes practically to perform power iterations. Since the spectral radius of the learned intractable for n > 20. Recent work [10] has further relaxed the linear operators in (3) can grow as the optimization progresses, the semidefinite formulation to reduce the complexity by a factor of n, cascade of GNN layers can become unstable to training. In order and proposed an augmented lagrangian alternative to the SDP which to mitigate this effect, we use spatial batch normalization [14] at is significantly faster but not as accurate, and it consists of an opti- each layer. The network depth is chosen to be of the order of the 3 mization algorithm with O(n ) variables. graph diameter, so that all nodes obtain information from the entire There are known examples where the SDP is not able to prove graph. In sparse graphs with small diameter, this architecture offers that two non-isomorphic graphs are actually not isomorphic (i.e. excellent scalability and computational complexity. the SDP produces pseudo-solutions that achieve the same objec- Cascading layers of the form (3) gives us the ability to approx- tive value as an isomorphism but that do not correspond to per- imate a broad family of graph inference algorithms, including some mutations [21, 32]). Such adverse example consists on highly reg- forms of spectral estimation. Indeed, power iterations are recovered ular graphs whose spectrum have repeated eigenvalues, so-called by bypassing the nonlinear components and sharing the parameters unfriendly graphs [2]. We find QAP to be a good case study for across the layers. Some authors have observed [12] that GNNs are our investigations for two reasons. It is a problem that is known to akin to message passing algorithms, although the formal connection be NP-hard but for which there are natural statistical models of in- has not been established. GNNs also offer natural generalizations to puts, such as models where one of the graphs is a relabelled small process high-order interactions, for instance using graph hierarchies random perturbation of the other, on which the problem is believed such as line graphs [6] or using tensorized representations of the per- to be tractable. On the other hand, producing algorithms capable of mutation group [17], but these are out of the scope of this note. achieving this task for large perturbations appears to be difficult. It is worth noting that, for statistical models of this sort, when seen as in- The choice of graph generators encodes prior information on the verse problems, the regimes on which the problem of recovering the nature of the estimation task. For instance, in the community detec- original labeling is possible, impossible, or possible but potentially tion task, the choice of generators is motivated by a model from Sta- computationally hard are not fully understood. tistical Physics, the Bethe free energy [6,26]. In the QAP, one needs generators that are able to detect distinctive and stable local struc- tures. Multiplicity of the spectrum of the graph adjacency operator is related to the (un)effectiveness of certain relaxations [2,19] (the so-called (un)friendly graphs), suggesting that generator families A that contain non-commutative operators may be more robust on such examples. 4. LANDSCAPE OF OPTIMIZATION If A ∼ Dn for large enough n, due to concentration of measure, with high probability we have In this section we sketch a research direction to study the land- ⊤ ⊤ ⊤ β EY Q(A,A)β(1−ǫ) ≤ β Q(A,A)β ≤ β EY Q(A,A)β(1+ǫ), scape of the optimization problem by studying a simplified setup. and similarly for Q(B, B) (assuming that the noise level in B is Let us assume that A is the n × n of a random small enough to allow concentration). Intuitively we expect that the weighted graph (symmetric) from some distribution D , and let B = n denominator concentrates faster than the numerator due to the lat- ΠAΠ−1 + ν where Π is a permutation matrix and ν represents a ter’s dependency on the inner products ′ . A formal argument symmetric noise matrix. hei, e˜i i is needed to make this statement precise. With this in mind we have For simplicity let’s assume that the optimization problem con- Rn×n β⊤{E Q(A, B)}β sists in finding a polynomial operator of degree d. If A ∈ − (1 − ǫ)−1 A,B,Y ≤ L(β) ⊤ E E denotes the adjacency matrix of a graph, then its embedding EA ∈ β ( Y Q(A,A)+ Y Q(B, B))β n×k R is defined as ⊤ E −1 β { A,B,Y Q(A, B)}β d ≤−(1 + ǫ) ⊤ . (t) j β (EY Q(A,A)+ EY Q(B, B))β EA = Pβ(A)Y where P (t) (A)= β A , t = 1,...,k, β j If , since the denominator is fixed one can use the Cholesky de- j=0 ǫ = 0 X composition of S = E Q(A,A)+E Q(B, B) to find the solution, Rn×k Y Y and Pβ (A)Y = (Pβ(1) (A)y1,...,Pβ(k) (A)yk) ∈ such that similarly to the case where A and B are fixed. In this case, the criti- Y is a random n × k matrix with independent Gaussian entries with 2 cal points are in fact the eigenvectors of Q(A, B), and one can show variance σ and yt is the t-th column of Y . Each column of Y is that L(β) has no poor local minima, since it amounts to optimizing a thought as a random initialization vector for an iteration resembling quadratic form on the sphere. When ǫ> 0, one could expect that the a power method. This model is a simplified version of (3) where ρ is −1 −1 concentration of the denominator around its mean somewhat con- the identity and A = {A}. Note that Pβ(ΠAΠ )=ΠPβ (A)Π trols the presence of poor local minima of L(β). An initial approach so it suffices to analyze the case where Π= I. is to use this concentration to bound the distance between the critical We consider the following loss points of L(β) and those of the ‘mean field’ equivalent ⊤ hPβ(A)Y, Pβ(B)Y i β EA,B,Y Q(A, B)β L(β)= −EA,B,Y . 2 2 g = − ⊤ . kPβ (A)Y k + kPβ (B)Y k β EA,B,Y (Q(A,A)+ Q(B, B))β Since A, B are symmetric we can consider e1,...,en an eigen- If we denote S˜ = Q(A,A)+ Q(B, B), we have basis for A with respective eigenvalues λ1,...,λn and an eigen- basis for , with eigenvalues ˜ ˜ . Decompose B e˜1,..., e˜n λ1,..., λn k∇β L(β) −∇βgk (t) (t) (t) y = y ei = ′ y˜ ′ e˜i′ . Observe that i i i i 1 T ˜ −1 T −1 E k ≤ (β Sβ) − (β Sβ) k A,B,Y Q(A, B)βk+ P P ⊤ 2 hP (A)Y, P (B)Y i = β(t) Q(A, B)(t)β(t) β β 1 Sβ˜ Sβ t=1 |L(β)| (1 + ǫ) − X 2 T ˜ βT Sβ where β Sβ A more rigorous analysis shows that both terms in the upper bo und Q(A, B)(t) = hAry(t), Bsy(t)i rs tend to zero as n grows due to the fast concentration of S˜ around its n ⊤ n mean S. Another possibility is to rely instead on topological tools r (t) s (t) ˜ ′ ′ = λi yi ei λi y˜i′ e˜i such as those developed in [31] to control the presence of energy i=0 ! ′ ! X iX=0 barriers as a function of ǫ. r s (t) (t) ˜ ′ ′ = λi λi yi y˜i′ hei, e˜i i. ′ 5. NUMERICAL EXPERIMENTS Xi,i Note we can also consider a symmetric version of Q that leads to the same quadratic form. We observe that We consider the GNN and train it to solve random planted problem instances of the QAP. Given a pair of graphs G1, G2 with n nodes β⊤Q(A, B)β E (4) each, we consider a siamese GNN encoder producing normalized L(β)= − A,B,Y ⊤ . n×d β (Q(A,A)+ Q(B, B))β embeddings E1,E2 ∈ R . Those embeddings are used to predict a matching as follows. We first compute the outer product E ET , For fixed A, B (let’s not consider the expected value for a mo- 1 2 ⊤ that we then map to a stochastic matrix by taking the softmax along β Rβ ment), the quotient of two quadratic forms ⊤ (where is a posi- β Sβ S each row/column. Finally, we use standard cross-entropy loss to pre- tive definite with Cholesky decomposition S = CC⊤) is maximized dict the corresponding permutation index. We perform experiments −1 1 by the top eigenvector of C−1RC⊤ . of the proposed data-driven model for the graph matching problem . Models are trained using Adamax [16] with lr = 10−3 and batches Say that A is appropriately normalized (for instance let’s say of size 32. We note that the complexity of this algorithm is at most −3/2 with a Wigner matrix, i.e. a random symmetric A = n W W O(n2). matrix with i.i.d. normalized gaussian entries). When n tends to infinity the denominator of (4) rapidly concentrates around its ex- pected value 5.1. Matching Erdos-Renyi Graphs σ2 2 E Q(A,A) = λr+sσ2 → λr+s(4 − |λ|2)1/2dλ, Y rs i 2π + In this experiment, we consider G1 to be a random Erdos-Renyi i −2 X Z graph with edge density pe. The graph G2 is a small perturbation where the convergence is almost surely according to the semicircle law (see for instance [29]). 1Code available at https://github.com/alexnowakvila/QAP_pt ErdosRenyi Graph Model Random Regular Graph Model

1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4 Recovery Rate Recovery Rate Recovery

0.2 0.2 SDP LowRankAlign(k=4) 0.0 GNN 0.0

0.00 0.01 0.02 0.03 0.04 0.05 0.00 0.01 0.02 0.03 0.04 0.05 Noise Noise

Fig. 1: Comparison of recovery rates for the SDP [25], LowRankAlign [9] and our data-driven GNN, for the Erdos-Renyi model (left) and random regular graphs (right). All graphs have n = 50 nodes and edge density p = 0.2. The recovery rate is measured as the average number of matched nodes from the ground truth. Experiments have been repeated 100 times for every noise level except the SDP, which have been repeated 5 times due to its high computational complexity.

of G1 according to the following error model considered in [9]: though possible, may not be solvable in polynomial time. ′ G2 = G1 ⊙ (1 − Q)+(1 − G1) ⊙ Q (5) If one believes that a problem is computationally hard for most where Q and Q′ are binary random matrices whose entries are drawn instances in a certain regime, then this would mean that no choice of from i.i.d. Bernoulli distributions such that P(Qij = 1) = pe and parameters for the GNN could give a good algorithm. However, even P ′ p when there exist efficient algorithms to solve the problem, it does not (Qij = 1) = pe2 with pe2 = pe p−1 . The choice of pe2 guaran- tees that the expected degrees of G1 and G2 are the same. We train mean necessarily that an algorithm will exist that is expressible by a a GNN with 20 layers and 20 feature maps per layer on a data set GNN. On top of all of this, even if such an algorithm exists, it is not of 20k examples. We fix the input embeddings to be the degree of clear whether it can be learned with Stochastic Gradient Descent on the corresponding node. In Figure 1 we report its performance in a loss function that simply compares with known solved instances. comparison with the SDP from [25] and the LowRankAlign method However, experiments in [6] suggest that GNNs are capable of learn- from [9]. ing algorithms for community detection under the SBM essentially up to optimal thresholds, when the number of communities is small. Experiments for larger number of communities show that GNN mod- 5.2. Matching Random Regular Graphs els are currently unable to outperform Belief-Propagation, a baseline tractable estimation that achieves optimal detection up to the compu- Regular graphs are an interesting example because they tend to be tational threshold. Whereas this preliminary result is inconclusive, it considered harder to align due to their more symmetric structure. may guide future research attempting to elucidate whether the lim- Following the same experimental setup as in [9], G1 is a random iting factor is indeed a gap between statistical and computational regular graph generated using the method from [15] and G2 is a threshold, or between learnable and computational thresholds. perturbation of G1 according to the noise model (5). Although G2 The performance of these algorithms depends on which opera- is in general not a regular graph, the “signal” to be matched to, G1, tors are used in the GNN. Adjacency matrices and Laplacians are is a regular graph. Figure 1 shows that in that case, the GNN is natural choices for the types of problem we considered, but different able to extract stable and distinctive features, outperforming the non- problems may require different sets of operators. A natural question trainable alternatives. We used the same architecture as 5.1, but now, is to find a principled way of choosing the operators, possibly query- due to the constant node degree, the embeddings are initialized with ing graph hierarchies. Going back to QAP, it would be interesting the 2-hop degree. to understand the limits of this problem, both statistically [22], but also computationally. In particular the authors would like to better understand the limits of the GNN approach and more generally of 6. DISCUSSION any approach that first embeds the graphs, and then does linear as- signment. Problems are often labeled to be as hard as their hardest instance. In general, understanding whether the regimes for which GNNs However, many hard problems can be efficiently solved for a large produce meaningful algorithms matches believed computational subset of inputs. This note attempts to learn an algorithm for QAP thresholds for some classes of problems is, in our opinion, a thrilling from solved problem instances drawn from a distribution of inputs. research direction. It is worth noting that this approach has the The algorithm’s effectiveness is evaluated by investigating how well advantage that the algorithms are learned automatically. However, it works on new inputs from the same distribution. This can be they may not generalize in the sense that if the GNN is trained with seen as a general approach and not restricted to QAP. In fact, an- examples below a certain input size, it is not clear that it will be able other notable example is the community detection problem under to interpret much larger inputs, that may need larger networks. This the Stochastic Block Model (SBM) [6]. That problem is another question requires non-asymptotic deviation bounds, which are for particularly good case of study because there exists very precise pre- example well-understood on problems such as the ‘planted clique’. dictions for the regimes where the recovery problem is (i) impossi- ble, (ii) possible and efficiently solvable, or (iii) believed that even 7. REFERENCES [18] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. arXiv preprint [1] Emmanuel Abbe. Community detection and stochas- arXiv:1511.05493, 2015. tic block models: recent developments. arXiv preprint [19] Vince Lyzinski, Donniell E Fishkind, Marcelo Fiori, Joshua T arXiv:1703.10146, 2017. Vogelstein, Carey E Priebe, and Guillermo Sapiro. Graph [2] Yonathan Aflalo, Alex Bronstein, and Ron Kimmel. Graph matching: Relax at your own risk. IEEE transactions on pat- matching: relax or not? arXiv preprint arXiv:1401.7623, 2014. tern analysis and machine intelligence, 38(1):60–73, 2016. [3] Arash A Amini and Martin J Wainwright. High-dimensional [20] Alex Nowak, Soledad Villar, Afonso S. Bandeira, and Joan analysis of semidefinite relaxations for sparse principal com- Bruna. A note on learning algorithms for quadratic assign- ponents. In Information Theory, 2008. ISIT 2008. IEEE Inter- ment with graph neural networks. ICML 2017 Workshop on national Symposium on, pages 2454–2458. IEEE, 2008. Principled Approaches to Deep Learning (PADL), 2017. [4] Q. Berthet and P. Rigollet. Complexity theoretic lower bounds [21] Ryan O’Donnell, John Wright, Chenggang Wu, and Yuan for sparse principal component detection. Conference on Zhou. Hardness of robust graph isomorphism, lasserre gaps, Learning Theory (COLT), 2013. and asymmetry of random graphs. In Proceedings of the [5] Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Al- Szlam, and Pierre Vandergheynst. Geometric deep learning: gorithms, pages 1659–1677. SIAM, 2014. going beyond euclidean data. CoRR, abs/1611.08097, 2016. [22] Efe Onaran, Siddharth Garg, and Elza Erkip. Optimal de- [6] Joan Bruna and Xiang Li. Community detection with graph anonymization in random graphs with community structure. neural networks. arXiv preprint arXiv:1705.08415, 2017. arXiv preprint arXiv:1602.01409, 2016. [7] Venkat Chandrasekaran and Michael I Jordan. Computational [23] Efe Onaran and Soledad Villar. Projected power iteration for and statistical tradeoffs via convex relaxation. Proceedings network alignment. In Wavelets and Sparsity XVII, volume of the National Academy of Sciences, 110(13):E1181–E1190, 10394, page 103941C. International Society for Optics and 2013. Photonics, 2017. [8] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, [24] Panos M Pardalos, Henry Wolkowicz, et al. Quadratic Assign- Rafael Bombarell, Timothy Hirzel, Al´an Aspuru-Guzik, and ment and Related Problems: DIMACS Workshop, May 20-21, Ryan P Adams. Convolutional networks on graphs for learn- 1993, volume 16. American Mathematical Soc., 1994. ing molecular fingerprints. In Advances in neural information [25] Jiming Peng, Hans Mittelmann, and Xiaoxue Li. A new re- processing systems, pages 2224–2232, 2015. laxation framework for quadratic assignment problems based [9] Soheil Feizi, Gerald Quon, Mariana Recamonde-Mendoza, on matrix splitting. Mathematical Programming Computation, Muriel M´edard, Manolis Kellis, and Ali Jadbabaie. Spec- 2(1):59–77, 2010. tral alignment of networks. arXiv preprint arXiv:1602.04181, [26] Alaa Saade, Florent Krzakala, and Lenka Zdeborov´a. Spectral 2016. clustering of graphs with the bethe hessian. In Advances in [10] Jose FS Ferreira, Yuehaw Khoo, and Amit Singer. Semidefinite Neural Information Processing Systems, pages 406–414, 2014. programming approach for the quadratic assignment problem [27] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagen- with a sparse graph. arXiv preprint arXiv:1703.09339, 2017. buchner, and Gabriele Monfardini. The graph neural network [11] Marcelo Fiori and Guillermo Sapiro. On spectral properties for model. IEEE Transactions on Neural Networks, 20(1):61–80, graph matching and graph isomorphism problems. Information 2009. and Inference: A Journal of the IMA, 4(1):63–76, 2015. [28] Sainbayar Sukhbaatar, Rob Fergus, et al. Learning multiagent [12] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol communication with backpropagation. In Advances in Neural Vinyals, and George E Dahl. Neural message passing for quan- Information Processing Systems, pages 2244–2252, 2016. tum chemistry. arXiv preprint arXiv:1704.01212, 2017. [29] Terence Tao. 254A Notes 4. The semi circular law. Terence [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Tao blog: https://terrytao.wordpress.com. Deep residual learning for image recognition. In Proceed- [30] Shinji Umeyama. An eigendecomposition approach to ings of the IEEE Conference on Computer Vision and Pattern weighted graph matching problems. IEEE transactions on pat- Recognition, pages 770–778, 2016. tern analysis and machine intelligence, 10(5):695–703, 1988. [14] Sergey Ioffe and Christian Szegedy. Batch normalization: Ac- celerating deep network training by reducing internal covariate [31] Luca Venturi, Afonso Bandeira, and Joan Bruna. Neural net- shift. arXiv preprint arXiv:1502.03167, 2015. works with finite intrinsic dimension have no spurious valleys. arXiv preprint arXiv:1802.06384, 2018. [15] Jeong Han Kim and Van H Vu. Generating random regular graphs. In Proceedings of the thirty-fifth annual ACM sympo- [32] Soledad Villar, Afonso S Bandeira, Andrew J Blumberg, and sium on Theory of computing, pages 213–222. ACM, 2003. Rachel Ward. A polynomial-time relaxation of the Gromov- Hausdorff distance. arXiv preprint arXiv:1610.05214, 2016. [16] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, [33] Qing Zhao, Stefan E Karisch, Franz Rendl, and Henry 2014. Wolkowicz. Semidefinite programming relaxations for the quadratic assignment problem. Journal of Combinatorial Op- [17] Risi Kondor, Hy Truong Son, Horace Pan, Brandon Anderson, timization, 2(1):71–109, 1998. and Shubhendu Trivedi. Covariant compositional networks for learning graphs. arXiv preprint arXiv:1801.02144, 2018.