A Note on Over-Smoothing for Graph Neural Networks

Chen Cai 1 Yusu Wang 1

Abstract results in significantly worse performance for GNNs, such Graph Neural Networks (GNNs) have achieved as GCN and GAT. This drop is associated with many factors, a lot of success on graph-structured data. How- including the vanishing gradients during back-propagation, ever, it is observed that the performance of graph overfitting due to the increasing number of parameters, as neural networks does not improve as the number well as the phenomenon called over-smoothing. (Li et al., of layers increases. This effect, known as over- 2018) was the first to call attention to the over-smoothing smoothing 1, has been analyzed mostly in linear problem. Having shown that the graph convolution is a type cases. In this paper, we build upon previous re- of Laplacian smoothing, they proved that after repeatedly sults (Oono & Suzuki, 2019) to further analyze the applying Laplacian smoothing many times, the features over-smoothing effect in the general graph neural of the nodes in the (connected) graph would converge to network architecture. We show when the weight similar values. Later, several others have alluded to the same matrix satisfies the conditions determined by the problem. (Li et al., 2019; Luan et al., 2019; Zhao & Akoglu, spectrum of augmented normalized Laplacian, the 2019) Dirichlet energy of embeddings will converge to The goal of this paper is to extend some analysis of GNN zero, resulting in the loss of discriminative power. in the ICLR 2020 spotlight paper (Oono & Suzuki, 2019) Using Dirichlet energy to measure “expressive- on the expressive power of GNNs for node classification. ness” of embedding is conceptually clean; it leads To the best of our knowledge, (Oono & Suzuki, 2019) is to simpler proofs than (Oono & Suzuki, 2019) the first paper extending the analysis of over-smoothing in and can handle more non-linearities. linear GNNs to the nonlinear ones. However, only ReLU is handled. It is noted by the authors that extension to other non-linearities such as Sigmoid and Leaky ReLU is far from 1. Introduction trivial. Graph neural networks (GNNs) are a family of neural net- In this paper, we propose a simple technique to analyze works that can learn from graph-structured data. Starting the embedding when the number of layers goes to infinity. with the success of Graph Convolutional Network (GCN) The analysis is based on tracking the Dirichlet energy of (Kipf & Welling, 2016) in achieving state-of-the-art perfor- node embeddings across layers. Our contributions are the mance on semi-supervised classification, several variants of following: GNNs have been developed for this task, including Graph- SAGE (Hamilton et al., 2017), GAT (Velickoviˇ c´ et al., 2017), Using Dirichlet energy to measure expressiveness of SGC (Wu et al., 2019), CGCNN (Xie & Grossman, 2018) • embeddings is conceptually clean. Besides being able and GMNN (Qu et al., 2019) to name a few most recent arXiv:2006.13318v1 [cs.LG] 23 Jun 2020 to recover the results in (Oono & Suzuki, 2019), our ones. See (Gurukar et al., 2019; Wu et al., 2020; Zhou et al., analysis can be easily applied to Leaky ReLU. In the 2018) for survey. special case of regular graphs, our proof can be ex- However, a key issue with GNNs is their depth limitations. tended to the most common nonlinearities. The proof It has been observed that deeply stacking the layers often is easy to follow and requires only elementary linear algebra. We discuss key differences between our proof 1 Department of Computer Science, Ohio State University, Ohio, and proofs in (Oono & Suzuki, 2019) as well as the USA. Correspondence to: Chen Cai . benefits of introducing Dirichlet energy in Section4. Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 108, 2020. Copyright 2020 by Second, we perform extensive experiments on a variety the author(s). • of graphs to study the effect of basic edge operations on 1Strictly speaking, over-smoothing is a misnomer. As we will the Dirichlet energy. We find in many cases dropping T show, what is decreasing is tr(X ∆˜ X), not the real smoothness edges and increasing the weights of edges (to a high tr(XT ∆˜ X) 2 of graph signal X. value) can increase the Dirichlet energy. ||X||2 A Note on Over-Smoothing for Graph Neural Networks

2. Notation Proof. Let us denote the eigenvalues of ∆˜ by λ1, λ2, ..., λN , and the associated eigenvectors of length 1 by v1, ..., vn. Let N+ be the set of positive integers. We define A ∈ Suppose f = Σcivi where ci R. RN×N to be the and D to be the ∈ matrix of graph G. Let A˜ := A + IN , D˜ := D + IN be T ˜ T 2 the adjacent and degree matrix of graph G augmented with E(f) = f ∆f = f Σciλivi = Σci λi (1) self- loops. We define the augmented normalized Laplacian − 1 − 1 Therefore, of G by ∆˜ := IN D˜ 2 A˜D˜ 2 and set P := IN ∆˜ = − 1 − 1 − − D˜ 2 A˜D˜ 2 . Let L, C + be the layer and channel sizes. T T N E(P f) = f (IN ∆)˜ ∆(˜ IN ∆)˜ f ∈ − − We define a GCN associated with G by f = f T L = f (IN ∆)˜ ∆(˜ IN ∆)˜ f N×Cl N×Cl+1 ◦ − − (2) ... f1 where fl : R R is de- 2 2 ◦ → = Σci λi(1 λi) fined by fl(X) = MLPl(PX). Here MLPl(X) := − 2 σ ( σ (σ(X)Wl1) Wl2 WlH ) where σ is an element- (1 λ) E(f) ··· ··· l ≤ − wise nonlinear function. Note that weight matrices Wl· are not necessarily square. We consider the embeddings Extending the above argument from the scaler field to vector (l+1) (l) (0) X := fl(X ) with initial value X . We are inter- field finishes the proof for E(PX) (1 λ)2E(X). ested in the asymptotic behavior of the output X(L) of GCN ≤ − as L . Lemma 3.2. E(XW ) W T 2E(X) → ∞ ≤ || ||2 We state the following lemma without a proof. Proof. By definition, Lemma 2.1. Eigenvalues of ∆˜ [0, 2). Eigenvalues of ˜ ∈ P = IN ∆ ( 1, 1]. 1 1 2 − ∈ − E(XW ) = Σ(i,j)∈Ewij xiW p xjW 2 ||√1 + di − 1 + dj || 3. Main Result T 1×c c×c0 where Xn×c = [x1, ..., xn] , xi R ,W R . The main idea of the proof is to track the Dirichlet energy Since for each term ∈ ∈ of node embeddings w.r.t. the (augmented) normalized Laplacian at different layers. With some assumptions on the 1 1 2 xiW p xjW 2 weight matrix of GCN, we can prove that Dirichlet energy ||√1 + di − 1 + dj || ≤ (3) decreases exponentially with respect to the number of layers. 1 1 2 T 2 Intuitively, the Dirichlet energy of a function measures the xi p xj 2 W 2, ||√1 + di − 1 + dj || || || “smoothness” of a function of unit norm, and eigenvectors of the normalized Laplacian are minimizers of the Dirichlet we get energy. T 2 E(XW ) E(W ) W 2 (4) Definition 3.1. Dirichlet energy E(f) of scalar function ≤ || || f RN×1 on the graph G is defined as ∈ 1 f f T ˜ i j 2 Remark: Since A 2 = σmax(A) where σmax(A) repre- E(f) = f ∆f = Σwij( p ) . || || 2 √1 + di − 1 + dj sents the largest singular value of matrix A. Our result in Lemma 2 is essentially the same as the Lemma 22 of (Oono T 1×c For vector field XN×c = [x1, ..., xN ] , xi R , Dirich- & Suzuki, 2019). Note that our proof can handle weight let energy is defined as ∈ matrix not only of dimension d d but also of dimension d d0 while the paper (Oono &× Suzuki, 2019) assumes the × T ˜ 1 xi xj 2 embedding dimension to be fixed across layers. See detailed E(X) = tr(X ∆X) = Σwij p 2. 2 ||√1 + di − 1 + dj || discussion at section4. Remark: The proof itself doesn’t leverage the structure of Without loss of generality, each layer of GCN can be repre- graph. In particular, only the fact of Laplacian is p.s.d sented as fl(X) = σ(σ( σ(σ( PX)Wl1)Wl2 )WlHl ) matrix is needed in the proof. See an alternative proof in | ···{z } ··· H times the appendix. This also makes sense because W operates Next we will analyze the effects of P,Wl, σ on the Dirichlet on the graph feature space and should be oblivious to the energy one by one. particular graph structure. 2 Lemma 3.1. E(PX) (1 λ) E(X) where λ is the 2For any X ∈ N×C , we have d (XW ) ≤ s d (X) ≤ − R M lh lh M smallest non-zero eigenvalue of ∆˜ . where slh is the maximum singular value of Wlh. A Note on Over-Smoothing for Graph Neural Networks

Lemma 3.3. E(σ(X)) E(X) when σ is ReLU or Leaky- 4. Key Differences ReLU. ≤ The key quantity paper (Oono & Suzuki, 2019) looks at is N×C the dM(X) where is a subspace of R , defined as M Proof. We first prove it holds for scalar field f and then C nPM C o fi := U R = m=1 em wm wm R , where extend it to vector field X. E(f) = Σ(i,j)∈Ewi,j( √ M ⊗ ⊗ | ∈ 1+di − fj 2 (em)m∈[M] is orthonormal basis of null space U of a normal- ) where wi,j 0. And c1, c2 +, a, b 1+d R R √ j ≥ ∀ ∈ ∈ ized graph Laplacian ∆˜ . The original definition of dM(X) is defined for the case of the same embedding dimension c1a c2b σ(c1a) σ(c2b) | − | ≥ | − | (5) across layers. It needs to be modified to handle the case of = c1σ(a) c2σ(b) varying dimensions. One way to achieve this is to define | − | C 0 C0 = U R , = U R , respectively. The lemma 2 ofM paper⊗ (Oono &M Suzuki,⊗ 2019) then can be modified from The first inequality holds for all σ whose Lipschitz constant c×c dM(XW ) sdM(X) (W R ) to the following: is no more than 1, including ReLU, Leaky-ReLU, Tanh, ≤ ∈ Sigmoid, etc. The second equality holds because for ReLu dM0 (XW ) sdM(X) (6) and Leaky-Relu, σ(cx) = cσ(x), c R+, x R. ≤ ∀ ∈ ∈ where s is the singular value of W 5. Therefore, by replacing c1, c2, a, b with √ 1 , 1 , f , f E(σ(f)) E(f) As for the nonlinearity, (Oono & Suzuki, 2019) mentions 1+d i j, we can see i √1+dj ≤ that their analysis is limited to graph neural networks with holds for ReLU and Leaky-ReLU. Extending the above the ReLU activation function because they implicitly use the argument to vector field completes the proof. property that ReLU is a projection onto the cone X > 0 (see Appendix A, Lemma 3 in (Oono & Suzuki, 2019{ ) for} Remark: For regular graphs, the above conclusion can be ex- details). This fact enables the ReLU function to get along tended to more non-linearities such as ReLU, Leaky-ReLU, with the nonnegativity of eigenvectors associated with the Tanh, and Sigmoid. largest eigenvalues (Perron-Frobenius theorem). Therefore, Remark: The proof hinges on the simple fact that for ReLU the authors mentioned that it may not be easy to extend their results to other activation functions such as the sigmoid and Leaky-ReLU, σ(ca) = σ(c)a where c R+, a R. ∈ ∈ function or Leaky ReLU. For other activation functions, as long as c1a = c2b and c1σ(a) = c2σ(b) (easy to find examples for Sigmoid, Tanh In contrast, the proof of Lemma 3.3 becomes trivial once we 3 6 4 , etc since there are no strong restrictions on a, b, c1, c2. ), write out the Dirichlet energy as the sum of multiple terms we can not guarantee E(σ(X)) E(X). for each of which the effect of nonlinearity can be easily ≤ Combining the above three lemmas, and denote the square analyzed. T of maximum singular value of Wlh by slh and set sl := QHl ¯ 2 h=1 slh. Also let λ := (1 λ) . With those parameters, 5. Experiments we arrive at the main theorem.− To investigate how basic edges operations, removing edges, Theorem 3.4. For any l N+, we have E(fl(X)) and increasing edge weight6, affect Dirichlet energy and ¯ ∈ ≤ slλE(X) over-smoothing, we perform extensive experiments on both common benchmarks (Cora and CiteSeer) and synthetic See proof in the appendixA. graphs. See appendixB for more details on datasets. (l) Corollary 3.4.1. Let s := sup sl. We have E(X ) l∈N+ In particular, given a graph, we will compute its eigenvalues ¯ l (l) ≤ O((sλ) ). In particular, E(X ) exponentially converges before and after randomly dropping/increasing weights of ¯ to 0 when sλ < 1. a certain percent (10% 90%) of edges. This is shown in the first/third column for− each figure. In the second/fourth Our result shares great similarity with the paper (Oono & column, we generate three signals x, P x(P 0x),P 2x(P 02x), Suzuki, 2019). The bounds are similar but our result han- T where x = Σi civi where vi is the first T eigenvectors dles more general cases. As noted in (Oono & Suzuki, corresponding to lowest T eigenvalues of normalized Lapla- 2019), eigengap plays an important role. The analysis of cian and ci is uniform random number between 0 and 1. Erdos-Renyi graph GN,p (or any other graphs that have log N In other words, x is a mix of lower eigenvectors. We large eigengaps) when Np = o(1) in the paper (Oono & then compute the Dirichlet energy of three signals both Suzuki, 2019) can also be directly applied to our case. 5 c×c0 2x Here with slight abuse of notation, W ∈ R 3 1 e −1 6 Sigmoid: Sigmoid(x) = 1+e−x . Tanh: Tanh(x) = e2x+1 . In this paper, we only consider the case where the edge weight 4 For example, c1 = 1, x = 2, c2 = 2, y = 1. is increased to very high (from 1 to 10000 in all experiments). A Note on Over-Smoothing for Graph Neural Networks

Random Geometric Graph: 200 nodes, 1960 edges. increase 10% edges. increase 20% edges. increase 30% edges. increase 40% edges. increase 50% edges. increase 60% edges. increase 70% edges. increase 80% edges. increase 90% edges. 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00

1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75

1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50

1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25

w1. w1. w1. 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 w2. w2. w2.

0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

Random Geometric Graph: 200 nodes, 1960 edges. increase 10% edges. increase 20% edges. increase 30% edges. increase 40% edges. increase 50% edges. increase 60% edges. increase 70% edges. increase 80% edges. increase 90% edges.

101 101 101 101 101 101 101 101 101

E E E 100 100 100 0 100 100 100 0 100 100 100 0 E0 E0 E0 E1 E1 E1 E E E 1 1 1 10 1 1 1 10 1 1 1 10 10− 10− 10− 10− 10− 10− 10− 10− 10− E2 E2 E2

E20 E20 E20

2 2 2 2 2 2 2 2 2 10− 10− 10− 10− 10− 10− 10− 10− 10−

3 3 3 3 3 3 3 3 3 10− 10− 10− 10− 10− 10− 10− 10− 10− 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20

Random Geometric Graph: 200 nodes, 1960 edges. drop 10% edges. drop 20% edges. drop 30% edges. drop 40% edges. drop 50% edges. drop 60% edges. drop 70% edges. drop 80% edges. drop 90% edges. 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00

1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75

1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50

1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25

w1. w1. w1. 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 w2. w2. w2.

0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

Random Geometric Graph: 200 nodes, 1960 edges. drop 10% edges. drop 20% edges. drop 30% edges. drop 40% edges. drop 50% edges. drop 60% edges. drop 70% edges. drop 80% edges. drop 90% edges.

101 101 101 101 101 101 101 101 101

E E E 100 100 100 0 100 100 100 0 100 100 100 0 E0 E0 E0 E1 E1 E1 E E E 1 1 1 10 1 1 1 10 1 1 1 10 10− 10− 10− 10− 10− 10− 10− 10− 10− E2 E2 E2

E20 E20 E20

2 2 2 2 2 2 2 2 2 10− 10− 10− 10− 10− 10− 10− 10− 10−

3 3 3 3 3 3 3 3 3 10− 10− 10− 10− 10− 10− 10− 10− 10− 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20

Figure 1. The effects of dropping edges and increasing edge weights on eigenvalues / Dirichilet energy for Random Geometric graph.

for original graph (E0,E1,E2) and graph with edges re- 6. Conclusion 0 0 0 moved/reweighted (E0,E1,E2). The same experiments are repeated 20 times and 120 data points are shown in the We provide an alternative proof of graph neural networks ex- scatter plot. ponentially loosing expressive power. Being able to achieve the same bound as the paper (Oono & Suzuki, 2019), our We make the following observations. simple proof also handles Leaky ReLU. We also empirically explore the effect of basic edge operations on the Dirichlet energy. First, for nearly all graphs and ratios (except for some • cases of Cora and CiteSeer), dropping edges increases Some interesting future directions are: 1) The key chal- Dirichlet energy for x, P x(P 0x),P 2x(P 02x). This co- lenge of analyzing the over-smoothing effect lies in the incides with the observation in DropEdge (Rong et al., non-linearity. How to extend our strategy to more general 2019) that dropping edges help relive over-smoothing. graph learning such as other nonlinearities, normalization strategy (Zhao & Akoglu, 2019), graphs with both node and edge features and attention mechanism (Velickoviˇ c´ et al., Second, in most cases, the effect of increasing the 2017) remains largely open. 2) The assumption on the norm • weight of edge (from 1 to 10000) and dropping edges of weight function of GNNs is crucial (may also be too is “ dual” to each other, i.e., increasing weights of a strong) in our proof. Understanding how learning plays a few edges to a very high value is similar to dropping a role in resisting the over-smoothing effect is interesting. 3) lot of edges in terms of eigenvalue and Dirichlet energy. Preserving Dirichlet energy for combinatorial Laplacian is Intuitively, we can think of increasing the weight of well studied in the context of graph sparsification. Novel an edge u, v to infinity as contracting node u and v techniques in (Lee & Sun, 2017; Spielman & Srivastava, into a supernode. For the planar graph and its dual 2011; Spielman & Teng, 2004; 2011) may be applicable. graph, edge deletion in one graph corresponds to the Also, Dirichlet energy itself is easy to compute and can serve contraction in the other graph and vice versa. We as a useful quantity to monitor during the training of graph hypothesize that randomly increasing the weight of networks for practitioners. Finally, analyzing the real over- tr(XT ∆˜ X) a few edges to a high value will also help to relieve smoothing effect, i.e., the Rayleigh quotient 2 , for ||X||2 over-smoothing. We leave the systematic verification deep GNNs is still an open and important question. as future work. A Note on Over-Smoothing for Graph Neural Networks

A. Missing Proof are joined by an edge if the between the nodes is at most 0.2. To show that Lemma 3.2 is not using any particular graph structure, we present an alternative proof of Lemma 3.2, with 2 blocks. It consists of show simply use the generic matrix inequality. • two blocks where each block has 100 nodes. The edge T 2 probability within the block is 0.1 and edge probability Lemma A.1. E(XW ) W 2E(X) ≤ || || between blocks is 0.01.

Proof. Expand E(XW ) in matrix form, Stochastic Block Model with 4 blocks. It consists of • four blocks where each block has 50 nodes. The edge E(XW ) = tr(W T XT ∆˜ XW ) probability within the block is 0.1, 0.2, 0.3, 0.4. The = tr(XT ∆˜ XWW T ) edge probability between blocks is 0.08. T T tr(X ∆˜ X)σmax(WW ) ≤ Barabasi-Albert Graph. A graph of n nodes is grown T 2 • = E(X) W 2 by attaching new nodes each with m edges that are || || preferentially attached to existing nodes with high de- Note σmax denotes the largest eigenvalue and A 2 = gree. We set n, m to be 200 and 4. p ∗ k k λmax (A A) = σmax(A). Cora is a citation graph where 2708 nodes are docu- Theorem A.2. For any l N+, we have E(fl(X)) • ments and 5278 edges are citation links. ¯ ∈ ≤ slλE(X) Citeseer is a citation graph where 3327 nodes are doc- • Proof. By Lemma 3.1-3.3, uments and 4552 edges are citation links.

E(fl(X)) = E(σ( σ(σ( PX)Wl1)Wl2 WlHl )) | ···{z } ··· H times

E(σ( σ(σ( PX)Wl1)Wl2 )WlHl ) ≤ | ···{z } ··· H−1 times

slHl−1E(σ( σ(σ( PX)Wl1)Wl2 )WlHl−1)) ≤ | ···{z } ··· H−1 times ... H ! Yl slh E(PX) ≤ h=1 ¯ slλE(PX) ≤ ¯ slλE(X) ≤

B. Experiments We perform experiments on both synthetic graphs and real graphs benchmarks. The threshold T for the number of lower eigenvectors is set to be 20 for synthetic graphs. For Cora and Citeseer, it is set to be 400 and 600 respectively (due to a large number of nearly zero eigenvalues). The code is available on Github. 7 The details of each graph are listed as follows:

Random graph G(200, 0.05). • Random geometric graph on the plane. There are 200 • nodes uniformly at random in the unit cube. Two nodes

7https://github.com/Chen-Cai-OSU/GNN-Over-Smoothing A Note on Over-Smoothing for Graph Neural Networks

Low Eigenvector Mix + Reweight Edge Cora: 2708 nodes, 5278 edges. increase 10% edges. increase 20% edges. increase 30% edges. increase 40% edges. increase 50% edges. increase 60% edges. increase 70% edges. increase 80% edges. increase 90% edges. 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00

1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75

1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50

1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25

w1. w1. w1. 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 w2. w2. w2.

0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500

Cora: 2708 nodes, 5278 edges. increase 10% edges. increase 20% edges. increase 30% edges. increase 40% edges. increase 50% edges. increase 60% edges. increase 70% edges. increase 80% edges. increase 90% edges. 6 101 6 101 × 6 101 × ×

1 4 101 4 10 4 101 × × × E0 E0 E0 1 1 1 3 10 3 10 E0 E0 3 10 × E0 × × E1 E1 E1

E10 E10 1 E10 2 101 2 10 2 101 E2 E2 × × × E2

E20 E20 E20

1 1 1 1 101 10 10 101 101 10 101 10 101

0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 Low Eigenvector Mix + Drop Edge Cora: 2708 nodes, 5278 edges. drop 10% edges. drop 20% edges. drop 30% edges. drop 40% edges. drop 50% edges. drop 60% edges. drop 70% edges. drop 80% edges. drop 90% edges. 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00

1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75

1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50

1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25

w1. w1. w1. 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 w2. w2. w2.

0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500

Cora: 2708 nodes, 5278 edges. drop 10% edges. drop 20% edges. drop 30% edges. drop 40% edges. drop 50% edges. drop 60% edges. drop 70% edges. drop 80% edges. drop 90% edges. 1 6 10 2 × 10 2 6 101 10 × 102 102 2 4 101 10 × 4 101 × E0 E0 E0 1 3 10 E E E × 3 101 0 0 0 × E1 E1 E1 E0 E0 E0 2 101 1 1 1 × 2 101 × E2 E2 E2 E20 E20 E20

1 1 1 1 1 101 10 101 10 10 101 101 10 10

0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20

Figure 2. Cora.

References Thirty-Second AAAI Conference on Artificial Intelligence, 2018. Gurukar, S., Vijayan, P., Srinivasan, A., Bajaj, G., Cai, C., Keymanesh, M., Kumar, S., Maneriker, P., Mitra, Luan, S., Zhao, M., Chang, X.-W., and Precup, D. Break the A., Patel, V., et al. Network representation learning: ceiling: Stronger multi-scale deep graph convolutional Consolidation and renewed bearing. arXiv preprint networks. In Advances in Neural Information Processing arXiv:1905.00987, 2019. Systems, pp. 10943–10953, 2019. Hamilton, W., Ying, Z., and Leskovec, J. Inductive repre- Oono, K. and Suzuki, T. Graph neural networks exponen- sentation learning on large graphs. In Advances in neural tially lose expressive power for node classification. arXiv information processing systems, pp. 1024–1034, 2017. preprint cs.LG/1905.10947, 2019. Kipf, T. N. and Welling, M. Semi-supervised classifica- Qu, M., Bengio, Y., and Tang, J. Gmnn: Graph markov tion with graph convolutional networks. arXiv preprint neural networks. arXiv preprint arXiv:1905.06214, 2019. arXiv:1609.02907, 2016. Rong, Y., Huang, W., Xu, T., and Huang, J. Dropedge: Lee, Y. T. and Sun, H. An sdp-based algorithm for linear- Towards deep graph convolutional networks on node clas- sized spectral sparsification. In Proceedings of the 49th sification. In International Conference on Learning Rep- annual acm sigact symposium on theory of computing, resentations, 2019. pp. 678–687, 2017. Spielman, D. A. and Srivastava, N. Graph sparsification by Li, G., Muller,¨ M., Thabet, A., and Ghanem, B. Deep- effective resistances. SIAM Journal on Computing, 40(6): gcns: Can gcns go as deep as cnns? arXiv preprint 1913–1926, 2011. arXiv:1904.03751, 2019. Spielman, D. A. and Teng, S.-H. Nearly-linear time algo- Li, Q., Han, Z., and Wu, X.-M. Deeper insights into graph rithms for graph partitioning, graph sparsification, and convolutional networks for semi-supervised learning. In solving linear systems. In Proceedings of the thirty-sixth A Note on Over-Smoothing for Graph Neural Networks

Low Eigenvector Mix + Reweight Edge CiteSeer: 3327 nodes, 4552 edges. increase 10% edges. increase 20% edges. increase 30% edges. increase 40% edges. increase 50% edges. increase 60% edges. increase 70% edges. increase 80% edges. increase 90% edges. 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00

1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75

1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50

1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25

w1. w1. w1. 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 w2. w2. w2.

0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 1000 2000 3000 0 1000 2000 3000 0 1000 2000 3000 0 1000 2000 3000 0 1000 2000 3000 0 1000 2000 3000 0 1000 2000 3000 0 1000 2000 3000 0 1000 2000 3000

CiteSeer: 3327 nodes, 4552 edges. increase 10% edges. increase 20% edges. increase 30% edges. increase 40% edges. increase 50% edges. increase 60% edges. increase 70% edges. increase 80% edges. increase 90% edges. 2 101 2 101 × 1 × 2 10 1 2 101 × 10 ×

E0 E0 101 E0 1 1 E0 E0 10 E0 10 1 1 10 1 E1 10 E1 E1 10 101 101 E E 6 100 E 10 10 × 10 E2 E2 0 E2 0 6 10 0 0 6 10 × 6 10 E20 6 100 6 10 E20 × E20 × × ×

0 0 4 10 4 100 4 100 4 100 4 100 4 10 × × × × × ×

0 0 3 100 3 10 3 10 3 100 × × × 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 × 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 Low Eigenvector Mix + Drop Edge CiteSeer: 3327 nodes, 4552 edges. drop 10% edges. drop 20% edges. drop 30% edges. drop 40% edges. drop 50% edges. drop 60% edges. drop 70% edges. drop 80% edges. drop 90% edges. 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00

1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75

1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50

1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25

w1. w1. w1. 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 w2. w2. w2.

0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 1000 2000 3000 0 1000 2000 3000 0 1000 2000 3000 0 1000 2000 3000 0 1000 2000 3000 0 1000 2000 3000 0 1000 2000 3000 0 1000 2000 3000 0 1000 2000 3000

CiteSeer: 3327 nodes, 4552 edges. drop 10% edges. drop 20% edges. drop 30% edges. drop 40% edges. drop 50% edges. drop 60% edges. drop 70% edges. drop 80% edges. drop 90% edges. 102 2 10 2 102 10 102

101 E0 E0 E0

E0 E0 E0 0 1 10 10 E1 E1 E1

1 E10 E10 E10 10 1 1 10 1 1 1 10 1 E2 10 10 101 E2 10 10− E2 E20 E20 E20

2 10−

3 10−

0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20

Figure 3. Citeseer.

annual ACM symposium on Theory of computing, pp. Zhou, J., Cui, G., Zhang, Z., Yang, C., Liu, Z., Wang, 81–90, 2004. L., Li, C., and Sun, M. Graph neural networks: A review of methods and applications. arXiv preprint Spielman, D. A. and Teng, S.-H. Spectral sparsification of arXiv:1812.08434, 2018. graphs. SIAM Journal on Computing, 40(4):981–1025, 2011. Velickoviˇ c,´ P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Bengio, Y. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017. Wu, F., Zhang, T., Souza Jr, A. H. d., Fifty, C., Yu, T., and Weinberger, K. Q. Simplifying graph convolutional networks. arXiv preprint arXiv:1902.07153, 2019. Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., and Philip, S. Y. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2020. Xie, T. and Grossman, J. C. Crystal graph convolutional neural networks for an accurate and interpretable predic- tion of material properties. Physical review letters, 120 (14):145301, 2018. Zhao, L. and Akoglu, L. Pairnorm: Tackling oversmoothing in gnns. arXiv preprint arXiv:1909.12223, 2019. A Note on Over-Smoothing for Graph Neural Networks

Low Eigenvector Mix + Reweight Edge Erdos Renyi Graph G(200, 0.05): 200 nodes, 942 edges. increase 10% edges. increase 20% edges. increase 30% edges. increase 40% edges. increase 50% edges. increase 60% edges. increase 70% edges. increase 80% edges. increase 90% edges. 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00

1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75

1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50

1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25

w1. w1. w1. 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 w2. w2. w2.

0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

Erdos Renyi Graph G(200, 0.05): 200 nodes, 942 edges. increase 10% edges. increase 20% edges. increase 30% edges. increase 40% edges. increase 50% edges. increase 60% edges. increase 70% edges. increase 80% edges. increase 90% edges.

101 101 101 101 101 101 101 101 101

E E E 100 100 100 0 100 100 100 0 100 100 100 0 E0 E0 E0 E1 E1 E1 E E E 1 1 1 10 1 1 1 10 1 1 1 10 10− 10− 10− 10− 10− 10− 10− 10− 10− E2 E2 E2

E20 E20 E20

2 2 2 2 2 2 2 2 2 10− 10− 10− 10− 10− 10− 10− 10− 10−

3 3 3 3 3 3 3 3 3 10− 10− 10− 10− 10− 10− 10− 10− 10− 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 Low Eigenvector Mix + Drop Edge Erdos Renyi Graph G(200, 0.05): 200 nodes, 942 edges. drop 10% edges. drop 20% edges. drop 30% edges. drop 40% edges. drop 50% edges. drop 60% edges. drop 70% edges. drop 80% edges. drop 90% edges. 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00

1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75

1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50

1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25

w1. w1. w1. 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 w2. w2. w2.

0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

Erdos Renyi Graph G(200, 0.05): 200 nodes, 942 edges. drop 10% edges. drop 20% edges. drop 30% edges. drop 40% edges. drop 50% edges. drop 60% edges. drop 70% edges. drop 80% edges. drop 90% edges.

101 101 101 101 101 101 101 101 101

E E E 100 100 100 0 100 100 100 0 100 100 100 0 E0 E0 E0 E1 E1 E1 E E E 1 1 1 10 1 1 1 10 1 1 1 10 10− 10− 10− 10− 10− 10− 10− 10− 10− E2 E2 E2

E20 E20 E20

2 2 2 2 2 2 2 2 2 10− 10− 10− 10− 10− 10− 10− 10− 10−

3 3 3 3 3 3 3 3 3 10− 10− 10− 10− 10− 10− 10− 10− 10− 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20

Figure 4. . A Note on Over-Smoothing for Graph Neural Networks

Low Eigenvector Mix + Reweight Edge Random Geometric Graph: 200 nodes, 1960 edges. increase 10% edges. increase 20% edges. increase 30% edges. increase 40% edges. increase 50% edges. increase 60% edges. increase 70% edges. increase 80% edges. increase 90% edges. 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00

1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75

1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50

1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25

w1. w1. w1. 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 w2. w2. w2.

0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

Random Geometric Graph: 200 nodes, 1960 edges. increase 10% edges. increase 20% edges. increase 30% edges. increase 40% edges. increase 50% edges. increase 60% edges. increase 70% edges. increase 80% edges. increase 90% edges.

101 101 101 101 101 101 101 101 101

E E E 100 100 100 0 100 100 100 0 100 100 100 0 E0 E0 E0 E1 E1 E1 E E E 1 1 1 10 1 1 1 10 1 1 1 10 10− 10− 10− 10− 10− 10− 10− 10− 10− E2 E2 E2

E20 E20 E20

2 2 2 2 2 2 2 2 2 10− 10− 10− 10− 10− 10− 10− 10− 10−

3 3 3 3 3 3 3 3 3 10− 10− 10− 10− 10− 10− 10− 10− 10− 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 Low Eigenvector Mix + Drop Edge Random Geometric Graph: 200 nodes, 1960 edges. drop 10% edges. drop 20% edges. drop 30% edges. drop 40% edges. drop 50% edges. drop 60% edges. drop 70% edges. drop 80% edges. drop 90% edges. 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00

1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75

1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50

1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25

w1. w1. w1. 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 w2. w2. w2.

0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

Random Geometric Graph: 200 nodes, 1960 edges. drop 10% edges. drop 20% edges. drop 30% edges. drop 40% edges. drop 50% edges. drop 60% edges. drop 70% edges. drop 80% edges. drop 90% edges.

101 101 101 101 101 101 101 101 101

E E E 100 100 100 0 100 100 100 0 100 100 100 0 E0 E0 E0 E1 E1 E1 E E E 1 1 1 10 1 1 1 10 1 1 1 10 10− 10− 10− 10− 10− 10− 10− 10− 10− E2 E2 E2

E20 E20 E20

2 2 2 2 2 2 2 2 2 10− 10− 10− 10− 10− 10− 10− 10− 10−

3 3 3 3 3 3 3 3 3 10− 10− 10− 10− 10− 10− 10− 10− 10− 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20

Figure 5. Random Geometric Graph. A Note on Over-Smoothing for Graph Neural Networks

Low Eigenvector Mix + Reweight Edge Watts-Strogatz Graph: 200 nodes, 2000 edges. increase 10% edges. increase 20% edges. increase 30% edges. increase 40% edges. increase 50% edges. increase 60% edges. increase 70% edges. increase 80% edges. increase 90% edges. 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00

1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75

1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50

1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25

w1. w1. w1. 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 w2. w2. w2.

0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

Watts-Strogatz Graph: 200 nodes, 2000 edges. increase 10% edges. increase 20% edges. increase 30% edges. increase 40% edges. increase 50% edges. increase 60% edges. increase 70% edges. increase 80% edges. increase 90% edges.

101 101 101 101 101 101 101 101 101

E E E 100 100 100 0 100 100 100 0 100 100 100 0 E0 E0 E0 E1 E1 E1 E E E 1 1 1 10 1 1 1 10 1 1 1 10 10− 10− 10− 10− 10− 10− 10− 10− 10− E2 E2 E2

E20 E20 E20

2 2 2 2 2 2 2 2 2 10− 10− 10− 10− 10− 10− 10− 10− 10−

3 3 3 3 3 3 3 3 3 10− 10− 10− 10− 10− 10− 10− 10− 10− 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 Low Eigenvector Mix + Drop Edge Watts-Strogatz Graph: 200 nodes, 2000 edges. drop 10% edges. drop 20% edges. drop 30% edges. drop 40% edges. drop 50% edges. drop 60% edges. drop 70% edges. drop 80% edges. drop 90% edges. 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00

1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75

1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50

1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25

w1. w1. w1. 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 w2. w2. w2.

0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

Watts-Strogatz Graph: 200 nodes, 2000 edges. drop 10% edges. drop 20% edges. drop 30% edges. drop 40% edges. drop 50% edges. drop 60% edges. drop 70% edges. drop 80% edges. drop 90% edges.

101 101 101 101 101 101 101 101 101

E E E 100 100 100 0 100 100 100 0 100 100 100 0 E0 E0 E0 E1 E1 E1 E E E 1 1 1 10 1 1 1 10 1 1 1 10 10− 10− 10− 10− 10− 10− 10− 10− 10− E2 E2 E2

E20 E20 E20

2 2 2 2 2 2 2 2 2 10− 10− 10− 10− 10− 10− 10− 10− 10−

3 3 3 3 3 3 3 3 3 10− 10− 10− 10− 10− 10− 10− 10− 10− 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20

Figure 6. Watts-Strogatz Graph. A Note on Over-Smoothing for Graph Neural Networks

Low Eigenvector Mix + Reweight Edge Stochastic Block Model with 2 Blocks.: 200 nodes, 1133 edges. increase 10% edges. increase 20% edges. increase 30% edges. increase 40% edges. increase 50% edges. increase 60% edges. increase 70% edges. increase 80% edges. increase 90% edges. 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00

1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75

1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50

1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25

w1. w1. w1. 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 w2. w2. w2.

0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

Stochastic Block Model with 2 Blocks.: 200 nodes, 1133 edges. increase 10% edges. increase 20% edges. increase 30% edges. increase 40% edges. increase 50% edges. increase 60% edges. increase 70% edges. increase 80% edges. increase 90% edges.

101 101 101 101 101 101 101 101 101

E E E 100 100 100 0 100 100 100 0 100 100 100 0 E0 E0 E0 E1 E1 E1 E E E 1 1 1 10 1 1 1 10 1 1 1 10 10− 10− 10− 10− 10− 10− 10− 10− 10− E2 E2 E2

E20 E20 E20

2 2 2 2 2 2 2 2 2 10− 10− 10− 10− 10− 10− 10− 10− 10−

3 3 3 3 3 3 3 3 3 10− 10− 10− 10− 10− 10− 10− 10− 10− 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 Low Eigenvector Mix + Drop Edge Stochastic Block Model with 2 Blocks.: 200 nodes, 1133 edges. drop 10% edges. drop 20% edges. drop 30% edges. drop 40% edges. drop 50% edges. drop 60% edges. drop 70% edges. drop 80% edges. drop 90% edges. 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00

1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75

1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50

1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25

w1. w1. w1. 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 w2. w2. w2.

0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

Stochastic Block Model with 2 Blocks.: 200 nodes, 1133 edges. drop 10% edges. drop 20% edges. drop 30% edges. drop 40% edges. drop 50% edges. drop 60% edges. drop 70% edges. drop 80% edges. drop 90% edges.

101 101 101 101 101 101 101 101 101

E E E 100 100 100 0 100 100 100 0 100 100 100 0 E0 E0 E0 E1 E1 E1 E E E 1 1 1 10 1 1 1 10 1 1 1 10 10− 10− 10− 10− 10− 10− 10− 10− 10− E2 E2 E2

E20 E20 E20

2 2 2 2 2 2 2 2 2 10− 10− 10− 10− 10− 10− 10− 10− 10−

3 3 3 3 3 3 3 3 3 10− 10− 10− 10− 10− 10− 10− 10− 10− 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20

Figure 7. Stochastic Block Model with 2 Blocks. A Note on Over-Smoothing for Graph Neural Networks

Low Eigenvector Mix + Reweight Edge Stochastic Block Model with 4 Blocks.: 200 nodes, 1329 edges. increase 10% edges. increase 20% edges. increase 30% edges. increase 40% edges. increase 50% edges. increase 60% edges. increase 70% edges. increase 80% edges. increase 90% edges. 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00

1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75

1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50

1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25

w1. w1. w1. 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 w2. w2. w2.

0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

Stochastic Block Model with 4 Blocks.: 200 nodes, 1329 edges. increase 10% edges. increase 20% edges. increase 30% edges. increase 40% edges. increase 50% edges. increase 60% edges. increase 70% edges. increase 80% edges. increase 90% edges.

101 101 101 101 101 101 101 101 101

E E E 100 100 100 0 100 100 100 0 100 100 100 0 E0 E0 E0 E1 E1 E1 E E E 1 1 1 10 1 1 1 10 1 1 1 10 10− 10− 10− 10− 10− 10− 10− 10− 10− E2 E2 E2

E20 E20 E20

2 2 2 2 2 2 2 2 2 10− 10− 10− 10− 10− 10− 10− 10− 10−

3 3 3 3 3 3 3 3 3 10− 10− 10− 10− 10− 10− 10− 10− 10− 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 Low Eigenvector Mix + Drop Edge Stochastic Block Model with 4 Blocks.: 200 nodes, 1329 edges. drop 10% edges. drop 20% edges. drop 30% edges. drop 40% edges. drop 50% edges. drop 60% edges. drop 70% edges. drop 80% edges. drop 90% edges. 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00

1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75

1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50

1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25

w1. w1. w1. 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 w2. w2. w2.

0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

Stochastic Block Model with 4 Blocks.: 200 nodes, 1329 edges. drop 10% edges. drop 20% edges. drop 30% edges. drop 40% edges. drop 50% edges. drop 60% edges. drop 70% edges. drop 80% edges. drop 90% edges.

101 101 101 101 101 101 101 101 101

E E E 100 100 100 0 100 100 100 0 100 100 100 0 E0 E0 E0 E1 E1 E1 E E E 1 1 1 10 1 1 1 10 1 1 1 10 10− 10− 10− 10− 10− 10− 10− 10− 10− E2 E2 E2

E20 E20 E20

2 2 2 2 2 2 2 2 2 10− 10− 10− 10− 10− 10− 10− 10− 10−

3 3 3 3 3 3 3 3 3 10− 10− 10− 10− 10− 10− 10− 10− 10− 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20

Figure 8. Stochastic Block Model with 4 Blocks. A Note on Over-Smoothing for Graph Neural Networks

Low Eigenvector Mix + Reweight Edge Barabasi-Albert Graph: 200 nodes, 784 edges. increase 10% edges. increase 20% edges. increase 30% edges. increase 40% edges. increase 50% edges. increase 60% edges. increase 70% edges. increase 80% edges. increase 90% edges. 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00

1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75

1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50

1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25

w1. w1. w1. 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 w2. w2. w2.

0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

Barabasi-Albert Graph: 200 nodes, 784 edges. increase 10% edges. increase 20% edges. increase 30% edges. increase 40% edges. increase 50% edges. increase 60% edges. increase 70% edges. increase 80% edges. increase 90% edges.

101 101 101 101 101 101 101 101 101

E E E 100 100 100 0 100 100 100 0 100 100 100 0 E0 E0 E0 E1 E1 E1 E E E 1 1 1 10 1 1 1 10 1 1 1 10 10− 10− 10− 10− 10− 10− 10− 10− 10− E2 E2 E2

E20 E20 E20

2 2 2 2 2 2 2 2 2 10− 10− 10− 10− 10− 10− 10− 10− 10−

3 3 3 3 3 3 3 3 3 10− 10− 10− 10− 10− 10− 10− 10− 10− 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 Low Eigenvector Mix + Drop Edge Barabasi-Albert Graph: 200 nodes, 784 edges. drop 10% edges. drop 20% edges. drop 30% edges. drop 40% edges. drop 50% edges. drop 60% edges. drop 70% edges. drop 80% edges. drop 90% edges. 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00

1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75 1.75

1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50

1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25

w1. w1. w1. 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 w2. w2. w2.

0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

Barabasi-Albert Graph: 200 nodes, 784 edges. drop 10% edges. drop 20% edges. drop 30% edges. drop 40% edges. drop 50% edges. drop 60% edges. drop 70% edges. drop 80% edges. drop 90% edges.

101 101 101 101 101 101 101 101 101

E E E 100 100 100 0 100 100 100 0 100 100 100 0 E0 E0 E0 E1 E1 E1 E E E 1 1 1 10 1 1 1 10 1 1 1 10 10− 10− 10− 10− 10− 10− 10− 10− 10− E2 E2 E2

E20 E20 E20

2 2 2 2 2 2 2 2 2 10− 10− 10− 10− 10− 10− 10− 10− 10−

3 3 3 3 3 3 3 3 3 10− 10− 10− 10− 10− 10− 10− 10− 10− 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20

Figure 9. Barabasi-Albert Graph.