A Note on Over-Smoothing for Graph Neural Networks

A Note on Over-Smoothing for Graph Neural Networks Chen Cai 1 Yusu Wang 1 Abstract results in significantly worse performance for GNNs, such Graph Neural Networks (GNNs) have achieved as GCN and GAT. This drop is associated with many factors, a lot of success on graph-structured data. How- including the vanishing gradients during back-propagation, ever, it is observed that the performance of graph overfitting due to the increasing number of parameters, as neural networks does not improve as the number well as the phenomenon called over-smoothing. (Li et al., of layers increases. This effect, known as over- 2018) was the first to call attention to the over-smoothing smoothing 1, has been analyzed mostly in linear problem. Having shown that the graph convolution is a type cases. In this paper, we build upon previous re- of Laplacian smoothing, they proved that after repeatedly sults (Oono & Suzuki, 2019) to further analyze the applying Laplacian smoothing many times, the features over-smoothing effect in the general graph neural of the nodes in the (connected) graph would converge to network architecture. We show when the weight similar values. Later, several others have alluded to the same matrix satisfies the conditions determined by the problem. (Li et al., 2019; Luan et al., 2019; Zhao & Akoglu, spectrum of augmented normalized Laplacian, the 2019) Dirichlet energy of embeddings will converge to The goal of this paper is to extend some analysis of GNN zero, resulting in the loss of discriminative power. in the ICLR 2020 spotlight paper (Oono & Suzuki, 2019) Using Dirichlet energy to measure “expressive- on the expressive power of GNNs for node classification. ness” of embedding is conceptually clean; it leads To the best of our knowledge, (Oono & Suzuki, 2019) is to simpler proofs than (Oono & Suzuki, 2019) the first paper extending the analysis of over-smoothing in and can handle more non-linearities. linear GNNs to the nonlinear ones. However, only ReLU is handled. It is noted by the authors that extension to other non-linearities such as Sigmoid and Leaky ReLU is far from 1. Introduction trivial. Graph neural networks (GNNs) are a family of neural net- In this paper, we propose a simple technique to analyze works that can learn from graph-structured data. Starting the embedding when the number of layers goes to infinity. with the success of Graph Convolutional Network (GCN) The analysis is based on tracking the Dirichlet energy of (Kipf & Welling, 2016) in achieving state-of-the-art perfor- node embeddings across layers. Our contributions are the mance on semi-supervised classification, several variants of following: GNNs have been developed for this task, including Graph- SAGE (Hamilton et al., 2017), GAT (Velickoviˇ c´ et al., 2017), Using Dirichlet energy to measure expressiveness of SGC (Wu et al., 2019), CGCNN (Xie & Grossman, 2018) • embeddings is conceptually clean. Besides being able and GMNN (Qu et al., 2019) to name a few most recent arXiv:2006.13318v1 [cs.LG] 23 Jun 2020 to recover the results in (Oono & Suzuki, 2019), our ones. See (Gurukar et al., 2019; Wu et al., 2020; Zhou et al., analysis can be easily applied to Leaky ReLU. In the 2018) for survey. special case of regular graphs, our proof can be ex- However, a key issue with GNNs is their depth limitations. tended to the most common nonlinearities. The proof It has been observed that deeply stacking the layers often is easy to follow and requires only elementary linear algebra. We discuss key differences between our proof 1 Department of Computer Science, Ohio State University, Ohio, and proofs in (Oono & Suzuki, 2019) as well as the USA. Correspondence to: Chen Cai <[email protected]>. benefits of introducing Dirichlet energy in Section4. Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 108, 2020. Copyright 2020 by Second, we perform extensive experiments on a variety the author(s). • of graphs to study the effect of basic edge operations on 1Strictly speaking, over-smoothing is a misnomer. As we will the Dirichlet energy. We find in many cases dropping T show, what is decreasing is tr(X ∆~ X), not the real smoothness edges and increasing the weights of edges (to a high tr(XT ∆~ X) 2 of graph signal X. value) can increase the Dirichlet energy. jjXjj2 A Note on Over-Smoothing for Graph Neural Networks 2. Notation Proof. Let us denote the eigenvalues of ∆~ by λ1; λ2; :::; λN , and the associated eigenvectors of length 1 by v1; :::; vn. Let N+ be the set of positive integers. We define A 2 Suppose f = Σcivi where ci R: RN×N to be the adjacency matrix and D to be the degree 2 matrix of graph G. Let A~ := A + IN ; D~ := D + IN be T ~ T 2 the adjacent and degree matrix of graph G augmented with E(f) = f ∆f = f Σciλivi = Σci λi (1) self- loops. We define the augmented normalized Laplacian − 1 − 1 Therefore, of G by ∆~ := IN D~ 2 A~D~ 2 and set P := IN ∆~ = − 1 − 1 − − D~ 2 A~D~ 2 . Let L; C + be the layer and channel sizes. T T N E(P f) = f (IN ∆)~ ∆(~ IN ∆)~ f 2 − − We define a GCN associated with G by f = f T L = f (IN ∆)~ ∆(~ IN ∆)~ f N×Cl N×Cl+1 ◦ − − (2) ::: f1 where fl : R R is de- 2 2 ◦ ! = Σci λi(1 λi) fined by fl(X) = MLPl(PX). Here MLPl(X) := − 2 σ ( σ (σ(X)Wl1) Wl2 WlH ) where σ is an element- (1 λ) E(f) ··· ··· l ≤ − wise nonlinear function. Note that weight matrices Wl· are not necessarily square. We consider the embeddings Extending the above argument from the scaler field to vector (l+1) (l) (0) X := fl(X ) with initial value X . We are inter- field finishes the proof for E(PX) (1 λ)2E(X). ested in the asymptotic behavior of the output X(L) of GCN ≤ − as L . Lemma 3.2. E(XW ) W T 2E(X) ! 1 ≤ jj jj2 We state the following lemma without a proof. Proof. By definition, Lemma 2.1. Eigenvalues of ∆~ [0; 2). Eigenvalues of ~ 2 P = IN ∆ ( 1; 1]. 1 1 2 − 2 − E(XW ) = Σ(i;j)2Ewij xiW p xjW 2 jjp1 + di − 1 + dj jj 3. Main Result T 1×c c×c0 where Xn×c = [x1; :::; xn] ; xi R ;W R : The main idea of the proof is to track the Dirichlet energy Since for each term 2 2 of node embeddings w.r.t. the (augmented) normalized Laplacian at different layers. With some assumptions on the 1 1 2 xiW p xjW 2 weight matrix of GCN, we can prove that Dirichlet energy jjp1 + di − 1 + dj jj ≤ (3) decreases exponentially with respect to the number of layers. 1 1 2 T 2 Intuitively, the Dirichlet energy of a function measures the xi p xj 2 W 2; jjp1 + di − 1 + dj jj jj jj “smoothness” of a function of unit norm, and eigenvectors of the normalized Laplacian are minimizers of the Dirichlet we get energy. T 2 E(XW ) E(W ) W 2 (4) Definition 3.1. Dirichlet energy E(f) of scalar function ≤ jj jj f RN×1 on the graph G is defined as 2 1 f f T ~ i j 2 Remark: Since A 2 = σmax(A) where σmax(A) repre- E(f) = f ∆f = Σwij( p ) : jj jj 2 p1 + di − 1 + dj sents the largest singular value of matrix A. Our result in Lemma 2 is essentially the same as the Lemma 22 of (Oono T 1×c For vector field XN×c = [x1; :::; xN ] ; xi R , Dirich- & Suzuki, 2019). Note that our proof can handle weight let energy is defined as 2 matrix not only of dimension d d but also of dimension d d0 while the paper (Oono &× Suzuki, 2019) assumes the × T ~ 1 xi xj 2 embedding dimension to be fixed across layers. See detailed E(X) = tr(X ∆X) = Σwij p 2: 2 jjp1 + di − 1 + dj jj discussion at section4. Remark: The proof itself doesn’t leverage the structure of Without loss of generality, each layer of GCN can be repre- graph. In particular, only the fact of Laplacian is p.s.d sented as fl(X) = σ(σ( σ(σ( PX)Wl1)Wl2 )WlHl ) matrix is needed in the proof. See an alternative proof in | ···{z } ··· H times the appendix. This also makes sense because W operates Next we will analyze the effects of P; Wl; σ on the Dirichlet on the graph feature space and should be oblivious to the energy one by one. particular graph structure. 2 Lemma 3.1. E(PX) (1 λ) E(X) where λ is the 2For any X 2 N×C ; we have d (XW ) ≤ s d (X) ≤ − R M lh lh M smallest non-zero eigenvalue of ∆~ : where slh is the maximum singular value of Wlh. A Note on Over-Smoothing for Graph Neural Networks Lemma 3.3. E(σ(X)) E(X) when σ is ReLU or Leaky- 4. Key Differences ReLU. ≤ The key quantity paper (Oono & Suzuki, 2019) looks at is N×C the dM(X) where is a subspace of R , defined as M Proof. We first prove it holds for scalar field f and then C nPM C o fi := U R = m=1 em wm wm R , where extend it to vector field X. E(f) = Σ(i;j)2Ewi;j( p M ⊗ ⊗ j 2 1+di − fj 2 (em)m2[M] is orthonormal basis of null space U of a normal- ) where wi;j 0.

Load more