2018 26th European Conference (EUSIPCO) Improving Graph Convolutional Networks with Non-Parametric Activation Functions

Simone Scardapane∗, Steven Van Vaerenbergh†, Danilo Comminiello∗ and Aurelio Uncini∗ ∗Dept. Information Eng., Electronics and Telecomm., Sapienza University of Rome, Italy. †Dept. Communications Eng., University of Cantabria, Spain. Corresponding author email: [email protected]

Abstract—Graph neural networks (GNNs) are a class of neural in a graph-frequency domain. However, this approach was networks that allow to efficiently perform inference on data that not scalable as it scaled linearly with the size of the graph. is associated to a graph structure, such as, e.g., citation networks Defferard et al. [10] extended this approach by using poly- or knowledge graphs. While several variants of GNNs have been proposed, they only consider simple nonlinear activation nomial filters on the frequency components, which can be functions in their layers, such as rectifiers or squashing functions. rewritten directly in the graph domain, avoiding the use of In this paper, we investigate the use of graph convolutional the graph . A further set of modifications networks (GCNs) when combined with more complex activation was proposed by Kipf and Welling [11] (described more in functions, able to adapt from the training data. More specifically, depth in Section II), resulting in a generic graph convolutional we extend the recently proposed kernel activation function, a non-parametric model which can be implemented easily, can be network (GCN). GCN has been successfully applied to several regularized with standard ℓp-norms techniques, and is smooth real-world scenarios, including semi-supervised learning [11], over its entire domain. Our experimental evaluation shows that matrix completion [12], program induction [13], modeling the proposed architecture can significantly improve over its relational data [14], and several others. baseline, while similar improvements cannot be obtained by Like most neural networks, GCNs interleave linear layers, simply increasing the depth or size of the original GCN. wherein information is adaptively combined according to the topology of the graph, with nonlinear activation functions that I.INTRODUCTION are applied element-wise. The information over the nodes can Efficient processing of graph-structured data (e.g., citation then be combined to obtain a graph-level prediction, or kept networks) has a range of different applications, going from separate to obtain a node-specific inference (e.g., predictions bioinformatics to text analysis and sensor networks, among on unlabeled nodes from a small set of labeled ones). Most others. Of particular importance is the design of learning literature for GCNs has worked with very simple choices methods that are able to take into account both numerical char- for the activation functions, such as the rectified linear unit acteristics of each node in the graph and their inter-connections (ReLU) [11]. However, it is well known that the choice of the [1]. While graph techniques have a long function can have a large impact on the final performance of history, recently we witness a renewed interest in models the neural network. Particularly, there is a large literature for that are defined and operate directly in the graph domain standard neural networks on the design of flexible schemes for (as compared to a Euclidean space), instead of including the adapting the activation functions themselves from the training graph information only a posteriori in the optimization process data [15]–[18]. These techniques range from the use of simple (e.g., through manifold regularization techniques [2]), or in a parametrizations over known functions (e.g., the parametric pre-processing phase via graph embeddings [3]. Examples of ReLU [15]), to the use of more sophisticated non-parametric native graph models include graph linear filters [4], their kernel families, including the Maxout network [16], the adaptive counterparts [5], and graph neural networks (GNNs) [6], [7]. piecewise linear unit [17], and the recently proposed kernel GNNs are particularly interesting because they promise to activation function (KAF) [18]. bring the performance of deep learning models [8] to graph- Contribution of the paper: In this paper, we conjecture based domains. In particular, convolutional neural networks that choosing properly the activation function (beyond sim- (CNNs) are nowadays the de-facto standard for processing ple ReLUs) can further improve the performance of GCNs, image data. CNNs exploit the image structure by performing possibly by a significant margin. To this end, we enhance spatial (with a limited receptive field) on the the basic GCN model by extending KAFs for the activation image, thus increasing parameter sharing and lowering their functions of the filters. Each KAF models a one-dimensional complexity. A number of authors recently have explored the activation function in terms of a simple kernel expansion possibility of extending CNNs to the graph domain by several (see the description in Section III). By properly choosing the generalizations of the operator, a trend which has elements of this expansion beforehand, we can represent each been generically called ‘geometric deep learning’ [6]. In one function with a small set of (linear) mixing coefficients, which of the first proposals [9], graph Fourier transform was used are adapted together with the weights of the convolutional at every layer of the network to perform filtering operations layers using standard back-propagation. Apart from flexibility,

ISBN 978-90-827970-1-5 © EURASIP 2018 877 2018 26th European Signal Processing Conference (EUSIPCO) the resulting scheme has a number of advantages, including CNNs [8]. In order to extend this idea to general unweighted smoothness in its domain and possibility of implementing graphs, we need some additional tools from the theory of graph the algorithm using highly vectorized GPU operations. We signal processing. compare on two benchmarks for semi-supervised learning, showing that the proposed GCN with KAFs can significantly B. Graph Fourier transform and graph neural networks outperform all competing baselines with a marginal increase in computational complexity (which is offset by a faster In order to define a convolutional operation in the graph convergence in terms of epochs). domain, we can exploit the normalized − 1 − 1 Structure of the paper: Section II introduces the problem L = I − D 2 AD 2 , where D is a diagonal matrix with N of inference over graphs and the generic GCN model. The Dnn = t=1 Atn. We denote the eigendecomposition of L as proposed extension with KAFs is described in Section III. L = UΛUT , where U is a matrix collecting column-wise the P We evaluate and compare the model in Section IV before eigenvectors of L, and Λ is a diagonal matrix with associated concluding in Section V. eigenvalues. We further denote by X ∈ RN×F the matrix N collecting the input signal across all nodes, and by Xi ∈ R II.GRAPHCONVOLUTIONALNETWORKS its ith column. The graph Fourier transform of Xi can be A. Problem setup defined as [1]: T Consider a generic undirected graph G = (V, E), where Xˆ i = U Xi , (2) V = {1,...,N} is the set of N vertices, and E ⊆ V × V and, because the eigenvectors form an orthonormal basis, we is the set of edges connecting the vertices. The graph is can also define an inverse Fourier transform as X = UXˆ . equivalently described by the (weighted) adjacency matrix i i We can exploit the graph Fourier transform and define a A ∈ RN×N , where its generic element a ≥ 0 if and only if ij convolutional layer operating on graphs as [9]: nodes i and j are connected. A graph signal [1] is a function f : V → RF associating to vertex n a F -dimensional vector F T 1 xn (each dimension being denoted as a channel). For a subset H = g Uh(Λ; Θi)U Xi , (3) i=1 ! L ⊂ V of the vertices we have also available a node-specific X label yn, which can be either a real number (graph regression) where h(Λ; Θi) is a (channel-wise) filtering operation acting or a discrete quantity (graph classification). Our task is to on the eigenvalues, having adaptable parameters Θi. The find the correct labels for the (unlabeled) nodes that are not previous operation can then be iterated to get successive repre- contained in L. Note that if the graph is unknown and/or must sentations H2, H3,... up to the final node-specific predictions. be inferred from the data, this setup is equivalent to standard The choice of h(·) determines the complexity of training. In semi-supervised (or, more precisely, transductive) learning. particular, [10] proposed to make the problem tractable by Ignoring for a moment the graph information, we could working with polynomial filters over the eigenvalues: solve the problem by training a standard (feedforward) NN 1 to predict the label y from the input x [8]. A NN is composed K− by stacking L layers, such that the operation of the lth layer h(Λ; Θi) = ΘikTk(Λ) , (4) k=0 can be described by: X e where Λ = 2Λ/λmax − I (with λmax being the largest hl = g (Wlhl−1) , (1) eigenvalue of L), and Tk(·) the Chebyshev polynomial of order Nl−1 Nl×Nl−1 where hl−1 ∈ R is the input to the layer, Wl ∈ R k definede by the recurrence relation: are trainable weights (ignoring for simplicity any bias term), T (Λ) = 2ΛT −1(Λ) − T −2(Λ) , (5) and g(·) is a element-wise nonlinear function known as activa- k k k tion function (AF). The NN takes as input x = h0, providing with T0(Λ) = 1 and T1(Λ) = Λ. The filtering operation C e e e e y = hL ∈ R as the final prediction. A number of techniques defined by (4) is advantageous because each filter is param- can be used to include unlabeled information in the training eterized withe only K valuese (withe K chosen by the user). processb of NNs, including ladder networks [19], pseudo-labels Additionally, it is easy to show that the filter is localized [20], manifold regularization [2], and neural graph machines over the graph, in the sense that the output at a given node [21]. As we stated in the introduction, however, the aim of depends only on nodes up to maximum K hops from it. GNNs is to include the proximity information contained in A Thanks to the choice of a polynomial filter, we can also avoid directly inside the processing layers of NNs, in order to further the expensive multiplications by U and UT by rewriting the improve the performance of such models. As shown in the filtering operation directly on the original graph domain [10]. experimental section, working directly in the graph domain can In order to avoid the need for the recurrence relation, [11] obtain vastly superior performances as compared to standard proposed to further simplify the expression by setting K = 1, semi-supervised techniques. θi0 = −θi1 for all channels, and assuming λmax = 2. Back- Note that if the nodes of the graphs are organized in a substituting in the original expression (3) we obtain: regular grid, spatial information can be included by adding − 1 − 1 standard convolutional operations to (1), as is common in H1 = g I + D 2 AD 2 XΘ1 , (6)   

ISBN 978-90-827970-1-5 © EURASIP 2018 878 2018 26th European Signal Processing Conference (EUSIPCO)

F ×N1 where Θ1 ∈ R is the matrix of adaptable filter co- A known problem of kernel methods is that the elements efficients. Practically, we can further avoid some numerical that we use for computing the kernel values (what is called instabilities (due to the range of the eigenvalues) by substitut- the dictionary in the kernel filtering literature [22]) can be − 1 − 1 − 1 − 1 ing I + D 2 AD 2 with D 2 AD 2 , where A = A + I extremely hard to select. The insight in [18] is to exploit N and Dnn = t=1 Atn. The previous expression can then the fact that we are working with one-dimensional functions, be iterated as in (1) to obtaine a graphe e convolutionale network by fixing the dictionary beforehand by sampling uniformly D (GCN).e As anP example,e a GCN with two layers is given by: elements (with D chosen by the user) around zero, and only adapting the linear coefficients of the kernel expansion (see ˆ Y = g Ag AXΘ1 Θ2 , (7) the equations below). In this way, the parameter D controls  − 1  − 1   L the flexibility of our approach: by increasing it, we increase A D 2 AD 2 Θ where we defined = b b . The parameters { l}l=1 the flexibility of each filter at the cost of a larger number of can be initialized randomly and trained by minimizing a parameters per-filter. suitable loss overb the labelede e e examples, such as the cross- More formally, we propose to model each activation func- entropy for classification problems [11]. A single layer of tion inside the GCN as follows: the GCN can combine information coming only from the immediate neighbors of a node, while using multiple layers D allows information to flow over several hops. In practice, g(s) = αiκ (s, di) , (10) i=1 two layers are found to be sufficient for many benchmark X problems, and further depth does not provide a benefit [11]. where di are the elements of the dictionary selected according D Note, however, that the previous expression requires to choose to before, while the mixing coefficients {αi}i=1 are initial- a proper activation function. This is the subject of the next ized at every filter and adapted independently together with L section. {Θl}l=1. κ(·, ·) in (10) is a generic kernel function, and we use the 1D Gaussian kernel in our experiments: III.PROPOSED GCN WITH KERNEL ACTIVATION 2 FUNCTIONS κ(s, di) = exp −γ (s − di) , (11) Up to this point, we assumed that the activation functions where γ ∈ R is a free parameter.n Since ino our context the g(·) in (7) were given. Note that the functions for different effect of γ only depends on the sampling of the dictionary, layers (or even for different filters in the same layer) need not we select it according to the rule-of-thumb proposed in [18]: be the same. Particularly, the outer function is generally chosen in a task-dependent fashion (e.g., softmax for classification 1 γ = , (12) [8]). However, choosing different activation functions for 6∆2 the hidden layers can vastly change the performance of the where ∆ is the distance between two elements of the dictio- resulting models and their flexibility. nary. Most of the research on GNNs and GCNs has focused on In order to avoid numerical problems during the initial simple activation functions, such as the ReLU function: phase of learning, we initialize the mixing coefficients to mimic as close as possible an exponential linear unit (ELU) g(s) = max (0, s) , (8) following the strategy in [18]. Denoting by t a sampling of the ELU at the same positions as the dictionary, we initialize where s is a generic element on which the function is applied. the coefficients as: It is easy to add a small amount of flexibility by introducing −1 a simple parametrization of the function. For example, the α = (K + εI) t , (13) parametric ReLU [15] adds an adaptable slope α (independent α RD×D for every filter) to the negative part of the function: where is the vector of mixing coefficients, K ∈ is the kernel matrix computed between t and the dictionary, and we add a diagonal term with to avoid degenerate g(s) = min (0, −αs) + max (0, s) . (9) ε > 0 solutions with very large mixing coefficients (see Fig. 1 for As we stated in the introduction, such parametric functions an example). might not provide a significant increase in performance in For more details on (10) we refer to the original publication general, and a lot of literature has been devoted to designing [18]. Here, we briefly comment on some advantages of using non-parametric activation functions that can adapt to a very this non-parametric formulation. First of all, (10) can be imple- large family of shapes. A simple technique would be to project mented easily in most deep learning libraries using vectorized the activation s to a high-dimensional space φ(s), and then operations, adding only a marginal overhead (constant on adapt a different linear function on this feature space for every D). Second, the functions defined by (10) are smooth over D filter. However, this method becomes easily unfeasible for a their entire domain. Finally, the mixing parameters {αi}i=1 large dimensionality of φ(·). can be handled like all other parameters, particularly by The idea of KAFs [18] is to avoid the expensive feature applying standard regularization techniques (e.g., ℓ2 or ℓ1 computation by working instead with a kernel expansion. regularization).

ISBN 978-90-827970-1-5 © EURASIP 2018 879 2018 26th European Signal Processing Conference (EUSIPCO)

TABLE I BRIEF DESCRIPTION OF THE DATASETS (FOR MORE DETAILS SEE THE TEXTAND [3], [11]).

Dataset Nodes Edges Classes Features Labels Citeseer 3327 4732 6 3703 3.60% Cora 2708 5429 7 1433 5.20%

in (7) is done using efficient sparse data structures, allowing to perform it in linear time with respect to the number of edges. All experiments are run in Python using a Tesla K80 as backend. Fig. 1. Example of initializing a KAF according to the ELU function, shown B. Experimental results with D = 8 and γ chosen according to (12). The results (in terms of accuracy computed on the test set) of GCN and the proposed GCN with KAFs (denoted as KAF- IV. EXPERIMENTAL RESULTS GCN) are given in Table II. We also report the performance of A. Experimental setup six additional baseline semi-supervised algorithms taken from [3], [11], including manifold regularization (ManiReg), semi- We evaluate the proposed model on two semi-supervised supervised embedding (SemiEmb), label propagation (LP), learning benchmarks taken from [11], whose characteristics skip-gram based graph embeddings (DeepWalk), iterative clas- are reported in Table I. Both represents citation networks, sification (ICA), and Planetoid [11]. References and a full where vertices xn are documents (each one represented by description of all the baselines can be found in the original a bag-of-words describing the relative frequencies of words), papers. edges are links (corresponding to citations across documents), Note how, in both cases, the proposed KAF-GCN is able to and each document can belong to a given class. The last outperform all the other baselines in a stable fashion (for KAF- column in Table I is the percentage of labeled nodes in the GCN, we also report the standard deviation with respect to the datasets. different weights’ initialization). In particular, results on the The baseline network is a two-layer GCN as in (7), while Cora dataset represent (to the best of our knowledge) the state- in our proposed method we simply replace the inner activation of-the-art results when using such a small labeled dataset. It functions with KAFs, with a dictionary of D = 20 elements is important to underline that the gap in performance between sampled uniformly in [−2.0, 2.0]. We optimize a cross-entropy GCN and KAF-GCN cannot be reduced by merely increasing loss given by: the depth (or size) of the former, since its architecture is C already fine-tuned to the specific datasets. In particular, [11, L = Ylc log Ylc . (14) Appendix B] reports results showing that neither GCN, nor a l∈L c=1 variant with residual connections can obtain a higher accuracy X X   The loss is optimized using the Adamb algorithm [8] in a when adding more layers. In our opinion, this points to the batch fashion. The labeled dataset is split following the same importance of having adaptable activation functions able to procedure as [11], where we use 20 nodes per class for more efficiently process the information coming from the training, 1000 elements for validation, and the remaining different filters. Due to a lack of space, we are not able to show elements for testing the accuracy. The validation set is used for the shapes of the functions resulting from the optimization evaluating the loss at each epoch, and we stop as soon as the process, although they are found to be similar to those obtained loss is not decreasing with respect to the last 10 epochs. All by [18], and the interested reader is referred there. hyper-parameters are selected in accordance with [11] as they In terms of training time, our implementation of KAF-GCN were already optimized for the Cora dataset: 16 filters in the requires roughly 10% more computation per epoch than the hidden layer, an initial learning rate of 0.01, and an additional standard GCN for the Cora dataset, and 15% more for the −4 ℓ2 regularization on the weights with weighting factor 5e . Citeseer one. However, we show in Fig. 2 the (average) loss In addition, we use dropout (i.e., we randomly remove filters evolution on the Cora dataset for the two models. We see from the hidden layer during training) with probability 0.5. that KAF-GCN generally converges faster than GCN, possibly The splits for the datasets are taken from [3], and we average due to the higher flexibility allowed by the network. Due to over 15 different initializations for the networks. this, KAF-GCN requires a lower number of epochs to achieve For the implementation, we extend the original code for the convergence (as measured by the early stopping criterion), GCN.1 Specifically, multiplication with the adjacency matrix requiring on average 25/30 epochs less to converge, which more than compensate for the increased computational time 1https://github.com/tkipf/pygcn per-epoch.

ISBN 978-90-827970-1-5 © EURASIP 2018 880 2018 26th European Signal Processing Conference (EUSIPCO)

TABLE II RESULTSINTERMSOFACCURACYOVERTHETESTSET.ALLBASELINESARETAKENFROM [3], [11]. THEPROPOSEDALGORITHMISDENOTEDAS KAF-GCN.BESTRESULTSAREHIGHLIGHTEDINBOLD.

Dataset ManiReg SemiEmb LP DeepWalk ICA Planetoid GCN KAF-GCN Citeseer 60.1 59.6 45.3 43.2 69.1 67.7 70.3 70.9 ± 0.1 Cora 59.5 29.0 68.0 67.2 75.1 75.7 81.5 83.0 ± 0.2

[2] Y. Yuan, L. Mou, and X. Lu, “Scene recognition by manifold regularized deep learning architecture,” IEEE Trans. Neural Netw. Learn. Syst, vol. 26, no. 10, pp. 2222–2233, 2015. [3] Z. Yang, W. W. Cohen, and R. Salakhutdinov, “Revisiting semi- supervised learning with graph embeddings,” in Proc. 33th International Conference on Machine Learning (ICML), 2016. [4] P. Di Lorenzo, S. Barbarossa, P. Banelli, and S. Sardellitti, “Adaptive least mean squares estimation of graph signals,” IEEE Trans. on Signal and Inf. Process. over Netw., vol. 2, no. 4, pp. 555–568, 2016. [5] D. Romero, M. Ma, and G. B. Giannakis, “Kernel-based reconstruction of graph signals,” IEEE Trans. Signal Process., vol. 65, no. 3, pp. 764– 778, 2017. [6] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, “Geometric deep learning: going beyond Euclidean data,” IEEE Signal Process. Mag., vol. 34, no. 4, pp. 18–42, 2017. [7] F. Gama, G. Leus, A. G. Marques, and A. Ribeiro, “Convolutional neural networks via node-varying graph filters,” in Proc. 2018 IEEE Data Science Workshop (DSW), 2018, pp. 1–5. [8] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT Press, 2016. Fig. 2. Loss evolution for GCN and KAF-GCN on the Cora dataset. We show [9] J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun, “Spectral networks the mean with a solid line and the standard deviation with a lighter color. and locally connected networks on graphs,” in Proc. International Conference on Learning Representations (ICLR), 2014. [10] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in Advances V. CONCLUSIONS in Neural Information Processing Systems, 2016, pp. 3844–3852. [11] T. N. Kipf and M. Welling, “Semi-supervised classification with graph In this paper we have shown that the performance of graph convolutional networks,” Proc. International Conference on Learning convolutional networks can be significantly improved with the Representations (ICLR), 2017. use of more flexible activation functions for the hidden layers. [12] R. v. d. Berg, T. N. Kipf, and M. Welling, “Graph convolutional matrix completion,” arXiv preprint arXiv:1706.02263, 2017. Specifically, we evaluated the use of kernel activation functions [13] M. Allamanis, M. Brockschmidt, and M. Khademi, “Learning to repre- on two semi-supervised benchmark tasks, resulting in faster sent programs with graphs,” arXiv preprint arXiv:1711.00740, 2017. convergence and higher classification accuracy. [14] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. v. d. Berg, I. Titov, and M. Welling, “Modeling relational data with graph convolutional net- Two limitations of the current work are the use of undirected works,” arXiv preprint arXiv:1703.06103, 2017. graph topologies, and the training in a batch regime. While [15] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: [11] handles the former case by rewriting a directed graph Surpassing human-level performance on imagenet classification,” in Proc. IEEE Int. Conf. on Comput. Vision (ICCV), 2015, pp. 1026–1034. as an equivalent (undirected) bipartite graph, this operation [16] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Ben- is computationally expensive. We plan on investigating recent gio, “Maxout networks,” in Proc. 30th International Conference on developments on graph Fourier transforms for directed graphs Machine Learning (ICML), 2013. [17] F. Agostinelli, M. Hoffman, P. Sadowski, and P. Baldi, “Learning [23] to define a simpler formulation. More in general, we aim activation functions to improve deep neural networks,” arXiv preprint to test non-parametric activation functions for other classes of arXiv:1412.6830, 2014. graph neural networks (such as gated graph neural networks), [18] S. Scardapane, S. Van Vaerenbergh, S. Totaro, and A. Uncini, “Kafnets: kernel-based non-parametric activation functions for neural networks,” which would allow to process sequences of graphs or multiple arXiv preprint arXiv:1707.04035, 2017. graphs at the same time. [19] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko, “Semi-supervised learning with ladder networks,” in Advances in Neural Information Processing Systems, 2015, pp. 3546–3554. ACKNOWLEDGMENT [20] D.-H. Lee, “Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks,” in Workshop on Challenges Simone Scardapane was supported in part by Italian MIUR, in Representation Learning, ICML, vol. 3, 2013, p. 2. GAUChO project, under Grant 2015YPXH4W 004. [21] T. D. Bui, S. Ravi, and V. Ramavajjala, “Neural graph machines: Learn- ing neural networks using graphs,” arXiv preprint arXiv:1703.04818, 2017. REFERENCES [22] W. Liu, J. C. Principe, and S. Haykin, Kernel adaptive filtering: a comprehensive introduction. John Wiley & Sons, 2011. [1] A. Sandryhaila and J. M. Moura, “Big data analysis with signal [23] S. Sardellitti, S. Barbarossa, and P. Di Lorenzo, “On the graph Fourier processing on graphs: Representation and processing of massive data transform for directed graphs,” IEEE J. Sel. Topics Signal Process., sets with irregular structure,” IEEE Signal Process. Mag., vol. 31, no. 5, vol. 11, no. 6, pp. 796–811, 2017. pp. 80–90, 2014.

ISBN 978-90-827970-1-5 © EURASIP 2018 881