Improving Graph Convolutional Networks with Non-Parametric Activation Functions
Total Page:16
File Type:pdf, Size:1020Kb
2018 26th European Signal Processing Conference (EUSIPCO) Improving Graph Convolutional Networks with Non-Parametric Activation Functions Simone Scardapane∗, Steven Van Vaerenbergh†, Danilo Comminiello∗ and Aurelio Uncini∗ ∗Dept. Information Eng., Electronics and Telecomm., Sapienza University of Rome, Italy. †Dept. Communications Eng., University of Cantabria, Spain. Corresponding author email: [email protected] Abstract—Graph neural networks (GNNs) are a class of neural in a graph-frequency domain. However, this approach was networks that allow to efficiently perform inference on data that not scalable as it scaled linearly with the size of the graph. is associated to a graph structure, such as, e.g., citation networks Defferard et al. [10] extended this approach by using poly- or knowledge graphs. While several variants of GNNs have been proposed, they only consider simple nonlinear activation nomial filters on the frequency components, which can be functions in their layers, such as rectifiers or squashing functions. rewritten directly in the graph domain, avoiding the use of In this paper, we investigate the use of graph convolutional the graph Fourier transform. A further set of modifications networks (GCNs) when combined with more complex activation was proposed by Kipf and Welling [11] (described more in functions, able to adapt from the training data. More specifically, depth in Section II), resulting in a generic graph convolutional we extend the recently proposed kernel activation function, a non-parametric model which can be implemented easily, can be network (GCN). GCN has been successfully applied to several regularized with standard ℓp-norms techniques, and is smooth real-world scenarios, including semi-supervised learning [11], over its entire domain. Our experimental evaluation shows that matrix completion [12], program induction [13], modeling the proposed architecture can significantly improve over its relational data [14], and several others. baseline, while similar improvements cannot be obtained by Like most neural networks, GCNs interleave linear layers, simply increasing the depth or size of the original GCN. wherein information is adaptively combined according to the topology of the graph, with nonlinear activation functions that I. INTRODUCTION are applied element-wise. The information over the nodes can Efficient processing of graph-structured data (e.g., citation then be combined to obtain a graph-level prediction, or kept networks) has a range of different applications, going from separate to obtain a node-specific inference (e.g., predictions bioinformatics to text analysis and sensor networks, among on unlabeled nodes from a small set of labeled ones). Most others. Of particular importance is the design of learning literature for GCNs has worked with very simple choices methods that are able to take into account both numerical char- for the activation functions, such as the rectified linear unit acteristics of each node in the graph and their inter-connections (ReLU) [11]. However, it is well known that the choice of the [1]. While graph machine learning techniques have a long function can have a large impact on the final performance of history, recently we witness a renewed interest in models the neural network. Particularly, there is a large literature for that are defined and operate directly in the graph domain standard neural networks on the design of flexible schemes for (as compared to a Euclidean space), instead of including the adapting the activation functions themselves from the training graph information only a posteriori in the optimization process data [15]–[18]. These techniques range from the use of simple (e.g., through manifold regularization techniques [2]), or in a parametrizations over known functions (e.g., the parametric pre-processing phase via graph embeddings [3]. Examples of ReLU [15]), to the use of more sophisticated non-parametric native graph models include graph linear filters [4], their kernel families, including the Maxout network [16], the adaptive counterparts [5], and graph neural networks (GNNs) [6], [7]. piecewise linear unit [17], and the recently proposed kernel GNNs are particularly interesting because they promise to activation function (KAF) [18]. bring the performance of deep learning models [8] to graph- Contribution of the paper: In this paper, we conjecture based domains. In particular, convolutional neural networks that choosing properly the activation function (beyond sim- (CNNs) are nowadays the de-facto standard for processing ple ReLUs) can further improve the performance of GCNs, image data. CNNs exploit the image structure by performing possibly by a significant margin. To this end, we enhance spatial convolutions (with a limited receptive field) on the the basic GCN model by extending KAFs for the activation image, thus increasing parameter sharing and lowering their functions of the filters. Each KAF models a one-dimensional complexity. A number of authors recently have explored the activation function in terms of a simple kernel expansion possibility of extending CNNs to the graph domain by several (see the description in Section III). By properly choosing the generalizations of the convolution operator, a trend which has elements of this expansion beforehand, we can represent each been generically called ‘geometric deep learning’ [6]. In one function with a small set of (linear) mixing coefficients, which of the first proposals [9], graph Fourier transform was used are adapted together with the weights of the convolutional at every layer of the network to perform filtering operations layers using standard back-propagation. Apart from flexibility, ISBN 978-90-827970-1-5 © EURASIP 2018 877 2018 26th European Signal Processing Conference (EUSIPCO) the resulting scheme has a number of advantages, including CNNs [8]. In order to extend this idea to general unweighted smoothness in its domain and possibility of implementing graphs, we need some additional tools from the theory of graph the algorithm using highly vectorized GPU operations. We signal processing. compare on two benchmarks for semi-supervised learning, showing that the proposed GCN with KAFs can significantly B. Graph Fourier transform and graph neural networks outperform all competing baselines with a marginal increase in computational complexity (which is offset by a faster In order to define a convolutional operation in the graph convergence in terms of epochs). domain, we can exploit the normalized Laplacian matrix − 1 − 1 Structure of the paper: Section II introduces the problem L = I − D 2 AD 2 , where D is a diagonal matrix with N of inference over graphs and the generic GCN model. The Dnn = t=1 Atn. We denote the eigendecomposition of L as proposed extension with KAFs is described in Section III. L = UΛUT , where U is a matrix collecting column-wise the P We evaluate and compare the model in Section IV before eigenvectors of L, and Λ is a diagonal matrix with associated concluding in Section V. eigenvalues. We further denote by X ∈ RN×F the matrix N collecting the input signal across all nodes, and by Xi ∈ R II. GRAPH CONVOLUTIONAL NETWORKS its ith column. The graph Fourier transform of Xi can be A. Problem setup defined as [1]: T Consider a generic undirected graph G = (V, E), where Xˆ i = U Xi , (2) V = {1,...,N} is the set of N vertices, and E ⊆ V × V and, because the eigenvectors form an orthonormal basis, we is the set of edges connecting the vertices. The graph is can also define an inverse Fourier transform as X = UXˆ . equivalently described by the (weighted) adjacency matrix i i We can exploit the graph Fourier transform and define a A ∈ RN×N , where its generic element a ≥ 0 if and only if ij convolutional layer operating on graphs as [9]: nodes i and j are connected. A graph signal [1] is a function f : V → RF associating to vertex n a F -dimensional vector F T 1 xn (each dimension being denoted as a channel). For a subset H = g Uh(Λ; Θi)U Xi , (3) i=1 ! L ⊂ V of the vertices we have also available a node-specific X label yn, which can be either a real number (graph regression) where h(Λ; Θi) is a (channel-wise) filtering operation acting or a discrete quantity (graph classification). Our task is to on the eigenvalues, having adaptable parameters Θi. The find the correct labels for the (unlabeled) nodes that are not previous operation can then be iterated to get successive repre- contained in L. Note that if the graph is unknown and/or must sentations H2, H3,... up to the final node-specific predictions. be inferred from the data, this setup is equivalent to standard The choice of h(·) determines the complexity of training. In semi-supervised (or, more precisely, transductive) learning. particular, [10] proposed to make the problem tractable by Ignoring for a moment the graph information, we could working with polynomial filters over the eigenvalues: solve the problem by training a standard (feedforward) NN 1 to predict the label y from the input x [8]. A NN is composed K− by stacking L layers, such that the operation of the lth layer h(Λ; Θi) = ΘikTk(Λ) , (4) k=0 can be described by: X e where Λ = 2Λ/λmax − I (with λmax being the largest hl = g (Wlhl−1) , (1) eigenvalue of L), and Tk(·) the Chebyshev polynomial of order Nl−1 Nl×Nl−1 where hl−1 ∈ R is the input to the layer, Wl ∈ R k definede by the recurrence relation: are trainable weights (ignoring for simplicity any bias term), T (Λ) = 2ΛT −1(Λ) − T −2(Λ) , (5) and g(·) is a element-wise nonlinear function known as activa- k k k tion function (AF). The NN takes as input x = h0, providing with T0(Λ) = 1 and T1(Λ) = Λ. The filtering operation C e e e e y = hL ∈ R as the final prediction. A number of techniques defined by (4) is advantageous because each filter is param- can be used to include unlabeled information in the training eterized withe only K valuese (withe K chosen by the user).