Towards Learning Powerful Deep Graph Neural Networks and Embeddings

A DISSERTATION SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY

Saurabh Verma

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy

Professor Zhi-Li Zhang, Advisor

June, 2020 © Saurabh Verma 2020 ALL RIGHTS RESERVED Acknowledgements

Started in the year 2008, it has been such a fun and exciting journey. Looking back now, I see myself have grown so much and have learned so many enjoyable life lessons. This happy journey have not been possible without the great mentors and good friends of my life.

I am one of the lucky person who is blessed with great mentors. Many of my life’s prestigious moments and this thesis have not been possible without my advisor, Zhi-Li Zhang. Undoubtedly, he is one of the great advisor who not only uplifted my research skills but always inspired me to do great research work. He is also one of the most enthusiastic people, I ever met and can be seen from his passion for reading all kind of mathematic books and that too just for fun. Freedom to purse your wild ideas and truly caring about his student growth are his most admiring qualities. On several occasions, he went above and beyond to help me both academically and personally for which I’ll always be grateful to him.

Another great mentor and inspiration of my life is Estevam Hruschka who first intro- duced me to world of research. He is the first person who showed me that the research is fun and worked with me to publish my first paper. Without him, I wouldn’t be here.

Some mentors who will always have a special place in my life: Prabhat, Negi and Rakesh Sir; Hrishikesh Sharma; Jun and Raj; Aude, Nima and Shaili; Saurabh, Subhojit and Xia.

A big thanks goes to my thesis committee members who not only helped me in polishing

i this thesis but also played a major role in boosting my research career in lot of ways. In particular, I was lucky to meet Prof. Jaideep Srivastava who provided his early guidance and helped me to lay a strong foundation of my research career from the very beginning. Prof. Georgios Giannakis’s deep knowledge of the field inspired me to work hard and keep exploring the field. Prof. Abhishek Chandra’s thought provoking questions helped me realized the big picture.

The other half part of this journey is made memorable by my dear friends. Old friends who are there from the very beginning: Saurabh, Rohit, Arpan, Baba, JP, Prolok, Prithvi, Harish, Chacha, Ankit, Tatu, Kanjad, Jade, Bhaiaya, PCC, Soni, Deval, GG, Captain, kunu, Ravi, Hema, Junglee, plus lot many Tronix friends. New friends who cheered me and kept the fun going: Guru, Anurag, Shalini, Taluk, Arvind, Pariya, Cheng, Braulio, Yang, Golshan, Hesham, Xinyue, Avinash and Malina.

Lastly, I would like to acknowledge the grants that supported my research: NSF grants CNS 1618339, CNS 1617729, CNS1814322, CNS183677 and US DoD DTRA DTRA grant HDTRA1-09-1-0050 and HDTRA1-14-1-0040 and ARO MURI Award W911NF- 12-1-0385.

ii Dedication

To my loving parents, brother & sister for always believing and supporting me.

iii Abstract

Learning powerful data embeddings has recently become the core of algorithms especially in natural language processing and computer vision domains. In the graph domain, the applications of learning graph embeddings are vast and have distinguished use-cases across multi-cross domains such as bioinformatics, chemoinfor- matics, social networks and recommendation systems. To date, graph remains the most fundamental data structure that can represent many forms of real-world datasets. How- ever, due to its rich but complex data structure, graph presents a significant challenge in forging powerful graph embeddings. Even standard techniques such as Recurrent Neural Networks (RNNs) or Convolutional Neural Networks (CNNs) are not capable enough to operate on the data lying beyond 1D sequence of say words or 2D pixel-grid of images and therefore, cannot generalize to arbitrary graph structure. Re- cently, Graph Neural Networks (GNNs) have been proposed to alleviate such limitations but the current state is far from being mature in both theory and applications.

To that end, this thesis aims at developing powerful models for solv- ing wide-variety of real-world problems on the graph. We study some of the major ap- proaches for devising graph embedding namely Graph Kernel Or Spectrum and GNN. We expose and tackle some of their fundamental weakness and contribute several novel state-of-the-art graph embedding models. These models can achieve superior perfor- mance in solving many real-world problems on graphs such as node classification, graph classification or link prediction over existing methods and that too comes with desirable theoretical guarantees. We first study the capabilities of Graph Kernel or Spectrum ap- proaches toward yielding powerful graph embeddings in terms of uniqueness, stability, sparsity and computationally efficiency. Second, we propose Graph Capsule Neural Net- work that can yield powerful graph embeddings by capturing much more information encoded in the graph structure in comparison with existing GNNs. Third, we devise a first ever universal and transferable GNN and thus, makes transfer learning possi- ble in graph domain. Specifically with this particular GNN, graph embeddings canbe

iv shared and transfered across different models and domains, reaping the huge benefits of transfer learning. Lastly, there is a dearth of theoretical explorations of GNN models such as their generalization properties. We take the first step towards developing a deeper theoretical understanding of GNN models by analyzing their stability and deriv- ing their generalization guarantees. To the best of our knowledge, we are the first to study stability bounds on graph learning in a semi-supervised setting and derive related generalization bounds for GNN models.

In summary, this thesis contributes several state-of-the-art graph embeddings and novel graph theory, specifically (i) Powerful Graph Embedding called Family of Graph Spec- tral Distances (Fgsd) (ii) Highly Informative GNN Called Graph Capsule Neural Net- work (GCAPS) (iii) Universal and Transferable GNN called Deep Universal and Trans- ferable Graph Neural Network (Dugnn) (iv) Stability Theory and Generalization Guar- antees of GNN.

v Contents

Acknowledgements i

Dedication iii

Abstract iv

List of Tables xi

List of Figures xiv

1 Introduction 1

1.1 Core of Machine Learning: Data Embeddings ...... 1

1.2 Thesis Statement ...... 2

1.3 Thesis Outline and Original Contributions ...... 2

1.4 Bibliographic Notes ...... 4

2 Background and Motivation 6

vi 2.1 Background ...... 6

2.1.1 Learning Powerful Data Embeddings ...... 6

2.1.2 Standard Deep Learning Techniques ...... 8

2.1.3 Graph Neural Networks ...... 9

2.1.4 Graph Kernels ...... 11

2.2 Motivation ...... 11

2.2.1 Limitation of Existing Graph Embedding Models ...... 12

3 Learning Unique, Stable, Sparse and Computationally Fast Graph Em- beddings 14

3.1 Introduction ...... 14

3.2 Our Graph Spectrum Approach ...... 15

3.3 Related Work ...... 17

3.4 Family of Graph Spectral Distances and Graph Spectrum ...... 19

3.5 Uniqueness of Family of Graph Spectral Distances and Embeddings .. 21

3.6 Unifying Relationship Between FGSD and Graph Embedding and Dimen- sion Reduction ...... 23

3.7 Stability of Family of Graph Spectral Distances and Embeddings .... 24

3.8 Sparsity of Family of Graph Spectral Distances and Embeddings .... 26

3.9 Fast Computation of Family of Graph Spectral Distances and Embed- dings ...... 28

3.10 Experiments and Results ...... 30

vii 3.11 Conclusion ...... 34

4 Learning Highly Informative Graph Embeddings With Graph Capsule Neural Networks 35

4.1 Introduction ...... 35

4.2 Related Work ...... 39

4.3 Graph Capsule CNN Model ...... 40

4.4 Graph Capsule Networks ...... 42

4.5 Designing Graph Permutation Invariant Layer ...... 46

4.5.1 Problems with Max-Sort Pooling Layer ...... 46

4.5.2 Covariance as Permutation Invariant Layer ...... 47

4.6 Designing GCAP-CNN with Global Features ...... 48

4.7 Experiment and Results ...... 51

4.8 Conclusion ...... 55

5 Learning Universal and Transferable Graph Neural Network Embed- dings 57

5.1 Introduction ...... 57

5.2 Related Work ...... 60

5.2.1 Input Layer ...... 63

5.2.2 Universal Graph Encoder ...... 65

5.2.3 Multi-Task Graph Decoder ...... 69

viii 5.3 Experiment and Results ...... 71

5.4 Ablation Studies and Discussion ...... 74

5.5 Conclusions ...... 78

6 Stability and Generalization Guarantees of Graph Neural Networks 79

6.1 Introduction ...... 79

6.2 Related Work ...... 82

6.3 Graph Capsule & Graph Convolution Neural Networks ...... 84

6.4 Main Result ...... 86

6.5 Preliminaries ...... 87

6.6 Uniform Stability of GCAPS & GCNN Models ...... 91

6.7 Revisiting GCAPS & GCNN Model Architecture ...... 99

6.8 Experiment and Results ...... 100

6.9 Conclusion ...... 102

7 Conclusion and Future Directions 104

7.1 Conclusion ...... 104

7.2 Open Problems and Future Directions ...... 106

Bibliography 108

8 Appendix 128

ix 8.1 Chapter 3: Appendix ...... 128

8.1.1 Proof of Theorem 1 ...... 128

8.1.2 Proof of Theorem 2 ...... 131

8.1.3 Proof of Theorem 3 ...... 132

8.1.4 Proof of Theorem 4 ...... 132

8.1.5 Proof of Theorem 5 ...... 134

8.1.6 Experiments and Results ...... 137

8.2 Chapter 5: Appendix ...... 138

8.2.1 Proof of Theorem 9 ...... 138

8.2.2 Experiment Baselines Settings ...... 143

x List of Tables

3.1 Fgsd complexity comparison with few strong state-of-art algorithms (show- ing variables that are only dependent on N & |E|). It reveals that the Fgsd complexity is better than the most...... 30

3.2 Classification accuracy on unlabeled bioinformatics datasets. Results in bold indicate all methods with accuracy within range 2.0 from the top result and blue color (for range > 2.0), indicates the new state-of-art re- sult. Green color highlights the best time computation, if it’s 5×faster (among the mentioned). ‘OMR’ is out of memory error, ‘> D’ is compu- tation exceed 24hrs...... 32

3.3 ...... 33

3.4 ...... 33

3.5 Left Side: Classification accuracy on datasets. Fgsd signif- icantly outperforms other methods. Right Side: Classification accuracy on labeled bioinformatics datasets. * emphasize that Fgsd did not utilize any node labels...... 33

xi 4.1 Classification accuracy on bioinformatics datasets. Result in bold in- dicates the best reported classification accuracy. Top half of the table compares results with various deep learning approaches while bottom half compares results with graph kernels. ‘> 1 day’ represents that the computation exceed more than 24hrs. ‘OMR’ is out of memory error. . 52

4.2 Classification accuracy on social network datasets. Result in bold in- dicates the best reported classification accuracy. Top half of the table compares results with various deep learning approaches while bottom half compares results with graph kernels. ‘> 1 day’ represents that the computation exceed more than 24hrs. ‘OMR’ is out of memory error. . 53

5.1 Graph classification accuracy on bioinformatics datasets. Result in bold indicates the best reported accuracy. Top half of the table compares results with Graph Kernels (GK) while bottom half compares results with graph neural networks (GNN). *On D&D dataset, we omit computing adjacency reconstruction loss due to GPU memory constraints...... 72

5.2 Ablation Study of Universal Graph Encoder on quantum mechanics

dataset. DUGgnn - ℓA - ℓK (model trained from scratch without multi- task decoder) sets the new state-of-art result on QM8 dataset...... 74

5.3 Ablation Study of Transfer Learning. Dugnn is the base model trained from scratch on both NCI1 and PTC datasets via transfer learning. Dugnn - NCI1 / PTC represent models trained from scratch on individ- ual datasets...... 75

5.4 Ablation Study of Multi-Task Decoder. Dugnn - ℓ(·) represents model

trained from scratch without ℓ(·) loss function. We pick one of the cross validation splits to report the accuracy and all the hyper-parameters, random seeds and data splits were kept constant across the ablation ex- periments...... 76

xii 5.5 Ablation Study of Supervised Adpative Graph Kernel Loss. Dugnn is the base model trained with non-adaptive kernel loss function. DUGnn (unsup) (sup) - LK + LK is trained with adaptive loss inplace of non-adaptive graph kernel loss...... 76

8.1 Bounds on Θ...... 135

8.2 Preliminary Experiment: Classification accuracy on few bioinformatics datasets. Harmonic based feature space yields higher accuracy than bi- harmonic due to sparseness...... 137

xiii List of Figures

1.1 Above Figure shows the overall contributions of this dissertation towards the graph embedding literature. Highlighted box represent the original thesis contributions under two directions namely, Models and Theory. Note that the references in Figure are: [1]– [1], [2]– [2], [3]– [3], [4]– [4]. . 3

2.1 Learning different kind of data embeddings like word embeddings orim- age embeddings...... 7

2.2 Numerous applications of graph embeddings in different domains. .... 7

2.3 Learning graph embedding solves many important problems on graphs. 8

2.4 ...... 10

2.5 Popular forms of representing data such as sequence of words or images or timeseries are instances of regular graphs...... 10

3.1 Graph Generation Model: Graph spectrum is assumed to be encoded in pairwise node distances which are generated from some distribution. Nodes connect together to form a graph in such a way that pairwise node distances are preserved (eg. ( – ) node-pair with distance 0.75 is preserved even though they are not directly connected)...... 16

xiv 3.2 Figure shows the number of unique elements present in formed by different f-spectral distance on all graphs (of |V | = 9, total 261, 080

R 1 graphs). Graph enumeration indices are sorted according to ( λ ) G. 1 We can observe that f(λ) = λ increases in form of a step function and lower bounds all other f(λ) with an addition constant. (Best viewed in color and when zoom-in.) ...... 26

3.3 Harmonic distance based graph feature matrix (matrix sparsity= 97.12%). Presence of blue dot () indicates feature value > 0...... 28

3.4 Biharmonic distance based graph feature matrix (matrix sparsity= 94.28%). Presence of blue dot () indicates feature value > 0...... 29

3.5 Harmonic distance based feature matrix sparsity shown with respect to per class label...... 30

3.6 Biharmonic distance based feature matrix sparsity shown with respect to per class label...... 31

4.1 Above figure shows that the graph capsule function atnode 0 computes a capsule vector which encodes higher-order statistical information about

its local neighboorhood (per feature). Here {x0, x1, x2, x3} are respective node feature values. For example, when a node has no more than two neighbors then it is possible to recover back the input node neighbors values from the very first three statistical moments...... 42

5.1 Figure shows two pairs of isomorphic graphs sampled from real but more interestingly different bioinformatics and quantum mechanic datasets namely NCI1, MUTAG, PTC, QM8; suggesting the importance of learn- ing universal graph embedding and performing transfer learning & multi- tasking (for learning more generalized embeddings)...... 58

xv 5.2 Above figure shows overall architecture of Dugnn model in the form of tensor transformations. Starting from the left, we have a graph G(V,E) with node feature matrix X and adjacency matrix A. X is first trans- form into a consistent feature dimension Xe via Input Transformer. Next, Universal Graph Encoder computes graph embedding z and output Y which is passed down to the decoder. Our Multi-task Graph Decoder comprises of minimizing graph kernel losses and adjacency matrix re- construction loss along with the optional supervised task loss for joint end-to-end learning...... 63

5.3 Two Regular Graphs G1 and G2...... 67

5.4 Some MUTAG graph data samples...... 77

6.1 The above figures show the generalziation gap for three datasets. Thegen- erlization gap is measured with respect to the loss function, i.e., |(training error − test error)|. In this experiment, the cross-entropy loss is used. . 100

6.2 The above figures show the divergence in weight parameters of a single layer GCNN measured using L2−norm on the three datasets. We sur- gically alter one sample point at index i = 0 in the training set S to generate Si and run the SGD algorithm...... 101

xvi Chapter 1

Introduction

1.1 Core of Machine Learning: Data Embeddings

Existing Data Embeddings: Learning powerful data embeddings has become a core piece in machine learning for producing superior results. This new trend of learning embeddings from data can be attributed to the huge success of word2vec [5, 6] with unprecedented real-world performance in natural language processing (NLP). The im- portance of extracting high quality embeddings has now been realized in many other domains such as computer vision (CV) and recommendation systems [7–9]. The crux of these embeddings is that they are pretrained in an unsupervised fashion on huge amount of data, and thus can potentially capture all kind of contextual information contained in the data. Such embedding learning is further advanced by incorporation of multi-task learning and transfer learning which allow more generalized embeddings to be learned across different datasets. These developments have brought major break- throughs in natural language processing: DecaNLP [10] and BERT [11] are recent such prime examples. However, there are domains in particular graph where the potential of designing powerful data embeddings is far from being mature both in theory and applications.

1 2 Graph Embeddings: One of the most versatile and rich domain for capturing struc- tural information is through graph representation. Graph are fundamental data struc- ture that can represent many forms of real-world datasets. However, forging data em- beddings in graph domain remains challenging due to its highly complex structure and thus, the potential of realizing powerful graph embeddings is a wide open frontier.

1.2 Thesis Statement

This thesis aims at developing powerful and theoretically guaranteed graph embedding models for solving wide variety of real-world problems on the graph.

To that end, it contributes several state-of-the-art graph embedding models that can be trusted due to their provable theoretical guarantees and achieves superior performance on many real-world graph tasks such as node classification, graph classification or link prediction.

1.3 Thesis Outline and Original Contributions

We start with providing the necessary background and overall motivation of this disser- tation in Chapter 2. The remaining outline and major contributions of this dissertation in graph embedding literature is shown in Figure 1.1 and further discussed below in the given order,

(Chapter 3) Learning Unique, Stable, Sparse and Computationally Fast Graph Embeddings: In this Chapter, we hunt for a graph embedding that exhibit certain uniqueness, stability and sparsity properties while also being amenable to fast computation for the purpose of learning on graphs. This leads to the discovery of family of graph spectral distances (denoted as Fgsd) and their based graph feature represen- tations, which we prove to possess most of these desired properties. To both evaluate the quality of graph embedding produced by Fgsd and demonstrate their utility, we apply them to the graph classification problem. Through extensive experiments, we 3 Graph Embedding Literature

Models Theory

Graph Neural Optimization Graph Kernels Network

Graph Conv. Graph Capsule Graph Transfer Stability, Generalization Operation Operation Learning Guarantees[4]

Deep Walk LINE WL MLG PATCHY MPNNs GCAPS[2] DUGNN[3] Node2Vec FGSD[1] DCNNs

Figure 1.1: Above Figure shows the overall contributions of this dissertation towards the graph embedding literature. Highlighted box represent the original thesis contributions under two directions namely, Models and Theory. Note that the references in Figure are: [1]– [1], [2]– [2], [3]– [3], [4]– [4]. show that a simple SVM based classification algorithm, driven with our powerful Fgsd based graph embedding, significantly outperforms all the more sophisticated state-of- art algorithms on the unlabeled node datasets in terms of both accuracy and speed; it also yields very competitive results on the labeled datasets – despite the fact it does not utilize any node label information.

(Chapter 4) Learning Highly Informative Graph Embeddings with Graph Capsule Neural Network: In this Chapter, we expose and tackle some of the basic weaknesses of a Graph Convolutional Neural Networks (GCNNs) model popularly uti- lized for yielding graph embeddings with a capsule idea presented in [12] and propose our Graph Capsule Network (GCAPS-CNN) model. In addition, we design our GCAPS- CNN model to solve especially graph classification problem which current GCNN mod- els find challenging. Through extensive experiments, we show that our proposed Graph Capsule Network can significantly outperforms both the existing state-of-art deep learn- ing methods and graph kernels on graph classification benchmark datasets.

(Chapter 5) Learning Universal and Transferable Graph Neural Network 4 Embeddings: In this Chapter, we present a first powerful and theoretically guaran- teed graph neural network that is designed to learn task-independent graph embeddings, thereafter referred to as deep universal graph embedding (DUGnn). Our DUGnn model incorporates a novel graph neural network (as a universal graph encoder) and leverages rich Graph Kernels (as a multi-task graph decoder) for both unsupervised learning and (task-specific) adaptive supervised learning. By learning task-independent graph embeddings across diverse datasets, DUGnn also reaps the benefits of transfer learn- ing. Through extensive experiments and ablation studies, we show that the proposed DUGnn model consistently outperforms both the existing state-of-art GNN models and Graph Kernels by an increased accuracy of 3% − 8% on graph classification benchmark datasets.

(Chapter 6) Theoretical Guarantees on Graph Neural Network Embeddings: In this Chapter, we take a first step towards developing a deeper theoretical under- standing of GCNN models by analyzing the stability of single-layer GCNN models and deriving their generalization guarantees in a semi-supervised graph learning setting. In particular, we show that the algorithmic stability of a GCNN model depends upon the largest absolute eigenvalue of its graph convolution filter. Moreover, to ensure the uni- form stability needed to provide strong generalization guarantees, the largest absolute eigenvalue must be independent of the graph size. Our results shed new insights on the design of new & improved graph convolution filters with guaranteed algorithmic stability. We evaluate the generalization gap and stability on various real-world graph datasets and show that the empirical results indeed support our theoretical findings. To the best of our knowledge, we are the first to study stability bounds on graph learning in a semi-supervised setting and derive generalization bounds for GCNN models.

Lastly in Chapter 7, we make concluding remarks and discuss the future directions.

1.4 Bibliographic Notes

Findings of the Chapter 3 on learning unique, stable, sparse and computationally fast graph embeddings has been published in a conference paper, “Hunt For The Unique, 5 Stable, Sparse And Fast Feature Learning On Graphs”, and appeared in the Proceed- ings of the 31st Conference on Neural Information Processing Systems (NIPS 2017) [1]. Contributions made in Chapter 4 has been published in the paper, “Graph Capsule Convolutional Neural Networks” and presented at the Joint ICML and IJCAI Workshop on Computational Biology, Stockholm, Sweden, 2018 [2]. In addition, Chapter 5 work has been collected in the article, “Towards Learning Universal and Transferable Graph Neural Network Embeddings” with preprint available on arXiv [3]. Lastly, Chapter 5 contributions has been published in a conference paper, “Stability and generalization of graph convolutional neural networks”, and appeared in the Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & (KDD 2019) [4]. Chapter 2

Background and Motivation

2.1 Background

2.1.1 Learning Powerful Data Embeddings

In reach years, much of the efforts have been made on improving the data embeddings learned through various machine learning algorithms. As shown in Figure 2.1, this recent trend has been attributed to the immense success of world2vec embeddings [5] in Natural Language Processing followed by the success of learning powerful image embeddings in Computer Vision. Deep learning techniques are have especially risen due to their strong ability to yield such powerful data embeddings and further allows to share embeddings across different models and domains via transfer learning.

Likewise, the applications of learning powerful graph embeddings in graph domain is vast. Many real-form of data are represented by graphs as shown in Figure 2.2 and comprised of examples such as social networks, biological networks, chemical networks, web networks, transportation network, knowledge graphs and brain functional networks. Furthermore, learning graph embeddings allows us to solve many real-world graph re- lated problems – such as node classification for detecting community structure in social

6 7

Figure 2.1: Learning different kind of data embeddings like word embeddings or image embeddings.

Figure 2.2: Numerous applications of graph embeddings in different domains. 8

Figure 2.3: Learning graph embedding solves many important problems on graphs. networks, graph classification applications in identifying characteristics of molecules, se- quence learning on graphs for predicting traffic in the network [13, 14], link prediction for identifying hidden links in brain function network, generating graphs for creating super drug molecules, building knowledge graph, computer networks [4, 15–26], educa- tion data mining [27–29] and climate applications [30–46] – as illustrated in Figure 2.3. However, learning graph embeddings remains challenging due to its rich and complex structure as evident from Section 2.2.1. Before delving into motivation of this thesis, we discuss the complexity of applying existing machine learning algorithms especially deep learning on graphs in order to yield graph embeddings.

2.1.2 Standard Deep Learning Techniques

Past decade has witnessed the rapid advancement in designing deep learning techniques with the focus of their applications in the area of natural language processing, com- puter vision and recommendation systems. Standard deep learning techniques such as 9 recurrent neural networks (RNNs) or convolutional neural networks (CNNs) can oper- ate either on 1D sequence of elements like words, time-series and/or on images which lives on a 2D pixel-grid as shown in Figure 2.4. These data structures are considered as regular graphs since number of neighbors are all same for each node in the graph. As a result, the applications of most deep learning techniques are limited to datasets which can only be represented in a regular form of data structure. For instance, widely used time series data structure is a 1D sequence and in a specific example, authors39 in[ ] applied multivariate time series classification algorithm via neural networks that learns discriminative patterns and demonstrate the utility of the application on real world problems of mapping burned area.

However, in real-world most datasets lies beyond the 1D sequence or a 2D grid or a regular structure. For example, representing social networks is achieved through arbitrary graph structure with no restrictions on number of neighbors of a node. This present a significant challenge to apply standard deep learning and requires amore general deep learning model that adapt to an irregular data structure or simply applied to graphs. In past few years, a new deep learning model called Graph Neural Networks (GNNs) have been on the horizon and are capable to yield graph embedding for any arbitrary graph structure. We discuss the rise of graph neural networks in deep learning and their implications on learning powerful graph embeddings in the next section.

2.1.3 Graph Neural Networks

Building upon the success of deep learning in images and words, graph neural networks (GNNs) based embeddings have been recently developed for various graph learning tasks on graph-structured datasets [47–50]. These convolutional networks on graphs are now commonly known as Graph Convolutional Neural Networks (GCNNs) or simply Graph Neural Networks (GNNs). The original idea of defining graph convolution operations comes from the graph signal processing domain [51], which has since been recognized as the problem of learning filter parameters that appear in the graph fourier transform in the form of a graph Laplacian [47, 48]. Various GCNN models such as [50, 52, 53] have been proposed, where traditional graph filters are replaced by a self-loop graph 10

Figure 2.4

Figure 2.5: Popular forms of representing data such as sequence of words or images or timeseries are instances of regular graphs. adjacency matrix and the outputs of each neural network layer output are computed using a propagation rule while updating the network weights. The authors in [49] extend such GCNN models by utilizing fast localized spectral filters and efficient pooling operations. A very different approach is proposed in[54] where a set of local nodes are converted into a sequence in order to create receptive fields which are then fed intoa 1D convolutional neural network.

Another popular name for GCNNs is message passing neural networks (MPNNs) [55–58]. Though the authors in [56] suggests that GCNNs are a special case of MPNNs, we believe that both are equivalent models in a certain sense; it is simply a matter of how the graph convolution operation is defined. In MPNNs the hidden states of each node is updated based on messages received from its neighbors as well as the values of the previous hidden states in each iteration. In general MPNN models current hidden state of a node is updated based on its neighbors as well as the previous hidden state. This is made possible by replacing traditional neural networks in GCNN with a small 11 recurrent neural network (RNN) with the same weight parameters shared across all nodes in the graph. Note that here the number of iterations in MPNNs can be related to the depth of a GCNN model. In [59] the authors propose to condition the learning parameters of filters based on edges rather than on traditional nodes. This approach is similar to some instances of MPNNs such as in [56] where learning parameters are also associated with edges. All the above MPNNs models employ aggregation as the graph permutation invariant layer for solving the graph classification problem. In contrast, the authors in [60, 61] employs a max-sort pooling layer and group theory to achieve graph permutation invariance.

2.1.4 Graph Kernels

We describe another popular approach for solving learning problems on graphs i.e., Graph Kernels, although it does not directly produce graph embeddings in the pro- cess. In graph kernels, a given graph G is decomposed into (possibly different) {Gs} sub-structures. The graph kernel K(G1,G2) is defined based on the frequency of each sub-structure appeared in G and G respectively, i.e., K(G ,G ) = ⟨f , f ⟩ where 1 2 1 2 Gs1 Gs2 { } fGs is the vector containing frequencies of Gs sub-structures. Much of work has gone on deciding which sub-structure is more suitable than the other. Among them are graphlets [62, 63], random walks or shortest paths [64, 65], and Weisfeiler-Lehman subtree kernel [66]. While deep graph kernels [67], graph invariant kernels [68] and multiscale laplacian graph kernel [69] focus on re-defining kernel functions to appropri- ately measure the sub-structural similarity at different levels. Another part this work goes into efficiently computing these kernels either through exploiting some structure dependency, or approximation, or randomization [70–72].

2.2 Motivation

To fully develop powerful graph embedding models, we need to overcome some of the major limitations present in the existing graph embedding models. Hence, we motivate 12 this thesis by exposing and tackling some of the fundamental weakness present in Graph Kernels and Graph Neural Networks approaches.

2.2.1 Limitation of Existing Graph Embedding Models

Limitations of Graph Kernel Approach: Most of the earlier work on solving learn- ing problems on graphs is accomplished through graph kernels. However, graph kernels bypasses computing any explicit graph embedding and only yield kernel matrix in the process. As a result, graph embeddings can not be extracted directly and hence, can- not be consumed in non-kernel based machine learning algorithms. In addition, this makes graph kernels methods ineligible for transfer learning. Moreover, it prohibits understanding the quality of embeddings produced in intermediate step and limits the theoretical investigation. In Chapter 3, we propose an alternative of graph kernels and invent novel graph embedding that overcomes the limitations discussed above.

Limited Information Captured by Graph Neural Network Embeddings: One of the main limitation of standard GNN model is due to the basic graph convolution operation which is defined – in its purest form – as the aggregation of node values ina local neighborhood corresponding to each feature (or channel). As such, there is a po- tential loss of information associated with the basic graph convolution operation. Hence, we seek ways to retain more information other than just performing pure aggregation. This problem has been noted before [12], but has not attracted much attention until recently [73]. In Chapter 4, we propose our novel Graph Capsule Network model to overcome this significant limitation.

Transfer Learning Challenges in Graph Neural Networks: Existing graph neural network models are not designed to perform transfer learning i.e., GNN embeddings cannot be shared across the different models or datasets. Unlike images or words where the channels or embedding layer has a fixed input size, in the context of graph learning, the initial node feature dimension can vary across different datasets. It is difficult to employ an embedding layer in GNNs, as this becomes circular, amounting to solving the 13 graph isomorphism problem; furthermore, the graph vocabulary is potentially infinite. Thus, it presents significant challenges to perform transfer learning in graph neural network domain. To tackle this, we propose a first universal and transferable Graph Neural Network model in Chapter 5.

Lack of Theoretical Guarantees for Graph Neural Networks: Lastly, there is a dearth of theoretical explorations of GNN models such as their generalization properties. In Chapter 6, we take the first step towards developing a deeper theoretical understand- ing of GNNs models by analyzing the stability of single-layer models and deriving their generalization guarantees in a semi-supervised graph learning setting. Chapter 3

Learning Unique, Stable, Sparse and Computationally Fast Graph Embeddings

3.1 Introduction

In the past decade, there has been a thrust towards learning embeddings on graphs for various purposes, in particular for solving graph classification problem. Several applica- tions of graph classification can be found in the domain of bioinformatics, or chemoin- formatics, or social networks. The fundamental challenge in solving graph classification is comparing two graph structures which dates back to the problem of solving graph isomorphism. [74] seminal paper on providing quasipolynomial time for solving graph isomorphism has a huge impact from the theoretical perspective, but its implication on developing any practical algorithms is still far away from the reality. Nonetheless, the good news is that there might be other uncharted asymptotically fast algorithms which can account for graph isomorphism and can successfully be applied to graph clas- sification problem as well. One way to get around both these intimately tied problems together is to learn an explicit graph representation or embedding that is invariant to

14 15 graph isomorphism1 and also useful for extracting graph features.

More specifically, given a graph G, we are interested in learning a graph embedding R → { }r (or spectrum) : G (g1, g2, ..., gr), such that the gi i=1 variables are invariant to graph isomorphism and represent the atomic (unique) sub-structures of the graph. And subsequently, we want to learn a feature function F : R → (f1, f2, ..., fd) from R such { }d that the graph features fi i=1 can be employed for solving graph classification problem. However, in machine learning, not much attention has been given towards learning R and most of the previous work focus on designing graph kernels and thus bypasses computing any explicit graph embedding. [75] et al. series of paper ( [75–77]), are some of the first (and few) to deal with constructing explicit graph features, using group theoretic approach, that are invariant to graph isomorphism and can be successfully applied to graph classification problem. Inspired from the same approach, we also explicitly deal with learning a graph embedding R and show how to derive graph features F from R.

3.2 Our Graph Spectrum Approach

Our approach is quite novel and build upon the following assumption: graph atomic structure (or spectrum) is encoded in the multiset2 of all node pairwise distances. Figure 3.1, shows the complete graph generation model based on this premise. The origin of our base assumption can be traced back to the study of homometric structure, i.e, structures with the same multiset of interatomic distances [78]. On graphs, two vertex sets are called non-homometric, if the multisets of distances determined by them are different. It is an unexplored problem whether there exists any distance metricon the graph for which two vertex sets of non-isomorphic graphs are always non-homometric (but converse is not true, example is shortest path). This argument provides the validity of our assumption that the graph atomic structure is being encoded in pairwise distances. Further, we empirically found that the biharmonic distance [79] multisets are unique for at-least upto 10-vertex size simple connected graphs (∼ 11 million graphs) and it remains as an open problem to show a contradictory example. Moreover, we show that

1 That is, invariant under permutation of graph vertex labels. 2 A set in which an element can occur multiple of times. 16

Nodes connect Pairs of graph nodes together to form a are generated from an graph such that the unknown distribution pairwise distances are preserved

0.75

0.55 0.80 0.80

0.75 0.55

Figure 3.1: Graph Generation Model: Graph spectrum is assumed to be encoded in pairwise node distances which are generated from some distribution. Nodes connect together to form a graph in such a way that pairwise node distances are preserved (eg. ( – ) node-pair with distance 0.75 is preserved even though they are not directly connected).

for a certain distance function Sf on the graph, one can uniquely recover all the graph intrinsic properties while also being able to capture both local & global information about the graph. Thus, we define R as the multiset of nodes pairwise distances based on some distance function Sf , which will be the main focus of this chapter.

We hunt for such a family of distances on graphs and its core members that can hold most of the properties of an ideal graph spectrum (see Section 3.4) including invariance under graph isomorphism and uniqueness property. This hunt leads us to the discovery of family of graph spectral distance (Fgsd) and one would find harmonic (resistance) and biharmonic distance on graphs as the suitable members of this family for graph representation R. Finally, for solving graph classification (where graphs can be of different nodes sizes), we simply construct F feature vector from the histogram of R (a 17 multiset) and feed it to any standard classification algorithm.

Our current work focuses on only unlabeled graphs but can be extended to labeled graphs using the same strategy as in shorted path kernel [65]. Nevertheless, our com- prehensive results shows that Fgsd graph features are powerful enough to outperform the current state-of-art algorithms significantly on unlabeled datasets and very compet- itive on labeled datasets – despite the fact it does not utilize any node label information. In summary, the major contributions of our graph kernel based embeddings are:

• Introducing a novel & conceptually simple yet powerful graph feature embeddings (or spectrum) based on the multiset of nodes pairwise distances. • Discovering Fgsd as a well suited choice for computing our proposed graph spectrum. • Proving that Fgsd based graph features can exhibit certain uniqueness, stability, sparsity and are computationally fast with O(|N|2) complexity, where N is the num- ber of nodes in a graph. • Showing the superior performance of Fgsd based graph features on graph classifica- tion tasks.

3.3 Related Work

Previous work on graphs classification can be break down into three main categories. First category of work concerns with constructing explicit graph features which are invariant under permutation of graph vertex labels. The Skew Spectrum of Graphs [76] based on group-theoretic approaches, is one of the example of this category. Its successor, Graphlet spectrum [77] was introduced later to include labeled information into the spectrum and account for the relative position of subgraphs within the graph. However, the main concern with graphlet spectrum or skew spectrum is its computational O(N 3) complexity. 18 More popular category of work on graphs classification deals with designing graph ker- nels where a given graph G is decomposed into (possibly different) {Gs} sub-structures.

The graph kernel K(G1,G2) is defined based on the frequency of each sub-structure ap- peared in G and G respectively, i.e., K(G ,G ) = ⟨f , f ⟩ where f is the vector 1 2 1 2 Gs1 Gs2 Gs containing frequencies of {Gs} sub-structures. Much of work has gone on deciding which sub-structure is more suitable than the other. Among them are graphlets [62, 63], ran- dom walks or shortest paths [64, 65], and Weisfeiler-Lehman subtree kernel [66]. While deep graph kernels [67], graph invariant kernels [68] and multiscale laplacian graph ker- nel [69] focus on re-defining kernel functions to appropriately measure the sub-structural similarity at different levels. Another part this work goes into efficiently computing these kernels either through exploiting some structure dependency, or approximation, or randomization [70–72]. Our effort on learning R from Fgsd can be seen as a part of first category, since we explicitly investigate numerous properties of our proposed graph spectrum. While, extracting F from R is more inspired from the work of graph kernels and can be considered as a part of second category. As a result, our work falls under both the related work categories. Therefore, our overall effort can be considered as an amalgam of both the categories.

The third category involves developing convolutional neural networks (CNNs) for graphs, where several models have been proposed to define convolution networks on graphs. The most common model is based on generalizing convolutional networks through the graph Fourier transform via a graph Laplacian [47,48]. [49] extend this model by constructing fast localized spectral filters for efficient graph coarsening as a pooling operation for CNNs on graphs. Some variants of these models were considered in [50, 52], where the output of each neural network layer is computed using a propagation rule that takes the graph adjacency matrix and node feature vectors into account while updating the network weights. In [53], the convolution operation is defined by hashing of local graph node features along with the local structure information. Likewise, in [54] local node sequences are “canonicalized” to create receptive fields and then fed into a1D convolutional neural network for classification. Among the aforementioned graph CNNs models, only those in [52–54] are relevant to this work since they are designed to account for graphs of different sizes, while others assume a global structure where the 19 one-to-one correspondence of input vertices are already known.

3.4 Family of Graph Spectral Distances and Graph Spec- trum

Basic Setup and Notations: Consider a weighted, undirected (and connected) graph G = (V,E,W ) of size N = |V |, where V is the vertex set, E the edge set (with no self-loops) and W = [wxy] the nonnegative weighted adjacency matrix. The standard graph Laplacian is defined as L = D − W , where D is the degree matrix. It is semi- T definite and admits an eigen-decomposition of the form L = ￿Λ￿ , where Λ = diag[λk] is the diagonal matrix formed by the eigenvalues λ0 = 0 < λ1 ≤ · · · ≤ λN−1, and

￿ = [ϕ0, ..., ϕN−1] is an orthogonal matrix formed by the corresponding eigenvectors

ϕk’s. For x ∈ V , we use ϕk(x) to denote the x-entry value of ϕk. Let f be an arbitrary nonnegative (real-analytical) function on R+ with f(0) = 0, 1 = [1, .., 1]T is the all-one vector and J = 11T . Then, using slight abuse of notion, we define f(L) := ￿f(Λ)￿T and f(Λ) := diag[f(λk)]. Also, f(L)xy represent xy-entry value in f(L) matrix. Lastly, I is identity matrix and L+ is Moore-Penrose Pseudoinverse of L.

FGSD Definition: For x, y ∈ V , we define the f-spectral distance between x and y on G as follows: N∑−1 2 Sf (x, y) = f(λk)(ϕk(x) − ϕk(y)) (3.1) k=0

We will refer to {Sf (x, y)|f}, as the family of graph spectral distances. Without loss of generality, we assume that the derivative f ′(λ) ≠ 0 for λ > 0, and then by Lagrange Inversion Theorem [80], f is invertible and thus bijective. For reasons that will be clear shortly, we are particularly interested in two sub-families of Fgsd, where f is monotonic function (increasing or decreasing) of λ. Depending on the sub-family, the f-spectral distance can capture different type of information in a graph.

FGSD Elements Encode Local Structure Information: For f(λ) = λp (p ≥ 1), p p p one can show that Sf (x, y) = (L )xx + (L )yy − 2(L )xy. If the shortest path from x to p p y is larger than p, then (L )xy = 0. This is based on the fact (L )xy captures only p-hop 20 local neighborhood information [51] on the graph. Hence, broadly for an increasing function of f (e.g., a polynomial function of degree atleast p ≥ 1), Sf (x, y) captures the local structure information.

FGSD Elements Encode Global Structure Information: On the other hand, f + p + p + p as a decreasing function yields Sf (x, y) = ((L ) )xx + (((L ) )yy − 2((L ) )xy. This + J −1 − J captures the global information, since the xy-entry of L = (L + N ) N accounts for all paths from node x to y (and so does (L+)p). Several known globally aware graph distances can be derived from this Fgsd sub-family. For f(λ) = 1/λ where

λ > 0, Sf (x, y) is the harmonic (or effective resistance) distance. More generally, for p f(λ) = 1/λ , p ≥ 1, Sf (x, y) is the polyharmonic distance (p = 2 is biharmonic distance). −2tλ Lastly f(λk) = e k yields Sf (x, y) that is equivalent to the heat diffusion distance. Also from signal processing point of view, f act as a band-pass filter [51] or can be viewed as a spectral graph wavelet kernel [81].

FGSD Graph Signal Processing Point of View: From graph signal processing perspective, Sf (x, y) is a distance computed based on spectral filter properties [51], where f(λ) act as a band-pass filter. Or, it can be viewed in terms of spectral graph ∑ S − N−1 wavelets [81] as: f (x, y) = ψf,x(x)+ψf,y(y) 2ψf,x(y), where ψf,x(y) = k=0 f(λk)ϕk(x)ϕk(y)

(and ψf,x(x), ψf,y(y) are similarly defined) is a spectral graph wavelet of scale 1, centered at node x and f(λ) act as a graph wavelet kernel.

FGSD Based Graph Spectrum: Using the Fgsd based distance matrix Sf =

[Sf (x, y)] directly, e.g., for graph classification, requires us being able to solve the graph isomorphism problem efficiently. But no known polynomial time algorithm isavail- able; the best algorithm today theoretically takes quasipolynomial time [74]. However, motivated from the study of homometric structure and the fact that each element of Fgsd encodes some local or global sub-structure information of the graph, inspired us to define the graph spectrum as R = {Sf (x, y)|∀(x, y) ∈ V }. Thus, comparing two R’s implicitly evaluates the sub-structural similarity between two graphs. For instance, R based on harmonic distance contains sub-structural properties related to the spanning trees of a graph [82]. 21 Our main concern in this chapter would be choosing an appropriate f(λ) function in order to generate R which can exhibit ideal graph spectrum properties as discuss below. Also, we want F to inherent these properties directly from R, which is made possible by defining F as the histogram of R. Finally, we lay down those important fundamental properties of an ideal graph spectrum that one would like R & F to obey on a graph G = (V,E,W ).

1. R & F must be invariant under any permutation π of vertex labels. That is, R(G) = R(Gπ) or R(W ) = R(PWP T ) for any permutation matrix P . 2. R & F must have a unique representation for non-isomorphic graphs. That is,

R(G1) ≠ R(G2) for any two non-isomorphic graphs G1 and G2.

3. R & F must be stable under small perturbation. That is, if graph G2(W2) = G1(W1+

∆), for a small perturbation norm matrix ∥∆∥, then the norm of ∥F(G2) − F(G1)∥ should also be small or bounded in order to maintain the stability. 4. F must be sparse (if high-dimensional) for all the sparsity reasons desirable in ma- chine learning. 5. R & F must be computationally fast for efficiency and scalability purposes.

3.5 Uniqueness of Family of Graph Spectral Distances and Embeddings

We first start with exploring the graph invariance and uniqueness properties of R & F based on Fgsd. Uniqueness is a very important (desirable) property, since it will determine whether the elements of R set are complete (i.e., how good they are), in the sense whether R is sufficient enough to recover all the intrinsic structural properties of a graph. We state the following important uniqueness theorem.

3 Theorem 1 (Uniqueness of Fgsd). The f-spectral distance matrix Sf = [Sf (x, y)] uniquely determines the underlying graph (up to graph isomorphism), and each graph

3 − 1 − 1 Variant of Theorem 1 also hold true for the normalized graph Laplacian Lnorm = D 2 LD 2 . 22 has a unique Sf (up to permutation). More precisely, two undirected, weighted (and connected) graphs G1 and G2 have the same Fgsd based distance matrix up to permuta- S S T tion, i.e., G1 = P G2 P for some permutation matrix P , if and only if the two graphs are isomorphic.

Implications: Our proof is based on establishing the following key relationship: f(L) = − 1 − 1 S − 1 2 (I N J) f (I N J). Since f is bijective, one can uniquely recover Λ from f(Λ). One of the consequence of Theorem 1 is that the R based on multiset of Fgsd is invariant under the permutation of graph vertex labels and thus, satisfies the graph invariance property. Also, F will inherent this property since R remains the same. Unfortunately, it is possible that the multiset of some Fgsd members can be same for non-isomorphic graphs (otherwise, we would have a O(N 2) polynomial time algorithm for solving graph isomorphism problem!). However, it is known that all non-isomorphic graphs with less than nine vertices have unique multisets of harmonic distance. While, for nine & ten vertex (simple) graphs, we have exactly 11 & 49 pairs of non-isomorphic graphs (out of total 274,668 & 12,005,168 graphs) with the same harmonic spectra. These examples show that there are significantly very low numbers of non-unique harmonic spectrums. Moreover, we empirically found that the biharmonic distance has all unique multisets for at-least upto ten vertices (∼ 11 million graphs) and we couldn’t find any non-isomorphic graphs with the same biharmonic multisets. Further, we have the following theorem regarding the uniqueness of R.

Theorem 2 (Uniqueness of Graph Harmonic Spectrum). Let G = (V,E,W ) be a graph of size |V | with an unweighted adjacency matrix W . Then, if two graphs G1 and G2 have the same number of nodes but different number of edges, i.e, |V1| = |V2| but |E1| ̸= |E2|, then with respect to the harmonic distance multiset, R(G1) ≠ R(G2).

Implications: Our proof relies on the fact that the effective resistance distance isa monotone function with respect to adding or removing edges. It shows that R based on some Fgsd members specially harmonic distance is atleast theoretically known to be unique to a certain degree. F also inherent this property, fully under the condition h → 0 (or for small enough h), where h is the histogram binwidth. 23 Overall the certain uniqueness of R along with containing local or global structural properties in its each element dictate that the R is capable enough to serve as the complete powerful Graph Spectrum.

3.6 Unifying Relationship Between FGSD and Graph Em- bedding and Dimension Reduction

Before delving into other properties, we uncover an essential relationship between Fgsd and Graph Embedding in Euclidean space and Dimension Reduction techniques. Let 1 √ 1 f(Λ) 2 = diag[ f(λk)] and define ￿ = ￿f(Λ) 2 . Then, the f-spectral distance can be S || − ||2 th expressed as f (x, y) = ￿(x) ￿(y) 2, where ￿(x) is the x row of ￿. Thus, ￿ represents an Euclidean embedding of G where each node x is represented by the vector ￿(x). Now for instance, if f(λ) = 1, then by taking the first p columns of ￿ yields embedding exactly equal to Laplacian Eigenmap (LE) [83] based on graph Laplacian −1 2t −1 (Lrw = D L). For f(λ) = λ and L = D W , we get the Diffusion Map [84]. Thus, f(λ) function has one-to-one correspondence relationship with spectral dimension reduction techniques. We have the following theorem concerning Graph Embedding based on Fgsd.

Theorem 3 (Uniqueness of Fgsd Graph Embedding). Each graph G can be iso- metrically embedded into a Euclidean space using Fgsd as an isometric measure. This isometric embedding is unique, if all the eigenvalues of G Laplacian are√ distinct and there ′ ′ ′ does not exist any other graph G with Laplacian eigenvectors ϕk = f(λj)/f(λj)ϕk, ∀k ∈ [1,N − 1].

Implications: The above theorem shows that Fgsd provides a unique way to embed the graph vertices into Euclidean space possibly without loosing any structural information of the graph. This could potentially serve as a cogent tool to convert an unstructured data into a structure data (similar to structure2vec 85 or node2vec 86 tool) which can enable us to perform standard inference tasks in Euclidean space. Note that the uniqueness condition is quite strict and holds for co-spectral graphs. In short, we have 24 following uniqueness relationship, where ￿ is the Euclidean embedding of G graph.

Sf f(LG) LG f(LG) ￿G

3.7 Stability of Family of Graph Spectral Distances and Embeddings

Next, we hunt for the stable members of the Fgsd that are robust against the pertur- bation or noise in the datasets. Specifically, we will look at the stability of R and F based on Fgsd from f(λ) perspective by first analyzing its influence on a single edge perturbation (or in other words analyzing rank one modification of Laplacian matrix). This will lead us to find the stable members and what restrictions we need to impose on f(λ) function for stability. We will further show that f-spectral distance function also satisfies the notion of uniform stability [87] in a certain sense. For our analysis, we will restrict f(λ) as a monotone function of λ, for λ > 0. Let △w ≥ 0 be the change ′ ′ after modifying w weight on any single edge to w on the graph, where △w = w − w.

Theorem 4 (Eigenfunction Stability of Fgsd). Let △Sxy be the change in Sf (x, y) distance with respect to △w change in weight of any single edge on the graph. Then,

△Sxy for any vertex pair (x, y) is bounded with respect to the function of eigenvalue as follows, ( △Sxy ≤ 2 |f(λN−1 + 2△w) − f(λ1)|

Implications: Since, R = {Sf (x, y)|∀(x, y) ∈ V }, then each element of R is itself bounded by △Sxy. Now, recall that F is a histogram of R, then F won’t change, if binwdith is large enough to accommodate the perturbation i.e., h ≥ 2△Sxy ∀(x, y) as- suming all elements of R are at the center of their respective histogram bins. Besides h, the other way to make R robust is by choosing a suitable f(λ) function. Lets consider ( ) △S p △S ≤ △ p − p the behavior xy on f(λ) = λ for p > 0. Then, xy 2 (λN−1 + 2 w) λ1 and as a result, △Sxy is an increasing function with respect to p which implies that 25 stability decreases with increase in p. For( p = 0, stability does not change) with re- △S ≤ |p| − △ |p| △S spect to λ. While, for p < 0, xy 2 1/λ1 1/(λN−1 + 2 w) . Here, xy is a decreasing function with respect to |p|, which implies that stability increases with decrease in p. The results conforms with the reasoning that eigenvectors correspond- ing to smaller eigenvalues are smoother (i.e., oscillates slowly) than large eigenvectors (corresponding to large eigenvalues) and decreasing p will attenuate the contribution of large eigenvectors, making the f-spectral distance more stable and less susceptible towards perturbation or noise. However, decreasing p too much could result in lost of local information contained in eigenvectors with larger eigenvalues and therefore, a balance needs to be maintained. Overall, Theorem 4 shows that either through suitable h or decreasing f(λ) function, stability of R & F can be controlled to satisfy the Ideal Spectrum Property 3.

Infact, we can further show that Sf (x, y) between any two vertex (x, y) on a graph, with 0 < α ≤ w ≤ β bounded weights, is tightly bounded to a certain expected value.

Theorem 5 (Uniform Stability of Fgsd). Let E[Sf (x, y)] be the expected value of

Sf (x, y) between vertex pair (x, y), over all possible graphs with fixed ordering of N vertices. Then we have, with probability 1 − δ, where δ ∈ (0, 1) and θ depends upon α, β, N. √ √ 1 S (x, y) − E[S (x, y)] ≤ f(θ) N(N − 1) log f f δ

Implications: The above theorem is based on the fact △Sxy can itself be upper bounded over all possible graphs generated on a fixed ordering of N vertices. This is a very similar condition needed for a learning algorithm to satisfy the notion of uniform stability in order to give generalization guarantees. The f-spectral distance function can itself be thought of as a learning algorithm which admits uniform stability (precise definition in supplementary) and indicates a strong stability behavior over all possible graphs and further act as a generalizable learning algorithm on the graph. Theorem 5 also reveals that the deviation can be minimized by choosing decreasing f(λ) function ( √ ) and it would be suitable, if f(λ) grow with O 1/ N(N − 1) rate in order to maintain stability for large graphs. 26 So far, we have narrow down our interest to R & F based on the bijective and decreasing f(λ) function for achieving both uniqueness and stability. This eliminates all forms of increasing polynomial functions as a good choice of f(λ). As a result, we can focus on inverse (or rational) form of polynomial functions such as polyharmonic distances. A by-product of our analysis results in revealing a new class of stable dimension reduction techniques, possible by scaling Laplacian eigenvectors with decreasing function of f(λ), although such connections have already been known before.

3.8 Sparsity of Family of Graph Spectral Distances and Embeddings

40

30 1 f(6) = 6

G 1

- -

) f(6) = 60:2 )

6 1

( 20

f(6) = 60:5 f

( 1 f(6) = 1:5

R 6 - - 1 10 f(6) = 62 1 f(6) = 63 0 50000 100000 150000 200000 250000 Graph Enumeration Index

Figure 3.2: Figure shows the number of unique elements present in R formed by different f-spectral distance on all graphs (of |V | = 9, total 261, 080 graphs). Graph

R 1 1 enumeration indices are sorted according to ( λ ) G. We can observe that f(λ) = λ increases in form of a step function and lower bounds all other f(λ) with an addition constant. (Best viewed in color and when zoom-in.) 27 Sparsity is desirable for both computational and statistical efficiency. In this section, we investigate the sparsity produced in F by choosing different f(λ) functions. Here, sparsity refers to its usual definition of “how many zero features are present in F graph feature vector”. Since F is a histogram of R, number of non-zero elements in F will always be less than equal to number of unique (or distinct) elements in R. However, due to the lack of any theoretical support, we rely on empirical evidence and conjecture the following statement.

R Conjecture (Sparsity of Fgsd Graph Spectrum) For any graph G, let (f(λ)) G represents the number of unique elements present in the multiset of R, computed on an unweighted graph G based on some monotonic decreasing f(λ) function. Then, the following holds,

( ) 1 R(f(λ)) ≥ R + 2 G λ G

Here, we refer {f(λ)} as the size of a unique set formed by computing spectral G distances on an unweighted graph G based on f(λ) (monotonic decreasing)( ( function.) )

R 1 The conjecture is based on the observation that, in the Figure 3.2, λ + 2 lower bounds all given monotonic decreasing f(λ) along with an addition constant of | | 2. Same trends are observed for different graph ( sizes) V . Interestingly, when graph

R 1 enumeration indices are sorted according to size λ , we further observe that f(λ) = 1 λ increases in the form of a step function. Note that the inequality is satisfied for 1 F f(λ) = λ2 . From this conjecture, we can directly conclude that the based on f(λ) = 1 R λ produce the most sparse features because number of unique elements in its is always less than any other R. Figures 3.3, 3.4, 3.5, 3.6, further supports this conjecture which shows the feature space computed for MUTAG dataset in case of harmonic and biharmonic spectrums. However, this raises a question of trade-off between maintaining uniqueness and sparsity, since biharmonic distance multisets are found to be unique for more number of graphs than harmonic distance. Nonetheless, some preliminary experiments measuring harmonic vs. biharmonic performance on graph classification, suggest that the sparsity is more favorable than uniqueness since it results in higher classification accuracy. In rest, we will narrow down our focus on computing harmonic 28

0

x

e d

n 50

I

e

l p

m 100

a

S

a t

a 150 D

0 200 400 600 800 1000 1200 1400 1600 1800 2000 Feature Index

Figure 3.3: Harmonic distance based graph feature matrix (matrix sparsity= 97.12%). Presence of blue dot () indicates feature value > 0. distances and its competitiveness with other state-of-art methods in graph classification task. But before that, we will also provide the general recipe of fast computing any member of Fgsd in the next section.

3.9 Fast Computation of Family of Graph Spectral Dis- tances and Embeddings

Finally, we provide the general recipe of computing any member of Fgsd in fast manner. In order to avoid direct eigenvalue decomposition, we can either perform approximation or leverage structural properties and sparsity of f(L) for efficient exact computation of

Sf and thus, R.

Approximation: Inspired from the spectral graph wavelet work [51], the recipe for ap- proximating Fgsd is to decompose f(λ) possibly into an approximate polynomial series ∑ r (for example, chebyshev polynomials) as follows: f(λ) = i=0 aiTi(λ) such that Ti(x) can be computed in recursive manner from few lower order terms (T − (x),T − (x), ..., T − (x)). ∑ i 1 i 2 i c r Then it follows, f(L) = i=0 aiTi(L). In this case, the cost of computing will reduce to 29

0

x

e d

n 50

I

e

l p

m 100

a

S

a t

a 150 D

0 200 400 600 800 1000 1200 1400 1600 1800 2000 Feature Index

Figure 3.4: Biharmonic distance based graph feature matrix (matrix sparsity= 94.28%). Presence of blue dot () indicates feature value > 0.

O | | O | | ≪ O 2 (r E ) for sparse L which is very less expensive, since (r E ) ( (N ). But, if) f(λ) ∑ −1 r is an inverse polynomial form of function, then computing f(L) = i=0 aiTi(L) = + f(Lr ), boils down to efficiently computing (a single) Moore Penrose Pseudo inverse of a matrix.

Efficient Exact Computation: By leveraging f(L) structural properties and its spar- sity, we can efficiently perform exact computation of f(L+) in much more better way than the eigenvalue decomposition. We propose such a method which is the generaliza- + −1 − J + tion of [79] work. We can show that, f(L)f(L ) = I N . Therefore, f(L)lk = Bk, + th + − J where lk and Bk are the k column of f(L ) and B = I N matrices, respectively. So, first we can find a particular solution of following (sparse) linear system: f(L)x = Bk + − 1T x and then obtain lk = x 1T 1 x. The particular solution x can be obtained by replacing any single row and corresponding column of f(L) by zeros, and setting diagonal entry at their intersection to one, and replacing corresponding row of B by zeros. This gives a (non-singular) sparse linear system which can be solved very efficiently by performing cholesky factorization and back-substitution, resulting in overall O(N 2) complexity as shown in [88]. Beside this, there are few other fast methods to compute Pseudo inverse, particularly given by [89].

As a result, it leads to a very efficient O(r|E|) complexity through approximation with 30

Figure 3.5: Harmonic distance based feature matrix sparsity shown with respect to per class label.

GS SP GK[67](k ∈ MLG[69] Complexity SGS[76] [77](k ∈ DCNN[52] Fgsd [65] {3, 4, 5})(d ≤ N) (Ne < N) [2, 6])

Approximate — O(Ndk−1) — — — O(Ne3) O(r|E|)

Worst- O(N 3) O(N k) O(N 3) O(N 2+k) O(N 2) O(N 3) O(N 2) Case

Table 3.1: Fgsd complexity comparison with few strong state-of-art algorithms (showing variables that are only dependent on N & |E|). It reveals that the Fgsd complexity is better than the most. the worst-case O(N 2) complexity in exact computation of R. Table 3.1, shows the complexity comparison with other state-of-art methods. Since, number of elements in R are O(N 2), then F is also bounded by O(N 2) and thus satisfies the ideal graph spectrum Property 5.

3.10 Experiments and Results

FGSD Graph Spectrum Settings: We chose harmonic distance as an ideal candi- date for F. For fast computation, we adopted our proposed efficient exact computation 31

Figure 3.6: Biharmonic distance based feature matrix sparsity shown with respect to per class label. method. And for computing histogram, we fix binwidth size and set the number of {R }M bins such that its range covers all i 1 elements of M number of graphs. Therefore, we had only one parameter, binwidth size, chosen from the set {0.001, 0.0001, 0.00001}. This results in F feature vector dimension in range 100 − 1000, 000 with feature matrix sparsity > 90% in all cases. Our Fgsd code is available at github4 .

Datasets: We employed wide variety of datasets considered as benchmark [52,54,67,69] in graph classification task to evaluate the quality of produce Fgsd graph features. We adopted 7 bioinformatics datasets: Mutag, PTC, Proteins, NCI1, NCI109, D&D, MAO and 5 social network datasets: Collab, REDDIT-Binary, REDDIT-Multi-5K, IMDB-Binary, IMDB-Multi. D&D dataset contains 691 enzymes and 587 non-enzymes proteins structures. While, MAO dataset contains 38 molecules that are antidepressant drugs and 30 do not. For other datasets, details can be found in [67].

Experimental Set-up: All experiments were performed on a single Intel-Core i7- [email protected] and 64GB RAM machine. We compare our method with 6 state-of-art Graphs Kernels: Random Walk (RW) [90], Shortest Path Kernel (SP) [65], Graphlet

4 https://github.com/vermaMachineLearning/FGSD 32

RW WL DGK DCNN SGS SP [ GK [ MLG (Wall- Fgsd Dataset (No. [ [ [ [ [ [65]] [63]] Time) [ [69]] (Wall-Time) Graphs, Max. [90]] [66]] [67]] [52]] [76]] Nodes)

MUTAG (188, 28) 83.50 87.23 84.04 87.28 86.17 87.23(5s) 66.51 88.61 92.12(0.3s)

PTC (344, 109) 55.52 58.72 60.17 55.61 59.88 62.20(18s) 55.79 — 62.80(0.07s)

PROTEINS 68.46 72.14 71.78 70.06 71.69 71.35(277s) 65.22 — 73.42(5s) (1113, 620)

NCI1 (4110, 111) > D 68.15 62.07 77.23 64.40 77.57(620s) 63.10 62.72 79.80(31s)

NCI109 (4127, 111) > D 68.30 62.04 78.43 67.14 75.91(600s) 60.67 62.62 78.84(35s)

D&D (1178, 5748) > D > D 75.05 73.76 72.75 77.02(7.5hr) OMR — 77.10(25s)

MAO (68, 27) 83.52 90.35 80.88 89.79 87.76 91.17(13s) 76.10 — 95.58(0.1s)

Table 3.2: Classification accuracy on unlabeled bioinformatics datasets. Results in bold indicate all methods with accuracy within range 2.0 from the top result and blue color (for range > 2.0), indicates the new state-of-art result. Green color highlights the best time computation, if it’s 5×faster (among the mentioned). ‘OMR’ is out of memory error, ‘> D’ is computation exceed 24hrs.

Kernel (GK) [63], Weisfeiler-Lehman Kernel (WL) [66], Deep Graph Kernels (DGK) [67], Multiscale Laplacian Graph Kernels (MLK) [69]. And proposed, 2 recent state-of- art Graph Convolutional Networks: PATCHY-SAN (PSCN) [54], Diffusion CNNs (DCNN) [52]. And, 2 strong Graph Spectrums: the Skew Spectrum (SGS) [76], Graphlet Spectrum (GS) [77]. We adopt the same procedure from previous works [54,67] to make a fair comparison and used 10-fold cross validation with LIBSVM [91] library to test the classification performance. Parameters of SVM are independently tuned using training folds data and best average classification accuracies is reported for each method. We provide node degree as the labeled data for algorithms that do not operate directly on unlabeled data. Further details about parameters selection for baseline methods are present in appendix.

Classification Results: From Table 5.1, it is clear that Fgsd consistently outper- forms every other state-of-art algorithms on unlabeled bioinformatics datasets and that too significantly in many cases. Fgsd even performs better for social network graphs as shown in Table 4.2 and achieves a very significant 7% − 8% more accuracy than the 33 Dataset GK DGK PSCN Fgsd DCNN (Graphs) [[63]] [[67]] [[54]] MLG PSCN GS [ Dataset [ Fgsd* [[69]] [[54]] [77]] COLLAB [52]] 72.84 73.09 72.60 80.02 (5000) 87.94 92.63 92.12 MUTAG 66.98 88.11 REDDIT- (4s) (3s) (0.3s) B 77.34 78.04 86.30 86.50 63.26 62.90 62.80 (2000) PTC 56.60 — (21s) (6s) (0.07s) REDDIT- 81.75 78.59 79.80 M 41.01 41.27 49.10 47.76 NCI1 62.61 65.0 (621s) (76s) (31s) (5000) D& 78.18 77.12 77.10 IMDB-B OMR — 65.87 66.96 71.00 73.62 D (7.5hr) (154s) (25s) (1000) 88.29 95.58 IMDB-M MAO 75.14 — — 43.89 44.55 45.23 52.41 (12s) (0.1s) (1500) Table 3.4 Table 3.3 .

Table 3.5: Left Side: Classification accuracy on social network datasets. Fgsd sig- nificantly outperforms other methods. Right Side: Classification accuracy on labeled bioinformatics datasets. * emphasize that Fgsd did not utilize any node labels. current state-of-art PSCNs on COLLAB and IMDB-M datasets. Also from run-time perspective (excluding any data loading or classification time for all algorithms), itis pretty fast (2x–1000x times faster) as compare to others. These appealing results further motivated us to compare Fgsd on the labeled datasets (even though, it is not a complete fair comparison). Table 3.4 shows that Fgsd is still very competitive with all other strong (recent) algorithms that utilize node labeled data. Infact on MAO dataset, Fgsd sets a new state-of-art result and stays within 0% − 2% range of accu- racy from the best on all labeled datasets. On few labeled datasets, we found MLG to have slightly better performance than the others, but it is 1000 times slower than Fgsd when graph size jumps to few thousand nodes (see D&D Results). Altogether, Fgsd shows very promising results in both accuracy & speed on all type of datasets and over all the more sophisticated algorithms. These results also point out the fact that there is untapped hidden potential in the graph structure which current algorithms are not harnessing despite having labeled data at their disposal. 34 3.11 Conclusion

We present a conceptually simple yet powerful and theoretically motivated graph rep- resentation. In particular, our graph representation based on the discovery of family of graph spectral distances can exhibits uniqueness, stability, sparsity and are compu- tationally fast. Moreover, our hunt specifically leads to the harmonic and next toit, biharmonic distances as an ideal members of this family for extracting graph features. Finally, our extensive results show that Fgsd based graph features are powerful enough to dominate the unlabeled graph classification task over all the more sophisticated al- gorithms and competitive enough to yield high classification accuracy on labeled data even without utilizing any node labels. Chapter 4

Learning Highly Informative Graph Embeddings With Graph Capsule Neural Networks

4.1 Introduction

Graphs are one of the most fundamental structures that have been widely used for representing many types of data. Learning on graphs such as graph semi-supervised learning, graph classification or graph evolution have found wide applications in domains such as bioinformatics, chemoinformatics, social networks, natural language processing and computer vision. With remarkable successes of deep learning approaches in image classification and object recognition that attain “superhuman” performance, there has been a surge of research interests in generalizing convolutional neural networks (CNNs) to structures beyond regular grids, i.e., from 2D/3D images to arbitrary structures such as graphs [47–50]. These convolutional networks on graphs are now commonly known as Graph Convolutional Neural Networks (GCNNs). The principal idea behind graph convolution has been derived from the graph signal processing domain [51], which has since been extended in different ways for a variety of purposes [53, 56, 61].

35 36 In this chapter, we expose three major limitations of the standard GCNN model com- monly used in existing deep learning approaches on graphs, especially when applied to the graph classification problem, and explore ways to overcome these limitations. In particular, we propose a new model, referred to as Graph Capsule Convolution Neu- ral Networks (GCAPS-CNN). It is inspired by the notion of capsules developed in [12]: capsules are new types of neurons which encapsulate more information in a local pool operation (e.g., a convolution operation in a CNN) by computing a small vector of highly informative outputs rather than just taking a scalar output. Our graph capsule idea is quite general and can be employed in any version of GCNN model either design for solving graph semi-supervised problem or doing sequence learning on graphs via Graph Convolution Recurrent Neural Network models (GCRNNs).

The first limitation of the standard GCNN model is due to the basic graph convolu- tion operation which is defined – in its purest form – as the aggregation of node values in a local neighborhood corresponding to each feature (or channel). As such, there is a potential loss of information associated with the basic graph convolution operation. This problem has been noted before [12], but has not attracted much attention until recently [73]. To address this limitation, we propose to improve upon the basic graph convolution operation by introducing the notion of graph capsules which encapsulate more information about nodes in a local neighborhood, where the local neighborhood is defined in the same way as in the standard GCCN model. Similar to theoriginal capsule idea proposed in [12], this is achieved by replacing the scalar output of a graph convolution operation with a small vector output containing higher order statistical in- formation per feature. Another source of inspiration for our proposed GCAPS-CNN model comes from one of the most successful graph kernels – the Weisfeiler-Lehman (WL)- subtree graph kernel [66] designed specifically for solving the graph classification problem. In WL-subtree graph kernel, node labels (features) are collected from neigh- bors of each node in a local neighborhood and compressed injectively to form a new node label in each iteration. The histogram of these new node labels are concatenated in each iteration to serve as a graph invariant feature vector. The important point to notice here is that due to the injection process, one can recover the exact node labels of local neighbors in each iteration without losing track of them. In contrast, this is not 37 possible in the standard GCNN model as the input feature values of node neighbors are lost after the graph convolution operation.

The second major limitation of the standard GCNN model is specific to its (in)ability in tackling the graph classification problem. GCNN models cannot be applied directly because they are equivariant (not invariant) with respect to the node order in a graph. To be precise, consider a graph G with Laplacian L ∈ RN×N and node feature ma- trix X ∈ RN×d. Let f(X, L) ∈ RN×h be the output function of a GCNN model where N, d, h are the number of nodes, input dimension and hidden dimension of node features, respectively. Then, f(X, L) is a permutation equivariant function, i.e., for any P per- mutation matrix f(PX, PLPT) = Pf(X, L). This specific permutation equivariance property prevent us from directly applying GCNN to a graph classification problem, since it cannot provide any guarantee that the outputs of any two isomorphic graphs are always the same. Consequently, a GCNN architecture needs an additional graph permutation invariant layer in order to perform the graph classification task successfully. This invariant layer also needs to be differentiable for end-to-end learning.

Very limited amount of efforts has been devoted to carefully designing such an invari- ant GCNN model for the purpose of graph classification. Currently the most common method for achieving graph permutation invariance is performing aggregation (i.e., sum- ming) over all graph node values [52, 57, 59, 92]. Though simple and fast, it can again incur significant loss of information. Likewise, using a max-pooling layer to achieve graph permutation invariance encounters similar issues. A few attempts have been made [60, 61] that go beyond aggregation or max-pooling in designing graph permu- tation invariant GCNNs. In [60] the authors propose a global ordering of nodes by sorting them according to their values in the last hidden layer. This type of invariance is based on creating an order among nodes and has also been explored before in [54]. However, as discussed in Section 4.5.1, we show that there are some issues with this type of approach. A more tangential approach has been adopted in [61] based on group theory to design transformation operations and tensor aggregation rules that results in permutation invariant outputs. However, this approach relies on computing high order tensors which are computationally expensive in many cases. To that end, we propose a novel permutation invariant layer based on computing the covariance of the data whose 38 output does not depend upon the order of nodes in the graph. It is also fast to compute since it requires only a single dense-matrix multiplication operation.

Our last concern with the standard GCNN model is their limited ability in exploit- ing global information for the purpose of graph classification. The filters employed in graph convolutions are in essence local in nature and hence can only provide an “av- erage/aggregate view” of the local data. This shortcoming poses a serious difficulty in handling graphs where node labels are not present; approaches which initialize (node) feature values using, e.g., node degree, are not much helpful in this respect. We propose to utilize global features (features that account for the full graph structure) using a family of graph spectral distances as proposed in [1] to remedy this problem.

In summary, the major contributions of this chapter are:

• We propose a novel Graph Capsule Convolution Neural Network model based on the capsule idea to capture highly informative output in a small vector in place of a scaler output currently employed in GCNN models. • We develop a novel graph permutation invariant layer based on computing the co- variance of data to solve graph classification problem. We show that it is a better choice than performing node aggregation or doing max pooling and at the same time it can be computed efficiently. • Lastly, we advocate explicitly including global graph structure features at each graph node to enable the proposed GCAPS-CNN model to exploit them for graph learning tasks.

We organize this chapter into five sections. We start with the related work on graph kernels and GCNNs in Section 6.2, and present our core idea behind graph capsules in Section 4.3. In Section 4.5, we focus on building a graph permutation invariant layer especially for solving the graph classification problem. In Section 4.6, we propose to equip our GCAPS-CNN model with enhanced global features to exploit the full graph structure for learning on graphs. Lastly in Section 6.8 we conduct experiments and show the superior performance of our proposed GCAPS-CNN model. 39 4.2 Related Work

There are three main approaches for solving the graph classification problem. The most common approach is concerned with building graph kernels. In graph kernels, a graph G is decomposed into (possibly different) {Gs} sub-structures. The graph kernel

K(G1,G2) is defined based on the frequency of each sub-structure appeared in G1 and G , respectively. Namely, K(G ,G ) = ⟨f , f ⟩, where f is the vector contain- 2 1 2 Gs1 Gs2 Gs ing frequencies of {Gs} sub-structures, and ⟨, ⟩ is an inner product in an appropriately defined normed vector space. Much of work has been devoted to deciding onwhich sub-structures are more suitable than others. Among the existing graph kernels, pop- ular ones are graphlets [62, 63], random walk and shortest path kernels [64, 65], and Weisfeiler-Lehman subtree kernel [66]. Furthermore, deep graph kernels [67], graph invariant kernels [68], optimal assignment graph kernels [93] and multiscale laplacian graph kernel [69] have been proposed with the goal to re-define kernel functions to appro- priately capture sub-structural similarity at different levels. Another line of research in this area focuses on efficiently computing these kernels either through exploiting certain structure dependency, or via approximation or randomization [70–72].

The second category involves constructing explicit graph features such as Fgsd features in [1] which is based on a family of graph spectral distances. It comes with certain theoretical guarantees. The Skew Spectrum of Graphs [76] based on group-theoretic approaches is another example in this category. Graphlet spectrum [77] improves upon this work by including labeled information; it also accounts for the relative position of subgraphs within a graph. However, the main concern with graphlet spectrum or skew spectrum is its computational O(N 3) complexity.

The third – more recent and perhaps more promising – approach to the graph classifi- cation is on developing convolutional neural networks (CNNs) for graphs. The original idea of defining graph convolution operations comes from the graph signal processing domain [51], which has since been recognized as the problem of learning filter parame- ters that appear in the graph fourier transform in the form of a graph Laplacian [47,48]. 40 Various GCNN models such a [50, 52, 53] have been proposed, where traditional graph filters are replaced by a self-loop graph adjacency matrix and the outputs of each neural network layer output are computed using a propagation rule while updating the network weights. The authors in [49] extend such GCNN models by utilizing fast localized spec- tral filters and efficient pooling operations. A very different approach is54 proposedin[ ] where a set of local nodes are converted into a sequence in order to create receptive fields which are then fed into a 1D convolutional neural network.

Another popular name for GCNNs is message passing neural networks (MPNNs) [55–58]. Though the authors in [56] suggests that GCNNs are a special case of MPNNs, we believe that both are equivalent models in a certain sense; it is simply a matter of how the graph convolution operation is defined. In MPNNs the hidden states of each node is updated based on messages received from its neighbors as well as the values of the previous hidden states in each iteration. This is made possible by replacing traditional neural networks in GCNN with a small recurrent neural network (RNN) with the same weight parameters shared across all nodes in the graph. Note that here the number of iterations in MPNNs can be related to the depth of a GCNN model. In [59] the authors propose to condition the learning parameters of filters based on edges rather than on traditional nodes. This approach is similar to some instances of MPNNs such as in [56] where learning parameters are also associated with edges. All the above MPNNs models employ aggregation as the graph permutation invariant layer for solving the graph classification problem. In contrast, the authors in[60, 61] employs a max-sort pooling layer and group theory to achieve graph permutation invariance.

4.3 Graph Capsule CNN Model

Basic Setup and Notations: Consider a graph G = (V,E, A) of size N = |V |, where

V is the vertex set, E the edge set (with no self-loops) and A = [aij] the weighted adjacency matrix. The standard graph Laplacian is defined as L = D − A ∈ RN×N , where D is the degree matrix. Let X ∈ RN×d be the node feature matrix, where d is the input dimension. When used, we will use h to denote the dimension of hidden (latent) 41 variables/feature space.

General GCNN Model: We start by describing a general GCNN model before pre- senting our Graph Capsule CNN model. Let G be a graph with graph Laplacian L and X ∈ RN×d be a node feature matrix. Then the most general form of a GCNN layer output function f(X, L) ∈ RN×h equipped with polynomial filters is given by Equation (4.1),

 

W1   ( [ ]   )   W2 f(X, L) = σ X LX ... LkX    .  | {z }  .    g(X,L) (4.1) Wk | {z } learning weight parameters ( ∑K ) k = σ L XWk k=0

In Equation (4.1), g(X, L) ∈ RN×kd is defined as a graph convolution filter of polynomial form with degree k. While [W1, W2, ..., Wk] are learning weight parameters where each d×h Wk ∈ R .

Note that g(X, L) = [X, LX, ..., LK X] ∈ RN×kd can be seen as a new node feature matrix with extended dimension kd1 . Furthermore, L can be replaced by any other suitable filter matrix as discussed in[50, 94].

A GCNN model with a depth of L layers can be expressed recursively as,

( ) f (ℓ)(X, L) = σ g(f (ℓ−1)(X, L), L)W(ℓ) (4.2) where Wℓ ∈ Rkd×h is the weight parameter matrix for the ℓth−layer, 1 ≤ l ≤ L.

1 Also referred to as the breadth of a GCNN layer . 42

[x2] [x1] [x2] [x1] A Capsule Vector 2 1 2 1 (for example   containing∑ moments)  a0kxk (mean)   k  0 0  ∑  Applying Graph Cap-  − 2  1  a0k(xk µ) (std.)  ℓ=1  k  ℓ=0 sule Function at node 0 [x0 ] =   [x0 ] |N0| ∑ −   a ( xk µ )3 (skewness)  0k σ   k  [x3] 3 [x3] 3 . .

Figure 4.1: Above figure shows that the graph capsule function atnode 0 computes a capsule vector which encodes higher-order statistical information about its local neigh- boorhood (per feature). Here {x0, x1, x2, x3} are respective node feature values. For example, when a node has no more than two neighbors then it is possible to recover back the input node neighbors values from the very first three statistical moments.

k (ℓ−1) One can notice that in any layer the basic computation expression involve is [L f (X, L)]ij. This expression represents that the new jth feature value of ith node (associated with the ith row) is yielded out as a single (scalar) aggregated value based on its local-hood neighbors. This particular operation can incur significant loss of information. We aim to remedy this issue by introducing our novel GCAPS-CNN model based on the funda- mental capsule idea.

4.4 Graph Capsule Networks

The core idea behind our proposed graph capsule convolutional neural network is to capture more information in a local node pool beyond what is captured by aggregation, the graph convolution operation used in a standard GCCN model. This new information is encapsulated in so-called instantiation parameters described in [12] which forms a capsule vector of highly informative outputs.

The quality of these parameters are determined by their ability to encode the node feature values in a local neighborhood of each node as well decode (i.e., to reconstruct) them from the capsule vector. For instance, one can take the histogram of neighborhood feature values as the capsule vector. If histogram bandwidth is sufficiently small, we can 43 guarantee to recover back all the original input node values. This strategy has been used in constructing a successful graph kernel. However, as histogram is not a continuous differentiable function, it cannot be employed in backpropagation for end-to-end deep learning.

Beside seeking representative instantiation parameters, we further impose two more constraints on a graph capsule function. First, we want our graph capsule function to be permutation invariant (unlike equivariant as discussed in [12]) with respect to the input node order since we are interested in a model that can produce the same output for isomorphic graphs. Second, we would like to be able to compute these parameters efficiently.

Graph Capsule Function: To describe a general graph capsule function, consider th an i node with x0 value and the set of its neighborhood node values as N (i) =

{x0, x1, x2, ..., xk} including itself. In the standard graph convolution operation, the output is a scalar function f : Rk → R which takes k input neighbors at the ith node and yields an output given by

1 ∑ f (x , x , ..., x ) = a x i 0 1 k |N (i)| ik k (4.3) k∈N (i) where aik represents edge weights between nodes i and k.

In our graph capsule network, we replace f(x0, ..., xk) with a vector-valued capsule function f : Rk → Rp. For example, consider a capsule function that captures higher- order statistical moments as follows (for simplicity, we omit the mean and standard deviation),   ∑  aikxk   k∈N (i)   ∑   a x2 1  ik k k∈N (i)  fi(x0, ..., xk) = |N |   (4.4) (i)  .   .   ∑   p aikxk k∈N (i) 44 Figure 4.1 shows an instance of applying our graph capsule function on a specific node. Consequently, for an input feature matrix X ∈ RN×d, our graph capsule network will produce an output f(X, L) ∈ RN×h×p where p is the number of instantiation parame- ters. The representational power of our Graph Capsule Networks can be inferred from Theorem 6.

Theorem 6. If graph capsule function is injective, then the resulting Graph Capsule Network is as powerful as the Weisfeiler-Lehman (WL) graph isomorphism test.

The proof is based on the facts that WL algorithm itself applied injective hash function to test graph isomorphism (followed from Theorem 3 of [95]).

Managing Graph Capsule Vector Dimension: In the first layer, our graph capsule network receives an input X ∈ RN×d and produces a non-linear output f (1)(X, L) ∈ × × RN h1 p. Since our graph capsule function produces a vector of p dimension (for each input d dimension), the feature dimension of the output in subsequent layers can quickly blow up to an unmanageable value. To keep it in check, we restrict the feature dimension × × of the output f (ℓ)(X, L) to be always ∈ RN hℓ p at any middle ℓth−layer of a GCAP-

CNN (here hℓ represents the hidden dimension of that layer). This can be accomplished in two ways 1) either by flattening the last two dimension of f(X, L) and carrying out graph convolution in usual way (see Equation 8.5 for an example) 2) or by taking the weighted combination of p−dimension capsule vectors (this is similar to performing attention mechanism) at each node as performed in [73]. We leave the second approach for our future work. Thus in a nutshell, our graph capsule network in ℓth−layer (ℓ > − × × 1) receives an input f (ℓ 1)(X, L) ∈ RN hℓ−1 p and produces an output f (ℓ)(X, L) ∈ × × RN hℓ p.

Graph Capsule Function with Statistical Moments: In this chapter, we consider higher-order statistical moments as instantiation parameters because they are permuta- tionally invariant and can nicely be computed through matrix-multiplication operations in a fast manner. To see exactly how, let fp(X, L) be the output matrix corresponding th (ℓ) to p dimension. Then, we can compute fp (X, L) containing statistical moments as instantiation parameters as follows, 45

( ∑K ) − − f (ℓ)(X, L) = σ Lk(f (ℓ 1)(X, L) ⊙ ... ⊙ f (ℓ 1)(X, L))W(ℓ) p |F {z F } pk (4.5) k=0 p times where ⊙ is a hadamard product. Here to keep the feature dimensions in check from (ℓ−1) × ∈ RN hℓ−1p growing, we flatten the last two dimension of the input as fF lat (X, L) and performs usual graph convolution operation followed by a linear transformation with (ℓ) × ∈ Rhℓ−1p hℓ Wpk as the learning weight parameter. Note that here p is used to denote both the capsule dimension as well the order of statistical moments.

Graph Capsule Function with Polynomial Coefficients: As mentioned earlier, the quality of instantiation parameters depend upon their capability to encode and decode the input values. Therefore, we seek capsule functions which are bijective in nature i.e., guaranteed to preserve everything about the local neighborhood. For instance, one consider coefficients of polynomial as instantiation parameters by taking the set oflocal node feature values as roots,

  ∑  xk   ∈N   k (i)   ∑   xk1 xk2   ∈N   k1,k2 (i)  1  ∑  fi(·) =   (4.6) |N (i)|  xk1 xk2 xk3   ∈N  k1,k2,k3 (i)   .   .   

x0x1 . . . xk−1xk

One can show that from a given full set of polynomial coefficients, we are guaranteed to recover back all the original node values (upto permutation). However, the first issue with this approach is that they are expensive to compute at each node. Specifically, a combinatorial algorithm without fast fourier transform takes O(k2) complexity to compute where k is the number of roots. Also, there is numerical instability issue 46 associated with computing polynomial coefficients. There are ways to deal with these kind issues but we leave pursuing this direction for our future work.

In short, our graph capsule idea is powerful and can be employed in any type of GCNN model for either solving graph semi-supervised learning problem or performing sequence learning on graphs using Graph Recurrent Neural Network models (GCRNNs) or doing link prediction via Graph Autoencoders (GAEs) or/and for generating synthetic graphs through Graph Generative Adversarial models (GGANs).

4.5 Designing Graph Permutation Invariant Layer

In this section, we focus on the second limitation of GCNN model regarding achieving permutation invariance for graph classification purpose. Before presenting our novel invariant layer in GCAPS-CNN model, we first discuss the shortcomings of Max-Sort Pooling Layer which is the next popular choice after aggregation for achieving invariance.

4.5.1 Problems with Max-Sort Pooling Layer

We design a test to determine whether the invariant graph feature constructed by a model has any degree of certainty to produce the same output for sub-graph isomers or not.

Sub-Graph Isomorphism Feature Test: Consider two graphs G1 = (V1,E1) and k G2 = (V2,E2) such that G1 is isomorphic to a sub-graph of G2. Let f1, f2 ∈ R be the invariant feature vector (w.r.t. to graph isomorphism) of G1,G2 respectively. Then, we define sub-graph isomorphism feature test as a criteria providing guarantee thateach elements of f1 and f2 are comparable under certain notion i.e., f1i ≡ f2i for any i ∈ [1, k]. Here ≡ represents a comparison operator defined in a sensible way. Satisfying this test is very desirable for graph classification problem since it is quite likely that sub-graph isomers of a graph belong to the same class label. This property helps the model to learn wi weight parameter appropriately which is shared across the same input place 47 i.e., f1i and f2i.

k Proposition 1. Let f1, f2 ∈ R be the feature vectors containing top k−max node values in sorted order for graphs G1,G2 respectively and given G1 is sub-graph isomorphic to

G2. Then the Max-Sort Pooling Layer fails the Sub-graph Isomorphism Feature Test owing to the comparison done with respect to node ordering.

Remarks: Max-Sort Pooling layer fails the test because it does not guarantee that f1i ̸≡ f2i for any i ∈ [1, k]. Here ̸≡ (not comparable) operator represents that the node corresponding to values f1i and f2i may not be the same in sub-graph isomers. Even including a single node (value) in f2 vector which is not present in G1 can mess up the whole comparision order of f1 and f2 elements. As a result, in Max-Sort Pooling layer the comparison is not always guaranteed to be sensible which makes the problem of learning weight parameters harder. In general, any invariant graph feature vector that relies on node ordering will fail this test.

4.5.2 Covariance as Permutation Invariant Layer

Our novel idea of permutation invariant features in GCAPS-CNN model is computing the covariance of f(X, L) layer output given as follows,

1 C(f(X, L)) = (f(X, L) − µ)T (f(X, L) − µ) (4.7) N

Here µ is the mean of f(X, L) output and C(·) is a covariance function. Since covariance function is differentiable and does not depends upon the order of row elements, itcan serve as a permutation invariant layer in GCAPS-CNN model. Also, it is fast in com- putation due to a single matrix-multiplication operation. Note that we flatten the last two dimension of GCAPS-CNN layer output f(X, L) ∈ RN×h×p in order to compute the covariance.

Moreover, covariance provides much richer information about the data by including shapes, norms and angles (between node hidden features) information rather than just 48 providing the mean of data. Infact in multivariate normal distribution, it is used as a statistical parameter to approximate the normal density and thus also reflects informa- tion about the data distribution. This particular property along with invariance has been exploited before in [96] for computing similarity between two set of vectors. One can also think about fitting multivariate normal distribution on f(X, L) but it involves computing inverse of covariance matrix which is computationally expensive.

Since each element of covariance matrix is invariant to node orders, we can flatten the symmetric covariance matrix C ∈ Rhp×hp to construct the graph invariant feature vector f ∈ R(hp+1)hp/2. On an another positive note, here the output dimension of f does not depend upon N number of nodes and can be adjusted according to computational constraints.

k Proposition 2. Let f1, f2 ∈ R be the feature vectors containing covariance elements of node feature matrices for graphs G1,G2 respectively and given G1 is sub-graph isomorphic to G2. Then the covariance invariant layer pass the Sub-Graph Isomorphism Feature Test owing to the comparison done with respect to feature dimensions.

Remarks: It is quite straightforward to see that the feature dimension order of a node does not depend upon the graph node ordering and hence the order is same across all graphs. As a result, each elements of f1 and f2 are always comparable. To be more specific, covariance output compares both the norms sand angles between the corresponding pairs of feature dimension vectors in two graphs.

4.6 Designing GCAP-CNN with Global Features

Besides guaranteeing permutation invariance in GCAP-CNN model, another important desired characteristic of graph classification model is to capture global structure (or features) of a graph. For instance, considering only node degree (as a node feature) is a local information and not much helpful towards solving graph classification problem. On the other hand, considering spectral embedding as a node feature takes global piece of information into account and have been proven successful in serving as a node vector for 49 problems dealing with graph semi-supervised learning. We define global features that takes full graph structure into account during their computation. While local features only depend upon some (at-most) k−hop node neighbors.

Unfortunately, the basic design of GCNN model can only capture local structure infor- mation of the graph at each node. We make this loose statement more concrete with the following theorem.

Theorem 7. Let G be a graph with L ∈ RN×N graph Laplacian and X ∈ RN×d node feature matrix. Let f (ℓ)(X, L) be the output function of a ℓth GCNN layer equipped with (ℓ) th th polynomial filters of degree k. Then [f (X, L)]i output at i node (i.e., i row in f (ℓ)(·)) depends upon “only” on the input values of neighbors distant at most “kℓ−hops” away.

Proof: We can proof this statement by mathematical induction. It is easy to see that the base case ℓ = 1 holds true. Lets assume it also holds true for f (ℓ−1)(X, L) i.e., ith node output depends upon neighbors distant upto k × (ℓ − 1) hop away. Then in ( ( ) ) f (ℓ)(X, L) = σ g f (ℓ−1)(X, L), L W(ℓ) we focus on the term,

g(X, L) = [f (ℓ−1)(X, L),..., Lkf (ℓ−1)(X, L)] (4.8) particularly the last term involving Lkf (ℓ−1)(X, L). Matrix multiplication of Lk with f (ℓ−1)(X, L) will result in ith node to include all node information which are at-most k−hop distance away. But since a node in f (ℓ−1)(X, L) at a distance k−hops (from ith node) can contain information upto k × (ℓ − 1) hops, we have ith node containing information at-most k + k(ℓ − 1) = kℓ hops distance away.

Remarks: Above theorem 7 establishes that GCNN model with ℓ layers can capture only kℓ−hop local-hood structure information at each node. Thus, employing GCNN for graph classification with say aggregation layer can capture only average variation of kℓ−hop local-hood information over the whole graph. To include more global in- formation about the graph one can either increase k (i.e, choose higher order graph 50 convolution filters) or ℓ (i.e, the depth of GCNN model). Both these choices increases model complexity and thus would require more data samples to reach satisfying results. However among the two, we prefer increasing the depth of GCNN model because the first choice leads to increase in the breadth of the GCNN layer (see footnote 1 about g(X, L) in Section 4.3) and based on the current understanding of deep learning theory, increasing the depth is favored more over the breadth.

For cases where graph node features are missing, it is a common practice to take node de- gree as a node feature. Such practices can work for problems like graph semi-supervised where local-structure information drives node output labels (or classes). But in graph classification global features governs the output labels and hence taking node degree is not sufficient. Of course, we can go for a very deep GCNN model that willallows us to exploit more global information but requires higher sample complexity to achieve satisfying results.

To balance the two (model complexity with depth vs. required sample complexity), we propose to incorporate Fgsd features in our GCAP-CNN model computed at each node. As shown in [1] Fgsd features capture global information about the graph and can also be computed in fast manner. Specifically, at each ith node Fgsd features are computed as the histogram of the multi-set formed by taking the harmonic distance between all nodes and the ith node. It is given by,

− N∑1 1 S(x, y) = (ϕ (x) − ϕ (y))2 (4.9) λ n n n=0 n where S(x, y) is the harmonic distance, x, y are any graph nodes and λn, ϕn(·) is the nth eigenvalue and eigenvector respectively.

In our experiments, we employ these features only for datasets where node feature are missing (specifically for social network datasets in our case). Although this strategy can always be used by concatenating Fgsd features with original node feature values to cap- ture more global information. Further inspired from Weisfeiler-lehman graph kernel [66] which also concatenate features in each labeling iteration, we also propose to pass con- catenated outputs from intermediate layers to our covariance and fully connected layers. 51 Finally, our whole end-to-end GCAP-CNN learning model is guaranteed to produce the same output for isomorphic graphs.

4.7 Experiment and Results

GCAPS-CNN Model Configuration: We build ℓ layer GCAPS-CNN with follow- ing configuration: Input → GC(h, p) → · · · → GC(h, p) → [M,C(·)] → FC(h) → FC(h) → Softmax. Here GC(h, p) represents a Graph Capsule CNN layer with h hidden dimensions and p instantiation parameters. As mentioned earlier, we take the intermediate output of each GC(h, p) layers and form a concatenated tensor which is sub- sequently pass through [M,C(·)] layer which computes mean and covariance of the input. Output of [M,C(·)] layer is then passed to two fully connected FC layers with again h output dimensions and finally connects to a softmax layer for computing class probabili- ties. In between intermediate layers, we use batch normalization and dropout technique to prevent overfitting along with L2 norm regularization. We set ℓ ∈ {2, 3, 4} depending upon the dataset size (towards higher for larger dataset) and h ∈ {32, 64, 128} for setting hidden dimension. We restrict p ∈ [1, 4] for computing higher-order statistical moments due to computational constraints. Further, we employ ADAM optimization technique with initial learning rate chosen from the set {10−1,..., 10−7} with a decaying factor of 0.1 after every few epochs. Batch size is set according to the given dataset size and memory requirements. Number of epochs are chosen from the set {100, 200, 500, 1000}. All the above mentioned hyper-parameters are tuned based on the training loss. Aver- age classification accuracy based on 10−fold cross validation error is reported for each dataset. Our GCAPS-CNN code and data will be made available at Github2 .

Datasets: To evaluate our GCAPS-CNN model, we perform graph classification tasks on variety of benchmark datasets. In first round, we used 6 bioinformatics datasets namely: PTC, PROTEINS, NCI1, NCI109, D&D, and ENZYMES. In second round, we used 5 social network datasets namely: COLLAB, IMDB-BINARY, IMDB-MULTI, REDDIT-BINARY and REDDIT-MULTI-5K. D&D dataset contains 691 enzymes and

2 https://github.com/vermaMachineLearning/Graph-Capsule-CNN-Networks/ 52

Dataset PTC PROTEINS NCI1 NCI109 D&D ENZYMES

(No. Graphs) 344 1113 4110 4127 1178 600

(Max. G. Size) 109 620 111 111 5748 126

(Avg. G. Size) 25.56 39.06 29.80 29.60 284.32 32.60

Deep Learning Methods

DCNN[ [52]] 56.60  2.89 61.29  1.60 56.61  1.04 57.47  1.22 58.09  0.53 42.44  1.76

PSCN[ [54]] 62.29  5.68 75.00  2.51 76.34  1.68 — — —

ECC[ [59]] — — 76.82 75.03 72.54 45.67

DGCNN[ [60]] 58.59  2.47 75.54  0.94 74.44  0.47 75.03  1.72 79.37  0.94 51.00  7.29

GCAPS-CNN 66.01  5.91 76.40  4.17 82.72  2.38 81.12  1.28 77.62  4.99 61.83  5.39

Graph Kernels

RW[ [90]] 57.85  1.30 74.22  0.42 > 1 Day > 1 Day > 1 Day 24.16  1.64

SP[ [65]] 58.24  2.44 75.07  0.54 73.00  0.24 73.00  0.21 > 1Day 40.10  1.50

GK[ [63]] 57.26  1.41 71.67  0.55 62.28  0.29 62.60  0.19 78.45  1.11 26.61  0.99

WL [ [66]] 57.97  0.49 74.68  0.49 82.19  0.18 82.46  0.24 79.78  0.36 52.22  1.26

DGK[ [67]] 60.08  2.55 75.68  0.54 80.31  0.46 80.32  0.33 73.50  1.01 53.43  0.91

MLG[ [69]] 63.26  1.48 76.34  0.72 81.75  0.24 81.31  0.22 78.18  2.56 61.81  0.99

GCAPS-CNN 66.01  5.91 76.40  4.17 82.72  2.38 81.12  1.28 77.62  4.99 61.83  5.39

Table 4.1: Classification accuracy on bioinformatics datasets. Result in bold indicates the best reported classification accuracy. Top half of the table compares results with various deep learning approaches while bottom half compares results with graph kernels. ‘> 1 day’ represents that the computation exceed more than 24hrs. ‘OMR’ is out of memory error.

587 non-enzymes proteins structures. For other datasets details can be found in [67]. Also for each dataset number of graphs, maximum and average number of nodes is 53

Dataset COLLAB IMDB-BINARY IMDB-MULTI REDDIT-BINARY REDDIT-MULTI

(No. Graphs) 5000 1000 1500 2000 5000

(Max. G. Size) 492 136 89 3783 3783

(Avg. G. Size) 74.49 19.77 13.00 429.61 508.5

Deep Learning Methods

DCNN[ [52]] 52.11  0.71 49.06  1.37 33.49  1.42 OMR OMR

PSCN[ [54]] 72.60  2.15 71.00  2.29 45.23  2.84 86.30  1.58 49.10  0.70

DGCNN[ [60]] 73.76  0.49 70.03  0.86 47.83  0.85 76.02  1.73 48.70  4.54

GCAPS-CNN 77.71  2.51 71.69  3.40 48.50  4.10 87.61  2.51 50.10  1.72

Graph Kernels

GK[ [63]] 72.84  0.28 65.87  0.98 43.89  0.38 77.34  0.18 41.01  0.17

DGK[ [67]] 73.09  0.25 66.96  0.56 44.55  0.52 78.04  0.39 41.27  0.18

GCAPS-CNN 77.71  2.51 71.69  3.40 48.50  4.10 87.61  2.51 50.10  1.72

Table 4.2: Classification accuracy on social network datasets. Result in bold indicates the best reported classification accuracy. Top half of the table compares results with various deep learning approaches while bottom half compares results with graph kernels. ‘> 1 day’ represents that the computation exceed more than 24hrs. ‘OMR’ is out of memory error. shown in the Table 5.1 and Table 4.2.

Experimental Set-up: All experiments were performed on a single machine loaded with recently launched 2×NVIDIA TITAN VOLTA GPUs and 64 GB RAM. We com- pare our method with both deep learning models and graph kernels.

Deep Learning Baselines: For deep learning approaches, we adopted 4 recently pro- posed state-of-art graph convolutional neural networks namely: PATCHY-SAN (PSCN) [54], Diffusion CNNs (DCNN) [52]], Dynamic Edge CNN (ECC) [59] and Deep Graph CNN 54 (DGCNN) [60].

Graph Kernel Baselines: We adopted 6 state-of-art graphs kernels for compari- son namely: Random Walk (RW) [90], Shortest Path Kernel (SP) [65], Graphlet Ker- nel (GK) [63], Weisfeiler-Lehman Sub-tree Kernel (WL) [66], Deep Graph Kernels (DGK) [67] and Multiscale Laplacian Graph Kernels (MLK) [69].

Baselines Settings: We adopted the same procedure from previous works [54, 60, 67] to make a fair comparison and used 10-fold cross validation with LIBSVM [91] library to report the classification performance for graph kernels. Parameters of SVM are inde- pendently tuned using training folds data and best average classification accuracies are reported for each method. For Random-Walk (RW) kernel, decay factor is chosen from {10−6, 10−5..., 10−1}. For Weisfeiler-Lehman (WL) kernel, we chose height of subtree kernel from h ∈ {2, 3, 4}. For graphlet kernel (GK), we chose graphlets size {3, 5, 7} and for deep graph kernels (DGK), we report the best classification accuracy obtained among: deep graphlet kernel, deep shortest path kernel and deep Weisfeiler-Lehman kernel. For Multiscale Laplacian Graph (MLG) kernel, we chose η and γ parameter of the algo- rithm from {0.01, 0.1, 1}, radius size from {1, 2, 3, 4}, and level number from {1, 2, 3, 4}. For diffusion-convolutional neural networks (DCNN), we chose number of hops from {2, 5}. For the rest, best reported results were borrowed from papers PATCHY-SAN (k = 10)[54], ECC [59] (without edge labels since all other methods also relies on only node labels) and DGCNN (with sorting layer) [60], since the experimental setup was the same and a fair comparison can be made. In short, we follow the same procedure as mentioned in previous papers. Note: some results are not present because either they are not previously reported or source code not available to run them.

Graph Classification Results: From Table 5.1, it is clear that our GCAPS-CNN model consistently outperforms most of the considered deep learning methods on bioinformatics datasets (except on D&D dataset) with a significant margin of 1% − 6% classification accuracy gain (highest being on NCI1 dataset).

Again, this trend is continued to be the same on social network datasets as shown in Table 4.2. Here, we were able to achieve upto 4% accuracy gain on COLLAB dataset 55 and rest were around 1% gain with consistency when compared against other deep learning approaches.

Our GCAPS-CNN is also very competitive with state-of-art graph kernel methods. It again show a consistent performance gain of 1% − 3% accuracy (highest being on PTC dataset) on many bioinformatic datasets when compared against with strong graph kernels. While other considered deep learning methods are not even close enough to beat graph kernels on many of these datasets. It is worth mentioning that the most deep learning models (like ours) are also scalable while graph kernels are more fine tuned towards handling small graphs.

For social network datasets, we have a significant gain of atleast 4% − 9% accuracy (highest being on REDDIT-MULTI dataset) against graph kernels as observed in Ta- ble 4.2. But this is expected as deep learning methods tend to do better with the large amount of data available for training on social networks datasets. Altogether, our GCAPS-CNN model shows very promising results against both the current state-of-art deep learning methods and graph kernels.

4.8 Conclusion

In this chapter, we present a novel Graph Capsule Network (GCAPS-CNN) model based on the fundamental capsule idea to address some of the basic weaknesses of ex- isting GCNN models. Our graph capsule network model by design captures more local structure information than traditional GCNN and can provide much richer representa- tion of individual graph nodes or for the whole graph. For our purpose, we employ a capsule function that preserves statistical moments formation since they are faster to compute.

Furthermore, we propose a novel permutation invariant layer based on computing co- variance in our GCAPS-CNN architecture to deal with graph classification problem which most GCNN models find challenging. This covariance can again be computed in a fast manner and has shown to be better than adopting aggregation or max-sort 56 pooling layer. On the top, we also propose to equip our GCAPS-CNN model with Fgsd features explicitly to capture more global information in absence of node features. This is essential to consider since non-deep GCNN models are not capable enough to exploit global information implicitly. Finally, we show GCAPS-CNN superior performance on many bioinformatics and social network datasets in comparison with existing deep learn- ing methods as well as strong graph kernels and set the current state-of-the-art. Our general idea of graph capsule is quite rich and can taken to another level by designing more sophisticated capsule functions that are capable of preserving more information in a local pool. Chapter 5

Learning Universal and Transferable Graph Neural Network Embeddings

5.1 Introduction

We envision a deep universal graph embedding neural network (DUGnn), which is capa- ble of the following: 1) It can be trained on diverse datasets (e.g., with different node features) for a variety of tasks in an unsupervised fashion to learn task-independent graph embeddings; 2) The learned graph embedding model can be shared across differ- ent datasets, thus enjoying the benefits of transfer learning; 3) The learned model can further be adapted and improved for specific tasks using adaptive supervised learning. Figure 5.1 shows some sample graphs in four different real-world datasets from fields as diverse as bioinformatics and quantum mechanics. While the node features (and their meanings) can differ vastly but the underlying graphs governing them contain isomor- phic graph structures. This suggests that learning universal graph embeddings across diverse datasets is not only possible, but can potentially offer the benefits of transfer learning. From theoretical point of view, we establishes the generalization guarantee

57 58

NCI1 MUTAG

QM8 PTC

Figure 5.1: Figure shows two pairs of isomorphic graphs sampled from real but more interestingly different bioinformatics and quantum mechanic datasets namely NCI1, MUTAG, PTC, QM8; suggesting the importance of learning universal graph embed- ding and performing transfer learning & multi-tasking (for learning more generalized embeddings). of DUGnn model for graph classification task and discuss the role of transfer learning in helping towards reducing the generalization gap. To best of our knowledge, we are the first to propose doing transfer learning in the graph neural network domain. Note a similar but still significantly different work has been done in[97]. It relies on traditional CNN where transfer learning is well-established while our chapter establishes transfer learning using specifically graph neural networks (first of its kind).

In order to develop a universal graph embedding model, we need to overcome three main technical challenges. Firstly, existing GNNs operate at the node feature-matrix level. Unlike images or words where the channels or embedding layer has a fixed input size, in the context of graph learning, the initial node feature (a feature vector defined on nodes in a graph) dimension can vary across different datasets. It is difficult to employ an embedding layer in GNNs, as this becomes circular, amounting to solving the graph isomorphism problem; furthermore, the graph vocabulary is potentially infinite. Secondly, the model complexity of GNNs is often limited by the basic graph convolution 59 operation, which in its purest form is the aggregation of neighboring node features and may suffer from the Laplacian smoothing problem [98]. Lastly, the major technical hurdle is to devise an unsupervised graph decoder that is capable of regenerating or reconstructing the original graph directly from its graph embedding with minimal loss of information.

In tackling these challenges, we propose a DUGnn model architecture (as depicted in Figure 5.2) with three carefully designed core components: 1) Input Layer, 2) Universal Graph Encoder, and 3) Multi-Task Graph Decoder. The input layer transforms an input node feature with arbitrary dimension into one with a consistent output dimension, and feeds it to the universal graph encoder. The universal graph encoder is based upon an GNN model with several innovations, and returns a final graph embedding. The multi-task graph decoder employs graph reconstruction mechanism and leverages various precomputed rich graph kernels to fine-tune the learned graph embedding in an unsupervised fashion. Hence through this decoder design, we combine the power of graph neural networks with that of graph kernels and leverage the best of both worlds.

Through extensive experiments and ablation studies, we show that the DUGnn model consistently outperforms both the existing state-of-art GNN models and Graph Kernels by an increased accuracy of 3% − 8% on graph classification benchmark datasets.

In summary, the major contributions of this chapter are:

• We propose a novel theoretical guaranteed DUGnn model for universal graph em- bedding learning that can be trained in unsupervised fashion and also capable of doing transfer learning. • We leverage rich graphs kernels to design a multi-task graph decoder which incorpo- rates the power of graph kernels in graph neural networks and get the best of both the worlds. • Our DUGnn model achieves superior results in comparison with existing graph neu- ral networks and graph kernels on various types of graph classification benchmark datasets. 60 5.2 Related Work

Recently, various graph neural networks (viewed as “graph encoders”) have been de- veloped, in particular, learning task-specific node embeddings from graph-structured datasets. In contrast, we are interested in learning task-independent graph embedding. Furthermore, our problem also involves designing graph decoder – graph reconstruc- tion using graph embedding – which is far more challenging and has not received much attention in the literature. We overview the key related areas along with graph kernels.

Graph Encoders: Under graph encoders, we consider both graph convolutional neural networks (GCNNs) and message passing neural networks (MPNNs). The early devel- opment of GNNs can be traced back to graph signal processing [51] and cast in terms of learning filter parameters of the graph Fourier transform [47,48]. Various GNN mod- els have since been proposed [50, 53, 99, 100] that mainly attempt to improve the basic GNN model along two aspects: 1) enhancing the graph convolution operation by devel- oping novel graph filters; and 2) designing appropriate graph pooling operations. For instance, [94, 101] employs complex graph filters via Cayley and b-splines as a basis for the filtering operation respectively. Similarly [99] parameterize graph filters using residual Laplacian matrix and in [102] authors used simply polynomial of adjacency matrix. For graph pooling operations, pre-computed graph coarsening layers via the graclus multilevel clustering algorithm are employed in [49], while a differential pooling operation is developed in [103]. The authors in [95] propose a sum aggregation pooling operation that is better justified in theory. The authors in[55–57] propose message passing neural networks (MPNNs), which are viewed as equivalent to GCNN models, as the underlying notion of the graph convolution operation is the same. A similar line of studies have been developed in [104,105] for learning node embedding on large graphs. Likewise, graph attention networks have also been proposed [106]. In contrast, the au- thors in [2, 60] propose GNNs for handling graph classification problem. More recently, a pre-trained strategy for training GNNs have been proposed in [?]. MPNNs can also be break into two step process where edge features are updated though message passing and 61 then node features are updates using the information encoded in its nearby edges. This is similar to Embedding belief propagation message passing algorithm proposed in [57]. All the above mentioned models are focused on learning node embeddings and employs the mean aggregation pooling operation on node to get the graph embedding output. Several attempts have also been made to convert graph into regular grid structure for straight forwardly applying standard 2D or 1D CNNs [54, 107]. A very tangential ap- proach was taken in [61] where authors design covariant neural network based on group theory for computing graph representation. The main limitations of all the aforemen- tioned GNN models lie in that they must be trained from scratch on a new dataset and the embeddings are tuned based on a supervised (task-specific) objective function. Asa result, learned embeddings are task dependent. There is no mechanism for sharing the embedding model across different datasets. We design our GNN based model to uplift all these limitations. Though the authors in [56] suggests that GCNNs are a special case of MPNNs, we believe that both are equivalent models in a certain sense; it is simply a matter of how the graph convolution operation is defined. In MPNNs the hidden states of each node is updated based on messages received from its neighbors as well as the values of the previous hidden states in each iteration. In general MPNN models current hidden state of a node is updated based on its neighbors as well as the previous hidden state. This is made possible by replacing traditional neural networks in GCNN with a small recurrent neural network (RNN) with the same weight parameters shared across all nodes in the graph. Note that here the number of iterations in MPNNs can be related to the depth of a GCNN model. In [59] the authors propose to condition the learning parameters of filters based on edges rather than on traditional nodes. This approach is similar to some instances of MPNNs such as in [56] where learning parameters are also associated with edges. All the above MPNNs models employ aggregation as the graph permutation invariant layer for solving the graph classification problem.

Graph Decoders: The research on graph decoder designs is currently under-explored. A few existing studies [108–111] mostly fall under the category of graph generation, namely, generating graphs with similar characteristics as in the training dataset as op- posed to recovering or regenerating the (original) graph structures. The exceptions are graph autoencoders [112–114] which are designed primarily for link prediction or node 62 classification problem, and provide a partial solution to the graph reconstruction prob- lem. The reconstruction mechanism employed by these graph autoencoders reconstructs the graph adjacency matrix using the latent node features output by the hidden layer (before the final pooling operation) instead of the final graph embedding learned bythe graph encoder, and the graph autoencoders are. Furthermore, it requires computing adjacency matrix reconstruction loss (O(N 2) time and space complexity), which can be expensive for large graphs.

Graph Kernels: Graph kernel provides a way to compute similarity between two graphs based on the shared sub-structural properties. The literature on graph kernels is vast, we only outline a few. Some of the most popular graph kernels are Weisfeiler- Lehman kernel [66], graphlets [62,63], random walk or shortest path or anonymous walk based kernels [64, 65, 115]. Several graph kernels based on more complex kernel func- tions have also been developed that can capture sub-structural similarities at multiple levels; these include deep graph kernels [67], graph invariant kernels [68] and multiscale laplacian graph kernel [69]. Instead of directly computing graph kernels, powerful graph spectrum based methods have also been developed, for example, Graphlet spectrum [77] based on group theory, and a family of graph spectral distances (FGSD) based on spec- tral graph theory. Another line of research in this area focuses on efficiently computing these kernels either through exploiting certain structure dependency, or via approxima- tion or randomization [70–72]. Other studies such as [116] which takes atomic 3D space coordinates into account rather than operating on graph structurse for constructing features can also be classified under this category.

Basic Setup and Notations: Let G = (V,E, A) be a graph where V is the vertex set, E the edge set (with no self-loops) and A the adjacency matrix, with N = |V | the graph size. We define the standard graph Laplacian L ∈ RN×N as L = D − A, where D is the degree matrix. Let X ∈ RN×d be the node feature matrix with d as the input dimension and h denotes the hidden dimension. Further let f(L) be a function of the graph Laplacian i.e., f(L) = Uf(￿)UT , where ￿ is the diagonal matrix of eigenvalues of L and U the eigenvector matrix. Let σ(·) be an activation function. 63

Graph Input Transformer Universal Graph Encoder Mutli-Task Graph Decoder Embedding e B×N×H B×N×D X ∈ R X ∈ R × (ℓ) Z ∈ RB H σ(f(L)XW) K(1) ∈ WL Kernel RB×B G(V,E) K(2) ∈ SP Kernel RB×B

K(k) ∈ POOL ... RB×B GNN A ∈ RB×N×N

S ∈ RB×C Model Shared

For Tranfer A Learning

Task Specific × × × × (ℓ) e ∈ RB N N Softmax NN Y ∈ RB N H A

Figure 5.2: Above figure shows overall architecture of Dugnn model in the form of tensor transformations. Starting from the left, we have a graph G(V,E) with node fea- ture matrix X and adjacency matrix A. X is first transform into a consistent feature dimension Xe via Input Transformer. Next, Universal Graph Encoder computes graph embedding z and output Y which is passed down to the decoder. Our Multi-task Graph Decoder comprises of minimizing graph kernel losses and adjacency matrix reconstruc- tion loss along with the optional supervised task loss for joint end-to-end learning.

Figure 5.2 depicts the overall architecture of DUGnn model. We describe the core components namely 1) Input Layer 2) Universal Graph Encoder 3) Multi-Task Graph Decoder, in more detail.

5.2.1 Input Layer

The dimension of graph input features is task-specific and may differ across different datasets. To make our universal graph encoder task-independent (or dataset-independent), we devise a specific input layer which transforms a given input node feature dimension X into a consistent hidden feature dimension, i.e., T : X ∈ RN×d → Xe ∈ RN×h and 64 feed it to the universal graph encoder.

It is well known that GNN models aggregate feature information within a K−hop local neighborhood of a node [49]. Hence these models heavily rely on the initial node features to capture crucial global structural information of a graph. However, in many datasets, the node features are absent (in other words, the inputs are merely the underlying graph structure). In these cases, many models choose X = I, the identity matrix. Instead, we propose to initialize X as a Gaussian random matrix (in absence of node features), which is justified by the following theorem.

Theorem 8 (Graph Spectral Embedding Approx. with GNN Random Fea- ture Initialization). Let f(L) ∈ RN×N be a function of graph Laplacian and X ∈ RN×d ∼ N (0, σ2) be a Gaussian random matrix initialize as a node feature (or embed- ding) matrix where d ≤ N. Then f(L)X resultant embedding is equal to a randomly projected graph spectral embedding in Rd space i.e, f(L)X = R(Uf(￿)) where R(·) is some random projection in Rd space.

The proof of Theorem 8 follows immediately from a key result in [117] stated below.

Lemma 1 (Eigenspace Approx. Using Random Signals). Let U ∈ RN×N be an orthonormal matrix and R ∈ RN×d be a Gaussian random matrix with i.i.d. entries ∼ N (0, σ2). Then the entries of UR are i.i.d. Gaussian random samples with the same pdf ∼ N (0, σ2).

Proof: Let f(L) be a function of graph Laplacian that operates on eigenvalues and defined as follows f(L) = Uf(￿)UT where ￿ is the eigenvalue diagonal matrix and U is the eigenvector matrix which is also an orthonormal matrix. Therefore, f(L)R = Uf(￿)UT R = Uf(￿)Rb where Rb ∈ RN×d is again a Gaussian random matrix from Lemma 1. Here Ub = Uf(￿) contains the eigenvector columns scaled by the f(￿) eigen- values. Thus Ub Rb results in a random projection (or dimension reduction) of the scaled eigenvector space of the graph Laplacian into a d-dimensional space.

Remarks: One of the consequences of Theorem 8 is that we can approximate different spectral dimension reduction techniques with suitably chosen f(·) function and a fast 65 algorithm to compute f(L) has also been devised in [1]. Moreover, remarkably f(L)X is nothing but a graph convolution operation and thus provides an approximation of the graph spectral embedding. This provides a theoretical explanation for the competitive performance of a randomly initialized GNN model as previously noted in [50,105]. Thus, we show the theoretical significance of using a random matrix X in a GNN model.

With appropriately initialized input feature vector X, our input layer performs the following operation, ( ) T (X) = MLP f(L)X (5.1) where MLP is a multi-layer perceptron neural network.

5.2.2 Universal Graph Encoder

Our universal graph encoder is based on a GNN model but with several key improve- ments. Let g(L) be a graph filter function. Some early adopted graph filters are poly- nomial functions of L [49]. We simply choose g(L) = D−1/2AD−1/2 + I as in [50]. th F (ℓ) e The ℓ( layer output of) a GNN model can be written in general as, (X, L) = MLP g(L)F (ℓ−1)(Xe , L) where F (0)(Xe , L) = Xe and F (ℓ)(Xe , L) ∈ RN×h(ℓ) . To fur- ther improve the model complexity of our graph encoder, we capture the higher order statistical moment information of features during the graph convolutional operation, similar to graph capsule networks [2] as follows,

( ) ∑P ( ( ) ) − p F (ℓ)(Xe , L) = MLP MLP g(L) F (ℓ 1)(Xe , L) (5.2) p=1 where P is the number of instantiation parameters. By capturing the higher order statistical moment information of the (hidden) features output by an intermediate layer and feeding them to the subsequent layer, we attempt to overcome the problem of Laplacian smoothing [98] in GNN models. As such, our graph encoder can learn on multi- scale smoothed features of a node, and thus can avoid the under-smoothing and over- smoothing issues. It also enjoys the side benefits of alleviating the vanishing-gradient 66 problem and strengthening feature propagation in deep networks as in Dense-CNN models [118]. Other than boosting the model complexity, there is standing problem of Laplacian smoothing [98] in GCNN models. To overcome the smoothing problem, we propose to concatenate the output of intermediate encoding layers and feed it into subsequent layers. This way our graph encoder can learn on multi-scale smooth features of a node and thus can avoid under-smoothing or over-smoothing issues in a GCNN based model. Fortunately, it also enjoy the side benefits of alleviating vanishing-gradient problem and strengthen feature propagation in deep networks as shown in Dense-CNN models [118].

Finally, to obtain the final graph embedding z ∈ Rh(ℓ) that is permutation-invariant, we ∑ N F (ℓ) e ∈ Rh(ℓ) perform the sum pooling operation on all node features z = i=0 i (X, L) . The sum pooling operation is shown in [95] to perform better than the mean or max pooling operation. More sophisticated permutation invariant pooling operations such as applying set2set model or performing sorting or computing covariance matrix [2,60,119] may also be incorporated into our model. We show its theoretical representational power below,

Theorem 9 (Deep Universal Graph Embedding Representational Power). Deep Universal Graph Embedding model initialized with node features as graph spectral embeddings is atleast as powerful as classical Weisfeiler-Lehman (WL) graph isomor- phism test.

Remarks: Theorem 9’s subpart where DUGnn representation power is same as WL graph isomorphism test directly follows from Theorem 3 of the paper [95]. To make DUGnn more powerful than the classical WL graph isomorphism test (where nodes fea- tures are either identical or equal to respective node degree), one can initialize DUGnn node features with graph spectral embeddings. As a result, DUGnn can now differen- tiate certain regular graphs where the classical WL graph isomorphism test fails. The main trick here is to initialize node features such they depend on the full graph structure rather than the local structure. Coincidently, we have Theorem 8 to approximate graph spectral embedding in our DUGnn model and that too in fast manner. Although in all fairness, the same trick can also make WL more powerful. 67 3 6 1 2 4 5

2 3 1 4 6 5

Figure 5.3: Two Regular Graphs G1 and G2.

Proof: To show that DUGnn is atleast as powerful than WL, we provide the general condition on k-regular graphs to be hold such that their graph representations are dif- ferent. It is suffice to show a single example where WL fails but DUGnn succeeds. Consider two k-regular non-isomorphic graphs of size N and let N (i) represents the neighbor indices of ith node including itself. Since node features initialized are identi- (0) cal (or equal to node degree) in classic WL i.e, xv = c for each node v, after local (1) aggregation operation, the new node features is given by xv = ϕ(c, f({c, ..., c})) for k-times each node in G1 where f is a multi-set function and ϕ is a injective function. Similarly (1) for G2 each node representation is given by xv = ϕ(c, f({c, ..., c})) after first iteration k-times which is same as G1. As a result, the graph representation of G1 and G2 are same since ∑ (1) sum-pool operation yields the same value i.e., v xv in both cases.

Consider DUGnn initialized with graph spectral embeddings as node features. Let {u , u , ..., u } and {v , v , ..., v } be the spectral embeddings of nodes in G and G . 1 2 N 1 2 N ∑ 1 2 th (1) Then, updated representation of i node is given by xi = ϕ( j∈N (i) uj) where ϕ is the universal MLP function. Let ϕ MLP function learns an identity function. Then the ∑ ∑ ∑ sum-pool representation of G is given by y = N u = k N u . Similarly, 1 ∑1 i=1 j∈N (i) j i=1 i N for G2 the final representation is y2 = k i=1 vi. As a result, for all non-isomorphic k- regular graph pairs whose row-wise sum of eigenvector matrix respectively are not equal – can be distinguished by DUGnn – but not possible in the case of classic WL. It is easy to verify numerically that the row-wise sum of eigenvector matrices of graphs shown in

Figure 5.3 are not equal which implies y1 ≠ y2 and hence their graph representations are different. 68 Next, we provide the generalization guarantee of DUGnn model and further discuss the role of transfer learning in reducing the generalization gap. Note that the details on obtaining generalization bound of a general GNN is deferred to Chapter 6.

Theorem 10 (Deep Universal Graph Embedding Model Generalization Guar- antee). Let AS be a single layer Universal Graph Encoder equipped with the graph convolution filter g(L) and trained on a dataset S using the SGD algorithm for T itera- tions. Let the loss & activation functions be Lipschitz-continuous and smooth. Then the following expected generalization guarantee holds with probability at least 1 − δ having δ ∈ (0, 1),

Esgd[R(AS)] ≤ Esgd[Remp(AS)]+ ( √ ) √ ( ) 1 log 1 log 1 O PN T +1(λmax)2T + δ + C δ G m 2m 2m

where Esgd[R(·)] is the expected risk taken over the randomness due to SGD, Remp(·) is the empirical risk, m is the number of training graph samples, N is the maximum max graph-size, λG is the largest eigenvalue of graph filter and C is a upper bound on the loss function.

Remarks: Theorem 11 relies on showing the fact that Universal Graph Encoders are uniformly stable [87] and has several implications. First, normalized graph filters are max ≤ theoretically more stable since λG 1 and parameter P controls the classic bias- variance tradeoff. Also the theorem does not bear any restrictions on type of graph datasets employed for training and establishes that transfer learning remains beneficial between different datasets for graph classification task. Intuitively GNN parameters are learned based on the local (k−hop) graph-structure and having more samples will always help towards generalizing better on unseen data and reduces the generalization O √1 error at a rate ( m ). 69 5.2.3 Multi-Task Graph Decoder

Multi-task learning have shown to yield superior results in natural language process- ing [10,11]. We want to equip our DUGnn model with built-in capabilities of multi-task learning. For this, we employ multiple prediction metrics in our graph decoder to enable it to learn more generalized graph embedding useful for different learning tasks. Note that supervised task specific loss i.e., cross-entropy loss Lclass is not considered as the part of our multi-task decoder (see Figure 5.2).

Graph Adjacency Matrix Reconstruction Loss: The major technical challenge in devising a general purpose graph decoder is reconstructing the original graph structure directly from its graph embedding vector z. Unfortunately, it is an unexplored problem and partially falls under graph generation area [108,109]. But the current work on graph generation does not focus on recovering the exact graph structure but rather generating graph with similar characteristics. We consider minimizing adjacency reconstruction loss LA as the first task of our graph decoder. Following adjacency loss LA is incurred during mini-batch process,

∑B ( ) L T A =λA ℓCE σ(YiYi ), Ai (5.3) i=1

(ℓ) e where Y = F (X, L) is the encoder output, ℓCE is binary cross entropy loss corre- sponding to presence or absence of each edge and λA is loss weight. There are two shortcomings of this task. First, LA does not take graph embedding z directly into account and as such graph embedding may incur significant loss of information after 2 the pooling operation. Second, with O(N ) computing LA (in every batch iteration) may become expensive on datasets with large graph-size. In our experiments, we were able to compute adjacency reconstruction loss on all datasets except D&D.

Graph Kernel Loss: We propose a novel solution which leverages rich graph kernels to overcome the shortcomings present in the first task. In essence, graph kernels are developed to capture various key sub-structural properties of a graph: for two graphs,

K(Gi,Gj) provides a measure of their similarity. In our decoder design, we incorporate multiple graph kernels {K(k)} to jointly learn and predict the quality of (universal) graph 70 embedding z directly by minimizing the following unsupervised joint graph kernel loss (unsup) LK ,

∑B ∑B ( ( ) ) T K(k) ℓK(k) = ℓMSE σ zi Wkzj , ij i=1 j=1 ∑K (unsup) LK = λK λkℓK(k) k=1 ( ) T where ℓMSE is the mean square error loss, σ zi Wkzj is a (learned) similarity function h(ℓ)×h(ℓ) between two graph embeddings and Wk ∈ R is the associated kernel weight parameter. By leveraging precomputed graph kernels, our computation is cheap (in every batch iteration), but also take the graph embedding z directly into account for the joint learning of our graph encoder-decoder model.

Adaptive Supervised Graph Kernel Loss: For supervised task learning problems, we augment the loss function with an adaptive supervised graph kernel loss function. Specifically, we focus on graph kernels that are aligned with the task objective. As shown in Equation (5.4), if class labels of two graphs are the same, then we choose the graph kernel with maximum similarity value and vice-versa. Thus, we are making sure to pick only those sub-structural properties that are relevant to a given specific task. For instance, in MUTAG dataset, counting the number of cycles is more important than computing the distribution of random walks (see Section 5.4 for more details).

∑B ∑B ( ( ) L(sup) T K(k) K = λK ℓMSE σ zi Wkzj ,I(yi=yj ) max( ij )+ k i=1 j=1 (5.4) ( ) ) − K(k) 1 I(yi=yj ) min( ij ) k

Computational Complexity: Precomputed graph kernel representation are typically bounded by O(N 2) time per graph (examples include WL, FGSD kernel) where N is number of nodes in a graph. Further computing graph kernel in each batch iteration requires O(B2) time and space where B is number of graphs in batch size. 71 Overall, we have carefully designed our DUGnn model architecture to extract high quality graph embeddings that can be shared and utilized by downstream applications, in a similar spirit of sharing the embedding models for multi-task learning and transfer learning found in the NLP and CV domains.

5.3 Experiment and Results

In this section, we evaluate our DUGnn model thoroughly on a variety of graph classi- fication benchmark datasets.

DUGnn Model Configuration: In the Input Transformer, we set the hidden dimen- sion to h ∈ {16, 32, 64}. For the graph datasets without node features, we initialize X using a Gaussian random matrix and choose f(L) as the normalized symmetric Lapla- cian. For the Universal Graph Encoder, we build ℓ ∈ {5, 7} layer deep GNN network with the internal hidden dimensions chosen from h(ℓ) ∈ {16, 32, 64} and pick the in- stantiation parameter from p ∈ {1, 2, 4}. In the Multi-Task Decoder, we employ three graph kernels, 1) WL-Subtree Graph Kernel, 2) Shortest-Path Kernel and 3) FGSD, as they are relatively fast to compute. To keep the network outputs stable, we use the batch normalization between layers along with the L2 weight norm regularization and dropout techniques to prevent overfitting. In addition, we employ the ADAM optimizer with varying learning rates as proposed in [120] and its parameters are set as: the max epoch 3000, warmup epoch 2, initial learning rate 10−4, max learning rate 10−3 and final learning rate 10−4. All models are trained with early stopping criteria based on validation loss. Our DUGnn code is available on Github1 . Average classification accuracy based on 10−fold cross validation error is reported for each dataset.

Datasets: We employ 6 benchmark bioinformatics datasets to evaluate the DUGnn model on the graph classification task, namely, PTC, PROTEINS, NCI1, NCI109, D&D, and ENZYMES. Dataset D&D contains 691 enzymes and 587 non-enzymes proteins structures while for remaining datasets detail are in [67].

1 https://github.com/codeanonymous/deep-universal-graph-embedding-neural-network/ 72

Model / Dataset PTC PROTEINS ENZYMES D&D* NCI1

(Number of Graphs) 344 1113 600 1178 4110

(Avg. nodes) 25.56 39.06 32.60 284.32 29.80

(Max. nodes) 109 620 126 5748 111

GK-RW[ [90]] 57.85  1.30 74.22  0.42 24.16  1.64 > 24 hrs > 24 hrs

GK-SP[ [65]] 58.24  2.44 75.07  0.54 40.10  1.50 > 24hrs 73.00  0.24

GK-GK[ [63]] 57.26  1.41 71.67  0.55 26.61  0.99 78.45  1.11 62.28  0.29

GK-WL [ [66]] 57.97  0.49 74.68  0.49 52.22  1.26 79.78  0.36 82.19  0.18

GK-DGK[ [67]] 60.08  2.55 75.68  0.54 53.43  0.91 73.50  1.01 80.31  0.46

GK-MLG[ [69]] 63.26  1.48 76.34  0.72 61.81  0.99 78.18  2.56 81.75  0.24

GK-FSGD[ [1]] 62.80 73.42 – 77.10 79.80

GK-AWE[ [115]] — — 35.77  5.93 71.51  4.02 —

GNN-PSCN[ [54]] 62.29  5.68 75.00  2.51 — — 76.34  1.68

GNN-DCNN[ [52]] 56.60  2.89 61.29  1.60 42.44  1.76 58.09  0.53 56.61  1.04

GNN-ECC[ [59]] — — 45.67 72.54 76.82

GNN-DIFFPOOL[ [103]] — 76.25 62.5 80.64 —

GNN-DGCNN[ [60]] 58.59  2.47 75.54  0.94 51.00  7.29 79.37  0.94 74.44  0.47

GNN-GIN[ [95]] 64.60  7.00 76.20  2.80 — — 82.70  1.70

GNN-GCAPS[ [2]] 66.01  5.91 76.40  4.17 61.83  5.39 77.62  4.99 82.72  2.38

DUGnn 74.72  6.02 81.74  2.46 67.38  4.84 82.49  3.38 85.56  1.22

Table 5.1: Graph classification accuracy on bioinformatics datasets. Result in bold indicates the best reported accuracy. Top half of the table compares results with Graph Kernels (GK) while bottom half compares results with graph neural networks (GNN). *On D&D dataset, we omit computing adjacency reconstruction loss due to GPU mem- ory constraints.

Graph Neural Network and Graph Kernel Baselines: We compare the DUGnn 73 model performance against 7 recently proposed state-of-art GNNs, namely: PATCHY- SAN (PSCN) [54], Diffusion CNNs (DCNN) [52]], Dynamic Edge CNN (ECC) [59], hi- erarchical graph differentiable pooling (DIFFPOOL)103 [ ], Deep Graph CNN (DGCNN) [60], Graph Isomorphism Network (GIN) [95] and Graph Capsule Networks (GCAPS) [2]. We also compare it against 8 state-of-art kernels: Random Walk (RW) [90], Shortest Path (SP) [65], Graphlet Kernel (GK) [63], Weisfeiler-Lehman Sub-tree Kernel (WL) [66], Deep Graph Kernels (DGK) [67], Multiscale Laplacian Graph Kernels (MLK) [69], Fam- ily of Graph Spectral Distances (FGSD) [1], Anonymous Walk Embeddings (AWE) [115].

Experimental Set-up: We first train DUGnn in an unsupervised fashion on the six bioinformatics datasets together. Here the universal graph encoder model is shared across all the datasets. Next, we fine tune our DUGnn for each dataset separately using a class-specific objective function and the adaptive supervised kernel loss. We report 10- fold cross validation results obtained by closely following the same experimental setup used in previous studies [95, 103]. To make a fair comparison, we either cite the best cross validation results previously reported wherever possible or run the source code if available according to author’s guidelines. For graph kernels, the LIBSVM [91] library is used to report the classification performance. Further details about the baseline experiment settings are present in the appendix.

Graph Classification Results: Table 5.1 shows the classification results on consid- ered datasets based on graph neural network models and graph kernel methods. It is clear that DUGnn consistently outperforms every state-of-art GNN model by a margin of 3% − 8% increase in prediction accuracy on both bioinformatics as well as social- network datasets (with the highest accuracy achieved on the PTC). On relatively small datasets such as PTC & ENZYMES, the increase in accuracy is around 6% − 8%. This empirically confirms our hypothesis that transfer learning in the graph domain isquite beneficial, especially where available graph samples are limited.

Our DUGnn model also significantly outperforms all the state-of-art graph kernel meth- ods. We again observe a consistent performance gain of 3% − 11% (again with the highest increase on the PTC dataset). Interestingly, our DUGnn integrated with graph 74

Model / Results Val MAE (×10−3) Test MAE (×10−3)

MPNN[ [56]] 14.60 14.30

DTNN[ [121]] 17.00 16.90

GCNN[ [122]] 15.00 14.80

DUGnn - LA - LK 11.16 11.54

Table 5.2: Ablation Study of Universal Graph Encoder on quantum mechanics dataset.

DUGgnn - ℓA - ℓK (model trained from scratch without multi-task decoder) sets the new state-of-art result on QM8 dataset. kernels in the multi-task decoder outperforms the WL-subtree kernel, FGSD and SP ker- nel. Altogether, DUGnn shows very promising results over all considered graph neural networks & graph kernel methods and on all bioinformatics datasets.

5.4 Ablation Studies and Discussion

We now take a closer look at the performance of each component of Dugnn model by performing various ablation studies and show their individual importance and con- tributions. We start with glancing the performance of our proposed Universal Graph Encoder.

How powerful is our Universal Graph Encoder without multi-tasking and transfer learning?

We conduct an experiment where the Dugnn model is trained from scratch without the multi-task decoder and evaluate its performance on a large quantum mechanic dataset QM8 (containing 21786 compounds). For this purpose, we use the same experimental 75 settings provided in [122] including the same datasplits given by the DeepChem2 im- plementation, and compare it against the three state-of-the-art GNNs: 1) MPNN [56], 2) DTNN [121] and 3) GCNN3 . From Table 5.2, it is clear that our universal graph en- coder significantly outperforms these GNNs by a large margin of 20% − 30% in terms of the mean absolute error (MAE), achieving the new state-of-art result on the QM8 dataset.

How much gain do we see from sharing pretrained Dugnn model via transfer learning?

Model / Dataset NCI1 PTC

DUGnn 83.51 74.22

DUGnn - NCI1 / PTC 83.10 73.18

Table 5.3: Ablation Study of Transfer Learning. Dugnn is the base model trained from scratch on both NCI1 and PTC datasets via transfer learning. Dugnn - NCI1 / PTC represent models trained from scratch on individual datasets.

We conduct an ablation study to determine the importance of utilizing the pretrained Dugnn model. In this experiment, we pick one of the cross validation splits of NCI1 & PTC datasets for training & validating, and fix all the hyper-parameters including the random seeds across the full ablation experiment. We first train a Dugnn model on both NCI1 & PTC datasets. Then the Dugnn models are trained on each dataset from scratch. Table 5.3 shows that training without transfer learning reduces accuracy by around 0.4% − 1% on both datasets. Also, we see a bigger accuracy jump on PTC, since its dataset size is smaller, thus benefiting more via transfer learning.

How much boost do we get with Multi-Task Graph Decoder?

In this ablation study, we train DUGnn from scratch with different loss functions. (unsup) Table 5.4 reveals that completely removing the graph decoder (i.e., LA and LK ) reduces accuracy by around 3% − 4% on both datasets. While removing only the

2 https://deepchem.io/ 3 http://moleculenet.ai/models 76

Model / Dataset PTC ENZYMES

DUGnn 73.53 65.00

(unsup) DUGnn - LA - LK 70.59 61.67

DUGnn - LA 72.68 64.10

(unsup) DUGnn - LK 71.59 62.83

DUGnn - Lclass 64.71 56.53

Table 5.4: Ablation Study of Multi-Task Decoder. Dugnn - ℓ(·) represents model trained from scratch without ℓ(·) loss function. We pick one of the cross validation splits to report the accuracy and all the hyper-parameters, random seeds and data splits were kept constant across the ablation experiments.

(unsup) graph kernel loss (LK ) reduces accuracy by 2% − 3%. The accuracy drops by around 1% when LA is removed. Lastly, removing the supervised loss (Lclass) reduces the performance considerably; nonetheless our model remains competitive and performs better against various graph kernel methods (see Table 5.1).

How effective is adaptive supervised graph kernel loss?

Model / Dataset PTC ENZYMES

DUGnn 73.53 65.00

(unsup) DUGnn - LK + 76.47 64.13 (sup) LK

Table 5.5: Ablation Study of Supervised Adpative Graph Kernel Loss. Dugnn is the (unsup) (sup) base model trained with non-adaptive kernel loss function. DUGnn - LK + LK is trained with adaptive loss inplace of non-adaptive graph kernel loss.

We observe that employing the adaptive supervised graph kernel loss yields a smoother decay in the validation loss and produces more stable results. However as evident from Table 5.5, it only increases the performance on PTC by 3%, and reducess the performance on ENZYMES by around 1%. As a result, we advice treating the adaptive 77 supervised kernel loss as a hyper-parameter.

Label 1 Label 2

(a) Typical label 1 sample. (b) Typical label 2 sample.

Figure 5.4: Some MUTAG graph data samples.

Significance of Adaptive Supervised Graph Kernel: Loss We qualitatively demon- strate the significance of employing the adaptive supervised graph kernel loss onMU- TAG dataset. For this purpose, we take a deeper dive into the results to find out which sub-structures are more important for prediction. Figures 5.4a and 5.4b depict some representative graph structures for class labels 1 and 2 on the MUTAG dataset. It turns out that the graph samples with label 1 have mostly three or more cycles, while the graph samples with label 2 tend to contain only one cycle. On the other hand, the graph samples with two cycles can belong to either class. By simply creating a learning rule that data samples containing 3 or more cycles belong to label 1, we can get a prediction accuracy around 84%. This simple rule alone beats the random-walk graph kernel based method, which achieves a prediction accuracy of 80.72%[66]. By employing an adaptive supervised graph kernel, the model can thus learn embeddings which are biased more towards a graph kernel that better captures the count of number of cycles in graphs, and discount the graph kernels which attempt to match the random walk distributions which do not help with increasing the prediction accuracy. 78 5.5 Conclusions

We have presented a powerful univeral graph embedding neural network architecture with three carefully designed components that can learn task-independent graph embeddings in a unsupervised fashion. In particular, the universal graph encoder component can be re-utilized across different datasets by leveraging transfer learning, and the decoder component incorporates various graph kernels to capture rich sub-structural properties to enable multi-task learning. Through extensive experiments and ablation studies on benchmark graph classification datasets, we show that our proposed DUGnn model can significantly outperform both the existing state-of-art graph neural network models and graph kernel methods. This demonstrates the the benefit of combining the power of graph neural networks in the design of a universal graph encoder with that of graph kernels in the design of a multi-task graph decoder. Chapter 6

Stability and Generalization Guarantees of Graph Neural Networks

6.1 Introduction

In recent years, there has been a huge thrust towards learning on graphs owing to the unprecedented success of Graph Convolutional Neural Networks (GCNNs) in real-world performance. Building upon the huge success of deep learning in computer vision (CV) and natural language processing (NLP), Graph Convolutional Neural Networks (GC- NNs) [50] have recently been developed for tackling various learning tasks on graph- structured datasets. These models have shown superior performance on real-world datasets from various domains such as node labelling on social networks [112], link prediction in knowledge graphs [123] and molecular graph classification in quantum chemistry [56] . Due to the versatility of graph-structured data representation, GCNN models have been incorporated in many diverse applications, e.g., question-answer sys- tems [124] in NLP and/or image semantic segmentation [125] in CV. While various

79 80 versions of GCCN models have been proposed, there is a dearth of theoretical explo- rations of GCNN models ( [95] is one of few exceptions which explores the discriminant power of GCNN models)—especially, in terms of their generalization properties and (algorithmic) stability. The latter is of particular import, as the stability of a learning algorithm plays a crucial role in generalization.

The generalization of a learning algorithm can be explored in several ways. One of the earliest and most popular approach is Vapnik–Chervonenkis (VC)-theory [126] which establishes generalization errors in terms VC-dimensions of a learning algorithm. Un- fortunately, VC-theory is not applicable for learning algorithms with unbounded VC- dimensions such as neural networks. Another way to show generalization is to perform the Probably Approximately Correct (PAC) [127] analysis, which is generally difficult to do in practice. The third approach, which we adopt, relies on deriving stability bounds of a learning algorithm, often known as algorithmic stability [87]. The idea behind algo- rithmic stability is to understand how the learning function changes with small changes in the input data. Over the past decade, several definitions of algorithmic stability have been developed [87,128–131], including uniform stability, hypothesis stability, pointwise hypothesis stability, error stability and cross-validation stability, each yielding either a tight or loose bound on the generalization errors. For instance, learning algorithm based on Tikhonov regularization satisfy the uniform stability criterion (the strongest stability condition among all existing forms of stability), and thus are generalizable.

In this Chapter, we take a first step towards developing a deeper theoretical understand- ing of GCNN models by analyzing the (uniform) stability of GCNN models and thereby deriving their generalization guarantees. For simplicity of exposition, we focus on single layer GCNN models in a semi-supervised learning setting. The main result of this Chap- ter is that (single layer) GCNN models with stable graph convolution filters can satisfy the strong notion of uniform stability and thus are generalizable. More specifically, we show that the stability of a (single layer) GCNN model depends upon the largest ab- solute eigenvalue (the eigenvalue with the largest absolute value) of the graph filter it employs – or more generally, the largest singular value if the graph filter is asymmetric – and that the uniform stability criterion is met if the largest absolute eigenvalue (or singular value) is independent of the graph size, i.e., the number of nodes in the graph. 81 As a consequence of our analysis, we establish that (appropriately) normalized graph convolution filters such as the symmetric normalized graph Laplacian or random walk based filters are all uniformly stable and thus are generalizable. In contrast, graph convolution filters based on the unnormalized graph Laplacian or adjacency matrix do not enjoy algorithmic stability, as their largest absolute eigenvalues grow as a function of the graph size. Empirical evaluations based on real world datasets support our the- oretical findings: the generalization gap and weight parameters instability in caseof unnormalized graph filters are significantly higher than those of the normalized filters. Our results shed new insights on the design of new & improved graph convolution filters with guaranteed algorithmic stability.

We remark that our GCNN generalization bounds obtained from algorithmic stability are non-asymptotic in nature, i.e., they do not assume any form of data distribution. Nor do they hinge upon the complexity of the hypothesis class, unlike the most uniform convergence bounds. We only assume that the activation & loss functions employed are Lipschitz continuous and smooth functions. These criteria are readily satisfied by sev- eral popular activation functions such as ELU (holds for α = 1), Sigmoid and/or Tanh. To the best of our knowledge, we are the first to study stability bounds on graph learning in a semi-supervised setting and derive generalization bounds for GCCN models. Our analysis framework remains general enough and can be extended to theoretical stability analyses of GCCN models beyond a semi-supervised learning setting (where there is a single and fixed underlying graph structure) such as for the graph classification (where there are multiple graphs).

In summary, the major contributions of this Chapter are:

• We provide the first generalization bound on single layer GCNN models based on analysis of their algorithmic stability. We establish that GCNN models which employ graph filters with bounded eigenvalues that are independent of the graph size can satisfy the strong notion of uniform stability and thus are generalizable.

• Consequently, we demonstrate that many existing GCNN models that employ 82 normalized graph filters satisfy the strong notion of uniform stability. Wealso justify the importance of employing batch-normalization in a GCNN architecture.

• Empirical evaluations of the generalization gap and stability using real-world datasets support our theoretical findings.

This Chapter is organized as follows. Section 6.2 reviews key generalization results for deep learning as well as regularized graphs and briefly discusses existing GCNN models. The main result is presented in Section 4.3 where we introduce the needed background and establish the GCNN generalization bounds step by step. In Section 6.7, we apply our results to existing graph convolution filters and GCNN architecture designs. In Section 6.8 we conduct empirical studies which complement our theoretical analysis. The Chapter is concluded in Section 6.9.

6.2 Related Work

In this section, we take a look at the GCNN or also referred as graph neural network (GNN) models. Also in the absence of any prior theoretical work on GCNNs, we review the existing generalization bounds for other deep learning methods.

Generalization Bounds on Deep Learning: The literature on theoretical under- standing of deep learning is relatively large and we only outline a few. Many theoretical studies have been devoted to understanding the representational power of neural net- works by analyzing their capability as a universal function approximator as well as their depth efficiency [132–136]. In [136] the authors show that the number of hidden units in a shallow network has to grow exponentially (as opposed to a linear growth in a deep network) in order to represent the same function; thus depth yields much more compact representation of a function than having a wide-breadth. It is shown in [132] that convolutional neural networks with the ReLU activation function are uni- versal function approximators with max pooling, but not with average pooling. The 83 authors of [137] authors explore which complexity measure is more appropriate for ex- plaining the generalization power of deep learning. The work most closest to ours is [138] where the authors derive upper bounds on the generalization errors for stochastic gra- dient methods. While also utilizing the notion of uniform stability [87], their analysis is concerned with the impact of SGD learning rates. More recently, through empiri- cally evaluations on real-world datasets, it has been argued in [139] that the traditional measures of model complexity are not sufficient to explain the generalization ability of neural networks. Likely, in [140] several open-ended questions are posed regarding the (yet unexplained) generalization capability of neural networks, despite their possible algorithmic instability, non-robustness, and sharp minima.

Generalization Bounds on Regularized Graphs: Another line of work concerns with generalization bounds on regularized graphs in transductive settings [141–144]. Of the most interest to ours is [141] where the authors provide theoretical guarantees for the generalization error based on Laplacian regularization, which are also derived based on the notion of algorithmic stability. Their generalization estimate is inversely proportional to the second smallest eigenvalue of the graph Laplacian. Unfortunately this estimate may be not yield desirable guarantee as the second smallest eigenvalue is dependent on both the graph structure and its size; it is in general difficult to remove this dependency via normalization. In contrast, our estimates are directly proportional to the largest absolute eigenvalue (or the largest singular value of an asymmetric graph filter), and can easily be made independent of the graph size by performing appropriate Laplacian normalization.

Graph Convolution Neural Networks: The development of GCNNs can be traced back to graph signal processing [51] in terms of learning filter parameters of the graph Fourier transform [47, 48]. Various GCNN models have since been proposed [50, 52, 53, 99, 100, 145, 146] that mainly attempt to improve the basic GCNN model along two aspects: 1) enhancing the graph convolution operation by developing novel graph fil- ters; and 2) designing appropriate graph pooling operations. For instance, [94] employs complex graph filters via Cayley instead of Chebyshev polynomials, whereas [101] in- troduces b-splines as a basis for the filtering operation instead of the graph Laplacian. Similarly [99] parameterize graph filters using residual Laplacian matrix and in[102] 84 authors used simply polynomial of adjacency matrix. Random walk and quantum walk based graph convolutions are also been proposed recently [100, 145, 146]. The authors of [104, 105] have also applied graph convolution to large graphs. In terms of graph pooling operations, pre-computed graph coarsening layers via the graclus multilevel clustering algorithm are employed in [49], while a differential pooling operation that can generate hierarchical representation of a graph is developed in [103]. In [55–58] message passing neural networks (MPNNs) are developed, which are viewed as equiva- lent to GCNN models, as the underlying notion of the graph convolution operation is pretty much the same.

6.3 Graph Capsule & Graph Convolution Neural Networks

Notations: Let G = (V,E, A) be a graph where V is the vertex set, E the edge set and A the adjacency matrix, with N = |V | the graph size. We define the standard graph Laplacian L ∈ RN×N as L = D − A, where D is the degree matrix. We define a graph filter, g(L) ∈ RN×N as a function of the graph Laplacian L or a normalized T (using D) version of it. Let U￿U be the eigen decomposition of L, with ￿ = diag[λi] the diagonal matrix of L’s eigenvalues. Then g(L) = Ug(￿)UT , and its eigenvalues (g) { ≤ ≤ } max {| (g)|} λi = g(λi), 1 i N . We define λG = maxi λi , referred to as the largest absolute eigenvalue1 of the graph filter g(L). Let P be the number of instantiation parameter of GCAPS and let m is the number of training samples depending on N as m ≤ N.

Let X ∈ RN×D be a node feature matrix (D is the input dimension) and θ ∈ RD be the learning parameters. With a slight abuse of notation, we will represent both a node (index) in a graph G and its feature values by x ∈ RD. N (x) denotes a set of the neigh- bor indices at most 1−hop distance away from node x (including x). Here the 1−hop distance neighbors are determined using the g(L) filter matrix. Finally, Gx represents the ego-graph extracted at node x from G.

1 This definition is valid for a symmetric graph filter g(L), or the matrix is normal. More generally, max λG is defined as the largest singular value of g(L). 85 To derive generalization guarantees of GCAPS & GNNs (OR GCNNs) based on algo- rithmic stability analysis, we adopt the strategy devised in [87]. It relies on bounding the output difference of a loss function due to a single data point perturbation. As stated earlier, there exist several different notions of algorithmic stability [87, 131]. In this chapter, we focus on the strong notion of uniform stability (see Definition 1). We get started by presenting the GCNN model.

Single Layer GCAPS & GCNN (Full Graph View): Output function of a single layer GCAPS & GCNN model – on all graph nodes together – can be written in a compact matrix form as follows,

∑P ( ) ⊙ ⊙ f(X, θ) = σ g(L)(|X {z... X} θp (GCPAS) p=1 (6.1) ( ) p times f(X, θ) = σ g(L)Xθ (GCNN) where g(L) is a graph filter. Some commonly used graph filters are a linear function of A as g(L) = A + I [95] (here I is the identity matrix) or a Chebyshev polynomial of L [49].

Single Layer GCAPS & GCNN (Ego-Graph View): We will work with the notion of ego-graph for each node (extracted from G) as it contains the complete information needed for computing the output of a single layer GCAPS & GCNN model. We can re-write the Equation (6.1) for a single node prediction as,

∑P ( ∑ ) p T f(x, θ) = σ e·j(xj ) θp (GCPAS) p=1 j∈ N (x) ( ∑ ) (6.2) T f(x, θ) = σ e·jxj θ (GCNN) j∈ N (x) where e·j ∈ R = [g(L)]·j is the weighted edge (value) between node x and its neighbor xj, j ∈ N (x) if and only e·j ≠ 0. The size of an ego-graph depends upon g(L). We 86 assume that the filters are localized to the 1−hop neighbors, but our analysis is ap- plicable to k−hop neighbors. For further notational clarity, we will consider the case ( ∑ ) D = 1, and thus f(x, θS) = σ j∈N (x) e·jxjθS . Our analysis holds for the general D−dimensional case.

6.4 Main Result

The main result of the chapter is stated in Theorem 11, which provides a bound on the generalization gap for single layer GCNN models. This gap is defined as the difference between the generalization error R(·) and empirical error Remp(·) (see definitions in Section 6.5).

Theorem 11 (GCAPS Generalization Gap). Let AS be a single layer GCAPS model equipped with the graph convolution filter g(L), P number of instantiation parameters and trained on a dataset S using the SGD algorithm for T iterations. Let the loss & activation functions be Lipschitz-continuous and smooth. Then the following expected generalization gap holds with probability at least 1 − δ, with δ ∈ (0, 1), √ ( ) 1 ( ) ( ) log 1 E [R(A ) − R (A )] ≤ O P (λmax)2T + O P (λmax)2T + M δ sgd S emp S m G G 2m where the expectation Esgd is taken over the randomness inherent in SGD, m is the number of training samples and M a constant depending on the loss function.

Corollary 11.1 (GCNN Generalization Gap). Let AS be a single layer GCNN model equipped with the graph convolution filter g(L), and trained on a dataset S using the SGD algorithm for T iterations. Let the loss & activation functions be Lipschitz- continuous and smooth. Then the following expected generalization gap holds with prob- ability at least 1 − δ, with δ ∈ (0, 1), √ ( ) 1 ( ) ( ) log 1 E [R(A ) − R (A )] ≤ O (λmax)2T + O (λmax)2T + M δ sgd S emp S m G G 2m where the expectation Esgd is taken over the randomness inherent in SGD, m is the number of training samples and M a constant depending on the loss function. 87 Remarks: Theorem 11 establishes a key connection between the generalization gap and the graph filter eigenvalues. A GCNN model is uniformly stable if the bound converges → ∞ max to zero as m . In particular, we see that if λG is independent of the graph O √1 size, the generalization gap decays at the rate of ( m ), yielding the tightest bound possible. Theorem 11 sheds light on the design of stable graph filters with generalization guarantees.

Proof Strategy: We need to tackle several technical challenges in order to obtain the generalization bound in Theorem 11.

1. Analyzing GCAPS & GCNN Stability w.r.t. Graph Convolution: We analyze the stability of a graph convolution function under the single data pertur- bation. For this purpose, we separately bound the difference on weight parameters from the graph convolution operation in the GCAPS & GCNN output function.

2. Analyzing GCAPS & GCNN Stability w.r.t. SGD algorithm: GCAPS & GCNNs employ the randomized stochastic gradient descent algorithm (SGD) for optimizing the weight parameters. Thus, we need to bound the difference in the expected value over the learned weight parameters under single data perturbation and establish stability bounds. For this, we analyze the uniform stability of SGD in the context of GCAPS & GCNNs. We adopt the same strategy as in [138] to obtain uniform stability ofGCAPS & GCNN models, but with fewer assumptions compared with the general case [138].

6.5 Preliminaries

Basic Setup: Let X and Y be a a subset of a Hilbert space and define Z = X × Y. We define X as the input space and Y as the output space. Let x ∈ X , y ∈ Y ⊂ R and

S be a training set S = {z1 = (x1, y1), z2 = (x2, y2), ..., zm = (xm, ym)}. We introduce two more notations below: 88 Removing ith data point in the set S is represented as,

\i S = {z1, ...., zi−1, zi+1, ....., zm}

th ′ Replacing the i data point in S by zi is represented as,

i { ′ } S = z1, ...., zi−1, zi, zi+1, ....., zm

General Data Sampling Process: Let D denote an unknown distribution from which

{z1, ...., zm} data points are sampled to form a training set S. Throughout the chapter, we assume all samples (including the replacement sample) are i.i.d. unless mentioned otherwise. Let ES[f] denote the expectation of the function f when m samples are drawn from D to form the training set S. Likewise, let Ez[f] denote the expectation of the function f when z is sampled according to D.

Graph Node Sampling Process: At first it may not be clear on how to describe the sampling procedure of nodes from a graph G in the context of GCAPS & GCNNs for performing semi-supervised learning. For our purpose, we consider ego-graphs formed by the 1−hops neighbors at each node as a single data point. This ego-graph is neces- sary and sufficient to compute the single layer GCAPS & GCNN output as shown in Equation (8.7). We assume node data points are sampled in an i.i.d. fashion by first choosing a node x and then extracting its neighbors from G to form an ego-graph.

Generalization Error: Let AS be a learning algorithm trained on dataset S. AS is m X defined as a function from Z to (Y) . For GCNNs, we set AS = f(x, θS). Then m generalization error or risk R(AS) with respect to a loss function ℓ : Z × Z → R is defined as, ∫

R(AS) := Ez[ℓ(AS, z)] = ℓ(AS, z)p(z)dz.

Empirical Error: Empirical risk Remp(AS) is defined as, 1 ∑m R (A ) := ℓ(A , z ). emp S m S j j=1 89

Generalization Gap: When AS is a randomized algorithm, we consider the expected generalization gap as shown below,

ϵgen := EA[R(AS) − Remp(AS)].

Here the expectation EA is taken over the inherent randomness of AS. For instance, most learning algorithms employ Stochastic Gradient descent (SGD) to learn the weight parameters. SGD introduces randomness due to the random order it uses to choose samples for batch processing. In our analysis, we only consider randomness in AS due to SGD and ignore the randomness introduced by parameter initialization. Hence, we will replace EA with Esgd.

Uniform Stability of Randomized Algorithm: For a randomized algorithm, uni- form stability is defined as follows,

Definition 1 (Uniform Stability). A randomized learning algorithm AS is βm−uniformly stable with respect to a loss function ℓ, if it satisfies,

| − | ≤ sup EA[ℓ(AS, z)] EA[ℓ(AS\i , z)] βm S,z

For our convenience, we will work with the following definition of uniform stability,

| − | ≤ sup EA[ℓ(AS, z)] EA[ℓ(ASi , z)] 2βm S,z which follows immediately from the fact that, ( | − | ≤ | − sup EA[ℓ(AS, z)] EA[ℓ(ASi , z)] sup EA[ℓ(AS, z)] S,z ) ( S,z ) | | − | EA[ℓ(AS\i , z)] + sup EA[ℓ(ASi , z)] EA[ℓ(AS\i , z)] S,z

Remarks: Uniform stability imposes an upper bound on the difference in losses due to a removal (or change) of a single data point from the set (of size m) for all possible combinations of S, z. Here, βm is a function of m (the number of training samples). Note that there is a subtle difference between Definition 1 above and the uniform stability of randomized algorithms defined in [130] (see Definition 13 in [130]). The authors in [130] 90 are concerned with random elements associated with the cost function such as those induced by bootstrapping, bagging or initialization process. However, we focus on the randomness due to the learning procedure, i.e., SGD.

Stability Guarantees: A randomized learning algorithm with uniform stability yields the following bound on generalization gap:

Theorem 12 (Stability Guarantees). A uniform stable randomized algorithm (AS, βm) with a bounded loss function 0 ≤ ℓ(AS, z) ≤ M, satisfies following generalization bound with probability at-least 1 − δ, over the random draw of S,z with δ ∈ (0, 1), √ ( ) log 1 E [R(A ) − R (A )] ≤ 2β + 4mβ + M δ . A S emp S m m 2m

Proof: The proof for Theorem 12 mirrors that of Theorem 12 (shown in [87] for deter- ministic learning algorithms). For the sake of completeness, we include the proof in Ap- | − | ≤ pendix based on our definition of uniform stability := sup EA[ℓ(AS, z)] EA[ℓ(ASi , z)] S,z 2βm.

Remarks: The generalization bound is meaningful if the bound converges to 0 as → ∞ O √1 m . This occurs when βm decays faster than ( m ); otherwise the generalization gap does not approach to zero as m → ∞. Furthermore, generalization gap produces O 1 tightest bounds when βm decays at ( m ) which is the most stable state possible for a learning algorithm.

σ−Lipschitz Continuous and Smooth Activation Function: Our bounds hold for all activation functions which are Lipschitz-continuous and smooth. An activation function σ(x) is Lipschitz-continuous if |∇σ(x)| ≤ ασ, or equivalently, |σ(x) − σ(y)| ≤

ασ|x − y|. We further require σ(x) to be smooth, namely, |∇σ(x) − ∇σ(y)| ≤ νσ|x − y|. This assumption is more strict but necessary for establishing the strong notion of uniform stability. Some common activation functions satisfying the above conditions are ELU (with α = 1), Sigmoid, and Tanh.

ℓ−Lipschitz Continuous and Smooth Loss Function: We also assume that the 91 loss function is Lipschitz-continuous and smooth, ( ) ( ) ′ ′ ℓ f(·), y − ℓ f (·), y ≤ αℓ f(·) − f (·)) , ( ) ( ) ′ ′ and ∇ℓ f(·), y − ∇ℓ f (·), y ≤ νℓ ∇f(·) − ∇f (·) .

Unlike in [138], we define Lipschitz-continuity with respect to the function argument rather than the weight parameters, a relatively weak assumption.

6.6 Uniform Stability of GCAPS & GCNN Models

The crux of our main result relies on showing that GCNN models are uniformly stable as stated in Theorem 13 below.

Theorem 13 (GCAPS & GCNN Uniform Stability). Let the loss & activation be Lipschitz-continuous and smooth functions. Then a single layer GCAPS & GCNN model trained using the SGD algorithm for T iterations is βm−uniformly stable, where

( ∑T ( ) ) ≤ max 2 max 2 t−1 βm P ηαℓασνℓ(λG ) 1 + ηνℓνσ(λG ) /m. t=1

Remarks: Plugging the bound on βm in Theorem 12 yields the main result of our chapter. Plugging P = 1 gives βm for the GCNN model.

Before we proceed to prove this theorem, we first explain what is meant by training a single layer GCAPS & GCNN using SGD on datasets S and Si which differ in one data point, following the same line of reasoning as in [138]. Let Z = {z1,..., zt,..., zT } be th a sequence of samples, where zt is an i.i.d. sample drawn from S at the t iteration of SGD during a training run of the GCCN2 . Training the same GCCN using SGD on Si means that we supply the same sample sequence to the GCCN except that if ≤ ≤ ′ ′ ′ zt = (xi, yi) for some t (1 t T ), we replace it with zt = (xi, yi), where i is the i ′ (node) index at which S and S differ. We denote this sample sequence by Z . Let {θS,0

2 One way to generate the sample sequence is to choose a node index it uniformly at random from the set {1, . . . , m} at each step t. Alternatively, one can first choose a random permutation of {1, . . . , m} and then process the samples accordingly. Our analysis holds for both cases. 92 } { } , θS,1 ,..., θS,T and θSi,0, θSi,1 ,..., θSi,T denote the corresponding sequences of the weight parameters learned by running SGD on S and Si, respectively. Since the parameter initialization is kept same, θS,0 = θSi,0. In addition, if k is the first time ′ that the sample sequences Z and Z differ, then θS,t = θSi,t at each step t before k, th and at the k and subsequent steps, θS,t and θSi,t diverge. The key in establishing the uniform stability of a GCNN model is to bound the difference in losses when training the GCNN using SGD on S vs. Si. As stated earlier in the proof strategy, we proceed in two steps.

Proof Part I (Single Layer GCAPS & GCNN Bound): We first bound the expected loss by separating the factors due to the graph convolution operation vs. the expected difference in the filter weight parameters learned via SGD on two datasets S and Si.

Let θS and θSi represent the final GCAPS & GCNN filter weights learned on training i − set S and S respectively. Define ∆θ = θS θSi . Using the facts that the loss are Lipschitz continuous and also |E[x]| ≤ E[|x|], we have,

| − | ≤ | − | Esgd[ℓ(AS, y) ℓ(ASi , y)] αℓEsgd[ f(x, θS) f(x, θSi ) ] [ ( ) ( ) ] ∑P ∑ ∑P ∑ ≤ p − p αℓEsgd σ e·jxj θp,S σ e·jxj θp,Si p=1 j∈ p=1 j∈ N (x) N (x) [ ( ) ( ) ] ∑P ∑ ∑ ≤ p − p αℓEsgd σ e·jxj θp,S σ e·jxj θp,Si p=1 j∈ j∈ N (x) N (x) (6.3) 93 Since activation function is also σ−Lipschitz continuous, [ ] ∑P ∑ ∑ ≤ p − p αℓEsgd e·jxj θp,S e·jxj θp,Si p=1 j∈ j∈ N (x) N (x) [ ] ∑P ∑ ≤ p − αℓEsgd e·jxj (θp,S θp,Si ) p=1 j∈ N (x) ( ) [ ∑P ∑ ( ) ( ) ] (6.4) ≤ p − αℓEsgd e·jxj θp,S θp,Si p=1 j∈ N (x) ( ) ∑P ∑ ( ) ( [ ]) ≤ p αℓ e·jxj Esgd ∆θp p=1 j∈ N (x) ∑P ( [ ])

≤ αℓ gp,λEsgd ∆θp p=1

∑ p where gp,λ is defined as gp,λ := sup j∈N (x) e·jxj . We will bound gp,λ in terms of x the largest absolute eigenvalue of the graph convolution filter g(L) later. Note that ∑ j∈N (x) e·jxj is nothing but a graph convolution operation. As such, reducing gλ will be the contributing factor in improving the generalization performance.

Proof Part II (SGD Based Bounds For GCAPS & GCNN Weights): What remains is to bound Esgd[|∆θ|] due to the randomness inherent in SGD. This is proved through a series of three lemmas. We first note that on a given training set S, a GCAPS & GCNN minimizes the following objective function,

( ) 1 ∑m ( ) min L f(x, θS), y = ℓ f(x, θS), yi (6.5) θ m i=1 For this, at each iteration t, SGD performs the following update: ( ) − ∇ θS,t+1 = θS,t η ℓ f(xit , θS,t), yit (6.6) where η > 0 is the learning rate.

{ } { Given two sequences of the weight parameters, θS,0 , θS,1 ,..., θS,T and θSi,0, θSi,1 94 } i ,..., θSi,T , learned by the GCAPS & GCNN running SGD on S and S , respectively, | − | we first find a bound on ∆θt := θS,t θSi,t at each iteration step t of SGD.

There are two scenarios to consider 1) At step t, SGD picks a sample zt = (x, y) which is identical in Z and Z′, and occurs with probability (m − 1)/m. From Equation (6.6), ( ) ( ) we have |∆θt+1| ≤ |∆θt| + η|∇ℓ f(x, θS,t), y − ℓ f(x, θS,t), y |. We bound this term in Lemma 8 below 2) At step t, SGD picks the only samples that Z and Z′ differ, z = (x , y ) and z′ = (x′ , y′) which occurs with probability 1/m. Then |∆θ | ≤ t i i ( t i) i ( ) t+1 | | |∇ − ′ ′ | ∆θt +η ℓ f(xi, θS,t), yi ℓ f(xi, θS,t), yi . We bound the second term in Lemma 9 below.

Lemma 2 (GCAPS & GCNN Same Sample Loss Stability Bound). The loss- derivative bound difference of (single-layer) GCAPS & GCNN models trained with SGD algorithm for T iterations on two training datasets S and Si respectively, with respect to the same sample is given by,

( ) ( ) ∇ − ∇ ≤ 2 | | pℓ f(x, θS,t), y pℓ f(x, θSi,t), y νℓνσgp,λ ∆θp,t

Proof: The first order derivative of a single-layer the GCAPS & GCNN output function with respect to pth-parameter is given by,

( ∑ ) ∑ ∂f(x, θ) ′ p p = σ e·jx θp e·jx , (6.7) ∂θ j j p j∈ j∈ N (x) N (x)

′ where σ (·) is the first order derivative of the activation function.

Using Equation (8.8) and the fact that the loss function is Lipschitz continuous and smooth, we have, 95

( ) ( ) ∂ℓ f(x, θS,t), y ∂ℓ f(x, θ i ), y ∂f(x, θ ) ∂f(x, θ i ) − S ,t ≤ ν S,t − S ,t ∂θ ∂θ ℓ ∂θ ∂θ ( ∑p ) ∑ p ( ∑ p ) ∑ p ′ ′ ≤ p p − p p νℓ σ e·jxj θp,S,t e·jxj σ e·jxj θp,Si,t e·jxj j∈ j∈ j∈ j∈ N (x) N (x) N (x) N (x) ( ∑ ) ( ∑ ) ( ∑ ) ′ ′ ≤ p p − p νℓ e·jxj σ e·jxj θp,S,t σ e·jxj θp,Si,t j∈ j∈ j∈ N (x) N (x) N (x) Since the activation function is Lipschitz continuous and smooth, ∑ p ≤ and plugging e·jxj gp,λ, we get, j∈ N (x) ( ) ( ) ∑ ∑ ≤ p − p νℓνσgp,λ e·jxj θp,S,t e·jxj θp,Si,t j∈ j∈ N (x) N (x) ( ∑ ) ≤ p | − | νℓνσgp,λ e·jxj θp,S,t θp,Si,t (6.8) j∈ N (x) ≤ 2 | | νℓνσgp,λ ∆θp,t This completes the proof of Lemma 8.

Note: Without the σ−smooth assumption, it would not be possible to derive the above bound in terms of |∆θt| which is necessary for showing the uniform stability. Unfortunately, this constraint excludes RELU activation from our analysis.

Lemma 3 (GCAPS & GCNN Different Sample Loss Stability Bound). The loss-derivative bound difference of (single-layer) GCAPS & GCNN models trained with SGD algorithm for T iterations on two training datasets S and Si respectively, with respect to the different samples is given by,

( ) ( ) ∇ − ∇ ′ ′ ≤ pℓ f(xi, θS,t), yi pℓ f(xi, θSi,t), yi 2νℓασgp,λ

Proof: Again using Equation (8.8) and the fact that the loss & activation function is Lipschitz continuous and smooth, and for any a, b, |a − b| ≤ |a| + |b|, we have, 96

( ) ( ) ∂ℓ f(x, θS,t), y ∂ℓ f(~x, θ i ), y˜ ∂f(x, θ ) ∂f(~x, θ i ) − S ,t ≤ ν S,t − S ,t ∂θ ∂θ ℓ ∂θ ∂θ ( ∑p ) ∑ p ( ∑ p ) ∑ p ′ ′ ≤ p p − p p νℓ σ e·jxj θp,S,t e·jxj σ e·j ~xj θp,Si,t e·j ~xj j∈ j∈ j∈ j∈ N (x) N (x) N (~x) N (~x) ( ∑ ) ∑ ( ∑ ) ∑ ′ ′ ≤ p p p p νℓ σ e·jxj θp,S,t e·jxj + νℓ σ e·j ~xj θp,Si,t e·jxj j∈ j∈ j∈ j∈ N (x) N (x) N (~x) N (~x) Using the fact that the first order derivative is bounded,

≤ 2νℓασgp,λ

This completes the proof of Lemma 9.

Summing over all iteration steps, and taking expectations over all possible sample se- quences Z, Z′ from S and Si, we have

Lemma 4 (GCAPS & GCNN SGD Stability Bound). Let the loss & activation functions be Lipschitz-continuous and smooth. Let θS,T and θSi,T denote the graph filter parameters of (single-layer) GCAPS & GCNN models trained using SGDfor T iterations on two training datasets S and Si, respectively. Then the expected difference in the filter parameters is bounded by,

[ ] ∑T ( ) 2ηνℓασgp,λ 2 t−1 E θ − θ i | ≤ 1 + ην ν g sgd p,S,T p,S ,T m ℓ σ p,λ t=1 97 Proof: From Equation (6.6) and taking into account the probabilities of the two sce- narios considered in Lemma 8 and Lemma 9 at step t, we have, [ ] ( ) [ ( ( )) 1 Esgd ∆θp,t+1| ≤ 1 − Esgd θp,S,t − η∇pℓ f(x, θS,t), y − ( m ) ] ( ) [ ( ) ( ) ( ) − ∇ 1 − ∇ − θp,Si,t η pℓ f(x, θSi,t), y + Esgd θp,S,t η pℓ f(~x, θS,t), y˜ ( ) ] m ( ) − ∇ θp,Si,t η pℓ f(^x, θSi,t), yˆ ( ) ( ) [ ] [ ] ( ) ( ) ≤ − 1 | | − 1 ∇ − ∇ 1 Esgd ∆θp,t + 1 ηEsgd pℓ f(x, θS,t), y pℓ f(x, θSi,t), y + ( )m ( ) m[ ] [ ] ( ) ( ) 1 | | 1 ∇ − ∇ Esgd ∆θp,t + ηEsgd pℓ f(~x, θS,t), y˜ pℓ f(^x, θSi,t), yˆ m ( m) [ ] [ ] ( ) ( ) | | − 1 ∇ − ∇ = Esgd ∆θp,t + 1 ηEsgd pℓ f(x, θS,t), y pℓ f(x, θSi,t), y + ( ) [ m ] 1 ( ) ( ) ηE ∇ ℓ f(~x, θ ), y˜ − ∇ ℓ f(^x, θ i ), yˆ m sgd p S,t p S ,t (6.9)

Plugging the bounds in Lemma 8 and Lemma 9 into Equation (8.10), we have, [ ] [ ] ( ) ( ) | ≤ | | − 1 2 | | 1 Esgd ∆θp,t+1 Esgd ∆θp,t + 1 ηνℓνσgp,λEsgd[ θp,t ] + 2ηνℓασgp,λ ( ) m m ( ) 2ην α g − 1 2 | | ℓ σ p,λ = 1 + 1 ηνℓνσgp,λ Esgd[ θp,t ] + ( m ) m 2ηνℓασgp,λ ≤ 1 + ην ν g2 E [|θ |] + ℓ σ p,λ sgd p,t m [ ]

Lastly, solving the Esgd ∆θt| first order recursion yields,

[ ] ∑T ( ) 2ηνℓασgp,λ t−1 E ∆θ | ≤ 1 + ην ν g2 sgd p,T m ℓ σ p,λ t=1 This completes the proof of Lemma 10.

Bound on gp,λ: We now bound gp,λ in terms of the largest absolute eigenvalue of the graph filter matrix g(L). We first note that at each node x, the ego-graph Gx ego-graph q×q can be represented as a sub-matrix of g(L). Let gx(L) ∈ R be the submatrix of g(L) whose row and column indices are from the set {j ∈ N (x)}. The ego-graph size is q th q = |N (x)|. We use hx,p ∈ R to denote the p moment graph signal (node features) on the ego-graph Gx,p. Without loss of generality, we will assume that node x is represented 98 ∑ p by index 0 in Gx,p. Thus, we can compute j∈N (x) e·jxj = [gx(L)hx,p]0, a scalar value.

Here [·]0 ∈ R represents the value of a vector at index 0, i.e., corresponding to node x.

Then the following holds (assuming the graph signals are normalized, i.e., ∥hx,1∥2 = 1),

| | ≤ ∥ ∥ ≤ ∥ ∥ ∥ ∥ ≤ ∥ ∥ ∥ ∥ max (6.10) [gx(L)hx,p]0 gx(L)hx,p 1 gx(L) 2 hx,p 2 gx(L) 2 hx,1 2 = λGx ∑ ∑ where the third inequality follows from the fact that |xp| ≤ |x |2 ≤ 1 for p ≥ 2 ∑i i i i ∑ ∥ ∥ | p| 1/2 ≤ | |2 1/2 ≤ and since x 2 = 1. As a result the norm inequality ( i xi ) ( i xi ) 1 ∥ ∥ ∥ ∥ holds. Also, M 2 = sup∥x∥2=1 Mx 2 = σmax(M) is the matrix operator norm and σmax(M) is the largest singular value of matrix M. For a normal matrix M (such as a symmetric graph filter g(L)), σmax(M) = max |λ(M)|, the largest absolute eigenvalue of M. Lemma 5 (Ego-Graph Eigenvalue Bound). Let G = (V,E) be a (un)directed graph max with (either symmetric or non-negative) weighted adjacency matrix g(L) and λG be the maximum absolute eigenvalue of g(L). Let Gx be the ego-graph of a node x ∈ V with corresponding maximum absolute eigenvalue λmax. Then the following eigenvalue Gx (singular value) bound holds ∀x, max ≤ max λGx λG

Proof: Notice that gx(L) is the adjacency matrix of Gx which also happens to be the principal submatrix of g(L). As a result, above bound holds from the eigenvalue interlacing theorem for normal/Hermitian matrices and their principal submatrices [147, 148].

≤ max Finally, plugging gp,λ λG and Lemma 10 into Equation (8.9) yields the following remaining result,

∑P ( [ ]) ≤ max 2βm αℓλG Esgd ∆θp p=1 ∑ ( ) max 2 T max 2 t−1 P ηαℓασνℓ(λG ) t=1 1 + ηνℓνσ(λG ) βm ≤ ( ) m 1 β ≤ O P (λmax)2T ∀T ≥ 1 m m G This completes the full proof of Theorem 13. 99 6.7 Revisiting GCAPS & GCNN Model Architecture

In this section, we discuss the implication of our results in designing graph convolution filters and revisit the importance of employing batch-normalization layers in GCAPS& GCNN network.

Unnormalized Graph Filters: One of the most popular graph convolution filters is g(L) = A + I [95]. The eigen spectrum of the unnormalized A is bounded by the max ≤ max maximal node degree, i.e., λG degreeG . This is concerning as now gλ is bounded c by O(N) and as m becomes close to N, βm tend towards O(N ) complexity with c ≥ 0. As a result, the generalization gap of such a GCNN model is not guaranteed to converge.

Normalized Graph Filters: Numerical instabilities with the unnormalized adjacency matrix have already been suspected in [50]. Therefore, the symmetric normalized graph filter has been adopted: g(L) = D−1/2AD−1/2 + I . The eigen spectrum of D−1/2AD−1/2 is bounded between [−1, 1]. As a result, such a GCNN model is uni- formly stable (assuming that the graph features are also normalized appropriately, e.g.,

∥x∥2 = 1).

Random Walk Graph Filters: Another graph filter that has been widely used is based on random walks: g(L) = D−1A + I [100]. The eigenvalues of D−1A are spread out in the interval [0, 2] and thus such a GCAPS & GCNN model is uniformly stable.

Importance of Batch-Normalization in GCAPS & GCNN: Recall that gp,λ = ∑ p sup j∈N (x) e·jxj and notice that in Equation (8.11), we assume that the graph x signals are normalized in order to bound gp,λ . This can easily be accomplished by normalizing features during data pre-processing phase for a single layer GCAPS & GCNN. However, for a multi-layer GCAPS & GCNN, the intermediate feature outputs are not guaranteed to be normalized. Thus to ensure stability, it is crucial to employ batch-normalization layers in GCAPS & GCNN models. This has already been reported in [95] as an important factor for keeping the GCNN outputs stable. 100

2 1.5

1.5 1

1

0.5

0.5

Generalization Gap Generalization Gap Generalization

0 0 0 20 40 60 80 100 0 20 40 60 80 100 Epochs Epochs (a) 1a (b) 1b

(c) 1b

Figure 6.1: The above figures show the generalziation gap for three datasets. The generlization gap is measured with respect to the loss function, i.e., |(training error − test error)|. In this experiment, the cross-entropy loss is used.

6.8 Experiment and Results

In this section, we empirically evaluate the effect of graph filters on the GCNN stability bounds using four different GCNN filters. We employ three citation network datasets: Citeseer, Cora and Pubmed (see [50] for details about the datasets).

Experimental Setup: We extract 1−hop ego-graphs of each node in a given dataset 101

60 35

50 30 25 40 20 30 15 20 10

10 5

Parameter Norm Difference Norm Parameter Difference Norm Parameter 0 0 0 20 40 60 80 100 0 20 40 60 80 100 Epochs Epochs (a) 1a (b) 1b

200

150

100

50 Parameter Norm Difference Norm Parameter 0 0 20 40 60 80 100 Epochs (c) 1b

Figure 6.2: The above figures show the divergence in weight parameters of a single layer GCNN measured using L2−norm on the three datasets. We surgically alter one sample point at index i = 0 in the training set S to generate Si and run the SGD algorithm.

to create samples and normalize the node graph features such that ∥x∥2 = 1 in the data pre-processing step. We run the SGD algorithm with a fixed learning rate η = 1 with the batch size equal to 1 for 100 epochs on all datasets. We employ ELU (set α = 1) as the activation function and cross-entropy as the loss function.

Measuring Generalization Gap: In this experiment, we quantitatively measure the generalization gap defined as the absolute difference between the training and test errors. From Figure 6.1, it is clear that the unnormalized graph convolution filters such as 102 g(L) = A + I show a significantly higher generalization gap than the normalized ones such as D−1/2AD−1/2 or random walk g(L) = D−1A + I based graph filters. The results hold consistently across the three datasets. We note that the generalization gap becomes constant after a certain number of iterations. While this phenomenon is not reflected in our bounds, it can plausibly be explained by considering thevariable bounding parameters (as a function of SGD iterations). This hints at the pessimistic nature of our bounds.

Measuring GCNN Learned Filter-Parameters Stability Based On SGD Op- timizer: In this experiment, we evaluate the difference between learned weight parame- ters of two single layer GCNN models trained on datasets S and Si which differ precisely in one sample point. We generate Si by surgically altering one sample point in S at the node index i = 0. For this experiment, we initialize the GCNN models on both datasets with the same parameters and random seeds, and then run the SGD algorithm. After each epoch, we measure the L2−norm difference between the weight parameters of the respective models. From Figure 6.2, it is evident that for the unnormalized graph con- volution filters, the weight parameters tend to deviate by a large amount and therefore the network is less stable. While for the normalized graph filters the norm difference converges quickly to a fixed value. These empirical observations are reinforced byour stability bounds. However, the decreasing trend in the norm difference after a certain number of iterations before convergence, remains unexplained, due to the pessimistic nature of our bounds.

6.9 Conclusion

We have taken the first steps towards establishing a deeper theoretical understanding of GCNN models by analyzing their stability and establishing their generalization guaran- tees. More specifically, we have shown that the algorithmic stability of GCNN models depends upon the largest absolute eigenvalue of graph convolution filters. To ensure uniform stability and thereby generalization guarantees, the largest absolute eigenvalue must be independent of the graph size. Our results shed new insights on the design of 103 new & improved graph convolution filters with guaranteed algorithmic stability. Fur- thermore, applying our results to existing GCNN models, we provide a theoretical jus- tification for the importance of employing the batch-normalization process inaGCNN architecture. We have also conducted empirical evaluations based on real world datasets which support our theoretical findings. To the best of our knowledge, we are thefirst to study stability bounds on graph learning in a semi-supervised setting and derive generalization bounds for GCNN models. Chapter 7

Conclusion and Future Directions

In this final chapter, we present the thesis conclusion and further discuss open problems and future directions in graph embedding area.

7.1 Conclusion

First, we presented a conceptually simple yet powerful and theoretically motivated graph embedding called Fgsd. In particular, our graph graph embedding based on the dis- covery of family of graph spectral distances can exhibits uniqueness, stability, sparsity and are computationally fast. Moreover, our hunt specifically leads to the harmonic and next to it, biharmonic distances as an ideal members of this family for extracting graph features. Finally, our extensive results show that Fgsd based graph features are powerful enough to dominate the unlabeled graph classification task over all the more sophisticated algorithms and competitive enough to yield high classification accuracy on labeled data even without utilizing any node labels.

Second, we proposed a novel Graph Capsule Network (GCAPS-CNN) model based on the fundamental capsule idea to address some of the basic weaknesses of existing GCNN models. Our graph capsule network model by design captures more local structure

104 105 information than traditional GNN and can provide much richer information of individual graph nodes or for the whole graph. For our purpose, we employ a capsule function that preserves statistical moments formation since they are faster to compute. Furthermore, we propose a novel permutation invariant layer based on computing covariance in our GCAPS-CNN architecture to deal with graph classification problem which most GNN models find challenging. This covariance can again be computed in a fast mannerand has shown to be better than adopting aggregation or max-sort pooling layer. On the top, we also equip our GCAPS-CNN model with Fgsd features explicitly to capture more global information in absence of node features. This is essential to consider since non-deep GCNN models are not capable enough to exploit global information implicitly. Finally, we show GCAPS-CNN superior performance on many bioinformatics and social network datasets in comparison with existing deep learning methods as well as strong graph kernels and set the current state-of-the-art. Our general idea of graph capsule is quite rich and can taken to another level by designing more sophisticated capsule functions that are capable of preserving more information in a local pool.

Third, we built a powerful univeral graph embedding neural network architecture with three carefully designed components that can learn task-independent graph embeddings in a unsupervised fashion. In particular, the universal graph encoder component can be re-utilized across different datasets by leveraging transfer learning, and the decoder component incorporates various graph kernels to capture rich sub-structural properties to enable multi-task learning. Through extensive experiments and ablation studies on benchmark graph classification datasets, we show that our proposed DUGnn model can significantly outperform both the existing state-of-art graph neural network models and graph kernel methods. This demonstrates the the benefit of combining the power of graph neural networks in the design of a universal graph encoder with that of graph kernels in the design of a multi-task graph decoder.

At last, we have taken the first steps towards establishing a deeper theoretical under- standing of GCAPS & GCNN models by analyzing their stability and establishing their generalization guarantees. More specifically, we have shown that the algorithmic stabil- ity of GCAPS & GCNN models depends upon the largest absolute eigenvalue of graph convolution filters. To ensure uniform stability and thereby generalization guarantees, 106 the largest absolute eigenvalue must be independent of the graph size. Our results shed new insights on the design of new & improved graph convolution filters with guaran- teed algorithmic stability. Furthermore, applying our results to existing GCAPS & GCNN models, we provide a theoretical justification for the importance of employing the batch-normalization process in a GCAPS & GCNN architecture. We have also con- ducted empirical evaluations based on real world datasets which support our theoretical findings.

7.2 Open Problems and Future Directions

Future Work in Graph Kernel or Spectrum: Like Laplacian Eigenmaps whose eigenvectors converges to the eigenfunctions of Laplacain operator (or Diffusion Map whose eigenvectors are discrete approximation of the eigenfunctions of Fokker-Planck operator), it would be curious to know the asymptotic behavior of the family of graph spectral distances Fgsd. And then, there is a matter of addressing a theoretical issue of adopting harmonic distance as a graph spectrum due to [149], in the regards to their claim about resistance distance as meaningless for large graphs. However, our results on large graphs (thousands of nodes) tells a different story and may be that is because their analysis is applicable to only random and very large graphs. Another main task would be to integrate node labels information explicitly into the computation of Fgsd. While, we can always use shortest path strategy to include labeled data for computing the graph kernel but this approach does not account for the structure around which the node label is standing. Further, this is also supported by the fact that on few datasets using graph node labels degrades the classification accuracy for some algorithms and one should always try to combine the node label information with the graph sub-structures (like in multiscale laplacian graph kernel [69]) around it, for better performance. Learning optimal binwidth is also on agenda, but the challenge lies in bringing down the optimization cost, so that it can scale to large graphs.

Future Work in Graph Neural Networks: Our general idea of graph capsule is quite rich and can taken to another level by designing more sophisticated capsule 107 functions that are capable of preserving more information in a local pool. In future, one can investigate various other capsule functions such as polynomial coefficients (as instantiation parameters) which comes with theoretical guarantees. Another choice of investigation would be performing kernel density estimation technique in end-to-end deep learning manner and understanding their theoretical significance would be also beneficial. One can also explore the other approach of managing the graph capsule vector dimension as discussed in [73]. Lastly, universal and transferable model such as proposed Dugnn are only trained on limited real-world graph datasets. One can train such models on much larger datasets, possibly using synthetic graphs which can be considered as data augmentation technique and open whole lot of new directions to explore given the ability to synthetically generate arbitrary graph.

Future Theory Work in Graph Neural Networks: Our current work is limited to a single layer GNN and need to be extended for a multi-layer GNN model. For a multi-layer GNN, one need to bound the difference in weights at each layer according to the back-propagation algorithm. Therefore, the main technical challenge is to study the stability of the full fledged back-propagation algorithm. Furthermore, it is fruitful to study the stability and generalization properties of non-localized convolutional filters designed based on rational polynomials of the graph Laplacian and so far remains an uncharted territory. Bibliography

[1] Saurabh Verma and Zhi-Li Zhang. Hunt for the unique, stable, sparse and fast feature learning on graphs. In Advances in Neural Information Processing Systems, pages 87–97, 2017.

[2] Saurabh Verma and Zhi-Li Zhang. Graph capsule convolutional neural networks. arXiv preprint arXiv:1805.08090, 2018.

[3] Saurabh Verma and Zhi-Li Zhang. Learning universal graph neural network em- beddings with aid of transfer learning, 2019.

[4] Saurabh Verma and Zhi-Li Zhang. Stability and generalization of graph convolu- tional neural networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1539–1548, 2019.

[5] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Dis- tributed representations of words and phrases and their compositionality. In Ad- vances in neural information processing systems, pages 3111–3119, 2013.

[6] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.

[7] Douwe Kiela and Léon Bottou. Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 36–45, 2014.

108 109 [8] Andrej Karpathy, Armand Joulin, and Li F Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in neural information processing systems, pages 1889–1897, 2014.

[9] Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and Wei-Ying Ma. Collaborative knowledge base embedding for recommender systems. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 353–362. ACM, 2016.

[10] Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730, 2018.

[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

[12] Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. Transforming auto- encoders. In International Conference on Artificial Neural Networks, pages 44–51. Springer, 2011.

[13] Arvind Narayanan, Saurabh Verma, and Zhi-Li Zhang. Mining latent patterns in geomobile data via epic. World Wide Web, 22(6):2771–2798, 2019.

[14] Saurabh Verma, Manjula Shivanna, Gyana Ranjan Dash, and Antonio Nucci. Spatio-temporal anomaly detection in computer networks using graph convolu- tional recurrent neural networks (gcrnns), October 10 2019. US Patent App. 15/949,198.

[15] Saurabh Verma, Gyana R Dash, Shamya Karumbaiah, Arvind Narayanan, Man- jula Shivanna, Sujit Biswas, and Antonio Nucci. Neural network-assisted com- puter network management, June 27 2019. US Patent App. 15/855,781.

[16] Saurabh Verma and Zhi-Li Zhang. Deep universal graph embedding neural net- work. arXiv preprint arXiv:1909.10086, 2019. 110 [17] Saurabh Agrawal, Saurabh Verma, Anuj Karpatne, Stefan Liess, Snigdhansu Chat- terjee, and Vipin Kumar. A fast-optimal guaranteed algorithm for learning sub- interval relationships in time series. arXiv preprint arXiv:1906.01450, 2019.

[18] Arvind Narayanan, Saurabh Verma, Eman Ramadan, Pariya Babaie, and Zhi-Li Zhang. Making content caching policies’ smart’using the deepcache framework. ACM SIGCOMM Computer Communication Review, 48(5):64–69, 2019.

[19] Arvind Narayanan, Saurabh Verma, Eman Ramadan, Pariya Babaie, and Zhi- Li Zhang. Deepcache: A deep learning based framework for content caching. In Proceedings of the 2018 Workshop on Network Meets AI & ML, pages 48–53. ACM, 2018.

[20] Saurabh Agrawal, Saurabh Verma, Gowtham Atluri, Anuj Karpatne, Stefan Liess, Angus Macdonald III, Snigdhansu Chatterjee, and Vipin Kumar. Mining sub- interval relationships in time series data. arXiv preprint arXiv:1802.06095, 2018.

[21] Saurabh Verma, Arvind Narayanan, and Zhi-Li Zhang. Multi-low-rank approxi- mation for traffic matrices. In 2017 29th International Teletraffic Congress (ITC 29), volume 1, pages 72–80. IEEE, 2017.

[22] Jun Ho Huh, Saurabh Verma, Swathi Sri V Rayala, Rakesh B Bobba, Konstantin Beznosov, and Hyoungshick Kim. I don’t use apple pay because it’s less secure...: perception of security and usability in mobile tap-and-pay. In Proceedings of the Workshop on Usable Security (USEC), volume 12, 2017.

[23] Arvind Narayanan, Saurabh Verma, and Zhi-Li Zhang. Most calls are local (but some are regional): Dissecting cellular communication patterns. In 2016 IEEE Global Communications Conference (GLOBECOM), pages 1–6. IEEE, 2016.

[24] Saurabh Verma, Ali Hamieh, Jun Ho Huh, Henrik Holm, Siva Raj Rajagopalan, Maciej Korczynski, and Nina Fefferman. Stopping amplified dns ddos attacks through distributed query rate sharing. In 2016 11th International Conference on Availability, Reliability and Security (ARES), pages 69–78. IEEE, 2016. 111 [25] Arvind Narayanan, Saurabh Verma, and Zhi-Li Zhang. Mining spatial-temporal geomobile data via feature distributional similarity graph. In Proceedings of the First Workshop on Mobile Data, pages 13–18. ACM, 2016.

[26] Saurabh Verma and Estevam R Hruschka. Coupled bayesian sets algorithm for semi-supervised learning and information extraction. In Joint European Confer- ence on Machine Learning and Knowledge Discovery in Databases, pages 307–322. Springer, 2012.

[27] Shalini Pandey and George Karypis. A self-attentive model for knowledge tracing. arXiv preprint arXiv:1907.06837, 2019.

[28] Sahil Gupta, Shalini Pandey, and KK Shukla. Comparison analysis of link predic- tion algorithms in social network. International Journal of Computer Applications, 111(16):27–29, 2015.

[29] Shalini Pandey and George Karypis. Structured dictionary learning for energy disaggregation. In Proceedings of the Tenth ACM International Conference on Future Energy Systems, pages 24–34, 2019.

[30] Varun Mithal, Ankush Khandelwal, Guruprasad Nayak, Vipin Kumar, Ramakr- ishna Nemani, and Nikunj Oza. A spatio-temporal data mining approach to global scale burned area monitoring. In AGU Fall Meeting Abstracts, 2014.

[31] Nikunj Oza, Vipin Kumar, Ramakrishna Nemani, Shyam Boriah, Kamalika Das, Ankush Khandelwal, Brian Matthews, Andrew Michaelis, Varun Mithal, Gu- ruprasad Nayak, et al. Integrating parallel and distributed data mining algorithms into the nasa earth exchange (nex). In AGU Fall Meeting Abstracts, 2014.

[32] Varun Mithal, Guruprasad Nayak, Ankush Khandelwal, Vipin Kumar, Nikunj Oza, and Ramakrishna Nemani. Global monitoring of tropical forest fires using a new predictive modeling approach for rare classes. In AGU Fall Meeting Abstracts, 2015.

[33] Vipin Kumar, Varun Mithal, Guruprasad Nayak, and Ankush Khandelwal. Clas- sification of highly-skewed data, October 27 2016. US Patent App. 15/137,603. 112 [34] Guruprasad Nayak, Varun Mithal, and Vipin Kumar. Multiple instance learning for bags with ordered instances. 2017.

[35] Xiaowei Jia, Ankush Khandelwal, Guruprasad Nayak, James Gerber, Kimberly Carlson, Paul West, and Vipin Kumar. Incremental dual-memory lstm in land cover prediction. In Proceedings of the 23rd ACM SIGKDD International Confer- ence on Knowledge Discovery and Data Mining, pages 867–876. ACM, 2017.

[36] Xiaowei Jia, Ankush Khandelwal, Guruprasad Nayak, James Gerber, Kimberly Carlson, Paul West, and Vipin Kumar. Predict land covers with transition mod- eling and incremental learning. In Proceedings of the 2017 SIAM International Conference on Data Mining, pages 171–179. SIAM, 2017.

[37] Varun Mithal, Guruprasad Nayak, Ankush Khandelwal, Vipin Kumar, Nikunj C Oza, and Ramakrishna Nemani. Rapt: Rare class prediction in absence of true labels. IEEE Transactions on Knowledge and Data Engineering, 29(11):2484–2497, 2017.

[38] Varun Mithal, Guruprasad Nayak, Ankush Khandelwal, Vipin Kumar, Ramakr- ishna Nemani, and Nikunj Oza. Mapping burned areas in tropical forests using a novel machine learning framework. Remote Sensing, 10(1):69, 2018.

[39] Guruprasad Nayak, Varun Mithal, Xiaowei Jia, and Vipin Kumar. Classifying multivariate time series by learning sequence-level discriminative patterns. In Proceedings of the 2018 SIAM International Conference on Data Mining, pages 252–260. SIAM, 2018.

[40] Guruprasad Nayak, Sourav Dutta, Deepak Ajwani, et al. Automated knowledge hierarchy assessment. In The Second Workshop on Knowledge Graphs and Seman- tics for Text Retrieval, Analysis, and Understanding (KG4IR), Michigan, USA, 12 July 2018, volume 2127, pages 59–60. CEUR Workshop Proceedings, 2018.

[41] Guruprasad Nayak, Sourav Dutta, Deepak Ajwani, Patrick Nicholson, and Alessandra Sala. Automated assessment of knowledge hierarchy evolution: com- paring directed acyclic graphs. Information Retrieval Journal, 2019. 113 [42] Xiaowei Jia, Guruprasad Nayak, Ankush Khandelwal, Anuj Karpatne, and Vipin Kumar. Classifying heterogeneous sequential data by cyclic domain adaptation: An application in land cover detection. In Proceedings of the 2019 SIAM Interna- tional Conference on Data Mining, pages 540–548. SIAM, 2019.

[43] Xiaowei Jia, Sheng Li, Ankush Khandelwal, Guruprasad Nayak, Anuj Karpatne, and Vipin Kumar. Spatial context-aware networks for mining temporal discrimina- tive period in land cover detection. In Proceedings of the 2019 SIAM International Conference on Data Mining, pages 513–521. SIAM, 2019.

[44] Guruprasad Nayak, Rahul Ghosh, Xiaowei Jia, Varun Mithal, and Vipin Kumar. Spatio-temporal classification at multiple resolutions using multi-view regulariza- tion. In 2019 IEEE International Conference on (Big Data). IEEE, 2019.

[45] Guruprasad Nayak, Rahul Ghosh, Xiaowei Jia, Varun Mithal, and Vipin Ku- mar. Semi-supervised classification using attention-based regularization on coarse- resolution data. In Proceedings of the 2020 SIAM International Conference on Data Mining. SIAM, 2020.

[46] Guruprasad Nayak. Learning with Weak Supervision for Land Cover Mapping Problems. PhD thesis, University of Minnesota, 2020.

[47] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral net- works and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.

[48] Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163, 2015.

[49] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pages 3837–3845, 2016.

[50] Thomas N Kipf and Max Welling. Semi-supervised classification with graph con- volutional networks. arXiv preprint arXiv:1609.02907, 2016. 114 [51] David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Processing Magazine, 30(3):83–98, 2013.

[52] James Atwood and Don Towsley. Diffusion-convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1993–2001, 2016.

[53] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Tim- othy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pages 2224–2232, 2015.

[54] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolu- tional neural networks for graphs. In Proceedings of the 33rd annual international conference on machine learning. ACM, 2016.

[55] Tao Lei, Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Deriving neural architectures from sequence and graph kernels. arXiv preprint arXiv:1705.09037, 2017.

[56] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212, 2017.

[57] Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models for structured data. In International Conference on Machine Learning, pages 2702–2711, 2016.

[58] Alberto García-Durán and Mathias Niepert. Learning graph representations with embedding propagation. arXiv preprint arXiv:1710.03059, 2017.

[59] Martin Simonovsky and Nikos Komodakis. Dynamic edge-conditioned filters in convolutional neural networks on graphs. In Proc. CVPR, 2017. 115 [60] Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. An end-to-end deep learning architecture for graph classification. In AAAI, pages 4438–4445, 2018.

[61] Risi Kondor, Hy Truong Son, Horace Pan, Brandon Anderson, and Shubhendu Trivedi. Covariant compositional networks for learning graphs. arXiv preprint arXiv:1801.02144, 2018.

[62] Nataša Pržulj. comparison using graphlet degree distribution. Bioinformatics, 23(2):e177–e183, 2007.

[63] Nino Shervashidze, SVN Vishwanathan, Tobias Petri, Kurt Mehlhorn, and Karsten M Borgwardt. Efficient graphlet kernels for large graph comparison. In AISTATS, volume 5, pages 488–495, 2009.

[64] Hisashi Kashima, Koji Tsuda, and Akihiro Inokuchi. Marginalized kernels between labeled graphs. In ICML, volume 3, pages 321–328, 2003.

[65] Karsten M Borgwardt and Hans-Peter Kriegel. Shortest-path kernels on graphs. In Data Mining, Fifth IEEE International Conference on, pages 8–pp. IEEE, 2005.

[66] Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep):2539–2561, 2011.

[67] Pinar Yanardag and SVN Vishwanathan. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1365–1374. ACM, 2015.

[68] Francesco Orsini, Paolo Frasconi, and Luc De Raedt. Graph invariant kernels. In IJCAI, pages 3756–3762, 2015.

[69] Risi Kondor and Horace Pan. The multiscale laplacian graph kernel. In Advances in Neural Information Processing Systems, pages 2982–2990, 2016.

[70] Aasa Feragen, Niklas Kasenburg, Jens Petersen, Marleen de Bruijne, and Karsten Borgwardt. Scalable kernels for graphs with continuous attributes. In Advances in Neural Information Processing Systems, pages 216–224, 2013. 116 [71] Gerben KD de Vries. A fast approximation of the weisfeiler-lehman graph kernel for rdf data. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 606–621. Springer, 2013.

[72] Marion Neumann, Novi Patricia, Roman Garnett, and Kristian Kersting. Effi- cient graph kernels by randomization. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 378–393. Springer, 2012.

[73] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. In Advances in Neural Information Processing Systems, pages 3859–3869, 2017.

[74] László Babai. Graph isomorphism in quasipolynomial time. CoRR, abs/1512.03547, 2015.

[75] Risi Kondor. A complete set of rotationally and translationally invariant features for images. CoRR, abs/cs/0701127, 2007.

[76] Risi Kondor and Karsten M Borgwardt. The skew spectrum of graphs. In Pro- ceedings of the 25th international conference on Machine learning, pages 496–503. ACM, 2008.

[77] Risi Kondor, Nino Shervashidze, and Karsten M Borgwardt. The graphlet spec- trum. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 529–536. ACM, 2009.

[78] Joseph Rosenblatt and Paul D Seymour. The structure of homometric sets. SIAM Journal on Algebraic Discrete Methods, 3(3):343–350, 1982.

[79] Yaron Lipman, Raif M Rustamov, and Thomas A Funkhouser. Biharmonic dis- tance. ACM Transactions on Graphics (TOG), 29(3):27, 2010.

[80] Edmund Taylor Whittaker and George Neville Watson. A course of modern anal- ysis. Cambridge university press, 1996.

[81] David K Hammond, Pierre Vandergheynst, and Rémi Gribonval. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129–150, 2011. 117 [82] Louis W Shapiro. An electrical lemma. Mathematics Magazine, 60(1):36–38, 1987.

[83] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduc- tion and data representation. Neural computation, 15(6):1373–1396, 2003.

[84] Boaz Nadler, Stephane Lafon, Ronald Coifman, and Ioannis Kevrekidis. Diffusion maps, spectral clustering and eigenfunctions of fokker-planck operators. In NIPS, pages 955–962, 2005.

[85] Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models for structured data. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, 2016.

[86] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for net- works. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.

[87] Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2(Mar):499–526, 2002.

[88] Mario Botsch, David Bommes, and Leif Kobbelt. Efficient linear system solvers for mesh processing. In Mathematics of Surfaces XI, pages 62–83. Springer, 2005.

[89] Vasilios N Katsikis, Dimitrios Pappas, and Athanassios Petralias. An improved method for the computation of the moore–penrose inverse matrix. Applied Math- ematics and Computation, 217(23):9828–9834, 2011.

[90] Thomas Gärtner, Peter Flach, and Stefan Wrobel. On graph kernels: Hardness results and efficient alternatives. In Learning Theory and Kernel Machines, pages 129–143. Springer, 2003.

[91] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector ma- chines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011.

[92] Xiaohan Zhao, Bo Zong, Ziyu Guan, Kai Zhang, and Wei Zhao. Substructure assembling network for graph classification. 2018. 118 [93] Nils M Kriege, Pierre-Louis Giscard, and Richard Wilson. On valid optimal as- signment kernels and applications to graph classification. In Advances in Neural Information Processing Systems, pages 1623–1631, 2016.

[94] Ron Levie, Federico Monti, Xavier Bresson, and Michael M Bronstein. Cayleynets: Graph convolutional neural networks with complex rational spectral filters. arXiv preprint arXiv:1705.07664, 2017.

[95] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018.

[96] Risi Kondor and Tony Jebara. A kernel between sets of vectors. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 361– 368, 2003.

[97] Jaekoo Lee, Hyunjae Kim, Jongsun Lee, and Sungroh Yoon. Transfer learning for deep learning on graph-structured data. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.

[98] Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convo- lutional networks for semi-supervised learning. arXiv preprint arXiv:1801.07606, 2018.

[99] Ruoyu Li, Sheng Wang, Feiyun Zhu, and Junzhou Huang. Adaptive graph convo- lutional neural networks. arXiv preprint arXiv:1801.03226, 2018.

[100] Gilles Puy, Srdan Kitic, and Patrick Pérez. Unifying local and non-local signal processing with graph cnns. arXiv preprint arXiv:1702.07759, 2017.

[101] Matthias Fey, Jan Eric Lenssen, Frank Weichert, and Heinrich Müller. Splinecnn: Fast geometric deep learning with continuous b-spline kernels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 869–877, 2018.

[102] Felipe Petroski Such, Shagan Sah, Miguel Alexander Dominguez, Suhas Pillai, Chao Zhang, Andrew Michael, Nathan D Cahill, and Raymond Ptucha. Robust 119 spatial filtering with graph convolutional neural networks. IEEE Journal of Se- lected Topics in Signal Processing, 11(6):884–896, 2017.

[103] Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pool- ing. In Advances in Neural Information Processing Systems, pages 4805–4815, 2018.

[104] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.

[105] Petar Veličković, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. Deep graph infomax. arXiv preprint arXiv:1809.10341, 2018.

[106] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 1(2), 2017.

[107] Antoine J-P Tixier, Giannis Nikolentzos, Polykarpos Meladianos, and Michalis Vazirgiannis. Graph classification with 2d convolutional neural networks. 2018.

[108] Martin Simonovsky and Nikos Komodakis. Graphvae: Towards generation of small graphs using variational autoencoders. arXiv preprint arXiv:1802.03480, 2018.

[109] Jiaxuan You, Rex Ying, Xiang Ren, William Hamilton, and Jure Leskovec. Graphrnn: Generating realistic graphs with deep auto-regressive models. In In- ternational Conference on Machine Learning, pages 5694–5703, 2018.

[110] Nicola De Cao and Thomas Kipf. Molgan: An implicit generative model for small molecular graphs. arXiv preprint arXiv:1805.11973, 2018.

[111] Jiaxuan You, Bowen Liu, Rex Ying, Vijay Pande, and Jure Leskovec. Graph convolutional policy network for goal-directed molecular graph generation. arXiv preprint arXiv:1806.02473, 2018.

[112] Thomas N Kipf and Max Welling. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308, 2016. 120 [113] Chun Wang, Shirui Pan, Guodong Long, Xingquan Zhu, and Jing Jiang. Mgae: Marginalized graph autoencoder for graph clustering. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pages 889–898. ACM, 2017.

[114] Shirui Pan, Ruiqi Hu, Guodong Long, Jing Jiang, Lina Yao, and Chengqi Zhang. Adversarially regularized graph autoencoder for graph embedding. In IJCAI, pages 2609–2615, 2018.

[115] Sergey Ivanov and Evgeny Burnaev. Anonymous walk embeddings. In Proceedings of the 35th International Conference on Machine Learning, pages 2186–2195, 2018.

[116] Grégoire Montavon, Katja Hansen, Siamac Fazli, Matthias Rupp, Franziska Biegler, Andreas Ziehe, Alexandre Tkatchenko, Anatole V Lilienfeld, and Klaus- Robert Müller. Learning invariant representations of molecules for atomization energy prediction. In Advances in Neural Information Processing Systems, pages 440–448, 2012.

[117] Johan Paratte and Lionel Martin. Fast eigenspace approximation using random signals. arXiv preprint arXiv:1611.00938, 2016.

[118] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, volume 1, page 3, 2017.

[119] Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence for sets. arXiv preprint arXiv:1511.06391, 2015.

[120] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.

[121] Kristof T Schütt, Farhad Arbabzadah, Stefan Chmiela, Klaus R Müller, and Alexandre Tkatchenko. Quantum-chemical insights from deep tensor neural net- works. Nature communications, 8:13890, 2017. 121 [122] Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Ge- niesse, Aneesh S Pappu, Karl Leswing, and Vijay Pande. Moleculenet: a bench- mark for molecular machine learning. Chemical science, 9(2):513–530, 2018.

[123] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional net- works. In European Semantic Web Conference, pages 593–607. Springer, 2018.

[124] Linfeng Song, Zhiguo Wang, Mo Yu, Yue Zhang, Radu Florian, and Daniel Gildea. Exploring graph-structured passage representation for multi-hop reading compre- hension with graph neural networks. arXiv preprint arXiv:1809.02040, 2018.

[125] Xiaojuan Qi, Renjie Liao, Jiaya Jia, Sanja Fidler, and Raquel Urtasun. 3d graph neural networks for rgbd semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5199–5208, 2017.

[126] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K War- muth. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM), 36(4):929–965, 1989.

[127] David Haussler. Probably approximately correct learning. University of California, Santa Cruz, Computer Research Laboratory, 1990.

[128] Shivani Agarwal and Partha Niyogi. Stability and generalization of bipartite rank- ing algorithms. In International Conference on Computational Learning Theory, pages 32–47. Springer, 2005.

[129] Shivani Agarwal and Partha Niyogi. Generalization bounds for ranking algorithms via algorithmic stability. Journal of Machine Learning Research, 10(Feb):441–474, 2009.

[130] Andre Elisseeff, Theodoros Evgeniou, and Massimiliano Pontil. Stability ofran- domized learning algorithms. Journal of Machine Learning Research, 6(Jan):55– 79, 2005.

[131] Sayan Mukherjee, Partha Niyogi, Tomaso Poggio, and Ryan Rifkin. Learning theory: stability is sufficient for generalization and necessary and sufficient for 122 consistency of empirical risk minimization. Advances in Computational Mathe- matics, 25(1-3):161–193, 2006.

[132] Nadav Cohen and Amnon Shashua. Convolutional rectifier networks as generalized tensor decompositions. In International Conference on Machine Learning, pages 955–963, 2016.

[133] Matus Telgarsky. Benefits of depth in neural networks. arXiv preprint arXiv:1602.04485, 2016.

[134] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural net- works. In Conference on Learning Theory, pages 907–940, 2016.

[135] Hrushikesh N Mhaskar and Tomaso Poggio. Deep vs. shallow networks: An ap- proximation theory perspective. Analysis and Applications, 14(06):829–848, 2016.

[136] Olivier Delalleau and Yoshua Bengio. Shallow vs. deep sum-product networks. In Advances in Neural Information Processing Systems, pages 666–674, 2011.

[137] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pages 5947–5956, 2017.

[138] Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. arXiv preprint arXiv:1509.01240, 2015.

[139] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.

[140] Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. Generalization in deep learning. arXiv preprint arXiv:1710.05468, 2017.

[141] Mikhail Belkin, Irina Matveeva, and Partha Niyogi. Regularization and semi- supervised learning on large graphs. In International Conference on Computa- tional Learning Theory, pages 624–638. Springer, 2004. 123 [142] Corinna Cortes, Mehryar Mohri, Dmitry Pechyony, and Ashish Rastogi. Stability of transductive regression algorithms. In Proceedings of the 25th international conference on Machine learning, pages 176–183. ACM, 2008.

[143] Rie K Ando and Tong Zhang. Learning on graph with laplacian regularization. In Advances in neural information processing systems, pages 25–32, 2007.

[144] Shiliang Sun, Zakria Hussain, and John Shawe-Taylor. Manifold-preserving graph reduction for sparse semi-supervised learning. Neurocomputing, 124:13–21, 2014.

[145] Stefan Dernbach, Arman Mohseni-Kabir, Siddharth Pal, and Don Towsley. Quan- tum walk neural networks for graph-structured data. In International Workshop on Complex Networks and their Applications, pages 182–193. Springer, 2018.

[146] Zhihong Zhang, Dongdong Chen, Jianjia Wang, Lu Bai, and Edwin R Hancock. Quantum-based subgraph convolutional neural networks. Pattern Recognition, 88:38–49, 2019.

[147] Thomas J Laffey and Helena Šmigoc. Spectra of principal submatrices of nonneg- ative matrices. Linear Algebra and its Applications, 428(1):230–238, 2008.

[148] Willem H Haemers. Interlacing eigenvalues and graphs. Linear Algebra and its applications, 226:593–616, 1995.

[149] Ulrike V Luxburg, Agnes Radl, and Matthias Hein. Getting lost in space: Large sample analysis of the resistance distance. In Advances in Neural Information Processing Systems, pages 2622–2630, 2010.

[150] P. Langley. Crafting papers on machine learning. In Pat Langley, editor, Pro- ceedings of the 17th International Conference on Machine Learning (ICML 2000), pages 1207–1216, Stanford, CA, 2000. Morgan Kaufmann.

[151] T. M. Mitchell. The need for biases in learning generalizations. Technical report, Computer Science Department, Rutgers University, New Brunswick, MA, 1980.

[152] M. J. Kearns. Computational Complexity of Machine Learning. PhD thesis, De- partment of Computer Science, Harvard University, 1989. 124 [153] R. S. Michalski, J. G. Carbonell, and T. M. Mitchell, editors. Machine Learning: An Artificial Intelligence Approach, Vol. I. Tioga, Palo Alto, CA, 1983.

[154] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley and Sons, 2nd edition, 2000.

[155] N. N. Author. Suppressed for anonymity, 2011.

[156] A. Newell and P. S. Rosenbloom. Mechanisms of skill acquisition and the law of practice. In J. R. Anderson, editor, Cognitive Skills and Their Acquisition, chapter 1, pages 1–51. Lawrence Erlbaum Associates, Inc., Hillsdale, NJ, 1981.

[157] A. L. Samuel. Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3(3):211–229, 1959.

[158] Nathan D Monnig and Francois G Meyer. The resistance perturbation distance: A metric for the analysis of dynamic networks. arXiv preprint arXiv:1605.01091, 2016.

[159] Eric W Weisstein. Resistance-equivalent graphs. from mathworld–a wolfram web resource.

[160] James R Bunch, Christopher P Nielsen, and Danny C Sorensen. Rank-one mod- ification of the symmetric eigenproblem. Numerische Mathematik, 31(1):31–48, 1978.

[161] Giuseppe Patané. Laplacian spectral distances and kernels on 3d shapes. Pattern Recognition Letters, 47:102–110, 2014.

[162] Giuseppe Patané. Accurate and efficient computation of laplacian spectral dis- tances and kernels. In Computer Graphics Forum. Wiley Online Library, 2016.

[163] Nino Shervashidze and Karsten M Borgwardt. Fast subtree kernels on graphs. In NIPS, pages 1660–1668, 2009.

[164] Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. Molecular graph convolutions: moving beyond fingerprints. Journal of computer- aided molecular design, 30(8):595–608, 2016. 125 [165] Michael O Albertson, Janos Pach, and Michael E Young. Disjoint homometric sets in graphs. Ars Mathematica Contemporanea, 4(1), 2011.

[166] William N Anderson Jr and Thomas D Morley. Eigenvalues of the laplacian of a graph∗. Linear and multilinear algebra, 18(2):141–145, 1985.

[167] Bin Dong. Sparse representation on graphs by tight wavelet frames and applica- tions. Applied and Computational Harmonic Analysis, 2015.

[168] Ming Zhong and Hong Qin. Sparse approximation of 3d shapes via spectral graph wavelets. The Visual Computer, 30(6-8):751–761, 2014.

[169] VY Pan, F Soleymani, and Liang Zhao. Highly efficient computation of generalized inverse of a matrix. arXiv preprint arXiv:1604.07893, 2016.

[170] Vasilios N Katsikis and Dimitrios Pappas. Fast computing of the moore-penrose inverse matrix. Electronic Journal of Linear Algebra, 17(1):637–650, 2008.

[171] Ulrike Von Luxburg. A tutorial on spectral clustering. and computing, 17(4):395–416, 2007.

[172] Fabrizio Costa and Kurt De Grave. Fast neighborhood subgraph pairwise distance kernel. In Proceedings of the 26th International Conference on Machine Learning, pages 255–262. Omnipress, 2010.

[173] André Elisseeff, Massimiliano Pontil, et al. Leave-one-out error and stability of learning algorithms with applications. NATO science series sub series iii computer and systems sciences, 190:111–130, 2003.

[174] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11(Oct):2635–2670, 2010.

[175] Michael Kearns and Dana Ron. Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. Neural Computation, 11(6):1427–1453, 1999.

[176] Alexander Rakhlin, Sayan Mukherjee, and Tomaso Poggio. Stability results in learning theory. Analysis and Applications, 3(04):397–417, 2005. 126 [177] O. Bousquet and A. Elisseeff. Algorithmic stability and generalization perfor- mance. In Advances in Neural Information Processing Systems 13, pages 196–202, Cambridge, MA, USA, April 2001. Max-Planck-Gesellschaft, MIT Press.

[178] Colin McDiarmid. On the method of bounded differences. Surveys in combina- torics, 141(1):148–188, 1989.

[179] Corinna Cortes, Mehryar Mohri, and Ashish Rastogi. Magnitude-preserving rank- ing algorithms. In Proceedings of the 24th international conference on Machine learning, pages 169–176. ACM, 2007.

[180] Luc Devroye and T Wagner. Distribution-free performance bounds for potential function rules. IEEE Transactions on Information Theory, 25(5):601–604, 1979.

[181] Andr e Elissee. A study about algorithmic stability and their relation to general- ization performances. 2000.

[182] Ryan Michael Rifkin. Everything old is new again: a fresh look at historical approaches in machine learning. PhD thesis, MaSSachuSettS InStitute of Tech- nology, 2002.

[183] Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.

[184] Dongsheng Li, Chao Chen, Qin Lv, Junchi Yan, Li Shang, and Stephen Chu. Low-rank matrix approximation with stability. In Proc. of, volume 951, page 16, 2016.

[185] Ben London, Bert Huang, Ben Taskar, and Lise Getoor. Collective stability in structured prediction: Generalization from one example. In ICML (3), pages 828–836, 2013.

[186] Ankan Saha, Prateek Jain, and Ambuj Tewari. The interplay between stability and regret in online learning. arXiv preprint arXiv:1211.6158, 2012.

[187] Andrew Y Ng and H Jin Kim. Stable adaptive control with online learning.

[188] Yu-Xiang Wang, Jing Lei, and Stephen E Fienberg. Learning with differential privacy: Stability, learnability and the sufficiency and necessity of erm principle. Journal of Machine Learning Research, 17(183):1–40, 2016. 127 [189] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems, pages 2121–2129, 2013.

[190] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representa- tions. arXiv preprint arXiv:1802.05365, 2018.

[191] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. arXiv preprint arXiv:1806.03536, 2018.

[192] Cătălina Cangea, Petar Veličković, Nikola Jovanović, Thomas Kipf, and Pietro Liò. Towards sparse hierarchical graph classifiers. arXiv preprint arXiv:1811.01287, 2018. Chapter 8

Appendix

8.1 Chapter 3: Appendix

8.1.1 Proof of Theorem 1

Notations: Let G = (V,E,W ) be an undirected graph with vertex set V , edge set E, and W as a (symmetric) nonnegative weighted adjacency (affinity) matrix. In particular,

Wij = W (i, j) > 0 if and only if (i, j) ∈ E. We will assume G is connected for all proofs unless mentioned otherwise. The (standard) graph Laplacian is defined as ∑ − n L = D W , where D = diag[d1, .., dn] and di = d(i) = j≠ i W (i, j) the (weighted) degree of node i. L is semi-definite and admits an eigenvalue decomposition ofthe T form L = ￿Λ￿ , where Λ = diag[λk] is the diagonal matrix formed by the eigenvalues

λ0 = 0 < λ1 ≤ · · · ≤ λN−1, and ￿ = [ϕ0, ..., ϕN−1] is an orthogonal matrix formed by the corresponding eigenvectors ϕk’s. For x ∈ V , we use ϕk(x) to denote the x-entry value of ϕ . Let f be an arbitrary nonnegative (real-analytical) function on R+ with f(0) = 0. k √ T T Then f(L) is defined as f(L) = ￿f(Λ)￿ . In addition, let 1 = [1, 1, ..., 1] = Nϕ0 be T T all-one column vector, J = 11 = Nϕ0ϕ0 . For any matrix M, we use diag(M) to denote a diagonal matrix consisting of the diagonal entries of M, namely, diag(M) = diag[M00,M11,...,MN−1,N−1]. For conciseness, we will drop subscript f when context

128 129 is clear.

Proof: We first note that the graph f-spectral distance matrix Sf = [Sf (x, y)], can be expressed in the matrix form using f(L) as follows:

T T T Sf = diag(￿f(Λ)￿ )J + J diag(￿f(Λ)￿ ) − 2￿f(Λ)￿ (8.1) = diag(f(L))J + J diag(f(L)) − 2f(L)

Assume, graphs G1 and G2 are isomorphic and do not contain self-loops then their T respective adjacency matrices are equal upto permutation (W1 = PW2P ) and also T implies the equality of their Laplacian matrices upto permutation (L1 = PL2P ). Then, S S we can specify G1 in terms of G2 as follows,

T T T T P SG P = P diag(f(L2))JP + PJ diag(f(L2))P − 2P f(L2)P 2 (8.2) T T T T = diag(P f(L2)P )1 + J diag(P f(L2)P ) − 2P f(L2)P

Since f only operates on eigenvalues of a matrix and P can only permutes the rows T S S T of ϕ matrix, we have f(L1) = P f(L2)P . This proves that G1 = P G2 P for any two isomorphic graphs. Above result, also holds for Lnorm. Now, all it remains to show that this also holds true in opposite direction as well. More specifically, we prove that one can uniquely recover Laplacian matrix L from the f-spectral distance matrix Sf as follows:

Since, the eigenvectors of symmetric L are orthogonal to each other and to 1, we have, N∑−1 T f(L)1 = f(λk)ukuk 1 = 0, since f(λ0) = 0 (8.3) k=0

Similarly, we have 1T f(L) = 0. Using Eq. 8.1 and 8.3, we can explicitly derive f(L) in terms of D as,

( ) 1 1 1 f(L) = − S − (S J + JS ) + JS J (8.4) 2 f N f f N 2 f 130

This shows that we can uniquely recover f(L) from Sf . Further since f(λ) is a bijective function, we can also uniquely recover the eigenvector matrix and eigenvalues of L from f(L).

S S Finally, we show that for a given two graph f-spectral distance matrices, 1 = G1 and S S S S T T 2 = G2 , such that 2 = P 1P for some permutation matrix, we have L2 = PL1P .

( ) T −1 S − 1 S S 1 S T P f(L1)P = P 1 ( 1J + J 1) + 2 J 1J P 2( N N ) −1 S T − 1 S T S T 1 S T = P 1P (P 1P J + JP 1P ) + 2 JP 1P J 2( N ) N 1 1 1 = − S − (S J + JS ) + JS J 2 2 N 2 2 N 2 2

= f(L2)

T T Since f is bijective, we can conclude P f(L1)P = f(L2) =⇒ PL1P = L2. This completes the full proof of Theorem 1.

Variant of Theorem 1: We consider the variant of Theorem 1 for the normalized − 1 − 1 − 1 − 1 graph Laplacian Lnorm = L˜ = D 2 LD 2 . Again L˜ = D 2 LD 2 is semi-definite and T ˜ admits an eigen-decomposition of the form L˜ = ˜￿Λ˜˜￿ , where Λ˜ = diag[λk] is the diagonal ˜ ˜ ˜ ˜ ˜ matrix formed by the eigenvalues λ0 = 0 < λ1 ≤ · · · ≤ λN−1, and ˜￿ = [ϕ0, ..., ϕN−1] is ˜ an orthogonal matrix formed by the corresponding eigenvectors ϕk’s.

With respect to L˜, for a given (real)-analytical function f on R+ with f(0) = 0, we define the f-spectral distance between x and y on G as follows:

N∑−1 ˜ ˜ ˜ ˜ 2 Sf (x, y) = f(λk)(ϕk(x) − ϕk(y)) (8.5) k=0 ∑ ∑ ˜ 1 1 Define, J = D 2 J , d = trace(D 2 ) and trace(L) = i j Wij = vol(G) (the sum of the 1 ˜T 2 ∀ ̸ ˜T ˜ ˜T ˜ weights of all edges). We note that ϕk D 1 = 0, k = 0, and thus, L J = J L = 0. Corresponding to Eq. 8.1 and Eq. 8.4, the following expressions hold which relate the 131 ˜ ˜ (normalized) graph f-spectral distance matrix Sf = [Sf (x, y)] to the normalized graph Laplacian L˜,

˜ Sf = diag(f(L˜))J + J diag(f(L˜)) − 2f(L˜) and (8.6) ( ) 1 1 1 f(L˜) = − S˜ − (S˜ J˜ + J˜S˜ ) + J˜S˜ J˜ 2 f d f f d2 f

Hence, we have the following variant of Theorem 1.

Theorem 14 (Uniqueness of Normalized Fgsd). The f-spectral distance matrix Sf =

[Sf (x, y)] uniquely determines the underlying graph (up to graph isomorphism and a scalar constant factor d on the weigh matrix). Thus, each graph has a unique Sf up to permutation. More precisely, two undirected, weighted (and connected) graphs G1 and S S T G2 have the same Fgsd based distance matrix up to permutation, i.e., G1 = P G2 P for some permutation matrix P , if and only if the two graphs are isomorphic and the T weight matrices WG1 and WG2 is such that WG1 = P dWG2 P for some scalar constant d > 0.

8.1.2 Proof of Theorem 2

The proof directly builds upon the fact that the effective resistance distance on the graph is a monotone function with respect to adding or removing edges (or weights) (see in [?], Theorem 2.6 for more details). In other words, pairwise effective resistance on a graph cannot increase when edges (or weights) are added. Suppose we add an edge ′ ′ (or increase the edge weight) wij on a graph G and get a new graph G . Let R and R be the multiset of pairwise effective resistance distances before and after adding/increasing the edge (weight) wij in the graph respectively.

S ′ Then, using the parallel law of resistance, we can show that ij will always decrease ′ S ′ according to S = ij wij . As a result, if the elements of R and R are in the sorted ij Sij +wij ′ order, then the following vector comparison (element-wise) strictly holds: R ≺ R, 132 ′ when we add edges. Likewise, R ≻ R when we remove edges. Thus, we conclude that R is unique up to a fixed number of edges.

8.1.3 Proof of Theorem 3

√ Fgsd provides a graph embedding using the following expression, ￿ = ￿ f(Λ). Hence, S || − ||2 we have f (x, y) = ￿(x) ￿(y) 2. By the definition of graph isometric embedding, this shows that Fgsd serves as an isometric measure on the graph embedding (defined by ￿) in a Euclidean space.

Next, we show the conditions under which ￿ embedding is unique for a graph G.

′ ′ 1) ￿ is unique, if there does not exist any other ￿′ = ￿ with f(L ) ≠ f(L ). As a √ √ ′ ′ ̸ ′ ̸ ′ ̸ result,√ ￿ f(Λ ) = ￿ f(Λ) must hold true for all L = L. This implies that, ϕk = ′ ∀ ∈ − f(λj)/f(λj)ϕk, k [1,N 1].

2) We must also make sure that one can always reconstruct the same ￿ from Sf . Now, according to Theorem 1, one can recover f(L) uniquely from Sf . If all the eigenvalues of f(L) are distinct, then the eigenvalue decomposition of L is also unique. This ensures that we can recover the same ￿ after the decomposition to reconstruct the same ￿. Note T T that we can compute Sf directly from ￿ as follows: Sf = diag(￿￿ )J + J diag(￿￿ ) − 2￿￿T . In short, we have the following uniqueness relationships under the aforementioned conditions, where ￿ is the Euclidean embedding of a graph G1:

S G1 f(LG1 ) LG1 f(LG1 ) ￿G1

8.1.4 Proof of Theorem 4

′ Lemma 6 (Eigenvalue Interlacing Lemma [160]). Let A = A + σzzT be the rank one ′ perturbation of A matrix with ||z||2 = 1. Then, the eigenvalues of A will interlace with ≥ ≤ ′ ≤ ≤ ′ ≤ ≤ ≤ ′ ≤ eigenvalues of A such that if σ 0, λ0 λ0 λ1 λ1 .... λN−1 λN−1 λN−1 + ≤ ′ ≤ ≤ ′ ≤ ≤ ≤ ′ ≤ σ; or otherwise (i.e., if σ < 0), λ0 + σ λ0 λ0 λ1 λ1 .... λN−1 λN−1. 133 S S ′ Let xy and xy be the graph f-spectral distance before and after the single edge perturbation or rank one modification of graph Laplacian L. Then, the modified graph ′ ′ △ T ′ △ Laplacian L can be expressed as, L = L + 2 wijee , where wij = wij + wij is ≥ ′ ≥ the modification of weight edge wij( 0) to wij( 0) and e is N size column vector with e = √1 , e = − √1 and 0 otherwise. Since our goal is to inspect the dependency i 2 j 2 from f(λ) perspective, therefore, we will eliminate any dependency on the Laplacian eigenvectors by bounding them. We will focus our analysis on △wij ≥ 0 and note that here σ = 2△wij.

S ′ S Case I: f is an increasing function and xy > xy,

N∑−1 N∑−1 ′ ′ ′ ′ |S − S | − 2 − − 2 xy xy = f(λk)(ϕk(x) ϕk(y)) f(λk)(ϕk(x) ϕk(y)) k=0 k=0 Applying Lemma 6, and using the fact that rows of eigenvectors matrix are orthogonal to each other and have unit norm, we get,

N∑−1 N∑−1 ′ ′ △S ≤ − 2 − − 2 f(λN−1 + σ) (ϕk(x) ϕk(y)) f(λ1) (ϕk(x) ϕk(y)) k=0 k=0

= 2 f(λN−1 + σ) − f(λ1)

S ′ S Case II: f is an increasing function and xy < xy. Again applying Lemma 6, and using orthonormal property of eigenvector, matrix we get,

N∑−1 N∑−1 ′ ′ ′ ′ |S − S | − 2 − − 2 xy xy = f(λk)(ϕi(x) ϕi(y)) f(λk)(ϕi(x) ϕi(y)) i=0 i=0 N∑−1 N∑−1 ′ ′ ′ ≤ − 2 − − 2 f(λN−1) (ϕi(x) ϕi(y)) f(λ1) (ϕi(x) ϕi(y)) i=0 i=0

= 2 f(λN−1) − f(λ1)

Combining Case I and Case II together, we get following bound for f as increasing function,

△S ≤ 2|f(λN−1 + 2△wij) − f(λ1)| 134 Case III: f is a decreasing function of λ. Same results hold and can be derived in similar fashion.

△S ≤ 2|f(λN−1 + 2△wij) − f(λ1)|

Note: The above results are also valid in the case of normalized graph Laplacian Lnorm. Lastly, when the weight change ∆w is negative (and constrained within a certain range), the stability bounds are similar but loose when compared to the positive ∆w change case. The bounds are loose due to the fact that the Eigenvalue Interlacing Lemma does not explicitly provide the bound on the change in the largest eigenvalue with respect to negative ∆w change. As a result, the change in f-spectral distance (i.e., ∆Sxy) is loosely upper bounded by 2|f(λN−1)| in case of an increasing function while for a decreasing function, it is upper bounded by 2|f(λ1)|.

8.1.5 Proof of Theorem 5

Lemma 7 (McDiarmid’s Inequality [178]). Let X = (x1, x2, ..., xm) be a set random → ′ variables and F : X R and xi be the substitution of xi. Now if,

| − ′ | ≤ sup F (x1, .., xi, .., xm) F (x1, .., xi, .., xm) ci ′ x1,..,xi,..,xm,xi | − | ≤ ∀ = sup FX FXi ci , i ′ x1,..,xi,..,xm,xi Then, the following exponential bound holds,

( ) 2 − ∑ 2ϵ m c2 P F (X) − EX [F (X)] ≥ ϵ ≤ e i=1 i

Consider the weighted adjacency matrix entries as random variables, and let D be the − distribution from which weights are independently sampled. We fix the order of n(n 1) { } 2 random variables in weighted adjacency matrix as w1, w2, ..., w N(N−1) and assume 2 weights are bounded 0 < α ≤ wi ≤ β, ∀i. Our goal is to to obtain exponential bounds on the expected value of S(x, y). Now, we have S(x, y) as the function of f(L) i.e., 135

Laplacian f as increasing function f as decreasing function

L 2f(2βN + 2β) 2f(αN)

Lnorm 2f(2 + 2β) 2f(αN)

Table 8.1: Bounds on Θ.

( ) ( ) Sf (x, y) = F (f(L)) = F f(D − W ) = F w1, w2, ..., w N(N−1) . Then, in order to apply 2 McDiarmid’s inequality, we bound the following,

( ) ( ) | − i| − ′ sup F F = sup F w1, .., wi, .., w N(N−1) F w1, .., wi, .., w N(N−1) ′ 2 2 w1,..,w N(N−1) ,wi 2 ≤ sup △S (Using Theorem 4) λN−1,λ1,△wi = Θ

Then, applying applying Lemma 7 gives the following bound with probability 1 − δ, where δ ∈ (0, 1).

( ) − 4ϵ2 − 2 P Sf (x, y) − E[Sf (x, y)] ≥ ϵ ≤ e N(N 1)Θ √ Θ√ 1 =⇒ S (x, y) − E[S (x, y)] ≤ N(N − 1) log f f 2 δ

Same results also hold for Lnorm case.

△ up Now, Θ itself depends upon the bounds on λ1, λN−1 and wi. Let λN−1 is the largest low upper bound on λN−1 and λ1 is the smallest lower bound possible over all graphs of 136 size N. Since wi > 0 ∀i, we have a complete graph. Now according to Lemma 6, λ1 low eigenvalue will always increase, with increase in edge weights. As a result, λ1 will low ≥ achieve it lowest value when all weights are equal to α which implies λ1 αN.

up ≤ Case I: Consider Laplacian as L and f as increasing function. Then, λN−1 2βN [166].

△S ≤ 2|f(λN−1 + 2△wi) − f(λ1)| ≤ | up △ | 2 f(λN−1 + 2 wi) = 2|f(2βN + 2β)|

For decreasing f function, we have,

△S ≤ 2|f(λ1) − f(λN−1 + 2△wi)| ≤ | low | 2 f(λN−1) = 2|f(αN)|

up ≤ Case II: Consider Laplacian as Lnorm and f as increasing function. Then, λN−1 2 and,

△S ≤ 2|f(2 + 2β)|

For decreasing f function, we have,

△S ≤ 2|f(αN)|

The summary of Θ bounds are given in Table 8.1 for different cases. This completes the proof.

Connection to Uniform Stability: The f-spectral distance function can be thought of as a learning algorithm F on a graph which satisfies the notion of uniform stability

− ≤ given as 87, sup ℓ(AX , z) ℓ(AXi , z) η, where ℓ is the loss function, A is the learning X,z algorithm, z is any data point and X is the set of N data points. Uniform stability is a strong stability criteria which provides generalization guarantees by establishing an upper bound η on the change in loss due to the modification of a single data point from 137 the training set (and by subsequently taking the supremum over all possible training sets X and z data point). To conjure the notion of uniform stability for distance function operating on a graph, we can replace A by F (distance function), ℓ by identity function, X by |E| set of edges in the graph and define Xi as the modification of any single edge and replace z as the pairwise node input (x, y). Then in case of f-spectral distance function, η is nothing but equal to Θ and thus, satisfies notion of uniform stability in the sense defined above.

8.1.6 Experiments and Results

Datasets MUTAG PROTEIN D&D

Harmonic 92.12 73.42 77.10

Biharmonic 89.24 70.06 75.34

Table 8.2: Preliminary Experiment: Classification accuracy on few bioinformatics datasets. Harmonic based feature space yields higher accuracy than biharmonic due to sparseness.

Parameters Selection: For Random-Walk (RW) kernel, decay factor is chosen from {10−6, 10−5..., 10−1}. For Weisfeiler-Lehman (WL) kernel, we chose h = 2 as the maxi- mum iteration to limit the exponentially time increase and feature space of the kernel and in case of unlabeled classification, we fed node degree as the node labeled data. For graphlet kernel (GK), we chose graphlets size {3, 5, 7}. For deep graph kernels (DGK), the window size and dimension is taken from the set {2, 5, 10, 25, 50} and report the best classification accuracy obtained among: deep graphlet kernel, deep shortest path kernel and deep Weisfeiler-Lehman kernel. For Multiscale Laplacian Graph (MLG) ker- nel, we chose η and γ parameter of the algorithm from {0.01, 0.1, 1}, radius size from {1, 2, 3, 4}, and level number from {1, 2, 3, 4}. While in case of unlabeled classification, 138 we provide degree of the node as the labeled data. For diffusion-convolutional neural networks (DCNN), we chose number of hops from {2, 5} and AdaGrad algorithm (gradi- ent descent) with parameters: learning rate 0.05, batch size 100 and number of epochs 500. Again, node degree served as labeled data for the case of unlabeled classification. For the rest, best reported results were borrowed from PATCHY-SAN (CNNs) [54], skewed graph spectrum (SGS) [76] and graphlet spectrum (GS) [77] papers, since the experimental setup was the same and a fair comparison can be made.

8.2 Chapter 5: Appendix

8.2.1 Proof of Theorem 9

Theorem 3 is based on main result of the paper [4] and the extension to graph classifi- cation as well as graph capsule networks. We show that the single layer graph capsule networks are also uniformly stable [87] for solving graph classification problem. Like in [4], we assume that the activation and loss functions are Lipschitz continuous and smooth. For convenience, we borrow the same notions used in [4] and similarly divide the proofs in two parts. In first part, we separate out the terms due to weight parameters and graph convolution operation in order to bound their difference independently. In second part, we bound the expected difference (due to to randomness of SGD algorithm) in weight parameters under single data perturbation.

Following is a single layer Universal Graph Encoder GNN function f(x, θ) ∈ R based on statistical moments with sum as read-out function in graph classification task where th N is number of nodes in a graph, xi is the i node feature value and θ is the capsule learning parameter. For simplicity, we show our results for the case of x ∈ R but the proof remains applicable for general x ∈ Rd. ( ) ∑N ∑P ( ∑ ) { } p f(x = x1,..., xN , θ) = σ e·jxj θp (8.7) n=1 p=1 j∈ N (xn) 139 The first order derivative with respect to pth−parameter is given as, ( ) ∑N ( ∑ ) ∑ ∂f(x, θ) ′ p p = σ e·jx θp e·jx (8.8) ∂θ j j p n=1 j∈ j∈ N (x) N (x) where P is number of instantiation parameters.

Proof Part I: Let Esgd be expectation due to SGD randomness and let θS and θSi represent the filter weights learned on training set S and Si that differs in precisely − single data point. Also, ∆θ = θS θSi .

| − | ≤ | − | Esgd[ℓ(AS, y) ℓ(ASi , y)] αℓEsgd[ f(x, θS) f(x, θSi ) ] [ ( ) ( ) ] ∑N ∑P ∑ ∑ ≤ p − p αℓEsgd σ e·jxj θp,S σ e·jxj θp,Si n=1 p=1 j∈ j∈ N (xn) N (xn) Since activation function is also σ−Lipschitz continuous, [ ∑N ∑P ∑ ∑ ] p p ≤ α E e· x θ − e· x θ i ℓ sgd j j p,S j j p,S (8.9) n=1 p=1 j∈ j∈ N (xn) N (xn) [ ] ∑N ∑P ∑ ≤ p − αℓEsgd e·jxj (θp,S θp,Si ) n=1 p=1 j∈ N (x ) ( n ) [ ∑N ∑P ∑ ( ) ( ) ] ≤ p − αℓEsgd e·jxj θp,S θp,Si n=1 p=1 j∈ N (xn) ( ) ∑N ∑P ∑ ( ) ( [ ]) ≤ p αℓ e·jxj Esgd ∆θp n=1 p=1 j∈ N (xn) ∑P ( [ ])

≤ αℓN gp,λEsgd ∆θp p=1

∑ p where gp,λ is defined as gp,λ := sup j∈N (x) e·jxj . The term gp,λ will later be x bounded in terms of the largest absolute eigenvalue of the graph convolution filter g(L). 140 Proof Part II: We move on the second part of the proof where we bound the different in weights learned due to single data perturbation in Lemma 10 using Lemma 8 and Lemma 9.

Lemma 8 (Universal Graph Encoder Same Sample Loss Stability Bound). The loss-derivative bound difference of (single-layer) Universal Graph Encoder trained with SGD algorithm for T iterations on two training datasets S and Si respectively, with respect to the same sample is given by,

( ) ( ) ∇ − ∇ ≤ 2 | | pℓ f(x, θS,t), y pℓ f(x, θSi,t), y νℓνσNgp,λ ∆θp,t

Proof: Using Equation 8.8 here, we get,

( ) ( ) ∂ℓ f(x, θS,t), y ∂ℓ f(x, θ i ), y ∂f(x, θ ) ∂f(x, θ i ) − S ,t ≤ ν S,t − S ,t ∂θ ∂θ ℓ ∂θ ∂θ (p p p p ) ∑N ( ∑ ) ( ∑ ) ( ∑ ) ′ ′ ≤ p p − p νℓ e·jxj σ e·jxj θp,S,t σ e·jxj θp,Si,t n=1 j∈ j∈ j∈ N (xn) N (xn) N (xn) Since the activation function is Lipschitz continuous and smooth, ∑ p ≤ and plugging e·jxj gp,λ, we get, j∈ N (xn) ( ) ( ) ( ) ∑N ∑ ∑ ≤ p − p νℓνσ gp,λ e·jxj θp,S,t e·jxj θp,Si,t n=1 j∈ j∈ N (x ) N (x ) ( n n ) ∑N ( ∑ ) ≤ p | − | νℓνσ gp,λ e·jxj θp,S,t θp,Si,t n=1 j∈ N (xn) ≤ 2 | | νℓνσNgp,λ ∆θp,t This completes the proof of Lemma 8.

Lemma 9 (Universal Graph Encoder Different Sample Loss Stability Bound). The loss-derivative bound difference of (single-layer) Universal Graph Encoder trained 141 with SGD algorithm for T iterations on two training datasets S and Si respectively, with respect to the different samples is given by,

( ) ( ) ∇ − ∇ ′ ′ ≤ pℓ f(xi, θS,t), yi pℓ f(xi, θSi,t), yi 2νℓασNgp,λ.

Proof:

( ) ( ) ∂ℓ f(x, θS,t), y ∂ℓ f(~x, θSi,t), y˜ ∂f(x, θS,t) − ≤ νℓ − ∂θp ∂θp ∂θp

∂f(~x, θ i ) S ,t ∂θ p ( ) ∑N ( ∑ ) ∑ ′ ≤ p p νℓ σ e·jxj θp,S,t e·jxj n=1 j∈ j∈ N (x) N (x) ( ) ∑N ( ∑ ) ∑ ′ p p + νℓ σ e·j ~xj θp,Si,t e·jxj n=1 j∈ j∈ N (~x) N (~x) Using the fact that the first order derivative is bounded,

≤ 2νℓασNgp,λ

This completes the proof of Lemma 9.

Lemma 10 (Universal Graph Encoder SGD Stability Bound). Let the loss & activation functions be Lipschitz-continuous and smooth. Let θS,T and θSi,T denote the graph filter parameters of (single-layer) Universal Graph Encode trained using SGDfor T iterations on two training datasets S and Si, respectively. Then the expected difference in the filter parameters is bounded by,

[ ] ∑T ( ) 2ηνℓασNgp,λ 2 t−1 E θ − θ i | ≤ 1 + ην ν Ng sgd p,S,T p,S ,T m ℓ σ p,λ t=1

Proof: Following the bounding analysis presented in [4] for expected weight difference due to SGD, we have, 142

[ ] ( ) [ ( ( )) 1 Esgd ∆θp,t+1| ≤ 1 − Esgd θp,S,t − η∇pℓ f(x, θS,t), y ( m ) ] ( ) − − ∇ θp,Si,t η pℓ f(x, θSi,t), y ( ) [ ( ) (8.10) 1 ( ) + Esgd θp,S,t − η∇pℓ f(~x, θS,t), y˜ (m ) ] ( ) − − ∇ θp,Si,t η pℓ f(^x, θSi,t), yˆ [ ] = Esgd |∆θp,t| + ( ) [ ] ( ) ( ) − 1 ∇ − ∇ 1 ηEsgd pℓ f(x, θS,t), y pℓ f(x, θSi,t), y + ( m) [ ] 1 ( ) ( ) ηE ∇ ℓ f(~x, θ ), y˜ − ∇ ℓ f(^x, θ i ), yˆ m sgd p S,t p S ,t

Plugging the bounds in Lemma 8 and Lemma 9 into Equation (8.10), we have, [ ] [ ] ( ) | ≤ | | − 1 2 | | Esgd ∆θp,t+1 Esgd ∆θp,t + 1 ηνℓνσNgp,λEsgd[ θp,t ] ( ) m 1 + 2ηνℓασNgp,λ ( m ) ( ) 2ην α Ng − 1 2 | | ℓ σ p,λ = 1 + 1 ηνℓνσNgp,λ Esgd[ θp,t ] + ( m ) m 2ηνℓασNgp,λ ≤ 1 + ην ν Ng2 E [|θ |] + ℓ σ p,λ sgd p,t m [ ]

Lastly, solving the Esgd ∆θt| first order recursion yields,

[ ] ∑T ( ) 2ηνℓασNgp,λ t−1 E ∆θ | ≤ 1 + ην ν Ng2 sgd p,T m ℓ σ p,λ t=1 This completes the proof of Lemma 10.

q×q Bound on gp,λ: Let q = |N (x)| and gx(L) ∈ R be the submatrix of g(L) whose q row and column indices are from the set {j ∈ N (x)}. We use hx,p ∈ R to denote th the p moment graph signal (node features) on the ego-graph Gx,p. Without loss of generality, we will assume that node x is represented by index 0 in Gx,p. Thus, we 143 ∑ p · ∈ R can compute j∈N (x) e·jxj = [gx(L)hx,p]0, a scalar value. Here [ ]0 represents the value of a vector at index 0, i.e., corresponding to node x. Then the following holds

(assuming the graph signals are normalized, i.e., ∥hx,1∥2 = 1),

|[gx(L)hx,p]0| ≤ ∥gx(L)hx,p∥1 ≤ ∥gx(L)∥2∥hx,p∥2 ≤ (8.11) ∥ ∥ ∥ ∥ max gx(L) 2 hx,1 2 = λGx ∑ ∑ where the third inequality follows from the fact that |xp| ≤ |x |2 ≤ 1 for p ≥ 2 ∑i i i i ∑ ∥ ∥ | p| 1/2 ≤ | |2 1/2 ≤ and since x 2 = 1. As a result the norm inequality ( i xi ) ( i xi ) 1 holds.

≤ max Finally, plugging gp,λ λG and Lemma 10 into Equation (8.9) yields the following remaining result,

∑P ( [ ]) ≤ max 2βm αℓNλG Esgd ∆θp p=1 ∑ ( ) 2 max 2 T max 2 t−1 P ηαℓασνℓN (λG ) t=1 1 + ηνℓνσN(λG ) βm ≤ ( ) m 1 β ≤ O PN T +1(λmax)2T ∀T ≥ 1 m m G

This completes the full proof of Theorem 3.

8.2.2 Experiment Baselines Settings

For the Weisfeiler-Lehman (WL) kernel, we vary the iteration pararmeter h ∈ {2, 3, 4, 5}. For the Random-Walk (RW) kernel, the decay factor is chosen from {10−6, 10−5..., 10−1}. For the graphlet kernel (GK), we choose the graphlet size {3, 5, 7}. For the deep graph kernels (DGKs), the window size and dimension are taken from the set {2, 5, 10, 25, 50} and report the best classification accuracy obtained among i) deep graphlet kernel, ii) deep shortest path kernel and iii) deep Weisfeiler-Lehman kernel. For the Multiscale Laplacian Graph (MLG) kernel, we vary the η and γ parameters of the algorithm from {0.01, 0.1, 1}, radius size from {1, 2, 3, 4}, and level number from {1, 2, 3, 4}. For 144 the diffusion-convolutional neural networks (DCNN), we choose the number ofhops from {2, 5} and employ the AdaGrad algorithm (gradient descent) with the following parameters: the learning rate 0.05, batch size 100 and number of epochs 500. We use the node degree as the labels in the cases where node labels are unavailable. For the rest, best reported results are borrowed from their respective papers since the experimental setup is same and fair comparison can be made.