JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 CogDL: Toolkit for on Graphs

Yukuo Cen, Zhenyu Hou, Yan Wang, Qibin Chen, Yizhen Luo, Xingcheng Yao, Aohan Zeng, Shiguang Guo, Yang Yang, Peng Zhang, Guohao Dai, Yu Wang, Chang Zhou, Hongxia Yang, Jie Tang?, IEEE Fellow

Abstract—Deep learning on graphs has attracted tremendous attention from the graph learning community in recent years. It has been widely used in several real-world applications such as analysis and recommender systems. In this paper, we introduce CogDL, an extensive toolkit for deep learning on graphs that allows researchers and developers to easily conduct experiments and build applications. It provides standard training and evaluation for the most important tasks in the graph domain, including node classification, graph classification, etc. For each task, it provides implementations of state-of-the-art models. The models in our toolkit are divided into two major parts, methods and graph neural networks. Most of the graph embedding methods learn node-level or graph-level representations in an unsupervised way and preserves the graph properties such as structural information, while graph neural networks capture node features and work in semi-supervised or self-supervised settings. All models implemented in our toolkit can be easily reproducible for leaderboard results. Most models in CogDL are developed on top of PyTorch, and users can leverage the advantages of PyTorch to implement their own models. Furthermore, we demonstrate the effectiveness of CogDL for real-world applications in AMiner, a large academic mining system.

Index Terms—Graph representation learning, Graph convolutional networks, Graph neural networks, Deep learning. !

1 INTRODUCTION and depth-first sampling (DFS). Another type of network Graph-structured data have been widely utilized in many embedding methods is matrix factorization (MF)-based such real-world scenarios. For example, each user on Facebook as GraRep [9], HOPE [8], NetMF [1], and ProNE [2], which can be seen as a vertex and their relations like friendship construct a proximity matrix and use MF such as singular or followership can be seen as edges in the graph. We value decomposition (SVD) [24] to obtain graph representa- might be interested in predicting the interests of users, or tions. whether a pair of nodes in a network should have an edge Recently, graph neural networks (GNNs) have been connecting them. However, traditional al- proposed and have achieved impressive performance in gorithms cannot be directly applied to the graph-structured semi-supervised representation learning. Graph Convolu- data. Inspired by recent trends of representation learning tion Networks (GCNs) [18] utilize a convolutional archi- on computer vision and natural language processing, graph tecture via a localized first-order approximation of spectral representation learning [4, 5, 6] is proposed as an efficient graph convolutions. GraphSAGE [25] is a general inductive technique to address this issue. Representation learning on framework that leverages node features to generate node graphs aims to learn low-dimensional continuous vectors embeddings for previously unseen nodes. Graph Atten- for vertices/graphs while preserving intrinsic graph prop- tion Networks (GATs) [16] leverage the multi-head self- erties. attention mechanism and enable (implicitly) specifying dif- One type of network embedding is Skip-gram [22] based ferent weights to different nodes in a neighborhood. model, such as DeepWalk [6], LINE [5], node2vec [4], and There are several toolkits supporting graph represen- PTE [23]. DeepWalk [6] transforms a graph structure into a tation learning algorithms, such as PyTorch Geometric uniformly sampled collection of truncated random walks (PyG) [26] and Deep Graph Library (DGL) [27]. PyG is a arXiv:2103.00959v2 [cs.SI] 23 Jul 2021 and optimizes with the Skip-gram model. LINE [5] pro- library for deep learning on irregularly structured input poses loss functions to preserve both first- and second- data such as graphs, point clouds, and manifolds, built order proximities and concatenates two learned embed- upon PyTorch [28]. DGL provides several APIs allowing dings. node2vec [4] conducts biased random walks to arbitrary message-passing computation over large-scale and smoothly interpolate between breadth-first sampling (BFS) dynamic graphs with efficient memory usage and high training speed. However, these popular graph representa- • Yukuo Cen, Zhenyu Hou, Yan Wang, Qibin Chen, Yizhen Luo, Xingcheng tion learning libraries may not completely integrate various Yao, Aohan Zeng, Shiguang Guo are with the Department of Computer representation learning methods (e.g., Skip-gram or matrix Science and Technology, Tsinghua University, China. E-mail: {cyk20, factorization based network embedding methods). More im- houzy21}@mails.tsinghua.edu.cn. • Yang Yang is with the Department of Computer Science and Technology portantly, these libraries only focus on specific downstream at Zhejiang University, China. tasks in the graph domain (e.g., semi-supervised node • Guohao Dai and Yu Wang are with the Department of Electronic Engi- classification) and do not provide sufficient reproducible neering, Tsinghua University, China. • evaluations of model performance. Chang Zhou and Hongxia Yang are with the Alibaba Group, China. 1 • Jie Tang is with the Department of Computer Science and Technology, Ts- In this paper, we introduce CogDL , an extensive graph inghua University, and Tsinghua National Laboratory for Information Sci- representation learning toolkit that allows researchers and ence and Technology (TNList), China. E-mail: [email protected], corresponding author. 1. The toolkit is available at: https://github.com/thudm/cogdl JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2 TABLE 1: Micro-F1 score (%) reproduced by CogDL for unsupervised multi-label node classification, including matrix factorization and skip-gram methods. 50% of nodes are labeled for training in PPI, Blogcatalog, and Wikipedia, 5% in DBLP and Flickr. These datasets correspond to different downstream scenarios: PPI stands for protein-protein interactions; Wikipedia is a co-occurrence network of words; Blogcatelog and Flickr are social networks; DBLP is a citation network.

Rank Method PPI (50%) Wikipedia (50%) Blogcatalog (50%) DBLP (5%) Flickr (5%) Reproducible 1 NetMF [1] 23.73 ± 0.22 57.42 ± 0.56 42.47 ± 0.35 56.72 ± 0.14 36.27 ± 0.17 Yes 2 ProNE [2] 24.60 ± 0.39 56.06 ± 0.48 41.16 ± 0.26 56.85 ± 0.28 36.56 ± 0.11 Yes 3 NetSMF [3] 23.88 ± 0.35 53.81 ± 0.58 40.62 ± 0.35 59.76 ± 0.41 35.49 ± 0.07 Yes 4 Node2vec [4] 20.67 ± 0.54 54.59 ± 0.51 40.16 ± 0.29 57.36 ± 0.39 36.13 ± 0.13 Yes 5 LINE [5] 21.82 ± 0.56 52.46 ± 0.26 38.06 ± 0.39 49.78 ± 0.37 31.61 ± 0.09 Yes 6 DeepWalk [6] 20.74 ± 0.40 49.53 ± 0.54 40.48 ± 0.47 57.54 ± 0.32 36.09 ± 0.10 Yes 7 SpectralClustering [7] 22.48 ± 0.30 49.35 ± 0.34 41.41 ± 0.34 43.68 ± 0.58 33.09 ± 0.07 Yes 8 Hope [8] 21.43 ± 0.32 54.04 ± 0.47 33.99 ± 0.35 56.15 ± 0.22 28.97 ± 0.19 Yes 9 GraRep [9] 20.60 ± 0.34 54.37 ± 0.40 33.48 ± 0.30 52.76 ± 0.42 31.83 ± 0.12 Yes

TABLE 2: Accuracy (%) reproduced by CogDL for semi- Therefore, we propose CogDL as an open standard toolkit supervised and self-supervised node classification on Ci- for graph benchmarks. The key point of CogDL is to build tation datasets. ↓ and ↑ mean our results are lower or reproducible benchmarks for representation learning on higher than the result in original papers. Repro* is short for graphs. We formalize the standard training and evaluation Reproducible. modules for the most important tasks in the graph domain. The overall framework is described in Figure 1. Our Repro* Rank Method Cora Citeseer Pubmed framework is built on PyTorch [28], which is the most pop- 1 GRAND [10] 84.8 75.1 82.4 Yes ular deep learning library. PyTorch provides an imperative 2 GCNII [11] 85.1 71.3 80.2 Yes 3 MVGRL [12] 83.6 ↓ 73.0 80.1 Partial and Pythonic programming style that supports code as a 4 APPNP [13] 84.3 ↑ 72.0 80.0 Yes model, makes debugging easy. Therefore, our toolkit can 5 Graph-Unet [14] 83.3 ↓ 71.2 ↓ 79.0 Partial leverage the advantages of PyTorch. CogDL provides im- 6 GDC [15] 82.5 72.1 79.8 Yes plementations of several kinds of models based on Python 7 GAT [16] 82.9 71.0 78.9 Yes 8 DropEdge [17] 82.1 72.1 79.7 Yes and PyTorch, including network embedding methods such 9 GCN [18] 81.5 71.4 ↑ 79.5 Yes as Deepwalk, NetMF, ProNE, and GNNs such as GCN, 10 DGI [19] 82.0 71.2 76.5 Yes GAT. It also supports several genres of datasets for node 11 JK-net [20] 81.8 69.5 77.7 Yes 12 Chebyshev [21] 79.0 69.8 68.6 Yes classification and graph classification. All the models and datasets can be utilized for experiments under different task settings in CogDL. Each task provides a standard training and evaluation process for comparison. developers to easily train and compare baseline or cus- tomized models for node classification, graph classification, To demonstrate the design and usage of CogDL, we will and other important tasks in the graph domain. We summa- answer the following three questions, which correspond to rize the contributions of CogDL as follows: the first three contributions summarized in Section 1: • High Efficiency: CogDL utilizes well-optimized sparse • Question 1: How to efficiently train GNN models via kernel operators to speed up training of GNN models. sparse kernel acceleration on GPUs? • Easy-to-Use: CogDL provides easy-to-use APIs for run- ning experiments with the given models and datasets • Question 2: Which GNN model does perform best on a using hyper-parameter search. specific dataset for a certain task? • Extensibility: The design of CogDL makes it easy to apply • Question 3: How to easily extend an existing GNN model GNN models to new scenarios based on our framework. to apply to a new scenario? • Reproducibility: CogDL provides reproducible leader- boards for state-of-the-art models on most of important tasks in the graph domain. 3 COGDLFRAMEWORK In this section, we present the efficiency optimization as well 2 OVERVIEW as the design and usage of CogDL framework by answering the aforementioned questions. Although there are enormous research works on graph machine learning, we find some existing issues such as Graph Notations. Denote a network G = (V,E), where different evaluation settings. Different papers may use their V is a set of n nodes and E ⊆ V × V is a set of edges own evaluation settings for the same graph dataset, making between nodes. Each node v may be accompanied with results indistinguishable. For example, as a widely used its feature xv. We use A to denote the adjacency matrix dataset, Cora [29], some papers use the “standard” splits (binary or weighted), and D to denote the diagonal degree P following Planetoid [30], while others adopt random splits. matrix, with Dii = j Aij. Each edge eij = (vi, vj), Reported results of the same model on the same dataset may associated with a weight Aij ≥ 0, indicates the strength of differ in various papers, making it challenging to compare the relationship between vi and vj. In practice, the network performance reported across various studies [31, 32, 33, 34]. could be either directed or undirected. If G is directed, we JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3 (l) (l) representation vector of nodes (e.g., hu and hv ). When the Node classification Heterogeneous node classification AGGREGATE takes other operations, the SpMM operator Graph classification Link prediction More… can also describe the dataflow of representation vectors be- tween nodes, while the element-wise operation varies. The Downstream Tasks on Graphs AGGREGATE can be described using a message-passing model or a Gather-Apply-Scatter (GAS) model in a general- Embedding: #23 GNN: #38 purpose graph processing system. Many frameworks for Node-level DeepWalk GCN, GAT GNNs adopt these models for SpMM implementation (e.g., spmm used in PyG based on the message-passing model). Graph-level graph2vec GIN, DiffPool However, such models fail to utilize the parallelism lying SOTA NetMF GRAND in elements of a representation vector, leading to poor performance when executing SpMM operator on GPUs. Models Trainer Many previous works focused on improving the perfor- mance of executing the SpMM operator on GPUs, such as Social network Molecular graph Academic graph ASpT [35] and GraphBlast [36]. GE-SpMM [37] is an opti- mized SpMM operator designed on GPUs. By avoiding re- dundant loading from GPU global memory and uncoalesced data access pattern, GE-SpMM improved the performance of both the SpMM operator on GPU and GNNs via SpMM. The original GE-SpMM only provides support for the sparse matrix with “0-1” values, and we integrate GE-SpMM into Datasets Evaluator our CogDL toolkit by extending the implementation of GE- SpMM (e.g., data format). i j m j When considering a GAT model, we also need a sam- i SpMM, SDDMM t edge_softmax pled dense-dense matrix multiplication (SDDMM) operator t T = A (PQT ), where T and A are the sparse matrices, m Operators P and Q are two dense matrices, is element-wise matrix multiplication. The SDDMM operator is used for back- propagating the gradients to the sparse adjacency matrix Fig. 1: Overview of CogDL. CogDL mainly comprises of 3 since the adjacency matrix of the GAT model is computed by modules: models, datasets, and downstream tasks. At the the attention mechanism. In addition, CogDL further speeds bottom, CogDL efficiently implements basic operators re- up multi-head graph attention by optimizing edge-based lated to graph operations like Sparse Matrix Multiplication softmax and multi-head SpMM operators. (SpMM). Datasets store graph data with Graph and specify We compare the training and inference time of GCN evaluation metric with Evaluator. Datasets include differ- and GAT models on several datasets with other popular ent types of real-world graphs, including social networks, GNN frameworks: CogDL with GE-SpMM, CogDL with molecular graphs, academic graphs, etc. CogDL implements .spmm, PyTorch-Geometric (PyG) v1.6.3, and Deep 23 graph embedding methods and 38 GNN-related mod- Graph Library (DGL) v0.6.0 with PyTorch backend. We els with specialized trainers. Based on these models and conduct experiments using Python 3.7.5 and PyTorch v1.7.0 datasets, various downstream tasks are supported and are on servers with Nvidia GeForce RTX 2080 Ti (11GB GPU applicable to many real scenarios. Memory) and 3090 (24GB GPU Memory). From Table 3, CogDL with GE-SpMM achieves 2.50× ∼ 5.26× speedup in training and 3.42× ∼ 5.59× in inference compared to have Aij 6= Aji and eij 6= eji. if G is undirected, we have torch.spmm. And CogDL with GE-SpMM also achieves eij = eji and Aij = Aji. faster speed than existing GNN frameworks in GCN. The RQ1: Optimize the sparse kernel operators. We introduce results in Table 3 also show that the efficiency of CogDL is an important operator used in CogDL and its correspond- comparable to state-of-the-art GNN frameworks and even ing optimization. The Sparse Matrix-Matrix multiplication have significant advantage in model inference. (SpMM) operator is widely used in most of GNNs. The reason is that many GNNs apply an aggregation operation for a given node from nodes of its incoming edges: RawData Graph get_loss_fn get_evaluator build_dataset (l+1)  (l)  hu = AGGREGATE {hv , ∀v ∈ N (u)} , (1)

(l) build_task init task.train task.evaluate where N (u) is the set of neighbors of node u, hu is the representation vector of node u of layer l. When the AGGREGATE is a summation, such an operation can be build_model get_trainer model.loss model.predict described as an SpMM operator H(l+1) ← AH(l), where the sparse matrix A represents the adjacency matrix of the input graph G, and each row of H(l) represents a Fig. 2: Architecture of CogDL. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4 TABLE 3: Epoch running time in seconds (full graph training). For GCN model, it includes 2 GCN layers and hidden size is set to 128. The GAT model includes 2 GAT layers with hidden size 128 and 4 attention heads. ‡ means out of memory and ? means the hidden size is set to 64.

Tranining per epoch (s) Inference per epoch (s) Model GPU Dataset torch PyG DGL CogDL torch PyG DGL CogDL Flickr 0.025 0.0084 0.012 0.085 0.012 0.0034 0.007 0.0035 2080Ti (11G) Reddit 0.445 0.122 0.102 0.081 0.218 0.045 0.049 0.039 Yelp 0.412 0.151 0.151 0.110 0.191 0.053 0.063 0.040 GCN Flickr 0.017 0.006 0.008 0.007 0.008 0.002 0.004 0.002 3090 (24G) Reddit 0.263 0.062 0.060 0.050 0.127 0.022 0.0314 0.022 Yelp 0.230 0.081 0.080 0.062 0.106 0.029 0.036 0.023 PubMed 0.017 0.016 0.011 0.012 0.004 0.006 0.004 0.003 2080Ti (11G) Flickr 0.082 0.090 0.047 0.056 0.023 0.030 0.019 0.014 ?Reddit ‡ ‡ 0.406 0.537 ‡ ‡ 0.163 0.086 GAT PubMed 0.043 0.011 0.011 0.016 0.004 0.004 0.003 0.002 Flickr 0.097 0.059 0.033 0.044 0.021 0.024 0.013 0.009 3090 (24G) Reddit 0.671 ‡ 0.301 0.373 0.112 ‡ 0.113 0.088 Yelp 0.614 ‡ 0.294 0.404 0.118 ‡ 0.105 0.113

RQ2: Easy-to-use design and usage. We introduce the de- the argument data path, then CogDL will read and process sign of CogDL and show how to use CogDL to find the best the data with pre-defined functions. In addition, the library model. Based on the backend and bottom operators, CogDL allows developers to define customized dataset class, if the comprises three modules: Datasets, Models, and Tasks. The customized dataset is ”registered” to CogDL. architecture of CogDL is illustrated in Figure 2. The three components work together to serve downstream tasks and @register_dataset("cora") are defined independently. class CoraDataset(Dataset): Graph. The Graph is the basic data structure in CogDL def process(self): """preprocess data to cogdl.Graph""" to store graph data with abundant graph operations. Graph def get_evaluator(self): supports both convenient graph modification such as re- """Return evaluator defined in CogDL""" moving/adding edges and efficient computing such as return accuracy SpMM. We also provides common graph manipulations in def get_loss_fn(self): """Return loss function defined in CogDL""" graph learning: return cross_entropy_loss • add remaining self loops(): add self-loops to the graph • degrees(): get node degrees Model. CogDL includes diverse graph neural network • normalize(): normalize adjacency matrix, including row- layers as research baselines and for other applications. A /symmetric-/col-normalization. GNN layer is implemented based on Graph and sparse • sample adj(): sample a fixed number of neighbors for operators. A model in the package comprises model builder, given nodes. forward propagation and loss computation. APIs described • subgraph(): generate a subgraph only containing given below are implemented in each model to provide an uni- nodes and edges between the nodes. fied paradigm for usage. add args and build model from args Task. The task module puts the model and dataset to- are used to build up a model with model-specific hyper- gether. The model and dataset are specified in args and parameters. loss takes in the data, calls the core forward func- built in the task through APIs in the package. train is the tion and returns the loss in one propagation. In addition, the only exposed API for a task and integrates the training library supports specifying a customized trainer for a model and evaluation. Flexibly, optimizing loss functions can also through get trainer. Details of the trainer can be found in the be implemented in train for task-specific targets. task.train() following. Similar to Dataset, the model should be registered runs the training and evaluation of models and datasets to make it known to the framework. specified in args and returns the performance. After the training, parameters of model and node/graph representa- from cogdl.layers imoprt GCNLayer tions will be saved for further usage. Dataset. The dataset component reads in data and pro- @register_model("gcn") class GCN(BaseModel): cesses it to produce tensors of appropriate types. Each def add_args(parser): dataset specifies its loss function and evaluator through parser.add_argument("--wd", type=float) function get loss fn and get evaluator for later usage in train- ing and evaluation. In this case, the dataset is general and def __init__(self): self.layer = GCNLayer(nfeats, nclasses) easily used in CogDL. Our package provides many real- world datasets and interfaces for users to define customized def loss(self, data): datasets with register dataset. The library provides two ways pred = self.forward(data, data.x) to fit a customized dataset. One way is to convert raw return loss_fn(pred, data.y) data files into the required format in CogDL and specify JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5 Trainer. A trainer is a supplement component for Task. components implemented in CogDL. In addition to the The training or evaluation of some models are special and above tasks, CogDL also provides the implementations of incompatible with the general paradigm in CogDL. In such abundant tasks in the graph domain: heterogeneous node cases, a custom-built trainer can be constructed in CogDL classification, multiplex link prediction, knowledge graph and specified in the model with get trainer, as shown in the completion, etc. model snippet. trainer.fit , which is similar to the function The following code snippet shows how to define a new task.train and covers training and evaluation, takes model task community detection with existing dataset and how to and dataset as input and returns the performance. Trainer define a customized dataset using the model and task in is used only when it is specified in a model, and then the CogDL in a short clip. The new task and dataset will be trainer will take over the process. The code snippet below collected by our framework when @register is called, and together with the model snippet shows how a trainer is the experiment API supports the mixed use of existing and specified and used. customized modules. In this way, one can easily apply any Now that we have prepared necessary components to module in CogDL to a new scenario. train and evaluate a GNN model, how do we make a run and how do we specify the best GNN for a specific dataset # from cogdl import build_model, build_dataset in response to Question 2? First, we provide a more easy-to- @register_task("community_detection") use usage for experiments through experiment API. We can class CommunityDetection(BaseTask): def __init__(self, args): pass a specific task, dataset, model with hyper-parameters dataset = build_dataset(args)[0] to the experiment API, which calls the low-level APIs (e.g., self.data = dataset.data build task and task.train). Producing the results with one line self.evaluator = dataset.get_evaluator() of setting provides convenient and efficient way to search self.model = build_model(args) for the best models. Furthermore, we integrate a popular def train(self): library, optuna [38] into CogDL to enable hyper-parameter for i in range(self.max_epoch): search. Optuna is a fast AutoML library based on Bayesian self.model.loss(self.data).backward() self.optimizer.step() optimization. CogDL implements hyper-parameter search return self.evaluator(self.model.predict(self. and even model search for graph learning. As shown below, data), self.data.y) by passing func search function which defines the search space of hyper-parameters, CogDL will start the running # Run your task with model and dataset in CogDL experiment(task="community_detection", dataset="dblp and output the best result and model. ", model="gcn")

from cogdl import experiment @register_dataset("new_data") class MyDataset(NodeDataset): # basic usage def __init__(self): experiment(task="node_classification", dataset="cora self.path = "mydata.pt" ", model="gcn", hidden_size=32, max_epoch=200) super(MyDataset, self).__init__(self.path)

# basic usage for hyper-parameter search def process(self): def func_search(trial): x, edge_index, y = load_raw_data() return { data = Graph(x=x, edge_index=edge_index, y=y) "lr": trial.suggest_categorical("lr",[0.01, 0.1]), torch.save(data, self.path) "hidden_size": trial.suggest_categorical("hs", return data [32, 64, 128]), "dropout": trial.suggest_uniform("dp", 0.1, 0.8), # Run your dataset with model and task in CogDL } experiment(task="node_classification", dataset=" new_data", model="gcn") experiment(task="node_classification", dataset="cora ", model="gcn", func_search=func_search)

We put all the hyper-parameters for reproducibility in 4 GRAPH BENCHMARKS the config file 2. Most of the parameters are obtained from hyper-parameter search in CogDL. Users can easily set In this section, with CogDL, we provide several downstream use best config=True in the experiment API to train the model tasks including node classification and graph classification using the best parameters in the configuration file. to evaluate implemented methods. We also build a reliable leaderboard for each task, which maintain benchmarks and RQ3: Define new modules. The design of CogDL makes it state-of-the-art results on this task. Table 4 shows the sum- easy to incrementally add new or customized modules. In marization of implemented models for node classification this part, we will show how to extend an existing graph and graph classification in CogDL. representation algorithm in CogDL to a new scenario. If existing GNN algorithms can be quickly tested in new scenarios, it will help both academia and industry to ben- 4.1 Unsupervised Node Classification efit from developing and leveraging existing methods. We Unsupervised node classification task aims to learn a map- provide simple interfaces to embed a customized model/- d ping function f : V 7→ that projects each node to a dataset/task into the current framework while reusing other d-dimensional space (d  |V |) in an unsupervised manner. 2. https://github.com/THUDM/cogdl/blob/master/cogdl/ Structural properties of the network should be captured by configs.py the mapping function. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6 TABLE 4: Summarization of implemented models for node classification and graph classification tasks.

Task Characteristics Models SpectralClustering [7] MF based Node Classification With high-order neighborhood NetMF [1], ProNE [2], NetSMF [3], HOPE [8], GraRep [9] (Unsupervised) LINE [5] Skip-gram With high-order neighborhood DeepWalk [6], Node2vec [4] GCN [18], GAT [16], JK-Net [20], ChebyNet [21], GCNII [11] Node Classification Semi-supervised With random propagation DropEdge [17], Graph-Unet [14], GraphSAINT [39], GRAND [10] (with GNN) With diffusion GDC [15], APPNP [13], GRAND [10], PPRGo [40] Self-supervised Contrastive methods MVGRL [12], DGI [19] CNN method PATCHY SAN [41] Supervised Global pooling GIN [42], SortPool [43], DGCNN [44] Graph Classification Hierarchical pooling DiffPool [45], SAGPool [46] Kernel methods DGK [47], graph2vec [48] Unsupervised GNN method Infograph [49]

TABLE 5: Dataset for node classification (“m” stands for multi-label classification)

Setting Dataset #Nodes #Edges #Features #Classes # Train / Val / Test Cora 2,708 5,429 1,433 7 140 / 500 / 1,000 Semi-supervised (Transductive) Citeseer 3,327 4,732 3,703 6 120 / 500 / 1,000 Pubmed 19,717 44,338 500 3 60 / 500 / 1,000 PPI 56,944 818,736 50 121 (m) 0.79 / 0.11 / 0.10 Flickr 89,350 899,756 500 7 0.50 / 0.25 / 0.25 Supervised (Inductive) Reddit 232,965 11,606,919 602 41 0.66 / 0.10 / 0.24 Yelp 716,847 6,977,410 300 100 (m) 0.75 / 0.10 / 0.15 PPI 3,890 76,584 - 50 (m) 0.50 / - / 0.50 Wikipedia 4,777 184,812 - 40 (m) 0.50 / - / 0.50 Unsupervised Blogcatalog 10,312 333,983 - 39 (m) 0.50 / - / 0.50 DBLP 51,264 127,968 - 60 (m) 0.05 / - / 0.95 Flickr 80,513 5,899,882 - 195 (m) 0.05 / - / 0.95

Datasets. We collect the most popular datasets used in • SpectralClustering [7] generates node representations the unsupervised node classification task. Table 5 shows the from the d-smallest eigenvectors of the normalized graph statistics of these datasets. Laplacian. • • BlogCatalog [50] is a social blogger network, where nodes DeepWalk [6] transforms a graph structure into linear and edges stand for bloggers and their social relation- sequences by truncating random walks and processing the ships, respectively. Bloggers’ interests are used as labels. sequences using Skip-gram with hierarchical softmax. • • Wikipedia3 is a co-occurrence network of words in the LINE [5] defines loss functions to preserve first-order or first million bytes of the Wikipedia dump. The labels are second-order proximity separately and concatenates two the Part-of-Speech (POS) tags inferred by Stanford POS- representations together. Tagger [51]. • node2vec [4] designs a biased procedure • PPI [52] is a subgraph of the PPI network for Homo with Breadth-first Sampling (BFS) and Depth-first Sam- Sapiens. Node labels are extracted from hallmark gene pling (DFS) to make a trade off between homophily simi- sets and represent biological states. larity and structural equivalence similarity. • k • DBLP [53] is an academic citation network where authors GraRep [9] decomposes -step probability transition ma- are treated as nodes and their dominant conferences as trix to train the node embedding, then concatenate all k- labels. step representations. • • Flickr [54] is the user contact network between users in HOPE [8] approximates high-order proximity based on Flickr. The labels represent the interest groups of the users. factorizing the Katz matrix. • NetMF [1] shows that Skip-gram models with negative Models. We implement and compare the following meth- sampling like Deepwalk, LINE can be unified into the ods for the unsupervised node classification task. These matrix factorization framework with closed forms. methods can be divided into two categories. One is Skip- • ProNE [2] firstly transforms the graph representation gram based models, and the other is matrix factorization learning into decomposition of a sparse matrix, and fur- based models. ther improves the performance through spectral propaga- tion technology. 3. http://www.mattmahoney.net/dc/text.html JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

• NetSMF [3] addresses the efficiency and scalability chal- the citation links as (undirected) edges and construct a lenges faced by the NetMF model via achieving a sparsi- binary, symmetric adjacency matrix A. Each document fication of the (dense) NetMF matrix. has a class label. For training, we only use 20 labels per Skip-gram network embedding considers the vertex class, but all feature vectors. paths traversed by random walks over the network as the • Inductive datasets include social networks (Reddit, sentences and leverages Skip-gram for learning latent ver- Yelp, and Flickr) and bioinformatics (PPI) from Graph- tex representation. For matrix factorization based methods, SAINT [39]. Reddit contains posts belonging to differ- they first compute a proximity matrix and perform matrix ent communities with user comments. Flickr categorizes factorization to obtain the embedding. Actually, NetMF [1] types of images based on the descriptions and common has shown the aforementioned Skip-gram models with neg- properties of online images and Yelp categorizes types of ative sampling can be unified into the matrix factorization businesses based on customers, reviewers, and friendship. framework with closed forms. PPI aims to classify protein functions across various bio- logical protein-protein interaction graphs. Results and Analysis. We build a leaderboard for the unsupervised multi-label node classification setting. We run Models. GCNs [18] extend the convolution operation into all algorithms on several real-world datasets and report graph-structured data by applying layer-wise propagation the sorted experimental Micro-F1 results (%) using logistic rule: regression with L2 normalization. Table 1 shows the results (l+1) ˆ (l) (l) and we find some interesting observations. H = σ(AH W ), (2)

• Matrix factorization vs Skip-gram. The leaderboard demon- − 1 − 1 Aˆ = D 2 A D 2 strates that matrix factorization (MF) methods like NetMF where f e f is the normalized adjacency and ProNE are very powerful and full of vitality as matrix, Ae = A+In is the adjacency matrix with augmented they outperform Skip-gram based methods in almost all self-connections, In is the identity matrix, Df is the diagonal P (l) datasets. ProNE and NetSMF are also of high efficiency degree matrix where Dfii = j Aeij, and W is a layer- and scalability and able to embed super-large graphs in specific learnable weight matrix. Function σ(·) denotes a (l) n×d feasible time in one single machine. There are many ways nonlinear activation function. H ∈ R l is the matrix of th to further optimize these matrix related operations. The dl-dimensional hidden node representation in the l layer (0) main advantage of Skip-gram methods is that they have with H = X, where X is the initial node feature matrix. good parallelism and are of high online while MF needs We implement all the following models, including to recompute the embedding when the network changes. • Chebyshev [21] presents a formulation of CNNs in the • Exploring neighborhoods. Exploring a node’s network context of spectral graph theory, which provides the nec- neighborhood is important in network embedding. Deep- essary mathematical background and efficient numerical Walk and node2vec consider vertex paths traversed by schemes to design fast localized convolutional filters on random walk to reach high-order neighbors. NetMF and graphs. PK i NetSMF factorize diffusion matrix i=0 αiA rather than • GCN [18] proposes a well-behaved layer-wise propaga- adjacency matrix A. ProNE and LINE are essentially 1- tion rule for neural network models which operate di- order methods, but ProNE further propagates the embed- rectly on graphs and are motivated from a first-order dings to enlarge the receptive field. Incorporating global approximation of spectral graph convolutions. information can improve performance but may hurt ef- • GAT [16] presents graph attention networks (GATs), a ficiency. The propagation in ProNE, which is similar to novel convolution-style neural networks that operate on graph convolution, shows that incorporating global infor- graph-structured data, leveraging masked self-attentional mation as a post-operation is effective. In our experiments, layers. stacking propagation on existing methods really improves • GraphSAGE [25] introduces a novel approach that allows its performance on downstream tasks. embeddings to be efficiently generated for unseen nodes by aggregating feature information from a node’s local 4.2 Node Classification with GNNs neighborhood. This task is for node classification with GNNs in semi- • APPNP [13] derives a propagation scheme from personal- supervised and self-supervised settings. Different from the ized PageRank by adding initial residual connection to previous part, nodes in these graphs, like Cora and Reddit, balance locality and leverage information from a large have node features and are fed into GNNs with prediction neighborhood. or representation as output. Cross-entropy loss and con- • DGI [19] indtroduces an approach to maximize mutual trastive loss are set for semi-supervised and self-supervised information between local representation and correspond- settings, respectively. For evaluation, we use prediction ac- ing summaries of graphs to learn node representation in curacy for multi-class and micro-F1 for multi-label datasets. an unsupervised manner. • GCNII [11] extends GCN to a deep model by using iden- Datasets. The datasets consist of two parts, including both tity mapping and initial residual connection to resolve transductive and inductive settings. over-smoothing. • Transductive datasets include three citation networks, • MVGRL [12] proposes to use graph diffusion for data Citeseer, Cora, and Pubmed [29]. These datasets contain augmentation and contrasts structural views of graphs for sparse bag-of-words feature vectors for each document self-supervised learning. MVGRL also maximizes local- and a list of citation links between documents. We treat global mutual information. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

• GRAND [10] proposes to combine random propagation to resolve over-smoothing in GNNs. As shown in Ta- and consistency regularization to optimize the prediction ble 2, these methods achieve remarkable results and all consistency of unlabeled data across different data aug- outperform GNNs (like GCN and GAT) that only use mentations. the immediate neighborhood information. This indicates • DropEdge [17] randomly removes a certain number of that in these graphs, incorporating high-order information edges from the input graph at each training epoch, acting might be of great importance. like a data augmenter and also a message-passing reducer • Dropout vs DropEdge vs DropNode. Random propagation, to alleviate over-fitting and over-smoothing issues. such as Dropout, DropEdge, and DropNode, is critical • Graph-Unet [14] uses novel graph pooling (gPool) and in semi-supervised graph learning. These random meth- unpooling (gUnpool) operations where gPool adaptively ods can help avoid the overfitting problem and improve selects some nodes to form a smaller graph based on their performance. In our experiments, we found that only scalar projection values on a trainable projection vector using dropout on the same model architecture can achieve and gUnpool restores the graph. results comparable to initial models using other random • GDC [15] leverages generalized graph diffusion, such as propagation techniques. Theoretically, as shown in [10], heat kernel and personalized PageRank, to alleviate the random propagation in fact enforces the consistency of problem of noisy and often arbitrarily defined edges in the classification confidence between each node and its real graphs. multi-hop neighborhoods. All these random propagation • PPRGo [40] utilizes an efficient approximation of informa- methods will achieve higher gain when combined with tion diffusion in GNNs based on personalized PageRank, consistency loss proposed in [10] to better leverage unla- resulting in significant speed gains. belled data in semi-supervised settings. • GraphSAINT [39] constructs minibatches by sampling • Self-supervised learning on graphs. Contrastive methods the training graph and trains a full GCN on sampled have been applied to graph learning and achieved re- subgraphs. markable results. In general, mutual information maxi- mization and InfoNCE both have been attempted in graph TABLE 6: Results of node classification on inductive representation learning. DGI and MVGRL maximize local datasets, including full-batch and sampling-based methods. and global mutual information. MVGRL performs better ”-” means that the performance of the model on these by replacing the adjacency matrix with graph diffusion datasets is not reported in the paper. Flickr and Reddit use matrix but is less scalable. On the contrary, GRACE, which accuracy metric, whereas PPI uses Micro-F1 metric. Repro* optimizes InfoNCE, does not perform well on citation is short for Reproducible. datasets with the public split. However, in our experi- ments, by replacing Dropout and DropEdge in GRACE Flickr PPI Reddit Repro* with DropNode, and applying graph diffusion and mini- GCNII 52.6 ± 0.1 96.5 ± 0.2 96.4 ± 0.0 Partial batch training, optimizing InfoNCE can also reach nearly GraphSAINT 52.0 ± 0.1 99.3 ± 0.1 96.1 ± 0.0 Yes 0.82 in PubMed. This indicates that graph diffusion and Cluster-SAGE 50.9 ± 0.1 97.9 ± 0.1 95.8 ± 0.1 Yes GAT 51.9 ± 0.3 97.3 ± 0.2 95.9 ± 0.1 Yes mini-batch training instead of full-batch might benefit GCN 52.4 ± 0.1 75.7 ± 0.1 95.1 ± 0.0 - graph self-supervsied learning. APPNP 52.3 ± 0.1 62.3 ± 0.1 96.2 ± 0.1 - PPRGo 51.1 ± 0.2 48.6 ± 0.1 94.5 ± 0.1 - SGC 50.2 ± 0.1 47.2 ± 0.1 94.5 ± 0.1 - 4.3 Graph Classification Graph classification assigns a label to each graph and aims to map graphs into vector spaces. Graph kernels are his- Results and Analysis. We implement all the aforemen- torically dominant and employ a kernel function to mea- tioned GNN models and build a leaderboard for the node sure the similarity between pairs of graphs to map graphs classification task. Table 2 and Table 6 summarize the eval- into vector spaces with deterministic mapping functions. uation results of all compared models in transductive and But they suffer from computational bottlenecks. Recently, inductive datasets respectively, under the setting of node graph neural networks attract much attention and indeed classification. We have the following observations: show promising results in this task. In the context of graph • High-order neighbors. Plenty of studies on GNNs have classification, GNNs often employ the readout operation to focused on designing a better aggregation paradigm to obtain a compact representation on the graph level incorporate neighborhood information in different dis- (K) tances. In citation datasets (Cora, CiteSeer, and PubMed) hG = READOUT({hv |v ∈ V }). (3) that are of relatively small scale, incorporating high-order GNNs directly apply classification based on the readout information plays an important role in improving the representation and thus are more efficient. performance of models. Most high-order models, such as GRAND, APPNP, and GDC, aim to use graph diffusion Datasets. We collect 8 popular benchmarks often used in ¯ PK ˆi matrix A = i=0 αiA to collect information of distant graph classification tasks. Table 7 shows the statistics. neighbors. Methods based on diffusion are inspired by • Bioinformatics datasets. The PROTEINS dataset contains spectral graph theory and will not be troubled by the graphs of protein structures. Edges represent the inter- over-smoothing problem. It is light-weight and can also action between sub-structures of proteins. Each graph in be applied to large-scale or unsupervised training. On MUTAG, PTC, and NCI1 is a chemical compound with the other hand, GCNII extends GCN to a deep model nodes and edges representing atoms and chemical bonds and uses the residual connection with identity mapping respectively. These datasets are usually have features. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9 TABLE 7: Dataset statistics for graph classification

Type Dataset #Graphs #Classes #Features Avg. #Nodes Avg. #Edges MUTAG 188 2 7 17.9 19.8 PTC 344 2 18 14.3 14.7 Bioinformatics PROTEINS 1,113 2 3 39.1 72.8 NCI1 4,110 2 37 29.8 32.3 IMDB-B 1,000 2 - 19.8 96.5 IMDB-M 1,500 3 - 13.0 65.9 Social Networks REDDIT-B 2,000 2 - 429.6 497.8 COLLAB 5,000 3 - 74.5 2457.8

• Social networks. IMDB-BINARY and IMDB-MULTI are results. movie collaboration datasets. Nodes correspond to ac- Results and Analysis. The development of GNNs for graph tors/actresses and an edge means they appear in the same classification is mainly in two aspects. One line (like GIN) movie. Graphs in REDDIT-BINARY represent online dis- aims to design more powerful convolution operations to cussions in reddit, where nodes correspond to users and improve the expressiveness. Another line is to develop effec- an edge is drawn between two nodes if one responded to tive pooling methods to generate the graph representation. another’s comment. Nodes in these datasets directly use node degree as features. • Neural network vs Kernel methods. Neural network based methods show promising results in bioinformatics • GIN [42] presents graph isomorphism network, which datasets (MUTAG, PTC, PROTEINS, and NCI1), where adjusts the weight of the central node with learning and nodes are with given features. But in social networks aims to make GNN as powerful as the WeisfeilerLehman (IMDB-B, IMDB-M, COLLAB, REDDIT-B) lacking node graph isomorphism test. features, methods based on graph kernels achieve really • DiffPool [45] proposes a differentiable pooling and gener- good performance and even surpass neural networks. ates hierachical representation of graphs. It learns a cluster Graph kernels are more capable of capturing structural assignment matrix and can be implemented based on any information to discriminate non-isomorphic graphs, while GNN. GNNs are better encoders with features. Most GNNs di- • SAGPool [46] proposes a hierachical graph pooling rectly perform classification based on the extracted graph method based on self-attention and considers both node representations and therefore are much more efficient features and graph topology. than graph kernel methods. • SortPool [43] rearanges nodes by sorting them according • Comparison between pooling methods. Graph pooling aims to their structural roles within the graph and then perform to scale down the size of representations and generates pooling on these nodes. Node features derived from graph graph representation from node features. Global pooling, convolutions are used as continuous WL colors for sorting which is used in GIN and SortPool, collects node fea- nodes. tures and applies a readout function. Hierarchical pooling, • PATCHY SAN [41] orders neighbors of each node ac- such as DiffPool and SAGPool, is proposed to capture cording to their graph labelings and selects the top q structural information in different graph levels, including neighbors. The graph labelings are derived by degree, nodes and subgraphs. The experimental results indicate centrality and other node scores. that, though hierarchical pooling seems more complex • DGCNN [44] builds a subgraph for each node with KNN and intuitively would capture more information, it does based on node features and then applies graph convolu- not show significant advantages over global pooling. tion to the reconstructed graph. • Infograph [49] applies contrastive learning to graph learn- ing by maximizing the mutual information between both 4.4 Other Graph Tasks graph-level representation and node-level representation CogDL also provides other important tasks in the graph in an unsupervised manner. domain, including heterogeneous node classification, link • graph2vec [48] follows skip-gram’s training process and prediction, multiplex link prediction, and knowledge graph considers the set of all rooted subgraphs around each completion, etc. The latest results of all these tasks are node as its vocabulary. updated in real time in this link4. • Deep Graph Kernels (DGK) [47] learns latent representa- Heterogeneous Node Classification. tion for subgraph structures based on graph kernels in This task is built for graphs with Skip-gram method. heterogeneous GNNs conducted on heterogeneous graphs. For heterogeneous node classification, we use macro-F1 to As for evaluation, for supervised methods we adopt evaluate the performance of all heterogeneous models un- 10-fold cross-validation with 90%/10% split and repeat 10 der the setting of Graph Transformer Networks (GTN) [55]. times; for unsupervised methods, we perform the 10-fold The heterogeneous datasets include DBLP, ACM, and IMDB. cross-validation with LIB-SVM. Then we report the accuracy We implement these models, including GTN [55], HAN [56], for classification performance. Table 8 reports the results of PTE [23], metapath2vec [57], hin2vec [58]. the aforementioned models on the task, including both un- supervised and supervised graph classification. We run all 4. https://github.com/THUDM/cogdl/blob/master/cogdl/tasks/ algorithms in 8 datasets and report the sorted experimental README.md JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10 TABLE 8: Accuracy Results (%) of both unsupervised and supervised graph classification. ↓ and ↑ mean our results are lower or higher than the results in original papers. ‡ means the experiment is not finished in 24 hours for one seed.

Algorithm MUTAG PTC NCI1 PROTEINS IMDB-B IMDB-M COLLAB REDDIT-B Reproducible GIN [42] 92.06 67.82 81.66 75.19 76.10 51.80 79.52 83.10 ↓ Yes InfoGraph [49] 88.95 60.74 76.64 73.93 74.50 51.33 79.40 76.55 Yes graph2vec [48] 83.68 54.76 ↓ 71.85 73.30 73.90 52.27 85.58 ↑ 91.77 Yes SortPool [43] 87.25 62.04 73.99 ↑ 74.48 75.40 50.47 80.07 ↑ 78.15 Yes DiffPool [45] 85.18 58.00 69.09 75.30 72.50 50.50 79.27 81.20 Yes PATCHY SAN [41] 86.12 61.60 69.82 75.38 76.00 ↑ 46.40 74.34 60.61 Yes DGCNN [44] 83.33 56.72 65.96 66.75 71.60 49.20 77.45 86.20 Yes SAGPool [46] 71.73 ↓ 59.92 72.87 74.03 74.80 51.33 ‡ 89.21 Yes DGK [47] 85.58 57.28 ‡ 72.59 55.00 ↓ 40.40 ↓ ‡ ‡ Partial

Link Prediction. Liben-Nowell and Kleinberg [59] studied problem. There are 4,833,171 papers in the field of computer the underlying social-network evolution and proposed a science in the AMiner database. We conduct experiments for basic computational problem named link prediction: Given these papers to show how graph representation learning can a snapshot of a social network at time t, it seeks to accurately help the tagging problem. We can build an embedding gen- predict the edges that will be added to the network during erator via generator = pipeline(“generate-emb”, model=“prone”). the interval from time t to a given future time. In CogDL, Then we can obtain the network embedding using ProNE we remove 15 percents of edges for each dataset and adopt by feeding the citation network to the generator. We choose ROC AUC [60] as the evaluation metric, which is the area logistic regression as the multi-label classifier. The node em- under the receiver operating characteristic curve. beddings and raw text features can be combined to predict the tags of a given paper. The result shows that the fused Multiplex Link Prediction. GATNE [61] formalizes the features increase the recall by 12.8%, which indicates that problem of embedding learning for the Attributed Multiplex structural information plays a vital role in tagging papers. Heterogeneous Network and proposes a unified framework to solve it with both transductive and inductive settings. Recommendation. In the AMiner Subscribe page, it gives We follow the setting of GATNE [61] to build the multiplex paper recommendations for AMiner users based on users’ heterogeneous link prediction task. In CogDL, for those historical behaviors. CogDL supports the application of methods that only deal with homogeneous networks, we AMiner Subscribe via GNN models, since GNNs have feed separate graphs with different edge types to them shown great success in the recommender systems [68]. We and obtain different node representations for each separate implement several state-of-the-art GNN models for recom- graph. We also adopt ROC AUC as the evaluation metric. mendation, such as LightGCN [69]. We can build a rec- ommendation server via recsys = pipeline(“recommendation”, Knowledge Graph Completion. This task aims to predict model=“lightgcn”, data=data). We can feed a user-item in- missing links in a knowledge graph through knowledge teraction graph into this API and obtain the trained GNN graph embedding. In CogDL, we implement two families model for serving. We can query the recsys with users and of knowledge graph embedding algorithms: triplet-based the system will output the recommended items. knowledge graph embedding and knowledge-based GNNs. The former includes TransE [62], DistMult [63], ComplEx In addition, CogDL has gained users from academia [64], and RotatE [65]; the latter includes RGCN [66] and like Carnegie Mellon University and Shanghai Jiaotong CompGCN [67]. We evaluate all implemented algorithms University for study or research in graph computing, graph with standard benchmarks including FB15k-237, WN18 and analysis, etc. Our package is also forked by developers in WN18RR with Mean Reciprocal Rank (MRR) metric. the industry like Microsoft and Tencent to help their work.

5 APPLICATIONS 6 CONCLUSIONS In this section, we demonstrate the easy-to-use pipeline API In this paper, we introduce CogDL, an extensive toolkit and effectiveness of our toolkit for real-world applications, for graph representation learning that provides easy-to-use including paper tagging and recommendation, in AMiner5, APIs and efficient sparse kernel operators for researchers a large online academic search and mining system. and developers to conduct experiments and build real- Paper Tagging. Each publication in AMiner has several world applications. It provides standard training, evalu- tags, extracted by the AMiner team using the raw texts (e.g., ation, and reproducible leaderboards for most important title and abstract) of each publication. However, publica- tasks in the graph domain, including node classification, tions with citation links may have similar tags, and we can link prediction, graph classification, and other graph tasks. utilize the citation network to improve the quality of tags. Formally, the publication tagging problem can be consid- ered as a multi-label node classification task, where each ACKNOWLEDGMENTS label represents a tag. Thus, we can utilize powerful graph representation learning methods in CogDL to handle this The work is supported by the National Key R&D Program of China (2018YFB1402600), NSFC for Distinguished Young 5. https://www.aminer.cn/ Scholar (61825602), and NSFC (61836013). JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11 REFERENCES and S. Jegelka, “Representation learning on graphs with jumping knowledge networks,” in ICML’18, 2018, [1] J. Qiu, Y. Dong, H. Ma, J. Li, K. Wang, and J. Tang, pp. 5453–5462. “Network embedding as matrix factorization: Unifying [21] M. Defferrard, X. Bresson, and P. Vandergheynst, “Con- deepwalk, line, pte, and node2vec,” in WSDM’18, 2018, volutional neural networks on graphs with fast local- pp. 459–467. ized spectral filtering,” in NeurIPS’16, 2016. [2] J. Zhang, Y. Dong, Y. Wang, J. Tang, and M. Ding, [22] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient “Prone: fast and scalable network representation learn- estimation of word representations in vector space,” ing,” in IJCAI’19, 2019, pp. 4278–4284. arXiv preprint arXiv:1301.3781, 2013. [3] J. Qiu, Y. Dong, H. Ma, J. Li, C. Wang, K. Wang, and [23] J. Tang, M. Qu, and Q. Mei, “Pte: Predictive text embed- J. Tang, “Netsmf: Large-scale network embedding as ding through large-scale heterogeneous text networks,” sparse matrix factorization,” in WWW’19, 2019, pp. in KDD’15, 2015, pp. 1165–1174. 1509–1520. [24] G. H. Golub and C. Reinsch, “Singular value decom- [4] A. Grover and J. Leskovec, “node2vec: Scalable feature position and least squares solutions,” in Linear Algebra. learning for networks,” in KDD’16. ACM, 2016, pp. Springer, 1971, pp. 134–151. 855–864. [25] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive rep- [5] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, resentation learning on large graphs,” in NeurIPS’17, “Line: Large-scale information network embedding,” 2017, pp. 1025–1035. in WWW’15. ACM, 2015, pp. 1067–1077. [26] M. Fey and J. E. Lenssen, “Fast graph representa- [6] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: tion learning with geometric,” arXiv preprint Online learning of social representations,” in KDD’14. arXiv:1903.02428, 2019. ACM, 2014, pp. 701–710. [27] M. Wang, L. Yu, D. Zheng, Q. Gan, Y. Gai, Z. Ye, [7] L. Tang and H. Liu, “Leveraging social media networks M. Li, J. Zhou, Q. Huang, C. Ma et al., “Deep graph for classification,” and Knowledge Discovery, library: Towards efficient and scalable deep learning vol. 23, no. 3, pp. 447–478, 2011. on graphs,” arXiv preprint arXiv:1909.01315, 2019. [8] M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu, “Asym- [28] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, metric transitivity preserving graph embedding,” in G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga KDD’16, 2016, pp. 1105–1114. et al., “Pytorch: An imperative style, high-performance [9] S. Cao, W. Lu, and Q. Xu, “Grarep: Learning graph deep learning library,” in NeurIPS’19, 2019, pp. 8024– representations with global structural information,” in 8035. CIKM’15. ACM, 2015, pp. 891–900. [29] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, [10] W. Feng, J. Zhang, Y. Dong, Y. Han, H. Luan, Q. Xu, and T. Eliassi-Rad, “Collective classification in network Q. Yang, E. Kharlamov, and J. Tang, “Graph ran- data,” AI magazine, vol. 29, no. 3, pp. 93–93, 2008. dom neural networks for semi-supervised learning on [30] Z. Yang, W. Cohen, and R. Salakhudinov, “Revisiting graphs,” 2020. semi-supervised learning with graph embeddings,” in [11] M. Chen, Z. Wei, Z. Huang, B. Ding, and Y. Li, “Simple ICML’16. PMLR, 2016, pp. 40–48. and deep graph convolutional networks,” in ICML’20, [31] O. Shchur, M. Mumme, A. Bojchevski, and 2020. S. Gunnemann,¨ “Pitfalls of graph neural network [12] K. Hassani and A. H. Khasahmadi, “Contrastive multi- evaluation,” arXiv preprint arXiv:1811.05868, 2018. view representation learning on graphs,” in ICML’20, [32] F. Errica, M. Podda, D. Bacciu, and A. Micheli, “A 2020. fair comparison of graph neural networks for graph [13] J. Klicpera, A. Bojchevski, and S. Gunnemann,¨ “Predict classification,” in ICLR’20, 2020. then propagate: Graph neural networks meet personal- [33] V. P. Dwivedi, C. K. Joshi, T. Laurent, Y. Bengio, and ized pagerank,” in ICLR’18, 2019. X. Bresson, “Benchmarking graph neural networks,” [14] H. Gao and S. Ji, “Graph u-nets,” in ICML’19, 2019, pp. arXiv preprint arXiv:2003.00982, 2020. 2083–2092. [34] W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, [15] J. Klicpera, S. Weißenberger, and S. Gunnemann,¨ “Dif- M. Catasta, and J. Leskovec, “Open graph bench- fusion improves graph learning,” in NeurIPS’19, 2019. mark: Datasets for machine learning on graphs,” in [16] P. Velickoviˇ c,´ G. Cucurull, A. Casanova, A. Romero, NeurIPS’20, 2020. P. Lio,` and Y. Bengio, “Graph attention networks,” in [35] C. Hong, A. Sukumaran-Rajam, I. Nisa, K. Singh, ICLR’18, 2018. and P. Sadayappan, “Adaptive sparse tiling for sparse [17] Y. Rong, W. Huang, T. Xu, and J. Huang, “Dropedge: matrix multiplication,” in Symposium on Principles and Towards deep graph convolutional networks on node Practice of Parallel Programming (PPoPP), 2019, pp. 300– classification,” in ICLR’20, 2020. 314. [18] T. N. Kipf and M. Welling, “Semi-supervised classifica- [36] C. Hong, A. Sukumaran-Rajam, B. Bandyopadhyay, tion with graph convolutional networks,” in ICLR’17, J. Kim, S. E. Kurt, I. Nisa, S. Sabhlok, U.¨ V. C¸atalyurek,¨ 2017. S. Parthasarathy, and P. Sadayappan, “Efficient sparse- [19] P. Velickovic, W. Fedus, W. L. Hamilton, P. Lio,` Y. Ben- matrix multi-vector product on gpus,” in HPDC’18, gio, and R. D. Hjelm, “Deep graph infomax,” in 2018, pp. 66–79. ICLR’19, 2019. [37] G. Huang, G. Dai, Y. Wang, and H. Yang, “Ge-spmm: [20] K. Xu, C. Li, Y. Tian, T. Sonobe, K.-i. Kawarabayashi, General-purpose sparse matrix-matrix multiplication JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12 on gpus for graph neural networks,” in SC’20. IEEE networks,” in KDD’08. ACM, 2008, pp. 990–998. Press, 2020. [54] L. Tang and H. Liu, “Relational learning via latent [38] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, social dimensions,” in KDD’09, 2009. “Optuna: A next-generation hyperparameter optimiza- [55] S. Yun, M. Jeong, R. Kim, J. Kang, and H. J. Kim, tion framework,” in KDD’19, 2019, pp. 2623–2631. “Graph transformer networks,” in NeurIPS’19, 2019. [39] H. Zeng, H. Zhou, A. Srivastava, R. Kannan, and [56] X. Wang, H. Ji, C. Shi, B. Wang, Y. Ye, P. Cui, and V. Prasanna, “GraphSAINT: Graph sampling based in- P. S. Yu, “Heterogeneous graph attention network,” in ductive learning method,” in ICLR’20, 2020. WWW’19, 2019, pp. 2022–2032. [40] A. Bojchevski, J. Klicpera, B. Perozzi, A. Kapoor, [58] T.-y. Fu, W.-C. Lee, and Z. Lei, “Hin2vec: Explore M. Blais, B. Rozemberczki,´ M. Lukasik, and meta-paths in heterogeneous information networks for S. Gunnemann,¨ “Scaling graph neural networks representation learning,” in CIKM’17, 2017. with approximate pagerank,” in KDD’20, 2020. [59] D. Liben-Nowell and J. Kleinberg, “The link prediction [41] M. Niepert, M. Ahmed, and K. Kutzkov, “Learning problem for social networks,” Journal of the American convolutional neural networks for graphs,” in ICML’16, Society for Information Science and Technology, vol. 58, 2016. no. 7, p. 1019–1031, 2007. [42] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How pow- [60] J. A. Hanley and B. J. McNeil, “The meaning and use of erful are graph neural networks?” in ICLR’19, 2019. the area under a receiver operating characteristic (roc) [43] M. Zhang, Z. Cui, M. Neumann, and Y. Chen, “An end- curve.” Radiology, vol. 143, no. 1, pp. 29–36, 1982. to-end deep learning architecture for graph classifica- [61] Y. Cen, X. Zou, J. Zhang, H. Yang, J. Zhou, and J. Tang, tion,” in AAAI’18, 2018. “Representation learning for attributed multiplex het- [44] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, erogeneous network,” in KDD’19, 2019, pp. 1358–1368. and J. M. Solomon, “Dynamic graph cnn for learning [62] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and on point clouds,” Acm Transactions On Graphics (tog), O. Yakhnenko, “Translating embeddings for modeling vol. 38, no. 5, pp. 1–12, 2019. multi-relational data,” in NeurIPS’13, 2013. [45] R. Ying, J. You, C. Morris, X. Ren, W. L. Hamilton, [63] B. Yang, W. tau Yih, X. He, J. Gao, and L. Deng, and J. Leskovec, “Hierarchical graph representation “Embedding entities and relations for learning and learning with differentiable pooling,” in NeurIPS’18, inference in knowledge bases,” in ICLR’15, 2015. 2018. [64] T. Trouillon, J. Welbl, S. Riedel, E.´ Gaussier, and [46] J. Lee, I. Lee, and J. Kang, “Self-attention graph pool- G. Bouchard, “Complex embeddings for simple link ing,” in ICML’19, 2019. prediction,” in ICML’16, 2016. [47] P. Yanardag and S. Vishwanathan, “Deep graph ker- [65] Z. Sun, Z.-H. Deng, J.-Y. Nie, and J. Tang, “Rotate: nels,” in KDD’15, 2015. Knowledge graph embedding by relational rotation in [48] A. Narayanan, M. Chandramohan, R. Venkatesan, complex space,” in ICLR’19, 2019. L. Chen, Y. Liu, and S. Jaiswal, “graph2vec: Learning [66] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, distributed representations of graphs,” arXiv preprint I. Titov, and M. Welling, “Modeling relational data arXiv:1707.05005, 2017. with graph convolutional networks,” Lecture Notes in [49] F.-Y. Sun, J. Hoffmann, V. Verma, and J. Tang, “Info- Computer Science, p. 593–607, 2018. graph: Unsupervised and semi-supervised graph-level [67] S. Vashishth, S. Sanyal, V. Nitin, and P. Talukdar, representation learning via mutual information maxi- “Composition-based multi-relational graph convolu- mization,” in ICLR’20, 2019. tional networks,” in ICLR’20, 2020. [50] R. Zafarani and H. Liu, “Social computing data reposi- [68] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamil- tory at asu,” 2009. ton, and J. Leskovec, “Graph convolutional neural net- [51] K. Toutanova, D. Klein, C. D. Manning, and Y. Singer, works for web-scale recommender systems,” in Pro- “Feature-rich part-of-speech tagging with a cyclic de- ceedings of the 24th ACM SIGKDD International Confer- pendency network,” in NAACL’03, 2003, pp. 252–259. ence on Knowledge Discovery & Data Mining, 2018, pp. [52] B.-J. Breitkreutz, C. Stark et al., “The biogrid interaction 974–983. database,” Nucleic acids research, vol. 36, no. suppl 1, pp. [69] X. He, K. Deng, X. Wang, Y. Li, Y. Zhang, and M. Wang, D637–D640, 2008. “Lightgcn: Simplifying and powering graph convolu- [53] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, tion network for recommendation,” in Proceedings of the “Arnetminer: extraction and mining of academic social 43rd International ACM SIGIR conference on research and [57] Y. Dong, N. V. Chawla, and A. Swami, “metapath2vec: development in Information Retrieval, 2020, pp. 639–648. Scalable representation learning for heterogeneous net- works,” in KDD’17, 2017.