Graph Convolutional Networks Yunsheng Bai Overview

1. Improve GCN itself a. i. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering (NIPS 2016) ii. Semi-Supervised Classification with Graph Convolutional Networks (ICLR 2017) iii. Dynamic Filters in Graph Convolutional Networks (2017) b. Pooling (Unpooling) i. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering (NIPS 2016) 2. Apply to new/larger datasets/graphs a. Use GCN as an auxiliary module i. Structured Sequence Modeling with Graph Convolutional Recurrent Networks (2017) b. Use GCN only i. Node/Link classification/prediction: http://tkipf.github.io/misc/GCNSlides.pdf 1. Directed graph a. Modeling Relational Data with Graph Convolutional Networks (2017) ii. Graph classification, e.g. MNIST (with or without pooling) Roadmap

1. Define Convolution for Graph 2. Architecture of Graph Convolutional Networks 3. Improvements: Generalizable Graph Convolutional Networks with Deconvolutional Layers a. Improvement 1: Dynamic Filters -> Generalizable b. Improvement 2: Deconvolution Define Convolution for Graph Laplace

http://www.norbertwiener.umd.edu/Research/lectures/2014/MBegue_Prelim.pdf Graph Laplacian

http://www.norbertwiener.umd.edu/Research/lectures/2014/MBegue_Prelim.pdf Graph Laplacian

Labeled graph Degree matrix Adjacency matrix

https://en.wikipedia.org/wiki/Laplacian_matrix Graph

L: (Normalized) Graph Laplacian

D: Degree Matrix

W: Adjacency Matrix

U: Eigenvectors of L (Orthonormal b.c. L is symmetric PSD)

Λ: Eigenvalues of L

: Fourier Transform of x

Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering (NIPS 2016) 1-D Convolution

Ex. x = [0 1 2 3 4 5 6 7 8]

f = [4 -1 0]

y = [0*4 1*4+0*(-1) 2*4+1*(-1)+0*0 3*4+2*(-1)+1*0 ...]

= [0 4 7 10 ...]

I made this based on my EECS 351 lecture notes. Convolution <--> Multiplication in Fourier Domain

View X and F as vectors

I made this based on my EECS 351 lecture notes. Spectral Filtering

~ to the Convolution: previous slide

Filter a signal:

“As we cannot express a (1) (1) meaningful translation e x y 1 operator in the vertex 1 0 0 (2) (2) e x domain, the convolution e1 e2 e3 0 2 0 2 = y operator on graph G is 0 0 3 e defined in the Fourier 3 x(3) y(3) domain”

Inverse Fourier Non-parametric Fourier Transform Transform Filter of x

Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering (NIPS 2016) Spectral Filtering

(1) e x 1 1 0 0 e x(2) e1 e2 e3 0 2 0 2 0 0 3 e 3 x(3)

Inverse Fourier Non-parametric Fourier Transform Transform Filter of x

e e * e1 + * 2 + * 3 = 1 1 2 2 3 3

Fourier Basis Spectral Filtering

(1) e x 1 1 0 0 e x(2) e1 e2 e3 0 2 0 2 0 0 3 e 3 x(3)

Inverse Fourier Non-parametric Fourier Transform Transform Filter of x

= e e e 1* 1 + 2* 2 + 3* 3

The result of the convolution is the original signal: (1) first Fourier Transformed (2) then multiplied by a filter (3) finally inverse Fourier Transformed Spectral Filtering

Convolution:

Filter a signal:

(1) (1) e x y 1 1 0 0 (2) (2) e x e1 e2 e3 0 2 0 2 = y 0 0 3 e 3 x(3) y(3)

Inverse Fourier Non-parametric Fourier Transform Transform Filter of x

Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering (NIPS 2016) Better Filters

Convolution:

Filter a signal:

(1) (1) e x y 1 1 0 0 (2) (2) e x e1 e2 e3 0 2 0 2 = y 0 0 3 e 3 x(3) y(3)

Inverse Fourier Localized & Fourier Transform Transform Polynomial of x Filter

Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering (NIPS 2016) Better Filters: Localized

Labeled graph Degree matrix Adjacency matrix Laplacian matrix

L2

Wavelets on Graphs via (2011) Better Filters: Localized

Filter:

Filter a signal:

x(1) x(1) x(1) 1-step neighbors 2-step neighbors x(2) x(2) x(2)

K=3 x(3) x(3) x(3) = Θ * + Θ * * + Θ *( )* 0 x(4) 1 x(4) 2 x(4)

x(5) x(5) x(5)

x(6) x(6) x(6) Fixed Θ for every neighbor :( (Dynamic Filters in Graph Convolutional Networks (2017)) Better Filters, but O(n2)

Convolution:

Filter a signal:

Filter: s(1) Computing Eigenvectors s(2) e1 e2 e3 O(n3) :( s(3) I am actually confused. They used Chebyshev polynomials to approximate the filter, but at the end of the day, the filtered signal is the same as the previous slide. O(n2) :( In fact, authors of Semi-Supervised Classification with Graph Convolutional Networks (ICLR 2017) set K=1, so no Chebyshev at all. Approximations

If K=1, filtering becomes:

If set (further approximate):

Then, filtering becomes:

If input is a matrix:

Then, filtering becomes:

Filter parameters:

Convolved signal matrix:

Filtering complexity:

Semi-Supervised Classification with Graph Convolutional Networks (ICLR 2017) Illustration of

-1/2 -1/2 D * A * D * X *

Feature 2 of Word Embedding of Node 1 F F Node 1 i i l l * * t t = L X e e Z r r Feature 2 of 1 2 Word Embedding of Node 6 Node 6

Feature 1 of Node 6

Semi-Supervised Classification with Graph Convolutional Networks (ICLR 2017) Roadmap

1. Define Convolution for Graph 2. Architecture of Graph Convolutional Networks 3. Improvements: Generalizable Graph Convolutional Networks with Deconvolutional Layers a. Improvement 1: Dynamic Filters -> Generalizable b. Improvement 2: Deconvolution Architecture of Graph Convolutional Networks Schematic Depiction

Semi-Supervised Classification with Graph Convolutional Networks (ICLR 2017) Roadmap

1. Define Convolution for Graph 2. Architecture of Graph Convolutional Networks 3. Improvements: Generalizable Graph Convolutional Networks with Deconvolutional Layers a. Improvement 1: Dynamic Filters -> Generalizable b. Improvement 2: Deconvolution Improvements: Generalizable Graph Convolutional Networks with Deconvolutional Layers Roadmap

1. Define Convolution for Graph 2. Architecture of Graph Convolutional Networks 3. Improvements: Generalizable Graph Convolutional Networks with Deconvolutional Layers a. Improvement 1: Dynamic Filters -> Generalizable b. Improvement 2: Deconvolution Improvement 1 Dynamic Filters -> Generalizable Roadmap

1. Define Convolution for Graph 2. Architecture of Graph Convolutional Networks 3. Improvements: Generalizable Graph Convolutional Networks with Deconvolutional Layers a. Improvement 1: Dynamic Filters -> Generalizable i. Basics ii. Ideas iii. Ordering iv. Example b. Improvement 2: Deconvolution Baseline Filter: Semi-Supervised Classification with Graph Convolutional Networks (ICLR 2017)

0 12 0 0 0 12

...

0 0 0 0 11 11 3 2 1 11 0 11 0 0 0 Filter 4 0 0 0 0 1 All share 11 11 the same 0 0 0 0 5 parameter. 11 11 6

11 11 11 11 0 0

0 11 11 11 11 0 Poor Filter: Parameters All Different (No Sharing)

0 1212 0 0 0 1216

...

0 0 0 0 1112 1116 3 2

0 0 0 0 1 What if they 1121 1123 Filter 4 share? 1 1131 0 0 0 0 1136

0 0 0 0 5 1145 1145 6

1151 1152 1153 1154 0 0 What if they share? 0 1162 1163 1164 1165 0 Proposed Filter: Parameters Shared across Nodes with Same # of Neighbors

0 ’1221 0 0 0 ’1222

...

0 ’ 0 0 0 ’ 1121 1122 3 2 1 ’1121 0 ’1122 0 0 0 Share the Filter same ’ 4 112 1 ’1121 0 0 0 0 ’1122

0 0 ’ 0 ’ 0 5 1121 1122 6

’1141 ’1142 ’1143 ’1144 0 0 Share the same ’114 0 ’1141 ’1142 ’1143 ’1144 0 Proposed Filter: Total Size O(N2*F*C)

’1211 0 0 0 0 0

...

’ 0 0 0 0 0 1111 3 2 1 ’1121 ’1122 0 0 0 0 Filter 1 4 ’ ’ ’ 0 0 0 (view N=6, 1131 1132 1133 without F=1, adjacency ’ ’ ’ ’ 0 0 5 C=2 1141 1142 1143 1143 info) 6

’1151 ’1152 ’1153 ’1154 ’1155 0

’1161 ’1162 ’1163 ’1164 ’1165 ’1166 2 2 Proposed Filter: Total Size O(nmax *F*C) <= O(N *F*C) 0 0 0 0 0 0

...

0 0 0 0 0 0 3 2 1 ’1121 ’1122 0 0 0 0 Filter 1 4 N=6, 0 0 0 0 0 0 (view F=1, without adjacency C=2, ’ ’ ’ ’ 0 0 5 1141 1142 1143 1144 info) 6 nmax=4 0 0 0 0 0 0

0 0 0 0 0 0 Proposed Filter: Generalizable to Regular CNN

0 0 0 0 0 0

Moving Filter ... Stride == 1

0 0 0 0 0 0 0 0 0 0 0

’ ’ 0 0 0 0 Filter 1121 1122 0 1 2 3 0 1 0 0 0 0 0 0 (view without 0 4 5 6 0 adjacency ’ ’ ’ ’ 0 0 1141 1142 1143 1144 info) 0 0 0 0 0

0 0 0 0 0 0 Regular 2-D image, Regular CNN 0 0 0 0 0 0 Proposed Filter: More Sharing of Weights

0 0 0 0 0 0

...

0 0 0 0 0 0 3 2 1 ’1121 ’1122 0 0 0 0 Filter 1 4 Weights from 0 0 0 0 0 0 (view previous rows without adjacency are related to ’ ’ ’ ’ 0 0 5 later rows. 1141 1142 1143 1144 info) 6

0 0 0 0 0 0

0 0 0 0 0 0 Proposed Filter: More Sharing of Weights

0 0 0 0 0 0

...

0 0 0 0 0 0 3 2 1 ’1121 ’1122 0 0 0 0 Filter 1 4 But copy or 0 0 0 0 0 0 (view other relations? without adjacency If copy, ’ ’ ’ ’ 0 0 5 randomly copy? 1141 1142 1143 1144 info) 6

0 0 0 0 0 0

0 0 0 0 0 0 Proposed Filter: Soft Assignment (Dynamic Filters in Graph Convolutional Networks (2017))

0 ’1221 0 0 0 ’1222

...

0 0 0 0 1112 1116 3 2 1 1121 0 1123 0 0 0 Filter 4 Each one is a 0 0 0 0 1 weighted sum of 1131 1136 later ones. 0 0 0 0 5 Essentially soft 1145 1145 6 assignment.

’1141 ’1142 ’1143 ’1144 0 0

0 ’1141 ’1142 ’1143 ’1144 0 Roadmap

1. Define Convolution for Graph 2. Architecture of Graph Convolutional Networks 3. Improvements: Generalizable Graph Convolutional Networks with Deconvolutional Layers a. Improvement 1: Dynamic Filters -> Generalizable i. Basics ii. Ideas iii. Ordering iv. Example b. Improvement 2: Deconvolution 1 2 Proposed Filter: Summary and More Ideas

Make them Add 2-step neighbors as less important 1-step neighbors. look equal. Duplicate 1-step neighbors as less important dummy neighbors. So treat them the same. Convert all the 1-step neighbors into one neighbor. ... Nodes are not Respect Share weights if same # of neighbors. created equal. diversity. Nodes with small # of neighbors have independent weights from the large. They have Treat different # of Share weights if same # of neighbors. differently. neighbors. Nodes with small # of neighbors randomly copy weights from the large. All nodes kind of share the same weights but actually have different weights. Nodes with small # of neighbors have weights softly assigned from the large. Respect ... diversity. Treat the Sequences from random walk and neighbors are fed to LSTM. same. Graph LSTM: Variant version of LSTM on graph. ... Roadmap

1. Define Convolution for Graph 2. Architecture of Graph Convolutional Networks 3. Improvements: Generalizable Graph Convolutional Networks with Deconvolutional Layers a. Improvement 1: Dynamic Filters -> Generalizable i. Basics ii. Ideas iii. Ordering iv. Example b. Improvement 2: Deconvolution ? ? ? Proposed Filter: Ordering of Neighbors 1 ? ? 1. No ordering.

No generalizability :(

But works well in

Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering (NIPS 2016) and

Semi-Supervised Classification with Graph Convolutional Networks (ICLR 2017) ? ? ? Proposed Filter: Ordering of Neighbors 1 ? ? 2. Soft assign every neighbor to all weights. Assignments are learnable.

Implicit ordering.

O(N*M) additional parameters to learn.

Dynamic Filters in Graph Convolutional Networks (2017) ? ? ? Proposed Filter: Ordering of Neighbors 1 ? ? 3. Hard assign every neighbor to an ordering. Assignments are learnable.

[0 0 1 0 0 0 0] [0 0 0 0 0 1 0] [0 0 0 1 0 0 0] 2 3 5 1 [0 0 0 0 1 0 0] 4 6 [0 0 0 0 0 0 1]

Explicit ordering. [0 0 0 … 0 1 0 … 0 0 0] O(N*M)/O(N) additional parameters to learn. 5! = 120 ? ? ? Proposed Filter: Ordering of Neighbors 1 ? ? 4. Hard assign every neighbor to an ordering. Assignments are fixed, e.g. rank.

2 3 5 1 4 6

Explicit ordering.

No additional parameters to learn. Roadmap

1. Define Convolution for Graph 2. Architecture of Graph Convolutional Networks 3. Improvements: Generalizable Graph Convolutional Networks with Deconvolutional Layers a. Improvement 1: Dynamic Filters -> Generalizable i. Basics ii. Ideas iii. Ordering iv. Example b. Improvement 2: Deconvolution Example: Share Weights If Same # of Neighbors

matrix1 = tf.constant([[[[0, 1, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]], \

[[0, 2, 0, 0, 0, 2], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]]], \

# Another filter/feature

[[[0, 3, 0, 0, 0, 3], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]], \

[[0, 4, 0, 0, 0, 4], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]]]])

Shape: (2, 2, 6, 6) 2 0 2 0 0 0 2 0 4 0 0 0 4 3 ...... 1 0 1 0 0 0 1 0 3 0 0 0 3 4 0 0 0 0 0 0 Filter 0 0 0 0 0 0 Filter 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 Input Data matrix2 = tf.constant([[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12]]) matrix2 = tf.reshape(matrix2, (2, 1, 6))

Channel Channel 1 2 Node 1 1 7 2 8 3 9 4 10 5 11 Node 6 6 12 Convolution: Step 1

product = tf.multiply(matrix1, matrix2)

with tf.Session() as sess:

result = sess.run(product)

0 2 0 0 0 2 0 4 0 0 0 4 Channel Channel 0 16 0 0 0 24 0 32 0 0 0 48 1 2 ...... 0 1 0 0 0 1 0 3 0 0 0 3 1 7 0 2 0 0 0 6 0 6 0 0 0 18 2 8 0 0 0 0 0 0 Filter 0 0 0 0 0 0 Filter * 0 0 0 0 0 0 Filter 0 0 0 0 0 0 Filter 1 2 (element- 3 9 = 1 2 0 0 0 0 0 0 0 0 0 0 0 0 wise) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 12 0 0 0 0 0 0 0 0 0 0 0 0 Shape: (2, 2, 6, 6) Shape: (2, 1, 6) Shape: (2, 2, 6, 6) Convolution: Step 2 reduced = tf.transpose(tf.reduce_sum(product, [1, 3])) with tf.Session() as sess:

result = sess.run(reduced)

0 16 0 0 0 24 0 32 0 0 0 48 Feature Feature 1 2 ...... 0 2 0 0 0 6 0 6 0 0 0 18 48 104 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Filter Filter = 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Shape: (2, 2, 6, 6) Roadmap

1. Define Convolution for Graph 2. Architecture of Graph Convolutional Networks 3. Improvements: Generalizable Graph Convolutional Networks with Deconvolutional Layers a. Improvement 1: Dynamic Filters -> Generalizable b. Improvement 2: Deconvolution Improvement 2 Deconvolution Why Deconvolution?

To visualize/understand/probe an existing GCN. Pooling <--> Unpooling https://www.quora.com/How-does-a-deconvolutional-neural-network-work

In progress.