Special Networks Hung-yi Lee 李宏毅 Announcement

• 11/13 (下週一) 14:00 ~ 17:00 台灣微軟參訪 • 地址:台北市信義區忠孝東路五段68號19樓 (捷運市 政府站3號出口) • 14:00:在捷運市政府站3號出口集合 • 報名表單: • https://docs.google.com/forms/d/e/1FAIpQLSfs2zlo GanjWjJvVkJu8DUe9BlVZ5ugLPIs3FUmMbR9VkF8Fw /viewform?fbzx=-8767653761190698000 Outline

• Convolutional Neural Network (Review) • Spatial Transformer • Highway Network & Grid LSTM • Pointer Network • External Memory Convolutional

Receptive Sparse Connectivity Field Each neural only connects to 1 1 part of the output of the previous layer 2 2 Different neurons have 3 different, but overlapping, 3 receptive fields 4 4

5 …… Convolutional Layer

Sparse Connectivity Each neural only connects to 1 1 part of the output of the previous layer 2 2 Parameter Sharing 3 3 The neurons with different 4 receptive fields can use the 4 same set of parameters. 5

Less parameters than …… fully connected layer Convolutional Layer

Considering neuron 1 and 3 as “filter 1” (kernel 1) 1 filter (kernel) size: size of the 1 receptive field of a neuron 2 2 Stride = 2 3 Considering neuron 2 and 4 as 3 “filter 2” (kernel 2) 4 4 5 Kernel size, no. of filter,

…… stride are all designed by the developers. Example – 1D Signal + Single Channel

Classification, Predict the future …

1 2 3 4

Audio Signal, 푥1 푥2 Stock Value … 푥3 푥4 푥5 Example – 1D Signal + Multiple Channel

A document: each word is a vector

푥 I 푥1 푥2 푥3 푥4 5 1 like 푥6 푥7 2 this movie 3 very 4 much

…… Does this kind of receptive field make sense? Example – 1: 1 2: 0 2D Signal + Single Channel 3: 0 4: 0

Size of Receptive field … is 3x3, Stride is 1 7: 0 1 0 0 0 0 1 8: 1 0 1 0 0 1 0 9: 0

0 0 1 1 0 0 10: 0 … 1 0 0 0 1 0 0 1 0 0 1 0 13: 0 0 0 1 0 1 0 14: 0 1 6 x 6 black & white 15: Only show 1

picture image 16: 1 filter here … Example – 1: 1 1 0 2D Signal + Multiple 2: 0 0 0 1 0 Channel 3: 0 4: 0 0 0 Size of Receptive field … is 3x3x3, Stride is 1 7: 0 0 0 1 0 0 0 0 1 8: 1 1 0 1 0 0 0 0 1 0 11 00 00 01 00 1 9: 0 1 0 0 1 0 0 1 0 0 00 11 01 00 10 0 10: 0 0 0 0 0 1 1 0 0 … 1 00 00 10 11 00 0 1 0 0 0 1 0 1 1 0 11 00 00 01 10 0 13: 0 0 1 0 0 1 0 0 00 11 00 01 10 0 14: 0 1 1 0 0 1 0 1 0 0 0 1 0 1 0 15: 1 0 1 Only show 1 filter here

6 x 6 colorful image 16: 1 1 0 … Source of images: Padding https://github.com/vdumoulin/conv_arithmetic Zero Padding, Reflection Padding k outputs in layer 푙 − 1 are Pooling Layer grouped together Each output in layer 푙 al1 “summarizes” k inputs 1 1

… Average Pooling: 1 l1 k ak l l 1 l1 k a1 a1   a j k j1 k 1 Max Pooling: … 2 l l1 l1 l1

2k a1  maxa1 ,a2 ,,ak 

… … L2 Pooling: k Layer l 1 Layer l l 1 l1 2 a1  a j  N nodes N / k nodes k j1 Which outputs should be Pooling Layer grouped together?

Convolutional Pooling Layer Layer 1 1 2 2 3 Subsampling 3 4 4 5

Group the neurons corresponding …… to the same filter with nearby receptive fields Which outputs should be Pooling Layer grouped together?

Convolutional Pooling Layer Layer 1 1 2 Maxout Network 2 3 How can you know 3 whether the neurons 4 detect the same 4 pattern? 5 …… Group the neurons with the same receptive field Unpooling & As close as Deconvolution possible Deconvolution

Unpooling

Deconvolution Pooling

Unpooling Convolution code Auto-encoder for CNN Deconvolution Pooling Unpooling

14 x 14 28 x 28

Alternative: simply Source of image : repeat the values https://leonardoaraujosantos.gitbooks.io/artificial- inteligence/content/image_segmentation.html Actually, deconvolution is convolution. Deconvolution

+

+ + = + Combination of Different Structures Tara N. Sainath, Ron J. Weiss, Andrew Senior, Kevin W. Wilson, Oriol Vinyals, “Learning the Speech Front-end With Raw Waveform CLDNNs,” In INTERPSEECH 2015 Combination of Different Structures Tara N. Sainath, Ron J. Weiss, Andrew Senior, Kevin W. Wilson, Oriol Vinyals, “Learning the Speech Front-end With Raw Waveform CLDNNs,” In INTERPSEECH 2015 Combination of Different Structures Tara N. Sainath, Ron J. Weiss, Andrew Senior, Kevin W. Wilson, Oriol Vinyals, “Learning the Speech Front-end With Raw Waveform CLDNNs,” In INTERPSEECH 2015

3 layers https://arxiv.org/abs/1705.03122

CNN for Sequence- to- sequence CNN for Sequence-to-sequence

• Encoder ℎ1 ℎ2 ℎ3 ℎ4

ℎ1 ℎ2 ℎ3 ℎ4

機 器 學 習 機 器 學 習 RNN CNN CNN for Sequence-to-sequence

• Decoder - WaveNet Outline

• Convolutional Neural Network (Review) • Spatial Transformer • Highway Network & Grid LSTM • Pointer Network • External Memory Spatial Transformer Layer

• CNN is not invariant to scaling and rotation

CNN 5

CNN 6

NN layer Can also transform End-to-end learn feature map Spatial Transformer Layer

• How to transform an image/feature map Layer l-1 Layer l 푎푙−1 푎푙−1 푎푙−1 푙 푙 푙 11 12 13 Spatial Transformer Layer 푎11 푎12 푎13

푙−1 푙−1 푙−1 푙 푙 푙 푎21 푎22 푎23 푎21 푎22 푎23

푙−1 푙−1 푙−1 Translate 푙 푙 푙 푎31 푎32 푎33 푎31 푎32 푎33

3 3 푙 푙 푙−1 General layer: 푎푛푚 = ෍ ෍ 푤푛푚,푖푗푎푖푗 푖=1 푗=1 푙 푙−1 If we want translate as above: 푎푛푚 = 푎(푛−1)푚 푙 푙 푤푛푚,푖푗 = 1 푖푓 푖 = 푛 − 1, 푗 = 푚 푤푛푚,푖푗 = 0 표푡ℎ푒푟푤푖푠푒 Spatial Transformer Layer

• How to transform an image/feature map

Layer l-1 Layer l Layer l-1 Layer l

푙−1 푙−1 푙−1 푙 푙 푙 푙−1 푙−1 푙−1 푙 푙 푙 푎11 푎12 푎13 푎11 푎12 푎13 푎11 푎12 푎13 푎11 푎12 푎13

푙−1 푙−1 푙−1 푙 푙 푙 푙−1 푙−1 푙−1 푙 푙 푙 푎21 푎22 푎23 푎21 푎22 푎23 푎21 푎22 푎23 푎21 푎22 푎23

푙−1 푙−1 푙−1 푙 푙 푙 푙−1 푙−1 푙−1 푙 푙 푙 푎31 푎32 푎33 푎31 푎32 푎33 푎31 푎32 푎33 푎31 푎32 푎33

Control the Control the connection connection NN NN Image Transformation Expansion, Compression, Translation 푥′ 푥′ 2 0 푥 0 = + 푦′ 푦′ 0 2 푦 0 푥 푦 1

푥′ 1 푦′ 푥′ 0.5 0 푥 0.5 = + 푦′ 0 0.5 푦 0.5 https://home.gamer.com.tw/c reationDetail.php?sn=792585 Image Transformation

• Rotation 푥′ 푐표푠휃 −푠푖푛휃 푥 0 = + 푦′ 푠푖푛휃 푐표푠휃 푦 0 푥 푥′ 푦 푦′

Rotate 휃° Spatial Transformer Layer

푥′ 푎 푏 푥 푒 6 parameters to describe = + 푦′ 푐 푑 푦 푓 the affine transformation Index of layer l-1 Index of layer l

푙−1 푙−1 푙−1 푙 푙 푙 푎11 푎12 푎13 푎11 푎12 푎13

푙−1 푙−1 푙−1 푙 푙 푙 Layer l-1 푎21 푎22 푎23 푎21 푎22 푎23 Layer l 푎 푏 푒 푙−1 푙−1 푙−1 푎푙 푎푙 푎푙 푎31 푎32 푎33 푐 푑 푓 31 32 33

NN Spatial Transformer Layer

푥′ 0푎 1푏 푥 −푒1 6 parameters to describe = + 푦′ 1푐 0푑 푦 −푓1 the affine transformation Index of layer l-1 Index of layer l

푙−1 푙−1 푙−1 푙 푙 푙 푎11 푎12 푎13 푎11 푎12 푎13

푙−1 푙−1 푙−1 푙 푙 푙 Layer l-1 푎21 푎22 푎23 푎21 푎22 푎23 Layer l 0푎 1푏 −푒1 푙−1 푙−1 푙−1 푎푙 푎푙 푎푙 푎31 푎32 푎33 1푐 0푑 −푓1 31 32 33

NN Spatial Transformer Layer

1.푥6′ 0푎 0푏.5 푥2 0푒.6 6 parameters to describe = + 2.푦4′ 1푐 0푑 푦2 0푓.4 the affine transformation Index of layer l-1 Index of layer l What is the problem?

푙−1 푙−1 푙−1 푙 푙 푙 푎11 푎12 푎13 푎11 푎12 푎13

푙−1 푙−1 푙−1 푙 푙 푙 Layer l-1 푎21 푎22 푎23 푎21 푎22 푎23 Layer l 0푎 0.푏5 0푒.6 푙−1 푙−1 푙−1 푎푙 푎푙 푎푙 푎31 푎32 푎33 1 푐 0푑 0푓.4 31 32 33

Gradient is NN always zero Interpolation Now we can use

1.푥6′ 0푎 0푏.5 푥2 0푒.6 6 parameters to describe = + 2.푦4′ 1푐 0푑 푦2 0푓.4 the affine transformation

Index of layer l-1 Index of layer l 푙 푙 푙 푎11 푎12 푎13

푙 푙 푙 0.4 0.6 푎21 푎22 푎23 Layer l 푙 푙 푙 푎31 푎32 푎33

0.6 0.6 푙 푙−1 1.6 푎22 = (1 − 0.4) × (1 − 0.4) × 푎22 푙−1 0.4 2.4 0.4 + 1 − 0.6 × (1 − 0.4) × 푎12 푙−1 + 1 − 0.6 × (1 − 0.6) × 푎13 푙−1 0.4 0.6 + 1 − 0.4 × (1 − 0.6) × 푎23

Street View House Number Single: one transformation layer Multi: many transformation layer 푎 0푏 푒 Bird Recognition 푐0 푑 푓 Outline

• Convolutional Neural Network (Review) • Spatial Transformer • Highway Network & Grid LSTM • Pointer Network • External Memory Feedforward v.s. Recurrent 1. Feedforward network does not have input at each step 2. Feedforward network has different parameters for each layer

1 2 3 x f1 a f2 a f3 a f4 y

푡 푡−1 푡 푡−1 푡 푎 = 푓푙 푎 = 휎 푊 푎 + 푏 t is layer

h0 f h1 f h2 f h3 f y4

x1 x2 x3 x4 t is time step

ℎ푡 = 푓 ℎ푡−1, 푥푡 = 휎 푊ℎℎ푡−1 + 푊푖푥푡 + 푏푖 Applying gated structure in feedforward network GRU → Highway Network yt

No input xt at each t-1 t step ha ⨀ + ha No output yt at ⨀ each step 1- ⨀ at-1 is the output of the (t-1)-th layer reset update at is the output of r z h' the t-th layer No reset gate ahtt--11 xt xt ℎ′ = 휎 푊푎푡−1 Highway Network 푧 = 휎 푊′푎푡−1 푎푡 = 푧 ⊙ 푎푡−1 + 1 − 푧 ⊙ ℎ • Highway Network • Residual Network 푎푡 푎푡 + 푧 푎푡−1 ℎ′ ℎ′ Gate copy controller copy

푎푡−1 푎푡−1

Training Very Deep Networks Deep Residual Learning for Image https://arxiv.org/pdf/1507.0622 Recognition 8v2.pdf http://arxiv.org/abs/1512.03385 output layer output layer output layer

Highway Network automatically determines the layers needed! Input layer Input layer Input layer Highway Network Grid LSTM Memory for both time and depth depth y a’ b’ c c’ c c’ LSTM Grid LSTM h ht h h’

x a b

time al+1 bl+1 al+1 bl+1 ct-1 ct ct+1 Grid Grid LSTM’ ht-1 LSTM’ ht ht+1

al bl al bl ct-1 ct ct+1 Grid Grid LSTM ht-1 LSTM ht ht+1

al-1 bl-1 al-1 bl-1 Grid LSTM h' b'

a’ b’ c c'

⨀ + ⨀ a' a tanh c c’ Grid ⨀ LSTM ’ h h zf zi z zo

a b h b 3D Grid LSTM

a’ b’ e’ f’

c c’ h h’ e f a b 3D Grid LSTM 3 x 3 images

• Images are composed of pixels Outline

• Convolutional Neural Network (Review) • Spatial Transformer • Highway Network & Grid LSTM • Pointer Network • External Memory Pointer Network

4 2 7 6 5 3

NN

푥1 푥2 푥3 푥4 coordinate …… 푦1 푦2 푦3 푦4 of P1 “硬train” 的故事

• Fizz Buzz in Tensorflow: http://joelgrus.com/2016/05/23/fizz-buzz-in-tensorflow/

https://ochronus.com/fizzbuzz-in-css/

Sequence-to-sequence?

machine

learning

~ ~ …

機 器 學 習 … Encoder Decoder Problem? Sequence-to-sequence?

{1, 2, 3, 4, END} 1 4 2

~

~ ~ … Of course, one can add attention. …

푥 푥 푥 푥 1 2 3 4 … 푦1 푦2 푦3 푦4 Encoder Decoder 푥 0 : END Pointer Network 푦0

Attention Weight key 0.5 푧0

ℎ0 ℎ1 ℎ2 ℎ3 ℎ4

푥0 푥1 푥2 푥3 푥4 푦0 푦1 푦2 푦3 푦4 푥 0 : END Pointer Network 푦0

Output: 1 What decoder can output ~ depends on the input. argmax from this distribution key 0.0 0.5 0.3 0.2 0.0 푧0 푧1

푥1 푦1 ℎ0 ℎ1 ℎ2 ℎ3 ℎ4

푥0 푥1 푥2 푥3 푥4 푦0 푦1 푦2 푦3 푦4 푥 0 : END Pointer Network 푦0

Output: 4 What decoder can output ~ depends on the input. argmax from this distribution key 0.0 0.0 0.1 0.2 0.7 푧0 푧1 푧2 ……

푥1 푥4 푦1 푦4 ℎ0 ℎ1 ℎ2 ℎ3 ℎ4 The process stops when 푥0 푥1 푥2 푥3 푥4 “END” has the largest 푦0 푦1 푦2 푦3 푦4 attention weights. https://arxiv.org/abs/1704.04368 Applications - Summarization More Applications

Machine Translation

Chat-bot User: X寶你好,我是庫洛洛

Machine:庫洛洛你好,很高興認識你 Outline

• Convolutional Neural Network (Review) • Spatial Transformer • Highway Network & Grid LSTM • Pointer Network • External Memory External Memory

Input DNN/RNN output

Reading Head Controller

Reading Head …… ……

Machine’s Memory Ref: http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Attain%20(v3).e cm.mp4/index.html Reading Comprehension

Query DNN/RNN answer

Reading Head Controller

Semantic Analysis …… ……

Each sentence becomes a vector. Memory Network Answer

푁 Sentence to Extracted = ෍ 훼 푥푛 DNN vector can be Information 푛 푛=1 jointly trained.

훼1 훼2 훼3 훼푁

1 3 푁 Document 푥 푥2 푥 …… 푥

vector

Match q Query

Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, Rob Fergus, “End-To-End Memory Networks”, NIPS, 2015 Memory Answer Network 푁 Extracted = ෍ 훼 ℎ푛 Information 푛 DNN 푛=1

Jointly learned ℎ1 ℎ2 ℎ3 …… ℎ푁 Hopping 훼1 훼2 훼3 훼푁

1 3 푁 Document 푥 푥2 푥 …… 푥

Match q Query Memory Network ෍ DNN a Extract information ……

……

Compute attention

Extract information ……

……

Compute attention q Multiple-hop

• End-To-End Memory Networks. S. Sukhbaatar, A. Szlam, J. Weston, R. Fergus. NIPS, 2015. The position of reading head:

Keras has example: https://github.com/fchollet/keras/blob/master/examples/ba bi_memnn.py Visual Question Answering

source: http://visualqa.org/ Visual Question Answering

Query DNN/RNN answer

Reading Head Controller

CNN A vector for each region Visual Question Answering

• Huijuan Xu, Kate Saenko. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering. arXiv Pre-Print, 2015 Visual Question Answering

• Huijuan Xu, Kate Saenko. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering. arXiv Pre-Print, 2015 External Memory v2

Input DNN/RNN output

Reading Head Writing Head Controller Controller

Writing Head Reading Head …… ……

Machine’s Memory Neural Neural Turing Machine y1

h0 f

0 푖 푖 x1 r0 푟 = ෍ 훼ො0푚0 Retrieval process 1 2 3 4 훼ො0 훼ො0 훼ො0 훼ො0

1 2 3 4 푚0 푚0 푚0 푚0 Neural Turing Machine y1

h0 f

0 푖 푖 x1 r0 푟 = ෍ 훼ො0푚0 푘1 푒1 푎1

1 2 3 4 1 2 3 4 훼ො0 훼ො0 훼ො0 훼ො0 훼ො1 훼ො1 훼ො1 훼ො1 softmax 1 2 3 4 푚0 푚0 푚0 푚0 1 2 3 4 훼1 훼1 훼1 훼1 푖 푖 1 훼1 = 푐표푠 푚0, 푘 Neural Turing Machine

푖 푖 푖 1 푖 푖 1 푚1 = 푚0 −훼ො1 푒 ⨀ 푚0 +훼ො1 푎

(element-wise)

푘1 푒1 푎1

1 2 3 4 1 2 3 4 0 ~ 1 훼ො0 훼ො0 훼ො0 훼ො0 훼ො1 훼ො1 훼ො1 훼ො1

1 2 3 4 1 2 3 4 푚0 푚0 푚0 푚0 푚1 푚1 푚1 푚1 Neural Turing Machine y1 y2

h0 f h1 f

x1 r0 x2 r1

1 2 3 4 1 2 3 4 1 2 3 4 훼ො0 훼ො0 훼ො0 훼ො0 훼ො1 훼ො1 훼ො1 훼ො1 훼ො2 훼ො2 훼ො2 훼ො2

1 2 3 4 1 2 3 4 1 2 3 4 푚0 푚0 푚0 푚0 푚1 푚1 푚1 푚1 푚2 푚2 푚2 푚2 Neural Turing Machine for LM

Wei-Jen Ko, Bo-Hsiang Tseng, Hung-yi Lee, “ based Language Modeling with Controllable External Memory”, ICASSP, 2017 Armand Joulin, Tomas Mikolov, Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets, arXiv Pre-Print, 2015 Stack RNN yt

Push Pop Nothing

f

+ + Information

…… to store

… … Push, Pop, Nothing … 0.7 0.2 0.1 X0.7 X0.2 X0.1 stack

xt Concluding Remarks

• Convolutional Neural Network (Review) • Spatial Transformer • Highway Network & Grid LSTM • Pointer Network • External Memory