Neural Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve) Introduction • Neural Turning Machine: Couple a Neural Network with external memory resources • The combined system is analogous to TM, but differentiable from end to end • NTM Can infer simple !

4/6/2015 Tinghui Wang, EECS, WSU 2 In Computation Theory… Finite State Machine 횺, 푺, 풔ퟎ, 휹, 푭

• 횺: Alphabet • 푺: Finite set of states

• 풔ퟎ: Initial state • 휹: state transition function: 푺 × 횺 → 푷 푺 • 푭: set of Final States

4/6/2015 Tinghui Wang, EECS, WSU 3 In Computation Theory… Push Down Automata 푸, 횺, 횪, 휹, 풒ퟎ, 풁, 푭

• 횺: Input Alphabet • 횪: Stack Alphabet • 푸: Finite set of states

• 풒ퟎ: Initial state • 휹: state transition function: 푸 × 횺 × 횪 → 푷 푸 × 횪 • 푭: set of Final States

4/6/2015 Tinghui Wang, EECS, WSU 4 Turing Machine

푸, 횪, 풃, 횺, 휹, 풒ퟎ, 푭

• 푸: Finite set of States

• 풒ퟎ: Initial State • 횪: Tape alphabet (Finite) • 풃: Blank Symbol (occurs infinitely on the tape) • 횺: Input Alphabet (Σ = 횪\ 푏 ) • 휹: Transition Function 푄\F × Γ → 푄 × Γ × 퐿, 푅 • 푭: Final State (Accepted)

4/6/2015 Tinghui Wang, EECS, WSU 5 Neural Turing Machine - Add Learning to TM!!

Finite State Machine (Program) Neural Network (I Can Learn!!)

4/6/2015 Tinghui Wang, EECS, WSU 6 A Little History on Neural Network • 1950s: Frank Rosenblatt, – classification based on linear predictor • 1969: Minsky – Proof that Perceptron Sucks! – Not able to learn XOR • 1980s: Back propagation • 1990s: , Long Short-Term Memory • Late 2000s – now: , Fast Computer

4/6/2015 Tinghui Wang, EECS, WSU 7 Feed-Forward Neural Net and Back Propagation

• Total Input of unit 푗:

푥푗 = 푦푖푤푗푖 푖 • Output of unit 푗 (logistic function) 1 푦 = 푗 1 + 푒−푥푗 푤푗푖 푖 푗 • Total Error (mean square) 1 2 퐸 = 푑 − 푦 푡표푡푎푙 2 푐,푗 푐,푗 푐 푗 • 휕퐸 휕퐸 휕푦푗 휕푥푗 = 휕푤푗푖 휕푦푗 휕푥푗 휕푤푗푖 Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October 1986). "Learning representations by back-propagating errors". Nature 323 (6088): 533–536.

4/6/2015 Tinghui Wang, EECS, WSU 8 Recurrent Neural Network

• 푦 푡 : n-tuple of outputs at time t • 푥푛푒푡 푡 : m-tuple of external inputs at time t

• 푈: set of indices 푘 that 푥푘 is output of unit in network

• 퐼: set of indices 푘 that 푥푘 is external input

푛푒푡 푥푘 푡 , 푖푓 푘 ∈ 퐼 • 푥푘 = 푦푘 푡 , 푖푓 푘 ∈ 푅

• 푦푘 푡 + 1 = 푓푘 푠푘 푡 + 1 where 푓푘 is a differential squashing function R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational • 푠푘 푡 + 1 = 푖∈퐼∪푈 푤푘푖푥푖 푡 complexity. In Back-propagation: Theory, Architectures and Applications. Hillsdale, NJ: Erlbaum, 1994.

4/6/2015 Tinghui Wang, EECS, WSU 9 Recurrent Neural Network – Time Unrolling

• Error Function: 1 1 퐽 푡 = − 푒 푡 2 = − 푑 푡 − 푦 푡 2 2 푘 2 푘 푘 푘∈푈 푘∈푈 푡 퐽푇표푡푎푙 푡′, 푡 = 퐽 휏 휏=푡′+1 • Back Propagation through Time: 휕퐽 푡 휀푘 푡 = = 푒푘 푡 = 푑푘 푡 − 푦푘 푡 휕푦푘 푡

휕퐽 푡 휕퐽 푡 휕푦푘 휏 ′ 훿푘 휏 = = = 푓푘 푠푘 휏 휀푘 휏 휕푠푘 휏 휕푦푘 휏 휕푠푘 휏

휕퐽 푡 휕퐽 푡 휕푠푙 휏 휀푘 휏 − 1 = = = 푤푙푘훿푙 휏 휕푦푘 휏 − 1 휕푠푙 휏 휕푦푘 휏 − 1 푙∈푈 푙∈푈 푇표푡푎푙 푡 휕퐽 휕퐽 푡 휕푠푗 휏 = 휕푤푗푖 휏=푡′+1 휕푠푗 휏 휕푤푗푖 4/6/2015 Tinghui Wang, EECS, WSU 10 RNN & Long Short-Term Memory

• RNN is Turing Complete . With Proper Configuration, it can simulate any sequence generated by a Turing Machine • Error Vanishing and Exploding . Consider single unit self loop:

휕퐽 푡 휕퐽 푡 휕푦푘 휏 ′ ′ 훿푘 휏 = = = 푓푘 푠푘 휏 휀푘 휏 = 푓푘 푠푘 휏 푤푘푘훿푘 휏 + 1 휕푠푘 휏 휕푦푘 휏 휕푠푘 휏 • Long Short-term Memory

Hochreiter, S. and Schmidhuber, J. (1997). Long short- term memory. Neural computation, 9(8):1735–1780.

4/6/2015 Tinghui Wang, EECS, WSU 11 Purpose of Neural Turing Machine • Enrich the capabilities of standard RNN to simplify the solution of algorithmic tasks

4/6/2015 Tinghui Wang, EECS, WSU 12 NTM Read/Write Operation

1 2 3 ... N

Erase: 푀푡 푖 = 푀푡−1 푖 1 − 푤푡 푖 푒푡

푟푡 = 푤푡 푖 푀푡 푖 푖 Add:

SizeVector 푀 푖 += 푤 푖 푎

- 푡 푡 푡 M

w1 w2 w3 ... wN 푒 : Erase Vector, emitted by the header 푤 푖 : vector of weights over the N locations 푡 푡 푎 : Add Vector, emitted by the header emitted by a read head at time t 푡 푤푡: weights vector

4/6/2015 Tinghui Wang, EECS, WSU 13 Attention & Focus • Attention and Focus is adjusted by weights vector across whole memory bank!

4/6/2015 Tinghui Wang, EECS, WSU 14 Content Based Addressing • Find Similar Data in memory

풌푡: Key Vector 훽푡: Key Strength

exp 훽푡퐾 풌푡, 푴푡 푖 푐 푤푡 푖 = 푗 exp 훽푡퐾 풌푡, 푴푡 푗 풖 ∙ 풗 퐾 풖, 풗 = 풖 ∙ 풗

4/6/2015 Tinghui Wang, EECS, WSU 15 May not depend on content only… • Gating Content Addressing

푔푡: Interpolation gate in the range of (0,1)

푔 푐 풘푡 = 푔푡풘푡 + 1 − 푔푡 풘푡−1

4/6/2015 Tinghui Wang, EECS, WSU 16 Convolutional Shift – Location Addressing

풔푡: Shift weighting – Convolutional Vector (size N) 푁−1 푔 풘 푡 푖 = 풘푡 푗 풔푡 푖 − 푗 푗=0

4/6/2015 Tinghui Wang, EECS, WSU 17 Convolutional Shift – Location Addressing

훾푡: sharpening factor 훾 풘 푡 푖 푡 풘푡 푖 = 훾푡 푗 풘 푡 푗

4/6/2015 Tinghui Wang, EECS, WSU 18 Neural Turing Machine - Experiments

• Goal: to demonstrate NTM is . Able to solve the problems . By learning compact internal programs • Three architectures: . NTM with a feedforward controller . NTM with an LSTM controller . Standard LSTM controller • Applications . Copy . Repeat Copy . Associative Recall . Dynamic N-Grams . Priority Sort

4/6/2015 Tinghui Wang, EECS, WSU 19 NTM Experiments: Copy NTM • Task: Store and recall a long sequence of arbitrary information • Training: 8-bit random vectors with length 1-20 • No inputs while it generates the targets

LSTM

4/6/2015 Tinghui Wang, EECS, WSU 20 NTM Experiments: Copy – Learning Curve

4/6/2015 Tinghui Wang, EECS, WSU 21 NTM Experiments: Repeated Copy

• Task: Repeat a sequence a specified number of times • Training: random length sequences of random binary vectors followed by a scalar value indicating the desired number of copies

4/6/2015 Tinghui Wang, EECS, WSU 22 NTM Experiments: Repeated Copy – Learning Curve

4/6/2015 Tinghui Wang, EECS, WSU 23 NTM Experiments: Associative Recall

• Task: ask the network to produce the next item, given current item after propagating a sequence to network • Training: each item is composed of three six-bit binary vectors, 2-8 items every episode

4/6/2015 Tinghui Wang, EECS, WSU 24 NTM Experiments: Dynamic N-Grams • Task: Learn N-Gram Model – rapidly adapt to new predictive distribution • Training: 6-Gram distributions over binary sequences. 200 successive bits using look-up table by drawing 32 probabilities from Beta(.5,.5) distribution • Compare to Optimal Estimator: 푁1 + 0.5 푃 퐵 = 1 푁1, 푁2, 풄 = 푁1 + 푁0 + 1 C=01111 C=00010

4/6/2015 Tinghui Wang, EECS, WSU 25 NTM Experiments: Priority Sorting

• Task: Sort Binary Vector based on priority • Training: 20 binary vectors with corresponding priorities, output 16 highest-priority vectors

4/6/2015 Tinghui Wang, EECS, WSU 26 NTM Experiments: Learning Curve

4/6/2015 Tinghui Wang, EECS, WSU 27 Some Detailed Parameters

NTM with Feed Forward Neural Network

NTM with LSTM

LSTM

4/6/2015 Tinghui Wang, EECS, WSU 28 Additional Information

• Literature Reference: . Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October 1986). "Learning representations by back-propagating errors". Nature 323 (6088): 533–536 . R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Back-propagation: Theory, Architectures and Applications. Hillsdale, NJ: Erlbaum, 1994. . Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735–1780 . Y. Bengio and Y. LeCun, Scaling Learning Algorithms towards AI, Large-Scale Kernel Machines, Bottou, ., Chapelle, O., DeCoste, D., and Weston, J., Eds., MIT Press, 2007. • NTM Implementation . https://github.com/shawntan/neural-turing-machines • NTM Reddit Discussion . https://www.reddit.com/r/MachineLearning/comments/2m9mga/implementation_of_neural_turing_ma chines/

4/6/2015 Tinghui Wang, EECS, WSU 29 4/6/2015 Tinghui Wang, EECS, WSU 30