Neural Networks with Controlled Memory

Online slides: http://bit.ly/nnstruct Password: bayes Neural networks. from: LSTM to: Neural Computer Daniil Polykovskiy CMC MSU 16 December 2016 Goal Narrow the gap between 1) Neural networks 2) Algorithms Neural networks Algorithms ✓ Cat-Dog discrimination ✘ ✘ Sequence sorting ✓ 3 Algorithms side: extract features This is a dog Pointy ears Lop-eared This is a cat 4 Sorting: easy case 6 Sorting: hard case 7 Appetizer 140 LSTM 120 NTM with LSTM Controller NTM with Feedforward Controller 100 80 60 40 20 cost per sequence (bits) 0 0 200 400 600 800 1000 sequence number (thousands) Outline 1.Attention 2.Neural turing machines 3.Neural stacks, lists, etc. 4.Differentiable Neural Computer Attention Salient regions Salient regions Salient regions Attention: idea • Auxiliary network generates distribution over places for attention (regions for images, words for texts) • Return expectation over attended places. 0.033 0.9 0.033 0.034 The red boat sank. What color was the boat? r = e 0.033 + e 0.9+e 0.033 + e 0.034 The · red · boat · sank · 15 Attention: idea • Auxiliary network generates distribution over places for attention (regions for images, words for texts) • Return expectation over attended places. 16 Attention: idea • Auxiliary network generates distribution over places for attention (regions for images, words for texts) • Return expectation over attended places. 17 Attention: image captioning 14x14 Feature Map A bird flying LSTM over a body of water 1. Input 2. Convolutional 3. RNN with attention 4. Word by Image Feature Extraction over the image word generation 18 Neural Turing Machines Turing machines 20 Turing machines Turing complete system: Can simulate any single- taped Turing machine. Are RNNs Turing complete? YES Is it possible to train TM with backpropagation? NO 21 Turing machines Turing complete system: Can simulate any single- taped Turing machine. Are RNNs Turing complete? YES Is it possible to train TM with backpropagation? NO 22 Turing machines Turing complete system: Can simulate any single- taped Turing machine. Are RNNs Turing complete? YES Is it possible to train TM with backpropagation? NO 23 Turing machines Turing complete system: Can simulate any single- taped Turing machine. Are RNNs Turing complete? YES Is it possible to train TM with backpropagation? NO 24 Turing machines Turing complete system: Can simulate any single- taped Turing machine. Are RNNs Turing complete? YES Is it possible to train TM with backpropagation? NO 25 Neural Turing machines 26 Neural Turing machines 27 Memory model M[1] M[2] • Memory is a matrix M[…] M[…] M[…] • In read-only scenario its M[…] elements are trained M[…] M[…] • Memory can learn a M[…] knowledge useful for M[…] M[…] processing M[…] M[N] 28 Neural Turing machines 29 Reading from ROM M[1] 0 • Generate weights for M[2] 0.30 each element and M[…] 0 temperature M[…] 0 M[…] 0 M[…] 0 • Normalize distribution with M[…] 0 M[…] 0 SoftMax M[…] 0 M[15] 0.5 • Return a weighted sum M[…] 0 over memory elements M[…] 0 M[N] 0.2 30 Focusing by content M[1] • Weights are generated M[2] with M[…] for some similarity M[…] measure M[…] M[…] M[…] • M[…] M[…] M[…] • This recalls associative M[…] arrays M[…] M[N] 31 Bad version Focusing by index 1 M[1] 2 M[2] … M[…] … M[…] • Set first number in array to … M[…] the index … M[…] … M[…] … M[…] • Use … M[…] … M[…] … M[…] … M[…] N M[N] 32 Focusing by index M[1] M[2] • Use previous M[…] M[…] • Emit another distribution M[…] M[…] M[…] • * M[…] M[…] • will shift M[…] attention right (like i++) M[…] M[…] M[N] 33 Indexing: combining 34 Read scenarios • —> content retrieval • —> iteration 35 Read scenarios • —> content retrieval • —> iteration • Combination: iteration content Context: The red boat sank. Question: What color was the boat? 36 Neural Turing machines 37 Writing M[1] M[2] M[…] M[…] Erase M[…] Add M[…] M[…] • There may be multiple read/ M[…] write heads M[…] M[…] • In this case erasure must be M[…] performed over all heads first, M[…] followed by addition M[N] 38 Neural Turing machines 39 Controller Recurrent (LSTM) Feed-forward • Can analyze read/write Controller: CPU operations LSTM cell: Registers • Can mimic RNN with a lot of layers (but network will Memory bank: RAM be very deep) Note: train model can operate with arbitrary memory size 40 Experiments on NTM Copy task LSTM NTM 42 Copy task 43 Copy task 10 LSTM NTM with LSTM Controller 8 NTM with Feedforward Controller 6 4 2 cost per sequence (bits) 0 0 200 400 600 800 1000 sequence number (thousands) 44 Repeat copy 45 Associative recall 46 Sorting Hypothesised Locations Write Weightings Read Weightings Locatiobn Time Time Time 47 Sorting 140 LSTM 120 NTM with LSTM Controller NTM with Feedforward Controller 100 80 60 40 20 cost per sequence (bits) 0 0 200 400 600 800 1000 sequence number (thousands) Structured memory NTM Zhang et.al, Structured Memory for Neural Turing Machines Structured memory NTM Zhang et.al, Structured Memory for Neural Turing Machines Structured memory NTM Zhang et.al, Structured Memory for Neural Turing Machines Structured memory NTM Zhang et.al, Structured Memory for Neural Turing Machines Neural Data Structures Turing vs Data structures • Turing machines are hard to program • Neural Turing machine does not allow dynamic memory growth • Data structures are often useful: stacks, lists, queues, etc. 54 Stack t = 1 u1 = 0 d1 = 0.8 t = 2 u2 = 0.1 d2 = 0.5 t = 3 u3 = 0.9 d3 = 0.9 row 3 v3 0.9 v removed row 2 v v 2 2 0.5 2 0 from stack row 1 v1 0.8 v1 0.7 v1 0.3 stack grows upwards stack grows r1 = 0.8 ∙ v1 r2 = 0.5 ∙ v2 + 0.5 ∙ v1 r3 = 0.9 ∙ v3 + 0 ∙ v2 + 0.1 ∙ v1 Values Strengths 55 Push t = 3 u = 0.9 d = 0.9 t = 3 u = 0.9 d = 0.9 t = 1 u1 =t 0= 1 d 1 u=1 0.8= 0 d1 = 0.8t = 2 u2 =tt =0.1= 2 1 u du2 21= == 0.1 0.50 d d1 2= = 0.8 0.5 3tt == 32 u u3 2= 3= 0.9 0.1 d d3 2= = 0.9 0.5 3 3 row 3 row 3 row 3 v3 v3 0.9 0.9 v3 0.9 push v removedv removed v removed row 2 row 2 row 2 v v 0.5 v vv 0 2 2 v 2 2 2 0.5 2 22 0 0.5 from stackfrom 2stack 0 from stack s = 0.5 row 1 v v v row 1 v1 10.8 row0.8 1 v1 v11 0.7 0.70.8 v1 v11 0.3 0.30.7 v1 0.3 stack grows upwards stack grows stack grows upwards stack grows upwards stack grows r = 0.8 ∙ v r = 0.5 ∙ v + 0.5 ∙ v r = 0.9 ∙ v + 0 ∙ v + 0.1 ∙ v r1 = 0.8 ∙ v11 1 r2 = 0.5 ∙ v22 + 0.5r1 = ∙0.8 v21 ∙ v1 r31 = 0.9 ∙ v3 3 +r 02 =∙ v0.52 3+ ∙0.1 v2 +∙ 2v0.5 1 ∙ v1 1 r3 = 0.9 ∙ v3 + 0 ∙ v2 + 0.1 ∙ v1 • Push v with strength s: put a vector on top of the stack, assign value s to it. 56 Pop tt == 11 uu == 00 dd == 0.80.8 tt == 22 uu == 0.10.1 dd == 0.50.5 tt == 33 uu == 0.90.9 dd == 0.90.9t = 3 u = 0.9 d = 0.9 t = 3 u = 0.9 d = 0.9 t = 1 u1 = 0 d111 = 0.8 11 tt == 2 1 u u2 1= = t0.1 0= 1 d d 221 2u= =1 0.8= 0.5 0 d221 = 0.8tt == 32 u u3 2= =t t 0.9 =0.1= 2 1 d u du332 21== == 0.90.1 0.50 d d1 323= = 0.8 0.5 3tt == 32 u u3 2= 3= 0.9 0.1 d d3 2= = 0.9 0.5 3 3 row 3 rowrow 33 row 3 row 3 row 3 v3 vv330.9 0.90.9 v3 v3 0.9 0.9 v3 0.9 pop v removedvv removedremoved v removedv removed v removed row 2 rowrow 22 row 2 row 2 v vv 0.5 row 2 vv vvv 0 0.5 2 v v22v 0 2 2 v 2 2 22 0.50.5 22 222 0.5 00 from 2stackfromfrom22 stackstack0 0.5 from stackfrom 2stack 0 from stack s = 1.0 row 1 v row 1 v v v v v rowrow 11 1 vv11 row0.8 1 0.80.8 v11 vv11 10.70.8 row0.70.70.8 1 v11 vvv11110.30.7 0.30.70.30.8 v1 v11 0.3 0.30.7 v1 0.3 stack grows upwards stack grows upwards stack grows stack grows upwards stack grows upwards stack grows stack grows upwards stack grows upwards stack grows r = 0.8 ∙ v r = 0.5 ∙ v + 0.5r =∙ v0.8 ∙ v r = 0.9 ∙ v + 0r ∙= v 0.5+ 0.1 ∙ v ∙+ v 0.5 ∙ v r = 0.9 ∙ v + 0 ∙ v + 0.1 ∙ v 1 rr11 ==1 0.80.8 ∙∙ vv11 2 r1 = rr0.822 2== 0.5∙0.5 v11 ∙∙ vv221 ++ 0.50.51 ∙∙ vv311 r2 = 0.5rr33 == ∙ 0.9v0.922 + ∙∙20.5r vv1 33= ++ ∙0.8 0v02 1 ∙∙ ∙vv1 v22 ++1 0.10.1r31 ∙=∙ v v0.911 ∙ v3 3 +r 02 =∙ v0.52 3+ ∙0.1 v2 +∙ 2v0.5 1 ∙ v1 1 r3 = 0.9 ∙ v3 + 0 ∙ v2 + 0.1 ∙ v1 • Pop with strength s: subtract total of s from top vectors preserving non-negative presence.

Load more