Online slides: http://bit.ly/nnstruct Password: bayes Neural networks. from: LSTM to: Neural Computer Daniil Polykovskiy CMC MSU 16 December 2016 Goal

Narrow the gap between 1) Neural networks 2) Neural networks Algorithms

✓ Cat-Dog discrimination ✘

✘ Sequence sorting ✓

3 Algorithms side: extract features

This is a dog

Pointy ears

Lop-eared

This is a cat

4

Sorting: easy case

6 Sorting: hard case

7 Appetizer

140 LSTM 120 NTM with LSTM Controller NTM with Feedforward Controller 100 80 60 40 20 cost per sequence (bits) 0 0 200 400 600 800 1000 sequence number (thousands) Outline

1.Attention 2.Neural turing machines 3.Neural stacks, lists, etc. 4.Differentiable Neural Computer Attention Salient regions Salient regions Salient regions

Attention: idea

• Auxiliary network generates distribution over places for attention (regions for images, words for texts)

• Return expectation over attended places.

0.033 0.9 0.033 0.034 The red boat sank. What color was the boat? r = e 0.033 + e 0.9+e 0.033 + e 0.034 The · red · boat · sank ·

15 Attention: idea

• Auxiliary network generates distribution over places for attention (regions for images, words for texts)

• Return expectation over attended places.

16 Attention: idea

• Auxiliary network generates distribution over places for attention (regions for images, words for texts)

• Return expectation over attended places.

17 Attention: image captioning

14x14 Feature Map A bird flying LSTM over a body of water 1. Input 2. Convolutional 3. RNN with attention 4. Word by Image Feature Extraction over the image word generation

18 Neural Turing Machines Turing machines

20 Turing machines

Turing complete system: Can simulate any single- taped .

Are RNNs Turing complete? YES

Is it possible to train TM with ? NO

21 Turing machines

Turing complete system: Can simulate any single- taped Turing machine.

Are RNNs Turing complete? YES

Is it possible to train TM with backpropagation? NO

22 Turing machines

Turing complete system: Can simulate any single- taped Turing machine.

Are RNNs Turing complete? YES

Is it possible to train TM with backpropagation? NO

23 Turing machines

Turing complete system: Can simulate any single- taped Turing machine.

Are RNNs Turing complete? YES

Is it possible to train TM with backpropagation? NO

24 Turing machines

Turing complete system: Can simulate any single- taped Turing machine.

Are RNNs Turing complete? YES

Is it possible to train TM with backpropagation? NO

25 Neural Turing machines

26 Neural Turing machines

27 Memory model

M[1] M[2] • Memory is a matrix M[…] M[…] M[…] • In read-only scenario its M[…] elements are trained M[…] M[…] • Memory can learn a M[…] knowledge useful for M[…] M[…] processing M[…] M[N]

28 Neural Turing machines

29 Reading from ROM

M[1] 0 • Generate weights for M[2] 0.30 each element and M[…] 0 temperature M[…] 0 M[…] 0 M[…] 0 • Normalize distribution with M[…] 0 M[…] 0 SoftMax M[…] 0 M[15] 0.5 • Return a weighted sum M[…] 0 over memory elements M[…] 0 M[N] 0.2

30 Focusing by content

M[1] • Weights are generated M[2] with M[…] for some similarity M[…] measure M[…] M[…] M[…] • M[…] M[…] M[…] • This recalls associative M[…] arrays M[…] M[N]

31 Bad version Focusing by index

1 M[1] 2 M[2] … M[…] … M[…] • Set first number in array to … M[…] the index … M[…] … M[…] … M[…] • Use … M[…] … M[…] … M[…] … M[…] N M[N]

32 Focusing by index

M[1] M[2] • Use previous M[…] M[…] • Emit another distribution M[…] M[…] M[…] • * M[…] M[…] • will shift M[…] attention right (like i++) M[…] M[…] M[N]

33 Indexing: combining

34 Read scenarios

• —> content retrieval

• —> iteration

35 Read scenarios

• —> content retrieval

• —> iteration

• Combination: iteration content Context: The red boat sank. Question: What color was the boat?

36 Neural Turing machines

37 Writing

M[1] M[2] M[…] M[…] Erase M[…] Add M[…] M[…] • There may be multiple read/ M[…] write heads M[…] M[…] • In this case erasure must be M[…] performed over all heads first, M[…] followed by addition M[N]

38 Neural Turing machines

39 Controller Recurrent (LSTM) Feed-forward

• Can analyze read/write Controller: CPU operations LSTM cell: Registers • Can mimic RNN with a lot of layers (but network will Memory bank: RAM be very deep)

Note: train model can operate with arbitrary memory size

40 Experiments on NTM Copy task LSTM

NTM

42 Copy task

43 Copy task

10 LSTM NTM with LSTM Controller 8 NTM with Feedforward Controller

6

4

2 cost per sequence (bits) 0 0 200 400 600 800 1000 sequence number (thousands) 44 Repeat copy

45 Associative recall

46 Sorting

Hypothesised Locations Write Weightings Read Weightings Locatiobn Time Time Time

47 Sorting

140 LSTM 120 NTM with LSTM Controller NTM with Feedforward Controller 100 80 60 40 20 cost per sequence (bits) 0 0 200 400 600 800 1000 sequence number (thousands) Structured memory NTM

Zhang et.al, Structured Memory for Neural Turing Machines Structured memory NTM

Zhang et.al, Structured Memory for Neural Turing Machines Structured memory NTM

Zhang et.al, Structured Memory for Neural Turing Machines Structured memory NTM

Zhang et.al, Structured Memory for Neural Turing Machines Neural Data Structures Turing vs Data structures

• Turing machines are hard to program

• Neural Turing machine does not allow dynamic memory growth

• Data structures are often useful: stacks, lists, queues, etc.

54 Stack t = 1 u1 = 0 d1 = 0.8 t = 2 u2 = 0.1 d2 = 0.5 t = 3 u3 = 0.9 d3 = 0.9

row 3 v3 0.9 v removed row 2 v v 2 2 0.5 2 0 from stack row 1 v1 0.8 v1 0.7 v1 0.3 stack grows upwards

r1 = 0.8 ∙ v1 r2 = 0.5 ∙ v2 + 0.5 ∙ v1 r3 = 0.9 ∙ v3 + 0 ∙ v2 + 0.1 ∙ v1 Values Strengths

55 Push

t = 3 u = 0.9 d = 0.9 t = 3 u = 0.9 d = 0.9 t = 1 u1 =t 0= 1 d 1 u=1 0.8= 0 d1 = 0.8t = 2 u2 =tt =0.1= 2 1 u du2 21= == 0.1 0.50 d d1 2= = 0.8 0.5 3tt == 32 u u3 2= 3= 0.9 0.1 d d3 2= = 0.9 0.5 3 3

row 3 row 3 row 3 v3 v3 0.9 0.9 v3 0.9 push v removedv removed v removed row 2 row 2 row 2 v v 0.5 v vv 0 2 2 v 2 2 2 0.5 2 22 0 0.5 from stackfrom 2stack 0 from stack s = 0.5 row 1 v v v row 1 v1 10.8 row0.8 1 v1 v11 0.7 0.70.8 v1 v11 0.3 0.30.7 v1 0.3 stack grows upwards stack grows upwards stack grows upwards r = 0.8 ∙ v r = 0.5 ∙ v + 0.5 ∙ v r = 0.9 ∙ v + 0 ∙ v + 0.1 ∙ v r1 = 0.8 ∙ v11 1 r2 = 0.5 ∙ v22 + 0.5r1 = ∙0.8 v21 ∙ v1 r31 = 0.9 ∙ v3 3 +r 02 =∙ v0.52 3+ ∙0.1 v2 +∙ 2v0.5 1 ∙ v1 1 r3 = 0.9 ∙ v3 + 0 ∙ v2 + 0.1 ∙ v1

• Push v with strength s: put a vector on top of the stack, assign value s to it.

56 Pop tt == 11 uu == 00 dd == 0.80.8 tt == 22 uu == 0.10.1 dd == 0.50.5 tt == 33 uu == 0.90.9 dd == 0.90.9t = 3 u = 0.9 d = 0.9 t = 3 u = 0.9 d = 0.9 t = 1 u1 = 0 d111 = 0.8 11 tt == 2 1 u u2 1= = t0.1 0= 1 d d 221 2u= =1 0.8= 0.5 0 d221 = 0.8tt == 32 u u3 2= =t t 0.9 =0.1= 2 1 d u du332 21== == 0.90.1 0.50 d d1 323= = 0.8 0.5 3tt == 32 u u3 2= 3= 0.9 0.1 d d3 2= = 0.9 0.5 3 3

row 3 rowrow 33 row 3 row 3 row 3 v3 vv330.9 0.90.9 v3 v3 0.9 0.9 v3 0.9 pop v removedvv removedremoved v removedv removed v removed row 2 rowrow 22 row 2 row 2 v vv 0.5 row 2 vv vvv 0 0.5 2 v v22v 0 2 2 v 2 2 22 0.50.5 22 222 0.5 00 from 2stackfromfrom22 stackstack0 0.5 from stackfrom 2stack 0 from stack s = 1.0 row 1 v row 1 v v v v v rowrow 11 1 vv11 row0.8 1 0.80.8 v11 vv11 10.70.8 row0.70.70.8 1 v11 vvv11110.30.7 0.30.70.30.8 v1 v11 0.3 0.30.7 v1 0.3 stack grows upwards stack grows upwards stack grows upwards stack grows upwards stack grows upwards stack grows upwards r = 0.8 ∙ v r = 0.5 ∙ v + 0.5r =∙ v0.8 ∙ v r = 0.9 ∙ v + 0r ∙= v 0.5+ 0.1 ∙ v ∙+ v 0.5 ∙ v r = 0.9 ∙ v + 0 ∙ v + 0.1 ∙ v 1 rr11 ==1 0.80.8 ∙∙ vv11 2 r1 = rr0.822 2== 0.5∙0.5 v11 ∙∙ vv221 ++ 0.50.51 ∙∙ vv311 r2 = 0.5rr33 == ∙ 0.9v0.922 + ∙∙20.5r vv1 33= ++ ∙0.8 0v02 1 ∙∙ ∙vv1 v22 ++1 0.10.1r31 ∙=∙ v v0.911 ∙ v3 3 +r 02 =∙ v0.52 3+ ∙0.1 v2 +∙ 2v0.5 1 ∙ v1 1 r3 = 0.9 ∙ v3 + 0 ∙ v2 + 0.1 ∙ v1

• Pop with strength s: subtract total of s from top vectors preserving non-negative presence.

57 t = 1 u1 = 0 d1 = 0.8 t = 2 u2 = 0.1 d2 = 0.5 t = 3 u3 = 0.9 d3 = 0.9

row 3 v3 0.9 v removed row 2 v v 2 Head 2 0.5 2 0 from stack

t = 1 u1 = 0 d1 = 0.8 t = 2 u2 = 0.1 d2 = 0.5 t = 3 u3 = 0.9 d3 = 0.9 row 1 v1 0.8 v1 0.7 v1 0.3 stack grows upwards row 3 v3 0.9 head v removed row 2 v v 2 r1 = 0.8 ∙2 v1 0.5 r2 = 0.5 ∙0 v2 + 0.5 ∙ fromv1 stack r3 = 0.9 ∙ v3 + 0 ∙ v2 + 0.1 ∙ v1 row 1 v1 0.8 v1 0.7 v1 0.3 stack grows upwards

r1 = 0.8 ∙ v1 r2 = 0.5 ∙ v2 + 0.5 ∙ v1 r3 = 0.9 ∙ v3 + 0 ∙ v2 + 0.1 ∙ v1

• Head: compute convex sum of top vectors until sum of 1 is reached.

58 Stack

prev. values (V ) t-1 next values (Vt) previous state next state

prev. strengths (s ) next strengths (s ) t-1 Neural t push (d ) t Stack output (r ) input pop (ut) t

value (v ) t Split

Join

59 Stack

prev. values (V ) t-1 next values (Vt) previous state next state

prev. strengths (s ) next strengths (s ) t-1 Neural t push (d ) t Stack output (r ) input pop (ut) t

value (v ) t Split

Join

60 Neural data structures

• Queues and DeQues are constructed similarly

• During training network tends to ignore stack

• This is fixed with bias towards pushing Double-linked lists

HEAD Operations: • Move HEAD left: • Move HEAD right: • Insert new element at position HEAD:

62 Double-linked lists

HEAD Operations: • Move HEAD left: • Move HEAD right: • Insert new element at position HEAD:

63 Double-linked lists

HEAD Operations: • Move HEAD left: • Move HEAD right: • Insert new element at position HEAD:

64 Double-linked lists

HEAD Operations: • Move HEAD left: • Move HEAD right: • Insert new element at position HEAD:

65 Double-linked lists

HEAD Differentiable update:

66 Differentiable Neural Computer Differentiable Neural Computer

• John Locke: memories are connected if they were formed nearby in time and space.

68

Differentiable Neural Computer

• John Locke: memories are connected if they were formed nearby in time and space.

• So memory can naturally be organized as a graph

Gives et.al, Hybrid computing using a neural70 network with dynamic external memory DNC: Memory graph

• Add an adjacency matrix

• is close to 1 if location i was written right after location j

• Move focus to the next memory moment:

• Move focus to the previous memory moment:

71 DNC: Allocator

• Memory usage vector u: values [0, 1] indicating presence of data

• Increases after writes

• Decreases after reads

• Writing to used positions is not allowed

72 DNC Experiments: Underground

73 DNC Experiments: Underground DNC Experiments

Family tree inference task VIDEO

75

Structures summary

Inferring Algorithmic Learning to Transduce with Neural Turing Machines Patterns with Stack- Unbounded Memory Graves, et.al Augmented Recurrent Nets Grefenstette, et.al 2014 Mikolov, et.al 2015 2015

Arrays Stack Stack

Associative arrays Queue Double-linked list

DeQueue Is it possible to have Universal Neural Turing Machine?

78 Thank you for attention