<<

Modular multask reinforcement learning with policy sketches

Jacob Andreas, Sergey Levine and Dan Klein The learning problem make planks

2 The learning problem make planks make sticks

3 Learning from sketches get get wood use use

4 The opons framework

5 The opons framework

+1

6 The opons framework

+1

7 The opons framework

[Suon et al. 99, Bacon & Precup 16] 8 Learning from intermediate rewards

r r

[Kearns & Singh 02, Kulkarni et al. 16] 9 Learning from demonstraons

Ï

[Stolle & Precup 02, Fox & Krishnan et al. 16] 10 Learning from policy sketches

get wood use saw

Ï

11 Under review as a conference paper at ICLR 2017

ATASKS AND SKETCHES

The complete list of tasks, sketches, and symbols is given below. Tasks marked with an asterisk⇤ are held out for the generalization experiments described in Section 4.4, but included in the multitask training experiments in Sections 4.2 and 4.3.

Goal Sketch Maze environment goal1 left left goal2 left down goal3 right downWhy sketches? goal4 up left goal5 up right goal6 up right up goal7 down right up goal8 left left down goal9 right down down goal10Easy to collect left up right Portable Crafting environment make plank get wood use toolshed make stick get wood use make cloth get grass use factory make rope get grass use toolshed make bridge get iron get wood use factory make bed⇤ get wood use toolshed get grass use workbench make axe⇤ get wood use workbench get iron use toolshed make shears get wood use workbench get iron use workbench get gold get iron get wood use factory use bridge get gem get wood use workbench get iron use toolshed use axe

12

12 Learning from policy sketches Learning from policy sketches

make planks

get wood

use saw

14 Learning from policy sketches

make sticks

get wood

use axe

15 Learning from policy sketches

get wood πa use saw

get wood πb use axe

[e.g. Branavan et al. 09, Oh et al. 17, Hermann et al. 17] 16 Learning from policy sketches

get wood

use saw

get wood

use axe

17 ` get wood use saw π1 π2

get wood use axe π1 π3

18 ` get wood use saw π1 π2

get wood use axe π1 π3

19 ` get wood π1 Policy representaon

π1 get wood

21 Policy representaon ???

π1 get wood

22 Policy representaon

23 Policy representaon

24 Policy representaon

25 Policy representaon Acon probabilies

π1 get wood

26 Policy search

acon state reward baseline Σ Σ ( ) ∇ log π( | ) (rt - b) tasks steps

27 Policy search

Σ Σ ( ) ∇ log π( | ) (rt - b) tasks steps get wood

28 Policy search

Σ Σ ( ) ∇ log π( | ) (rt - b) tasks steps use axe

29 Policy search

Reward .40

Σ Σ ( ) ∇ log π( | )SUBPOLICY (rt - b) tasks steps 30 Improving policy search

31 Improving policy search

acon state reward baseline

Σ Σ ( ) ∇ log π( | ) (rt - b) tasks steps

32 Improving policy search

( )∇ log π( | )use saw (rt - )make planks ( )∇ log π( | )use saw (rt - )make nails

( )∇ log π( | )use axe (rt - )make planks ( )∇ log π( | )use axe (rt - )make nails

( )∇ log π( | )get wood (rt - )make planks ( )∇ log π( | )get wood (rt - )make nails

( )∇ log π( | )get iron (rt - )make planks ( )∇ log π( | )get iron (rt - )make nails

33 Improving policy search

.89

Reward .40

Σ Σ ( ) ∇ log π( | )SUBPOLICY (rt - )TASK tasks steps 34 Do sketches help? The maze navigaon task

36 The maze navigaon task

37 The maze navigaon task

Sketches: modular

Unsupervised Reward Sketches: joint

0 1 2 3 x 106 episodes 38 The mini-cra task

39 The mini-cra task

40 The mini-cra task

Sketches: modular

Reward Sketches: joint Unsupervised

0 1 2 3 x 106 episodes 41 The cliff-walking task

42 The cliff-walking task

Sketches: modular

log Reward Sketches: joint Unsupervised

0 1 2 3 x 108 mesteps 43 Zero-shot generalizaon

What if I see a sketch I’ve never seen before?

get iron

use axe

44 Zero-shot generalizaon

What if I see a sketch I’ve never seen before?

100

75 89 Joint 77 Modular 50 49 25 1 0 Multask Zero-shot 45 Zero-shot generalizaon

What if I see a sketch I’ve never seen before?

100

75 89 Joint 77 Modular 50 49 25 1 0 Multask Zero-shot 46 Fast adaptaon

What if I don’t get a sketch at test me?

???

47 Fast adaptaon

What if I don’t get a sketch at test me?

100

75 89 Unsupervised 77 Sketches 50 47 25 1 0 Multask Adaptaon 48 Fast adaptaon

What if I don’t get a sketch at test me?

100

75 89 Unsupervised 76 Sketches 50 47 42 25

0 Multask Adaptaon 49 Conclusions Under review as a conference paper at ICLR 2017

ATASKS AND SKETCHES

The complete list of tasks, sketches, and symbols is given below. Tasks marked with an asterisk⇤ are held out for the generalization experiments described in Section 4.4, but included in the multitask training experiments in Sections 4.2 and 4.3.

Goal Sketch Maze environment goal1 left left goal2 left down goal3 right down goal4 up left goal5 up right goal6 up right up goal7A ny bit of data goes a long way down right up goal8 left left down goal9 right down down goal10 left up right Crafting environment make plank get wood use toolshed make stick get wood use workbench make cloth get grass use factory make rope get grass use toolshed make bridge get iron get wood use factory make bed⇤ get wood use toolshed get grass use workbench make axe⇤ get wood use workbench get iron use toolshed make shears get wood use workbench get iron use workbench get gold get iron get wood use factory use bridge get gem get wood use workbench get iron use toolshed use axe

51

12 Under review as a conference paper at ICLR 2017

ATASKS AND SKETCHES

The complete list of tasks, sketches, and symbols is given below. Tasks marked with an asterisk⇤ are held out for the generalization experiments described in Section 4.4, but included in the multitask training experiments in Sections 4.2 and 4.3.

Goal Sketch Maze environment goal1 left left goal2 left down goal3 right down goal4 up left goal5 up right goal6 up right up goal7A ny bit of data goes a long way down right up goal8 left left down goal9 right down down goal10 left up right Crafting environment make plank get wood use toolshed make stick get wood use workbench make cloth get grass use factory make rope get grass use toolshed make bridge get iron get wood use factory make bed⇤ get wood use toolshed get grass use workbench make axe⇤ get wood use workbench get iron use toolshed make shears get wood use workbench get iron use workbench get gold get iron get wood use factory use bridge get gem get wood use workbench get iron use toolshed use axe

52

12 Thank you! https://github.com/jacobandreas/psketch