Parallel & Scalable Machine Learning Introduction to Machine Learning Dr. Jenia Jitsev Head of Cross-Sectional Team Deep Learning (CST-DL) Scientific Lead Helmholtz AI Local (HAICU Local) Institute for Advanced Simulation (IAS) Juelich Supercomputing Center (JSC)

@Jenia Jitsev

LECTURE 9 Introduction to Deep

Feb 19th, 2020 JSC, Germany Mitglied in der Helmholtz-Gemeinschaft MachineLearning:Forms ofLearning ● ● ● ● For all For holds learning Reinforcement learning Unsupervised regression classification, frameworks: - classical unknown : estimate - usually → learning Supervised - - - - actions - discover (e.g PCA) reduction dimensionality clustering, frameworks: - classical models variable of (“hidden”) latent class - broad / generates :a that underlies model find - in general p( density probability unknown estimate - in general, in the data; patterns structure, - find “hidden”

Y Define a loss Define p( estimate active “teacher” signal, correct “outcomes”, “labels” for the data data for the “labels” “outcomes”, correct signal, “teacher” learning: experience experience learning: : a | X L(D, ), p(r| ), a that maximize total future reward totalreward future that maximize : correct responses responses : correct W X ), : only data data : only ), ), : data optimize V ( X f: X→Y, ), X X Q , depends on choice of on choice depends

( including (sparse) (sparse) including X X by tuning free parameters parameters free tuning by ,

a y is given is ) – future reward predictors predictors reward ) future – = = f Y ( x; W x; for input data data for input ) reward X X R ) a X are given; given; are r( X X W ) ) Mitglied in der Helmholtz-Gemeinschaft Deep Networks:Neural Forms ofLearning ● ● ● → learning Reinforcement learning Unsupervised learning Supervised - Define loss a - Define of - Parts AlphaGo; AlphaZero; ... AlphaFold, - Deep - - find predictors - ... PixelCNN, : PixelRNN, models Autoregressive - Generative ( Networks Adversarial - Variational ( Autoencoder models variable of (“hidden”) latent class - broad / generates that underlies - model find a p( density unknown estimate setting: - general Inception, (e.g, recognition object for visual CNN - Deep - find unknown For all For holds Deep Q-Learning Network (DQN – (DQN Network Q-Learning ...)Deep Breakout, games: Atari Actor-Critic : : L( f: R

x y = f( , = = y : correct responses responses : correct Networks (A3C, PPO, SAC, ...) (A3C, Networks , W f ( X x;W : only data data : only ), ; : data W optimize by tuning parameters parameters tuning by optimize ) to ) estimate ) or density p( density or ) VAE X , inference and generation data ),

including (sparse) (sparse) including X is given given is GAN total future reward reward future total Y Y X s), data generation data s), for input data data for input | X ; W X;W ) for data ( for data ) ) reward X W ResNet are given are x r( , y R ) X ( ) X ) ) , ...) Mitglied in der Helmholtz-Gemeinschaft Deep ReinforcementLearning: breakthroughs Mitglied in der Helmholtz-Gemeinschaft Deep ReinforcementLearning: breakthroughs Mnih et al, Nature, 2015 Nature, al, etMnih Mitglied in der Helmholtz-Gemeinschaft Deepnetworks: neural reinforcement learning Mnih et al, Nature, 2015 Nature, al, et Mnih Mitglied in der Helmholtz-Gemeinschaft Deep ReinforcementLearning: breakthroughs Mnih et al, Nature, 2015 Nature, al, etMnih Mitglied in der Helmholtz-Gemeinschaft Deepnetworks: neural reinforcement learning Mitglied in der Helmholtz-Gemeinschaft Deep Learning:transforming field the Silver et al, Nature, 2015 Nature, al, et Silver ● world champion Lee Sedol Lee champion world DeepMind 2016: March 18-time GO 4:1 against wins AlphaGo Mitglied in der Helmholtz-Gemeinschaft Deep ReinforcementLearning: breakthroughs Silver et al, Nature, 2016 al, Nature, Silver et Mitglied in der Helmholtz-Gemeinschaft Deep Learning:transforming field the Silver et al, Nature, 2017 al, Nature, Silver et ● ● Learning a function instead of hard wiring it through previous insight previous itthrough wiring of hard instead function a Learning margin dramatic ofthe art by state previous surpassing AlphaGo: Mitglied in der Helmholtz-Gemeinschaft Deep ReinforcementLearning: breakthroughs Silver et al, Nature, 2017 Nature, al, et Silver thatprovide into oldest the newinsights games.” of Gothis of knowledge, as novel strategies well as rasa thebooks. In space a of few starting days, patterns, into proverbsdistilled and collectively of overgames millions played thousands years,of “ Humankindhas accumulatedGo knowledge from , AlphaGo Zero wasableto rediscovermuch tabula Mitglied in der Helmholtz-Gemeinschaft Deep ReinforcementLearning: breakthroughs Silver et al, Nature, 2017 Nature, al, et Silver ● ● Human play data is no longer necessary for training for necessary no longer is data play Human games for winning reward – only self-play from Learning Zero AlphaGo Master AlphaGo Lee AlphaGo Fan AlphaGo Versions singlemachine 4 TPUs singlemachine 4 TPUs distributed 48 distributed 176 Hardware TPUs GPUs [2] [2] , [2] v2, , v2 [2] , , 5,185 4,858 3,739 3,144 Elorating [1] [1] [1] [1] AlphaGoMaster 89:11 against AlphaGoLee 100:0 against SummitFuture of Go professionalplayers; 60:0against Lee Sedol 4:1 against 5:0 against Matches FanHui Mitglied in der Helmholtz-Gemeinschaft Deep ReinforcementLearning: breakthroughs Silver et al, Nature, 2017; Segler et al, Nature, 2018 Nature, al, et Segler 2017; al, Nature, Silver et ● (“bad”, “incorrect”) as outcome as “incorrect”) (“bad”, undesired or “correct”) (“good”, desired are states - some state transitions with of problem type any for - suitable …) synthesis, chemical (e.g, problems optimization AlphaZero employs deep neural networks: generic control / control generic networks: neural deep employs Chemistry Game... Mitglied in der Helmholtz-Gemeinschaft Deep ReinforcementLearning: breakthroughs Evans et. al., 2018 al., et. Evans ● AlphaFold Mitglied in der Helmholtz-Gemeinschaft Deep ReinforcementLearning: breakthroughs ● AlphaStar Mitglied in der Helmholtz-Gemeinschaft Deep ReinforcementLearning: breakthroughs Vinyals et al, Nature, 2019 Nature, Vinyals al, et Mitglied in der Helmholtz-Gemeinschaft Deep RL:JURECA /p/project/training2001/$USER/ cp -r/p/project/training2001/practicals_DeepRL DeepRLCopy folder: ( python . python . TestingGym Environment: OpenAI source . source . source . source . Preparethe environment: cd variables /p/project/training2001/$USER/DeepRL /Test/test_opengym_pong_breakout.py /Test/test_opengym_CartPole.py /scripts/install_packages_DeepRL.sh /scripts/load_env_DeepRL.sh /scripts/init_env_DeepRL.sh /scripts/load_module_DeepRL.sh render , , steps , , episodes maybe adaptedaccordingly) (https://gym.openai.com) Mitglied in der Helmholtz-Gemeinschaft Deep Networks:Neural Forms ofLearning ● ● ● Define a loss a Define input each For learning Supervised Probabilistic view: unknown density P( density unknown view: Probabilistic Unknown function: L( x X x , correct , y W , W

1

: Find unknown : unknown Find ), ),

. . . parameters tuning by optimize

example example

Y . . . = W ... y

f k

( is provided is

X . . . f: ;

W y = ) Y f ( x ) for data ( ) for data Y | X W , W x , y ) ) Mitglied in der Helmholtz-Gemeinschaft Deep Networks:Neural Forms ofLearning ● ● ● Define a loss a Define input each For learning Supervised e.g., gradient descent adapting adapting descent e.g., gradient Loss Learning: P( density unknown view: Probabilistic Unknown function: L( x X x , correct , y W , W

1

: Find unknown : unknown Find ), ),

. . . parameters tuning by optimize

example example L( x

, Y y . . . , = W W ... y

f ) k

( (defined over observed data; data; observed over (defined is provided is

X . . . f: ;

W y = W ) Y to minimize loss) to minimize f ( x ) for data ( ) for data Y | X W , W x , y ) ) Mitglied in der Helmholtz-Gemeinschaft Deep Networks:Neural Forms ofLearning ● ● implausible or impractical (expensive labeling required) labeling (expensive impractical or implausible signal teacher specific with data explicit delivering However: Learn y = = gradient descent adapting adapting descent gradient Loss Learning: P( density unknown view: Probabilistic Unknown function: f ( x ) from data ( data from ) X W 1

Z . . . 1 x L( , y ) by adapting adapting ) by x

, Y y . . . , = W W ... f ) k

( (defined over observed data; data; observed over (defined W

X

Z . . ; . to minimize loss) to minimize k W ) Y W Y | X , W ) Mitglied in der Helmholtz-Gemeinschaft Deep Networks:Neural Forms ofLearning ● ● What are desired, relevant relevant desired, are What output ifthe What correct gradient descent adapting adapting descent gradient Loss Learning: P( density unknown view: Probabilistic Unknown function:

X W 1

Z . . . 1 y L( for inputs inputs for f( x x

)

, Y in this case anyway? case this in y . . . , = W W ... f ) k

(

(defined over observed data; data; observed over (defined W

X Z . . . x ; to minimize loss) to minimize k W to the network is unknown? unknown? is network to the ) Y ? Y ? | X , W ) Mitglied in der Helmholtz-Gemeinschaft ReinforcementLearning ● ● (about states, rewards, etc) rewards, states, (about predictions learning driven error Prediction relevant consequences learning: Reinforcement - using “reward/punishment” signals to define loss loss todefine signals “reward/punishment” - using f( x ) ) and incoming observations to update beliefs/expectations beliefs/expectations to update observations incoming and from of

loss loss self-generated actions self-generated L(

x use available sensory input and input and sensory available use , W )

: use : use mismatch (“good”/”bad”) to extract to extract (“good”/”bad”) between between L( x , own own W )

Mitglied in der Helmholtz-Gemeinschaft ReinforcementLearning (RL) Barto et al, 1983; Niv et al, al, 2006 Niv et 1983; al, et Barto Mitglied in der Helmholtz-Gemeinschaft ReinforcementLearning (RL) Barto et al, 1983; Niv et al, al, 2006 Niv et 1983; al, et Barto ● ● ● ● Estimating Estimating state each For environment given in the reward optimal “policy” towards “solve” Objective: < Process Decision MDP: Markov sensory inputsensory Q(s,a)s,a)), Q(s,a)s,a)), s , , X, select a response, action action a response, select π s (s,a)s,a)), (s,a)s,a)), Q(s,a)s,a)) V(s,a)s) V(s,a)s) V(s,a)s) Agent/Control Tasks Environment/ would providefull solution π (s,a)s,a)) S, A, P, R, action reward r π a) a) (s,a)s,a))

that is “optimal” that is γ > - getting maximal getting maximal - Mitglied in der Helmholtz-Gemeinschaft Deep ReinforcementLearning ● ● Use notion of “good”, “bad” outcome: “reward/punishment” signals “reward/punishment” outcome: “bad” of “good”, notion Use output ifthe What correct X W 1

Z . . . 1

y

for inputs inputs for

. . . W ... k

Z . . . x k to the network is unknown? unknown? is network to the expected outcome expected … “ π Q(s,a)s,a)) V(s,a)s) true” outcome true” (s,a)s,a)) Mitglied in der Helmholtz-Gemeinschaft Deep ReinforcementLearning ● ● Use notion of “good”, “bad” outcome: “reward/punishment” signals “reward/punishment” outcome: “bad” of “good”, notion Use output ifthe What correct X W 1

Z . . . 1

y

for inputs inputs for

. . . W ... k

Z . . . x k to the network is unknown? unknown? is network to the preferred action preferred … “ π Q(s,a)s,a)) V(s,a)s) optimal” action optimal” (s,a)s,a)) Mitglied in der Helmholtz-Gemeinschaft Deep ReinforcementLearning ● ● descent adapting adapting descent Loss Learning: enter loss function function loss enter signals “reward/punishment” outcome: “bad” of “good”, notion Use Parametrizing Q(s,a)s,a)), Q(s,a)s,a)), L( X W x to minimize loss) to minimize W , L( R 1 x π ,

W

(s,a)s,a)), (s,a)s,a)),

,

Z R . . ) . 1 , (defined over observed data; gradient data; gradient observed over (defined W

V(s,a)s) V(s,a)s) )

. . . network neural deep a via W ... k

Z . . . k preferred action preferred … “ π Q(s,a)s,a)) V(s,a)s) optimal” action optimal” (s,a)s,a)) f ( X ; W ) )

Mitglied in der Helmholtz-Gemeinschaft Deep ReinforcementLearning ● What can be loss function function loss be can What sensory sensory input X X W 1

Z . . . 1 Tasks Environment/ L(

x

, . . . R , W W ... k

)

Z ? . . . k

reward . . . a) V(s,a)s) Q(s,a)s,a)) π (s,a)s,a)) r action Mitglied in der Helmholtz-Gemeinschaft ReinforcementLearning (RL) Barto et al, 1983; Niv et al, al, 2006 Niv et 1983; al, et Barto Mitglied in der Helmholtz-Gemeinschaft ReinforcementLearning (RL) Bellman, 1957; Sutton and Barto, 1997; Barto, and Sutton 1957; Bellman, decision making and behavior and making decision foroptimizing mathematical : foundation Bellman equations • • (givenapolicy for action selection) Avalue function Optimal value function returnsbestpossible outcome V(s) describes expected outcome Mitglied in der Helmholtz-Gemeinschaft ReinforcementLearning (RL) Barto et al, 1983; 1983; al, et Barto Learning can be driven by the by driven be can Learning ● ● ● Updating beliefs/expectations and action preferences (actor-critic) preferences action and beliefs/expectations Updating observation) following the and expectation (differenceinternal between error prediction Computing optimal for condition Self-consistency Sutton and Barto, 1997 Barto, and Sutton : : better worse than expected than expected prediction error prediction V(s,a)s) : signal Mitglied in der Helmholtz-Gemeinschaft ReinforcementLearning (RL) Barto et al, 1983; 1983; al, et Barto Learning can be driven by the by driven be can Learning ● ● ● Updating beliefs/expectations and action preferences preferences action and beliefs/expectations Updating observation) following the and expectation (differenceinternal between error prediction Computing optimal for condition Self-consistency Sutton and Barto, 1997 Barto, and Sutton : : better worse than expected than expected prediction error prediction Q(s,a)s,a)) : signal (actor-critic) Mitglied in der Helmholtz-Gemeinschaft ReinforcementLearning and theBrain Barto et al, 1983; Sutton and Barto, 1998 and Sutton Barto1983; et al, Mitglied in der Helmholtz-Gemeinschaft BasalGanglia and Reinforcement Learning Made by BodyParts3D,by Made Project, Database Integrated MEXT, Japan Mitglied in der Helmholtz-Gemeinschaft Dopamine andReward-Based Learning Schulz et al, 1997 et al, Schulz − − Mitglied in der Helmholtz-Gemeinschaft Basal GangliaBasal and Reward-Based Learning Schulz et al, 1997; Doya, 2002 Doya, 1997; et al, Schulz − − Mitglied in der Helmholtz-Gemeinschaft ReinforcementLearning (RL) ● Learning driven by prediction error: an example an error: prediction by driven Learning sensory inputsensory X, s Q(s,a)s,a)) V(s,a)s) Agent/Control Tasks Environment/ π (s,a)s,a)) action reward r a) Mitglied in der Helmholtz-Gemeinschaft ReinforcementLearning (RL) Mitglied in der Helmholtz-Gemeinschaft ReinforcementLearning (RL) Mitglied in der Helmholtz-Gemeinschaft ReinforcementLearning (RL) Mitglied in der Helmholtz-Gemeinschaft ReinforcementLearning (RL) Mitglied in der Helmholtz-Gemeinschaft ReinforcementLearning (RL) Mitglied in der Helmholtz-Gemeinschaft ReinforcementLearning (RL) Mitglied in der Helmholtz-Gemeinschaft ReinforcementLearning (RL) ● ● π → dimensional. high Available very usually input is V(s,a)s), description Main state point: (s,a)s,a)) Use deep neural networks to networks neural deep Use π . (s,a)s,a)) in more realistic environments are are environments realistic in more s as well as all essential estimates estimates all essential as well as learn the estimates the learn unknown

Q(s,a)s,a)), V(s,a)s), Q(s,a)s,a)), and complex. complex. and Q(s,a)s,a)), Q(s,a)s,a)), Mitglied in der Helmholtz-Gemeinschaft Deep ReinforcementLearning (Deep RL) ● ● ● overcome their limitations their overcome RLActor-Critic to : approaches differenttwo of those mixtures reward) term long maximize RLPolicy-based estimate :directly compute there Value-based RL outcomes expected : estimate π (s,a)s,a)) (additional, though straight-forward step) straight-forward though (additional, π (s,a)s,a)) (action probabilities that probabilities (action Q(s,a)s,a)), V(s,a)s), Q(s,a)s,a)), from from

Mitglied in der Helmholtz-Gemeinschaft Deep ReinforcementLearning (Deep RL) ● estimates drive learning drive estimates on dependent Differentarchitectures, and losses sensory sensory input

X X W 1

Z

. . . 1 Tasks Environment/

. . . W ... k

Z . . . k

reward a) . . . V(s,a)s) Q(s,a)s,a)) π (s,a)s,a)) what way which which way what r action

Mitglied in der Helmholtz-Gemeinschaft Deep ReinforcementLearning (Deep RL) ● learn them. learn V(s,a)s), estimates all essential as well as s description Main state point: π sensory sensory input

X (s,a)s,a)) are unknown and complex. Use deep neural networks to networks neural deep Use complex. and unknown are X W 1

Z

. . . 1 Tasks Environment/

. . . W ... k

Z . . . k

reward a) . . . V(s,a)s) Q(s,a)s,a)) π (s,a)s,a)) r action Q(s,a)s,a)), Q(s,a)s,a)), Mitglied in der Helmholtz-Gemeinschaft Deep ReinforcementLearning (Deep RL) ● ● → on policy descent gradient learning: Pure policy-based on descent gradient learning: Pure value-based - - - Combining both : a pletora :of pletora a both Combining high variance high convergence slow policy gradient of the actor.of the gradient policy → : estimation value explicit - maintaining → selection. action gradient: its and policy explicit - maintaining Value modulates and gradient its both own drives estimate responses value-based by modulated gradient Policy actor-critic critic approaches actor Q(s,a)s,a)), V(s,a)s) Q(s,a)s,a)), . . , explicit , explicit π (s,a)s,a))

Mitglied in der Helmholtz-Gemeinschaft Deep RL:pure policygradient ● Learning can be driven directly by policy gradient policy by directly driven be can Learning probability of actions of actions probability the change and (rewards) outcome sampled - Observe - Objective (loss with reversed sign) and gradient ascent gradient sign)and reversed with (loss - Objective - Policy gradient - Policy a) being selected in given states states in given selected being s Mitglied in der Helmholtz-Gemeinschaft Deep RL:pure policygradient Mitglied in der Helmholtz-Gemeinschaft Deep RL:pure policygradient Mitglied in der Helmholtz-Gemeinschaft Deep RL:pure policygradient ● - can converge fast, but but fast, converge - can gradient policy by directly driven Learning sensory sensory input X X W 1

Z . . . 1

Tasks Environment/

high variance high

. . . W ... k

Z . . . k

reward . . . a) π (s,a)s,a)) π (s,a)s,a)) r action Mitglied in der Helmholtz-Gemeinschaft Deep RL:pure gradientvalue-based ● Learning can be driven by by driven be can Learning (rewards), correct for deviation from expectation from for deviation correct (rewards), outcome sampled observe return), (total value expected - Estimate - Loss and gradient descent and - Loss - selection: Action gradient value error and - Prediction prediction error prediction of estimated value value estimated of Mitglied in der Helmholtz-Gemeinschaft

Deep RL:Deep Q-Learning Network (DQN) ● - accurate, but but - accurate, gradient value andpredicted error reward prediction by driven Learning sensory sensory input X

X Q(s,a)s,a)) slow convergence slow W 1

Z

. . . 1 Tasks Environment/

. . . W ... k

Z . . . k

reward . . . a) Q (s,a)s,a)) r action Mitglied in der Helmholtz-Gemeinschaft Deep RL:DQN Mnih et al, Nature, 2015 Nature, al, et Mnih Mitglied in der Helmholtz-Gemeinschaft Deep RL:DQN Mnih et al, Nature, 2015 Nature, al, etMnih Q(s,a)s,a)) Mitglied in der Helmholtz-Gemeinschaft Deep RL: Mnih et al, Nature, 2015 Nature, al, etMnih DQN Mitglied in der Helmholtz-Gemeinschaft Deep RL: architectures Actor-Critic ● Actor-critic: learning via via learning Actor-critic: - combines both - combines sensory sensory input X X W good accuracy good 1

Z . . . 1 both Tasks Environment/

value value

. . . W ... k

and and

Q(s,a)s,a)) Q(s,a)s,a))

Z . . . k fast convergence fast and policy policy and

reward

a)

...... π Q(s,a)s,a)) V(s,a)s) (s,a)s,a)) π r (s,a)s,a)) action gradient Mitglied in der Helmholtz-Gemeinschaft Deep RL: architectures Actor-Critic ● Actor-critic RLActor-critic with value - Loss (for critic), objective (foractor) objective critic), (for - Loss - Prediction error,- Prediction gradient and modulated policy value Q(s,a)s,a)) Q(s,a)s,a)) and policy policy and π (s,a)s,a)) gradient Mitglied in der Helmholtz-Gemeinschaft Deep RL: architectures Actor-Critic Silver et al, Nature, 2016 al, Nature, Silver et Mitglied in der Helmholtz-Gemeinschaft Deep RL: architectures Actor-Critic Silver et al, Nature, 2015 al, Nature, Silver et ● gameplay experience to estimate the value of game positions of game value the to estimate experience gameplay from that learn networks neural convolutional : deep Key ingredient Mitglied in der Helmholtz-Gemeinschaft Deep RL: architectures Actor-Critic Silver et al, Nature, 2015 al, Nature, Silver et Mitglied in der Helmholtz-Gemeinschaft Deep RL: architectures Actor-Critic Silver et al, Nature, 2017; Segler et al, Nature, 2018 Nature, al, et Segler 2017; al, Nature, Silver et ● chemical synthesis, …), …), synthesis, chemical (e.g, problems / optimization control for generic AlphaZero (“good”, “correct”) or undesired (“bad”, “incorrect”) status “incorrect”) (“bad”, undesired or “correct”) (“good”, desired have outcomes state some where transitions, state defined with of problem type any for - suitable Chemistry Game... Mitglied in der Helmholtz-Gemeinschaft Complexwith complex tasks input data ● Realistic input data for task learning learning for task data input Realistic - OpenAI: Gym (https://gym.openai.com), Universe (https://gym.openai.com), Gym - OpenAI: Mitglied in der Helmholtz-Gemeinschaft Complexwith complex tasks input data ● Realistic input data for task learning learning for task data input Realistic - DeepMind Lab; StarCraft2, ... StarCraft2, Lab; - DeepMind Mitglied in der Helmholtz-Gemeinschaft Complexwith complex tasks input data ● Realistic input data for real world task learning learning task world for real data input Realistic Simulations, UK: SpatialOS) Simulations, Immense (Improbable, scenario scale - city Full Mitglied in der Helmholtz-Gemeinschaft Complexwith complex tasks input data ● Realistic input data for real world task learning learning task world for real data input Realistic - CarCraft: car traffic car (Waymo, - CarCraft: simulator Google) Mitglied in der Helmholtz-Gemeinschaft Deep ReinforcementLearning: Frontiers ● learning task-specific cost functions) cost task-specific learning to learn, (learning for meta-learning learning - Reinforcement learning) with unsupervised fusion models”, “world RL model-based environment, - Generative, simulate and (model - Multi-Task RL, Transfer RL RL - Hierarchical RL - Distributed for training GPUs and multiple CPUs : utilizing of the out course’s scope topics Advanced Mitglied in der Helmholtz-Gemeinschaft Deep Learning:Distributed Training HPC with ● Distributed execution over multiple GPUs multiple over execution Distributed - - phase up training speed drastically - can Data parallelism Data : dominating distributed scheme training distributed : dominating Mitglied in der Helmholtz-Gemeinschaft Deep Learning:Distributed Training HPC with Strube, Jitsev, Strube, 2018 ● V100): Supervised Learning on ImageNet on Learning V100): Supervised TrainingwithGPUs multiple (TensorFlow Horovod, JUWELS, + Mitglied in der Helmholtz-Gemeinschaft DistributedDeep RL Espeholt et al, 2018 al, et Espeholt ● - CULE (NVIDIA), (NVIDIA), ... - CULE - A3C, Impala (DeepMind), Ape-X, Different approaches Jenia Jitsev, Jenia JSC Mitglied in der Helmholtz-Gemeinschaft DistributedDeep RL Espeholt et al, 2018 al, et Espeholt utilisation due to rendering time variance within batch. a within timerenderingto variance due utilisation low GPU low ● Impala decouples acting fromacting decouples learning Jenia Jitsev, Jenia JSC Mitglied in der Helmholtz-Gemeinschaft Deep ReinforcementLearning: Frontiers Ha, Schmiedhuber 2018 Ha, Schmiedhuber ● fusion with unsupervised learning): “world models” “world learning): unsupervised with fusion RL model-based environment, Generative, simulate and (model Mitglied in der Helmholtz-Gemeinschaft Deep ReinforcementLearning: Frontiers Ha, Schmiedhuber 2018 Schmiedhuber Ha, ● fusion with unsupervised learning): “world models” “world learning): unsupervised with fusion RL model-based environment, Generative, simulate and (model Mitglied in der Helmholtz-Gemeinschaft GrowingGeneral ● ● ● ● → data various and learn to ability core their learn”) (“Learningto Meta-learning meta-object) as (task inference hierarchical in step further as itself learning: and transfer Multi-task “off”, modelnever fromand inverse data forward both improving learning: Open-end or fromcustomizable events) include that sets data environments virtual Realistic Self-organized, Continual, TransferableContinual, Self-organized, Learning learning in a a learning in AI rich structure from experience with differentfrom experience tasks for obtaining data: obtaining large large obtaining data: obtaining for procedurally generated treating learning of task structure structure task of learning treating closed action-perception loop action-perception closed : architectures that that architectures : (e.g, temporal dynamics of dynamics temporal (e.g, improve improve worlds , , Mitglied in der Helmholtz-Gemeinschaft GrowingGeneral AI ● Learning, Basic and and Basic Learning, Support Research, Applied TeamCross-Sectional Learning Deep HLST) with together Local, Unit (HAICU into - embedding Helmholtz Helmholtz Intelligence Artificial High Level Support TeamSupport Level High (CST-DL): for Deep Hub Cooperation Cooperation , Mitglied in der Helmholtz-Gemeinschaft GrowingGeneral ● → of maintenance and operation with continuous ofexperience accumulation and use - Extensive different with interests of collaborators integration - Seamless different across domains that is beneficial activity - Research perspective: long-term with project “light house” common a Create Self-organized, Continual, Transferable Continual, Self-organized, Learning AI large-scale learning on HPC learning large-scale Mitglied in der Helmholtz-Gemeinschaft Deep RL:JURECA python . cd /p/project/training2001/$USER/DeepRL TestingGym Environment: OpenAI DQN_1Node_1GPU.sh sbatch - Run training using DQN_1Node_CPU.sh sbatch - Run training using cd ./CartPole/1-dqn/ submission: Job ( python . variables /p/project/training2001/$USER/DeepRL/job_scripts/jobRun_CartPole- /p/project/training2001/$USER/DeepRL/job_scripts/jobRun_CartPole- /Test/test_opengym_pong_breakout.py /Test/test_opengym_CartPole.py render , , steps DQN DQN , , in in episodes CartPole CartPole maybe adaptedaccordingly) environment using environment using GPU CPU : : Mitglied in der Helmholtz-Gemeinschaft OpenAIGym: CartPole Mitglied in der Helmholtz-Gemeinschaft OpenAIGym: CartPole ● CartPole: Reinforce (vanilla policy gradient) policy (vanilla Reinforce CartPole:

Successful time steps episodes Mitglied in der Helmholtz-Gemeinschaft OpenAIGym: CartPole ● CartPole: DQN (Deep Q-Learning Network – value learning) – value Network Q-Learning (Deep DQN CartPole:

Successful time steps episodes Mitglied in der Helmholtz-Gemeinschaft OpenAIGym: CartPole ● CartPole: CartPole: data) for gathering workers multiple (actor-critic, A3C

Successful time steps episodes Mitglied in der Helmholtz-Gemeinschaft Resources ● ● ● Plenty of free code: of Plenty - - Silver, David lectures: Free … SergeLevine, http://www.incompleteideas.net/book/the-book-2nd.html Cambridge, MA, MIT Press, 2ed. 2018 Learning: Reinforcement An Introduction and Sutton S. Richard Andrew G.Barto site): the on (pdfavailable book Free http://rail.eecs.berkeley.edu/deeprlcourse/ http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html - - https://github.com/hill-a/stable-baselines https://spinningup.openai.com

Mitglied in der Helmholtz-Gemeinschaft