arXiv:2003.01431v1 [cs.NE] 3 Mar 2020 h ri vle vrmlin fyasfrtesl ups fcnrligtebd naga-ietdfashion. goal-directed a in body the controlling of purpose sole the for years of millions over evolved brain The Introduction 1 ewr n agt hrfr,teeauto fteerlsi sal are u noe-oppattern dynamics open-loop closed-loop the on the disregards out evaluation of carried of output usually type the this is between embodiment, rules error) the these neglecting as of By to evaluation (referred tasks. the classification distance Therefore, a ([ target. minimize proposed a to regularly and derived network are are networks rules neural synaptic these spiking plausible general, in Biologically In learning principles. functional computational neural for these Spiking rules on plasticity communication. computations asynchronous their and base dynamics models neural network on relying performed are Computations moidSnpi lsiiywt nieRifreetlearning Reinforcement Online with Plasticity Synaptic Embodied iumtrtasks. visuomotor ocmn Learning forcement of functionality deep the recent increase the to discussing beneficial by to be conclude We would need which dynamics retained. techniques learning be rate learning synaptic to learning reinforcement improvements govern the that performance that for processes indicate regulated stochastic explorations be the parameter driving Provisional temperature the tasks. and both for hours simulated that of show We following. lane REinforcement and Online with Plasticity closed-loop Synaptic ( evaluate resulting in to learning The models framework plasticity this in fields. biologically-plausibe demonstrate two We body of validity these environments. a robotics the from and control evaluate components to to software computational allows is of open-source framework purpose fields integrating the which by pattern bringing together brain on to closer biological isolation contributes the paper in with This evaluated contrasts are closed-loop. This theoretical tasks. by classification derived rules plasticity synaptic Keywords: ai Kappel David aqe Kaiser Jacques h nevrt nesadtebanivle utpeclaoaigrsac ed.Classically, fields. research collaborating multiple involves brain the understand to endeavor The SPORE errbtc,Snpi lsiiy pkn erlNtok,Nuoopi iin Rein- Vision, Neuromorphic Networks, Neural Spiking Plasticity, Synaptic Neurorobotics, ,arwr-erigrl ae nsnpi apig ntovsooo ak:reaching tasks: visuomotor two on sampling, synaptic on based rule reward-learning a ), reRoennau Arne 2 , 3 1 , 4 ihe Hoff Michael , ailReichard Daniel , 1 Z eerhCne o nomto Technology, Information for Center Research FZI 3 er-uutUniversit¨at, G¨ottingen,Georg-August SPORE 4 1 2 rzUiest fTechnology, of University Graz ehiceUniversit¨atTechnische Dresden ofagMaass Wolfgang , 1 , 2 ac ,2020 4, March scpbeo erigt efr oiiswti h course the within policies perform to learning of capable is nra Konle Andreas , 1 Abstract nn Subramoney Anand , 1 2 R¨udiger Dillmann , 1 .Cml aqe Tieck Vasquez Camilo J. , 2 oetLegenstein Robert , 1 46 , SPORE 17 , 32 1 , 2 , on , 35 , 41 ]).

This is the author’s version. This paper was published in Frontiers in Neurorobotics: www.frontiersin.org/articles/10.3389/fnbot.2019.00081 h er21 akdasgicn ratruhi eprifreetlann.Atfiilnua networks neural Artificial learning. reinforcement deep in breakthrough significant a marked 2015 year The of overview brief a give we state-of-the-art current Unlike influx. dopamine to respect with spines dendritic of growth the models which ehd r o ae nbooial luil rcse.Frt ag ato erldnmc r ignored. are dynamics neural of part large a these First, functional, processes. While plausible actions. unlike biologically Importantly, rewarding on taking gradient based of ascending not probability performing are the by methods maximize updated ([ to are backpropagation games parameters video with policy playing steps Specifically, from policy-gradients. ranging on tasks based of are variety a ([ solving of multi-joints capable controlling now are neurons analog of Work Related 2 we Finally, 4. improved. Section be in could out method carried the is how tasks 5 neurorobotic Section chosen in two discuss the on evaluation and implementation on policy performing a retain to beneficial is annealing task. rate following learning lane as advanced ([ such problems mechanism learning external identify proof-of-concept an simple to that on us evaluated allowed only embodiment was This rule components. regulate ([ software to cortex open-source techniques visual crucial of pathways the two combination in the a shape with using and agrees color tasks It than learning. separately efficient processed more is allowing motion sensed, that is a states motion with which only Representation hypothesis as Event input Address visual in the encoded ([ of is plasticity simulation input (DVS) synaptic visual Sensor with Vision online paper, Dynamic tasks this visuomotor In learning and challenging. communication neurons remains synchronous spiking rules dense of as networks computationally For such are on mechanisms approaches learning. made implausible learning batch been biologically deep have However, on progress rely learning. important and years, deep last expensive with the input In visual from commands. control motor learning to ([ input rule following visual learning mapping lane this learned, evaluate a We is and neurons. reaching spiking closed-loop with implemented a entirely in is and ([ spike-time networks precise neural from conventional online with implemented methods learning reinforcement [ with in Plasticity introduced scheme framework Synaptic sampling This rule synaptic the reward-learning fashion. of modular neural instantiation a promising in ( learning the network learning of of spiking REinforcement evaluation capable a Online the with is robots in framework real demonstrated resulting and is The simulated of fields. control two the these online from respectively. decoded components and software encoded open-source be biological to in need as Additionally, commands spiking motor environment. since plausibility the and complicated with the information technically synchronized of sensory is be understanding bodies, evaluation must the deeper This that on a systems necessary. impact dynamical get is an To are evaluation have neurons brain. brain embodied the the an by by rules, taken back these decisions sensed of the is Indeed, change environment. this the and with environment, handle to has brain the hsppri tutrda olw.W rvd eiwo h eae oki eto .I eto 3, Section In 2. Section in work related the of review a provide We follows. as structured is paper This of embodiment the is paper this of contribution main The integrating by together closer robotics and neuroscience computational of fields the bring we paper, this In SPORE SPORE hs ehd ontlanoln egtudtsaepromdwt respect with performed are updates weight – online learn not do methods these , SPORE 39 SPORE , n ics h otiue ehiusrqie o t moiet The embodiment. its for required techniques contributed the discuss and 28 erigdnmc,ntdsusdi rvoswrsweeti learning this where works previous in discussed not dynamics, learning )adln olwn ([ following lane and ]) ([ ) 27 21 , , 4 18 19 , 18 ) hsrpeetto rsial eue h redundancy the reduces drastically representation This ]). , 22 )stp nbt ak,a n-oedvsooo policy visuomotor end-to-end an tasks, both In setup. ]) 2 , 45 )o w lsdlo ooi tasks. robotic closed-loop two on ]) 21 SPORE , 44 19 ) otrcn ehd ([ methods recent Most ]). .I noprtsaplc apigmethod sampling policy a incorporates It ]. n t vlaino w neurorobotic two on evaluation its and 21 , 19 , 22 30 , 45 , 29 ) u eut suggest results Our ]). , 28 ]), 39 SPORE SPORE , 38 , 28 30 learns , ) to ]), san is 25 29 ]). ])

This is the author’s version. This paper was published in Frontiers in Neurorobotics: www.frontiersin.org/articles/10.3389/fnbot.2019.00081 eetsre nopsigti oi saalbei [5]. in available is topic this encompassing survey recent a – h aki ovdwt nhr-oe pkn ewr f1 ern mlmnigasml Braitenberg simple a implementing neurons 16 of network spiking hard-coded an with solved is task The on based is approach policy-gradient Their differentiation. automatic on rely and respectively) TensorFlow, eil so h eto ntergto h ae hswy h eadpoie h nomto fwa motor what of information the provides reward the way, This lane. the of right the on or left the on is vehicle middle the to differences orientation and distance to respect with evaluated is performance The vehicle. ciiyde o aet emnal ycrnzdit bevto n cinpae fabtayduration arbitrary of phases action and observation into synchronized network manually approach, be our to With have activity. not network pre- does clocked a activity and by synapses processed symmetric digits of MNIST cumbersome: characteristics approach of implementation some The consist However, observations pairs. extractor. the feature state-action where trained rewarding tasks of discrete-actions free-energy on evaluated the ( is decreases Machine which Boltzmann rule Restricted plasticity spiking modulated a framework, term. this learning noise In impressive the achieves of introduced. and tuning 4.1.1 demonstrated proper Section is with in rule 100s) discussed This of one order exploration. the the the to (in drive comparable times to task term reaching noise target “slow” a a action on uses which rule indicate reward-learning not this does traces, and synapses all this to in distributed presented is approach take. reward the should Conversely, global agent a setup. the since learning generic supervised more a is to paper similar taken, be should command in presented vehicle Braitenberg hard-coded the outperforms [ approach spiking Their the environment. which following signal lane reward a into combined rule. are learning metrics SPORE the performance with these maximizes paper, network this In lane. the of command. motor to vision event-based from end-to-end controlled is vehicle simulated a where presented, maximization is reward for embodiment an ([ in brain DA-STDP the evaluated in works observed to Fewer been which traces. has of traces, eligibility concept eligibility form on This introduction a relies the needs synapses. also brain with active solved the recently is to performed, problem credit is assign This action rewarding actions. a chosen Specifically, after previously traces. only reinforce eligibility received with is spike- problem reward modulated reward the dopamine distal since how the characterizing solves introduced [ (DA-STDP) in is plasticity presented formalization timing-dependent was mathematical networks a spiking works, with learning these reinforcement In of Groundwork [3]). points. to and previous refers paths, the enumeration feedforward (the corresponding (5) paths the synchronously feedback of alternate the (3) phases (1), weights feedback linear and and purely (2) feedforward is derivatives the backpropagation the the of but knowledge (4), precise spikes require on based are computations the PPO ( Optimization Policy Proximal not is [ which backpropagation [3]. on in based stated is as learning mechanism, Second, learning memory. plausible rollout biologically in stored trajectories entire to 18 2 .I ohppr,tesiigntok r mlmne ihde erigfaeok PTrhand (PyTorch frameworks learning deep with implemented are networks spiking the papers, both In ]. .Tetomtrnuoscnrligtesern eev ieet(irrd eadsgaswehrthe whether signals reward (mirrored) different receive steering the controlling neurons motor two The ]. n[ In [ in derived is approach policy-gradient a implenting rule plasticity similar A [ In [ are paper this to related work previous closest The addressing while networks, neural spiking with learning reinforcement on focused work of body small a Only [ in introduced were techniques learning reinforcement deep by inspired models network Spiking os(hog-iei h aeo [ of case the in (through-time loss 4 31 ,teatoseaut ASD rfre oa -TPfrrwr-ouae TP nasimilar a in STDP) reward-modulated for R-STDP as to (referred DA-STDP evaluate authors the ], ,asiigvrino h reeeg-ae enocmn erigfaeokpooe n[ in proposed framework learning reinforcement free-energy-based the of version spiking a ], PPO ([ ) 2 ) otbooia osrit ttdi [ in stated constraints biological most ]), 39 ) stelann ehns ossso akrpgtn the backpropagating of consists mechanism learning the As ]). 3 18 , 4 n [ and ] RBM 6 r ilgclyipasbeadmk their make and implausible biologically are .I [ In ]. 18 ,anuoooi aefloigtask following lane neurorobotic a ], RBM 6 3 .As eyn neligibility on relying Also ]. standwt reward- a with trained is ) r tl iltd Indeed, violated. still are ] 11 , 34 ) and ]), 16 , SPORE 40 10 33 and ] , 26 is ] ]. This is the author’s version. This paper was published in Frontiers in Neurorobotics: www.frontiersin.org/articles/10.3389/fnbot.2019.00081 FLO)i nrdcd hsrl sue olanteivrednmc fatoln r h model the – arm two-link a of dynamics inverse the learn to used is rule This introduced. is (FOLLOW) where where spiking for rule learning online reward-based the of implementation an use we experiments our Throughout il ihrwr.Atmeaueparameter temperature A reward. high yield tasks. of variety osrit ntesast ftentokprmtr.I a ensoni [ in parameters shown synaptic been all has if It achieved, parameters. is network (2) the Equation of sparsity the on constraints vector parameter the form from the samples of generates distribution that target framework some probabilistic to a according by (1) responses. Equation network in and function trials experimental of number a over averaging involves reward expectation of this sequence the observing of probability reward non-negative algorithm. sampling synaptic by given the of features main the here review briefly We approach. ([ simulator neural exploitation). (high peaked more or the precisely, More reward. expected maximum the solutions local maximize different a to samples to converge signal not reward does global rule a learning by modulated are that updates named networks, (SPORE) neural Learning Reinforcement Online with Plasticity Synaptic 3.1 rule learning reward-based the policy. of overview brief a SPORE give we section, this In Method 3 [ in closed is contrast, loop In The commands. trajectory. control arm as given torques a predicted for (torques) commands control predicts place. take to learning for spooe n[ in proposed As reward discounted future expected the maximize to learning reinforcement of goal the consider We SPORE [ In r p 13 ( ( τ a moidi lsdlo,aogwt u oictost nraeterbsns ftelearned the of robustness the increase to modifications our with along closed-loop, in embodied was θ ,asprie yatclann uenmdFebc-ae nieLclLann fWeights Of Learning Local Online Feedback-based named rule learning synaptic supervised a ], eoe h eada time at reward the denotes ) ) ([ sapirdsrbto vrtentokprmtr htalw s o xml,t introduce to example, for us, allows that parameters network the over distribution prior a is 20 )i nipeetto fterwr-ae yatcsmln ue[ rule sampling synaptic reward-based the of implementation an is ]) 12 21 r ]). ( erpaetesadr olo enocmn erigt aiieteobjective the maximize to learning reinforcement of goal standard the replace we ] τ dθ ) SPORE yatcsampling synaptic ≥ i θ = taytm uhthat such time any at 0 ∼ β p ∗ sotmzdfrcoe-opapiain ofr noln policy-gradient online an form to applications closed-loop for optimized is  ( θ V ∂θ rmatre itiuinta ek tprmtrvcosta likely that vectors parameter at peaks that distribution target a from ) ∂ ( θ i θ τ = ) ∼ log and p p htwsitoue n[ in introduced was that , ∗ p ∗ ( ( τ T  ( θ θ e θ .W ilfcso apigfo h agtdistribution target the from sampling on focus will We ). ) Z lost aetedistribution the make to allows + ) satm osatta icut eoerwrs econsider We rewards. remote discounts that constant time a is 0 ∝ r ∞ ne ie aaee vector parameter given a under θ e ∂θ p θ − ∂ ∗ V ( i 4 i τ θ ftesnpi aaee vector parameter synaptic the of τ ( SPORE e bytesohsi ieeta equation differential stochastic the obey θ log ) r ) V × ( ≥ τ V ) ( o all for 0 dτ θ ( θ ) ersfo eadsga n a ov a solve can and signal reward a from learns  )  , dt p ( r 21 | + θ θ ) .Telann ueepossynaptic employs rule learning The ]. h distribution The . p , 2 Td βT SPORE p ∗ 21 W ( θ httelann olin goal learning the that ] i atr(ihexploration) (high flatter ) , θ 21 ete ics how discuss then We . oeta computing that Note . ,ta ssteNEST the uses that ], θ u tcontinuously it but , p 14 ( r yfeigthe feeding by ] | θ eoe the denotes ) p V ∗ (3) (1) (2) ( ( θ θ θ ) ) This is the author’s version. This paper was published in Frontiers in Neurorobotics: www.frontiersin.org/articles/10.3389/fnbot.2019.00081 nebde vlaini ehial oeivle n eursacoe-opevrnetsmlto.A simulation. environment closed-loop a requires and involved more technically is evaluation embodied An ihse it 0 sfrtesk fsmlto pe.Tesnpi egt eancntn between constant remain weights synaptic The speed. simulation of sake the for ms 100 width step with where variable The distribution target the of optima local the to close probability high with where hc cln n ffe parameters offset and scaling which oecnrbto fti ae steipeetto fafaeokalwn oeaut h aiiyof validity the evaluate to allowing framework a of implementation the is paper this [ of tasks contribution classification core pattern open-loop on evaluated solely are rules learning synaptic Implementation Usually, Embodiment Closed-Loop 3.2 at clipped ± are parameters Synaptic updates. two term postsynaptic the of dynamics g The ms. 1 of step [ in introduced as reward paper. the this to in baseline investigated a not Gaussian subtracting was by a this reduced as although be modeled could is (5)) distribution (Equation prior estimation the general, In around prior. centered (flat) non-informative a to corresponds distribution prior the of influence time at ( neuron parameters post-synaptic synaptic of mean prior the synaptic the to respect with (1) Equation signal in reward function global objective the activity. of neural gradient parameter the pre-/post of i.e. gradient, history reward brief the a maintains that trace eligibility synapse each of state The rules. update parameter synaptic the to and respect process Wiener a of decrements i ∆ ( t In projection the by given are weights synaptic the Finally [ in shown further been has It r pae tec iese.Tednmc of dynamics The step. time each at updated are ) θ max SPORE z β post h aaeesue noreauto r ttdi als1t 3. to 1 Tables in stated are evaluation our in used parameters The . sasaigprmtrta ucin salann rate, learning a as functions that parameter scaling a is θ i i ( ( t y t sasmo ia et usspae ttefiigtmso h otsnpi neuron, post-synaptic the of times firing the at placed pulses delta Dirac of sum a is ) ). i h ieeta qain qain 4 o()aesle sn h ue ehdwt time a with method Euler the using solved are (6) to (4) Equations equations differential the ( t µ w stepesnpi pk ri lee ihapssnpi-oeta kernel. postsynaptic-potential a with filtered train spike pre-synaptic the is ) : i ( p t r ( eoe h egto synapse of weight the denotes ) ( θ t .Tevariables The ). ) = dg de dθ N dt dt i i i t w ( ( ( ( h constants The . µ, t t t i ) ) = ) ( t 21 c 1 = ) = = p θ p eused We . ) htEuto 3 a eipeetduigasnpemdlwt local with model synapse a using implemented be can (3) Equation that ] i ( h tcatcpoesi qain()gnrtssmlsof samples generates (3) Equation in process stochastic The . T w − − β θ 0 )     τ τ stetmeaueparameter. temperature the is 1 1 e g e and gis h nuneo h eadmdltdtr.Setting term. reward-modulated the of influence the against c p i i w otherwise 0 e g p ( ( ossso h yai variables dynamic the of consists i i t ( 0 θ ( ( ), µ t t ) exp( θ + ) + ) − g 0 nEq. in c i respectively. , p µ ( θ t i θ and and ) norsmltos h aineo h eadgradient reward the of variance The simulations. our in 0 = w r ( i t ( ( θ i )+ )) t t ( min ) ) t (2) c ) e 5 i y − g i y i θ ttime at ( ( c and ) i r uigprmtr fteagrtmta cl the scale that algorithm the of parameters tuning are θ i t and θ t ( g ( (5) ) 0 ,teeiiiiytrace eligibility the ), i t t g ( if ) ( ) r pae ysligtedffrnilequations: differential the solving by updated are ) t i and ) ( z θ t post ) max ρ  post t nadto ahsnpehsacs othe to access has synapse each addition In . dt i aaee gradients Parameter . ( w i t + ( ) i θ d t ( − steisatnosfiigrt fthe of rate firing instantaneous the is ) i W t ( p ∂θ r pae nacasrtm grid time coarser a on updated are ) t ∂ ρ i ) i 2 post > r h tcatciceet and increments stochastic the are T p eoe h ata eiaiewith derivative partial the denotes ∗ θ 0 y ( β i θ ( i ( t W g ). , )(4) )) t i e ), ( i i t ( , savral oestimate to variable a is ) e t n h eadgradient reward the and ) i ( t ), g i g ( i t ( ), t r lpe at clipped are ) θ 46 i ( t , and ) e 32 θ i ( , t htare that sthe is ) 35 c p w , µ i 0 = 43 41 (6) (7) ( t is ). ], ]. This is the author’s version. This paper was published in Frontiers in Neurorobotics: www.frontiersin.org/articles/10.3389/fnbot.2019.00081 ([ diinly otnosrwr inli mte rmteenvironment. the from emitted is signal reward continuous a Additionally, ihrsett time: to respect with ([ MUSIC on rely We processes. different in run network neural the and simulator robotic The spiking the setup, learning reinforcement classical Unlike reward. overall more yield to expected are which eprtr rno xlrto) e qain() pcfial,w ea h erigrate learning the decay we lane Specifically, the (3). For Equation improvements. see policy rate exploration), previous learning (random converge the forget temperature to paper, to weights this network in synaptic the presented the prevents experiment allows following and regulation configuration This stable regulated mechanism. a often annealing to are an rates from learning benefit learning, and deep learning ([ in that processes optimization Note the process. by learning the throughout constant kept presenting work original the Annealing In Rate real-time. Learning in RAM 16GB with 3.3 i7-4790K Core Intel many core the Quad with a a on coupled for conducted robotics substituted was real be framework in can simulation supervision by human robotic needed the necessary hours that the However, is framework seamlessly. this of real advantage clear A evaluate rules. to learning vision event-based using experiments ROS-MUSIC synapses, all the to signal reward a streaming support experiments. [ our not (NRP) in does Most Platform required NRP time. Neurorobotics the network the Therefore, spiking in with tool-chain. integrated ROS-MUSIC time also ROS are [ synchronizes components by also these tool-chain latter of ROS-MUSIC The the frameworks. employ communication we two and the spikes the transform and communicate ([ Gazebo ([ do Gazebo reset. We by the task). of following notified lane are the SPORE (in nor off-track agent the goes the when agent neither only the and agent when the episodes or reset finite-time task) we enforce reaching Similarly, not reinforcement the iterations. classical (in of unlike completed number time, is in simulated task (biological) progress This to learning commands. respect reports motor with which outputting progress learning and learning input report receiving to continuously us system allows dynamical a as configurations treated towards is network network the of agent. processes plasticity the synaptic by ongoing sensed the steering is by by environment executed online the signal are reward of commands modification motor This The environment. commands. processed spikes, motor the as to modifies encoded converted which sensed, are agent, is spikes the input output sensory and Visual network, the embodiment. by an evaluate in to rules framework learning this network on spiking rely We rule sampling environments. synaptic robotics the closed-loop in models plasticity bio-plausibe 12 2 1 spr fti ok ecnrbtdt h aeoDSpui yitgaigi oRSMSC n to and ROS-MUSIC, to it integrating by plugin DVS Gazebo the to contributed we work, this of part As NEST use we simulator neural As components: software open-source many on relies framework This https://github.com/HBPNeurorobotics/gazebo_dvs_plugin https://github.com/IGITUGraz/spore-nest-module )cmie ihteoe-oreipeetto of implementation open-source the with combined ]) SPORE 18 ] 2 oueb nertn twt UI.Teecnrbtoseal eerhr odsg new design to researchers enable contributions These MUSIC. with it integrating by module 24 .Ti lgneisplrzdadeseet hnvrain npxlitniycosathreshold. a cross intensity pixel in variations when events address polarized emits plugin This ). )adRS([ ROS and ]) SPORE olanapromn oiyi urnl rhbtv.Tesmlto ftewhole the of simulation The prohibitive. currently is policy performing a learn to SPORE 36 23 )advsa ecpini elzduigteoe-oreDSpui for plugin DVS open-source the using realized is perception visual and ]) SPORE ) efudta h erigrate learning the that found We ]). ([ 20 ) sdpce nFgr .nTi rmwr stioe o evaluating for tailored is framework This n 1. Figure in depicted as ]), ([ 21 , dβ 19 dt ( , t 22 ) = , 6 45 − ) h erigrate learning the ]), λβ SPORE ( t ) β . sdcesdoe ie hc lordcsthe reduces also which time, over decreased is SPORE ([ 21 β ] 1 .Terbtcsmlto smanaged is simulation robotic The ). of rteronbiologically-plausible own their or SPORE β SPORE 9 n h temperature the and ,ecp o UI n the and MUSIC for except ], ly nipratrl in role important an plays re omxmz this maximize to tries 42 obig between bridge to ] β exponentially 7 , T 8 were )to ]) (8)

This is the author’s version. This paper was published in Frontiers in Neurorobotics: www.frontiersin.org/articles/10.3389/fnbot.2019.00081 eeaut u praho w errbtctss ecigts n h aefloigts rsne in presented task following lane the and task reaching a tasks: neurorobotic two on approach our evaluate We eaayetepromneadsaiiyo h ere oiiswt epc otepirdistribution prior the to respect with policies learned the of stability and performance the analyze we temperature the decaying DVS Independently a minutes. by 10 provided every equation is this input following Visual updated is road. rate the learning of The lane right the on vehicle the keep to is task the of goal The aineo h eadgain siain nrnial edn h gn oexplore. to agent the leading intrinsically estimation, gradient reward the of variance n erigrate learning and [ Evaluation 4 angles. steering term with controlled is vehicle The road. plane the the above to experiment. simulation forward following DVS looking Lane a vehicle task by Right: the the provided to vectors. of is attached velocity input goal simulation Cartesian Visual The with plane. experiment. controlled the reaching is of Left: ball center The the experiments. two downward. to the ball looking the for control setup to the is of Visualization 2: Figure are image) rendered the on pixels blue and spikes. (red as events neurons Address visual to information. fed more and downscaled for 4.1.2 Section ([ adapters see ([ tool-chain ([ plugin ([ ROS-MUSIC ROS DVS SPORE with with the communicates with commands by learning motor emitted network is to are spiking network decoded events the spiking and in events The synapses address all from components. to ([ software streamed rule simulator is open-source learning neural reward reward-based on NEST the the based of with framework evaluation implemented asynchronous closed-loop our embodied the Left: of Implementation 1: Figure 18 , 4 .I h olwn etos edsrb hs ak n h blt of ability the and tasks these describe we sections, following the In ]. T a o netgtd oee eepc io mato h efrac eas ftehigh the of because performance the on impact minor a expect we however investigated, not was β e qain(3). Equation see , 36 ) ih:Ecdn iulifraint pksfrteln olwn experiment, following lane the for spikes to information visual Encoding Right: ]). 18 )wti h iuae ooi niomn aeo([ Gazebo environment robotic simulated the within ]) 12 ) hc omnctssie ihMSC([ MUSIC with spikes communicates which ]), 7 SPORE oslete.Additionally, them. solve to 20 ) pksaeencoded are Spikes ]). 42 ) Address ]). 24 7 , ) which ]), 8 SPORE ) The ]). p ( θ ) . This is the author’s version. This paper was published in Frontiers in Neurorobotics: www.frontiersin.org/articles/10.3389/fnbot.2019.00081 eoiyvcosaedcddfo uptsie ihtelna decoder: linear the with spikes output from decoded are vectors Velocity with Plugin. Move Planar Gazebo the through vectors velocity instantaneous with ball the controls network The which on task reaching blind open-loop the of extension natural a is task reaching The two-layers all-to-all feed-forward a tasks, both In 2. Figure in depicted are evaluation our for used tasks The h al h gn srwre ftebl oe ntedrcintwrstegoal the towards direction Let the episode. in an moves of ball end the the if mark rewarded not is does agent and The – input ball. visual the is in network change abrupt the immobility. the Since more from β to aside change. leading – network drop, senses the to only to activity simulation neural DVS the motionless, causes the stay input as to sensory decides input of agent visual absence sensory the the the any by when feedforward, receive inhibited Indeed, is immobility. not and of does neurons periods it motor long prevents the neuron excites set which This network we neurons. the experiment, to our neuron For exploration additional vector. displacement the to contribution τ are neurons feature the axis in 2x16 plastic spikes and 10 each neurons with for visual neurons fire 16x16 motor neurons Both input 8 These cover. the all column. they enhance to each neurons additionally connected and We of to row column events. corresponding each or OFF neuron for row and visual neuron respective ON one feature located between is axis pixels distinctions There an 16x16 plane no with plane. of 20mx20m make space entire resolution we a the a of – and with center pixel ball DVS radius DVS the simulated each 2m perceives a the which by towards center provided move the is to above input has which Sensory radius walls. 2m with of enclosed ball a controls agent the [ in Task with Reaching together tasks the describe 4.1.1 we paragraphs, for next used the parameters functions. In The reward and 3. commands. schemes to motor decoding 1 to Tables their DVS in simulated [ presented a considered are of evaluation complexity events the task address the the for maps and sufficient end-to-end was architecture this with that trained shown is neurons spiking of network Setup Experimental 4.1 hsdcdn ceecnit feulydistributing equally of consists scheme decoding This . err 45 h ali ee oarno oiino h ln fi a ece h etr hsrsti o signaled not is reset This center. the reached has it if plane the on position random a to reset is ball The eoeteaslt au fteagebtentesrih iet h oladtedrcintknby taken direction the and goal the to line straight the between angle the of value absolute the denote a .Asmlrvsa rcigts a rsne n[ in presented was task tracking visual similar A ]. k h ciiyo oo neuron motor of activity the β v k = = " 2 N x y kπ ˙ ˙ # k , = bandb pligalwps le ntesie ihtm constant time with spikes the on filter low-pass a applying by obtained " sin cos SPORE ( ( β β 1 1 ) ) SPORE sin cos omxmz akseicrwr.Peiu okhas work Previous reward. task-specific a maximize to 8 ( ( 6 β β ,wt ieetvsa nu noig norsetup, our In encoding. input visual different a with ], 2 2 ) ) yass eutn n200lanbeparameters. learnable 23040 in resulting synapses, N sin . . . cos . . . oo ern nacrl ersnigtheir representing circle a on neurons motor ( ( β β N N N ) ) # oo ern.W d an add We neurons. motor 8 =       a a a . . . N 2 1       18 β err , SPORE 4 , β < 6 .Tentokis network The ]. lim a evaluated was tasufficient a at (9)

This is the author’s version. This paper was published in Frontiers in Neurorobotics: www.frontiersin.org/articles/10.3389/fnbot.2019.00081 64vsa ern oeigtepxl,ec ernrsosbefra88pxlwno.Ec iulneuron visual Each window. pixel 8x8 a for responsible neuron each pixels, the covering neurons visual 16x4 h first The [ in controllers neural spiking demonstrate to used already was task following lane The formulation This agent. the to streamed being before filter exponential an with smoothed is signal This velocity r ratio (0 the straight discretizing by obtained is command set We side. other decoders: from linear decoded following of are the consist commands between agent steering ratio the Specifically, a to populations. as sent two spikes commands these output steering between the The activity to populations. of signaled neural difference explicitly two the not into separated is the is reset leaves layer episode. this vehicle output learning task, the a reaching as of on the soon end driving As in the starts As track. mark vehicle the not point. The of does starting lane 1. and the right Figure network to the see reset on window, velocity is its constant it are in simulated a track, events There a with of by front. point amount provided in starting the is track fixed to input the a correlated Sensory showing rate vehicle track. a the a at of of spikes top lane on right mounted the pixels on 128x32 stay of to resolution vehicle a a with steer DVS to is task the of Task following Lane 4.1.2 for required be would online. tuning by rewards hyperparameter supported distal or are mechanisms rewards performing additional distal learn [ that hand, to in other suggests agent demonstrated the the was On for as time. suffice traces, the of not eligibility reaching amount did upon reasonable rewards reward a terminal terminal in discrete discrete policies experiments, a our delivering In unlike agent, state. the goal to feedback continuous a provides = h ewr otosteageo h eil ysern t hl t iervlct scntn.The constant. is velocity linear its while it, steering by vehicle the of angle the controls network The {− 10 v > v N/ , ◦ − ,rgt(15 right ), ern ultesern noesd,wieteremaining the while side, one on steering the pull neurons 2 2 lim . 5 , pcfial,tereward the Specifically, . 2 . 5 N , 10 ota hr r etmtrnuosad4rgtmtrnuos h steering The neurons. motor right 4 and neurons motor left 4 are there that so 8 = ◦ } n adrgt(30 right hard and ) epciey hsdsrtzto ssmlrta h n sdi [ in used one the than similar is discretization This respectively. r ( r r t β v 35 = ) = = r 21 ( t       ◦ scmue as: computed is ) , .Tedcso onaisbtenteesern nlsare angles steering these between boundaries decision The ). a 0 | 0 1 a 45 √ v , , R L r r − | r o pnlo ak ihcerydlmtdeioe.This episodes. delimited clearly with tasks open-loop for ] , v = = = nofiepsil omns adlf (-30 left hard commands: possible five into β ( β r lim err otherwise if N/ i X i a a β = =1 X 9 L L | N N/ , 1) + v 2 + − | a 2 v > i otherwise if a a , a 5 R R i β , lim err . β < . lim N/ ern ulsern othe to steering pull neurons 2 SPORE 18 n [ and ] SPORE 44 ◦ olanfrom learn to ,lf (-15 left ), .I yielded It ]. 4 .Tegoal The ]. through (10) (11) ◦ ),

This is the author’s version. This paper was published in Frontiers in Neurorobotics: www.frontiersin.org/articles/10.3389/fnbot.2019.00081 diinly h ubro ercigsnpe nrae vrtm vni h a ro ae–rdcn the reducing – case prior flat the in even – time over increases synapses retracting of number the Additionally, hr ny3%o l yassaeatv fe ho erig e iue5. Figure see learning, of 4h after active are synapses all of 33% only where for distribution weight impact the evaluated we where task, reaching the on results the discuss first We hours. simulated some within 5 or error distance 0.1m every halved is score the that so chosen are constants The error distance The eil so h eto ntergto h ae h ea ftelann aeis rate learning the of decay The lane. the of right the on or left the on is vehicle xsfauenuoscnb eni h etrclm fpxl h xsfauenuo epnil o this for responsible neuron feature axis the The of pixel. drawback of One column rows 4.1.1. center single Section the of in in patterns described seen particular, neurons be In feature can axis 3. neurons largely 2x16 Figure feature has the in axis weights to histogram the due weight of emerge, the magnitude columns from overall and the expected Additionally, be colors). could pixel as policy inner increased, (see the center (b), grid initialization the towards On configurations. sampling directions. is random network with the actions while weak drop employs and task, following rise lane rather the and for observed The is active. trend are same synapses The all of performance. 36% trading-off only words, form, other to In weights strong total). in synapses 23040 of (out learning 0 of (below 5h weights after synaptic ([ weak implementation of hardware number neuromorphic the a for allowed important overhead, but computational case, unconstrained the constraining to Less compared With only policies policy. center random learned happen. the the the to reaches to of learning ball comparable performance the performance the learning, a affected of 250s, also 13h every priors average after the on case, reaches times this ball 10 In the about emerge. and to time, policies simulated performing of prevented hours every few times a 90 within about rapidly center improves performance prior the flat case, this a task, reaching the Distribution For Prior the of of Impact impact the where task, 4.2.1 following lane the on results evaluated. the was present rate then We learning distribution. prior the of that show results Our Results the whether indicate 4.2 not does value reward particular a and synapses, all to delivered is reward [0 same between the comprised is function reward this as: error computed angular is the – terms two on depends reward the task, reaching error the in As policy. the [18]. in as command using directly than performance better fe over After with trial single a of analysis The [ in used metrics performance the to equivalent is vehicle the to delivered signal reward The d err h nua error angular The . 70s 4750 d err flann c,tefis oa aiu srahd etrdrcin aelreyturned largely have directions Vector reached. is maximum local first the (c), learning of SPORE stedsac ewe h eil n h etro h ih ae h reward The lane. right the of center the and vehicle the between distance the is c p 0 = c 5 s 250 p β . 5i iia oten-ro case no-prior the to similar is 25 0 = err scpbeo erigplce niefrmdrtl icl moidtasks embodied difficult moderately for online policies learning of capable is ovrey togpir( prior strong a Conversely, . steaslt au fteagebtentergtln n h vehicle. the and lane right the between angle the of value absolute the is . 5 h alrahstecne bu 0tmso vrg every average on times 60 about center the reaches ball the 25, c p c r r p ile h oiywt ihs efrac,seFgr .In 3. Figure see performance, highest with policy the yielded 0 = ( t mlile ihasaigconstant scaling a with (multiplied 0 = = ) , . 5i eitdi iue4 h efrac osntconverge not does performance The 4. Figure in depicted is 25 ]adi esifraieta h ro sdi [ in used error the than informative less is and 1] e − . 7 nrae rm32 o75 fe ho erigt 14753 to learning of 1h after 7557 to 3329 from increased 07) 0 . 03 β 10 err 2 × e − c 70 p )frigtesnpi egt ls o0 to close weights synaptic the forcing 1) = c d p err 2 .Tesrn prior strong The 0. = . k sacniuu-pc steering continuous-space a as ) λ = ◦ nua ro.Nt that Note error. angular 8 1 . ) ned for Indeed, ]). 5 β × err 10 c p n h distance the and − 4 prevented 1 = 18 5 .I u case, our In ]. e al 2. Table see , oevaluate to ] c p 5 s 250 (12) 0, = r ( t ) . This is the author’s version. This paper was published in Frontiers in Neurorobotics: www.frontiersin.org/articles/10.3389/fnbot.2019.00081 egt r o rzn h mltd fsbeun yatcudtsaedatclyrdcd nti case, this In reduced. drastically are updates synaptic subsequent of amplitude the frozen, not are weights h oiyi infiatybte hnrno n h eil ean ntergtln bu 0 naverage. on 60s about lane right the on remains vehicle the and random rate than learning better the significantly learning, is of policy 5h rate the After learning improve. the to learning, starts of remains policy 3h vehicle the about the and After case, value random reset. initial β a the its triggering In of until 40% 5. lane to Figure right decreased the see on random, seconds to 10 compared about improve not does rate policy learning rate the the learning when the Specifically, that improvements. show we policy experiment, following lane the Rate For Learning of Impact 4.2.2 decreased. largely has strength s. overall 250 ball the every however, the times direction, push 140 right vectors around motion center strong the grid, reaches whole ball the the Around and center, trial. the this towards of reward. performance any maximum receiving which the without emerged, reached down, have and other up each moving towards constantly pointing ball vectors the pixels wrong to strong the lead Some can in stronger. pointing even pixels grew weights The the direction. right the strength. vector in low point a also have which at mostly However, pixels direction flipped. grid. many is the for center of stronger the part even towards upper ball the the visited push mostly to ball direction the correct since the down, center, ball the the push to learned column Right: position). ( random prior a no at trials: reset two is for time. and learning the over center of with significantly the course measured to the ( the is moves on over prior ball performance weights configurations strong (the The synaptic prior reached different the trials. is of of 8 target Development effect over the averaged the which were comparing at results Left: rate The task. performance. reaching learning the overall for Results 3: Figure prahs2%o t nta au n h efrac mrvmnsaertie.Ide,wiethe while Indeed, retained. are improvements performance the and value initial its of 20% approaches

utbfr h n ftetil h efrac rp gi g.Ms etr tl on oad the towards point still vectors Most (g). again drops performance the trial, the of end the before Just at and again slowly rises performance the valley, this After After At Number of target reached / 250s 120 120 120 30 60 90 30 60 90 30 60 90 0 0 0 50s 7500 0 20s 9250 1 M N S d,tepromnehsfrhricesd h oiy ssoni h eodpa a grown has peak second the in shown as policy, The increased. further has performance the (d), t 2 o c e r

o d p P n i r u e,tepromnedost afispeiu efrac.A ecnsefo h policy, the from see can we As performance. previous its half to drops performance the (e), g i ,bto) nbt ae,tenme fwa yatcwihs(eo .7 increases 0.07) (below weights synaptic weak of number the cases, both In bottom). 1, = o 3 m

r P

r ( P i r o i 4 = r o

r ( 0

( , Simulation time[h] =

5 c = p 0 = , 0

c , 6 0

p c ) = p = 1 7 ) 0 . 2 5 8 ) 9 10 11 12 13 11

Number of synapses Number of synapses 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 0 1 2 3 4 0 0 β ean osatoe h oreo learning, of course the over constant remains 000s 000 20 2 2 β ly nipratrl o retaining for role important an plays 4 4 0h 0h fsmlto ie()teplc has policy the (f) time simulation of S S y y 0 0 n n a a p p t t i i c c

w w 2 2 e e i i g g h h t t s s

( ( 4 4 1h 1h c c p p = = 1 0 ) ) 0 0 c p ,tp and top) 0, = 2 2 4 4 5h 5h β

This is the author’s version. This paper was published in Frontiers in Neurorobotics: www.frontiersin.org/articles/10.3389/fnbot.2019.00081 ie ra h ne iceclrrpeet h seseto h etrdrcin(nua correctness). (angular direction outer vector the the by of indicated assessment is pixels. the strength) an DVS represents (vector of the contribution color contribution representing the circle the 2d-grid of the inner to a which magnitude The corresponding of for The motion area. consists time, over pixel. the pixel plot in performance this indicates points policy the by which certain Each top, vector, emitted indicate the event figures. a lines On contains 6 red pixel bottom trial. The every the single Hereby, depicted. in a is shown in trial are time well-performing policies in single, points a selected for for time development Policy 4: Figure 12

This is the author’s version. This paper was published in Frontiers in Neurorobotics: www.frontiersin.org/articles/10.3389/fnbot.2019.00081 diinly o h aefloigeprmn,w aesonhwlann aerglto nbe policy enabled regulation rate learning how shown have we experiment, following lane the for Additionally, okbte npatc.O h te ad oet skont ouaepatct hog ubrof number a through plasticity modulate to known is novelty hand, other the On practice. in better work rule reward-learning the evaluate to used is framework This synaptic allowing Collaborations fields. research multiple over spans brain the understand to endeavor The iumtrtss vrl,w aesonthat shown have we Overall, tasks. visuomotor 4 fe ho erig(u f52snpe ntotal). in synapses 512 of (out learning of 4h after 342 osriswih pae oatrteplc iteb ite hs mrvmnssol increase should improvements These ([ methods little. by trust-region by little inspired policy rate the be learning alter could the mechanism to regulate a updates to weight Such mechanism constrains automatic task. an also complex Lastly, could more variance (3). for the Equation beneficial [ Decreasing in in evaluation. process return as Wiener policy expected noise the the This action-space by of state. an variance given considering the a policy-gradient by reduces of achieved in which return be alleviated baseline expected was a the problem as estimate This used to trained evaluation. is critic policy a the to introducing inherent of by variance method methods learning high online the the by Specifically, impacted methods. is learning deep from inspiration taking by SPORE to seems but mechanism, biological particular a ([ model mechanisms decreasing not method does simple method a This presented we time. over annealing, rate 3. simulated learning by Figure the see Inspired emerge, retained. to be weights to synaptic priors improvements strong constraining prevent Specifically, but policy. policy learned learned to the the us allowed of on evaluation performance distribution This the prior deteriorate hours. the simulated of some influence within the tasks characterize embodied difficult moderately fashion. for modular a online in network spiking a with robots real and [ simulated of simulation control network the online spiking allowing learning for framework of capable components a an software implemented [ are successfully communication open-source we embodiment and on synchronization paper, closed-loop relying this in by In evaluated evaluation endeavor. be this this to of neuroscientists milestone theoretical important by derived rules learning Conclusion 5 of prior medium a for 0 With learning (below weights of task. synaptic course the weak the of perform over number to The weights learn synaptic annealing. annealing, not the Without does of trials. network rate 6 Development the over learning and averaged ( the were retained prior results annealing, not medium The are a improvements performance. with learning task performance overall following the lane on the annealing for Results 5: Figure

Time since last reset [s] nafntoa cl,de erigmtossilotefr ilgclypasbelann ue uhas such rules learning plausible biologically outperform still methods learning deep scale, functional a On 20 40 60 80 0 0 o uuewr,tepromnegpbetween gap performance the work, future For . Static Annealing Without Annealing 37 1 , 15 ) hrfr,adces nlann aeatrfmlaiainwt h aki reasonable. is task the with familiarization after rate learning in decrease a Therefore, ]). Simulation time[h] 2 β erae vrtm n efrac mrvmnsaertie.Right: retained. are improvements performance and time over decreases 3 7 , 8 , 42 , 4 36 SPORE n ooi iuain[ simulation robotic and ] 13

SPORE Number of synapses 1 1 1 a aal flann hlo edowr policies feedforward shallow learning of capable was 6 0 0 0 . nta faprmtrsaenieimplemented noise parameter-space a of instead ] 7 nrae rm4 o21atr1 flann to learning of 1h after 231 to 41 from increases 07) 0 1 2 0 SPORE n eplann ehd hudb tackled be should methods learning deep and 2 c p 0 = 4 0h ([ 21 24 . 5.Lf:cmaigteeetof effect the comparing Left: 25). , 0 , Synaptic weights 19 18 , .Tersligfaeokis framework resulting The ]. 22 2 , 45 )o w closed-loop two on ]) 4 1h 0 c p 0 = 2 38 ) which ]), . 5with 25 SPORE SPORE 12 4 4h , β 20 is ], This is the author’s version. This paper was published in Frontiers in Neurorobotics: www.frontiersin.org/articles/10.3389/fnbot.2019.00081 l h uhr atcptdi rtn h ae.J,M,A,JV n Kcnevdteexperiments the conceived DK and JCVT AK, MH, JK, paper. the writing in participated authors the All h olbrto ewe h ieetisiue htldt h eut eotdi h rsn ae was paper present the in reported results the to led that institutes different the between collaboration The for Programme Framework 2020 Horizon Union’s European the from funding received has research This relationships financial or commercial any of absence the in conducted was research the that declare authors The odtst eegnrtdfrti study. this for generated were datasets No Project. Brain Statement Human Availability the Data of Learning) Deep Biological – (CDP5 5 Project CoDesign under out carried the Acknowledgments by supported was work this programme addition, FITweltweit In the [DK]. [MH]. within (#732266) fellowship (DAAD) Plan4Act a Service project SGA1) as H2020-FETPROACT Exchange well Project Academic as Brain German SGA2), (Human the Project 720270 of Brain No. (Human Agreement 785907 Grant No. Specific and the under Innovation and Research Funding data. the analyzed and Contributions Author interest. of conflict potential a as construed be could that Statement Interest of Conflict – considered. rewards be terminal could discrete – and framework control proposed effector the multi-joint by as design such by tasks supported complex more that so performance yas paeitra 0 ms 100 ms 1 Hz 35 750 inh. exploration to visual exc. (reaching) exploration to noise (reaching) noise exploration (reaching) interval update synapse time-step/resolution rahn)epoaint oo x.10 exc. motor to exploration (reaching) al :NS Parameters NEST 1: Table N ( . 0 − . 0 500 , 14 50)

This is the author’s version. This paper was published in Frontiers in Neurorobotics: www.frontiersin.org/articles/10.3389/fnbot.2019.00081 References [2] [8] noise. “slow” [7] with policy-gradient hebbian using specialization neuronal of model Dauc´e. A Emmanuel [6] [5] [4] [3] [1] eoe iecntn 0 ms 100 ms 1 ms constant 1 time decoder time-step adapter DVS time-step MUSIC 10 ( decay rate learning ( rate learning initial ( temperature mul. motor to visual exc. motor to visual nerto ie5 s s 50 1 s 2 length episode following) (lane length episode (reaching) ( parameter synaptic min ( parameter synaptic max time integration EEItrainlCneec nRbtc n uoain(ICRA) Automation and Robotics on Conference International IEEE (NIPS) Systems Processing Information 2008. hr-emmmr n erigt-er nntok fsiignuos In neurons. spiking of Long networks Maass. in Wolfgang learning-to-learn and Legenstein, and Robert memory Subramoney, short-term Anand Salaj, Darjan Bellec, very Training Guillaume rewiring: Deep Legenstein. Robert networks. and deep Maass, sparse Wolfgang Kappel, David Bellec, Guillaume ra kbr n ialDuflt UI utsmlto oriao:RqetFrComments. For Request Coordinator: Multisimulation – MUSIC Djurfeldt. Mikael and Ekeberg Orjan Framework. MUSIC the 2010. on mar Based 8(1):43–60, Potjans, Simulators and Network C. Kotaleski, Tobias Neuronal Hellgren Between Helias, Jeanette operability Moritz Diesmann, Dudani, Markus Niraj Bhalla, Eppler, S. M. Upinder Jochen Hjorth, Johannes Djurfeldt, Mikael In robotics of survey A networks. neural Knoll. spiking C. learning-inspired Alois on and In based Huang, R¨ohrbein, vehicle. control Kai keeping Florian Meschede, lane Claus a Bing, for Zhenshan r-stdp on based Alois network and neural Akl, spiking Mahmoud of Rohrbein, learning Florian end Chen, to Guang End Huang, TowardsKnoll. Kai Meschede, Lin. Claus Bing, Zhouhan Zhenshan and Mesnard, Thomas learning. Bornschein, deep plausible Jorg biologically Lee, Dong-Hyun Bengio, Yoshua ¨ nentoa ofrneo rica erlNetworks Neural Artificial on Conference International T 0 ) λ β 8 ) 1 ) ri rpitarXiv:1711.05136 preprint arXiv θ θ min max ) . . . 5 ) al :RSMSCParameters ROS-MUSIC 3: Table ms 3 N − al :SOEParameters SPORE 2: Table . . . × 1 5 2 0 (0 ri rpitarXiv:1502.04156 preprint arXiv . × 0 10 . 8 32018. 03 , 10 , − 0 7 − . )(lpe t0) at (clipped 6) 5 15 2017. , ae 1–2.Srne,2009. Springer, 218–228. pages , rnir nNeurorobotics in Frontiers ae –.IE,2018. IEEE, 1–8. pages , 2015. , ra kbr.RnTm Inter- Run-Time Ekeberg. Orjan ¨ ofrneo Neural on Conference 23,2018. 12:35, , 2018 , This is the author’s version. This paper was published in Frontiers in Neurorobotics: www.frontiersin.org/articles/10.3389/fnbot.2019.00081 [16] [15] [12] [11] [22] [21] [20] [19] [18] [17] [14] [13] [10] [9] ofrneo iuain oeig n rgamn o uooosRbt (SIMPAR) Robots Autonomous for Programming and Modeling, Simulation, on Conference cortex Cerebral neurorobotics Learning 2–3.IE,2016. IEEE, 127–134. adrWee oetTKney rno rgn,adJsu ek.Mslmi dopamine Mesolimbic Berke. D Joshua and Aragona, J Brandon Kennedy, T Robert Weele, Vander artificial Connecting al. et Denninger, Oliver Peric, Igor Kaiser, Jacques Hinkel, Georg Tieck, Vasquez .40 o 2017. Nov 2.14.0. 2007. ueeMIhkvc.Sligtedsa eadpolmtruhlnaeo tpaddpmn signaling. dopamine and stdp of linkage through problem reward distal the Solving Izhikevich. M Eugene M Caitlin work. Schmidt, of Robert value Hetrick, the L signals Vaughn Mabrouk, S Omar Pettibone, R Jeffrey Hamid, A Arif PMLR. 2018. Jul 10–15 Sweden, Stockholm networks. neural editors, spiking Krause, in Andreas learning and local a Dy by Jennifer control in motor In Non-linear learning Gerstner. local Wulfram and stable Gilra Aditya by dynamics non-linear network. Predicting neural spiking Gerstner. recurrent Wulfram and Gilra Aditya tool). simulation (neural Nest Diesmann. Markus and Gewaltig Marc-Oliver potentiation. long-term and tagging 1997. Synaptic 536, al. et Morris, GM Richard Frey, Uwe synaptic spike-timing-dependent of modulation through plasticity. learning Reinforcement Florian. R˘azvan platform. V neurorobotics the framework: simulation comprehensive a in robots Camilo to Juan brains Ulbrich, Stefan Albanese, Ugo Ambrosano, Alessandro Vannucci, Lorenzo Falotico, Egidio ai apl enadNslr n ofagMas TPIsal nWne-aeAlCrut an Circuits Learning. Winner-Take-All 2014. in Model mar Markov Installs Hidden STDP to Maass. Approximation Wolfgang Online and Nessler, Bernhard through Kappel, David Circuits Neural of Function Dynamic Computational A Stable Learning. Maass. Reward-Based of Wolfgang Emergence and the Hsieh, Michael Supports Habenschuss, Connectome Stefan Legenstein, Robert Kappel, David version SPORE IGITUGraz/spore-nest-module: Subramoney. Anand and Hoff, Michael as Kappel, David Plasticity Network Maass. Wolfgang and Legenstein, Inference. Bayesian Robert Habenschuss, Stefan Kappel, David In networks. neural framework spiking a with Towards vehicle al. simulated et a Kohlhaas, of Ralf Michael control Weber, Roennau, end-to-end Michael Arne for Wolf, Wojtasik, Peter Konrad Hubschneider, Friedrich, Christian Alexander Tieck, Hoff, Vasquez Camilo J continuous Kaiser, deep Jacques for dynamics plasticity Synaptic Neftci. learning. Emre local and Mostafa, Hesham Kaiser, Jacques oue8 of 80 volume , erlComputation Neural 12 2017. 11:2, , 71)24–42 2007. 17(10):2443–2452, , ri rpitarXiv:1811.10766 preprint arXiv LSCmuainlBiology Computational PLOS rceig fMcieLann Research Learning Machine of Proceedings eneuro aueneuroscience Nature 96:4810,2007. 19(6):1468–1502, , ae NUO00–721,ar2018. apr ENEURO.0301–17.2018, pages , Elife :225 2017. 6:e28295, , rceig fte3t nentoa ofrneo Machine on Conference International 35th the of Proceedings 91:1,2016. 19(1):117, , 2018. , 16 11)e048,nv2015. nov 11(11):e1004485, , LSCmuainlBiology Computational PLoS ae 7318,Stockholmsm¨assan, 1773–1782, pages , 06IE International IEEE 2016 Scholarpedia Nature 10(3):e1003511, , 385(6616):533– , rnir in Frontiers 2(4):1430, , pages ,

This is the author’s version. This paper was published in Frontiers in Neurorobotics: www.frontiersin.org/articles/10.3389/fnbot.2019.00081 [30] [29] [26] [24] Mra uge,KnCne,BinGre,Js as,TlyFoe eeyLis o hee,and Wheeler, Rob Leibs, Jeremy Foote, Tully Faust, Josh Gerkey, Brian Conley, Ken Quigley, Morgan [36] [35] [33] [32] [31] [28] [27] [25] [23] [34] a.N.04CH37566) No. Cat. arXiv:1509.02971 arXiv:1412.6980 etn ID) 07IE International IEEE 2017 (IEDM), Meeting lSone PloS nentoa ofrneo ahn learning machine on conference International 86:3814,2006. 18(6):1318–1348, 2008. 10 lxGae,Mri idilr nra ijln,GogOtosi ta.Hmnlvlcontrol Human-level al. et Ostrovski, Georg Fidjeland, K Andreas Riedmiller, Martin Graves, Alex nrwYN.Rs noe-orerbtoeaigsse.In system. operating robot open-source an Ros: Ng. Y Andrew eprlCnrs iinSensor. Vision Contrast Temporal oue3 ae5 oe aa,2009. Japan, Kobe, 5. page 3, volume ooyy nh oa aucol,DvdSle,Ade uu olVns,Mr Bellemare, learning. reinforcement G Marc deep Veness, through Joel Rusu, A Andrei Silver, David In Kavukcuoglu, Koray learning. Mnih, reinforcement Volodymyr deep for methods Asynchronous Harley, Tim Kavukcuoglu. Lillicrap, Koray Timothy Graves, and Alex Silver, Mirza, David Mehdi Badia, Puigdomenech Adria Tassa, Mnih, Yuval Volodymyr Erez, learning. reinforcement Tom deep Heess, with Nicolas control Continuous Pritzel, Wierstra. Alexander Daan Hunt, and Silver, J David Jonathan Lillicrap, P Timothy 128 A Delbruck. Tobi and Posch, Christoph Lichtsteiner, Patrick spike- reward-modulated for biofeedback. theory to learning application A with Maass. plasticity Wolfgang timing-dependent and Pecevski, we Dejan can Legenstein, What Robert cortex: 2013. visual 35(8):1847–1871, J. primate Antonio vision? the Piater, computer in Justus for hierarchies Leonardis, learn Deep Ales Wiskott. Lappe, Laurenz Markus and Kalkan, Rodriguez-Sanchez, Sinan Janssen, Peter Kruger, Norbert multi-robot open-source an gazebo, for paradigms use and In Design simulator. Howard. optimization. Andrew stochastic and for Koenig method Nathan A Adam: Ba. Jimmy and Kingma P Diederik eedn lsiiyfrpeieato oeta rn nsprie learning. supervised in spike-timing- firing Optimal potential action Gerstner. precise Wulfram for and plasticity Barber, dependent David Toyoizumi, reward-learning Taro the Pfister, in Jean-Pascal to traces respond eligibility cells for Dopamine evidence Hyland. conditioning: I network. classical Brian and during Wickens, events R predicted Jeffery Schmidt, Robert Pan, Wei-Xing a in learning In reinforcement Free-energy-based environment. observable Doya. partially Kenji and Yoshimoto, Junichiro Otsuka, In Makoto machines. learning deep efficient for resource as synapses Stochastic Neftci. ambiguity. Emre perceptual and input sensory model network high-dimensional neural with spiking learning A reinforcement Doya. model-free Kenji of and Yoshimoto, Junichiro Otsuka, Makoto Nakano, Takashi 03:0160 2015. 10(3):e0115620, , ora fNeuroscience of Journal 04IE/S nentoa ofrneo nelgn oosadSses(IROS)(IEEE Systems and Robots Intelligent on Conference International IEEE/RSJ 2004 2014. , 2015. , oue3 ae 1925.IE,2004. IEEE, 2149–2154. pages 3, volume , 52)63–22 2005. 25(26):6235–6242, , EEJunlo oi-tt Circuits Solid-State of Journal IEEE ESANN Nature ae 11 EE 2017. IEEE, 11–1. pages , 1(50:2,2015. 518(7540):529, , 2010. , ae 9813,2016. 1928–1937, pages , 17 LSCmuainlBiology Computational PLOS × CAwrso noe oresoftware source open on workshop ICRA 2 2 B15 dB 120 128 32:6–7,2008. 43(2):566–576, , µ aec Asynchronous Latency s erlcomputation Neural lcrnDevices Electron ri preprint arXiv ri preprint arXiv 4(10):1–27, , , , This is the author’s version. This paper was published in Frontiers in Neurorobotics: www.frontiersin.org/articles/10.3389/fnbot.2019.00081 [44] [42] [46] [45] [43] [41] [40] [39] [38] [37] (IV) ora fpsychopharmacology of Journal Neuroinformatics Networks Neural rv nara ol iuainwt epqntok.In to q-networks. how Z¨ollner. deep Learning M. with simulation J. D¨urr, world H¨artl, and F. real J. a Bauer, in A. drive Weber, M. Hubschneider, C. Wolf, P. reinforcement connectionist for algorithms gradient-following statistical learning. Simple Williams. J ROS. Ronald and MUSIC on Based Interactions Simulators Loop Robotic Closed and Morrison. Network Abigail Neural and Spiking Duarte, C. between Renato Djurfeldt, Mikael Weidel, spiking. Philipp somatic of prediction dendritic the by 2014. Learning 81(3):521–528, Senn. Walter and Urbanczik Robert In machine. state extending by liquid arm a multi-joint with a optimization for policy control muscle proximal continuous Marc-Oliver Learning Roennau, R¨udiger Dillmann. Poganˇci´c, Arne and Vlastelica Kaiser, Gewaltig, Marin Jacques policy Tieck, Proximal Vasquez Camilo Klimov. Juan Oleg and Radford, Alec Dhariwal, algorithms. optimization Prafulla Wolski, Filip policy Schulman, region John Trust Moritz. Philipp and Jordan, Michael In Abbeel, optimization. Pieter Levine, review. Sergey Schulman, systematic John a novelty: and Meeter. Martijn and Rangel-Gomez Mauricio reeanZneadSraGnui uesie uevsdlann nmliae pkn neural spiking multilayer in learning Supervised Superspike: networks. Ganguli. Surya and Zenke Friedemann Camkii Maass. sampling. Wolfgang hamiltonian and through arXiv:1606.00157 Chen, optimization preprint Feng network Song, neural Sen reward-based Legenstein, supports activation Robert Kappel, David Yu, Zhaofei ae 4–5,Jn 2017. June 244–250, pages , ahn learning Machine erlcomputation Neural ae 1–2.Srne,2018. Springer, 211–221. pages , nentoa ofrneo ahn Learning Machine on Conference International 03)11,ag2016. aug 10(31):1–19, , ri rpitarXiv:1707.06347 preprint arXiv 2016. , (-)2926 1992. 8(3-4):229–256, , 06:5414,2018. 30(6):1514–1541, , 01:–2 2016. 30(1):3–12, , 18 2017. , 07IE nelgn eilsSymposium Vehicles Intelligent IEEE 2017 nentoa ofrneo Artificial on Conference International ae 8919,2015. 1889–1897, pages , rnir in Frontiers Neuron arXiv , This is the author’s version. This paper was published in Frontiers in Neurorobotics: www.frontiersin.org/articles/10.3389/fnbot.2019.00081