<<

1

Embodied Neuromorphic Vision with Event-Driven Random

Jacques Kaiser1, Alexander Friedrich1, J. Camilo Vasquez Tieck1, Daniel Reichard1, Arne Roennau1, Emre Neftci2, Rüdiger Dillmann1

Abstract—Spike-based communication between biological neu- rons is sparse and unreliable. This enables the brain to process visual information from the eyes efficiently. Taking inspiration from , artificial spiking neural networks coupled with silicon retinas attempt to model these computations. Recent findings in allowed the derivation of a family of powerful synaptic plasticity rules approximating backpropaga- tion for spiking networks. Are these rules capable of processing real-world visual sensory data? In this paper, we evaluate the performance of Event-Driven Random Back-Propagation (eRBP) at learning representations from event streams provided by a Dynamic Vision Sensor (DVS). First, we show that eRBP matches state-of-the-art performance on the DvsGesture dataset with the addition of a simple covert attention mechanism. By remapping visual receptive fields relatively to the center of the motion, this attention mechanism provides translation invariance at low computational cost compared to . Second, we successfully integrate eRBP in a real robotic setup, where a robotic arm grasps objects according to detected visual affordances. In this setup, visual information is actively sensed Figure 1: Our robotic setup embodying the synaptic learning by a DVS mounted on a robotic head performing microsaccadic rule eRBP [1]. The DVS is mounted on a robotic head per- eye movements. We show that our method classifies affordances forming microsaccadic eye movements. The spiking network within 100ms after microsaccade onset, which is comparable to is trained (dashed-line connections) in a supervised fashion human performance reported in behavioral study. Our results suggest that advances in neuromorphic technology and plasticity to classify visual affordances from the event streams online. rules enable the development of autonomous operating at During training, output neurons and label neurons project to an high speed and low energy consumption. error neuron population, which conveys a random feedback to Index Terms—, Spiking Neural Networks, Event- hidden layers. The output neurons of the network correspond Based Vision, Learning, Synaptic Plasticity to the four types of affordances: ball-grasp, bottle-grasp, pen- grasp or do nothing. At test time, a Schunk LWA4P arm equipped with a five-finger Schunk SVH gripper performs the I.INTRODUCTION detected reaching and grasping motion. HERE are important discrepancies between the computa- T tional paradigms of the brain and modern computer archi- tectures. Remarkably, evolution discovered a way to learn and

arXiv:1904.04805v2 [cs.NE] 6 May 2019 [6], approximations of the gradient can be made without perform energy-efficient computations from large networks of significantly deteriorating the accuracy of the network. These neurons. Particularly, the communication of spikes between approximations allow weight updates to only depend on local neurons is asynchronous, sparse and unreliable. Understanding information available at the synapse, enabling online learning what computations the brain performs and how they are with an efficient neuromorphic hardware implementation. Are implemented on a neural substrate is a global endeavor of these rules capable of efficiently processing complex visual our time. From an engineering perspective, this would enable information like the brain? Synaptic learning rules are rarely the design of autonomous learning robots operating at high evaluated in an embodiment, and often report below state-of- speed for a fraction of the energy budget of current solutions. the-art performance compared to classical methods [7]. Recently, a family of synaptic learning rules building on In this paper, we evaluate the ability of one of these rules groundbreaking machine learning research have been pro- – eRBP [1] – to learn from real visual event-based sensor posed in [1]–[4]. These rules implement variations of Back- data. eRBP was the first synaptic plasticity rule formally Propagation (BP) adapted to spiking networks by approx- derived from Random Backpropagation [5], its imating the gradient computations. As was shown in [5], equivalent. First, we benchmark eRBP on the popular visual event-based dataset DvsGesture [8] from IBM. We introduce 1FZI Research Center For Information Technology, Karlsruhe, Germany 2Department of Cognitive Sciences, University of California Irvine, Irvine, a covert attention mechanism which further improves the per- USA formance by providing translation invariance without convolu- 2

tions. Second, we integrate the rule in a real-world closed-loop with bmin and bmax the window of the boxcar function. The robotic grasping setup involving a robotic head, arm and a five- weight update for the last is an exception, since there finger hand. The spiking network learns to classify different are no random feedback weights. Each time a pre-synaptic types of affordances based on visual information obtained neuron j with a connection to an output neuron k spikes, the with microsaccades, and communicates this information to the weight update of this connection is calculated by: arm for grasping. This real-world task has the potential to ( −ek(t) if bmin < Ik < bmax enhance neuromorphic and neurorobotics research since many ∆wkj(t) ∝ . (3) functional components such as reaching, grasping and depth 0 otherwise can be easily segregated and implemented with Since eRBP only uses two comparisons and one addition brain models. This work paves the way towards the integration for each pre-synaptic spike to perform the weight update, of brain-inspired computational paradigms into the field of it allows a real-time, energy-efficient and online learning robotics. implementation running on neuromorphic hardware. This paper is structured as follows. We give a brief in- troduction to the synaptic learning rule eRBP in Section II B. Network Architecture together with the network architecture. We further present We propose a similar network architecture to the one our covert attention mechanism and the microsaccadic eye proposed in [1]. Specifically, the network consists of spiking movements. Our approach is evaluated in Section III, first neurons organized in feedforward layers, performing classifi- on the DvsGesture [8] benchmark, second in our real-world cation with one-hot encoding. In other words, the k output grasping setup. We conclude in Section IV. neurons of the last layer correspond to the k class of the dataset. The error signals ek(t) are encoded in spikes and II.APPROACH provided to all hidden neurons with backward connections A. Event-Driven Random Backpropagation from error neurons. These error spikes are integrated in a dedicated leaking dendritic compartment locally for every eRBP [1] is an interpretation for spiking neurons of feed- learning neuron. The weight of the backward connections are back alignment presented in [5] for analog neurons. Feedback drawn from a random distribution and fixed – they correspond alignment is an approximation of backpropagation, where the to the factors g in Equations (1) and (2). prediction error is fed back to all neurons directly with fixed, ik For each class, there are two error neurons conveying random feedback weights. With a mean-square error loss, the positive and negative error respectively. This is required, update for a hidden weight from neuron j to neuron i can be since spikes are not signed events. A pair of positive and formulated as follows: negative error neurons are connected to the corresponding   output neuron and label neuron. There is one label neuron for 0 X ∆wij(t) = yjφ  wij0 (t)yj0 (t) Ti(t), each class, which is spiking repeatedly during training when a j0 (1) sample of the respective class is presented. Formally, the error X e (t) is approximated as: Ti(t) = ek(t)gik, k k ∼ + − ek(t) = νk (t) − νk (t), with y the output of neuron j, φ the of + P L j νk (t) ∝ νk (t) − νk (t), (4) i e k g neuron , k the prediction error for output neuron and ik ν−(t) ∝ −νP (t) + νL(t), a fixed random feedback weight. This rule is interpreted for k k k P L spiking neurons with yj the spiking output (0 or 1) of neuron j where νk (t), νk (t) are the firing rates of the output neurons + − and φ the function mapping membrane potential to spikes. The and label neurons respectively, and νk (t), νk (t) are the firing P weighted sum of input spikes j0 wij0 (t)yj0 (t) is interpreted rates of the positive and negative error neurons. Within this as the membrane potential Ii of neuron i. eRBP relies on framework, all computations are performed with spiking neu- hard-threshold Leaky Integrate-And-Fire neurons, leading φ rons and communicated as spikes, including the computation to be non-differentiable. Following the approach of surrogate of errors. As in the brain, all synapses are stochastic, with gradients [9], φ0 is approximated as the boxcar function, equal a probability of dropping spikes. The network is depicted in to 1 between bmin and bmax and 0 otherwise. Note that the Figure 1. temporal dynamics of the Leaky Integrate-And-Fire neuron – In this work, the network learns from event streams provided post-synaptic potentials and refractory period – are not taken by a DVS. Since spikes are not signed events, we associate into account in Equation (1). More recent plasticity rules now two neurons for each pixel to convey ON- and OFF-events include an eligibility trace [2]–[4] accounting for some of these separately. This distinction is important since event polarities dynamics. carry the instantaneous information of direction of motion This rule can be efficiently implemented in a spiking (see Figure 4). Since events are generated only upon light network with weight updates triggered by pre-synaptic spikes change, two different setup are analyzed: a dataset where yj as: changes originate from motion in the scene, and a dataset where changes originate from fixational eye movements. The ( P − k ek(t)gik if bmin < Ii < bmax evaluation on these two types of dataset is important since ∆wij(t) ∝ , (2) 0 otherwise they can lead to different performance [3]. 3

C. Covert Attention Window It was shown in biology that visual receptive fields of frontal eye field neurons are constantly remapped [10], [11]. Inspired from this insight, we introduce a simple covert attention mechanism which consists of moving an attention window across the input stream. Covert attention, as opposed to overt attention, signifies an attention shift which was not marked by eye movements. Particularly suited to event streams, the center of the attention window is computed online as the median address event of the last nattention events, see Figure 4. By remapping receptive fields relatively to the center of the motion, this technique enables translation invariance at low computational cost compared to convolutions. This method also allows to reduce the dimension of the event stream Figure 2: Microsaccadic motion of the DVS performed by without rescaling. the robotic head. The motion consists of three phases. An A similar method was already introduced in [12] for classi- negative tilt of α and negative pan of α/2, followed by a tilt fying a dataset of three human motions (bend, sit/stand, walk) of α and negative pan of α/2, followed by a final pan of α moving the DVS back to its initial position. We chose the recorded with a DVS. Their approach consists of remapping ◦ the address of their feature neurons (C1) with respect to their angle α = 1.833 . Each motion is effectuated in 0.2 s. mean activation before being fed to the classifier. Instead, our method consists of remapping the address events directly, with allow the events to flow through the network only when a respect to the median event. Unlike the median, the mean microsaccade is triggered. No information about the properties activation can result in an event-less attention window in case of the microsaccade is passed as an input to the network. of multiple objects in motion, such as two-hand gestures. Additionally, since our attention window is smaller than the event stream, eccentric events are not processed by the net- III.EVALUATION work. We show in this paper how this biologically motivated Throughout the experiments, we rely on a dense 4-layers technique boosts the performance, even on DvsGesture, where architecture with two hidden layers of 200 neurons respec- multiple body parts are simultaneously in motion. We note that tively (see Figure 1). As described in Section II-B, the ON- a similar mechanism could be integrated in a robotic head as and OFF-events obtained from the DVS are segregated in two the one used in this paper to perform actual eye movements separate input layers. All synapses are stochastic, with a 35% (see Figure 2). However, an additional mechanism to discard chance of dropping each spike. All the presented experiments events resulting of the ego-motion would be required, which relied on the open-source implementation of eRBP1 based on could be based on visual prediction [11], [13], [14]. the neural simulator Auryn [18].

D. Microsaccadic eye-movements A. DvsGesture For our real-world grasping experiment, address events are DvsGesture is an action recognition dataset recorded by sensed from static scenes by performing microsaccadic eye IBM using a DVS [8], [19]. We reduce the dimensionality movements. This technique was already used to convert im- of the input event stream to 64 × 64, both in the rescaling ages to event streams [15], essentially extracting edge features case and the covert attention window case. It consists of [16]. To this end, we mounted the DVS on the robotic head 29 subjects performing 11 diverse actions in three different presented in [16], see Figure 2. One Dynamixel servo MX- illumination conditions. We split the whole dataset into 1176 64AT is used to tilt both DVS simultaneously, while two training samples and 288 test samples, leaving out the data other Dynamixel servos MX-28AT are used to pan each DVS when users do not perform a labeled motion. This training independently. The center of all rotations is approximately the set consists of 7602s (approximately 2h) of recordings, versus optical center of each DVS. In this work, only the events of the 1960s for the test set. The duration of the actions varies greatly right DVS are processed. The microsaccadic motion consists across samples, see Table I and Figure 3. A single sample of an isosceles triangle in joint space. The three motions may be about 1 second or over 18 seconds long. Despite this defined by the edges of the triangle last 0.2 s each. The first high variance, all samples were used in full without temporal motion consists of a negative tilt of α and negative pan of α/2. modifications, both for training and testing. The number of The second motion is a tilt of α and negative pan of α/2. The events to calculate the position of the attention window was third motion moves the DVS back to its initial position with a set to nattention = 1000. ◦ pan of α. We chose the angle α = 1.833 . This angle is much Our evaluation on DvsGesture shows that eRBP efficiently smaller in biology, but DVS pixels are much larger than the learns to classify motions from raw event streams with the photoreceptors of the retina [17]. attention mechanism introduced in Section II-C. The accuracy The microsaccades are triggered either manually for record- ing training data, or automatically in a loop at test time. We 1https://gitlab.com/eneftci/erbp_auryn 4

Label #Training #Test Description 1 97 24 Clapping 2 98 24 Right hand waving 3 98 24 Left hand waving 4 98 24 Right arm circling clockwise 5 98 24 Right arm circling anticlockwise 6 98 24 Left arm circling clockwise 7 99 24 Left arm circling anticlockwise 8 196 48 Arms rolling 9 98 24 Air drum 10 98 24 Air guitar 11 98 24 Other gestures

Table I: Labels of all the different classes in the DvsGesture (a) Arm circling clockwise (b) Arm circling anticlockwise dataset and their description. The amount of samples of label Figure 4: Aggregation of 1000 events for two samples of the 8 is doubled, since arms rolling were recorded and labeled DvsGesture dataset (user 10). Yellow pixels symbolize ON- with 8 for both rotation directions. events, blue pixels are OFF-events. The information about direction of motion is contained in the event polarity, hence the importance of their segregation in the input layer. The red square represents the attention window of size 64 × 64, calculated as the median of the last 1000 events.

100.0 92.7 86.1 80.0

60.0

40.0 Accuracy[%]

20.0 attention window rescaling Figure 3: Statistics on sample duration for each label in the 0.0 DvsGesture dataset. The red horizontal lines represent the 0 20 40 60 80 100 median sample duration. Each box indicates the interquartile Epoch range (IQR = Q3 −Q1) per label, which is the range between the first Q1 and the third quartile Q3. Whisker pairs show the Figure 5: Classification accuracy over 100 epochs of training range of all sample durations within Q1 − 1.5 × IQR and on DvsGesture. The accuracy converge to 86.1% and 92.7% Q3 + 1.5 × IQR. Outliers are represented by small circles. for the rescaling and the attention window approaches respec- tively. The dimension of the event stream is 64 × 64 in both cases. The drop in accuracy in early stages of learning in of 92.7% is achieved after only 60 epochs, corresponding the attention window case are due to the stochastic synapses, to approximately 127h of training data, see Figure 5. This which have 35% chances of dropping spikes. The rescaling accuracy is comparable to state-of-the-art deep networks (IBM approach is resilient to this stochasticity since all events in EEDN [8], 94.49%) and other synaptic learning rules taking the original stream are squeezed into macro-pixels, leading to temporal dynamics into account (DCLL [3], 94.19%), both redundant events. relying on convolutions. Without the attention mechanism, the accuracy of the network drops to 86.1%. These results confirm that our simple covert attention mechanism provides sequence – where the temporal dynamics of the target signal translation invariance without convolutions, at a low computa- is relevant. tional cost. Additionally, unambiguous samples are classified in under 0.1 s, with an increasing confidence over time, see B. Grasp-type Recognition Figure 6. In this experiment, we embody eRBP in the real-world Since DvsGesture is a classification task, not accounting for grasping robotic setup depicted in Figure 7. In this setup, the neural temporal dynamics in the learning rule (Equation (1)) spiking network is trained to recognize four labels correspond- does not impact the performance much. Indeed, the target ing to four different types of grasp: ball-grasp, bottle-grasp, output signal as encoded by label neurons is constant across a pen-grasp or do nothing [20]. During training, an object of a training sample for durations of several seconds, see Figure 3. particular class is placed on a table at a specific position. The We expect this omission to decrease performance significantly robotic head performs microsaccadic eye movements (similar for a temporal regression task – such as learning a time to the N-MNIST dataset [15]) to extract visual information 5

8192

4096 Input

0 200

100 Hidden 1 0 200

100 Hidden 2 0

300 1 4 7 10 2 5 8 11 200 3 6 9

100 Output spikes 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 Time [s]

Figure 6: Spiketrains and classification results for a test sample of class “left hand waving” from DvsGesture dataset. The network manages to correctly classify the sample in less than 0.1 s, with increasing confidence over time. The rhythm of the “left hand waving” motion is clearly visible in the input spiketrain. The neurons in the hidden layers spike close to their maximum frequency, as limited by the refractory period. from the static object. The event streams are recorded together with the corresponding object affordance. In this experiment, the attention window has dimension 32 × 32 and is fixed to match the position of the objects on the table, see Figure 8 for example samples. During testing, a microsaccade is performed and the detected object affordance triggers the adequate pre- defined reaching and grasping motion on a Schunk LWA4P arm equipped with a five-finger Schunk SVH gripper. This demonstrator was implemented with the ROS Framework [21] and the ROS DVS driver introduced in [22]. With only 20 samples per class and 30 training epochs, the network was capable of learning the four visual affordances, see the supplementary video2. Example spiketrains and classi- fication results at test time are shown in Figure 9. The network classifies the ball and the pen with high confidence, but with low confidence for the bottle and the background. This is due Figure 7: Real-world grasp-type Recognition experiment setup to the fact that we trained with a transparent bottle, generating integrating a Schunk LWA4P arm equipped with a five-finger only a few events, thus visually similar to the background. Schunk SVH gripper and a DVS head. The DVS head performs While the microsaccade leads to sparse activity in the input microsaccadic eye movements to sense event streams from layer, both hidden layers have strong, sustained activity (see static scenes. We recorded a small four-classes dataset (ball, Figure 9). This indicates some form of short-term memory bottle, pen, background) of 20 samples per class. At test time, emerging from neural dynamics, as the network is capable of the detected grasp-type triggers the corresponding predefined remembering the object despite receiving no events between reaching and grasping motion. two microsaccadic phases. The neurons in the hidden layers spike close to their maximum frequency, as limited by the refractory period. This is not surprising, since we did not see Section II-B. Except for the background class, counting use a regularization term to enforce sparsity in the learning the spikes at the output leads to successful classification rule and labels were encoded into spikes with rate coding, about 100ms after microsaccade onset, comparable to human 2https://neurorobotics-files.net/index.php/s/sBQzWFrBPoH9Dx7 performance reported in behavioral study [23]. 6

ball bottle pen background that correct affordances are detected within about 100ms after microsaccade onset, matching biological reaction time [23]. For future work, eRBP could be replaced by more recent learning rules accounting for neural temporal dynamics [1]– [4]. This would enable the setup to be extended to temporal sequence learning and tasks [24], [25]. Additionally, other components of the grasp-type recognition experiment could be implemented with spiking networks, such as reaching motions [26], [27], grasping motions [28] and depth perception [16]. This work paves the way towards the integration of brain-inspired computational paradigms into the field of robotics.

ACKNOWLEDGMENT This research has received funding from the European Figure 8: Example samples and learned weights for the grasp- Union’s Horizon 2020 Framework Programme for Research type recognition experiment. Top row: camera image of the and Innovation under the Specific Grant Agreement No. objects. Middle row: integration of the address events during 785907 ( Project SGA2). 15ms after microsaccade onset. Bottom row: projection of the synaptic weights of our 4-layers network for each label neuron REFERENCES onto the input after training. Green denotes excitation (positive [1] E. O. Neftci, C. Augustine, S. Paul, and G. Detorakis, “Event-driven influence) and pink denotes inhibition (negative influence). random back-propagation: Enabling neuromorphic deep learning ma- chines,” Frontiers in , vol. 11, p. 324, 2017. [2] F. Zenke and S. Ganguli, “Superspike: in multi-layer spiking neural networks,” arXiv preprint arXiv:1705.11146, 2017. Since the DVS does not sense colors, the network only relies [3] J. Kaiser, H. Mostafa, and E. Neftci, “Synaptic plasticity dynamics for on shape information, crucial for affordances. This allowed the deep continuous local learning,” arXiv preprint arXiv:1811.10766, 2018. network to moderately generalize despite the small amount of [4] G. Bellec, F. Scherr, E. Hajek, D. Salaj, R. Legenstein, and W. Maass, “Biologically inspired alternatives to backpropagation through time for training samples. Indeed, a single object per affordance was learning in recurrent neural nets,” arXiv preprint arXiv:1901.09049, used during training, but the network could recognize different 2019. objects of the same shape. Recognition also worked when the [5] T. P. Lillicrap, D. Cownden, D. B. Tweed, and C. J. Akerman, “Ran- dom synaptic feedback weights support error backpropagation for deep objects were slightly moved from the reference point used learning,” Nature communications, vol. 7, p. 13276, 2016. for grasping. However, the network was not robust to change [6] M. Jaderberg, W. M. Czarnecki, S. Osindero, O. Vinyals, A. Graves, in background or unexpected background motions happening and K. Kavukcuoglu, “Decoupled neural interfaces using synthetic gradients,” arXiv preprint arXiv:1608.05343, 2016. during the microsaccade. This is due to the background being [7] Z. Bing, C. Meschede, F. Röhrbein, K. Huang, and A. C. Knoll, “A learned as an additional class for the “do nothing” affordance. survey of robotics control based on learning-inspired spiking neural networks,” Frontiers in neurorobotics, vol. 12, p. 35, 2018. [8] A. Amir, B. Taba, D. Berg, T. Melano, J. McKinstry, C. Di Nolfo, IV. CONCLUSION T. Nayak, A. Andreopoulos, G. Garreau, M. Mendoza et al., “A low power, fully event-based gesture recognition system,” in Proceedings Neuromorphic engineering technology enables the design of the IEEE Conference on and , of autonomous learning robots operating at high speed for a 2017, pp. 7243–7252. [9] E. O. Neftci, H. Mostafa, and F. Zenke, “Surrogate gradient learning in fraction of the energy consumption of current solutions. Until spiking neural networks,” arXiv preprint arXiv:1901.09948, 2019. recently, the advantages of this technology were limited due [10] M. Zirnsak, N. A. Steinmetz, B. Noudoost, K. Z. Xu, and T. Moore, to the lack of efficient learning rules for spiking networks. “Visual space is compressed in prefrontal cortex before eye movements,” Nature, vol. 507, no. 7493, p. 504, 2014. This bottleneck has been addressed since the realization that [11] M. A. Sommer and R. H. Wurtz, “Influence of the thalamus on spatial gradient backpropagation can be approximated with spikes. visual processing in frontal cortex,” Nature, vol. 444, no. 7117, p. 374, In this paper, we demonstrated the ability of eRBP to learn 2006. [12] B. Zhao, R. Ding, S. Chen, B. Linares-Barranco, and H. Tang, “Feedfor- from real-world event streams provided by a DVS. First, ward Categorization on AER Motion Events Using Cortex-Like Features with the addition of a simple covert attention mechanism, in a .” IEEE transactions on neural networks we have shown that eRBP achieved comparable accuracy as and learning systems, vol. PP, no. 99, p. 1, 2014. [13] J. Kaiser, R. Stal, A. Subramoney, A. Roennau, and R. Dillmann, state-of-the-art methods on the DvsGesture benchmark [8]. “Scaling up liquid state machines to predict over address events from This attention mechanism improved performance by a large dynamic vision sensors,” Bioinspiration & biomimetics, vol. 12, no. 5, margin in comparison to classical rescaling approaches, by p. 055001, 2017. [14] J. Kaiser, S. Melbaum, J. C. V. Tieck, A. Roennau, M. V. Butz, and providing translation invariance at a low computational cost R. Dillmann, “Learning to reproduce visually similar movements by compared to convolutions. Second, we integrated eRBP in a minimizing event-based prediction error,” in Int. Conf. on Biomedical real-world robotic grasping experiment, where affordances are Robotics and Biomechatronics. IEEE, 2018. [15] G. Orchard, A. Jayawant, G. Cohen, and N. Thakor, “Converting static detected from microsaccadic eye movements and conveyed to image datasets to spiking neuromorphic datasets using saccades,” arXiv a robotic arm and hand setup for execution. Our results show preprint arXiv:1507.07629, 2015. 7

Network activity for “ball” Network activity for “bottle”

2048 2048

1024 1024 Input Input

0 0 200 200

100 100 Hidden 1 Hidden 1 0 0 200 200

100 100 Hidden 2 Hidden 2 0 0

ball ball 100 bottle bottle nothing 50 nothing 50 pen pen Output spikes Output spikes 0 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Time [s] Time [s]

Network activity for “pen” Network activity for “background”

2048 2048

1024 1024 Input Input

0 0 200 200

100 100 Hidden 1 Hidden 1 0 0 200 200

100 100 Hidden 2 Hidden 2 0 0

ball ball 100 100 bottle bottle nothing nothing 50 pen 50 pen Output spikes Output spikes 0 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Time [s] Time [s]

Figure 9: Spiketrains and classification results for four test samples of our grasp-type dataset. The network manages to correctly classify the four test samples. The ball and the pen are easily classified despite the small amount of training data. However, the transparent bottle generates few events, yielding to more uncertainty in the classification with the background. The three phases of the microsaccadic motion are clearly visible in the input spiketrains (first row of each plot), see Section II-D. However, the activity of the hidden layers does not drop instantaneously even when the input is sparse, indicating a form of short-term memory induced by the neural dynamics. The neurons in the hidden layers spike close to their maximum frequency, as limited by the refractory period. 8

[16] J. Kaiser, J. Weinland, P. Keller, L. Steffen, J. C. V. Tieck, D. Reichard, 500 faces in less than 100 seconds: Evidence for extremely fast and A. Roennau, J. Conradt, and R. Dillmann, “Microsaccades for neuro- sustained continuous visual search,” Scientific reports, vol. 8, no. 1, p. morphic stereo vision,” in International Conference on Artificial Neural 12482, 2018. Networks. Springer, 2018, pp. 244–252. [24] J. C. V. Tieck, M. V. Poganciˇ c,´ J. Kaiser, A. Roennau, M.-O. Gewaltig, [17] S. Martinez-Conde, S. L. Macknik, and D. H. Hubel, “The role of and R. Dillmann, “Learning continuous muscle control for a multi- fixational eye movements in visual perception,” Nature Reviews Neu- joint arm by extending proximal policy optimization with a liquid state roscience, vol. 5, no. 3, pp. 229–240, 2004. machine,” in International Conference on Artificial Neural Networks. [18] F. Zenke and W. Gerstner, “Limits to high-speed simulations of spiking Springer, 2018, pp. 211–221. neural networks using general-purpose computers,” Frontiers in neuroin- [25] J. Kaiser, M. Hoff, A. Konle, J. C. V. Tieck, D. Kappel, D. Reichard, formatics, vol. 8, p. 76, 2014. A. Subramoney, R. Legenstein, A. Roennau, W. Maass, and R. Dillmann, [19] P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128x128 120 db 15us “Embodied synaptic plasticity with online reinforcement learning,” Fron- latency asynchronous temporal contrast vision sensor,” IEEE journal of tiers in neurorobotics, p. Submitted, 2019. solid-state circuits, vol. 43, no. 2, pp. 566–576, 2008. [26] J. C. V. Tieck, L. Steffen, J. Kaiser, D. Reichard, A. Roennau, and [20] J. Kaiser, D. Zimmerer, J. C. V. Tieck, S. Ulbrich, A. Roennau, R. Dillmann, “Combining motor primitives for perception driven target and R. Dillmann, “Spiking convolutional deep belief networks,” in reaching with spiking neurons,” International Journal of Cognitive International Conference on Artificial Neural Networks. Springer, 2017, Informatics and Natural Intelligence (IJCINI), vol. 13, no. 1, pp. 1– pp. 3–11. 12, 2019. [21] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, [27] J. C. V. Tieck, L. Steffen, J. Kaiser, A. Roennau, and R. Dillmann, R. Wheeler, and A. Y. Ng, “Ros: an open-source operating system,” “Controlling a robot arm for target reaching without planning using in ICRA workshop on open source software, vol. 3, no. 3.2. Kobe, Japan, spiking neurons,” in 2018 IEEE 17th International Conference on 2009, p. 5. Cognitive Informatics & (ICCI* CC). IEEE, [22] E. Mueggler, B. Huber, and D. Scaramuzza, “Event-based, 6-dof pose 2018, pp. 111–116. tracking for high-speed maneuvers,” in 2014 IEEE/RSJ International [28] J. C. V. Tieck, H. Donat, J. Kaiser, I. Peric, S. Ulbrich, A. Roennau, Conference on Intelligent Robots and Systems. IEEE, 2014, pp. 2761– M. Zöllner, and R. Dillmann, “Towards grasping with spiking neural 2768. networks for anthropomorphic robot hands,” in International Conference [23] J. G. Martin, C. E. Davis, M. Riesenhuber, and S. J. Thorpe, “Zapping on Artificial Neural Networks. Springer, 2017, pp. 43–51.