Event-Based Action Recognition Using Motion Information and Spiking Neural Networks

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) Event-based Action Recognition Using Motion Information and Spiking Neural Networks Qianhui Liu1;2 , Dong Xing1;2 , Huajin Tang1;2 , De Ma1∗ and Gang Pan1;2∗ 1College of Computer Science and Technology, Zhejiang University, Hangzhou, China 2Zhejiang Lab, Hangzhou, China fqianhuiliu, dongxing, htang, made, [email protected] Abstract pixel, forming an asynchronous and sparse representation of the scene. Event-based cameras have attracted increasing at- This event-based representation is inherently suitable to tention due to their advantages of biologically cooperate with the spiking neural network (SNN) since SNN inspired paradigm and low power consumption. also has the event-based property [Hu et al., 2018]. SNN uses Since event-based cameras record the visual input discrete spikes to transmit information between units which as asynchronous discrete events, they are inherently mimics the behavior of biological neural systems. Benefit- suitable to cooperate with the spiking neural net- ing from this event-driven processing paradigm, SNN is en- work (SNN). Existing works of SNNs for process- ergy efficient on neuromorphic hardware and has a power- ing events mainly focus on the task of object recog- ful ability in processing spatio-temporal information. Recent nition. However, events from the event-based cam- years, SNN has been increasingly applied to the task related era are triggered by dynamic changes, which makes to event-based cameras. it an ideal choice to capture actions in the visual Existing works of SNNs cooperating with event-based scene. Inspired by the dorsal stream in visual cor- cameras mainly focus on the object recognition tasks [Or- tex, we propose a hierarchical SNN architecture for chard et al., 2015; Xiao et al., 2019; Liu et al., 2020b]. How- event-based action recognition using motion infor- ever, since the event-based camera naturally captures move- mation. Motion features are extracted and utilized ments in the visual scene, it is a good fit for the action recog- from events to local and finally to global perception nition task. Nevertheless, works of SNN on event-based ac- for action recognition. To the best of the authors’ tion recognition are still limited. knowledge, it is the first attempt of SNN to apply Humans can recognize actions accurately, which motivates motion information to event-based action recog- us to explore the biological visual cortex to gain experience nition. We evaluate our proposed SNN on three for event-based recognition. The visual cortex is organized in event-based action recognition datasets, including two different pathways [Jhuang et al., 2007]. One is a ven- our newly published DailyAction-DVS dataset com- tral stream dealing with shape information, which has been prising 12 actions collected under diverse recording widely used in the existing spiking object recognition models conditions. Extensive experimental results show [Orchard et al., 2015; Liu et al., 2020b]. The other is a dor- the effectiveness of motion information and our sal stream involved with the analysis of motion information. proposed SNN architecture for event-based action Since the event streams representing the actions contain rich recognition. motion information, motion features of event stream may be an ideal choice for action recognition tasks. Further, the orga- nization of dorsal stream is hierarchical. Neurons gradually 1 Introduction increase their receptive field along the hierarchy, as well as Event-based cameras are a novel class of vision devices im- their selectivity and invariance to features. Inspired by the itating the mechanism of human retina. Contrary to conven- current theory of the visual cortex, we will make steps to- tional cameras, which record the visual input from all pix- wards the solution of event-based action recognition. els as images at a fixed rate, with event-based cameras, each We propose a hierarchical SNN architecture for event- pixel individually emits events when it observes sufficient based action recognition using motion information. Motion changes of light intensity in its receptive field. Thus, event- features are extracted and utilized from events to local and based cameras naturally respond to moving objects and ig- finally to global perception for action recognition. Specif- nore static redundant information, resulting in significant re- ically, we first adopt motion-sensitive neurons to estimate duction of memory usage and energy consumption. The final optical flow for the purpose of local motion (direction and output of the camera is a stream of events collected from each speed) perception. Then, we perform a motion pooling and a spatial pooling to mitigate the effect of the aperture prob- ∗Corresponding author lem [Orchard et al., 2013] and increase the spatial invariance 1743 Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) respectively. As the final stage of the architecture, a SNN 2020b] and the proposed features become one of the most classifier in which the spiking neurons are fully connected to commonly used features when using SNN to process the the previous pooling layer, is adopted as global perception for event-based recognition. recognition results. To the best of the authors’ knowledge, it We are inspired by the function of dorsal stream also in vi- is the first attempt of SNN to apply the motion information to sual cortex and make steps towards using motion information action recognition tasks. for action recognition tasks. Since event-based cameras pro- Besides, due to the lack of event-based action datasets vide an efficient way for encoding light and its temporal vari- and their importance for the algorithm development, we ations [Benosman et al., 2012], we introduce the optical flow present a new event-based action recognition dataset called estimation for motion (direction and speed) perception. Ex- DailyAction-DVS. The dataset comprises 15 subjects per- isting works on SNN-based optical flow estimation adopted forming 12 daily actions under 2 lighting conditions and 2 motion-sensitive neurons with synaptic delays [Orchard et camera positions (with different distances and angles to the al., 2013; Paredes-Valles´ et al., 2019]. We here adopt neu- subjects). This setting increases the challenge of dataset rons in [Orchard et al., 2013] due to their effectiveness and while gaining more practical significance. simplicity. We evaluate the proposed SNN on the new event-based action recognition dataset and two other challenging ones. Ex- 2.3 Event-based Datasets perimental results show the effectiveness of motion informa- Existing datasets on event-based action recognition can be tion and our proposed SNN architecture for event-based ac- divided into two categories: one is recorded with a static tion recognition. event-based camera facing a monitor on which video-based datasets were set to play automatically [Hu et al., 2016]. 2 Related Work However, this recording way will lose real dynamics of moving objects between two frames. There is no guarantee that 2.1 Event-based Action Recognition a method tested on this kind of artificial data will behave Action recognition task has drawn a significant amount of similarly in real-world conditions [Sironi et al., 2018]. The attention from the academic community, owing to its appli- other is to record directly by event-based cameras in the real cations in many areas like security and behavior analysis. scene. Among them, several datasets are proposed for ges- With the popularity of event-based cameras, they have been tures. [Amir et al., 2017] proposed an event-based hand ges- found to be ideal choices to capture human actions since they ture dataset captured by a fixed DVS camera [Lichtsteiner et only record the activity in the field of view and automatically al., 2008]. [Maro et al., 2020] also proposed a gesture dataset partition the foreground and background. Recently, research but was recorded by an ATIS camera [Posch et al., 2011] con- on event-based action recognition has emerged progressively. nected to the smartphone. As for human action, [Miao et al., One approach is to convert the output of an event camera into 2019] proposed an event-based action recognition dataset us- frames and use standard computer vision methods, such as ing a DAVIS camera [Brandli et al., 2014] with 3 different [Innocenti et al., 2020]. However, these works mainly focus positions. However, this dataset is recorded under single light on how to aggregate events and deal with frames. Another condition and is relatively small (291 recordings released). approach is to directly deal with events. [Maro et al., 2020] Our proposed event-based action recognition DailyAction- introduced a framework for dynamic gesture recognition rely- DVS dataset has 1440 recordings of 12 daily actions. The ing on the concept of time-surfaces introduced in [Lagorce et dataset is captured by a DVS camera with 2 different light- al., 2017]. In addition, SNN is trying to solve the event-based ing conditions and 2 different camera positions (with differ- gesture recognition. [George et al., 2020] presented a SNN ent distances and angles), which brings more challenges to which uses the idea of convolution and reservoir computing the dataset and is also more in line with the realistic situation. in order to classify human hand gestures. Since SNN has the event-based property and the ability in processing spatio- temporal information, it has great potential to solve the event- 3 Method based action recognition, but related works are still limited. In this section, we introduce the proposed SNN for event- 2.2 Event-based Features based action recognition, which extracts and utilizes motion [Lagorce et al., 2017] proposed the spatio-temporal features information from events to local and finally to global percep- based on recent temporal activity of events within a local spa- tion. The architecture of the proposed SNN is shown in Fig- tial neighborhood called time-surfaces.

Load more