Recurrent Neural Networks for Person Re-Identification Revisited

Recurrent Neural Networks for Person Re-identification Revisited Jean-Baptiste Boin Andre´ Araujo Bernd Girod Stanford University Google AI Stanford University Stanford, CA, U.S.A. Mountain View, CA, U.S.A. Stanford, CA, U.S.A. [email protected] [email protected] [email protected] Abstract lelize, but we also show that this model can be trained with The task of person re-identification has recently received an improved process that boosts the final performance while rising attention due to the high performance achieved by converging substantially faster. Finally, we obtain results new methods based on deep learning. In particular, in the that are on par or better than other published work based on context of video-based re-identification, many state-of-the- RNNs, but with a much simpler technique. art works have explored the use of Recurrent Neural Net- works (RNNs) to process input sequences. In this work, 2. Related work we revisit this tool by deriving an approximation which re- The majority of the traditional approaches to image- veals the small effect of recurrent connections, leading to a based person re-identification follow a two-step strategy. much simpler feed-forward architecture. Using the same The first step is feature representation, which aims at repre- parameters as the recurrent version, our proposed feed- senting an input in a way that is as robust as possible to vari- forward architecture obtains very similar accuracy. More ations in illumination, pose and viewpoint. Common tech- importantly, our model can be combined with a new train- niques include: SIFT used in [21], SILTP used in [10], Lo- ing process to significantly improve re-identification per- cal Binary Patterns used in [16, 18], color histograms used formance. Our experiments demonstrate that the proposed in [10, 16, 21, 18]. This step is followed by metric learn- models converge substantially faster than recurrent ones, ing, which transforms the features in a way that maximizes with accuracy improvements by up to 5% on two datasets. intra-class similarities while minimizing inter-class similar- The performance achieved is better or on par with other ities. Some metric learning algorithms specifically intro- RNN-based person re-identification techniques. duced for person re-identification are XQDA [10], LFDA 1. Introduction [13], and its kernelized version k-LFDA [16]. See [24] for a more detailed survey of these techniques. Person re-identification consists of associating different More recently, with the successes of deep learning in tracks of a person as they are captured across a scene by computer vision as well as the release of larger datasets different cameras, which is useful for video surveillance or for re-identification (VIPeR [5], CUHK03 [9], Market-1501 crowd dynamics understanding. The challenges inherent to [23]), this field has shifted more and more towards neural this task are the variations in background, body pose, illumi- networks. In particular, the Siamese network architecture nation and viewpoint. It is important to represent a person provides a straightforward way to simultaneously tackle the using a descriptor that is as robust as possible to these vari- tasks of feature extraction and metric learning into a unified ations, while still being discriminative enough to be char- end-to-end system. This architecture was introduced to the acteristic of a single person’s identity. A sub-class of this field of person re-identification by the pioneering works of problem is video-based re-identification, where the goal is [19] and [9]. This powerful tool can learn an embedding to match a video of a person against a gallery of videos cap- where inputs corresponding to the same class (or identity) tured by different cameras, by opposition to image-based are closer to each other than inputs corresponding to dif- (or single-shot) re-identification. ferent classes. Unlike classification approaches it has the Person re-identification has recently received rising at- added benefit that it can be used even if a low number of tention due to the much improved performance achieved images is available per class (such as a single pair of im- by methods based on deep learning. For video-based re- ages). Variants of the Siamese network have been used for identification, it has been shown that representing videos by re-identification. [1] achieved very good results by comple- aggregating visual information across the temporal dimen- menting the Siamese architecture with a layer that computes sion was particularly effective. Recurrent Neural Networks neighborhood differences across the inputs. Instead of using (RNNs) have shown promising results for performing this pairs of images as inputs, [3] uses triplets of images whose aggregation in multiple independent works [12, 18, 15, 25, representations are optimized by using triplet loss. 20, 17, 2]. In this paper, we analyze one type of architecture Although less explored, video-based re-identification that uses RNNs for video representation. Our contributions has followed a similar path since many techniques from are the following. We show that the recurrent network ar- image-based re-identification are applicable, ranging from chitecture can be replaced with a simpler non-recurrent ar- low-level hand-crafted features [11, 14] to deep learning, chitecture, without sacrificing the performance. Not only made possible by the release of large datasets (PRID2011 does this lower the complexity of the forward pass through [7], iLIDS-VID [14], MARS [22]). In order to represent a the network, making the feature extraction easier to paral- video sequence, most works consider some form of pooling that aggregates frame features into a single vector rep- Sequence feature resenting the video. Some approaches such as [22] do not explicitly make use of the temporal information, but other Temporal pooling works have shown promising results when learning spatio- temporal features. In particular, [12, 18, 15] all propose to use Recurrent Neural Networks (RNNs) to aggregate the Sequence processing temporal information across the duration of the video. [25] Frame feature Frame feature Frame feature showed promising results by combining a RNN-based tem- extraction extraction extraction poral attention model with a spatial attention model. In this work, we will focus on [12], which directly in- spired more recent papers that built upon it: [20] replaces the RNN with a bi-directional RNN; [17] computes the frame-level features with an extra spatial pyramid pooling layer to generate a multi-scale spatial representation, and Figure 1. General architecture of the feature extraction network. aggregates these features with a RNN and a more complex attentive temporal pooling algorithm; [2] aggregates considers max and average pooling, showing that average the features at the output of the RNN with the frame-level pooling generally outperforms max pooling, and [18] and features used as the input to the RNN, and also processes [15] both use average pooling only. Other works explored the upper-body, lower-body and full-body sequences sep- more complex temporal pooling strategies [17, 25]. arately, with late fusion of the three sequence descriptors. Here we will only focus on the specific architecture used All propose a more complex system compared to [12], but in [12], which has shown great success. The frame feature showed improved performance. extraction stage is a convolutional neural network (CNN), the sequence processing stage is a recurrent neural network 3. Proposed framework (RNN) and the temporal pooling is performed by average 3.1. General architecture of the network pooling the outputs at each time step. The full network is trained using the Siamese architecture framework. In video-based person re-identification, a query video of We call f (t) 2 Rd1 (o(t) 2 Rd2 ) the inputs (outputs) a person is matched against a gallery of videos, either by of the sequence processing stage, t = 1; :::; T . The using a multi-match strategy where frame descriptors from output at each time step is given by the RNN equations: query and gallery video are compared pairwise, or by a (t) (t) (t−1) o = Wif + Wsr (1) single-match strategy where the information is first pooled (t−1) (t−1) across frames to represent each video with a single descrip- r = T anh(o ) is obtained from the previous output. tor. The latter strategies have slowly superseded the former The last output of the RNN technically contains infor- ones and have proved more successful and efficient. mation from all the time steps of the input sequence so In order to extract a fixed length one-dimensional de- it could be used to represent that sequence (the temporal pooling stage would then just ignore o(t) for t ≤ T − 1 scriptor from a variable-length sequence of images, we in- (T ) troduce a parametric model called the feature extraction net- and directly output o ). However, in practice we observe work. The general architecture of that network is shown in that the contribution of a given input to each of the time Fig. 1: it is made up of three distinct stages namely frame steps of an output of a RNN decreases over time, as the feature extraction, sequence processing and temporal pool- information of the earlier time steps gets diluted and the later time steps dominate. This means that the value of o(T ) ing. This multiple-stage architecture is considered because (T ) (1) is a good generalization of the systems used in the related is much more strongly dependent on f than on f . This works [12] (as well as its extensions [20, 17, 2]), [18] and is not suitable when designing a sequence descriptor that [15] for video representation. should be as representative of the start of the sequence as For each frame of the input video, the first stage (frame of its end.

Load more