<<

Pose-Aware Person Recognition

Vijay Kumar ? Anoop Namboodiri ? Manohar Paluri † C. V. Jawahar ? ? CVIT, IIIT Hyderabad, India † Facebook AI Research

Abstract

(a) Person recognition methods that use multiple body re- gions have shown significant improvements over traditional face-based recognition. One of the primary challenges in full-body person recognition is the extreme variation in pose and view point. In this work, (i) we present an approach that tackles pose variations utilizing multiple models that (b) are trained on specific poses, and combined using pose- aware weights during testing. (ii) For learning a person representation, we propose a network that jointly optimizes a single loss over multiple body regions. (iii) Finally, we in- troduce new benchmarks to evaluate person recognition in Figure 1: (a) Different body regions provide cues that help diverse scenarios and show significant improvements over in recognition. (b) Each row shows the appearance of a per- previously proposed approaches on all the benchmarks in- son in different poses. Note that the distinguishing features cluding the photo album setting of PIPA. of a person that help classification are different in each pose.

1. Introduction alignment of different object parts. The appearance of the same object changes drastically with different poses and People are ubiquitous in our images and videos. They view-points (rows of Figure 1(b)) causing a serious chal- appear in photographs, entertainment videos, sport broad- lenge for recognition. One way to overcome this problem casts, and surveillance and authentication systems. This is through pose normalization, where objects in different makes person recognition an important step towards auto- poses and view-points are transformed to a canonical pose matic understanding of media content. Perhaps, the most [6, 10, 18, 40, 48, 57]. Another popular strategy is to model straight-forward and popular way of recognizing people is the appearance of objects in individual poses by learning through facial cues. Consequently, there is a large literature view-specific representations [2, 22, 31, 52]. focused on face recognition [25, 56]. Current face recogni- In this work, we aim to learn pose-aware representations tion algorithms [33, 35, 39, 40] achieve impressive perfor- for person recognition. While it is straight-forward to align mance on verification and recognition benchmarks [20, 44], objects such as faces, it is harder to align human body parts arXiv:1705.10120v1 [cs.CV] 29 May 2017 and are close to human-level performance. that exhibit large variations. Hence we design view-specific Face based recognition approaches require that faces are models to obtain pose-aware representations. We partition visible in images, which is often not the case in many prac- the space of human pose into finite clusters (columns of Fig- tical scenarios. For instance, in social media photos, movies ure 1(b)) each containing samples in a particular body ori- and sport videos, faces may be occluded, be of low resolu- entation or view-point. We then learn multi-region convo- tion, facing away from the camera or even cropped from the lutional neural network (convnet) representations for each view (Figure 1(a)). Hence it becomes necessary to look be- view-point. However, unlike previous approaches that train yond faces for additional identity cues. It has been recently a convnet for each body region, we jointly optimize the net- demonstrated [23, 26, 53] that different body parts provide work over multiple body regions with a single identifica- complementary information and significantly improve per- tion loss. This provides additional flexibility to the network son recognition when used in conjunction with face. to make predictions based on a few informative body re- One of the major challenges in person recognition or gions. This is in contrast to separate training which strictly fine-grained object recognition in general is the pose and enforces correct predictions from each body region. Dur- ing testing, we obtain the identity predictions of a sample before recogniton. Pose-normalization is also applied to the through a linear combination of classifier scores, each of similar problem of fine-grained bird classification [6, 51]. which is trained using a pose-specific representation. The Unlike rigid objects such as faces, it is difficult to align hu- weights for combining the classifiers are obtained by a pose man body parts due to large deformations. Hence we follow estimator that computes the likelihood of each view. a multi-view representation approach where the objects are Our approach overcomes some of the limitations of modeled independently in different views. This has also the previously proposed approaches, PIPER [53] and been employed in face recognition [2, 31], where training naeil [23]. Although poselet-based representation of faces are grouped into different poses and pose-aware CNN PIPER normalizes the pose; individual poselet patches [8] representations are learnt for each group. by themselves are not discriminative enough for recognition Person recognition with multiple body cues is the prob- tasks, and under-perform compared to fixed body regions lem of interest in this work. We make direct compar- such as head and upper body. On the other hand, naeil isons with the recent efforts that use multiple body cues: learns a pose-agnostic representation using more informa- PIPER [53], naeil [23] and Li et al. [26]. PIPER uses a tive body regions. Our framework is able to combine the complex pipeline with 109 classifiers, each predicting iden- best of both approaches by generating pose-specifc repre- tities based on different body part representations. These sentations based on discriminative body regions, which are include one representation based on Deep Face [40] archi- combined using pose-aware weights. tecture trained on millions of images, one AlexNet [24] Another major contribution of the work is the rigor- trained on the full body and 107 AlexNets trained on pose- ous evaluation of person recognition. Current approaches let patches, the latter two using PIPA [53] trainset. On the [23, 26, 53] have solely focused on the photo albums sce- other hand, naeil is based on fixed body regions such nario, reporting primarily on PIPA dataset [53]. However, as face, head and body along with scene and human at- this setting is very limited due to the similar appearance tribute cues trained using four different datasets, namely of people in albums, clothing and scene cues. To create PIPA, CASIA [50], CACD [9] and PETA [11]. While a more challenging evaluation, we consider three different poselets (used in PIPER) normalizes the pose, they are less scenarios of photo albums, movies and sports and show sig- discriminative compared to fixed body regions employed by nificant performance improvements with our proposed ap- naeil. We combine the strengths of both approaches us- proach. The datasets are available at http://cvit.iiit. ing pose-aware representations based on fixed body regions. ac.in/research/projects/personrecognition Person identification using context is another popular direction of work, where domain-specific information is ex- 2. Related Work ploited. Li et al. [26] focus on person recognition in photo- Person recognition has been attempted in multiple set- albums exploiting context at multiple levels. They propose tings, each assuming the availability of specific types of in- a transductive approach where a spectral embedding of the formation regarding the subjects to be recognized. training and test examples is used to find the nearest neigh- Face recognition is by far the most widely studied bors of a test sample. An online classifier is then trained to form of person recognition. The area has witnessed great classify each test sample. They also exploited photo meta- progress with several techniques proposed to solve the prob- data such as time-stamp and the co-occurrence of people to lem, varying from hand-crafted feature design [4, 34, 45], improve the performance. In [5, 47], meta-data and cloth- metric learning [17, 36], sparse representations [46, 54] to ing information are exploited to identify people in photo state-of-the-art deep representations [33, 35, 40]. collections. Similarly in [15], the authors use timestamp, Person re-identification is the task of matching pedes- camera-pose, and people co-occurrence to find all the in- trians captured in non-overlapping camera views; a primary stances of a specific person from a community-contributed requirement in video-surveillance applications. Most popu- set of photos of a crowded public event. Sivic et al. [38] im- lar existing works employ metric learning [16, 19, 49] using prove the recall by modeling the appearance of cloth, hair, hand-crafted [13, 28, 30] or data-driven [3, 27, 55] features skin of people in repeated shots of the same scene. to achieve invariance with respect to view-point, pose and People identification in videos may use cues such as photometric transformations. The approaches in [42, 49] sub-title [12] or appearance models [14, 37], in addition to also optimized a joint architecture with siamese loss on non- clothing, audio, face [41]. Similarly, a combination of jer- overlapping body regions for re-identification. sey, face identification and contextual constraints are used Pose normalization and multi-view representation are to identify players in broadcast videos [7, 29]. the two common approaches in dealing with object pose We focus on the generic person recognition problem sim- variations. Frontalization [18, 40, 48, 57] is a pose normal- ilar to [23, 53] that work in diverse settings without using ization scheme used commonly in face recognition, where any domain level information and demonstrate the effective- faces in arbitrary poses are transformed to a canonical pose ness of the pose-aware models in different scenarios. Training Prominent images views view 1 view 2 view n test H UB

PSM1 PSM2 PSMn

wi Pose ∑ estimator

identity

Figure 2: Overview of our approach: The database is partitioned into a set of prominent views (poses) based on keypoints. A PSM is trained for each pose based on multiple body regions. During testing, predictions from multiple classifiers, each based on a particular PSM representation, are obtained and combined using pose-aware weights provided by the pose estimator.

3. Pose-Aware Person Recognition as opposed to the PIPER, which uses fixed weights com- puted from a validation set. Second, PIPER extracts fea- The primary challenge in person recognition is the vari- 1 tures from a single model for a given localized poselet ation in pose of the subjects. The appearance of the body patch, however, we extract features from different pose parts change significantly with pose. We aim to tackle this models but combine them softly using the pose weights. by learning pose-specific models (PSMs), where each PSM This allows multiple PSMs that are very near in pose space focuses on specific discriminative features that are relevant (e.g. a semi-left and left) to contribute during the prediction. to a particular pose. We fuse the information from different PSMs to make an identity prediction. 3.1. Learning Prominent Views Our proposed framework is shown in Figure 2. Given a database of training images with identity labels and key- To facilitate pose-aware representations, we partition the points, we cluster the images into a set of prominent views training images into prominent views using body keypoints. (poses) based on keypoint features (§ 3.1). A pose estima- Although people exhibit large variations in arm and leg tor is learned on these clusters for view classification. For positions, we consider only the informative regions such learning person representation in each view, a PSM is then as head and torso. We construct a 24-D feature for pose trained for identity recognition that makes use of multiple clustering using 14 key points and visibility annotations as body regions (§ 3.2). We train multiple linear classifiers that shown in Figure 3. It consists of - predict the identities based on PSM representations (§ 3.3). 1. 10-D orientation feature based on the relative Given an input image x, we first compute the pose-specific location of different body parts computed as th [cos(θ ), . . . , cos(θ ), sign(x − x ), sign(x − x )] identity scores, si(y, x), each based on the i PSM rep- 1 8 6 3 7 4 , resentation. The final score for each identity y is a linear where θi, i = {1, 2, 3,..., 8} denote the angle be- combination of the pose-specific scores. tween the line joining two key points and the x-axis. X For example, θ6 is the angle between head midpoint s(y, x) = wisi(y, x), (1) and right shoulder. The last two elements distinguish i front and back views, and are based on the sign of where wis are the pose-aware weights predicted by the x-coordinate differences of left and right points of pose estimator (§ 3.1). To allow robustness to rare views shoulder and elbow, respectively. with limited training examples, we also incorporate a base 2. 14-D visibility feature, where each element is either 1 model in the above equation similar to [53] which is trained or 0 depending upon whether the corresponding key- on the entire train set, whose scores and weights are referred point is visible or not. This provides strong pose cues as so(y, x) and wo respectively. The predicted label of the as certain body parts are not visible in particular views. sample is computed as: arg maxy s(y, x). To identify meaningful views, we apply k-means algo- Our framework differs from PIPER in two aspects. rithm to cluster the images based on the above features. We First, our pose-aware weights are specific to each instance first obtain a large number (30) of highly similar groups, which are then hierarchically merged to obtain seven promi- 1In this work, we use the terms pose and view interchangeably. We use these terms to refer the overall orientation of the body with respect to the nent views. Figure 1(b) shows an example from each of camera and not the location of keypoints within the body. these views for two different people. The views from left to » ! head 1 x! 1 4096 3 2 x + » 8192 2000

2 x head

3 »

x 6 !3 1 4 + 9 *» CNN !6

4 7 3 softmax * !4 1 loss 5 10 8 x + !7

6 11 13 upper-body !5 6 x + CNN fc7plus 12 14 concat fc7 head mid point 7 + !8 layer * torso mid point sign( (x,y)6— (x,y)3 ) x» x shoulder mid point sign( (x,y)7— (x,y)4) upper body

Figure 3: We use body keypoints (left) to learn prominent Figure 4: PSM: Our network consists of two AlexNets for views using a set of features (right) based on orientation of head and upper body with a single output layer. The last informative body parts and keypoints. fully connected layer of the two regions are concatenated and passed to an joint hidden layer with 2000 nodes. right are: frontal, semi-left, left, semi-right, right, back and partial views, respectively. The gion even if the other region is noisy or less informative. partial cluster contains images where only the head and As we show in our experiments (Table 5), the joint training possibly part of shoulders are visible. approach performs better than separate training of regions. Once we obtain the prominent views or poses, we train 3.3. Identity Prediction with PSMs an pose-estimator based on AlexNet that takes full body im- age as input and computes the likelihood of each pose. Dur- We derive multiple features from each PSM and train ing testing, the pose likelihood estimated from the pose- classifiers on these feature vectors. The primary feature estimator provide the pose-aware weights wi, in Eqn. 1. vector (F) consists of the sixth and seventh layers of the We noticed from our experiments that, it is critical to l2- head (h) and upper body (u), and the joint fully connected normalize the weights to obtain the improved performance. layer (F: ). In addition to F, we define two additional feature vectors solely based 3.2. Learning a PSM on the head and upper body layers - Fh: F : < , > To train a pose-specific model (PSM), we select the train- and u fc6u fc7u . We train linear SVM classifiers ing samples that belong to a specific pose cluster. We con- on each of the above feature vectors to obtain the identity s (y, x) sider the head and upper body regions as these are the most predictions. The pose-specific identity score, i , is informative cues for recognition [23, 26]. Given a head at simply the sum of the three SVM classifier outputs. location (lx, ly) with dimensions (δx, δy), we estimate the X si(y, x) = Pi(y|f; x), (2) upper body to be a box at location (lx − 0.5α, ly) of dimen- f∈{F,Fh,Fu} sions (2α, 4α) where α = min(δx, δy).

Given different body parts, one possibility is to train in- where Pi(y|f; x) is the class y score of the sample x pre- dependent convnets on each of these regions [23, 26, 53]. dicted by the classifier trained on the feature f in i-th view. However, discriminative body regions that help in recogni- tion may vary across training instances. For example, Fig- 4. Experiments and Results ure 1(a)-4 contains an occluded face region and is less in- formative. Similarly, upper body may be less informative in 4.1. Datasets and Setup some other instances. If such noisy or less informative re- We select three datasets from the domain of photo- gions influence the optimization process, it may reduce the albums, movies and sport broadcast videos as shown in Fig- generalization ability of the networks. ure 9. Each of these settings have their own set of advan- We propose an approach to improve the generalization tages and challenges as summarized in Table 1. To the best ability by allowing the network to selectively focus on infor- of our knowledge, this is the first work that evaluates person mative body regions during the training process. The idea is recognition in such diverse scenarios. to optimize both the head and upper body networks jointly over a single loss function. Our PSM contains two AlexNets 4.1.1 Photo Album Dataset corresponding to the head and upper-body regions (see Fig- ure 4). The final fc7 layers of each region are concatenated PIPA [53] consists of 37,107 photos containing 63,188 and passed to a joint hidden layer (fc7plus) with 2000 nodes instances of 2,356 identities collected from user-uploaded before the classification layer. This provides more flexibil- photos in Flickr. The dataset consists of four splits with ity to the network to make the predictions based on one re- an approximate ratio of 45:15:20:20. The larger split is PIPA [53] Hannah [32] Soccer Train instances 6,443 2,385 19,813 Train subjects 581 26 28 Test instances 6,443 202,178 51,051 Test subjects 581 41 28 Annotations Head Face Body Domain variation No Yes No Clothing Yes No No Age gap No Yes No Head resolution High Medium Low Motion blur No Moderate Severe Deformation Less Moderate Severe Figure 5: Few images from (top) PIPA, (middle) Hannah Table 1: Comparison of the datasets in terms of statistics, and (bottom) Soccer datasets. annotations, merits and challenges. primarily used to train convnets, second split to optimize important events of the match. We further filtered the re- parameters during validation and the third split to evaluate play clips to retain only those clips that are shot in close-up recognition algorithms. The evaluation set is further divided and medium views. We used VATIC toolbox [43] to an- into two equal subsets, each with 6,443 instances belonging notate the players in videos. Our soccer dataset consists of to 581 subjects for training and testing the classifiers. We 37 video clips with an average duration of 30 secs. It con- follow PIPA experimental protocol and train the classifiers sists of 28 subjects with 13 players from Germany team, 14 on one fold and test on another fold, and vice-versa. We players from Argentina team, and a referee. also conduct experiments on challenging splits introduced Unlike PIPA, we marked full-body bounding boxes for by Oh et al. [23] based on album, time and day information. each player since head is not visible or out-of-view in many instances, and it also is difficult to estimate the bounding 4.1.2 Hannah Movie Dataset boxes of different body regions from head, due to large de- We consider “” dataset [32] to rec- formations. We followed PIPA annotation protocol and la- ognize the actors appearing in the movie. The dataset con- beled the players regardless of their pose, resolution and sists of 153,833 movie frames containing 245 shots and visibility. We annotated the players to generate continuous 2,002 tracks with 202,178 face bounding boxes. We regress tracks even in the presence of severe occlusion. Whenever it the face annotations to get the rough estimate of head. There is difficult to recognize the players, we relied on additional are a total of 254 labels of which 41 are the named subjects. clues such as hair, shoes, jersey number and accessories. The remaining labels are the unnamed characters (boy1, However, we do not rely on any of these domain-specific girl1, etc.) and miscellaneous regions (crowd, etc.). cues in this work. For evaluation purposes, we randomly To consider a more practical recognition setting, we cre- select 10 clips into training and remaining 27 clips into test- ate another dataset for classifier training using IMDB pho- ing. This resulted in 19,813 instances in training set and tos. For each named character, we collect photos from ac- 51,051 instances in testing set. tor’s profile in IMDB[1] and annotate the head bounding 4.2. Results and Analysis boxes. Of the 41 named characters, only 26 prominent ac- tors had profiles in IMDB. The IMDB train set consists of For all our experiments, we use the PSM models trained 2,385 images belonging to 26 prominent actors appeared in on larger set of PIPA consisting of 29,223 instances. We the movie. There are a total of 159,458 instances belonging annotate PIPA train instances with keypoint locations to to these 26 actors in the test set. There is a significant age learn prominent views as discussed in § 3.1. The number variation between train and test instances since the Hannah of instances in each view after pose clustering is shown instances are created from a particular year (1986) while the in Figure 6(a). We train a separate PSM on frontal, IMDB photos are captured over a long period of time. semi-left, left, semi-right, right, back and partial views. The base model is trained on the entire 4.1.3 Soccer Dataset PIPA train set. We augment each view by horizontal flip- ping of the instances from its symmetrically opposite view. We create soccer dataset from the broadcast video of World For instance, images in left view are flipped and augmented cup 2014 final match played between Argentina and Ger- to right view. We use Caffe library [21] for our implemen- many. We considered only replay clips as these capture the tation. For optimization, we use stochastic gradient descent h erigrt siiilystto set initially is rate learning of The size batch a with eo nteAlto td (III). study Ablation discussed the is in This below en- models. the view-specific not different and of strategy semble combination primarily pose-aware is the improvement to proposed due the Second, datasets). and [40]) DeepFace ing to ( compared smaller data ad- much extra the is the regarding First, points important data. two ditional make We num- The initialize 6(b). to images. ure used VGG examples in- the of in those ber body of only We full selected subset have that and a networks. stances annotations the used face initializing we the for extended this, [33] overcome dataset To VGGFace over-fitting. to to set is of total a for networks of factor a by iteipoeetover observe improvement also little We a variations. age motion and resolution, occlusions, lower heavy to blur, owing PIPA to compared lower performance. the improve to used information track be the that should available suggest all results if reassign of The voting We track. majority a simple in frames on recognition. based on labels frame impact the their datasets understand Soccer and to Hannah on labels track merging by sults be- motivation the hind strengthening recognition in helps gions in recognition 4. player Table created in newly soccer the on results the show outperforms approach our dataset, naeil Hannah the On split. nal all an outperforms achieving approaches approach previous our the in- considered, contextual not the is When formation review. Li comprehensive includ- of a results 2, provide contextual Table and in baseline splits both test ing PIPA on approaches various #examples #examples 11000 vrl performance: Overall the that noticed We h cuaiso anhadSce aaesaemuch are datasets Soccer and Hannah on accuracies The re- body multiple of use that datasets the all on notice We 1500 2250 3000 2750 5500 8250 750 PIPER 0 0 iue6 oesaitc fdfeetdatasets. different of statistics Pose 6: Figure yalremri ssoni al .Fnly we Finally, 3. Table in shown as margin large a by 1 (a) (a) PIPA train o riigSMadtebs weight base the and SVM training for (c) PIPA test and 10 fe every after naeil 50 10000 20000 30000 40000 12500 25000 37500 50000 300 PSM naeil n oetmcefiin of coefficient momentum and 0 0 loihs eas hwtere- the show also We algorithms. naeil , ecmaetepromneof performance the compare We 000 (b) (b) VGG-sub train rie nve ape lead samples view on trained s (d) Hannah test (d) Hannah 50 PIPER , trtos h parameter The iterations. ( 000 ∼ 0 500 PSM nsce aae u to due dataset soccer on . 001 ∼ trtos etanthe train We iterations. 160 ( 89 rmfu different four from K ∼ hc sdecreased is which , r hw nFig- in shown are s 4 . 12500 18750 25000 )w considered we K) 05 6250 ae o train- for faces M 0 nteorigi- the on % Partial body Partial Back Left Semi-left Frontal Semi-right Right tal et (e) Soccer test (e) Soccer w 0 2]to [26] . to 1 0 . . C 9 . eonzn xmlsfo ifrn iw.I hw that corresponding shows It in views. model for different each from of performance examples the recognizing shows 8 We Figure testing. models. into half remaining extract and training classifier for etst nec xeiet ecnie nytoeexam- those in only are consider that we experiment, ples each In set. test pose-specific of performance the eot nacrc of accuracy an reports outperforms itself classifiers of combination note We single our improvement. that, performance the bring further tures ( classifiers ( training separate upper and ( head features of body concatenation per- the Similarly, two almost points. by cent training separate through obtained tures ( head the outperform training head The regions. both body ( the of all use for performance The the improve with 5. Table strategy in optimization model joint and features ferent improvement. minimal only noticed we performance by the our “back-view” Since of 6(e)). as Figure predicted (see pose-estimator are the images these of majority falling, (kicking, poses unusual ap- various of dictionary. IMDB (%) using dataset performance movie Hannah on Recognition proaches 3: Table ap- various of (%) splits. comparison PIPA different Performance on proaches 2: Table F u prah4.544.46 40.95 Accuracy approach Our Accuracy naeil of training Joint of training Separate ( body Upper ( Face ( Head Method approach Our Li Li naeil PIPER Method h bainsuy( study Ablation ( study Ablation frontal n pe oy( body upper and ) tal et al et back F F ihcnet[26] context with . [26] context w/o . H u 65 31.55 26.53 ) 2]3.137.57 31.41 [23] [23] [53] 75 31.91 27.52 ) etr o hs xmlsusing examples these for feature s 0 iwmdli ordet iie riigdata, training limited to due poor is model view ( , ,x y, U F base semi-left PSM i 64 17.72 16.49 ) t oeadslc afo hmrandomly them of half select and pose -th hog on riigprombte than better perform training joint through ) H ) S rmha,uprbd n on fea- joint and upper-body head, from ) II and I 1 ): oesotefr te oesin- models other outperform models .Fnly h obnto fthree of combination the Finally, ). ): oe ihjittann taeyand strategy joint-training with model H 86 F eaayeteefcieeso dif- of effectiveness the analyze We ecnuteprmnst measure to experiments conduct We and U u . 78% etrsotie hog joint through obtained features ) Original U 89.05 88.75 83.86 86.78 83.05 and h with etc 2 18 36.10 31.86 tracks with tracks without 29 37.74 32.92 n pe oy( body upper and ) semi-right PSM )o h ujcs large A subjects. the of .) 17 fc Album models. oesuigPIPA using models 82.37 83.33 78.23 78.72 6 and - PSM naeil fc and 74.84 77.00 70.29 69.29 examples, Time 7 u features - which , 2 base base fea- ) 56.73 59.35 56.40 46.61 Day - PIPA - Performance across pose Hannah per pose accuracy Soccer per pose accuracy

100 60 30 Our approach Joint training Head-Upper body Our approach Joint training Our approach Joint training Head-Upper body Head Upper body Face Head - Upper body Head Head Upper body Upper body Face

80 48 24

60 36 18 Accuracy(%) Accuracy(%) 40 Accuracy(%) 24 12

20 12 6

0 0 0 Right Semi-right Frontal Semi-left Left Back Partial Upper body Right Semi-right Frontal Semi-left Left Back Partial Upper body Right Semi-right Frontal Semi-left Left Back Partial Upper body Figure 7: Pose-wise recognition performance on PIPA (left), Hannah (middle) and Soccer (right) datasets.

model 0 model 1 model 2 model 2 model 3 Base Method Accuracy Accuracy (Right) (Semi-right) (Frontal) (Semi-left) (Left) without tracks with tracks 56.1 40.7 53.8 51.4 51 44.9 (Right) Head (H) 17.68 20.54 Pose0 Upper body (U) 18.01 19.76 62 47.2 64.5 63.8 63.9 48.2

Separate training of H and U 17.62 20.68 Pose1 (Semi-right) Joint training of H and U 18.35 20.18 71.4 58.7 75.5 78.5 74.3 61.6 Pose2 naeil [23] 19.45 23.77 (Frontal) Our approach 20.15 24.31 68.3 53.8 68.1 68 71.7 57.1 Pose3 Table 4: Performance comparison (%) on Soccer dataset. (Semi-left)

61.8 51.9 58 61.3 56.3 52.7 (Left) Feature Accuracy Pose4 Figure 8: Effectiveness of PSMs: Each row shows the per- Face (F) fc7f 66.83 formance of test examples in a particular pose represented [fc6f fc7f] 70.40 using the different PSMs. Head (H) fc7h ...(h1) 76.81 [fc6h fc7h]...(h2) 79.54 Upper body (U) fc7u ...(u1) 72.26 Fusion type (I) (II) [fc6u fc7u]...(u2) 75.19 Average pooling 83.57% 87.78% Separate training [h1 u1] 82.90 Max pooling 80.54% 85.51% H U h u Elementwise multi- of and [ 2 2]...(S1) 84.01 81.71% 86.32% fc7plus 85.98 plication [fc6h fc7h]...(Fh) 82.22 Concatenation 84.44% 87.62% Joint training [fc6u fc7u]...(Fu) 77.62 Pose-aware weights – 89.05% H U fc7 fc7 of and [ h u] 85.05 Table 6: Comparison of different fusion schemes for com- j fc7 fc7 [ 1 h u] 85.22 bining (i) features during joint training (using frontal fc7 fc6 fc6 [ plus h u] 86.10 PSM) and (ii) pose-aware classifier scores during testing. [fc7plus Fh Fu]...(F) 86.27 s0(y, x) 86.96 Table 5: Performance (%) of different features obtained information pooling, one during PSM training and another from separate and joint training of regions on PIPA test set. for combining classifiers. For joint-training, we show the effect of different head (fc7h) and upper body (fc7u) com- bination strategies in Table 6 (I). The simple concatenation cluding the base model. This show that the person represen- of fc7h and fc7u worked better, and hence considered. tations obtained from pose-specifc models are more robust Similarly, we tried multiple strategies to combine the clas- than the pose-agnostic representations. However, for ex- sifiers during testing. As can be seen in Table 6 (II), pose- treme profile views, we noticed that base model performed aware weighting outperform other strategies including av- better than the corresponding profile models. We attribute erage pooling of the ensemble of classifiers. this to the non-availability of enough profile images while Pose-wise recognition performance: The statistics of training the PSM. For this reason we include the base model different poses is given in Figure 6 (bottom) for different along with PSMs to bring more robustness when handling datasets. Frontal images dominate PIPA due to which algo- rear and non-prominent view images. rithms already achieve high performance (>80%). Hannah Ablation study (III): Our approach uses two kinds of consists of different poses in similar proportion while the Figure 9: Success and failure cases on (top) PIPA and (bottom) Hannah. First five columns show the success cases of our approach where the improvement is primarily due to the specific-pose model. Green and yellow boxes indicate the success and failure result of naeil respectively. Last column in red shows the failure cases for both our approach and naeil. Rank-n all in one sheet

100% 100 70 On Hannah, we noticed a big difference of 12% between

95% 80 58 rank-1 and rank-2 performance. The performance gap be-

90% 60 46 tween different approaches tend to reduce with higher rank. 85% Our approach Our approach Our approach Joint training 40 Joint training 34 Joint training 80% Handling unknown instances: In the movie scenario, Accuracy(%) Head - Upper body Head - Upper body Head-Upper body Head 20 Head 22 Head 75% Upper body Upper body Upper body the test set has 41 ground truth labels while there are only 26 Face Face 70% 0 10 1 12 23 34 45 56 67 78 89 100 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 subjects in the trainset. Therefore, recognition algorithms rank rank rank should have the ability to reject such unknown instances. Figure 10: CMC curves of various approaches on PIPA To achieve this, we l2-normalized the predicted class scores ROC curve (left), Hannah (middle) and Soccer (right) datasets. 1 and considered the maximum score as a confidence mea- ROC curve 1 sure. The confidence score obtained on pose-aware repre- Our approach (AUC = 0.61) 0.8 Joint training (AUC = 0.56) Head - Upper body (AUC = 0.54) sentations are more robust in rejecting unknown, the perfor- 0.8 Head (AUC = 0.49) Upper body (AUC = 0.40) Figure 11: ROC curves of mance being measured using ROC curve in Figure 11. 0.6 Face (AUC = 0.53) 0.6 various approaches in re- Computational complexity: The number of features 0.4 0.4

True Positive rate True Positive rate True jecting unknown Hannah extracted from each PSM is 18,384 (4096 × 4 + 2000). Our approach (AUC = 0.61) Joint training (AUC = 0.56) 0.2 instances based on nor- With 7 pose-aware models and a base model, our total fea- 0.2 Head - Upper body (AUC = 0.54) Head (AUC = 0.49) Upper body (AUC = 0.40) 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 malized prediction scores. 18, 384 × 8 ∼3 Face (AUC = 0.53) ture dimension is ( ) which is times smaller False positive rate 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 than PIPER (4096 × 109) and ∼2 times larger than naeil False positive rate (4096×17). For memory critical applications, fc7plus alone PR curvesoccer dataset contains majority of back view images. Con- can be used as feature. We achieve an accuracy of 87.01% 1 sequently, we observe a low performance on these datasets. Hannah unknown on PIPA with fc7plus still outperforming naeil with a actor rejection 0.8 In Figure 7, we show pose-wise recognition performance. feature dimension of just 16,000 (2000 × 8). curves PIPA and Hannah have a similar trend in which frontal and 0.6 semi-profile images are recognized with greater accuracies, 5. Conclusion Precision 0.4 while profile and back-views with less accuracy. The up- Our approach (AP = 86.28%) perJoint training body (AP = 82.84) seems to be less informative in case of Hannah We show that learning a pose-specific person represen- Head - Upper body (AP = 81.93) 0.2 Head (AP = 78.59%) asUpper the body (AP clothing = 73.58%) is completely different between Hannah and tation helps to better capture the discriminative features in Face (AP = 81.26)

0 IMDB. The proportion of back views that are correctly rec- different poses. A pose-aware fusion strategy is proposed 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Recallognized is slightly better in soccer setting due to large num- to combine the classifiers using weights obtained from a ber of back view images in the classifier train set. pose estimator. The person representations obtained us- Rank-n identification rates: The Cumulative matching ing a joint optimization strategy is shown to be more pow- Characteristic (CMC) curves are shown in Fig 10. Our ap- erful compared to separate training of body regions. We proach achieves rank-10 accuracies of 96.56%, 84.83% and achieve state-of-the-art results on three different datasets 63.2% on PIPA, Hannah and Soccer datasets respectively. from photo-albums, movies and sport domains. References [21] Y.Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- shick, S. Guadarrama, and T. Darrell. Caffe: Convolu- [1] Hannah and her sisters (1986), full cast and crew. tional architecture for fast feature embedding. arXiv preprint http://www.imdb.com/title/tt0091167/ arXiv:1408.5093, 2014. fullcredits. [22] S. Johnson and M. Everingham. Clustered pose and nonlin- [2] W. AbdAlmageed, Y. Wu, S. Rawls, S. Harel, T. Hassner, ear appearance models for human pose estimation. In BMVC, I. Masi, J. Choi, J. Lekust, J. Kim, P. Natarajan, et al. Face 2010. recognition using deep multi-pose representations. In WACV, [23] S. Joon Oh, R. Benenson, M. Fritz, and B. Schiele. Person 2016. recognition in personal photo collections. In ICCV, 2015. [3] E. Ahmed, M. Jones, and T. K. Marks. An improved deep [24] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet learning architecture for person re-identification. In CVPR, classification with deep convolutional neural networks. In 2015. NIPS, 2012. [4] T. Ahonen, A. Hadid, and M. Pietikainen. Face description [25] E. Learned-Miller, G. B. Huang, A. RoyChowdhury, H. Li, with local binary patterns: Application to face recognition. and G. Hua. Labeled faces in the wild: A survey. In Advances PAMI, 2006. in Face Detection and Facial Image Analysis, 2016. [5] D. Anguelov, K.-c. Lee, S. B. Gokturk, and B. Sumengen. [26] H. Li, J. Brandt, Z. Lin, X. Shen, and G. Hua. A multi-level Contextual identity recognition in personal photo albums. In contextual model for person recognition in photo albums. In CVPR, 2007. CVPR, 2016. [6] T. Berg and P. Belhumeur. Poof: Part-based one-vs.-one fea- [27] W. Li, R. Zhao, T. Xiao, and X. Wang. Deepreid: Deep filter tures for fine-grained categorization, face verification, and pairing neural network for person re-identification. In CVPR, attribute estimation. In CVPR, 2013. 2014. [7] M. Bertini, A. Del Bimbo, and W. Nunziati. Player identifi- cation in soccer videos. In SIGMM workshop on Multimedia [28] C. Liu, S. Gong, C. C. Loy, and X. Lin. Person re- information retrieval, 2005. identification: What features are important? In ECCV, 2012. [8] L. Bourdev and J. Malik. Poselets: Body part detectors [29] W.-L. Lu, J.-A. Ting, J. J. Little, and K. P. Murphy. Learning trained using 3d human pose annotations. In ICCV, 2009. to track and identify players from broadcast sports videos. PAMI [9] B.-C. Chen, C.-S. Chen, and W. H. Hsu. Cross-age reference , 2013. coding for age-invariant face recognition and retrieval. In [30] B. Ma, Y. Su, and F. Jurie. Local descriptors encoded by European Conference on Computer Vision, 2014. fisher vectors for person re-identification. In ECCV, 2012. [10] G. Cheron,´ I. Laptev, and C. Schmid. P-CNN: Pose-based [31] I. Masi, S. Rawls, G. Medioni, and P. Natarajan. Pose-aware CNN Features for Action Recognition. In ICCV, 2015. face recognition in the wild. In CVPR, 2016. [11] Y. Deng, P. Luo, C. C. Loy, and X. Tang. Pedestrian attribute [32] A. Ozerov, J.-R. Vigouroux, L. Chevallier, and P. Perez.´ On recognition at far distance. In MM, 2014. evaluating face tracks in movies. In ICIP, 2013. [12] M. Everingham, J. Sivic, and A. Zisserman. Taking the bite [33] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face out of automated naming of characters in tv video. Image recognition. BMVC, 2015. and Vision Computing, 2009. [34] N. Pinto, J. J. DiCarlo, and D. D. Cox. How far can you [13] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and get with a modern face recognition test set using only simple M. Cristani. Person re-identification by symmetry-driven ac- features? In CVPR, 2009. cumulation of local features. In CVPR, 2010. [35] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni- [14] V. Gandhi and R. Ronfard. Detecting and naming actors in fied embedding for face recognition and clustering. In CVPR, movies using generative appearance models. In CVPR, 2013. 2015. [15] R. Garg, S. M. Seitz, D. Ramanan, and N. Snavely. Where’s [36] K. Simonyan, O. M. Parkhi, A. Vedaldi, and A. Zisserman. waldo: Matching people in images of crowds. In CVPR, Fisher vector faces in the wild. In BMVC, 2013. 2011. [37] J. Sivic, M. Everingham, and A. Zisserman. “Who are you?” [16] S. Gong, M. Cristani, S. Yan, and C. C. Loy. Person re- Learning person specific classifiers from video. In CVPR. identification. Springer, 2014. IEEE, 2009. [17] M. Guillaumin, J. Verbeek, and C. Schmid. Is that you? [38] J. Sivic, C. L. Zitnick, and R. Szeliski. Finding people in re- metric learning approaches for face identification. In ICCV, peated shots of the same scene. In BMVC, volume 2, page 3, 2009. 2006. [18] T. Hassner, S. Harel, E. Paz, and R. Enbar. Effective face [39] Y. Sun, X. Wang, and X. Tang. Deep learning face repre- frontalization in unconstrained images. In CVPR, 2015. sentation from predicting 10,000 classes. In CVPR, pages [19] M. Hirzer, P. M. Roth, M. Kostinger,¨ and H. Bischof. Re- 1891–1898, 2014. laxed pairwise learned metric for person re-identification. In [40] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: ECCV, 2012. Closing the gap to human-level performance in face verifica- [20] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. tion. In CVPR, 2014. Labeled faces in the wild: A database for studying face [41] M. Tapaswi, M. Bauml,¨ and R. Stiefelhagen. knock! knock! recognition in unconstrained environments. Technical report, who is it? probabilistic person identification in tv-series. In University of Massachusetts, Amherst, 2007. CVPR, 2012. [42] E. Ustinova, Y. Ganin, and V. Lempitsky. Multire- gion bilinear convolutional neural networks for person re- identification. arXiv preprint arXiv:1512.05300, 2015. [43] C. Vondrick, D. Patterson, and D. Ramanan. Efficiently scal- ing up crowdsourced video annotation. IJCV, 2012. [44] L. Wolf, T. Hassner, and I. Maoz. Face recognition in un- constrained videos with matched background similarity. In CVPR, 2011. [45] L. Wolf, T. Hassner, and Y. Taigman. Descriptor based meth- ods in the wild. In Workshop on faces in ‘real-life’ images: Detection, alignment, and recognition, 2008. [46] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma. Robust face recognition via sparse representation. PAMI, 2009. [47] R. B. Yeh, A. Paepcke, H. Garcia-Molina, and M. Naaman. Leveraging context to resolve identity in photo albums. In JCDL, 2005. [48] D. Yi, Z. Lei, and S. Z. Li. Towards pose robust face recog- nition. In CVPR, 2013. [49] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Deep metric learning for person re-identification. In ICPR, 2014. [50] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face represen- tation from scratch. arXiv preprint arXiv:1411.7923, 2014. [51] N. Zhang, R. Farrell, and T. Darrell. Pose pooling kernels for sub-category recognition. In CVPR, 2012. [52] N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L. Bourdev. Panda: Pose aligned networks for deep attribute modeling. In CVPR, 2014. [53] N. Zhang, M. Paluri, Y. Taigman, R. Fergus, and L. Bourdev. Beyond frontal faces: Improving person recognition using multiple cues. In CVPR, 2015. [54] Q. Zhang and B. Li. Discriminative k-svd for dictionary learning in face recognition. In CVPR, 2010. [55] R. Zhao, W. Ouyang, and X. Wang. Learning mid-level fil- ters for person re-identification. In CVPR, 2014. [56] W.-Y. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld. Face recognition: A literature survey. ACM computing sur- veys, 2003. [57] X. Zhu, Z. Lei, J. Yan, D. Yi, and S. Z. Li. High-fidelity pose and expression normalization for face recognition in the wild. In CVPR, 2015. Supplementary This supplementary document provides additional qual- itative and quantitative results that provide further insights into the results discussed in the main paper. A more de- tailed description of the datasets is given in § I and the pose clusters discussed in § 3.1 of the main paper are visualized in § II. Additional experimental results and visualizations that show the effectiveness of different components of our framework are given in § III and § IV, respectively.

I. Datasets I.1. IMDB We created the IMDB database to train the actor clas- sifier in the movie scenario. The scenario is different from most of the person recognition in that the test set contains a single movie with lesser variation in appearance between multiple instances of an actor in terms of age, style of cloth- ing, etc. We assume that there are no labeled images within the movie and hence training data is not a part of the movie. Figure 12: IMDB: Each row shows few images of an actor To create a training set, the images are collected from the from the dataset. We used IMDB dataset to train classifiers IMDB profile2 of actors appearing in the movie, which are for actor recognition in the Hannah movie. then manually cropped and annotated. Few images from IMDB database are shown in Figure 12. We relied on text tags associated with photos for annotation whenever the photos contain multiple confusing identities. Apart from il- lumination, resolution, and pose variations, there is a large age variations among IMDB instances. In addition, there is a large domain contrast between IMDB and Hannah test set in terms of lighting, camera and imaging conditions. This creates a more challenging setting to match identites between IMDB and Hannah instances. I.2. Soccer Soccer is another scenario where there are a significant number of frames in which the face is not visible and the subjects are often occluded by other players. We show more examples from our soccer dataset in Figure 14. In many in- stances, head is largely occluded, and in back-view unlike PIPA and Hannah instances, which contain visible head and torso regions. Also, soccer instances exhibit large body de- formations, are of low resolution with significant blur. The soccer dataset therefore offers different kinds of challenges for recognition that are not seen in PIPA and Hannah.

II. Pose clusters We obtain a set of prominent views to facilitate pose- specific representations as discussed in § 3.1. To achieve Figure 13: Pose clusters: Each row from top to bottom this, we annotated 14 body keypoints for 29,223 PIPA shows people from PIPA with particular body orientation train instances which are then used for clustering. More clustered using orientation and keypoint visibility features.

2http://www.imdb.com/title/tt0091167/ fullcredits Figure 14: Images from soccer dataset. It offers a challenging person recognition scenario due to low resolution, high occlusion, deformation and motion blur exhibited by soccer instances. examples of our pose clusters are shown in Figure 13. III. Quantitative Results and Analysis Each row from top to bottom contain images from right, semi-right, frontal, semi-left, left, back We provide more insightful results that help to under- and partial body views. The orientation and keypoint stand merits and challenges of different recognition settings visibility features produced tight clusters containing images that are considered. with particular body orientation. The last cluster captures Recognition per subject: Figure 15 shows the number the instances with partial upper body such as head or shoul- of images for each actor in IMDB and Hannah test sets der, etc, in the images that are commonly seen in social me- along with their individual recognition performances. We dia photos and movies. While we considered seven promi- observe that, for those subjects with sufficiently large num- nent views in this work, we note that generating a large ber of training instances (Michael Caine, Barbara Harshey, number of views can be helpful, provided there are enough , Julia Louis-Dreyfus, and ), the training samples in each cluster to train the convnets. performance is high as expected. For subjects with less than 20 training instances, the performance is very low. etn.Fratr ihls hnei perneoe time over appearance movie in the ( change in less useful with less actors are For color setting. hair and glasses, attributes human age, and like scene as such clues photo-albums, like with than accu- an of reaches approach racy Our respectively. 18, Figure and on approaches various 5 of performance recognition the pare next. discussed are which cues, clothing to due near a ( observe keepers also We stances. iia rn fhg efrac o ujcs( subjects see for We performance high Huguain 16. of Figure trend similar in a performances individual their with examples. training enough be- age Jenkins ( in Richard instances difference test large and train a tween is there whenever However, set. test the on (bottom) actor the each show of We performance set. recognition IMDB test (top) movie in Hannah actor (middle) each for and images of Number 15: Figure ihe Caine Michael otocrigmveadsce ujcsi iue17 Figure in subjects soccer and movie occurring most eonto efrac ftpsubjects: top of performance Recognition iial,w hwtesaitc fsce lyr along players soccer of statistics the show we Similarly, #images in Hannah movie naeil 17

Recognition accuracy 12000 18000 24000 30000 #images in IMDB 6000 100 100 200 300 400 20 40 60 80 61 oesi oprbet edaduprbd.Un- body. upper and head to comparable is models aulNeuer Manuel and

. Carrie Fisher 17

oeta h vrl efrac of performance overall the that Note . Joanna Gleason ntpatr,wihi infiatybetter significantly is which actors, top on %

org Palacio Rodrigo Sam Waterston

and Julia Louis-Dreyfus ,tepromnei ordsiehaving despite poor is performance the ), Verna O. Hobson Woody Allen od Allen Woody Maureen OSullivan Tony Roberts and Lewis Black Paul Bates

egoRomero Sergio Fred Melamed ihsfcettann in- training sufficient with ) areFse,Dan West, Dianne Fisher, Carrie Daniel Stern

e o he n v in five and three row See : J.T. Walsh 100 Michael Caine John Turturo

cuayfrgoal for accuracy % Lloyd Nolan Ken Costigan Max Von Sydow n h referee the and ) Julie Kavner Barry Gibb Mia Farrow Diane Wiest ecom- We

Gonzalo Christian Clemenson naeil Bobby Short Barbara Hershey sn h ocrdtst eso h efrac fdif- ( Romero of subjects performance three the on show approaches We ferent qualitative study dataset. a soccer a such the perform recognition, using We previously. in done not helps is evaluation clothing that obvious distance. a at people recognize to repre- better able develop are to that suggest sentations This approaches. the all for and informative extremely be head. to to compared found robust is face 12), Figure player. each of show performance also We recognition dataset. (bottom) Soccer the training of the split in (middle) test player and each (top) for images of Number 16: Figure fps wr oesand models to performance aware overall flexibility pose the of more Finally, provide regions. selective robust it on more as focus is better model much trained performs jointly and using concatena- features the hand, of other tion the On alone. than feature worse body is upper training separate and through head obtained of body a concatenation upper The by clothing. head to due outperforms margin head, large to compared informative less and Germany the of keepers goal respectively. the Argentina, are subjects two first o nomtv sclothing? is informative How poor is performance overall the dataset, soccer the On sse nFgr 9 pe oyrgo,wihi often is which region, body upper 19, Figure in seen As Recognition accuracy (%) #images in test clips #images in train clips 1000 2000 3000 4000 5000 1000 2000 3000 4000 100 20 40 60 80 and

Manuel Neuer

Referee Benedikt Hoewedes Mats Hummels Schweinsteiger Mesut Oezil Miroslav Klose

ihuiu ltigi iue1.The 19. Figure in clothing unique with ) Thomas Mueller Phillip Lahm Toni Kroos Jerome Boateng naeil Christoph Kramer Mario Gotze Andre Schurrle Sergio Romero Ezequiel Garay r identical. are Pabli Zabaleta huhi sintuitively is it Though Lucas Biglia aulNeuer Manuel Enzo Perez Gonzalo Huguain Lionel Messi Javier MAscherano Martin Demichelis Marcos Rojo Ezequiel Lavezzi Fernando Gago

, Sergio Aguero

Sergio Rodrigo Palacio Referee Soccer Clothing

Performance on 5 lead actors

100 Our approach (61.17%) 100 naeil (49.25%) Joint training (50.41%) Head-Upper body (49.03%) 80 80 Head (41.80%) Upper body (22.02%) Face (42.08%) 60 60

40 Accuracy(%) 40 20

20 0 Manuel Neuer (GER) Sergio Romero (ARG) Referee

0 Hannah: effectOur approach (97.09%)of seed imagesnaeil (97.09%) from movie Diane Wiest Woody Allen Barbara Hershey Mia Farrow Michael Caine ( needJoint to training recompute (95.53%) withHead-Upper pose body networks) (89.96%) Head (86.61%) Upper body (95.25%) Figure 17: Recognition performance of five lead actors in Hannah dataset. Figure 19: Effect of clothing on recognition. Performance on 5 lead players

40 100 Our approach (14.58%) naeil (12.40%) Joint training (10.16%) Head-Upper body (11.57%) 32 Head (12.54%) 80 Upper body (8.53%)

24 60

Accuracy(%) 16 40

8

Recognitionaccuracy (%) 20

0 Schweinsteiger Javier Mascherano Lionel Messi Marcos Rojo Martin Demichelis 0 0 1 3 5 10 20 Figure 18: Recognition performance of five most occurring Number of examples players in Soccer dataset. Figure 20: Recognition performance of Hannah movie set using IMDB plus samples from the Hannah test set. It is interesting to note that, convnets that are trained for identity recognition can distinguish clothing without any IV. Qualitative Results explicit modeling or hand-crafted features [29]. Confusion between identities: We show the recogni- We show some qualitative results in Figures 23 to 27. tion confusion matrix for Hannah and Soccer datasets in Figure 23 shows the success and failure cases of joint train- Figure 21 and Figure 22 respectively, with and without ing and separate training of body regions. We notice an tracking. We notice two important points related to gender over-influence of clothing while using separately trained and clothing. As seen from Figure 21, female subjects are and concatenated regional features, compared to the jointly mostly getting confused with female subjects, and similarly training features. In Figure 24, we show the effectiveness the male subjects are confused with male subjects. In Fig- of using multiple classifiers from each PSM. As seen in the ure 22, we notice that players from each team are mislabeled figure, the concatenated head and upper body features (F) with the members from the same team. These studies show may predict incorrect labels even when one (or two) of these the effectiveness of convnets in capturing human attributes features predict correctly, due to the over influence of less without any explicit training. Finally, majority voting over informative body region. Combining these three features is a track helps to produce consistent predictions. found to be more robust. Domain gap: To understand the effect of domain con- We show the top scoring predictions obtained from each trast between train and test instances, we conduct an ex- pose-specific PSM in Figure 25. It clearly shows how each periment adding different number of Hannah instances per PSM helps in the prediction of instances in that particular subject to the IMDB training gallery. The results are shown pose when the base model is unable to predict correctly. Fi- in Figure 20. As seen from the graph, the addition of even nally, we show the success and failure cases of our approach a few instances from the test domain results in a very large on Hannah and Soccer datasets in Figure 26 and Figure 27 improvement in the recognition performance. respectively, and compare with the naeil. iue21: Figure Christian Clemenson Christian Christian Clemenson Christian Maureen OSullivan Maureen Julia Louis-Dreyfus Julia Maureen OSullivan Maureen Julia Louis-Dreyfus Julia Verna O. Hobson Verna O. Hobson Barbara Hershey Barbara Barbara Hershey Barbara Joanna Gleason Joanna Joanna Gleason Joanna Max Max Von Sydow Richard Jenkins Richard Max Max Von Sydow Richard Jenkins Richard Sam Sam Waterston Sam Sam Waterston Fred Fred Melamed Fred Fred Melamed Michael Caine Michael Michael Caine Michael Ken Costigan Ken Ken Costigan Ken Tony Roberts Tony Roberts Carrie Fisher Carrie Carrie Fisher Carrie John John Turturo John John Turturo Woody Allen Daniel Stern Daniel Kavner Julie Woody Allen Bobby Short Bobby Julie Kavner Julie Daniel Stern Daniel Bobby Short Bobby Diane Wiest Diane Diane Wiest Diane Lloyd Nolan Lloyd Lewis Black Lewis Lloyd Nolan Lloyd Lewis Black Lewis ofso matrix Confusion Mia Farrow Mia Mia Farrow Mia Paul Bates Paul Paul Bates Paul Barry Barry Gibb Barry Barry Gibb J.T. Walsh J.T. Walsh 12.99% 32.36% 15.62% 11.11% 13.56% 0.00% 3.17% 4.97% 0.84% 3.04% 1.92% 2.09% 7.19% 6.66% 0.00% 2.74% 3.16% 1.38% 0.75% 2.51% 5.07% 3.31% 7.54% 1.71% 8.67% 4.73% 0.48% 0.00% 0.79% 4.17% 0.49% 0.00% 1.08% 0.17% 9.88% 0.00% 0.00% 2.74% 7.59% 0.39% 0.00% 1.35% 3.00% 3.85% 6.07% 0.60% 3.86% 6.07% 2.78% 0.00% 0.00% 3.75% Carrie Fisher (F) Carrie Fisher (F) 24.95% 33.45% 39.46% 0.79% 0.80% 0.67% 1.52% 0.00% 0.00% 0.07% 0.00% 0.59% 0.00% 0.00% 0.00% 1.57% 0.00% 0.09% 5.89% 0.69% 6.01% 1.01% 1.56% 3.25% 0.00% 3.78% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.04% 4.64% 0.00% 7.65% 0.00% 1.59% 9.39% 0.00% 0.00% Joanna Gleason (F) Joanna Gleason (F) 100.00% 60.90% 15.73% 33.05% 49.70% 43.14% 28.06% 36.94% 15.21% 18.57% 23.04% 24.86% 77.49% 10.01% 49.70% 44.33% 59.80% 20.14% 56.25% 12.72% 31.07% 29.34% 82.32% 39.03% 52.85% 16.11% 21.22% 30.58% 84.29% 85.98% 71.79% 18.15% 1.58% 0.53% 2.01% 6.84% 6.97% 2.24% 8.23% 4.31% 4.52% 0.00% 0.00% 2.03% 3.80% 4.69% 0.00% 0.00% 5.50% 2.34% 2.04% 2.03% Julia Louis-Dreyfus (F) Julia Louis-Dreyfus (F) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Verna O. Hobson (F) Verna O. Hobson (F) 0.31% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.03% 0.24% 0.00% 0.00% 0.60% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Maureen Osullivan (F) Maureen Osullivan (F) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Julie Kavner (F) Julie Kavner (F) nHna aae tp ihad(otm ihu tracking. without (bottom) and with (top) dataset Hannah on 79.29% 46.78% 88.12% 46.65% 0.00% 2.91% 1.76% 3.40% 0.38% 0.75% 0.00% 4.37% 0.00% 0.39% 0.00% 2.49% 0.00% 0.00% 0.25% 0.75% 0.00% 1.49% 8.45% 0.00% 0.00% 0.00% 1.18% 9.48% 0.00% 4.01% 3.75% 0.00% 0.00% 0.00% 0.32% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.21% 0.00% 1.23% 5.69% 0.00% 0.00% 0.00% 0.00% 6.90% 0.00% Mia Farrow (F) Mia Farrow (F) 21.96% 22.74% 19.02% 16.40% 0.00% 0.00% 1.63% 1.52% 0.09% 0.00% 2.58% 0.30% 1.42% 0.00% 0.00% 0.00% 0.98% 4.72% 0.34% 0.19% 0.52% 1.54% 0.50% 4.22% 0.10% 0.76% 2.13% 2.47% 0.00% 0.00% 7.50% 0.00% 0.00% 0.08% 0.00% 0.00% 2.43% 0.00% 0.00% 0.00% 1.25% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 3.63% 0.00% 0.00% 0.00% 0.00% Diane Wiest (F) Diane Wiest (F) 57.90% 15.81% 32.75% 10.37% 12.85% 10.58% 13.66% 19.06% 62.75% 15.61% 28.21% 20.00% 6.33% 9.04% 5.65% 5.16% 3.63% 1.96% 4.18% 1.89% 2.39% 8.68% 4.76% 2.67% 8.75% 1.03% 4.80% 5.69% 0.00% 0.00% 3.24% 0.00% 0.53% 0.00% 1.59% 9.32% 0.00% 0.00% 2.98% 0.00% 2.74% 0.00% 0.00% 0.00% 0.00% 3.41% 4.51% 1.87% 3.40% 0.85% 2.68% 0.00% Barbara Hershey (F) Barbara Hershey (F) 42.32% 51.61% 46.45% 49.30% 0.00% 0.00% 0.00% 0.68% 0.00% 0.05% 0.06% 0.00% 0.00% 0.05% 0.00% 0.00% 0.05% 5.42% 0.00% 0.00% 4.64% 0.00% 0.26% 0.85% 1.28% 0.00% 0.05% 1.00% 0.00% 0.00% 0.00% 0.62% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 1.05% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Sam Waterston Sam Waterston 12.25% 73.04% 11.98% 21.11% 22.46% 11.43% 82.05% 0.00% 2.75% 3.95% 0.13% 0.77% 0.07% 4.10% 0.09% 1.74% 0.00% 0.03% 0.35% 0.00% 2.65% 0.00% 4.46% 0.00% 0.00% 0.00% 4.36% 0.00% 2.74% 0.00% 0.00% 0.00% 4.12% 5.89% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.94% 0.00% 6.86% 0.00% 0.00% 0.00% 0.00% 0.00% Woody Allen Woody Allen 0.00% 0.00% 0.16% 0.05% 0.00% 0.00% 0.00% 0.02% 0.00% 0.05% 0.00% 0.00% 0.00% 0.00% 0.03% 0.02% 0.00% 1.12% 0.03% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Tony Roberts Tony Roberts 0.00% 0.26% 0.00% 0.00% 0.57% 0.47% 0.00% 0.89% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.16% 0.00% 0.02% 0.02% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 2.40% 0.00% 0.00% 0.00% 0.00% Lewis Black Lewis Black 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Paul Bates Paul Bates 74.31% 97.26% 0.00% 0.00% 0.00% 0.36% 0.00% 0.11% 0.06% 0.04% 0.00% 0.00% 0.10% 0.00% 0.00% 0.00% 0.10% 0.00% 0.00% 0.22% 0.00% 1.38% 0.00% 0.06% 0.00% 0.39% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 3.08% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.28% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Fred Melamed Fred Melamed 22.49% 55.51% 10.97% 40.72% 20.32% 66.14% 75.86% 15.00% 60.78% 14.52% 0.00% 0.20% 0.00% 3.92% 0.24% 0.27% 0.10% 0.12% 0.04% 0.00% 0.10% 0.00% 0.00% 0.03% 1.08% 0.00% 6.10% 0.00% 1.26% 0.20% 7.48% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 3.61% 0.00% 0.08% 0.00% 0.02% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Richard Jenkins Richard Jenkins 44.08% 17.92% 52.42% 0.00% 5.30% 0.26% 1.17% 0.22% 0.66% 0.07% 0.11% 0.03% 0.35% 0.00% 0.00% 0.43% 3.06% 2.91% 2.40% 0.47% 4.18% 0.43% 0.00% 0.43% 0.00% 0.00% 0.00% 0.00% 5.03% 0.00% 0.18% 0.00% 0.00% 0.00% 0.00% 0.00% 6.42% 0.00% 0.00% 0.00% 0.00% 0.00% 0.21% 0.00% 0.03% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Daniel Stern Daniel Stern 0.00% 0.20% 0.00% 0.00% 0.00% 0.04% 0.00% 0.01% 0.14% 0.00% 0.15% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.05% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.25% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% J.T. Walsh J.T. Walsh 81.65% 27.22% 18.44% 45.24% 52.33% 49.77% 92.41% 10.54% 10.68% 32.17% 10.49% 15.54% 63.22% 56.44% 61.10% 4.72% 1.41% 6.21% 3.85% 9.95% 8.93% 7.03% 3.11% 4.77% 2.77% 0.56% 8.10% 8.73% 8.65% 3.04% 0.85% 0.30% 7.10% 0.00% 0.25% 0.98% 0.59% 5.25% 2.12% 5.50% 0.00% 5.27% 0.00% 0.00% 0.00% 0.53% 5.13% 0.00% 0.00% 3.28% 0.00% 0.00% Michael Caine Michael Caine 36.94% 15.09% 14.94% 47.46% 26.72% 22.56% 20.11% 55.37% 0.00% 1.44% 0.39% 3.28% 0.51% 0.07% 0.49% 7.65% 0.00% 0.00% 0.19% 1.78% 0.26% 1.12% 1.93% 1.90% 5.52% 2.61% 0.90% 3.04% 0.00% 1.50% 0.00% 0.35% 0.11% 3.73% 0.07% 0.00% 0.36% 3.28% 0.00% 0.00% 0.00% 0.00% 0.00% 0.96% 0.54% 0.00% 5.84% 2.24% 0.00% 5.24% 0.00% 0.00% John Turturo John Turturo 0.10% 0.00% 0.00% 0.02% 0.00% 2.18% 0.00% 0.03% 0.00% 0.05% 0.00% 1.50% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.05% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 1.57% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Lloyd Nolan Lloyd Nolan 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Ken Costigan Ken Costigan 0.00% 5.29% 0.00% 7.98% 0.76% 1.40% 0.00% 0.37% 0.00% 0.05% 0.00% 0.00% 0.00% 0.00% 0.19% 0.21% 0.03% 0.21% 0.06% 0.19% 0.00% 0.04% 0.05% 0.00% 0.00% 0.03% 0.00% 0.00% 0.00% 6.44% 0.00% 0.22% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.03% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Max Von Sydow Max Von Sydow 19.07% 0.00% 0.00% 8.17% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.03% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Barry Gibb Barry Gibb 23.28% 17.60% 10.58% 17.01% 0.59% 0.48% 0.00% 5.54% 0.00% 0.42% 0.90% 0.93% 0.00% 0.25% 0.63% 0.00% 0.37% 0.70% 0.00% 0.58% 0.35% 1.87% 0.04% 0.62% 0.45% 0.00% 0.20% 1.30% 0.00% 0.00% 0.00% 4.05% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.07% 0.00% 0.13% 0.01% 1.80% 0.00% 0.05% 0.00% 0.00% 0.00% 0.00% Christian Clemenson Christian Clemenson 100.00% 59.17% 0.00% 0.00% 0.03% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Bobby Short Bobby Short iue22: Figure Benedikt Hoewedes Benedikt Benedikt Hoewedes Benedikt Javier MAscherano Javier Javier MAscherano Javier Martin Demichelis Martin Martin Demichelis Martin Christoph Kramer Christoph Christoph Kramer Christoph Gonzalo Huguain Gonzalo Gonzalo Huguain Gonzalo Ezequiel Lavezzi Ezequiel Ezequiel Lavezzi Ezequiel Jerome Boateng Jerome Jerome Boateng Jerome Thomas Mueller Thomas Thomas Mueller Thomas Rodrigo Palacio Rodrigo Fernando Gago Fernando Rodrigo Palacio Rodrigo Fernando Gago Fernando Ezequiel Garay Ezequiel Schweinsteiger Ezequiel Garay Ezequiel Sergio Romero Sergio Schweinsteiger Andre Schurrle Andre Mats Hummels Sergio Romero Sergio Andre Schurrle Andre Mats Hummels Miroslav Klose Miroslav Miroslav Klose Miroslav Pabli Zabaleta Pabli Sergio Sergio Aguero Pabli Zabaleta Pabli Manuel Neuer Manuel Sergio Sergio Aguero Manuel Neuer Manuel Marcos Rojo Marcos Lionel Messi Lionel Marcos Rojo Marcos Phillip Lahm Phillip Lionel Messi Lionel Mario Gotze Mario Lucas Biglia Lucas Phillip Lahm Phillip Mario Gotze Mario Lucas Biglia Lucas Mesut Mesut Oezil Mesut Mesut Oezil Enzo Perez Enzo Enzo Perez Enzo Toni Kroos Toni Kroos ofso matrix Confusion Referee Referee 93.31% 100% 0.03% 0.00% 0.08% 0.00% 0.51% 0.04% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.73% 0.08% 0.00% 0.37% 0.00% 0.78% 0.00% 2.09% 0.13% 0.00% 0.00% 0.00% 0.00% 0.07% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 1% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Manuel Neuer Manuel Neuer 10.24% 16.38% 18.50% 0.00% 0.00% 0.00% 0.12% 0.24% 1.17% 0.43% 0.43% 2.80% 0.00% 1.22% 0.00% 0.09% 0.00% 9.28% 0.00% 0.63% 4.62% 7.63% 8.61% 3.76% 6.31% 7.95% 2.13% 0.80% 20% 19% 16% 0% 1% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 2% 0% 0% 6% 4% Benedikt Hoewedes Benedikt Hoewedes 24.47% 17.76% 13.53% 12.96% 13.34% 0.06% 0.00% 0.00% 0.00% 0.00% 0.00% 0.11% 0.00% 0.03% 0.00% 1.52% 0.00% 0.58% 0.00% 0.04% 0.21% 5.19% 0.27% 1.86% 0.31% 6.89% 0.37% 0.00% 10% 21% 13% 21% 48% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 9% 9% 0% 0% Mats Hummels Mats Hummels 23.55% 22.71% 62.77% 10.76% 10.92% 10.33% 26.42% 8.71% 0.00% 0.00% 0.00% 0.33% 0.00% 0.00% 0.48% 2.22% 2.52% 0.25% 0.61% 0.00% 0.00% 2.04% 0.00% 1.03% 9.76% 5.33% 7.69% 3.76% 90% 12% 35% 40% 0% 0% 0% 0% 0% 0% 0% 2% 0% 0% 0% 0% 0% 0% 0% 7% 0% 6% 0% 9% 7% 1% 0% 0% Schweinsteiger Schweinsteiger 14.79% 18.41% 10.07% 0.10% 4.62% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 1.66% 8.10% 0.00% 0.00% 0.00% 0.00% 0.00% 2.77% 0.00% 0.27% 1.51% 9.54% 2.62% 1.77% 0.55% 23% 14% 10% 21% 0% 0% 0% 0% 0% 0% 0% 2% 8% 0% 0% 0% 0% 0% 8% 0% 0% 0% 0% 0% 0% 0% 0% 0% Mesut Oezil Mesut Oezil 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.03% 0.08% 0.00% 0.00% 0.17% 0.00% 0.09% 0.00% 0.00% 0.00% 0.00% 0.00% 0.06% 0.00% 0.12% 0.07% 0.02% 0.00% 0.31% 0.00% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Miroslav Klose Miroslav Klose 16.95% 39.69% 36.14% 15.73% 11.06% 11.55% 0.00% 0.38% 0.00% 1.42% 0.55% 2.91% 0.10% 0.16% 4.69% 0.38% 2.88% 2.37% 0.63% 0.00% 0.00% 8.82% 4.52% 8.66% 8.03% 2.97% 8.71% 0.00% 83% 64% 33% 17% 16% 11% 17% 23% 0% 0% 0% 0% 2% 0% 0% 0% 0% 0% 0% 0% 0% 0% 7% 7% 2% 6% 0% 0%

nSce aae tp ihad(otm ihu tracking. without (bottom) and with (top) dataset Soccer on Thomas Mueller Thomas Mueller 12.83% 13.87% 18.56% 24.98% 19.36% 12.21% 18.50% 0.00% 0.00% 0.38% 0.00% 0.00% 0.00% 0.00% 0.98% 0.18% 0.00% 0.00% 0.00% 0.06% 0.00% 0.09% 0.04% 3.41% 6.49% 6.96% 1.72% 9.77% 13% 11% 13% 23% 33% 23% 13% 25% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 7% 0% 0% Phillip Lahm Phillip Lahm 39.93% 15.93% 10.04% 10.25% 0.25% 1.02% 0.46% 0.00% 0.00% 0.96% 0.33% 0.28% 8.10% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 5.76% 2.60% 2.17% 3.85% 6.56% 6.04% 8.87% 1.05% 0.00% 53% 20% 12% 10% 12% 0% 0% 2% 0% 0% 0% 0% 8% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 9% 0% 0% 0% 0% Toni Kroos Toni Kroos 16.09% 16.87% 15.45% 11.32% 12.08% 0.00% 0.00% 2.45% 0.00% 3.67% 7.10% 4.53% 4.13% 4.80% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.40% 0.38% 0.00% 0.00% 0.63% 0.00% 0.33% 14% 31% 25% 14% 12% 0% 0% 0% 0% 0% 6% 0% 7% 3% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Jerome Boateng Jerome Boateng 0.00% 0.00% 0.13% 0.00% 0.00% 0.00% 0.07% 0.10% 0.19% 0.00% 0.00% 0.00% 0.04% 1.35% 0.00% 2.45% 0.00% 0.00% 0.71% 0.35% 1.04% 1.85% 0.00% 0.44% 0.77% 1.67% 0.18% 0.00% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 1% 1% 0% 0% 0% 4% 0% 0% Christoph Kramer Christoph Kramer 13.42% 0.00% 0.04% 0.00% 0.00% 0.41% 0.13% 0.73% 0.17% 2.54% 6.56% 2.35% 0.21% 0.00% 0.00% 0.55% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 1.36% 0.00% 0.00% 0.00% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 6% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Mario Gotze Mario Gotze 13.92% 0.00% 0.00% 4.58% 0.00% 0.00% 7.05% 4.88% 2.06% 4.73% 0.22% 0.00% 0.31% 1.05% 0.00% 0.00% 1.02% 0.00% 0.00% 0.00% 0.03% 0.69% 1.40% 0.00% 0.00% 0.00% 0.00% 0.00% 25% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 5% 5% 0% 7% 0% 0% 0% 0% 0% Andre Schurrle Andre Schurrle 92.29% 0.00% 0.26% 0.00% 0.00% 0.00% 0.03% 0.00% 0.99% 0.03% 0.83% 0.00% 0.32% 0.67% 0.00% 0.11% 0.00% 0.09% 0.00% 0.00% 0.00% 0.13% 0.00% 0.00% 1.34% 0.00% 0.00% 0.00% 95% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 2% 0% 0% 0% Sergio Romero Sergio Romero 28.15% 39.87% 32.50% 21.82% 16.64% 16.13% 0.00% 8.74% 0.00% 2.90% 5.06% 1.16% 1.22% 3.70% 5.90% 0.11% 4.38% 2.71% 0.00% 0.00% 0.00% 2.72% 2.29% 1.66% 6.95% 0.45% 8.34% 9.45% 32% 27% 16% 19% 28% 46% 0% 0% 0% 3% 0% 0% 0% 8% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 4% 0% 2% 5% Ezequiel Garay Ezequiel Garay 15.27% 0.36% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.06% 0.00% 0.00% 0.00% 0.04% 0.00% 0.00% 0.00% 0.68% 0.51% 0.00% 0.00% 2.02% 0.11% 0.43% 0.00% 0.00% 0.00% 0.00% 3.33% 0% 0% 0% 0% 0% 0% 0% 1% 0% 0% 0% 0% 6% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Pabli Zabaleta Pabli Zabaleta 11.51% 0.00% 5.12% 0.53% 0.00% 2.14% 8.59% 4.64% 5.93% 6.96% 0.00% 5.12% 4.99% 7.05% 0.04% 0.00% 0.00% 0.27% 0.00% 0.00% 0.43% 0.29% 0.00% 0.29% 0.06% 1.41% 0.00% 0.30% 11% 10% 0% 0% 0% 0% 0% 2% 6% 0% 0% 1% 0% 4% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Lucas Biglia Lucas Biglia 24.68% 12.52% 23.35% 0.00% 0.00% 0.00% 0.09% 0.00% 0.00% 0.06% 0.21% 0.00% 0.00% 0.00% 0.00% 0.00% 1.22% 0.00% 2.82% 5.04% 0.00% 0.24% 8.03% 8.16% 1.30% 4.97% 1.73% 3.79% 12% 10% 10% 17% 12% 25% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 2% 0% Enzo Perez Enzo Perez 55.34% 23.66% 58.36% 21.31% 11.87% 14.07% 11.61% 47.99% 38.05% 28.49% 31.46% 19.98% 0.00% 7.68% 0.99% 0.00% 0.00% 2.17% 0.71% 0.00% 1.77% 6.26% 0.40% 0.07% 0.13% 3.65% 2.65% 3.04% 21% 72% 68% 21% 15% 11% 61% 46% 28% 33% 0% 0% 0% 1% 0% 0% 0% 0% 0% 0% 0% 8% 0% 0% 0% 0% 0% 0% Gonzalo Huguain Gonzalo Huguain 10.77% 12.52% 3.55% 0.52% 0.00% 0.00% 0.00% 1.66% 0.22% 0.00% 1.81% 0.12% 0.00% 0.00% 1.04% 0.00% 0.00% 0.06% 0.13% 3.32% 4.58% 3.20% 1.47% 5.46% 5.94% 1.51% 1.34% 0.46% 15% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 2% 0% 9% 1% 0% Lionel Messi Lionel Messi 0.36% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.49% 0.04% 0.00% 0.00% 0.17% 0.00% 0.00% 0.00% 0.86% 0.00% 0.00% 0.00% 0.12% 1.37% 1.60% 2.98% 1.37% 5.45% 0.00% 1.41% 1.37% 22% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 5% 0% 0% 0% 0% Javier MAscherano Javier MAscherano 18.82% 0.00% 0.00% 0.00% 0.00% 0.10% 0.04% 0.25% 0.96% 0.00% 0.00% 0.00% 0.04% 0.22% 0.00% 0.00% 0.00% 0.00% 0.83% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 20% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 2% 0% 0% 0% 0% Martin Demichelis Martin Demichelis 0.18% 0.00% 3.05% 0.00% 0.24% 8.91% 2.40% 1.17% 2.39% 0.00% 0.00% 2.94% 2.95% 2.87% 0.09% 0.00% 0.00% 0.00% 0.00% 0.17% 0.00% 0.38% 0.00% 3.54% 0.13% 0.00% 2.22% 0.00% 0% 0% 0% 0% 0% 8% 0% 0% 3% 0% 0% 6% 4% 8% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Marcos Rojo Marcos Rojo 0.00% 0.00% 0.00% 0.00% 0.12% 0.79% 1.49% 3.08% 2.39% 0.00% 0.00% 3.84% 0.00% 1.03% 0.00% 0.00% 0.00% 0.00% 0.06% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.25% 0.00% 0% 0% 0% 0% 0% 0% 0% 6% 2% 0% 0% 7% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Ezequiel Lavezzi Ezequiel Lavezzi 0.00% 0.00% 0.46% 0.00% 0.00% 0.00% 0.00% 0.00% 1.40% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.24% 0.00% 0.00% 0.00% 4.05% 0.00% 0.00% 0.00% 0.00% 0.00% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 9% 0% 0% 0% 0% 0% Fernando Gago Fernando Gago 16.32% 10.96% 10.40% 11.25% 0.62% 5.38% 0.00% 0.12% 2.15% 4.00% 0.08% 1.54% 4.79% 2.33% 0.22% 0.00% 0.00% 0.00% 1.07% 1.21% 0.73% 0.34% 0.76% 0.59% 2.48% 2.76% 0.00% 0.00% 20% 13% 10% 0% 0% 0% 0% 3% 0% 0% 9% 0% 0% 1% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 3% 0% 0% Sergio Aguero Sergio Aguero 56.34% 12.21% 14.50% 17.69% 23.71% 45.53% 29.90% 32.39% 31.00% 16.63% 1.91% 1.42% 2.42% 0.00% 8.53% 4.39% 1.28% 9.52% 5.61% 0.00% 1.51% 7.45% 0.00% 2.98% 1.40% 0.05% 5.55% 0.00% 80% 17% 19% 35% 58% 40% 37% 52% 11% 12% 0% 8% 0% 0% 0% 5% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Rodrigo Palacio Rodrigo Palacio 91.93% 100% 0.00% 0.07% 0.00% 0.00% 0.55% 0.00% 0.46% 0.03% 0.00% 0.00% 0.06% 0.12% 0.04% 0.22% 0.11% 0.00% 0.00% 1.36% 0.22% 0.00% 0.00% 0.00% 0.00% 0.77% 3.03% 0.00% 0.00% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Referee Referee JT ✅ JT ❌

ST ST ❌ ✅

JT ✅ JT ❌

ST ❌ ST ✅

JT ✅ JT ❌

ST ❌ ST ✅

Figure 23: Success and failure cases of separate and joint training of body regions on PIPA dataset. Column one shows the test images and the column two shows the training images belonging to the predicted subject. (Left) shows the success and failure case of joint training (JT) and separate training (ST), respectively and the reverse is shown in (right). ∑i ∑i ✅ ✅

❌ ❌ F F

✅ ✅ Fh Fh

❌ ❌ Fu Fu

∑i ✅ ∑i ✅

❌ ❌ F F

❌ ✅ Fh Fh

✅ ✅ Fu Fu

Figure 24: Effectiveness of multiple classifiers from each PSM: Column one shows the PIPA test images and the column two shows the training images belonging to the predicted subject using different approaches. The four approaches considered are the classifiers trained on head (Fh) and upper body (Fu) features, a classifier trained on concatenated head and upper body P (F) feature, and linear combination of three classifiers ( i) trained on these features. It clearly shows that it is advantageous to consider individual classifiers trained on regional features and their combination for improved performance. base ❌ base + right ✅

base ❌ base + semi-right ✅

base ❌ base + frontal ✅

base ❌ base + semi-left ✅

base ❌ base + left ✅

Figure 25: Success cases of pose-specific models (PSMs) on PIPA dataset. Each row shows the success predictions of our approach where the improvement is obtained primarily due to the specific-pose model i.e., base model wrongly predicts but base + correct PSM predicts correctly. Green and yellow boxes indicate the success and failure result of naeil respectively. ours ours

naeil naeil

ours ours

naeil naeil

ours ours

naeil naeil

Figure 26: Comparison of our approach with naeil on Hannah dataset. Column one shows the test images and the column two shows the training images belonging to the predicted subject. (Left) in green shows the success case of our approach and the failure case of naeil. (Right) in red shows the failure case of our approach and the success case of naeil. ours ours

naeil naeil

ours ours

naeil naeil

ours ours

naeil naeil

Figure 27: Comparison of our approach with naeil on Soccer dataset. Column one shows the test images and the column two shows the training images belonging to the predicted subject. (Left) in green shows the success case of our approach and the failure case of naeil. (Right) in red shows the failure case of our approach and the success case of naeil.