Self-Supervised 3D Human Pose Estimation Via Part Guided Novel Image Synthesis
Total Page:16
File Type:pdf, Size:1020Kb
Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image Synthesis Jogendra Nath Kundu1∗ Siddharth Seth1∗ Varun Jampani2 Mugalodi Rakesh1 1 1 Appearance transfer R. Venkatesh Babu Anirban ChakrabortyPose imitation 1Indian Institute of Science, Bangalore 2Google Research Abstract Camera captured human pose is an outcome of several , , , , Image pairs Part segmentation sources of variation. Performance of supervised 3D pose from wild 3D pose estimation estimation approaches comes at the cost of dispensing with videos Image pairs from Unpaired 3D pose data Wild videos variations, such as shape and appearance, that may be use- A part-based Our 3D pose estimation ful for solving other relatedDisentangling tasks. As a result, the learned 2D puppet Self-supervised appearance and pose model not only inculcates task-bias but also dataset-bias model framework Human pose because of its strong reliance on the annotated samples, articulation constraints which also holds true for weakly-supervised models. Ac- Pose Camera view synthesis App- knowledging this, we propose a self-supervised learning earance pose framework1 to disentangle such variations from unlabeled App- earance video frames. We leverage the prior knowledge on human skeleton and poses in the form of a single part based 2D Figure 1. Our self-supervised framework not only produces 3D puppet model, human pose articulation constraints, and a pose and part segmentation but also enables novel image synthesis set of unpaired 3D poses. Our differentiable formalization, via interpretable latent manipulation of the disentangled factors. bridging the representation gap between the 3D pose and spatial part maps, not only facilitates discovery of inter- in this area are mostlyCamera driven view synthesis by recent deep learning archi- pretable pose disentanglement, but also allows us to op- tectures and the collection of large-scale annotated samples. erate on videos with diverse camera movements. Quali- However, unlike 2D landmark annotations, it is very diffi- tative results on unseen in-the-wild datasets establish our cult to manually annotate 3D human pose on 2D images. superior generalization across multiple tasks beyond the A usual way of obtaining 3D ground-truth (GT) pose anno- primary tasks of 3D pose estimation and part segmenta- tations is through a well-calibrated in-studio multi-camera tion. Furthermore, we demonstrate state-of-the-art weakly- setup [20, 50], which is difficult to configure in outdoor en- supervised 3D pose estimation performance on both Hu- vironments. This results in a limited diversity in the avail- man3.6M and MPI-INF-3DHP datasets. able 3D pose datasets, which greatly limits the generaliza- tion of supervised 3D pose estimation models. To facilitate better generalization, several recent 1. Introduction works [7, 46] leverage weakly-supervised learning tech- niques that reduce the need for 3D GT pose annotations. Analyzing humans takes a central role in computer vi- Most of these works use an auxiliary task such as multi- sion systems. Automatic estimation of 3D pose and 2D part- view 2D pose estimation to train a 3D pose estimator [8, arrangements of highly deformable humans from monocu- 27]. Instead of using 3D pose GT for supervision, a 3D lar RGB images remains an important, challenging and un- pose network is supervised with loss functions on multi- solved problem. This ill-posed classical inverse problem view projected 2D poses. To this end, several of these works has diverse applications in human-robot interaction [53], still require considerable annotations in terms of paired 2D augmented reality [15], gaming industry, etc. pose GT [7, 57, 42, 28], multi-view images [27] and known In a fully-supervised setting [52, 39, 10], the advances camera parameters [46]. Dataset bias still remains a chal- ∗Equal contribution. lenge in these techniques as they use paired image and 2D 1Project page: http://val.cds.iisc.ac.in/pgp-human/ pose GT datasets which have limited diversity. Given the 6152 ever-changing human fashion and evolving culture, the vi- part segmentation. To further improve the segmentation be- sual appearance of humans keeps varying and we need to yond what is estimated with pose cues, we use a novel dif- keep updating the 2D pose datasets accordingly. ferentiable shape uncertainty map that enables extraction of In this work, we propose a differentiable and modular limb shapes from the FG appearance representation. self-supervised learning framework for monocular 3D hu- We make the following main contributions: man pose estimation along with the discovery of 2D part • We present techniques to explicitly constrain the 3D segments. Specifically, our encoder network takes an im- pose by modeling it at its most fundamental form of age as input and outputs 3 disentangled representations: 1. rigid and non-rigid transformations. This results in in- view-invariant 3D human pose in canonical co-ordinate sys- terpretable 3D pose predictions, even in the absence of tem, 2. camera parameters and 3. a latent code represent- any auxiliary 3D cues such as multi-view or depth. ing foreground (FG) human appearance. Then, a decoder • We propose a differentiable part-based representation network takes the above encoded representations, projects which enables us to selectively attend to foreground them onto 2D and synthesizes FG human image while also human appearance which in-turn makes it possible producing 2D part segmentation. Here, a major challenge is to learn on in-the-wild videos with changing back- to disentangle the representations for 3D pose, camera, and grounds in a self-supervised manner. appearance. We achieve this disentanglement by training on • We demonstrate generalizability of our self-supervised video frame pairs depicting the same person, but in varied framework on unseen in-the-wild datasets, such as poses. We self-supervise our network with consistency con- LSP [23] and YouTube. Moreover, we achieve state- straints across different network outputs and across image of-the-art weakly-supervised 3D pose estimation per- pairs. Compared to recent self-supervised approaches that formance on both Human3.6M [20] and MPI-INF- either rely on videos with static background [45] or work 3DHP [36] datasets against the existing approaches. with the assumption that temporally close frames have sim- ilar background [22], our framework is robust enough to 2. Related Works learn from large-scale in-the-wild videos, even in the pres- Human 3D pose estimation is a well established prob- ence of camera movements. We also leverage the prior lem in computer vision, specifically in fully supervised knowledge on human skeleton and poses in the form of a paradigm. Earlier approaches [43, 63, 56, 9] proposed to in- single part-based 2D puppet model, human pose articulation fer the underlying graphical model for articulated pose esti- constraints, and a set of unpaired 3D poses. Fig. 1 illustrates mation. However, the recent CNN based approaches [5, 40, the overview of our self-supervised learning framework. 37] focus on regressing spatial keypoint heat-maps, with- Self-supervised learning from in-the-wild videos is chal- out explicitly accounting for the underlying limb connectiv- lenging due to diversity in human poses and backgrounds ity information. However, the performance of such models in a given pair of frames which may be further complicated heavily relies on a large set of paired 2D or 3D pose anno- due to missing body parts. We achieve the ability to learn on tations. As a different approach, [26] proposed to regress these wild video frames with a pose-anchored deformation latent representation of a trained 3D pose autoencoder to of puppet model that bridges the representation gap between indirectly endorse a plausibility bound on the output pre- the 3D pose and the 2D part maps in a fully differentiable dictions. Recently, several weakly supervised approaches manner. In addition, the part-conditioned appearance de- utilize varied set of auxiliary supervision other than the di- coding allows us to reconstruct only the FG human appear- rect 3D pose supervision (see Table 1). In this paper, we ance resulting in robustness to changing backgrounds. address a more challenging scenario where we consider ac- Another distinguishing factor of our technique is the cess to only a set of unaligned 2D pose data to facilitate the use of well-established pose prior constraints. In our self- learning of a plausible 2D pose prior (see Table 1). supervised framework, we explicitly model 3D rigid and In literature, while several supervised shape and appear- non-rigid pose transformations by adopting a differentiable ance disentangling techniques [3, 34, 33, 13, 49] exist, the parent-relative local limb kinematic model, thereby reduc- available unsupervised pose estimation works (i.e. in the ab- ing ambiguities in the learned representations. In addition, sence of multi-view or camera extrinsic supervision), are for the predicted poses to follow the real-world pose distri- mostly limited to 2D landmark estimation [11, 22] for rigid bution, we make use of an unpaired 3D pose dataset. We or mildly deformable structures, such as facial landmark de- interchangeably use predicted 3D pose representations and tection, constrained torso pose recovery etc. The general sampled real 3D poses during training to guide the model idea [25, 47, 55, 54] is to utilize the relative transformation towards a plausible 3D pose distribution. between a pair of images depicting a consistent appearance Our network also produces useful part segmentations. with varied pose. Such image pairs are usually sampled With the learned 3D pose and camera representations, we from a video satisfying appearance invariance [22] or syn- model depth-aware inter-part occlusions resulting in robust thetically generated deformations [47]. 6153 Table 1. Characteristic comparison of our approach against prior 3.1. Jointanchored spatial part representation weakly-supervised human 3D pose estimation works, in terms of access to direct (paired) or indirect (unpaired) supervision levels.