Human De-Occlusion: Invisible Perception and Recovery for Humans
Total Page:16
File Type:pdf, Size:1020Kb
Human De-occlusion: Invisible Perception and Recovery for Humans Qiang Zhou1,* Shiyin Wang2, Yitong Wang2, Zilong Huang1, Xinggang Wang1† 1School of EIC, Huazhong University of Science and Technology 2ByteDance Inc. [email protected] [email protected] [email protected] [email protected] [email protected] Occlusion De-occlusion Abstract Human De-occlusion Model In this paper, we tackle the problem of human de- Mask Completion Content Recovery occlusion which reasons about occluded segmentation masks and invisible appearance content of humans. In par- (a) Pipeline ticular, a two-stage framework is proposed to estimate the invisible portions and recover the content inside. For the Portrait Editing 3D Mesh Generation Pose Estimation stage of mask completion, a stacked network structure is devised to refine inaccurate masks from a general instance segmentation model and predict integrated masks simulta- neously. Additionally, the guidance from human parsing (b) Demo Applications and typical pose masks are leveraged to bring prior infor- Figure 1. (a) The pipeline of our framework to tackle the task of mation. For the stage of content recovery, a novel parsing human de-occlusion, which contains two stages of mask comple- guided attention module is applied to isolate body parts and tion and content recovery. (b) Some applications which demon- capture context information across multiple scales. Besides, strate better results can be obtained after human de-occlusion. an Amodal Human Perception dataset (AHP) is collected to settle the task of human de-occlusion. AHP has advan- tages of providing annotations from real-world scenes and timating the invisible segmentation [25, 59, 37, 50, 54] and the number of humans is comparatively larger than other recovering the invisible content of objects [6, 50, 54]. amodal perception datasets. Based on this dataset, ex- In this paper, we aim at the problem of estimating the in- periments demonstrate that our method performs over the visible masks and the appearance content for humans. We state-of-the-art techniques in both tasks of mask comple- refer to such a task as human de-occlusion. Compared with tion and content recovery. Our AHP dataset is available at general amodal perception, human de-occlusion is a more https://sydney0zq.github.io/ahp/. special and important task, since completing human body plays key roles in many vision tasks, such as portrait edit- ing [1], 3D mesh generation [19], and pose estimation [11] 1. Introduction as illustrated in Fig1 (b). Different from common object arXiv:2103.11597v1 [cs.CV] 22 Mar 2021 de-occlusion task (e.g., vehicle, building, furniture), hu- Visual recognition tasks have witnessed significant ad- man de-occlusion presents new challenges in three aspects. vances driven by deep learning, such as classification [21, First, humans are non-rigid bodies and their poses vary dra- 14], detection [38, 10] and segmentation [28, 51]. De- matically. Second, the context relations between humans spite the achieved progress, amodal perception, i.e. to rea- and backgrounds are relatively inferior to common objects son about the occluded parts of objects, is still challeng- which are usually limited to specific scenes. Third, to seg- ing for vision models. In contrast, it is easy for humans ment and recover the invisible portions of humans, the algo- to interpolate the occluded portions with human visual sys- rithm should be able to recognize the occluded body parts tems [59, 34]. To reduce the recognition ability gap be- to be aware of symmetrical or interrelated patches. tween the models and humans, recent works have proposed methods to infer the occluded parts of objects, including es- To precisely generate the invisible masks and the appear- ance content of humans, we propose a two-stage framework *The work was mainly done during an internship at ByteDance Inc. to accomplish human de-occlusion as shown in Fig1 (a). †Corresponding author. The first stage segments the invisible portions of humans, 1 and the second stage recovers the content inside the regions • We propose a two-stage framework for precisely gen- obtained in the previous stage. In the testing phase, the two erating the invisible parts of humans. To the best of our stages are cascaded into an end-to-end manner. knowledge, this is the first study which both considers Previous methods [54,6, 50] adopt perfect modal 1 mask human mask completion and content recovery. as an extra input (e.g., ground-truth annotation or well- segmented mask), and expect to output amodal mask. How- • A stacked network structure is devised to do mask ever, obtaining ground-truth or accurate segmented masks completion progressively, which specifically refines is non-trivial in actual applications. To address the issue, inaccurate modal masks. Additionally, human parsing our first stage utilizes a stacked hourglass network [31] to and typical poses are introduced to bring prior cues. segment the invisible regions progressively. Specifically, • We propose a novel parsing guided attention (PGA) the network first applies a hourglass module to refine the module to capture body parts guidance and context in- inaccurate input modal mask, then a secondary hourglass formation across multiple scales. module is applied to estimate the integrated amodal mask. In addition, our network benefits from human parsing and • A dataset, namely AHP, is collected to settle the task typical pose masks. Inspired by [22, 50], human parsing of human de-occlusion. pseudo labels are served as auxiliary supervisions to bring prior cues and typical poses are introduced as references. 2. Related Work In the second stage, our model recovers the appearance Amodal Segmentation and De-occlusion. The task of content inside the invisible portions. Different from typical modal segmentation is to assign categorical or instance la- inpainting methods [36, 27, 52, 30, 49], the context infor- bel for each pixel in an image, including semantic segmen- mation inside humans should be sufficiently explored. To tation [28,4, 57, 33] and instance segmentation [38, 13,3]. this end, a novel parsing guided attention (PGA) module is Differently, amodal segmentation aims at segmenting the proposed to capture and consolidate the context informa- visible and estimating the invisible regions, which is equiv- tion from the visible portions to recover the missing content alent to the amodal mask, for each instance. Research across multiple scales. Briefly, the module contains two at- on amodal segmentation emerges from [25] which iter- tention streams. The first path isolates different body parts atively enlarges detected box and recomputes heatmap of to obtain relations of the missing parts with other parts. The each instance based on modally annotated data. After- second spatially calculates the relationship between the in- wards, AmodalMask [59], MLC [37], ORCNN [8] and PC- visible and the visible portions to capture contextual infor- Nets [54] push forward the field of amodal segmentation by mation. Then the two streams are fused by concatenation releasing datasets or techniques. and the module outputs an enhanced deep feature. The De-occlusion [54,6, 50] targets at predicting the invis- module works at different scales for stronger recovery ca- ible content of instances. It is different from general in- pability and better performance. painting [27, 52, 30, 49] which recovers the missing areas As most existing amodal perception datasets [59, 37, 16] (manual holes) to make the recovered images look reason- focus on amodal segmentation and few human images are able. At present, de-occlusion usually depends on the invis- involved, an amodal perception dataset specified for hu- ible masks predicted by amodal segmentation. SeGAN [6] man category is required to evaluate our method. On ac- studies amodal segmentation and de-occlusion of indoor ob- count of this, we introduce an Amodal Human Perception jects with synthetic data. Yan et al.[50] propose an it- dataset, namely AHP. There are three main advantages of erative framework to complete cars. Xu et al.[48] aim our dataset: at portrait completion without involving amodal segmenta- a) the number of humans in AHP is comparatively larger tion. Compared with them, we jointly tackle the tasks of hu- than other amodal perception datasets; b) the occlusion man amodal segmentation and de-occlusion. Not only that, cases synthesized from AHP own amodal segmentation and humans commonly have larger variations in shapes and tex- appearance content ground-truths from real-world scenes, ture details than rigid vehicles and furniture. hence subjective judgements and consistency of the invisi- Amodal Perception Datasets. Amodal perception is a ble portions from individual annotators are unnecessary; c) challenging task [8, 37, 59,6] aiming at recognizing oc- the occlusion cases with expected occlusion distribution can cluded parts of instances. Zhu et al.[59] point out that hu- be readily obtained. Existing amodal perception datasets mans are able to predict the occluded regions with high de- have fixed occlusion distributions but some scenarios may grees of consistency though the task is ill-posed. There are have gaps with them. some pioneers building amodal segmentation datasets. CO- Our contributions are summarized as follows: COA [59] elaborately annotates modal and amodal masks 1The terms ‘modal’ and ‘amodal’ are used to refer to the visible and of 5; 000 images in total which originates from COCO [26]. the integrated portions of an object respectively. To alleviate the problem that the data scale of COCOA is ⨁ :Concatenation :Convolution Layer Mt Dot Production Mt ⨀ : p … M̂ m M̂ m M̂ m ⨀ ⨀ Reweight Conv Pretrained Instseg Wm,t Network Ni M̂ a ⨁ Discriminator Dm Image I Initial Mask M Feature F s i m M̂ p hg hg a Hourglass Nm Hourglass Na Figure 2. Illustration of the refined mask completion network. A pretrained instance segmentation network Ni is applied to obtain the hg initial modal mask Mi from the input image Is which contains an occluded human.