Arxiv:2001.05613V2 [Cs.CV] 14 Oct 2020 Mental Results Demonstrate That the Mean Per Joint Position I.E., Parts Or All of the Body Must Not Be Lost at Any Time
Total Page:16
File Type:pdf, Size:1020Kb
Synergetic Reconstruction from 2D Pose and 3D Motion for Wide-Space Multi-Person Video Motion Capture in the Wild Takuya Ohashi1,2 Yosuke Ikegami2 Yoshihiko Nakamura2 1NTT DOCOMO 2The University of Tokyo [email protected] [email protected] [email protected] Figure 1: All futsal players’ motions were captured using 12 video cameras surrounding the court. (left) Input images and reprojected joint position. (right) Bone CG drawing based on the calculated joint angles. Abstract diagnosis, behavioral understanding, and even humanoid robot operation [43, 32, 37]. Various motion capture meth- Although many studies have investigated markerless mo- ods have been developed to obtain such data, e.g., opti- tion capture, the technology has not been applied to real cal motion capture, where reflective markers are attached sports or concerts. In this paper, we propose a marker- to characteristic parts of the body, and these 3D positions less motion capture method with spatiotemporal accuracy are then measured [1,5]. Inertial motion capture uses IMU and smoothness from multiple cameras in wide-space and sensors attached to body parts, and then, the positions are multi-person environments. The proposed method predicts calculated using sensor speed [6,2]. Markerless motion each person’s 3D pose and determines the bounding box capture uses a depth camera or single/multiple RGB video of multi-camera images small enough. This prediction and cameras [34, 38,3,4]. However, although various methods spatiotemporal filtering based on human skeletal model en- for using motion data exist, this technology is only used in ables 3D reconstruction of the person and demonstrates limited locations. Few examples of motion capture being high-accuracy. The accurate 3D reconstruction is then used used in locations with practical value (e.g., sports matches, to predict the bounding box of each camera image in the concerts, and public roadways) have been reported. next frame. This is feedback from the 3D motion to 2D pose, This encourages the question “why are motion data not and provides a synergetic effect on the overall performance captured in the real world?” Motion capture under real- of video motion capture. We evaluated the proposed method world conditions is challenging because human motion is using various datasets and a real sports field. The experi- continuous; thus, the motion data must also be continuous, arXiv:2001.05613v2 [cs.CV] 14 Oct 2020 mental results demonstrate that the mean per joint position i.e., parts or all of the body must not be lost at any time. error (MPJPE) is 31.5 mm and the percentage of correct However, the real world has three specific factors that make parts (PCP) is 99.5% for five people dynamically moving motion capture difficult. The first difficulty is the existence while satisfying the range of motion (RoM). Video demon- of multiple subjects, which causes occlusion and requires stration, datasets, and additional materials are posted on individual identification and tracking. The second difficulty our project page1. is related to the large measurement field. A wider measure- ment field incurs greater calibration error; however, precise calibration is required in motion capture. In addition, the 1. Introduction measurement field can sometimes be open, i.e., people can enter and exit the field. The third difficulty is derived from Human motion data are widely used in various fields, real-world environments, which are not ideal and restrict e.g., sports training, CG production, rehabilitation, medical measurement conditions. For competitive sporting events 1http://www.ynl.t.u-tokyo.ac.jp/research/vmocap-syn 1 or concerts, measurement constraints must be avoided, e.g., cropped image [22, 15, 42, 36], and the bottom-up ap- markers, IMU sensors, or specific shirts/pants. Further- proach, which first estimates the 2D keypoint positions of more, other constraints exist, e.g., taking measurements in all people in the entire image and then associates the posi- a severe lighting conditions or being unable to set the sen- tions for each person [40, 14, 25, 16]. In general, top-down sor at the desired position. Due to these various difficulties, approaches are more accurate, and bottom-up approaches even with the latest technology, motion capture under real- are faster. However, a top-down approach is heavily de- world conditions has not been fully developed. pendent on human detection results for accuracy; therefore, In this paper, we discuss the multi-person video motion estimation is likely to fail in environments with severe oc- capture, which means image-based 3D human motion re- clusion. construction with spatiotemporal accuracy and smoothness In recent years, several studies have estimated human 3D even in a challenging multi-person environment, by extend- poses only from a single image by extending detected 2D ing the single-person video motion capture method [33]. keypoint positions to 3D spaces [31, 27,8], directly esti- In the proposed method, multiple synchronized calibrated mating 3D poses [29, 12, 24, 48, 28], and estimating not cameras are used to record video images of human sub- only poses but detailed body shapes [41, 20]. However, 3D jects from different directions. A human skeletal model is pose estimation from a single image is a fundamentally ill- also used to reconstruct 3D motion by spatiotemporal filter- posed problem because various assumptions must be made. ing of joint movements. The key concept of the proposed Therefore, the estimation accuracy obtained in a complex method is predicting each person’s 3D pose and determin- environment, e.g., a multi-person environment, is inferior ing the bounding box small enough. Using this bounding to methods that use multiple cameras. box, the keypoint positions of each subject in each image are estimated using a top-down pose estimation approach 2.2. Multi-view 3D pose estimation [42, 36]. The estimated positions are received as part con- Previous studies have investigated 3D pose estimation fidence maps (PCM) which express the probability of the using multiple cameras. Most early research efforts ex- keypoint existence at each pixel location as continuous val- tracted a person region from an image, considered the re- ues in the range [0, 1]. Probable keypoint positions can gion of the human body in 3D space, and continuously be calculated using the PCM of multi-camera images and tracked the region over time [17, 35]. This tracking-based a predicted past 3D motion. Then, the skeletal model’s approach can independently estimate motion of the sub- current 3D pose is reconstructed by minimizing the error ject’s pose and has achieved remarkable results. However, between the probable keypoint positions and the skeletal for preparation, it is necessary to create a detailed human model’s corresponding joint positions. The reconstructed model (including clothing). Thus, this approach may fail 3D motion is then used to predict the bounding box of each depending on the light conditions, backgrounds, and cloth- camera image in the next frame. This feedback from 3D ing of the subject. motion to 2D pose provides a synergetic effect on the over- In recent years, as 2D pose estimation methods have all video motion capture performance. achieved remarkable results, approaches that combine 2D The proposed method was quantitatively evaluated us- pose estimation and multi-view geometry have been as- ing various datasets [10]1. We also applied the proposed sessed, e.g., reconstructing estimated keypoints in 3D [23, method to actual futsal matches to evaluate it in real-world 18, 13] and comparing 3D keypoint probability and 3D pic- environments. Additionally, the proposed method uses in- torial structure [10, 11, 19]. However, most of these meth- verse kinematics (IK) for optimization; thus, it is possible ods do not reconstruct 3D keypoint positions when the 2D to calculate not only the position but joint angle consider- keypoint is undetected or falsely detected. As a result, con- ing the range of motion (RoM). As a qualitative evaluation, tinuity, which is essential for motion capture, is lost. One bone CG was generated using the joint angle as shown in recent study [46] combined per-view parsing, cross-view Fig.1. matching, and temporal tracking and achieved fast, high- performance multi-person motion capture. However, this 2. Related work study uses a bottom-up pose estimator, so the recognition performance depends largely on the pixel resolution of the 2.1. Single-view pose estimation person in the image. Therefore, it is difficult to use it in a wide field like a soccer court. Human 2D pose estimation from a single image is a A previous study [33] proposed a method that uses task of detecting human keypoint positions, e.g., knees and a bottom-up approach [40, 14] from multiple cameras shoulders in an image. Typically, two approaches are used: to estimate 3D keypoint positions, and applies filtering the top-down approach, which first detects the positions of based on the human skeletal model and continuity of joint multiple people in an image as a bounding box and then movements. This method demonstrates high-accuracy and estimates the keypoint positions of a single person in the 2 Single-Person Video Motion Capture Multi-Person Video Motion Capture Subject-Specific Skeletal Model Choose Camera from Each Viewpoint and Predict Bounding Box 2D Keypoint Estimation Spatiotemporal 3D Initialization Reconstruction Multi-View Images Multi-View Time Series Data of Joint Positions & Angles 9 Single-Person Video Motion Capture Figure 2: Flowchart of proposed multi-person video motion capture method. smooth motion capture using a few cameras. However, 40 degrees of freedom (DoF), as shown in Fig.3[a]. Then, this method presented three difficulties. First, this method the 3D pose in the next time frame is accurately calculated.