Arxiv:2106.14166V1 [Cs.CV] 27 Jun 2021

Indoor Panorama Planar 3D Reconstruction via Divide and Conquer Cheng Sun1;2 Chi-Wei Hsiao1 [email protected] [email protected] Ning-Hsu Wang1 Min Sun1;3 Hwann-Tzong Chen1;4 [email protected] [email protected] [email protected] Abstract Indoor panorama typically consists of human-made structures parallel or perpendicular to gravity. We leverage this phenomenon to approximate the scene in a 360-degree image with (H)orizontal-planes and (V)ertical-planes. To this end, Figure 1: Planar surfaces of human-made structures are we propose an effective divide-and-conquer strategy that mostly horizontal or vertical with respect to the gravity di- divides pixels based on their plane orientation estimation; rection. Given an RGB panorama, we propose to model a then, the succeeding instance segmentation module conquers 3D scene as horizontal or vertical planes (the H&V-planes). the task of planes clustering more easily in each plane orientation group. Besides, parameters of V-planes depend on scenes, which would cost more computational time and re- camera yaw rotation, but translation-invariant CNNs are less sources. As 360° devices get popularized, the amount of aware of the yaw change. We thus propose a yaw-invariant 360° data has significantly increased, with many panorama V-planar reparameterization for CNNs to learn. We create datasets [4, 2, 26] being released to facilitate learning-based a benchmark for indoor panorama planar reconstruction methods. By taking input in 360° format, 3D reconstruc- by extending existing 360 depth datasets with ground truth tion of an entire scene can be done with only one snapshot. H&V-planes (referred to as “PanoH&V” dataset) and adopt Considering the benefit of real-world 360° data in planar state-of-the-art planar reconstruction methods to predict reconstruction and the gap of existing literature in this re- H&V-planes as our baselines. Our method outperforms the search field, we believe the planar reconstruction task from baselines by a large margin on the proposed dataset. Code is panorama imagery is worthy of investigation. available at https://github.com/sunset1995/PanoPlane360. In this work, we construct the first real-world indoor 360° H&V-plane dataset (PanoH&V dataset), where we sim- 1. Introduction plify the planar reconstruction task by focusing on horizontal and vertical planes (illustrated in Fig.1). To this end, we Reconstructing planar surfaces from single-view images extend existing large-scale 360° datasets by fitting the pro- have many applications such as interior modeling, AR/VR, vided depth modality with horizontal and vertical planes. arXiv:2106.14166v2 [cs.CV] 9 Sep 2021 robot navigation, and scene understanding. Generally, an The H&V-planes are similar to the concept of the Manhattan indoor scene consisting of human-made structures can be world [5] and the Atlanta world [22], but we only constrain approximated by a small set of dominant planes, making the extracted planes to be vertical or horizontal without as- planar reconstruction suitable for 3D indoor modeling. suming other inter-plane relationship. With the extended Recent works on planar reconstruction [17, 18, 20] em- modality from large-scale H&V-planes, we train two cur- ploy state-of-the-art instance segmentation methods and rent state-of-the-art planar reconstruction methods, PlaneR- achieve promising results. However, these works are mostly CNN [17] and PlaneAE [32], and report their performance trained on the planar datasets derived from ScanNet [6] to serve as the strong baselines in our benchmark. or NYUv2 [23] with a small field-of-view (FoV). Data We find existing planar reconstruction methods subopti- of this kind require multiple images to reconstruct entire mal when applied to the presented PanoH&V dataset. First, existing methods employ instance segmentation to detect 1 National Tsing Hua University planes from visual cue with less consideration about the 2ASUS AICS Department 3Joint Research Center for AI Technology and All Vista Healthcare estimated geometry. In practice, some planes are easier to 4Aeolus Robotics differentiate through geometry instead of visual appearance (e.g., walls with similar appearance). Second, plane parame- refining the inter-planar relationship. ters depend on camera poses, but 360° camera yaw rotations Previous works rarely use the plane geometry prior when are left-right circular shifts on equirectangular images and segmenting plane instances. Plane parameter estimation may hard for the translation-invariant CNNs to observe. In con- correlate with instance segmentation either via loss [18, 29, trast to the existing works, our method addresses the above 32] or by an additional module for refinement [17, 21]. In issues appropriately: i) We use a divide-and-conquer strategy. contrast, our method directly groups pixels based on plane The proposed surface orientation grouping distinguishes pix- orientations so that the plane segmentation module can detect els of different plane orientations, so the succeeding instance unique planes in each group separately. segmentation module applied to each group can focus on a Estimating per-pixel surface normals for vertical planes simpler subproblem. ii) We then propose a residual form from an equirectangular image is challenging for CNNs. yaw-invariant parameterization for V-planar geometry such Specifically, vertical-plane parameters depend on 360° cam- that it is independent of the camera yaw rotation. We show era’s yaw rotation, but the counterpart left-right circular that the yaw-invariant parameterization brings significant shifting on the equirectangular image is less discerned by improvements in V-plane orientations estimation, which also the translation-invariant CNNs. Although the surface normal benefits other methods. is a fundamental property to many 360° applications aside We summarize the contribution of this work in two as- from the planar reconstruction, existing methods [30, 24, 28] pects. In terms of technical contribution, the proposed which estimate surface normal from an equirectangular im- method consists of i) a divide-and-conquer strategy for the age are less aware of the 360° camera yaw ambiguous prob- task of plane instance segmentation, which exploits the esti- lem. A workaround by [8] is to use CoordConv [19] to mated plane orientations to divide the task into multiple sim- make the model condition on image u-coordinates so that pler subproblems; ii) a yaw-invariant vertical plane param- the yaw ambiguous problem in 360° is alleviated. However, eterization addressing the 360° yaw ambiguity, which can this relies on the deep net capability to learn the relationship also boost other existing methods. In terms of system con- between the u-coordinates and the plane orientations. On tribution, we construct a new real-world 360° piece-wise the contrary, we propose a yaw-invariant parameterization planar benchmark, which focuses on evaluating horizontal for vertical planes, which solves the yaw-rotation ambiguous and vertical planes. Finally, our approach outperforms the problem adequately. two strong baselines adapted from existing state-of-the-art planar reconstruction methods on the new benchmark. 3. PanoH&V dataset 2. Related work In this section, we first introduce the large-scale panoramic public datasets used to construct our dataset Reconstructing 3D planes from an image involves two (Sec. 3.1). We then show the statistical analysis on these subtasks: segmenting plane instances and estimating plane datasets to support the validity of scene approximation geometric parameters. To solve the problems, PlaneNet [18] with H&V-planes (Sec. 3.2). Finally, we outline the auto- trains CNN and DCRF that reconstruct a fixed number of matic H&V-plane annotation algorithm from depth modality planes by estimating plane parameters and plane segmen- (Sec. 3.3). tation masks both in an instance-wise manner. PlaneRe- 3.1. Panorama dataset sources cover [29] also predicts a fixed number of planes but learns directly from depth modality with plane structure-induced We construct our dataset from three public 360° RGB-D loss. Recent state-of-the-art approaches relax the constraint datasets, including Matterport3D [4] and Stanford2D3D [2] on the number of planes by exploiting popular frameworks in real-world scenes, and Structure3D [34] in synthetic en- in instance segmentation. PlaneRCNN [17] modifies the vironments. These three datasets consist of large-scale ag- two-stage architecture Mask R-CNN [11] with object cate- gregation of panoramic RGB images and depth maps. All gory classification replaced by plane geometry prediction, panorama images in this work are represented in the equirect- followed by a network to refine the segmentation masks. angular format with resolution of 512 × 1024. Similar PlaneAE [32] predicts per-pixel plane parameters and adopts to [17, 18] deriving plane modality to train the learning- associative embedding [3, 9, 16, 20], which trains a network based method, our annotations of H&V-planar masks and to map each pixel to embedding space and then clusters parameters are derived from the ground-truth 360° depths. the embedded pixels to generate instances. DualRPN [13] We assume all panorama images are aligned with the groups planes into object and layout categories, each with gravity direction. In case that g-sensor and tripod are its network branch, and infers plane representations for both not equipped with the 360° camera, and the image is not the visible and occlusion

Load more