Superplane: 3D Plane Detection and Description from a Single Image
Total Page:16
File Type:pdf, Size:1020Kb
2021 IEEE Virtual Reality and 3D User Interfaces (VR) SuperPlane: 3D Plane Detection and Description from a Single Image Weicai Ye*1 Hai Li†1 Tianxiang Zhang2 Xiaowei Zhou‡1 Hujun Bao§1 Guofeng Zhang¶1 1State Key Lab of CAD&CG, Zhejiang University 2Beijing Institute of Spacecraft System Engineering ABSTRACT We present a novel end-to-end plane detection and description net- work named SuperPlane to detect and match planes in two RGB images. SuperPlane takes a single image as input and extracts 3D planes and generates corresponding descriptors simultaneously. A mask-attention module and an instance-triplet loss are proposed to improve the distinctiveness of the plane descriptor. For image matching, we also propose an area-aware Kullback-Leibler (KL) divergence retrieval method. Extensive experiments show that the proposed method outperforms state-of-the-art methods and retains good generalization capacity. The applications in image-based local- ization and augmented reality also demonstrate the effectiveness of SuperPlane. Index Terms: Human-centered computing—Human computer interaction (HCI)—Interaction paradigms—Mixed / augmented reality; Com- puting methodologies—Artificial intelligence—Computer vision— Computer vision problems Figure 1: Weak-texture challenge in man-made scenes. The left 1 INTRODUCTION shows the query image and the right shows the reference image [29]. Finding the correspondence between different views is a key prob- Feature point based methods fail to extract enough matching points lem in 3D vision tasks such as augmented reality (AR) applications to group the plane (the first row) while SuperPlane (the second row) can directly detect the plane and generate the plane description. The [2–4, 38] and image-based localization (IBL) task [7, 22, 31, 32, 34]. matched planes are indicated with the same color. In AR applications, some virtual objects are often placed on the extracted planes [15, 18]. The traditional plane extraction usually follows this paradigm: triangulate the matched feature points to 3D coordinate points from multiple views, and then estimate the planes’ training image features in a self-supervised manner, it ignores multi- parameters by clustering and expanding 3D points. However, it is region-to-multi-region supervision. non-trivial to obtain enough matching feature points in challenging Recently, there has been several works [16, 17, 37, 39] to detect conditions, such as textureless scenes (Fig. 1). Some methods di- planes from a single image. Note that all of them only emphasize rectly perform depth estimation and then triangulate the plane, so plane detection but ignore plane description. Even though Plane- that virtual objects can be placed on the plane. But they cannot Match [28] introduces a RGB-D patch descriptor designed for de- differentiate semantically different areas. For example, the wall and tecting coplanar surfaces for RGB-D reconstruction, but overlooks the door may have the same depth and there will be only one plane the fact that even if it is a small patch, the patch may still consist detected, which is insufficient to realize the AR effect of hanging a of multiple small planes, and using this patch directly to establish hat on the door. Human-made scenes generally contain rich planar constraints may cause some errors. It also fails to detect planes and structures, and human perception of the world may be based on generates corresponding descriptors from a single image. These ex- individual planar features, rather than low-level feature points or isting methods do not perform joint plane detection and description global image features. The mid-level feature such as plane structure due to quite a few challenging problems, such as how to determine can simulate the way humans perceive the world to some extent. the number of planes and how to describe the detected planes. How In view of this, we highlight that plane detection and description to construct data set to supervise plane detection and description is deserves more attention. also a critical issue. While in image-based localization (IBL) task, most existing meth- For the problems aforementioned, we do the following analysis: ods utilize global [1, 20] or semantic [10, 23, 33] features which plane detection should be related to the object instance in the real are highly influenced by salient regions and sensitive to dynamic world. With the different images obtained, the number of planes objects [8], such as moving people, resulting to ambiguous matches. detected should also change. For plane descriptor, it should retrain Although SFRS [8] introduces image-to-region supervisions for discriminative ability to handle viewpoint changes, even the illumi- nation. We can follow the plane detection network such as Plan- *e-mail: [email protected] eRCNN [16] to detect the planes and construct the triplet samples †e-mail: [email protected] as supervision of the corresponding plane descriptors. The triplet ‡e-mail: [email protected] samples consist of the detected planes, not the complete images. §e-mail: [email protected] Based on the above considerations, we propose a novel framework ¶e-mail: [email protected], the corresponding author named SuperPlane (Fig. 2) to detect 3D planes and generate corre- sponding descriptions from a single image. Our model can detect 3D planes and generate the globally consistent descriptors handling the illumination or large viewpoint changes. To the best of our knowl- edge, we are the first attempt to detect 3D planes with description 2642-5254/21/$31.00 ©2021 IEEE 207 DOI 10.1109/VR50410.2021.00042 Figure 2: SuperPlane Framework. Our framework takes a single RGB image as input, and outputs the 3D planes and corresponding descriptors. Our network consists of three sub-networks: a plane detection network (upper), a plane description network (bottom-right) and a joint warping network (Fig. 3 left) respectively. The plane description network (bottom-left) is composed of a mask-attention module, 3 conv-layer, an average pooling layer, a fully-Connected layer, intra-normalization, and L2 Normalization. from a single image. In addition, due to the lack of available datasets 2 RELATED WORK for training our framework, we introduce a Plane Description Bench- In this section, we briefly review the related methods. mark (PDB). Furthermore, we propose a novel pipeline (Fig. 3) for IBL task using our 3D plane detection and description network (Su- 2.1 Feature Extraction and Matching perPlane). Different from the existing retrieval-based methods, they Robust feature extraction and correct matching are particularly im- directly use the global features of query and gallery images to cal- portant in SLAM or AR applications. There exist several low-level culate the similarity between each other. We additionally use the feature detectors and descriptors such as SIFT [21], ORB [26], Su- multi-plane descriptors to get the similarity of the two images. It perPoint [6], and so on. It is still a challenge to extract robust is non-trivial to merge the many-to-many plane similarity into the features in weakly texture scenes, such as the white wall (Fig. 1 first similarity of two images. Kullback–Leibler Divergence (KL) is row). While high-level global or semantic features pay attention to generally utilized to estimate the average difference between the the whole feature, omitting the multi-region features. Human-made distributions P and Q. We cast every plane descriptor of the image environments generally contain rich planar regions, which are often as the distribution, and we can exploit the KL Divergence to estimate weakly or repeated textures. Furthermore, human perception and the difference of the two images. Since our system detects planes understanding of the real world may be more from multiple planes of different sizes, each plane has a different effect on the similarity structure, and in AR application, the virtual objects are placed in the of the image. We expand the traditional KL divergence to Area- planar surfaces [13]. Therefore, we argue that mid-level features, Aware Kullback–Leibler Divergence similarity method to retrieve such as plane detection and description, are necessary. the similar images, which shows better performance. Overall, our contributions are summarised as four-fold. 2.2 Plane Detection and Description • We propose a novel unified framework for simultaneously 3D Recently several works of plane detection from a single image have planes detection and description for a single image. emerged. A pioneer work name PlaneNet [17] is proposed for end-to- end planar reconstruction from a single RGB image. PlaneRecover • We introduce a new training and test benchmark for plane [37] casts the 3D plane recovery problem as a depth prediction description from a single image and propose an instance-triplet problem. However, the above two methods can only handle a fixed loss to train our model. amount of planes. While PlaneRCNN [16] based on Mask R-CNN • We apply our SuperPlane to image-based localization task and [11], performs a detection network to extract an arbitrary number of further introduce an Area-Aware Kullback–Leibler divergence planar regions. It further improves the accuracy of plane parameters retrieval method to retrieve similar images. and depth map with the geometry constraints. PlanarReconstruction [39] exploits pixel-wise embedding and proposes a fast mean-shift • The proposed system outperforms previous state-of-the-art clustering algorithm to group pixel embedding into plane instances. methods on image-based localization benchmarks and demon- It fails to generate the full planar descriptors. Note that all of them strates significant generalization capability. only focus on plane detection but ignoring plane description. The most similar work PlaneMatch [28] introduces an RGB-D patch 208 Figure 3: Image-Based Localization Pipeline with SuperPlane. We use the SuperPlane network with shared weights to inference the planes and descriptors of the query image and gallery image, respectively. Assume that the query image consists of m planes, and the gallery image consists of n planes.