Arxiv:1811.01571V2 [Cs.CV] 24 Jan 2019 Many Traditional Cnns on 3D Data Simply Extend the 2D Convolutional Op- Erations to 3D, for Example, the Work of Wu Et Al
Total Page:16
File Type:pdf, Size:1020Kb
SPNet: Deep 3D Object Classification and Retrieval using Stereographic Projection Mohsen Yavartanoo1[0000−0002−0109−1202], Eu Young Kim1[0000−0003−0528−6557], and Kyoung Mu Lee1[0000−0001−7210−1036] Department of ECE, ASRI, Seoul National University, Seoul, Korea https://cv.snu.ac.kr/ fmyavartanoo, shreka116, [email protected] Abstract. We propose an efficient Stereographic Projection Neural Net- work (SPNet) for learning representations of 3D objects. We first trans- form a 3D input volume into a 2D planar image using stereographic projection. We then present a shallow 2D convolutional neural network (CNN) to estimate the object category followed by view ensemble, which combines the responses from multiple views of the object to further en- hance the predictions. Specifically, the proposed approach consists of four stages: (1) Stereographic projection of a 3D object, (2) view-specific fea- ture learning, (3) view selection and (4) view ensemble. The proposed ap- proach performs comparably to the state-of-the-art methods while having substantially lower GPU memory as well as network parameters. Despite its lightness, the experiments on 3D object classification and shape re- trievals demonstrate the high performance of the proposed method. Keywords: 3D object classification · 3D object retrieval · Stereographic Projection · Convolutional Neural Network · View Ensemble · View Se- lection. 1 Introduction In recent years, success of deep learning methods, in particular, convolutional neural network (CNN), has urged rapid development in various computer vision applications such as image classification, object detection, and super-resolution. Along with the drastic advances in 2D computer vision, understanding 3D shapes and environment have also attracted great attention. arXiv:1811.01571v2 [cs.CV] 24 Jan 2019 Many traditional CNNs on 3D data simply extend the 2D convolutional op- erations to 3D, for example, the work of Wu et al. [36] which extends 2D deep belief network to 3D deep belief network, or the works of Maturana et al. [18] and Sedaghat et al. [25] where they extend 2D convolutional kernels to 3D convolu- tional kernels. Furthermore, Brock et al. [5] and Wu [35] proposed to build deeper 3D CNNs following the structures from inception-module, residual connections, and Generative Adversarial Network (GAN) to improve the generalization ca- pability. However, these methods are based on 3D convolutions, thereby having high computational complexity and GPU memory consumption. 2 M. Yavartanoo et al. An alternate approach is based on projected 2D views of the 3D object to exploit established 2D CNN architectures. MVCNN [31] renders multiple 2D views of a 3D object and use them as an input to 2D CNNs. Some other works [28,26,1] propose to use the 2D panoramic views of a 3D shape. However, these methods can only observe partial parts of the 3D object, failing to cover full 3D surfaces. To address all these limitations, we introduce a novel 3D shape representation technique using stereographic mapping to project the full surfaces of a 3D object onto a 2D planar image. This 2D stereographic image becomes an input to our proposed shallow 2D CNN, thereby reducing substantial amount of network parameters and GPU memory consumption compared to the state-of-the-art 3D convolution-based methods, while achieving high accuracy. By taking advantage of multiple projected views generated from a single 3D shape, we propose view ensemble to combine predictions of most discrimina- tive views, which are sampled by our view selection network. On the contrary, Conventional methods [31,28,33,20,34,1] simply aggregate the responses of all multiple views via max or average pooling. 2 Related Work In this section, we review recent deep learning methods for 3D feature learning. These methods are categorized in term of different feature representations; (1) point cloud-based representations, (2) 3D model-based representations, and (3) 2D and 2.5D image-based representations. Point cloud-based methods: While previous works often combine hand-crafted features or descriptors with a machine learning classifier [11,32,4,9], the point cloud-based methods operate directly on point clouds in an end-to-end manner. In [6,21,16], the authors designed novel neural network architectures suitable for handling unordered point sets in 3D. Features based on point clouds often require spatial neighborhood queries, which can be hard to deal for inputs with large numbers of points. 3D model-based methods: Voxel-based methods learn 3D features from vox- els which represent 3D shape by the distribution of corresponding binary vari- ables. In 3D shapeNet [36], the authors proposed a method which learns global fea- tures from voxelized 3D shapes based on the 3D convolutional restricted Boltz- mann machine. Similarly, Maturana and Scherer [18] proposed VoxNet which in- tegrates a volumetric occupancy grid representation with a supervised 3D CNN. In a follow-up, Sedaghat et al. [25] extended VoxNet by introducing auxiliary task. They proposed to add orientation loss in addition to the general classifica- tion loss, in which the architecture predicts both the pose and class of the object. Furuya et al. [10] proposed Deep Local feature Aggregation Network (DLAN) which combines rotation-invariant 3D local features and their aggregation in a single architecture. SPNet 3 Sharma et al. [27] proposed a fully convolutional denoising auto-encoder to perform unsupervised global feature learning. In addition, 3D variational auto- encoders and generative adversarial networks have been adopted by Brock et al. [5] and Wu et al. [35], respectively. Furthermore, recent works[34,22] exploit the sparsity of 3D input using the octree data structure to reduce the computational complexity and speed up the learning of global features. 2D/2.5D image-based methods: Image-based methods have been considered as one of the fundamental approaches in 3D object classification. Light Field descriptor (LFD)[8] by Chen et al. used multiple views around a 3D shape, and evaluates the dissimilarity between two shapes by comparing the corresponding two view sets in a greedy way instead of learning global features by combining multi-view information. Bai et al. [3] used a similar approach but using the Hausdorff distance between the corresponding view sets to measure the similarity between two 3D shapes. Su et al. [31] proposed a CNN architecture that aggregates information from multiple views rendered from a 3D object which achieves higher recognition performance compared to single view based architectures. By decomposing each view sequence into a set of view pairs, Johns et al. [14] classified each pair independently and learned an object classifier by weighting the contribution of each pair, which allows 3D shape recognition over arbitrary camera viewpoint. To perform pooling more efficiently, Wang et al. [33] proposed a dominant set clustering technique where pooling is performed in each cluster individually. Kanezaki et al. [15] proposed RotationNet which takes multi-view images of an object and jointly estimates its object category and poses. RotationNet learns viewpoint labels in an unsupervised manner. Moreover, it learns view-specific feature representations shared across classes to boost the performance. As an alternative approach, Gomez-Donoso et al. [12] proposed LonchaNet which uses three orthogonal slices from 3D point cloud as an input to three in- dependent GoogLeNet networks, each network learning specific features for each slice. Cohen et al. [7] in Spherical CNNs proposed a definition for the spherical cross-correlation that is both expressive and rotation-equivariant. The spherical correlation satisfies a generalized Fourier theorem, which allows to compute it ef- ficiently using a generalized Fast Fourier Transform (FFT) algorithm. Papadakis et al. [19] proposed PANORAMA that uses a set of panoramic views of a 3D object which describe the position and orientation of the object's surface in 3D space. 2D Discrete Fourier Transform and the 2D Discrete Wavelet Transform are computed for each view. Shi et al. in DeepPano [28], projected each 3D shape into a panoramic view around its principal axis and used a CNN for learning the representations from these views. To make the learned representations in- variant to the rotation around the principal axis a row-wise max-pooling layer is applied between the convolution and fully-connected layers. to achieve better feature descriptor for a 3D object in the training phase, Sfikas et al. [1] use three panoramic views corresponding to the major axes and taking average pooling over feature descriptor of each view for the training of an ensemble of CNNs. 4 M. Yavartanoo et al. 3 Proposed Stereographic Projection Network In this section, we provide details of our proposed approach. We first describe how to transform a 3D object into a 2D planar image using stereographic pro- jection. Then, we give the detailed description of the proposed shallow 2D CNN architecture, SPNet, followed by the procedures for view selection and view en- semble. 3.1 Streographic Representation Stereographic projection is a mapping that projects a 2D manifold onto a 2D plane. Such a technique is well developed in the field of Topology and Geography to project surface of the earth to a 2D planar map [30]. Since then, various projection functions have been proposed to improve the quality of mapping. In this work, we explore different types of projection functions showing that stereographic projection preserves the more detailed surface structure of a 3D object. To construct the stereographic representation of a 3D object, we first nor- malize the 3D object such that a unit sphere can fully cover it. We then translate the origin of the sphere to the center of the object assuming that the orientation of the object is aligned. For each point p on the surface of the object, we denote e as a unit vector from the origin o to the point p as shown in Fig. 1(a).