Render for CNN: Viewpoint Estimation in Images Using Cnns Trained with Rendered 3D Model Views
Total Page:16
File Type:pdf, Size:1020Kb
Render for CNN: Viewpoint Estimation in Images Using CNNs Trained with Rendered 3D Model Views Hao Su⇤, Charles R. Qi⇤, Yangyan Li, Leonidas J. Guibas Stanford University Abstract Object viewpoint estimation from 2D images is an ConvoluƟonal Neural Network essential task in computer vision. However, two issues hinder its progress: scarcity of training data with viewpoint annotations, and a lack of powerful features. Inspired by the growing availability of 3D models, we propose a framework to address both issues by combining render-based image synthesis and CNNs (Convolutional Neural Networks). We believe that 3D models have the potential in generating a large number of images of high variation, which can be well exploited by deep CNN with a high learning capacity. Figure 1. System overview. We synthesize training images by overlaying images rendered from large 3D model collections on Towards this goal, we propose a scalable and overfit- top of real images. CNN is trained to map images to the ground resistant image synthesis pipeline, together with a novel truth object viewpoints. The training data is a combination of CNN specifically tailored for the viewpoint estimation task. real images and synthesized images. The learned CNN is applied Experimentally, we show that the viewpoint estimation from to estimate the viewpoints of objects in real images. Project our pipeline can significantly outperform state-of-the-art homepage is at https://shapenet.cs.stanford.edu/ methods on PASCAL 3D+ benchmark. projects/RenderForCNN 1. Introduction However, this is contrary to the recent finding — features learned by task-specific supervision leads to much better 3D recognition is a cornerstone problem in many vi- task performance [16, 11, 14]. Ideally, we want to learn sion applications and has been widely studied. Despite stronger features by deep CNN. This, however, requires its critical importance, existing approaches are far from huge amount of viewpoint-annotated images. robust when applied to cluttered real-world images. We In this paper, we propose to address both issues by com- believe that two issues have to be addressed to enable bining render-based image synthesis and CNNs, enabling us more successful methods: scarcity of training images with to learn discriminative features. We believe that 3D models accurate viewpoint annotation, and a lack of powerful have the potential to generate large number of images of features specifically tailored for 3D tasks. high variation, which can be well exploited by deep CNN The first issue, scarcity of images with accurate view- with a high learning capacity. point annotation, is mostly due to the high cost of manual The inspiration comes from our key observation: more annotation, and the associated inaccuracies due to human and more high-quality 3D CAD models are available online. error. Consequently, the largest 3D image dataset, PASCAL In particular, many geometric properties, such as symmetry 3D+ [39], contains only 22K images. As such, it is limited ⇠ and joint alignment, can be efficiently and reliably esti- in diversity and scale compared with object classification mated by algorithms with limited human effort (Sec 2). By datasets such as ImageNet, which contains millions of rendering the 3D models, we convert the rich information images [5]. carried by them into 3D annotations automatically. The second issue is a lack of powerful features specif- To explore the idea of “Render for CNN” for 3D tasks, ically tailored for viewpoint estimation. Most 3D vision we focus on the viewpoint estimation problem — for an systems rely on features such as SIFT and HoG, which were input RGB image and a bounding box from an off-the-shelf designed primarily for classification and detection tasks. detector, our goal is to estimate the viewpoint. ⇤Indicates equal contributions. To prepare training data for this task, we augment real images by synthesizing millions of highly diverse images. alignment problem and built key point correspondences Several techniques are applied to increase the diversity of between 2D images and rendered 3D views. In contrast the synthesized dataset, in order to prevent the deep CNN to these prior efforts that use hand-designed models based from picking up unreliable patterns and push it to learn on hand-crafted features, we let CNNs learn viewpoint more robust features. estimation directly from data. To fully exploit this large-scale dataset, we design a deep Synthesizing Images for Training Synthetic images ren- CNN specifically tailored for the viewpoint estimation task. dered from 3D models have long been used in computer We formulate a class-dependent fine-grained viewpoint vision [25, 26, 37, 23]. Recently, [32, 19, 13, 27] used 3D classification problem and solve the problem with a novel models to render images for training object detectors and loss layer adapted for this task. viewpoint classifiers. They tweak the rendering parameters The results are surprising: trained on millions of to maximize model usage, since they have a limited number ”rendered” images, our CNN-based viewpoint estimator of 3D models - typically below 50 models per category significantly outperforms state-of-the-art methods, tested on and insufficient to capture the geometric and appearance ”real” images from the challenging PASCAL 3D+ dataset. variance of objects in practice. Leveraging 3D repositories, In summary, our contributions are as follows: [20, 1] use 250 and 1.3K chair models respectively to render We show that training CNN by massive synthetic data tens of thousands training images, which are then used to • is an effective approach for 3D viewpoint estimation. train deformable part models (DPM [8]). In our work, we In particular, we achieve state-of-the-art performance synthesize several orders of magnitude more images than on benchmark data set; existing work. We also explore methods to increase data variation by changing background patterns, illumination, Based upon existing 3D model repositories, we pro- • viewpoint, etc., which are critical for preventing overfitting pose a synthesis pipeline that generates millions of of the CNN. images with accurate viewpoint labels at negligible While both [27] and our work connect synthetic im- human cost. This pipeline is scalable, and the ages with CNN, they are fundamentally different in task, generated data is resistant to overfitting by CNN; approach and result. First, [27] focused on 2D object Leveraging on the big synthesized data set, we propose detection, whereas our work focuses on 3D viewpoint • a fine-grained view classification formulation, with a estimation. Second, [27] used a small set of synthetic loss function encouraging strong correlation of nearby images (2,000 in total) to train linear classifiers based views. This formulation allows us to accurately predict on features extracted by out-of-the-box CNNs [16, 11]. views and capture underlying viewpoint ambiguities. In contrast, we develop a scalable synthesis pipeline and generate more than two million images to learn geometric- 2. Related Work aware features by training deep CNNs (initialized by [11]). Third, the performance of [27], though better than previous 3D Model Datasets Prior work has focused on manually work using synthetic data, did not match R-CNN baseline collecting organized 3D model datasets (e.g., [7, 10]). trained by real images [11]. In contrast, we show significant Recently, several large-scale online 3D model repos- performance gains (Sec 5.2) over previous work [39] using itories have grown to tremendous sizes through public full set of real data of PASCAL VOC 2012 (trainset). aggregation, including the Trimble 3D warehouse (above 2.5M models in total), Turbosquid (300K models) and Yobi3D (1M models). Using data from these repositories, 3. Problem Statement [38] built a dataset of 130K models from over 600 For an input RGB image, our goal is to estimate its ⇠ categories. More recently, ShapeNet [33] annotated 330K viewpoint. We parameterize the viewpoint as a tuple ⇠ models from over 4K categories. Using geometric analysis (✓,φ, ) of camera rotation parameters, where ✓ is azimuth techniques, they semi-automatically aligned 57K models (longitude) angle, φ is elevation (latitude) angle, and from 55 categories by orientation. is in-plane rotation (around optical axis) angle. They 3D Object Detection Most 3D object detection methods are discretized in a fine-grained manner, with azimuth, are based on representing objects with discriminative fea- elevation and in-plane rotation angles being divided into tures for points [4], patches [6] and parts [19, 29, 34], or by N bins respectively. The viewpoint estimation problem exploring topological structures [15, 2, 3]. More recently, is formalized as classifying the camera rotation parameters 3D models have been used for supervised learning of into these fine-grained bins (classes). appearance and geometric structure. For example, [32] and By adopting a fine-grained (N=360) viewpoint classifi- [20] proposed similar methods that learn a 3D deformable cation formulation, our estimation is informative and accu- part model and demonstrate superior performance for rate. Compared with regression-based formulations [22], cars and chairs, respectively; [21] and [1] formulated an our formulation returns the probabilities of each viewpoint, Figure 3. Synthetic image examples. Three example images are Figure 2. 3D model set augmentation by symmetry-preserving shown for each of the 12 classes from PASCAL 3D+. deformation. preserving free-form deformation, defined via regularly thus capturing the underlying viewpoint ambiguity possibly placed control points in the bounding cube, similar to caused by symmetry or occlusion patterns. This informa- the approach of [30]. Our choice is largely due to the tion can be useful for further processing. Compared with model’s simplicity and efficiency. More advanced methods traditional coarse-grained classification-based formulations can detect and preserve more structures, such as partial that typically have 8 to 24 discrete classes [29, 39], our symmetry and rigidity [31]. formulation is capable of producing much more fine- To generate a deformed model from a seed model, grained viewpoint estimation.