Image-based 3D Reconstruction: Neural Networks vs. Multiview Geometry

Julius Schoning¨ and Gunther Heidemann Institute of Cognitive Science, Osnabruck¨ University, Osnabruck,¨ Germany Email: {juschoening,gheidema}@uos.de

Abstract—Methods using multiple view geometry (MVG), like algorithms, guarantee linear processing time, even in cases (SfM), are still the dominant approaches where the number and resolution of the input images make for image-based 3D reconstruction. These reconstruction methods MVG-based approaches infeasible. have become quite robust and accurate. However, how robust and accurate can artificial neural networks (ANNs) reconstruct For the research if the underlying mathematical principle a priori unknown 3D objects? Exceed the restriction of object of MVG can be learned by ANNs, datasets like ShapeNet categories this paper evaluates ANNs for reconstructing arbitrary [9] and ModelNet40 [10] cannot be used hence they en- 3D objects. With the use of a synthetic scalable cube dataset for code object categories such as planes, chairs, tables. The training, testing and validating ANNs, it is shown that ANNs are needed dataset must not have shape priors of object cat- capable of learning mathematical principles of 3D reconstruction. As inspiration for the design of the different ANNs architectures, egories, and also, it must be scalable in its complexity the global, hierarchical, and incremental key-point matching for providing a large body of samples with ground truth strategies of SfM approaches were taken into account. Based data, ensuring the learning of even deep ANN. For this on these benchmarks and a review of the used dataset, it is reason, we are using the synthetic scalable cube dataset [11], shown that -based 3D reconstruction cannot be scaled. Thus, [12], which provides 3D objects and 2D images of different voxel-based reconstruction might be misleading for capturing the complexity of real-world images. Also, the benchmark results complexity. Within classical 3D reconstruction pipelines, a show that ANNs have the same limitation by reconstructing global, a hierarchical, or an incremental method for key-point unknown object categories as current MVG approach. matching is used depending on the amount and complexity Index Terms—3D Reconstruction; Artificial Neural Networks; of the data [13]. Assigning these three high-level concepts Image-Based 3D Reconstruction; Structure From Motion to the architecture of ANNs, the networks schematized in Fig. 1(b) – Fig. 1(d), are designed. For ensuring comparable I.INTRODUCTION benchmark results of the global, hierarchical, and incremental In theory, “standard multilayer feedforward networks are network, all of them have by design the same number of capable of approximating any measurable function to any trainable weights. desired degree of accuracy, in a very specific and satisfying sense” [1]. Hence, because the MVG methods consist of II.STATE OF THE ART a concatenation of several algorithms, it must be possible Using ANNs for the image-based 3D reconstructing is to approximate its underlying mathematical principle by an still in an early stage. For known object categories, a few ANN. The accuracy of commercial as well as academic 3D approaches [14], [15] were introduced recently. In contrast, reconstruction software like PhotoScan [2], VisualSfM [3], there are no ANN-based approaches for the 3D reconstruc- and ARC 3D [4] is for the most tasks sufficient [5]. Since tion of unknown or unseen objects and scenes. Therefore this software is based on MVG algorithm, the number, as we provide a brief overview of related approaches such as well as the size of input images, increase the processing time 3D object representation, single-view 3D reconstruction, and exponentially. stereo image generation by ANNs. In the last decade, the computing power of GPU has Using a multilayer feedforward network design, Pengs and improved remarkably due to the use of several thousand Shamsuddins study [16] explores the ability ANNs in learning parallel computing cores. Thus, GPU allows the massive 3D shapes through estimating the object z-coordinate. They parallelization of algorithms, so that the training as well as showed that a 3D object representation by an ANN is more the execution of ANN, even deep configurations with millions accurate than its representation by a 3rd order polynomial. of training weights, become feasible. Up to now, it has been Based on stacks of convolutional neural networks (CNNs), shown that ANN outperforms handcrafted algorithms in many image-based depth prediction is possible [17]. This multi-scale fields of applications such as object recognition tasks [6]– CNN produces a set of features for the entire image at the first [8]. In case ANNs can learn the underlying mathematical scale. The second scale predicts depth information in mid-level principle of MVG, then the trained ANNs can reconstruct resolution and the third scale yields depth information in high any unknown 3D object or scene without seeing it during resolutions. training. Such an image-based reconstruction ANN integrated Roy et al. [18] proposes a convolutional regression forest into 3D reconstruction software might, in contrast to the MVG where each node in the forest is associated with a CNN. The

c 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. 3 CNN makes a depth estimation for a window around each r×r×r grid, 2r different 3D objects can be created and saved. pixel and returns a probability value for the depth accuracy. Exporting valid 3D obj files, duplicate vertices are combined In the manner for single view depth estimating, Garg et al. and inner faces, where two subcubes are connected with each [19], introduced an unsupervised framework to train a deep other, are deleted before saving the generated cube. ANN without labeled data. As input data, an image pair is needed showing the same scene but from a slightly different B. Image and Voxel Generator viewpoint. For the generation of the w input images with a pixel resolution x × x as well one voxel cloud with the resolution III.MVG BY ANNS - WHICH OUTPUT REPRESENTATION? of v × v × v voxel, a script is provided which accepts any ANNs have frequently been applied to object recognition 3D obj files as input data. For before running this script, the [6]–[8]. For classification tasks, the output layer usually has number of views w taken from the 3D object and there pixel one output neuron for each category. However, to describe the resolution x can be defined by the user as well as the voxel shape of 3D objects, e.g., in and vision, resolution v. In case the randomized cubes are used as input vertices, edges, and faces are required, their numbers are for this generator the voxel resolution v should be set to r. equivalent to the object complexity. The first 3D computer To generate w images of the object from w different view- games use a simplified representation of 3D objects the points, the generator chooses viewpoints uniformly distributed volumetric pixels (). Representing a 3D object or scene around the object to provide as different perspectives as at the output and bottleneck layer of an ANN, thus i) vertices, possible. The viewpoints are evenly distributed on a sphere ii) edges, iii) faces, or iv) voxels can be utilized. Since in using the Fibonacci lattice [20]. From these viewpoints, gray binary classification tasks, ANNs have a high performance, scale images are rendered. For each scene, a light source is the iv) voxel-based representation as ANN output is chosen. added next to the viewpoint. The images, as well as the voxel The voxel space, as an additional advantage, can easily be matrix of the objects, are created after scaling the object to scaled, which therefore affected the 3D reconstruction accu- the v × v × v grid. racy. Next, to the discussion of the output representation, the input representation for MVG by ANN should be, as for SfM C. Cube Datasets approaches, images captured from different viewpoints. In addition to the generators, Schoning¨ et al [11] have released three ready-to-use datasets—one 3×3×3, one 4×4×4 IV. SYNTHETIC SCALABLE CUBE DATASET and one 8 × 8 × 8. This ensures reproducibility to tackle their In the development history of digital cameras, the first baseline. These defined snapshots consist of 100000, 300000, running prototypes had had a low number of pixel, but these and 430000 different cube objects for the respective setup. All prototypes with its low resolution founded the basis for scaling objects care captured with a resolution of 100×100 pixels from up the camera resolution to several tens of megapixels. By 12 different viewpoints. the same argumentation line, we [11] design a synthetic scalable cube dataset for image-based 3D reconstruction of V. GLOBAL,HIERARCHICAL, AND INCREMENTAL ANNS voxel cubes. Obtaining the large numbers of examples required FOR 3D RECONSTRUCTION for training, testing, and validating ANNs with various amount The designed ANNs should be able to acquire 3D voxel of data, this dataset comes with a database generator that can matrices out the ready-to-use dataset, i.e. out of 12 different handle the Wavefront obj file format. Since the dataset should views. The output voxel matrix is defined on a binary voxel cover a variety of unrepeatable 3D obj objects random cube- grid where the values are 1/0 for voxel/non voxel, respectively. based voxel objects as basic geometric shapes are chosen. We design there different feedforward ANNs to emulate These cubes represent both simple and complex object, which the SfM strategies of global, hierarchical, and incremental are difficult to quantify in case real object categories are used. key-point matching, cf. Fig. 1(b) – Fig. 1(d). Ensuring the In two steps these cubes are generated. The cube generator comparability, all three architectures have the same number creates randomized hr × r × ri voxel cubes as 3D obj files of trainable parameters (weights). These three simple feedfor- first. The second generator creates the input and output data ward ANNs are implement using the Keras [21] framework, for the ANNs, i. e. it creates the w images showing these with Theano [22] as backend. The voxels representation as objects from different viewpoints as input data and it creates output, transform the image-based 3D reconstruction by ANNs the corresponding voxel matrix as output data. somehow to a multi-classification task. All three architectures have an i) input layer comprising 12 images as 2D matrices, A. Cube Generator ii) several convolutional layers which are used for creating The cube generator, randomly generates a number of n 3D the architectural differences, iii) a dropout layer (20%), iv) a objects. Each such object is created by taking a unit cube pooling layer, a v) fully connected layer and finally v3 output in R3 and subdividing it into r × r × r subcubes. Each neurons where each neuron represent a voxel. subcube represent a voxel. The resolution, i. e. the number The activation functions of all nodes are rectifiers. Because r of subcubes in each dimension can be defined by the user. fine tuning of the ANN is not the objective here, the filter, By ensuring the unique nature of the cube distribution in the as well as the kernel size, are fixed to 32 and 5. In future

c 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. TABLE I BENCHMARKBETWEENTHE SFM STRATEGIES GLOBAL, HIERARCHICAL, AND INCREMENTAL ON THE SYNTHETIC CUBE DATASET IN THE 3 × 3 × 3, 4 × 4 × 4, AND 8 × 8 × 8 SETUP.THE ACCURACY OF CORRECTLY PREDICTED VOXELS IS DROPPING SIGNIFICANTLY BY INCREASING THE SIZE OF THE VOXELCUBE.ALL STRATEGIES PERFORM ALMOST IDENTICALLY ON THIS TASK. scheme of the strategy

voxels correct predicted voxels correct predicted voxels correct predicted global 1 1 1 train test 0.9 0.9 0.9

0.8 0.8 0.8

percent 0.7 percent 0.7 percent 0.7

0.6 0.6 0.6 train train test test 0.5 0.5 0.5 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 epoch epoch epoch scheme of the strategy

voxels correct predicted voxels correct predicted voxels correct predicted 1 1 1 train

hierarchical test 0.9 0.9 0.9

0.8 0.8 0.8

percent 0.7 percent 0.7 percent 0.7

0.6 0.6 0.6 train train test test 0.5 0.5 0.5 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 epoch epoch epoch scheme of the strategy

voxels correct predicted voxels correct predicted voxels correct predicted 1 1 1 train

incremental test 0.9 0.9 0.9

0.8 0.8 0.8

percent 0.7 percent 0.7 percent 0.7

0.6 0.6 0.6 train train test test 0.5 0.5 0.5 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 epoch epoch epoch

c 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. conv output layer output voxel vector voxel vector 1 1 0 0 1 1 0 0 1 1 1 1 0 1 0 conv 0 conv 1 layer layer 0 dropout 1 conv layer 1 layer dropout 1 0 layer 1 0 0 0 0 0 0 0 0 fc 0 maxp 0 layer maxp fc 0 layer 1 maxp layer layer 1 0 layer 0 0 0 1 1 1 1 1 1 1 1 1 merge 0 1 layer 0 0 input 0 input image images

(a) Single input image architecture (b) Global feature mating like architecture

conv output conv output merge layer voxel vector layer voxel vector conv layer conv merge layer layer 1 layer conv merge 1 conv 0 layer layer 0 layer conv 1 layer 1 0 0 1 1 1 1 0 0 1 1 0 0 conv 1 1 layer dropout 1 · · · 1 layer 0 0 0 0 0 0 0 0 0 0 maxp fc layer 0 0 layer 1 1 0 0 0 0 ··· conv 1 dropout 1 layer 1 layer 1 merge 1 1 layer 1 1 1 1 maxp merge 0 layer 0 0 merge fc 0 merge layer layer layer input layer input images images

(c) Hierarchical feature mating like architecture (d) Incremental feature mating like architecture

Fig. 1. Simplified schemes of the different benchmarked ANN architectures for image-based 3D reconstruction. In architecture (a) all w images from various viewpoints are concatenated to a single input images. Based on this single input image a simple six layer architecture is used for the prediction of each single voxel [11]. The benchmarked architectures (b)–(d) are inspired by the three different feature matching methods: global, hierarchical, as well as incremental. All w images from the object are used individually as input for the networks. As output all networks have the a binary vector of v3 voxels—depending on the resolution of the voxel space. benchmarks we are planning to perform a hyper-parameter and the last 20% are used as validation samples. optimization. Thus, the ANNs are designed such that they still VI.BENCHMARK have the same number of weights for all values of filters and kernel sizes. The best performance we achieved during the benchmark on our dataset is a voxel accuracy of 98.03%, which leads Dealing with the available hardware, we need to downscaled to 58.64% completely correct reconstructed cubes. As seen in the input images to 20 × 20 pixels. We also store all objects Table III, this result is achieved by the hierarchical strategy with their 12 images and their corresponding binary voxel ma- learned and validated on the 3 × 3 × 3 dataset. In TableI, trix to a HDF5 file to make the objects easier to load directly the voxel accuracy over the epochs for any of the three from hard disk to relax the working memory consumption architectures on the the 3×3×3, 4×4×4, and 8×8×8 setup for during training, testing and validating. During the complete both the testing and training samples are plotted. A comparison benchmark, the first 30% of the cubes are used as training of the object accuracy, where 100% of the voxels of a cube samples, the next 50% of cubes are used as testing samples, are predicted correctly can be seen in TableII. The footprint

c 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. 1

0.95

0.9 3 3 3 0.85

2 2 2 0.8

1 1 1 0.75

0.7

3 3 3 0.65

0.6 2 2 2 3 3 3 0.55 2 2 2 1 1 1 1 1 1 0.5 (a) voxel accuracy in 3 × 3 × 3 setup

4 4 4 4

3 3 3 3

2 2 2 2

1 1 1 1

4 4 4 4

3 3 3 3 4 4 4 4 2 3 2 3 2 3 2 3 2 2 2 2 1 1 1 1 1 1 1 1 (b) voxel accuracy in 4 × 4 × 4 setup

7 7 7 7

5 5 5 5

3 3 3 3

1 1 1 1

7 7 7 7

5 5 5 5 7 7 7 7 3 5 3 5 3 5 3 5 3 3 3 3 1 1 1 1 1 1 1 1

7 7 7 7

5 5 5 5

3 3 3 3

1 1 1 1

7 7 7 7

5 5 5 5 7 7 7 7 3 5 3 5 3 5 3 5 3 3 3 3 1 1 1 1 1 1 1 1 (c) voxel accuracy in 8 × 8 × 8 setup

Fig. 2. Average accuracy of each voxel after 350 iteration training in the different setups (a)–(c) predicted by the global ANN architecture; from the left to the right hand side the slice by slice walk through; accuracy is color coded from 0.5 random chance–blue to 1.0 every time correct–dark red. See also animated result in supplementary.

c 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. TABLE II COMPARISONOFOBJECTACCURACY—ALLVOXELSOFACUBEAREPREDICTEDCORRECTLY—USINGTHE 3 × 3 × 3 DATASET.HERE, THE SFM STRATEGIES SHOW DIFFERENT BEHAVIOR.NOTETHESTEPWISEINCREMENTSOFTHEPERFORMANCEBYTHEGLOBAL ANN WHICHCANNOTBESEEN INTHEOTHERARCHITECTURES (CF. DISCUSSIONIN SECTION VII). global hierarchical incremental 1 1 1

0.8 0.8 0.8

0.6 0.6 0.6

percent 0.4 percent 0.4 percent 0.4

0.2 0.2 0.2 train train train test test test 0 0 0 0 100 200 300 400 0 100 200 300 400 0 100 200 300 400 epoch epoch epoch on the hard disk of all trained ANNs varied from 4MB to on any image. Thus without object knowledge these inner 160MB per network. The training for all architectures on all voxels cannot be reconstructed. setups was continued over 350 iteration, also called epochs. To investigate how well a voxel at a certain position was Based on these considerations, we investigated if there are learned, the accuracy for each voxel is calculated. Fig.2 shows positions in the voxel space which are harder to predict than the average accuracy of the global ANN for all three different others. Fig.2 shows the average accuracy for each voxel. The setups, where an accuracy of random chance—50%— is accuracy of the outer voxels of the cube in all setups is high, marked blue and an accuracy of 100% is marked dark red. while the accuracy drops towards the center of the voxel cube. Next to the static visualization of single epochs, we rendered This can be explained by the fact that these voxels are often video sequences showing the training and testing accuracy for blocked from view by other parts of the object. each voxel over all epochs. By the emulation of SfM strategies in ANN architecture, VII.DISCUSSION —ANNSVS.MVG we recognized no significant differences between the networks By discussing which reconstruction accuracy, ANNs per- regarding the voxel accuracy, cf. TableI. But by the investi- forming image-based 3D reconstruction, one important remark gation of the object accuracy, shown in TableII, one can see is necessary first. Classical MVG approaches [2], [3], [23], the plateaus of over the number of epochs only in the global [24] reconstruct vertices and faces instead of voxels. However, architecture. This effect might be an indicator that the global the first observations, presented in Table III, suggest that architecture might be preferable for 3D reconstruction tasks image-based reconstruction by ANN is not as simple as one by ANNs. might expect. Due to the hidden inner object voxels, which are As expected the biggest issue, next to occlusion, is the effect mostly not predicted correctly as seen in Fig.2, the complexity that the performance of all benchmarked ANNs significantly of the 3D reconstruction task leads to maximal 58% object drops by increasing the number of voxels. Next to the drop of accuracy. These inner voxels may be occluded and not visible performance the complexity as well as the processing time of the ANNs increases. TABLE III VALIDATION RESULTS OF THE DIFFERENT STRATEGIES LEARN AND Based on these named issues and the unsatisfactory accuracy TRAINED ON THE DIFFERENT DATASET SETUP.THEOBJECTACCURACYIS of the ANNs in even on this simple cube dataset, another AMEASUREOFHOWMANYCUBESWERERECONSTRUCTEDENTIRELY output representation of the ANNs should be considered. CORRECT. Since the accuracy significantly drops and the footprint of the setup ANN strategy voxel accuracy object accuracy ANN exponentially increases by scaling up the voxel space, 3 × 3 × 3 global 97.72% 54.404% the voxel-based represent might be misleading. Reflecting 3 × 3 × 3 hierarchical 98.03% 58.646% the MVG pipelines, ANNs should be designed and trained 3 × 3 × 3 incremental 95.77% 34.168% that predicts the fundamental matrix based on two views, 4 × 4 × 4 global 92.00% 0.006% the trifocal tensor on three views, or the w-focal tensor on 4 × 4 × 4 hierarchical 94.62% 0.045% w-views. On these essential matrices and tensors, an ANN- 4 × 4 × 4 incremental 84.56% 0.001% powered 3D object reconstruction that results in a point cloud 8 × 8 × 8 global 54.40% 0.000% be implemented. By focusing on these essential matrices and 8 × 8 × 8 hierarchical 58.32% 0.000% tensors, a non-fixed amount of points within the point cloud 8 × 8 × 8 incremental 56.41% 0.000% could be generated without the need of, e.g., recurrent ANNs.

c 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. VIII.CONCLUSION [11] J. Schoning,¨ T. Behrens, P. Faion, P. Kheiri, G. Heidemann, and U. Krumnack, “Structure from motion by artificial neural networks,” In this paper, we have benchmarked three ANN architec- in Scandinavian Conference on Image Analysis (SCIA). Springer tures that replace the whole MVG 3D reconstruction pipeline. International Publishing, 2017, pp. 146–158. [12] J. Schoning,¨ G. Heidemann, and U. Krumnack, “Structure from neural The different ANN architectures are inspired by the three most networks (SfN2),” Journal of Computers, vol. 13, no. 8, pp. 988–999, important strategies of SfM pipelines—the global, hierarchical, 2018. and incremental key point matching. This benchmark is done [13] J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,” 2016. on a reference dataset for evaluating ANNs for image-based [14] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese, “3D-R2N2: A 3D reconstruction. Using convolutional ANNs we show in unified approach for single and multi-view 3D object reconstruction,” for small scales that 3D reconstruction of unknown object in – ECCV 2016. Springer International Publishing, 98% 2016, pp. 628–644. categories is possible with voxel accuracy of . Based on [15] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d the benchmark, we recognized, especially for higher voxel ShapeNets: A deep representation for volumetric shapes,” in Conference resolutions, a lack in performance. Therefore, we discuss if on Computer Vision and Pattern Recognition (CVPR). IEEE - Institute of Electrical and Electronics Engineers, 2015. a voxel-based representation is misleading for introducing [16] L. W. Peng and S. M. Shamsuddin, “3D object reconstruction and ANNs in the domain of 3D reconstruction. In our opinion, a representation using neural networks,” in Proceedings of the Interna- smarter way is to design ANNs for predicting the fundamental tional Conference on Computer Graphics and iIteractive Techniques in Austalasia and Southe East Asia (GRAPHITE). Association for matrix based or the trifocal tensor. Thus, the voxel-space could Computing Machinery (ACM), 2004, pp. 139–147. be avoid, which mainly caused issues like hardware limitations [17] D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic and number of needed training samples by scaling it up. As labels with a common multi-scale convolutional architecture,” IEEE International Conference on Computer Vision (ICCV), pp. 2650–2658, forthcoming work, we are going to design ANNs to predict the 2015. fundamental matrix based on two views. For training, testing, [18] A. Roy and S. Todorovic, “Monocular depth estimation using neural and validation of these ANNs a dataset based on 3D CAD regression forest,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. objects will be created. [19] R. Garg, V. K. B.G., G. Carneiro, and I. Reid, “Unsupervised CNN for single view depth estimation: Geometry to the rescue,” in Computer ACKNOWLEDGMENT Vision – ECCV 2016. Springer Nature, 2016, pp. 740–756. [20] A.` Gonzalez,´ “Measurement of areas on a sphere using fibonacci and The training and validation of the different latitude-longitude lattices,” Mathematical Geosciences, vol. 42, no. 1, ANNs were performed on a SGI UV2000 granted pp. 49–64, 2009. by the German Research Foundation (DFG) [21] F. Chollet, “Keras,” https://github.com/fchollet/keras, Jan 2017. [22] Theano Development Team, “Theano: A Python framework for http://gepris.dfg.de/gepris/projekt/239246210 as well as fast computation of mathematical expressions,” arXiv e-prints, vol. on a Titan X Maxwell donated by NVIDIA Corporation. abs/1605.02688, 2016. [23] Q. Pan, G. Reitmayr, and T. Drummond, “ProFORMA: Probabilistic REFERENCES feature-based on-line rapid model acquisition,” British Conference (BMVC), pp. 112.1–112.11, 2009. [1] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward [24] M. Pollefeys, D. Nister,´ J.-M. Frahm, A. Akbarzadeh, P. Mordohai, networks are universal approximators,” Neural Networks, vol. 2, no. 5, B. Clipp, C. Engels, D. Gallup, S.-J. Kim, P. Merrell, and et al., “Detailed pp. 359–366, 1989. real-time urban 3D reconstruction from video,” International Journal of [2] Agisoft. (2017, Jan) Agisoft PhotoScan. [Online]. Available: http: Computer Vision, vol. 78, no. 2-3, pp. 143–167, 2008. //www.agisoft.ru/ [3] C. Wu. (2011, Jan.) VisualSFM: a visual structure from motion system. [Online]. Available: http://ccwu.me/vsfm/ [4] M. Vergauwen and L. Van Gool, “Web-based 3D reconstruction service,” Machine Vision and Applications (MVA), vol. 17, no. 6, pp. 411–426, 2006. [5] J. Schoning¨ and G. Heidemann, “Evaluation of multi-view 3d re- construction software,” in Computer Analysis of Images and Patterns (CAIP). Springer International Publishing, 2015, pp. 450–461. [6] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Infor- mation Processing Systems (NIPS), F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097– 1105. [7] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. [8] P. Simard, D. Steinkraus, and J. Platt, “Best practices for convolutional neural networks applied to visual document analysis,” Seventh Inter- national Conference on Document Analysis and Recognition (ICDAR), vol. 3, pp. 958–962, 2003. [9] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu, “Shapenet: An information-rich 3d model repository.” [10] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d ShapeNets: A deep representation for volumetric shapes,” in Computer Vision and Pattern Recognition (CVPR). IEEE, 2015.

c 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.