DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2018

Reconstruction and recommendation of realistic 3D models using cGANs

MÓNICA VILLANUEVA AYLAGAS

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Reconstruction and recommendation of realistic 3D models using cGANs

MÓNICA VILLANUEVA AYLAGAS

Master in Machine Learning Date: June 15, 2018 Supervisor: Hedvig Kjellström and Mario Romero Vega Examiner: Danica Kragic Jensfelt Swedish title: Rekonstruktion och rekommendation av realistiska 3D-modeller som använder cGANs School of Electrical Engineering and Computer Science

iii

Abstract

Three-dimensional modeling is the process of creating a representation of a sur- face or object in three dimensions via a specialized software where the modeler scans a real-world object into a point cloud, creates a completely new surface or edits the selected representation. This process can be challenging due to fac- tors like the complexity of the 3D creation software or the number of dimensions in play. This work proposes a framework that recommends three types of re- constructions of an incomplete or rough 3D model using Generative Adversarial Networks (GANs). These reconstructions follow the distribution of real data, re- semble the user model and stay close to the dataset while keeping features of the input, respectively. The main advantage of this approach is the acceptance of 3D models as input for the GAN instead of latent vectors, which prevents the need of training an extra network to project the model into the latent space. The systems are evaluated both quantitatively and qualitatively. The quantitative measure relies upon the Intersection over Union (IoU) metric while the quantitative eval- uation is measured by a user study. Experiments show that it is hard to create a system that generates realistic models, following the distribution of the dataset, since users have different opinions on what is realistic. However, similarity be- tween the user input and the reconstruction is well accomplished and, in fact, the most valued feature for modelers. iv

Sammanfattning

Tredimensionell modellering är processen att skapa en representation av en yta eller ett objekt i tre dimensioner via en specialiserad programvara där modelle- raren skannar ett verkligt objekt i ett punktmoln, skapar en helt ny yta eller redi- gerar den valda representationen. Denna process kan vara utmanande på grund av faktorer som komplexiteten i den 3D-skapande programvaran eller antalet di- mensioner i spel. I det här arbetet föreslås ett ramverk som rekommenderar tre typer av rekonstruktioner av en ofullständig eller grov 3D-modell med Generati- ve Adversarial Networks (GAN). Dessa rekonstruktioner följer distributionen av reella data, liknar användarmodellen och håller sig nära datasetet medan respek- tive egenskaper av ingången behålls. Den främsta fördelen med detta tillväga- gångssätt är acceptansen av 3D-modeller som input för GAN istället för latenta vektorer, vilket förhindrar behovet av att träna ett extra nätverk för att projicera modellen i latent rymd. Systemen utvärderas både kvantitativt och kvalitativt. Den kvantitativa åtgärden beror på Intersection over Union (IoU) metrisk me- dan den kvantitativa utvärderingen mäts av en användarstudie. Experiment vi- sar att det är svårt att skapa ett system som genererar realistiska modeller efter distributionen av datasetet, eftersom användarna har olika åsikter om vad som är realistiskt. Likvärdighet mellan användarinmatning och rekonstruktion är väl genomförd och i själva verket den mest uppskattade funktionen för modellerare. Contents

1 Introduction 1 1.1 Research question ...... 2 1.2 Motivation ...... 2 1.3 Delimitations ...... 3 1.4 Societal, ethical and sustainability aspects ...... 4 1.5 Outline of the Master Thesis ...... 5

2 Background and related work 6 2.1 Background ...... 6 2.1.1 Generative models ...... 6 2.1.2 3D models ...... 8 2.2 Related work ...... 9 2.2.1 2D ...... 10 2.2.2 3D ...... 10 2.2.3 User studies ...... 11

3 Method 13 3.1 Data ...... 13 3.1.1 Format ...... 13 3.1.2 Noise functions ...... 14 3.2 GANs ...... 15 3.2.1 Network architectures ...... 16 3.2.2 Objective function ...... 18 3.3 Distance functions ...... 19 3.4 Recommendation system ...... 20 3.5 Evaluation ...... 21 3.5.1 Quantitative: Distance measure ...... 21 3.5.2 Qualitative: User study ...... 22 3.6 Hardware description ...... 23

v vi CONTENTS

4 Experiments and results 24 4.1 Distance functions ...... 24 4.2 Noise generalization ...... 28 4.3 Discriminator strength ...... 30 4.4 Recommendation system ...... 31 4.4.1 Realistic system ...... 32 4.4.2 Balanced system ...... 34 4.4.3 Similar system ...... 36 4.4.4 System comparative ...... 37 4.5 User models ...... 38 4.6 Qualitative evaluation: User study ...... 40 4.6.1 Population statistics ...... 40 4.6.2 Data preprocessing and analysis ...... 42 4.6.3 Realism experiment ...... 43 4.6.4 Similarity experiment ...... 45 4.6.5 Preference experiment ...... 45

5 Discussion and conclusions 47 5.1 Achievements ...... 47 5.2 Future work ...... 48

Bibliography 49

A Complete list of noise functions 53 A.1 Unstructured noise ...... 53 A.2 Structured noise ...... 54

B Architecture 56

C User study 58 C.1 Realism experiment ...... 58 C.2 Similarity experiment ...... 59 C.3 Preference experiment ...... 59 C.4 Exit survey ...... 60

D Additional resources 61 Chapter 1

Introduction

Three-dimensional modeling is the process of creating a representation of a sur- face or object in three dimension via a specialized software where the modeler can create and edit the representation. Another way of creating the surfaces is scanning real-world objects into a point cloud. There are multiple 3D software for creating 3D models, each with its own characteristics, tools and render engines. Figure 1.1 shows the User Interface (UI) of two modeling softwares, Blender, as an example of open source (nfhGNU GPLv2+ licence) and Autodesk Maya as a commercial one.

(a) Blender interface (b) Maya interface

Figure 1.1: User interfaces of different software

Currently, 3D modeling is difficult to master. Not everyone can reproduce what they see in a successful way, even less what they imagine. This can be the result of personal limitations, the complexity of working with multiple dimen- sions, or the intricacy of the 3D modeling software. Movies, many video games and even virtual and mixed reality apps surround us with increasing need for realistic models. Experienced modelers can benefit from a tool that helps them quicken the content creation. Furthermore, with the popularization of 3D printers on a daily basis, even beginners would be able to

1 2 CHAPTER 1. INTRODUCTION

create their own natural-looking models.

The field of Computer Graphics is not the only one benefiting from advances in the generation of 3D models. Many robotic applications use 3D models to solve problems like interacting with objects. The area of also employs 3D models for segmentation of cancer or injuries. The increase in computational power is boosting the research in Deep Learn- ing which, in synergy with generative models, is increasing the amount and qual- ity of 3D Computer-Aided designs (CADs). This data enhancement is, in turn, improving the learning processes and helping achieve better models, adding value to the Machine Learning pipeline.

The aim of this Master Thesis is the design and development of an end-to- end recommendation system for 3D models using GANs to generate novel re- constructions from a user input. To the best of the author’s knowledge, no rec- ommendation systems are included in 3D modeling software nor the idea has been researched so far. The decision to use GANs to solve this problem is supported by the preference to reconstruct novel objects and the fact that this method is the state of the art in generation as revealed by the literature study in Section 2. The whole motivation behind this work is outlined in Section 1.2.

1.1 Research question

This work addresses the following research question:

What are the benefits and limitations of using conditional Generative Adversarial Nets to reconstruct unpolished voxelized models and rec- ommend plausible alternatives?

The reconstructions are assessed using Intersection over Union as a quanti- tative measure and the users’ is evaluated from the results of a user study regarding both the level of realism and the similarity with respect to the model entered by the user as measured by forced pair-wise comparison.

1.2 Motivation

The ambition behind this work is the creation of a complete end-to-end system that reconstructs 3D models designed by users, in other words, help bringing a preliminary 3D sketch closer to the final result intended by the user when mod- eling the sketch. CHAPTER 1. INTRODUCTION 3

The reconstructions are guided by three different similarity measures, which make the output follow the distributions of natural 3D models, look like the user models or share features from both natural 3D CADs and the sketch created by the user. This makes it possible to build up a system that uses the reconstructions as recommendations for unfinished or crude models, comparable to predictive text in mobile phones. No previous work has been found that uses end-to-end generation of 3D models to build a recommendation system.

The main difference with the closest related work [19] is the lack of an addi- tional network aside from the GAN. As explained in Section 2.2.2 before, Liu, Yu, and Funkhouser [19] project the 3D model into the manifold to obtain a latent vector that is used as input for the GAN at a later time. In this Master Thesis, the 3D models are fed directly to the Generative Adversarial Network. Moreover, the creation of Liu, Yu, and Funkhouser [19] is not meant as a rec- ommendation system, but as a tool that improves the current 3D model expecting an iterative interaction with the user. The recommendation system developed for this work instead presents to the user three outputs based on different perceptual attributes.

Finally, a quantitative measure that indicates the accuracy of the system may not be very convenient in this particular case where, ultimately, the resulting models will be judged by humans. Therefore, in addition to evaluating the sys- tems quantitatively, a study is carried out to assess how natural the results are and how much they resemble the model created by the user. It also measures which of these qualities is more useful for a 3D recommendation system. Given this information, it is possible to analyze the veracity of the mathemati- cal assumptions as well as estimate what is the most valuable weighting for these features according to the users.

1.3 Delimitations

The aim of the project is focused on the technical difficulties of training GANs for the reconstruction of 3D models and not in developing a graphical interface for editing 3D and visualizing recommendations. The recommendation system can be extended by training additional GANs using different sets of parameters for the weights that modify the distance func- tions. However, only three sets of these parameters are actually trained to mea- sure the theoretical assumptions related with the features of the reconstructions. The work targets a one-class system that must be trained for every category separately leaving the multiclass design, implementation and testing for future work. Only chairs are tested for illustrative reasons, but different classes can be 4 CHAPTER 1. INTRODUCTION

employed after the modification of certain hyperparameters.

1.4 Societal, ethical and sustainability aspects

The use of Artificial Intelligence and Machine Learning applications has multiple repercussions in society that need to be addressed by scientists. For this particular work, the most dangerous issue could be the job loss due to the automation of a process. The recommendation system, however, does not aim to replace human beings, but to present a tool to ease the task of 3D modelers in what has been denominated as Artificial Intelligence Augmentation (AIA), or IA in short. At the same time, it is possible to argue that improving the productivity of a worker can reduce the necessity of hiring more staff. This problem is real, but it is not new. For instance, with the advances of the Industrial Revolution, thresh- ers and seeders were introduced in the fields to help humans. Nowadays, the amount of people working in agriculture is not comparable. However, this leads to a deeper discussion about shifts in the long-term job creation. It is said that "Artificial Intelligence will create more jobs than it elim- inates" [23]. It is necessary to get ready for the change in terms of education, preparing new generations for the new work market or economic repercussions, such as fewer companies/individuals holding most of the power and wealth. These are topics in which this work does not dwell, but important to have in mind. Additionally, this work aspires to help inexperienced users to model more re- alistic 3D models so that they can come close to technology. With the increasing popularization of 3D printers nowadays, closing this gap could mean attracting people initially not interested in modeling but appealed by the idea of making small changes to customize premade models, engaging more girls into STEM [24] due to the creative nature of 3D modeling or allowing different kind of designers to see their creations come true.

On a different note, there is an essential issue that is not commonly addressed and I would like to mention. Researching Machine Learning and moving the developed algorithms into production consumes large amounts of power, that is mostly generated by nonrenewable sources, at least currently. According to Bawden [2], the amount of electricity consumed by data centers is higher than that of the whole UK and 2% of the of the global power consumption. Until electricity is generated in a completely clean and sustainable way, there should be an awareness on this topic. CHAPTER 1. INTRODUCTION 5

1.5 Outline of the Master Thesis

The rest of the work is organized as follows: Chapter 2 reviews the basic back- ground theory to understand the work and examines several papers solving simi- lar problems with the selected method, GANs. In Chapter 3, the reader can find a detailed description of the data used for training, the architecture of the network, the goal of the recommendation system, the evaluation method and the hardware used. Chapter 4 shows the results of learning experiments testing different dis- tance functions, noise generalization, several discriminator strengths and various learning approaches for each of the systems in the recommender. It also reports the results of the user study. Finally, Chapter 5 summarizes this work by stat- ing the achievements of the solution as well as highlighting some directions for future work. Chapter 2

Background and related work

For the particular problem that concerns us, the reconstruction or improvement of 3D models, the background study must cover areas related with generative models and 3D handling, including representations and metrics of similarity. The related work, on the other hand, should include solutions to different problems using the selected Machine Learning technique, Generative Adversarial Nets (GANs), as well as other approaches for the same problem. Furthermore, it needs to address how these works design their user studies in order to measure their goals as perceived by human subjects.

2.1 Background

The addressed research question intends to produce models that are as similar as possible to the input created by a user without diverging from the realistic space of 3D models. This makes it clear that the selected method should be part of the family of generative models since the goal is to create improved 3D objects. In order to condition the generative model on the input, it is essential to choose a representation for the 3D model. This representation will also affect the way to measure the accuracy of the reconstruction. These issues are discussed in the following sections.

2.1.1 Generative models The goal of generative models is to learn the true underlying distribution of the data and allow sampling from the joint distribution of the observation and the class. There are several such models, ranging from traditional methods, like Gaus- sian Mixture Models [32] and Hidden Markov Models [29], to Deep Learning approaches, like Generative Adversarial Networks [10].

6 CHAPTER 2. BACKGROUND AND RELATED WORK 7

A possible approach to the problem stated in this Thesis could be finding the closest sample to the input using k-NN on a database of realistic models. How- ever, since the goal is to create novel reconstructions and given the complexity of the underlying distribution and the amount of data necessary to approximate it, the most common methods to achieve this are currently Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs) and Autoregressive models. Their advantages and disadvantages have already been studied by Karpathy et al. [16]. VAEs sets a mathematical Bayesian framework in which to learn complex la- tent variable spaces, escalating good to large datasets. The ability to choose a prior distribution is useful when there is domain knowledge. However, their main drawback is that the sampled data is blurry. This is caused by the use of variational inference that tends to model the mean or mode of the data. GANs, on the other hand, are designed for generative tasks where the distri- bution is not as important as the perceptual result. Generated samples are sharper because the network is encouraged to imitate the training data. Although GANs could seem like the solution, they are famous for their unstable training process. Its two networks are trained by competing against each other until reaching the Nash equilibrium, which can be complicated to achieve using gradient descent. Meanwhile, Autoregressive models have a simpler and more stable training that returns more realistic results. Nevertheless, sampling from these models is highly inefficient due to sequential generation.

The selected method, GANs, has been a hot topic in the field since it was first published by Goodfellow et al. [10] in 2014. Generative Adversarial Networks are composed by two architectures that learn from each other minimizing opposing losses until convergence: a generative model that tries to learn the data distri- bution and a discriminative model that attempts to infer if a sample comes from the original distribution or the generator. In Section 3 the method is described in more detail. In its first publication, these two networks were implemented as a multilayer perceptron and trained using gradient descent, but several improvements have been published over the years, regarding architecture and learning tricks to make the training more stable, like conditioning the data generation on class labels or some part of the data, cGANs [26]. Convolutional Neural Networks (CNNs) takes advantage of the structure of the data and can be use as the architecture for the GAN improving the results and the stability compared with perceptrons. This architecture, denominated Deep Convolutional Generative Adversarial Networks (DCGANs) [30], make use of all convolutional nets, replacing pooling with strides in the convolution, which are 8 CHAPTER 2. BACKGROUND AND RELATED WORK

easier to reverse and let the network learn its own positional information. It also avoids fully connected layers to increase stability. The generator applies ReLU as the non-linear activation function except on the output layer where using a bounded function like Tanh speeds up the learning. The discriminator, on the other hand employs Leaky ReLU to avoid dead units, since this is the only source of information for the generator to learn. Batch normalization is applied in all layers except the output of the generator and the input of the discriminator using Adam as the gradient descent optimization algorithm. In its conditional version, cDCGAN, the scale and bias removal from the batch normalization achieved bet- ter results. In order to improve the learning process, researchers have developed several practices like training the discriminator only if the accuracy of the last batch drops below 80% [39], using dropout after Leaky ReLU or using a normal distribution instead of a uniform one to sample [19]. However, the most important progress in the field is the Wasserstein [1] training objective used in WGAN and its improved version with gradient penalty [11] in IWGAN. Both methods avoid the problem of balancing the generator and discriminator training because the Earth-Mover (EM) distance or Wasserstein-1 cost function, unlike Jensen-Shannon (JS) diver- gence, allows the discriminator to train until optimality. Moreover, it appears to work well even without a careful design of the architecture. The improved version proposes an alternative to weight clipping in order to compute this cost function. Penalizing the norm of the gradient of the discriminator with respect to the input removes the need to tune one hyperparameter more and obtain a more stable and precise training since the clipping predetermines the discrimina- tor into learning less complex distributions.

2.1.2 3D models The same way that images can be represented in different formats depending on the problem, such as gray level, RGB (Red, Green, Blue) or HSL (Hue, Satura- tion, Luminance), 3D models can also be represented in several ways. Some of the most important 3D object representations are point cloud, and polygon mesh [8], represented in Figure 2.1. Point cloud is the raw data product of a 3D scan by laser. It is formed by a set of data points in a virtual three-dimensional space that represents the surface of the object being scanned. This format can be converted into any other higher- level representation. For example, it is possible to perform surface reconstruction using Delaunay triangulation resulting in triangle mesh. Polygon mesh is a collection of edges, faces and vertices that form a 3D poly- hedron. The polygons constituting the model are usually triangles or quadrilat- erals. This representation can produce a more realistic description with fewer data points, but when resolution increases, it can lead to high storage space. Its CHAPTER 2. BACKGROUND AND RELATED WORK 9

(a) Point cloud (b) Mesh (c) Voxel

Figure 2.1: The Stanford bunny modeled in point cloud, mesh and voxel repre- sentation advantage is that current GPUs are designed to optimize polygons. represent a value on a three-dimensional grid. The situation of a voxel is inferred based on its position relative to other voxels, which can efficiently represent heterogeneous filled spaces, in contrast to polygons. In Machine Learning, most approaches use voxel representation either as in- put or output [39, 19, 37, 40], though there is no consensus [14, 27, 43] and re- searchers are still trying to find an adequate representation [18].

Measuring 3D shape similarity is an unsolved problem [40] and the 3D repre- sentation affects the available methods directly. Focusing on voxels, the most widespread technique in research seems to be Intersection over Union (IoU), especially in the field of medicine, where the use of voxel representation is as original as common [5], but also extensively used in other fields [33]. Other error measurements for different representations are addressed in [3, 6], just to name a few examples. Nevertheless, these geometrically traditional metrics are not the only ones. Recently, a method has been developed related to the representation learned by the different filters of a Convolutional Neural Network. This method, perceptual similarity, is based on the idea that if two image/objects are the same, they have the same filter response [7, 19].

2.2 Related work

In recent years GANs have undergone intense study, particularly in the area of images, but they have also been employed for 3D modeling naturally nourishing 10 CHAPTER 2. BACKGROUND AND RELATED WORK

from the wider 2D research and experience. In most cases the ultimate goal of these systems is to please the final user. Therefore, not only quantitative mea- sures are used, but also user studies that judge the developed algorithm against previous works or ground-truth samples.

2.2.1 2D Examples of problems related to the task at hand can comprise the works of Liu et al. [20] and Choi et al. [4]. The work of Liu et al. [20], denoted Auto-painter, is a good example of a cD- CGAN that uses a partial input, in this case a black and white sketch, and tries to reconstruct a missing feature, color, based on what was learned from a database of colorful images. Choi et al. [4], in their system called StarGAN, in a different manner, perform image-to-image translation using only one GAN for multiple domains. The learn- ing process is conditioned both on the input image that is to be modified as well as the class label of the target domain. Isola et al. [13] show how GANs can be used as a general-purpose solution for multiple problems removing the need to engineer specific loss functions. The goal of pix2pix is to prove that it is possible to learn a mapping for two pairs of representations using the same framework. The discriminator architecture of these implementations is based on the patch- GAN discriminator [13], with StarGAN adding an extra output layer for classifi- cation. Besides, Auto-painter and pix2pix use a U-net architecture for the gener- ator [34].

The closest investigation in 2D to the research question under study regards the paper published by Zhu et al. [42] in which the authors design manipulation operations so that the edits on the image give rise to realistic results. First, they approximate the manifold using a GAN and then project the input image to find the closest latent vector using a hybrid approach of feedforward and optimization. Next, they manipulate the latent vector with constraints to match the user’s intent and stay close to the input in the manifold using gradient descent. Finally, the edition is transferred back to the original picture by applying traditional optical flow methods.

2.2.2 3D When talking about 3D reconstruction, most papers tackle the problem of infer- ring a 3D object from a single image [39], stereo system or 2.5D sketch [40], but few works deal with 3D objects as an input [19]. In addition, and as explained be- CHAPTER 2. BACKGROUND AND RELATED WORK 11

fore, several generative approaches have been used in order to solve these prob- lems [27, 43]. Since the focus of this project is on GANs, as motivated on Section 2.1.1, the detailed study is centered on these particular solutions.

The first major work on 3D generation using GANs was proposed by Wu et al. [39]. In this paper, the advances in GANs are combined with those in volumetric convolutional networks to perform classification, generate random 3D models and reconstruct these models from a 2D image building a Variational Autoen- coder on top of the GAN. In addition, Smith and Meger [37] extends this work implementing the new advances by Gulrajani et al. [11]. However, neither of these studies addresses the problem of controlling the output or conditioning it to an input 3D object.

The closest source available is the work of Liu, Yu, and Funkhouser [19], in- spired by the previously presented work of Zhu et al. [42]. Despite this fact, their goal is not to optimize edits but to project original or modified 3D objects into the learned manifold. The method trains three neural networks to achieve its objective. The first two are the generator and discriminator networks that learn the latent space of the manifold. The generator accepts as input a latent vector representing the user model. The last network, called the projection operator, maps the input 3D model to its latent vector so that it can be used in the generator. This is the most important difference when compared with the proposed approach in this Master Thesis, where the generator is fed directly with the 3D model instead of training an additional network. The GAN architecture used is a revised version of 3DGAN [39] to preserve the stability during training. The projection operator, on the contrary, is the novel contribution of the paper. It adjusts the importance of the plausibility of the gen- erated model as well as its similarity with the original input. In their implementa- tion, they use a feedforward network to optimize the similarity and use that as an initial guess find a local minimum using gradient descent on the whole objective loss. In this case, computational time is important due to its interactive nature.

2.2.3 User studies Traditionally, user studies comparing the plausibility of different samples have used p-values [15, 21]. However, in recent years, researchers just report the per- centage of users preferring a certain result when facing another, where 50% rep- resents chance. A reason for this change of methodology can be the number of articles criticizing p-values as a statistical tool to prove hypotheses [12], particu- larly in the medical field. The detractors argue that p-value hacking is a common 12 CHAPTER 2. BACKGROUND AND RELATED WORK

practice and more importantly, that p-value cannot tell you if your hypothesis is correct since it is the probability of the data given the hypothesis. Basic and Ap- plied Social Psychology was the first journal to ban p-values [38] in 2015.

The work by Mantiuk, Tomaszewska, and Mantiuk [22] compares four subjec- tive methods for quality assessment, namely single and double stimulus, forced choice and similarity judgments. Single stimulus shows a sample for a certain amount of time and asks the user to rate it while double stimulus shows two in a sequence before requesting to grade both. The task in forced choice is to select the best sample given two or more options. The difference from similarity judg- ments is that it uses a relative scale instead of absolute terms. The conclusions of this work are that the forced choice pairwise comparison is the most accurate, but also the fastest when using a sorting algorithm for the adaptive approach. Accordingly, the most frequent method used in other papers are variants of this forced choice pairwise comparison. For measuring realism, the studies com- pare the output of a developed system against that of a previous work and/or a ground-truth sample in pair-wise trials. Users have to choose the more plausible example or their lack of preference. These studies have two versions, comparing original models with the result of an algorithm [13] but also comparing among algorithms [15]. However, there are researchers that still prefer the other methods, like the single stimulus in the case of Zhu et al. [42]. Liu et al. [20] show four synthetic examples and ask to rate the best one and the worse one according to subjective appealing. The results are ranked ordered by a "popularity index" Concerning similarity, the most repeated study shows an original sample and ask the subjects to select the one that looks more similar [21, 40] from two other examples. Chapter 3

Method

The solution proposed in this Master Thesis depends entirely on the data used for training, the decisions made to design the GAN, including the distance functions tested during training and the way the results are evaluated. All of these particulars are explained in the next sections, along with a de- scription of the equipment used to implement, train and execute the project.

3.1 Data

There are three types of data that are needed to train the system. The most im- portant is a ground-truth dataset from which to learn the natural distribution of 3D objects. The quality and variability of this dataset establish the limitations of the problem. The second type of data is the models designed by users that are to be improved in test time. Finally, the third class of models is produced by in- troducing noise into the ground-truth dataset so that it is possible to measure the distance between the reference model and its reconstruction.

3.1.1 Format Following the lead of similar papers [18, 19, 37, 39, 41, 43], the selected dataset is ModelNet 10 [17]. This dataset includes 10 classes of objects (chair, sofa, bed, monitor, table, toilet, desk, dresser, bathtub and nightstand) oriented into 12 views of size 32 × 32 × 32 and split into training and test subsets. For the purpose of this study, the experiments are performed for the class with more samples: chairs. The chair class contains 10 657 training samples and 1 200 test samples, making a total of 11 857 models.

The voxelized version of the dataset, available in the project 3D ShapeNet [41], uses a Matlab format to store the models. Since the system works with Python,

13 14 CHAPTER 3. METHOD

these files are preprocessed to 3D matrices in Numpy format. In a similar way, user models can be preprocessed by converting .stl into .npy files. The output of the system, with the same resolution as the input, is not sub- jected to any condition such as connectivity requirements or thresholding on the existence or absence of voxels. However, these constraints transform the output format of the matrices into the same format as the input, moreover cleaning and improving the visual appeal of the final results, which pleads for a custom post- processing step.

3.1.2 Noise functions The synthetic data aim to recreate user-generated models through the introduc- tion of noise in the database models. This synthetic data is used during training so that it is possible to measure the distance with respect to a ground-truth. Train- ing with CADs modeled by real users would forfeit this possibility. The synthetic data must be as similar as possible to the typical modeling errors introduced by modelers. Designing a set of plausible noise functions can be a challenging topic that this work does not aim to solve. Therefore, some basic distortions and crude user-like functions are combined to create this synthetic data during the training stage. Examples are shown in Figure 3.1.

(b) Remove voxels (c) Add voxels (d) Dilation

(a) Reference 3D model

(e) Remove part (f) Move part (g) Bump

Figure 3.1: Examples of noise functions applied to a reference 3D model. CHAPTER 3. METHOD 15

• Unstructured noise

– Remove voxels: Each voxel has a 50% probability of being removed. It creates clouds of voxels with the shape of the original object that can make the training more robust. Note that this kind of noise is never used for the similar and balanced systems (Section 3.4), seeing that users would rarely produce this sort of models. – Add voxels: Each void space has a 2% probability of becoming a voxel. After modifying the model, connectivity is enforced and only the largest connected component is returned. – Dilation: Basic mathematical morphology operation using the mini- mum 3-dimensional structure with connectivity = 1.

• Structured noise

– Remove part: Removes a box of half of the length of the side equal to 3 from a randomly selected center among all possible voxels. – Move part: Selects a random center and axis among all possible voxels and shifts the whole structure 2 units to the closest boundary, dragging the structure so that the model is still connected. – Create a bump: Copies the structure from a randomly selected center among all possible voxels to the closest boundary in each direction for a radius of 10 units.

Some of the combined functions include the addition of voxels and dilation, create a bump and move part or dilation and voxel removal. Appendix A con- tains the complete list of noise functions with illustrative figures.

3.2 GANs

Generative Adversarial Networks are composed of two architectures, a Discrim- inator (D) and a Generator (G), that compete against each other as described by Goodfellow et al. [10] and depicted in Figure 3.2. In this Master Thesis, the generator tries to produce 3D models that look like the samples in the dataset, minimizing Equation 3.5c, while the task of the dis- criminator is to classify the inputs into real (x) or generated (r = G(x)), minimiz- ing Equation 3.5b. The system is trained when the losses of the two networks (Lgen and Ldisc) reach an equilibrium point in which, optimally, the generator has learned the distribution of the data and the discriminator cannot tell apart gener- ated 3D models from samples belonging to the dataset. 16 CHAPTER 3. METHOD

Figure 3.2: GAN framework described by Goodfellow et al. [10]. Figure credit: Skymind [36]

3.2.1 Network architectures The system is built using Convolutional Neural Networks (CNNs). This architec- ture takes advantage of the spatial information in the input, a reason why it is so widely spread in . Unlike image fed networks, 3D models need 3D convolutions to take into account depth information.

The framework includes a conditional generative model that takes into ac- count the input model created by the user. The generator (Figure 3.3) is designed to use a simple encoder-decoder architecture [4] in which the user’s (conditional) model is encoded into a low dimensional representation that keeps the important information of the input. The decoder component learns how to reconstruct the original voxel representation from the low dimensional representation following the restrictions imposed by the loss function.

Figure 3.3: Generator module based on an encoder-decoder architecture CHAPTER 3. METHOD 17

Figure 3.4: Discriminator module based on the PatchGan architecture [13]

The discriminator architecture choice, depicted in Figure 3.4, is based on the PatchGAN described by Isola et al. [13], which classifies parts of the input as real or generated instead of the whole sample. The resulting classification is the aver- age of all the patches. This design penalizes implausible fine grain patches of size 2x2x2, improving detail in reconstruction and solving the problem of generating realistic generalizations with scarce detail, known in generative models.

The architectural components of the networks are not very different from those originally described in Radford, Metz, and Chintala [30]: the CNN archi- tecture is all convolutional, there are fully-connected layers or pooling layers. Batch Normalization is replaced in this case by Instance Normalization layers according to Choi et al. [4]. They differ in that while Batch Normalization makes the distribution of the whole batch Gaussian, Instance Normalization makes each sample Gaussian, and it has proved to be useful in stylization tasks. Normaliza- tion is applied in the generator, except for the output layer, the same as ReLU activation. Leaky ReLU is applied in the discriminator to avoid dead units and improving the gradient that the generator uses to learn, as previously stated in Section 2.1.1. Adam optimizer is used in training with parameters set to η = 0.0002, β1 = 0.5 and dropout is used both in training and test to introduce some randomness into the generator [13]. For the specific problem at hand, the best suited activation function for the output layer of the generator is a sigmoid. This activation function is delimited between (0, 1). The bounded nature of the function helps accelerate the learning process and its output range can be used as the voxel probability of existence, transforming the problem into a classification where the positive class signifies voxel existence.

The details of the architectures are provided in Appendix B for replicability. 18 CHAPTER 3. METHOD

3.2.2 Objective function The improved Wasserstein loss function is used instead of the traditional GAN loss in order to make the training more stable [11]. This loss is combined with a distance loss function (see Section 3.3) that helps to guide the system into learn- ing the true distribution of the dataset while staying close to the user’s input.

The loss functions that constitute the objective function are explained next:

• Improved conditional GAN loss: The adversarial loss encourages 3D mod- els to look as close as possible to the true distribution, balancing the loss of the discriminator and the generator until they reach Nash equilibrium. The improved version includes a gradient penalty that enforces unit gradient norm along straight lines between the dataset distribution and the genera- tor distribution.

2 Ladv = Ex[D(x)] − Ec[D(G(c)] − λgpExˆ[(||∇xDˆ (ˆx)||2 − 1) ], (3.1)

where x comes from the dataset distribution, c represents the conditional input modeled by the user and xˆ is sampled uniformly along the straight line between the two distributions. • Reconstruction loss: Content loss that imposes the generated model to look close to the sample in the dataset.

Lrec = Ex,c[dist(x, G(c)] (3.2)

• Similarity loss: Content loss that describes the error between the generated model and the user input.

Lsim = Ec[dist(c, G(c)] (3.3)

The combination of these three losses forms the objective loss, i.e., the com- plete loss function to minimize. The weights λrec and λsim combine the previously described partial losses conceding different relevance to following the dataset dis- tribution, reconstructing the synthetic model or staying close to the input model respectively. L = Ladv + λrecLrec + λsimLsim (3.4) Since the training of the two networks is performed separately, this alternative notation is more useful:

L = Ldisc + Lgen (3.5a) 2 Ldisc = −Ex[D(x)] + Ec[D(G(c)] + λgpExˆ[(||∇xDˆ (ˆx)||2 − 1) ] (3.5b) Lgen = −Ec[D(G(c)] + λrecLrec + λsimLsim (3.5c) CHAPTER 3. METHOD 19

λgp is set to 10 following the guidelines of Gulrajani et al. [11], while λrec and λsim take different values depending on which of the three components of the recommendation system is being trained (for more details see Section 3.4).

3.3 Distance functions

A distance function, dist, is used in the loss functions described in Equations 3.2 and 3.3. One measures the difference with respect to the sample from the dataset being reconstructed while the other measures it with respect to the synthetic ver- sion (see Section 3.1.2) of the same sample that simulates the user input. Since the presence/absence of voxels in the array is unbalanced, it is inadvis- able to use Intersection over Union (IoU). With this kind of metric and a naïve algorithm that selects all positions as void, it is possible to reach an accuracy as high as the unbalance between classes. Three different measures used in the literature are tested on the problem un- der study: optimized IoU (sometimes called soft IoU) [31], optimized Dice (some- times called soft Dice) [25] and balanced binary cross-entropy, based on re-weighting, similar to the loss used by Salehi, Erdogmus, and Gholipour [35]. In the following descriptions y represents the ground-truth model in the com- parisons. In reality, the ground-truth can be either a dataset sample x or a syn- thetic sample c depending on the loss function being computed. To simplify the notation, let us call the reconstructed model r = G(c). V is the set of all possible voxels in the defined space.

• Soft IoU: measures the similarity between the predicted model and the ground-truth. To avoid setting a threshold that converts the probabilities produced by the sigmoid layer into existent voxels, soft IoU approximates P |r∩y| v∈V rv∗yv the IoU score according to: = P |r∪y| v∈V (rv+yv−rv∗yv) IoU, also known as Jaccard index or similarity coefficient, returns 1 when the models are identical. In order to be used as a loss function, it is necessary to convert it to a distance by subtracting the score from 1 as follows: P v∈V rv ∗ yv distIoU = 1 − P (3.6) v∈V (rv + yv − rv ∗ yv)

• Soft Dice: averages precision and recall thus weighting equally false pos- itives and false negatives. Like IoU, the soft version uses the probabilities P |r∩y| 2 v∈V rv∗yv produced by the network: = P P . |r|+|y| v∈V rv+ v∈V yv Again, and in the same fashion as IoU, this coefficient needs to be converted into a distance to be used as a loss: 20 CHAPTER 3. METHOD

P 2 v∈V rv ∗ yv distDice = 1 − P P (3.7) v∈V rv + v∈V yv • Balanced binary cross-entropy: is one of the most used losses for classi- fication in its standard version. The modification included for this work weights voxel existence inversely proportional to the probability of the class.

distBCE = −Wv[yv log rv + (1 − yv) log(1 − rv)] (3.8)

3.4 Recommendation system

It would be convenient for the user to be able to decide how close the generated model should be to the true distribution or the input model. Since the final ob- jective function (see Equation 3.4) includes parameters that weight each of the losses that minimize these distances, it is not possible to modify the behavior in test time. However, it is possible to discretize the range of possibilities training different networks and building a recommendation system that proposes models generated by each set of parameters. For the purpose of testing the design assumptions and the user’s preferences, three systems are trained to deliver the two available extreme results and a bal- ance between them.

• Realistic system: This system reconstructs the input modeled by the user according to the distribution learned by the generator. It is possible to think about it as reconstructing the closest model in the learned manifold.

For the experiments in this work the parameters are set to λrec = 5 and λsim = 0. • Balanced system: The parameters that weight the loss functions are coun- terbalanced so that the generated model should depict a homogeneous mix between the input and the true distribution, even if the result is not com- pletely natural.

For the experiments in this work the parameters are set to λrec = 2.5 and λsim = 2.5. • Similar system: Still takes into account the distribution of the dataset, since the final goal of a potential user would be to improve the input. Neverthe- less, the emphasis is on the similarity reconstruction in hopes of learning to what point the verisimilitude of the model is secured and what is the reaction of the users to it.

For the experiments in this work the parameters are set to λrec = 0 and λsim = 5. CHAPTER 3. METHOD 21

3.5 Evaluation

The success of the reconstruction is measured both quantitatively and qualita- tively. A sound empirical approach is to define an objective goodness metric and compare results of different algorithms among themselves or against a baseline. However, in this case, getting a good numerical performance is not adequate since the goal is to create 3D models that are visually appealing to humans.

It is therefore necessary to design user studies to measure the model quality, as well as the human predilection for diverse reconstructions. The first desired measurement is the level of realism of the reconstruction, i.e. how similar it is to a dataset sample. The objective of the user study is to compare the empiri- cal perception of the users to the mathematical definition of this difference, the Intersection over Union of the voxels in the two comparing models. The second experiment measures similarity, or how close the generated model resembles the user input. The same empirical-theoretical contrast is performed.

3.5.1 Quantitative: Distance measure The approach selected for this work is slightly different from those in the liter- ature. Due to the scarcity of previous work in the area and the unavailability of code and time to reproduce experiments, it is impossible to compare the pro- posed algorithm to others in the literature. Nonetheless, the learning algorithm is designed so that it is possible to compare a reconstruction against the original sample in the database and the input model (synthetic data). The metric selected to compute the differences between models is Intersection over Union. This statistic resembles how humans perceive similarity, that is, if an existing voxel in the reference model exists or not in the reconstruction, like- wise for non-existent voxels. This metric requires that the values of the matrix representing the 3D model take either 0 or 1 values. For that purpose, the output of the generator is thresholded and connectivity is ensured in order to produce more plausible results. Although the same measurements are performed for every system setup, the importance of the results vary. For the realistic system, the priority is on the distance between the reconstruction and the dataset, the difference between the reconstruction and the input being just an estimate on how related the models are. For the balanced system, both measurements are equally important since the goal is to stabilize these to distances. Finally, for the similar system, the first concern is the resemblance of the generated model to the input. However, the distance to the original sample from the dataset must be taken into account, considering that 22 CHAPTER 3. METHOD

it is the only way to measure quantitative plausibility.

3.5.2 Qualitative: User study The goal of the study is to empirically measure the success of the system focusing on the results using each of the three versions of the generator trained to mini- mize different objective functions. These results are assessed in terms of plausi- bility and similarity to the model entered by a user, but also users’ preference for practical reasons. The subjects are 24 3D modelers or users that have close experience with 3D models. The only restrictions are that the subjects should be over 18 and legally- sighted. The study is conducted on the KTH premises using the author’s com- puter. Each experiment is designed to take 5 minutes so that the whole study takes no more than 30 minutes per participant. This design is meant to avoid fatigue [22]. Subjects are recruited from Facebook groups, mainly related with KTH, where the probability of finding people meeting the study requirement is higher. Due to the lack of funding for this work, no compensation is offered to participants.

In a first stage, the subject is informed about the objective and structure of the study. If the subject agrees to sign the consent form and proceed with the study, a test is performed showing how the visualization works in order to ease the sub- sequent interaction. Examples of the User Interface can be found in Appendices C.1, C.2 and C.3. The realism test presents two models side by side. These models are randomly sampled for the following distributions: the dataset, generated by the realistic, balanced or similar systems. The task of the user is to decide which one looks more realistic. The comparison groups are realistic (A) vs balanced (B), realistic (A) vs similar (C), realistic (A) vs dataset (D), balanced (B) vs similar (C), bal- anced (B) vs dataset (D), similar (C) vs dataset (D) following the probabilities in Table 3.1. The experiment is designed so that there is a greater number of com- parisons between more similar conditions, namely (A), (B) and (C). In particular, the estimated weights produce approximately twice the number of comparisons between similar conditions than between the dataset and the different reconstruc- tions. The permutations represent the position in which the models are presented. The pilot study showed that the selection pace is highly dependent on the subject, which is the reason why finally it was decided to set a fixed amount of time per experiment (5 minutes) instead of a fixed amount of comparisons. This reduces the possibility of concentration loss for those participants for which the study would take longer. The similarity test shows two reconstructions of the same reference model placed on top of them. The task of the user is to decide which one looks closer CHAPTER 3. METHOD 23

to the reference. The reference models are selected from the noisy version of the dataset so that the differences with a natural-looking model are more apparent. The testing groups are realistic (A) vs balanced (B), realistic (A) vs similar (C) and balanced (B) vs similar (C). In this case, the estimated weights produce roughly the same number of comparisons between conditions as shown in Table 3.2.

ABCD ABC A - 11.11 11.11 5.55 A - 16.66 16.66 B 11.11 - 11.11 5.55 B 16.66 - 16.66 C 11.11 11.11 - 5.55 C 16.66 16.66 - D 5.55 5.55 5.55 -

Table 3.1: Weighted probabilities for Table 3.2: Uniform probabilities for each comparison in the realism ex- each comparison in the similarity ex- periment. periment.

The last experiment measures the recommendation preference of the partic- ipants by showing the three possible reconstructions along with the unfinished model being reconstructed. These models are submitted by the same subjects and processed before the study. Due to the difficulty of finding subjects willing to participate in the study with experience in 3D modeling and time constraints, the number of user models amounts to 22 examples. In this manner, subjects are presented with triplets of reconstructions in random order until the time is over or they run out of examples. Finally, some questions are asked (Appendix C.4) in order to understand bet- ter the answers of the subjects and the potential adoption of the system as a mod- eling tool.

3.6 Hardware description

The implementation1 of the system uses PyTorch as the main library for building and training the networks. The system runs in a high-performance node of the PDC supercomputing system [28], using an NVIDIA Tesla K80 GPU.

1The complete code developed for this work will be publicly available in a GitHub repository under the user of the author https://github.com/MonicaVillanueva/3D-ReconstGAN Chapter 4

Experiments and results

This work is mainly a research project to find the benefits and limitations of using conditional Generative Adversarial Networks to build a recommendation system that reconstructs 3D models. Due to the novel approach of the architecture and the loss function, it is necessary to perform experiments that clarify how to obtain the best performance in each of the systems that compose the recommender.

In the following sections, the reader can find experiments on the base model, the realistic system, regarding several distance functions, generalization to differ- ent types of noise and diverse strengths in the discriminator. Once these elements are optimized, various learning approaches are used to train each of the systems separately. The best ones are selected to assemble the final recommendation system that is evaluated in the user study. Section 4.5 describes the limitations of the method when the input models are too different from the training data.

4.1 Distance functions

It is important to find a distance function (dist) suitable for the problem in order to guide the system into reconstructions that resemble the dataset (Equation 3.2) or the models introduced by the user (Equation 3.3). Theoretically, the three func- tions described in Section 3.3 are appropriate. Still, the functions are tested over a small number of epochs to compare empirical results. The realism system is taken as the baseline to test different parameters. The distance with better results is used for the following experiments.

Initially, several values for λrec in Equation 3.4/3.5c are searched for all dis- tance functions (distIoU , distDice, distBCE), modifying the voxel existence thresh- old during visualization to perceive the changes in confidence. The final values

24 CHAPTER 4. EXPERIMENTS AND RESULTS 25

for the different systems are reported in Section 3.4. Results during training (Figure 4.1) show how the distance function affects the learning while test results (Figures 4.2 and 4.3) prove generalization to unseen examples.

Training Original Epoch 5 Epoch 10 Epoch 20 Epoch 40 Soft IoU Soft Dice Balanced BCE Figure 4.1: The first row shows the learning results over the epochs 5, 10, 20 and 40 with the voxel existence threshold set to 0.5 using soft IoU as the distance function. The second row shows the same results for soft Dice with threshold 0.5, while the last row shows the results for balanced cross-entropy with threshold 0.3.

Figure 4.1 displays the reconstructions during training for the three different distances. The probability assigned to the existence of each voxel by the network is guided by the distance function, which makes the whole algorithm learn dif- ferently. The soft Dice differs from the other two distances in that it creates a general shape and epoch after epoch it carves out the model, though it also corrects and extends parts that are not completely developed. Meanwhile, soft IoU and balanced binary cross-entropy (balanced BCE) are more similar because that they give more importance to true positives. Both dis- tances create models that are more conservative in the number of voxels and keep adding detail while fixing wrongly placed voxels. The difference between them is the certainty in early stages. In Figure 4.1, the results obtained by using balanced 26 CHAPTER 4. EXPERIMENTS AND RESULTS

BCE as the distance function are thresholded at a probability of 0.3 while soft IoU uses 0.5. However, since this threshold is just a postprocessing tool and does not alter training, the learning speed may be a more important factor. Regarding this matter, it is possible to see that the development of legs is more advanced using balanced BCE (epoch 20) and the results more distinct (epoch 40) than its counterpart using soft IoU. Despite the usefulness of appreciating the way the algorithm is learning, the real value is in the reconstruction of the unseen samples in the test set (Figure 4.2). Test after 5 epochs Original Threshold 0.5 Threshold 0.3 Threshold 0.1 Soft IoU Soft Dice Balanced BCE Figure 4.2: The first row shows the test results after 5 epochs for voxel existence threshold set to 0.5, 0.3 and 0.1 using soft IoU as the distance function. The second row shows the same results for soft Dice, while the last row shows the results for balanced cross-entropy.

It is interesting to notice that there is no a clear correlation between the IoU metric and the visual similarity with respect to the models being compared. In Table 4.1 the best average value for the test set is the one achieved by the distance balanced BCE using threshold 0.5. This may actually result in empty models for some samples (see the result in Figure 4.2 for the mentioned parameters). A 40-epoch execution is used for the purpose of selecting a final distance based on the training evolution and the generalization shown in the test set while saving resources. Nevertheless, the choice is not trivial due to the similarity of the results CHAPTER 4. EXPERIMENTS AND RESULTS 27

Test after 5 epochs IoU Threshold 0.5 0.3 0.1 soft IoU 0.955083 0.939283 0.805049 soft Dice 0.936344 0.875890 0.166469 Balanced BCE 0.965469 0.963394 0.800237

Table 4.1: IoU metric measuring the average similarity between the test dataset and its reconstructions after 5 epochs using different distances and thresholds. and the disassociation between the visual perception and the similarity metric used. Additionally, this difference proves the importance of performing studies with users. It is obvious, but worth pointing out, that the voxel confidence decreases in test which implies that in order to obtain results as good a those in Figure 4.1 the threshold needs to be lowered. Once more, it is possible to appreciate that the best value in Table 4.2, which corresponds to the soft Dice distance and the threshold 0.5, represents a chair without rear legs and an irregular hole in the rest. Visually, the best model could be identified as soft IoU with threshold 0.1 if a larger emphasis is set on the body or balance BCE with threshold 0.1 if the importance is shifted to the legs. Given the difficulty that the generators seem to have with reconstructing legs, the distance used for the subsequent experiments is selected as the balanced BCE. All the same, it would be interesting to test further experiments with the other distances. Due to lack of time, this is left for future work.

Test after 40 epochs IoU Threshold 0.5 0.3 0.1 soft IoU 0.985563 0.985095 0.983689 soft Dice 0.985879 0.985829 0.985217 Balanced BCE 0.980784 0.984049 0.977174

Table 4.2: IoU metric measuring the average similarity between the test dataset and its reconstructions after 40 epochs using different distances and thresholds. 28 CHAPTER 4. EXPERIMENTS AND RESULTS

Test after 40 epochs Original Threshold 0.5 Threshold 0.3 Threshold 0.1 Soft IoU Soft Dice Balanced BCE Figure 4.3: The first row shows the test results after 40 epochs for the threshold set to 0.5, 0.3 and 0.1 using soft IoU as the distance. The second row shows the same results for soft Dice, while the last row shows the results for balanced BCE.

4.2 Noise generalization

A perfectly understandable question to ask is if the system is learning just to reconstruct certain types of noise, those with which it is trained. If the result of applying noise functions were close to typical user errors this behavior would not be disturbing. However, since the noise functions described in Section 3.1.2 are basic transformations, not necessarily faithful to user mistakes, it is interesting to check if a system trained with certain noise functions generalize well in test with another set of noise functions.

For this experiment three systems are trained using the balanced BCE dis- tance: one with unstructured noise, another with structured noise and the last one without noise, that is, the model introduced is the same one that the genera- tor is supposed to reconstruct. These systems are tested on the all of the types of noise for several thresholds. The results are displayed in Table 4.3. There is an interesting phenomenon showing from the results, that the best accuracy for a system trained on a specific noise is not reached by testing on the CHAPTER 4. EXPERIMENTS AND RESULTS 29

Test after 40 epochs IOU - Threshold Trained on Tested on 0.5 0.3 0.1 Unstructured 0.980784 0.984049 0.977174 Unstructured Structured 0.979197 0.982159 0.976545 - 0.980968 0.984446 0.978921 Average 0.980316 0.983551 0.977547 Unstructured 0.977274 0.978339 0.971616 Structured Structured 0.980821 0.985431 0.982204 - 0.981352 0.986181 0.983033 Average 0.979816 0.983317 0.978951 Unstructured 0.977131 0.978605 0.971630 - Structured 0.979386 0.984444 0.981136 - 0.981887 0.987377 0.983586 Average 0.979468 0.983475 0.978784

Table 4.3: Average IoU metric over the test set for systems trained and tested for all possible combinations of unstructured noise, structured noise and no noise. Results are reported for thresholds 0.5, 0.3 and 0.1. same noise. Regardless of the type of noise used during training, the best recon- structions are achieved when testing without noise and using a 0.3 threshold. In fact, this particular threshold accomplishes better results than the others in all cases. On the one hand, this outcome makes sense because the generator does not need to create new information (add or remove voxels), but on the other hand, it is trained to do so in the cases where unstructured and structured noise is used on training. Indeed, this is visible in the IoU values noted in Table 4.3, where the higher value corresponds to the model trained without noise, followed by structured noise and finally unstructured. The reason why the unstructured noise is more difficult to reconstruct is pre- cisely because it is more different from the target model.

However, training on unstructured noise yields the best on average, which relates with generalization. The initial idea was that this kind of noise functions make the generator more robust and it seems that the empirical results support this assumption. 30 CHAPTER 4. EXPERIMENTS AND RESULTS

4.3 Discriminator strength

When using GANs there is an additional parameter to tune, the balance between the generator and the discriminator training. It can also be described, in a simpler way, as the number of update steps of the discriminator for every step of the gen- erator. If the generator is too powerful, there is no gradient left for the generator to follow. On the other hand, if the discriminator is weak the generator can learn meaningless features or exploit a particular weakness producing similar outputs in what is known as mode collapse [9]. Using the Wasserstein GAN loss [1] allows more freedom in balancing the two networks without destabilizing the learning process. That is why it is possible to change this parameter and show the results in the following analysis.

For this experiment, the results of two different parameters balancing the gen- erator and the discriminator are shown. Both systems are trained with all the possible noise functions in Sections 3.1.2 and the hyperparamenters described in Section 3.2.1. Their distinction is that the weaker discriminator version trains the generator once per 5 updates of the discriminator, while the second one updates the discriminator 10 times for every update in the generator. In order to bear comparison of results, the stronger discriminator system is trained for twice the epochs so that the number of updates in the generator is the same in both systems. In particular, the weaker version is trained for 100 epochs and the stronger for 200.

Test after 100/200 epochs IoU Disc. accuracy Threshold 0.5 0.3 0.1 Samples Dataset Generated Weaker disc. 0.984643 0.984398 0.983384 Weaker disc. 1.0 0.035 Stronger disc. 0.981642 0.981492 0.980914 Stronger disc. 0.77583 0.0

Table 4.4: Average IoU in test simi- Table 4.5: Discriminator accuracy for larity with respect to dataset for dif- database and reconstruction sam- ferent systems and thresholds. ples in test with threshold=0.5.

Table 4.4 proves that the weaker discriminator reconstructs models more accu- rately heedless of the threshold value. In terms of the discriminator knowledge (Table 4.5) the weaker one still classifies correctly all dataset samples and gets fooled in 96.5% percent of the cases when samples come from the generator. The stronger one shows worst accuracy on dataset examples and a higher rate of gen- erated models passes as true samples. This shows better learning by the stronger version since the features that make the generated models better also make the discriminator doubt the origin of the dataset models. CHAPTER 4. EXPERIMENTS AND RESULTS 31

Test after 100/200 epochs Original Threshold 0.5 Threshold 0.3 Threshold 0.1 Weaker disc Stronger disc

Figure 4.4: Test results after 100 epoch for the weaker discriminator (first row) and 200 for the stronger (second row). The voxel existence threshold is set to 0.5, 0.3 and 0.1 and balanced cross-entropy is used as the distance function.

Disc. accuracy Samples Dataset Generated Weaker disc. 1.0 0.039 Stronger disc. 0.77583 0.0

Table 4.6: Discriminator accuracy for database and reconstruction samples in test with threshold=0.1.

The reconstructions in Figure 4.4 depict a higher probability of voxel exis- tence in the legs for the system with the stronger discriminator. Even so, the produced models are less defined, which makes the reconstructed model of the weaker discriminator using threshold 0.1 look more natural. The accuracy in the discriminator (Table 4.6) does not change as could be expected, however.

4.4 Recommendation system

The unstructured noise, particularly the function that randomly removes voxels, was initially included in an attempt to make the system more robust. Neverthe- less, since both the balanced and the similar systems keep some of the features of the input model and given that this kind of noise is highly unlikely for a real user, it was omitted when training the final frameworks. This decision also helps to improve the IoU metric. Table 4.3 corroborates that the closer the noise is to the original sample, the better the metric becomes. 32 CHAPTER 4. EXPERIMENTS AND RESULTS

As appreciated in the previous experiments, the IoU metric does not always correlate with what humans regard as similarity. This poses a problem at the time of deciding which three systems should be selected as the best ones. For each of the systems, several learning approaches are tested and inspected in order to decide a final version. These approaches include different number of iterations and strengths in the discriminator, decaying strategies, and best model selection with early stopping using a 10% validation set extracted from the train- ing set. For the early stopping approach, the similarity distance metric in the valida- tion set is computed every epoch. Only if the target IoU improves (against dataset samples for the realistic system, user input for the similar one and both for the balanced), the model is saved. If after 10 epochs the average distance is not im- proved, the execution halts. The decaying strategy is based on linear decay from the smallest number of epochs with good results to twice that. For example, the realistic system achieves good results with a minimum number of 100 epochs. The learning rate decays linearly to zero over the next 100 epochs.

Some of the best results are shown in the following sections as a comparative to motivate the selection of the final systems. A comparative of the best system achieved for every set of λ parameters is presented in Section 4.4.4.

4.4.1 Realistic system All the previous experiments are completed using the realistic system, which grants the reader with a deeper understanding. The results displayed below show the final research performed to fine-tune the selection of this specific system. The other frameworks are adjusted indepen- dently due to the fact that another parameter is regarded, the similarity to the input, which may render the learning process completely different.

After appraising the results of the different learning approaches, and partic- ularly comparing the reconstructions with the results in the following Sections 4.4.2 and 4.4.3, the realistic system is also trained without the random removal of voxels noise resulting in improved reconstructions. The goal of the realistic system is to generate samples that look as close as possible to the dataset so that the target IoU is the score of comparing the recon- struction to the dataset. The IoU metrics are reported for both set of noises (Tables 4.7 and 4.8), but only the best results are reported in images (Figure 4.5). The difficulty about assessing the results of the realistic system is that the re- constructions should follow the distribution of the dataset, but at the same time CHAPTER 4. EXPERIMENTS AND RESULTS 33

Test after 100/200 epochs with all noise functions IoU Dataset Input Average Weaker 100 ep 0.984398 0.968770 0.976584 Stronger 200 ep 0.981492 0.966923 0.974208 Weaker 200 ep 0.983989 0.968331 0.97616 Decay 0.984105 0.968467 0.976286 Validation 0.981978 0.967733 0.974856

Table 4.7: IoU metric for the realistic system trained with all noise functions mea- suring the mean similarity with respect of both the test dataset and the user input thresholding at 0.3. The last column indicates the average of both similarity val- ues.

Test after 100/200 epochs without random removal of voxels IoU Dataset Input Average Weaker 100 ep 0.985214 0.974970 0.980092 Weaker 200 ep 0.984793 0.974208 0.979501 Decay 0.249668 0.257863 0.253766 Validation 0.982110 0.974729 0.978420

Table 4.8: IoU metric for the realistic system trained with without random re- moval of voxels measuring the mean similarity with respect of both the test dataset and the user input thresholding at 0.3. The last column indicates the average of both similarity values. stay as close as possible to the sample in the dataset from where the synthetic data comes from. This set up provides with two features to take into account, similarity with respect to the original in Figure 4.5 and realism. For the first chair example, the reconstruction of the system trained for 100 epochs looks more natural and more different from the original sample than the other reconstructions. This is because the arms and connections between the legs are what make the models look both more similar to the original and noisier at the same time. For the second example, all the reconstructions look realistic, but the one that resembles the original more is, once again, the output of the sys- tem trained for 100 epochs. The system trained with validation and early stop- ping achieves the best reconstruction regarding both features for the last example. Nevertheless, reconstructions that are closer to the original and look more natural are, in general, the ones that result from training during 100 epochs, a conclusion that is supported by the IoU value in Table 4.8. 34 CHAPTER 4. EXPERIMENTS AND RESULTS

Test Original 100 epochs 200 epochs Validation

Figure 4.5: Test results for the realistic system trained without the random re- moval of voxels noise and thresholded to 0.3 for three different dataset samples using several training approaches.

4.4.2 Balanced system The balanced system is meant to be a combination of the qualities found in the realistic and the similar systems, meaning that the reconstructions should be close to the input model, but still following the distribution of the dataset. In consequence, it is particularly difficult to assess the success of the results due to the hybrid nature of the training and the output.

Test IoU Dataset Input Average Weaker 60 ep 0.979190 0.983888 0.981539 Stronger 120 ep 0.979548 0.982328 0.980938 Weaker 120ep 0.980906 0.985178 0.983042 Decay 0.982523 0.985416 0.983970 Validation 0.975414 0.975393 0.975404

Table 4.9: IoU metrics for the balanced system at test time thresholding at 0.3. CHAPTER 4. EXPERIMENTS AND RESULTS 35

Test Input Weaker 60ep Stronger 120ep Weaker 120ep Decay

Figure 4.6: Test results for the balanced system thresholded to 0.3 for three differ- ent dataset samples using several training approaches.

For this particular system, the average between the dataset and the user mod- els is the target IoU given that is a quantity measuring both similarities. The results obtained with validation are comparatively worst so reconstruc- tion images are not included in Figure 4.6. If we discard alternatives that remove structural elements of the original, the systems trained with a weaker discriminator during 60 epochs and with a stronger one during 120 would be avoided. These systems erase/modify the arms in ex- amples 1 and 3. Coincidentally, these are also the systems with lower IoU with the dataset. The model examples shown for similarity with the user model may not be the best ones. But if the IoU value is to be trusted like in the previous case, then the best system is the one trained with linear decay. In fact, it achieves the best metric for all similarity measures. The stated evidence support the selection of the decay approach as the choice representing the balanced framework in the final recommendation system. 36 CHAPTER 4. EXPERIMENTS AND RESULTS

4.4.3 Similar system With the similar system, the goal is to reconstruct models as close as possible to user model retaining realism to a certain point. That is why in this section the objective IoU is that with respect to the input. In the same way as the balanced system, the validation approach is dismissed due to its low performance.

Test IoU Dataset Input Average Weaker 60 ep 0.975507 0.983702 0.978823 Stronger 120 ep 0.974902 0.982819 0.978861 Weaker 120 ep 0.977981 0.986298 0.982140 Decay 0.979298 0.987392 0.983345 Validation 0.965963 0.971143 0.968553

Table 4.10: IoU metrics for the similar system at test time thresholding at 0.3.

Test Input Weaker 60ep Stronger 120ep Weaker 120ep Decay

Figure 4.7: Test results for the similar system thresholded to 0.3 for three different dataset samples using several training approaches. CHAPTER 4. EXPERIMENTS AND RESULTS 37

With the focus set on the differences introduced by the noise functions it is possible to notice that, for the first example in Figure 4.7, the system that best retains the user features is the one trained with the weaker discriminator for 60 epochs. However, this reconstruction lacks structural elements like the arms and part of the legs. In comparison with the rest of the systems that reconstruct these morpholog- ical components, both the systems trained with the weaker discriminator for 60 epochs and with the stronger one during 120 epochs must be relegated in favor of the other systems. The decayed approach reaches the highest IoU in all comparisons, but signif- icantly for the input, closely competing with the system trained with the weaker discriminator over 120 epochs. In the second example in Figure 4.7 the descrip- tive feature in the user input, the hole in the base of the rest, is more clearly kept, and the arms in the first example are maintained. The decay approach is selected for the definitive recommendation system sup- ported by all these arguments.

4.4.4 System comparative The results reported in this section serve as a summary of the best systems achieved in the previous sections. For the realistic system, a weaker discriminator trained for 100 epochs without random removal of voxels; for the balanced and the sim- ilar systems, a weaker discriminator trained for 60 epochs with linear decay for another 60. The metrics and reconstructions are placed together to simplify the compari- son and get a better overview of the conclusions.

The IoU scores in Table 4.11 indicate that not only each of the systems is the best among the all the tested alternatives of the same set of parameters (Tables 4.8, 4.9 and 4.10) but also the best in its target IoU when compared with the other systems. That is, the realistic system for the IoU with the dataset is better than the same IoU for the balanced and the similar systems. The same happens for the balanced system and the averaged IoU metric and the similar system for the IoU with the input. The reconstructions in Figure 4.8 are self-explanatory. 38 CHAPTER 4. EXPERIMENTS AND RESULTS

Test IoU Dataset Input Average Realistic 0.985214 0.974970 0.980092 Balanced 0.982523 0.985416 0.983970 Similar 0.979298 0.987392 0.983345

Table 4.11: Average IoU metric comparative for the best trained systems at test time thresholding at 0.3.

Test Dataset Input Realistic Balanced Similar

Figure 4.8: Test results for the complete recommendation system thresholded to 0.3 for three different samples. The first two columns represent the dataset model and the user input while the following three depict reconstructions performed by the realistic, balanced and similar systems respectively.

4.5 User models

There is an important point that has not been touched so far, the importance of the similarity between the database and the user input. The GAN loss function helps to learn the underlying distribution of the sam- CHAPTER 4. EXPERIMENTS AND RESULTS 39

ples in the dataset while the other components of the objective function are guid- ing the reconstruction into looking closer to the input or the dataset. If the model being reconstructed during test time is too different from the dataset/synthetic data the reconstructions are flawed. However, if the model is similar enough, the reconstruction results are as expected. Examples of these two cases can be found in Figure 4.9.

User models Input Realistic Balanced Similar

Figure 4.9: Reconstruction of 3D models designed by humans using the three developed systems. The results in the first row correspond to reconstructions of a model that follows the dataset distribution while the one in the second row does not.

The main difference between the input model in the first and the second row in Figure 4.9 is the position of the seat. In the dataset, most samples have the seat in a half height position and even most of the stool samples have some kind of backrest that ends in a higher position. These elements are never modified by the noise functions when creating the synthetic data for training. The reconstruction generated by the realistic system show how the network tries to push down the seat to a position where it is natural in the dataset. This effect lessens when the generator starts keeping the input features, which is the reason why the balanced and the similar reconstructions are better looking.

This is always a problem when training with idealized and processed data, like in this case with the Princeton ModelNet [17] and using the trained system on real, imperfect data. This problem can be attenuated by using several datasets with a potentially different distribution or processed differently. Another possibility is training with the same distribution that is going to be used in test, recollecting or learning 3D 40 CHAPTER 4. EXPERIMENTS AND RESULTS

models online directly from the final users of the recommender like it is done for predictive keyboards. In any case, these solutions are left for future work since it is something that is not trivial and may destabilize the training process.

4.6 Qualitative evaluation: User study

The user study is designed to evaluate the discrepancies between the similarity measure assumed in the loss function, and the perception that human modelers have concerning preference, plausibility and likeness of 3D models. The reader should keep in mind that the results are influenced by the subjective liking of the individuals. In the following sections the information retrieved from the three experiments, as well as the questionnaire filled at the end of the session is analyzed. More in- formation about the User Interface (UI) employed during the study and the exit survey is available in Appendix C.

4.6.1 Population statistics Figure 4.10 shows the demographic information of the population that took part in the study. As it is quite frequent in STEM, the percentage of women is lower than that of men. In this case, the representation of females is 1/3 of the population. The age distribution of the study may be biased towards a younger population due to the fact that the subjects were gathered from KTH students and acquain- tances of the author. However, it is also true that younger generations have a closer relationship with technology, in general, and 3D technologies are becom- ing increasingly important in the last years, which may affect the distribution in a similar way. In Figure 4.10b, it is possible to see that more than half of the population are below 25 years old. There is a lack of representation in the groups above 44 and only 8.3% of the subjects are between 35 and 44 years.

Figure 4.11 shows the proficiency level of the participants modeling 3D for graphics in general and specifically using voxels. It is important to notice that due to general lack of interest to participate in the study and the difficulty of acquiring subjects with 3D modeling experience, the study had to be modified to accept participants without modeling experience, but with a close relationship with 3D models. Figure 4.11 show that these subjects represent 1/4 of the sample. More than 50% of participants regarded themselves as beginners and only two of them as proficient. CHAPTER 4. EXPERIMENTS AND RESULTS 41

(a) (b)

Figure 4.10: Demographic information of the study subjects regarding gender (4.10a) and age (4.10b).

Proficiency with voxels is understandable lower, with more than half the pop- ulation without experience and 1/3 of beginners. Given this information, the re- ported proficiency in general 3D modeling is the one used for segmentation later on.

(a) (b)

Figure 4.11: Proficiency level of the study subjects modeling 3D graphics in gen- eral (4.11a) and using voxels in particular (4.11b).

Cross-referencing, the data it is possible to learn that 75% of the females are beginners whereas for males this is 43.75%. Gender correlations can be biased by proficiency due to unbalance in the segmentations. The same can happen with age where 77.78% of the participants between 25-34 years are beginners.

Finally, the plots in Figure 4.12 inform of several indications of how useful a recommendation system could be for 3D modeling. Essentially, a selection in the recommendation system would return a 3D model that the users would have to modify further in most cases. The question in Fig- ure 4.12a seeks to know if 3D modelers are used to modifying premade models or they would rather model from scratch because modifying these CADs is either difficult or uncomfortable. 42 CHAPTER 4. EXPERIMENTS AND RESULTS

Only 12.5% report that they never use premade models, while 41.6% do it frequently. Additionally, 58.3% of the respondents would probably use a recom- mendation system while 16.7% think is unlikely.

(a) (b)

Figure 4.12: Indicators of the potential of a recommender system for 3D graphics: frequency of use of premade models (4.12a) and likelihood of using a recommen- dation system (4.12b)

Cross-referenced data show that even though the average use of premade models is "Sometimes", inexperienced subjects respond "Rarely" on average and experienced ones are closer to "Often". It is curious that inexperienced partici- pants do not use premade models and it is possible that they misunderstood the question. But there is another explanation: they are still at an early stage where they are learning and do not need or know how to import external models. The proficiency correlates with the likelihood of using a recommendation sys- tem: inexperienced users are on average "Somewhat probable" and proficient participants lie between that probability and "Neutral". In any case, it looks like users, regardless of their experience level are willing to use a recommendation system.

4.6.2 Data preprocessing and analysis For each of the experiments, every participant selects a system in multiple com- parisons. This design causes the results of each comparison per subject to be a probability or confidence percentage. There are several options to aggregated data. For this analysis, two methods have been used: absolute vote by quantizing the confidence rate and box-and- whisker diagrams showing the minimum and maximum values as well as the quartiles. In order to compute the absolute vote for a particular pair-wise comparison (Sections 4.6.3 and 4.6.4), e.g. dataset vs realistic system, the vote is counted as preference for the dataset if the confidence is equal or greater than 50% and re- alistic otherwise. In Section 4.6.5, since the comparison includes three terms, the CHAPTER 4. EXPERIMENTS AND RESULTS 43

preferred option is the system with higher confidence compared to the other two. Likewise, the rejected system would be the one with the lower percentage. The box-and-whisker diagram (example in Figure 4.13) is selected as a visual representation that gives plenty of information in a simple plot. The whiskers reach the minimum and maximum values, depicting the dispersion of the data. The lower and upper bounds of the box are the Q1 and Q3 values of the distri- bution, while the Q2 (median) is represented by the line dividing the two colors. Moreover, the median is more robust than the average, which makes it a more suitable measure for describing the data in this case given the large variance in some cases.

4.6.3 Realism experiment In discrete votes, 91.67% of the population choose the dataset when compared against the realistic system, 79.17% select the dataset over the balanced system and 91.67% over the similar one. According to this, we could expect the balanced system to be chosen before both the realistic and the similar systems. Yet, the comparisons between the systems yield a nontransitive relationship. The empiri- cal data shows that 58.33% of the population regard the realistic system as better suited and only 66.67% prefers the balanced system when compared to the simi- lar.

Figure 4.13: Box-and-whisker diagram of the results for the realism experiment.

The plot in Figure 4.13 shows the quartiles as well as the maximum and min- imum values to display the spread of the data in the distribution of answers. When looking at the median value of the first three comparisons, against the dataset, we can see that the selection rate of the dataset is elevated. However, the confidence when comparing the reconstruction systems among themselves 44 CHAPTER 4. EXPERIMENTS AND RESULTS

lies around 50%. This, along with the nontransitive relationship discovered be- fore and the large standard deviation in the data may indicate randomness in the answers or that several groups regard realism as a different thing. To study this phenomenon further a segmentation by proficiency is performed on the subjects and the data reanalyzed (Figure 4.14).

(a) Unexperience segment (b) Beginner segment (c) Experinced and Proficient segments

Figure 4.14: Box-and-whisker diagram of the results for the realism experiment segmented by proficiency in 3D modeling.

Studying the segments separately, the data dispersion decreases slightly, but no clear correlation is visible. The number of experienced and proficient partici- pants is so low that nothing can be said about them independently except if they were unanimous. The inexperienced group prefers the realistic system to the balanced as op- posed to the mixture of more experienced subjects. The beginner group still presents a random behavior probably due to a wider range of knowledge. The antagonistic opinion of the inexperienced and more experienced group can be ex- plained interpreted as groups looking for different qualities due to differences in their skills. Something analogous happens with the comparison between the realistic and the similar systems. In this case, the inexperienced group is the one displaying a random behavior while beginners prefer the realistic system in opposition to the more experienced group.

It is possible to affirm then that realism is a very personal quality that is hardly captured by the so-called realistic system since it is less selected by the population of the study when compared against the dataset, and by the more experienced subjects when compared with the other systems. The results of the experiment are in partial disagreement with the mathemat- ical assumptions of the loss function. CHAPTER 4. EXPERIMENTS AND RESULTS 45

4.6.4 Similarity experiment 87.50% of the people interviewed voted that the balanced system generates mod- els that resemble the input better than those generated by the realistic system. 91.67% prefer the similar system when compared to the realistic and 70.83% se- lect the similar system over the balanced one.

Figure 4.15: Box-and-whisker diagram of the results for the similarity experiment.

This experiment is clearer, the transitive relation is kept. The system that gets elected more times against the realistic is also the one that is preferred when com- pared with the remaining one. This is so not only in number of votes but also in confidence rate according to Figure 4.15. The results of the experiment are thus in agreement with the mathematical assumptions of the loss function.

4.6.5 Preference experiment For this experiment, the first and last choice in discretized votes are studied sep- arately. The most selected alternative, with the support of 79.17% of the population, is the similar system. This is also the least selected last choice, which makes it the best possible option. The other systems are selected as the first alternative by 29.17% of the popula- tion (realistic system) and 4.17% (balanced system). Following the same analysis performed previously, the realistic system is chosen as the last option by 54.17% and the balanced by 50% of the subjects. Once again, it is possible to see (Figure 4.17) that the dispersion of the data decreases for the group merging the segments "Experienced" and "Proficient". 46 CHAPTER 4. EXPERIMENTS AND RESULTS

Figure 4.16: Box-and-whisker diagram of the results for the preference experi- ment.

With the results of the realism and similarity experiments in hand, it is easy to see why the similar system is the best empirical choice. These results show that the reconstructions of the similar system are, all in all, able to capture better both the realism and similarity qualities according to the most experienced subjects.

(a) Inexperienced segment (b) Begginer segment (c) Experinced and Proficient segments

Figure 4.17: Box-and-whisker diagram of the results for the preference experi- ment segmented by proficiency in 3D modeling. Chapter 5

Discussion and conclusions

The following sections summarize the contributions of the project and describe a set of ideas that can be useful to improve the results. These ideas are organized into approaches that can be applied directly to this project and suggestions that could improve both the results and the usefulness of the solution altering the foundation of the project.

5.1 Achievements

The work in this Master Thesis shows that it is possible to reconstruct partially complete or unrefined 3D models using GANs in a simpler, more direct way than previously attempted in the literature [19]. The proposed solution uses a novel 3D conditional Generative Adversarial Network (cGAN) that accepts 3D models in voxel form as the input and generates reconstructions in the same format. Three loss functions are used in order to achieve different qualities in the re- construction, namely realism, similarity with respect to the input and a mixture of both, and several learning approaches are tested in order to get the best results for each one. A user study is conducted to reconcile the mathematical formulation of the loss functions with the human perception and assess the potential of adapting this idea as a modeling tool based on a recommendation system. Although the realism property does not match the users’ conception completely, the similarity quality lives up to the expectations. Moreover, the questionnaire of the user study reveal that people working with 3D are open to using recommenders. The main drawback of this solution is the limited variability in the training dataset that can produce flawed reconstructions due to the input model falling too far from the data used during training even if they are actual samples of the class being reconstructed. This problem, however, has already been discussed in Section 4.5, where the easiest fix would entail using several datasets for training.

47 48 CHAPTER 5. DISCUSSION AND CONCLUSIONS

5.2 Future work

This work can benefit from a deeper research in terms of testing different archi- tectures, like U-net [34] and several hyperparamenter values. The ones used in this work are the ones reported in the literature. Pros and cons of different distance functions are suggested in Section 4.1. However, the results in longer executions may grant better arguments for choos- ing one of the functions above the rest. Improving the noise functions so that they resemble user errors would solve the generalization problem. Additional work could consist on studying and clas- sify the different types of modeling errors typically introduced by modelers, mea- suring their frequency and impact in practice so that the synthetic data could adapt accordingly. Another possible solution would be changing the learning process completely in order to employ user-generated models instead of synthe- sizing data to simulate it. One step further in the improvement of the work is the implementation of a multiclass system similar to that of StarGAN described by Choi et al. [4]. Af- ter the GAN is trained, the discriminator can be used in test time to classify the input model, passing the most probable label to the generator along with the user-generated CAD. This would be not only useful but indispensable if the rec- ommendation system is to be used as a commercial tool for 3D designers. In this case, the user could also correct the discriminator introducing a custom label.

An important improvement in the area would be the development of metrics that meet human perception better than Intersection over Union. Even though the mismatch between the visual results and the IoU score seems to decrease with training due to the reconstruction of most structural elements, there is a basic un- derlying disagreement: humans prioritize the existence of functional components over detail elements. Topological features could be an interesting starting point for researching these new metrics. Accepting other representations, like polygon mesh, would increase the use- fulness of the tool for professional 3D modelers, since this is the most common format used for video games, animations, etc. However, the change in the rep- resentation entails the modification of the network, the distance functions in the losses and the accuracy metric, which would mean a completely different ap- proach, stating a new the research question and starting over the whole study. Finally, the theoretical understanding of Deep Learning networks and partic- ularly generators is still in an early stage. Improving the comprehension of these models will lead to better design decisions and better empirical results. Bibliography

[1] Martin Arjovsky, Soumith Chintala, and Léon Bottou. “Wasserstein gan”. In: arXiv preprint arXiv:1701.07875 (2017). [2] Tom Bawden. Global warming: Data centres to consume three times as much energy in next decade, experts warn. https://www.independent.co. uk/environment/global-warming-data-centres-to-consume- three - times - as - much - energy - in - next - decade - experts - warn-a6830086.html. Accessed: 2018-05-25. 2016. [3] Ding-Yun Chen et al. “On visual similarity based 3D model retrieval”. In: Computer graphics forum. Vol. 22. 3. Wiley Online Library. 2003, pp. 223–232. [4] Yunjey Choi et al. “StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation”. In: arXiv preprint arXiv:1711.09020 (2017). [5] Özgün Çiçek et al. “3D U-Net: learning dense volumetric segmentation from sparse annotation”. In: International Conference on Medical Image Com- puting and Computer-Assisted Intervention. Springer. 2016, pp. 424–432. [6] Paolo Cignoni, Claudio Rocchini, and Roberto Scopigno. “Metro: Measur- ing error on simplified surfaces”. In: Computer Graphics Forum. Vol. 17. 2. Wiley Online Library. 1998, pp. 167–174. [7] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. “Image style trans- fer using convolutional neural networks”. In: Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on. IEEE. 2016, pp. 2414–2423. [8] S Gebhardt et al. “Polygons, point-clouds and voxels: A comparison of high-fidelity terrain representations”. In: Simulation Interoperability Work- shop and Special Workshop on Reuse of Environmental Data for Simulation—Processes, Standards, and Lessons Learned. 2009. [9] Ian Goodfellow. “NIPS 2016 tutorial: Generative adversarial networks”. In: arXiv preprint arXiv:1701.00160 (2016). [10] Ian Goodfellow et al. “Generative adversarial nets”. In: Advances in neural information processing systems. 2014, pp. 2672–2680.

49 50 BIBLIOGRAPHY

[11] Ishaan Gulrajani et al. “Improved training of wasserstein gans”. In: arXiv preprint arXiv:1704.00028 (2017). [12] John PA Ioannidis. “Why most published research findings are false”. In: PLoS medicine 2.8 (2005), e124. [13] Phillip Isola et al. “Image-to-image translation with conditional adversarial networks”. In: arXiv preprint (2017). [14] Chiyu Jiang, Philip Marcus, et al. “Hierarchical Detail Enhancing Mesh- Based Shape Generation with 3D Generative Adversarial Network”. In: arXiv preprint arXiv:1709.07581 (2017). [15] Evangelos Kalogerakis et al. “A probabilistic model for component-based shape synthesis”. In: ACM Transactions on Graphics (TOG) 31.4 (2012), p. 55. [16] Andrej Karpathy et al. Generative Models. https://blog.openai.com/ generative-models/. Accessed: 2018-01-25. 2016. [17] Princeton Vision & Robotics Labs. Princeton ModelNet. 2015. URL: http: //modelnet.cs.princeton.edu/ (visited on 2018). [18] Chieh Lin, Derek Liu, and Alex Kelly. “Deep Adversarial 3D Shape Net”. In: (). [19] Jerry Liu, Fisher Yu, and Thomas Funkhouser. “Interactive 3D modeling with a generative adversarial network”. In: arXiv preprint arXiv:1706.05170 (2017). [20] Yifan Liu et al. “Auto-painter: Cartoon Image Generation from Sketch by Using Conditional Generative Adversarial Networks”. In: arXiv preprint arXiv:1705.01908 (2017). [21] Jingwan Lu et al. “HelpingHand: example-based stroke stylization”. In: ACM Transactions on Graphics (TOG) 31.4 (2012), p. 46. [22] Rafał K Mantiuk, Anna Tomaszewska, and Radosław Mantiuk. “Compari- son of four subjective methods for image quality assessment”. In: Computer Graphics Forum. Vol. 31. 8. Wiley Online Library. 2012, pp. 2478–2491. [23] Rob van der Meulen and Christy Pettey. Gartner Says By 2020, Artificial In- telligence Will Create More Jobs Than It Eliminates. https://www.gartner. com/newsroom/id/3837763. Accessed: 2018-05-24. 2017. [24] Microsoft. Creativity in STEM – a contradiction in terms? Not for Europe’s girls! https : / / news . microsoft . com / europe / 2017 / 12 / 05 / creativity-stem-contradiction-terms-not-europes-girls/. Accessed: 2018-05-25. 2017. BIBLIOGRAPHY 51

[25] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. “V-net: Fully convolutional neural networks for volumetric medical image segmenta- tion”. In: 3D Vision (3DV), 2016 Fourth International Conference on. IEEE. 2016, pp. 565–571. [26] Mehdi Mirza and Simon Osindero. “Conditional generative adversarial nets”. In: arXiv preprint arXiv:1411.1784 (2014). [27] Charlie Nash and Chris KI Williams. “The shape variational autoencoder: A deep generative model of part-segmented 3D objects”. In: Computer Graph- ics Forum. Vol. 36. 5. Wiley Online Library. 2017, pp. 1–12. [28] PDC. Tegner. 2017. URL: https://www.pdc.kth.se/hpc-services/ computing-systems/tegner-1.737437 (visited on 2018). [29] Lawrence R Rabiner. “A tutorial on hidden Markov models and selected applications in speech recognition”. In: Proceedings of the IEEE 77.2 (1989), pp. 257–286. [30] Alec Radford, Luke Metz, and Soumith Chintala. “Unsupervised represen- tation learning with deep convolutional generative adversarial networks”. In: arXiv preprint arXiv:1511.06434 (2015). [31] Md Atiqur Rahman and Yang Wang. “Optimizing Intersection-Over-Union in Deep Neural Networks for ”. In: International Sym- posium on . Springer. 2016, pp. 234–244. [32] Douglas Reynolds. “Gaussian mixture models”. In: Encyclopedia of biomet- rics (2015), pp. 827–832. [33] Jason Rock et al. “Completing 3D object shape from one depth image”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, pp. 2484–2493. [34] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. “U-net: Convolutional networks for biomedical image segmentation”. In: International Conference on Medical image computing and computer-assisted intervention. Springer. 2015, pp. 234–241. [35] Seyed Sadegh Mohseni Salehi, Deniz Erdogmus, and Ali Gholipour. “Auto- context Convolutional Neural Network for Geometry-Independent Brain Extraction in Magnetic Resonance Imaging”. In: arXiv preprint arXiv:1703.02083 (2017). [36] Skymind. GAN: A Beginner’s Guide to Generative Adversarial Networks. https: //deeplearning4j.org/generative-adversarial-network. Ac- cessed: 2018-05-21. 2017. 52 BIBLIOGRAPHY

[37] Edward Smith and David Meger. “Improved adversarial systems for 3D object generation and reconstruction”. In: arXiv preprint arXiv:1707.09557 (2017). [38] David Trafimow and Michael Marks. Editorial banning p-values from BASP. http : / / www . medicine . mcgill . ca / epidemiology / Joseph / courses/EPIB-621/BASP2015.pdf. Accessed: 2018-02-02. 2015. [39] Jiajun Wu et al. “Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling”. In: Advances in Neural Information Pro- cessing Systems. 2016, pp. 82–90. [40] Jiajun Wu et al. “Marrnet: 3d shape reconstruction via 2.5 d sketches”. In: Advances In Neural Information Processing Systems. 2017, pp. 540–550. [41] Zhirong Wu et al. “3d shapenets: A deep representation for volumetric shapes”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015, pp. 1912–1920. [42] Jun-Yan Zhu et al. “Generative visual manipulation on the natural image manifold”. In: European Conference on Computer Vision. Springer. 2016, pp. 597– 613. [43] Chuhang Zou et al. “3d-prnn: Generating shape primitives with recurrent neural networks”. In: The IEEE International Conference on Computer Vision (ICCV). 2017. Appendix A

Complete list of noise functions

A.1 Unstructured noise

• Remove voxels*: Each voxel has a 50% probability of being removed.

• Add voxels: Each void space has a 2% probability of becoming a voxel.

• Dilation: Basic mathematical morphology operation using the minimum 3-dimensional structure with connectivity = 1.

• Add voxels + dilation: Applies the addition of voxels followed by a dila- tion.

• Dilation + add voxels: Performs the dilation before the random addition of voxels.

• Dilation + remove voxels*: Executes the dilation before the random re- moval of voxels.

• Remove + add voxels*: Removes 50% of the voxels before converting 2% of the empty spaces into voxels.

• Remove voxels + dilation*: Performs the dilation operation after the ran- dom removal of voxels.

Functions marked with (*) are avoided when training the final systems (Sec- tion 4.4).

53 54 APPENDIX A. COMPLETE LIST OF NOISE FUNCTIONS

(b) Remove voxels (c) Add voxels (d) Dilation (e) Add voxels + dilation

(a) Reference 3D model

(f) Dilation + add (g) Dilation + re- (h) Remove + add (i) Remove voxels voxels move voxels voxels + dilation

Figure A.1: Examples of unstructured noise applied to a reference 3D model.

A.2 Structured noise

• Remove part: Removes a box of half of the length of the side equal to 3 from a randomly selected center among all possible voxels.

• Move part: Selects a random center and axis among all possible voxels and shifts the whole structure 2 units to the closest boundary, dragging the struc- ture so that the model is still connected.

• Create a bump: Copies the structure from a randomly selected center among all possible voxels to the closest boundary in each direction for a radius of 10 units.

• Remove part + create bump: Removes a part of the model before creating a bump in the remaining object.

• Move + remove part: Applies the move part noise function followed by the remove part one.

• Create bump + move part: Creates a bump after executing the move part noise function. APPENDIX A. COMPLETE LIST OF NOISE FUNCTIONS 55

(b) Remove part (c) Move part (d) Bump

(a) Reference 3D model

(e) Remove part + (f) Move + remove (g) Bump + move bump part part

Figure A.2: Examples of structured noise applied to a reference 3D model. Appendix B

Architecture

Discriminator Name Input Output Information Input Layer 1 × 32 × 32 × 32 64 × 16 × 16 × 16 Conv(C=64, K=4, S=2, P=1) + Leaky ReLU(0.2) 64 × 16 × 16 × 16 128 × 8 × 8 × 8 Conv(C=128, K=4, S=2, P=1) + Leaky ReLU(0.2) Hidden Layer 128 × 8 × 8 × 8 256 × 4 × 4 × 4 Conv(C=256, K=4, S=2, P=1) + Leaky ReLU(0.2) 256 × 4 × 4 × 4 512 × 2 × 2 × 2 Conv(C=512, K=4, S=2, P=1) + Leaky ReLU(0.2) Output Layer 512 × 2 × 2 × 2 1 × 32 × 32 × 32 Conv(C=1, K=3, S=1, P=1)

Table B.1: Detailed discriminator architecture

Generator Name Input Output Information 1 × 32 × 32 × 32 64 × 16 × 16 × 16 Conv(C=64, K=4, S=2, P=1) + IN + ReLU 64 × 16 × 16 × 16 128 × 8 × 8 × 8 Conv(C=128, K=4, S=2, P=1) + IN + ReLU Downsampling 128 × 8 × 8 × 8 256 × 4 × 4 × 4 Conv(C=256, K=4, S=2, P=1) + IN + ReLU 256 × 4 × 4 × 4 512 × 2 × 2 × 2 Conv(C=512, K=4, S=2, P=1) + IN + ReLU Bottleneck ×6 512 × 2 × 2 × 2 512 × 2 × 2 × 2 ResBlock 512 × 2 × 2 × 2 256 × 4 × 4 × 4 Deconv(C=256, K=4, S=2, P=1) + IN + ReLU 256 × 4 × 4 × 4 128 × 8 × 8 × 8 Deconv(C=128, K=4, S=2, P=1) + IN + ReLU Upsampling 128 × 8 × 8 × 8 64 × 16 × 16 × 16 Deconv(C=64, K=4, S=2, P=1) + IN + ReLU 64 × 16 × 16 × 16 1 × 32 × 32 × 32 Deconv(C=1, K=4, S=2, P=1) + Sigmoid

Table B.2: Detailed generator architecture

The reader can find detailed descriptions of the elements that constitute both the architecture of the generator (encoder-decoder) and the discriminator (Patch- GAN) in Tables B.1 and B.2. The meaning of the acronyms used in the tables can be found next:

56 APPENDIX B. ARCHITECTURE 57

• Conv(C, K, S, P) = 3D Convolution(number of output Channels, cubic di- mensions of the Kernel, Stride, Padding) • Deconv(C, K, S, P) = 3D Deconvolution(number of output Channels, cubic dimensions of the Kernel, Stride, Padding) • IN = 3D Instance Normalization • ResBlock = Conv(C=512, K=4, S=2, P=2) + IN + ReLU + Dropout(0.5)

Additionally, the following parameters are reported for reproducibility rea- sons:

• Batch size = 50

• Discriminator learning rate, ηd = 0.0001

• Generator learning rate ηg = 0.0001

• Adam’s β1 = 0.5

• Adam’s β2 = 0.9 Appendix C

User study

C.1 Realism experiment

During the user study participants interact with the following User Interface (UI). Figure C.1 shows the particulars in the realism experiment, but most of the fea- tures are shared in the similarity and preference experiments too.

Figure C.1: User Interface for the realism experiment of the user study.

1. Interface control buttons: Allow the modification of the visualization

2. Model selection buttons: Select the preferred model according to the exper- iment recording the answer and display the next comparison.

3. Actionable windows: Accept camera translations and rotations, zoom, etc.

58 APPENDIX C. USER STUDY 59

4. Pair of models: Each of the CADs comes from one of the following distribu- tions: dataset, reconstruction of the realistic, balance or similar system. The group comparison and the presentation order is randomized per subject.

C.2 Similarity experiment

The similarity experiment includes a reference model with which to compare the two reconstructions below.

Figure C.2: User Interface for the similarity experiment of the user study.

4. Pair of models: Each of the CADs comes from one of the following distribu- tions: reconstruction of the realistic, balance or similar system. The group comparison and the presentation order is randomized per subject.

5. Reference model: This CAD comes from the synthetic data distribution so that the comparison with the reconstructions (4) is easier.

C.3 Preference experiment

In this case, the reference models are real user-generated CADs submitted by some of the participants and the subjects can select among three different recon- structions.

2. Model selection buttons: Include one more button to select the extra model. The behavior is the same as in the previous experiments. 60 APPENDIX C. USER STUDY

4. Trio of models: Each of the CADs is the reconstruction of the reference model (5) by of one of the three systems. The presentation order is ran- domized per subject.

5. Reference model: This CAD comes from the pool of subject-generated mod- els submitted before the study.

Figure C.3: User Interface for the preference experiment of the user study.

C.4 Exit survey

The results showed in Section 4.6.1 come from the answers to the exit survey, available at the following URL: https://docs.google.com/forms/d/e/ 1FAIpQLSfdBp1Zm6tSAatQ6lVFcBD5uj7- a3v0qGvVCjnULXaRXWjShQ/viewform Appendix D

Additional resources

For completeness, the code, trained models and further results are at the disposal of the reader.

1. The complete implementation of the work show in this Master Thesis is available at https://github.com/MonicaVillanueva/3D-ReconstGAN.

2. The trained networks for all the reconstruction systems can be found at https://drive.google.com/open?id=1O6qMDohDezpj6UjrMve- fwr29ML38WFo

3. Finally, an illustrative video showing additional results is also accessible via the above two links.

61 www.kth.se