Addressing GAN Limitations: Resolution, Lack of Novelty and Control on Generations

1 2019

Addressing GAN limitations: resolution, lack of novelty and control on generations

Camille Couprie Facebook AI Research Joint works with O. Sbai, M. Aubry, A, Bordes, M. Elhoseiny, M. Riviere, Y. LeCun, M. Mathieu, P. Luc, N. Neverova, J. Verbeek. 2 Why do we care about generative models

• Scene Understanding can be assessed by checking the ability to generate plausible new scenes. • Generative models are interesting if they can be used to go beyond training data: data of higher resolution, data augmentation to help train better classifiers, use the learned representations in other tasks, or make prediction about uncertain events... 3 Outline

• 1/ Design inspiration from adversarial generative networks • 2/ High resolution, decoupled generation • 3/ Vector image generation by learning parametric layer decomposition • 4/ Future frame prediction 4 Design inspiration from generative networks

Sbai, Elhoseiny, Bordes, LeCun, Couprie, ECCV workshop 17

Hedonic Value

01 2 3 4 Novelty Generative Adversarial Networks Goodfellow et al, 2014

0.3 0.7 0.1 0.8

GENERATED Fake IMAGE

AdVERSARIAL RANDOM NUMBERS Generator NETWORK Generative Adversarial Networks Goodfellow et al, 2014 Real INPUT

0.3 0.7 0.1 0.8

GENERATED Real IMAGE

AdVERSARIAL RANDOM NUMBERS Generator NETWORK Deep convolutional GANs

RADFORD ET AL : ICLR 2015 Training with pictures of about 2000 Clothing items Texture and shape labels

Floral Tiled Uniform Dotted Animal Print Graphical Striped

Skirt Pullover T-Shirt Coat Top Jacket Dress Class conditioned GAN

Real INPUT

Shape CLASS

0.3 0.7 0/1 0.1 (REAL/FAKE) 0.8

GENERATED IMAGE TEXTURE CLASS

AdVERSARIAL RANDOM NUMBERS Generator NETWORK GAN Optimization objectives

• Generator’s loss • Discriminator’s loss • Auxiliary classifier discriminator: • Additional loss for the generator: Without conditioning With class conditioning Introduction of a Style Deviation criterion Dotted

Floral graphical

0.3 0.7 100% 0.1 uniform 0.8

tiled

RANDOM Generator Generated AdVERSARIAL NUMBERS Image NETWORK striped

Animal print Introduction of a Style Deviation criterion

Dotted 6%

Floral 7%

57% 0.3 graphical 0.7 0.1 0.8 6% uniform

8% RANDOM Generated AdVERSARIAL tiled Generator NUMBERS Image NETWORK

11% striped

9% Animal print With the Style Deviation criterion (CAN H)

Dotted 0.3 Floral 0.7 0.1 graphical 0.8 uniform tiled striped RANDOM AdVERSARIAL Animal print Generator Generated Image NUMBERS NETWORK Tested deviation objectives

Binary cross entropy loss :

Multi-class cross entropy loss: Human Evaluation Study

7 0

64.5

CAN(h) CAN realistic GAN texTURE Appearance 59 texTURE

Style 53.5 CAN(h) texTURE

48 8 60 65 70 75 0 OveralL Likability (%)

CAN: GAN with Creativity loss, (H) stands for the use of a holistic loss. Models with texture deviation are Most Popular

Can (H) 7 Shape Can (H) 8 texture

style 74 can (h)

Likeability 70 Can texture

GAN

0 Novelty 1

judged by humans and measured as a distance to similar training images

2 0 Decoupled adversarial image generation

M. Riviere, C. Couprie, Y. LeCun

Motivation: - Take advantage of white background clothing datasets - Potentially avoid defaults in generated shapes - Better enforce shape conditioning of generations 2 1 1024x1024 generations on the RTW dataset

Using Morgane’s pytorch “progressive growing of GANs” available online, Karras et al., ICLR’18 2 2 Decoupled architecture

Real image Real image

Texture, color, shape and Discriminator Discriminator pose classes Ds Real / Fake Dg Real / Fake

Shape and pose classes Shape × GeneratorGs

z Generation Texture Texture, color, shape and pose classes Generator Gt 2 3 Random generations Progressive growing with decoupled Progressive growing architecture 2 4 Better class conditioning

= +12%

+14% -12%

Accuracy of classifiers trained on FashionGen Clothing and FashionGen Shoes on our different models results (GAN-test metric) Overall average improvement: 4.7% 2 5 Vector Image Generation by learning parametric layer decomposition Sbai, Couprie, Aubry, arxiv dec18

Current deep generative models are great but… … are limited in resolution, and control in generations 2 6 Related work

Kanan et al: Layered GANs (LR-GANs), ICLR’17 GANIN et al. SPIRAL, ICML’18 2 7 Our approach

Spoiler alert: yes, we can generate sharper images, this is just an example. 2 8 Our iterative pipeline for image reconstruction

AAACC3icbVDLSsNAFJ3UV42vqks3g0VwVRIRdFlw40oq2Ac0oUwmk3bozCTOTIQQ8glu3eo/uBO3foS/4Fc4abOwrQcuHM65l3s4QcKo0o7zbdXW1jc2t+rb9s7u3v5B4/Cop+JUYtLFMYvlIECKMCpIV1PNyCCRBPGAkX4wvSn9/hORisbiQWcJ8TkaCxpRjLSRPI8jPQmiHBcjPWo0nZYzA1wlbkWaoEJn1PjxwhinnAiNGVJqGDHyKHw7R1JTzEhhe6kiCcJTNCZDQwXiRPn5LHUBz4wSwiiWZoSGM/XvRY64UhkPzGaZUi17pfifN0x1dO3nVCSpJgLPH0UpgzqGZQUwpJJgzTJDEJbUZIV4giTC2hS18EVpjmQmw8I25bjLVayS3kXLdVru/WWzfVfVVAcn4BScAxdcgTa4BR3QBRgk4AW8gjfr2Xq3PqzP+WrNqm6OwQKsr18ScpxS

ctMt

AAACEXicbVDLSsNAFJ3UV62vqEs3g0VwVRIRdFlw40apYB/QhjCZTNqhk0mcuSmU0K9w61b/wZ249Qv8Bb/CSduFbT1w4XDOvdzDCVLBNTjOt1VaW9/Y3CpvV3Z29/YP7MOjlk4yRVmTJiJRnYBoJrhkTeAgWCdVjMSBYO1geFP47RFTmifyEcYp82LSlzzilICRfNvuxQQGQZTTiQ/4zgffrjo1Zwq8Stw5qaI5Gr790wsTmsVMAhVE624k2JP0KjlRwKlgk0ov0ywldEj6rGuoJDHTXj7NPsFnRglxlCgzEvBU/XuRk1jrcRyYzSKpXvYK8T+vm0F07eVcphkwSWePokxgSHBRBA65YhTE2BBCFTdZMR0QRSiYuha+aIiJGqtwUjHluMtVrJLWRc11au7DZbV+P6+pjE7QKTpHLrpCdXSLGqiJKBqhF/SK3qxn6936sD5nqyVrfnOMFmB9/QI5+J3r

x x

×AAACBXicbVDLSgMxFL3js46vqks3wSK4KjMi6LLgxpVUsA9oS8lkMm1skhmTjDAMXbt1q//gTtz6Hf6CX2HazsK2HggczrmXc3OChDNtPO/bWVldW9/YLG252zu7e/vlg8OmjlNFaIPEPFbtAGvKmaQNwwyn7URRLAJOW8HoeuK3nqjSLJb3JktoT+CBZBEj2Fip2TVMUN0vV7yqNwVaJn5BKlCg3i//dMOYpIJKQzjWuhNx+ih7bo6VYYTTsdtNNU0wGeEB7VgqsY3p5dN7x+jUKiGKYmWfNGiq/t3IsdA6E4GdFNgM9aI3Ef/zOqmJrno5k0lqqCSzoCjlyMRo8nkUMkWJ4ZklmChmb0VkiBUmxlY0l6KNwCpT4di15fiLVSyT5nnV96r+3UWldlvUVIJjOIEz8OESanADdWgAgQd4gVd4c56dd+fD+ZyNrjjFzhHMwfn6BRxkmZg=

AAACAHicbVDLSsNAFJ3UV42vqks3g0UQhJKIoMuCG1fSgn1AG8pkctMOnUzizEQIoRu3bvUf3Ilb/8Rf8CuctlnY1gMXDufcy733+AlnSjvOt1VaW9/Y3Cpv2zu7e/sHlcOjtopTSaFFYx7Lrk8UcCagpZnm0E0kkMjn0PHHt1O/8wRSsVg86CwBLyJDwUJGiTZS82JQqTo1Zwa8StyCVFGBxqDy0w9imkYgNOVEqV7I4VF4dk6kZpTDxO6nChJCx2QIPUMFiUB5+ezUCT4zSoDDWJoSGs/UvxM5iZTKIt90RkSP1LI3Ff/zeqkOb7yciSTVIOh8UZhyrGM8/RsHTALVPDOEUMnMrZiOiCRUm3QWtigdEZnJYGKbcNzlKFZJ+7LmOjW3eVWt3xcxldEJOkXnyEXXqI7uUAO1EEWAXtArerOerXfrw/qct5asYuYYLcD6+gWfb5cT

Iterative generation : !" = $(!, !"'() Vectorized mask generation *"(+,,) = $(+,,,-") Alpha-blending !"(+, ,) = !"'( +, , . (1-*"(+, ,))+ /"*"(+, ,) 2 9 Our iterative pipeline for image generation 3 0 Training criteria

Adversarial net criterion: Wasserstein loss with Gradient Penalty (WGAN-GP), Gulrajani et al. NIPS’17

Our generator loss in the GAN setting:

Our generator loss in the image reconstruction setting: 3 1 Results using a l1 reconstruction loss 3 2 Results using a l1 reconstruction loss

Target Our iterative reconstruction 3 3 Applications

Editing the original image using extracted masks, by performing local modifications of luminosity (top), or color modification using a blending of masks (bottom). 3 4 Applications

Image editing using masks: Using chosen extracted mask(s) from image reconstruction, we apply object removal (top) and color modifications with object removal (bottom). 3 5 Applications

Image vectorization: Reconstruction results on MNIST images. Our model learns a vectorized mask representation of digits that can be generated at any resolution without interpolation artifacts. 3 6

Baselines

Resnet MLP

RGB image z RGB image

1/ MLP-baseline 3 8

Baselines

Resnet Resnet

RGB image z RGB image

1/ MLP-baseline 2/ ResNet baseline 3 9

Baselines

(x,y)

Resnet MLP

RGB image z RGB image

1/ MLP-baseline 2/ ResNet baseline 3/ MLP-xy baseline 4 0 Comparative results

- Using a l1 reconstruction loss - Parameters for our approach: 20 masks, p of size 10, c of size 3: 260 parameters - Baselines: size of the latent code z = 20x13=260 4 1 GAN results

CelebA generations trained on 64x64 images, sampled at 256x256 CIFAR10 generations trained on 32x32 images, sampled at 256x256 4 2 GAN Results on ImageNet

trained on 64x64 images, sampled at 1024x1024 4 4 Target Perceptual Results loss

Result with perceptual loss 4 5 Conclusion and Future work

Faster training Use class conditioning Texture image generation Predicting next frames in videos

Michael Mathieu, Camille Couprie, Yann LeCun, ICLR16

4 input images Our 2 predictions Deep multiscale video prediction beyond Mean square error

• Result with a simple convolutional network trained minimizing an l2 loss

• Our result using • A multiscale architecture • an image gradient different loss • Use adversarial training 4 8 Predicting deeper into the future of semantic segmentations P. Luc , N . Neverova, C. Couprie, J. Verbeek, Y. LeCun ECCV18

• Predictions in the RGB space quickly become blurry despites previous attempts • Idea: predict in the space of semantic segmentation

4 input images Our 2 predictions Approach – predicting deeper into the future

Lt+1 Autoregressive mode is only possible S S S S t−3 t−2 t−1 St t+1 Single time-step for X2X, S2S, XS2XS

Lt+1 Lt+2 Lt+3 Autoregressive model is either :

S S S S S S t−3 t−2 t−1 St t+1 t+2 t+3 Batch model • used for inference without additional training (w.r.t. to single time step model) AR • Fine-tuned using BPTT L L L t+1 t+2 t+3 AR fine-tune

S S S S S S t−3 t−2 t−1 St t+1 t+2 t+3 Autoregressive model

Same color = shared weights

4 9 Some results

Performance measure (mean IoU) of our approach and Baselines : baselines on the Cityscapes dataset • Copy the last input frame to the output • Estimate flow between the two last inputs, and project the last input forward using the flow Flow baseline

Our AutoRegressive fine-tune result

5 0 Instance level segmentation: Mask RCNN K. He G. Gkioxari P. Dollar R. Girshick’17

• Extends Faster RCNN [Ren et al.’15] by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition

1 0 5 2 Predicting future segmentation instances by forecasting convolutional features P. Luc, C. Couprie, Y. LeCun, J. Verbeek, ECCV18

person 0.97 car 1.00 personperson 0.98 0.72 person 0.93person 0.99 person 0.97 person 0.76 person 0.98 person person0.99 0.91 person 0.99 person 0.97 person 0.84 person 1.00 person 0.560.94 bicycle 1.00 bicycle 0.92 bicycle 0.97

Luc, Neverova et al. ICCV17 F2F predictions 5 3 Conclusions

Some open problems: - Automatic metrics to evaluate generative models performances - Non deterministic training losses for future prediction

Torch code online : - For future video prediction: - Vector image generation: available soon on Othman Sbai’s github - of RGBs : on Michael Mathieu’s github - of semantic segmentations : on Pauline’s Luc github - of instance segmentation : on Pauline’s Luc github