PaintsTorch: a User-Guided Line Colorization Tool with Double Generator Conditional Adversarial Network Yliess Hati, Gregor Jouet, Francis Rousseaux, Clément Duhart

To cite this version:

Yliess Hati, Gregor Jouet, Francis Rousseaux, Clément Duhart. PaintsTorch: a User-Guided Anime Line Art Colorization Tool with Double Generator Conditional Adversarial Network. Eu- ropean Conference on Visual Media Production (CVMP), 2019, Londres, United Kingdom. pp.1-10, ￿10.1145/3359998.3369401￿. ￿hal-02455373￿

HAL Id: hal-02455373 https://hal.archives-ouvertes.fr/hal-02455373 Submitted on 9 Dec 2020

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. 1 PaintsTorch: a User-Guided Anime Line Art Colorization Tool 59 2 60 3 with Double Generator Conditional Adversarial Network 61 4 62 5 Yliess HATI Gregor JOUET 63 6 [email protected] [email protected] 64 7 Pole Universitaire Leonard de Vinci, Research Center ADALTAS 65 8 La Defense, France Meudon, France 66 9 67 10 Francis ROUSSEAUX Clement DUHART 68 11 [email protected] [email protected] 69 12 70 URCA CReSTIC, Moulin de la Housse MIT MediaLab, Responsive Environments Group 13 71 14 Reims, France Cambridge, USA 72 15 ABSTRACT 73 16 74 17 The lack of information provided by line makes user guided- 75 18 colorization a challenging task for computer vision. Recent contri- 76 19 butions from the deep learning community based on Generative 77 20 Adversarial Network (GAN) have shown incredible results com- 78 21 pared to previous techniques. These methods employ user input 79 22 color hints as a way to condition the network. The current state of 80 23 the art has shown the ability to generalize and generate realistic 81 24 and precise colorization by introducing a custom dataset and a new 82 25 model with its training pipeline. Nevertheless, their approach relies 83 26 on randomly sampled pixels as color hints for training. Thus, in 84 27 this contribution, we introduce a stroke simulation based approach 85 28 for hint generation, making the model more robust to messy inputs. 86 29 We also propose a new cleaner dataset, and explore the use of a 87 30 double generator GAN to improve visual fidelity. 88 31 89 32 CCS CONCEPTS 90 Figure 1: PaintsTorch guided colorization on line art. 33 • Computing methodologies → Reconstruction; Image pro- 91 PaintsTorch takes two inputs: a grayscale lineart and a color 34 cessing; • Applied computing → Media arts; • Human-centered 92 hint. It outputs a colored following the color 35 computing → User studies. 93 36 hints and prior knowledge learned from a custom illustra- 94 tion dataset. 37 KEYWORDS 95 38 96 39 deep learning, neural networks, generative adversarial network, 97 user-guided colorization, datasets 40 DISCLAIMER 98 41 99 ACM Reference Format: The shown in this paper belong to their respective 42 100 Yliess HATI, Gregor JOUET, Francis ROUSSEAUX, and Clement DUHART. owners and are used purely for academic and research purposes. 43 101 2019. PaintsTorch: a User-Guided Anime Line Art Colorization Tool with Some content may hurt the sensibility of the reader, but the images 44 102 Double Generator Conditional Adversarial Network. In European Confer- shown are representative of the dataset and communities they 45 ence on Visual Media Production (CVMP ’19), December 17–18, 2019, London, 103 originate from. 46 United Kingdom. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/ 104 47 3359998.3369401 105 48 1 INTRODUCTION 106 49 Line Art colorization plays a critical part in the artists, illustrators 107 50 Permission to make digital or hard copies of all or part of this work for personal or 108 classroom use is granted without fee provided that copies are not made or distributed and work. The task is labor intensive, redundant, and 51 for profit or commercial advantage and that copies bear this notice and the full citation exhaustive, especially in the industry, where artists have 109 52 on the first page. Copyrights for components of this work owned by others than the to colorize every frame of the animated product. The process is 110 author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or 53 111 republish, to post on servers or to redistribute to lists, requires prior specific permission often executed by hand for or via the use of 54 and/or a fee. Request permissions from [email protected]. image editing software such as Photoshop, PaintMan, PaintToolSai, 112 55 CVMP ’19, December 17–18, 2019, London, United Kingdom ClipStudio, and Krita. Therefore, one can see automatic colorization 113 © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. 56 114 ACM ISBN 978-1-4503-7003-5/19/12...$15.00 pipelines as a way to improve the artist’s workflow, and such a 57 https://doi.org/10.1145/3359998.3369401 system has recently been implemented into ClipStudio. 115 58 1 116 CVMP ’19, December 17–18, 2019, London, United Kingdom Yliess et al.

117 Automatic user-guided colorization is a challenging task for com- 2.2 Generative Adversarial Network 175 176 118 puter vision as black and white line arts does not provide enough GAN models [8] are responsible for successful contributions in 177 119 semantic information. As colorization represents an essential part computer vision generation tasks such as super-resolution, high- 178 120 of the artist process, and directly influences the final art piece, au- 179 definition image synthesis, image inpainting, and image denoising 180 121 tomatic approaches require to produce aesthetically pleasing and [14, 18, 19]. This has often been described as one of 181 122 consistent results while conserving enough texture and the most beautiful ideas of the deep learning field. It consists of 182 123 material. 183 training two networks against each other, one being a discrimina- 184 124 Previous contributions from the deep learning community have tive model and the other a generative one. Hopefully, at some point, 185 125 explored image colorization [4, 6, 7, 11, 20, 22, 25, 30, 31]. While the discriminator is fooled by the generator, and we consider the 186 187 126 first works focused on user-guided gray scale images colorization model trained. 188 127 [7, 11, 31] and could not handle sparse inputs, others explored color While being able to generate good quality results, the vanilla 189 128 strokes colorization [6, 20, 25]. Nevertheless, none of these methods 190 implementation of GAN [8] suffers from mode collapse, vanishing 191 129 yield suitable generalization on unseen images nor generate pleas- gradient issues, and others. Improvement of the former model has 192 130 ing enough images. More recent works addressed these issues and been discussed in the literature introducing a gradient penalty [9] 193 131 enabled for the first time the use of such pipelines in production 194 and a new loss based on the Wasserstein-1 distance [2]. When 195 132 environments, providing qualitative results [4, 22]. The current conditioned on external information such as class labels, the model 196 133 state of the art [4] introduced a new model and its training pipeline. referred to cGAN [21] can generate higher quality output as well 197 198 134 The authors also stated that, based on a previous paper, randomly as enabling natural controls on the generator. 199 135 sampled pixels to generate color hints for training is enough to The current state of the art for user-guided line art colorization 200 136 enable user strokes input for inference. One issue is that this state- 201 [4] used such a model referred to cWGAN-GP to obtain their results. 202 137 ment is based on gray scale images colorization [31], a task close As well as introducing a deeper model compared to previous work, 203 138 yet far enough from the one of line art colorization. they introduce the use of a local feature network described in Sec- 204 139 Our contributions include: 205 tion 3.3, thus providing semantic like information to the generator 206 140 and the discriminator models. Furthermore, Their method manage 207 • The use of stroke simulation as a substitute for random pixel 141 to train a GAN with training data, illustrations, different from the 208 sampling to provide color hints during training. 209 142 inference one, line arts. • The introduction of a cleaner and more visually pleasing 210 143 211 144 dataset containing high-quality anime illustration filtered 212 2.3 Double Generator GAN 213 145 by hand. • The task of cross-domain training has been studied by previous 214 146 The exploration of a double generator GAN for this task pre- 215 works such as StarGAN [3] and DualGAN [27]. StarGAN translates 147 viously studied by contributions for multi-domain training. 216 an image to another domain using one generator inspired by the 217 148 The name "PaintsTorch" has been chosen to refer to this work. classical image-to-image GAN .[13] As their work handle discrete 218 149 219 "Paints" stands for and "Torch" for the Pytorch deep learn- labels as target domains, our work considers hint matrices and 220 150 ing framework. The name analogous by the "PaintsChainer" tool features vector from a local feature network as continuous labels. 221 151 [22], where "Chainer" refers to the Chainer deep learning library. This capacity is essential to the artistic field. 222 152 223 DualGAN goes a step further and introduces a double Generator. 224 153 Their first Generator is used for translation and their second one 225 154 2 RELATED WORK 226 for reconstruction. The two generators share only a few parameters. 155 227 As previous works in the literature has exhaustively described As this contribution allows better visually perceptive results, we 228 156 none deep learning based approaches, these are not explained in consider the approach interesting enough to be explored for the 229 157 230 this paper. Nowadays, deep learning approaches to the line art task of line art colorization. 158 231 colorization problem have shown to be the trend and outperform 232 159 previous methods. 233 160 Table 1: The Table describes the dataset composition. The 234 235 161 Paper Line Arts and Paper Colored lines refer to the dataset 236 162 2.1 Synthetic Colorization of the current state of the art while the Ours Colored refers 237 163 Previous works have studied gray scale mapping [7, 11, 31]. It usu- to our image dataset adding up to the Total. 238 239 164 ally consists of trying to map gray scale images to colored ones 240 165 using a Convolutional Neural Network (CNN) or a generative model Source Images 241 242 166 such as GAN [8]. By using high-level and low-level semantic infor- Paper Line Arts 2 780 243 167 mation, such models generate photo-realistic and visually pleasing Paper Colored 21 931 244 168 outputs. Moreover, previous works also explored direct mapping 245 Ours Colored 21 930 246 169 between human sketches and realistic images while providing a 247 170 way to generate multiple outputs out of one sketch. Total Colored 43 861 248 171 However, as explained before, semantic information for black Total 43 641 249 250 172 and white line art colorization is not conceivable, and these models 251 173 do not explore all the entire output space. 252 253 174 2 254 PaintsTorch: a User-Guided Anime Line Art Colorization Tool with Double Generator Conditional Adversarial NetworkCVMP’19, December 17–18, 2019, London, United Kingdom

255 313 314 256 315 257 316 258 317 318 259 319 260 320 261 321 322 262 323 263 324 325 264 326 265 327 266 328 329 267 Figure 2: The illustration describes the entire transformation pipeline of the model’s inputs. The pipeline outputs a lineart 330 268 and corresponding color hint image from an input illustration. The process can be applied to any given illustration dataset. 331 269 332 333 270 3 PROPOSED METHOD 334 271 335 All the models presented in the paper are trained following the next 336 272 instructions. In Section 3.1 we describe the dataset used for training 337 273 and evaluation, Section 3.2 the prepossessing pipeline, Section 3.3 338 274 339 the model architecture, Section 3.3 the loss used to train the models, 340 275 and Section 3.5 the training process. 341 276 342 277 343 344 278 3.1 Dataset 345 279 As stated by the current state of the art, datasets for colorization are 346 347 280 available on the web. However, if we consider two of the most 348 281 known and large datasets, Nico-opendata [7, 12, 16] and Dan- 349 282 350 booru2017 or Danbooru2018 [1], they both contain messy illus- 351 283 Figure 3: xDoG fake line art on the left generated out of the trations, and line arts are mixed with colored illustrations. In this 352 284 sense, Ci et al. [4] gathered their custom dataset containing 21 930 illustration on the right with parameters described in Sec- 353 285 tion 3.2.1 and σ = 0.4 354 colored illustrations for training and 2 780 high-quality line arts for 355 286 evaluation. 356 287 Nevertheless, after investigation, we found images that cannot 357 358 288 be qualified as illustrations, and the quality of the isnot for inference. Their assumption is based on Zhang et al. contribu- 359 289 consistent over the entire set of colored images. To this end, we tion [31] which deals with gray scale image colorization. As their 360 290 361 collected a custom training dataset composed of 21 930 consistent, problem is not entirely the same, we explored the use of simulated 362 291 high-quality anime like illustrations. These illustrations have been strokes as a substitute for hint generation. 363 292 filtered by hand to ensure some subjective quality standards. On We simulated human strokes using the PIL library with 364 293 365 the other hand, the line art set used for evaluation is not subject to a round brush. To this end, we define four parameters: the number 366 294 these critics. The exact composition of the dataset can be found in of strokes nstrokes ∈ [0; 120], the brush thickness tbrush ∈ [1; 4], 367 295 368 Table 1. the number of waypoints per stroke npoints ∈ [1; 5] and a square 296 369 of width wranдe ∈ [0; 10] which defines the range of movement of 370 297 one brush stroke. By doing so, we aim to make the model robust 371 298 3.2 Preprocessing 372 to messy user inputs. An example of a brush stroke simulation 373 299 3.2.1 Synthetic Line Art. Illustrations do not often come with their generated from an illustration can be found in Figure 5. 374 300 corresponding high-quality line arts. To counter this issue, previous 375 301 works use synthetics line arts to train their model. To generate 3.2.3 Model Inputs Transformations. To be handled by the deep 376 377 302 high-quality fake line arts out of colored illustrations, Extended learning model, all inputs are preprocessed and follow certain trans- 378 303 Difference of Gaussians (xDoG)28 [ ] has proven to be one of the formations. First, the illustration input is randomly flipped to the 379 380 304 best methods. xDoG produces realistic enough sketches as it can be left or the right. Then, the image is scaled to match 512 pixels on 381 305 observed in Figure 3. In this work, we use the same set of parameters its smaller side before being randomly cropped to a 512 by 512 382 9 −1 306 as the previous state of the art:γ = 0.95, ϕ = 1e , k = 4.5, ϵ = −1e , pixels image. The obtained resized and cropped illustration is then 383 384 307 ∈ { . . . } σ 0 3; 0 4; 0 5 . used to generate the synthetic gray scale line art (512x512x1). The 385 308 same transformed illustration is then resized to a 128 by 128 pixels 386 309 3.2.2 Simulated Hint Strokes. As mentioned, the current state of and used to generate a stroke simulated hint and its corresponding 387 388 310 the art [4] stated that randomly sampled pixels hint during training black and white mask to finally obtain the hint image used for 389 311 is enough to enable natural interaction using user strokes as input training (128x128x4). All pixel data except the one coming from 390 391 312 3 392 CVMP ’19, December 17–18, 2019, London, United Kingdom Yliess et al.

393 451 452 394 453 395 454 396 455 456 397 457 398 458 399 459 460 400 461 401 462 463 402 464 403 465 404 466 467 405 468 406 469 407 470 471 408 472 409 473 474 410 Figure 4: The Figure shows the Generator and Discriminator Convolutional Neural Network . Both are composed 475 411 of ResNet Xt Blocks (violet) and Pixel Shuffle Blocks (orange). The Generator uses residual connections in the form of aUNet. 476 412 477 478 413 479 414 480 415 481 482 416 483 417 484 485 418 486 419 487 420 488 489 421 490 422 491 423 492 493 424 Figure 5: Example of a simulated brush stroke hint on the 494 425 495 right generated from the left illustration left. The strokes 496 426 are supposed to represent a normal usage of the software. 497 427 Their density and thickness varies randomly to make the 498 428 499 model more robust int its usage. 500 429 501 430 502 Figure 6: The illustration describes the overall model archi- 431 503 tecture from a higher . Arrows describe the path 504 432 the black and white mask is normalized with an [0.5,0.5,0.5] std of the data through the models to the losses. The colors al- 505 433 506 and mean, and the mask is normalized to map the [0;1] range. The low distinguishing between each path and each piece of the 434 507 entire transformation pipeline can be observed in Figure 2. architecture. Green arrows refer to connections with the loss 508 435 modules, plain blue ones for the input modules, red ones for 509 436 510 the generators, and violet for the discriminator. 511 437 3.3 Model 512 438 Regarding the model architecture, the GAN we used is similar to 513 439 the one used by Ci et al. [4]. This model is shown in the Figure 4 514 515 440 but a more detailed explanation of the model can be found in their the Discriminator along with the hint image as a way to condition 516 441 paper. The Generator G1 is a deep U-Net model [23] composed and give semantic information to the GAN. 517 518 442 of ResNetXt [10] blocks with dilated convolutions to increase the Our contribution introduces the use of a second Generator G 2 519 443 receptive field of the network. LeakyReLU [29] is used as activation using the same architecture as G1. This second Generator is re- 520 444 with a 0.2 slope except for the last layer using a Tanh activation. sponsible for the generation of a synthetic line art out of the fake 521 522 445 The Discriminator is inspired by the SRGAN one but upgraded with illustration inferred by the first one. This kind of approach has been 523 446 the same kind of blocks as the generator without any dilation and used for cross-domain training [27]. By doing so, we aim for im- 524 447 using more layers. They also introduced the use of a Local Feature proving the overall perceptive quality of the generated illustration 525 526 448 Network F1. This network is an Illustration2Vec [24] model able to as well as giving G1 further insight and better training objective. A 527 449 tag illustrations. These tags are passed through the Generator and schematic of the whole architecture can be found in Figure 6. 528 529 450 4 530 PaintsTorch: a User-Guided Anime Line Art Colorization Tool with Double Generator Conditional Adversarial NetworkCVMP’19, December 17–18, 2019, London, United Kingdom

531 3.4 Loss Table 2: The Table compares the Frechet Inception Distance 589 (FID) of multiple models trained over 100 epochs. [Paper] 590 532 As indicated in Ci et al. paper [4], the loss functions are a combina- refers to the colored images used for training by Ci et al. [4], 591 533 tion of all GAN’s improvements described earlier in the paper. We 592 534 [Custom] to the images dataset we collected, and [Custom 593 want the second Generator to back propagate its signal to the first 594 535 + Paper] to the combination of both. Lower value is better. one, so we add a new term to the Generator loss, which relies on a 595 STD stands for standard deviation to the mean. 536 simple MSE. In this Section, we describe each loss used to train the 596 537 597 model. 598 538 First, we define the global Generator loss as a combination of Model and Options FID STD 599 539 three components: a content component, an adversarial one, and a 600 (Paper) Random, Simple 104.07 0.016 601 540 reconstruction one. (Paper) Strokes, Simple 68.28 0.048 602 541 (Paper) Strokes, Double 83.25 0.019 603 542 LG = Lcont + λ1 . Ladv + Lrecon 604 605 543 |{z} | {z } | {z } (Custom) Random, Simple 82.23 0.022 content loss adv loss recon loss (1) 606 544 (Custom) Strokes, Simple 64.81 0.035 607 | {z } 545 (Custom) Strokes, Double 65.15 0.006 608 total loss 609 546 The adversarial part is computed thanks to the local feature (Custom + Paper) Strokes, Double 75.71 0.032 610 547 611 network F1 used asconditional input, and WGAN-GP [9] used 612 548 to distinguish fake examples from real ones. The component is 613 549 −4 weighted by parameter λ1 = 1e Table 3: The Table compares the FID of our model trained 614 550 over 100 epochs for different batch sizes: 4, 16, and 32.A 615 616 551 higher batch size returns lower FID values. Lower value is Ladv = −EG (X ,H , F (X ))∼P [D(G1(X, H, F1(X)), F1(X))] (2) 617 552 1 1 д better. STD stands for standard deviation to the mean. 618 553 A perceptual loss is used for the content loss, which relies on 619 620 554 an L2 difference between generated output and target CNN feature Batch Sizes FID STD 621 555 map coming from the fourth convolution activation of a pretrained 622 623 556 VGG16 [26] on ImageNet [5]. 4 74.53 0.003 624 557 16 64.35 0.061 625 1 558 L = ∥F (G (X, H, F (X))), F (X)∥2 (3) 32 65.15 0.006 626 cont 2 1 1 2 2 627 559 chw The loss signal we call reconstruction loss describes the ability 628 560 629 of generator G to produce a fake line art out of the fake illustration 561 2 3.5 Training 630 generated by G as close from the xDoG synthetic line art used for 631 562 1 training. As the output does not contain multi-channel information, The models are trained on an Nvidia DGX station using four V100 632 563 633 the difference is computed with a mean squared error. GPUs with 32Go of dedicated RAM each. The ADAM optimizer 634 564 −4 [17] is used with parameters: learning rate α = 1e and betas 635 565 636 β1 = 0.5, β2 = 0.9. The same training pipeline as previous work 566 Lrecon = MSE [G2(G1(X, H, F1(X)), H, F1(X)), X] (4) 637 has been applied. One gradient descent step is first applied to train 638 567 Concerning the Discriminator loss, it is a combination of the the Discriminator D, then to train Generator G1 and finally G2. For 639 568 Wasserstein loss and the penalty loss. comparison, all models have been trained for 100 epochs. However, 640 569 641 the final one is trained on 300 epochs. 642 570 LD = Lw + Lp 643 571 |{z} |{z} 4 RESULTS 644 572 critic loss penalty loss (5) 645 In our contribution, we trained and experimented different model 646 573 | {z } 647 total loss 574 pipelines. To evaluate and compare these models, we realized mul- 648 The critic loss is described in the WGAN paper [2]. 649 575 tiple evaluations. In this Section, we describe our results. 650 576 651 L E [D(G ( F ( ))) F ( )]− 577 w = G1(X ,H , F1)∼Pд 1 X, H, 1 X , 1 X 4.1 FID Evaluation 652 (6) 653 578 [D( F ( )))] EY ∼Pr ) Y, 1 X Peak Signal-to-Noise Ratio (PSNR), as stated by Ci et al. [4], does 654 579 not assess joint statistics between targets and results. Moreover, 655 The penalty term, as described in the current state of the art 656 580 our dataset does not provide colored illustrations with their corre- [4], is composed of two components, a penalty term and an extra 657 581 constraint from Karras et al [15] The two parts are weighted by sponding line arts. In that sense, measuring similarities between the 658 582 −3 two data distributions, synthetic colorized line arts, and authentic 659 parameters λ2 = 10 and ϵdrif t = 1e . 660 583 line arts is more appropriate to evaluate the model’s performances. 661 584 2 2 The FID can measure intra-class dropping, diversity, and quality. A 662 L = λ2 . E [( ∇ D(Yˆ, F1(X)), F1(X)] − 1) ]+ 663 585 p Yˆ ∼Pr Yˆ small FID means that two distributions are similar. The FID evalua- (7) 664 586 2 ϵ . E [D(Y, F1(X)) ] tion is performed the same way as Ci et al. [4] between the colored 665 drif t Yˆ ∼Pr 587 illustrations train set and the line arts test set. It extracts features 666 667 588 5 668 CVMP ’19, December 17–18, 2019, London, United Kingdom Yliess et al.

669 Table 4: The Table describes the Mean Opinion Score (MOS) 727 728 670 for every model we compared in our study. PaperRS stands 729 671 for current state of the art illustration data with randomly 730 672 sampled pixels and one generator, CustomSS for our illus- 731 732 673 tration data with stroke simulation with one generator and 733 674 CustomSD for double generator. STD stands for Standard De- 734 675 viation to the mean 735 736 676 737 677 Model MOS STD 738 739 678 PaperRS 1.60 0.85 740 679 CustomSS 2.95 0.92 741 680 742 CustomSD 3.10 1.02 743 681 744 682 745 683 746 747 684 748 685 749 750 686 Figure 7: The graph compares the FID on a logarithmic scale 751 687 of every model trained over 100 epochs. 752 688 753 754 689 755 690 756 691 757 758 692 759 693 760 761 694 762 695 763 696 764 765 697 766 698 767 699 768 769 700 770 701 Figure 9: The heatmap compares the MOS ratings for every 771 772 702 model we study. PaperRS stands for current state of the art illustration data with randomly sampled pixels and one gen- 773 703 774 704 erator, CustomSS for our illustration data with stroke simu- 775 776 705 lation with one generator and CustomSD for double genera- 777 706 tor. 778 707 779 780 708 Figure 8: The graph compares the FID on a logarithmic scale 781 709 for different batch size, 32, 16 and 4 over 100 epochs. 4.2 MOS Evaluation 782 710 It is generally challenging to evaluate art results as the artistic 783 784 711 perception is different for every person, and the FID is not good 785 712 at assessing the model’s quality in this context. Thus, as previous 786 787 713 from intermediate layers of a pre-trained Inception Network to works did, we also conducted a MOS evaluation of the different 788 714 model the data distribution using a Multivariate Gaussian Process. model’s pipelines. This evaluation aims at quantifying the recon- 789 715 Results of the FID evaluations can be found in Tables 2, 3 and struction of perceptually convincing line art colorization. To do 790 791 716 Figures 7, 8. This objective evaluation allows us to infer some insight so, we asked 16 users to score synthetic illustrations between one 792 717 about the different model training pipelines we tried during our and five corresponding to bad quality and excellent quality. The 793 794 718 experimentation. Stroke simulation, instead of randomly sampled evaluation is composed of 160 randomly selected line art from the 795 719 pixels for hint generation provides the most notable positive impact validation set. Corresponding hint images have been created by 796 720 on the FID value. The overall quality improvement of the training hand by non-professional users and used to generate 160 corre- 797 798 721 illustrations also yields better results. Though, we cannot deduce if sponding illustrations for each of the three models we compared. 799 722 the double generator visually improves the output with this kind of Thus, the overall number of images to rate per user is 480. Examples 800 723 evaluation as art is a completely subjective matter. Finally, greater of the illustrations shown to the users are available in Figure 10. 801 802 724 batch size does not have that much of an impact on the FID value, Our results show in Table 4 and Figure 9 that our models are 803 725 but allows faster training. perceptively better when compared to the previous state of the 804 805 726 6 806 PaintsTorch: a User-Guided Anime Line Art Colorization Tool with Double Generator Conditional Adversarial NetworkCVMP’19, December 17–18, 2019, London, United Kingdom

807 865 866 808 867 809 868 810 869 870 811 871 812 872 813 873 874 814 875 815 876 877 816 878 817 879 818 880 881 819 882 820 883 821 884 885 822 886 823 887 888 824 889 825 890 826 891 892 827 893 828 Figure 10: The illustrations shows examples generated from our contribution model using the line art on the left and the hint 894 829 895 in the middle. 896 830 897 831 898 899 832 900 833 901 834 902 903 835 904 836 905 837 906 907 838 Figure 11: The illustrations compare the previous state of 908 839 the art (top) to our results (bottom) in coloring a 3 simple 909 910 840 tasks: filling a circle without hint, filling a circle with onthe 911 841 edge brush stokes, and performing gradients using messy 912 842 913 inputs. 914 843 Figure 12: The illustrations shows the differences between 915 844 916 845 PaintsChainer and our model when the brush strokes are 917 918 846 thin versus when the hint is made out of thicker strokes. art. We realized a unilateral Student test with a significance level 919 847 920 α = 0.001 to compare the MOS mean of our model to the current 848 921 state of the art with a sample size n = 16. We obtained a t value 922 849 of 4.525, approximately equivalent to a p − value inferior to 0.001. fill the inner part. As shown in Figure 11, color gradients arealso 923 850 924 Statistically, our study validates that our contributions improve visually more pleasant using our contribution. 925 851 the model described by Ci et al. [4]. This evaluation score also 926 852 allows us to conclude on the use of a double generator. The dual 927 853 5 APPLICATION 928 generator seems to slightly improve the illustration quality and 929 854 In this Section, we discuss the possible use of this kind of application provides higher contrast with misplaced colors. 930 855 and the web app we developed to ease the experiments. 931 932 856 In order to use the models, we developed a web application to 933 857 4.3 Visual Improvements allow real-time user interactions with our contribution with simple 934 858 The differences in our approach results in visibly perceptive im- tools such as a brush pen, an eyedropper, and an eraser with various 935 936 859 provements. As it can be observed in Figure 11, training the models brush sizes. Visual of the app can be found in Figure 13. It has been 937 860 using simulated strokes improves the general ability to fill the inner created using a dockerized flask rest API to serve the PyTorch mod- 938 861 part of forms as well as allowing the user’s inputs to be messier. els on the DGX station we used for training. The web application 939 940 862 When the user’s strokes exceed the outer part of the area slightly, performs API calls each time the user’s touch to the canvas end. To 941 863 the current state of the art fails at capturing the user’s will only to ease the production of the Figures for this contribution, and allow 942 943 864 7 944 CVMP ’19, December 17–18, 2019, London, United Kingdom Yliess et al.

945 also benefit from this kind of application if the model canallow 1003 1004 946 temporally stable outputs. 1005 947 1006 948 6 LIMITATIONS 1007 1008 949 PaintsTorch enables natural interaction when painting illustrations. 1009 950 Even though our solution provides visually pleasing colored illus- 1010 951 1011 trations, it sometimes fails to some extent. 1012 952 When too many strokes are included in the hint image, or when 1013 953 1014 colors are highly saturated, the network tends to produce artifacts, 1015 954 as Figure 14 shows. These artifacts can be the result of the dataset 1016 955 color distribution. It could be resolved by introducing data augmen- 1017 956 1018 tation on the source and hint images such as changing the hue, 1019 957 saturation, and contrast, but also by allowing more strokes per hint 1020 958 map in the stoke simulation. 1021 959 1022 Moreover, in some cases, our network does not always apply the 1023 960 Figure 13: The Figure is a screen-shot of the web app envi- exact same colors as the given ones. As it can be observed in Figure 1024 961 1025 ronment created to use our model. On the left, there is a 15, it failed to capture the artist’s intent to make the eyes pinkish. 1026 962 tool bar with a color selector, an eyedropper, a pen, an eraser, Finally, our pipeline does not use any layer system like painting 1027 963 and sizes for the brush. On the right the tool bar allows to software such as Photoshop, Krita, and others do. Digital artists 1028 964 1029 import a line art, save the colored illustration and save the usually work with multiple layers. Some are used for the sketch, 1030 965 hint canvas. The top right tool box is used to select a model others for the lineart, and colors. PaintsTorch only delivers a final 1031 966 to use. These models are the one described in the study. The illustration, including the lineart. The colors cannot be separated by 1032 967 1033 left canvas is the one the user can draw on whereas the right the artist afterward and force him to paint directly on the produced 1034 968 one displays the illustration results. colored illustration. 1035 969 1036 1037 970 7 CONCLUSION 1038 971 1039 972 Guided line art colorization is a challenging task for the computer 1040 1041 973 vision domain. Previous works have shown that deep learning yield better results than previous methods. In our contribution, 1042 974 1043 975 we propose three changes that can improve the current state of 1044 1045 976 the art’s results. Our first contribution is the introduction of stroke 1046 977 simulations as a way to replace random pixels activation to generate 1047 1048 978 the hint used for training. Our second contribution is the use of a custom, high resolution, and quality controlled dataset for training 1049 979 1050 980 illustrations. Our third contribution is the exploration for the use 1051 1052 981 of a second generator, which is in charge of generating synthetic lines art based on the produced artificial illustrations. These three 1053 982 1054 983 contributions, as the study shows, allow for improved perceptive 1055 results compared to previous works. 1056 984 1057 985 Our results allow to produce quality illustrations on unseen line 1058 986 arts and the used of different input stroke sizes. However, the model 1059 still suffers from small artifacts. Moreover, it does not always seem 1060 987 1061 988 to use the exact color information provided by the user’s hints. The 1062 1063 989 Figure 14: The illustration shows the appearance of artifacts model could also provide increased robustness to thinner or thicker line arts and color strokes by changing few training parameters. 1064 990 on the output image of our model in the presence of highly 1065 991 saturated colors or densely populated hint images. One extension of this work could be studying the impact of 1066 1067 992 using a more massive and diversified dataset. We are also planning 1068 993 to make the model stable temporally to be used for animation 1069 1070 994 users to share their creations, we also provide some tools to save purposes. 1071 995 the illustration and the hint. 1072 996 As shown in Figure 15, this kind of tool can be included in an ACKNOWLEDGEMENTS 1073 1074 997 artist’s workflow saving time and providing new types of creativity. We would like to thanks the De Vinci Innovation Center for pro- 1075 998 While the contribution can be useful to professionals illustrators, it viding an Nvidia DGX-station, as well as all the users who took the 1076 999 can also leverage the power of digital colorization to the beginners time to participate to our study, and the Pytorch team for provid- 1077 1078 1000 through natural interaction. As it can be used for finalized art pieces, ing such a useful and accessible framework for deep learning on 1079 1001 it is also a way to prototype quickly. The field of animation could multi-GPU machines. 1080 1081 1002 8 1082 PaintsTorch: a User-Guided Anime Line Art Colorization Tool with Double Generator Conditional Adversarial NetworkCVMP’19, December 17–18, 2019, London, United Kingdom

1083 1141 1142 1084 1143 1085 1144 1086 1145 1146 1087 1147 1088 1148 1089 1149 1150 1090 1151 1091 1152 1153 1092 1154 1093 1155 1094 1156 1157 1095 1158 1096 1159 1097 Figure 15: The Figure is a representation of how an artist would naturally embed our contribution in his workflow. 1160 1161 1098 1162 1099 1163 [16] Y. Kataoka, T. Matsubara, and K. Uehara. 2017. Automatic manga colorization REFERENCES 1164 1100 with color style by generative adversarial nets. In 2017 18th IEEE/ACIS Interna- [1] Gwern Branwen Aaron Gokaslan Anonymous, the Danbooru community. 2019. 1165 tional Conference on Software Engineering, Artificial Intelligence, Networking and 1101 Danbooru2018: A Large-Scale Crowdsourced and Tagged Anime Illustration 1166 Parallel/Distributed Computing (SNPD). 495–499. https://doi.org/10.1109/SNPD. Dataset. https://www.gwern.net/Danbooru2018. https://www.gwern.net/ 1167 1102 2017.8022768 Danbooru2018 Accessed: DATE. 1168 1103 [17] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Op- [2] Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein Genera- 1169 timization. http://arxiv.org/abs/1412.6980 cite arxiv:1412.6980Comment: Pub- 1104 tive Adversarial Networks. In Proceedings of the 34th International Conference on 1170 lished as a conference paper at the 3rd International Conference for Learning Machine Learning (Proceedings of Machine Learning Research), Doina Precup and 1171 1105 Representations, San Diego, 2015. Yee Whye Teh (Eds.), Vol. 70. PMLR, International Convention Centre, Sydney, 1172 1106 [18] Orest Kupyn, Volodymyr Budzan, Mykola Mykhailych, Dmytro Mishkin, and Jiri Australia, 214–223. http://proceedings.mlr.press/v70/arjovsky17a.html 1173 Matas. 2017. DeblurGAN: Blind Motion Deblurring Using Conditional Adversarial 1107 [3] Yunjey Choi, Min-Je Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, 1174 Networks. CoRR abs/1711.07064 (2017). arXiv:1711.07064 http://arxiv.org/abs/ and Jaegul Choo. 2017. StarGAN: Unified Generative Adversarial Networks 1175 1108 1711.07064 for Multi-Domain Image-to-Image Translation. CoRR abs/1711.09020 (2017). 1176 [19] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew P. Aitken, 1109 arXiv:1711.09020 http://arxiv.org/abs/1711.09020 1177 Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. 2016. Photo- [4] Yuanzheng Ci, Xinzhu Ma, Zhihui Wang, Haojie Li, and Zhongxuan Luo. 2018. 1178 1110 Realistic Single Image Super-Resolution Using a Generative Adversarial Network. User-Guided Deep Anime Line Art Colorization with Conditional Adversarial 1179 1111 CoRR abs/1609.04802 (2016). arXiv:1609.04802 http://arxiv.org/abs/1609.04802 Networks. CoRR abs/1808.03240 (2018). arXiv:1808.03240 http://arxiv.org/abs/ 1180 [20] Yifan Liu, Zengchang Qin, Zhenbo Luo, and Hua Wang. 2017. Auto-painter: 1112 1808.03240 1181 Image Generation from Sketch by Using Conditional Generative Ad- [5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A 1182 1113 versarial Networks. CoRR abs/1705.01908 (2017). arXiv:1705.01908 http: Large-Scale Hierarchical Image Database. In CVPR09. 1183 1114 //arxiv.org/abs/1705.01908 [6] Kevin Frans. 2017. Outline Colorization through Tandem Adversarial Networks. 1184 [21] Mehdi Mirza and Simon Osindero. 2014. Conditional Generative Adversarial 1115 CoRR abs/1704.08834 (2017). arXiv:1704.08834 http://arxiv.org/abs/1704.08834 1185 Nets. CoRR abs/1411.1784 (2014). arXiv:1411.1784 http://arxiv.org/abs/1411.1784 [7] Chie Furusawa, Kazuyuki Hiroshiba, Keisuke Ogaki, and Yuri Odagiri. 2017. 1186 1116 [22] Preferred Networks. 2017. paintschainer. https://paintschainer.preferred.tech/. Comicolorization : Semi-automatic Manga Colorization. CoRR abs/1706.06759 1187 [23] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional 1117 (2017). arXiv:1706.06759 http://arxiv.org/abs/1706.06759 1188 Networks for Biomedical Image Segmentation. CoRR abs/1505.04597 (2015). [8] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- 1189 1118 arXiv:1505.04597 http://arxiv.org/abs/1505.04597 Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative 1190 1119 [24] Masaki Saito and Yusuke Matsui. 2015. Illustration2Vec: A Semantic Vector Adversarial Nets. In Proceedings of the 27th International Conference on Neural 1191 Representation of Illustrations. In SIGGRAPH Asia 2015 Technical Briefs (SA ’15). 1120 Information Processing Systems - Volume 2 (NIPS’14). MIT Press, Cambridge, MA, 1192 ACM, New York, NY, USA, Article 5, 4 pages. https://doi.org/10.1145/2820903. USA, 2672–2680. http://dl.acm.org/citation.cfm?id=2969033.2969125 1193 1121 2820907 [9] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and 1194 [25] Patsorn Sangkloy, Jingwan Lu, Chen Fang, Fisher Yu, and James Hays. 2016. 1122 Aaron C Courville. 2017. Improved Training of Wasserstein GANs. In Ad- 1195 Scribbler: Controlling Deep Image Synthesis with Sketch and Color. CoRR 1123 vances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, 1196 abs/1612.00835 (2016). arXiv:1612.00835 http://arxiv.org/abs/1612.00835 S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran 1197 1124 [26] K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Networks for Associates, Inc., 5767–5777. http://papers.nips.cc/paper/7159-improved-training- 1198 Large-Scale Image Recognition. CoRR abs/1409.1556 (2014). 1125 of-wasserstein-gans.pdf 1199 [27] Hao Tang, Dan Xu, Wei Wang, Yan Yan, and Nicu Sebe. 2019. Dual Generator [10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual 1200 1126 Generative Adversarial Networks for Multi-Domain Image-to-Image Translation. Learning for Image Recognition. CoRR abs/1512.03385 (2015). arXiv:1512.03385 1201 1127 CoRR abs/1901.04604 (2019). arXiv:1901.04604 http://arxiv.org/abs/1901.04604 http://arxiv.org/abs/1512.03385 1202 [28] Holger Winnemöller, Jan Eric Kyprianidis, and Sven C. Olsen. 2012. XDoG: 1128 [11] Paulina Hensman and Kiyoharu Aizawa. 2017. cGAN-based Manga Colorization 1203 An eXtended difference-of-Gaussians compendium including advanced image Using a Single Training Image. CoRR abs/1706.06918 (2017). arXiv:1706.06918 1204 1129 stylization. Computers & 36, 6 (2012), 740 – 753. https://doi.org/10.1016/ http://arxiv.org/abs/1706.06918 1205 1130 j.cag.2012.03.004 2011 Joint Symposium on Computational Aesthetics (CAe), Non- [12] Hikaru Ikuta, Keisuke Ogaki, and Yuri Odagiri. 2016. Blending Texture Features 1206 Photorealistic Animation and Rendering (NPAR), and Sketch-Based Interfaces 1131 from Multiple Reference Images for Style Transfer. In SIGGRAPH ASIA 2016 1207 and Modeling (SBIM). Technical Briefs (SA ’16). ACM, New York, NY, USA, Article 15, 4 pages. https: 1208 1132 [29] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. 2015. Empirical Evaluation of //doi.org/10.1145/3005358.3005388 1209 Rectified Activations in Convolutional Network. CoRR abs/1505.00853 (2015). 1133 [13] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2016. Image-to- 1210 arXiv:1505.00853 http://arxiv.org/abs/1505.00853 Image Translation with Conditional Adversarial Networks. CoRR abs/1611.07004 1211 1134 [30] Lvmin Zhang, Yi Ji, and Xin Lin. 2017. Style Transfer for Anime Sketches with (2016). arXiv:1611.07004 http://arxiv.org/abs/1611.07004 1212 1135 Enhanced Residual U-net and Auxiliary Classifier GAN. CoRR abs/1706.03319 [14] Justin Johnson, Alexandre Alahi, and Fei-Fei Li. 2016. Perceptual Losses for 1213 (2017). arXiv:1706.03319 http://arxiv.org/abs/1706.03319 1136 Real-Time Style Transfer and Super-Resolution. CoRR abs/1603.08155 (2016). 1214 [31] Richard Zhang, Jun-Yan Zhu, Phillip Isola, Xinyang Geng, Angela S. Lin, Tianhe arXiv:1603.08155 http://arxiv.org/abs/1603.08155 1215 1137 Yu, and Alexei A. Efros. 2017. Real-Time User-Guided Image Colorization with [15] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2017. Progres- 1216 1138 Learned Deep Priors. CoRR abs/1705.02999 (2017). arXiv:1705.02999 http://arxiv. sive Growing of GANs for Improved Quality, Stability, and Variation. CoRR 1217 org/abs/1705.02999 1139 abs/1710.10196 (2017). arXiv:1710.10196 http://arxiv.org/abs/1710.10196 1218 1219 1140 9 1220