Style transfer-based image synthesis as an efficient regularization technique in

Agnieszka Mikołajczyk Michał Grochowski Department of Electrical Engineering, Control Systems and Department of Electrical Engineering, Control Systems and Informatics, Gdańsk University of Technology, Poland Informatics, Gdańsk University of Technology, Poland agnieszka.mikolajczyk@ pg.edu.pl [email protected]

Abstract— These days deep learning is the fastest-growing various parts of the model (dropout [3]), to limit the growth of area in the field of Machine Learning. Convolutional Neural the weights (weight decay [4]), to select the right moment to Networks are currently the main tool used for the image analysis stop the optimization procedure (early stopping [5]), to select and classification purposes. Although great achievements and perspectives, deep neural networks and accompanying learning the initial model weights (transfer learning [6]). Moreover, algorithms have some relevant challenges to tackle. In this paper, popular approach is to use dedicated optimizers such as we have focused on the most frequently mentioned problem in Momentum, AdaGrad, Adam [7], learning rate schedules [8], the field of machine learning, that is relatively poor or techniques such as batch normalization [9], online batch generalization abilities. Partial remedies for this are selection. Since the quality of a trained model depends regularization techniques e.g. dropout, batch normalization, weight decay, transfer learning, early stopping and data strongly on the training data one of the most popular ways to augmentation. In this paper we have focused on data regularize is regularization via data [2]. Data augmentation augmentation. We propose to use a method based on a neural with traditional transformations such as shear, zoom in, style transfer, which allows to generate new unlabeled images of rotation, hue or contrast modification, is widely used to high perceptual quality that combine the content of a base image increase a dataset size, even when the original set is large with the appearance of another one. In a proposed approach, the newly created images are described with pseudo-labels, and then enough [10]. In the case of a class imbalance, a common used as a training dataset. Real, labeled images are divided into regularization strategy is to undersample or oversample the the validation and test set. We validated proposed method on a data to improve generalization abilities [2]. Although their challenging skin lesion classification case study. Four many advantages in some cases simple classical operations are representative neural architectures are examined. Obtained not enough to significantly improve the neural network results show the strong potential of the proposed approach. accuracy or to overcome the problem of overfitting. Moreover, Keywords—deep neural networks, regularization, neural style they do not bring any new features to the images that could transfer, data augmentation, decision support system, diagnosis, significantly improve the learning abilities of algorithm used skin lesions as well as further generalization abilities of the networks [10].

I. INTRODUCTION A main goal of this paper is to propose an efficient Currently, systems based on deep neural networks (DNN) regularization technique based on data augmentation, which can reach near human-level performance in many types of uses the neural style transfer (NST) to synthesize rich in tasks, such as classification, segmentation, natural language information and features images [11]. In order to improve the processing, image generation, signal processing and many Convolutional Neural Network (CNN) generalization abilities, more. Despite great achievements and bright perspectives, synthetized images are used to train the CNN instead of the huge deep neural models with many layers and millions of real images, while real images are used only for validation and parameters, still does not always generalize well enough even testing, what we think is an unconventional but interesting though they show a small difference between the training and approach. As a case study, we choose very challenging test performance [1]. One way of making models generalize benchmark dataset with skin lesions used for classification and better is regularization, which currently is one of the most segmentation tasks [12]. The main contributions of this paper important techniques in the field of deep learning. are: Regularization is defined as any supplementary technique that • The proposition of using neural style transfer as a aims at making the model generalize better, for example, training data augmentation method (skin lesion case produces better results on the test set [2]. A common study: benign skin lesion to malignant lesion); regularizations strategies are: to inject random noise into • Incorporating unlabeled data into the training; • Using only synthetized images as CNN’s training set class images by mixing two images belonging to different while using real images as a validation set; classes with a random ratio and added them to a train set which • Proving that above mentioned significantly improved an significantly improved the network performance in the task of accuracy as well as the repeatability of deep neural skin lesion classification. networks. C. Neural-based transformations The reminder of the paper is as follows: in section II we summarize briefly a current state-of-the-art regarding the One of the neural-based data augmentation approaches are the Generative Adversarial Networks (GANs) which are image augmentation methods. Later, we present a fresh look at popular, yet a relatively new tool to perform unsupervised the style-transfer method and a prove that it can be used as a generation of the new images using min-max strategy [21]. data augmentation method (section III). Moreover, we present GANs are found to be very useful in many different image a methodology how to use neural style transfer method to generation and manipulation problems like text-to-image generate new, unlabeled data and how to incorporate them into synthesis [22], super-resolution (generating high-resolution the training (section IV). In the last two sections, we present image out of low-resolution one) [23], image-to-image our results and bring a discussion about the advantages and translation (e.g. convert sketches to images) [24], image drawbacks of our method. blending (mixing selected parts of two images to get a new one) [25], image inpainting (restoring missing pieces of an image) [26] and unconditional [27] and conditional synthesis II. DATA AUGMENTATION METHODS [28]. Recently, GANs started gaining attention in the field of In general, the image synthesis methods that made a great medical image processing to solve tasks such as de-noising impact in deep learning might be grouped into three categories: [29], segmentation [30], detection [31] and classification [32]. traditional transformations, advanced transformations and Another CNN-based method is an algorithm of artistic neural-based transformations. style based on a deep neural network that creates artistic images of high perceptual quality [11]. The first proposition of A. Traditional transformations using CNN feature activations to mimic an artistic style of one Traditional transformations are the most common data image while maintaining the content of another was published augmentation methods applied in deep learning. Traditional in 2015 [11], and since then many different modifications to transformations are mainly defined as affine and geometric this method emerged such as: universal style transfer, (elastic) transformations. The category of affine transformation semantic style transfer (ST), instance ST, doodle ST, methods covers techniques that use linear transformations to stereoscopic ST, portrait ST, video ST and many more [33]. the image, such as rotation, shear, reflection, scaling [10,13], while the most popular geometric transformations are hue III. DATA SYNTHESIS WITH A STYLE TRANSFER modification, contrast modification, white-balance, sharpening [14]. Both affine and elastic distortions or deformations are A. Fundamentals commonly used to increase the number of samples for training The goal of the image style transfer is to mimic the style of one the deep neural models [15], to balance the size of datasets [16] image while keeping the content of another one. Developed in or to increase the robustness of the model [14]. the past, style transfer methods were mostly based on a texture transfer, where one copy the small fragments of one image to B. Advanced transformations another [34]. Although in some cases it looks well and gives The last years saw also more sophisticated transformations the impression of visually pleasing style transfer, it does not that are not as popular, commonly used and well-known as bring any new information to those pictures, but only copy and traditional transformations. We would like to present a few paste small patches. Such approach does not enrich the training interesting approaches which can be used as data augmentation data set with new information and therefore does not help in techniques. Goal of using hose method is to: make DNN’s teaching the network. Since 2015, a new way of a style transfer easier to train, with a greater ability to generalize, be more named the neural style transfer started to gain popularity [11]. robust for input changes (what is important in a case of in The idea was based on the Convolutional Neural Network’s adversarial attacks) [17]. properties to carry the information about the data it was trained on within convolutional layers. First CNN’s filters represent One of the interesting approaches developed to make a detailed pixel values, but deeper into the network, the more CNN robust was data augmentation by adding a stochastic advanced and abstract features start to emerge. Therefore, additive noise to an image [18]. Another approach is a random neural style transfer combine properties from the last layers of erasing technique proposed in [19], fast and relatively easy to the CNN, that carry information about the content, and from implement yet giving good results in generalization ability of first layers that carry information about the style of an image. CNN’s. In this method, one randomly paints a noise-filled The method allows to recombine those features to achieve an rectangle in an image resulting in changing original pixels image with a content of one photography but with a style of values. As the authors explained, expanding the dataset with another (Fig. 1). To combine the content and the style of an images of various levels of occlusion reduces the risk of image with NST, first one need to train CNN of a choice on a overfitting and makes the model more robust. Very surprising selected dataset (for example benchmark dataset like results were showed by authors in [20]. They create between- ImageNet). a) b)

Content

+

Style

=

Synthetic image Synthetic Fig. 1. a) Synthetic image created on the base of content and style images – Skin lesion example; b) Neural style transfer flowchart; MSE- Mean Square Error; x – base image, a – style image, p – content image, Al - style representation; Pl- content representation; Gl – style representation of a base image; Fl – content representation of a base image; Ls – style loss, Lc – content loss, Lt – total loss.

Then, both images of a style a and a content p should be neural style transfer to synthesize a skin lesion dataset. Instead passed through the CNN in order to calculate and save their of undersampling or oversampling the images to balance a filter responses. For the style representation Al usually all, or number of images in each class what is a popular approach almost all layers are stored, while for the content representation [35], we decided to synthesize it by merging the content of a Pl only one of the last layers are saved. Next, in the same benign lesion with a style of malignant one. It effected in the manner, the base image x (which for instance white-noise creation of a new image, with properties of both images image or content image) is passed through the network and its (including possible artifacts such as dense hair, water bubbles, content Gl and style Fl representations are saved. Saved gel). representations are later used to calculate a style and a content loss functions which are equal to mean square error of a) representations. The base image is updated with an error , where the total loss Lt function is equal to the sum of a content loss Ls and a style loss Ls. The neural style transfer flowchart is presented in the Fig. 1b.

B. Style transfer for data augmentation If both style and content images are chosen wisely, the style transfer allows to, for instance, change the illumination of one image, to create a retouch effect or to mimic the artistic style. Although NST gives many possibilities it is still underestimated in the field of data augmentation and is mainly used for artistic purposes. The flowchart of the neural style transfer method with the artificial image of skin lesion synthesis example is presented in Fig. 1. Medical databases are often highly imbalanced in terms of a number of images in each class, for example in public skin lesion datasets there is usually about ten times less malignant lesions than benign lesions. In this paper, we propose to use Fig. 3 The general idea of the method. b) The first subtask is to generate new images by merging the style and content of two different images. In order to achieve this, the neural style transfer method [11] was employed [36]. This method does not simply copy-and-paste patches of an image but allows generating new information to the

synthetized images. Moreover, enables for producing more visually pleasing results than traditional style transfer techniques Better insight into this phase was presented in section III. After synthetizing the data, we labelled them. The details are presented in section IV. Finally, we trained our CNN and saved the results. The proposed approach is a new regularization technique that can be successfully used among others and still improve the results. Common regularization techniques such as data augmentation based on affine transformations, dropout, and early stopping are exploited in all tests along with an artificial dataset [37].

B. Data labelling Fig. 2. The examples of synthetic images generated with the same content but The downside of the style transfer data augmentation with the different styles – a) first example, b) second example. method is that generated images are unlabeled hence unusable in the supervised training. Our goal was to incorporate all of IV. EFFICIENT METHOD OF NEURAL NETWORK the unlabeled synthetized data into the training hence we TRAINING decided to label images. We wanted to avoid manual labelling, so in order to do that we classified and labelled generated A. Idea of the methodology images with a DNN that was previously trained on the target The general idea of this method is to increase the amount dataset to properly classify the skin lesions. Moreover, we of data further used to train the deep neural networks (data modified and selected a new DNN classification threshold in augmentation) but in the way to ensure an increase in the such way so the number of images in all classes are equal. It amount and variety of information that the generated images resulted in applying very noisy psuedo-labels, but the problem carry. It should lead to improvement of the images of uncertain labelling was solved by the dataset splitting classification accuracy. The method consists of four subtasks: described in the next section. data augmentation, data labelling, dataset splitting and CNN’s training (Fig. 3). C. Dataset splitting We prepared the training, validation and test sets, using both real and generated data. All of the generated images were selected as a training set, while all real data was divided into validation and test sets. This type of data splitting is robust to uncertainty of labelling which exists in training set, because during the training DNN is validated only on the validation set (real annotated data). In total, we prepared 5-fold cross- validation [38] sub-sets. We paid special attention to avoid a situation where any images that were used to generate the augmented database did not appear in the validation or test sets (data leakage phenomena [39]). D. Implementation details To synthesize the images we used the method of artistic neural style transfer based on [11] with improvements detailed in [40], implemented in Keras 2.0. In details, we used VGG16 pertained on the ImageNet, with the input image size of 224x224 px, Conv5_2 as the content layer. Each base image was initialized by the content picture, the content weight equal 0.025, style weight equal 1.0, and pooling layer with pool by the maximum operator. We aimed to draw more general conclusions and therefore we conducted training and tests on four different architecture types, namely: VGG9, VGG11,

VGG16 and DenseNet121 [41,42]. Each experiments were validated on five different testing folds.

V. RESULTS Through the method of neural style transfer we have augment the data set from 1088 malignant skin lesions and 12433 benign original images up to 248 489 synthetized one, while we used only 418 malignant lesion from the original dataset. Next, we evaluated the performance of four popular and commonly used architecture types. We trained it on 54030 synthetized images, validated on 1976 real images, and tested it on 200 images per each of five folds. We applied following regularization techniques to every tested network: traditional data augmentation (rotation, zoom, shear, reflection), dropout, early stopping. We gathered the results in Table I and in graphical form in Fig. 4.

TABLE I. RESULTS OF EXPERMIENTS WITH AND WITHOUT DATA AUGMENTATION

Fold number Architecture

0 1 2 3 4 AVER

VGG9 0.805 0.820 0.845 0.845 0.805 0.824

Data Data VGG11 0.820 0.799 0.805 0.840 0.845 0.822

VGG16 0.830 0.850 0.815 0.875 0.840 0.842

Without Without Augmentation DenseNet121 0.820 0.825 0.810 0.850 0.810 0.823

VGG9 0.835 0.880 0.845 0.875 0.850 0.857

VGG11 0.815 0.895 0.870 0.885 0.840 0.861 Fig. 4 AUC for evaluated architectures with or without style transfer (ST) for Data Data 5 k-folds

VGG16 0.825 0.850 0.835 0.830 0.815 0.831

With With Augmentation DenseNet121 0.870 0.895 0.865 0.855 0.860 0.869 VI. SUMMARY AND CONCLUDING REMARKS In the paper we presented the current state-of-the-art of Networks trained with additional DA increased in area regularization techniques along with a variety of different data under the curve (AUC) on average by 2.68% comparing with augmentation techniques. networks trained traditionally. What is more, we observed that We proposed our own regularization technique based on DNN’s trained with DA tend to not overfit so quickly as in the the data synthesis and we tested it on the very demanding case case of the traditional approach. CNN trained without the of skin lesion classification. We created the images with a additional DA increased its training accuracy after a few neural style transfer where we mixed benign content and epochs, while validation’s accuracy (ACC) dropped. In malignant style to synthetize new skin lesion images. Next, we opposite, CNNs with DA slowly gained on both validation and created the pseudo-labels for each image by using CNN and training ACC what is desired situation during the training. prepared specific dataset split – we put all of the synthetized Less overfitting makes DNN generally more robust and images into a training set, while the real images were split into they better generalize the new, unseen by DNN samples. validation and test set. At the end, such prepared data were Interestingly, the architecture VGG-16, which is bigger in used to train four different representative types of terms of a number of parameters received slightly lower AUC convolutional neural networks architectures. The experiments than the approach without DA. carried out for all architectures and for 5-folds have showed Moreover, smaller architectures such as VGG-9 and the average of 2.68% improvement of AUC comparing with DenseNet trained with DA achieved better results than bigger the approach without proposed regularization technique. architecture VGG-16 without DA. Our NST regularization method allows to be used for image analysis, where NST gives satisfying results [11] and for other applications e.g. video style transfer for scene recognition, speech style transfer for speech recognizing or others. Moreover, our method instead of NST can be used with GANs [20] or other methods which allows to create [20] Tokozume Y, Ushiku Y, Harada T. Between-class learning for image additional unannotated data, or even where the new data is classification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, p. 5486–5494. available but it is unlabeled or labeled with high uncertainty. [21] Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, NST regularization method should be used not instead but Ozair S, et al. Generative adversarial nets. Advances in neural among other regularization techniques. information processing systems, 2014, p. 2672–2680. While this method is fast and easy to implement to any [22] Reed S, Akata Z, Yan X, Logeswaran L, Schiele B, Lee H. Generative adversarial text to image synthesis. ArXiv Preprint ArXiv:160505396 architecture, it is especially useful in case of using unpopular 2016. or specifically designed CNN architecture, where it is not [23] Ledig C, Theis L, Huszár F, Caballero J, Cunningham A, Acosta A, et possible to download pretrained CNN from the web, and using al. Photo-Realistic Single Image Super-Resolution Using a Generative transfer learning is available only after additionally pretraining Adversarial Network. CVPR, vol. 2, 2017, p. 4. [24] Isola P, Zhu J-Y, Zhou T, Efros AA. Image-to-image translation with it. Hence, it is faster to train than pretraining it and then conditional adversarial networks. ArXiv Preprint 2017. training it, it is especially useful to use it with the task of [25] Wu H, Zheng S, Zhang J, Huang K. Gp-gan: Towards realistic high- neural architecture search [43]. Also, when needed, it can be resolution image blending. ArXiv Preprint ArXiv:170307195 2017. also used with the transfer learning. [26] Yeh RA, Chen C, Lim T-Y, Schwing AG, Hasegawa-Johnson M, Do MN. Semantic Image Inpainting with Deep Generative Models. CVPR, vol. 2, 2017, p. 4. REFERENCES [27] Bodnar C. Text to image synthesis using generative adversarial [1] Zhang C, Bengio S, Hardt M, Recht B, Vinyals O. Understanding networks. ArXiv Preprint ArXiv:180500676 2018. deep learning requires rethinking generalization. ArXiv Preprint [28] Shaban MT, Baur C, Navab N, Albarqouni S. Staingan: Stain style ArXiv:161103530 2016. transfer for digital histological images. ArXiv Preprint [2] Kukačka J, Golkov V, Cremers D. Regularization for deep learning: A ArXiv:180401601 2018. taxonomy. ArXiv Preprint ArXiv:171010686 2017. [29] Kazeminia S, Baur C, Kuijper A, van Ginneken B, Navab N, [3] Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Albarqouni S, et al. GANs for medical image analysis. ArXiv Preprint Dropout: a simple way to prevent neural networks from overfitting. ArXiv:180906222 2018. The Journal of Machine Learning Research 2014;15:1929–1958. [30] Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, Ghafoorian [4] Krogh A, Hertz JA. A simple weight decay can improve M, et al. A survey on deep learning in medical image analysis. generalization. Advances in neural information processing systems, Medical Image Analysis 2017;42:60–88. 1992, p. 950–957. [31] Mendoza R, Menkovski V, Mavroeidis D. Anomaly Detection with [5] Prechelt L. Automatic early stopping using cross validation: generative models. PhD Thesis. Master’s thesis, Eindhoven University quantifying the criteria. Neural Networks 1998;11:761–767. of Technology, 2018. [6] Pan SJ, Yang Q. A survey on transfer learning. IEEE Transactions on [32] Zhang L, Gooya A, Frangi AF. Semi-supervised assessment of Knowledge and Data Engineering 2010;22:1345–1359. incomplete lv coverage in cardiac mri using generative adversarial [7] Wilson AC, Roelofs R, Stern M, Srebro N, Recht B. The marginal nets. International Workshop on Simulation and Synthesis in Medical value of adaptive gradient methods in machine learning. Advances in Imaging, Springer; 2017, p. 61–68. Neural Information Processing Systems, 2017, p. 4148–4158. [33] Jing Y, Yang Y, Feng Z, Ye J, Yu Y, Song M. Neural style transfer: A [8] Darken C, Moody JE. Note on learning rate schedules for stochastic review. ArXiv Preprint ArXiv:170504058 2017. optimization. Advances in neural information processing systems, [34] Chen TQ, Schmidt M. Fast patch-based style transfer of arbitrary 1991, p. 832–838. style. ArXiv Preprint ArXiv:161204337 2016. [9] Ioffe S, Szegedy C. Batch Normalization: Accelerating Deep Network [35] Grochowski M, Kwasigroch A, Miko\lajczyk A. Selected technical Training by Reducing Internal Covariate Shift. ArXiv:150203167 [Cs] issues of deep neural networks for image classification purposes. 2015. Bulletin of the Polish Academy of Sciences: Technical Sciences 2019. [10] Mikolajczyk A, Grochowski M. Data augmentation for improving [36] Zhu J-Y, Park T, Isola P, Efros AA. Unpaired image-to-image deep learning in image classification problem. 2018 International translation using cycle-consistent adversarial networks. ArXiv Interdisciplinary PhD Workshop (IIPhDW), IEEE; 2018, p. 117–122. Preprint 2017. [11] Gatys LA, Ecker AS, Bethge M. A neural algorithm of artistic style. [37] Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep learning. vol. ArXiv Preprint ArXiv:150806576 2015. 1. MIT press Cambridge; 2016. [12] Jaworek-Korjakowska J. A Deep Learning Approach to Vascular [38] Kohavi R. A study of cross-validation and bootstrap for accuracy Structure Segmentation in Dermoscopy Colour Images. BioMed estimation and model selection. Ijcai, vol. 14, Montreal, Canada; Research International 2018;2018. 1995, p. 1137–1145. [13] Wang J, Perez L. The effectiveness of data augmentation in image [39] Rouhani BD, Riazi MS, Koushanfar F. Deepsecure: Scalable classification using deep learning. Convolutional Neural Networks Vis provably-secure deep learning. Proceedings of the 55th Annual Recognit 2017. Design Automation Conference, ACM; 2018, p. 2. [14] Galdran A, Alvarez-Gila A, Meyer MI, Saratxaga CL, Araújo T, [40] Novak R, Nikulin Y. Improving the neural algorithm of artistic style. Garrote E, et al. Data-driven color augmentation techniques for deep ArXiv Preprint ArXiv:160504603 2016. skin image analysis. ArXiv Preprint ArXiv:170303702 2017. [41] Kwasigroch A, Mikolajczyk A, Grochowski M. Deep neural networks [15] Grochowski M, Kwasigroch A, Mikolajczyk A. Diagnosis of approach to skin lesions classification — A comparative analysis. Malignant Melanoma by neural network ensemble-based system 2017 22nd International Conference on Methods and Models in utilising hand-crafted skin lesion features. Metrology and Automation and Robotics (MMAR), Miedzyzdroje, Poland: IEEE; Measurement Systems 2019. 2017, p. 1069–74. doi:10.1109/MMAR.2017.8046978. [16] Grochowski M, Wąsowicz M, Mikolajczyk A, Ficek M, Kulka M, [42] Huang G, Liu Z, van der Maaten L, Weinberger KQ. Densely Wróbel M, Et Al. Machine learning system for automated blood smear Connected Convolutional Networks. ArXiv:160806993 [Cs] 2016. analysis. Metrology And Measurement Systems 2019. [43] Kwasigroch Arkadiusz, Grochowski Michał. Deep neural network [17] Gu S, Rigazio L. Towards deep neural network architectures robust to architecture search using network morphism. 24nd International adversarial examples. ArXiv Preprint ArXiv:14125068 2014. Conference on Methods and Models in Automation and Robotics [18] Jin J, Dundar A, Culurciello E. Robust convolutional neural networks (MMAR), 2019 n.d. under adversarial noise. ArXiv Preprint ArXiv:151106306 2015. [19] Zhong Z, Zheng L, Kang G, Li S, Yang Y. Random erasing data augmentation. ArXiv Preprint ArXiv:170804896 2017.