Fake It Until You Make It Exploring If Synthetic GAN-Generated Images Can Improve the Performance of a Deep Learning Skin Lesion Classiﬁer

Fake It Until You Make It Exploring if synthetic GAN-generated images can improve the performance of a deep learning skin lesion classifier Nicolas Carmont Zaragoza [email protected] 4th Year Project Report Artificial Intelligence and Computer Science School of Informatics University of Edinburgh 2020 1 Abstract In the U.S alone more people are diagnosed with skin cancer each year than all other types of cancers combined [1]. For those that have melanoma, the average 5-year survival rate is 98.4% in early stages and drops to 22.5% in late stages [2]. Skin lesion classification using machine learning has become especially popular recently because of its ability to match the accuracy of professional dermatologists [3]. Increasing the accuracy of these classifiers is an ongoing challenge to save lives. However, many skin classification research papers assert that there is not enough images in skin lesion datasets to improve the accuracy further [4] [5]. Over the past few years, Generative Adversarial Neural Networks (GANs) have earned a reputation for their ability to generate highly realistic synthetic images from a dataset. This project explores the effectiveness of GANs as a form of data augmentation to generate realistic skin lesion images. Furthermore it explores to what extent these GAN-generated lesions can help improve the performance of a skin lesion classifier. The findings suggest that GAN augmentation does not provide a statistically significant increase in classifier performance. However, the generated synthetic images were realistic enough to lead professional dermatologists to believe they were real 41% of the time during a Visual Turing Test. Areas of further study for skin lesion GANs and potential future applications of the synthetic skin lesion images in educational material were explored. 2 Acknowledgements To Robert Fisher for his ongoing support and guidance throughout this project, for our insightful weekly conversations and help in the pronunciation of “Wasserstein“. To the anonymous professional dermatologists for their participation in the study and the Visual Turing Test. To Mom, Dad, Blanca and Oli who were always there for me every step of the way a phone call away through out my 4 years at university. Constantly providing love, laughter and reminding me of how pale I look. In loving memory of Nanny, who always taught me to “Smile, be happy and don’t forget the doughnuts“. In loving memory of Dottie, an exemplary fighter. To Poppy, Avia, Avia and my family in Australia and Spain for their unmatchable love and constant source of inspiration and joy. To the Redfort Mafia, Ashish, Justin, Sameer, Jay, Angus, Adityha and Gurz for making this the most enjoyable 4 years. To Joanna for her constant positivity and enthusiasm even across time zones. To my friends in Amsterdam and Barcelona for some great memories. In loving memory of Xana, a beautiful young soul who left us too soon. Contents 1 Introduction 5 1.1 The Problem . .5 1.2 A Potential Solution . .6 1.3 Malignant Lesions . .7 1.4 Research Question . .9 1.5 Roadmap . .9 2 Background & Literature 11 2.1 The role of Artificial Neural Networks in skin lesion classification . 11 2.2 Convolutional Neural Networks . 12 2.3 State-of-the-art in skin lesion classification . 13 2.4 DERMOFIT Dataset . 14 2.5 Work on the DERMOFIT Dataset . 15 2.6 What is Data Augmentation? . 16 2.7 How GANs create synthetic images . 17 2.8 Existing work on skin lesions GANs . 18 3 GAN Implementation 20 3.1 Building the GAN . 20 3.2 Selecting the dimensions of the generator input and output . 21 3.3 Training the GAN . 22 3.4 Challenges with mode collapse . 23 3.5 The Wasserstein GAN . 24 3.6 Many GANs or a single GAN? . 25 3.7 When to finish training the GAN . 26 4 Testing Methodology 28 4.1 Test Framework . 28 4.2 Transfer Learning as Feature Extraction . 30 4.3 Choice of Hyper-parameters . 31 4.4 Baseline and Comparison . 31 4.5 Reproducibility and Reliability . 32 4.6 Performance Metrics . 32 5 Results 34 5.1 Training Performance of the VGG-16 . 34 3 CONTENTS 4 5.2 Performance of the 4 Augmentation Techniques . 35 5.2.0.1 No Augmentation . 36 5.2.0.2 Affine augmentation . 38 5.2.0.3 WGAN Augmentation . 40 5.2.0.4 Both Affine and WGAN Augmentation. 42 5.3 Statistical Analysis . 43 6 Qualitative Evaluation: Visual Turing Test 46 6.1 Methodology of VTT . 46 6.2 Results . 47 6.3 Analysis . 49 7 Conclusion & Evaluation 50 7.1 Analysis & Conclusion . 50 7.2 Comparison to Literature . 51 7.3 Method Strengths . 52 7.4 Method Weaknesses . 52 7.5 Potential Applications . 53 7.6 Further Study . 53 8 Appendix 59 8.1 Visual Turing Test (VTT) Survey . 59 Chapter 1 Introduction “Reality is merely an illusion, albeit a very persistent one.“ -Albert Einstein [6] 1.1 The Problem Approximately two in three Australians will be diagnosed with skin cancer by the time they turn 70 [7]. It is the most common type of cancer worldwide, positioning it at 18th worldwide on the rank of global health threats [8]. The largest sufferers are pale-skinned populations in sun-exposed countries such as Australia, the United States and New Zealand [9]. Despite evidence showing individuals having less sun exposure, the total frequency of skin cancer incidence has been rising recently per- haps due to a variety of factors including more intense UV presence and rising rates of longevity [4] [10]. Fortunately, medical tools used for skin cancer detection have greatly improved over the past decades [11]. Vast improvements in skin lesion classification particularly have made it possible for dermatologists to augment their diagnosis capabilities using technology. Today, various medical devices and even apps exist to give a additional information to dermatologists on diagnosis [12]. However, studies have shown that the accuracy of these devices is limited [12][13]. This is concerning as a False Negative misdiagnosis by a skin lesion classifier can lead a doctor to believe that a patient with a malignant skin cancer is fine. Skin lesion classification research papers often blame a lack of data as the primary reason for not achieving higher performance [4]. Most datasets used for machine learning with neural networks often contain 50,000-100,000 images, however most skin lesion datasets only contain about 800-8000 images [14]. The problem with getting more data is that it often takes a lot of time (typically years), much legal approval from patients and extensive collaboration with various medical facilities [14]. Nevertheless, there may exist other forms of acquiring more data without needing to gather new skin lesion images. 5 Chapter 1. Introduction 6 1.2 A Potential Solution In the past few years, Generative Adversarial Neural Networks (GANs) have earned a reputation for their ability to generate highly realistic synthetic images from a dataset [15]. To verify this, the reader is encouraged to classify which of the face images below are real or generated by AI. Figure 1.1: Mixture of Real and GAN-generated Face images [16] From the 6 images above, all have been generated by StyleGAN2 [16]. One of the most distinctive parts about a GAN is that it learns implicitly [17]. This means that the component in the GAN responsible for generating synthetic images (the generator) learns only through feedback it is given by another component in the GAN which acts as a supervisor (the discriminator). This means the generator never actually sees real images. Every step of training, the generator creates a sample image and is then told whether it believed this is accurate or not. Much like humans, the generator learns by failure. So it essentially learns to “paint pictures“ by itself, cor- recting itself when it gets bad grades, rather than by looking at the solutions. Because of this, the generator can create synthetic images that do not necessarily look like any single image in the dataset, but instead incorporate a variety of features from each in a realistic way [18]. Chapter 2 explores further how GANs work. From the standpoint of a skin lesion classifier, generating more data, specifically for rare malignant lesions, is highly desirable. This is as many skin lesions are rare and having more variations of a malignant lesion can help in its detection. Hence if GANs could generate realistic skin lesion images that add new information then they could potentially help increase the the performance of a skin lesion classifier and hence help save lives. However, in order to help better understand what “realistic“ means in the context of skin lesions, we must first look into the what the different types of skin cancer and pre-cancer look like. Chapter 1. Introduction 7 1.3 Malignant Lesions Legal note: The following material is meant as background on skin lesions and should not be taken as medical advice. The author is not a certified doctor. When talking about malignant skin lesions in this project we specifically refer to the different types of skin cancer and pre-cancer. Skin cancer is typically divided into two main categories: melanoma skin cancer and non-melanoma skin cancer. Within non-melanoma skin cancer there exists two types: Squamos cell carcinoma and Basal cell carcinoma. Overall, Basal cell carcinoma (non-melanoma) is the most common type of skin cancer, with Squamos cell carcinoma (non-melanoma) following soon after. Meanwhile, Melanoma is the least common skin cancer of the three, but it is also the most deadly with twice the death rate per incidence in the US [1].

Fake It Until You Make It Exploring If Synthetic GAN-Generated Images Can Improve the Performance of a Deep Learning Skin Lesion Classiﬁer

Revisiting Visual Question Answering Baselines 3

Visual Question Answering

Craftassist: a Framework for Dialogue-Enabled Interactive Agents

Visual Coreference Resolution in Visual Dialog Using Neural Module Networks

Resolving Vision and Language Ambiguities Together: Joint Segmentation & Prepositional Attachment Resolution in Captioned Scenes

A Visual Turing Test for Computer Vision Systems

Ask Your Neurons: a Deep Learning Approach to Visual Question Answering

Visual Turing Test for Computer Vision Systems

Building Machines That Learn and Think Like People

COMPUTER VISION and BUILDING ENVELOPES a Thesis Submitted To

A Restricted Visual Turing Test for Deep Scene and Event Understanding

Measuring Machine Intelligence Through Visual Question Answering