DEGREE PROJECT IN INFORMATION AND COMMUNICATION TECHNOLOGY, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2018

Generative adversarial networks for single image super resolution in images

SAURABH GAWANDE

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY Generative adversarial networks for single image super resolution in microscopy images

SAURABH GAWANDE

Master’s Thesis at KTH Information and Communication Technology Supervisor: Mihhail Matskin Examiner: Dr. Anne Hakånsson Industrial Supervisor: Dr. Kevin Smith

TRITA-EECS-EX-2018:10 Abstract

Image Super resolution is a widely-studied problem in computer vision, where the objective is to convert a low- resolution image to a high resolution image. Conventional methods for achieving super-resolution such as image pri- ors, interpolation, sparse coding require a lot of pre/post processing and optimization. Recently, meth- ods such as convolutional neural networks and generative adversarial networks are being used to perform super-resolution with results competitive to the state of the art but none of them have been used on microscopy images. In this thesis, a generative adversarial network, mSRGAN, is proposed for super resolution with a perceptual loss function consist- ing of a adversarial loss, mean squared error and content loss. The objective of our implementation is to learn an end to end mapping between the low / high resolution images and optimize the upscaled image for quantitative metrics as well as perceptual quality. We then compare our results with the current state of the art methods in super reso- lution, conduct a proof of concept segmentation study to show that super resolved images can be used as a effective pre processing step before segmentation and validate the findings statistically.

Keywords: Deep Learning, Generative adversarial net- works, Super resolution, High content screening microscopy Abstract

Image Super-resolution är ett allmänt studerad problem i datasyn, där målet är att konvertera en lågupplösnings- bild till en högupplöst bild. Konventionella metoder för att uppnå superupplösning som image priors, interpolation, sparse coding behöver mycket för- och efterbehandling och optimering.Nyligen djupa inlärningsmetoder som convolu- tional neurala nätverk och generativa adversariella nätverk är användas för att utföra superupplösning med resultat som är konkurrenskraftiga mot toppmoderna teknik, men ingen av dem har använts på mikroskopibilder. I denna avhandling, ett generativ kontradiktorisktsnätverk, mSR- GAN, är föreslås för superupplösning med en perceptuell förlustfunktion bestående av en motsatt förlust, medelk- vadratfel och innehållförlust.Mål med vår implementering är att lära oss ett slut på att slut kartläggning mellan bilder med låg / hög upplösning och optimera den uppskalade bilden för kvantitativa metriks såväl som perceptuell kvalitet. Vi jämför sedan våra resultat med de nuvarande toppmod- erna metoderna i superupplösning, och uppträdande ett be- vis på konceptsegmenteringsstudie för att visa att superlösa bilder kan användas som ett effektivt förbehandling steg före segmentering och validera fynden statistiskt.

Keywords: Deep Learning, Generative adversarial net- works, Super resolution, High content screening microscopy Acknowledgements

First of all, I would like to express my sincerest gratitude to Dr. Kevin Smith for giving me the opportunity to work on this exciting topic and without whom this work wouldn’t have materialized. Having a guide like Kevin was truly a blessing and I could not have wished for a better mentor. Thank you, Kevin for always being patient with me, pushing me to go the extra mile and always making the time for me despite your hectic schedule. I would like to thank Dr.Hossein Azizpour for his continuous feedback and ideas, always being available to clear my doubts no matter how naive and serving as a beacon of inspiration. I am also thankful to my examiner Dr. Anne Hakånsson for providing me the support I needed to stay on track and helping me maintaining scientific quality of this work .

Last but not the least, no amount of thanks will ever be enough for my parents who have loved, supported and cared for me unconditionally throughout my tumul- tuous and protracted journey.

May all your minima always be local! Tack! Contents

Abbreviations

1 Introduction 1 1.1 Image Super-resolution ...... 1 1.2 Background ...... 2 1.3 Problem ...... 3 1.4 Purpose and Goal ...... 3 1.5 Ethics and Sustainability ...... 4 1.6 Methodology ...... 4 1.7 Delimitations ...... 5 1.8 Outline ...... 5 1.9 Contributions ...... 6

2 Relevant Theory 7 2.1 Background knowledge ...... 7 2.1.1 Definitions ...... 7 2.1.2 Strategies to increase image resolution ...... 9 2.1.3 Evaluation metric for Super-Resolution ...... 10 2.2 Neural Networks ...... 10 2.2.1 Convolutional Neural Networks ...... 13 2.2.2 Generative Adversarial Networks ...... 13 2.3 Literature Study ...... 17 2.3.1 Traditional Single Image super resolution ...... 18 2.3.2 Deep Learning based Single Image super resolution . . . . . 20

3 Motivation 24 3.1 HCS microscopy problems in image acquisition ...... 24 3.1.1 Photo bleaching ...... 25 3.1.2 Bleed through/ Crosstalk ...... 27 3.1.3 Phototoxicity ...... 28 3.1.4 Uneven illumination ...... 29 3.1.5 Color and contrast errors ...... 30 3.2 Inefficiency of pixel wise M.S.E ...... 31 3.3 Feature transferability issues in CNN’s for distant source and target domians ...... 33

4 Methods 35 4.1 Methods ...... 35 4.2 Mathematical formulation ...... 37 4.3 Generative Adversarial Network architecture ...... 38 4.4 Loss functions ...... 40 4.4.1 Perceptual Loss ...... 40 4.4.2 Pixel wise Mean squared error ...... 41 4.4.3 Content Loss ...... 42 4.4.4 Adversarial Loss ...... 42 4.4.5 Flowchart ...... 42 4.5 Data Acquisition ...... 44 4.5.1 Data processing ...... 46

5 Experiments and Results 47 5.1 mSRGAN ...... 49 5.2 mSRGAN - VGG2 ...... 52 5.3 mSRGAN - VGG5 ...... 54 5.4 SRRESNET ...... 57 5.5 mSRGAN - CL (Only content loss) ...... 60 5.6 SRGAN ...... 64 5.7 Nuclei Segmentation ...... 68 5.7.1 Statistical Validation ...... 70

6 Discussion 74 6.1 mSRGAN vs SRGAN (Evaluating Hypothesis 1) ...... 74 6.2 Content loss vs M.S.E (Evaluating hypothesis 2) ...... 76 6.2.1 Effect of Different VGG layers ...... 77 6.2.2 PSNR variation for mSRGAN variants ...... 78 6.3 Segmentation results ...... 79 6.4 GAN failures ...... 80 6.5 Checkerboard Artifacts ...... 91 6.6 Lack of training data ...... 92

7 Conclusion 93 7.1 Future Work ...... 94

Bibliography 95

Appendices 101

A More Results 101 Abbreviations

α weight coefficient for M.S.E

β weight coefficient for content loss

CNN Convolutional neural networks

CT Computed tomography

DL Deep Learning

GAN Generative adversarial networks

GAN Generative adversarial networks

HCS High content screening

HR High resolution

HVS Human visual system

LR low resolution image

MRI Magnetic resonance imaging

MSE Mean Squared Error psnr Peak Signal to Noise ratio

SC Sparse coding

SGD Stochastic gradient descent

SISR Single image super resolution

SR Super Resolution Chapter 1

Introduction

In this thesis project, we explore the use of Generative adversarial networks for per- forming single image super resolution on high content screening microscopy images. The project was carried out within the Bioimage Informatics Facility at the Science for Life Laboratory, Sweden.

1.1 Image Super-resolution

In most digital imaging applications, high-resolution images are preferred and of- ten required to accomplish tasks. Image super-resolution (SR) is a widely-studied problem in computer vision, where the objective is to generate one or more high- resolution images from one or more low-resolution images. SR algorithm aims to produce details finer than the sampling grid of a given imaging device by increas- ing the number of pixels per unit area in an image. SR is a well known ill-posed inverse problem, where from a low-resolution image (usually corrupted by noise, mo- tion blur, aliasing, optical distortion, etc.) a high-resolution image is restored [1] [2].

SR techniques can be applied in many scenarios where multiple frames of a sin- gle scene can be obtained (e.g., multiple images of the same object by a single camera), various images of a scene are available from numerous sources (numerous cameras capturing a single scene from various locations).

SR has its applications in varied fields such as Satellite imaging (eg. remote sens- ing) where several images of a single area are available, in security and surveillance where it may be required to enlarge a particular point of interest in a scene (such as zooming on the face of a criminal or the numbers of a license plate), in computer vision where it can improve the performance of pattern recognition and other areas such as facial image analysis, text image analysis, biometric identification, finger- print image enhancement, etc. [1].

SR is particularly of great importance in medical imaging where more detailed

1 CHAPTER 1. INTRODUCTION image details are required on demand, and high-resolution medical images can aid the doctors to make a correct diagnosis, e.g., in Computed tomography (CT) and Magnetic resonance imaging (MRI) for diagnosis, where the acquisition of multiple images is possible albeit with limited resolution.

1.2 Background

Convolutional neural networks (CNN) have been in existence for a long time [2] and recently deep CNN’s have shown an upsurge in popularity due to its various successes in image classification tasks, one of them being the ImageNet Large Scale Visual Recognition Challenge which is a benchmark in object classification and de- tection tasks consisting of millions of images and thousands of classes [3]. CNN’s have also been applied to other sub-problems of computer vision such as object detection [4], face recognition [5] and pedestrian detection [6].Various factors are instrumental in the progress and effectiveness of CNN’s such as A) The advent of more powerful Graphics Processing Units [3], which makes it easier to train more complex models on large datasets B) The exponential increase in the amount of big data which helps in training large models and getting more accurate results. C) The proposal in the machine learning community using various activation functions such as ReLU, LeakyReLU which facilitates the CNN model to converge faster, maintain high accuracy and avoid overfitting [7].

Generative models, particularly GAN’s have shown remarkable results in image generation applications such as super-resolution [8], generating art, Image-to-image translation [9] etc. One advantage of GAN’s is the adversarial loss component that allows it work well with multi-modal outputs, e.g. in image generation tasks where an input can have multiple acceptable correct answers. Traditional machine learn- ing methods use pixel-wise mean squared error as the optimization objective and are hence not able to produce multiple correct outputs. GAN’s excel in tasks which require generating samples resembling a particular distribution. One such task is super-resolution where from a low-resolution image, a high-resolution equivalent has to be estimated, and multiple high-resolution images corresponding to a low- resolution image are possible [9].

Image restoration and denoising techniques deal with accounting for noise and other disturbances to recover a less degraded image from the original image. Super Res- olution and Image restoration techniques are quite similar theoretically differing to the fact that, Super Resolution produces an upscaled noise-free image of the original one. There has been considerable work in Image restoration using Deep Learning methods to achieve image denoising. Burger et al. have applied The Multi-Layer Perceptron for natural image denoising and post blurring denoising [10]. Jain et al. have used CNN’s for natural image denoising and removing patterns consist- ing of noise such as rain/dirt etc [11]. Cui et al. [12] have proposed to include

2 CHAPTER 1. INTRODUCTION auto encoders in their super-resolution pipeline based on an internal example-based approach. Christian et al. [8] most recently reported state of the art results in super-resolution using a generative adversarial network.

These recent developments and research show that there is a lot of potential in applying Deep Learning (especially Deep CNN’s and GAN’s) to Image SR and achieve competitive/ better results compared to the state of the art methods in Image SR.

1.3 Problem

The problem of generating a high-resolution image (HR) from a low-resolution im- age (LR) is an undetermined inverse problem which does not have a unique solution. This is made worse by the fact that a variety of different solutions exist for any given low-resolution pixel.

While capturing a digital image, there is a significant loss of spatial resolution caused by optical distortions; motion blurs due to limited shutter speed, noise that occurs within the sensor or due to transmission resulting in significant differences between the original scene and the captured scene. So, apart from scaling the low-resolution image, the SR algorithm also needs to account for these factors. Standard errors in acquiring an image aside, microscopy images are more susceptible to suffering from problems like photobleaching, crosstalk, toxicity, etc. (Discussed in more detail in section 3.1).

The most common diagnostic errors in biomedical imaging are missed diagnoses, compared to those that were late or incorrect. Many Patients are misdiagnosed by images from CT scans, mammograms, MRI’s, etc. These misdiagnoses partly occur on account of observer error (due to unclear images, multiple psychophysio- logical factors, including the level of observer alertness, observer fatigue, duration of the observation task) from the observers and perceptual errors (failure to detect abnormality from images).

1.4 Purpose and Goal

This thesis will investigate whether Deep Learning (Generative adversarial net- works, Convolutional neural networks) can be used to achieve better results for Super-resolution in high content screening microscopy images (compared to con- ventional methods).

The end goals of this project are to - • Propose and Implement a deep learning method for single microscopic image super-resolution (SR) that directly learns an end-to-end mapping between the

3 CHAPTER 1. INTRODUCTION

low/high-resolution images and takes a low-resolution microscopic image as the input and outputs the high-resolution one.

• Evaluate the produced HR images by the method with the original LR images, compare the obtained results with the state of the art methods and discuss the suitability of Deep Learning for achieving SR in microscopy images.

1.5 Ethics and Sustainability

This work will benefit the biomedical research community in general and patients in particular since this technique might help doctors/clinicians in reducing the num- ber of misdiagnosis. Our use of publicly available datasets assuages any privacy concerns there might be. The results are reported as they are obtained without any manipulation and appropriate sources are given credit wherever possible to avoid plagiarism.

The biggest risks we foresee at the moment is -

• Acquisition of medical image data - Medical data acquired especially images, might be filled with noise and artifacts owing to instrument acquisition errors, faulty human handling etc. This factor should be taken into account while performing the data analysis and interpreting the results as faulty conclusions might lead to incorrect diagnosis causing harm to the patients.

• Adoption of algorithms in real environments - Clinicians prefer to use the raw data/images produced from experiments without any processing and hence there is a possibility that they might not adapt our technique in practice.

1.6 Methodology

Since this thesis targets research and no known deep learning technique has been applied on microscopic images, the use of Generative Adversarial networks (which have reported state of the art results for super-resolution in natural images) is ex- plored for super-resolution in microscopic images. Moreover, acquiring microscopic images comes with a host of challenges (described in detail in section 3.1) lead- ing to noisy captured images emphasizing the need for applying super-resolution techniques. I propose a Generative adversarial network, mSRGAN, for performing super-resolution exclusively on microscopic images inspired by Christian et al.’s, SR- GAN [8], who use it for super-resolution on natural images optimized for perceptual quality.

First off, an extensive literature survey is conducted to investigate the classical and more recent deep learning based approaches for super-resolution. The objective of this study was to get acquainted with the algorithms used for super-resolution,

4 CHAPTER 1. INTRODUCTION make use of the latest state of the art techniques and learn about the main chal- lenges present in the research area. Through the study, it was found that Deep Learning-based approaches for super-resolution have shown promise, with some of them breaking the state of the art results. One drawback, however, is that the resulting super-resolved images are not visually pleasing to the human observer and the main reason for this is the use of pixel-wise Mean squared error as the optimiza- tion function to generate super-resolved image.

To offset the lack of visual quality and building on the work by Christan et al. [8], I propose a novel perceptual loss function, which makes use of pixel-wise M.S.E and a content loss (minimizing the feature representations of images stored in a mini VGG19 network ). An adversarial network similar to SRGAN by Christian et al. [8] is used with the exception that a mini VGG19 network trained from scratch on microscopic images is proposed instead of a pre-trained VGG19 trained on Imagenet images. Using a weighted combination of pixel-wise M.S.E and content loss makes sure that generated images benefit from the strong points of both the losses. The generated image quality is measured by psnr and compared with Bicubic interpola- tion and SRGAN. Finally, to test the applicability of our mSRGAN model in real life applications, a nuclei segmentation study is performed, and the segmentation performance is measured by Dice co-efficient and validated further using statistical tests.

1.7 Delimitations

In this project, due to time and resource constraints, we do not conduct a compre- hensive qualitative study on the quality of the images generated by super-resolution. The visual quality of the images was evaluated only by the author. How good an image looks is a very subjective matter, with some images looking pleasing to some- one, while the same image may not be so pleasing to others. Thus the observations in the report about the visual quality of the images are prone to the authors biases.

1.8 Outline

The rest of the report is organized as follows - In chapter 2 we give an overview of the theory and concepts that are essential to understanding the work done in rest of the project. This chapter also includes a section on related work. Next, we highlight the contribution of this work in chapter 3. The motivations for conducting this work is presented in chapter 4. We introduce the model and methods use in this project in chapter 5. Then we present our experiments, results, and evaluation in chapter 6 followed by a discussion on the methods used, experiments, and results

5 CHAPTER 1. INTRODUCTION in chapter 7. Finally, we conclude the report in chapter 8 and also give some ideas for future work.

1.9 Contributions

The main contributions of this work are as follows -

• We propose a first Generative adversarial network called, mSRGAN, to per- form SR on microscopic images optimized for visual quality. We integrate the traditional pixel-wise M.S.E and loss calculated on the feature representations of a mini VGG19 network trained from scratch on fluorescent microscopic im- ages.

• We then do a proof of concept nuclei segmentation study on the super-resolved, bicubic interpolated and ground truth images. By Dice coefficient and statis- tical validation, we demonstrate that super-resolution by mSRGAN improves the segmentation performance compared to Bicubic interpolated SR images and can be used as an effective pre-processing step for performing nuclei seg- mentation.

6 Chapter 2

Relevant Theory

This chapter begins with subchapters 2.1,2.2 aimed at providing the reader sufficient background knowledge to understand the various technical concepts used in the thesis. Then a review of the related work of non-deep deep learning, as well as learning methods for single image super-resolution, is described in subchapter 2.3.

2.1 Background knowledge

2.1.1 Definitions Image Resolution - The term resolution in image processing corresponds to the amount of information contained in an image that can be used to judge the quality of the image and image acquisition/ processing devices. Resolution can be classi- fied into several categories such as Pixel or Spatial resolution, Spectral Resolution, Temporal resolution, Radiometric resolution. For this project, we will be dealing with spatial resolution, and the term resolution used henceforth will imply spatial resolution. Spatial Resolution is the number of pixels that are used to construct the image and is measured by some pixel columns (width) × number of pixel rows (height), say for, e.g., 800 × 600.

7 CHAPTER 2. RELEVANT THEORY

Figure 2.1: As can be seen in the above figures, L has more spatial resolution than R.

Pixels - They are the smallest addressable parts of an image. Each image can be considered as a matrix consisting of several pixel values. Every pixel stores a value proportional to the light intensity at a particular location, and for an 8-bit grayscale image, the pixel can take values from 0 to 255.

Low resolution - A low-resolution image implies that the pixel density of the image is small thereby giving fewer details.

High resolution - A high-resolution image implies that the pixel density of the image is high leading to more details.

Super-resolution - SR is constructing an HR image from a single/multiple LR images.

Super resolution methods can be categorized into two categories based on the num- ber of images involved - a) Multiframe super-resolution b) Single image super- resolution.

Multiframe super-resolution - This method utilizes multiple LR images to re- construct an HR image. These multiple images can come from various cameras at separate locations capturing a scene or several pictures of the same scene. These multiple input LR images more or less contain the same information, however the information of interest is the subpixel shifts that occur due to movement of objects, scene shifts, motion in imaging systems (e.g., satellites) If the different LR image inputs have different subpixel shifts then this unique information contained in each LR image can be leveraged to reconstruct a good HR image [13].

8 CHAPTER 2. RELEVANT THEORY

Single image super-resolution (SISR) - In SISR, the super resolving algorithm is applied to only one input image. Since in most cases there is no underlying ground truth, the significant issue is to create an acceptable image. The majority of the SISR algorithms employ some learning algorithms to hallucinate the missing details of the output HR image utilizing the relationship between LR and HR images from a training database.

The SR reconstruction problem can be formulated regarding an observation model [1] as shown in Figure 2.2 which relates the HR image with the input LR images.

Figure 2.2: Observation model between an LR and HR image for a real imaging system. First by continuous signal sampling the desired HR image is produced which is then subjected to translation, rotation leading to blurring due to optical, motion, imaging system motion, etc. Next, LR observation images are achieved by downsampling the blurred image.

2.1.2 Strategies to increase image resolution The resolution of an image can be increased by either increasing the hardware ca- pabilities of imaging devices or using a software/algorithmic approach.

• Hardware Approach - One direct way to increase the spatial resolution is to increase the number of pixels per unit area by reducing the pixel size from sensor manufacturing techniques [14]. But reducing the pixel size beyond a threshold (which is already achieved by current technologies) leads to the generation of shot noise as less amount of light is available for the decreasing number of pixels, degrading the image quality severely. Another way to en- hance the spatial resolution is to increase the sensor chip size which leads to an increase in the capacitance and is not sufficient since increasing capacitance adversely affects to speed up a charge transfer rate [1]. Also, the high cost

9 CHAPTER 2. RELEVANT THEORY

of high precision optics and sensors acts as a hindrance in these approaches adopted to commercial solutions.

• Software Approach- To avoid the disadvantages of the hardware-based- approaches mentioned above, software and algorithmic methods (i.e., SR al- gorithms) are preferred. Techniques such as image interpolation, restoration, rendering, etc. are widely used in enhancing spatial resolution. Image interpo- lation approximates the color and intensity of a pixel based on the neighboring pixels values but fails to reconstruct the high-frequency details as noise is in- troduced in the HR image. Image restoration works by applying deblurring, sharpening and removing sources of corruption such as motion blur, noise, camera misfocus, etc. keeping the size of the input and output images the same. In image rendering, a model of an HR scene with imaging parameters is given which is used to predict the HR observation of camera. Image super- resolution is a signal processing technique which considers a single/multiple LR images to construct an HR image [15] [16]. Apart from costing less than the hardware-based approaches, SR techniques can be applied to the existing imaging systems.

2.1.3 Evaluation metric for Super-Resolution Peak signal to noise ratio (psnr) - psnr is a metric used to measure the quality of any image reconstructed/restored concerning its reference or ground truth image.

For a given noise-free m × n monochrome image I and its noisy approximation K, the Mean squared error is given by -

1 m−1 n−1 MSE = X X[I(i, j) − K(i, j)]2 (2.1) mn i=0 j=0 and the psnr is given by -

MAX psnr = 10log ( I ) (2.2) 10 MSE where MAXI is the maximum possible pixel value of the image.

2.2 Neural Networks

A neural network consists of an input layer, an output layer and at least one in- termediate layer called as the hidden layer. The layers consist of units called as neurons which are connected to neurons of the preceding layer in a directed acyclic graph, i.e., the neuron outputs from the previous can become the neuron inputs in the next layer. The most common layer type used in neural networks is the fully

10 CHAPTER 2. RELEVANT THEORY connected layer wherein all the neurons in the adjacent layers are pairwise connected with each other and connections between neurons of the same layer is prohibited [17].

Figure 2.3: A graphical representation of a neuron with three inputs (Input1, In- put2, Input3), their corresponding weights (Weight1, Weight2, Weight3), activation function and the resulting output.

As shown in Fig 2.3 a neuron computes the weighted sum of its inputs and a bias which is then applied a linear/ nonlinear activation function. To represent the process formally, for given inputs xi with its respective weights wij, a neuron yj computes the weighted sum wij along with the bias bj and applies an activation function f to the whole sum as shown below -

Yj = f(wijxi + bj) (2.3)

The activation function ’f’ introduces nonlinearity to the output of neuron yj, This comes in handy since we want the network to account for the nonlinear patterns in the data and most of the real world data has nonlinear structure. Some of the commonly used activation functions are -

• Sigmoid - The sigmoid function is given by 1 σ(z) = (2.4) 1 + e−z

11 CHAPTER 2. RELEVANT THEORY

• Tanh - The tanh function is given by

sinh ez − e−z tanh(z) = = (2.5) cosh ez + e−z

Figure 2.4: Visual representation of the tanh activation function

• RelU - ReLU is given by

ReLU(z) = max(0, z) (2.6)

Figure 2.5: Visual representation of the Relu activation function

12 CHAPTER 2. RELEVANT THEORY

2.2.1 Convolutional Neural Networks Convolutional neural networks are a category of neural networks that have been proven to remarkably effective in computer vision and classification applications such as object detection, self-driving cars, super-resolution, etc. [3] [4] [5] [6] .

A convolutional layer consists of two-dimensional filters/kernels. The idea is to organize the neurons into units with inputs from local neighborhood in the image which results in this filter. These filters are learned during the training of the algo- rithm, unlike the custom handcrafted features that are used in conventional machine learning algorithms. This operation is similar to the standard mathematical con- cept of and is called that. The learned filters are convolved with the input image, and the feature responses that result from this are passed in the next processing layer as input. Neural networks that have such convolutional layers as cascaded stacks are called as Deep convolutional neural networks [17]. Some well- known architectures are Alexnet which uses five convolutional layers winning the best recognition performance at ILSVRC 2012, ResNet a 152-deep residual network winner of the best performance at ILSVRC 2015 almost entirely consists of convo- lutional layers [3].

Figure 2.6: Visual representation of a convolutional neural network subsequently creating two feature maps, first with a filter of size 5 and then 3.

2.2.2 Generative Adversarial Networks Generative adversarial networks (GAN) are a class of generative models used in un- supervised machine learning consisting of two networks (the generator and the dis- criminator) competing against each other in a zero-sum game framework. GAN use a latent code that describes everything that’s generated later. GAN’s are asymptot- ically consistent, meaning if one can find the equilibrium point of the game-defining a GAN, it’s guaranteed that the real distribution that generates the data is re-

13 CHAPTER 2. RELEVANT THEORY covered and given an infinite amount of training data, the correct distribution is eventually rescued [18][19].

To describe the working of the GAN framework, we have two competing models in the sense of game theory where there is a game that has defined payoff functions with each player trying to maximise their payoffs.

Within this game, one of the networks is the generator which is our primary model of interest that produces samples (generated samples/fake samples) with the aim of mimicking those that were from the real training distribution (real samples). The other competing model is the discriminator which inspects the sample and deter- mines whether it’s real or fake. During the training, images or any other samples are fed to the discriminator. The discriminator can be any differentiable function (usually a deep neural network) whose parameters can be learned by gradient de- scent. When the discriminator is applied to samples/images that come from the training set (real samples), its objective is to yield a value close to one, representing a high probability that the input was real rather than fake.

The discriminator is also applied to the samples generated from the generator (fake samples), and the goal of the discriminator in this scenario is to make the output as close to zero as possible implying the sample was fake. The generator is a differ- entiable function (usually a deep neural network) whose parameters can be learned by gradient descent. The generator function is applied on a sampled latent vector ’z’ which is nothing but noise at the start acting as a source of randomness helping the generator in producing a wide range of outputs. The generated images by the generator are then fed to the discriminator, and the generator tries to make the dis- criminator output one, fooling it into thinking the generated image is real when it is not. The readers can find more detailed technical information on GAN’s here [18].

On a higher level, the generator can be viewed as a counterfeiter trying to cre- ate fake currency while the discriminator can be viewed as the police trying to ban fake currency while allowing real currency. As these two adversaries are forced to compete against each other, the counterfeiter must create more and realistic cur- rency samples with the ultimate objective of fooling the police into believing that generated fake currency is real.

14 CHAPTER 2. RELEVANT THEORY

Figure 2.7: Visual representation of a generative adversarial network in action.

Technical formulation - As mentioned above, a Generative adversarial network consists of two competing networks, the generator and the discriminator, usually differentiable multi-layer networks. The generator network ’G’ learns a mapping from a representation space, a latent space to the space of the training data. This is done by first defining a prior on the input noise variables pz(z), then representing a mapping to the data space as G(z; θg), where θg are the parameters of the generator [18][19]. Expressing this more formally,

G : G(z) → R|x| where, z ∈ R|z| is a sample from the latent space x ∈ R|x| is a sample from the training data

A second differentiable multi-layer network, the Discriminator network D(x; θd) is defined as a function that returns a scalar (0 / 1) by mapping the image data to a probability, effectively telling whether the image is from a real distribution (training images) or fake distribution (generated images by the generator) [18][19]. Expressing this more formally,

15 CHAPTER 2. RELEVANT THEORY

D : D(x) → (0,1) where,

D(x) is the probability that x comes from the true data distribution (px) instead of the generator distribution pg

θd represent the parameters of the Discriminator D.

The generator G is trained to minimize log(1 - D(G(z))) in order to find parame- ters which confuse the discriminator the most, and the discriminator D is trained to maximize the assigned probability to the training examples and the generated samples from G [18].

The training cost is calculated by solving the value function V (G, D) [18]

minθG maxθD Ex ∼ pdata(x)[logD(x)] + Ez ∼ pz(z)[log(1 − D(G(z)))] (2.7) The training is done in an alternate fashion, with the parameters of a model are updated while the parameters of the other model are fixed. The training process is described in detail below in Algorithm 1 [18] and for a fixed generator G, there is an optimal discriminator D∗ such that

∗ pdata(x) DG(x) = (2.8) pdata(x) + pg(x)

16 CHAPTER 2. RELEVANT THEORY

Algorithm 1 Minibatch stochastic gradient descent training of gen- erative adversarial networks. The number of steps to apply to the discriminator, k, is a hyper paramater.

1: For number of training iterations do 2: For k steps do 3: Sample minibatch of m noise samples {z(1), ..., z(m)} from noise prior pg(z). 4: Sample minibatch of m examples {x(1), ..., x(m)} from data generating distribution pdata(x). 5: Update the discriminator by ascending its stochastic gradient: 6: 1 Pm i (i) 7: ∇θd m i=1[logD(x ) + log(1 − D(G(z )))] 8: 9: end for 10: Sample minibatch of m noise samples {z(1), ..., z(m)} from noise prior pg(z). 11: Update the generator by descending its stochastic gradient: 12: 1 Pm (i) 13: ∇θg m i=1 log(1 − D(G(z ))) 14: 15: end for

Goofellow et. al. [20] show that there exists an optimal generator G when pg(x) = pdata(x) i.e the optimal discriminator predicts 0.5 for all the samples drawn from x and is unable to distinguish between the real and fake samples.

Goodfellow et. al. [20] further show the convergence of Algorithm 1 (i.e pg converges to pdata) under the condition that the generator and discriminator are individually strong enough and discriminator is permitted to attain its optimum for a given generator G to improve for :

∗ ∗ Ex ∼ pdata(x)[logDG(x)] + Ex ∼ pg[log(1 − DG(G(x)))] (2.9)

2.3 Literature Study

In this section, we review the literature for single image super resolution recon- struction techniques based on the traditional methods and the recent deep learning

17 CHAPTER 2. RELEVANT THEORY methods.

2.3.1 Traditional Single Image super resolution Traditional single image super resolution are categorized into learning based, recon- struction based, interpolation based approaches.

Learning Based This method usually involves a training step where the relationship between HR images belonging to a specific class such as face images, fingerprints, etc. and their counterparts LR images are learned, and this knowledge is incorporated into the apriori term of the reconstruction. For obvious reasons, the training data set should be good enough (regarding sufficiency and predictability) to generalize the test set to avoid overfitting. Different learning based single SR image algorithms are dis- cussed below

Feature pyramids - In this method by Baker and Kanade [21], HR images are downsampled and blurred to produce a Gaussian resolution pyramid which is then used for the generation of Laplacian and Feature Pyramids. After training the system, for a given LR test image, an LR image is found from all the available pyramids, which is the most similar to the LR test image. The work is used nearest neighbor method for detecting the most similar images/patches. The authors tried a different approach, and they arrange the patches/images in a tree structure. In particular, the LR image and its higher counterparts are arranged in a child/parent structure. The relationship between them is learned and used as a priori term in MAP algorithms [2].

Belief Network - Freeman et al. proposed the use of belief network such as a Markov network [22]. The LR and its corresponding HR image are divided into patches. The corresponding patches from the LR and HR images are associated with an observation function, which represents how significantly two patches are related to each other. The neighboring patches in the HR image are assumed to be associated with each other and are represented by a Transition function. Af- ter training the model, the LR image is reconstructed into an HR image and the missing details of the HR image are estimated (learned) using a belief propagation algorithm generating a MAP super-resolved image [2].

Neural Nets - These are similar to belief nets, but handle diverse types of neu- ral networks. Probabilistic neural networks, Integrated recurrent neural networks, multilayer perceptron, feed-forward neural networks, Hopfield NN, linear associa- tive memories with single and dual associative learning, RBF, etc. [2].

18 CHAPTER 2. RELEVANT THEORY

Manifold based methods - This technique involves two steps. The goal in the first step is to add a global constraint over the super-resolved image, and this is achieved by integrating the manifold based methods with MAP method or a Markov based learning method. In the next step, the local constraint is added to the super-resolved image by finding the transformations between the LR and HR residual patches, and this is achieved by using methods such as kernel ridge regression, graph embedding, radial basis function and partial regression. Manifold based methods use multiple nearest neighbor patches of the LR image as against most learning based techniques which use only a single most adjacent patch of the LR image and the corresponding HR image from the training set are used [2].

Reconstruction Based These methods address the aliasing artifacts that might be present in the input LR image and are classified into the following three groups

Primal Sketches - A priori used by other algorithms was only for a class in an im- age (e.g., face). This is extended to generic priors, and primal sketches are used as a priori. Hallucination algorithm only applied to primitives (edges, ridges, corners, terminations, etc.) but not to the non-primitive parts of the image since a priori for primitives can be learned and not for non-primitives [2].

Then, based on the primal sketch prior, and using a Markov chain inference, the corresponding HR patch for every LR patch is replaced [23]. This step halluci- nates the high-frequency counterparts of the primitives. This hallucinated image is then used as the starting point for the IBP algorithm to produce an HR image [2].

Gradient profile - The shape statistics of the gradient profiles in natural images is robust against changes in image resolution, introduce a gradient profile prior. Gradient profile-based methods learn the similarity between the shape statistics of the low and high-resolution images and the learned information is used to apply a gradient-based constraint to the reconstruction process [2].

Fields of experts - Fields of experts is an a priori for learning the massive non- Gaussian statistics of natural images. Here usually contrastive divergence is used to determine a set of filters from a training database [2].

Interpolation Based Interpolation based approaches utilize sampling theory to approximate the high- resolution image from a low-resolution image.The disadvantages of these methods are the introduction of aliasing artifacts along the edges [2]. Bicubic interpolation

19 CHAPTER 2. RELEVANT THEORY is one example of an interpolation based technique, and it will be used in the thesis as a baseline to compare the SR images generated by mSRGAN.

2.3.2 Deep Learning based Single Image super resolution Dong et al. [24] were the first ones to demonstrate that DL can be utilized in solving the classical computer vision problem of SR and introduce their deep learn- ing method (SRCNN) to perform super-resolution. They draw their inspiration from the traditional sparse coding (SC) based SR method, and they establish a relationship established between their proposed method and SC. This relationship serves as a guideline in designing their network structure which is an entirely con- volutional neural network that learns an end to end mapping between the low and high-resolution images. Their unified framework requires very little pre/post pro- cessing beyond the optimization as opposed to SC wherein the steps in the pipeline have rarely been optimized or considered in a unified optimization framework. The SRCNN network consists of 3 layers. Given a low-resolution image which is first upscaled by bicubic interpolation, layer 1 is responsible for extracting overlapping patches from the image and representing each patch as a high dimensional vector which comprises a set of feature maps. Each of these high dimensional vectors is further mapped into another high dimensional vector by layer 2. These vectors which consist of another set of feature maps, conceptually represent the patches of a high-resolution image. The final layer aggregates the previously generated high-resolution patch wise representations to create a final HR resolution which is expected to be the ground truth image. The authors show that SRCNN, which has a lightweight structure, demonstrates state-of-the-art restoration quality, achieves fast speed for practical online usage, functions on three channels simultaneously and performs better than the state of the art methods.

Kim et al. [25] propose a very deep convolutional neural network for SR (VDSR) inspired by VGG-net used for ImageNet classification. Their primary motivation stems from the typical drawbacks found in the existing SR methods especially SR- CNN which are a) SRCNN relies on the context of small image regions (It has only three layers with the receptive field of 13 X 13) and other methods use even smaller regions. This is inauspicious since the information contained in small patches is not sufficient for detailed recovery, especially for larger scale factors. b) Training in these methods converges too slowly, and SRCNN which uses a learning rate of 10−5 takes several days to converge c) Most of the existing techniques handle different scale factors independently. So, an SRCNN model trained for a scale factor of 3 would not work for a scale factor of say 4 and a separate model would have to be trained for it. Keeping in mind these drawbacks, the authors design a deep CNN architecture which 1) utilizes more considerable contextual information spread over extensive image regions using larger receptive fields (41 X 41 vs. 13 X 13 used by SRCNN) ultimately taking larger image context into consideration, 2) converges faster due to using residual learning CNN and extremely high learning rates (their

20 CHAPTER 2. RELEVANT THEORY initial learning rate is 104 times higher than SRCNN). Boosting convergence rates can potentially lead to the problem of vanishing/ exploding gradients which are handled by residual learning and gradient clipping leading to a more stable train- ing 3) is capable of learning and processing different scale factors without training additional models. The network structure of VDSR consists of 20 cascaded layers (convolutional and nonlinear) having a 41 X 41 receptive field and 3 X 3 filters in each layer. An interpolated low-resolution image is fed through these layers and transformed into an HR image. The network estimates a residual image and the addition of interpolated low res image with the residual gives the desired output. VDSR outperforms every other state if the art method including SRCNN by a large margin regarding accuracy, speed, and visual quality.

Tang et al.[26] proposed a compact hourglass shape CNN structure (FSRCNN) for accelerating and producing better results than SRCNN which can be used in practical scenarios demanding real-time performance (24 fps). They find two in- herent limitations that serve as the bottleneck in the runtime of SRCNN 1) The original LR image must be first upsampled to the desired HR size by bicubic in- terpolation that serves as the input. This causes the computational complexity to grow quadratically with the spatial size of the HR image, and for a spatial size of n, the computational cost of convolution with the interpolated LR image will be n2 times that of the original non-interpolated LR image. 2) SRCNN has a costly nonlinear mapping step wherein input image patches are projected on a high di- mensional feature space which is then followed by another complex mapping to a high dimensional HR space, all at the cost of the running time.

To address the first limitation posed by SRCNN, they take the non-interpolated LR image as the input to the network and introduce a deconvolution layer at the end of the network which is responsible for upsampling the LR image. Due to this, the computational complexity is now proportional to the spatial size of the original non-interpolated LR image as mapping is learned directly from the LR image (non- interpolated) to an HR image. For the second limitation, they add a shrinking and expanding layer at the beginning and end of the mapping layer separately to strictly restrict mapping in a low dimensional feature space leading to their model using smaller filter sizes and thus saving computational cost. Their experiments showed that FSRCNN clocks speed up to more than 40X achieving even superior quality than SRCNN. The authors also present a small FSRCNN network that produces image restoration quality similar to SRCNN but is 17X faster and can be run for real-time applications on a generic CPU.

Kim et al. [27] propose a deeply recursive convolutional network (DRCN) to per- form image super-resolution. Their network has up to 16 very deep recursive layers, and they hypothesize that increasing recursion depth can improve the reconstruc- tion performance without the need of introducing new parameters for performing . In previous approaches such as SRCNN, increasing network depth

21 CHAPTER 2. RELEVANT THEORY can present problems such as overfitting and the model becoming too big to be stored and retrieved. They solve these issues by proposing DRCN which repeatedly applies the same convolutional layer as many times required (network efficiently reuses the weight parameters while exploiting a large image context) which makes sure no additional parameters are introduced while more recursions are performed. Because DRCN optimized with Stochastic gradient descent method (SGD) does not efficiently converge because of vanishing/ exploding gradients. To make the model converge more quickly, they introduce recursion supervision and skip connections. In recursion supervision, feature maps originating after each recursion are used to reconstruct the corresponding desired HR image, and all the different predictions from each of the recursion layers are aggregated to generate a more accurate final HR image prediction. Using skip connections, the authors explicitly connect the input to the output layers for image reconstruction. This is particularly helpful since in SR an input LR image is highly correlated with output HR image to a large extent, and an exact copy of the input image is likely to be diminished during the feedforward passes. DRCN outperforms existing state of the art methods by a large margin on benchmarked datasets.

Shi et al. [28] proposed a sub-pixel convolution neural network which can perform real-time SR of 1080p videos on a single K2 GPU. They do this by introducing an efficient subpixel convolution layer at the end of the network which learns upscaling filters to increase the final LR feature maps produced by the network into an HR image feature maps generating an output HR image. By doing this, they elimi- nate the need for upscaling an input LR image by bicubic interpolation in the first step of the SR pipeline and ultimately reducing the computational complexity for carrying out the overall SR algorithm. They evaluate their proposed approach on publicly available images and videos and show that it performs better significantly better and is an order of magnitude faster than the existing CNN based SR methods.

Johnson et al.[29] propose the use of perceptual loss functions for image trans- formation problems such as Style Transfer SR where an input image is transformed into an output image. Recent methods such as CNN’s typically use per pixel loss between the output and ground truth images. In recent work, high-quality im- ages are produced by optimizing perceptual loss functions based on high-level fea- tures extracted from trained networks. The authors combine the benefits of both approaches propose the use of perceptual loss functions for training feed-forward networks for image transformation task and finally experiment with single-image super-resolution, where replacing a per-pixel loss with a perceptual loss gave them visually pleasing results.

Christian et al. [8] present a generative adversarial network SRGAN which is the first framework capable of inferring photo-realistic natural images for 4x upscaling factors. They propose a perceptual loss function which consists of an adversarial loss and a content loss. The adversarial loss pushes their solution to the natural

22 CHAPTER 2. RELEVANT THEORY image manifold using the adversarial loss. Also, they use a content loss motivated by perceptual similarity instead of similarity in pixel space. Their deep residual network can recover photo-realistic textures from heavily downsampled images on public benchmarks. An extensive mean-opinion-score test employed showed vast gains in perceptual quality. Also, they report state of the art psnr values on bench- mark SR datasets.

23 Chapter 3

Motivation

This chapter discusses in detail the motivation behind proposing a GAN utilizing perceptual loss for High content screening (HCS) microscopy images which are -

1. Common problems in HCS causing captured images with artifacts/noise (ac- quisition errors)

2. Inefficiency of the traditional pixel wise Mean squared error (MSE)

3. Feature transferability issues in CNN’s

3.1 HCS microscopy problems in image acquisition

Apart from suffering from the usual challenges in image acquisition as mentioned in section 1.3, microscopy images are prone to a host of domain specific challenges which might further degrade the quality of the images acquired. Through this chapter we attempt to highlight these common problems faced while acquiring mi- croscopic images, which create a need to use denoising, super-resolution approaches.

24 CHAPTER 3. MOTIVATION

Figure 3.1: An overview of common problems encountered while acquiring mi- croscopy images. Image Source [30]

3.1.1 Photo bleaching Photobleaching (also called fading) occurs when due to more extended illumination periods, the fluorophores suffer from diminished excitation response thereby los- ing its ability to fluoresce caused by photon-induced chemical damage and covalent modification. In simple terms when we say a fluorophore is photobleaching it means that it has lost its ability to fluoresce, absorb and emit light. During the photo- bleaching process, the imaged sample gradually loses the amount of fluorescence observed thereby ultimately leading to a loss of image quality. Loss of fluorescence caused by photobleaching is crucial to take into account while performing image quantification studies as it can alter the quantitative data thereby leading to false and misleading results.

25 CHAPTER 3. MOTIVATION

Figure 3.2: Photo bleaching over time (seconds) in quantum dot labels (shown in red) and organic dye molecules (shown in green)Image Source [31].

Figure 3.3: Images captured (a-f) at 2 minute intervals for multiply stained speci- mens. Image Source [32]

26 CHAPTER 3. MOTIVATION

3.1.2 Bleed through/ Crosstalk Bleed through/ crosstalk artifacts appear when two or more fluorescent markers are excited simultaneously, and the channel of interest displays fluorescent from the neighboring channel.

Figure 3.4: In a sample containing two non overlapping objects dyed in red and green and as the crosstalk factor increases the more yellowish the red object looks since its signal is recorded in the green channel apart from just the red channel. Image Source [33]

Figure 3.5: Another instance of crosstalk where two distinct fluorophores appear in the same channel. Fluorophore observed in the TRITC filter is also observed in FITC filter. Image Source [34]

27 CHAPTER 3. MOTIVATION

3.1.3 Phototoxicity In live cell imaging, overexposing the cells to light (both low and high wavelength) for a prolonged time eventually, end up damaging them causing phototoxicity. One of the reasons for phototoxicity is that most cells used in a typical imaging experi- ment are not used to the sheer number of photons aimed at them. Fig 3.6 illustrates phototoxicity.

Figure 3.6: The cells at the top show disastrous protrusion of the plasma membrane of a cell (also known as blebbing) indicating phototoxicity, while the neighboring cells are relatively healthier. Image Source [35]

28 CHAPTER 3. MOTIVATION

3.1.4 Uneven illumination There are instances when a sample is not evenly illuminated by light across the field of view giving rise to uneven illumination in the image regions with darker, unclear and more brightly illuminated areas occurring together.

Figure 3.7: In the above figure, cells are stained with nucleic acid dye and uneven illumination is observed. Image Source [36]

29 CHAPTER 3. MOTIVATION

3.1.5 Color and contrast errors Color errors occur due to many reasons such as color degradation from auto fluo- rescence and improper filtration etc. Contrast errors occur due to misconfiguration of the optical train or utilizing wrong filter combinations.

Figure 3.8: Visual demonstration of color error across slides Image Source [30].

Figure 3.9: Visual demonstration of contrast error across slides. Image Source [30]

30 CHAPTER 3. MOTIVATION

3.2 Inefficiency of pixel wise M.S.E

L2 loss or mean squared error is widely used in several machine learning applica- tions such as regression, pattern recognition, signal processing, image processing and is the defacto error metric where the pixel-wise distance between generated and ground truth images is measured. M.S.E is used in wide variety of image ap- plications such as super-resolution, segmentation, colorization, depth, and surface normal prediction, etc. Several factors make it a popular choice such as its con- vexity, symmetry, differentiability (favorable for optimization problems), simplicity (parameter free and inexpensive to compute), it is additive for independent sources of distortions, etc. [37]. Another catalyst for this widespread adoption is that standard software packages such as Caffe, Tensorflow, Keras, etc. facilitate using M.S.E but not many other loss functions for regression discouraging practitioners to experiment with different loss functions. More detailed advantages of M.S.E can be found here [37]

However, M.S.E has a lot of flaws for generating images, and images produced by MSE do not correlate well with the image quality perceived by a human ob- server. One of the reasons for this is the assumptions made while using M.S.E, such as the impact of noise, does not depend on the local characteristics of an image and M.S.E operates under the Gaussian noise model which isn’t the case in many settings. Contrary to the aforementioned assumptions, the factors influencing the sensitivity of the human visual system (HVS) to noise rely on local luminance, con- trast, and structure. M.S.E overly penalizes larger errors while is more forgiving to the small errors ignoring the underlying structure of the image. M.S.E tends to have more local minima which make it challenging to reach convergence towards a better local minimum. Consequently, the most common metric to quantitatively measure the image quality, psnr corresponds poorly to a human’s perception of an image quality. As can be observed by equation 3.1 below, M.S.E and psnr share an inverse relationship with minimizing M.S.E leading to a high psnr. Thus, psnr itself cannot be an indicator of how well an image looks perceptually and there is need to adopt other loss metrics that capture intricate details affecting the HVS.

L2 psnr = 10log (3.1) 10 MSE

Recently, it has been shown that visually good looking high-quality images are generated by using perceptual loss function optimization, where feature represen- tations from pre-trained convolutional neural networks are minimized instead of pixel differences. This approach has been applied to invert feature representations, visualizing image features learned by a deep CNN, for performing style transfer be- tween content and style images. More recently, a perceptual loss has been used for super-resolution by Christian et al. in SRGAN [8] with great success in generating

31 CHAPTER 3. MOTIVATION photorealistic SR images.

Figure 3.10: SR images generated by optimizing M.S.E (Left), achieving state of the art psnr results and the image on the right is the SR image generated by perceptual optimization. The visual differences are apparent with the perceptually optimized image (R) appearing more sharper and realistic despite having a low psnr value. Image source [38].

We hypothesize that using perceptual optimization as opposed to pixel-wise opti- mization alone for generating microscopic SR images has the potential to contribute to making the generated images look visually pleasing and closer to the ground truth HR images. Despite its shortcomings, M.S.E still can be an asset in accounting for pixel-wise changes in the images which a perceptual loss function might otherwise miss, and we won’t completely discard M.S.E as the authors of SRGAN do. Hence, we will use a weighted combination of the pixel-wise M.S.E with the perceptual loss to combine the benefits of both approaches, and we believe that generated image would look visually pleasing (something not possible by M.S.E alone) and have a respectable psnr (something not likely by using perceptual loss alone).

32 CHAPTER 3. MOTIVATION

3.3 Feature transferability issues in CNN’s for distant source and target domians

There is compelling evidence that for performing visual recognition tasks such as object detection/ classification, deep convolutional neural networks are the most powerful way to learn feature representations as their deep architecture makes it possible to extract several critical distinguishing features at multiple layers of ab- straction. Azizpour et al. [39] show that the features obtained from training a deep convolutional neural network should be the first choice in visual recognition tasks. Deep networks trained on a large labeled dataset such as Imagenet yield the best results by a substantial amount by learning useful generic image feature represen- tations. Apart from learning the representations, one essential aspect of CNN’s is "transferability" of these representations which can be used off the shelf for solving many visual recognition tasks with remarkable performance. This transferability, however, is influenced by several factors, one of them being the distance between the source and target tasks. Bengio et al. [40] present evidence that the trans- ferability of features is inversely proportional to the distance between the source and target tasks, i.e., transferability of features decreases as the distance between source and target tasks increases. Extensive studies conducted by Azizpour et al. [41] further cement the fact that there is, in fact, an inverse relation between per- formance achieved at target tasks and their respective distance from the source task.

This transferability is one of the prime reasons that the SRGAN model proposed by Christian et al.[4] generates impressive photo-realistic upscaled images. They incorporate a content loss optimized for minimizing the feature representations in the different layers of a pre-trained VGG19 model on imagenet data and evaluate the model on widely bench marked datasets in SR such as Set 5, Set14, BSD100 which are visually very similar to the labeled images in ImageNet.

Even though SRGAN beats every other architecture and achieves state of the art results, it might be ill-suited to apply directly to the domain of microscopic images, as the authors themselves acknowledge. Considering the problems of transferability of features between distant domains mentioned above, we foresee two significant issues acting as a hindrance in directly applying SRGAN to high content screening microscopy images -

1. There is a vast difference between high content screening microscopy images and the Imagenet images. Owing to this distance in the domains, an SRGAN model trained on Imagenet might not yield the ideal results for the content loss as it won’t be able to leverage the feature representations stored in the VGG19 layers for the target task of upscaling high content screening microscopy im- ages. One of the guidelines to learn the representations between distant source and target tasks is to train the network from scratch using target data [39].

33 CHAPTER 3. MOTIVATION

2. SRGAN is trained on Imagenet images to optimise the visual appearance of images. The primary goal of the algorithm is to improve the perceptual qual- ity of the SR images, so in a microscopy/ medical setting a damaged cell in the image could be reconstructed as a healthy cell just because it looks nice perceptually. This can be a catastrophic situation in medical settings. Also, the VGG network the authors use in their paper is not trained on images be- longing to the microscopic domain hence the SRGAN algorithm might suffer from convergence issues and fail to generalize appropriately for microscopic images.

Drawing inspiration from the original SRGAN architecture [8], and to mitigate the challenges as mentioned earlier we propose a newer extended version of SR- GAN specialized for super-resolving microscopic images. In this proposed extended architecture, we would train a mini VGG 19 network from scratch in the original SRGAN network, for the sole task of recognizing and classification of microscopic images and subsequently use the learned feature reconstructions of these images for minimizing the content loss. We hypothesize that this modified architecture would result in better Super-Resolved images than using the plain SRGAN architecture alone as it can leverage the feature reconstructions of microscopic images for per- ceptual optimization. We would also train the generator and discriminator of the network only on microscopic images, leading to faster convergence and plausible re- construction of images which represent the true distribution of microscopic images.

Summing up the motivations as hypothesis -

1. SRGAN will not perform optimally for performing SR on microscopic images compared to mSRGAN since it utilizes feature representations from a domain (natural images) distant from the target domain (microscopy images).

2. Pixel-wise M.S.E will be inefficient in generating photorealistic microscopic images compared to the perceptual loss.

34 Chapter 4

Methods

4.1 Research Methods

We describe the research methods in this thesis formally using the portal provided by Hakånsson [42] as the reference.

• Quantitative and Qualitative methods - Quantitative methods are con- cerned with quantitatively measuring variables and approving/disproving hy- pothesis by means of rigorous experimentation. Qualitative methods make use of deciphering opinions, behaviours etc to formulate a hypothesis and is not explicitly measurable via quantitative metrics [42].

This thesis uses the triangulation method which is a both combination of quantitative and qualitative methods, although we lean heavily towards quan- titative methods to draw inferences about the generated image quality. Quan- titative research methods makes use of experiments and testing of specific variables to validate / invalidate a hypothesis which can be measured using quantifications and not vague terms. We propose and evaluate a measurable hypothesis that, our proposed model, mSRGAN leads to better super-resolved images compared to Bicubic interpolation and SRGAN regarding a metric, i.e., the peak signal to noise ratio.

Another measurable hypothesis proposed and evaluated is that super-resolved images by mSRGAN lead to better nuclei segmentation compared to super- resolved images by bicubic interpolation. The hypothesis is validated/invalidated by measuring/comparing the dice coefficients, and the resulting increase/decrease in performance is statistically validated using paired t-tests.

Using a purely quantitative measure (such as psnr) is not entirely indicative of the image quality (as described in section 3.2 ) and can be sometimes mis- leading. Also, the hypothesis that super-resolved images generated by using a content loss instead of pixel-wise M.S.E are visually more pleasing cannot

35 CHAPTER 4. METHODS

be measured quantitatively. Hence, a qualitative approach is employed while comparing the image quality of super-resolved images produced by mSRGAN, SRGAN and Bicubic interpolation (section 6.2).

• Philosophical Assumptions - Philosophical assumptions act as the founda- tion and genesis of the whole research process effectively guiding a researcher towards assumptions one makes about the research and research methods to employ giving a point of view for the project [42] .

This thesis takes the positivism philosophical approach as it works well for ex- amining the performance within ICT and in projects of experimental / testing nature. We examine and compare the performance of mSRGAN, SRGAN and bicubic models for super-resolution and nuclei segmentation by quantitatively measuring variables (psnr and dice coefficient), drawing inferences about the population (domain of microscopic images) from the samples (CYTO 2017 Image analysis challenge) and finally approving / disapproving the hypothe- sis.

• Research Methods - Research methods present a set of procedures for ac- complishing tasks that might entail in a research. They provide a framework for carrying out the research with respect to initiating, conducting and finish- ing specific research tasks that might entail [42] .

Applied research method is used in this work since this project entails building upon the existing research carried out in super-resolution, data from the real world is used directly in the form of microscopic images which are an outcome of drug screening experiments to perform super-resolution and a practical application (nuclei segmentation) is explored.

• Research Approaches - Research approaches provide the processes for con- ducting research which later on helps in concluding the study by proving what is true or false [42].

We follow the deductive research approach to verify/falsify the hypothesis that our proposed model (mSRGAN) performs better than the state of the art methods and is useful in other applications. This hypothesis are validated quantitatively by the measuring the psnr metric and conducting statistical significance tests on the dice co-coefficients

• Research Design - Research design provide a direction for performing the research which consists of organizing, planning, designing and conducting the research [42].

Experimental research strategy is used in this thesis albeit lack of extensive

36 CHAPTER 4. METHODS

training data. Measures such as peak signal to noise ratio, dice coefficient are used for measuring the hypotheses.

• Data Collection - Data collection methods are used for collecting data for the research such as Experiments, Questionnaire Case Study etc. [42].

The data is collected using the experiments method, and we use the mi- croscopic images provided by the CYTO 2017 competition belonging to the Human Protein Atlas.

• Data Analysis - After the data for the research has been collected, data analysis methods are used for analysing this data which help in process of inspecting, cleaning, transforming and modelling data aiding decision making and arriving at conclusions [42] .

The data analysis strategy used is Statistics as we calculate the results for a sample (psnr and dice coefficient) and examine the significance of the re- sults obtained using paired t-test.

• Quality Assurance - The data analysis results have to be validated and verified which is precisely what the quality assurance method helps in [42] .

We employ stringent quality assurance measures to make sure that the soft- ware packages and frameworks were used correctly without bugs which might have influenced the outcome of the experiments. Python programming lan- guage (version 3.6) was used to code the entire project and the frameworks with their packages are listed in table 4.1 . To ensure the reproducibility of our experiments, we make the whole implementation available to the public on GitHub 1. Table 4.1: Software packages used in the thesis

Library Version Tensorflow 1.0.0 Keras 2.0.2 OpenCV 3.3.0

4.2 Mathematical formulation

Given a low-resolution image (ILR ) , the primary objective in single image SR is to approximate a super-resolved image (ISR ) which should be as close as possible

1https://github.com/Saurabh23/mSRGAN-A-GAN-for-single-image-super-resolution-on-high- content-screening-microscopy-images.

37 CHAPTER 4. METHODS to the corresponding ground truth High-resolution image. (IHR ). Specifically, our generative adversarial network takes the low res image ’x’ as input and predicts its corresponding high-resolution image ’y’. The goal of the generator is to learn the function G in y’ = G(x), where y’ is the estimate of the ground truth high-resolution image y.

Let w x h x c be the width, height, and the number of channels of the low-resolution image. If ’r’ is the scaling factor, the respective width, height, and the number of channels of the SR image (ISR ) are given by r x w, r x h, r x c. The LR image is obtained by downscaling the ground truth high-resolution image (IHR ) by factor ’r’.

The generator network is trained as a feedforward convolutional neural network parametrized by GθG , where θG = { W 1 : L; b1; L} denotes the weights and biases for a ’L’ layer deep network and optimized using a custom loss function designed for super-resolution. For a set of HR training images

HR LR In , n = 1, ...., N with corresponding In , n = 1, ...., N we solve for

1 N θˆ = argmin X(G (ILR),IHR)(4.1) G θG N θG n n n=1

In this thesis, we will design a custom perceptual loss function lSR , by integrating several loss components (using weighted sum) that individually represent and ac- count for the various desirable characteristics of an estimated SR image.They are discussed in detail in section 4.4

4.3 Generative Adversarial Network architecture

We present a discriminator function DθD which is optimized alongside the generator network GθG in an alternate fashion, to find the solution of the adversarial min-max problem as described by Goodfellow et al. in their landmark paper [18]-

HR LR HR HR LR LR minθG maxθD EI ∼ptrain(I )[logDθD (I )] + EI ∼ptrain(I )[logDθD (I )] (4.2) The above formulation aims at training a G to produce plausible, realistic looking images and thereby deceiving the discriminator D which is tasked and trained to classify images generated by the generator as real or fake. This type of adversarial training ensures that the images produced by the generator are per- ceptually more superior and closer to the real HR images, something which is often lacking in images produced by minimizing the L2 pixel-wise loss which even though offers a high psnr but are usually not appealing to the visual eye as these images are far away from the subspace, the manifold of natural images.

38 CHAPTER 4. METHODS

The Generator network G consists of B residual blocks, with each block having the same design layout, as proposed by Gross and Wilber [43]. The residual blocks consist of two successive convolutions of filter sizes 3 x 3 and 64 feature maps each one of which is followed by batch normalization layers and Relu as the employed activation function. The image resolution is increased using two sub-pixel convo- lution layers [28] at the penultimate and antepenultimate positions in the network. Pooling layers are avoided as they result in a loss of information which can be detri- mental to creating SR images.

The discriminator network consists of eight convolutional layers, and the filter ker- nels are increased from 64 to 512 kernels following the original VGG network archi- tecture [44]. The first convolutional layers have leaky relu as the activation function whereas convolution layers 2 to 8 are followed by batch normalization layers and a leaky relu as the activation function. These convolution layers end with 512 feature maps, proceeded by two dense layers and ultimately a sigmoid activation function that returns the probability of the classification.

We use a mini VGG network trained from scratch on the CYTO 2017 microscopic images from the human protein atlas [45] [46] instead of a VGG19 pre-trained on ImageNet. The mini VGG architecture is similar to the original VGG19 proposed by the authors [44], consists of 16 convolution layers with the filter kernels of size 3 x 3, increasing feature maps from 64 to 512. This is followed by two fully con- nected layers having 4096 neurons and ending with a softmax layer which provides the probability of each class. Additional to the original architecture proposed by the VGG19 authors [44], we add a batch normalization layer after each convolution layer which helps in normalizing the input given to the next layer. The batch nor- malization layer is followed by the leaky Relu activation function instead of Relu to ensure better gradient flow and avoid sparse gradients. This mini VGG is trained on human protein atlas images to classify 13 types of subcellular protein localization in cell images.

39 CHAPTER 4. METHODS

Figure 4.1: Architecture of the Generative adversarial network with kernel size (k), number of feature maps (n), stride (s).

4.4 Loss functions

4.4.1 Perceptual Loss The proposed perceptual loss function in this thesis will be the loss function opti- mized to generate perceptually plausible looking and high psnr SR images. Building up and improving the loss functions used by Ledig et al. [8], Johnson [29], the sug- gested perceptual loss forms the crux of our generator network. Conventional Deep learning based SR approaches rely on the pixel-wise M.S.E as the loss function to generate SR images. However, such methods often make the generated images stray away from their ground truth natural manifold resulting in overly smooth images, despite having a high psnr as mentioned in 3. The most recent state of the art SR solution completely disregards the pixel-wise M.S.E, instead, relies on using the weighted sum of the content loss (feature representation of VGG) and adversarial loss as the final perceptual loss. However, based on the results by [47] and by empir- ical findings after carrying out various experiments we discover that using M.S.E, perceptual loss (as in SRGAN) and adversarial concurrently together leads to better microscopic SR images not only in terms of the psnr metric but also in terms of visual quality.

Hence, we propose the following perceptual loss function as the weighted combi- nation of pixel-wise MSE, content loss and adversarial loss (elaborated in more

40 CHAPTER 4. METHODS detail in the next subchapter).

SR SR −5 SR l = α × M.S.E + β × lX + 10 × lGen (4.3)

where, lSR = Total loss (Perceptual Loss)

M.S.E = pixel wise mean squared error between generated SR and ground truth HR

SR lX = Content loss (Feature Reconstruction loss from the mini VGG layers)

SR lGen = Adversarial Loss

α = weight coefficient for M.S.E (also referred as ’ mse_coeff’ in the report)

β = weight coefficient for content loss (also referred as ’cl_coeff’ in the report)

4.4.2 Pixel wise Mean squared error The pixel wise Mean squared error is calculated as

1 rW rH lSR = X X(IHR − G (ILR) )2 (4.4) MSE r2WH x,y θG x,y x=1 y=1 where W, H, r are the width, height and scaling factor respectively.

LR HR GθG (I )x,y is the SR image generated by the GAN and Ix,y is the ground truth HR image.

M.S.E is the most common optimization loss function used in many Deep learn- ing based state of the art solutions for SR. Using this function leads to superior psnr values 3.1 for generated images but lacking in high-frequency content result- ing in perceptually implausible and overly smooth textured results. We multiply equation 4.4 by an M.S.E coefficient (α) to adjust the weight given to M.S.E and add it to the content loss (also multiplied by a content loss coefficient, β) and the adversarial loss specified above to give the total loss.

41 CHAPTER 4. METHODS

4.4.3 Content Loss Building upon the ideas of Gatys et. al.(Style transfer), Johnson et. al.(Perceptual loss) and taking inspiration from Chritian et. al.(SRGAN) we propose a content loss that is defined on the feature activations of a miniVGG19 trained on microscopic images to optimize for the perceptual quality of the SR images.

W H 1 i,j i,j lSR = X X(φ (IHR) ) − φ (G (ILR) )2 (4.5) V GG/i,j W Hi, j i,j x,y i,j θG x,y i,j x=1 y=1 where

φi,j - feature maps obtained from the j-th convolution before the i-th maxpool- ing layer in the miniVGG19 network

Wi,j and Wi,j - dimensions of the respective feature maps stored in the miniVGG19 network

VGG loss is the Euclidean distance between the feature representations of the super- LR HR resolved image (GθG (I ) and the ground truth HR image (I )

4.4.4 Adversarial Loss The generative part of the generator network is added to the perceptual loss com- prising of the weighted sum of MSE and Content losses as described above to push the generated SR images in their natural manifold. The generator loss is given in equation below

N SR X l = −logDΘ (G LR Gen D ΘG(I ))(4.6) n=1 where,

LR LR DΘD (GΘG(I ) represents the probability of the reconstructed image given by (GΘG(I )) is a real image.

4.4.5 Flowchart

42 CHAPTER 4. METHODS

Figure 4.2: Flowchart illustrating the working of mSRGAN for 4X super-resolution. G is the generator network trained to fool the discriminator into thinking the up- scaled fake HR images produced from a latent representation z, are real. The features of the generated fake HR image and the real HR images stored in the mini VGG are then minimized for optimizing the perceptual loss, and subsequently us- ing the adversarial loss, the content loss is minimized leading to updated fake HR images optimized for perceptual quality. The discriminator network D is trained to distinguish between real HR images and fake generated HR images. D and G are then optimized in alternating fashion, with G producing realistic texture looking HR images and D getting better at spotting the fake generated HR images.

43 CHAPTER 4. METHODS

Table 4.2: Counts of classes in the major 13 dataset

No. Organelles Class Count 1 Nucleus 5807 2 Nucleoli 1264 3 Nuclear Membrane 268 4 Golgi Apparatus 1025 5 Endoplasmic Reticulum 497 6 Vesicles 1242 7 Plasma Membrane 517 8 Mitochondria 1541 9 Cytosol 6617 10 Intermediate filaments 144 11 Microtubules 282 12 Centrosome 535 13 Actin Filaments 261

4.5 Data Acquisition

We use the fluorescence microscopy images from the Human Protein atlas database [46] which are a part of The CYTO 2017 image analysis challenge [45]. The image data provided in this challenge were generated by the Cell Atlas part of the Human Protein Atlas database [46].

The images visualize immunostaining of human protein, and the goal of the chal- lenge was to identify subcellular protein localisations to major organelles. The images were acquired using Leica SP5 confocal microscopes with 63x/1.2 NA oil objective and Nyquist sampling rate in 4 fluorescence channels [45]. Each field of view consists of 4 images, which are as follows

DAPI staining of the nucleus - Blue

Antibody based staining of microtubules - Red

Endoplasmic reticulum - Yellow

Protein localizations - Green

The major13 dataset part of sub-challenge 2 of the CYTO 2017 competition [45] which consists of 20,000 fields of view containing multi-label data for 13 protein localization’s is used for training. Images from hold out test sets consisting of 1216 fields of view are chosen for performing validation.

44 CHAPTER 4. METHODS

Figure 4.3: Visual description of the 13 protein sub cellular localization classes. Image Source [45] .

45 CHAPTER 4. METHODS

4.5.1 Data processing The dataset consists of separate .tiff images for individual stains, and we merge these images into a single RGB image with the channels assigned as

R - Antibody-based staining of microtubules G - Protein localizations B - DAPI staining of the nucleus

The yellow channel representing Endoplasmic reticulum stained images is discarded. Following the guidelines for microscopic image acquisition by Anne Carpenter [?], we avoid compressing these merged images into lossy image formats such as JPG/JPEG as they sacrifice image quality which cannot be recovered afterward even after con- verting them to a lossless format. Hence, the resulting merged RGB images are then converted to .png which ensures lossless compression and finally resized to 96x96x3 (width x height x number of channels).

The 96x96x3 images are then downscaled using bicubic interpolation by a factor of 4, and the resulting 24x24x3 images are used as the input to the GAN. Rasmus et al. [48] show that performing data augmentation on images helps increase the accuracy for generating super-resolved images. Hence, we apply data augmentation in the form of normalization, flip, brightness, cropping and blur.

We randomly sample 70% of the images from the major 13 dataset. For the test set, we use the holdout sets provided by The CYTO 2017 competition [45] amounting to a total of 1216 images which are distinct from the training images.

46 Chapter 5

Experiments and Results

We perform the experiments in the following order - First we implement mSRGAN and investigate different configurations for the con- tent loss to observe effects of different layers of the mini VGG on the image quality and psnr. Then we move on to study the impact of loss functions on the perfor- mance measures (psnr and visual quality). We then implement the original SR- GAN model by Christian et al. [8] and compare it with our model mSRGAN to approve/disprove Hypothesis 1 that mSRGAN is more likely to give better results on microscopic images. All the SR models (mSRGAN, SRGAN, etc.) are compared to bicubic interpolation as the baseline. Finally, we investigate whether it makes sense to use SR images as a precursor to applications such as nuclei segmentation by Dice co-efficient measure and statistically validate the marked increase/ decrease in performance due to SR. All the experiments with their brief description are listed below -

• mSRGAN (VGG all layers)- This is our proposed model, with the content loss defined on all the layers of the mini VGG network. The goal of this ex- periment is to investigate how mSRGAN performs when it utilizes the feature representations from the lower as well, the higher layers of the network.

SR • mSRGAN-VGG2: lV GG/2.2 with φ2.2 - Here, mSRGAN will use a content loss defined on the lower level feature maps of the mini VGG network and the objective here is to recognize the advantages/ disadvantages that might present while leveraging the lower level features stored in the miniVGG.

SR • mSRGAN-VGG5: lV GG/5.4 with φ5.4 - Here, mSRGAN will use a content loss defined on the higher level feature maps of the mini VGG network and the objective here is to recognize the advantages/ disadvantages that might present while leveraging the high-level features stored in the miniVGG.

• SRRESNET - This experiment will evaluate the performance of the gener- ator of mSRGAN using just pixel-wise M.S.E as the loss function to generate

47 CHAPTER 5. EXPERIMENTS AND RESULTS

SR images. The goal of this experiment is to observe how solely using M.S.E as the optimization function plays a role in influencing the psnr and visual quality of the generated SR images.

• mSRGAN - CL (Just content loss) - This experiment will evaluate the performance of the of mSRGAN using the just content loss as the loss function to generate SR images. The goal of this experiment is to observe how solely using the Content loss as the optimization function plays a role in influencing the psnr and visual quality of the generated SR images.

• SRGAN - This experiment will implement the SRGAN model trained on imagenet by Christian et al. and will be used to compare with mSRGAN thereby evaluating hypothesis 1.

• Nuclei Segmentation and Statistical analysis - This experiments goal will be to evaluate if SR images reconstructed by mSRGAN can be used in other applications (in this case nuclei segmentation) as a preprocessing step to improve segmentation performance. Here, we will compare the dice coeffi- cients on the segmentation masks obtained from SR image by bicubic interpo- lation and SR image obtained by mSRGAN. If there is an increase/decrease in performance by mSRGAN, this change in performance will be statistically validated by a paired t-test to ascertain whether these gains/ losses are sta- tistically significant or not.

Table 5.1: Parameters for different Model Configurations

α, Learning rate Optimizer Loss function VGG Layer No β G - 1e-3, G (ADAM), mSRGAN 0.4, 0.6 MSE+CL+Adv All D - 1e-4 D (SGD) G -1e-3, G(ADAM), mSRGAN-VGG2 0.7, 0.7 MSE+CL+Adv 2 D- 1e-4 D(SGD) G -1e-3, G(ADAM), mSRGAN-VGG5 0.7, 0.7 MSE+CL+Adv 5 D- 1e-4 D (SGD) G -1e-3, G(ADAM), SRRESNET 1, - MSE - D- 1e-4 D(SGD) G -1e-3, G(ADAM), mSRGAN (CL) -, 1 CL+Adv ALL D- 1e-4 D (SGD) G -1e-4, G(ADAM), SRGAN -, 1 CL+Adv 5 D- 1e-4 D (ADAM)

where,

α - Weight coefficient assigned to the mean squared error

48 CHAPTER 5. EXPERIMENTS AND RESULTS

β - Weight coefficient assigned to the content loss

G - Generator

D - Discriminator

Adv - Adversarial Loss

5.1 mSRGAN

Model Details - The GAN architecture and loss functions described in section 4.3 are used with 15 Resnet layers. The input image dimensions are 24 x 24 x 3, and a batch size of 16 is used for training. In this model, we give the content loss a slightly higher weight (0.6) than MSE (0.4). In order to leverage the lower and higher level feature representations of the miniVGG, we use all the layers to optimize the con- tent loss. The optimizers used for the generator and discriminator are ADAM and Stochastic gradient descent with learning rates of 1e-3 and 1e-4 respectively.

Evaluation - The generated SR images by mSRGAN, Bicubic interpolation are shown in fig 5.1, fig 5.2 along with the input low resolution and ground truth HR images. The average psnr values measured for 50 images are presented in table 5.2. It appears that Bicubic interpolation has a slight edge (> 0.86) compared to mSRGAN. One of the reasons for this might be that we assigned a lower weight to MSE(0.4) due to which the GAN optimized more for the content loss rather than psnr. However, if the resulting SR images are observed visually, the generated SR images by mSRGAN are far better than bicubic by a landslide. The visual differ- ence is evident if we observe the recreated texture patterns (to the credit of content loss) in the cells especially the microtubules and cytoplasm surrounding the nucleus (oval structures in the images).

49 CHAPTER 5. EXPERIMENTS AND RESULTS

Figure 5.1: Input low resolution image (top row), SR reconstructed images by Bicubic interpolation (2nd row) mSRGAN (All layers) (3rd row) and the ground truth image (bottom row).

50 CHAPTER 5. EXPERIMENTS AND RESULTS

Figure 5.2: Input low resolution image (top row), SR reconstructed images by Bicubic interpolation (2nd row) mSRGAN (All layers) (3rd row) and the ground truth image (bottom row).

Table 5.2: Averaged psnr values for Bicubic interpolated SR image, SR image gen- erated by mSRGAN and Ground truth high resolution image

Bicubic interpolation mSRGAN Ground Truth HR psnr 27.57 26.71 ∞

51 CHAPTER 5. EXPERIMENTS AND RESULTS

5.2 mSRGAN - VGG2

Model Details -The GAN architecture and loss functions described in section 4.3 are used with 15 Resnet layers. The input image dimensions are 24 x 24 x 3, and a batch size of 16 is used for training. In this model, we give the content loss and MSE equal and high weights (0.7 each). In order to leverage the lower level feature representations of the miniVGG, we use the second layer of the miniVGG to opti- mize for the content loss. The optimizers used for the generator and discriminator are ADAM and Stochastic gradient descent with learning rates of 1e-3 and 1e-4 respectively.

Evaluation - The generated SR images by mSRGAN, Bicubic interpolation are shown in fig 5.3, along with the input low resolution and ground truth HR images. The average psnr values measured for 50 images are presented in table 5.3. Since this time we assigned a higher weight to MSE (0.7) the psnr value of mSRGAN has im- proved compared to bicubic interpolation (> 0.39). Compared to bicubic, SR images generated by mSRGAN-VGG2 are superior at recreating the contour and texture details. However, it appears that more emphasis is laid down by mSRGAN-VGG2 in drawing the exact outlines of the cell objects, and the microtubules/cytoplasm around the nucleus (oval structure in the images) look averaged, smooth and bland.

52 CHAPTER 5. EXPERIMENTS AND RESULTS

Figure 5.3: Input low resolution image (top row), SR reconstructed images by Bicubic Interpolation (2nd row) mSRGAN-VGG2 (3rd row) and the ground truth image(bottom row)

Table 5.3: Averaged psnr values for Bicubic interpolated SR image, SR image gen- erated by mSRGAN-VGG2 and Ground truth high resolution image

Bicubic interpolation mSRGAN-VGG2 Ground Truth HR psnr 27.40 27.79 ∞

53 CHAPTER 5. EXPERIMENTS AND RESULTS

5.3 mSRGAN - VGG5

Model Details - The GAN architecture and loss functions described in section 4.3 are used with 15 Resnet layers. The input image dimensions are 24 x 24 x 3 and a batch size of 16 is used for training. In this model, we give the content loss and MSE equal and high weights (0.7 each). In order to leverage the higher level feature representations of the miniVGG, we use the fifth layer of the miniVGG to optimize for the content loss. The optimizers used for the generator and discriminator are ADAM and Stochastic gradient descent with learning rates of 1e-3 and 1e-4 respec- tively.

Evaluation -The generated SR images by mSRGAN, Bicubic interpolation are shown in fig 5.4, fig 5.5 along with the input low resolution and ground truth HR images. The average psnr values measured for 50 images are presented in table 5.4. There is a minute increase in the psnr value for mSRGAN-VGG5 compared to bicubic interpolation (> 0.02) and visually the SR images by mSRGAN-VGG5 look superior. The higher level representations seem to have paid more emphasis in creating the texture patterns and less importance is given to creating sharp outlines as by mSRGAN-VGG2.

54 CHAPTER 5. EXPERIMENTS AND RESULTS

Figure 5.4: Input low resolution image (top row), SR reconstructed images by Bicubic Interpolation (2nd row) mSRGAN-VGG5 (3rd row) and the ground truth image(bottom row)

55 CHAPTER 5. EXPERIMENTS AND RESULTS

Figure 5.5: Input low resolution image (top row), SR reconstructed images by Bicubic Interpolation (2nd row) mSRGAN-VGG5 (3rd row) and the ground truth image(bottom row)

56 CHAPTER 5. EXPERIMENTS AND RESULTS

Table 5.4: psnr values for bicubic, mSRGAN-VGG5, Ground truth HR images

Bicubic interpolation mSRGAN-VGG5 Ground Truth HR psnr 27.73 27.75 ∞

5.4 SRRESNET

Model details - The GAN architecture and loss functions described in section 4.3 are used with 15 Resnet layers. The input image dimensions are 24 x 24 x 3 and a batch size of 16 is used for training. In this model, we only use the pixel wise MSE for generating the SR images to observe the effects of using MSE alone. The optimizers used for the generator and discriminator are ADAM and Stochastic gradient descent with learning rates of 1e-3 and 1e-4 respectively.

Evaluation - The generated SR images by mSRGAN, Bicubic interpolation are shown in fig 5.6, fig 5.7 along with the input low resolution and ground truth HR images. The average psnr values measured for 50 images are presented in table 5.5. There is a dramatic increase in the psnr values for SRRESNET compared to bicubic interpolation (> 0.91) and SRRESNET model gives the highest performance among all the mSRGAN variants for psnr. The generated SR images by SRRESNET how- ever are not so impressive and they look very smooth compared to other mSRGAN variants which leverage content loss.

57 CHAPTER 5. EXPERIMENTS AND RESULTS

Figure 5.6: Input low resolution image (top row), SR reconstructed images by Bicubic Interpolation (2nd row) SRRESNET (3rd row) and the ground truth im- age(bottom row)

58 CHAPTER 5. EXPERIMENTS AND RESULTS

Figure 5.7: Input low resolution image (top row), SR reconstructed images by Bicubic Interpolation (2nd row) SRRESNET (3rd row) and the ground truth im- age(bottom row)

59 CHAPTER 5. EXPERIMENTS AND RESULTS

Table 5.5: psnr values for bicubic, SRRESNET and HR

Bicubic interpolation SRRESNET Ground Truth HR psnr 26.89 27.80 ∞

5.5 mSRGAN - CL (Only content loss)

Model details - The GAN architecture and loss functions described in section 4.3 are used with 15 Resnet layers. The input image dimensions are 24 x 24 x 3 and a batch size of 16 is used for training. In this model, we only use the content loss for generating the SR images to observe the effects of using content loss alone.The optimizers used for the generator and discriminator are ADAM and Stochastic gra- dient descent with learning rates of 1e-3 and 1e-4 respectively.

Evaluation - The generated SR images by mSRGAN, Bicubic interpolation are shown in fig 5.8, fig 5.9 along with the input low resolution and ground truth HR images. The average psnr values measured for 50 images are presented in table 5.6. The psnr value for mSRGAN-Cl is slightly lower than bicubic interpolation (< 0.04) and the viusal outlook of the generated SR images by mSRGAN-CL looks better than bicubic interpolated SR images. In most of the mSRGAN-CL generated SR images there appears to be a hazy white film over the image, especially in the darker regions. The content loss seems to have ignored the pixel wise variations and only focused on optimizing the high level object details in the image.

60 CHAPTER 5. EXPERIMENTS AND RESULTS

Figure 5.8: Input low resolution image (top row), SR reconstructed images by Bicubic Interpolation (2nd row) mSRGAN-CL (3rd row) and the ground truth im- age(bottom row)

61 CHAPTER 5. EXPERIMENTS AND RESULTS

Figure 5.9: Input low resolution image (top row), SR reconstructed images by Bicubic Interpolation (2nd row) mSRGAN-CL (3rd row) and the ground truth im- age(bottom row)

62 CHAPTER 5. EXPERIMENTS AND RESULTS

Table 5.6: psnr values for mSRGAN-CL (only content loss), bicubic and ground truth HR image

Bicubic interpolation mSRGAN-CL Ground Truth HR psnr 27.73 27.69 ∞

63 CHAPTER 5. EXPERIMENTS AND RESULTS

5.6 SRGAN

Model Details - The GAN architecture and loss functions described in section 4.3 are used with 15 Resnet layers. The input image dimensions are 24 x 24 x 3 and a batch size of 16 is used for training. In this model, we only use the content loss for generating the SR images following the guidelines of the SRGAN authors [8]. The same optimizer is used for the generator and discriminator are ADAM with learning rates of 1e-4 and 1e-4 respectively.

Evaluation - The generated SR images by mSRGAN, Bicubic interpolation are shown in fig 5.10, fig 5.11 along with the input low resolution and ground truth HR images. The average psnr values measured for 50 images are presented in table 5.7. There is a remarkable increase in the psnr of SRGAN compared to bicubic interpolation (> 0.80). SRGAN not only does well psnr wise, but the visual quality of the SR images generated by it are stunning and by far the best among all the experiments we carried out. Pixel level changes are taken into account, the con- tour lines are nicely drawn and most importantly the textures are reconstructed in remarkable detail, with some images close to the ground truth images. A more detailed discussion about these results is carried out in section 6.1

64 CHAPTER 5. EXPERIMENTS AND RESULTS

Figure 5.10: Input low resolution image (top65 row), SR reconstructed images by Bicu- bic Interpolation (2nd row) SRGAN (3rd row) and the ground truth im-age(bottom row) CHAPTER 5. EXPERIMENTS AND RESULTS

Figure 5.11: Input low resolution image (top66 row), SR reconstructed images by Bicu- bic Interpolation (2nd row) SRGAN (3rd row) and the ground truth im-age(bottom row) CHAPTER 5. EXPERIMENTS AND RESULTS

Table 5.7: psnr values for SRGAN (Imagenet), bicubic and ground truth HR image

Bicubic interpolation SRGAN Ground Truth HR psnr 27.60 28.40 ∞

67 CHAPTER 5. EXPERIMENTS AND RESULTS

5.7 Nuclei Segmentation

We use the open source cell image analysis software Cellprofiler [49] to perform a proof of concept study for nuclei segmentation of generated SR images by bicubic interpolation and mSRGAN. The objective of this study is to evaluate the usefulness of super-resolution for an important application in cell biology (segmentation) and whether super-resolution can be used as an effective pre-processing step for such applications.

Segmentation configuration details - We use the Identify primary objects anal- ysis module in Cellprofiler to perform nuclei segmentation. The diameter range of objects to be detected (nuclei) is set between 10 and 40 pixel units. The threshold- ing strategy is set as automatic and object intensity is used to distinguish clumped objects. Objects establishing contact with the border of the image and outside the diameter range are discarded. The segmentation masks of SR images generated by Bicubic interpolation, mSRGAN and ground truth images are obtained (shown as green outlines in fig 5.12). Then, Dice coefficients are calculated by comparing the segmentation masks of segmented Bicubic SR image, the segmented mSRGAN SR image with the ground truth image. The dice coefficient values range from 0 to 1 with 0 being the worst and one being the best. The average Dice coefficients calcu- lated over 30 images are presented in table 5.8 and these values are also represented as boxplots in fig 5.13

68 CHAPTER 5. EXPERIMENTS AND RESULTS

Figure 5.12: Nuclei segmentation on Bicubic SR, mSRGAN SR and ground truth HR

Table 5.8: Dice coefficient values for BicubicSR - Ground truth segmented images and mSRGAN - Ground truth segmented images

Bicubic - Ground truth mSRGAN - Ground truth Dice Coefficient 0.57 0.60

69 CHAPTER 5. EXPERIMENTS AND RESULTS

Figure 5.13: Box and whisker plots for BicubicSR - Ground truth segmented dice values and mSRGAN- Ground truth segmentation dice values

As can be seen in table 5.8 and fig 5.13, there is a marked increase in the dice values for images belonging to mSRGAN vs. Bicubic SR images (> 0.03). The top dice value for mSRGAN SR segmented masks is 0.84 vs. 0.77 of bicubic SR segmented masks. Even the lowest dice value for mSRGAN SR segmented masks is 0.26 > 0.24 of bicubic SR segmented masks. While mSRGAN shows noticeable gains compared to Bicubic interpolation, it is worthwhile seeing whether these improvements are statistically significant. For this reason, we perform a paired t-test on the dice values of images belonging to mSRGAN vs. Bicubic SR images. The findings are presented in the next chapter.

5.7.1 Statistical Validation Rationale for the choice of the test - A paired t-test is used when a compar- ison between two methods of measurement has to be performed wherein the two methods are applied to the same entity. This perfectly suits our scenario where two different super-resolution techniques (Bicubic Interpolation, mSRGAN) are applied to the same low-resolution input image. The conditions for performing paired t- test are - a] The variables in questions should be continuous (The dice values are continuous ) and b] The variables should be approximately normally distributed (

70 CHAPTER 5. EXPERIMENTS AND RESULTS which is shown in fig 5.14, fig 5.15). All the statistical experiments are carried out in SAS.

Normality testing Since the paired t-test relies on the assumption that the data is normal , we conduct tests for checking the normality in the data on Dice coefficient(mSRGAN) and Dice coefficient(Bicubic interpolation) at a 0.05 significance level. The null hypothesis (H0) of the normality tests is that the data follows a specified distribution, in our case normal distribution and the alternative hypothesis (H1) is that it does not follow the specified distribution.

H0 : F0 = F1

H1 : F0 6= F1 Fig 5.14 and Fig 5.15 display the results of normality tests for Bicubic and mSR- GAN Dice values respectively.

Kolmogorov-Smirnov Test does not reject the normality assumption for the data with p-value equals to 0.15, and 0.1094 (>0.05). However, it is a less powerful test when parameters of the distribution must be estimated (in this case mean and vari- ance of the normal distribution). Then, more appropriate tests are the Cramer-Von Mises and Anderson Darling tests. Both Cramer-Von Mises and Anderson Darling tests do not reject normality, with p-value equal to 0.1033 and 0.0998, respectively. Shapiro-Wilk which is the most powerful test for checking normality also does not reject normality with p-value 0.0960 but since there are ties in the data we do not take the results of the Shapiro-Wilk test into account. Since the normality of the variables is established, we can proceed to finally perform the paired t-test.

Figure 5.14: Normality testing results for Bicubic Dice values

71 CHAPTER 5. EXPERIMENTS AND RESULTS

Figure 5.15: Normality testing results for mSRGAN Dice values

Results of the paired t-test The null hypothesis of the paired t-test is that the difference between the paired variables equals zero i.e the means are equal. The significance level is set at 0.05 and the results of the paired t-test are presented in fig 5.16

H0 : m1 = m2 H0 : m1 6= m2

Figure 5.16: Results of the paired t test

Since the p-value is significantly smaller than 0.05 (0.001) we rule in favour if the alternate hypothesis that there is indeed difference between the means of Dice values obtained from Bicubic interpolation and mSRGAN. Hence, it can be said that the gains observed in table 6.8, fig 6.13 are significant in nature. Therefore, we

72 CHAPTER 5. EXPERIMENTS AND RESULTS conclude by this proof of concept study that mSRGAN does indeed make a positive difference when it comes to increasing nuclei segmentation performance and super- resolved images from mSRGAN can be used as a effective pre-processing step for such applications.

73 Chapter 6

Discussion

6.1 mSRGAN vs SRGAN (Evaluating Hypothesis 1)

Revisiting hypothesis 1 we said, "SRGAN will not perform optimally for performing SR on microscopic images compared to mSRGAN since it utilizes feature represen- tations from a domain(natural images) distant from the target domain (microscopy images). " We proposed the mSRGAN model under the premise that SRGAN model by Chris- tian et al. [8] trained on and utilizing feature representations from Imagenet database will not perform optimally for a distinct target domain, i.e., High content screening microscopy images. This premise, however, is dis-proven in this study, as can be seen by the images psnr values in table 6.5, 6.7 and fig 6.1,6.2, 6.10, 6.11 which are slightly better than the best value of mSRGAN (28.40 > 27.80). Though SRGAN has a slight edge over mSRGAN, we believe this is not significant, and mSRGAN has competitive results concerning psnr and visual quality.

There are a few factors at play which we believe played a role in inhibiting the performance of mSRGAN.

Lack of Training data - On of the most dominating reason can be the scarcity of training data we used for training and leveraging feature representations for the mSRGAN. The original SRGAN model uses 350K randomly selected images from the Imagenet database which in itself carries the most comprehensive and diverse coverage of natural images consisting of millions images with thousand plus cate- gories. This enormous size database clearly gives SRGAN advantage in terms of learning feature representations both in breadth and depth. This is not the case for our proposed model mSRGAN which uses only 16K labeled microscopic images belonging to 13 classes (for both training the mini VGG19 and GAN) as opposed to 350K images used in training the VGG19 and another 350K for training SRGAN. This disparity of nearly 40X can be a severely debilitating factor for mSRGAN’s performance. This is discussed in more detail in the discussion chapter.

74 CHAPTER 6. DISCUSSION

Instability of GAN’s - Another reason for mSRGAN underperforming can be the notorious convergence issue of GAN’s we faced while performing our experi- ments (described in detail in section 6.4). With a multitude of parameters under the hood controlling adversarial training, achieving a perfect equilibrium between the generator and the discriminator remains a challenging task. If the discriminator is not powerful enough, then the generated SR images by the generator are passed off as being real even if they are far off from the real HR image. Conversely, if the discriminator becomes too powerful compared to the generator, then no generated SR image is passed off as being real, no matter how close they are to the real HR image. Also, by the end of adversarial training, the losses for both the generator and discriminator should both converge, something which we achieved after numer- ous experiments. The sheer amount of time and resources required to run each experiment prevented us from optimizing each and every experiment we did. Some of the obstacles we encountered while training mSRGAN and its variants (VGG2, VGG5, VGGALL, SRRESNET, etc.) are described in more detail in section 6.4 and we believe these problems might have played a role in limiting mSRGAN to realize the full potential of adversarial training for generating photorealistic images in the natural manifold of microscopic images.

Content Loss + M.S.E - Right off the bat, recognizing the individual significance of perceptual loss and pixel-wise M.S.E, we use both these losses in conjunction as- signing weights α, and β respectively to control the emphasis given to each one of them. By doing this, we strived to achieve a perfect balance between optimizing the signal to noise ratio and perceptual optimization of the generated SR images. Solely using pixel-wise M.S.E gives a high p.s.n.r but leads to overly smooth av- eraged solutions (as discussed in motivation, results) and exclusively using content loss although leads to visually pleasing results but suffers severely in terms of psnr.

With the introduction of weight parameters (α, and β) comes the problem of finding the optimal combination of weights between M.S.E and C.L. While we experimented with different combinations of mSRGAN and its variants we could not perform a comprehensive study due to computational and time limitations. We believe further experimentation and optimizing for these weights hold the key to achieving further visually pleasing SR images with high psnr for mSRGAN.

75 CHAPTER 6. DISCUSSION

6.2 Content loss vs M.S.E (Evaluating hypothesis 2)

Revisiting Hypothesis 2, ’Pixel-wise M.S.E will be inefficient in generating photo-realistic microscopic images compared to the perceptual loss.’ Con- tent loss and pixel-wise M.S.E have their advantages as described in the motivation chapter. However, we discover that entirely discarding M.S.E as the SRGAN au- thors do is not ideal for the use case of microscopic images and the best results quantitatively and visually are achieved when both content loss and M.S.E are used in conjunction. To obtain a balance and adjust the importance given to each loss we set weights and β for M.S.E and content loss respectively which can be modified according to the requisite use case.

We notice that the downside of using M.S.E alone is that the resulting generated SR images tend to be overly smooth and visually far off from the ground truth re- garding textures and colors. This can be because pixel-wise M.S.E just averages the all the possible solutions, validating hypothesis 2. One advantage of using M.S.E is that the generated SR images have the highest psnr as demonstrated in table 5.5

Using Content loss results in visually more pleasing and believable SR images with contours and texture patterns very close to the ground truth HR images supporting hypothesis 2. However, content loss tends to ignore the changes in the pixel space, and it is observed that using content loss alone leads to a hazy white film cover on the generated SR images. Using content loss alone also leads to checkerboard arti- facts at several places in the image more pronounced in the regions having bright colors. This trend can be disadvantageous and can serve as a hindrance to achieving good results in segmentation applications.

It is noticed that adopting content loss and M.S.E in unison, ensures the changes in the pixel space are taken into account leading to the removal of the hazy white film and considerably reducing checkerboard artifacts. This combination builds on the strengths of both the content and M.S.E ensuring that the images are perceptually similar to real HR images plus they have a high psnr taking into account the spatial and pixel components.

76 CHAPTER 6. DISCUSSION

6.2.1 Effect of Different VGG layers To explore the content loss, we trained different variants of mSRGAN using feature activations from different layers of the mini VGG19 network to see how these indi- vidual layers impact generated SR images concerning psnr and visually.

We observe that SR images generated via the lower layers are more superior when it comes to drawing the contour lines for cells, giving sharp outlines (see nuclei and microtubules in fig 5.3 ) and merely reproduce the exact pixel values of the ground truth HR images. However, in this case, less emphasis is paid towards the texture patterns making the generated colors look more bland, averaged and smooth.

As we move on to the later layers in the mini VGG19, e.g., VGG5, we find that the generated SR images have a more added emphasis on the texture details rather than focussing explicitly on the contour lines. This behavior can be attributed to the higher layers in the mini VGG19 focussing more on, the higher level details in the image regarding the objects, their spatial arrangement in the image and pay less attention to reconstructing the exact pixel-level details.

77 CHAPTER 6. DISCUSSION

6.2.2 PSNR variation for mSRGAN variants The psnr values remain reasonably constant in the 26.7 - 27.8 range across different variants of mSRGAN and we believe the primary reason for this our usage of content and M.S.E in conjunction. When M.S.E is used in isolation (SRRESNET), highest gains inn psnr are achieved validating our assumption from the motivation section that optimizing for M.S.E alone gives high psnr.

Figure 6.1: psnr values for different model combinations

However, we further validate the fact that achieving a high psnr does not amount to a visually pleasing image resembling the ground truth as can be seen in fig 5.7. The best performing mSRGAN model concerning visual quality has by far the lowest psnr (something weights might have done α = 0.4, β = 0.6) but generate less smooth and more realistic looking SR images.

78 CHAPTER 6. DISCUSSION

6.3 Segmentation results

There is a marked increase in the Dice values for segmented SR images by mSRGAN compared to the segmented SR images by Bicubic interpolation leading to better segmentation results (fig 5.13). We also show via paired t-test that this increase is statistically significant (fig 6.16) further confirming the superiority of generated SR images by mSRGAN.

The difference between generated SR images by these two algorithms is apparent when we visually compare the raw SR images generated from Bicubic and mSRGAN (fig 6.12). The raw SR image from mSRGAN does a better job in creating sharp outlines of objects in the image (especially the nucleus which is brightly lit) which gives it an edge while generating segmentation masks for primary object identifica- tion as opposed to bicubic interpolation generated SR images where objects are not distinguishable.

For this proof of concept study where the objective was to segment primary objects such as the nuclei, we believe using models that leverage the feature reconstructions from the earlier layers (e.g mSRGAN VGG2) would give the optimum results since the earlier layers in the VGG network do a good job of emphasizing more on the contour details of the image as opposed to texture details (which might not play a vital role in primary object segmentation).

79 CHAPTER 6. DISCUSSION

6.4 GAN failures

Despite the theoretical guarantees of unique solutions for GAN’s, we discover that GAN’s can be very difficult to train perfectly and achieving a perfect equilibrium between the generator and discriminator is quite tricky to achieve. Also, the per- fect convergence for GAN based training remains unproven. In GAN’s it is desired that the generator and the discriminator learn via each other in sync with both improving their capabilities as the training progresses. We would like the generator to generate SR images which resemble the true HR images and be passed by the discriminator as being real images in an evolving manner where the discriminator improves its distinguishing capabilities over time and using the discriminators im- provement, generator improving its generated SR images as being more close to the real images distribution.

However, it so happens that the balance of power gets skewed during the course of training with either one of the generator or discriminator becoming too stronger than the other completely dominating how the training proceeds. This goes against the very basic concept of adversarial training where the opposite parties learn from each other, resulting in no fruitful SR images generated after a point. In the con- text of generating SR images we feel that the discriminator has an edge over the generator since its task is relatively simple i.e distinguishing between real and fake images versus generating plausible looking SR images. Goodfellow and Salimans et al [18] observe that gradient descent as a optimization method to update the generator and discrimination parameters can be inefficient, as the optimal solution for a GAN problem constitutes a saddle point.

There are a couple of things that can be done to remedy the skewed balance of power -

• Make the generator more powerful

• Make the discriminator less powerful

• Alter the learning rates of the generator/discriminator if either of them be- come too strong

• Try other tricks such as using Batch Normalization, different nonlinear activa- tion functions (lrelu, prleu etc), normalizing the inputs, optimizers (ADAM, SGD), inference noise etc

Making the generator more powerful is a difficult option given the already deep architecture/ configuration used for the generator, and the gains are minuscule given the resources used. If the generator becomes too powerful, then any generated ran- dom junk image by it is passed off as being real by the discriminator (fig 6.2). If the discriminator becomes too powerful, no matter how good SR image the generator produces they are always classified as fake. Technically speaking the discriminator

80 CHAPTER 6. DISCUSSION loss in this case quickly converges to zero (See fig 6.3, attempt1 ) at the start of the training thereby providing no reliable gradient updates to the generator to learn and improve its generated SR images over time.

We attempt to document/discuss the GAN training experiments we did and the changes we adopted to stabilize the training to arrive at the design choices for the final model (mSRGAN)

Figure 6.2: Instance of generator overfitting (L) and a random noise SR image (R) generated by the generator which is passed off as being real by the discriminator.

Attempt 1 - In our initial attempts and by using the parameters listed by Christian et al [8], we faced the issue of the discriminator being too powerful and the generator not being able to improve its generated SR images by leveraging the adversarial loss. The only improvement that it achieves are due to the content loss used.

81 CHAPTER 6. DISCUSSION

Figure 6.3: Plots visualizing discriminator (top) and generator (bottom) loss re- spectively. An instance of Discriminator becoming too powerful leading to failure of the generator reaching full convergence (The step wise decrease in loss is due to optimizing the content loss and not the adversarial loss).

This behaviour is explained by Arjovsky et al [50] wherein they show that the support of pg(x) and pdata(x) lie in a lower dimension than the training data (X) making it easy for the discriminator to distinguish between real and fake samples with cent percent accuracy, since there is no actual overlap between pg(x) and pdata(x).

Attempt 2 - Using separate learning rates for generator and discriminator

Hochreiter et al. [51] propose a two-time-scale update rule (TTUR) for training GANs with stochastic gradient descent with separate learning rates for the gener- ator and discriminator. They show that the TTUR converges to a mild stationary equilibrium. Using this idea, we use different learning rates for training the gener-

82 CHAPTER 6. DISCUSSION ator and discriminator. To be specific, we use the ADAM optimizer for both the generator and discriminator with learning rates set to 1e-4 and 1e-6 respectively.

Figure 6.4: Plots visualizing discriminator (top) and generator (bottom) loss re- spectively.

83 CHAPTER 6. DISCUSSION

Figure 6.5: The generator loss (without the content loss) visualized

As can be seen in fig 6.4, the discriminator loss does not immediately converge to zero, and the generator converges quickly compared to our earlier attempt (fig 6.3). Despite of the improvements, this is still not an ideal condition to perform the adversarial training, since the discriminator loss does, in fact, converge to zero at 80K iterations, giving the generator no reliable updates to improve the generated SR image. This can be seen in the way generator loss (without the content loss, fig 6.5) evolves during the course of training. At the start of the training, there is an increase in the loss (which is expected since the initial SR images generated are just random noise) but after ∼20k iterations the loss remains stable at around 3e-4 with no signs of improvement indicating there are no proper gradient updates from the discriminator to learn from (due to the discriminator loss saturating to ∼ 0 loss).

Attempt 3 - Using separate optimizers and learning rates for generator and dis- criminator

Attempt 2 at using separate learning rates did not yield fruitful results for sta- bilizing the training, and the discriminator still dominates the training. To cripple the discriminator’s abilities, we use separate optimizes and learning rates for the generator and discriminator. We use Stochastic gradient descent (SGD) optimizer for the discriminator and ADAM optimizer for the generator which is proven to be better than SGD in practice.

84 CHAPTER 6. DISCUSSION

Figure 6.6: Plots visualizing discriminator (top) and generator (bottom) loss re- spectively while training the GAN using separate optimizers and learning rates .

85 CHAPTER 6. DISCUSSION

Figure 6.7: The generator loss (without the content loss) while training the GAN using separate optimizers and learning rates visualized.

86 CHAPTER 6. DISCUSSION

As can be seen in fig 6.6, the discriminator loss converges linearly and does not overfit as it did in the earlier attempts. The generator loss (without the content loss, fig 6.7) also appears to converge after the expected initial increase in the loss. Although it seems like the training is happening in the right direction, there is still scope to increase the rate of convergence for both the generator and the discrimi- nator. Our next attempt precisely ventured in that direction.

Attempt 4 - Use lrelu, batch normalization and dropout

One of the reasons the stability of GAN training might suffer is due to using Relu as the nonlinear activation layers as it leads to sparse gradients. To avoid this, we use leaky Relu in both the generator and the discriminator which ensures passing of a small non zero gradient even when the unit is not active. We also input noise in all of the layers in generator via Dropout both during training and test time in an attempt to make the network more robust as dropout randomly removes units with their connections preventing units from co-adapting too much. Dropout forces the network to learn multiple representations of the same data by not overly rely on some representations and can be viewed as a form of regularization preventing overfitting. Fig 6.8, 6.9 illustrates the change in the loss caused due to the adoption of techniques as mentioned above during training.

87 CHAPTER 6. DISCUSSION

Figure 6.8: The generator loss and the discriminator loss visualized.

88 CHAPTER 6. DISCUSSION

Figure 6.9: The generator loss (without content loss) and the discrimantor loss for fake image (generated SR) visualized.

89 CHAPTER 6. DISCUSSION

As can be seen in fig 6.8, earlier problems of the discriminator overfitting are avoided, and the rate of convergence for the discriminator has increased. The gener- ator loss (without the content loss) after high initial bump shows linear convergence. More interestingly, the discriminator loss for SR generated images by the generator after an initial drop ( 2k iterations) increases steadily which implies that the gener- ator is steadily fooling the discriminator in believing that its generated images are real. At around 20k iterations, the loss is at 8 and at this point the generator fools the generator fools the discriminator 50% of the time (since the batch size is 16 and the penalty for each misclassification is 0.5). The slight oscillating behavior of the generator (fig 6.8 generator loss) is also a testament to the fact that the generator exploits the weakness of the discriminator at regular intervals.

Out of all our attempts, this was by far the most successful configuration that led to a good convergence for both the generator and discriminator without over- fitting and hence we use this for all the experiments

While there is no perfect formula to stabilize GAN training, we believe these key takeaways play an important role in helping generator and discriminator reach con- vergence -

• Using Batch Normalization with the generator speeds up the training consider- ably and using batch normalization with the discriminator can be catastrophic as it makes the discriminator very powerful compared to the generator.

• Using lrelu (instead of relu) and dropout helps in better flow of gradients and prevents overfitting.

• For image generation applications using GAN’s, discriminator is usually stronger than the generator, and it may be beneficial to debilitate its capabilities (by decreasing its features so that the generator has more features).

90 CHAPTER 6. DISCUSSION

6.5 Checkerboard Artifacts

One thing that was prominent in the generated SR images was the presence of checkerboard artifacts, particularly more dominant in the bright colored regions of the generated images.

We attribute the main reason for this behavior to the deconvolution layer (also known as transposed convolution/ strided convolution), which we use in the GAN architecture to upscale the LR (24 x 24) image to an HR (96 x 96) image. Unlike most of the Deep Learning approaches for SR, which first upscale the LR image to the dimensions of the desired HR image by simple methods such as Nearest neighbor etc. and then feed this upscaled image as an input to the network, we directly feed in the LR image to the network skipping the upscaling process and performing the computations on the LR image itself. This was inspired by the SR study presented by Shi et al. [28] who show directly feeding an LR image to the network and then upscaling it using subpixel convolution layers reduces the computational parame- ters making the architecture fast enough to run on 1080p videos and also lowers checkerboard patterns. This choice is also based on our intuition that by using an LR image, the convolution filters will get a larger receptive field to learn and get a better sense in recognizing the objects from a higher level. Odena et al. [52] highlight two main reasons that lead to checkerboard artifacts in generated images - Deconvolution overlap and loss functions.

The deconvolution operation facilities in estimating high-resolution descriptors from low-resolution information. Roughly speaking, deconvolution layers ensure that the model uses every point in the LR image to paint a square in the estimated HR image. However, when the kernel size is not divisible by the stride, deconvolution overlap occurs adding the metaphorical paint unsymmetrically with some regions getting more than the others [52]. A checkerboard-like pattern is formed in two dimensions when the overlaps between the two axes multiply together. Since the uneven overlaps in both dimensions are multiplied, this unevenness is increased by a factor of four. In our mSRGAN model architecture, we use several deconvolu- tion layers in the generator network to upscale the LR images, and these multiple stacked deconvolutions compound the effect of uneven overlaps leading to strong checkerboard artifacts of different extents in the generated SR image.

Odena et al. [52] further point towards downsampling operations such as strided convolution, max pooling operations (used in our mini VGG19 network), etc. can be the culprits in producing checkerboard artifacts due to their inconsistent gra- dient updates. They recommend using resize convolutions, i.e., first upscale the LR image by the nearest neighbor or bilinear interpolation and then apply regular convolutions to mitigate checkerboard artifacts. Their experiments show that these artifacts are significantly reduced across diverse contexts following this approach. We encourage researchers to try resize convolutions in the future for upsampling

91 CHAPTER 6. DISCUSSION applications as they might potentially reduce the checkerboard artifacts.

6.6 Lack of training data

Even though we receive better results than bicubic baseline and competitive results against SRGAN, we speculate that adding more diverse labeled training data for training both the mini VGG and GAN will boost the psnr and visual quality of the images generated by mSRGAN. The Imagenet dataset that was used by the authors of SRGAN boasts of thousands of classes selected from a collection of millions of labeled images giving it more edge when it comes to learning and transferring fea- ture representations for optimizing the perceptual loss.

It was challenging to employ HCS microscopy datasets resembling the size of Im- agenet in this study for training purposes because there is a lack of large-scaled labeled HCS microscopic images due to several reasons. Cells in the microscopic images exhibit unique characteristics (less light illumination, noisier) compared to natural objects which are present in constant scale and pose. Different phenotypes observed in the cases of cell objects are subtle and intricate to label (consisting of multi labels simultaneously), and a domain expert is required to decipher the appropriate annotation making the image acquisition/labeling process costly. This is in contrast to the labels used in datasets such as Imagenet, Cifar-10, Cifar-100, etc. which are abundant, easily recognizable and the task of annotation can be crowdsourced to lay, people, .

Another hindrance towards gathering more HCS microscopy training data is that the whole process of culturing cells, staining/ applying experiments and captur- ing the images using a microscope is economically costly, time intensive and error prone since captured images are likely to standard microscopic plus HCS errors as discussed in the motivation chapter which limits the rate at which HCS images are made available.

Recent efforts in the community such as Human protein atlas, Broad Bioimage Benchmark Collection, CYTO challenge, etc. to make HCS imaging data more accessible to researchers is a positive step in combating this problem of limited la- belled microscopy data and we are optimistic about the role of these growing imaging datasets in bolstering the generalization capabilities of future deep learning models used for SR in the microscopy domain.

92 Chapter 7

Conclusion

In this work, we presented a first Generative adversarial network mSRGAN for per- forming super-resolution exclusively on high content screening microscopy images. We highlight the inefficiency of using MSE alone for generating super-resolved im- ages that are not amiable to the human visual system. Our work was also motivated by the fact that the existing state of the art in super-resolution, SRGAN, trained on natural images, might potentially suffer from efficiently transferring feature rep- resentations in a distant domain (microscopy). SRGAN also pays disproportionate attention in optimizing for the perceptual quality of the images by using content loss and entirely discarding MSE which leads to poor psnr values.

We introduce a more balanced perceptual loss, which builds on the strengths of MSE, content loss and adversarial loss, resulting in visually pleasing images close to the natural manifold of real images which do not suffer from poor psnr since pixel level changes have been appropriately taken into account. The generated SR images by mSRGAN are then compared visually and quantitatively against Bicubic interpolation and SRGAN, where mSRGAN consistently beats Bicubic interpola- tion and achieves competitive results against SRGAN. We document the several challenges faced while training GAN’s in practice and recommend strategies to en- sure convergence of the generator and discriminator.

To demonstrate the usefulness of performing super-resolution on microscopy im- ages, a proof of concept study focusing on nuclei segmentation is performed. We compare the segmented Bicubic SR and mSRGAN SR images by dice coefficient and notice the increase in segmentation performance (∼3%) by mSRGAN which is further validated statistically leading to the conclusion that super-resolved images produced by mSRGAN can prove to be an asset in boosting segmentation perfor- mance if used as a pre-processing step before segmentation.

We conclude that Generative adversarial networks optimized for carefully designed loss functions can lead to aesthetically pleasing SR images (notwithstanding the

93 CHAPTER 7. CONCLUSION convergence issues) and implementing such creative loss functions geared towards the application domain holds an important key towards generating more realistic images.

7.1 Future Work

Mean opinion scoring - One major drawback of this work was the way the vi- sual quality of the generated SR images was evaluated. The visual quality of the images is very subjective, with some images being pleasing for a particular group and the same images not looking agreeable to another group. The visual analysis of all the SR images was conducted by the author, leaving open the possibility of bias in visual evaluation. To offset for these issues, a proper Mean opinion scoring study should be done wherein microscopic/ cell biology experts are asked to rate the generated SR images on a scale, and the averaged final results should be evaluated to get a precise idea of the visual quality. We neither had the time nor the resources to conduct such a large-scale study, but this subject should be something to keep in mind in the future for researchers working on evaluating images visually.

Different GAN’s - Recently, the research on Generative adversarial networks has taken several exciting directions. Alternative approaches other than vanilla GAN’s such as Wasserstein GAN which consists of a alternative cost function obtained by approximating the Wasserstein distance and is shown to be more stable when it comes to providing reliable gradient updates to the generator, Class conditional GAN’s which take into account the class of images, etc can prove to be beneficial in avoiding convergence issues and creating good quality images [19]. We encourage researchers to try these new GAN variants for super-resolution.

MEGA super-resolution model - With the advent of huge biomedical datasets such as Medical Imagenet,this proof of concept work on Super-resolution for mi- croscopy images can be extended by creating a general purpose Super resolution model for health care images which would work on different imaging modalities such as ultrasound images, CT scans, MRI, MG,PET, optical images etc. The implica- tions of this work would be a great asset to clinicians, physicians and researchers alike.

94 Bibliography

[1] M. G. K. Sung C. Park, Min K. Park, “Super-resolution image reconstruction: a technical overview ,” Signal Processing Magazine, IEEE, vol. 20, no. 3, pp. 21–36, May 2003. doi: 10.1109/msp.2003.1203207. [Online]. Available: http://ieeexplore.ieee.org/document/1203207/

[2] P. B. Chopade and P. M. Patil, “Article: Single and multi frame image super- resolution and its performance analysis: A comprehensive survey,” Interna- tional Journal of Computer Applications, vol. 111, no. 15, pp. 29–34, February 2015, full text available.

[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifica- tion with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105. [Online]. Available: http://papers.nips.cc/paper/ 4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

[4] C. Szegedy, S. E. Reed, D. Erhan, and D. Anguelov, “Scalable, high-quality object detection,” CoRR, vol. abs/1412.1441, 2014. [Online]. Available: http://arxiv.org/abs/1412.1441

[5] Y. Sun, X. Wang, and X. Tang, “Deep learning face representation by joint identification-verification,” CoRR, vol. abs/1406.4773, 2014. [Online]. Available: http://arxiv.org/abs/1406.4773

[6] W. Ouyang and X. Wang, “Joint deep learning for pedestrian detection,” in Proceedings of the 2013 IEEE International Conference on Computer Vision, ser. ICCV ’13. Washington, DC, USA: IEEE Computer Society, 2013. doi: 10.1109/ICCV.2013.257. ISBN 978-1-4799-2840-8 pp. 2056–2063. [Online]. Available: http://dx.doi.org/10.1109/ICCV.2013.257

[7] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), J. Fürnkranz and T. Joachims, Eds. Omnipress, 2010, pp. 807–814. [Online]. Available: http://www.icml2010.org/papers/432.pdf

95 BIBLIOGRAPHY

[8] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, “Photo-realistic single image super-resolution using a generative adversarial network,” CoRR, vol. abs/1609.04802, 2016. [Online]. Available: http://arxiv.org/abs/1609.04802

[9] I. J. Goodfellow, “NIPS 2016 tutorial: Generative adversarial networks,” CoRR, vol. abs/1701.00160, 2017. [Online]. Available: http://arxiv.org/abs/ 1701.00160

[10] H. Burger, C. Schuler, and S. Harmeling, “Image denoising: Can plain neural networks compete with bm3d?” Jun. 2012, pp. 2392 – 2399.

[11] V. Jain and S. Seung, “Natural image denoising with convolutional networks,” in Advances in Neural Information Processing Systems 21, D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, Eds. Curran Associates, Inc., 2009, pp. 769–776. [Online]. Available: http://papers.nips.cc/paper/ 3506-natural-image-denoising-with-convolutional-networks.pdf

[12] Z. Cui, H. Chang, S. Shan, B. Zhong, and X. Chen, Deep Network Cascade for Image Super-resolution. Cham: Springer International Publishing, 2014, pp. 49–64. ISBN 978-3-319-10602-1. [Online]. Available: https://doi.org/10.1007/978-3-319-10602-1_4

[13] K. Nasrollahi and T. B. Moeslund, “Super-resolution: A comprehensive survey,” Mach. Vision Appl., vol. 25, no. 6, pp. 1423–1468, Aug. 2014. doi: 10.1007/s00138-014-0623-4. [Online]. Available: http://dx.doi.org/10.1007/ s00138-014-0623-4

[14] K. Su, Q. Tian, Q. Xue, N. Sebe, and J. Ma, “Neighborhood issue in single- frame image super-resolution,” in ICME. IEEE Computer Society, 2005, pp. 1122–1125.

[15] T. Gotoh and M. Okutomi, “Direct super-resolution and registration using raw cfa images,” Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., vol. 2, pp. II–II, 2004.

[16] K. Nasrollahi and T. B. Moeslund, “Super-resolution: a comprehensive survey,” Mach. Vis. Appl., vol. 25, no. 6, pp. 1423–1468, 2014. doi: 10.1007/s00138- 014-0623-4. [Online]. Available: https://doi.org/10.1007/s00138-014-0623-4

[17] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.

[18] I. J. Goodfellow, “NIPS 2016 tutorial: Generative adversarial networks,” CoRR, vol. abs/1701.00160, 2017. [Online]. Available: http://arxiv.org/abs/ 1701.00160

96 BIBLIOGRAPHY

[19] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, and A. A. Bharath, “Generative adversarial networks: An overview,” CoRR, vol. abs/1710.07035, 2017. [Online]. Available: http://arxiv.org/abs/1710.07035

[20] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 2672–2680. [Online]. Available: http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf

[21] S. Baker and T. Kanade, “Hallucinating faces,” March 2000.

[22] O. T. C. William T. Freeman, Egon C. Pasztor, “Learning low-level vision,” MITSUBISHI ELECTRIC RESEARCH LABORATORIES, 2000. [Online]. Available: http://www.merl.com/publications/docs/TR2000-05.pdf

[23] K. K. Sweta Patel, “Analysis of various single frame super resolution techniques for better psnr,” International Research Journal of Engineering and Technology (IRJET), Tech. Rep., 2016.

[24] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 2, pp. 295–307, Feb. 2016. doi: 10.1109/TPAMI.2015.2439281. [Online]. Available: http://dx.doi.org/10.1109/TPAMI.2015.2439281

[25] J. Kim, J. K. Lee, and K. M. Lee, “Accurate image super-resolution using very deep convolutional networks,” CoRR, vol. abs/1511.04587, 2015. [Online]. Available: http://arxiv.org/abs/1511.04587

[26] C. Dong, C. C. Loy, and X. Tang, “Accelerating the super-resolution convolutional neural network,” CoRR, vol. abs/1608.00367, 2016. [Online]. Available: http://arxiv.org/abs/1608.00367

[27] J. Kim, J. K. Lee, and K. M. Lee, “Deeply-recursive convolutional network for image super-resolution,” CoRR, vol. abs/1511.04491, 2015. [Online]. Available: http://arxiv.org/abs/1511.04491

[28] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” CoRR, vol. abs/1609.05158, 2016. [Online]. Available: http://arxiv.org/abs/1609.05158

[29] J. Johnson, A. Alahi, and F. Li, “Perceptual losses for real-time style transfer and super-resolution,” CoRR, vol. abs/1603.08155, 2016. [Online]. Available: http://arxiv.org/abs/1603.08155

97 BIBLIOGRAPHY

[30] M. Abramowitz and M. W. Davidson, Troubleshooting Photomicrogra- phy Errors, 2015, https://micro.magnet.fsu.edu/primer/photomicrography/ fluorescenceerrors.html.

[31] F. Vietmeyer, S. Volkan-Kacso, P. Frantsuzov, M. Kuno, and B. Janko, FLUO- RESCENCE IMAGING: Understanding fluorescence blinking is the first path to an imaging solution, 2011, http://www.laserfocusworld.com/articles/2011/02/ fluorescence-imaging-understanding-fluorescence-blinking-is-the-first-path-to-an-imaging-solution. html.

[32] K. R. Spring and M. W. Davidson, Troubleshooting Photomicrography Errors. microscopyu, 2012, https://www.microscopyu.com/techniques/fluorescence/ introduction-to-fluorescence-microscopy.

[33] Crosstalk or bleedthrough. Scientific Volume Imaging B.V, 2017, https://svi. nl/CrossTalk.

[34] T. scientific, Bleed-Through in Fluorescence Imaging. Thermofisher, 2016, https://www.thermofisher.com/se/en/home/life-science/cell-analysis/ cell-analysis-learning-center/molecular-probes-school-of-fluorescence/ imaging-basics/protocols-troubleshooting/troubleshooting/bleed-through. html.

[35] ——, Phototoxicity in Live-Cell Imaging and Ways to Reduce It. Thermofisher, 2016, https://www.thermofisher.com/se/en/home/life-science/cell-analysis/ cell-analysis-learning-center/molecular-probes-school-of-fluorescence/ imaging-basics/protocols-troubleshooting/troubleshooting/photoxicity.html.

[36] ——, Uneven Illumination in Fluorescence Imaging. Thermofisher, 2016, https://www.thermofisher.com/se/en/home/life-science/cell-analysis/ cell-analysis-learning-center/molecular-probes-school-of-fluorescence/ imaging-basics/protocols-troubleshooting/troubleshooting/ uneven-illumination.html.

[37] Z. Wang and A. C. Bovik, “Mean squared error: Love it or leave it? A new look at signal fidelity measures,” IEEE Signal Processing Magazine, vol. 26, no. 1, pp. 98–117, 2009.

[38] M. S. M. Sajjadi, B. Schölkopf, and M. Hirsch, “Enhancenet: Single image super-resolution through automated texture synthesis,” CoRR, vol. abs/1612.07919, 2016. [Online]. Available: http://arxiv.org/abs/1612.07919

[39] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN features off- the-shelf: an astounding baseline for recognition,” CoRR, vol. abs/1403.6382, 2014. [Online]. Available: http://arxiv.org/abs/1403.6382

98 BIBLIOGRAPHY

[40] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” CoRR, vol. abs/1411.1792, 2014. [Online]. Available: http://arxiv.org/abs/1411.1792

[41] H. Azizpour, A. S. Razavian, J. Sullivan, A. Maki, and S. Carlsson, “From generic to specific deep representations for visual recognition,” CoRR, vol. abs/1406.5774, 2014. [Online]. Available: http://arxiv.org/abs/1406.5774

[42] A. Håkansson, “Portal of research methods and methodologies for research projects and degree projects,” in Proceedings of the International Conference on Frontiers in Education: Computer Science and Computer Engineering (FECS). The Steering Committee of The World Congress in Computer Science, Com- puter Engineering and Applied Computing (WorldComp), 2013, p. 1.

[43] S. Gross and M. Wilber, Training and investigating Residual Nets. at, 2016, http://torch.ch/blog/2016/02/04/resnets.html.

[44] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large- scale image recognition,” CoRR, vol. abs/1409.1556, 2014.

[45] U. Mathias, O. Per, F. Linn, L. Emma, J. Kalle, F. Mattias, Z. Martin, K. Car- oline, W. Kenneth, H. Sophia, W. Henrik, B. Lisa, and P. Fredrik, “Cyto 2017 image analysis challenge.” [Online]. Available: https://www.proteinatlas.org/ CYTO_challenge2017/Cyto2017_Image_analysis_challenge_description.pdf

[46] ——, “Towards a knowledge-based human protein atlas,” Nature Biotechnology, vol. 28, 2010. [Online]. Available: http://dx.doi.org/10.1038/nbt1210-1248

[47] W. Lotter, G. Kreiman, and D. D. Cox, “Unsupervised learning of visual structure using predictive generative networks,” CoRR, vol. abs/1511.06380, 2015. [Online]. Available: http://arxiv.org/abs/1511.06380

[48] R. Timofte, R. Rothe, and L. J. V. Gool, “Seven ways to improve example-based single image super resolution,” CoRR, vol. abs/1511.02228, 2015. [Online]. Available: http://arxiv.org/abs/1511.02228

[49] D. Dao, A. N. Fraser, J. Hung, V. Ljosa, S. Singh, and A. E. Carpenter, “Cellprofiler analyst: interactive data exploration, analysis, and classification of large biological image sets,” bioRxiv, 2016. doi: 10.1101/057976. [Online]. Available: https://www.biorxiv.org/content/early/2016/06/09/057976

[50] M. Arjovsky and L. Bottou, “Towards principled methods for training generative adversarial networks,” CoRR, vol. abs/1701.04862, 2017. [Online]. Available: http://arxiv.org/abs/1701.04862

[51] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, G. Klambauer, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to

99 BIBLIOGRAPHY

a nash equilibrium,” CoRR, vol. abs/1706.08500, 2017. [Online]. Available: http://arxiv.org/abs/1706.08500

[52] A. Odena, V. Dumoulin, and C. Olah, “Deconvolution and checkerboard artifacts,” Distill, 2016. doi: 10.23915/distill.00003. [Online]. Available: http://distill.pub/2016/deconv-checkerboard

100 Appendix A

More Results

Figure A.1: Input LR image (Left), Ouput SR image by mSRGAN(Middle), Ground truth HR image (Right)

101 APPENDIX A. MORE RESULTS

Figure A.2: Input LR image (Left), Ouput SR image by mSRGAN(Middle), Ground truth HR image (Right)

102 APPENDIX A. MORE RESULTS

Figure A.3: Input LR image (Left), Ouput SR image by mSRGAN(Middle), Ground truth HR image (Right)

103 APPENDIX A. MORE RESULTS

Figure A.4: Input LR image (Left), Ouput SR image by mSRGAN(Middle), Ground truth HR image (Right)

104 APPENDIX A. MORE RESULTS

Figure A.5: Input LR image (Left), Ouput SR image by mSRGAN(Middle), Ground truth HR image (Right)

105 APPENDIX A. MORE RESULTS

Figure A.6: Input LR image (Left), Ouput SR image by mSRGAN(Middle), Ground truth HR image (Right)

106 APPENDIX A. MORE RESULTS

Figure A.7: Input LR image (Left), Ouput SR image by mSRGAN(Middle), Ground truth HR image (Right)

107 APPENDIX A. MORE RESULTS

Figure A.8: Input LR image (Left), Ouput SR image by mSRGAN(Middle), Ground truth HR image (Right)

108 APPENDIX A. MORE RESULTS

Figure A.9: Input LR image (Left), Ouput SR image by mSRGAN(Middle), Ground truth HR image (Right)

109 APPENDIX A. MORE RESULTS

Figure A.10: Input LR image (Left), Ouput SR image by mSRGAN(Middle), Ground truth HR image (Right)

110 APPENDIX A. MORE RESULTS

Figure A.11: Input LR image (Left), Ouput SR image by mSRGAN(Middle), Ground truth HR image (Right)

111 TRITA TRITA-EECS-EX-2018: 10

www.kth.se