Loss Functions for Image Restoration with Neural Networks Hang Zhao?,†, Orazio Gallo?, Iuri Frosio?, and Jan Kautz?

1 Loss Functions for Image Restoration with Neural Networks Hang Zhao?;y, Orazio Gallo?, Iuri Frosio?, and Jan Kautz? Abstract—Neural networks are becoming central in several popularity of `2: standard neural networks packages, such as areas of computer vision and image processing and different Caffe [11], only offer the implementation for this metric. architectures have been proposed to solve specific problems. The However, `2 suffers from well-known limitations. For in- impact of the loss layer of neural networks, however, has not ` received much attention in the context of image processing: stance, when the task at hand involves image quality, 2 the default and virtually only choice is `2. In this paper, we correlates poorly with image quality as perceived by a human bring attention to alternative choices for image restoration. In observer [12]. This is because of a number of assumptions particular, we show the importance of perceptually-motivated implicitly made when using `2. First and foremost, the use of losses when the resulting image is to be evaluated by a human `2 assumes that the impact of noise is independent of the local observer. We compare the performance of several losses, and propose a novel, differentiable error function. We show that characteristics of the image. On the contrary, the sensitivity of the quality of the results improves significantly with better loss the Human Visual System (HVS) to noise depends on local functions, even when the network architecture is left unchanged. luminance, contrast, and structure [13]. The `2 loss also works under the assumption of white Gaussian noise, which is not Index Terms—Image Processing, Image Restoration, Neural valid in general. Networks, Loss Functions. We focus on the use of neural networks for image restoration tasks, and we study the effect of different metrics for the network’s loss layer. We compare `2 against four error metrics I. INTRODUCTION on representative tasks: image super-resolution, JPEG artifacts OR decades, neural networks have shown various degrees removal, and joint denoising plus demosaicking. First, we test of success in several fields, ranging from robotics, to whether a different local metric such as `1 can produce better regressionF analysis, to pattern recognition. Despite the promis- results. We then evaluate the impact of perceptually-motivated ing results already produced in the 1980s on handwritten metrics. We use two state-of-the-art metrics for image quality: digit recognition [1], the popularity of neural networks in the structural similarity index (SSIM [13]) and the multi- the field of computer vision has grown exponentially only scale structural similarity index (MS-SSIM [14]). We choose recently, when deep learning boosted their performance in these among the plethora of existing indexes, because they are image recognition [2]. established measures, and because they are differentiable—a In the span of just a couple of years, neural networks have requirement for the backpropagation stage. As expected, on been employed for virtually every computer vision and image the use cases we consider, the perceptual metrics outperform processing task known to the research community. Much `2. However, and perhaps surprisingly, this is also true for `1, research has focused on the definition of new architectures that see Figure1. Inspired by this observation, we propose a novel are better suited to a specific problem [3], [4]. A large effort loss function and show its superior performance in terms of was also made to understand the inner mechanisms of neural all the metrics we consider. networks, and what their intrinsic limitations are, for instance We offer several contributions. First we bring attention to arXiv:1511.08861v3 [cs.CV] 20 Apr 2018 through the development of deconvolutional networks [5], or the importance of the error metric used to train neural networks trying to fool networks with specific inputs [6]. Other advances for image processing: despite the widely-known limitations of were made on the techniques to improve the network’s con- `2, this loss is still the de facto standard. We investigate the use vergence [7]. of three alternative error metrics (`1, SSIM, and MS-SSIM), The loss layer, despite being the effective driver of the and define a new metric that combines the advantages of `1 and network’s learning, has attracted little attention within the MS-SSIM (Section III). We also perform a thorough analysis image processing research community: the choice of the cost of their performance in terms of several image quality indexes function generally defaults to the squared `2 norm of the (SectionIV). Finally, we discuss their convergence properties. error [8], [3], [9], [10]. This is understandable, given the many We empirically show that the poor performance of some losses desirable properties this norm possesses. There is also a less is related to local minima of the loss functions (Section V-A), well-founded, but just as relevant reason for the continued and we explain the reasons why SSIM and MS-SSIM alone do not produce the expected quality (Section V-B). For each of ?NVIDIA, yMIT Media Lab the metrics we analyze, we implement a loss layer for Caffe, The supplementary material is available at http://research.nvidia.com/ which we make available to the research community1. publication/loss-functions-image-restoration-neural-networks. The material includes additional numerical and visual comparisons, and mathematical derivations. Contact [email protected] for further questions about this work. 1https://github.com/NVlabs/PL4NN 2 1 (b) Clean (c) Noisy (d) `2 (e) `1 (f) SSIM (g) MS-SSIM (h) Mix (i) Clean (j) Noisy (k) `2 (l) `1 (m) SSIM (n) MS-SSIM (o) Mix (a) Clean image Fig. 1: Comparisons of the results of joint denoising and demosaicking performed by networks trained on different loss functions (best viewed in the electronic version by zooming in). `2, the standard loss function for neural networks for image processing, produces splotchy artifacts in flat regions (d). II. RELATED WORK independent and identically distributed Gaussian noise, to the In this paper, we target neural networks for image restora- fact that it is additive for independent noise sources. There tion, which, in our context, is the set of all the image is longer list of reasons for which we refer the reader to the processing algorithms whose goal is to output an image that work of Wang and Bovik [16]. is appealing to a human observer. To this end, we use the These properties paved the way for `2’s widespread adop- problems of super-resolution, JPEG artifacts removal, and joint tion, which was further fueled by the fact that standard demosaicking plus denoising as benchmark tests. Specifically, software packages tend to include tools to use `2, but not we show how established error measures can be adapted to many other error functions for regression. In the context of work within the loss layer of a neural network, and how this image processing, Caffe [11] actually offers only `2 as a loss 2 can positively influence the results. Here we briefly review the layer , thus discouraging researchers from testing other error existing literature on the subject of neural networks for image measures. processing, and on the subject of measures of image quality. However, it is widely accepted that `2, and consequently the Peak Signal-to-Noise Ratio, PSNR, do not correlate well with ` A. Neural networks for image restoration human’s perception of image quality [12]: 2 simply does not capture the intricate characteristics of the human visual system Following their success in several computer vision (HVS). tasks [15], [2], neural networks have received considerable There exists a rich literature of error measures, both attention in the context of image restoration. Neural networks reference-based and non reference-based, that attempt to ad- have been used for denoising [8], [3], deblurring [4], demo- dress the limitations of the simple `2 error function. For our saicking [10], and super-resolution [9] among others. To the purposes, we focus on reference-based measures. A popu- best of our knowledge, however, the work on this subject lar reference-based index is the structural similarity index has focused on tuning the architecture of the network for the (SSIM [13]). SSIM evaluates images accounting for the fact specific application; the loss layer, which effectively drives the that the HVS is sensitive to changes in local structure. Wang learning of the network to produce the desired output quality, et al. [14] extend SSIM observing that the scale at which local ` is based on 2 for all of the approaches above. structure should be analyzed is a function of factors such We show that a better choice for the error measure has as image-to-observer distance. To account for these factors, a strong impact on the quality of the results. Moreover, we they propose MS-SSIM, a multi-scale version of SSIM that ` show that even when 2 is the appropriate loss, alternating the weighs SSIM computed at different scales according to the ` training loss function with a related loss, such as 1, can lead sensitivity of the HVS. Experimental results have shown the to finding a better solution for `2. superiority of SSIM-based indexes over `2. As a consequence, SSIM has been widely employed as a metric to evaluate image B. Evaluating image quality processing algorithms. Moreover, given that it can be used The mean squared error, `2, is arguably the dominant error as a differentiable cost function, SSIM has also been used measure across very diverse fields, from regression problems, in iterative algorithms designed for image compression [16], to pattern recognition, to signal and image processing. Among image reconstruction [17], denoising and super-resolution [18], the main reasons for its popularity is the fact that it is convex and even downscaling [19].

Load more