Blind Denoising by Self-Supervision
Total Page:16
File Type:pdf, Size:1020Kb
Noise2Self: Blind Denoising by Self-Supervision Joshua Batson * 1 Loic Royer * 1 Abstract highly correlated. Speaking loosely, if the “latent dimen- We propose a general framework for denoising sion” of the space of objects under study is much lower than high-dimensional measurements which requires the dimension of the measurement, it may be possible to no prior on the signal, no estimate of the noise, implicitly learn that structure, denoise the measurements, and no clean training data. The only assumption and recover the signal without any prior knowledge of the is that the noise exhibits statistical independence signal or the noise. across different dimensions of the measurement, Traditional denoising methods each exploit a property of while the true signal exhibits some correlation. the noise, such as Gaussianity, or structure in the signal, For a broad class of functions (“J -invariant”), it such as spatiotemporal smoothness, self-similarity, or hav- is then possible to estimate the performance of ing low-rank. The performance of these methods is limited a denoiser from noisy data alone. This allows by the accuracy of their assumptions. For example, if the us to calibrate J -invariant versions of any pa- data are genuinely not low rank, then a low rank model rameterised denoising algorithm, from the single will fit it poorly. This requires prior knowledge of the sig- hyperparameter of a median filter to the millions nal structure, which limits application to new domains and of weights of a deep neural network. We demon- modalities. These methods also require calibration, as hy- strate this on natural image and microscopy data, perparameters such as the degree of smoothness, the scale of where we exploit noise independence between self-similarity, or the rank of a matrix have dramatic impacts pixels, and on single-cell gene expression data, on performance. where we exploit independence between detec- tions of individual molecules. This framework In contrast, a data-driven prior, such as pairs (xi; yi) of generalizes recent work on training neural nets noisy and clean measurements of the same target, can be from noisy images and on cross-validation for used to set up a supervised learning problem. A neural matrix factorization. net trained to predict yi from xi may be used to denoise new noisy measurements (Weigert et al., 2018). As long as the new data are drawn from the same distribution, one 1. Introduction can expect performance similar to that observed during training. Lehtinen et al. demonstrated that clean targets are We would often like to reconstruct a signal from high- 0 unnecessary (2018). A neural net trained on pairs (xi; xi) dimensional measurements that are corrupted, under- of independent noisy measurements of the same target will, sampled, or otherwise noisy. Devices like high-resolution under certain distributional assumptions, learn to predict the cameras, electron microscopes, and DNA sequencers are clean signal. These supervised approaches extend to image capable of producing measurements in the thousands to mil- denoising the success of convolutional neural nets, which lions of feature dimensions. But when these devices are currently give state-of-the-art performance for a vast range pushed to their limits, taking videos with ultra-fast frame of image-to-image tasks. Both of these methods require an rates at very low-illumination, probing individual molecules experimental setup in which each target may be measured with electron microscopes, or sequencing tens of thousands multiple times, which can be difficult in practice. of cells simultaneously, each individual feature can become quite noisy. Nevertheless, the objects being studied are of- In this paper, we propose a framework for blind denoising ten very structured and the values of different features are based on self-supervision. We use groups of features whose noise is independent conditional on the true signal to predict *Equal contribution 1Chan-Zuckerberg Biohub. Correspon- one another. This allows us to learn denoising functions dence to: Joshua Batson <[email protected]>, Loic from single noisy measurements of each object, with per- Royer <[email protected]>. formance close to that of supervised methods. The same Proceedings of the 36 th International Conference on Machine approach can also be used to calibrate traditional image de- Learning, Long Beach, California, PMLR 97, 2019. Copyright noising methods such as median filters and non-local means, 2019 by the author(s). Noise2Self: Blind Denoising by Self-Supervision a b c d ACCT...TGAG ACTG...TGAC TAGC...CTCA ATAT...CGTC TTAG...GAGC CGCA...ACAC ACCT...GGTT ACCT...TGAC ACCG...TGTA GCGT...CGAC ACCT...GATC TTCG...AGAT CGCT...GTGT ACAT...GAGG independent independent independent independent feature dimensions images pixels molecules Figure 1. (a) The box represents the dimensions of the measurement x. J is a subset of the dimensions, and f is a J-invariant function: it has the property that the value of f(x) restricted to dimensions in J, f(x)J , does not depend on the value of x restricted to J, xJ . This enables self-supervision when the noise in the data is conditionally independent between sets of dimensions. Here are 3 examples of dimension partitioning: (b) two independent image acquisitions, (c) independent pixels of a single image, (d) independently detected RNA molecules from a single cell. and, using a different independence structure, denoise highly we may search over all J -invariant functions for the global under-sampled single-cell gene expression data. optimum: ∗ We model the signal y and its noisy measurement x as a pair Proposition 2. The J -invariant function fJ minimizing (1) of random variables in m. If J ⊂ f1; : : : ; mg is a subset satisfies R ∗ f (x) = [y jx c ] of the dimensions, we write xJ for x restricted to J. J J E J J Definition. Let J be a partition of the dimensions for each subset J 2 J . f1; : : : ; mg and let J 2 J . A function f : m ! m R R That is, the optimal J -invariant predictor for the dimensions is J-invariant if f(x) does not depend on the value of x . J J of y in some J 2 J is their expected value conditional on It is J -invariant if it is J-invariant for each J 2 J . observing the dimensions of x outside of J. We propose minimizing the self-supervised loss In x4, we use analytical examples to illustrate how the opti- 2 mal J -invariant denoising function approaches the optimal L(f) = E kf(x) − xk ; (1) general denoising function as the amount of correlation between features in the data increases. over J -invariant functions f. Since f has to use information from outside of each subset of dimensions J to predict the In practice, we may attempt to approximate the optimal values inside of J, it cannot merely be the identity. denoiser by searching over a very large class of functions, Proposition 1. Suppose x is an unbiased estimator of y, i.e. such as deep neural networks with millions of parameters. In x5, we show that a deep convolutional network, modified to E[xjy] = y, and the noise in each subset J 2 J is indepen- dent from the noise in its complement J c, conditional on y. become J -invariant using a masking procedure, can achieve Let f be J -invariant. Then state-of-the-art blind denoising performance on three diverse datasets. 2 2 2 E kf(x) − xk = E kf(x) − yk + E kx − yk : (2) Sample code is available on GitHub1 and deferred proofs are contained in the Supplement. That is, the self-supervised loss is the sum of the ordinary supervised loss and the variance of the noise. By minimizing 2. Related Work the self-supervised loss over a class of J -invariant functions, one may find the optimal denoiser for a given dataset. Each approach to blind denoising relies on assumptions about the structure of the signal and/or the noise. We re- For example, if the signal is an image with independent, view the major categories of assumption below, and the J = mean-zero noise in each pixel, we may choose traditional and modern methods that utilize them. Most of ff1g;:::; fmgg to be the singletons of each coordinate. the methods below are described in terms of application to Then “donut” median filters, with a hole in the center, form image denoising, which has the richest literature, but some J a class of -invariant functions, and by comparing the value have natural extensions to other spatiotemporal signals and of the self-supervised loss at different filter radii, we are to generic measurements of vectors. able to select the optimal radius for denoising the image at hand (See x3). Smoothness: Natural images and other spatiotemporal sig- nals are often assumed to vary smoothly (Buades et al., The donut median filter has just one parameter and therefore limited ability to adapt to the data. At the other extreme, 1https://github.com/czbiohub/noise2self Noise2Self: Blind Denoising by Self-Supervision 2005b). Local averaging, using a Gaussian, median, or eral framework for learnable compression. Each sample some other filter, is a simple way to smooth out a noisy is mapped to a low-dimensional representation—the value input. The degree of smoothing to use, e.g., the width of a of the neural net at the bottleneck layer— then back to the filter, is a hyperparameter often tuned by visual inspection. original space (Gallinari et al., 1987; Vincent et al., 2010). An autoencoder trained on noisy data may produce cleaner Self-Similarity: Natural images are often self-similar, in data as its output. The degree of compression is determined that each patch in an image is similar to many other patches by the width of the bottleneck layer.