Noise Learning Based Denoising Autoencoder Woong-Hee Lee, Mustafa Ozger, Ursula Challita, and Ki Won Sung, Member, IEEE
Total Page:16
File Type:pdf, Size:1020Kb
1 Noise Learning Based Denoising Autoencoder Woong-Hee Lee, Mustafa Ozger, Ursula Challita, and Ki Won Sung, Member, IEEE Abstract—This letter introduces a new denoiser that modifies Y should capture the information of X as much as possible the structure of denoising autoencoder (DAE), namely noise although Y is a function of the noisy input. Additionally, from learning based DAE (nlDAE). The proposed nlDAE learns the the manifold learning perspective, DAE can be seen as a way noise of the input data. Then, the denoising is performed by subtracting the regenerated noise from the noisy input. Hence, to find a manifold where Y represents the data into a low nlDAE is more effective than DAE when the noise is simpler to dimensional latent space corresponding to X. However, we regenerate than the original data. To validate the performance of often face the problem that the stochastic feature of X to nlDAE, we provide three case studies: signal restoration, symbol be restored is too complex to regenerate or represent. This is demodulation, and precise localization. Numerical results suggest called the curse of dimensionality, i.e., the dimension of latent that nlDAE requires smaller latent space dimension and smaller training dataset compared to DAE. space for X is still too high in many cases. What can we do if N is simpler to regenerate than X? It will Index Terms—machine learning, noise learning based denois- be more effective to learn N and subtract it from Y instead of ing autoencoder, signal restoration, symbol demodulation, precise localization. learning X directly. In this light, we propose a new denoising framework, named as noise learning based DAE (nlDAE). The main advantage of nlDAE is that it can maximize the efficiency I. INTRODUCTION of the ML approach (e.g., the required dimension of the latent Machine learning (ML) has recently received much attention space or size of training dataset) for capability-constrained as a key enabler for future wireless communications [1], [2], devices, e.g., IoT, where N is typically easier to regenerate [3]. While the major research effort has been put to deep neural than X owing to their stochastic characteristics. To verify the networks, there are enormous number of Internet of Things advantage of nlDAE over the conventional DAE, we provide (IoT) devices that are severely constrained on the computa- three practical applications as case studies: signal restoration, tional power and memory size. Therefore, the implementation symbol demodulation, and precise localization. of efficient ML algorithms is an important challenge for IoT The following notations will be used throughout this letter. devices, as they are energy and memory limited. Denoising • Ber; Exp; U; N ; CN : the Bernoulli, exponential, uniform, autoencoder (DAE) is a promising technique to improve the normal, and complex normal distributions, respectively. performance of IoT applications by denoising the observed P • x; n; y 2 : the realization vectors of random variables data that consists of the original data and the noise [4]. DAE R X; N; Y , respectively, whose dimensions are P . is a neural network model for the construction of the learned 0 • P (< P ): the dimension of the latent space. representations robust to an addition of noise to the input P 0×P 0 P ×P 0 • W 2 ; W 2 : the weight matrices for samples [?], [5]. The representative feature of DAE is that the R R encoding and decoding, respectively. dimension of the latent space is smaller than the size of the P 0 0 P • b 2 ; b 2 : the bias vectors for encoding and input vector. It means that the neural network model is capable R R decoding, respectively. of encoding and decoding through a smaller dimension where • S: the sigmoid function, acting as an activation function the data can be represented. 1 for neural networks, i.e., S(a) = 1+e−a , and S(a) = The main contribution of this letter is to improve the T P efficiency and performance of DAE with a modification of (S(a[1]); ··· ; S(a[P ])) where a 2 R is an arbitrary arXiv:2101.07937v2 [cs.LG] 21 Jun 2021 its structure. Consider a noisy observation Y which consists input vector. of the original data X and the noise N, i.e., Y = X + N. • fθ: the encoding function where the parameter θ is fW; bg, i.e., fθ(y) = S(Wy + b). From the information theoretical perspective, DAE attempts to 0 • gθ0 : the decoding function where the parameter θ is minimize the expected reconstruction error by maximizing a 0 0 0 0 lower bound on mutual information I(X; Y ). In other words, fW ; b g, i.e., gθ0 (fθ(y)) = S(W fθ(y) + b ). • M: the size of training dataset. This work was supported by a Korea University Grant. This research • L: the size of test dataset. was supported by the BK21 FOUR(Fostering Outstanding Universities for Research) funded by the Ministry of Education(MOE, Korea) and National Research Foundation of Korea(NRF). This work was partly funded by the II. METHOD OF NLDAE European Union Horizon 2020 Research and Innovation Programme under the EU/KR PriMO-5G project with grant agreement No 815191. (Corresponding In the traditional estimation problem of signal processing, author: Ki Won Sung.) N is treated as an obstacle to the reconstruction of X. W.-H. Lee is with the Department of Control and Instrumentation Engineer- ing, Korea University, Republic of Korea (e-mail: [email protected]). Therefore, most of the studies have focused on restoring X U. Challita is with Ericsson Research, Stockholm, Sweden (e-mail: ur- as much as possible, which can be expressed as a function of [email protected]). X and N. Along with this philosophy, ML-based denoising M. Ozger and K. W. Sung are with the School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, Stockholm, Sweden techniques, e.g., DAE, have also been developed in various (e-mail:fozger and [email protected]). signal processing fields with the aim of maximizing the ability 2 Input Output Output Input layer Hidden layer layer Hidden layer layer layer � � � �" Encoding Encoding Decoding Decoding �$�� (a) Training phase (b) Test phase Fig. 1: An illustration of the concept of nlDAE. The training and test phases of nlDAE are depicted in Fig. 1. 100 The parameters of nlDAE model can be optimized as follows for all i 2 f1; ··· ;Mg: M ∗ 0∗ 1 X (i) (i) θnl; θnl = arg min L n ; gθ0 (fθ(y )) : (3) 0 M θ,θ i=1 (i) -1 The region where nlDAE Notice that the only difference from (1) is that x is re- 10 is better than DAE (j) MSE (i) placed by n . Let x~nl denote the j-th regenerated data based on nlDAE, which can be represented as follows for all j 2 f1; ··· ;Lg: (j) (j) (j) x~ = y − g 0∗ (fθ∗ (y )): (4) nl θnl nl 10-2 To provide the readers with insights into nlDAE, we exam- 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 ine two simple examples where the standard deviation of X is fixed as 1, i.e., σX = 1, and that of N varies. Y = X + N Fig. 2: A simple example of comparison between DAE and is comprised as follows: p nlDAE: reconstruction error according to σN . • Example 1: X ∼ U(0; 2 3) and N ∼ N (0; σN ). to restore X from Y . Unlike the conventional approaches, we • Example 2: X ∼ Exp(1) and N ∼ N (0; σN ). hypothesize that, if N has a simpler statistical characteristic Fig. 2 describes the performance comparison between DAE than X, it will be better to subtract from Y after restoring N. and nlDAE in terms of mean squared error (MSE) for the two We first look into the mechanism of DAE to build neural examples1. Here, we set P = 12, P 0 = 9, M = 10000, and networks. Recall that DAE attempts to regenerate the original L = 5000. It is observed that nlDAE is superior to DAE when data x from the noisy observation y via training the neural σN is smaller than σX in Fig. 2. The gap between nlDAE and network. Thus, the parameters of a DAE model can be DAE widens with lower σX . This implies that the standard optimized by minimizing the average reconstruction error in deviation is an important factor when we select the denoiser the training phase as follows: between DAE and nlDAE. M These examples show the consideration of whether X ∗ 0∗ 1 X (i) (i) or N is easier to be regenerated, which is highly related θ ; θ = arg min L x ; gθ0 (fθ(y )) ; (1) 0 M θ,θ i=1 to differential entropy of each random variable, H(X) and H(N) [?]. The differential entropy is normally an increasing where L is a loss function such as squared error between two function over the standard deviation of the corresponding inputs. Then, the j-th regenerated data x~(j) from y(j) in the p random variable, e.g., H(N) = log(σN 2πe). Naturally, it test phase can be obtained as follows for all j 2 f1; ··· ;Lg: is efficient to reconstruct a random variable with a small (j) (j) amount of information, and the standard deviation can be a x~ = gθ0∗ (fθ∗ (y )): (2) good indicator. It is noteworthy that, if there are two different neural networks which attempt to regenerate the original data and III. CASE STUDIES the noise from the noisy input, the linear summation of these To validate the advantage of nlDAE over the conventional two regenerated data would be different from the input.