<<

Learning Sparse Representations

Daniel Recoskie∗ and Richard Mann University of Waterloo David R. Cheriton School of Computer Science Waterloo, Canada {dprecosk, mannr}@uwaterloo.ca

Abstract ods like the Fourier transform and its variants are often used (e.g. [5]). We believe that one In this work we propose a method for learning cause of the lack of use is the difficulty in design- wavelet filters directly from data. We accom- ing and selecting appropriate wavelet functions. plish this by framing the discrete wavelet trans- Wavelet filters are typically derived analytically form as a modified convolutional neural network. using Fourier methods. Furthermore, there are We introduce an autoencoder wavelet transform many different wavelet functions to choose from. network that is trained using gradient descent. Without a deep understanding of wavelet the- We show that the model is capable of learning ory, it can be difficult to know which wavelet to structured wavelet filters from synthetic and real choose. This difficulty may lead many to stick data. The learned are shown to be sim- to simpler methods. ilar to traditional wavelets that are derived using We propose a method that learns wavelet func- Fourier methods. Our method is simple to im- tions directly from data using a neural network plement and easily incorporated into neural net- framework. As such, we can leverage the theo- work architectures. A major advantage to our retical properties of the wavelet transform with- model is that we can learn from raw audio data. out the difficult task of designing or choosing a wavelet function. An advantage of this method is that we are able to learn directly from raw 1 Introduction audio data. Learning from raw audio has shown arXiv:1802.02961v1 [cs.LG] 8 Feb 2018 success in audio generation [10]. The wavelet transform has several useful proper- We are not the first to propose using wavelets ties that make it a good choice for a feature rep- in neural network architectures. There has been resentation including: a linear time , previous work in using fixed wavelet filters in perfect reconstruction, and the ability to tailor neural networks such as the wavelet network [16] wavelet functions to the application. However, and the scattering transform [9]. Unlike our pro- the wavelet transform is not widely used in the posed method, these works do not learn wavelet community. Instead, meth- functions from data. ∗Thanks to NSERC for funding. One notable work involving learning wavelets

1 can be found in [12]. Though the authors also The wavelet functions can be thought of as a propose learning wavelets from data, there are bandpass filter bank. The wavelet transform is several differences from our work. One major then a decomposition of a signal with this fil- difference is that second generation wavelets are ter bank. Since the wavelets are bandpass, we considered instead of the traditional (first gen- require the notion of a lowpass scaling function eration) wavelets considered here [15]. Secondly, that is the sum of all wavelets above a certain the domain of the signals were over the vertices scale j in order to fully represent the signal. of graphs, as opposed to R. We define the scaling function, φ, such that its We begin our discussion with the wavelet Fourier transform, φˆ, satisfies transform. We will provide some mathemati- Z +∞ |ψˆ(sω)|2 cal background as well as outline the discrete |φˆ(ω)|2 = ds (3) wavelet transform algorithm. Next, we outline 1 s our proposed wavelet transform model. We show ˆ that we can represent the wavelet transform as with the phase of φ being arbitrary [8]. a modified convolutional neural network. We The discrete wavelet transform and its inverse then evaluate our model by demonstrating we can be computed via a fast decimating algo- can learn useful wavelet functions by using an rithm. Let us define two filters architecture similar to traditional autoencoders  1  t   [6]. h[n] = √ φ , φ(t − n) (4) 2 2  1  t   2 Wavelet transform g[n] = √ ψ , φ(t − n) (5) 2 2 We choose to focus on a specific type of linear The following equations connect the wavelet time-frequency transform known as the wavelet coefficients to the filters h and g, and give rise to transform. The wavelet transform makes use of a a recursive algorithm for computing the wavelet dictionary of wavelet functions that are dilated transform. and shifted versions of a mother wavelet. The Wavelet Filter Bank Decomposition: mother wavelet, ψ, is constrained to have zero and unit norm. The dilated and scaled +∞ X wavelet functions are of the form: aj+1[p] = h[n − 2p]aj[n] (6) n=−∞ 1  n  ψj[n] = ψ . (1) 2j 2j +∞ X d [p] = g[n − 2p]a [n] (7) where n, j ∈ Z. The discrete wavelet transform j+1 j n=−∞ is defined as Wavelet Filter Bank Reconstruction N−1 j X ∗ +∞ +∞ W x[n, 2 ] = x[m]ψj [m − n] (2) X X m=0 aj[p] = h[p−2n]aj+1[n]+ g[p−2n]dj+1[n] n=−∞ n=−∞ for a discrete real signal x. (8)

2 We call a and d the approximation and detail coefficients respectively. The detail coefficients are exactly the wavelet coefficients defined by Equation 2. As shown in Equations 6 and 7, the wavelet coefficients are computed by recursively computing the coefficients at each scale, with a0 initialized with the signal x. At each step of the algorithm, the signal is split into high and low frequency components by convolving the approx- imation coefficients with h (scaling filter) and g (wavelet filter). The low frequency component becomes the input to the next step of the al- gorithm. Note that ai and di are downsampled by a factor of two at each iteration. An advan- tage of this algorithm is that we only require Figure 1: The discrete wavelet transform repre- two filters instead of an entire filter bank. The sented as a neural network. This network com- wavelet transform effectively partitions the sig- putes the discrete wavelet transform of its input nal into frequency bands defined by the wavelet using Equations 6 and 7 functions. We can reconstruct a signal from its wavelet coefficients using Equation 8. We call the reconstruction algorithm the inverse discrete raw audio signal. We accomplish this by im- wavelet transform. A thorough treatment of the plementing the discrete wavelet transform as a wavelet transform can be found in [8]. modified CNN. Figure 1 shows a graphical rep- resentation of our model, which consists of re- peated applications of Equations 6 and 7. The 3 Proposed Model parameters (or weights) of this network are the wavelet and scaling filters g and h. Thus, the We propose a method for learning wavelet func- network computes the wavelet coefficients of a tions by defining the discrete wavelet transform signal, but allows the wavelet filter to be learned as a convolutional neural network (CNN). CNNs from the data. We can similarly define an inverse compute a feature representation of an input network using Equation 8. signal through a cascade of filters. They have We can view our network as an unrolling of seen success in many signal processing tasks, the discrete wavelet transform algorithm similar such as speech recognition and music classifica- to unrolling a recurrent neural network (RNN) tion [13, 3]. Generally, CNNs are not applied [11]. Unlike an RNN, our model takes as input directly to raw audio data. Instead, a transform the entire input signal and reduces the scale at is first applied to the signal (such as the short- every layer through downsampling. Each layer time Fourier transform). This representation is of the network corresponds to one iteration of then fed into the network. the algorithm. At each layer, the detail coeffi- Our proposed method works directly on the cients are passed directly to the final layer. The

3 final layer output, denoted W (x), is formed as a concatenation of all the computed detail coef- ficients and the final approximation coefficients. We propose that this network be used as an ini- tial module as part of a larger neural network architecture. This would allow a neural network architecture to take as input raw audio data, as opposed to some transformed version. We restrict ourselves to quadrature mirror fil- ters. That is, we set

g[n] = (−1)nh[−n] (9)

By making this restriction, we reduce our pa- Figure 2: Examples of random wavelet func- rameters to only the scaling filter h. tions that satisfy Equation 10 for different filter The model parameters will be learned by gra- lengths. dient descent. As such, we must introduce con- straints that will guarantee the model learns We achieve this by constructing an autoencoder wavelet filters. We define the wavelet constraints as illustrated in Figure 3. Autoencoders are used as in unsupervised learning in order to learn use- √ ful data representations [6]. Our autoencoder L (h, g) = (||h|| −1)2 +(µ − 2/k)2 +µ2 (10) w 2 h g is composed of a wavelet transform network fol- where µh and µg are the of h and g re- lowed by an inverse wavelet transform network. spectively, and k is length of the filters. The The is made up of a reconstruction first two terms correspond to finite L2 and L1 loss, a sparsity term, and the wavelet constraints. norms respectively. The third term is a relaxed Letx ˆi denote the reconstructed signal. The loss orthogonality constraint. Note that these are function is defined as soft constraints, and thus the filters learned by M 1 X the model are only approximately wavelet filters. L(X; g, h) = ||x − xˆ ||2 M i i 2 See Figure 2 for examples of randomly chosen i=1 wavelet functions derived from filters that min- M 1 X imize Equation 10. We have not explored the +λ1 ||W (xi)||1 + λ2Lw(h, g) M connection between the space of wavelets that i=1 minimize Equation 10 and those of parameter- (11) ized wavelet families [2]. for a dataset X = {x1, x2, . . . , xM } of fixed length signals. In our , we fix λ1 = 4 Evaluation λ2 = 1/2. We conducted experiments on synthetic and We will evaluate our wavelet model by learning real data. The real data consists of segments wavelet filters that give sparse representations. taken from the MIDI aligned piano dataset

4 convergence. The wavelet filters learned are unique to the type of data that is being reconstructed. Ex- ample wavelet functions are included in Figure 4. These functions are computed from the scal- ing filter coefficients using the cascade algorithm [14]. Note that the learned functions are highly structured, unlike the random wavelet functions in Figure 2. In order to compare the learned wavelets to traditional wavelets, we will first define a dis- tance measure between filters of length k: Figure 3: The reconstruction network is com- posed of a wavelet transform followed by an in- hh1, shift(h2, i)i dist(h1, h2) = min 1 − (13) verse wavelet transform. 0≤i

where shift(h2, i) is h2 circular shifted by i sam- (MAPS) [4]. The synthetic data consists of ples. This measure is the minimum cosine dis- harmonic data generated from simple periodic tance under all circular shifts of the filters. To waves. We construct a synthetic signal, xi, from compare different length filters, we zero-pad the a base periodic wave function, s, as follows: shorter filter to the length of the longer. We restrict our consideration to the following tradi- K−1 X tional wavelet families: Haar, Daubechies, Sym- x (t) = a · s(2kt + φ ) (12) i k k lets, and Coiflets. The middle column of Fig- k=0 ure 4 shows the closest traditional wavelet to the where φk ∈ [0, 2π] is a phase offset chosen uni- learned wavelets according to Equation 13. The th formly at random, and ak ∈ {0, 1} is the k distances are listed in the right column. harmonic indicator which takes the value of 1 In order to determine how well the learned with probability p. We considered three differ- wavelets capture the structure of the training ent base waves: sine, sawtooth, and square. A data signals, we will consider signals randomly second type of synthetic signal was created simi- generated from the learned wavelets. To gen- larly to Equation 12 by windowing the base wave erate signals we begin by sparsely populating at each scale with randomly centered Gaussian wavelet coefficients from [−1, 1]. The coefficients windows (multiple windows at each scale were from the three highest frequency scales are then allowed). The length of the learned filters is 20 set to zero. Finally, the generated signal is ob- in all trials. The length of each xi is 1024 and tained by performing an inverse wavelet trans- we set K = 5, p = 1/2, and M = 32000. We form of the sparse coefficients. Qualitative re- implemented our model using Google’s Tensor- sults are shown in Figure 5. Typical training flow library [1]. We make use of the Adam al- examples are shown in the left column. Exam- gorithm for stochastic gradient descent [7]. We ple generated signals are shown in the right col- use a batch size of 32 and run the optimizer until umn. Note that the generated signals have visu-

5 Figure 5: Left column: Examples of synthetic training signals. Right column: Examples of sig- nals generated from the corresponding learned Figure 4: Left column: Learned wavelet (solid) wavelet filters. Base waves from top to bottom: and scaling (dashed) functions. Middle column: sine, square, sawtooth. Closest traditional wavelet (solid) and scaling (dashed) functions. Right column: Plots of the scaling filters from the first two columns with can learn useful wavelet filters by gradient de- corresponding distance measure. scent, as opposed to the traditional derivation of wavelets using Fourier methods. The learned wavelets are able to capture the structure of the ally similar structure to the training examples. data. We hope that our work leads to wider use This provides evidence that the learned wavelets of the wavelet transform in the machine learning have captured the structure of the data. community. Framing our model as a neural network has the benefit of allowing us to leverage deep learning 5 Conclusion software frameworks, and also allows for simple integration into existing neural network architec- We have proposed a new model capable of learn- tures. An advantage of our method is the ability ing useful wavelet representations from data. We to learn directly from raw audio data, instead accomplish this by framing the wavelet trans- of relying on a fixed representation such as the form as a modified CNN. We show that we Fourier transform.

6 References Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A genera- [1] Mart´ın Abadi et al. TensorFlow: Large-scale tive model for raw audio. arXiv preprint machine learning on heterogeneous systems, arXiv:1609.03499, 2016. 2015. Software available from tensorflow.org. [11] Razvan Pascanu, Tomas Mikolov, and Yoshua [2] C Sidney Burrus, Ramesh A Gopinath, and Bengio. On the difficulty of training recurrent Haitao Guo. Introduction to wavelets and neural networks. In International Conference on wavelet transforms: a primer. Prentice-Hall, Machine Learning, pages 1310–1318, 2013. Inc., 1997. [12] Raif Rustamov and Leonidas J Guibas. Wavelets [3] Keunwoo Choi, Gy¨orgy Fazekas, Mark San- on graphs via deep learning. In Advances in Neu- dler, and Kyunghyun Cho. Convolutional re- ral Information Processing Systems, pages 998– current neural networks for music classifica- 1006, 2013. tion. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference [13] Tara N Sainath, Abdel-rahman Mohamed, on, pages 2392–2396. IEEE, 2017. Brian Kingsbury, and Bhuvana Ramabhad- ran. Deep convolutional neural networks for [4] Valentin Emiya, Roland Badeau, and Bertrand lvcsr. In Acoustics, Speech and Signal Processing David. Multipitch estimation of piano sounds (ICASSP), 2013 IEEE International Conference using a new probabilistic spectral smooth- on, pages 8614–8618. IEEE, 2013. ness principle. IEEE Transactions on Audio, Speech, and Language Processing, 18(6):1643– [14] Gilbert Strang and Truong Nguyen. Wavelets 1654, 2010. and filter banks. SIAM, 1996. [5] Alex Graves, Abdel-rahman Mohamed, and Ge- [15] Wim Sweldens. The lifting scheme: A construc- offrey Hinton. Speech recognition with deep re- tion of second generation wavelets. SIAM jour- current neural networks. In Acoustics, speech nal on mathematical analysis, 29(2):511–546, and signal processing (icassp), 2013 ieee inter- 1998. national conference on, pages 6645–6649. IEEE, [16] Qinghua Zhang and Albert Benveniste. Wavelet 2013. networks. IEEE transactions on Neural Net- [6] Geoffrey E Hinton and Ruslan R Salakhutdinov. works, 3(6):889–898, 1992. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006. [7] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [8] St´ephaneMallat. A Wavelet Tour of Signal Pro- cessing, Third Edition: The Sparse Way. Aca- demic Press, 3rd edition, 2008. [9] St´ephaneMallat. Group invariant scattering. Communications on Pure and Applied Mathe- matics, 65(10):1331–1398, 2012. [10] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex

7