4th Mediterranean Conference on Embedded Computing MECO - 2015 Budva, Montenegro

CS reconstruction of the speech and musical signals

Trifun Savi ć, Radoje Albijani ć Faculty of Electrical Engineering University of Montenegro, Podgorica [email protected], [email protected]

Abstract—The application of Compressive sensing approach Generally speaking, the CS can be used for different types to the speech and musical signals is considered in this paper. of data such as video, images, audio signals, signals used in Compressive sensing (CS) is a new approach to the signal medical applications (MRI, ECG), etc. In this paper we sampling that allows signal reconstruction from a small set of consider the CS application to the audio and speech signals[5]- randomly acquired samples. This method is developed for the [8]. Having in mind that audio signals consist of harmonics, signals that exhibit the sparsity in a certain domain. Here we have observed two sparsity domains: discrete Fourier and i.e., sinusoid-like components, these signals are good discrete cosine transform domain. Furthermore, two different candidates for CS. Namely, a pure sinusoid is a type of signal types of audio signals are analyzed in terms of sparsity and CS which has, in the frequency domain, just one non-zero performance - musical and speech signals. Comparative analysis component corresponding to the frequency of the sinusoid. of the CS reconstruction using different number of signal Therefore, sinusoidal signals can be considered as sparse in samples is performed in the two domains of sparsity. It is shown frequency domain. The sound quality, number of harmonics, that the CS can be successfully applied to both, musical and played note, could be of different complexity depending on speech signals, but the speech signals are more demanding in the instrument used to play the sound[9]. Consequently, terms of the number of observations. Also, our results show that different number of measurements, used for reconstruction, is discrete cosine transform domain allows better reconstruction using lower number of observations, compared to the Fourier required for the signals of different complexity. transform domain, for both types of signals. Speech signals have more complex nature and therefore they are less sparse in the frequency domain, compared to the Keywords- Discrete Fourier transform, discrete cosine pure musical tones. As these signals have short-time transform, compressive sensing, sparsity, signal reconstruction, l1 stationarity property[10], short length frames should be minimization considered for CS reconstruction. Namely, when observing I. INTRODUCTION just a small time interval of the speech signal, it is found that it Compressive sensing (CS) is a new theory aiming to can be considered as sparse. In this way, successful optimize the signal acquisition process. The main idea behind reconstruction is assured using lower number of available the CS lies in possibility to reconstruct signal using much observations. lower number of samples than it is determined by Shannon- In this paper we have used different transformation Nyqiust theorem[1]-[4]. The Shannon-Nyqiust theorem for both musical and speech signals, with different number of requires samples to be taken in equidistant time intervals, one randomly acquired samples. The quality of reconstructed after another. In CS scenarios, the signal samples should be signals is proved by performing the listening test on the randomly selected or acquired[1]. By using just small number resulting signals, and illustrated by graphical representation of of signal samples, CS automatically lowers the sampling rate signals in time and frequency domains. and avoids later signal compression. Also, smaller number of The rest of the paper is organized as follows: In section II samples means faster and faster further theoretical background on CS is presented. The reconstruction transmission. of musical and speech signals will be detailed analyzed in Besides the random selection of the signal samples, CS section III. In section IV the experimental results are shown, requires signal to be sparse in a certain transform domain. including mean square errors between original and Sparse signals are characterized by important information reconstructed signal as well as error versus measurement condensed in a small number of non-zero coefficients in the percentage graph. Conclusion is given in Section V . sparsity domain[1], [2]. Depending on type of signal, various II. THEORETICAL BACKGROUND domains where signals have sparse representation, can be used, e.g., Discrete Fourier Transform - DFT, Discrete Cosine Let us introduce discrete signal f in terms of transform Transform - DCT, Transform, etc. Note that the domain Ψ and transform domain vectors x as follows: observations are randomly taken from the domain where signal f=Ψ x , (1) has dense representation. The random selection assures where matrix Ψ is of N×N (N is the signal length) dimension incoherence property to be satisfied. Incoherence and sparsity and x is the signal f transformed into vector in Ψ domain. are two requirements that need to be satisfied in order to apply Vector x is sparse vector, containing K non-zero coefficients, CS approach. where K<

4th Mediterranean Conference on Embedded Computing MECO - 2015 Budva, Montenegro randomly taken from the dense vector f. In other words, Ψ vibration of strings. Pressing a key on keyboard causes a matrix is subsampled by randomly selecting certain number of padded hammer to hit the string and make them vibrate. In the columns/rows. Let the matrix θM×N be the sub-matrix of the case of speech, the signal is made by vocal chords, which matrix Ψ, formed by randomly selecting rows/columns of the vibrate as the air flows through them. By losing or tensioning matrix Ψ. Vector y, of M×1 dimension, is the one containing vocal chords we make sounds of different frequencies. It is not CS measurements. It is important to emphasize that the number necessary for speech sound to be consisted of pitch sound and of selected rows/columns M, should be much smaller than the multiples of its frequency. Speech signal can contain lots of signal length, i.e. M<

4th Mediterranean Conference on Embedded Computing MECO - 2015 Budva, Montenegro of available signal samples. Mean square error (MSE) is Figure 2 shows the original signal in the frequency domain calculated after each reconstruction, and MSE versus number (DCT and DFT), while the reconstructed signal in DCT and of measurements graphs are displayed. DFT domains are shown in Figure 3 (with 30% and 50% samples used for the reconstruction). As a more relevant Example 1: Audio signal indicator of reconstruction quality, in Figure 4 we ploted the mean square error (MSE) between reconstructed piano signal For experiments with the audio signals, the piano tone is using DCT and DFT as transform domains, for different chosen with the length of 3000 samples. The reconstruction number of measurements. Again, it is shown that the MSE is performance is analyzed for different number of much lower in the case when DCT is used as a sparse basis. measurements: 20% - 90% samples available. Also, both the -7 x 10 DFT and the DCT transform basis are observed as sparsity 4 domains. Figure 1 shows the results of reconstruction using 30% and 50% samples, for both basis. It can been concluded 3 that, by using the DCT as sparsity domain only 30% of the available samples are enough for successful reconstruction. 2 When using the DFT as a sparse basis, the number of 1 measurements needs to larger such as 50% of the total signal Error Square Mean length. 0 -3 -3 20 30 40 50 60 70 80 90 x 10 x 10 2 2 Samples Rate in % 0 0 Figure 4. MSE vs number of measurements for the piano signal; DCT (blue), and DFT (red) -2 -2

Amplitude

-4 -4 Example 2: Speech signal 0 100 200 300 400 500 0 100 200 300 400 500 -3 -3 x 10 x 10 2 2 In this example, speech signals with length of 3000 0 0 samples are observed. Experiments are repeated several times,

-2 -2 using different number of samples as CS measurements: from Amplitude 20% to 90% of the original number of samples. Figure 5 -4 -4 0 100 200 300 400 500 0 100 200 300 400 500 represents time domains of the original and reconstructed Samples Samples Figure 1.Original piano signal (blue), reconstructed (red), using 30% DCT speech signal (as a represent of speech signal we illustrated (upper left graph), 50% DCT (upper right graph), 30% DFT (lower left the vowel E). The graphs are for the 50% and 70% of samples graph), 50% DFT(lower right graph), time domain used as measurements and with the DCT and the DFT domains. The DCT gives satisfactory reconstruction using Original Signal, Discrete Cosine Transform Original Signal, Discrete Fourier Transform 0.08 4 50% of samples, while the DFT requires 70% of samples for the successful reconstruction. 0.06 3 0.5 0.5

0.04 2 0 0 Amplitude Amplitude 0.02 1 -0.5 -0.5 Amplitude

-1 0 0 -1 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 0 100 200 300 400 500 0 100 200 300 400 500 Samples Samples 0.5 0.5 Figure 2.Original piano signal in DCT (left graph), in DFT (right graph) 0 0 Recovered Signal, Discrete Cosine TransformdomainRecovered Signal, Discrete Cosine Transform

0.08 0.08 Amplitude -0.5 -0.5 0.06 0.06 -1 -1 0.04 0.04 100 200 300 400 500 100 200 300 400 500 Samples Samples Amplitude 0.02 0.02 Figure 5.Original speech signal (blue), reconstructed (red), using 50% DCT 0 0 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 (upper left graph), 70% DCT (upper right graph), 50%DFT (lower left graph), 4 4 Samples 70% DFT(lower right graph), time domain 3 3 2 2 Therefore, we can again conclude that DCT domain is more Amplitude 1 Amplitude 1 applicable as sparse domain, for both types of signals. Figures 0 0 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 6 and 7 show frequency domains of original and reconstructed Samples Samples Figure 3.Reconstructed piano signal using 30% DCT (upper left graph), 50% signals. DCT (upper right graph), 30%DFT (lower left graph), 50% DFT(lower right The graph representing the MSE in terms of the number of graph), frequency domain measurements graph, is presented in Figure 8. The DCT and the DFT domains are used in reconstruction and MSE is

4th Mediterranean Conference on Embedded Computing MECO - 2015 Budva, Montenegro calculated for both domains. As we can see from experimental VI. ACKNOWLEDGEMENT results, DCT transform basis gives smaller MSE for the same The authors are thankful to Professors and assistants within number of used measurements, compared to the DFT. the Laboratory for Multimedia Signals and Systems, at the

Original Signal, Discrete Cosine Transform Original Signal, Discrete Fourier Transform University of Montenegro, for providing the ideas, codes, 10 400 literature and results developed for the project CS-ICT (funded by the Montenegrin Ministry of Science). 5 200

Amplitude REFERENCES 0 0 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 [1] E. Candès, J. Romberg, J. K.; Tao, Terence, "Stable signal recovery Samples Samples Figure 6.Original speech signal using DCT (left graph), using DFT (right from incomplete and inaccurate measurements", Communications on graph), frequency domain Pure and Applied Mathematics, vol. 59, no. 8, 2006, 1207. doi:10.1002/cpa.20124. Recovered Signal, Discrete Cosine Transform Recovered Signal, Discrete Cosine Transform [2] Mark A. Davenport, Marco F. Duarte, Yonina C. Eldar and 10 10 GittaKutyniok, “Introduction to ,” in Compressed Sensing: Theory and Applications, Y. Eldar and G. Kutyniok, eds., 5 5 Cambridge University Press, 2011 Amplitude Amplitude [3] S. Stankovic, I. Orovic, E. Sejdic, “Multimedia Signals and Systems“, 0 0 Springer-Verlag, New York 2012. 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 Samples Samples [4] Jarvis Haupt and Rob Nowak, Signal reconstruction from noisy random Recovered Signal, Discrete Fourier Transform Recovered Signal, Discrete Fourier Transform 400 400 projections. (IEEE Trans. on Information Theory, 52(9), pp. 4036-4048, September 2006)

200 200 [5] A. Griffin, P. Tsakalides, “Compressed Sensing of Audio Signals using Multiple Sensors”, Eusipco 2008, Lausanne, Switzerland, august 25-29, Amplitude Amplitude 2008. 0 0 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 [6] M. G. Christensen, J. Ostergaard, S. H. Jensen, “On Compressed Samples Samples Sensing and its Application to Speech and Audio Signals”, 2009 IEEE. Figure 7. Reconstructed speech signal using 50% DCT (upper left graph), 70% DCT (upper right graph), 50%DFT (lower left graph), 70% DFT(lower [7] M. Scekic, R. Mihajlovic, I. Orovic, S. Stankovic, "CS Performance right graph), frequency domain Analysis for the Musical Signals Reconstruction," 3rd Mediterranean Conference on Embedded Computing , MECO, 2014 0.025 [8] I. Orovic, S. Stankovic, B. Jokanovic, A. Draganic, "On Compressive 0.02 Sensing in Audio Signals," ETRAN 2012, Zlatibor, 2012 [9] I. Orovic, S. Stankovic, A. Draganic, "Time-Frequency Analysis and 0.015 Singular Value Decomposition Applied to the Highly Multicomponent Musical Signals," ActaAcustica united with Acustica , Vol. 100, No 1, 0.01 pp. 93-101(9), 2014 [10] X. Feng, W. Xia ,Z. Xiao-Dong, W. Hao, “An Adaptive Compressed Mean Square Error Square Mean 0.005 Sensing Method in Speech”, International Journal of Advancements in Computing Technology(IJACT), Vol4, No 8, May 2012. 0 20 30 40 50 60 70 80 90 [11] S. G. Mallat and Z. Zhang, Matching Pursuits with Time-Frequency Samples Rate in % Dictionaries, IEEE Transactions on Signal Processing, December 1993, Figure 8. MSE using DCT (blue), and DFT (red), speech signal pp. 3397–3415. [12] J. A. Tropp and A. C. Gilbert, “Signal Recovery From Random Measurements Via Orthogonal Matching Pursuit”, IEEE Transactions on V. CONCLUSION Information Theory, Vol. 53, NO. 12, December 2007 In this paper we analyzed the Compressive Sensing [13] D. L. Donoho , Y.Tsaig , I.Drori , J.L.Starck, "Sparse Solution of application to the speech and audio signals. Different Underdetermined Systems of Linear Equations by Stagewise Orthogonal Matching Pursuit ", IEEE Transactions on Information Theory, Vol. 58, transform domains are tested, in order to choose suitable basis No. 2, 2012, pp. 1094-1121 for successful reconstruction from small set of available [14] S. Stankovic, I. Orovic, LJ. Stankovic, "An Automated Signal samples. It is shown that the DCT domain requires smaller Reconstruction Method based on Analysis of Compressive Sensed number of samples compared to the DFT, for both types of Signals in Noisy Environment," Signal Processing, vol. 104, Nov 2014, pp. 43 - 50, 2014 signals. Moreover, when observing the MSE for audio signal, [15] Boyd, S.P., Vandenberghe, L.: ‘Convex Optimization’ (Cambridge it is shown that reconstruction with 20% of measurements and Univ. Press, 2004). the DCT as transform basis, gives the same error as 80% of [16] S. Stankovic, I. Orovic, LJ. Stankovic, "Single-Iteration Algorithm for measurements and the DFT as transform basis. Further, the Compressive Sensing Reconstruction," 21st Telecommunications Forum audio signals are less complex compared to the speech signals TELFOR 2013. and therefore much smaller number of samples is required for . reconstruction of audio signals. Audio signals can be reconstructed with 30% of samples, while for the speech signal this number is slightly larger – 50%.