<<

2016 3rd International Conference on Smart Materials and Nanotechnology in Engineering (SMNE 2016) ISBN: 978-1-60595-338-0

A Visualization Scheme for Complex Spectrograms with Complete Amplitude and Phase Information

Caixia Zheng School of Computer Science and Information Technology, Northeast Normal University, Changchun 130117, China. School of Mathematics and Statistics, Northeast Normal University, Changchun 130024, China. Guangyan Li School of Physics, Northeast Normal University, Changchun 130024, China Tingfa Xu Photoelectric Imaging and Information Engineering Institute, Beijing Institute of Technology, Beijing 100081, China Shili Liang School of Physics, Northeast Normal University, Changchun 130024, China Xiaolin Cao College of Automotive Engineering, Jilin University, Changchun 130025, China Shuangwei Wang * School of Physics, Northeast Normal University, Changchun 130024, China *Corresponding author: [email protected]

ABSTRACT: We present a novel visualization scheme for complex spectrograms of and other non-stationary signals, in which both the amplitude and phase information is visualized simultaneously using the conventional RGB (red, green, blue) color model. The three-layer image matrix is constructed in such a way that the absolute values of the real and imaginary parts of the time-frequency analysis of speech are used to fill the first (red) and third (blue) layers, respectively, and an scheme is adopted to represent both signs of the real and imaginary parts and the computed values based on this encoding scheme are used to fill the second (green) layer. The importance of phase in spectrogram applications has been gradually recognized, and imaging processing techniques are increasingly used in spectral analysis of non-stationary one-dimensional (1D) signals such as speech signal. The ability to present complete spectrogram information (both the amplitude and phase information) on a single RGB image is potentially useful, especially for those applications in which phase information plays a complementary role.

1 INTRODUCTION instrument called sound spectrograph for mapping a onto a graph (Koenig et al., 1946). The Spectrogram visualizes the spectrum of frequencies instrument was initially used to identify enemy in a sound or other signals as they vary with time voices on telephones and radios, and has gradually (Cohen, 1989). The most common format is a become an indispensable tool in speech analysis and gray-scale or pseudo-color two dimensional (2D) studies. image with vertical axis representing frequency and As a visual representation of sound, spectrogram horizontal axis representing time. The gray or color simultaneously displays the characteristics of voice scale of each point represents the amplitude of a information in the closely related time and particular frequency at a particular time. Sound frequency domains, and obviously, spectrogram is spectrogram was first introduced during World War compact and efficient, and shows more pattern II for military purposes. The Bell laboratories in information than a time-domain or 1941 successfully developed a voice visualization frequency-domain signal does alone. Speech

289 waveforms in the time-domain vary greatly between With the advances of modern sensor and digital utterances but the frequency-domain representations signal processing (DSP) technology, spectrograms usually show very consistent patterns. By tracking can be efficiently generated using short time Fourier the frequency content of a sound over time, speech transform (STFT), and this greatly expands the spectrogram additionally provides the characteristics concept of spectrogram. Firstly, the frequency range of speech as well as its speaker. Spectrogram that spectrograms represent is extended from audible involves many linguistic aspects such as sound to infrasonic and ultrasonic frequencies. articulatory and , recognition of Secondly, spectrogram analysis goes beyond speech various , and thorough and and other audio signals. As a matter of fact, any knowledge of speech (Zue and Lamel, 1986). temporally or spatially varying signal can be Human beings excel at extracting features from analyzed from the spectrogram approach. images. With intensive training spectrogram experts Traditionally spectrograms have been widely used in can reach impressive recognition rates. In the past the development of the fields of audio, speech, few decades, researchers began using advanced music, sonar, radar, and so on. In addition, they have digital signal and image processing techniques for also found uses in other fields. Spectrograms as a removing noise and extracting useful features in complementary tool was used to study ultrashort spectrograms, and noticeable progresses have been pulse lasers (Kane and Trebino, 1993). With proper made in speech and speaker recognition. mapping of character strings to numerical sequences, Conventionally only peak amplitude or power is genomic sequences can be studied using a variety of required to construct speech spectrograms, and modern DSP techniques including color phase information has long been considered of little spectrograms (Anastassiou, 2001). Indeed, use and was often conveniently neglected in audio spectrograms are a useful tool for studying 1D and speech signal processing applications (Mowlaee non-stationary signals. et al., 2014) (Hegde et al., 2007; Oppenheim and The complex Fourier transform results can be Lim, 1981; Wang and Lim, 1982). Although phase represented either by phase/magnitude or separately information is usually not required to describe the by real and imaginary parts. To our best knowledge, time-frequency characteristics of a non-stationary there is no report in the literature that attempt to signal, it found uses in some situations that phase display the complete amplitude and phase information plays an important role. Both phase and information in a spectrogram. Xu et al in their power spectrograms were used as a diagnostics tool modified nonlocal means (NL-means) algorithm to detect cracks in vibrational beams, and the phase considered the real and imaginary parts of the information was also graphically presented in the complex speech spectrogram each as a separate study (Léonard, 2007). Three-dimensional (3D) image but with the sign (or partial phase) phase and frequency spectrograms were constructed information discarded (Xu et al., 2008). Such to analyze specific small slope linear chirp, phase spectrograms do not contain the complete jump and small frequency jump in musical signals information of the source signals. Because phase can (Navarro et al., 2008). Even in the traditional audio play a complementary role in a variety of and speech processing applications, the importance spectrogram applications, a proper way to preserve of phase information has been recognized. The and present phase information graphically in a usefulness of phase in human speech was spectrogram may be desired. examined to show that the phase spectrum contributes to speech intelligibility (Paliwal and Alsteris, 2003). A novel method for using phase 2 VISUALIZATION OF COMPLEX information was proposed to demonstrate the SPECTROGRAM potential of phase spectra in spectral subtractive enhancement applications (Kleinschmidt et al., We propose a visualization method to display both 2011). The discontinuity problems of phase was the amplitude and phase characteristics of the overcome and the phase information was used as an time-frequency analysis of speech signals on a same additional feature stream to achieve significant image using the conventional RGB color model. In improvement in audio retrieval and classification the proposed method, the complete amplitude and (Paraskevas and Chilton, 2004). The phase spectrum phase information is preserved and displayed information was used to improve simultaneously. The absolute values of the real part performance (Schluter and Ney, 2001). These of the time-frequency analysis of speech signal fill examples illustrate that phase can play a the first (red) layer matrix; the absolute values of the complementary role in a broad range of spectrogram imaginary part fill the third (blue) layer matrix; the applications. signs of both the real and imaginary parts are In the early years, spectrograms were created combined and encoded to make a sign (green) using a series of bandpass analog filters, and such matrix. This three-layer image matrix is used to laborious process limited the use of spectrograms. drive the RGB color model for visualizing the 290

time-frequency analysis of speech signals. In this = + IjRX kikiki (4) visualization scheme, a pair of unsigned complex . data in the time-frequency domain is uniquely i kN ⋅⋅⋅⋅⋅⋅=⋅⋅⋅⋅⋅⋅= ,,2,1;,,2,1 M mapped to a point in the red-blue (RB) color space and, the partial phase information, or the To properly display the data in the RGB color combination of the signs of both the real and space, the two parts of X need to be normalized. Let imaginary parts, is shown in the shades of green. d be the maximum absolute value of all elements in R and I, normalization of R and I is realized using 2.1 Frame size and window function the following formula, To conduct time-frequency analysis, the original 1 (5) R = abs( R), audio recording is divided into windowed frames of n d 10~20 ms long. A proper filter window is added before the STFT analysis of each frame (Schilling 1 (6) and Harris, 2011). When selecting window function, I n = abs( I) , a tradeoff between the main lobe width and side lobe d roll-off rate has to be made: narrow main lobe reduces the uncertainty in frequency and faster where function abs() is used to obtain the absolute roll-off reduces dependencies on neighboring speech value of each elements in a matrix, and Rn and In are samples. the normalized absolute real and imaginary parts of With frame size N and frame overlapping matrix X, respectively. Using the same constant b, a speech signal of length L sampled at Fs normalization coefficient d ensures that the dynamic can generate an N × M matrix x, in which M is the ranges of Rn and In are consistent. number of frames given by L (1) 2.3 Generation of sign matrix = roundM ( ) , − bN )1( The corresponding sign matrix SR and SI are simply obtained by applying the sign function to all the elements in the matrix R and I, where round() function rounds the value in the (7) parenthesis to the nearest integer. Each column in x S R = sign R )( , represents a speech frame of length N, and element th th th xik of i row and k column represents the i data in I = sign IS )( . (8) the kth frame. 2.2 Construction of amplitude matrix The sign function sign(x) is equal to −1 if x < 0, +1 if x > 0, and 0 if x = 0. We now devise an Fast Fourier Transform (FFT) is applied to each encoding scheme that packs both SR and SI into one column of matrix x to generate an N × M matrix X single sign matrix S, which is obtained by with a frequency resolution of ∆f = Fs / N. The computing the weighted sum of SR and SI using a element in the ith row and kth column of X shows the th th specially devised formula of property of i frequency of the k Fourier transform = 3 + SSS . (9) data, IR

2π There are totally nine combinations of the signs N kl −−− )1()1( N from the real and imaginary parts, R and I. The ki = ∑ kl exX . corresponding nine values by formula (9) are l=1 (2) i ⋅⋅⋅⋅⋅= kN ⋅⋅⋅⋅⋅⋅=⋅ ,,2,1;,,2,1 M tabulated in Table 1. The green (G) channel in the RGB color model is used to carry the sign or phase information in the matrix S. To properly visualize S, In computer programming matrix index often the numerical values in Table 1 are shifted by 4, and starts from zero whereas it starts from one in the subsequently normalized to obtain Sn using the mathematical formula. For convenience, index i and formula below, k in the original formula are replaced by (i−1) and 1 (10) (k−1) without affecting the result. The complex n SS += )4( . 200 matrix X can be written as a sum of two parts as below: Formula (10) converts the integers in Table 1 to = + jIRX . (3) smaller values ranging from 0 to 0.04, which are listed in Table 2. The normalization factor of 200 R and I are the real and imaginary parts of X, ensures that the maximum value in the G channel is respectively, and j is imaginary unit. Alternatively, comparable to those values in the R and B channels formula (3) is written in terms of components, such that the green shades representing the sign 291 combination values in the G channel are properly displayed and do not interfere with the natural RB color.

Table 1. Nine designated numerical values for the nine combinations of the signs of real and imaginary data. Sign of real part (numerical value) + (3) 0 (0) − (−3) Sign of imaginary part (numerical value) (+, +) (0, +) (−, +) Figure. 1. Dual color (RB) space and its corresponding + (1) 4 1 -2 unsigned complex space in the time-frequency domain. The (0, +) (0,0) (0, −) X-axis (R channel) represents the real part and the Y-axis (B 0 (0) 3 0 -3 channel) represents the imaginary part. The pixel position in (−, +) (−, 0) (−, −) the image is determined by the normalized absolute values of − (−1) 2 −1 −4 real and imaginary parts of the FFT result. For example, the coordinate of point A (0.8, 0.2) in the RB color space indicates that the normalized absolute values of the real and imaginary Table 2. Nine normalized numerical values parts are 0.8 and 0.2, respectively. corresponding to those in Table 1. Sign of the real part + 0 − Sign of the imaginary part (+, +) (0, +) (−, +) + 0.04 0.0252 0.01 (0, +) (0,0) (0, −) 0 0.0352 0.02 0.005 (−, +) (−, 0) (−, −) − 0.03 0.0148 0 2.4 Assembly of RGB matrix

A three layer RGB image matrix Y is constructed as follows: the three normalized matrixes Rn, Sn and In Figure 2. Nine shades of green for a total of nine sign form the first, second and third layer of the image combinations. Each real or imaginary number has three matrix to drive the R, G and B channels, respectively. possible signs, ‘−‘, ‘0’ and ‘+’. Therefore, there are totally 9 Because we intentionally the values in the G combinations for a pair of numbers. Each combination is represented by a shade of green. channel less than those in the R and B channels (due to a larger normalization factor), the color complex 2.5 Sample complex spectrogram spectrogram is faithfully represented by a dual color Figure 3 is the audio recording for the phrase of (RB) image with a green texture background. Figure “Speech spectrograms” and its corresponding 1 shows the dual color (RB) space mapping to the complex spectrogram is shown in Figure 4. The unsigned complex space in the time-frequency Hamming window with 50% overlapping is applied domain. With the help of G channel carrying sign to the original signal sampled at 44.1 kHz. Fourier information, the real and imaginary parts of a transformation is done using a 512 point FFT. In complex number in a spectrogram can be resolved Figure 3, the green texture background displays the directly from the RB color ratio. Figure 2 shows signs of both the real and imaginary parts while the green color blocks for labeling the sign information RB dual color shows the amplitude spectrogram. of the real and imaginary parts. For better visual effect, values in Figure 2 are set to the range of 0 2.6 Speech signal reconstruction and 1. Without any signal processing of the image matrix, the whole construction process described in the previous sections can be easily reversed to recover the original audio signals. 1) Rebuild the amplitude matrixes Extract the first and third layers of the image matrix as Rn and In for later use. 2) Restore the sign matrix 292

The data in the G channel are extracted to restore I = − 3SSS R (13) the normalized sign matrix Sn: a) Restore the sign of the real part by According to formula (13) the sign of the imaginary re-arranging formula (10) to recover matrix S: part is positive (+1) when S = 4, 1, −2 with the corresponding S = +1, 0, −1, respectively. (11) R = SS n − 4200 . Numerical computations of (4−3), (1−0) and (−2− (−3)) all give +1. The integers in Table 1 can be recovered using the 3) Reconstruct the time-frequency matrix formula above. The sign matrix SR of the real part is The real and imaginary parts of the complex computed use the following formula, matrix X are restored by combining the sign and

R i,k = SuS , ki − − − − Su , ki ),01.1()01.1( (12) amplitude values, R = d SR○Rn and I = d SI○In, , respectively. Here we define the element-wise i kN ⋅⋅⋅⋅⋅⋅=⋅⋅⋅⋅⋅⋅= ,,2,1;,,2,1 M product of two matrices, A=(Aik) and B=(Bik), as A○B=(AikBik). where the step function u(x) is defined to be 0 if x < 4) Synthesize the original speech signal 0; 0.5 if x = 0; or +1 if x > 0. Formula (12) gives the Compute the inverse FFT of X = R + jI to obtain result of +1 (corresponding to a positive sign) if Si,k the multi-frame matrix x of the original speech = 2, 3, 4; 0 if Si,k = −1, 0, +1; or −1 if Si,k = −2, −3, signal. Speech synthesis is accomplished by linking −4. the columns of matrix x head-to-tail to form a 1D b) Restore the sign of the imaginary part: With speech sequence with the correct window function known S and SR, the sign of the imaginary part SI (Hamming) and overlapping rate (50%).Note that can be simply computed using the formula below: speech resynthesis would not be straightforward if the matrix has been manipulated.

Figure 3. Audio recording for the phrase of “Speech spectrogram”. Note that amplitude is normalized.

Figure 4. Complex spectrogram for the audio recording of “speech spectrograms”. Note that the amplitude spectrum (shown in violet color) is displayed on the top of dark green shades. any other non-stationary signals in the time-frequency or space-frequency domain and they 3 CONCLUSIONS have become useful tools for analyzing a variety of signals in the fields far beyond sound and speech. Spectrograms are visual representations of sound or Traditional spectrograms are constructed using only 293 the amplitude information of Fourier transform Proceedings of the IEEE 77, 941-981. results whereas the phase information has been [3] Hegde Rajesh M., Murthy Hema and Gadde Venkata Ramana Rao, 2007. Significance of the modified group conveniently ignored. However, phase does play a delay feature in speech recognition. Audio, Speech, and complementary role in some spectrogram Processing, IEEE Transactions on 15, 190-202. applications. To our best knowledge there is no [4] Kane Daniel J. and Trebino Rick, 1993. Characterization of report in the literature to present phase information arbitrary femtosecond pulses using frequency-resolved simultaneously with amplitude information in a optical gating. Quantum Electronics, IEEE Journal of 29, same spectrogram. Thus a proper means to process 571-579. [5] Kleinschmidt Tristan, Sridharan Sridha and Mason Michael, and display both amplitude and phase information 2011. The use of phase in complex spectrum subtraction for graphically may be desired. robust speech recognition. Computer Speech & Language The purpose of this paper is to develop a novel 25, 585-600. encoding and visualization scheme for complex [6] Koenig, W., Dunn H. K., and Lacy L. Y., 1946. The sound spectrograms with both the amplitude and phase spectrograph. The Journal of the Acoustical Society of America 18, 19-49. information of the time-frequency analysis displayed [7] Léonard François, 2007. Phase spectrogram and frequency simultaneously using the conventional RGB color spectrogram as new diagnostic tools. Mechanical systems model. Spectrogram images can not only display the and signal processing 21, 125-137. complete spectrum information but also be [8] Mowlaee Pejman, Saeidi Rahim and Stylianou Yannis, recognized by a human reader or processed using 2014. INTERSPEECH 2014 Special Session: Phase digital image processing techniques. The ability to Importance in Speech Processing Applications, Proceedings of the 15th International Conference on present the complete amplitude and phase Spoken Language Processing. information on a single RGB image is potentially [9] Navarro Laurent, Courbebaisse Guy and Pinoli useful, especially for those applications in which Jean-Charles, 2008. Continuous frequency and phase phase information plays a complementary role. spectrograms: a study of their 2D and 3D capabilities and Due to the nature of the signal and, very likely, application to musical signal analysis. Journal of Zhejiang University SCIENCE A 9, 199-206. due to our encoding scheme of phase information, [10] Oppenheim Alan V. and Lim Jae S., 1981. The the spectrogram (Figure 4) does not provide any importance of phase in signals. Proceedings of the IEEE particular useful information other than the dark 69, 529-541. green background. But for right type of signals, the [11] Paliwal Kuldip K. and Alsteris Leigh, 2003. Usefulness of green background may display vital phase phase spectrum in human , information. The proposed visualization scheme is INTERSPEECH. [12] Paraskevas Ioannis, and Chilton Edward, 2004. far from ideal and improvements can be made for Combination of magnitude and phase statistical features for optimal visualization from several perspectives such audio classification. Research Letters Online 5, as: 1) different window functions can be chosen and 111-117. overlapping rate can be adjusted; 2) the encoding [13] Schilling Robert and Harris Sandra, 2011. Fundamentals scheme for the signs of both real and imaginary of digital signal processing using MATLAB. Cengage . parts is arbitrarily devised, there are other options to [14] Schluter Ralf, and Ney Hemann, 2001. Using phase show the phase information; and 3) other color or spectrum information for improved speech recognition even text models can be explored. performance, Acoustics, Speech, and Signal Processing, 2001. Proceedings. (ICASSP'01). 2001 IEEE International Conference on. IEEE, 133-136. 4 ACKNOWLEDGEMENTS [15] Wang David L., and Lim Jae S., 1982. The unimportance of phase in speech enhancement. Acoustics, Speech and Signal Processing, IEEE Transactions on 30, 679-681. This paper was supported by Natural Science [16] Xu Haitian, Tan Zheng-Hua, Dalsgaard Paul and Foundation of China (No. 61471111) Lindberg Brge, 2008. Robust speech recognition by nonlocal means denoising processing. Signal Processing Letters, IEEE 15, 701-704. [17] Zue Victor W. and Lamel Lori F., 1986. An expert REFERENCES spectrogram reader: A knowledge-based approach to speech recognition, Acoustics, Speech, and Signal Processing, [1] Anastassiou Dimitris, 2001. Genomic signal processing. IEEE International Conference on ICASSP'86, 1197-1200. Signal Processing Magazine, IEEE 18, 8-20. [2] Cohen Leon, 1989. Time-frequency distributions-a review.

294