J. Acoust. Soc. Jpn. (E) 19, 6 (1998)

Blind signal separation for recognizing overlapped speech

Tomohiko Taniguchi, Shouji Kajita, Kazuya Takeda, and Fumitada Itakura

Department of Information Electronics, Graduate School of Engineering, Nagoya Univer- sity, Furo-cho 1, Chikusa-ku Nagoya, 464-8603 JAPAN

(Received 26 February 1998)

Utilization of blind signal separation method for the enhancement of accuracy under multi-speaker is discussed. The separation method is based upon the ICA (Independent Component Analysis) and has very few assumptions on the mixing process of speech signals. The recognition experiments are performed under various conditions con- cerned with (1) acoustic environment, (2) interfering speakers and (3) recognition systems. The obtained results under those conditions are summarized as follows. (1) The separation method can improve recognition accuracy more than 20% when the SNR of interfering signal is 0 to 6dB, in a soundproof room. In a reverberant room, however, the improvement of the performance is degraded to about 10%. (2) In general, the recognition accuracy deterio- rated more as the number of interfering speakers increased and the more improvement by the separation is obtained. (3) There is no significant difference between DTW isolated word discrimination and HMM continuous speech recognition results except for the fact that saturation in improvement is observed in high SNR condition in HMM CSR.

Keywords: ICA, Blind signal separation, Overlapped speech

PACS number: 43. 72. Ew, 43. 72. Ne

always apart from each other, has not been clarified 1. INTRODUCTION yet. Tracking speech structure hidden in the acous- Since the multi-speaker situation is quite usual in tic signal, for example harmonic structure or for- speech communication, recognizing overlapped mant frequency, has been also studied in Ref.5-7) speech is one of the most typical and important The method, however, has inherent difficulty in problems of speech recognition in real world. applying to the separation of non-speech signal such Known as "cocktail party effect," humans can focus as ambient noise. on the particular speaker's speech under the exis- While these methods assume the spatial or spec- tence of interfering speech. Based on different inter- tral characteristics on the overlapping signals, pretations of cocktail party effect various recently proposed blind signal separation approaches for recognizing overlapped speech have methods8,9) do not require any special conditions on been studied. the signal mixture process except for; 1) the source One approach is to form a directional beam signals are super-Gaussians, 2) no time difference in which does not receive the interfering signal, based the mixing process, 3) the source signals are statisti- on array signal processing.1-3) The effectiveness of cally independent. Thus blind signal separation the method in applying to speech recognition is seems to be applicable to wider range of problems confirmed when the interfering signal is Gaussian and potentially suites for recognizing overlapped random signal.4) However, the performance under speech. multi-speaker situation, where the interfering signal The purpose of this paper is to clarify the is also speech signal and the sound sources are not effectiveness of the blind signal separation method

385 J. Acoust. Soc. Jpn. (E) 19, 6 (1998)

in terms of acoustic preprocessing for recognizing Joint entropy can be calculated as the expected overlapped speech. For the purpose the speech value of the log likelihood of joint distribution by recognition experiments are performed under vari- ous conditions of interfering speech, SNR and recog- (5) nition system. and the joint distribution of y can be determined in This paper consists of the following sections . In terms of joint distribution of x, and Jacobian of Section 2, the blind separation method will be the mapping from x to y briefly described. Then, in Section 3, the experi- mental conditions for evaluating the method will be described. After discussions on the results of exper- (6) iments in Section 4, we will conclude this paper in Section 5. Since the joint distribution of x is independent from 2. METHOD W, it is enough to find W which maximizesthe first In this section, the basic method of blind signal term, E [1n|J|], in order to maximizejoint entropy of separation is described.9) y. The gradient descent method can, then, be If the statistically independent source signals applied to search for the W by differentiating loga- [n],•c, sN[n] are mixed by an N•~N mixing matrix rithm of Jacobian with respectto each element of W A, the mixed signals x1[n],•c, xN[n] are obtained as matrix: follows: (7) (1)

where x=(x1[n], x2[n], xN[n])T and s=(s1[n], (8)

s2[n],•csN[n])T. Note that there is no time Where ƒÉ is a learning factor and i is the number of

difference among terms of a source signal, si[n], in iteration. The iteration can be regarded being

contributing the mixed signal, xi[n]. converged when the ăĢW becomes less than Based upon this modeling, the blind separation of predetermined threshold value. Finally, with the signals can be formalized as a problem of finding estimated matrix W the source signal can be recover-

mixing matrix A, or more directly, inverse matrix, ed by inverting the mixing A-1. From the assumption that the source signals are independent, the inverse matrix A-1 is expected to (9) be estimated as a matrix of minimizing mutual 3. EXPERIMENTAL SETUP information between the signals obtained by multi-

plying an unknown matrix W to the mixed signal, 3.1 Recording Environment i.e. The overlapped speech for recognition experi- (2) ments are recorded both in an unreverberant sound- proof room and a quiet but reverberant (T60=0.70s) Bell et al. found that, as far as s follows a super- laboratory room. The back ground noise levels are

Gaussian distribution, the minimization of the 28 (dBA), and 36 (dBA) respectively in those rooms. mutual information, I(Wx), can be achieved by The arrangement of loudspeakers and microphones maximizing joint entropy after nonlinearly squash- for recording multi-speaker speech, are illustrated in ing Wx so as to bound its variance. Defining y as Fig. 1, for the soundproof room condition. The the squashed version of Wx, therefore, the above same loudspeakers and microphones are arranged in minimizing procedure can be accomplished by the same way in a larger (3m•~5m) laboratory (3) room for the recording under reverberant condition. The test and the interfering speech are played by

loudspeaker 1 and 2 respectively. Two directional (4) microphones are located at the same point so as to where g(x) is a non-linear squashing, e.g. sigmoid, form a 90•‹ angle. As shown in Fig. 2, at 200-4,000 function. Hz frequency band, the characteristics of the transfer

386 T. TANIGUCHI et al: BLIND SIGNAL SEPARATION FOR RECOGNIZING OVERLAPPED SPEECH

Sources of speech and interfering signals are stor- ed in digital 16bit form, 16kHz sampling. The mixed signals are recorded in 8kHz and 16kHz for DTW and CSR experiments, respectively. As for interfering speech, human speech-like noise (HSLN), which is a kind of bubble noise generated by superimposing independent speech signals,12) is utilized. The utterance of each interfering speech is selected to be different from interfered speech. By changing the number of superpositions, we can simulate various multispeaker conditions. When HSLN of one superposition is used for interference, the overlap of two speakers speech is simulated, Fig. 1 Arrangement of speakers and micro- whereas when the number of superpositions is large, phones for recording overlapped speech. then cock-tail party sound is simulated. However, The outer frame corresponds to the size of the soundproof room. it should be noted that the experiment does not simulate a real cock-tail party environment, because interfering speech are not spatially distributed. When the number of superpositions is more than some hundreds, HSLN sounds like stationary noise.

3.2 Recognition Systems Both DTW isolated word discrimination and HMM continuous speech recognition experiments are performed in a speaker dependent manner. The isolated word discrimination task is to discriminate the utterances of a phonetically similar Japanese city

Fig. 2 Transfer characteristics from loud- name pair. For the DTW experiment, SGDS13) is speaker 1, to microphone 1 and 2. Since the used as a spectral measure as it outperformed spectral responses of the two channels are MFCC in a preliminary experiment. 68 utterances almost identical, the mixed signal can be rep- of five male speakers are used and the results below resented by linear addition, i.e. x1=a11s1++ are average values of the five male speakers. The a12s2. rest of conditions are summarized in Table 1

Table 1 Acoustic analysis conditions for recog- function from a loudspeaker to microphones are nition experiments. almost the same except for their gains. Therefore, the resultant mixed signals satisfy the mixing condi- tion of linear addition, i.e. x1=a11s1+a12s2.

Furthermore, under that condition, the difference of arriving time from the same speaker is negligible between two microphones. (In our preliminary experiment, at least 10dB improvement of Signal-to-

Deviation Ratio can be obtained when the time difference between two microphones is less than

0.375ms, i.e. 12cm in length.10,11)) In the other words, Ą11=Ą2i and Ą12=Ą22 hold for the mixing process of the below form.

(10)

387 J. Acoust. Soc. Jpn. (E) 19, 6 (1998) together with CSR case. tion, the accuracy of interfering signal (SEP 2) is For continuous speech recognition (CSR) experi- greatly reduced. This is also the evidence of the ment, HTK speech recognizer is used with the condi- effectiveness of signal separation. From these tions listed in Table 1. Unlike the DTW case, no results, it is clarified that the separation method significant difference has been found between SGDS works well for pre-processing of speech recognition and MFCC in a preliminary experiment. For the systems. CSR experiment, each of ten sentences of four male and four female speakers are used as test data. As 4.2 HMM Continuous Speech Recognition of DTW experiment, the averaged results of them are The same tendency with isolated word discrimina- discussed below. The monophone HMM model is tion were observed in HMM continuous speech trained by 140 phonetically balanced sentences. recognition experiment, which is illustrated in Fig. 4, as the word correctness (% Corr.: disregarding 4. RESULTS insertion errors). The baseline word correctness, 4.1 DTW Word Discrimination however, is quite low, especially when the SNR is Figure 3 shows the recognition accuracy across lower than 0dB. By applying the separation, an global SNR of the test speech against the interfering improvement in correctness of more than 20% is signal of HSLN of 256 superpositions. The obtained in 0 and 6dB conditions. baseline performance, using the test signal recorded The conspicuous difference from the DTW results by microphone 1 with no interfering signal, is is that there is saturation in improving performance shown as the top vertical line (CLEAN). From the at 12dB of SNR, where the signals before separation directional property of the microphone 10 to 20% outperform the separated speech. From these better accuracy is obtained by microphone 1 (MIC results, it can be concluded that the blind separation 1) than microphone 2 (MIC 2). However, the improves the performance more effectively when the accuracy is 5 to 25% lower than that of clean signal. baseline accuracy is low. (Note that the accuracy usually never falls down less Figure 5 shows the results across the type of inter- than 50% in the word discrimination task.) fering speech, i.e. the number of superpositions of After separation (SEP 1), the recognition accu- HSLN. The SNR is fixed to be 0dB throughout racy is improved by about 2, 5, 7 and 10% under the experiments. The results include simple over- SNR of 12, 6, 0 and -6dB respectively, or the error lapping of single speaker's speech of a male (1M) rate is reduced by two thirds or a half of the pre- and a female (1F). As shown from the figure, in separated data. On the other hand, after separa- general, as the number of interfering speech increases, the overall recognition accuracy decreases. However, the improvement obtained by the separa-

Fig. 3 Recognition accuracy, correct rate in DTW based word discrimination, of interfer- ed and separated speech under various SNR Fig. 4 Word correct rate (% Corr) of HMM conditions. MIC1 is located in front of the based continuous speech recognition results loudspeaker 1, which is for the test speech. under various SNR conditions. The arrange- SEP1 is the separated signal corresponding to ment of microphones and loudspeakers are the test speech, whereas SEP2 corresponds to same as the DTWword discrimination experi- that of interfering speech. ment.

388 T. TANIGUCHI et al: BLIND SIGNAL SEPARATION FOR RECOGNIZING OVERLAPPED SPEECH

well as in the soundproof room. The improvement under 0 and 6 dB SNR conditions are only 10%. In our previous simulation experiment,10,41) the most critical condition that governs the separation perfor- mance was the time difference in the mixing process (described by equation (10)). In the reverberant room, where various acoustic echo paths exist, this condition does not hold due to the mixing of delayed signal. 5. CONCLUSION Fig. 5 % Corr. of HMM based continuous In this paper we evaluated Bell's blind separation speech recognition results plotted across the number of super-positions of interfering method in order to utilize it for preprocessing of human-speech-like noise (HSLN). 1M and recognizing overlapped speech. In both DTW 1F indicate the case when an utterance of a word discrimination and HMM continuous word male and female speaker is used as interfering recognition, the separation method can improve speech, i.e. simple overlapping speech situa- recognition accuracy by more than 10% where the tion. SNR of the signal is below 10dB. From these results, it can be concluded that the blind separation method is effective to separate overlapped speech. However, two problems were found; 1) satura- tion of performance improvement in rather clean conditions and 2) insufficient results under highly reverberant conditions. The latter is a consequence of the theoretical assumption of the method and the most important problem to be solved in future works. ACKNOWLEDGEMENTS The authors are grateful to Dr. Heni Yehia for Fig. 6 Word correct rate (% Corr) of HMM based continuous speech recognition results his suggestions and discussions on this work. This applied to the reverberant laboratory room work has been partially supported by Core Research conditions with interfering signals of various for Evolutional Science and Technology (CREST) SNR. program of Japan Science and Technology Corpora- tion (JST). tion increases as the number of superpositions REFERENCES increases. In addition, the recognition results after single 1) J. L. Flanagan, J. D. Johonston, R. Zahn, and G. W. microphone Spectral Subtraction (SS) is also shown Elko, "Computer-steered microphone arrays for in the figure. Since SS is assuming that the noise is sound transduction in large rooms," J. Acoust. Soc. Am. 78, 1508-1518 (1985). stationary, the improvement from the original result 1) J. L. Flanagan, A. C. Surendran, and E. E. Jan, is obtained only for HSLN of 256 superpositions, "Spatially selective sound capture for speech and where the HSLN is almost stationary. However, audio processing," Speech Commun. 13, 207-222 the improved result is still lower than the result of (1993). the blind separation. 3) Y. Kaneda and J. Ohga, "Adaptive microphone- array system for noise reduction," IEEE Trans. Acoust. Speech Signal Process. 34, 1391-1400 (1986). 4.3 Reverberant Room Results 4) S. Nakamura and K. Shikano, "Room acoustics and From the CSR results of laboratory room record- reverberation: Impact on hands-free recognition," ing in Fig. 6, it can be shown that in the reverberant Proc. Eur. Conf. Speech Commun. Technol., 2419- condition the separation method does not work as 2422, Rhodoes, Sep. (1997).

389 J. Acoust. Soc. Jpn. (E) 19, 6 (1998)

5) T. W. Parsons, "Separation of speech from interfering include based on the human auditory speech by means of harmonic selection," J. Acoust. system, especially its applications in speech representation Soc. Am. 60; 911-918 (1976). and speech recognition, 6) K. Kashino, K. Nakadai, T. Kinoshita, and H. Tana- ka, "Organization of hierarchical perceptual sounds," Kazuya Takeda wes born in Sen- Proc. 14 th Int. Conf. Artificial Intelligence, Vol.1, dai, Japan on September 1, 1960. He 158-164, Montreal, August (1995). received B. E. E., M. E. E. and Doctor of 7) H. G. Okuno, T. Nakatani, and T. Kawabata, "A Engineering degrees all from Nagoya new speech enhancement: Speech stream segrega- University in 1983, 1985 and 1994, tion," Proc. Int. Conf. Spoken Language Processing, respectively. In 1986, he joined ATR Vol.4, 2356-2359, Philadelphia, Oct. (1996). (Advanced Telecommunication 8) C. Jutten and J. Herault, "Blind separation of sources, Research Laboratories) where he Part I: An adaptive algorithm based on neuro- involved in the two major projects of speech database mimetic architecture," Signal Process. 24, 1-10 construction and system development. In (1991). 1989, he moved to KDD R & D Laboratories and par- 9) A. J. Bell and T. J. Sejowski, "An information- ticipated a project for constructing voice-activated tele- maximization approach to blind separation and blind phone extension system. Since 1995, he has been working deconvolution," Neural Comput. 7, 1129-1159 for Nagoya University as an Associate Prof. (1995). 10) T. Taniguchi, H. Yehia, S. Kajita, K. Takeda and F. Fumitada Itakura was born in Itakura, "On the problems in applying Bell's blind Toyokawa, Japan on August 6 1940. separation to real environments," ASJ-ASA 3rd J. He received the B. E. E., M. E. E. and Meet., Dec. 1996, 1pSP4, pp.2602 (in JASA Special Doctor of Engineering degrees all from Issue), pp.1257-1260 (in ASJ Proceeding)(1996). Nagoya University in 1963, 1965 and 11) K. Takeda, T. Taniguchi, S. Kajita, and F. Itakura, 1972, respectively. In 1968, he joined "Separation of multi -speaker speech," Tech. Rep. the Electrical Commnunication Labora- IEICE, EA97-14, 31-37 (1997) (in Japanese). tory of NTT, Musashino, Tokyo and 12) D. Kobayashi et al., "Extracting speech features from participated in the speech processing, including the maxi- human speech-like noise," Proc ICSLP96, Vol.1, mum likelihood spectrum estimation, the PARCOR 418-421 (1996). method, the line spectrum pair method, the composite 13) F. Itakura and T. Umezaki, "Distance measure for sinusoidal method, and the APC-AB speech coding. In speech recognition based on the smoothed group 1981, he was appointed to Head of Speech, and Acoustics delay spectrum," Proc. Int. Conf. Acoust. Speech Research Section of the ECL, NTT. In 1984, he left NTT Signal Process., 1257-1260, Dallas, April (1987). to become a professor of Nagoya University, where he teaches courses of communication theory and signal proces- Tomohiko Taniguchi received the Bachelor and Master sing. In 1975, he received IEEE ASSP Senior Awaird for of Engineering degrees from Nagoya University in 1994 his paper on speech recognition based on the minimum and 1996 respectively. He is now working for Matsushita prediction residual principle. He is a. co-recipient with B. Electric Industrial Co. S. Atal of 1986 Morris N. Liebmann Award for contribu- tions to for speech processing. In Shoji Kajita was born in Japan on 1997, he received IEEE Signal Processing Society Award. April 20 1967. He received the Bache- lor, Master, and Doctor of Information Engineering from Nagoya University, in 1990, 1992 and 1998 respectively. He is an Assistant Professor in Center for Information Media Studies of Nagoya University. His research interests

390