On Timbre Based Perceptual Feature for Singer Identification

ON TIMBRE BASED PERCEPTUAL FEATURE FOR SINGER IDENTIFICATION

Swe Zin Kalayar Khine Tin Lay Nwe Haizhou Li Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore 119613 ABSTRACT the basic elements of music is timbre or color. Timbre is the quality of sound which allows human ears to Timbre can be defined as feature of an auditory stimulus distinguish among different types of sounds [15]. that allows us to distinguish the sounds which have the Cleveland [3] states that an individual singer has a same pitch and loudness. In this paper, we explore characteristic timbre that is a function of the laryngeal timbre based perceptual feature for singer identification. source and vocal tract resonances. Timbre is assumed to We start with a vocal detection process to extract the be invariant with an individual singer. On the other vocal segments from the songs. The cepstral hand, Erickson [6] states that the traditional concept of coefficients, which reflect timbre characteristics, are an invariant timbre associated with a singer is inaccurate then computed from the vocal segments. The cepstral and that vocal timbre must be conceptualized in terms of coefficients of timbre are formulated by combining transformations in perceived quality that occur across an information of harmonic and the dynamic characteristics individual singer's range and/or registers. In general, of the sound such as vibrato and the attack-decay these studies suggest that timbre is invariant with an envelope of the songs. Bandpass filters that spread individual singer or there is a particular range of timbre according to the octave frequency scale are used to quality associated to an individual singer. In this paper, extract vibrato and harmonic information of sounds. The we would like to study the use of timbre based features experiments are conducted on a database of 84 popular in SingerID task. songs. The results show that the proposed timbre based Poli [9] measured the timbre quality from spectral perceptual feature is robust and effective. We achieve an envelope of MFCC features to identify singers. In [17], average error rate of 12.2% in segment level singer timber is characterized by the harmonic lines of the identification. harmonic sound. In this paper, we propose determining timbre by the harmonic content of a sound and the 1. INTRODUCTION dynamic characteristics of it such as vibrato and attack- The rapid evolution of the digital multimedia decay envelope [15]. technologies in computer and Internet technology has The rest of this paper is organized as follows. In enabled huge multimedia database. With these section 2, we present the methods for vocal detection. In continually growing databases, automatic music section 3, we study perceptually motivated acoustic information retrieval (MIR) has become increasingly features and their characteristics. In section 4, we important. Singer Identification (SingerID) is one of the describe the popular song database, experiment setup important tasks in MIR. It is the process of identifying and results. Finally, we conclude our study in section 5. the singer of a song. In general, a SingerID process comprises three major steps. 2. VOCAL DETECTION The first step is detecting singing segments (vocals) in a song. Vocals can be either pure singing voice or a We extract vocal segments from songs. Subband based mixture of singing voice with background instrumentals Log Frequency Power Coefficient (LFPC) [10] is used (nonvocal). The second step is singer feature as acoustic features. We train hidden Markov models computation. Features are extracted from vocal (HMM) as vocal and nonvocal acoustic models to detect segments. The last step is formulating singer classifier vocal segments. Vocal detection errors can affect using feature parameters. In this paper, we propose new SingerID performance. To reduce the vocal detection solutions for the second step of singer feature error, we formulate the vocal detection as a hypothesis computation. test [12] based on confidence score. In this way, we Earlier studies in SingerID use features such as Mel only retain vocal segments, with which the acoustic Frequency Cepstral Coefficients (MFCC) [6]. Recently, models have high confidence. Vocal segments with studies start looking into perceptually motivated features confidence measure which are higher than a which are able to appreciate the aesthetic characteristics predetermined threshold are accepted for the SingerID of singing voice for music content processing and experiment. analysis. For example, vibrato motivated acoustic features are used to identify singers in [1, 11]. Besides 3. ACOUSTIC FEATURES vibrato, harmonic is also a useful feature for SingerID. We next study several perceptually motivated features, In fact, harmonics of soprano singer’s voice are widely namely harmonic, vibrato and timber features, to spaced in the spectrum in contrast to that of bass singer’s characterize song segments. We propose to use voice [8]. Hence, harmonic spectrum is useful to subband filters on octave frequency scale in formulating differentiate between low and high pitch singers. One of

484

these acoustic features. seen in Figures 2(a) and (b). According to [2], such irregular vibrato excursions are very common in most of 3.1. Harmonic the tones. In [5], vibrato extent is categorized into two Sopranos have higher fundamental frequency than bass different types, ‘wobble’ and ‘bleat’. Wobble has a singers. Hence, harmonics of soprano’s voice is widely wider pitch fluctuation and slower rate of vibrato as in spaced in contrast to that of bass singing. Upper panels Singer-A. However, bleat has a narrower pitch of Figure 1 (a) and (b) show the examples of harmonic fluctuation and faster rate as in Singer-C. Hence, the spectrum of soprano and bass singing respectively. To information such as 1) regularity or irregularity in capture this information, we implement the harmonic vibrato excursion, 2) two different vibrato types of filters with the centre frequencies located at each of the ‘wobble’ and ‘bleat’, and 3) vibrato rate is integrated musical notes as in middle panel of Figure 1 (a) and (b). into acoustic feature. The list of the frequencies of the musical notes can be As a result of the modulation of pitch, the frequencies found in [7]. Subband filters span up to 8 octaves (16 of all the overtones vary regularly and in sync with the kHz). Each octave has 12 subbands and there are 96 pitch frequency modulation [13]. Therefore, we subbands in total. Output of subband filtering is given in implement the subband filters with the center lower panels of Figure 1. For soprano, lower panel of frequencies located at each of the musical notes to Figure 1(a) shows widely spaced peaks. However, the characterize the vibrato. The list of the frequencies of the peaks are narrowly spaced in lower panel of Figure 1(b) musical notes can be found in [7]. Due to the fact that for 0bass singers. 10000 singing voice contains high frequency harmonics [16], 0 5000 our subband filters span up to 8 octaves (16kHz). Our 0 0 0 0.5 1 0 0.5 1 subband filters are implemented with a bandwidth of 1 1 semitone from each note since vibrato extent can 5 0.5 ± 5.1 0 0 increase more than ±1semitone when a singer raises 0 0.5 1 0 0.5 1 0 Peaks 100 his/her vocal loudness [13]. We employ cascaded subband filters (referred to as vibrato filter) [11] to 0 50 0 0.5 1 0 0.5 1 capture vibrato information from acoustic signal. Frequency(kHz) 4 Second Layer (a) Soprano singer (b) Bass singer 0 2 First Layer

Figure 1. Harmonics and harmonic filtering. dB 0 0.06 10 10.5 11 11.5 12 16kHz 3.2. Vibrato Frequency(kHz) Vibrato is a periodic, rather sinusoidal, modulation of Figure 3. A bank of cascaded subband filters pitch and amplitude of a musical tone [14]. Vocal 00 1000 5000 500 vibrato can be seen as a function of the style of singing 0 0 0 associated to a particular singer [11]. 0.6 0.8 1 0.6 0.8 1 0.6 0.8 1 4 4 4 Vibrato Extent Vibrato Extent 20 Period, Rate=1/Period 2 2 2 Extent Singer-A 0 0 0 0.6 0.8 1 0.6 0.8 1 0.6 0.8 1 (a) 0 80 100 80 60 Local 50Maxima -20 60 0 200 400 600 800 1000 40 40 0 20 0.6 0.8 1 0.6 0.8 1 0.6 0.8 1 Singer-B (b) 0 Frequency(kHz) (a) (b) (c) -20 0 200 400 600 800 1000 20 Figure 4. Vibrato fluctuations and vibrato filtering Singer-C observed at the note G#5, 830.6Hz. (a) vibrato (c) Deviation from note (Hz) 0 fluctuates left (b) no fluctuation (c) vibrato fluctuates -20 right. The upper panel shows spectrum partial. The 0 200 400 600 800 1000 middle panel presents the frequency response of the Time(ms) vibrato filter. The lower panel demonstrates the Figure 2. Vibrato waveforms of 3 singers at note D6, instantaneous amplitude output of the vibrato filter. With 1174.6Hz . the output from the vibrato filter, we are able to track the local maxima to derive the vibrato extent [11] Vibrato is characterized by two parameters: the extent We illustrate the vibrato filter [11] and subband or excursion and the rate as illustrated in Figure 2(a). outputs in Figures 3 and 4. The filter has two cascaded Female singers tend to have a slightly faster mean layers of subbands. The first layer has overlapped vibrato rate than male singers [4]. Vibrato excursions trapezoidal filters. The second layer has 5 non- occurring at the tone D6 for three different singers are overlapped rectangular filters of equal bandwidths for shown in Figures 2(a), (b) and (c). In Figure 2(c), each trapezoidal subband. Trapezoidal filters are tapered vibrato excursions to the up and down from the note is between ± 5.0 semitone to ± 5.1 semitone. The vibrato balanced. However, unbalanced vibrato excursion can be fluctuations are observed by tracking the local maxima

485

in the instantaneous amplitude output of the subbands in 4. EXPERIMENTS AND DISCUSSION the second layer as shown in the lower panel of Figure 4. Local maxima indicate the position of the vibrato. The We compile a database of 84 popular songs from albums distance between the center frequency of the of 12 solo singers. Examples of album titles are ‘My corresponding filter and the local maxima informs the own best enemy’ (singer: Richard Marx) and ‘Like a vibrato extent. The tapered and overlapped trapezoidal virgin’ (singer: Madonna). A total of 7 songs are filters in the first layer allow vibrato fluctuations of extracted from each album. Four songs of each singer adjacent notes observed at the output of the subbands in are allocated to TrainDB and the remaining 3 songs to the second layer to be ‘continuous’. The vibrato filter TestDB. Vocal and nonvocal segments of each song are captures irregularities or regularities in vibrato excursion manually annotated to provide the ground truth. We and ‘wobble’ or ‘bleat’ vibrato types. define error rate (ER) as the number of errors divided by the total number of test trials. 3.3. Timbre We first segment a song into 1 second fixed-length Sounds may be generally characterized by pitch, segments. Then, each segment is classified as vocal or loudness and quality. For sounds that have the same nonvocal class using the method mentioned in section 2. pitch and loudness, sound quality or timbre describes The average vocal/nonvocal classification error rate of the characteristics which allow human ears to vocal detection system is reported at 8.7%. distinguish among them. Timbre is a general term for Using the vocal segments from vocal detector, the distinguishable characteristics of a tone. Timbre is SingerID experiments are further conducted. We use the mainly determined by the harmonic content of a sound continuous density HMM with four states and two and the dynamic characteristics of the sound such as Gaussian mixtures per state for all HMM models in our vibrato and attack-decay envelope of the sound [15]. experiments. Using the TrainDB, we train a singer Attack-decay processes of two different singers are model, λs , for each of 12 singers. To identify singer for shown in Figure 5 (a) and (b). Singer-1’s voice takes a vocal segment O , we estimate the likelihood score of more time to develop to its peak than that of Singer-2. O being generated by each of 12 singer models. The And, the decay process of Singer-1 is more gradual than model with the highest likelihood suggests the singer that of Singer-2. identity. We conduct experiments to compare the Peak Amplitude Peak Amplitude SingerID performance of five feature types, namely,

TBCC, OFCCvib, OFCChar, MFCC, and LPCC. Window

mplitude mplitude mplitude mplitude size is 20ms and frame shift is 7ms in all tests. A A A A 0.14 0.5 1 1.5 2 2.5 0.05 0.5 1 1.5 22 2.52.5 (a)Singer-1 (b) Singer-2 TBCC OFCCvib OFCChar MFCC LPCC 12.2 12.8 13.6 12.9 21.9 Figure 5. Attack-decay envelopes. Table 1. Error rate (ER%) of SingerID on TestDB. Studies found that it takes a duration of about 60ms to Table 1 shows that the TBCC feature, with an average recognize the timbre of a tone. If a tone is shorter than error rate of 12.2%, outperforms all other features. As 4ms, it is perceived as an atonal click [15]. mentioned in section 3.3, duration of about 60ms is 3.4. Cepstral Coefficient Computation necessary to recognize the timbre of tone. Hence, the error rate of 12.2% may not give the optimal We first divide a music signal into frames of 20ms with performance for TBCC since window size is only 20ms. 13ms overlapping and apply Hamming window to each We further conduct several experiments with TBCC frame to minimize signal discontinuities at the end of feature of different window sizes, with fixed frame shift each frame. Each audio frame is passed through of 7ms in all tests. In Table 2, the results show that harmonic filters for harmonic content analysis to derive performance peaks when window size is 63ms, giving log energy of each band. Finally, we compute a total of the best error rate of 11.9%. It is observed that timbre 13 Octave Frequency Cepstral Coefficients (OFCC ) har based features capture the singer characteristics well by from the log energies using Discrete Cosine Transform. 0.9% and 1.7% (or 7% and 12.5% relative) error We then replace the harmonic filters with vibrato reduction over OFCC (reported earlier in [11]) and filters to compute the OFCC coefficients. To account vib vib OFCC features respectively. We believe that a window for timbre characteristics, output log energy of vibrato har size of around 60ms is suitable to extract timbre filter is augmented by that of harmonic and Mel-scale characteristics from a music signal. filters. Then, we compute 13 TimBre Cepstral

Coefficients (TBCC). We augment the feature Window size (ms) ER (%) Window size (ms) ER (%) coefficients with time derivatives or delta parameters 59 12.1 63 11.9 from two neighbouring frames to capture temporal 60 15.3 64 15.4 information. For example, delta parameters take care of 61 15 65 13.6 vibrato rate and attack-decay envelope in OFCCvib and TBCC respectively. Table 2. Error rate (ER%) on TestDB with different window sizes of TBCC

486

Both vocal and instrumental sounds have musical [4] Dejonckere, P. H., Hirano, M., and Sundberg, characteristics such as harmonic, vibrato, and timbre. J. Vibrato, San Diego: Singular Pub., 1995, ch. This gives rise to a question as to whether the three 2. features: TBCC, OFCC and OFCC , capture these vib har [5] Dromey, C., Carter, N., and Hopkin, A. musical characteristics from either vocal or instrumental “Vibrato Rate Adjustment” Journal of Voice, sound. To look into this, we conduct SingerID vol. 17, pp. 168-178, 2003. experiments using manually annotated nonvocal segments. SingerID system performance using a) vocal [6] Erickson, M., Perry, S., and Handel, S. segments, b) nonvocal segments are presented in Table “Discrimination Functions: Can They Be Used 3. to Classify Singing Voices?” Journal of Voice, Elsevier, vol. 15, pp. 492-502, December 2001. Features Case (a) Case (b) [7] Everest, F.A. Master Handbook of Acoustics. TBCC 12.2 60.1 New York, McGraw-Hill Professional, 2000. OFCCvib 12.8 66.7 OFCC 13.6 60.6 [8] Joliveau, E., Smith, J., and Wolfe, J. “Vocal har Tract Resonances in Singing: The Soprano Table 3. Average error rates using (a) vocal segments and Voice,” Journal of Acoustical Society of (c) nonvocal segments America, vol. 116, pp. 2434-2439, 2004. Without surprise, in the presence of vocal timbre, [9] Poli, G.D., and Prandoni, P. “Sonological vibrato and harmonic, the three features work the best as Models for Timber Characterization,” Journal in Case (a). With absence of vocal timbre, vibrato and of New Music Research, vol. 26, pp. 170-197. harmonic in Case (b), the error rate increases. This is because the singing voice usually stands out of the [10] Nwe, T. L., Foo, S. W., and De Silva, L. C. background musical accompaniments [13] and the three “Stress classification using subband based features are able to capture musical characteristics from features,” IEICE Trans. Information and vocal rather than from background instruments. Systems, Special Issue on Speech Information Processing, Vol. E86-D, no.3, pp. 565-573, 5. CONCLUSIONS March 2003. We have presented an approach for singer identification [11] Nwe, T.L., and Li, H. “Exploring Vibrato- of popular songs. The proposed approach explores Motivated Acoustic Features for Singer perceptually motivated timbre characteristics for Identification,” IEEE Transactions, Audio, SingerID. The main contributions of this paper are Speech and Language Processing: vol. 15, no. summarized as follows: 1) we propose using several 2, 2007. perceptually motivated features using harmonic, vibrato [12] Sukkar, R. A., and Lee, C. H. “Vocabulary and timbre information to represent the singer’s independent discriminative utterance characteristics. 2) With these features, we found that verification for nonkeyword rejection in there is a strong correlation between singer subword based speech recognition,” IEEE characteristics and system performance. We conclude Trans. Speech and Audio Processing, vol. 4, that perceptually motivated features especially timbre pp. 420-429, November, 1996. features are effective that help improve system performance. [13] Sundberg, J. The Science of Singing Voice, Northern Illinois University Press , 1987. 6. REFERENCES [14] Timmers, R., and Desain, P. “Vibrato: Questions and Answers from Musicians and [1] Bartsch, M.A., and Wakefield, G.H. “Singing Science”. Proc. Int. Conf. On Music Voice Identification Using Spectral Envelope Perception And Cognition, England, 2000. Estimation,” IEEE Transactions, Speech and Audio Processing, vol. 12, pp.100-109, March [15] Winckell, F. Music, Sound and Sensation, 2004. Dover, NY, 1967. [2] Bretos, J., and Sundberg, J. “Measurements of [16] Zhang, T. “System and method for automatic Vibrato Parameters in Long Sustained singer identification,” in Proceedings IEEE Crescendo Notes As Sung by Ten Sopranos”, International Conference Multimedia and Journal of Voice, vol. 17, pp. 343-352, 2003. Expo, Baltimore, MD, 2003. [3] Cleveland, T.F. “Acoustic Properties of Voice [17] Zhang, T., and Kuo, C.C.J. Content-Based Timbre Types and Their Influence on Voice Audio Classification and Retrieval for Data Classification,” Journal of Acoustical Society Parsing, Kluwer Academic, USA, 2001. of America, vol. 61, pp. 1622-1629, June 1977.

487