Classification of Solo Phrases using Tuning Factor and Percussive Factor

Hui Li Tan, Yongwei Zhu, Lekha Chaisorn Institute for Infocomm Research (I2R), A*STAR 1 Fusionopolis Way, 138632 {hltan, ywzhu, clekha}@i2r.a-star.edu.sg

Abstract—Live sound audio mixing is the art of processing and has led to the development from flat classification system combining sound sources captured at live performances, for the (assigns instrument labels in a single step) to hierarchical performers or audience in real time. With the development of an classification system (assigns instrument labels progressively automatic live sound audio mixing system in mind, the present study evaluates the performances of higher-level features, the from the broader categories to the more specific categories Tuning Factor and the Percussive Factor, for musical instrument down a hierarchical tree). classification of solo phrases. In particular, we focus on the Most of the work on recognition of instrument in solo classification of the vocals and drums phrases, two common phrases utilized basic features such as Mel-Frequency Cep- instrument categories in live performances. Using the Support stral Coefficients (MFCCs), spectral features; the inclusion of Vector Machine (SVM) with the Gaussian Radial Basis Function (RBF) Kernel, vocals and drums classification rates of 84% and temporal features such as rise time and attack time is hardly 97% are achieved respectively, as compared to 72% and 96% attempted, as this requires reliable front end onset detector. when using Mel-Frequency Cepstral Coefficients (MFCCs). The research of higher-level features, F0-dependent features [6], and Line Spectrum Frequencies [7] are some works that I.INTRODUCTION explore new features in this research context. The present Mixing is not just limited to the studio! Audio signals study evaluates the performances of two higher-level features, captured at live performances can be processed and combined the Tuning Factor and the Percussive Factor, for musical at real time to create new mixes for the performers or audience. instrument classification of solo phrases. These two features Depending on the performance requirements, there can be are evaluated against baseline feature, MFCCs, which was a variety of different mixes required. Two typical types of suggested as being robust by many of the above references. mixes are the Monitor mixes meant for the performers and the Our automatic live sound mixing system directs our focus Front of House (FOH) mixes meant for the audience. While to the classification of the vocals and drums, as these are the role of a monitor engineer is to enable the individual among the most common sound inputs present in many live performer/ to hear his individual mix with more performances in the pubs or churches. These require specific clarity, reduce other confusing house sound; the role of a audio processing such as equalization, compressing etc. FOH engineer is to reinforce the sound sources to cover the We first elaborate on the two proposed features in Section audience and/or introduce a variety of processors and effects II, before moving on to the classification scheme in Section to provide some styling to the house mix. We are looking into III. The experimental dataset and results are presented in the development of an automatic live sound mixing system, Section IV, and the conclusion and future work are presented which will automatically recognize the sound sources captured in Section V. by the microphones and then process them appropriately. In this paper, we address the first issue of automatically II.FEATURE EXTRACTION recognizing the sound sources captured by the microphones, Two higher-level musically relevant features, the Tuning a musical instrument classification problem in music infor- Factor and the Percussive Factor are proposed. The definitions mation retrieval (MIR). The authors in [1] presented an in- and descriptions of proposed features will be first be presented, depth summary of the related works on musical instrument followed by the elaboration of the extraction steps. classification, which evolved from the recognition of isolated musical sounds to the recognition of instruments in solo A. Tuning Factor phrases [2], [3], [4], [5] or/and duet phrases, and then to In most music performances, pitch tuning, in which the the more complex tasks of recognition of instruments in pitches of the instruments are tuned to a common reference polyphonic music and complex mixtures. Besides investigating pitch, is practised so that the instruments sound harmonic the distinction between the different instrument families, some when played concurrently. The reference pitch to which the researchers investigate the finer distinction between the in- pitches of the instruments are tuned to may however vary strument categories in the instrument families. The increasing slightly from orchestra to orchestra or to band. The difficulty of the musical instrument classification problem concert pitch, A=440Hz, specified by the ISO 16 standard is

911

Proceedings of the Second APSIPA Annual Summit and Conference, pages 911–914, , Singapore, 14-17 December 2010.

10-0109110914©2010 APSIPA. All rights reserved. widely used; but there exist other historical pitch standard such Tuning Pitch Histogram - Drums as the diapason normal in which A=435Hz. In our context, the 1 tuning index is estimated as the amount of shift of the pitches 0 from the concert pitch; and the Tuning Factor is a measure of Value -1 the presence of tuned pitch in a audio signal. Tuning index 1 2 3 4 5 6 7 8 9 10 Tuning Pitch Bin extraction is detailed in [8] while the Tuning Factor is derived Tuning Pitch Histogram - Brass from the tuning indexes of a audio signal. The main extraction 400 200

steps of the tuning index will be outlined below, followed by Value 0 the extraction of the tuning factor. 1 2 3 4 5 6 7 8 9 10 Considering 7 octaves of 12 semitones from f = 27.5Hz Tuning Pitch Bin 0 Tuning Pitch Histogram - Vocals to f6 = 3520Hz, the CQT spectrogram of the audio signal is 100 first obtained. A pitch resolution of 10 points per semitone is 50 cqt Value used. Let xˆ (t, k) be the CQT spectrogram after local peak 0 1 2 3 4 5 6 7 8 9 10 picking; where t represents the time index and k ∈ {1,..., 7× Tuning Pitch Bin 12 × 10} represents the frequency band index. For each t, the accumulated energy for each of the 10 shifts is: 12×7=84 X Fig. 1. Tuning Pitch Histogram of (a) Brass (b) Vocals (c) Drums. Low Eˆ(t, n) = xˆcqt(t, n + 10m), accumulated values for the drums and vocal excerpts are illustrated. m=1 where n ∈ {1,..., 10}. The tuning index, which is the tuning pitch position with the maximal cumulative energy is then: excerpts respectively, illustrating the decrease in percussitivity. While drums are characterized by their wide band and transient   P (t) = arg max Eˆ(t, n) . nature, vocals only exhibit gradual energy changes. n∈{1,...,10} Conditions are imposed to ensure that the energy of the tuning indexes are due to single frequency shift positions, Spectrogram - Drums 200 eliminating any tuning index obtained due to noisy sound, 400 600 which P (t) is set to 0. As mentioned, the tuning indexes 800 1000 indicates the deviations of the pitches from the concert pitch Index Frequency 20 40 60 80 100 120 140 160 180 200 Time Index i.e. 440Hz for middle A4. Taking the distance between 2 Spectrogram - Brass semitones to be 100 cents, a pitch resolution of 10 points 200 400 per semitone imply a 10 cents shift per point. Hence, bin 1 600 800 represents exact concert pitch, bin 2 represents 10 cents above 1000

Frequency Index Frequency 20 40 60 80 100 120 140 160 180 200 the concert pitch and bin 10 represents 10 cents below the Time Index concert pitch. Spectrogram - Vocals 200 T 400 For a audio signal of length , the tuning pitch histogram, 600 800 H(p), p ∈ {1,..., 10} is constructed by accumulating all 1000

Frequency Index Frequency 20 40 60 80 100 120 140 160 180 200 occurrences of P (t), t ∈ T . Figure 1 illustrates the tuning pitch Time Index histograms of drum, brass and vocal excerpts respectively. Low accumulated values are observed for the drum and vocal excerpts. The Tuning Factor is derived from the tuning pitch and is a Fig. 2. Spectrogram of (a) Drums (b) Brass (c) Vocals. Decreasing percussi- measure of the presence of tuned pitch in a audio signal. Both tivity is illustrated. the duration (1) and energy (2) percentages are considered: A list of potential percussive onsets, O = {l1, . . . , lL}, is P 1 TF = P (t)∈{1,...,10},t∈T , (1) first obtained by the percussive measure, D(i): dur T P X 0 if di(j) < H P (t)∈{1,...,10},t∈T E(t) D(i) = , (3) TFen = P , (2) 1 if di(j) ≥ H t∈T E(t) j∈J where E(t) is the energy of the audio signal.  w  |STFTx (j, i)| where di(j) = log2 w , B. Percussive Factor |STFTx (j, i − 1)| w The Percussive Factor is a measure of the average bandwidth and STFTx (j, i) is the Short-Time Fourier Transform com- of abrupt energy changes/percussitivity in a audio signal. puted with hop x, window w, and with the hamming window. Figure 2 shows the spectrograms of drum, brass and vocal i ∈ {1,...,I} represents the time index while j ∈ J

912 represents the frequency band index; J = [1KHz, 5KHz] Features Space - Vocals defines the spectral range over which the measure is evaluated. 600 0 (training) The average bandwidth of all such potential onsets are then 0 (classified) 500 1 (training) computed by: 1 (classified) P D(O) Support Vectors B = (4) 400 L and used as indicator for the the presence of high percussive 300 onsets in the signal. The estimated Percussive Factor can roughly distinguish high percussive onsets such as the drums, Factor Percussive 200 brass to low percussive onsets such as the vocals. 100 III.CLASSIFICATION 0 The Support Vector Machine (SVM) with the Gaussian 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Tuning Factor

Radial Basis Function (RBF) Kernel is used in the experiment, Features Space - Drums 600 since they have demonstrated good performances for various 0 (training) data classification [2], [7]. Given two classes, the SVM try 0 (classified) 500 1 (training) to find the optimal separating hyperplane (decision boundary 1 (classified) Support Vectors separating the tuples of one class from another). Well suited to 400 perform binary classification, the SVM can also be extended to perform N-class classification. For our purpose, a one-vs-all 300 classification (vocals-vs-non-vocals and drums-vs-non-drums) is adopted. Factor Percussive 200

IV. EXPERIMENTS 100 A. Training and Testing Dataset 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 42 excerpts are collected from the RWC dataset, inter- Tuning Factor net sources such as the Youtube, and personal collections. The dataset comprises seven musical instrument categories, namely, brass, drums, plucked strings, struck strings, bowed Fig. 3. Feature Space of (Top) Vocals (Bottom) Drums stings, vocals, winds. Each excerpt is 5-10 second long, which is segmented into phrases of 0.74 seconds. For all phrases, the Features Correct(%) False Positive(%) False Negative(%) F1,F3 85.17 8.35 6.47 MFCCs are taken to be the mean of the MFCCs computed on a F2,F3 84.25 9.23 6.52 hop size of 0.046 seconds. The Percussive Factors and Tuning F4-F16 72.21 9.39 18.41 Factor are extracted on the phrases. Phrases with low energy F1,F3,F4-F16 73.08 9.27 17.64 F2,F3,F4-F16 74.69 6.64 18.67 are eliminated for both training and testing. Features Correct(%) False Positive(%) False Negative(%) F1,F3 97.16 0.72 2.13 B. Experimental Results F2,F3 97.25 0.49 2.25 Singing contains both intentional and unintentional devia- F4-F16 96.52 0.32 3.16 tion from the nominal note pitches. Intentional deviations are F1,F3,F4-F16 96.23 0.00 3.77 F2,F3,F4-F16 96.32 0.00 3.68 used to enhance expressiveness while unintentional deviations are caused by the lack of voice training. Hence, vocals have TABLE I F1:TUNING FACTOR (DURATION), F2:TUNING FACTOR (ENERGY), potential unstable tune, and are non-percussive. Drums, on F3:PERCUSSIVE FACTOR, F4-F6: MFCCS.(TOP)RESULTS FOR VOCALS the other hand, are un-pitched and percussive. As shown in CLASSIFICATION.(BOTTOM)RESULTS FOR DRUMS CLASSIFICATION. Figure 3, the vocals samples are distributed across the bottom left corner while the drums samples are distributed across the top left corner. The musical instrument classification accuracy for the solo higher correct rate as compared to using only the MFCCs. phrases are evaluated. Referring to Table I, we consider the For the drums, the correct rate was not significantly increased correct rate, false positive rate, false negative rate for the when using the proposed features as compared to the MFCCs, Tuning Factor (energy)(F1), Tuning Factor (duration)(F2), nonetheless the combination of the features have decrease the Percussive Factor (F3), MFCCs (F4-F16), over cross validation false positive rate to zero. of 100 runs. Although the features could also contribute to the classifi- For the vocals, the correct rate increased from 72% using cation of some other categories, the results are not reported MFCCs to around 80% using the proposed features, seemingly here. As it is one-vs-all classification, and could be used in contributed by the decrease in the false negative rate. The the developing hierarchical approach, the improvement for the combination of the proposed features with MFCCs also gave vocals and drums categories will also increase the overall

913 classification accuracy of all categories. REFERENCES [1] A. Klapuri and M. Davy. Signal processing methods for music transcrip- V. CONCLUSIONAND FUTURE WORKS tion. P. Herrera-Boyer, A. Klapuri and M. Davy. Automatic classification of pitched musical instrument sounds, pp.163-200. [2] J. Marques and P. J. Moreno. A study of musical instrument classification With the development of an automatic live sound audio using gaussian mixture models and support vector machines. In Compaq mixing system in mind, the present study evaluates the per- Computer Corporation, Tech. Rep., 1999. formances of higher-level features, the Tuning Factor and the [3] J. C. Brown, O. Houix and S. McAdams. Feature dependence in the automatic identification of musical woodwind instruments. In Journal of Percussive Factor, for musical instrument recognition of solo the Acoustical Society of America, 109(3), March 2001, pp.1064-1072. phrases. Using the Support Vector Machine (SVM) with the [4] A. Krishna and T. Sreenivas. Music instrument recognition : from isolated Gaussian Radial Basis Function (RBF) Kernel, we achieve notes to solo phrases. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Montreal, Canada, May 2004, good classification results for the vocals and drums categories, pp. 265-268. as compared to baseline features MFCCs. [5] S. Essid, G. Richard and B. David. Efficient musical instrument recog- nition on solo performance music using basic features. In AES 25th Although full categorization of the whole range of musical International Conference, London, UK, June 2004. instruments is not attempted, the evaluation would aid the [6] T. Kitahara, M. Goto, and H. G. Okuno. Musical instrument identification development of a hierarchical system for the full system of based on F0-dependent multivariate normal distribution. In IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), musical instrument classification. As discussed, the features Hong Kong, April 2003. are meant for the un-tuned categories of drum and vocals, [7] N. Chetry and M. Sandler. Linear prediction models for musical instru- the sub-hierarchical features are to be further investigated. We ment identification. In IEEE Acoustics, Speech and Signal Processing (ICASSP) 2006. will also be looking into the enlargement of the dataset, in [8] Y. Zhu and M Kankanhalli. Precise pitch profile extraction from musical particular, to introduce more variety of drum samples. audio for key detection. In IEEE Transactions on Multimedia, Vol. 8, No. 3, June 2006.

914