Musical Instrument Classification of Solo Phrases Using Tuning Factor

Musical Instrument Classification of Solo Phrases using Tuning Factor and Percussive Factor Hui Li Tan, Yongwei Zhu, Lekha Chaisorn Institute for Infocomm Research (I2R), A*STAR 1 Fusionopolis Way, Singapore 138632 {hltan, ywzhu, clekha}@i2r.a-star.edu.sg Abstract—Live sound audio mixing is the art of processing and has led to the development from flat classification system combining sound sources captured at live performances, for the (assigns instrument labels in a single step) to hierarchical performers or audience in real time. With the development of an classification system (assigns instrument labels progressively automatic live sound audio mixing system in mind, the present study evaluates the performances of higher-level features, the from the broader categories to the more specific categories Tuning Factor and the Percussive Factor, for musical instrument down a hierarchical tree). classification of solo phrases. In particular, we focus on the Most of the work on recognition of instrument in solo classification of the vocals and drums phrases, two common phrases utilized basic features such as Mel-Frequency Cep- instrument categories in live performances. Using the Support stral Coefficients (MFCCs), spectral features; the inclusion of Vector Machine (SVM) with the Gaussian Radial Basis Function (RBF) Kernel, vocals and drums classification rates of 84% and temporal features such as rise time and attack time is hardly 97% are achieved respectively, as compared to 72% and 96% attempted, as this requires reliable front end onset detector. when using Mel-Frequency Cepstral Coefficients (MFCCs). The research of higher-level features, F0-dependent features [6], and Line Spectrum Frequencies [7] are some works that I. INTRODUCTION explore new features in this research context. The present Mixing music is not just limited to the studio! Audio signals study evaluates the performances of two higher-level features, captured at live performances can be processed and combined the Tuning Factor and the Percussive Factor, for musical at real time to create new mixes for the performers or audience. instrument classification of solo phrases. These two features Depending on the performance requirements, there can be are evaluated against baseline feature, MFCCs, which was a variety of different mixes required. Two typical types of suggested as being robust by many of the above references. mixes are the Monitor mixes meant for the performers and the Our automatic live sound mixing system directs our focus Front of House (FOH) mixes meant for the audience. While to the classification of the vocals and drums, as these are the role of a monitor engineer is to enable the individual among the most common sound inputs present in many live performer/musician to hear his individual mix with more performances in the pubs or churches. These require specific clarity, reduce other confusing house sound; the role of a audio processing such as equalization, compressing etc. FOH engineer is to reinforce the sound sources to cover the We first elaborate on the two proposed features in Section audience and/or introduce a variety of processors and effects II, before moving on to the classification scheme in Section to provide some styling to the house mix. We are looking into III. The experimental dataset and results are presented in the development of an automatic live sound mixing system, Section IV, and the conclusion and future work are presented which will automatically recognize the sound sources captured in Section V. by the microphones and then process them appropriately. In this paper, we address the first issue of automatically II. FEATURE EXTRACTION recognizing the sound sources captured by the microphones, Two higher-level musically relevant features, the Tuning a musical instrument classification problem in music infor- Factor and the Percussive Factor are proposed. The definitions mation retrieval (MIR). The authors in [1] presented an in- and descriptions of proposed features will be first be presented, depth summary of the related works on musical instrument followed by the elaboration of the extraction steps. classification, which evolved from the recognition of isolated musical sounds to the recognition of instruments in solo A. Tuning Factor phrases [2], [3], [4], [5] or/and duet phrases, and then to In most music performances, pitch tuning, in which the the more complex tasks of recognition of instruments in pitches of the instruments are tuned to a common reference polyphonic music and complex mixtures. Besides investigating pitch, is practised so that the instruments sound harmonic the distinction between the different instrument families, some when played concurrently. The reference pitch to which the researchers investigate the finer distinction between the in- pitches of the instruments are tuned to may however vary strument categories in the instrument families. The increasing slightly from orchestra to orchestra or band to band. The difficulty of the musical instrument classification problem concert pitch, A=440Hz, specified by the ISO 16 standard is 911 Proceedings of the Second APSIPA Annual Summit and Conference, pages 911–914, Biopolis, Singapore, 14-17 December 2010. 10-0109110914©2010 APSIPA. All rights reserved. widely used; but there exist other historical pitch standard such Tuning Pitch Histogram - Drums as the diapason normal in which A=435Hz. In our context, the 1 tuning index is estimated as the amount of shift of the pitches 0 from the concert pitch; and the Tuning Factor is a measure of Value -1 the presence of tuned pitch in a audio signal. Tuning index 1 2 3 4 5 6 7 8 9 10 Tuning Pitch Bin extraction is detailed in [8] while the Tuning Factor is derived Tuning Pitch Histogram - Brass from the tuning indexes of a audio signal. The main extraction 400 200 steps of the tuning index will be outlined below, followed by Value 0 the extraction of the tuning factor. 1 2 3 4 5 6 7 8 9 10 Considering 7 octaves of 12 semitones from f = 27.5Hz Tuning Pitch Bin 0 Tuning Pitch Histogram - Vocals to f6 = 3520Hz, the CQT spectrogram of the audio signal is 100 first obtained. A pitch resolution of 10 points per semitone is 50 cqt Value used. Let xˆ (t, k) be the CQT spectrogram after local peak 0 1 2 3 4 5 6 7 8 9 10 picking; where t represents the time index and k ∈ {1,..., 7× Tuning Pitch Bin 12 × 10} represents the frequency band index. For each t, the accumulated energy for each of the 10 shifts is: 12×7=84 X Fig. 1. Tuning Pitch Histogram of (a) Brass (b) Vocals (c) Drums. Low Eˆ(t, n) = xˆcqt(t, n + 10m), accumulated values for the drums and vocal excerpts are illustrated. m=1 where n ∈ {1,..., 10}. The tuning index, which is the tuning pitch position with the maximal cumulative energy is then: excerpts respectively, illustrating the decrease in percussitivity. While drums are characterized by their wide band and transient P (t) = arg max Eˆ(t, n) . nature, vocals only exhibit gradual energy changes. n∈{1,...,10} Conditions are imposed to ensure that the energy of the tuning indexes are due to single frequency shift positions, Spectrogram - Drums 200 eliminating any tuning index obtained due to noisy sound, 400 600 which P (t) is set to 0. As mentioned, the tuning indexes 800 1000 indicates the deviations of the pitches from the concert pitch Index Frequency 20 40 60 80 100 120 140 160 180 200 Time Index i.e. 440Hz for middle A4. Taking the distance between 2 Spectrogram - Brass semitones to be 100 cents, a pitch resolution of 10 points 200 400 per semitone imply a 10 cents shift per point. Hence, bin 1 600 800 represents exact concert pitch, bin 2 represents 10 cents above 1000 Frequency Index Frequency 20 40 60 80 100 120 140 160 180 200 the concert pitch and bin 10 represents 10 cents below the Time Index concert pitch. Spectrogram - Vocals 200 T 400 For a audio signal of length , the tuning pitch histogram, 600 800 H(p), p ∈ {1,..., 10} is constructed by accumulating all 1000 Frequency Index Frequency 20 40 60 80 100 120 140 160 180 200 occurrences of P (t), t ∈ T . Figure 1 illustrates the tuning pitch Time Index histograms of drum, brass and vocal excerpts respectively. Low accumulated values are observed for the drum and vocal excerpts. The Tuning Factor is derived from the tuning pitch and is a Fig. 2. Spectrogram of (a) Drums (b) Brass (c) Vocals. Decreasing percussi- measure of the presence of tuned pitch in a audio signal. Both tivity is illustrated. the duration (1) and energy (2) percentages are considered: A list of potential percussive onsets, O = {l1, . , lL}, is P 1 TF = P (t)∈{1,...,10},t∈T , (1) first obtained by the percussive measure, D(i): dur T P X 0 if di(j) < H P (t)∈{1,...,10},t∈T E(t) D(i) = , (3) TFen = P , (2) 1 if di(j) ≥ H t∈T E(t) j∈J where E(t) is the energy of the audio signal. w |ST F Tx (j, i)| where di(j) = log2 w , B. Percussive Factor |ST F Tx (j, i − 1)| w The Percussive Factor is a measure of the average bandwidth and ST F Tx (j, i) is the Short-Time Fourier Transform com- of abrupt energy changes/percussitivity in a audio signal. puted with hop x, window w, and with the hamming window. Figure 2 shows the spectrograms of drum, brass and vocal i ∈ {1,...,I} represents the time index while j ∈ J 912 represents the frequency band index; J = [1KHz, 5KHz] Features Space - Vocals defines the spectral range over which the measure is evaluated. 600 0 (training) The average bandwidth of all such potential onsets are then 0 (classified) 500 1 (training) computed by: 1 (classified) P D(O) Support Vectors B = (4) 400 L and used as indicator for the the presence of high percussive 300 onsets in the signal.

Load more