<<

Automatic Labelling of Signals

Olivier K. GILLET Gael¨ RICHARD GET-ENST (TELECOM Paris) GET-ENST (TELECOM Paris) 46, rue Barrault 46, rue Barrault 75013 Paris, France 75013 Paris, France [email protected] [email protected]

Abstract a set of general descriptors such as genre, style, musicians • and instruments of a music piece.

Most of the recent developments in the field of music a set of more specific descriptors such as beat, notes, indexing and music information retrieval are focused • chords or nuances. on western music. In this paper, we present an auto- matic music transcription system dedicated to Tabla - Automatic labelling of instrument sounds therefore represents a North Indian . Our approach one of the component of a complete indexing system. Pre- is based on three main steps: firstly, the audio signal is vious efforts in automatic labelling of instruments sounds segmented in adjacent segments where each segment have mainly been dedicated to sounds with definite pitch represents a single stroke. Secondly, rhythmic infor- ([Kaminskyj, 2001], [Martin, 1999] or [Herrera et al., 2000] for mation such as relative durations are calculated using a critical review). Percussive sounds is a specific class of in- beat detection techniques. Finally, the transcription strument sounds. They can be pitched (as for or (recognition of the strokes) is performed by means of ) or unpitched (as for or ). There is much a statistical model based on Hidden Markov Model interest in devoting more effort to the latter class since there is (HMM). The structure of this model is designed in or- a number of specific applications for this class of sounds (auto- der to represent the time dependencies between suc- matic drum loop recognition and retrieval, virtual dancer, drum- cessives strokes and to take into account the specifici- controlled synthesizers . . . ). ties of the tabla score notation (transcription symbols may be context dependent). Realtime transcription In fact, more attention is now given to the class of Tabla soli (or performances) with an error rate of of unpitched percussive sounds (see for example 6.5% is made possible with this transcriber. The tran- [Herrera et al., 2003],[McDonald and Tsang, 1997], scription system, along with some additional features [Gouyon and Herrera, 2001])), but these efforts are still such as sound synthesis or phrase correction, are inte- limited to western music and for most studies only isolated grated in a user-friendly environment called - sounds are considered. cope. We propose in this paper a study on the automatic transcrip- tion of tabla performances. The tabla is a percussion instrument widely used in (North) Indian classical and semi-classical mu- 1 Introduction sic, and which has gained more and more recognition among Due to the exponential growth of available digital information, fusion musicians. As for a number of percussive instruments there is a need for techniques that would make this information (such as congas, drum), the characteristics of the audio signal more readily accessible to the user. As a consequence, auto- produced is rather impulsive and successive events are therefore matic indexing and retrieval of information based on content is rather easily detected. Another advantage for automatic analy- becoming more and more important and represent very chal- sis is that such instruments produce only limited sound mixtures lenging research areas. Automatic indexing of digital informa- (1 or 2 events at the same time where in polyphonic music over tion permits to extract a textual description of this information 50 notes can be played simultaneously). However, the transcrip- (i.e. meta data). In the context of music signals, if such a de- tion of tabla signals is quite challenging because there is strong scription would ultimately be a complete transcription (for ex- time dependencies that are essential to take into account for a ample in the form of music scores), it should at least target to successful transcription. There are two types of time dependen- extract two types of descriptors from the audio signal: cies : firstly, the instrument can produce long, resonant strokes which will overlap and alter the of the following strokes. Secondly, the symbol used to represent a stroke can depend on Permission to make digital or hard copies of all or part of this work the context in which it is used. Prior works related to computeri- for personal or classroom use is granted without fee provided that zation of the Tabla include a phrase retrieval system using MIDI copies are not made or distributed for profit or commercial advan- input [Roh and Wilcox, 1995] or specific controller and synthe- tage and that copies bear this notice and the full citation on the first sis models [A. Kapur and Cook, 2002], but to our knowledge no page. c 2003 Johns Hopkins University. transcription system of tabla performances have been reported. A tabla sequence or is composed of successive bols of different durations (note, half-note, quarter note,...) however the rhythmic structure can be quite complex. The basic rhythmic structures (called ) can have a large variety of beats (for example 6,7,8, 10, 12, 16,..) which are grouped in measures. The successives measures may have different number of beats (for example Dhin Na Dhin Dhin Na Tin Na Dhin Dhin | | | Na where the vertical bars symbolise the measure boundaries). The grouping characteristics are in a sense similar to those ob- served in spoken or written languages where ”words” can be grouped in sentences. These properties can be modelled by Figure 1: The tabla is a pair of : a metallic drum means of a Language model that will provide an estimation of (the bayan) and a wooden treble drum : the dayan the probability of a given sequence of words. For the tabla, since the association between a sound and the corresponding bol can be context-sensitive (i.e. depends on previous bols), it The paper is organized as follows. The next section presents the seems quite appropriate to use language modelling approaches instrument and the specific transcription that is commonly used to model these dependencies. to describe tabla performances. Section 3 is dedicated to the In this work, the transcription is limited to the recognition of description of the transcription system including a detailed sec- the successives bols with their corresponding relative duration tion on the model proposed. Section 4 presents some evaluation (note, half-note etc...) but does not intend to extract the complex results on real tabla performances. Finally, section 5 suggests rhythmic structure (i.e the measures). some conclusions. 3 Transcription of Tabla phrases 2 Presentation of the tabla 3.1 Architecture of the system The tabla (see figure 1) is a pair of drums traditionally used for The architecture of our system is organized in several modules: the accompaniment of Indian classical or semi-. The bayan is the metallic , played by the left hand. Parametric representation the signal parametric repre- It is used to produce a loud resonant sound or a damped non- • sentation is obtained by means of two main sub-modules: resonant one. By sliding the palm of the hand on the drum head the onset detection which permits to segment the signal while striking, the perceived pitch can be modified. The dayan into individual strokes and the features extraction which is the wooden treble drum, played by the right hand. A larger provides an acoustic vector from the segment previously variety of sounds is produced on this drum. As for timbals for extracted. example, the drum can be tuned according to the accompanied voice or instrument. It is also important to note that for this Sequence Modelling: the sequence of feature vectors class of instruments the main resonance (that corresponds to the • (one vector per stroke) is then modelled by a classification main perceived pitch) is much above the resonance of the other technique such as k nearest neighbors (k-NN) or Hidden harmonics. Further elements on the physics of the Tabla can be Markov Models (HMM). found in [Fletcher and Rossing, 1998] or in the early work of Raman [Raman, 1934]. Transcription: The final transcription is given in the form • of a succession of bols or words accompanied with a rhyth- A mnemonic syllable or bol is associated to each of these mic notation obtained from the tempo detection module strokes. Since the musical tradition in is mostly oral, com- positions are often transmitted from masters to students as sung bols. Common bols are : Ge, Ke (bayan bols), Na, Tin, Tun, Figure 2 provides the general architecture of the system includ- Ti, Te (dayan bols). Strokes on the Bayan and dayan can be ing the transcription modules (in bold line), and the tabla se- combined, like in the bols : Dha (Na + Ge), Dhin (Tin + Ge), quence generator/synthesizer (in dotted line) that is further ex- Dhun (Tun + Ge)... Because of these characteristics, even if plained in section 5.2. It is important to note that the input to the two drums are played simultaneously, the transcription can al- system can either be an audio file (for example in .wav format) ways be considered as monophonic - a single symbol is used or audio streams that are processed in real time. even if the corresponding stroke is compound. 3.2 Parametric representation A specificity of this notation system is that two different bols 3.2.1 Segmentation in strokes can sound very similar. For example, Ti and Te nearly sound the same, but are played with different fingers. Though, it is As for many transcription system, the first step of our tran- possible to figure out which bol is used, since a fast sequence of scriber consists in performing a segmentation of the incoming these strokes is played by alternating between Ti and Te. An- stream in separate strokes. A number of approaches have been other characteristic is the existence of ”words” made of several proposed in the literature for such a segmentation (see for ex- bols concatenated in stereotyped groups which are used as a ample [Klapuri, 1999]). In the case of tabla signals, most of the whole, e.g. TiReKiTe or GeReNaGe. Such words are used as strokes have a fast decaying exponential envelope. As a con- compositional units, and make the recitation and memorization sequence, a simple envelope/threshold based approach seems of a piece easier. adequate and is detailed below: Audio file Onset detection Transcription system Realtime audio input Tempo detection Features extraction Additional features

Transcription Annotated corpus HMM model User input Phrase correction 64 phrases 5715 bols

Audio output Synthesis

Figure 2: Architecture of the system: the transcription system is represented by bold lines and the tabla sequence generator or synthesis is represented by dotted lines

Sampling: The original input signal is sampled (or resam- power spectra of each stroke can be represented by a probability • pled) at 44.1 kHz. distribution and that it can be approximated by a weighted sum of N Gaussian distributions. For tabla signals, N = 4 repre- Envelope extraction: The envelope, env(t), is obtained in • sents an appropriate choice. The four Gaussian distributions are two steps. First, a low-frequency envelope signal (sampled obtained as follows: at 220.5 Hz), env1(t), is calculated by taking the absolute value of the maximum in 400 samples . Second, a local normalisation is performed by dividing the previous An initial estimate is obtained by computing the em- envelope by another rectified very-low-frequency envelope • pirical mean and variance in four pre-defined fre- signal, n(t), (sampled at 1 Hz) in order to compensate the quency bands B1[0, 150]Hz, B2 = [150, 220]Hz, B3 = dynamic variations of the input signal. Note that n(t) is [220, 380]Hz, B4 = [700, 900]. Despite that the - estimated by taking the absolute value of the maximum in width of these bands are sufficient to cope with signifi- 44100 samples windows. cant tuning variation of the Dayan, it would be necessary to adapt this front-end to obtain a truly tuning-independent Onset detection: An onset is detected at locations where system. For example, Mel Frequency Cepstral Coeficients • the difference between two successive samples of the rec- (MFCC) which have better generalization properties have tified envelope signal exceeds a given threshold while an- also been tested but have lead to slightly degraded perfor- other onset wasn’t detected during the last 100 ms. Note mances on our corpus. that due to the normalisation approach, this threshold does not depend on the initial signal energy.

3.2.2 Duration and tempo extraction Then, an improved estimation is obtained by iterating the • Expectation-Maximisation (EM) algorithm on a training Tempo extraction has also received much interest in the recent data set. In practice, convergence is obtained in a few steps years (see [Scheirer, 1998] and [Alonso et al., 2003] for exam- (typically in 4 iterations). ple). Due to the impulsiveness of the tabla signals, a simple approach is also preferred in this work. A local tempo at time T is estimated by finding a peak in the autocorrelation func- In summary, the feature vectors F are constituted with 12 ele- tion of env(t), t [T 10s, T + 10s] in the range [60, 240] ments F = f1,..,12 (the mean, variance, and relative weight of BPM (Beats per ∈minute−). Events are then aligned and quan- each of the 4 Gaussians). tized to the corresponding time grid. We keep for each stroke its relative duration as a fraction of the tempo (each events is 3.4 Learning and classification of bols then rhythmically described as a note, half-note, quarter noter Due to the time dependencies discussed above, classifying each etc....). stroke independently of its context leads to unsatisfactory re- 3.3 Features extraction sults. An efficient approach that integrates context (or time) dependencies is given by the Hidden Markov Model (HMM). Early percussion instruments classification systems used en- This class of models is particularly suitable for modelling ergy in a set of well-chosen frequency bands as features. short term time-dependencies and it has been successfully used Recent systems uses more complex features sets, including for a wide variety of problems ranging from speech recogni- temporal, spectral or cepstral features ([Herrera et al., 2003], tion ([Rabiner and Juang, 1993]) to music transcription ¨¨ [McDonald and Tsang, 1997], [Sillanpaa et al., 2000]). ([Raphael, 2002]). In such a framework, the sequence of fea- The power spectra of various bols are plotted in figure 3. It ture vectors Ot is represented as the output of a Hidden Markov can be seen that the discrimination between the bols can be Model. The recognition is performed by searching the most achieved according to the position, relative weight and width likely states sequence, given the output sequence of feature vec- of the spectral peaks. It is then appropriate to consider that the tors. Dha -3 G e Na x 10 Ti Ke 0.035 0.1 0.025 4 0.01

0.03 3.5 0.08 0.02 0.008 3

e 0.025 e e e e d d d d d u u u u u t t t i t t i i l i i 2.5 l l l l

p 0.06 0.015 0.006 p p 0.02 p p m m m m m a a a a a

2 d d d d d e

0.015 e e e e z z z i z z i i l i 0.04 i 0.01 0.004 l l l l 1.5 a a a a a m m m m m r

0.01 r r r r o o o o o 1 n n n n 0.02 n 0.005 0.002 0.005 0.5

0 0 0 0 0 0 500 1000 0 500 1000 0 500 1000 0 500 1000 0 500 1000 frequency frequency frequency frequency frequency

Figure 3: Spectra of common bols (from left to right Dha, Ge, Na, Ti, Ke).

3.4.1 Description of the model The HMM model described above is a trigrams model (3- grams) since the parameters of this model only depends on the Although many implementation of HMM are now available, it two previous events. In this paper, 4-grams models (the three is felt important to recall the main stages of such a model and previous events are used to estimate the various probabilities) to describe how it is applied to the particular problem of tabla are also used. sequence recognition. The model includes states, transitions between states and emission probabilities. 3.4.2 Training Transition probabilities are estimated by counting occurrences A couple of bols B B is associated to each state q States 1 2 t in the training database : If the state i is labelled by B B , j by of the model. In other words, q represents ”B in the 1 2 t 2 B B , and if C(s) is the number of occurrences of subsequence context of B at time t”. For example, for the sequence 2 3 1 s in the annotated corpus, then : of bols b = ( start , start , Dha, Dhin, Dhin, Dha), ∗ ∗ ∗ ∗ the corresponding sequence of states is q0 = [ start . ∗ ∗ ∗ C(B1B2B3) start ], q1 = [ start .Dha], q2 = [Dha.Dhin], q3 = aij = p(bt = B3 bt−1 = B2, bt−2 = B1) = ∗ ∗ ∗ | C(B1B2) [Dhin.Dhin], q4 = [Dhin.Dha].

Transitions If the state i is labelled by B1B2 ; and j by B2B3, If i is labelled by B1B2, emission probabilities for this state are then the transition from state qt−1 = i to state qt = j is estimated with : given by: empirical mean and variance estimators on the set of fea- • ture vectors : Ai = xt/bt = B2, bt−1 = B1 - in the case aij = p(qt = j qt− = i) { } | 1 of a simple Gaussian model = p(bt = B3 bt−1 = B2, bt−2 = B1) | 8 iterations of the EM algorithm after a random affectation • where p(bt = B3) is the probability density of observing to mixtures in the case of a mixture model. the bol B3 at time t. 3.4.3 Recognition Initial probabilities Since two dummy states *start* are inserted at the beginning of each sequence, the first state Recognition is performed through the classical iterative Viterbi is always q = [ start . start ]. algorithm. After an initialisation step with v (0) = 0 and 0 ∗ ∗ ∗ ∗ q0 vq=6 q0 (0) = , the algorithm includes: Emission probabilities Each state i labelled by B1B2 emits a −∞ feature vector according to a distribution b (x) character- i Iteration istic of the bol B2 preceded by B1 :

vl(t + 1) = log bl(ft+1) + max vq(t) + log(aql) bi(x) = p(Ot = x qt = i) q∈Q | { } = p(Ot = x bt = B , bt− = B ) | 2 1 1 b (x) represents the probability of observing x knowing backtrackl(t) = argmaxq∈Q vq(t) + log(aql) i { } that the state at time t is i. In this work bi(x) is either mod- elled by a single mixture (a Gaussian vector distribution Backtracking The most likely bols sequence b∗ is the one with diagonal covariance matrix) or a mixture of two Gaus- which maximizes the log-likelihood and is calculated by : sian distributions. For example, in the single mixture case, the feature vectors are modelled with a single vector distri- q(T ) = argmaxq∈Q vq(T ) bution of 4 Gaussians distributions (where each Gaussian { } characterizes the mean, variance and relative weights of q(t) = backtrackq(t+1)(t) and b(t) = Etq(q(t)) the frequency content in each pre-defined frequency bands, see section 3.3). where Etq(q(t)) represents the label (i.e the bol) of state q(t). Implementation issues In practice, it is needed to use the not include signals that share the same environment acoustical log-likehood version of this algorithm which permits to avoid properties and instrument than those of the testing set. the scaling problems. Another practical problem is the in- ability of the model to recognize some bols sequences which 4.2 Results were not present in the training set. To authorize such se- The results obtained for our novel approach based on HMM quences to be recognized (i.e. to permit all transitions: language model is compared to simpler classifiers such as Ker- aij = 0), the iteration step of the Viterbi algorithm has 6 nel density estimator, k-NN (k nearest neighbors) and a naive been slightly modified to : vl(t + 1) = log bl(ft+1) + Bayesian approach. These simple classifiers are briefly de- maxq∈Q vq(t) + log((1 ε)aql + ε) { − } scribed below:

4 Experimental results k-Nearest Neighbors (k-NN) The k-NN approach is a pop- 4.1 Database and evaluation protocol ular technique and has already been used in sev- eral studies on musical instruments (see for example 4.1.1 The database [Herrera et al., 2003],[Kaminskyj, 2001]). It consists in The database used for this study is made of 64 phrases with a building a set of couples associating a feature vector F total of 5715 bols. Some of these phrases are long compositions with the corresponding bol Bj. If k = 1, a stroke B with with themes / variations (kaida), some others are shorter pieces feature vector F is then identified as the bol Bj whose fea- (tukra) or basic taals. This database is sub-divided in three sets, ture vector Fj = f 1,..,12 is the closest to the feature vector each one corresponding to a specific model of tabla: F . For k greater than 1, the decision is taken on the most represented bol Bj amongst the k nearest feature vectors Tabla #1: recorded using a cheap quality tabla whose Fj. The distance used is a classical Euclidian distance on • strokes are much less resonant than professional instru- the normalised feature vectors: ments. The Dayan of the tabla was tuned in C#3 and the fi µfi recording was made using good home-studio equipment fi = − (Sennheiser e825s microphone), in low reverberation and σfi low noise conditions. where µfi and σfi respectively correspond to the mean and Tabla #2: recorded using a high quality instrument and standard deviation of the parameter fi on the training set. • with a Dayan tuned in D3. The recording was made with good home-studio equipment (Sennheiser e825s micro- Naive Bayes This approach consists in building a Gaussian phone), in low reverberation and low noise conditions. model for each bol Bj. All training data for a given bol are then used to build a Gaussian model for each element Tabla #3 recorded using another high quality instrument • fi of the feature vector leading to a set of 24 parameters per and with a Dayan tuned in D3. This set was recorded in bol Bj (each element fi of a given bol Bk is then charac- a noisier environment and with a greater distance between terized by its mean and variance, respectively µ(fi,Bk ) and the microphone and the musician. σ(fi,Bk (). The chosen bol Bk is then the one that max- imises the likelihood knowing the observed feature vector 4.1.2 Evaluation Protocol 1 F :

− 2 For evaluation, the usual cross-validation approach was fol- (fi µ f ,B ) 12 − ( i k) lowed (often called ten-fold procedure in the literature 1 2σ2 p(B = B o = F ) = .e (fi,Bk) [Herrera et al., 2003]). It consists in splitting the whole k Y √ | σ(fi,Bk) 2π database in 10 subsets randomly selected and in using nine of i=1 them for training and the last subset (i.e. 10 % of the data) for testing. The procedure is then iterated by rotating the 10 subsets Kernel Density estimator This approach does not generate a used for training and testing. The results are computed as the real abstract model of the data. It merely approximates average values for the ten runs. In this protocol, one iteration the distribution of the data by a sum of functions called the thus consists in using 90 % of each data set (90 % of Tabla #1, ”kernel”. This Kernel can have different properties and can Tabla #2 and Tabla #3) and to use the remaining data (coming for example lead to very smooth approximation of the dis- from each set) for testing. In this protocol, it is important to tribution. Note that one of the advantage of this approach note that all instruments and conditions are known in the train- is that it can describe non Gaussian distributions. ing phase. 4.1.3 Evaluation Protocol 2 For all three classifiers described above the implementation pro- posed in Weka is used ([Waikato, 2003]). To evaluate the generalization properties of our algorithm, a second protocol is used where training sets and testing set cor- As mentioned in section 4.1, two evaluation protocols are used. respond to different instruments and recording conditions. In The transcription results obtained using these two protocols are this protocol, the training set consists of all signals from 2 sets respectively summarized in Table 1 and Table 2. (for example 100 % of Tabla #1 and Tabla #2) and the test- First, it is important to note that due to the sound quality of ing set consists of all signal from the third set (i.e. 100 % of the database, the segmentation is very efficient and resulted in Tabla #3). This protocol therefore allows to evaluate the gener- 98.3% correct segment detection (only 97 errors - 70 strokes not alization properties of our approach since the training set does detected, 27 wrongly detected onsets) on the whole database. Database All Tabla #1 Tabla #2 Tabla #3 # of bols 5715 1678 2216 1821 Classification using only features of stroke n Kernel density estimator 81.7% 81.8% 82.4% 85.2% 5-NN 83.0% 81.7% 83.3% 85.6% Naive Bayes 76.6% 79.4% 78.6% 78.5% Classification using features of stroke n, n 1, n 2 − − Kernel density estimator 86.8% 86.0% 88.7% 92.0% 5-NN 88.9% 87.2% 88.4% 90.6% Naive Bayes 81.8% 86.5% 83.8% 85.8% Classification using language modelling HMM, 3-grams, 1 mixture 88.0% 90.6% 89.9% 92.6% HMM, 4-grams, 2 mixtures 93.6% 92.0% 91.9% 93.4%

Table 1: Recognition scores using evaluation protocol 1

In Table 1, it can be observed that the simple classifiers (Kernel been developed following a client/server approach which pro- density estimator, k-NN, Naive Bayes) already obtain satisfac- vides a great level of flexibility and portability. All the signal tory results and therefore indicate that the basic parametric rep- processing / recognition modules are implemented in a server resentation chosen is valuable. However, these models cannot written in C++, while the graphical client is a Java application. deal with time dependencies. A simple way of modelling these A protocol based on the serialization of objects as XML mes- dependencies is to extend the feature vectors with the features sages has been chosen for the communication between the two of the 2 previous bols. In these experiments, the feature vectors modules. The database and the transcriptions are also stored us- are simply the concatenation of the feature vector with those ing the XML format. Tablascope also integrates the possibility of the two previous strokes. Clearly, this simple approach suc- to export transcription as Csound scores. An overview of the ceeds to model some of the time dependencies and relevantly environment is presented in figure 4. improves the recognition score, especially for non-parametric models. However, the best results are achieved with the HMM 5.2 Applications language model approach that appears to be a very appropriate Since the recognition algorithm runs in realtime, a number of solution for the transcription of Tabla signals. other applications are possible including: The Table 2 summarizes the results obtained with the protocol 2 where the training sets and test set are recorded on different Tabla sequence generation or synthesis In that context, the instruments and in different conditions. In other words the con- user types in the desired bols sequence and the overall dition and specificities of the test set are not known in the train- tempo. Then, the HMM sequence model can check the ing phase. In this condition, the classifier which gave the best phrase input by the user, i.e. checks whether it con- recognition scores with protocol 1 failed to properly generalize tains unknown bols or forbidden bols sequences. In this and to adapt to other instruments or recording conditions. This case, Tablascope suggests the nearest correct phrase ac- is clearly another advantage of the HMM approach where per- cording to an edit distance. Synthesis is then performed by formance remain at a very high level even with signal recorded an overlap-add procedure slightly adapted to permit pitch in different conditions. Though, it is interesting to note that in shifting. this case the simpler HMM model leads to the best results. This Tabla-controlled synthesizer Tablascope can also be used as a may be explained by the fact that using more complex mixture MIDI controller, that is to control a synthesizer with input models as distributions for the acoustic features may lead to an coming from a Tabla without using any other sensor than overfitting of the training data. a microphone. In such an application, a given bol can be In further analyzing the errors of our algorithms, it can be no- associated to a specific instrument of the synthesizer giving ticed that most of them occur with similar strokes. If bols the Tabla player a wider range of performance capabilities. are grouped in 5 categories corresponding to a similar produc- tion mechanism (see Table 3), one can observe that most of 6 Conclusion the recognition errors happen within the same stroke category (”non-resonant bayan strokes”). Most of the music transcription systems are focused on melody and western music. In this paper, we presented a transcription It is worth to note that the overall category recognition rate is system for Tabla - a north Indian percussion instrument - based fairly high since it reaches 95.7% with the best model (HMM, on a statistical machine learning approach. The identification of 4-grams, 2 mixtures models). Tabla bols with a low (6.5%) error rate is achieved through the use of a HMM model in a similar way that language models are 5 Tablascope : a fully integrated environment exploited in large vocabulary speech recognition systems. Ad- ditional functions such as tabla sequence correction and genera- 5.1 Description tion (or synthesis) are integrated in a user friendly environment The tabla transcription algorithm has been embedded in a fully called Tablascope. Although this study is conducted on Tabla integrated environment called Tablascope. This system has signals, it can easily be generalized to other types of percus- Training set Tabla #1 & Tabla #2 Tabla #2 & Tabla #3 Test set Tabla #3 (noisy rec.) Tabla #1 (cheap quality) 5-NN 79.8 % 78.2 % HMM, 3-grams, 1 mixture 90.2 % 88.4 % HMM, 4-grams, 2 mixtures 84.5 % 85.0 %

Table 2: Generalization test using evaluation protocol 2

a b c d e ¡- classified as 1241 22 2 7 8 a: resonant dayan strokes (Tin, Na, Tun...) 20 1076 2 1 5 b: bayan+dayan strokes (Dhin, Dha...) 1 3 766 5 20 c: resonant bayan strokes (Ge, Gi...) 8 2 2 448 61 d: non-resonant bayan strokes (Ke, Ki...) 11 7 6 50 1938 e: non-resonant dayan strokes (Te, Ti, Tek...)

Table 3: Confusion matrix by bol category (with HMM, 4-grams, 2 mixtures classifier)

Figure 4: An overview of the Tablascope environment : phrase edition (BolPad), audio annotation, realtime transcription, corpus management and tuning windows sive signals such as percussion or drum loops which also show [Raman, 1934] Raman, C. (1934). The indian musical drum. strong time dependencies. As a matter of fact, except for the In Science, P. I. A., editor, Reprinted in Musical Acoustics: signal parametric representation (i.e the acoustic front-end) that selected reprints, ed. T.D. Rossing Am. Assn. Phys. Tech., is rather specific to Tabla, all other modules are fairly generic. College Park, MD, 1988. [Raphael, 2002] Raphael, C. (2002). Automatic transcription 7 Acknowledgements of piano music. In Proceedings of ISMIR2002, pages 15–16. The authors wish to thank Sanjay and Uday Deshpande for their precious advice about Tabla theory, Claire Waast- [Roh and Wilcox, 1995] Roh, J. and Wilcox, L. (1995). Ex- Richard for fruitful discussions on current speech recognition ploring tabla drumming using rhythmic input. In Proceed- technologies and Nicolas Lelandais for the tabla performances ings of the CHI ’95, pages 310–311. used in this work. [Scheirer, 1998] Scheirer, E. D. (1998). Tempo and beat anal- ysis of acoustic musical signals. Journal of the Acoustical References Society of America, 103(1):588–601. [Alonso et al., 2003] Alonso, M., Badeau, R., David, B., and [Sillanpa¨a¨ et al., 2000] Sillanpa¨a,¨ J., Klapuri, A., Seppanen,¨ Richard, G. (2003). Musical tempo estimation using noise J., and Virtanen, T. (2000). Recognition of acoustic noise subspace projections. IEEE Workshop on Applications of mixtures by combined bottom-up and top-down approach. Signal Processing to Audio and Acoustics (WASPAA ’03). In Proceedings of European Signal Processing Conference, EUSIPCO-2000. [A. Kapur and Cook, 2002] A. Kapur, G. Essl, P. D. and Cook, P. R. (2002). The electronic tabla controller. In Proceed- [Waikato, 2003] Waikato (2003). Weka 3: Machine learning ings of the 2002 Conference on New Instruments for Musical software in java. In http://www.cs.waikato.ac.nz/ml/weka/. Expression (NIME-02).

[Fletcher and Rossing, 1998] Fletcher, N. and Rossing, T. (1998). The Physics of Musical Instruments. Springer verlag, 2nd edition edition.

[Gouyon and Herrera, 2001] Gouyon, F. and Herrera, P. (2001). Exploration of techniques for automatic labeling of audio drum tracks. In Proceedings of MOSART: Workshop on Current Directions in Computer Music.

[Herrera et al., 2000] Herrera, P., Amatriain, X., Batlle, E., and Serra, X. (2000). Towards instrument segmentation for mu- sic content description: A critical review of automatic musi- cal instrument classification. In In Proc. of ISMIR2000.

[Herrera et al., 2003] Herrera, P., Dehamel, A., and Gouyon, F. (2003). Automatic labeling of unpitched percussion sounds. In 114th AES Convention, Amsterdam, The Netherlands.

[Kaminskyj, 2001] Kaminskyj, I. (2001). Multi feature sound classifier. In Proceedings of Asutralasian Computer Music Conference.

[Klapuri, 1999] Klapuri, A. (1999). Sound onset detection by applying psychoacoustic knowledge. In IEEE Interna- tional Conference on Acoustics, Speech and Signal Process- ing, Phoenix, Arizona.

[Martin, 1999] Martin, K. (1999). Sound Source Recognition: A Theory and Computational Model. Ph.d., Massachussetts Institute of Technology.

[McDonald and Tsang, 1997] McDonald, S. and Tsang, C. (1997). Percussive sound identification using spectral centre trajectories. In Proceedings of 1997 Postgraduate Research Conference.

[Rabiner and Juang, 1993] Rabiner, L. and Juang, B. (1993). Fundamentals of speech recognition. Englewood Cliffs, NJ.