<<

JOINT TRANSCRIPTION OF LEAD, , AND RHYTHM BASED ON A FACTORIAL HIDDEN SEMI-MARKOV MODEL

Kentaro Shibata1 Ryo Nishikimi1 Satoru Fukayama2 Masataka Goto2 Eita Nakamura1 Katsutoshi Itoyama1 Kazuyoshi Yoshii1

1Graduate School of Informatics, Kyoto University, Japan 2National Institute of Advanced Industrial Science and Technology (AIST), Japan

Rhythm ABSTRACT C4 This paper describes a statistical method for estimating musical E3 scores for lead, bass, and rhythm guitars from polyphonic audio C Bm C3 Chord sequence signals of typical band-style music. To perform multi-instrument transcription involving multi-pitch detection and part assignment, it C6

is crucial to formulate a musical language model that represents the F2 E2 Note sequence characteristics of each part in order to solve the ambiguity of part E2 C6 Music assignment and estimate a musically-natural score. We propose a spectrogram factorial hidden semi-Markov model that consists of three language F1 models corresponding to the three guitar parts (three latent chains) E1 G3 E1 Note sequence Low-rank and an acoustic model of a mixture spectrogram (emission model). Basis spectra Activations spectrograms The language model for rhythm guitar represents a homophonic sequence of musical notes (chord sequence) and those for lead and Fig. 1. A factorial hidden semi-Markov model (FHSMM) that rep- bass guitars represent a monophonic sequence of musical notes in a resents the generative process of audio signals of three guitar parts higher and lower frequency range respectively. The acoustic model of band music. represents a spectrogram as a sum of low-rank spectrograms of the three guitar parts approximated by NMF. Given a spectrogram, we Since most of the previous methods rely on the timbral charac- estimate the note sequences using Gibbs sampling. We show that our teristics of instruments, they struggle with band music with multiple model outperforms a state-of-the-art multi-pitch detection method instruments having similar timbre (e.g., two guitars). Recently, at- in the accuracy and naturalness of the transcribed scores. tempts have been made to incorporate a music language model rep- Index Terms— Automatic music transcription, multi-pitch esti- resenting a musical grammar (e.g., sequential dependency of chords mation, multi-instrument transcription, HSMM, and NMF. and musical notes) to complement the acoustic model in such dif- ficult situations. Although Sigtia et al. [14] proposes an end-to- 1. INTRODUCTION end polyphonic transcription method based on a recurrent neural Transcribing a musical score from an audio signal is a fundamen- network (RNN)-based language model, the model’s effect is essen- tal and challenging problem in music information processing [1]. tially smoothing since it is defined at the frame level. In order to From a practical viewpoint, it is important to develop a transcrip- get well-formed transcriptions, as they mentioned, a beat-level lan- tion method for band music to facilitate widely-enjoyed cover per- guage model is considered to be effective. Schramm et al. [16] com- formances. A typical band playing (e.g., The Beatles) bines a PLCA-based acoustic model with a hidden Markov model consists of vocal, several guitars, and drum parts. Transcription of (HMM)-based musical language model (also defined at the frame vocal and drum parts have each been studied extensively, and high level), where multiple vocal parts were treated independently. accuracies have been reported [2–7]. We therefore focus on accom- In this paper, we propose a transcription method for the accom- paniment parts that are typically played by three guitars (lead, bass, paniment part of band music based on the beat-level sequential char- and rhythm guitars). acteristics of accompanying instruments. We assume that three kinds -part transcription for band music is a challeng- of guitars—lead, bass, and rhythm—are used in band music and ing task because multi-pitch detection and part assignment of each have different sequential characteristics, as listed in Table 1. Our note are both required. A major approach to multi-pitch detection method uses a factorial hidden semi-Markov model (FHSMM) [17– is to use probabilistic latent component analysis (PLCA) or non- 19] that consists of three language models (semi-Markov models), negative matrix factorization (NMF) based on the sparseness and corresponding to the three guitar parts, and an NMF-based acous- low-rankness of source spectrograms. For part assignment, these tic model of a mixture spectrogram (Fig. 1). The language model methods have been extended to use pre-learned spectral templates of for the rhythm guitar represents a homophonic sequence of musical multiple instruments [8–13]. More recently, end-to-end neural net- notes (chord sequence), while those for the lead and bass guitars rep- works have been applied successfully to executing multi-pitch de- resent a monophonic sequence of musical notes in a higher or lower tection and part assignment simultaneously [14, 15]. frequency range respectively. The acoustic model represents a spec- This work was supported in part by JST ACCEL No. JPMJAC1602, trogram as a sum of the spectrograms of the three guitar parts, each JSPS KAKENHI No. 16H01744 and No. 16J05486, and the Kyoto Univer- of which is represented as a low-rank matrix obtained by the prod- sity Foundation. uct of an activation vector and basis spectra. A key feature of the

978-1-5386-4658-8/18/$31.00 ©2019 IEEE 236 ICASSP 2019 Transition between tatums : = , Table 1. The characteristics of thee guitar parts. Tatum U 0 4 8 0 Part Sequence Frequency range Rhythm Tatum Index Lead Monophonic Middle - High Complex 08240 6 10 12 14 Bass Monophonic Low Complex Transition between pitches: Chord Label Z C Am Rhythm Homophonic Wide Simple |, () = ,,

Pitch Z C4 E4 A3 C4 Bases ~Poisson( ) acoustic model is that only one of the basis spectra is allowed to be Select a activated in each tatum in each guitar part. Given a mixture spec- basis trogram as observed data, the basis spectra, activations, and latent at each tatum chains are statistically estimated using Gibbs sampling. The musical × × × × Low-rank score for each guitar is then obtained by using the Viterbi algorithm. Activation spectrogram A major contribution of this study is to achieve joint transcrip- tion of multiple musical instruments that have similar timbral char- Fig. 2. The generative process of spectrograms for lead/bass guitar acteristics. More specifically, we propose an integrated language and based on a conditional HSMM. ? indicates lead (L) or bass (B). acoustic model that can describe beat-level symbolic musical gram- mar and dependency between multiple instrument parts. We also “ ” will be used to represent “R” (rhythm guitar), “L”, or “B”. show that the initialization problem of the NMF-based model can be The• latent chain of the rhythm guitar is specified by a sequence of improved by a support from a DNN-based model, which has strong R R R chord symbols Z = z1 ,...,zNR with relative onset positions capability of expressing acoustic signals. We achieve an improve- R R R { R} R U = u1 ,...,uNR , where N is the number of chords, zn ment on the accuracy, which is an important step towards developing { } R a practical transcription method that can be used for popular music. takes one of 24 values of C,...,B major, minor , and un takes an integer from 0 to 15{ indicating}⇥ a position{ on the 16th-note-} level tatum grid in a measure. Likewise, the latent chain of the lead 2. PROPOSED METHOD or bass guitar’s HSMM is specified by a sequence of pitches Z? = ? ? ? ? ? Our method jointly performs multi-pitch estimation and part assign- z1 ,...,zN? with relative onset positions U = u1,...,uN? , { ? } L { } ment in a unified framework. We formulate a generative model of where N is the number of musical notes zn takes one of 45 pitches B a music spectrogram and the corresponding musical score and then of E2,...,C6 , zn takes one of 28 pitches of E1,...,G3 , and ?{ } { } solve the inverse problem. That is, given a music spectrogram, we un takes an integer from 0 to 15. R estimate the score described as latent variables in the model. The The sequence of chord symbols Z and the sequence of pitches ? problem is defined as follows: Z are represented by a Markov model as follows: F T R R R R R R R : The magnitude spectrogram of a target signal X ⇥ Input R+ p z1 ⇡ = ⇡zR ,pzn zn 1, = zR ,zR , (2) 2 1 n 1 n and 16th-note-level tatum times | | ⇣ ? R⌘ R ? ⇣ ? ⌘ Output: Musical scores of the three guitar parts p z1 Z , U , ⇡ = ⇡zR,z? , (3) | 1 1 Here, F is the number of frequency bins, T is the number of time ? ?⇣ R R ?⌘ ? p zn zn 1, Z , U , = zR ,z? ,z? , (4) frames. In this paper we assume that the time signature of the target ⇢?(n) n 1 n | signal is 4/4. ⇣ R ⌘ R where ⇡a is the initial probability of chord a, a,b is the transition probability from chord a to chord b, ⇢?(n) indicates a chord to which 2.1. Probabilistic Factorial Modeling ? ? F T musical note zn belongs, ⇡c,a is the initial probability of pitch a We represent a music spectrogram X R+⇥ as a sum of the spec- ? R L B F T2 given chord c, and c,a,b is the transition probability from pitch a to trograms X , X , and X R+⇥ of the rhythm, lead, and bass pitch b given chord c. This formulation is based on the fact that the guitars (Fig. 1): 2 R L B pitches of the lead and bass guitars are strongly correlated with the xft = xft + xft + xft. (1) chords of the rhythm guitar. R ? The generative processes of XR, XL, and XB are represented by The sequence of chord onset times U and note onset times U rhythm, lead, and bass guitar models, respectively. In each guitar are represented by metrical Markov models [20, 21] as follows: model, the corresponding musical score is represented as a latent p(u• u• , •)=• , (5) n n 1 un• 1,un• chain of variables, whose generative process is described by a semi- | R L Markov model. Given the three chains, the spectrogram X , X , where a,b• is the transition probability from tatum position a to and XB are generated by an NMF-based acoustic model conditioned tatum position b. Note that if a b, the chord or note continues on the musical score. Finally, the observed spectrogram X is gener- over a bar line. The maximum duration of a chord or note is re- ated according to Eq. (1). The generative process for each guitar part stricted to the measure length (16 tatums). When combined with the is described as a hidden semi-Markov model (HSMM) as explained models for pitches/chords, this is a semi-Markov model because the below. The total model is a factorial HSMM with three latent chains. time unit of state transition is different from the tatum unit and the metrical Markov model describes the distribution of sojourn times 2.1.1. Language Model of latent states. We formulate a rhythm guitar HSMM, representing the generative process of XR, and lead and bass guitar conditional HSMMs, rep- 2.1.2. Acoustic Model resenting the generative process of XL and XB given the chord se- The generative process of the spectrograms of three guitars are for- quence specified by rhythm guitar (Fig. 2). For convenience, “?” mulated in the same way. We use the probabilistic formulation of will be used to represent “L” (lead guitar) or “B” (bass guitar) and NMF based on the Kullback-Leibler divergence (KL-NMF) [13].

237 R R R The spectrogram of the rhythm, lead, or bass guitar part X• is gen- ↵ (zn ,un ) is calculated recursively as follows: W• erated by using a set of basis spectra and an activation vector R R R R R T ↵ z ,u = p z = ⇡ R , (11) h• R+. The generative process of the mixture spectrogram is 1 1 1 z 2 1 described as follows: R⇣ R R⌘ ⇣ ⌘ R R ↵ zn ,un = p x⌧ R(n 1):⌧ R(n)+1 zn ,un | • • p(xft Z, U, W,ht)=Poisson xft wf⌘•(t)ht , (6) ⇣ ⌘ R⇣ R R ⌘R | zR,zR uR,uR ↵ zn 1,un 1 , (12) =R,L,B ! n n 1 n n 1 · • X zR uR Xn 1 Xn 1 ⇣ ⌘ where ⌘•(t) indicates the musical note to which frame t belongs and R R is determined by Z• and U•. Basis spectra W• are given by W = where ⌧ (n) is the last frame of the n-th chord and xa:b is R R F 24 L L L F 33 wC,...,wBm R+⇥ ,W = wE2,. . ., wC6 R+⇥ or xa,...,xb . { B B }2 B F 28 { R }2 { } zR uR W = wE1,. . ., wG3 R+⇥ , where w indicates the basis In the backward sampling step, n and n are sampled recur- { }2? R R R spectrum of chord and w indicates the basis spectrum of note sively by calculating a backward message (zn ,un ) as follows: . A key feature of our model is that only a single basis spectrum R zR ,uR = p zR ,uR X ↵R zR ,uR , (13) specified by ⌘•(t) is activated in each frame t for generating the N N N N | / N N spectrogram X•. ⇣ ⌘ ⇣ ⌘ ⇣ ⌘ R zR,uR = p zR,uR zR ,uR , X n n n n | n+1:N n+1:N 2.2. Bayesian Formulation ⇣ ⌘ ⇣R R R R ⌘ zR,zR uR,uR ↵ zn ,un . (14) / n n+1 n n+1 To integrate the sub-models described in Sections 2.1.1, and 2.1.2, Latent variables Z? and U? are updated similarly⇣ except⌘ that the we formulate a semi-Bayesian model given by R ? transition probability a,b is replaced with c,a,b. p(X, Y; ⇥) = p(X Z, U, W, H)p(Z ⇡, )p(U )p(W)p(H), (7) 2.3.2. Updating Basis Spectra W and Activations H | | | R L B R L B R L B Similarly as the Bayesian inference of NMF [22], W and H where ⇥ = ⇡ ,⇡ ,⇡ , , , , , , is parameters are sampled directly from the conditional posterior distribution that are trained{ in advance and Y = Z, U, W,}H is random { } p(W, H Z, U, X). Let ft• be an auxiliary variable, calculated variables estimated for an observed spectrogram X during runtime. from the| latest samples of W and H as Here, Z = ZR, ZL, ZB and U, W, and H are defined similarly. { } wf•⌘(t)ht• We introduce prior distributions p(W) and p(H) to make W • ft = R R L L B B . (15) close to the template bases and make H sparse. We use gamma wf⌘R(t)ht + wf⌘L(t)ht + wf⌘B(t)ht priors for W as follows: Using , W and H are sampled as follows:

wfk• aw ,bw , (8) ⇠G fk• fk• wfk• aw• + xftft• ,bw• + ht• , (16) where k C,...,Bm , E2⇣,...,C6 , or⌘ E1,...,G3 denotes ⇠G 2{ } { } { } t t ! a chord or a pitch and aw and bw are the shape and rate hyperpa- X X R R R rameters. We use weak priors on W (aw and bw are smaller than ? ? ht• ah• + xftft• ,bh• + wfk• . (17) aw and bw) to handle the large variance of the rhythm guitar. The ⇠G0 t• t• 1 hyperparameters are set in a way that the prior expectation of W Xf Xf matches the template spectra of chords and musical notes prepared @ A in advance. Similarly, we use gamma priors for H as follows: 2.4. Musical Score Estimation We obtain the most likely Z• and U• by the Viterbi algorithm and ht• ah ,bh , (9) ⇠G t• t• the final estimate of H by taking the mean of the posterior distribu- • • where aht and bht are hyperparameters. tion in Eq. (17). Musical scores for the three guitar parts are esti- mated with these variables. The musical score for the lead or bass 2.3. Bayesian Inference guitar can be obtained from the pitches Z? and onset times U?. Note that a series of musical notes with the same pitch can be represented Given a music spectrogram X, we aim to calculate the posterior dis- L L tribution according to Bayes’ theorem: by the self-transitions (e.g., zn 1 = zn =C3). On the other hand, for the rhythm guitar part, we need to determine the actual onsets p(Y X, ⇥)=p(X, Y ⇥)/p(X ⇥). (10) (attack times) in each region of chord symbols ZR because the same | | | Since this distribution is analytically intractable, we use Gibbs sam- chords can be repeated multiple times. To obtain a natural rhyth- pling for alternately and iteratively sampling the latent variables Z mic pattern within each measure, we perform template matching by and U, the basis spectra W, and the activations H. That is, we using a dictionary of rhythmic patterns (16-dimensional binary vec- get samples of G Y from a conditional posterior distribution tors). The entries of the dictionary are obtained from a collection of ⇢ musical pieces. We first obtain a tentative rhythmic pattern by de- p(G Y G, X, ⇥), where Y G indicates the subset of Y obtained ¬ ¬ R by removing| G from Y. We henceforth do not write the dependency tecting peaks in the activation H and then use the closest pattern in on ⇥ for brevity. the dictionary based on the cosine distance for estimating the score. The variables Z, U, and H are estimated as follows. First, we generate a sufficient amount of samples of Z, U, W, and H by 2.3.1. Updating Latent Variables Z and U using Gibbs sampling. Second, we fix the parameters W and H with R R R R To sample Z and U from p(Z , U Y ZR,UR , X), an effi- the last sample and the maximum-a-posteriori (MAP) estimates of Z | ¬ cient forward filtering-backward sampling algorithm can be used. and U are obtained by using the Viterbi algorithm. Finally, the MAP In forward filtering, a forward message of the rhythm guitar part estimate of H is obtained from the posterior distribution in Eq. (17).

238 3. EVALUATION Table 2. Performance of transcription for three guitar parts. 3.1. Experimental Setup Method Part (%) (%) (%) Chord P R F Lead 32.4 19.0 22.8 – Ten songs by The Beatles in 4/4 time consisting of rhythm, lead, Deep saliency [15] and bass guitars, vocals, and drums were used for evaluation. To Bass 57.0 61.4 58.8 – extract accompaniment sounds by suppressing the drum and vocal Madmom [23] Rhythm –––74.8 sounds, we used a harmonic/percussive source separation method [2] Lead 20.8 12.8 15.0 – and a voice separation method [5]. The tatum-level chord FHSMM-RND Bass 34.3 36.6 35.3 – label accuracy was measured for the rhythm guitar and the tatum- Rhythm –––48.2 level recall and precision rates and F-measure were measured for Lead 34.7 19.8 24.1 – the lead and bass guitars. The chord vocabulary consisted of 24 la- FHSMM-DSM Bass 57.7 62.2 60.0 – R R bels (wC,...,wBm) and the no chord regions were not considered. Rhythm –––74.8 Recall, precision, and F-measure were given by =Ntp/Nsys, P =Ntp/Nref , and =2 /( + ), where Ntp is the number R F RP R P of correctly-estimated pitches, Nsys the number of detected pitches, Deep saliency (Lead) FHSMM-DSM (Lead) 70 70 and Nref the number of ground-truth pitches. For comparison, we tested a state-of-the-art method for melody 60 60 and bass line estimation based on convolutional neural networks 50 50 (CNNs) [15] (called a deep saliency method) for transcribing the

MIDI note number 40 40 lead and bass guitar parts. For transcribing the rhythm guitar part, 55 57 59 61 63 55 57 59 61 63 we tested a state-of-the-art chord estimation method based on CNNs, Deep saliency (Bass) FHSMM-DSM (Bass) (Madmom [23]). Since the three guitar parts are estimated at the 58 58 Green: Correct pitch Red: Insertion error frame level, the estimated results are quantized in the tatum level. Blue: Correct pitch class Gray: Deletion error To investigate the sensitivity to initialization of the proposed model, 48 48 we compared random initialization (FHSMM-RND) and initializa- 38 38 tion using the results of the above methods (FHSMM-DSM). MIDI note number 28 28 55 57 59 61 63 55 57 59 61 63 The log-frequency magnitude spectrogram of a music signal was Measure number Measure number obtained by the constant-Q transform [24] with 96 bins/octave, a shifting interval of 11 ms, and a frequency range from 32.7 Hz (C1) Fig. 3. Musical scores of the lead and bass guitars estimated by the to 8372.0 Hz (C9). The tatum times were estimated in advance by deep saliency method and the proposed method initialized by the R R using a beat tracking method [23]. The hyperparameters aw and bw deep saliency method for “Don’t Bother Me.” of the prior on WR were determined using the spectra of 24 low- position chords played by an . The hyperparameters ? ? ? Although we achieved an improvement on the transcription ac- aw and bw of the prior on W were determined in a similar way by using the spectra of 45 pitches from E2 to C6 or 28 pitches from curacy, there is still much room for improving the performance of the E1 to G3. These reference spectra were made by MIDI . lead guitar transcription. First, it is necessary to deal with rests be- cause the lead guitar often has rests during vocal parts. It is also nec- The hyperparameters of activations were set to ahR = ah? =1, t t essary to relax the strong constraints that the lead guitar only plays b R = bh? =1. The uniform initial probabilities were used: ht t L B R a monophonic melody and that the rhythm guitar is described with a ⇡ =1/45, ⇡ =1/28, and ⇡ =1/24. The language model pa- single set of bases. Another limitation is that the Markov model does rameters , , and a dictionary of rhythmic patterns for the rhythm not have sufficient capability of expressing the long-term dynamics guitar were learned in advance from 206 pieces of Japanese popular of musical notes. We plan to extend a deep generative model such musicby the maximum-likelihood method. as a variational autoencoder (VAE) [25] or a generative adversarial network (GAN) [26] for dealing with time-series data. 3.2. Experimental Results Table 2 shows the models’ performances on multi-guitar transcrip- 4. CONCLUSION tion. Note that , , and can never be 100% because the lead and bass guitars wereP assumedR F to be monophonic here, but this assump- We have described a statistical method to estimate the musical scores tion does not hold true in reality, e.g., the lead guitar sometimes plays of lead, bass, and rhythm guitars from a polyphonic audio signal chords. FHSMM-DSM consistently outperformed the deep saliency of typical band-style music. Here, we proposed a unified Bayesian method in the transcription of the lead and bass guitars. As shown framework for integrating language and acoustic models of multi- in Fig. 3, unnatural musical notes such as short insertion errors esti- instrument music based on a factorial hidden semi-Markov model. mated by the deep saliency method were significantly reduced by the We have shown that the unified model improves the accuracy and proposed method. This shows the effectiveness of the language mod- naturalness of the transcribed scores. We focused on band-style pop- els, which induce musical naturalness of estimated scores. FHSMM- ular music in this paper, but our framework can be applied to various RND, on the other hand, underperformed Madmom and the deep kinds of music transcription involving multi-pitch detection and part saliency method because the NMF-based acoustic model was sen- assignment. In piano transcription, for example, more reasonable sitive to initialization. We found that the proposed method of joint estimations could be obtained by executing the multi-pitch detec- guitar transcription can find more accurate scores than independent tion and separation of right-hand and left-hand streams jointly [27]. transcription methods if it is appropriately initialized. The perfor- In the future, we plan to develop a system that generates a com- mance of chord estimation obtained by FHSMM-DSM were almost plete score consisting of vocal, guitar, and drum parts by combining same as that obtained by Madmom. This is probably because good the work described in this paper with automatic vocal (melody) and local optima were already found by Madmom. drum transcription methods.

239 5. REFERENCES [14] Siddharth Sigtia, Emmanouil Benetos, and Simon Dixon, “An end-to-end neural network for polyphonic piano music tran- [1] Emmanouil Benetos, Simon Dixon, Dimitrios Giannoulis, scription,” IEEE/ACM Transactions on Audio, Speech and Holger Kirchhoff, and Anssi Klapuri, “Automatic music tran- Language Processing, vol. 24, no. 5, pp. 927–939, 2016. scription: Challenges and future directions,” Journal of Intel- [15] Rachel M. Bittner, Brian McFee, Justin Salamon, Peter Li, and ligent Information Systems, vol. 41, no. 3, pp. 407–434, 2013. Juan P. Bello, “Deep salience representations for F0 estima- [2] Derry Fitzgerald, “Harmonic/percussive separation using me- tion in polyphonic music,” in Proceedings of the International dian filtering,” in Proceedings of the International Conference Society for Music Information Retrieval Conference (ISMIR), on Digital Audio Effects (DAFX), 2010, pp. 1–4. 2017, pp. 63–70. [3] Carl Southall, Ryan Stables, and Jason Hockman, “Automatic [16] Rodrigo Schramm, Andrew McLeod, Mark Steedman, and drum transcription using bi-directional recurrent neural net- Emmanouil Benetos, “Multi-pitch detection and voice assign- works,” in Proceedings of the International Society for Music ment for a cappella recordings of multiple singers,” in Pro- Information Retrieval Conference (ISMIR), 2016, pp. 591–597. ceedings of the International Society for Music Information [4] Chih-Wei Wu and Alexander Lerch, “Drum transcription using Retrieval Conference (ISMIR), 2017, pp. 552–559. partially fixed non-negative matrix factorization with template [17] Zoubin Ghahramani and Michael I. Jordan, “Factorial hidden adaptation,” in Proceedings of the International Society for Markov models,” in Proceedings of the Advances in Neural Music Information Retrieval Conference (ISMIR), 2015, pp. Information Processing Systems (NIPS), 1996, pp. 472–478. 257–263. [18] Gautham J. Mysore and Maneesh Sahani, “Variational infer- [5] Andreas Jansson, Eric Humphrey, Nicola Montecchio, Rachel ence in non-negative factorial hidden Markov models for ef- Bittner, Aparna Kumar, and Tillman Weyde, “Singing voice ficient audio source separation,” in Proceedings of the Inter- separation with deep U-Net convolutional networks,” in Pro- national Conference on Machine Learning (ICML), 2012, pp. ceedings of the International Society for Music Information Re- 1887–1894. trieval Conference (ISMIR), 2017, pp. 745–751. [19] Alexey Ozerov, Cedric´ Fevotte,´ and Maurice Charbit, “Facto- [6] Emilio Molina, Lorenzo J. Tardon,´ Ana M. Barbancho, and Is- rial scaled hidden Markov model for polyphonic audio repre- abel Barbancho, “SiPTH: Singing transcription based on hys- sentation and source separation,” in Proceedings of the IEEE teresis defined on the pitch-time curve,” IEEE/ACM Transac- Workshop on Applications of Signal Processing to Audio and tions on Audio, Speech and Language Processing, vol. 23, no. Acoustics (WASPAA), 2009, pp. 121–124. 2, pp. 252–263, 2015. [20] Christopher Raphael, “A hybrid graphical model for rhythmic [7] Ryo Nishikimi, Eita Nakamura, Masataka Goto, Katsutoshi parsing,” Artificial Intelligence, vol. 137, pp. 217–238, 2002. Itoyama, and Kazuyoshi Yoshii, “Scale- and rhythm-aware [21] Masatoshi Hamanaka, Masataka Goto, Hideki Asoh, and musical note estimation for vocal F0 trajectories based on Nobuyuki Otsu, “A learning-based quantization: Unsupervised a semi-tatum-synchronous hierarchical hidden semi-Markov estimation of the model parameters,” in Proceedings of the model,” in Proceedings of the International Society for Music International Computer Music Conference (ICMC), 2003, pp. Information Retrieval Conference (ISMIR), 2017, pp. 376–382. 369–372. [8] Mert Bay, Andreas F. Ehmann, James W. Beauchamp, Paris [22] Ali Taylan Cemgil, “Bayesian inference for nonnegative ma- Smaragdis, and J. Stephen Downie, “Second fiddle is impor- trix factorisation models,” Journal of Computational Intelli- tant too: Pitch tracking individual voices in polyphonic mu- gence and Neuroscience, , no. 785152, pp. 1–17, 2009. sic.,” in Proceedings of the International Society for Music In- formation Retrieval Conference (ISMIR), 2012, pp. 319–324. [23] Sebastian Bock,¨ Filip Korzeniowski, Jan Schluter,¨ Florian Krebs, and Gerhard Widmer, “Madmom: A new python au- [9] Graham Grindlay and Daniel P.W. Ellis, “Transcribing multi- dio and music signal processing library,” in Proceedings of instrument polyphonic music with hierarchical eigeninstru- the ACM International Conference on Multimedia, 2016, pp. ments,” IEEE Journal of Selected Topics in Signal Processing, 1174–1178. vol. 5, no. 6, pp. 1159–1169, 2011. [24] Christian Schorkhuber¨ and Anssi Klapuri, “Constant-Q trans- [10] Emmanouil Benetos, Roland Badeau, Tillman Weyde, and form toolbox for music processing,” in Proceedings of the Gael¨ Richard, “Template adaptation for improving automatic Sound and Music Computing Conference (SMC), 2010, pp. 3– music transcription,” in Proceedings of the International Soci- 64. ety for Music Information Retrieval Conference (ISMIR), 2014, pp. 175–180. [25] Diederik P. Kingma and Max Welling, “Auto-encoding varia- tional Bayes,” in Proceedings of the International Conference [11] Emmanuel Vincent and Xavier Rodet, “Music transcription on Learning Representations (ICLR), 2014. with ISA and HMM,” in Proceedings of the International Con- ference on Independent Component Analysis and Signal Sepa- [26] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing ration (ICA), 2004, pp. 1197–1204. Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,” in Proceed- [12] Emmanuel Vincent, Nancy Bertin, and Roland Badeau, “Adap- ings of the Advances in Neural Information Processing Systems tive harmonic spectral decomposition for multiple pitch esti- (NIPS), 2014, pp. 2672–2680. mation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 3, pp. 528–537, 2010. [27] Eita Nakamura, Kazuyoshi Yoshii, and Shigeki Sagayama, “Rhythm transcription of polyphonic piano music based on [13] Paris Smaragdis and Judith C. Brown, “Non-negative matrix merged-output HMM for multiple voices,” IEEE/ACM Trans- factorization for polyphonic music transcription,” in Proceed- actions on Audio, Speech, and Language Processing, vol. 25, ings of the IEEE Workshop on Applications of Signal Process- no. 4, pp. 794–806, 2017. ing to Audio and Acoustics (WASPAA), 2003, pp. 177–180.

240