Joint Transcription of Lead, Bass, and Rhythm Guitars Based on a Factorial Hidden Semi-Markov Model
Total Page:16
File Type:pdf, Size:1020Kb
JOINT TRANSCRIPTION OF LEAD, BASS, AND RHYTHM GUITARS BASED ON A FACTORIAL HIDDEN SEMI-MARKOV MODEL Kentaro Shibata1 Ryo Nishikimi1 Satoru Fukayama2 Masataka Goto2 Eita Nakamura1 Katsutoshi Itoyama1 Kazuyoshi Yoshii1 1Graduate School of Informatics, Kyoto University, Japan 2National Institute of Advanced Industrial Science and Technology (AIST), Japan ABSTRACT Rhythm guitar ڮ C4 ࡾ ڮ This paper describes a statistical method for estimating musical E3 ࢄ ڮ ڮ Chord sequence ڮscores for lead, bass, and rhythm guitars from polyphonic audio C Bm C3 guitar ڮsignals of typical band-style music. To perform multi-instrument Lead transcription involving multi-pitch detection and part assignment, it C6 ࡸ 䠿 ࢄ ࢄ ڮ F2 ڮ is crucial to formulate a musical language model that represents the E2 Note sequence characteristics of each part in order to solve the ambiguity of part E2 C6 Music guitar ڮBass assignment and estimate a musically-natural score. We propose a spectrogram G3 䠿 ࢄ ڮ factorial hidden semi-Markov model that consists of three language F1 ڮ models corresponding to the three guitar parts (three latent chains) E1 G3 E1 Note sequence Low-rank ڮ and an acoustic model of a mixture spectrogram (emission model). Basis spectra Activations spectrograms The language model for rhythm guitar represents a homophonic sequence of musical notes (chord sequence) and those for lead and Fig. 1. A factorial hidden semi-Markov model (FHSMM) that rep- bass guitars represent a monophonic sequence of musical notes in a resents the generative process of audio signals of three guitar parts higher and lower frequency range respectively. The acoustic model of band music. represents a spectrogram as a sum of low-rank spectrograms of the three guitar parts approximated by NMF. Given a spectrogram, we Since most of the previous methods rely on the timbral charac- estimate the note sequences using Gibbs sampling. We show that our teristics of instruments, they struggle with band music with multiple model outperforms a state-of-the-art multi-pitch detection method instruments having similar timbre (e.g., two guitars). Recently, at- in the accuracy and naturalness of the transcribed scores. tempts have been made to incorporate a music language model rep- Index Terms— Automatic music transcription, multi-pitch esti- resenting a musical grammar (e.g., sequential dependency of chords mation, multi-instrument transcription, HSMM, and NMF. and musical notes) to complement the acoustic model in such dif- ficult situations. Although Sigtia et al. [14] proposes an end-to- 1. INTRODUCTION end polyphonic transcription method based on a recurrent neural Transcribing a musical score from an audio signal is a fundamen- network (RNN)-based language model, the model’s effect is essen- tal and challenging problem in music information processing [1]. tially smoothing since it is defined at the frame level. In order to From a practical viewpoint, it is important to develop a transcrip- get well-formed transcriptions, as they mentioned, a beat-level lan- tion method for band music to facilitate widely-enjoyed cover per- guage model is considered to be effective. Schramm et al. [16] com- formances. A typical band playing popular music (e.g., The Beatles) bines a PLCA-based acoustic model with a hidden Markov model consists of vocal, several guitars, and drum parts. Transcription of (HMM)-based musical language model (also defined at the frame vocal and drum parts have each been studied extensively, and high level), where multiple vocal parts were treated independently. accuracies have been reported [2–7]. We therefore focus on accom- In this paper, we propose a transcription method for the accom- paniment parts that are typically played by three guitars (lead, bass, paniment part of band music based on the beat-level sequential char- and rhythm guitars). acteristics of accompanying instruments. We assume that three kinds Accompaniment-part transcription for band music is a challeng- of guitars—lead, bass, and rhythm—are used in band music and ing task because multi-pitch detection and part assignment of each have different sequential characteristics, as listed in Table 1. Our note are both required. A major approach to multi-pitch detection method uses a factorial hidden semi-Markov model (FHSMM) [17– is to use probabilistic latent component analysis (PLCA) or non- 19] that consists of three language models (semi-Markov models), negative matrix factorization (NMF) based on the sparseness and corresponding to the three guitar parts, and an NMF-based acous- low-rankness of source spectrograms. For part assignment, these tic model of a mixture spectrogram (Fig. 1). The language model methods have been extended to use pre-learned spectral templates of for the rhythm guitar represents a homophonic sequence of musical multiple instruments [8–13]. More recently, end-to-end neural net- notes (chord sequence), while those for the lead and bass guitars rep- works have been applied successfully to executing multi-pitch de- resent a monophonic sequence of musical notes in a higher or lower tection and part assignment simultaneously [14, 15]. frequency range respectively. The acoustic model represents a spec- This work was supported in part by JST ACCEL No. JPMJAC1602, trogram as a sum of the spectrograms of the three guitar parts, each JSPS KAKENHI No. 16H01744 and No. 16J05486, and the Kyoto Univer- of which is represented as a low-rank matrix obtained by the prod- sity Foundation. uct of an activation vector and basis spectra. A key feature of the Transition between tatums : = , څ څ څ .Table 1. The characteristics of thee guitar parts Tatum U 0 4 8 0 ܲ ݑ ݑିଵ ߶଼ څ Part Sequence Frequency range Rhythm Tatum Index Lead Monophonic Middle - High Complex 024 6 8 10 12 14 0 Bass Monophonic Low Complex Transition between pitches: Chord Label Z C Am Rhythm Homophonic Wide Simple ୖ | , ( ) = , , څ ୖ څ څ ୫ ଷ େସ څ ିଵ Pitch Z C4 E4 A3 ܲ ݖ C4ݖ ݖఘ ߰ ( )Poisson~ څ Bases څ acoustic model is that only one of the basis spectra is allowed to be څ څ څ څ Select a ௧ څ څ څ څ ܹ ݄ basis ିଷ ିଶ ିଵ ݔ௧ ෝݓఎ ௧ څ څ activated in each tatum in each guitar part. Given a mixture spec- ଵ େ ݖ ݖ ݖ ݖ څ څ څ څ ࢝ ڮ ࢝ ܆ څ څ څat each tatum trogram as observed data, the basis spectra, activations, and latent ࢝ෝ ࢝େସ ࢝ସ ࢝ଷ ࢝େସ chains are statistically estimated using Gibbs sampling. The musical × × × × Low-rank score for each guitar is then obtained by using the Viterbi algorithm. Activation spectrogram כ A major contribution of this study is to achieve joint transcrip- ࡴ tion of multiple musical instruments that have similar timbral char- Fig. 2. The generative process of spectrograms for lead/bass guitar acteristics. More specifically, we propose an integrated language and based on a conditional HSMM. ? indicates lead (L) or bass (B). acoustic model that can describe beat-level symbolic musical gram- mar and dependency between multiple instrument parts. We also “•” will be used to represent “R” (rhythm guitar), “L”, or “B”. show that the initialization problem of the NMF-based model can be The latent chain of the rhythm guitar is specified by a sequence of improved by a support from a DNN-based model, which has strong R R R chord symbols Z = fz1 ; : : : ; zNR g with relative onset positions capability of expressing acoustic signals. We achieve an improve- R R R R R U = fu1 ; : : : ; uNR g, where N is the number of chords, zn ment on the accuracy, which is an important step towards developing R a practical transcription method that can be used for popular music. takes one of 24 values of fC;:::; Bg × fmajor; minorg, and un takes an integer from 0 to 15 indicating a position on the 16th-note- level tatum grid in a measure. Likewise, the latent chain of the lead 2. PROPOSED METHOD or bass guitar’s HSMM is specified by a sequence of pitches Z? = ? ? ? ? ? Our method jointly performs multi-pitch estimation and part assign- fz1 ; : : : ; zN? g with relative onset positions U = fu1; : : : ; uN? g, ? L ment in a unified framework. We formulate a generative model of where N is the number of musical notes zn takes one of 45 pitches B a music spectrogram and the corresponding musical score and then of fE2;:::; C6g, zn takes one of 28 pitches of fE1;:::; G3g, and ? solve the inverse problem. That is, given a music spectrogram, we un takes an integer from 0 to 15. R estimate the score described as latent variables in the model. The The sequence of chord symbols Z and the sequence of pitches ? problem is defined as follows: Z are represented by a Markov model as follows: Input: The magnitude spectrogram of a target signal X 2 F ×T R R R R R R R R+ p z1 jπ = πzR ; p zn jzn−1; = zR ;zR ; (2) and 16th-note-level tatum times 1 n−1 n ? R R ? ? Output: Musical scores of the three guitar parts p z1 jZ ; U ; π = π R ? ; (3) z1 ;z1 Here, F is the number of frequency bins, T is the number of time ? ? R R ? ? p znjzn−1; Z ; U ; = zR ;z? ;z? ; (4) frames. In this paper we assume that the time signature of the target ρ?(n) n−1 n signal is 4/4. R R where πa is the initial probability of chord a, a;b is the transition ? 2.1. Probabilistic Factorial Modeling probability from chord a to chord b, ρ (n) indicates a chord to which musical note z? belongs, π? is the initial probability of pitch a X 2 F ×T n c;a We represent a music spectrogram R+ as a sum of the spec- c ? a R L B F ×T given chord , and c;a;b is the transition probability from pitch to trograms X ; X ; and X 2 R+ of the rhythm, lead, and bass pitch b given chord c.