2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 18-21, 2015, New Paltz, NY

A SEPARATE AND RESTORE APPROACH TO SCORE-INFORMED MUSIC DECOMPOSITION

Christian Dittmar, Jonathan Driedger, Meinard Muller¨

International Audio Laboratories Erlangen∗, Am Wolfsmantel 33, 91058 Erlangen, Germany, {christian.dittmar;jonathan.driedger;meinard.mueller}@audiolabs-erlangen.de

ABSTRACT the decomposition technique. In this paper, we follow the paradigm of score-informed Non-Negative Matrix Factorization (NMF) to Our goal is to improve the perceptual quality of signal components separate the mixture magnitude spectrogram as described in [3, 4]. extracted in the context of music source separation. Specifically, This procedure involves soft masks used to derive the targeted com- we focus on decomposing polyphonic, mono-timbral piano record- ponent magnitude spectrograms by Wiener filtering. The soft masks ings into the sound events that correspond to the individual notes of are usually obtained from multiplying suitable NMF templates and the underlying composition. Our separation technique is based on activations. In contrast to that, we propose to refine the soft masks score-informed Non-Negative Matrix Factorization (NMF) that has on the basis of a -adapted dictionary of isolated tones. We re- been proposed in earlier works as an effective means to enforce a fer to this extension of the conventional method as restore approach musically meaningful decomposition of piano music. However, the and show that it can be beneficial for certain types of mixtures. method still has certain shortcomings for complex mixtures where The remainder of this paper is organized as follows: Section 2 pro- the tones strongly overlap in frequency and time. As the main con- vides a brief overview of related work, Section 3 reviews score- tribution of this paper, we propose a restoration stage based on re- informed NMF, Section 4 describes our proposed restore approach, fined Wiener filter masks to score-informed NMF. Our idea is to in- Section 5 discusses the experimental results and indicates directions troduce notewise soft masks created from a dictionary of perfectly for future work. isolated piano tones, which are then adapted to match the timbre of the target components. A basic experiment with mixtures of pi- ano tones shows improvements of our novel reconstruction method with regard to perceptually motivated separation quality metrics. A second experiment with more complex piano recordings shows that 2. RELATED WORK further investigations into the concept are necessary for real-world applicability. Music source separation using score information has first been in- Index Terms— Score-informed music processing, source sep- troduced in [5] and [6]. Related approaches used tensor factoriza- aration, music decomposition, signal reconstruction. tion [7] or synthesized music for component initialization [8]. An important starting point for our work is the procedure for score- informed music decomposition described in [3], where the authors 1. INTRODUCTION describe how to impose musically meaningful constraints on the components of a Non-Negative Matrix Factorization (NMF) with- The goal of music source separation is to decompose a music out the need for a dedicated component training. The same principle recording into its constituent signal components [1, 2]. We focus is extended to note-wise decomposition in [4]. Other authors de- on the special case of polyphonic, mono-timbral piano recordings, vised elaborate source-filter models to account for the time-varying where we aim to extract sound events that correspond to the com- spectral envelope of components with fixed pitch [9, 10]. In [11], a position’s individual notes. We assume that each sound event corre- semi-adaptive NMF variant, which allows to efficiently capture the sponding to a musical note can be characterized as a harmonic tone temporal evolution of component spectrograms, was proposed. with constant pitch as well as a sharp attack, a stable sustain, and a In [12], Ewert discussed the problems inherent to NMF decompo- release phase. Moreover, the mixture signal is assumed to be a lin- sition in case of overlapping partials of the targeted components. ear superposition of the isolated tones. This is, of course, a simpli- He proposed to use the activations and gains computed by a first fication since we neglect acoustic effects such as room impulse re- NMF stage to infer a time-frequency (TF) dependent weighting of sponses and resonances in the piano. With these the mixture magnitude spectrogram accounting for possible phase assumptions, the decomposition of piano music into isolated tones cancellations. He could show that this leads to more meaningful de- constitutes a limited, yet challenging source separation task, which compositions in a second NMF stage. The use of a weighted NMF may pave the way to more complex scenarios. had already been proposed by Virtanen in [13], where it was used as One of the challenges is to improve the perceptual quality of the sep- a means to fill gaps in the magnitude spectrogram that occurred due arated signals which may suffer from audible artifacts, depending to binary masking of predominant pitched signal components. In on the complexity of the music, the recording conditions, as well as [14], Cano et al. investigated the complex mutual influence of mag- ∗ nitude and phase on the quality of separated signals in source sep- This work has been supported by the German Research Foundation (DFG MU 2686/6-1). The International Audio Laboratories Erlangen aration. Cano proposed to soften the additivity constraint in source (AudioLabs) is a joint institution of the Friedrich-Alexander-Universitat¨ separation and suggested to use instrument specific resynthesis ap- Erlangen-Nurnberg¨ (FAU) and Fraunhofer IIS. proaches in [15]. 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 18-21, 2015, New Paltz, NY

stood element-wise. Furthermore, J ∈ RK×M denotes an all-one matrix.

11840 3.2. Constraint Components via Score-Informed NMF 5920  2960 W(0) H(0) 1480 = Proper initialization of and is an effective means to con- 740 strain the degrees of freedom in the NMF iterations and enforces 392 Frequency (Hz) 0 1 2 3 4 convergence to a desired, musically meaningful solution. One pos- Time (Seconds) sibility is to impose constraints derived from a time-aligned, sym- 11840 G4 (392 Hz) A4 (440 Hz) bolic representation (i.e., machine readable score) of the recording 5920 ଵ ଶ 2960 [7]. Three constraints can be obtained from the musical score. First, 1480 + + the rank R of the decomposition is chosen according to the num- (0) 740 ber of unique musical pitches. Second, each column of W is 392 initialized with a prototype harmonic series reflecting the 11840 5920 B4 (494 Hz) D5 (587 Hz) expected nature of a musical tone corresponding to the assigned ଷ ସ (0) 2960 musical pitch. Third, the rows of H are initialized as follows: 1480 + C ∈ RR×M 740 A binary constraint matrix s is constructed for each 392 s ∈ [1 : S], where Cs is 1 at entries that correspond to the pitch and 0 1 2 3 4 0 1 2 3 4 temporal position of the sth aligned note event and 0 otherwise. The (0) union (OR-sum) of all Cs is then used as initialization of H . With Figure 1: Artificial piano score used as illustrative example through- this initialization, each template obtained from iteration (1) typi- out the paper. Our target is the extraction of the isolated magnitude cally corresponds to an average spectrum (usually 1-normalized spectrograms corresponding to the individual notes G4, A4, B4, and [2]) of the corresponding musical pitch and each activation function D5 (ordered by pitch instead of onset time). obtained from (2) corresponds to the temporal amplitude envelope of all occurrences of that particular pitch throughout the recording. 3. SEPARATE 4. RESTORE In this section, we summarize the score-informed NMF approach as described in [3, 4]. In our signal model, we assume that the given Score-informed NMF as described in Section 3.2 yields a decom- V W(L) piano recording x is a linear mixture of notewise audio events xs, position of into musically meaningful templates and acti- (L) s ∈ [1 : S], where S ∈ N is the number of musical notes speci- vations H . In the following, we discuss the issues inherent to the V fied in the musical score (see our example score in Figure 1) such restoration of our targeted s from the components and introduce that x := s xs. Let X (m, k), m, k ∈ Z, be a complex-valued our extension to the conventional procedure. TF coefficient at the mth time frame and kth frequency bin of the Short-Time Fourier Transform (STFT) of our mixture signal x. Let 4.1. Component Magnitude Spectrogram Reconstruction T K×M V:=|X | ∈ R≥0 be a transposed version of the mixture sig- nal’s magnitude spectrogram. Our objective is to decompose V into As shown on the left hand side of Figure 2(a), we can use the component magnitude spectrograms Vs that correspond to the indi- results of score-informed NMF to reconstruct magnitude spec- VNMF ≈ vidual note events xs. Ignoring possible phase issues, we assume trograms corresponding to individual note objects as s  (L) (L) that the additive relationship V:= s Vs is fulfilled (see Figure 1). W H  Cs . In the conceptual illustration, the template and activation corresponding to the sth tone are indicated by a 3.1. Music Decomposition via NMF hatched column and row, respectively. The binary activation in the V corresponding Cs is visualized by a black box inside the hatched NMF can be used to decompose the magnitude spectrogram VNMF into spectral basis functions (also called templates) encoded by the row. In order to obtain a time-domain signals from s ,itis K×R columns of W ∈ R and time-varying gains (also called acti- common practice to use the mixture phase information of the origi- ≥0 nal STFT X and to invert the resulting modified STFTs via the sig- H ∈ RR×M V ≈ WH vations) encoded by the rows of ≥0 such that . nal reconstruction method from [17]. However, NMF-based mod- (0) NMF typically starts with a suitable initialization of matrices W els typically yield only a rough approximation of the original mag- (0) and H . Subsequently, these matrices are iteratively updated to nitude spectrogram, where spectral nuances may not be captured adapt to V with regard to a suitable distance measure. In this work, well. Therefore, the audio components reconstructed in this way we use the well-known update rules for minimizing the Kullback- may contain a number of audible artifacts. In order to better cap- Leibler Divergence [16] given by ture the temporal evolution of the spectral nuances, it is common T practice to calculate soft masks that can be interpreted as a weight- V H() th (+1) () () () ing matrix reflecting the contribution of the s tone to the original W =W  W H (1) JH()T mixture V. The mask corresponding to the desired note event can T NMF NMF (L) (L) W(+1) V be computed as Ms := Vs  W H + , where  (+1) () (+1) () H =H  W H (2) W()TJ denotes element-wise division and is a small positive constant to avoid division by zero. We obtain the masking-based estimate of Mask NMF for =0, 1, 2,...,L for some L ∈ N. The symbol  de- the component magnitude spectrogram as Vs := V  Ms . notes element-wise multiplication and the division is also under- This procedure is also often referred to as Wiener filtering. 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 18-21, 2015, New Paltz, NY

୒୑୊ ௅ ௅ ୒୑୊ ୒୑୊ (௦ (a ௄ǡଵ   ٖ  ௦ ௦ ௦ 1 ڮ

Dictionary Spectrum ୈ୧ୡ୲ 0.8 ٖ ڮ a) ൎ ή ൌ) Dictionary Spectral Envelope______௦ 0.6 ܭൈܯ ܭൈܴ ܴൈܯ ܭൈܯ ܭൈܯ 0.4 େ୭୫ୠ ෩ୈ୧ୡ୲ ୒୑୊ ௦ ௦ ௄ǡଵ ௦ 0.2 ڮ

b) ൎ ٖ 0) 0 500 1000 1500 2000 2500 3000 3500 4000

ܭൈܯ ܭൈܯ ܭൈܯ (b) 1 Adapted Dictionary Spectrum 0.8 Figure 2: Conceptual illustration of (a) the procedures described in Target Spectral Envelope______୒୑୊

Relative Magnitude ௦ Section 4.1 and Section 4.2, and (b), the restore approach described 0.6 Target Spectrum

in Section 4.3. 0.4

0.2

0 4.2. Difficult Mixtures 0 500 1000 1500 2000 2500 3000 3500 4000 As discussed in [12], even the integration of score information Frequency (Hz) might not suffice to separate certain mixtures. This is especially true in the case of mutually overlapping harmonics and transients. Figure 3: (a): Spectrogram frame (bold black curve) located in the attack phase of the dictionary tone representing the note A4. Its Our artificial example exhibits some of these problems. The decay Dict of the two quarter notes (G4 and B4) is interfered by the attack tran- spectral envelope Es (thin black curve) is extracted via the true sient of the subsequent notes. The last two notes (A4 and D5) are envelope method [18]. (b): Due to an imperfect NMF decomposi- played simultaneously, so their attack transients overlap and some tion, a spurious peak is present around 600 Hz (marked by the red of their harmonics collide since they have very close center frequen- oval) in the target spectrum (blue curve). After an envelope transfer, the adapted dictionary spectrum follows nicely the target envelope cies due to the harmonic relationship between the two notes (fourth NMF interval). As can be seen in Figure 3(b) this leads to corrupted NMF Es , but does not contain the artifact. templates. The note s =2(A4) exhibits a spurious peak around 600 Hz in a spectrogram frame that lies within the attack phase of NMF Vs . The artifact is caused by “crosstalk” of the first partial of envelope can be applied to the complete tone spectrogram. The es- D5 whose center frequency is 587 Hz and leads to deteriorated sep- timate could e.g., be derived from an average spectrum. In our case, aration quality. For the following considerations, we introduce an we expect the target component to dominate over potential cross- K×M alternative representation Bs ∈ R≥0 of the targeted Vs defined talk components in a spectral frame located in the attack phase of by the tone. From that particular frame, we extract the spectral enve- NMF K×1 lope Es ∈ R using the so-called true envelope method as Vs(k, m) Bs(k, m):= (3) described by Robel¨ et al. in [18]. This procedure iteratively refines Gs(m) an estimate for the spectral envelope obtained by conventional cep- 1×M 1 stral liftering. Using the same method, we extract an estimate for where Gs ∈ R contains element-wise the -norm of each Dict the spectral envelope Es of the dictionary spectrogram. Then, spectrogram frame of the original Vs. On the right hand side of NMF NMF we introduce the timbre-adapted dictionary note spectrogram as Figure 2(a), we show Bs derived by application of (3) to Vs . NMF   Each spectrogram frame in Bs is depicted as a replicate of the ENMFJ 1 s V˜ Dict := VDict  s 1,M -normalized template corresponding to . This seemingly redun- s s ( +EDictJ ) (4) dant representation will become useful in Section 4.3, where we s 1,M replace the potentially imperfect NMF templates by timbre-adapted 1×M 1 where the division is understood element-wise and J1,M ∈ R and -normalized spectral templates taken from a dictionary. denotes an all-one matrix needed to replicate both spectral en- ˜ Dict velopes across all frames. Subsequently, we apply (3) to Vs in ˜ Dict 4.3. Component Restoration using a Note Dictionary order to obtain Bs that we now use to replace the potentially cor- BNMF Inspired by the work of [7] and [8], we construct a dictionary of rupted s as shown in Figure 2(b). This way, we obtain novel Dict crosstalk-free magnitude spectrograms Vs obtained from iso- estimates for the component spectrogram and corresponding soft lated piano tones. Via the score information, we select the appro- mask as   priate tone and place it in the TF domain according to the onset po- VComb := B˜ Dict  J · GNMF sition of the target note. However, the benefit of having an artifact- s s K,1 s (5) Dict  free Vs comes at the expense that the dictionary tone is likely to  Comb Comb Comb differ in timbre from the target tone in the mixture. In order to adapt Ms := Vs  + Vs (6) the timbral qualities of the dictionary without propagating potential s errors in the NMF decomposition, we propose to transfer the spec- K×1 tral envelope of the target tone to the corresponding dictionary tone. where JK,1 ∈ R denotes an all-one matrix needed to replicate Figure 3 illustrates this procedure for the note A4 (s =2) in our the corresponding note activation across all frequency bins. In short, NMF artificial example. Similar to [12], we take Vs as best estimate the sequence of (4),(5), and (6) allows us to combine the NMF- of the target component regardless of potential decomposition arti- based activations with 1-normalized dictionary spectra that have facts. Furthermore, we assume that a single estimate for the spectral been adapted to match the timbre captured in the NMF-based tem- 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 18-21, 2015, New Paltz, NY

100 plates. The resulting note component spectrogram is again obtained Proposed Method 75 Prop Comb Conventional Method by Wiener filtering as Vs := V  Ms . (a) 50 25 0 5. EXPERIMENTS -60 -40 -20 0 20 40 60 80 100 Proposed Method 75 We conducted two source separation experiments using wellknown Conventional Method (b) 50 quality metrics to assess the possible improvements achievable with 25 our proposed approach. First, we evaluated the separation of sim- 0 Overall Perceptual Score 0 1 2 3 4 5 6 7 8 9 10 11 plistic piano tone mixtures. Second, we tried to decompose more Piano Tone Intervals complex piano recordings into bass and treble component signals. Figure 4: The Overall Perceptual Score (OPS) [19, 20] computed 5.1. Dataset for separated piano tone mixtures with the conventional method (black bars) vs. the proposed method (gray bars). (a): OPS aver- Our first test set consisted of pair-wise piano tone combinations. We aged over interval classes. (b): OPS averaged over absolute, octave- assigned MIDI pitch P1 to the first tone in the mixture and defined wrapped interval classes. it to be the interfering signal. Consequently, we assigned MIDI pitch P2 to the second tone and defined it to be the target signal. Both P1,P2 were varied from MIDI pitch P =21(A0, 27.5 Hz) to P = 108 (C8, 4186 Hz) resulting in 7569 tone pairs (including shows that the restore approach surpasses the conventional Wiener unison intervals). The underlying single tone signals were recorded filtering approach in terms of OPS mostly for positive intervals, from a real piano, while the tone dictionary used for separation was i.e., where the MIDI pitch of the target is above the MIDI pitch synthesized using the Pianoteq1 physical modeling plugin. Each of the interferer. Interestingly, this trend can not be observed for the tone pair was treated as individual test item, i.e. only R =2com- SDR. Instead, the improvements are more evenly distributed and ponents were used. only decrease for very wide intervals regardless if they are positive The second test uses a subset of 11 MIDI files from the Saarland or negative. Figure 4(b) shows that the proposed restore approach Music Data (SMD2) collection. SMD contains MIDI files for vari- is approx. 9.5 OPS points ahead of the conventional approach if we ous classical piano pieces which were performed by students of the ignore octave information. The average SDR improvement in that Hochschule fur¨ Musik Saar on a Yamaha Disklavier. The Disklavier case amounts to approx. 0.7 dB. stores all key and pedal movements performed by the pianist in an We obtained very mixed results in our second experiment with re- interpreted MIDI file that is suitable for synthesizing piano perfor- alistic piano performances. On average, we achieved an OPS of mances with expressive dynamics and timing. Following the ex- 38.06 (SDR 10.16 dB) using conventional Wiener filtering, while perimental design in [12], we split the note events in each of the our proposed restore approach yielded a tiny OPS increase to 38.32 interpreted MIDI files into a bass (MIDI pitch P<60) and a treble (SDR 10.25 dB). Unfortunately, there is no consistent improvement set (MIDI pitch P ≥ 60). We again used Pianoteq to synthesize across the test items, roughly half of them exhibit lower quality met- the note sequences for the bass and treble set individually. The bass rics compared to the conventional approach. From inspection of tones were defined to be the interferer and the treble tones the target, selected examples, we believe that this might be related to inferior respectively. The superposition of both yielded our mixture signal. separation of tones with a long sustain phase. Since we transfer All test files had 44.1 kHz sampling rate, the STFT was com- the spectral envelope taken from the tone’s attack, the sustain phase puted with a blocksize of approx. 46.4 ms and a hopsize of ap- might diverge from the desired target over time. prox. 5.8 ms. The number of NMF iterations was set to L =30. Still, we see potential in further developing the principal concept of Mask For each test item, we used Vs as magnitude spectrogram rep- our separate and restore approach. One obvious possibility would Prop resenting the conventional approach and Vs as magnitude spec- be an interval-selective application of the proposed method. An- trogram representing proposed reconstruction. We employed the other possible direction is to investigate dedicated processing of PEASS Toolkit [19, 20] in order to evaluate the quality of the sep- percussive and harmonic NMF components to remedy some of the arated audio signals obtained from application of the conventional remaining problems related to unsatisfactory separation of complex and the proposed method. From the available metrics, we focused mixtures. Audio examples covering positive and negative separa- on the perceptually-motivated Overall Perceptual Score (OPS) and tion results are available online3. used the objective Source-Distortion Ratio (SDR) to complement the evaluation. 5.3. Conclusions and Future Work 5.2. Results and Discussion We presented a method for post-processing score-informed music From the first experiment, we derived two interval related evalu- decomposition by means of refined soft masks based on a dictio- ations. First, we aggregated the quality measures to the interval nary of timbre-adapted piano tone spectrograms. In simple tone ΔP1,2 := P2 − P1 between the two piano tones in each mixture mixtures, this step attenuates the mutual interference between com- and averaged the respective measurements. Second, we applied the ponents. In the future, we want to further enhance this concept and same result aggregation, this time mapping to 12 absolute, octave- ˜ investigate its applicability to other source separation tasks, such as agnostic interval classes ΔP1,2 := (ΔP1,2 mod 12). Figure 4(a) drum sound separation from drum loops. 1https://www.pianoteq.com/ 2http://resources.mpi-inf.mpg.de/SMD/SMD_ 3http://www.audiolabs-erlangen.de/resources/MIR/ MIDI-Audio-Piano-Music.html 2015-WASPAA-SeparateAndRestore/ 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 18-21, 2015, New Paltz, NY

6. REFERENCES [13] T. Virtanen, A. Mesaros, and M. Ryynanen,¨ “Combining pitch-based inference and non-negative spectrogram factor- [1] S. Ewert, B. Pardo, M. Muller,¨ and M. Plumbley, “Score- ization in separating vocals from polyphonic music.” in Pro- informed source separation for musical audio recordings,” ceedings of the ISCA Tutorial and Research Workshop on Sta- IEEE Signal Processing Magazine, vol. 31, no. 3, pp. 116– tistical And Perceptual Audition (SAPA), Brisbane, Australia, 124, April 2014. September 2008, pp. 17–22. [2] P. Smaragdis, B. Raj, and M. Shashanka, “Supervised and [14] E. Cano, J. Abeßer, C. Dittmar, and G. Schuller, “Influence of semi-supervised separation of sounds from single-channel phase, magnitude and location of harmonic components in the mixtures,” in Proceedings of the International Conference perceived quality of extracted solo signals,” in Proceedings of on Independent Component Analysis and Signal Separation the Audio Engineering Society (AES) Conference on Semantic (ICA), London, UK, September 2007, pp. 414–421. Audio, Ilmenau, Germany, July 2011, pp. 247–252. [3] S. Ewert and M. Muller,¨ “Using score-informed constraints [15] E. Cano, C. Dittmar, and G. Schuller, “Re-thinking sound sep- for NMF-based source separation,” in Proceedings of the aration: Prior information and additivity constraint in sepa- IEEE International Conference on Acoustics, Speech, and Sig- raton algorithms,” in Proceedings of the International Con- nal Processing (ICASSP), Kobe, Japan, 2012, pp. 129–132. ference on Digital Audio Effects (DAFx), Maynooth, Ireland, [4] J. Driedger, H. Grohganz, T. Pratzlich,¨ S. Ewert, and 2013. M. Muller,¨ “Score-informed audio decomposition and appli- [16] D. D. Lee and H. S. Seung, “Algorithms for non-negative ma- cations,” in Proceedings of the ACM International Conference trix factorization,” in Proceedings of the Neural Information on Multimedia (ACM-MM), Barcelona, Spain, 2013, pp. 541– Processing Systems (NIPS), Denver, CO, USA, 2000, pp. 556– 544. 562. [5] B. P. J. Woodruff and R. B. Dannenberg, “Remixing stereo [17] D. W. Griffin and J. S. Lim, “Signal estimation from modified music with score-informed source separation,” in Proceed- short-time fourier transform,” IEEE Transactions on Acous- ings of the International Conference on Music Information tics, Speech and Signal Processing, vol. 32, no. 2, pp. 236– Retrieval (ISMIR), Victoria, Canada, October 2006, pp. 314– 243, April 1984. 319. [18] A. Robel¨ and X. Rodet, “Efficient spectral envelope estima- [6] J. W. Y. Li and D. L. Wang, “Monaural musical sound separa- tion and its application to pitch shifting and envelope preserva- tion using pitch and common amplitude modulation,” IEEE tion,” in Proceedings of the International Conference on Digi- Transactions on Audio, Speech, and Language Processing, tal Audio Effects (DAFx), Madrid, Spain, September 2005, pp. vol. 17, no. 7, pp. 1361–1371, 2009. 344–349. [7] U. S¸ims¸ekli and A. T. Cemgil, “Score guided musical source [19] V. Emiya, E. Vincent, N. Harlander, and V. Hohmann, “Sub- separation using generalized coupled tensor factorization,” in jective and Objective Quality Assessment of Audio Source Proceedings of the European Signal Processing Conference Separation,” IEEE Transactions on Audio, Speech, and Lan- (EUSIPCO). IEEE, 2012, pp. 2639–2643. guage Processing, vol. 19, no. 7, pp. 2046–2057, 2011. [8] J. Fritsch and M. D. Plumbley, “Score informed audio source [20] E. Vincent, “Improved perceptual metrics for the evaluation of separation using constrained nonnegative matrix factoriza- audio source separation,” in Proceedings of the International tion and score synthesis,” in Proceedings of IEEE Interna- Conference on Latent Variable Analysis and Signal Separa- tional Conference on Acoustics, Speech, and Signal Process- tion (LVA/ICA), Tel Aviv, Israel, March 2012, pp. 430–437. ing (ICASSP). IEEE, 2013, pp. 888–891. [9] R. Hennequin, R. Badeau, and B. David, “NMF with time– frequency activations to model nonstationary audio events,” IEEE Transactions on Audio, Speech, and Language Process- ing, vol. 19, no. 4, pp. 744–753, 2011. [10] J.-L. Durrieu, B. David, and G. Richard, “A Musically Mo- tivated Mid-Level Representation for Pitch Estimation and Musical Audio Source Separation,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 6, pp. 1180–1191, 2011. [11] C. Dittmar and D. Gartner,¨ “Real-time transcription and sep- aration of drum recordings based on NMF decomposition,” in Proceedings of the International Conference on Digital Au- dio Effects (DAFx), Erlangen, Germany, September 2014, pp. 187–194. [12] S. Ewert, M. D. Plumbley, and M. Sandler, “Accounting for phase cancellations in non-negative matrix factorization us- ing weighted distances,” in Proceedings of IEEE Interna- tional Conference on Acoustics, Speech, and Signal Process- ing (ICASSP), Florence, Italy, 2014, pp. 7480–7484.