A Separate and Restore Approach to Score-Informed Music Decomposition

2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 18-21, 2015, New Paltz, NY A SEPARATE AND RESTORE APPROACH TO SCORE-INFORMED MUSIC DECOMPOSITION Christian Dittmar, Jonathan Driedger, Meinard Muller¨ International Audio Laboratories Erlangen∗, Am Wolfsmantel 33, 91058 Erlangen, Germany, {christian.dittmar;jonathan.driedger;meinard.mueller}@audiolabs-erlangen.de ABSTRACT the decomposition technique. In this paper, we follow the paradigm of score-informed Non-Negative Matrix Factorization (NMF) to Our goal is to improve the perceptual quality of signal components separate the mixture magnitude spectrogram as described in [3, 4]. extracted in the context of music source separation. Specifically, This procedure involves soft masks used to derive the targeted com- we focus on decomposing polyphonic, mono-timbral piano record- ponent magnitude spectrograms by Wiener filtering. The soft masks ings into the sound events that correspond to the individual notes of are usually obtained from multiplying suitable NMF templates and the underlying composition. Our separation technique is based on activations. In contrast to that, we propose to refine the soft masks score-informed Non-Negative Matrix Factorization (NMF) that has on the basis of a timbre-adapted dictionary of isolated tones. We re- been proposed in earlier works as an effective means to enforce a fer to this extension of the conventional method as restore approach musically meaningful decomposition of piano music. However, the and show that it can be beneficial for certain types of mixtures. method still has certain shortcomings for complex mixtures where The remainder of this paper is organized as follows: Section 2 pro- the tones strongly overlap in frequency and time. As the main con- vides a brief overview of related work, Section 3 reviews score- tribution of this paper, we propose a restoration stage based on re- informed NMF, Section 4 describes our proposed restore approach, fined Wiener filter masks to score-informed NMF. Our idea is to in- Section 5 discusses the experimental results and indicates directions troduce notewise soft masks created from a dictionary of perfectly for future work. isolated piano tones, which are then adapted to match the timbre of the target components. A basic experiment with mixtures of piano tones shows improvements of our novel reconstruction method with regard to perceptually motivated separation quality metrics. A second experiment with more complex piano recordings shows that 2. RELATED WORK further investigations into the concept are necessary for real-world applicability. Music source separation using score information has first been in- Index Terms— Score-informed music processing, source sep- troduced in [5] and [6]. Related approaches used tensor factoriza- aration, music decomposition, signal reconstruction. tion [7] or synthesized music for component initialization [8]. An important starting point for our work is the procedure for score- informed music decomposition described in [3], where the authors 1. INTRODUCTION describe how to impose musically meaningful constraints on the components of a Non-Negative Matrix Factorization (NMF) with- The goal of music source separation is to decompose a music out the need for a dedicated component training. The same principle recording into its constituent signal components [1, 2]. We focus is extended to note-wise decomposition in [4]. Other authors de- on the special case of polyphonic, mono-timbral piano recordings, vised elaborate source-filter models to account for the time-varying where we aim to extract sound events that correspond to the com- spectral envelope of components with fixed pitch [9, 10]. In [11], a position’s individual notes. We assume that each sound event corre- semi-adaptive NMF variant, which allows to efficiently capture the sponding to a musical note can be characterized as a harmonic tone temporal evolution of component spectrograms, was proposed. with constant pitch as well as a sharp attack, a stable sustain, and a In [12], Ewert discussed the problems inherent to NMF decompo- release phase. Moreover, the mixture signal is assumed to be a lin- sition in case of overlapping partials of the targeted components. ear superposition of the isolated tones. This is, of course, a simpli- He proposed to use the activations and gains computed by a first fication since we neglect acoustic effects such as room impulse re- NMF stage to infer a time-frequency (TF) dependent weighting of sponses and sympathetic string resonances in the piano. With these the mixture magnitude spectrogram accounting for possible phase assumptions, the decomposition of piano music into isolated tones cancellations. He could show that this leads to more meaningful de- constitutes a limited, yet challenging source separation task, which compositions in a second NMF stage. The use of a weighted NMF may pave the way to more complex scenarios. had already been proposed by Virtanen in [13], where it was used as One of the challenges is to improve the perceptual quality of the sep- a means to fill gaps in the magnitude spectrogram that occurred due arated signals which may suffer from audible artifacts, depending to binary masking of predominant pitched signal components. In on the complexity of the music, the recording conditions, as well as [14], Cano et al. investigated the complex mutual influence of mag- ∗ nitude and phase on the quality of separated signals in source sep- This work has been supported by the German Research Foundation (DFG MU 2686/6-1). The International Audio Laboratories Erlangen aration. Cano proposed to soften the additivity constraint in source (AudioLabs) is a joint institution of the Friedrich-Alexander-Universitat¨ separation and suggested to use instrument specific resynthesis ap- Erlangen-Nurnberg¨ (FAU) and Fraunhofer IIS. proaches in [15]. 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 18-21, 2015, New Paltz, NY stood element-wise. Furthermore, J ∈ RK×M denotes an all-one matrix. 11840 3.2. Constraint Components via Score-Informed NMF 5920 2960 W(0) H(0) 1480 = Proper initialization of and is an effective means to con- 740 strain the degrees of freedom in the NMF iterations and enforces 392 Frequency (Hz) 0 1 2 3 4 convergence to a desired, musically meaningful solution. One pos- Time (Seconds) sibility is to impose constraints derived from a time-aligned, sym- 11840 G4 (392 Hz) A4 (440 Hz) bolic representation (i.e., machine readable score) of the recording 5920 ଵ ଶ 2960 [7]. Three constraints can be obtained from the musical score. First, 1480 + + the rank R of the decomposition is chosen according to the num- (0) 740 ber of unique musical pitches. Second, each column of W is 392 initialized with a prototype harmonic overtone series reflecting the 11840 5920 B4 (494 Hz) D5 (587 Hz) expected nature of a musical tone corresponding to the assigned ଷ ସ (0) 2960 musical pitch. Third, the rows of H are initialized as follows: 1480 + C ∈ RR×M 740 A binary constraint matrix s is constructed for each 392 s ∈ [1 : S], where Cs is 1 at entries that correspond to the pitch and 0 1 2 3 4 0 1 2 3 4 temporal position of the sth aligned note event and 0 otherwise. The (0) union (OR-sum) of all Cs is then used as initialization of H . With Figure 1: Artificial piano score used as illustrative example through- this initialization, each template obtained from iteration (1) typi- out the paper. Our target is the extraction of the isolated magnitude cally corresponds to an average spectrum (usually 1-normalized spectrograms corresponding to the individual notes G4, A4, B4, and [2]) of the corresponding musical pitch and each activation function D5 (ordered by pitch instead of onset time). obtained from (2) corresponds to the temporal amplitude envelope of all occurrences of that particular pitch throughout the recording. 3. SEPARATE 4. RESTORE In this section, we summarize the score-informed NMF approach as described in [3, 4]. In our signal model, we assume that the given Score-informed NMF as described in Section 3.2 yields a decom- V W(L) piano recording x is a linear mixture of notewise audio events xs, position of into musically meaningful templates and acti- (L) s ∈ [1 : S], where S ∈ N is the number of musical notes speci- vations H . In the following, we discuss the issues inherent to the V fied in the musical score (see our example score in Figure 1) such restoration of our targeted s from the components and introduce that x := s xs. Let X (m, k), m, k ∈ Z, be a complex-valued our extension to the conventional procedure. TF coefficient at the mth time frame and kth frequency bin of the Short-Time Fourier Transform (STFT) of our mixture signal x. Let 4.1. Component Magnitude Spectrogram Reconstruction T K×M V:=|X | ∈ R≥0 be a transposed version of the mixture signal’s magnitude spectrogram. Our objective is to decompose V into As shown on the left hand side of Figure 2(a), we can use the component magnitude spectrograms Vs that correspond to the indi- results of score-informed NMF to reconstruct magnitude spec- VNMF ≈ vidual note events xs. Ignoring possible phase issues, we assume trograms corresponding to individual note objects as s (L) (L) that the additive relationship V:= s Vs is fulfilled (see Figure 1). W H Cs . In the conceptual illustration, the template and activation corresponding to the sth tone are indicated by a 3.1. Music Decomposition via NMF hatched column and row, respectively. The binary activation in the V corresponding Cs is visualized by a black box inside the hatched NMF can be used to decompose the magnitude spectrogram VNMF into spectral basis functions (also called templates) encoded by the row.

A Separate and Restore Approach to Score-Informed Music Decomposition

The Science of String Instruments

Vapar Synth--A Variational Parametric Model for Audio Synthesis

A Comparison of Viola Strings with Harmonic Frequency Analysis

Timbre Perception

Automatic Transcription of Bass Guitar Tracks Applied for Music Genre Classiﬁcation and Sound Synthesis

Advances in Perceptual Stereo Audio Coding Using Linear Prediction Techniques

Expressive Sampling Synthesis

Uncontrolled Manifolds and Short-Delay Reflexes in Speech Motor Control : a Modeling Study Andrew Szabados

Parsing the Spectral Envelope

Analysing Deep Learning-Spectral Envelope Prediction Methods For

A Comparison of Spectral Smoothing Methods for Segment Concatenation Based Speech Synthesis Q David T

Spectral Envelope Coding in Cat Primary Auditory Cortex: Linear and Non-Linear Effects of Stimulus Characteristics