Deep Learning Based Source Separation Applied to Choir Ensembles

DEEP LEARNING BASED SOURCE SEPARATION APPLIED TO CHOIR ENSEMBLES Darius Petermann1 Pritish Chandna1 Helena Cuesta1 Jordi Bonada1 Emilia Gómez2;1 1 Music Technology Group, Universitat Pompeu Fabra, Barcelona 2 European Commission, Joint Research Centre, Seville [email protected], {pritish.chandna, helena.cuesta, jordi.bonada, emilia.gomez}@upf.edu ABSTRACT four different sections, each depicting different frequency ranges for the singers; "Soprano" (260 Hz-880 Hz), "Alto" Choral singing is a widely practiced form of ensemble (190 Hz-660 Hz), "Tenor" (145 Hz-440 Hz), and "Bass" singing wherein a group of people sing simultaneously in (90 Hz-290 Hz) [1]. This type of structural setting is usu- polyphonic harmony. The most commonly practiced set- ally referred to as a SATB setting. Although different vari- ting for choir ensembles consists of four parts; Soprano, ants of this structure exist, the SATB is the most well doc- Alto, Tenor and Bass (SATB), each with its own range of umented, with several conservatories across Europe dedi- fundamental frequencies (F0s). The task of source separa- cated to the study and practice of the art form, highlighting tion for this choral setting entails separating the SATB mix- its cultural significance. This will be the main focal point ture into the constituent parts. Source separation for musi- of our study. cal mixtures is well studied and many deep learning based The segregation of a mixture signal into its components methodologies have been proposed for the same. However, is a well researched branch of signal processing, known as most of the research has been focused on a typical case source separation. For polyphonic music recordings, this which consists in separating vocal, percussion and bass implies the isolation of the various instruments mixed to- sources from a mixture, each of which has a distinct spec- gether to form the whole. With applications such as music tral structure. In contrast, the simultaneous and harmonic remixing, rearrangement, audio restoration, and full source nature of ensemble singing leads to high structural simi- extraction, its potential use in music is of great appeal. larity and overlap between the spectral components of the While the task remains similar independently of the type sources in a choral mixture, making source separation for of setting involved, the nature of the sources (e.g.: speech, choirs a harder task than the typical case. This, along with musical instrument, singing voice) and their relations may the lack of an appropriate consolidated dataset has led to a entail various challenges and, consequently, require differ- dearth of research in the field so far. In this paper we first ent separation methodologies to be employed. assess how well some of the recently developed method- The most studied case of musical source separation fo- ologies for musical source separation perform for the case cuses on pop/rock songs, which typically have three com- of SATB choirs. We then propose a novel domain-specific mon sources; vocals, drums, bass along with other instru- adaptation for conditioning the recently proposed U-Net mental sources which are usually grouped together as oth- architecture for musical source separation using the funda- ers. A large body of research [2–4] has been published in mental frequency contour of each of the singing groups and this field over the last few years, beginning with the con- demonstrate that our proposed approach surpasses results solidation of a common dataset for researchers to train and from domain-agnostic architectures. evaluate their models on. In 2016, DSD100 [5] was first in- troduced and made available to the public and was later ex- 1. INTRODUCTION tended to MUSDB18 [6], which comprises 150 full-length Choir music is a well-established and long-standing prac- music tracks for a total of approximately 10 hours of mu- tice involving a body of singers performing together. Such sic. To this day, MUSDB18 represents the largest freely ensembles are usually referred to as choir and may perform available dataset of its kind. with or without instrumental accompaniment. A choir en- While source separation for the pop/rock case has come semble is usually structured by grouping the voices into leaps and bounds in the last few years, it remains largely unexplored for the SATB choir case, despite its cultural importance. This is partly due to the lack of a consoli- c D. Petermann, P. Chandna, H. Cuesta, J. Bonada, and dated dataset, similar to the MUSDB18, and partly due to E. Gómez. Licensed under a Creative Commons Attribution 4.0 Interna- tional License (CC BY 4.0). Attribution: D. Petermann, P. Chandna, H. the nature of the task itself. The sources to be separated Cuesta, J. Bonada, and E. Gómez, “Deep Learning Based Source Separa- in pop/rock have distinct spectral structure; the voice is a tion Applied To Choir Ensembles”, in Proc. of the 21st Int. Society for harmonic instrument and has a distinct spectral shape, de- Music Information Retrieval Conf., Montréal, Canada, 2020. fined by a fundamental frequency and its harmonic partials and formants. The bass element to be separated also has One of the first paper to present a U-Net adaptation to- a harmonic structure, but lacks the formants found in the wards audio source separation was proposed by Jansson et human voice and has a much lower fundamental frequency al. [10], where they propose an architecture which specif- than the human voice. In contrast, the spectrum of a per- ically targets vocal separation performed on western com- cussive instrument is generally inharmnoic and energy is mercial music (or pop music). The authors present an ar- usually spread across the spectrum. In contrast, the sources chitecture directly derived from the original U-Net one, to be separated in a SATB choir all have a similar spectral which takes spectrogram representations of the sources as structure with a fundamental frequency, partials and for- input and aims at predicting a soft-mask for the targeted mants. This makes the task more challenging than its more source (either vocal or instrumental). The predicted mask studied counter part. However, the distinct ranges of fun- is then multiplied element-wise with the original mix- damental frequencies in the sources to be separated can be ture spectrogram in order to obtain the predicted isolated used to distinguish between them, a key aspect that we aim source. It is worth mentioning that for each of the given to explore in our study. sources, a U-Net instance is trained in order to predict its We build on top of some recently proposed Deep Neu- respective mask. In the case of SATB mixtures, four U-Net ral Network (DNN) models to separate SATB monoaural instances are necessary in order to predict each of the four recordings into each of their respective singing groups and singing groups. then propose a specific adaptation to one of the models. The rest of the paper is organized as follow: Section 2 2.2 Conditioned-U-Net presents and investigates some of the recently proposed Depending on the nature of the separation task, its under- high performance deep learning based algorithms used for lying process can easily lead to scaling issues. The con- common musical source separation tasks, such as the U- ditioned U-Net (C-U-Net) architecture, described in [9], Net [7] architecture and its waveform-based adaptation, aims at addressing this limitations by introducing a mech- Wave-U-Net [8]. Section 3 goes over the dataset cura- anism controlled by external data which govern a single tion carried out for this experiment. Section 4 presents U-Net instance. C-U-Net does not diverge much from the our adaptation of the conditioned U-Net model described initial U-Net one; as an alternative to the multiple instances in [9], with a control mechanism conditioned on the input of the model, each of which is specialized in isolating a sources’ fundamental frequency (F0). Section 5 defines specific source, C-U-Net proposes the insertion of feature- the evaluation metrics and methodology used in this ex- wise linear modulation (FiLM) layers [11], which repre- periment. In Section 5.2 we evaluate and compare how ex- sents an affine transform defined by two scalars - γ and β, isting models and our proposed adaptation perform on the across the architecture. This allows for the application of task of source separation for SATB recordings. We then linear transformations to intermediate feature maps. These present and discuss the results. Section 6 finally concludes specialized layers conserve the shape of the original in- with a discussion around our experiment and provide com- termediate feature input while modifying the underlying ments on future research that we intend to carry out. mapping of the filters themselves. F iLM(x) = γ(z)x + _β(z) (1) 2. RELATED WORK In eq. (1), x is the input of the FiLM layer, γ and β While source separation has remained relatively unex- the parameters that scale and shift x based on an external plored for the case of SATB choirs, a number of archi- information, z [9]. γ and β modulates the feature maps tectures have been proposed over the last few years for according to an input vector z, which describes the source musical source separation in the pop/rock case. A com- to separate. The condition generator block described in prehensive overview of all proposed models is beyond the Figure 1 represents a neural network embedding the one- scope of this study, but we provide a summary of some hot encoding input z into the most optimal values to be of the most pertinent models that we believe can easily be used by the FiLM layer.

Load more