Neural Granular Sound Synthesis
Total Page:16
File Type:pdf, Size:1020Kb
Neural Granular Sound Synthesis Adrien Bitton Philippe Esling Tatsuya Harada IRCAM, CNRS UMR9912 IRCAM, CNRS UMR9912 The University of Tokyo, RIKEN Paris, France Paris, France Tokyo, Japan [email protected] [email protected] [email protected] ABSTRACT Granular sound synthesis is a popular audio generation technique based on rearranging sequences of small wave- resynthesis generated sequence match decode form windows. In order to control the synthesis, all grains grains in a given corpus are analyzed through a set of acoustic continuous + f(↵) descriptors. This provides a representation reflecting some discrete + condition + form of local similarities across the grains. However, the grain space + quality of this grain space is bound by that of the descrip- + + + grain tors. Its traversal is not continuously invertible to signal grain + + latent space sample and does not render any structured temporality. library dz acoustic R We demonstrate that generative neural networks can im- analysis encode plement granular synthesis while alleviating most of its target signal shortcomings. We efficiently replace its audio descriptor input basis by a probabilistic latent space learned with a Vari- signal ational Auto-Encoder. In this setting the learned grain space is invertible, meaning that we can continuously syn- thesize sound when traversing its dimensions. It also im- Figure 1. Left: A grain library is analysed and scattered (+) into the acoustic dimensions. A target is defined, by analysing an other signal (o) plies that original grains are not stored for synthesis. An- or as a free trajectory, and matched to the library through the acoustic other major advantage of our approach is to learn struc- descriptors. Subsequently, grains are selected and arranged into a wave- tured paths inside this latent space by training a higher- form. Right: The grain latent space can continuously synthesize wave- level temporal embedding over arranged grain sequences. form grains. Latent features can be encoded from an input signal, sampled from a structured temporal embedding or freely drawn. Explicit controls The model can be applied to many types of libraries, in- can be learned as target conditions for the decoder. cluding pitched notes or unpitched drums and environmen- tal noises. We report experiments on the common granular synthesis processes as well as novel ones such as condi- An instance of corpus-based synthesis, named granular tional sampling and morphing. sound synthesis [2], uses short waveform windows of a fixed length. These units (called grains) usually have a size 1. INTRODUCTION ranging between 10 and 100 milliseconds. For a given cor- pus, the grains are extracted and can be analyzed through The process of generating musical audio has seen a con- audio descriptors [3] in order to facilitate their manipula- tinuous expansion since the advent of digital systems. Au- tion. Such analysis space provides a representation that dio synthesis methods relying on parametric models can reflects some form of local similarities across grains. The be derived from physical considerations, spectral analysis grain corpus is displayed as a cloud of points whose dis- (e.g. sinusoids plus noise [1] models) or signal processing tances relate to some of their acoustic relationships. By re- operations (e.g. frequency modulation). Alternatively to lying on this space, resynthesis can be done with concate- those signal generation techniques, samplers provide syn- native sound synthesis [4]. To a certain extent, this process arXiv:2008.01393v3 [cs.SD] 3 Jul 2021 thesis mechanisms by relying on stored waveforms and can emulate the spectro-temporal dynamics of a given sig- sets of audio transformations. However, when tackling nal. However, the perceptual quality of the audio similar- large audio sample libraries, these methods cannot scale ities, assessed through predefined sets of acoustic descrip- and are also unable to aggregate a model over the whole tors, is inherently biased by their design. These only offer data. Therefore, they cannot globally manipulate the audio a limited consistency across many different sounds, within features in the sound generation process. To this extent, the corpus and with respect to other targets. Furthermore, corpus-based synthesis has been introduced by slicing sets it should be noted that the synthesis process can only use of signals in shorter audio segments, which can be rear- the original grains, precluding continuously invertible in- ranged into new waveforms through a selection algorithm. terpolations in this grain space. To enhance the expressivity of granular synthesis, grain Copyright: ©2020 Adrien Bitton et al. This is an open-access article dis- sequences should be drawn in more flexible ways, by un- tributed under the terms of the Creative Commons Attribution License 3.0 derstanding the temporal dynamics of trajectories in the Unported, which permits unrestricted use, distribution, and reproduction acoustic descriptor space. However, current methods are in any medium, provided the original author and source are credited. only restricted to perform random or simple hand-drawn dz paths. Traversals across the space map to grain series that lower-dimensional space z R (dz dx), as a higher- are ordered according to the corresponding features. How- level representation generating2 any given example. The ever, given that the grain space from current approaches is complete model is defined by p(x; z) = p(x z)p(z). How- not invertible, these paths do not correspond to continuous ever, a real-world dataset follows a complexj distribution audio synthesis, besides that of each of the scattered orig- that cannot be evaluated analytically. The idea of varia- inal grains. This could be alleviated by having a denser tional inference (VI) is to address this problem through op- grain space (leading to a smoother assembled waveform), timization by assuming a simpler distribution qφ(z x) but it would require a correspondingly increasing amount from a family of approximate densities [5]. Thej goal2 ofQ of memory, quickly exceeding the gigabyte scale when VI is to minimize differences between the approximated considering nowadays sound sample library sizes. In a and real distribution, by using their Kullback-Leibler (KL) real-time setting, this causes further limitations to consider divergence in a traditional granular synthesis space. As current meth- ∗ ods only account for local relationships, they cannot gen- q (z x) = argmin KL qφ (z x) pθ (z x) : (1) φ j D j k j erate the structured temporal dynamics of musical notes qφ(zjx)2Q or drum hits without having a strong inductive bias, such By developing this divergence and re-arranging terms (de- as a target signal. Finally, the audio descriptors and the tailed development can be found in [5]), we obtain slicing size of grains are critical parameters to choose for these methods. They model the perceptual relationships log p(x) q (z x) p (z x) across elements and set a trade-off: shorter grains allow − DKL φ j k θ j for a denser space and faster sound variations at the ex- = Ez log p(x z) KL qφ(z x) pθ(z) : (2) pense of a limited estimate of the spectral features and the j − D j k need to process larger series for a given signal duration. This formulation of the Variational Auto-Encoder(VAE) In this paper, we show that we can address most of the relies on an encoder qφ(z x), which aims at minimizing aforementioned shortcomings by drawing parallels between the distance to the unknownj conditional latent distribution. granular sound synthesis and probabilistic latent variable Under this assumption, the Evidence Lower Bound Objec- models. We develop a new neural granular synthesis tech- tive (ELBO) is optimized by minimization of a β weighted nique that refines granular synthesis and is efficiently solved KL regularization over the latent distribution added to the by generative neural networks (Figure1). Through the re- reconstruction cost of the decoder pθ(x z) peated observation of grains, our proposed technique adap- j θ,φ = Eqφ(z) log pθ(x z) +β KL qφ(z x) pθ(z) : (3) tively and unsupervisedly learns analysis dimensions, struc- L − j ∗ D j k turing a latent grain space, which is continuously invert- reconstruction regularization ible to signal domain. Such space embeds the training | {z } | {z } dataset, which is no longer required in memory for gen- The second term of this loss requires to define a prior dis- eration. It allows to continuously generate novel grains tribution over the latent space, which for ease of sampling at any interpolated latent position. In a second step, this and back-propagation is chosen to be an isotropic gaussian space serves as basis for a higher-level temporal model- of unit variance pθ(z) = (0; I). Accordingly, a forward N ing, by training a sequential embedding over contiguous pass of the VAE consists in encoding a given data point series of grain features. As a result, we can sample la- qφ : x µ(x); σ(x) to obtain a mean µ(x) and vari- −!f g tent paths with a consistent temporal structure and more- ance σ(x). These allow us to obtain the latent z by sam- over relieve some of the challenges to learn to generate pling from the Gaussian, such that z (µ(x); σ(x)). ∼ N raw waveforms. Its architecture is suited to optimizing lo- The representation learned with a VAEhas a smooth topol- cal spectro-temporal features that are essential for audio ogy [6] since its encoder is regularized on a continuous quality, as well as longer-term dependencies that are ef- density and intrinsically supports sampling within its unsu- ficiently extracted from grain-level sequences rather than pervised training process. Its latent dimensions can serve individual waveform samples. The trainable modules used both for analysis when encoding new samples, or as gen- are well-grounded in digital signal processing (DSP), thus erative variables that can continuously be decoded back to interpretable and efficient for sound synthesis.