Neural Granular Synthesis

Adrien Bitton Philippe Esling Tatsuya Harada IRCAM, CNRS UMR9912 IRCAM, CNRS UMR9912 The University of Tokyo, RIKEN Paris, France Paris, France Tokyo, Japan bitton@.fr [email protected] [email protected]

ABSTRACT Granular sound synthesis is a popular audio generation technique based on rearranging sequences of small wave- resynthesis generated sequence match decode form windows. In order to control the synthesis, all grains grains in a given corpus are analyzed through a set of acoustic continuous + f(↵) descriptors. This provides a representation reflecting some discrete + condition + form of local similarities across the grains. However, the grain space + quality of this grain space is bound by that of the descrip- + + + grain tors. Its traversal is not continuously invertible to signal grain + + latent space sample and does not render any structured temporality. library dz acoustic R We demonstrate that generative neural networks can im- analysis encode plement granular synthesis while alleviating most of its target signal shortcomings. We efficiently replace its audio descriptor input basis by a probabilistic latent space learned with a Vari- signal ational Auto-Encoder. In this setting the learned grain space is invertible, meaning that we can continuously syn- thesize sound when traversing its dimensions. It also im- Figure 1. Left: A grain library is analysed and scattered (+) into the acoustic dimensions. A target is defined, by analysing an other signal (o) plies that original grains are not stored for synthesis. An- or as a free trajectory, and matched to the library through the acoustic other major advantage of our approach is to learn struc- descriptors. Subsequently, grains are selected and arranged into a wave- tured paths inside this latent space by training a higher- form. Right: The grain latent space can continuously synthesize wave- level temporal embedding over arranged grain sequences. form grains. Latent features can be encoded from an input signal, sampled from a structured temporal embedding or freely drawn. Explicit controls The model can be applied to many types of libraries, in- can be learned as target conditions for the decoder. cluding pitched notes or unpitched drums and environmen- tal noises. We report experiments on the common granular synthesis processes as well as novel ones such as condi- An instance of corpus-based synthesis, named granular tional sampling and morphing. sound synthesis [2], uses short windows of a fixed length. These units (called grains) usually have a size 1. INTRODUCTION ranging between 10 and 100 milliseconds. For a given cor- pus, the grains are extracted and can be analyzed through The process of generating musical audio has seen a con- audio descriptors [3] in order to facilitate their manipula- tinuous expansion since the advent of digital systems. Au- tion. Such analysis space provides a representation that dio synthesis methods relying on parametric models can reflects some form of local similarities across grains. The be derived from physical considerations, spectral analysis grain corpus is displayed as a cloud of points whose dis- (e.g. sinusoids plus noise [1] models) or signal processing tances relate to some of their acoustic relationships. By re- operations (e.g. modulation). Alternatively to lying on this space, resynthesis can be done with concate- those signal generation techniques, samplers provide syn- native sound synthesis [4]. To a certain extent, this process arXiv:2008.01393v3 [cs.SD] 3 Jul 2021 thesis mechanisms by relying on stored and can emulate the spectro-temporal dynamics of a given sig- sets of audio transformations. However, when tackling nal. However, the perceptual quality of the audio similar- large audio sample libraries, these methods cannot scale ities, assessed through predefined sets of acoustic descrip- and are also unable to aggregate a model over the whole tors, is inherently biased by their design. These only offer data. Therefore, they cannot globally manipulate the audio a limited consistency across many different , within features in the sound generation process. To this extent, the corpus and with respect to other targets. Furthermore, corpus-based synthesis has been introduced by slicing sets it should be noted that the synthesis process can only use of signals in shorter audio segments, which can be rear- the original grains, precluding continuously invertible in- ranged into new waveforms through a selection algorithm. terpolations in this grain space. To enhance the expressivity of granular synthesis, grain Copyright: ©2020 Adrien Bitton et al. This is an open-access article dis- sequences should be drawn in more flexible ways, by un- tributed under the terms of the Creative Commons Attribution License 3.0 derstanding the temporal dynamics of trajectories in the Unported, which permits unrestricted use, distribution, and reproduction acoustic descriptor space. However, current methods are in any medium, provided the original author and source are credited. only restricted to perform random or simple hand-drawn dz paths. Traversals across the space map to grain series that lower-dimensional space z R (dz dx), as a higher- are ordered according to the corresponding features. How- level representation generating∈ any given example. The ever, given that the grain space from current approaches is complete model is defined by p(x, z) = p(x z)p(z). How- not invertible, these paths do not correspond to continuous ever, a real-world dataset follows a complex| distribution audio synthesis, besides that of each of the scattered orig- that cannot be evaluated analytically. The idea of varia- inal grains. This could be alleviated by having a denser tional inference (VI) is to address this problem through op- grain space (leading to a smoother assembled waveform), timization by assuming a simpler distribution qφ(z x) but it would require a correspondingly increasing amount from a family of approximate densities [5]. The| goal∈ ofQ of memory, quickly exceeding the gigabyte scale when VI is to minimize differences between the approximated considering nowadays sound sample library sizes. In a and real distribution, by using their Kullback-Leibler (KL) real-time setting, this causes further limitations to consider divergence in a traditional granular synthesis space. As current meth- ∗ ods only account for local relationships, they cannot gen- q (z x) = argmin KL qφ (z x) pθ (z x) . (1) φ | D | k | erate the structured temporal dynamics of musical notes qφ(z|x)∈Q or drum hits without having a strong inductive bias, such   By developing this divergence and re-arranging terms (de- as a target signal. Finally, the audio descriptors and the tailed development can be found in [5]), we obtain slicing size of grains are critical parameters to choose for these methods. They model the perceptual relationships log p(x) q (z x) p (z x) across elements and set a trade-off: shorter grains allow − DKL φ | k θ | for a denser space and faster sound variations at the ex- = Ez log p(x z) KL qφ(z x) pθ(z) . (2) pense of a limited estimate of the spectral features and the | − D | k need to process larger series for a given signal duration. This formulation of the Variational Auto-Encoder(VAE) In this paper, we show that we can address most of the relies on an encoder qφ(z x), which aims at minimizing aforementioned shortcomings by drawing parallels between the distance to the unknown| conditional latent distribution. granular sound synthesis and probabilistic latent variable Under this assumption, the Evidence Lower Bound Objec- models. We develop a new neural granular synthesis tech- tive (ELBO) is optimized by minimization of a β weighted nique that refines granular synthesis and is efficiently solved KL regularization over the latent distribution added to the by generative neural networks (Figure1). Through the re- reconstruction cost of the decoder pθ(x z) peated observation of grains, our proposed technique adap- |

θ,φ = Eqφ(z) log pθ(x z) +β KL qφ(z x) pθ(z) . (3) tively and unsupervisedly learns analysis dimensions, struc- L − | ∗ D | k turing a latent grain space, which is continuously invert- reconstruction   regularization  ible to signal domain. Such space embeds the training | {z } | {z } dataset, which is no longer required in memory for gen- The second term of this loss requires to define a prior dis- eration. It allows to continuously generate novel grains tribution over the latent space, which for ease of sampling at any interpolated latent position. In a second step, this and back-propagation is chosen to be an isotropic gaussian space serves as basis for a higher-level temporal model- of unit variance pθ(z) = (0, I). Accordingly, a forward N ing, by training a sequential embedding over contiguous pass of the VAE consists in encoding a given data point series of grain features. As a result, we can sample la- qφ : x µ(x), σ(x) to obtain a mean µ(x) and vari- −→{ } tent paths with a consistent temporal structure and more- ance σ(x). These allow us to obtain the latent z by sam- over relieve some of the challenges to learn to generate pling from the Gaussian, such that z (µ(x), σ(x)). ∼ N raw waveforms. Its architecture is suited to optimizing lo- The representation learned with a VAEhas a smooth topol- cal spectro-temporal features that are essential for audio ogy [6] since its encoder is regularized on a continuous quality, as well as longer-term dependencies that are ef- density and intrinsically supports sampling within its unsu- ficiently extracted from grain-level sequences rather than pervised training process. Its latent dimensions can serve individual waveform samples. The trainable modules used both for analysis when encoding new samples, or as gen- are well-grounded in digital signal processing (DSP), thus erative variables that can continuously be decoded back to interpretable and efficient for sound synthesis. By provid- the target data domain. Furthermore, it has been shown ing simple variations of the model, it can adapt to many [7] that it could be successfully applied to audio genera- audio domains as well as different user interactions. With tion. Thus, it is the core of our neural model for granular this motivation, we report several experiments applying the synthesis of raw waveforms. creative potentials of granular synthesis to neural wave- form modeling: continuous free-synthesis with variable 2.2 Neural waveform generation step size, one-shot sample generation with controllable at- Applications of generative neural networks to raw audio tributes, analysis/resynthesis for audio style transfer and data must face the challenge of modeling time series with high-level interpolation between audio samples. very high sampling rates. Hence, the models must account for both local features ensuring the generated audio qual- 2. STATE OF THE ART ity, as well as longer-term relationships (consistent over tens of thousands of samples) in order to form meaningful 2.1 Generative neural networks signals. The first proposed approaches were based on auto- Generative models aim to understand a given set x Rdx regressive models, which exploit the causal nature of au- ∈ by modeling an underlying probability distribution p(x) of dio. Given the whole waveform x = x1, . . . , xT , these the data. To do so, we consider latent variables defined in a models decompose the joint distribution{ into a product} of conditional distributions. Hence, each sample is generated 3. NEURAL GRANULAR SOUND SYNTHESIS conditionally on all previous ones In this paper, we propose a model that can learn both a local T audio representation and modeling at multiple time scales, p(x) = p(x x , . . . , x − ). (4) by introducing a neural version of the granular sound syn- t| 1 t 1 t=1 thesis [4]. The audio quality of short-term signal win- Y dows is ensured by efficient DSP modules optimized with Amongst these models, WaveNet [8] has been established a spectro-temporal criterion suited to both periodic and as the reference solution for high-quality . stochastic components. We structure the relative acoustic It has also been successfully applied to musical audio with relationships in a latent grain space, by explicitly recon- the Nsynth dataset [9]. However, generating a signal in structing waveforms through an overlap-add mechanism an auto-regressive manner is inherently slow since it iter- across audio grain sequences. This synthesis operation ates one sample at a time. Moreover, a large convolutional can model any type of spectrogram, while remaining in- structure is needed in order to infer even a limited context terpretable. Our proposal allows for analysis prior to data- of 100ms. This results in heavy models, only adapted to driven resynthesis and also performs continuous variable large databases and requiring long training times. length free-synthesis trajectories. Taking advantage of this Specifically for musical audio generation, the Symbol-to- grain-level representation, we further train a higher-level Instrument Neural Generator (SING) proposes an overlap- sequence embedding to generate audio events with mean- add convolutional architecture [10] on top of which a se- ingful temporal structure. In its less restrictive definition, quential embedding S is trained on frame steps F1...f , by our model allows for unconditional sampling, but it can conditioning over instrument, pitch and velocity classes be trained with additional independent controls (such as (I, P, V). The model processes signal windows of 1024 pitch or user classes) for more explicit interactions in com- points with a 75% overlap, thus reducing the temporal di- position and sound transfer. The complete architecture is mension by 256 before the forward pass of the up-sampling depicted in Figure2. convolutional decoder D. Given an input signal with log- magnitude spectrogram l(x) = log( + STFT[x] 2), the de recurrent | | e R ◉ decoder outputs a reconstruction xˆ, in order to optimize 2 embedding s (sz e) argmin l(x) l(xˆ) 1 (5) grain | ohclass conditioning D,S || − || latent space x s = z ,...,z x x z 1 g x { } for xˆ = D(S(F, I, P, V)). This approach removes auto- µ(xi), (xi) x concat x regressive computation costs and offers meaningful con- noise encoder dz decoder source q(zi xi) R trols, while achieving high-quality synthesis. However, | 2 Hi ni given its specific architecture, it does not generalize to gen- mini-batch p✓(xˆ sz) spectral filtering erative tasks other than sampling individual instrumental sx [nx,g,dx] | notes of fixed duration in pitched domains. grain sequences learned ... ˆsx post-processing Recently, additional inductive biases arising from digi- overlap-add x1,...,xg tal signal processing have allowed to specify tighter con- { } N straints on model definitions, leading to high sound quality l (x) l (xˆ) ~ || n n ||1 with lower training costs. In this spirit, the Neural Source- x n=1 xˆ X Filter (NSF) model [11] applies the idea of Spectral Mod- eling Synthesis (SMS) [1] to speech synthesis. Its input Figure 2. Overview of the neural granular sound synthesis model. module receives acoustic features and computes condition- ing information for the source and temporal filtering mod- ules. In order to render both voiced and unvoiced sounds, a 3.1 Latent grain space sinusoidal and gaussian noise excitations are fed into sep- dx arate filter modules. Estimation of noisy and harmonic Formally, we consider a set of audio grains xi R ex- components is further improved by relying on a multi-scale tracted from audio waveformsX x in a given sound∈ corpus, spectrogram reconstruction criterion. with fixed grain size dx. This set of grains follows an un- Similar to NSF, but for pitched musical audio, the Dif- derlying probability density p(xi) that we aim to approxi- ferentiable Digital Signal Processing [12] model has been mate through a parametric distribution pθ. This would al- proposed. Compared to NSF, this architecture features an low to synthesize consistent novel audio grains by sam- harmonic additive that is summed with a sub- pling xˆj pθ(xi). This likelihood is usually intractable, tractive noise synthesizer. Envelopes for the fundamen- we can tackle∼ this process by introducing a set of latent dz tal frequency and loudness as well as latent features are variables z R (dz dx). This low-dimensional space extracted from a waveform and fed into a recurrent de- is expected∈ to represent the most salient features of the coder which controls both . An alternative fil- data, which might have led to generate a given example. ter design is proposed by learning frequency-domain trans- In our case, it will efficiently replace the use of acoustic fer functions of time-varying Finite Impulse Response (FIR) descriptors, by optimizing continuous generative features. filters. Furthermore, the summed output is fed into a rever- This latent grain space is based on an encoder network that beration module that refines the acoustic quality of the sig- models q (z x ) paired with a decoder network p (x z ) φ i| i θ i| i nal. Although this process offers very promising results, it allowing to recover xˆi for every grains xi . We use the is restricted in the nature of signals that can be generated. Variational Auto-Encoder [5] with a mean-field∈ X family and Gaussian prior to learn a smooth latent distribution p(z). also synthesize and transfer meaningful temporal paths in- side the latent grain space. It starts by sampling e Rde 3.2 Latent path encoder from the Gaussian e (0, I), then sequentially decod-∈ ing s (s e) and finally∼ Ngenerating the grains and overlap- As we will perform overlap-add reconstruction, our model ψ z| add waveform with pθ(xˆ sz). processes series of g grains sx = x1,..., xg extracted | from a given waveform . The down-sampling{ } ratio be- x 3.5 Multi-scale training objective tween the waveform duration T and number of grains g is given by the hop size separating neighboring grains. Each To optimize the waveform reconstruction, we rely on a of these grains xi is analyzed separately by the encoder in multi-scale spectrogram loss [11, 12], where STFTs are order to produce qφ(zi xi) = (µ(xi), σ(xi)). Hence, computed with increasing hop and window sizes, so that the successive encoded| grains formN a corresponding series the temporal scale is down-sampled while the spectral ac- sz = z1,..., zg of latent coordinates such that curacy is refined. We use both linear and log-frequency { } STFT [14] on which we compare log-magnitudes l(x) = zi = µ(xi) + ε σ(xi) (6) log( + STFT[x] 2) with the L1 distance . . In addi- ∗ | | || ||1 tion to fitting multiple resolutions of STFT1...N , we can ex- with ε (0, I). The layers of the encoder are first ∼ N plicitly control the trade-off between low and high-energy strided residual convolutions that successively down-sample components with the  floor value [10]. In order to opti- the input grains through temporal 1-dimensional filters. The mize a latent grain space, KL regularization and sampling output of these layers is then fed into several fully-connected (6) are performed for each latent point zi, thus we extend linear layers that map to Gaussian means and variances at the original VAE objective (3) as the desired latent dimensionality dz. N g = l (x) l (xˆ) +β q (z x ) p (z) (9) 3.3 Spectral filtering decoder Lθ,φ || n − n ||1 ∗ DKL φ i| i k θ n=1 i=1 X X   Given a latent series sz, the decoder must first synthesize reconstructions regularizations each grain prior to the overlap-add operation. To that end, | {z } | {z } we introduce a filtering model that adapts the design of where N is the number of scales in the spectrogram loss and g is the number of grains processed in one sequence. [12] to granular synthesis. Hence, each zi is processed by a set of residual fully-connected layers that produces fre- dh quency domain coefficients Hi R of a filtering mod- 4. EXPERIMENTS ule that transforms uniform noise∈ excitations n dx i ∼ U[−1,1] 4.1 Datasets into waveform grains. We replace the recurrence over en- velope features proposed in [12] by performing separate In order to evaluate our model across a wide variety of forward passes over overlapping grain features. Denoting sound domains, we train on the following datasets the Discrete Fourier Transform DFT and its inverse iDFT, 1. Studio-On-Line provides individual note recordings this amounts to computing sampled at 22050 Hz with labels (pitch, instrument, playing technique) for 12 orchestral instruments. The Xˆ = H DFT(n ) (7) i i ∗ i tessitura for Alto-Saxophone, Bassoon, Clarinet, Flute, xˆi = iDFT(Xˆ i). (8) Oboe, English-Horn, French-Horn, Trombone, Trum- pet, Cello, Violin, Piano are in average played in 10 Since the DFT of a real valued signal is Hermitian, sym- different extended techniques. The full set amounts metry implies that for an even grain size dx, the network to around 15000 notes [15]. only filters the dh = dx/2 + 1 positive . These grains are then used in an overlap-add mechanism 2. 8 Drums around 6000 one-shot recordings sampled that produces the waveform, which is passed through a at 16000 Hz in Clap, Cowbell, Crash, Hat, Kick, final learnable post-processing inspired from [13]. This Ride, Snare, Tom instrument classes 1 . module applies a multi-channel temporal convolution that 10 animals learns a parallel set of time-invariant FIR filters and im- 3. contains around 3 minutes of recordings sampled at 22050 Hz for each of Cat, Chirping Birds, proves the audio quality of the assembled signal xˆ. Cow, Crow, Dog, Frog, Hen, Pig, Rooster, Sheep 2 3.4 Sequence trajectories embedding classes of the ESC-50 dataset . As argued earlier, generative audio models need to sample For datasets sampled at 22050 Hz, we use a grain size audio events with a consistent long-term temporal struc- dx = 2048, which subsequently sets the filter size dh = ture. Our model provides this in an efficient manner, by 1025, and compute spectral losses for STFT window sizes [128, 256, 512, 1024, 2048]. For datasets sampled at 16000 learning a higher-level distribution of sequences sψ(sz) that models temporal trajectories in the granular latent space Hz, dx = 1024 and STFT window sizes range from 32 to dz ∗g 1024. Hop sizes for both grain series and STFTs are set sz R . This allows to use the down-sampling of an intermediate∈ frame-level representation in order to learn with an overlap ratio of 75%. Log-magnitudes are com- −3 longer-term relationships. This is achieved by training a puted with a floor value  = 5e . Dimensions for latent temporal recurrent neural network on ordered sequences of features are dz = 96 and de = 256. grain features sz. This process can be applied equivalently 1 https://github.com/chrisdonahue/wavegan/tree/v1 to any types of audio signals. As a result, our proposal can 2 https://github.com/karolpiczak/ESC-50 4.2 Models model to synthesize 1 second of audio is about 19.7 ms. on GPU and 25.0 ms. on CPU. Since datasets provide some labels, we both train uncondi- tional models and variants with decoder conditioning. For VAE VAE VAE VAE instance Studio-On-Line can be trained with control over tr up fi fi+pp Studio-On-Line ordinario pitch and/or instrument classes when using multiple instru- RMSE 6.86 6.65 6.22 4.86 ment subsets. Otherwise for a single instrument we can in- LSD 1.60 1.62 1.29 1.17 stead condition on its playing styles (such as Pizzicato or Studio-On-Line strings Tremolo for the violin). To do so, we concatenate one-hot RMSE 5.68 5.78 5.29 4.07 encoded labels oh to the latent vectors at the input of class LSD 1.39 1.43 1.19 1.05 the decoder. During generation we can explicitly set these target conditions, which provide independent controls over 8 Drums the considered sound attributes RMSE 3.85 4.39 2.65 2.79 LSD 0.94 0.66 0.52 0.52 cond. cond. sec./iter 2.32 2.87 0.54 0.58 p :(s , ohclass) ˆs xˆ . (10) θ z −→ x −→

4.3 Training Table 1. Report of the baseline model comparison. Bold denotes the best model for each evaluation. The model is trained according to eq.9. In the first epochs only the reconstruction is optimized, which amounts to β = 0. This regularization strength is then linearly in- creased to its target value, during some warm-up epochs. 5.2 Common granular synthesis processes The last epochs of training optimize the full objective at the target regularization strength, which is roughly fixed in or- The audio-quality of the models trained in different sound der to balance the gradient magnitudes when individually domains can be judged by data reconstructions. It gives a back-propagating each term of the objective. The num- sense of the model performance at auto-encoding various ber of training iterations vary depending on the datasets, types of sounds. This extends to generating new sounds we use a minibatch size of 40 grain sequences, an initial by sampling latent sequences rather than encoding features learning rate of 2e−4 and the ADAM optimizer. In this from input sounds. For structured one-shot samples, such setting, a model can be fitted within 10 hours on a single as musical notes and drum hits, latent sequences are gen- GPU, such as an Nvidia Titan V. erated from the higher-level sequence embedding. For use in composition (e.g. MIDI score), this sampling can be 5. RESULTS done with conditioning over user classes such as pitch and target instrument (eq. 10). Since the VAE learns a continu- The model performance is first compared to some baseline ously invertible grain space, it can as well be explored with auto-encoders in Table1. To assess the generative qual- smooth interpolations that render free-synthesis trajecto- ities of the model, we provide audio samples of data re- ries. Some multidimensional latent curves that are mapped constructions as well as examples of neural granular sound to overlap-add grain sequences, including linear interpo- synthesis 3 . These are generations based on its common lations between random samples from the latent Gaussian processes as well as novel interactions enabled by our pro- prior, circular paths and spirals. When repeating forward posed neural architecture. and backward traversals of a linear interpolation or looping a circular curve, we can modulate non-uniformly the steps 5.1 Baseline comparison between latent points in order to bring additional expres- sivity to the synthesis. Free-synthesis can be performed at In the first place, the granular VAE could be implemented variable lengths (in multiples of g) by concatenating sev- using a convolutional decoder that symmetrically reverts eral contiguous latent paths. the latent mapping of the encoder we use. Strided down- sampling convolutions can be mirrored with transposed con- volutions or up-sampling followed with convolutions. We 5.3 Audio style and temporal manipulations will refer to these baselines as VAEtr and VAEup while our model with spectral filtering decoder is VAE and To perform data-driven resynthesis, a target sample is ana- fi lyzed by the encoder. Its corresponding latent features are with the added learnable post-processing is VAEfi+pp. We train these models on the Studio-On-Line dataset for the then decoded, thus emulating the target sound in the style full orchestra in ordinario and the strings in all playing of the learned grain space. A conditioning over multiple modes as well as the 8 Drums dataset, keeping all other timbres (e.g. instrument classes) allows for finer control hyper-parameters identical. We report their test set spec- over such audio transfer between multiple target styles. To trogram reconstruction scores for the Root Mean Squared perform resynthesis of audio samples longer than the grain Error (RMSE), Log-Spectral Distance (LSD) and their av- series length g, we auto-encode several contiguous seg- erage time per training iteration. Each model was trained ments that are assembled with fade-out/fade-in overlaps. for about 10 hours. Accordingly, we can see that our pro- Since the model can also learn a continuous temporal em- posal globally outperforms the convolutional decoder base- bedding, by interpolating this higher-level space, we can lines, while training and generating fast. The latency of our generate successive latent series in the grain space that are decoded into signals with evolving temporal structures. We 3 https://adrienchaton.github.io/neural_granular_synthesis/ illustrate this feature in Figure3. interpolation spectrograms e2 [3] G. Peeters, B. L. Giordano, P. Susini, N. Misdariis, and e↵ 1 S. McAdams, “The Timbre Toolbox: Extracting au- e p✓(xˆ sz) temporal

embedding | dio descriptors from musical signals,” The Journal of s (s e) the Acoustical Society of America, vol. 130, no. 5, pp. z| 2 2902–2916, 2011. sz [4] D. Schwarz, G. Beller, B. Verbrugghe, and S. Brit- ↵ 1 ↵ 2 space sz xˆ xˆ xˆ ton, “Real-Time Corpus-Based Concatenative Synthe- grain latent z 1 interpolation waveforms sz sis with CataRT,” in Proceedings of the International Conference on Digital Audio Effects, 2006. Figure 3. An interpolation in the continuous temporal embedding gen- erates series of latent grain features corresponding to waveforms with [5] D. P. Kingma and M. Welling, “Auto-encoding vari- evolving temporal structure. Here three drum sounds with increasingly ational bayes,” International Conference on Learning α 1 2 sustained envelope. The point e is set half-way from e and e . Representations, 2014. [6] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glo- 5.4 Real-time sound synthesis rot, M. Botvinick, S. Mohamed, and A. Lerchner, With GPU support, for instance a sufficient dedicated lap- “beta-VAE: Learning basic visual concepts with a con- top chip or an external thunderbolt hardware, the models strained variational framework,” International Confer- can be ran in real-time. In order to apply trained models to ence on Learning Representations, 2016. these different generative tasks, we currently work on some [7] P. Esling, A. Chemla-Romeu-Santos, and A. Bitton, prototype interfaces based on a Python OSC 4 server con- “Generative timbre spaces with variational audio syn- trolled from a MaxMsp 5 patch. For instance a neural drum thesis,” in Proceedings of the International Conference machine 3 featuring a step-sequencer driving a model with on Digital Audio Effects, 2018. sequential embedding and conditioning trained over the 8 Drums dataset classes. [8] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and 6. CONCLUSIONS K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv:1609.03499, 2016. We propose a novel method for raw waveform generation that implements concepts from granular sound synthesis [9] J. Engel, C. Resnick, A. Roberts, S. Dieleman, D. Eck, and digital signal processing into a Variational Auto-Encoder. K. Simonyan, and M. Norouzi, “Neural audio synthe- It adapts to a variety of sound domains and supports neu- sis of musical notes with WaveNet autoencoders,” In- ral audio modeling at multiple temporal scales. The ar- ternational Conference on Machine Learning, vol. 70, chitecture components are interpretable with respect to its pp. 1068–1077, 2017. spectral reconstruction power. Such VAE addresses some [10] A. Defossez, N. Zeghidour, N. Usunier, L. Bottou, and limitations of traditional techniques by learning a continu- F. Bach, “Sing: Symbol-to-instrument neural genera- ously invertible grain latent space and a hierarchical tem- tor,” Advances in Neural Information Processing Sys- poral embedding. Moreover, it enables multiple modes of tems, vol. 31, p. 9041–9051, 2018. generation derived from granular sound synthesis, as well as potential controls for composition purpose. By doing [11] X. Wang, S. Takaki, and J. Yamagishi, “Neural so, we hope to enrich the creative use of neural networks source-filter waveform models for statistical paramet- in the field of musical sound synthesis. ric speech synthesis,” IEEE/ACM Transactions on Au- dio, Speech, and Language Processing, vol. 28, pp. Acknowledgments 402–415, 2019. This JSPS International Research Fellow research was par- [12] J. Engel, L. H. Hantrakul, C. Gu, and A. Roberts, tially supported by JST AIP Acceleration Research Grant “DDSP: Differentiable Digital Signal Processing,” In- Number JPMJCR20U3, as well as ANR:17-CE38-0015- ternational Conference on Learning Representations, 01 MAKIMOno project, SSHRC:895-2018-1023 ACTOR 2020. Partnership, Emergence(s) ACIDITEAM project from Ville de Paris and ACIMO project of Sorbonne Universite.´ [13] C. Donahue, J. McAuley, and M. Puckette, “Adver- sarial audio synthesis,” International Conference on 7. REFERENCES Learning Representations, 2019. [1] X. Serra and J. Smith, “Spectral modeling synthesis: A [14] K. W. Cheuk, K. Agres, and D. Herremans, “nnAU- sound analysis/synthesis system based on a determin- DIO: a pytorch audio processing tool using 1d convolu- istic plus stochastic decomposition,” Computer Music tion neural networks,” International Society for Music Journal, vol. 14, no. 4, p. 12–24, 1990. Information Retrieval, 2019. [2] C. Roads, “Automated Granular Synthesis of Sound,” [15] G. Ballet, R. Borghesi, P. Hoffmann, and F. Levy,´ “Stu- Computer Music Journal, vol. 2, p. 61, 1978. dio online 3.0: An internet ”killer application” for re- mote access to ircam sounds and processing tools,” 4 https://pypi.org/project/python-osc/ Journee Informatique Musicale, 1999. 5 https://cycling74.com