<<

DIFFERENTIABLE SIGNAL PROCESSING WITH BLACK-BOX AUDIO EFFECTS

Marco A. Mart´ınez Ram´ırez ∗ Oliver Wang Paris Smaragdis Nicholas J. Bryan

Z\ Adobe Research,\ USA ^\ \ Centre for Digital Music,\ Queen Mary University of London, UK Z University of Illinois at Urbana-Champaign, USA ^ ABSTRACT DeepAFx We present a data-driven approach to automate audio signal pro- Input Output cessing by incorporating stateful third-party, audio effects as layers Audio Audio Deep within a deep neural network. We then train a deep encoder to ana- Encoder Audio Fx Loss lyze input audio and control effect parameters to perform the desired signal manipulation, requiring only input-target paired audio data as Target supervision. To train our network with non-differentiable black-box Audio effects layers, we use a fast, parallel stochastic gradient approxima- Audio Parameters Gradients tion scheme within a standard auto differentiation graph, yielding efficient end-to-end backpropagation. We demonstrate the power of Fig. 1. Our DeepAFx method consists of a deep encoder that ana- our approach with three separate automatic audio production appli- lyzes audio and predicts the parameters of one or more black-box cations: tube amplifier emulation, automatic removal of breaths and audio effects (Fx) to achieve a desired audio processing task. At pops from voice recordings, and automatic music . We training time, gradients for black-box audio effects are approximated validate our results with a subjective listening test, showing our ap- via a stochastic gradient method. proach not only can enable new automatic audio effects tasks, but can yield results comparable to a specialized, state-of-the-art com- mercial solution for music mastering. parameter control commonly require expensive human-labeled data Index Terms— audio effects, deep learning, differentiable sig- from experts (e.g. inputs, control parameters, and target outputs) [7]. nal processing, black-box optimization, gradient approximation. They are also typically optimized to minimize parameter prediction error and not audio quality directly [10], which can reduce perfor- mance. Third, DDSP approaches require a differentiable implemen- 1. INTRODUCTION tation for learning with backpropagation, re-implementation of each Fx, and in-depth knowledge, limiting use to known differentiable effects (Fx) are ubiquitous and used to ma- effects and causing high engineering effort [11, 12]. nipulate different sound characteristics such as loudness, dynamics, frequency, and timbre across a variety of media. Many Fx, however, In this work, we introduce DeepAFx, a new approach to differ- can be difficult to use or are simply not powerful enough to achieve a entiable signal processing that enables the use of arbitrary stateful, desired task, motivating new work. Past methods to address this in- third-party black-box processing effects (plugins) within any deep clude audio effects circuit modeling [1], analytical methods [2], and neural network as shown in Figure 1. Using our approach, we can intelligent audio effects that dynamically change their parameter set- mix-and-match traditional deep learning operators and black-box ef- tings by exploiting sound engineering best practices [3]. The most fects and learn to perform a variety of intelligent audio effects tasks common approach for the latter is adaptive audio effects or signal without the need of neural proxies, re-implementation of Fx plugins, processing systems based on the modeling and automation of tradi- or expensive parameter labels (only paired input-target audio data is required). To do so, we present 1) a deep learning architecture

arXiv:2105.04752v1 [eess.AS] 11 May 2021 tional processors [4]. More recent deep learning methods for audio effects modeling and intelligent audio effects include 1) end-to-end in which an encoder analyzes input audio and learns to control Fx direct transformation methods [5, 6, 7], where a neural proxy learns black-boxes that themselves perform signal manipulation 2) an end- and applies the transformation of an audio effect target 2) parameter to-end backpropagation method that allows differentiation through estimators, where a deep neural network (DNN) predicts the param- non-differentiable black-box audio effects layers via a fast, paral- eter settings of an audio effect [8, 9, 10] and 3) differentiable digital lel stochastic gradient approximation scheme used within a standard signal processing (DDSP) [11, 12], where signal processing struc- auto differentiation graph 3) a training scheme that can support state- tures are implemented within a deep learning auto-differentiation ful black-box processors, 4) a -invariant loss to make our model framework and trained via backpropagation. more robust to effects layer group delay, 5) application of our ap- While these past methods are very promising, they are limited proach to three audio production tasks including tube amplifier em- in several ways. First, direct transform approaches can require spe- ulation, automatic non-speech sounds removal, and automatic music cial, custom modeling strategies per effect (e.g. distortion), are often post-production or mastering, and 6) listening test results that shows based on large and expensive networks, and/or use models with lim- our approach can not only enable new automatic audio production ited or no editable parameter control [5, 6]. Second, methods with tasks, but is high-quality and comparable to a state-of-the-art pro- prietary commercial solution for automatic music mastering. Audio ∗This work was performed while interning at Adobe Research. samples and code are online at https://mchijmma.github.io/DeepAFx/. 2. METHOD where ∆ˆ P is a P -dimensional random perturbation vector sampled P from a symmetric Bernoulli distribution, i.e. ∆i = ±1 with prob- 2.1. Architecture ability of 0.5 [18]. In each iteration, the total number of function evaluations f() is two, since the numerator in eq. (2) is identical for Our model is a sequence of two parts: a deep encoder and a black- ˜ SPSA box layer consisting of one or more audio effects. The encoder an- all the elements of ∇ . This is especially valuable when P is alyzes the input audio and predicts a set of parameters, which is fed large and/or when f() is stateful. While the random search direc- into the audio effect(s) along with the input audio for transformation. tions do not follow the steepest descent, over many iterations, errors The deep encoder can be any deep learning architecture for tend to average to the sample mean of the stochastic process [17]. learning representations of audio. To allow the deep encoder to learn long temporal dependencies, the input x˜ consists of the current au- 2.3. Training with stateful black-boxes dio frame x centered within a larger audio frame containing previous While training samples typically should be generated from an in- and subsequent context samples. The input to the encoder consists dependent and identically distributed (i.i.d.) process, we note that of a log-scaled mel-spectrogram non-trainable layer followed by a audio effects are stateful systems. This means output samples de- batch normalization layer. The last layer of the encoder is a dense pend on previous input samples or internal states [19]. Therefore, layer with P units and sigmoid activation, where P is the total num- audio Fx layers should not be fed i.i.d.-sampled audio during train- ˆ ber of parameters. The output is the predicted parameters θ for the ing as this will corrupt state and yield unrealistic, transient outputs. current input frame x. As such, we feed consecutive non-overlapping frames of size N to The audio Fx layer is a stateful black-box consisting on one or the audio Fx layer, i.e. audio frames with a hop size of N samples, more connected audio effects. This layer uses the input audio x and where the internal block size of each audio effects is set to a divi- θˆ to produce an output waveform y¯ = f(x, θˆ). We calculate a loss sor of N. This ensures that there are no discrepancies between the between the transformed audio and targeted output audio. behavior of the audio effects during training and inference time. To train the model using mini-batches and maintain state, we 2.2. Gradient approximation use a separate Fx processor for each item in the batch. Therefore, for a batch size of M, we instantiate M independent audio effects In order to train our model via backpropagation, we calculate the for the forward pass of backpropogation. Each of the M plugins gradients of the loss function with respect to the audio Fx layer and receives training samples from random audio clips, however across the deep encoder. For the latter, we compute the gradient using stan- minibatches, each instance is fed samples from the same input until dard automatic differentiation. For the former, gradient approxima- swapped with a new sample. When we consider the two-side func- tion has been addressed in various ways such as finite differences tion evaluations for SPSA, we meet the same stateful constraints by (FD) [13], FD together with evolution strategies [14], reinforcement using two additional Fx per batch item. This gives a total of 3M learning [15], and by employing neural proxies [16]. For our work, audio effects when optimizing with SPSA gradients, whereas FD re- we use a stochastic gradient approximation method called simulta- quires an unmanageable (2P + 1)M effects for a large number of neous permutation stochastic approximation (SPSA) [17], which is parameters P or batch size M. Finally, we parallelize our gradient rooted in standard FD gradient methods and has low computational operator per item in a minibatch, similar to recent distributed training cost [13]. In our formulation, we ignore the gradient of the input strategies [20], but using a single one-GPU, multi-CPU setup. signal x and only approximate the gradient of the parameters θˆ. FD can be used to approximate the gradient of the audio Fx 2.4. Delay-invariant loss layer at a current set of parameters θˆ0, where the gradient ∇˜ is de- fined by the vector of partial derivatives characterized by ∇˜ f(θˆ0)i = Multiband audio effects correspond to audio processors that split the input signal into various frequency bands via different types ∂f(θˆ0)/∂θi for i = 1,...,P . The two-side FD approximation FD of filters [19]. These filters can introduce group delay, which is a ∇˜ [17] is based on the backward and forward measurements of FD frequency-dependent time delay of the sinusoidal components of f() perturbed by a constant . The ith component of ∇˜ is the input [21]. These types of effects and similar can also apply a f(θˆ + dˆP ) − f(θˆ − dˆP ) 180° phase shift, i.e. to invert the sign of the input. While such ∇˜ FDf(θˆ ) = 0 i 0 i , (1) 0 i 2 properties are critical to the functioning of Fx, it means that diffi- culties can arise when directly applying a loss function with inexact ˆP where 0 <   1 and di denotes a standard basis vector with inputs, either in time or frequency domain. Thus, we propose a 1 in the ith place and dimension P . From eq. (1), each parameter delay-invariant loss function designed to mitigate these issues. θi is perturbed one at a time and requires 2P function evaluations. We first compute the time delay τ = argmax(¯y?y) between the This can be computationally expensive for even small P , given that output y¯ and target y audio frames via cross-correlation (?). Then, f() represents one or more audio effects. Even worse, when f() is the loss in the time domain stateful, FD perturbation requires 2P instantiations of f() to avoid Ltime = min(||y¯τ − yτ ||1, ||y¯τ + yτ ||1), (3) state corruption, resulting in high memory costs. SPSA offers an alternative and is an algorithm for stochastic op- corresponds to the minimum L1 distance between the time-aligned timization of multivariate systems. It has proven to be an efficient target yτ and both a 0° phase shift and 180° phase shift time-aligned ¯ gradient approximation method that can be used with gradient de- output y¯τ . We compute Yτ and Yτ : a 1024-point Fast Fourier Trans- scent for optimization [17]. The SPSA gradient estimator ∇˜ SPSA is form (FFT) magnitude of y¯τ and yτ , respectively. The loss in the based on the random perturbation of all the parameters θˆ at the same frequency domain Lfreq is defined as ˜ SPSA time. The ith element of ∇ is Lfreq = ||Y¯τ − Yτ ||2 + || log Y¯τ − log Yτ ||2. (4) ˆ ˆ P ˆ ˆ P SPSA f(θ0 + ∆ ) − f(θ0 − ∆ ) The final loss function is L = α1Ltime + α2Lfreq, where we em- ∇˜ f(θˆ0)i = , (2) P pirically tune the weighting and use α1 = 10 and α2 = 1. 2∆i iiu 1 s u otefs oto aeo u oe 4 ms). (46 model their our to of rate set control are fast which the to gates, due release ms) and respectively. (10 with attack minimum defaults minutes, the to 4.0 set of are and exception parameters the Fx 4.0, non-trainable 31.6, the tasks, are all For sizes dataset test and Strip a Channel Vintage through processed 6176 and guitars bass and guitars tric used [29] Mart dataset IDMT-SMT-Audio-Effects by the of subset the use we put approx. is an variants; encoder and We deep approach. effects two our different of explore more generality also or the illustrate one to use datasets we different case, and each In ampli- mastering. automatic tube and music including removal, sounds applications non-speech automatic three emulation, to fier approach our apply We parameters. predicted selected and result, our target, input, bottom: 2 Fig. o aho h rqec ad;te3 the bands; frequency 4 the of each for approach the our parameters; apply 21 we learn Nonetheless, and for [19]. in amplifiers used distortion tube harmonic commonly to little contrast are introduce typically Compressors and control loudness gain. a time-varying using a distortion plying emulate to possible compressor? it range is dynamic multiband – simple ask how- We approaches, Such benchmark. indistinguishable 6]. virtually [5, are achieve ever, conditions to certain prox- shown neural under been deep emulation and of using context 27] emulation the [26, in Amplifier ies explored been has circuits). (e.g. learning devices distortion deep reference analog or of amplifiers sound the 2] tube recreate [1, to field is highly-researched goal a whose is effects distortion of emulation The emulation amplifier Tube 3.1. loss. validation on stop early epoch and size per optimization batch Adam minutes use with 3 We GPU. train takes to [25] steps) (EQ) encoder (1000 Fx Inception equalizer the parametric with scheme and optimization final our to filter intuition, lowpass 1. a mel- and apply we 0 128 inference between At and scaled parameters overlap, continuous 25% with experiment size, the and window from effects ms use mel-spectrogram We 46 log-scaled bands. a The has rate. layer sampling input 22,050Hz a at ms) frame audio current

0.0 value Hz Hz Hz and opesrmdfisteapiuednmc fadob ap- by audio of dynamics amplitude the modifies compressor A MobileNetV2 Lf)Tb mlfireuain Cne)Atmtcnnsec onsrmvl Rgt uoai ui atrn.Fo o to top From mastering. music Automatic (Right) removal. sounds non-speech Automatic (Center) emulation. amplifier Tube (Left) . 0.2 ´ nze l 5 ossigo 20rwntsfo aiu elec- various from notes raw 1250 of consisting [5] al. et ınez uptgains output specifically 2 0.4 . 8 and M th-4 0.6 2] h ubro aaeesfrec encoder each for parameters of number The [23]. time (s) x eindfrti ak aigi ninteresting an it making task, this for designed namliadcmrso 2] o training, For [28]. compressor multiband a on ratio-4 2 0.8 r 06 n 04smls(.5scad46 and sec (1.85 samples 1024 and 40960 are .EXPERIMENTS 3. . 2 ,rsetvl.Teiptcontext input The respectively. M, 1.0 uepemlfir h ri,validation, train, The preamplifier. tube threshold gain-4 LV2 1.2 ui lgnoe tnad[24] standard open plugin audio 1.4 knee-4 , M aepgain makeup θ rqec splits frequency ˆ oesr mohes For smoothness. ensure to 100 = Inception

0.0 value Hz Hz Hz naTesla-V100 a on nvra Audio Universal , ratio 0.5 ewr [22] network n the and ; and , 1.0 x ˜ knee 1.5 and in- time (s) th-1 th 2.0 and , atrdtakt 2dF.Tetan aiain n etdataset respectively. test and and minutes, 50.3 validation, and 51.1, train, 429.3, The is size -25dBFS. un- to each normalize track loudness mastered and cross-correlation using alignment Library Download from tracks music mastered ( the iter and bands quency the for 3 parameters the 33 bands, frequency 4 the etanormdlt er 0prmtr 6prmtr o the for parameters 16 – parameters of 50 ( ratio learn compressor a to multiband using ad- model by aggressively our e.g. more train threshold, that We the processor above range input any dynamic justs a is limiter A a of consisting compressor task, band this perform engineers mastering way the mas- [38]. online (OMS) proprietary software as tering commercialized co- even gain and compression range [37], dynamic efficients predict to EQ. effects DNNs audio as using adaptive 36], such using [35, explored processors, been has frequency-based com- mastering a and Automatic as such limiter; effects, and range dynamic pressor using out en- carried mastering is experienced and an [3]. gineer by content a done frequency typically enhancing is and of manipulation dynamics This process its the manipulating by is recording mastering or post-production Music mastering music Automatic 3.3. 30.2, 213.5, is size dataset test respectively. and minutes, 23.8 validation, and lip train, and breaths The removed manually smacks. with recordings speech clean and 3 gains the put bands; frequency 4 7prmtr ( parameters 17 33]. and [32, [31] recognition recordings speaker for drum breaths for of the gate detection tangential noise knowledge, automatic the intelligent with our investigated an of been of not best exception has the task this to of automation and, knowledge, expert require gain a edn ymnal dtn h ui aeomo yuiga using task by This or waveform audio [30]. the engineers editing gate sound manually noise by by done performed be task can smacks, common lip and a breaths is as such sounds, vocal non-speech Removing removal sound non-speech Automatic 3.2. th-2 2.5 gain-c etanamdlwt utpeadoefcsi eis iia to similar series, in effects audio multiple with model a train We etanamdlwt a with model a train We threshold and 3.0 th-3 ratio entrsodadcmrso aepgi,respectively. gain, makeup compressor and threshold mean .W s the use We ). ordc inlblwacertain a below signal reduce to , 3.5 th-4 .Frtann aa ecletd18umsee and unmastered 138 collected we data, training For ). etn 1] ohapoce a etm consuming, time be can approaches Both [19]. setting threshold 2] a [28], 4] n spercsigse,w efr time- perform we step, preprocessing as and [41], rpi EQ graphic 0.0 uptgain output threshold Hz Hz Hz

DAPS value , rpi EQ graphic rqec splits frequency euto gain reduction 0.5 utbn os gate noise multiband h Mxn ert’Fe Multitrack Free Secrets’ ‘Mixing The rqec splits frequency aae 3]wihcnan 0 raw 100 contains which [34] dataset input-gain , 1.0 aepgain makeup (the ,ad1prmtrfrtelim- the for parameter 1 and ), 1.5 2]admono and [25] gain time (s) gain-c-1 and n the and ; 2.0 threshold o aho h 2fre- 32 the of each for n the and , ratio and 2.5 2]F ae with layer Fx [28] gain-eq-8 ratio o aho the of each for 3.0 i a via input limiter nu gain input o ahof each for 3.5 reduction and ∞ limit multi- [39]. [40]. out- ), 1.0 1.0 1.0 ˜ Model Epochs Time dMFCC 0.8 0.8 0.8

0.6 0.6 0.6

Tube amplifier Inception 97 9.07 0.2596 0.4 0.4 0.4

emulation MobileNetV2 63 6.4 0.2186 0.2 0.2 0.2

CAFx 723 5.5 0.0826 0.0 0.0 0.0 Non-speech Inception 89 7.4 0.0186 LA MA HA Ours CAFx LA MA HA Ours LA MA HA Ours OMS sounds removal MobileNetV2 60 4.8 0.0231 Fig. 3. Listening test violins plots. (Left) Tube amplifier emulation Music Inception 202 19.8 0.0282 results show our approach does not outperform CAFx [5], but still mastering MobileNetV2 178 17.5 0.0542 yields good quality. (Center) Non-speech sounds removal results OMS - - 0.0157 show near human quality. (Right) Music mastering results show our method performs equivalent to the state-of-the-art (OMS). Table 1. Training epochs & time (hours), and test MFCC distance.

4. RESULTS & ANALYSIS similarity relative to a reference sound, i.e. a sample processed by a sound engineer or a recording from an analog device. 4.1. Qualitative Evaluation The samples consisted of an unprocessed sample as low-anchor (LA), a mid-anchor (MA), a hidden reference as high-anchor (HA), We show qualitative analysis by visual inspection via Figure 2, in and the output of our DeepAFx method with the Inception encoder which we depict input, target and predicted spectrograms for all (Ours). The mid-anchor corresponds to a tube distortion [44] with tasks. For the tube amplifier task, it can be seen that the model emu- drive of +8 dB; a gate [45] with default settings; and peak normal- lates very closely the analog target. Also, the multiband compressor ization to -1dBFS for each task, respectively. All samples for the parameters move over time along the attack and decay of the guitar first two tasks were loudness normalised to -23dBFS. The samples note to emulate the nonlinear hysteresis behavior of the tube ampli- for the music mastering task were monophonic and not loudness nor- fier. For the non-speech sound removal task, the model successfully malized so as to evaluate loudness as part of the task. removes breath and lip smacking sounds. It can be seen how the dif- The results of the listening test can be seen in Figure 3 as a violin ferent thresholds vary according to the non-speech sounds present at plot. For tube amplifier emulation, the median scores in figure order the first and third second of the input recording. The music master- are (0.0, 0.24, 0.96, 0.63, 0.93), respectively. The CAFx method ing example also successfully matches the sound engineer mastering ranks above our model, which is reasonable considering it corre- target and the parameters of the three audio effects gradually change sponds to a custom, state-of-the-art network. Our result for this task based on the input content of the unmastered music track. is still remarkable, however, since we are only using a multiband compressor in a field where complex models or deep neural prox- 4.2. Quantitative Evaluation ies have predominated. For the non-speech sounds removal task, the median scores in figure order are (0.39, 0.11, 0.89, 0.96), re- We show quantitative evaluation in Table 1, including number of spectively. Our model is rated almost as high as the hidden-anchor, training epochs until early stopping, training time in hours, and the which is again remarkable given the typical expertise and time re- mean cosine distance of the mel-frequency cepstral coefficients or quired for the task. Also note, 1) the mid-anchor is rated lower d˜ as a proxy for a perceptual metric, following past work [5]. MFCC than the low-anchor because of agitating gating artifacts from Fx For this, we compute 13 MFCCs from a log-power mel-spectrogram processing and 2) the high variance of the low anchor, which con- using a window size of 1024 samples, 25% hop size and 128 bands. tains different speech content (breaths and voice pops) and appears As a baseline for the tube amplifier emulation task, we use the to confuse listeners when rating similarity. For the automatic mu- Convolutional Audio Effects Modeling Network (CAFx) [5], and sic mastering task, the median scores in figure order are (0.15, 0.24, for the music mastering task, we use an online mastering software 0.95, 0.88, 0.79). Our model is rated above OMS, which we consider (OMS) [38]. We see that the tube amplifier emulation distance for state of the art. We further conducted multiple post-hoc paired t-tests our approach is higher than the other two tasks, likely caused by with Bonferroni correction [46] for each task and condition vs. our using a compressor to achieve distortion. We see both encoders method. We found the rank ordering of the results are significant for achieved similar performance, although the Inception model tends p < 0.01, with the exception of our approach compared to OMS to perform slightly better, and all training times are under a day. We (p ≈ 0.0378). also find the CAFx and OMS models have lower distance, but yield to perceptual listening tests for further conclusions. 5. CONCLUSION 4.3. Perceptual Evaluation We present a new approach to differentiable signal processing that We performed a subjective listening test to evaluate the perceptual enables the use of stateful, third-party black-box audio effects (plu- quality of our results using the multiple stimulus hidden reference gins) within any deep neural network. The main advantage of our (MUSHRA) protocol [42]. Seventeen participants took part in the approach is that it can be used with general, unknown plugins, and experiment which was conducted using The Web Audio Evaluation can be retrained for a wide variety of tasks by substituting in new Tool [43]. Five participants identified as female and twelve partici- audio training data and/or customizing with specific Fx. We be- pants identified as male. Participants were among musicians, sound lieve that this framework has the potential for broader use, including engineers, or experienced in critical listening. optimizing legacy signal processing pipelines, style matching, and Five listening samples were used from each of the three test sub- generative modeling for audio synthesis. For future work, we hope sets. Each sample consisted of a 4-second segment with the excep- to address limitations including optimization of discrete parameters, tion of the tube amplifier emulation samples, where each note is 2 gradient approx. accuracy, removing the need for paired training seconds long. Participants were asked to rate samples according to data, and robustness to black-box software brittleness. 6. REFERENCES [23] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetV2: Inverted residuals and linear bottle- [1] D. T. Yeh, J. S. Abel, A. Vladimirescu, and J. O. Smith III, necks,” in CVPR, 2018. “Numerical methods for simulation of guitar distortion cir- [24] D. Robillard, “LV2 plugins,” 2015, http://lv2plug.in/. cuits,” Computer Music Journal, 2008. [25] V. Sadovnikov, “LSP parametric equalizer x32 mono,” 2015, [2] F. Eichas and U. Zolzer,¨ “Black-box modeling of distortion https://lsp-plug.in/. circuits with block-oriented models,” in DAFx, 2016. [26] M. A. Mart´ınez Ram´ırez and J. D. Reiss, “Modeling of non- [3] B. De Man, R. Stables, and J. D. Reiss, Intelligent Music Pro- linear audio effects with end-to-end deep neural networks,” in duction, Focal Press, 2020. ICASSP, 2019. [4] D. Moffat and Mark B Sandler, “Approaches in intelligent mu- [27] E.-P. Damskagg,¨ L. Juvela, E. Thuillier, and V. Valim¨ aki,¨ sic production,” in Arts. MDPI, 2019. “Deep learning for tube amplifier emulation,” in ICASSP, 2019. [5] M. A. Mart´ınez Ram´ırez, E. Benetos, and J. D. Reiss, “Deep learning for black-box modeling of audio effects,” in Applied [28] K. Foltman, M. Schmidt, C. Holschuh, T. H. Johanssen, and Sciences. MDPI, 2020. T. Szilagyi, “Calf studio gear,” 2020, http://calf-studio-gear.org/doc. [29] M. Stein, J. Abeßer, C. Dittmar, and G. Schuller, “Automatic [6] A. Wright, E.-P. Damskagg,¨ L. Juvela, and V. Valim¨ aki,¨ “Real- detection of audio effects in guitar and bass recordings,” in time guitar amplifier emulation with deep learning,” in Applied AES Convention, 2010. Sciences. MDPI, 2020. [30] B. Owsinski, The Mixing Engineer’s Handbook, Nelson Edu- [7] S. H. Hawley, B. Colburn, and S. I. Mimilakis, cation, 2013. “SIGNALTRAIN: Profiling audio compressors with deep [31] M. Terrell, J. D. Reiss, and M. Sandler, “Automatic AES Convention neural networks,” in , 2019. settings for drum recordings containing bleed from secondary [8]J.R am¨ o¨ and V. Valim¨ aki,¨ “Neural third-octave graphic equal- sources,” EURASIP JASP, 2011. izer,” in DAFx, 2019. [32] S. H. Dumpala and K.N.R.K. R. Alluri, “An algorithm for [9] D. Sheng and G. Fazekas, “A feature learning siamese model detection of breath sounds in spontaneous speech with appli- for intelligent control of the compressor,” in cation to speaker recognition,” in SPECOM, 2017. IJCNN, 2019. [33] E.´ Szekely,´ G. E. Henter, and J. Gustafson, “Casting to corpus: [10] S. I. Mimilakis, N. J. Bryan, and P. Smaragdis, “One-shot para- Segmenting and selecting spontaneous dialogue for TTS with metric audio production style transfer with application to fre- a CNN-LSTM speaker-dependent breath detector,” in ICASSP, quency ,” in ICASSP, 2020. 2019. [34] G. J. Mysore, “Can we automatically transform speech [11] J. Engel, L. Hantrakul, C. Gu, and A. Roberts, “DDSP: Dif- recorded on common consumer devices in real-world environ- ferentiable digital signal processing,” in ICLR, 2020. ments into professional production quality speech?—a dataset, [12] B. Kuznetsov, J. D. Parker, and F. Esqueda, “Differentiable IIR insights, and challenges,” SPL, 2014. filters for machine learning applications,” in DAFx, 2020. [35] S. I. Mimilakis, K. Drossos, A. Floros, and D. Katerelos, “Au- [13] L. M. Milne-Thomson, The calculus of finite differences, tomated tonal balance enhancement for audio mastering appli- American Mathematical Soc., 2000. cations,” in AES Convention, 2013. [14] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever, “Evolu- [36] M. Hilsamer and S. Herzog, “A statistical approach to auto- tion strategies as a scalable alternative to reinforcement learn- mated offline dynamic processing in the audio mastering pro- ing,” in arXiv:1703.03864, 2017. cess.,” in DAFx, 2014. [37] S. I. Mimilakis, K. Drossos, T. Virtanen, and G. Schuller, [15] R. J. Williams, “Simple statistical gradient-following algo- “Deep neural networks for dynamic range compression in mas- rithms for connectionist reinforcement learning,” ML, 1992. tering applications,” in AES Convention, 2016. [16] A. Jacovi and et al., “Neural network gradient-based learning [38] “LANDR,” 2020, https://www.landr.com/. of black-box function interfaces,” in ICLR, 2019. [39] V. Sadovnikov, “LSP limiter mono,” 2015, https://lsp-plug.in/. [17] J. C. Spall, “An overview of the simultaneous perturbation [40] J. D Reiss and A. McPherson, Audio effects: Theory, Imple- method for efficient optimization,” The Johns Hopkins APL mentation and Application, CRC Press, 2014. Technical Digest, 1998. [41] M. Senior, “Mixing secrets for the small studio,” 2011, Taylor [18] J. C. Spall, “Multivariate stochastic approximation using a si- & Francis, https://www.cambridge-mt.com/ms/mtk/. multaneous perturbation gradient approximation,” TAC, 1992. [42] B. Series, “Recommendation ITU BS.1534-3,” 2014. [19]U.Z olzer,¨ DAFX: digital audio effects, J. Wiley & Sons, 2011. [43] N. Jillings, B. De Man, D. Moffat, and J. D. Reiss, “Web Audio [20] Y. You, Z. Zhang, C.-J. Hsieh, J. Demmel, and K. Keutzer, Evaluation Tool: A browser-based listening test environment,” “Imagenet training in minutes,” in ICPP, 2018. in SMC, 2015. [21] J. O. Smith III, Introduction to Digital Filters: with Audio [44] “Invada tube distortion,” 2020, https://launchpad.net/invada-studio/. Applications, vol. 2, W3K Publishing, 2007. [45] V. Sadovnikov, “LSP gate mono manual,” 2015, https://lsp-plug. in/?page=manuals§ion=gate mono. [22] J. Lee, N. J. Bryan, J. Salamon, Z. Jin, and J. Nam, “Metric learning vs classification for disentangled music representation [46] S. Holm, “A simple sequentially rejective multiple test proce- learning,” in ISMIR, 2020. dure,” Scandinavian Journal of Statistics, 1979.