
InverSynth: Deep Estimation of Synthesizer Parameter Configurations from Audio Signals Oren Barkan, David Tsiris, Noam Koenigstein and Ori Katz is to reveal the hidden configuration that was used to generate Abstract— Sound synthesis is a complex field that requires the source signal. This is particularly useful in scenarios where domain expertise. Manual tuning of synthesizer parameters to a music artist is interested in replicating sounds that were match a specific sound can be an exhaustive task, even for generated by other artists / sound designers using the same experienced sound engineers. In this paper, we introduce synthesizer (assuming no patch is released). Indeed, one of the InverSynth - an automatic method for synthesizer parameters tuning to match a given input sound. InverSynth is based on most frequently asked questions in music production and sound strided convolutional neural networks and is capable of inferring synthesis forums is: What is the recipe for replicating a sound the synthesizer parameters configuration from the input X that was generated by synthesizer Y? An even more general spectrogram and even from the raw audio. The effectiveness question is: What is the recipe for replicating a sound X using InverSynth is demonstrated on a subtractive synthesizer with four synthesizer Y? while in the former, X is assumed to be frequency modulated oscillators, envelope generator and a gater synthesized by Y (intra-domain), in the latter, this assumption effect. We present extensive quantitative and qualitative results that showcase the superiority InverSynth over several baselines. is no longer valid (cross-domain). This paper focuses on the Furthermore, we show that the network depth is an important intra-domain problem and leaves the more general cross- factor that contributes to the prediction accuracy. domain problem for future investigation. To this end, we propose the InverSynth method. InverSynth is based on strided Index Terms - deep synthesizer parameter estimation, CNNs and comes in two variants: The first variant is a automatic sound synthesis, inverse problems, InverSynth spectrogram based CNN that is trained to predict the synthesizer parameter configuration from the log Short Time Fourier Transform (STFT) spectrogram [1] of the input audio I. INTRODUCTION signal. The second variant is capable of performing end-to-end HE art of sound synthesis is a challenging task traditionally learning directly from the raw audio to the synthesizer Treserved to sound designers and engineers [1]. Nowadays, parameters domain, successfully. This is done by adding synthesizers play a key role in electronic music production. several convolutional layers that are designed to learn an Typically, music producers are equipped with sets of alternative representation for the log STFT spectrogram. preprogrammed sound patches per synthesizer. These sets are The InverSynth variants are depicted in Figure 1. In Fig.1 (a), essentially a collection of different configurations of the an input audio signal is transformed to a STFT spectrogram synthesizer parameters. Each configuration was carefully tuned matrix, which is then fed to a CNN. The CNN analyzes the by an expert to produce a different type of sound. spectrogram and predicts a parameter configuration. Finally, In the last decade, recent advances in deep learning pushed the synthesizer is configured according to the predicted forward state of the art results in various fields such as computer parameters values and synthesizes the output audio signal. In vision [16], speech recognition [17], natural language Fig. (b), a CNN performs end-to-end learning and predicts the understanding [18]-[23] and music information retrieval [2]- parameters configuration directly from the raw audio. In [4]. Particularly, convolutional neural networks (CNN) addition, we compare the performance of InverSynth against demonstrated extraordinary results in genre classification [5]. two other types of fully connected (FC) neural network models: One of the key properties of CNNs is automatic feature learning the first type is a FC network that receives a Bag of Words through shared filters which enable capturing similar patterns at (BoW) representation of the spectrogram as input. The second different locations across the signal. CNNs are a natural fit in type is a FC network that receives a set of complex hand crafted many visual and auditory tasks and provide better features [10] that are designed to capture spectral properties and generalization and overall performance [6]. temporal variations in the signal. In this paper, we investigate the problem of estimating the The audio signals that are used for training and testing the synthesizer parameter configuration that best reconstructs a models are generated by synthesizer parameter configurations source audio signal. Our model assumes that the source audio that are randomly sampled, i.e. for each synthesizer parameter, signal is generated from a similar synthesizer with hidden we define an effective range of valid values and sample the parameter configuration. Hence, the aim of the learning process parameter value from this range, uniformly. Hence, each O. B. is with Microsoft. D.T. is with Tel Aviv University, Israel. N. K. is Parts of this manuscript are based on a previous publication [26]. with Tel Aviv University, Israel. O. K. is with Technion, Israel. Fig. 1. The InverSynth models. (a) The STFT spectrogram of the input signal is fed into a 2D CNN that predicts the synthesizer parameter configuration. This configuration is then used to produce a sound that is similar to the input sound. (b) End-to-End learning. A CNN predicts the synthesizer parameter configuration directly from the raw audio. The first convolutional layers perform 1D convolutions that learn an alternative representation for the STFT Spectrogram. Then, a stack of 2D convolutional layers analyze the learned representation to predict the synthesizer parameter configuration. configuration in the dataset is obtained by a set of (parameter, main aspects: First, we solve a classification task rather than a sampled value) pairs. regression task. Second, we employ deep CNN architectures We present a comprehensive investigation of various with non-linear gates. CNNs have sufficient capacity to learn network architectures and demonstrate the effectiveness of complex filters (weights) which enables capturing important InverSynth in a series of quantitative and qualitative patterns while maintaining robust generalization. This experiments. Our findings show that the network depth is an eliminates the need for handcrafted features and enables the use important factor which contributes to the prediction accuracy of STFT spectrogram or raw audio as input without further and a spectrogram based CNN of a sufficient depth outperforms manipulations. We validate this claim, empirically, by its end-to-end counterpart, while both InverSynth variants comparing the CNN models to linear and non-linear FC models outperform FC networks and other models that utilize that use the hand crafted features from [10]. handcrafted features. An additional attempt to learn synthesizer parameters through a regression was presented by M. Yee-King et al. II. RELATED WORK [27,28]. They experimented with a variety of algorithms such The problem of synthesizer parameter estimation has been as a hill climber, a genetic algorithm, MLP networks and RNN studied in the literature [39]. Several attempts have been made networks. In contrast to [27,28], we focus on network to apply traditional machine learning techniques to physical architectures that are based on strided convolutional operators. modeling of musical instruments such as bowed and plucked It is worth mentioning that our initial experimentations did in strings [7]–[9]. Physical modeling synthesis is mainly used for fact covered RNNs. However, we discovered that RNNs had creating real-sounding instruments but is less common than major difficulty handling the lengthy audio sequences subtractive synthesis with frequency modulation (FM) [1], containing thousands of samples per second. Furthermore, in an which is one of the dominant synthesis methods in electronic effort to alleviate this difficulty, we employed convolutional music production that enables the creation of extremely layers in an attempt to compress the time axis into shorter diversified sounds. sequences that were subsequently fed into the RNN network. Itoyama and Okuno [10] employed multiple linear regression Nevertheless, this approach did not yield any improvement over on a set of handcrafted features for the task of learning our final models that are presented in this paper. Finally, in synthesizer parameters. Our approach, differs from [10] in two contrast to both [27,28] as well as [10], we do not solve a regression task and formulate the parameter estimation task as a classification problem instead which enables better 푥(푎) = sgn(sin(푎)). quantification and understanding of the parameter estimation Note that each oscillator 푦 is associated with its own set of accuracy. 푓 , 푣 , 퐴 , 퐵 parameters. Finally, the outputs from all The above papers are previously published research work oscillators are summed to 푦 = ∑ 푦 . Therefore, the total dealing with the problem of synthesizer parameter estimation. ∈ number of parameters in the 푦 component is 16. Nevertheless, others have dealt with different yet somewhat The function 푦 from Eq. (1) is a special case of a general related tasks.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages13 Page
-
File Size-