Music sound synthesis using machine learning: Towards a perceptually relevant control space
Fanny ROCHE PhD Defense 29 September 2020
Supervisor: Laurent GIRIN Co-supervisors: Thomas HUEBER Co-supervisors: Maëva GARNIER Co-supervisors: Samuel LIMIER
Fanny ROCHE - PhD Defense - 29 September 2020 1 / 48 Context and Objectives Context and Objectives
Fanny ROCHE - PhD Defense - 29 September 2020 2 / 48 Context and Objectives Context and Objectives
Fanny ROCHE - PhD Defense - 29 September 2020 2 / 48 Context and Objectives Context and Objectives
Fanny ROCHE - PhD Defense - 29 September 2020 2 / 48 Processed recordings → e.g. sampling, wavetable, granular, ... + computationally efficient − very memory consuming
Spectral modeling → e.g. additive, subtractive, ... + close to human sound perception − numerous & very specific
Physical modeling → e.g. solving wave equations, modal synthesis, ... + physically meaningful controls − very specific sounds
Context and Objectives Context and Objectives
Music sound synthesis Abstract algorithms → e.g. FM, waveshaping, ... + rich sounds − complex parameters
Fanny ROCHE - PhD Defense - 29 September 2020 3 / 48 Spectral modeling → e.g. additive, subtractive, ... + close to human sound perception − numerous & very specific
Physical modeling → e.g. solving wave equations, modal synthesis, ... + physically meaningful controls − very specific sounds
Context and Objectives Context and Objectives
Music sound synthesis Abstract algorithms → e.g. FM, waveshaping, ... + rich sounds − complex parameters
Processed recordings → e.g. sampling, wavetable, granular, ... + computationally efficient − very memory consuming
Fanny ROCHE - PhD Defense - 29 September 2020 3 / 48 Physical modeling → e.g. solving wave equations, modal synthesis, ... + physically meaningful controls − very specific sounds
Context and Objectives Context and Objectives
Music sound synthesis Abstract algorithms → e.g. FM, waveshaping, ... + rich sounds − complex parameters
Processed recordings → e.g. sampling, wavetable, granular, ... + computationally efficient − very memory consuming
Spectral modeling → e.g. additive, subtractive, ... + close to human sound perception − numerous & very specific
Fanny ROCHE - PhD Defense - 29 September 2020 3 / 48 Context and Objectives Context and Objectives
Music sound synthesis Abstract algorithms → e.g. FM, waveshaping, ... + rich sounds − complex parameters
Processed recordings → e.g. sampling, wavetable, granular, ... + computationally efficient − very memory consuming
Spectral modeling → e.g. additive, subtractive, ... + close to human sound perception − numerous & very specific
Physical modeling → e.g. solving wave equations, modal synthesis, ... + physically meaningful controls − very specific sounds
Fanny ROCHE - PhD Defense - 29 September 2020 3 / 48 Context and Objectives Context and Objectives
Fanny ROCHE - PhD Defense - 29 September 2020 4 / 48 Context and Objectives Context and Objectives
Thesis Project . New machine learning methods to tackle these issues and get:
◦ perceptually-meaningful ◦ independent control ◦ accurate sound control parameters parameters modeling
Fanny ROCHE - PhD Defense - 29 September 2020 5 / 48 Context and Objectives Challenges/Research questions
1. Define verbal descriptors adapted to synthetic sounds 2. Find suited method for extracting a high-level representation space & generating high-quality sounds 3. Get perceptually-meaningful control parameters for the synthesis
Fanny ROCHE - PhD Defense - 29 September 2020 6 / 48 Context and Objectives Content
1 Perceptual characterization of synthetic timbre State-of-the-art Free verbalization perceptual test Perceptual scaling test Conclusion
2 Unsupervised representation learning Methodology Comparative study Conclusion
3 Towards weak supervision using timbre perception Methodology Experiments Perceptual evaluation Conclusion
4 Conclusion and perspectives
Fanny ROCHE - PhD Defense - 29 September 2020 7 / 48 Perceptual characterization of synthetic timbre Content
1 Perceptual characterization of synthetic timbre State-of-the-art Free verbalization perceptual test Perceptual scaling test Conclusion
2 Unsupervised representation learning Methodology Comparative study Conclusion
3 Towards weak supervision using timbre perception Methodology Experiments Perceptual evaluation Conclusion
4 Conclusion and perspectives
Fanny ROCHE - PhD Defense - 29 September 2020 8 / 48 Perceptual characterization of synthetic timbre Perceptual characterization of synthetic timbre
Fanny ROCHE - PhD Defense - 29 September 2020 9 / 48 Timbre perception approaches:
Multidimensional studies (MDS) [Grey 1977; Iverson and Krumhansl 1993; McAdams et al. 1995; Lakatos 2000] Qualitative description of timbre: ⇒ No consensus BUT agreement on its spectro-temporal shape:
temporal dynamics vs. spectral content [Schaeffer 1996; Castellengo 2015]
Context-dependent perceptual dimensions → type of sounds, listeners, language, ...
Perceptual characterization of synthetic timbre State-of-the-art
Ambiguous definition of timbre → multidimensional perceptual attribute
Fanny ROCHE - PhD Defense - 29 September 2020 10 / 48 ⇒ No consensus BUT agreement on its spectro-temporal shape:
temporal dynamics vs. spectral content [Schaeffer 1996; Castellengo 2015]
Context-dependent perceptual dimensions → type of sounds, listeners, language, ...
Perceptual characterization of synthetic timbre State-of-the-art
Ambiguous definition of timbre → multidimensional perceptual attribute
Timbre perception approaches:
Multidimensional studies (MDS) [Grey 1977; Iverson and Krumhansl 1993; McAdams et al. 1995; Lakatos 2000] Qualitative description of timbre:
Free verbalization [Faure 2000; Traube 2004; Garnier et al. 2007; Cance and Dubois 2015] Free categorization [Guyot 1996; Gaillard 2000; Bensa et al. 2004; Ehrette 2004]
Semantic differential (SD) method [Faure 2000; Ehrette 2004; Zacharakis et al. 2014]
Fanny ROCHE - PhD Defense - 29 September 2020 10 / 48 Context-dependent perceptual dimensions → type of sounds, listeners, language, ...
Perceptual characterization of synthetic timbre State-of-the-art
Ambiguous definition of timbre → multidimensional perceptual attribute
Timbre perception approaches:
Multidimensional studies (MDS) [Grey 1977; Iverson and Krumhansl 1993; McAdams et al. 1995; Lakatos 2000] Qualitative description of timbre:
Free verbalization [Faure 2000; Traube 2004; Garnier et al. 2007; Cance and Dubois 2015] Free categorization [Guyot 1996; Gaillard 2000; Bensa et al. 2004; Ehrette 2004]
Semantic differential (SD) method [Faure 2000; Ehrette 2004; Zacharakis et al. 2014] ⇒ No consensus BUT agreement on its spectro-temporal shape:
temporal dynamics vs. spectral content [Schaeffer 1996; Castellengo 2015]
Fanny ROCHE - PhD Defense - 29 September 2020 10 / 48 Perceptual characterization of synthetic timbre State-of-the-art
Ambiguous definition of timbre → multidimensional perceptual attribute
Timbre perception approaches:
Multidimensional studies (MDS) [Grey 1977; Iverson and Krumhansl 1993; McAdams et al. 1995; Lakatos 2000] Qualitative description of timbre:
Free verbalization [Faure 2000; Traube 2004; Garnier et al. 2007; Cance and Dubois 2015] Free categorization [Guyot 1996; Gaillard 2000; Bensa et al. 2004; Ehrette 2004]
Semantic differential (SD) method [Faure 2000; Ehrette 2004; Zacharakis et al. 2014] ⇒ No consensus BUT agreement on its spectro-temporal shape:
temporal dynamics vs. spectral content [Schaeffer 1996; Castellengo 2015]
Context-dependent perceptual dimensions → type of sounds, listeners, language, ...
Fanny ROCHE - PhD Defense - 29 September 2020 10 / 48 Perceptual characterization of synthetic timbre State-of-the-art
Ambiguous definition of timbre → multidimensional perceptual attribute
Timbre perception approaches:
Multidimensional studies (MDS) [Grey 1977; Iverson and Krumhansl 1993; McAdams et al. 1995; Lakatos 2000] Qualitative description of timbre:
Free verbalization [Faure 2000; Traube 2004; Garnier et al. 2007; Cance and Dubois 2015] Free categorization [Guyot 1996; Gaillard 2000; Bensa et al. 2004; Ehrette 2004]
Semantic differential (SD) method [Faure 2000; Ehrette 2004; Zacharakis et al. 2014] ⇒ No consensus BUT agreement on its spectro-temporal shape:
temporal dynamics vs. spectral content [Schaeffer 1996; Castellengo 2015]
Context-dependent perceptual dimensions → type of sounds, listeners, language, ...
Fanny ROCHE - PhD Defense - 29 September 2020 10 / 48 Perceptual characterization of synthetic timbre Free verbalization perceptual test
Objective: . Collect verbal descriptors that are frequently and transversally used to describe synthesizer sounds in French
Participants: 101 responses Stimuli: Creation of the ARTURIA dataset → 1,233 samples from ARTURIA’s software synthesizers 50 wisely chosen stimuli
HHHHHH
Fanny ROCHE - PhD Defense - 29 September 2020 11 / 48 Perceptual characterization of synthetic timbre Free verbalization perceptual test
Protocol:
Fanny ROCHE - PhD Defense - 29 September 2020 12 / 48 Observations Both terms commonly used for usual musical instruments and new terms: → e.g. brillant, chaud, métallique, ... see for example [Faure 2000; Traube 2004; Cheminée et al. 2005; Garnier et al. 2007; Lavoie 2013] → e.g. distordu, explosif, rétro-futuriste, robotique, saccadé, spatial, ... 5 most frequent and transversal perceptual categories all related to spectral content of the sound
Perceptual characterization of synthetic timbre Free verbalization perceptual test
Results analysis Pre-processing Collected 784 different terms Semantic clustering based on co-occurrences matrix Frequency & transversality analysis
Fanny ROCHE - PhD Defense - 29 September 2020 13 / 48 Perceptual characterization of synthetic timbre Free verbalization perceptual test
Results analysis Pre-processing Collected 784 different terms Semantic clustering based on co-occurrences matrix Frequency & transversality analysis
Observations Both terms commonly used for usual musical instruments and new terms: → e.g. brillant, chaud, métallique, ... see for example [Faure 2000; Traube 2004; Cheminée et al. 2005; Garnier et al. 2007; Lavoie 2013] → e.g. distordu, explosif, rétro-futuriste, robotique, saccadé, spatial, ... 5 most frequent and transversal perceptual categories all related to spectral content of the sound
Fanny ROCHE - PhD Defense - 29 September 2020 13 / 48 Perceptual characterization of synthetic timbre Free verbalization perceptual test
Verbal descriptors selection Prototype of the scales Covering all timbre dimensions
⇒ These descriptors will serve as semantic labels of the scales for the second perceptual test
Fanny ROCHE - PhD Defense - 29 September 2020 14 / 48 Perceptual characterization of synthetic timbre Perceptual scaling test
Objectives: . Evaluate consensus of the 8 highlighted perceptual dimensions . Annotate a subset of samples along these dimensions
Participants: 71 responses
Stimuli:
ARTURIA dataset Stimuli selection: 80 wisely chosen samples
Stimuli assignment: training vs. main phase samples
Fanny ROCHE - PhD Defense - 29 September 2020 15 / 48 Perceptual characterization of synthetic timbre Perceptual scaling test
Protocol:
Fanny ROCHE - PhD Defense - 29 September 2020 16 / 48 Inter-subject consensus distinguish groups of participants with a shared conception of the dimensions
Figure: Resulting dendrogram of the inter-subject correlation coefficients for the Agressif scale
Perceptual characterization of synthetic timbre Perceptual scaling test
Results analysis Intra-subject consensus evaluate consistency of each participant remove "unreliable" listeners
Figure: Intra-subject correlation matrix
Fanny ROCHE - PhD Defense - 29 September 2020 17 / 48 Perceptual characterization of synthetic timbre Perceptual scaling test
Results analysis Intra-subject consensus evaluate consistency of each participant remove "unreliable" listeners
Figure: Intra-subject correlation matrix
Inter-subject consensus distinguish groups of participants with a shared conception of the dimensions
Figure: Resulting dendrogram of the inter-subject correlation coefficients for the Agressif scale
Fanny ROCHE - PhD Defense - 29 September 2020 17 / 48 Final label vectors computation Selection of the cluster with the largest number of participants Scale-wise annotation of every sound using the mean ratings
Perceptual characterization of synthetic timbre Perceptual scaling test
Observations No significant differences between groups of participants Discrepancy between scales: ◦ métallique ◦ qui vibre ◦ chaud ◦ qui résonne ◦ soufflé ◦ qui évolue ◦ percussif ◦ agressif Consensus degree consistent across both analyses → intra and inter-subject correlations
Fanny ROCHE - PhD Defense - 29 September 2020 18 / 48 Final label vectors computation Selection of the cluster with the largest number of participants Scale-wise annotation of every sound using the mean ratings
Perceptual characterization of synthetic timbre Perceptual scaling test
Observations No significant differences between groups of participants Discrepancy between scales: ◦ métallique ◦ qui vibre ◦ chaud ◦ qui résonne ◦ soufflé ◦ qui évolue ◦ percussif ◦ agressif → very consensual Consensus degree consistent across both analyses → intra and inter-subject correlations
Fanny ROCHE - PhD Defense - 29 September 2020 18 / 48 Final label vectors computation Selection of the cluster with the largest number of participants Scale-wise annotation of every sound using the mean ratings
Perceptual characterization of synthetic timbre Perceptual scaling test
Observations No significant differences between groups of participants Discrepancy between scales: ◦ métallique ◦ qui vibre → least consensual ◦ chaud ◦ qui résonne ◦ soufflé ◦ qui évolue ◦ percussif ◦ agressif → very consensual Consensus degree consistent across both analyses → intra and inter-subject correlations
Fanny ROCHE - PhD Defense - 29 September 2020 18 / 48 Perceptual characterization of synthetic timbre Perceptual scaling test
Observations No significant differences between groups of participants Discrepancy between scales: ◦ métallique ◦ qui vibre → least consensual ◦ chaud ◦ qui résonne ◦ soufflé ◦ qui évolue ◦ percussif ◦ agressif → very consensual Consensus degree consistent across both analyses → intra and inter-subject correlations
Final label vectors computation Selection of the cluster with the largest number of participants Scale-wise annotation of every sound using the mean ratings
Fanny ROCHE - PhD Defense - 29 September 2020 18 / 48 Shared conception of the terms ⇒ although discrepancies between scales Dual aspect of 2 dimensions ⇒ use for commercialized synthesizer open to question Annotation of a subset of the ARTURIA dataset with perceptual scores
Perceptual characterization of synthetic timbre Conclusion
Identification of the most frequent/transversal terms to characterize synthesizer sounds ⇒ 8 verbal descriptors: ◦ métallique ◦ qui vibre ◦ chaud ◦ qui résonne ◦ soufflé ◦ qui évolue ◦ percussif ◦ agressif
Fanny ROCHE - PhD Defense - 29 September 2020 19 / 48 Dual aspect of 2 dimensions ⇒ use for commercialized synthesizer open to question Annotation of a subset of the ARTURIA dataset with perceptual scores
Perceptual characterization of synthetic timbre Conclusion
Identification of the most frequent/transversal terms to characterize synthesizer sounds ⇒ 8 verbal descriptors: ◦ métallique ◦ qui vibre ◦ chaud ◦ qui résonne ◦ soufflé ◦ qui évolue ◦ percussif ◦ agressif Shared conception of the terms ⇒ although discrepancies between scales
Fanny ROCHE - PhD Defense - 29 September 2020 19 / 48 Annotation of a subset of the ARTURIA dataset with perceptual scores
Perceptual characterization of synthetic timbre Conclusion
Identification of the most frequent/transversal terms to characterize synthesizer sounds ⇒ 8 verbal descriptors: ◦ métallique ◦ qui vibre ◦ chaud ◦ qui résonne ◦ soufflé ◦ qui évolue ◦ percussif ◦ agressif Shared conception of the terms ⇒ although discrepancies between scales Dual aspect of 2 dimensions ⇒ use for commercialized synthesizer open to question
Fanny ROCHE - PhD Defense - 29 September 2020 19 / 48 Perceptual characterization of synthetic timbre Conclusion
Identification of the most frequent/transversal terms to characterize synthesizer sounds ⇒ 8 verbal descriptors: ◦ métallique ◦ qui vibre ◦ chaud ◦ qui résonne ◦ soufflé ◦ qui évolue ◦ percussif ◦ agressif Shared conception of the terms ⇒ although discrepancies between scales Dual aspect of 2 dimensions ⇒ use for commercialized synthesizer open to question Annotation of a subset of the ARTURIA dataset with perceptual scores
Fanny ROCHE - PhD Defense - 29 September 2020 19 / 48 Unsupervised representation learning Content
1 Perceptual characterization of synthetic timbre State-of-the-art Free verbalization perceptual test Perceptual scaling test Conclusion
2 Unsupervised representation learning Methodology Comparative study Conclusion
3 Towards weak supervision using timbre perception Methodology Experiments Perceptual evaluation Conclusion
4 Conclusion and perspectives
Fanny ROCHE - PhD Defense - 29 September 2020 20 / 48 Unsupervised representation learning Unsupervised representation learning
Fanny ROCHE - PhD Defense - 29 September 2020 21 / 48 Related work Autoencoder-based models Generative Adversarial Network (GAN)-based models Autoencoders [Sarroff and Casey 2014; Colonel et al. 2017] WaveGAN & SpecGAN [Donahue WaveNet autoencoders [Engel et et al. 2019] al. 2017] GANSynth [Engel et al. 2019] Variational autoencoders → speech [Blaauw and Bonada 2016; Hsu et al. 2017; Akuzawa et al. 2018] → music sounds [Esling et al. 2018]
Unsupervised representation learning Unsupervised representation learning
Objective: . Investigate well-suited deep learning algorithm to extract a high-level representation space with interesting interpolation and extrapolation properties from a dataset of sounds
Questions: → possibility to extract such a space automatically from low-level representation of signals? → suitable for synthesis control & perceptually relevant?
Fanny ROCHE - PhD Defense - 29 September 2020 22 / 48 Unsupervised representation learning Unsupervised representation learning
Objective: . Investigate well-suited deep learning algorithm to extract a high-level representation space with interesting interpolation and extrapolation properties from a dataset of sounds
Questions: → possibility to extract such a space automatically from low-level representation of signals? → suitable for synthesis control & perceptually relevant?
Related work Autoencoder-based models Generative Adversarial Network (GAN)-based models Autoencoders [Sarroff and Casey 2014; Colonel et al. 2017] WaveGAN & SpecGAN [Donahue WaveNet autoencoders [Engel et et al. 2019] al. 2017] GANSynth [Engel et al. 2019] Variational autoencoders → speech [Blaauw and Bonada 2016; Hsu et al. 2017; Akuzawa et al. 2018] → music sounds [Esling et al. 2018]
Fanny ROCHE - PhD Defense - 29 September 2020 22 / 48 Unsupervised representation learning Methodology
Analysis-transformation-synthesis methodology
Autoencoder-based models (AE)
AE and deep AE (DAE) [Hinton and Salakhutdinov 2006; Bengio et al. 2007]
Recurrent AE (LSTM-AE) [Hochreiter and Schmidhuber 1997]
Variational AE (VAE) [Kingma and Welling 2014; Rezende et al. 2014]
Baseline
Principal component analysis (PCA) [Bishop 2006]
Fanny ROCHE - PhD Defense - 29 September 2020 23 / 48 Unsupervised representation learning Methodology
Autoencoder framework
. Encoding: z = fenc(Wenc x + benc)
. Decoding: xˆ = fdec(Wdec z + bdec) . Training by minimizing reconstruction error: MSE(xˆ, x)
Fanny ROCHE - PhD Defense - 29 September 2020 24 / 48 Unsupervised representation learning Methodology
Autoencoder framework (AE and DAE)
. Encoding: z = fenc(Wenc x + benc)
. Decoding: xˆ = fdec(Wdec z + bdec) . Training by minimizing reconstruction error: MSE(xˆ, x)
Fanny ROCHE - PhD Defense - 29 September 2020 24 / 48 Unsupervised representation learning Methodology
Autoencoder framework (AE, DAE and LSTM-AE)
. Encoding: z = fenc(Wenc x + benc)
. Decoding: xˆ = fdec(Wdec z + bdec) . Training by minimizing reconstruction error: MSE(xˆ, x)
Fanny ROCHE - PhD Defense - 29 September 2020 24 / 48 Unsupervised representation learning Methodology
VAE Probabilistic framework
. Parametric model: pθ(x, z) = pθ(x|z)pθ(z) N 2 . Prior on latent space: pθ(z) = (z; 0, IL) . Probabilistic decoder: pθ(x|z) = N(x; µθ(z), σ (z)) 2 θ . Probabilistic encoder: qφ(z|x) = N(z; µ˜ (x), σ˜ (x)) . Maximizing log-likelihood: φ φ log pθ(x) = DKL qφ(z|x)|pθ(z|x) + L(φ, θ, β, x) | {z } ≥0
Fanny ROCHE - PhD Defense - 29 September 2020 25 / 48 Unsupervised representation learning Methodology
VAE Probabilistic framework
. Parametric model: pθ(x, z) = pθ(x|z)pθ(z) N 2 . Prior on latent space: pθ(z) = (z; 0, IL) . Probabilistic decoder: pθ(x|z) = N(x; µθ(z), σ (z)) 2 θ . Probabilistic encoder: qφ(z|x) = N(z; µ˜ (x), σ˜ (x)) . Maximizing log-likelihood: φ φ log pθ(x) = DKL qφ(z|x)|pθ(z|x) + L(φ, θ, β, x) | {z } ≥0
Fanny ROCHE - PhD Defense - 29 September 2020 25 / 48 Unsupervised representation learning Methodology
VAE Probabilistic framework
. Parametric model: pθ(x, z) = pθ(x|z)pθ(z) N 2 . Prior on latent space: pθ(z) = (z; 0, IL) . Probabilistic decoder: pθ(x|z) = N(x; µθ(z), σ (z)) 2 θ . Probabilistic encoder: qφ(z|x) = N(z; µ˜ (x), σ˜ (x)) . Maximizing variational lower bound (VLB): φ φ L(φ, θ, β, x) = Eqφ(z|x) log pθ(x|z) − β DKL qφ(z|x)|pθ(z) | {z } | {z } reconstruction accuracy regularization
Fanny ROCHE - PhD Defense - 29 September 2020 25 / 48 Time-frequency data representation Magnitude spectrogram Phase spectrogram reconstruction → Griffin & Lim algorithm [Griffin and Lim 1984] → Linear phase unwrapping [Magron 2016] Tested models: PCA (baseline) AE and DAE (several architectures) VAE (different β values) LSTM-AE Metrics: Root mean squared error (RMSE) PEMO-Q scores [Huber and Kollmeier 2006]
Unsupervised representation learning Comparative study
Datasets:
NSynth dataset (subset of 10, 000 samples with fs = 16 kHz) [Engel et al. 2017] ARTURIA dataset (1, 233 samples with fs = 44.1 kHz)
Fanny ROCHE - PhD Defense - 29 September 2020 26 / 48 Tested models: PCA (baseline) AE and DAE (several architectures) VAE (different β values) LSTM-AE Metrics: Root mean squared error (RMSE) PEMO-Q scores [Huber and Kollmeier 2006]
Unsupervised representation learning Comparative study
Datasets:
NSynth dataset (subset of 10, 000 samples with fs = 16 kHz) [Engel et al. 2017] ARTURIA dataset (1, 233 samples with fs = 44.1 kHz) Time-frequency data representation Magnitude spectrogram Phase spectrogram reconstruction → Griffin & Lim algorithm [Griffin and Lim 1984] → Linear phase unwrapping [Magron 2016]
Fanny ROCHE - PhD Defense - 29 September 2020 26 / 48 Metrics: Root mean squared error (RMSE) PEMO-Q scores [Huber and Kollmeier 2006]
Unsupervised representation learning Comparative study
Datasets:
NSynth dataset (subset of 10, 000 samples with fs = 16 kHz) [Engel et al. 2017] ARTURIA dataset (1, 233 samples with fs = 44.1 kHz) Time-frequency data representation Magnitude spectrogram Phase spectrogram reconstruction → Griffin & Lim algorithm [Griffin and Lim 1984] → Linear phase unwrapping [Magron 2016] Tested models: PCA (baseline) AE and DAE (several architectures) VAE (different β values) LSTM-AE
Fanny ROCHE - PhD Defense - 29 September 2020 26 / 48 Unsupervised representation learning Comparative study
Datasets:
NSynth dataset (subset of 10, 000 samples with fs = 16 kHz) [Engel et al. 2017] ARTURIA dataset (1, 233 samples with fs = 44.1 kHz) Time-frequency data representation Magnitude spectrogram Phase spectrogram reconstruction → Griffin & Lim algorithm [Griffin and Lim 1984] → Linear phase unwrapping [Magron 2016] Tested models: PCA (baseline) AE and DAE (several architectures) VAE (different β values) LSTM-AE Metrics: Root mean squared error (RMSE) PEMO-Q scores [Huber and Kollmeier 2006]
Fanny ROCHE - PhD Defense - 29 September 2020 26 / 48 Unsupervised representation learning Comparative study
Reconstruction accuracy evaluation on NSynth
(a) PCA, AE, DAE and LSTM-AE (b) VAE for different β values (PCA recalled)
Original Ù PCA (enc = 100) Ù PCA (enc = 32) Ù VAE (enc = 32) Ù
Fanny ROCHE - PhD Defense - 29 September 2020 27 / 48 Unsupervised representation learning Comparative study
Cross-correlation of latent dimensions → latent dimension of 16
Fanny ROCHE - PhD Defense - 29 September 2020 28 / 48 Unsupervised representation learning Comparative study
Sound morphing
Source : Ù Target: Ù
VAE : ÙÙÙÙÙ
Fanny ROCHE - PhD Defense - 29 September 2020 29 / 48 AE-based models adapted
VAEs extracted latent vectors ⇒ good candidates for control
Latent space poorly related to perceptually relevant dimensions
Unsupervised representation learning Conclusion
Good reconstruction audio quality ⇒ on both datasets
Fanny ROCHE - PhD Defense - 29 September 2020 30 / 48 VAEs extracted latent vectors ⇒ good candidates for control
Latent space poorly related to perceptually relevant dimensions
Unsupervised representation learning Conclusion
Good reconstruction audio quality ⇒ on both datasets
AE-based models adapted
Fanny ROCHE - PhD Defense - 29 September 2020 30 / 48 Latent space poorly related to perceptually relevant dimensions
Unsupervised representation learning Conclusion
Good reconstruction audio quality ⇒ on both datasets
AE-based models adapted
VAEs extracted latent vectors ⇒ good candidates for control
Fanny ROCHE - PhD Defense - 29 September 2020 30 / 48 Unsupervised representation learning Conclusion
Good reconstruction audio quality ⇒ on both datasets
AE-based models adapted
VAEs extracted latent vectors ⇒ good candidates for control
Latent space poorly related to perceptually relevant dimensions
Fanny ROCHE - PhD Defense - 29 September 2020 30 / 48 Towards weak supervision using timbre perception Content
1 Perceptual characterization of synthetic timbre State-of-the-art Free verbalization perceptual test Perceptual scaling test Conclusion
2 Unsupervised representation learning Methodology Comparative study Conclusion
3 Towards weak supervision using timbre perception Methodology Experiments Perceptual evaluation Conclusion
4 Conclusion and perspectives
Fanny ROCHE - PhD Defense - 29 September 2020 31 / 48 Towards weak supervision using timbre perception Towards weak supervision using timbre perception
Fanny ROCHE - PhD Defense - 29 September 2020 32 / 48 Related work
Timbre-based perceptual regularization [Esling et al. 2018] Timbre space relying on several MDS studies [Grey 1977; Krumhansl 1989; Iverson and Krumhansl 1993; McAdams et al. 1995; Lakatos 2000] fully-labeled dataset of orchestral instruments Add perceptual regularization to VAE → additional term in the VLB to optimize Weakly supervised learning ⇒ 2-step semi-supervised learning procedure [Hinton and Salakhutdinov 2007]
Towards weak supervision using timbre perception Towards weak supervision using timbre perception
Objective: . Give a perceptual meaning to the dimensions extracted by the neural model
Questions: → possibility to add perceptual supervision so as to "force" the meaning of the latent dimensions? → possible to change the behavior of the model using very few annotated data?
Fanny ROCHE - PhD Defense - 29 September 2020 33 / 48 Towards weak supervision using timbre perception Towards weak supervision using timbre perception
Objective: . Give a perceptual meaning to the dimensions extracted by the neural model
Questions: → possibility to add perceptual supervision so as to "force" the meaning of the latent dimensions? → possible to change the behavior of the model using very few annotated data? Related work
Timbre-based perceptual regularization [Esling et al. 2018] Timbre space relying on several MDS studies [Grey 1977; Krumhansl 1989; Iverson and Krumhansl 1993; McAdams et al. 1995; Lakatos 2000] fully-labeled dataset of orchestral instruments Add perceptual regularization to VAE → additional term in the VLB to optimize Weakly supervised learning ⇒ 2-step semi-supervised learning procedure [Hinton and Salakhutdinov 2007]
Fanny ROCHE - PhD Defense - 29 September 2020 33 / 48 Towards weak supervision using timbre perception Methodology
Proposed 2-step learning procedure:
Repeat the 2 steps N times
1 Unsupervised pre-training
→ Optimizing classic VLB on XU XL: L(φ, θ, β, x) = Eqφ(z|x) log pθ(x|z) − β DKL qφ(z|x)|pθ(z) | {z } | {z } reconstruction accuracy regularization
2 Supervised fine-tuning → Optimizing VLB with extra regularization on XL: L(φ, θ, β, x) = Eqφ(z|x) log pθ(x|z) − β DKL qφ(z|x)|pθ(z) + α R(z, P) | {z } | {z } | {z } reconstruction accuracy regularization perceptual reg.
Fanny ROCHE - PhD Defense - 29 September 2020 34 / 48 Datasets
Unlabeled dataset XU → 1, 233 samples of ARTURIA dataset Labeled dataset XL → 80 samples annotated during scaling perceptual test
Towards weak supervision using timbre perception Methodology
Perceptual regularization metric:
R(z, P) = MSE(z1:8, ps)
Fanny ROCHE - PhD Defense - 29 September 2020 35 / 48 Towards weak supervision using timbre perception Methodology
Perceptual regularization metric:
R(z, P) = MSE(z1:8, ps) Datasets
Unlabeled dataset XU → 1, 233 samples of ARTURIA dataset Labeled dataset XL → 80 samples annotated during scaling perceptual test
Fanny ROCHE - PhD Defense - 29 September 2020 35 / 48 Towards weak supervision using timbre perception Experiments
Reconstruction accuracy evaluation
(a) Different values of α (b) Different number of 2-step procedure iterations
Fanny ROCHE - PhD Defense - 29 September 2020 36 / 48 Towards weak supervision using timbre perception Experiments
Latent space organization as revealed by t-SNE [van der Maaten and Hinton 2008]
Fanny ROCHE - PhD Defense - 29 September 2020 37 / 48 Towards weak supervision using timbre perception Perceptual evaluation using A/B testing model
Objective: . Assess the effectiveness of the perceptual regularization using a simple listening test
Participants: 30 responses
Stimuli:
Remove time-related scales: 60 pairs of generated → percussif stimuli → qui résonne → 2 offset values → qui évolue Métallique Agressif 12 source samples per scale: → 3 very representative → 3 unrepresentative → 6 from the test set HH HH
Fanny ROCHE - PhD Defense - 29 September 2020 38 / 48 Towards weak supervision using timbre perception Perceptual evaluation using A/B testing model
Protocol:
Fanny ROCHE - PhD Defense - 29 September 2020 39 / 48 Two-fold purpose of analysis Analyze participants perception of VAE variations Investigate the influence of the dataset (train or test)
Observations Perceptually-regularized model able to generalize Discrepancy between scales
Towards weak supervision using timbre perception Perceptual evaluation using A/B testing model
Logistic random effects
regression analysis [Li et al. 2011]
Fanny ROCHE - PhD Defense - 29 September 2020 40 / 48 Observations Perceptually-regularized model able to generalize Discrepancy between scales
Towards weak supervision using timbre perception Perceptual evaluation using A/B testing model
Logistic random effects
regression analysis [Li et al. 2011]
Two-fold purpose of analysis Analyze participants perception of VAE variations Investigate the influence of the dataset (train or test)
Fanny ROCHE - PhD Defense - 29 September 2020 40 / 48 Towards weak supervision using timbre perception Perceptual evaluation using A/B testing model
Logistic random effects
regression analysis [Li et al. 2011]
Two-fold purpose of analysis Analyze participants perception of VAE variations Investigate the influence of the dataset (train or test)
Observations Perceptually-regularized model able to generalize Discrepancy between scales
Fanny ROCHE - PhD Defense - 29 September 2020 40 / 48 Influence of 2-step procedure repetition not clearly evidenced
Preliminary validation of our methodology ⇒ ability to generalize although very few labeled data ⇒ captured very well acoustic properties of Agressif & Qui vibre ⇒ difficulty to model Chaud & Soufflé ⇒ results to be confirmed for Métallique
Towards weak supervision using timbre perception Conclusion
Fair-to-good audio quality achievable for well-chosen α values ⇒ extra regularization slightly degrades quality BUT can be tackled by
enlarging both XU and XL
Fanny ROCHE - PhD Defense - 29 September 2020 41 / 48 Preliminary validation of our methodology ⇒ ability to generalize although very few labeled data ⇒ captured very well acoustic properties of Agressif & Qui vibre ⇒ difficulty to model Chaud & Soufflé ⇒ results to be confirmed for Métallique
Towards weak supervision using timbre perception Conclusion
Fair-to-good audio quality achievable for well-chosen α values ⇒ extra regularization slightly degrades quality BUT can be tackled by
enlarging both XU and XL
Influence of 2-step procedure repetition not clearly evidenced
Fanny ROCHE - PhD Defense - 29 September 2020 41 / 48 Towards weak supervision using timbre perception Conclusion
Fair-to-good audio quality achievable for well-chosen α values ⇒ extra regularization slightly degrades quality BUT can be tackled by
enlarging both XU and XL
Influence of 2-step procedure repetition not clearly evidenced
Preliminary validation of our methodology ⇒ ability to generalize although very few labeled data ⇒ captured very well acoustic properties of Agressif & Qui vibre ⇒ difficulty to model Chaud & Soufflé ⇒ results to be confirmed for Métallique
Fanny ROCHE - PhD Defense - 29 September 2020 41 / 48 Conclusion and perspectives Content
1 Perceptual characterization of synthetic timbre State-of-the-art Free verbalization perceptual test Perceptual scaling test Conclusion
2 Unsupervised representation learning Methodology Comparative study Conclusion
3 Towards weak supervision using timbre perception Methodology Experiments Perceptual evaluation Conclusion
4 Conclusion and perspectives
Fanny ROCHE - PhD Defense - 29 September 2020 42 / 48 Identified 8 perceptual verbal descriptors: ◦ métallique ◦ qui vibre ◦ chaud ◦ qui résonne ◦ soufflé ◦ qui évolue ◦ percussif ◦ agressif Evaluated the degree of consensus of these 8 dimensions Annotated a subset of 80 synthesizer sounds Performed an extensive comparison of AE-based models on 2 different datasets (NSynth & ARTURIA) ⇒ validation of the use of these models and in particular VAEs
Proposed a new methodology to perceptually regularize VAEs ⇒ validation of the 2-step learning method both objectively and perceptually
Conclusion and perspectives Main contributions
Listed 784 terms used to describe synthesizer sounds in French
Fanny ROCHE - PhD Defense - 29 September 2020 43 / 48 Evaluated the degree of consensus of these 8 dimensions Annotated a subset of 80 synthesizer sounds Performed an extensive comparison of AE-based models on 2 different datasets (NSynth & ARTURIA) ⇒ validation of the use of these models and in particular VAEs
Proposed a new methodology to perceptually regularize VAEs ⇒ validation of the 2-step learning method both objectively and perceptually
Conclusion and perspectives Main contributions
Listed 784 terms used to describe synthesizer sounds in French Identified 8 perceptual verbal descriptors: ◦ métallique ◦ qui vibre ◦ chaud ◦ qui résonne ◦ soufflé ◦ qui évolue ◦ percussif ◦ agressif
Fanny ROCHE - PhD Defense - 29 September 2020 43 / 48 Annotated a subset of 80 synthesizer sounds Performed an extensive comparison of AE-based models on 2 different datasets (NSynth & ARTURIA) ⇒ validation of the use of these models and in particular VAEs
Proposed a new methodology to perceptually regularize VAEs ⇒ validation of the 2-step learning method both objectively and perceptually
Conclusion and perspectives Main contributions
Listed 784 terms used to describe synthesizer sounds in French Identified 8 perceptual verbal descriptors: ◦ métallique ◦ qui vibre ◦ chaud ◦ qui résonne ◦ soufflé ◦ qui évolue ◦ percussif ◦ agressif Evaluated the degree of consensus of these 8 dimensions
Fanny ROCHE - PhD Defense - 29 September 2020 43 / 48 Performed an extensive comparison of AE-based models on 2 different datasets (NSynth & ARTURIA) ⇒ validation of the use of these models and in particular VAEs
Proposed a new methodology to perceptually regularize VAEs ⇒ validation of the 2-step learning method both objectively and perceptually
Conclusion and perspectives Main contributions
Listed 784 terms used to describe synthesizer sounds in French Identified 8 perceptual verbal descriptors: ◦ métallique ◦ qui vibre ◦ chaud ◦ qui résonne ◦ soufflé ◦ qui évolue ◦ percussif ◦ agressif Evaluated the degree of consensus of these 8 dimensions Annotated a subset of 80 synthesizer sounds
Fanny ROCHE - PhD Defense - 29 September 2020 43 / 48 Proposed a new methodology to perceptually regularize VAEs ⇒ validation of the 2-step learning method both objectively and perceptually
Conclusion and perspectives Main contributions
Listed 784 terms used to describe synthesizer sounds in French Identified 8 perceptual verbal descriptors: ◦ métallique ◦ qui vibre ◦ chaud ◦ qui résonne ◦ soufflé ◦ qui évolue ◦ percussif ◦ agressif Evaluated the degree of consensus of these 8 dimensions Annotated a subset of 80 synthesizer sounds Performed an extensive comparison of AE-based models on 2 different datasets (NSynth & ARTURIA) ⇒ validation of the use of these models and in particular VAEs
Fanny ROCHE - PhD Defense - 29 September 2020 43 / 48 Conclusion and perspectives Main contributions
Listed 784 terms used to describe synthesizer sounds in French Identified 8 perceptual verbal descriptors: ◦ métallique ◦ qui vibre ◦ chaud ◦ qui résonne ◦ soufflé ◦ qui évolue ◦ percussif ◦ agressif Evaluated the degree of consensus of these 8 dimensions Annotated a subset of 80 synthesizer sounds Performed an extensive comparison of AE-based models on 2 different datasets (NSynth & ARTURIA) ⇒ validation of the use of these models and in particular VAEs
Proposed a new methodology to perceptually regularize VAEs ⇒ validation of the 2-step learning method both objectively and perceptually
Fanny ROCHE - PhD Defense - 29 September 2020 43 / 48 Improve VAE framework → consider different distance metrics for perceptual regularization
→ explore recurrent/dynamical VAE [Girin et al. 2020]
Improve global framework → enlarge ARTURIA dataset → investigate more robust real-time phase reconstruction algorithms → train VAE on complex spectrogram or temporal data [Nugraha et al. 2019] → explore neural vocoders → explore conditional GANs (C-GANs)
Conclusion and perspectives Perspectives
Deeper analysis of the perceptual dimensions → correlation/redundancy analysis → analyze semantic relationships between the terms → investigate underlying acoustic correlates
Fanny ROCHE - PhD Defense - 29 September 2020 44 / 48 Improve global framework → enlarge ARTURIA dataset → investigate more robust real-time phase reconstruction algorithms → train VAE on complex spectrogram or temporal data [Nugraha et al. 2019] → explore neural vocoders → explore conditional GANs (C-GANs)
Conclusion and perspectives Perspectives
Deeper analysis of the perceptual dimensions → correlation/redundancy analysis → analyze semantic relationships between the terms → investigate underlying acoustic correlates
Improve VAE framework → consider different distance metrics for perceptual regularization
→ explore recurrent/dynamical VAE [Girin et al. 2020]
Fanny ROCHE - PhD Defense - 29 September 2020 44 / 48 Conclusion and perspectives Perspectives
Deeper analysis of the perceptual dimensions → correlation/redundancy analysis → analyze semantic relationships between the terms → investigate underlying acoustic correlates
Improve VAE framework → consider different distance metrics for perceptual regularization
→ explore recurrent/dynamical VAE [Girin et al. 2020]
Improve global framework → enlarge ARTURIA dataset → investigate more robust real-time phase reconstruction algorithms → train VAE on complex spectrogram or temporal data [Nugraha et al. 2019] → explore neural vocoders → explore conditional GANs (C-GANs)
Fanny ROCHE - PhD Defense - 29 September 2020 44 / 48 Thank you for your attention!
List of publications: → F. Roche, T. Hueber, S. Limier, and L. Girin (2019). "Autoencoders for music sound modeling: A comparison of linear, shallow, deep, recurrent and variational models". In: Proceedings of the Sound and Music Computing Conference (SMC). Málaga, Spain. → L. Girin, T. Hueber, F. Roche, and S. Leglaive (2019). "Notes on the use of variational autoencoders for speech and audio spectrogram modeling.". In: Proceedings of the International Conference on Digital Audio Effects (DAFx). Birmingham, UK. Journal submission: → F. Roche, T. Hueber, M. Garnier, S. Limier, and L. Girin. Article currently in a blind review process for publication in an international journal.
Fanny ROCHE - PhD Defense - 29 September 2020 45 / 48 References
[Akuzawa et al. 2018] "Expressive speech synthesis via modeling expressions with variational autoencoder". In: Proceedings of the Conference of the International Speech Communication Association (Interspeech). [Bengio et al. 2007] "Greedy layer-wise training of deep networks". In: Advances in Neural Information Processing Systems (NIPS). [Bensa et al. 2004] "Perceptive and cognitive evaluation of a piano synthesis model". In: Proceedings of the International Symposium on Computer Music Modeling and Retrieval (CMMR). [Bishop 2006] Pattern recognition and machine learning. Springer. [Blaauw and Bonada 2016] "Modeling and transforming speech using variational autoencoders". In: Proceedings of the Conference of the International Speech Communication Association (Interspeech). [Cance and Dubois 2015] "Dire notre expérience du sonore : nomination et référenciation". In: Langue Française. [Castellengo 2015] Ecoute musicale et acoustique : avec 420 sons et leurs sonagrammes décryptés. Eyrolles. [Cheminée et al. 2005] "Analyses des verbalisations libres sur le son du piano versus analyses acoustiques". In: Proceedings of the Conference on Interdisciplinary Musicology (CIM05). [Colonel et al. 2017] "Improving neural net auto encoders for music synthesis". In: Audio Engineering Society Convention. [Donahue et al. 2019] "Adversarial audio synthesis". In: Proceedings of the International Conference on Learning Representations (ICLR). [Ehrette 2004] "Les voix des services vocaux, de la perception à la modélisation". PhD thesis. Paris 11. [Engel et al. 2017] "Neural audio synthesis of musical notes with wavenet autoencoders". In: Proceedings of the International Conference on Machine Learning (ICML). [Engel et al. 2019] "GANSynth: Adversarial neural audio synthesis". In: Proceedings of the International Conference on Learning Representations (ICLR). [Esling et al. 2018] "Bridging audio analysis, perception and synthesis with perceptually-regularized variational timbre spaces". In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). [Faure 2000] "Des sons aux mots, comment parle-t-on du timbre musical ?" PhD thesis. Ecole des Hautes Etudes en Sciences Sociales (EHESS). [Gaillard 2000] "Etude de la perception des transitoires d’attaque des sons de steeldrum : particularités acoustiques, transformation par synthèse et catégorisation". PhD thesis. Toulouse 2.
Fanny ROCHE - PhD Defense - 29 September 2020 46 / 48 References
[Garnier et al. 2007] "Characterisation of voice quality in Western lyrical singing: From teachers’ judgements to acoustic descriptions". In: Journal of Interdisciplinary Music Studies 1.2. [Girin et al. 2007] "Dynamical Variational Autoencoders: A Comprehensive Review". arXiv preprint arXiv:2008.12595. [Grey 1977] "Multidimensional perceptual scaling of musical timbres". In: The Journal of the Acoustical Society of America 61.5. [Griffin and Lim 1984] "Signal estimation from modified short-time Fourier transform". In: IEEE Transactions on Acoustics, Speech, and Signal Processing 32.2. [Guyot 1996] "Etude de la perception sonore en termes de reconnaissance et d’appréciation qualitative : une approche par la catégorisation". PhD thesis. Le Mans. [Hinton and Salakhutdinov 2006] "Reducing the dimensionality of data with neural networks". In: Science 313.5786. [Hinton and Salakhutdinov 2007] "Using deep belief nets to learn covariance kernels for Gaussian processes". In: Advances in Neural Information Processing Systems (NIPS). [Hochreiter and Schmidhuber 1997] "Long short-term memory". In: Neural Computation 9.8. [Hsu et al. 2017] "Learning latent representations for speech generation and transformation". In: Proceedings of the Conference of the International Speech Communication Association (Interspeech). [Huber and Kollmeier 2006] "PEMO-Q: A new method for objective audio quality assessment using a model of auditory perception". In: IEEE Transactions on Audio, Speech, and Language Processing (TASLP) 14.6. [Iverson and Krumhansl 1993] "Isolating the dynamic attributes of musical timbre". In: The Journal of the Acoustical Society of America 94.5. [Kingma and Welling 2014] "Auto-encoding variational Bayes". In: Proceedings of the International Conference on Learning Representations (ICLR). [Krumhansl 1989] "Why is musical timbre so hard to understand?". In: Structure and Perception of Electroacoustic Sound and Music. [Lakatos 2000] "A common perceptual space for harmonic and percussive timbres". In: Perception & Psychophysics 62.7. [Lavoie 2013] "Conceptualisation et communication des nuances de timbre à la guitare classique". PhD thesis. Université de Montréal. [Li et al. 2011] "Logistic random effects regression models: a comparison of statistical packages for binary and ordinal outcomes". In: BMC medical research methodology.
Fanny ROCHE - PhD Defense - 29 September 2020 47 / 48 References
[Magron 2016] "Reconstruction de phase par modèles de signaux : application à la séparation de sources audio". PhD thesis. Telecom ParisTech. [McAdams et al. 1995] "Perceptual scaling of synthesized musical timbres: Common dimensions, specificities, and latent subject classes". In: Psychological Research 58.3. [Nugraha et al. 1995] "A deep generative model of speech complex spectrograms". In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). [Rezende et al. 2014] "Stochastic backpropagation and approximate inference in deep generative models". In: Proceedings of the International Conference on Machine Learning (ICML). [Sarroff and Casey 2014] "Musical audio synthesis using autoencoding neural nets". In: Proceedings of the Joint International Computer Music Conference (ICMC) and Sound & Music Computing conference (SMC). [Shaeffer 1966] Traité des objets musicaux. Le Seuil. [Traube 2004] "An interdisciplinary study of the timbre of the classical guitar". PhD thesis. McGill University. [van der Maaten and Hinton 2008] "Visualizing data using t-SNE". In: Journal of machine learning research 9. [Zacharakis et al. 2014] "An interlanguage study of musical timbre semantic dimensions and their acoustic correlates". In: Music Perception: An Interdisciplinary Journal 31.4.
Fanny ROCHE - PhD Defense - 29 September 2020 48 / 48